NLP-基础知识-003(词性标注)
生活随笔
收集整理的這篇文章主要介紹了
NLP-基础知识-003(词性标注)
小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
目標(biāo):詞性標(biāo)注s = w1w2w3......wn 單詞z = (z1z2......zn) 詞性目的:argmax p(z|s) -> Noisy Channel Model= argmax p(s|z) p(z)p(s|z) - Translation Modelp(z) - Language Model= argmax p(w1w2...wn|z1z2....zn)p(z1z2....zn) (假設(shè)條件獨(dú)立)= argmax p(w1|z1) p(w2|z2) ..... p(wn|zn)p(z1)p(z2|z1)p(z3|z1z2)......(馬爾科夫假設(shè))= argmax Pi(i=1..n) P(wi|zi) * p(z1)p(z2|z1)p(z3|z2)......=> argmax logpi(i=1...n)p(wi|zi)p(z1) pi(i=2..n)p(zj|zj-1)
= argmax sum(i=1..n) log p(wi|zi) + logp(z1) + sum(j=2..n)logp(zj|zj-1)z' = argmax sum(i=1..n) log p(wi|zi) + logp(z1) + sum(j=2..n)logp(zj|zj-1)
"計(jì)算pi、A、B代碼,traindata.txt文件數(shù)據(jù)見文章結(jié)尾"tag2id, id2tag = {}, {} # maps tag to id . tag2id: {"VB": 0, "NNP":1,..} , id2tag: {0: "VB", 1: "NNP"....}
word2id, id2word = {}, {} # maps word to idfor line in open('traindata.txt'):items = line.split('/')word, tag = items[0], items[1].rstrip() # 抽取每一行里的單詞和詞性if word not in word2id:word2id[word] = len(word2id)id2word[len(id2word)] = wordif tag not in tag2id:tag2id[tag] = len(tag2id)id2tag[len(id2tag)] = tagM = len(word2id) # M: 詞典的大小、# of words in dictionary
N = len(tag2id) # N: 詞性的種類個(gè)數(shù) # of tags in tag set# 構(gòu)建 pi, A, B
import numpy as np
pi = np.zeros(N) # 每個(gè)詞性出現(xiàn)在句子中第一個(gè)位置的概率, N: # of tags pi[i]: tag i出現(xiàn)在句子中第一個(gè)位置的概率
A = np.zeros((N, M)) # A[i][j]: 給定tag i, 出現(xiàn)單詞j的概率。 N: # of tags M: # of words in dictionary
B = np.zeros((N,N)) # B[i][j]: 之前的狀態(tài)是i, 之后轉(zhuǎn)換成轉(zhuǎn)態(tài)j的概率 N: # of tagsprev_tag = ""
for line in open('traindata.txt'):items = line.split('/')wordId, tagId = word2id[items[0]], tag2id[items[1].rstrip()]if prev_tag == "": # 這意味著是句子的開始pi[tagId] += 1A[tagId][wordId] += 1else: # 如果不是句子的開頭A[tagId][wordId] += 1B[tag2id[prev_tag]][tagId] += 1if items[0] == ".":prev_tag = ""else:prev_tag = items[1].rstrip()# normalize
pi = pi/sum(pi)
for i in range(N):A[i] /= sum(A[i])B[i] /= sum(B[i])# 到此為止計(jì)算完了模型的所有的參數(shù): pi, A, B
知道了pi、A、B,需要求出最優(yōu)的z
維特比算法最終為一個(gè)動(dòng)態(tài)規(guī)劃尋找最優(yōu)路徑的問題,最終代碼如下:
def log(v):if v == 0:return np.log(v+0.000001)return np.log(v)def viterbi(x, pi, A, B):"""x: user input string/sentence: x: "I like playing soccer"pi: initial probability of tagsA: 給定tag, 每個(gè)單詞出現(xiàn)的概率B: tag之間的轉(zhuǎn)移概率"""x = [word2id[word] for word in x.split(" ")] # x: [4521, 412, 542 ..]T = len(x)dp = np.zeros((T,N)) # dp[i][j]: w1...wi, 假設(shè)wi的tag是第j個(gè)tagptr = np.array([[0 for x in range(N)] for y in range(T)] ) # T*N# TODO: ptr = np.zeros((T,N), dtype=int)for j in range(N): # basecase for DP算法dp[0][j] = log(pi[j]) + log(A[j][x[0]])for i in range(1,T): # 每個(gè)單詞for j in range(N): # 每個(gè)詞性# TODO: 以下幾行代碼可以寫成一行(vectorize的操作, 會(huì)使得效率變高)dp[i][j] = -9999999for k in range(N): # 從每一個(gè)k可以到達(dá)jscore = dp[i-1][k] + log(B[k][j]) + log(A[j][x[i]])if score > dp[i][j]:dp[i][j] = scoreptr[i][j] = k# decoding: 把最好的tag sequence 打印出來best_seq = [0]*T # best_seq = [1,5,2,23,4,...] # step1: 找出對應(yīng)于最后一個(gè)單詞的詞性best_seq[T-1] = np.argmax(dp[T-1])# step2: 通過從后到前的循環(huán)來依次求出每個(gè)單詞的詞性for i in range(T-2, -1, -1): # T-2, T-1,... 1, 0best_seq[i] = ptr[i+1][best_seq[i+1]]# 到目前為止, best_seq存放了對應(yīng)于x的 詞性序列for i in range(len(best_seq)):print (id2tag[best_seq[i]])最終驗(yàn)證,輸入一句話,可以得出對應(yīng)的詞性:
x = "Social Security number , passport number and details about the services provided for the payment" print(viterbi(x, pi, A, B))NNP NNP NN , NN NN CC NNS IN DT NNS VBN IN DT NN traindata.txt 部分訓(xùn)練語料如下所示:Newsweek/NNP ,/, trying/VBG to/TO keep/VB pace/NN with/IN rival/JJ Time/NNP magazine/NN ,/, announced/VBD new/JJ advertising/NN rates/NNS for/IN 1990/CD and/CC said/VBD it/PRP will/MD introduce/VB a/DT new/JJ incentive/NN plan/NN for/IN advertisers/NNS ./. The/DT new/JJ ad/NN plan/NN from/IN Newsweek/NNP ,/, a/DT unit/NN of/IN the/DT Washington/NNP Post/NNP Co./NNP ,/, is/VBZ the/DT second/JJ incentive/NN plan/NN the/DT magazine/NN has/VBZ offered/VBN advertisers/NNS in/IN three/CD years/NNS ./. Plans/NNS that/WDT give/VBP advertisers/NNS discounts/NNS for/IN maintaining/VBG or/CC increasing/VBG ad/NN spending/NN have/VBP become/VBN permanent/JJ fixtures/NNS at/IN the/DT news/NN weeklies/NNS and/CC underscore/VBP the/DT fierce/JJ competition/NN between/IN Newsweek/NNP ,/, Time/NNP Warner/NNP Inc./NNP 's/POS Time/NNP magazine/NN ,/, and/CC Mortimer/NNP B./NNP Zuckerman/NNP 's/POS U.S./NNP News/NNP &/CC World/NNP Report/NNP ./. Alan/NNP Spoon/NNP ,/, recently/RB named/VBN Newsweek/NNP president/NN ,/, said/VBD Newsweek/NNP 's/POS ad/NN rates/NNS would/MD increase/VB 5/CD %/NN in/IN January/NNP ./. A/DT full/JJ ,/, four-color/JJ page/NN in/IN Newsweek/NNP will/MD cost/VB $/$ 100,980/CD ./. In/IN mid-October/NNP ,/, Time/NNP magazine/NN lowered/VBD its/PRP$ guaranteed/VBN circulation/NN rate/NN base/NN for/IN 1990/CD while/IN not/RB increasing/VBG ad/NN page/NN rates/NNS ;/: with/IN a/DT lower/JJR circulation/NN base/NN ,/, Time/NNP 's/POS ad/NN rate/NN will/MD be/VB effectively/RB 7.5/CD %/NN higher/JJR per/IN subscriber/NN ;/: a/DT full/JJ page/NN in/IN Time/NNP costs/VBZ about/IN $/$ 120,000/CD ./. U.S./NNP News/NNP has/VBZ yet/RB to/TO announce/VB its/PRP$ 1990/CD ad/NN rates/NNS ./. Newsweek/NNP said/VBD it/PRP will/MD introduce/VB the/DT Circulation/NNP Credit/NNP Plan/NNP ,/, which/WDT awards/VBZ space/NN credits/NNS to/TO advertisers/NNS on/IN ``/`` renewal/NN advertising/NN ./. ''/'' The/DT magazine/NN will/MD reward/VB with/IN ``/`` page/NN bonuses/NNS ''/'' advertisers/NNS who/WP in/IN 1990/CD meet/VBP or/CC exceed/VBP their/PRP$ 1989/CD spending/NN ,/, as/RB long/RB as/IN they/PRP spent/VBD $/$ 325,000/CD in/IN 1989/CD and/CC $/$ 340,000/CD in/IN 1990/CD ./.?
總結(jié)
以上是生活随笔為你收集整理的NLP-基础知识-003(词性标注)的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: NLP-基础知识-002 (语言模型)
- 下一篇: NLP-基础知识-004(生成模型)