MIT自然语言处理第四讲:标注
MIT自然語言處理第四講:標注(第一部分)
自然語言處理:標注
Natural Language Processing: Tagging
作者:Regina Barzilay(MIT,EECS Department, November 15, 2004)
譯者:我愛自然語言處理(www.52nlp.cn?,2009年2月24日)
上一講主要內(nèi)容回顧(Last time)
語言模型(Language modeling):
n-gram模型(n-gram models)
語言模型評測(LM evaluation)
平滑(Smoothing):
打折(Discounting)
回退(Backoff)
插值(Interpolation)
本講主要內(nèi)容(Today):
標注(Tagging)
一、 基本介紹
a) 標注問題(Tagging)
i. 任務(wù)(Task): 在句子中為每個詞標上合適的詞性(Label each word in a sentence with its appropriate part of speech)
ii. 輸入(Input): Our enemies are innovative and resourceful , and so are we. They never stop thinking about new ways to harm our country and our people, and neither do we.
iii. 輸出(Output): Our/PRPenemies/NNSare/VBPinnovative/JJand/CCresourceful/JJ,/,and/CCso/RBare/VBwe/PRP?/?.They/PRPnever/RBstop/VBthinking/VBGabout/INnew/JJways/NNSto/TOharm/VBour/PROPcountry/NN and/CC our/PRP$ people/NN, and/CC neither/DT do/VB we/PRP.
b) Motivation
i. 詞性標注對于許多應(yīng)用領(lǐng)域是非常重要的(Part-of-speech(POS) tagging is important for many applications)
1. 語法分析(Parsing)
2. 語言模型(Language modeling)
3. 問答系統(tǒng)和信息抽取(Q&A and Information extraction)
4. 文本語音轉(zhuǎn)換(Text-to-speech)
ii. 標注技術(shù)可用于各種任務(wù)(Tagging techniques can be used for a variety of tasks)
1. 語義標注(Semantic tagging)
2. 對話標注(Dialogue tagging)
c) 如何確定標記集(How to determine the tag set)?
i. “The definition [of the parts of speech] are very far from having attained the degree of exactitude found in Euclidean geometry” Jespersen, The Philosophy of Grammar
ii. 粗糙的詞典類別劃分基本達成一致至少對某些語言來說(Agreement on coarse lexical categories (at least, for some languages))
1. 封閉類(Closed class): 介詞,限定詞,代詞,小品詞,助動詞(prepositions, determiners, pronouns, particles, auxiliary verbs)
2. 開放類(Open class): 名詞,動詞,形容詞和副詞(nouns, verbs, adjectives and adverbs)
iii. 各種粒度的多種標記集(Multiple tag sets of various granularity)
1. Penn tag set (45 tags), Brown tag set (87 tags), CLAWS2 tag set (132 tags)
2. 示例:Penn Tree Tags
標記(Tag) 說明(Description) 舉例(Example)
CC conjunction and, but
DT determiner a, the
JJ adjective red
NN noun, sing. rose
RB adverb quickly
VBD verb, past tense grew
d) 標注難嗎(Is Tagging Hard)?
i. 舉例:“Time flies like an arrow”
ii. 許多單詞可能會出現(xiàn)在幾種不同的類別中(Many words may appear in several categories)
iii. 然而,大多數(shù)單詞似乎主要在一個類別中出現(xiàn)(However, most words appear predominantly in one category)
1. “Dumb”標注器在給單詞標注最常用的標記時獲得了90%的準確率(“Dumb” tagger which assigns the most common tag to each word achieves 90% accuracy (Charniak et al., 1993))
2. 對于90%的準確率我們滿足嗎(Are we happy with 90%)?
iv. 標注的信息資源(Information Sources in Tagging):
1. 詞匯(Lexical): 觀察單詞本身(look at word itself)
單詞(Word) 名詞(Noun) 動詞(Verb) 介詞(Preposition)
flies 21 23 0
like 10 30 21
2. 組合(Syntagmatic): 觀察鄰近單詞(look at nearby words)
——哪個組合更像(What is more likely): “DT JJ NN” or “DT JJ VBP“?
未完待續(xù):第二部分
附:課程及課件pdf下載MIT英文網(wǎng)頁地址:
http://people.csail.mit.edu/regina/6881/
注:本文遵照麻省理工學院開放式課程創(chuàng)作共享規(guī)范翻譯發(fā)布,轉(zhuǎn)載請注明出處“我愛自然語言處理”:www.52nlp.cn
from:http://www.52nlp.cn/mit-nlp-fourth-lesson-tagging-first-part/
MIT自然語言處理第四講:標注(第二部分)
自然語言處理:標注
Natural Language Processing: Tagging
作者:Regina Barzilay(MIT,EECS Department, November 15, 2004)
譯者:我愛自然語言處理(www.52nlp.cn?,2009年3月7日)
學習標注(Learning to Tag)
* 基于轉(zhuǎn)換的學習(Transformation-based Learning)
* 隱馬爾科夫標注器(Hidden Markov Model Taggers)
* 對數(shù)線性模型(Log-linear models)
二、 基于轉(zhuǎn)換的學習(Transformation-based Learning ——TBL)
a) 概述:
i. TBL 介于符號法和基于語料庫方法之間(TBL is “in between” symbolic and corpus-based methods);
ii. TBL利用了更廣泛的詞匯知識和句法規(guī)則——很少的參數(shù)估計(TBL exploit a wider range of lexical and syntactic regularities (very few parameters to estimate))
iii. TBL關(guān)鍵部分(Key TBL components):
1. 一個容許的用于“糾錯”的轉(zhuǎn)換規(guī)范(a specification of which “error-correcting” transformations are admissible)
2. 學習算法(the learning algorithm)
b) 轉(zhuǎn)換(Transformations)
i. 重寫規(guī)則(Rewrite rule): tag1 → tag2, 如果C滿足某個條件(if C holds)
– 模板是手工選擇的(Templates are hand-selected)
ii. 觸發(fā)條件(Triggering environment (C))::
1. 標記觸發(fā)(tag-triggered)
2. 單詞觸發(fā)(word-triggered)
3. 形態(tài)觸發(fā)(morphology-triggered)
c) 轉(zhuǎn)換模板(Transformation Templates)
i. 圖略;
ii. 附:TBL算法的提出者Eric Brill(1995-Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging)中的模板:
1. The preceding (following) word is tagged z.
2. The word two before (after) is tagged z.
3. One of the two preceding (following) words is tagged z.
4. One of the three preceding (following) words is tagged z.
5. The preceding word is tagged z and the following word is tagged w.
6. The preceding (following) word is tagged z and the word two before (after) is tagged w.
當條件滿足時,將標記1變?yōu)闃擞?#xff12;(Change tag1 to tag 2 when),其中變量a,b,z和w在詞性集里取值(where a, b, z and w are variables over the set of parts of speech)。
iii. 舉例:
源標記 目標標記 觸發(fā)條件
NN VB previous tag is TO
VBP VB one of the previous tags is MD
JJR JJR next tag is JJ
VBP VB one of the prev. two words is “n’t”
d) TBL的學習(Learning component of TBL):
i. 貪婪搜索轉(zhuǎn)換的最優(yōu)序列(Greedy search for the optimal sequence of transformations):
1. 選擇最好的轉(zhuǎn)換(Select the best transformations);
2. 決定它們應(yīng)用的順序(Determine their order of applications);
e) 算法(Algorithm)
注釋(Notations):
1. Ck — 第k次迭代時的語料庫標注(corpus tagging at iteration k)
2. E(Ck) — k次標注語料庫的錯誤數(shù)(the number of mistakes in tagged corpus)
C0 := corpus with each word tagged with its most frequent tag
for k:= 0 step 1 do
v:=the transformation ui that minimizes r(ui(Ck))
if (E(Ck)? E(v(Ck)) < then break fi
Ck+1 := v(Ck)
τk+1 := τ
end
輸出序列(Output sequence): τ1,…,τn
f) 初始化(Initialization)
i. 備選方案(Alternative approaches)
1. 隨機(random)
2. 頻率最多的標記(most frequent tag)
3. …
ii. 實際上TBL對于初始分配并不敏感(In practice, TBL is not sensitive to the original assignment)
g) 規(guī)則應(yīng)用(Rule Application):
i. 從左到右的應(yīng)用順序(Left-to-right order of application)
ii. Immediate vs delayed effect:
Consider “A → B if the preceding tag is A”
– Immediate: AAAA →?
– Delayed: AAAA → ?
h) 規(guī)則選擇(Rule Selection):
i. 我們選擇模板及其相應(yīng)的實例(We select both the template, and its instantiation);
ii. 每個規(guī)則對已給出的標注進行修改(Each rule τ modifies given annotations)
1. 某些情況下提高(improves in some places ):Cimproved(τ)
2. 某些情況下降低(worsens in some places):Cworsened?(τ)
3. 對剩余數(shù)據(jù)不觸動(does not touch the remaining data)
iii. 規(guī)則的貢獻是(The contribution of the rule is):
Cimproved(τ)? Cworsened?(τ)
iv. 第i次迭代的規(guī)則選擇(Rule selection at iteration i):
τ_selected?(i)= argmax_τ_contrib(τ)
i) TBL標注器(The Tagger):
i. 輸入(Input):
1. 未標注的數(shù)據(jù)(untagged data);
2. 經(jīng)由學習器學習得到規(guī)則(S)(rules (S) learned by the learner);
ii. 標注(Tagging):
1. 使用與學習器相同的初始值(use the same initialization as the learner did)
2. 應(yīng)用所有學習得到的規(guī)則,保持合適的應(yīng)用順序(apply all the learned rules ,keep the proper order of application)
3. 最后的即時數(shù)據(jù)為輸出(the last intermediate data is the output)
j) 討論(Discussion)
i. TBL的時間復雜度是多少(What is the time complexity of TBL)?
ii. 有無可能建立一個無監(jiān)督的TBL標注器(Is it possible to develop an unsupervised TBL tagger)?
k) 與其他模型的關(guān)系(Relation to Other Models):
i. 概率模型(Probabilistic models):
1. “k-best”標注(“k-best” tagging);
2. 對先驗知識編碼(encoding of prior knowledge);
ii. 決策樹(Decision Trees)
1. TBL 很有效(TBL is more powerful (Brill, 1995));
2. TBL對于過度學習“免疫”(TBL is immune to overfitting)。
關(guān)于TBL,《自然語言處理綜論》第8章有更通俗的解釋和更詳細的算法說明。
未完待續(xù):第三部分
附:課程及課件pdf下載MIT英文網(wǎng)頁地址:
http://people.csail.mit.edu/regina/6881/
注:本文遵照麻省理工學院開放式課程創(chuàng)作共享規(guī)范翻譯發(fā)布,轉(zhuǎn)載請注明出處“我愛自然語言處理”:www.52nlp.cn
from:http://www.52nlp.cn/mit-nlp-fourth-lesson-tagging-second-part/
MIT自然語言處理第四講:標注(第三部分)
自然語言處理:標注
Natural Language Processing: Tagging
作者:Regina Barzilay(MIT,EECS Department, November 15, 2004)
譯者:我愛自然語言處理(www.52nlp.cn?,2009年3月24日)
三、 馬爾科夫模型(Markov Model)
a) 直觀(Intuition):對于序列中的每個單詞挑選最可能的標記(Pick the most likely tag for each word of a sequence)
i. 我們將對P(T,S)建模,其中T是一個標記序列,S是一個單詞序列(We will model P(T,S), where T is a sequence of tags, and S is a sequence of words)
ii. P({T}delim{|}{S}{})={P(T,S)}/{sum{T}{}{P(T,S)}}
Tagger(S)=?argmax_{T in T^n}logP({T}delim{|}{S}{})
=?argmax_{T in T^n}logP({T,S}{})
b) 參數(shù)估計(Parameter Estimation)
i. 應(yīng)用鏈式法則(Apply chain rule):
P(T,S)={prod{j=1}{n}{P({T_j}delim{|}{S_1,…S_{j-1},T_1,…,T_{j-1}}{})}}*
P({S_j}delim{|}{S_1,…S_{j-1}T_1,…,T_{j}}{})
ii. 獨立性假設(shè)(馬爾科夫假設(shè))(Assume independence (Markov assumption)):
={prod{j=1}{n}{P({T_j}delim{|}{T_{j-2},T_{j-1}}{})}}*P({S_j}delim{|}{T_j}{})
c) 舉例(Example)
i. They/PRP never/RB stop/VB thinking/VBG about/IN new/JJ ways/NNS to/TO harm/VB our/PROPcountry/NNand/CCour/PRP? people/NN, and/CC neither/DT do/VB we/PRP.
ii. P(T, S)=P(PRP|S, S)?P(They|PRP)?P(RB|S, PRP)?P(never|RB)?…
d) 估計轉(zhuǎn)移概率(Estimating Transition Probabilities)
P({T_j}delim{|}{T_{j-2},T_{j-1}}{})=
{lambda_1}*{{Count(T_{j-2},T_{j-1},T_j)}/{Count(T_{j-2},T_{j-1})}}
+{lambda_2}*{{Count(T_{j-1},T_j)}/{Count(T_{j-1})}}
+{lambda_3}*{{Count(T_j)}/{Count(sum{i}{}{T_i})}}
e) 估計發(fā)射概率(Estimating Emission Probabilities)
P({S_j}delim{|}{T_j}{})={Count(S_j,T_j)}/{Count(T_j)}
i. 問題(Problem): 未登錄詞或罕見詞(unknown or rare words)
1. 專有名詞(Proper names)
“King Abdullah of Jordan, the King of Morocco, I mean, there’s a series of places — Qatar, Oman – I mean, places that are developing— Bahrain — they’re all developing the habits of free societies.”
2. 新詞(New words)
“They misunderestimated me.”
f) 處理低頻詞(Dealing with Low Frequency Words)
i. 將詞表分為兩個集合(Split vocabulary into two sets)
1. 常用詞(Frequent words)— 在訓練集中出現(xiàn)超過5次的詞(words occurring more than 5 times in training)
2. 低頻詞(Low frequency words)— 訓練集中的其他詞(all other words)
ii. 依據(jù)前綴、后綴等將低頻詞映射到一個小的、有限的集合中(Map low frequency words into a small, finite set, depending on prefixes, suffixes etc. (see Bikel et al., 1998))
未完待續(xù):第四部分
附:課程及課件pdf下載MIT英文網(wǎng)頁地址:
http://people.csail.mit.edu/regina/6881/
注:本文遵照麻省理工學院開放式課程創(chuàng)作共享規(guī)范翻譯發(fā)布,轉(zhuǎn)載請注明出處“我愛自然語言處理”:www.52nlp.cn
from:http://www.52nlp.cn/mit-nlp-fourth-lesson-tagging-third-part/
MIT自然語言處理第四講:標注(第四部分)
自然語言處理:標注
Natural Language Processing: Tagging
作者:Regina Barzilay(MIT,EECS Department, November 15, 2004)
譯者:我愛自然語言處理(www.52nlp.cn?,2009年3月26日)
三、 馬爾科夫模型(Markov Model)
g) 有效標注(Efficient Tagging)
i. 對于一個單詞序列,如何尋找最可能的標記序列(How to find the most likely a sequence of tags for a sequence of words)?
1. 盲目搜索的方法是可怕的(The brute force search is dreadful)— 對于N個標記和W個單詞計算代價是N^W.for N tags and W words, the cost is NW
2. 主意(Idea): 使用備忘錄(Viterbi算法)(use memoization (the Viterbi Algorithm))
——結(jié)束于相同標記的序列可以壓縮在一起,因為下一個標記僅依賴于此序列的當前標記(Sequences that end in the same tag can be collapsed together since the next tag depends only on the current tag of the sequence)
圖示如下:
h) Viterbi 算法(The Viterbi Algorithm)
i. 初始情況(Base case):
pi delim{[}{0, START}{]} = log 1 = 0?
pi delim{[}{0, t_{-1}}{]} = log 0 = infty?
對所有其他的t_{-1}(for all other?t_{-1})
ii. 遞歸情況(Recursive case):
1. 對于i = 1…S.length及對于所有的t_{-1} in T:
pi delim{[}{i, t_{-1}}{]} = {max}under{t in T union START}{ pi delim{[}{i-1, t}{]} + log P(t_{-1}delim{|}{t}{}) + log P(S_i delim{|}{t_{-1}}{})}?
2. 回朔指針允許我們找出最大概率序列(Backpointers allow us to recover the max probability sequence):
BP delim{[}{i, t_{-1}}{]} = {argmax}under{t in T union START}{ pi delim{[}{i-1, t}{]} + log P(t_{-1}delim{|}{t}{}) + log P(S_i delim{|}{t_{-1}}{})}?
i) 性能(Performance)
i. HMM標注器對于訓練非常簡單(HMM taggers are very simple to train)
ii. 表現(xiàn)相對很好(Perform relatively well) (over 90% performance on named entities)
iii. 最大的困難是對p(單詞|標記)建模(Main difficulty is modeling of p(word|tag))
四、 結(jié)論(Conclusions)
a) 標注是一個相對比較簡單的任務(wù),至少在一個監(jiān)督框架下對于英語來說(Tagging is relatively easy task (at least, in a supervised framework, and for English))
b) 影響標注器性能的因素包括(Factors that impact tagger performance include):
i. 訓練集數(shù)量(The amount of training data available)
ii. 標記集(The tag set)
iii. 訓練集和測試集的詞匯差異(The difference in vocabulary between the training and the testing)
iv. 未登錄詞(Unknown words)
c) TBL和HMM框架可用于其他自然語言處理任務(wù)(TBL and HMM framework can be used for other tasks)
第四講結(jié)束!
第五講:最大熵和對數(shù)線性模型
附:課程及課件pdf下載MIT英文網(wǎng)頁地址:
http://people.csail.mit.edu/regina/6881/
注:本文遵照麻省理工學院開放式課程創(chuàng)作共享規(guī)范翻譯發(fā)布,轉(zhuǎn)載請注明出處“我愛自然語言處理”:www.52nlp.cn
from:http://www.52nlp.cn/mit-nlp-fourth-lesson-tagging-fourth-part/
《新程序員》:云原生和全面數(shù)字化實踐50位技術(shù)專家共同創(chuàng)作,文字、視頻、音頻交互閱讀
總結(jié)
以上是生活随笔為你收集整理的MIT自然语言处理第四讲:标注的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: MIT自然语言处理第三讲:概率语言模型(
- 下一篇: MIT自然语言处理第五讲:最大熵和对数线