MIT自然语言处理第一讲:简介和概述(第二部分)
自然語言處理:背景和概述
Natural Language Processing:Background and Overview
作者:Regina Barzilay(MIT,EECS Department,September 8, 2004)
譯者:我愛自然語言處理(www.52nlp.cn?,2009年1月4日)
三、NLP的知識瓶頸(Knowledge Bottleneck in NLP)
我們需要(We need):
——有關語言的知識(Knowledge about language);
——有關世界的知識(Knowledge about the world);
可能的解決方案(Possible solutions):
——符號方法or象征手法(Symbolic approach):將所有需要的信息在計算機里編碼(Encode all the required information into computer);
——統計方法(Statistical approach):從語言樣本中推斷語言特性(Infer language properties from language samples);
1、例子研究:限定詞位置(Case study: Determiner Placement)
任務:在文本中自動地放置限定詞
Task: Automatically place determiners (a,the,null)in a text
樣本:
Scientists in United States have found way of turning lazy monkeys into workaholics using gene therapy. Usually monkeys work hard only when they know reward is coming, but animals given this treatment did their best all time. Researchers at National Institute of Mental Health near Washington DC, led by Dr Barry Richmond, have now developed genetic treatment which changes their work ethic markedly. ”Monkeys under influence of treatment don’t procrastinate,” Dr Richmond says. Treatment consists of anti-sense DNA – mirror image of piece of one of our genes – and basically prevents that gene from working. But for rest of us, day when such treatments fall into hands of our bosses may be one we would prefer to put off.
2、 相關語法規則(Relevant Grammar Rules)
a) 限定詞位置很大程度上由以下幾項決定(Determiner placement is largely determined by):
i. 名詞類型-可數,不可數(Type of noun – countable, uncountable);
ii. 照應-特指,類指(Reference -specific, generic);
iii. 信息價值-已有,新知(Information value – given, new)?這個翻譯不確定^_^
iv. 數詞-單數,復數(Number – singular, plural)
b) 然而,許多例外和特殊情況也扮演著一定的角色(However, many exceptions and special cases play a role),如:
i. 定冠詞用在報紙名稱的前面,但是零冠詞用在雜志和期刊名稱前面
ii. The definite article is used with newspaper titles (The Times), but zero article in names of magazines and journals (Time)
3、 符號方法方案(Symbolic Approach: Determiner Placement)
a) 我們需要哪些類別的知識(What categories of knowledge do we need):
i. 語言知識(Linguistic knowledge):
-靜態知識:數詞,可數性,…(Static knowledge: number, countability, …)
-上下文相關知識:共指關系,…(Context-dependent knowledge: co-reference, … )
ii. 世界知識(World knowledge):
-Uniqueness of reference (the current president of the US), type of noun (newspaper vs. magazine), situational associativity between nouns (the score of the football game), …
iii. 這些信息很難人工編碼(Hard to manually encode this information)!
4、 統計方法方案(Statistical Approach: Determiner Placement)
a) 樸素方法(Naive approach):
i. 收集和你的領域相關的大量的文本(Collect a large collection of texts relevant to your domain (e.g., newspaper text))
ii. 對于其中的每個名詞,計算它和特定的限定詞一起出現的概率,公式如下(For each noun, compute its probability to take a certain determiner):
- p(determiner|noun)= freq(noun,deter miner)/freq(noun)
iii. 對于一個新名詞,依據訓練語料庫中最高似然估計選擇一個限定詞(Given a new noun, select a determiner with the highest likelihood as estimated on the training corpus)
b) 實現(Implementation):
i. 語料:訓練——華爾街日報(WSJ)前21節語料,測試——第23節(Corpus: training — first 21 sections of the Wall Street Journal (WSJ) corpus, testing – the 23th section)
ii. 預測準確率:71.5%(Prediction accuracy: 71.5%)
c) 結論(Does it work?):
i. 結果并不是很好,但是對于這樣簡單的方法結果還是令人吃驚(The results are not great, but surprisingly high for such a simple method)
ii. 這個語料庫中的很大一部分名詞總是和同樣的限定詞一起出現(A large fraction of nouns in this corpus always appear with the same determiner),如:
-“the FBI”,“the defendant”, …
5、 作為分類問題的限定詞位置(Determiner Placement as Classification)
a) 預測(Prediction): “the”, “a”, “null”
b) 代表性的問題(Representation of the problem):
i. 復數?(是,否)(plural? (yes, no))
ii. 第一次在文本中出現?(是否)(first appearance in text? (yes, no))
iii. 名詞(詞匯集的成員)(noun (members of the vocabulary set))
c) 圖表例子略
d) 目標:學習分類函數以預測未知例子(Goal: Learn classification function that can predict unseen examples)
6、 分類方法(Classification Approach)
a) 學習X->Y的映射函數(Learn a function from X->Y (in the previous example, {?1,0,1})
b) 假設已存在一些分布D(X,Y)(Assume there is some distribution D(X, Y ), where x ∈ X, and y ∈ Y )
c) 嘗試建立分布D(X,Y)和D(X|Y)的模型(Attempt to explicitly model the distribution D(X, Y ) and D(X|Y ))
7、 分類之外(Beyond Classification)
a) 許多NLP應用領域可以被看作是從一個復雜的集合到另一個集合的映射(Many NLP applications can be viewed as a mapping from one complex set to another):
i. 句法分析(Parsing): 串到樹(strings to trees)
ii. 機器翻譯(Machine Translation): 串到串(strings to strings)
iii. 自然語言生成(Natural Language Generation):數據詞條到串(database entries to strings)
b) 注意,分類框架并不適合這些情況!(Classification framework is not suitable in these cases!)
8、 機器翻譯中的映射(Mapping in Machine Translation)
a) Weaver 1955 的經典論述:
i. “… one naturally wonders if the problem of translation could conceivably be treated as a problem of cryptography. When I look at an article in Russian, I say: ‘this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.’ ”
b) 機器翻譯示例略
c) 機器翻譯中的學習(Learning for MT)
i. 在許多語言對中都有合適的平行語料庫(Parallel corpora are available in several language pairs)
ii. 基本思想(Basic idea):使用平行語料庫作為翻譯例子的訓練集(use a parallel corpus as a training set of translation examples)
iii. 目標(Goal): 學習一個函數能將源語言的字符串映射為目標語言的字符串(learn a function that maps a string in a source language to a string in a target language)
未完待續:第三部分
附:課程及課件pdf下載MIT英文網頁地址:
http://people.csail.mit.edu/regina/6881/
注:本文遵照麻省理工學院開放式課程創作共享規范翻譯發布,轉載請注明出處“我愛自然語言處理”:www.52nlp.cn
from:http://www.52nlp.cn/mit-nlp-first-lesson-introduction-and-overview-second-part/
總結
以上是生活随笔為你收集整理的MIT自然语言处理第一讲:简介和概述(第二部分)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: MIT自然语言处理第一讲:简介和概述(第
- 下一篇: MIT自然语言处理第一讲:简介和概述(第