【Python 自然语言处理 第二版】读书笔记1:语言处理与Python
文章目錄
- 前言
- 語言處理與Python
- 一、語言計算:文本和單詞
- 1、NLTK入門
- (1)安裝(nltk、nltk.book)
- (2)搜索文本
- (3)詞匯計數(shù)
- 2、列表與字符串
- (1)列表操作
- (2)索引列表
- (3)變量
- (4)字符串
- 二、計算語言:簡單的統(tǒng)計
- 1、頻率分布
- 2、細(xì)粒度的選擇詞
- (1)選出長度大于15的單詞
- (2)頻繁出現(xiàn)的長詞
- (3)提取詞匯中的次對
- (4)提取文本中的頻繁出現(xiàn)的雙連詞
- 3、計數(shù)其他東西
- (1)文本中詞長的分布
- (2)[w for w in text if condition ]
- (3)條件循環(huán)
- 三、理解自然語言
- 四、作業(yè)
原書: 《Python 自然語言處理 第二版》
前言
從廣義上講,“自然語言處理”(Natural Language Processing 簡稱NLP)包含所有用計算機對自然語言進行的操作。
NLTK 定義了一個使用Python 進行NLP 編程的基礎(chǔ)工具。它提供重新表示自然語言處理相關(guān)數(shù)據(jù)的基本類,詞性標(biāo)注、文法分析、文本分類等任務(wù)的標(biāo)準(zhǔn)接口以及這些任務(wù)的標(biāo)準(zhǔn)實現(xiàn),可以組合起來解決復(fù)雜的問題。
語言處理任務(wù)與相應(yīng)NLTK 模塊以及功能描述
| 訪問語料庫 | corpus | 語料庫與詞典的標(biāo)準(zhǔn)化接口 |
| 字符串處理 | tokenize, stem | 分詞,分句,提取主干 |
| 搭配的發(fā)現(xiàn) | collocations | t-檢驗,卡方,點互信息PMI |
| 詞性標(biāo)注 | tag | n-gram, backoff, Brill, HMM, TnT |
| 機器學(xué)習(xí) | classify, cluster, tbl | 決策樹,最大熵,貝葉斯,EM,k-means |
| 分塊 | chunk | 正則表達式,n-gram,命名實體 |
| 解析 | parse, ccg | 圖表,基于特征,一致性,概率,依賴 |
| 語義解釋 | sem, inference | λ演算,一階邏輯,模型檢驗 |
| 指標(biāo)評測 | metrics | 精度,召回率,協(xié)議系數(shù) |
| 概率和估計 | probability | 頻率分布,平滑概率分布 |
| 應(yīng)用 | app, chat | 圖形化的語料庫檢索工具,分析器WordNet 查看器,聊天機器人 |
| 語言學(xué)領(lǐng)域的工作 | toolbox | 處理SIL 工具箱格式的數(shù)據(jù) |
語言處理與Python
一、語言計算:文本和單詞
1、NLTK入門
(1)安裝(nltk、nltk.book)
安裝 nltk.book
import nltk nltk.download()使用 nltk.download() 瀏覽可用的軟件包.下載器上Collections 選項卡顯示軟件包如何被打包分組,選擇book 標(biāo)記所在行,可以獲取本書的例子和練習(xí)所需的全部數(shù)據(jù)。
from nltk.book import * print("text1 : ", text1) print("text2 : ", text2)輸出結(jié)果
*** Introductory Examples for the NLTK Book *** Loading text1, ..., text9 and sent1, ..., sent9 Type the name of the text or sentence to view it. Type: 'texts()' or 'sents()' to list the materials. text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty Python and the Holy Grail text7: Wall Street Journal text8: Personals Corpus text9: The Man Who Was Thursday by G . K . Chesterton 1908 text1 : <Text: Moby Dick by Herman Melville 1851> text2 : <Text: Sense and Sensibility by Jane Austen 1811>(2)搜索文本
# 詞語索引:搜索文本text1中含有“monstrous”的句子 print(text1.concordance("monstrous")) # 搜索文本text1中與“monstrous”相似的單詞 print(text1.similar("monstrous")) # 搜索文本text2中兩個單詞共同的上下文 print(text2.common_contexts(["monstrous", "very"])) # 顯示在文本text4中各個單詞的使用頻率,顯示為詞匯分布圖 print(text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"]))輸出結(jié)果
Displaying 11 of 11 matches: ong the former , one was of a most monstrous size . ... This came towards us , ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r ll over with a heathenish array of monstrous clubs and spears . Some were thick d as you gazed , and wondered what monstrous cannibal and savage could ever hav that has survived the flood ; most monstrous and most mountainous ! That Himmal they might scout at Moby Dick as a monstrous fable , or still worse and more de th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l ing Scenes . In connexion with the monstrous pictures of whales , I am strongly ere to enter upon those still more monstrous stories of them which are to be fo ght have been rummaged out of this monstrous cabinet there is no telling . But of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u None true contemptible christian abundant few part mean careful puzzled mystifying passing curious loving wise doleful gamesome singular delightfully perilous fearless None a_pretty am_glad a_lucky is_pretty be_glad None(3)詞匯計數(shù)
# 文本text3的符號總數(shù) print(len(text3)) # 不重復(fù)的符號排序,注意:排序表中大寫字母出現(xiàn)在小寫字母之前。 print(sorted(set(text3))) # 不重復(fù)的符號總數(shù) print(len(set(text3))) # 詞匯豐富度:不重復(fù)符號占總符號5%,或者:每個單詞平均使用16詞 print(len(set(text3)) / len(text3)) # 文本中“smote”的計數(shù) print(text3.count("smote")) print(100 * text4.count('a') / len(text4)) print('--------------------'*2)# 計算詞匯豐富度 def lexical_diversity(text):return len(set(text)) / len(text)# 計算詞word在文本text中出現(xiàn)的頻率 def percentage(word, text):return 100 * text.count(word) / textprint(lexical_diversity(text3)) print(lexical_diversity(text5)) print(percentage('a', text4))輸出結(jié)果
44764 ['!', "'", '(', ')', ',', ',)', '.', '.)', ':', ';', ';)', '?', '?)', 'A', 'Abel', 'Abelmizraim', 'Abidah', 'Abide', 'Abimael', 'Abimelech', 'Abr', 'Abrah', 'Abraham', 'Abram', ..., 'With', 'Woman', 'Ye', 'Yea', 'Yet', 'Zaavan', 'Zaphnathpaaneah', 'Zar', 'Zarah', 'Zeboiim', 'Zeboim', 'Zebul', 'Zebulun', 'Zemarite', 'Zepho', 'Zerah', 'Zibeon', 'Zidon', 'Zillah', 'Zilpah', 'Zimran', 'Ziphion', 'Zo', 'Zoar', 'Zohar', 'Zuzims', 'a', 'abated', 'abide', 'able', 'abode', 'abomination', 'about', 'above', 'abroad', 'absent', 'abundantly', 'accept', 'accepted', 'according', 'acknowledged', 'activity', 'add', ..., 'yielded', 'yielding', 'yoke', 'yonder', 'you', 'young', 'younge', 'younger', 'youngest', 'your', 'yourselves', 'youth'] 2789 0.06230453042623537 5 1.4643016433938312 ---------------------------------------- 0.06230453042623537 0.13477005109975562 1.46430164339383122、列表與字符串
(1)列表操作
print('sent2 : ', sent2) # 連接 : 將多個列表組合為一個列表。 print('List : ', ['Monty', 'Python'] + ['and', 'the', 'Holy', 'Grail']) # 追加 : 增加一個元素 print('sent1 : ', sent1) sent1.append("Some") print('sent1 : ', sent1)輸出結(jié)果
sent2 : ['The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex', '.'] List : ['Monty', 'Python', 'and', 'the', 'Holy', 'Grail'] sent1 : ['Call', 'me', 'Ishmael', '.'] sent1 : ['Call', 'me', 'Ishmael', '.', 'Some'](2)索引列表
# 利用索引獲取文本 print(text4[173]) # 利用文本獲得第一次出現(xiàn)的索引 print(text4.index('awaken'))# 切片:從大文本中任意抽取語言片段,即獲取子列表 print(text5[16715:16735]) print(text6[1600:1625])sent = ['word1', 'word2', 'word3', 'word4', 'word5', 'word6', 'word7', 'word8', 'word9', 'word10'] print(sent[5:8]) # sent[5]\sent[6]\sent[7] print(sent[0]) print(sent[9])sent[0] = 'First' sent[9] = 'Last' # 用新內(nèi)容替換掉一整個片段 sent[1:9] = ['Second', 'Third'] print(sent) # 這個鏈表只有四個元素而要獲取其后面的元素就產(chǎn)生了錯誤 # print(sent[9])輸出結(jié)果
awaken 173 ['U86', 'thats', 'why', 'something', 'like', 'gamefly', 'is', 'so', 'good', 'because', 'you', 'can', 'actually', 'play', 'a', 'full', 'game', 'without', 'buying', 'it'] ['We', "'", 're', 'an', 'anarcho', '-', 'syndicalist', 'commune', '.', 'We', 'take', 'it', 'in', 'turns', 'to', 'act', 'as', 'a', 'sort', 'of', 'executive', 'officer', 'for', 'the', 'week'] ['word6', 'word7', 'word8'] word1 word10 ['First', 'Second', 'Third', 'Last'] # Traceback (most recent call last): # File "/home/jie/Jie/codes/nlp/1_nltk.py", line 60, in <module> # print(sent[9]) # IndexError: list index out of range(3)變量
形式:變量 = 表達式
(4)字符串
用來訪問列表元素的一些方法也可以用在單獨的詞或字符串上。
name = 'Monty'# 索引, 切片 print(name[0]) print(name[:4])# 乘法,加法 print(name * 2) print(name + '!')輸出結(jié)果
M Mont MontyMonty Monty!字符串與列表的相互轉(zhuǎn)換
print(' '.join(['Monty', 'Python'])) print('Monty Python'.split())輸出結(jié)果
Monty Python ['Monty', 'Python']二、計算語言:簡單的統(tǒng)計
1、頻率分布
頻率分布:在文本中的每一個詞項出現(xiàn)的頻率。
# 頻率分布: 文本中單詞詞符的總數(shù)是如何分布在詞項中的 fdist1 = FreqDist(text1) print(fdist1) print(fdist1.most_common(50)) print(fdist1['whale']) # whale的詞頻統(tǒng)計 # 累積頻率圖 # 《白鯨記》中50個最常用詞的累積頻率圖:這些詞占了所有詞符的將近一半。 fdist1.plot(50, cumulative=True) # 只出現(xiàn)了一次的詞 print(fdist1.hapaxes())輸出結(jié)果
<FreqDist with 19317 samples and 260819 outcomes> [(',', 18713), ('the', 13721), ('.', 6862), ('of', 6536), ('and', 6024), ('a', 4569), ('to', 4542), (';', 4072), ('in', 3916), ('that', 2982), ("'", 2684), ('-', 2552), ('his', 2459), ('it', 2209), ('I', 2124), ('s', 1739), ('is', 1695), ('he', 1661), ('with', 1659), ('was', 1632), ('as', 1620), ('"', 1478), ('all', 1462), ('for', 1414), ('this', 1280), ('!', 1269), ('at', 1231), ('by', 1137), ('but', 1113), ('not', 1103), ('--', 1070), ('him', 1058), ('from', 1052), ('be', 1030), ('on', 1005), ('so', 918), ('whale', 906), ('one', 889), ('you', 841), ('had', 767), ('have', 760), ('there', 715), ('But', 705), ('or', 697), ('were', 680), ('now', 646), ('which', 640), ('?', 637), ('me', 627), ('like', 624)] 906 ['Herman', 'Melville', ']', 'ETYMOLOGY', 'Late', 'Consumptive', 'School', 'threadbare', 'lexicons', 'mockingly', 'flags', 'mortality', 'signification', 'HACKLUYT', 'Sw',...'suction', 'closing', 'Ixion', 'Till', 'liberated', 'Buoyed', 'dirgelike', 'padlocks', 'sheathed', 'retracing', 'orphan']《白鯨記》中50個最常用詞的累積頻率圖:這些詞占了所有詞符的將近一半。
2、細(xì)粒度的選擇詞
(1)選出長度大于15的單詞
V = set(text1) long_words = [w for w in V if len(w) > 15] print(sorted(long_words), '\n')輸出結(jié)果
['CIRCUMNAVIGATION', 'Physiognomically', 'apprehensiveness', 'cannibalistically', 'characteristically', 'circumnavigating', 'circumnavigation', 'circumnavigations', 'comprehensiveness', 'hermaphroditical', 'indiscriminately', 'indispensableness', 'irresistibleness', 'physiognomically', 'preternaturalness', 'responsibilities', 'simultaneousness', 'subterraneousness', 'supernaturalness', 'superstitiousness', 'uncomfortableness', 'uncompromisedness', 'undiscriminating', 'uninterpenetratingly'](2)頻繁出現(xiàn)的長詞
# 所有長度超過7 個字符,且出現(xiàn)次數(shù)超過7 次的詞 fdist5 = FreqDist(text5) long_words1 = [w for w in set(text5) if len(w) > 7 and fdist5[w] > 7] print(long_words1, '\n')輸出結(jié)果
['remember', '((((((((((', 'listening', '#talkcity_adults', 'actually', 'football', 'seriously', 'something', 'innocent', 'everyone', 'Question', 'watching', '#14-19teens', 'anything', 'computer', 'tomorrow', 'together', '........', 'cute.-ass'](3)提取詞匯中的次對
bigrams_words = bigrams(['more', 'is', 'said', 'than', 'done']) print(list(bigrams_words), '\n')輸出結(jié)果
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')](4)提取文本中的頻繁出現(xiàn)的雙連詞
collocations():提取頻繁出現(xiàn)的雙連詞
print(text4.collocations(), '\n')輸出結(jié)果
United States; fellow citizens; four years; years ago; Federal Government; General Government; American people; Vice President; Old World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice; God bless; every citizen; Indian tribes; public debt; one another; foreign nations; political parties None3、計數(shù)其他東西
(1)文本中詞長的分布
# 文本中詞長的頻數(shù) fdist = FreqDist(len(w) for w in text1) print(fdist) print(fdist.most_common()) print(fdist.max()) # 詞頻中詞長為“3”的頻數(shù) print(fdist[3]) # 詞頻中詞長為“3”的頻率 print(fdist.freq(3))輸出結(jié)果
<FreqDist with 19 samples and 260819 outcomes> [(3, 50223), (1, 47933), (4, 42345), (2, 38513), (5, 26597), (6, 17111), (7, 14399), (8, 9966), (9, 6428), (10, 3528), (11, 1873), (12, 1053), (13, 567), (14, 177), (15, 70), (16, 22), (17, 12), (18, 1), (20, 1)] 3 50223 0.19255882431878046分析:最頻繁的詞長度是3,長度為3 的詞有50,000 多個(約占書中全部詞匯的20%)
(2)[w for w in text if condition ]
# 選出以ableness結(jié)尾的單詞 print(sorted(w for w in set(text1) if w.endswith('ableness')))# 選出含有g(shù)nt的單詞 print(sorted(term for term in set(text4) if 'gnt' in term))# 選出以大寫字母開頭的單詞 print(sorted(item for item in set(text6) if item.istitle()))# 選出數(shù)字 print(sorted(item for item in set(sent7) if item.isdigit()))# 選出不全部是小寫字母的單詞 print(sorted(w for w in set(sent7) if not w.islower()))# 將單詞變?yōu)槿看髮懽帜?/span> print([w.upper() for w in text1])# 將text1中過濾掉不是字母的,然后全部轉(zhuǎn)換成小寫,然后去重,然后計數(shù) print(len(set(word.lower() for word in text1 if word.isalpha())))輸出結(jié)果
['comfortableness', 'honourableness', 'immutableness', 'indispensableness', 'indomitableness', 'intolerableness', 'palpableness', 'reasonableness', 'uncomfortableness'] ['Sovereignty', 'sovereignties', 'sovereignty'] ['A', 'Aaaaaaaaah', ... , 'Woa', 'Wood', 'Would', 'Y', 'Yapping', 'Yay', 'Yeaaah', 'Yeaah', 'Yeah', 'Yes', 'You', 'Your', 'Yup', 'Zoot'] ['29', '61'] [',', '.', '29', '61', 'Nov.', 'Pierre', 'Vinken'] ['[', 'MOBY', 'DICK', 'BY', 'HERMAN', 'MELVILLE', '1851', ']', 'ETYMOLOGY', '.', '(', 'SUPPLIED', 'BY', 'SHARP', 'BLEAK', 'CORNER', ',', 'WHERE', ... , 'WILD', 'OATS', 'IN', 'ALL', 'FOUR', 'OCEANS', '.', 'THEY', 'HAD', 'MADE', 'A', 'HARP(3)條件循環(huán)
示例1:
for token in sent1:if token.islower():print(token, 'is a lowercase word')elif token.istitle():print(token, 'is a titlecase word')else:print(token, 'is punctuation')輸出結(jié)果
Call is a titlecase word me is a lowercase word Ishmael is a titlecase word . is punctuation示例2:
tricky = sorted(w for w in set(text2) if 'cie' in w or 'cei' in w) for word in tricky:# 不換行打印print(word, end=' ')print(word, end=' ')輸出結(jié)果
ancient ceiling conceit conceited conceive conscience conscientious conscientiously deceitful deceive deceived deceiving deficiencies deficiency deficient delicacies excellencies fancied insufficiency insufficient legacies perceive perceived perceiving prescience prophecies receipt receive received receiving society species sufficient三、理解自然語言
關(guān)鍵點:信息提取、推理與總結(jié)
四、作業(yè)
1. 下面兩行之間的差異是什么?哪一個的值比較大?其他文本也是同樣情況嗎?
sorted(set([w.lower() for w in text1])) sorted([w.lower() for w in set(text1)]第二個更大。因為第一個是先執(zhí)行小寫再執(zhí)行set 相同的元素只保留一個; 而第二個里先執(zhí)行了set ,大小寫不同的同一元素都會保留下來,然后再執(zhí)行小寫操作,會出現(xiàn)相同的都是小寫的元素。
2. w.isupper()和 not w.islower()這兩個測試之間的差異是什么?
w.isupper()——返回的是w是否為全大寫的字母 not w.islower()——返回的是w是否全不是小寫字母(可能包含數(shù)字等)3. 編寫一個切片表達式提取text2中的最后兩個詞。
text2[-2:] # ['THE', 'END']4. 找出聊天語聊庫(text5)中所有4個字母的詞。使用頻率分布函數(shù)(FreqDist),以頻率從高到低顯示這些詞。
fdist = FreqDist([w for w in text5 if len(w)==4]) print(fdist.most_common())輸出結(jié)果
[('JOIN', 1021), ('PART', 1016), ('that', 274), ('what', 183), ('here', 181), ('....', 170), ('have', 164), ('like', 156), ('with', 152), ('chat', 142), ('your', 137), ('good', 130), ('just', 125), ('lmao', 107), ..., ('brwn', 1), ('hurr', 1), ('Were', 1)]5. 寫表達式找出text6中所有符合下列條件的詞。結(jié)果應(yīng)該是單詞列表的形式:[‘word1’, ‘word2’, …]。
- 以ize 結(jié)尾
- 包含字母z
- 包含字母序列pt
- 除了首字母外是全部小寫字母的詞(即titlecase)
總結(jié)
以上是生活随笔為你收集整理的【Python 自然语言处理 第二版】读书笔记1:语言处理与Python的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 【爬虫笔记】Scrapy爬虫技术文章网站
- 下一篇: 【Python 自然语言处理 第二版】读