【Python 自然语言处理 第二版】读书笔记2:获得文本语料和词汇资源
文章目錄
- 一、獲取文本語料庫
- 1、古騰堡語料庫
- (1)輸出語料庫中的文件標(biāo)識(shí)符
- (2)詞的統(tǒng)計(jì)與索引
- (3)文本統(tǒng)計(jì)
- 2、網(wǎng)絡(luò)和聊天文本
- 3、布朗語料庫
- (1)初識(shí)
- (2)比較不同文體中的情態(tài)動(dòng)詞的用法
- 4、路透社語料庫
- (1)初識(shí)
- (2)通過主題和fileids查找words
- (3)以文檔或類別為單位查找想要的詞或句子
- 5、就職演說語料庫
- (1)初識(shí)
- (2)條件頻率分布圖
- 6、標(biāo)注文本語料庫
- 7、多國語言語料庫
- 8、文本語料庫的結(jié)構(gòu)
- 9、加載你自己的語料庫
- (1)PlaintextCorpusReader
- (2)BracketParseCorpusReader
- 二、條件頻率分布
- 1、條件和事件
- 2、按文體計(jì)數(shù)詞匯
- 3、繪制分布圖和分布表
- 4、使用雙連詞生成隨機(jī)文本
- 三、代碼重用(Python)
- 四、詞典資源
- 1、詞匯列表語料庫
- (1)過濾文本
- (2)停用詞語料庫
- (3)一個(gè)字母拼詞謎題
- (4)名字語料庫
- 2、發(fā)音的詞典
- 3、比較詞表
- 五、WordNet
- 1、意義與同義詞
- 2、層次結(jié)構(gòu)
- 3、更多的詞匯關(guān)系
- 4、語義相似度
大量的語言數(shù)據(jù)或者語料庫。
一、獲取文本語料庫
1、古騰堡語料庫
NLTK 包含 古騰堡項(xiàng)目(Project Gutenberg) 電子文本檔案的經(jīng)過挑選的一小部分文本,該項(xiàng)目大約有25,000本免費(fèi)電子圖書。
(1)輸出語料庫中的文件標(biāo)識(shí)符
import nltk # 輸出語料庫中的文件標(biāo)識(shí)符 print(nltk.corpus.gutenberg.fileids())輸出結(jié)果
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt'](2)詞的統(tǒng)計(jì)與索引
from nltk.corpus import gutenbergemma = gutenberg.words('austen-emma.txt') print(len(emma))emma = nltk.Text(gutenberg.words('austen-emma.txt')) print(emma.concordance("surprize"))輸出結(jié)果
192427 Displaying 25 of 37 matches: er father , was sometimes taken by surprize at his being still able to pity ` hem do the other any good ." " You surprize me ! Emma must do Harriet good : a Knightley actually looked red with surprize and displeasure , as he stood up , r . Elton , and found to his great surprize , that Mr . Elton was actually on d aid ." Emma saw Mrs . Weston ' s surprize , and felt that it must be great , father was quite taken up with the surprize of so sudden a journey , and his f y , in all the favouring warmth of surprize and conjecture . She was , moreove he appeared , to have her share of surprize , introduction , and pleasure . Th ir plans ; and it was an agreeable surprize to her , therefore , to perceive t talking aunt had taken me quite by surprize , it must have been the death of m f all the dialogue which ensued of surprize , and inquiry , and congratulationthe present . They might chuse to surprize her ." Mrs . Cole had many to agre the mode of it , the mystery , the surprize , is more like a young woman ' s sto her song took her agreeably by surprize -- a second , slightly but correct " " Oh ! no -- there is nothing to surprize one at all .-- A pretty fortune ; t to be considered . Emma ' s only surprize was that Jane Fairfax should accep of your admiration may take you by surprize some day or other ." Mr . Knightle ation for her will ever take me by surprize .-- I never had a thought of her iexpected by the best judges , for surprize -- but there was great joy . Mr . sound of at first , without great surprize . " So unreasonably early !" she w d Frank Churchill , with a look of surprize and displeasure .-- " That is easy ; and Emma could imagine with what surprize and mortification she must be retu tled that Jane should go . Quite a surprize to me ! I had not the least idea !. It is impossible to express our surprize . He came to speak to his father o g engaged !" Emma even jumped with surprize ;-- and , horror - struck , exclai(3)文本統(tǒng)計(jì)
for fileid in gutenberg.fileids():# raw()函數(shù):沒有進(jìn)行過任何語言學(xué)處理的文件的內(nèi)容num_chars = len(gutenberg.raw(fileid))num_words = len(gutenberg.words(fileid))# sents()函數(shù)將文本劃分為句子,每個(gè)句子都是一個(gè)單詞列表。num_sents = len(gutenberg.sents(fileid))num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))print('平均詞長(zhǎng):', round(num_chars/num_words), '平均句長(zhǎng):', round(num_words/num_sents), '每個(gè)單詞出現(xiàn)的平均次數(shù):', round(num_words/num_vocab), 'from:', fileid)輸出結(jié)果
平均詞長(zhǎng): 5 平均句長(zhǎng): 25 每個(gè)單詞出現(xiàn)的平均次數(shù): 26 from: austen-emma.txt 平均詞長(zhǎng): 5 平均句長(zhǎng): 26 每個(gè)單詞出現(xiàn)的平均次數(shù): 17 from: austen-persuasion.txt 平均詞長(zhǎng): 5 平均句長(zhǎng): 28 每個(gè)單詞出現(xiàn)的平均次數(shù): 22 from: austen-sense.txt 平均詞長(zhǎng): 4 平均句長(zhǎng): 34 每個(gè)單詞出現(xiàn)的平均次數(shù): 79 from: bible-kjv.txt 平均詞長(zhǎng): 5 平均句長(zhǎng): 19 每個(gè)單詞出現(xiàn)的平均次數(shù): 5 from: blake-poems.txt 平均詞長(zhǎng): 4 平均句長(zhǎng): 19 每個(gè)單詞出現(xiàn)的平均次數(shù): 14 from: bryant-stories.txt 平均詞長(zhǎng): 4 平均句長(zhǎng): 18 每個(gè)單詞出現(xiàn)的平均次數(shù): 12 from: burgess-busterbrown.txt 平均詞長(zhǎng): 4 平均句長(zhǎng): 20 每個(gè)單詞出現(xiàn)的平均次數(shù): 13 from: carroll-alice.txt 平均詞長(zhǎng): 5 平均句長(zhǎng): 20 每個(gè)單詞出現(xiàn)的平均次數(shù): 12 from: chesterton-ball.txt 平均詞長(zhǎng): 5 平均句長(zhǎng): 23 每個(gè)單詞出現(xiàn)的平均次數(shù): 11 from: chesterton-brown.txt 平均詞長(zhǎng): 5 平均句長(zhǎng): 18 每個(gè)單詞出現(xiàn)的平均次數(shù): 11 from: chesterton-thursday.txt 平均詞長(zhǎng): 4 平均句長(zhǎng): 21 每個(gè)單詞出現(xiàn)的平均次數(shù): 25 from: edgeworth-parents.txt 平均詞長(zhǎng): 5 平均句長(zhǎng): 26 每個(gè)單詞出現(xiàn)的平均次數(shù): 15 from: melville-moby_dick.txt 平均詞長(zhǎng): 5 平均句長(zhǎng): 52 每個(gè)單詞出現(xiàn)的平均次數(shù): 11 from: milton-paradise.txt 平均詞長(zhǎng): 4 平均句長(zhǎng): 12 每個(gè)單詞出現(xiàn)的平均次數(shù): 9 from: shakespeare-caesar.txt 平均詞長(zhǎng): 4 平均句長(zhǎng): 12 每個(gè)單詞出現(xiàn)的平均次數(shù): 8 from: shakespeare-hamlet.txt 平均詞長(zhǎng): 4 平均句長(zhǎng): 12 每個(gè)單詞出現(xiàn)的平均次數(shù): 7 from: shakespeare-macbeth.txt 平均詞長(zhǎng): 5 平均句長(zhǎng): 36 每個(gè)單詞出現(xiàn)的平均次數(shù): 12 from: whitman-leaves.txt顯示每個(gè)文本的三個(gè)統(tǒng)計(jì)量:平均詞長(zhǎng)、平均句子長(zhǎng)度和本文中每個(gè)詞出現(xiàn)的平均次數(shù)(我們的詞匯多樣性得分)。平均詞長(zhǎng)似乎是英語的一個(gè)一般屬性,因?yàn)樗闹悼偸?。(事實(shí)上,平均詞長(zhǎng)是3而不是4,因?yàn)?strong>num_chars變量計(jì)數(shù)了空白字符。)相比之下,平均句子長(zhǎng)度和詞匯多樣性看上去是作者個(gè)人的特點(diǎn)。
2、網(wǎng)絡(luò)和聊天文本
- from nltk.corpus import webtext:網(wǎng)絡(luò)文本小集合
- from nltk.corpus import nps_chat:即時(shí)消息聊天會(huì)話語料庫
輸出結(jié)果
firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se ... grail.txt SCENE 1: [wind] [clop clop clop] KING ARTHUR: Whoa there! [clop ... overheard.txt White guy: So, do you have any plans for this evening?Asian girl ... pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr ... singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encoun ... wine.txt Lovely delicate, fragrant Rhone wine. Polished leather and strawb ... ['i', 'do', "n't", 'want', 'hot', 'pics', 'of', 'a', 'female', ',', 'I', 'can', 'look', 'in', 'a', 'mirror', '.']3、布朗語料庫
1961年,布朗大學(xué),第一個(gè)百萬詞語的英語電子語料庫,包含500個(gè)不同來源的文本。
(1)初識(shí)
布朗語料庫每一部分的示例文檔
from nltk.corpus import brown print(brown.categories())print(brown.words(categories='news')) print(brown.words(fileids=['cg22'])) print(brown.sents(categories=['new', 'editorial', 'reviews']))輸出結(jié)果
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] ['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...] [['Assembly', 'session', 'brought', 'much', 'good'], ['The', 'General', 'Assembly', ',', 'which', 'adjourns', 'today', ',', 'has', 'performed', 'in', 'an', 'atmosphere', 'of', 'crisis', 'and', 'struggle', 'from', 'the', 'day', 'it', 'convened', '.'], ...](2)比較不同文體中的情態(tài)動(dòng)詞的用法
一個(gè)文體中情態(tài)動(dòng)詞的對(duì)比
import nltk from nltk.corpus import brownnews_text = brown.words(categories='news') fdist = nltk.FreqDist(w.lower() for w in news_text) modals = ['can', 'could', 'may', 'might', 'must', 'will'] for m in modals:print(m + ':', fdist[m], end=' ')輸出結(jié)果
can: 94 could: 87 may: 93 might: 38 must: 53 will: 389在不同的文體中統(tǒng)計(jì)感興趣詞的詞頻分布
# 帶條件的頻率分布函數(shù) cfd = nltk.ConditionalFreqDist((genre, word) for genre in brown.categories()for word in brown.words(categories=genre)) # 填寫我們想要展示的文體種類 genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor'] # 填寫我們想要統(tǒng)計(jì)的詞 modals = ['can', 'could', 'may', 'might', 'must', 'will'] cfd.tabulate(conditions = genres, samples = modals) cfd.plot(conditions = genres, samples = modals)輸出結(jié)果
can could may might must will news 93 86 66 38 50 389 religion 82 59 78 12 54 71 hobbies 268 58 131 22 83 264 science_fiction 16 49 4 12 8 16 romance 74 193 11 51 45 43 humor 16 30 8 8 9 134、路透社語料庫
10,788 個(gè)新聞文檔,90個(gè)主題,共計(jì)130 萬字,按照“training”和“test”分為兩組。
(1)初識(shí)
from nltk.corpus import reuters print(reuters.fileids()) print(reuters.categories()) # 主題 print("\n")輸出結(jié)果
['test/14826', 'test/14828', 'test/14829', 'test/14832', 'test/14833', 'test/14839', 'test/14840', 'test/14841', 'test/14842', 'test/14843', ,..., 'test/21567', 'test/21568', 'test/21570', 'test/21571', 'test/21573', 'test/21574', 'test/21575', 'test/21576', 'training/1', 'training/10', 'training/100', 'training/1000', 'training/10000', 'training/10002', 'training/10005', ...'training/9988', 'training/9989', 'training/999', 'training/9992', 'training/9993', 'training/9994', 'training/9995'] ['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', ..., 'sun-oil', 'sunseed', 'tea', 'tin', 'trade', 'veg-oil', 'wheat', 'wpi', 'yen', 'zinc'](2)通過主題和fileids查找words
路透社語料庫的類別是有互相重疊的:新聞報(bào)道往往涉及多個(gè)主題。
# 查詢?cè)搃d中包含的主題 print(reuters.categories('training/9865')) print(reuters.categories(['training/9865', 'training/9880'])) print("\n") # 查詢?cè)撝黝}中包含的id print(reuters.fileids('barley')) print(reuters.fileids(['barley', 'corn'])) print("\n")輸出結(jié)果
['barley', 'corn', 'grain', 'wheat'] ['barley', 'corn', 'grain', 'money-fx', 'wheat']['test/15618', 'test/15649', 'test/15676', 'test/15728', 'test/15871', 'test/15875', 'test/15952', 'test/17767', 'test/17769', ..., 'training/8257', 'training/8759', 'training/9865', 'training/9958'] ['test/14832', 'test/14858', 'test/15033', 'test/15043', 'test/15106', 'test/15287', 'test/15341', 'test/15618', 'test/15648',...,'training/9058', 'training/9093', 'training/9094', 'training/934', 'training/9470', 'training/9521', 'training/9667', 'training/97', 'training/9865', 'training/9958', 'training/9989'](3)以文檔或類別為單位查找想要的詞或句子
print(reuters.words('training/9865')[:14]) print(reuters.words(['training/9865', 'training/9880'])) print(reuters.words(categories='barley')) print(reuters.words(categories=['barley', 'corn']))輸出結(jié)果
['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', 'BIDS', 'DETAILED', 'French', 'operators', 'have', 'requested', 'licences', 'to', 'export'] ['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', ...] ['FRENCH', 'FREE', 'MARKET', 'CEREAL', 'EXPORT', ...] ['THAI', 'TRADE', 'DEFICIT', 'WIDENS', 'IN', 'FIRST', ...]5、就職演說語料庫
(1)初識(shí)
import nltk from nltk.corpus import inauguralprint(inaugural.fileids()) print([fileid[:4] for fileid in inaugural.fileids()])輸出結(jié)果
['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson.txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monroe.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1833-Jackson.txt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', ..., '1985-Reagan.txt', '1989-Bush.txt', '1993-Clinton.txt', '1997-Clinton.txt', '2001-Bush.txt', '2005-Bush.txt', '2009-Obama.txt'] ['1789', '1793', '1797', '1801', '1805', '1809', '1813', '1817', '1821', '1825', '1829', '1833', '1837', '1841', '1845', '1849', '1853', '1857', '1861', '1865', '1869', '1873', '1877', '1881', '1885', '1889', '1893', '1897', '1901', '1905', '1909', '1913', '1917', '1921', '1925', '1929', '1933', '1937', '1941', '1945', '1949', '1953', '1957', '1961', '1965', '1969', '1973', '1977', '1981', '1985', '1989', '1993', '1997', '2001', '2005', '2009'](2)條件頻率分布圖
cfd = nltk.ConditionalFreqDist((target, fileid[:4])for fileid in inaugural.fileids()for w in inaugural.words(fileid)for target in ['america', 'citizen']if w.lower().startswith(target)) cfd.plot()輸出結(jié)果:條件頻率分布圖
計(jì)數(shù)就職演說語料庫中所有以america 或citizen開始的詞。
6、標(biāo)注文本語料庫
7、多國語言語料庫
print(nltk.corpus.cess_esp.words()) print(nltk.corpus.floresta.words()) print(nltk.corpus.indian.words('hindi.pos')) print(nltk.corpus.udhr.fileids()) print(nltk.corpus.udhr.words('Javanese-Latin1')[11:])用條件頻率分布來研究“世界人權(quán)宣言”(udhr)語料庫中不同語言版本中的字長(zhǎng)差異
from nltk.corpus import udhr languages = ['Chickasaw', 'English', 'German_Deutsch', 'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik'] cfd = nltk.ConditionalFreqDist((lang, len(word))for lang in languagesfor word in udhr.words(lang + '-Latin1')) cfd.plot(cumulative=True)8、文本語料庫的結(jié)構(gòu)
文本語料庫的常見結(jié)構(gòu):
- isolated:一些孤立的沒有什么特別的組織的文本集合;
- categorized:分類組織結(jié)構(gòu);
- overlapping:重疊,如主題類別(路透社語料庫);
- temporal:隨時(shí)間變化語言用法的改變(就職演說語料庫)。
9、加載你自己的語料庫
(1)PlaintextCorpusReader
PlaintextCorpusReader更適合文本文件,eg:添加 corpus_root 下的語料庫
from nltk.corpus import PlaintextCorpusReadercorpus_root = '/usr/share/dict' wordlists = PlaintextCorpusReader(corpus_root, '.*') print(wordlists.fileids()) print(wordlists.words('american-english'))輸出結(jié)果
['README.select-wordlist', 'american-english', 'british-english', 'cracklib-small', 'words', 'words.pre-dictionaries-common'] ['A', 'A', "'", 's', 'AMD', 'AMD', "'", 's', 'AOL', ...](2)BracketParseCorpusReader
BracketParseCorpusReader更適合已解析過的語料庫
from nltk.corpus import BracketParseCorpusReadercorpus_root = '' # 路徑 file_pattern = r".*/wsj_.*\.mrg" # 匹配模式 # 初始化讀取器:語料庫目錄和要加載文件的格式,默認(rèn)utf8格式的編碼 ptb = BracketParseCorpusReader(corpus_root, file_pattern) print(ptb.fileids()) print(len(ptb.sents())) print(ptb.sents(fileids='20/wsj_2013/mrg')[19])二、條件頻率分布
1、條件和事件
每個(gè)配對(duì)pairs的形式是:(條件, 事件)。如果我們按文體處理整個(gè)布朗語料庫,將有15 個(gè)條件(每個(gè)文體一個(gè)條件)和1,161,192 個(gè)事件(每一個(gè)詞一個(gè)事件)。
text = ['The', 'Fulton', 'Country', 'Grand', 'Jury', 'said', ...] pairs = [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ...]2、按文體計(jì)數(shù)詞匯
# 構(gòu)建文體與詞的配對(duì) genre_word = [(genre, word)for genre in ['news', 'romance']for word in brown.words(categories=genre)] print(len(genre_word)) print(genre_word[:4]) print(genre_word[-4:], '\n')# 頻率分布 cfd = nltk.ConditionalFreqDist(genre_word) print(cfd) print(cfd.conditions(), '\n')print(cfd['news']) print(cfd['romance']) print(cfd['romance'].most_common(20)) print(cfd['romance']['could'])輸出結(jié)果
170576 [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')] [('romance', 'afraid'), ('romance', 'not'), ('romance', "''"), ('romance', '.')] <ConditionalFreqDist with 2 conditions> ['news', 'romance'] <FreqDist with 14394 samples and 100554 outcomes> <FreqDist with 8452 samples and 70022 outcomes> [(',', 3899), ('.', 3736), ('the', 2758), ('and', 1776), ('to', 1502), ('a', 1335), ('of', 1186), ('``', 1045), ("''", 1044), ('was', 993), ('I', 951), ('in', 875), ('he', 702), ('had', 692), ('?', 690), ('her', 651), ('that', 583), ('it', 573), ('his', 559), ('she', 496)] 1933、繪制分布圖和分布表
from nltk.corpus import inaugural cfd = nltk.ConditionalFreqDist((target, fileid[:4])for fileid in inaugural.fileids()for w in inaugural.words(fileid)for target in ['america', 'citizen']if w.lower().startswith(target)) cfd.plot()from nltk.corpus import udhrlanguages = ['Chickasaw', 'English', 'German_Deutsch', 'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik'] cfd = nltk.ConditionalFreqDist((lang, len(word))for lang in languagesfor word in udhr.words(lang + '-Latin1')) cfd.tabulate(conditions=['English', 'German_Deutsch'],samples=range(10), cumulative=True) cfd.plot(cumulative=True)4、使用雙連詞生成隨機(jī)文本
利用bigrams制作生成模型
def generate_model(cfdist, word, num=15):for i in range(num):print(word, end=" ")word = cfdist[word].max()text = nltk.corpus.genesis.words("english-kjv.txt") bigrams = nltk.bigrams(text) cfd = nltk.ConditionalFreqDist(bigrams) print(cfd) print(list(cfd)) print(cfd["so"]) print(cfd["living"])generate_model(cfd, "so") generate_model(cfd, "living")輸出結(jié)果
<ConditionalFreqDist with 2789 conditions> ['In', 'the', 'beginning', 'God', 'created', 'heaven', 'and', 'earth', '.', 'And', 'was', 'without', 'form', ',', 'void', ';', 'darkness', 'upon', 'face', 'of', 'deep', 'Spirit', 'moved', 'waters', 'said', 'Let', 'there', 'be', 'light', ':', 'saw', 'that', 'it', 'good', 'divided', 'from', 'called', 'Day', 'he', ..., ', 'embalmed', 'past', 'elders', 'chariots', 'horsemen', 'threshingfloor', 'Atad', 'lamentati', 'floor', 'Egyptia', 'Abelmizraim', 'requite', 'messenger', 'Forgive', 'forgive', 'meant', 'Machir', 'visit', 'coffin']FreqDist({'that': 8, '.': 7, ',': 4, 'the': 3, 'I': 2, 'doing': 2, 'much': 2, ':': 2, 'did': 1, 'Noah': 1, ...}) FreqDist({'creature': 7, 'thing': 4, 'substance': 2, 'soul': 1, '.': 1, ',': 1})so that he said , and the land of the land of the land of living creature that he said , and the land of the land of條件頻率分布 的 常用方法
| cfdist= ConditionalFreqDist(pairs) | 從配對(duì)鏈表中創(chuàng)建條件頻率分布 |
| cfdist.conditions() | 將條件按字母排序 |
| cfdist[condition] | 此條件下的頻率分布 |
| cfdist[condition][sample] | 此條件下給定樣本的頻率 |
| cfdist.tabulate() | 為條件頻率分布制表 |
| cfdist.tabulate(samples, conditions) | 指定樣本和條件限制下制表 |
| cfdist.plot() | 為條件頻率分布繪圖 |
| cfdist.plot(samples, conditions) | 指定樣本和條件 |
| cfdist1 < cfdist2 | 測(cè)試樣本在 cfdist1 中出現(xiàn)次數(shù)是否小于在 cfdist2 中出現(xiàn)次數(shù) |
三、代碼重用(Python)
- 函數(shù)、方法
- 模塊(module):一個(gè)文件中的變量和函數(shù)定義的集合。可通過文件入來訪問自定義的函數(shù)。
- 包(package):相關(guān)模塊的集合。
注意:當(dāng) Python 導(dǎo)入模塊時(shí),它先查找當(dāng)前目錄(文件夾)。
四、詞典資源
詞典或者詞典資源:一個(gè)詞和(或)短語以及一些相關(guān)信息的集合,附屬于文本,通常在文本的幫助下創(chuàng)建和豐富。
上圖為詞典術(shù)語:兩個(gè)拼寫相同的詞條但意義不同(同音異義詞)的詞匯項(xiàng)(包括詞
目(也叫詞條)以及其他附加信息),其他附加信息包括詞性和注釋信息。
1、詞匯列表語料庫
詞匯語料庫是Unix 中的/usr/share/dict/words文件,被一些拼寫檢查程序使用。我們可以用它來尋找文本語料中不尋常的或拼寫錯(cuò)誤的詞匯。
(1)過濾文本
此程序計(jì)算文本的詞匯表,然后刪除所有在現(xiàn)有的詞匯列表中出現(xiàn)的元
素,只留下罕見或拼寫錯(cuò)誤的詞。
輸出結(jié)果
['abbeyland', 'abhorred', 'abilities', 'abounded', 'abridgement', 'abused', 'abuses', 'accents', 'accepting', 'accommodations', ..., 'wiping', 'wisest', 'wishes', 'withdrew', 'witnessed', 'witnesses', 'witnessing', 'witticisms', 'wittiest', 'wives', 'women', 'wondered', 'woods', 'words', 'workmen', 'worlds', 'wrapt', 'writes', 'yards', 'years', 'yielded', 'youngest'] ['aaaaaaaaaaaaaaaaa', 'aaahhhh', 'abortions', 'abou', 'abourted', 'abs', 'ack', 'acros', 'actualy', ...,'yuuuuuuuuuuuummmmmmmmmmmm', 'yvw', 'yw', 'zebrahead', 'zoloft', 'zyban', 'zzzzzzzing', 'zzzzzzzz'](2)停用詞語料庫
停用詞通常幾乎沒有什么詞匯內(nèi)容,eg:如the,to和also…
from nltk.corpus import stopwordsprint(stopwords.words('english'))# 計(jì)算文本中沒有在停用詞列表中的詞的比例。 def content_fraction(text):stopwords = nltk.corpus.stopwords.words('english')content = [w for w in text if w.lower() not in stopwords]return len(content)/len(text)frac = content_fraction(nltk.corpus.reuters.words()) print(frac)輸出結(jié)果
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's",..., 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"] 0.735240435097661(3)一個(gè)字母拼詞謎題
在由隨機(jī)選擇的字母組成的網(wǎng)格中,選擇里面的字母組成詞;這個(gè)謎題叫做“目標(biāo)”。
要求:
- 長(zhǎng)度不小于6
- 每個(gè)詞必須包括中間的字母
- 每個(gè)字母在每個(gè)詞中只能被用一次
輸出結(jié)果
['glover', 'gorlin', 'govern', 'grovel', 'ignore', 'involver', 'lienor', 'linger', 'longer', 'lovering', 'noiler', 'overling', 'region', 'renvoi', 'revolving', 'ringle', 'roving', 'violer', 'virole']FreqDist 比較法:允許我們檢查每個(gè)字母在候選詞中的頻率是否小于或等于相應(yīng)的字母在拼詞謎題中的頻率。
(4)名字語料庫
包括 8000 個(gè)按性別分類的名字。男性和女性的名字存儲(chǔ)在單獨(dú)的文件中。
names = nltk.corpus.names print(names.fileids()) male_names = names.words('male.txt') female_names = names.words('female.txt')## 男女同名 print([w for w in male_names if w in female_names])# 條件頻率分布:此圖顯示男性和女性名字的結(jié)尾字母 cfd = nltk.ConditionalFreqDist((fileid, name[-1])for fileid in names.fileids()for name in names.words(fileid)) cfd.plot()輸出結(jié)果
['female.txt', 'male.txt'] ['Abbey', 'Abbie', 'Abby', 'Addie', 'Adrian', 'Adrien', 'Ajay', 'Alex', 'Alexis', 'Alfie', 'Ali', 'Alix', 'Allie', 'Allyn', 'Andie', 'Andrea', 'Andy', 'Angel',..., 'Ted', 'Teddie', 'Teddy', 'Terri', 'Terry', 'Theo', 'Tim', 'Timmie', 'Timmy', 'Tobe', 'Tobie', 'Toby', 'Tommie', 'Tommy', 'Tony', 'Torey', 'Trace', 'Tracey', 'Tracie', 'Tracy', 'Val', 'Vale', 'Valentine', 'Van', 'Vin', 'Vinnie', 'Vinny', 'Virgie', 'Wallie', 'Wallis', 'Wally', 'Whitney', 'Willi', 'Willie', 'Willy', 'Winnie', 'Winny', 'Wynn']
條件頻率分布:此圖顯示男性和女性名字的結(jié)尾字母;大多數(shù)以 a,e 或 i 結(jié)尾的名字是女性;以 h 和 l 結(jié)尾的男性和女性同樣多;以 k,o,r,s 和 t 結(jié)尾的更可能是男性。
2、發(fā)音的詞典
為語音合成器使用而設(shè)計(jì)的
entries = nltk.corpus.cmudict.entries() print(len(entries)) for entry in entries[39943:39951]:print(entry)for word, pron in entries:if len(pron) == 3:ph1, ph2, ph3 = pronif ph1 == 'P' and ph3 == 'T':print(word, ph2, end=' ')輸出結(jié)果
133737 ('explorer', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'ER0']) ('explorers', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'ER0', 'Z']) ('explores', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'Z']) ('exploring', ['IH0', 'K', 'S', 'P', 'L', 'AO1', 'R', 'IH0', 'NG']) ('explosion', ['IH0', 'K', 'S', 'P', 'L', 'OW1', 'ZH', 'AH0', 'N']) ('explosions', ['IH0', 'K', 'S', 'P', 'L', 'OW1', 'ZH', 'AH0', 'N', 'Z']) ('explosive', ['IH0', 'K', 'S', 'P', 'L', 'OW1', 'S', 'IH0', 'V']) ('explosively', ['EH2', 'K', 'S', 'P', 'L', 'OW1', 'S', 'IH0', 'V', 'L', 'IY0'])pait EY1 pat AE1 pate EY1 patt AE1 peart ER1 peat IY1 peet IY1 peete IY1 pert ER1 pet EH1 pete IY1 pett EH1 piet IY1 piette IY1 pit IH1 pitt IH1 pot AA1 pote OW1 pott AA1 pout AW1 puett UW1 purt ER1 put UH1 putt AH1找到所有發(fā)音結(jié)尾與 nicks 相似的詞匯。
syllable = ['N', 'IH0', 'K', 'S'] print([word for word, pron in entries if pron[-4:] == syllable])輸出結(jié)果
["atlantic's", 'audiotronics', 'avionics', 'beatniks', 'calisthenics', 'centronics', 'chamonix', 'chetniks', "clinic's", 'clinics', 'conics', 'conics', 'cryogenics', 'cynics', 'diasonics', "dominic's", 'ebonics', 'electronics', "electronics'", "endotronics'", 'endotronics', 'enix', 'environics', 'ethnics', 'eugenics', 'fibronics', 'flextronics', 'harmonics', 'hispanics', 'histrionics', 'identics', 'ionics', 'kibbutzniks', 'lasersonics', 'lumonics', 'mannix', 'mechanics', "mechanics'", 'microelectronics', 'minix', 'minnix', 'mnemonics', 'mnemonics', 'molonicks', 'mullenix', 'mullenix', 'mullinix', 'mulnix', "munich's", 'nucleonics', 'onyx', 'organics', "panic's", 'panics', 'penix', 'pennix', 'personics', 'phenix', "philharmonic's", 'phoenix', 'phonics', 'photronics', 'pinnix', 'plantronics', 'pyrotechnics', 'refuseniks', "resnick's", 'respironics', 'sconnix', 'siliconix', 'skolniks', 'sonics', 'sputniks', 'technics', 'tectonics', 'tektronix', 'telectronics', 'telephonics', 'tonics', 'unix', "vinick's", "vinnick's", 'vitronics']3、比較詞表
斯瓦迪士核心詞列表
from nltk.corpus import swadeshprint(swadesh.fileids()) print(swadesh.words('en'))# entries()方法:指定一個(gè)語言鏈表來訪問多語言中的同源詞 fr2en = swadesh.entries(['fr', 'en']) print(fr2en) translate = dict(fr2en) print(translate['chien']) print(translate['jeter'])de2en = swadesh.entries(['de', 'en']) # German-English es2en = swadesh.entries(['es', 'en']) # Spanish-English translate.update(dict(de2en)) translate.update(dict(es2en)) print(translate['Hund']) print(translate['perro'])languages = ['en', 'de', 'nl', 'es', 'fr', 'pt', 'la'] for i in [139, 140, 141, 142]:print(swadesh.entries(languages)[i])輸出結(jié)果
['be', 'bg', 'bs', 'ca', 'cs', 'cu', 'de', 'en', 'es', 'fr', 'hr', 'it', 'la', 'mk', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sr', 'sw', 'uk']['I', 'you (singular), thou', 'he', 'we', 'you (plural)', 'they', 'this', 'that', 'here', 'there', 'who', 'what', 'where', 'when', 'how', 'not', 'all', 'many', 'some', 'few', 'other', 'one', 'two', 'three', 'four', 'five', 'big', 'long', 'wide', 'thick', 'heavy', 'small', 'short', 'narrow', 'thin', 'woman', 'man (adult male)', 'man (human being)', 'child', 'wife', 'husband', 'mother', 'father', 'animal', 'fish', 'bird', 'dog', 'louse', 'snake', 'worm', 'tree', 'forest', 'stick', 'fruit', 'seed', 'leaf', 'root', 'bark (from tree)', 'flower', 'grass', 'rope', 'skin', 'meat', 'blood', 'bone', 'fat (noun)', 'egg', 'horn', 'tail', 'feather', 'hair', 'head', 'ear', 'eye', 'nose', 'mouth', 'tooth', 'tongue', 'fingernail', 'foot', 'leg', 'knee', 'hand', 'wing', 'belly', 'guts', 'neck', 'back', 'breast', 'heart', 'liver', 'drink', 'eat', 'bite', 'suck', 'spit', 'vomit', 'blow', 'breathe', 'laugh', 'see', 'hear', 'know (a fact)', 'think', 'smell', 'fear', 'sleep', 'live', 'die', 'kill', 'fight', 'hunt', 'hit', 'cut', 'split', 'stab', 'scratch', 'dig', 'swim', 'fly (verb)', 'walk', 'come', 'lie', 'sit', 'stand', 'turn', 'fall', 'give', 'hold', 'squeeze', 'rub', 'wash', 'wipe', 'pull', 'push', 'throw', 'tie', 'sew', 'count', 'say', 'sing', 'play', 'float', 'flow', 'freeze', 'swell', 'sun', 'moon', 'star', 'water', 'rain', 'river', 'lake', 'sea', 'salt', 'stone', 'sand', 'dust', 'earth', 'cloud', 'fog', 'sky', 'wind', 'snow', 'ice', 'smoke', 'fire', 'ashes', 'burn', 'road', 'mountain', 'red', 'green', 'yellow', 'white', 'black', 'night', 'day', 'year', 'warm', 'cold', 'full', 'new', 'old', 'good', 'bad', 'rotten', 'dirty', 'straight', 'round', 'sharp', 'dull', 'smooth', 'wet', 'dry', 'correct', 'near', 'far', 'right', 'left', 'at', 'in', 'with', 'and', 'if', 'because', 'name'][('je', 'I'), ('tu, vous', 'you (singular), thou'), ('il', 'he'), ('nous', 'we'), ('vous', 'you (plural)'), ('ils, elles', 'they'), ('ceci', 'this'), ('cela', 'that'), ('ici', 'here'), ('là', 'there'), ('qui', 'who'), ('quoi', 'what'), ('où', 'where'), ('quand', 'when'), ('comment', 'how'), ('ne...pas', 'not'), ('tout', 'all'), ('plusieurs', 'many'), ... , ('sec', 'dry'), ('juste, correct', 'correct'), ('proche', 'near'), ('loin', 'far'), ('à droite', 'right'), ('à gauche', 'left'), ('à', 'at'), ('dans', 'in'), ('avec', 'with'), ('et', 'and'), ('si', 'if'), ('parce que', 'because'), ('nom', 'name')]dog throwdog dog('say', 'sagen', 'zeggen', 'decir', 'dire', 'dizer', 'dicere') ('sing', 'singen', 'zingen', 'cantar', 'chanter', 'cantar', 'canere') ('play', 'spielen', 'spelen', 'jugar', 'jouer', 'jogar, brincar', 'ludere') ('float', 'schweben', 'zweven', 'flotar', 'flotter', 'flutuar, boiar', 'fluctuare')五、WordNet
面向語義的英語詞典,共有155,287 個(gè)詞和117,659 個(gè)同義詞集合。
1、意義與同義詞
from nltk.corpus import wordnet as wnprint(wn.synsets('motorcar')) # 意義相同的詞(或“詞條”)的集合 print(wn.synset('car.n.01').lemma_names()) # 獲取該詞在該詞集的定義 print(wn.synset('car.n.01').definition()) # 獲取該詞在該詞集下的例句 print(wn.synset('car.n.01').examples())# 得到指定同義詞集的所有詞條 print(wn.synset('car.n.01').lemmas()) # 查找特定的詞條 print(wn.lemma('car.n.01.automobile')) # 得到一個(gè)詞條對(duì)應(yīng)的同義詞集 print(wn.lemma('car.n.01.automobile').synset()) # 以得到一個(gè)詞條的“名字” print(wn.lemma('car.n.01.automobile').name(), '\n')print(wn.synsets('car')) for synset in wn.synsets('car'):print(synset.lemma_names())print(wn.lemmas('car'))輸出結(jié)果
[Synset('car.n.01')] ['car', 'auto', 'automobile', 'machine', 'motorcar'] a motor vehicle with four wheels; usually propelled by an internal combustion engine ['he needs a car to get to work'][Lemma('car.n.01.car'), Lemma('car.n.01.auto'), Lemma('car.n.01.automobile'), Lemma('car.n.01.machine'), Lemma('car.n.01.motorcar')] Lemma('car.n.01.automobile') Synset('car.n.01') automobile [Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'), Synset('car.n.04'), Synset('cable_car.n.01')] ['car', 'auto', 'automobile', 'machine', 'motorcar'] ['car', 'railcar', 'railway_car', 'railroad_car'] ['car', 'gondola'] ['car', 'elevator_car'] ['cable_car', 'car'][Lemma('car.n.01.car'), Lemma('car.n.02.car'), Lemma('car.n.03.car'), Lemma('car.n.04.car'), Lemma('cable_car.n.01.car')]2、層次結(jié)構(gòu)
WordNet的同義詞集相當(dāng)于抽象的概念,它們并不總是有對(duì)應(yīng)的英語詞匯。這些概念在層次結(jié)構(gòu)中相互聯(lián)系在一起。
輸出結(jié)果
Synset('stanley_steamer.n.01')['Model_T', 'S.U.V.', 'SUV', 'Stanley_Steamer', 'ambulance', 'beach_waggon', 'beach_wagon', 'bus', 'cab', 'compact', 'compact_car', 'convertible', 'coupe', 'cruiser', 'electric', 'electric_automobile', 'electric_car', 'estate_car', 'gas_guzzler', 'hack', 'hardtop', 'hatchback', 'heap', 'horseless_carriage', 'hot-rod', 'hot_rod', 'jalopy', 'jeep', 'landrover', 'limo', 'limousine', 'loaner', 'minicar', 'minivan', 'pace_car', 'patrol_car', 'phaeton', 'police_car', 'police_cruiser', 'prowl_car', 'race_car', 'racer', 'racing_car', 'roadster', 'runabout', 'saloon', 'secondhand_car', 'sedan', 'sport_car', 'sport_utility', 'sport_utility_vehicle', 'sports_car', 'squad_car', 'station_waggon', 'station_wagon', 'stock_car', 'subcompact', 'subcompact_car', 'taxi', 'taxicab', 'tourer', 'touring_car', 'two-seater', 'used-car', 'waggon', 'wagon'][Synset('motor_vehicle.n.01')] 2 [<bound method Synset.name of Synset('entity.n.01')>, <bound method Synset.name of Synset('physical_entity.n.01')>, <bound method Synset.name of Synset('object.n.01')>, <bound method Synset.name of Synset('whole.n.02')>, <bound method Synset.name of Synset('artifact.n.01')>, <bound method Synset.name of Synset('instrumentality.n.03')>, <bound method Synset.name of Synset('container.n.01')>, <bound method Synset.name of Synset('wheeled_vehicle.n.01')>, <bound method Synset.name of Synset('self-propelled_vehicle.n.01')>, <bound method Synset.name of Synset('motor_vehicle.n.01')>, <bound method Synset.name of Synset('car.n.01')>] ['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01', 'instrumentality.n.03', 'container.n.01', 'wheeled_vehicle.n.01', 'self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01'] ['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'artifact.n.01', 'instrumentality.n.03', 'conveyance.n.03', 'vehicle.n.01', 'wheeled_vehicle.n.01', 'self-propelled_vehicle.n.01', 'motor_vehicle.n.01', 'car.n.01'][Synset('entity.n.01')]3、更多的詞匯關(guān)系
- part_meronyms():部分,例如:一棵樹的部分是它的樹干,樹冠等。
- substance_meronyms():實(shí)質(zhì)包括…組成,例如:一棵樹的實(shí)質(zhì)是包括心材和邊材組成的。
- member_holonyms():形成…整體,例如:樹木的集合形成了一個(gè)森林。
- entailments():蘊(yùn)含關(guān)系
- antonyms():反義詞
- dir():查看詞匯關(guān)系和同義詞集上定義的其它方法
輸出結(jié)果
[Synset('burl.n.02'), Synset('crown.n.07'), Synset('limb.n.02'), Synset('stump.n.01'), Synset('trunk.n.01')] [Synset('heartwood.n.01'), Synset('sapwood.n.01')] [Synset('forest.n.01')] [Synset('mint.n.02')] [Synset('mint.n.05')] batch.n.02: (often followed by `of') a large number or amount or extent mint.n.02: any north temperate plant of the genus Mentha with aromatic leaves and small mauve flowers mint.n.03: any member of the mint family of plants mint.n.04: the leaves of a mint plant used fresh or candied mint.n.05: a candy that is flavored with a mint oil mint.n.06: a plant where money is coined by authority of the government -------------------------------------------- [Synset('step.v.01')] [Synset('chew.v.01'), Synset('swallow.v.01')] [Synset('arouse.v.07'), Synset('disappoint.v.01')] -------------------------------------------- [Lemma('demand.n.02.demand')] [Lemma('linger.v.04.linger')] [Lemma('vertical.a.01.vertical'), Lemma('inclined.a.02.inclined')] [Lemma('legato.r.01.legato')] ['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', '_all_hypernyms', '_definition', '_examples', '_frame_ids', '_hypernyms', '_instance_hypernyms', '_iter_hypernym_lists', '_lemma_names', '_lemma_pointers', '_lemmas', '_lexname', '_max_depth', '_min_depth', '_name', '_needs_root', '_offset', '_pointers', '_pos', '_related', '_shortest_hypernym_paths', '_wordnet_corpus_reader', 'also_sees', 'attributes', 'causes', 'closure', 'common_hypernyms', 'definition', 'entailments', 'examples', 'frame_ids', 'hypernym_distances', 'hypernym_paths', 'hypernyms', 'hyponyms', 'in_region_domains', 'in_topic_domains', 'in_usage_domains', 'instance_hypernyms', 'instance_hyponyms', 'jcn_similarity', 'lch_similarity', 'lemma_names', 'lemmas', 'lexname', 'lin_similarity', 'lowest_common_hypernyms', 'max_depth', 'member_holonyms', 'member_meronyms', 'min_depth', 'name', 'offset', 'part_holonyms', 'part_meronyms', 'path_similarity', 'pos', 'region_domains', 'res_similarity', 'root_hypernyms', 'shortest_path_distance', 'similar_tos', 'substance_holonyms', 'substance_meronyms', 'topic_domains', 'tree', 'unicode_repr', 'usage_domains', 'verb_groups', 'wup_similarity']4、語義相似度
right = wn.synset('right_whale.n.01') orca = wn.synset('orca.n.01') minke = wn.synset('minke_whale.n.01') tortoise = wn.synset('tortoise.n.01') novel = wn.synset('novel.n.01')# 共同的上位詞 print(right.lowest_common_hypernyms(minke)) print(right.lowest_common_hypernyms(orca)) print(right.lowest_common_hypernyms(tortoise)) print(right.lowest_common_hypernyms(novel))# 查找每個(gè)同義詞集深度量化 print(wn.synset('baleen_whale.n.01').min_depth()) print(wn.synset('whale.n.02').min_depth()) print(wn.synset('vertebrate.n.01').min_depth()) print(wn.synset('entity.n.01').min_depth())# 基于上位詞層次結(jié)構(gòu)中相互連接的概念之間的最短路徑 # 在 0-1 范圍的打分(兩者之間沒有路徑就返回-1)。 print(right.path_similarity(minke)) print(right.path_similarity(orca)) print(right.path_similarity(tortoise)) print(right.path_similarity(novel))輸出結(jié)果
[Synset('baleen_whale.n.01')] [Synset('whale.n.02')] [Synset('vertebrate.n.01')] [Synset('entity.n.01')]14 13 8 00.25 0.16666666666666666 0.07692307692307693 0.043478260869565216總結(jié)
以上是生活随笔為你收集整理的【Python 自然语言处理 第二版】读书笔记2:获得文本语料和词汇资源的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 【Python 自然语言处理 第二版】读
- 下一篇: 知识图谱(二)——知识表示