《Natural Language Processing with Python》读书笔记 001期
這本書對應python2的中文版書籍網上有很多,但是隨后更新的python3的版本卻微乎其微,只能從官網上的電子英文版開看了,反正也全當練習了。
官網明確更新的幾條觀月NLTK 3.0的信息,間接說明這些可能很重要或者很常用,就像print對于python一樣。
NLTK also includes some pervasive changes:
- many types are initialised from strings using a fromstring() method
- many functions now return iterators instead of lists
- ContextFreeGrammar is now called CFG and WeightedGrammar is now called PCFG
- batch_tokenize() is now called tokenize_sents(); there are corresponding changes for batch taggers, parsers, and classifiers
- some implementations have been removed in favour of external packages, or because they could not be maintained adequately
詳情:https://github.com/nltk/nltk/wiki/Porting-your-code-to-NLTK-3.0
第一章沒什么新內容,多了一個concordance的方法
>>> text5.concordance('lol') Displaying 25 of 25 matches: ast PART 24 / m boo . 26 / m and sexy lol U115 boo . JOIN PART he drew a girl w ope he didnt draw a penis PART ewwwww lol & a head between her legs JOIN JOIN s a bowl i got a blunt an a bong ...... lol JOIN well , glad it worked out my cha e " PART Hi U121 in ny . ACTION would lol @ U121 . . . but appearently she does 30 make sure u buy a nice ring for U6 lol U7 Hi U115 . ACTION isnt falling for didnt ya hear !!!! PART JOIN geeshhh lol U6 PART hes deaf ppl here dont get it es nobody here i wanna misbeahve with lol JOIN so read it . thanks U7 .. Im hap ies want to chat can i talk to him !! lol U121 !!! forwards too lol JOIN ALL PE k to him !! lol U121 !!! forwards too lol JOIN ALL PErvs ... redirect to U121 'loves ME the most i love myself JOIN lol U44 how do u know that what ? jerkett ng wrong ... i can see it in his eyes lol U20 = fiance Jerketts lmao wtf yah I cooler by the minute what 'd I miss ? lol noo there too much work ! why not ?? that mean I want you ? U6 hello room lol U83 and this .. has been the grammar the rule he 's in PM land now though lol ah ok i wont bug em then someone wann flight to hell :) lmao bbl maybe PART LOL lol U7 it was me , U83 hahah U83 ! 80 ht to hell :) lmao bbl maybe PART LOL lol U7 it was me , U83 hahah U83 ! 808265 082653953 K-Fed got his ass kicked .. Lol . ACTION laughs . i got a first class. i got a first class ticket to hell lol U7 JOIN any texas girls in here ? any. whats up U155 i was only kidding . lol he 's a douchebag . Poor U121 i 'm bo??? sits with U30 Cum to my shower . lol U121 . ACTION U1370 watches his nads ur nad with a stick . ca u U23 ewwww lol *sniffs* ewwwwww PART U115 ! owww spl ACTION is resisting . ur female right lol U115 beeeeehave Remember the LAst tim pm's me . charge that is 1.99 / min . lol @ innocent hahah lol .... yeah LOLOLOis 1.99 / min . lol @ innocent hahah lol .... yeah LOLOLOLLL U12 thats not nic s . lmao no U115 Check my record . :) Lol lick em U7 U23 how old r u lol Way to
通過實驗,可以知道dispersion_plot是注意大小寫的,可以稍微見得,在NLP處理過程中大小寫都是要很注意的。
對于generate這個函數,根據網頁:https://github.com/nltk/nltk/issues/736來看,仍然沒有解決,最近的一條回復竟然是18號,然而很多其他也并不能給出相應的解答,無非都是沒辦法,不去管,我這邊也嘗試了幾種不同的方式,也沒有得到不錯的結果……故而暫且擱置,文章說第三章會再見,我們第三期再說。
token被譯為標識符(管他第二個字念什么),括號和標點符號的組合體貌似算是一種標識符,有點意思。
word type 詞類型,含有標點符號的一般不叫word type,而是叫item type,換句話說純正的單詞表才會是word type。
1.3上來這個saying是什么就不知道,中間一串省略號…
>>> saying = ['After', 'all', 'is', 'said', 'and', 'done','more', 'is', 'said', 'than', 'done'] >>> tokens=set(saying) >>> tokens=sorted(tokens) >>> tokens[-2:] ['said', 'than']“單純來看”
再使用hapaxes方法的時候可能會出現IDLE短時死機的可能,不過等一會兒就好了,畢竟9000多個詞呢。
Collocations被翻譯成了搭配,好像沒什么問題
只計數小寫的詞肯定有問題啊,國家名地名什么的……
babelize_shell()這個函數已經不再使用了,官網的電子書給出了解釋:
Today, practical translation systems exist for particular pairs of languages, and some are integrated into web search engines. However, these systems have some serious shortcomings, which are starkly revealed by translating a sentence back and forth between a pair of languages until equilibrium is reached, e.g.:0> how long before the next flight to Alice Springs? 1> wie lang vor dem folgenden Flug zu Alice Springs? 2> how long before the following flight to Alice jump? 3> wie lang vor dem folgenden Flug zu Alice springen Sie? 4> how long before the following flight to Alice do you jump? 5> wie lang, bevor der folgende Flug zu Alice tun, Sie springen? 6> how long, before the following flight to Alice does, do you jump? 7> wie lang bevor der folgende Flug zu Alice tut, tun Sie springen? 8> how long before the following flight to Alice does, do you jump? 9> wie lang, bevor der folgende Flug zu Alice tut, tun Sie springen? 10> how long, before the following flight does to Alice, do do you jump? 11> wie lang bevor der folgende Flug zu Alice tut, Sie tun Sprung? 12> how long before the following flight does leap to Alice, does you? Observe that the system correctly translates Alice Springs from English to German (in the line starting 1>), but on the way back to English, this ends up as Alice jump (line 2). The preposition before is initially translated into the corresponding German preposition vor, but later into the conjunction bevor (line 5). After line 5 the sentences become nonsensical (but notice the various phrasings indicated by the commas, and the change from jump to leap). The translation system did not recognize when a word was part of a proper name, and it misinterpreted the grammatical structure.正如之前討論所得出的結果一樣,現在很多翻譯器的翻譯結果都是呈離散型的,換句話說一句話翻譯過去在翻譯過來并不能和原句相同,這也許是現在NLP面臨的另外一個難題吧。
總結
以上是生活随笔為你收集整理的《Natural Language Processing with Python》读书笔记 001期的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 经常玩电脑正确的坐姿_电脑族玩游戏正确坐
- 下一篇: wow钓鱼方案