當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

Python实例--文本词频统计

發(fā)布時(shí)間：2025/3/15 python 17 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python实例--文本词频统计小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

最近在MOOC跟著北京理工大學(xué)的嵩天老師學(xué)習(xí)Python（https://www.icourse163.org/learn/BIT-268001?tid=1003243006#/learn/announce），受益匪淺，老師所講的通俗易懂，推薦給大家。

在此記點(diǎn)筆記和注釋，備忘。

今天所記得是文本詞頻統(tǒng)計(jì)-Hamlet文本詞頻統(tǒng)計(jì)。

英文文本

Hamlet詞頻統(tǒng)計(jì)文件鏈接：https://python123.io/resources/pye/hamlet.txt

直接上源代碼

#CalHamletV1.py def getText():txt = open("E:\hamlet.txt", "r").read() #讀取Hamlet文本文件，并返回給txttxt = txt.lower() #將文件中的單詞全部變?yōu)樾?xiě)for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~': txt = txt.replace(ch, " ") #將文本中特殊字符替換為空格return txthamletTxt = getText() words = hamletTxt.split() #按照空格，將文本分割 counts = {} for word in words: #統(tǒng)計(jì)單詞出現(xiàn)的次數(shù)，并存儲(chǔ)到counts字典中 counts[word] = counts.get(word,0) + 1 #先給字典賦值，如果字典中沒(méi)有word這個(gè)鍵，則返回0；見(jiàn)下面函數(shù)講解 items = list(counts.items()) #將字典轉(zhuǎn)換為列表，以便操作 items.sort(key=lambda x:x[1], reverse=True) # 見(jiàn)下面函數(shù)講解 for i in range(10):word, count = items[i]print ("{0:<10}{1:>5}".format(word, count))

所用函數(shù)講解：

①dict.get(key, default=None)：函數(shù)返回指定鍵的值，如果值不在字典中返回默認(rèn)值

②list.sort(cmp=None, key=None, reverse=False)：

cmp -- 可選參數(shù), 如果指定了該參數(shù)會(huì)使用該參數(shù)的方法進(jìn)行排序。
key -- 主要是用來(lái)進(jìn)行比較的元素，只有一個(gè)參數(shù)，具體的函數(shù)的參數(shù)就是取自于可迭代對(duì)象中，指定可迭代對(duì)象中的一個(gè)元素來(lái)進(jìn)行排序。
reverse -- 排序規(guī)則，reverse = True?降序，?reverse = False?升序（默認(rèn)）

?中文文本

? 三國(guó)演義文本：https://python123.io/resources/pye/threekingdoms.txt

#CalThreeKingdomsV2.py import jieba excludes = {"將軍","卻說(shuō)","荊州","二人","不可","不能","如此"} txt = open("threekingdoms.txt", "r", encoding='utf-8').read() words = jieba.lcut(txt) counts = {} for word in words:if len(word) == 1:continueelif word == "諸葛亮" or word == "孔明曰":rword = "孔明"elif word == "關(guān)公" or word == "云長(zhǎng)":rword = "關(guān)羽"elif word == "玄德" or word == "玄德曰":rword = "劉備"elif word == "孟德" or word == "丞相":rword = "曹操"else:rword = wordcounts[rword] = counts.get(rword,0) + 1 for word in excludes:del counts[word] items = list(counts.items()) items.sort(key=lambda x:x[1], reverse=True) for i in range(10):word, count = items[i]print ("{0:<10}{1:>5}".format(word, count))

函數(shù)講解：

jieba.lcut(s)：精確分詞模式，返回一個(gè)列表類型的分詞結(jié)果。沒(méi)有冗余。

總結(jié)

以上是生活随笔為你收集整理的Python实例--文本词频统计的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：业务中台01：中台解决方案本质在解决什么
下一篇： 2021住房消费品质服务报告