當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

NLP：基于nltk和jieba库对文本实现提取文本摘要(两种方法实现：top_n_summary和mean_scored_summary)

發(fā)布時(shí)間：2025/3/21 编程问答 21 豆豆

生活随笔收集整理的這篇文章主要介紹了 NLP：基于nltk和jieba库对文本实现提取文本摘要(两种方法实现：top_n_summary和mean_scored_summary) 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

NLP：基于nltk和jieba庫(kù)對(duì)文本實(shí)現(xiàn)提取文本摘要(兩種方法實(shí)現(xiàn)：top_n_summary和mean_scored_summary)

輸出結(jié)果

設(shè)計(jì)思路

核心代碼

輸出結(jié)果

1、測(cè)試文本

今天一大早，兩位男子在故宮抽煙對(duì)鏡頭炫耀的視頻在網(wǎng)絡(luò)上傳播，引發(fā)網(wǎng)友憤怒。有人感到后怕，600年的故宮真要這兩個(gè)人給點(diǎn)了，萬死莫贖。也有評(píng)論稱，把無知當(dāng)成炫耀的資本，丟人！視頻中兩位男子坐在故宮公共休息區(qū)的遮陽傘下，面對(duì)鏡頭問出：“誰敢在故宮抽煙？”語氣極其囂張，表情帶有挑釁意味。話音剛落，另外一位男子面向鏡頭吸了一口煙。而視頻中兩人也表示知道有故宮禁止吸煙的規(guī)定。事實(shí)上，2013年5月18日是國(guó)際博物館日，故宮從這一天開始至今一直實(shí)行全面禁煙。根據(jù)規(guī)定，故宮博物院全體員工在院合作單位和個(gè)人不管在室內(nèi)和室外，也不分開放區(qū)與工作區(qū)，一律禁止吸煙，對(duì)違反禁止吸煙規(guī)定的人員將進(jìn)行嚴(yán)格處罰并通報(bào)全院。此外，在2015年6月1日起北京全市也開始了《控制吸煙條例》，規(guī)定公共場(chǎng)所工作場(chǎng)所室內(nèi)環(huán)境室外排隊(duì)等場(chǎng)合禁止吸煙，違者將最高被罰200元，全市統(tǒng)一設(shè)立舉報(bào)電話12320。視頻在網(wǎng)絡(luò)上傳播開來，不少網(wǎng)友擔(dān)心故宮的安危，稱一旦發(fā)生火情，后果不堪設(shè)想，有網(wǎng)友表示，這樣的行為應(yīng)該被旅游景區(qū)拉近黑名單，建議終身禁止進(jìn)入任何景區(qū)和各種場(chǎng)館。

設(shè)計(jì)思路

后期更新……

核心代碼

def sent_tokenizer(texts):start=0i=0#每個(gè)字符的位置sentences=[]punt_list='.!?。！？'.encode('utf-8') #',.!?:;~，。！？：；～'.decode('utf8')# punt_lists='.!?。！？'.decode()for text in texts:if text in punt_list and token not in punt_list: #檢查標(biāo)點(diǎn)符號(hào)下一個(gè)字符是否還是標(biāo)點(diǎn)sentences.append(texts[start:i+1])#當(dāng)前標(biāo)點(diǎn)符號(hào)位置start=i+1#start標(biāo)記到下一句的開頭i+=1else:i+=1#若不是標(biāo)點(diǎn)符號(hào)，則字符位置繼續(xù)前移token=list(texts[start:i+2]).pop()#取下一個(gè)字符if start<len(texts):sentences.append(texts[start:])#這是為了處理文本末尾沒有標(biāo)點(diǎn)符號(hào)的情況return sentencesdef load_stopwordslist(path):print('load stopwords...')stoplist=[line.strip() for line in codecs.open(path,'r',encoding='utf8').readlines()]stopwrods={}.fromkeys(stoplist)return stopwrodsdef summarize(text):stopwords=load_stopwordslist('stopwords.txt')sentences=sent_tokenizer(text)words=[w for sentence in sentences for w in jieba.cut(sentence) if w not in stopwords if len(w)>1 and w!='\t']wordfre=nltk.FreqDist(words)topn_words=[w[0] for w in sorted(wordfre.items(),key=lambda d:d[1],reverse=True)][:N]scored_sentences=_score_sentences(sentences,topn_words)#approach 1,利用均值和標(biāo)準(zhǔn)差過濾非重要句子avg=numpy.mean([s[1] for s in scored_sentences])#均值std=numpy.std([s[1] for s in scored_sentences])#標(biāo)準(zhǔn)差mean_scored=[(sent_idx,score) for (sent_idx,score) in scored_sentences if score>(avg+0.5*std)]#approach 2，返回top n句子top_n_scored=sorted(scored_sentences,key=lambda s:s[1])[-TOP_SENTENCES:]top_n_scored=sorted(top_n_scored,key=lambda s:s[0])return dict(top_n_summary=[sentences[idx] for (idx,score) in top_n_scored],mean_scored_summary=[sentences[idx] for (idx,score) in mean_scored])def _score_sentences(sentences,topn_words):scores=[]sentence_idx=-1for s in [list(jieba.cut(s)) for s in sentences]:sentence_idx+=1word_idx=[]for w in topn_words:try:word_idx.append(s.index(w))#關(guān)鍵詞出現(xiàn)在該句子中的索引位置except ValueError:#w不在句子中password_idx.sort()if len(word_idx)==0:continue#對(duì)于兩個(gè)連續(xù)的單詞，利用單詞位置索引，通過距離閥值計(jì)算族clusters=[]cluster=[word_idx[0]]i=1while i<len(word_idx):if word_idx[i]-word_idx[i-1]<CLUSTER_THRESHOLD:cluster.append(word_idx[i])else:clusters.append(cluster[:])cluster=[word_idx[i]]i+=1clusters.append(cluster)#對(duì)每個(gè)族打分，每個(gè)族類的最大分?jǐn)?shù)是對(duì)句子的打分max_cluster_score=0for c in clusters:significant_words_in_cluster=len(c)total_words_in_cluster=c[-1]-c[0]+1score=1.0*significant_words_in_cluster*significant_words_in_cluster/total_words_in_clusterif score>max_cluster_score:max_cluster_score=scorescores.append((sentence_idx,max_cluster_score))return scores;

總結(jié)

以上是生活随笔為你收集整理的NLP：基于nltk和jieba库对文本实现提取文本摘要(两种方法实现：top_n_summary和mean_scored_summary)的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： NLP：基于snownlp库对文本实现提
下一篇：成功解决WARNING:tensorfl

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

NLP：基于nltk和jieba库对文本实现提取文本摘要(两种方法实现：top_n_summary和mean_scored_summary)

輸出結(jié)果

設(shè)計(jì)思路

核心代碼

總結(jié)