當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

NLP中的红楼梦

發(fā)布時(shí)間：2024/3/7 编程问答 37 豆豆

生活随笔收集整理的這篇文章主要介紹了 NLP中的红楼梦小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

兜兜轉(zhuǎn)轉(zhuǎn)學(xué)NLP學(xué)了一個(gè)月，結(jié)果還在皮毛上，今天打算使用NLP對(duì)自己喜歡的紅樓夢(mèng)進(jìn)行梳理。

這篇文章的目的，建立紅樓夢(mèng)的知識(shí)庫

1、主要人物說話關(guān)鍵字提取

2、

一、建立語料庫

語料庫是以后我們分詞以及建立模型的基礎(chǔ)，我們將紅樓夢(mèng)各章節(jié)的內(nèi)容以一句話一行的形式建立語料庫。

└─data # 根目錄└─chapters # 存放文檔01.txt02.txt03.txt04.txt05.txt06.txt07.txt│└─corpus # 存放語料01.txt02.txt03.txt04.txt05.txt06.txt07.txt

#construct_corpus.py import re import matplotlib.pyplot as plt import pandas from itertools import chain #defaultdict的作用是在于，當(dāng)字典里的key不存在但被查找時(shí)，返回的不是keyError而是一個(gè)默認(rèn)值 from collections import defaultdict from string import punctuation# 定義要?jiǎng)h除的標(biāo)點(diǎn)等字符 add_punc='，。、【】 “”：；（）《》‘’{}？！⑦()、%^>℃：.”“^-——=&#@￥『』' all_punc=punctuation+add_punc import os os.chdir('D:/good_study/NLP/紅樓夢(mèng)/') chapters_path = 'D:/good_study/NLP/紅樓夢(mèng)/data/chapters/' corpus_path = 'D:/good_study/NLP/紅樓夢(mèng)/data/corpus/' #/*-----------------------------------------------*/ #/* 1、各章一句話一行的形式建立語料庫 #/*-----------------------------------------------*/ # 處理得到所有章節(jié)地址列表 listdir = os.listdir(chapters_path) # listdir=listdir[:9] #所有章節(jié)的每句話列表 sentences_all_list = [] for filename in listdir:print("正在處理第{}章節(jié)".format(filename))chapters_root_path = chapters_path + str(filename)#每個(gè)章節(jié)的每句話列表sentences_list = []with open(chapters_root_path,'r', encoding='utf8') as f:for line in f.readlines():# 把元素按照[。！；？]進(jìn)行分隔，得到句子。line_split = re.split(r'[，。！；？]',line.strip())# [。！；？]這些符號(hào)也會(huì)劃分出來，把它們?nèi)サ簟ine_split = [line.strip() for line in line_split if line.strip() not in ['。','！','？','；'] and len(line.strip())>1]#移除英文和數(shù)字line_split = [re.sub(r'[A-Za-z0-9]|/d+','',line) for line in line_split]# #移除標(biāo)點(diǎn)符號(hào)line_split = [''.join(list(filter(lambda ch: ch not in all_punc, line) )) for line in line_split]sentences_list.append(line_split)# print("="*30)#chain.from_iterable 將嵌套的列表無縫連接在一起sentences_list = list(chain.from_iterable(sentences_list))sentences_all_list.append(sentences_list)corpus_root_path = corpus_path + str(filename)with open(corpus_root_path,"w", encoding='utf8') as f:for line in sentences_list:f.write(line)f.write('\n')#構(gòu)建全書語料庫 sentences_all_list = list(chain.from_iterable(sentences_all_list)) corpus_root_path=corpus_path+'whole_book.txt' with open(corpus_root_path,"w", encoding='utf8') as f:for line in sentences_all_list:f.write(line)f.write('\n')#/*-----------------------------------------------*/ #/* 2、分析各章字?jǐn)?shù) #/*-----------------------------------------------*/ # 處理得到所有章節(jié)地址列表 listdir = os.listdir(corpus_path) line_words_list=[] chapter_list=[] # listdir=listdir[:9] #所有章節(jié)的每句話列表 for filename in listdir:corpus_root_path = corpus_path + str(filename)#提取章節(jié)數(shù)字num = int(re.findall('\d+',filename)[0])chapter_list.append(num)with open(corpus_root_path,"r", encoding='utf8') as f:line_words=0for line in f.readlines():line_words+=len(line)line_words_list.append(line_words)print("{}章節(jié)，共{}字，驗(yàn)證章節(jié){}".format(filename,line_words,num))chapter_words=pandas.DataFrame({'chapter':chapter_list,'chapter_words':line_words_list})chapter_words.sort_values(by='chapter',ascending=True, inplace=True) chapter_words = chapter_words.set_index(keys=['chapter']) chapter_words['chapter_words'].plot(kind='bar',color = 'g',alpha = 0.5,figsize = (20,15)) plt.show()

處理好語料后，統(tǒng)計(jì)全書字?jǐn)?shù)為82萬，各章節(jié)字?jǐn)?shù)如下圖所示，每章平均字?jǐn)?shù)在7000左右，字?jǐn)?shù)和故事情節(jié)一樣，有抑揚(yáng)頓挫的節(jié)奏感，中間57-78章節(jié)字?jǐn)?shù)略有高峰，也是小說中寶黛愛情走向高峰、各種人物風(fēng)波矛盾糾纏迭起的時(shí)候。

參考資料：點(diǎn)此鏈接

《紅樓夢(mèng)》漢英平行語料庫：http://corpus.usx.edu.cn/hongloumeng/images/shiyongshuoming.htm

現(xiàn)代漢語＋古代漢語語料庫在線檢索系統(tǒng):http://ccl.pku.edu.cn:8080/ccl_corpus/index.jsp?dir=xiandai

二、分詞，建立紅樓夢(mèng)詞庫

分詞方法分規(guī)則分詞和統(tǒng)計(jì)分析，目前我們還沒有紅樓夢(mèng)的詞庫，目前通用的漢語NLP工具均以現(xiàn)代漢語為核心語料，對(duì)古代漢語的處理效果很差，從網(wǎng)上找了甲言這個(gè)包，甲言，取「甲骨文言」之意，是一款專注于古漢語處理的NLP工具包。

當(dāng)前版本支持詞庫構(gòu)建、自動(dòng)分詞、詞性標(biāo)注、文言句讀和標(biāo)點(diǎn)五項(xiàng)功能，更多功能正在開發(fā)中。

Windows上pip install kenlm報(bào)錯(cuò)解決：點(diǎn)此鏈接

?

2.1 HMM

2.2 CRF

?

2.3 衡量分詞的一致性

三、命名實(shí)體識(shí)別
四、每章摘要
五、每章內(nèi)容概述
六、每章內(nèi)容標(biāo)簽
七、紅樓夢(mèng)的社交網(wǎng)絡(luò)
八、每章內(nèi)容概述
九、每章內(nèi)容概述
十、每章內(nèi)容概述

未完待續(xù)...

總結(jié)

以上是生活随笔為你收集整理的NLP中的红楼梦的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：国家海洋局的超算应用探索
下一篇： RT-thread国产实时操作系统概述