當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Tensorflow使用LSTM实现中文文本分类（1）

發布時間：2024/1/8 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了 Tensorflow使用LSTM实现中文文本分类（1）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

前言

使用Tensorflow，利用LSTM進行中文文本的分類。
數據集格式如下：
‘’’
體育馬曉旭意外受傷讓國奧警惕無奈大雨格外青睞殷家軍記者傅亞雨沈陽報道來到沈陽，國奧隊依然沒有擺脫雨水的困擾。…
‘’’
可以看出 label：體育，接著是一個 tab，最后跟隨一段文字。
目標：傳入模型一段文字，預測出這段文字所屬類別。

數據集下載

中文文本分類數據集下載：https://download.csdn.net/download/missyougoon/11221027

文本預處理

中文分詞

詞語轉化為 id ，embeding
例如：詞語A 轉化為 id（5）
同時，將 label 轉化 id

統計詞頻

代碼演示

# -*- coding:utf-8 -*-import sys import os import jieba# 輸入文件 train_file = './news_data/cnews.train.txt' val_file = './news_data/cnews.val.txt' test_file = './news_data/cnews.test.txt'# 分詞結果 seg_train_file = './news_data/cnews.train.seg.txt' seg_val_file = './news_data/cnews.val.seg.txt' seg_test_file = './news_data/cnews.test.seg.txt'# 詞語和 label到id 的映射 vocab_file = './news_data/cnews.vocab.txt' category_file = './news_data/cnews.category.txt'#print(label)def generate_seg_file(input_file, output_seg_file):'''生成分詞之后的文本數據:param input_file: 待分詞的輸入文件:param output_seg_file: 已經分詞完畢的文本:return:'''with open(input_file, 'r') as f:lines = f.readlines()with open(output_seg_file, 'w') as f:for line in lines:label, content = line.strip('\n').split('\t')word_iter = jieba.cut(content)word_content = ''for word in word_iter:word = word.strip(' ')if word != '':word_content += word + ' 'out_line = '%s\t%s\n'%(label, word_content.strip(' ')) # 將最后一個空格刪除f.write(out_line)# 對三個文件進行分詞 #generate_seg_file(train_file, seg_train_file) #generate_seg_file(val_file, seg_val_file) #generate_seg_file(test_file, seg_test_file)def generate_vocab_file(input_seg_file, output_vocab_file):''':param input_seg_file: 已經分詞的文件:param output_vocab_file: 輸出的詞表:return:'''with open(input_seg_file, 'r') as f:lines = f.readlines()word_dict = {} # 統計詞頻信息，因為我們只需要關注的是詞頻for line in lines:label, content = line.strip('\n').split('\t')for word in content.split(' '):word_dict.setdefault(word, 0) # 如果沒有這個詞語，就把給詞語的默認值設為 0word_dict[word] += 1# dict.item() 將字典轉化為列表# 詳情參考：http://www.runoob.com/python/att-dictionary-items.htmlsorted_word_dict = sorted(word_dict.items(), key=lambda d:d[1], reverse=True)# 現在sorted_word_dict的格式為： [(word, frequency).....(word, frequency)]with open(output_vocab_file, 'w') as f:f.write('<UNK>\t1000000\n') # 因為不是所有詞匯都有的，對于一些沒有的詞匯，就用 unk 來代替for item in sorted_word_dict:f.write('%s\t%d\n'%(item[0], item[1]))#generate_vocab_file(seg_train_file, vocab_file) # 從訓練集中統計詞表def generate_category_dict(input_file, category_file):with open(input_file, 'r') as f:lines = f.readlines()category_dict = {}for line in lines:label, content = line.strip('\n').split('\t')category_dict.setdefault(label, 0)category_dict[label] += 1category_number = len(category_dict)with open(category_file, 'w') as f:for category in category_dict:line = '%s\n' % category # 現在才知道，原來遍歷字典，原來默認查出的是keyprint('%s\t%d' % (category, category_dict[category]))f.write(line)generate_category_dict(train_file, category_file)

數據預處理完畢，接下來進行模型的訓練和測試，請參考： Tensorflow使用LSTM實現中文文本分類（二）

總結

以上是生活随笔為你收集整理的Tensorflow使用LSTM实现中文文本分类（1）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Windows平台快速安装MongoDB
下一篇： Vue实现动态路由导航