當前位置：首頁 > 编程语言 > python >内容正文

python

在Python中使用LDA处理文本

發布時間：2025/7/25 python 16 豆豆

生活随笔收集整理的這篇文章主要介紹了在Python中使用LDA处理文本小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

[翻譯] 在Python中使用LDA處理文本

發表于2個月前(2016-02-17 16:10)?? 閱讀（78）?|?評論（0）?1人收藏此文章,?我要收藏

目錄[-]

安裝
示例

說明：

原文：http://chrisstrelioff.ws/sandbox/2014/11/13/getting_started_with_latent_dirichlet_allocation_in_python.html

本文包含了上文的主要內容。

關于LDA：LDA漫游指南

使用的python庫lda來自：https://github.com/ariddell/lda?。

gensim庫也含有lda相關函數。

安裝

$ pip install lda --user

示例

from __future__ import division, print_functionimport numpy as np import lda import lda.datasets# document-term matrix X = lda.datasets.load_reuters() print("type(X): {}".format(type(X))) print("shape: {}\n".format(X.shape)) print(X[:5, :5])'''輸出：type(X): <type 'numpy.ndarray'> shape: (395L, 4258L)[[ 1 0 1 0 0][ 7 0 2 0 0][ 0 0 0 1 10][ 6 0 1 0 0][ 0 0 0 2 14]] '''

X為395*4298的矩陣，意味著395個文本，共4258個單詞。值代表出現次數。

看一下是哪些單詞：

# the vocab vocab = lda.datasets.load_reuters_vocab() print("type(vocab): {}".format(type(vocab))) print("len(vocab): {}\n".format(len(vocab))) print(vocab[:6])'''輸出 type(vocab): <type 'tuple'> len(vocab): 4258('church', 'pope', 'years', 'people', 'mother', 'last') '''

X中第0列對應的單詞是church，第1列對應的單詞是pope

下面看一下文章標題：

# titles for each story titles = lda.datasets.load_reuters_titles() print("type(titles): {}".format(type(titles))) print("len(titles): {}\n".format(len(titles))) print(titles[:2]) # 前兩篇文章的標題'''輸出 type(titles): <type 'tuple'> len(titles): 395('0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20', '1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21') '''

訓練數據，指定20個主題，500次迭代：

model = lda.LDA(n_topics=20, n_iter=500, random_state=1) model.fit(X)

主題-單詞（topic-word）分布：

topic_word = model.topic_word_ print("type(topic_word): {}".format(type(topic_word))) print("shape: {}".format(topic_word.shape))'''輸出: type(topic_word): <type 'numpy.ndarray'> shape: (20L, 4258L) '''

topic_word中一行對應一個topic，一行之和為1。看一看'church', 'pope', 'years'這三個單詞在各個主題中的比重：

print(topic_word[:, :3])'''輸出 [[ 2.72436509e-06 2.72436509e-06 2.72708945e-03][ 2.29518860e-02 1.08771556e-06 7.83263973e-03][ 3.97404221e-03 4.96135108e-06 2.98177200e-03][ 3.27374625e-03 2.72585033e-06 2.72585033e-06][ 8.26262882e-03 8.56893407e-02 1.61980569e-06][ 1.30107788e-02 2.95632328e-06 2.95632328e-06][ 2.80145003e-06 2.80145003e-06 2.80145003e-06][ 2.42858077e-02 4.66944966e-06 4.66944966e-06][ 6.84655429e-03 1.90129250e-06 6.84655429e-03][ 3.48361655e-06 3.48361655e-06 3.48361655e-06][ 2.98781661e-03 3.31611166e-06 3.31611166e-06][ 4.27062069e-06 4.27062069e-06 4.27062069e-06][ 1.50994982e-02 1.64107142e-06 1.64107142e-06][ 7.73480150e-07 7.73480150e-07 1.70946848e-02][ 2.82280146e-06 2.82280146e-06 2.82280146e-06][ 5.15309856e-06 5.15309856e-06 4.64294180e-03][ 3.41695768e-06 3.41695768e-06 3.41695768e-06][ 3.90980357e-02 1.70316633e-03 4.42279319e-03][ 2.39373034e-06 2.39373034e-06 2.39373034e-06][ 3.32493234e-06 3.32493234e-06 3.32493234e-06]] '''

獲取每個topic下權重最高的5個單詞：

n = 5 for i, topic_dist in enumerate(topic_word):topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n+1):-1]print('*Topic {}\n- {}'.format(i, ' '.join(topic_words)))'''輸出： *Topic 0 - government british minister west group *Topic 1 - church first during people political *Topic 2 - elvis king wright fans presley *Topic 3 - yeltsin russian russia president kremlin *Topic 4 - pope vatican paul surgery pontiff *Topic 5 - family police miami versace cunanan *Topic 6 - south simpson born york white *Topic 7 - order church mother successor since *Topic 8 - charles prince diana royal queen *Topic 9 - film france french against actor *Topic 10 - germany german war nazi christian *Topic 11 - east prize peace timor quebec *Topic 12 - n't told life people church *Topic 13 - years world time year last *Topic 14 - mother teresa heart charity calcutta *Topic 15 - city salonika exhibition buddhist byzantine *Topic 16 - music first people tour including *Topic 17 - church catholic bernardin cardinal bishop *Topic 18 - harriman clinton u.s churchill paris *Topic 19 - century art million museum city '''

文檔-主題（Document-Topic）分布：

doc_topic = model.doc_topic_ print("type(doc_topic): {}".format(type(doc_topic))) print("shape: {}".format(doc_topic.shape))'''輸出： type(doc_topic): <type 'numpy.ndarray'> shape: (395, 20) '''

一篇文章對應一行，每行的和為1。

輸入前10篇文章最可能的Topic：

for n in range(10):topic_most_pr = doc_topic[n].argmax()print("doc: {} topic: {}".format(n, topic_most_pr))'''輸出： doc: 0 topic: 8 doc: 1 topic: 1 doc: 2 topic: 14 doc: 3 topic: 8 doc: 4 topic: 14 doc: 5 topic: 14 doc: 6 topic: 14 doc: 7 topic: 14 doc: 8 topic: 14 doc: 9 topic: 8 '''

關于數據集替換

下載包以后,把datasets.py里面的load_reuters()里面的reuters.ldac,load_reuters_vocab()里面的reuters.tokens,load_reuters_titles()里面的reuters.titles替換成自己的數據集就行了.數據集格式按照包里的生成就行.

總結

以上是生活随笔為你收集整理的在Python中使用LDA处理文本的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

python

在Python中使用LDA处理文本

[翻譯] 在Python中使用LDA處理文本

安裝

示例

總結