當前位置：首頁 > 编程语言 > java >内容正文

java

机器学习知识点(三十)LDA话题模型Java实现

發布時間：2025/4/16 java 23 豆豆

生活随笔收集整理的這篇文章主要介紹了机器学习知识点(三十)LDA话题模型Java实现小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1、LDA數學定義

1）話題模型：傳統的文本分類器，比如貝葉斯、kNN和SVM，只能將其分到一個確定的類別中。假設我給出3個分類“算法”“分詞”“文學”讓其判斷，如果某個分類器將該文歸入算法類，我覺得還湊合，如果分入分詞，那我覺得這個分類器不夠準確。

假設一個文藝小青年來看我的博客，他完全不懂算法和分詞，自然也給不出具體的備選類別，有沒有一種模型能夠告訴這個白癡，這篇文章很可能(80%)是在講算法，也可能(19%)是在講分詞，幾乎不可能(1%)是在講其它主題呢？

有，這樣的模型就是話題模型。

2）LDA概念

潛在狄立克雷分配（Latent Dirichlet Allocation，LDA）主題模型是最簡單的主題模型，它描述的是一篇文章是如何產生的。如圖所示：

從左往右看，一個主題是由一些詞語的分布定義的，比如藍色主題是由2%幾率的data，2%的number……構成的。一篇文章則是由一些主題構成的，比如右邊的直方圖。具體產生過程是，從主題集合中按概率分布選取一些主題，從該主題中按概率分布選取一些詞語，這些詞語構成了最終的文檔（LDA模型中，詞語的無序集合構成文檔，也就是說詞語的順序沒有關系）。

如果我們能將上述兩個概率分布計算清楚，那么我們就得到了一個模型，該模型可以根據某篇文檔推斷出它的主題分布，即分類。由文檔推斷主題是文檔生成過程的逆過程。

在《LDA數學八卦》一文中，對文檔的生成過程有個很形象的描述：

3）概率模型

LDA是一種使用聯合分布來計算在給定觀測變量下隱藏變量的條件分布（后驗分布）的概率模型，觀測變量為詞的集合，隱藏變量為主題。

聯合分布

LDA的生成過程對應的觀測變量和隱藏變量的聯合分布如下：

式子有點長，耐心地從左往右看：

式子的基本符號約定——β表示主題，θ表示主題的概率，z表示特定文檔或詞語的主題，w為詞語。進一步說——

β1:K為全體主題集合，其中βk是第k個主題的詞的分布（如圖1左部所示）。第d個文檔中該主題所占的比例為θd，其中θd,k表示第k個主題在第d個文檔中的比例（圖1右部的直方圖）。第d個文檔的主題全體為zd，其中zd,n是第d個文檔中第n個詞的主題（如圖1中有顏色的圓圈）。第d個文檔中所有詞記為wd，其中wd,n是第d個文檔中第n個詞，每個詞都是固定的詞匯表中的元素。

p(β）表示從主題集合中選取了一個特定主題，p(θd)表示該主題在特定文檔中的概率，大括號的前半部分是該主題確定時該文檔第n個詞的主題，后半部分是該文檔第n個詞的主題與該詞的聯合分布。連乘符號描述了隨機變量的依賴性，用概率圖模型表述如下：

比如，先選取了主題，才能從主題里選詞。具體說來，一個詞受兩個隨機變量的影響（直接或間接），一個是確定了主題后文檔中該主題的分布θd，另一種是第k個主題的詞的分布βk（也就是圖2中的第二個壇子）。

后驗分布

沿用相同的符號，LDA后驗分布計算公式如下：

分子是一個聯合分布，給定語料庫就可以輕松統計出來。但分母無法暴力計算，因為文檔集合詞庫達到百萬（假設有w個詞語），每個詞要計算每一種可能的觀測的組合（假設有n種組合）的概率然后累加得到先驗概率，所以需要一種近似算法。

基于采樣的算法通過收集后驗分布的樣本，以樣本的分布求得后驗分布的近似。

θd的概率服從Dirichlet分布，zd,n的分布服從multinomial分布，兩個分布共軛，所謂共軛，指的就是先驗分布和后驗分布的形式相同：

兩個分布其實是向量的分布，向量通過這兩個分布取樣得到。采樣方法通過收集這兩個分布的樣本，以樣本的分布近似。

4）馬氏鏈和Gibbs Sampling

這是一種統計模擬的方法。

馬氏鏈

所謂馬氏鏈指的是當前狀態只取決于上一個狀態。馬氏鏈有一個重要的性質：狀態轉移矩陣P的冪是收斂的，收斂后的轉移矩陣稱為馬氏鏈的平穩分布。給定p(x)，假如能夠構造一個P，轉移n步平穩分布恰好是p(x)。那么任取一個初始狀態，轉移n步之后的狀態都是符合分布的樣本。

Gibbs Sampling

Gibbs Sampling是高維分布（也即類似于二維p(x,y)，三維p(x,y,z)的分布）的特化采樣算法。

2、代碼：
? ?參考：https://github.com/hankcs/LDA4j
? ?源于Gregor Heinrich的LdaGibbsSampler.java類。

/** (C) Copyright 2005, Gregor Heinrich (gregor :: arbylon : net) (This file is* part of the org.knowceans experimental software packages.)*/ /** LdaGibbsSampler is free software; you can redistribute it and/or modify it* under the terms of the GNU General Public License as published by the Free* Software Foundation; either version 2 of the License, or (at your option) any* later version.*/ /** LdaGibbsSampler is distributed in the hope that it will be useful, but* WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or* FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more* details.*/ /** You should have received a copy of the GNU General Public License along with* this program; if not, write to the Free Software Foundation, Inc., 59 Temple* Place, Suite 330, Boston, MA 02111-1307 USA*//** Created on Mar 6, 2005*/ package sk.ml;import java.text.DecimalFormat; import java.text.NumberFormat;/*** Gibbs sampler for estimating the best assignments of topics for words and* documents in a corpus. The algorithm is introduced in Tom Griffiths' paper* "Gibbs sampling in the generative model of Latent Dirichlet Allocation"* (2002). * Gibbs sampler采樣算法的實現** @author heinrich*/ public class LdaGibbsSampler {/*** document data (term lists) * 文檔*/int[][] documents;/*** vocabulary size * 詞表大小*/int V;/*** number of topics * 主題數目*/int K;/*** Dirichlet parameter (document--topic associations) * 文檔——主題參數*/double alpha = 2.0;/*** Dirichlet parameter (topic--term associations) * 主題——詞語參數*/double beta = 0.5;/*** topic assignments for each word. * 每個詞語的主題 z[i][j] := 文檔i的第j個詞語的主題編號*/int z[][];/*** cwt[i][j] number of instances of word i (term?) assigned to topic j. * 計數器，nw[i][j] := 詞語i歸入主題j的次數*/int[][] nw;/*** na[i][j] number of words in document i assigned to topic j. * 計數器，nd[i][j] := 文檔[i]中歸入主題j的詞語的個數*/int[][] nd;/*** nwsum[j] total number of words assigned to topic j. * 計數器，nwsum[j] := 歸入主題j詞語的個數*/int[] nwsum;/*** nasum[i] total number of words in document i. * 計數器,ndsum[i] := 文檔i中全部詞語的數量*/int[] ndsum;/*** cumulative statistics of theta * theta的累積量*/double[][] thetasum;/*** cumulative statistics of phi * phi的累積量*/double[][] phisum;/*** size of statistics * 樣本容量*/int numstats;/*** sampling lag (?) * 多久更新一次統計量*/private static int THIN_INTERVAL = 20;/*** burn-in period * 收斂前的迭代次數*/private static int BURN_IN = 100;/*** max iterations * 最大迭代次數*/private static int ITERATIONS = 1000;/*** sample lag (if -1 only one sample taken) * 最后的模型個數（取收斂后的n個迭代的參數做平均可以使得模型質量更高）*/private static int SAMPLE_LAG = 10;private static int dispcol = 0;/*** Initialise the Gibbs sampler with data. * 用數據初始化采樣器** @param documents 文檔* @param V vocabulary size 詞表大小*/public LdaGibbsSampler(int[][] documents, int V) {this.documents = documents;this.V = V;}/*** Initialisation: Must start with an assignment of observations to topics ?* Many alternatives are possible, I chose to perform random assignments* with equal probabilities * 隨機初始化狀態** @param K number of topics K個主題*/public void initialState(int K) {int M = documents.length;// initialise count variables. 初始化計數器nw = new int[V][K];nd = new int[M][K];nwsum = new int[K];ndsum = new int[M];// The z_i are are initialised to values in [1,K] to determine the// initial state of the Markov chain.z = new int[M][]; // z_i := 1到K之間的值，表示馬氏鏈的初始狀態for (int m = 0; m < M; m++) {int N = documents[m].length;z[m] = new int[N];for (int n = 0; n < N; n++) {int topic = (int) (Math.random() * K);z[m][n] = topic;// number of instances of word i assigned to topic jnw[documents[m][n]][topic]++;// number of words in document i assigned to topic j.nd[m][topic]++;// total number of words assigned to topic j.nwsum[topic]++;}// total number of words in document indsum[m] = N;}}public void gibbs(int K) {gibbs(K, 2.0, 0.5);}/*** Main method: Select initial state ? Repeat a large number of times: 1.* Select an element 2. Update conditional on other elements. If* appropriate, output summary for each run. * 采樣** @param K number of topics 主題數* @param alpha symmetric prior parameter on document--topic associations 對稱文檔——主題先驗概率？* @param beta symmetric prior parameter on topic--term associations 對稱主題——詞語先驗概率？*/public void gibbs(int K, double alpha, double beta) {this.K = K;this.alpha = alpha;this.beta = beta;// init sampler statistics 分配內存if (SAMPLE_LAG > 0) {thetasum = new double[documents.length][K];phisum = new double[K][V];numstats = 0;}// initial state of the Markov chain:initialState(K);System.out.println("Sampling " + ITERATIONS+ " iterations with burn-in of " + BURN_IN + " (B/S="+ THIN_INTERVAL + ").");for (int i = 0; i < ITERATIONS; i++) {// for all z_ifor (int m = 0; m < z.length; m++) {for (int n = 0; n < z[m].length; n++) {// (z_i = z[m][n])// sample from p(z_i|z_-i, w)int topic = sampleFullConditional(m, n);z[m][n] = topic;}}if ((i < BURN_IN) && (i % THIN_INTERVAL == 0)) {System.out.print("B");dispcol++;}// display progressif ((i > BURN_IN) && (i % THIN_INTERVAL == 0)) {System.out.print("S");dispcol++;}// get statistics after burn-inif ((i > BURN_IN) && (SAMPLE_LAG > 0) && (i % SAMPLE_LAG == 0)) {updateParams();System.out.print("|");if (i % THIN_INTERVAL != 0)dispcol++;}if (dispcol >= 100) {System.out.println();dispcol = 0;}}System.out.println();}/*** Sample a topic z_i from the full conditional distribution: p(z_i = j |* z_-i, w) = (n_-i,j(w_i) + beta)/(n_-i,j(.) + W * beta) * (n_-i,j(d_i) +* alpha)/(n_-i,.(d_i) + K * alpha) * 根據上述公式計算文檔m中第n個詞語的主題的完全條件分布，輸出最可能的主題** @param m document* @param n word*/private int sampleFullConditional(int m, int n) {// remove z_i from the count variables 先將這個詞從計數器中抹掉int topic = z[m][n];nw[documents[m][n]][topic]--;nd[m][topic]--;nwsum[topic]--;ndsum[m]--;// do multinomial sampling via cumulative method: 通過多項式方法采樣多項式分布double[] p = new double[K];for (int k = 0; k < K; k++) {p[k] = (nw[documents[m][n]][k] + beta) / (nwsum[k] + V * beta)* (nd[m][k] + alpha) / (ndsum[m] + K * alpha);}// cumulate multinomial parameters 累加多項式分布的參數for (int k = 1; k < p.length; k++) {p[k] += p[k - 1];}// scaled sample because of unnormalised p[] 正則化double u = Math.random() * p[K - 1];for (topic = 0; topic < p.length; topic++) {if (u < p[topic])break;}// add newly estimated z_i to count variables 將重新估計的該詞語加入計數器nw[documents[m][n]][topic]++;nd[m][topic]++;nwsum[topic]++;ndsum[m]++;return topic;}/*** Add to the statistics the values of theta and phi for the current state. * 更新參數*/private void updateParams() {for (int m = 0; m < documents.length; m++) {for (int k = 0; k < K; k++) {thetasum[m][k] += (nd[m][k] + alpha) / (ndsum[m] + K * alpha);}}for (int k = 0; k < K; k++) {for (int w = 0; w < V; w++) {phisum[k][w] += (nw[w][k] + beta) / (nwsum[k] + V * beta);}}numstats++;}/*** Retrieve estimated document--topic associations. If sample lag > 0 then* the mean value of all sampled statistics for theta[][] is taken. * 獲取文檔——主題矩陣** @return theta multinomial mixture of document topics (M x K)*/public double[][] getTheta() {double[][] theta = new double[documents.length][K];if (SAMPLE_LAG > 0) {for (int m = 0; m < documents.length; m++) {for (int k = 0; k < K; k++) {theta[m][k] = thetasum[m][k] / numstats;}}} else {for (int m = 0; m < documents.length; m++) {for (int k = 0; k < K; k++) {theta[m][k] = (nd[m][k] + alpha) / (ndsum[m] + K * alpha);}}}return theta;}/*** Retrieve estimated topic--word associations. If sample lag > 0 then the* mean value of all sampled statistics for phi[][] is taken. * 獲取主題——詞語矩陣** @return phi multinomial mixture of topic words (K x V)*/public double[][] getPhi() {double[][] phi = new double[K][V];if (SAMPLE_LAG > 0) {for (int k = 0; k < K; k++) {for (int w = 0; w < V; w++) {phi[k][w] = phisum[k][w] / numstats;}}} else {for (int k = 0; k < K; k++) {for (int w = 0; w < V; w++) {phi[k][w] = (nw[w][k] + beta) / (nwsum[k] + V * beta);}}}return phi;}/*** Print table of multinomial data** @param data vector of evidence* @param fmax max frequency in display* @return the scaled histogram bin values*/public static void hist(double[] data, int fmax) {double[] hist = new double[data.length];// scale maximumdouble hmax = 0;for (int i = 0; i < data.length; i++) {hmax = Math.max(data[i], hmax);}double shrink = fmax / hmax;for (int i = 0; i < data.length; i++) {hist[i] = shrink * data[i];}NumberFormat nf = new DecimalFormat("00");String scale = "";for (int i = 1; i < fmax / 10 + 1; i++) {scale += " . " + i % 10;}System.out.println("x" + nf.format(hmax / fmax) + "\t0" + scale);for (int i = 0; i < hist.length; i++) {System.out.print(i + "\t|");for (int j = 0; j < Math.round(hist[i]); j++) {if ((j + 1) % 10 == 0)System.out.print("]");elseSystem.out.print("|");}System.out.println();}}/*** Configure the gibbs sampler * 配置采樣器** @param iterations number of total iterations* @param burnIn number of burn-in iterations* @param thinInterval update statistics interval* @param sampleLag sample interval (-1 for just one sample at the end)*/public void configure(int iterations, int burnIn, int thinInterval,int sampleLag) {ITERATIONS = iterations;BURN_IN = burnIn;THIN_INTERVAL = thinInterval;SAMPLE_LAG = sampleLag;}/*** Inference a new document by a pre-trained phi matrix** @param phi pre-trained phi matrix* @param doc document* @return a p array*/public static double[] inference(double alpha, double beta, double[][] phi, int[] doc) {int K = phi.length;int V = phi[0].length;// init// initialise count variables. 初始化計數器int[][] nw = new int[V][K];int[] nd = new int[K];int[] nwsum = new int[K];int ndsum = 0;// The z_i are are initialised to values in [1,K] to determine the// initial state of the Markov chain.int N = doc.length;int[] z = new int[N]; // z_i := 1到K之間的值，表示馬氏鏈的初始狀態for (int n = 0; n < N; n++) {int topic = (int) (Math.random() * K);z[n] = topic;// number of instances of word i assigned to topic jnw[doc[n]][topic]++;// number of words in document i assigned to topic j.nd[topic]++;// total number of words assigned to topic j.nwsum[topic]++;}// total number of words in document indsum = N;for (int i = 0; i < ITERATIONS; i++) {for (int n = 0; n < z.length; n++) {// (z_i = z[m][n])// sample from p(z_i|z_-i, w)// remove z_i from the count variables 先將這個詞從計數器中抹掉int topic = z[n];nw[doc[n]][topic]--;nd[topic]--;nwsum[topic]--;ndsum--;// do multinomial sampling via cumulative method: 通過多項式方法采樣多項式分布double[] p = new double[K];for (int k = 0; k < K; k++) {p[k] = phi[k][doc[n]]* (nd[k] + alpha) / (ndsum + K * alpha);}// cumulate multinomial parameters 累加多項式分布的參數for (int k = 1; k < p.length; k++) {p[k] += p[k - 1];}// scaled sample because of unnormalised p[] 正則化double u = Math.random() * p[K - 1];for (topic = 0; topic < p.length; topic++) {if (u < p[topic])break;}if (topic == K) {throw new RuntimeException("the param K or topic is set too small");}// add newly estimated z_i to count variables 將重新估計的該詞語加入計數器nw[doc[n]][topic]++;nd[topic]++;nwsum[topic]++;ndsum++;z[n] = topic;}}double[] theta = new double[K];for (int k = 0; k < K; k++) {theta[k] = (nd[k] + alpha) / (ndsum + K * alpha);}return theta;}public static double[] inference(double[][] phi, int[] doc) {return inference(2.0, 0.5, phi, doc);}/*** Driver with example data. * 測試入口** @param args*/public static void main(String[] args) {// words in documentsint[][] documents = {{1, 4, 3, 2, 3, 1, 4, 3, 2, 3, 1, 4, 3, 2, 3, 6},{2, 2, 4, 2, 4, 2, 2, 2, 2, 4, 2, 2},{1, 6, 5, 6, 0, 1, 6, 5, 6, 0, 1, 6, 5, 6, 0, 0},{5, 6, 6, 2, 3, 3, 6, 5, 6, 2, 2, 6, 5, 6, 6, 6, 0},{2, 2, 4, 4, 4, 4, 1, 5, 5, 5, 5, 5, 5, 1, 1, 1, 1, 0},{5, 4, 2, 3, 4, 5, 6, 6, 5, 4, 3, 2}}; // 文檔的詞語id集合// vocabularyint V = 7; // 詞表大小int M = documents.length;// # topicsint K = 2; // 主題數目// good values alpha = 2, beta = .5double alpha = 2;double beta = .5;System.out.println("Latent Dirichlet Allocation using Gibbs Sampling.");LdaGibbsSampler lda = new LdaGibbsSampler(documents, V);lda.configure(10000, 2000, 100, 10);lda.gibbs(K, alpha, beta);double[][] theta = lda.getTheta();double[][] phi = lda.getPhi();System.out.println();System.out.println();System.out.println("Document--Topic Associations, Theta[d][k] (alpha="+ alpha + ")");System.out.print("d\\k\t");for (int m = 0; m < theta[0].length; m++) {System.out.print(" " + m % 10 + " ");}System.out.println();for (int m = 0; m < theta.length; m++) {System.out.print(m + "\t");for (int k = 0; k < theta[m].length; k++) {// System.out.print(theta[m][k] + " ");System.out.print(shadeDouble(theta[m][k], 1) + " ");}System.out.println();}System.out.println();System.out.println("Topic--Term Associations, Phi[k][w] (beta=" + beta+ ")");System.out.print("k\\w\t");for (int w = 0; w < phi[0].length; w++) {System.out.print(" " + w % 10 + " ");}System.out.println();for (int k = 0; k < phi.length; k++) {System.out.print(k + "\t");for (int w = 0; w < phi[k].length; w++) {// System.out.print(phi[k][w] + " ");System.out.print(shadeDouble(phi[k][w], 1) + " ");}System.out.println();}// Let's inference a new documentint[] aNewDocument = {2, 2, 4, 2, 4, 2, 2, 2, 2, 4, 2, 2};double[] newTheta = inference(alpha, beta, phi, aNewDocument);for (int k = 0; k < newTheta.length; k++) {// System.out.print(theta[m][k] + " ");System.out.print(shadeDouble(newTheta[k], 1) + " ");}System.out.println();}static String[] shades = {" ", ". ", ": ", ":. ", ":: ","::. ", "::: ", ":::. ", ":::: ", "::::.", ":::::"};static NumberFormat lnf = new DecimalFormat("00E0");/*** create a string representation whose gray value appears as an indicator* of magnitude, cf. Hinton diagrams in statistics.** @param d value* @param max maximum value* @return*/public static String shadeDouble(double d, double max) {int a = (int) Math.floor(d * 10 / max + 0.5);if (a > 10 || a < 0) {String x = lnf.format(d);a = 5 - x.length();for (int i = 0; i < a; i++) {x += " ";}return "<" + x + ">";}return "[" + shades[a] + "]";} }

要理解這段代碼，還是要深入其定義的數學形式，統計學方面相關的分布知識。

參考：http://www.hankcs.com/nlp/lda-java-introduction-and-implementation.html

對于機器學習的很多知識點，目前先匯聚起來，后續需要應用時一個個專題研究。

總結

以上是生活随笔為你收集整理的机器学习知识点(三十)LDA话题模型Java实现的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： (转载)机器学习知识点(二十九)LDA入
下一篇： HTML标签meta在seo中的作用