结构主题模型(一)stm包工作流
前言
對論文(stm: An R Package for Structural Topic Models)中stm代碼的工作流進行梳理,總體結構參考論文原文,但對部分代碼執行的順序提出個人想法。因時間有限,存在未能解決的問題(如選擇合適的主題數……),論文后面的部分內容仍未詳細敘述,后續有時間將會補充。若有朋友能提出有效的修改建議和解決方案,博主將在第一時間做出反饋。最后,希望對使用STM結構主題模型的朋友們有幫助😁
論文復現過程中的相關問題匯總
 結構主題模型(二)復現
論文原文、數據及代碼
 stm: An R Package for Structural Topic Model
stm庫官方文檔
3.0 讀取數據
樣例數據poliblogs2008.csv為一個關于美國政治的博文集,來自CMU2008年政治博客語料庫:American Thinker, Digby, Hot Air, Michelle Malkin, Think Progress, and Talking Points Memo。每個博客論壇都有自己的政治傾向,所以每篇博客都有寫作日期和政治意識形態的元數據。
建議讀取xlsx,因為csv文件以逗號作為分隔符,有時會出現問題。pandas:csv-excel文件相互轉換
# data <- read.csv("./poliblogs2008.csv", sep =",", quote = "", header = TRUE, fileEncoding = "UTF-8") data <- read_excel(path = "./poliblogs2008.xlsx", sheet = "Sheet1", col_names = TRUE)若數據為中文,可參考以下文章對中文文本進行分詞等預處理操作后,再進行后續步驟
以3.0為開始序號是為了和論文原文保持一致
3.1 Ingest: Reading and processing text data
提取數據:將原始數據處理成STM可以分析的三塊內容(分別是documents,vocab ,meta),用到的是textProcessor或readCorpus這兩個函數。
textProcessor()函數旨在提供一種方便快捷的方式來處理相對較小的文本,以便使用軟件包進行分析。它旨在以簡單的形式快速攝取數據,例如電子表格,其中每個文檔都位于單個單元格中。
# 調用textProcessor算法,將 data$document、data 作為參數 processed <- textProcessor(documents = data$documents, metadata = data, wordLengths = c(1, Inf))textProcessor()函數中的參數wordLengths = c(3, Inf)表示:短于最小字長(默認為3字符)或長于最大字長(默認為inf)的字數將被丟棄,[用戶@qq_39172034]建議設置該參數為wordLengths = c(1, Inf),以避免避免單個漢字被刪除
論文中提到,textProcessor()可以處理多種語言,需設置變量language = "en", customstopwords = NULL,。截至0.5支持的版本“丹麥語、荷蘭語、英語、芬蘭語、法語、德語、匈牙利語、意大利語、挪威語、葡萄牙語、羅馬尼亞語、俄語、瑞典語、土耳其語”,不支持中文
 詳見:textProcessor function - RDocumentation
3.2 Prepare: Associating text with metadata
數據預處理:轉換數據格式,根據閾值刪除低頻單詞等,用到的是prepDocuments()和plotRemoved()兩個函數
plotRemoved()函數可繪制不同閾值下刪除的document、words、token數量
pdf("output/stm-plot-removed.pdf") plotRemoved(processed$documents, lower.thresh = seq(1, 200, by = 100)) dev.off()根據此pdf文件的結果(output/stm-plot-removed.pdf),確定prepDocuments()中的參數lower.thresh的取值,以此確定變量docs、vocab、meta
論文中提到如果在處理過程中發生任何更改,PrepDocuments還將重新索引所有元數據/文檔關系。例如,當文檔因為含有低頻單詞而在預處理階段被完全刪除,那么PrepDocuments()也將刪除元數據中的相應行。因此在讀入和處理文本數據后,檢查文檔的特征和相關詞匯表以確保它們已被正確預處理是很重要的。
# 去除詞頻低于15的詞匯 out <- prepDocuments(documents = processed$documents, vocab = processed$vocab, meta = processed$meta, lower.thresh = 15)docs <- out$documents vocab <- out$vocab meta <- out$meta-  docs:documents。包含單詞索引及其相關計數的文檔列表 
-  vocab:a vocab character vector。包含與單詞索引關聯的單詞 
-  meta:a metadata matrix。包含文檔協變量 
以下表示兩篇短文章documents:第一篇文章包含5個單詞,每個單詞出現在vocab vector的第21、23、87、98、112位置上,除了第一個單詞出現兩次,其余單詞都僅出現一次。第二篇文章包含3個單詞,解釋同上。
| [,1] | [,2] | [,3] | [,4] | [,5] | |
| [1,] | 21 | 23 | 87 | 98 | 112 | 
| [2,] | 2 | 1 | 1 | 1 | 1 | 
| [[2]] | [,1] | [,2] | [,3] | ||
| [1,] | 16 | 61 | 90 | ||
| [2,] | 1 | 1 | 1 | 
3.3 Estimate: Estimating the structural topic model
STM的關鍵創新是它將元數據合并到主題建模框架中。在STM中,元數據可以通過兩種方式輸入到主題模型中:**主題流行度(topical prevalence)**和主題內容(topical content)。主題流行度中的元數據協變量允許觀察到的元數據影響被討論主題的頻率。主題內容中的協變量允許觀察到的元數據影響給定主題內的詞率使用——即如何討論特定主題。對主題流行率和主題內容的估計是通過stm()函數進行的。
主題流行度(topical prevalence)表示每個主題對某篇文檔的貢獻程度,因為不同的文檔來自不同的地方,所以自然地希望主題流行度能隨著元數據的變化而變化。
具體而言,論文將變量rating(意識形態,Liberal,Conservative)作為主題流行度的協變量,除了意識形態,還可以通過+號增加其他協變量,如增加原始數據中的day”變量(表示發帖日期)
s(day)中的s()為spline function,a fairly flexible b-spline basis
day這個變量是從2008年的第一天到最后一天,就像panel data一樣,如果帶入時序設置為天(365個penal),則會損失300多個自由度,所以引入spline function解決自由度損失的問題。
The stm package also includes a convenience functions(), which selects a fairly flexible b-spline basis. In the current example we allow for the variabledayto be estimated with a spline.
poliblogPrevFit <- stm(documents = out$documents, vocab = out$vocab, K = 20, prevalence = ~rating + s(day), max.em.its = 75, data = out$meta, init.type = "Spectral")R中主題流行度協變量prevalence能表示為含有多個斜變量和階乘或連續協變量的公式,在spline包中還有其他的標準轉換函數:log()、ns()、bs()
隨著迭代的進行,如果bound變化足夠小,則認為模型收斂converge了。
3.4 Evaluate: Model selection and search
因為混合主題模型的后驗往往非凸和難以解決,模型的確定取決于參數的起始值(例如,特定主題的單詞分布)。兩種實現模型初始化的方式:
- spectral initialization。init.type="Spectral"。優先選取此方式
- a collapsed Gibbs sampler for LDA
selectModel()首先建立一個運行模型的網絡(net),并依次將所有模型運行(小于10次)E step和M step,拋棄低likelihood的模型,接著僅運行高likelihood的前20%的模型,直到收斂(convergence)或達到最大迭代次數(max.em.its)
通過plotModels()函數顯示的語義一致性(semantic coherence)和排他性(exclusivity)選擇合適的模型,semcoh和exclu越大則模型越好
# 繪制圖形平均得分每種模型采用不同的圖例 plotModels(poliblogSelect, pch=c(1,2,3,4), legend.position="bottomright") # 選擇模型3 selectedmodel <- poliblogSelect$runout[[3]]對比兩種或多個主題數,通過對比語義連貫性SemCoh和排他性Exl確定合適的主題數
3.5 Understand: Interpreting the STM by plotting and inspecting results
選擇好模型后,就是通過stm包中提供的一些函數來展示模型的結果。為與原論文保持一致,使用初始模型poliblogPrevFit作為參數,而非SelectModel
每個主題下的高頻單詞排序:labelTopics()、sageLabels()
兩個函數都將與每個主題相關的單詞輸出,其中sageLabels()僅對于包含內容協變量的模型使用。此外,sageLabels()函數結果比labelTopics()更詳細,而且默認輸出所有主題下的高頻詞等信息
# labelTopics() Label topics by listing top words for selected topics 1 to 5. labelTopicsSel <- labelTopics(poliblogPrevFit, c(1:5)) sink("output/labelTopics-selected.txt", append=FALSE, split=TRUE) print(labelTopicsSel) sink()# sageLabels() 比 labelTopics() 輸出更詳細 sink("stm-list-sagelabel.txt", append=FALSE, split=TRUE) print(sageLabels(poliblogPrevFit)) sink()TODO:兩個函數輸出結果存在差異
列出與某個主題高度相關的文檔:findthoughts()
shortdoc <- substr(out$meta$documents, 1, 200) # 參數 'texts=shortdoc' 表示輸出每篇文檔前200個字符,n表示輸出相關文檔的篇數 thoughts1 <- findThoughts(poliblogPrevFit, texts=shortdoc, n=2, topics=1)$docs[[1]] pdf("findThoughts-T1.pdf") plotQuote(thoughts1, width=40, main="Topic 1") dev.off()# how about more documents for more of these topics? thoughts6 <- findThoughts(poliblogPrevFit, texts=shortdoc, n=2, topics=6)$docs[[1]] thoughts18 <- findThoughts(poliblogPrevFit, texts=shortdoc, n=2, topics=18)$docs[[1]] pdf("stm-plot-find-thoughts.pdf") # mfrow=c(2, 1)將會把圖輸出到2行1列的表格中 par(mfrow = c(2, 1), mar = c(.5, .5, 1, .5)) plotQuote(thoughts6, width=40, main="Topic 6") plotQuote(thoughts18, width=40, main="Topic 18") dev.off()估算元數據和主題/主題內容之間的關系:estimateEffect
out$meta$rating<-as.factor(out$meta$rating) # since we're preparing these coVariates by estimating their effects we call these estimated effects 'prep' # we're estimating Effects across all 20 topics, 1:20. We're using 'rating' and normalized 'day,' using the topic model poliblogPrevFit. # The meta data file we call meta. We are telling it to generate the model while accounting for all possible uncertainty. Note: when estimating effects of one covariate, others are held at their mean prep <- estimateEffect(1:20 ~ rating+s(day), poliblogPrevFit, meta=out$meta, uncertainty = "Global") summary(prep, topics=1) summary(prep, topics=2) summary(prep, topics=3) summary(prep, topics=4)uncertainty有"Global", “Local”, "None"三個選擇,The default is “Global”, which will incorporate estimation uncertainty of the topic proportions into the uncertainty estimates using the method of composition. If users do not propagate the full amount of uncertainty, e.g., in order to speed up computational time, they can choose uncertainty = “None”, which will generally result in narrower confidence intervals because it will not include the additional estimation uncertainty.
summary(prep, topics=1)輸出結果:
Call: estimateEffect(formula = 1:20 ~ rating + s(day), stmobj = poliblogPrevFit, metadata = meta, uncertainty = "Global")Topic 1:Coefficients:Estimate Std. Error t value Pr(>|t|) (Intercept) 0.068408 0.011233 6.090 1.16e-09 *** ratingLiberal -0.002513 0.002588 -0.971 0.33170 s(day)1 -0.008596 0.021754 -0.395 0.69276 s(day)2 -0.035476 0.012314 -2.881 0.00397 ** s(day)3 -0.002806 0.015696 -0.179 0.85813 s(day)4 -0.030237 0.013056 -2.316 0.02058 * s(day)5 -0.026256 0.013791 -1.904 0.05695 . s(day)6 -0.010658 0.013584 -0.785 0.43269 s(day)7 -0.005835 0.014381 -0.406 0.68494 s(day)8 0.041965 0.016056 2.614 0.00897 ** s(day)9 -0.101217 0.016977 -5.962 2.56e-09 *** s(day)10 -0.024237 0.015679 -1.546 0.12216 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 13.6 Visualize: Presenting STM results
Summary visualization
主題占比條形圖
# see PROPORTION OF EACH TOPIC in the entire CORPUS. Just insert your STM output pdf("top-topic.pdf") plot(poliblogPrevFit, type = "summary", xlim = c(0, .3)) dev.off()Metadata/topic relationship visualization
主題關系對比圖
pdf("stm-plot-topical-prevalence-contrast.pdf") plot(prep, covariate = "rating", topics = c(6, 13, 18),model = poliblogPrevFit, method = "difference",cov.value1 = "Liberal", cov.value2 = "Conservative",xlab = "More Conservative ... More Liberal",main = "Effect of Liberal vs. Conservative",xlim = c(-.1, .1), labeltype = "custom",custom.labels = c("Obama/McCain", "Sarah Palin", "Bush Presidency")) dev.off()主題6、13、18自定義標簽為"Obama/McCain"、“Sarah Palin”、“Bush Presidency”,主題6、主題13的意識形態偏中立,既不是保守,也不是自由,主題18的意識形態偏向于保守。
主題隨著時間變化的趨勢圖
pdf("stm-plot-topic-prevalence-with-time.pdf") plot(prep, "day", method = "continuous", topics = 13, model = z, printlegend = FALSE, xaxt = "n", xlab = "Time (2008)") monthseq <- seq(from = as.Date("2008-01-01"), to = as.Date("2008-12-01"), by = "month") monthnames <- months(monthseq) # There were 50 or more warnings (use warnings() to see the first 50) axis(1, at = as.numeric(monthseq) - min(as.numeric(monthseq)), labels = monthnames) dev.off()運行報錯,但可以輸出以下圖片,原因不明
topic content
顯示某主題中哪些詞匯與一個變量值與另一個變量值的關聯度更大。
# TOPICAL CONTENT. # STM can plot the influence of covariates included in as a topical content covariate. # A topical content variable allows for the vocabulary used to talk about a particular # topic to vary. First, the STM must be fit with a variable specified in the content option. # Let's do something different. Instead of looking at how prevalent a topic is in a class of documents categorized by meta-data covariate... # ... let's see how the words of the topic are emphasized differently in documents of each category of the covariate # First, we we estimate a new stm. It's the same as the old one, including prevalence option, but we add in a content option poliblogContent <- stm(out$documents, out$vocab, K = 20, prevalence = ~rating + s(day), content = ~rating, max.em.its = 75, data = out$meta, init.type = "Spectral") pdf("stm-plot-content-perspectives.pdf") plot(poliblogContent, type = "perspectives", topics = 10) dev.off()主題10與古巴有關。它最常用的詞是“拘留、監禁、法庭、非法、酷刑、強制執行、古巴”。上顯示了自由派和保守派對這個主題的不同看法,自由派強調“酷刑”,而保守派則強調“非法”和“法律”等典型的法庭用語
原文:Its top FREX words were “detaine, prison, court, illeg, tortur, enforc, guantanamo”中的tortur應為torture
繪制主題間的詞匯差異
pdf("stm-plot-content-perspectives-16-18.pdf") plot(poliblogPrevFit, type = "perspectives", topics = c(16, 18)) dev.off()Plotting covariate interactions
# Interactions between covariates can be examined such that one variable may ??moderate?? # the effect of another variable. ###Interacting covariates. Maybe we have a hypothesis that cities with low $$/capita become more repressive sooner, while cities with higher budgets are more patient ##first, we estimate an STM with the interaction poliblogInteraction <- stm(out$documents, out$vocab, K = 20,prevalence = ~rating * day, max.em.its = 75,data = out$meta, init.type = "Spectral") # Prep covariates using the estimateEffect() function, only this time, we include the # interaction variable. Plot the variables and save as pdf files. prep <- estimateEffect(c(16) ~ rating * day, poliblogInteraction,metadata = out$meta, uncertainty = "None") pdf("stm-plot-two-topic-contrast.pdf") plot(prep, covariate = "day", model = poliblogInteraction,method = "continuous", xlab = "Days", moderator = "rating",moderator.value = "Liberal", linecol = "blue", ylim = c(0, 0.12),printlegend = FALSE) plot(prep, covariate = "day", model = poliblogInteraction,method = "continuous", xlab = "Days", moderator = "rating",moderator.value = "Conservative", linecol = "red", add = TRUE,printlegend = FALSE) legend(0, 0.06, c("Liberal", "Conservative"),lwd = 2, col = c("blue", "red")) dev.off()上圖描繪了時間(博客發帖的日子)和評分(自由派和保守派)之間的關系。主題16患病率以時間的線性函數繪制,評分為0(自由)或1(保守)。
3.7 Extend: Additional tools for interpretation and visualization
繪制詞云圖
pdf("stm-plot-wordcloud.pdf") cloud(poliblogPrevFit, topic = 13, scale = c(2, 0.25)) dev.off()主題相關性
# topicCorr(). # STM permits correlations between topics. Positive correlations between topics indicate # that both topics are likely to be discussed within a document. A graphical network # display shows how closely related topics are to one another (i.e., how likely they are # to appear in the same document). This function requires 'igraph' package. # see GRAPHICAL NETWORK DISPLAY of how closely related topics are to one another, (i.e., how likely they are to appear in the same document) Requires 'igraph' package mod.out.corr <- topicCorr(poliblogPrevFit) pdf("stm-plot-topic-correlations.pdf") plot(mod.out.corr) dev.off()stmCorrViz
stmCorrViz軟件包提供了一個不同的d3可視化環境,該環境側重于使用分層聚類方法將主題分組,從而可視化主題相關性。
存在亂碼問題
# The stmCorrViz() function generates an interactive visualisation of topic hierarchy/correlations in a structural topicl model. The package performs a hierarchical # clustering of topics that are then exported to a JSON object and visualised using D3. # corrViz <- stmCorrViz(poliblogPrevFit, "stm-interactive-correlation.html", documents_raw=data$documents, documents_matrix=out$documents)stmCorrViz(poliblogPrevFit, "stm-interactive-correlation.html", documents_raw=data$documents, documents_matrix=out$documents)4 Changing basic estimation defaults
此部分為解釋如何更改stm包的估算命令中的默認設置
首先討論如何在初始化模型參數的不同方法中進行選擇,然后討論如何設置和評估收斂標準,再描述一種在分析包含數萬個或更多文檔時加速收斂的方法,最后,討論內容協變量模型的一些變化,這些變化允許用戶控制模型的復雜性。
問題
ems.its和run的區別是什么?ems.its表示的組大迭代數,每次迭代run=20?
3.4-3中如何根據四個圖確定合適的主題數?
補充
在Ingest部分,作者提到其他用于文本處理的quanteda包,該包可以方便地導入文本和相關元數據,準備要處理的文本,并將文檔轉換為文檔術語矩陣(document-term matrix)。另一個包,readtext包含非常靈活的工具,用于讀取多種文本格式,如純文本、XML和JSON格式,可以輕松地從中創建語料庫。
為從其他文本處理程序中讀取數據,可使用txtorg,此程序可以創建三種獨立的文件:a metadata file, a vocabulary file, and a file with the original documents。默認導出格式為LDA-C sparse matrix format,可以用readCorpus()設置"ldac"option以讀取
論文:stm: An R Package for Structural Topic Models (harvard.edu)
參考文章:R軟件 STM package實操- 嗶哩嗶哩 (bilibili.com)
相關github倉庫:
JvH13/FF-STM: Web Appendix - Methodology for Structural Topic Modeling (github.com)
dondealban/learning-stm: Learning structural topic modeling using the stm R package. (github.com)
bstewart/stm: An R Package for the Structural Topic Model (github.com)
總結
以上是生活随笔為你收集整理的结构主题模型(一)stm包工作流的全部內容,希望文章能夠幫你解決所遇到的問題。
 
                            
                        - 上一篇: JS数据类型与分支结构
- 下一篇: 双NameNode的同步机制
