犀牛建模软件的英文语言包_使用tidytext和textmineR软件包在R中进行主题建模(
犀牛建模軟件的英文語(yǔ)言包
In this article, we will learn to do Topic Model using tidytext and textmineR packages with Latent Dirichlet Allocation (LDA) Algorithm.
在本文中,我們將學(xué)習(xí)使用帶有潛在Dirichlet分配 (LDA)算法的tidytext和textmineR包進(jìn)行主題模型。
Natural Language Processing has a wide area of knowledge and implementation, one of them is Topic Model. Topic Model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. For example “dog”, “bone”, and “obedient” will appear more often in the document about dogs, “cute”, “evil”, and “home owner” will appear in the document about cats. The “topics” produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document’s balance of topics is.
自然語(yǔ)言處理具有廣泛的知識(shí)和實(shí)現(xiàn)領(lǐng)域,其中之一就是主題模型。 主題模型是一種統(tǒng)計(jì)模型,用于發(fā)現(xiàn)文檔集合中出現(xiàn)的抽象“主題” 。 主題建模是一種常用的文本挖掘工具,用于發(fā)現(xiàn)文本主體中的隱藏語(yǔ)義結(jié)構(gòu)。 例如,“狗”,“骨頭”和“服從”將在有關(guān)狗的文檔中更頻繁地出現(xiàn),“可愛(ài)”,“邪惡”和“房主”將在關(guān)于貓的文檔中出現(xiàn)。 通過(guò)主題建模技術(shù)產(chǎn)生的“主題”是相似單詞的簇。 主題模型在數(shù)學(xué)框架中捕獲了這種直覺(jué),該模型允許檢查一組文檔并基于每個(gè)單詞的統(tǒng)計(jì)信息來(lái)發(fā)現(xiàn)主題可能是什么以及每個(gè)文檔的主題平衡是什么。
背景 (Background)
What is Topic Modeling Topic Modeling is how the machine collect a group of words within a document to build ‘topic’ which contain group of words with similar dependencies. With Topic models (or topic modeling, or topic model, its just the same) methods we can organize, understand and summarize large collections of textual information. It helps in:
什么是主題建模主題建模是機(jī)器如何在文檔中收集一組單詞以構(gòu)建“主題”,其中包含具有相似依賴性的一組單詞。 使用主題模型(或主題模型,或主題模型,相同),我們可以組織,理解和總結(jié)大量文本信息。 它有助于:
- Discovering hidden topical patterns that are present across the collection 發(fā)現(xiàn)集合中存在的隱藏主題模式
- Annotating documents according to these topics 根據(jù)這些主題注釋文檔
- Using these annotations to organize, search and summarize texts 使用這些注釋來(lái)組織,搜索和總結(jié)文本
In a business approach, topic modeling power for discovering hidden topics can help the organization to understand better about their customer feedback’s So that they can concentrate on those issues customer’s are facing. It also can summarize text for company’s meetings. A high-quality meeting document can enable users to recall the meeting content efficiently. Topic tracking and detection can also use to build a recommender system.
在業(yè)務(wù)方法中,用于發(fā)現(xiàn)隱藏主題的主題建模功能可以幫助組織更好地了解其客戶反饋,從而使他們可以專注于客戶面臨的那些問(wèn)題。 它還可以匯總公司會(huì)議的文本。 高質(zhì)量的會(huì)議文檔可以使用戶有效地回憶會(huì)議內(nèi)容。 主題跟蹤和檢測(cè)也可以用于構(gòu)建推薦系統(tǒng)。
There are many techniques that are used to obtain topic models, namely: Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), Correlated Topic Models (CTM), and TextRank. In this study, we will focus to implement LDA algorithm to build topic model with tidytext and textmineR package. Not only building model, we will also evaluate the goodness of fit of the model using some metrics like R-squared or log-likelihood. There are also some metrics like coherence and prevalence to measure the quality of topics.
有許多技術(shù)可用于獲取主題模型,即:潛在Dirichlet分配(LDA),潛在語(yǔ)義分析(LSA),相關(guān)主題模型(CTM)和TextRank。 在本研究中,我們將重點(diǎn)實(shí)現(xiàn)LDA算法,以使用tidytext和textmineR包構(gòu)建主題模型。 不僅建立模型,我們還將使用一些指標(biāo)(例如R平方或?qū)?shù)似然)評(píng)估模型的擬合優(yōu)度。 還有一些度量標(biāo)準(zhǔn),例如coherence和prevalence來(lái)衡量主題的質(zhì)量。
Load these libraries in your working machine:
將這些庫(kù)加載到您的工作計(jì)算機(jī)中:
# data wranglinglibrary(dplyr)
library(tidyr)
library(lubridate)
# visualization
library(ggplot2)
# dealing with text
library(textclean)
library(tm)
library(SnowballC)
library(stringr)
# topic model
library(tidytext)
library(topicmodels)
library(textmineR)
主題模型 (Topic Model)
From the introduction above we know that there are several ways to do topic model. In this study, we will use the LDA algorithm. LDA is a mathematical model that is used to find a mixture of words to each topic, also determine the mixture of topics that describe each document. LDA answer these following principles of topic modeling:
通過(guò)上面的介紹,我們知道有幾種方法可以進(jìn)行主題建模。 在這項(xiàng)研究中,我們將使用LDA算法。 LDA是一種數(shù)學(xué)模型,用于查找每個(gè)主題的單詞組合,也可以確定描述每個(gè)文檔的主題的組合。 LDA回答以下主題建模的原則:
Every document is a mixture of topics. We imagine that each document may contain words from several topics in particular proportions. For example, in a two-topic model we could say “Document 1 is 90% topic A and 10% topic B, while Document 2 is 30% topic A and 70% topic B.” This also can be symbolized as Θ theta
每個(gè)文檔都是主題的混合體。 我們認(rèn)為每個(gè)文檔可能包含特定比例的多個(gè)主題的單詞。 例如,在兩個(gè)主題的模型中,我們可以說(shuō)“文檔1是主題A的90%和主題B的10%,而文檔2是主題A的30%和主題B的70%。” 這也可以被符號(hào)化為Θ theta
Every topic is a mixture of words. For example, we could imagine a two-topic model of American news, with one topic for “politics” and one for “entertainment.” The most common words in the political topic might be “President”, “Congress”, and “government”, while the entertainment topic may be made up of words such as “movies”, “television”, and “actor”. Importantly, words can be shared between topics; a word like “budget” might appear in both equally. This also can be symbolized as Φ phi
每個(gè)主題都是單詞的混合體。 例如,我們可以想象美國(guó)新聞?dòng)袃蓚€(gè)主題的模型,一個(gè)主題是“政治”,另一個(gè)主題是“娛樂(lè)”。 政治話題中最常見(jiàn)的詞可能是“總統(tǒng)”,“國(guó)會(huì)”和“政府”,而娛樂(lè)話題可能是由“電影”,“電視”和“演員”等詞組成的。 重要的是,可以在主題之間共享單詞。 諸如“預(yù)算”之類的詞可能會(huì)同時(shí)出現(xiàn)在兩者中。 這也可以表示為phi
We will use two packages: tidytext including tidymodels package and textmineR. Tidytext package build topic model easily and they provide a method for extracting the per-topic-per-word probabilities, called β (“beta”), from the model. But they don’t provide metrics to calculate the goodness of model like textmineR do.
我們將使用兩個(gè)軟件包: tidytext包括tidymodels軟件包和textmineR 。 Tidytext包可以輕松地建立主題模型,并且它們提供了一種從模型中提取每個(gè)主題/單詞的概率(稱為β(“ beta”))的方法。 但是它們沒(méi)有像textmineR一樣提供度量標(biāo)準(zhǔn)來(lái)計(jì)算模型的textmineR 。
潛在狄利克雷分配(LDA) (Latent Dirichlet Allocation (LDA))
LDA is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s presence is attributable to one of the document’s topics. Plate Notation (picture below) is a concise way of visually representing the dependencies among the model parameters.
LDA是一種生成統(tǒng)計(jì)模型,它允許由未觀察組解釋一組觀察結(jié)果,這些觀察組解釋了為什么某些數(shù)據(jù)相似的原因。 例如,如果觀察是收集到文檔中的單詞,則假定每個(gè)文檔都是少量主題的混合,并且每個(gè)單詞的出現(xiàn)都可歸因于文檔的主題之一。 Plate Notation (下圖)是一種直觀地表示模型參數(shù)之間依賴性的簡(jiǎn)潔方法。
LDA Plate NotationLDA板符號(hào)- Area in M denotes the number of documents M中的區(qū)域表示文件數(shù)
- N is the number of words in a given document N是給定文檔中的單詞數(shù)
- α is the parameter of the Dirichlet prior on the per-document topic distributions. High α indicates that each documents is likely to contain a mixture of most of the topics (not just one or two). Low αα indicates each document will likely contain just a few of topics α是每個(gè)文檔主題分布上的Dirichlet優(yōu)先級(jí)的參數(shù)。 高α表示每個(gè)文檔可能包含大多數(shù)主題的混合體(不僅僅是一個(gè)或兩個(gè))。 αα低表示每個(gè)文檔可能只包含幾個(gè)主題
- β is the parameter of the Dirichlet prior to the per-topic word distribution. High β indicates that each topic will contain a mixture of most in the words. low β indicates the topic have a low mixture of words. β是按主題分布之前的Dirichlet的參數(shù)。 高β表示每個(gè)主題將包含大部分單詞。 低β表示主題的單詞混合度較低。
θm is the topic distribution for document m
θm是文檔m的主題分布
zmn is the topic for the n-th word in document m
zmn是文檔m中第n個(gè)單詞的主題
- wmn is the specific word wmn是特定詞
LDA is a generative process. LDA assumes that new documents are created in the following way:1. Determine the number of words in document2. Choose a topic mixture for the document over a fixed set of topics (example: 20% topic A, 50$ topic B, 30% topic C)3. Generate the words in the document by:- pick a topic based on the document’s multinomial distribution (zm,n~Multinomial(θm))- pick a word based on topic’s multinomial distribution (wm,n~Multinomial(φzmn)) (where φzmn is the word distribution for topic z)4. Repeat the process for n number of iteration until the distribution of the words in the topics meet the criteria (number 2)
LDA是一個(gè)生成過(guò)程。 LDA假定以下列方式創(chuàng)建新文檔:1。 確定document2中的單詞數(shù)。 在固定的主題集上選擇文檔的主題組合(例如:20%主題A,50 $主題B,30%主題C)3。 通過(guò)以下方式生成文檔中的單詞:-根據(jù)文檔的多項(xiàng)式分布(zm,n?Multinomial(θm))選擇一個(gè)主題-根據(jù)主題的多項(xiàng)式分布(wm,n?Multinomial(φzmn))選擇一個(gè)單詞(其中φzmn是主題z )4的單詞分布。 重復(fù)此過(guò)程n次迭代,直到主題中單詞的分布符合條件(第2個(gè))
數(shù)據(jù)導(dǎo)入和目標(biāo) (Data Import & Objectives)
The data is from this kaggle. It's about customers' feedback on Amazon musical instruments. Every row represents one feedback from one user. There are several columns but we only need reviewText which contain the text of the review, overall the product rating from 1-5 given by the user, and reviewTime which contain the time review was given.
數(shù)據(jù)來(lái)自該kaggle 。 這是關(guān)于客戶對(duì)亞馬遜樂(lè)器的反饋。 每行代表一個(gè)用戶的一個(gè)反饋。 有幾列,但是我們只需要reviewText包含評(píng)論的文本, overall上用戶給出的產(chǎn)品評(píng)分為1-5,而reviewTime包含給出評(píng)論的時(shí)間。
# data import and preparationdata <- read.csv("Musical_instruments_reviews.csv")
data <- data %>%
mutate(overall = as.factor(overall),
reviewTime = str_replace_all(reviewTime, pattern = " ",replacement = "-"),
reviewTime = str_replace(reviewTime, pattern = ",",replacement = ""),
reviewTime = mdy(reviewTime)) %>%
select(reviewText, overall,reviewTime)
head(data)
So the objectives of this project is to discover what users are talking about for each rating. This will help the organization to understand better about their customer feedback So that they can concentrate on those issues customers are facing.
因此,該項(xiàng)目的目標(biāo)是發(fā)現(xiàn)用戶對(duì)每個(gè)評(píng)級(jí)都在談?wù)撌裁?/strong> 。 這將幫助組織更好地了解他們的客戶反饋,以便他們可以專注于客戶面臨的那些問(wèn)題。
整潔的文字 (Tidytext)
文字清理過(guò)程 (Text cleaning process)
Before we put the text to LDA model, we need to clean the text. We gonna build textcleaner function using several functions from tm, textclean, and stringr package. We also need to convert the text to Document Term Matrix (DTM) format because LDA() function from tidytext package needs dtm format.
在將文本放入LDA模型之前,我們需要清理文本。 我們將使用tm , textclean和stringr包中的幾個(gè)函數(shù)來(lái)構(gòu)建textcleaner函數(shù)。 我們還需要將文本轉(zhuǎn)換為文檔術(shù)語(yǔ)矩陣(DTM)格式,因?yàn)閠idytext包中的LDA()函數(shù)需要dtm格式。
# build textcleaner functiontextcleaner <- function(x){
x <- as.character(x)
x <- x %>%
str_to_lower() %>% # convert all the string to low alphabet
replace_contraction() %>% # replace contraction to their multi-word forms
replace_internet_slang() %>% # replace internet slang to normal words
replace_emoji() %>% # replace emoji to words
replace_emoticon() %>% # replace emoticon to words
replace_hash(replacement = "") %>% # remove hashtag
replace_word_elongation() %>% # replace informal writing with known semantic replacements
replace_number(remove = T) %>% # remove number
replace_date(replacement = "") %>% # remove date
replace_time(replacement = "") %>% # remove time
str_remove_all(pattern = "[[:punct:]]") %>% # remove punctuation
str_remove_all(pattern = "[^\\s]*[0-9][^\\s]*") %>% # remove mixed string n number
str_squish() %>% # reduces repeated whitespace inside a string.
str_trim() # removes whitespace from start and end of string
xdtm <- VCorpus(VectorSource(x)) %>%
tm_map(removeWords, stopwords("en"))
# convert corpus to document term matrix
return(DocumentTermMatrix(xdtm))
}
Because we want to know the topic from each rating, we should split/subset the data by its rating.
因?yàn)槲覀兿霃拿總€(gè)評(píng)分中了解主題,所以我們應(yīng)該按評(píng)分對(duì)數(shù)據(jù)進(jìn)行拆分/細(xì)分。
data_1 <- data %>% filter(overall == 1)data_2 <- data %>% filter(overall == 2)
data_3 <- data %>% filter(overall == 3)
data_4 <- data %>% filter(overall == 4)
data_5 <- data %>% filter(overall == 5)
table(data$overall)>
##
## 1 2 3 4 5
## 14 21 77 245 735
From the table above we know that most of the feedback has the highest rating. Because the distributions are different, each rating will have different treatments especially in choosing highest terms frequency. I’ll make sure we will use at least 700–1000 words to be analyzed for each rating.
從上表可以知道,大多數(shù)反饋的評(píng)分最高。 因?yàn)榉植疾煌?#xff0c;所以每個(gè)等級(jí)都會(huì)有不同的處理方式,尤其是在選擇最高條款頻率時(shí)。 我將確保每個(gè)等級(jí)至少要使用700-1000個(gè)單詞進(jìn)行分析。
主題建模等級(jí)5 (Topic Modeling rating 5)
# apply textcleaner function for review textdtm_5 <- textcleaner(data_5$reviewText)
# find most frequent terms. i choose words that at least appear in 50 reviews
freqterm_5 <- findFreqTerms(dtm_5,50)
# we have 981 words. subset the dtm to only choose those selected words
dtm_5 <- dtm_5[,freqterm_5]
# only choose words that appear once in each rows
rownum_5 <- apply(dtm_5,1,sum)
dtm_5 <- dtm_5[rownum_5>0,]# apply to LDA function. set the k = 6, means we want to build 6 topic
lda_5 <- LDA(dtm_5,k = 6,control = list(seed = 1502))
# apply auto tidy using tidy and use beta as per-topic-per-word probabilities
topic_5 <- tidy(lda_5,matrix = "beta")# choose 15 words with highest beta from each topic
top_terms_5 <- topic_5 %>%
group_by(topic) %>%
top_n(15,beta) %>%
ungroup() %>%
arrange(topic,-beta)# plot the topic and words for easy interpretation
plot_topic_5 <- top_terms_5 %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered()plot_topic_5Rating 5 topic modeling using tidytext使用tidytext對(duì)5個(gè)主題建模進(jìn)行評(píng)級(jí)
主題建模等級(jí)4 (Topic Modeling rating 4)
dtm_4 <- textcleaner(data_4$reviewText)freqterm_4 <- findFreqTerms(dtm_4,20)
dtm_4 <- dtm_4[,freqterm_4]
rownum_4 <- apply(dtm_4,1,sum)
dtm_4 <- dtm_4[rownum_4>0,]lda_4 <- LDA(dtm_4,k = 6,control = list(seed = 1502))
topic_4 <- tidy(lda_4,matrix = "beta")top_terms_4 <- topic_4 %>%
group_by(topic) %>%
top_n(15,beta) %>%
ungroup() %>%
arrange(topic,-beta)plot_topic_4 <- top_terms_4 %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered()plot_topic_4Rating 4 topic modeling using tidytext使用tidytext對(duì)4個(gè)主題建模進(jìn)行評(píng)級(jí)
主題建模等級(jí)3 (Topic Modeling rating 3)
dtm_3 <- textcleaner(data_3$reviewText)freqterm_3 <- findFreqTerms(dtm_3,10)
dtm_3 <- dtm_3[,freqterm_3]
rownum_3 <- apply(dtm_3,1,sum)
dtm_3 <- dtm_3[rownum_3>0,]lda_3 <- LDA(dtm_3,k = 6,control = list(seed = 1502))
topic_3 <- tidy(lda_3,matrix = "beta")top_terms_3 <- topic_3 %>%
group_by(topic) %>%
top_n(15,beta) %>%
ungroup() %>%
arrange(topic,-beta)plot_topic_3 <- top_terms_3 %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered()plot_topic_3Rating 3 topic modeling using tidytext使用tidytext對(duì)3個(gè)主題建模進(jìn)行評(píng)級(jí)
主題建模等級(jí)2 (Topic Modeling rating 2)
dtm_2 <- textcleaner(data_2$reviewText)freqterm_2 <- findFreqTerms(dtm_2,5)
dtm_2 <- dtm_2[,freqterm_2]
rownum_2 <- apply(dtm_2,1,sum)
dtm_2 <- dtm_2[rownum_2>0,]lda_2 <- LDA(dtm_2,k = 6,control = list(seed = 1502))
topic_2 <- tidy(lda_2,matrix = "beta")top_terms_2 <- topic_2 %>%
group_by(topic) %>%
top_n(15,beta) %>%
ungroup() %>%
arrange(topic,-beta)plot_topic_2 <- top_terms_2 %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered()plot_topic_2Rating 2 topic modeling using tidytext使用tidytext對(duì)2個(gè)主題建模進(jìn)行評(píng)級(jí)
主題建模等級(jí)1 (Topic Modeling rating 1)
dtm_1 <- textcleaner(data_1$reviewText)freqterm_1 <- findFreqTerms(dtm_1,5)
dtm_1 <- dtm_1[,freqterm_1]
rownum_1 <- apply(dtm_1,1,sum)
dtm_1 <- dtm_1[rownum_1>0,]lda_1 <- LDA(dtm_1,k = 6,control = list(seed = 1502))
topic_1 <- tidy(lda_1,matrix = "beta")top_terms_1 <- topic_1 %>%
group_by(topic) %>%
top_n(15,beta) %>%
ungroup() %>%
arrange(topic,-beta)plot_topic_1 <- top_terms_1 %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
scale_x_reordered()plot_topic_1Rating 1 topic modeling using tidytext使用tidytext對(duì)1個(gè)主題建模進(jìn)行評(píng)級(jí)
文本 (textmineR)
文字清理過(guò)程 (Text cleaning process)
Just like previous text cleaning method, we will build a text cleaner function to automate the cleaning process. The difference is we don’t need to convert the text to dtm format. textmineR package has its own dtm converter, CreateDtm(). Fitting LDA model with textmineR need dtm format made by CreateDtm() function. We also can set n-gram size, remove punctuation, stopwords, and any simple text cleaning process.
就像以前的文本清理方法一樣,我們將構(gòu)建文本清理器功能來(lái)自動(dòng)執(zhí)行清理過(guò)程。 區(qū)別在于我們不需要將文本轉(zhuǎn)換為dtm格式。 textmineR軟件包具有自己的dtm轉(zhuǎn)換器CreateDtm() 。 用textmineR擬合LDA模型需要CreateDtm()函數(shù)制作的dtm格式。 我們還可以設(shè)置n-gram大小,刪除標(biāo)點(diǎn)符號(hào),停用詞以及任何簡(jiǎn)單的文本清理過(guò)程。
textcleaner_2 <- function(x){x <- as.character(x)
x <- x %>%
str_to_lower() %>% # convert all the string to low alphabet
replace_contraction() %>% # replace contraction to their multi-word forms
replace_internet_slang() %>% # replace internet slang to normal words
replace_emoji() %>% # replace emoji to words
replace_emoticon() %>% # replace emoticon to words
replace_hash(replacement = "") %>% # remove hashtag
replace_word_elongation() %>% # replace informal writing with known semantic replacements
replace_number(remove = T) %>% # remove number
replace_date(replacement = "") %>% # remove date
replace_time(replacement = "") %>% # remove time
str_remove_all(pattern = "[[:punct:]]") %>% # remove punctuation
str_remove_all(pattern = "[^\\s]*[0-9][^\\s]*") %>% # remove mixed string n number
str_squish() %>% # reduces repeated whitespace inside a string.
str_trim() # removes whitespace from start and end of string
return(as.data.frame(x))
主題建模等級(jí)5 (Topic Modeling rating 5)
# apply textcleaner function. note: we only clean the text without convert it to dtmclean_5 <- textcleaner_2(data_5$reviewText)
clean_5 <- clean_5 %>% mutate(id = rownames(clean_5))# crete dtm
set.seed(1502)
dtm_r_5 <- CreateDtm(doc_vec = clean_5$x,
doc_names = clean_5$id,
ngram_window = c(1,2),
stopword_vec = stopwords("en"),
verbose = F)dtm_r_5 <- dtm_r_5[,colSums(dtm_r_5)>2]
create LDA model using `textmineR`. Here we gonna make 20 topics. the reason why we build so many topics is that `textmineR` has metrics to calculate the quality of topics. we will choose some topics with the best quality
使用`textmineR`創(chuàng)建LDA模型。 在這里,我們將提出20個(gè)主題。 我們建立這么多主題的原因是`textmineR`具有衡量主題質(zhì)量的指標(biāo)。 我們將選擇質(zhì)量最好的一些主題
set.seed(1502)mod_lda_5 <- FitLdaModel(dtm = dtm_r_5,
k = 20, # number of topic
iterations = 500,
burnin = 180,
alpha = 0.1,beta = 0.05,
optimize_alpha = T,
calc_likelihood = T,
calc_coherence = T,
calc_r2 = T)
Once we have created a model, we need to evaluate it. For overall goodness of fit, textmineR has R-squared and log-likelihood. R-squared is interpretable as the proportion of variability in the data explained by the model, as with linear regression.
創(chuàng)建模型后,我們需要對(duì)其進(jìn)行評(píng)估。 為了整體上的貼合度,textmineR具有R平方和對(duì)數(shù)似然性。 與線性回歸一樣,R平方可解釋為模型解釋的數(shù)據(jù)中的可變性比例。
mod_lda_5$r2>
## [1] 0.2183867
The primary goodness of fit measures in topic modeling is likelihood methods. Likelihoods, generally the log-likelihood, are naturally obtained from probabilistic topic models. the log_likelihood is P(tokens|topics) at each iteration.
主題建模中擬合度量的主要優(yōu)點(diǎn)是似然法。 可能性,通常是對(duì)數(shù)可能性,自然是從概率主題模型中獲得的。 log_likelihood在每次迭代中為P(tokens | topics)。
plot(mod_lda_5$log_likelihood,type = "l")log likelhood for every iteration in rating 5等級(jí)5中每次迭代的記錄似然度get 15 top terms with the highest phi. phi representing a distribution of words over topics. Words with high phi have the most frequency in a topic.
獲得phi最高的15個(gè)熱門(mén)術(shù)語(yǔ)。 代表主題上單詞分布的phi。 phi較高的單詞在主題中的出現(xiàn)頻率最高。
mod_lda_5$top_terms <- GetTopTerms(phi = mod_lda_5$phi,M = 15)data.frame(mod_lda_5$top_terms)top terms in topic rating 5主題評(píng)分最高的詞5
Let’s see the coherence value for each topic. Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference. For each pair of words {a,b}, then probabilistic coherence calculates P(b|a)?P(b) where {a} is more probable than in the topic. In simple words, coherence tell us how associated words are in a topic
讓我們看看每個(gè)主題的連貫性值。 主題連貫性度量通過(guò)測(cè)量主題中高分單詞之間的語(yǔ)義相似度來(lái)對(duì)單個(gè)主題評(píng)分。 這些度量有助于區(qū)分在語(yǔ)義上可以解釋的主題和作為統(tǒng)計(jì)推斷的工件的主題。 對(duì)于每對(duì)單詞{a,b}, probabilistic coherence計(jì)算出P(b | a)-P(b),其中在主題中{a}比更有可能。 簡(jiǎn)單來(lái)說(shuō), 連貫性告訴我們主題中的相關(guān)詞如何
mod_lda_5$coherence>
## t_1 t_2 t_3 t_4 t_5 t_6 t_7
## 0.12140404 0.08349523 0.05510456 0.11607445 0.16397834 0.05472121 0.09739406
## t_8 t_9 t_10 t_11 t_12 t_13 t_14
## 0.14221823 0.24856426 0.79310008 0.28175270 0.10231907 0.58667185 0.05449207
## t_15 t_16 t_17 t_18 t_19 t_20
## 0.09204392 0.10147505 0.07949897 0.04519463 0.13664781 0.21586105
We also want to look at prevalence value. Prevalence tells us the most frequent topics in the corpus. Prevalence is the probability of topics distribution in the whole documents.
我們還想看看患病率值。 患病率告訴我們語(yǔ)料庫(kù)中最常見(jiàn)的話題。 患病率是主題在整個(gè)文檔中分布的概率 。
mod_lda_5$prevalence <- colSums(mod_lda_5$theta)/sum(mod_lda_5$theta)*100mod_lda_5$prevalence
>
## t_1 t_2 t_3 t_4 t_5 t_6 t_7 t_8
## 5.514614 5.296280 4.868778 7.484032 9.360072 2.748069 4.269445 4.195638
## t_9 t_10 t_11 t_12 t_13 t_14 t_15 t_16
## 5.380414 3.541380 5.807442 5.305865 3.243890 4.657203 5.488087 2.738993
## t_17 t_18 t_19 t_20
## 4.821128 4.035630 7.385820 3.857221
Now we have the top terms at each topic, the goodness of model by r2 and log_likelihood, also the quality of topics by calculating coherence and prevalence. let’s compile them in summary
現(xiàn)在,我們?cè)诿總€(gè)主題上都擁有最高級(jí)的詞匯,r2和log_likelihood的模型優(yōu)度,以及通過(guò)計(jì)算一致性和普遍性得出的主題質(zhì)量。 讓我們總結(jié)一下
mod_lda_5$summary <- data.frame(topic = rownames(mod_lda_5$phi),coherence = round(mod_lda_5$coherence,3),
prevalence = round(mod_lda_5$prevalence,3),
top_terms = apply(mod_lda_5$top_terms,2,function(x){paste(x,collapse = ", ")}))
modsum_5 <- mod_lda_5$summary %>%
`rownames<-`(NULL)
We know that the quality of the model can be described with coherence and prevalence value. let’s build a plot to identify which topic has the best quality
我們知道,模型的質(zhì)量可以用相關(guān)性和流行度值來(lái)描述。 讓我們來(lái)建立一個(gè)情節(jié),以確定哪個(gè)主題的質(zhì)量最高
modsum_5 %>% pivot_longer(cols = c(coherence,prevalence)) %>%ggplot(aes(x = factor(topic,levels = unique(topic)), y = value, group = 1)) +
geom_point() + geom_line() +
facet_wrap(~name,scales = "free_y",nrow = 2) +
theme_minimal() +
labs(title = "Best topics by coherence and prevalence score",
subtitle = "Text review with 5 rating",
x = "Topics", y = "Value")coherence and prevalence score in rating 5等級(jí)5的連貫性和患病率得分
From the graph above we know that topic 10 has the highest quality, which means the words in that topic are associated with each other. But in the terms of probability of topics distribution in the whole documents (prevalence), topic 10 has a low score. Mean the review is unlikely using combination of words in topic 10 even tough the words inside that topic are supporting each other.
從上圖可以看出, topic 10的質(zhì)量最高,這意味著該主題中的單詞相互關(guān)聯(lián)。 但就整個(gè)文檔中主題分布的可能性(普遍性)而言, topic 10得分較低。 這意味著使用topic 10的單詞組合很難回顧,即使該主題中的單詞相互支持也很難。
We can see if topics can be grouped together using Dendogram. A Dendrogram uses Hellinger distance (distance between 2 probability vectors) to decide if the topics are closely related. For instance, the Dendrogram below suggests that there are greater similarity between topic 10 and 13.
我們可以查看是否可以使用Dendogram將主題分組在一起。 樹(shù)狀圖使用Hellinger距離 (兩個(gè)概率向量之間的距離)來(lái)確定主題是否緊密相關(guān)。 例如,下面的樹(shù)狀圖表明主題10和主題13之間存在更大的相似性。
mod_lda_5$linguistic <- CalcHellingerDist(mod_lda_5$phi)mod_lda_5$hclust <- hclust(as.dist(mod_lda_5$linguistic),"ward.D")
mod_lda_5$hclust$labels <- paste(mod_lda_5$hclust$labels, mod_lda_5$labels[,1])
plot(mod_lda_5$hclust)cluster dendrogram rating 5聚類樹(shù)狀圖評(píng)分5
Now we have complete to build topic model in rating 5 and its interpretation, let’s apply the same step for every rating and see the difference of what people are talk about.
現(xiàn)在,我們已經(jīng)完成了在等級(jí)5及其解釋中建立主題模型的工作,讓我們對(duì)每個(gè)等級(jí)應(yīng)用相同的步驟,并了解人們?cè)谡務(wù)撌裁础?
I won’t copy and paste the process for every rating because its just the same process and i think it will waste the space. But if you really want to look at it please visit my publications in my rpubs.
我不會(huì)為每個(gè)評(píng)級(jí)復(fù)制并粘貼該過(guò)程,因?yàn)樗皇且粋€(gè)相同的過(guò)程,我認(rèn)為這會(huì)浪費(fèi)空間。 但是,如果您真的想查看它,請(qǐng)?jiān)L問(wèn)我在rpubs中的出版物。
結(jié)論 (Conclusion)
We’ve done topic model process from cleaning text to interpretation and analysis. Finally, let’s see what people are talking about for each rating. We will choose 5 different topics with the highest quality (coherence). Each topic will have 15 words with the highest value of phi (distribution of words over topics).
我們已經(jīng)完成了從清潔文本到解釋和分析的主題模型過(guò)程。 最后,讓我們看看人們?cè)谡務(wù)撁總€(gè)評(píng)級(jí)。 我們將選擇質(zhì)量最高(一致性)最高的5個(gè)不同主題。 每個(gè)主題將包含15個(gè)具有phi最高值的單詞(單詞在主題上的分布)。
等級(jí)5 (Rating 5)
modsum_5 %>%arrange(desc(coherence)) %>%
slice(1:5)top terms in topic ordered by highest coherence (rating 5)主題中的最高術(shù)語(yǔ)(按最高連貫性排序)(等級(jí)5)
Highest coherence score, topic 10 and topic 13 contains lots of ‘sticking’ and ‘tongue’ words. Maybe its just a phrase for a specific instrument. It has similar words that make their coherence score rising but low prevalence means they are rarely used in other reviews, that’s why i suggest its from ‘specific’ instrument. in topic 11 and other people are talking about how good the product is, for example, there are words like ‘good’, ‘a(chǎn)ccurate’, ‘clean’, ‘easy’, ‘recommend’, and ‘great’ that indicates positive sentiment.
最高連貫分?jǐn)?shù), topic 10和topic 13包含許多“黏”字和“舌”字。 也許只是特定工具的一句話。 它具有相似的詞,使他們的連貫性得分上升,但流行率低意味著它們很少在其他評(píng)論中使用,這就是為什么我從“特定”工具中建議使用它。 在topic 11 ,其他人在談?wù)摦a(chǎn)品的好壞,例如,有“好”,“準(zhǔn)確”,“干凈”,“簡(jiǎn)單”,“推薦”和“偉大”之類的詞表示積極的情緒。
等級(jí)4 (Rating 4)
modsum_4 %>%arrange(desc(coherence)) %>%
slice(1:5)top terms in topic ordered by highest coherence (rating 4)主題中的最高術(shù)語(yǔ),按最高連貫性排序(4級(jí))
Same as before, topic with the highest coherence score is filled with sticking and tongue stuff. In this rating, people are still praising the product but not as much as rating 5. Keep in mind, the dtm is built using bigrams, words with 2 words like solid_state or e_tongue are captured and calculated just like single word does. With that information, we know that all words showed here have their own phi value and actually represent the review.
與以前一樣,具有最高連貫性得分的主題充滿了黏糊糊的內(nèi)容。 在此評(píng)級(jí)中,人們?nèi)匀粚?duì)產(chǎn)品贊不絕口,但沒(méi)有達(dá)到5評(píng)級(jí)。請(qǐng)記住,dtm是使用雙字母組構(gòu)建的,捕獲并計(jì)算了2個(gè)單詞(例如solid_state或e_tongue),就像單個(gè)單詞一樣。 有了這些信息,我們知道這里顯示的所有單詞都有自己的phi值,實(shí)際上代表了評(píng)論。
等級(jí)3 (Rating 3)
modsum_3 %>%arrange(desc(coherence)) %>%
slice(1:5)top terms in topic ordered by highest coherence (rating 3)主題中的最高術(shù)語(yǔ),按最高連貫性排序(3級(jí))
Looks like stick and tongue words are everywhere. `topic 15` has high coherence and prevalence value in rating 3, means lots of review in this rating are talking about them. On the other hand, in this rating, positive words are barely seen. most of the topics filled with guitar or string related words.
看起來(lái)到處都是棍子和舌頭的話。 “主題15”在等級(jí)3中具有較高的連貫性和普遍性值,意味著該等級(jí)中有很多評(píng)論都在談?wù)撍鼈儭?另一方面,在此評(píng)級(jí)中,幾乎看不到正面的話。 大多數(shù)主題充滿了與吉他或弦樂(lè)相關(guān)的單詞。
等級(jí)2 (Rating 2)
modsum_2 %>%arrange(desc(coherence)) %>%
slice(1:5)top terms in topic ordered by highest coherence (rating 2)主題中的最高術(shù)語(yǔ),按最高連貫性排序(等級(jí)2)
等級(jí)1 (Rating 1)
modsum_1 %>%arrange(desc(coherence)) %>%
slice(1:5)top terms in topic ordered by highest coherence (rating 1)主題中的最高術(shù)語(yǔ),按最高連貫性排序(等級(jí)1)
In the worst rating, people are highly complaint. words like ‘junk’, ‘cheap’ , ‘just’, ‘back’ are everywhere. there’s a lot of difference compared with rating 5.
在最差的評(píng)分中,人們高度抱怨。 像“垃圾”,“便宜”,“公正”,“后退”之類的詞無(wú)處不在。 與等級(jí)5相比有很多差異。
Overall let's keep in mind this dataset is a combination of products, so its obvious if the topic filled with nonsense. But for every rating we’re able to build topics with different instruments. Most of them are talking about with particular instrument with its positive or negative review. In this project we managed to build topic model that separated by instrument, it shows LDA is able to build topic with its semantic words. It will be better if we do topic model with a specific product only and discover the problems to remove or goodness to keep. It surely help organization to understand better about their customer feedback’s So that they can concentrate on those issues customer’s are facing, especially for those who have lots of reviews to analyze.
總體而言,讓我們記住該數(shù)據(jù)集是產(chǎn)品的組合,因此,如果主題中充斥著廢話,這是顯而易見(jiàn)的。 但是,對(duì)于每個(gè)評(píng)級(jí),我們都可以使用不同的工具構(gòu)建主題。 他們中的大多數(shù)人都在談?wù)搸в姓婊蜇?fù)面評(píng)論的特定工具。 在這個(gè)項(xiàng)目中,我們?cè)O(shè)法建立了以儀器分隔的主題模型,它表明LDA能夠使用其語(yǔ)義詞來(lái)建立主題。 如果我們僅對(duì)特定產(chǎn)品進(jìn)行主題建模,然后發(fā)現(xiàn)要?jiǎng)h除的問(wèn)題或保留的優(yōu)點(diǎn),那將更好。 它肯定有助于組織更好地了解他們的客戶反饋,從而使他們可以專注于客戶面臨的那些問(wèn)題,尤其是對(duì)于那些需要分析大量評(píng)論的客戶。
翻譯自: https://medium.com/@joenathanchristian/topic-modeling-in-r-with-tidytext-and-textminer-package-latent-dirichlet-allocation-764f4483be73
犀牛建模軟件的英文語(yǔ)言包
總結(jié)
以上是生活随笔為你收集整理的犀牛建模软件的英文语言包_使用tidytext和textmineR软件包在R中进行主题建模(的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 国产手游2022出海总收入达624亿《原
- 下一篇: 体验更佳!小米11T、POCO F4正在