R学习之——R用于文本挖掘(tm包)
生活随笔
收集整理的這篇文章主要介紹了
R学习之——R用于文本挖掘(tm包)
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
?
首先需要安裝并加載tm包。
?
1、讀取文本
x = readLines("222.txt")2、建立語料庫
> r=Corpus(VectorSource(x))> rA corpus with 7012 text documents3、語料庫輸出,保存到硬盤
> writeCorpus(r)?
4、查看語料庫
> print(r) A corpus with 7012 text documents > summary(r) A corpus with 7012 text documentsThe metadata consists of 2 tag-value pairs and a data frame Available tags are:create_date creator Available variables in the data frame are:MetaID? > inspect(r[2:2])
? A corpus with 1 text document
? The metadata consists of 2 tag-value pairs and a data frame
? Available tags are:
? create_date creator
? Available variables in the data frame are:
? MetaID
? [[1]]
? Female; Genital Neoplasms, Female/*therapy; Humans
? > r[[2]]
? Female; Genital Neoplasms, Female/*therapy; Humans
5、建立“文檔-詞”矩陣
> dtm = DocumentTermMatrix(r) > head(dtm) A document-term matrix (6 documents, 16381 terms)Non-/sparse entries: 110/98176 Sparsity : 100% Maximal term length: 81 Weighting : term frequency (tf)6、查看“文檔-詞”矩陣
> inspect(dtm[1:2,1:4])7、查找出現200次以上的詞
> findFreqTerms(dtm,200)[1] "acute" "adjuvant" "advanced" "after" [5] "and" "breast" "cancer" "cancer:" [9] "carcinoma" "cell" "chemotherapy" "clinical" [13] "colorectal" "factor" "for" "from" [17] "group" "growth" "iii" "leukemia" [21] "lung" "lymphoma" "metastatic" "non-small-cell" [25] "oncology" "patients" "phase" "plus" [29] "prostate" "randomized" "receptor" "response" [33] "results" "risk" "study" "survival" [37] "the" "therapy" "treatment" "trial" [41] "tumor" "with"7、移除出現次數較少的詞
inspect(removeSparseTerms(dtm, 0.4))8、查找和“stem”的相關系數在0.5以上的詞
> findAssocs(dtm, "stem", 0.5)stem cells 1.00 0.61?9、計算文檔相似度(用cosine計算距離)
> dist_dtm <- dissimilarity(dtm, method = 'cosine') > head(dist_dtm) [1] 1.0000000 0.7958759 0.8567770 0.9183503 0.9139337 0.930993410、聚類
> hc <- hclust(dist_dtm, method = 'ave') > plot(hc,xlab='')?
?
? ? ?
轉載于:https://www.cnblogs.com/todoit/archive/2012/07/13/2589741.html
總結
以上是生活随笔為你收集整理的R学习之——R用于文本挖掘(tm包)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 4.3英寸屏双核 LG Prada K2
- 下一篇: [转]过度情绪化心智模式的10大特征——