當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

Python获得一篇文档的不重复词列表并创建词向量

發(fā)布時(shí)間：2024/7/23 python 34 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python获得一篇文档的不重复词列表并创建词向量小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

?獲得一篇文檔的不重復(fù)詞列表：

def loadDataSet():postingList = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],['stop', 'posting', 'stupid', 'worthless', 'garbage'],['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]return postingListdef createVocabList(dataSet):vocabSet = set([]) # 創(chuàng)建空集合for document in dataSet:vocabSet = vocabSet | set(document) # 取并集return list(vocabSet)word = loadDataSet() word_set = createVocabList(word) print(word_set)

輸出：（可以看到輸出沒(méi)有重復(fù)詞匯）

['stop', 'not', 'stupid', 'how', 'food', 'him', 'posting', 'worthless', 'I', 'has', 'please', 'dalmation', 'licks', 'problems', 'help', 'garbage', 'buying', 'maybe', 'my', 'to', 'quit', 'flea', 'so', 'mr', 'dog', 'park', 'is', 'love', 'steak', 'ate', 'take', 'cute']

接下來(lái)是由輸入文檔和詞匯表來(lái)創(chuàng)建詞向量的函數(shù)：

vocabList是詞匯表，inputSet是輸入文檔，輸出是文檔向量，向量每一個(gè)元素是1或0，分別表示詞匯表的單詞在輸入文檔中是否出現(xiàn)

def setOfWords2Vec(vocabList, inputSet):returnVec = [0]*len(vocabList) # 創(chuàng)建一個(gè)和詞匯表等長(zhǎng)的全0向量for word in inputSet:if word in vocabList:returnVec[vocabList.index(word)] = 1else: print("the word: %s is not in my Vocabulary!" % word)return returnVec

總結(jié)

以上是生活随笔為你收集整理的Python获得一篇文档的不重复词列表并创建词向量的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： windows10下安装pytorch并
下一篇：快搜浏览器_让微软丢大脸的edge浏览器