朴素贝叶斯实现书籍分类
生活随笔
收集整理的這篇文章主要介紹了
朴素贝叶斯实现书籍分类
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
文章目錄
- 樸素貝葉斯實現書籍分類
- 一、數據集
- 二、實現方法
- 二、代碼
- 實驗結果
樸素貝葉斯實現書籍分類
樸素貝葉斯是生成方法,直接找出特征輸出Y和特征X的聯合分布P(X,Y)P(X,Y),然后用P(Y|X)=P(X,Y)/P(X)P(Y|X)=P(X,Y)/P(X)得出。
一、數據集
數據集鏈接:https://wss1.cn/f/73l1yh2yjny
數據格式說明:
X.Y
X表示書籍
Y表示該書籍下不同章節
目的:判斷文本出自哪個書籍
自行劃分訓練和測試集
二、實現方法
1. 數據預處理
(1) 將原始數據的80%作為訓練集,20%作為測試集
(2) 將訓練集和測試集的數據去除所有數字、字符和多余的空格,并建立字典,將段落對應的書籍號作為索引,段落內容作為值。
2.樸素貝葉斯實現
(1) 創建詞匯表
(2) 計算先驗概率
(3) 計算條件概率
二、代碼
naive_bayes_text_classifier.py
import numpy as np import re def loadDataSet(filepath):f = open(filepath, "r").readlines()raw_data = []print("去除所有符號數字...")for i in range(0, len(f), 2):temp = dict()temp["class"] = int(f[i].strip().split(".")[0])#去除所有符號和數字mid = re.sub("[1-9,!?,.:\"();&\t]", " ", f[i + 1].strip(), count=0, flags=0)#去除多余的空格temp["abstract"] = re.sub(" +", " ", mid, count=0, flags=0).strip()if temp["abstract"] != "":raw_data.append(temp)postingList=[i["abstract"].split() for i in raw_data]classVec = [i["class"] for i in raw_data] #1 is abusive, 0 notreturn postingList, classVec # 集合結構內元素的唯一性,創建一個包含所有詞匯的詞表。 def createVocabList(dataSet):vocabSet = set([]) # 建立一個空列表for document in dataSet:vocabSet = vocabSet | set(document) # 合并兩個集合return list(vocabSet)def setOfWords2Vec(vocabList, inputSet):returnVec = [0]*len(vocabList)for word in inputSet:if word in vocabList:returnVec[vocabList.index(word)] = 1else:continueprint("the word: %s is not in my Vocabulary!" % word)return returnVec# 進行訓練, 這里就是計算: 條件概率 和 先驗概率 def trainNB0(trainMatrix, trainCategory):numTrainDocs = len(trainMatrix) # 計算總的樣本數量# 計算樣本向量化后的長度, 這里等于詞典長度。numWords = len(trainMatrix[0])# 計算先驗概率p0 = np.sum(trainCategory == 0) / float(numTrainDocs)p1 = np.sum(trainCategory == 1) / float(numTrainDocs)p2 = np.sum(trainCategory == 2) / float(numTrainDocs)p3 = np.sum(trainCategory == 3) / float(numTrainDocs)p4 = np.sum(trainCategory == 4) / float(numTrainDocs)p5 = np.sum(trainCategory == 5) / float(numTrainDocs)p6 = np.sum(trainCategory == 6) / float(numTrainDocs)#print(p0,p1,p2,p3,p4,p5,p6)# 進行初始化, 用于向量化后的樣本 累加, 為什么初始化1不是全0, 防止概率值為0.p0Num = np.ones(numWords)p1Num = np.ones(numWords)p2Num = np.ones(numWords)p3Num = np.ones(numWords)p4Num = np.ones(numWords)p5Num = np.ones(numWords)p6Num = np.ones(numWords) #change to ones()# 初始化求條件概率的分母為2, 防止出現0,無法計算的情況。p0Denom = 2.0p1Denom = 2.0p2Denom = 2.0p3Denom = 2.0p4Denom = 2.0p5Denom = 2.0p6Denom = 2.0 #change to 2.0# 遍歷所有向量化后的樣本, 并且每個向量化后的長度相等, 等于詞典長度。for i in range(numTrainDocs):# 統計標簽為1的樣本: 向量化后的樣本的累加, 樣本中1總數的求和, 最后相除取log就是條件概率。if trainCategory[i] == 0:p0Num += trainMatrix[i]p0Denom += sum(trainMatrix[i])# 統計標簽為0的樣本: 向量化后的樣本累加, 樣本中1總數的求和, 最后相除取log就是條件概率。elif trainCategory[i] == 1:p1Num += trainMatrix[i]p1Denom += sum(trainMatrix[i])elif trainCategory[i] == 2:p2Num += trainMatrix[i]p2Denom += sum(trainMatrix[i])elif trainCategory[i] == 3:p3Num += trainMatrix[i]p3Denom += sum(trainMatrix[i])elif trainCategory[i] == 4:p4Num += trainMatrix[i]p4Denom += sum(trainMatrix[i])elif trainCategory[i] == 5:p5Num += trainMatrix[i]p5Denom += sum(trainMatrix[i])elif trainCategory[i] == 6:p6Num += trainMatrix[i]p6Denom += sum(trainMatrix[i])# 求條件概率。p0Vect = np.log(p0Num / p0Denom)p1Vect = np.log(p1Num / p1Denom) # 改為 log() 防止出現0p2Vect = np.log(p2Num / p2Denom)p3Vect = np.log(p3Num / p3Denom)p4Vect = np.log(p4Num / p4Denom)p5Vect = np.log(p5Num / p5Denom)p6Vect = np.log(p6Num / p6Denom)# 返回條件概率 和 先驗概率return p0Vect, p1Vect, p2Vect, p3Vect, p4Vect, p5Vect, p6Vect, p0,p1,p2,p3,p4,p5,p6def classifyNB(vec2Classify, p0Vec, p1Vec,p2Vec, p3Vec,p4Vec, p5Vec,p6Vec, p0,p1,p2,p3,p4,p5,p6):# 向量化后的樣本 分別 與 各類別的條件概率相乘 加上先驗概率取log,之后進行大小比較, 輸出類別。p0 = sum(vec2Classify * p0Vec) + np.log(p0)p1 = sum(vec2Classify * p1Vec) + np.log(p1) #element-wise multp2 = sum(vec2Classify * p2Vec) + np.log(p2)p3 = sum(vec2Classify * p3Vec) + np.log(p3)p4 = sum(vec2Classify * p4Vec) + np.log(p4)p5 = sum(vec2Classify * p5Vec) + np.log(p5)p6 = sum(vec2Classify * p6Vec) + np.log(p6)res=[p0,p1,p2,p3,p4,p5,p6]return res.index(max(res)) if __name__ == '__main__':# 生成訓練樣本 和 標簽print("獲取訓練數據...")listOPosts, listClasses = loadDataSet("bys_data_train.txt")print("訓練數據集大小:",len(listOPosts))# 創建詞典print("建立詞典...")myVocabList = createVocabList(listOPosts)# 用于保存樣本轉向量之后的trainMat=[]# 遍歷每一個樣本, 轉向量后, 保存到列表中。for postinDoc in listOPosts:trainMat.append(setOfWords2Vec(myVocabList, postinDoc))# 計算 條件概率 和 先驗概率print("訓練...")p0V,p1V,p2V,p3V,p4V,p5V,p6V,p0,p1,p2,p3,p4,p5,p6 = trainNB0(np.array(trainMat), np.array(listClasses))# 給定測試樣本 進行測試print("獲取測試數據...")listOPosts, listClasses = loadDataSet("bys_data_test.txt")print("測試數據集大小:", len(listOPosts))f=open("output.txt","w")total=0true=0for i,j in zip(listOPosts,listClasses):testEntry = ithisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))result=classifyNB(thisDoc, p0V,p1V,p2V,p3V,p4V,p5V,p6V,p0,p1,p2,p3,p4,p5,p6)print(" ".join(testEntry))print('分類結果為: ', result,' 答案結果為: ',j)f.write(" ".join(testEntry)+"\n")f.write('分類結果為: '+ str(result)+' 答案結果為: '+str(j)+ "\n")total+=1if result==j:true+=1print("total acc: ",true/total)f.write("total acc: "+str(true/total))f.close()preprocess.py
from sklearn.model_selection import train_test_splitf=open("AsianReligionsData .txt","r",encoding='gb18030',errors="ignore").readlines() raw_data=[] for i in range(0,len(f),2):raw_data.append(f[i]+f[i+1])train, test = train_test_split(raw_data, train_size=0.8, test_size=0.2,random_state=42) print(len(train)) print(len(test))f=open("bys_data_train.txt","w") for i in train:f.write(i) f.close() f=open("bys_data_test.txt","w") for i in test:f.write(i) f.close()實驗結果
(1) 預測結果和真實結果比較
(2) 準確率計算
總結
以上是生活随笔為你收集整理的朴素贝叶斯实现书籍分类的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Dynamics 365(online)
- 下一篇: kali 2020 VMware 15.