KNN算法的简单实现
一? 算法原理:已知一個訓練樣本集,其中每個訓練樣本都有自己的標記(label),即我們知道樣本集中每一個樣本數據與所屬分類的對應關系。輸入沒有標記的新數據后,將新數據的每個特征與樣本集中的數據對應的特征進行比較,然后提取樣本集中特征最相似數據的分類標記。一般的,我們選擇樣本集中前k個最相似的數據分類標簽,其中出現次數最多的分類作為我們新數據的分類標記。簡單的說,k_近鄰算法采用測量不同特征值之間的距離方法進行分類。
算法優點: 精度高、對異常值不敏感,無數據輸入假設。
算法缺點: 由于要將每個待分類的數據特征與樣本集中的每個樣例進行對應特征距離的計算,所以計算的時間空間復雜度高。
?
二? 算法的實現(手寫體識別)
1.數據準備:采用的是32*32像素的黑白圖像(0-9,每個數字大約200個樣本,trainingDigits用于數據分類器訓練,testDigits用于測試),這里為了方便理解,將圖片轉換成了文本格式。
2.代碼實現:
????????????? 將圖片轉化為一個向量,我們把一個32*32的二進制圖像矩陣轉化為一個1*1024的向量,編寫一個函數vector2d,如下代碼
1 def vector2d(filename): 2 rows = 32 3 cols = 32 4 imgVector = zeros((1,rows * cols)) 5 fileIn = open(filename) 6 for row in xrange(rows): 7 lineStr = fileIn.readline() 8 for col in xrange(cols): 9 imgVector[0,row *32 + col] = int(lineStr[col]) 10 return imgVector 11 View Code????????????? trainingData set 和testData set 的載入
1 '''load dataSet ''' 2 def loadDataSet(): 3 print '....Getting training data' 4 dataSetDir = 'D:/pythonCode/MLCode/KNN/' 5 trainingFileList = os.listdir(dataSetDir + 'trainingDigits') 6 numSamples = len(trainingFileList) 7 8 train_x = zeros((numSamples,1024)) 9 train_y = [] 10 for i in xrange(numSamples): 11 filename = trainingFileList[i] 12 train_x[i,:] = vector2d(dataSetDir + 'trainingDigits/%s'%filename) 13 label = int(filename.split('_')[0]) 14 train_y.append(label) 15 ''' ....Getting testing data...''' 16 print '....Getting testing data...' 17 testFileList = os .listdir(dataSetDir + 'testDigits') 18 numSamples = len(testFileList) 19 test_x = zeros((numSamples,1024)) 20 test_y = [] 21 for i in xrange(numSamples): 22 filename = testFileList[i] 23 test_x[i,:] = vector2d(dataSetDir + 'testDigits/%s'%filename) 24 label = int(filename.split('_')[0]) 25 test_y.append(label) 26 27 return train_x,train_y,test_x,test_y View Code??????????????? 分類器的構造
1 from numpy import * 2 3 import os 4 5 def kNNClassify(newInput,dataSet,labels,k): 6 numSamples = dataSet.shape[0] 7 8 diff = tile(newInput,(numSamples,1)) - dataSet 9 squaredDiff = diff ** 2 10 squaredDist = sum(squaredDiff,axis = 1) 11 distance = squaredDist ** 0.5 12 13 sortedDistIndex = argsort(distance) 14 15 classCount = {} 16 for i in xrange(k): 17 votedLabel = labels[sortedDistIndex[i]] 18 classCount[votedLabel] = classCount.get(votedLabel,0) + 1 19 20 maxValue = 0 21 for key,value in classCount.items(): 22 if maxValue < value: 23 maxValue = value 24 maxIndex = key View Code分類測試
1 def testHandWritingClass(): 2 print 'load data....' 3 train_x,train_y,test_x,test_y = loadDataSet() 4 print'training....' 5 6 print'testing' 7 numTestSamples = test_x.shape[0] 8 matchCount = 0.0 9 for i in xrange(numTestSamples): 10 predict = kNNClassify(test_x[i],train_x,train_y,3) 11 if predict != test_y[i]: 12 13 print 'the predict is ',predict,'the target value is',test_y[i] 14 15 if predict == test_y[i]: 16 matchCount += 1 17 accuracy = float(matchCount)/numTestSamples 18 19 print'The accuracy is :%.2f%%'%(accuracy * 100) View Code?????????? 測試結果?
1 testHandWritingClass() 2 load data.... 3 ....Getting training data 4 ....Getting testing data... 5 training.... 6 testing 7 the predict is 7 the target value is 1 8 the predict is 9 the target value is 3 9 the predict is 9 the target value is 3 10 the predict is 3 the target value is 5 11 the predict is 6 the target value is 5 12 the predict is 6 the target value is 8 13 the predict is 3 the target value is 8 14 the predict is 1 the target value is 8 15 the predict is 1 the target value is 8 16 the predict is 1 the target value is 9 17 the predict is 7 the target value is 9 18 The accuracy is :98.84% View Code注:以上代碼運行環境為Python2.7.11
從上面結果可以看出knn 分類效果還不錯,在我看來,knn就是簡單粗暴,就是把未知分類的數據特征與我們分類好的數據特征進行比對,選擇最相似的標記作為自己的分類,辣么問題來了,如果我們的新數據的特征在樣本集中比較少見,這時候就會出現問題,分類錯誤的可能性非常大,反之,如果樣例集中某一類的樣例比較多,那么新數據被分成該類的可能性就會大,如何保證分類的公平性,我們就需要進行加權了。
?
補充:關于K值的選取,當k越小時,分類結果對原數據的敏感性越強,易受到異常數據的影響,即模型越復雜。
?
?
數據來源:http://download.csdn.net/download/qq_17046229/7625323
轉載于:https://www.cnblogs.com/lpworkstudyspace1992/p/5470621.html
總結
以上是生活随笔為你收集整理的KNN算法的简单实现的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 动手写一个快速集成网易新闻,腾讯视频,头
- 下一篇: 20145315 《Java程序设计》实