【StatLearn】统计学习中knn算法实验(2)
接著統(tǒng)計學習中knn算法實驗(1)的內(nèi)容
Problem:
?
使用Parallel coordinates plot做數(shù)據(jù)可視化,首先對數(shù)據(jù)進行歸一化處理,數(shù)據(jù)的動態(tài)范圍控制在[0,1]。注意歸一化的處理針對的是每一個fearture。
通過對圖的仔細觀察,我們挑選出重疊度比較低的feature來進行fearture selection,feature selection實際上是對數(shù)據(jù)挑選出更易區(qū)分的類型作為下一步分類算法的數(shù)據(jù)。我們挑選出feature序號為(1)、(2)、(5)、(6)、(7)、(10)的feature。個人認為,feature selection是一種簡單而粗暴的降維和去噪的操作,但是可能效果會很好。?
根據(jù)上一步的操作,從Parallel coordinates上可以看出,序號為(1)、(2)、(5)、(6)、(7)、(10)這幾個feature比較適合作為classify的feature。我們選取以上幾個feature作knn,得到的結果如下:
?
當K=1 的時候,Accuracy達到了85.38%,并且相比于簡單的使用knn或者PCA+knn的方式,Normalization、Featrure Selection的方法使得準確率大大提升。我們也可以使用不同的feature搭配,通過實驗得到更好的結果。
MaxAccuracy= 0.8834 when k=17 (Normalization+FeartureSelection+KNN)
?
試驗中,我們使用了兩種不同的Feature Selection 策略,選用較少fearture的策略對分類的準確率還是有影響的,對于那些從平行坐標看出的不那么好的fearture,對分類還是有一定的幫助的。 在較小的k值下,Feature Selection的結果要比直接采用全部Feature的結果要好。這也體現(xiàn)了在相對純凈的數(shù)據(jù)下,較小的k值能夠獲得較好的結果,這和直觀感覺出來的一致。 我們再嘗試對數(shù)據(jù)進行進一步的預處理操作,比如denoising。 數(shù)據(jù)去噪的方法利用對Trainning數(shù)據(jù)進行一個去處最大最小邊緣值的操作,我們認為,對于一個合適的feature,它的數(shù)據(jù)應該處于一個合理的范圍中,過大或者過小的數(shù)據(jù)都將是異常的。Denoising的代碼如下:
?
function[DNData]=DataDenoising(InputData,KillRange) DNData=InputData; %MedianData=median(DNData); for i=2:size(InputData,2)[temp,DNIndex]=sort(DNData(:,i));DNData=DNData(DNIndex(1+KillRange:end-KillRange),:); end?
?
采用LLE作為降維的手段,通過和以上的幾種方案作對比,如下:
?
?
MaxAccuracy= 0.9376 when K=23 (LLE dimensionality reduction to 2)
關于LLE算法,參見這篇論文?
?
- Nonlinear dimensionality reduction by locally linear embedding.Sam Roweis & Lawrence Saul.Science, v.290 no.5500 , Dec.22, 2000. pp.2323--2326.
?
源代碼:
StatLearnProj.m
?
clear; data=load('wine.data.txt'); %calc 5-folder knn Accuracy=[]; for i=1:5Test=data(i:5:end,:);TestData=Test(:,2:end);TestLabel=Test(:,1);Trainning=setdiff(data,Test,'rows');TrainningData=Trainning(:,2:end);TrainningLabel=Trainning(:,1);Accuracy=cat(1,Accuracy,CalcAccuracy(TestData,TestLabel,TrainningData,TrainningLabel)); end AccuracyKNN=mean(Accuracy,1);%calc PCA Accuracy=[]; %PCA [Coeff,Score,Latent]=princomp(data(:,2:end)); dataPCA=[data(:,1),Score(:,1:6)]; Latent for i=1:5Test=dataPCA(i:5:end,:);TestData=Test(:,2:end);TestLabel=Test(:,1);Trainning=setdiff(dataPCA,Test,'rows');TrainningData=Trainning(:,2:end);TrainningLabel=Trainning(:,1);Accuracy=cat(1,Accuracy,CalcAccuracy(TestData,TestLabel,TrainningData,TrainningLabel)); end AccuracyPCA=mean(Accuracy,1); BarData=[AccuracyKNN;AccuracyPCA]; bar(1:2:51,BarData');[D,I]=sort(AccuracyKNN,'descend'); D(1) I(1) [D,I]=sort(AccuracyPCA,'descend'); D(1) I(1)%pre-processing data %Normalization labs1={'1)Alcohol','(2)Malic acid','3)Ash','4)Alcalinity of ash'}; labs2={'5)Magnesium','6)Total phenols','7)Flavanoids','8)Nonflavanoid phenols'}; labs3={'9)Proanthocyanins','10)Color intensity','11)Hue','12)OD280/OD315','13)Proline'}; uniData=[]; for i=2:size(data,2)uniData=cat(2,uniData,(data(:,i)-min(data(:,i)))/(max(data(:,i))-min(data(:,i)))); end figure(); parallelcoords(uniData(:,1:4),'group',data(:,1),'labels',labs1); figure(); parallelcoords(uniData(:,5:8),'group',data(:,1),'labels',labs2); figure(); parallelcoords(uniData(:,9:13),'group',data(:,1),'labels',labs3);%denoising%Normalization && Feature Selection uniData=[data(:,1),uniData]; %Normalization all featurefor i=1:5Test=uniData(i:5:end,:);TestData=Test(:,2:end);TestLabel=Test(:,1);Trainning=setdiff(uniData,Test,'rows');TrainningData=Trainning(:,2:end);TrainningLabel=Trainning(:,1);Accuracy=cat(1,Accuracy,CalcAccuracy(TestData,TestLabel,TrainningData,TrainningLabel)); end AccuracyNorm=mean(Accuracy,1);%KNN PCA Normalization BarData=[AccuracyKNN;AccuracyPCA;AccuracyNorm]; bar(1:2:51,BarData');%Normalization& FS 1 2 5 6 7 10 we select 1 2 5 6 7 10 feature FSData=uniData(:,[1 2 3 6 7 8 11]); size(FSData) for i=1:5Test=FSData(i:5:end,:);Trainning=setdiff(FSData,Test,'rows');TestData=Test(:,2:end);TestLabel=Test(:,1);TrainningData=Trainning(:,2:end);TrainningLabel=Trainning(:,1);Accuracy=cat(1,Accuracy,CalcAccuracy(TestData,TestLabel,TrainningData,TrainningLabel)); end AccuracyNormFS1=mean(Accuracy,1);%Normalization& FS 1 6 7 FSData=uniData(:,[1 2 7 8]); for i=1:5Test=FSData(i:5:end,:);Trainning=setdiff(FSData,Test,'rows');TestData=Test(:,2:end);TestLabel=Test(:,1); TrainningData=Trainning(:,2:end);TrainningLabel=Trainning(:,1);Accuracy=cat(1,Accuracy,CalcAccuracy(TestData,TestLabel,TrainningData,TrainningLabel)); end AccuracyNormFS2=mean(Accuracy,1); figure(); BarData=[AccuracyNorm;AccuracyNormFS1;AccuracyNormFS2]; bar(1:2:51,BarData');[D,I]=sort(AccuracyNorm,'descend'); D(1) I(1) [D,I]=sort(AccuracyNormFS1,'descend'); D(1) I(1) [D,I]=sort(AccuracyNormFS2,'descend'); D(1) I(1) %denoiding %Normalization& FS 1 6 7 FSData=uniData(:,[1 2 7 8]); for i=1:5Test=FSData(i:5:end,:);Trainning=setdiff(FSData,Test,'rows');Trainning=DataDenoising(Trainning,2);TestData=Test(:,2:end);TestLabel=Test(:,1); TrainningData=Trainning(:,2:end);TrainningLabel=Trainning(:,1);Accuracy=cat(1,Accuracy,CalcAccuracy(TestData,TestLabel,TrainningData,TrainningLabel)); end AccuracyNormFSDN=mean(Accuracy,1); figure(); hold on plot(1:2:51,AccuracyNormFSDN); plot(1:2:51,AccuracyNormFS2,'r');%other distance metricsDist='cityblock'; for i=1:5Test=uniData(i:5:end,:);TestData=Test(:,2:end);TestLabel=Test(:,1);Trainning=setdiff(uniData,Test,'rows');TrainningData=Trainning(:,2:end);TrainningLabel=Trainning(:,1);Accuracy=cat(1,Accuracy,CalcAccuracyPlus(TestData,TestLabel,TrainningData,TrainningLabel,Dist)); end AccuracyNormCity=mean(Accuracy,1);BarData=[AccuracyNorm;AccuracyNormCity]; figure(); bar(1:2:51,BarData');[D,I]=sort(AccuracyNormCity,'descend'); D(1) I(1)%denoising FSData=uniData(:,[1 2 7 8]); Dist='cityblock'; for i=1:5Test=FSData(i:5:end,:);TestData=Test(:,2:end);TestLabel=Test(:,1);Trainning=setdiff(FSData,Test,'rows');Trainning=DataDenoising(Trainning,3);TrainningData=Trainning(:,2:end);TrainningLabel=Trainning(:,1);Accuracy=cat(1,Accuracy,CalcAccuracyPlus(TestData,TestLabel,TrainningData,TrainningLabel,Dist)); end AccuracyNormCityDN=mean(Accuracy,1); figure(); hold on plot(1:2:51,AccuracyNormCityDN); plot(1:2:51,AccuracyNormCity,'r');%call lledata=load('wine.data.txt'); uniData=[]; for i=2:size(data,2)uniData=cat(2,uniData,(data(:,i)-min(data(:,i)))/(max(data(:,i))-min(data(:,i)))); end uniData=[data(:,1),uniData]; LLEData=lle(uniData(:,2:end)',5,2); %size(LLEData) LLEData=LLEData'; LLEData=[data(:,1),LLEData];Accuracy=[]; for i=1:5Test=LLEData(i:5:end,:);TestData=Test(:,2:end);TestLabel=Test(:,1);Trainning=setdiff(LLEData,Test,'rows');Trainning=DataDenoising(Trainning,2);TrainningData=Trainning(:,2:end);TrainningLabel=Trainning(:,1);Accuracy=cat(1,Accuracy,CalcAccuracyPlus(TestData,TestLabel,TrainningData,TrainningLabel,'cityblock')); end AccuracyLLE=mean(Accuracy,1); [D,I]=sort(AccuracyLLE,'descend'); D(1) I(1)BarData=[AccuracyNorm;AccuracyNormFS2;AccuracyNormFSDN;AccuracyLLE]; figure(); bar(1:2:51,BarData');save('ProcessingData.mat');CalcAccuracy.m
?
?
function Accuracy=CalcAccuracy(TestData,TestLabel,TrainningData,TrainningLabel) %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %calculate the accuracy of classify %TestData:M*D matrix D stand for dimension,M is sample %TrainningData:T*D matrix %TestLabel:Label of TestData %TrainningLabel:Label of Trainning Data %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% CompareResult=[]; for k=1:2:51ClassResult=knnclassify(TestData,TrainningData,TrainningLabel,k);CompareResult=cat(2,CompareResult,(ClassResult==TestLabel)); end SumCompareResult=sum(CompareResult,1); Accuracy=SumCompareResult/length(CompareResult(:,1));CalcAccuracyPlus.m
function Accuracy=CalcAccuracyPlus(TestData,TestLabel,TrainningData,TrainningLabel,Dist) %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %just as CalcAccuracy,but add distance metrics %calculate the accuracy of classify %TestData:M*D matrix D stand for dimension,M is sample %TrainningData:T*D matrix %TestLabel:Label of TestData %TrainningLabel:Label of Trainning Data %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% CompareResult=[]; for k=1:2:51ClassResult=knnclassify(TestData,TrainningData,TrainningLabel,k,Dist);CompareResult=cat(2,CompareResult,(ClassResult==TestLabel)); end SumCompareResult=sum(CompareResult,1); Accuracy=SumCompareResult/length(CompareResult(:,1));?
轉載于:https://www.cnblogs.com/pangblog/p/3402651.html
總結
以上是生活随笔為你收集整理的【StatLearn】统计学习中knn算法实验(2)的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 解决Windows7 Embedded连
- 下一篇: 用一条sql取得第10到第20条的记录-