ML_R kNN
鄰近算法
K最近鄰(kNN,k-NearestNeighbor)分類算法是數(shù)據(jù)挖掘分類技術(shù)中最簡(jiǎn)單的方法之一。所謂K最近鄰,就是k個(gè)最近的鄰居的意思,說的是每個(gè)樣本都可以用它最接近的k個(gè)鄰居來代表。
- 優(yōu)點(diǎn):簡(jiǎn)單有效,對(duì)數(shù)據(jù)的分布不用預(yù)先假設(shè);
- 缺點(diǎn):不能生成模型,限制了發(fā)現(xiàn)特性間關(guān)系的能力;
下面介紹一下kNN算法在R中的簡(jiǎn)單實(shí)現(xiàn)
所用數(shù)據(jù)集UCI,breast cancer
獲取并查看數(shù)據(jù)集
b_c<-read.table("Breast_cancer.txt",sep=",",stringsAsFactors = F) str(b_c) 'data.frame': 569 obs. of 32 variables:$ V1 : int 842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...$ V2 : chr "M" "M" "M" "M" ...$ V3 : num 18 20.6 19.7 11.4 20.3 ...$ V4 : num 10.4 17.8 21.2 20.4 14.3 ...$ V5 : num 122.8 132.9 130 77.6 135.1 ...$ V6 : num 1001 1326 1203 386 1297 ...$ V7 : num 0.1184 0.0847 0.1096 0.1425 0.1003 ...$ V8 : num 0.2776 0.0786 0.1599 0.2839 0.1328 ...$ V9 : num 0.3001 0.0869 0.1974 0.2414 0.198 ...$ V10: num 0.1471 0.0702 0.1279 0.1052 0.1043 ...$ V11: num 0.242 0.181 0.207 0.26 0.181 ...$ V12: num 0.0787 0.0567 0.06 0.0974 0.0588 ...$ V13: num 1.095 0.543 0.746 0.496 0.757 ...$ V14: num 0.905 0.734 0.787 1.156 0.781 ...$ V15: num 8.59 3.4 4.58 3.44 5.44 ...$ V16: num 153.4 74.1 94 27.2 94.4 ...$ V17: num 0.0064 0.00522 0.00615 0.00911 0.01149 ...$ V18: num 0.049 0.0131 0.0401 0.0746 0.0246 ...$ V19: num 0.0537 0.0186 0.0383 0.0566 0.0569 ...$ V20: num 0.0159 0.0134 0.0206 0.0187 0.0188 ...$ V21: num 0.03 0.0139 0.0225 0.0596 0.0176 ...$ V22: num 0.00619 0.00353 0.00457 0.00921 0.00511 ...$ V23: num 25.4 25 23.6 14.9 22.5 ...$ V24: num 17.3 23.4 25.5 26.5 16.7 ...$ V25: num 184.6 158.8 152.5 98.9 152.2 ...$ V26: num 2019 1956 1709 568 1575 ...$ V27: num 0.162 0.124 0.144 0.21 0.137 ...$ V28: num 0.666 0.187 0.424 0.866 0.205 ...$ V29: num 0.712 0.242 0.45 0.687 0.4 ...$ V30: num 0.265 0.186 0.243 0.258 0.163 ...$ V31: num 0.46 0.275 0.361 0.664 0.236 ...$ V32: num 0.1189 0.089 0.0876 0.173 0.0768 ... > #其中第一列是ID,第二列是診斷 > b_c<-b_c[-1] #刪除ID列 > table(b_c$V2)B M 357 212 > str(b_c) 'data.frame': 569 obs. of 31 variables:$ V2 : chr "M" "M" "M" "M" ...$ V3 : num 18 20.6 19.7 11.4 20.3 ...$ V4 : num 10.4 17.8 21.2 20.4 14.3 ...$ V5 : num 122.8 132.9 130 77.6 135.1 ...$ V6 : num 1001 1326 1203 386 1297 ...$ V7 : num 0.1184 0.0847 0.1096 0.1425 0.1003 ...$ V8 : num 0.2776 0.0786 0.1599 0.2839 0.1328 ...$ V9 : num 0.3001 0.0869 0.1974 0.2414 0.198 ...$ V10: num 0.1471 0.0702 0.1279 0.1052 0.1043 ...$ V11: num 0.242 0.181 0.207 0.26 0.181 ...$ V12: num 0.0787 0.0567 0.06 0.0974 0.0588 ...$ V13: num 1.095 0.543 0.746 0.496 0.757 ...$ V14: num 0.905 0.734 0.787 1.156 0.781 ...$ V15: num 8.59 3.4 4.58 3.44 5.44 ...$ V16: num 153.4 74.1 94 27.2 94.4 ...$ V17: num 0.0064 0.00522 0.00615 0.00911 0.01149 ...$ V18: num 0.049 0.0131 0.0401 0.0746 0.0246 ...$ V19: num 0.0537 0.0186 0.0383 0.0566 0.0569 ...$ V20: num 0.0159 0.0134 0.0206 0.0187 0.0188 ...$ V21: num 0.03 0.0139 0.0225 0.0596 0.0176 ...$ V22: num 0.00619 0.00353 0.00457 0.00921 0.00511 ...$ V23: num 25.4 25 23.6 14.9 22.5 ...$ V24: num 17.3 23.4 25.5 26.5 16.7 ...$ V25: num 184.6 158.8 152.5 98.9 152.2 ...$ V26: num 2019 1956 1709 568 1575 ...$ V27: num 0.162 0.124 0.144 0.21 0.137 ...$ V28: num 0.666 0.187 0.424 0.866 0.205 ...$ V29: num 0.712 0.242 0.45 0.687 0.4 ...$ V30: num 0.265 0.186 0.243 0.258 0.163 ...$ V31: num 0.46 0.275 0.361 0.664 0.236 ...$ V32: num 0.1189 0.089 0.0876 0.173 0.0768 ... > #將診斷列V2轉(zhuǎn)成因子 > b_c$V2<-factor(b_c$V2,levels = c("B","M"),labels = c("B","M")) > prop.table(table(b_c$V2))B M 0.6274165 0.3725835 > #標(biāo)準(zhǔn)化 > bc_n<-as.data.frame(scale(b_c[,-1])) > bc_n<-cbind(b_c[,1],bc_n) > str(bc_n) 'data.frame': 569 obs. of 31 variables:$ b_c[, 1]: Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ...$ V3 : num 1.096 1.828 1.578 -0.768 1.749 ...$ V4 : num -2.072 -0.353 0.456 0.254 -1.151 ...$ V5 : num 1.269 1.684 1.565 -0.592 1.775 ...$ V6 : num 0.984 1.907 1.558 -0.764 1.825 ...$ V7 : num 1.567 -0.826 0.941 3.281 0.28 ...$ V8 : num 3.281 -0.487 1.052 3.4 0.539 ...$ V9 : num 2.6505 -0.0238 1.3623 1.9142 1.3698 ...$ V10 : num 2.53 0.548 2.035 1.45 1.427 ...$ V11 : num 2.21557 0.00139 0.93886 2.86486 -0.00955 ...$ V12 : num 2.254 -0.868 -0.398 4.907 -0.562 ...$ V13 : num 2.488 0.499 1.228 0.326 1.269 ...$ V14 : num -0.565 -0.875 -0.779 -0.11 -0.79 ...$ V15 : num 2.831 0.263 0.85 0.286 1.272 ...$ V16 : num 2.485 0.742 1.18 -0.288 1.189 ...$ V17 : num -0.214 -0.605 -0.297 0.689 1.482 ...$ V18 : num 1.3157 -0.6923 0.8143 2.7419 -0.0485 ...$ V19 : num 0.723 -0.44 0.213 0.819 0.828 ...$ V20 : num 0.66 0.26 1.42 1.11 1.14 ...$ V21 : num 1.148 -0.805 0.237 4.729 -0.361 ...$ V22 : num 0.9063 -0.0994 0.2933 2.0457 0.4989 ...$ V23 : num 1.885 1.804 1.511 -0.281 1.297 ...$ V24 : num -1.358 -0.369 -0.024 0.134 -1.465 ...$ V25 : num 2.3 1.53 1.35 -0.25 1.34 ...$ V26 : num 2 1.89 1.46 -0.55 1.22 ...$ V27 : num 1.307 -0.375 0.527 3.391 0.22 ...$ V28 : num 2.614 -0.43 1.082 3.89 -0.313 ...$ V29 : num 2.108 -0.147 0.854 1.988 0.613 ...$ V30 : num 2.294 1.086 1.953 2.174 0.729 ...$ V31 : num 2.748 -0.244 1.151 6.041 -0.868 ...$ V32 : num 1.935 0.281 0.201 4.931 -0.397 ...設(shè)置訓(xùn)練集和測(cè)試集
> ind<-sample(2,nrow(bc_n),replace = T,prob=c(0.7,0.3)) > traindata<-bc_n[ind==1,] > testdata<-bc_n[ind==2,] > traindata_lable<-traindata[,1] > testdata_lable<-testdata[,1] > #安裝包FNN,調(diào)用函數(shù)knn構(gòu)建模型,以循環(huán)方法選擇knn算法中的k值
> library(class)> Precesion <-as.data.frame(c(),c()) #構(gòu)建空數(shù)據(jù)框 > for (i in 1:round(sqrt(nrow(traindata)))){ + bc_pred<-knn(traindata[,-1],testdata[,-1],cl=traindata_lable,k=i) + precesion<-prop.table(xtabs(~testdata[,1]+bc_pred),2)[2,2] + temp<-cbind(i,precesion) + Precesion<-rbind(Precesion,temp)} > Precesion[order(Precesion$precesion),]i precesion 4 4 0.9420290 5 5 0.9552239 18 18 0.9682540 19 19 0.9682540 17 17 0.9687500 20 20 0.9687500 16 16 0.9692308 1 1 0.9696970 2 2 0.9696970 6 6 0.9701493 7 7 0.9701493 12 12 0.9701493 13 13 0.9701493 8 8 0.9705882 11 11 0.9705882 3 3 0.9846154 15 15 0.9846154 14 14 0.9848485 9 9 0.9850746 10 10 0.9850746轉(zhuǎn)載于:https://www.cnblogs.com/li-volleyball/p/5565749.html
總結(jié)