當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

KNN算法实验-采用UCI的Iris数据集和DryBean数据集

發(fā)布時間：2023/12/20 编程问答 33 豆豆

生活随笔收集整理的這篇文章主要介紹了 KNN算法实验-采用UCI的Iris数据集和DryBean数据集小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

KNN（K Nearest Neighbors）

全部的代碼、數(shù)據(jù)集見我github
DryBean數(shù)據(jù)集傳不上github，放在了CSDN，0積分即可下載：Drybean下載

1.概述

KNN（K鄰近投票算法）
直接計算出所有點到樣本測試點的距離，選出前K個距離最小的點，少數(shù)服從多數(shù)地決定測試點的標簽
優(yōu)點：算法簡單、思路簡單；無需參數(shù)估計、無需訓練
缺點：只適用于每類樣本數(shù)值均衡的數(shù)據(jù)
能力：多分類

2.原理

選定Iris數(shù)據(jù)集作為計算樣例，取K=7、歐式距離、30%測試集、70%訓練集、隨機種子0
步驟
1.對每個特征值采用數(shù)據(jù)歸一化。采用min-max歸一化方法，公式為（X-Min）/(Max-Min)
2.對于第一個測試樣本來說，計算其到訓練集中所有樣本的歐式距離
3.選擇最小的前K個距離，進行投票，按照少數(shù)服從多數(shù)得到這個樣本的預測標簽
4.計算下一個測試樣本的標簽，直到做完全部預測

2.1 手動計算詳見excel表格：Iris-count.xlsx

3.簡單調(diào)包

選定Iris數(shù)據(jù)集作為計算樣例，歐式距離、30%測試集、70%訓練集、隨機種子0
這里使用了十折交叉驗證找到了最好的K=7

#導入包 %matplotlib inline import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn import datasets from sklearn.preprocessing import MinMaxScaler from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from scipy.spatial import distance import operator #引入數(shù)據(jù)集 iriis = datasets.load_iris() x= iriis.data y= iriis.target#數(shù)據(jù)集歸一（線性歸一化） x= (x-x.min(axis=0)) / (x.max(axis=0)-x.min(axis=0)) #axis=0表示取列的最大值或者最小值#拆分訓練集和測試集 split = 0.7 #trianset : testset = 7:3np.random.seed(0) #固定隨機結(jié)果 train_indices = np.random.choice(len(x),round(len(x) * split),replace=False) test_indices = np.array(list(set(range(len(x))) - set(train_indices))) train_indices = sorted(train_indices) test_indices =sorted(test_indices) train_x = x[train_indices] test_x = x[test_indices] train_y = y[train_indices] test_y = y[test_indices] print(train_indices) print(test_indices) [1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 15, 16, 17, 18, 20, 22, 24, 26, 27, 30, 33, 37, 38, 40, 41, 42, 43, 44, 45, 46, 48, 50, 51, 52, 53, 54, 56, 59, 60, 61, 62, 63, 64, 66, 68, 69, 71, 73, 76, 78, 80, 83, 84, 85, 86, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 100, 101, 102, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 116, 119, 120, 121, 123, 124, 125, 126, 127, 128, 129, 132, 133, 134, 135, 137, 139, 141, 143, 144, 146, 147, 148, 149] [0, 9, 14, 19, 21, 23, 25, 28, 29, 31, 32, 34, 35, 36, 39, 47, 49, 55, 57, 58, 65, 67, 70, 72, 74, 75, 77, 79, 81, 82, 87, 88, 99, 103, 115, 117, 118, 122, 130, 131, 136, 138, 140, 142, 145] pd.DataFrame(train_x).head(10) #與excel的計算結(jié)果一致 01230123456789

0.166667	0.416667	0.067797	0.041667
0.111111	0.500000	0.050847	0.041667
0.083333	0.458333	0.084746	0.041667
0.194444	0.666667	0.067797	0.041667
0.305556	0.791667	0.118644	0.125000
0.083333	0.583333	0.067797	0.083333
0.194444	0.583333	0.084746	0.041667
0.027778	0.375000	0.067797	0.041667
0.305556	0.708333	0.084746	0.041667
0.138889	0.583333	0.101695	0.041667

## KNNfrom sklearn.neighbors import KNeighborsClassifier #一個簡單的模型，只有K一個參數(shù)，類似K-means from sklearn.model_selection import train_test_split,cross_val_score #劃分數(shù)據(jù) 交叉驗證k_range = range(1,10) #k是投票人數(shù) cv_scores = [] #用來放每個模型的結(jié)果值 for n in k_range:knn = KNeighborsClassifier(n) #knn模型，這里一個超參數(shù)可以做預測，當多個超參數(shù)時需要使用另一種方法GridSearchCVscores = cross_val_score(knn,train_x,train_y,cv=10,scoring='accuracy') #cv：選擇每次測試折數(shù) accuracy：評價指標是準確度,可以省略使用默認值，具體使用參考下面。cv_scores.append(scores.mean()) plt.plot(k_range,cv_scores) plt.xlabel('K') plt.ylabel('Accuracy') #通過圖像選擇最好的參數(shù) plt.show()

best_knn = KNeighborsClassifier(n_neighbors=7) # 選擇最優(yōu)的K=7傳入模型 best_knn.fit(train_x,train_y) #訓練模型 print(best_knn.score(test_x,test_y)) #score = right/total = 44/45 = 0.9778(這里預測錯了一個) print(best_knn.predict(test_x)) print(test_y) 0.9777777777777777 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 2 2 22 2 2 2 2 2 2 2] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 22 2 2 2 2 2 2 2]

4.優(yōu)缺點

4.1 優(yōu)點

不需要訓練、沒有參數(shù)估計，拿到測試數(shù)據(jù)即可進行分類

4.2 缺點

當樣本中每種類型的數(shù)量不均衡時，可能會強行“少數(shù)服從多數(shù)”

4.3驗證一下缺點

這里使用Iris數(shù)據(jù)集和DryBean數(shù)據(jù)集
對這兩個數(shù)據(jù)集進行數(shù)據(jù)清洗，生成每類樣本數(shù)目均衡/不均衡的新數(shù)據(jù)集，比較KNN在“均衡Iris”、“不均衡Iris”、“均衡DryBean”、“不均衡DryBean”上的效果

#依舊是7:3劃分數(shù)據(jù)集 #均衡Iris：7:3 = 105:45 = （33+34+38）：（17+16+12）（使用之前劃分的數(shù)據(jù)集） #不均衡Iris：7:3 = 105:45 = （45+45+15）：（5+5+35） #均衡DryBean：7:3 = 1960:840 =（280*7）：（120*7） #不均衡DryBean：7:3 = 1960:840 = (6:6:6:6:6:4:3) = #（318,318,318,318,318,212,159）：（68,91,136,136,136,136,136） from numpy import * i_train_x = train_x i_train_y = train_y i_test_x = test_x i_test_y = test_y ui_train_x = concatenate((concatenate((x[:45],x[50:95]),axis=0),x[100:115]),axis=0) #axis=0表示豎向拼接 ui_train_y = concatenate((concatenate((y[:45],y[50:95]),axis=0),y[100:115]),axis=0) ui_test_x = concatenate((concatenate((x[:5],x[50:55]),axis=0),x[100:135]),axis=0) ui_test_y = concatenate((concatenate((y[:5],y[50:55]),axis=0),y[100:135]),axis=0) i_score = [] ui_score = [] klist = [] for k in range(1,12): klist.append(k)i_knn = KNeighborsClassifier(n_neighbors=k)i_knn.fit(i_train_x,i_train_y)i_score.append(i_knn.score(i_test_x,i_test_y)) # print("均衡Iris:",i_knn.score(i_test_x,i_test_y))ui_knn = KNeighborsClassifier(n_neighbors=k)ui_knn.fit(ui_train_x,ui_train_y)ui_score.append(ui_knn.score(ui_test_x,ui_test_y)) # print("不均衡Iris:",i_knn.score(ui_test_x,ui_test_y))plt.plot(klist, i_score, marker = 'o', label = 'banlanced Iris') plt.plot(klist, ui_score,marker = '*', label = 'unbalanced Iris') plt.legend() #讓圖例生效 plt.xlabel('k-value') plt.ylabel('accuracy-value') plt.title(u'Iris map') plt.show()

# import openpyxl import operator from sklearn.preprocessing import StandardScaler # 均值歸一化 from sklearn.metrics import confusion_matrix # 生成混淆矩陣 from sklearn.metrics import classification_report # 分類報告 def openfile(filename):"""打開數(shù)據(jù)集，進行數(shù)據(jù)處理:param filename:文件名:return:特征集數(shù)據(jù)、標簽集數(shù)據(jù)""" # 打開excelsheet = pd.read_excel(filename,sheet_name='Dry_Beans_Dataset')data = sheet.iloc[:,:16].valuestarget = sheet['Class'].valuesprint(data.shape)print(target.shape) return data, target, sheet.columns def split_data_set(data_set, target_set, rate=0.7):"""說明：分割數(shù)據(jù)集，默認數(shù)據(jù)集的30%是測試集:param data_set: 數(shù)據(jù)集:param target_set: 標簽集:param rate: 測試集所占的比率:return: 返回訓練集數(shù)據(jù)、訓練集標簽、測試集數(shù)據(jù)、測試集標簽"""# 計算訓練集的數(shù)據(jù)個數(shù)train_size = len(data_set)# 隨機獲得數(shù)據(jù)的下標train_index = sorted(np.random.choice(train_size,round(train_size * rate), replace=False))test_index = sorted(np.array(list(set(range(train_size)) - set(train_index)))) #不用排序也行，強迫癥，為了上面保持一致就排序了# 分割數(shù)據(jù)集（X表示數(shù)據(jù)，y表示標簽）x_train = data_set.iloc[train_index,:] #因為這里的data_set和target_set變成DataFrame，而不是ndarray了，所以要用iloc訪問x_test = data_set.iloc[test_index,:]y_train = target_set.iloc[train_index,:]y_test = target_set.iloc[test_index,:]return x_train, y_train, x_test, y_test filename = r'D:\jjq\code\jupyterWorkSpace\datasets\DryBeanDataset\Dry_Bean_Dataset.xlsx' o_bean_dataset = openfile(filename) #每個類別的種子抽取400條數(shù)據(jù)，這個是每個類別的起始索引 step = 400 start_index = [0,1322,1844,3474,7020,8948,10975] #一共7類 # bean_dataset_x = pd.DataFrame(columns=o_bean_dataset[2]) # bean_dataset_y =pd.DataFrame(columns=o_bean_dataset[2]) bean_dataset_x = pd.DataFrame(columns=range(16)) bean_dataset_y =pd.DataFrame(columns=range(1)) bean_dataset_x.drop(bean_dataset_x.index,inplace=True) bean_dataset_y.drop(bean_dataset_y.index,inplace=True) for i in range(7):bean_dataset_x = pd.concat((bean_dataset_x, pd.DataFrame(o_bean_dataset[0][start_index[i]:(step+start_index[i])])),axis=0)bean_dataset_y = pd.concat((bean_dataset_y, pd.DataFrame(o_bean_dataset[1][start_index[i]:(step+start_index[i])])),axis=0) # bean_dataset_y.to_excel("./123.xlsx") (13611, 16) (13611,) #按照均衡和不均衡的方式，劃分訓練集和測試集#均衡 b_train_x, b_train_y, b_test_x, b_test_y = split_data_set(bean_dataset_x,bean_dataset_y) print(b_train_x.shape,b_train_y.shape) print(b_test_x.shape,b_test_y.shape) #不均衡 steps_train = [318,318,318,318,318,212,159] steps_test = [68,91,136,136,136,136,136] now = 0 #初始化不均衡數(shù)組 ub_train_x = pd.DataFrame(columns=range(16)) ub_test_x = pd.DataFrame(columns=range(16)) ub_train_y = pd.DataFrame(columns=range(1)) ub_test_y = pd.DataFrame(columns=range(1)) #保證添加數(shù)據(jù)之前數(shù)組為空 ub_train_x.drop(ub_train_x.index,inplace=True) ub_test_x.drop(ub_test_x.index,inplace=True) ub_train_y.drop(ub_train_y.index,inplace=True) ub_test_y.drop(ub_test_y.index,inplace=True)#開始添加數(shù)據(jù) for i in range(7):ub_train_x = pd.concat((ub_train_x, bean_dataset_x[now:(now+steps_train[i])]),axis=0) ub_train_y = pd.concat((ub_train_y, bean_dataset_y[now:(now+steps_train[i])]),axis=0)now = now+steps_train[i]ub_test_x = pd.concat((ub_test_x, bean_dataset_x[now:(now+steps_test[i])]),axis=0)ub_test_y = pd.concat((ub_test_y, bean_dataset_y[now:(now+steps_test[i])]),axis=0)now = now+steps_test[i] (1960, 16) (1960, 1) (840, 16) (840, 1) b_score = [] ub_score = [] klist = [] for k in range(30,50): klist.append(k)b_knn = KNeighborsClassifier(n_neighbors=k)b_knn.fit(b_train_x,b_train_y.values.ravel())b_score.append(b_knn.score(b_test_x,b_test_y.values.ravel()))ub_knn = KNeighborsClassifier(n_neighbors=k)ub_knn.fit(ub_train_x,ub_train_y.values.ravel())ub_score.append(ub_knn.score(ub_test_x,ub_test_y.values.ravel()))plt.plot(klist, b_score, marker = 'o', label = 'banlanced DryBean') plt.plot(klist, ub_score,marker = '*', label = 'unbalanced DryBean') plt.legend() #讓圖例生效 plt.xlabel('k-value') plt.ylabel('accuracy-value') plt.title(u'DryBean map') plt.show()

4.實現(xiàn)代碼

#這里放一下手動實現(xiàn)算法的代碼，并且做到和調(diào)包的正確率一樣 #采用Iris數(shù)據(jù)集作為計算樣例 print(type(train_x)) print(pd.DataFrame(train_x).shape) print(train_x[0][1]) <class 'numpy.ndarray'> (105, 4) 0.41666666666666663 #定義KNN類，用于分類，類中定義兩個預測方法，分為考慮權(quán)重不考慮權(quán)重兩種情況 class KNN:''' 使用Python語言實現(xiàn)K近鄰算法。（實現(xiàn)分類） '''def __init__(self, k):'''初始化方法 Parameters-----k:int 鄰居的個數(shù)'''self.k = kdef fit(self,X,y):'''訓練方法Parameters----X : 類數(shù)組類型，形狀為：[樣本數(shù)量, 特征數(shù)量]待訓練的樣本特征（屬性）y : 類數(shù)組類型，形狀為： [樣本數(shù)量]每個樣本的目標值（標簽）。'''#將X轉(zhuǎn)換成ndarray數(shù)組self.X = np.asarray(X)self.y = np.asarray(y)def predict(self,X):"""根據(jù)參數(shù)傳遞的樣本，對樣本數(shù)據(jù)進行預測。Parameters-----X : 類數(shù)組類型，形狀為：[樣本數(shù)量, 特征數(shù)量]待訓練的樣本特征（屬性） Returns-----result : 數(shù)組類型預測的結(jié)果。"""X = np.asarray(X)result = []# 對ndarray數(shù)組進行遍歷，每次取數(shù)組中的一行。for x in X:# 對于測試集中的每一個樣本，依次與訓練集中的所有樣本求距離。dis = np.sqrt(np.sum((x - self.X) ** 2, axis=1))## 返回數(shù)組排序后，每個元素在原數(shù)組（排序之前的數(shù)組）中的索引。index = dis.argsort()# 進行截斷，只取前k個元素。【取距離最近的k個元素的索引】index = index[:self.k]# 返回數(shù)組中每個元素出現(xiàn)的次數(shù)。元素必須是非負的整數(shù)。【使用weights考慮權(quán)重，權(quán)重為距離的倒數(shù)。】if dis[index].all()!=0:count = np.bincount(self.y[index], weights= 1 / dis[index])else :pass# 返回ndarray數(shù)組中，值最大的元素對應的索引。該索引就是我們判定的類別。# 最大元素索引，就是出現(xiàn)次數(shù)最多的元素。result.append(count.argmax())return np.asarray(result) #創(chuàng)建KNN對象，進行訓練與測試。 knn = KNN(k=7) #進行訓練 knn.fit(train_x,train_y) #進行測試 result = knn.predict(test_x) # display(result) # display(test_y) display(np.sum(result == test_y)) if len(result)!=0:display(np.sum(result == test_y)/ len(result)) #與調(diào)包結(jié)果一致 440.9777777777777777

5.scikit-learn中的kNN模型（source: CSDN Ada_Concentration）

scikit-learn中提供了一個KNeighborClassifier類來實現(xiàn)k近鄰法分類模型，其原型為：
sklearn.neighbors.KNighborClassifier(n_neighbors=5,weights=’uniform’,algorithm=’auto’,leaf_size=30,p=2,metric=’minkowski’,metric_params=None,n_jobs=1,**kwargs)

5.1參數(shù)

n_neighbors:一個整數(shù)，指定k值。
weights:一字符串或者可調(diào)用對象，指定投票權(quán)重類型。也就是說這些鄰居投票權(quán)可以為相同或不同：
– ‘uniform’：本節(jié)點的所有鄰居節(jié)點的投票權(quán)重都相等；
– ‘distance’：本節(jié)點的所有鄰居節(jié)點的投票權(quán)重與距離成反比，即越近的節(jié)點，其投票的權(quán)重越大；
– [callable]：一個可調(diào)用對象。它傳入距離的數(shù)組，返回同樣形狀的權(quán)重數(shù)組。
algorithm:一個字符串，指定計算最近鄰的算法，可以為如下：
– ’ball_tree’ ：使用BallTree算法，也就是球樹；
– kd_tree’：使用KDTree算法；
–‘brute’ ：使用暴力搜素法；
–‘a(chǎn)uto’ ：自動決定最適合的算法。
leaf_size:一個整數(shù)，指定BallTree或者KDTree葉節(jié)點的規(guī)模。它影響樹的構(gòu)建和查詢速度。
metric:一個字符串，指定距離度量。默認為‘minkowski’距離。
p:整數(shù)值，指定在‘minkowski’距離上的指數(shù)。
n_jobs：并行性。默認為-1表示派發(fā)任務到所有計算機的CPU上。

5.2方法

fit(X,y)：訓練模型
predict：使用模型來預測，返回待預測樣本的標記。
score(X,y)：返回在（X，y）上預測的準確率。
predict_proba(X)：返回樣本為每種標記的概率。
kneighbors([X,n_neighbors,return_distance])：返回樣本點的k近鄰點。如果return_diatance=True,同時還返回到這些近鄰點的距離。
kneighbors_graph([X,n_neighbors,model])：返回樣本點的連接圖。

6.適用數(shù)據(jù)

# Code source: Ga?l Varoquaux # Andreas Müller # Modified for documentation by Jaques Grobler # License: BSD 3 clauseimport numpy as np import matplotlib.pyplot as plt from matplotlib.colors import ListedColormap from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.datasets import make_moons, make_circles, make_classification from sklearn.neural_network import MLPClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.svm import SVC from sklearn.gaussian_process import GaussianProcessClassifier from sklearn.gaussian_process.kernels import RBF from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier from sklearn.naive_bayes import GaussianNB from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysish = 0.02 # step size in the meshnames = ["Nearest Neighbors","Linear SVM","RBF SVM","Gaussian Process","Decision Tree","Random Forest","Neural Net","AdaBoost","Naive Bayes","QDA", ]classifiers = [KNeighborsClassifier(3),SVC(kernel="linear", C=0.025),SVC(gamma=2, C=1),GaussianProcessClassifier(1.0 * RBF(1.0)),DecisionTreeClassifier(max_depth=5),RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),MLPClassifier(alpha=1, max_iter=1000),AdaBoostClassifier(),GaussianNB(),QuadraticDiscriminantAnalysis(), ]X, y = make_classification(n_features=2, n_redundant=0, n_informative=2, random_state=1, n_clusters_per_class=1 ) rng = np.random.RandomState(2) X += 2 * rng.uniform(size=X.shape) linearly_separable = (X, y)datasets = [make_moons(noise=0.3, random_state=0),make_circles(noise=0.2, factor=0.5, random_state=1),linearly_separable, ]figure = plt.figure(figsize=(27, 9)) i = 1 # iterate over datasets for ds_cnt, ds in enumerate(datasets):# preprocess dataset, split into training and test partX, y = dsX = StandardScaler().fit_transform(X)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))# just plot the dataset firstcm = plt.cm.RdBucm_bright = ListedColormap(["#FF0000", "#0000FF"])ax = plt.subplot(len(datasets), len(classifiers) + 1, i)if ds_cnt == 0:ax.set_title("Input data")# Plot the training pointsax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright, edgecolors="k")# Plot the testing pointsax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6, edgecolors="k")ax.set_xlim(xx.min(), xx.max())ax.set_ylim(yy.min(), yy.max())ax.set_xticks(())ax.set_yticks(())i += 1# iterate over classifiersfor name, clf in zip(names, classifiers):ax = plt.subplot(len(datasets), len(classifiers) + 1, i)clf.fit(X_train, y_train)score = clf.score(X_test, y_test)# Plot the decision boundary. For that, we will assign a color to each# point in the mesh [x_min, x_max]x[y_min, y_max].if hasattr(clf, "decision_function"):Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])else:Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]# Put the result into a color plotZ = Z.reshape(xx.shape)ax.contourf(xx, yy, Z, cmap=cm, alpha=0.8)# Plot the training pointsax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright, edgecolors="k")# Plot the testing pointsax.scatter(X_test[:, 0],X_test[:, 1],c=y_test,cmap=cm_bright,edgecolors="k",alpha=0.6,)ax.set_xlim(xx.min(), xx.max())ax.set_ylim(yy.min(), yy.max())ax.set_xticks(())ax.set_yticks(())if ds_cnt == 0:ax.set_title(name)ax.text(xx.max() - 0.3,yy.min() + 0.3,("%.2f" % score).lstrip("0"),size=15,horizontalalignment="right",)i += 1plt.tight_layout() plt.show()

總結(jié)

以上是生活随笔為你收集整理的KNN算法实验-采用UCI的Iris数据集和DryBean数据集的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：代码里-3gt;gt;1是-2但3gt;
下一篇：我决定去读研了