當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

复现经典：《统计学习方法》第 3 章 k 近邻法

發(fā)布時間：2025/3/8 编程问答 27 豆豆

生活随笔收集整理的這篇文章主要介紹了复现经典：《统计学习方法》第 3 章 k 近邻法小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

本文是李航老師的《統(tǒng)計學習方法》^[1]一書的代碼復現(xiàn)。

作者：黃海廣^[2]

備注：代碼都可以在github^[3]中下載。

我將陸續(xù)將代碼發(fā)布在公眾號“機器學習初學者”，敬請關(guān)注。

代碼目錄

第 1 章統(tǒng)計學習方法概論
第 2 章感知機
第 3 章 k 近鄰法
第 4 章樸素貝葉斯
第 5 章決策樹
第 6 章邏輯斯諦回歸
第 7 章支持向量機
第 8 章提升方法
第 9 章 EM 算法及其推廣
第 10 章隱馬爾可夫模型
第 11 章條件隨機場
第 12 章監(jiān)督學習方法總結(jié)

代碼參考：wzyonggege^[4],WenDesi^[5],火燙火燙的^[6]

第 3 章 k 近鄰法

1．近鄰法是基本且簡單的分類與回歸方法。近鄰法的基本做法是：對給定的訓練實例點和輸入實例點，首先確定輸入實例點的個最近鄰訓練實例點，然后利用這個訓練實例點的類的多數(shù)來預測輸入實例點的類。

2．近鄰模型對應于基于訓練數(shù)據(jù)集對特征空間的一個劃分。近鄰法中，當訓練集、距離度量、值及分類決策規(guī)則確定后，其結(jié)果唯一確定。

3．近鄰法三要素：距離度量、值的選擇和分類決策規(guī)則。常用的距離度量是歐氏距離及更一般的pL距離。值小時，近鄰模型更復雜；值大時，近鄰模型更簡單。值的選擇反映了對近似誤差與估計誤差之間的權(quán)衡，通常由交叉驗證選擇最優(yōu)的。

常用的分類決策規(guī)則是多數(shù)表決，對應于經(jīng)驗風險最小化。

4．近鄰法的實現(xiàn)需要考慮如何快速搜索 k 個最近鄰點。kd樹是一種便于對 k 維空間中的數(shù)據(jù)進行快速檢索的數(shù)據(jù)結(jié)構(gòu)。kd 樹是二叉樹，表示對維空間的一個劃分，其每個結(jié)點對應于維空間劃分中的一個超矩形區(qū)域。利用kd樹可以省去對大部分數(shù)據(jù)點的搜索，從而減少搜索的計算量。

距離度量

設特征空間是維實數(shù)向量空間，,,?，則：,的距離定義為:

?曼哈頓距離
?歐氏距離
?切比雪夫距離

import math from itertools import combinations def L(x, y, p=2):# x1 = [1, 1], x2 = [5,1]if len(x) == len(y) and len(x) > 1:sum = 0for i in range(len(x)):sum += math.pow(abs(x[i] - y[i]), p)return math.pow(sum, 1 / p)else:return 0

課本例 3.1

x1 = [1, 1] x2 = [5, 1] x3 = [4, 4] # x1, x2 for i in range(1, 5):r = {'1-{}'.format(c): L(x1, c, p=i) for c in [x2, x3]}print(min(zip(r.values(), r.keys()))) (4.0, '1-[5, 1]') (4.0, '1-[5, 1]') (3.7797631496846193, '1-[4, 4]') (3.5676213450081633, '1-[4, 4]')

python 實現(xiàn)，遍歷所有數(shù)據(jù)點，找出個距離最近的點的分類情況，少數(shù)服從多數(shù)

import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inlinefrom sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from collections import Counter # data iris = load_iris() df = pd.DataFrame(iris.data, columns=iris.feature_names) df['label'] = iris.target df.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'label'] # data = np.array(df.iloc[:100, [0, 1, -1]]) df.head(10)
sepal lengthsepal widthpetal lengthpetal widthlabel0123456789

5.1	3.5	1.4	0.2
4.9	3.0	1.4	0.2
4.7	3.2	1.3	0.2
4.6	3.1	1.5	0.2
5.0	3.6	1.4	0.2
5.4	3.9	1.7	0.4
4.6	3.4	1.4	0.3
5.0	3.4	1.5	0.2
4.4	2.9	1.4	0.2
4.9	3.1	1.5	0.1

plt.scatter(df[:50]['sepal length'], df[:50]['sepal width'], label='0') plt.scatter(df[50:100]['sepal length'], df[50:100]['sepal width'], label='1') plt.xlabel('sepal length') plt.ylabel('sepal width') plt.legend() data = np.array(df.iloc[:100, [0, 1, -1]]) X, y = data[:,:-1], data[:,-1] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)class KNN:def __init__(self, X_train, y_train, n_neighbors=3, p=2):"""parameter: n_neighbors 臨近點個數(shù)parameter: p 距離度量"""self.n = n_neighborsself.p = pself.X_train = X_trainself.y_train = y_traindef predict(self, X):# 取出n個點knn_list = []for i in range(self.n):dist = np.linalg.norm(X - self.X_train[i], ord=self.p)knn_list.append((dist, self.y_train[i]))for i in range(self.n, len(self.X_train)):max_index = knn_list.index(max(knn_list, key=lambda x: x[0]))dist = np.linalg.norm(X - self.X_train[i], ord=self.p)if knn_list[max_index][0] > dist:knn_list[max_index] = (dist, self.y_train[i])# 統(tǒng)計knn = [k[-1] for k in knn_list]count_pairs = Counter(knn) # max_count = sorted(count_pairs, key=lambda x: x)[-1]max_count = sorted(count_pairs.items(), key=lambda x: x[1])[-1][0]return max_countdef score(self, X_test, y_test):right_count = 0n = 10for X, y in zip(X_test, y_test):label = self.predict(X)if label == y:right_count += 1return right_count / len(X_test)clf = KNN(X_train, y_train)clf.score(X_test, y_test)1.0 test_point = [6.0, 3.0] print('Test Point: {}'.format(clf.predict(test_point)))Test Point: 1.0 plt.scatter(df[:50]['sepal length'], df[:50]['sepal width'], label='0') plt.scatter(df[50:100]['sepal length'], df[50:100]['sepal width'], label='1') plt.plot(test_point[0], test_point[1], 'bo', label='test_point') plt.xlabel('sepal length') plt.ylabel('sepal width') plt.legend()

scikit-learn 實例

from sklearn.neighbors import KNeighborsClassifier clf_sk = KNeighborsClassifier() clf_sk.fit(X_train, y_train) KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',metric_params=None, n_jobs=None, n_neighbors=5, p=2,weights='uniform') clf_sk.score(X_test, y_test) 1.0

sklearn.neighbors.KNeighborsClassifier

n_neighbors: 臨近點個數(shù)
p: 距離度量
algorithm: 近鄰算法，可選{'auto', 'ball_tree', 'kd_tree', 'brute'}
weights: 確定近鄰的權(quán)重

kd 樹

kd樹是一種對 k 維空間中的實例點進行存儲以便對其進行快速檢索的樹形數(shù)據(jù)結(jié)構(gòu)。

kd樹是二叉樹，表示對維空間的一個劃分（partition）。構(gòu)造kd樹相當于不斷地用垂直于坐標軸的超平面將維空間切分，構(gòu)成一系列的 k 維超矩形區(qū)域。kd 樹的每個結(jié)點對應于一個維超矩形區(qū)域。

構(gòu)造kd樹的方法如下：

構(gòu)造根結(jié)點，使根結(jié)點對應于維空間中包含所有實例點的超矩形區(qū)域；通過下面的遞歸方法，不斷地對維空間進行切分，生成子結(jié)點。在超矩形區(qū)域（結(jié)點）上選擇一個坐標軸和在此坐標軸上的一個切分點，確定一個超平面，這個超平面通過選定的切分點并垂直于選定的坐標軸，將當前超矩形區(qū)域切分為左右兩個子區(qū)域（子結(jié)點）；這時，實例被分到兩個子區(qū)域。這個過程直到子區(qū)域內(nèi)沒有實例時終止（終止時的結(jié)點為葉結(jié)點）。在此過程中，將實例保存在相應的結(jié)點上。

通常，依次選擇坐標軸對空間切分，選擇訓練實例點在選定坐標軸上的中位數(shù) （median）為切分點，這樣得到的kd樹是平衡的。注意，平衡的kd樹搜索時的效率未必是最優(yōu)的。

構(gòu)造平衡 kd 樹算法

輸入：維空間數(shù)據(jù)集＝，，

其中?，＝；

輸出：kd樹。

（1）開始：構(gòu)造根結(jié)點，根結(jié)點對應于包含的維空間的超矩形區(qū)域。

選擇為坐標軸，以 T 中所有實例的坐標的中位數(shù)為切分點，將根結(jié)點對應的超矩形區(qū)域切分為兩個子區(qū)域。切分由通過切分點并與坐標軸垂直的超平面實現(xiàn)。

由根結(jié)點生成深度為 1 的左、右子結(jié)點：左子結(jié)點對應坐標小于切分點的子區(qū)域，右子結(jié)點對應于坐標大于切分點的子區(qū)域。

將落在切分超平面上的實例點保存在根結(jié)點。

（2）重復：對深度為的結(jié)點，選擇為切分的坐標軸，＝，以該結(jié)點的區(qū)域中所有實例的坐標的中位數(shù)為切分點，將該結(jié)點對應的超矩形區(qū)域切分為兩個子區(qū)域。切分由通過切分點并與坐標軸垂直的超平面實現(xiàn)。

由該結(jié)點生成深度為的左、右子結(jié)點：左子結(jié)點對應坐標小于切分點的子區(qū)域，右子結(jié)點對應坐標大于切分點的子區(qū)域。

將落在切分超平面上的實例點保存在該結(jié)點。

（3）直到兩個子區(qū)域沒有實例存在時停止。從而形成kd樹的區(qū)域劃分。

# kd-tree每個結(jié)點中主要包含的數(shù)據(jù)結(jié)構(gòu)如下 class KdNode(object):def __init__(self, dom_elt, split, left, right):self.dom_elt = dom_elt # k維向量節(jié)點(k維空間中的一個樣本點)self.split = split # 整數(shù)（進行分割維度的序號）self.left = left # 該結(jié)點分割超平面左子空間構(gòu)成的kd-treeself.right = right # 該結(jié)點分割超平面右子空間構(gòu)成的kd-treeclass KdTree(object):def __init__(self, data):k = len(data[0]) # 數(shù)據(jù)維度def CreateNode(split, data_set): # 按第split維劃分數(shù)據(jù)集exset創(chuàng)建KdNodeif not data_set: # 數(shù)據(jù)集為空return None# key參數(shù)的值為一個函數(shù)，此函數(shù)只有一個參數(shù)且返回一個值用來進行比較# operator模塊提供的itemgetter函數(shù)用于獲取對象的哪些維的數(shù)據(jù)，參數(shù)為需要獲取的數(shù)據(jù)在對象中的序號#data_set.sort(key=itemgetter(split)) # 按要進行分割的那一維數(shù)據(jù)排序data_set.sort(key=lambda x: x[split])split_pos = len(data_set) // 2 # //為Python中的整數(shù)除法median = data_set[split_pos] # 中位數(shù)分割點split_next = (split + 1) % k # cycle coordinates# 遞歸的創(chuàng)建kd樹return KdNode(median,split,CreateNode(split_next, data_set[:split_pos]), # 創(chuàng)建左子樹CreateNode(split_next, data_set[split_pos + 1:])) # 創(chuàng)建右子樹self.root = CreateNode(0, data) # 從第0維分量開始構(gòu)建kd樹,返回根節(jié)點# KDTree的前序遍歷 def preorder(root):print(root.dom_elt)if root.left: # 節(jié)點不為空preorder(root.left)if root.right:preorder(root.right) # 對構(gòu)建好的kd樹進行搜索，尋找與目標點最近的樣本點： from math import sqrt from collections import namedtuple# 定義一個namedtuple,分別存放最近坐標點、最近距離和訪問過的節(jié)點數(shù) result = namedtuple("Result_tuple","nearest_point nearest_dist nodes_visited")def find_nearest(tree, point):k = len(point) # 數(shù)據(jù)維度def travel(kd_node, target, max_dist):if kd_node is None:return result([0] * k, float("inf"),0) # python中用float("inf")和float("-inf")表示正負無窮nodes_visited = 1s = kd_node.split # 進行分割的維度pivot = kd_node.dom_elt # 進行分割的“軸”if target[s] <= pivot[s]: # 如果目標點第s維小于分割軸的對應值(目標離左子樹更近)nearer_node = kd_node.left # 下一個訪問節(jié)點為左子樹根節(jié)點further_node = kd_node.right # 同時記錄下右子樹else: # 目標離右子樹更近nearer_node = kd_node.right # 下一個訪問節(jié)點為右子樹根節(jié)點further_node = kd_node.lefttemp1 = travel(nearer_node, target, max_dist) # 進行遍歷找到包含目標點的區(qū)域nearest = temp1.nearest_point # 以此葉結(jié)點作為“當前最近點”dist = temp1.nearest_dist # 更新最近距離nodes_visited += temp1.nodes_visitedif dist < max_dist:max_dist = dist # 最近點將在以目標點為球心，max_dist為半徑的超球體內(nèi)temp_dist = abs(pivot[s] - target[s]) # 第s維上目標點與分割超平面的距離if max_dist < temp_dist: # 判斷超球體是否與超平面相交return result(nearest, dist, nodes_visited) # 不相交則可以直接返回，不用繼續(xù)判斷#----------------------------------------------------------------------# 計算目標點與分割點的歐氏距離temp_dist = sqrt(sum((p1 - p2)**2 for p1, p2 in zip(pivot, target)))if temp_dist < dist: # 如果“更近”nearest = pivot # 更新最近點dist = temp_dist # 更新最近距離max_dist = dist # 更新超球體半徑# 檢查另一個子結(jié)點對應的區(qū)域是否有更近的點temp2 = travel(further_node, target, max_dist)nodes_visited += temp2.nodes_visitedif temp2.nearest_dist < dist: # 如果另一個子結(jié)點內(nèi)存在更近距離nearest = temp2.nearest_point # 更新最近點dist = temp2.nearest_dist # 更新最近距離return result(nearest, dist, nodes_visited)return travel(tree.root, point, float("inf")) # 從根節(jié)點開始遞歸

例 3.2

data = [[2,3],[5,4],[9,6],[4,7],[8,1],[7,2]] kd = KdTree(data) preorder(kd.root) [7, 2] [5, 4] [2, 3] [4, 7] [9, 6] [8, 1] from time import clock from random import random# 產(chǎn)生一個k維隨機向量，每維分量值在0~1之間 def random_point(k):return [random() for _ in range(k)]# 產(chǎn)生n個k維隨機向量 def random_points(k, n):return [random_point(k) for _ in range(n)] ret = find_nearest(kd, [3,4.5]) print (ret) Result_tuple(nearest_point=[2, 3], nearest_dist=1.8027756377319946, nodes_visited=4) N = 400000 t0 = clock() kd2 = KdTree(random_points(3, N)) # 構(gòu)建包含四十萬個3維空間樣本點的kd樹 ret2 = find_nearest(kd2, [0.1,0.5,0.8]) # 四十萬個樣本點中尋找離目標最近的點 t1 = clock() print ("time: ",t1-t0, "s") print (ret2) time: 5.204035100000002 s Result_tuple(nearest_point=[0.09308431086306368, 0.5071110780404813, 0.7998624450062822], nearest_dist=0.009920338124925524, nodes_visited=88)

參考資料

[1] 《統(tǒng)計學習方法》:?https://baike.baidu.com/item/統(tǒng)計學習方法/10430179
[2] 黃海廣:?https://github.com/fengdu78
[3] github:?https://github.com/fengdu78/lihang-code
[4] wzyonggege:?https://github.com/wzyonggege/statistical-learning-method
[5] WenDesi:?https://github.com/WenDesi/lihang_book_algorithm
[6] 火燙火燙的:?https://blog.csdn.net/tudaodiaozhale

關(guān)于本站

“機器學習初學者”公眾號由是黃海廣博士創(chuàng)建，黃博個人知乎粉絲23000+，github排名全球前100名（33000+）。本公眾號致力于人工智能方向的科普性文章，為初學者提供學習路線和基礎資料。原創(chuàng)作品有：吳恩達機器學習個人筆記、吳恩達深度學習筆記等。

往期精彩回顧

那些年做的學術(shù)公益-你不是一個人在戰(zhàn)斗
適合初學者入門人工智能的路線及資料下載
吳恩達機器學習課程筆記及資源（github標星12000+，提供百度云鏡像）
吳恩達深度學習筆記及視頻等資源（github標星8500+，提供百度云鏡像）
《統(tǒng)計學習方法》的python代碼實現(xiàn)（github標星7200+）
機器學習的數(shù)學精華（在線閱讀版）

備注：加入本站微信群或者qq群，請回復“加群”

總結(jié)

以上是生活随笔為你收集整理的复现经典：《统计学习方法》第 3 章 k 近邻法的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：复现经典：《统计学习方法》第 2 章感
下一篇：经典算法笔记：异常检测和推荐系统