【机器学习】层次聚类
寫在篇前
??層次聚類(hierarchical clustering)是一種通用的聚類算法之一,它通過自下而上合并或自上而下拆分來構建嵌套聚類。這種簇的層次結構表示為樹(或樹狀圖),樹的根匯聚所有樣本,樹的葉子是各個樣本。本篇博客會簡述層次聚類的原理,重點是使用sklearn、scipy、seaborn等實現層次聚類并可視化結果。
原理簡述
??看到一篇詳細講層次聚類原理的文章層次聚類算法的原理及實現Hierarchical Clustering,講的通俗易懂,一看便知,這里主要講一下Metrics(請保證sklearn >=0.20):
Ward:minimizes所有聚類中的平方差和,它是一種方差最小化方法,在這個意義上類似于kmeans。
E=∑i=1k∑x∈Ci∣∣x?μi∣∣22E = \sum\limits_{i=1}^k\sum\limits_{x \in C_i} ||x-\mu_i||_2^2 E=i=1∑k?x∈Ci?∑?∣∣x?μi?∣∣22?
complete linkage:minimizes兩個簇的樣本對之間距離的最小值
dmin(Ci,Cj)=min?x?i∈Ci,x?j∈Cjdistance(x?i,x?j)d_{min}(C_i,C_j)=\min_{\vec{x}_i\in C_i,\vec{x}_j\in C_j}distance(\vec{x}_i,\vec{x}_j) dmin?(Ci?,Cj?)=xi?∈Ci?,xj?∈Cj?min?distance(xi?,xj?)
Average linkage:minimizes兩個簇的樣本對之間距離的平均值
davg(Ci,Cj)=1∣Ci∣∣Cj∣∑x?i∈Ci∑x?j∈Cjdistance(x?i,x?j)d_{avg}(C_i,C_j)=\frac{1}{|C_i||C_j|}\sum_{\vec{x}_i\in C_i}\sum_{\vec{x}_j\in C_j}distance(\vec{x}_i,\vec{x}_j) davg?(Ci?,Cj?)=∣Ci?∣∣Cj?∣1?xi?∈Ci?∑?xj?∈Cj?∑?distance(xi?,xj?)
Single linkage: minimizes兩個簇的樣本對之間距離的最大值
dmax(Ci,Cj)=max?x?i∈Ci,x?j∈Cjdistance(x?i,x?j)d_{max}(C_i,C_j)=\max_{\vec{x}_i\in C_i,\vec{x}_j\in C_j}distance(\vec{x}_i,\vec{x}_j) dmax?(Ci?,Cj?)=xi?∈Ci?,xj?∈Cj?max?distance(xi?,xj?)
算法實現
??sklearn中實現的層次聚類實際上是自下而上合并的方式,通過from sklearn.cluster import AgglomerativeClustering導入層次聚類的封裝類。
from sklearn.cluster import AgglomerativeClusteringclu = AgglomerativeClustering(n_clusters=2,affinity='euclidean',linkage='ward', compute_full_tree=False) # --------------------------屬性--------------------------# 基礎(類)屬性 print('n_clusters:', clu.n_clusters) print('affinity:', clu.affinity) print('memory:', clu.memory) print('connectivity:', clu.connectivity) print('compute_full_tree:', clu.compute_full_tree) print('linkage:', clu.linkage) print('pooling_func:', clu.pooling_func)# 重要屬性 print('labels_:', clu.labels_) # 樣本聚類結果標簽 print('children_:', clu.children_) # 每一個非葉子結點的孩子數 print('n_components_:', clu.n_components_) # 連接圖中連通分量的估計值 print('n_leaves_:', clu.n_leaves_) # 層次聚類樹中葉子數目# --------------------------函數-------------------------- print('get_params:', clu.get_params()) # 與set_params()相對應# fit()、fit_predict()函數放到案例里面講解案例
基礎案例
?? 官方給了一個很好的案例,我們可以看一下(scikit-learn 0.20 or later):
#! /usr/bin/python # _*_ coding: utf-8 _*_ __author__ = 'Jeffery' __date__ = '2018/11/12 17:45'import time import warnings import numpy as np import matplotlib.pyplot as pltfrom sklearn import cluster, datasets from sklearn.preprocessing import StandardScaler from itertools import cycle, islicenp.random.seed(0)###################################################################### # Generate datasets. We choose the size big enough to see the scalability # of the algorithms, but not too big to avoid too long running timesn_samples = 1500 noisy_circles = datasets.make_circles(n_samples=n_samples, factor=.5, noise=.05) noisy_moons = datasets.make_moons(n_samples=n_samples, noise=.05) blobs = datasets.make_blobs(n_samples=n_samples, random_state=8) no_structure = np.random.rand(n_samples, 2), None # Anisotropicly distributed data random_state = 170 X, y = datasets.make_blobs(n_samples=n_samples, random_state=random_state) transformation = [[0.6, -0.6], [-0.4, 0.8]] X_aniso = np.dot(X, transformation) aniso = (X_aniso, y) # blobs with varied variances varied = datasets.make_blobs(n_samples=n_samples,cluster_std=[1.0, 2.5, 0.5],random_state=random_state) # data shape: # noisy_circles:(1500, 2) # noisy_moons: (1500, 2) # varied(blobs): (1500, 2) # blobs: (1500, 2) # aniso: (1500, 2) # no_structure:(1500, 2) no labels###################################################################### # Run the clustering and plot# Set up cluster parameters plt.figure(figsize=(9 * 1.3 + 2, 14.5)) plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05,hspace=.01)plot_num = 1 default_base = {'n_clusters': 3} datasets = [(noisy_circles, {'n_clusters': 2}),(noisy_moons, {'n_clusters': 2}),(varied, {}),(aniso, {}),(blobs, {}),(no_structure, {})]for i_dataset, (dataset, algo_params) in enumerate(datasets):# update parameters with dataset-specific valuesparams = default_base.copy()params.update(algo_params)X, y = dataset# normalize dataset for easier parameter selectionX = StandardScaler().fit_transform(X)# ============# Create cluster objects# ============ward = cluster.AgglomerativeClustering(n_clusters=params['n_clusters'], linkage='ward')complete = cluster.AgglomerativeClustering(n_clusters=params['n_clusters'], linkage='complete')average = cluster.AgglomerativeClustering(n_clusters=params['n_clusters'], linkage='average')single = cluster.AgglomerativeClustering(n_clusters=params['n_clusters'], linkage='single')clustering_algorithms = (('Single Linkage', single),('Average Linkage', average),('Complete Linkage', complete),('Ward Linkage', ward),)for name, algorithm in clustering_algorithms:t0 = time.time()algorithm.fit(X)t1 = time.time()if hasattr(algorithm, 'labels_'):y_pred = algorithm.labels_.astype(np.int)else:y_pred = algorithm.predict(X)plt.subplot(len(datasets), len(clustering_algorithms), plot_num)if i_dataset == 0:plt.title(name, size=18)colors = np.array(list(islice(cycle(['#377eb8', '#ff7f00', '#4daf4a','#f781bf', '#a65628', '#984ea3','#999999', '#e41a1c', '#dede00']),int(max(y_pred) + 1))))plt.scatter(X[:, 0], X[:, 1], s=10, color=colors[y_pred])plt.xlim(-2.5, 2.5)plt.ylim(-2.5, 2.5)plt.xticks(())plt.yticks(())plt.text(.99, .01, ('%.2fs' % (t1 - t0)).lstrip('0'),transform=plt.gca().transAxes, size=15,horizontalalignment='right')plot_num += 1plt.show()
通過以上案例我們應該學習到:
- single linkage速度快,并且可以在非球狀數據上表現良好,但是在存在噪聲的情況下它表現不佳;
- average linkage和complete linkage在‘干凈’的球狀數據上表現良好;
- Ward是應對噪聲數據最有效的方法;
- 層次聚類一般不能直接適用于高維數據
進階案例
樹狀圖
??對于層次聚類,我們一般還會構建聚類樹或生成熱圖,這就是我們下面要探討的問題。但是其中還會牽涉一些原理,包括natural divisions、dissimilarity(cophenetic correlation coefficient.)、disconsistency等,請參考層次聚類原理解析
import pandas as pd from scipy.cluster import hierarchy import matplotlib.pyplot as plt import numpy as npmy_data = np.random.random(size=[10, 4]) my_data = pd.DataFrame(my_data, index=(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),columns='A B C D'.split())Z = hierarchy.linkage(my_data,method='ward',metric='euclidean',optimal_ordering=False) print(type(Z), '\n', Z) # 這個有必要解釋一下 hierarchy.dendrogram(Z) plt.show()
??分割聚類樹,將其生成多個不同的cluster還有一種方式:
# 分割cluster labels = fcluster(Z, t=0.7, criterion='inconsistent') print(labels)參數說明:
熱圖
??上面的圖呢,形式比較簡單,有時候我們更希望我們能夠畫出更加炫酷的聚類圖,這個時候我們就會想到繪制clustermap,這是我們可以用到seaborn.clustermap(),該函數基于上面講到的scipy.hierarchy模塊,封裝了一個方便的繪圖函數。
import seaborn as sns import matplotlib.pyplot as pltsns.set(color_codes=True) iris = sns.load_dataset("iris") species = iris.pop("species") # {'setosa', 'versicolor', 'virginica'}lut = dict(zip(species.unique(), "rbg")) row_colors = species.map(lut) g = sns.clustermap(iris, row_colors=row_colors)print(g.dendrogram_row.reordered_ind) # reordered row indices print(g.dendrogram_col.reordered_ind) # reordered col indices print(g.savefig('a.png'))plt.show()??層次聚類本身的原理不難,主要就在于對聚類的method(euclidean、manhattan等)、metrics(single、average等)的理解以及聚類樹分割原理的理解。
總結
以上是生活随笔為你收集整理的【机器学习】层次聚类的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 【机器学习】Kmeans聚类
- 下一篇: 【机器学习】NMF(非负矩阵分解)