聚类树状图_聚集聚类和树状图-解释
聚類樹狀圖
Agglomerative Clustering is a type of hierarchical clustering algorithm. It is an unsupervised machine learning technique that divides the population into several clusters such that data points in the same cluster are more similar and data points in different clusters are dissimilar.
聚集聚類是一種層次聚類算法。 這是一種無監督的機器學習技術,可將總體分為多個集群,以使同一集群中的數據點更加相似,而不同集群中的數據點則彼此不同。
- Points in the same cluster are closer to each other. 同一群集中的點彼此靠近。
- Points in the different clusters are far apart. 不同聚類中的點相距很遠。
In the above sample 2-dimension dataset, it is visible that the dataset forms 3 clusters that are far apart, and points in the same cluster are close to each other.
在上面的示例二維數據集中,可以看到該數據集形成了3個彼此相距很遠的群集,并且同一群集中的點彼此靠近。
聚集集群背后的直覺: (The intuition behind Agglomerative Clustering:)
Agglomerative Clustering is a bottom-up approach, initially, each data point is a cluster of its own, further pairs of clusters are merged as one moves up the hierarchy.
聚集式聚類是一種自下而上的方法,最初,每個數據點都是其自身的一個聚類,隨著一個聚類上移,將進一步合并成對的聚類。
聚集聚類的步驟: (Steps of Agglomerative Clustering:)
In the above sample dataset, it is observed that 2 clusters are far separated from each other. So we stopped after getting 2 clusters.
在上面的樣本數據集中,觀察到2個聚類彼此分離。 因此,我們在獲得2個簇之后就停止了。
(Image by Author), Sample dataset separated into 2 clusters(作者提供的圖像),樣本數據集分為2個類如何加入兩個集群以形成一個集群? (How to join two clusters to form one cluster?)
To obtain the desired number of clusters, the number of clusters needs to be reduced from initially being n cluster (n equals the total number of data-points). Two clusters are combined by computing the similarity between them.
為了獲得所需的群集數量,需要將群集數量從最初的n個群集減少(n等于數據點的總數)。 通過計算兩個群集之間的相似度將它們組合在一起。
There are some methods which are used to calculate the similarity between two clusters:
有一些方法可用于計算兩個聚類之間的相似度:
- Distance between two closest points in two clusters. 兩個群集中兩個最近點之間的距離。
- Distance between two farthest points in two clusters. 兩個群集中兩個最遠點之間的距離。
- The average distance between all points in the two clusters. 兩個群集中所有點之間的平均距離。
- Distance between centroids of two clusters. 兩個簇的質心之間的距離。
There are several pros and cons of choosing any of the above similarity metrics.
選擇上述相似性指標中的任何一個都有其優缺點。
凝聚集群的實現: (Implementation of Agglomerative Clustering:)
(Code by Author)(作者代碼)如何獲得最佳的簇數? (How to obtain the optimal number of clusters?)
The implementation of the Agglomerative Clustering algorithm accepts the number of desired clusters. There are several ways to find the optimal number of clusters such that the population is divided into k clusters in a way that:
聚集聚類算法的實現接受所需聚類的數量。 有幾種方法可以找到最佳數目的聚類,以便按以下方式將總體分為k個聚類:
Points in the same cluster are closer to each other.
同一群集中的點彼此靠近。
Points in the different clusters are far apart.
不同聚類中的點相距很遠。
By observing the dendrograms, one can find the desired number of clusters.
通過觀察樹狀圖,可以找到所需數目的簇。
Dendrograms are a diagrammatic representation of the hierarchical relationship between the data-points. It illustrates the arrangement of the clusters produced by the corresponding analyses and is used to observe the output of hierarchical (agglomerative) clustering.
樹狀圖是數據點之間層次關系的圖形表示。 它說明了由相應分析產生的聚類的排列,并用于觀察分層(聚集)聚類的輸出。
樹狀圖的實現: (Implementation of Dendrograms:)
(Code by Author)(作者代碼)Download the sample 2-dimension dataset from here.
從此處下載示例二維數據集。
Left Image: Visualize the sample dataset, 左圖像:可視化示例數據集, Right Image: Visualize 3 cluster for the sample dataset右圖像:可視化示例數據集的3個簇For the above sample dataset, it is observed that the optimal number of clusters would be 3. But for high dimension dataset where visualization is of the dataset is not possible dendrograms plays an important role to find the optimal number of clusters.
對于上面的樣本數據集,可以觀察到最佳數目的聚類將是3。但是對于高維數據集,無法可視化該數據集,樹狀圖對于找到最佳數目的聚類起著重要的作用。
如何通過觀察樹狀圖找到最佳聚類數: (How to find the optimal number of clusters by observing the dendrograms:)
(Image by Author), Dendrogram for the above sample dataset(作者提供的圖像),上述示例數據集的樹狀圖From the above dendrogram plot, find a horizontal rectangle with max-height that does not cross any horizontal vertical dendrogram line.
從上面的樹狀圖中,找到最大高度不與任何水平垂直樹狀圖線交叉的水平矩形。
Left: Separating into 2 clusters, 左:分為2個類, Right: Separating into 3 clusters右:分為3個類The portion in the dendrogram in which rectangle having the max-height can be cut, and the optimal number of clusters will be 3 as observed in the right part of the above image. Max height rectangle is chosen because it represents the maximum Euclidean distance between the optimal number of clusters.
在樹狀圖中可以切割出具有最大高度的矩形的部分,并且如上圖右側所示,最佳簇數將為3。 選擇最大高度矩形是因為它代表最佳簇數之間的最大歐幾里得距離。
結論: (Conclusion:)
In this article, we have discussed the in-depth intuition of the agglomerative hierarchical clustering algorithm. There are some disadvantages to the algorithm that it is not suitable for large datasets because of the large space and time complexities. Even observing the dendrogram to find the optimal number of clusters for a large dataset is very difficult.
在本文中,我們討論了聚集層次聚類算法的深入直覺。 由于存在較大的空間和時間復雜性,該算法存在一些缺點,不適用于大型數據集。 即使觀察樹狀圖以找到大型數據集的最佳聚類數也非常困難。
Thank You for Reading
謝謝您的閱讀
翻譯自: https://towardsdatascience.com/agglomerative-clustering-and-dendrograms-explained-29fc12b85f23
聚類樹狀圖
總結
以上是生活随笔為你收集整理的聚类树状图_聚集聚类和树状图-解释的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 消息称半导体厂商面临消费电子与汽车应用需
- 下一篇: 需求量太低难以盈利,松下宣布停止生产刻录