當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

现代分层、聚集聚类算法_分层聚类：聚集性和分裂性-解释

發布時間：2023/12/15 编程问答 21 豆豆

生活随笔收集整理的這篇文章主要介紹了现代分层、聚集聚类算法_分层聚类：聚集性和分裂性-解释小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

現代分層、聚集聚類算法

Hierarchical clustering is a method of cluster analysis that is used to cluster similar data points together. Hierarchical clustering follows either the top-down or bottom-up method of clustering.

分層聚類是一種聚類分析的方法，用于將相似的數據點聚類在一起。分層聚類遵循自頂向下或自底向上的聚類方法。

什么是群集？ (What is Clustering?)

Clustering is an unsupervised machine learning technique that divides the population into several clusters such that data points in the same cluster are more similar and data points in different clusters are dissimilar.

聚類是一種無監督的機器學習技術，可將總體分為多個聚類，以使同一聚類中的數據點更相似，而不同聚類中的數據點則不相似。

Points in the same cluster are closer to each other.
同一群集中的點彼此靠近。
Points in the different clusters are far apart.
不同聚類中的點相距很遠。

(Image by Author), Sample 2-dimension Dataset(作者提供的圖像)，樣本2維數據集

In the above sample 2-dimension dataset, it is visible that the dataset forms 3 clusters that are far apart, and points in the same cluster are close to each other.

在上面的示例二維數據集中，可以看到該數據集形成了3個彼此相距很遠的群集，并且同一群集中的點彼此靠近。

There are several types of clustering algorithms other than Hierarchical clusterings, such as k-Means clustering, DBSCAN, and many more. Read the below article to understand what is k-means clustering and how to implement it.

除分層聚類之外，還有幾種聚類算法，例如k-Means聚類，DBSCAN等。閱讀以下文章，了解什么是k-means聚類以及如何實現它。

In this article, you can understand hierarchical clustering, its types.

在本文中，您可以了解層次聚類及其類型。

There are two types of hierarchical clustering methods:

有兩種類型的分層聚類方法：

Divisive Clustering

分裂聚類

Agglomerative Clustering

聚集聚類

分裂聚類： (Divisive Clustering:)

The divisive clustering algorithm is a top-down clustering approach, initially, all the points in the dataset belong to one cluster and split is performed recursively as one moves down the hierarchy.

劃分聚類算法是一種自上而下的聚類方法，最初，數據集中的所有點都屬于一個聚類，并且隨著一個層次向下移動，遞歸執行拆分。

分裂聚類的步驟： (Steps of Divisive Clustering:)

Initially, all points in the dataset belong to one single cluster.

最初，數據集中的所有點都屬于一個群集。

Partition the cluster into two least similar cluster

將群集劃分為兩個最不相似的群集

Proceed recursively to form new clusters until the desired number of clusters is obtained.

遞歸進行以形成新的群集，直到獲得所需的群集數量。

1st Image: All the data points belong to one cluster, 第一個圖像：所有數據點都屬于一個群集， 2nd Image: 1 cluster is separated from the previous single cluster, 第二個圖像： 1個群集與先前的單個群集分離， 3rd Image: Further 1 cluster is separated from the previous set of clusters.第三個圖像：另外1個群集與先前的群集集合分離。

In the above sample dataset, it is observed that there is 3 cluster that is far separated from each other. So we stopped after getting 3 clusters.

在上面的樣本數據集中，可以看到有3個彼此遠離的群集。因此，我們在獲得3個簇之后就停止了。

Even if start separating further more clusters, below is the obtained result.

即使開始進一步分離更多的簇，也可以得到以下結果。

(Image by Author), Sample dataset separated into 4 clusters(作者提供的圖像)，樣本數據集分為4個類

如何選擇要拆分的集群？ (How to choose which cluster to split?)

Check the sum of squared errors of each cluster and choose the one with the largest value. In the below 2-dimension dataset, currently, the data points are separated into 2 clusters, for further separating it to form the 3rd cluster find the sum of squared errors (SSE) for each of the points in a red cluster and blue cluster.

檢查每個群集的平方誤差總和，然后選擇值最大的一個。當前，在下面的二維數據集中，數據點被分為2個簇，為了進一步將其分離以形成第3個簇，找到紅色簇和藍色簇中每個點的平方誤差總和(SSE)。

(Image by Author), Sample dataset separated into 2clusters(作者提供的圖像)，樣本數據集分為2個簇

The cluster with the largest SSE value is separated into 2 clusters, hence forming a new cluster. In the above image, it is observed red cluster has larger SSE so it is separated into 2 clusters forming 3 total clusters.

具有最大SSE值的群集分為2個群集，因此形成一個新群集。在上圖中，可以看到紅色群集的SSE較大，因此將其分為2個群集，形成3個總群集。

如何分割以上選擇的集群？ (How to split the above-chosen cluster?)

Once we have decided to split which cluster, then the question arises on how to split the chosen cluster into 2 clusters. One way is to use Ward’s criterion to chase for the largest reduction in the difference in the SSE criterion as a result of the split.

一旦決定拆分哪個群集，就會出現有關如何將所選群集拆分為2個群集的問題。一種方法是使用Ward準則，以求最大程度地減少分裂導致的SSE準則差異。

如何處理噪音或離群值？ (How to handle the noise or outlier?)

Due to the presence of outlier or noise, can result to form a new cluster of its own. To handle the noise in the dataset using a threshold to determine the termination criterion that means do not generate clusters that are too small.

由于存在異常值或噪聲，可能導致形成自己的新簇。為了使用閾值確定終止標準來處理數據集中的噪聲，這意味著不要生成太小的簇。

聚集聚類： (Agglomerative Clustering:)

Agglomerative Clustering is a bottom-up approach, initially, each data point is a cluster of its own, further pairs of clusters are merged as one moves up the hierarchy.

聚集式聚類是一種自下而上的方法，最初，每個數據點都是其自身的一個聚類，隨著一個聚類上移，將進一步合并成對的聚類。

聚集聚類的步驟： (Steps of Agglomerative Clustering:)

Initially, all the data-points are a cluster of its own.

最初，所有數據點都是其自身的集群。

Take two nearest clusters and join them to form one single cluster.

選取兩個最近的群集，并將它們合并為一個群集。

Proceed recursively step 2 until you obtain the desired number of clusters.

遞歸地執行步驟2，直到獲得所需的群集數量。

1st Image: All the data point is a cluster of its own, 第一個圖像：所有數據點都是其自己的一個群集， 2nd Image: Two nearest clusters (surrounded by a black oval) joins together to form a single cluster.第二個圖像：兩個最近的群集(由黑色橢圓形包圍)連接在一起形成一個群集。

In the above sample dataset, it is observed that 2 clusters are far separated from each other. So we stopped after getting 2 clusters.

在上面的樣本數據集中，觀察到2個聚類彼此分離。因此，我們在獲得2個簇之后就停止了。

(Image by Author), Sample dataset separated into 2 clusters(作者提供的圖像)，樣本數據集分為2個類

如何加入兩個集群以形成一個集群？ (How to join two clusters to form one cluster?)

To obtain the desired number of clusters, the number of clusters needs to be reduced from initially being n cluster (n equals the total number of data-points). Two clusters are combined by computing the similarity between them.

為了獲得所需的群集數量，需要將群集數量從最初的n個群集減少(n等于數據點的總數)。通過計算兩個群集之間的相似度將它們組合在一起。

There are some methods which are used to calculate the similarity between two clusters:

有一些方法可用于計算兩個聚類之間的相似度：

Distance between two closest points in two clusters.
兩個群集中兩個最近點之間的距離。
Distance between two farthest points in two clusters.
兩個群集中兩個最遠點之間的距離。
The average distance between all points in the two clusters.
兩個群集中所有點之間的平均距離。
Distance between centroids of two clusters.
兩個簇的質心之間的距離。

There are several pros and cons of choosing any of the above similarity metrics.

選擇上述相似性指標中的任何一個都有其優缺點。

實現方式： (Implementation:)

(Code by Author)(作者代碼)

結論： (Conclusion:)

In this article, we have discussed the in-depth intuition of agglomerative and divisive hierarchical clustering algorithms. There are some disadvantages of hierarchical algorithms that these algorithms are not suitable for large datasets because of large space and time complexities.

在本文中，我們討論了聚集和分裂層次聚類算法的深入直覺。分層算法存在一些缺點，即這些算法由于空間和時間復雜而不適用于大型數據集。

Thank You for Reading

謝謝您的閱讀

翻譯自: https://towardsdatascience.com/hierarchical-clustering-agglomerative-and-divisive-explained-342e6b20d710