聚类算法——Birch详解
1 原理
1.1 B樹
(1)m路查找樹
一棵m路查找樹,它或者是一棵空樹,或者是滿足如下性質的樹:
- 根最多有m棵子樹,并具有以下結構:
,是指向子樹的指針,是關鍵碼,
- 在子樹中所有的關鍵碼都大于,小于。
- 在子樹中所有的關鍵碼都大于
- 在子樹中所有的關鍵碼都小于
- 子樹也是m路查找樹
(2)B樹
m階B樹時一棵m路查找樹,它或是空樹,或者滿足以下性質:
- 樹中每個節點至多有m棵子樹
- 根節點至少有兩棵子樹
- 除根節點以外的所有非終端節點至少有棵子樹
- 所有的葉子節點都位于同一層
1.2 步驟
? ? ? ? 具體模擬過程參考:https://www.cnblogs.com/pinard/p/6179132.html
? ? ? 參考資料:
BIRCH能夠識別出數據集中數據分布的不均衡性,將分布在稠密區域中的點聚類并移除將分布在稀疏區域中的異常點。此外,BIRCH是一種增量聚類方法,針對每一個點的聚類決策都是基于當前已經處理過的數據點,而不是全局的數據點。
① 建立一個聚類特征樹
首先是遍歷所有數據,使用給定的內存數量和磁盤上的回收空間構建一個初始的內存CF樹,來反映數據集上的聚類信息。對于稠密數據,分組成更精細的簇,稀疏的數據點則被作為異常點移除。
② 縮小范圍,精簡聚類特征樹
該過程是可選擇的,這個部分是連接步驟①和步驟③的橋梁,相似于步驟①,他開始遍歷初始化的聚類特征樹葉子節點,移除更多的異常點和縮小范圍進行分組。
③ 全局聚類
使用全局聚類或者半全局聚類來操作所有的葉子節點,有數據點的聚類算法很容易適應一組子簇,每個子簇由其聚類特征向量表示。計算子簇的質心,然后每個子簇用質心表示,這部分可以捕捉到數據的主要分布規律。
④ 簇類細化
因為步驟③只是對數據進行粗略總結,原數據只是被掃描了一次,需要繼續完善簇類。使用上階段產生的簇的中心作為種子,并將數據點重新分配到最近的種子,以獲得一組新的簇。這不僅允許屬于該子簇的點遷移,而且可以確保給定數據點的所有副本都遷移到同一個子簇中。還為提供了丟棄異常值的選項。也就是說,如果距離最近的點太遠,種子可以作為離群值處理,而不包含在結果中。
2.參數說明
函數:sklearn.cluster.Birch
參數:
- threshold:(float,default=0.5)新的子聚類和最近的子聚類合并的子聚類的半徑小于閾值,否則將進行分裂。
- branching_factor:(int,default=50)每個結點中CF子聚類的最大數量。
- n_cluster:(int,default=3)最終聚類步驟的聚類數量,if None,不執行最終的聚類步驟,子聚類原樣返回;if sklearn.cluster.Estimator,則該模型將子聚類作為新樣本執行。
- compute_labels:(bool,default=True)每次擬合的時候是否標簽值計算。
- copy:(bool,default=True)是否復制獲得的數據,如果設置為false,初始化數據將被重寫。
屬性:
- root_:CF tree的root
- dummy_leaf_:所有葉子節點的指針
- subcluster_centers_:所有葉子里子聚類的質心
- subcluster_labels_:全聚類之后子聚類質心的labels
- labels_:所有輸入數據的labels
3 具體實現
可參考scikit-learn的實例:https://scikit-learn.org/stable/auto_examples/cluster/plot_birch_vs_minibatchkmeans.html#sphx-glr-auto-examples-cluster-plot-birch-vs-minibatchkmeans-py
4 源碼解析
源碼在:Anaconda3/Lib/site-packages/sklearn/cluster/birch.py中
(1)前綴知識
-
hasattr()函數用來判斷某個類實例對象是否包含指定名稱的屬性或方法,返回True和False
hasattr(obj, name),其中obj 指的是某個類的實例對象,name 表示指定的屬性名或方法名。
- getattr()函數獲取某個類實例對象中指定屬性的值
getattr(obj, name[, default]),其中obj 表示指定的類實例對象,name 表示指定的屬性名,而 default 是可選參數,用于設定該函數的默認返回值,即當函數查找失敗時,如果不指定 default 參數,則程序將直接報 AttributeError 錯誤,反之該函數將返回 default 指定的值。
-
setattr()函數的功能相對比較復雜,它最基礎的功能是修改類實例對象中的屬性值。其次,它還可以實現為實例對象動態添加屬性或者方法。
(2)Birch函數
- Birch(BaseEstimator, TransformerMixin, ClusterMixin)在sklearn的base文件里
- 其他參數
- fit函數(主要核心計算在_fit函數中)
其他函數:
構建稀疏矩陣
def _iterate_sparse_X(X):"""This little hack returns a densified row when iterating over a sparsematrix, instead of constructing a sparse matrix for every row that isexpensive."""n_samples = X.shape[0]X_indices = X.indicesX_data = X.dataX_indptr = X.indptrfor i in range(n_samples):row = np.zeros(X.shape[1])startptr, endptr = X_indptr[i], X_indptr[i + 1]nonzero_indices = X_indices[startptr:endptr]row[nonzero_indices] = X_data[startptr:endptr]yield row分裂葉子結點的函數:定義兩個子聚類,兩個CF節點,并將CF節點加入到CF子聚類中,如果傳入的子聚類是葉子節點,就進行一系列的指針變換,計算子聚類的質心和平方和之間的距離,選擇距離最大的矩陣,然后選擇較小的值為一個子聚類,其他的歸為另一個子聚類。
def _split_node(node, threshold, branching_factor):"""The node has to be split if there is no place for a new subclusterin the node.1. Two empty nodes and two empty subclusters are initialized.2. The pair of distant subclusters are found.3. The properties of the empty subclusters and nodes are updatedaccording to the nearest distance between the subclusters to thepair of distant subclusters.4. The two nodes are set as children to the two subclusters."""new_subcluster1 = _CFSubcluster()new_subcluster2 = _CFSubcluster()new_node1 = _CFNode(threshold=threshold, branching_factor=branching_factor,is_leaf=node.is_leaf,n_features=node.n_features)new_node2 = _CFNode(threshold=threshold, branching_factor=branching_factor,is_leaf=node.is_leaf,n_features=node.n_features)new_subcluster1.child_ = new_node1new_subcluster2.child_ = new_node2if node.is_leaf:if node.prev_leaf_ is not None:node.prev_leaf_.next_leaf_ = new_node1new_node1.prev_leaf_ = node.prev_leaf_new_node1.next_leaf_ = new_node2new_node2.prev_leaf_ = new_node1new_node2.next_leaf_ = node.next_leaf_if node.next_leaf_ is not None:node.next_leaf_.prev_leaf_ = new_node2dist = euclidean_distances(node.centroids_, Y_norm_squared=node.squared_norm_, squared=True)n_clusters = dist.shape[0]farthest_idx = np.unravel_index(dist.argmax(), (n_clusters, n_clusters))node1_dist, node2_dist = dist[(farthest_idx,)]node1_closer = node1_dist < node2_distfor idx, subcluster in enumerate(node.subclusters_):if node1_closer[idx]:new_node1.append_subcluster(subcluster)new_subcluster1.update(subcluster)else:new_node2.append_subcluster(subcluster)new_subcluster2.update(subcluster)return new_subcluster1, new_subcluster2獲取葉子結點:
def _get_leaves(self):"""Retrieve the leaves of the CF Node.Returns-------leaves : list of shape (n_leaves,)List of the leaf nodes."""leaf_ptr = self.dummy_leaf_.next_leaf_leaves = []while leaf_ptr is not None:leaves.append(leaf_ptr)leaf_ptr = leaf_ptr.next_leaf_return leaves進行全局聚類:增加了AgglomerativeClustering算法(另寫)。
def _global_clustering(self, X=None):"""Global clustering for the subclusters obtained after fitting"""clusterer = self.n_clusterscentroids = self.subcluster_centers_compute_labels = (X is not None) and self.compute_labels# Preprocessing for the global clustering.not_enough_centroids = Falseif isinstance(clusterer, numbers.Integral):clusterer = AgglomerativeClustering(n_clusters=self.n_clusters)# There is no need to perform the global clustering step.if len(centroids) < self.n_clusters:not_enough_centroids = Trueelif (clusterer is not None and nothasattr(clusterer, 'fit_predict')):raise ValueError("n_clusters should be an instance of ""ClusterMixin or an int")# To use in predict to avoid recalculation.self._subcluster_norms = row_norms(self.subcluster_centers_, squared=True)if clusterer is None or not_enough_centroids:self.subcluster_labels_ = np.arange(len(centroids))if not_enough_centroids:warnings.warn("Number of subclusters found (%d) by Birch is less ""than (%d). Decrease the threshold."% (len(centroids), self.n_clusters), ConvergenceWarning)else:# The global clustering step that clusters the subclusters of# the leaves. It assumes the centroids of the subclusters as# samples and finds the final centroids.self.subcluster_labels_ = clusterer.fit_predict(self.subcluster_centers_)if compute_labels:self.labels_ = self.predict(X)?
(3)CFNode
| 參數 | 屬性 | ||
| threshold:float | 確定子聚類的閾值 | subclusters_ : list | 指定結點的子聚類 |
| branching_factor: int | 分支因子 | prev_leaf_ : _CFNode | 前葉子結點 |
| is_leaf : bool | 是否是葉子節點 | next_leaf_ : _CFNode | 后葉子結點 |
| n_features : int | 特征數量 | init_centroids_? | 初始化質心,shape=(branching_factor + 1, n_features) |
| ? | ? | init_sq_norm_? | 初始化平方和,shape=(branching_factor + 1, n_features) |
| ? | ? | centroids_ | 質心 |
| ? | ? | squared_norm_ | 平方和 |
?
CFNode有三個函數構成:
第一個函數:append_subcluster(self, subcluster)更新CF的特征值
def append_subcluster(self, subcluster):#獲取CF的子聚類長度n_samples = len(self.subclusters_)#將新的子聚類加入到CF中self.subclusters_.append(subcluster)#初始化新子聚類的質心和平方和(將質心和平和方加入到列表中)self.init_centroids_[n_samples] = subcluster.centroid_self.init_sq_norm_[n_samples] = subcluster.sq_norm_# Keep centroids and squared norm as views. In this way# if we change init_centroids and init_sq_norm_, it is# sufficient,#更新最終的子聚類的質心和平方和(將質心和平和方加入到列表中)self.centroids_ = self.init_centroids_[:n_samples + 1, :]self.squared_norm_ = self.init_sq_norm_[:n_samples + 1]第二個函數:update_split_subclusters(self, subcluster,new_subcluster1, new_subcluster2):更新分裂節點
def update_split_subclusters(self, subcluster,new_subcluster1, new_subcluster2):"""Remove a subcluster from a node and update it with thesplit subclusters."""ind = self.subclusters_.index(subcluster)self.subclusters_[ind] = new_subcluster1self.init_centroids_[ind] = new_subcluster1.centroid_self.init_sq_norm_[ind] = new_subcluster1.sq_norm_self.append_subcluster(new_subcluster2)第三個函數:insert_cf_subcluster(self, subcluster):子聚類中插入CF特征
def insert_cf_subcluster(self, subcluster):"""Insert a new subcluster into the node."""# self.subclusters_不存在,則將新的子聚類加入到子聚類列表中if not self.subclusters_:self.append_subcluster(subcluster)return Falsethreshold = self.thresholdbranching_factor = self.branching_factor# We need to find the closest subcluster among all the# subclusters so that we can insert our new subcluster.#計算距離矩陣dist_matrix = np.dot(self.centroids_, subcluster.centroid_)dist_matrix *= -2.dist_matrix += self.squared_norm_closest_index = np.argmin(dist_matrix)closest_subcluster = self.subclusters_[closest_index]# If the subcluster has a child, we need a recursive strategy.#如果子聚類存在字跡,需要采用遞歸策略,更新CF參數if closest_subcluster.child_ is not None:split_child = closest_subcluster.child_.insert_cf_subcluster(subcluster)if not split_child:# If it is determined that the child need not be split, we# can just update the closest_subclusterclosest_subcluster.update(subcluster)self.init_centroids_[closest_index] = \self.subclusters_[closest_index].centroid_self.init_sq_norm_[closest_index] = \self.subclusters_[closest_index].sq_norm_return False# things not too good. we need to redistribute the subclusters in# our child node, and add a new subcluster in the parent# subcluster to accommodate the new child.else:new_subcluster1, new_subcluster2 = _split_node(closest_subcluster.child_, threshold, branching_factor)self.update_split_subclusters(closest_subcluster, new_subcluster1, new_subcluster2)if len(self.subclusters_) > self.branching_factor:return Truereturn False# good to go!else:#當子聚類的殘差半徑小于閾值時,更新CF參數merged = closest_subcluster.merge_subcluster(subcluster, self.threshold)#如果merged存在,將新的子聚類加入到子聚類中,并更新子聚類的參數if merged:self.init_centroids_[closest_index] = \closest_subcluster.centroid_self.init_sq_norm_[closest_index] = \closest_subcluster.sq_norm_return False# not close to any other subclusters, and we still# have space, so add.#如果子聚類的CF樹超過分支因子數,分裂成新的子聚類加入到Node中elif len(self.subclusters_) < self.branching_factor:self.append_subcluster(subcluster)return False# We do not have enough space nor is it closer to an# other subcluster. We need to split.else:self.append_subcluster(subcluster)return True(4)CFSubcluster
| 參數 | 屬性 | ||
| linear_sum:narray | 樣本 | n_samples_ :int | 每個子聚類的樣本數 |
| ? | ? | linear_sum_ : narray | 子聚類所有樣本的線性和 |
| ? | ? | squared_sum_ : float | Sum of the squared l2 norms |
| ? | ? | centroids_? | 質心 |
| ? | ? | child_ | 孩子結點 |
| ? | ? | sq_norm_? | 子聚類的平方和 |
CFSubcluster有三個函數構成:
第一個函數:update(self, subcluster)更新數值(線性和、質心、平方和等數值)
def update(self, subcluster):self.n_samples_ += subcluster.n_samples_self.linear_sum_ += subcluster.linear_sum_self.squared_sum_ += subcluster.squared_sum_self.centroid_ = self.linear_sum_ / self.n_samples_self.sq_norm_ = np.dot(self.centroid_, self.centroid_)第二個函數:merge_subcluster(self, nominee_cluster, threshold)連接subclustert
def merge_subcluster(self, nominee_cluster, threshold):"""Check if a cluster is worthy enough to be merged. Ifyes then merge."""new_ss = self.squared_sum_ + nominee_cluster.squared_sum_new_ls = self.linear_sum_ + nominee_cluster.linear_sum_new_n = self.n_samples_ + nominee_cluster.n_samples_new_centroid = (1 / new_n) * new_lsnew_norm = np.dot(new_centroid, new_centroid)dot_product = (-2 * new_n) * new_normsq_radius = (new_ss + dot_product) / new_n + new_normif sq_radius <= threshold ** 2:(self.n_samples_, self.linear_sum_, self.squared_sum_,self.centroid_, self.sq_norm_) = \new_n, new_ls, new_ss, new_centroid, new_normreturn Truereturn False第三個函數:radius(self):計算殘差
def radius(self):"""Return radius of the subcluster"""dot_product = -2 * np.dot(self.linear_sum_, self.centroid_)return sqrt(((self.squared_sum_ + dot_product) / self.n_samples_) +self.sq_norm_)?
總結
以上是生活随笔為你收集整理的聚类算法——Birch详解的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: tomcat 日志切割
- 下一篇: mysql中修改表字段的类型长度_mys