机器学习实战(十)利用K-means算法对未标注数据分组
- 第十章 利用K-means算法對(duì)未標(biāo)注數(shù)據(jù)分組
- 10.1 K-均值聚類算法
- 10.2 使用后處理來提高聚類性能
- 10.3 二分K-均值算法
- 10.4 總結(jié)
- 第十章 利用K-means算法對(duì)未標(biāo)注數(shù)據(jù)分組
第十章 利用K-means算法對(duì)未標(biāo)注數(shù)據(jù)分組
簇識(shí)別:
簇識(shí)別給出了聚類結(jié)果的含義,假定有一些數(shù)據(jù),簇識(shí)別會(huì)告訴我們這些簇到底都是些什么。
聚類與分類的區(qū)別:
分類的目標(biāo)事先已知,但是聚類的類別沒有事先定義,故聚類也稱無監(jiān)督分類。
10.1 K-均值聚類算法
K-均值是發(fā)現(xiàn)給定數(shù)據(jù)集的k個(gè)簇的算法,簇個(gè)數(shù)k是用戶給定的,每一個(gè)簇通過其質(zhì)心(centroid),即簇中所有點(diǎn)的中心來描述。
K-均值算法工作流程:
1)隨機(jī)確定k個(gè)初始點(diǎn)作為質(zhì)心
2)將數(shù)據(jù)集中的每個(gè)點(diǎn)分配到一個(gè)簇中,即為每個(gè)點(diǎn)找到距離其最近的質(zhì)心,并將其分配給該質(zhì)心所對(duì)應(yīng)的簇,該步驟完成后,每個(gè)簇的質(zhì)心更新為該簇所有點(diǎn)的平均值。
偽代碼:
隨機(jī)創(chuàng)建K個(gè)質(zhì)心為起始點(diǎn) 當(dāng)任意一個(gè)點(diǎn)的簇分配結(jié)果發(fā)生變化時(shí) 對(duì)每一個(gè)數(shù)據(jù)點(diǎn) 對(duì)每個(gè)質(zhì)心 計(jì)算數(shù)據(jù)點(diǎn)和之心之間的距離 將數(shù)據(jù)點(diǎn)分配到距離最近的簇中 對(duì)每個(gè)簇,計(jì)算所有點(diǎn)的均值更新質(zhì)心具體做法:
1、 隨機(jī)選取k個(gè)聚類質(zhì)心點(diǎn)(cluster centroids)為μ1,μ2,...,μk∈Rnμ1,μ2,...,μk∈Rn。
2、 重復(fù)下面過程直到收斂
{
對(duì)于每一個(gè)樣例i,計(jì)算其應(yīng)該屬于的類
c(i):=argminj||x(i)?μj||2c(i):=argminj||x(i)?μj||2
對(duì)于每一個(gè)類j,重新計(jì)算該類的質(zhì)心
}
K是我們事先給定的聚類數(shù),c(i)c(i)代表樣例i與k個(gè)類中距離最近的那個(gè)類,c(i)c(i)的值是1到k中的一個(gè)。質(zhì)心μjμj代表我們對(duì)屬于同一個(gè)類的樣本中心點(diǎn)的猜測(cè),拿星團(tuán)模型來解釋就是要將所有的星星聚成k個(gè)星團(tuán),首先隨機(jī)選取k個(gè)宇宙中的點(diǎn)(或者k個(gè)星星)作為k個(gè)星團(tuán)的質(zhì)心.
第一步:對(duì)于每一個(gè)星星計(jì)算其到k個(gè)質(zhì)心中每一個(gè)的距離,然后選取距離最近的那個(gè)星團(tuán)作為c(i)c(i),這樣經(jīng)過第一步每一個(gè)星星都有了所屬的星團(tuán);
第二步:對(duì)于每一個(gè)星團(tuán),重新計(jì)算它的質(zhì)心μjμj(對(duì)里面所有的星星坐標(biāo)求平均)。重復(fù)迭代第一步和第二步直到質(zhì)心不變或者變化很小。
一般流程:
下圖展示了對(duì)n個(gè)樣本點(diǎn)進(jìn)行K-means聚類的效果,這里k取2。
K-means面對(duì)的第一個(gè)問題是如何保證收斂,前面的算法中強(qiáng)調(diào)結(jié)束條件就是收斂,可以證明的是K-means完全可以保證收斂性。
收斂性:
首先定義畸變函數(shù):
J函數(shù)表示每個(gè)樣本點(diǎn)到其質(zhì)心的距離平方和。K-means是要將J調(diào)整到最小。假設(shè)當(dāng)前J沒有達(dá)到最小值,那么首先可以固定每個(gè)類的質(zhì)心μjμj,調(diào)整每個(gè)樣例的所屬的類別c(i)c(i)來讓J函數(shù)減少,同樣,固定c(i)c(i),調(diào)整每個(gè)類的質(zhì)心μjμj也可以使J減小。這兩個(gè)過程就是內(nèi)循環(huán)中使J單調(diào)遞減的過程。當(dāng)J遞減到最小時(shí),μμ和cc也同時(shí)收斂。(在理論上,可以有多組不同的clip_image018[7]和c值能夠使得J取得最小值,但這種現(xiàn)象實(shí)際上很少見)。
由于畸變函數(shù)J是非凸函數(shù),意味著我們不能保證取得的最小值是全局最小值,也就是說k-means對(duì)質(zhì)心初始位置的選取比較敏感,但一般情況下k-means達(dá)到的局部最優(yōu)已經(jīng)滿足需求。但如果你怕陷入局部最優(yōu),那么可以選取不同的初始值跑多遍k-means,然后取其中最小的J對(duì)應(yīng)的μμ和cc<script type="math/tex" id="MathJax-Element-15">c</script>輸出。
K-均值聚類支持函數(shù):
from numpy import *def loadDataSet(fileName):"""加載數(shù)據(jù)集:param fileName: 文件名稱:return:"""dataMat=[]fr=open(fileName)for line in fr.readlines():curLine=line.strip().split('\t')# map函數(shù)的主要功能是 # 對(duì)一個(gè)序列對(duì)象中的每一個(gè)元素應(yīng)用被傳入的函數(shù),并且返回一個(gè)包含了所有函數(shù)調(diào)用結(jié)果的一個(gè)列表 # 這里從文件中讀取的信息是字符型,通過map函數(shù)將其強(qiáng)制轉(zhuǎn)化為浮點(diǎn)型frtLine=map(float,curLine)dataMat.append(frtLine)return dataMatdef distEclud(vecA,vecB):"""計(jì)算兩個(gè)向量間的距離:param vecA::param vecB::return:"""#np.power(x1, 2):數(shù)組x1的元素求2次方return sqrt(sum(power(vecA-vecB,2)))def randCent(dataSet,k):"""隨機(jī)構(gòu)建初始質(zhì)心:param dataSet: 數(shù)據(jù)集:param k: k個(gè)隨機(jī)質(zhì)心:return:"""#求列數(shù)n=shape(dataSet)[1]#初始化質(zhì)心centroids=mat(zeros(k,n))for j in range(n):#求每一維數(shù)據(jù)的最大值和最小值,保證隨機(jī)選取的質(zhì)心在邊界內(nèi)minJ=min(dataSet[:,j])maxJ=max(dataSet[:,j])rangeJ=float(maxJ-minJ)centroids[:,j]=minJ+rangeJ*random.rand(k,1)return centroids數(shù)據(jù)集:testSet.txt
################################################# # kmeans: k-means cluster # Author : zouxy # Date : 2013-12-25 # HomePage : http://blog.csdn.net/zouxy09 # Email : zouxy09@qq.com #################################################from numpy import * import time import matplotlib.pyplot as plt# calculate Euclidean distance def euclDistance(vector1, vector2):return sqrt(sum(power(vector2 - vector1, 2)))# init centroids with random samples def initCentroids(dataSet, k):numSamples, dim = dataSet.shapecentroids = zeros((k, dim))for i in range(k):index = int(random.uniform(0, numSamples))centroids[i, :] = dataSet[index, :]return centroids# k-means cluster def kmeans(dataSet, k):numSamples = dataSet.shape[0]# first column stores which cluster this sample belongs to,# second column stores the error between this sample and its centroidclusterAssment = mat(zeros((numSamples, 2)))clusterChanged = True## step 1: init centroidscentroids = initCentroids(dataSet, k)while clusterChanged:clusterChanged = False## for each samplefor i in range(numSamples):minDist = 100000.0minIndex = 0## for each centroid## step 2: find the centroid who is closestfor j in range(k):distance = euclDistance(centroids[j, :], dataSet[i, :])if distance < minDist:minDist = distanceminIndex = j## step 3: update its clusterif clusterAssment[i, 0] != minIndex:clusterChanged = TrueclusterAssment[i, :] = minIndex, minDist ** 2## step 4: update centroidsfor j in range(k):pointsInCluster = dataSet[nonzero(clusterAssment[:, 0].A == j)[0]]centroids[j, :] = mean(pointsInCluster, axis=0)print('Congratulations, cluster complete!')return centroids, clusterAssment# show your cluster only available with 2-D data def showCluster(dataSet, k, centroids, clusterAssment):numSamples, dim = dataSet.shapeif dim != 2:print("Sorry! I can not draw because the dimension of your data is not 2!")return 1mark = ['or', 'ob', 'og', 'ok', '^r', '+r', 'sr', 'dr', '<r', 'pr']if k > len(mark):print("Sorry! Your k is too large! please contact Zouxy")return 1# draw all samplesfor i in range(numSamples):markIndex = int(clusterAssment[i, 0])plt.plot(dataSet[i, 0], dataSet[i, 1], mark[markIndex])mark = ['Dr', 'Db', 'Dg', 'Dk', '^b', '+b', 'sb', 'db', '<b', 'pb']# draw the centroidsfor i in range(k):plt.plot(centroids[i, 0], centroids[i, 1], mark[i], markersize=12)plt.show()if __name__=='__main__':## step 1: load dataprint("step 1: load data...")dataSet = []fileIn = open('testSet.txt')for line in fileIn.readlines():lineArr = line.strip().split('\t')dataSet.append([float(lineArr[0]), float(lineArr[1])])## step 2: clustering...print( "step 2: clustering...")dataSet = mat(dataSet)k = 4centroids, clusterAssment = kmeans(dataSet, k)## step 3: show the resultprint("step 3: show the result...")showCluster(dataSet, k, centroids, clusterAssment)結(jié)果:
不同的類型用不同的顏色來表示,大菱形是對(duì)應(yīng)類的均值質(zhì)心點(diǎn)。
由于基本K均值聚類算法質(zhì)心選擇的隨機(jī)性,其聚類的結(jié)果一般比較隨機(jī),一般不會(huì)很理想,最終結(jié)果往往出現(xiàn)自然簇?zé)o法區(qū)分的情況,為避免此問題,下面介紹二分K均值聚類算法。
質(zhì)心:
centroids, clusterAssment [[-3.53973889 -2.89384326][ 3.54988221 -2.50520707][ 1.27051487 -2.7993025 ][-0.02298687 2.99472915]]10.2 使用后處理來提高聚類性能
k是用戶自定義的參數(shù),但是無法確定k的選取是否正確,如何才能判斷生成的簇效果是否好呢,每個(gè)點(diǎn)的誤差是點(diǎn)到簇質(zhì)心的距離平方值。
上圖中簇的分配結(jié)果值沒有那么準(zhǔn)確,k-均值算法收斂但是聚類效果差的原因是,k-均值算法收斂到了局部最小值,而非全局最小值。
一種用于度量聚類效果的指標(biāo)是SSE(Sum of Squared Error,誤差平方和),對(duì)應(yīng)clusterAssment矩陣的第一列之和,SSE值越小表示數(shù)據(jù)點(diǎn)越接近于其質(zhì)心,聚類效果越好。因?yàn)閷?duì)誤差取了平方,因此更加重視那些遠(yuǎn)離中心的點(diǎn),一種肯定可以降低SSE值的方法是增加簇的個(gè)數(shù),但是這違背了聚類的目標(biāo),聚類的目標(biāo)是在不增加簇的情況下,提高簇的質(zhì)量。
如何對(duì)圖10-2的結(jié)果進(jìn)行改進(jìn),可以對(duì)生成的簇進(jìn)行后處理,一種方法是將具有最大SSE值的簇劃分為兩個(gè)簇。
為了保持簇的總數(shù)不變,可以將某兩個(gè)簇進(jìn)行合并。
如何合并:
1)合并最近的質(zhì)心
通過計(jì)算所有質(zhì)心之間的距離,合并兩個(gè)距離最近的點(diǎn)。
2)合并兩個(gè)使得SSE增幅最小的質(zhì)心
10.3 二分K-均值算法
該算法首先將所有的點(diǎn)當(dāng)成一個(gè)簇,然后將該簇一分為二。之后選擇其中一個(gè)簇進(jìn)行劃分,選擇哪一個(gè)簇進(jìn)行劃分取決于對(duì)其劃分是否可以最大程度降低SSE的值。不斷重復(fù)該過程,直到達(dá)到用戶所需要的簇個(gè)數(shù)為止。
偽代碼:
將所有的點(diǎn)看成一個(gè)簇 當(dāng)簇?cái)?shù)目小于k時(shí)對(duì)每一個(gè)簇計(jì)算總誤差在給定的簇上面進(jìn)行二分k-均值聚類計(jì)算一分為二之后的總誤差選擇使得誤差最小的那個(gè)簇進(jìn)行劃分代碼:
################################################# # kmeans: k-means cluster # Author : zouxy # Date : 2013-12-25 # HomePage : http://blog.csdn.net/zouxy09 # Email : zouxy09@qq.com #################################################from numpy import * import time import matplotlib.pyplot as plt# calculate Euclidean distance def euclDistance(vector1, vector2):return sqrt(sum(power(vector2 - vector1, 2)))# init centroids with random samples def initCentroids(dataSet, k):numSamples, dim = dataSet.shapecentroids = zeros((k, dim))for i in range(k):index = int(random.uniform(0, numSamples))centroids[i, :] = dataSet[index, :]return centroids# k-means cluster def kmeans(dataSet, k):numSamples = dataSet.shape[0]# first column stores which cluster this sample belongs to,# second column stores the error between this sample and its centroidclusterAssment = mat(zeros((numSamples, 2)))clusterChanged = True## step 1: init centroidscentroids = initCentroids(dataSet, k)while clusterChanged:clusterChanged = False## for each samplefor i in range(numSamples):minDist = 100000.0minIndex = 0## for each centroid## step 2: find the centroid who is closestfor j in range(k):distance = euclDistance(centroids[j, :], dataSet[i, :])if distance < minDist:minDist = distanceminIndex = j## step 3: update its clusterif clusterAssment[i, 0] != minIndex:clusterChanged = TrueclusterAssment[i, :] = minIndex, minDist ** 2## step 4: update centroidsfor j in range(k):pointsInCluster = dataSet[nonzero(clusterAssment[:, 0].A == j)[0]]centroids[j, :] = mean(pointsInCluster, axis=0)return centroids, clusterAssment# bisecting k-means cluster def biKmeans(dataSet, k):numSamples = dataSet.shape[0]# first column stores which cluster this sample belongs to,# second column stores the error between this sample and its centroidclusterAssment = mat(zeros((numSamples, 2)))# step 1: the init cluster is the whole data setcentroid = mean(dataSet, axis=0).tolist()[0]centList = [centroid]for i in range(numSamples):clusterAssment[i, 1] = euclDistance(mat(centroid), dataSet[i, :]) ** 2while len(centList) < k:# min sum of square errorminSSE = 100000.0numCurrCluster = len(centList)# for each clusterfor i in range(numCurrCluster):# step 2: get samples in cluster ipointsInCurrCluster = dataSet[nonzero(clusterAssment[:, 0].A == i)[0], :]# step 3: cluster it to 2 sub-clusters using k-meanscentroids, splitClusterAssment = kmeans(pointsInCurrCluster, 2)# step 4: calculate the sum of square error after split this clustersplitSSE = sum(splitClusterAssment[:, 1])notSplitSSE = sum(clusterAssment[nonzero(clusterAssment[:, 0].A != i)[0], 1])currSplitSSE = splitSSE + notSplitSSE# step 5: find the best split cluster which has the min sum of square errorif currSplitSSE < minSSE:minSSE = currSplitSSEbestCentroidToSplit = ibestNewCentroids = centroids.copy()bestClusterAssment = splitClusterAssment.copy()# step 6: modify the cluster index for adding new clusterbestClusterAssment[nonzero(bestClusterAssment[:, 0].A == 1)[0], 0] = numCurrClusterbestClusterAssment[nonzero(bestClusterAssment[:, 0].A == 0)[0], 0] = bestCentroidToSplit# step 7: update and append the centroids of the new 2 sub-clustercentList[bestCentroidToSplit] = bestNewCentroids[0, :]centList.append(bestNewCentroids[1, :])# step 8: update the index and error of the samples whose cluster have been changedclusterAssment[nonzero(clusterAssment[:, 0].A == bestCentroidToSplit), :] = bestClusterAssmentreturn mat(centList), clusterAssment# show your cluster only available with 2-D data def showCluster(dataSet, k, centroids, clusterAssment):numSamples, dim = dataSet.shapeif dim != 2:print("Sorry! I can not draw because the dimension of your data is not 2!")return 1mark = ['or', 'ob', 'og', 'ok', '^r', '+r', 'sr', 'dr', '<r', 'pr']if k > len(mark):print("Sorry! Your k is too large!")return 1# draw all samplesfor i in range(numSamples):markIndex = int(clusterAssment[i, 0])plt.plot(dataSet[i, 0], dataSet[i, 1], mark[markIndex])mark = ['Dr', 'Db', 'Dg', 'Dk', '^b', '+b', 'sb', 'db', '<b', 'pb']# draw the centroidsfor i in range(k):plt.plot(centroids[i, 0], centroids[i, 1], mark[i], markersize=12)plt.show()if __name__=='__main__':################################################## kmeans: k-means cluster# Author : zouxy# Date : 2013-12-25# HomePage : http://blog.csdn.net/zouxy09# Email : zouxy09@qq.com#################################################from numpy import *import timeimport matplotlib.pyplot as plt## step 1: load dataprint("step 1: load data...")dataSet = []fileIn = open('testSet2.txt')for line in fileIn.readlines():lineArr = line.strip().split('\t')dataSet.append([float(lineArr[0]), float(lineArr[1])])## step 2: clustering...print( "step 2: clustering...")dataSet = mat(dataSet)k = 3centroids, clusterAssment = biKmeans(dataSet, k)## step 3: show the resultprint("step 3: show the result...")showCluster(dataSet, k, centroids, clusterAssment)結(jié)果:
10.4 總結(jié)
1)優(yōu)點(diǎn):
2)缺點(diǎn):
1.收斂太慢
2.算法復(fù)雜度高O(nkt)
3.不能發(fā)現(xiàn)非凸形狀的簇,或大小差別很大的簇
4.需樣本存在均值(限定數(shù)據(jù)種類)
6.需先確定k的個(gè)數(shù)
7.對(duì)噪聲和離群點(diǎn)敏感
8.最重要是結(jié)果不一定是全局最優(yōu),只能保證局部最優(yōu)。
3)改進(jìn)算法:
k-means++等
創(chuàng)作挑戰(zhàn)賽新人創(chuàng)作獎(jiǎng)勵(lì)來咯,堅(jiān)持創(chuàng)作打卡瓜分現(xiàn)金大獎(jiǎng)總結(jié)
以上是生活随笔為你收集整理的机器学习实战(十)利用K-means算法对未标注数据分组的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: RouterOS的Fasttrack,可
- 下一篇: 360公布财报:安全业务实现营业收入约1