dbscan java_DBSCAN算法的Java,C++,Python实现
最近由于要實(shí)現(xiàn)‘基于網(wǎng)格的DBSCAN算法’,網(wǎng)上有沒(méi)有找到現(xiàn)成的代碼[如果您有代碼,麻煩聯(lián)系我],只好參考已有的DBSCAN算法的實(shí)現(xiàn)。先從網(wǎng)上隨便找了幾篇放這兒,之后對(duì)比研究。
DBSCAN簡(jiǎn)介:
1.簡(jiǎn)介
DBSCAN 算法是一種基于密度的空間聚類(lèi)算法。該算法利用基于密度的聚類(lèi)的概念,即要求聚類(lèi)空間中的一定區(qū)域內(nèi)所包含對(duì)象(點(diǎn)或其它空間對(duì)象)的數(shù)目不小于某一給定閥 值。DBSCAN 算法的顯著優(yōu)點(diǎn)是聚類(lèi)速度快且能夠有效處理噪聲點(diǎn)和發(fā)現(xiàn)任意形狀的空間聚類(lèi)。但是由于它直接對(duì)整個(gè)數(shù)據(jù)庫(kù)進(jìn)行操作且進(jìn)行聚類(lèi)時(shí)使用了一個(gè)全局性的表征 密度的參數(shù),因此也具有兩個(gè)比較明顯的弱點(diǎn):
1. 當(dāng)數(shù)據(jù)量增大時(shí),要求較大的內(nèi)存支持 I/0 消耗也很大;
2. 當(dāng)空間聚類(lèi)的密度不均勻、聚類(lèi)間距離相差很大時(shí),聚類(lèi)質(zhì)量較差。
2.DBSCAN算法的聚類(lèi)過(guò)程
DBSCAN算法基于一個(gè)事實(shí):一個(gè)聚類(lèi)可以由其中的任何核心對(duì)象唯一確定。等價(jià)可以表述為: 任一滿(mǎn)足核心對(duì)象條件的數(shù)據(jù)對(duì)象p,數(shù)據(jù)庫(kù)D中所有從p密度可達(dá)的數(shù)據(jù)對(duì)象o 所組成的集合構(gòu)成了一個(gè)完整的聚類(lèi)C,且p屬于C。
3.DBSCAN中的幾個(gè)定義
密度可達(dá)是直接密度可達(dá)的傳遞閉包,非對(duì)稱(chēng)性關(guān)系;密度相連是對(duì)稱(chēng)性關(guān)系。DBSCA目的是找到密度相連對(duì)象的最大集合。
E領(lǐng)域:給定對(duì)象p半徑為E內(nèi)的區(qū)域稱(chēng)為該對(duì)象的E領(lǐng)域;
核心對(duì)象:p的E領(lǐng)域內(nèi)樣本數(shù)大于MinPts(算法輸入值),則該對(duì)象p為核心對(duì)象;
直接密度可達(dá):對(duì)于樣本集合D,如果樣本點(diǎn)q在p的E領(lǐng)域內(nèi),且p為核心對(duì)象,則p直接密度可達(dá)q;
密度可達(dá):對(duì)于樣本集合D,存在一串樣本點(diǎn)p1,p2,p3,...pn,其中連續(xù)兩個(gè)點(diǎn)直接密度可達(dá),則 p=p1,q=qn,則p密度可達(dá)q;
密度相連:對(duì)于樣本集合D中任意一點(diǎn)o,存在p到o密度可達(dá),并且q到o密度可達(dá),那么q從p密度相連;
算法偽代碼
1 DBSCAN(SetOfPoints, Eps, MinPts){2 ClusterId=nextId(NOISE)3 for(i=0;i
13 ExpandCluster(SetOfPoints,Point, ClId, Eps, MinPts){14 seeds=SetOfPoints.regionQuery(Point, Eps)15 if(seeds.size()0){22 currentP=seeds.first()23 result=SetOfPoints.regionQuery(currentP, Eps)24 if(result.size()>=MinPts){25 for(i=0;i
View Code
JAVA實(shí)現(xiàn):
1 packageorisun;2
3 importjava.io.File;4 importjava.util.ArrayList;5 importjava.util.Vector;6 importjava.util.Iterator;7
8 public classDBScan {9
10 double Eps=3; //區(qū)域半徑
11 int MinPts=4; //密度12
13 //由于自己到自己的距離是0,所以自己也是自己的neighbor
14 public Vector getNeighbors(DataObject p,ArrayListobjects){15 Vector neighbors=new Vector();16 Iterator iter=objects.iterator();17 while(iter.hasNext()){18 DataObject q=iter.next();19 double[] arr1=p.getVector();20 double[] arr2=q.getVector();21 int len=arr1.length;22
23 if(Global.calEditDist(arr1,arr2,len)<=Eps){ //使用編輯距離24 //if(Global.calEuraDist(arr1, arr2, len)<=Eps){//使用歐氏距離25 //if(Global.calCityBlockDist(arr1, arr2, len)<=Eps){//使用街區(qū)距離26 //if(Global.calSinDist(arr1, arr2, len)<=Eps){//使用向量夾角的正弦
27 neighbors.add(q);28 }29 }30 returnneighbors;31 }32
33 public int dbscan(ArrayListobjects){34 int clusterID=0;35 boolean AllVisited=false;36 while(!AllVisited){37 Iterator iter=objects.iterator();38 while(iter.hasNext()){39 DataObject p=iter.next();40 if(p.isVisited())41 continue;42 AllVisited=false;43 p.setVisited(true); //設(shè)為visited后就已經(jīng)確定了它是核心點(diǎn)還是邊界點(diǎn)
44 Vector neighbors=getNeighbors(p,objects);45 if(neighbors.size()
48 }else{49 if(p.getCid()<=0){50 clusterID++;51 expandCluster(p,neighbors,clusterID,objects);52 }else{53 int iid=p.getCid();54 expandCluster(p,neighbors,iid,objects);55 }56 }57 AllVisited=true;58 }59 }60 returnclusterID;61 }62
63 private void expandCluster(DataObject p, Vectorneighbors,64 int clusterID,ArrayListobjects) {65 p.setCid(clusterID);66 Iterator iter=neighbors.iterator();67 while(iter.hasNext()){68 DataObject q=iter.next();69 if(!q.isVisited()){70 q.setVisited(true);71 Vector qneighbors=getNeighbors(q,objects);72 if(qneighbors.size()>=MinPts){73 Iterator it=qneighbors.iterator();74 while(it.hasNext()){75 DataObject no=it.next();76 if(no.getCid()<=0)77 no.setCid(clusterID);78 }79 }80 }81 if(q.getCid()<=0){ //q不是任何簇的成員
82 q.setCid(clusterID);83 }84 }85 }86
87 public static voidmain(String[] args){88 DataSource datasource=newDataSource();89 //Eps=3,MinPts=4
90 datasource.readMatrix(new File("/home/orisun/test/dot.mat"));91 datasource.readRLabel(new File("/home/orisun/test/dot.rlabel"));92 //Eps=2.5,MinPts=493 //datasource.readMatrix(new File("/home/orisun/text.normalized.mat"));94 //datasource.readRLabel(new File("/home/orisun/text.rlabel"));
95 DBScan ds=newDBScan();96 int clunum=ds.dbscan(datasource.objects);97 datasource.printResult(datasource.objects,clunum);98 }99 }
View Code
C++實(shí)現(xiàn):
數(shù)據(jù)結(jié)構(gòu)
1 #include
2
3 using namespacestd;4
5 const int DIME_NUM=2; //數(shù)據(jù)維度為2,全局常量6
7 //數(shù)據(jù)點(diǎn)類(lèi)型
8 classDataPoint9 {10 private:11 unsigned long dpID; //數(shù)據(jù)點(diǎn)ID
12 double dimension[DIME_NUM]; //維度數(shù)據(jù)
13 long clusterId; //所屬聚類(lèi)ID
14 bool isKey; //是否核心對(duì)象
15 bool visited; //是否已訪(fǎng)問(wèn)
16 vector arrivalPoints; //領(lǐng)域數(shù)據(jù)點(diǎn)id列表
17 public:18 DataPoint(); //默認(rèn)構(gòu)造函數(shù)
19 DataPoint(unsigned long dpID,double* dimension , bool isKey); //構(gòu)造函數(shù)
20
21 unsigned long GetDpId(); //GetDpId方法
22 void SetDpId(unsigned long dpID); //SetDpId方法
23 double* GetDimension(); //GetDimension方法
24 void SetDimension(double* dimension); //SetDimension方法
25 bool IsKey(); //GetIsKey方法
26 void SetKey(bool isKey); //SetKey方法
27 bool isVisited(); //GetIsVisited方法
28 void SetVisited(bool visited); //SetIsVisited方法
29 long GetClusterId(); //GetClusterId方法
30 void SetClusterId(long classId); //SetClusterId方法
31 vector& GetArrivalPoints(); //GetArrivalPoints方法
32 };
View Code
實(shí)現(xiàn)
1 #include "DataPoint.h"
2
3 //默認(rèn)構(gòu)造函數(shù)
4 DataPoint::DataPoint()5 {6 }7
8 //構(gòu)造函數(shù)
9 DataPoint::DataPoint(unsigned long dpID,double* dimension , boolisKey):isKey(isKey),dpID(dpID)10 {11 //傳遞每維的維度數(shù)據(jù)
12 for(int i=0; idimension[i]=dimension[i];15 }16 }17
18 //設(shè)置維度數(shù)據(jù)
19 void DataPoint::SetDimension(double*dimension)20 {21 for(int i=0; idimension[i]=dimension[i];24 }25 }26
27 //獲取維度數(shù)據(jù)
28 double*DataPoint::GetDimension()29 {30 return this->dimension;31 }32
33 //獲取是否為核心對(duì)象
34 boolDataPoint::IsKey()35 {36 return this->isKey;37 }38
39 //設(shè)置核心對(duì)象標(biāo)志
40 void DataPoint::SetKey(boolisKey)41 {42 this->isKey =isKey;43 }44
45 //獲取DpId方法
46 unsigned longDataPoint::GetDpId()47 {48 return this->dpID;49 }50
51 //設(shè)置DpId方法
52 void DataPoint::SetDpId(unsigned longdpID)53 {54 this->dpID =dpID;55 }56
57 //GetIsVisited方法
58 boolDataPoint::isVisited()59 {60 return this->visited;61 }62
63
64 //SetIsVisited方法
65 void DataPoint::SetVisited( boolvisited )66 {67 this->visited =visited;68 }69
70 //GetClusterId方法
71 longDataPoint::GetClusterId()72 {73 return this->clusterId;74 }75
76 //GetClusterId方法
77 void DataPoint::SetClusterId( longclusterId )78 {79 this->clusterId =clusterId;80 }81
82 //GetArrivalPoints方法
83 vector&DataPoint::GetArrivalPoints()84 {85 returnarrivalPoints;86 }
View Code
PYTHON實(shí)現(xiàn):
1 from matplotlib.pyplot import *
2 fromcollections import defaultdict3 import random4
5 #function to calculate distance6 def dist(p1, p2):7 return ((p1[0]-p2[0])**2+ (p1[1]-p2[1])**2)**(0.5)8
9 #randomly generate around 100cartesian coordinates10 all_points=[]11
12 for i in range(100):13 randCoord = [random.randint(1,50), random.randint(1,50)]14 if not randCoord inall_points:15 all_points.append(randCoord)16
17
18 #take radius = 8 and min. points = 8
19 E = 8
20 minPts = 8
21
22 #find outthe core points23 other_points =[]24 core_points=[]25 plotted_points=[]26 for point inall_points:27 point.append(0) # assign initial level 0
28 total = 0
29 for otherPoint inall_points:30 distance =dist(otherPoint,point)31 if distance<=E:32 total+=1
33
34 if total >minPts:35 core_points.append(point)36 plotted_points.append(point)37 else:38 other_points.append(point)39
40 #find border points41 border_points=[]42 for core incore_points:43 for other inother_points:44 if dist(core,other)<=E:45 border_points.append(other)46 plotted_points.append(other)47
48
49 #implement the algorithm50 cluster_label=0
51
52 for point incore_points:53 if point[2]==0:54 cluster_label+=1
55 point[2]=cluster_label56
57 for point2 inplotted_points:58 distance =dist(point2,point)59 if point2[2] ==0 and distance<=E:60 print point, point261 point2[2] =point[2]62
63
64 #after the points are asssigned correnponding labels, we group them65 cluster_list =defaultdict(lambda: [[],[]])66 for point inplotted_points:67 cluster_list[point[2]][0].append(point[0])68 cluster_list[point[2]][1].append(point[1])69
70 markers = ['+','*','.','d','^','v','>','
72 #plotting the clusters73 i=0
74 print cluster_list75 for value incluster_list:76 cluster=cluster_list[value]77 plot(cluster[0], cluster[1],markers[i])78 i = i%10+1
79
80 #plot the noise points aswell81 noise_points=[]82 for point inall_points:83 if not point in core_points and not point inborder_points:84 noise_points.append(point)85 noisex=[]86 noisey=[]87 for point innoise_points:88 noisex.append(point[0])89 noisey.append(point[1])90 plot(noisex, noisey, "x")91
92 title(str(len(cluster_list))+"clusters created with E ="+str(E)+"Min Points="+str(minPts)+"total points="+str(len(all_points))+"noise Points ="+str(len(noise_points)))93 axis((0,60,0,60))94 show()
View Code
參考:http://www.cnblogs.com/zhangchaoyang/articles/2182748.html
http://www.cnblogs.com/lovell-liu/archive/2011/11/08/2241542.html
http://blog.sudipk.com.np/2013/02/implementation-of-dbscan-algorithm-for.html
http://caoyaqiang.diandian.com/post/2012-09-26/40039517485
總結(jié)
以上是生活随笔為你收集整理的dbscan java_DBSCAN算法的Java,C++,Python实现的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: java 串的顺序存储_算法入门之串的顺
- 下一篇: html5小游戏是用js做的吗,谁说做H