PyTorch图神经网络实践(七)社区检测
文章目錄
- 前言
- 組合優(yōu)化
- 社區(qū)檢測
- 端到端的學(xué)習(xí)與優(yōu)化
- 作者介紹
- 核心思想
- 技術(shù)手段
- 方法創(chuàng)新
- 代碼復(fù)現(xiàn)
- 導(dǎo)入包
- 數(shù)據(jù)轉(zhuǎn)換
- ClusterNet模型
- 創(chuàng)建網(wǎng)絡(luò)
- 參數(shù)設(shè)置和數(shù)據(jù)導(dǎo)入
- 訓(xùn)練網(wǎng)絡(luò)
前言
最近一直在研究組合優(yōu)化問題,上周看到2019年NeurIPS會議上有篇文章提出了一種端到端的學(xué)習(xí)和優(yōu)化框架,并且開源了代碼,于是復(fù)現(xiàn)了一下,發(fā)現(xiàn)在社區(qū)檢測任務(wù)上的效果真的不錯(cuò),而且方法非常簡單。
NeurIPS 2019:圖上端到端的學(xué)習(xí)和優(yōu)化
End to end learning and optimization on graphs
GitHub源碼
組合優(yōu)化
圖中的很多問題都是組合優(yōu)化問題,比如最大獨(dú)立集、最小覆蓋集、圖分割、最短路等等。很多組合優(yōu)化問題都是NP難問題,不存在多項(xiàng)式時(shí)間復(fù)雜度的求解算法,所以傳統(tǒng)多是用貪婪算法或者啟發(fā)式算法(比如遺傳算法、粒子群算法等等)來求解。
最近,很多研究人員嘗試用深度學(xué)習(xí)或者強(qiáng)化學(xué)習(xí)來解決組合優(yōu)化問題,這幾年相關(guān)研究也經(jīng)常出現(xiàn)在IAAA、NIPS這樣的頂會上。我在之前的一篇博客中整理一些代表性研究(包含文章及代碼鏈接),感興趣的可以移步看看。
社區(qū)檢測
社區(qū)檢測是網(wǎng)絡(luò)科學(xué)中的一個(gè)經(jīng)典問題了,其目的是發(fā)現(xiàn)網(wǎng)絡(luò)中的社區(qū)結(jié)構(gòu)。社區(qū)結(jié)構(gòu)一般是指網(wǎng)絡(luò)中一些內(nèi)部聯(lián)系非常緊密的子圖,這些子圖往往具有一些特定的功能或者屬性。當(dāng)然,社區(qū)結(jié)構(gòu)有很多種,比如層次社區(qū)結(jié)構(gòu)、重疊社區(qū)結(jié)構(gòu)等等,這方面的文章有不少是發(fā)表在《Nature》《Science》這樣級別的期刊上的。想深入了解社區(qū)檢測問題可以看看下面幾篇文章。
Finding and evaluating community structure in networks (2003)
Modularity and community structure in networks (2006)
Community detection in graphs (2009)
Community detection algorithms: a comparative analysis (2009)
Community detection in networks: A user guide (2016)
我之前也寫了一篇關(guān)于社區(qū)檢測的文章,里面給出了社區(qū)可視化的代碼,用python3和networkx包實(shí)現(xiàn)的。
大多數(shù)社區(qū)檢測算法都將模塊度作為優(yōu)化函數(shù),其目的就是尋找一種最優(yōu)劃分將所有節(jié)點(diǎn)分配到不同的社區(qū)中,使得模塊度值最大。因此,社區(qū)檢測問題本質(zhì)上也是一個(gè)組合優(yōu)化問題。
下面就逐步介紹一下這篇文章的主要研究內(nèi)容。
端到端的學(xué)習(xí)與優(yōu)化
作者介紹
這篇文章發(fā)表在2019年的NeurIPS會議上。四位作者分別來自哈佛大學(xué)和南加州大學(xué),其中一作Bryan Wilder是即將畢業(yè)的博士生,非常厲害,博士期間發(fā)表了很多高質(zhì)量文章,其個(gè)人主頁上有詳細(xì)介紹。Bistra Dilkina是南加州大學(xué)的副教授,發(fā)表過很多關(guān)于組合優(yōu)化問題的文章,如經(jīng)典文章 Learning combinatorial optimization algorithms over graphs。
核心思想
許多實(shí)際應(yīng)用同時(shí)涉及到圖上的學(xué)習(xí)問題與優(yōu)化問題,傳統(tǒng)的做法是先解決學(xué)習(xí)問題然后再解決優(yōu)化問題,這樣有個(gè)缺點(diǎn)就是下游優(yōu)化的結(jié)果無法反過來指導(dǎo)學(xué)習(xí)過程,實(shí)現(xiàn)不了學(xué)習(xí)與優(yōu)化的協(xié)同改善。文章的目的就是提出一種端到端的框架,將學(xué)習(xí)過程和優(yōu)化過程合并在一個(gè)網(wǎng)絡(luò)中,這樣最終優(yōu)化任務(wù)的誤差可以一直反向傳播到學(xué)習(xí)任務(wù)上,網(wǎng)絡(luò)參數(shù)就可以一起優(yōu)化,改善模型在優(yōu)化任務(wù)上的性能。
技術(shù)手段
這篇文章以鏈路預(yù)測作為學(xué)習(xí)問題的代表,以社區(qū)檢測為優(yōu)化問題的代表來開展研究。具體上,文章假定在進(jìn)行社區(qū)檢測之前,網(wǎng)絡(luò)結(jié)構(gòu)不是完全已知的,只有部分(40%)網(wǎng)絡(luò)結(jié)構(gòu)是能夠觀察到的,所以要先用鏈路預(yù)測來找出出網(wǎng)絡(luò)中那些沒有被觀察到的連邊,然后再在這種“復(fù)原”后的網(wǎng)絡(luò)上進(jìn)行社區(qū)檢測,利用模塊度指標(biāo)來評估社區(qū)檢測的效果。同時(shí),文章還設(shè)立了對照實(shí)驗(yàn),即在原始網(wǎng)絡(luò)(不隱藏任何連邊)上執(zhí)行社區(qū)檢測任務(wù),通過觀察兩組實(shí)驗(yàn)的結(jié)果來分析他們提出的模型的有效性。
方法創(chuàng)新
文章提出了一種新的端到端的網(wǎng)絡(luò)模型(ClusterNet),其中主要包含四個(gè)步驟:
整個(gè)模型框架如下圖所示,上面是ClusterNet,下面是兩階段優(yōu)化模型。
其中關(guān)鍵在于決策和誤差反向傳播。
針對不同的優(yōu)化任務(wù),決策函數(shù)是不一樣的,而且訓(xùn)練階段和測試階段也有些不同。本文只介紹社區(qū)檢測這一任務(wù),在訓(xùn)練階段,節(jié)點(diǎn)聚類的結(jié)果是概率值,被當(dāng)做社區(qū)的軟劃分,這樣計(jì)算梯度更準(zhǔn)確,有利于參數(shù)優(yōu)化;而在測試階段(推理過程),對節(jié)點(diǎn)聚類的結(jié)果進(jìn)行softmax操作就可以得到社區(qū)的硬劃分(二值化),這樣可以計(jì)算出最終的模塊度值。
在誤差反向傳播過程中,有兩個(gè)影響優(yōu)化效果的重要參數(shù),一個(gè)是β\betaβ,即聚類分配的嚴(yán)格程度(hardness),δ\deltaδ,即類別之間的區(qū)分程度。這兩個(gè)參數(shù)的乘積決定了社區(qū)劃分的效果,一般情況下,大一些比較好。在代碼中,這兩個(gè)參數(shù)只用其乘積一個(gè)參數(shù)來代表了。
作者們還證明了該模型可以通過梯度下降來尋優(yōu),并推導(dǎo)出了參數(shù)的梯度計(jì)算公式。對公式感興趣的可以去看原文。
除了決策和參數(shù)優(yōu)化之外,K-means聚類的作用也是極為重要的。如果沒有中間聚類這一步的話,效果是要大打折扣的。對于社區(qū)檢測任務(wù)來說,如果去掉中間的聚類層,那么最后的結(jié)果基本上都是將所有節(jié)點(diǎn)都分配到同一個(gè)社區(qū),這樣網(wǎng)絡(luò)中全部邊都在社區(qū)內(nèi)部,也算是最優(yōu)了,但是沒有任何意義。文中也特意設(shè)計(jì)了一種直接優(yōu)化的方法(不含聚類層),也就是GCN-e2e方法,可以看出其效果比ClusterNet要差很多。
代碼復(fù)現(xiàn)
下面,就一步一步復(fù)現(xiàn)一下文中的代碼。
導(dǎo)入包
import numpy as np import sklearn import sklearn.cluster import scipy.sparse as sp import math import torch import torch.optim as optim import torch.nn as nn import torch.nn.functional as F from torch.nn.parameter import Parameter數(shù)據(jù)轉(zhuǎn)換
將networkx中的graph對象轉(zhuǎn)換為網(wǎng)絡(luò)要求的輸入,輸入數(shù)據(jù)有兩個(gè),一個(gè)是歸一化后的鄰接矩陣(稀疏矩陣),一個(gè)是節(jié)點(diǎn)的特征矩陣(沒有特征的圖默認(rèn)為單位矩陣)。
## Data handling def normalize(mx):"""Row-normalize sparse matrix"""rowsum = np.array(mx.sum(1), dtype=np.float32)r_inv = np.power(rowsum, -1).flatten()r_inv[np.isinf(r_inv)] = 0.r_mat_inv = sp.diags(r_inv)mx = r_mat_inv.dot(mx)return mxdef mx_to_sparse_tensor(mx):"""Convert a scipy sparse matrix to a torch sparse tensor."""mx = mx.tocoo().astype(np.float32)indices = torch.from_numpy(np.vstack((mx.row, mx.col)).astype(np.int64))values = torch.from_numpy(mx.data)shape = torch.Size(mx.shape)return torch.sparse.FloatTensor(indices, values, shape)def load_data(G):"""Load network (graph)"""adj = nx.to_scipy_sparse_matrix(G).tocoo()adj = normalize(adj+sp.eye(adj.shape[0]))adj = mx_to_sparse_tensor(adj)features = torch.eye(len(G.nodes())).to_sparse()return adj, featuresClusterNet模型
實(shí)際上,ClusterNet網(wǎng)絡(luò)僅僅包含兩個(gè)模塊,第一個(gè)模塊是經(jīng)典的圖卷積網(wǎng)絡(luò),第二個(gè)模塊就是kmeans聚類,只不過聚類分成了兩步,第一步先得到各聚類中心的初始向量,第二步再優(yōu)化節(jié)點(diǎn)的聚類結(jié)果。得到聚類結(jié)果后就可以通過softmax操作進(jìn)行社區(qū)的硬化分。
## Model class GraphConvolution(nn.Module):'''Simple GCN layer, similar to https://arxiv.org/abs/1609.02907'''def __init__(self, in_features, out_features, bias=True):super(GraphConvolution, self).__init__()self.in_features = in_featuresself.out_features = out_featuresself.weight = Parameter(torch.FloatTensor(in_features, out_features))if bias:self.bias = Parameter(torch.FloatTensor(out_features))else:self.register_parameter('bias', None)self.reset_parameters()def reset_parameters(self):stdv = 1. / math.sqrt(self.weight.size(1))self.weight.data.uniform_(-stdv, stdv)if self.bias is not None:self.bias.data.uniform_(-stdv, stdv)def forward(self, input, adj):support = torch.mm(input, self.weight)output = torch.spmm(adj, support)if self.bias is not None:return output + self.biaselse:return outputdef __repr__(self):return self.__class__.__name__ + ' (' \+ str(self.in_features) + ' -> ' \+ str(self.out_features) + ')'class GCN(nn.Module):'''2-layer GCN with dropout'''def __init__(self, nfeat, nhid, nout, dropout):super(GCN, self).__init__()self.gc1 = GraphConvolution(nfeat, nhid)self.gc2 = GraphConvolution(nhid, nout)self.dropout = dropoutdef forward(self, x, adj):x = F.relu(self.gc1(x, adj))x = F.dropout(x, self.dropout, training=self.training)x = self.gc2(x, adj)return xdef cluster(data, k, num_iter, init=None, cluster_temp=5):'''pytorch (differentiable) implementation of soft k-means clustering.'''# normalize x so it lies on the unit spheredata = torch.diag(1./torch.norm(data, p=2, dim=1)) @ data# use kmeans++ initialization if nothing is providedif init is None:data_np = data.detach().numpy()norm = (data_np**2).sum(axis=1)init = sklearn.cluster.k_means_._k_init(data_np, k, norm, sklearn.utils.check_random_state(None))init = torch.tensor(init, requires_grad=True)if num_iter == 0: return initmu = initfor t in range(num_iter):# get distances between all data points and cluster centersdist = data @ mu.t()# cluster responsibilities via softmaxr = torch.softmax(cluster_temp*dist, 1)# total responsibility of each clustercluster_r = r.sum(dim=0)# mean of points in each cluster weighted by responsibilitycluster_mean = (r.t().unsqueeze(1) @ data.expand(k, *data.shape)).squeeze(1)# update cluster meansnew_mu = torch.diag(1/cluster_r) @ cluster_meanmu = new_mudist = data @ mu.t()r = torch.softmax(cluster_temp*dist, 1)return mu, r, distclass GCNClusterNet(nn.Module):'''The ClusterNet architecture. The first step is a 2-layer GCN to generate embeddings.The output is the cluster means mu and soft assignments r, along with the embeddings and the the node similarities (just output for debugging purposes).The forward pass inputs are x, a feature matrix for the nodes, and adj, a sparseadjacency matrix. The optional parameter num_iter determines how many steps to run the k-means updates for.'''def __init__(self, nfeat, nhid, nout, dropout, K, cluster_temp):super(GCNClusterNet, self).__init__()self.GCN = GCN(nfeat, nhid, nout, dropout)self.distmult = nn.Parameter(torch.rand(nout))self.sigmoid = nn.Sigmoid()self.K = Kself.cluster_temp = cluster_tempself.init = torch.rand(self.K, nout)def forward(self, x, adj, num_iter=1):embeds = self.GCN(x, adj)mu_init, _, _ = cluster(embeds, self.K, num_iter, init = self.init, cluster_temp = self.cluster_temp)mu, r, dist = cluster(embeds, self.K, num_iter, init = mu_init.detach().clone(), cluster_temp = self.cluster_temp)return mu, r, embeds, dist# 損失函數(shù) def loss_modularity(mu, r, embeds, dist, bin_adj, mod, args):bin_adj_nodiag = bin_adj*(torch.ones(bin_adj.shape[0], bin_adj.shape[0]) - torch.eye(bin_adj.shape[0]))return (1./bin_adj_nodiag.sum()) * (r.t() @ mod @ r).trace()# 獲得模塊度矩陣 def make_modularity_matrix(adj):adj = adj*(torch.ones(adj.shape[0], adj.shape[0]) - torch.eye(adj.shape[0]))degrees = adj.sum(dim=0).unsqueeze(1)mod = adj - degrees@degrees.t()/adj.sum()return mod創(chuàng)建網(wǎng)絡(luò)
創(chuàng)建一個(gè)demo網(wǎng)絡(luò),用于演示社區(qū)檢測結(jié)果,該網(wǎng)絡(luò)包含三個(gè)社區(qū),每個(gè)社區(qū)10個(gè)節(jié)點(diǎn)。
import networkx as nx import matplotlib.pyplot as plt# create a graph G = nx.random_partition_graph([10, 10, 10], 0.8, 0.1)# plot the graph fig, ax = plt.subplots(figsize=(5,5)) option = {'font_family':'serif', 'font_size':'15', 'font_weight':'semibold'} nx.draw_networkx(G, node_size=400, **option) # pos=nx.spring_layout(G) plt.axis('off') plt.show()網(wǎng)絡(luò)可視化效果如下
參數(shù)設(shè)置和數(shù)據(jù)導(dǎo)入
設(shè)置網(wǎng)絡(luò)參數(shù),并導(dǎo)入數(shù)據(jù)。
class arguments():def __init__(self):self.no_cuda = True # Disables CUDA trainingself.seed = 24 # Random seedself.lr = 0.001 # Initial learning rateself.weight_decay = 5e-4 # Weight decay (L2 loss on parameters)self.hidden = 50 # Number of hidden unitsself.embed_dim = 50 # Dimensionality of node embeddingsself.dropout = 0.5 # Dropout rate (1 - keep probability)self.K = 3 # How many partitionsself.clustertemp = 100 # how hard to make the softmax for the cluster assignmentsself.train_iters = 301 # number of training iterationsself.num_cluster_iter = 1 # number of iterations for clusteringargs = arguments() args.cuda = not args.no_cuda and torch.cuda.is_available()## Load data adj_all, features = load_data(G=G) # normalized adjacency matrix bin_adj_all = (adj_all.to_dense() > 0).float() # standard adjacency matrix test_object = make_modularity_matrix(bin_adj_all) nfeat = features.shape[1] num_cluster_iter = args.num_cluster_iter K = args.K訓(xùn)練網(wǎng)絡(luò)
創(chuàng)建一個(gè)ClusterNet網(wǎng)絡(luò),在CPU上訓(xùn)練,不分割數(shù)據(jù),直接在原圖上進(jìn)行測試。
%%time ## INITIALIZE MODELS model_cluster = GCNClusterNet(nfeat=nfeat,nhid=args.hidden,nout=args.embed_dim,dropout=args.dropout,K = args.K,cluster_temp = args.clustertemp)optimizer = optim.Adam(model_cluster.parameters(),lr=args.lr, weight_decay=args.weight_decay)if args.cuda:model_cluster.cuda()adj_all = adj_all.cuda()features = features.cuda()bin_adj_all = bin_adj_all.cuda()test_object = test_object.cuda()losses = [] losses_test = []## Decision-focused training best_train_val = 100 for t in range(args.train_iters):# optimization setting: get loss with respect to the full graphmu, r, embeds, dist = model_cluster(features, adj_all, num_cluster_iter)loss = loss_modularity(r, bin_adj_all, test_object)loss = -lossoptimizer.zero_grad()loss.backward()# increase number of clustering iterations after 200 updates to fine-tune solutionif t == 200:num_cluster_iter = 5# every 100 iterations, look and see if we've improved on the best training loss# seen so far. Keep the solution with best training value.if t % 100 == 0:# round solution to discrete partitioningr = torch.softmax(100*r, dim=1)# evalaute test loss -- note that the best solution is# chosen with respect training loss. Here, we store the test loss# of the currently best training solutionloss_test = loss_modularity(r, bin_adj_all, test_object)# training loss, to do rounding afterif loss.item() < best_train_val:best_train_val = loss.item()curr_test_loss = loss_test.item()log = 'Iterations: {:03d}, ClusterNet modularity: {:.4f}'print(log.format(t, curr_test_loss))losses.append(loss.item())optimizer.step()輸出結(jié)果如下:
Iterations: 000, ClusterNet modularity: 0.3532 Iterations: 100, ClusterNet modularity: 0.4813 Iterations: 200, ClusterNet modularity: 0.4813 Iterations: 300, ClusterNet modularity: 0.4813 CPU times: user 59.7 s, sys: 869 ms, total: 1min Wall time: 2.31 s可以看出,網(wǎng)絡(luò)在100步后就收斂了,實(shí)際上收斂步驟更快,差不多50左右就收斂了。用時(shí)也很少。
再看看社區(qū)檢測結(jié)果:
可以看出,這30個(gè)節(jié)點(diǎn)分類非常準(zhǔn)確,1-10、11-20、21-30節(jié)點(diǎn)分別被分配到三個(gè)社區(qū)中,效果非常好。感興趣的同學(xué)可以自己試試,這篇文章提出的方法我覺得是目前社區(qū)檢測任務(wù)中最有效的圖神經(jīng)網(wǎng)絡(luò)方法了,兼顧了性能與效率,而且實(shí)現(xiàn)起來也很簡單。
總結(jié)
以上是生活随笔為你收集整理的PyTorch图神经网络实践(七)社区检测的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 会计软件遭黑客攻击,QuickBooks
- 下一篇: Python学习笔记(飞机大战项目练习)