14,255 research outputs found
On the efficiency of estimating penetrating rank on large graphs
P-Rank (Penetrating Rank) has been suggested as a useful measure of structural similarity that takes account of both incoming and outgoing edges in ubiquitous networks. Existing work often utilizes memoization to compute P-Rank similarity in an iterative fashion, which requires cubic time in the worst case. Besides, previous methods mainly focus on the deterministic computation of P-Rank, but lack the probabilistic framework that scales well for large graphs. In this paper, we propose two efficient algorithms for computing P-Rank on large graphs. The first observation is that a large body of objects in a real graph usually share similar neighborhood structures. By merging such objects with an explicit low-rank factorization, we devise a deterministic algorithm to compute P-Rank in quadratic time. The second observation is that by converting the iterative form of P-Rank into a matrix power series form, we can leverage the random sampling approach to probabilistically compute P-Rank in linear time with provable accuracy guarantees. The empirical results on both real and synthetic datasets show that our approaches achieve high time efficiency with controlled error and outperform the baseline algorithms by at least one order of magnitude
Structure fusion based on graph convolutional networks for semi-supervised classification
Suffering from the multi-view data diversity and complexity for
semi-supervised classification, most of existing graph convolutional networks
focus on the networks architecture construction or the salient graph structure
preservation, and ignore the the complete graph structure for semi-supervised
classification contribution. To mine the more complete distribution structure
from multi-view data with the consideration of the specificity and the
commonality, we propose structure fusion based on graph convolutional networks
(SF-GCN) for improving the performance of semi-supervised classification.
SF-GCN can not only retain the special characteristic of each view data by
spectral embedding, but also capture the common style of multi-view data by
distance metric between multi-graph structures. Suppose the linear relationship
between multi-graph structures, we can construct the optimization function of
structure fusion model by balancing the specificity loss and the commonality
loss. By solving this function, we can simultaneously obtain the fusion
spectral embedding from the multi-view data and the fusion structure as
adjacent matrix to input graph convolutional networks for semi-supervised
classification. Experiments demonstrate that the performance of SF-GCN
outperforms that of the state of the arts on three challenging datasets, which
are Cora,Citeseer and Pubmed in citation networks
TPA: Fast, Scalable, and Accurate Method for Approximate Random Walk with Restart on Billion Scale Graphs
Given a large graph, how can we determine similarity between nodes in a fast
and accurate way? Random walk with restart (RWR) is a popular measure for this
purpose and has been exploited in numerous data mining applications including
ranking, anomaly detection, link prediction, and community detection. However,
previous methods for computing exact RWR require prohibitive storage sizes and
computational costs, and alternative methods which avoid such costs by
computing approximate RWR have limited accuracy. In this paper, we propose TPA,
a fast, scalable, and highly accurate method for computing approximate RWR on
large graphs. TPA exploits two important properties in RWR: 1) nodes close to a
seed node are likely to be revisited in following steps due to block-wise
structure of many real-world graphs, and 2) RWR scores of nodes which reside
far from the seed node are proportional to their PageRank scores. Based on
these two properties, TPA divides approximate RWR problem into two subproblems
called neighbor approximation and stranger approximation. In the neighbor
approximation, TPA estimates RWR scores of nodes close to the seed based on
scores of few early steps from the seed. In the stranger approximation, TPA
estimates RWR scores for nodes far from the seed using their PageRank. The
stranger and neighbor approximations are conducted in the preprocessing phase
and the online phase, respectively. Through extensive experiments, we show that
TPA requires up to 3.5x less time with up to 40x less memory space than other
state-of-the-art methods for the preprocessing phase. In the online phase, TPA
computes approximate RWR up to 30x faster than existing methods while
maintaining high accuracy.Comment: 12pages, 10 figure
Taming computational complexity: efficient and parallel SimRank optimizations on undirected graphs
SimRank has been considered as one of the promising link-based ranking algorithms to evaluate similarities of web documents in many modern search engines. In this paper, we investigate the optimization problem of SimRank similarity computation on undirected web graphs. We first present a novel algorithm to estimate the SimRank between vertices in O(n3+ Kn2) time, where n is the number of vertices, and K is the number of iterations. In comparison, the most efficient implementation of SimRank algorithm in [1] takes O(K n3 ) time in the worst case. To efficiently handle large-scale computations, we also propose a parallel implementation of the SimRank algorithm on multiple processors. The experimental evaluations on both synthetic and real-life data sets demonstrate the better computational time and parallel efficiency of our proposed techniques
Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks
Graph convolutional network (GCN) has been successfully applied to many
graph-based applications; however, training a large-scale GCN remains
challenging. Current SGD-based algorithms suffer from either a high
computational cost that exponentially grows with number of GCN layers, or a
large space requirement for keeping the entire graph and the embedding of each
node in memory. In this paper, we propose Cluster-GCN, a novel GCN algorithm
that is suitable for SGD-based training by exploiting the graph clustering
structure. Cluster-GCN works as the following: at each step, it samples a block
of nodes that associate with a dense subgraph identified by a graph clustering
algorithm, and restricts the neighborhood search within this subgraph. This
simple but effective strategy leads to significantly improved memory and
computational efficiency while being able to achieve comparable test accuracy
with previous algorithms. To test the scalability of our algorithm, we create a
new Amazon2M data with 2 million nodes and 61 million edges which is more than
5 times larger than the previous largest publicly available dataset (Reddit).
For training a 3-layer GCN on this data, Cluster-GCN is faster than the
previous state-of-the-art VR-GCN (1523 seconds vs 1961 seconds) and using much
less memory (2.2GB vs 11.2GB). Furthermore, for training 4 layer GCN on this
data, our algorithm can finish in around 36 minutes while all the existing GCN
training algorithms fail to train due to the out-of-memory issue. Furthermore,
Cluster-GCN allows us to train much deeper GCN without much time and memory
overhead, which leads to improved prediction accuracy---using a 5-layer
Cluster-GCN, we achieve state-of-the-art test F1 score 99.36 on the PPI
dataset, while the previous best result was 98.71 by [16]. Our codes are
publicly available at
https://github.com/google-research/google-research/tree/master/cluster_gcn.Comment: In Proceedings of the 25th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining (KDD'19
- …