1,955 research outputs found
Graph Pattern Mining and Learning through User-defined Relations (Extended Version)
In this work we propose R-GPM, a parallel computing framework for graph
pattern mining (GPM) through a user-defined subgraph relation. More
specifically, we enable the computation of statistics of patterns through their
subgraph classes, generalizing traditional GPM methods. R-GPM provides
efficient estimators for these statistics by employing a MCMC sampling
algorithm combined with several optimizations. We provide both theoretical
guarantees and empirical evaluations of our estimators in application scenarios
such as stochastic optimization of deep high-order graph neural network models
and pattern (motif) counting. We also propose and evaluate optimizations that
enable improvements of our estimators accuracy, while reducing their
computational costs in up to 3-orders-of-magnitude. Finally,we show that R-GPM
is scalable, providing near-linear speedups on 44 cores in all of our tests.Comment: Extended version of the paper published in the ICDM 201
NScale: Neighborhood-centric Large-Scale Graph Analytics in the Cloud
There is an increasing interest in executing complex analyses over large
graphs, many of which require processing a large number of multi-hop
neighborhoods or subgraphs. Examples include ego network analysis, motif
counting, personalized recommendations, and others. These tasks are not well
served by existing vertex-centric graph processing frameworks, where user
programs are only able to directly access the state of a single vertex. This
paper introduces NSCALE, a novel end-to-end graph processing framework that
enables the distributed execution of complex subgraph-centric analytics over
large-scale graphs in the cloud. NSCALE enables users to write programs at the
level of subgraphs rather than at the level of vertices. Unlike most previous
graph processing frameworks, which apply the user program to the entire graph,
NSCALE allows users to declaratively specify subgraphs of interest. Our
framework includes a novel graph extraction and packing (GEP) module that
utilizes a cost-based optimizer to partition and pack the subgraphs of interest
into memory on as few machines as possible. The distributed execution engine
then takes over and runs the user program in parallel, while respecting the
scope of the various subgraphs. Our experimental results show
orders-of-magnitude improvements in performance and drastic reductions in the
cost of analytics compared to vertex-centric approaches.Comment: 26 pages, 15 figures, 5 table
Graphlet Decomposition: Framework, Algorithms, and Applications
From social science to biology, numerous applications often rely on graphlets
for intuitive and meaningful characterization of networks at both the global
macro-level as well as the local micro-level. While graphlets have witnessed a
tremendous success and impact in a variety of domains, there has yet to be a
fast and efficient approach for computing the frequencies of these subgraph
patterns. However, existing methods are not scalable to large networks with
millions of nodes and edges, which impedes the application of graphlets to new
problems that require large-scale network analysis. To address these problems,
we propose a fast, efficient, and parallel algorithm for counting graphlets of
size k={3,4}-nodes that take only a fraction of the time to compute when
compared with the current methods used. The proposed graphlet counting
algorithms leverages a number of proven combinatorial arguments for different
graphlets. For each edge, we count a few graphlets, and with these counts along
with the combinatorial arguments, we obtain the exact counts of others in
constant time. On a large collection of 300+ networks from a variety of
domains, our graphlet counting strategies are on average 460x faster than
current methods. This brings new opportunities to investigate the use of
graphlets on much larger networks and newer applications as we show in the
experiments. To the best of our knowledge, this paper provides the largest
graphlet computations to date as well as the largest systematic investigation
on over 300+ networks from a variety of domains
A Chronological Edge-Driven Approach to Temporal Subgraph Isomorphism
Many real world networks are considered temporal networks, in which the
chronological ordering of the edges has importance to the meaning of the data.
Performing temporal subgraph matching on such graphs requires the edges in the
subgraphs to match the order of the temporal graph motif we are searching for.
Previous methods for solving this rely on the use of static subgraph matching
to find potential matches first, before filtering them based on edge order to
find the true temporal matches. We present a new algorithm for temporal
subgraph isomorphism that performs the subgraph matching directly on the
chronologically sorted edges. By restricting our search to only the subgraphs
with chronologically correct edges, we can improve the performance of the
algorithm significantly. We present experimental timing results to show
significant performance improvements on publicly available datasets for a
number of different temporal query graph motifs with four or more nodes. We
also demonstrate a practical example of how temporal subgraph isomorphism can
produce more meaningful results than traditional static subgraph searches
Distributed Estimation of Graph 4-Profiles
We present a novel distributed algorithm for counting all four-node induced
subgraphs in a big graph. These counts, called the -profile, describe a
graph's connectivity properties and have found several uses ranging from
bioinformatics to spam detection. We also study the more complicated problem of
estimating the local -profiles centered at each vertex of the graph. The
local -profile embeds every vertex in an -dimensional space that
characterizes the local geometry of its neighborhood: vertices that connect
different clusters will have different local -profiles compared to those
that are only part of one dense cluster.
Our algorithm is a local, distributed message-passing scheme on the graph and
computes all the local -profiles in parallel. We rely on two novel
theoretical contributions: we show that local -profiles can be calculated
using compressed two-hop information and also establish novel concentration
results that show that graphs can be substantially sparsified and still retain
good approximation quality for the global -profile.
We empirically evaluate our algorithm using a distributed GraphLab
implementation that we scaled up to cores. We show that our algorithm can
compute global and local -profiles of graphs with millions of edges in a few
minutes, significantly improving upon the previous state of the art.Comment: To appear in part at WWW'1
An Faster Network Motif Detection Tool
Network motif provides a way to uncover the basic building blocks of most
complex networks. This task usually demands high computer processing, specially
for motif with 5 or more vertices. This paper presents an extended methodology
with the following features: (i) search for motifs up to 6 vertices, (ii)
multithread processing, and a (iii) new enumeration algorithm with lower
complexity. The algorithm to compute motifs solve isomorphism in with
the use of hash table. Concurrent threads evaluates distinct graphs. The
enumeration algorithm has smaller computational complexity. The experiments
shows better performance with respect to other methods available in literature,
allowing bioinformatic researchers to efficiently identify motifs of size 3, 4,
5, and 6
Parallel Algorithms for Counting Triangles in Networks with Large Degrees
Finding the number of triangles in a network is an important problem in the
analysis of complex networks. The number of triangles also has important
applications in data mining. Existing distributed memory parallel algorithms
for counting triangles are either Map-Reduce based or message passing interface
(MPI) based and work with overlapping partitions of the given network. These
algorithms are designed for very sparse networks and do not work well when the
degrees of the nodes are relatively larger. For networks with larger degrees,
Map-Reduce based algorithm generates prohibitively large intermediate data, and
in MPI based algorithms with overlapping partitions, each partition can grow as
large as the original network, wiping out the benefit of partitioning the
network.
In this paper, we present two efficient MPI-based parallel algorithms for
counting triangles in massive networks with large degrees. The first algorithm
is a space-efficient algorithm for networks that do not fit in the main memory
of a single compute node. This algorithm divides the network into
non-overlapping partitions. The second algorithm is for the case where the main
memory of each node is large enough to contain the entire network. We observe
that for such a case, computation load can be balanced dynamically and present
a dynamic load balancing scheme which improves the performance significantly.
Both of our algorithms scale well to large networks and to a large number of
processors
Arabesque: A System for Distributed Graph Mining - Extended version
Distributed data processing platforms such as MapReduce and Pregel have
substantially simplified the design and deployment of certain classes of
distributed graph analytics algorithms. However, these platforms do not
represent a good match for distributed graph mining problems, as for example
finding frequent subgraphs in a graph. Given an input graph, these problems
require exploring a very large number of subgraphs and finding patterns that
match some "interestingness" criteria desired by the user. These algorithms are
very important for areas such as social net- works, semantic web, and
bioinformatics. In this paper, we present Arabesque, the first distributed data
processing platform for implementing graph mining algorithms. Arabesque
automates the process of exploring a very large number of subgraphs. It defines
a high-level filter-process computational model that simplifies the development
of scalable graph mining algorithms: Arabesque explores subgraphs and passes
them to the application, which must simply compute outputs and decide whether
the subgraph should be further extended. We use Arabesque's API to produce
distributed solutions to three fundamental graph mining problems: frequent
subgraph mining, counting motifs, and finding cliques. Our implementations
require a handful of lines of code, scale to trillions of subgraphs, and
represent in some cases the first available distributed solutions.Comment: A short version of this report appeared in the Proceedings of the
25th ACM Symp. on Operating Systems Principles (SOSP), 201
A sampling framework for counting temporal motifs
Pattern counting in graphs is fundamental to network science tasks, and there
are many scalable methods for approximating counts of small patterns, often
called motifs, in large graphs. However, modern graph datasets now contain
richer structure, and incorporating temporal information in particular has
become a critical part of network analysis. Temporal motifs, which are
generalizations of small subgraph patterns that incorporate temporal ordering
on edges, are an emerging part of the network analysis toolbox. However, there
are no algorithms for fast estimation of temporal motifs counts; moreover, we
show that even counting simple temporal star motifs is NP-complete. Thus, there
is a need for fast and approximate algorithms. Here, we present the first
frequency estimation algorithms for counting temporal motifs. More
specifically, we develop a sampling framework that sits as a layer on top of
existing exact counting algorithms and enables fast and accurate
memory-efficient estimates of temporal motif counts. Our results show that we
can achieve one to two orders of magnitude speedups with minimal and
controllable loss in accuracy on a number of datasets.Comment: 9 pages, 4 figure
An Experimental Evaluation of a Bounded Expansion Algorithmic Pipeline
Previous work has suggested that the structural restrictions of graphs from
classes of bounded expansion--locally dense pockets in a globally sparse
graph--naturally coincide with common properties of real-world networks such as
clustering and heavy-tailed degree distributions. As such, fixed-parameter
tractable algorithms for bounded expansion classes may offer a promising
framework for network analysis where other approaches have struggled to scale.
However, there has been little work done in implementing and evaluating the
performance of these structure-based algorithms. To this end we introduce
CONCUSS, a proof-of-concept implementation of a generic algorithmic pipeline
for classes of bounded expansion. In particular, we focus on using CONCUSS for
subgraph isomorphism counting (also called motif or graphlet counting), which
has been used extensively as a tool for analyzing biological and social
networks. Through a broad set of experiments we first evaluate the interactions
between implementation/engineering choices at multiple stages of the pipeline
and their effects on overall run time. From there, we establish viability of
the bounded expansion framework by demonstrating that in some scenarios CONCUSS
achieves run times competitive with a popular algorithm for subgraph
isomorphism counting that does not exploit graph structure. Finally, we
empirically identify two particular ways in which future theoretical advances
could alleviate bottlenecks in the algorithmic pipeline
- …