138 research outputs found
Estimating Graphlet Statistics via Lifting
Exploratory analysis over network data is often limited by the ability to
efficiently calculate graph statistics, which can provide a model-free
understanding of the macroscopic properties of a network. We introduce a
framework for estimating the graphlet count---the number of occurrences of a
small subgraph motif (e.g. a wedge or a triangle) in the network. For massive
graphs, where accessing the whole graph is not possible, the only viable
algorithms are those that make a limited number of vertex neighborhood queries.
We introduce a Monte Carlo sampling technique for graphlet counts, called {\em
Lifting}, which can simultaneously sample all graphlets of size up to
vertices for arbitrary . This is the first graphlet sampling method that can
provably sample every graphlet with positive probability and can sample
graphlets of arbitrary size . We outline variants of lifted graphlet counts,
including the ordered, unordered, and shotgun estimators, random walk starts,
and parallel vertex starts. We prove that our graphlet count updates are
unbiased for the true graphlet count and have a controlled variance for all
graphlets. We compare the experimental performance of lifted graphlet counts to
the state-of-the art graphlet sampling procedures: Waddling and the pairwise
subgraph random walk
Motif counting beyond five nodes
Counting graphlets is a well-studied problem in graph mining and social network analysis. Recently, several papers explored very simple and natural algorithms based on Monte Carlo sampling of Markov Chains (MC), and reported encouraging results. We show, perhaps surprisingly, that such algorithms are outperformed by color coding (CC) [2], a sophisticated algorithmic technique that we extend to the case of graphlet sampling and for which we prove strong statistical guarantees. Our computational experiments on graphs with millions of nodes show CC to be more accurate than MC; furthermore, we formally show that the mixing time of the MC approach is too high in general, even when the input graph has high conductance. All this comes at a price however. While MC is very efficient in terms of space, CC’s memory requirements become demanding when the size of the input graph and that of the graphlets grow. And yet, our experiments show that CC can push the limits of the state-of-the-art, both in terms of the size of the input graph and of that of the graphlets
Motivo: Fast Motif Counting via Succinct Color Coding and Adaptive Sampling
The randomized technique of color coding is behind state-of-the-art
algorithms for estimating graph motif counts. Those algorithms, however, are
not yet capable of scaling well to very large graphs with billions of edges. In
this paper we develop novel tools for the `motif counting via color coding'
framework. As a result, our new algorithm, Motivo, is able to scale well to
larger graphs while at the same time provide more accurate graphlet counts than
ever before. This is achieved thanks to two types of improvements. First, we
design new succinct data structures that support fast common color coding
operations, and a biased coloring trick that trades accuracy versus running
time and memory usage. These adaptations drastically reduce the time and memory
requirements of color coding. Second, we develop an adaptive graphlet sampling
strategy, based on a fractional set cover problem, that breaks the additive
approximation barrier of standard sampling. This strategy gives multiplicative
approximations for all graphlets at once, allowing us to count not only the
most frequent graphlets but also extremely rare ones.
To give an idea of the improvements, in minutes Motivo counts -nodes
motifs on a graph with M nodes and B edges; this is and
times larger than the state of the art, respectively in terms of nodes and
edges. On the accuracy side, in one hour Motivo produces accurate counts of
distinct -node motifs on graphs where state-of-the-art
algorithms fail even to find the second most frequent motif. Our method
requires just a high-end desktop machine. These results show how color coding
can bring motif mining to the realm of truly massive graphs using only ordinary
hardware.Comment: 13 page
- …