138 research outputs found

    Estimating Graphlet Statistics via Lifting

    Full text link
    Exploratory analysis over network data is often limited by the ability to efficiently calculate graph statistics, which can provide a model-free understanding of the macroscopic properties of a network. We introduce a framework for estimating the graphlet count---the number of occurrences of a small subgraph motif (e.g. a wedge or a triangle) in the network. For massive graphs, where accessing the whole graph is not possible, the only viable algorithms are those that make a limited number of vertex neighborhood queries. We introduce a Monte Carlo sampling technique for graphlet counts, called {\em Lifting}, which can simultaneously sample all graphlets of size up to kk vertices for arbitrary kk. This is the first graphlet sampling method that can provably sample every graphlet with positive probability and can sample graphlets of arbitrary size kk. We outline variants of lifted graphlet counts, including the ordered, unordered, and shotgun estimators, random walk starts, and parallel vertex starts. We prove that our graphlet count updates are unbiased for the true graphlet count and have a controlled variance for all graphlets. We compare the experimental performance of lifted graphlet counts to the state-of-the art graphlet sampling procedures: Waddling and the pairwise subgraph random walk

    Motif counting beyond five nodes

    Get PDF
    Counting graphlets is a well-studied problem in graph mining and social network analysis. Recently, several papers explored very simple and natural algorithms based on Monte Carlo sampling of Markov Chains (MC), and reported encouraging results. We show, perhaps surprisingly, that such algorithms are outperformed by color coding (CC) [2], a sophisticated algorithmic technique that we extend to the case of graphlet sampling and for which we prove strong statistical guarantees. Our computational experiments on graphs with millions of nodes show CC to be more accurate than MC; furthermore, we formally show that the mixing time of the MC approach is too high in general, even when the input graph has high conductance. All this comes at a price however. While MC is very efficient in terms of space, CC’s memory requirements become demanding when the size of the input graph and that of the graphlets grow. And yet, our experiments show that CC can push the limits of the state-of-the-art, both in terms of the size of the input graph and of that of the graphlets

    Motivo: Fast Motif Counting via Succinct Color Coding and Adaptive Sampling

    No full text
    The randomized technique of color coding is behind state-of-the-art algorithms for estimating graph motif counts. Those algorithms, however, are not yet capable of scaling well to very large graphs with billions of edges. In this paper we develop novel tools for the `motif counting via color coding' framework. As a result, our new algorithm, Motivo, is able to scale well to larger graphs while at the same time provide more accurate graphlet counts than ever before. This is achieved thanks to two types of improvements. First, we design new succinct data structures that support fast common color coding operations, and a biased coloring trick that trades accuracy versus running time and memory usage. These adaptations drastically reduce the time and memory requirements of color coding. Second, we develop an adaptive graphlet sampling strategy, based on a fractional set cover problem, that breaks the additive approximation barrier of standard sampling. This strategy gives multiplicative approximations for all graphlets at once, allowing us to count not only the most frequent graphlets but also extremely rare ones. To give an idea of the improvements, in 4040 minutes Motivo counts 77-nodes motifs on a graph with 6565M nodes and 1.81.8B edges; this is 3030 and 500500 times larger than the state of the art, respectively in terms of nodes and edges. On the accuracy side, in one hour Motivo produces accurate counts of  ⁣10.000\approx \! 10.000 distinct 88-node motifs on graphs where state-of-the-art algorithms fail even to find the second most frequent motif. Our method requires just a high-end desktop machine. These results show how color coding can bring motif mining to the realm of truly massive graphs using only ordinary hardware.Comment: 13 page