64 research outputs found

    Wedge Sampling for Computing Clustering Coefficients and Triangle Counts on Large Graphs

    Full text link
    Graphs are used to model interactions in a variety of contexts, and there is a growing need to quickly assess the structure of such graphs. Some of the most useful graph metrics are based on triangles, such as those measuring social cohesion. Algorithms to compute them can be extremely expensive, even for moderately-sized graphs with only millions of edges. Previous work has considered node and edge sampling; in contrast, we consider wedge sampling, which provides faster and more accurate approximations than competing techniques. Additionally, wedge sampling enables estimation local clustering coefficients, degree-wise clustering coefficients, uniform triangle sampling, and directed triangle counts. Our methods come with provable and practical probabilistic error estimates for all computations. We provide extensive results that show our methods are both more accurate and faster than state-of-the-art alternatives.Comment: Full version of SDM 2013 paper "Triadic Measures on Graphs: The Power of Wedge Sampling" (arxiv:1202.5230

    Improved triangle counting in graph streams: Neighborhood multi-sampling

    Get PDF
    In this thesis, we study the problem of estimating the number of triangles of an undirected graph in the data stream model. Some of the well-known streaming algorithms work as follows: Sample a single triangle with high enough probability and repeat this basic step to obtain a global triangle count. For example, the neighborhood sampling algorithm attempts to sample a triangle by randomly choosing a single edge e, a single neighbor f of e and waits for a third edge that completes the triangle. The basic sampling step in the algorithm is repeated multiple times to obtain an estimate for the global triangle count in the input graph stream. In this work, we propose a multi-sampling variant of this algorithm. We provide a theoretical analysis of the algorithm and prove that it improves upon the known space and accuracy bounds. We experimentally show that this algorithm outperforms several well-known triangle counting streaming algorithms

    Triangle Counting in Dynamic Graph Streams

    Get PDF
    Estimating the number of triangles in graph streams using a limited amount of memory has become a popular topic in the last decade. Different variations of the problem have been studied, depending on whether the graph edges are provided in an arbitrary order or as incidence lists. However, with a few exceptions, the algorithms have considered {\em insert-only} streams. We present a new algorithm estimating the number of triangles in {\em dynamic} graph streams where edges can be both inserted and deleted. We show that our algorithm achieves better time and space complexity than previous solutions for various graph classes, for example sparse graphs with a relatively small number of triangles. Also, for graphs with constant transitivity coefficient, a common situation in real graphs, this is the first algorithm achieving constant processing time per edge. The result is achieved by a novel approach combining sampling of vertex triples and sparsification of the input graph. In the course of the analysis of the algorithm we present a lower bound on the number of pairwise independent 2-paths in general graphs which might be of independent interest. At the end of the paper we discuss lower bounds on the space complexity of triangle counting algorithms that make no assumptions on the structure of the graph.Comment: New version of a SWAT 2014 paper with improved result

    Triangle counting in graph streams: Power of multi-sampling

    Get PDF
    Some of the well known streaming algorithms to estimate number of triangles in a graph stream work as follows: Sample a single triangle with high enough probability and repeat this basic step to obtain a global triangle count. For example, one such algorithm uniformly at random picks a single vertex v and a single edge e and checks whether the two cross edges that connect v to e appear in the stream. In this algorithm, the basic sampling step is repeated multiple times to obtain an estimate for the global triangle count in the input graph stream. This work, proposes a multi-sampling variant of this algorithm: Instead of randomly choosing a single vertex and edge, randomly sample multiple vertices and multiple edges and collect cross edges that connect sampled vertices to the sampled edges. We provide a theoretical analysis of this algorithm and prove that this simple modification improves upon the known space and accuracy bounds. We experimentally show that the proposed algorithm out performs several well known triangle counting streaming algorithms

    Catching the head, tail, and everything in between: a streaming algorithm for the degree distribution

    Full text link
    The degree distribution is one of the most fundamental graph properties of interest for real-world graphs. It has been widely observed in numerous domains that graphs typically have a tailed or scale-free degree distribution. While the average degree is usually quite small, the variance is quite high and there are vertices with degrees at all scales. We focus on the problem of approximating the degree distribution of a large streaming graph, with small storage. We design an algorithm headtail, whose main novelty is a new estimator of infrequent degrees using truncated geometric random variables. We give a mathematical analysis of headtail and show that it has excellent behavior in practice. We can process streams will millions of edges with storage less than 1% and get extremely accurate approximations for all scales in the degree distribution. We also introduce a new notion of Relative Hausdorff distance between tailed histograms. Existing notions of distances between distributions are not suitable, since they ignore infrequent degrees in the tail. The Relative Hausdorff distance measures deviations at all scales, and is a more suitable distance for comparing degree distributions. By tracking this new measure, we are able to give strong empirical evidence of the convergence of headtail
    corecore