64 research outputs found
Wedge Sampling for Computing Clustering Coefficients and Triangle Counts on Large Graphs
Graphs are used to model interactions in a variety of contexts, and there is
a growing need to quickly assess the structure of such graphs. Some of the most
useful graph metrics are based on triangles, such as those measuring social
cohesion. Algorithms to compute them can be extremely expensive, even for
moderately-sized graphs with only millions of edges. Previous work has
considered node and edge sampling; in contrast, we consider wedge sampling,
which provides faster and more accurate approximations than competing
techniques. Additionally, wedge sampling enables estimation local clustering
coefficients, degree-wise clustering coefficients, uniform triangle sampling,
and directed triangle counts. Our methods come with provable and practical
probabilistic error estimates for all computations. We provide extensive
results that show our methods are both more accurate and faster than
state-of-the-art alternatives.Comment: Full version of SDM 2013 paper "Triadic Measures on Graphs: The Power
of Wedge Sampling" (arxiv:1202.5230
Improved triangle counting in graph streams: Neighborhood multi-sampling
In this thesis, we study the problem of estimating the number of triangles of an undirected graph in the data stream model. Some of the well-known streaming algorithms work as follows: Sample a single triangle with high enough probability and repeat this basic step to obtain a global triangle count. For example, the neighborhood sampling algorithm attempts to sample a triangle by randomly choosing a single edge e, a single neighbor f of e and waits for a third edge that completes the triangle. The basic sampling step in the algorithm is repeated multiple times to obtain an estimate for the global triangle count in the input graph stream. In this work, we propose a multi-sampling variant of this algorithm. We provide a theoretical analysis of the algorithm and prove that it improves upon the known space and accuracy bounds. We experimentally show that this algorithm outperforms several well-known triangle counting streaming algorithms
Triangle Counting in Dynamic Graph Streams
Estimating the number of triangles in graph streams using a limited amount of
memory has become a popular topic in the last decade. Different variations of
the problem have been studied, depending on whether the graph edges are
provided in an arbitrary order or as incidence lists. However, with a few
exceptions, the algorithms have considered {\em insert-only} streams. We
present a new algorithm estimating the number of triangles in {\em dynamic}
graph streams where edges can be both inserted and deleted. We show that our
algorithm achieves better time and space complexity than previous solutions for
various graph classes, for example sparse graphs with a relatively small number
of triangles. Also, for graphs with constant transitivity coefficient, a common
situation in real graphs, this is the first algorithm achieving constant
processing time per edge. The result is achieved by a novel approach combining
sampling of vertex triples and sparsification of the input graph. In the course
of the analysis of the algorithm we present a lower bound on the number of
pairwise independent 2-paths in general graphs which might be of independent
interest. At the end of the paper we discuss lower bounds on the space
complexity of triangle counting algorithms that make no assumptions on the
structure of the graph.Comment: New version of a SWAT 2014 paper with improved result
Triangle counting in graph streams: Power of multi-sampling
Some of the well known streaming algorithms to estimate number of triangles in a graph stream work as follows: Sample a single triangle with high enough probability and repeat this basic step to obtain a global triangle count. For example, one such algorithm uniformly at random picks a single vertex v and a single edge e and checks whether the two cross edges that connect v to e appear in the stream. In this algorithm, the basic sampling step is repeated multiple times to obtain an estimate for the global triangle count in the input graph stream. This work, proposes a multi-sampling variant of this algorithm: Instead of randomly choosing a single vertex and edge, randomly sample multiple vertices and multiple edges and collect cross edges that connect sampled vertices to the sampled edges. We provide a theoretical analysis of this algorithm and prove that this simple modification improves upon the known space and accuracy bounds. We experimentally show that the proposed algorithm out performs several well known triangle counting streaming algorithms
Catching the head, tail, and everything in between: a streaming algorithm for the degree distribution
The degree distribution is one of the most fundamental graph properties of
interest for real-world graphs. It has been widely observed in numerous domains
that graphs typically have a tailed or scale-free degree distribution. While
the average degree is usually quite small, the variance is quite high and there
are vertices with degrees at all scales. We focus on the problem of
approximating the degree distribution of a large streaming graph, with small
storage. We design an algorithm headtail, whose main novelty is a new estimator
of infrequent degrees using truncated geometric random variables. We give a
mathematical analysis of headtail and show that it has excellent behavior in
practice. We can process streams will millions of edges with storage less than
1% and get extremely accurate approximations for all scales in the degree
distribution.
We also introduce a new notion of Relative Hausdorff distance between tailed
histograms. Existing notions of distances between distributions are not
suitable, since they ignore infrequent degrees in the tail. The Relative
Hausdorff distance measures deviations at all scales, and is a more suitable
distance for comparing degree distributions. By tracking this new measure, we
are able to give strong empirical evidence of the convergence of headtail
- …