10 research outputs found
Sublinear Time Estimation of Degree Distribution Moments: The Degeneracy Connection
We revisit the classic problem of estimating the degree distribution moments of an undirected graph. Consider an undirected graph G=(V,E) with n (non-isolated) vertices, and define (for s > 0) mu_s = 1n * sum_{v in V} d^s_v. Our aim is to estimate mu_s within a multiplicative error of (1+epsilon) (for a given approximation parameter epsilon>0) in sublinear time. We consider the sparse graph model that allows access to: uniform random vertices, queries for the degree of any vertex, and queries for a neighbor of any vertex. For the case of s=1 (the average degree), widetilde{O}(sqrt{n}) queries suffice for any constant epsilon (Feige, SICOMP 06 and Goldreich-Ron, RSA 08). Gonen-Ron-Shavitt (SIDMA 11) extended this result to all integral s > 0, by designing an algorithms that performs widetilde{O}(n^{1-1/(s+1)}) queries. (Strictly speaking, their algorithm approximates the number of star-subgraphs of a given size, but a slight modification gives an algorithm for moments.)
We design a new, significantly simpler algorithm for this problem. In the worst-case, it exactly matches the bounds of Gonen-Ron-Shavitt, and has a much simpler proof. More importantly, the running time of this algorithm is connected to the degeneracy of G. This is (essentially) the maximum density of an induced subgraph. For the family of graphs with degeneracy at most alpha, it has a query complexity of widetilde{O}left(frac{n^{1-1/s}}{mu^{1/s}_s} Big(alpha^{1/s} + min{alpha,mu^{1/s}_s}Big)right) = widetilde{O}(n^{1-1/s}alpha/mu^{1/s}_s). Thus, for the class of bounded degeneracy graphs (which includes all minor closed families and preferential attachment graphs), we can estimate the average degree in widetilde{O}(1) queries, and can estimate the variance of the degree distribution in widetilde{O}(sqrt{n}) queries. This is a major improvement over the previous worst-case bounds. Our key insight is in designing an estimator for mu_s that has low variance when G does not have large dense subgraphs
Sampling and Counting Edges via Vertex Accesses
We consider the problems of sampling and counting edges from a graph on
vertices where our basic access is via uniformly sampled vertices. When we have
a vertex, we can see its degree, and access its neighbors. Eden and Rosenbaum
[SOSA 2018] have shown it is possible to sample an edge -uniformly in
vertex accesses. Here, we get down to
expected vertex accesses. Next, we
consider the problem of sampling edges. For this we introduce a model
that we call hash-based neighbor access. We show that, w.h.p, we can sample
edges exactly uniformly at random, with or without replacement, in
vertex accesses. We present a
matching lower bound of which holds
for -uniform edge multi-sampling with some constant even
though our positive result has .
We then give an algorithm for edge counting. W.h.p., we count the number of
edges to within error in time . When is not too small (for ), we present a near-matching lower-bound of
. In the same range, the previous best
upper and lower bounds were polynomially worse in .
Finally, we give an algorithm that instead of hash-based neighbor access uses
the more standard pair queries (``are vertices and adjacent''). W.h.p.
it returns approximation of the number of edges and runs in
expected time .
This matches our lower bound when is not too small, specifically for
.Comment: This paper subsumes the arXiv report (arXiv:2009.11178) which only
contains the result on sampling one edg
Parallel Algorithms for Small Subgraph Counting
Subgraph counting is a fundamental problem in analyzing massive graphs, often
studied in the context of social and complex networks. There is a rich
literature on designing efficient, accurate, and scalable algorithms for this
problem. In this work, we tackle this challenge and design several new
algorithms for subgraph counting in the Massively Parallel Computation (MPC)
model:
Given a graph over vertices, edges and triangles, our first
main result is an algorithm that, with high probability, outputs a
-approximation to , with optimal round and space complexity
provided any space per machine, assuming
.
Our second main result is an -rounds
algorithm for exactly counting the number of triangles, parametrized by the
arboricity of the input graph. The space per machine is
for any constant , and the total space is ,
which matches the time complexity of (combinatorial) triangle counting in the
sequential model. We also prove that this result can be extended to exactly
counting -cliques for any constant , with the same round complexity and
total space . Alternatively, allowing space per
machine, the total space requirement reduces to .
Finally, we prove that a recent result of Bera, Pashanasangi and Seshadhri
(ITCS 2020) for exactly counting all subgraphs of size at most , can be
implemented in the MPC model in rounds,
space per machine and total space. Therefore,
this result also exhibits the phenomenon that a time bound in the sequential
model translates to a space bound in the MPC model
Sublinear-Time Algorithms for Counting Star Subgraphs via Edge Sampling
We study the problem of estimating the value of sums of the form S[subscript p]≜∑([x[subscript i] over p]) when one has the ability to sample x[subscript i]≥0 with probability proportional to its magnitude. When p=2 , this problem is equivalent to estimating the selectivity of a self-join query in database systems when one can sample rows randomly. We also study the special case when {x[subscript i]} is the degree sequence of a graph, which corresponds to counting the number of p-stars in a graph when one has the ability to sample edges randomly. Our algorithm for a (1 ± ε) -multiplicative approximation of S[subscript p] has query and time complexities O(mloglogn/ϵ[superscript 2]S[superscript 1/p][subscript p]). Here, m=∑x[subscript i]/2 is the number of edges in the graph, or equivalently, half the number of records in the database table. Similarly, n is the number of vertices in the graph and the number of unique values in the database table. We also provide tight lower bounds (up to polylogarithmic factors) in almost all cases, even when {x[subscript i]} is a degree sequence and one is allowed to use the structure of the graph to try to get a better estimate. We are not aware of any prior lower bounds on the problem of join selectivity estimation. For the graph problem, prior work which assumed the ability to sample only vertices uniformly gave algorithms with matching lower bounds (Gonen et al. in SIAM J Comput 25:1365–1411, 2011). With the ability to sample edges randomly, we show that one can achieve faster algorithms for approximating the number of star subgraphs, bypassing the lower bounds in this prior work. For example, in the regime where S[subscript p]≤n , and p=2 , our upper bound is [~ over O](n/S[superscript 1/2][subscript p]), in contrast to their Ω(n/S[superscript 1/3][subscript p]) lower bound when no random edge queries are available. In addition, we consider the problem of counting the number of directed paths of length two when the graph is directed. This problem is equivalent to estimating the selectivity of a join query between two distinct tables. We prove that the general version of this problem cannot be solved in sublinear time. However, when the ratio between in-degree and out-degree is bounded—or equivalently, when the ratio between the number of occurrences of values in the two columns being joined is bounded—we give a sublinear time algorithm via a reduction to the undirected case.
Keywords: Subgraphs, Approximate counting, Randomized algorithms, Sublinear-time algorithmsNational Science Foundation (U.S.). Graduate Research Fellowship Program (Grant CCF-1217423)National Science Foundation (U.S.). Graduate Research Fellowship Program (Grant CCF-1065125)National Science Foundation (U.S.). Graduate Research Fellowship Program (Grant CCF-1420692)National Science Foundation (U.S.). Graduate Research Fellowship Program (Grant CCF-1122374