10 research outputs found

    Sublinear Time Estimation of Degree Distribution Moments: The Degeneracy Connection

    Get PDF
    We revisit the classic problem of estimating the degree distribution moments of an undirected graph. Consider an undirected graph G=(V,E) with n (non-isolated) vertices, and define (for s > 0) mu_s = 1n * sum_{v in V} d^s_v. Our aim is to estimate mu_s within a multiplicative error of (1+epsilon) (for a given approximation parameter epsilon>0) in sublinear time. We consider the sparse graph model that allows access to: uniform random vertices, queries for the degree of any vertex, and queries for a neighbor of any vertex. For the case of s=1 (the average degree), widetilde{O}(sqrt{n}) queries suffice for any constant epsilon (Feige, SICOMP 06 and Goldreich-Ron, RSA 08). Gonen-Ron-Shavitt (SIDMA 11) extended this result to all integral s > 0, by designing an algorithms that performs widetilde{O}(n^{1-1/(s+1)}) queries. (Strictly speaking, their algorithm approximates the number of star-subgraphs of a given size, but a slight modification gives an algorithm for moments.) We design a new, significantly simpler algorithm for this problem. In the worst-case, it exactly matches the bounds of Gonen-Ron-Shavitt, and has a much simpler proof. More importantly, the running time of this algorithm is connected to the degeneracy of G. This is (essentially) the maximum density of an induced subgraph. For the family of graphs with degeneracy at most alpha, it has a query complexity of widetilde{O}left(frac{n^{1-1/s}}{mu^{1/s}_s} Big(alpha^{1/s} + min{alpha,mu^{1/s}_s}Big)right) = widetilde{O}(n^{1-1/s}alpha/mu^{1/s}_s). Thus, for the class of bounded degeneracy graphs (which includes all minor closed families and preferential attachment graphs), we can estimate the average degree in widetilde{O}(1) queries, and can estimate the variance of the degree distribution in widetilde{O}(sqrt{n}) queries. This is a major improvement over the previous worst-case bounds. Our key insight is in designing an estimator for mu_s that has low variance when G does not have large dense subgraphs

    Sampling and Counting Edges via Vertex Accesses

    Full text link
    We consider the problems of sampling and counting edges from a graph on nn vertices where our basic access is via uniformly sampled vertices. When we have a vertex, we can see its degree, and access its neighbors. Eden and Rosenbaum [SOSA 2018] have shown it is possible to sample an edge ϵ\epsilon-uniformly in O(1/ϵnm)O(\sqrt{1/\epsilon}\frac{n}{\sqrt{m}}) vertex accesses. Here, we get down to expected O(log(1/ϵ)nm)O(\log(1/\epsilon)\frac{n}{\sqrt{m}}) vertex accesses. Next, we consider the problem of sampling s>1s>1 edges. For this we introduce a model that we call hash-based neighbor access. We show that, w.h.p, we can sample ss edges exactly uniformly at random, with or without replacement, in O~(snm+s)\tilde{O}(\sqrt{s} \frac{n}{\sqrt{m}} + s) vertex accesses. We present a matching lower bound of Ω(snm+s)\Omega(\sqrt{s} \frac{n}{\sqrt{m}} + s) which holds for ϵ\epsilon-uniform edge multi-sampling with some constant ϵ>0\epsilon>0 even though our positive result has ϵ=0\epsilon=0. We then give an algorithm for edge counting. W.h.p., we count the number of edges to within error ϵ\epsilon in time O~(nϵm+1ϵ2)\tilde{O}(\frac{n}{\epsilon\sqrt{m}} + \frac{1}{\epsilon^2}). When ϵ\epsilon is not too small (for ϵmn\epsilon \geq \frac{\sqrt m}{n}), we present a near-matching lower-bound of Ω(nϵm)\Omega(\frac{n}{\epsilon \sqrt{m}}). In the same range, the previous best upper and lower bounds were polynomially worse in ϵ\epsilon. Finally, we give an algorithm that instead of hash-based neighbor access uses the more standard pair queries (``are vertices uu and vv adjacent''). W.h.p. it returns 1+ϵ1+\epsilon approximation of the number of edges and runs in expected time O~(nϵm+1ϵ4)\tilde{O}(\frac{n}{\epsilon \sqrt{m}} + \frac{1}{\epsilon^4}). This matches our lower bound when ϵ\epsilon is not too small, specifically for ϵm1/6n1/3\epsilon \geq \frac{m^{1/6}}{n^{1/3}}.Comment: This paper subsumes the arXiv report (arXiv:2009.11178) which only contains the result on sampling one edg

    Parallel Algorithms for Small Subgraph Counting

    Get PDF
    Subgraph counting is a fundamental problem in analyzing massive graphs, often studied in the context of social and complex networks. There is a rich literature on designing efficient, accurate, and scalable algorithms for this problem. In this work, we tackle this challenge and design several new algorithms for subgraph counting in the Massively Parallel Computation (MPC) model: Given a graph GG over nn vertices, mm edges and TT triangles, our first main result is an algorithm that, with high probability, outputs a (1+ε)(1+\varepsilon)-approximation to TT, with optimal round and space complexity provided any Smax(m,n2/m)S \geq \max{(\sqrt m, n^2/m)} space per machine, assuming T=Ω(m/n)T=\Omega(\sqrt{m/n}). Our second main result is an O~δ(loglogn)\tilde{O}_{\delta}(\log \log n)-rounds algorithm for exactly counting the number of triangles, parametrized by the arboricity α\alpha of the input graph. The space per machine is O(nδ)O(n^{\delta}) for any constant δ\delta, and the total space is O(mα)O(m\alpha), which matches the time complexity of (combinatorial) triangle counting in the sequential model. We also prove that this result can be extended to exactly counting kk-cliques for any constant kk, with the same round complexity and total space O(mαk2)O(m\alpha^{k-2}). Alternatively, allowing O(α2)O(\alpha^2) space per machine, the total space requirement reduces to O(nα2)O(n\alpha^2). Finally, we prove that a recent result of Bera, Pashanasangi and Seshadhri (ITCS 2020) for exactly counting all subgraphs of size at most 55, can be implemented in the MPC model in O~δ(logn)\tilde{O}_{\delta}(\sqrt{\log n}) rounds, O(nδ)O(n^{\delta}) space per machine and O(mα3)O(m\alpha^3) total space. Therefore, this result also exhibits the phenomenon that a time bound in the sequential model translates to a space bound in the MPC model

    Sublinear-Time Algorithms for Counting Star Subgraphs via Edge Sampling

    No full text
    We study the problem of estimating the value of sums of the form S[subscript p]≜∑([x[subscript i] over p]) when one has the ability to sample x[subscript i]≥0 with probability proportional to its magnitude. When p=2 , this problem is equivalent to estimating the selectivity of a self-join query in database systems when one can sample rows randomly. We also study the special case when {x[subscript i]} is the degree sequence of a graph, which corresponds to counting the number of p-stars in a graph when one has the ability to sample edges randomly. Our algorithm for a (1 ± ε) -multiplicative approximation of S[subscript p] has query and time complexities O(mloglogn/ϵ[superscript 2]S[superscript 1/p][subscript p]). Here, m=∑x[subscript i]/2 is the number of edges in the graph, or equivalently, half the number of records in the database table. Similarly, n is the number of vertices in the graph and the number of unique values in the database table. We also provide tight lower bounds (up to polylogarithmic factors) in almost all cases, even when {x[subscript i]} is a degree sequence and one is allowed to use the structure of the graph to try to get a better estimate. We are not aware of any prior lower bounds on the problem of join selectivity estimation. For the graph problem, prior work which assumed the ability to sample only vertices uniformly gave algorithms with matching lower bounds (Gonen et al. in SIAM J Comput 25:1365–1411, 2011). With the ability to sample edges randomly, we show that one can achieve faster algorithms for approximating the number of star subgraphs, bypassing the lower bounds in this prior work. For example, in the regime where S[subscript p]≤n , and p=2 , our upper bound is [~ over O](n/S[superscript 1/2][subscript p]), in contrast to their Ω(n/S[superscript 1/3][subscript p]) lower bound when no random edge queries are available. In addition, we consider the problem of counting the number of directed paths of length two when the graph is directed. This problem is equivalent to estimating the selectivity of a join query between two distinct tables. We prove that the general version of this problem cannot be solved in sublinear time. However, when the ratio between in-degree and out-degree is bounded—or equivalently, when the ratio between the number of occurrences of values in the two columns being joined is bounded—we give a sublinear time algorithm via a reduction to the undirected case. Keywords: Subgraphs, Approximate counting, Randomized algorithms, Sublinear-time algorithmsNational Science Foundation (U.S.). Graduate Research Fellowship Program (Grant CCF-1217423)National Science Foundation (U.S.). Graduate Research Fellowship Program (Grant CCF-1065125)National Science Foundation (U.S.). Graduate Research Fellowship Program (Grant CCF-1420692)National Science Foundation (U.S.). Graduate Research Fellowship Program (Grant CCF-1122374
    corecore