19 research outputs found
Parallel Algorithms for Small Subgraph Counting
Subgraph counting is a fundamental problem in analyzing massive graphs, often
studied in the context of social and complex networks. There is a rich
literature on designing efficient, accurate, and scalable algorithms for this
problem. In this work, we tackle this challenge and design several new
algorithms for subgraph counting in the Massively Parallel Computation (MPC)
model:
Given a graph over vertices, edges and triangles, our first
main result is an algorithm that, with high probability, outputs a
-approximation to , with optimal round and space complexity
provided any space per machine, assuming
.
Our second main result is an -rounds
algorithm for exactly counting the number of triangles, parametrized by the
arboricity of the input graph. The space per machine is
for any constant , and the total space is ,
which matches the time complexity of (combinatorial) triangle counting in the
sequential model. We also prove that this result can be extended to exactly
counting -cliques for any constant , with the same round complexity and
total space . Alternatively, allowing space per
machine, the total space requirement reduces to .
Finally, we prove that a recent result of Bera, Pashanasangi and Seshadhri
(ITCS 2020) for exactly counting all subgraphs of size at most , can be
implemented in the MPC model in rounds,
space per machine and total space. Therefore,
this result also exhibits the phenomenon that a time bound in the sequential
model translates to a space bound in the MPC model
Massively Parallel Algorithms for Distance Approximation and Spanners
Over the past decade, there has been increasing interest in
distributed/parallel algorithms for processing large-scale graphs. By now, we
have quite fast algorithms -- usually sublogarithmic-time and often
-time, or even faster -- for a number of fundamental graph
problems in the massively parallel computation (MPC) model. This model is a
widely-adopted theoretical abstraction of MapReduce style settings, where a
number of machines communicate in an all-to-all manner to process large-scale
data. Contributing to this line of work on MPC graph algorithms, we present
round MPC algorithms for computing
-spanners in the strongly sublinear regime of local memory. To
the best of our knowledge, these are the first sublogarithmic-time MPC
algorithms for spanner construction. As primary applications of our spanners,
we get two important implications, as follows:
-For the MPC setting, we get an -round algorithm for
approximation of all pairs shortest paths (APSP) in the
near-linear regime of local memory. To the best of our knowledge, this is the
first sublogarithmic-time MPC algorithm for distance approximations.
-Our result above also extends to the Congested Clique model of distributed
computing, with the same round complexity and approximation guarantee. This
gives the first sub-logarithmic algorithm for approximating APSP in weighted
graphs in the Congested Clique model
Learning Spanning Forests Optimally using CUT Queries in Weighted Undirected Graphs
In this paper we describe a randomized algorithm which returns a maximal
spanning forest of an unknown {\em weighted} undirected graph making
queries in expectation. For weighted graphs, this is optimal due
to a result in [Auza and Lee, 2021] which shows an lower bound for
zero-error randomized algorithms. %To our knowledge, it is the only regime of
this problem where we have upper and lower bounds tight up to constants. These
questions have been extensively studied in the past few years, especially due
to the problem's connections to symmetric submodular function minimization. We
also describe a simple polynomial time deterministic algorithm that makes
queries on undirected unweighted graphs and
returns a maximal spanning forest, thereby (slightly) improving upon the
state-of-the-art
Graph sparsification for derandomizing massively parallel computation with low space
Massively Parallel Computation (MPC) is an emerging model which distills core aspects of distributed and parallel computation. It was developed as a tool to solve (typically graph) problems in systems where input is distributed over many machines with limited space. Recent work has focused on the regime in which machines have sublinear (in n, number of nodes in the input graph) space, with randomized algorithms presented for the fundamental problems of Maximal Matching and Maximal Independent Set. There are, however, no prior corresponding deterministic algorithms.
A major challenge in the sublinear space setting is that the local space of each machine may be too small to store all the edges incident to a single node. To overcome this barrier we introduce a new graph sparsification technique that deterministically computes a low-degree subgraph with additional desired properties: degrees in the subgraph are sufficiently small that nodes’ neighborhoods can be stored on single machines, and solving the problem on the subgraph provides significant global progress towards solving the problem for the original input graph.
Using this framework to derandomize the well-known randomized algorithm of Luby [SICOMP’86], we obtain O(log(\Delta) + loglog(n))- round deterministic MPC algorithms for solving the fundamental problems of Maximal Matching and Maximal Independent Set with O(n epsilon) space on each machine for any constant epsilon > 0. Based on the recent work of Ghaffari et al. [FOCS’18], this additive O(loglog(n)) factor is conditionally essential. These algorithms can also be shown
to run in O(log(\Delta)) rounds in the closely related model of CONGESTED CLIQUE, improving upon the state-of-the-art bound of O(log^2(\Delta)) rounds by Censor-Hillel et al. [DISC’17]
Exponential Speedup over Locality in MPC with Optimal Memory
Locally Checkable Labeling (LCL) problems are graph problems in which a solution is correct if it satisfies some given constraints in the local neighborhood of each node. Example problems in this class include maximal matching, maximal independent set, and coloring problems. A successful line of research has been studying the complexities of LCL problems on paths/cycles, trees, and general graphs, providing many interesting results for the LOCAL model of distributed computing. In this work, we initiate the study of LCL problems in the low-space Massively Parallel Computation (MPC) model. In particular, on forests, we provide a method that, given the complexity of an LCL problem in the LOCAL model, automatically provides an exponentially faster algorithm for the low-space MPC setting that uses optimal global memory, that is, truly linear.
While restricting to forests may seem to weaken the result, we emphasize that all known (conditional) lower bounds for the MPC setting are obtained by lifting lower bounds obtained in the distributed setting in tree-like networks (either forests or high girth graphs), and hence the problems that we study are challenging already on forests. Moreover, the most important technical feature of our algorithms is that they use optimal global memory, that is, memory linear in the number of edges of the graph. In contrast, most of the state-of-the-art algorithms use more than linear global memory. Further, they typically start with a dense graph, sparsify it, and then solve the problem on the residual graph, exploiting the relative increase in global memory. On forests, this is not possible, because the given graph is already as sparse as it can be, and using optimal memory requires new solutions
Massively Parallel Single-Source SimRanks in Rounds
SimRank is one of the most fundamental measures that evaluate the structural
similarity between two nodes in a graph and has been applied in a plethora of
data management tasks. These tasks often involve single-source SimRank
computation that evaluates the SimRank values between a source node and all
other nodes. Due to its high computation complexity, single-source SimRank
computation for large graphs is notoriously challenging, and hence recent
studies resort to distributed processing. To our surprise, although SimRank has
been widely adopted for two decades, theoretical aspects of distributed
SimRanks with provable results have rarely been studied.
In this paper, we conduct a theoretical study on single-source SimRank
computation in the Massive Parallel Computation (MPC) model, which is the
standard theoretical framework modeling distributed systems such as MapReduce,
Hadoop, or Spark. Existing distributed SimRank algorithms enforce either
communication round complexity or machine space
for a graph of nodes. We overcome this barrier. Particularly, given a graph
of nodes, for any query node and constant error ,
we show that using rounds of communication among machines is
almost enough to compute single-source SimRank values with at most
absolute errors, while each machine only needs a space sub-linear to . To
the best of our knowledge, this is the first single-source SimRank algorithm in
MPC that can overcome the round complexity barrier with
provable result accuracy