223 research outputs found
Efficient Triangle Counting in Large Graphs via Degree-based Vertex Partitioning
The number of triangles is a computationally expensive graph statistic which
is frequently used in complex network analysis (e.g., transitivity ratio), in
various random graph models (e.g., exponential random graph model) and in
important real world applications such as spam detection, uncovering of the
hidden thematic structure of the Web and link recommendation. Counting
triangles in graphs with millions and billions of edges requires algorithms
which run fast, use small amount of space, provide accurate estimates of the
number of triangles and preferably are parallelizable.
In this paper we present an efficient triangle counting algorithm which can
be adapted to the semistreaming model. The key idea of our algorithm is to
combine the sampling algorithm of Tsourakakis et al. and the partitioning of
the set of vertices into a high degree and a low degree subset respectively as
in the Alon, Yuster and Zwick work treating each set appropriately. We obtain a
running time
and an approximation (multiplicative error), where is the number
of vertices, the number of edges and the maximum number of
triangles an edge is contained.
Furthermore, we show how this algorithm can be adapted to the semistreaming
model with space usage and a constant number of passes (three) over the graph
stream. We apply our methods in various networks with several millions of edges
and we obtain excellent results. Finally, we propose a random projection based
method for triangle counting and provide a sufficient condition to obtain an
estimate with low variance.Comment: 1) 12 pages 2) To appear in the 7th Workshop on Algorithms and Models
for the Web Graph (WAW 2010
Massively Parallel Approximate Distance Sketches
Data structures that allow efficient distance estimation (distance oracles, distance sketches, etc.) have been extensively studied, and are particularly well studied in centralized models and classical distributed models such as CONGEST. We initiate their study in newer (and arguably more realistic) models of distributed computation: the Congested Clique model and the Massively Parallel Computation (MPC) model. We provide efficient constructions in both of these models, but our core results are for MPC. In MPC we give two main results: an algorithm that constructs stretch/space optimal distance sketches but takes a (small) polynomial number of rounds, and an algorithm that constructs distance sketches with worse stretch but that only takes polylogarithmic rounds.
Along the way, we show that other useful combinatorial structures can also be computed in MPC. In particular, one key component we use to construct distance sketches are an MPC construction of the hopsets of [Elkin and Neiman, 2016]. This result has additional applications such as the first polylogarithmic time algorithm for constant approximate single-source shortest paths for weighted graphs in the low memory MPC setting
On Conceptually Simple Algorithms for Variants of Online Bipartite Matching
We present a series of results regarding conceptually simple algorithms for
bipartite matching in various online and related models. We first consider a
deterministic adversarial model. The best approximation ratio possible for a
one-pass deterministic online algorithm is , which is achieved by any
greedy algorithm. D\"urr et al. recently presented a -pass algorithm called
Category-Advice that achieves approximation ratio . We extend their
algorithm to multiple passes. We prove the exact approximation ratio for the
-pass Category-Advice algorithm for all , and show that the
approximation ratio converges to the inverse of the golden ratio
as goes to infinity. The convergence is
extremely fast --- the -pass Category-Advice algorithm is already within
of the inverse of the golden ratio.
We then consider a natural greedy algorithm in the online stochastic IID
model---MinDegree. This algorithm is an online version of a well-known and
extensively studied offline algorithm MinGreedy. We show that MinDegree cannot
achieve an approximation ratio better than , which is guaranteed by any
consistent greedy algorithm in the known IID model.
Finally, following the work in Besser and Poloczek, we depart from an
adversarial or stochastic ordering and investigate a natural randomized
algorithm (MinRanking) in the priority model. Although the priority model
allows the algorithm to choose the input ordering in a general but well defined
way, this natural algorithm cannot obtain the approximation of the Ranking
algorithm in the ROM model
Tight Bounds on the Round Complexity of the Distributed Maximum Coverage Problem
We study the maximum -set coverage problem in the following distributed
setting. A collection of sets over a universe is
partitioned across machines and the goal is to find sets whose union
covers the most number of elements. The computation proceeds in synchronous
rounds. In each round, all machines simultaneously send a message to a central
coordinator who then communicates back to all machines a summary to guide the
computation for the next round. At the end, the coordinator outputs the answer.
The main measures of efficiency in this setting are the approximation ratio of
the returned solution, the communication cost of each machine, and the number
of rounds of computation.
Our main result is an asymptotically tight bound on the tradeoff between
these measures for the distributed maximum coverage problem. We first show that
any -round protocol for this problem either incurs a communication cost of or only achieves an approximation factor of
. This implies that any protocol that simultaneously achieves
good approximation ratio ( approximation) and good communication cost
( communication per machine), essentially requires
logarithmic (in ) number of rounds. We complement our lower bound result by
showing that there exist an -round protocol that achieves an
-approximation (essentially best possible) with a communication
cost of as well as an -round protocol that achieves a
-approximation with only communication per each
machine (essentially best possible).
We further use our results in this distributed setting to obtain new bounds
for the maximum coverage problem in two other main models of computation for
massive datasets, namely, the dynamic streaming model and the MapReduce model
Massively Parallel Algorithms for Distance Approximation and Spanners
Over the past decade, there has been increasing interest in
distributed/parallel algorithms for processing large-scale graphs. By now, we
have quite fast algorithms -- usually sublogarithmic-time and often
-time, or even faster -- for a number of fundamental graph
problems in the massively parallel computation (MPC) model. This model is a
widely-adopted theoretical abstraction of MapReduce style settings, where a
number of machines communicate in an all-to-all manner to process large-scale
data. Contributing to this line of work on MPC graph algorithms, we present
round MPC algorithms for computing
-spanners in the strongly sublinear regime of local memory. To
the best of our knowledge, these are the first sublogarithmic-time MPC
algorithms for spanner construction. As primary applications of our spanners,
we get two important implications, as follows:
-For the MPC setting, we get an -round algorithm for
approximation of all pairs shortest paths (APSP) in the
near-linear regime of local memory. To the best of our knowledge, this is the
first sublogarithmic-time MPC algorithm for distance approximations.
-Our result above also extends to the Congested Clique model of distributed
computing, with the same round complexity and approximation guarantee. This
gives the first sub-logarithmic algorithm for approximating APSP in weighted
graphs in the Congested Clique model
Engineering Aggregation Operators for Relational In-Memory Database Systems
In this thesis we study the design and implementation of Aggregation operators in the context of relational in-memory database systems. In particular, we identify and address the following challenges: cache-efficiency, CPU-friendliness, parallelism within and across processors, robust handling of skewed data, adaptive processing, processing with constrained memory, and integration with modern database architectures. Our resulting algorithm outperforms the state-of-the-art by up to 3.7x
Preconditioned Data Sparsification for Big Data with Applications to PCA and K-means
We analyze a compression scheme for large data sets that randomly keeps a
small percentage of the components of each data sample. The benefit is that the
output is a sparse matrix and therefore subsequent processing, such as PCA or
K-means, is significantly faster, especially in a distributed-data setting.
Furthermore, the sampling is single-pass and applicable to streaming data. The
sampling mechanism is a variant of previous methods proposed in the literature
combined with a randomized preconditioning to smooth the data. We provide
guarantees for PCA in terms of the covariance matrix, and guarantees for
K-means in terms of the error in the center estimators at a given step. We
present numerical evidence to show both that our bounds are nearly tight and
that our algorithms provide a real benefit when applied to standard test data
sets, as well as providing certain benefits over related sampling approaches.Comment: 28 pages, 10 figure
Correlation clustering in data streams
In this paper, we address the problem of correlation clustering in the dynamic data stream model. The stream consists of updates to the edge weights of a graph on n nodes and the goal is to find a node-partition such that the end-points of negative-weight edges are typically in different clusters whereas the end-points of positive-weight edges are typically in the same cluster. We present polynomial-time, O(n·polylog n)-space approximation algorithms for natural problems that arise. We first develop data structures based on linear sketches that allow the âqualityâ of a given node-partition to be measured. We then combine these data structures with convex programming and sampling techniques to solve the relevant approximation problem. However the standard LP and SDP formulations are not obviously solvable in O(n·polylog n)-space. Our work presents space-efficient algorithms for the convex programming required, as well as approaches to reduce the adaptivity of the sampling. Note that the improved space and running-time bounds achieved from streaming algorithms are also useful for offline settings such as MapReduce models
- âŠ