1,706 research outputs found
Parallel Processing of Large Graphs
More and more large data collections are gathered worldwide in various IT
systems. Many of them possess the networked nature and need to be processed and
analysed as graph structures. Due to their size they require very often usage
of parallel paradigm for efficient computation. Three parallel techniques have
been compared in the paper: MapReduce, its map-side join extension and Bulk
Synchronous Parallel (BSP). They are implemented for two different graph
problems: calculation of single source shortest paths (SSSP) and collective
classification of graph nodes by means of relational influence propagation
(RIP). The methods and algorithms are applied to several network datasets
differing in size and structural profile, originating from three domains:
telecommunication, multimedia and microblog. The results revealed that
iterative graph processing with the BSP implementation always and
significantly, even up to 10 times outperforms MapReduce, especially for
algorithms with many iterations and sparse communication. Also MapReduce
extension based on map-side join usually noticeably presents better efficiency,
although not as much as BSP. Nevertheless, MapReduce still remains the good
alternative for enormous networks, whose data structures do not fit in local
memories.Comment: Preprint submitted to Future Generation Computer System
A Practical Parallel Algorithm for Diameter Approximation of Massive Weighted Graphs
We present a space and time efficient practical parallel algorithm for
approximating the diameter of massive weighted undirected graphs on distributed
platforms supporting a MapReduce-like abstraction. The core of the algorithm is
a weighted graph decomposition strategy generating disjoint clusters of bounded
weighted radius. Theoretically, our algorithm uses linear space and yields a
polylogarithmic approximation guarantee; moreover, for important practical
classes of graphs, it runs in a number of rounds asymptotically smaller than
those required by the natural approximation provided by the state-of-the-art
-stepping SSSP algorithm, which is its only practical linear-space
competitor in the aforementioned computational scenario. We complement our
theoretical findings with an extensive experimental analysis on large benchmark
graphs, which demonstrates that our algorithm attains substantial improvements
on a number of key performance indicators with respect to the aforementioned
competitor, while featuring a similar approximation ratio (a small constant
less than 1.4, as opposed to the polylogarithmic theoretical bound)
Space and Time Efficient Parallel Graph Decomposition, Clustering, and Diameter Approximation
We develop a novel parallel decomposition strategy for unweighted, undirected
graphs, based on growing disjoint connected clusters from batches of centers
progressively selected from yet uncovered nodes. With respect to similar
previous decompositions, our strategy exercises a tighter control on both the
number of clusters and their maximum radius.
We present two important applications of our parallel graph decomposition:
(1) -center clustering approximation; and (2) diameter approximation. In
both cases, we obtain algorithms which feature a polylogarithmic approximation
factor and are amenable to a distributed implementation that is geared for
massive (long-diameter) graphs. The total space needed for the computation is
linear in the problem size, and the parallel depth is substantially sublinear
in the diameter for graphs with low doubling dimension. To the best of our
knowledge, ours are the first parallel approximations for these problems which
achieve sub-diameter parallel time, for a relevant class of graphs, using only
linear space. Besides the theoretical guarantees, our algorithms allow for a
very simple implementation on clustered architectures: we report on extensive
experiments which demonstrate their effectiveness and efficiency on large
graphs as compared to alternative known approaches.Comment: 14 page
Scalable Facility Location for Massive Graphs on Pregel-like Systems
We propose a new scalable algorithm for facility location. Facility location
is a classic problem, where the goal is to select a subset of facilities to
open, from a set of candidate facilities F , in order to serve a set of clients
C. The objective is to minimize the total cost of opening facilities plus the
cost of serving each client from the facility it is assigned to. In this work,
we are interested in the graph setting, where the cost of serving a client from
a facility is represented by the shortest-path distance on the graph. This
setting allows to model natural problems arising in the Web and in social media
applications. It also allows to leverage the inherent sparsity of such graphs,
as the input is much smaller than the full pairwise distances between all
vertices.
To obtain truly scalable performance, we design a parallel algorithm that
operates on clusters of shared-nothing machines. In particular, we target
modern Pregel-like architectures, and we implement our algorithm on Apache
Giraph. Our solution makes use of a recent result to build sketches for massive
graphs, and of a fast parallel algorithm to find maximal independent sets, as
building blocks. In so doing, we show how these problems can be solved on a
Pregel-like architecture, and we investigate the properties of these
algorithms. Extensive experimental results show that our algorithm scales
gracefully to graphs with billions of edges, while obtaining values of the
objective function that are competitive with a state-of-the-art sequential
algorithm
Scalable Online Betweenness Centrality in Evolving Graphs
Betweenness centrality is a classic measure that quantifies the importance of
a graph element (vertex or edge) according to the fraction of shortest paths
passing through it. This measure is notoriously expensive to compute, and the
best known algorithm runs in O(nm) time. The problems of efficiency and
scalability are exacerbated in a dynamic setting, where the input is an
evolving graph seen edge by edge, and the goal is to keep the betweenness
centrality up to date. In this paper we propose the first truly scalable
algorithm for online computation of betweenness centrality of both vertices and
edges in an evolving graph where new edges are added and existing edges are
removed. Our algorithm is carefully engineered with out-of-core techniques and
tailored for modern parallel stream processing engines that run on clusters of
shared-nothing commodity hardware. Hence, it is amenable to real-world
deployment. We experiment on graphs that are two orders of magnitude larger
than previous studies. Our method is able to keep the betweenness centrality
measures up to date online, i.e., the time to update the measures is smaller
than the inter-arrival time between two consecutive updates.Comment: 15 pages, 9 Figures, accepted for publication in IEEE Transactions on
Knowledge and Data Engineerin
Comparing MapReduce and pipeline implementations for counting triangles
A generalized method to define the Divide & Conquer paradigm in order to have processors acting on its own data and scheduled in a
parallel fashion. MapReduce is a programming model that follows this paradigm, and allows for the definition of efficient solutions by both decomposing a problem into steps on subsets of the input data
and combining the results of each step to produce final results. Albeit used for the implementation of a wide variety of computational problems, MapReduce performance can be negatively affected
whenever the replication factor grows or the size of the input is larger than the resources available at each processor. In this paper we show an alternative approach to implement the Divide & Conquer
paradigm, named pipeline. The main features of pipeline are illustrated on a parallel implementation of the well-known problem of counting triangles in a graph. This problem is especially interesting either when the input graph does not fit in memory or is dynamically generated. To evaluate the properties of pipeline, a dynamic pipeline of processes and an ad-hoc version of MapReduce are implemented in the language Go, exploiting its ability to deal with channels and spawned processes.
An empirical evaluation is conducted on graphs of different sizes and densities. Observed results suggest that pipeline allows for the implementation of an efficient solution of the problem of counting
triangles in a graph, particularly, in dense and large graphs, drastically reducing the execution time with respect to the MapReduce implementation.Peer ReviewedPostprint (published version
- …