176 research outputs found
Scalable Facility Location for Massive Graphs on Pregel-like Systems
We propose a new scalable algorithm for facility location. Facility location
is a classic problem, where the goal is to select a subset of facilities to
open, from a set of candidate facilities F , in order to serve a set of clients
C. The objective is to minimize the total cost of opening facilities plus the
cost of serving each client from the facility it is assigned to. In this work,
we are interested in the graph setting, where the cost of serving a client from
a facility is represented by the shortest-path distance on the graph. This
setting allows to model natural problems arising in the Web and in social media
applications. It also allows to leverage the inherent sparsity of such graphs,
as the input is much smaller than the full pairwise distances between all
vertices.
To obtain truly scalable performance, we design a parallel algorithm that
operates on clusters of shared-nothing machines. In particular, we target
modern Pregel-like architectures, and we implement our algorithm on Apache
Giraph. Our solution makes use of a recent result to build sketches for massive
graphs, and of a fast parallel algorithm to find maximal independent sets, as
building blocks. In so doing, we show how these problems can be solved on a
Pregel-like architecture, and we investigate the properties of these
algorithms. Extensive experimental results show that our algorithm scales
gracefully to graphs with billions of edges, while obtaining values of the
objective function that are competitive with a state-of-the-art sequential
algorithm
Algorithm-Level Optimizations for Scalable Parallel Graph Processing
Efficiently processing large graphs is challenging, since parallel graph algorithms suffer from
poor scalability and performance due to many factors, including heavy communication and load-imbalance.
Furthermore, it is difficult to express graph algorithms, as users need to understand
and effectively utilize the underlying execution of the algorithm on the distributed system. The
performance of graph algorithms depends not only on the characteristics of the system (such as
latency, available RAM, etc.), but also on the characteristics of the input graph (small-world scalefree,
mesh, long-diameter, etc.), and characteristics of the algorithm (sparse computation vs. dense
communication). The best execution strategy, therefore, often heavily depends on the combination
of input graph, system and algorithm.
Fine-grained expression exposes maximum parallelism in the algorithm and allows the user to
concentrate on a single vertex, making it easier to express parallel graph algorithms. However,
this often loses information about the machine, making it difficult to extract performance and
scalability from fine-grained algorithms.
To address these issues, we present a model for expressing parallel graph algorithms using a
fine-grained expression. Our model decouples the algorithm-writer from the underlying details
of the system, graph, and execution and tuning of the algorithm. We also present various graph
paradigms that optimize the execution of graph algorithms for various types of input graphs and
systems. We show our model is general enough to allow graph algorithms to use the various graph
paradigms for the best/fastest execution, and demonstrate good performance and scalability for
various different graphs, algorithms, and systems to 100,000+ cores
The Family of MapReduce and Large Scale Data Processing Systems
In the last two decades, the continuous increase of computational power has
produced an overwhelming flow of data which has called for a paradigm shift in
the computing architecture and large scale data processing mechanisms.
MapReduce is a simple and powerful programming model that enables easy
development of scalable parallel applications to process vast amounts of data
on large clusters of commodity machines. It isolates the application from the
details of running a distributed program such as issues on data distribution,
scheduling and fault tolerance. However, the original implementation of the
MapReduce framework had some limitations that have been tackled by many
research efforts in several followup works after its introduction. This article
provides a comprehensive survey for a family of approaches and mechanisms of
large scale data processing mechanisms that have been implemented based on the
original idea of the MapReduce framework and are currently gaining a lot of
momentum in both research and industrial communities. We also cover a set of
introduced systems that have been implemented to provide declarative
programming interfaces on top of the MapReduce framework. In addition, we
review several large scale data processing systems that resemble some of the
ideas of the MapReduce framework for different purposes and application
scenarios. Finally, we discuss some of the future research directions for
implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU
High-performance implementations of graph algorithms are challenging to
implement on new parallel hardware such as GPUs because of three challenges:
(1) the difficulty of coming up with graph building blocks, (2) load imbalance
on parallel hardware, and (3) graph problems having low arithmetic intensity.
To address some of these challenges, GraphBLAS is an innovative, on-going
effort by the graph analytics community to propose building blocks based on
sparse linear algebra, which will allow graph algorithms to be expressed in a
performant, succinct, composable and portable manner. In this paper, we examine
the performance challenges of a linear-algebra-based approach to building graph
frameworks and describe new design principles for overcoming these bottlenecks.
Among the new design principles is exploiting input sparsity, which allows
users to write graph algorithms without specifying push and pull direction.
Exploiting output sparsity allows users to tell the backend which values of the
output in a single vectorized computation they do not want computed.
Load-balancing is an important feature for balancing work amongst parallel
workers. We describe the important load-balancing features for handling graphs
with different characteristics. The design principles described in this paper
have been implemented in "GraphBLAST", the first high-performance linear
algebra-based graph framework on NVIDIA GPUs that is open-source. The results
show that on a single GPU, GraphBLAST has on average at least an order of
magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL,
comparable performance to the fastest GPU hardwired primitives and
shared-memory graph frameworks Ligra and Gunrock, and better performance than
any other GPU graph framework, while offering a simpler and more concise
programming model.Comment: 50 pages, 14 figures, 14 table
A Partition-centric Distributed Algorithm for Identifying Euler Circuits in Large Graphs
Finding the Eulerian circuit in graphs is a classic problem, but inadequately
explored for parallel computation. With such cycles finding use in neuroscience
and Internet of Things for large graphs, designing a distributed algorithm for
finding the Euler circuit is important. Existing parallel algorithms are
impractical for commodity clusters and Clouds. We propose a novel
partition-centric algorithm to find the Euler circuit, over large graphs
partitioned across distributed machines and executed iteratively using a Bulk
Synchronous Parallel (BSP) model. The algorithm finds partial paths and cycles
within each partition, and refines these into longer paths by recursively
merging the partitions. We describe the algorithm, analyze its complexity,
validate it on Apache Spark for large graphs, and offer experimental results.
We also identify memory bottlenecks in the algorithm and propose an enhanced
design to address it.Comment: To appear in Proceedings of 5th IEEE International Workshop on
High-Performance Big Data, Deep Learning, and Cloud Computing, In conjunction
with The 33rd IEEE International Parallel and Distributed Processing
Symposium (IPDPS 2019), Rio de Janeiro, Brazil, May 20th, 201
StreamLearner: Distributed Incremental Machine Learning on Event Streams: Grand Challenge
Today, massive amounts of streaming data from smart devices need to be
analyzed automatically to realize the Internet of Things. The Complex Event
Processing (CEP) paradigm promises low-latency pattern detection on event
streams. However, CEP systems need to be extended with Machine Learning (ML)
capabilities such as online training and inference in order to be able to
detect fuzzy patterns (e.g., outliers) and to improve pattern recognition
accuracy during runtime using incremental model training. In this paper, we
propose a distributed CEP system denoted as StreamLearner for ML-enabled
complex event detection. The proposed programming model and data-parallel
system architecture enable a wide range of real-world applications and allow
for dynamically scaling up and out system resources for low-latency,
high-throughput event processing. We show that the DEBS Grand Challenge 2017
case study (i.e., anomaly detection in smart factories) integrates seamlessly
into the StreamLearner API. Our experiments verify scalability and high event
throughput of StreamLearner.Comment: Christian Mayer, Ruben Mayer, and Majd Abdo. 2017. StreamLearner:
Distributed Incremental Machine Learning on Event Streams: Grand Challenge.
In Proceedings of the 11th ACM International Conference on Distributed and
Event-based Systems (DEBS '17), 298-30
- …