5,586 research outputs found
GraphHP: A Hybrid Platform for Iterative Graph Processing
The Bulk Synchronous Parallel(BSP) computational model has emerged as the
dominant distributed framework to build large-scale iterative graph processing
systems. While its implementations(e.g., Pregel, Giraph, and Hama) achieve high
scalability, frequent synchronization and communication among the workers can
cause substantial parallel inefficiency. To help address this critical concern,
this paper introduces the GraphHP(Graph Hybrid Processing) platform which
inherits the friendly vertex-centric BSP programming interface and optimizes
its synchronization and communication overhead.
To achieve the goal, we first propose a hybrid execution model which
differentiates between the computations within a graph partition and across the
partitions, and decouples the computations within a partition from distributed
synchronization and communication. By implementing the computations within a
partition by pseudo-superstep iteration in memory, the hybrid execution model
can effectively reduce synchronization and communication overhead while not
requiring heavy scheduling overhead or graph-centric sequential algorithms. We
then demonstrate how the hybrid execution model can be easily implemented
within the BSP abstraction to preserve its simple programming interface.
Finally, we evaluate our implementation of the GraphHP platform on classical
BSP applications and show that it performs significantly better than the
state-of-the-art BSP implementations. Our GraphHP implementation is based on
Hama, but can easily generalize to other BSP platforms
Experimental Analysis of Distributed Graph Systems
This paper evaluates eight parallel graph processing systems: Hadoop, HaLoop,
Vertica, Giraph, GraphLab (PowerGraph), Blogel, Flink Gelly, and GraphX (SPARK)
over four very large datasets (Twitter, World Road Network, UK 200705, and
ClueWeb) using four workloads (PageRank, WCC, SSSP and K-hop). The main
objective is to perform an independent scale-out study by experimentally
analyzing the performance, usability, and scalability (using up to 128
machines) of these systems. In addition to performance results, we discuss our
experiences in using these systems and suggest some system tuning heuristics
that lead to better performance.Comment: Volume 11 of Proc. VLDB Endowmen
Toward Creating Subsurface Camera
In this article, the framework and architecture of Subsurface Camera (SAMERA)
is envisioned and described for the first time. A SAMERA is a geophysical
sensor network that senses and processes geophysical sensor signals, and
computes a 3D subsurface image in-situ in real-time. The basic mechanism is:
geophysical waves propagating/reflected/refracted through subsurface enter a
network of geophysical sensors, where a 2D or 3D image is computed and
recorded; a control software may be connected to this network to allow view of
the 2D/3D image and adjustment of settings such as resolution, filter,
regularization and other algorithm parameters. System prototypes based on
seismic imaging have been designed. SAMERA technology is envisioned as a game
changer to transform many subsurface survey and monitoring applications,
including oil/gas exploration and production, subsurface infrastructures and
homeland security, wastewater and CO2 sequestration, earthquake and volcano
hazard monitoring. The system prototypes for seismic imaging have been built.
Creating SAMERA requires an interdisciplinary collaboration and transformation
of sensor networks, signal processing, distributed computing, and geophysical
imaging.Comment: 15 pages, 7 figure
GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems
While many of the architectural details of future exascale-class high
performance computer systems are still a matter of intense research, there
appears to be a general consensus that they will be strongly heterogeneous,
featuring "standard" as well as "accelerated" resources. Today, such resources
are available as multicore processors, graphics processing units (GPUs), and
other accelerators such as the Intel Xeon Phi. Any software infrastructure that
claims usefulness for such environments must be able to meet their inherent
challenges: massive multi-level parallelism, topology, asynchronicity, and
abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a
collection of building blocks that targets algorithms dealing with sparse
matrix representations on current and future large-scale systems. It implements
the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel
numerical kernels, intelligent resource management, and truly heterogeneous
parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We
describe the details of its design with respect to the challenges posed by
modern heterogeneous supercomputers and recent algorithmic developments.
Implementation details which are indispensable for achieving high efficiency
are pointed out and their necessity is justified by performance measurements or
predictions based on performance models. The library code and several
applications are available as open source. We also provide instructions on how
to make use of GHOST in existing software packages, together with a case study
which demonstrates the applicability and performance of GHOST as a component
within a larger software stack.Comment: 32 pages, 11 figure
Recommended from our members
Preparing sparse solvers for exascale computing.
Sparse solvers provide essential functionality for a wide variety of scientific applications. Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. This paper describes the challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms. We address the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential. The efforts described here are works in progress, highlighting current success and upcoming challenges. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'
Improving strong scaling of the Conjugate Gradient method for solving large linear systems using global reduction pipelining
This paper presents performance results comparing MPI-based implementations
of the popular Conjugate Gradient (CG) method and several of its communication
hiding (or 'pipelined') variants. Pipelined CG methods are designed to
efficiently solve SPD linear systems on massively parallel distributed memory
hardware, and typically display significantly improved strong scaling compared
to classic CG. This increase in parallel performance is achieved by overlapping
the global reduction phase (MPI_Iallreduce) required to compute the inner
products in each iteration by (chiefly local) computational work such as the
matrix-vector product as well as other global communication. This work includes
a brief introduction to the deep pipelined CG method for readers that may be
unfamiliar with the specifics of the method. A brief overview of implementation
details provides the practical tools required for implementation of the
algorithm. Subsequently, easily reproducible strong scaling results on the US
Department of Energy (DoE) NERSC machine 'Cori' (Phase I - Haswell nodes) on up
to 1024 nodes with 16 MPI ranks per node are presented using an implementation
of p(l)-CG that is available in the open source PETSc library. Observations on
the staggering and overlap of the asynchronous, non-blocking global
communication phases with communication and computational kernels are drawn
from the experiments.Comment: EuroMPI 2019, 10 - 13 September 2019, ETH Zurich, Switzerland, 11
pages, 4 figures, 1 table. arXiv admin note: substantial text overlap with
arXiv:1902.0310
GraphX: Unifying Data-Parallel and Graph-Parallel Analytics
From social networks to language modeling, the growing scale and importance
of graph data has driven the development of numerous new graph-parallel systems
(e.g., Pregel, GraphLab). By restricting the computation that can be expressed
and introducing new techniques to partition and distribute the graph, these
systems can efficiently execute iterative graph algorithms orders of magnitude
faster than more general data-parallel systems. However, the same restrictions
that enable the performance gains also make it difficult to express many of the
important stages in a typical graph-analytics pipeline: constructing the graph,
modifying its structure, or expressing computation that spans multiple graphs.
As a consequence, existing graph analytics pipelines compose graph-parallel and
data-parallel systems using external storage systems, leading to extensive data
movement and complicated programming model.
To address these challenges we introduce GraphX, a distributed graph
computation framework that unifies graph-parallel and data-parallel
computation. GraphX provides a small, core set of graph-parallel operators
expressive enough to implement the Pregel and PowerGraph abstractions, yet
simple enough to be cast in relational algebra. GraphX uses a collection of
query optimization techniques such as automatic join rewrites to efficiently
implement these graph-parallel operators. We evaluate GraphX on real-world
graphs and workloads and demonstrate that GraphX achieves comparable
performance as specialized graph computation systems, while outperforming them
in end-to-end graph pipelines. Moreover, GraphX achieves a balance between
expressiveness, performance, and ease of use
Spinner: Scalable Graph Partitioning in the Cloud
Several organizations, like social networks, store and routinely analyze
large graphs as part of their daily operation. Such graphs are typically
distributed across multiple servers, and graph partitioning is critical for
efficient graph management. Existing partitioning algorithms focus on finding
graph partitions with good locality, but disregard the pragmatic challenges of
integrating partitioning into large-scale graph management systems deployed on
the cloud, such as dealing with the scale and dynamicity of the graph and the
compute environment.
In this paper, we propose Spinner, a scalable and adaptive graph partitioning
algorithm based on label propagation designed on top of the Pregel model.
Spinner scales to massive graphs, produces partitions with locality and balance
comparable to the state-of-the-art and efficiently adapts the partitioning upon
changes. We describe our algorithm and its implementation in the Pregel
programming model that makes it possible to partition billion-vertex graphs. We
evaluate Spinner with a variety of synthetic and real graphs and show that it
can compute partitions with quality comparable to the state-of-the art. In
fact, by using Spinner in conjunction with the Giraph graph processing engine,
we speed up different applications by a factor of 2 relative to standard hash
partitioning
Fast Distributed Algorithms for Computing Separable Functions
The problem of computing functions of values at the nodes in a network in a
totally distributed manner, where nodes do not have unique identities and make
decisions based only on local information, has applications in sensor,
peer-to-peer, and ad-hoc networks. The task of computing separable functions,
which can be written as linear combinations of functions of individual
variables, is studied in this context. Known iterative algorithms for averaging
can be used to compute the normalized values of such functions, but these
algorithms do not extend in general to the computation of the actual values of
separable functions.
The main contribution of this paper is the design of a distributed randomized
algorithm for computing separable functions. The running time of the algorithm
is shown to depend on the running time of a minimum computation algorithm used
as a subroutine. Using a randomized gossip mechanism for minimum computation as
the subroutine yields a complete totally distributed algorithm for computing
separable functions. For a class of graphs with small spectral gap, such as
grid graphs, the time used by the algorithm to compute averages is of a smaller
order than the time required by a known iterative averaging scheme.Comment: 15 page
Strategies and Principles of Distributed Machine Learning on Big Data
The rise of Big Data has led to new demands for Machine Learning (ML) systems
to learn complex models with millions to billions of parameters, that promise
adequate capacity to digest massive datasets and offer powerful predictive
analytics thereupon. In order to run ML algorithms at such scales, on a
distributed cluster with 10s to 1000s of machines, it is often the case that
significant engineering efforts are required --- and one might fairly ask if
such engineering truly falls within the domain of ML research or not. Taking
the view that Big ML systems can benefit greatly from ML-rooted statistical and
algorithmic insights --- and that ML researchers should therefore not shy away
from such systems design --- we discuss a series of principles and strategies
distilled from our recent efforts on industrial-scale ML solutions. These
principles and strategies span a continuum from application, to engineering,
and to theoretical research and development of Big ML systems and
architectures, with the goal of understanding how to make them efficient,
generally-applicable, and supported with convergence and scaling guarantees.
They concern four key questions which traditionally receive little attention in
ML research: How to distribute an ML program over a cluster? How to bridge ML
computation with inter-machine communication? How to perform such
communication? What should be communicated between machines? By exposing
underlying statistical and algorithmic characteristics unique to ML programs
but not typically seen in traditional computer programs, and by dissecting
successful cases to reveal how we have harnessed these principles to design and
develop both high-performance distributed ML software as well as
general-purpose ML frameworks, we present opportunities for ML researchers and
practitioners to further shape and grow the area that lies between ML and
systems
- …