13 research outputs found
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable
There has been significant recent interest in parallel graph processing due
to the need to quickly analyze the large graphs available today. Many graph
codes have been designed for distributed memory or external memory. However,
today even the largest publicly-available real-world graph (the Hyperlink Web
graph with over 3.5 billion vertices and 128 billion edges) can fit in the
memory of a single commodity multicore server. Nevertheless, most experimental
work in the literature report results on much smaller graphs, and the ones for
the Hyperlink graph use distributed or external memory. Therefore, it is
natural to ask whether we can efficiently solve a broad class of graph problems
on this graph in memory.
This paper shows that theoretically-efficient parallel graph algorithms can
scale to the largest publicly-available graphs using a single machine with a
terabyte of RAM, processing them in minutes. We give implementations of
theoretically-efficient parallel algorithms for 20 important graph problems. We
also present the optimizations and techniques that we used in our
implementations, which were crucial in enabling us to process these large
graphs quickly. We show that the running times of our implementations
outperform existing state-of-the-art implementations on the largest real-world
graphs. For many of the problems that we consider, this is the first time they
have been solved on graphs at this scale. We have made the implementations
developed in this work publicly-available as the Graph-Based Benchmark Suite
(GBBS).Comment: This is the full version of the paper appearing in the ACM Symposium
on Parallelism in Algorithms and Architectures (SPAA), 201
Parallel Batch-Dynamic Graph Connectivity
In this paper, we study batch parallel algorithms for the dynamic
connectivity problem, a fundamental problem that has received considerable
attention in the sequential setting. The most well known sequential algorithm
for dynamic connectivity is the elegant level-set algorithm of Holm, de
Lichtenberg and Thorup (HDT), which achieves amortized time per
edge insertion or deletion, and time per query. We
design a parallel batch-dynamic connectivity algorithm that is work-efficient
with respect to the HDT algorithm for small batch sizes, and is asymptotically
faster when the average batch size is sufficiently large. Given a sequence of
batched updates, where is the average batch size of all deletions, our
algorithm achieves expected amortized work per
edge insertion and deletion and depth w.h.p. Our algorithm
answers a batch of connectivity queries in expected
work and depth w.h.p. To the best of our knowledge, our algorithm
is the first parallel batch-dynamic algorithm for connectivity.Comment: This is the full version of the paper appearing in the ACM Symposium
on Parallelism in Algorithms and Architectures (SPAA), 201
Recommended from our members
Parallel algorithms for finding connected components using linear algebra
Finding connected components is one of the most widely used operations on a graph. Optimal serial algorithms for the problem have been known for half a century, and many competing parallel algorithms have been proposed over the last several decades under various different models of parallel computation. This paper presents a class of parallel connected-component algorithms designed using linear-algebraic primitives. These algorithms are based on a PRAM algorithm by Shiloach and Vishkin and can be designed using standard GraphBLAS operations. We demonstrate two algorithms of this class, one named LACC for Linear Algebraic Connected Components, and the other named FastSV which can be regarded as LACC's simplification. With the support of the highly-scalable Combinatorial BLAS library, LACC and FastSV outperform the previous state-of-the-art algorithm by a factor of up to 12x for small to medium scale graphs. For large graphs with more than 50B edges, LACC and FastSV scale to 4K nodes (262K cores) of a Cray XC40 supercomputer and outperform previous algorithms by a significant margin. This remarkable performance is accomplished by (1) exploiting sparsity that was not present in the original PRAM algorithm formulation, (2) using high-performance primitives of Combinatorial BLAS, and (3) identifying hot spots and optimizing them away by exploiting algorithmic insights
Parallel Index-Based Structural Graph Clustering and Its Approximation
SCAN (Structural Clustering Algorithm for Networks) is a well-studied, widely
used graph clustering algorithm. For large graphs, however, sequential SCAN
variants are prohibitively slow, and parallel SCAN variants do not effectively
share work among queries with different SCAN parameter settings. Since users of
SCAN often explore many parameter settings to find good clusterings, it is
worthwhile to precompute an index that speeds up queries.
This paper presents a practical and provably efficient parallel index-based
SCAN algorithm based on GS*-Index, a recent sequential algorithm. Our parallel
algorithm improves upon the asymptotic work of the sequential algorithm by
using integer sorting. It is also highly parallel, achieving logarithmic span
(parallel time) for both index construction and clustering queries.
Furthermore, we apply locality-sensitive hashing (LSH) to design a novel
approximate SCAN algorithm and prove guarantees for its clustering behavior.
We present an experimental evaluation of our algorithms on large real-world
graphs. On a 48-core machine with two-way hyper-threading, our parallel index
construction achieves 50--151 speedup over the construction of
GS*-Index. In fact, even on a single thread, our index construction algorithm
is faster than GS*-Index. Our parallel index query implementation achieves
5--32 speedup over GS*-Index queries across a range of SCAN parameter
values, and our implementation is always faster than ppSCAN, a state-of-the-art
parallel SCAN algorithm. Moreover, our experiments show that applying LSH
results in faster index construction while maintaining good clustering quality
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable
There has been significant recent interest in parallel graph processing due to the need to quickly analyze the large graphs available today. Many graph codes have been designed for distributed memory or external memory. However, today even the largest publicly-available real-world graph (the Hyperlink Web graph with over 3.5 billion vertices and 128 billion edges) can fit in the memory of a single commodity multicore server. Nevertheless, most experimental work in the literature report results on much smaller graphs, and the ones for the Hyperlink graph use distributed or external memory. Therefore, it is natural to ask whether we can efficiently solve a broad class of graph problems on this graph in memory.
This paper shows that theoretically-efficient parallel graph algorithms can scale to the largest publicly-available graphs using a single machine with a terabyte of RAM, processing them in minutes. We give implementations of theoretically-efficient parallel algorithms for 20 important graph problems. We also present the interfaces, optimizations, and graph processing techniques that we used in our implementations, which were crucial in enabling us to process these large graphs quickly. We show that the running times of our implementations outperform existing state-of-the-art implementations on the largest real-world graphs. For many of the problems that we consider, this is the first time they have been solved on graphs at this scale. We have made the implementations developed in this work publicly-available as the Graph Based Benchmark Suite (GBBS).</jats:p
Exploring the Design Space of Static and Incremental Graph Connectivity Algorithms on GPUs
Connected components and spanning forest are fundamental graph algorithms due
to their use in many important applications, such as graph clustering and image
segmentation. GPUs are an ideal platform for graph algorithms due to their high
peak performance and memory bandwidth. While there exist several GPU
connectivity algorithms in the literature, many design choices have not yet
been explored. In this paper, we explore various design choices in GPU
connectivity algorithms, including sampling, linking, and tree compression, for
both the static as well as the incremental setting. Our various design choices
lead to over 300 new GPU implementations of connectivity, many of which
outperform state-of-the-art. We present an experimental evaluation, and show
that we achieve an average speedup of 2.47x speedup over existing static
algorithms. In the incremental setting, we achieve a throughput of up to 48.23
billion edges per second. Compared to state-of-the-art CPU implementations on a
72-core machine, we achieve a speedup of 8.26--14.51x for static connectivity
and 1.85--13.36x for incremental connectivity using a Tesla V100 GPU