624 research outputs found
Gunrock: GPU Graph Analytics
For large-scale graph analytics on the GPU, the irregularity of data access
and control flow, and the complexity of programming GPUs, have presented two
significant challenges to developing a programmable high-performance graph
library. "Gunrock", our graph-processing system designed specifically for the
GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on
operations on a vertex or edge frontier. Gunrock achieves a balance between
performance and expressiveness by coupling high performance GPU computing
primitives and optimization strategies with a high-level programming model that
allows programmers to quickly develop new graph primitives with small code size
and minimal GPU programming knowledge. We characterize the performance of
various optimization strategies and evaluate Gunrock's overall performance on
different GPU architectures on a wide range of graph primitives that span from
traversal-based algorithms and ranking algorithms, to triangle counting and
bipartite-graph-based algorithms. The results show that on a single GPU,
Gunrock has on average at least an order of magnitude speedup over Boost and
PowerGraph, comparable performance to the fastest GPU hardwired primitives and
CPU shared-memory graph libraries such as Ligra and Galois, and better
performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing
(TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance
Graph Processing Library on the GPU
A Fast and Scalable Graph Coloring Algorithm for Multi-core and Many-core Architectures
Irregular computations on unstructured data are an important class of
problems for parallel programming. Graph coloring is often an important
preprocessing step, e.g. as a way to perform dependency analysis for safe
parallel execution. The total run time of a coloring algorithm adds to the
overall parallel overhead of the application whereas the number of colors used
determines the amount of exposed parallelism. A fast and scalable coloring
algorithm using as few colors as possible is vital for the overall parallel
performance and scalability of many irregular applications that depend upon
runtime dependency analysis.
Catalyurek et al. have proposed a graph coloring algorithm which relies on
speculative, local assignment of colors. In this paper we present an improved
version which runs even more optimistically with less thread synchronization
and reduced number of conflicts compared to Catalyurek et al.'s algorithm. We
show that the new technique scales better on multi-core and many-core systems
and performs up to 1.5x faster than its predecessor on graphs with high-degree
vertices, while keeping the number of colors at the same near-optimal levels.Comment: To appear in the proceedings of Euro Par 201
Scalable partitioning for parallel position based dynamics
We introduce a practical partitioning technique designed for parallelizing Position Based Dynamics, and exploiting
the ubiquitous multi-core processors present in current commodity GPUs. The input is a set of particles whose
dynamics is influenced by spatial constraints. In the initialization phase, we build a graph in which each node
corresponds to a constraint and two constraints are connected by an edge if they influence at least one common
particle. We introduce a novel greedy algorithm for inserting additional constraints (phantoms) in the graph
such that the resulting topology is q-colourable, where ˆ qˆ ≥ 2 is an arbitrary number. We color the graph, and
the constraints with the same color are assigned to the same partition. Then, the set of constraints belonging to
each partition is solved in parallel during the animation phase. We demonstrate this by using our partitioning
technique; the performance hit caused by the GPU kernel calls is significantly decreased, leaving unaffected the
visual quality, robustness and speed of serial position based dynamics
Multi-GPU Graph Analytics
We present a single-node, multi-GPU programmable graph processing library
that allows programmers to easily extend single-GPU graph algorithms to achieve
scalable performance on large graphs with billions of edges. Directly using the
single-GPU implementations, our design only requires programmers to specify a
few algorithm-dependent concerns, hiding most multi-GPU related implementation
details. We analyze the theoretical and practical limits to scalability in the
context of varying graph primitives and datasets. We describe several
optimizations, such as direction optimizing traversal, and a just-enough memory
allocation scheme, for better performance and smaller memory consumption.
Compared to previous work, we achieve best-of-class performance across
operations and datasets, including excellent strong and weak scalability on
most primitives as we increase the number of GPUs in the system.Comment: 12 pages. Final version submitted to IPDPS 201
Recommended from our members
Preparing sparse solvers for exascale computing.
Sparse solvers provide essential functionality for a wide variety of scientific applications. Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. This paper describes the challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms. We address the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential. The efforts described here are works in progress, highlighting current success and upcoming challenges. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'
- …