42,164 research outputs found
Gunrock: A High-Performance Graph Processing Library on the GPU
For large-scale graph analytics on the GPU, the irregularity of data access
and control flow, and the complexity of programming GPUs have been two
significant challenges for developing a programmable high-performance graph
library. "Gunrock", our graph-processing system designed specifically for the
GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on
operations on a vertex or edge frontier. Gunrock achieves a balance between
performance and expressiveness by coupling high performance GPU computing
primitives and optimization strategies with a high-level programming model that
allows programmers to quickly develop new graph primitives with small code size
and minimal GPU programming knowledge. We evaluate Gunrock on five key graph
primitives and show that Gunrock has on average at least an order of magnitude
speedup over Boost and PowerGraph, comparable performance to the fastest GPU
hardwired primitives, and better performance than any other GPU high-level
graph library.Comment: 14 pages, accepted by PPoPP'16 (removed the text repetition in the
previous version v5
Gunrock: GPU Graph Analytics
For large-scale graph analytics on the GPU, the irregularity of data access
and control flow, and the complexity of programming GPUs, have presented two
significant challenges to developing a programmable high-performance graph
library. "Gunrock", our graph-processing system designed specifically for the
GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on
operations on a vertex or edge frontier. Gunrock achieves a balance between
performance and expressiveness by coupling high performance GPU computing
primitives and optimization strategies with a high-level programming model that
allows programmers to quickly develop new graph primitives with small code size
and minimal GPU programming knowledge. We characterize the performance of
various optimization strategies and evaluate Gunrock's overall performance on
different GPU architectures on a wide range of graph primitives that span from
traversal-based algorithms and ranking algorithms, to triangle counting and
bipartite-graph-based algorithms. The results show that on a single GPU,
Gunrock has on average at least an order of magnitude speedup over Boost and
PowerGraph, comparable performance to the fastest GPU hardwired primitives and
CPU shared-memory graph libraries such as Ligra and Galois, and better
performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing
(TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance
Graph Processing Library on the GPU
Mixing multi-core CPUs and GPUs for scientific simulation software
Recent technological and economic developments have led to widespread availability of
multi-core CPUs and specialist accelerator processors such as graphical processing units
(GPUs). The accelerated computational performance possible from these devices can be very
high for some applications paradigms. Software languages and systems such as NVIDIA's
CUDA and Khronos consortium's open compute language (OpenCL) support a number of
individual parallel application programming paradigms. To scale up the performance of some
complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and
very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica-
tions using threading approaches and multi-core CPUs to control independent GPU devices.
We present speed-up data and discuss multi-threading software issues for the applications
level programmer and o er some suggested areas for language development and integration
between coarse-grained and ne-grained multi-thread systems. We discuss results from three
common simulation algorithmic areas including: partial di erential equations; graph cluster
metric calculations and random number generation. We report on programming experiences
and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs;
a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and
trends in multi-core programming for scienti c applications developers
Three Puzzles on Mathematics, Computation, and Games
In this lecture I will talk about three mathematical puzzles involving
mathematics and computation that have preoccupied me over the years. The first
puzzle is to understand the amazing success of the simplex algorithm for linear
programming. The second puzzle is about errors made when votes are counted
during elections. The third puzzle is: are quantum computers possible?Comment: ICM 2018 plenary lecture, Rio de Janeiro, 36 pages, 7 Figure
Using Graph Properties to Speed-up GPU-based Graph Traversal: A Model-driven Approach
While it is well-known and acknowledged that the performance of graph
algorithms is heavily dependent on the input data, there has been surprisingly
little research to quantify and predict the impact the graph structure has on
performance. Parallel graph algorithms, running on many-core systems such as
GPUs, are no exception: most research has focused on how to efficiently
implement and tune different graph operations on a specific GPU. However, the
performance impact of the input graph has only been taken into account
indirectly as a result of the graphs used to benchmark the system.
In this work, we present a case study investigating how to use the properties
of the input graph to improve the performance of the breadth-first search (BFS)
graph traversal. To do so, we first study the performance variation of 15
different BFS implementations across 248 graphs. Using this performance data,
we show that significant speed-up can be achieved by combining the best
implementation for each level of the traversal. To make use of this
data-dependent optimization, we must correctly predict the relative performance
of algorithms per graph level, and enable dynamic switching to the optimal
algorithm for each level at runtime.
We use the collected performance data to train a binary decision tree, to
enable high-accuracy predictions and fast switching. We demonstrate empirically
that our decision tree is both fast enough to allow dynamic switching between
implementations, without noticeable overhead, and accurate enough in its
prediction to enable significant BFS speedup. We conclude that our model-driven
approach (1) enables BFS to outperform state of the art GPU algorithms, and (2)
can be adapted for other BFS variants, other algorithms, or more specific
datasets
- …