140,310 research outputs found
Importance of Explicit Vectorization for CPU and GPU Software Performance
Much of the current focus in high-performance computing is on
multi-threading, multi-computing, and graphics processing unit (GPU) computing.
However, vectorization and non-parallel optimization techniques, which can
often be employed additionally, are less frequently discussed. In this paper,
we present an analysis of several optimizations done on both central processing
unit (CPU) and GPU implementations of a particular computationally intensive
Metropolis Monte Carlo algorithm. Explicit vectorization on the CPU and the
equivalent, explicit memory coalescing, on the GPU are found to be critical to
achieving good performance of this algorithm in both environments. The
fully-optimized CPU version achieves a 9x to 12x speedup over the original CPU
version, in addition to speedup from multi-threading. This is 2x faster than
the fully-optimized GPU version.Comment: 17 pages, 17 figure
Architecture-Aware Optimization on a 1600-core Graphics Processor
The graphics processing unit (GPU) continues to
make significant strides as an accelerator in commodity cluster
computing for high-performance computing (HPC). For example,
three of the top five fastest supercomputers in the world, as
ranked by the TOP500, employ GPUs as accelerators. Despite this
increasing interest in GPUs, however, optimizing the performance
of a GPU-accelerated compute node requires deep technical
knowledge of the underlying architecture. Although significant
literature exists on how to optimize GPU performance on the
more mature NVIDIA CUDA architecture, the converse is true
for OpenCL on the AMD GPU.
Consequently, we present and evaluate architecture-aware optimizations
for the AMD GPU. The most prominent optimizations
include (i) explicit use of registers, (ii) use of vector types, (iii)
removal of branches, and (iv) use of image memory for global data.
We demonstrate the efficacy of our AMD GPU optimizations by
applying each optimization in isolation as well as in concert to
a large-scale, molecular modeling application called GEM. Via
these AMD-specific GPU optimizations, the AMD Radeon HD
5870 GPU delivers 65% better performance than with the wellknown
NVIDIA-specific optimizations
Gunrock: GPU Graph Analytics
For large-scale graph analytics on the GPU, the irregularity of data access
and control flow, and the complexity of programming GPUs, have presented two
significant challenges to developing a programmable high-performance graph
library. "Gunrock", our graph-processing system designed specifically for the
GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on
operations on a vertex or edge frontier. Gunrock achieves a balance between
performance and expressiveness by coupling high performance GPU computing
primitives and optimization strategies with a high-level programming model that
allows programmers to quickly develop new graph primitives with small code size
and minimal GPU programming knowledge. We characterize the performance of
various optimization strategies and evaluate Gunrock's overall performance on
different GPU architectures on a wide range of graph primitives that span from
traversal-based algorithms and ranking algorithms, to triangle counting and
bipartite-graph-based algorithms. The results show that on a single GPU,
Gunrock has on average at least an order of magnitude speedup over Boost and
PowerGraph, comparable performance to the fastest GPU hardwired primitives and
CPU shared-memory graph libraries such as Ligra and Galois, and better
performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing
(TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance
Graph Processing Library on the GPU
Advances in GPU architecture for deep learning and scientific computing
The talk will cover the recent NVIDIA product announcements made at the GTC'16 conference, and how the Pascal GPU and NVLink interconnect technologies greatly improve multi-GPU performance and efficiency in deep learning and scientific computing applications
- …
