13,890 research outputs found
A Similarity Measure for GPU Kernel Subgraph Matching
Accelerator architectures specialize in executing SIMD (single instruction,
multiple data) in lockstep. Because the majority of CUDA applications are
parallelized loops, control flow information can provide an in-depth
characterization of a kernel. CUDAflow is a tool that statically separates CUDA
binaries into basic block regions and dynamically measures instruction and
basic block frequencies. CUDAflow captures this information in a control flow
graph (CFG) and performs subgraph matching across various kernel's CFGs to gain
insights to an application's resource requirements, based on the shape and
traversal of the graph, instruction operations executed and registers
allocated, among other information. The utility of CUDAflow is demonstrated
with SHOC and Rodinia application case studies on a variety of GPU
architectures, revealing novel thread divergence characteristics that
facilitates end users, autotuners and compilers in generating high performing
code
CU2CL: A CUDA-to-OpenCL Translator for Multi- and Many-core Architectures
The use of graphics processing units (GPUs) in
high-performance parallel computing continues to become more
prevalent, often as part of a heterogeneous system. For years,
CUDA has been the de facto programming environment for
nearly all general-purpose GPU (GPGPU) applications. In spite
of this, the framework is available only on NVIDIA GPUs,
traditionally requiring reimplementation in other frameworks
in order to utilize additional multi- or many-core devices.
On the other hand, OpenCL provides an open and vendorneutral
programming environment and runtime system. With
implementations available for CPUs, GPUs, and other types of
accelerators, OpenCL therefore holds the promise of a âwrite
once, run anywhereâ ecosystem for heterogeneous computing.
Given the many similarities between CUDA and OpenCL,
manually porting a CUDA application to OpenCL is typically
straightforward, albeit tedious and error-prone. In response
to this issue, we created CU2CL, an automated CUDA-to-
OpenCL source-to-source translator that possesses a novel design
and clever reuse of the Clang compiler framework. Currently,
the CU2CL translator covers the primary constructs found in
CUDA runtime API, and we have successfully translated many
applications from the CUDA SDK and Rodinia benchmark suite.
The performance of our automatically translated applications via
CU2CL is on par with their manually ported countparts
Application of graphics processing units to search pipelines for gravitational waves from coalescing binaries of compact objects
We report a novel application of a graphics processing unit (GPU) for the purpose of accelerating the search pipelines for gravitational waves from coalescing binaries of compact objects. A speed-up of 16-fold in total has been achieved with an NVIDIA GeForce 8800 Ultra GPU card compared with one core of a 2.5 GHz Intel Q9300 central processing unit (CPU). We show that substantial improvements are possible and discuss the reduction in CPU count required for the detection of inspiral sources afforded by the use of GPUs
NVIDIA Tensor Core Programmability, Performance & Precision
The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called
"Tensor Core" that performs one matrix-multiply-and-accumulate on 4x4 matrices
per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta
microarchitecture, provides 640 Tensor Cores with a theoretical peak
performance of 125 Tflops/s in mixed precision. In this paper, we investigate
current approaches to program NVIDIA Tensor Cores, their performances and the
precision loss due to computation in mixed precision.
Currently, NVIDIA provides three different ways of programming
matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply
Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS
GEMM. After experimenting with different approaches, we found that NVIDIA
Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100
GPU, seven and three times the performance in single and half precision
respectively. A WMMA implementation of batched GEMM reaches a performance of 4
Tflops/s. While precision loss due to matrix multiplication with half precision
input might be critical in many HPC applications, it can be considerably
reduced at the cost of increased computation. Our results indicate that HPC
applications using matrix multiplications can strongly benefit from using of
NVIDIA Tensor Cores.Comment: This paper has been accepted by the Eighth International Workshop on
Accelerators and Hybrid Exascale Systems (AsHES) 201
Landau Gauge Fixing on GPUs and String Tension
We explore the performance of CUDA in performing Landau gauge fixing in
Lattice QCD, using the steepest descent method with Fourier acceleration. The
code performance was tested in a Tesla C2070, Fermi architecture. We also
present a study of the string tension at finite temperature in the confined
phase. The string tension is extracted from the color averaged free energy and
from the color singlet using Landau gauge fixing.Comment: 7 pages, 4 figures, 1 table. Contribution to the International
Meeting "Excited QCD", Peniche, Portugal, 06 - 12 May 201
Sapporo2: A versatile direct -body library
Astrophysical direct -body methods have been one of the first production
algorithms to be implemented using NVIDIA's CUDA architecture. Now, almost
seven years later, the GPU is the most used accelerator device in astronomy for
simulating stellar systems. In this paper we present the implementation of the
Sapporo2 -body library, which allows researchers to use the GPU for -body
simulations with little to no effort. The first version, released five years
ago, is actively used, but lacks advanced features and versatility in numerical
precision and support for higher order integrators. In this updated version we
have rebuilt the code from scratch and added support for OpenCL,
multi-precision and higher order integrators. We show how to tune these codes
for different GPU architectures and present how to continue utilizing the GPU
optimal even when only a small number of particles () is integrated.
This careful tuning allows Sapporo2 to be faster than Sapporo1 even with the
added options and double precision data loads. The code runs on a range of
NVIDIA and AMD GPUs in single and double precision accuracy. With the addition
of OpenCL support the library is also able to run on CPUs and other
accelerators that support OpenCL.Comment: 15 pages, 7 figures. Accepted for publication in Computational
Astrophysics and Cosmolog
Parallelized Inference for Gravitational-Wave Astronomy
Bayesian inference is the workhorse of gravitational-wave astronomy, for
example, determining the mass and spins of merging black holes, revealing the
neutron star equation of state, and unveiling the population properties of
compact binaries. The science enabled by these inferences comes with a
computational cost that can limit the questions we are able to answer. This
cost is expected to grow. As detectors improve, the detection rate will go up,
allowing less time to analyze each event. Improvement in low-frequency
sensitivity will yield longer signals, increasing the number of computations
per event. The growing number of entries in the transient catalog will drive up
the cost of population studies. While Bayesian inference calculations are not
entirely parallelizable, key components are embarrassingly parallel:
calculating the gravitational waveform and evaluating the likelihood function.
Graphical processor units (GPUs) are adept at such parallel calculations. We
report on progress porting gravitational-wave inference calculations to GPUs.
Using a single code - which takes advantage of GPU architecture if it is
available - we compare computation times using modern GPUs (NVIDIA P100) and
CPUs (Intel Gold 6140). We demonstrate speed-ups of for
compact binary coalescence gravitational waveform generation and likelihood
evaluation and more than for population inference within the
lifetime of current detectors. Further improvement is likely with continued
development. Our python-based code is publicly available and can be used
without familiarity with the parallel computing platform, CUDA.Comment: 5 pages, 4 figures, submitted to PRD, code can be found at
https://github.com/ColmTalbot/gwpopulation
https://github.com/ColmTalbot/GPUCBC
https://github.com/ADACS-Australia/ADACS-SS18A-RSmith Add demonstration of
improvement in BNS spi
- âŠ