1,474 research outputs found
Evaluation of Directive-Based GPU Programming Models on a Block Eigensolver with Consideration of Large Sparse Matrices
Achieving high performance and performance portability for large-scale scientific applications is a major challenge on heterogeneous computing systems such as many-core CPUs and accelerators like GPUs. In this work, we implement a widely used block eigensolver, Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG), using two popular directive based programming models (OpenMP and OpenACC) for GPU-accelerated systems. Our work differs from existing work in that it adopts a holistic approach that optimizes the full solver performance rather than narrowing the problem into small kernels (e.g., SpMM, SpMV). Our LOPBCG GPU implementation achieves a 2.8–4.3 speedup over an optimized CPU implementation when tested with four different input matrices. The evaluated configuration compared one Skylake CPU to one Skylake CPU and one NVIDIA V100 GPU. Our OpenMP and OpenACC LOBPCG GPU implementations gave nearly identical performance. We also consider how to create an efficient LOBPCG solver that can solve problems larger than GPU memory capacity. To this end, we create microbenchmarks representing the two dominant kernels (inner product and SpMM kernel) in LOBPCG and then evaluate performance when using two different programming approaches: tiling the kernels, and using Unified Memory with the original kernels. Our tiled SpMM implementation achieves a 2.9 and 48.2 speedup over the Unified Memory implementation on supercomputers with PCIe Gen3 and NVLink 2.0 CPU to GPU interconnects, respectively
Accelerating NBODY6 with Graphics Processing Units
We describe the use of Graphics Processing Units (GPUs) for speeding up the
code NBODY6 which is widely used for direct -body simulations. Over the
years, the nature of the direct force calculation has proved a barrier
for extending the particle number. Following an early introduction of force
polynomials and individual time-steps, the calculation cost was first reduced
by the introduction of a neighbour scheme. After a decade of GRAPE computers
which speeded up the force calculation further, we are now in the era of GPUs
where relatively small hardware systems are highly cost-effective. A
significant gain in efficiency is achieved by employing the GPU to obtain the
so-called regular force which typically involves some 99 percent of the
particles, while the remaining local forces are evaluated on the host. However,
the latter operation is performed up to 20 times more frequently and may still
account for a significant cost. This effort is reduced by parallel SSE/AVX
procedures where each interaction term is calculated using mainly single
precision. We also discuss further strategies connected with coordinate and
velocity prediction required by the integration scheme. This leaves hard
binaries and multiple close encounters which are treated by several
regularization methods. The present nbody6-GPU code is well balanced for
simulations in the particle range for a dual GPU system
attached to a standard PC.Comment: 8 pages, 3 figures, 2 tables, MNRAS accepte
NBSymple, a double parallel, symplectic N-body code running on Graphic Processing Units
We present and discuss the characteristics and performances, both in term of
computational speed and precision, of a numerical code which numerically
integrates the equation of motions of N 'particles' interacting via Newtonian
gravitation and move in an external galactic smooth field. The force evaluation
on every particle is done by mean of direct summation of the contribution of
all the other system's particle, avoiding truncation error. The time
integration is done with second-order and sixth-order symplectic schemes. The
code, NBSymple, has been parallelized twice, by mean of the Computer Unified
Device Architecture to make the all-pair force evaluation as fast as possible
on high-performance Graphic Processing Units NVIDIA TESLA C 1060, while the
O(N) computations are distributed on various CPUs by mean of OpenMP Application
Program. The code works both in single precision floating point arithmetics or
in double precision. The use of single precision allows the use at best of the
GPU performances but, of course, limits the precision of simulation in some
critical situations. We find a good compromise in using a software
reconstruction of double precision for those variables that are most critical
for the overall precision of the code. The code is available on the web site
astrowww.phys.uniroma1.it/dolcetta/nbsymple.htmlComment: Paper composed by 29 pages, including 9 figures. Submitted to New
Astronomy
Recommended from our members
Preparing sparse solvers for exascale computing.
Sparse solvers provide essential functionality for a wide variety of scientific applications. Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. This paper describes the challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms. We address the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential. The efforts described here are works in progress, highlighting current success and upcoming challenges. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'
Astrophysical Supercomputing with GPUs: Critical Decisions for Early Adopters
General purpose computing on graphics processing units (GPGPU) is
dramatically changing the landscape of high performance computing in astronomy.
In this paper, we identify and investigate several key decision areas, with a
goal of simplyfing the early adoption of GPGPU in astronomy. We consider the
merits of OpenCL as an open standard in order to reduce risks associated with
coding in a native, vendor-specific programming environment, and present a GPU
programming philosophy based on using brute force solutions. We assert that
effective use of new GPU-based supercomputing facilities will require a change
in approach from astronomers. This will likely include improved programming
training, an increased need for software development best-practice through the
use of profiling and related optimisation tools, and a greater reliance on
third-party code libraries. As with any new technology, those willing to take
the risks, and make the investment of time and effort to become early adopters
of GPGPU in astronomy, stand to reap great benefits.Comment: 13 pages, 5 figures, accepted for publication in PAS
C Language Extensions for Hybrid CPU/GPU Programming with StarPU
Modern platforms used for high-performance computing (HPC) include machines
with both general-purpose CPUs, and "accelerators", often in the form of
graphical processing units (GPUs). StarPU is a C library to exploit such
platforms. It provides users with ways to define "tasks" to be executed on CPUs
or GPUs, along with the dependencies among them, and by automatically
scheduling them over all the available processing units. In doing so, it also
relieves programmers from the need to know the underlying architecture details:
it adapts to the available CPUs and GPUs, and automatically transfers data
between main memory and GPUs as needed. While StarPU's approach is successful
at addressing run-time scheduling issues, being a C library makes for a poor
and error-prone programming interface. This paper presents an effort started in
2011 to promote some of the concepts exported by the library as C language
constructs, by means of an extension of the GCC compiler suite. Our main
contribution is the design and implementation of language extensions that map
to StarPU's task programming paradigm. We argue that the proposed extensions
make it easier to get started with StarPU,eliminate errors that can occur when
using the C library, and help diagnose possible mistakes. We conclude on future
work
- …