102 research outputs found
Multi-GPU Acceleration of the iPIC3D Implicit Particle-in-Cell Code
iPIC3D is a widely used massively parallel Particle-in-Cell code for the
simulation of space plasmas. However, its current implementation does not
support execution on multiple GPUs. In this paper, we describe the porting of
iPIC3D particle mover to GPUs and the optimization steps to increase the
performance and parallel scaling on multiple GPUs. We analyze the strong
scaling of the mover on two GPU clusters and evaluate its performance and
acceleration. The optimized GPU version which uses pinned memory and
asynchronous data prefetching outperform their corresponding CPU versions by
5-10x on two different systems equipped with NVIDIA K80 and V100 GPUs.Comment: Accepted for publication in ICCS 201
sputniPIC: an Implicit Particle-in-Cell Code for Multi-GPU Systems
Large-scale simulations of plasmas are essential for advancing our
understanding of fusion devices, space, and astrophysical systems.
Particle-in-Cell (PIC) codes have demonstrated their success in simulating
numerous plasma phenomena on HPC systems. Today, flagship supercomputers
feature multiple GPUs per compute node to achieve unprecedented computing power
at high power efficiency. PIC codes require new algorithm design and
implementation for exploiting such accelerated platforms. In this work, we
design and optimize a three-dimensional implicit PIC code, called sputniPIC, to
run on a general multi-GPU compute node. We introduce a particle decomposition
data layout, in contrast to domain decomposition on CPU-based implementations,
to use particle batches for overlapping communication and computation on GPUs.
sputniPIC also natively supports different precision representations to achieve
speed up on hardware that supports reduced precision. We validate sputniPIC
through the well-known GEM challenge and provide performance analysis. We test
sputniPIC on three multi-GPU platforms and report a 200-800x performance
improvement with respect to the sputniPIC CPU OpenMP version performance. We
show that reduced precision could further improve performance by 45% to 80% on
the three platforms. Because of these performance improvements, on a single
node with multiple GPUs, sputniPIC enables large-scale three-dimensional PIC
simulations that were only possible using clusters.Comment: Accepted for publication at the 32nd International Symposium on
Computer Architecture and High Performance Computing (SBAC-PAD 2020
Preparing HPC Applications for the Exascale Era: A Decoupling Strategy
Production-quality parallel applications are often a mixture of diverse
operations, such as computation- and communication-intensive, regular and
irregular, tightly coupled and loosely linked operations. In conventional
construction of parallel applications, each process performs all the
operations, which might result inefficient and seriously limit scalability,
especially at large scale. We propose a decoupling strategy to improve the
scalability of applications running on large-scale systems.
Our strategy separates application operations onto groups of processes and
enables a dataflow processing paradigm among the groups. This mechanism is
effective in reducing the impact of load imbalance and increases the parallel
efficiency by pipelining multiple operations. We provide a proof-of-concept
implementation using MPI, the de-facto programming system on current
supercomputers. We demonstrate the effectiveness of this strategy by decoupling
the reduce, particle communication, halo exchange and I/O operations in a set
of scientific and data-analytics applications. A performance evaluation on
8,192 processes of a Cray XC40 supercomputer shows that the proposed approach
can achieve up to 4x performance improvement.Comment: The 46th International Conference on Parallel Processing (ICPP-2017
- …