23 research outputs found
A portable platform for accelerated PIC codes and its application to GPUs using OpenACC
We present a portable platform, called PIC_ENGINE, for accelerating
Particle-In-Cell (PIC) codes on heterogeneous many-core architectures such as
Graphic Processing Units (GPUs). The aim of this development is efficient
simulations on future exascale systems by allowing different parallelization
strategies depending on the application problem and the specific architecture.
To this end, this platform contains the basic steps of the PIC algorithm and
has been designed as a test bed for different algorithmic options and data
structures. Among the architectures that this engine can explore, particular
attention is given here to systems equipped with GPUs. The study demonstrates
that our portable PIC implementation based on the OpenACC programming model can
achieve performance closely matching theoretical predictions. Using the Cray
XC30 system, Piz Daint, at the Swiss National Supercomputing Centre (CSCS), we
show that PIC_ENGINE running on an NVIDIA Kepler K20X GPU can outperform the
one on an Intel Sandybridge 8-core CPU by a factor of 3.4
Particle-In-Cell Simulation using Asynchronous Tasking
Recently, task-based programming models have emerged as a prominent
alternative among shared-memory parallel programming paradigms. Inherently
asynchronous, these models provide native support for dynamic load balancing
and incorporate data flow concepts to selectively synchronize the tasks.
However, tasking models are yet to be widely adopted by the HPC community and
their effective advantages when applied to non-trivial, real-world HPC
applications are still not well comprehended. In this paper, we study the
parallelization of a production electromagnetic particle-in-cell (EM-PIC) code
for kinetic plasma simulations exploring different strategies using
asynchronous task-based models. Our fully asynchronous implementation not only
significantly outperforms a conventional, synchronous approach but also
achieves near perfect scaling for 48 cores.Comment: To be published on the 27th European Conference on Parallel and
Distributed Computing (Euro-Par 2021
An efficient mixed-precision, hybrid CPU-GPU implementation of a fully implicit particle-in-cell algorithm
Recently, a fully implicit, energy- and charge-conserving particle-in-cell
method has been proposed for multi-scale, full-f kinetic simulations [G. Chen,
et al., J. Comput. Phys. 230,18 (2011)]. The method employs a Jacobian-free
Newton-Krylov (JFNK) solver, capable of using very large timesteps without loss
of numerical stability or accuracy. A fundamental feature of the method is the
segregation of particle-orbit computations from the field solver, while
remaining fully self-consistent. This paper describes a very efficient,
mixed-precision hybrid CPU-GPU implementation of the implicit PIC algorithm
exploiting this feature. The JFNK solver is kept on the CPU in double precision
(DP), while the implicit, charge-conserving, and adaptive particle mover is
implemented on a GPU (graphics processing unit) using CUDA in single-precision
(SP). Performance-oriented optimizations are introduced with the aid of the
roofline model. The implicit particle mover algorithm is shown to achieve up to
400 GOp/s on a Nvidia GeForce GTX580. This corresponds to 25% absolute GPU
efficiency against the peak theoretical performance, and is about 300 times
faster than an equivalent serial CPU (Intel Xeon X5460) execution. For the test
case chosen, the mixed-precision hybrid CPU-GPU solver is shown to over-perform
the DP CPU-only serial version by a factor of \sim 100, without apparent loss
of robustness or accuracy in a challenging long-timescale ion acoustic wave
simulation.Comment: 25 pages, 6 figures, submitted to J. Comput. Phy
Evaluating kernels on Xeon Phi to accelerate Gysela application
This work describes the challenges presented by porting parts ofthe Gysela
code to the Intel Xeon Phi coprocessor, as well as techniques used for
optimization, vectorization and tuning that can be applied to other
applications. We evaluate the performance of somegeneric micro-benchmark on Phi
versus Intel Sandy Bridge. Several interpolation kernels useful for the Gysela
application are analyzed and the performance are shown. Some memory-bound and
compute-bound kernels are accelerated by a factor 2 on the Phi device compared
to Sandy architecture. Nevertheless, it is hard, if not impossible, to reach a
large fraction of the peek performance on the Phi device,especially for
real-life applications as Gysela. A collateral benefit of this optimization and
tuning work is that the execution time of Gysela (using 4D advections) has
decreased on a standard architecture such as Intel Sandy Bridge.Comment: submitted to ESAIM proceedings for CEMRACS 2014 summer school version
reviewe
Particle-in-cell simulation using asynchronous tasking
Recently, task-based programming models have emerged as a prominent alternative among shared-memory parallel programming paradigms. Inherently asynchronous, these models provide native support for dynamic load balancing and incorporate data flow concepts to selectively synchronize the tasks. However, tasking models are yet to be widely adopted by the HPC community and their effective advantages when applied to non-trivial, real-world HPC applications are still not well comprehended. In this paper, we study the parallelization of a production electromagnetic particle-in-cell (EM-PIC) code for kinetic plasma simulations exploring different strategies using asynchronous task-based models. Our fully asynchronous implementation not only significantly outperforms a conventional, synchronous approach but also achieves near perfect scaling for 48 cores.Peer ReviewedPostprint (author's final draft
Implementation of 2D Domain Decomposition in the UCAN Gyrokinetic Particle-in-Cell Code and Resulting Performance of UCAN2
The massively parallel, nonlinear, three-dimensional (3D), toroidal, electrostatic, gyrokinetic, particle-in-cell (PIC), Cartesian geometry UCAN code, with particle ions and adiabatic electrons, has been successfully exercised to identify non-diffusive transport characteristics in present day tokamak discharges. The limitation in applying UCAN to larger scale discharges is the 1D domain decomposition in the toroidal (or z-) direction for massively parallel implementation using MPI which has restricted the calculations to a few hundred ion Larmor radii or gyroradii per plasma minor radius. To exceed these sizes, we have implemented 2D domain decomposition in UCAN with the addition of the y-direction to the processor mix. This has been facilitated by use of relevant components in the P2LIB library of field and particle management routines developed for UCLA's UPIC Framework of conventional PIC codes. The gyro-averaging specific to gyrokinetic codes is simplified by the use of replicated arrays for efficient charge accumulation and force deposition. The 2D domain-decomposed UCAN2 code reproduces the original 1D domain nonlinear results within round-off. Benchmarks of UCAN2 on the Cray XC30 Edison at NERSC demonstrate ideal scaling when problem size is increased along with processor number up to the largest power of 2 available, namely 131,072 processors. These particle weak scaling benchmarks also indicate that the 1 nanosecond per particle per time step and 1 TFlops barriers are easily broken by UCAN2 with 1 billion particles or more and 2000 or more processors.This work was supported in part in the USA by Grant No. DE-FG02-04ER54741 to the University of Alaska, Fairbanks, AK, from the Office of Fusion Energy Sciences, Office of Science, United States Department of Energy. It was also supported in part at Universidad Carlos III, Madrid, Spain, by Spanish National Project No. ENE2009-12213-C03-03. This research used resources of the National Energy Research Scientific Computing Center (NERSC), which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. It also took advantage of resources at the Barcelona Supercomputing Center (BSC), Centro Nacional de Supercomputación, Barcelona, Spain. One of us (Leboeuf) would particularly like to thank David Vicente from BSC and Zhengji Zhao from NERSC for their help in the porting, debugging, and optimization of UCAN2 on the mainframes at their respective centers
A massively parallel semi-Lagrangian solver for the six-dimensional Vlasov-Poisson equation
This paper presents an optimized and scalable semi-Lagrangian solver for the
Vlasov-Poisson system in six-dimensional phase space. Grid-based solvers of the
Vlasov equation are known to give accurate results. At the same time, these
solvers are challenged by the curse of dimensionality resulting in very high
memory requirements, and moreover, requiring highly efficient parallelization
schemes. In this paper, we consider the 6d Vlasov-Poisson problem discretized
by a split-step semi-Lagrangian scheme, using successive 1d interpolations on
1d stripes of the 6d domain. Two parallelization paradigms are compared, a
remapping scheme and a classical domain decomposition approach applied to the
full 6d problem. From numerical experiments, the latter approach is found to be
superior in the massively parallel case in various respects. We address the
challenge of artificial time step restrictions due to the decomposition of the
domain by introducing a blocked one-sided communication scheme for the purely
electrostatic case and a rotating mesh for the case with a constant magnetic
field. In addition, we propose a pipelining scheme that enables to hide the
costs for the halo communication between neighbor processes efficiently behind
useful computation. Parallel scalability on up to 65k processes is demonstrated
for benchmark problems on a supercomputer
Master of Science
thesisTo address the need of understanding and optimizing the performance of complex applications and achieving sustained application performance across different architectures, we need performance models and tools that could quantify the theoretical performance and the resultant gap between theoretical and observed performance. This thesis proposes a benchmark-driven Roofline Model Toolkit to provide theoretical and achievable performance, and their resultant gap for multicore, manycore, and accelerated architectures. Roofline micro benchmarks are specialized to quantify the behavior of different architectural features. Compared to previous work on performance characterization, these micro benchmarks focus on capturing the performance of each level of the memory hierarchy, along with thread-level parallelism(TLP), instruction-level parallelism(ILP), and explicit Single Instruction, Multiple Data(SIMD) parallelism, measured in the context of the compilers and runtime environment on the target architecture. We also developed benchmarks to explore detailed memory subsystems behaviors and evaluate parallelization overhead. Beyond on-chip performance, we measure sustained Peripheral Component Interconnect Express(PCIe) throughput with four Graphics Processing Unit(GPU) memory managed mechanisms. By combining results from the architecture characterization with the Roofline Model based solely on architectural specification, this work offers insights for performance prediction of current and future architectures and their software systems. To that end, we instrument three applications and plot their resultant performance on the corresponding Roofline Model when run on a Blue Gene/Q architecture