6 research outputs found
Particle-in-Cell algorithms for emerging computer architectures
AbstractWe have designed Particle-in-Cell algorithms for emerging architectures. These algorithms share a common approach, using fine-grained tiles, but different implementations depending on the architecture. On the GPU, there were two different implementations, one with atomic operations and one with no data collisions, using CUDA C and Fortran. Speedups up to about 50 compared to a single core of the Intel i7 processor have been achieved. There was also an implementation for traditional multi-core processors using OpenMP which achieved high parallel efficiency. We believe that this approach should work for other emerging designs such as Intel Phi coprocessor from the Intel MIC architecture
A portable platform for accelerated PIC codes and its application to GPUs using OpenACC
We present a portable platform, called PIC_ENGINE, for accelerating
Particle-In-Cell (PIC) codes on heterogeneous many-core architectures such as
Graphic Processing Units (GPUs). The aim of this development is efficient
simulations on future exascale systems by allowing different parallelization
strategies depending on the application problem and the specific architecture.
To this end, this platform contains the basic steps of the PIC algorithm and
has been designed as a test bed for different algorithmic options and data
structures. Among the architectures that this engine can explore, particular
attention is given here to systems equipped with GPUs. The study demonstrates
that our portable PIC implementation based on the OpenACC programming model can
achieve performance closely matching theoretical predictions. Using the Cray
XC30 system, Piz Daint, at the Swiss National Supercomputing Centre (CSCS), we
show that PIC_ENGINE running on an NVIDIA Kepler K20X GPU can outperform the
one on an Intel Sandybridge 8-core CPU by a factor of 3.4
An efficient mixed-precision, hybrid CPU-GPU implementation of a fully implicit particle-in-cell algorithm
Recently, a fully implicit, energy- and charge-conserving particle-in-cell
method has been proposed for multi-scale, full-f kinetic simulations [G. Chen,
et al., J. Comput. Phys. 230,18 (2011)]. The method employs a Jacobian-free
Newton-Krylov (JFNK) solver, capable of using very large timesteps without loss
of numerical stability or accuracy. A fundamental feature of the method is the
segregation of particle-orbit computations from the field solver, while
remaining fully self-consistent. This paper describes a very efficient,
mixed-precision hybrid CPU-GPU implementation of the implicit PIC algorithm
exploiting this feature. The JFNK solver is kept on the CPU in double precision
(DP), while the implicit, charge-conserving, and adaptive particle mover is
implemented on a GPU (graphics processing unit) using CUDA in single-precision
(SP). Performance-oriented optimizations are introduced with the aid of the
roofline model. The implicit particle mover algorithm is shown to achieve up to
400 GOp/s on a Nvidia GeForce GTX580. This corresponds to 25% absolute GPU
efficiency against the peak theoretical performance, and is about 300 times
faster than an equivalent serial CPU (Intel Xeon X5460) execution. For the test
case chosen, the mixed-precision hybrid CPU-GPU solver is shown to over-perform
the DP CPU-only serial version by a factor of \sim 100, without apparent loss
of robustness or accuracy in a challenging long-timescale ion acoustic wave
simulation.Comment: 25 pages, 6 figures, submitted to J. Comput. Phy
Recommended from our members
Gyrokinetic particle-in-cell optimization on emerging multi- and manycore platforms
The next decade of high-performance computing (HPC) systems will see a rapid evolution and divergence of multi- and manycore architectures as power and cooling constraints limit increases in microprocessor clock speeds. Understanding efficient optimization methodologies on diverse multicore designs in the context of demanding numerical methods is one of the greatest challenges faced today by the HPC community. In this work, we examine the efficient multicore optimization of GTC, a petascale gyrokinetic toroidal fusion code for studying plasma microturbulence in tokamak devices. For GTC's key computational components (charge deposition and particle push), we explore efficient parallelization strategies across a broad range of emerging multicore designs, including the recently-released Intel Nehalem-EX, the AMD Opteron Istanbul, and the highly multithreaded Sun UltraSparc T2+. We also present the first study on tuning gyrokinetic particle-in-cell (PIC) algorithms for graphics processors, using the NVIDIA C2050 (Fermi). Our work discusses several novel optimization approaches for gyrokinetic PIC, including mixed-precision computation, particle binning and decomposition strategies, grid replication, SIMDized atomic floating-point operations, and effective GPU texture memory utilization. Overall, we achieve significant performance improvements of 1.3-4.7× on these complex PIC kernels, despite the inherent challenges of data dependency and locality. Our work also points to several architectural and programming features that could significantly enhance PIC performance and productivity on next-generation architectures. © 2011 Elsevier B.V. All rights reserved
Recommended from our members
Gyrokinetic particle-in-cell optimization on emerging multi- and manycore platforms
The next decade of high-performance computing (HPC) systems will see a rapid evolution and divergence of multi- and manycore architectures as power and cooling constraints limit increases in microprocessor clock speeds. Understanding efficient optimization methodologies on diverse multicore designs in the context of demanding numerical methods is one of the greatest challenges faced today by the HPC community. In this work, we examine the efficient multicore optimization of GTC, a petascale gyrokinetic toroidal fusion code for studying plasma microturbulence in tokamak devices. For GTC's key computational components (charge deposition and particle push), we explore efficient parallelization strategies across a broad range of emerging multicore designs, including the recently-released Intel Nehalem-EX, the AMD Opteron Istanbul, and the highly multithreaded Sun UltraSparc T2+. We also present the first study on tuning gyrokinetic particle-in-cell (PIC) algorithms for graphics processors, using the NVIDIA C2050 (Fermi). Our work discusses several novel optimization approaches for gyrokinetic PIC, including mixed-precision computation, particle binning and decomposition strategies, grid replication, SIMDized atomic floating-point operations, and effective GPU texture memory utilization. Overall, we achieve significant performance improvements of 1.3-4.7× on these complex PIC kernels, despite the inherent challenges of data dependency and locality. Our work also points to several architectural and programming features that could significantly enhance PIC performance and productivity on next-generation architectures. © 2011 Elsevier B.V. All rights reserved