12 research outputs found

    Particle-in-Cell algorithms for emerging computer architectures

    Get PDF
    AbstractWe have designed Particle-in-Cell algorithms for emerging architectures. These algorithms share a common approach, using fine-grained tiles, but different implementations depending on the architecture. On the GPU, there were two different implementations, one with atomic operations and one with no data collisions, using CUDA C and Fortran. Speedups up to about 50 compared to a single core of the Intel i7 processor have been achieved. There was also an implementation for traditional multi-core processors using OpenMP which achieved high parallel efficiency. We believe that this approach should work for other emerging designs such as Intel Phi coprocessor from the Intel MIC architecture

    A portable platform for accelerated PIC codes and its application to GPUs using OpenACC

    Get PDF
    We present a portable platform, called PIC_ENGINE, for accelerating Particle-In-Cell (PIC) codes on heterogeneous many-core architectures such as Graphic Processing Units (GPUs). The aim of this development is efficient simulations on future exascale systems by allowing different parallelization strategies depending on the application problem and the specific architecture. To this end, this platform contains the basic steps of the PIC algorithm and has been designed as a test bed for different algorithmic options and data structures. Among the architectures that this engine can explore, particular attention is given here to systems equipped with GPUs. The study demonstrates that our portable PIC implementation based on the OpenACC programming model can achieve performance closely matching theoretical predictions. Using the Cray XC30 system, Piz Daint, at the Swiss National Supercomputing Centre (CSCS), we show that PIC_ENGINE running on an NVIDIA Kepler K20X GPU can outperform the one on an Intel Sandybridge 8-core CPU by a factor of 3.4

    sputniPIC: an Implicit Particle-in-Cell Code for Multi-GPU Systems

    Full text link
    Large-scale simulations of plasmas are essential for advancing our understanding of fusion devices, space, and astrophysical systems. Particle-in-Cell (PIC) codes have demonstrated their success in simulating numerous plasma phenomena on HPC systems. Today, flagship supercomputers feature multiple GPUs per compute node to achieve unprecedented computing power at high power efficiency. PIC codes require new algorithm design and implementation for exploiting such accelerated platforms. In this work, we design and optimize a three-dimensional implicit PIC code, called sputniPIC, to run on a general multi-GPU compute node. We introduce a particle decomposition data layout, in contrast to domain decomposition on CPU-based implementations, to use particle batches for overlapping communication and computation on GPUs. sputniPIC also natively supports different precision representations to achieve speed up on hardware that supports reduced precision. We validate sputniPIC through the well-known GEM challenge and provide performance analysis. We test sputniPIC on three multi-GPU platforms and report a 200-800x performance improvement with respect to the sputniPIC CPU OpenMP version performance. We show that reduced precision could further improve performance by 45% to 80% on the three platforms. Because of these performance improvements, on a single node with multiple GPUs, sputniPIC enables large-scale three-dimensional PIC simulations that were only possible using clusters.Comment: Accepted for publication at the 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2020

    Implementation of 2D Domain Decomposition in the UCAN Gyrokinetic Particle-in-Cell Code and Resulting Performance of UCAN2

    Get PDF
    The massively parallel, nonlinear, three-dimensional (3D), toroidal, electrostatic, gyrokinetic, particle-in-cell (PIC), Cartesian geometry UCAN code, with particle ions and adiabatic electrons, has been successfully exercised to identify non-diffusive transport characteristics in present day tokamak discharges. The limitation in applying UCAN to larger scale discharges is the 1D domain decomposition in the toroidal (or z-) direction for massively parallel implementation using MPI which has restricted the calculations to a few hundred ion Larmor radii or gyroradii per plasma minor radius. To exceed these sizes, we have implemented 2D domain decomposition in UCAN with the addition of the y-direction to the processor mix. This has been facilitated by use of relevant components in the P2LIB library of field and particle management routines developed for UCLA's UPIC Framework of conventional PIC codes. The gyro-averaging specific to gyrokinetic codes is simplified by the use of replicated arrays for efficient charge accumulation and force deposition. The 2D domain-decomposed UCAN2 code reproduces the original 1D domain nonlinear results within round-off. Benchmarks of UCAN2 on the Cray XC30 Edison at NERSC demonstrate ideal scaling when problem size is increased along with processor number up to the largest power of 2 available, namely 131,072 processors. These particle weak scaling benchmarks also indicate that the 1 nanosecond per particle per time step and 1 TFlops barriers are easily broken by UCAN2 with 1 billion particles or more and 2000 or more processors.This work was supported in part in the USA by Grant No. DE-FG02-04ER54741 to the University of Alaska, Fairbanks, AK, from the Office of Fusion Energy Sciences, Office of Science, United States Department of Energy. It was also supported in part at Universidad Carlos III, Madrid, Spain, by Spanish National Project No. ENE2009-12213-C03-03. This research used resources of the National Energy Research Scientific Computing Center (NERSC), which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. It also took advantage of resources at the Barcelona Supercomputing Center (BSC), Centro Nacional de Supercomputación, Barcelona, Spain. One of us (Leboeuf) would particularly like to thank David Vicente from BSC and Zhengji Zhao from NERSC for their help in the porting, debugging, and optimization of UCAN2 on the mainframes at their respective centers
    corecore