460 research outputs found

    A portable platform for accelerated PIC codes and its application to GPUs using OpenACC

    Get PDF
    We present a portable platform, called PIC_ENGINE, for accelerating Particle-In-Cell (PIC) codes on heterogeneous many-core architectures such as Graphic Processing Units (GPUs). The aim of this development is efficient simulations on future exascale systems by allowing different parallelization strategies depending on the application problem and the specific architecture. To this end, this platform contains the basic steps of the PIC algorithm and has been designed as a test bed for different algorithmic options and data structures. Among the architectures that this engine can explore, particular attention is given here to systems equipped with GPUs. The study demonstrates that our portable PIC implementation based on the OpenACC programming model can achieve performance closely matching theoretical predictions. Using the Cray XC30 system, Piz Daint, at the Swiss National Supercomputing Centre (CSCS), we show that PIC_ENGINE running on an NVIDIA Kepler K20X GPU can outperform the one on an Intel Sandybridge 8-core CPU by a factor of 3.4

    Master of Science

    Get PDF
    thesisTo address the need of understanding and optimizing the performance of complex applications and achieving sustained application performance across different architectures, we need performance models and tools that could quantify the theoretical performance and the resultant gap between theoretical and observed performance. This thesis proposes a benchmark-driven Roofline Model Toolkit to provide theoretical and achievable performance, and their resultant gap for multicore, manycore, and accelerated architectures. Roofline micro benchmarks are specialized to quantify the behavior of different architectural features. Compared to previous work on performance characterization, these micro benchmarks focus on capturing the performance of each level of the memory hierarchy, along with thread-level parallelism(TLP), instruction-level parallelism(ILP), and explicit Single Instruction, Multiple Data(SIMD) parallelism, measured in the context of the compilers and runtime environment on the target architecture. We also developed benchmarks to explore detailed memory subsystems behaviors and evaluate parallelization overhead. Beyond on-chip performance, we measure sustained Peripheral Component Interconnect Express(PCIe) throughput with four Graphics Processing Unit(GPU) memory managed mechanisms. By combining results from the architecture characterization with the Roofline Model based solely on architectural specification, this work offers insights for performance prediction of current and future architectures and their software systems. To that end, we instrument three applications and plot their resultant performance on the corresponding Roofline Model when run on a Blue Gene/Q architecture

    Particle-In-Cell Simulation using Asynchronous Tasking

    Get PDF
    Recently, task-based programming models have emerged as a prominent alternative among shared-memory parallel programming paradigms. Inherently asynchronous, these models provide native support for dynamic load balancing and incorporate data flow concepts to selectively synchronize the tasks. However, tasking models are yet to be widely adopted by the HPC community and their effective advantages when applied to non-trivial, real-world HPC applications are still not well comprehended. In this paper, we study the parallelization of a production electromagnetic particle-in-cell (EM-PIC) code for kinetic plasma simulations exploring different strategies using asynchronous task-based models. Our fully asynchronous implementation not only significantly outperforms a conventional, synchronous approach but also achieves near perfect scaling for 48 cores.Comment: To be published on the 27th European Conference on Parallel and Distributed Computing (Euro-Par 2021

    Particle-in-cell simulation using asynchronous tasking

    Get PDF
    Recently, task-based programming models have emerged as a prominent alternative among shared-memory parallel programming paradigms. Inherently asynchronous, these models provide native support for dynamic load balancing and incorporate data flow concepts to selectively synchronize the tasks. However, tasking models are yet to be widely adopted by the HPC community and their effective advantages when applied to non-trivial, real-world HPC applications are still not well comprehended. In this paper, we study the parallelization of a production electromagnetic particle-in-cell (EM-PIC) code for kinetic plasma simulations exploring different strategies using asynchronous task-based models. Our fully asynchronous implementation not only significantly outperforms a conventional, synchronous approach but also achieves near perfect scaling for 48 cores.Peer ReviewedPostprint (author's final draft

    A performance portable implementation of the semi-Lagrangian algorithm in six dimensions

    Full text link
    In this paper, we describe our approach to develop a simulation software application for the fully kinetic Vlasov equation which will be used to explore physics beyond the gyrokinetic model. Simulating the fully kinetic Vlasov equation requires efficient utilization of compute and storage capabilities due to the high dimensionality of the problem. In addition, the implementation needs to be extensibility regarding the physical model and flexible regarding the hardware for production runs. We start on the algorithmic background to simulate the 6-D Vlasov equation using a semi-Lagrangian algorithm. The performance portable software stack, which enables production runs on pure CPU as well as AMD or Nvidia GPU accelerated nodes, is presented. The extensibility of our implementation is guaranteed through the described software architecture of the main kernel, which achieves a memory bandwidth of almost 500 GB/s on a V100 Nvidia GPU and around 100 GB/s on an Intel Xeon Gold CPU using a single code base. We provide performance data on multiple node level architectures discussing utilized and further available hardware capabilities. Finally, the network communication bottleneck of 6-D grid based algorithms is quantified. A verification of physics beyond gyrokinetic theory for the example of ion Bernstein waves concludes the work

    An efficient mixed-precision, hybrid CPU-GPU implementation of a fully implicit particle-in-cell algorithm

    Full text link
    Recently, a fully implicit, energy- and charge-conserving particle-in-cell method has been proposed for multi-scale, full-f kinetic simulations [G. Chen, et al., J. Comput. Phys. 230,18 (2011)]. The method employs a Jacobian-free Newton-Krylov (JFNK) solver, capable of using very large timesteps without loss of numerical stability or accuracy. A fundamental feature of the method is the segregation of particle-orbit computations from the field solver, while remaining fully self-consistent. This paper describes a very efficient, mixed-precision hybrid CPU-GPU implementation of the implicit PIC algorithm exploiting this feature. The JFNK solver is kept on the CPU in double precision (DP), while the implicit, charge-conserving, and adaptive particle mover is implemented on a GPU (graphics processing unit) using CUDA in single-precision (SP). Performance-oriented optimizations are introduced with the aid of the roofline model. The implicit particle mover algorithm is shown to achieve up to 400 GOp/s on a Nvidia GeForce GTX580. This corresponds to 25% absolute GPU efficiency against the peak theoretical performance, and is about 300 times faster than an equivalent serial CPU (Intel Xeon X5460) execution. For the test case chosen, the mixed-precision hybrid CPU-GPU solver is shown to over-perform the DP CPU-only serial version by a factor of \sim 100, without apparent loss of robustness or accuracy in a challenging long-timescale ion acoustic wave simulation.Comment: 25 pages, 6 figures, submitted to J. Comput. Phy
    corecore