21 research outputs found
A portable platform for accelerated PIC codes and its application to GPUs using OpenACC
We present a portable platform, called PIC_ENGINE, for accelerating
Particle-In-Cell (PIC) codes on heterogeneous many-core architectures such as
Graphic Processing Units (GPUs). The aim of this development is efficient
simulations on future exascale systems by allowing different parallelization
strategies depending on the application problem and the specific architecture.
To this end, this platform contains the basic steps of the PIC algorithm and
has been designed as a test bed for different algorithmic options and data
structures. Among the architectures that this engine can explore, particular
attention is given here to systems equipped with GPUs. The study demonstrates
that our portable PIC implementation based on the OpenACC programming model can
achieve performance closely matching theoretical predictions. Using the Cray
XC30 system, Piz Daint, at the Swiss National Supercomputing Centre (CSCS), we
show that PIC_ENGINE running on an NVIDIA Kepler K20X GPU can outperform the
one on an Intel Sandybridge 8-core CPU by a factor of 3.4
Particle-In-Cell Simulation using Asynchronous Tasking
Recently, task-based programming models have emerged as a prominent
alternative among shared-memory parallel programming paradigms. Inherently
asynchronous, these models provide native support for dynamic load balancing
and incorporate data flow concepts to selectively synchronize the tasks.
However, tasking models are yet to be widely adopted by the HPC community and
their effective advantages when applied to non-trivial, real-world HPC
applications are still not well comprehended. In this paper, we study the
parallelization of a production electromagnetic particle-in-cell (EM-PIC) code
for kinetic plasma simulations exploring different strategies using
asynchronous task-based models. Our fully asynchronous implementation not only
significantly outperforms a conventional, synchronous approach but also
achieves near perfect scaling for 48 cores.Comment: To be published on the 27th European Conference on Parallel and
Distributed Computing (Euro-Par 2021
Particle-in-cell simulation using asynchronous tasking
Recently, task-based programming models have emerged as a prominent alternative among shared-memory parallel programming paradigms. Inherently asynchronous, these models provide native support for dynamic load balancing and incorporate data flow concepts to selectively synchronize the tasks. However, tasking models are yet to be widely adopted by the HPC community and their effective advantages when applied to non-trivial, real-world HPC applications are still not well comprehended. In this paper, we study the parallelization of a production electromagnetic particle-in-cell (EM-PIC) code for kinetic plasma simulations exploring different strategies using asynchronous task-based models. Our fully asynchronous implementation not only significantly outperforms a conventional, synchronous approach but also achieves near perfect scaling for 48 cores.Peer ReviewedPostprint (author's final draft
XcalableMP PGAS Programming Language
XcalableMP is a directive-based parallel programming language based on Fortran and C, supporting a Partitioned Global Address Space (PGAS) model for distributed memory parallel systems. This open access book presents XcalableMP language from its programming model and basic concept to the experience and performance of applications described in XcalableMP.  XcalableMP was taken as a parallel programming language project in the FLAGSHIP 2020 project, which was to develop the Japanese flagship supercomputer, Fugaku, for improving the productivity of parallel programing. XcalableMP is now available on Fugaku and its performance is enhanced by the Fugaku interconnect, Tofu-D. The global-view programming model of XcalableMP, inherited from High-Performance Fortran (HPF), provides an easy and useful solution to parallelize data-parallel programs with directives for distributed global array and work distribution and shadow communication. The local-view programming adopts coarray notation from Coarray Fortran (CAF) to describe explicit communication in a PGAS model. The language specification was designed and proposed by the XcalableMP Specification Working Group organized in the PC Consortium, Japan. The Omni XcalableMP compiler is a production-level reference implementation of XcalableMP compiler for C and Fortran 2008, developed by RIKEN CCS and the University of Tsukuba. The performance of the XcalableMP program was used in the Fugaku as well as the K computer. A performance study showed that XcalableMP enables a scalable performance comparable to the message passing interface (MPI) version with a clean and easy-to-understand programming style requiring little effort
Optimised hybrid parallelisation of a CFD code on Many Core architectures
COSA is a novel CFD system based on the compressible Navier-Stokes model for
unsteady aerodynamics and aeroelasticity of fixed structures, rotary wings and
turbomachinery blades. It includes a steady, time domain, and harmonic balance
flow solver.
COSA has primarily been parallelised using MPI, but there is also a hybrid
parallelisation that adds OpenMP functionality to the MPI parallelisation to
enable larger number of cores to be utilised for a given simulation as the MPI
parallelisation is limited to the number of geometric partitions (or blocks) in
the simulation, or to exploit multi-threaded hardware where appropriate. This
paper outlines the work undertaken to optimise these two parallelisation
strategies, improving the efficiency of both and therefore reducing the
computational time required to compute simulations. We also analyse the power
consumption of the code on a range of leading HPC systems to further understand
the performance of the code.Comment: Submitted to the SC13 conference, 10 pages with 8 figure
An efficient mixed-precision, hybrid CPU-GPU implementation of a fully implicit particle-in-cell algorithm
Recently, a fully implicit, energy- and charge-conserving particle-in-cell
method has been proposed for multi-scale, full-f kinetic simulations [G. Chen,
et al., J. Comput. Phys. 230,18 (2011)]. The method employs a Jacobian-free
Newton-Krylov (JFNK) solver, capable of using very large timesteps without loss
of numerical stability or accuracy. A fundamental feature of the method is the
segregation of particle-orbit computations from the field solver, while
remaining fully self-consistent. This paper describes a very efficient,
mixed-precision hybrid CPU-GPU implementation of the implicit PIC algorithm
exploiting this feature. The JFNK solver is kept on the CPU in double precision
(DP), while the implicit, charge-conserving, and adaptive particle mover is
implemented on a GPU (graphics processing unit) using CUDA in single-precision
(SP). Performance-oriented optimizations are introduced with the aid of the
roofline model. The implicit particle mover algorithm is shown to achieve up to
400 GOp/s on a Nvidia GeForce GTX580. This corresponds to 25% absolute GPU
efficiency against the peak theoretical performance, and is about 300 times
faster than an equivalent serial CPU (Intel Xeon X5460) execution. For the test
case chosen, the mixed-precision hybrid CPU-GPU solver is shown to over-perform
the DP CPU-only serial version by a factor of \sim 100, without apparent loss
of robustness or accuracy in a challenging long-timescale ion acoustic wave
simulation.Comment: 25 pages, 6 figures, submitted to J. Comput. Phy
Orthrus: A Framework for Implementing Efficient Collective I/O in Multi-core Clusters
Abstract. Optimization of access patterns using collective I/O imposes the overhead of exchanging data between processes. In a multi-core-based cluster the costs of inter-node and intra-node data communication are vastly different, and heterogeneity in the efficiency of data exchange poses both a challenge and an opportunity for implementing efficient collective I/O. The opportunity is to effectively exploit fast intra-node communication. We propose to improve communication locality for greater data exchange efficiency. However, such an effort is at odds with improving access locality for I/O efficiency, which can also be critical to collective-I/O performance. To address this issue we propose a framework, Orthrus, that can accommodate multiple collective-I/O implementations, each optimized for some performance aspects, and dynamically select the best performing one accordingly to current workload and system patterns. We have implemented Orthrus in the ROMIO library. Our experimental results with representative MPI-IO benchmarks on both a small dedicated cluster and a large production HPC system show that Orthrus can significantly improve collective I/O performance under various workloads and system scenarios.
Recommended from our members
Photonic Interconnects Beyond High Bandwidth
The extraordinary growth of parallelism in high-performance computing requires efficient data communication for scaling compute performance. High-performance computing systems have been using photonic links for communication of large bandwidth-distance product during the last decade. Photonic interconnection networks, however, should not be a wire-for-wire replacement based on conventional electrical counterparts. Features of photonics beyond high bandwidth, such as transparent bandwidth steering, can implement important functionalities needed by applications. In another aspect, application characteristics can be exploited to design better photonic interconnects. Therefore, this thesis explores codesign opportunities at the intersection between photonic interconnect architectures and high-performance computing applications. The key accomplishments of this thesis, ranging from system level to node level, are as follows.
Chapter 2 presents a system-level architecture that leverages photonic switching to enable a reconfigurable interconnect. The architecture, called Flexfly, reconfigures the inter-group level of the widely-used Dragonfly topology using information about the application’s communication pattern. It can steal additional direct bandwidth for communication-intensive group pairs. Simulations with applications such as GTC, Nekbone and LULESH show up to 1.8x speedup over Dragonfly paired with UGAL routing, along with halved hop count and latency for cross-group messages. To demonstrate the effectiveness of our approach, we built a 32-node Flexfly prototype using a silicon photonic switch connecting four groups and demonstrated 820 ns interconnect reconfiguration time. This is the first demonstration of silicon photonic switching and bandwidth steering in a high-performance computing cluster.
Chapter 3 extends photonic switching to the node level and presents a reconfigurable silicon photonic memory interconnect for many-core architectures. The interconnect targets at important memory access issues, such as network-on-chip hot-spots and non-uniform memory access. Integrated with the processor through 2.5D/3D stacking, a fast-tunable silicon photonic memory tunnel can transparently direct traffic from any off-chip memory to any on-chip interface – thus alleviating the hot-spot and non-uniform access effects. We demonstrated the operation of our proposed architecture using a tunable laser, a 4-port silicon photonic switch (four wavelength-routed memory channels) and a 4x4 mesh network-on-chip synthesized by FPGA. The emulated system achieves a 15-ns channel switching time. Simulations based on a 12-core 4-memory model show that for such switching speeds the interconnect system can realize a 2x speedup for the STREAM benchmark in the hot-spot scenario and a reduction of execution time for data-intensive applications such as 3D stencil and K-means clustering by 23% and 17%, respectively.
Chapters 4 explores application-level characteristics that can be exploited to hide photonic path setup delays. In view of the frequent reuse of optical circuits by many applications, we proposed a circuit-cached scheme that amortizes the setup overhead by maximizing circuit reuses. In order to improve circuit “hit” rates, we developed a reuse-distance based replacement policy called “Farthest Next Use”. We further investigated the tradeoffs between the realized hit rate and energy consumption. Finally, we experimentally demonstrated the feasibility of the proposed concept using silicon photonic devices in an FPGA-controlled network testbed.
Chapter 5 proceeds to develop an application-guided circuit-prefetch scheme. By learning temporal locality and communication patterns from upper-layer applications, the scheme not only caches a set of circuits for reuses, but also proactively prefetches circuits based on predictions. We applied this technique to communication patterns from a spectrum of science and engineering applications. The results show that setup delays via circuit misses are significantly reduced, showing how the proposed technique can improve circuit switching in photonic interconnects