Search CORE

21 research outputs found

A portable platform for accelerated PIC codes and its application to GPUs using OpenACC

Author: Brunner S.
Gheller G.
Hariri F.
Jocksch A.
Lanti E.
Messmer P.
Progsch J.
Tran T. M.
Villard L.
Publication venue: 'Elsevier BV'
Publication date: 09/03/2016
Field of study

We present a portable platform, called PIC_ENGINE, for accelerating Particle-In-Cell (PIC) codes on heterogeneous many-core architectures such as Graphic Processing Units (GPUs). The aim of this development is efficient simulations on future exascale systems by allowing different parallelization strategies depending on the application problem and the specific architecture. To this end, this platform contains the basic steps of the PIC algorithm and has been designed as a test bed for different algorithmic options and data structures. Among the architectures that this engine can explore, particular attention is given here to systems equipped with GPUs. The study demonstrates that our portable PIC implementation based on the OpenACC programming model can achieve performance closely matching theoretical predictions. Using the Cray XC30 system, Piz Daint, at the Swiss National Supercomputing Centre (CSCS), we show that PIC_ENGINE running on an NVIDIA Kepler K20X GPU can outperform the one on an Intel Sandybridge 8-core CPU by a factor of 3.4

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Repository for Publications and Research Data

Elsevier - Publisher Connector

Particle-In-Cell Simulation using Asynchronous Tasking

Author: Barreto João
Ceyrat Pedro
Fonseca Ricardo
Guidotti Nicolas
Martorell Xavier
Monteiro José
Peña Antonio J.
Rodrigues Rodrigo
Publication venue
Publication date: 01/01/2021
Field of study

Recently, task-based programming models have emerged as a prominent alternative among shared-memory parallel programming paradigms. Inherently asynchronous, these models provide native support for dynamic load balancing and incorporate data flow concepts to selectively synchronize the tasks. However, tasking models are yet to be widely adopted by the HPC community and their effective advantages when applied to non-trivial, real-world HPC applications are still not well comprehended. In this paper, we study the parallelization of a production electromagnetic particle-in-cell (EM-PIC) code for kinetic plasma simulations exploring different strategies using asynchronous task-based models. Our fully asynchronous implementation not only significantly outperforms a conventional, synchronous approach but also achieves near perfect scaling for 48 cores.Comment: To be published on the 27th European Conference on Parallel and Distributed Computing (Euro-Par 2021

arXiv.org e-Print Archive

UPCommons. Portal del coneixement obert de la UPC

Global Simulation of Plasma Microturbulence at the Petascale & Beyond (Optimizing the GTC Code for Blue Gene/Q): ALCF-2 Early Science Program Technical Report

Author
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date
Field of study

Crossref

Particle-in-cell simulation using asynchronous tasking

Author: Barreto João
Ceyrat Pedro
Fonseca Ricardo
Guidotti Nicolas
Martorell Bofill Xavier
Monteiro José
Peña Monferrer Antonio José
Rodrigues Rodrigo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

UPCommons. Portal del coneixement obert de la UPC

XcalableMP PGAS Programming Language

Author: Sato Mitsuhisa
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

XcalableMP is a directive-based parallel programming language based on Fortran and C, supporting a Partitioned Global Address Space (PGAS) model for distributed memory parallel systems. This open access book presents XcalableMP language from its programming model and basic concept to the experience and performance of applications described in XcalableMP.　 XcalableMP was taken as a parallel programming language project in the FLAGSHIP 2020 project, which was to develop the Japanese flagship supercomputer, Fugaku, for improving the productivity of parallel programing. XcalableMP is now available on Fugaku and its performance is enhanced by the Fugaku interconnect, Tofu-D. The global-view programming model of XcalableMP, inherited from High-Performance Fortran (HPF), provides an easy and useful solution to parallelize data-parallel programs with directives for distributed global array and work distribution and shadow communication. The local-view programming adopts coarray notation from Coarray Fortran (CAF) to describe explicit communication in a PGAS model. The language specification was designed and proposed by the XcalableMP Specification Working Group organized in the PC Consortium, Japan. The Omni XcalableMP compiler is a production-level reference implementation of XcalableMP compiler for C and Fortran 2008, developed by RIKEN CCS and the University of Tsukuba. The performance of the XcalableMP program was used in the Fugaku as well as the K computer. A performance study showed that XcalableMP enables a scalable performance comparable to the message passing interface (MPI) version with a clean and easy-to-understand programming style requiring little effort

OAPEN Library

Optimised hybrid parallelisation of a CFD code on Many Core architectures

Author: Campobasso M. Sergio
Jackson Adrian
Publication venue
Publication date: 29/04/2013
Field of study

COSA is a novel CFD system based on the compressible Navier-Stokes model for unsteady aerodynamics and aeroelasticity of fixed structures, rotary wings and turbomachinery blades. It includes a steady, time domain, and harmonic balance flow solver. COSA has primarily been parallelised using MPI, but there is also a hybrid parallelisation that adds OpenMP functionality to the MPI parallelisation to enable larger number of cores to be utilised for a given simulation as the MPI parallelisation is limited to the number of geometric partitions (or blocks) in the simulation, or to exploit multi-threaded hardware where appropriate. This paper outlines the work undertaken to optimise these two parallelisation strategies, improving the efficiency of both and therefore reducing the computational time required to compute simulations. We also analyse the power consumption of the code on a range of leading HPC systems to further understand the performance of the code.Comment: Submitted to the SC13 conference, 10 pages with 8 figure

arXiv.org e-Print Archive

Crossref

Lancaster E-Prints

An efficient mixed-precision, hybrid CPU-GPU implementation of a fully implicit particle-in-cell algorithm

Author: Birdsall
Bowers
Bowers
Bowers
Burau
Chen
D.C. Barnes
Decyk
Fuller
G. Chen
Harris
Hennessy
Hwu
Hwu
Kirk
Kong
L. Chacón
Liewer
Little
Madduri
Markstein
Nickolls
Shampine
Stantchev
Williams
Wittenbrink
Wulf
Publication venue: 'Elsevier BV'
Publication date: 22/11/2011
Field of study

Recently, a fully implicit, energy- and charge-conserving particle-in-cell method has been proposed for multi-scale, full-f kinetic simulations [G. Chen, et al., J. Comput. Phys. 230,18 (2011)]. The method employs a Jacobian-free Newton-Krylov (JFNK) solver, capable of using very large timesteps without loss of numerical stability or accuracy. A fundamental feature of the method is the segregation of particle-orbit computations from the field solver, while remaining fully self-consistent. This paper describes a very efficient, mixed-precision hybrid CPU-GPU implementation of the implicit PIC algorithm exploiting this feature. The JFNK solver is kept on the CPU in double precision (DP), while the implicit, charge-conserving, and adaptive particle mover is implemented on a GPU (graphics processing unit) using CUDA in single-precision (SP). Performance-oriented optimizations are introduced with the aid of the roofline model. The implicit particle mover algorithm is shown to achieve up to 400 GOp/s on a Nvidia GeForce GTX580. This corresponds to 25% absolute GPU efficiency against the peak theoretical performance, and is about 300 times faster than an equivalent serial CPU (Intel Xeon X5460) execution. For the test case chosen, the mixed-precision hybrid CPU-GPU solver is shown to over-perform the DP CPU-only serial version by a factor of \sim 100, without apparent loss of robustness or accuracy in a challenging long-timescale ion acoustic wave simulation.Comment: 25 pages, 6 figures, submitted to J. Comput. Phy

arXiv.org e-Print Archive

CiteSeerX

Crossref

SPC Annual Report 2016

Author: A E
Publication venue
Publication date: 28/12/2017
Field of study

Infoscience - École polytechnique fédérale de Lausanne

Orthrus: A Framework for Implementing Efficient Collective I/O in Multi-core Clusters

Author: Jianqiang Ou
Kei Davis
Song Jiang
Xuechen Zhang
Publication venue
Publication date: 01/01/2014
Field of study

Abstract. Optimization of access patterns using collective I/O imposes the overhead of exchanging data between processes. In a multi-core-based cluster the costs of inter-node and intra-node data communication are vastly different, and heterogeneity in the efficiency of data exchange poses both a challenge and an opportunity for implementing efficient collective I/O. The opportunity is to effectively exploit fast intra-node communication. We propose to improve communication locality for greater data exchange efficiency. However, such an effort is at odds with improving access locality for I/O efficiency, which can also be critical to collective-I/O performance. To address this issue we propose a framework, Orthrus, that can accommodate multiple collective-I/O implementations, each optimized for some performance aspects, and dynamically select the best performing one accordingly to current workload and system patterns. We have implemented Orthrus in the ROMIO library. Our experimental results with representative MPI-IO benchmarks on both a small dedicated cluster and a large production HPC system show that Orthrus can significantly improve collective I/O performance under various workloads and system scenarios.

CiteSeerX

Crossref

Recommended from our members

Photonic Interconnects Beyond High Bandwidth

Author: Wen Ke
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2017
Field of study

The extraordinary growth of parallelism in high-performance computing requires efficient data communication for scaling compute performance. High-performance computing systems have been using photonic links for communication of large bandwidth-distance product during the last decade. Photonic interconnection networks, however, should not be a wire-for-wire replacement based on conventional electrical counterparts. Features of photonics beyond high bandwidth, such as transparent bandwidth steering, can implement important functionalities needed by applications. In another aspect, application characteristics can be exploited to design better photonic interconnects. Therefore, this thesis explores codesign opportunities at the intersection between photonic interconnect architectures and high-performance computing applications. The key accomplishments of this thesis, ranging from system level to node level, are as follows. Chapter 2 presents a system-level architecture that leverages photonic switching to enable a reconfigurable interconnect. The architecture, called Flexfly, reconfigures the inter-group level of the widely-used Dragonfly topology using information about the application’s communication pattern. It can steal additional direct bandwidth for communication-intensive group pairs. Simulations with applications such as GTC, Nekbone and LULESH show up to 1.8x speedup over Dragonfly paired with UGAL routing, along with halved hop count and latency for cross-group messages. To demonstrate the effectiveness of our approach, we built a 32-node Flexfly prototype using a silicon photonic switch connecting four groups and demonstrated 820 ns interconnect reconfiguration time. This is the first demonstration of silicon photonic switching and bandwidth steering in a high-performance computing cluster. Chapter 3 extends photonic switching to the node level and presents a reconfigurable silicon photonic memory interconnect for many-core architectures. The interconnect targets at important memory access issues, such as network-on-chip hot-spots and non-uniform memory access. Integrated with the processor through 2.5D/3D stacking, a fast-tunable silicon photonic memory tunnel can transparently direct traffic from any off-chip memory to any on-chip interface – thus alleviating the hot-spot and non-uniform access effects. We demonstrated the operation of our proposed architecture using a tunable laser, a 4-port silicon photonic switch (four wavelength-routed memory channels) and a 4x4 mesh network-on-chip synthesized by FPGA. The emulated system achieves a 15-ns channel switching time. Simulations based on a 12-core 4-memory model show that for such switching speeds the interconnect system can realize a 2x speedup for the STREAM benchmark in the hot-spot scenario and a reduction of execution time for data-intensive applications such as 3D stencil and K-means clustering by 23% and 17%, respectively. Chapters 4 explores application-level characteristics that can be exploited to hide photonic path setup delays. In view of the frequent reuse of optical circuits by many applications, we proposed a circuit-cached scheme that amortizes the setup overhead by maximizing circuit reuses. In order to improve circuit “hit” rates, we developed a reuse-distance based replacement policy called “Farthest Next Use”. We further investigated the tradeoffs between the realized hit rate and energy consumption. Finally, we experimentally demonstrated the feasibility of the proposed concept using silicon photonic devices in an FPGA-controlled network testbed. Chapter 5 proceeds to develop an application-guided circuit-prefetch scheme. By learning temporal locality and communication patterns from upper-layer applications, the scheme not only caches a set of circuits for reuses, but also proactively prefetches circuits based on predictions. We applied this technique to communication patterns from a spectrum of science and engineering applications. The results show that setup delays via circuit misses are significantly reduced, showing how the proposed technique can improve circuit switching in photonic interconnects

Columbia University Academic Commons