31 research outputs found
Pipelining the Fast Multipole Method over a Runtime System
Fast Multipole Methods (FMM) are a fundamental operation for the simulation
of many physical problems. The high performance design of such methods usually
requires to carefully tune the algorithm for both the targeted physics and the
hardware. In this paper, we propose a new approach that achieves high
performance across architectures. Our method consists of expressing the FMM
algorithm as a task flow and employing a state-of-the-art runtime system,
StarPU, in order to process the tasks on the different processing units. We
carefully design the task flow, the mathematical operators, their Central
Processing Unit (CPU) and Graphics Processing Unit (GPU) implementations, as
well as scheduling schemes. We compute potentials and forces of 200 million
particles in 48.7 seconds on a homogeneous 160 cores SGI Altix UV 100 and of 38
million particles in 13.34 seconds on a heterogeneous 12 cores Intel Nehalem
processor enhanced with 3 Nvidia M2090 Fermi GPUs.Comment: No. RR-7981 (2012
PGAS-FMM: Implementing a distributed fast multipole method using the X10 programming language
The fast multipole method (FMM) is a complex, multi-stage algorithm over a distributed tree data structure, with multiple levels of parallelism and inherent data locality. X10 is a modern partitioned global address space language with support for asynchr
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions
We study the simulation of stellar mergers, which requires complex
simulations with high computational demands. We have developed Octo-Tiger, a
finite volume grid-based hydrodynamics simulation code with Adaptive Mesh
Refinement which is unique in conserving both linear and angular momentum to
machine precision. To face the challenge of increasingly complex, diverse, and
heterogeneous HPC systems, Octo-Tiger relies on high-level programming
abstractions.
We use HPX with its futurization capabilities to ensure scalability both
between nodes and within, and present first results replacing MPI with
libfabric achieving up to a 2.8x speedup. We extend Octo-Tiger to heterogeneous
GPU-accelerated supercomputers, demonstrating node-level performance and
portability. We show scalability up to full system runs on Piz Daint. For the
scenario's maximum resolution, the compute-critical parts (hydrodynamics and
gravity) achieve 68.1% parallel efficiency at 2048 nodes.Comment: Accepted at SC1
Task-based programming for Seismic Imaging: Preliminary Results
International audienceThe level of hardware complexity of current supercomputers is forcing the High Performance Computing (HPC) community to reconsider parallel programming paradigms and standards. The high-level of hardware abstraction provided by task-based paradigms make them excellent candidates for writing portable codes that can consistently deliver high performance across a wide range of platforms. While this paradigm has proved efficient for achieving such goals for dense and sparse linear solvers, it is yet to be demonstrated that industrial parallel codes relying on the classical Message Passing Interface (MPI) standard and that accumulate dozens of years of expertise (and countless lines of code) may be revisited to turn them into efficient task-based programs. In this paper, we study the applicability of task-based programming in the case of a Reverse Time Migration (RTM) application for Seismic Imaging. The initial MPI-based application is turned into a task-based code executed on top of the PaRSEC runtime system. Preliminary results show that the approach is competitive with (and even potentially superior to) the original MPI code on an homogenous multicore node and can exploit much more efficiently complex hardware such as a cache coherent Non Uniform Memory Access (ccNUMA) node or an Intel Xeon Phi accelerator