Search CORE

3 research outputs found

Scaling and optimizing the Gysela code on a cluster of many-core processors

Author: Asahi Yuuichi
Bigot Julien
Fehér Tamás
Grandgirard Virginie
Latu Guillaume
Publication venue: HAL CCSD
Publication date: 24/09/2018
Field of study

International audienceThe current generation of the Xeon Phi Knights Landing (KNL) processor provides a highly multi-threaded environment on which regular programming models such as MPI/OpenMP can be used. This specific hardware offers both large memory bandwidth and large computing resources and is currently available on computing facilities. Many factors impact the performance achieved by applications, one of the key points is the efficient exploitation of SIMD vector units, another one is the memory access pattern. Thus, vectorization and optimization works have been conducted on a plasma turbulence application, namely Gysela. A set of different techniques have been used: loop splitting, inlining, grouping a set of LU solve operations, removing conditionals and some loop nests, auto-tuning of one computation kernel, changing a key numerical scheme – Lagrange interpolation instead of cubic splines. As a result, KNL execution times have been reduced by up to a factor 3 in some configurations. This effort has also permitted to gain a speedup of 2x on Broadwell architecture and 3x on Skylake. Nice scalability curves up to a few thousands cores have been obtained on a strong scaling experiment. Incremental work for vectorizing the Gysela code meant a large payoff without resorting to writing assembly code or using low-level intrinsics

INRIA a CCSD electronic archive server

A massively parallel semi-Lagrangian solver for the six-dimensional Vlasov-Poisson equation

Author: Kormann Katharina
Rampp Markus
Reuter Klaus
Publication venue: 'SAGE Publications'
Publication date: 01/01/2019
Field of study

This paper presents an optimized and scalable semi-Lagrangian solver for the Vlasov-Poisson system in six-dimensional phase space. Grid-based solvers of the Vlasov equation are known to give accurate results. At the same time, these solvers are challenged by the curse of dimensionality resulting in very high memory requirements, and moreover, requiring highly efficient parallelization schemes. In this paper, we consider the 6d Vlasov-Poisson problem discretized by a split-step semi-Lagrangian scheme, using successive 1d interpolations on 1d stripes of the 6d domain. Two parallelization paradigms are compared, a remapping scheme and a classical domain decomposition approach applied to the full 6d problem. From numerical experiments, the latter approach is found to be superior in the massively parallel case in various respects. We address the challenge of artificial time step restrictions due to the decomposition of the domain by introducing a blocked one-sided communication scheme for the purely electrostatic case and a rotating mesh for the case with a constant magnetic field. In addition, we propose a pipelining scheme that enables to hide the costs for the halo communication between neighbor processes efficiently behind useful computation. Parallel scalability on up to 65k processes is demonstrated for benchmark problems on a supercomputer

arXiv.org e-Print Archive

MPG.PuRe

Benefits of SMT and of Parallel Transpose Algorithm for the Large-Scale GYSELA Application

Author: Christara C.
D'Hollander E.
Kuhn M.
Latu G.
Rozar F.
Stellmach S.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/06/2016
Field of study

International audienceThis article describes how we manage to increase performance and to extend features of a large parallel application through the use of simultaneous multithreading (SMT) and by designing a robust parallel transpose algorithm. The semi-Lagrangian code Gysela typically performs large physics simulations using a few thousands of cores, between 1k cores up to 16k on x86-based clusters. However, simulations with finer resolutions and with kinetic electrons increase those needs by a huge factor, providing a good example of applications requiring Exascale machines. To improve Gysela compute times, we take advantage of efficient SMT implementations available on recent INTEL architectures. We also analyze the cost of a transposition communication scheme that involves a large number of cores in our case. Adaptation of the code for balance load whenever using both SMT and good deployment strategy led to a significant reduction that can be up to 38% of the execution times

Crossref

UPCommons. Portal del coneixement obert de la UPC

INRIA a CCSD electronic archive server

HAL-CEA

HAL UVSQ