Search CORE

10 research outputs found

Investigating performance portability of a highly scalable particle-in-cell simulation code on various multi-core architectures

Author: Worpitz Benjamin
Publication venue
Publication date
Field of study

The alpaka library defines and implements an abstract hierarchical redundant parallelism model. This model exploits parallelism and memory hierarchies on a node at all levels available in current hardware. This allows to achieve portability of performant codes across various types of accelerators by ignoring specific unsupported levels and utilizing only the ones supported on a specific accelerator. All hardware types (multi- and many-core CPUs, GPUs and other accelerators) are treated and can be programmed in the same way. The C++ template interface provided allows for straightforward extension of the library to support other accelerators and specialization of its internals for optimization

ZENODO

Tuning and optimization for a variety of many-core architectures without changing a single line of implementation code using the Alpaka library

Author: Bussmann Michael
Huebl Axel
Matthes Alexander
Widera René
Worpitz Benjamin
Zenker Erik
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

We present an analysis on optimizing performance of a single C++11 source code using the Alpaka hardware abstraction library. For this we use the general matrix multiplication (GEMM) algorithm in order to show that compilers can optimize Alpaka code effectively when tuning key parameters of the algorithm. We do not intend to rival existing, highly optimized DGEMM versions, but merely choose this example to prove that Alpaka allows for platform-specific tuning with a single source code. In addition we analyze the optimization potential available with vendor-specific compilers when confronted with the heavily templated abstractions of Alpaka. We specifically test the code for bleeding edge architectures such as Nvidia's Tesla P100, Intel's Knights Landing (KNL) and Haswell architecture as well as IBM's Power8 system. On some of these we are able to reach almost 50\% of the peak floating point operation performance using the aforementioned means. When adding compiler-specific #pragmas we are able to reach 5 TFLOPS/s on a P100 and over 1 TFLOPS/s on a KNL system.Comment: Accepted paper for the P\^{}3MA workshop at the ISC 2017 in Frankfur

arXiv.org e-Print Archive

Crossref

eScholarship - University of California

Recommended from our members

Tuning and Optimization for a Variety of Many-Core Architectures Without Changing a Single Line of Implementation Code Using the Alpaka Library

Author: Bussmann Michael
Huebl Axel
Matthes Alexander
Widera René
Worpitz Benjamin
Zenker Erik
Publication venue: eScholarship, University of California
Publication date: 01/01/2017
Field of study

We present an analysis on optimizing performance of a single C++11 source code using the Alpaka hardware abstraction library. For this we use the general matrix multiplication (GEMM) algorithm in order to show that compilers can optimize Alpaka code effectively when tuning key parameters of the algorithm. We do not intend to rival existing, highly optimized DGEMM versions, but merely choose this example to prove that Alpaka allows for platform-specific tuning with a single source code. In addition we analyze the optimization potential available with vendor-specific compilers when confronted with the heavily templated abstractions of Alpaka. We specifically test the code for bleeding edge architectures such as Nvidia’s Tesla P100, Intel’s Knights Landing (KNL) and Haswell architecture as well as IBM’s Power8 system. On some of these we are able to reach almost 50% of the peak floating point operation performance using the aforementioned means. When adding compiler-specific we are able to reach 5 on a P100 and over 1 on a KNL system

eScholarship - University of California

Recommended from our members

Alpaka - An Abstraction Library for Parallel Kernel Acceleration

Author: Bussmann Michael
Huebl Axel
Juckeland Guido
Knüpfer Andreas
Nagel Wolfgang E
Widera René
Worpitz Benjamin
Zenker Erik
Publication venue: eScholarship, University of California
Publication date: 01/05/2016
Field of study

Porting applications to new hardware or programming models is a tedious and error prone process. Every help that eases these burdens is saving developer time that can then be invested into the advancement of the application itself instead of preserving the status-quo on a new platform. The Alpaka library defines and implements an abstract hierarchical redundant parallelism model. The model exploits parallelism and memory hierarchies on a node at all levels available in current hardware. By doing so, it allows to achieve platform and performance portability across various types of accelerators by ignoring specific unsupported levels and utilizing only the ones supported on a specific accelerator. All hardware types (multi-and many-core CPUs, GPUs and other accelerators) are supported for and can be programmed in the same way. The Alpaka C++ template interface allows for straightforward extension of the library to support other accelerators and specialization of its internals for optimization. Running Alpaka applications on a new (and supported) platform requires the change of only one source code line instead of a lot of #ifdefs

eScholarship - University of California

Recommended from our members

Alpaka -- An Abstraction Library for Parallel Kernel Acceleration

Author: Bussmann Michael
Huebl Axel
Juckeland Guido
Knupfer Andreas
Nagel Wolfgang E
Widera Rene
Worpitz Benjamin
Zenker Erik
Publication venue: eScholarship, University of California
Publication date: 01/05/2016
Field of study

eScholarship - University of California

Caracterisation par les methodes de coherences, de sources de vibrations et de bruit d'un moteur diesel

Author: Bussmann Michael
Huebl Axel
Juckeland Guido
Knuepfer Andreas
Nagel Wolfgang E.
Widera Rene
Worpitz Benjamin
Zenker Erik
Publication venue
Publication date: 01/01/1986
Field of study

SIGLECNRS T Bordereau / INIST-CNRS - Institut de l'Information Scientifique et TechniqueFRFranc

arXiv.org e-Print Archive

DESY Publication Database

Crossref

DESY

eScholarship - University of California

OpenGrey Repository

Matrix multiplication software and results bundle for paper "Tuning and optimization for a variety of many-core architectures without changing a single line of implementation code using the Alpaka library" for P^3MA submission

Author: Bussmann Michael (5294758)
Hübl Axel (5294752)
Matthes Alexander (5294755)
Widera René (5294746)
Worpitz Benjamin (5294749)
Zenker Erik (5294761)
Publication venue
Publication date
Field of study

This is the archive containing the matrix multiplication software and the results of the publication "Tuning and optimization for a variety of many-core architectures without changing a single line of implementation code using the Alpaka library" submitted to the P^3MA workshop 2017. The archive has the following content: <ul> <li>Source code for the (tiled) matrix multiplication in "src": <ul> <li>regular version in "src/matmul": <ul> <li>Remote: https://github.com/theZiz/matmul.git (copy will be removed)</li> <li>Branch: topic-compatible-alpaka-0-1-0</li> <li>Commit: a63ba4810d6bfcca62c68dd57408af15028e78a3</li> </ul> </li> <li>forked version for XL in "src/matmul": <ul> <li>Remote: https://github.com/theZiz/matmul.git (copy will be removed)</li> <li>Branch: topic-xl-workaround</li> <li>Commit: 1fee028eccb8cf7b677e8071233e08aa9f81846a</li> </ul> </li> </ul> </li> <li>The compiled binaries and the results of the tuning and scaling runs are in "runs" in sub folders for each type of run and architectures.</li> </ul

The Francis Crick Institute

PIConGPU, Alpaka, and cupla software bundle for IWOPH 2016 submission

Author: Andreas Knüpfer (5287657)
Axel Huebl (5261149)
Benjamin Worpitz (5268976)
Erik Zenker (5268298)
Guido Juckeland (5287654)
Michael Bussmann (5261152)
René Widera (5261146)
Wolfgang E. Nagel (5287660)
Publication venue
Publication date
Field of study

This is the archive containing the software used for evaluations in the publication "Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond" submitted to the international workshop on OpenPOWER for HPC 2016. The archive has the following content: PIConGPU Kelvin-Helmholtz Simulation code (picongpu-alpaka/): <ul> <li> Remote: https://github.com/psychocoderHPC/picongpu-alpaka.git</li> <li> Branch: topic-scaling</li> <li> Commit: 1f004c8e0514ad1649f3958a6184878af6e75150</li> </ul> Alpaka code (alpaka/): <ul> <li>Remote: https://github.com/psychocoderHPC/alpaka.git</li> <li>Branch: topic-picongpu-alpaka</li> <li>Commit: 4a6dd35a9aff62e7f500623c3658685f827f73e5</li> </ul> Cupla (cupla/): <ul> <li>Remote: https://github.com/psychocoderHPC/cupla.git</li> <li>Branch: topic-dualAccelerators</li> <li>Commit: 4660f5fd8e888aa732230946046219f7e5daa1c9</li> </ul> The simulation was executed for one thousand time steps and the following configuration: <ul> <li> shape is higher then CIC, we used TSC</li> <li> pusher is Boris</li> <li> current solver is Esirkepov (optimized, generalized)</li> <li> Yee field solver</li> <li> trilinear interpolation in field gathering</li> <li> 16 particles per cell</li> </ul> Compile flags: <ul> <li>CPU g++-4.9.2: -g0 -O3 -m64 -funroll-loops -march=native -ffast-math --param max-unroll-times=512</li> <li>GPU nvcc: --use_fast_math --ftz=false -g0 -O3 -m64</li> </ul

The Francis Crick Institute

Scalable, Data Driven Plasma Simulations with PIConGPU

Author: Bastrakov Sergei
Bastrakova Ksenia
Bussmann Michael
Debus Alexander
Ehrig Simeon
Garten Marco
Huebl Axel
Kluge Thomas
Matthes Alexander
Meyer Felix
Pausch Richard
Rudat Sophie
Starke Sebastian
Steiniger Klaus
Werner Matthias
Widera René
Worpitz Benjamin
Publication venue
Publication date: 13/06/2019
Field of study

PIConGPU is an open source, multi-platform particle-in-cell code scaling to the fastest supercomputers in the TOP500 list. We present the architecture, novel developments, and workflows that enable high-precision, fast turn-around computations on Exascale-machines. Furthermore, we present our strategies to handle extreme data flows from thousands of GPUs for analysis with in situ processing and open data formats (openPMD). PIConGPU is since recently furthermore natively controlled by a Python Jupyter interface and we research just-in-time kernel generation for C++ with our Cling-CUDA extensions.Invited minisymposium talk at the Platform for Advanced Scientific Computing (PASC) Conference (PASC19) at ETH Zurich (Zurich, Switzerland)

RODARE Docs (Rossendorf Data Repository - Helmholtz-Zentrum Dresden-Rossendorf, HZDR)

Talk "Next-Generation Simulations for XFEL-Plasma Interactions with Solid Density Targets with PIConGPU"

Author: Alexander Debus
Alexander Matthes
Axel Huebl
Benjamin Worpitz
Fabian Koller
Heiko Burau
Hyun-Kyung Chung
Jan Vorberger
Marco Garten
Michael Bussmann
René Widera
Richard Pausch
Thomas Cowan
Thomas Kluge
Ulrich Schramm
Publication venue
Publication date
Field of study

PIConGPU reportedly is the fastest particle-in-cell code in the world with respect to sustained Flop/s. Written in performance-portable, single-source C++ we constantly push the envelope towards Exascale laser-plasma modeling. However, solving previously week-long simulation tasks in a few hours with a speedy framework is only the beginning. This talk will present the architecture and recent additions driving PIConGPU. As we speak, we run on the fastest machines and the community approaches a new generation of TOP10 clusters. Within those, many-core computing architectures and severe limitations in available I/O bandwidth demand fundamental rethinking of established modeling workflows towards in situ-processing. We present our ready-to-use open-source solutions and address scientific repeatability, data-reduction in I/O, predictability and new atomic modeling for XFEL pump-probe experiments

ZENODO

The Francis Crick Institute