10 research outputs found
Investigating performance portability of a highly scalable particle-in-cell simulation code on various multi-core architectures
The alpaka library defines and implements an abstract hierarchical redundant parallelism model. This model exploits parallelism and memory hierarchies on a node at all levels available in current hardware. This allows to achieve portability of performant codes across various types of accelerators by ignoring specific unsupported levels and utilizing only the ones supported on a specific accelerator. All hardware types (multi- and many-core CPUs, GPUs and other accelerators) are treated and can be programmed in the same way. The C++ template interface provided allows for straightforward extension of the library to support other accelerators and specialization of its internals for optimization
Tuning and optimization for a variety of many-core architectures without changing a single line of implementation code using the Alpaka library
We present an analysis on optimizing performance of a single C++11 source
code using the Alpaka hardware abstraction library. For this we use the general
matrix multiplication (GEMM) algorithm in order to show that compilers can
optimize Alpaka code effectively when tuning key parameters of the algorithm.
We do not intend to rival existing, highly optimized DGEMM versions, but merely
choose this example to prove that Alpaka allows for platform-specific tuning
with a single source code. In addition we analyze the optimization potential
available with vendor-specific compilers when confronted with the heavily
templated abstractions of Alpaka. We specifically test the code for bleeding
edge architectures such as Nvidia's Tesla P100, Intel's Knights Landing (KNL)
and Haswell architecture as well as IBM's Power8 system. On some of these we
are able to reach almost 50\% of the peak floating point operation performance
using the aforementioned means. When adding compiler-specific #pragmas we are
able to reach 5 TFLOPS/s on a P100 and over 1 TFLOPS/s on a KNL system.Comment: Accepted paper for the P\^{}3MA workshop at the ISC 2017 in Frankfur
Recommended from our members
Tuning and Optimization for a Variety of Many-Core Architectures Without Changing a Single Line of Implementation Code Using the Alpaka Library
We present an analysis on optimizing performance of a single C++11 source code using the Alpaka hardware abstraction library. For this we use the general matrix multiplication (GEMM) algorithm in order to show that compilers can optimize Alpaka code effectively when tuning key parameters of the algorithm. We do not intend to rival existing, highly optimized DGEMM versions, but merely choose this example to prove that Alpaka allows for platform-specific tuning with a single source code. In addition we analyze the optimization potential available with vendor-specific compilers when confronted with the heavily templated abstractions of Alpaka. We specifically test the code for bleeding edge architectures such as Nvidia’s Tesla P100, Intel’s Knights Landing (KNL) and Haswell architecture as well as IBM’s Power8 system. On some of these we are able to reach almost 50% of the peak floating point operation performance using the aforementioned means. When adding compiler-specific we are able to reach 5 on a P100 and over 1 on a KNL system
Recommended from our members
Alpaka - An Abstraction Library for Parallel Kernel Acceleration
Porting applications to new hardware or programming models is a tedious and error prone process. Every help that eases these burdens is saving developer time that can then be invested into the advancement of the application itself instead of preserving the status-quo on a new platform. The Alpaka library defines and implements an abstract hierarchical redundant parallelism model. The model exploits parallelism and memory hierarchies on a node at all levels available in current hardware. By doing so, it allows to achieve platform and performance portability across various types of accelerators by ignoring specific unsupported levels and utilizing only the ones supported on a specific accelerator. All hardware types (multi-and many-core CPUs, GPUs and other accelerators) are supported for and can be programmed in the same way. The Alpaka C++ template interface allows for straightforward extension of the library to support other accelerators and specialization of its internals for optimization. Running Alpaka applications on a new (and supported) platform requires the change of only one source code line instead of a lot of #ifdefs
Caracterisation par les methodes de coherences, de sources de vibrations et de bruit d'un moteur diesel
SIGLECNRS T Bordereau / INIST-CNRS - Institut de l'Information Scientifique et TechniqueFRFranc
Matrix multiplication software and results bundle for paper "Tuning and optimization for a variety of many-core architectures without changing a single line of implementation code using the Alpaka library" for P^3MA submission
<p>This is the archive containing the matrix multiplication software and the results of the publication "<em>Tuning and optimization for a variety of many-core architectures without changing a single line of implementation code using the Alpaka library</em>" submitted to the P^3MA workshop 2017.</p>
<p><strong>The archive has the following content:</strong></p>
<ul>
<li>Source code for the (tiled) matrix multiplication in "src":
<ul>
<li>regular version in "src/matmul":
<ul>
<li>Remote: https://github.com/theZiz/matmul.git (copy will be removed)</li>
<li>Branch: topic-compatible-alpaka-0-1-0</li>
<li>Commit: a63ba4810d6bfcca62c68dd57408af15028e78a3</li>
</ul>
</li>
<li>forked version for XL in "src/matmul":
<ul>
<li>Remote: https://github.com/theZiz/matmul.git (copy will be removed)</li>
<li>Branch: topic-xl-workaround</li>
<li>Commit: 1fee028eccb8cf7b677e8071233e08aa9f81846a</li>
</ul>
</li>
</ul>
</li>
<li>The compiled binaries and the results of the tuning and scaling runs are in "runs" in sub folders for each type of run and architectures.</li>
</ul
PIConGPU, Alpaka, and cupla software bundle for IWOPH 2016 submission
<p>This is the archive containing the software used for evaluations in the publication "Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond" submitted to the international workshop on OpenPOWER for HPC 2016.</p>
<p>The archive has the following content:</p>
<p>PIConGPU Kelvin-Helmholtz Simulation code (picongpu-alpaka/):</p>
<ul>
<li>Â Remote: https://github.com/psychocoderHPC/picongpu-alpaka.git</li>
<li>Â Branch: topic-scaling</li>
<li>Â Commit: 1f004c8e0514ad1649f3958a6184878af6e75150</li>
</ul>
<p>Alpaka code (alpaka/):</p>
<ul>
<li>Remote: https://github.com/psychocoderHPC/alpaka.git</li>
<li>Branch: topic-picongpu-alpaka</li>
<li>Commit: 4a6dd35a9aff62e7f500623c3658685f827f73e5</li>
</ul>
<p>Cupla (cupla/):</p>
<ul>
<li>Remote: https://github.com/psychocoderHPC/cupla.git</li>
<li>Branch: topic-dualAccelerators</li>
<li>Commit: 4660f5fd8e888aa732230946046219f7e5daa1c9</li>
</ul>
<p>The simulation was executed for one thousand time steps and the following configuration:</p>
<ul>
<li>Â Â shape is higher then CIC, we used TSC</li>
<li>Â Â pusher is Boris</li>
<li>Â Â current solver is Esirkepov (optimized, generalized)</li>
<li>Â Â Yee field solver</li>
<li>Â Â trilinear interpolation in field gathering</li>
<li>Â Â 16 particles per cell</li>
</ul>
<p>Compile flags:</p>
<ul>
<li>CPU g++-4.9.2: -g0 -O3 -m64 -funroll-loops -march=native -ffast-math --param max-unroll-times=512</li>
<li>GPU nvcc: --use_fast_math --ftz=false -g0 -O3 -m64</li>
</ul
Scalable, Data Driven Plasma Simulations with PIConGPU
PIConGPU is an open source, multi-platform particle-in-cell code scaling to the fastest supercomputers in the TOP500 list. We present the architecture, novel developments, and workflows that enable high-precision, fast turn-around computations on Exascale-machines. Furthermore, we present our strategies to handle extreme data flows from thousands of GPUs for analysis with in situ processing and open data formats (openPMD). PIConGPU is since recently furthermore natively controlled by a Python Jupyter interface and we research just-in-time kernel generation for C++ with our Cling-CUDA extensions.Invited minisymposium talk at the Platform for Advanced Scientific Computing (PASC) Conference (PASC19) at ETH Zurich (Zurich, Switzerland)
Talk "Next-Generation Simulations for XFEL-Plasma Interactions with Solid Density Targets with PIConGPU"
PIConGPU reportedly is the fastest particle-in-cell code in the world with respect to sustained Flop/s. Written in performance-portable, single-source C++ we constantly push the envelope towards Exascale laser-plasma modeling. However, solving previously week-long simulation tasks in a few hours with a speedy framework is only the beginning.
This talk will present the architecture and recent additions driving PIConGPU. As we speak, we run on the fastest machines and the community approaches a new generation of TOP10 clusters. Within those, many-core computing architectures and severe limitations in available I/O bandwidth demand fundamental rethinking of established modeling workflows towards in situ-processing.
We present our ready-to-use open-source solutions and address scientific repeatability, data-reduction in I/O, predictability and new atomic modeling for XFEL pump-probe experiments