11,755 research outputs found
Tuning and optimization for a variety of many-core architectures without changing a single line of implementation code using the Alpaka library
We present an analysis on optimizing performance of a single C++11 source
code using the Alpaka hardware abstraction library. For this we use the general
matrix multiplication (GEMM) algorithm in order to show that compilers can
optimize Alpaka code effectively when tuning key parameters of the algorithm.
We do not intend to rival existing, highly optimized DGEMM versions, but merely
choose this example to prove that Alpaka allows for platform-specific tuning
with a single source code. In addition we analyze the optimization potential
available with vendor-specific compilers when confronted with the heavily
templated abstractions of Alpaka. We specifically test the code for bleeding
edge architectures such as Nvidia's Tesla P100, Intel's Knights Landing (KNL)
and Haswell architecture as well as IBM's Power8 system. On some of these we
are able to reach almost 50\% of the peak floating point operation performance
using the aforementioned means. When adding compiler-specific #pragmas we are
able to reach 5 TFLOPS/s on a P100 and over 1 TFLOPS/s on a KNL system.Comment: Accepted paper for the P\^{}3MA workshop at the ISC 2017 in Frankfur
MP-STREAM: A Memory Performance Benchmark for Design Space Exploration on Heterogeneous HPC Devices
Sustained memory throughput is a key determinant
of performance in HPC devices. Having an accurate estimate of
this parameter is essential for manual or automated design space
exploration for any HPC device. While there are benchmarks for
measuring the sustained memory bandwidth for CPUs and GPUs,
such a benchmark for FPGAs has been missing. We present
MP-STREAM, an OpenCL-based synthetic micro-benchmark for
measuring sustained memory bandwidth, optimized for FPGAs,
but which can be used on multiple platforms. Our main contribution
is the introduction of various generic as well as device-specific
parameters that can be tuned to measure their effect on memory
bandwidth. We present results of running our benchmark on a
CPU, a GPU and two FPGA targets, and discuss our observations.
The experiments underline the utility of our benchmark for
optimizing HPC applications for FPGAs, and provide valuable
optimization hints for FPGA programmers
- …