Alpaka - An Abstraction Library for Parallel Kernel Acceleration by Zenker, Erik et al.
Alpaka – An Abstraction Library for Parallel Kernel Acceleration∗
Erik Zenker1,2, Benjamin Worpitz1,2, Rene´ Widera1, Axel Huebl1,2,
Guido Juckeland1,2, Andreas Knu¨pfer2, Wolfgang E. Nagel2, Michael Bussmann1
1Helmholtz-Zentrum Dresden - Rossendorf, Dresden, Germany
{e.zenker, r.widera, a.huebl, g.juckeland@hzdr.de, m.bussmann}@hzdr.de
2Technische Universita¨t Dresden, Dresden, Germany
{andreas.knuepfer, wolfgang.nagel}@tu-dresden.de, benjamin.worpitz@outlook.com
March 27, 2018
Abstract
Porting applications to new hardware or program-
ming models is a tedious and error prone process.
Every help that eases these burdens is saving devel-
oper time that can then be invested into the advance-
ment of the application itself instead of preserving the
status-quo on a new platform.
The Alpaka library defines and implements an ab-
stract hierarchical redundant parallelism model. The
model exploits parallelism and memory hierarchies
on a node at all levels available in current hardware.
By doing so, it allows to achieve platform and perfor-
mance portability across various types of accelerators
by ignoring specific unsupported levels and utilizing
only the ones supported on a specific accelerator. All
hardware types (multi- and many-core CPUs, GPUs
and other accelerators) are supported for and can be
programmed in the same way. The Alpaka C++ tem-
plate interface allows for straightforward extension of
the library to support other accelerators and special-
ization of its internals for optimization.
Running Alpaka applications on a new (and sup-
ported) platform requires the change of only one
source code line instead of a lot of #ifdefs.
∗This project has received funding from the European
Unions Horizon 2020 research and innovation programme un-
der grant agreement No 654220
Keywords. Heterogeneous computing, HPC, C++,
CUDA, OpenMP, platform portability, performance
portability
1 Introduction
1.1 Motivation
Performance gain by employing parallelism in soft-
ware nowadays faces a variety of obstacles. Parallel
performance currently relies on the efficient use of
many-core architectures that are commonly found in
a heterogeneous environment of multi-core CPU and
many-core accelerator hardware.
Heterogeneous systems often expose a memory hi-
erarchy that has to be used efficiently as high com-
putational performance usually requires high memory
throughput, demanding the development of efficient
caching strategies by application developers.
The same developers face a variety of parallel com-
puting models either specific to a certain hardware or
with limited control over optimization. Many models
aim for providing easy to learn interfaces that hide
the complexities of memory management and paral-
lel execution while promising performance portabil-
ity, but ultimately fall short of at least one of their
aims.
Due to the Herculean effort associated with main-
1
ar
X
iv
:1
60
2.
08
47
7v
1 
 [c
s.D
C]
  2
6 F
eb
 20
16
taining a multi-source application even large devel-
opment teams thus usually have to choose a strategy
of trading performance for portability or vice versa
by choosing one single programming model.
Alpaka was designed to prevent this trade off by
providing a single source abstract interface that ex-
poses all levels of parallelism existent on today’s het-
erogeneous systems. Alpaka heavily relies on exist-
ing parallelism models, but encapsulates them via a
redundant mapping of its abstract parallelization hi-
erarchy [4] to a specific hardware, allowing for mix-
ing various models in a single source C++ code at
runtime. Thus, hardware-specific optimizations are
possible without the necessity for code replication.
Alpaka therefore is open to future performance op-
timizations while providing portable code. This is
only possible as Alpaka relies on the developers abil-
ity to write parallel code by explicitly exposing all
information useful for defining parallel execution in
a heterogeneous environment rather than hiding the
complexities of parallel programming.
Moreover, Alpaka limits itself to a simple, pointer
based memory model that requires explicit deep
copies between memory levels. This puts the de-
veloper in the position to develop portable parallel
caching strategies without imposing restrictions on
the memory layout. Thus, developers achieve perfor-
mance portability by skillful code design for which
Alpaka provides a single source, explicit redundant
parallelism model without any intrinsic optimization
hidden from the user.
In the following, we define some categories in or-
der to compare Alpaka to existing models for parallel
programming.
Openness By Openness we refer to models li-
censed as open source or defined by an open stan-
dard.
Single source A model that provides for single
source code allows for the application code to be writ-
ten in a single programming language. It further-
more does not require extensive multiple compilation
branches with varying implementations of an algo-
rithm specific to a certain hardware. Single source
models may provide annotations (e.g. compiler di-
rectives) or add defined words to the language that
describe parallel execution.
Sustainability We define a sustainable parallel
programming model as a model where the porting
of an algorithm to another hardware requires min-
imum changes to the algorithmic description itself.
Sustainable models furthermore should be adaptable
to future hardware and be available for at least two
varieties of current hardware architectures.
Heterogeneity Parallel programming models
map to heterogeneous systems if they allow for de-
veloping a single source code in such a way that
execution on various hardware architectures requires
minimum specific changes (e.g. offloading, memory
scope), execution of a single algorithmic implementa-
tion on various architectures can happen in the same
program and at the same time during run time.
Maintainability We define a parallel program-
ming model to serve code maintainability if it pro-
vides a single source code that is sustainable and
allows for execution on heterogeneous hardware by
changing or extending the programming model rather
than the application source code.
Testability A model provides testability if an al-
gorithmic implementation can be tested on a specific
hardware and give, in a lose sense, the same results
when migrating to another hardware. Testability re-
quires sustainability, heterogeneity and maintainabil-
ity but furthermore demands a separation of the algo-
rithmic description from hardware specific features.
Optimizability We define an optimizable model
by the fact that it provides the user with complete
control over the parallelization of the algorithm as
well as the memory hierarchy in a heterogeneous sys-
tem. Furthermore, fine-tuning algorithmic perfor-
mance to a specific hardware should not force devel-
opers to write multiple implementations of the same
algorithm, but rather be provided for by the model.
Data structure agnostic A data structure ag-
nostic model does not restrict the memory layout, it
instead provides full control over the memory alloca-
tion and layout on all hierarchy levels, exposes deep
copies between levels and does not assume a certain
distribution of the memory over the various hierarchy
levels. Specifically, it does not provide distributed
data types that are intertwined with the paralleliza-
tion scheme itself.
Performance Portability A model provides per-
2
formance portability if for a given single source sus-
tainable and maintainable implementation of an al-
gorithm the hardware utilization on various systems
is the same within reasonable margins, taking into ac-
count the limitations of the specific hardware. Per-
formance portability does not require optimum uti-
lization of the hardware.
2 Related Work
In the following we briefly discuss other libraries tar-
geting the portable parallel task execution within
nodes. Some of them require language extensions,
others advertise performance portability across a
multitude of devices. However, none of these libraries
can provide full control over the possibly diverse un-
derlying hardware while being only minimally inva-
sive. Furthermore, many of the libraries do not sat-
isfy the requirement for full single-source (C++) sup-
port.
CUDA[3] is a parallel computing platform and
programming model developed by NVIDIA. The user
is bound to the usage of NVIDIA GPUs. CUDA is
not open source and does not provide for Sustain-
ability, heterogeneity, maintainability and testability.
For CUDA enabled hardware it provides for optimiz-
ability.
PGI CUDA-X861 is a compiler technology that
allows to generate x86-64 binary code from CUDA
C/C++ applications using the CUDA runtime API
but does not support the CUDA driver API. Com-
pared to CUDA it allows for heterogeneity, maintain-
ability and testability, but it currently falls behind in
adapting to the latest CUDA features, thus has lim-
ited support for sustainability. As it does not provide
for control of optimzations for X86 architectures, it
lacks optimizability.
GPU Ocelot[2] is an open source dynamic JIT
compilation framework based on llvm which allows to
execute native CUDA binaries by dynamically trans-
lating the NVIDIA PTX virtual instruction set archi-
tecture to other instruction sets. It supports NVIDIA
and AMD GPUs as well as multicore CPUs via a PTX
1https://www.pgroup.com/resources/cuda-x86.htm
to LLVM translator. The project is not in active de-
velopment anymore and only supports PTX up to
version 3.1 while the current version is 4.2. Thus, it
is in many respects similar to PGI CUDA-X86.
OpenMP2 is an open specification for vendor ag-
nostic shared memory parallelization which allows to
easily parallelize existing sequential C/C++/Fortran
code in an incremental manner by adding annota-
tions (pragmas in C/C++) to loops or regions. Up
to version 4.5 there is no way to allocate device mem-
ory that is persistent between kernel calls in different
methods because it is not possible to create a device
data region spanning both functions in the general
case. Currently OpenMP does not allow for control-
ling the hierarchical memory as its main assumption
is a shared memory pool for all threads. Therefore,
the block shared memory on CUDA devices cannot
be explicitly utilized and both heterogeneity and op-
timizability are not provided for.
OpenACC3 is an open pragma based program-
ming standard for heterogeneous computing which is
very similar to OpenMP and provides annotations
for parallel execution and data movement as well as
run-time functions for accelerator and device man-
agement. It allows for limited access to CUDA block
shared memory but does not support dynamic allo-
cation of memory in kernel code. It thus does not
provide for optimizability and in a practical sense,
due to the very limited number of implementations,
heterogeneity and sustainability.
OpenCL4 is an open programming framework for
heterogeneous platforms. It supports heterogeneity
as it can utilize CPUs and GPUs of nearly all vendors.
Versions prior to 2.1 (released in March 2015) did
only support a C-like kernel language. Version 2.1
introduced a subset of C++14, but there are still no
compilers available. OpenCL thus does not support
single source programming. Furthermore, it does not
allow for dynamic allocation of memory in kernel code
and thus does not fully support optimizability.
SYCL5 is an open cross-platform abstraction layer
based on OpenCL and thus shares most deficiencies
2http://openmp.org//
3http://www.openacc-standard.org/
4https://www.khronos.org/opencl/
5https://www.khronos.org/sycl/
3
with OpenCL, however it in principle would allow for
optimizability. In contrast to OpenCL it allows for
single source heterogeneous programs, but as of now
there is no usable free compiler implementation avail-
able that has good support for multiple accelerator
devices, thus it currently lacks sustainability.
C++ AMP6 is an open specification from Mi-
crosoft which is implemented on top of DirectX 11
and thus currently limited in terms of heterogene-
ity, sustainability and testability. It is a language
extension requiring compiler support that allows to
annotate C++ code that then can be run on mul-
tiple accelerators. It lacks full control of paral-
lel execution and memory hierarchy and thus falls
short of supporting optimizability. Due to restric-
tions on data types that provide for portability (see
e.g. concurrency::array) it is not data structure
agnostic.
KOKKOS7 is an open source abstract inter-
face for portable, high-performance shared memory-
programming and in many ways similar to Alpaka.
However, kernel arguments have to be stored in mem-
bers of the function object coupling algorithm and
data together. It thus is not data structure agnostic
and in this sense limited in its optimizability
Thrust[1] is an open source parallel algorithms li-
brary resembling the C++ Standard Template Li-
brary (STL) which is available for CUDA, Thread
Building Blocks8 and OpenMP back-ends at make-
time. Its container objects are tightly coupled with
the parallelization strategy, therefore Thrust is not
data structure agnostic. Thrust aims at hiding the
memory hierarchy and is limited in expressing paral-
lel execution, thus it cannot achieve full optimizabil-
ity.
Table 1 provides a summary of all related work and
a comparison to Alpaka.
6https://msdn.microsoft.com/en-us/library/hh265136.
aspx
7https://github.com/kokkos
8https://www.threadingbuildingblocks.org/
3 Introduction to Alpaka
This section serves as an introduction to Alpaka. It
first explains the conceptual ideas behind Alpaka,
then provides an overview of the hardware abstrac-
tion model of Alpaka as well as how the model is
mapped to real devices. Lastly, the Alpaka program-
ming API is described.
3.1 Conceptual Overview
Alpaka provides a single abstract C++ interface to
describe parallel execution across multiple levels of
the parallelization hierarchy on a single compute
node. Each level of the Alpaka parallelization hierar-
chy is unrestricted in its dimensionality. In addition,
Alpaka uses the offloading model, which separates the
host from the accelerator device.
In order to execute Alpaka code on different hard-
ware the interface is currently implemented using var-
ious parallelism models such as OpenMP, CUDA,
C++ threads and boost fibers. Alpaka interface im-
plementations, called back-ends, are not limited to
these choices and will in the future be extended by
e.g. Thread Building Blocks. By design, new back-
ends can be added to Alpaka. Thus, Alpaka allows
for mixing parallelization models in a single source
program, thereby enabling the user to choose the im-
plementation that is best suited for a given choice
of hardware and algorithm. It even enables running
multiple of the same or different back-end instances
simultaneously, e.g. to utilize all cores on a device as
well as all accelerators concurrently.
The Alpaka library is based on the C++11 stan-
dard without any language extensions and makes ex-
tensive usage of C++ template meta-programming.
Algorithms are written in-line with single source code
and are called kernels which can be compiled to mul-
tiple platform back-ends by simply selecting the ap-
propriate back-end. The actual back-ends that exe-
cute an algorithm can, thus, be selected at configure-,
compile- or run-time, making it possible to run an al-
gorithm on multiple back-ends in one binary at the
same time.
Alpaka does not automatically optimize data ac-
cesses or data transfers between devices. Data are
4
Table 1: Properties of intra-node parallelization frameworks and their ability to solve the problems in porting
high-performance HPC codes. 3: yes / fully solved, m: partially solved, 7: no / not solved
Model Openness
Single
Source
Sustain-
ability
Hetero-
geneity
Maintain-
ability
Testability
Optimiz-
ability
Data
structure
agnostic
NVIDIA CUDA 7 3 7 7 7 7 m 3
PGI CUDA-x86 7 3 m 3 3 3 7 3
GPU Ocelot 3 3 m 3 3 3 7 3
OpenMP 3 3 3 m m 3 7 3
OpenACC 3 3 m m 3 3 7 3
OpenCL 3 m 3 3 3 3 7 3
SYCL 3 3 m 3 3 m m 3
C++AMP 3 3 m m 3 m 7 m
KOKKOS 3 3 3 3 3 3 7 m
Thrust 3 3 3 3 3 3 7 7
Alpaka 3 3 3 3 3 3 3 3
stored in simple buffers with support for copies be-
tween devices and access to memory is completely
data structure agnostic. Thus, the user needs to take
care of distribution of data between devices.
Alpaka does neither automatically decompose the
algorithmic execution domain and data domain, nor
does it assume any default or implicit states such
as default device, current device, default stream, im-
plicit built-in variables and functions.
3.2 Model of Parallel Abstraction
Alpaka abstracts data parallelism following the re-
dundant hierarchical parallelism model [4], thereby
enabling the developer to explicitly take the hierar-
chy of processing units, their data parallel features
and corresponding memory regions into account. The
Alpaka abstraction of parallelization is influenced
by and based on the groundbreaking CUDA and
OpenCL abstractions9 of a multidimensional grid of
threads with additional hierarchy levels in between.
Furthermore, it is amended with additional vector-
ization capabilities.
The four main hierarchies introduced by Alpaka
are called grid, block, thread and element level, shown
in Figure 1 together with their respective paralleliza-
tion and synchronization features as discussed below.
Each parallelization level corresponds to a par-
ticular memory level (Figure 2): global mem-
9Both, CUDA and OpenCL are industry standards for ac-
celerator programming.
Pa
ra
lle
l
Grid
Block
Thread
Element
Synchronize
Sequential
Figure 1: The Alpaka parallelization hierarchy con-
sists of a grid of blocks, where each block consists of
threads and each thread processes multiple elements.
Both threads and grids are able to synchronize.
ory (grid), shared memory (block) and register mem-
ory (thread).
The Alpaka model enables to separate the paral-
lelization strategy from the algorithm. The algorithm
is described by kernel functions that are executed by
threads. A kernel is the common set of instructions
executed by all threads on a grid.
The parallelization strategy is described by the
accelerator and the work division (See Section 3.3
and 3.4). An accelerator defines the acceleration
strategy by a mapping of the parallelization levels
5
Global
Memory
Shared
Memory
Register Memory
Register Memory
Host
Memory
Grid
Block
Thread
Figure 2: The memory hierarchy of the Alpaka ab-
straction model. Threads have exclusive access to
fast register memory. All threads in a block can ac-
cess the same shared memory. All blocks in a grid
can access the same global memory.
to the hardware. The device is the actual hardware
onto which these levels are mapped.
3.2.1 Grid
A grid is an n-dimensional set of blocks with a usually
large global memory accessible by all threads in all
blocks. Grids are independent of each other and can
thus be executed either sequentially or in parallel.
Grids can be synchronized to each other via explicit
synchronization evoked in the code.
3.2.2 Block
A block is an n-dimensional set of threads with a high
bandwidth, low latency but small amount of shared
memory. All blocks on a grid are independent of each
other and can thus be executed either sequentially or
in parallel. Blocks cannot be synchronized to each
other. The shared memory can only be accessed ex-
plicitly by threads within the same block and gets
discarded after the complete block has finished its
calculation.
3.2.3 Thread
A thread represents the execution of a sequence of
instructions. All threads in a block are independent
of each other and can thus be executed either sequen-
tially or in parallel. Threads can be synchronized to
each other via explicit synchronization evoked in the
code. Threads can by default always access their pri-
vate registers, the shared memory of the block and
the global memory10. All variables within the de-
fault scope of a kernel are stored within register mem-
ory and are not shared between threads. Shared and
global memory can be allocated statically or at run-
time before the kernel start.
3.2.4 Element
The element level represents an n-dimensional set of
elements and unifies the data parallel capabilities of
modern hardware architectures e.g. vectorization on
thread level. This is necessary as current compilers
do not support automatic vectorization of basic, non
trivial loops containing control flow statements (e.g.
if, else, for) or non-trivial memory operations. Fur-
thermore, vectorization intrinsics as they are avail-
able in intrin.h, arm neo.h, altivec.h are not portable
across varying back-ends. Alpaka therefore currently
relies on compiler recognition of vectorizable code
parts. Code is refactored in such a way that it in-
cludes primitive inner loops over a fixed number of
elements.
The user is free to sequentially loop over the el-
ements or to utilize vectorization where a single in-
struction is applied to multiple data elements in par-
allel e.g. by utilizing SIMD vector registers. Process-
ing multiple elements per thread on some architec-
tures may enhance caching.
3.3 Mapping of Abstraction to Hard-
ware
Alpaka clearly separates its parallelization abstrac-
tion from the specific hardware capabilities by an
explicit mapping of the parallelization levels to the
hardware. A major point of the hierarchical paral-
lelism abstraction is to ignore specific unsupported
levels of the model and utilize only the ones sup-
10However, Alpaka allows for atomic operations that serial-
ize thread access to global memory.
6
ported on a particular device. Mapping is left to the
implementation of the accelerator.
This allows for variable mappings as shown in the
examples below and, therefore, an optimum usage of
the underlying compute and memory capabilities—
albeit with two minor limitations: The grid level is
always mapped to the whole device being in consid-
eration and the kernel scheduler can always execute
multiple kernel grids from multiple streams in parallel
by statically or dynamically subdividing the available
resources.
Figure 3 shows a mapping of the Alpaka abstrac-
tion model onto a CPU, a many integrated cores de-
vice (MIC) and a GPU architecture. For the MIC
architecture a second mapping is shown, which spans
a block over all cores to increase the shared memory.
CPU and MIC process multiple elements per thread
and benefit from their vectorization units, while a
GPU thread processes only a small amount of ele-
ments.
GPU
Shared 
Memory
SM
R R R
R R R
R R R
R R R
R R R
R R R
Shared 
Memory
SM
R R R
R R R
R R R
R R R
R R R
R R R
Shared 
Memory
SM
R
R
R
R
R
R
Shared 
Memory
SM
R
R
R
R
R
R
RAM
Shared 
Memory
SM
R
R
R
R
R
R
Shared 
Memory
SM
R
R
R
R
R
R
L1 / Shared 
SM
R
R
R
R
R
R
RAM
L3
Core
Package
L1/2
CPU
RAM
On Chip RAM
MIC
RAM
MIC
2D Mesh
On Chip RAM
Core
L1/R
Core
L2
RVPUVPU
L1/R
RVPUVPU
Core
L1/R
Core
L2
RVPUVPU
L1/R
RVPUVPU
Core
L1/R
Core
L2
RVPUVPU
L1/R
RVPUVPU
Core
L1/R
r
L2
RVPUVPU
L /
VPVP
Core
L1/R
r
L2
RVPUVPU
L /
VPVP
Core
L1/R
r
L2
RVPUVPU
L /
VPVP
Grid
Block
Thread
Element
AVX
R
Core
L1/2
AVX
R
L3
Core
Package
L1/2
AVX
R
Core
L1/2
AVX
R
L2
Tile
Core
L1/R
VPUVPU
Core
L1/R
VPUVPU
L2
Tile
Core
L1/R
VPUVPU
Core
L1/R
VPUVPU
L2
Tile
Core
L1/R
VPUVPU
Core
L1/R
VPUVPU
L2
Tile
Core
L1/R
VPUVPU
r
/
L2
Tile
Core
L1/R
VPUVPU
r
/
L2
Tile
Core
L1/R
VPUVPU
r
/
L2
Tile
Core
L1/R
VPUVPU
r
/
Core
L1/R
r
L2
RVPUVPU
L /
VPVP
Register Memory
Shared Memory
Shared 
Global Memory
Figure 3: Possible mapping of blocks, threads and
elements to a MIC, a CPU and a GPU device. The
mapping can skip individual levels when they are not
beneficial on a particular device.
Finally, the user needs to decide which back-end to
use for which device. It can be selected from the set of
predefined accelerators or the user can write its own
accelerator implementation. The set of predefined
accelerator mappings are listed in Table 2.
Table 2: Predefined accelerators with: prob-
lem size(N), threads per block(B), elements per
thread(V).
Arch Acc Grid Block Thread Element
GPU CUDA 1 N/(B · V ) B V
CPU OpenMP block 1 N/V 1 V
OpenMP thread 1 N/(B · V ) B V
C++11 thread 1 N/(B · V ) B V
Sequential 1 N/V 1 V
MIC OpenMP block 1 N/V 1 V
OpenMP thread 1 N/(B · V ) B V
3.4 Alpaka Programming Interface
In the following each part of the Alpaka interface is
described briefly. The provided listings assume that
the Alpaka namespace is used. Source code exam-
ples are provided to give a more detailed insight into
Alpaka.
3.4.1 Kernel
The kernel is the central unit in Alpaka that acts as
the bridge between host and accelerator code through
a C++ class or a C++ lambda (C++14 required).
The algorithm is described from the block down to
the element level which removes the need for a nested
loop structure like it is used in OpenMP and SYCL.
The kernel function object needs to implement the
template operator() member function as it is shown in
Listing 1. This member function is the first function
called on the particular accelerator device.
The code within the kernel is written in C++11 (re-
stricted by the utilized back-end compilers) with ad-
ditional calls to the Alpaka run-time API. There exist
no implicit built-in variables and functions like it is
usual in CUDA or OpenCL. All information can be
retrieved from the accelerator object (Listing 1 : 4).
7
1 struct Kernel {
2
3 template <class T_Acc , class T>
4 ALPAKA_FN_ACC void operator ()(T_Acc acc ,
5 T data)
const {
6
7 /* Write kernel code here */
8
9 }
10
11 };
Listing 1: A skeleton of an Alpaka kernel. The
kernel needs to implement the operator() with
prefix ALPAKA FN ACC, which takes at least the
accelerator as parameter.
1 Vec <Dim2 , size_t > elementsPerThread (1,1);
2 Vec <Dim2 , size_t > threadsPerBlock (1,1);
3 Vec <Dim2 , size_t > blocksPerGrid (8,16);
4
5 workdiv :: WorkDivMembers <Dim2 , size_t >
6 (blocksPerGrid ,
7 threadsPerBlock ,
8 elementsPerThread);
Listing 2: Declaration of a work division in the host
code. The work division is defined in two dimensions
for all levels. Element and block level have an extent
of one, while the grid has an extent of 128.
3.4.2 Accelerator Executable Functions
Alpaka defines the macros ALPAKA FN HOST, AL-
PAKA FN ACC and ALPAKA FN HOST ACC to
define that functions are callable from host, from ac-
celerator or from both host and accelerator device.
All functions called from accelerator code need to be
prefixed by these macros.
3.4.3 Work Division and Index Retrieval
The work division defines the extent and dimension-
ality of each level of the Alpaka abstraction model.
Listing 2 shows the declaration of a two-dimensional
work division on the host where the grid level has
an extent of 128 blocks and the other levels have an
extent of one.
1 ALPAKA_FN_ACC void operator ()(T_Acc acc)
const {
2
3 // Retrieve the global n-dim thread index
4 auto gTIdx = idx::getIdx <Grid ,
Threads >(acc);
5
6 // Retrieve the n-dim thread extent
7 auto gTExtent = workdiv :: getWorkDiv <Grid ,
Threads >(acc);
8
9 // Retrieve the global one dim thread
index
10 auto linIdx =
core::mapIdx <1>(gTIdx ,gTExtent);
11 }
Listing 3: Access of work division and thread index.
Thread extent and index are mapped onto a one
dimensional space to retrieve a linearized index.
The work division declared in Listing 2 can be ac-
cessed within the kernel via the accelerator object.
Furthermore, there exist methods to map the index
space between varying extents and dimensionalities.
Listing 3 shows a kernel function that calculates the
global linearized index of a thread with the help of
Alpaka run-time functions.
3.4.4 Memory
Alpaka provides simple memory buffers that store the
plain pointer to memory of the particular device and
additional information like residing device, extent,
pitch and dimension. These buffers are uniform for
all devices which allows for copying memory between
different devices with respect to pitch and extents,
see Listing 4.
3.4.5 Streams
A stream is the work queue of a particular device.
Operations in streams are always executed in-order:
No operation in a stream will begin before all previ-
ously issued operations in the stream have completed.
Streams can be synchronous or asynchronous with
respect to operations on the host. If an operation is
8
1 // Dim , data and index type
2 using Dim = dim::DimInt <2>;
3 using Data = std:: uint32_t;
4 using Size = std:: size_t;
5
6 // Declare extents of buffer
7 Vec <Dim , Size > extents (10 ,10);
8
9 // Declare host and device buffer
10 auto hostBuf =
mem::buf::alloc <Data ,Size >(host ,
extents);
11 auto devBuf =
mem::buf::alloc <Data ,Size >(dev , extents);
12
13 // Copy host buffer to device buffer
14 mem::view::copy(stream , devBuf , hostBuf ,
extents);
Listing 4: Allocation of a two dimensional host and
a device buffer with one hundred elements each.
Moreover, the host buffer is copied to the device
buffer.
issued in a synchronous stream, the host thread will
block until this operation is finished. Asynchronous
streams allow the host to resume computations while
the accelerator is executing the operation.
3.4.6 Kernel Execution
A kernel will be executed by enqueuing a kernel ex-
ecutor into a stream of a particular device. An execu-
tor binds an accelerator, a work division, a kernel and
its parameters. Streams are filled with those execu-
tors and Alpaka takes care that they will be executed
in the specified way. Listing 5 shows a full host code
example from the accelerator type definition up to
the enqueuing of a kernel into a stream.
1 // Define the dimensionality of the task
2 using Dim = dim::DimInt <1u>;
3 using Size = std:: size_t;
4
5 // Define the accelerator and stream to use
6 using Acc = acc:: AccCpuSerial <Dim , Size >;
7 using Stream = stream :: StreamCpuAsync;
8
9 // Select a device to execute on
10 auto devAcc =
dev::DevMan <Acc >:: getDevByIdx (0);
11
12 // Create a stream to enqueue the execution
into
13 Stream stream(devAcc);
14
15 // Create a 1d work division with 256 blocks
16 // a 16 threads a 1 element
17 auto workDiv(workdiv :: WorkDivMembers <Dim ,
Size >(256u, 16u, 1u);
18 // Create an instance of the kernel
19 Kernel kernel;
20 // Create the execution task
21 auto exec(exec::create <Acc >(workDiv ,
kernel /*, arguments ... */);
22
23 // Enqueue the task into the stream
24 stream :: enqueue(stream , exec);
Listing 5: Full example of a kernel execution. The
kernel is executed with a work division of 256 blocks
and 16 threads per block on a single accelerator (in
this case the sequential CPU back-end is selected).
4 Evaluation
This section provides an evaluation of Alpaka on a va-
riety of hardware platforms using various back-ends,
see Table 3. All CUDA evaluations are compiled with
CUDA 7.0 and all CPU evaluations with gcc 4.9.2.
Source codes denoted as native are not wrapped by
Alpaka, but contain pure CUDA or OpenMP code.
The evaluation was performed in five stages: First,
the PTX and assembler code generated during com-
pilation of an Alpaka DAXPY program is compared
to the respective native versions. Then, the over-
head of two na¨ıve Alpaka DGEMM kernels with re-
spect to their native versions is measured. As a next
step, it is investigated what happens to performance
portability when the na¨ıve Alpaka DGEMM kernels
9
are mapped to an inappropriate back-end followed
by the description of a single source DGEMM kernel
and how it can obtain performance portability with
the help of Alpaka. Finally, the applicability of Al-
paka to real world applications was evaluated using
HASEonGPU [5].
4.1 Conceptual Comparison
This section compares the Alpaka implementation of
the generalized vector addition algorithm in double
precision (DAXPY) to a sequential C++ and CUDA
implementation on the source code and assembler
level.
DAXPY computes Y ← αX + Y where X and
Y are vectors while α is a scalar. DAXPY was se-
lected as a trivial example on the one hand to show-
case the non obfuscation abstraction and on the other
hand the zero overhead abstraction of Alpaka. The
source code of the various DAXPY implementations
are available in our GitHub repository11.
From a developers point of view, the source codes
are very similar to each other. However, Alpaka re-
lated changes are necessary: Alpaka adds the addi-
tional accelerator template argument together with
the according function argument to the kernel func-
tion call. Each function that should be called from
a accelerator needs to be annotated with the Alpaka
specific macro ALPAKA_FN_ACC. Furthermore, Alpaka re-
places the for loop index calculations by an Alpaka
equivalent that calculates the correct index for each
thread.
Figure 4 shows snippets of the PTX codes of the
compiled Alpaka and CUDA implementations.
Comparing the generated PTX code leads to the
result that these codes are identical up to two addi-
tional but unused function parameters in the Alpaka
variant as well as different internal variable names
and the use of non coherent texture cache once. It
can be seen that modern compilers are able to remove
all the meta-programming abstraction introduced by
Alpaka. This perfectly demonstrates the zero over-
head abstraction of the Alpaka interface regarding
the CUDA interface.
11https://github.com/BenjaminW3/vecadd
Alpaka CUDA PTX Native CUDA PTX
mov.u32    %r3, %ctaid.x;
mov.u32    %r4, %ntid.x;
mov.u32    %r5, %tid.x;
mad.lo.s32 %r1, %r4, %r3, %r5;
setp.ge.s32 %p1, %r1, %r2;
@%p1 bra BB6_2;
cvta.to.global.u64 %rd3, %rd2;
cvta.to.global.u64 %rd4, %rd1;
mul.wide.s32       %rd5, %r1, 8;
add.s64            %rd6, %rd4, %rd5;
ld.global.f64      %fd2, [%rd6];
add.s64            %rd7, %rd3, %rd5;
ld.global.f64      %fd3, [%rd7];
fma.rn.f64         %fd4, %fd2, %fd1, %fd3;
st.global.f64     [%rd7], %fd4;
mov.u32    %r3, %ctaid.x;
mov.u32    %r4, %ntid.x;
mov.u32    %r5, %tid.x;
mad.lo.s32 %r1, %r4, %r3, %r5;
setp.ge.s32 %p1, %r1, %r2;
@%p1 bra BB6_2;
cvta.to.global.u64 %rd3, %rd2;
cvta.to.global.u64 %rd4, %rd1;
mul.wide.s32       %rd5, %r1, 8;
add.s64            %rd6, %rd4, %rd5;
ld.global.nc.f64   %fd2, [%rd6];
add.s64            %rd7, %rd3, %rd5;
ld.global.f64      %fd3, [%rd7];
fma.rn.f64         %fd4, %fd2, %fd1, %fd3;
st.global.f64     [%rd7], %fd4;
Figure 4: Snippet of the PTX code of Alpaka and
CUDA kernels. The PTX code is the same up to the
line where the CUDA PTX uses non coherent tex-
ture cache. This cache allows for access with higher
bandwidth and lower latency.
The assembler code of the native C++ implemen-
tation does not perfectly fit the Alpaka assembler
since only the native implementation has been vec-
torized to use the packed double precision SSE2 in-
struction movupd, mulpd and addpd instead of the single
value versions movsd, mulsd and addsd. However, by
looping over the additional element level of the Al-
paka abstraction model which has a constant size, the
compiler recognizes the iteration independent loop-
ing pattern and optimizes this by using SIMD in-
structions to process multiple consecutive iterations
together.
4.2 Performance
As a next step, the performance characteristics of
the CUDA and OpenMP Alpaka back-ends are eval-
uated. First, an algorithm is implemented for both
Alpaka and the particular native API to show the
pure Alpaka overhead in numbers. Then, the na-
tive Alpaka kernel is mapped to the non-native back-
end to show that Alpaka is not na¨ıvely performance
portable. Afterwards, an enhanced single source Al-
paka kernel is introduced and mapped to various ar-
chitectures and it is shown that it can match the per-
formance of the native implementations when using
the appropriate Alpaka back-ends.
For comparison the double generalized matrix-
matrix-multiplication (DGEMM) has been selected
as a compute bound problem. DGEMM computes
10
Table 3: List of utilized accelerator hardware for evaluation. Clock frequencies which are encapsulated in
braces denote the turbo frequency of the particular architecture. Often turbo can only be utilized when not
all cores of a device are busy.
Vendor AMD Intel Intel NVIDIA NVIDIA
Architecture Opteron 6276 Xeon E5-2609 Xeon E5-2630v3 K20 GK110 K80 GK210
Number of devices 4 2 2 1 2
Number of cores per device 16 4 8 (16 hyper-threads) 2496 2x2496
Clock frequency 2.3 (3.2) GHz 2.4 GHz 2.4 (3.2) GHz 0.56 (0.88) GHz
Release date Q4/2011 Q1/2012 Q3/2014 Q4/2012 Q4/2014
Th. double peak performance 480 GFLOPS 150 GFLOPS 540 GFLOPS 1170 GFLOPS 2x1450 GFLOPS
C ← αAB + βC where C, A and B are matrices
A = (ai,j), B = (bi,j) while α and β are scalars.
DGEMM implementations can utilize all levels of
parallelism on GPUs and CPUs ranging from block-
ing over shared memory to vectorization, providing
a perfect showcase of these techniques within the Al-
paka library. All DGEMM implementations are avail-
able in our GitHub repository 12.
All input matrices are dense and always have
square extents to minimize bias towards implemen-
tations preferring column- or row-major layout. Ini-
tially, the matrices are filled with random values in
the range [0.0, 10.0]. The matrices are mapped to
1D memory buffers with Alpaka aligning rows to op-
timum memory boundaries. Measurements do not
include times for allocating the matrices on the host,
filling them, a possible data transfer between the pro-
cessor and a co-processor as well as device and stream
initialization.
4.2.1 Zero Overhead Abstraction
Alpaka does not add additional overhead to the al-
gorithm execution time. In order to show this zero
overhead, native CUDA and OpenMP 2 kernels were
translated one-to-one to Alpaka kernels.
The CUDA kernels use a block parallelized tiling
algorithm based on the CUDA programming guide
([3], Sec. 3.2.3) and were executed on a compute node
with a NVIDIA K20 GK210. The OpenMP kernels
use a standard DGEMM algorithm with nested for
loops and were executed on a compute node with two
Intel E5-2630v3 CPUs. The kernels were executed
12https://github.com/BenjaminW3/matmul
 0.8
 0.85
 0.9
 0.95
 1
 1.05
 1.1
 1.15
 1.2
 0  1000  2000  3000  4000  5000  6000  7000
sp
ee
du
p 
re
la
ti
ve
 t
o 
na
ti
ve
matrix extents n=m=k
Less than 6% overhead compared to native DGEMM implementation
Alpaka(OMP2) with native OMP2 style kernel on Intel E2630v3
Alpaka(CUDA) with native CUDA style kernel on NVIDIA K80
Native implementations
Figure 5: The native Alpaka kernels were mapped to
their corresponding native back-ends and compared
to the native implementations. Both kernels show a
relative overhead of less than 6% which is well below
run-to-run variation. This proves the zero overhead
abstraction of Alpaka.
with an increasing matrix size and their execution
time was measured. Figure 5 shows the speed of the
Alpaka kernels mapped to the corresponding back-
end relative to their native implementations.
The native CUDA Alpaka kernel provides more
than 94% relative performance for almost all matrix
sizes, which is an overhead of 6% or less. After a deep
inspection of the compiled PTX code it turned out
that this overhead results from move and forward op-
erators translated to copies. These operators are used
for grid index calculations within an Alpaka kernel.
Furthermore, a small number of additional CUDA
runtime calls by the alpaka CUDA back-end are nec-
essary. The native OpenMP Alpaka kernel provides
an average relative performance of 100%.
11
 0
 0.05
 0.1
 0.15
 0.2
 0  1000  2000  3000  4000  5000  6000  7000
sp
ee
du
p 
re
la
ti
ve
 t
o 
na
ti
ve
matrix extents n=m=k
Native style kernel shows bad performance mapped to opposite back-end
Alpaka(OMP2) with native CUDA style kernel on Intel E2630v3
Alpaka(CUDA) with native OMP2 style kernel on NVIDIA K80
Figure 6: The native Alpaka kernels with swapped
back-ends leads to poor performance. Alpaka does
not guarantee performance portability when data ac-
cess, work division and cache hierarchies are not con-
sidered.
One-to-one translation of a particular algorithm to
an Alpaka kernel demonstrates a minimal amount of
overhead compared to the native implementation on
the same architecture. However, Alpaka does not
guarantee that such a kernel will also show the same
run-time characteristics when it is mapped onto an-
other back-end, as it neither provides optimized nor
architecture dependent data access and work divi-
sion automatically. Figure 6 shows the performance
of the previously used kernels when their back-ends
are swapped relative to the native implementation13.
The Alpaka kernels originally translated from the
opposite back-end do not perform well. There are at
least two reasons why these mappings are not per-
formance portable. First, the back-ends require com-
pletely different data access patterns to achieve op-
timum data access performance e.g. strided data ac-
cess in CUDA. Second, the amount of data a single
thread can process effectively is different because of
different cache sizes and hierarchies and varying op-
timal work divisions.
Nevertheless, it is possible to write com-
petitive code for each back-end. Both, the
13In this case the triple nested loop is compiled using the
CUDA back-end, while the tiled shared-memory version is
mapped to OpenMP.
NVIDIA CUDA (nvcc) and the gcc compiler remove
all the abstraction layers introduced by Alpaka.
A na¨ıve port of a kernel to an architecture it was
not meant to be executed on will almost always lead
to poor performance. Thus, providing a single, per-
formance portable kernel is not trivial. The following
section shows that Alpaka is able to provide perfor-
mance for various back-ends with a single source ker-
nel.
4.2.2 Single Source Kernel / Performance
It is possible to write a single source kernel that per-
forms well on all tested Alpaka back-ends without
a drop in performance compared to the native im-
plementations. In order to reach this performance,
the developer needs to abstract the access to data,
optimize the work division, and consider cache hi-
erarchies. The single source Alpaka DGEMM ker-
nel implements a tiling matrix-matrix multiplication
algorithm and considers the architecture cache sizes
by adapting the number of elements processed per
thread or block and the size of the shared memory to
provide minimal access latency. A lot of processor ar-
chitectures benefit from the Alpaka element level par-
allelism when calculating multiple elements in paral-
lel in the vector processing unit.
Figure 7 provides a brief description of the hierar-
chical tiling algorithm. A block calculates the result
of a tile in matrix C. Each thread in this block loads
a set of elements of matrices A and B into shared
memory to increase memory reuse. It then calculates
the partial results of its set of elements before the
block continues with the next tiles of A and B.
Figure 8 shows the performance of the tiling kernel
mapped to the CUDA and OpenMP back-ends rela-
tive to the original native implementations. No per-
formance loss compared to native implementations is
observed but instead performance gain in the major-
ity of cases is seen. This is due to the more descrip-
tive nature of the Alpaka kernel which enables even
further optimizations by the back-end compilers.
It is clear that there exist even more optimized
versions of the algorithm, e.g. in mathematical li-
braries such as cuBlas, which is fine tuned for differ-
12
AB
C
Tile in global memory
Tile in shared memory
Elements per thread
Blocks
Threads
Figure 7: An Alpaka optimized hierarchically tiled
matrix-matrix multiplication algorithm with multiple
elements per thread. A block loads tiles of the A and
B matrix into its shared memory to increase memory
reuse. A thread can calculate multiple elements by
using the vector processing unit of its particular back-
end.
ent compute-capabilities of NVIDIA GPUs, or MKL,
which is an optimized OpenMP kernel library. These
provide higher peak performance than Alpaka, but
may require additional setup (cuBlas data transfers)
or include implicit—and maybe unwanted—data mi-
gration between the host and the device. Neverthe-
less, if it should be necessary to use one of these op-
timized algorithms it is possible to use them with
Alpaka as well by utilizing template specialization
within Alpaka kernels.
4.2.3 Performance Portability
Figure 9 shows the performance of the Alpaka tiling
kernel executed on varying architectures relative to
the theoretical peak performance of the correspond-
ing architecture. The kernel work division was se-
lected in a way that provides good performance for
the particular architecture. CPU devices were accel-
erated by the OpenMP 2 back-end, while NVIDIA de-
vices were accelerated by the CUDA back-end. The
performance of all architectures lies around 20% of
the theoretical peak performance. This shows that a
single Alpaka kernel using all levels of the abstraction
 0
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 4
 0  1000  2000  3000  4000  5000  6000  7000
sp
ee
du
p 
re
la
tiv
e 
to
 n
at
iv
e
matrix extents n=m=k
Single source tiling kernel competes with native implementations
Alpaka(CUDA) with tiling 4 elements on NVIDIA K80
Alpaka(CUDA) with tiling 1 element  on NVIDIA K80
Alpaka(OMP2) with tiling 16k elements on Intel E2630v3
Alpaka(OMP2) with tiling 256 elements on Intel E2630v3
Native implementations
Figure 8: The Alpaka single source DGEMM kernel
implements a hierarchical tiling matrix-matrix mul-
tiplication algorithm. This kernel can compete with
and even outperform the original native implementa-
tions on all tested back-ends.
model together with optimized data access patterns
is able to provide performance portability over vari-
ous architectures.
4.3 Real World Example
HASEonGPU is an open-source adaptive massively
parallel multi-GPU Monte Carlo integration algo-
rithm for computing the amplified spontaneous emis-
sion (ASE) flux in laser gain media pumped by pulsed
lasers14.
The source code consists of about ten thousand
lines of code and has been ported in three weeks by
one person to Alpaka (HASEonAlpaka). After the
porting has been finished, HASEonAlpaka has suc-
cessfully been executed on GPU and CPU clusters.
Figure 10 shows the relative speed of a HASEon-
Alpaka computation executed with identical parame-
ters on different systems. The original native CUDA
version is used as the basis for comparison. The Al-
paka version using the CUDA back-end running on
the same NVIDIA K20 GK110 cluster as the native
version does not show any overhead at all leading to
identical execution times.
14https://github.com/ComputationalRadiationPhysics/
haseongpu
13
 0
 0.05
 0.1
 0.15
 0.2
 0.25
 0.3
 0.35
 0  1000  2000  3000  4000  5000  6000  7000
pe
rf
or
m
an
ce
 r
el
at
iv
e 
to
 t
h.
 p
ea
k 
pe
rf
or
m
an
ce
matrix extents n=m=k
Performance portability with single source kernel on all architectures
Alpaka(CUDA) with tiling 4 elements on NVIDIA K80
Alpaka(CUDA) with tiling 4 elements on NVIDIA K20
Alpaka(OMP2) with tiling 16k elements on Intel E2609v1
Alpaka(OMP2) with tiling 16k elements on Intel E2630v3
Alpaka(OMP2) with tiling 16k elements on AMD Opteron 6276
Figure 9: Performance of the Alpaka kernel executed
on various architectures relative to the theoretical
peak performance of the corresponding architecture.
The Alpaka kernel provides about 20% relative peak
performance on all measured architectures.
On the Intel Xeon E5-2630v3 and AMD Opteron
6276 clusters the OpenMP 2 back-end without sup-
port for the not required thread level parallelism is
used, i.e each block contains exactly one thread com-
puting multiple elements. This perfectly maps to the
CPUs capabilities for independent vectorized paral-
lelism and leads to very good results. The nearly
doubled time to solution on both, the Intel and AMD
clusters, is on par with the halved double precision
peak performance of those systems relative to the
NVIDIA cluster used as reference.
 0
 200
 400
 600
 800
 1000
 1200
 1400
CUDA native
Alpaka(CUDA)
 on K20
Alpaka(OMP2)
 on Opteron 6276
Alpaka(OMP2)
 on E2630v3
 0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
Pe
rf
or
m
an
ce
 [G
flo
p/
s]
Sp
ee
du
p
HASEonGPU ported to Alpaka shows performance portability
Hardware Peak Performance
Application Speedup
Figure 10: HASEonGPU was ported to Alpaka
within three weeks by one person. The application
shows almost perfect performance portability on all
evaluated platforms.
5 Conclusion
We have presented the abstract C++ interface Al-
paka and its implementations for parallel kernel ex-
ecution across multiple hierarchy levels on a single
compute node. We have demonstrated platform and
performance portability for all studied use cases. A
single source Alpaka DGEMM implementation pro-
vides consistently 20% of the theoretical peak per-
formance on AMD, Intel and NVIDIA hardware, be-
ing on par with the respective native implementa-
tions. Moreover, performance measurements of a real
world application translated to Alpaka unanimously
demonstrated that Alpaka can be used to write per-
formance portable code.
Performance portability, maintainability, sustain-
ability and testability were reached through the us-
age of C++ metaprogramming techniques abstract-
ing the variations in the underlying architectures.
Alpaka code is sustainable, optimizable and eas-
ily extendable to support even more architectures
through the use of C++ template specialization.
It is data structure agnostic and provides a simple
pointer based memory model that requires explicit
deep copies between memory levels.
Future work will focus on including more Alpaka
back-ends, e.g. for OpenACC and OpenMP 4.x tar-
get offloading and studying performance portability
for additional architectures (e.g Intel Xeon Phi and
OpenPower) and applications.
Alpaka is an open-source project and available in
our GitHub repository15.
References
[1] Nathan Bell and Jared Hoberock. Thrust:
Productivity-oriented library for cuda. Astro-
physics Source Code Library, 1:12014, 2012.
[2] Andrew Kerr, Gregory Diamos, and Sudhakar
Yalamanchili. Gpu application development, de-
bugging, and performance tuning with gpu ocelot.
15https://github.com/ComputationalRadiationPhysics/
alpaka
14
GPU Computing Gems Jade Edition, pages 409–
427, 2011.
[3] NVIDIA Corporation. NVIDIA CUDA
C Programming Guide Version 7.0.
http://docs.nvidia.com/cuda/pdf/CUDA_
C_Programming_Guide.pdf, March 2015. [On-
line; accessed May 20, 2015].
[4] Kamil Rocki, Martin Burtscher, and Reiji Suda.
The future of accelerator programming: abstrac-
tion, performance or can we have both? In Pro-
ceedings of the 29th Annual ACM Symposium on
Applied Computing, pages 886–895. ACM, 2014.
[5] E. Zenker, C. Eckert, D. Albach, and M. Buss-
mann. HASEonGPU - High performance Ampli-
fied Spontaneous Emission on GPU. http://dx.
doi.org/10.5281/zenodo.13964, 2015. [Online;
accessed September 26, 2015].
15
