K-Athena: a performance portable structured grid finite volume
  magnetohydrodynamics code by Grete, Philipp et al.
GRETE et al.: K-ATHENA – A PERFORMANCE PORTABLE STRUCTURED GRID FINITE VOLUME MHD CODE 1
K-ATHENA: a performance portable structured
grid finite volume magnetohydrodynamics code
Philipp Grete, Forrest W. Glines, and Brian W. O’Shea
Abstract—Large scale simulations are a key pillar of modern research and require ever increasing computational resources. Different
novel manycore architectures have emerged in recent years on the way towards the exascale era. Performance portability is required
to prevent repeated non-trivial refactoring of a code for different architectures. We combine ATHENA++, an existing
magnetohydrodynamics (MHD) CPU code, with KOKKOS, a performance portable on-node parallel programming paradigm, into
K-ATHENA to allow efficient simulations on multiple architectures using a single codebase. We present profiling and scaling results for
different platforms including Intel Skylake CPUs, Intel Xeon Phis, and NVIDIA GPUs. K-ATHENA achieves > 108 cell-updates/s on a
single V100 GPU for second-order double precision MHD calculations, and a speedup of 30 on up to 24,576 GPUs on Summit
(compared to 172,032 CPU cores), reaching 1.94× 1012 total cell-updates/s at 76% parallel efficiency. Using a roofline analysis we
demonstrate that the overall performance is currently limited by DRAM bandwidth and calculate a performance portability metric of
83.1%. Finally, we present the implementation strategies used and the challenges encountered in maximizing performance. This will
provide other research groups with a straightforward approach to prepare their own codes for the exascale era. K-ATHENA is available
at https://gitlab.com/pgrete/kathena.
Index Terms—
F
1 INTRODUCTION
THE era of exascale computing is approaching. Differentprojects around the globe are working on the first
exascale supercomputers, i.e., supercomputers capable of
conducting 1018 floating point operations per second. This
includes, for example, the Exascale Computing Initiative
working with Intel and Cray on Aurora as the first exascale
computer in the US in 2021, the EuroHPC collaboration
working on building two exascale systems in Europe by
2022/2023, Fujitsu and RIKEN in Japan working on the
Post-K machine to launch in 2021/2022, and China who
target 2020 for their first exascale machine. While the exact
architectural details of these machines are not announced
yet and/or are still under active development, the overall
trend in recent years has been manycore architectures. Here,
manycore refers to an increasing number of (potentially
simpler) cores on a single compute node and includes
CPUs (e.g., Intel’s Xeon Scalable Processor family or AMD’s
Epyc family), accelerators (e.g., the now discontinued Intel
Xeon Phi line), and GPUs for general purpose computing.
MPI+OPENMP has been the prevailing parallel program-
ming paradigm in many areas of high performance com-
puting for roughly two decades. It is questionable, however,
whether this generic approach will be capable of making
efficient use of available hardware features such as parallel
threads and vectorization across different manycore archi-
tectures and between nodes.
• P. Grete, F. W. Glines, and B. .W. O’Shea are with the Department of
Physics and Astronomy, Michigan State University, East Lansing, MI
48824 USA, also with the Department of Computational Mathematics,
Science and Engineering, Michigan State University, East Lansing, MI
48824 USA.
E-mail: grete@pa.msu.edu
• B. W. O’Shea is also with the National Superconducting Cyclotron
Laboratory, Michigan State University, East Lansing, MI 48824 USA.
Manuscript received . . . ; revised . . . .
In addition to extensions of the MPI standard such as
shared-memory parallelism, several approaches in addition
to MPI+OPENMP exist and are being actively developed
to address either on-node, inter-node, or both types of par-
allelism. These include, for example, partitioned global ad-
dress space (PGAS) programming models such as UPC++
[1], or parallel programming frameworks such as CHARM++
or LEGION, which are based on message-driven migratable
objects [2], [3].
Our main goal is a performance portable version of
the existing MPI+OPENMP finite volume (general relativ-
ity) magnetohydrodynamics (MHD) code ATHENA++ [4],
[5]. This goal includes enabling GPU-accelerated simula-
tions while maintaining CPU performance using a single
code base. More generally, performance portability refers
to achieving consistent levels of performance across het-
erogeneous platforms using as little architecture-dependent
code as possible [6]. In order to keep the code changes
minimal and given the MPI+OPENMP basis of ATHENA++
we decided to keep MPI for inter-node parallelism and
focus on on-node performance portability.
For on-node performance portability several libraries
and programming language extensions exist. With version
4.5 OPENMP [7] has been extended to support offloading
to devices such as GPUs, but support and maturity is still
highly compiler and architecture dependent. This similarly
applies to OPENACC, which has been designed from the
beginning to target heterogeneous platforms. While these
two directives-based programming models are generally
less intrusive with respect to the code base, they only ex-
pose a limited fraction of various platform specific features.
OPENCL [8] is much more flexible and allows fine grained
control over hardware features (e.g., threads), but this, on
the other hand, adds substantial complexity to the code.
KOKKOS [9] and RAJA [10] try to combine the strength of
ar
X
iv
:1
90
5.
04
34
1v
1 
 [c
s.D
C]
  1
0 M
ay
 20
19
GRETE et al.: K-ATHENA – A PERFORMANCE PORTABLE STRUCTURED GRID FINITE VOLUME MHD CODE 2
flexibility with ease of use by providing abstractions in the
form of C++ templates. Both KOKKOS and RAJA focus on
abstractions of parallel regions in the code, and KOKKOS
additionally provides abstractions of the memory hierarchy.
At compile time the templates are translated to different
(native) backends, e.g., OPENMP on CPUs or CUDA on
NVIDIA GPUs.
We chose KOKKOS for the refactoring of ATHENA++
for several reasons. KOKKOS offers the highest level of
abstraction without forcing the developer to use it by set-
ting reasonable implicit platform defaults. Moreover, the
KOKKOS core developer team actively works on integrat-
ing the programming model into the C++ standard. New,
upcoming features, e.g., in OPENMP, will replace manual
implementations in the KOKKOS OPENMP backend over
time. KOKKOS is already used in several large projects to
achieve performance portability, e.g., the scientific software
building block collection Trilinos [11] or the computational
framework for simulating chemical and physical reactions
Uintah [12]. In addition, KOKKOS is part of the DOE’s
Exascale Computing Project and we thus expect a backend
for Aurora’s new Intel Xe architecture when the system
launches. Finally, the KOKKOS community, including core
developers and users, is very active and supportive with re-
spect to handling issues, questions and offering workshops.
The resulting K-ATHENA code successfully achieves per-
formance portability across CPUs (Intel, AMD, and IBM),
Intel Xeon Phis, and NVIDIA GPUs. We demonstrate weak
scaling at 76% parallel efficiency on 24,576 GPUs on OLCF’s
Summit, reaching 1.94 × 1012 total cell-updates/s for a
double precision MHD calculation. Moreover, we calculate
a performance portability metric of 83.1% across Xeon Phis,
5 CPU generations, and 4 GPU generations. We make the
code available as an open source project1.
The paper is organized as follows. In Section 2 we
introduce KOKKOS, ATHENA++, and the changes made
and approach chosen in creating K-ATHENA. In Section 3
we present profiling, scaling and roofline analysis results.
Finally, we discuss current limitations and future enhance-
ments in Sec. 4 and make concluding remarks in Sec. 5.
2 METHOD
2.1 KOKKOS
KOKKOS is an open source2 C++ performance portability
programming model [9]. It is implemented as a template
library and offers abstractions for parallel execution of code
and data management. The core of the programming model
consists of six abstractions.
First, execution spaces define where code is executed. This
includes, for example, OPENMP on CPUs or Intel Xeon Phis,
CUDA on NVIDIA GPUs, or ROCm on AMD GPUs (cur-
rently experimental). Second, execution patterns are parallel
patterns, e.g. parallel_for or parallel_reduce, and
are the building blocks of any application that uses KOKKOS.
These parallel regions are often also referred to as kernels as
they can be dispatched for execution on execution spaces
1. K-ATHENA’s project repository is located at https://gitlab.com/
pgrete/kathena.
2. See https://github.com/kokkos for the library itself, associated
tools, tutorial and a wiki.
(such as GPUs). Third, execution policies determine how
an execution pattern is executed. There exist simple range
policies that only specify the indices of the parallel pattern
and the order of iteration (i.e., the fastest changing index
for multidimensional arrays). More complicated policies,
such as team policies, can be used for more fine-grained
control over individual threads and nested parallelism.
Fourth, memory spaces specify where data is located, e.g.,
in host/system memory or in device space such as GPU
memory. Fifth, the memory layout determines the logical
mapping of multidimensional indices to actual memory
location, cf., C family row-major order versus Fortran
column-major order. Sixth, memory traits can be assigned to
data and specify how data is accessed, e.g., atomic access,
random access, or streaming access.
These six abstractions offer substantial flexibility in fine-
tuning application, but the application developer is not al-
ways required to specify all details. In general, architecture-
dependent defaults are set at compile time based on the
information on devices and architecture provided. For ex-
ample, if CUDA is defined as the default execution space
at compile time, all Kokkos::Views, which are the funda-
mental multidimensional array structure, will be allocated
in GPU memory. Moreover, the memory layout is set to
column-major so that consecutive threads in the same warp
access consecutive entries in memory.
2.2 ATHENA++
ATHENA++ is a radiation general relativistic magnetohy-
drodynamics (GRMHD) code focusing on astrophysical ap-
plications [4], [5]. It is a rewrite in modern C++ of the
widely used ATHENA C version [13]. ATHENA++ offers a
wide variety of compressible hydro- and magnetohydro-
dynamics solvers including support for special and rela-
tivistic (M)HD, flexible geometries (Cartesian, cylindrical,
or spherical), and mixed parallelization with OPENMP and
MPI. Apart from the overall feature set, the main reasons
we chose ATHENA++ are a) its excellent performance on
CPUs and KNLs due to a focus on vectorization in the code
design, b) a generally well written and documented code
base in modern C++, c) point releases are publicly available
that contain many (but not all) features3, and d) a flexible
task-based execution model that allows for a high degree of
modularity.
ATHENA++’s parallelization strategy evolves around so-
called meshblocks. The entire simulation grid is divided
into smaller meshblocks that are distributed among MPI
processes and/or OPENMP threads. Each MPI processes (or
OPENMP thread) owns one or more meshblocks that can
be updated independently after boundary information have
been communicated. If hybrid parallelization is used, each
MPI process runs one or more OPENMP threads that each
are assigned one or more meshblock. This design choice is
often referred to as coarse-grained parallelization as threads
are used at a block (here meshblock) level and not over
loop indices. In general, ATHENA++ uses persistent MPI
3. Our code changes are based on the public version,
ATHENA++ 1.1.1, see https://github.com/PrincetonUniversity/
athena-public-version
GRETE et al.: K-ATHENA – A PERFORMANCE PORTABLE STRUCTURED GRID FINITE VOLUME MHD CODE 3
communication handles to realize asynchronous communi-
cation. Moreover, each thread makes its own MPI calls to
exchange boundary information. As a result, using more
than one thread per MPI process may increase overall on-
node performance due to hyperthreading but also increases
both the number of MPI messages sent and the total amount
of data sent. The latter may result in overall worse parallel
performance and efficiency, as demonstrated in Sec 3.3.2.
Listing 1. Example triple for loop for a typical operation in a finite
volume method on a structured mesh such as in a code like ATHENA++,
where ks, ke, js, je, is, and ie are loop bounds and u is an
athena_array object of, for example, an MHD variable.
for( int k = ks; k < ke; k++){
for( int j = js; j < je; j++){
#pragma omp simd
for( int i = is; i < ie; i++){
/* Loop Body */
u(k,j,i) = ...
}}}
Given the coarse-grained OPENMP approach over
meshblocks the prevalent structures in the code base are
triple (or quadruple) nested for loops that iterate over the
content of each meshblock (and variables in the quadruple
case). A prototypical nested loop is illustrated in Listing 1.
Generally, all loops (or kernels) in ATHENA++ have been
written so that OPENMP simd pragmas are used for the
innermost loop. This helps the compiler in trying to auto-
matically vectorize the loops resulting in a more performant
application.
2.3 K-ATHENA = KOKKOS + ATHENA++
In order to combine ATHENA++ and KOKKOS, four ma-
jor changes in the code base were required: 1) making
Kokkos::Views the fundamental data structure, 2) con-
verting nested for loop structures to kernels, 3) converting
“support” functions, such as the equation of state, to inline
functions, and 4) converting communication buffer filling
functions into kernels.
First, Views are the KOKKOS’ abstraction of multidi-
mensional arrays. Thus, the multidimensional arrays orig-
inally used in ATHENA++, e.g., the MHD variables for
each meshblock, need to be converted to Views so that
these arrays can transparently be allocated in arbitrary
memory spaces such as device (e.g., GPU) memory or
system memory. ATHENA++ already implemented an ab-
stract athena_array class for all multidimensional ar-
rays with an interface similar to the interface of a View.
Therefore, we only had to add View objects as member
variables and to modify the functions of athena_arrays
to transparently use functions of those member Views. This
included using View constructors to allocate memory, us-
ing Kokkos::deep_copy or Kokkos::subview for copy
constructors and shallow slices, and creating public member
functions to access the Views. The latter is required in order
to properly access the data from within compute kernels.
Second, all nested for loop structures (see Listing 1 need
to be converted to so-called kernels, i.e., parallel region that
can be dispatched for execution by an execution space. As
described in Sec. 2.1 multiple execution policies are possible,
such as a multidimensional range policy (see Listing 2) or
a team policy that allows for more fine-grained control and
nested parallelism (see Listing 3).
Listing 2. Example for loop using KOKKOS. The loop body
is reformulated into a lambda function and passed into
Kokkos::parallel_for to execute on the target architecture.
The class Kokkos::MDRangePolicy specifies the loop bounds. The
array u is now a Kokkos::View, a KOKKOS building block that allows
transparent access to CPU and GPU memory. The loop body, i.e., the
majority of the code, remains mostly unchanged.
using namespace Kokkos;
parallel_for( MDRangePolicy<Rank<3>>
({ks,js,is},{ke,je,ie}),
KOKKOS_LAMBDA(int k, int j, int i){
/* Loop Body */
u(k,j,i) = ...
});
Listing 3. Another approach using KOKKOS’ nested team-based
parallelism through the Kokkos::TeamThreadRange and
Kokkos::ThreadVectorRange classes. This interface is closer
to the underlying parallelism used by the backend such as CUDA
blocks on GPUs and SIMD vectors on CPUs. Again, the loop body
remains mostly unchanged and u is now a Kokkos::View.
using namespace Kokkos;
parallel_for(team_policy(nk, AUTO),
KOKKOS_LAMBDA(member_type thread) {
const int k = thread.league_rank() + ks;
parallel_for(
TeamThreadRange<>(thread,js,je,
[&] (const int j) {
parallel_for(
ThreadVectorRange<>(thread,is,ie,
[=] (const int i) {
/* Loop Body */
u(k,j,i) = ...
});});});
Generally, the loop body remained mostly unchanged.
Given that it is not a priori clear what kind of execution
policy yields the best performance for a given implementa-
tion of an algorithm, we decided to implement a flexible
loop macro. That macro allows us to easily change the
execution policy for performance tests – see profiling results
in Sec. 3.3.1 and discussion in Sec. 4.
Third, all functions that are called within a kernel need
to be converted into inline functions (here, more specifically
using the KOKKOS_INLINE_FUNCTION macro). This is re-
quired because if the kernels are executed on a device such
as a GPU, the function need to be compiled for the device
(e.g., with a __device__ attribute when compiling with
CUDA). In ATHENA++, this primarily concerned functions
such as the equation of state and coordinate system-related
functions.
Fourth, ATHENA++ uses persistent communication
buffers (and MPI handles) to exchange data between pro-
cesses. Originally, these buffers resided in the system mem-
ory and were filled directly from arrays residing in the
system memory. In the case where a device (such as a GPU)
is used as the primary execution space and the arrays should
remain on the device to reduce data transfers, the buffer
GRETE et al.: K-ATHENA – A PERFORMANCE PORTABLE STRUCTURED GRID FINITE VOLUME MHD CODE 4
filling functions need to be converted too. Thus, we changed
all buffers to be Views and converted the buffer filling
functions into kernels that can be executed on any execution
space. In addition, this allows for CUDA-aware MPI– GPU
buffers can be directly copied between the memories of
GPUs (both on the same node and on different nodes)
without an implicit or explicit copy of the data to system
memory.
In general, the first three changes above are required
in refactoring any legacy code to make use of KOKKOS.
The fourth change was required for ATHENA++ due to the
existing MPI communication patterns.
Finally, for the purpose of the initial proof-of-concept,
we only refactored the parts required for running hydrody-
namic and magnetohydrodynamic simulations on static and
adaptive Cartesian meshes. Running special and general rel-
ativistic simulations on spherical or cylindrical coordinates
is currently not supported. However, the changes required
to allow for these kind of simulations are straightforward
and we encourage and support contributions to re-enable
this functionality.
2.4 Development process and tools
Overall, the development process was driven and accom-
panied by several key decisions. First, we wanted to be
able to continuously track potential overhead and changes
in performance. Therefore, we added so-called KOKKOS
profiling regions to the original codebase covering all rel-
evant parts of the code, e.g., the main loop, communication,
reconstruction, and the Riemann solve. This allowed us to
obtain detailed timings (across MPI processes) with the
KOKKOS profiling tools of all parts of the code including
those that were not converted to make use of KOKKOS yet.
In addition to manual regions, all kernels are automatically
profiled once a parallel execution pattern is used.
In order to ensure that the code keeps producing correct
results we employ automated regression testing using Git-
Lab’s continuous integration features. ATHENA++ already
provides a regression test suite cover many aspect of the
code. For K-ATHENA we extended this suite to specifically
address changes related to KOKKOS. For example, we test
different execution policies for the kernels, KOKKOS’ par-
tition master for multi-threaded operation, and run tests
both on CPUs and GPUs. The tests are automated and
are triggered for every Git merge request. Moreover, we
required that any merge request that added functionality
or converted parts of the code to make use of KOKKOS
needs to be accompanied by a new test case addressing that
change. NVIDIA’s unified memory allowed us to run tests
on GPUs without having converted the entire code. With
unified memory a virtual memory space is available that
can be accessed from both CPU and GPU. Page migrations
occur automatically and transparently, i.e., no manual data
movement is required.
In general, unified memory eased our refactoring ex-
perience. Originally, we used Kokkos::DualViews in the
transition. DualViews store data in two different memory
spaces, e.g., on the system memory and GPU memory. How-
ever, synchronization between the memory spaces requires
manually specifying which memory space is needed and
when they have been modified at different points in the
algorithm. This introduced additional code overhead. For
this reason, we decided to discontinue using DualViews
and instead used normal Views with default allocations in
unified memory space. In addition, being able to change
the default memory space from unified memory to pure
GPU memory at compile time allowed us to easily identify
all parts of the code that need to be converted by tracking
segmentation faults with a debugger.
3 RESULTS
If not noted otherwise, all results in this section have been
obtained using a double precision, shock-capturing, unsplit,
adiabatic MHD solver consisting of Van Leer integration,
piecewise linear reconstruction, Roe Riemann solver, and
constrained transport for the integration of the induction
equation (see, e.g., [14] for more details). The test problem
is a linear fast magnetosonic wave on a static, structured,
three-dimensional grid. Unified memory in GPU runs was
not used, i.e., there was no data transfer between system and
GPU memory after the problem initialization. Generally,
we used the Intel compilers on Intel platforms, and gcc
and nvcc on other platforms as we found that (recent)
Intel compilers are more effective in automatic vectorization
than (recent) gcc compilers. We used the identical software
environment and compiler flags for both K-ATHENA and
ATHENA++ where possible. Details are listed in Table 1.
We used ATHENA++ version 1.1.1 (commit 4d0e425) and
K-ATHENA commit 73fec12d for the scaling tests. Addi-
tional information on how to run K-ATHENA on different
machines can be found in the documentation.
3.1 Profiling
In order to evaluate the effect on performance of the dif-
ferent loop structures presented in Sec. 2.3 we compare the
timings of different regions within the main loop of the code.
The results using both an NVIDIA V100 GPU and an Intel
Skylake CPU for a selection of the computationally most
expensive regions are shown in Fig. 1. The 1DRange loop
structure refers to a one dimensional range policy over a
single index that is explicitly unpacked to the multidimen-
sional indices in the code. While this 1DRange is the fastest
loop structure for all regions on the GPU, it is the slowest for
all regions on the CPU. We suspect that this is related to the
inability of the compiler to vectorize the loops on CPUs. All
other loop structures tested, i.e., simd-for, MDRange, and
TeamPolicy logically separate the nested loops and, thus,
make it easier for the compiler to automatically vectorize
the innermost loop. This also explains why the results for
simd-for, MDRange, and TeamPolicy are very close to
each other for all regions except the Riemann solver. Given
that the Riemann solve is the most complex loop/kernel, we
again suspect that the compiler is unable to vectorize that
loop if KOKKOS templates add additional complexity to the
loop structure. We expect that these differences will become
less pronounced with continued compiler development.
On the GPU, MDRange is the slowest loop structure,
being several times slower than the 1DRange across all
regions. TeamPolicy is on par with 1DRange for half of
GRETE et al.: K-ATHENA – A PERFORMANCE PORTABLE STRUCTURED GRID FINITE VOLUME MHD CODE 5
TABLE 1
Software Environment and Compiler Flags Used In Scaling Tests.
Machine Compiler Compiler flags MPI version
Summit GPU GCC 6.4.0 & Cuda 9.2.148 -O3 -std=c++11 -fopenmp -Xcudafe
--diag_suppress=esa_on_defaulted_function_ignored
-expt-extended-lambda -arch=sm_70 -Xcompiler
Spectrum MPI 10.2.0.11
Summit CPU GCC 8.1.1 -O3 -std=c++11 -fopenmp-simd -fwhole-program
-flto -ffast-math -fprefetch-loop-arrays
-fopenmp -mcpu=power9 -mtune=power9
Spectrum MPI 10.2.0.11
Titan GPU GCC 6.3.0 & Cuda 9.1.85 -O3 -std=c++11 -fopenmp -Xcudafe
--diag_suppress=esa_on_defaulted_function_ignored
-expt-extended-lambda -arch=sm_35 -Xcompiler
Cray MPICH 7.6.3
Titan CPU GCC 6.3.0 -O3 -std=c++11 -fopenmp Cray MPICH 7.6.3
Theta ICC 18.0.0 -O3 -std=c++11 -ipo -xMIC-AVX512
-inline-forceinline -qopenmp-simd -qopenmp
Cray MPICH 7.7.3
Electra ICC 18.0.3 -O3 -std=c++11 -ipo -inline-forceinline
-qopenmp-simd -qopt-prefetch=4 -qopenmp
-xCORE-AVX512
HPE MPT 2.17
Rie
ma
nn
Co
mp
ute
Co
rne
rE
Re
co
ns
tru
ct 
Z
Re
co
ns
tru
ct 
X
Flu
xD
ive
rge
nc
e CT
We
igh
ted
Av
eU
0
1
2
3
4
wa
llt
im
e 
re
la
tiv
e 
to
 R
ie
m
an
n
NVidia Volta V100 GPU
1DRange
TeamPolicy
MDRange
Rie
ma
nn
Co
mp
ute
Co
rne
rE
Re
co
ns
tru
ct 
Z
Re
co
ns
tru
ct 
X
Flu
xD
ive
rge
nc
e CT
We
igh
ted
Av
eU
Intel Xeon Gold 6148 Skylake CPU
1DRange
simd for
MDRange
TeamPolicy
Fig. 1. Profiling results on a GPU (left) and CPU (right) for selected regions (x-axis) within the main loop of an MHD timestep using the algorithm
described in Sec. 3. The different lines correspond to different loop structures, see Sec. 2.3 and the timings are normalized to the fastest Riemann
region in each panel.
the regions shown. As discussed in more detail in Sec. 4,
we expected these non-optimized raw loop structures to not
cause any major differences in performance.
The results shown here for V100 GPUs and Skylake
CPUs equally apply to other GPU generations and other
CPUs (and Xeon Phis), respectively. For all tests conducted
in the following, we use the loop structure with the highest
performance on each architecture, i.e., 1DRange on GPUs
and simd-for on CPUs and Xeon Phis.
3.2 Performance portability
Our main objective for writing K-ATHENA is an MHD code
that runs efficiently on any current supercomputer and
possibly any future machines. A code that runs efficiently
on more architectures is said to be performance portable.
Determining what is meant by ”efficient code” can be vague,
especially when comparing performance across different
architectures. The specific memory space sizes, bandwidths,
instruction sets, and groupings of cores on different archi-
tectures can all affect how efficiently a code can utilize the
hardware.
In order to make fair comparisons of K-ATHENA’s per-
formance on different machines, we used the roofline model
[15] to quantify the application performance of K-ATHENA,
or the fraction of the performance obtained out of the max-
imum theoretical performance as limited by the arithmetic
intensity of the code and the bandwidth and throughput
of the processor. With the roofline model, we can take
into account the different hardware configurations as we
measure K-ATHENA’s performance.
For quantifying the performance portability from the
application performance on each machine, we used the met-
ric developed by [16]. This performance portability metric
emphasizes high efficiency on all machines as opposed to
only some machines.
GRETE et al.: K-ATHENA – A PERFORMANCE PORTABLE STRUCTURED GRID FINITE VOLUME MHD CODE 6
3.2.1 Roofline models
The roofline model plot is a tool to determine the theoretical
peak performance of a code or algorithm on a specific
architecture, given the bandwidths of the different memory
spaces and the arithmetic intensity, or number of operations
vs memory usage of the algorithm for the different spaces.
The plot is a condensation of the performance limits im-
posed by the bandwidth of each memory space. For any
given architecture, the maximum obtainable floating point
operations per second (FLOP/s) is limited by
Pmax ≤ min
m∈M
{min ( TPeak, Bm × Im )} , (1)
where Pmax[FLOP/s] is the maximum possible FLOP/s,
TPeak[FLOP/s] is the peak throughput, or the maximum
number of FLOP/s the device can achieve, M is all the
memory spaces on the device (L1 cache, L2 cache, DRAM,
etc.), Bm[Byte/s] is the bandwidth of the memory space m,
and Im[FLOP/Byte] is the arithmetic intensity of specific
algorithm or code for the memory space m, or the number
of FLOP executed per number of bytes written and read to
and from m.
In the roofline model, throughputs and bandwidths are
plotted on a log FLOP/s versus log FLOP/Byte scale, so that
throughputs are horizontal lines and bandwidths as P ∝ I
lines (since bandwidth-limited P = Bm× I). The arithmetic
intensities of each memory space for a specific algorithm
appear as vertical lines, extending up to their bandwidth-
limited performance, Pm,max = min (TPeak, Bm × Im). The
maximum achievable performance of a given algorithm or
code is the minimum of these bandwidth-limited perfor-
mances, although optimizations of the code can change the
arithmetic intensities of the different memory spaces. We
can also mark the actual performance of K-ATHENA with
a horizontal dashed line, indicating the average FLOP/s
achieved. Figures 2a and 2b show roofline modes for an
Intel Skylake CPU on NASA’s Electra and NVIDIA Volta
V100 GPU on ORNL’s Summit, respectively.
Although numbers for different bandwidths and
throughputs can be obtained through vendor specifications
and theoretical arithmetic intensities can be computed by
hand, empirical testing is required to more accurately reflect
the performance of actual hardware and code. This required
a variety of different performance profiling tools on the dif-
ferent architectures and machines, depending on what was
available. For gathering the bandwidths and throughput
on both CPUs and GPUs, we used the Empirical Roofline
Toolkit [17]. However, the publicly available version of the
Empirical Roofline Toolkit does not measure L1 bandwidth
for GPUs, for which we used GPUMEMBENCH [18]. For
computing arithmetic intensities on GPUs, we were able
to use NVIDIA’s NVPROF to measure both the total FLOP
per finite volume cell and total reads and writes from the
L1 cache, L2 cache, and DRAM/HBM per cell. For Intel
CPUs, we used Intel’s Software Development Emulator
(SDE) [19] to capture the total FLOP per cell and L1 reads
and writes. To measure the reads and writes from L2 and
L3 on Intel CPUs, we used LIKWID [20]. For reads and
writes to DRAM, we used Intel’s VTUNE [21]. With the
data combined from SDE, LIKWID, and VTUNE, we could
compute the arithmetic intensities for K-ATHENA running
on Intel CPUs.
We did find that the L1 memory usage reported by
LIKWID was about half of what SDE reported, when we
expected the difference would be a few percent. We attribute
this to restricted permissions (as non-root user) to model-
specific registers that are used for performance profiling.
Given that we obtained all L2 arithmetic intensities with
LIKWID these numbers must be interpreted with care.
However, our primarily focus is on performance relative to
the DRAM performance.
In all cases, we found that K-ATHENA’s performance is
limited by the smallest memory space that will accommo-
date the data required by a single MPI task. For GPUs, this is
on device DRAM/HBM, for CPUs this is the DDR3/DDR4
DRAM, and for KNLs this was the MCDRAM. This is
expected, due to the current construction of K-ATHENA; the
finite volume MHD method is currently implemented as a
series of simple kernels for different tasks such as recon-
struction or flux calculations. Each kernel is a tightly nested
triple or quadruple for-loop. Within each iteration, data is
loaded from DRAM up to the registers, then discarded. Fu-
ture improvements can be made to K-ATHENA to explicitly
cache data in smaller 1D arrays and kept in higher level
caches. This would raise the DRAM arithmetic intensity
while lowering the caches’ arithmetic intensity, but overall
raising the performance ceiling. Similar improvements have
already been implemented upstream in ATHENA++. A more
complete solution would involve fusing consecutive kernels
into one kernel to reduce DRAM accesses.
3.2.2 Performance portability metric
Performance portability is at present nebulously defined. It
is generally held that a performance portable application
can execute wide variety of architectures and achieve ac-
ceptable performance, preferably maintaining a single code
base for all architectures. In order to make valid compar-
isons between codes, an objective metric of performance
portability is needed.
The metric proposed by [16] quantifies performance
portability by the harmonic sum of the performance
achieved on each platform, so that
P (a, p,H) =

|H|∑
i∈H
1
ei(a,p)
if i is supported ∀i ∈ H
0 otherwise
(2)
where a is an application, p is a problem to be solved by
the application, H is the space of all relevant platforms,
and ei(a, p) is the performance efficiency of application a
to solve the problem p on a platform i. If an application
does not support a platform, then it is not performance
portable across the platforms and so is assigned a metric of
0. The performance efficiency can also be defined in several
ways, such as the application efficiency and architectural
efficiency, as used in [16]. Application efficiency is the
achieved performance as a fraction of the performance of
the fastest application that can solve the problem on the plat-
form. Architectural efficiency is the achieved fraction of the
theoretical peak performance of the hardware, accounting
for bandwidths, throughputs, and arithmetic intensities. We
GRETE et al.: K-ATHENA – A PERFORMANCE PORTABLE STRUCTURED GRID FINITE VOLUME MHD CODE 7
10 2 10 1 100 101
Arithmetic Intensity (Flop/Byte)
10 1
100
TF
lo
p/
s
0.11 TFlop/s
0.92 TFlop/s
0.79 TFlop/s
DR
AM
 (2
07
 G
B/s
)  
L2
 (3
74
0 G
B/s
)  
L1
 (5
27
3 G
B/s
)  
Peak (1.50 TFlop/s)
K-Athena (0.11 TFlop/s)
(a) CPU Roofline
10 1 100 101
Arithmetic Intensity (Flop/Byte)
100
101
TF
lo
p/
s
0.76 TFlop/s
2.42
TFlop/s
7.83 TFlop/s
DR
AM
 (7
84
 G
B/s
)  
L2
 (2
88
1 G
B/s
)  
L1
 (1
41
18
 G
B/s
)  
Peak (7.83 TFlop/s)
K-Athena (0.61 TFlop/s)
(b) GPU Roofline
Fig. 2. Roofline models of a 2 socket Intel Xeon Gold 6148 ”Skylake”
CPU node on NASA’s Electra (2a) and a single NVIDIA Tesla V100
”Volta” GPU on ORNL’s Summit (2b). For both cases shown here and
all other architectures we tested, DRAM bandwidth (or MCDRAM band-
width for KNLs) is the limiting bandwidth for K-ATHENA’s performance.
did not have publicly available MHD codes implementing
the same method as K-ATHENA to directly compute the
application efficiency, and so we used the architectural
efficiency obtained from the roofline model.
Since by its implementation K-ATHENA is always lim-
ited by the DRAM bandwidth in the roofline model, we
used the DRAM architectural efficiency, or the theoretical
peak performance of K-ATHENA if it could achieve 100%
of the peak DRAM bandwidth to define the architectural
efficiency. More explicitly, the architectural efficiency on
platform i is
ei(a, p) =
εa,p,i
Bi,DRAM × Ia,p,i,DRAM (3)
where εa,p,i is the achieved performance of K-ATHENA for
solving the problem p on the platform i, Bi,DRAM is the peak
DRAM bandwidth on the platform, and Ia,p,i,DRAM is the
arithmetic intensity of K-ATHENA for solving the problem
on that platform (which is compiler dependent). For ex-
San
dy B
ridg
e1
Ivy 
Brid
ge
1
Has
wel
l1
Bro
adw
ell
1
Sky
lake
2
Knig
ht's
 Lan
ding
3
Kep
ler K
20
4
Kep
ler K
40
1
Kep
ler K
80
4
Volt
a V1
00
5
0
25
50
75
100
Ar
ch
ite
ct
ur
al
 E
ffi
cie
nc
y 
(%
)
DRAM Perf. Port. 83.1%
L1 Perf. Port. 11.6%
CPU Machines GPU Machines
1Pleiades, 2Electra, 3Stampede 2, 4MSU HPCC, 5Summit
CPU DRAM
CPU L1
GPU DRAM
GPU L1
Fig. 3. Performance Portability plot of several CPU and GPU machines
with different architectures. Individual bars show the performance of
K-ATHENA compared to the theoretical peak performance limited by
the DRAM and L1 bandwidths. The performance portability metrics
computed from the harmonic mean of the two sets of efficiencies.
ample, on Summit’s Volta V100s, K-ATHENA achieves 0.61
TFLOP/s while the DRAM/HBM bandwidth ceiling is at
0.76 TFLOP/s, leading to an 80% architectural performance
compared to HBM bandwidth. For completeness, we also
computed the architectural efficiency as measured against
the L1 cache bandwidth.
For our purposes, we used a problem size roughly
proportional to the memory space and number of cores
available on the device, maximizing the cell updates per
second achievable on each device. On CPUs, we used mesh
blocks of 643, using one MPI task per CPU core and one
mesh block per task. We used all sockets on the node for
CPU tests, which was a single KNL socket on Stampede 2
and two sockets on all other machines. On KNLs, we also
used two OPENMP threads per core instead of one. For
GPUs, we used a single mesh block with the largest size
that would fit in GPU DRAM, starting at 256 × 128 × 128
on K20’s up to 2563 on V100’s, using just one GPU on a
node. If we used a constant problem size across all the CPU
and GPU platforms, the problem would be too large to fit
on small GPUs, too small to occupy all of the capabilities
on other GPUs, and would not divide evenly across all the
CPUs.
In Fig. 3, the architectural efficiencies as measured
against the DRAM bandwidth and L1 cache bandwidth are
shown with the computed performance portability metrics.
K-ATHENA achieved 83.1% DRAM performance portabil-
ity and 11.6% L1 cache performance portability, measured
across a number of CPU and GPU architectures. In general,
K-ATHENA achieved higher efficiencies on newer CPUs and
older GPUs. This could be due to changes in instructions
sets or trends in changing hardware details.
GRETE et al.: K-ATHENA – A PERFORMANCE PORTABLE STRUCTURED GRID FINITE VOLUME MHD CODE 8
323 643 1283 2563
total # cells
107
108
ce
ll-
up
da
te
s/
s
Single GPU
K-Athena Volta
K-Athena Pascal
GAMER Pascal
643 1283 2563
total # cells
Single CPU
Athena++ SKX
K-Athena SKX
Athena++ BDW
K-Athena BDW
GAMER BDW
Fig. 4. Raw performance for double precision MHD (algorithm described
in Sec. 3) of K-ATHENA, ATHENA++, and GAMER on a single GPU (left)
or CPU (right) for varying problem sizes. Volta refers to an NVIDIA V100
GPU, Pascal refers to an NVIDIA P100 GPU, BDW (Broadwell) refers to
a 14-core Xeon E5-2680 CPU, and SKX (Skylake) refers to a 20-core
Xeon Gold 6148 CPU. The GAMER numbers were reported in [22] for
the same algorithm used here.
3.3 Scaling
3.3.1 Single CPU and GPU performance
In order to compare the degree to which the refactor-
ing of ATHENA++ affected performance we first compare
ATHENA++ and K-ATHENA on a single CPU. The right
panel of Fig. 4 shows the cell-updates/s achieved on an Intel
Broadwell and an Intel Skylake CPU for both codes for vary-
ing problem size. Overall, the achieved cell-updates/s are
practically independent of problem sizes reaching ≈ 8×106
on a single Broadwell CPU and ≈ 1.4 × 107 on a single
Skylake CPU. Moreover, without any additional perfor-
mance optimizations (see discussion in Sec. 4), K-ATHENA is
virtually on par with ATHENA++, reaching 93% or more of
the original performance. For comparison, we also show the
results of GAMER [22]. It is another recent (astrophysical)
MHD code with support for CPU and (CUDA-based) GPU
accelerated calculations and has directly been compared to
ATHENA++ in [22]. We also find that ATHENA++ (and thus
K-ATHENA) is about 1.5 times faster than GAMER on the
same CPU.
A slightly smaller difference (factor of ≈ 1.25) is ob-
served when comparing results for GPU runs as shown in
the left panel of Fig. 4. On a P100 Pascal GPU, K-ATHENA
is about 1.3 times faster than GAMER, suggesting that
the difference in performance is related to the fundamen-
tal code design and not related to the implementation of
specific computing kernels. On a single V100 Volta GPU,
K-ATHENA reaches a peak performance of greater than
108 cell-updates/s for large problem sizes. In general, the
achieved performance in cell-updates/s is strongly depen-
dent on the problem size. For small grids the performance
is more than one order of magnitude lower than what is
achieved for the largest permissible grid sizes that still fit
into GPU memory. This emphasizes that the nature of GPUs
are better suited for compute-intensive tasks.
3.3.2 Weak scaling
Weak scaling results (same test problem and algorithm as
in Sec. 3.3.1) for K-ATHENA and the original ATHENA++
version on different systems and architectures are shown
in Fig. 5. Overall, the differences between K-ATHENA and
ATHENA++ on CPUs and Xeon Phis are marginal. This is
expected as K-ATHENA employed simd-for loops for all
kernels that are similar to the ones already in ATHENA++.
Therefore, the parallel efficiency is also almost identical
between both codes, reaching ≈ 80% on NASA’s Electra
system with Skylake CPUs (first column in Fig. 5) and
≈ 70% on ALCF’s Theta system with Knights Landing Xeon
Phis (second column in Fig. 5) at 2,048 nodes each. Using
multiple hyperthreads per core on Theta has no significant
influence on the results given the intrinsic variations ob-
served on that system4.
The first major difference is observed on OLCF’s Titan
(third column in Fig. 5), where results for K-ATHENA on
GPUs are included. While the parallel efficiency for both
codes remains at 94% up to 8,192 nodes using only CPUs, it
drops to 72% when using GPUs with K-ATHENA. However,
the majority of loss in parallel efficiency already occurs
going from 1 to 8 nodes using GPUs and afterwards re-
mains almost flat. This behavior is equally present for CPU
runs but less visible due to the higher parallel efficiency
in general. The differences in parallel efficiency between
CPU and GPU runs can be attributed to the vastly different
raw performance of each architecture. On a single node the
single Kepler K20X GPU is about 7 times faster than the
16-core AMD Opteron CPU. Given that the interconnect is
identical for GPU and CPU communication, the effective
ratio of computation to communication is worse for GPUs.
Despite the worse parallel efficiency on GPUs the raw per-
node performance using GPUs is still about 5.5 times faster
than using CPUs at 8,192 nodes.
K-ATHENA on OLCF’s Summit system (last column in
Fig. 5) with six Volta V100 GPUs and two 21-core POWER9
CPUs exhibits a GPU weak scaling behavior similar to the
one observed on Titan. Going from 1 to 8 nodes results in
a loss of 15% and afterwards the parallel efficiency remains
almost flat to 76% on 4,096 nodes. The CPU weak scaling
results for both codes using CPUs reveal properties of the
interconnect. The weak scaling is almost perfect up to 256
nodes using 1 hyperthread per core and afterwards rapidly
plummets. Using 2 hyperthreads per core (i.e., doubling
the number of threads making MPI calls and doubling the
number of MPI messages sent and received, as described
in Sec. 2.2) the steep drop in parallel efficiency is already
observed beyond 128 nodes. No such drop is observed using
GPUs, which perform 42/6 = 7 times fewer MPI calls
(compared to using 1 hyperthread per core) with larger mes-
sages sizes in general. This suggests that the interconnect is
handling fewer but larger messages better than many small
messages (which likely results in network congestion).
Naturally, this issue is tightly related to the existing
communication pattern in ATHENA++, i.e., coarse grained
threading over meshblocks with each thread performing
4. According to the ALCF support staff, system variability con-
tributes around 10% to the fluctuations in performance between identi-
cal runs.
GRETE et al.: K-ATHENA – A PERFORMANCE PORTABLE STRUCTURED GRID FINITE VOLUME MHD CODE 9
107
108
ce
ll-
up
da
te
s/
s/
no
de
Electra Skylake CPU
Athena++ 643
Athena++ 1283
K-Athena 643
K-Athena 1283
Theta Knights Landing
Athena++ HT-1
Athena++ HT-2
Athena++ HT-4
K-Athena HT-1
K-Athena HT-2
K-Athena HT-4
Titan Opteron/Kepler GPU Summit Power9/Volta GPU
100 101 102 103
# nodes
0.0
0.2
0.4
0.6
0.8
1.0
pa
ra
lle
l e
ffi
cie
nc
y
100 101 102 103
# nodes
100 101 102 103
# nodes
Athena++ CPU 1283
K-Athena CPU 1283
K-Athena GPU 1923
100 101 102 103
# nodes
Athena++ CPU HT-1 643
Athena++ CPU HT-2 643
K-Athena CPU HT-1 643
K-Athena CPU HT-2 643
K-Athena GPU 2563
K-Athena CPU nested 643
Fig. 5. Weak scaling for double precision MHD (exact algorithm described in Sec. 3) on different supercomputers and architectures for K-ATHENA
and the original ATHENA++ version. Numbers correspond to the 80th percentile of individual cycle performances of several runs in order to reduce
effects of network variability. The top row shows the raw performance in number of cell-updates per second per node and can directly be compared
between different system and architectures. The bottom row shows the parallel efficiency normalized to the individual single node performance.
The first column contains results for a workload of 643 and 1283 cells per core on NASA’s Electra system using two 20-core Intel Xeon Gold 6148
processors per node. The second column shows results for a workload of 643 per core on ALCF’s Theta system with one 64-core Intel Xeon Phi
7230 (Knights Landing) per node. HT-1, HT-2, and HT-4 refers to using 1, 2, and 4 hyperthreads per core, respectively. The third column shows
results for a workload of 1283 per CPU core and 1923 per GPU on OLCF’s Titan system with one AMD Opteron 6274 16-core CPU and one NVIDIA
K20X (Kepler) GPU per node. The last column contains results for a workload of 643 per CPU core and 2563 per GPU on OLCF’s Summit system
with two 21-core IBM POWER9 CPUs and six NVIDIA V100 (Volta) GPUs per node. On all systems the GPU runs used 1D loops and the CPU runs
used simd-for loops with the the exception of the dashed purple line on Summit that used KOKKOS nested parallelism, see Sec. 2.3 for more
details.
MPI calls. Without making additional changes to the code
base, this issue can directly be addressed using KOKKOS
nested parallelism in K-ATHENA. More specifically, we use
the triple nested construct illustrated in Listing 3 allowing
multiple threads handling a single meshblock. As a proof
of concept, the results for using using 1 MPI process per 2
cores each with one thread are shown in the purple dash
line in the last column of Fig. 5. While the raw perfor-
mance on a single node is slightly lower (about 16%), the
improved communication pattern results in a higher overall
performance for > 1, 024 nodes. Similarly, the sharp drop
in parallel efficiency has been shifted to first occur at 2,048
nodes.
Finally, the raw per-node performance is overall com-
parable between Intel Skylake CPUs, Intel Knight Land-
ing Xeon Phis, IBM POWER9 CPUs, and a single
NVIDIA Kepler GPU, ranging between ≈ 1.5 – 3×107 cell-
updates/s/node. The latest NVIDIA Volta GPU is a notable
exception, reaching more than 108 cell-updates/s/GPU.
This performance in combination with six GPUs per node
on Summit and a high parallel efficiency results in a total
performance of 1.94× 1012 cell-updates/s on 4,096 nodes.
3.4 Strong scaling
Strong scaling results for K-ATHENA on Summit on both
CPUs and GPUs are shown in Fig. 6 (same test problem and
algorithm as in Sec. 3.3.1). Overall, strong scaling in terms
of parallel efficiency is better on CPUs than on GPUs. For
example, for a 1,4083 domain the parallel efficiency using
CPUs remains > 83% going from 32 to 512 nodes whereas it
drops to 45% for the similar GPU case (1,5363 domain using
36 to 576 nodes). This is easily explained by comparing to
the single CPU/GPU performance discussed in Sec. 3.3.1,
which effectively corresponds to on-node strong scaling.
The more pronounced decrease in parallel efficiency on the
GPUs is a direct result of the decreased raw performance of
GPUs with smaller problem sizes per GPU. The increased
communication overhead of the strong scaling test plays
only a secondary role. As a result, additional performance
GRETE et al.: K-ATHENA – A PERFORMANCE PORTABLE STRUCTURED GRID FINITE VOLUME MHD CODE 10
107
108
ce
ll-
up
da
te
s/
s/
no
de
Summit Power9/Volta GPU
101 102 103
# nodes
0.0
0.2
0.4
0.6
0.8
1.0
pa
ra
lle
l e
ffi
cie
nc
y
K-Athena GPU 15363
K-Athena GPU 30723
K-Athena CPU 14083
K-Athena CPU 29443
Fig. 6. Strong parallel scaling for double precision MHD (algorithm
described in Sec. 3) of K-ATHENA on NVIDIA V100 GPUs (6 GPUs per
node; green solid lines) and IBM Power 9 CPUs (42 cores per node;
orange/red dash dotted lines) on Summit. The top panel shows the raw
performance in cell-updates per second per node and the bottom panel
shows the parallel efficiency. The effective workload per GPU goes from
2563 to 643 for the 1,5363 domain and from 2563 to 1283 for the 30723
domain. In the CPU case the effective workload per single Power9 CPU
(21 cores) goes from 3533 to 883 for the 1,4083 domain and from 3533 to
1773 for the 2,9443 domain. The resulting effective workloads per node
are comparable (within few percent) between GPU and CPU runs.
improvements, as discussed in the following Section, will
greatly benefit the strong scaling behavior of GPUs in gen-
eral. Nevertheless, the raw performance of the GPUs still
outperforms CPUs by a large multiple despite the worse
strong scaling parallel efficiency. For example, in the case
discussed above on Summit, the per-node performance of
GPUs over CPUs is still about 14 times higher at > 512
nodes.
4 CURRENT LIMITATIONS AND FUTURE ENHANCE-
MENTS
Our primary goal for the current version of K-ATHENA
was to make GPU-accelerated simulations possible while
maintaining CPU performance, and to do so with the
smallest amount of code changes necessary. Naturally, this
resulted in several trade-offs and leaves room for further
(performance) improvements in the future.
For example, we are currently not making use of more
advanced hardware features such as scratch spaces on
GPUs. Scratch space can be shared among threads of a
TeamPolicy and allows for efficient reuse of memory. We
could use scratch space to reduce the number of reads from
DRAM in stenciled kernels (like the reconstruction step). We
could also fuse consecuetive kernels to further reduce reads
and writes to DRAM, although this would also increase reg-
ister and possibly spill store usage. Moreover, complex ker-
nels such as a Riemann solver could be broken further down
by using TeamThreadRanges and ThreadVectorRanges
structures that are closer to the structure of the algorithm.
This is in contrast to our current approach where all kernels
are treated equally, with the same execution policies inde-
pendent of the individual algorithms within the kernels. The
Riemann solver could also be split into separate kernels to
reduce the number of registers needed, eliminate the use of
spill stores on the GPU, and allow higher occupancy on the
GPU.
Similarly, on CPUs and Xeon Phis we are currently not
using a KOKKOS parallel execution pattern. The macro we
introduced to easily exchange parallel patterns replaces the
parallel region on CPUs and Xeon Phis with a simple nested
for loop including a simd pragma, as shown in Listing 1.
This is required for maximum performance as current com-
pilers are less effective in automatically vectorizing the more
complex KOKKOS patterns compared to simple for loops.
On the one hand, we expect that this will become less
of an issue with future compiler improvements. On the
other hand, more kernels and more fine-tuned kernels, as
discussed in the previous paragraph, will also automatically
address this issue.
Another possible future improvement is an increase
in parallel efficiency by overlapping communication and
computation. While ATHENA++ is already built for asyn-
chronous communication and a task based execution model
more fine-grained optimizations are possible. For example,
spatial dimensions in the variable reconstruction step that
occurs after the exchange of boundary information could
be split, so that the kernel in the first dimension could
run while the boundary information of the second and
third dimension are still being exchanged. In addition, the
next major KOKKOS release will contain more support for
architecture-dependent task based execution and, for exam-
ple, will allow for the transparent use of CUDA streams.
CUDA streams may also help in addressing another
current limitation of K-ATHENA on GPUs. Our minimal
implementation approach currently limits all meshblocks
to be allocated in a fixed memory space. This means that the
total problem size that can currently be addressed with K-
ATHENA is limited by the total amount of GPU memory
available. An alternative approach is keeping the entire
mesh in system memory, which is still several times larger
than the GPU memory on many (if not all) current machines.
For the execution of kernels individual meshblocks would
be copied back and forth between system memory and GPU
memory. Here, CUDA streams could be used to hide these
expensive memory transfers as they would occur in the
background while the GPU is executing different kernels.
Theoretically, meshes larger than the GPU memory could
already be used right now with the help of unified memory.
However, given that the code is not optimized for efficient
page migrations the resulting performance degradation is
GRETE et al.: K-ATHENA – A PERFORMANCE PORTABLE STRUCTURED GRID FINITE VOLUME MHD CODE 11
large (more than a factor of 10). Thus, using unified memory
with meshes larger than the GPU memory is not recom-
mended.
5 CONCLUSIONS
We presented K-ATHENA – a KOKKOS-based perfor-
mance portable version of the finite volume MHD code
ATHENA++. KOKKOS is a C++ template library that pro-
vides abstractions for on-node parallel regions and the
memory hierarchy. Our main goal was to enable GPU-
accelerated simulations while maintaining ATHENA++’s ex-
cellent CPU performance using a single code base and with
minimal changes to the existing code.
Generally, four main changes were required in the
refactoring process. We changed the underlying memory
management in ATHENA++’s multi-dimensional array class
to make transparently use of KOKKOS’s equivalent multi-
dimensional arrays, i.e., Kokkos::Views. We exchanged
all (tightly) nested for loops with the KOKKOS equivalent
parallel region, e.g., a Kokkos::parallel_for, which are
now kernels that can be launched on any supported device.
We inlined all support functions (e.g., the equation of state)
that are called within kernels. We changed the communica-
tion buffers to be Views so that MPI calls between GPUs
buffers are directly possible without going through system
memory.
With all changes in place we performed both profiling
and scaling studies across different platforms, including
NASA’s Electra system with Intel Skylake CPUs, ALCF’s
Theta system with Intel Xeon Phi Knights Landing, OLCF’s
Titan with AMD Opteron CPUs and NVIDIA Kepler GPUs,
and OLCF’s Summit machine with IBM Power9 CPUs and
NVIDIA Volta GPUs. Using a roofline model analysis, we
demonstrated that the current implementation of the MHD
algorithms is memory bound by either the DRAM, HBM, or
MCDRAM bandwidths on CPUs and GPUs. Moreover, we
calculated a performance portability metric of 83.1% across
Xeon Phis, and 5 CPU and 4 GPU generations.
Detailed KOKKOS profiling revealed that there is cur-
rently no universal KOKKOS execution policy (how a par-
allel region is executed) that achieves optimal perfor-
mance across different architectures. For example, a one-
dimensional loop with manual index matching from 1 to
3D/4D is fastest on GPUs (achieving > 108) double preci-
sion MHD cell-updates/s on a single NVIDIA V100 GPU)
whereas tightly nested for loops with simd directives are
fastest on CPUs. This is a result of both a lack in compiler
capaabilities (e.g., with respect to automated vectorization)
and KOKKOS’s specific implementation details. Both are
expected to improve in the future and KOKKOS allows
for the flexibility to easily exchange execution policies in
general.
Strong scaling on GPUs is currently predominately lim-
ited by individual GPU performance and not by commu-
nication. In other words, insufficient GPU utilization out-
weighs additional performance overhead with decreasing
problem size per GPU.
Weak scaling is generally good, with parallel efficiencies
of 80% and higher for more than 1,000 nodes across all
machines tested. Notably, on Summit K-ATHENA achieves
a total calculation speed of 1.94 × 1012 cell-updates/s on
24,567 V100 GPUs at a speedup of 30 compared to using the
available 172,032 CPU cores.
Finally, there is still a great deal of untapped potential
left, e.g., using more advanced hardware features such
as fine-grained nested parallelism, scratch pad memory
(i.e., fast memory that can be shared among threads), or
CUDA streams. Nevertheless, we achieved our primary
performance portability goal of enabling GPU-accelerated
simulations while maintaining CPU performance using a
single code base. Moreover, we consider the current re-
sults highly encouraging and will continue with further
development on the project’s GitLab repository at https:
//gitlab.com/pgrete/kathena. Contributions of any kind
are welcome!
ACKNOWLEDGMENTS
The authors would like to thank the KOKKOS developers,
particularly Christian Trott and Steve Bova, and the orga-
nizers of the 2018 Performance Portability with KOKKOS
Bootcamp for their help using KOKKOS in ATHENA++.
We thank Kristian Beckwith for inspiring discussions on
KOKKOS. We thank the ATHENA++ team for making their
code public and for their well designed code. We acknowl-
edge funding by NASA Astrophysics Theory Program grant
#NNX15AP39G. Code development, testing, and bench-
marking was made possible through various computing
grants including allocations on NASA Pleiades (SMD-
16-7720), OLCF Titan (AST133), OLCF Summit (AST146),
ALCF Theta (athena performance), XSEDE Comet (TG-
AST090040), and Michgian State University’s High Perfor-
mance Computing Center.
REFERENCES
[1] Y. Zheng, A. Kamil, M. B. Driscoll, H. Shan, and K. Yelick, “Upc++:
A pgas extension for c++,” in 2014 IEEE 28th International Parallel
and Distributed Processing Symposium, May 2014, pp. 1105–1114.
[2] L. V. Kale and S. Krishnan, “Charm++: A portable concurrent
object oriented system based on c++,” Champaign, IL, USA, Tech.
Rep., 1993.
[3] M. Bauer, S. Treichler, E. Slaughter, and A. Aiken, “Legion:
Expressing locality and independence with logical regions,” in
Proceedings of the International Conference on High Performance
Computing, Networking, Storage and Analysis, ser. SC ’12. Los
Alamitos, CA, USA: IEEE Computer Society Press, 2012, pp.
66:1–66:11. [Online]. Available: http://dl.acm.org/citation.cfm?
id=2388996.2389086
[4] C. J. White, J. M. Stone, and C. F. Gammie, “An extension of
the athena++ code framework for grmhd based on advanced
riemann solvers and staggered-mesh constrained transport,” The
Astrophysical Journal Supplement Series, vol. 225, no. 2, p. 22, 2016.
[Online]. Available: http://stacks.iop.org/0067-0049/225/i=2/a=
22
[5] J. Stone, K. Tomida, K. G. Felker, and C. White, “Athena++: a new
code for radiation grmhd,” in prep., 2019.
[6] T. P. Straatsma, K. B. Antypas, and T. J. Williams, Exascale Scientific
Applications: Scalability and Performance Portability, 1st ed. Chap-
man & Hall/CRC, 2017.
[7] L. Dagum and R. Menon, “Openmp: An industry-standard
api for shared-memory programming,” IEEE Comput. Sci.
Eng., vol. 5, no. 1, pp. 46–55, Jan. 1998. [Online]. Available:
https://doi.org/10.1109/99.660313
[8] J. E. Stone, D. Gohara, and G. Shi, “Opencl: A parallel program-
ming standard for heterogeneous computing systems,” Computing
in Science Engineering, vol. 12, no. 3, pp. 66–73, May 2010.
GRETE et al.: K-ATHENA – A PERFORMANCE PORTABLE STRUCTURED GRID FINITE VOLUME MHD CODE 12
[9] H. C. Edwards, C. R. Trott, and D. Sunderland, “Kokkos:
Enabling manycore performance portability through polymorphic
memory access patterns,” Journal of Parallel and Distributed
Computing, vol. 74, no. 12, pp. 3202 – 3216, 2014, domain-Specific
Languages and High-Level Frameworks for High-Performance
Computing. [Online]. Available: http://www.sciencedirect.com/
science/article/pii/S0743731514001257
[10] R. Hornung, H. Jones, J. Keasler, R. Neely, O. Pearce,
S. Hammond, C. Trott, P. Lin, C. Vaughan, J. Cook,
R. Hoekstra, B. Bergen, J. Payne, and G. Womeldorff, “ASC
Tri-lab Co-design Level 2Milestone Report 2015,” LLNL,
Tech Report LLNL-TR-677453, 2015. [Online]. Available: https:
//e-reports-ext.llnl.gov/pdf/800854.pdf
[11] M. A. Heroux, R. A. Bartlett, V. E. Howle, R. J. Hoekstra,
J. J. Hu, T. G. Kolda, R. B. Lehoucq, K. R. Long, R. P.
Pawlowski, E. T. Phipps, A. G. Salinger, H. K. Thornquist, R. S.
Tuminaro, J. M. Willenbring, A. Williams, and K. S. Stanley,
“An overview of the trilinos project,” ACM Trans. Math. Softw.,
vol. 31, no. 3, pp. 397–423, Sep. 2005. [Online]. Available:
http://doi.acm.org/10.1145/1089014.1089021
[12] J. K. Holmen, A. Humphrey, D. Sunderland, and M. Berzins,
“Improving uintah’s scalability through the use of portable
kokkos-based data parallel tasks,” in Proceedings of the
Practice and Experience in Advanced Research Computing 2017
on Sustainability, Success and Impact, ser. PEARC17. New
York, NY, USA: ACM, 2017, pp. 27:1–27:8. [Online]. Available:
http://doi.acm.org/10.1145/3093338.3093388
[13] J. M. Stone, T. A. Gardiner, P. Teuben, J. F. Hawley, and J. B. Simon,
“Athena: A new code for astrophysical mhd,” The Astrophysical
Journal Supplement Series, vol. 178, no. 1, p. 137, 2008. [Online].
Available: http://stacks.iop.org/0067-0049/178/i=1/a=137
[14] J. M. Stone and T. Gardiner, “A simple unsplit godunov method
for multidimensional mhd,” New Astronomy, vol. 14, no. 2, pp. 139
– 148, 2009. [Online]. Available: http://www.sciencedirect.com/
science/article/pii/S1384107608000754
[15] S. Williams, A. Waterman, and D. Patterson, “Roofline: An In-
sightful Visual Performance Model for Multicore Architectures,”
Commun. ACM, vol. 52, no. 4, pp. 65–76, Apr. 2009.
[16] S. Pennycook, J. Sewall, and V. Lee, “Implications of a metric
for performance portability,” Future Generation Computer Systems,
vol. 92, pp. 947 – 958, 2019. [Online]. Available: http://www.
sciencedirect.com/science/article/pii/S0167739X17300559
[17] Y. J. Lo, S. Williams, B. Van Straalen, T. J. Ligocki, M. J. Cordery,
N. J. Wright, M. W. Hall, and L. Oliker, “Roofline Model Toolkit:
A Practical Tool for Architectural and Program Analysis,” in High
Performance Computing Systems. Performance Modeling, Benchmark-
ing, and Simulation, S. A. Jarvis, S. A. Wright, and S. D. Hammond,
Eds. Springer International Publishing, 2015, pp. 129–148.
[18] E. Konstantinidis and Y. Cotronis, “A quantitative roofline model
for GPU kernel performance estimation using micro-benchmarks
and hardware metric profiling,” Journal of Parallel and Distributed
Computing, vol. 107, pp. 37–56, Sep. 2017.
[19] “Intel R© Software Development Emulator,” https://software.
intel.com/en-us/articles/intel-software-development-emulator,
Accessed: May 1, 2019.
[20] J. Treibig, G. Hager, and G. Wellein, “LIKWID: A Lightweight
Performance-Oriented Tool Suite for x86 Multicore Environ-
ments,” in 2010 39th International Conference on Parallel Processing
Workshops, Sep. 2010, pp. 207–216.
[21] “Intel R© VTuneTM Amplifier,” https://software.intel.com/en-us/
vtune, Accessed: May 1, 2019.
[22] U.-H. Zhang, H.-Y. Schive, and T. Chiueh, “Magnetohydro-
dynamics with gamer,” The Astrophysical Journal Supplement
Series, vol. 236, no. 2, p. 50, 2018. [Online]. Available:
http://stacks.iop.org/0067-0049/236/i=2/a=50
Philipp Grete After receiving a B.Sc. in Com-
puter Science in 2008 from the University of
Cooperative Education in Stuttgart, Germany,
Philipp Grete worked for Hewlett-Packard be-
fore studying Physics (B.Sc.) and Computer Sci-
ence (M.Sc.) at the University of Gttingen, Ger-
many, from 2010 to 2013. In his Ph.D. the-
sis (2014-2016, University of Gttingen, Ger-
many) he worked on subgrid-scale modeling
of compressible magnetohydrodynamic turbu-
lence. Since October 2016, he is a postdoctoral
research associate at Michigan State University. His current research
interest include fundamental processes involving magnetic fields in as-
trophysical fluids, numerical methods in computational fluid dynamics,
and high-performance computing with an emphasis on performance
portability.
Forrest W. Glines Forrest Glines is PhD student
at Michigan State University working towards a
dual degree in Astrophysics and Computational
Mathematics, Science, and Engineering. He re-
ceived his B.S. in Physics from Brigham Young
University in 2016, publishing work on magne-
tohydrodynamics methods using GPUs. In the
2015 and 2016 he also spent 6 months collec-
tively at Los Alamos National Laboratory, simu-
lating galaxy cluster and developing simulations
coupling radiative transfer and hydrodynamics.
Forrest’s current research interests includes the coupling of galaxy
clusters with active galactic nuclei, plasma turbulence, the evolution
of galaxies with magnetic fields, and developing simulations for the
exascale era.
Brian W. O’Shea Brian O’Shea received his
PhD in Physics from the University of Illinois
at Urbana-Champaign in 2005. His dissertation
research focused on simulations of cosmological
structure formation, particularly in the first gen-
eration of stars in the Universe. He was then a
Directors Postdoctoral Fellow at Los Alamos Na-
tional Laboratory in both the Theoretical Astro-
physics Group and the Applied Physics Division,
where he studied the formation of galaxies over
a wide range of mass scales and cosmic epochs.
Since 2008 he has been a professor at Michigan State University
with appointments in the Department of Computational Mathematics,
Science and Engineering, Department of Physics and Astronomy, and
National Superconducting Cyclotron Laboratory. His current research
interests range across a wide variety of topics in cosmological structure
formation, astrophysical plasma turbulence, high-performance comput-
ing, and computational science education. He is one of the core devel-
opers and leaders of the Enzo community code (enzo-project.org), and
a contributor to several other open-source projects.
