Evaluation of performance portability frameworks for the implementation
  of a particle-in-cell code by Artigues, Victor et al.
ar
X
iv
:1
91
1.
08
39
4v
1 
 [c
s.D
C]
  1
9 N
ov
 20
19
Evaluation of performance portability frameworks for the
implementation of a particle-in-cell code
Victor Artigues1,2, Katharina Kormann1, Markus Rampp2, Klaus Reuter2,∗
1Max-Planck-Institut fu¨r Plasmaphysik,
Boltzmannstr. 2, 85748 Garching, Germany
2Max Planck Computing and Data Facility
Gieenbachstr. 2, 85748 Garching, Germany
∗Correspondence: klaus.reuter@mpcdf.mpg.de
November 20, 2019
Abstract
This paper reports on an in-depth evaluation of the performance portability frameworks Kokkos and
RAJA with respect to their suitability for the implementation of complex particle-in-cell (PIC) simulation
codes, extending previous studies based on codes from other domains. At the example of a particle-in-cell
model, we implemented the hotspot of the code in C++ and parallelized it using OpenMP, OpenACC,
CUDA, Kokkos, and RAJA, targeting multi-core (CPU) and graphics (GPU) processors. Both, Kokkos
and RAJA appear mature, are usable for complex codes, and keep their promise to provide performance
portability across different architectures. Comparing the obtainable performance on state-of-the art hard-
ware, but also considering aspects such as code complexity, feature availability, and overall productivity, we
finally draw the conclusion that the Kokkos framework would be suited best to tackle the massively parallel
implementation of the full PIC model.
1 Introduction
Modern high-performance computing (HPC) systems are getting increasingly complex, in particular concerning
the hardware architecture of the clusters’ nodes. This trend is reflected in the development of the “Top 500”
ranking of the fastest supercomputers [1], where a growing fraction of HPC systems deploys various types
of accelerators or coprocessors, in addition to the prevailing multi-core CPUs. As of November 2018, such
accelerated systems contribute a total (peak-performance weighted) share of about 40% within the Top 500 [1]
list, with more than two dozens of different types of accelerators (GPUs in the majority of systems). For an
individual HPC system, in particular at the high end of the list, GPUs contribute a large fraction of the nominal
peak performance.
While for handling inter-node communication the Message Passing Interface (MPI [2]) remains unchallenged
as the programming model which has proven stable, reliable, portable, and well supported over more than 25
years, a de-facto standard for programming heterogeneous compute nodes has yet to emerge. The situation is
quite adequately described by the term “MPI + X” which has been around for a number of years [3], but today,
the “X” still represents a variety of node-level programming paradigms which are mostly specific for a certain
type of hardware, or even vendor.
With OpenMP[4], OpenACC[5], and OpenCL[6], to name the most relevant and widespread ones, there is
a set of language extensions to C and Fortran available that— at least partly—offer portable programming
across various types of compute nodes: OpenMP[4] has been very successful for programming multi-core CPUs,
and most recent specifications target also accelerators. However, there is currently only very limited compiler
support for accelerators, and some of the semantics differ between CPUs and accelerators.
Conceptually similar to OpenMP, OpenACC[5] was designed as a high-level, directive-based approach to
GPU programming and has been quite successful, in particular as an alternative to CUDA, but also supporting
multi-core processors. In practice, however, there is only limited compiler support for OpenACC (proprietary
compilers by PGI and CRAY, and experimental support in GCC). For complex application codes, both, the
1
OpenMP, and—maybe to a slightly lesser degree—the OpenACC-based approach likely require target-specific
adaptations of data structures and flow control in order to achieve good performance on both, CPUs and GPUs.
OpenCL[7, 6] was designed for heterogeneous systems with CPUs and GPUs, but, in practice, has limited
support by compilers and runtimes. Moreover, it exposes many of the complexities of the GPU hardware
architecture to its programming model. In particular, parallelism is expressed in the form of so-called kernels
which are mapped to the CPU’s or GPU’s threads in a SIMT or SIMD fashion. This often poses severe challenges
to HPC codes which typically express parallelism in loops or—more fashionably—in complex task graphs, both
of which requires a major rewriting of code, and, for Fortran programs, the construction of C interfaces.
In the absence of a de-facto standard for achieving portable performance, scientific HPC application develop-
ment, especially when targeting high-end machines, is facing an enormous challenge. In practice multiple code
versions, code paths, or alike, need to be implemented for the same algorithm, employing different node-level
programming models specialized for the target hardware platforms. Applying advanced software engineering
techniques, some of the complexity can be encapsulated in certain “lower” layers of the software architec-
ture of an application (e.g. in a custom domain-specific language [8, 9]), but readability, maintainability, and
sustainability of such code is often significantly impacted.
Hence, there is significant demand and motivation to develop tools for abstracting the source code from the
hardware. Performance portability frameworks enable programmers to target multiple architectures, ideally
achieving good performance on any of them without the need to implement and optimize individually. Ex-
isting frameworks typically build upon the C++ programming language with templates (ideally, architecture-
dependent decisions are taken automatically at compile time) and provide a limited set of building blocks
for parallelism while hiding the complexity of the target architecture from the application programmer. The
promise is that a single source code can run efficiently on multiple architectures. Following the paradigm
of “separation of concerns” the framework eventually delegates compilation and execution to well-established
“native” programming models and runtimes, such as the aforementioned OpenMP (for multi-core CPUs), or
the proprietary CUDA model (for NVIDIA GPUs). Those “backends” can be implemented and maintained by
specialists in an application-agnostic way and thus become transparent for the application programmer. For a
more comprehensive review on options for performance portability we refer to the introduction of Ref. [10] and
the references therein.
In this paper, we shall focus on Kokkos [11] and RAJA [12] which, at the time of writing, are the two
leading C++ performance portability frameworks. Both use advanced meta-programming techniques to generate
architecture-specific code and optimizations at compile time and are freely available as open source under
BSD-type licenses. Similar, though still in beta state and therefore not considered currently, is the alpaka
performance portability library[13, 14]. It provides hardware abstraction to support single-source accelerator
development. Specifically, this paper reports on our experiences with employing Kokkos and RAJA for achieving
performance portability of a complex particle-in-cell (PIC) code across various types of multi-core and GPU-
accelerated HPC platforms, and compares it to platform-optimized versions of the code. In addition to the
computational performance, our assessment considers aspects such as code complexity, feature availability, and
overall productivity of the approach.
The paper is structured as follows. Subsection 1.1 gives a brief review of related work on the assessment
of performance-portability frameworks. Next, subsection 1.2 introduces the methodology and the goals of the
present work and subsection 1.3 briefly introduces the numerical model of our study. In section 2, we briefly
summarize the main concepts of Kokkos and RAJA and put them into context by means of code examples from
our application. We then turn towards a usability review of both frameworks in section 3, before we report in
detail on the performance achieved on state-of-the art CPU and GPU nodes in section 4. Finally, the paper
closes with a summary and conclusions in section 5.
1.1 Related work
To our knowledge, there exist only few comparisons of Kokkos, RAJA, and other parallel frameworks at the
level of a complex HPC application.
Martineau et al. did a comparison of Kokkos, RAJA, OpenACC, OpenMP 4.0, CUDA, and OpenCL based
on the Tealeaf application, a miniapp solving the heat conduction equation using finite differences[15, 16].
The authors report a 5% to 30% performance penalty for Kokkos and RAJA compared to architecture-specific
implementations.
Sunderland et al. report on the Kokkos refactoring of the large legacy code base Uintah, used to model
turbulent combustion [17]. In their case, the introduction of Kokkos views and proper memory layouts even
increased the performance of specific kernels thanks to more efficient memory accesses and vectorization.
More recently, Demeshko et al. report specifically on a Kokkos port of parts of the complex finite element
framework Albany in great detail[10]. They focus on the refactoring process, and present performance compar-
isons on the target platforms CPU, GPU, and KNL, though a baseline defined by the original code’s performance
2
before the porting is lacking.
1.2 Methodology
We consider a complex HPC application from the plasma physics domain, based on a numerical model which
solves the Vlasov–Maxwell system of equations using a Particle-In-Cell (PIC) method[18]. The fact that a full-
scale implementation of the code has yet to be developed and optimized for state-of-the-art HPC systems, which
is a multi person-year effort, served as the main motivation for the pilot-study and assessment of portability
frameworks presented here.
As a baseline implementation, we extracted a generic part from the original FORTRAN implementation
taken from SeLaLib [19] and rewrote it in C++. Starting out from this code, six different parallel versions
were developed, namely, using plain OpenMP directives, plain OpenACC directives, plain CUDA, Kokkos,
hybrid OpenMP/Kokkos, and RAJA. The baseline version and the six parallel versions, targeting multi-core
and graphics processors, were extensively benchmarked, and are compared with respect to their performance.
In addition, we address soft factors such as the programmers’ productivity, the code complexity, and the overall
experience.
Since the fundamental computational kernels of the PIC method are somewhat complementary to the pat-
terns encountered in finite-element methods or mesh-based domain-decomposition schemes that were targeted in
the aforementioned studies, our assessment may serve as another major building block for judging the relevance
of portability frameworks for HPC application development as a whole.
1.3 Numerical model
The particle-in-cell method solves a hyperbolic conservation law by representing a distribution function by
macro-particles that evolve in a Lagrangian frame along the characteristic equations accociated with the con-
servation law. These macro-particles are represented by their position in phase space and a weight. Often the
particle description is coupled to a field description of some moments of the distribution function. We study
the solution of the Vlasov–Maxwell equations in six-dimensional phase space. The Vlasov equation models the
evolution of a species s of a plasma in its self-consistent and external electromagnetic fields
∂tfs(x,v, t) + v · ∇xfs(x,v, t) +
qs
ms
(E(x, t) + v×B(x, t)) · ∇vfs(x,v, t) = 0, (1)
where f denotes the distribution function, E the electric, and B the magnetic field, and qs and ms the charge
and mass of the particles, respectively. The self-consistent fields can be computed from Maxwell’s equations
∂E
∂t
−∇×B = −j, (2a)
∂B
∂t
+∇×E = 0, (2b)
∇E = ρ, (2c)
∇B = 0, (2d)
where ρ and j denote the charge and current density, respectively, which are defined as velocity moments of the
distribution functions,
ρ =
∑
s
qs
∫
fsdv, and j =
∑
s
qs
∫
vfsdv. (3)
Particle methods represent the distribution function fs by a large number of macroparticles that are defined by
their position (xp,vp) in six-dimensional phase space and a (constant) weight, and evolve over time. The fields
are represented on a grid by some discretization method.
In our case study, we follow the geometric electromagnetic particle-in-cell (GEMPIC) scheme as proposed
in [18] where the fields are represented by conforming spline finite elements as proposed in [20]. The common
parallel structure of the particle-in-cell method are particle loops which are embarrassingly parallel but assemble
and evaluate grid-based quantities that are shared between several processes and thus require a reduction step
(and the solution of the field equations) following the particle loop. Our implementation focuses of a subset of
equations, namely a particle loop that identifies the position of the particle in the grid, evaluates the magnetic
field at the particle position, updates x1,p and v2,p, v3,p, and accumulates the first component of the current
density. The particle loop is then followed by a reduction of the current density. Appendix A gives a more
detailed description of the implemented method. Moreover, we refer to [18] where this subset of instructions is
presented as the operator Hp1 .
3
Although the study uses a particular operator stemming from the GEMPIC discretization, we believe that the
findings are general enough to apply to other types of particle-in-cell schemes. Note that we focus in the present
work only on shared-memory parallelization with a simple data structure storing particle position, velocity,
and weight in an array-of-structures data type. Including advanced data types that enable vectorization and
efficient data access (cf. e.g. Ref. [21]) and adapting them to the special requirement of the current-deposition
loop as it appears in the GEMPIC equations is a different topic which we do not address in this work.
2 Performance Portability Frameworks
The aim of performance portability frameworks like Kokkos and RAJA is to enable scientific programmers to
write generic parallel code that can be compiled and run on several parallel architectures while minimizing
or even eliminating the need to implement architecture-specific code. At the same time, nearly the same
computational performance as obtained from an architecture-specific implementation is supposed to be achieved.
Thus, the application programmer has to take care of only a single code version and is shielded from technical
details of different target architectures. Moreover, the code is expected to run at high performance also on
future HPC hardware, provided that the chosen performance-portability framework is properly maintained.
To point out the concepts, similarities, and differences of Kokkos and RAJA, we present a recap of both
frameworks, partly based on code from our implementations of the numerical model. For direct comparison of
the source code from the different programming models we have included a color-coded listing in appendix B.
These code excerpts transport the essential ideas behind Kokkos and RAJA. We used Kokkos version 2.7.00
and RAJA version 0.6.0 during development work. For further details on the abstractions and the programming
models, we refer the reader to the official documentation of Kokkos[22] and RAJA[23].
In the following sections, we will briefly address the role of the individual building blocks for the implemen-
tation of our model code. The key feature of both Kokkos and RAJA is to offer abstractions for the memory
(organized in so-called views) and the parallel operations. Kokkos defines the following six fundamental ab-
stractions:
1. execution spaces specify on which processor to execute,
2. execution patterns specify the parallel operation (i.e. for, reduction, scan, task),
3. execution policies specify how to execute the pattern,
4. memory spaces specify where to allocate memory and store data,
5. memory layouts control the mapping of indices to physical memory,
6. memory traits specify how to access the memory.
RAJA supports various types of parallel operations, such as loops (for), reductions, atomics (handled by the
data structure in Kokkos), and scans, and defines a policy for each of them specifying both where and how
the parallel operation is executed. The memory abstraction is through views that wrap the pointer to the
actual memory and enable processing of the data independent of its index organization that is defined by the
concept of layouts. Note that Kokkos automatically allocates the memory according to the memory space while
RAJA requires the user to explicitly define the memory layout for each case. These abstractions enable the
implementation of parallel code as outlined in the following.
2.1 Array types and memory management
Kokkos and RAJA both offer multidimensional array primitives called views. These views allow for an abstrac-
tion of the storage layout from the data which, most importantly, relieves the programmer from architecture-
specific optimization work. For example, on cache-based architectures such as CPUs, it is of advantage to access
memory linearly (i.e. multiple consecutive elements from each thread) whereas on GPUs it typically performs
better to access memory in a coalesced fashion (i.e. one element only per lightweight thread). Using the view
abstraction, the optimal layout and padding can be chosen automatically and individually for each architecture
(execution space).
In both frameworks, a view contains only metadata which is stored in host memory, plus a pointer to the
actual data that can be located on the host or on the device. Kokkos allocates memory together with views,
whereas RAJA does not allocate memory. To be able to, e.g., define an array independent of its location,
Kokkos implements the so called HostMirror type. This special view transparently guarantees data access to
device memory from the host. In our code, the names of such views contain the keyword host. With RAJA,
managing host and device memory allocation and transfers is the responsibility of the user, similarly to the
CUDA heterogeneous programming model.
In the codes presented in the appendix, Kokkos views are used, e.g., in the lines 101-104 of the listing B,
4
this->view j dofs local = Kokkos::View<double*>("view j dofs local",
this->part mesh coupling.view n dofs host(0));
this->view j dofs local host = Kokkos::create mirror view(this->view j dofs local);
101 for(long i=0; i<this->view j dofs local host.size(); i++) {
102 this->view j dofs local host(i) = 0.0;
103 }
104 Kokkos::deep copy(this->view j dofs local, this->view j dofs local host);
and RAJA views are used, e.g., in the lines 108-122 of the listing B:
this->d j dofs local = memoryManager::allocate<double>(this->part mesh coupling.n dofs);
this->raja j dofs local. RajaVector1DType();
new(&this->raja j dofs local) RajaVector1DType(this->d j dofs local, this->part mesh coupling.n dofs);
108 #if defined(RAJA ENABLE CUDA) && defined(I USE CUDA)
109 RAJA::forall<RAJA::cuda exec<32>>(RAJA::RangeSegment(0, this->part mesh coupling.n dofs),
110 [*this] RAJA DEVICE (int i) {
111 if(i<this->part mesh coupling.raja n dofs(0)) {
112 this->raja j dofs local(i) = 0.0;
113 }
114 });
115 #else
116 RAJA::forall<RAJA::omp parallel for exec>(RAJA::RangeSegment(0, this->part mesh coupling.n dofs),
117 [this] (int i) {
118 if(i<part mesh coupling.raja n dofs(0)) {
119 raja j dofs local(i) = 0.0;
120 }
121 });
122 #endif
2.2 Parallel patterns
Kokkos and RAJA both implement data parallel operations (for, reduce, scan), and Kokkos in addition also
a task parallel operation (task). These parallel operations are called patterns in the case of Kokkos, whereas
RAJA subsumes them under the term policy, see section 2.3 below, however the concepts are the same. The
calls to parallel code can be done via lambda functions in both frameworks. With Kokkos, a functor can be
used alternatively to wrap parallel code.
The inputs of the lambda function or the functor have to match the expected input parameters of the
execution pattern and policy. This prohibits the use of arguments to pass data, therefore lambda functions
seem easier to work with on a small scale. To get access to the data, functors require all the data as class
members.
According to our experience, functors are often easier to test and somewhat more readable, especially for
source codes with many lines. However, plain functors seem to pose a problem when multiple parallel functions
to be accessed via the same functor need to be implemented. To address this, Kokkos uses an Execution Tag.
The tag is passed as a template parameter to the execution policy, which will then feed it to the operator(),
making the specialization to call a certain internal function a compile-time decision. An execution tag is used
at line 176 of Listing B:
176 auto policy = Kokkos::TeamPolicy<pic routine, ExecSpace>
The first template parameter, pic routine, is the tag that will define which operator() to use. The second
parameter is the execution space defining, at compile time, the information on where to execute the code.
At the time of writing and for both the frameworks, the reduction loop supported on both CPU and GPU
only covered scalar reductions. Vector reduction is a crucial feature for the implementation of a PIC method
such as the one considered by us in the present paper. While RAJA does not provide a “ready-to-use” solution
for vector reduction, Kokkos provides it by the use of a scatter view. The scatter view deals with the allocation
of per-thread memory for thread-safe access to the vector, see the example below taken from the lines 316, 445
of the listing B. For completeness we have added a call to the contribute function which performs the reduction:
445 ViewScatterAccessType scatter access = scatter view.access();
367 scatter access(index1d) += view j1d(i) * splinejk;
Kokkos::Experimental::contribute(hamiltonian splitting.view j dofs local, scatter view);
For RAJA, we have implemented our own vector reduction by using a different view for each particle’s partial
result, and summing the partial results at the end. See the example below adapted from lines 200, 447 and 214
of the listing B.
5
200 //RAJA CPU: initialisation of the handmade reduction 2D array
201 RAJA::kernel<fdPolicy>(RAJA::make tuple(RAJA::RangeSegment(0, total num threads)),
202 [&,this] (int i part) {
203 for(int i=0; i<part mesh coupling.n dofs; i++)
204 raja private raja array(i part).raja handmade reduce(i) = 0.0;
205 });
206
447 raja private raja array(i part).raja handmade reduce(index1d) += util raja.raja j1d(i) * splinejk;
448
214 //RAJA: CPU handmade reduction
215 RAJA::kernel<fdPolicy>(RAJA::make tuple(RAJA::RangeSegment(0, part mesh coupling.n dofs)),
216 [&,this] (int i part) {
217 for(int i=0; i<total num threads; i++)
218 d j dofs local[i part] += raja private raja array(i).raja handmade reduce(i part);
219 });
2.3 Execution policies
To specify how a parallel pattern is executed, Kokkos knows two types of execution policies, the RangePolicy
and the TeamPolicy. The range policy simply defines a range of indices on which a function will be called. No
assumption can be made about the order of the concurrent execution, and it is not allowed to synchronize the
threads. The team policy adds additional features on top of the range policy. It enables hierarchical parallelism
and scratch memory. The scratch memory is a very important feature for the present PIC application, as each
thread needs private variables to compute updated particle positions, velocities, etc. At line 176 of Listing B,
a TeamPolicy, with shared size bytes per-thread of scratch memory is declared.
176 Kokkos::TeamPolicy{...}.set scratch size(1,Kokkos::PerThread (shared size));
RAJA decides on which hardware the code will run based on the template arguments of the policy. The
indices for the data-parallel operation are defined using RangeSegments, a RAJA class to define a range [[n,m−
1]]. Moreover, e.g., ranges with constant strides {n, n+m,n+ 2 ∗m, ...} are possible. In line 189 of listing B,
we launch a lambda function on the GPU with 256 threads per thread-block on the indices [[0, n particles− 1]],
as shown below.
189 RAJA::forall<RAJA::cuda exec<256>>(RAJA::RangeSegment(0,
190 this->particles.group[0].n particles),
191 [=,*this] RAJA DEVICE (int i part){
192 {operator RAJA GPU}
193 });
These segments can further be combined into lists of segments for more complex access patterns. However, this
advanced index management feature is not necessary for our PIC model and has not been tested in this study.
In the following we turn towards a review of the different frameworks, based on our experience from imple-
menting the PIC algorithm.
3 Usability review
3.1 Implementation overview and methodology
In order to compare Kokkos and RAJA with OpenACC and the architecture-specific programming models, we
developed multiple versions of the PIC operator.
C++ A standalone C++ reference implementation of the PIC routine was written first. This code includes
input/output and driver functions to run, time, and verify the computation, and serves as the starting
point for the other codes below.
OpenMP Starting from the C++ code, we implemented a thread-parallel version for the CPU using plain
OpenMP. Results from this code define the baseline to compare to the Kokkos-CPU and RAJA-CPU
codes, see below.
Kokkos Next, the parallel loops from the OpenMP code were refactored using Kokkos and were verified to
produce correct results on CPUs and GPUs. In the following, we refer to this version as Kokkos-CPU
when run on the CPU with the Kokkos-OpenMP back end, and as Kokkos-GPU when run on the GPU
with the Kokkos-CUDA back end, respectively.
6
OpenMP/Kokkos Related to the pure Kokkos implementation, we developed a hybrid code version which
uses OpenMP directives in combination with Kokkos views, with the motivation to identify the overhead
of the pure Kokkos implementation. Moreover, such hybrid codes naturally emerge during code refactoring
work.
RAJA A parallel implementation was developed using RAJA, and verified on CPUs and GPUs. In the fol-
lowing, we refer to this version as RAJA-CPU when run on the CPU with the RAJA-OpenMP back end,
and as RAJA-GPU when run on the GPU with the RAJA-CUDA back end, respectively.
CUDA To define a baseline for Kokkos’ and RAJA’s GPU performance, a plain CUDA version was imple-
mented.
OpenACC To complement this comparison, a plain OpenACC version was implemented, which runs on the
CPU and on the GPU.
To compile all but the last code we used GCC 6.3 and CUDA 9.1 with appropriate library and compiler flags
to optimize for the specific hardware. In particular, the flags -O3 -march=native -fopenmp were passed to
gcc, and, e.g., the flags -O3 -arch=sm 70 were passed to nvcc with the -arch flag matching the actual GPU
architecture, NVIDIA Volta in this example. For Kokkos, e.g., the macro KOKKOS ARCH="SKX,Volta70" was
specified in addition to specify the target platforms, here Intel Skylake CPU and NVIDIA Volta GPU. To
compile the OpenACC code we used PGI 19 with the -fast -acc optimization flags and the respective target
specialization flags for the CPU (-ta=multicore) and GPU (e.g., -ta=tesla:cc70).
In the following, we discuss the implementation experience with Kokkos and RAJA individually and address
soft factors such as code complexity and productivity, before we turn towards performance benchmarks in
section 4.
3.2 Kokkos implementation
To perform a refactoring of existing sequential C++ code into parallel Kokkos code, essentially a two-step
process is necessary in the most simple case: First, replace legacy C-style array allocations or, similarly, higher-
level data structures such as std::vector with Kokkos::View types, and, second, replace for loops with parallel for
and move the loop bodies into functors or lambda functions.
As mentioned previously, Kokkos execution patterns offer two possibilities to run user code in parallel. For
the present work, the functor approach was preferred to the lambda functions, allowing for easy testing and
leading to well-readable code.
As a feature of convenience and robustness, Kokkos provides a default execution space that will consider
architectures in the order CUDA, OpenMP, pthreads, and serial, if available. This default execution space
makes it optional to the programmer to specify the architecture onto which the code is going to be deployed.
Complex implementations can of course have parallel sections specified to use distinct execution spaces.
Vector reductions are a crucial feature for our application. The Kokkos reference only covers scalar reductions
in the documentation∗, examples, and test files. However, Kokkos provides a ScatterView class under the
experimental namespace that can be used to implement vector reductions. It has a slightly different behavior
on the CPU and the GPU: On the CPU, the ScatterView class will duplicate its View internally per thread.
After the parallel section, the contribute function has to be called explicitly to compute the reduction from
all the internal Views. On the GPU, the ScatterView does not duplicate its View, hence, value updates from
concurrent threads need to be performed atomically. On present GPU hardware, the atomic add operation
is efficiently implemented[24], which makes it a good choice to perform a reduction instead of using variable
duplication. The ScatterView class expects a template parameter to request to use the duplication or the atomic
access, but there is no default setting. Hence, we specified different template settings for CPU and GPU using
a preprocessor macro. In case of the Kokkos framework, this was the only time two versions of a code section
(though 2 lines only) had to be implemented.
Lastly, we need some scratch memory for each thread. For the Kokkos implementation, this is simply
achieved by declaring all the needed variables inside the parallel section, such that allocation is done per thread.
The total size of the scratch memory inside a parallel loop needs to be known before the parallel dispatch is
launched, similarly to CUDA shared memory. Different memory levels with different size caps, corresponding
to L1 and L2 cache sizes, can be accessed on the CPU, and similarly for shared memory on the GPU.
∗https://github.com/kokkos/kokkos/wiki/Custom-Reductions:-Build-In-Reducers, accessed on 09/11/2018
7
3.3 RAJA implementation
Turning towards RAJA and the refactoring of existing sequential C++ code, the same two-step process as
already discussed with Kokkos is necessary: First, replace array types with RAJA::View types, and, second
replace for loops with forall and move the loop bodies into lambda functions. It is the duty of the user to
allocate memory for the RAJA views.
A difference to note is that RAJA does not implement defaults, e.g., for architectures, but rather forces the
programmer to consider and specify all the necessary settings explicitly. On the source code level this requires
branches to specialize for CPU and GPU versions.
Second, RAJA does not support functors for the parallel dispatch but only lambda functions which required
to structure the code differently, compared to Kokkos.
As already discussed for the Kokkos implementation, scratch memory is needed, for which RAJA offers an
implementation of shared memory on the GPU. On the CPU, we use regular views.
RAJA does not offer ready-to-use vector reductions, however, it was straight forward to implement such
reductions based on views.
3.4 Qualitative comparison
Table 1: Qualitative comparison of the programming models based on a subjective ranking on the scale low,
medium, high, as experienced by the programmers.
criterion OpenMP OpenACC CUDA Kokkos RAJA
code clarity high high low medium medium
productivity high medium low medium medium
portability low medium low high high
performance high high high high medium
In the following, a qualitative review on the code complexity, programmer’s productivity, portability, and
performance is given, summarized in table 1.
The directive-based OpenMP and OpenACC programming models were the least intrusive when applied
to the loops of the PIC routine, starting from the sequential C++ code. Kokkos and RAJA required both
significant restructuring of the existing code for the parallel dispatch via functors or lambda functions. CUDA
required a comparable amount of rewriting effort, in particular to map the loops onto a CUDA grid of threads
and thread blocks. The overhead for OpenMP and OpenACC in terms of lines of code is the smallest, followed
by Kokkos. For RAJA, the overhead is about a factor two compared to Kokkos in many places since separate
code for CPU and GPU can be necessary. The CUDA version is comparable to the Kokkos code in terms of
lines of code.
Concerning the programmer’s productivity, it took a graduate-level C++ programmer about two months
to learn and apply the Kokkos framework successfully to implement the PIC routine. RAJA, conceptually
similar and tackled with having the knowledge of Kokkos, took about another month. The OpenMP and
CUDA implementations were done faster due to previous experience. To develop the OpenACC code, it was
necessary to do the implementation in analogy to the CUDA implementation, keeping the correspondence of
OpenACC gangs and CUDA thread blocks in mind. Remarkably, Kokkos and RAJA keep their promise of
basically providing a GPU version for free based on implementation work mainly done using the CPU. For
completeness, we include the performance ranking in table 1 but refer to the following section 4 for details.
Moreover, it should be noted that the specification of the target platform in the Kokkos Makefile via the
KOKKOS ARCH variable already enables the use of correct compiler optimization flags for the respective platform,
while for the other frameworks the user has to set these flags manually, a procedure which is tedious and prone
to errors.
A drawback common to both frameworks, Kokkos and RAJA, is the fact that the hardware-specific code
generated at compile time via C++ template meta programming is not accessible to the programmer for
inspection. This limitation does not only affect Kokkos and RAJA, but is well-known to any C++ programmer
who uses templates. There is no way to obtain the resulting code in the high-level C++ or C++/CUDA
language, rather the programmer has to work at the level of assembly code.
3.5 Hybrid OpenMP/Kokkos
Even though performance portability frameworks enable portability of the code, they often come with a certain
overhead which we have also observed in our application (cf. the following section). For this reason, we were
8
interested in the question whether this overhead is intimately connected to the data structures or rather to
the implementation. For this reason, we have modified the Kokkos code such that it uses OpenMP parallel for
directives instead of the Kokkos loops but keeps the Kokkos views as array objects.
Because Kokkos does reference counting, accessing a regular view from a pure OpenMP parallel section
would cause significant overhead. In particular, any operation on a view would be an atomic operation. If
one wants to use a view from conventional OpenMP parallel code, the trait of the view needs to be set to
Unmanaged. This is what we used for the hybrid OpenMP/Kokkos code, see e.g. the lines 13 and 136 of listing
B:
13 ViewVector2DUnmanagedType view particle array unmanaged;
136 this->view i weight unmanaged = this->view i weight;
With these modifications it was possible to eliminate the overhead for Kokkos as will be reported on in the
next section. We did not perform such an experiment for RAJA. However, since memory management is more
transparent in RAJA, no difficulties should arise in this case.
4 Performance benchmarks
4.1 Hardware
For the performance benchmarks presented in the following we consider a single multi-core CPU socket and a
single GPU. This is commonly considered a fair comparison when measuring the performance of heterogeneous
codes. We used the following hardware in the scope of this work.
IvyBridge-Kepler Single node with two Intel Xeon E5-2680 v2 @2.80GHz CPUs (IvyBridge) and two NVIDIA
K40m GPUs (Kepler, PCIe 3.0) [25].
POWER8-Pascal Single node with two IBM POWER8 processors and four NVIDIA Tesla P100 GPUs (Pas-
cal, NVLINK) [26].
Skylake-Volta Single node with two Intel Xeon Gold 6148 2.4GHz CPUs (Skylake) and two NVIDIA V100
GPUs (Volta, PCIe 3.0) [27].
The IvyBridge-Kepler system was used for the majority of the development work. To obtain performance
numbers on state-of-the art hardware, we turned towards running the code on the POWER8-Pascal and the
Skylake-Volta systems.
4.2 Preparatory work
After porting, we performed analysis and optimization work on all codes for a comparable amount of time, hence,
the numbers presented below should provide a representative relative picture of the performance one may expect
from the various approaches, when starting from a sequential C++ PIC code. No specific optimizations were
necessary for the various versions of the code, except for a false sharing issue encountered with the RAJA
version. We first consider a single process and run 3 iterations to measure the single core performance of the
PIC routine. We choose the number of particles to be 10 million on a 16 by 8 by 8 mesh (i.e. Ng = 16 · 8 · 8)
and use splines of degree 3 (and 2) which we keep constant for all the runs discussed in the following. With
these parameters, the plain C++ code runs on a single IvyBridge core in about 7.5 seconds per iteration.
The PIC routine implements the particle-to-mesh transfer for which the spline basis for the finite element
description of the fields needs to be evaluated at the particle position—or integrated over the particle trajectory
in the time step. This requires the localization of the particles within the mesh, which is done via modulo
operations. A straight-forward optimization was to cache the modulo computations for each iteration, thereby
replacing computation with memory storage and reuse. This reduced the processing time by about a factor of
2 for all the implementations.
Note that we did not change the fundamental data structure used to store the particles. We use a C++ class
(struct), hence, the ensemble of particles is stored as an array of structures. Alternatively, one could consider
using a structure of arrays or a mix of both, potentially improving the performance due to better vectorization.
However, finding optimal particle data structures for PIC codes is still the subject of ongoing research, cf. e.g.,
Ref. [21], and is beyond the scope of the present work.
9
4.2.1 False sharing mitigation
During the development work it was found that the RAJA-based code shows the worst scaling on the CPU, which
is due to sub-optimal memory access, leading to false sharing of cached data between threads. The LIKWID
performance tool [28] was used successfully to shed light on the cause. Running a “FALSE SHARING” analysis
revealed that the OpenMP-based code has about 200 MB of last level cache (LLC) hits when using 10 cores
whereas the Kokkos code only has about 10 MB. However, for the RAJA-CPU implementation, a much larger
value of 18 GB was determined. This metric refers to the amount of memory the processor has to synchronize
between cores via the last level cache (L3 cache), because data present in the L1 or L2 caches of one core were
modified by other cores at the same time, a situation known as false sharing.
Ideally, threads on different CPU cores work on data that are far away in memory compared to the size of
a cache line, which is typically 8 doubles on x86 64 platforms. Our algorithm and implementation retrieves the
position, velocity and charge of a particle from a large array. It then works with local variables to compute
the new state of the particle before updating the large array. With OpenMP, the local variables are defined as
thread private. With Kokkos-CPU, we use the per thread scratch memory. Therefore, the memory allocated to
each thread is allocated as one block.
A very early version of our RAJA-CPU code used a simple #threads×3 View for the local variables, leading
to false sharing, see listing 1. The approach taken to mitigate this situation is known as cache alignment or
padding. In order to avoid threads copying neighbor values into their caches, ghost values are added between
the values. Instead of having a #threads× 3 View, we allocate a #threads × 8 View where the first 3 values
are used as before and the next 5 are never touched. They serve as padding to make sure the core only copies
the relevant elements into its cache, and not its neighbor’s vector. See the listing 2 for the modified code with
padding. Padding improves the scaling significantly, however, drawbacks are higher memory requirements and
transfers. The final and best solution is to create a situation similar to OpenMP and Kokkos-CPU, which
allocate local variables as contiguous blocks per thread. To do so, we regrouped all the memory needed per
thread in order to avoid any false sharing, as shown in listing 3. This implementation scales moderately well,
as reported in the next section.
Listing 1: Code version with false sharing.
1 this->d xi = memoryManager::allocate<double>(N THREADS*3);
2 this->d vt = memoryManager::allocate<double>(N THREADS*3);
3
4 RAJA::forall<RAJA::omp parallel for exec>(RAJA::RangeSegment(0, N THREADS),
5 [*this] RAJA DEVICE (int i) {
6 this->raja private raja array(i).raja xi. RajaVector1DType();
7 new(&this->raja private raja array(i).raja xi) RajaVector1DType(&(this->d xi[i*3]), 3);
8 this->raja private raja array(i).raja box. RajaVector1DType();
9 new(&this->raja private raja array(i).raja box) RajaVector1DType(&(this->d vt[i*2]), 2);
10 {...}
11 });
Listing 2: Code version with padding.
1 this->d xi = memoryManager::allocate<double>(N THREADS*8);
2 this->d vt = memoryManager::allocate<double>(N THREADS*8);
3
4 RAJA::forall<RAJA::omp parallel for exec>(RAJA::RangeSegment(0, N THREADS),
5 [*this] RAJA DEVICE (int i) {
6 this->raja private raja array(i).raja xi. RajaVector1DType();
7 new(&this->raja private raja array(i).raja xi) RajaVector1DType(&(this->d xi[i*3]), 3);
8 this->raja private raja array(i).raja box. RajaVector1DType();
9 new(&this->raja private raja array(i).raja box) RajaVector1DType(&(this->d vt[i*2]), 2);
10 {...}
11 });
Listing 3: Code version with per-thread contiguous memory.
1 shared size = 1362;
2 this->memory pool = new double[N THREADS*shared size]();
3
4 RAJA::forall<RAJA::omp parallel for exec>(RAJA::RangeSegment(0, N THREADS),
5 [&,this] (int i) {
6 int memory position = i*shared size;
7 raja private raja array(i).raja xi. RajaVector1DType();
8 new(&raja private raja array(i).raja xi) RajaVector1DType(&(memory pool[memory position]), 3);
9 memory position += 3;
10 raja private raja array(i).raja vt. RajaVector1DType();
10
 1
 1  2  4  8  16  20
T
im
e 
[s
]
Number of CPU Cores
OpenMP
OpenACC
Kokkos
RAJA
OpenMP/Kokkos hybrid
ideal
Figure 1: Log-log plot of the compute times per iteration as functions of the number of CPU cores. We compare
the performance of Kokkos, RAJA, and OpenACC on the CPU to plain OpenMP. Moreover, the scaling curve
of the hybrid code that uses OpenMP directive-based loop-parallelism on data stored in Kokkos views is shown.
The codes were run on a 20-core Intel Xeon Gold 6148 2.4GHz (Skylake) CPU. The times shown were averaged
over 10 runs.
11 new(&raja private raja array(i).raja vt) RajaVector1DType(&(memory pool[memory position]), 2);
12 memory position += 2;
13 {...}
14 });
4.3 Performance results
In this section, we provide a performance comparison of the different implementations on the Skylake CPU, and
the K40m, P100, and V100 GPUs, representing state-of-the-art hardware at the time of writing (cf. Sec. 4.1 for
details on the hardware). Fig. 1 shows for each implementation in terms of the wall clock time per iteration the
parallel scaling on the CPU. Fig. 2 shows the performance results from runs on the GPUs for comparison.
4.3.1 CPU
On the CPU, the OpenACC implementation turns out to be the fastest and scales nearly ideally up to the
full 20 Skylake cores with a speedup of about 19. Following closely, the next-ranked CPU code is the plain
OpenMP implementation which as well does show a near-ideal parallel scaling and speedup of about 18. Overall,
OpenMP is performing very similarly to OpenACC, it is only about 2.8% slower on 20 cores. † The Kokkos
implementation, in comparison and averaged over the different core numbers under consideration, is about a
factor of 1.21 slower than the OpenACC code which we attribute to Kokkos-internal overhead. Obviously,
the performance and also the scaling of the RAJA code is the worst in the present comparison, although
the per-thread memory management was already improved (cf. Sec. 4.2.1). Presumably, this could be further
†Note that we used PGI to compile the OpenACC code and GCC to compile the other codes, each with aggressive optimization
flags enabled (cf. Sec. 3.1).
11
 0
 0.5
 1
 1.5
 2
Kokkos RAJA OpenACC CUDA
Data Transfer
Computation
T
im
e 
[s
]
20-core Skylake CPU
K40m GPU
P100 GPU
V100 GPU
Figure 2: Plot of the times per iteration for various GPU models. We compare the performance of Kokkos,
RAJA, and OpenACC to CUDA. In addition to the compute time the time necessary for one data transfer
is shown. The codes were run on a NVIDIA K40m, a NVIDIA P100, and a NVIDIA V100 GPU. For direct
comparison, a horizontal line indicates the best result obtained on the 20-core Intel Xeon Gold 6148 2.4GHz
(Skylake) multicore CPU, cf. Fig. 1. The times shown for the GPUs were averaged over 200 runs.
optimized, however, this would require a considerable performance tuning effort which is exactly what one wants
to avoid when choosing to build the implementation based on a performance portability framework. The hybrid
implementation which uses plain OpenMP directives in combination with Kokkos views shows virtually identical
performance as the plain OpenMP implementation, with a parallel speedup of 19.18 on 20 cores. Hence, it is
possible to optimize certain critical parts of a code in situations when the overhead from the Kokkos parallel
execution is not acceptable.
4.3.2 GPU
Turning towards the GPUs, in addition to the compute time, we also have to consider the time for data transfers
between the host and the device. In a full-fledged MPI-parallel PIC code, the operation on the particle data is
embarrassingly parallel and only needs synchronization when particles are redistributed between processes due
to particle sorting for reasons of data locality. Other occasions for the exchange of particle data between host
and device are the initial setup or regular checkpointing. The frequency of such data transfers depends on the
characteristics of a particular setup. Only the field data needs to be synchronized in each step, however, that
data is almost negligibly small compared to the particle data. For our study, we therefore measure separately
the compute time and the time needed for the data transfer, where we synchronized the data between host and
device in each time step. The total time in a realistic scenario will usually be close to the pure compute time
due to infrequent data transfers.
What concerns the data transfer time, it is important to first recall that the three GPU models under
consideration use different interconnects. The P100 is connected via the NVLink communication bus with a
transfer rate of 20 gigatransfers (GT) per second, while the K40m and the V100 use a PCIe 3.0 bus with a
transfer rate of 8 GT/s. Therefore, the transfer times are roughly 2 times smaller with NVLink, as can be
clearly seen from the plot. The compute and transfer times shown in Fig. 2 were determined using the NVIDIA
12
nvprof profiler.
We now compare the compute time for the various implementations on the three GPU models. On the
oldest GPU hardware, the K40m, the rank order is given by OpenACC, CUDA, RAJA, and Kokkos. CUDA is
about 67% slower than OpenACC. The RAJA runs take a factor of 2.02 and the Kokkos runs take a factor of
2.30 longer than the times measured for OpenACC. This order changes on the more recent hardware.
On the P100 GPU, the CUDA code is the fastest, followed by OpenACC (1.23), Kokkos (1.95), and RAJA,
the latter being slower than CUDA by a factor of 2.18.
On the V100 GPU, again CUDA delivers the fastest result, followed by OpenACC (1.20), RAJA (2.51), and
finally Kokkos which is a factor of 3.53 slower than CUDA. Note that in the OpenACC case, in particular, the
transfer time is larger on the V100 (PCIe) than on the P100 (NVLink), which makes the OpenACC code overall
run the fastest on the P100 GPU. Moreover, the compute time is by far the smallest on the V100 GPU for
the CUDA implementation, where the data transfer takes about a factor of 2.3 longer than the computation.
In comparison to the best result from the 20-core Skylake CPU, the computation is significantly faster on the
V100 in all cases.
The fact that the CUDA code becomes relatively faster when going to more recent GPUs is caused by
an implementation detail. Atomic updates have received significantly improved hardware support with recent
GPUs. Our CUDA implementation uses atomic additions on a global array to perform the reductions, whereas
our OpenACC implementation uses atomic updates on per-gang private array copies followed by a final reduction
across the gangs, reducing the concurrent accesses compared to the CUDA approach and therefore showing
relatively better performance on the older K40m GPU. The OpenACC per-gang solution was chosen because
it demonstrated much better performance on the multicore CPU compared to a solution with a single global
array, while showing rather similar performance on the GPUs.
5 Summary
This paper presents a performance and usability assessment of two major performance portability frameworks,
Kokkos and RAJA, which are considered for the future development of a high-performance C++ implementation
of a particle-in-cell approach to solve the Vlasov–Maxwell equations. We focus on the node-local part of the
computation, comparing generic implementations based on Kokkos, RAJA, and OpenACC, to CPU- and GPU-
specific implementations based on OpenMP and CUDA, respectively.
5.1 Usability assessment
Considering programmability and usability, Kokkos and RAJA offer rather similar concepts and levels of abstrac-
tion. Both frameworks provide generic building blocks for parallel programming, targeting CPU and accelerator
platforms. Code specialization to a specific processor is done at compile time based on C++ template meta
programming, and is thus hidden from the user. Regarding the features necessary for our PIC application
such as vector reductions and scratch memory, Kokkos offers all of them whereas for RAJA some additional
implementation work was needed.
Kokkos provides useful default values for the majority of relevant template parameters. If not specified oth-
erwise, the code is compiled for the “fastest” architecture available, as defined by the ordering CUDA, OpenMP,
pthreads, and serial. With the default parameters, the execution space (CPU or GPU) of a parallel section
will automatically match the memory space (host memory or device memory) which significantly facilitates
implementation and simplifies the code. Moreover, with Kokkos we managed avoiding architecture-specific pre-
processor macros in the code. The Kokkos project is very well documented and the developers are supportive on
GitHub, according to our experience. As a plus, a simple architecture-specification string set by the user in the
Kokkos Makefile automatically enables the correct set of hardware-specific optimization flags for the compiler.
RAJA does not implement such architectural default values for the template parameters. Hence, the user
has to specify the architecture-specific parameters explicitly, e.g. by employing preprocessor macros. RAJA
required us to have multiple of such branches in the code for CPU and GPU compilation. While this adds some
fine-grained control options, the underlying paradigm of writing a single source code for multiple architectures
gets somewhat compromised.
It should be noted that both the Kokkos-GPU and RAJA-GPU codes for our PIC application were obtained
“for free” in the sense that all development started out on multi-core CPUs and no GPU-specific code was
ever written (except for some preprocessor branches to set template parameters in the case of RAJA), thereby
confirming the idea of portability at good performance.
13
5.2 Performance assessment
In total, seven C++ versions of the PIC routine were developed, namely a sequential one, and parallel codes using
OpenMP, OpenACC, CUDA, Kokkos, hybrid OpenMP/Kokkos, and RAJA. The performance was measured
on a Skylake multicore CPU and on three NVIDIA GPUs from different hardware generations. Only limited
effort was spent on optimizing each individual implementation.
On the CPU, Kokkos shows acceptable overhead compared to plain OpenMP of about 14% and scales nearly
as well as plain OpenMP. Moreover, we have also demonstrated that the overhead of the Kokkos framework
over a pure OpenMP implementation is not linked to the use of Kokkos views for data management. Therefore,
it is possible to mitigate the performance penalty of the Kokkos framework by manually optimizing critical
kernels. The results we have obtained with RAJA appear a bit inferior on the CPU since the parallel scalability
is hampered by a false sharing issue that could only partially be solved. On the other hand, the performance
on the GPU is comparable or even slightly better than the one of Kokkos across different GPU hardware
generations. On state-of-the-art GPUs, CUDA turns out to be the fastest choice, followed by OpenACC,
Kokkos and RAJA. Overall, OpenACC is very competitive on both, the CPU and the GPU, and should be
kept in mind as an option for performance portability, especially with improved compiler support that is to be
expected in the near future (GCC and LLVM, in addition to PGI).
In summary, both Kokkos and RAJA seem mature, usable for production codes, and keep their promise to
provide performance portability across different hardware architectures.
References
[1] Strohmaier E, Dongarra J, Simon H, Meuer M. Top 500. The List.. https://www.top500.org; 2019.
Accessed: 2019/03/18.
[2] Forum TM. MPI: A Message Passing Interface. https://www.mpi-forum.org/docs/; 1993.
[3] Wolfe M. Compilers and More: MPI+X. https://www.hpcwire.com/2014/07/16/compilers-mpix/;
2014. Accessed: 2019/03/21.
[4] OpenMP Architecture Review Board . OpenMP Application Programming Interface.
https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf; 2018. Ac-
cessed: 2019/03/21.
[5] OpenACC-Standard.org . The OpenACC Application Programming Interface Version 2.7.
https://www.openacc.org/sites/default/files/inline-files/OpenACC.2.7.pdf; 2018. Accessed:
2019/03/21.
[6] Khronos OpenCL Working Group . The OpenCL Specification Version 2.2.
https://www.khronos.org/registry/OpenCL/specs/2.2/pdf/OpenCL_API.pdf; 2019. Accessed:
2019/03/21.
[7] Munshi A, Gaster B, Mattson TG, Fung J, Ginsburg D. OpenCL Programming Guide. Addison-Wesley
Professional. 1st ed. 2011.
[8] Fuhrer O, Osuna C, Lapillonne X, et al. Towards a performance portable, architecture agnostic implemen-
tation strategy for weather and climate models. Supercomputing Frontiers and Innovations 2014; 1(1).
[9] Clement V, Ferrachat S, Fuhrer O, et al. The CLAW DSL: Abstractions for Performance Portable Weather
and Climate Models. In: PASC ’18. ACM; 2018; New York, NY, USA: 2:1–2:10
[10] Demeshko I, Watkins J, Tezaur IK, et al. Toward performance portability of the Albany finite element anal-
ysis code using the Kokkos library. The International Journal of High Performance Computing Applications
2019; 33(2): 332-352. doi: 10.1177/1094342017749957
[11] Edwards HC, Trott CR, Sunderland D. Kokkos: Enabling manycore performance portability through
polymorphic memory access patterns. Journal of Parallel and Distributed Computing 2014; 74(12): 3202–
3216.
[12] Hornung R, Keasler J. The RAJA portability layer: overview and status. tech. rep., Lawrence Livermore
National Lab.(LLNL), Livermore, CA (United States); : 2014.
[13] Zenker E, Worpitz B, Widera R, et al. Alpaka - An Abstraction Library for Parallel Kernel Acceleration.
In: IEEE Computer Society; 2016.
14
[14] Matthes A, Widera R, Zenker E, Worpitz B, Huebl A, Bussmann M. Tuning and optimization for a variety
of many-core architectures without changing a single line of implementation code using the Alpaka library.
In: ; 2017.
[15] Martineau M, McIntoshSmith S, Gaudin W. Assessing the performance portability of modern parallel
programming models using TeaLeaf. Concurrency and Computation: Practice and Experience 2017; 29(15):
e4117.
[16] Martineau M, McIntoshSmith S, Boulton M, Gaudin W. An evaluation of emerging many-core parallel
programming models. In: ACM. ; 2016: 1–10.
[17] Sunderland D, Peterson B, Schmidt J, Humphrey A, Thornock J, Berzins M. An Overview of Performance
Portability in the Uintah Runtime System through the Use of Kokkos. In: ; 2016: 44-47
[18] Kraus M, Kormann K, Morrison PJ, Sonnendru¨cker E. GEMPIC: Geometric electromagnetic particle-in-cell
methods. Journal of Plasma Physics 2017; 83(4).
[19] Selalib. http://selalib.gforge.inria.fr/; 2014–2019. Accessed: 2019/03/21.
[20] Buffa A, Rivas J, Sangalli G, Va´zquez R. Isogeometric Discrete Differential Forms in Three Dimensions.
SIAM Journal on Numerical Analysis 2011; 49: 818–844. doi: 10.1137/100786708
[21] Barsamian Y, Hirstoaga SA, Violard E. Efficient data structures for a hybrid parallel and vectorized
particle-in-cell code. In: IEEE. ; 2017: 1168–1177.
[22] The Kokkos ProgrammingGuide. https://github.com/kokkos/kokkos/wiki/The-Kokkos-Programming-Guide;
2019. Accessed: 2019/03/21.
[23] RAJA User Guide. https://raja.readthedocs.io/en/master; 2016–2019. Accessed: 2019/03/21.
[24] Go´mez-Luna J. Atomic Operations across GPU generations. University Lecture ECE 408,
http://ece408.hwu-server2.crhc.illinois.edu/Shared%20Documents/Slides/Presentation_ECE408_JGL.pdf;
2015.
[25] NVIDIA Corporation . TESLAK40 GPUActive Accelerator. https://www.nvidia.com/content/PDF/kepler/Tesla-K40-Active-Board-Spec-BD-06949-001_v03.pdf;
2013. Accessed: 2019/03/21.
[26] NVIDIA Corporation . NVIDIA Tesla P100. https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf;
2016. Accessed: 2019/03/21.
[27] NVIDIA Corporation . NVIDIA Tesla V100. http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf;
2017. Accessed: 2019/03/21.
[28] Treibig J, Hager G, Wellein G. Likwid: A lightweight performance-oriented tool suite for x86 multicore
environments. In: IEEE. ; 2010: 207–216.
A Numerical model
This section gives a more detailed description of the equations implemented in our case study. The basis for
particle-in-cell methods are the characteristic equations associated to (1) are given by
dX
dt
= V,
dV
dt
= (E(X, t) +V×B(X, t)) , (4)
along which the value of the distribution function remains constant over time. Particle methods represent the
distribution function by a number Np of macro-particles that evolve according to the characteristic equations.
The macro-particles are characterized by their position in phase space and a weight, that is particle p is
represented by (xp,vp, wp). The particle position and velocity are dynamic variables following the characteristic
equations (4) and the weights are constant in time (note that time-dependent weights are also possible in
advanced setups but are not discussed here). In order to compute the velocity moments of the distribution
function needed to solve Maxwell’s equations, a representation of the distribution function is necessary. This is
usually given by
fs(x,v, t) =
Nps∑
p=1
wpS(x− xp)δ(v − vp), (5)
15
where S can either be a δ distribution or a smoothing kernel. Finally, the fields are represented on a grid and
Maxwell’s equations are solved by finite elements or finite differences. Multiple discretization schemes have been
discussed in the literature which have similar building blocks based on solutions of the field equations and loops
over the particles with field evaluations for the particle push and current or charge depositions to assemble the
source terms for the Maxwell’s equations. Since usually the number of particles is much larger than the number
of degrees of freedom in the description of the fields, the computational complexity is dominated by the particle
loop.
The motivation of our work is an efficient and portable implementation of the scheme proposed in Ref. [18]
for the Vlasov–Maxwell equations. The scheme uses compatible spline finite elements as proposed by Buffa et
al. [20] for the fields and a Klimontovic distribution for the particle distribution (i.e. S = δ). Basis functions of
different order are used to represent the magnetic and the electric field, respectively. Let us denote by Λ1,ki the
basis functions for component k, k = 1, 2, 3, of the electric field associated with the grid point i, i = 1, . . . , Ng.
In the same way, we denote by Λ2,ki the basis functions for component k, k = 1, 2, 3, of the magnetic field
associated with grid point i, i = 1, . . . , Ng. Then, the semi-discretized fields are represented as
E˜k(x, t) =
Ng∑
i=1
ek,i(t)Λ
1,k
i (x), B˜k(x, t) =
Ng∑
i=1
bk,i(t)Λ
2,k
i (x). (6)
with ek =
(
ek,1, . . . , ek,Ng
)⊤
and bk =
(
bk,1, . . . , bk,Ng
)⊤
being the dynamic variables. Given a certain degree
p, the basis functions are constructed in the following way:
• Λ1,k is constructed as a tensor product of splines of degree p − 1 in xk and of degree p along the other
two dimensions,
• Λ2,k is constructed as a tensor product of splines of degree p in xk and of degree p − 1 along the other
two dimensions.
Furthermore, equation (2a) is solved in weak form and (2b) is solved in strong form. Inserting the representation
of the fields and particles into the equations yields the following semi-discrete equations of motion
dX
dt
= V, (7a)
dV1
dt
=
q
m
(
Λ
1,1(X)e1 +V2MqΛ
2,3(X)b3 −V3MqΛ
2,2(X)b2
)
, (7b)
dV2
dt
=
q
m
(
Λ
1,2(X)e2 +V3MqΛ
2,1(X)b1 −V1MqΛ
2,3(X)b3
)
, (7c)
dV3
dt
=
q
m
(
Λ
1,3(X)e3 +V1MqΛ
2,2(X)b2 −V2MqΛ
2,1(X)b1
)
, (7d)
de
dt
= M−11
(
C
⊤
M2b(t)− Λ
1(X)⊤MqV
)
, (7e)
db
dt
= −Ce(t). (7f)
The dynamic variables are given by (X,V, e,b), where X =
(
x⊤1 , . . . ,x
⊤
Np
)⊤
and V =
(
v⊤1 , . . . ,v
⊤
Np
)⊤
,
e = (e⊤1 , e
⊤
2 , e
⊤
3 )
⊤, b = (b⊤1 ,b
⊤
2 ,b
⊤
3 )
⊤. Furthermore, C represents the discrete curl matrix, Mj the finite element
mass matrix for basis j = 1, 2, and Mq is a diagonal matrix with the product of the particle weight and charge
on the diagonals. The matrix Λj = (Λj,1,Λj,2,Λj,3) is a Np × (3Ng) matrix with entries (Λj,k)p,i = Λ
j,k
i (xp).
These equations can either be solved by some implicit time-stepping scheme or by a splitting of the equations
into several subsystem that are chosen in such a way that the substeps can be solved explicitly. Such an explicit
time stepping scheme, called Hamiltonian splitting, has been proposed in Ref. [18]. For this study, we implement
only one of the building blocks of the complete scheme that, however, contains the main features of the overall
schemes. The building blocks solves the following part of the equations
dX1
dt
= V1, (8a)
dV2
dt
= −
q
m
V1MqΛ
2,3(X)b3, (8b)
dV3
dt
=
q
m
V1MqΛ
2,2(X)b2, (8c)
M1,1
de1
dt
= −Λ1,1(X)⊤MqV1, (8d)
16
by explicit integration over time as
x1,p(tm+1) = x1,p(tm) + ∆tv1,p, (9a)
v2,p(tm+1) = v2,p(tm)−
q
m
Ng∑
i=1
b3,i(tm)
∫ tm+1
tm
Λ2,3(xp(τ))v1,p(tm)dτ, (9b)
v3,p(tm+1) = v3,p(tm) +
q
m
Ng∑
i=1
b2,i(tm)
∫ tm+1
tm
Λ2,2(xp(τ))v1,p(tm)dτ, (9c)
M1,1e1(tm+1) = M1,1e1(tm)− q
Np∑
p=1
wp
∫ tm+1
tm
Λ
1,1(xp(τ))v1,p(tm)dτ. (9d)
where xp(τ) = (x1,p(tm) + τv1,p(tm), x2,p(tm), x3,p(tm))
⊤. The computationally most expensive part is to eval-
uate the integrals over the basis functions. If the velocity v1,p(tm) is nonzero, we can transform the integral
from τ to σ = x1,p(tm) + τv1,p(tm). Then equation (9b) reads
v2,p(tm+1) = v2,p(tm)−
q
m
Ng∑
i=1
b3,i(tm)
∫ x1,p(tm+1)
x1,p(tm)
Λ2,3(σ, x2,p(tm), x3,p(tm))dσ. (10)
The other integrals can be transformed in a similar way. Next, we apply the tensor product of one-variate
splines N q(x) of degree q = p and q = p− 1 as proposed before. Then, the update of the particle velocities and
the contribution of particle p to component i of the integrated current ji reads
v2,p(tm+1) = v2,p(tm)−
q
m
Ng∑
i=1
b3,i(tm)
∫ x1,p(tm+1)
x1,p(tm)
N
p−1
i1
(σ)dσNp−1i1 (x2,p(tm))N
p
i3
(x3,p(tm)) (11a)
v3,p(tm+1) = v3,p(tm) +
q
m
Ng∑
i=1
b2,i(tm)
∫ x1,p(tm+1)
x1,p(tm)
N
p−1
i1
(σ)dσNpi1(x2,p(tm))N
p−1
i3
(x3,p(tm)), (11b)
ji = ji + qwp
∫ x1,p(tm+1)
x1,p(tm)
N
p−1
i1
(σ)dσNpi1(x2,p(tm))N
p
i3
(x3,p(tm)), (11c)
where we decompose the unique index i for the basis functions into a three dimensional index (i1, i2, i3) reflecting
the tensor product structure of the basis. Since the basis functions have compact support, we identify first in
which grid cell the particle is located and then we evaluate the q + 1 basis functions (of degree q) with support
on this cell. To evaluate the splines, we use their representation in piecewise polynomial form (pp-form) using
Horner’s algorithm at the particle position (normalized to the grid cell). For the evaluation of the integral,
we evaluate the primitive basis function at the new and old position. This can either be implemented using
numerical quadrature or based on the primitive function of Np−1. We choose the latter approach and denoting
the primitive of Np−1 by N , we get
∫ x1,p(tm+1)
x1,p(tm)
N
p−1
i1
(σ)dσ = Ni1(x1,p(tm+1))−Ni1(x1,p(tm)). (12)
We note that the support of a spline of degree p − 1 is p cells while the support of this integral is variable
depending on how large the difference x1,p(tm+1) − x1,p(tm) between the old and new particle position is. We
note that this results in a variable loop length that makes it harder to design data structures for efficient and
vectorized current deposition of this particular scheme.
B Color-coded implementation comparison of the PIC routine
The source code shown in listing 4 contains the key features necessary to implement the PIC routine for each
parallel programming model under investigation. In particular the color coding is as follows: Black for common
code, blue for OpenMP, brown for OpenACC, red for Kokkos, green for RAJA, purple for CUDA, and orange
for hybrid OpenMP/Kokkos. Note that the listing presents a concatenation from the different source code parts
for illustration purposes.
Listing 4: Color-coded key implementation features of the PIC routine for direct comparison.
17
1 // Class particle group: Basic data structure saving
2 // the information on all particles of one species
3 class particle group base Kokkos {
4 KOKKOS FUNCTION particle group base Kokkos(char*, long, bool);
5 // common particle weight
6 double common_weight;
7 ViewScalarDoubleType view common weight;
8 ViewScalarDoubleType::HostMirror view common weight host;
9
10 // array storing position in phase space and weight for each of n_particles
11 double ** particle_array;
12 ViewVector2DType view particle array;
13 ViewVector2DUnmanagedType view particle array unmanaged;
14 ViewVector2DType::HostMirror view particle array host;
15
16 // information about the species ( mass and charge )
17 double m; //mass of a single particle
18 ViewScalarDoubleType view m;
19 ViewScalarDoubleUnmanagedType view m unmanaged;
20 ViewScalarDoubleType::HostMirror view m host;
21
22 // charge
23 double q; // charge of a single particle
24 ViewScalarDoubleType view q;
25 ViewScalarDoubleUnmanagedType view q unmanaged;
26 ViewScalarDoubleType::HostMirror view q host;
27
28 // number of particles in the particle group
29 long n_particles ;
30 ViewScalarLongType view n particles;
31 ViewScalarLongUnmanagedType view n particles unmanaged;
32 ViewScalarLongType::HostMirror view n particles host;
33 // Next the accessor functions are defined
34 { ... }
35 }
36
37 class particle group base RAJA {
38 particle group base RAJA(string, long, bool);
39 // common particle weight
40 double common_weight;
41
42 // array storing position in phase space and weight for each of n_particles
43 double ** particle_array;
44 double* d particle array;
45 RajaVector2DType raja particle array;
46
47 // information about the species ( mass and charge )
48 double m; //mass of a single particle
49 double *d m;
50 RajaVector1DType raja m;
51
52 double q; // charge of a single particle
53 double *d q;
54 RajaVector1DType raja q;
55
56 // number of particles in the particle group
57 long n_particles ;
58 long *d n particles;
59 RajaVector1DLongType raja n particles;
60 // Next the accessor functions are defined
61 { ... }
62 }
63
64 class particle group base cuda {
65 particle group base cuda(string, long, bool);
66 // common particle weight
67 double common_weight;
68
69 // array storing position in phase space and weight for each of n_particles
70 double ** particle_array;
71 double* d particle array;
18
72
73 // information about the species ( mass and charge )
74 double m; //mass of a single particle
75 double *d m;
76
77 double q; // charge of a single particle
78 double *d q;
79
80 // number of particles in the particle group
81 long n_particles ;
82 long *d n particles;
83 // Next the accessor functions are defined
84 { ... }
85 }
86
87 void hamiltonian_splitting :: pic_routine (double dt ,
88 int n threads, int n teams
89 int n threads
90 int n threads) {
91 int total num threads = omp get max threads();
92 //OpenMP: initialisation of the reduction array
93 double j dofs[total num threads][this->part mesh coupling.n dofs];
94 for(int i=0; i<total num threads; i++) {
95 for(int j=0; j<this->part mesh coupling.n dofs; j++) {
96 j dofs[i][j] = 0.0;
97 }
98 }
99
100 //Kokkos: initialisation of the reduction array
101 for(long i=0; i<this->view j dofs local host.size(); i++) {
102 this->view j dofs local host(i) = 0.0;
103 }
104 Kokkos::deep copy(this->view j dofs local, this->view j dof local host);
105
106 int total num threads = std::max(atoi(std::getenv("OMP NUM THREADS")), 1);
107 //RAJA: initialisation of the reduction array
108 #if defined(RAJA ENABLE CUDA) && defined(I USE CUDA)
109 RAJA::forall<RAJA::cuda exec<256>>(RAJA::RangeSegment(0, this->part mesh coupling.n dofs),
110 [*this] RAJA DEVICE (int i) {
111 if(i<this->part mesh coupling.raja n dofs(0)) {
112 this->raja j dofs local(i) = 0.0;
113 }
114 });
115 #else
116 RAJA::forall<RAJA::omp parallel for exec>(RAJA::RangeSegment(0, this->part mesh coupling.n dofs),
117 [this] (int i) {
118 if(i<part mesh coupling.raja n dofs(0)) {
119 raja j dofs local(i) = 0.0;
120 }
121 });
122 #endif
123
124 int M blocks = (this->particle group.group[0].n particles + n threads-1)/n threads; //#of blocks
125 //CUDA: no reduction array, atomic operations are used instead
126
127 int total num threads = omp get max threads();
128 //Hybrid: initialisation of the reduction array
129 double j dofs[total num threads][this->part mesh coupling.view n dofs host(0)];
130 for(int i=0; i<total num threads; i++) {
131 for(int j=0; j<this->part mesh coupling.view n dofs host(0); j++) {
132 j dofs[i][j] = 0.0;
133 }
134 }
135 //Hybrid: set all unmanaged views
136 this->view i weight unmanaged = this->view i weight;
137 {...}
138
139 int total num gangs = input n gangs();
140 int total num vectors = input n vectors();
141 //OpenACC: initialisation of the reduction array
142 double **j dofs = (double**)malloc(total num gangs * sizeof(double *));
19
143 for(int i=0; i<total num gangs; i++) {
144 j dofs[i] = (double*)malloc(this->part mesh coupling.n dofs * sizeof(double));
145 for(int j=0; j<this->part mesh coupling.n dofs; j++) {
146 j dofs[i][j] = 0.0;
147 }
148 }
149
150 #pragma omp parallel
151 {
152 //OpenMP: initialisation of the local variables
153 {...}
154 for(long i sp = 0; i sp<this->particle group.n species; i sp++) {
155 //OpenMP: get thread ID, set qoverm
156 {...}
157 #pragma omp for
158 for(long i part=0; i part<this->particle group.group[i sp].n particles; i part++) {
159 {operator openmp}
160 }
161 #pragma omp for
162 for(int i=0; i<this->part mesh coupling.n dofs; i++) {
163 for(int thread=1; thread<total num threads; thread++) {
164 j dofs[0][i] += j dofs[thread][i];
165 }
166 }
167 }
168 }
169
170 //Kokkos: set dt
171 {...}
172 for(long i sp = 0; i sp<this->particle group.view n species host(0); i sp++) {
173 //Kokkos: set view i sp and shared size=n shared*sizeof(double)
174 {...}
175 //Kokkos: set parallel policy
176 auto policy = Kokkos::TeamPolicy<pic routine,
ExecSpace>((this->particle group.view group host(i sp).n particles + n threads-1)/n threads,
n threads)
177 .set scratch size(1,Kokkos::PerThread(shared size));
178 //Kokkos: parallel computation
179 Kokkos::parallel for(policy, *this);
180 }
181
182 //RAJA: set dt
183 {...}
184 for(long i sp = 0; i sp<this->particle group.n species; i sp++) {
185 //RAJA: set view i sp and shared size
186
187 #if defined(RAJA ENABLE CUDA) && defined(I USE CUDA)
188 //RAJA: GPU parallel computation
189 RAJA::forall<RAJA::cuda exec<256>>(RAJA::RangeSegment(0, this->particle group.group[0].n particles),
190 [=,*this] RAJA DEVICE (int i part) {
191 {operator RAJA GPU} //RAJA GPU: uses atomics instead of reduction
192 });
193 #else
194 //RAJA CPU: raja private raja array is our solution to false-sharing
195 //It is an array of struct such that each thread gets contiguous memory to use as scratch
196
197 //RAJA: set parallel policy
198 using fdPolicy = RAJA::KernelPolicy< RAJA::statement::For< 0, RAJA::omp parallel for exec,
RAJA::statement::Lambda<0> > >;
199
200 //RAJA CPU: initialisation of the handmade reduction 2D array
201 RAJA::kernel<fdPolicy>(RAJA::make tuple(RAJA::RangeSegment(0, total num threads)),
202 [&,this] (int i part) {
203 for(int i=0; i<this->part mesh coupling.n dofs; i++)
204 this->raja private raja array(i part).raja handmade reduce(i) = 0.0;
205 });
206
207
208 //RAJA: CPU parallel computation
209 RAJA::kernel<fdPolicy>(RAJA::make tuple(RAJA::RangeSegment(0,
this->particle group.group[0].n particles)),
210 [&,this] (int i part) {
211 {operator RAJA CPU}
20
212 });
213
214 //RAJA: CPU handmade reduction
215 RAJA::kernel<fdPolicy>(RAJA::make tuple(RAJA::RangeSegment(0, this->part mesh coupling.n dofs)),
216 [&,this] (int i part) {
217 for(int i=0; i<total num threads; i++)
218 this->d j dofs local[i part] += this->raja private raja array(i).raja handmade reduce(i part);
219 });
220 #endif
221 }
222
223 //CUDA: parallel computation
224 operator cuda<<<M blocks,n threads>>>(0.05, 0, this->d particle group, this->d part mesh coupling,
225 this->d control vari, this->i weight, this->d bfield dofs, this->d j dofs local,
this->d Lx);
226
227 #pragma omp parallel
228 {
229 //Hybrid: initialisation of the local variables
230 {...}
231 for(long i sp = 0; i sp<this->particle group.view n species unmanaged(0); i sp++) {
232 //Hybrid: get thread ID, set qoverm
233 {...}
234 #pragma omp for
235 for(long i part=0; i part<this->particle group.view group unmanaged(
this->view i species unmanaged(0) ).view n particles unmanaged(0); i part++) {
236 {operator hybrid}
237 }
238 #pragma omp for
239 for(int i=0; i<this->part mesh coupling.view n dofs unmanaged(0); i++) {
240 for(int thread=1; thread<total num threads; thread++) {
241 j dofs[0][i] += j dofs[thread][i];
242 }
243 }
244 }
245 }
246
247 #pragma acc data
248 copy(j dofs[0:total num gangs][0:this->part mesh coupling.n dofs])
249 copyin(this[0:1],
250 this->part mesh coupling,
251 this->part mesh coupling.domain[0:3][0:2],
252 this->part mesh coupling.delta x[0:3],
253 this->part mesh coupling.rdelta x[0:3],
254 this->part mesh coupling.spline 0,
255 this->part mesh coupling.spline 0.d poly coeffs[0:(spline 0 degree+1)*(spline 0 degree+1)],
256 this->part mesh coupling.spline 0.d poly coeffs fpa[0:(spline 0 degree+2)*(spline 0 degree+1)],
257 this->part mesh coupling.spline 1,
258 this->part mesh coupling.spline 1.d poly coeffs[0:(spline 1 degree+1)*(spline 1 degree+1)],
259 this->part mesh coupling.spline 1.d poly coeffs fpa[0:(spline 1 degree+2)*(spline 1 degree+1)],
260 this->part mesh coupling.n grid[0:3],
261 this->part mesh coupling,
262 this->Lx[0:3],
263 this->bfield dofs[0:this->part mesh coupling.n dofs*3],
264 this->particle group,
265 this->particle group.group[0:this->particle group.n species],
266 this->particle group.group[0].sp),
267 copy( this->particle group.group[0].particle array[0:7][0:this->particle group.group[0].n particles])
268 {
269 #pragma acc parallel num gangs(total num gangs) vector length(total num vectors)
270 {
271 //OpenACC: initialisation of the per-gang reduction j dofs, local variables and get gang ID
272 int this gang = pgi gangidx();
273 {...}
274 for(long i sp = 0; i sp<this->particle group.n species; i sp++) {
275 //OpenACC: set qoverm
276 {...}
277 #pragma acc loop
278 for(long i part=0; i part<this->particle group.group[i sp].n particles; i part++) {
279 {operator openacc}
280 }
281 }
21
282 }
283 }
284
285 //OpenMP: copying the reduction’s result
286 for(int i=0; i<this->part mesh coupling.n dofs; i++) {
287 this->j dofs local[i] = j dofs[0][i];
288 }
289
290 //Hybrid: copying the reduction’s result
291 for(int i=0; i<this->part mesh coupling.view n dofs(0); i++) {
292 this->view j dofs local(i) = j dofs[0][i];
293 }
294
295 //OpenACC: second reduction on the gang results
296 for(int i=0; i<total num gangs; i++) {
297 for(int j=0; j<this->part mesh coupling.n dofs; j++) {
298 this->j dofs local[j] += j dofs[i][j];
299 }
300 }
301
302 //OpenACC: free memory
303 for(int i=0; i<total num gangs; i++)
304 free(j dofs[i]);
305 free(j dofs);
306 }
307
308 operator_openmp {
309 KOKKOS FUNCTION void hamiltonian splitting::operator() (const pic routine&, const
Kokkos::TeamPolicy<>::member type & team member) const {
310 operator RAJA GPU {
311 operator RAJA CPU {
312 global void operator cuda(double dt, int i sp, particles *d particle group, particle mesh coupling
*d part mesh coupling, control variate *d control vari, long i weight, double *d bfield dofs, double
*d j dofs, double *d Lx) {
313 operator hybrid {
314 operator openacc {
315 //Kokkos: thread access to special scatter view, used for vector reduction
316 ViewScatterAccessType scatter access = this->scatter view.access();
317
318 //Kokkos: initialise scratch variables
319 ViewVector3ScratchType view x old(team member.thread scratch(1));
320 {...}
321
322 //RAJA GPU: initialisation of the local variables BUT WITH FIXED SIZE
323 //The size is given by the number of dimensions (3) by the support of the spline (4 here for cubic
splines, but should be templated for varying order).
324 double d spline 0[4*3];
325 {...}
326
327 //CUDA: initialisation of the local variables (and qoverm) BUT WITH FIXED SIZE
328 double d spline 0[4*3]; // Size as for RAJA.
329 {...}
330
331 //OpenACC: initialisation of the local variables (and qoverm) BUT WITH FIXED SIZE
332 double d spline 0[4*3]; // Size as for RAJA.
333 {...}
334
335 // Read out particle position and velocity
336 {...}
337 // Then update particle position: X_new (0) = X_old (0) + dt * V(0)
338 x_new [0] = x_old [0] + dt * vi [0];
339 x_new [1] = x_old [1];
340 x_new [2] = x_old [2];
341 // Get charge for accumulation of j
342 {...}
343
344 this->part mesh coupling.add current update v primitive component1 spline 3d feec util (res[this thread],
x old, x new[0], wi[0], qoverm,
345 this->bfield dofs, vi, &util arrays);
346
347 this->part mesh coupling.view add current update v primitive component1 spline 3d feec scratch
(team member,
348 view x old, view x new(0), view wi(0), qoverm,
22
349 this->view bfield dofs, view vi, &(this->scatter view));
350
351 this->part mesh coupling.raja add current update v primitive component1 spline 3d feec util (raja x old,
raja x new(0), raja wi(0), qoverm,
352 this->raja bfield dofs, raja vi, this->d j dofs local, &util raja); //GPU
353
354 this->part mesh coupling.raja add current update v primitive component1 spline 3d feec util pool thread<0>
355 (THREAD, this->raja private raja array(THREAD).raja x old,
356 this->raja private raja array(THREAD).raja x new(0), this->raja private raja array(THREAD).raja wi(0),
357 qoverm, this->raja bfield dofs, this->raja private raja array(THREAD).raja vi,
358 this->raja private raja array(THREAD).raja handmade reduce,
this->raja private raja array(THREAD)); //CPU
359
360 d part mesh coupling->cuda add current update v primitive component1 spline 3d feec util (x old, x new[0],
wi[0], qoverm,
361 d bfield dofs, vi, d j dofs, &util cuda);
362
363 this->part mesh coupling.add current update v primitive component1 spline 3d feec util from view
(j dofs[this thread], x old, x new[0], wi[0], qoverm,
364 &this->view bfield dofs unmanaged, vi, &util arrays);
365
366 this->part mesh coupling.add current update v primitive component1 spline 3d feec util openacc
(j dofs[this gang], x old, x new[0], wi[0], qoverm,
367 this->bfield dofs, vi, &util openacc);
368
369 x_new [0] = fmod(x_new [0] + this ->Lx[0], this ->Lx[0]);
370 this -> particle_group.group [i_sp ]. set_x(i_part , x_new );
371 this -> particle_group.group [i_sp ]. set_v(i_part , vi);
372 }
373
374 void particle_mesh_coupling ::
→֒ add_current_update_v_primitive_component1_spline_3d_feec_util (double *j_dofs ,
375 double *position_old , double position_new , double marker_charge , double qoverm ,
376 double *bfield_dofs ,
377 ViewVector1DUnmanagedType *view bfield dofs unmanaged, // instead of double *bfield_dofs
378 double *vi , struct private_arrays *util_arrays ,
379 struct private raja util raja //GPU
380 struct private raja * util raja //CPU
381 struct private cuda util cuda
382 struct private openacc util openacc
383 ) {
384 // Initialise local variables (and qoverm ) BUT WITH FIXED SIZE
385 {...}
386
387 ViewScatterAccessType view j dofs scatter access = view j dofs scatter->access();
388 // Identify the grid cell (box) where the particle is located and
389 // its normalized position (xi) within the box
390 {...}
391 // Similarly as above , we identify box and normalized position for the
392 // new position (along x( component) only)
393 {...}
394 // Extract box index along the direction component for the old position
395 {...}
396 // In the tensor product basis , the three 1D components are
397 // combined with each other in every possible combination
398 // Therefore , we start by computing the three 1D components
399 {...}
400 // Along x(0), we have to integrate over splines of degree degree -1
401 // We get a contribution to the current in all indices of
402 // splines that are nonzero in either boxnew or boxold
403 // In each box , degree basis function are nonzero
404 // This gives the following number of nonzero values of
405 // the total integral
406 {...}
407 // First we evaluate the primitive of the degree basis functions that are
408 // nonzero at the old and new particle positions respectively
409 // For this we use the pp coefficients of the primitive function
410 // of the splines and evaluate using Horner ’s scheme
411 {...}
412 // Now , we glue everything together
413 // Note that the primitive function is equal to delta_x(component)
414 // in all intervals with index larger that the indices of the
415 // support of the basis function
23
416 {...}
417 // For the other two other directions , we need to evaluate the splines of
418 // degree p and (p-1) at the particle position
419 {...}
420 // We use the pp form and evaluate with Horner ’s scheme
421 {...}
422 // Set index of first basis function that is nonzero at
423 // the particle position
424
425 // Set index of first basis function that is nonzero at
426 // the particle position for the dimension with integration
427 {...}
428 // optimisation : precomputations of the indices
429 {...}
430
431 // loop over all the basis functions that are nonzero at the
432 // particle position and update velocities and current
433 for(long k=0; k<this ->spline_degree +1; k++) {
434 vtt2 = 0.0; vtt3 = 0.0;
435 for (long j=0; j<this -> spline_degree+1; j++) {
436 // Save the 2D spline product to avoid recomputation in the inner loop
437 splinejk = util_arrays ->spline_0 [1][j] * util_arrays -> spline_0 [2][k] *
→֒ marker_charge;
438 util_arrays ->vt[0] = 0.0; util_arrays ->vt[1] = 0.0;
439
440 for(long i=0; i< local_size ; i++) {
441 // Compute 1D index of the basis function from the 3D tensor product index
442 index1d = util_arrays -> startjk [k][j] + util_arrays ->index_x [i];
443 // update the current
444 j dofs[index1d] += util arrays->j1d[i] * splinejk;
445 view j dofs scatter access(index1d) += view j1d(i) * splinejk;
446 RAJA::atomic::atomicAdd<RAJA::atomic::cuda atomic>(&(d j dofs tmp[index1d]),
util raja->raja j1d(i) * splinejk); // GPU
447 d j dofs tmp(index1d) += util raja.raja j1d(i) * splinejk; // CPU
448 atomicAdd(&(d j dofs tmp[index1d]), util cuda->d j1d(i) * splinejk);
449 j dofs[index1d] += util arrays->j1d[i] * splinejk;
450 #pragma acc atomic
451 j dofs[index1d] += util openacc->d j1d[i] * splinejk;
452 // contributions for the velocities
453 util_arrays ->vt[0] += bfield_dofs [start1 +index1d ] * util_arrays ->j1d[i];
454 util_arrays ->vt[1] += bfield_dofs [start2 +index1d ] * util_arrays ->j1d[i];
455 }
456
457 if(j>0) {
458 vtt2 += util_arrays ->vt [0]* util_arrays ->spline_1 [1][j-1];
459 }
460 vtt3 -= util_arrays ->vt[1]* util_arrays -> spline_0 [1][j];
461 }
462 // update the velocities
463 vi[1] -= qoverm *vtt2*util_arrays ->spline_0 [2][k];
464 if(k >0) {
465 vi [2] -= qoverm *vtt3*util_arrays -> spline_1 [2][k-1];
466 }
467 }
468 }
24
