Offsite Autotuning Approach -- Performance Model Driven Autotuning
  Applied to Parallel Explicit ODE Methods by Seiferth, Johannes et al.
ar
X
iv
:2
00
4.
03
69
5v
1 
 [c
s.P
F]
  7
 A
pr
 20
20
Offsite Autotuning Approach
Performance Model Driven Autotuning Applied to
Parallel Explicit ODE Methods
Johannes Seiferth, Matthias Korch, Thomas Rauber
Department of Computer Science, University of Bayreuth, Bayreuth, Germany
{johannes.seiferth, korch, rauber}@uni-bayreuth.de
Abstract. Autotuning (AT) is a promising concept to minimize the of-
ten tedious manual effort of optimizing scientific application for a specific
target platform. Ideally, an AT approach can reliably identify the most
efficient implementation variant(s) for a new platform or new character-
istics of the input by applying suitable program transformations and an-
alytic models. In this work, we introduce Offsite, an offline AT approach
which automates this selection process at installation time by rating im-
plementation variants based on an analytic performance model without
requiring time-consuming runtime tests. From abstract multilevel YAML
description languages, Offsite automatically derives optimized, platform-
specific and problem-specific code of possible variants and applies the
performance model to these variants.
We apply Offsite to parallel numerical methods for ordinary differential
equations (ODEs). In particular, we investigate tuning a specific class of
explicit ODE solvers (PIRK methods) for four different initial value prob-
lems (IVPs) on three different shared-memory systems. Our experiments
demonstrate that Offsite can reliably identify the set of most efficient im-
plementation variants for different given test configurations (ODE solver,
IVP, platform) and effectively handle important AT scenarios.
Keywords: Autotuning · performance modeling · description language
· ODE methods · ECM performance model · shared-memory
1 Introduction
The performance of scientific applications strongly depends on the character-
istics of the targeted computing platform, such as, e.g., the processor design,
the core topology, the cache architectures, the memory latency or the memory
bandwidth. Facing the growing diversity and complexity of today’s computing
landscape, the task of writing and maintaining highly efficient application code
is getting more and more cumbersome for software developers. A highly opti-
mized implementation variant on one target platform, might, however, perform
poorly on another platform. That particular poorly performant implementation
variant, though, could again potentially outperform all other variants on the
next platform. Hence, in order to achieve a high efficiency and obtain optimal
performance when migrating an existing scientific application, developers need
to tune and adapt the application code for each specific platform anew.
2 Johannes Seiferth, Matthias Korch, Thomas Rauber
1.1 State of the Art
A promising concept to avoid this time-consuming, manual effort is autotun-
ing (AT), and many different approaches have been proposed to automatically
tune software [2]. AT is based on two core concepts: (i) the generation of opti-
mized implementation variants based on program transformation and optimiza-
tion techniques such as, e.g., loop unrolling or loop tiling, and (ii) the selection
of the most efficient variant(s) on the target platform from the set of generated
variants. In general, there are (i) offline and (ii) online AT techniques. Off
line AT tries to select the supposedly most efficient variant at compile or instal-
lation time without actual knowledge of the input data. These approaches are
applicable for use-cases, whose execution behavior does no depend on the input
data. This is the case, e.g., for dense linear algebra problems, which can, i.a., be
tuned offline with ATLAS [20] and PhiPAC [4]. In other fields, such as sparse
linear algebra or particle codes, characteristics of the input data heavily influ-
ence the execution behavior. By choosing the best variant at runtime—when all
input is known—, online AT approaches such as Active Harmony [19] and ATF
[13] incorporate these influences.
Selecting a suitable implementation variant from a potentially large set of
available variants in a time-efficient manner is a big challenge in AT. Various
techniques and search strategies have been proposed in previous works to meet
this challenge [2]. A straightforward approach is the time-consuming comparison
of variants by runtime tests, possibly steered by a single search strategy, such as
an exhaustive search or more sophisticated mathematical optimization methods
like differential evolution [6], or a combination of multiple search strategies [1].
[12] proposes a hierarchical approach that allows the use of individual search
algorithms for dependent subspaces of the search space. As an alternative to
runtime tests, analytic performance models can be applied to either select the
most efficient variant or to reduce the number of tests required by filtering out
inefficient variants beforehand. In general, two categories of performance models
are distinguished: (i) black box models applying statistical methods and ma-
chine learning techniques to observed performance data like hardware metrics
or measured runtimes in order to learn to predict performance behavior [16,18],
and (ii) white box models such as the Roofline model [21] or the ECM per-
formance model [9,17] that describe the interaction of hardware and code using
simplified machine models. For loop kernels, the Roofline and the ECM model
can be automatically constructed with the Kerncraft tool [8].
1.2 Main Contributions
In this work, we propose Offsite, an offline AT approach that automatically
identifies the most efficient implementation variant(s) during installation time
based on performance predictions. These predictions stem from an analytic per-
formance prediction methodology for explicit ODE methods proposed by [15]
that uses a combined white and black box model approach based on the ECM
model. The main contributions of this paper are:
Offsite – A Performance Model Driven Autotuning Approach 3
1 for l ← 1, . . . , s do Y
(0)
l
← yκ
2 for k ← 1, . . . ,m do
3 for l ← 1, . . . , s do
4 Y
(k)
l
← yκ + hκ
∑
s
i=1 aliF
(k−1)
i
with F
(k−1)
i
← f(tκ + cihκ,Y
(k−1)
i
)
5 yκ+1 ← yκ + hκ
∑
s
i=1 biF
(m)
i
Listing 1. Timestep function of a PIRK method.
(i) We develop a novel offline AT approach for shared-memory systems based
on performance modelling. This approach automates the task of generating the
pool of possible implementation variants using abstract description languages.
For all these variants, our approach can automatically predict their performance
and identify the best variant(s). Further, we integrated a database interface
for collected performance data which enables the reusability of data and which
allows to include feedback from possible online AT or actual program runs.
(ii) We show how to apply Offsite to an algorithm from numerical analysis
with complex runtime behavior: the parallel solution of IVPs of ODEs.
(iii) We validate the accuracy and efficiency of Offsite for different test con-
figurations and discuss its applicability to four different AT scenarios.
1.3 Outline
Section 2 details the selected example use-case (PIRK methods) and the corre-
sponding testbed. Based on this use-case, Offsite is described in Section 3. In
Section 4, we experimentally evaluate Offsite in four different AT scenarios and
on three different target platforms. Section 5 concludes the paper.
2 Use-Case and Experimental Test Bed
Use-Case: PIRK Methods
As example use-case, we investigate parallel iterated Runge-Kutta (PIRK).
PIRK methods are part of the general class of explicit ODE methods [10] and
solve an ODE System
y′(t) = f(t,y(t)) , y(t0) = y0 , y ∈ R
n (1)
by performing a series of timesteps until the end of the integration interval is
reached. In each timestep, a new numerical approximation yκ+1 for the unknown
solution y is determined by an explicit predictor–corrector process in a fixed
number of substeps.
PIRK methods are an excellent candidate class for AT. Their complex four-
dimensional loop structure (Lst. 1) can be modified by loop transformations
resulting in a large pool of possible implementation variants whose performance
behavior potentially varies highly depending on: (i) the composition of computa-
tions and memory accesses, (ii) the number of stages of the base ODE method,
(iii) the characteristics of the ODE system solved, (iv) the target hardware, (v)
the compiler and the compiler flags, and (vi) the number of threads started.
4 Johannes Seiferth, Matthias Korch, Thomas Rauber
Table 1. Characteristics of the test set of IVPs.
IVP Cusp IC Medakzo Wave1D
Acces distance1 unlimited limited unlimited limited
Computational behavior mixed compute-bound mixed memory-bound
1 In practice, many IVPs are sparse, i.e. only access few components of Y when
evaluating function f (line 4, Lst. 1). A special case of sparse is limited access
distance d(f), where fj only accesses components yj−d(f) to yj+d(f).
Table 2. Characteristics of the test set of target platforms.
Name HSW IVB SKY
Micro-architecture Haswell EP Ivy-Bridge EP Skylake SP
CPU Xeon E5-2630 v3 Xeon E5-2660 v2 Xeon Gold 6148
Clock speed 2.3 GHz 2.2 GHz 1.76 GHz
Cores 8 10 20
L1 cache (data) 32 kB 32 kB 32 kB
L2 cache 256 kB 256 kB 1 MB
L3 cache (shared) 20 MB 25 MB 27.5 MB
Cache line size 64 B 64 B 64 B
Instruction throughput per cycle (double precision)
ADD / FMA / MUL 2 / 1 / 1 1 / - / 1 4 / 2 / 4
Compiler icc 19.0.5 icc 19.0.4 icc 19.0.2
Test Set of Initial Value Problems
In our experiments, we consider a broad set of IVPs (Table 1) that exhibit
different characteristics: (i) Cusp combines Zeeman’s cusp catastrophe model
for a threshold-nerve-impulse mechanism with the van der Pol oscillator [7], (ii)
IC describes a traversing signal through a chain of N concatenated inverters
[3], (iii) Medakzo describes the penetration of radio-labeled antibodies into a
tissue infected by a tumor [11], and (iv) Wave1D describes the propagation of
disturbances at a fixed speed in one direction [5].
Test Set of Target Platforms
We conducted our experiments on three different shared-memory systems (Table
2). For all experiments, the CPU clock was fixed, hyper-threading disabled and
thread binding set with KMP AFFINITY=granularity=fine,compact. All codes
were compiled with the Intel C compiler and compiler flags -O3, -xAVX and
-fno-alias set.
Offsite – A Performance Model Driven Autotuning Approach 5
T
u
n
in
g
S
c
e
n
a
r
io
Kernel
Working Sets
Kernel Code
Generation
Derived Impl.
Variants
Communication
Cost Benchmarks
Kernel prediction
Impl. Variant
Prediction
Impl. Variant
Ranking
Impl. Variant
Code Generation
DBOnline AT
Offsite
Fig. 1. Workflow of the Offsite autotuning approach.
3 Offsite Autotuning Approach
In this work, we introduce the Offsite offline AT approach on the example of
explicit ODE methods. Before starting a new Offsite run, the tuning scenario
desired, which consists of: (i) the pool of possible implementations and program
transformations, (ii) the ODE base method(s), (iii) the IVP(s), and (iv) the
target platform, is defined using description languages in the YAML standard1.
From its input data, Offsite automatically handles the whole tuning workflow
(Fig. 1). First, Offsite generates optimized, platform-specific and problem-specific
code for all kernels and derives all possible implementation variants. Applying
an analytic performance prediction methodology, the performance of each kernel
is predicted for either (i) a fixed ODE system size n—if specified by the user or
prescribed by the ODE2—or (ii) a set of relevant ODE system sizes determined
by a working set model. The performance of a variant is derived by combin-
ing the predictions of its kernels and adding an estimate of its synchronization
costs. Variants are ranked by their performance to identify the most efficient
variant(s). All obtained prediction and ranking data are stored in a database.
For the best ranked variants, Offsite generates optimized, platform-specific and
problem-specific code.
3.1 Input Description Languages
A decisive, yet cumbersome step in AT is generating optimized code. Often,
there is a large pool of possible implementation variants, applicable program
transformations (e.g. loop transformations) and tunable parameters (e.g. tile
sizes) available. Furthermore, exploiting characteristics of the input data can
enable more optimizations (e.g. constant propagation). Writing all variants by
hand, however, would be tedious and error-prone and there is demand for au-
tomation. In this work, we introduce multilevel description languages to describe
implementations, ODE methods, IVPs and target platforms in an abstract way.
Offsite can interpret these languages and automatically derives optimized code.
1 YAML is a data serialization language; https://yaml.org
2 There are scalable ODE systems but also ODEs with a fixed size [7].
6 Johannes Seiferth, Matthias Korch, Thomas Rauber
1 stages: 4
2 order: 7
3 corrector_steps: 6
4 A: [["0.1130", "-0.0403", "0.0258",
"-0.0099"], ..., [...]]
5 b: ["0.2205", ...]
6 c: ["0.1130 - 0.0403 + 0.0258 -
0.0099", ...]
Listing 2. ODE method description
format on the example of Radau
IIA(7).
1 components:
2 first: 1
3 size: n-1
4 code: |
5 (((U_op - %in[j]) * R - (eta * ((%in[j
-1] - U_thres) * (%in[j-1] -
U_thres) - (%in[j-1] - %in[j] -
U_thres) * (%in[j-1] - %in[j] -
U_thres)))) / C);
6 constants:
7 - double R = 1.0
8 - ...
Listing 3. IVP description format on the
example of InverterChain.
The Base ODE Method of a PIRK method is characterized by its Butcher
table—i.e., coefficient matrix A, weight vector b, node vector c—and a small set
of properties: (i) number of stages s, (ii) order o, (iii) number of corrector steps
m. Exploiting these properties, however, can have a large impact on the efficiency
of an implementation variant and should be included into the code generation
in order to obtain the most efficient code. The i-loop in Listing 4, e.g., might
be replaceable by a single vector operation for specific s, or zero entries in the
Butcher table might allow to save computations.
Listing 2 shows the ODE method description format on the example of Radau
IIA(7) which is a four-stage method with order seven applying six corrector steps
per timestep. To save space, only an excerpt of the Butcher table is shown with
a reduced number of digits.
IVPs are described in the IVP description format shown by IC (Lst. 3):
(i) components describes the n components of the IVP. Each component
contains a code YAML block that describes how function evaluation f(tκ +
cihκ,Y
(k−1)
i ) (l. 4, Lst. 1) will be substituted during code generation whereby
%in is a placeholder for the used input vector Y
(k−1)
i . Adjacent components
that execute the same computation can be described by a single block whereby
first denotes the first component and size specifies the total number of adjacent
components handled by that particular block.
(ii) constants defines IVP-specific parameters replaced with their actual val-
ues during code generation and might possibly enable further code optimizations.
In IVP IC, e.g., a multiplication could be saved if the given electrical resistance
R equals 1.0.
Target Platform and Compiler are described using the machine description
format introduced by Kerncraft3. Its general structure is tripartite: (i) the exe-
cution architecture description, (ii) the cache and memory hierarchy description,
and (iii) benchmark results of typical streaming kernels.
3 For example files, we refer to https://github.com/RRZE-HPC/kerncraft.
Offsite – A Performance Model Driven Autotuning Approach 7
1 datastructs:
2 - double b[s]
3 - double F[s][n]
4 - double dy[n]
5 computations:
6 C1: "dy[j] = dy[j] + b[i] * F[i][j]"
7 variants:
8 - name: APRX_ij
9 code: |
10 %PRAGMA nounroll_and_jam
11 %LOOP_START i s
12 %LOOP_START j n
13 %COMP C1
14 %LOOP_END j
15 %LOOP_END i
16 working sets: { "(s+1)*n+s", "2*n" }
17 - name: APRX_ji
18 code: |
19 %LOOP_START j n
20 %LOOP_START i s unroll
21 %COMP FMA
22 %LOOP_END i
23 %LOOP_END j
24 working sets: { "(s+1)*n+s" }
Listing 4. Kernel template description
YAML on the example of APRX.
1 double F[4][1912]; // s=4; n=161
2 double dy[1912]; // n=161
3 for (int j=0; j<161; ++j) { // n=161;
unrolled i; replaced b[i]
4 dy[0] += 0.2205 * F[0][j];
5 dy[1] += 0.3882 * F[1][j];
6 dy[2] += 0.3288 * F[2][j];
7 dy[3] += 0.0625 * F[3][j];
8 }
Listing 5. Code generated for kernel
APRX ji of kernel template APRX
specialized on Radau IIA(7) & n = 161.
1 code: |
2 %COM omp_barrier
3 %LOOP_START k m
4 %KERNEL RHS
5 %COM omp_barrier
6 %KERNEL LC
7 %COM omp_barrier
8 %LOOP_END k
9 %KERNEL RHS
10 %COM omp_barrier
11 %KERNEL APRX
12 %KERNEL UPD
Listing 6. Implementation skeleton
description format on the example of A.
1 void timestep(...) {
2 #omp barrier
3 for (k=0; k<6; ++k) { // m=6
4 // Code for kernel template RHS
5 #omp barrier
6 // Code for kernel template LC
7 #omp barrier
8 }
9 #omp barrier
10 // Code for kernel template RHS
11 // Kernel APRX
12 for (int j=0; j<161; ++j) { // n=161;
unrolled i; replaced b[i]
13 dy[0] += 0.2205 * F[0][j];
14 dy[1] += 0.3882 * F[1][j];
15 dy[2] += 0.3288 * F[2][j];
16 dy[3] += 0.0625 * F[3][j];
17 }
18 // Code for kernel template UPD
19 }
Listing 7. Code generated for a
variant of impl. skeleton A using kernel
APRX ji specialized on Radau IIA(7) &
n = 161.
Implementation Variants of numerical algorithms are abstracted by descrip-
tion languages as (i) kernel templates and (ii) implementation skeletons.
Kernel Templates define basic computation kernels and possible variations
of this kernel enabled by program transformations that preserve semantic cor-
rectness. Listing 4 shows the kernel template description format on the example
of APRX, which covers computation
∑s
i=1 biF
(m)
i (l. 5, Lst. 1):
(i) datastructs defines required data structures.
(ii) computations describes the computations covered by a kernel template.
Each computation corresponds to a single line of code and has an unique iden-
tifier (E.g. C1 in Lst. 4). Computations can contain IVP evaluations which are
marked by keyword %RHS and are replaced by an IVP component during code
generation (E.g. for IC by line 5 of Lst. 3). Hence, if a kernel template contains
%RHS , a separate, specialized kernel version has to be generated for each IVP
component.
8 Johannes Seiferth, Matthias Korch, Thomas Rauber
(iii) variants contains possible kernels of a kernel template enabled by pro-
gram transformations. For each kernel, its workings sets (working sets) and its
program code (code) are specified. A code block defines for a kernel its order
of computations and program transformations and using four keywords. Com-
putations are specified by keyword %COMP whose parameter must correspond
to one of the identifiers defined in the computations block (E.g. C1 in Lst. 4).
For-loop statements are defined by %LOOP START and %LOOP END . The
first parameter of %LOOP START specifies the loop variable name, the second
the number of loop iterations and an optional third parameter unroll indicates
that the loop will be unrolled during code generation. Loop-specific pragmas can
be added using %PRAGMA.
Implementation skeletons define processing orders of kernel templates and re-
quired communication points. From skeletons, concrete implementation variants
are derived by replacing its templates with concrete kernel code. Listing 6 shows
the implementation skeleton description format on the example of skeleton A
which is a realization of a PIRK method (Lst. 1) that focuses on parallelism
across the ODE system, i.e its n equations are distributed blockwise among the
threads. A contains a loop k over the m corrector steps dividing each correc-
tor step into two templates: RHS computes the IVP function evaluations (l. 5,
Lst. 1) which are used to compute the linear combinations (l. 4, Lst. 1) in LC.
Per corrector step, two synchronizations are needed as RHS—depending on the
IVP solved—can potentially require all components of the linear combinations
from the last iteration of k. After all corrector steps are computed, the next
approximation yκ+1 is calculated by templates APRX and UPD (l. 6, Lst. 1).
Four keywords suffice to specify skeletons:
(i) %LOOP START and %LOOP END define for-loops.
(ii) %COM states communication operations of an implementation skeleton.
Skeleton A, e.g., requires 2m+ 2 barrier synchronizations.
(iii) %KERNEL specifies an executed kernel template. Its parameter must
correspond to the name of a kernel template. During code generation%KERNEL
is replaced by actual kernel code (e.g. APRX in Lst. 7).
3.2 Rating Implementation Variant Performance
Offsite can automatically identify the most efficient implementation variant(s)
from a pool of available variants using analytic performance modelling (Fig. 1):
(i) In a first step, Offsite automatically generates code for all kernels in a
special code format processable by kerncraft4. Kernel code generation (Kernel
Code Generation in Fig. 1) includes specializations of the code on the target plat-
form, IVP, ODE method and (if fixed) ODE system size n. Listing 5 exemplary
shows the code generated for kernel APRX ji of kernel template APRX (Lst. 4)
when specialized on ODE method Radau II A(7) and n = 161. As specified in
the template description, j loop is unrolled completely. Further, Butcher table
coefficients (b) and known constants (s = 4, n = 161) are substituted.
4 In this work, version 0.8.3 of the Kerncraft tool was used.
Offsite – A Performance Model Driven Autotuning Approach 9
(ii) In some tuning scenarios, the ODE system size n is not yet known during
installation time. Giving predictions for all valid n values, however, is in general
not feasible. By applying a working set model (Sec. 3.4), Offsite automatically
determines for each kernel a set of relevant n (Kernel Working Sets, Fig. 1) for
which predictions are then obtained in the next step.
(iii) Offsite automatically computes node-level runtime predictions (Sec. 3.3)
for each implementation variant (Impl. Variant Prediction, Fig. 1) by adding up
the kernel predictions of its kernels and adding an estimate of its communication
costs (Communication Cost Benchmarks, Fig. 1), which Offsite derives from
benchmark data. For each of the kernel codes generated in step (i), its kernel
prediction is automatically derived by Offsite (Kernel Prediction, Fig. 1) whereby
Kerncraft is used to construct the ECM model.
(iv) Using these node-level runtime predictions, Offsite ranks implementation
variants by their performance (Impl. Variant Ranking, Fig. 1).
(v) From the ranking of implementation variants, Offsite automatically de-
rives the subset Λ of the best rated variant(s) which contains all variants λ whose
performance is within an user-provided maximum deviation from the best rated
variant. For each variant of λ, Offsite generates optimized, platform-specific and
problem-specific code (Impl. Variant Code Generation, Fig. 1). Listing 7 shows
an excerpt of the code generated for an variant of implementation skeleton A
which substitutes kernel template APRX with kernel APRX ji and was special-
ized on ODE method Radau IIA(7), IVP IC and n = 161.
3.3 Performance Prediction Methodology
The performance prediction methodology applied by Offsite expands on the
works of [15] and comprises: (i) a node-level runtime prediction of an imple-
mentation variant and (ii) an estimate of its intra-node communication costs.
Node-level Runtime Prediction. Base of the node-level prediction is the
analytic ECM (Execution-Cache-Memory) performance model. For an in-depth
explanation, we refer to [17,9]. The ECM model gives an estimation of the num-
ber of CPU cycles per cache line (CL) required to execute a particular loop
kernel on a multi- or many-core chip which includes contributions from the in-
core execution time Tcore and the data transfer time Tdata:
Tcore = max(TOL, TnOL) , (2)
T L3data = T
data
L1L2 + T
data
L2L3 + T
p
L2L3 . (3)
Tcore is defined as the time required to retire the instructions of a single loop
iteration under the assumptions that (i) there are no loop-carried dependen-
cies, (ii) all data are in the L1 data cache, (iii) all instructions are scheduled
independently to the ports of the units, and (iv) the time to retire arithmetic
instructions and load/store operations can overlap due to speculative execution
depending on the target platform. Hence, the unit that takes the longest to retire
10 Johannes Seiferth, Matthias Korch, Thomas Rauber
its instructions determines Tcore. T
level
data factors in the time required to transfer
all data from its current location in the memory hierarchy to the L1 data cache
and back. The single contributions of transfers between levels i and j of the
memory hierarchy T dataij are determined depending on the amount of transferred
CLs. Depending on the platform used, an optional latency penalty T pij might
be added. In (3) T leveldata is exemplarily shown for data coming from the L3 cache
under the assumption that a latency penalty between L2 and L3 cache has to
be factored in on the platform used. Combining all contributions, a single-core
prediction
T levelECM = max(TOL, TnOL + T
level
data ) (4)
can be derived, whereby the overlapping capabilities of the target platform de-
termine whether a contribution is considered overlapping or non-overlapping.
Using Kerncraft, Offsite obtains ECM model predictions (4) for all kernels λ.
For each kernel, kernel runtime prediction
φλ =
αλ · βλ
δ · f
(5)
yields the runtime in seconds of kernel λ, where αλ is (4) computed for a specific
number of running cores τ , βλ is the number of loop iterations executed, δ is the
number of data elements fitting into one CL and f is the CPU frequency. By
summing up the individual kernel runtime predictions φλ of its basic kernels λ
and adding an estimate of its communication costs tcom, the node-level runtime
prediction θǫ of an implementation variant ǫ is given by:
θǫ =
∑
λ
φλ + tcom . (6)
Remark: [15] used an older Kerncraft version that could not yet return ECM
predictions for multiple core counts τ with a single run, but additionally returned
the kernel’s saturation point σλ. Hence, an extra factor min(τ, σλ) was needed
in the denominator of (5) in their work.
Estimate of Intra-node Communication Costs. For the implementation
variants considered in this work, the costs of the required OpenMP barrier oper-
ations are estimated depending on the number of threads using linear regression.
Reusability of Performance Predictions. Prediction data (e.g. kernel run-
time predictions) collected for a specific implementation variant can be reused
to estimate other variants (if they also include that kernel) or to estimate other
IVPs (if the kernel does not contain any IVP evaluations). In the context of
AT, this is a decisive advantage compared to time-consuming runtime testing
of variants which requires running each additionally added variant as well or to
run all variants anew when changing the IVP solved.
Offsite – A Performance Model Driven Autotuning Approach 11
3.4 Working Set Model
If the ODE system size n is not fixed—either by the user or restrictions of the
IVP—selecting the most efficient implementation variant(s) at installation time
leads to an exhaustive search over the possibly vast space of values for n. To
minimize the number of predictions required per kernel, the set of estimated n
values is reduced by a model-based restriction, the working set of the kernel,
which corresponds to the amount of data referenced by a kernel.
We use the working sets to identify for each kernel the maximum n that still
fit into the single cache levels. Using these maximums, ranges of consecutive n
values for which the ECM prediction (4) stays constant5 can be derived. The
medium values of these ranges form the working set of the kernel.
4 Experimental Evaluation
We validate Offsite using the experimental test bed introduced in Section 2. In
particular, we study the efficiency of Offsite in four AT scenarios when tuning
four different IVPs on three different target platforms and compare the efficiency
of the following AT strategies:
(i) BestVariant covers the case that the most efficient implementation variant
is already known (e.g. from previous execution) and no AT is required.
(ii) RunAll runs all variants in order to identify the most efficient variant.
(iii) OffsitePreselect5 runs an Offsite determined subset of all variants, which
contains all variants withing a 5 % deviation of the best ranked variant, to
identify the most efficient variant of that subset.
(iv) OffsitePreselect10 runs variants within a 10 % deviation of the best variant.
(v) RandomSelect randomly runs 20 of the total 56 variants.
4.1 Derived Implementation Variants
Table 3 summarizes the implementation skeletons and kernel templates used in
this work. In total, we consider eight skeletons from which 56 implementation
variants can be derived. Each table column shows the templates required by a
particular skeleton. E.g., skeleton A (Lst. 6) uses templates LC, RHS,APRX and
UPD and twelve variants can be derived from A as there are six different kernels
of LC (enabled by loop interchanges, unrolls, pragmas) and two of APRX.
In total, 17 different kernels can be derived from the eight kernel templates
available. To predict the performance of all 56 variants, only these 17 kernels
have to be estimated. Further, when obtaining predictions off all 56 variants for
a different IVP, only those four templates that contain IVP evaluations—and
thus their six corresponding kernels—need to be re-evaluated, while prediction
data of the remaining kernels can be retrieved from database.
5 The ECM prediction factors in the location of data in the memory hierarchy. As a
simplified assumption—neglecting overlapping effects at cache borders—, this means
that as long as data locations do not change, the ECM model yields the same value
for a kernel independent from the actual n.
12 Johannes Seiferth, Matthias Korch, Thomas Rauber
Table 3. Overview of the implementation variants considered.
Impl. Skeleton Kernel Template1(#Kernels)
(#Impl. Variants) LC(6) RHS*(1) RHSLC*(1) APRX(2) RHSAPRX*(2) UPD(1) APRXUPD(2) RHSAPRXUPD*(2)
A (12) x x x x
B (12) x x x
C (2) x x x
D (2) x x
E (2) x x x
F (2) x x
G (12) x x x x
H (12) x x x
1 A kernel template marked with * contains evaluations of the IVP.
4.2 AT Scenario – All Input Known
As first test scenario, we consider the case that all input is known at installation
time, in particular the ODE system size n. In such cases, Offsite is applied with-
out the working set model. Performance predictions, however, are only obtained
for that particular n and a new Offsite AT run would be required if n changes.
Table 4 compares the accuracy and efficiency of AT strategies when tuning
four different IVPs on three different target platforms for n = 36,000,000 and
ODE method Radau IIA(7). For AT strategies OffsitePreselect5 and OffsitePre-
select10, tstep yields the time in seconds it takes to execute a timestep using the
measured best implementation variant from the subset Λ of variants λ tested by
that strategy. Performance loss denotes the percent runtime deviation of that
particular measured best variant from the variant selected by BestVariant (tbest).
Ideally, an AT strategy correctly identifies the measured best variant and, thus,
would suffer no performance loss. For an AT strategy, |Λ| yields the cardinality
of subset Λ and the percent tuning overhead of applying that strategy is defined
as ttune−|Λ|tbest|Λ|tbest · 100 where ttune =
∑Λ
λ tλ is the time required to test all variants
and |Λ|tbest is the time needed to execute the measured best variant instead.
Haswell. AT strategy RunAll causes a significant tuning overhead for all IVPs,
while OffsitePreselect5 and OffsitePreselect10 only lead to marginal overhead
as the subset of tested variants is considerably smaller, while still being able to
select the measured best variant for all IVPs but Wave1D.
IvyBridge. Again, RunAll leads to decisive overhead compared to both Off-
site strategies and the measured best variant is correctly identified for all IVPs.
However, for IVP IC only OffsitePreselect10 finds the best variant. As IC is
compute-bound (Table 1), the IVP evaluation dominates the computation time
while the order of the remaining computations has only minor impact. Hence,
already minor jitter can lead to a different variant being selected.
Skylake. Similar observations as on the two previous systems can be made.
The overhead of both Offsite strategies is marginal compared to RunAll. For all
IVPs, the measured best variant is successfully identified.
Offsite – A Performance Model Driven Autotuning Approach 13
Table 4. Comparison of different AT strategies applied to four different IVPs with
n = 36,000,000 and Radau IIA(7).
IVP Cusp IC Medakzo Wave1D
H
a
sw
e
ll
(8
c
o
re
s)
BestVariant F *ji F *ij F *ji H LCjli *ij
BestVariant tstep[s] 1.28 0.80 1.29 1.04
AT strategy – OffsitePreselect5
|Λ| (tuning overhead) 3 (1%) 3 (3%) 2 (2%) 3 (5%)
tstep[s] selected variant (perf. loss) 1.28 (–) 0.80 (–) 1.29 (–) 1.08 (4%)
AT strategy – OffsitePreselect10
|Λ| (tuning overhead) 3 (1%) 3 (3%) 3 (1%) 4 (5%)
tstep[s] selected variant (perf. loss) 1.28 (–) 0.80 (–) 1.29 (–) 1.08 (4%)
AT strategy – RunAll
|Λ| (tuning overhead) 56 (42%) 56 (44%) 56 (20%) 56 (16%)
Iv
y
B
ri
d
g
e
(1
0
c
o
re
s)
BestVariant E *ji F *ij F *ji F *ji
BestVariant tstep[s] 1.16 0.725 1.20 1.04
AT strategy – OffsitePreselect5
|Λ| (tuning overhead) 3 (3%) 1 (1%) 3 (1%) 3 (0.4%)
tstep[s] selected variant (perf. loss) 1.16 (–) 0.734 (1%) 1.20 (–) 1.04 (–)
AT strategy – OffsitePreselect10
|Λ| (tuning overhead) 3 (3%) 3 (3%) 3 (1%) 3 (0.4%)
tstep[s] selected variant (perf. loss) 1.16 (–) 0.725 (–) 1.20 (–) 1.04 (–)
AT strategy – RunAll
|Λ| (tuning overhead) 56 (54%) 56 (59%) 56 (41%) 56 (44%)
S
k
y
la
k
e
(2
0
c
o
re
s)
BestVariant E *ji F *ji F *ji F *ji
BestVariant tstep[s] 0.43 0.24 0.45 0.40
AT strategy – OffsitePreselect5
|Λ| (tuning overhead) 3 (1%) 3 (2%) 3 (1%) 3 (%)
tstep[s] selected variant (perf. loss) 0.43 (–) 0.24 (–) 0.45 (–) 0.40 (–)
AT strategy – OffsitePreselect10
|Λ| (tuning overhead) 3 (1%) 3 (2%) 3 (1%) 3 (1%)
tstep[s] selected variant (perf. loss) 0.43 (–) 0.24 (–) 0.45 (–) 0.40 (–)
AT strategy – RunAll
|Λ| (tuning overhead) 56 (69%) 56 (65%) 56 (41%) 56 (44%)
4.3 AT Scenario – Unknown ODE System Size
The next scenario considered is that of a still unknown ODE system size n at
installation time. In these cases, the working set model is applied to determine
a set of sample n values for which Offsite computes predictions and from which
predictions for the whole range of possible n are derived. As this requires com-
puting multiple performance predictions, a single Offsite run takes longer than
in the previous scenario. This particular Offsite run, however, already covers all
possible n and no further run will be required when switching n at a later point.
Figures 2 and 3 show for the single implementation variants selected as best
variant by the AT strategies considered, the time per timestep of IC and Cusp
on three platforms (each using their max. number of cores). On the x-axis, n is
14 Johannes Seiferth, Matthias Korch, Thomas Rauber
0 1 2 3
·107
3.4
3.6
3.8
4
4.2
·10−8
n
T
im
e
p
er
co
m
p
o
n
en
t
[s
/
n
]
0 1 2 3
·107
3
3.2
3.4
3.6
3.8
4
·10−8
n
0 1 2 3
·107
1.1
1.2
1.3
1.4
·10−8
n
BestVariant OffsitePreselect5 OffsitePreselect10
(a) Haswell (8 cores) (b) IvyBridge (10 cores) (c) Skylake (20 cores)
Fig. 2. Comparison of AT strategies applied to Cusp with varying n and Radau II A(7).
plotted up to n = 60,000,000. The y-axis shows the time per component of n in
seconds needed by a specific variant to solve a timestep for Radau II A(7).
Tuning Cusp (Fig. 2). On Haswell (Fig. 2a), OffsitePreselect5 and OffsiteP-
reselect10 select the same subset of three variants independent of n. Both strate-
gies always correctly identify the measured best variant. The same observations
can be made on IvyBridge (Fig. 2b) and on Skylake (Fig. 2c) where also the
same subset of three variants is selected and the measured best variant is always
found.
Tuning IC (Fig. 3). On Haswell (Fig. 3a), the same subset of one (for Off-
sitePreselect5 ) respectively of two variants (for OffsitePreselect10 ) is picked for
n up to 8,500,000. For bigger n, both strategies select the same three variants.
Except for n = 5,760,000, OffsitePreselect10 always correctly finds the measured
best variant. The single variant selected by OffsitePreselect5 is slightly off for
n = 1,440,000 and n = 2,560,000. In both cases, however, the absolute time
difference is only marginal. IC is compute-bound (Table 1) and, thus, the IVP
evaluation dominates the computation time. Hence, in particular for small n,
the order of the remaining computations has only minor impact on the time and
already minor jitter can lead to a different variant being selected.
OffsitePreslect5 selects on IvyBridge (Fig. 3b) the same variant for all n
while OffsitePreselect10 adds two additional variants for n ≥ 2,560,000. While
OffsitePreselect10 always finds the measured best variant, OffsitePreselect5 is
slightly off for n = 4,000,000 and n = 5,760,000 but the absolute time difference
is only marginal.
On Skylake (Fig. 3c), the same variant is selected for n up to 1,440,000 by
OffsitePreselect5 as well as by OffsitePreselect10 while for larger n two additional
Offsite – A Performance Model Driven Autotuning Approach 15
0 1 2 3
·107
2.2
2.3
2.4
2.5
·10−8
n
T
im
e
p
er
co
m
p
o
n
en
t
[s
/
n
]
0 1 2 3
·107
1.9
2
2.1
2.2
·10−8
n
0 1 2 3
·107
5
5.5
6
6.5
7
·10−9
n
BestVariant OffsitePreselect5 OffsitePreselect10
(a) Haswell (8 cores) (b) IvyBridge (10 cores) (c) Skylake (20 cores)
Fig. 3. Comparison of AT strategies applied to IC with varying n and Radau IIA(7).
variants are considered. Except for n = 1,440,000 both Offsite strategies manage
to always correctly identify the measured best variant. As on the two previous
systems, the absolute time difference is again only marginal.
4.4 AT Scenario – Variable Number of Cores
Offsite is capable of predicting the performance of an implementation variant
for different core counts with a single AT run. In this AT scenario, we consider
tuning an IVP for a fixed ODE system size n and multiple core counts.
Figure 4 shows the effectiveness of different AT strategies compared to strat-
egy RunAll when tuning IVP IC on three target platforms for n = 9,000,000
and Radau IIA(7). On the x-axis, we plot the number of cores. The y-axis plots
for different AT strategies the percent performance gain Π achieved by applying
that particular strategy instead of RunAll which tests all 56 variants (tRA). The
performance gain is defined as tRA−tAT
tRA
∗ 100 where tAT includes the time to run
the variants Λ tested by that strategy and the time to run the measured best
variant from Λ an additional 56 − |Λ| times. Ideally, the bar of an AT strategy
would be close to the horizontal line of BestVariant.
Haswell (Fig. 4a). Depending on the number of cores, OffsitePreselect5 picks
different subsets Λ. For core counts smaller than eight, the same variant is se-
lected, while for eight cores two additional variants are selected. Using OffsiteP-
reselect10, these two variants are also included for four cores. For all core counts,
a significant performance gain close to BestImplVariant can be observed when
using the Offsite strategies. The total performance gain grows with increasing
number of cores as the performance gap between best and worst variants also
increases. While outperforming RunAll, RandomSelect is still far off from the
maximum gain.
16 Johannes Seiferth, Matthias Korch, Thomas Rauber
1 core 2 cores 4 cores 8 cores
0
20
40
60
P
er
fo
rm
an
ce
ga
in
[%
]
1 core 2 cores 5 cores 10 cores
0
20
40
60
1 core 5 cores 10 cores 20 cores
0
20
40
60
BestVariant OffsitePreselect5 OffsitePreselect10 RandomSelect
(a) Haswell (b) IvyBridge (c) Skylake
Fig. 4. Percent performance gain achieved by different AT strategies when tuning IVP
IC for different core counts, Radau IIA(7) and n = 9,000,000.
IvyBridge (Fig. 4b). OffsitePreselct5 selects the same variant for all core
counts. Using OffsitePreselect10, only for 20 cores two further variants are se-
lected. Again, a significant performance gain close to BestImplVariant, can be
observed for all core counts when using the Offsite strategies while RandomSelect
is far off from that ideal gain.
Skylake (Fig. 4c). Both Offsite strategies select the same three variants for
20 cores, while the same single variant is selected for smaller core counts. As on
the two previous target platforms, both Offsite strategies are close too BestIm-
plVariant while RandomSelect is again further off.
4.5 AT Scenario – Variable ODE Method
In the last AT scenario, we consider tuning an IVP for a fixed ODE system
size n for four different ODE methods. Depending on the characteristics of the
ODE method, different optimizations might be applicable—for specific number
of stages s, e.g., loops over s can be replaced by a vector operation—which
potentially results in varying efficiency of the same implementation variant for
different ODE methods.
Figure 5 shows the effectiveness of different AT strategies when tuning IVP
IC on three target platforms for n = 9,000,000 and four different ODE methods:
(i) Radau IA(5) (s = 3, m = 4), (ii) Radau IIA(7) (s = 4, m = 6), (iii) Lobatto
III C(6) (s = 4, m = 5), and (iv) Lobatto III C(8) (s = 5, m = 7). On the
x-axis the ODE method used is shown. The y-axis plots for each AT strategy
the percent performance gain Π achieved by applying that particular strategy
instead of RunAll which tests all 56 variants. The bar of an AT strategy is ideally
close to the horizontal line of BestVariant.
Tuning Haswell (Fig. 5a). OffsitePreselect5 selects the same subset of two
variants for Lobatto III C(6) and Radau IA(5). For Lobatto III C(8) and Radau
IIA(7), an additional variant is selected. Using OffsitePreselect10, these three
Offsite – A Performance Model Driven Autotuning Approach 17
L
ob
at
to
II
I
C
(6
)
L
ob
at
to
II
I
C
(8
)
R
ad
au
I
A
(5
)
R
ad
au
II
A
(7
)
0
20
40
60
P
er
fo
rm
an
ce
ga
in
[%
]
L
ob
at
to
II
I
C
(6
)
L
ob
at
to
II
I
C
(8
)
R
ad
au
I
A
(5
)
R
ad
au
II
A
(7
)
0
20
40
60
L
ob
at
to
II
I
C
(6
)
L
ob
at
to
II
I
C
(8
)
R
ad
au
I
A
(5
)
R
ad
au
II
A
(7
)
0
20
40
60
BestVariant OffsitePreselect5 OffsitePreselect10 RandomSelect
(a) Haswell (8 cores) (b) IvyBridge (10 cores) (c) Skylake (20 cores)
Fig. 5. Percent performance gain achieved by different AT strategies when tuning IVP
IC for different ODE methods and n = 9,000,000.
variants are selected for all ODE methods. For all ODE methods, a significant
performance gain close to BestImplVariant can be observed when using one of
the two Offsite strategies. Further, both Offsite strategies decisively outperform
RandomSelect.
IvyBridge (Fig. 5b). For all ODE methods, the same single variant is chosen
when using OffsitePreselect5, while OffsitePreselect10 selects two variants for Lo-
batto III C(6) and the same three variants for Lobatto III C(8) and Radau IIA(7).
As on Haswell, the performance gain of both Offsite strategies for all ODE meth-
ods is close to the maximum gain, while the achieved gain of RandomSelect is
far off from BestImplVariant.
Skylake (Fig. 5c). Both Offsite AT strategies select the same subset of three
variants for all ODE methods but for Radau IA(5) which only selects two vari-
ants when using OffsitePreselect5. Again, the performance gain achieved by both
Offsite strategies is close to BestVariant while RandomSearch is further off.
5 Conclusion and Future Work
In this work, we have introduced the Offsite AT approach which automates the
process of identifying the most efficient implementation variant(s) from a pool of
possible variants at installation time. Offsite ranks variants by their performance
using analytic performance predictions. To facilitate specifying tuning scenarios,
multilevel YAML description languages allow to describe these scenarios in an
abstract way and enable Offsite to automatically generate optimized codes. More-
over, we have demonstrated that Offsite can reliably tune a representative class
of parallel explicit ODE methods, PIRK methods, by investigating different AT
scenarios and AT strategies on three different shared-memory platforms.
18 Johannes Seiferth, Matthias Korch, Thomas Rauber
Our future work includes expanding Offsite to cluster systems as well as to
AMD and ARM platforms. Further, we intend to extend Offsite to a combined
offline-online AT approach that incorporates feedback data from previous online
AT (or program runs) and to study whether these data can be used to predict
the performance in scenarios with unknown input data (e.g. new IVP).
Acknowledgments
This work is supported by the German Ministry of Science and Education
(BMBF) under project number 01IH16012A. Furthermore, we thank the Erlan-
gen Regional Computing Center (RRZE) for granting access to their IvyBridge
and Skylake systems.
References
1. Ansel, J., Kamil, S., Veeramachaneni, K., Ragan-Kelley, J., Bosboom, J.,
O’Reilly, U.M., Amarasinghe, S.: OpenTuner: An Extensible Framework for
Program Autotuning. In: Proc. 23rd Int. Conf. on Parallel Architecture
and Compilation Techniques. pp. 303–316. PACT ’14, ACM (Aug 2014).
https://doi.org/10.1145/2628071.2628092
2. Balaprakash, P., Dongarra, J., Gamblin, T., Hall, M., Hollingsworth,
J.K., Norris, B., Vuduc, R.: Autotuning in High-Performance Computing
Applications. Proceedings of the IEEE 106(11), 2068–2083 (Nov 2018).
https://doi.org/10.1109/JPROC.2018.2841200
3. Barthel, A., Gu¨nther, M., Pulch, R., Rentrop, P.: Numerical Techniques for Differ-
ent Time Scales in Electric Circuit Simulation. In: High Performance Scientific and
Engineering Computing: Proc. 3rd Int. FORTWIHR Conf. HPSEC. pp. 343–360.
HPSEC’ 02, Springer (Mar 2002). https://doi.org/10.1007/978-3-642-55919-8 38
4. Bilmes, J., Asanovic, K., Chin, C.W., Demmel, J.: Optimizing Matrix Multiply
Using PHiPAC: A Portable, High-performance, ANSI C Coding Methodology. In:
Proc. 11th Int. Conf. on Supercomputing. pp. 340–347. ICS ’97, ACM (Jul 1997).
https://doi.org/10.1145/263580.263662
5. Calvo, M., Franco, J.M., Randez, L.: A New Minimum Storage Runge-Kutta
Scheme for Computational Acoustics. Journal of Computational Physics 201(1),
1–12 (Nov 2004). https://doi.org/10.1016/j.jcp.2004.05.012
6. Das, S., Mullick, S.S., Suganthan, P.N.: Recent Advances in Differential Evolution
– An Updated Survey. Swarm and Evolutionary Computation 27, 1–30 (Apr 2016).
https://doi.org/10.1016/j.swevo.2016.01.004
7. Hairer, E., Wanner, G.: Solving Ordinary Differential Equations II: Stiff and
Differential-Algebraic Problems. Springer, 2nd rev. edn. (2002)
8. Hammer, J., Eitzinger, J., Hager, G., Wellein, G.: Kerncraft: A Tool for Analytic
Performance Modeling of Loop Kernels. In: Tools for High Performance Computing
2016. pp. 1–22. Springer (Oct 2017). https://doi.org/10.1007/978-3-319-56702-0 1
9. Hofmann, J., Alappat, C., Hager, G., Fey, D., Wellein, G.: Bridging the Architec-
ture Gap: Abstracting Performance-Relevant Properties of Modern Server Proces-
sors (2019), preprint
Offsite – A Performance Model Driven Autotuning Approach 19
10. van der Houwen, P. J.and Sommeijer, B.P.: Parallel Iteration of High-order Runge-
Kutta Methods with Stepsize Control. J. Comput. Appl. Math. 29(1), 111–127
(Jan 1990). https://doi.org/10.1016/0377-0427(90)90200-J
11. Mazzia, F., Magherini, C.: Test Set for Initial Value Problem Solvers, Release 2.4.
https://archimede.dm.uniba.it/~testset/ (Feb 2008)
12. Pfaffe, P., Grosser, T., Tillmann, M.: Efficient Hierarchical Online-Autotuning:
A Case Study on Polyhedral Accelerator Mapping. In: Proc. ACM Int. Conf.
Supercomputing. pp. 354–366. ICS ’19, ACM, New York, NY, USA (2019).
https://doi.org/10.1145/3330345.3330377
13. Rasch, A., Gorlatch, S.: ATF: A Generic Directive-Based Auto-Tuning
Framework. Concurrency Computat. Pract. Exper. 31(5), e4423 (Mar 2019).
https://doi.org/10.1002/cpe.4423
14. Scherg, M., Seiferth, J., Korch, M., Rauber, T.: Performance Prediction of Ex-
plicit ODE Methods on Multi-Core Cluster Systems. In: Proc. 2019 ACM/SPEC
Int. Conf. on Performance Engineering. pp. 139–150. ICPE ’19, ACM (2019).
https://doi.org/10.1145/3297663.3310306
15. Seiferth, J., Alappat, C., Korch, M., Rauber, T.: Applicability of the
ECM Performance Model to Explicit ODE Methods on Current Multi-
core Processors. In: High Performance Computing: Proc. 33rd Int. Conf.,
ISC High Performance 2018. pp. 163–183. ISC ’18, Springer (Jun 2018).
https://doi.org/10.1007/978-3-319-92040-5 9
16. Shudler, S., Vrabec, J., Wolf, F.: Understanding the Scalability of Molec-
ular Simulation Using Empirical Performance Modeling. In: Program-
ming and Performance Visualization Tools. pp. 125–143. Springer (2019).
https://doi.org/10.1007/978-3-030-17872-7 8
17. Stengel, H., Treibig, J., Hager, G., Wellein, G.: Quantifying Performance Bot-
tlenecks of Stencil Computations Using the Execution-Cache-Memory Model. In:
Proc. 29th ACM Int. Conf. on Supercomputing. pp. 207–216. ICS ’15, ACM (2015).
https://doi.org/10.1145/2751205.2751240
18. Tallent, N.R., Mellor-Crummey, J.M.: Effective Performance Measurement and
Analysis of Multithreaded Applications. In: Proc. 14th ACM SIGPLAN Symp. on
Principles and Practice of Parallel Programming. pp. 229–240. PPoPP ’09, ACM
(Feb 2009). https://doi.org/10.1145/1504176.1504210
19. Tiwari, A., Hollingsworth, J.K.: Online Adaptive Code Generation and Tuning. In:
Proc. 2011 IEEE Int. Parallel Distributed Processing Symp. pp. 879–892. IPDPS
’11, IEEE (May 2011). https://doi.org/10.1109/IPDPS.2011.86
20. Whaley, R.C., Petitet, A., Dongarra, J.: Automated Empirical Optimizations of
Software and the ATLAS Project. Parallel Computing 27(1), 3–35 (Jan 2001).
https://doi.org/10.1016/S0167-8191(00)00087-9
21. Williams, S., Waterman, A., Patterson, D.: Roofline: An Insightful Visual Perfor-
mance Model for Multicore Architectures. Communications of the ACM 52(4),
65–76 (Apr 2009). https://doi.org/10.1145/1498765.1498785
