Quantifying performance bottlenecks of stencil computations using the
  Execution-Cache-Memory model by Stengel, Holger et al.
Quantifying performance bottlenecks of stencil
computations using the Execution-Cache-Memory model
Holger Stengel Jan Treibig Georg Hager Gerhard Wellein
Erlangen Regional Computing Center (RRZE)
Friedrich-Alexander University of Erlangen-Nuremberg
Erlangen, Germany
{holger.stengel,jan.treibig,georg.hager,gerhard.wellein}@fau.de
ABSTRACT
Stencil algorithms on regular lattices appear in many fields of com-
putational science, and much effort has been put into optimized
implementations. Such activities are usually not guided by per-
formance models that provide estimates of expected speedup. Un-
derstanding the performance properties and bottlenecks by per-
formance modeling enables a clear view on promising optimiza-
tion opportunities. In this work we refine the recently developed
Execution-Cache-Memory (ECM) model and use it to quantify
the performance bottlenecks of stencil algorithms on a contempo-
rary Intel processor. This includes applying the model to arrive
at single-core performance and scalability predictions for typical
“corner case” stencil loop kernels. Guided by the ECM model we
accurately quantify the significance of “layer conditions,” which
are required to estimate the data traffic through the memory hierar-
chy, and study the impact of typical optimization approaches such
as spatial blocking, strength reduction, and temporal blocking for
their expected benefits. We also compare the ECM model to the
widely known Roofline model.
Keywords
stencils, performance model, optimization, multicore
1. INTRODUCTION AND RELATED WORK
Stencil computations on regular lattices are important in many
computational science applications operating on regular lattices.
The combination of low computational intensity and their iterative
nature have made them a popular target for loop optimizations. Be-
yond established cache blocking techniques, leveraging SIMD ca-
pabilities [1] and an efficient shared memory parallelization [2] are
subject to recent research. An overview about the state of the art in
stencil optimizations can be found in a review by Datta et al. [3].
In the compiler community modeling approaches based on cache
usage have been employed for stencil optimizations more than a
decade ago [4]. Modeling was also used to estimate the optimal
performance of stencil codes and to validate the effectiveness of
applied optimizations. One can roughly separate those efforts into
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ICS ’15 Newport Beach, California USA
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.
two categories: “black box” modeling, which relies on statistical
methods and machine learning to describe and predict performance
behavior based on observed data such as hardware performance
metrics [5], and “white box” modeling, which uses simplified ma-
chine models to describe the interaction of code with the hardware
from “first principles.” Simple bottleneck-centric approaches based
on memory bandwidth vs. computational peak performance have
been in use for a long time [6, 7]. This line of thinking was pop-
ularized and refined by Williams et al. [8], and is now known as
the “Roofline model.” The Roofline model is tremendously useful
in getting first estimates of the “light speed” of a loop on a given
architecture. Its central assumption is that data transfers through
the memory hierarchy overlap with code execution on the core(s).
This implies that data access latencies are assumed to be hidden
by prefetching mechanisms. However, the Roofline model cannot
describe relevant bottlenecks beyond memory bandwidth and peak
performance. Especially in long-range stencils common in seismic
imaging there is no single dominating bottleneck anymore and the
influence of in-cache transfers on performance is significant. The
recently introduced ECM model provides an accurate single-core
performance and scaling prediction. It can be seen as an exten-
sion and refinement of the Roofline model for multicore proces-
sors. Like the Roofline model it focuses on resource utilization
but considers contributions from the complete memory hierarchy.
The ECM model was first introduced in [2] and later refined and
combined with a multicore power model in [9]. It provides clear
insights into the relevant contributions of the hardware model to
the performance of a given loop code and therefore allows a clear
identification of optimization opportunities. Based on the single-
core performance estimate it also predicts a performance saturation
point. This paper refines the ECM model with clear rules for over-
lapping execution and data transfers and introduces an accessible
notation to ease the usage and description of the model. The power
of the model to validate and guide performance engineering efforts
is demonstrated on the example of three relevant stencil code pat-
terns.
The paper is structured as follows. In Sect. 2 we describe the ar-
chitecture of the test machine used for measurements. Section 3
introduces and refines the ECM model on the example of sim-
ple streaming loop kernels. In Sections 4, 5, and 6 we apply the
model to successively more complex stencils to show how it can
identify bottlenecks, reveal optimization opportunities, and prevent
misleading conclusions. The paper concludes in Sect. 7.
2. EXPERIMENTAL TESTBED
Unless noted otherwise, a standard two-socket server based on
eight-core SandyBridge-EP (SNB) CPUs was used for the mea-
surements. The core supports SSE and AVX SIMD instruction set
ar
X
iv
:1
41
0.
50
10
v2
  [
cs
.PF
]  
17
 Ja
n 2
01
5
Microarchitecture Intel SandyBridge-EP [10]
Model Xeon E5-2680
Clock speed (fixed) 2.7 GHz
Cores/Threads 8/16
SIMD Support SSE (128 bit), AVX (256 bit)
L1/L2/L3 Cache 8×32 kB/8×256 kB/20 MB
Main Memory Configuration 4 channels DDR3-1600
Update Benchmark Bandwidth 40 GB/s
Table 1: Test machine specifications. All information is shown
for one socket. “Update” is a multi-threaded streaming bench-
mark that modifies an array in memory.
extensions. Using floating-point arithmetic, each core can execute
one multiply and one add instruction per cycle, leading to a peak
performance of eight double precision (DP) or 16 single precision
(SP) flops per cycle. Memory is connected via four DDR3-1600
memory channels per socket. The SNB core can sustain one full-
width AVX load and one half-width AVX store per cycle. With
SSE or scalar execution, these limits are changed: In both cases
the core can sustain either one load and one store, or two loads
per cycle, to the effect that many loops do not show a 4× speedup
of core execution when going from scalar mode to AVX. The L2
cache sustains refills and evicts to and from L1 at 256 bits per cycle
(half-duplex). A full 64-B cache line transfer thus takes two cy-
cles. The L3 cache is segmented, with one segment per core. All
segments are connected by a ring bus. Each segment has the same
bandwidth capabilities as the L2 cache, i.e., it can sustain 256 bits
per cycle (half-duplex) to and from L2. This means that the L3
cache is usually not a bandwidth bottleneck, which is an improve-
ment compared to previous Intel processors. An overview of the
specifications is given in Table 1. In Sect. 5 we use a 3.0GHz Intel
Ivy Bridge CPU for comparison. The relevant differences will be
described there.
Measurements for the vector summation example were per-
formed with LIKWID-bench [11], and the Intel C Compiler (ver-
sion 13.1.3) was used to compile the stencil codes. The clock
frequency was fixed to 2.7 GHz.
3. THE ECM PERFORMANCE MODEL
The ECM model delivers a prediction of the number of CPU
cycles required to execute a certain number of iterations nit of a
given loop on a single core. Since the smallest amount of data that
can be transferred between adjacent cache levels is usually a cache
line (CL), nit is typically chosen to be “one cache line’s worth of
work.” With a 64-B CL, this amounts to nit = 8 for a streaming
kernel with double precision data and stride-one data access to all
arrays (such as the STREAM benchmarks). Depending on the data
types and access characteristics this number may vary, however.
3.1 Model input, construction, and assump-
tions
Some input data is required to construct the ECM model for a
loop code on a given architecture. We consider two parts of the
input separately: The time to execute the instructions on the pro-
cessor core, assuming that there are no cache misses, and the time
to transfer data between its initial location and the L1 cache. For
the sake of clarity we review and refine the model, which was first
introduced in [2, 9], using simple streaming loops such as DAXPY
(A(:)=A(:)+s*B(:)) before turning to more complex cases.
MOV/MASK
Memory control
DIV
L1 Dcache
ADD ADRS
Data flow
Control flow
ADRS
ALUALUALU
Port 0 Port 5
MULT
Ex
ec
ut
io
n LOAD
u
n
its
LOAD
Port 4
STORE
Port 3
JMP
Port 2Port 1
Figure 1: Simplified view of execution resources on a modern
out-of-order CPU core.
3.1.1 In-core execution time
To determine the in-core execution time a simple model for in-
struction throughput capabilities of a microarchitecture is required.
On Intel processors the port scheduler model can be used for this
purpose. Figure 1 shows the port model for the Intel Sandy Bridge
microarchitecture. We assume that all instructions in a loop are
independently scheduled to the ports and executed out of program
order within the execution units ui. The execution time is then set
by the unit that takes the most cycles to execute the instructions:
Tcore = max
i
Tui . (1)
An additional limitation that may be relevant is the overall maxi-
mum instruction throughput of four instructions per cycle on Intel
microarchitectures.1 For example, the loop body of the DAXPY
operation
for(i=0; i<N; ++i)
a[i] = a[i] + s * b[i];
consists of two loads, one store, one add, and one multiply instruc-
tion regardless of SIMD vectorization. In addition there is typi-
cally one integer add, one integer compare, and one conditional
branch in the body to implement the “loop mechanics.” If the loop
is unrolled, the relative share of these instructions is smaller and
can usually be neglected. With AVX vectorization, one work unit
(eight iterations) comprises four load, two store, two add, and two
multiply instructions, which require four cycles to execute on the
SNB core. In this case the load (and store) pipeline throughput is
the bottleneck. In Fig. 2 we visualize this situation in a timeline di-
agram (cycles 1 to 4). Since there are no loop-carried dependencies
in the code, the four-cycle prediction holds if the loop is sufficiently
unrolled. In complex cases one can employ the Intel Architecture
Code Analyzer (IACA) [12] to get an estimate of the loop body
execution time under the throughput assumption.
3.1.2 Data transfers through the memory hierarchy
Data not present in the L1 cache must be fetched from lower
levels of the memory hierarchy before it can be processed in CPU
registers, and modified data must be evicted to make room for new
cache lines. We call the time needed for these transfers the “transfer
time.” Again we assume the best possible case of perfect streaming,
i.e., there are no latency effects and the cost for transferring one CL
can be computed using the maximum bandwidth. For the SNB
architecture, one CL transfer takes two cycles between adjacent
cache levels. Getting a 64-B CL from memory to L3 or back takes
1The term “instruction” is used in a microarchitectural context; on
Intel architectures these are called “µops.”
64B · f/bS cycles, where f is the CPU clock speed and bS is the
achievable memory bandwidth as measured by a suitable streaming
benchmark.
Calculating the transfer times through the memory hierarchy re-
quires an accurate account of the data volumes between all cache
levels. For the DAXPY loop kernel discussed above this means
that, due to the cache hierarchy being inclusive, three CLs must be
transferred between each pair of adjacent memory levels: two for
loading the data for a and b, and one for evicting modified data of
a. This leads to transfer times of six cycles each between cache lev-
els, and 13 cycles between memory and L3 cache (at a maximum
socket bandwidth of 40 GB/s and a clock frequency of 2.7 GHz). If
the “uncore” part of the chip, especially the L3 cache, has a clock
speed that is set independently from the core clock, a correction
factor can be applied to the L3-L2 cycle count.
3.1.3 Single-core prediction and examples
The in-core execution and transfer times described above must
be put together to arrive at a prediction of single-thread execution
time. Therefore we must specify which of the runtime contribu-
tions can overlap with each other. If Tdata is the transfer time (L1
downward, as described above), TOL is the part of the core execu-
tion that overlaps with the transfer time, and TnOL is the part that
does not, then
Tcore = max(TnOL,TOL) (2)
TECM = max(TnOL +Tdata,TOL) . (3)
There are two open questions here: (i) Which components of Tcore
in (1) are overlapping? (ii) How is Tdata composed of the different
data transfer contributions along the memory hierarchy?
The ECM model makes the following fundamental assumptions
for single-core performance prediction: (1) All instructions in the
loop are independently scheduled to the execution ports and exe-
cuted out of program order within the execution units. The unit
that takes the longest time executing its instructions is the bottle-
neck on the core level (see Sect. 3.1.1). The maximum throughput
capabilities of the microarchitecture are a further limitation. (2)
Loads (more specifically, cycles in which loads are retired) do not
overlap with any other data transfer in the memory hierarchy. All
other instructions, and also pipeline bubbles, overlap perfectly with
data transfers. This means that only the time needed for the loads
to the L1 cache contributes to TnOL (store-bound loop kernels show
significant overlap). (3) The transfer times up to the L1 cache are
mutually non-overlapping, i.e., all cache line transfers across the
memory hierarchy are serialized.
The non-overlapping assumptions seem very pessimistic; they
were motivated by the observation that the L1 cache on Intel pro-
cessors can either communicate with the L2 cache or the registers,
but not both in the same cycle. We will show below that they lead to
the most accurate description of performance behavior. Due to this
non-overlap assumption the ECM model does not provide an abso-
lute upper performance bound. In contrast to the Roofline model it
allows for performance measurements to exceed the model predic-
tion.
In case of the DAXPY example on the Intel SNB core, these rules
lead to the timeline diagram shown in Fig. 2. In this particular case
the overall execution time is dominated by data transfers, even if
the data comes from the L1 cache.
In order to be able to discuss the ECM model more efficiently we
introduce a shorthand notation that summarizes the most relevant
information from timeline diagrams such as Fig. 2. We write the
predicted cycle counts for TOL and TnOL, separated by “‖”. After
TnOL we attach the individual contributions for the different mem-
TOL 16104
t
29
[cy]
nOLT
dataTST
MULT
ADD
LD L2−L3 L3−MemL1−L2
Figure 2: Single-core ECM model for the DAXPY loop kernel
on Intel SNB. The predicted number of cycles per CL is 29 if
the data is in memory, 16 for the L3 cache, 10 for the L2 cache,
and 4 in L1 (using AVX).
case ECM model [cy] prediction [cy] measurement [cy]
naive {24‖4 |2 |2 |4.3} {24e24e24e24} 24e24e24e27
scalar {8‖4 |2 |2 |4.3} {8e8e8e12} 8.1e8.9e11e18
SSE {4‖2 |2 |2 |4.3} {4e4e6e10} 4.1e5e7.2e15
AVX {2‖2 |2 |2 |4.3} {2e4e6e10} 2.1e4.9e7.2e14
Table 2: ECM model and cycle predictions on a SNB core for
the vector summation loop in double precision.
ory hierarchy levels (which make up Tdata), separated by “|”:
{TOL ‖TnOL |TL1L2 |TL2L3 |TL3Mem} (4)
For example, the ECM model for DAXPY would be written as
{4‖4 |6 |6 |13} cy. Cycle predictions for data sets fitting into any
particular cache level can then be easily calculated from this rep-
resentation by adding up the appropriate contributions from Tdata
and TnOL, and finding the maximum of that number and TOL ac-
cording to (3). For instance, the prediction for data in L2 will be
max(4,4+6) cy = 10cy. As a shorthand notation for these pre-
dictions we use a similar format but with “e” as the delimiter. For
DAXPY this would read as TECM = {4e10e16e29} cy.
As a further example that does not exhibit the strong data transfer
dominance of DAXPY we pick a double precision vector summa-
tion:
for(i=0; i<N; ++i)
s = s + a[i];
We construct the ECM model for this kernel in four cases: naive,
scalar, SSE, and AVX vectorization on a SNB core. In all but the
“naive” case we assume that appropriate modulo unrolling has been
applied (at least 3-way due to the 3-cycle ADD latency) on top of
SIMD vectorization to allow for filling the ADD pipeline despite
the register dependency. The “naive” code is scalar and does not
use modulo unrolling, so the full latency must be paid for every
ADD instruction. The results are shown in Table 2. One can im-
mediately see that the naive code is executing so slowly that it does
not “feel” the memory hierarchy even if the data is in memory. The
scalar code is dominated by core execution down to the L3 cache
and starts to degrade in performance only when the data comes
from memory. Furthermore, although SSE vectorization yields a
significant speedup in L1 and L2 cache, the gain is only about 20%
in memory. Finally, going from SSE to AVX pays off only if the
data is in the L1 cache. These predictions are in line with the gen-
eral observation that the performance of “slow,” i.e., unoptimized
code is largely independent of the position of data in the memory
hierarchy.
3.1.4 Conclusions from the ECM model
The ECM model explains accurately why the theoretical band-
width limits of all memory hierarchy levels except the L1 cache
cannot be hit with a single thread. Even with a highly efficient,
bandwidth-demanding loop kernel the non-overlapping contribu-
tions to Tdata from the higher memory levels prevent full bandwidth
utilization. If multiple threads are used, bandwidth saturation is
possible. See Sect. 3.1.5 for scalability predictions.
Once the execution time for the basic work unit is known from
the ECM model, one can calculate the single-core performance by
dividing work by time: PECM = W/T . Any appropriate unit of
“work” will do, be it flops, iterations, “lattice site updates," etc. In
case of the scalar summation, the performance in flop/s on a SNB
core is
P=
8flops · f
{8e8e8e8+4.3 f/ f0} cy . (5)
Here we allow for a modification of the CPU clock speed away
from the base frequency f0. At the nominal clock speed of f0 =
2.7GHz we get P( f0) = {2.7e2.7e2.7e1.8} Gflop/s. The model
shows clearly that the single-core performance changes with the
clock frequency even if the data is in memory: P(1.6GHz) =
{1.6e1.6e1.6e1.2} Gflop/s. The larger the relative share of TL3Mem
versus the other contributions, the weaker this dependence will be.
3.1.5 Chip-level scaling and comparison with
Roof line
For describing the chip-level performance scaling we assume
that the single-core performance scales linearly with the number of
cores until a bottleneck is hit [13]. In case of the modern Intel pro-
cessors the only bottleneck is the memory bandwidth, which means
that an upper performance limit is given by the Roofline prediction
for memory-bound execution: PBW = bS/BC, where BC is the code
balance (inverse computational intensity) of the loop code and bS
is the memory bandwidth (one may certainly consider other bottle-
necks, such as a non-scalable L3 cache like on the Intel Westmere
processors). The performance scaling for n cores is thus described
by
P(n) = min
(
nPmemECM,
bS
BC
)
(6)
if PmemECM is the ECM model prediction if the data is in main memory.
This expression allows us to draw a comparison between the
ECM model and the Roofline model. The Roofline model con-
siders only a single data transfer bottleneck. This could be main
memory access but also any other memory hierarchy level. Con-
structing the Roofline model for any core count thus requires mea-
sured maximum bandwidth numbers for all memory levels and all
core counts as input parameters, since none of these values can be
predicted by the model. Due to the overlapping assumption of the
Roofline model, the predicted single-core performance is then
PRoof(n) = min
i
(
nPcoremax ,
bS,i(n)
BC,i
)
. (7)
Here, Pcoremax is the maximum in-core sequential performance, i.e.,
with all data coming from the L1 cache (one may use the arith-
metic peak performance here and treat the L1 cache as an additional
memory level). It corresponds to Tcore as used in the ECM model.
bS,i(n) and BC,i are the measured maximum bandwidth and code
LC ECM model [cy] prediction [cy] P
mem
ECM
[MLUP/s]
Ni < nS
L1 {6‖8 |6 |6 |13} {8e14e20e33} 659 683 3
L2 {6‖8 |10 |6 |13} {8e18e24e37} 587 5461 3
L3 {6‖8 |10 |10 |13} {8e18e28e41} 529 436900 4
— {6‖8 |10 |10 |22} {8e18e28e50} 438 N/A 3
Table 3: ECM model, main memory code balance, predicted
performance, required leading dimension, and saturation point
for the 2D Jacobi kernel when the layer condition (LC) is ful-
filled in different memory hierarchy levels, from L1 down to
main memory on one SNB core.
balance, respectively, at n cores and in memory level i (exclud-
ing L1). There are two limiting cases where the Roofline model
and ECM model predictions coincide on the processor architectures
considered here:
• Tcore is large compared to all contributions from data trans-
fers. Then, if n is large enough, the memory bottleneck may
become relevant, but this is described in the same way by
both models. One typical example would be a kernel that is
absolutely dominated by long-latency operations such as di-
vides or square roots, like the “divide triad” kernel analyzed
in [9].
• The loop code has single-core data transfer and execution
characteristics that make its Roofline-based performance es-
timate coincide with the measured streaming benchmark per-
formance used for obtaining the bS,i(n). We will show an
example for this in Sect. 4.3.
If, however, in-core execution time and data transfer times are com-
parable but much larger than those of the streaming benchmark, the
Roofline model is not able to describe the performance character-
istics accurately. See Sect. 6 for an example.
According to (6), the performance will saturate at nS cores:
nS =
⌈
bS/BC
PmemECM
⌉
=
⌈
TmemECM
TL3Mem
⌉
. (8)
The model confirms the well-known effect that “slow” loop code
needs more cores to reach saturation. For example, the AVX sum-
mation code has a predicted single-core performance of PmemECM =
2.1Gflop/s at nominal clock speed and it will saturate at three cores,
while the naive, non-pipelined code with PmemECM = 0.9Gflop/s will
require six. Hence, in this case a low code quality can be “healed”
by using more resources.2 This would not be possible at a clock
speed of 1.6 GHz, since the slow code would then need ten cores to
saturate, but the CPU only has eight.
4. CASE STUDY: 5-POINT STENCIL
We use the arguably most simple non-trivial stencil variant to
employ the ECM model in a setting that is more complex than the
pure streaming kernels shown earlier. All concepts, however, are
generalizable to other stencils. One sweep (complete update of all
points) of the 2D five-point Jacobi stencil looks as follows (double
precision assumed):
2If time to solution is the only relevant metric, this attitude may be
justified. However, it is not acceptable in view of energy consump-
tion. See [9], and references therein.
for(j=1; j<N j -1; ++j)
for(i=1; i<Ni -1; ++i)
b[j][i] = ( a[j][i-1] + a[j][i+1]
+ a[j-1][i] + a[j+1][i] ) * s;
Unless otherwise noted we assume that the array dimensions are
large so that the arrays a and b do not fit in any cache. We adopt
“lattice site updates per second” (LUP/s), i.e., scalar inner kernel
iterations, as an appropriate performance metric.
4.1 Basic analysis
With AVX vectorization, one unit of work (eight LUPs) com-
prises eight loads, two stores, six adds, and two multiply instruc-
tions (note that there will be no benefit from inner loop unrolling to
enable re-use of data along the leading dimension). Hence, TnOL =
8cy and TOL = 6cy. The code is thus limited by the LOAD pipeline
throughput at the core level.
The data transfer properties of the loop nest depend on the prob-
lem size in relation to the available cache sizes. The target array
b causes a traffic of two CLs across the complete memory hier-
archy due to the write-allocate transfer on every store miss. The
right hand side element a[j][i-1] can always be obtained from
the L1 cache since it was loaded two inner iterations before as
a[j][i+1], and a[j+1][i] must be loaded from memory since
it was not used before within the sweep. Whether a[j-1][i] and
a[j][i+1] cause cache misses in a certain cache level depends on
whether three successive rows of the array a fit into the cache [3].
We call this the layer condition (LC). If the size of cache k is Ck, it
can be formulated as
3 ·Ni ·8B < Ck2 . (9)
The factor of 1/2 is a rule of thumb: we assume that half the cache
size is available for storing layers. The number of layers is con-
nected to the stencil radius in the outer dimension. If r is the radius,
(2r+1) layers have to fit.
If the layer condition is fulfilled for cache level k, only a[j+1][i]
causes a cache miss in this level, and is subsequently used three
times from the cache. The code balance is then 24B/LUP as seen
from the next memory hierarchy level (cache k+1, or main mem-
ory), because only three data streams must be sustained. If the
condition is violated, only a[j][i-1] comes from the cache and
all other elements cause misses, leading to five data streams and a
code balance of 40B/LUP. Columns 2–4 in Table 3 show a sum-
mary of the ECM model timing and performance predictions on
a SNB core. In column 5 we give the leading dimension below
which the respective layer condition is satisfied. This number was
obtained by solving (9) for Ni and using the cache sizes of the SNB
processor in Table 1.
4.2 Single-core performance vs. problem size
Figure 3 shows the measured performance and the ECM model
prediction for a large range of memory-bound problem sizes. After
the problem drops out to main memory at Ni ≈ 1145 (quadratic
domain), three phases can be distinguished. These can be at-
tributed to the layer condition being satisfied in the L2 cache, the
L3 cache, and not at all, corresponding to rows 2–4 in Table 3. An
L1 layer condition would require a leading dimension of less than
683, which leads to an in-cache problem size if Ni = N j. For each
phase we also show the predicted code balance for different levels
of the memory hierarchy. These results indicate that the Roofline
model may not be a reliable model for the single core, because
the performance changes considerably when the leading dimension
grows beyond Ni = 5461 although BmemC stays the same. See below
for more discussion on this issue.
103 104 105 106 107
Leading dimension (Ni)
0
200
400
600
800
Pe
rfo
rm
an
ce
 [M
LU
P/
s]
Ni x Nj = 4 x 10
8Ni = Nj
L3
 li
m
it
no LCLC in L3LC in L2
BC
mem
 = 40 B/LUPBC
L3
 = 40 B/LUP
BC
mem
 = 24 B/LUP
B
Cm
em
 
=
 2
4 
B/
LU
P
B
CL3
 
=
 2
4 
B/
LU
P
B
CL2
 
=
 4
0 
B/
LU
P
Figure 3: In-memory performance vs. leading array dimension
of the 2D Jacobi sweep on a SNB core. Up to a leading dimen-
sion of 2×104 the domain is quadratic. Beyond that, the overall
problem size is kept constant. The dashed line is the prediction
from the ECM model.
The dashed line in Fig. 3 shows the ECM model prediction,
which is in good agreement with the measurement in all three
phases. The model fits well within a 10% margin, and it is even
slightly below the measurement if the layer condition is fulfilled in
L2 cache. We have observed this effect in many situations where
the pressure on the memory hierarchy gets smaller when mov-
ing towards main memory (i.e., when the number of streams goes
down). One may speculate that the non-overlapping assumption of
the model is inaccurate in these cases; the deviation is generally
small, however.
Despite the performance fluctuations (especially in the phase
with BC = 40B/LUP), the actual memory code balance obtained
by measuring the data volume with hardware performance events
is within 5% of the predicted value.
4.3 Spatial blocking optimization
Spatial blocking (also called “loop tiling”) is a well-known opti-
mization for loop nests that aims at improving temporal data access
locality. In the context of stencil solvers it amounts to rearranging
the order of the stencil updates so that some layer condition is met,
leading to a decrease in code balance. We can predict the expected
gain of spatial blocking on the single core by setting up an ECM
model for different blocking strategies. This is especially simple
for the 2D Jacobi stencil as the layer condition (9) only depends on
the leading dimension. One must conclude that it is sufficient to
block the inner loop:
// inner block size = b_i
for(is=1; is<Ni -1; is += bi)
for(j=1; j<N j -1; ++j)
for(i=is; i<min(Ni -1,is+bi -1; ++i)
b[j][i] = ( a[j][i-1] + a[j][i+1]
+ a[j-1][i] + a[j+1][i] ) * s;
The layer condition (9) now takes a modified form:
3 ·bi ·8B < Ck2 . (10)
Hence, the block size bi determines whether the layer condition
will be satisfied in cache level k, and column 5 in Table 3 again pro-
vides the relevant thresholds. Choosing bi appropriately is called
“blocking for cache level k.” Figure 4 shows performance vs. lead-
ing dimension for three block sizes and no blocking. These four
103 104 105 106 107
Leading dimension (Ni)
0
200
400
600
800
Pe
rfo
rm
an
ce
 [M
LU
P/
s]
no blocking
bi = 2.3 x 10
5
 (L3 blocking)
bi = 4.3 x 10
3
 (L2 blocking)
bi = 800, bj = 300 (L1 blocking)
ECM predictions
Roofline BC
mem
 = 40 B/LUP
Roofline BC
mem
 = 24 B/LUP Ni x Nj = 4 x 10
8Ni = Nj
Figure 4: Performance vs. problem size for the 2D Jacobi sten-
cil as in Fig. 3 but with spatial blocking of the leading dimen-
sion with block size bi. The four ECM model predictions corre-
spond to the rows of Table 3. In case of the L1 layer condition
(squares) blocking in j direction is required (see also Fig. 5).
The single-core Roofline predictions for the two in-memory
code balance values are shown for comparison. Roofline does
not yield different predictions across different blocking strate-
gies.
cases (squares, circles, triangles, gray line) correspond to the four
possible layer conditions in Table 3, and the ECM models given
there apply. The solid lines denote the ECM performance predic-
tions. The dotted and dashed-dotted lines mark the Roofline model
predictions derived from measured single-thread cache and mem-
ory bandwidths.3 Since the Roofline model predicts a memory-
bound situation in all cases (independent of the blocking), there are
only two performance levels: a lower level for BmemC = 40B/LUP
and a higher level for BmemC = 24B/LUP. Although the Roofline
model cannot distinguish between different blocking strategies, the
best and worst case scenarios (L1 blocking and no blocking) are
covered with useful accuracy. Note, however, that this is a co-
incidence: As the ECM model shows, the runtime contributions
are rather evenly distributed across the memory hierarchy. The
Roofline model works in this case because the benchmark used for
taking the single-thread memory bandwidth limit has similar char-
acteristics (in fact, the ECM model prediction for the STREAM
COPY loop is {4e10e16e29} cy, and the number of data streams
is the same as for the Jacobi 2D kernel with L1 blocking).
The data shows that the ECM model is able to describe the per-
formance impact of spatial blocking quite accurately, and that the
gain from moving a layer condition up in the memory hierarchy
(i.e., satisfying it in a higher cache level) can be leveraged also for
larger leading dimensions. In case of pure L1 blocking, however,
the observed performance was at first lower than expected. A mea-
surement of the memory data traffic revealed that the actual code
balance was much higher than 24B/LUP, which was caused by the
hardware prefetcher: If the inner block size bi is small, the lead-
ing array dimension is only traversed partially, but the hardware
prefetcher cannot foresee when the loop ends. It fetches cache lines
from beyond the actual end of the short inner loop block. Those
cannot be used before the next inner block is updated, they are thus
evicted prematurely and cause excess memory traffic. Additional
3STREAM COPY kernel: bmemS = 17GB/s, b
L3
S = 34GB/s, b
L2
S =
56GB/s. The assumed in-core performance was the same as for the
ECM model.
1000 2000
outer block size bj
550
600
650
Pe
rfo
rm
an
ce
 [M
LU
P/
s]
(a)
1000 2000
outer block size bj
24
25
26
27
28
B
Cm
em
 
[B
/L
UP
]
(b)
minimum
Figure 5: Influence of outer loop blocking on the 2D Jacobi
stencil at a constant inner loop block size of bi = 800 and a con-
stant problem size of Ni = 3.5×104 and N j = 1.2×104. (a) Per-
formance vs. outer loop block size. (b) Measured in-memory
code balance vs. outer loop block size.
blocking in the outer dimension was added to correct this problem:
for(js=1; js<N j -1; js += b j)
for(is=1; is<Ni -1; is += bi)
for(j=js; j<min(N j -1,js+b j -1); ++j)
for(i=is; i<min(Ni -1,is+bi -1); ++i)
b[j][i] = ( a[j][i-1]+a[j][i+1]
+ a[j-1][i]+a[j+1][i] ) * s;
In Fig. 5 we show the effect of the outer loop blocking with block
size b j on the performance and on the measured in-memory code
balance at a constant inner block size of bi = 800. It turns out
that b j ≈ 300 alleviates the prefetcher problem: blocks of size bi×
b j are updated consecutively along the leading dimension, and all
excess cache lines can be used. Note that one can determine the
excess traffic, and hence the prefetch distance, from the measured
data volume at large b j. We have calculated the prefetch distance to
be about 33 CL in our case, but this result is probably not generic
since the details of the prefetch mechanism are unknown. Data
overheads at block boundaries (of which eager prefetching is just
the most striking example) may be incorporated into the model if
the corresponding data volumes are known.
Although all considered problem sizes are bigger than the L3
cache, the execution is far from memory bound. Even when block-
ing for the L1 cache (squares in Fig. 4), the memory bandwidth
as calculated from the product of performance and code balance is
barely 17GB/s out of a maximum of over 40GB/s.
Since the code balance is already at its possible minimum of
24B/LUP across all cache levels, spatial blocking cannot improve
the single-core performance any further. The ECM model predic-
tion in the first row of Table 3 shows that if one were able to re-
duce the core time by a factor of two (from 8 to 4 cycles) by some
advanced optimization technique such as register blocking, single-
core performance would improve by a factor of 33/(33−4) = 1.14.
4.4 Multi-core scaling
When going from single-threaded to multi-threaded execution
on a multicore chip, two architectural features impact the perfor-
mance behavior: (i) the memory bandwidth bottleneck and (ii) the
shared outer-level cache (if present). In Fig. 6 we show perfor-
mance scaling across the cores of a SNB socket at Ni = 1.2×106.
Two-dimensional L1 blocking was only used where necessary. In
all cases (blocked or not), the j loop was parallelized with OpenMP
1 2 3 4 5 6 7 8
# cores
0
200
400
600
800
1000
1200
1400
1600
1800
Pe
rfo
rm
an
ce
 [M
LU
P/
s]
no blocking
bi = 800, bj = 300
ECM model (L1 blocking)
ECM model (no blocking)
(a)
1 2 3 4 5 6 7 8
# cores
0
200
400
600
800
1000
1200
1400
1600
1800
Pe
rfo
rm
an
ce
 [M
LU
P/
s]
40B/LUP
24B/LUP
no blocking
bi = 2.3 x 10
5
bi = 1.1 x 10
5
bi = 2.3 x 10
5
 / nthreads
(b)
1 2 3 4 5 6 7 8
# cores
0
200
400
600
800
1000
1200
1400
1600
1800
Pe
rfo
rm
an
ce
 [M
LU
P/
s]
bi = 800, bj = 300
bi = 4.3 x 10
3
bi = 2.3 x 10
5
 / nthreads
(c)
Figure 6: Scaling of the Jacobi 2D kernel across the cores of a SNB chip at Ni = 1.2× 106. (a) Comparison of ECM model and
measurement for L1 blocking and no blocking. (b) Choosing the right block size bi for the L3 layer condition. (c) Comparison of
saturation behavior for L1, L2, and L3 blocking.
and static loop scheduling. This ensures that the prefetcher problem
described in the previous section remains solved, since each thread
works on a consecutive string of blocks in the leading dimension.
Figure 6a compares the ECM model and scaling predictions with
measurements for the two extreme cases of no blocking (circles)
and L1 blocking (stars). The general scaling behavior is described
well, although there is a deviation in the vicinity of the saturation
point, which we attribute to changes in the prefetching strategy
when the memory interface is almost fully utilized [10]: For in-
stance, cache lines are only prefetched into L3 and the prefetch dis-
tance is smaller. This leads to an additional latency penalty, which
is not part of the ECM model (but could be included). The satura-
tion level is slightly smaller than predicted by the code balance, but
the deviation is well below 10%.
In Fig. 6b we study the influence of the block size on the scalabil-
ity when blocking for the L3 cache. At bi = 2.3×105 (squares) the
single-core speedup versus the non-blocked version (circles) is as
expected, but performance saturates at a level that indicates a code
balance of 40B/LUP. Reducing the block size to bi = 1.1× 105
shows no improvement on the single core but leads to much higher
performance up to four cores. Beyond four cores, however, perfor-
mance degrades and settles again at the same level as before. The
reason for this behavior is the shared L3 cache, which requires a
modified layer condition:
3 ·bi ·n ·8B < C32 . (11)
The n-dependent block size ensures that three layers per thread fit
into the cache, and leads to the saturation performance expected
from a code balance of 24B/LUP (triangles). Note that block size
adaptivity is not required when blocking for core-private caches,
since their aggregate size scales with the number of cores.
Finally we need to comment on the influence of the chosen
blocking strategy (L1, L2, or L3) on the scalability. Technically
the ECM model predicts different saturation points for L3 block-
ing (4 cores) and L1 or L2 blocking (3 cores), as shown in the
last column of Table 3. In reality they display very similar satura-
tion behavior for the Jacobi 2D kernel (see Fig. 6c), although the
general trend to saturate earlier with faster single-core code can
certainly be observed. If maximum performance is the only signif-
icant metric, it is irrelevant in practice which blocking strategy is
chosen as long as the in-memory code balance is at its minimum of
24B/LUP. Saturating earlier, however, has some advantages: “ex-
pendable” cores can be put into a low-power mode, saving energy
(“concurrency throttling”), or they can be used for other tasks such
as asynchronous communication.
Note that the use of non-temporal stores (also called “streaming
stores”) may reduce the in-memory code balance from 24B/LUP
to 16B/LUP by ignoring the cache hierarchy on a store miss, sav-
ing the write-allocate transfers. However, their implementation on
Intel processors cannot be described by a simple throughput as-
sumption. Therefore it is unclear as yet how to incorporate the
non-temporal store instructions in the ECM model.
5. CASE STUDY: UXX STENCIL
The “uxx” stencil [14] is a part of a simulation code for dy-
namic rupture and earthquake wave propagation. The basic loop
nest without any blocking optimizations looks as follows:
#pragma omp parallel for schedule(static) \
private(d)
for(int k=2; k<=N-1; k++){
for (int j=2; j<=N-1; j++){
for (int i=2; i<=N-1; i++){
d = 0.25*(d1[ k ][j][i] + d1[ k ][j-1][i]
+ d1[k-1][j][i] + d1[k-1][j-1][i]);
u1[k][j][i] = u1[k][j][i] + (dth/d)
*( c1*(xx[ k ][ j ][ i ]-xx[ k ][ j ][i-1])
+ c2*(xx[ k ][ j ][i+1]-xx[ k ][ j ][i-2])
+ c1*(xy[ k ][ j ][ i ]-xy[ k ][j-1][ i ])
+ c2*(xy[ k ][j+1][ i ]-xy[ k ][j-2][ i ])
+ c1*(xz[ k ][ j ][ i ]-xz[k-1][ j ][ i ])
+ c2*(xz[k+1][ j ][ i ]-xz[k-2][ j ][ i ]));
}}}}
The code is used with either single or double precision, and it con-
tains a divide operation in the inner loop. Without any guidance
by performance modeling one would probably try to eliminate the
divide (which is non-trivial) because “divides are expensive.”
5.1 Analysis and ECM model
The loop code as generated by the compiler is quite complex, so
we employ the IACA tool. At double precision (DP) the through-
put analysis reports the divider to be the bottleneck on the core,
version ECM model [cy] prediction [cy]
DP {84‖38 |20 |20 |26} {84e84e84e104}
SP {45‖38 |20 |20 |26} {45e58e78e104}
DP noDIV {41‖38 |20 |20 |26} {41e58e78e104}
Table 4: ECM model and cycle predictions for the uxx stencil
code in double precision (DP) and single precision (SP), respec-
tively, with L3 blocking. The unit of work is 8LUPs for DP
and 16LUPs for SP. The “noDIV” version performs a multiply
instead of a divide in the inner loop.
causing a core time of Tcore = TOL = 84cy per CL (eight iterations)
due to the very slow 42-cycle throughput for the divide instruction
(vdivpd). The non-overlapping part is TnOL = 38cy, caused by the
loads to the arrays xx, xy, xz, d1, and u1. With single precision
(SP) the machine code does not contain any divide at all: the com-
piler uses the vrcpps instruction instead, which provides a low-
precision (11bits) but fully vectorized and pipelined reciprocal, and
employs a subsequent Newton-Raphson step and a multiply to per-
form a 22-bit divide [10]. This results again in TnOL = 38cy as the
load instructions are the same as with DP. The overlapping part is
TOL = 35cy, caused by add, multiply, and 16-byte move instruc-
tions. The code is close to the limit of four micro-ops per cycle on
the SNB, which is why IACA reports a front-end bottleneck and an
overall throughput of Tcore = 45cy. Since front-end stalls overlap
with the transfer time, the seven extra cycles have no significance
and will be ignored in the following.
Calculating the transfer time requires an analysis of relevant
layer conditions. There are two types of layers in a 3D stencil
code: one-dimensional layers along the leading dimension (rows
along i of size N) and layers that span the extent of the inner two-
level loop nest (size N ×N). At relevant problem sizes only the
latter will matter, and we can safely ignore the “row conditions”
since they will be automatically fulfilled in the L1 cache. The only
data structures which are subject to layer conditions are thus xz
and d1, because their stencil radius is larger than zero in the outer
(k) dimension. The cache must hold as many layers as there are
distinct outer index references, i.e., two for d1 (which is accessed
in layers k and k− 1) and four for xz (which is accessed in layers
k− 2 to k+ 1). Hence, the layer condition for n threads in the
shared L3 cache and blocking in the j dimension is
(4+2) ·N ·b j ·n ·
{
4B (SP)
8B (DP)
}
<
C3
2
. (12)
One may certainly consider higher cache levels or block the inner
dimension as well, but we omit this discussion here since it would
not provide new insights.
With L3 blocking, the transfer time will be independent of the
precision. Six consecutive data streams hit the memory interface
per thread: one each for xx, xy, xz, and d1, and two for u1.
The code balance in memory is thus BmemC = 24B/LUP in SP and
48B/LUP in DP. Assuming that the higher caches are too small to
hold the layers, the L3 cache will be hit by ten streams per thread.
The analysis in the previous sections now enables us to construct
the ECM model for the uxx stencil. The first two rows in Table 4
show the model and the cycle predictions for SP and DP. For com-
parison we also include a special DP version (“noDIV”) in which
the divide as substituted by a multiplication.
The ECM model yields the same cycle prediction per unit of
work in main memory for all versions (104cy), hence we expect a
factor of two in performance between SP and DP. Also, all versions
1 2 3 4 5 6 7 8
#cores
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Pe
rfo
rm
an
ce
 [M
LU
P/
s]
(a) Sandy Bridge
1 2 3 4 5 6 7 8 9 10
#cores
ECM Model SP
single precision
ECM Model DP
double precision noDIV
double precision
(b) Ivy Bridge
Figure 7: Performance scaling and ECM model predictions of
the uxx stencil benchmark with L3 blocking at a problem size
of 2763 grid points across the cores of one SNB socket (left) and
one IVB socket (right) in single and double precision.
should saturate at four cores. Probably the most striking result is
that the presence or absence of the divide does not make a differ-
ence as
Tdata +TnOL > TOL (13)
despite the low-throughput divide. Going to great lengths in trying
to eliminate it from the code will thus not be rewarded with per-
formance improvements. These predictions are confirmed by the
performance data shown for one SNB socket in Fig. 7a. There is
actually a slight speedup for the “noDIV” version below saturation.
Using the latency analysis in the IACA tool revealed that the crit-
ical code path through the loop body has a worst case execution
time of 192cy. Usually the throughput prediction is much closer
to the measured runtime since the out of order logic can fill a lot
of pipeline bubbles across the independent loop iterations. In this
case, however, the critical path is so long that (13) is just barely sat-
isfied (in fact there is a 10% deviation from the model on the single
core). Removing the divide shortens the critical path so much that
(13) can be easily met, and the ECM model prediction is very pre-
cise. For comparison we present the results on an Intel Ivy Bridge
CPU (3.0GHz, bS = 47GB/s) in Fig. 7b. This architecture is quite
similar to Sandy Bridge apart from the higher memory bandwidth
and the faster divide throughput (28cy instead of 42cy for a dou-
ble precision divide with AVX). The accuracy of the model is even
better here because of the shorter critical path in double precision.
The ECM model in this case is {56‖38 |20 |20 |25} cy, leading to
a prediction of {56e58e78e103} cy. The divide instruction has no
impact if the data does not fit into the L1 cache.
5.2 Optimization opportunities
One typical stencil optimization that goes beyond spatial block-
ing is temporal blocking, which can dramatically reduce the in-
memory code balance by performing multiple updates on each grid
point in cache before it gets evicted. The ECM model can provide
upper limits to the expected benefit of temporal blocking. In case of
uxx, optimal temporal blocking for the L3 cache would completely
remove the memory transfer time of TL3Mem = 26cy, resulting in a
24% (DP) or 33% (SP) speedup on the single core. The next bottle-
neck will then be the divide for DP (84cy) and the L3 transfer time
for SP (78cy), respectively, but both of these are purely core-local.
The true potential of temporal blocking is thus not in the speedup
on the single core but in the removal of the memory bandwidth
bottleneck, allowing for scalable performance with an upper limit
of over 2000MLUP/s even for DP and still including the divide.
Even in this case the benefit from temporal blocking would by far
outweigh the speedup gained from removing the divide.
See [15] for a case study of temporal blocking with a 3D Jacobi
smoother supported by an ECM model analysis.
6. CASE STUDY: 3D LONG-RANGE STEN-
CIL
A large stencil radius in the outer grid dimension leads to a high
pressure on the cache levels since they have to hold a large number
of layers. As an example we consider a constant coefficient long-
range stencil code in single precision with a radius of four in each
dimension:
#pragma omp parallel for
for(int k=4; k < N-4; k++) {
for(int j=4; j < N-4; j++) {
for(int i=4; i < N-4; i++) {
float lap = c0 * V[k][j][i]
+ c1 * ( V[ k ][ j ][i+1]+ V[ k ][ j ][i-1])
+ c1 * ( V[ k ][j+1][ i ]+ V[ k ][j-1][ i ])
+ c1 * ( V[k+1][ j ][ i ]+ V[k-1][ j ][ i ])
+ c2 * ( V[ k ][ j ][i+2]+ V[ k ][ j ][i-2])
+ c2 * ( V[ k ][j+2][ i ]+ V[ k ][j-2][ i ])
+ c2 * ( V[k+2][ j ][ i ]+ V[k-2][ j ][ i ])
+ c3 * ( V[ k ][ j ][i+3]+ V[ k ][ j ][i-3])
+ c3 * ( V[ k ][j+3][ i ]+ V[ k ][j-3][ i ])
+ c3 * ( V[k+3][ j ][ i ]+ V[k-3][ j ][ i ])
+ c4 * ( V[ k ][ j ][i+4]+ V[ k ][ j ][i-4])
+ c4 * ( V[ k ][j+4][ i ]+ V[ k ][j-4][ i ])
+ c4 * ( V[k+4][ j ][ i ]+ V[k-4][ j ][ i ]);
U[k][j][i] = 2.f * V[k][j][i]- U[k][j][i]
+ ROC[k][j][i] * lap;
}}}
The only array that is relevant for layer conditions here is V; the
others (U and ROC) are accessed in linear order without potential
for cache re-use.
6.1 Analysis and ECM model
The IACA tool reports an overlapping core time of TOL = 68cy,
which is mainly due to adds and frontend stalls, while the load
port limitation is at TnOL = 62cy (there are only 27 visible loads in
the loop body, but due to some integer register spilling four more
cycles are required). Since the stencil radius is 4 in all directions,
the block size in the j direction must be chosen so that the cache
can hold nine layers. The n-thread layer condition is
9 ·N ·b j ·n ·4B < C32 . (14)
In this case, four data streams per thread will hit the memory and
the code balance will be Bmemc = 16B/LUP. If the layer con-
dition is only satisfied in L3, this cache will be hit by twelve
streams per thread, which already indicates that the transfer time
is significant for this kernel and that the main contribution to
it does not come from main memory. Indeed the ECM model
is {68‖62 |24 |24 |17} cy, which leads to a cycle prediction of
{68e86e110e127} cy. Hence, only 17/127 ≈ 13% of the exe-
cution time is attributed to TL3Mem, and the code will just barely
saturate at eight cores. If the layer condition is not satisfied at all,
the memory code balance will rise to Bmemc = 48B/LUP. In Fig. 8
we show performance data on a SNB socket with and without
blocking. At the chosen problem size of 4803, the layer condi-
tion is fulfilled without blocking for a single thread, which is why
1 2 3 4 5 6 7 8
#cores
0
500
1000
1500
2000
2500
Pe
rfo
rm
an
ce
 [M
LU
P/
s]
Roofline Model L3 blocking
ECM Model L3 blocking
L3 blocking
ECM Model no blocking
no blocking
BC
mem
 = 16 B/LUP
BC
mem
 = 48 B/LUP
Figure 8: Performance scaling and ECM model predictions of
the 3D long-range stencil benchmark with (circles) and without
(triangles) L3 blocking at a problem size of 4803 grid points
across the cores of one SNB socket in single precision. The
Roofline model prediction is shown for reference.
the single-core performance and prediction is the same in both
cases. At two cores and above, however, blocking is required. The
blocked code scales well almost up to the full socket, as predicted
by the ECM model.
This stencil provides a good example for a situation where the
Roofline model fails to deliver a useful performance estimate. Due
to the large in-core execution time, the loop is predicted to be core-
bound until the memory bandwidth is saturated, which happens at
four cores (top data set in Fig. 8). However, the other data trans-
fer contributions are far from negligible, so the Roofline model is
much too optimistic here due to its overlapping assumption.
6.2 Optimization opportunities
The blocking optimization has almost removed the memory
bottleneck, and the ECM model predicts that temporal blocking,
which would eliminate TL3Mem altogether, would only have a mi-
nor effect on the single core and the full socket. In order to get
any performance improvement one would have to reduce the ma-
jor contribution in TECMmem , which is TnOL. If all core contribu-
tions including TnOL could be shrunk by 50%, the ECM model
would be {34‖31 |24 |24 |17} cy and the prediction would be
{34e55e79e96} cy. The single-core code would be accelerated
by 33% and saturation would set in at six cores, making tempo-
ral blocking a viable optimization. Possible code transformations
leading to a substantial improvement of Tcore would have to save
on arithmetic operations as well as on loads. The semi-stencil
algorithm [16] is one recent example.
7. SUMMARY
In this work we have reviewed the ECM performance model us-
ing simple streaming kernels and refined it to clearly state its as-
sumptions regarding the overlap of different contributions to the
single-core runtime of a loop kernel across the memory hierarchy.
By introducing a short-hand notation we could ease the presenta-
tion of the model results and performance predictions. We have
then applied the model to three stencil algorithms with different
code characteristics on an Intel SandyBridge processor. Using a
2D Jacobi stencil as the initial example we were able to explain the
single-core performance vs. problem size by accurately calculating
the runtime contributions. The performance and scalability impact
of spatial blocking for establishing “layer conditions” in different
cache levels was quantified using the ECM model, and it was shown
why outer loop blocking is required although the layer condition
does not depend on the outer problem dimension. The performance
of a more complex 3D stencil with an expensive divide operation in
the inner loop was shown to be independent of the presence of the
divide because of the dominance of transfer times. Temporal block-
ing was predicted to show a minor speedup on the single core but a
major performance boost on the full chip even if the divide opera-
tion is left untouched. Finally we have used the model to show that
a 3D long-range stencil cannot saturate the memory bandwidth of
the CPU at minimum in-memory code balance because of in-cache
contributions to the single-core runtime. Any optimization attempt
would thus have to aim at reducing the in-cache runtime before
more advanced techniques such as temporal blocking are applied.
It must be stressed that we have barely scratched the surface of
the vast range of known stencil optimizations, but the basic line of
thinking has been clearly demonstrated. We have also shown when
and why Roofline model predictions fail to coincide with the ECM
model. The ECM model is to our knowledge the only model that
can usefully estimate the contributions to the single-core execution
time of streaming loop kernels, of which stencil codes are just one
example. It leads to a clear understanding of relevant bottlenecks
and potential optimization approaches but can mostly be set up us-
ing “pencil and paper.” Work is ongoing to build a simple tool that
can construct the model from a high-level description of the code
and the architecture, making it simpler to apply for the non-expert.
One basic limitation of the ECM model is its “streaming assump-
tion:” The model is over-optimistic if the data transfers through
the memory hierarchy are latency-bound, which is the case, e.g.,
in “pointer chasing” scenarios, or on processors for which hard-
ware and/or software prefetch mechanisms cannot hide the latency
cost. In addition, we do not know yet how to incorporate the cost
of non-temporal stores. Finally we have observed the model to
be too optimistic for the memory data transfer time of tight loops
with low Tcore on processors with high-frequency memory inter-
faces. These shortcomings may be “fixed” by introducing addi-
tional latency penalties, but we consider such an approach undesir-
able, since it would introduce additional parameters just for mak-
ing the predictions more accurate. The main goal of the model is,
however, to understand trends and bottlenecks, for which a precise
performance prediction is often not required.
Acknowledgment
We thank Andrey Semin (Intel Germany) for useful discussions,
Olaf Schenk (USI Lugano) for providing the “uxx” benchmark
case, and Hatem Ltaief (KAUST) for providing the 3D long-range
stencil case. Part of this work was supported by the DFG priority
programme 1648 “SPPEXA” under the project “EXASTEEL.”
8. REFERENCES
[1] T. Henretty, K. Stock, L.-N. Pouchet, F. Franchetti,
J. Ramanujam, and P. Sadayappan, “Data layout
transformation for stencil computations on short-vector simd
architectures,” in Proceedings of the 20th International
Conference on Compiler Construction: Part of the Joint
European Conferences on Theory and Practice of Software,
ser. CC’11/ETAPS’11. Berlin, Heidelberg:
Springer-Verlag, 2011, pp. 225–245.
[2] J. Treibig and G. Hager, “Introducing a performance model
for bandwidth-limited loop kernels,” in Parallel Processing
and Applied Mathematics, ser. Lecture Notes in Computer
Science, R. Wyrzykowski, J. Dongarra, K. Karczewski, and
J. Wasniewski, Eds. Springer Berlin / Heidelberg, 2010,
vol. 6067, pp. 615–624.
[3] K. Datta, S. Kamil, S. Williams, L. Oliker, J. Shalf, and
K. Yelick, “Optimization and performance modeling of
stencil computations on modern microprocessors,” SIAM
Review, vol. 51, no. 1, pp. 129–159, 2009.
[4] C. Leopold, “Tight bounds on capacity misses for 3d stencil
codes,” in Proceedings of the International Conference on
Computational Science-Part I, ser. ICCS ’02. London, UK,
UK: Springer-Verlag, 2002, pp. 843–852.
[5] S. M. F. Rahman, Q. Yi, and A. Qasem, “Understanding
stencil code performance on multicore architectures,” in
Proceedings of the 8th ACM International Conference on
Computing Frontiers, ser. CF ’11. New York, NY, USA:
ACM, 2011, pp. 30:1–30:10.
[6] D. Callahan, J. Cocke, and K. Kennedy, “Estimating
interlock and improving balance for pipelined architectures,”
Journal of Parallel and Distributed Computing, vol. 5, no. 4,
pp. 334 – 358, 1988.
[7] R. W. Hockney and I. J. Curington, “ f1/2: A parameter to
characterize memory and communication bottlenecks,”
Parallel Computing, vol. 10, no. 3, pp. 277–286, 1989.
[8] S. Williams, A. Waterman, and D. Patterson, “Roofline: An
insightful visual performance model for multicore
architectures,” Commun. ACM, vol. 52, no. 4, pp. 65–76,
2009.
[9] G. Hager, J. Treibig, J. Habich, and G. Wellein, “Exploring
performance and power properties of modern multicore chips
via simple machine models,” Concurrency Computat.: Pract.
Exper., 2013, DOI: 10.1002/cpe.3180.
[10] “Intel 64 and IA-32 architectures optimization reference
manual,” September 2014.
[11] J. Treibig, G. Hager, and G. Wellein, “LIKWID: A
lightweight performance-oriented tool suite for x86
multicore environments,” in Proc. PSTI2010. Los
Alamitos, CA, USA: IEEE Computer Society, 2010, pp.
207–216, DOI: 10.1109/ICPPW.2010.38.
[12] “Intel architecture code analyzer.” [Online]. Available:
http://software.intel.com/en-us/articles/
intel-architecture-code-analyzer/
[13] M. A. Suleman, M. K. Qureshi, and Y. N. Patt,
“Feedback-driven threading: power-efficient and
high-performance execution of multi-threaded workloads on
CMPs,” SIGARCH Comput. Archit. News, vol. 36, no. 1, pp.
277–286, Mar. 2008.
[14] M. Christen and O. Schenk, “A performance study of an
anelastic wave propagation code using auto-tuned stencil
computations,” in Proc. ICCS 2012, Omaha, Nebraska, USA,
4-6 June, 2012, ser. Procedia Computer Science, H. H. Ali,
Y. Shi, D. Khazanchi, M. Lees, G. D. van Albada,
J. Dongarra, and P. M. A. Sloot, Eds., vol. 9. Elsevier,
2012, pp. 956–965.
[15] S. Kronawitter, H. Stengel, G. Hager, and C. Lengauer,
“Domain-specific optimization of two Jacobi smoother
kernels and their evaluation in the ECM performance
model,” Parallel Processing Letters, vol. 24, no. 03, p.
1441004, 2014.
[16] R. de la Cruz and M. Araya-Polo, “Algorithm 942:
Semi-Stencil,” ACM Trans. Math. Softw., vol. 40, no. 3, pp.
23:1–23:39, Apr. 2014.
