Towards a Multi-Level Cache Performance Model for 3D Stencil Computation  by de la Cruz, Ràul & Araya-Polo, Mauricio
Available online at www.sciencedirect.com
1877–0509 © 2011 Published by Elsevier Ltd. 
Selection and/or peer-review under responsibility of Prof. Mitsuhisa Sato and Prof. Satoshi Matsuoka 
doi:10.1016/j.procs.2011.04.235
Procedia Computer Science 4 (2011) 2146–2155
International Conference on Computational Science, ICCS 2011, iWAPT2011
Towards a Multi-Level Cache
Performance Model for 3D Stencil Computation
Rau´l de la Cruza,1, Mauricio Araya-Polob
aBarcelona Supercomputing Center, Barcelona (Spain)
bRepsol USA, The Woodlands, TX (USA)
Abstract
It is crucial to optimize stencil computations since they are the core (and most computational demanding segment)
of many Scientiﬁc Computing applications, therefore reducing overall execution time. This is not a simple task,
actually it is lengthy and tedious. It is lengthy because the large number of stencil optimizations combinations to test,
which might consume days of computing time, and the process is tedious due to the slightly different versions of code
to implement. Alternatively, models that predict performance can be built without any actual stencil execution, thus
reducing the cumbersome optimization task. Previous works have proposed cache misses and execution time models
for speciﬁc stencil optimizations. Furthermore, most of them have been designed for 2D datasets or stencil sizes that
only suit low order numerical schemes. We propose a ﬂexible and accurate model for a wide range of stencil sizes up
to high order schemes, that captures the behavior of 3D stencil computations using platform parameters. The model
has been tested in a group of representative hardware architectures, using realistic dataset sizes. Our model predicts
successfully stencil execution times and cache misses. However, predictions accuracy depends on the platform, for
instance on x86 architectures prediction errors ranges between 1-20%. Therefore, the model is reliable and can help
to speed up the stencil computation optimization process. To that end, other stencil optimization techniques can be
added to this model, thus essentially providing a framework which covers most of the state-of-the-art.
Keywords: stencil computation, HPC, code optimization, homogeneous multi-core, performance model
1. Introduction
Finite Difference (FD) is a widely used method to solve Partial Differential Equations (PDE). Astrophysics [1]
and oceanography [2] are examples, among many, of scientiﬁc ﬁelds where large computer simulations are frequently
deployed. These kind of simulations may consume weeks of supercomputer time. Speciﬁcally, if they are PDE+FD
based, most of this execution time is spent on stencil computations. For instance, Reverse-Time Migration is a seismic
imaging technique in geophysics, where up to 80% of the execution time is devoted to stencil computations [3].
The main objective of this work is to improve stencil-based applications performance. Usually, high performance
is achieved by algorithmic changes that lead to suboptimal implementations, where many executions of those dif-
ferent implementations are required. Notice here that code modiﬁcations must not taint the numerical soundness of
Email addresses: delacruz@bsc.es (Rau´l de la Cruz), araya.mauricio@repsol.com (Mauricio Araya-Polo)
1This work was partially supported by project TIN2007-60625 of Spanish Government’s Science and Innovation Ministry.
Open access under CC BY-NC-ND license.
Rau´l de la Cruza et al. / Procedia Computer Science 4 (2011) 2146–2155 2147
the algorithm. Thus, adding another issue to be kept into account. In this work, we deﬁne a test case as a set of
characteristics that include platform and stencil parameters, such as: stencil size, data size, algorithm implementation,
architecture, etc. Since, every one of those parameters may change among executions, the number of total test cases
is large. Therefore it is not feasible to ﬁnd a possible suboptimal stencil scheme by manual means; the automatization
of this task is a must.
 	
	
 
	
















Figure 1: Test cases tree, this one with a ba-
sic conﬁguration based on three main factors:
platform, stencil size and dataset size.
1: for k =  to Y −  do
2: for j =  to X −  do
3: for i =  to Z −  do
4: Xti, j,k = C0 ∗ Xt−1i, j,k
+CZ1 ∗ (Xt−1i−1, j,k + Xt−1i+1, j,k)+ . . .+CZ ∗ (Xt−1i−, j,k + Xt−1i+, j,k)
+CX1 ∗ (Xt−1i, j−1,k +Xt−1i, j+1,k)+ . . .+CX ∗ (Xt−1i, j−,k +Xt−1i, j+,k)
+CY1 ∗ (Xt−1i, j,k−1 + Xt−1i, j,k+1)+ . . .+CY ∗ (Xt−1i, j,k− + Xt−1i, j,k+)
5: end for
6: end for
7: end for
Algorithm 1: The classical stencil algorithm pseudo-code. Z, X,
Y are the dimensions of the data set including ghost points.  de-
notes the neighbors used for the central point contribution. CZ1...Z,
CX1...X, CY1...Y are the spatial discretization coefﬁcients for each
direction and C0 for the self-contribution. Notice that in this work
the coefﬁcients are considered constant.
We identify three ways to accomplish this: by brute force, i.e. trying every possible test case (as auto-tuning), by
modeling the algorithm behavior and a hybrid approach based on the two previous ideas. The auto-tuning approach
has been widely covered in the literature [4]. The latter two are the guiding ideas followed in this work. Unfortunately,
modeling has a shortcoming, which is the limited coverage of the experimental space regarding ﬁne grain test cases.
Therefore, the hybrid approach may be the ideal solution. In this scenario, the branches shown in Figure 1 are covered
by modeling, and the leaves by an auto-tuning kind of mechanism.
In this work, the model-based way of covering the experiments space is explored. We consider that it has three
advantages with respect to auto-tuning. First, there is no need for real executions of the code, which can reduce hours
or days of testing time. Second, the development of a framework to control the experiments execution is not needed.
And third, which is the most important, the model’s ﬂexibility allows the extension without a substantial effort in
implementation and experimental executions. Furthermore, auto-tuning framework extensions, in particular if these
modiﬁcations highly affects the test cases tree, produce combinatorial explosion of code to be modiﬁed.
Nevertheless, the modeling approach has a main drawback: it strongly relies on the ability to capture the algorithm
behavior in an accurate fashion, which is not simple to achieve. In fact, this is the key point of this work. Thus the
main effort has focused on modeling in an efﬁcient way how stencil computations perform on different platforms.
Furthermore, in contrast to previous work [5], our model takes into account a complete memory hierarchy, from RAM
to register level, considering any intermediate cache level.
In order to achieve our goal, we have modeled the behavior of the stencil computation. After that, the model has
been implemented in a program that receives the following parameters: stencil size, architecture speciﬁcations and
dataset size. The reﬁnement and deployment of the model follow a methodology described in this work as well.
In the following subsections we provide the contextual information for this work. First, a detailed explanation of
the stencil computation itself and its challenges, and secondly, an overview of the literature about them.
1.1. Stencil Computation Overview
In this subsection, we review the stencil computation characteristics, and then its main challenges are identiﬁed.
Basically, the stencil central point accumulates the contribution of neighbor points in every axis of the cartesian system
(see Algorithm 1). This operation is repeated for every point in the computational domain, thus solving the spatial
differential operator. The domain contains interior points (inside the solution domain) and ghost points (outside the
solution domain). We identify the two main stencil computation challenges [6] as:
2148  Rau´l de la Cruza et al. / Procedia Computer Science 4 (2011) 2146–2155
• The non-contiguous memory access pattern. In order to compute the central point of the stencil, a set of neigh-
bors have to be accessed. Some of these neighbor points are spread out in the memory hierarchy, thus paying
many cycles in latencies. Furthermore, depending on the stencil order (related to the number of neighbors that
contribute to the spatial differential operator) this problem becomes even more burdensome.
• The low computation/access and re-utilization ratios. After gathering a set of data points, just one central point
is computed, and only some of those accessed data points will be useful for the next central point computation.
1.2. Related Work
Several works have been published about performance and modeling of stencil computations on modern proces-
sors. For instance, Datta and Kamil [7, 5] have proposed cost models to capture the performance of a set of stencil
optimizations. Such models are based basically on the Stanza Triad micro-benchmark (STriad) derived from the
STREAM Triad benchmark [8]. Using this model, a 7-point naive stencil computation is modeled by taking into
account three types of memory access costs: ﬁrst, intermediate and stream access. The access cost of the ﬁrst non-
streamed cache line is represented by Cf irst, and the streamed cache lines by Cstream. Prefetched accesses are only
activated after some number of misses (denoted as k), where Cintermediate is used instead of that cost. Considering a FD
domain of L elements (where L = N3, being N the axis domain size) and W elements per cache line, (L/W) cache
lines must be fetched to compute the stencil. Therefore, their cost model for low-order stencils can be summarized as:
Cstencil = C f irst + k ∗Cintermediate + ((L/W) − k − 1) ∗Cstream (1)
In addition, they developed a cache blocking model for Rivera tiling [9], where an N3 problem is traversed using
I× J×N blocks, being I and J the most unit-stride dimensions of the cut. This model was built on the top of the above
mentioned model. They set the lower and upper bounds for I × J × N blocking of a 7-point stencil computation. The
lower bound assumes high reuse (2Cstencil, one plane to read and one plane to write), while the upper bound assumes
no cache reuse at all (4Cstencil, three planes to read and one plane to write). Given an N3 grid problem, the number
of non-streamed (T f irst), partially streamed (Tintermediate) and streamed (Tstreamed) cache lines are computed as: T f irst=
N3
I if I N, or
N3
IJ if I =N  J, or
N2
IJ if I = J =N, Tintermediate =T f irst ∗ (k − 1) and Tstream =Ttotal − Tintermediate − T f irst,
where Ttotal= (I/W)N
3
I . Consequently, the cost of sweeping a 3D domain in a blocked fashion is approximated as:
Cstencil = C f irstT f irst +CintermediateTintermediate +CstreamTstream (2)
Lately, Datta et al. [5] have extended their model to time-skewing stencil computations. They distinguished mainly
ﬁve cases of cache misses: three preferred cases with compulsory misses and other two cases with capacity misses.
Kamil, Datta et al. set the basis for modeling stencil computations, e.g. streamed and non-streamed data for Rivera
blocking. However, there are at least two issues which were not considered in their works. First, they are taking into
account only low-order 3D stencil computations (7-point or 27-point in a compact fashion). Nevertheless, in many
scientiﬁc computing applications high-order stencils are required, from 25-point upward [10] to 85-points and more.
We believe that a parametrical model supporting different stencil order sizes is advisable for future research. Second,
the memory model is approached as an uniﬁed homogeneous hierarchy, leaving aside multi-level cache peculiarities
of modern architectures. Further, the number of available registers in the ALU or the different cost between load
and store operations are not considered in their model. We believe that those aspects must be taken into account in
order to lay down a ﬂexible, accurate and reliable model suitable for any kind of stencil computation and architecture.
However, as far as we know, no stencil computation model with such characteristics can be found in the literature,
which motived us to develop the current work.
The remainder of this paper is organized as follows: Section 2 introduces the stencil computation model, its
internals, features and considerations. In Section 3, we evaluate the performance and the model validity over different
current architectures. Finally, Section 4 presents our conclusions.
2. Modeling Stencil Computation
In this section, we present the basis for a ﬂexible and accurate model suitable for a wide range of stencil computa-
tions and architectures. As mentioned before, the multi-level cache hierarchy has to be considered to predict properly
Rau´l de la Cruza et al. / Procedia Computer Science 4 (2011) 2146–2155 2149
the behavior of stencil computations in current architectures.
It is a widely accepted fact that the stencil computation is memory bound [5, 6]. Therefore, only memory transac-
tions between the CPU and the memory subsystem are contemplated, and computation bottlenecks are supposed to be
negligible. In other words, the execution of the pipeline is dominated by the data transfer latency. Also, interferences
between data and instruction at cache level are not taken into account. On the other hand, due to possible register
spilling, our model estimates the number of loads and stores in the ﬁrst cache level (L1).
This section proceeds as follows: ﬁrstly, we review the base model. Secondly, the number of read and write misses
in stencil computations are approximated. And ﬁnally, some peculiarities, like prefetching, are added to the model.
Software prefetching is considered as an extension to the model, as many other user driven possible optimizations,
and therefore it is not covered in this work, but it is part of our future work plan.
2.1. Base Model
Given a grid of N3 words (element precision, either single or double) to traverse, we can set the fundamental
parameters involved in stencil computations. Consider the grid dimensions as I= J=K=N for a naive implementation,
where I is the unit-stride dimension and J and K are the least stride dimensions excluding ghost points. A 3D stencil
computation of order  requires 2 ∗  + 1 Z-X input planes in order to compute one Z-X output plane [6]. The size of
these planes (in words) differs depending on the operation to perform (read or write). Thus, we proceed to calculate
the memory requirements to compute one k iteration of the sweep (outermost loop of Algorithm 1). Let P be the
number of planes and S the plane size in words, the following terms for read and write operations are obtained,
Pread = 2 ∗  + 1 Pwrite = 1
S read = (I + 2 ∗ ) ∗ (J + 2 ∗ ) S write = I ∗ J
(3)
The stencil computation involves back and forth dataset transfers, which may produce conﬂicts depending on
cache policies because the memory resources are shared. For instance, during a write operation, two types of write
policies can be enforced for a cache miss: write-through and write-back. Considering the memory hierarchy of
our testbed architectures (described in Section 3.1.3), the former does not produce a cache line allocation (no write
allocate). On the contrary, the latter produces cache line allocation (write allocate), thus sharing memory resources
among input data and leading to cache pollution. Therefore, considering the above described policies, the words kept
by the memory subsystem to compute a Z-X plane (one k iteration) are:
Pwritebacktotal = Pread + Pwrite P
writethrough
total = Pread
S writebacktotal = Pread ∗ S read + Pwrite ∗ S write S writethroughtotal = Pread ∗ S read
(4)
Given an architecture with a memory hierarchy of n cache levels, it can be assumed that the total time (Ttotal)
required to compute a stencil is based on the following formula,
Ttotal = TL1 + · · · + TLi + · · · + TLn + TMemory (5)
where TLi and TMemory are the time to access data in Li cache level and main memory respectively. In order to
estimate TLi and TMemory, the number of cache line hits (HitsLi) and cache line misses (MissesLi) must be calculated.
Depending on the level, the way to compute hits and misses differ due to hardware policies. Regarding this aspect,
three memory hierarchy groups have been established to calculate hits and misses: ﬁrst (L1), intermediate (from L2
to Ln) and last (Memory). When the CPU issues a word load instruction, the data is brought from the closer cache
level (L1). If the data is not present, a miss is ﬂagged and passed to the next level of the hierarchy, which will ﬁnally
fetch an entire cache line containing the data. The remaining levels also follow this mechanism.
Recall that a stencil computation, as shown in Algorithm 1, requires several values in the internal loop to compute
a single point: grid points (Xt−1), weights (CZ,X,Y,0), and indices (i, j, k) to access grid points. In the CPU register bank,
grid points and weights are kept in Floating-Point Registers (FPR), while indices use General-Purpose Registers
(GPR). Leaving aside compiler optimizations, grid points must be fetched in every sweep of the loop, whereas
weights and indices might be partially reused. However, note that depending on: the order of the stencil (), the
dimension of the problem (dim) and the available registers (FPRf ree and GPRfree), data reuse becomes burdensome,
leading to register spilling and a higher CPU-L1 trafﬁc.
After deep analysis of the assembly stencil code on our testbed architectures, a couple of assertions can be pointed
out. First, the required  ∗ dim + 1 weights are reused along the sweep if enough FPRf ree registers are available.
2150  Rau´l de la Cruza et al. / Procedia Computer Science 4 (2011) 2146–2155
Second, the compiler keeps one index register for each grid point, whereas those points along Z axis share the same
index ((2 ∗  ∗ (dim − 1)) + 1). Likewise, index registers are also reused whether enough GPRfree resources exist.
As shown, the L1 cost is difﬁcult to predict accurately, and some extra information about how registers are used is
required. In order to obtain such information in the ﬁrst level, we proceed to estimate the register pressure (Reggrid,
Regweight and Regindex) to compute each point of the domain (I × J × K) as,
Reggrid = (2 ∗  ∗ dim) + 1 TwordL1 = word/BwreadL1
Regweight = max(( ∗ dim + 1) − FPRf ree, 0) HitswordL1 = ((Reggrid + Regweight + Regindex) ∗ I ∗ J ∗ K) − MisseswordL1
Regindex = max(((2 ∗  ∗ (dim − 1)) + 1) −GPRf ree, 0) TL1 = HitswordL1 ∗ TwordL1
(6)
where hits (HitswordL1 ) are estimated by the difference between loads issued by CPU ((Reggrid + Regweight + Regindex) ∗
I ∗ J ∗ K) and misses triggered to the ﬁrst hierarchy (MisseswordL1 ). Let word be the data granularity and BwreadL1 the L1
read bandwidth, the L1 cost (TL1) is calculated as the cost of transferring a word from L1 to CPU (TwordL1 ) times the
hits in L1 (HitswordL1 ). Following a similar scheme, the cost for any intermediate cache level can be approximated as,
TcachelineLi = cacheline/Bw
read
Li Hits
cacheline
Li = Misses
cacheline
Li−1 − MissescachelineLi
TLi = HitscachelineLi ∗ TcachelineLi
(7)
where the number of hits (HitscachelineLi ) depends on misses issued in the previous cache level (Li−1) and misses ﬂagged
in the current one (Li). Then, the time cost in level i can be estimated straightforwardly as the time required to transfer
a cache line (data granularity) from level i (TcachelineLi ) times the number of transfers performed (Hits
cacheline
Li ). Finally,
the accessing data cost for the last level (main memory) can be computed with the following equations,
TcachelineMemory = cacheline/Bw
read
Memory Hits
cacheline
Memory = Misses
cacheline
Ln
TMemory = HitscachelineMemory ∗ TcachelineMemory
(8)
Given that data is allocated in the last hierarchy level, any miss issued by previous level (Ln) will generate neces-
sarily a hit in main memory (misses are not possible). In other words, the access cost (TMemory) is directly based on
misses issued by the previous level (MissescachelineLn ) times the bandwidth (T
cacheline
Memory ) to access the data in such level.
2.2. Read Misses
Once the base model has been set up for the memory hierarchy, the different cases to be predicted are explored.
We will treat read misses because write misses have a negligible inﬂuence on the total miss estimation. Our model
covers four cases of miss estimation (see Figure 2), where these cases depend on three distinct types of cache misses
(compulsory, capacity and conﬂict misses) [11]. Ordered from lower to highest penalty, the four cases are:
1. Best possible scenario (lower bound) is obtained when all the required Z-X planes (S writepolicytotal ) for one k iteration
ﬁt entirely in Li cache. Thus, one plane (nplanesLi) per k iteration is fetched. In this case, only compulsory
misses (cold-start misses) are ﬂagged. Cold misses arise due to loads of new data which is not in the hierarchy.
Let II, JJ and KK be the extended dimensions (including ghost points) of I, J and K , and let W be the words
per cache line (W=cacheline/word), the misses of Li are,
MissescachelineLi = II/W ∗ JJ ∗ KK ∗ nplanesLi where nplanesLi = 1 (9)
2. The second scenario occurs when all the required planes do not ﬁt in Li, leading to capacity misses. However,
we assume that k-central plane, with a higher temporal reuse than the rest (1 vs 2 ∗  + 1), is probably not
replaced from cache due to conﬂict misses. In other words, k-central plane, when it ﬁts in cache, might be
partially reused in the following iterations, giving nplanesLi = (Pread − 1) for Equation 9.
3. On the third scenario, we also assume that all the planes (S writepolicytotal ) do not ﬁt in Li, as on the previous one.
Nevertheless in this case, the k-central plane overwhelms a signiﬁcant part of the Li cache (say half of the
capacity), therefore reducing the possibility of temporal reuse. Therefore, nplanesLi = Pread planes must be
read for Equation 9 when traversing the Y axis of the stencil loop.
4. Finally, in the worst scenario (upper bound) neither the planes nor the columns of the k-central plane ﬁt in the
cache. In this case, due to capacity and conﬂict misses, we consider that all the planes must be read at every k
iteration (Pread), but the columns of the k-central plane must be read at each j iteration (Pread − 1) as well. This
scenario gives nplanesLi = (2 ∗ Pread − 1) for Equation 9.
Rau´l de la Cruza et al. / Procedia Computer Science 4 (2011) 2146–2155 2151



	










	






	







	















	



	




	







Figure 2: Cases considered for read misses during k iteration. Each plane color represents the misses generated when
accessing the plane. A light color means none or few misses and a dark color implies a high ratio of misses.
2.3. Prefetching
Hardware prefetching is a common feature in modern architectures. It enables a reduction of cache misses, thus
increasing the hit ratio and minimizing latency. Hardware prefetching, also known as streaming in the literature,
consists of prefetching data from main memory into cache before it is explicitly requested. As processors have
become faster, and memory latencies have risen, the signiﬁcance of prefetching has increased. Besides, most modern
architectures support hardware prefetching of streams in different levels of the hierarchy simultaneously. Hence, it
is necessary to include prefetching so that predict stencil behaviour in current architectures. Dick [12] proposes a
simple prefetching model. Unfortunately, this model computes cost in cycles per memory access, and therefore it is
not compatible with our model, which is based on hit and miss metrics.
The prefetching modeling is complex, in particular to ﬁgure out if a prefetched stream ﬂags a miss. We devise a
simple approach, misses of the model are divided into two groups: prefetched and non-prefetched. First, we calculate
the X-Y planes (nplanesS treamLi ) that can be prefetched and not prefetched (nplanes
Non−stream
Li ) in the cache level i,
nplanesNon-S treamLi = max(nplanesLi − pre fLi, 0) nplanesS treamLi = nplanesLi − nplanesNon-streamLi
T cachelineLi = cacheline/Bw
read
Li T -S tream
cacheline
Li = cacheline/Bw-S tream
read
Li
MissescachelineLi = II/W ∗ JJ ∗ KK ∗ nplanesNon-streamLi Misses-S treamcachelineLi = II/W ∗ JJ ∗ KK ∗ nplanesS treamLi
(10)
HitscachelineLi = Misses
cacheline
Li−1 − MissescachelineLi
Hits-S treamcachelineLi = Misses-S tream
cacheline
Li−1 − Misses-S treamcachelineLi TLi = HitscachelineLi ∗ TcachelineLi + Hits-S treamcachelineLi ∗ TcachelineLi
(11)
where pre fLi refers to the number of stream channels supported by the current hierarchy level i. Second, prefetched
and non-prefetched cache line misses within planes are computed. Finally, hits are obtained for the next level (i +
1) using their transfer cost, TcachelineLi for non-prefetched hits and T -S tream
cacheline
Li for prefetched hits. The way to
compute the speciﬁc bandwidths for streamed and non-streamed planes is shown in Subsection 3.1.3.
3. Experiments and Model Validation
In this section the experimental results are presented to validate the model. To that end the predicted execution
times (or cache misses) are compared with real execution results (or HW/SW counted cache misses), and their relative
error are estimated. In order to achieve that, we have followed a methodology that is introduced in this section.
3.1. Methodology
Basically, our experimental methodology consists in the following ﬁve stages:
Stage 1: Some representatives test cases are selected within the experimental space.
Stage 2: Test cases selected in Stage 1 are executed and the results are stored for further use.
Stage 3: Information regarding platform conﬁguration is gathered, for instance cacheline size, prefetching options or
streamed and non-streamed bandwidth. Some of them, for instance bandwidth, is obtained executing bench-
marking programs (see Section 3.1.3). The rest of the information is obtained reviewing the manufacturer’s
datasheets. Afterwards, the consolidated information is added to the model through conﬁguration ﬁles.
Stage 4: Predicted performance is obtained by using the model for test cases selected in Stage 1.
Stage 5: Finally, real execution (Stage 2) and model generated (Stage 4) performance data are compared, obtaining
relative errors.
2152  Rau´l de la Cruza et al. / Procedia Computer Science 4 (2011) 2146–2155
The methodology has a feedback mechanism, where model parameters can be adjusted (Stage 3) and experiments
repeated if the relative error (computed in Stage 5) is out of acceptable correctness range. In general terms, once
the model parameters are compiled from platform characteristics, the model should return accurate results. However,
it may occur that some characteristics are not clearly speciﬁed for the platform (e.g. streaming channels through
memory hierarchy), then a process of reﬁnement must be performed. This platform parameters reﬁnement process
follows Stages 3-4-5 as many times as it is needed.
In the following subsections we describe the main features of each methodology stage.
3.1.1. Stage 1: Test-cases Parameters
The test cases space is a combination of: stencil size, platform and dataset dimension. The stencil sizes of the
experiments are: 13 (academic), 25 (widely used), 43 and 85 (high order) points. We will carry out experiments for
three platforms. For the sake of simplicity we have collapsed the 3D dataset dimension to one parameter, since the
datasets are cubic. The dataset size ranges from 128 to 512 points cubic, with a 16 points stride. In total, we have
288 test case results to analyze, but before proceeding to that, the following two subsections describe the testbed
conﬁguration and the experimental setup used.
3.1.2. Stage 2: Real Execution Performance
Two sets of real data have been collected for the validation step: the real execution times and the software/hardware
counters (cache misses), when they are available by the platform. The results are presented combined with model
results in Subsection 3.1.4. In order to gather the results, in particular those related to cache misses, the following
tools (or frameworks) have been deployed:
PapiEx (PAPI): PapiEx is a performance analysis tool designed to measure transparently and passively the hardware
performance counters of an application using PAPI framework.
hpmcount/libhpm (HPCT/HPM): hpmcount and libhpm provide hardware metric and resource utilization statistics
after application execution. They are developed by IBM to support Power-based systems.
Valgrind: Valgrind is an instrumentation framework for building dynamic analysis tools. It contains a set of tools to
perform debugging and proﬁling.
The ﬁrst two items of the above list provide hardware counter information. The information gathered is used as the
reference data in validation. The last item is a cache simulator, it roughly captures the platform behavior by software
counters, notice that prefetching is not considered by that tool. Although that tool do not capture streaming features,
it is useful in terms of analysis to proﬁle memory instructions issued by CPU. Therefore, the Valgrind cache simulator
allows to partially validate the non-streaming version of our model.
3.1.3. Stage 3: Model Parameters and Architectural Characterization
In this section, ﬁrst, the platforms used to carry out the experiments are described. Finally, the tools employed to
obtain the platform proﬁle information are introduced. We have used the following platforms available in the PRACE 2
project (detailed platform characteristics are presented in Table 1):
AMD Opteron: Louhi is a Cray XT4/XT5 supercomputer with 1012 XT4 and 672/180 XT5 compute nodes, hosted
at the IT Center for Science in Finland.
Intel Nehalem: Inti supercomputer, hosted at the Atomic Energy Commission, sports 256 Nehalem processors.
IBM Blue Gene/P: Jugene supercomputer, hosted at the Julich Research Centre, is based on the Blue Gene/P pro-
cessor with 294912 PowerPC 450 cores.
In order to obtain memory bandwidth measures on the testbed architectures, we have used STREAM2 3, which
aims to extend the functionality of the STREAM [8] benchmark in two aspects. First, it measures sustained band-
width, providing accurate information across the memory hierarchy. Second, STREAM2 exposes more clearly the
2Partnership for Advanced Computing in Europe - http://www.prace-project.eu/
3http://www.cs.virginia.edu/stream/stream2
Rau´l de la Cruza et al. / Procedia Computer Science 4 (2011) 2146–2155 2153
System name Inti Jugene Louhi
Architecture Intel Xeon Core i7 IBM BlueGene/P AMD Opteron
X5570 (Nehalem-EP) PowerPC450 2360/8360 (Barcelona)
Chips × Cores 2 × 4 1 × 4 1 × 4
Clock 2.93 GHz 850 MHz 2.3 GHz
SP GFlops† 93.76 (SSE2) 13.6 73.6 (SSE2)
DP GFlops† 46.88 13.6 36.8
L1 Cache (D+I) 32 kB + 32 kB 32 kB + 32 kB 64 kB + 64 kB
L2 Cache 256 kB per core 1920 B per core 512 kB per core
L3 Cache 8 MB (shared) 2 x 4 MB (shared) 2048 kB (shared)
Main memory 24 GB 2 GB 8 GB
Bandwidth 32 GB/s 13.6 GB/s 42.7 GB/s
Watts x hour 95 39 75
Compiler Intel Compiler IBM XL Compiler Portland/GCC Compiler
(v11.1) (v9.0/v11.01) (v10.2/v4.1.2)
Table 1: Architectural conﬁguration of the platforms used in our tests. †Only one core is considered.
performance differences between reads and writes. It is based on the same ideas as STREAM, but a different set
of vector kernels are used: FILL (8 bytes/write), COPY (16 bytes/read&write), DAXPY (24 bytes/read&write) and
DOT (16 bytes/read). Recall that our model requires isolated read and write bandwidths for all cache hierarchy levels.
Thus, we have deployed DOT and FILL kernels. Furthermore, to obtain the necessary non-streamed bandwidths for
our model, we have extended STREAM2 to provide such information. DOT and FILL kernels have been modiﬁed
in such a way that the stride parameter (o f f set) is carefully set for consecutive reads and writes. Also, most modern
architectures support next-stride cacheline prefetching. Thus, we have used a combination of two techniques in DOT
and FILL kernels to avoid triggering the prefetching engine. The stride parameter has been redeﬁned as,
o f f set = stride + MOD(I, epsilon) (12)
where stride is a constant large enough to avoid next cacheline prefetching mechanism. Besides, the second term
MOD(I, epsilon), which is a cyclic variable that depends on the I iteration of the loop and epsilon constant, prevents
stride cacheline prefetching engine to be triggered.
Table 2 shows detailed streamed and non-streamed bandwidths measures for the testbed platforms. The different
cache level bandwidths can be deduced from analyzing STREAM2 output (see Figure 3) and focusing in the areas
where bandwidth drops substantially (due to cache level overﬂow) when vector size increases (X axis). Every step in
Figure 3 represents the sustained bandwidth for each memory level in the hierarchy.
Parameters Intel IBM AMDNehalem BlueGene/P Opteron
Register
availability†
GPRfree 10 26 10
FPRf ree 12 26 12
Block size cacheline 64 bytes 32/128 bytes 64 bytes
Prefetching
channels
pre fL1 2 0 2
pre fL2 2 7 0
pre fL3 0 0 0
Cache
capacity
sizeL1 32 kB 32 kB 64 kB
sizeL2 256 kB 1920 bytes 512 kB
sizeL3∗ 8 MB 8 MB 2 MB
Measured
Bandwidth‡
(GB/s)
BwreadL1 49.4 (23.4) 6.2 (3.2) 29.1 (14.6)
BwreadL2 29.4 (12.5) 2.1 (1.3) 13.9 (4.1)
BwreadL3 21.1 (8.1) 2.0 (0.6) 7.6 (2.2)
BwreadMemory 8.2 (3.5) 2.0 (0.6) 3.3 (1.3)
BwwriteL1 49.4 (24.6) 3.3 (3.3) 29.9 (8.6)
BwwriteMemory 7.9 (3.9) 3.3 (1.9) 4.9 (0.8)
Table 2: Parameters used in our model to predict stencil
executions on each platform. †The available registers for
each architecture have been estimated doing assembly reg-
ister analysis of a naive case. ‡Bandwidth measured using
our STREAM2 version, which is able to capture streaming
and non-streaming bandwidths (shown in parenthesis). ∗L3
cache is shared among the cores.
 0
 10
 20
 30
 40
 50
 1000  10000  100000  1e+06
G
B/
s
Size in words
Intel Nehalem (Inti)
Read Stream
Write Stream
Read Non-stream
Write Non-stream
Figure 3: STREAM2 results of stream and non-stream
bandwidths for read and write operations for the hierar-
chy levels in Inti platform (Intel Xeon). Several compiler
ﬂags were tested to obtain the maximum bandwidth per-
formance.
2154  Rau´l de la Cruza et al. / Procedia Computer Science 4 (2011) 2146–2155
Going through the results, we can observe that almost all the platforms show an impressive beneﬁt when using
prefetching engines. As the manufacturer’s datasheet indicates [13], the IBM architecture implement write-through
policy (no write allocation) in their L1 data caches. This matches the STREAM2 results, as it should be expected the
performance remains constant between L1-L3 caches.
Finally, STREAM2 gives a better performance for write than read operations in Intel Nehalem architecture. This
reveals a problem with the binary code generated by the compiler for STREAM2. As Molka [14] reports, the write
and read operations should be similar, therefore same read and write bandwidths were used in these cases.
3.1.4. Stage 4 & 5: Model Performance and Validation
In order to visualize and analyze the amount of data at hand, ﬁrstly we review the results for an interesting case,
secondly we summarize the platforms results in Table 3. In Figure 4.1, we can observe a clear pattern, the predicted
curve of cache misses separates from real and simulated curves, the latter two following a similar path. Even so, the
relative error respect to the predicted results stays between 10 to 20% (see Figure 4.2).
Figure 4.3 depicts a different scenario, the real execution time and the model curve follow a closer path, where
relative error remains under 12%. The reason for this behavior is that although the model computes execution time
based on cache misses and latencies, it captures correctly the effect of the streaming mechanism giving a different
penalty cost to prefeteched and non-prefeteched cache misses (see Subsection 2.3). Therefore, the execution time
prediction is more precise than the cache misses prediction. Also, the stencil size for those experiments is 85-point,
which implies that the time prediction is only slightly perturbed by the prefetching mechanism.
 0
 1e+08
 2e+08
 3e+08
 4e+08
 5e+08
 6e+08
 100  150  200  250  300  350  400  450  500  550
Ca
ch
e 
m
iss
es
s 
[s]
Dataset sizes [points]
predicted L2 cache misses
HW counted L2 cache misses
Valgrind simulated L2 cache misses
 0
 5
 10
 15
 20
 25
 30
 35
 40
 45
 50
 100  150  200  250  300  350  400  450  500  550
Er
ro
r [%
]
Dataset sizes [points]
relative error (predicted vs HW counted) L2 cache misses
relative error (predicted vs simulated) L2 cache misses
 0
 5
 10
 15
 20
 25
 30
 35
 40
 45
 50
 100  150  200  250  300  350  400  450  500  550
 0
 3
 6
 9
 12
 15
Ti
m
e 
[s]
Er
ro
r [%
]
Dataset sizes [points]
predicted execution time
measured execution time
relative error
Figure 4: SW/HW counters and model results (the top two graphs shows misses, the bottom graph represents execu-
tions time) for Lohui (AMD) system, with 85-point stencil. In terms of L2 cache misses, the relative error (the middle
graph) stays under 15% for the most relevant part of the graph. This is important since realistic 3D datasets in industry
and large scientiﬁc simulations are usually bigger than the datasets covered in this work.
Further works will include an extended robustness analysis. In particular, we will break down the factors contri-
bution of the modeled performance for the carried out experiments.
Rau´l de la Cruza et al. / Procedia Computer Science 4 (2011) 2146–2155 2155
Platforms AMD Opteron IBM BlueGene/P Intel Nehalem
Stencil sizes (points) 13 25 43 85 13 25 43 85 13 25 43 85
Max 37.2 37.0 18.2 13.4 17.2 24.5 27.9 16.8 46.2 28.9 38.2 20.0
Min 3.5 1.8 1.8 0.5 2.4 0.6 8.7 1.0 0.4 0.6 0.6 2.4
Average 13.0 11.6 8.5 5.8 11.1 17.1 21.1 10.8 9.9 11.9 22.1 12.5
Standard deviation 16.4 18.7 3.7 2.7 6.8 7.5 10.8 10.1 0.5 7.1 4.9 0.5
Table 3: Relative errors (%) between predicted and measured execution times. The model is highly accurate for AMD
architecture, where the average relative error is 9.7% for all stencil sizes. In terms of relative error standard deviation,
the model is most effective for Intel architecture with an average of 3.5%. Overall, considering both accuracy and
stability of the results, the model results for AMD outperform the ones for Intel followed by IBM closely.
4. Conclusions and Future Work
In general terms, the proposed model captures the performance behavior of the stencil computation. The accuracy
of the prediction ranges regarding the platform. In x86 architectures case, on which we have spent most of our research
time, a high level of prediction accuracy is obtained. Furthermore, the average relative error in execution time is 13%
for all stencil sizes. These results show that we are in the right path to achieve a dependable multi-level cache model,
which at the same time remains simple, ﬂexible and extensible.
Our future work can be summarized in the following three lines. First, improve the model for +90% accu-
racy, enhancing mainly the prefetching equations. Second, add support for multi-core architectures to the model.
Third, extend the model to capture stencil optimizations performance, particularly spatial/temporal blocking, soft-
ware prefetching and semi-stencil algorithm. The latter line of research is the one that motivates us the most, because
those techniques contribute to achieving highly optimized stencil-based applications.
The ﬁnal goal is not only to have an accurate and ﬂexible model, but also to have the chance of freely changing
platform parameters. The latter will be helpful to forecast performance for current platform modiﬁcations as well
as for future platforms. Finally, the model is integrable with auto-tuning techniques, thus providing a rich synergy
towards efﬁcient stencil codes implementation.
References
[1] A. Brandenburg, Computational aspects of astrophysical MHD and turbulence, Vol. 9, CRC, 2003.
[2] C. D. Groot-Hedlin, A ﬁnite difference solution to the helmholtz equation in a radially symmetric waveguide: Application to near-source
scattering in ocean acoustics, Journal of Computational Acoustics 16 (2008) 447–464.
[3] M. Araya-Polo, F. Rubio, M. Hanzich, R. de la Cruz, J. M. Cela, D. P. Scarpazza, 3D seismic imaging through reverse-time migration on
homogeneous and heterogeneous multi-core processors, Scientiﬁc Programming: Special Issue on the Cell Processor 16.
[4] K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, K. Yelick, Stencil computation optimization and
auto-tuning on state-of-the-art multicore architectures, in: SC ’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, IEEE
Press, Piscataway, NJ, USA, 2008, pp. 1–12.
[5] K. Datta, S. Kamil, S. Williams, L. Oliker, J. Shalf, K. Yelick, Optimization and performance modeling of stencil computations on modern
microprocessors, SIAM Rev. 51 (1) (2009) 129–159.
[6] R. de la Cruz, M. Araya-Polo, J. M. Cela, Introducing the semi-stencil algorithm, 8th International Conference on Parallel Processing and
Applied Mathematics.
[7] S. Kamil, P. Husbands, L. Oliker, J. Shalf, K. Yelick, Impact of modern memory subsystems on cache optimizations for stencil computations,
in: MSP ’05: Proceedings of the 2005 workshop on Memory system performance, ACM Press, New York, NY, USA, 2005, pp. 36–43.
[8] J. D. McCalpin, Stream: Sustainable memory bandwidth in high performance computers, Tech. rep., University of Virginia, Charlottesville,
Virginia, a continually updated technical report. http://www.cs.virginia.edu/stream/ (1991-2007).
[9] G. Rivera, C. W. Tseng, Tiling optimizations for 3D scientiﬁc computations, in: Proc. ACM/IEEE SC 2000, 2000, p. 32.
[10] L. Peng, R. Seymour, K. ichi Nomura, R. K. Kalia, A. Nakano, P. Vashishta, A. Loddoch, M. Netzband, W. R. Volz, C. C. Wong, High-order
stencil computations on multicore clusters, in: IPDPS ’09: Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed
Processing, IEEE Computer Society, Washington, DC, USA, 2009, pp. 1–11.
[11] O. Temam, C. Fricker, W. Jalby, Cache interference phenomena, in: SIGMETRICS ’94: Proceedings of the 1994 ACM SIGMETRICS
conference on Measurement and modeling of computer systems, ACM, New York, NY, USA, 1994, pp. 261–271.
[12] K. Dick, A mathematical model of hardware prefetching (2007).
[13] I. J. of Research, Development, Overview of the ibm blue gene/p project, IBM J. Res. Dev. 52 (1/2) (2008) 199–220.
[14] D. Molka, D. Hackenberg, R. Schone, M. S. Muller, Memory performance and cache coherency effects on an intel nehalem multiprocessor
system, in: PACT ’09, IEEE Computer Society, Washington, DC, USA, 2009, pp. 261–270.
