A New Locality Metric and Case Studies for HPCS Benchmarks by Yu, Jing et al.
A NEW LOCALITY METRIC AND CASE STUDIES FOR HPCS BENCHMARKS
BY
Jing Yu, Sara Baghsorkhi and Marc Snir
Department of Computer Science
University of Illinois Urbana Champaign
201 N. Goodwin Avenue
Urbana, IL 61801-2302
{jingyu,bsadeghi, snir}@uiuc.edu
Technical Report UIUC DCS-R-2005-2564
Engr. No. UILU-ENG-2005-1758
Department of Computer Science, UIUC
April, 2005
A New Locality Metric and Case Studies for HPCS Benchmarks ∗
Jing Yu, Sara Baghsorkhi, and Marc Snir
Computer Science Department
University of Illinois ar Urbana-Champaign
Thomas M.Siebel Center for Computer Science
201 N.Goodwin, Urbana, IL 61801-2302, USA
{jingyu,bsadeghi,snir}@uiuc.edu
Technical Report UIUC DCS-R-2005-2564
Dept.of Computer Science, UIUC
April 2005
Abstract
We propose in this paper a new approach to study the temporal and spatial locality of codes using a
plot of cache miss bandwidth as a function of cache size and line size for a fully associative LRU cache.
We apply this new approach to the study of locality for several High-Performance benchmarks. We
show that this plot capture fine behavior of these benchmarks and explain some of the difficulties that
recent attempts to characterize locality using a few parameters are facing: Codes can exhibit different
levels of temporal or spatial locality for different cache sizes; averaging these different behavior requires
to weight properly the cost of misses at different levels of the memory hierarchy. We propose such a
scheme, for an average measure of temporal locality.
1 Introduction
The performance of modern microprocessors is increasingly constrained by the performance of the mem-
ory subsystem; high cache hit rates are essential to performance. Codes achieve good cache hit rates
if they have good locality: temporal locality, meaning that if a datum is accessed then it is likely to be
accessed again soon after; and spatial locality meaning that if a a datum is accessed then near by lo-
cations are likely to be accessed soon after (e.g., because memory locations are accessed in sequential
order). Therefore, it is important to understand the amount of temporal and spatial locality available in
the memory accesses of important codes.
A good measure of the temporal locality in a sequence of accesses is provided by the reuse dis-
tance metric (also called stack distance: Given a sequence of accesses to locations A = a1, a2, . . . , an,
ReuseDist(A, i), the reuse distance for access i, is the number of distinct locations accessed since
the last access to ai; ReuseDist(A, i) = ∞ if location ai is first accessed at step i. Thus, if A =
7, 7, 3, 4, 3, 7, 3, 3, 4 then ReuseDist(A, .) = ∞, 0,∞,∞, 1, 2, 1, 0, 2. A reuse distance computation
simulates the operation of a fully-associative cache [14]. Thus, if each access is to a word, and the
cache has word sized lines and is fully associative and LRU, then the i-th access is a hit if and only
if ReuseDist(A, i) < m, where m is the cache size. A histogram of the function ReuseDist(A, .)
indicates cache hit rates for caches of arbitrary size.
Many empirical studies have shown that set-associative caches with high associativity are a good ap-
proximation to fully associative caches for real codes, assuming that pathological array sizes are avoided.
∗This work is supported by DARPA Contract NBCHC-02-0056 and NBCH30390004.
1
Thus, reuse distance is indicative of cache hit rates for real caches, once cache line size is taken into
account.
Reuse distance can be computed by tracing memory references [17]. (The result depends both on the
source code executed and the compiler used.) It is intrinsically a good metric for temporal locality of a
program [7]. Recent researchers have used reuse distance histogram to predict cache performance for a
given cache line size [10, 11, 13].
It is not clear how one can characterize temporal locality by less then a full histogram: there are
theoretical reasons why such a compression is not possible, unless one make some assumptions about
the target systems [16]. A characterization that uses one parameter only becomes possible with such
assumptions.
Also, it is less clear how one characterizes spatial locality in a useful manner. Grimsrud el. [12]
introduced a way to unite data reuse and its neighbors’ reuse information by counting the occurrence
frequency of each distance (number of memory accesses between repeated references) and stride (offset
to the nearby location) pair for a given trace. However this occurrence frequency has no direct relationship
with cache behavior, e.g. it cannot show how much benefit a longer cache line will make.
The main contributions of this paper are as follows.
• We propose a new metric (Miss Bandwidth) to measure both temporal locality and spatial locality.
This measure is defined and discussed in Section 2.
• We apply the new metric Miss Bandwidth to seven HPCS benchmarks and characterize their local-
ity. We find that our metric is able to reveal fine details of the benchmarks, such as the effect on
locality of the use of auxiliary arrays and invocation to intrinsic functions. We explain the local-
ity property of these codes by examining their source code. The analysis also shows that a single
characaterization of code as having good or bad temporal or spatial locality is misleading: a code
can have good temporal or spatial locality for small caches but bad temporal or spatial locality for
large caches; and vice versa. This analysis is presented in Section 3.
• We propose, in Section 4, a scheme to characterize the temporal locality by a single score, based
on accepted performance models for caches.
In addition, Section 5 discusses the measurement of reuse distance, while Section 6 discusses the
results and future work.
2 The Miss Bandwidth Locality Metric
2.1 Definitions
Let A = a1, a2, . . . , an be a sequence of addresses. We generalize the definition of reuse distance to
take into account cache line size `: ReuseDist(A, `, i) = ReuseDist(A|`, i), where A|` is defined to
be the sequence a1|`, a2|`, . . . , an|`; i.e., we use the addresses of the cache line frames, rather than the
addresses of the locations accessed. We define the Reuse Distance Distribution to be the histogram of the
reuse distance function:
ReuseDD(A, `, d) = |{i : ReuseDist(A, `, i) = d}|/n.
We define the Miss Rate function as
MissRate(A, `,m) =
∑
d≥m/`ReuseDD(A, `, d). MissRate(A, `,m) equals to the miss rate of a
fully associative LRU cache of size m with cache lines of size `. Finally, we define the Miss Bandwidth
function as
MissBw(A, `,m) = MissRate(A, `,m) × `. MissBw(A, `,m) equals to the amount of memory
traffic generated by misses of a fully associative LRU cache of size m with cache lines of size `.
2.2 Study of Locality Using Miss Bandwidth
Let A be the sequence of memory accesses executed when an application is run for a particular problem
size and input. A plot of MissBw(A, `,m), for a fixed cache line length `, as a function of cache size m,
2
Figure 1: 3D Spatial-temporal locality plot
indicates the level of temporal locality available in the application: If the application has good temporal
locality, then the miss bandwidth decreases rapidly as function of cache size; if it has bad temporal
locality, then the miss bandwidth stays relatively flat.
Spatial locality can be studied by plotting the miss bandwidth as a function of cache line length as
well. It is easy to see that MissRate(A, `,m`) ≤ MissRate(A,m); a cache with m lines of length `
will suffer no more misses than a cache with m lines of length 1 [16]. In the worse case, where exactly
one word is used within each line, equality holds. We say that the access sequence A has no spatial
locality if MissRate(A, `,m`) = MissRate(A,m). In such a case, increasing the cache line size by a
factor of `, while keeping the cache size fixed, holds not benefit: the effective cache size is reduced, the
number of cache misses does not decrease, and the miss bandwidth increases by at least a factor of `, i.e.,
MissBw(A, `,m) ≥ `MissBw(A,m).
In general, one expects that increasing the cache line length to ` while keeping the cache size fixed will
decrease the number of cache misses, but will not decrease the miss bandwidth. This is not universally
true; one can build pathological access sequences where the use of longer cache lines reduce the miss
bandwidth. However such “pathological” behavior is rarely observed in practice. (This “pathological”
behavior is due to the use of LRU replacement; it is easy to see that if an optimal replacement policy is
used then increasing the cache line length will not decrease the miss bandwidth; see [16].) We say that
the access sequence A has perfect spatial locality if MissBw(A, `,m) ≤ MissBw(A,m), i.e., if miss
bandwidth does not increase with longer cache lines.
Most applications fall in between these two extremes. An application has good spatial locality if it is
closer to having full spatial locality and has bad spatial locality if it is closer to having no spatial locality.
Note that an application may exhibit different spatial locality for different values of m. For example,
an application may have good spatial locality for small values of m – having a small number of heavily
reused variables that are accessed with good spatial locality, but bad spatial locality for large values of m,
because the remaining variables are accessed in random order, with no spatial locality; the RandomAccess
benchmark studied in Section 3 exhibits such a behavior. More typically one may have bad spatial locality
for small m, but good spatial locality for large m: e.g., when a small number of temporary variables are
accessed randomly but large arrays are accessed sequentially.
The spatial-temporal locality of a sequence of accesses can be characterized by a 3D plot of the miss
bandwidth as a function of cache size and line size, as shown in Figure 1 We call this the spatial-temporal
locality plot. To simplify experiments and analysis, we shall only consider discrete line sizes of 8B, 16B,
32B, 64B and 128B, and plot the 5 cuts through the spatial-temporal locality plot at ` = 8, 16, 32, 64 and
128. This choice of cache line lengths includes the lengths that are utilized in most current microproces-
sors. We also call this set of 5 2D plots the spatial-temporal locality plot.
In some cases it is more convenient to plot miss bandwidth relative to a cache with word length (8B)
3
0 0.5 1 1.5 2
x 104
0
0.5
1
1.5
2
2.5
3
cache size (byte)
M
i s
s B
a n
d w
i d
t h
/ 8
 ( m
i s s
r a t
e *
l i n
e s
i z e
/ 8 )
STREAM 2D Spatial−Temporal Locality Plot
8B Line
16B Line
32B Line
64B Line
128B Line
(a) set of 5 2D spatial-temporal locality plot
0 0.5 1 1.5 2
x 104
1
2
3
4
5
6
cache size (byte)
M
i s
s B
w
 R
a t
i o
STREAM 2D MissBandwidth Ratio Plot
8B Line
16B Line
32B Line
64B Line
128B Line
(b) set of 5 2D-MissBwRatio plot
Figure 2: Data Locality Characteristics for STREAM with working size 2000000 of double precision.
lines, rather than plotting the absolute value. We define MissBwRatio to be this ratio.
3 Case Studies: Data Locality of HPCS Benchmarks
In the following subsections, we shall concentrate on the locality characterization of HPCS benchmarks
based on the spatial-temporal locality plot. We shall examine the source code of these benchmarks to
explain the characteristics of the plot. Our case studies include five individual benchmarks from HPC
Challenge Benchmark suite [1, 2] and two benchmarks from NPB3.2-SER [4]. The simulation environ-
ment and working size issues are addressed in Section 5.
3.1 STREAM - perfect spatial locality
Let’s first look at the spatial-temporal locality plot in Figure 2(a). Each curve is flat, which means that
irrespective of cache size, the cache miss rate remains the same at 0.5; the code has poor temporal locality.
Furthermore, Figure 2(b) shows that the miss bandwidth ratio is one: the miss rate is inversely propor-
tional to the cache line size, so that the bandwidth does not increase with longer cache lines; the code has
perfect spatial locality.
The FORTRAN code for STREAM core function is listed below. A,B,C are double-precision arrays
with length N , which is 2000000 in our experiment. The large arrays are traversed sequentially, and each
array element is touched once in the loop. This explains the good spatial locality and the bad temporal
locality.
FOR 30 J=1,N
C(J) = A(J)
30 CONTINUE
FOR 40 J=1,N
B(J) = SCALAR * C(J)
40 CONTINUE
FOR 50 J=1,N
C(J) = A(J) + B(J)
50 CONTINUE
3.2 RandomAccess
The RandomAccess benchmark is accessing array elements (64bit each) in a random order. As the array
is very large, 2MB in our experiment, it is unlikely that the same element is touched twice, in close
succession. Also, accessing array element A(i) does not imply that A(i + 1) or A(i − 1), is touched in
the near future. Thus, we expect to see the poor spatial locality and poor temporal locality.
4
0 0.5 1 1.5 2
x 104
0
0.5
1
1.5
2
2.5
cache size (byte)
M
i s
s B
a n
d w
i d
t h
/ 8
 ( m
i s s
r a t
e *
l i n
e s
i z e
/ 8 )
RandomAccess 2D Spatial−Temporal Locality Plot
8B Line
16B Line
32B Line
64B Line
128B Line
(a) set of 5 2D spatial-temporal locality plot
0 0.5 1 1.5 2
x 104
0
2
4
6
8
10
12
14
cache size (byte)
M
i s
s B
w
R
a t
i o
RandomAccess 2D MissBandwidth Ratio Plot
8B Line
16B Line
32B Line
64B Line
128B Line
(b) set of 5 2D MissBwRatio plot
100 101 102 103 104
0
0.05
0.1
0.15
0.2
stack distance
f r e
q u
e n
c y
RamdonAccess Reuse Distance Frequency
8B Line
16B Line
32B Line
64B Line
128B Line
(256,0.035) 
(191,0.016) 
(158,0.007) 
(c) Reuse distance frequency distribution
Figure 3: Data Locality Characteristics for RandomAccess with working size 2MB.
On Figure 3(a), the 8B curve is relatively flat except a sharp 30% drop around 2KB. Other curves
also show a more or less sudden decrease (0.0% ∼ 17%) once between 2KB and 10KB. To explain this
drop, we need to investigate the source code more closely.
for (i=0; i<NUPDATE/128; i++) {
for (j=0; j<128; j++) {
ran[j] = (ran[j]<<1) ˆ ((s64Int)ran[j]<0 ? POLY : 0);
Table[ran[j] & (TableSize-1)] ˆ= ran[j];
}
}
The above code segment is the core part of RandomAccess benchmark. Table is the working 64-bit-
element array, in which total NUPDATE number of elements are to be randomly picked for read and write.
An auxiliary array ran saves the latest 128 random numbers. The array ran is accessed sequentially.
Although the principal working array Table is accessed randomly, the accesses to the auxiliary ran
have good temporal and spatial locality. This improves locality metrics for small caches.
To make clear how the auxiliary array affects the entire reuse distance distribution, we show in Fig-
ure 3(c) the frequency of each reuse distance. With 8B lines, reuse distance peaks at 256 (3.5% of
accesses); with 16B lines it peaks at 191 (1.6% of accesses); and with 32B lines it peaks at 158 (0.7% of
accesses). An examination of the source code will indicate the reason for these local peaks.
Checking the source code, we can see when line size is 8B (one line contains only one array element)
the line containing ran[j] will be re-loaded after all the other 127 elements of the array are loaded, and
after 128 distinct entries from Table are loaded, assuming no collisions. Thus, most access to elements
of the array ran have reuse distance 255. We actually observe reuse distance 256, the difference being
due to an additional access to a temporary variable that occurs in the assembly code. Similarly, when line
size is 16B, the line containing ran[j] will be re-accessed after 128/2 − 1 = 63 lines are loaded and
128 distinct lines containing entries from the array Table are loaded (assuming no collisions). Thus,
we expect reuse distance 63 + 128 = 191 to occur frequently. For the same reason when the line size
is 32B, we shall see frequent occurrence of reuse distance 159, etc. Notice that the total number of
lines containing entries from the array ran is decreasing in inverse proportion to cache line size, so that
frequency of reuse distances due to accesses to ran[] decreases in inverse proportion to cache line size;
this matches perfectly the observed frequencies.
The analysis in the previous paragraph indicates that accesses to the auxiliary array introduce a certain
amount of temporal locality for small cache sizes (from 2KB to 5KB). When the cache size is large
enough to contain the auxiliary array, there is no temporal locality in the accesses that generate hits
(accesses to Table[]).
For the same reason, the observed miss bandwidth ratio drops in Figure 3(b) for cache sizes that are
insufficient to contain the auxiliary array and keeps flat afterwards. The miss bandwidth ratio is 2.0 for
16B line size, 3.7 for 32B line size, 7.0 for 64 line size and 13.5 for 128B line size, all illustrating poor
spatial locality of RandomAccess. If the RandomAccess benchmark had no spatial locality at all, we
5
0 0.5 1 1.5 2
x 104
0.1
0.15
0.2
0.25
0.3
0.35
cache size (byte)
M
i s
s B
a n
d w
i d
t h
/ 8
 ( m
i s s
r a t
e *
l i n
e s
i z e
/ 8 )
LAPACK 2D Spatial−Temporal Locality Plot
8B Line
16B Line
32B Line
64B Line
128B Line
(a) set of 5 2D spatial-temporal locality plot
0 0.5 1 1.5 2
x 104
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
1.4
cache size (byte)
M
i s
s B
w
R
a t
i o
LAPACK 2D MissBwidth Ratio Plot
8B Line
16B Line
32B Line
64B Line
128B Line
(b) set of 5 2D MissBwRatio plot
Figure 4: Data Locality Characteristics for LAPACK with working size 2MB and 32 ∗ 32 blocking.
would expect the miss bandwidth ratios for 16B,32B,64B,128B lines to be 2,4,8,16 respectively. Our
ratios are smaller, especially for 128B line size, indicating some level of spatial locality. This cannot be
due to accesses to the auxiliary array ran[], as these only impact a small cache. So there would seem to
be some spatial locality in the accesses to the array Table[]. If this is true, then, in effect, the sequence
of successive values assigned to the array ran[] fails a statistical test for randomness: the probability
that a value v is followed within a small number of steps by a value w so that v÷8 = w÷8 is greater than
in a true random sequence. We are investigating this result, to insure that it is not due to a measurement
error and to understand its cause.
3.3 LAPACK
We measured reuse distance distribution for DGESV, a double precision linear equation solver that uses
LU factorization and partial pivoting. We used a 512× 512 input matrix and 32× 32 blocking. LAPACK
has addressed multi-layer memory hierarchy problem and reorganized the original LINPACK algorithms
to make each working block fit in cache, therefore memory accesses have much more locality than in a
naı¨ve implementation. In our experiment, we expect to see a good data reuse for cache sizes less than
the 8KB blocking size and little temporal locality improvement for larger caches due to blocking. As for
spatial locality, since arrays and subarrays are accessed in storage order, good spatial locality is expected,
though we may access only 32 continuous array elements at a time after blocking instead of 512 before
blocking.
The 8B 2D spatial-temporal locality curve on Figure 4(a) matches our expectation on temporal locality
quite well. Miss rate decreases in a continuous manner until cache size reaches the 8KB blocking limit.
Then the curve keeps flat illustrating no reuse after the turning point(8KB). Considering all line size
curves together, we can see small distances between them, showing a certain spatial locality. In order to
better understand spatial locality, we calculate the miss bandwidth ratio in Figure 4(b). We can see that
the miss bandwidth ratio is around 1.1 for 128B line size, and less than 1.05 for other line sizes. Thus
we can conclude that DGESV has very good spatial locality. Note that spatial locality is better for large
caches, where misses are due to accesses to the full matrix.
3.4 FFTE
FFTE works by decomposing a large FFT into smaller transforms that fit in cache. In our experiment, we
measured the one-dimensional complex FFT routine, on an input of size 218 complex, double precision
(16B) elements. This routine used a blocking algorithm with maximum 16 × 16 blocks. Thus accesses
tend to have reuse distance close to the blocking size. Good spatial locality should be seen always, in the
sense that in every block elements are referenced consecutively.
Although the basic element in FFTE has size 16B, we still use 8B cache lines in Figure 5(a) to have
comparable results for all the benchmarks. The 8B line drops quickly to 0.2 and then continues to steadily
decrease, exhibiting a nice temporal locality. The steady decrease can be attributed to the well known
6
0 0.5 1 1.5 2
x 104
0
0.5
1
1.5
cache size (byte)
M
i s
s B
a n
d w
i d
t h
/ 8
 ( m
i s s
r a t
e *
l i n
e s
i z e
/ 8 )
FFTE 2D Spatial−Temporal Locality Plot
8B Line
16B Line
32B Line
64B Line
128B Line
(a) set of 5 2D spatial-temporal locality plot
0 0.5 1 1.5 2
x 104
0
2
4
6
8
10
cache size (byte)
M
i s
s B
w
R
a t
i o
FFTE 2D MissBandwidth Ratio Plot
8B Line
16B Line
32B Line
64B Line
128B Line
(b) set of 5 2D MissBwRatio plot
Figure 5: Data Locality Characteristics for FFTE with working array length 256K.
“surface to volume ratio” of FFT: in a good implementation one expects a miss rate of Θ(1/logm) for a
cache of size m. The initial high miss rate occurs when a block and other temporaries do not fit in cache.
For spatial locality, the miss bandwidth ratio on Figure 5(b) shows two phases. During the first phase,
where the cache size is less than the block size (4KB), we see that miss bandwidth increases with cache
line size, showing imperfect space locality. Above 4KB we have perfect space locality, with a ratio that
is very near 1.
3.5 DGEMM
DGEMM does a matrix-matrix multiply-add operation. The kernel is a three-nested loop of multiplication
of matrix A and matrix B. The pseudo source code is attached below.
DO 90, J = 1, N
DO 80, L = 1, K
IF(B(L,J).NE.ZERO)THEN
TEMP = ALPHA*B(L,J)
DO 70, I = 1, M
C(I,J) = C(I,J) + TEMP*A(I,L)
70 CONTINUE
80 CONTINUE
90 CONTINUE
All arrays are column major stored with each array element 8 bytes, and N ,M ,K are 512 in our
experiment. An entire column of C is reused within the L loop as long as this C column, one A column
and one B element can fit in cache. Thus reuse will increase significantly once cache size reaches 2 ×
512 × 8B = 8KB. If the cache size is large enough to hold all of A, then when get further reuse of the
elements of array A. As for the spatial locality, since array elements are always referenced in a storage
order, a perfect spatial locality is expected.
The spatial-temporal locality plot confirms this behavior. The 8B curve on Figure 6 exhibits a sharp
drop at cache size 8KB, where miss rate decreases 50% from 0.328 to 0.165. The miss bandwidth ratio
on Figure 6 are all very close to one. The match is perfect for large caches; there is a narrow peak around
8K, as the cache line size affects the exact size of the cache that can fit two matrix columns, one matrix
element and additional temporaries.
3.6 CG
Most memory references in the Conjugate Gradient benchmark occur while computing sparse matrix-
vector multiplication. The source code is listed below.
do j=1,lastrow-firstrow+1
sum = 0.d0
do k=rowstr(j),rowstr(j+1)-1
7
0 0.5 1 1.5 2
x 104
0.15
0.2
0.25
0.3
0.35
0.4
cache size (byte)
M
i s
s B
a n
d w
i d
t h
/ 8
 ( m
i s s
r a t
e *
l i n
e s
i z e
/ 8 )
DGEMM 2D Spatial−Temporal Locality Plot
8B Line
16B Line
32B Line
64B Line
128B Line
(a) set of 5 2D spatial-temporal locality plot
0 0.5 1 1.5 2
x 104
0.8
1
1.2
1.4
1.6
1.8
2
2.2
cache size (byte)
M
i s
s B
w
R
a t
i o
DGEMM 2D MissBandwidth Ratio Plot
8B Line
16B Line
32B Line
64B Line
128B Line
(b) set of 5 2D-MissBwRatio plot
Figure 6: Data Locality Characteristics for DGEMM with matrix size 512 ∗ 512.
0 0.5 1 1.5 2
x 104
0
0.5
1
1.5
2
2.5
3
3.5
4
cache size (byte)
M
i s
s B
a n
d w
i d
t h
/ 8
 ( m
i s s
r a t
e *
l i n
e s
i z e
/ 8 )
CG 2D Spatial−Temporal Locality Plot
8B Line
16B Line
32B Line
64B Line
128B Line
(a) set of 5 2D spatial-temporal locality plot
0 0.5 1 1.5 2
x 104
1
2
3
4
5
6
7
8
cache size (byte)
M
i s
s B
w
R
a t
i o
CG 2D−MissBandwidth Ratio Plot
8B Line
16B Line
32B Line
64B Line
128B Line
(b) set of 5 2D MissBwRatio plot
Figure 7: Data Locality Characteristics for CG with 4.4MB non-zero matrix data and 504KB non-zero
vector data.
sum = sum + a(k)*p(colidx(k))
enddo
q(j) = sum
enddo
All nonzero matrix elements are stored in a[] continuously in row major order, and they are indexed
by rowstr[]. The matrix elements are traversed in storage order; the vector elements p[] are accessed
in random order, as the index of the vector element accessed is determined by colidx[], the column
index of the matrix element.
Since our experiment has about 4.4MB non-zero matrix data and 504KB non-zero vector data in mem-
ory, data reuse is very limited. For spatial locality, the accesses to matrix elements will have good spatial
locality, while the vector element accesses will have no spatial locality. Therefore we expect average
spatial locality.
Our spatial-temporal locality plot in Figure 7(a) and the miss bandwidth ratio plot in Figure 7(b) match
quite well with our conclusions drawn from the source code. The 8B line size curve is very flat. Its miss
rate decreased only 5% from 0.49 to 0.46 when cache size increased from 1KB to 16KB, which illustrates
bad temporal locality. (Spatial locality would increase significantly once the cache is large enough to
contain the vector.) In Figure 7(b), we can see that the miss bandwidth ratios for 16B, 32B, 64B and 128B
line size caches are about 1.5, 2, 3.5 and 6 respectively; these would be 1 for perfect space locality and 2,
4, 8 and 16 for no cache locality.
3.7 EP
The core loop of Embarrassing Parallel benchmark reads a double-precision working array x[] in a se-
quential manner and at the same time updates a 10-entry double-precision array q[] relatively randomly.
8
l 0 1 2 3 4 5 6 7 8 9
frequency(%) 46.602 44.514 8.351 0.520 0.013 0.000 0.000 0.000 0.000 0.000
Table 1: Statistics of q[l]’s occurrences for EP
102 103 104 105
0
1
2
3
4
5
6
cache size (byte)
M
i s
s B
a n
d w
i d
t h
/ 8
 ( m
i s s
r a t
e *
l i n
e s
i z e
/ 8 )
EP 2D Spatial−Temporal Locality Plot
8B Line
16B Line
32B Line
64B Line
128B Line
(a) set of 5 2D spatial-temporal locality
plot
0 0.5 1 1.5 2
x 104
0
5
10
15
20
25
cache size (byte)
M
i s
s B
a n
d w
i d
t h
R a
t i o
EP 2D−MissBandwidth Ratio Plot
8B Line
16B Line
32B Line
64B Line
128B Line
(b) set of 5 2D MissBwRatio plot
102 103 104 105
0
0.5
1
1.5
2
2.5
cache size (byte)
M
i s
s B
a n
d w
i d
t h
/ 8
 ( m
i s s
r a t
e *
l i n
e s
i z e
/ 8 )
EP (without LOG,SQRT) 2D Spatial−Temporal Locality Plot
8B Line
16B Line
32B Line
64B Line
128B Line
(c) 2D spatial-temporal locality plot for log-
sqrt-removed version
Figure 8: Data Locality Characteristics for EP with 2MB working size.
In each iteration i, which q[] entry to be accessed is determined by a formula based on the values of
x[2*i-1] and x[2*i]. In our experiment, x[] is 2MB and q[] has only 80B. x[] definitely has no
reuse for small caches. Since each access to an entry of q[] is accompanied by accesses to two entries
of array x[], we expect entries of the array q[] not to be evicted once the cache is larger than 240B,
three time the size of array q[]. Thus, we expect little temporal locality after 240B until the cache size
is large enough to hold all x[]. Since access to x[] are sequential, we expect good spatial locality for
caches larger than 240B.
Figure 8(a)(b) show a different behavior. The 8B line curve starts to decrease from 200B and reaches
its minimum around 300B, where is more than the expected 240B threshold. And the miss bandwidth
ratio for long cache lines indicate a bad spatial locality for cache sizes up to 15KB.
To understand this discrepancy, we count the frequency of each q[l]’s occurrences. The statistics on
Table 1 illustrates strong bias to small index entries; only first five entries of q[] are actually used. This
would lead us to expect a turning point at less than 240B; another explanation is needed.
One possible explanation is the accesses to scalar variables inside the loop, which may cause extra
memory accesses. But these variables tend to be allocated in continuous memory chunks, so that they
would not hurt spatial locality for large caches. After examining the assembly code and simulator config-
uration, we are sure that almost all these scalar variables are actually held in registers; accesses to these
variables cannot explain the locality behavior.
The unexpected behavior turns out to be caused by the calls to the intrinsic functions (log and sqrt)
made to calculate the index used to access q[]. To verify this, we removed the invocations to log and
sqrt from the code and adjusted the indices to be in a valid range. The generated index sequence was
different from the original one, but as we discussed before, the order of access to array q[] should not
affect locality much. The spatial-temporal locality plot for the modified version (Figure 8(c)) shows a
perfect spatial locality, which demonstrates that the execution of the log and sqrt functions actually
result in irregular memory accesses that pollute the spatial locality.
3.8 Summary of Case Studies
To summarize the above seven case studies, we generalize the correlation between program memory
access pattern and locality.
There are two extreme access pattern in our case studies, sequential access and random access. The
former has perfect spatial locality where as the latter shows no spatial locality at all. If the kernel loop has
9
a combination of these two patterns, the one with more memory accesses determines the spatial locality.
However, the minority pattern still affect the spatial locality for small cache. As an example, suppose the
kernel loop has m sequential accesses to seq[] of size s1, n random accesses to ran[] of size s2, and
that the accesses to seq[] and ran[] are mixed. If s2 À s1, as in RandomAccess benchmark, the spatial
locality deteriorates after the cache size exceeds (n/m + 1)s1. On the other hand, if s1 À s2, as in
EP , the spatial locality will be perfect when cache size exceeds (m/n + 1)s2. Finally, if s1 and s2 are
relatively close, for example in CG , an average spatial locality is expected.
In the case of temporal locality, decomposition and blocking techniques contribute a lot. In FFTE and
LAPACK, once the cache can hold the entire block, most cache capacity misses disappear.
Besides memory references and compiler optimization techniques, there are other factors that may
improve or aggravate temporal and spatial locality. As we have discussed in section 3.7, code generated
by the compiler and run time for use of temproary variables or invocation to intrinsics that use large tables
may affect locality; other compiler changes in the generated code may have the same effect.
4 Locality Information Compression
The previous section has illustrated the usefulness of the spatial-temporal locality plot to characterize
the locality of the memory accesses performed by an application. It is always tempting to attempt to
compress the information so as to come up with a single “index of locality”, or a few locality parameters.
Unfortunately, it is intrinsically impossible to do so in a manner that is architecture independent: For any
monotonic non-increasing function f(m) such that 0 ≤ f(m) ≤ 1 and limm f(m) = 0, it is possible to
define a sequence of accesses so that f(m) closely approximate the miss rate function for this sequence
of accesses [16]. Thus, one parameter cannot possibly characterize the cache miss rates for a sequence of
accesses at all levels of the memory hierarchy. This is not only a theoretical issue: The different sizes of
arrays accessed in a computation can cause jumps in miss rates at arbitrary cache sizes.
Information can be compressed if one studies one fixed architecture. The information of interest is the
overhead due to memory accesses, i.e., the difference between the actual compute time and the compute
time that would occur if all values used resided in registers, and no loads or stores where necessary. To
a first approximation, this overhead can be computed by weighting the miss rates at each level of the
memory hierarchy by the latency at that level of the hierarchy. I.e., the overhead is approximated by a
function of the form
∑k−1
i=1 MissRate(mi)latencyi+1, where k is the number of levels in the memory
hierarchy, mi is the size of Li, the cache at the i-th level in the hierarchy, and latencyi is the access time
of Li. The access time to a cache is, to a first approximation, logarithmic in the cache size [18]. Thus, the
overhead is approximated by a
∑k−1
i=1 MissRate(mi)log(mi+1), for some positive constant a.
To the extent that different systems, at the same point in time, tend to have caches of similar size, one
can use the same weighted sum to compute an index of locality that is valid across systems at this point in
time. The index will not be valuable for comparisons across time, as cache sizes increase. Alternatively,
one can choose a mathematically convenient sequence of sizes mi that present a good abstraction of a
memory hierarchy, without attempting to fit technology at a particular point in time. A natural choice is
to pick mi = 2i, thus resulting into a Temporal Locality Index the form∑
i
(i+ 1)MissRate(2i).
We show in Table 2 the value of the temporal locality index for some of the applications studied in this
paper. To calculate the above sum we have picked i from the range 10 to 14 according to limitations of
our experiment.
The scores shown in the table may look surprising: for example, RandomAccess has a better score
(more temporal locality) than most other benchmarks, even though RandomAccess is supposed to be a
benchmarks exhibiting bad temporal locality. The reason for this surprising phenomenon is that although
the accesses to the main array are random and exhibit no locality, most of the accesses are to a short auxil-
iary array; while, at the source code level, we expect half of the accesses to be to array ran[] and half to
be to the array Table[], an examination of the assembly code generated by our compiler indicates that
10
Benchmark FFTE RandomAccess LAPACK STREAM DGEMM CG EP EP(modified)
Score 7.80 10.68 12.97 32.20 17.75 30.88 7.30 23.37
Table 2: Temporal locality index for i from 10 to 14 (higher index implies worse temporal locality)
more than three quarters of the accesses are to the array ran[].This may be due to imperfect compiler
analysis.
5 Reuse Distance Measurement
In this section, we describe the technique used to measure the miss bandwidth. The key thing is to
measure the reuse distance distribution. Reuse distance can be estimated at compile time via static loop
analysis [8, 9] or computed at run time from a trace of memory accesses [10, 11, 13]. The compile time
approach avoids the simulation overhead, but is not accurate for codes with data dependent control and
memory access patterns. The trace-driven approach is very accurate because it considers compiler effect
as well as microarchitecture effect on memory access sequence and handles data dependent traces. To get
accurate reuse distance histogram, we use the latter method. We used the MIPS ESESC simulator [15] and
extended it by effectively adding a large LRU, fully associative cache. The extended simulator maintains
recently accessed addresses on a stack and moves each address to the top of the stack, when accessed.
The depth of the old position is returned as the reuse distance. A reuse distance of ∞ is returned for an
address that is not on the stack (cold miss). The stack size is bounded by the maximum memory in use.
5.1 Efficient Implementation
The trace stack data structure has to support efficient search, delete and insert operations. To do so we
maintain the trace stack as a weighted balanced priority tree(similar to [6]) with the time of the last access
to a memory location as the key. Number of nodes in subsequent subtrees indicates the number of distinct
memory location accessed since the last access to a node. A hash table is used to map each memory
address to its last access time. When a memory access is made, its address, addr, is looked up in the hash
table to find timestep, the corresponding time of the previous access.
If no hash entry exists, a new one is made and a new node is inserted into the priority tree. In this case,
a MAXINT value is returned as the reuse distance. Otherwise, we search timestep in the priority tree,
find the number of nodes with smaller keys and delete the old node. Then, a new node is inserted into the
tree to represent the updated state of that memory location. The average search and update complexity
for each memory access is log(n), where n is the number of distinct addresses accessed so far. The tree
is trimmed after the number of nodes pass a threshold limit. Also priority can be given to branches that
contain younger nodes to improve search and update in that part of the tree.
In the previous sections, we assumed that each accessed location is equivalent to a cache line. The
same assumption can be made to simulate caches with larger lines. In our implementation, we define a
granularity parameter Block. Instead of directly using addr as the hashing key, we use daddr/Blocke.
5.2 Reuse Distance Measurements of HPCS Benchmarks
We have done measurements for five individual benchmarks from HPC Challenge Benchmark suite [1, 2]
and two benchmarks from NPB3.2-SER [4] with reasonable input sizes. The input size must meet the
algorithm requirement for the its benchmark purpose. For example, STREAM benchmark requires the
working size to at least exceed the largest size of the cache or TLB such that all data to be referenced can
not be satisfied in caches or TLBs. For the same reason, the ideal memory usage of HPL benchmark is half
the total system memory. However, our program has a limitation of 216 on the size of distinct locations
accessed. In order to make our simulation results representative, we have used whenever possible input
sizes that are commonly used for benchmarking, or the largest input size that fits our simulator, otherwise.
11
To better illustrate the locality of each benchmark algorithm itself, we commented several unimportant
or not-simulated instructions such as timing calls, dumping large amount of data, etc. Also, all our
simulations are for uniprocessors. Note that HPL is a MPI implementation of Linpack double-precision
linear equations solver, designed for distributed memory multi-processor systems we did not pick the
trivial approach of running HPL with the MPI calls on one process, in order to avoid the MPI overheads.
Instead, we used DGESVfrom LAPACK [3]. Details about the modification we did to the benchmarks
can be found on our group website [5].
6 Summary and Conclusion
We have developed in this paper an approach for characterizing temporal and spatial locality of codes
and have applied it to several HPC benchmarks. We have shown that the proposed characterization can
explain well the behavior of these benchmarks, as determined from an examination of their code. We have
not attempted, in this paper, to relate measures of temporal and spatial locality to actual running time, as
measured on real systems; we hope to do so on follow up work.
Comparison of locality metrics with actual running time is especially important for metrics that attempt
to summarize locality information using one of a few parameters. Such attempts are bound to imply a
significant loss of information: one can either use a parsimonious characterization that focuses on cache
characteristics of current systems, or use a characterization that have a larger number of values – in effect,
a histogram – if one wishes to have an architecture independent metric.
In this work, we presented results on each benchmarks for a fixed input size. Our experiments with
different input sizes indicate that the spatial-locality curves look similar as input size is changed. This not
necessarily true of any code; rather it is due to the fact that the codes we examined are written in a style
where input size does not affect behavior for the small cache sizes we have studied.
References
[1] High productivity computer systems. http://www.highproductivity.org/.
[2] Hpc challenge benchmark. http://icl.cs.utk.edu/hpcc/.
[3] Linear algebra package. http://www.netlib.org/lapack/.
[4] Nas parallel benchmarks. http://www.nas.nasa.gov/Software/NPB/.
[5] Parallel processing principles group at university of illinois at urbana-champaign.
http://wing.cs.uiuc.edu.
[6] G. Alma´si, C. Cas¸caval, and D. A. Padua. Calculating stack distances efficiently. In MSP ’02:
Proceedings of the workshop on Memory system performance, pages 37–43, New York, NY, USA,
2002. ACM Press.
[7] K. Beyls and E.D’Hollander. Reuse distance as a metric for cache behavior. In Proceedings of the
IASTED Conference on Parallel and Distributed Computing and systems, August 2001.
[8] C. Cas¸caval and D. A. Padua. Estimating cache misses and locality using stack distances. In ICS
’03: Proceedings of the 17th annual international conference on Supercomputing, pages 150–159,
New York, NY, USA, 2003. ACM Press.
[9] S. Chatterjee, E. Parker, P. J. Hanlon, and A. R. Lebeck. Exact analysis of the cache behavior of
nested loops. In PLDI ’01: Proceedings of the ACM SIGPLAN 2001 conference on Programming
language design and implementation, pages 286–297, New York, NY, USA, 2001. ACM Press.
[10] C. Ding and Y. Zhong. Predicting whole-program locality through reuse distance analysis. In ACM
Conference on Programming Languages Design and Implementations’03, 2003.
12
[11] C. Fang, S. Carr, S. Onder, and Z. Wang. Reuse-distance-based miss-rate prediction on a per instuc-
tion bases. In MSP.
[12] K. Grimsrud, J. Archibald, R. Frost, and B. Nelson. On the accuracy of memory reference models.
In Proceedings of the 7th international conference on Computer performance evaluation : modelling
techniques and tools, pages 369–388, Secaucus, NJ, USA, 1994. Springer-Verlag New York, Inc.
[13] G. Marin and J. Mellor-Crummey. Cross-architecture performance predictions for scientific applica-
tions using parameterized models. In SIGMETRICS 2004/PERFORMANCE 2004: Proceedings of
the joint international conference on Measurement and modeling of computer systems, pages 2–13,
New York, NY, USA, 2004. ACM Press.
[14] R. Mattson, J. Gecsei, D. Slutz, and I. Traiger. Evaluation techniques for storage hierarchies. IBM
System Journal, 9(2), 1970.
[15] J. Renau and L. Ceze, 2002. http://sourceforge.net/projects/sesc/.
[16] M. Snir. Measuring and leveraging locality of reference in uniprocesors. 2005.
[17] W.-H. Wang and J.-L. Baer. Efficient trace-driven simulation methods for cache performance anal-
ysis. ACM Trans. Comput. Syst., 9(3):222–241, 1991.
[18] S. Wilton and N.P.Jouppi. Cacti: an enhanced cache access and cycle time model. Solid-State
Circuits, IEEE Journal of, 32(5):677–688, May 1996.
13
