An Empirical Study on How Program Layout affects Cache Miss Rates by Bradford, Jeffrey P. & Quong, Russell W.
Purdue University
Purdue e-Pubs
ECE Technical Reports Electrical and Computer Engineering
6-1-1995
An Empirical Study on How Program Layout
affects Cache Miss Rates
Jeffrey P. Bradford
Purdue University School of Electrical and Computer Engineering
Russell W. Quong
Purdue University School of Electrical and Computer Engineering
Follow this and additional works at: http://docs.lib.purdue.edu/ecetr
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact epubs@purdue.edu for
additional information.
Bradford, Jeffrey P. and Quong, Russell W., "An Empirical Study on How Program Layout affects Cache Miss Rates" (1995). ECE




An Empirical Study on How 
Program Layout affects Cache Miss Rates 
Jeffrey P. Bradford Russell W. Quong 
School of Electrical Engineering 
Purdue University 
West Lafayette, IN 47907 
{jbradfor,quong)@ecn.purdue.edu 
Abstract 
Cache miss rates are quoted for a specific program, cache configuration, and input set; the effect 
of program layout on the miss rate has largely been ignored. We examine the variation of the miss 
rate resulting from randomly chosen layouts, the miss variation, for several cache configurations 
(cache size, lines size, and set-associativity), input sets, and optimization levels for five programs 
in the SPEC benchmark suite. We observed miss rates that varied from 0.6m to 1.8m, where m 
is the mean miss rate. We did not observe any consistently good layout:; across different 
parameters; in contrast, several layouts were consistently bad. Overall, cache line size has little 
effect on the miss variation, while increasing the cache size (decreasing the miss rate), decreasing 
the set-assaciativity, or increasing the optimization level increased the miss variation. We question 
the validity of using a single layout to represent the miss rate of a given program for a direct- 
mapped cache. 
1 . Introduction 
In designing a memory system, computer architects run trace-driven simulations for many 
programs to determine the miss rate for various cache configurations. For a given cache 
configuration, the resultant miss rate depends on three factors: (i) the program being executed; 
(ii) the input data; (iii) the layout, namely the specific compilellink time mapping of the object 
code and data objects to memory addresses. To address the first two factors, standard benchmark 
suites (SPEC, Perfect, etc.) have been developed and are commonly used. 
We are concerned with the last factor, the layout. Different layouts have different miss rates, as 
changing the layout changes which instructions and data map to the same cache line(s). Almost 
invariably only one layout is used when calculating the miss rates. The layout is affected by many 
factors including compiler optimization, the order in which the program is linked together, and 
the specific libraries on a system. 
The original impetus for this work is the gap model [Quo94], which analytically predicts the 
"true" miss rate averaged over all possible layouts. As expected, there was some deviation 
between the miss rates as predicted by the gap model and the miss rates as measured from a cache 
simulation on the default layout. In this paper, we explore the related issue o:F how much miss 
rates vary "'randomly" due to layouts. 
Furtherrno~e, it makes sense to purposely consider different layouts for two reasons. First, the 
notion of a single "standard" layout for a program is erroneous. Compilers and system libraries 
evolve over time, both of which affect the layout in unforeseen ways. More importantly, different 
architectures have different object code densities which greatly affect layouts and miss rates. 
Second, we: show that layouts do affect the miss rate which in turn can affect systlem performance. 
For time-critical applications, it would be wise to view the finding of a layout wit:h a low miss rate 
as another standard compilellink optimization. 
Our primary goal was to determine if measuring the miss rate for a single layout gives an accurate 
estimate of the true miss rate. In addition, we hoped to answer the following qpestions. 
Wh#at is the typical variation in the miss rates (the miss variation) for programs? 
HOW do cache size, line size, associativity, optimization, and input. set affect miss 
variation? 
Are there consistently good or bad layouts as we change the cache: size, line size, 
optimization, and input set? 
How does the miss variation affect system performance? 
The paper is organized as follows. Section 2 discusses miss variation in detail and gives some 
definitions. Section 3 describes the experiments we performed to generate the data. Section 4 
summarize!s the results and discusses some interesting observations. Section 5 answers the four 
questions posed above, and Section 6 presents conclusions and offers ideas for further research. 
2. Background and Previous Work 
For a program to have a high miss variation requires two memory fragments that obey three 
properties (a memory fragment being a code section or data object). First, the fragments must 
be executeti frequently. Second, the fragments must have high temporal locality, namely after one 
fragment ir; referenced, the other is likely to be referenced soon. Third, the relative addresses of 
the fragments must be layout-dependent, namely in some but not all layouts the fragments map 
to the same cache lines. 
Note that the only layout-dependent data objects are static (global) scalars and arrays; in 
particular, the addresses of stack-allocated variables (locals) and heap-al1ocai:ed variables (via 
malloc) arc? not affected by the layout. Similarly, the relative addresses of elernents of the same 
array are not affected by the layout. All code is layout dependent. 
A program consists of modules; a module consists of procedures; and a procedure: consists of basic 
blocks. Usually when generating the object file, the compiler places procedures and its constituent 
basic blocks consecutively in memory based on their order in the source modulle. Similarly, the 
linker places object files consecutively in memory based on their order in the link command. 
There is rarely any thought given to the ordering, both because of and despite of the fact that 
procedures, and modules can be arbitrarily ordered in a program. 
We can rearrange the layout of a program at three granularities, module, procedure, and basic 
block. While we can rearrange modules and procedure arbitrarily, rearranging basic blocks 
requires adding unconditional jumps to preserve the original control flow. These unconditional 
jumps change the program in two ways. They make the code larger, and they change the 
instruction stream. For these reasons, we did not rearrange at the basic block level. 
For a program with M modules and P procedures, there are M! module-level and P! procedure- 
level remmgements; even for small M and P there many possible layouts. Note that we have 
ignored la!puts that have holes or unused addresses. In practice, a code or data segment is 
contiguous (no holes), to minimize memory and disk space usage. 
We would expect the miss variation to increase as we rearrange at finer granularities, due to 
intra-module and intra-procedure locality. Namely, after referencing a procedure we expect to 
reference mother procedure in the same module soon. By default, the compiler places intra- 
module prtxedures consecutively in memory which guarantees these do not cc~nflict (unless the 
procedures fill up the cache and wrap around). Procedure-level rearranging removes this 
guarantee. This argument applies to even a greater extent when rearranging at the basic-block 
level, as there is stronger temporal locality among intra-procedural basic blocks than among intra- 
module prcdures. 
We expect a direct-mapped cache to show greatest variation, with variation tiecreasing as the 
set-associa.tivity increases. In the extreme case, an LRU cache is insensitive to the layout. In 
addition, we expect decreasing the compiler optimization level to decrease the miss variation, as 
a common. optimization is to remove redundant 1oadsJstores [FL91.]. A redundant 1oadJstore to 
address a duplicates a previous reference to a ,  so it is very likely a is in the cache. Thus, 
redundant loads/stores are "sure" cache hits, which effectively reduces both the miss rate and the 
miss variation (but increases cache accesses). 
A paramel'er is any of the factors we change, namely the cache configuration (line size, size, or 
set-associittivity), the compiler optimization, and input set. We call a layout good (bad) if it 
yields a miss rate, for a given set of parameters, that is significantly lower (higher) than the 
average miss rate. We call a layout extreme if the layout is either good or bad. :Note that a layout 
that is bad for one set of parameters might be good for another set of parameters. Finally, we use 
m to denote the experimental mean miss rate over all measured layouts for a given set of 
parameter!;. 
W found 110 reference in previous literature to how randomly chosen layouts aficect the miss rate, 
and very few references to variation in execution time in general. Sarkar [Sar89] discusses 
execution time variation caused by loops processing different data in the context of picking the 
optimal girain size for parallelization. Various cache models have been proposed [AHH89] 
[Quo94], but none of these models consider miss variation. 
The idea of rearranging code to improve cache performance is not new. [pH901 used two 
compiler optimizations to improve cache performance (and TLB and virtual memory perfor- 
mance). Their first optimization arranged procedures using a "closest is best" heuristic by 
repeatedly coalescing procedures that were adjacent in a weighted call graph until a single node 
(the entire .program) existed. Their second optimization split each procedure into two procedures, 
one of corn~monly used basic blocks and one of "fluff", and within each of the two new procedures 
chained together basic blocks based on usage counts of the internal control-flow graph for that 
procedure. Unfortunately, they did not report the improvement in the miss rate, only the 
improvement in execution time (10 % -26 %) . WcF891 reordered basic block using compile-time 
usage estimates and showed that a direct-mapped cache could perform better than a set-associative 
cache, if instructions could selectively be excluded from the cache. Rearranging code is also a 
variation of the well-studied problem of rearranging the address map to minimize paging [Ker71] 
w72:I[Har88][OMHL93]. Our research is unique in that we are the first to consider the range 
of miss rates resulting from randomly chosen layouts. 
3. Method 
W ran six types of experiments to empirically answer to the questions posed in the introduction. 
We chose five programs from the SPEC92 benchmark suite as shown in nble 1. Two are C 
programs, e s p r e s s o  and gcc,  and three are FORTRAN programs, sp i ce ,  doduc, and 
f pppp. Vk ran all six types of experiments under SunOS 5.3 and compiled them with c c  2.0.1 
using static linking. We used caches  i m 5  5.2 with shade 4.1.6 [CK93] to calculate the miss 
rates. To run the experiments, we changed the SPEC makefile and wrote shell scripts that 
generated (different layouts, changed the compileJlink flags, and changed the input set used. 
nble  1: Properties of the Benchmarks Used 






We ran d.1. types of experiments on the same 10 cache configurations, consisting of five cache 
sizes, 256, 1-kB, 4-kB, 16-kB, 64-kB, and two line sizes, 16 bytes and 64 bytes. Unless 
otherwise rioted in W l e  2, we measured the miss rates for 21 layouts (20 randoim layouts and the 
original layout) on these 10 cache configurations, with full optimization, a direct-mapped cache, 
the input sets shown in Tkble 1, and module-level rearrangement. We generated the random 
layouts by changing the order the modules are specified to the linker. 
For the first type of experiment, I, i 
we used the parameters listed in 
the preceding paragraph to 








For the second type of experi- 
ment, we generated 21 layouts by 
reananging s p i c e  at the proce- 
dure level. instead of at the 
module level. We split the 



































Rearrange at Procedure Level 
11 3 11 100 Layouts I 11, 15 I I 
Input Set Used  
input  . r e f / t i  . i n  
input  .s hortlgreycode. i n  
input  . ref ldoduc.  i n 
input  . short/* 
input  . shortlnatoms 
(1 5 11 Different Optimization Level 1 4, 5, 8, 9 11 
i 
Different Input Sets 
1 6 11 4-way set-associative Cache 1 16 11 
2. 3. 14 1) 
modules so that each new module a b l e  2: Description of Experiments Performed 
contained exactly one procedure. 
We then linked the new modules 
in 2 1 random orders. We chose 
sp ice  because we wanted a FORTRAN program, as FORTRAN programs are easier to break 
into procedures than C programs, and of the three FOKI'RAN programs only sp ice  has multiple 
procedures per module. 
For the third type of experiment, we measured the miss rates for doduc and fpppp for 100 
layouts. Uk were concerned that "only" 21 layouts might not yield miss variation representative 
of the true miss variation; we wanted to confirm that the variation was due to irhe program itself 
and not to the small number of layouts chosen. We chose doduc because it gave a low, evenly 
spread oul variation with no outlying points in Experiment 1, and we chose fpppp because it 
gave a wicle, unevenly distributed variation. 
For the fourth type of experiment, we ran espresso and f pppp with different input sets. We 
used bca. in and tial . in for espresso and input. ref/natoms for fpppp. 
For the fifth experiment, we compiled espresso and spice with medium optimization (-02) 
and with no optimization, instead of full optimization (-04) which we used in the other five types 
of experiments. Full optimization (on the SUN compilers we use) can cause the following 
problems: (i) an increase in code size due to loop unrolling and inlining; (ii) a si,gnificant increase 
in compile time; (iii) incorrect code being generated by a risky optimization. For these reasons, 
many programmers use medium optimization instead. Compiling with no optimization is common 
during debugging. n b l e  3 shows how different optimization levels affect the size of the 
executable, the number of instructions executed, and the number of data references (the number 
of loads/sl:ores) for the three optimization levels. Note that shifting from no optimization to 
medium optimization reduced the number of instructions executed by at least a third in all cases. 
lhble 3: Effect of Optimization on static code size, the number of instructions executed, and 
the number of data references 
For the sixth experiment, we ran fpppp with a four-way set-associative c,ache. We chose 
f pppp bezause its miss variation was greatest for the direct-mapped cache. 
4. Results 
Please refer to Zble 4, which summarizes all the figures. To aid comprehensiorl we use the same 
scale for 2tl1 figures. The X-axis shows the size of the cache on a log scale. The Y-axis shows 
the miss rate on a log scale over the range from 0.3 % to 30%, which is likely to be the range of 
interest to a computer architect. When the miss rate is above 30%, performance is likely to be 
poor irrespective of the variation of the miss rate. Similarly, unless the miss penalty is high 
(100's of cycles), a cache miss rate less than 0.3% will have little impact on program execution 
time. Thr: following descriptions point out a few of the interesting features; see the figures for 
full detail!;. 
Figure 1 shows the miss rate for espresso for twenty-one layouts and the input set ti. in. 
The I-cache has three consistently bad layouts for both 16-byte and 64-byte lines, For example, 
for the 4-kB I-cache and 16-byte lines, the bad layouts have a miss rate of 3.3 % !, compared to 2 % 
for the remaining 18 layouts. For the 16-kB I-cache, there was only one bad layout (with miss 
rate of 2.3 % for 16-byte lines), which was one the three bad layouts. The data cache shows little 
miss variation except for the 1-kB D-cache. For 64-byte lines, the miss rates split into two groups, 
with three layouts yielding a 18.5 % miss rate and the other 1 8 layouts yielding a 15.3 % miss rate. 
(These three layouts are not the three bad layouts for the I-cache.) 
Figure 2 sllows the miss rate for espresso for the same 21 layouts as in Figure. 1 and the second 
input set, bca. in. For both 16-byte and 64-byte lines, this second input set yields a much 
greater variation than the first input set, which is most noticeable for the 1-kB I--cache. The 4-kB 
I-cache ha.s four bad layouts, three of which are bad for the first input set too. 
Figure 3 sllows the miss rate for espresso with the third input set, t ia 1 . in. Only one layout 
is bad for the 4-kB I-cache. This layout is also bad for the 16-kB I-cache for all three input sets; 
obviously, this is a layout to avoid. There is also a very slight splitting in the miss rates for the 
4-kB D-cache. Changing to &-byte lines changed the split found in the 4-kB D-cache; 16-byte 
lines has I bad layout, @-byte lines has 7 bad layouts. 
Figure 4 shows the miss rate for espres so with medium optimization. For 16-byte lines, the 
miss rate i.s similar to full optimization (Figure I), with three bad layouts for the 4-kB I-cache, 
and one of these three layouts are bad for the 16-kB I-cache. But the three bad layouts for 
medium alptimization are diferent layouts than the three bad layouts for full optimization. For 
64-byte lines the miss rates for the 4-kB and 16-kB I-cache do not split and the overall variation 
is decreased. The three layouts that are bad in Figure 1 for the 1-kB D-cach~e are bad here as 
well. 
Figure 5 slhows the miss rate for espresso with no optimization. Compared to the results for 
full optimization (Figure I), the biggest difference is the 64-kB I-cache. While all the miss rates 
are low (below 0.6%), they are higher than the miss rates for the other two optimization levels. 
Again we see one bad layout for the 16-kB I-cache, but this is a different layout than the 6 bad 

















'Ifable 4: Summary of Figures. Unless otherwise noted, for each figure vve generated 
21 layouts by rearraning at the module level and used a direct mapped cache. 
layouts fo:r the other two optimization levels. Besides a slight splitting in the 1-kB I-cache, the 
results for @-byte lines are almost identical to the results for 16-byte lines. 
Figure 6 sllows the miss rate for spice.  The miss rates for the 16-kB I-cache are evenly spread 
out between 0.3% and 1.2%. The miss rates for the 64-kB I-cache yield a bi-rrtodal distribution, 
with 14 layouts yielding low miss rates (most below 0.1 %), and 7 bad layouts yielding miss rates 
grouped around 0.8 % . These 7 layouts are also bad for @-byte lines. 
Figure 7 sl~ows the miss rate for s p i c e  when rearranged at the procedure level instead of at the 
module level. Note that rearranging spice  at the procedure level slightly &cnlases the variation 
for 16-byte lines, but slightly increases the variation for @-byte lines. While the 64-kB I-cache 
gives similar results when rearranged at both the module and procedure level, the bad layouts 
differ. Of the 7 bad layouts when rearranged at the module level and 8 1Sad layouts when 
rearranged at the procedure level, only 2 layouts are bad in both cases. 
Figure 8 shows the miss rate for sp ice  with medium optimization; the resu1t.s are very similar 
to the results for full optimization (Figure 6). Of the 7 bad layouts for the 64.-kB I-cache, 6 of 
these are among the 7 bad layouts in Figure 6. For @-byte lines, the 4-kB and 16-kB D-cache 
has 1 bad layout and the 1-kB D-cache has 2 bad layouts. 
Figure 9 shows the miss rate for sp ice  with no optimization. Overall, the imiss variation has 
decreased, especially for the l6kB I-cache. One layout yields a miss rate above 0.3 % for the 64- 
kB I-cache; this layout is not bad for the other cases. As in Figure 8, changing to 64-byte lines 
causes an interesting split in the D-cache which we can not currently explain. Three layouts are 
bad for the: 1-kB D-cache, two layouts are bad for the 4-kB D-cache, and one layout is bad for the 
l6kB Dcrache. Of the three layouts that are bad for the D-cache, two are also bad in Figure 8. 
Figure 10 ;shows the miss rate for doduc. For 16byte lines, the miss rates for .the 16-kB I-cache 
are evenly spread out between 3.8% and 7.8%. The miss rates for the 64-163 I-cache are also 
even spread out, with fifteen layouts yielding a miss rate between 1.0 % and 2.0 96 , and all twenty- 
one layouts yielding a miss rate between 0.7% and 3.4 %. 
Figure 11 :shows the miss rate for doduc with 100 layouts instead of the previous 21. The miss 
rates for the 16-kB I-cache are still evenly spread out, with miss rates between 3.3% and 7.8%, 
except for one layout which yields a miss rate of 2.65%. For the 64-kB I-cac:he, 94 of the 100 
layouts ykld miss rates between 0.75 % and 2.8%. The remaining six layouts yield slightly higher 
miss rates', the highest being 4.1 %. For 64-byte lines there is a slight decrease in variance 
compared to 16-byte lines. 
Figure 12 shows the miss rates for gcc. All cases have very little variation. As the results are 
the sum of 19 different input files, it is possible that much of the variation has b'een averaged out. 
The miss r,ates for each individual input case are not shown, due to the low number of instructions 
executed :For each input. The results for the &-byte line case are very sim~ilar, except for a 
slightly larger, but still small, variation for the D-cache. 
Figure 13 shows the miss rates for f p p p p  with the first input set, i n p u t .  sho r t / na toms .  
The miss ~ates for the 16-kB D-cache yield a quad-modal distribution. A majority of the layouts 
(eleven) yield miss rates spread out between 0.9 % and 1.5 % , four are between 2.2 % and 2.7 % , 
five are grouped together around 4.7 % , and one layout yields a miss rate of 5.9 % . The 64-kB 
cache yields a bimodal distribution, with 16 layouts yielding miss rates belovv 0.3% and 5 bad 
layouts yielding a miss rate over ten times higher (near 4%). While otherwise the results are 
similar for @-byte lines, the quad-modal distribution found in the 16-kB D-cache is almost non- 
existent far 64-byte lines, and the number of layouts in each group is different. 
Figure 14 shows the miss rates for f pppp with the second input set, input:. r e f  /natoms. 
The results are very similar to the results with the first input set (Figure 13) fior both line sizes, 
the only difference being the lower two groupings for the 16-kB D-cache in Figure 13b have 
merged into one group in Figure 14b. 
Figure 15 shows the miss rate for f pppp for 100 layouts. For 16-byte lines, we see a clear bi- 
modal distribution for the 16-kB D-cache, with 65 layouts in the lower group and 35 layouts in 
the upper group. The 64-kB D-cache shows a bi-modal split for both lines sizes, with 73 layouts 
in the lower group and 27 layouts in the upper group. 
Figure 16 shows the miss rate for f pppp with a 4-way set-associative cache. As expected, the 
variation tlecreased to almost nothing for both 16-byte and @-byte lines. 
5 .  Analysis 
We now return to the four questions posed in the introduction. 
What is the typical miss variation for programs? 
Our figures show that programs exhibit a wide variety of variation, from no .variation, to wide 
variation, 1.0 multi-modal variation. The I-cache shows more variation than the D-cache for three 
( e sp re s so ,  s p i c e ,  and doduc) of the five benchmarks, and the D-cache sholws more variation 
for one benchmark, f pppp. Finally, gcc shows almost no variation for both tlhe I-cache and D- 
cache, bui: the gcc miss rates were the average of 19 input cases. There were no consistently 
good layaats, namely layouts that were good for a wide variety of parameters. There were, 
however, Inany consistently bad layouts. 
We define the relative deviation to be the standard deviation divided by m, the mean. For 
example, if m = 3.0 % and the standard deviation is 1.5 % , then the relative deviation is 50 % . 
We used the formula for standard deviation for measured data, - 1  [GKP89], to 
calculate the relative deviation for all figures as shown in nb le  5. There are three important 
observatio~ls. First, when m is moderate (1 % to 5%), the relative deviation is often between 40% 
and 80%. Second, all the very high relative deviation values (>90%) occur when m is low 
( < 0.3%). Third, while most of the miss rates lie within f 2 standard deviations of the mean, the 
extreme layouts lie even further out. 
11 Figure 11 Instruction Cache I Data Cache 
'hble 4: 'Relative deviation (the standard deviation divided by the mean) for all figures. Within 
each box, the first number is for 16byte lines and the second number is for 64-byte lines. The 
relative dileviation values are low for 256-byte caches, and thus are not shown. 
How do (i) cache size, (ii) line size, (iii) set-associativity, (iv) optimization, and (v) input 
set affect miss variation? 
(i) The variation is low at small cache sizes (high miss rate); as the cache size increases, and the 
miss rate tiecreases, the variation either increases or remains low. 
(ii) The two lines sizes (16 and 64 bytes) give nearly identical variations, with one exception. In 
the following four cases, the miss rates split into two groups for @-byte lines but not for 16-byte 
lines. For the 1 -kB D-cache for e s p r e s  s o ,  Figure lb shows no split while Figure Id shows a 
1813 split. For the 4-kB D-cache for e s p r e s s o ,  Figure 3b shows a 20/1 split while Figure 3d 
shows a 1417 split. For the 1-kB D-cache for s p i c e ,  Figures 8b and 9b show no split, while 
Figures 8cl and 9d show a 1912 and a 1813 split, respectively. 
(iii) Direct-mapped caches show much more miation than four-way set-associative caches, which 
show no variation. 
(iv) Decreasing the optimization usually decreases the variation; we give two examples and one 
counter-e~mple. First, for the 4-kB and 16-kB I-caches for e sp re s so ,  full optimization 
(Figure 1) yields several bad layouts; for medium optimization (Figure 4) the bad layouts are 
"mediocre", while no optimization (Figure 5) yields no bad layouts. Second, for the 16-kB I- 
cache for s p i c e ,  full optimization (Figure 6) and medium optimization (Figure 8) have 
significanl variation while no optimization (Figure 9) has little variation. One counter-example 
is the 64-lcB I-cache for e sp re s so ,  for which no optimization yields a higher variation than 
either full optimization or medium optimization. 
(v) Changing the input set does not change the variation, with the following single exception. For 
the 1-kB I-cache for e sp re s so ,  two input sets (Figures la and 3a) show low variation, while 
one input set (Figure 2a) shows high variation. In a separate case, it appears that the 16-kB D- 
cache for fpppp shows a difference between the two input sets, with Figure 13b showing a quad- 
modal distribution and Figure 14b showing a bi-modal distribution. Howe:ver, running 100 
layouts (Figure 15) seems to show that the true distribution is bimodal, anti the quad-modal 
distribution is an artifact of using only 21 layouts. 
Arc: there consistently good or bad layouts as we change the (i) cache size, (ii) line size, 
(iii) optimization, and (iv) input set? 
@)As we change the cache size, bad layouts remain bad, but there are no consistently good 
layouts. VVe give two examples of consistently bad layouts. First, for the D-cache for fpppp 
(Figures 1:3b and 14b), of the 6 bad layouts for the 16-kB cache, 5 are bad for the 64-kB cache. 
Second, far the I-cache for sp i ce ,  (Figure 6a) of the 7 bad layouts for the 61-kB cache, 6 are 
bad or yield above average miss rates for the 16-kB cache. 
(ii) Line size has no effect on layouts, with the following single exception. For the D-cache for 
e s p r e  s s o and s p i c e ,  64-byte lines occasionally show a splitting while 16-byte lines do not. 
(iii) Changing the optimization level affects which layouts are bad, as seen in file following three 
examples. First, for the 4-kB I-cache for espresso ,  both full optimizatiorl (Figure la) and 
medium qhmization (Figure 4a) have 3 bad layouts, but they are a different 3 layouts. Second, 
for the 64-IKB Icache for e sp re s  so, full optimization yields 2 bad layouts, medlium optimization 
yields no bad layouts, and no optimization (Figure 5a) yields 6 bad layouts. Furthermore, the 2 
bad layouts for full optimization are not among the 6 bad layouts for no optimization. Third, for 
the 64-kB ]:cache for sp ice ,  7 layouts are bad for both full optimization (Figure: 6a) and medium 
optimization (Figure 8a); of these 7 layouts, 6 are bad in both cases. No optimization (Figure 
9a) yields 1 bad layout, which is not among the 8 bad layouts for full optimization and medium 
optimization. 
(iv) Changing the input set usually does out change which layouts are bad, as the following two 
examples !;how. First, for the 4-kB I-cache for e s p r e s  so, of the four bad layouts in Figure 2a, 
three are bad in Figure la and one is bad in Figure 3a. In addition, the bad liiyout in Figure 3a 
is bad for ;dl three inputs set for the 16-kB I-cache. Second, for the D-cache for f pppp (Figures 
13b and lrtb), both input sets have the same 6 bad layouts for the 16-kB cache and the same 5 bad 
layouts for the 64-kB cache. 
Hc~w does the miss variation affect system performance? 
When the :mean measured miss rate m is between 0.3 % to 7.0%, we often obsenred miss rates that 
varied frorn 0 . h  to 1 .&n, which can effect system performance significantly. A s  a representative 
example, consider the 1-kB I-cache for espresso (Figure 2a). Here, m = 2.3 $ 6 ,  the lowest miss 
rate is 1.4% (0.6,~) and the highest miss rate is 3.7% (1.6m). Assuming a cache miss penalty of 
4 cycles, the best layout has a cache miss penalty of 0.06 CPI while the worst 1.ayout has a cache 
miss penalty of 0.15 CPI, for a difference of 0.09 CPI. For a processor with a1 base CPI of 0.7, 
this difference amounts to a 13 % penalty on system performance. 
To perforrn a "real-world" test of how layout affects execution time, we measured the execution 
time of s p i c e  on 21 layouts. We used a Sun SPARC 5, which has a 4-kB 32-byte line I-cache 
with a 6 cycle miss penalty and a 2-kB 16-byte line D-cache with a 4 cycle imiss penalty. We 
measured the execution time for each layout ten times, discarded the longest and shortest times, 
and used the average of the remaining eight times. The (averaged) execution times ranged from 
695.2 seconds to 727.4 seconds, with a mean of 708.0 seconds and a standard deviation of 8.5 
seconds. Thus, execution times varied +2.3% of the mean which would not normally be 
noticeable., 
W also compared the preceding execution time variation to the variation we wc3uld expect based 
on our previous measurements of miss variation. For spice on this cache configuration, the 
miss rate standard deviation was 9% for the I-cache and 2% for the D-cache,. We measured a 
2.6% miss rate for the I-cache, a 38.4% miss rate for the D-cache, and found 35% of the 
instructions were loads/stores. Assuming a CPI of 1.3 without cache effects, we expect 
the CPI to be 
1.3 + (0.026+9%)(6) + (0.35)(0.38+2%)(4) = 1.99 + 0.025 
Thus we er;pected a variation of f 2.5% of the mean (+2 standard deviations), which is very close 
to the measured variation (2.3 %). 
One concern about the general validity of our results is that we examined a vely small subset of 
all possible: layouts, and there is no guarantee that another subset of layouts would produce similar 
results. However, we have confidence in our results, as for both doduc and fpppp the results 
for 21 laycuts versus 100 layouts are essentially identical. 
In summary, our data show that miss rates vary considerably on direct-mapped caches, with a 
typical variation from 0.6m to 1.8m across just 21 layouts, where m is the measured mean miss 
rate. W observed many layouts that had consistently poor miss rates on different caches, but we 
found no consistently good layouts. Thus, when picking a layout, the problem is not so much 
picking a good layout, but rather not picking a bad layout. In practice, when linking a 
time-crucial program for a specific system, we recommend picking the best of five random 
layouts. Experimentally, we found a bad layout occurs less than '/3 of the time, so picking the 
best of five layouts reduces the odds of a bad layout to under 0.5 %. 
We conclude with some suggestions for future work phrased as unanswered qu~estions. 
Is there a method to analyhcally predict the miss wiation? We have started to extend the 
gap model in this regard. 
Is there a practical method to find a good layout? An exhaustive search is impractical 
given the huge number of possible layouts. Furthermore, are there consistently a good 
layout? Based on our results, there are no consistently good layouts, only consistently 
average and bad ones. (We have ignored the experimental compiler/linker systems 
pH9QIWcF891 which seem to work, as these systems are not generally available.) 
Do our results extend to other caches, programs, and machines? Our measurements were 
limited, as we have measured data for only five SPEC benchmark programs on one 
platform (SunOS 5.3, SPARC V8, SunOS cc). While we see no reason ihat other systems 
should have significantly different results, our results should be confirmed. 
References 
[AHH89] Anant Argawal, Mark Horowitz, and John Hennessy, An Analytical Cache Model, 
ACM Tkansactions on Computer System, 7(2) : 1 84-2 15, May 1989 
1 ~ ~ 7 2 1  Jean-Loup Baer and R. Caughey, Segmentation and Optimization of Programs from 
Cyclic Structure Analysis, AFZPS, pages 23-36, 1972. 
[CK93] Robert F. Cmelik and D. Keppel, Shade: A Fast Instruction !Set Simulator for 
Execution Profding, Technical Report TR-93-12, Sun Microsystenis Inc., July 1993 
[FL9 11 Charles N. Fischer and Richard J. LeBlanc, Crafling a Compiler with C, Benja- 
minICummings, Reading, MA, 199 1 
[GKP89] R. Graham, D. E. Knuth, and 0. Patashnik, Concrete Mathematics, Addison- 
Wesley, Reading, MA, 1989. 
[Har88] Stephen J. Hartley, Compile-Time Program Restructuring in Plrlultiprograrnmed 
Virtual Memory Systems, ZEEE Transactions on Sofnare Engineering, 
14(11): 1640- 1644, November 1988 
Fer711 Brian W. Kernighan. Optimal Sequential Partitions of Graphs, Jourrnal of the ACM, 
18:34-40, January 1971. 
WcF891 Scott McFarling, Program Optimization for Instruction Cach.es, ASPLOS-ZZZ, 
Boston, MA, April 3-6, 1989. 
[OMHL93] Douglas B. Orr, Robert W. Mecklenburg, Peter J. Hoogenboom, and Jay Lepreau, 
Dynamic Program Monitoring and Transformation Using the OMOS Object Server, 
Hawaii Zntem'onal Conference of System Sciences, pages 232-24 1, January 1993. 
Also available as technical report UUCS-92-034. 
F'H901 Karl Pettis and Robert C. Hansen, Profile Guided Code Positioni~ig, Programming 
Language Design and Implementation, pages 1627, White Plains, NY, June 20-22, 
1990 
[Quo941 Russell W. Quong, Expected I-cache Miss Rates via the Gap Model, International 
Symposium on Computer Architecture, pages 372-383, April 18-2 1, 1994 
[Sar89] Vivek Sarkar, Determining Average Program Execution Times and their Variance, 
Programming Language Design and Implementation, pages 298-3 12, 1989 
:256 1 K 4K 16K 64K 256 1 K 4K 16K 64K 
Cache Size (bytes) Cache Size (bytes) 
Figure la: Instruction Cache miss rate for Figure lc: Instruction Cache miss rate for 
e s p r e s s o ,  first input set, 16 byte lines e s p r e s s o ,  first input set, 164 byte lines 
256 1 K 4K 16K 64K 256 1 K 4K 16K 64K 
Cache Size (bytes) Cache Size (bytes) 
Figure lb: Data Cache miss rate for Figure Id: Data Cache miss rate for 
e s p r e s s o ,  first input set, 16 byte lines e s p r e s s o ,  first input set, ti4 byte lines 
256 1 K 4K 16K 64K 256 1K 4K 16K 64K 
Cache Size (bytes) Cache Size ((bytes) 
Figure 2a: Instruction Cache miss rate for Figure 2c: Instruction Cache miss rate for 
espresso, second input set, 16 byte lines espresso, second input set, 64 byte lines 
256 1 K 4K 16K 64K 256 1 K 4K 16K 64K 
Cache Size (bytes) Cache Size (bytes) 
Figure 2b: Data Cache miss rate for Figure 26: Data Cache miss rate for 
espresso, second input set, 16 byte lines espresso, second input set, 64 byte lines 
1 K 4K 16K 64K 256 1 K 4K 16K 64K 
Cache Size (bytes) Cache Size (bytes) 
Figure 3a: Instruction Cache miss rate for Figure 3c: Instruction Cache miss rate for 
e s p r e s s o ,  third input set, 16 byte lines e s p r e s s o ,  third input set, 64 byte lines 
2 56 1 K 4K 16K 64K 256 1 K 4K 16K 64K 
Cache Size (bytes) Cache Size (Ibytes) 
Figure 3b: Data Cache miss rate for Figure 3d: Data Cache miss rate for 
e s p r e s s o ,  third input set, 16 byte lines e spre s s o ,  third input set, 64 byte lines 
256 1 K 4K 16K 64K 
Cache Size (bytes) 
Figure 4a: Instruction Cache miss rate for 
e s p r e s s o, medium optimization, 16 byte lines 
256 1 K 4K 16K 64K 
Cache Size (bytes) 
Figure 4b: Data Cache miss rate for 
e s p r e s s o ,  medium optimization, 16 byte lines 
256 1 K 4K 16K 64K 
Cache Size (bytes) 
Figure 4c: Instruction Cache miss rate for 
e s p r e s s o ,  medium optimization, 64 byte 
lines 
0.3 
256 1 K 4K 16K 64K 
Cache Size (bytes) 
Figure 4d: Data Cache miss rate for 
espres s o ,  medium optimj.zation, 64 byte 
lines 
:256 1K 4K 16K 64K 256 1 K 4K 16K 64K 
Cache Size (bytes) Cache Size (bytes) 
Figure 5a: Instruction Cache miss rate for Figure 5c: Instruction Cache miss rate for 
e s p r e s s o ,  no optimization, 16 byte lines e s p r e s s o ,  no optimization, 64 byte lines 
256 I K 4K 16K 64K 256 1 K 4K 16K 64K 
Cache Size (bytes) Cache Size (bytes) 
Figure 5b: Data Cache miss rate for Figure 5d: Data Cache miss rate for 
e sp r e  s s o ,  no optimization, 16 byte lines espres s o ,  no optirnizatio~i, 64 byte lines 
256 1 K 4K 16K 64K 256 1 K 4K 16K 64K 
Cache Size (bytes) Cache Size [bytes) 
Figure 6a: Instsuction Cache miss rate for Figure 6c: Instruction Cache miss rate for 
spice,  16 byte lines spice,  64 byte lines 
2156 1 K 4K 16K 64K 256 1 K 4K 16K 64K 
Cache Size (bytes) Cache Size (bytes) 
Figure 6b: Data Cache miss rate for spice,  16 Figure 6d: Data Cache miss; rate for spice,  
byte lines 64 byte lines 
256 1 K 4K 16K 64K 256 1 K 4K 16K 64K 
Cache Size (bytes) Cache Size (bytes) 
Figure 7a: Instruction Cache miss rate for Figure 7c: Instruction Cache miss rate for 
s p i c e , 115 byte lines, procedure-level spice, 64 byte lines, proc~dure-level 
rearrangenient rearrangement 
1256 1 K 4K 16K 64K 256 1 K 4K 16K 64K 
Cache Size (bytes) Cache Size (bytes) 
Figure 7b: Data Cache miss rate for spice, 16 Figure 7d: Data Cache miss rate for s p i c e ,  
byte lines, procedure-level rearrangement 64 byte lines, procedure-level rearrangement 
256 1K 4K 16K 64K 256 1 K 4 K 16K 64K 
Cache Size (bytes) Cache Size (bytes) 
Figure 8a: Instruction Cache miss rate for Figure &: Instruction Cache miss rate for 
s p i c e ,  16 byte lines, medium optimization sp  i c e ,  64 byte lines, mediium optimization 
256 1 K 4K 16K 64K 256 1 K 4K 16K 64K 
Cache Size (bytes) Cache Size (bytes) 
Figure 8b: Data Cache miss rate for s p i c e ,  16 Figure 8d: Data Cache miss rate for s p i c e ,  
byte lines, medium optimization 64 byte lines, medium optimization 
256 1 K 4K 16K 64K 256 1 K 4K 16K 64K 
Cache Size (bytes) Cache Size (bytes) 
Figure 9a: Instruction Cache miss rate for Figure 9c: Instruction Cache miss rate for 
s p i c e ,  no optimization, 16 byte lines s p i c e ,  no optimization, 64 byte lines 
1256 1 K 4K 16K 64K 256 1 K 4K 16K 64K 
Cache Size (bytes) Cache Size (bytes) 
Figure 9b:. Data Cache miss rate for s p i c e ,  no Figure 9d: Data Cache miss rate for s p i c e ,  
optimization, 16 byte lines no optimization, 64 byte lin'es 
256 1 K 4K 16K 64K 256 1 K 4K 16K 64K 
Cache Size (bytes) Cache Size (bytes) 
Figure loo: Instruction Cache miss rate for Figure 10c: Instruction Cache miss rate for 
doduc, ltj byte lines doduc, 64 byte lines 
256 1K 4K 16K 64K 256 1 K 4K 16K 64K 
Cache Size (bytes) Cache Size (bytes) 
Figure 1011: Data Cache miss rate for doduc, Figure 10d: Data Cache miss rate for 
16 byte lines doduc, 64 byte lines 
w., w.., 
256 1 K 4K 16K 64K 256 1 K 4K 16K 64K 
Cache Size (bytes) Cache Size (bytes) 
Figure Ilia: Instruction Cache miss rate for Figure l lc: Instruction Cache Miss rate for 
doduc , 1100 layouts, 16 byte lines doduc, 100 layouts, 64 byte lines 
256 1 K 4K 16K 64K 256 1 K 4K 16K 64K 
Cache Size (bytes) Cache Size (bytes) 
Figure l lb:  Data Cache miss rate for doduc, Figure l ld: Data Cache miss rate for 
100 layouts, 64 byte lines doduc, 100 layouts, 64 byte lines 
Y." 
256 1 K 4K 16K 64K 256 1 K 4K 16K 64K 
Cache Size (bytes) Cache Size (bytes) 
Figure 12;a: Instruction Cache miss rate for Figure 12c: Instruction Cache miss rate for 
gcc, 16 byte lines gcc, 64 byte lines 
256 1 K 4K 16K 64K 256 1 K 4K 16K 64K 
Cache Size (bytes) Cache Size (bytes) 
Figure 1213: Data Cache miss rate for gcc, 16 Figure 12d: Data Cache miss rate for gcc, 
byte lines 64 byte lines 
:256 1 K 4K 16K 64K 256 1K 4K 16K 64K 
Cache Size (bytes) Cache Size (bytes) 
Figure 1311: Instruction Cache miss rate for Figure 13c: Instruction Cache miss rate for 
f pppp, 16 byte lines f pppp, 64 byte lines 
;!56 1 K 4K 16K 64K 256 1K 4K 16K 64K 
Cache Size (bytes) Cache Size (bytes) 
Figure 1311: Data Cache miss rate for fpppp, Figure 13d: Data Cache miss rate for 
16 byte lines f pppp, 64 byte lines 
256 1 K 4K 16K 64K 
Cache Size (bytes) 
Figure 14;a: Instruction Cache miss rate for 
f pppp, second input set, 16 byte lines 
256 1 K 4K 16K 64K 
Cache Size (bytes) 
Figure 14c: Instruction Cache miss rate for 
f pppp, second input set, 64 byte lines 
256 1 K 4K 16K 64K 256 1 K 4K 16K 64K 
Cache Size (bytes) Cache Size (bytes) 
Figure 14lb: Data Cache miss rate for f pppp, Figure 14d: Data Cache miss rate for 
second input set, 16 byte lines f pppp, second input set, 64 byte lines 
256 1K 4K 16K 64K 256 1 K 4K 16K 64K 
Cache Size (bytes) Cache Size (bytes) 
Figure 15;a: Instruction Cache miss rate for Figure 15c: Instruction Cache miss rate for 
f pppp, 100 layouts, 16 byte lines f pppp,  100 layouts, 64 byte lines 
256 1K 4K 16K 64K 256 1 K 4K 16K 64K 
Cache Size (bytes) Cache Size (bytes) 
Figure 15b: Data Cache miss rate for f pppp, Figure 15d: Data Cache miss rate for 
100 layouits, 16 byte lines f pppp, 100 layouts, 64 byte lines 
256 1K 4K 16K 64K 256 1 K 4K 16K 64K 
Cache Size (bytes) Cache Size (bytes) 
Figure 16,a: Instruction Cache miss rate for Figure 16c: Instruction Cache miss rate for 
f pppp,  4.-way associative, 16 byte lines f pppp, 4-way associative, 64 byte lines 
256 1 K 4K 16K 64K 256 1K 4K 16K 64K 
Cache Size (bytes) Cache Size (bytes) 
Figure 16b: Data Cache miss rate for f pppp,  Figure 16d: Data Cache miss rate for 
Cway associative, 16 byte lines f pppp, 4-way associative, 64 byte lines 
