decreasing the miss rate), decreasing the set-associativity, or increasing the optimization level increases the miss variation. For a direct-mapped cache, the results in this paper call into question the validity of using a single layout (i) to determine the miss rate of a given program, (ii) to determine how a given compiler optimization affects the miss rate, and (iii) to make architecture design decisions based on the miss rate.

I n t r o d u c t i o n
When designing a memory system, computer architects simulate many programs to determine the miss rates resulting from various cache configurations. For a given cache configuration, the resultant miss rate depends on three factors: the program being executed, the input data, and the layout, namely the specific compile-and link-time mapping of the object code and data objects to memory addresses. To address the first two factors, standard benchmark suites (e.g. SPEC, Perfect, SPLASH) containing both specified programs and input sets have been developed and are commonly used.
This paper investigates the effects of the last factor, the layout. Different layouts have different miss rates, as changing the layout changes which instructions and data map to the same cache line(s).
Almost invariably only one layout is used when calculating the miss rates. The layout is affected by many factors including compiler optimization, the order in which the program is linked together, and the specific libraries in a system.
Our goal for this paper is to answer the following five questions:
1. Does measuring the miss rate for a single layout give an accurate estimate of the true miss rate?
. What is the typical variation in the miss rates (the miss variation) for programs? Why do programs show this variation?
. How do cache size, line size, associativity, optimization, and input set affect miss variation?
. Are there consistently good or bad layouts as the cache size, line size, optimization, input set, and compiler is varied?
5. How does the miss variation affect system performance?
The impetus for this work is the gap model [1], which analytically predicts the "true" miss rate, namely the miss rate averaged over all possible layouts. As expected, there was some deviation between the miss rates predicted by the gap model and the miss rates measured from a cache simulation on the default layout. This paper explores the related issue of how much miss rates vary "randomly" due to layouts.
There are two practical reasons to purposely consider different layouts. First, the notion of a single standard layout for a program is erroneous. Compilers and system libraries evolve over time, affecting the layout in unforeseen ways. A minor change to a program may cause two frequently accessed data objects to conflict that did not previously conflict in the cache. A minor upgrade to the compiler, system libraries, or source code can result in a new, effectively random, layout. Second, this paper shows that layouts do affect the miss rate which in turn can affect system performance. For time-critical applications, we feel that finding a layout with a low miss rate should be another standard compile/link optimization. This paper does not focus on algorithms or methods for improving layouts to lower the low miss rate; rather, it focuses on the range of miss rates resulting from randomly chosen layouts. Before considering how to optimize the layout to lower the miss rate, one must know how much one can expect to lower the miss rate, both to know whether optimization is worthwhile and to have a reference for the quality of an optimization. This paper also does not examine the change to the miss rate resulting from changing the array-index to memory-address mapping (the intra-array relationship); that is, this paper does not examine such techniques as using blocking to improve matrix multiplication performance. Rather, when changing the "data layout" what is changed is where each scalar or array is laid out relative to one another (the inter-array relationship). This paper further assumes virtually addressed caches. For a physically addressed cache, the virtual-to-physical mapping may affect which addresses conflict in the cache [2] . As this mapping is non-deterministicly set by the operating system at runtime, it is not considered in this paper.
The remainder of this paper is organized as follows. Section 2 discusses miss variation in detail, provides some needed definitions, and discusses previous work in this area. Section 3 describes the experiments performed to generate the data. Section 4 answers the five questions posed above and discuss some interesting observations. Section 5 presents c0nclusions and offers ideas for further research.
2 B a c k g r o u n d a n d Previous W o r k For a program to have a high miss variation on an n-way set-associative cache requires n + 1 memory fragments that obey three properties. (A memory fragment is a code section or data object.) First, the fragments must be executed or accessed frequently. Second, the fragments must have high temporal locality, namely, after one fragment is referenced, another is likely to be referenced soon. Third, the relative addresses of the fragments must be layout-dependent, namely in some but not all layouts the fragments map to the same cache line(s).
The only layout-dependent data objects are static (global) scalars and arrays. In contrast, the addresses of stack-allocated variables (locals) and heap-allocated variables are not affected by the layout. The addresses of heap allocated objects are determined at run-time, for example by calls to malloc 0. As rearranging the code does not affect the order in which the calls to malloc 0 are made, changing the layout does not change the addresses of heap-allocated variables. (It is, however, an open question what effect changing the order of the calls to malloc 0 would have on the miss rate.) Similarly, the relative addresses of elements of the same array are not affected by the layout; the addresses of a [5] and all005] will always differ by 4000 bytes (for 4-byte elements), irrespective of the layout. All code from different basic blocks is layout dependent.
A program consists of modules (files); a module consists of procedures; and a procedure consists of basic blocks. Usually when generating the object file, the compiler places procedures and their constituent basic blocks consecutively in memory based on their order in the source module. Similarly, the linker places object files consecutively in memory based on their order in the link command. Rarely is there any thought given to the ordering, both because of and despite the fact that procedures and modules can be arbitrarily ordered in a program.
The layout of a program can be modified at these three granularities, module, procedure, and basic block. While the order of modules and procedure can be rearranged arbitrarily, rearranging basic blocks requires adding unconditional jumps to preserve the original control flow. These unconditional jumps change the program execution by making the code larger and by changing the instruction stream, and can change the optimization performed by the compiler due to changing the basic block order. These changes can introduce a significant change in the miss rate, obscuring the change in the miss rate due to the layout. For these reasons, this paper does not present results for rearranging at the basic block level.
For a program with M modules and P procedures, there are M! module-level and P! procedurelevel rearrangements; even for small M and P there are many possible layouts. For layouts that contain holes or unused addresses the number of possible layouts is unlimited. In practice, a code or data segment is contiguous (no holes) to minimize memory and disk space usage.
We expect the miss variation to increase as the program is rearranged at finer granularities, due to intra-module and intra-procedure locality. Namely, after referencing a procedure in a module the probability of referencing another procedure in the same module soon is increased. By default, the compiler places intra-module procedures consecutively in memory, which guarantees they do not conflict, unless the procedure is larger than the cache, and thus wraps around. Procedure-level rearranging removes this guarantee. This argument applies to even a greater extent when rearranging at the basicblock level, as there is usually stronger temporal locality among intra-procedural basic blocks than among intra-module procedures.
We expect a direct-mapped cache to exhibit the greatest variation, with variation decreasing as the set-associativity increases. In the extreme case, a fully associative cache with LRU replacement and 1-byte lines is insensitive to the layout. In addition, we expect decreasing the compiler optimization level to decrease the miss variation of the data cache, as a standard optimization is removing redundant loads and stores [3] . A redundant load/store to address a duplicates a previous reference to a, so it is very likely is in the cache. Thus, redundant loads/stores are "sure" cache hits, reducing both the miss rate and the miss variation, but increasing the number of cache accesses.
A parameter is any of the factors we change; in this paper the parameters are the cache configuration (line size, size, or set-associativity), the compiler optimization level, the compiler itself, and the input set. We call a layout good (bad) if it yields a miss rate, for a given set of parameters, that is significantly lower (higher) than the average of the miss rates of the other layouts. We call a layout extreme if the layout is either good or bad. Note that a layout that is bad for one set of parameters might be good for another set of parameters. The default layout of program is the layout that results from compiling and linking the source as distributed. Typically, a "makefile" specifies the compile and link commands. The default layout is arbitrary in the sense that miss rates are not considered when ordering the modules for linking.
We found no reference in the literature to how randomly chosen layouts affect the miss rate, and very few references to variation in execution time due to the layout in general. In [4], Sarkar discusses execution time variation caused by loops processing different data in the context of picking the optimal grain size for parallelization. Various cache models have been proposed [5, 1] , but none of these models consider miss variation.
The idea of rearranging code to improve cache performance is not new. Pettis and Hansen [6] used two compiler optimizations to maximizing cache reuse, and thus improve cache (and TLB and virtual memory) performance; the focus was not on reducing cache conflicts. Their first optimization used a "closest is best" heuristic to arrange procedures based on usage counts of the call graph. Their second optimization split each procedure into two new procedures, one of commonly used basic blocks and one of less commonly used basic blocks ("fluff") such as error handling code. Within each of the two newly created procedures, they again used a "closest is best" heuristic to place basic blocks based on usage counts of the internal control-flow graph for that procedure. Unfortunately, they did not report their improvement in the miss rate, only the improvement in execution time (10%-26%). McFarling [7] reordered basic block using compile-time usage estimates and showed that a direct-mapped cache could perform better than a set-associative cache if instructions could selectively be excluded from the cache. Hwu and Chang [8] used profile guided rearrangement of code at the "trace" level, where they define a trace to be several basic blocks that are usually executed in sequence. Unfortunately, their results are inconclusive at best because Chen et. al. [9] use the same code rearrangement strategy as had been used previously [8] ; averaging over seven small benchmarks (15-kB to 140-kB static code size, fewer than 150 million instructions executed), they found a 25% improvement in the miss rate for moderate (5%) miss rates, and no improvement for low (<1%) miss rates. Gloy et. al. [10] refine pervious methods by introducing the notion of "temporal ordering". That is, assume Procedure A calls both Procedure X and Procedure Y frequently. When determining the performance penalty of placing Procedure X and Procedure Y into the same cache lines(s), it is important to know the ordering of the calls to X and Y. For example, if Procedure X is called only in the first half of program execution and Procedure Y is called only in the second half, there is no performance penalty; in contrast, if the procedure calls alternate between Procedure X and Procedure Y, the penalty is high. They found their method improves on previous methods in five of the six benchmark programs used, taken from SPECint95. A similar concept was presented in [11] . Torrellas et. al. [12] used a procedure similar to that used in [8] to determine whether reordering the operating system will reduce the instruction cache miss rate; the primary difference is they reorder basic blocks across the entire program, not just within procedures. (They did not actually reorder the operating system, rather they simulated the effects had they reordered the operating system.) They found a 31%-86% (0.8-3.9 percentage point) improvement in the instruction cache miss rate. Truong et. al. [13] computed the correlation between each data object; highly correlated objects were placed into the same cache-line sized block of memory, and highly correlated blocks we placed in memory so that they would map to different lines in the cache. Using four benchmark programs (ls, f77, 078.swm256, and gzip), the miss rate decreased to 24%, 67%, 75%, and 44%, respectively, of its original value on a direct mapped cache. In addition, they found a direct-mapped cache on a modified data layout can outperform a fully-associative cache with perfect prediction (i.e. replace the line that will be used furthest in the future) on an unmodified layout. Their results are, however, optimistic in that for a given program they used the same input set to both generate the improved data layout and to measure the benefit of the improved data layout. Hashemi et. al. [14] use an approach similar to the coloring method used for register allocation. However, their method requires knowing the the cache parameters, requiring a re-compile or re-link for each new cache organization. This type of code rearrangement is also similar the well-studied problem of rearranging the address map to minimize paging [15, 16, 17, 18] .
The results presented in this paper are unique in that it is the first to consider the range of miss rates resulting from randomly chosen layouts and it is only the second to consider the effect layout has on the data cache miss rates. Gloy et. al. [10] generate different layouts by adding random noise to the weights on the call graph, and applying their procedure to generate a layout; however, this was done to model the effects of different input sets, rather than our use which is to see the range of miss rates. 
Experimental Methodology
Eight types of experiments were run to empirically answer the questions posed in the introduction. Five programs from the SPEC92 benchmark suite we chosen, shown in Table 1 . Two are C programs, espresso and gcc, and three are FOR-TRAN programs, spice, doduc, and fpppp. All experiments were run under SunOS 5.3 and compiled with cc and f77 using static linking. Cachesim5 5.2 and shade 4.1.6 [19] were used to calculate the miss rates. To run the experiments, we wrote shell scripts to generate different layouts, change the compile/link flags, and change the input set used in the SPEC makefile.
All of the experiments were run on the same 10 cache configurations, which consists of all pairs of five cache sizes, 256 bytes, 1 kB, 4 kB, 16 kB, 64 kB, and two line sizes, 16 bytes and 64 bytes. Unless otherwise noted in Table 2 , the miss rates were measured for the same 21 layouts (20 random layouts and the default layout) on these 10 cache configurations, compiled with cc (version 2.0.1) or f77 with full optimization (-O4), simulated a direct-mapped cache, and used the input sets shown in Table 1 . The random layouts were generated by changing the module order specified to the linker. The eight experiment types are described below, summarized in Table 2 .
The first type of experiment determined the baseline miss variation. To do so, the default parameters listed above were used to measure the miss rate.
The second type of experiment determined the effect of the input set on the miss variation. The miss rate was measured for different input sets for espresso (bea.in and tial.in) and fpppp (input.ref/natoms).
The third type of experiment determined the effect the compiler optimization level has on the miss variation. The miss rate was measured for espresso and spice compiled with medium optimization (-02) and with no optimization, instead of full optimization (-04) which was otherwise used. Full optimization (on the SUN compilers used) can cause the following problems: (i) an increase in code size due to loop unrolling and inlining; (ii) a significant increase in compile time; and (~ii) incorrect code being generated by a risky optimization. For these reasons, many programmers use medium optimization instead. Compiling with no optimization is common during debugging. Table 3 shows how different optimization levels affect the size of the executable, the number of instructions executed, and the number of data references (the number of loads/stores) for the three optimization levels. Note that shifting from no optimization to medium optimization reduced the number of instructions executed by at least a third in all cases.
The fourth type of experiment determined the effect the compiler version has on the miss variation. The miss rate was measured using a newer version of the SUN ANSI C compiler (ce) to compile and link espresso with full optimization and the second input set, bca.in.
The fifth type of experiment determined the effect a small perturbation, such as would be caused by a change to a compiler optimization or the system libraries, has on the miss variation. The miss rate was measured after inserting a dummy module between every fifth module when linking espresso (using the new cc compiler and the bca.in input set). Each dummy module consisted of a small amount of text (code) and a small amount of data. In total, eight dummy modules were inserted into each layout, increasing the text size from 300,340 bytes to 300,916 bytes, and increasing the data size from 9,160 bytes to 9,896 bytes.
The sixth type of experiment determined the effect the cache set-associativity has on the miss variation. The miss rate was measured for fpppp with a four-way set-associative cache. We chose fpppp because its data cache miss variation was greatest for the direct-mapped cache.
The seventh type of experiment confirmed that the variation observed in the previous experiments was due to the program itself and not to the small number (21) of layouts chosen. The miss rates for doduc and fpppp were measured for 100 randomly chosen layouts, rather than the 21 used previously. We chose these two programs because they represented extremes in experiment one: doduc yielded a low, evenly spread-out variation with no outlying points, while fpppp yielded a wide, unevenly distributed variation.
The eighth type of experiment determined the effect rearranging at the procedure level, as opposed to the module level used in the other experiments, has on the miss variation. Twenty-one new layouts were generated by splitting the existing modules into new modules so that each new module contained exactly one procedure. The new modules were then linked in 21 random orders. We chose spice because we wanted a FOi~TRAN program, as FORTRAN programs are easier to automatically break into procedures than C programs (due to FOP~TRAN's lack of a file namespace and ease of finding the BEGIN/END of FORTRAN procedures), and of the three FORTRAN programs only spice has multiple procedures per module.
Neither experiment seven nor experiment eight provided any new information. The results for 100 layouts (experiment seven) were essentiMly identical to the results for 21 layouts. Rearranging spice at the procedure level (experiment eight) resulted in a slight decrease in the miss variation for 16-byte lines and a slight increase in the variation for 64-byte lines, but in both cases the difference was minor. For space reasons we omit the results for these two experiments; refer to [20] for more details.
Results and Analysis
We now return to the five questions posed in the introduction. In doing so, we will make reference to many of the figures, which are summarized in Table 4 . There are two types of figures: the baseline figures show the absolute miss rate on the Y-axis, and the comparison figures show the miss rate relative to the baseline on the Y-axis. To aid comprehension all figures use the same log-log scale: the X-axis shows the size of the cache and the Y-axis shows the absolute or relative miss rate. The baseline miss rate is shown over the range 0.3% to 30%, which is likely to be the range of interest for performance evaluation. When the miss rate is above 30%, performance is likely to be poor irrespective of the variation of the miss rate. Similarly, unless the miss penalty is hundreds of cycles, when the cache miss rate less is than 0.3% the miss variation will have little impact on program execution time. For the comparison figures, the miss rates are shown over a factor of 16; a value of 1 means the two experiments yielded the same miss rate for that layout and those parameters, while a value of 2 means changing the parameter doubled the miss rate. The miss rates of the default layout are indicated with filled squares. Not always. The best example is the 64-kB data cache for fpppp ( Figure 5) ; merely changing the order of the modules in the link command can change the miss rate from 0.2% to 3%, over a factor of 10. A second example is changing the compiler version ( Figure 7 ). While the instruction miss rate averaged over all layouts remains approximately unchanged, some layouts yield a four-fold decrease in the miss rate, while other layouts yield a two-fold increase in the miss rate, due solely to changing the compiler version.
Overall, however, for a direct-mapped cache in about 70% of the cases a single layout gives an accurate estimate of the true miss rate. Certain benchmarks, such as gcc, show almost no miss variation over a wide range of parameters. The other four benchmarks we used (espresso, spice, doduc, and fpppp) show significant variation for some parameters, but little to no variation for other parameters. For a 4-way (and we assume higher) set-associative cache, the miss rate showed no significant variation, so measuring a single layout is probably an accurate estimation of the true miss rate on a 4-way set-associative cache.
The inaccuracies caused by measuring the miss rate for only one layout can affect design decisions, as different layouts can give different answers. For example, consider using fpppp to determine whether it is worthwhile to increase the data cache from 16 kB to 64 kB (Figure 5c ). Fifteen layouts say "definitely yes" while six, including the default layout, say "definitely no".
What is the typical miss variation for programs?
The experiments show that there is no typical miss variation; these five programs exhibit a wide variety of variation. In general, we expect most programs to show some variation in their I-cache miss rate. As discussed in Section 2, for the D-cache miss rates to show variation requires multiple, frequently accessed, statically allocated (e.g. not using malloc0) scalars or arrays. We note that three of the five benchmark programs (espresso, spice, and doduc) show variation in the I-cache miss rate, and two of the five benchmark programs (doduc and fpppp) show variation in the D-cache miss rate. Although it is possible to determine exactly which blocks (for the I-cache) or arrays (for the D-cache) are causing this variation, in the absence of a compiler/linker system that can use such information, such a determination is moot. Gcc showed no variation in either the I-cache or D-cache miss rates, but the miss rate for gee is calculated by taking the average miss rate over 19 input sets, which may have averaged out any variation. Espresso did not show any variation in the D-cache, which is expected as almost all the data structures are dynamically allocated. Fpppp did not show any variation in the I-cache, as fpppp has a very large (approximately 32 kB) procedure that consists of one basic block. For a cache smaller than this procedure, the miss rate will be uniformly high, as that procedure will always miss. For a cache larger than this procedure, the miss rates are low.
We calculated the standard deviation for all the figures using the equation for the standard deviation for measured data from a normal distribution (~/~? (X~ -X2)/(n -1) [21] ). We observed that the miss rates tend to fall into three categories. (i) When the miss rate is high (>10%), in only one case (Figure 15d , 16-kB cache) did the standard devia-tion exceed 10% of the mean. (ii) When the miss rate is moderate (1%-10%), which is a typical miss rate for modern L1 caches, the standard deviation ranged from 0% to 65% of the mean. (i/i) When the miss rate is low (<1%), the standard deviation often exceeded the mean.
How do (i) cache size, (ii) llne size, (iii)
set-associativity, (iv) optimization, and (v) input set affect miss variation?
(i) The variation is low at small cache sizes (high miss rate); the variation either increases or remains low as the cache size increases. This is true for all figures.
(ii) The two lines sizes (16 and 64 bytes) yield nearly identical variations. The only exception was two places in which the miss rates split into two groups for 64-byte lines but not for 16-byte lines (figure 13c versus Figure 13d and Figure 15c versus Figure 15d ).
(iii) Direct-mapped caches ( Figure 5 ) show much more variation than four-way set-associative caches (Figure 6 ), which show no variation.
(iv) Decreasing the compiler optimization level usually decreases the variation; two examples and one counter-example are given. First, the 4-kB and 16-kB I-caches for espresso with full optimization (Figure la) yield several bad layouts; medium optimization (Figure 12a ) reduces the variations; no optimization (Figures 5a) reduces it even further. Second, the 16-kB I-cache for spice with both full optimization ( Figure 2a ) and medium optimization (Figure 13a ) yield miss rates with significant variation while no optimization (Figure 15a ) yield miss rates with little variation. One counter-example is the 64-kB I-cache for espresso, for which no optimization (Figure 14a ) yields a higher variation (six layouts yield a miss rate greater than 0.3%) than either full optimization (Figure la) or medium optimization (Figure 12a ) (for which no layouts yield a miss rate greater than 0.3%).
(v) Over all the datasets, in only one case did changing the input set change the variation. That case is the 1-kB I-cache for espresso, two input sets (Figure la There were no consistently good layouts, namely layouts that were good for a wide variety of parameters. There were many consistently bad layouts. The effect changing parameters has on the miss variation can easily be seen in Figures 7-15 . The greater the range of values for a given cache size, the more changing the parameters changes the miss rate. Consider Figure 9b ; it shows that changing the dataset from ti.in to bca.in results in both an overall decrease in the miss rate, and, more relevant to this paper, that different layouts are affected differently by changing the input set. This has the important practical result that if one is trying to determine the instruction cache miss rate of different input sets, which layout is chosen will change the answer. In contrast, consider Figures 9c. While changing the input set results in an increase in the data cache miss rate, all layouts are changed equally. This means that if one is trying to determine the data cache miss rate of different input sets, all layouts yield the same miss rate.
(i) Changing the cache size does not change which layouts are bad; we give two examples. First, for the D-cache for fpppp (Figure 5c ), of the 6 bad layouts for the 16-kB cache, 5 are bad for the 64-kB cache. Second, for the I-cache for spice (Figure 2a) , of the 7 bad layouts for the 64-kB cache, 6 are bad or yield above average miss rates for the 16-kB cache.
(ii) Line size has almost no effect on which layouts are bad. Almost invariably, layouts that are bad for one line size are bad for the other.
(iii) Changing the optimization level does affect which layouts are bad; this can be seen by the wide range of values in the instruction caches for Figures 12-15 . Consider the following three examples. First, for the 4-kB I-cache for espresso, both full optimization (Figure la) and medium optimization (Figure 12a ) have 3 bad layouts, but they are a different 3 layouts. Second, for the 64-kB I-cache for espresso, full optimization yields 2 bad layouts, medium optimization yields no bad layouts, and no optimization (Figure 14a ) yields 6 bad layouts. Fur-thermore, the 2 bad layouts for full optimization are not among the 6 bad layouts for no optimization. Third, for the 64-kB I-cache for spice, 7 layouts are bad for both full optimization (Figure 2a ) and medium optimization ( Figure 13a) ; of these 7 layouts, 6 are bad in both cases. No optimization (Figure 15a ) yields 1 bad layout; this layout is not one of the 8 layouts that are bad for full optimization or for medium optimization.
(iv) Changing the input set has little effect on which layouts are bad; we give two examples. First, for the 4-kB I-cache for espresso, of the four bad layouts for the second input set (Figure 9a ), three are bad for the first input set (Figure la) and one is bad for the third input set (Figure 10a ). In addition, for the 16-kB I-cache, the same layout is bad for all three input sets. Second, for the D-cache for fpppp, both the first input set (Figure 5c ) and the second input set (Figure 1 lc) have the same 6 bad layouts for the 16-kB cache and the same 5 bad layouts for the 64-kB cache.
(v) Changing the compiler version does affect which layouts are bad. For example, for the 4-kB cache for espresso (Figure 9a ), 4 layouts are bad. The newer version of the compiler (Figure 7a ) also yields 4 bad layouts, but only 1 layout is bad in both cases. Even a minor change, such as adding fewer than 600 bytes of code to a 300 kB program (Figure 8a ), decreases the number of bad layouts to 3. In fact, the default layout went from a being good layout to being a bad layout when changing compiler version. 4.5 How does the miss variation affect system performance?
Determining the effect the miss variation has on system performance requires assuming parameters for the CPI and for the miss penalty. Clearly, the lower the CPI and the greater the miss penalty chosen, the greater effect the miss variation will have on performance. We pick a cache miss penalty of 5 cycles and base CPI of 1.2, values which we feel are appropriate to the benchmark programs used (from SPEC92), although they are not representative of current processors. Based on these parameters, we found that the miss variation has little effect on system performance. Even in a high variation case, such as fpppp with 64-byte lines and 16-kB instruction and data caches (Figure 5b and Figure 5d ), we estimate only a 6% increase in execution time going from a good layout (2.661% Icache miss rate, 0.830% D-cache miss rate) to a bad layout (3.254% I-cache miss rate, 4.197% Dcache miss rate), assuming 5-cycle miss penalty, a base CPI (without cache effects) of 1.2, and of the instructions 1/3 are load/stores. Picking values appropriate for recent processors, such as a base CPI of 0.5 and a miss penalty of 10 cycles, results in a 22% performance improvement going from a bad layout to a good layout.
To illustrate a more typical case, we performed a "real-world" estimate of how the layout affects execution time by measuring the execution time of spice for the same 21 layouts. We choose spice because it is one of the longer running benchmarks, which reduces the effects of other forms of execution time variation. We used a Sun SPARC 5, which has a 4-kB 32-byte line I-cache with a 6 cycle miss penalty and a 2-kB 16-byte line D-cache with a 4 cycle miss penalty. We measured the execution time for each layout ten times, discarded the longest and shortest times, and used the average of the remaining eight times. The averaged execution times ranged from 695.2 seconds to 727.4 seconds for the 21 layout; thus, execution times varied :t:2.3% of the mean, which would not be noticeable. The mean was 708.0 seconds and the standard deviation was 8.5 seconds, so the standard deviation was 1.2% of the mean.
The preceding execution time variation was also compared to the variation expect baseded on the previous measurements of miss variation. For spice on this cache configuration, the standard deviation in the miss rate was 9% for the I-cache and 2% for the D-cache. We measured a 2.6% miss rate for the I-cache, a 38.4% miss rate for the D-cache, and found 3.5% of the instructions were loads/stores. Assuming a CPI of 1.2 without cache effects, the expected CPI is 1.2 + (0.026 ± 9%)(6) + (0.35)(0.38 ::t= 2%)(4) = 1.89 ± 0.025 Thus we expected the standard deviation to be 1.3% of the mean, which is very close to the measured standard deviation of 1.2%.
Conclusion
In summary, the data show that miss rates for direct-mapped caches vary considerably, with a typical variation in the miss rate of 60% to 180% of the mean measured miss rate across just 21 random layouts. We observed many layouts that had consistently poor miss rates on different caches, but we found no consistently good layouts. Overall, cache line size and input set has little effect on the miss variation, while increasing the cache size, decreasing the set-associativity, or increasing the optimization level increased the miss variation.
There are three major conclusions from this work.
• For direct-mapped caches, the miss rate
[1] should not be measured using only one layout. For 4-way (and most likely higher) setassociative caches, using only a single layout is acceptable. However, even for direct-mapped caches, execution time is not strongly affected
[2] by miss variation, for the cache miss penalty (5 cycles) and base CPI (1.2) chosen)
• Architects should be aware of miss variation when making design decisions, such as selecting the cache size, or judging the effectiveness of layout optimization, based on results gathered from only one layout.
[3]
• When picking a layout to lower the miss rate, the concern is not so much picking a good layout, but rather not picking a bad layout.
[4] We found a bad layout occurs less than onethird of the time, so picking the best of five layouts reduces the odds of a bad layout to under 0.5%.
Running for a number of random layouts is a
[5] very time consuming method to measure the miss variation. One area for future research is to determine whether a method exists to analytically pre= dict the miss variation. Towards that end, we have [6] 1While recent processors would likely show a greater effect due to their greater cache miss penalty and lower base CPI, we picked values appropriate to the benchmark programs (i.e. SPEC92) used. In addition, rearranging at the basic block level would likely result in more variation, as evidenced by previous research that found rearranging at the basic block level to provide a greater performance increase than the rearranging at the module level we performed. started to extend the gap model to estimate the miss variation.
Another area for future investigation is finding a practical method to find a good layout. An exhaustive search is impractical given the huge number of possible layouts. Furthermore, are there consistently good layouts? Based on our results, there are no consistently good layouts, only consistently average and bad ones. The experimental compiler/linker systems listed in Section 2 appear promising, and should be pursued further. However, their lack of general availably, especially in commercial systems, makes comparison difficult.
International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 183-191, 1989. [8] Wen-mei W. Hwu and Pohua P. Chang, "Achieving high instruction cache performance with an optimizing compiler", in Proceedings of the Sixteenth International Symposium on Computer Architecture, pp. 242-251, 1989 .
The four graphs per figure are, from left to right, the miss rate for the instruction cache for 16-byte lines, the miss rate for the instruction cache for 64-byte lines, the miss rate for the data cache for 16-byte lines, and the miss rate for the data cache for 64-byte lines. Each line is the miss rate for a given program layout for various cache sizes. Full details on the parameters for each graph are provided in Table 4 . Figures 1-5 show the baseline miss rate for a direct-mapped cache; Figure 6 shows the baseline miss rate for fpppp for a four-way set-associative cache. The remaining figures show the change in the miss rate compared to the baseline miss rate. The squares show the default miss rate, that is, the miss rate of the original layout.
