This paper shows that code expanding optimizations have strong and non-intuitive implications on instruction cache design. Three types of code expanding optimizations are studied in this paper: instruction placement, function inline expansion, and superscalar optimizations. Overall, instruction placement reduces the miss ratio of small caches. Function inline expansion improves the performance for small cache sizes, but degrades the performance of medium caches. Superscalar optimizations increases the cache size required for a given miss ratio. On the other hand, they also increase the sequentiality of instruction access so that a simple load-forward scheme e ectively cancels the negative e ects. Overall, we show that with load forwarding, the three types of code expanding optimizations jointly improve the performance of small caches and have little e ect on large caches.
Introduction
Compiler technology plays an important role in enhancing the performance of processors. Many code optimizations are incorporated into a compiler to produce code that is comparable or better than hand-written machine code. Classic code optimizations decrease the number of executed Function inlining replaces a function call with the function body. To further enlarge the scope of code optimization and scheduling, compilers unroll loops by duplicating the loop body several times.
The IMPACT-I C compiler utilizes inline expansion, loop unrolling, and other code optimization techniques. These techniques increase the execution e ciency at the cost of increasing the overall code size. Therefore, these compiler optimizations can a ect the instruction cache performance. This paper examines the e ect of these code expanding optimizations on the performance of a wide range of instruction cache con gurations. The experimental data indicate that code expanding optimizations have strong and non-intuitive implications on instruction cache design. For small cache sizes, the overall cache miss ratio of the expanded code is lower than that of the code without expansion. The opposite is true for large cache sizes. This paper studies three types of code expanding optimizations: instruction placement, function inline expansion, and superscalar optimizations. Overall, instruction placement increases the performance of small caches. Function inline expansion improves the performance of small caches, but degrades that of medium caches.
Superscalar optimizations increases the cache size required for a given miss ratio. However, they also increase the sequentiality of instruction access so that a simple load-forward scheme removes Too appear, IEEE Transactions on Computers. CRHC-91-17, University of Illinois, Urbana-Champaign.3 the performance degradation. Overall, it is shown that with load forwarding, the three types of code expanding optimizations jointly improve the performance of small caches and have little e ect on large caches.
Related Work
Cache memory is a popular and familiar concept. Smith studied cache design tradeo s extensively with trace driven simulations 5] . In his work, many aspects of the design alternatives that can a ect the cache performance were measured. Later, both Smith and Hill focused on speci c cache designs parameters. Smith studied the cache block (line) size design and its e ect on a range of machine architectures, and found that the miss ratios for di erent block sizes can be predicted regardless of the workload used 6]. The causes of cache misses were categorized by Hill and Smith into three types: con ict misses, capacity misses, and compulsory misses 7] . The loop model was introduced by Smith and Goodman to study the e ect of replacement policies and cache organizations 8].
They showed that under some circumstances, a small direct mapped cache performs better than the same cache using fully associativity with LRU replacement policy. The tradeo s between a variety of cache types and on-chip registers were reported by Eickenmeyer and Patel 9] . This work showed that when the chip area is limited, a small-or medium-sized instruction cache is the most cost e ective way of improving processor performance. Przybylski et al: studied the interaction of cache size, block size, and associativity with respect to the CPU cycle time and the main memory speed 10]. This work found that cache size and cycle time are dependent design parameters. Alpert and Flynn introduced an utilization model to evaluate the e ect of the block size on cache performance 11]. They considered the actual physical area of caches and found that larger block sizes have better cost-performance ratio. All of these studies assumed an invariant Too appear, IEEE Transactions on Computers. CRHC-91-17, University of Illinois, Urbana-Champaign. 4 compiler technology and did not consider the e ects of compiler optimizations on the instruction cache performance.
Load forwarding is used to reduce the penalty of a cache miss by overlapping the cache repair with the instruction fetch. Hill and Smith evaluated the e ects of load forwarding for di erent cache con gurations 12]. They concluded that load forwarding in combination with prefetching and sub-blocking increases the performance of caches. In this paper a simpler version of the loadforward scheme is used, where neither prefetching nor sub-blocking is performed. The e ectiveness of this load-forward technique is measured by comparing the cache performance of code without optimizations and with code expanding optimizations. Load forwarding potentially can hide the e ects of code expanding optimizations.
Davidson and Vaughan compared the cache performances of three architectures with di erent instruction set complexities 13]. They have shown that less dense instruction sets consistently generate more memory tra c. The e ect of instruction sets of over 50 architectures on cache performance has been characterized by Mitchell and Flynn 14] . They showed that intermediate cache sizes are not suited for less dense architectures. Steenkiste 15] was concerned with the relationship between the code density pertaining to instruction encoding and instruction cache performance. He presented a method to predict the performance of di erent architectures based on the miss rate of one architecture. Unlike less dense instruction sets which typically have higher miss rate for small caches 13], we show that code expansion due to optimizations improves performance of small caches, and degrades that of large caches. Our approach is also di erent from these previous studies in that the instruction set is kept constant. A load/store RISC instruction set whose code density is close to that of the MIPS R2000 instruction set is assumed.
Cuderman and Flynn have simulated the e ects of classic code optimizations on architecture Too appear, IEEE Transactions on Computers. CRHC-91-17, University of Illinois, Urbana-Champaign.5 design decisions 16]. Classic code optimizations do not signi cantly alter the actual working sets of programs. In contrast, in this paper, classic code optimizations are always performed; code expanding optimizations that enlarge the working sets are the major concern. Code expanding optimizations increase the actual code size and change the instruction sequential and spatial localities.
Outline Of This Paper
Section 2 describes the instruction cache design parameters and the performance metrics. The cache performance is explained using the recurrence/con ict model 17]. Section 3 describes the code expanding optimizations and their e ects on the target code and the cache design. Section 4 presents and analyzes experimental results. Section 5 provides some concluding remarks.
Instruction Cache Design Parameters

Performance Metrics with Recurrences and Con icts
The dimension of a cache is expressed by three parameters: the cache size, the block size, and the associativity of the cache 5]. The size of the cache, 2 C , is de ned by the number of bytes that can simultaneously reside in the cache memory. The cache is divided into b blocks, and the block size, 2 B , is the cache size divided by b. The associativity of a cache is the number of cache blocks that share the same cache set. An associativity of one is commonly called a direct mapped cache, and an associativity of 2 C?B de nes a fully associative cache.
The metric used in many cache memory system studies is the cache miss ratio. This is the ratio of the number of references that are not satis ed by a cache at a level of the memory system Too appear, IEEE Transactions on Computers. CRHC-91-17, University of Illinois, Urbana-Champaign.6 hierarchy over the total number of references made at that cache level. The miss ratio has served as a good metric for memory systems since it is characteristic of the workload (e.g., the memory trace) yet independent of the access time of the memory elements. Therefore, a given miss ratio can be used to decide whether a potential memory element technology will meet the required bandwidth for the memory system.
The recurrence/con ict model 17] of the miss ratio will be used to analyze the cause of cache misses. Consider the trace in Figure 1 , a 1 ; a 2 ; a 3 , and a 4 are the rst occurrence of an access, and they are unique in the trace. The recurrences in the trace are accesses a 5 ; a 6 ; a 7 and a 8 . Without a context switch, all these four recurrences would result in a hit in an in nite cache. In the ideal case of an in nite cache and in the absence of context-switching, the intrinsic miss ratio is expressed as, o = N ? R N ; (1) where R is the total number of recurrences and N is the total number of references. Note that an access can be of only two types: either a unique or a recurrent access. Non-ideal behavior occurs due to con icts, and this paper considers only the dimensional con icts; multiprogramming con icts are considered in 18].
A dimensional con ict is de ned as an event which converts a recurrent access into a miss due to limited cache capacity or mapping in exibility. For illustration, consider a direct mapped cache composed of two one-byte blocks as shown in Figure 2 . because reference a 4 purges address 1 from the cache due to insu cient cache capacity. Hence, a 4 represents a dimensional con ict for the recurrence a 5 . The other misses, a 1 ; a 2 ; a 3 and a 4 , occur because these are the rst references to addresses 0; 1; 2 and 3, respectively (i.e., they are unique accesses). Therefore, the following formula can be used for deriving the cache miss ratio, , for a given trace, and a given cache dimension:
where C D is the total number of dimensional con icts, and o is the intrinsic miss ratio.
In a simple design, when a cache miss occurs, instruction fetch stalls and the instruction cache waits for the appropriate cache block to be lled. After instruction cache repair is completed, the instruction fetch resumes. The number of stalled cycles is determined by three parameters:
the initial cache repair latency (L), the block size, and the cache-memory bandwidth ( ). For a single cache miss, the number of stalled cycles is the initial cache repair latency plus the number of transfers required to repair the cache block. The total miss penalty without load forwarding, t n , is expressed by the number of total misses multiplied by the number of stalled cycles for a single 
This is the miss-penalty model used when load forwarding is not assumed. The miss penalty ratio is calculated by dividing the miss penalty, t n , by N.
Load Forwarding
Load forwarding was evaluated by Hill and Smith 12] . They concluded that load forwarding in combination with prefetching and sub-blocking increases the performance of the cache. In this paper, we use a simpler version of the load forwarding scheme where neither prefetching nor subblocking is performed. The state transition diagram for load forwarding is shown in Figure 3 .
The instruction cache is in the standby state initially (state 0). When a cache miss occurs, the instruction fetch stalls (state 1). Instead of waiting for the entire cache block to be lled before resuming, the cache loads the block from the currently-referenced instruction and forwards the instruction to the instruction fetch unit (state 2). Furthermore, if the instruction reference stream is sequential, each subsequent instruction is forwarded to the instruction fetch unit until the end of the block is reached or a taken branch is encountered. Any remaining un lled cache-block bytes are repaired in the normal manner, and the instruction fetch stalls (state 3). This load forwarding scheme requires no sub-block valid bits and therefore has a simpler logic for cache block repair than sub block-based schemes.
An example of the cache-block repair process with load forwarding is provided in Figure 4 .
Reference X results in a miss. It takes L cycles before this reference is placed in the appropriate block location and is forwarded to the fetch unit. Reference Y is a sequential access, thus it is considered as a hit. It is placed in the cache and forwarded to the fetch unit. Reference Z breaks the sequential-reference stream, load forwarding stops, and cache repair of block 0 continues. At cycle L+2, the end of the block is reached, and the cache repair continues from the beginning of the cache block. At cycle L+3, the entire cache block is lled, the fetch unit continues with the next instruction reference. The block wrap around time is assumed to be negligible compared to the total block-repair time 1 . References X and Y are sequential and constitute a run length (the number of sequential instructions before a taken branch) of 2.
For the i th cache miss, if the total number of bytes where the instruction fetch and cache repair 1 For the actual hardware implementation, the cache repair can start at the beginning of the cache block. When the location of the instruction to be fetched is encountered within the cache block, load forwarding begins. Load forwarding terminates when the end of the block is reached or when a taken branch is encountered. Cache repair stops at the end of the block. The miss penalty incurred by this method is the same as the one presented in the paper. t l = t n ? t S (4) where t S is
S i] : (5) t S measures the number of cycles saved by load forwarding. Equation 4 is the miss-penalty model used when load forwarding is assumed. The miss penalty ratio with load forwarding is calculated by dividing the miss penalty, t l , by N.
The saved cycles expressed in Equation 5 is constrained by two factors. First, load forwarding is limited by the sequentiality of the instruction reference stream. The more sequential the instruction reference stream is, the more overlap between the cache repair and load forwarding cycles that can be achieved. Second, assuming the sequentiality of the referencing stream is not a problem, load forwarding is performed only from the missed reference until the end of the block. Thus the savings is highly dependent upon the location of the miss within the cache block. The sequentiality of the reference stream can be increased by appropriate compiler optimizations and this will be discussed in Section 3. This second factor is highly variable and dependent upon the instruction reference stream and the block size.
Optimizations and Code Transformations 3.1 Base Optimizations
A standard set of classic optimizations is available in commercial compilers today (see Table 1 ). within basic blocks, whereas global optimizations are performed across operations in di erent basic blocks. In this paper, these classic code optimizations are always performed on the compiled programs.
Execution Pro ler
Execution pro ling is performed on all measured benchmarks. The IMPACT-I pro ler translates each target C program into an equivalent C program with additional probes. When the equivalent C program is executed, these probes record the basic block weights and the branch characteristics for each basic block. Pro le information is used to guide the code expanding optimizations. The pro le information is collected using an average 20 program inputs per benchmark. An additional input is then used to measure the cache performance.
Instruction Placement
Reordering program structure to improve the memory system performance is not a new subject.
In more recent literature regarding instruction caches, instruction placement has been shown to 
Optimizations for Superscalar Processors
Since basic blocks typically contain few instructions, there is little parallelism within a basic block. copying the target basic block of a frequently taken branch into its fall-through path. The number of static instructions increases due to this optimization.
Super-block formation, loop unrolling, loop peeling, and branch target expansion increase the sequentiality of the code. Loop unrolling and loop peeling decrease both spatial and temporal locality. A reduction in cache performance can be expected due to a decrease in spatial locality.
The increased code size and increased unique references can be expected to increase the cache size requirement. programs are large enough for studying instruction caches. The instruction references column gives the corresponding number of dynamic instruction references. These instruction references are for the full run of each benchmark program, no sampling or reference partitioning is used.
Experiments and Analysis
Benchmark Programs
Measurement Tools
The measurement results are generated by trace driven simulation. To collect the instruction traces, the compiler's code generator was modi ed to insert probes into the assembly language program. Executing the modi ed program with sample input data produced the instruction trace.
The traces consist of the IMPACT assembly instructions (LCODE 3 ) which is similar to the MIPS R2000 assembly language 32].
Since the performance number for many cache dimensions are needed, a one pass cache simulator is used. The cache simulator for the experiments uses the recurrence/con ict model 17], where only one pass over the instruction trace is needed to simulate all cache dimensions. Similarly, the information required to derive miss penalty with load forwarding is collected for all cache dimensions. In this paper, associativity of one-way, two-way, four-way, and fully-associative are simulated. The block sizes considered are 16, 32, 64, and 128 bytes. The cache sizes range from 1K to 128K bytes.
Empirical Data and Analysis
For the purpose of experimentation, the code expanding optimizations described in Section 3 are organized into four optimization levels with increasing functionality: no (no code expanding optimization), pl (instruction placement), in (function inline expansion plus instruction placement), 3 LCODE documentation is available as an internal report.
program no pl in su cccp -2% 36% 54% eqntott -1% 2% 7% espresso -1% 10% 60% mpla -1% 13% 41% tbl -3% 22% 67% xlisp -1% 18% 49% yacc -4% 21% 110% average -2% 17% 55% Table 3 : Accumulated code size increase. and su (superscalar optimization, function inline expansion, and instruction placement). Experiments are conducted by varying the optimization level to measure the incremental and accumulative e ects of these optimizations.
General E ects
In order to quantify the e ect of optimization on code size, the object code size was measured for each level of optimization. Table 3 shows the relative object code size for each optimization level. All ratios and percentages are computed based on the code size without code expanding optimization.
Instruction placement increases the average code size by 2%. Function inline expansion results in a 15% code expansion after instruction placement, as indicated by the 17% increase in average code size in the in column of Table 3 . Superscalar optimization further increases the code size by 38% after both inline expansion and instruction placement. The total code expansion due to all the three optimizations is 55%, which reinforces the concern that these optimizations may degrade the instruction cache performance.
The instruction working set of a program is de ned as the smallest fully-associative instruction cache which achieves a 0.1% miss ratio for the program. It provides a relative measure of cache size requirement by programs. Table 4 presents the instruction working set size of each benchmark for all optimization levels. All numbers presented are in log 2 scale (e.g., 14 is a 16K byte cache).
The largest working set size needs at most a 32K byte cache. All miss ratios for the larger caches are considered negligible, and for this reason, cache sizes larger than 32K will generally not be Table 6 : Number of dynamic references.
As discussed in Section 3, all of the three code expanding optimizations can improve the sequentiality of instruction access. To quantify this e ect, the average number of sequential instructions executed between taken branches was measured. As shown in Table 5 , all of the three optimizations improve the sequentiality signi cantly. With all optimizations, the average number of sequential instructions increased from 4.6 to 12.3. This dramatic increase in sequentiality suggests that schemes such as load forwarding may be able to o set the negative e ect of code expansion. We will further explore this subject later in this section.
Although the static code size increases signi cantly after the code expanding optimizations, the number of dynamic instruction references tends to decrease with each additional level of optimizations. Table 6 presents the number of instruction references for each benchmark program. The largest improvement results from function inline expansion; this is due to the increasing opportunity to apply classic local and global optimizations on the inlined version of the code and to eliminate instructions that save and restore registers across function boundaries. The purpose for superscalar optimizations is to uncover parallelism and scheduling opportunities. Note however, that superscalar optimizations often result in a decrease in the number of instruction references. The contribution of instruction placement to the number of dynamic references is small when compared to the other optimizations since instruction placement only performs code reordering.
The sum of the number of recurrent references and the number of unique references constitutes the number of total dynamic references. Table 7 shows that the number of unique references increases for inlining and superscalar optimizations, but decreases for instruction placement. The absolute di erence within the unique references does not constitute a signi cant variation in the miss ratio since the di erence is insigni cant when compared to the number of dynamic references in Table 6 .
Instruction Placement intrinsic miss ratio dimensional miss ratio without placement dimensional miss ratio with placement Figure 7 : E ect of placement on dimensional con icts and unique references. Figure 5 shows the e ect of instruction placement on the average cache miss ratio 4 . On one hand, instruction placement reduces miss ratio for small caches (1K and 2K). For example, the miss ratio of a 1K cache with placement is comparable to that of a 2K cache without placement. On the other hand, instruction placement has very little e ect on large caches (8K and 16K). The same trend can be observed from the worst case miss ratios in Figure 6 . The worst case miss ratio is the maximal miss ratio observed among all benchmark programs. Note that the bene t of instruction placement is more pronounced for programs with high miss ratios. This is a very desirable e ect since it increases the stability of the cache performance.
To analyze why instruction placement improves the performance of small caches, we have measured the misses due to unique references (intrinsic misses, see Section 2) and those due to dimensional con icts (dimensional misses). The log plot of Figure 7 shows the contribution of each to 4 We found that the e ect of instruction placement on the cache miss ratio of other associativities closely follows the trend of the direct mapped cache case, therefore only the direct mapped cache results are presented. the miss ratio with and without placement. The black bars show the intrinsic miss ratio. Figure 7 clearly indicates that instruction placement makes negligible di erence in the number of intrinsic misses 5 . The shaded bars in Figure 7 show the dimensional misses. As can be seen in the gure, the reduced miss ratio after placement is due to decreased dimensional con icts 6 .
The changes in program behavior due to instruction placement explain the discrepancy between small and large caches. The working set of the benchmark programs do not t into small caches.
This accounts for the high miss ratio of the small caches. Instruction placement separates the frequently executed code segments from those executed infrequently. This helps the small caches to accommodate the frequently executed portions of the programs. Therefore, the performance of small caches improves signi cantly after instruction placement. Since large caches can accommodate the working set of most benchmark programs, the compaction e ect of instruction placement does not make a signi cant di erence for these cache sizes.
Function Inline Expansion
Function inline expansion has two con icting e ects on cache performance. On the positive side, with inlining the caller and callee bodies are processed together by instruction placement. This allows instruction placement to signi cantly increase the sequentiality of the program (see Table 5 ).
When the cache miss ratio is high, the increased sequentiality reduces the miss ratio because it increases the number of useful bytes transferred for each cache miss. On the negative side, inlining increases the working set size (see Tables 3 and 4) . If the working set ts into a cache before inlining 5 The reader is encouraged to derive the intrinsic miss ratio by dividing the number of unique references in Table 7 with the number of dynamic references in Table 6 . 6 Note that Figure 7 is in log scale, which is necessary to make the intrinsic miss ratio visible. However, the log scale also magni es the miss ratio of large caches. For example, instruction placement seem to make comparable di erence for small caches (1K and 2K) and large caches (16K and 32K) in Figure 7 . However, it is clear from Figure 5 that instruction placement has strong e ect on small caches but negligible e ect on large caches. but does not after inlining, the cache miss ratio may increase substantially. Figures 8 and 9 show the e ect of inline function expansion on cache performance 7 . The cache miss ratio is relatively high for small caches before inlining. In this range, the increased sequentiality reduces the cache miss ratio. In the middle range (8K, 16K, and 32K), the working sets of some benchmarks t in the cache before inlining but not after inlining. As a result, inlining increases cache miss ratio. The 64K cache is large enough to accommodate the program working set before and after inlining. Therefore, inlining has negligible e ect in caches of size 64K and greater.
Superscalar Optimizations Figure 10 shows the changes in the cache miss ratios when superscalar optimizations are applied after inlining and placement. The miss ratios are consistently higher with superscalar optimizations.
Therefore, a larger cache is required to compensate for the e ect of superscalar optimizations to maintain the same miss ratio. This information is consistent with the working set size calculated in 7 As before, the trend for higher set associativities is very close to the results for direct mapped cache. Thus, only the direct mapped results are presented. Table 4 . If the block sizes are kept constant, the required cache size to maintain the same level of miss ratio is approximately twice the cache size over that of code with no superscalar optimizations. Figure 11 indicates that superscalar optimizations increase the number of unique references, but the increase is not signi cant. Therefore, it is the increase in code size rather than the increase in unique references that is the primary cause of reduced cache performance.
All Optimizations Figure 12 shows the cumulative e ect of all optimizations on direct mapped caches. Intuitively, smaller caches should perform worse on expanded code because of increase in the expected number of dimensional con icts. However, the experimental data show the opposite. For the 1k and 2k
caches, the miss ratio of code without code expanding optimizations are larger than the miss ratios of code with code expanding optimizations. Sequentiality is increased by superscalar optimizations, thus for larger block size, the decrease in miss ratio is due to sequentiality (e.g., for 1K cache in Figure 12 , code with superscalar optimizations has a larger drop in miss ratio going from 64B to 128B block size than code with no optimization). For small block sizes, the positive e ect of higher sequentiality disapears, and the negative e ect of code expansion causes an increase in the miss ratio. However, the increase in code locality by function inlining and instruction placement is still large enough to o set the negative e ect of the code expansion, and a slight decrease in the miss ratio can still be seen in small caches.
Load Forwarding
The results of load forwarding are presented in Figure 13 . Since superscalar optimizations have the worst results thus far, they are used here to evaluate the e ectiveness of load forwarding. The initial memory repair latency (L) is assumed to be 4 cycles, and the cache-memory bandwidth ( ) is assumed to be 4 bytes. Equations 3 and 4 are used to calculate the relative miss time penalty.
Load forwarding reduces the miss penalty and e ectively upgrades the cache to a performance level similar to a non load-forwarding cache of twice the size. For example, assume that 2K direct mapped cache with block size of 64 bytes is used with load forwarding. Using the same block size, the miss penalty is approximately the same as that of a 4K cache without load forwarding. When superscalar optimizations are used, the designer can either double the cache size to maintain the same performance level or use load forwarding and achieve the same result.
Another observation is that a block size of 128 bytes has consistently higher average miss penalties than for other block sizes. This can be explained by the number of sequential instructions shown in Table 5 . The overall average run length for superscalar optimizations is approximately 12.3 instructions (49.2 bytes). It is possible that the rst non-sequential miss will not be in the beginning of the block (see Figure 14) . By using the symbol R for the run length, and l as the run 
For simplicity, an integer approximation of the run length is used. Instead of 12.3, the value of 13
is used for R in Equations 6 and 7.
P(13; 4) = 19 cycles (8) P(13; 5) = 17 cycles (9) P(13; 6) = 22 cycles (10) P(13; 7) = 36:5 cycles (11) The calculated values follow the trend in Figure 13 closely. For B equal to 4, 5, and 6, the load forwarding miss penalties are relatively the same, with B equal to 5 (the lowest), and B equal to 4 (the next lowest). For B equal to 7, the load forwarding miss penalty is noticeably higher than the other block sizes, and this can also be shown by using Equation 7.
The miss penalty for each run of sequential accesses is dominated by three values: the initial load delay, the number of re ll cycles with load forwarding, and the number of re ll cycles without load forwarding. While the initial load delay is dependent upon the hardware design technology, the non-stalling and stalling re ll cycles are related to the block size and the instruction sequentiality.
Before the initial load delay reaches a certain threshold value, the number of re ll cycles will have a dominant e ect upon the miss penalty. Larger block sizes will tend to have higher wasted number of re ll cycles than smaller block sizes. However, larger block sizes are penalized less for the initial load delay than smaller block sizes. Figure 15 shows the e ect of varying the value of the initial load delay on block sizes for a 4k cache. For each value of L, the miss penalty ratio is compared between four block sizes. For small values of L, 16 and 32-byte blocks perform the best. But for larger values of L, 64-byte block performs the best. This is also veri ed by Equation 7 . Here, the value of L is set to 10. P(13; 4) = 43 cycles (12) P(13; 5) = 32 cycles (13) P(13; 6) = 32:5 cycles (14) P(13; 7) = 44:75 cycles (15) From Figure 15 , for initial delay of 10, block sizes of 32 and 64 bytes have similar performances, and block sizes of 16 and 128 bytes have similar performances.
As the value of L increases, the performance of the larger block sizes increases while the performance of the smaller block sizes decreases. It is not until an initial load delay of 40 cycles before 128-byte blocks start to out-perform other block sizes. For smaller cache sizes, the miss ratios are the dominating factor, and a smaller block size should be used. On the contrary, for larger cache sizes, since the miss ratios are very small, larger block sizes are preferred.
Conclusions
This paper analyzes the e ect of compile-time code expanding optimizations on instruction cache design. We rst show that instruction placement, function inline expansion, and superscalar optimizations cause substantial code expansion, reinforcing the concern that they may increase the cache size required to achieve a given performance level. We then show the actual e ect of each optimization on cache design.
Among the three types of optimizations, instruction placement causes the least amount of code expansion. Its e ects on the cache performance are mostly due to the increased instruction access sequentiality. For small caches where the miss ratio is relatively high, the increased sequentiality reduces the number of cache misses by increasing the useful bytes transferred for each cache miss. For large caches where the miss ratio is relatively low, the e ect of instruction placement is negligible.
Inline function expansion a ects the cache performance by increasing both the sequentiality Too appear, IEEE Transactions on Computers. CRHC-91-17, University of Illinois, Urbana-Champaign.34
and the working set size. For small caches where the miss ratio is high, the increased sequentiality helps to reduce the miss ratio. Due to the increased working set size, some benchmarks which t into moderately sized caches before inlining do not t after inlining. Therefore, inlining increases the miss ratio of moderately-sized caches. For large caches, since the working sets t in the cache before and after the cache, the e ect of inlining is insigni cant.
Superscalar optimizations increase the cache size required for a given miss ratio. However, they increase the sequentiality of instruction access so much that a simple load-forward scheme e ectively cancels the negative e ects. Using load forwarding, the three types of code-expanding optimizations jointly improves the performance of small caches in spite of the substantial code expansion. Load forwarding also allows the code expanding optimization to have little negative e ect on the performance of large caches.
