Abstract -Instruction cache replacement policies and organizations are analyzed both theoretically and experimentally. Theoretical analyses are based on a new model for cache references -the loop model. First the loop model is used to study replacement policies and cache organizations. It is concluded theoretically that random replacement is better than LRU and FIFO, and that under certain circumstances, a direct-mapped or setassociative cache may perform better than a full-associative cache organization. Experimental results using instruction trace data are then given and analyzed. Instruction caches provide the same advantages as more general caches, i.e., reduced memory access time and reduced memory bandwidth requirements, and they provide the following additional advantages. 1) When used in conjunction with a data cache, they increase the total cache bandwidth.
Index Terms -Cache memories, direct-mapped, fully associative, loop model, memory organization, replacement algorithms, set-associative.
I. INIRODUCTION
S YSTEMS with a cache memory specifically for instructions have been used in large scale scientific computers for some time [1] , [2] , and are beginning to be used in other high performance systems [3] - [5] . If they continue to follow the evolutionary pattern of other performance enhancement techniques, instruction caches will be used in a wide variety of computer systems within a few years.
Instruction caches provide the same advantages as more general caches, i.e., reduced memory access time and reduced memory bandwidth requirements, and they provide the following additional advantages.
1) When used in conjunction with a data cache, they increase the total cache bandwidth.
2) An instruction cache designed for a modern architecture can be simpler than a data cache or a combined instruction/data cache because stores into cache locations can be disallowed (i.e., there can be no self-modifying code, an assumption consistent with modern programming principles and protection methods).
3) An instruction cache can be tailored to the specific referencing patterns found in fetching instruction streams. A separate data cache can also be tailored to data referencing patterns, resulting in better usage of both. This paper is aimed at exploiting the last advantage given above. It studies cache organizations and replacement policies that are intended solely for instruction caches. We first propose a new model for instruction address patterns, the loop model. This model is the basis for a theoretical analysis of cache organizations and replacement policies. Then, we give results of an experimental study using instruction trace data. These experimental data are analyzed in light of the theoretical conclusions.
A. Definitions
We consider memory/instruction cache systems where memory is divided into contiguous blocks of some fixed size (typically a power of 2). An instruction cache holds a fixed number of blocks of instructions, and all instruction fetch references are checked to see if the requested instruction word is in the instruction cache. If so, there is a hit and the instruction word from the cache is immediately decoded and executed. If it is not present, there is a miss; the program block containing the requested instruction is brought in from main memory, and it replaces some block in the instruction cache according to a replacement policy. In this paper, we assume cache blocks are only replaced following a miss; there is no prefetching of program blocks as in [2] . Note that we call the physical locations that make up the cache the cache blocks, and all the blocks of a program, only some of which may be in the cache, are the program blocks.
There are three common ways to organize a cache [8] . Each program block resident in a cache has a tag associated with it to help in determining if a referenced program block is in the cache. In a fully associative cache, any program block potentially can be found in any of the cache blocks. A memory address is divided into two fields, shown in Fig. l(a) . The word field indicates the instruction word within a block, and the tag field is compared against the tags of all the blocks in the cache to determine if there is a hit.
In a direct-mapped cache, any program block can be placed in only one cache block. A memory address is divided into three fields [ Fig. 1(b fields. The word field is as before, the set field identifies the set of cache. blocks that may contain a referenced program block, and the tag field is compared to the tags of all the blocks in the set to determine if there is a hit or a miss. An instruction reference fetches a basic unit of storage, typically a "word," into the instruction unit. It may include part or all of one or more instructions. For an instruction cache, the hit ratio is the number of instruction references that result in hits divided by the total instruction references. Hit ratio is a measure often used for cache performance. In this paper, we will be more interested in block hit ratio. To evaluate the block hit ratio, we consider only instruction references that are in a program block different from the block holding the immediately previous instruction reference. We call these new block references. The block hit ratio is then found by dividing the new block references that result in instruction cache hits by the total new block references.
We define block hit ratio because only a new block reference can result in a cache miss, and more importantly, a block replacement. By considering only references that potentially can result in a replacement, the block hit ratio provides a better measure of the impact of replacement policy on cache performance. Conversely, the measure is intended to eliminate cache hits caused by instruction word "prefetching" that comes with using block sizes larger than the instrtiction word size. This prefetching effect occurs when there are consecutive idistruction references to the same block. All hits after the fitst are independent of replacement policy, and are eliminated from our performance measure.
B. Instruction Reference Model
In most programs a significant percentage of instructions are executed within loops [6] . This leads us to study instruc Following the survey of previous research given in Section I-C, Sections II, III, and IV consider the loop model for full-associative, direct-mapped, and set-associative caches, respectively. Section V gives experimental results and analysis.
C. Previous Research
Most past research on cache memories has considered data caches or combined instruction/data caches. One exception is in [7] where it is assumed that there is a fixed amount of cache memory, and it can be partitioned into instruction and data portions. The authors of [7] conclude that such a scheme is not worthwhile. As we have pointed out, however, current high performance systems are being built with separate caches. This is probably because separate caches yield up to twice the cache bandwidth, a factor not considered in [7] . Also, hardware costs are now low enough to permit extra hardware for instruction caches (i.e., the amount of cache available for data.is not necessarily reduced as an instruction cache is added). In a recent survey paper on caches [8] , Smith studied separate instruction and data caches, but looked at a different set of problems than those discussed here. He considered ways data and instruction caches should be partitioned and whether duplicate entries should be allowed; all the instruction caches studied were set-associative with LRU replacement. Theoretical work has also been geared to data or combined caches. A good example is [9] where the independent reference model [10] is used.
Other related work is in the ATLAS system [11] where virtual memory page replacement algorithms attempted to recognize looping behavior. Since they were used at a higher level in the memory hierarchy, these methods could be implemented in software and could be more sophisticated than the hardware methods that must be used for cache replacement. In any case, the ATLAS method was later found to be of questionable value [12] . 
B. Optimum Replacement
In [13] , it is shown that a replacement strategy that maximizes hit ratio replaces the block that will be needed the farthest in the future. Of course, this cannot be practically implemented since it requires knowledge of the future, but it does provide an upper bound against which other replacement strategies can be measured. For the loop model and a fully associative cache one can derive a closed-form expression for the optimum steady state block hit ratio.
It is easiest to begin as the complex loop is entered. According to our initial condition assumption, the cache is full when the loop is begun, but all the M residual blocks are replaced by the first M blocks of the loop. Without loss of generality, assume that the cache blocks are filled in order (0, 1v, * ' M -1). Then, there areN -M further misses as the first pass through the loop completes. All these missed blocks go into cache block M -1 if the optimum algorithm is followed. Then, on pass 2 through the loop, the first M -1 blocks are all hits. Then, there are again N -M misses, all of which are placed in cache block M -2. Reference N of pass 2 is still in cache block M -1 and hits, as do the first M -2 references of loop 3. This again gives M -1 consecutive hits. This is followed by N -M misses, all of which go into cache block M -3, etc. By formalizing the above argument, it can be shown that the hit-miss pattern in the steady state is always M -1 hits followed by N -M misses. This leads to the following theorem. replacement policy gives a steady state hit ratio of (M -1)/(M + 1).
Using Theorem 2 the optimum block hit ratio when
We see that for the case N = M + 1 the random replacement policy is reasonably close to optimum. When N . M, it can be shown that the steady state hit ratio with random replacement becomes 1 with probability 1. For values of M and N where N > M + 1, we have not found a general closed form solution. We did find block hit ratios for several specific cases, however, and these are shown graphically in Fig. 5 . Fig. 5 (a) assumes a cache with 4 blocks and shows block hit ratios for loops that vary in size from 3 blocks to 12 blocks. Loops of size less than 3 have block hit ratios of 1. Fig. 5 (b) assumes a cache with 16 blocks and shows block hit ratios for loops that vary in size from 15 blocks to 24 blocks. Loops of size less than 15 have block hit ratios of 1. Along with the block hit ratios for random replacement, the block hit ratios for optimum and LRU/FIFO replacement are also shown in Fig. 5 . In addition, results for a direct-mapped cache, to be discussed in the next section, are given. The superiority of random replacement over LRU/FIFO is clearly shown. In addition, random replacement is shown to be reasonably close to optimum for smaller loops, but as loop sizes become larger, hit ratios for random replacement tend to fall off faster.
III. DIRECT-MAPPED CACHE
In a direct-mapped cache, a program block can only be placed in one cache block; hence, there is only one possible replacement policy. Nevertheless, it is interesting to see how well a direct-mapped cache performs under the loop model. The discussion in this section assumes that addresses are broken up into fields as in Fig. l(b For the special case N = M + 1, Theorem 4 gives a hit ratio of (M -l)/(M + 1), the same as for random replacement in a fully associative cache. For 2M N it is always worse than random since random is nonzero for any finite M and N. This, as well as the data in Fig. 5 , leads us to conclude that for the simple loop model, using a direct-mapped cache gives no better performance than a fully associative cache using random replacement. This conclusion is somewhat weaker than the others because it is not proved for all combinations of M and N.
The complex loop model is more difficult to analyze because program blocks do not necessarily map -into the cache blocks in any regular way. In the worst case, all N blocks map into the same cache block, and the block hit ratio is 0, even if 1 N ' M. In the best case, M -1 of the program blocks map into M -1 different cache blocks, and the remaining program blocks all map into the single remaining cache block. The block hit ratio is (M -1)/N, and is close to optimum. Hence, for the complex model, we can only observe that for the complex loop model, the particular assignment of program to cache blocks determines whether a direct-mapped cache or an associative cache works better.
IV. SET-ASSOCIATIVE CACHE
We now show ways some of the previous results can be extended to set-associative caches. When analyzing a setassociative cache, we come up against the same problem we did with a direct-mapped cache: for complex loops, the map- ping of program blocks into sets must be know mapping is known, all the blocks mapping to I behave as if they are in a fully associative cach one set. Then by collecting together the block ] each of the sets, the aggregate block hit ratio ca by forming a weighted average. Let bhr (P, B, S, R) be the block hit ratio for a f tive cache with S blocks of size B, using i policy R, when P program blocks all map into t] follow the complex loop model. Theorem 5: Under the complex loop model associative cache of set size S, block size B, T program blocks mapping into set i, and r policy R, the block hit ratio is lT -X Pi bhr(Pi, B, S, R) .
To take one example, consider a set-associativ 4 Cache size ifn butns Fig. 11 Block hit ratios for a 2-way set-associative cache with 16 byte blocks.
performance with a direct-mapped cache. Our conclusion that a fully associative cache with random replacement is superior to a direct-mapped cache is also supported by the data. In no case was a fully associative cache with random replacement worse than a direct-mapped cache, and typically was much superior.
Finally, there are some cache sizes where a set-associative cache with random replacement gives the best block hit ratio of all the organizations and replacement policies. This is a common occurrence for small caches and also tends to support our observations concerning simple loop behavior.
