In current processors, the cache controller, which contains the cache directory and other logic such as tag comparators, is active for each instruction fetch and is responsible for 20-25% of the power consumed in the Icache. Reducing the power consumed by the cache controller is important for low power I-cache design. We present three a r chitectural modications, which in concert, allow us to reduce the cache controller activity to less than 2% for most applications. The rst modication involves comparing cache tags for only those instructions that result in fetches from a new cache block. The second modication involves the tagging of those branches that cause instructions to be fetched f r om a new cache block. The third m o dication involves augmenting the I-cache with a small on-chip memory called the S-cache. The most frequently executed b asic blocks of code are statically allocated to the S-cache before p r ogram execution. We present empirical data to show the eect that these modications have on the cache controller activity.
Introduction
Caches are an integral part of processors because of the increasing disparity b e t w een processor cycle time and memory access time. While caches are important for high performance, they are also important for low power since they reduce the amount of o-chip trac. In a typical processor with a split cache architecture, the I-cache consumes more power than the D-cache because the I-cache is accessed for each instruction while the Dcache is accessed only for loads and stores. Since loads and stores constitute less than 25% of the executed instructions, the activity of the D-cache is less than 25% of the activity of the I-cache. The cache controller, which Partial support for this work was provided by the Oce of Naval Research under grant N00014-91-J-1009.
contains the cache directory and other logic such as tag comparators, can account for 20-25% of the power consumed in the I-cache. This work focuses on reducing the power consumption of the I-cache by reducing the controller activity through architectural modications.
The organization of this paper is as follows. Section 2 presents a discussion on prefetch buers and shows why an I-cache with a prefetch buer is not good for low power. The discussion in this section sets the stage for the rest of the paper which assumes that low p o w er processors will not have prefetch buers. Section 3 presents the conditional tag compare scheme whereby a cache directory lookup and tag compare is done only for those instructions that result in fetches from a new cache block. This scheme requires minor hardware modications involving a few gates; hence its cost is negligible. This scheme also relies on an architectural modication called branch tagging whereby branches that result in an instruction fetch from a new cache block are tagged. This entails a change to the instruction set architecture. However, since most instruction sets have u n used space in the elds of the branch instruction, this modication doesn't incur any extra cost. Section 4 presents the S-cache which is a small on-chip memory that augments the I-cache. The most frequently executed basic blocks are statically allocated to the S-cache before program execution. The S-cache obviates the need for tag compares for the instructions in these basic blocks. The impact of all the three modications has been evaluated empirically using several benchmarks. The benchmarks and the empirical protocol are presented in Section 6. Finally, w e conclude the paper with a summary of the results.
Prefetch buers
Most recent processors such as the PowerPC [6] and the Pentium [4] h a v e prefetch buers which allow a n entire block of data to be fetched from the I-cache on a The prefetch buer allows the fetch unit to build a stockpile of instructions into which it is able to look ahead to detect branch instructions early. When a branch is encountered, the prefetch buer can be used to continue feeding the pipeline while the branch target is fetched. Prefetch buers can thus reduce branch penalties in processors. While prefetch buers are good for performance, they are bad for power as can be shown using the metrics 1 that follow.
To show the impact of prefetch buers on power consumption, we ignore the cache controller since its activity can be reduced to less than 2% using the techniques that follow. The dominant contribution to the power consumption in the I-cache is then due to the bit line capacitance and the sense ampliers [3] . If we consider the energy required to fetch a block of instructions, then it makes no dierence as to whether the instructions in the block are read out one at a time or whether the entire block of instructions is read out in a single cycle. Consider a prefetch buer that can hold s B instructions. Let the hit rate of the prefetch buer be h B . I f an application executes n instructions, then n(1 h B ) Icache fetches will be required. Since s B instructions are fetched on each access, the total number of instructions fetched from the I-cache is n(1 h B )s B . The impact of 1 The best metric for evaluating the impact of any architectural feature would be the power or energy consumed by representative applications. H o w ever, direct power or energy metrics require many implementation details. To a v oid making implementation assumptions, we shall use indirect metrics in our paper. These metrics can be easily translated into direct power or energy metrics when the implementation details are known. the prefetch buer on the power consumption can thus be characterized by a metric called the buer trac ratio t B which is dened as the ratio of the number of instructions fetched from the I-cache in the absence of a prefetch buer to the number of instructions fetched in the presence of a prefetch buer. Clearly, the buer trac ratio is given by: Figure 2 shows the buer trac ratios for some applications as a function of the buer size. The applications used are the same applications that we will use later in Section 6. The gure shows that even a small prefetch buer that holds only four instructions has a very detrimental eect on the power consumption of the I-cache since the number of instructions fetched from the I-cache increases by at least 60%. The detrimental eect of the prefetch buer on the I-cache power increases with increasing buer size since the likelihood of prefetching useless instructions into the buer increases. A prefetch buer that can hold 16 instructions will increase the Icache power consumption by 1000%. Clearly, prefetching the entire cache block is not a good idea for low power design. In low p o w er processors, instructions should be fetched on a cycle-by-cycle basis from the I-cache. When instructions are fetched on a cycle-bycycle basis, the cache controller is active o n e a c h cycle and can account for 20-25% of the power consumed in the I-cache. The techniques presented in the rest of this paper show h o w the cache controller activity can be reduced to less than 2% even when instructions are fetched on a cycle-by-cycle basis.
Conditional tag compares
Cache directory lookups and tag compares do not need to be done for all instruction fetches. Let us consider two instructions i and j where the execution of j immediately follows the execution of i. W e need to consider four cases. The rst case, called intrablock non-sequential ow, occurs when i is a taken branch instruction with j as its target and i and j reside in the same cache block. The second case, called intrablock sequential ow, occurs when i is a non-branch o r u n taken branch instruction and i and j reside in the same cache block. The third case, called interblock non-sequential ow, occurs when i is a taken branch instruction with j as its target and i and j reside in dierent cache blocks. The fourth case, called interblock sequential ow, occurs when i is a non-branch o r u n taken branch instruction and i and j reside in dierent cache blocks. If i and j map to the same cache block, then a tag compare for j is not necessary since the block containing j must be present in the cache. Thus, cache directory lookup and tag compare is required only for interblock non-sequential ow and interblock sequential ow.
Interblock non-sequential ow
A non-sequential instruction fetch is directly indicated by the control signal that loads the program counter (PC). Interblock non-sequential ow can be indicated by branch tagging whereby u n used space in the opcode is used as a compiler hint that the branch will transfer control to an instruction outside the current cache block. Table 1 shows the branch instruction frequency f 1 in the four applications that we used in our study. The probability p f of a branch being a forward branch is also shown in the table. Table 2 shows the probability p i of a forward branch causing interblock ow as a function of the cache block size in bytes. We have found that all backward branches get captured by the S-cache which w e present in the next section. Hence we are concerned only with forward branches for interblock non-sequential ow. Table 2 : Forward branch i n terblock o w probability size of the block blocks is increased [5] . A guarded instruction is a normal instruction augmented with a guard condition specier. The instruction is executed if the guard condition evaluates to true else the instruction is treated as a NOP. Some of the recent RISC architectures such as the SPARC V 9 [ 7 ] provide simple guard instructions like conditional moves.
Interblock sequential ow
Interblock sequential ow can be detected quite simply by looking at a single bit of the program counter (PC) and EXORing it with the value of the bit from the previous cycle. Let the cache block size be 2 k . I f the bits of the PC are labeled 31,30,...,1,0 with 0 being the least signicant bit, then interblock sequential ow can be detected by E X ORing bit k of the PC with the previous value of the bit. The frequency f 2 of tag compares necessitated by i n terblock sequential ow depends on the size of the cache block. It can be shown that this frequency is given by the following expression:
where f 1 is the frequency of non-sequential instruction fetches and s B is the number of instructions contained in a single cache block.
It can be seen that the hardware overhead for doing conditional tag compares is quite minimal. A signal can be generated on a cycle-by-cycle basis as to whether a tag compare is needed and the I-cache controller enabled accordingly. Otherwise the cache controller can be disabled and the tag array can be kept in a state of constant precharge. We n o w present the S-cache which will allow us to reduce the cache controller activity e v en further.
S-cache
The Pareto principle applies to program execution since programs spend most of their execution time within a few basic blocks of code. This principle can be exploited to reduce the cache controller activity e v en further by augmenting the I-cache with a small on-chip memory in which frequently executed basic blocks of code are stored before program execution. This small memory is called the S-cache since basic blocks of code are statically cached in this memory. On-chip program memories are found in many digital signal processors. Our S-cache was inspired by the write control store (WCS) of some earlier machines. The dierence between the S-cache and the WCS of earlier machines is that the WCS was used to store microinstructions while the S-cache is used to store macroinstructions. The Scache may e v en store decoded instructions to bypass the instruction decoding stage to save p o w er.
Size
The issue of the size of the S-cache can be addressed by observing that only processes that run for significant amounts of time and thus consume signicant amounts of energy should request space allocation in the S-cache. Such processes which can request space in the S-cache are called PSC (Possibly Statically Cached) processes. The linker/loader needs to make the appropriate patches to a PSC executable when it is loaded for running. Some processes do not need to have basic blocks allocated to the S-cache. such processes which use only the I-cache for execution are called PDC (Purely Dynamically Cached) processes. As an example, the process associated with the Unix hostname command executes less than 2000 instructions and should be compiled as a PDC executable. In Section 5, we will show that an space allocation of 4 KB in the S-cache is sucient for most processes. Thus, if the S-cache is required to support the simultaneous execution of n P PSC processes, then a size of 4n P KB is sucient. If more than n P PSC processes need to run simultaneously, then they can still do so except that only n P of them will be able to get their basic blocks allocated to the S-cache. The other PSC processes can either wait until space is available in the S-cache or they can go ahead and execute without using the S-cache. The kernel may implement other policies such a s s w apping out the S-cache state of a PSC process to allow another PSC process to use the S-cache. The parameter n P is crucial in determining the size of the S-cache. Various tradeos will need to be made by the system architect in arriving at a reasonable value for this parameter. Our guess is that n P = 4 w ould suce for most portable applications. The integer S represents the size of the per-process space in the S-cache. We h a v e stated above, that a good choice of the per-process space would be 4 KB which corresponds to 1024 instructions. The 0-1 knapsack can be solved optimally by using dynamic programming [2] .
Allocation

Block patching
Since basic blocks of code are moved to the S-cache, some blocks may need to be patched to ensure correct execution. Only blocks that are terminated by conditional branches need patches. A block b i that is terminated by a conditional branch needs to be patched if it is marked for S-cache allocation and its succeeding block b j isn't. The patch t o s u c h a basic block b i involves the addition of a jump instruction as the last instruction of the block. This jump instruction transfers control from block b i to block b j so that the correct execution is ensured when the conditional branch, which is now the penultimate instruction in block b i , is not taken and execution falls through to patching jump instruction. Likewise, a block b i that is not marked for S-cache allocation needs to be patched if its succeeding block b j is marked for S-cache allocation.
Controller activity
The combined eect on the cache controller activity can be modeled analytically. Let us designate the Scache hit rate as h S . The cache controller is inactive when instructions are fetched from the S-cache. When instructions are fetched from the I-cache, the cache controller is active only when we h a v e a n i n terblock o w i n the program execution. The frequency f T of interblock ow is given by:
The controller activity a C is merely (1 h S )f T and is given by:
If we c hoose some typical numbers for the parameters in the above equation, then we can obtain a rough estimate of the controller activity. I f w e c hoose s B = 8 , f 1 = 0 : 1, p i = 0 : 6, p f = 0 : 8, and h S = 0 : 8, then the controller activity is around 3%. This implies that the combination of the techniques presented in this paper can almost completely eliminate the cache controller activity in I-caches. We present the empirical validation of this result.
Run
Instructions The eciency of the techniques was evaluated using four benchmarks. The benchmarks used in this study were espresso, a PLA minimization program, wacc, a lossy image compression program using wavelets and arithmetic coding, gzip, the Gnu lossless compression program using a Lempel-Ziv compressor, and latex, a program for document preparation. Each benchmark was run with seven dierent inputs since our hypothesis was that a single run would not provide sucient condence for basic block allocation in the static cache. It appeared at the onset of the experiment that a program like espresso, which implements several dierent algorithms for PLA minimization, can use dierent algorithms depending on the nature of the PLAs and hence emphasize dierent basic blocks. Likewise, for a program like latex, t w o inputs that dier signicantly, such as one requiring a lot of mathematical typesetting but having no gures or tables and the other requiring very little mathematical typesetting but having a lot of gures and tables, can emphasize dierent basic blocks. After running an extensive set of simulations we h a v e concluded that even a single run provides a very good estimate of the relative execution frequency of the basic blocks as would be experienced as an average over all possible runs. Our experiments resulted in roughly 2.6 billion instructions being executed which i s a n a v erage of 93.5 million instructions per run. Table 3 shows the number of instructions executed in the various benchmark runs. Shade, an instruction-level tracing tool [1] , was used for instruction tracing on SPARC w orkstations running SunOS 5.3. Shade allowed us to do custom analysis of the instruction traces for the SPARC V8 architecture.
Results
We rst investigated the applicability of the Pareto principle for program execution. The Pareto graph for wacc benchmark is shown in Figure 3 . The Pareto graph gives a rst cut estimate of the expected S-cache hit ratio. From the Pareto graph for wacc, it appears that even an S-cache allocation of 1 KB will result in a hit rate of larger than 90%. Figure 4 shows the number of basic blocks that can be allocated in the S-cache for varying amounts of perprocess allocation space for espresso. The gure also shows the number of blocks that require patches. We nd that about 150 basic blocks can be allocated to the S-cache for a per-process allocation of 4 KB. The identity of these blocks was determined by dynamic programming as indicated above. Tables 4 and 5 show the cache controller activity a C for espresso and latex. Due to constraints of space, we have not shown the results for wacc and gzip. The results for wacc and gzip are even better than those of espresso and latex that are shown here. The activity is shown as a function of the per-process allocation in the S-cache and the block size of the I-cache. The experimental data shows that the S-cache in conjunction with conditional tag compares can almost completely eliminate the power overhead of the cache controller by rendering it inactive 98% of the time for an S-cache per-process allocation of 4 KB and 64-byte blocks in the I-cache.
Conclusion
We presented a case against prefetch buers in low power processors. Since the lack of prefetch buers will necessitate instruction fetching on a cycle-by-cycle basis from the I-cache, the I-cache controller can be active 100% of the time unless we adopt architectural techniques to reduce the controller activity. The I-cache controller can consume 20-25% of the I-cache power if it is active 100% of the time. We h a v e presented three architectural modications that when used in concert, reduce the I-cache controller activity to less than 2% for most applications. We h a v e presented empirical data using benchmarks with over 2.6 billion instructions in the address traces to justify our claims of the eciency of the proposed techniques. 
