Leakage energy will be the major energy consumer infuture deep sub-micron designs. Especially the memory sub-system of future SOCs will be negatively affected by this trend. In order to reduce the leakage energy, memory banks are transitioned to a low-energy state when possible. This transition itselfcosts some energy which is termed as the transition en- Ever thinning gate-oxides in CMOS transistors has introduced gate-leakage (through tunneling) in addition to sub-threshold leakage (due to lower supply voltages). Alarming data is presented in [2] that leakage power would jump from 50% to 460% of active power as technology moves from 65 nm to 45 nm, respectively. Thus it becomes crucial to integrate circuitlevel leakage-current reduction techniaues and architecture-
Introduction
level design techniquea into J)ciem-le&l de,ign tlows in order to tackle static pou'ur con>umption Over the past decade researchers have been increasingly paying attention to static power reduction in logic circuits ( [le] [61 [I91 [I I] ,etc). These circuit-level techniques, while successful in reducing leakage-power costs, bring new challenges, i.e. transition-energy costs which are incurred when circuits switch to shut-down modes. Since caches consume a considerable amount of area on a chip, they would also account for a significant contribution to transition energy costs. In this paper we present as the first approach of its kind a novel energy saving replacement policy called LRU-SEQ for instruction caches that will effectively reduce static energy consum tion and as such, reduce total cache energy by 20%-28% ( wit . R an average of 23%). The paper is structured as follows: related work is described in Sec. 2, motivation for transitionenergy reduction through architecture-level olicies is presented in Sec. 3 and our novel LRU-SEQ cacie organization is presented in Sec. 4. Our evaluation platform is presented in Sec. 5 followed by results in Sec. 6. We conclude in Sec. 7.
Circuit-level leakage reduction techniques can be broadly classified into two categories. The minimum number of cycles for which the circuit has to remain in a low-power mode to overcome this energy overhead is termed as the break-even time. As circuit-level leakagereduction techniques become more advanced and faster (in terms of their latency), circuits will transition tolfrom lowpower modes more often. This will eventually increase the significance of the contribution of E,, to the total energy
In this paper, we introduce a novel architecture-level policy that reduces the transition energy in cache sub-banks by redirecting sequential cache fills to the last bank accessed. By regrouping sequential accesses in such a manner, the policy not only reduces inter-bank transitions, but also increases the chances that a bank can be shut for a longer period and thus meet the requirements of the break-even time. Thus leakage energy is minimized. We describe the motivation for such a policy in the next section followed by the implications of such a policy on the hit-rate and the its hardware implementation.
Motivation and Basic Idea
In contrast to traditional approaches that employ heuristics to determine less frequently used cache regions, we revisit the replacement policy. Thus, the cache activity (at the granularity of a cache sub-bank) is altered in such a way that any circuit-level leakage-reduction technique can profit from it. The approach of a policy change is motivated by the following two observations (we will concentrate on only statepreserving leakage-reduction techniques in this paper). ,consider a %way set-associative instruction cache to which the basic blocks B2 and B10 (address rang: 76-88) would map as shown in (6). 8 2 has a partial con ict at address 28 with B10. Now consider a run-time snapshot of the cache as shown in (c). At this point, the instruction at address 24 has been executed and instruction at address 28 has to he fetched from memory. Assume that way-2 has the LRU position for that cache line. The conventional LRU policy would place the fetched instructions in w a y 2 (d). A subsequent loop execution would result in 100 hank transitions. Also, assuming that a) leakagereduction techni ues have been applied at the granularity of a bank, and b) a t a n k is put to sleep as soon as another hank is activated, the hreak-even time should he less than 3 cycles (wrt way-2) in order to yield an advantage (16, 20, 24 strain a sequential-fill to the same hank which was previously accessed. This would glace the, instruction at 28 in the same hank as (16, 20, 24) . t us eliminating the bank's transitions. Such a placement has also automatically eliminated policyconstraints imposed by the break-even time. It has to be noted that the motivation behindLRU-SEQ is valid only for instruction caches. Its validity for data caches has to be evaluated and is out of the scope of this paper. We now present a formal definition of our policy as well as the cache organization.
The LRU-SEQ Cache
This section describes the LRU-SEQ policy along with the assumptions and some necessary hardware modifications.
SystedCache architecture: We assume that a setassociative cache is structurally divided into hanks, where each set/way corresponds to a bank(s) (Fig. 3) . We have con- 
LRU-SEQ policy:
Assuming that the way and line (index) accessed in the previous cache access cycle (a read or a write) is stored as P,,,,, and PI,,,, and the current access is denoted by CWaY and Cline, respectively, LRU-SEQ operates as shown in Fig. 2 . SEQDST is the maximum distance (in cache lines)
between Piin, and Cline beyond which traditional LRU-way selection takes over from LRU-SEQ way-selection. The reason for still using LRU state for far-neighbors ( a jump beyond SEQDST) is to retain in part or whole, the temporallocality advantages inherent to the LRU policy as well as to avoid scenarios where cache fills are concentrated at a single bank (which would undermine the advantages of the associativity offered by the cache).
Hardware modifications: In order to implement our policy, we need some logic to track the cache way and cache line accessed in the previous cache access cycle. This data is stored within two additional registers as shown in Fig. 3 (details shown are for a single way in the cache for simplicity).
During a cache hit, PWaY is updated. During a cache miss,
Cline is compared to Pline to determine whether LRULRU-SEQ way have to he applied. Since this logic is activated during a cache miss, any imposed delay is overla ped with memory cycles. As a summary, there is no timing &advantage and the energy overhead for this additional operation is virtually not noticeable as a percentage of the whole energy consumption of the cache. The next section evaluates our methodology in various scenarios with diverse parameters.
Evaluation of LRU-SEQ
This section describes our evaluation platform for LRU-SEQ, the tool flow and the energy model.
The evaluation platform
We have embedded two different flows in our platform (Fig. 4) . The left side depicts a flow that is trace-based i.e. an address trace is generated and afterwards analyzed whereas We have used the Mediabench [ 171 benchmark set to evaluate our LRU-SEQ replacement policy through our platform.
Energy model
As mentioned earlier, it is the major goal to ex lore the en- 
Results
In this section, we present the ex eriments conducted using our LRU-SEQ policy. For lack otspace, most of the discussions refer to the SysTRACE set-up (see Sec. 4 and Sec. 5) unless otherwise stated. However, a summary of the results obtained through SysSIM is also given (see Sec. 6.5). 6.1 LRU-SEQ has been designed to save energy as ex lained earlier. This is directly achieved by reduction in total fank transitions (TBTs) and indirectly by increasin the inter bank transition time (IBT time). Fig. 5 Transition energy and break-even time Figure 6 : ET? savings by LRU-SEQ for rasia that occur within n c cles ABT,, ca tures the reduction offered by LRU-SEQ. $he plot shows i e data for ( Z 4 , 8, and 10) cycles across the different cache configurations. It can be seen that the minimum improvement is already about 80%. Thus in at least 80% of the IBTs, the cache banks will be able to overcome their energy costs using LRU-SEQ, depending upon the break-even time (2,4, 8 or 10 cycles) which is, dependent upon the technology.
The excellent improvement in BT, is due to two factors: One is the absolute reduction of ( I B T ) , (as discussed) and the other is due to the reduction of the total number of inter-bank transitions (TBTs). That means: 1) many transition events are completely eliminated and 2) the remaining events are pushed farther apart in time. Both effects are su porting the goal of energy reduction. This is understandabL as the LRU-SEQ policy favors to cluster activities in banks.
The improvement in TBTs is plotted in Part (6) of Fig. 5 .
Col. 3 of Tab. 1 summarizes the TBT reduction for all the benchmarks. It can be seen that, on an average there is a 71.93% reduction in total transitions across the benchmarks. Referring to our energy model (Sec. 5.2), this translates accordingly to savings in ET?. These savings are plotted in Fig. 6 (where the contribution of ET? is given by Eqn. 4 . (Nrr, NaCti,,, midie) are the number of (transition, active and idle) events in the cache. It is seen that up to 35% of the total cache energy can be saved by the LRU-SEQ policy (assuming that ET, is comparable to Eaetive and E i d l e which is a fairly reasonable assumptions). Col. 4 of Tab. 1 resents this data for all the benchmarks averaged over all cac& configurations.
It can be seen that LRU-SEQ has the potential to save considerable amount of cache energy compared to LRU (20-28%) . 
Effects on the hit ratio
The major performance characteristic for any replacement olicy is the cache hit ratio. This is shown in Fig. 7 for difyerent cache configurations (example: rasta benchmark). Pan (a) illustrates the variation in hit-ratio for a conventional LRU policy. Results are plotted for caches from 1 KB to 16 KB, with block sizes from 8 B to 32 B with different associativities from 2-way to shows the difference in the hit-ratios: AHRwusrs-w,. It can be seen that in many cases, LRU-SEQ offers even a slightly better hit-ratio compared to LRU.
Col2 of Tab. 1 gives comprehensive AHR data for all the applications averaged over all cache configurations. On the average, LRU-SEQ performed indeed better by 0.46% for rasia. LRU-SEQ had an average variation of 0.46 to -0.04% over all benchmarks. From this data, it can be seen that LRU-SEQ with a S E DST = 1 does not trade-off the energy savings against per%rmance. Indeed, the performance even increases slightly in many cases. Next we study the effect of varying SE&-DST. 
Effects when varying SEQ-DST
Previously, we have presented SEQ-DST as a possible parameter in our policy. The idea behind this parameter is to p p farther and farther branch destinations under se uential Ils (to the same way). We want to investigate the e%ect on the hit-ratio and the advantages in terms of increase in the reduction of bank transitions (ABT). It can be seen that there is very little increase in ABT as SEQ-DST increases. However the effect on AHR is more apparent. Let us we define a "good" value for SEQ-DST as one for which: -1 5 A H R 5 +l. Reviewing the values of the standard deviation as SEQ-DST is increased, we found that a SEQ-DST 2 2 can be detrimental (for 95% test coverage). If we look further at the range of variation (R) in AHR, LRU-SEQ has the best performance for SEQ-DST of I (-0.5 vs. -3) . In the following. we combine the sequenrial-Jill methodology with different policies and observe the effects on hit-ratio and bank-transition reduction. 
PLRU-SEQ and Random-SEQ
Given the implementation complexity of LRU, pseudo-LRU (PLRU) is often adopted (x86 through Pentium [7] ). We applied the principles behind LRU-SEQ to this policy and analyzed plots similar to those in Fig. 8 . Our analysis showed that: a SEQ-DST of I provides the best performance with a hank transition reduction of 82% and a maximum variation in hit-ratio of (2, -0.5) ,PLRUSEP.~LRU, when compared to regular PLKU implementations. Analysis for applying the methodology to the random replacement policy (Sparc family [15]) showed that: a SEQ-DST of 1 has a 82% reduction in bank transitions and a maximum variation in hit-ratio of (0, -0.5)
Effects of system configuration using SysSIM
In the previous sections an address trace generator (Shade) was used to evaluate the new replacement policy. In this section we present the results obtained on the SimpleScalar framework [ 131. The goal behind these tests is to obtain an idea of the relationship between the effect on hit-ratio and the corresponding effect on the run-time due to various system- denotes the (mean, median). It can be seen that a SEQDST of I or 2 is preferable. The second column shows that the effect on the run-time on the average is 0.36%.
The system-level study presented in this section thus confirms our observations in the previous sections that LRU-SEQ policy with a SEQ-DST of l or 2 is excellent for reducing transition energy effects with a minimal impact on the run-time.
IRnllsm. Ilnd).

Conclusions
We presented a novel replacement policy LRU-SEQ for instruction caches. The policy exploits the fact that reducing inter-bank transitions and increasing the slew time for the banks between transitions can save :considerable amount of cache energy. This holds especially for future silicon technologies where static power consumption is expected to dominate through leakage power consum tion We obtain cache energy savings with an average of23?8across all benchmarks. The policy has been carefully studied within different system set-ups (SysTrace and SysSIM) as well as for a large set of parameters like SEQ.DIST and cache parameters (like cache size, associativity, block size etc.). Beyond comparing LKU-SEQ to LRU, we have also studied pseudo random and random replacement policies as they are used by some processor architectures. Despite the energy savings of 23% in average, our policy does not incur any noticeable performance penalty (though the cache access patterns change) as shown and the additional hardware required within the cache is minimum. It can be concluded that the energy savings come almost for free.
LRU-SEQ is currently only validated for instruction caches. Future work will include the validation of data caches, too.
