On-chip L1 and L2 caches represent a sizeable fracrion ofrhe rota1 power cunrmprion of micmprocessors. In deep sub-micron rechnology, rhe subthreshold leakage power is becoming the dominant fraction of the roral power consumprion of those caches. In rhis p a p q we present optimization techniques 10 
Introduction
As semiconductor process technology moves below O.lpm. subthreshold leakage power is becoming a dominant fraction of total power. A potentially important source of this power loss is on-chip caches, because larger and larger onship caches are being integrated on the chip. For example, Intel's Madison processor has IMB and 6MB on-chip L2 and L3 caches respectively [l] .
To alleviate this problem, transistors in caches could be designed for low leakage, for example, by assigning them a high threshold voltage, VTH, or by cantrolling the VTH with adaptive body biasing or, if a better balance of speed and power is required, by employing dual VTH [2] [3] [4] [5] [6] [7] [8] . Traditionally, only two VTHs have been available in high performance process technologies, allowing cache designers limited flexibility to suppress leakage cwent. To further improve the leakage, several dynamic circuit and microarchitectural techniques 19-12] have therefore been proposed targeted at leakage power reduction of L1 caches. However, due to the increasing importance of subthreshold leakage cwent, the number of available VTHs in future process technologies will increase.
Next generation 65nm pracesses are expected to support 3 VTHs and fume processes are likely to provide designers with even more VTH choices. This increase provides new flexibility for leakage power reduction methods, allowing new trade-offs between the VTH of different pans of a cache and between different levels in the cache hierarchy. The availability of additional VTH's suggests a new examination of the trade-off between cache sire and VTH to reduce subthreshold leakage power loss.
In tlus paper, we investigate combinations of circuit and microarchitectural techniques to minimize leakage and dynamic energy in microprocessor memory hierarches under access time Constraints. We present systematic approaches to VTH assignment and memory hierarchy configuration to minimize leakage and dynamic energy consumption. Our study is limited to hierarchies consisting of L1, U caches. and main memory. However, our approach is readily extended to systems with more cache levels. First, we examine the optimization of leakage power of individual on-chip cache memories that can be achieved if more than one VTH is used to optimize leakage power dissipation. We show how many independent VTH'S are needed for effective leakage power reduction and how much VTH can be increased effective without sacrificing the access time of caches. Second, we show that cache miss characteristics of L1 and L2 caches under SPEC2000 workloads allow us to reduce leakage as well as total dynamic energy dissipation while maintaining the same overall average memory access time in the microprocessor memory system.
The next section of this paper explains the circuit and microarcbitectural simulation methodology used in this research. Section 3
and Section 4 present our pIoposed leakage power optimization techniques for individual and multi-level cache systems. Section 5 discusses future direction of this research and adds some concluding remarks
Methodology

Circuit simulation
To examine trade-offs between leakage power dissipation and access time of a microprocessor memory system, we need SRAM Caches were designed with sizes ranging from 16KB to 1024KB. Bit-line and word-lines were segmented to improve access time, and sub-banks were employed to reduce dynamic power dissipation [IS] (see Table 1 for the cache sub-hank configuration). The caches were broken into four components for the purposes of assigning distinct Vm's: address bus drivers, data bus drivers, decoders, and 6T-SRAM cell arrays with sense-amps. We employed an "H-tree" topology for the address and data bus routing and inserted repeaters on each branch of the buses to optimize the HSPICE simulations were run to obtain access times, dynamic, and leakage power dissipations for various cache sizes and for various VTH'S for their four components. We considered VTH'S between 0.2 and 0.5V in steps of 0.05V at 1V supply voltage. In addition, we measured the delay time, dynamic power, and leakage power dissipation of each memory component separately. Figure 1 and Figure 2 show VTH vs. leakage power and delay time of the 7x128, 8x256, and 9x512 row decoders that we designed. The HSPICE simulation results shown in Figure I agree with the exponential decay in leakage power with VI% that is characteristic of CMOS circuits:
(1)
The CMOS circuit delay of ultra deep sub-micron short-channel transistors are:
where k is a constant and a is about 1.3 depending on the technology (171. Figure 2 shows the HSPICE measurement results for the circuit delay of the decoders, which agrees with the Equation (2).
However, the circuit delay or access time also fits well to an exponential growth function with a very small exponent over OUT range of interest. It was convenient in some of our optimizations to model delay this way.
To obtain approximated analytic equations for the leakage power and access time as a functions of VTH and cache size, we applied exponential decay and growth c w e fitting techniques after measuring leakage power and access time for each VTH point and each cache size. Assuming that we can apply four distinct VTH's, the analytic approximated equations for leakage power, LP, and access time, AT are: curve fitting software based on HSPICE simulation measurement results. We also define baseline caches as those with all low-VTH's (0.2V). According to the HSPICE measurements, the access time and the leakage power trends of the designed baseline caches agree with those of earlier studies.
Microarchitectural simulation
We use the SimpleScalar cycle simulator [IS] to obtain LI and L2 cache miss rates, which were used to estimate the averaxe memory access time (AMAT) [19] for a two-level cache memory hierarchy system. The SPEC2000 benchmark suite was used and compiled with GCC 2.6.3 using 0 4 level optimizations. It was also statically linked with library code. To get reliable LZ cache miss rates, we completed the execution for each benchmark application and each
given input. The reason we completed the executions is that L2
cache accesses are far less frequent than LI cache accesses and an insufficient number of L2 accesses may result in unrepresentatively higher L2 cache miss rates. Table 2 shows the L1 and L2 cache miss rates for 16KB, 32KB, and 64KB LI caches respectively. We assume that we have two LI caches, one each for instruction and data, but a unified L2 cache for each configuration. The LI instruction caches are direct mapped, and the L1 data caches are 4-way set associative. Also, the L2 caches are 8-way set associative. Each L1 cache miss rate is obtained by the sum of the number of total instruction and data cache misses divided by the sum of total instruction and data cache accesses. A 16KB L1 means instruction and data caches are each 16KB in size. Table 3 shows the dynamic energy per access, and the dynamic and leakage power dissipation of the baseline caches used in this research. First, the dynamic energy dissipation is measured for an access using HSPICE, then we divide it by the access time of the cache to estimate the average dynamic power dissipation during the access. If caches are designed with the same size sub-bank, there is not much difference in the average dynamic energy dissipation, because we assume that only one sub-bank is accessed during the cache access and the sub-hank size is the same regardless of the cache size in a certain range (e.& 256-1024KB). The only difference in energy dissipation of the different caches with the same sub-bank size is caused by the energy dissipation of address and data bus drivers. However, the access time increases as the cache size grows due to the propagation delay caused by longer address FIGURE 2. VTH vs. delay of 7xl28,8x256, and 9x512 row decoder logic. where VTH1, VTHZ, VTH3 and VTH4 represent the VTH's for address bus drivers, data bus dnvers, decoders and 6T-SRAM arrays.
A single cache leakage optimization
There exits numerous combinations of VTH1, VTHZ, VTH3 and VTH4 satisfying a specific target access time. Among these combinations, we find a quadruple of VTHI, VTH2, VTH3 and VTH4 producing minimum leakage power using a numerical optimization method (e.g., Matlab's fmincon function) that satisfies a specified access time error range within 5%. We can repeat this with modified objective and constraint functions to find an optimal VTH combination for the cache memories that have only 2 or 3 VTH's.
To examine the dependence of the optimization results on access time, we sweep the target access time from the fastest possible (all low-VTH's) to the slowest possible (all high Vnl's). The followings are the VTH assignment schemes we examined in this study:
Scheme I assigning a high-VTH to all the cache circuit components includini x&esz'bus dri\,ers. ddla hbr dnrcrr. decoders md 6TSKAM cell 3nrys
Scheme It assigning a high-VTH only to 6T-SRAM cell arrays and assigning a default-or low-VTH (02') to the rest of the transistors.
Scheme I l l assigning a high-VTH to 6T-SRAM cell arrays and assigning another high-VTH to the peripheral circuit components of the cache (address bus drivers, data bus drivers and decoders). Scheme IV: assigning four different high VTH's to all four circuit components of the cache. In Figure 3 , we plot the normalized minimum leakage power values at different target access times (1056, 110%, 115% and so on). In the graph, the normalized delay and leakage of 100% correspond to the access time and leakage of a cache with all default-VTH's for all the four cache components. Also. the 115% access time means that it is 15% slower than the baseline cache. Figure 4 shows the VTH trends of each cache component for the normalized delay of 32KB cache with scheme 1V. According to this simulation result, the VTH of 6T-SRAM cell array starts increasing first, because it has the most significant impact on the leakage power reduction but has the least significant impact on the overall access time. In contrast. the decoder has the least significant impact on the leakage but has the most significant impact on the overall access time. The address and data bus drivers show mid- Normalized delay dling impact on both leakage and access time compared to the 6T-SRAM cell anay and the decoder. This trend suggests that we should give the top optimization priority to the 6T-SRAM cell array to achieve the lowest leakage power of caches for a given access time constraint. Table 4 compares the leakage power of schemes 11,111, and IV for each cache size against scheme I. As expected, we can reduce more leakage power whle achieving the same access time by having more VTH's to control. However, as the target access time increased to more than the 150% point of the scheme 11, the caches dissipate more leakage power than those employing the scheme I caches, see both Table 4 and Figure 3 . The address and data bus drivers and decoders -cache peripheral circuits consume non-negligible leakage power. Also, the leakage power by those components becomes substantial when we cut down the leakage power of 6T-SRAM array. Furthermore, the slowest delay point of the scheme II ends around 150% in small caches. This means that the peripheral circuits also play important roles in both leakage power and access time of caches. In other words, increasing the VTH of the 6T-SRAM cell m a y alone gives us diminishing return at some point without reducing the leakage power further. This is why scheme 1 caches give even better results than scheme I1 caches with increase of VTH. Other noticeable results are that there is a negligible difference between schemes I11 and 1V in terms of leakage power reduction, which implies that 2 distinct VTH's or scheme 111 for caches are enough for the leakage reduction. 
Trade-off between leakage and access time
One interesting point from these experimental results is that we may not need to use very high VTH (e.g. 0.5 for 1V supply voltage) since this impedes the circuit speed unnecessarily without reducing the leakage power further. Figure 5 shows a general trend in a leakage power vs. access time graph. In the "fast / leaky" region, we can reduce the leakage power dramatically with a small increase of the access time. On the other hand, we cannot reduce the leakage power very much by increasing VTH after some points, while access time increases rapidly. Based on this observation, we can calculate a point whose tangential slope equals negative "1" in the graph shown in Figure 5 , and we call this the inflexion point of the leakage power -in other words, this point can be regarded as an optimal leakage I access time trade-off point. Table 5 shows the normalized access times vs. leakage power of caches at their inflexion point. This is also good indication for how many VTH's are good enough to control the leakage power effectively. The normalized access time and leakage power are based on the fastest access time and leakiest leakage power values when using all low-VTH's (0.2V). According to this result, we can achieve more leakaee Dower reduction with a faster access time as ue 1ncrc.m the n&h& 01 di,iinri VTH's. but we quichl) rcxh 3 p i n t %if diminishing return. 
T ( L Z ) + M R ( L l , L Z ) x A T ( m i l i n m e m ) (9)
where AT and MR are access time and miss rate for the specific size of caches, respectively. In (9) we use the local miss rate MR(LI, L2). i.e.. the miss rate of the LZ seen by the accesses that first miss the L1. This depends on the size of L1 caches, because the number of accesses of the L2 cache is equal to the number of misses Df L1 caches (of course. the total misses remains the same).
In addition, to compare the dynamic energy dissipation of each memory hierarchy configuration, we define the average memory access energy (AMAE) similarly to the AMAT. Assuming that the LI cache is accessed every cycle, the AMAE represents the average energy dissipation per access in the entire microprocessor memory system that includes LI, L2 and main memory. We can estimate AMAE, as follows: To obtain the main memory access time and dynamic energy dissipation per access, we use a '2-channel 1066Hz 256MB RAM-BUS DRAM RlMM module whose sustained transfer rate is 4.2GBls [ZO]. Though the sustained transfer rate is quite high, we should also consider the RASICAS latency of the memory, which is about 2011s. For the energy dissipation per access, we used the number given in I211 -3.57nJ per access
We will examine the leakage power optimization of L2 caches first, because their contribution to leakage power dominates due to their size. Consider a conventional cache hierarchy of 16KB and 128KB
for Ll and LZ caches respectively, designed with IOW-VTH (0.2V) devices. If we fix the L1 and reduce the leakage of the L2 by increasing VTH the cache system becomes slower. However, we can maintain the same AMAT and reduce the leakage power of the LZ by increasing its size to reduce its miss rate. Since the main memory access penalty is quite significant, even a slight reduction of L2 cache miss rates results in a significant improvement of the AMAT.
We note that although area was one of the most important design constraints in the past, thts trend is changing and power is becoming an equally important constraint in many situations 1221 Figure 6 shows the leakage power vs. AMAT of L2 caches with the fixed size L1 cache of 16KB. Assuming that the AMAT of SfNCtUrCS. -+IIBKB -)IS6KB+5121(8 a 128KB L2 cache as a base, we compare the leakage power of other caches at the same AMAT point (see the dotted vertical line in Figure 6 ). As can be seen from the graphs, the AMAT can be maintained whle the leakage power can be reduced by replacing a 128KB L2 with a 256KB L2 cache that is intentionally slowed down by increasing its V n t to reduce leakage. Similarly, the use of a 512KB LZ cache can further reduce leakage compared to the 256KB cache (see the dashed vertical line in Figure 6 ). Table 6 shows the results for normalized leakage power and AMAE for each L1 cache size designed using scheme 111 at a fixed AMAT. To compare leakage power and AMAE, the following standard cache configurations were used: IZBKB L2 with 16KB L1, 256KB L2 with 32KB LI, and 512KB L2 with MKB L1. Table 6 gives the somewhat counter intuitive results that we can reduce both leakage power and AMAE by employing larger L2 caches while maintaining a constant AMAT. caches for a certain range of cache sizes (e.g.. 16KB-MKB). Figure 7 shows the leakage power vs. the AMATof 16KB, 32KB, and 64KB L1 caches using scheme 111 each with a fixed L2 cache of size 512KB. Like the comparison performed in Section 4.1, the leakage power of different caches is compared at the same AMAT point. The graphs show that leakage power can be reduced by replacing a MKB LI cache with a 32KB L1 cache that is intentionally slowed down by increasing its VTH's to reduce the leakage power (see the dotted vertical line in Figure 7 ), but the resulting hierarchy has the same AMAT. Similarly, a slowed 16KB cache with increased VTH's can replace a 32KB without changing the AMAT of the LlfL2 herarchy. The new system consumes much less leakage power (see the dashed vertical line in Figure 7 ). Table 7 shows the results for normalized leakage power and AMAE for each fast but leaky LI cache sizes using scheme Ill with fixed AMAT's. The comparisons were performed in the same manner as Table 6 . According to the comparisons, we can reduce both leakage power and AMAE by employing smaller L1 caches. This is therefore in contrast to the case for L2 caches. where the leakage of the overall hierarchy can be reduced by increasing their size. It should be noted, these results are only valid within the specific set of sizes given in h s paper. A 4KB LI cache will have a cache miss rate that is much higher than a 16KB cache, hut its access time will 
L2 cache leakage power optimization
AMAT(nilsreesr)
not be sufficiently smaller to make the trade-off worthwhile. Also, the normalized AMAE is rather high because the total power fraction of L1 caches is relatively small compared to L2 caches.
Conclusion and future work
In this naoer. we examined the leakaee mwer and. access time tradc.oii tr& where multiple \'Tli'h x i ailuucd We used curve fittbng te:hniquss to niodcl leakage p u e r and xccss time Our result> shmv t h a 2 dmnct V Ill'< fur cdilie\ rire suiii;ient tu yield a significant reduction in leakage power. Such an arrangement can reduce the leakage power up to 91% (see scheme 111 in Table 5 ) for an IMB SRAM cache without significantly increasing access time.
We also show that smaller L1 and larger L2 caches than are typical in today's processors result in significant leakage and dynamic power reduction without affecting AMAT. Given that the processor core may need a distinct VTH. and each of the caches may need up to two VTH's (scheme 111) we could require up to five distinct VTH's.
In this work, we assume that we have two-level ou-chip caches. Recently, however, microprocessors with three-level caches are being deployed, and their L2 and L3 cache sizes are much larger than the caches discussed here. For future work, we will investigate leakage power optimization in a multi-level cache hierarchy that include L2 and L3 caches. 
