Power-Performance Trade-Offs in Nanometer-Scale Multi-Level Caches
  Considering Total Leakage by Bai, Robert et al.
Power-Performance Trade-offs in Nanometer-Scale Multi-Level Caches Considering Total Leakage 
Robert Bai1, Nam-Sung Kim2, Tae Ho Kgil1, Dennis Sylvester1, Trevor Mudge1
1 University of Michigan, EECS Department, Ann Arbor, MI 48109 {rbai, tkgil, dennis, tnm} @eecs.umich.edu
2 Intel Corporation, Portland, Oregon, nam.sung.kim@intel.com
Abstract
In this paper, we investigate the impact of Tox and Vth on power 
performance trade-offs for on-chip caches. We start by examining
the optimization of the various components of a single level cache
and then extend this to two level cache systems. In addition to 
leakage, our studies also account for the dynamic power expended
as a result of cache misses. Our results show that one can often
reduce overall power by increasing the size of the L2 cache if we
only allow one pair of Vth/Tox in L2. However, if we allow the
memory cells and the peripherals to have their own Vth’s and Tox’s,
we show that a two-level cache system with smaller L2’s will yield
less total leakage. We further show that two Vth’s and two Tox’s are 
sufficient to get close to an optimal solution, and that Vth is generally
a better design knob than Tox for leakage optimization, thus it is 
better to restrict the number of Tox’s rather than Vth’s if cost is a 
concern.
1. Introduction
Leakage power is a problem for all microprocessor circuit 
components, but it is a particularly important problem in processor
on-chip caches where a large number of potentially high-leakage 
cross-coupled inverters — the storage elements of caches — are 
integrated in great numbers. We can expect the fraction of the
leakage power to exceed that of the dynamic power in future 
processor generations. There have been several previous studies on 
cache leakage power reduction [1-7]; all of them focused on 
subthreshold leakage power. However, with aggressive Tox
scaling, gate leakage power can potentially surpass the 
subthreshold leakage at low Tox. In this paper, we investigate 
various techniques to minimize total (gate + subthreshold) leakage 
power plus dynamic power under delay constraints by
systematically assigning values for Tox and Vth for single cache, 
two-level caching system and an entire microprocessor memory
system consisting of L1, L2 cache and main memory.
2. Circuit Evaluation Methodology
For our experiment, we have used the technology files from 
Berkeley Predictive Technology Model (BPTM) for a 65nm 
technology node [8]. We then characterize the technology files for a 
range of Vth and Tox values. We let Vth vary from 0.2V to 0.5V,
while allowing Tox to scale from 10Å to 14Å. The lower limits of
these ranges are chosen to reflect typical values of high-
performance logic for the studied technology node. Such
transistors would be required for the non-memory portion of a 
processor or system. While there is no physical reason for a Vth
upper bound, we expect that values above 0.5V are unlikely in
65nm technology with approximately 1V supply.  The increase of 
Tox while maintaining the same drawn channel length may cause 
the gate terminal to lose control of the conduction state of the 
channel due to DIBL effect [9]. Hence, when Tox changes, the 
drawn channel length must be scaled appropriately. Also in order 
to maintain memory cell stability, the widths of the transistors in
the memory cell need to be adjusted proportionately with the 
change in the drawn channel lengths. Thus the impact of Tox
scaling on the cell area must be taken into account, as the cell will
grow in both horizontal and vertical dimensions. 
3. Delay and Leakage Power Models
First we have re-designed the cache netlists used in [7] to target 
for 65nm technology node. We assume that internally, the cache
consists of four components: memory cell array and sense
amplifier, decoder, address bus drivers, and data bus drivers. 
Second, it is observed through extensive HSPICE simulation that 
the total leakage current of memory cell array is exponentially
dependent on Tox and Vth. We then approximate the total leakage 
power as follows: 
1* 2 *
, 0 1 2 *( ) V th T o xa ato ta l th o xP V T A A e A e  
On the other hand, the delay of the array is shown to be linear
with Tox and over the range of our interest its dependence on Vth
can be approximated to an exponential growth function with very
small exponents as follows: 
3 )( *
0 1 2 *( , ) * thk Vd th o x o xT V T k k e k T  
Although these total leakage and delay trends are for the memory
cell array, the same trends also hold for the rest of cache memory
components — decoders and address/data bus drivers. Therefore, we 
can model the total leakage and delay of each component in the
same way as we do for the memory cell array assuming that both 
total leakage and delay of each component are independent from one 
another. Thus we can approximate both the total leakage and the 
delay of a cache system by summing up the leakage and delay of
each cache component.
4. Single Cache Leakage Optimization
To examine the dependence of leakage power on Vth and Tox
assignment, we study three different Vth/Tox assignment schemes:
x Scheme I: assign independent Vth’s and Tox’s to each cache 
component.
x Scheme II: assign a Vth/Tox pair to the memory cell array and 
another pair to the remaining three cache components.
x Scheme III: assign the same Vth/Tox pair to all four cache 
components.
We formulate the problem of minimizing the leakage power given
the delay constraint as the following optimization problem [10]:
1* 1 2* 1 7* 4 8* 4
1, 1, .... 4, 4
0 1 2 * 7 8 *
( )
...Vth Tox Vth Tox
th ox th ox
a a a
LeakagePower V T V T
A A e A e A e A e      
Minimize
1 4
1 1, ..., 4 4
( 1* ) ( 4* )
0 1 * 2 * 1 7 * 8 * 4
1, 2 , 3 , 4
1, 2 , 3 , 4
 ( , , )
... ;
1 0 1 4 ;
0 .2 0 .5 ;
th th
d th o x th o x
b V b V
o x o x
o x o x o x o x
th th th th
T V T V T
a
B B e B T B e B T
Å T T T T Å
V V V V V V
 
    
d d
d d
S u b jec t  to
In our optimization process, we have chosen Vth and Tox to take on 
discrete values with small step size. The optimization shows
scheme III is the worst performer, and scheme I is the best. 
However, scheme II is only slightly behind scheme I for the same 
delay constraint, but from a process standpoint, scheme I is more
costly than scheme II. Therefore, it is the preferred scheme, as it is
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’05) 
1530-1591/05 $ 20.00 IEEE 
not only economically feasible but also achieves close to optimal 
leakage. It is worth noting that in schemes I and II, high values of 
Vth and thick Tox’s are always assigned to the memory cell arrays,
and Vth/Tox in the peripheral components have been set 
sufficiently low to help meet the delay target. To gain further 
insight into the selection of the decision variables during the 
optimization process, we perform an experiment in which for a 
16KB cache we hold either Vth or Tox constant, and at the same 
time observe how leakage power is impacted by the other decision
variable independently. In Figure 1, we show four curves, two of 
which are constructed by fixing Tox at 10Å and 14Å, respectively,
and the other two are created by fixing Vth at 0.2V and 0.4V,
respectively. It is evident that the leakage is more sensitive to Tox
than Vth, and the delay doesn’t show as wide a range when Vth is 
fixed as when Tox is fixed. Hence, to achieve minimum overall 
leakage, it is best to set Tox conservatively at a high value and let Vth
be the knob designers can vary to meet a delay constraint.
800 1000 1200 1400 1600 1800 2000 2200
0
6
12
18
24
30
36
42
48
54
60
Le
ak
ag
e 
Po
w
e
r 
(m
W
)
Access Time (pS)
 Tox=10A
 Tox=14A
 Vth=200mV
 Vth=400mV
Tox
Tox
Vth
Vth
Figure 1. Fixed Vth vs. Fixed Tox
5. Two-Level Cache Leakage Optimization
We use architectural simulations to gather cache access statistics
for each L1 and L2 cache size combination. To perform our 
evaluation, results from various benchmark suites such as 
SPEC2000, SPECWEB, TPC/C, etc., are collected.
L2 Cache Leakage Power Optimization. Due to its size, L2 
caches naturally consume much more leakage than L1 caches. In
the first of our experiments, we fix the size of an L1 cache and 
assign the default Vth and Tox to the L1 cache, and then proceed to 
see which L2 organization would yield better leakage whilst still 
meeting the same average memory access time (AMAT)
constraint of the two-level cache system. For example, we can 
adjust both Vth and Tox knobs to make two different size caches 
that have the same AMAT: note that the AMAT is a function of
both the cache miss rate and access (hit) time. The results show
that generally the bigger L2 consumes less leakage power than 
smaller ones under the same delay constraint. This agrees with the 
trends presented in [7] that focused only on the Vth assignment to 
optimize L2 cache leakage power. A larger L2 cache results in a 
smaller miss rate and faster AMAT, therefore Vth and Tox of L2 
can be set more conservatively for a larger L2 than for a smaller
one. Nevertheless, having the largest available L2 does not always
yield the best leakage. This is because we reach a point where the
leakage of a very large L2 outweighs the benefit of the 
improvement in the L2 miss rate. 
In the second part of our analysis, we assign a Vth/Tox pair to the 
core array cells in an L2 cache and another pair to its peripheral
circuitry. In this scenario, there are two ways to improve AMAT. 
One is through reducing the L2 miss rate by employing a large L2 
cache as was done in the first part of our analysis; the other is by
setting the Vth/Tox assignments more aggressively in the peripheral
circuitry. We found that the latter approach works better in the 
cases we have investigated. After optimization we see that Vth and 
Tox in the core cell arrays are always set much more
conservatively than those in the peripheral circuitry. This allows 
the leakiest component in the L2, which is the core cell array, to
take on high values for Vth and Tox, thus saving leakage. At the 
same time, we can still meet the target delay because Vth and Tox
in the peripheral circuitry can be set sufficiently low. 
L1 Cache Leakage Power Optimization. Local L1 cache miss
rates are already very low and they do not vary much amongst the
L1 caches ranging from 4K to 64K as illustrated in [7]. Hence 
given a fixed L2, the key to minimizing total leakage power is to
reduce the leakage power consumed by L1. A smaller L1 will 
consume less leakage and at the same time a smaller L1 will be 
faster. Therefore, a small L1 will probably be the optimal solution 
Entire Processor Memory System Energy Optimization. We
determine the optimal number of Vth and Tox values needed to 
achieve the optimal total energy for a system comprising of L1, 
L2 and main memory. Figure 2. shows that the best scheme has 2 
Tox’s and 3 Vth’s. However, the difference between a system with 
dual Tox , dual Vth and that with dual Tox , triple Vth is very small.
So in general a process with dual Tox and dual Vth is sufficient to 
achieve near optimal total energy. It is also worth noting that a 
single Tox and dual Vth process outperforms that with a single Vth
and dual Tox for the same reason that we discovered in Section 4: 
Vth is generally a more effective design knob than Tox .
1300 1400 1500 1600 1700 1800 1900 2000 2100
50
100
150
200
250
300
350
400
To
ta
l E
n
er
gy
(pJ
)
AMAT(pS)
 2 T
ox
 + 2 Vth
 2 T
ox
 + 3 Vth
 3 T
ox
 + 2 Vth
 2 T
ox
 + 1 Vth
 1 T
ox
 + 2 Vth
Figure 2. (Tox, Vth) Tuple Problem
References
[1] K. Nii, et al., Proc. ISLPED, pp 293~298, 1998.
[2] M. Powell, et al., Proc. ISLPED, pp. 90~95, 2000.
[3] F. Hamzaoglu, et al., IEEE Transaction on VLSI Systems, Vol 10, pp 
91~95, Apr. 2002.
[4] N. Azizi, et al., Proc. ISLPED, pp. 48~51, 2002.
[5] A. Agarwal, et al., IEEE JSCC, Vol 38, pp. 319~328, 2003.
[6] C. H. Kim, et al., Proc. ISLPED, pp. 6~9, 2003.
[7] N. Kim, et al., Proc. ICCAD, pp. 627~632, 2003.
[8] Berkeley Predictive Technology Model, 2002.
[9] A. Sultania, et al., Proc. DAC, pp. 761~766, 2004.
[10] D. Bertsekas, “Nonlinear Programming,” Athena Scientific, 1995.
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’05) 
1530-1591/05 $ 20.00 IEEE 
