Managing Static Leakage Energy in Microprocessor Functional Units by Dropsho, Steven et al.
Managing Static Leakage Energy in Microprocessor Functional Units  
Steven Dropsho

, Volkan Kursun, David H. Albonesi,
Sandhya Dwarkadas

, and Eby G. Friedman

Department of Computer Science
Department of Electrical and Computer Engineering
University of Rochester
Rochester, NY 14627
Abstract
Static energy due to subthreshold leakage current is pro-
jected to become a major component of the total energy
in high performance microprocessors. Many studies so far
have examined and proposed techniques to reduce leakage
in on-chip storage structures. In this study, static energy
is reduced in the integer functional units by leveraging the
unique qualities of dual threshold voltage domino logic.
Domino logic has desirable properties that greatly re-
duce leakage current while providing fast propagation
times. However, due to the energy cost of entering the
low leakage current state (sleep mode), domino logic has
thus far been used only for leakage reduction in the long-
term standby mode. We examine the utility of the sleep
mode (while considering the aforementioned costs) when
idle times are relatively short, one to a few hundred cycles,
as is often the case for functional units.
Using an analytical energy model suitable for
architecture-level analysis, we explore the interaction
of the application and technology, and the effect on energy
and performance as the underlying parameters are varied,
on a set of benchmarks. Our results show that if the
leakage approaches the magnitude as projected in the
literature, even for short idle intervals as few as ten cycles,
an aggressive policy of activating the sleep mode at every
idle period performs well and a more complex control
strategy may not be warranted. We also propose a simple
design, called Gradual Sleep, to reduce the energy impact
of using the sleep mode for smaller idle periods.

This work was supported in part by NSF grants EIA-9972881, EIA-
0080124, CCR–9702466, CCR–9701915, CCR–9811929, CCR-9988361,
and CCR–9705594; by DARPA/ITO under AFRL contract F29601-00-K-
0182; by New York State Office of Science, Technology & Academic Re-
search to the Center for Advanced Technology – Electronic Imaging Sys-
tems and the Microelectronics Design Center; by an IBM Faculty Partner-
ship Award; and by external research grants from the corporations of Intel,
DEC/Compaq, Xerox, Eastman Kodak, Lucent Technologies, and Photon
Vision Systems, Inc.
1 Introduction
Energy dissipation has become a critical design con-
straint in high performance microprocessors. Until recently,
the focus has been on the dynamic energy dissipated in
CMOS circuits. In older technologies, the majority of
the energy is dissipated when transistors switch (transient
power dissipation). When the circuits are not active the cur-
rent is extremely low relative to switching, and thus, the
static energy consumed is negligible. This exaggerated re-
lationship between dynamic and static energy will experi-
ence a marked shift in the near future.
Static energy dissipation is a result of leakage current
due to the finite-resistance of the off transistors between
power and ground that exist whenever power is applied to
a CMOS circuit. The magnitude of the leakage current is
highly dependent on the threshold voltage  characteris-
tics. As integrated circuit technology scales to ever smaller
dimensions, supply voltage levels are likewise scaled. To
improve circuit speed, the threshold voltages are also de-
creased. This decrease in threshold voltage results in an ex-
ponential increase in the subthreshold leakage current [3].
The International Technology Roadmap for Semiconduc-
tors [2] projects dimensions of 70 nm to be in production
by the year 2006. At these dimensions, the leakage energy
is estimated to be on par with the dynamic switching energy
if novel circuit techniques are not developed [3].
Since most of the transistors in a microprocessor re-
side in the storage structures (the caches and buffers), the
RAMs are responsible for a large portion of the leakage
power [7, 9, 11, 14, 20]. The functional units, alternatively,
consist of a much smaller fraction of the transistors. How-
ever, the model developed by Butts and Sohi [6] for estimat-
ing leakage current in various logic structures reveals an or-
der of magnitude larger leakage current for combinational
logic relative to cache RAM transistors. While precise es-
timates for static power require detailed circuit knowledge
of the processor, which is not readily available, this model
indicates the integer and floating point functional units con-
tribute a noticeable fraction of the overall static power de-
spite the smaller transistor count relative to the caches.
In this paper, we present the benefits of employing a
dual threshold voltage domino logic circuit technique [16]
(dual-   ) to reduce subthreshold leakage current in the in-
teger functional units (FUs) of a general-purpose processor.
We focus on domino logic dual-   circuits because domino
logic has both superior speed and area characteristics as
compared to static CMOS logic circuits [1, 10, 13, 16]. We
restrict the analysis to the integer FUs because it is these
units that are most heavily utilized.
Some domino logic designs have a sleep mode in which
the circuit expends very little static energy. However, due
to the energy cost of entering this mode, it has thus far
been proven useful only to reduce leakage during long-term
standby mode. Idle times in the functional units can often
be relatively short, from one to a few hundred cycles. We
develop an energy model appropriate for the architecture-
level analysis of logic circuits and explore strategies to em-
ploy the sleep mode in the dual-  circuits so as to minimize
the overall energy when idle times are short. We use this en-
ergy model to develop insight into the dependencies among
the application behavior, activation of the idle mode, and
the underlying technology characteristics of the circuit.
We study both analytically and empirically (by deter-
mining the effects on the performance and energy of a
set of integer benchmarks) the benefits and costs of ag-
gressively enabling the sleep mode at every opportunity
(MaxSleep) relative to never enabling the sleep mode (Al-
waysActive). These two extreme sleep mode management
policies, MaxSleep and AlwaysActive, are the two simplest
policies possible and provide bounds on the energy savings
to which other sleep management methods should be com-
pared. Our results show that with idle intervals as short
as ten cycles, the MaxSleep policy performs well across a
broad range of parameters. We also propose a circuit-based
scheme we call GradualSleep that blends the best behav-
iors of MaxSleep and AlwaysActive and reduces the energy
impact of using the sleep mode for even smaller idle pe-
riods. We show that GradualSleep performs well across a
wide range of conditions. The simple GradualSleep design
achieves most of the potential energy savings, indicating
that more complex control strategies may not be warranted.
The rest of the paper is organized as follows. The low-
leakage domino circuit and its behavior is described in Sec-
tion 2. A static energy model appropriate for architectural
energy studies of functional units is developed in Section 3.
Our experimental methodology is described in Section 4.
The use of the sleep mode to reduce overall energy in inte-
ger functional units is evaluated in Section 5. Related work
is discussed in Section 6. Finally, concluding remarks are
made in Section 7.
2 Low-leakage logic-based circuit design
Dynamic domino logic gates are frequently used in crit-
ical paths within the functional units of high speed pro-
cessors. The structures of a static CMOS AND-gate with
its counterpart implemented as dynamic domino logic are
contrasted in Figures 1a and 1b. In static CMOS, the in-
puts are loaded by both the PMOS and NMOS transistors.
In domino logic, the inputs have only the NMOS device
as a load and thus are inherently faster. The operation of
the domino AND-gate is shown in Figure 1c. The inter-
nal node Dynamic is precharged during the low phase of
the clock. Note that the path to ground is cut-off by an
NMOS transistor during this time. When the clock transi-
tions high, the path to ground is enabled and the inputs are
evaluated. When both inputs are high, the dynamic node
is discharged and the output goes high. When either input
is low the dynamic node remains charged and the output
is low. The state of the dynamic node is preserved against
coupling noise, charge sharing, and charge leakage by the
keeper transistor. In contrast to static CMOS, every clock
cycle the dynamic nodes are precharged and the inputs eval-
uated regardless of whether the inputs change state. When
the circuit is not required, useless re-evaluation (and energy
cost) can be avoided by gating the clock such that the clock
input is forced high.
As described in [1, 13, 16], domino circuits permit the
use of dual threshold voltage techniques to reduce sub-
threshold leakage current without sacrificing active mode
circuit performance. The key to achieving this balance is
to place low-   transistors only along the critical evaluation
path as shown in Figure 2a, in which the shaded transis-
tors are the slower high-   devices. The leakage current
of a dual-   domino circuit is asymmetric and depends on
the voltage level at the internal dynamic node. If either In
1 or In 2 are low, the dynamic node will remain high. In
this state, the voltage across the high leakage transistors N1,
N2, N3, as well as N4 results in a large subthreshold leak-
age current. Alternatively, when both inputs and the clock
are high, the dynamic node is discharged and the low leak-
age transistors P1, P2, and N5 are strongly cutoff. When
the dynamic node is discharged, the voltage drop is across
the high-   devices, which act as high resistance switches,
and not across the low-   transistors. In this state, the static
energy of the circuit is dramatically reduced.
Dual-   domino circuits that incorporate a low leakage
sleep mode do so by adding the ability to force the internal
nodes into the low leakage state. Many circuits incorporat-
ing a sleep mode have been proposed [1, 12, 13, 16]. For
the purpose of this paper, the essential behavior of all these
circuits is similar. The differences are in the complexity and
energy overhead of the sleep mode function. For the ensu-
ing discussion, we select a circuit from [16] that is simple
and incurs minimal energy overhead.
The proposed method for incorporating the low leakage
sleep (idle) mode into a dual-   domino circuit is shown in
Figure 2b. A high-   transistor is added to discharge the
dynamic node when the Sleep signal is asserted, regardless
of the input vector. Only the first stage in a sequence of
domino circuits requires this additional transistor. Assert-
ing the Sleep signal drives the Out signal high which turns
off the keeper and forces any subsequent domino gates to
evaluate to the low leakage state in a domino fashion. Not
shown is the standard gating of the clock when Sleep is as-
serted to disable the precharge phase. An important aspect
of this design is that the activation energy overhead of the
sleep transistor is negligible relative to the switching energy
of the gate, 0.14 fJ versus 22.2 fJ.
The delay and energy parameters of an 8-input OR (OR8)
Table 1. OR8 gate characteristics (70 nm),       ,  
	  ,    , Period=250 ps
Delay (ps) Energy (fJ)
Circuit Evaluation Sleep Dynamic (1 gate) Vector LO Lkg Vector HI Lkg Sleep
low-  19.3 na 26.7 1.2 1.4 na
dual-  
no sleep mode 15.0 na 22.2 7.1E-4 1.4 na
dual- 
w/sleep mode 15.0 16.0

22.2 7.1E-4 7.1E-4

0.14


indicates sleep mode is enabled
In 1
In 2
Vdd
Out static
(a) Static CMOS AND-gate
Vdd
In 2
In 1
Clock
Dynamic Out
Keeper
domino
(b) Domino AND-gate
Dynamic
Out
In 2
In 1
Clock
Evaluate
Precharge
Evaluate
Precharge
domino
(c) Domino Operation
Figure 1. Static vs. dynamic domino AND-gates
N5
N4
Vdd
In 2
In 1
Clock
Dynamic
P1 P2
N1
N3
N2
Out
(a) Dual-  
of the logic pipeline
Added to the first stage
N4
Vdd
In 2
In 1
Clock
Dynamic
P1 P2
N1
N3
N2
NS
Sl
ee
p N5
Out
(b) Dual-  with sleep mode
Figure 2. Low leakage domino AND-gates
domino gate in 70 nm technology is compared in Table 1 for
low-   , dual-   without the sleep mode, and dual-   with
the sleep transistor. The parameters are    ﬀ   and
 
	
ﬁﬂﬃ
 . Since leakage energy in dual-   domino cir-
cuits depends on the state of the circuit, Vector LO Lkg is the
input 10000000 which discharges the dynamic node to the
low leakage state, and Vector HI Lkg is the input 00000000
which does not discharge the dynamic node. The keeper
maintains the dynamic node at the high voltage level, which
is also the high leakage state.
The lower gate overdrive of the high-   keeper transistor
in dual-   domino circuits reduces the contention current
when switching the output and improves the propagation
delay and dynamic power characteristics as compared to
the low-   domino circuit. In the dual-   domino circuit,
the difference in leakage energy between the LO Lkg and
HI Lkg vectors is a factor of 2,000. Our method of incor-
porating a sleep mode is not in the evaluation path of the
gate so there is no impact on the propagation speed of the
circuit. The sleep transistor is minimally sized and intro-
duces negligible additional loading on the dynamic node of
a domino gate. With the sleep mode capability, we can force
the internal state of all of the gates to the low leakage state,
drastically reducing the leakage energy regardless of the in-
put vector. Enabling the sleep transistor, however, requires
some additional energy (0.14 fJ) which must be accounted
for. The delay in discharging the gate via the sleep tran-
sistor, 16 ps, is comparable to the delay of the evaluation
phase, 15 ps, so the circuit can transition to the sleep state
in one cycle. The measurements assume a 4 GHz clock.
The overhead of enabling the sleep mode depends upon
the state of the circuit from the previous evaluation phase.
The contributors to the dynamic energy dissipated during
an evaluation are the circuits whose input vectors cause the
dynamic nodes to discharge. In a complex circuit such as
an ALU, on average not all dynamic nodes will discharge
during an evaluation cycle. An activity factor is the prob-
ability that a domino logic gate will evaluate and place the
dynamic node into a low voltage state at any given clock cy-
cle. The average activity factor (   ), therefore, determines
the fraction of the dynamic nodes that are discharged dur-
ing each evaluation period,      . Activating the
sleep switch leaves the circuit in the same state as if the ac-
tivity factor were 1.0 in the last evaluation; thus, activating
the sleep mode discharges the dynamic nodes of the rest of
the gates in the circuit. This portion is the      fraction
of the gates that were not discharged during the previous
evaluation period before the sleep mode. To return to the
active mode, the clock is again enabled and one precharge
phase readies the circuit for evaluation; thus, activation also
occurs within a single clock cycle.
2.1 Tradeoffs between active versus sleep modes
Enabling the sleep mode reduces the static energy dis-
sipated, however, this mode is entered by discharging all
of the dynamic nodes within the circuit that did not dis-
charge in the evaluation phase. Thus, there is a tradeoff
between the energy saved due to lower leakage current and
the additional energy expended in the next active cycle from
precharging these dynamic nodes that would have remained
charged had the sleep mode not been entered. The activ-
ity factor   affects both the leakage energy and the en-
ergy overhead in transitioning to the sleep mode. As pre-
viously mentioned, activating the sleep mode is equivalent
to an evaluation with a maximum activity factor of      .
We approximate a generic functional unit (FU) by a circuit
consisting of 500 OR8-gates arranged as 100 rows of five
cascaded domino circuits. The circuit contains the drivers
that distribute the Sleep signal throughout the FU and this
energy is accounted for. The energy expenditure for this
circuit relative to the idle interval is shown in Figure 3. The
plot compares the effects of enabling the sleep mode versus
idling the circuit with clock gating only (the clock is gated
0
2
4
6
8
10
12
14
0 5 10 15 20 25
En
er
gy
 (p
J)
Idle Interval (cycles)
alpha=0.1
alpha=0.5
alpha=0.9
Sleep mode
Uncontrolled Idle
Figure 3. Uncontrolled idle versus sleep mode
high and Sleep is not enabled). We refer to this latter case
as uncontrolled idle. We compare the tradeoffs at three ac-
tivity levels,     	 
   . Results using only an
uncontrolled idle (Sleep signal not asserted) are the straight
lines emanating from the origin. Plots of using the sleep
mode rise quickly then plateau.
The graph shows that for a low activity factor there is a
considerable expenditure of energy to transition to the sleep
mode after which the additional energy is minimal. If the
circuit is not idle for at least 17 cycles then more energy is
used than is saved by shifting to the low leakage sleep state.
This extra energy decreases as the activity factor increases
since more nodes enter the low leakage state during the pre-
vious evaluation phase before the idle mode. Interestingly,
the time to break even is relatively insensitive across this
range of activity factor. The reason is that as the activity
factor increases, both the sleep transition overhead and the
uncontrolled idle circuit leakage energy decrease at a simi-
lar rate, roughly proportional to (1 -   ).
3 Static energy model
A precise energy model depends heavily on the details
of the logic design and the circuit design. General circuit
methods to reduce static power include combining high-  
devices (slow, low leakage) with low-   devices (fast, high
leakage) and placing the high-   transistors along the non-
critical paths throughout a functional unit. We develop a
simple energy model that is parameterized and can repre-
sent the energy characteristics across a wide range of logic
and circuit designs at a level useful for architectural stud-
ies. The model parameterizes the contribution of the low
leakage and high leakage transistors in the overall energy
dissipation of the circuit. This parameterization of the frac-
tion of high leakage transistors abstracts the circuit specifics
into a single primary parameter that can be varied.
The total energy of a circuit is shown in equation (1).
The total energy consists of the dynamic and leakage en-
ergy during active cycles plus the leakage energy when the
circuit is idle. We divide the total run-time into three cate-
gories of operation. The cycles of actual computation are
called active cycles and their number is denoted as  .
The cycles when the circuit is clock-gated (no computation)
but the sleep mode is not enabled are called uncontrolled
idle cycles and denoted by 	 . Cycles when the circuit
is forced into the low-leakage state of the sleep mode are
called sleep cycles and denoted by   .

	
 

ﬁﬀﬃﬂ
 "!
ﬂ
#  ﬀ$%&(')ﬂ*
 "+
,-.ﬂ
 )!
ﬂ
0/21-34.ﬂ

 



 56587
ﬂ
91
 "+ (1)
The dynamic energy is the number of active cycles 
times the maximum dynamic energy per cycle, : , pro-
rated by an activity factor   , which is the fraction of the
internal dynamic nodes that are actually discharged during
the evaluation phase. Recall that the dynamic nodes are
precharged prior to evaluation. The precharged state is also
the high leakage state of the circuit. If the clock has a duty
cycle (i.e., fraction of time the clock signal is high) of ; ,
0<
;
< 
, then the precharge portion is    ;  of the
clock period. The leakage energy of every active cycle is
accounted for by prorating the per cycle high leakage en-
ergy,    ; =:?>A@ . Also added is the leakage energy after
evaluation. This active mode leakage energy consists of two
components. The first energy component is for the fraction
of nodes   that are placed into the low-leakage state, :CB +
(per cycle leakage energy with the internal dynamic node
discharged), in the normal operation of the circuit. The sec-
ond component is the fraction of nodes       that are
not discharged (internal dynamic node is high) and have a
greater leakage energy, : B ! . In the active cycles we account
for this energy only for the portion of the clock period when
the clock is high,   ; . For uncontrolled idle cycles where
gating the clock prevents precharge, we do not prorate  	
by the duty cycle. If the circuit is sometimes placed into the
sleep mode, we add the energy expended in transitioning
D
  times into the sleep state. This energy cost is the addi-
tional energy to precharge the      nodes that would not
have been discharged if the circuit had not been forced into
the sleep mode. The per transition energy is thus     =:  .
Also included is the overhead of activating the sleep mode
transistors and distributing the sleep signal across the FU,
:?BFEHGGJI . The final term is the static energy while in the sleep
state, i.e., all internal dynamic nodes have been discharged
and the gates dissipate an energy of :CB + for    cycles.
Since we are using circuits based on dual-   domino
logic, we can simplify (1). In dual-  circuits, the static
energy :K>ML is much less than :?>A@ [16]. We define a re-
lationship between the two as :?>NL PORQ :K>A@ where O is
typically in the range of  ﬂSO    . Furthermore,
for a given technology, we can define the leakage energy as
a fraction of the dynamic energy for a device, :T>U@ WV : 
where  XV . To elaborate, for a single gate the factor V is
the ratio of the maximum leakage energy expended to the
maximum energy for evaluation per time unit (1 cycle). For
circuits in a 130 nm technology, the value will be small,
VY<  
. This leakage factor V is a versatile parameter.
Functional units may be designed using all domino logic or
a mix of dynamic domino logic and static logic. In the latter
case where there is a mix of low-   devices along the crit-
ical paths and high-   devices along the non-critical paths,
we can consider the circuit as a whole and use the ratio of
its leakage energy to its evaluation energy as the factor V .
This value of V is lower for a single low-   gate but greater
for a high-   gate. Thus, the factor V abstracts the details
of the circuit into a single value that models the worst-case
leakage behavior relative to the dynamic energy : . The
factor V becomes a key parameter that we vary to explore
the technology design space. Applying the above relation-
ships results in equation (2).
 
	
    

 Z-ﬀﬃﬂ6[

 ﬂ
#\.ﬀ$% &(' ﬂ*.][

^-.ﬂ6[

_ﬂ
0/21-3-.ﬂ

 



 56587
ﬂ
9 1 ][

 (2)
In this architectural study we focus on the relative energy
between control policies. We can further simplify (2) by
normalizing to the active energy :  as in (3).

	

 

Z-ﬀﬃﬂ6[`ﬂ


ﬀXa&b')ﬂc.][?-.ﬂ6[`ﬂ
d/
1
3Z4.ﬂﬃ



 56587


ﬂ
ae1][ (3)
Equation (3) represents the total energy of a circuit in
terms of three factors: the technology, the control policy,
and the application. The technology defined parameters are
V
,
O
, : BFEHGGfI , and :  . Together, the control policy and ap-
plication determine the active, uncontrolled idle, and idle
times   ,  	 , and c  , respectively. The application de-
termines the activity factor   .
To give perspective to the magnitude of the technology
variables, we calculate the values for the circuit character-
ization described in Section 2 from the data listed in Ta-
ble 1. The maximum dynamic energy, : , is 22.2 fJ. The
ratio gihj kHk l
g_m
is 0.006. The ratio O of the static energy per
cycle in the low leakage state to that in the high leakage
state is O    
 . The ratio of per cycle leakage en-
ergy to the dynamic switching energy is the leakage fac-
tor V  n)o pqrq
o
q
 ﬂ Us
. Since (3) models the energy rela-
tive to :  , the relatively small values of the other factors
means the leakage factor V has the greatest impact. We note
here that from a similar circuit characterization by Heo and
Asanovic [10] we estimate from the data in the paper that
their implementation of a Hans-Carlson adder circuit in 70
nm technology has a leakage factor that is comparable to
our result, between  AtUs uV    
As .
3.1 Analysis
The analytical model permits quick exploration of the
parameter space to find interesting regions that might not be
evident from simulating individual data points. We choose
values for O and :TBvEGGfI that are in agreement with the circuit
measurements but somewhat pessimistic (higher). Specifi-
cally, we set O   ﬂ and :TBFEHGGfI  ﬂ ﬂ :  . We vary
the leakage factor V from  0V   to cover a broad range
of technology points that include relatively extreme points
in terms of the energy contribution caused by subthreshold
leakage current. In some of the results we select specific
values for V . In these cases, we restrict V to be either 0.05
(motivated by the values calculated from the circuit char-
acterization) or 0.50 (a convenient number to demonstrate
contrasting behavior). These two technology points act as
representatives for two distinct behavior regions that, as we
shall see, require very different methods for reducing leak-
age energy. In the rest of this paper, we assume a fixed clock
duty cycle of 50% ( ;  ﬂ 
 ).
Breakeven idle interval. The break even idle interval is
the length of time that a circuit must be idle in order that
the energy saved in the sleep mode offsets the additional en-
ergy required for the transition. Let us parameterize :   E
from (3) as :   E      	  c   D       V  and calculate
the break even point for a single idle interval. Thus, the
break even interval 	   EG is the interval that satisfies the fol-
lowing relationship:

	


= 

 5



=	
[`ﬂ.

	



3 

 5

"
3	
[ ﬂ (4)

 

 5

Z4 
hj kk l

m
ﬂ
[eﬁRﬁ]%.]bﬂ
(5)
In (4), the left-hand side represents the energy if the cir-
cuit is not placed into the sleep mode,    
D
 
 
, and
is left as uncontrolled idle,  	      EG cycles. The right-
hand side of (4) is the energy required for a single transition
to the sleep mode, D     and c      EG , assuming no
uncontrolled idle cycles,  	   . We omit the simple
algebraic manipulations and give the solution for    EHG in
(5) and a graph of (5) is shown in Figure 4a (the curves for
 
   and    
 are almost identical at this scale). The
vertical line at V ﬁﬂ  
 indicates where the near-term tech-
nology point lies. The plot delineates the break even inter-
vals across a range of leakage factors, V , for three activity
levels,      	 
 ﬂ   . From the graph it is apparent
that as leakage becomes a larger component of the energy,
the break even interval decreases, approximately as n

.
Modeling control strategies of the sleep mode. An ad-
vantage of a mathematical model is that a model permits
exploration of the parameter space before any simulations
are run. For this section, we explore three basic methods
for controlling the Sleep signal. These methods are distin-
guished by being easily modeled and defining the boundary
cases of managing the sleep mode. The first method, Al-
waysActive, represents the case of doing nothing other than
clock gating. We never enable the sleep mode so all idle
cycles are uncontrolled idle cycles and the circuit expends
greater leakage energy. The second method, MaxSleep, ag-
gressively enables the sleep mode whenever the circuit has
no useful calculation to perform in the following cycle. The
MaxSleep method incurs the maximum energy transition
overhead. The third method, NoOverhead, provides an up-
per bound on energy savings. This method is the same as
MaxSleep but we omit the energy overhead for transitioning
into the sleep mode. Thus, the No Overhead strategy rep-
resents an unachievable lower bound on total energy and,
therefore, is an upper bound on energy savings for any con-
trol method. Formally, the energies for each of the strategies
are defined in (6)-(8). :ﬀ ﬁ of (9) is the maximum dynamic
energy that the circuit can expend by performing a calcula-
tion on every cycle assuming an activity factor   , and ﬂ is
the total number of cycles for the simulation. We normalize
the graphs to :ﬀ ﬁ as a useful baseline for the magnitude
of the energy differences. Here, we are exploring how the
relative energy costs vary across the parameter space.
ﬃ

  N	"!"#
ﬃ%$

 &
5


	

\	
3 &b' 

'
3	
f[`ﬂ (6)
)(
	+*


 56587


	

  

=e1
=/21,
3	
f[ ﬂ (7)
-
/.
&
501r58	
 

	

  

=e12

=	
 [`ﬂ (8)
43
	+*


	

657
'
8
'
	
f[`ﬂ (9)
To limit the degrees of freedom, we link the four param-
eters   ,  	 ,    , and
D
  by a single parameter called a
usage factor ( 9  ). We define this relationship as follows.
Assuming a simulation with a total of ﬂ cycles, we define



9

ﬂ ,
 
9

 
. Since ﬂ   ;:  	 : c  ,
for the AlwaysActive policy in which we do nothing and
all idle cycles are uncontrolled idle cycles (there are no
sleep cycles),  	     9  ﬂ and c    . Con-
versely, in the MaxSleep policy, all idle cycles are sleep
cycles (there are no uncontrolled idle cycles) so 	  
and c      9  ﬂ . We also define D   as a function of

 , c  , and <   EG , the average idle interval duration. Recall
that
D
  is the number of times the circuit is placed into the
sleep mode and determines how often the transition energy
overhead is incurred. For a given average interval length
<

 
EHG , the number of transitions to the sleep mode (in the
MaxSleep policy) is D   7=> @?%AB
?C D
j k

  or, equivalently,
D
 
 7=>
E
nGFIH
mKJ6L
B
?C D
j k

9 Mﬂ  . The 7=> function is neces-
sary because we must limit the number of transitions to be
no greater than the number of active cycles. This restriction
ensures that every transition into the sleep mode implies at
least one prior active cycle. The energy for the NoOverhead
method is the same as for MaxSleep if
D
 

. The base
energy :N ﬁ is the energy of the functional unit if during
every cycle the unit performs a calculation, thus,   ﬂ
and  	     
D
 
 
.
The total energy for the three control strategies versus
the leakage factor V , for a fixed activity factor    
 , and
normalized to :  ﬁ is plotted in Figure 4b with the idle
interval <   EG   cycles. The bottom three lines are for
9

  
. The top three lines are for 9     . The
plots for MaxSleep and NoOverhead lie almost on top of
each other. A similar plot is shown in Figure 4c but with
<
 
EHG
  cycles. Together, these plots show behavior at
extremes of the usage factor and idle intervals (100 cycles
happens to be a long idle interval).
In Figure 4b, the lower grouping of three lines is for a
10% usage factor. The lowest energy line is the NoOver-
head policy. The slope is relatively flat since 90% of the
time the circuit is in the low leakage sleep mode. The Al-
waysActive line shows a sharp rise as the leakage factor in-
creases. The line for the MaxSleep policy runs parallel to
that of the NoOverhead policy. The difference between the
12
4
8
16
32
64
128
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
B
re
ak
ev
en
 Id
le
 In
te
rv
al
 (c
yc
les
)
Leakage factor, p
alpha = 0.1
alpha = 0.5
alpha = 0.9
(a) Break even Idle Interval
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
R
el
at
iv
e 
En
er
gy
 to
 1
00
%
 C
om
pu
ta
tio
n
Leakage factor, p
f_A = 0.90,  Always Active
Max Sleep
No Overhead
f_A = 0.10, Always Active
Max Sleep
No Overhead
(b) Idle Interval=10 cycles
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
R
el
at
iv
e 
En
er
gy
 to
 1
00
%
 C
om
pu
ta
tio
n
Leakage factor, p
f_A = 0.90,  Always Active
Max Sleep
No Overhead
f_A = 0.10,  Always Active
Max Sleep
No Overhead
(c) Idle Interval=100 cycles
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
R
el
at
iv
e 
En
er
gy
 fo
r 1
00
%
 C
om
pu
ta
tio
n
Leakage factor, p
f_A = 0.50,  Always Active
Max Sleep
No Overhead
(d) Worst-case: Idle Interval=1 cycle
Figure 4. Exploring the parameter space ot the model
two lines is the energy overhead to place the circuit into the
low leakage state when the Sleep signal is enabled. At small
values of V (low-leakage), the MaxSleep policy uses more
energy than the AlwaysActive policy when the break even
interval is greater than 10 cycles (see Figure 4a).
The relative behavior of the policies at the 90% usage
factor is similar but the differences are compressed. Since
all three policies have identical energy in the active phase,
which accounts for 90% of the time, differentiation between
the policies can occur only in the remaining 10% of the cy-
cles. Figure 4c is a similar plot with <    EG   cycles.
With the larger idle interval the MaxSleep policy is nearly
identical to the No Overhead policy at the 10% usage level.
The difference between Figures 4b and 4c is that in the lat-
ter figure the transition energy is amortized over 100 cycles
as compared to only 10 cycles. The worst case at the 50%
usage level is shown in Figure 4d where <M   EHG   cycle,
meaning the circuit alternates between one active and one
sleep cycle to incur the maximum transition overhead.
3.2 The GradualSleep design
The brief exploration of the energy model space in Sec-
tion 3.1 showed that the preferred policy for managing the
sleep mode depends on parameters for the technology (V )
and the control policy/application behavior (embodied by
9
 and <	   EG in the discussion). The MaxSleep policy works
well if the average idle interval is longer than the break even
interval,   < <    EG , but the AlwaysActive policy per-
forms better when the idle interval is shorter, <    EG <   .
A policy that selects the minimum energy between the two
options, 7=> J: ﬁ BFEGGJI  :	eE 
      G  , is the best combi-
nation of the two policies.
Here we propose a method that is a hybrid of the
MaxSleep and AlwaysActive control schemes. By dividing
the circuit into slices and staggering the Sleep enable sig-
nal, we can incrementally place the circuit into the sleep
mode and avoid the initial energy dissipation in the first
idle cycle as in the MaxSleep policy. This method also
protects against excessive static energy consumption that
the AlwaysActive policy would incur in the event of a long
idle interval. A block diagram of a circuit divided into four
slices is shown in Figure 5a. The timing of the Sleep sig-
nal is shown in Figure 5b. The Sleep signal feeds one end
of a shift register whose bits supply the Sleep signal to the
different slices of the circuit. The AND gates ensure simul-
taneous re-activation of the circuit. All of the register bits
are simultaneously cleared upon de-assertion of the Sleep
signal. While any level of granularity can be used, we as-
sume the number of slices matches the number of cycles in
the break even interval for the technology,    , so that ev-
ery cycle n
?

of the circuit enters the sleep mode. Using
fewer slices changes the curve for GradualSleep to be more
similar to the MaxSleep behavior. Adding more slices re-
sults in a shift towards the Always Active behavior. We hide
assertion/deassertion of the Sleep signal behind the register
read stage of the pipeline. The basic pipeline of the Al-
pha 21264 [15] is shown in Figure 6, as is a single, generic
Sleep signal to one of the FUs. At the end of the issue
stage the number of integer instructions to be executed is
known and the appropriate FUs are activated before the in-
structions reach the execute stage. Since the Sleep signal is
staged and is not along a critical path, the shift register and
AND gates can be constructed from slower, high  transis-
tors with very low subthreshold leakage current. We do not
include the small additional dynamic energy in the analysis.
The energy costs of transitioning to the sleep mode for
the three policies is compared in Figure 5c. We set V
  
 for reasons discussed in Section 3.1 and arbitrarily set
 
 
 and the usage factor 9   
 . The relative
shape of the curves is consistent regardless of the parame-
ter values. The GradualSleep policy saves energy over the
MaxSleep policy when the idle interval is short and is bet-
ter than the AlwaysActive policy when the interval is long.
Near the break even point the GradualSleep policy expends
more energy than the other two policies. The GradualSleep
design acts as a hedge against the pathological case of short
alternating active and idle intervals as highlighted in Fig-
ure 4b of Section 3.1. The results described in Section 5
show that the GradualSleep policy successfully avoids the
extremes of the other two policies.
4 Experimental methodology
We use the Simplescalar simulator [5] to verify the pre-
ceding analytical analysis in Section 3. The processor is
modeled after the Alpha 21264 and the configuration pa-
rameters are given in Table 2. We have modified the sim-
ulator to have individual structures for the reorder buffer,
integer queue, floating point queue, and load store queue as
in the Alpha 21264. We restrict the study to the integer units
since integer operations are generally the dominant type of
instructions executed, thus, these functional units are heav-
ily utilized. The integer benchmarks are listed in Table 3.
The goal of this study is to explore the potential for im-
proving energy efficiency with fine-grained control of static
energy in large logic circuits. To ensure the results are not
inflated by excess resources that can be trivially put to sleep,
we limit the number of functional units. Our processor con-
figuration supports a maximum of up to four integer func-
tional units. For each application, we determine the mini-
mum number of functional units required to achieve at least
95% of the peak performance from using four functional
units. Restricting the number of functional units makes it
more difficult for a control method to successfully exploit
the sleep mode and, thus, makes differences between con-
trol methods more meaningful. Implicit in this methodol-
Slice 1
Slice 2
Slice 3
Slice 4
Functional Unit
Sleep 1
Sleep 2
Sleep 3
Sleep 4
Clock
Sleep
(a) Block diagram
idle
.  .  . n4321 tt
Sleep 1
Sleep 2
Sleep 3
Sleep 4
t t t
(b) GradualSleep signal timing
0
0.5
1
1.5
2
0 20 40 60 80 100
En
er
gy
 R
el
at
iv
e 
to
 E
_A
Idle Interval (cycles)
Max Sleep
Gradual Sleep
Always Active
(c) Energy to transition to the sleep mode
Figure 5. The GradualSleep design
10 2 3 4 5 6
Issue Reg ReadRename MemoryExecuteFetch
Sleep
Figure 6. The Sleep signal timing
ogy is the assumption that some technique of profiling [19]
or compiler analysis [18] can be used to identify when func-
tional units are not needed a priori. Such an analysis could
be used to signal the run-time system that some functional
units are unnecessary and can be disabled at the start of an
application. The second to last column in Table 3 shows the
number of integer units used for each benchmark in all of
the simulations. The fourth column lists the maximum IPC
with four functional units, while the fifth column lists the
achieved IPC for a given number of functional units.
Table 2. Architectural Parameters
Fetch queue 8 entries
Branch predictor comb. of bimodal and 2-level gshare;
bimodal/Gshare Level 1/2 entries-
2048, 1024 (hist. 10), 4096 (global), resp.;
Combining pred. entries - 1024;
RAS entries - 32; BTB - 4096 sets, 2-way
Branch mispred. latency 10 cycles
Fetch, decode, issue width 4 instructions
Reorder buffer 128 entries
Integer issue 32 entries
Floating point issue 32 entries
Physical integer regs 96 entries
Physical floating point regs 96 entries
Load entries 32 entries
Store entries 32 entries
Instruction TLB 256 entry 4-way, 8K pages, 30 cycle miss
Data TLB 512 entry 4-way, 8K pages, 30 cycle miss
Memory latency 80 cycles
L1 I-cache 64 KB, 4-way, 64B line, 2 cycle
L1 D-cache 64 KB, 4-way, 64B line, 2 cycle
L2 unified 2 MB 8-way, 128B line, 12 cycle
Table 3. Benchmarks
App Suite Instr. Window Max IPC IPC FUs
health Olden 80M-140M 0.560 0.554 2
mst Olden entire pgm 14M 1.748 1.748 4
gcc SPEC95 INT 1650M-1750M 1.622 1.619 2
gzip SPEC2K INT 2000M-2050M 2.120 2.120 4
mcf SPEC2K INT 1000M-1050M 0.523 0.503 2
parser SPEC2K INT 2000M-2100M 1.692 1.692 4
twolf SPEC2K INT 1000M-1100M 1.542 1.475 3
vortex SPEC2K INT 2000M-2100M 2.387 2.387 4
vpr SPEC2K INT 2000M-2100M 1.481 1.431 3
In the simulations, we allocate operations to the set of
functional units in round robin fashion and record precise
statistics on the idle times for each functional unit. From
this data, we calculate the total energy used by each func-
tional unit by summing the energies for active cycles, un-
controlled idle cycles, and sleep mode cycles as given in
equation (3). The total energy of the integer unit is the sum
of the energies of the individual functional units. Values of
the equation parameters are listed in Table 4.
We present results for three values of the activity factor,
 
 ﬂ  
 ﬂ 
    
 
. Since values in the integer units are
dominated by either zeros or ones [4], we expect the final
state after evaluation of the domino gates in the functional
units to also be biased to either the high leakage state or the
low leakage state depending on the bias. A low activity fac-
tor (   < ﬂ 
 ) corresponds to a bias of the input values that
leaves the majority of the domino gates in the high leakage
state. Conversely, a high activity factor (   
 ) sets the
majority of gates to the low leakage state.
5 Results
The distribution of idle intervals across the benchmarks
is plotted in Figure 7. The x-axis is a  q scale in cycles of
the length of the idle interval and the y-axis is the fraction
of the total time that the ALUs are idle. The data for each of
the functional units from the different applications are com-
bined as fractions to give the data equal weight regardless of
the instruction window size of the application. To improve
readability, idle intervals longer than 8192 cycles have the
Table 4. Parameter values for energy calculations
Parameter Value
	
 Distribution from simulation data
	
&b' Distribution from simulation data
	
1 Distribution from simulation data


1 Distribution from simulation data
    
ﬀ   
ﬁ 0.001

hj kk l

m
0.01
0
0.1
0.2
0.3
0.4
0.5
0.6
1 4 16 64 256 1024 4096 16384
Fr
ac
tio
n 
of
 T
ot
al
 T
im
e 
A
LU
s a
re
 Id
le
Idle Interval (cycles)
12 cycle L2 access
32 cycle L2 access
Figure 7. Distribution of idle intervals
total idle time accumulated at the 8192 cycle marker, hence,
the short but sharp step at the right of the graph. The graph
shows that across the suite of benchmarks, any given integer
ALU is idle 46.8% of the time when the L2 access latency
is 12 cycles. Furthermore, nearly all of the idle intervals
are shorter than 128 cycles and a large fraction, 75%, occur
within the L2 access latency time. To highlight the influ-
ence that the L2 access latency has on the distribution, also
plotted is the idle interval distribution using a 32 cycle L2
access latency. The increased overall idle time reflects the
additional time to access the L2 cache. As demonstrated
in Figure 7, extremely large idle intervals are rare and rela-
tively short intervals are common.
The relative energies of the three policies presented in
Section 3 are compared in Figure 8. The energy is nor-
malized to the energy that would be expended if the circuit
performs a calculation every cycle, i.e., there are no idle cy-
cles ( :  ﬁ of Section 3). The results for a circuit with a
subthreshold leakage factor of V   
 are shown in Fig-
ure 8a. The applications are listed below with the number
of functional units. In each grouping of bars, the first bar is
the MaxSleep policy that enables the Sleep signal at a func-
tional unit as soon as there is an idle cycle for that unit.
Multiple functional units are managed independently. The
second bar is the GradualSleep design that incrementally
places a circuit into the sleep mode. The third bar in the
grouping is the AlwaysActive policy which never enables
the Sleep signal. The fourth bar plots the NoOverhead pol-
icy which represents in this study an unachievable lower
bound for reducing static energy. For each policy, the pri-
mary bar represents    ﬂ 
 . The small bar at the top
delineates the range for     	
 (the top) and      

(the bottom).
Let us discuss only the primary bars when    ﬂ 
 .
From the bar chart, when Vﬁ   
 , the MaxSleep policy
always uses more energy than the simpler (i.e., do nothing)
AlwaysActive policy, 8.3% more on average. The reason
is that at the lower V value the breakeven interval to recoup
the transition energy is significantly greater than the average
idle interval in this set of applications. The AlwaysActive
policy is within only 5.3% of the energy of the NoOverhead
method. Thus, at this technology point, there is no need
to enable the sleep mode. The GradualSleep design uses
slightly more energy than the AlwaysActive policy, but is
within 2.0%. These conclusions hold when     	
 (   
  
 ) except the differences increase (decrease). Recall that
at     	
 , less of the domino logic gates end up in the
low leakage state from the evaluation so transitioning to the
sleep mode discharges more energy than when    
 .
The converse is true for      
 .
The results are considerably different when the technol-
ogy involves high leakage transistors. The same plot is
shown but for a relatively high leakage factor V  ﬂ 
 in
Figure 8b. The greater subthreshold leakage current short-
ens the breakeven interval (recall Figure 4a) such that the
MaxSleep policy is always more energy efficient than the
AlwaysActive policy, saving an average of 19.2% at   


. This savings represents 70.4% of the maximum po-
tential bounded by the NoOverhead policy. The remaining
29.6% difference represents the overhead to transition to the
sleep mode and can be reduced by decreasing the number
of these transitions, possibly with a policy that schedules
operations on the functional units. Notice that the Gradual-
Sleep design performs about as well as the MaxSleep policy
and even slightly better on three applications, parser, vor-
tex, and vpr. Averaged across the benchmark suite, Gradu-
alSleep is essentially identical to the MaxSleep policy (the
difference is negligible). Here, again, the differences in-
crease for     	
 and decrease for      
 .
The energy of each of the three policies relative to the
energy of the NoOverhead policy across the range of values
  V   is plotted in Figure 9a. We do not show the
results for the individual benchmarks, only the average. For
each data point, we calculate the average of the relative en-
ergies for the benchmark suite. This plot shows the relative
behavior of each policy across the technology space. The
technology points of V    
 and V  
 used to gener-
ate the results illustrated in Figures 8a and 8b are marked
on the graph. As described before, when the leakage energy
of the circuit is small the AlwaysActive policy outperforms
the MaxSleep policy, but the reverse is true when the leak-
age energy becomes large. The GradualSleep design, how-
ever, exhibits well behavior across the complete technol-
ogy range, and performs better near the breakeven point for
the distribution of idle intervals of the benchmarks. Thus,
the ability to blend both policies has little negative impact
and can actually improve the overall energy efficiency when
the distribution of idle times centers around the breakeven
point. The fact that the GradualSleep design avoids the ex-
treme behaviors of the other two policies means that the
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
gcc (2) gzip (4) health (2) mcf (2) mst (4) parser (4) twolf (3) vortex (4) vpr (3)
GradualSleep
Average
N
or
m
al
iz
ed
 E
ne
rg
y 
(to
 10
0%
 ac
tiv
ity
)
Application
No OverheadAlwaysActiveMaxSleep
(a) ﬀ   
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
gcc (2) gzip (4) health (2) mcf (2) mst (4) parser (4) twolf (3) vortex (4) vpr (3)
GradualSleep
Average
N
or
m
al
iz
ed
 E
ne
rg
y 
(to
 10
0%
 ac
tiv
ity
)
Application
MaxSleep No OverheadAlwaysActive
(b) ﬀ   
Figure 8. Comparing MaxSleep, GradualSleep, Al-
waysActive, and NoOverhead policies
GradualSleep policy will still perform well as a design is
scaled in the same circuit technology or implemented in a
different technology having a different value for V .
The problem of leakage energy is often reported as the
fraction of the total energy due to leakage. This view of the
data is plotted in Figure 9b. At V ﬁ  
 , the leakage energy
is 13% of the total energy for the AlwaysActive policy, but
increases to 60% at V  ﬂ 
 .
The results shown in Figure 9b are best appreciated in
the context of the processor as a whole. Borkar [3] indi-
cates that at 70 nm dimensions and beyond (V    
 , ap-
proximately), leakage will comprise 30% or more of the
total power. Our results showing only 13% at V ﬂ  
 do
not conflict with this conclusion for the following reasons.
The primary factor producing the lower than projected frac-
tion of leakage energy is our methodology of eliminating
unnecessary functional units that would contribute signifi-
cantly to leakage but not to dynamic energy. For example,
in our simulations mcf utilizes only 31% of the two func-
tional units and the fraction of leakage energy is 15%. The
fraction increases to 25% for a microarchitecture with four
functional units. Second, we do not include the non-interger
functional units in our analysis because they are mostly idle
in this benchmark suite (and, thus, trivially controlled). In
11.1
1.2
1.3
1.4
1.5
1.6
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
En
er
gy
 N
or
m
al
iz
ed
 to
 N
o 
O
ve
rh
ea
d 
Po
lic
y
Technology Factor p
p=0.05
p=0.50
Gradual Sleep
Max Sleep
Always Active
(a) Average energy relative to NoOverhead
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
R
at
io
 L
ea
ka
ge
 to
 T
ot
al
 E
ne
rg
y
Technology Factor p
p=0.05
p=0.50
Gradual Sleep
Max Sleep
Always Active
No Overhead
(b) Ratio of leakage energy to total energy
Figure 9. Averaged simulation results
the integer benchmarks, these non-integer functional units
add disproportionately to the leakage portion of the total en-
ergy. This effect would further increase the overall fraction
of leakage energy relative to the total energy.
Depicted in Figure 9b is the plot for the No Overhead
policy. This policy represents a lower bound on the fraction
of static energy since all the idle cycles are at the lowest
leakage state and there is no additional energy cost to tran-
sition to that state. Thus, for this policy, the static energy is
almost entirely due to leakage during computation cycles.
The active mode leakage energy is a significant fraction of
the overall leakage energy, and becomes the dominant frac-
tion as V becomes larger. Circuit techniques are required to
reduce this portion of the leakage energy.
6 Related work
Dual-   domino logic circuits with a sleep mode have
been proposed in [1, 10, 13, 16]. While all of these cir-
cuits limit leakage energy by forcing the dynamic nodes
into the low leakage state, the overhead of this sleep mech-
anism varies. We selected the circuit from [16] because the
technique has no delay penalty and a low energy overhead.
However, the energy model parameters can be adjusted to
reflect many other circuit techniques.
Heo and Asanovic [10] introduce the technique of con-
trolling the sleep mode of dual-   circuits for fine-grained
reduction of leakage energy. The focus is on the circuit it-
self and ends with an analysis of the breakeven interval for
an adder. We extend this work by introducing an analytical
energy model for a logic functional unit and perform a de-
tailed study on how to implement fine-grained control of the
sleep mode in heavily used functional units of a micropro-
cessor. Our results reveal the interdependencies among the
circuit technology, the application, and the control strategy.
Butts and Sohi [6] introduce a static energy model for es-
timating static power consumption early in the design pro-
cess at the architectural level. This static energy model
can be parameterized to provide steady-state estimates of
various types of circuits, e.g., RAM cells, CAM cells, and
logic gates. To relate this work to our own, the Butts and
Sohi model is appropriate for estimating the parameter :
and the leakage factor V . In contrast, our model is special-
ized for logic but estimates total energy of the functional
units, both dynamic and static, based on the behavior of the
application. The ability to consider the dynamic behavior
of a circuit is essential in analyzing the tradeoffs between
schemes that manage the sleep mode of the circuit.
Rele et al. [18] use the compiler to identify when func-
tional units will be idle for long periods of time and can
be power gated, thus reducing the static power. The basis
of our study presumes a technique such as [18] has already
been applied. By limiting the number of functional units,
our study explores how to manage resources that are criti-
cal to performance and, consequently, have short idle times.
Both Brooks and Martonosi [4] and Ghose et al. [8]
demonstrate that many operands do not require the full
width of the datapath. To save dynamic energy, datapath
hardware detects these bytes and gates the logic from per-
forming unnecessary work. In the context of this paper, this
phenomenon might be able to be exploited in the Gradual-
Sleep policy by placing the high order bytes to sleep initially
and upon re-activation only activate these bytes that are also
enabled by the datapath hardware.
Pyreddy and Tyson [17] use dual speed pipelines to save
dynamic energy by scheduling non-critical instructions on
the slow pipeline. A slow pipeline could have a higher
threshold voltage and lower leakage current. Off-loading
the non-critical instructions from the fast pipeline will in-
crease the average idle duration in the fast pipeline. This
strategy may offer additional opportunities to enable the
sleep mode of the fast pipeline.
At the architectural level, the study of leakage reduction
has centered on the storage structures in the microproces-
sor. Yang et al. [20] gate the power supply voltage to the
L1 instruction cache RAM cells to turn off power to the
storage cells and essentially eliminate the leakage energy.
The state of the cell is lost. Kaxiras et al. [14] present a
control scheme that dynamically adjusts when to place the
cache lines into the sleep mode to minimize leakage energy.
Flautner et al. [7] propose a drowsy cache design for the L1
data cache that maintains the cell state in the sleep mode
but at the cost of higher leakage energy than if power to
the cell were completely turned off. Their study concluded
that a simple control scheme sufficed to achieve most of
the energy savings. Hanson et al. [9] compare these two
techniques and a third method in an extensive study that
includes the L1 instruction cache, the L1 data cache, and
the L2 unified cache. Heo et al. [11] take a novel approach
to reduce the static energy associated with the bitlines in a
RAM by simply tristating the drivers to the lines. The float-
ing bitlines settle naturally at the voltage level that mini-
mizes the leakage energy.
7 Conclusion
In this study, we evaluate the circuit technology of dual-
  domino logic along with the sleep mode as a promis-
ing technique for reducing subthreshold leakage energy at a
fine-grained time scale, from one to a few hundred cycles.
Taking the energy cost of entering the low-leakage sleep
state into account, we introduce an analytical energy model
to characterize the energy behavior of functional logic units
at the architectural level. We use this model to character-
ize the interaction of the application with the technology
as well as evaluate the effects on performance and energy
of a set of integer benchmarks as technology parameters
are varied. We show that the simple GradualSleep design
works well across a range of technology and application
parameters by amortizing the energy cost of entering the
sleep mode across several cycles. Our results indicate that
a more complex control strategy to determine when to enter
the sleep state may not be warranted and that the leakage en-
ergy lost during the active cycles of the functional units may
eventually become the dominant component of the overall
leakage energy.
References
[1] M. W. Allam, M. H. Anis, and M. I. Elmasry. High-Speed
Dynamic Logic Styles for Scaled-Down CMOS and MTC-
MOS Technologies. In IEEE International Symposium on
Low Power Electronics and Design, July 2000.
[2] S. I. Association. The International Technology Roadmap
for Semiconductors. In http://www.semichips.org, 2001.
[3] S. Borkar. Design Challenges of Technology Scaling. In
IEEE Micro, July 1999.
[4] D. Brooks and M. Martonosi. Value-Based Clock Gating
and Operation Packing: Dynamic Strategies for Improving
Processor Power and Performance. In ACM Transactions on
Computer Systems, May 2000.
[5] D. Burger and T. Austin. The Simplescalar toolset, version
2.0. Technical Report TR-97-1342, University of Wisconsin-
Madison, June 1997.
[6] J. A. Butts and G. S. Sohi. A Static Power Model for Archi-
tects. In 33rd Annual International Symposium on Microar-
chitecture, December 2000.
[7] K. Flautner, N. S. Kim, S. Martin, D. Blaauw, and T. Mudge.
Drowsy Caches: Simple Techniques for Reducing Leakage
Power. In 29th Annual International Symposium on Com-
puter Architecture, May 2002.
[8] K. Ghose, D. Ponomarev, G. Kucuk, A. Flinders, P. M.
Kogge, and N. Toomarian. Exploiting Bit-Slice Inactivities
for Reducing Energy Requirements of Superscalar Proces-
sors. In Kool Chips Workshop, MICRO-33, 2000.
[9] H. Hanson, M. S. Hrishikesh, V. Agarwal, S. W. Keckler,
and D. Burger. Static Energy Reduction Techniques for Mi-
croprocessor Caches. In 2001 International Conference on
Computer Design, September 2001.
[10] S. Heo and K. Asanovic. Leakage-Biased Domino Circuits
for Dynamic Fine-Grain Leakage Reduction. In Symposium
on VLSI Circuits, June 2002.
[11] S. Heo, K. Barr, M. Hampton, and K. Asanovic. Dynamic
Fine-Grain Leakage Reduction Using Leakage-Biased Bit-
lines. In 29th Annual International Symposium for Computer
Architecture, May 2002.
[12] S. Jung, S. Yoo, K. Kim, and S. Kang. Skew-Tolerant High-
Speed (STHS) Domino Logic. In IEEE International Sym-
posium on Circuits and Systems, May 2001.
[13] J. Kao and A. Chandrakasan. Dual-Threshold Voltage Tech-
niques for Low-Power Digital Circuits. In IEEE Journal of
Solid-State Circuits, volume 35, July 2000.
[14] S. Kaxiras, Z. Hu, and M. Martonosi. Cache Decay: Exploit-
ing Generational Behavior to Reduce Cache Leakage Power.
In 28th Annual International Symposium on Computer Ar-
chitecture, June 2001.
[15] R. E. Kessler. The Alpha 21264 Microprocessor. In IEEE
Micro, April 1999.
[16] V. Kursun and E. G. Friedman. Low Swing Dual Threshold
Voltage Domino Logic. In 12th Great Lakes Symposium on
VLSI, April 2002.
[17] R. Pyreddy and G. Tyson. Evaluating Design Tradeoffs in
Dual Speed Pipelines. In Workshop on Complexity-Effective
Design in conjunction with ISCA 2001, June 2001.
[18] S. Rele, S. Pande, S. Onder, and R. Gupta. Optimizing Static
Power Dissipation by Functional Units in Superscalar Pro-
cessors. In International Conference on Compiler Construc-
tion, April 2002.
[19] G. Semeraro, G. Magklis, R. Balasubramonian, D. H. Al-
bonesi, S. Dwarkadas, and M. L. Scott. Energy-Efficient
Processor Design Using Multiple Clock Domains with Dy-
namic Voltage and Frequency Scaling. In Eighth Interna-
tional Symposium on High Performance Computer Architec-
ture, February 2002.
[20] S.-H. Yang, M. D. Powell, B. Falsafi, K. Roy, and T. N. Vi-
jaykumar. An Integrated Circuit/Architecture Approach to
Reducing Leakage in Deep-Submicron High Performance
I-Caches. In Seventh International Symposium on High-
Performance Computer Architecture, January 2001.
