ET^2: A Metric For Time and Energy Efficiency of Computation by Martin, Alain J. et al.
Chapter 1
 
  : A METRIC FOR TIME AND ENERGY EFFICIENCY
OF COMPUTATION
Alain J. Martin
Department of Computer Science
California Institute of Technology
alain@cs.caltech.edu
Mika Nystro¨m
Department of Computer Science
California Institute of Technology
mika@cs.caltech.edu
Paul I. Pe´nzes
Department of Computer Science
California Institute of Technology
penzes@cs.caltech.edu
1
2Abstract
We investigate an efficiency metric for VLSI computation that includes energy,
 , and time, , in the form   . We apply the metric to CMOS circuits operating
outside velocity saturation when energy and delay can be exchanged by adjusting
the supply voltage; we prove that under these assumptions, optimal    implies
optimal energy and delay. We give experimental and simulation evidences of the
range and limits of the assumptions. We derive several results about sequential,
parallel, and pipelined computations optimized for   , including a result about
the optimal length of a pipeline.
We discuss transistor sizing for optimal    and show that, for fixed, nonzero
execution rates, the optimum is achieved when the sum of the transistor-gate
capacitances is twice the sum of the parasitic capacitances—not for minimum
transistor sizes. We derive an approximation for    (for arbitrary ) of an
optimally sized system that can be computed without actually sizing the transis-
tors; we show that this approximation is accurate. We prove that when multiple,
adjustable supply voltages are allowed, the optimal    for the sequential com-
position of components is achieved when the supply voltages are adjusted so that
the components consume equal power. Finally, we give rules for computing the
 
  of the sequential and parallel compositions of systems, when the    of the
components are known.
1. Introduction
With energy becoming as important as time as a criterion for computational
efficiency, analytical tools are needed to evaluate computations according to
both criteria simultaneously. How are energy and time traded against each
other in the design process? How are algorithms compared when they have
different energy and time figures? A useful metric must separate the algorithmic
tradeoffs of energy against time from the physical (usually electrical) tradeoffs.
We propose an efficiency metric for VLSI computation that combines en-
ergy,   and time,  in the form    The choice of this metric is based on
CMOS VLSI technology: in CMOS,    is independent of the voltage in first
approximation. Instead of attempting to optimize a circuit for both   and  
the designer can now optimize the design for the single metric and adjust the
voltage to obtain the chosen tradeoff between   and  
We prove that the    metric is optimal for CMOS circuits under assump-
tions that hold approximately in the normal range of operation. Under those
assumptions, energy and delay can be freely exchanged through supply-voltage
adjustment. Although the metric is not adequate over the entire range of opera-
tion for CMOS transistors, we have experimental evidence that a large class of
circuits exhibit a collective behavior that is more regular than that of individual
transistors. We also investigate when and why the    metric is inadequate; in
some cases, we can use the metric    with      We shall see that most
results we prove for    generalize for    with     
 
  : A Metric for Time and Energy Efficiency of Computation 3
The objection that a metric grounded in a specific technology (CMOS) cannot
be general enough for the study of algorithms can be answered by observing that
the CMOS model of computation is certainly as general as the “random-access
machine” model, which has been used successfully in the traditional analysis
of algorithms.
The    metric was originally introduced for the design of an asynchronous
MIPS R3000 microprocessor [4]. The arguments about the validity of the
metric and the analysis of the pipeline were published by Martin [3]. The
results about transistor sizing for minimal    have been described previously
by the authors [7, 11].
2. Energy and Delay in VLSI Computations
Our study of energy and delay in computations is based on CMOS imple-
mentations. We consider a digital VLSI computation to be a partially ordered
sequence of transitions. Each transition changes the value of a boolean variable
of the computation. We consider irreversible computations only, i.e., computa-
tions in which assigning a value to a variable destroys the previous value of the
variable.
In digital CMOS, a boolean variable is implemented as an electrical node
whose voltage represents the present value of the variable. Each transition
charges or discharges the capacitor attached to the node, bringing its voltage
either to the supply voltage Vdd or to the ground voltage GND. We are interested
in the computational, or dynamic, energy spent in charging and discharging the
capacitances of all nodes involved in a computation. We ignore leakage energy
and short-circuit energy. Since we ignore leakage, we also assume that no
energy is spent maintaining the values of variables stored in registers.
Any algorithm can be implemented as a set of production rules [5]. A
production rule is of the form:   , where  is a boolean expression, and 
is a single assignment of the value true or false to a boolean variable. Such an
assignment is called a transition; in this example, the transition  is performed
after  becomes true. A logic gate or operator is the physical implementation
of the pair of production rules that set and reset a given variable, of the following
form:
Bu  z
Bd  z (1.1)
Let  
 
and  

be the energy spent firing the first and second production
rules, respectively. (A firing is an execution of a production rule that changes
the value of a variable.) The energy spent firing production rule    is
the energy dissipated charging the capacitor 	

associated with the node of .
The total energy required for charging the capacitor up to voltage 
 is 	



 ;
4half of this energy is stored in the capacitor, and half is dissipated as heat in the
pull-up network connecting the capacitor to the power supply. We have that
 
 
 
	



 

 (1.2)
where 
 is the power-supply voltage. When the capacitor is discharged to
ground, the energy stored in the capacitor is dissipated in the pull-down network.
Hence,  

   
 
.
We shall not elaborate on the calculation of the capacitance 	

beyond noting
that 	

depends mostly on the “load” of , i.e., the topology of the logical gates
of which  is an input, and hardly depends on the structure of the logical gate
of which  is an output. In other words, the energy consumed in computing the
value of  does not depend on what is computed but rather on where the result
of the computation is needed.
The delay 
 
for firing  is the ratio of the final electrical charge 

on 	

to the current 
 
available for charging 	

:

 
 



 
 (1.3)
with 

  	


 . The current 
 
is the current flowing in the transistor network
connecting the constant power-supply to  when and only when  holds;
similarly for the delay 

and current 

.
In general, the transistor current is difficult to analyze. Let us look first at
one single nMOS-transistor as pull-down network. (The analysis for a pMOS
transistor as pull-up network is similar.) We assume that the transistor is above
threshold (


 


), and not in velocity saturation. Then, the current is either
the saturation current, 

, when 


 


 


; or it is the linear current, 

,
when 


 


 


, where 


and 


are the gate-to-source and drain-to-
source voltages of the transistor, respectively, and 


is the threshold voltage.
The formulas for 

and 

are well-known:


  
 



 





 

 


(1.4)


  


 



 
 (1.5)
If we assume that the voltages 


, 


, 


vary proportionally to the supply
voltage 
 , then both 

and 

depend quadratically1 on 
 , and therefore the
1Of course, 

and 

must vary by quite a different mechanism from the one governing 

: 

and


can vary “automatically” as a result of changing Vdd, whereas 

must be set at the time of fabrication.
The main reason that the proportional variation breaks down is that it is in practice impossible to scale 

with Vdd because that would lead to unacceptably large leakage currents at the low end of the scale.
 
  : A Metric for Time and Energy Efficiency of Computation 5
current 

discharging 

is of the form 

  



 
 Similarly, for a pull-up
network, we have 
 
  
 


 
 Hence, we have for the delay 
 
that

 
 
	


 


 (1.6)
Combining the expressions for delay and for energy, we see that the expres-
sion  


 

is independent of 

Under certain restrictions, it is possible to extend the result that the current
is quadratic in 
 to cover the arbitrary composition of pullup and pulldown
networks. Papadantonakis has proved this result for a class of circuits called
“smooth circuits” [8]. A smooth circuit is a network of transistors in which
each node has a capacitance to ground, the power supplies are modeled as large
capacitors, and again the threshold voltage is assumed to scale with the supply
voltage.
If we assume that a CMOS circuit is a reasonable approximation of a smooth
circuit, we can assume that the quadratic relation between currents and supply
voltage 
 holds, and therefore that the delays are inversely proportional to 
 .
For those circuits,    is independent of 
 , where   is the dynamic energy
dissipated by a computation, and  represents either the latency or the cycle
time of the computation. We shall return to the limitations of this assumption.
3. Comparing Algorithms for Energy and Delay
Given two algorithms and, with energy and delay  
	
 
	
 and  


 


,
how do we compare them for energy and delay? In evaluating the time efficiency
of a computation, we may be interested in either one of two delay parameters:
the latency and the cycle time. For an algorithm computing the function  , the
latency is the delay between the input of parameter 

and the output of  

,
averaged over all values of . The cycle time is the delay between the input of
parameter 

and the input of the next parameter 

, again averaged over all
values of .
3.1 Why   is not the Right Metric
The energy-delay product   is often used for comparing designs, but it is
not usually an acceptable metric, as we shall presently demonstrate.
Let us assume that we have two circuits,  and , that compute the same
thing in two different ways. Assume  
	
   


and 
	
 


 
. Then, according
to the   metric,  and  are equally good. But let us reduce the supply voltage
of  by half. Let  
	
 

	
 be the new values of energy and delay for . Given
the dependence of energy and delay on voltage, Equations 1.2 and 1.6, we have
that
6 

	
 
 
	

 (1.7)


	
  
	
 (1.8)
which gives
 

	
 
 



 (1.9)


	
  


 (1.10)
Hence,  now has the same delay as  but at only half the energy, and
therefore  is a better implementation than , contrary to what the   metric
indicates.
These results are borne out in practice [10]. In Table 1.1, we see the results
of simulating two different implementations of an eight-bit comparator with
the simulator HSPICE. In each case, eight single-bit comparators perform the
comparison: in the “linear” comparator, the results of the single-bit compara-
tors are merged in a linear chain; in the “log” comparator, in a binary tree.
Comparing the performance of the comparators at 3.3-V Vdd, we see that the
linear comparator is slower than the log comparator, but using the   metric,
we find that it more than makes up for its sluggishness with its lower energy
consumption. On the other hand, using the    metric, we find that the log
comparator is better. Which is it?
If we adjust the supply voltage on the log comparator down to 2.15 V, we see
that we can match the delay of the linear comparator while using less energy;
thus, the log comparator outperforms the linear one in both speed and energy
if we are allowed to adjust the supply voltage. Even over this relatively wide
range of supply voltages,    changes only by an insignificant 3.2 percent.
This example illustrates that the    metric is more trustworthy for circuit
comparisons when we are allowed to adjust the supply voltage.
3.2 The   Metric
Let us now ignore the lower and upper bounds imposed on the voltage by
the technology and assume that we can always trade   and  against each other
through voltage adjustment. Suppose that under these conditions there exists a
function    with the properties:
Property .  is monotonically increasing in   and ,
Property .  is independent of 
 .
 
  : A Metric for Time and Energy Efficiency of Computation 7
8-bit comparator        
  	


   	


   	
 

 
 
 	
 


Linear (3.3 V) 25.24 3.93 99.21 389.97
Log (3.3 V) 44.97 2.35 105.50 247.57
Log (2.15 V) 16.52 3.93 64.97 255.59
Table 1.1. Comparison of  , ,  , and    of two kinds of 8-bit comparators. Simulations
with HSPICE using parameters for HP’s 0.6-m CMOS process (via MOSIS).
Theorem 1 Given two computations and with corresponding
	
and


:
If 
	
 


then  is more delay-efficient than  when  and  use
equal energy and  is more energy-efficient than  when  and  have
the same delay.
If 
	
  


then  is equivalent to  with respect to energy when their
delays are the same.
Proof: Through supply-voltage adjustment, we can equalize either the energy
or the delay of the two computations. Let us arbitrarily choose to equalize the
delays: 
	
  


. Because of Property , the s have not changed.
We can now compare the two computations, thanks to Property :

	
 


  
	
  


, i.e.,  is better than ,

	
  


  
	
   


, i.e.,  and  are equally good.
Hence, for any chosen delay ,  is more energy-efficient than . Likewise,
for any chosen energy  ,  is more time-efficient than .
3.3 The    Metric
Any expression in   and  that is monotonically increasing in   and  and
that is independent of 
 can be used as complexity metric . We have shown
in Section 2 that, in CMOS technology, the following definition for  is valid:


   
 
 (1.11)
Henceforth,  will always mean     If we now return to the example at
the beginning of Section 3.1, we compare the two computations  and  by
comparing their s:
8
	
   
	

 
	
(1.12)

	
   









 
(1.13)

	
 




 (1.14)
Hence, we can conclude that  is twice as-efficient as . For equal delays,
 
	
 


 
. For equal energies, 
	
 



 

3.4    Measurements
How constant is    in reality? There are several operating modes for the
CMOS transistor, each with a very different relation between current and volt-
age. In particular, at high electric field, the carrier velocity saturates and be-
comes constant; the delay becomes independent of the voltage, and    be-
comes quadratic in the voltage. Figure 1.1 shows the measured    for the
two-million–transistor asynchronous MIPS R3000 microprocessor designed at
Caltech between 1996 and 1998. It was fabricated in 0.6-m CMOS and was
entirely functional on first silicon [4]. (Measurements on other fabricated chips
give similar results.)
The behavior below 1.3 V shows the effect of approaching the threshold
voltage; in our calculations we have assumed that the threshold voltage would
scale with Vdd, but we obviously cannot enforce this for HP’s 0.6-m process
whose threshold voltage is fixed at 0.8 V. The positive slope from 3 V and up
shows the onset of velocity saturation. The nominal voltage of this process is
3.3 V; the graph shows that    varies only about 20% around its average in
the range 1.5–4.9 volts Vdd.
4. The  -Efficiency of Designs
In this section, we use the  metric to determine when two standard design
transformations—parallel composition and pipelining—improve the efficiency
of a design compared with sequential execution.
4.1 The  -Efficiency of Parallelism
Given a collection of independent tasks, when does the parallel execution
of the tasks improve (i.e., reduce) the  of the computation? For simplicity,
consider  identical tasks each consuming energy   and using delay 
to complete.
If the  tasks are executed sequentially, the total energy is   and the total
delay (execution time) is , giving a 

of   .
 
  : A Metric for Time and Energy Efficiency of Computation 9
0
2
4
6
8
10
0 1 2 3 4 5 6
Et
^2
/[1
0^
-24
 kg
 m
^2
]
Supply Voltage/[V]
MiniMIPS Processor 0.6 micron
Figure 1.1. Measured    for a two-million–transistor asynchronous microprocessor.
Now, let us consider what happens when the  tasks run in parallel. First,
we ignore the cost of the split circuitry that may be needed to distribute the
control (and possibly data) to all tasks, and the cost of the merge circuitry that
may be needed to gather the completion signal (and possibly some results) from
all tasks. In that simple case, the total energy is still  , but the total delay is
now reduced to , giving a 

of    . Hence, the improvement




  
 : parallelism reduces the  of the computation by a factor

 

Assuming we can vary the voltage of the new design so as to make the delay
equal to the delay of the sequential design, then under the invariance of   ,
the parallel transformation decreases the energy consumption by a factor  .
(For large , it may in practice be impossible to scale down the voltage by a
factor , and therefore it may be impossible to exploit all the potential energy
improvement of parallelism.)
Theorem 2 The parallel composition of  identical tasks without overhead
gives an    reduction of   compared with sequential execution, and a po-
tential energy reduction of   if the voltage can be reduced by a factor .
The situation is more complicated if the tasks are different. Let us assume
that each task 

now uses energy  

and delay 

, such that we still have

 

    and



  . The parallel composition still uses energy  , but
the delay is now 

; let us call the task with maximal delay 
 
and
10
its delay 
 
  

 Remember that we are assuming that the voltage
of each task can be adjusted freely. Then it is clear that, for each task except

 
, the early termination of the task amounts to a waste of energy since
every task but 
 
can be slowed down to 
 
without affecting the delay of
the whole computation, but with an energy improvement corresponding to the
voltage decrease.
According to the above analysis, parallelism always improves    if we
ignore any overhead it introduces. Let us now examine the case when the cost
of the split and merge circuitries cannot be ignored. We are looking at the
simple case where we split the original task into two parallel tasks with the help
of just one binary split and one binary merge. We assume that both the split and
merge have an energy and delay that are a fraction  of the energy and delay of
the original task:
 

   

     (1.15)


  

    (1.16)
With the added overhead, the energy  

and delay 

of the parallel execu-
tion become:
 

        (1.17)


      (1.18)
which means that


  

    
 
 (1.19)
The ratio 



is less than 1 only for   	  In other words, binary
parallel composition using a split and a merge with the above characteristics de-
creases  only when the task to be parallelized is at least  times as expensive
as a split or merge.
As a concrete example, the authors have investigated the possibility of im-
proving the performance of a 32-bit four-stage carry-lookahead adder by inter-
leaving two identical adders. For this type of circuit, k is empirically found to
be approximately 	 (the split and merge networks are about as expensive as
one adder stage). Therefore splitting the adder in this particular way does not
help.
4.2 The  -Efficiency of Pipelining
Now let us consider a task  that repeatedly evaluates the function  for a
sequence of parameter values:  receives a parameter value  from the envi-
ronment, evaluates  , sends the result to the environment, and repeats the
cycle for a next parameter value.
 
  : A Metric for Time and Energy Efficiency of Computation 11
Pipelining is the transformation that replaces the single task  with a chain
 of  tasks 

computing functions 

, called the “stages” of the pipeline.
Each stage 

behaves exactly like the original task, except that it computes 

instead of  , and its environment is different: 

receives its parameters from


if   	, or from the environment of  if    	  it sends its results to


if    , or to the environment of  if      
The s are chosen such that    

Æ    Æ 

Æ 

, with 	  .
In this example, we are interested in the cycle time and the energy consumed
by one cycle. We first determine the energy  

and cycle time 

for the
computation of one   by task .
There are two parts to the activity of a cycle: the computation of  and
the communication overhead (receiving parameters and sending results). Let
  be the energy and  the time to compute  . We assume that the energy of
the communication overhead is   and that the delay is . Putting the pieces
together, we get:
 

       (1.20)


    (1.21)


   
 
  
	
 (1.22)
Now, want to choose the length of the pipeline, , and the functions  so
as to minimize the  of the pipeline in terms of the energy and cycle time for
computing one  .
The cycle time of the pipeline is the cycle time of the slowest stage. Hence,
we should choose the functions  such that all stages have the same cycle time.
For all :


 


   (1.23)
But the stages do not need to have the same energy; let us say that each stage
consumes 

, with 




    . For simplicity, let us assume that the
communication overhead for the entire non-pipelined implementation is paid
by each stage of the pipelined implementation. Under these assumptions, the
total energy for stage  is
 

  

    (1.24)
We can now compute the energy  

and the cycle time 

for computing
one  :
 

 




 

(1.25)


  

for all  (1.26)
12
which gives:
 

        (1.27)


 


   (1.28)
Let 

be the  of the pipeline. By definition, 

   


 

 i.e.,


   
 
  
	

 
 (1.29)
We can express the improvement in  compared with the non-pipelined case
as the ratio of the two s:




 


 
  
	
  
	
 (1.30)
The ideal case    	 (no overhead) gives an improvement




 


 
 (1.31)
with  

    and 

  . Although it looks like we have gained nothing
in energy, in fact we can save up to a factor   in energy if we equalize the
cycle time to that of the non-pipelined case by adjusting the supply voltage.
For   	, the optimal improvement is achieved for








  	  (1.32)
i.e., for    .
In the optimal case
 

     and (1.33)


 


  (1.34)
whence we derive the following theorem on the optimal length of a pipeline.
Theorem 3 The -optimal pipeline requires an energy per computation step
that is 3 times the energy required for computing  . It has a cycle time that is
 the overhead’s cycle time.
Let us compute the optimal pipeline improvement as a function of the over-
head ratio , (    

). We get the following result:




 



 
  
	
 (1.35)
 
  : A Metric for Time and Energy Efficiency of Computation 13
The result shows that the pipeline is very sensitive to the communication
overhead. For an overhead ratio of one (which obtains when the pipelining
communication is as costly as the operation itself), the pipeline offers practically
no gain in   .
* *
In the second part of this chapter, we examine the relationship between   
and the two physical parameters that a designer (usually) can adjust: the supply
voltage and the transistor widths. First, we address the issue of optimizing   
as a function of the transistor widths. Secondly, we introduce the notion of
minimum-energy functions   to express the dependence of   and  on each
of the two physical parameters. We use those functions for deriving a number
of important results about the sequential and parallel compositions of systems.
5. Transistor Sizing for Optimal  
The task of adjusting the transistor widths of a circuit is called “transistor
sizing,” or “circuit sizing.” We are interested in sizing transistors so as to
minimize  Both capacitance 	

and current 
 
in Equations 1.2 and 1.3 for
  and  depend on the size of the transistors. The capacitance contributed
by transistors increases linearly with the transistor widths, but the current also
increases linearly. Hence, it is not immediately clear how transistors should be
sized to optimize the  of a circuit.
We shall find that, in a -optimal circuit, the transistors are sized such
that the total transistor capacitance is approximately twice the total parasitic
capacitance. As we shall see, the result is exact only for a restricted class of
circuits; nevertheless, it is a good approximation for most circuits.
Let   be the total energy of the computation: it is the sum of all energy spent
exercising the nodes of the computation. Assume, without loss of generality,
that there are exactly two transitions corresponding to each node. (This amounts
to “unrolling” into several nodes the nodes of the circuit that see more than two
transitions and ignoring the nodes that never transition.) Let  be the cycle time
of a critical cycle. We assume that the circuit is designed so that all cycles
are critical; this is true in many well designed circuits, and it is true for any
optimally sized circuit in the absence of additional constraints (e.g., minimum-
size or slew-rate constraints) on transistor sizes.
We distinguish between two types of capacitances attached to a node :
first, the “gate” (or transistor) capacitance  

contributed by the transistors
of the operators to which node  is connected, and secondly, the “parasitic”
capacitance !

contributed by the wires connecting node  to other operators.
We assume that !

is fixed and that we can change  

as we please by adjusting
the transistors’ widths.
14
We first consider scaling all the transistors in the circuit by the same factor:
we want to determine the global scaling factor " to be applied to all transistors’
widths that achieves the lowest   . We have
   


 

 !



 
 (1.36)
where # is the set of all nodes on the chosen critical cycle, i.e.,
    	   

 
 (1.37)
with 	  


 

and   


!


For the sake of simplicity, we assume that upgoing and downgoing transitions
on a given node have the same delay. For the cycle time  of a critical cycle,
   




 where 

is a transition of node . We use Equation 1.6 to
compute all 

s
   


 

 !






 (1.38)
We first simplify the problem and assume that the circuit is “homogeneous,”
i.e., that all gates are identical, and hence that 	   
 #  

  ". We get
that
   


 

 !


"

 (1.39)
i.e.,
   
	  
"

 (1.40)
By definition of the global sizing factor ", we have 	   "	
 
 and
therefore we can eliminate " from the expression of :
   
	  
	

 (1.41)
 
 
  
 
	   
	
	
 
 (1.42)
where    	
 
 . It is easy to check that 

 
 
   	 for 	    .
Hence the theorem:
Theorem 4 For a homogeneous circuit, the minimal    is achieved when the
total gate capacitance is twice the total parasitic capacitance.
 
  : A Metric for Time and Energy Efficiency of Computation 15
5.1 Using    With    
Sometimes, we may want to optimize    for     when using a -
optimal circuit would not be possible because the required delay or energy
would result in a supply voltage outside the practically possible range. Roughly
speaking, when we perform    optimization we mean that we consider a 1%
improvement in speed is worth an % increase in energy. For example, for a
circuit operating in velocity saturation, we might have expend twice the energy
for a 10% speed improvement. In that case, we should optimize for    with
   
There is another reason to examine   , besides extreme supply voltages
(and mathematical insight). Even though a large system may be optimized for
 
 
, components of that system may not individually be optimized for   .
For example, speeding up critical paths while lowering the    of these paths
may make the entire design run faster and actually improve    for the entire
design. If multiple supply voltages are allowed, then Theorem 1 applies to each
component of the system, so each component is optimally characterized by   .
But multiple supplies are impractical; instead we can use    optimization for
the different paths, with a larger  for the more critical paths and a smaller 
for the less critical paths.
As an example of    optimization, we generalize Theorem 4 for all :
 
 
  
 
	   
 
	
 


  
 (1.43)
and we find that 

 
 
   	 for 	    Hence the theorem:
Theorem 5 For a homogeneous circuit, the minimal    is achieved when the
total gate capacitance is  times the total parasitic capacitance.
5.2 Optimal Energy and Cycle Time
We have seen that, for a ring of identical operators,   and  are of the
following form:
    	   

  (1.44)
   
	  
	

 (1.45)
When optimizing for    by transistor sizing, we have established that the
minimum is achieved for 	    , to which correspond an energy  
 
and a
cycle time 
 
, with
 
 
   

 
 (1.46)
16
and

 
  
 




 (1.47)
Two interesting quantities are  

and 

:  

  

 
 and 

  
 
By definition of    

is the theoretical minimal energy, corresponding to
minimizing   without regard for ; it corresponds to the situation when the
transistors are all zero-sized and the fixed parasitic capacitances constitute the
entire  . Conversely, 

is the theoretical minimal cycle time corresponding
to minimizing  without regard for  . It is obtained when 	 goes to infinity,
i.e., when only gate capacitances contribute to   and . We may eliminate 

from Equations 1.46 and 1.47 and restate our results in terms of  

and 

;
thus,
 
 
    

 (1.48)
and

 
 
 



 (1.49)
In particular
 
 
   

 (1.50)

 
 




 (1.51)
Theorem 6 The cycle time for optimal    is 3/2 the theoretical minimal cycle
time at that supply voltage. The energy for optimal    is three times the
theoretical minimal energy at that supply voltage.
If we eliminate  from Equations 1.48 and 1.49, we arrive at the following
relation between the minimum energy and the minimum delay of a single-cycle
circuit at a fixed voltage:
 
 
 
 


 

 
 

 (1.52)
5.3 A Minimum-Energy Function
We can define an antimonotonic minimum–energy-consumption function or
minimum-energy function   that describes the effect of transistor sizing on
the minimum energy required for a system to run at a given  at a fixed voltage.
(Tierno has previously used a similar energy function [12].) If we rewrite
Equation 1.52 with   a function of , we get the following function:
   
 


 

 (1.53)
 
  : A Metric for Time and Energy Efficiency of Computation 17
It is easy to prove that Equation 1.53 satisfies the above definition of the
minimum-energy function.
5.4 Experimental Evidence
Even though Equations 1.48 and 1.49 have been derived for only a very
restricted class of circuits, they are in fact good approximations for a much wider
class. The authors have checked the equations against the minimal   obtained
by applying an optimization algorithm (gradient descent) to a class of circuits.
The circuits, each consisting of a ring of operators, were chosen at random
with a uniform-squared distribution of parasitic capacitances; the number of
transistors in series was also chosen according to such a distribution. The
authors used real numbers for both parameters; they optimized the expression
for    using Equations 1.36 and 1.38 for   and . The range of parasitics
was [1,100] in normalized units; the range of transistors in series was [1,6].
The results show that Equations 1.48 and 1.49 hold, with very good accuracy,
over a wide range of parasitics, logic-gate types, and circuit sizes.
energy error for 100 operators
−0.04
−0.02
0
0.02
0.04
0.06
0 2 4 6 8 10
re
la
tiv
e 
er
ro
r
optimization index (n)
delay error for 100 operators
−0.06
re
la
tiv
e 
er
ro
r
Figure 1.2. Results of simulating random circuits.
The results of the simulations for circuits consisting of a ring of 100 operators
are summarized in Figure 1.2. (Simulations for rings of 10 and 1000 operators
show similar results.) The figure shows the mean and standard deviation of
the error in the estimates of Equations 1.48 and 1.49 for a range of different
optimization indices ( 
 	 in   ). The estimates get more dependable
for larger circuits, where the random variation in operators tends to average out
over the cycle. Overall, the estimates are usually good to within five percent of
the energy and delay values for the actual optimum   .
18
5.5 Multi-cycle Systems
Let us now consider a system composed of  subsystems 

( 

, 

)
executing in parallel; each subsystem 

has minimum-energy function
 


 


These subsystems can be chains or rings of arbitrary logic gates, since our
experiment shows that Equations 1.48 and 1.49 adequately describe the min-
imum energy and delay of a large class of circuits. Let us assume that the
subsystems are synchronized so that all s are equal. As a consequence, the
minimum-energy function for the composed system is
    




 

 

 (1.54)
Theorem 7 For a system composed of  subsystems 

( 

, 

) as specified
above, if the system is optimally sized for    then
 
 
  




 

(1.55)
with equality if and only if all 

s are equal.
(Note that Equation 1.48 is simply the special case of Theorem 7 that holds
when all 

s are equal.)
Proof. The optimal    of this composed system is reached for   and  that
satisfy
 
 


  	  (1.56)
which is achieved when
 




 

 

  




 

 


 
 (1.57)
We may now invoke the Cauchy-Schwarz inequality 

$

%


 


$
 


%
 

,
where equality holds if and only if $

%

has the same value for all . If we
substitute $






 
and %



 

, we get that





 

 


 





 

 


 




 

(1.58)
 
  : A Metric for Time and Energy Efficiency of Computation 19
with equality if and only if all 

s are equal. Using Equation 1.57, we replace




 

 
with  





 
in Equation 1.58, and we get the following
result:





 

 


 

 





 

 





 

 (1.59)
By Equation 1.54, then,
    




 

 

  




 

 (1.60)
And therefore
 
 
  




 

 (1.61)
In Theorem 7, equality holds if and only if all 

s are equal; in this situation, we
also have that     
 


 This is a generalization of Equations 1.48 and 1.49
to multi-cycle systems. As we have already pointed out, the 

s are likely
to be close to each other in most well designed circuits, so we should expect
that usually      

 

 In other words, in a multi-cycle system
optimally sized for     the gate capacitance is (close to) twice the parasitic
capacitance.
We have shown that the 	    result is correct for a ring of operators.
We previously observed that if a dominant term exists then 	    is ap-
proximately correct for general circuits. We have experimental evidence that
the relation is true for a large class of multi-cycle systems. Such evidence is
also provided by SPICE simulations of an adder published by Chandrakasan
and Brodersen and summarized in Figure 4.7 of their book [2]. Their figure
shows that, for the five different parasitic contributions they study, the mini-
mum energy for a given speed (allowing supply-voltage adjustment) is achieved
when the gate capacitance is very close to twice the parasitics. (They did not,
however, draw the conclusion that we have reached here.)
6. Power Rule for Sequential Composition
Let us now consider the sequential composition of two systems  and 
Let us assume a sequential computation that runs  to completion and then 
to completion; we assume the delay between the end of  and the start of  to
be negligible. We want to know at what 
	
, 


to run  and  so as to optimize
the    of the sequential composition.
We now recall the concept of a minimum-energy function introduced by
Equation 1.53. Equation 1.53 applies to the specific transformation of changing
transistor sizes; i.e., it describes what the minimum energy of a circuit will be
when that circuit is sized to achieve a certain performance. We are no longer
20
limiting our discussion to the effects of transistor sizing, so we can allow other
transformations to be used as a basis for this   
Theorem 8 For the sequential composition of two systems  and , if the
composed system is optimized for   , then
 
	

	


	
 
 









 (1.62)
Proof. The latency of the composed system is    
	
 


, while its energy
is      
	

	
  



	
. Hence we are minimizing

	
 




   
	

	
  






	
 



 
 (1.63)
If we set the partial derivatives of  with respect to 
	
and 


equal to zero, we
obtain
 
	

	


	
  
 
	

	
  







	
 


and
 









  
 
	

	
  







	
 



from which it is clear that Equation 1.62 holds.
Theorem 8 holds for any minimum-energy function   and any value of
the optimization index .
If we now vary the supply voltages of the components  and  of the
sequential composition so as to optimize     we have the following theorem:
Theorem 9 For the sequential composition of two systems  and  with power
consumptions 
	
and 


, respectively, if the composed system is optimized for
 
  by adjusting the supply voltages of the components, then

	
  


 (1.64)
Proof. Let us define 
	
   
	

 
	
, 


   



 


; as we have established, 
	
and 


are voltage independent. Using Theorem 8 with  
	

	
   
	

 
	
and  





   



 


, we get:

	

	
	
 




	


 (1.65)
Hence,
 
	

	
 
 





 (1.66)
 
  : A Metric for Time and Energy Efficiency of Computation 21
In other words, circuits composed sequentially in a -optimal way should
have their supply voltages adjusted so as to equalize their power use. (If the
circuits are themselves -optimal, then equalizing their power is a necessary
and sufficient condition for making the composition -optimal.)
7.  -Rules for Parallel and Sequential Compositions
Finally, let us consider the parallel and sequential composition rules for
 
 
 assuming that we have the freedom of independently adjusting the supply
voltages of the composed systems. Given are two systems  and  with
latencies 
	
and 


, energies  
	
and  


; we have 
	
   
	

 
	
and 


 
 



 



First, consider the parallel composition of the two systems. We want to
compute the minimum 
	

as a function of 
	
and 



The minimum 
	

is achieved when 
	
  


(see Section 4.1). With
 
	

   
	
  


and 
	

  
	
  


, we get 
	

   
	

 
	
  



 



Hence the theorem:
Theorem 10 For two systems  and  composed in parallel,

	

  
	



 (1.67)
Now consider the two systems composed in sequence as in Section 6. As
in the previous example, we are given 
	
   
	

 
	
and 


   



 


, and we
wish to determine the optimal 
	

as a function of 
	
and 


 We have:

	

   
	
 



	
 



 
 (1.68)
i.e.,

	

 


	

 
	





 




	
 



 
. (1.69)
Since the optimal total    is independent of global scaling of , it is suf-
ficient to determine a single parameter defined as follows: &   
	



. From
Equation 1.65, we have &   
	




. For this &  we compute the optimal
   









	

	

Theorem 11 For two systems  and  composed in sequence,

	

	

 




	







 (1.70)
22
The -rules are very useful for computing the  of a computation as we did
for parallelism and pipelining in Section 4. The industrious reader that goes
through the exercise of using the -rules to compute the same 

that was
computed in Section 4 is in for a surprise: the result is different!
Using the -rules, we get 

  

 

 
 	

	
 which is smaller than
the result of Equation 1.19. The reason is subtle but important. In the first
computation, we postulate that  split    merge     and split   merge    
By fixing the   and  ratios, we implicitly assume that the voltages of the
different components are identical. The computation using the -rules just
assumes that 

  

  
	


 Hence the voltages of the split and
merge can be adjusted independently, which leads to a lower 


(An analysis of the -efficiency of parallelism can be found in a companion
paper [6].)
8. Summary and Conclusion
In this chapter, we have seen that    constitutes an excellent metric for
comparing computations for energy and delay efficiency when the physical
behavior is that of CMOS VLSI circuits. We started by observing that the   
metric for a CMOS circuit is independent of the supply voltage, as long as we
can scale the threshold voltage linearly with the supply voltage and as long as
we stay away from velocity saturation.
We showed that when supply-voltage adjustment is allowed, the popular
  metric is inferior. Following along these lines, we established that any
metric with certain properties (we called such a metric ) could be used to
compare designs independently of the voltage. As long as the required speed
or energy lies within the threshold-voltage to velocity-saturation range of the
implementations, we saw that the implementation with the better  is better
for any desired speed or energy consumption. We showed that    is to first
order a metric with the required properties. We then applied the metric to
various circuit transformations, namely pipelining and parallelism. We also
applied the metric to transistor sizing and were able to show that the optimal
sizing for energy efficiency is not what is commonly used (minimal sizes).
Finally, we established rules for computing the  of the sequential and parallel
compositions of systems.
Overall,    is a very useful efficiency metric for designing CMOS VLSI
circuits. Time and experience will show how applicable it is to other computa-
tions.
Acknowledgments
Acknowledgment is due to Karl Papadantonakis, Martin Rem, and Cather-
ine Wong for their comments and criticisms. The research described in this
REFERENCES 23
paper was sponsored by the Defense Advanced Research Projects Agency and
monitored by the Air Force.
References
[1] Carver A. Mead and Lynn Conway. Introduction to VLSI Systems,
Addison-Wesley, Reading MA, 1980.
[2] Anantha P. Chandrakasan and Robert W. Brodersen. Low Power Digital
CMOS Design, Kluwer Academic Publishers, Dordrecht,1995.
[3] Alain J. Martin. Towards an Energy Complexity of Computation. Infor-
mation Processing Letters, 77, 2001.
[4] Alain J. Martin, Andrew Lines, Rajit Manohar, Mika Nystro¨m, Paul Pen-
zes, Robert Southworth, Uri Cummings, and Tak Kwan Lee. The Design
of an Asynchronous MIPS R3000 Microprocessor. Proceedings of the
17th Conference on Advanced Research in VLSI, IEEE Computer Society
Press, 164-181, 1997.
[5] Alain J. Martin. Synthesis of Asynchronous VLSI Circuits. In Formal
Methods for VLSI Design, ed. J. Staunstrup, North-Holland, 1990.
[6] Alain J. Martin and Mika Nystro¨m. The   -Efficiency of Parallelism,
Caltech Technical Report, October 2001.
[7] Mika Nystro¨m.    and Multi-voltage Logic, Caltech Technical Report,
April 1995.
[8] Karl Papadantonakis. A Theory of Constant    CMOS Circuits, Caltech
Computer Science Technical Report 2001.004, July 2001.
[9] Karl Papadantonakis. Hierarchical Voltage Scaling for    Optimization
of CMOS Circuits, Caltech Computer Science Technical Report 2001.005,
July 2001.
[10] Paul I. Pe´nzes. Energy-delay Efficiency of Asynchronous Circuits, Ph.D.
Thesis (in preparation), California Institute of Technology, 2002.
[11] Paul I. Pe´nzes and Alain Martin. Global and Local Properties of Asyn-
chronous Circuits Optimized for Energy Efficiency. IEEE Workshop on
Power Management for Real-time and Embedded Systems, May 2001.
[12] Jose´ A. Tierno. An Energy-Complexity Model for VLSI Computations,
Ph.D. Thesis, California Institute of Technology, 1995.
[13] Jose´ A. Tierno and Alain J. Martin. Low-Energy Asynchronous Memory
Design. Proceedings of International Symposium on Advanced Research
in Asynchronous Circuits and Systems, IEEE Computer Society Press,
pp. 176–185, 1994.
