Introduction
With energy becoming as important as time as a criterion for computational efficiency, analytical tools are needed to evaluate computations according to both criteria simultaneously. How are energy and time traded against each other in the design process? How are algorithms compared when they have different energy and time figures? A useful metric must separate the algorithmic tradeoffs of energy against time from the physical (usually electrical) tradeoffs.
We propose an efficiency metric for VLSI computation that combines energy, and time, Ø in the form Ø ¾ The choice of this metric is based on CMOS VLSI technology: in CMOS, Ø ¾ is independent of the voltage in first approximation. Instead of attempting to optimize a circuit for both and Ø the designer can now optimize the design for the single metric and adjust the voltage to obtain the chosen tradeoff between and Ø
We prove that the Ø ¾ metric is optimal for CMOS circuits under assumptions that hold approximately in the normal range of operation. Under those assumptions, energy and delay can be freely exchanged through supply-voltage adjustment. Although the metric is not adequate over the entire range of operation for CMOS transistors, we have experimental evidence that a large class of circuits exhibit a collective behavior that is more regular than that of individual transistors. We also investigate when and why the Ø ¾ metric is inadequate; in some cases, we can use the metric Ø Ò with Ò ¾ We shall see that most results we prove for Ø ¾ generalize for Ø Ò with Ò ¾
The objection that a metric grounded in a specific technology (CMOS) cannot be general enough for the study of algorithms can be answered by observing that the CMOS model of computation is certainly as general as the "random-access machine" model, which has been used successfully in the traditional analysis of algorithms.
The Ø ¾ metric was originally introduced for the design of an asynchronous MIPS R3000 microprocessor [4] . The arguments about the validity of the metric and the analysis of the pipeline were published by Martin [3] . The results about transistor sizing for minimal Ø Ò have been described previously by the authors [7, 11] .
Energy and Delay in VLSI Computations
Our study of energy and delay in computations is based on CMOS implementations. We consider a digital VLSI computation to be a partially ordered sequence of transitions. Each transition changes the value of a boolean variable of the computation. We consider irreversible computations only, i.e., computations in which assigning a value to a variable destroys the previous value of the variable.
In digital CMOS, a boolean variable is implemented as an electrical node whose voltage represents the present value of the variable. Each transition charges or discharges the capacitor attached to the node, bringing its voltage either to the supply voltage Vdd or to the ground voltage GND. We are interested in the computational, or dynamic, energy spent in charging and discharging the capacitances of all nodes involved in a computation. We ignore leakage energy and short-circuit energy. Since we ignore leakage, we also assume that no energy is spent maintaining the values of variables stored in registers.
Any algorithm can be implemented as a set of production rules [5] . A production rule is of the form:
Ø, where is a boolean expression, and Ø is a single assignment of the value true or false to a boolean variable. Such an assignment is called a transition; in this example, the transition Ø is performed after becomes true. A logic gate or operator is the physical implementation of the pair of production rules that set and reset a given variable, of the following form:
Let Þ and Þ be the energy spent firing the first and second production rules, respectively. (A firing is an execution of a production rule that changes the value of a variable.) The energy spent firing production rule Ù Þ is the energy dissipated charging the capacitor Þ associated with the node of Þ. The total energy required for charging the capacitor up to voltage Î is Þ Î ¾ ; half of this energy is stored in the capacitor, and half is dissipated as heat in the pull-up network connecting the capacitor to the power supply. We have that
where Î is the power-supply voltage. When the capacitor is discharged to ground, the energy stored in the capacitor is dissipated in the pull-down network. Hence, Þ Þ . We shall not elaborate on the calculation of the capacitance Þ beyond noting that Þ depends mostly on the "load" of Þ, i.e., the topology of the logical gates of which Þ is an input, and hardly depends on the structure of the logical gate of which Þ is an output. In other words, the energy consumed in computing the value of Þ does not depend on what is computed but rather on where the result of the computation is needed.
The delay Ø Þ for firing Þ is the ratio of the final electrical charge É Þ on Þ to the current Þ available for charging Þ :
with É Þ Þ Î . The current Þ is the current flowing in the transistor network connecting the constant power-supply to Þ when and only when Ùholds; similarly for the delay Ø Þ and current Þ .
In general, the transistor current is difficult to analyze. Let us look first at one single nMOS-transistor as pull-down network. (The analysis for a pMOS transistor as pull-up network is similar.) We assume that the transistor is above threshold (Î × Î Ø ), and not in velocity saturation. Then, the current is either the saturation current, Á × , when Î × Î × Î Ø ; or it is the linear current, Á Ð , when Î × Î × Î Ø , where Î × and Î × are the gate-to-source and drain-tosource voltages of the transistor, respectively, and Î Ø is the threshold voltage.
The formulas for Á × and Á Ð are well-known:
If we assume that the voltages Î × , Î × , Î Ø vary proportionally to the supply voltage Î , then both Á Ð and Á × depend quadratically 1 on Î , and therefore the 1 Of course, Î × and Î × must vary by quite a different mechanism from the one governing ÎØ: Î ×and Î × can vary "automatically" as a result of changing Vdd, whereas ÎØ must be set at the time of fabrication. The main reason that the proportional variation breaks down is that it is in practice impossible to scale ÎØ with Vdd because that would lead to unacceptably large leakage currents at the low end of the scale. Combining the expressions for delay and for energy, we see that the expression Þ Ø ¾ Þ is independent of Î Under certain restrictions, it is possible to extend the result that the current is quadratic in Î to cover the arbitrary composition of pullup and pulldown networks. Papadantonakis has proved this result for a class of circuits called "smooth circuits" [8] . A smooth circuit is a network of transistors in which each node has a capacitance to ground, the power supplies are modeled as large capacitors, and again the threshold voltage is assumed to scale with the supply voltage.
If we assume that a CMOS circuit is a reasonable approximation of a smooth circuit, we can assume that the quadratic relation between currents and supply voltage Î holds, and therefore that the delays are inversely proportional to Î . For those circuits, Ø ¾ is independent of Î , where is the dynamic energy dissipated by a computation, and Ø represents either the latency or the cycle time of the computation. We shall return to the limitations of this assumption.
3.
Comparing Algorithms for Energy and Delay These results are borne out in practice [10] . In Table 1 .1, we see the results of simulating two different implementations of an eight-bit comparator with the simulator HSPICE. In each case, eight single-bit comparators perform the comparison: in the "linear" comparator, the results of the single-bit comparators are merged in a linear chain; in the "log" comparator, in a binary tree. Comparing the performance of the comparators at 3.3-V Vdd, we see that the linear comparator is slower than the log comparator, but using the Ømetric, we find that it more than makes up for its sluggishness with its lower energy consumption. On the other hand, using the Ø ¾ metric, we find that the log comparator is better. Which is it?
If we adjust the supply voltage on the log comparator down to 2.15 V, we see that we can match the delay of the linear comparator while using less energy; thus, the log comparator outperforms the linear one in both speed and energy if we are allowed to adjust the supply voltage. Even over this relatively wide range of supply voltages, Ø ¾ changes only by an insignificant 3.2 percent. This example illustrates that the Ø ¾ metric is more trustworthy for circuit comparisons when we are allowed to adjust the supply voltage.
The ¢ Metric
Let us now ignore the lower and upper bounds imposed on the voltage by the technology and assume that we can always trade and Ø against each other through voltage adjustment. Suppose that under these conditions there exists a function ¢´ Ø µ with the properties:
Property ¢½. ¢ is monotonically increasing in and Ø, Property ¢¾. ¢ is independent of Î . Hence, for any chosen delay Ø, is more energy-efficient than . Likewise, for any chosen energy , is more time-efficient than .
3.3
The Ø ¾ Metric
Any expression in and Ø that is monotonically increasing in and Ø and that is independent of Î can be used as complexity metric ¢. We have shown in Section 2 that, in CMOS technology, the following definition for ¢ is valid:
Henceforth, ¢ will always mean Ø ¾ If we now return to the example at the beginning of Section 3.1, we compare the two computations and by comparing their ¢s:
Hence, we can conclude that is twice as ¢-efficient as . For equal delays,
How constant is Ø ¾ in reality? There are several operating modes for the CMOS transistor, each with a very different relation between current and voltage. In particular, at high electric field, the carrier velocity saturates and becomes constant; the delay becomes independent of the voltage, and Ø ¾ becomes quadratic in the voltage. 
The ¢-Efficiency of Designs
In this section, we use the ¢ metric to determine when two standard design transformations-parallel composition and pipelining-improve the efficiency of a design compared with sequential execution.
The ¢-Efficiency of Parallelism
Given a collection of independent tasks, when does the parallel execution of the tasks improve (i.e., reduce) the ¢ of the computation? For simplicity, consider Ñ identical tasks each consuming energy Ñand using delay Ø Ñ to complete.
If the Ñ tasks are executed sequentially, the total energy is and the total delay (execution time) is Ø, giving a ¢ ¼ of Ø ¾ . we ignore the cost of the split circuitry that may be needed to distribute the control (and possibly data) to all tasks, and the cost of the merge circuitry that may be needed to gather the completion signal (and possibly some results) from all tasks. In that simple case, the total energy is still , but the total delay is now reduced to Ø Ñ, giving a ¢ Ô Ö of´ Ø ¾ µ Ñ ¾ . Hence, the improvement ¢ Ô Ö ¢ ¼ ½ Ñ ¾ : parallelism reduces the ¢ of the computation by a factor Ñ ¾ Assuming we can vary the voltage of the new design so as to make the delay equal to the delay of the sequential design, then under the invariance of Ø ¾ , the parallel transformation decreases the energy consumption by a factor Ñ ¾ .
(For large Ñ, it may in practice be impossible to scale down the voltage by a factor Ñ, and therefore it may be impossible to exploit all the potential energy improvement of parallelism.) Ñ Ü , the early termination of the task amounts to a waste of energy since every task but Ñ Ü can be slowed down to Ø Ñ Ü without affecting the delay of the whole computation, but with an energy improvement corresponding to the voltage decrease.
According to the above analysis, parallelism always improves Ø ¾ if we ignore any overhead it introduces. Let us now examine the case when the cost of the split and merge circuitries cannot be ignored. We are looking at the simple case where we split the original task into two parallel tasks with the help of just one binary split and one binary merge. We assume that both the split and merge have an energy and delay that are a fraction of the energy and delay of the original task: The ratio ¢ Ô Ö ¢ ¼ is less than 1 only for ¼ ½ In other words, binary parallel composition using a split and a merge with the above characteristics decreases ¢ only when the task to be parallelized is at least times as expensive as a split or merge. As a concrete example, the authors have investigated the possibility of improving the performance of a 32-bit four-stage carry-lookahead adder by interleaving two identical adders. For this type of circuit, k is empirically found to be approximately ¼ ¾ (the split and merge networks are about as expensive as one adder stage). Therefore splitting the adder in this particular way does not help.
4.2
The ¢-Efficiency of Pipelining whence we derive the following theorem on the optimal length of a pipeline.
Theorem 3 The ¢-optimal pipeline requires an energy per computation step that is 3 times the energy required for computing . It has a cycle time that is

¿ ¾ the overhead's cycle time.
Let us compute the optimal pipeline improvement as a function of the overhead ratio , (Ñ ¾ ). We get the following result:
The result shows that the pipeline is very sensitive to the communication overhead. For an overhead ratio of one (which obtains when the pipelining communication is as costly as the operation itself), the pipeline offers practically no gain in Ø ¾ .
* *
In the second part of this chapter, we examine the relationship between Ø and the two physical parameters that a designer (usually) can adjust: the supply voltage and the transistor widths. First, we address the issue of optimizing Ø ¾ as a function of the transistor widths. Secondly, we introduce the notion of minimum-energy functions ´Øµ to express the dependence of and Ø on each of the two physical parameters. We use those functions for deriving a number of important results about the sequential and parallel compositions of systems.
5.
Transistor Sizing for Optimal ¢
The task of adjusting the transistor widths of a circuit is called "transistor sizing," or "circuit sizing." We are interested in sizing transistors so as to minimize ¢ Both capacitance Þ and current Þ in Equations 1.2 and 1.3 for and Ø depend on the size of the transistors. The capacitance contributed by transistors increases linearly with the transistor widths, but the current also increases linearly. Hence, it is not immediately clear how transistors should be sized to optimize the ¢ of a circuit.
We shall find that, in a ¢-optimal circuit, the transistors are sized such that the total transistor capacitance is approximately twice the total parasitic capacitance. As we shall see, the result is exact only for a restricted class of circuits; nevertheless, it is a good approximation for most circuits. Let be the total energy of the computation: it is the sum of all energy spent exercising the nodes of the computation. Assume, without loss of generality, that there are exactly two transitions corresponding to each node. (This amounts to "unrolling" into several nodes the nodes of the circuit that see more than two transitions and ignoring the nodes that never transition.) Let Ø be the cycle time of a critical cycle. We assume that the circuit is designed so that all cycles are critical; this is true in many well designed circuits, and it is true for any optimally sized circuit in the absence of additional constraints (e.g., minimumsize or slew-rate constraints) on transistor sizes.
We distinguish between two types of capacitances attached to a node : first, the "gate" (or transistor) capacitance contributed by the transistors of the operators to which node is connected, and secondly, the "parasitic" capacitance Ô contributed by the wires connecting node to other operators. We assume that Ô is fixed and that we can change as we please by adjusting the transistors' widths.
We first consider scaling all the transistors in the circuit by the same factor:
we want to determine the global scaling factor Û to be applied to all transistors' widths that achieves the lowest Ø ¾ . We have 
5.1
Using Ø Ò With Ò ¾ Sometimes, we may want to optimize Ø Ò for Ò ¾ when using a ¢-optimal circuit would not be possible because the required delay or energy would result in a supply voltage outside the practically possible range. Roughly speaking, when we perform Ø Ò optimization we mean that we consider a 1% improvement in speed is worth an Ò% increase in energy. For example, for a circuit operating in velocity saturation, we might have expend twice the energy for a 10% speed improvement. In that case, we should optimize for Ø Ò with Ò There is another reason to examine Ø Ò , besides extreme supply voltages (and mathematical insight). Even though a large system may be optimized for Ø ¾ , components of that system may not individually be optimized for Ø ¾ . For example, speeding up critical paths while lowering the Ø ¾ of these paths may make the entire design run faster and actually improve Ø ¾ for the entire 
Optimal Energy and Cycle Time
We If we eliminate Ò from Equations 1.48 and 1.49, we arrive at the following relation between the minimum energy and the minimum delay of a single-cycle circuit at a fixed voltage:
A Minimum-Energy Function
We can define an antimonotonic minimum-energy-consumption function or minimum-energy function ´Øµ that describes the effect of transistor sizing on the minimum energy required for a system to run at a given Ø at a fixed voltage.
(Tierno has previously used a similar energy function [12] .) If we rewrite Equation 1 .52 with a function of Ø, we get the following function:
It is easy to prove that Equation 1.53 satisfies the above definition of the minimum-energy function.
Experimental Evidence
Even though Equations 1.48 and 1.49 have been derived for only a very restricted class of circuits, they are in fact good approximations for a much wider class. The authors have checked the equations against the minimal Ø Ò obtained by applying an optimization algorithm (gradient descent) to a class of circuits. The circuits, each consisting of a ring of operators, were chosen at random with a uniform-squared distribution of parasitic capacitances; the number of transistors in series was also chosen according to such a distribution. The authors used real numbers for both parameters; they optimized the expression for Ø Ò using Equations 1.36 and 1.38 for and Ø. The range of parasitics was [1, 100] in normalized units; the range of transistors in series was [1, 6] . The results show that Equations 1.48 and 1.49 hold, with very good accuracy, over a wide range of parasitics, logic-gate types, and circuit sizes. 
Multi-cycle Systems
Let us now consider a system composed of Ñ subsystems Ë ( ¼ , Ø ½ ) executing in parallel; each subsystem Ë has minimum-energy function
These subsystems can be chains or rings of arbitrary logic gates, since our experiment shows that Equations 1.48 and 1.49 adequately describe the minimum energy and delay of a large class of circuits. Let us assume that the subsystems are synchronized so that all Øs are equal. As a consequence, the minimum-energy function for the composed system is We have shown that the ÒÈ result is correct for a ring of operators.
We previously observed that if a dominant term exists then ÒÈ is approximately correct for general circuits. We have experimental evidence that the relation is true for a large class of multi-cycle systems. Such evidence is also provided by SPICE simulations of an adder published by Chandrakasan and Brodersen and summarized in Figure 4 .7 of their book [2] . Their figure shows that, for the five different parasitic contributions they study, the minimum energy for a given speed (allowing supply-voltage adjustment) is achieved when the gate capacitance is very close to twice the parasitics. (They did not, however, draw the conclusion that we have reached here.)
Power Rule for Sequential Composition
Let us now consider the sequential composition of two systems and Let us assume a sequential computation that runs to completion and then to completion; we assume the delay between the end of and the start of to be negligible. We want to know at what Ø , Ø to run and so as to optimize the Ø Ò of the sequential composition.
We now recall the concept of a minimum-energy function introduced by Equation 1.53. Equation 1.53 applies to the specific transformation of changing transistor sizes; i.e., it describes what the minimum energy of a circuit will be when that circuit is sized to achieve a certain performance. We are no longer limiting our discussion to the effects of transistor sizing, so we can allow other transformations to be used as a basis for this ´Øµ 
Summary and Conclusion
In this chapter, we have seen that Ø ¾ constitutes an excellent metric for comparing computations for energy and delay efficiency when the physical behavior is that of CMOS VLSI circuits. We started by observing that the Ø ¾ metric for a CMOS circuit is independent of the supply voltage, as long as we can scale the threshold voltage linearly with the supply voltage and as long as we stay away from velocity saturation. We showed that when supply-voltage adjustment is allowed, the popular Ømetric is inferior. Following along these lines, we established that any metric with certain properties (we called such a metric ¢) could be used to compare designs independently of the voltage. As long as the required speed or energy lies within the threshold-voltage to velocity-saturation range of the implementations, we saw that the implementation with the better ¢ is better for any desired speed or energy consumption. We showed that Ø ¾ is to first order a metric with the required properties. We then applied the metric to various circuit transformations, namely pipelining and parallelism. We also applied the metric to transistor sizing and were able to show that the optimal sizing for energy efficiency is not what is commonly used (minimal sizes).
Finally, we established rules for computing the ¢ of the sequential and parallel compositions of systems.
Overall, Ø ¾ is a very useful efficiency metric for designing CMOS VLSI circuits. Time and experience will show how applicable it is to other computations.
paper was sponsored by the Defense Advanced Research Projects Agency and monitored by the Air Force.
