This paper proposes an optimum methodology for assigning supply and threshold voltages Io modules in a CMOS circuit such that the overall energy consumption is minimized for a given delay constraint. The modules ofthe circuit should have large enough gate depths such that the delay and energy penullies of the level shzJiers connecling them are negligible. Both static and dynamic energy are considered in the optimization. Energy savings of up to 48% have been achieved on various example circuils. The first step in the optimization finds optimum supply and threshold voltagesfor each module in the circuit. Ifthe circuit has a large number ofmodules. this step might yield a correspondingly large number of diffeerent supply and threshold voltages for minimum energy consumption. Since having a large number of different supply and threshold voltoges on an IC is not feacible in current technologies, an additional step clusters the multiple voltages obrainedfrom thefirst step into a p e d number of supply and threshold voltages for exumple, 2 different supply voltages and 2 different threshold vohges). In addition to the application ofthis method to circuit optimization, it can also be applied to a wide range ofproblems with delqv constraints, such as sofmare rash running on a dynamically variable VOD and V,h processor.
Introduction
Energy consumption is recognized as one of the most important parameters in designing modem portable electronic systems. Dynamic energy has been the main component of total energy siuce it is proportional to the square of VDU, However, with the shrinking of device sizes and reduction of supply voltages, static energy has become as important as dynamic energy. To obtain high gate overdrive (VDU-V,b) for high speeds of operation, V, is also decreased as Vu, is decreased. The decrease in threshold voltage increases the leakage current exponentially, which makes static energy consumption more significant in evely new technology generation. Therefore, it has become essential to consider both supply and threshold voltage in any circuit optimization for low-energy consumption.
There has been significant research in the usage of dynamic supply voltage scaling [3, 41; static assignment of different supply voltages to a system [ I , 5 , 61 ; static multithreshold voltage systems [7, 8, 161; and dynamic threshold voltage scaling [9] for energy minimization. The optimization procedure that we propose in this paper can be used in all of the above scenarios. It can be used to determine the optimal supply and threshold voltages for tasks executing on a variable voltage processor; or it can be used to statically set supply and threshold voltages for a system, given the probable switching activity and timing requirements. The benefit of our method is that it considers both supply and threshold voltages simultaneously, an idea which is gaining importance as a means of saving energy without sacrificing performance [15] . Another important contribution of this work is that it gives a metric which can be used by circuit designers to test how close the energy consumption of their design is to the minimum possible. We use this metric to determine the stopping conditions of our optimization algorithm. If an unlimited number of supply and threshold voltages are available, the proposed algorithm is optimum in the sense that no other voltage assignment for the given modules will give lower energy consumptionfor the given delay consfraint.
The complete procedure has two steps. The first step finds optimum supply and threshold voltage values for CMOS modules in a digital circuit that minimizes the total energy consumption. Considering a circuit as composed of modules allows energy optimization of much larger circuits than is possible with gate-level optimization algorithms. This is due to the significant reduction of problem complexity. We find the exact conditions for minimum energy using the Lagrange Multiplier Method. Then we iteratively find the supply and threshold voltage values for each module that satisfy the minimum energy condition. This step of the algorithm converges to an exact so/ution in a small number of iterations for a large and varied set of problems. If it is technologically feasible to assign the optimum (and perhaps all different) supply and threshold voltages to all the modules, then we stop here. Othenvise we continue to the next step.
The second step clusters the multiple voltages obtained from the first step into a fixed number of supply and threshold voltages (for example, 2 different supply voltages and 2 different threshold voltages). This step results in a feasible implementation of the system in current technologies. The organization of the paper is as follows. In Section 2, we provide the theoretical background of the paper. In Section 3, we explain the Lagrange Multiplier based optimization procedure. In Sections 4 and 5, we explain the algorithms for the first and second steps of the procedure. Section 6 shows experimental results. Finally, Section 7 concludes.
Theoretical background
In order to formulate the optimization, we need equations for delay, dynamic energy, and static energy in terms of V, , and VuI, For a CMOS circuit, delay can be approximated as being
voltage, respectively applied to that module, and a is the velocity saturation coefficient. 
Here Ci is the term for all the capacitances that are switched during operation of the ith module including possible multiple switching of some nodes. To simplify the derivation, we rewrite dynamic energy as:
where kl, stands for the circuit, process, and application dependent terms including switching activity. An average value for total switching activity in the module can be found by running several different tasks on the module and averaging the switching activity results. Short circuit power dissipation can also be included in k,, because of the quadratic dependence of short-circuit power dissipation to VDD [I I] Given these approximate models for delay and energy in terms of supply and threshold voltages, we state the energyoptimization problem for a digital system (assumed given to us) consisting of N modules and P paths from primary inputs to primary outputs as follows:
all paths Pj where Ei = Ed! + ESi, Td is the time constraint and the variables are Voo; and V,hi for each module. Note that we obtain the time constraint, Td, for the optimized circuit from the initial delays ofthe modules of the unoptimized circuit.
In the following sections, we will consistently use i to index the modules and j to index the paths. An example circuit for N=4 and P=2 is given in Figure I . In the next section, we derive the conditions on VDDi and Vlhi for minimum energy consumption. 
Lagrange Multiplier based optimization
Consider a system of N modules and P paths from primary inputs to primary outputs. We form a binary matrix, A, of P rows and N columns as follows:
=O o s h e
For example, the A matrix corresponding to the circuit in Figure I (N4, P=2) is:
If a path P, is a subset of P, (i.e. all modules on P. also lie on PJ, then the row of A corresponding to P, (Row,(A)) is removed from A to reduce unnecessary computation. For the rest of the paper, we assume that the resulting A matrix has more columns than rows (i.e. N > P).
The total energy consumed by the system is given by:
=&+&+...+EN
The initial delay of each path in the system is given by:
We can represent the above equation in vector form as follows: Following is a Lagrange Multiplier formulation with multiple constraints, where the function to minimize is total energy, the constraint for each path j is that its delay, Ti, should be less than Td, and X, is the Lagrange Multiplier for the j' path.
694
Equations 12 
c T E G , = c s E G , = R~~, ( A~) . /~)
in 2 variables ( V m and VhJ. However, doing this is not trivial since the Lagrange Multiplier Vector, n , is unknown. In the next section, we propose an iterative gradient search algorithm that yields a solution to this problem in a small number of iterations. After every iteration, the condition in Equation 23 will be used to check if minimum energy is achieved.
Gradient search algorithm for the optimization problem
We use an iterative algorithm to fulfill the conditions of Equation 23. The inputs to the algorithm are the initial parameters of all the N modules, such as the VDDis, the Vlhis. the module delays (dis) and the circuit-and process-dependent parameters kois, klis, kljs, k,,s, his, kSjs, k & and k,,s.
To solve CTEG; = CSEG; for the i" module. we fix the delay, di, for that module. Then we can write Vai'in terms of Vooi using Equation 1. This makes CTEGi and CSEGi functions of VDm only and the equation CTEGi = CSEGi can be solved easily (we use MATLAB's FZERO function) to get VD,~ and Vai values. Then with these V,,, and V,,, values, we can find the energy consumption of that module (this will be the optimum energy consumption for that module, for the given delay, di ). Hence, we can consider energy consumption of a module as a function of delay for that module. In vector form, we can write: 
- ( E ( d , , , ) ) is minimum in the direction of gradient vector.
We now derive the stopping condition for the gradient search. Designers can use Metric-cost-fn to determine how close their design is to the optimum.
In our algorithm, we terminate the iterations when Metric-cost-fn goes below lo'. The overview of the optimization algorithm is given in the flowchart in Figure 2 .
Clustering heuristic for limited number of supply and threshold voltages
The algorithm described in Section 4 yields optimum values of supply and threshold voltages for each module that minimize the overall circuit energy. But these voltages might all have different values, in which case a practical implementation of the optimized circuit is difficult in current technologies. In this section, we propose a heuristic algorithm that clusters the optimum supply and threshold voltage values obtained into a limited number of supply and threshold voltages. The final solution meets the delay constraint at the expense of slightly higher total energy consumption than the optimum case.
Assume only n supply voltage planes and m threshold voltages are available (n<N, m<N) These vectors will finally hold the n supply voltage values and m threshold voltage values that will be used in the circuit. For any module i, the function "Map" finds the nearest pair [vno.dp), vth.m(S)l to the pair [vdd-apdi)> V~~o p d i ) l and assigns it to [Vnn,di), V,h,di)l. Figure 3 .
Experimental results
We synthesized the hierarchical Verilog descriptions of the combinational ISCAS'BS circuits and a 16-bit Wallace Tree Multiplier using Synopsys Design Compiler (with the TSMC 0 . 2 5~ library) to get the delay, dynamic energy and static energy consumption values for the modules at the top level of dcsign hierarchy. The modules at the top level of hierarchy in the Verilog description were directly mapped to the modules used in the optimization . The values of the process-dependent parameters (k,, k, k6, k7) were obtained from SPICE A power-aware partitioning of the circuit into modules could further improve the results, but that by itself is a very difficult problem to solve and is not handled in this work.
697
simulations as explained in Section 2. SPICE simulation of simple gates showed that k5 is 6 orders of magnitude smaller than k2 for this technology. Since kl and ks scale almost linearly with number of gates [IO, 121,  ks can be taken to be I 0-6 times k2 for any module. The circuit-dependent parameters (b, k,, k2) were then calculated for each module by using the delay, dynamic energy and static energy values obtained from Synopsys and the process-dependent parameters.
We use the following notation for describing the results: The symbol "I" denotes the initial circuit which has the standard 0 . 2 5~ TSMC voltages (VDD = 2SV, Vu, = OSV). We obtain the delay of the initial circuit using Synopsys Design
Compiler and use this value as the time constraint for the optimization i..e the optimized circuits (II,lIl, IV) will have the same delay as I. "II" denotes the baseline circuit (for energy comparisons) that has the single VDD and V,h values that give the minimum energy consumption for the given deadline. "III" denotes the circuit having optimum (and possibly all different) VnDs and V,,s for the modules. "Iv" denotes the circuit in which the VnDs and Vlhs in Ill have been clustered into two Vons and one Vh. We only use one Vh in the final circuit because we found that having more Vhs only saved an additional 2.3% of energy in the benchmark circuits designed using 0 . 2 5~ technology. The need for multiple Vhs will become more pronounced as technology shrinks.
For the experiments, we used various switching activities for the input ports to observe their effects on the energy savings and the optimum voltages obtained. We noticed that for switching activities above 0.05, the optimum Vhs were of the order of 10 mV. This is due to the fact that the static energy in 0 . 2 5~ technology is very small compared to the dynamic energy for high switching activities. So for these cases, the optimization algorithm scales down VnD aggressively and to achieve the delay constraint, it reduces Veto very small values without incurring a significant increase in static energy. Since such small V,,, values are not currently feasible, for these cases we fixed Vlh at 0.IV and found the optimum VDDs. This phenomenon is not expected ot occur for deep sub-micron technologies, where static energy is significant. Table I provides detailed results for the Wallace Tree Multiplier circuit. The first colu,mn in the table shows the top level of the Verilog design hierarchy. The modules are a partial product generator (levelo), Cany Save Adders (CSAs), and a Cany Propagate Adder (CPA). Also shown is the A matrix corresponding to the circuit. The second column gives the VDn, Vh and energy consumption values for the baseline circuit (11) for two different input switching activities (SA=O.OI and SA=0.0001). Note that the delay for the baseline circuit is same as the delay of the initial circuit (I), which had VDD = 2SV, Vh = OSV. The third and fourth columns give the voltages for each module as well as the energy consumptions for circuits I11 and IV respectively. Figure 4 shows the energy savings obtained for the various benchmark circuits as a percentage of the baseline energy consumption for an input switching activity of 0.01. The dynamic and static components of energy are also shown. it is observed that in II and 111, static energy is -10% of the total energy. This validates the fact that at the optimum, static energy is a fixed fraction of the total energy [14] , although this fraction depends on the technology used.
Figures 5 and 6 show the savings for different input switching activities for circuits Ill and IV, respectively. The results show that the energy savings tend to increase as the input switching activity increases. Thus, accurate estimation of Table 2 . Optimization Results the input switching activity is crucial for obtaining good energy savings. Tablc 2 summarizes the results of the experiments. Voo, and VoD2 are the two voltages applied to the circuit after clustering. We obtained up to 48.4% savings for circuit I11 and up to 36% savings for circuit IV for switching activities below 0.05 (Vlhs variable). For switching activities above 0.05 (Vlhs fixed at O.IV), we obtained up to 58.4% savings for circuit 111 and up to 3 I .6% savings for circuit IV. The average saving, for switching activities above 0.05, was 29% for circuit Illand 18% for circuit IV. For switching activities below 0.05, the average saving was 28% for circuit 111 and 15% for circuit IV.
The optimization algorithms in Sections 4 and 5 were implemented in MATLAB and run on a PC with I GB RAM and PI11 800 MHz processor. To compare execution times for the different circuits, we terminate the optimal algorithm in Section 4 after 10 iterations of the loop in Figure 2 . 10 iterations were enough to get near optimal results for most of the cascs. Since we observed that the clustering algorithm (Section 5 ) takes only about 5-10% of the total execution time, we let it run to completion. Figure 7 shows 
Conclusion and future work
In this paper, we presented an algorithm to find optimum values of supply and threshold voltages for circuit modules such that the energy consumption is minimized. The conditions for optimum energy were found mathematically and then a gradient search algorithm was presented which iteratively converges to the optimum values. An additional step clusters these optimum values into a limited number of supply and threshold voltages. The method can be applied to circuit modules of any kind, given the delay and energy parameters for the modules.
As a next step, we plan to apply our algorithm to deep submicron technologies, which we believe will give more energy savings than the results for 0 . 2 5~. We are also investigating techniques for power-aware partitioning of circuits into modules.
