Abstruct-A direct approach to transistor sizing for minimizing the power consumption of a CMOS circuit under a delay constraint is presented. In contrast to the existing assumption that the power consumption of a static CMOS circuit is proportional to the active area of the circuit, it is shown that the power consumption is a convex function of the active area. Analytical formulation for the power dissipation of a circuit in terms of the transistor size is derived which includes both the capacitive and the short circuit power dissipation. SPICE circuit simulation results are presented to confirm the correctness of the analytical model. Based on the intuitions drawn from the analytical model, heuristics for initial transistor sizing on critical and noncritical paths for minimum power consumption are developed. Further, fast heuristics to perform transistor sizing in CMOS circuits for minimizing power consumption while meeting the given delay constraints are presented.
Special Issue Short Papers I. INTRODUCTION
Transistor sizing is the operation of enlarging (or reducing) the width of the channel of a transistor. It is an effective technique to improve the delay of a CMOS circuit. When the width of the channel is increased, the current drive capability of the transistor increases which reduces the signal rise/fall times at the gate output. The active area, i.e., the area occupied by active devices (e.g., transistors) increases with increased transistor sizes, and the layout area may increase. Earlier approaches for transistor sizing concentrated on minimizing the area of the circuit subject to a certain delay constraint. With improved process techniques and reduced feature sizes, the constraint on area is becoming less strict; on the other hand, the growing market of low-power electronics is pushing the constraint on power consumption to the top of the stack of priorities. Most of the existing methods for transistor sizing are based on the assumption that the power consumption of a circuit is proportional to the active area [I] -@].
Recent studies have revealed that the power consumption of a static CMOS circuit is not always minimized by minimizing the active area, but can be improved by enlarging some of the transistors driving large active loads [9] , [lo] . This led to the belief that the "active area" is not a reliable indicator of the power consumption of a circuit. A direct approach for transistor sizing to minimize the power consumption of a circuit subject to a delay constraint is required. For such an approach to be applicable, a reliable model for the power consumption of a circuit with respect to transistor sizing is essential.
Veendrick, in [ 111, studied the behavior of the short circuit power consumption of a CMOS inverter and showed that the short circuit power consumption is proportional to the input rise/fall time and the transistor gain factor. Since the change in the output rise/fall time of a gate affects the power consumption of the next stage gates of the circuit, it is necessary to include the power consumption of the load devices in the analysis. This paper presents an analytical model for the power consumption of a CMOS circuit based on the analysis of short circuit power consumption given in [ 111 and the gate delay model of [12] . In contrast to the existing assumption that the power dissipation of a circuit is proportional to the active area of the circuit, it is shown that the power consumption of it circuit is a convex function of the active area. Circuit simulation results are presented to confirm the validity of the formulation. Analytical methods for transistor sizing to minimize the power consumption of a CMOS circuit subject to a given delay constraint is presented with experimental results to show its applicability to actual circuits.
The convex programming techniques [8] can be applied using the analysis presented here to produce near optimal results. However, such an approach is slow, and may not be feasible for large circuits.
A good approximate solution can be obtained quickly by using an algorithm similar to TILOS [2] . The heuristics for transistor sizing presented in this paper c m be ,applied to a transistor level layout optimizer as well as a standard-cell based layout generator.
The rest of the paper is organized as follows. Section I1 presents the formulation and analysis of the power dissipation of a circuit deriving the power optimal and plower-delay optimal transistor sizes. The heuristics for optimizing ,power consumption subject to delay constraint are developed in Section 111. Results from MCNC benchmark circuits are presented to demonstrate the usefulness of the heuristics in Section IV. Section V concludes with comments on further research.
POWER DISSIPATION OF A CMOS CIRCUIT
CMOS inverters are used in the analysis for simplicity. The analysis is extended to consider complex gates later in this section. The power dissipation of a CMOS gate, neglecting the static power, is given by Pcap + Psc, wbxe P,,, is the capacitive power consumed during transfer of charge between the logic levels and PTc is the short circuit dissipation during tbe period in which both the pull-up and the pull-down blocks are "ON." 'The capacitive power consumption is given by
where CL is the total load capacitance of the gate, 11' is the supply voltage, and f is the transition frequency of the output node.
The short circuit power dissipation Psc is given by the following expression [1 11
Here, T is the input transition time, $ is the gain-factor of the transistor, I . ' is the supply voltage, L? is the threshold voltage, and f is the transition frequency.' Equation (1) was derived for the maximum short-circuit dissipation under the no-load condition at the output of the inverter. The same formulation holds in the general loading case and it was found empirically that the short circuit dissipation for an inverter with equal transition times at the input and output is about half of thaf obtained using (1) [ll] .
The gain factor of the gate, J is determined by the width of the transistor 1%' and the mobility of the carrier responsible for the transition ( p I , for a low-to-high transition and pTL for high-to-low).
' In the case of a complex gates, the appropriate term is the trigger density, i.e., the transition activity involving the given input as the trigger input. Consider the case where a given CMOS gate drives a load of several other CMOS gates (Fig. 1) The total power consumed by the circuit is given by
The first term in (3) is the total capacitive power consumption of the driver and the load gates. The load capacitance CL consists of three components: the gate capacitance of the load transistors (C,).
the interconnect capacitance (Cr), and the source/drain capacitance of the transistors of the driver (Cd). Only C d changes due to sizing of the transistor of the driver. Since C d is small compared to C', and Ci, the contribution of the change in Cd to the capacitive power consumption of the driver can be neglected. Therefore, the first term in (3) i s considered invariant and denoted as PI in the rest of the analysis. The second term is the capacitive power consumption due to the gate capacitance of the driver (c: is a constant). The third term is the total short circuit power consumption of all the gates. Here, p is the mobility of the carrier responsible for the transition at g l and p i is the mobility of the carrier responsible for the opposite transition in the gates g2, g3. . . . . g7).
The delay and transition time of the output signal of the gate g1 (i.e., 7 1 ) is affected by the sizing of the transistors in y l . The output signal delay of an inverter is given by 1121 (4) the first term is the delay response to a step function based on the gate geometry and the second term is the delay due to the input transition time. Here, d is a process dependent constant. Substituting (4) in (3) gives here, XI and kr are constants. The right-hand side (RHS) of (6), represents a convex function of the form Circuit simulations [ 131 were performed using 1.2 pm HP process parameters to compute the total power consumption of a circuit consisting of an inverter driving five uniformly sized inverters with 4A wide transistors as load. The inputs to the driving inverter was derived from another fixed sized "driver" and the output of each of the load inverters were connected to a capacitive load of 5 fF.
The average power consumption of the circuit for the rising and falling transitions were plotted when the p-transistor of the gate was sized from minimum-size (3X) and up, keeping the n-transistor at the minimum size. Three random points were selected from the experimental results to compute the coefficients for (6) and the curve generated by (6) was plotted against the results obtained from the SPICE simulations (Fig. 2 ). The analytical model matches the experimental results closely.
The curve represented by (6) is convex, and it attains a minimum for a certain value of IVI. Differentiating (6) with respect to 1 .
; gives P attains a minimum when I%-; is the power optimal size for the transistor in 91. The power optimal p-transistor size is given by substituting pP, the hole mobility for p and /ill for p' in (8) and the power optimal n-transistor size can be obtained by substituting / i n and p P , respectively. The power optimal sizes for the p-type transistor and the ntype transistor were obtained independently using SPICE simulations for fan-out loads varying from one inverter to ten inverters. The relationship of the power optimal transistor sizes to the load is shown in Fig. 3 . As expected from (8), the power optimal p-type and n-type transistors vary linearly with the load. Moreover, the power optimal p-transistor size is more than two times larger than the corresponding n-transistor, due to the combined effect of ji and p'. The power optimal p-transistor sizes are plotted for the same set of loads with different input transition times in Fig. 4 . The power optimal transistor sizes for slower inputs are smaller than those for the faster inputs, showing a similar trend as predicted by (8).
Finally, the p-transistor and the n-transistor were sized together to obtain the overall power optimal configuration for the circuit. Table  I shows the results. The fan-out load is varied from 2 to 20 inverters. The second column shows the power optimal p and n transistor sizes for the circuit. The power saving obtained by power optimal transistor sizing, compared to the power dissipation with minimum sized transistors in the driver is presented in the third column. The power saving can be as much as 58% with power optimal sizing in the case when the fan-out load is 20. However, as seen in real circuit examples, such high fan-out gates usually consist of only a small fraction of the total gates in a circuit.
A. Extension to General [CMOS Gates
Only CMOS inverters have been considered for the analysis so far. The problem of sizing ser.ies connected transistors is considered here. It is assumed that the transistors in a series connected chain are sized uniformly. Tapering of the transistor widths in long series chains such that the transistor closest to the power/ground node has the largest width may result in speedup and saving in power dissipation [14] . Tapering is useful only when the total intemal source/drain capacitance of a gate is 1.xger than or comparable to the total load capacitance, which is not true in the case of a typical high fan-out gate.
Several researchers proposed extension of the inverter based delay model for series connected transistors [15] , [16] . The problem is commonly framed as that of finding the equivalent width of a single inverter that would result in the same delay as the chain of transistors. Sakurai and Newton in [16] showed that due to carrier velocity saturation effect in short channel devices, the delay of a chain of N transistors of width W is less than N times the delay of a single inverter of width Pi' where C is a constant, less; than unity for short channel devices and equals to one for long chimnel transistors.
De1ay.v t r n n s z 3 t o r s
Using this equivalent inverter in the derivation of (8) and taking into account the fact that the total gate capacitance of the driver is proportional to N * WI, t.he power-optimal size of a transistor in a gate with AT series connected transistors is given by Using 1.2 jrm technology, the results for a two and three input NOR gates are compared with that of an inverter in Fig. 5 . For small A-the second term in the denominator of (9) is dominant. However, when ilr becomes larger, the effect of the first term in the denominator reduces the difference between the optimal transistor size for a series chain and that of an inverter.
The power consumptiori of a NAND or NOR gate also depends on the positioning of the inputs and their arrival sequence. The switching capacitances involved during the transition using different paths within a gate may vary significantly due to sourceldrain parasitics. Hossain et al., in [17] reported 13-17% improvement in power dissipation by switching activity based input reordering. Circuit simulation results indicate that the power savings by power optimal transistor sizing car1 be augmented by the power optimization due to input reordering.
B. Power-Delay Product oj'a Static CMOS Circuit
Consider the circuit of Fig. 1 again. The product of the power dissipation (5) and the gate delay (4) when simplified with respect to 
WI gives an equation of the form
Here, A. B, C , and D arc constants depending on process, supply voltage, and threshold voltage. The coefficient in (10) were computed using four sample points from SPICE simulation results for an inverter with a load of five uniformly sized inverters. The predicted power-delay curve using (10) is plotted against the actual power-delay curve obtained from circuit simulation in Fig. 6 . The accuracy of the analytical model with respect to SPICE results is apparent from the two curves. and attains a global minimum for certain "1. The value of W 1 for which this function attains a minimum is the power-delay optimal size of the transistor for a given load. The value of the power-delay optimal size of a transistor is much larger than the value of power optimal size for the same load. The variation in the power-delay optimal sizes for the p and n-transistor with various load sizes, as obtained from SPICE simulations is shown in Fig. 7. Fig. 8 shows the variation of the power-delay optimal size with different input transition times.
The RHS of (10) is also a convex function of 28 -a - Fig. 8 . Power-delay optimal size versus input transition time.
C. Intuitions from the Analytical Model
The relationship among the curves for delay (4), power (6), and power-delay product (10) are represented graphically in Fig. 9 . When the size of the transistor is below the power optimal size, the power and delay both decrease with an increase in the transistor size. As a result the power-delay product decreases very fast (region A of Fig. 9 ). After the power optimal size the power consumption starts to increase with the increase in transistor size while the delay of the gate is still decreasing, resulting in a slower decrease in the powerdelay product (region B), and reaches the power-delay optimal size. Beyond this size, the power-delay product starts increasing (region C) because the rate of increase in the power dissipation becomes higher than the rate of decrease in delay.
Let us analyze the implication of the three regions in Fig. 9 with regards to a transistor sizing heuristic. When the transistors are in region A, it is a win-win situation in terms of power and delay.
Therefore, a21 transistors should be sized at least to the corresponding power optimal size. In region B, the reduction rate in delay is higher than the rate of increase in power. Therefore, a transistor in region B can be sized profitably. The steeper the curve in region B, the more profit can be obtained by sizing the transistor up. However, if the transistor is in region C, then the yield in speed due to increasing its size is lower than the increase in power consumption. Hence, When the demand on circuit speed is aggressive, the speed improvement by sizing the transistors within region B may not suffice. Given two transistors in region C, the transistor with a smaller powerdelay slope should be given preference for sizing up over a transistor with a steeper power-delay slope. On the other hand, if a transistor in region B needs to be sized down (i.e., reduced in width), the one with the slowest rate of decrease in the power-delay product should be chosen.
MINIMIZING POWER CONSUMPTION
SUBJECT TO DELAY CONSTRAINT The above intuitions can be directly applied to an iterative transistor sizing algorithm such as TILOS [2] . In a TILOS-like algorithm, the circuit performance is optimized by sizing the transistor which gives the best performance gain on the current critical path in each iteration. While choosing the transistor to size on a critical path, the power delay model developed in this paper can be used as the cost criterion.
A. Power-Optimal Initial Sizing
The existing transistor sizing approaches start with a layout with minimum active area, i.e., one in which all the transistors are in their minimum size. In contrast, this paper proposes that all the transistors in a circuit should be initially sized to their power optimal sizes given by (8). This gives the configuration with minimum power dissipation of the layout with respect to transistor sizing, and any further sizing of transistors (enlarging or reducing) would only increase the power consumption of the circuit.
While a transistor in a gate is being sized to the power optimal size, the increase in its size is reflected in the load seen by the gate in the previous stage. The power optimal size of the gates in the previous stage are determined by the sizing at the current stage. Therefore, the transistors in a gate should be sized only after all the fan-out gates are sized. A simple algorithm for power optimal sizing is given in algorithm I (see Fig. 10 ).
Algorithm powerOptimalSizing( ) does not handle feedback loops in the circuit. This problem can be solved by breaking a loop arbitrarily. This would result in incorrect power optimal size for the gate where the loop is broken. Considering typical fan-outs of CMOS circuits, the load changes due to power optimal sizing of a fan-out transistor results in only small change, if any, on the power optimal size of the driver. This effect rarely propagates beyond two gates. Therefore, a reasonable solution can be obtained by running the algorithm more than once, each time on the layout obtained from the previous iteration.
B. Power Optimization Ur; der Delay Constraint
Since the power-delay product is a convex function of the transistor size, a greedy iterative approach such as TILOS can be used to obtain a good solution quickly. It is assumed that the power-delay characteristic of a transistor in a circuit can be determined locally, i.e., from the gate geometry, transition times of the fan-in gates and the fan-out of the gate. The optimization method starts with an initial power optimal layout configuration and progresses along a power-delay optimal path to meet the required delay based on the power-delay characterizaticin in (10). The basic algorithm is given in Algorithm I1 (see Fig. 11 ).
The initial transistor sizing is performed using procedure powerOptimalSizing( ) (see Algorithm 111. If the minimum power layout satisfies the required delay constraint. a layout with the minimum power consumption is obtained and the algorithm terminates. After detecting the critical paths using timingAnalysis, the algorithm iteratively selects the transistor with the best power-delay characteristic to be sized. After sizing ,a transistor on the current critical path, incremental timing analysis is performed on the circuit to compute the new critical path. The iterative process is repeated until all the critical -circuit   c432  c499  c880  c1355  c1908  c2670  c3540  c5315  c6288  c7552   TABLE I1 MINIMUM AREA VERSUS MINIMUM PolVER SOLUTlONs WITH NO DELAY CONSTRAIN? minimum a r e a
path delays satisfy the delay constraints or no further improvement in circuit speed is possible. Algorithm powerOptimalSizing( ) requires linear time on the number of transistors in the circuit. Every iteration of the while loop sizes up one transistor by a certain amount. Therefore, in the worst case, the number of iterations of the while loop is limited to the total active area of the circuit. Each iteration of the while loop also involves computing the sensitivity of each transistor on the current critical path. The worst case bound for this step is given by the maximum depth of the circuit. The timing analysis requires time proportional to the number of transistors in the circuit. Thus, the time complexity of the algorithm is O(T2 * where T is the total number of transistors in the circuit and T. C-n , , , is the maximum size of a transistor in the optimized layout. The complexity computed above considers a complete timing analysis in each iteration of the while loop. Since incremental timing analysis typically requires only a fraction of the time required for a complete timing analysis, the actual time complexity of the algorithm is nearly linear with the active area.
Iv. IMPLEMENTATION AND EXPERIMENTAL RESULTS
The heuristics developed in this paper has been incorporated into Pe$exlI, the second generation prototype of the performance driven module generator program, Per-ex [SI. The earlier version, P e g e x is a CMOS module generator which minimizes the active area of the circuit subject to a given delay constraint. Given the functional description of the circuit in the glue description language [18] and a delay constraint, it generates a virtual grid layout with optimized area satisfying the given delay constraint. Pe$ex uses the Elmore delay model to analyze the circuit and applies transistor sizing and input reordering to achieve circuit speed.
Per-exll, the power optimizing module generator uses the same layout style as Perflex. Based on actual data extracted from the virtual grid layout, the parasitics and interconnect capacitances are computed using the appropriate technology parameters. The capacitive power consumption is computed using the total capacitance at the output node of the gate. The short circuit power dissipation of a gate is computed using the input transition times computed using the Elmore delay model and the maximum resistance of the discharge path in the gate. The Elmore delay model does not capture the dependence of the output delay of a gate on the input transition time. This results in an under estimation in the delay and consequently, the short circuit power computation. In an optimization environment, the delay and power estimates are used to compare different solutions. Therefore, the reliability of the power estimates generated using the analytical model is more important than the absolute accuracy.
Several circuits were used to verify the reliability of the analytical model with respect to the SPICE circuit simulator. The circuits include a chain of inverters (chain), inverter trees (treel,tree2) with different structures, an 8-b ripple carry adder(rca8), 4- For the arithmetic circuits, the accurate switching activities predicted by [19] were used to account for the variations in switching activities in the nodes and 200 random vectors were used for the circuit simulations. The results, scaled and translated for the ease of viewing, are presented in Fig. 12 . The correlation between the two estimates is found to be satisfactory, indicating a good fidelity of the analytical model. The transistor sizing heuristic presented in this paper was compared directly with the traditional area optimization based heuristic using a TILOS like algorithm. Several large combinational circuits from the MCNC benchmark set are used for this comparison. The size of the circuits ranged from 728 transistors (c432) up to 8854 transistors (~7.552). First, the layouts were generated with no delay constraints. Table I1 shows the results of minimum area and minimum power solutions for the benchmark circuits.
The second, third, and fourth columns present the active area, critical path delay, and the power consumption of the minimum area solution (when all the transistors in the circuit are at their minimum width). Columns five through seven present the active area, critical path delay, and power dissipation for the minimum power solution (when all the transistors in the circuit are sized to their power optimal point). Column eight presents the increase in the active area due to power optimal sizing and column nine shows the power saving obtained by power optimal sizing compared to the minimum area layout. For all the circuits there is power saving obtained by power optimal sizing at the cost of an increase in the active area. It is not surprising that the delay of some circuits also improved due to power optimal sizing because some of the gates driving large fan-outs lie on the critical path.
In the second experiment, the circuits were subject to delay constraints. The results are shown in Table 111 . The delay constraint applied to the circuits are shown in column two. Columns three and four present the active area and the power dissipation for the area driven optimization by Perjlex. The last two columns present the active area and power dissipation for the power optimizing program, PerflexII. PerflexII consistently produced layouts with lower power dissipation even with tight delay constraints. The CPU time required by both the programs are found to be similar. The area of the layout generated by PerflexII is usually larger than the area of the layout generated by Perflex. It has also been observed that even for very tight delay constraints, the power optimizing approach produces solutions with lower power dissipation than the area optimizing solution.
V. CONCLUSION An analytical formulation for the total power consumption of a CMOS circuit with respect to transistor sizing is presented which includes the capacitive and the short circuit dissipation. It is shown that the power consumption of a CMOS circuit is a convex function of the active area and the objective of minimizing the active area is different than that of minimizing the power dissipation for a transistor sizing algorithm. A direct approach for transistor sizing to minimize the power consumption of a CMOS circuit subject to a given delay constraint is presented. An implementation and experimental results for the MCNC benchmark circuit are presented to establish the usefulness of the approach.
The analysis presently assumes transition activities lumped at each output node of a gate. However, the transition activities in a gate take place through different transition arcs involving different trigger input signals and different capacitances. A sophisticated model using the trigger probabilities for each transition arc would enhance the results further. The analysis presented in this paper does not consider the effect of variation in the supply voltage and the minimum feature size. A study of the transistor sizing issues with scaling of voltage and feature sizes would be interesting future work.
