We present a framework for combining Voltage Scaling (VS) and Gate Sizing (GS) techniques for power optimizations. We introduce a fast heuristic for choosing gates for sizing and voltage scaling such that the total power is minimized under delay constraints. We also use a more accurate estimate for determining the power dissipation of the circuit by taking into account the short circuit power along with the dynamic power. A better model of the short circuit power is used which takes into account the load capacitance of the gates. Our results show that the combination of VS and GS perform better than the techniques applied in isolation. An average power reduction of 73 % is obtained when decisions are taken assuming dynamic power only. In contrast, average power reduction is 77 % when decisons include the short circuit power dissipation.
Introduction
Advances in semiconductor technologies have led to chips with millions of transistors. As circuit density and speed increases, power dissipation has become one of the critical parameters in circuit design. The expanding and converging elds of computing and digital comunications are creating new demands for high performance and programmable signal processing engines. To enhance the performance capabilities of today's DSP systems would imply a higher power consumption. Since, the fastest growing area in the computing industry is the provision of high throughput DSP systems in a portable form, the operating time of these systems provided by the battery becomes a major design issue. Hence, a lot of research has been done for power reduction at various design levels of abstraction (such as system, architectural, logic and layout levels) 1], especially for portable DSP applications.
The average dynamic power consumed by a CMOS circuit is given by 1] P avg = 0:5V 2 dd f C (v)E(v) (1) where f is the clock frequency, V dd the supply voltage, C(v) the load capacitance of gate v, and E(v) is the switching activity at the output of gate v. Due to the fact that the charging/discharging of capacitance is the most signi cant source of power dissipation in CMOS circuits, previous work optimizes the power by considering three factors in a circuit: supply voltage, load capacitance and switching activity. However, most of them deal with one factor at a time. In this work, we are interested in power optimization by reducing both the supply voltage and the load capacitance.
Since the dynamic power consumption is quadratically related to supply voltage, reducing supply voltage (or voltage-scaling) promises to be an e ective technique for power saving. The basic problem with Voltage Scaling (VS) is the increased circuit delay, since the relation between delay (t d ) and supply voltage (V dd ) is given by 1]
where C is the load capacitance, V T the threshold voltage, and K a constant. If V dd is much greater than V T , then the delay is almost inversely proportional to supply voltage. For supply voltage near the threshold voltage, however, the V T term causes the delay to increase rapidly. Another major overhead in using di erent supply voltages in a circuit is the additional level converters required at the interface and layout design. For this reason, it is advisable to restrict oneself to dual-voltage approach where two supply voltages are available for power optimization. Another technique for reducing power at the logic or transistor level is the technique of Gate Sizing (GS) which targets power optimization by reducing the load capacitance. Since the intrinsic resistance of the gate is inversely proportional to the size of the gate, GS results in an increase in delay of the gate. Gate sizing is well known to be a useful tool for reducing circuit delays in CMOS integrated circuits. Several methods have been proposed as solutions when the problem is posed as an area-delay tradeo , such as in the work in 9-11].
From a general point of view, reducing either supply voltage or physical size of a gate, at logic level, leads to a gate delay increase which implies decreased slack time. In this sense, VS and GS can be e ective for delay-constrained optimization only if the given circuit has signi cant timing slack available in some or all of its constituent gates. Because of the discrete nature of supply voltages or gate sizes, VS or GS alone tends to leave more slacks unutilized, 20] preventing e ective power reduction. Further, slacks used up by one technique could have been used by the other technique to give higher power reduction. This fact motivates us to opt for a combined VS and GS algorithm. We propose a fast heuristic for GS and VS which would identify the maximum number of gates for gate sizing or voltage scaling under the delay constraints so that the total power dissipation of the circuit is minimized.
Previous approaches have also attempted to minimize the total power using simultaneous voltage scaling and gate sizing 12]. But these approaches consider the dynamic power dissipation only, and neglected the role of the short-circuit power. However, this is not a valid assumption as short-circuit power accounts for under 20 % of the total power. Minimizing a power function that considers only the dynamic power, without any constraints on delay, would imply that all transistors must necessarily be minimum sized. However, a minimum-sized circuit does not necessarily correspond to a minimum power circuit, the e ect being more pronounced when large loads are driven.
Further, down sizing a gate might increase the short-circuit power of the fanout gates which could be high enough to o set the decrease in the dynamic power. Most of the traditional models for short-circuit power neglect the e ect of the load capacitance and are incorrect. In this work, we use a more accurate estimate for short-circuit power and minimize the total dynamic and short circuit power using a combined VS and GS technique. We also propose a fast algorithm which would identify more nodes for sizing or for voltage scaling.
Our optimization problem may be described as :
subject to Delay(W; V ) T spec (4) V i = V high or V low ; 8gate i
Maxsize(i) w i Minsize(i) (6) where both Power and Delay are functions of gate sizes (W) and supply voltages (V), T spec is the timing constraints, V high and V low are two supply voltages, V i and w i are the supply voltage and size of gate i , respectively, and Minsize(i) and Maxsize(i) are given by the gate library. This is a delay-constrained power-minimizing problem. In 16], a method which makes use of transistor reordering was described to address a similar problem. Since transistor reordering is simply intended for reducing the average number of transitions at internal nodes of gates for low power, the resulting power reduction is very limited. In this work, we provide new cost models for delay and power with voltage scaling and gate sizing. Algorithms for single VS, single GS and combined VS and GS are proposed to optimize power. Experiments show that the combined VS and GS obtain maximum power improvement.
For our work, we assume that switching activity is a constant for each node and is independent of gate delays.
Switching activity is the measure of signal transitions per clock cycle. Switching activity at all nodes inside a circuit not only depends strongly on the topologic structure and input patterns of the circuit, but may also vary with gate delay which introduces glitching transitions. Therefore, the zero-delay model provides a lower bound on the activity.
Under a general delay model, updating activities iteratively, is computationally prohibitive. Fortunately, VS and GS do not change the circuit topology, and both tend to reach path-balancing by reducing the slacks. This helps eliminate glitching to some extent. Intuitively, for the purpose of power reduction, the nodes with high switching activity are good candidates to work at low supply voltage by VS (or work with the small load capacitance by GS).
The remainder of the paper is organizes as follows. Section 2 discusses delay and power modeling with both VS and GS. Section 3 discusses the VS and GS problem in detail. In Section 4, we discuss an algorithm for combined VS and GS for power optimization. Finally, experimental results are described in Section 5.
Timing and Power Models
Because of the nature of the problem shown in Eqn. (3) (4) (5) (6) , the general idea behind GS (or VS) is to iteratively select a set of gates to down-size (or reduce their supply voltages), so that the total power reduction is maximized and the timing constraints are met. Thus, a reasonably accurate timing/power model is required to estimate the delay and power consumption of a gate under speci c supply voltage and physical size. In this section we discuss the timing model followed by the dynamic and the short-circuit power model used by us.
Timing Model
In most standard-cell libraries, the gate delay is de ned as
where i is the intrinsic delay, w i and C i load are size and load capacitance of gate i respectively, and c i is a constant. The load drive capability of gate i increases with w i . The internal capacitance of gate i , however, varies almost linearly with w i . These together keep i almost independent of w i . C i load is determined by the size of the fanout gates and wiring capacitances, i.e.,
where FO(i) is the set of fanouts of gate i , and c is a constant. When ignoring the wiring capacitance, (5) can be written as
w j =w i (9) where k i = c c i . Basically, (7) indicates that a larger gate is required for the delay reduction if it drives more fanouts.
Furthermore, it has been shown in 13] that the gate delay at supply voltage V dd is approximately proportional to kV dd =(V dd ? V t ) 2 , where V t is the threshold voltage, and k is a constant. Assuming d i in (7) is the delay at V high , the gate delay with size w i and supply voltage V i is given by
For the purpose of VS, V i can be either V high or V low . From (8), reducing supply voltage results in increased delay of the gate, while reducing gate size does not always degrade the delay. The reason is that the loading and, hence, the delay of its fanins decreases with the reduced size of this gate.
Dynamic Power Dissipation
The dynamic power dissipated in a circuit corresponds to the power dissipated in charging and discharging capacitances in the circuit. The magnitude of this power for a gate driving a load capacitance C i load , and internal capacitance C i int = c W i , operating under a clock frequency f and having a probability p T of switching is given by
where V dd is the supply voltage. It can be seen that reducing the size of gate i leads to the saved power consumption of both gate i itself and its fanin gates.
Short Circuit Power Dissipation
Most transistor sizing methods have considered only the dynamic power dissipation. Recently, a few methods have also considered short circuit power using the formula
where is the MOS transistor gain factor, and is the transition time of the input transition, and f and p T are as de ned earlier.
Eqn (13) is inaccurate since it does not model the e ect of the load capacitance on the short circuit power. The 
Assuming that w p = 2 w n , a modi ed model would be : Gate sizing consists of choosing for each node of a technology mapped network, a gate implementation in the library so that the total power of the circuit is minimized without a ecting the overall delay of the network, i.e. under some delay constraints. This is possible as gates in the non-critical path of the network have a lot of slack so that they can be down sized to save on power without violating timing criticality. Figure 1 shows the e ect of down sizing gate G on the total power of the circuit. On down sizing gate G, the input capacitance of Gate G decreases.
Hence, the load capacitances of the gates which are the fanins of this gate G, i.e. gate G1 decreases. According to Eqn. (9) , this results in a decrease in the dynamic power of gate G1. As a consequence of down sizing gate G, the transition time of the signal at the output of gate G increases. This e ects the gates which are the fanouts of gate G as the time for which both the n and the p gates are ON is increased. This results in an increase in the short-circuit power dissipated by the fanout nodes. Hence, if the number of fanouts are very high, then the total increase in short-circuit power dissipation may o set the decrease in dynamic power dissipation resulting in an increase in the total power, even though we have down sized gate G. Figure 2 shows the need for optimally choosing the gates for down sizing. If gate G is chosen for down sizing, then the corresponding decrease in slack of this gate, will reduce the slack of its fanout gates which could have been down sized. On the contrary, if both the fanout gates G1 and G2 were down sized, then we would have got a greater reduction in power. Hence, gates which are part of less paths are better candidates for down sizing before gates which are a part of a large number of paths. Again, since both dynamic and short-circuit power is directly proportional to switching activity, gates with a high switching activity should be down sized earlier. Section 5 describes an algorithm for combined voltage scaling and gate sizing.
Combination of Voltage Scaling and Gate Sizing
Since both VS and GS decrease the available slack in the circuit, it would be better to apply the two techniques in a simultaneous fashion rather than one after the other. In 12], a technique for power reduction by simultaneous VS and GS using a maximum weighted independent set (MWIS) approach has been proposed. Formulating the power optimization problem as a maximum weighted independent set of the sensitive transitive closure of the graph exposes several opportunities to reduce power. However, the time complexity of the algorithm is quite high. The algorithm attempts to reduce power dissipation by nding a set of nodes for which delay can be traded for power.
The selected nodes are usually sized down or operated at a lower V dd . This results in a lower power dissipation and increased delay for the node. To ensure that the increase in the delay of the nodes does not violate any critical path timing constraints, the delay at any step is increased by at most minfmin v2Qm ( d(v)); s max ? s max?1 g. s max is the maximum slack available for any node in the graph and s max?1 is the second largest slack available. 
A Fast Heuristic
The principal reason behind the success of the MWIS based approach is that the algorithm is able to choose the maximum number of nodes to trade delay for power given the slacks along the paths. For example, consider gure 3. The MWIS algorithm obtains the optimal solution because it selects the nodes V 1 ; V 2 ; V 3 ; V 4 over the nodes V 5 ; V 6 or V 7 to introduce delay. Our heuristic is guided by the same principle. The heuristic is based on the number of paths that pass through a node from any primary input to any primary output. The intuition is that if the number of paths that pass through a node are large, then introducing a delay at that node uses up the slack of a large number of nodes that lie on the paths that pass through that node. On the other hand introducing delay to a node which has small number of paths passing through it will a ect the slacks of a small number of other nodes.
Returning to the example of gure 3, the number of paths that pass through each node are shown in parenthesis.
For simplicity, the delay of each node is assumed to be 1. If we take into account the number of paths that pass through each node in selecting which nodes to introduce delays, giving more priority to nodes that have less paths passing through them, then we arrive at the same solution given by the MWIS algorithms. Thus we use the number of paths that pass through each node in deciding which nodes to introduce delays. Further, since power dissipated at a node is directly proportional to the switching activity at the node, nodes with a high switching activity should be gate sized or voltage scaled rst. This guides us to the following weight function for each node.
Weight(i) = Slack p T (No:OfP aths) (19) where p T is the switching probability and ; ; were assumed to be 1. The weight function assigns a larger weight to gates which have larger slack as these gates can be sized or voltage scaled by a large factor giving us more reduction in power. Also, gates with high switching activity are given a larger weight as power reduction is directly proportional to the switching activity of the gates. Our path based heuristic assigns a lower weight to gates having large number of paths passing through them so that changing slack of an individual gate does not reduce slack of a large number of gates. The parameters ; ; were chosen to be 1 so that the e ect of slack, switching activity and number of paths on the total power reduction could be studied. These parameters could be changed to obtain better solutions. The heuristic is described next. Afterwards we describe the algorithm to calculate the number of paths that go through a node. Note that computing the number of paths going through a node is e cient. Moreover, as it is a property of the graph that does not change with the delays of the nodes, we need to calculate it only once as opposed to the MWIS approach where the MWIS had to be calculated after each iteration.
Algorithm 1 proposes our combined VS and GS algorithm. This has the advantage that any slack leftover by one of the techniques will be used over by the other technique. Further, the technique which would bring the maximum power reduction would be used for the particular node. The algorithm nds out the number of paths through each update slacks on a ected paths else apply GS on node i update slacks on a ected paths endfor while(at least one node is changed) gate and uses this to assign a weight to each node based on the available slack in the node using Eqn. (19) . Gates which have a larger slack and have less paths passing through them are initially chosen for VS or GS. The change in the total power per unit delay is calculated for these chosen gates. Since the main objective is to achieve a maximum power reduction, gates are chosen for VS or GS depending on which operation decreases the total available slack in the circuit by the least amount. This algorithm terminates when the available slack in the circuit is reduced so that anymore VS or GS operation would violate the timing constraints of the circuit. Algorithm 2 proposes a linear time algorithm to calculate the number of paths which is used to calculate the Weight function to choose the candidate nodes for VS or GS. Now we prove that the above algorithm indeed gives the number of paths passing through a node. Consider the number of paths entering a particular node. Each of these paths must either pass through one of its predecessor or originate at one of its predecessors. Moreover, a path passing through a node has a unique predecessor along the path as the graph is acyclic. Hence the number of paths entering a node is the sum of all paths going through or originating at its predecessors. A similar argument applies for paths leaving a node. Each path leaving a node must pass through or terminate at a successor. The number of entering paths for each node is computed by visiting Table 1 shows the percentage reduction in total power using only voltage scaling technique. We see a power reduction of about 50 % for circuit 9symml when the total power is equal to the dynamic power and about 58 % when short-circuit power is also considered during the decision. Table 2 shows the percentage reduction in total power using only gate sizing technique when all gates operate on a single supply voltage. Figure 4 shows the percentage reduction in power using gate sizing graphically. We see a power reduction of about 47 % for circuit 9symml when the total power is equal to the dynamic power and about 54 % when short-circuit power is also considered during the decision. Figure 5 shows that a combined VS and GS approach gives more power reduction than only VS. Table 3 gives the percentage power reduction using our combined VS and GS technique. A power reduction of as high as 80 % is obtained for circuits like i1. The percentage power reduction is very high as the algorithm nds out the maximum number of nodes that are candidates for either VS or GS and do not violate the timing constraints. We can conclude that though VS and GS individually give us high power reduction, we can get much higher reduction by using a combined approach as the slacks which are unutilized by one technique can be used by the other technique. We have not considered the e ect on power of the additional level converters that would be introduced due to the dual voltages in the circuit. Figure 1 shows that down sizing a gate might not always result in total power reduction. Hence, a decision taken with only the dynamic power into consideration would be less accurate. We can see from Figure 6 that an additional power reduction of as high as 6 % can be got by taking the short-circuit power in the decision process. The improvement in power reduction depends on the number of implementations of the gates in the library. 12] de nes completeness of a gate library for gate sizing. A more complete library would de nitely improve the exibility of the algorithm. The execution time of our algorithm using our fast heuristic for circuit C1908 is 85.87 seconds. The execution time using the MWIS approach 6] is reported as 117.7 seconds for Library A, 136.6 seconds using Library B, 256.6 seconds using Library C and 1485.7 seconds using Library D. We are not reporting a complete comparison with the combined VS and GS technique using a MWIS approach as the gate libraries used by them was di erent than what was available to us. But, from the execution times and the complexity analysis presented in section 5, it can be concluded that out algorithm is much faster than the MWIS algorithm.
Conclusion
We have presented an e ective framework for integrating voltage scaling and gate sizing techniques for getting maximum power reduction. We have proposed a fast algorithm for choosing the maximum number of gates for voltage scaling and gate sizing. We use a better model for short-circuit power dissipation and show that combined voltage scaling and gate sizing with total power modeled as the sum of dynamic power and short-circuit power gives us an average power reduction of 77 % and is greater than the power reduction achieved when the decisions are taken with only dynamic power.
