Abstract-This paper presents a new approach using simultaneous voltage-scaling and gate-sizing for low power without violating the timing constraints. We provide the problem formulation in this application, and propose algorithms for single voltagescaling, single gate-sizing, and their simultaneous manipulation. We target a globally optimal solution by showing how the power optimization is related to the maximum-weighted-independent-set (MWIS) problem. Experimental results on a set of benchmark circuits show that the simultaneous voltage-scaling and gate-sizing generates maximum power reduction. The average power savings range from 23% to 57% over all tested circuits, depending upon the circuit topology, underlying gate library and specific supply voltages.
I. INTRODUCTION
A S CIRCUIT density and speed increase, power dissipation has become one of the critical metrics of circuit synthesis and optimization. Low power can improve the circuit reliability, increase the lifetime, and reduce the cooling and packaging costs. A lot of effort on power reduction has been made at various levels of design abstraction, ranging from system level to layout level. Due to the fact that charging/discharging of capacitance is the most significant source of power dissipation in CMOS circuits, previous work optimizes power consumption by considering three factors in a circuit: supply voltage, loading capacitance and switching activity. However, most of them are only able to deal with one factor at a time. In this work, we are interested in power optimization by reducing the supply voltage and loading capacitance simultaneously.
Reducing the supply voltage, or voltage scaling (VS), promises to be an effective low-power technique since dynamic power consumption is quadratically related to supply voltage [1] - [7] , [23] . While reducing the supply voltage of a whole circuit suffers from circuit speed loss, a low voltage applied only to noncritical paths of the circuit does not necessarily lead to performance degradation. The major overhead in using different supply voltages at different parts of a circuit is that level converters are required to eliminate the static current at their interface [3] , [18] . To avoid too many level converters which introduce additional power penalty, it is reasonable to use dual-voltage approach in which only two different supply voltages are available for the optimized circuits (unless otherwise stated, VS represents dual-voltage approach throughout the paper). The typical dual-voltage approach is the cluster voltage scaling (CVS) scheme [3] . Its basic idea is to use the depth-first search from the primary outputs to find gates which may operate at a low-supply voltage without violating the timing constraints of the circuit. Some improvements on CVS have been made [4] , [19] , but they lack a global view and do not consider the switching activity information which is linearly related to the dynamic power. A linear programming approach was proposed in [18] to address dual-voltage problem. However, it is based on the so-called delay balanced configurations whose generation requires very expensive computation cost. Unlike the VS, gate sizing (GS) is a well-known technique which targets power optimization by reducing load capacitance. Several approaches have been explored [8] - [12] . In [9] , for example, gate resizing was performed in two phases. The first phase carries out single gate resizing, and the second phase targets multiple gate resizing. Resizing a single gate at the early stage may prevent the subsequent multiple gate resizing from achieving further power reduction. In [8] , an integer linear programming technique was used, and a gain function was defined to represent the product of slack time and switching activity, which is oversimplified. An exact algorithm proposed in [10] can yield an optimal gate resizing solution. The authors assumed that for any two gates of the same logic function, the ratio of their loading capacitance difference to their delay difference is a constant. This assumption, however, is unrealistic. A fast gate sizing algorithm was also presented in [11] , where it was claimed that the effect of a single gate resizing on slack variation decreases quickly by fanin/fanout level geometrically. Unfortunately, this observation is not true, especially when the gate lies on a long slack-sensitive path [20] .
From a general point of view, reducing either the supply voltage or the physical size of a gate leads to a gate delay increase which implies the decreased slack time. In this sense, the VS and/or GS can be effective for the delay-constrained optimization problem only if the given circuit has significant timing slacks available with some or all of its constituent gates. Fig. 1 shows the average distribution of gates with different slacks for 16 MCNC'91 benchmark circuits after technology mapping (note that the slack value has been normalized to the longest path delay). It can be seen that the number of gates on critical paths (i.e., with zero slack or close-to-zero slack in the figure) accounts for only 14% of total number of gates, while more than 60% of gates have a slack larger than 0.2. This potentially provides much room for power reduction using VS and/or GS. Another common feature with the VS and GS is that both are nonlinear, discrete optimization problems and do not modify the topologic structure of the given circuit. Because of the discrete nature of supply voltage (or gate sizes), VS or GS alone tends to leave more slacks unutilized, preventing further power reduction. All these facts motivate us to look at the best combination or simultaneous application of VS and GS for low-power design. More recently, Yeh et al. [19] reported an approach of using gate upsizing to create new slack for the VS. Essentially, the VS and GS were done separately and locally in their algorithm, where the GS tried to minimize area penalty and no specific timing model was given. In their method, VS is still based on the CVS scheme [3] which breaks down under the tight timing constraints. Therefore, there is no similarity between their algorithm and our technique.
In this paper, we deal with the problem of reducing power dissipation of a technology-mapped circuit under timing constraints using simultaneous voltage scaling and gate sizing. Our optimization problem can be described as minimize subject to or gate gate (1) where both Power and Delay are functions of gate sizes and supply voltages , is the timing constraint, and are two supply voltages, and are the supply voltage and the size of gate , respectively, and and are given by a gate library. This is a delay-constrained power minimization problem. In [25] , a method which makes use of transistor reordering was described to address a similar problem. Since transistor reordering is simply intended for reducing the average number of transitions at internal nodes of gates for low power, the resulting power reduction is very limited. In this work, we relate the power optimization to the maximum-weighted-independent-set (MWIS) problem [14] . Algorithms for single VS, single GS, and simultaneous VS and GS are proposed to optimize power. Experiments show that simultaneous VS and GS obtain maximum power improvement. The improvement rate ranges from 23 to 57% (on average) over tested circuits. In this paper, we make the following assumptions: 1) glitching is ignored; 2) only downsizing of gates is considered in GS; and 3) short-circuit power is not accounted for.
The remainder of the paper is organized as follows. Section II discusses delay and power modeling with both VS and GS. Section III describes a basic algorithm for VS and GS based on the MWIS . In Section IV, we discuss the simultaneous VS and GS for power optimization. Finally, experimental results and conclusions are given in Sections V and VI, respectively.
II. TIMING AND POWER MODELING WITH VOLTAGE SCALING AND GATE SIZING
Because of the nature of the problem shown in (1), the general idea behind the GS (or VS) approach is to iteratively select a set of gates to downsize (or reduce their supply voltages) so that the total power reduction is maximized and the timing constraints are met. In this section, we first describe the timing and power models. In Section III-A, we will show how MWIS can be used to optimize power globally.
A. Timing Model
In most standard-cell libraries, the gate delay is defined as (2) where is the intrinsic delay, and are size and load capacitance of gate , respectively, and is a constant. The load drive capability of gate increases with . The internal capacitance 1 of gate , however, varies almost linearly with . This keeps almost independent of . is determined by the size of fanout gates and wiring capacitance, i.e. (3) where is the set of (immediate) fanouts of gate , and is a constant. When ignoring the wiring capacitance, (2) can be rewritten as 2 (4) where . Basically, (4) indicates that a larger gate is required for the delay reduction if it drives more fanouts. Furthermore, it has been shown in [13] that the gate delay at supply voltage is approximately proportional to , where is the threshold voltage, and is a constant. Assuming in (4) is the delay at , the gate delay with size and supply voltage is given by where (
For the purpose of VS, can be either or . From (5), reducing supply voltage results in increased delay of the gate. Reducing gate size, however, does not always degrade the delay even though the driving capability of the gate is reduced. The reason is that the loading and, hence, the delay of its fanins decrease with the reduced gate size.
B. Power Model
In CMOS circuits, the average dynamic power consumption for gate is given by (6) where is the clock frequency, is the supply voltage, is the switching activity, is the internal capacitance of gate , and is as defined earlier. It can be seen that reducing the size of gate leads to reduced power consumption of both gate itself and its fanin gates. Another source of power consumption is the short-circuit power which arises when both the NMOS and PMOS transistors are ON at the same time, providing a direct path from power supply to ground. In general, by controlling the input and output transition times, the short-circuit power could be kept to within 10% of the total power. However, when the input signal transition time (or signal slope) increases, this power component needs to be accounted for.
C. Weight Function on a Single Gate
Having established the relations between the delay/power and both of gate size and supply voltage, we define a weight function, , for gate as the average power savings per unit delay penalty. Take VS for example. We define (7) where and are the power reduction and delay penalty, respectively, due to voltage scaling for gate , given as follows: (8) and (9) Similarly, we can have based on (5) and (6) (see [28] for more details). Intuitively, gates with high weight are better candidates for VS or GS.
From (8) and (9), delay variation and power savings for gate due to the VS depend upon the parameters of gate itself (such as gate size, switching activity, and supply voltage) and its fanouts. In contrast, GS on gate affects delay and capacitance of not only gate itself and its fanouts, but also all its fanins as well, as shown in (4). More specifically, for a single gate, GS tends to be more effective than VS, if: (a) the number of fanouts of the gate is small, which leads to little delay penalty when downsizing it; (b) the number of fanins of the gate is large, which results in high power reduction on these fanins when downsizing it; and/or (c) the switching activity of fanins of the gate is relatively high, which also promises significant power savings when downsizing it. As an example, Fig. 2 shows a gate with three fanins (denoted by , , and ) and three fanouts (denoted by , and ). For the sake of calculation, we assume the following parameters for gates in the figure: (a) the size of all gates is 2 units, the intrinsic delay in (4) for all gates is 1 unit, the factor in (4) for all gates is 0.2 units, and the factor in (3) is 1 unit; V, V, and the clock frequency ; (c) the probability of output signals of three fanins (i.e., , , and ) to be logic 1 is 0.8. Suppose the supply voltage of all gates is and gate is downsized by half to 1 unit size. The resulting delay increase is calculated as , and the power reduction as . The weight function for GS turns out to be . For VS, we have , and (the detailed calculation is omitted here). The weight function for VS is . This shows that the GS is more effective than the VS since . In constrast, if the number of fanouts (with same size) of gate increases from 3 to 10, our estimation shows would increase from 0.5 to 1.9, leading to , which indicates that VS is preferable to GS for this case. This example shows that the weight functions on a single gate strongly depend upon the parameters of its fanins and/or fanouts, and, hence, vary dynamically during VS and/or GS.
III. BASIC ALGORITHM
From Section II, it is reasonable to select gates one by one for VS/GS based on their weight functions for power reduction. However, as mentioned above, it lacks a global view. We next tackle this problem algorithmically by relating the power optimization to the MWIS which is polynomial-time solvable for transitive graphs (we will show that circuits can be translated to transitive graphs).
A. GS/VS Algorithm
Traditionally, a technology-mapped circuit is modeled as a directed acyclic graph , where each node (or each edge ) corresponds to a gate (or a signal net) of the circuit (the "gate" and "node" are used interchangeably throughout the paper). Given a timing model and timing constraints of the circuit, the slack time of each node (or gate) can be obtained by calculating its arrival/required time through forward/backward delay propagation. If the circuit initially meets the timing constraints, we have for each node . The problem is how to assign the slacks to nodes/edges such that the initial slacks can be fully exploited for power optimization [17] . A typical approach for this slack assignment problem is the zero-slack-algorithm [16] . Its basic idea is to first find the nodes with minimum positive slack, and then do the slack assignment such that their slacks become zero. The algorithm, however, ignores the effect of slack on low power applications, and does not account for the discrete nature of node delay which characterizes the VS/GS technique. Note that we are concerned with exploiting the slacks available in all nodes to reduce the power. This may depend strongly upon the specific gate library with GS and the supply voltages used in VS. Before going further, we have the following definitions. 
Definition 3.4: An object graph,
, is an induced subgraph of on a subset such that there is an edge if , , . Fig. 3 shows an example. This is part of a graph where the delay penalty due to GS for each node is denoted by , and the slack of each node before GS is denoted by . Assuming each node's size , we have four resizable nodes (i.e., node 1, 2, 3, and 4) by definition 3.1. Fig. 3(b) is the transitive closure graph of , where the dotted lines represent transitive edges. Given a subset of nodes , the object graph on is shown in Fig. 3(c) . Obviously, the object graph on any subset of nodes in is a (directed) transitive graph where the existence of a directed path from any node to another node implies that there is a directed edge from to .
Let be the set of resizable nodes. Any node may be downsized without violating the timing constraints. In general, however, not all nodes in can be downsized at the same time. The reason is that, once a gate is downsized, the slack of other nodes in may be reduced and, hence, they may turn out to be no longer resizable. In Fig. 3(a) , for instance, nodes 2 and 3 are resizable. Downsizing node 3 reduces both and from 6 to 1, making node 2 no longer resizable. Downsizing both of them will lead to , violating the timing constraints. Similarly, if we use to represent the set of scalable nodes, not all the nodes in can be selected to work at while meeting the timing constraints. However, if we consider an object graph on and let be the maximum independent set of , downsizing all the nodes in at same time keeps the timing constraints being met. The same is true for the object graph on . Note that the is, in general, not necessarily unique. Finding a in an object graph is straightforward. This can be done by identifying the node with minimum degree in the graph iteratively. Initially let . We add into , and update the graph by deleting and all its neighbors (together with their associated edges) from the graph. This process repeats until the graph is empty.
As discussed in Section II, the average power reduction on a single node under per unit delay penalty is represented by its weight function. Since the weight functions vary with specific nodes, downsizing different nodes (or reducing the supply voltage of different nodes) may result in a totally different power savings. In order to maximize power reduction without violating the timing constraints, it is natural to associate each node with its weight function and extend MIS to MWIS. 3 From Section II, the weight functions on node for GS and VS are denoted by and , respectively. When we simultaneously select all nodes in the MWIS of object graph on (or ) for GS (or VS), the timing performance is still guaranteed, and the power reduction is maximum in the sense that no other independent sets can generate greater power reduction. Algorithmically, each time all nodes in MWIS are downsized (or voltage-scaled), the node slack, weight, and object graph need to be updated and a new MWIS must be identified again. This process repeats until (or ) is empty. Formally, we give the GS algorithm as follows: (the VS-algorithm is omitted here since it can be obtained by simply replacing and in the GS algorithm with and , respectively) Before discussing the MWIS algorithm further, we briefly take a look at the example graph in Fig. 3(c) . Suppose that the weight functions for all 4 nodes in the graph are , , , and . Although node 2 has maximum weight of 6 for individual nodes, the MWIS is which leads to the weight sum of 7. If, instead, , the MWIS becomes . In contrast, independent of node weight, the MIS in this graph is . The MIS is a special case of MWIS when all the nodes have uniform weights.
B. MWIS Algorithm
The MIS and MWIS problems are known to be NP-complete on general graphs [21] . However, both are polynomial-time solvable when restricted to transitive graphs. Considering the fact that a polynomial-time algorithm is still computationally expensive in our application, we resort to fast heuristics. A fast algorithm for finding MWIS is as follows. Initially we set . Then select a node with maximum ratio (where and are weight function and degree of node , respectively, in the graph), add the node into MWIS and delete this node and all its neighbors (along with all the edges incident with at least one of the nodes) from the graph. Repeat this selection process until the graph is empty. Alternatively, one can set in this algorithm, where is a constant. In general, trying different values of for this process promises high probability of obtaining reasonably good solution, depending on the specific object graph. The time complexity of the algorithm is .
IV. COMBINATION OF VOLTAGE SCALING AND GATE SIZING
Regardless of the specific algorithms used, the effectiveness of GS (or VS) depends on the underlying gate library (or supply voltages). It can be preferable, for example, to downsize one gate, followed by reducing supply voltage (instead of reducing the size) of another. Thus, VS and GS are dependent on each other. In this section, we look at simultaneous VS and GS problem, and discuss the level converters which are required at the interface of different supply voltages.
A. Simultaneous VS and GS
VS and GS can be combined in one of the following three ways: VS followed by GS, GS followed by VS, and simultaneous VS and GS. From the discussions in Section III, it is straightforward to carry out the first two combinations. In order to do the simultaneous VS and GS, we need to construct an object graph on . Particularly, if any node is both resizable and scalable (i.e., ), it is assigned the weight of and downsized or voltage-scaled accordingly so long as it is in the MWIS of the object graph. Thus, some nodes in the MWIS are selected to be downsized, while others are voltage-scaled. As an extension of the GS algorithm, the simultaneous VS and GS algorithm follows: Note that in the previous algorithm, the weight functions are calculated for resizable and scalable gates, and updated in the iterative process. Depending upon their sizes, supply voltages and the switching activity at their outputs, gates with large slack may be voltage-scaled in one iteration and resized in another. This is a greedy algorithm. With the discreteness of delay and/or power changes, there is no guarantee that the best results can be obtained.
SIMULtaneous-VS-and-GS-

B. Level Converter
When VS is applied, the circuit requires the level converters (LCs) at the interface of two different voltages to block the possible static current which occurs if a gate drives a gate [3] . Since the LCs produce the additional delay and dynamic power consumption, the number of LCs should be minimized. Otherwise, the power savings due to the VS may be offset by the power consumption of the LCs. Fig. 4 shows a conventional LC circuit which with 6 transistors. In general, the delay of LC (denoted by ) can be seen to be a constant dependent on the specific technology (in our experiments, 0.5 ns for was used as in [3] ). After VS, we have two partitions of gates: one with gates (denoted by ) and the other with gates (denoted by ). Assuming that the input capacitance of LC is , and that the switching activity at gate where the LC is inserted is , the power consumption due to the LC can be approximated by . We examine the problem of reducing the total power consumption by moving a gate from to or from to . Let us first look into the possibility of moving a gate from to . Without loss of generality, consider an -type gate with -type fanins and -type fanouts. The power reduction by moving of node from to is given by move power (10) where the first term is the reduced power due to LCs and the second term accounts for the increased power of the node itself by moving from to . Thus, the problem of maximizing power reduction can be solved by constrained Fiduccia-Mattheyses (CFM) algorithm [26] . "Constrained" means that a move is accepted only if it does not violate the given timing constraints. Although more effective cluster-based F-M (like hMetis [27] ) algorithms are available for general partitioning purposes, they can not be used in this application since the nodes to be moved individually should not violate the timing constraints. Our algorithm for minimizing LCs' power penalty is given as follows: After the process of gate move from to is complete, a similar algorithm is applied to the possible move of gate from to where the cost function is modified as move power (11) Since the cost function during each tentative move is (10) or (11) which requires only constant time for computation, the time complexity of LCPM algorithm is the same as that of the traditional F-M algorithm [26] which needs time for each pass, where is the number of terminals in the circuit. It has been shown [5] that for most circuits, the power overhead due to LCs can be controlled (within about 5%, on average, of total power consumption) by using the LCPM algorithm.
V. EXPERIMENT AND DISCUSSION
We implemented our algorithms for VS, GS, and simultaneous VS and GS under SIS environment [15] . Experiments were carried out on a set of MCNC benchmark circuits using all combinations of VS and GS: single VS, single GS, VS plus GS, GS plus VS, and simultaneous VS and GS. Before running our algorithms, we performed technology mapping on the given circuit under minimum delay mode by SIS, and then used this delay as the timing constraints. The power consumption was estimated using the clock frequency of 20 MHz, threshold voltage of 0.6 V and supply voltage of V and V (unless otherwise stated).
First, we ran our VS, GS, and simultaneous VS and GS algorithms using a standard cell library with range-completeness 0.08, and granularity-completeness 0.78, where the rangecompleteness is used to measure the maximum difference in size for each type of gate, while the granularity-completeness is used to measure the number of cells for each type of gate in the library (the readers are referred to [28] for more about the library completeness). The average power reduction 4 over all tested circuits is 6.6% for GS, 19.5% for VS and 23.3% for simultaneous VS and GS (specific data will be shown later). As an example, Fig. 5 shows the power reduction and slack distribution for circuit 9symml before and after optimization. The maximum power reduction of 16.1% is achieved by simultaneous VS and GS. The discrete nature of the gate library and supply voltage prevents further power optimization.
To see how the underlying library affects our algorithms, we used different libraries with different completeness for testing purpose. The results with four libraries (library A, B, C, and D) are summarized in Table I where columns 2 and 3 give the number of gates and circuit delay, respectively, before optimization. As our experiment shows that simultaneous algorithm produces better results than both GS VS and VS GS algorithms with less CPU time, Table I only lists the performance of GS, VS and simultaneous algorithms on benchmarks. Library A is the least complete library with range-completeness of 0.08 and granularity-completeness of 0.78. Library D is the most complete of four libraries. It can be seen that, for most circuits (except ), GS is less effective than VS when library A is used. On average, VS generates about 13% more power reduction than GS. The effectiveness of GS, however, improves as a more complete library is used. Whatever library is used, the simultaneous algorithm always leads to the best results, as shown in this table. The average power reduction by simultaneous algorithm ranges from 23.3% to 56.9% over all tested circuits.
Also, we tested our algorithms using different supply voltages. Table II shows the comparison of results (using library A) with four groups of supply voltages. In general, the best supply voltage should be chosen carefully, depending on the slack distribution of specific circuits. In particular, using too discrete supply voltage (e.g., V and V) is not advisable for most circuits, as it may disable all most or all gates in the circuit to operate at , resulting in less or no power savings. This is shown in the last column of Table II. 
VI. CONCLUSION AND FURTHER WORK
We have presented the first paper on power reduction using simultaneous voltage scaling and gate sizing. The algorithms optimize the dynamic power consumption under the given timing constraints by dealing with the MWIS problem on transitive graphs. It has been shown that the proposed simultaneous voltage-scaling and gate-sizing provides globally good solutions with inexpensive computation cost.
In this paper, our delay model does not account for signal slew effect and other secondary effects (such as short channel effect) with deep submicron technology. Further efforts are needed to extend our technique to a more accurate model. Also, there might be the additional noise due to the cross coupling between the signals with different voltages. Noise checking is thus increasingly necessary during the subsequent layout. Finally, it is worthwhile to explore the post-layout extraction and accurate simulations on delay and power for validation of power improvement. * This is the CPU time in seconds using SIMULtaneous algorithm on a SUN SPARCstation 5 with 32 MB RAM.
TABLE II POWER REDUCTION (%) WITH DIFFERENT SUPPLY VOLTAGES ** (USING LIBRARY A)
** The CPU times in this experiment are almost the same as those in Column "library A" of Table I. 
