Activity(transJsec)
Introduction
The continuous increasing packing densit and clock frequenc of static CMOS circuits has pushellow power as one o?the principal design parameters, specially in batterypowered portable systems, such as note-pad computers, sonal digital assistants, multi-media terminals and moEfi telephones.
This paper addresses the optimization of a circuit for low power using Fapsistor reordering from a gate-level description. The optimization al orithm uses a power-consumption model of a static CMO! gate that takes into account the power of the internal nodes of the gate. This model allows a fast ex loration of the different configurations of a ate that are oitained by reordering its transistors. Thus, the [est configuration of each gate is selected and the overall power consum tion of the circuit is decreased.
We ~C U S on combinational multilevel circuits, where it has been shown that the power consumption of useless signal transitions (i.e. those transitions that do not contribute to the final result of the circuit) accounts for a large fraction of the overall dynamic power consumption of the circuit. Thus, it is necessary to incorporate the switching activity of the input signals into the power-consumption of the gate.
Motivation examples
To illustrate why it is important to incorporate the switching activity information to the power estimation of a ate, consider the four possible configurations of the gate insigure l(a) thatimplementfunction y = (a1 + a2) b . Different switching activity of the inputs (D,l, D a 2 and Da) results in a different o timal transistor reordering of the gate as it is shown in Tabye l(b). The 7 yilibrium probability (i.e. the probability for a signal to be 7 ) of all input signals has been set to 0.5. Table l(b) shows the power consumption for two different input switching activity scenarios (cases (1) and (2)) of the different configurations relative to configuration (D) in case (1). Time In case (2), the power is decreased by 17% if configuration (D) is. taken instead of (A). adder is another example in which fhe equilibrium p r o z i l i t y does not ive enough infonpation to optimize gates for low ower. eonsider an adder implemented as a chain of fulfadders that has to calculate the addition of two n-bit operands with equal equilibrium probability for all bits. The equilibrium probabilities of all inputs of the full-adders is 0.5, but it is clear that the switching activity of the inputs of the full-adders corresponding to the operands is low (0.5 transitions per operation) whereas the switching activity of the input corres onding to the propagated carry is higher (specially in %os, full-adders that compute the most-significant bits) because of the generation and propagation of useless signal transitions.
Previous work and overview
The ripple-c Carlson [2] hinted the possibility to use the pansistorreordering technique to decrease power consumption and he presented an algorithm for delaylpowerlarea optimization where high s eed was synonym of high powerconsumption. p his a proac. of measurin power consumption is not sufficientfy accurate since it r$es not consider the probability and switching activity of si nals. 
Transition density
The. transition density is a compact measure of the switching activity in digital circuits. The transition. density of a node is the average number of signal transitions per time unit of that node and it is defined as D(yj) = ~~~~ P(&D(xi) , where P ( z ) is the equilibriumprobability, D ( z ) is the transition density, x (y) are the n (m) gate inputs (outputs). 2 is named the boolean difserence and it is a boolean function that may depend on all
all the transitions at input xi are propagated to output yj .
Thus, the transition density provides a fast way of proagating switching activity from the primary inputs to t R e outputs of the gates that compose a circuit. Henceforth, we assume that a node is charged (discharged) only when there is a direct path from power round) supply to the node, i.e. we do not consider charge :Raring among the nodes of the gate. The algorithm to obtain the Hnk function is depicted in Figure 2(b) . Function H,, is obtained by generating a minterm for each possible path from node n k to supply node vdd. A path from node n k to vdd is a set of r edges e k ,
Extended power-consumption model

Notation
CALCULATEHIVNCTION (nk, function) f o
r e a c h e l e m e n t (dfS-liStk, eh,)
is the source (destination) node of edge ek . Using a depthfirst-search approach [3], a list of all edges to visit is created (depth-first-seurch-listk). Afterwards, the edges of this list are added to the current minterm of the Hnk boolean function ( A D D~P U T -TOMINTERM()) until an edge ekj is reached so that src(eki) = vdd. In this case, a new minterm is created (CREATENEWMINTERMO), sharing with the last created minterm all its edges but the last one visited.
Because of the task of modeling the capacitances of the nodes of a gate is difficult, these capacitances should be extracted and stored for all gates of the library whenever it is possible. This is the approach followed in this paper.
2Note that Gnk and H,, are complementary functions only when n k is the output node ( y ) of the gate.
In the example of Figure 2(a) , the four minterms generated when calculating H,, -are {a1 b, a1 a, ; ; ; I , a2T, a 2 Z a l } , leading to Hnl = b (a1 + a2) . Similarly, G,, can also be derived. In Figure2(a) , G,, = b. The time complexity of these algorithms is linear in the number of transistors of the gate.
Afterwards, the boolean difference of function Hnk with respect to input xi (that is 2) and the equilibrium probabilities of node nk need to be calculated. The boolean function ar, is calculated as explained in Section 3.2. The equilibrium probability of node n k is obtained as follows [4] : the probability of node n k of being '1' at a given instant of time (P(nk) IC) is the probability that n k was '1' in the instant before ( P ( n k ) I*) and it is not discharged ( P ( 2 ) ) or that it was '0' ( P ( n k ) la) and it is charged (P( %)), i.e.
aHn,
Since all signals are assumed to be 0-1 stationary Markov processes, the steady state value of P ( n k ) can be derived as
We conclude that W,, I , , is
T c y c
If the contributions of all nodes (output and internal) are taken into account, the power estimation of the gate is obtained as P g a t e = (CY:; wnk 
where p is the number of internal nodes and n is the number of inputs of the gate.
Power-optimization algorithm OBTAINPROBABILITIES ( c i r c u i t ) g a t e li s t = DEF'THXRST-TRAVERSE (circuit ) for e a c h g a t e g a t e i n g a t e l i s t & J i n f o i n p u t s = OBTAINPROBANDDENS ( g a t e , circuit)
FINDBESTREORDERING ( i n f o i n p u t s , g a t e , circuit) i n f o -o u t p u t = CALCULATEDENS ( i n f o i n p u t s , g a t e )
UPDATE-CIRCUITINFORMATION (in f o-output , cir cuit ) Figure 3 : Optimization algorithm.
In this section, an algorithm that traverses the gate descri tion of the circuit is presented. For each gate, it finds the {est transistor reorderin using the power-consumption model explained in Section 4.
Algorithm overview
Finding the best transistor reordering implies an exhaustive ex loration of each gate. Since most gates only have a smalr number of transistors in series, an exhaustive .exploration is feasible. The algorithm to obtain all possible transistor reorderings of a gate will be addressed later.
A simplified algorithm of the o timization process for low power is shown in Figure 3 . &e probabilities for all output nodes of the gates of the circuit are computed in OB-TAINPROBABILITIESO following the al orithm proposed in [7] . of the circuit (circuit) ordered in a depth-fist fashion [3] from the outputs,i.e. every gate ap ears somewhere after all of its transitive fan-in ates. %or each gate (gate) of this list, the probability anif transition density: information for all of its inputs is obtamed from the circuit (OB-TAINPROBANDDENSO).
Afterwards, the best reordenng is derived for gate gate (FINDEESTAEORDERING()). Finally, the transition density of the output node of the gate is calculated (CALCULATEDENSO) and this information is transferred to the circuit (UPDATE-CIRCUITJNFORMATION() ).
Monotonic characteristic
The algorithm takes advantage of the following property of the model ex lained in Section 3: the reduction of the power in an ingvidual gate always decreases the power o the circuit. The reason of this monotonic behavior is same probability and transition density at its output node if the model explained in Section 3 is used to compute them. Since (a) the model recisely relies on the probability and transition density of tie inputs of a gate to decrease its power consumption and (b) the power of the circuit is the sum of the power of its gates, it is clear that the reduction of the power in an individual gate always decreases the total power of the circuit. This monotonic behavior may not corres ond to the actual behavior of a circuit, but the ex erimentsiave shown that this local (greedy) approach resuh in an overall power reduction for the whole circuit.
Thus, with only one traversal of the circuit, the optimal reordering (always with respect to the model) for all gates is obtained.
Exhaustive exploration of gate configurations
DEPTHFIRST-TRAVERSE() returns the f ist of gates ( g a t e h t )
t rf at all possible transistor reorderings of a gate lead to the The algorithm recursively points to an internal node (current-node) and pivots on it to obtain a new reordering (PIVOTING-ONJNTERNALNODE()). Further searching for new reorderings is pruned if the reordering obtained has already been visited (VISITED()). If it has not been visited, it is added to the set of transistor reorderings of the gate already visited (ADD_T~_VIS~DREORDERINGSO) and the algorithm is called again for all internal nodes of the gate except the current one (this is so to prevent the generation of a reordering that we know beforehand that wehave already visited). In [5] it is demonstrated that all ossible transistor reorderings of a gate are generated with t ! e algorithm in Figure 4 . To illustrate how this algorithm works, it has been applied to the gate implementing the function y = ( a 1 + a2) 6. Figure l(a) ) are generated.
PIVOTEANDSEARCH (gate-graph, v i s i t e d r e o r d s , e u r r e n t n o d e ) gat e-graph = PNOTINGDNNODE ( g a t e-grap h , e u r r e n t n ode)
_-i f n o t VISITED (gate-graph, v i s i t e d r e o r d e r i n g s ) then v i s it e d r eor ds = ADD-M-VISITEDREORDS ( g a t e g r ap h ) -for i n d e x = 1 e n u m b e r -o f i n t e r n a l n o d e s & if i n d e x # e u r r e n t n o d e then, -PNOTEANDSEARCH (gate-graph, v i s i t e d r e o r d s , i n d e x )
FIND ALLREORDERINGS ( g a t e g r a p h ) v i s i t e d r e o r d s
The algorithm in Figure 4 works for gates that can be re resented with a series-parallel graph. The gates of typical litraries can be all represented with this type of graphs. 
Scenarios for the experiments
A wide range of MCNC circuits have been used as benchmarks. They have been map ed into the gate library shown in Table 2 . In some cases, to oltain all transistor reorderings of a gate, it is necessary to have more instances of that gate. For example, there are two instances of gate oai21: oai21[A], which is able to im lement configurations (A) and (B) of Figure l(a) and oaifl[B] , whch is able to implement configurations (C) and (D). All instances of the gates in Table 2 have been implemented in a Sea-of-Gates design style.
Two scenarios have been considered to evaluate the power-consumption savin s obtained with the transistor reordering technique (see figure 6(a) ). In Scenario A, the circuit is considered to be embedded in a lar er digital system. Thus, the equilibrium probability a n d the transition density of the inputs of the circuit may take very different values. In this scenario, the probabilities and transition density of the primary inputs of each circuit are randomly set with a uniform distribution. Probabilities range from 0 to 1 and transition densities range from 0 to 1 million transitions per second. In Scenario B, the circuit is considered to be the whole di ital system, with latches at its inputs and workin at a fixef fre uency. In this scenario, the probability a n i the transition%ensity of the primary inputs of the circuit are set to, respectively, 0.5 and 0.5 transitions per cycle. In both scenarios, the optimization algorithm has been applied to the ori inal gate-level description of the circuits to obtain, for ea& gate, the best instance and, for each instance, the best input reordering. Because of all instances of the same gate have the same area, the total area of the optimized circuit remains the same.
.. Table 3 : Results obtained for several MCNC benchmarks for both scenarios considered. The number of gates is given in column G.
For each scenario and circuit, two new gate-level descriptions have been created. One of them contains the best transistor reordering for low power for all ates found with the optimization algorithm whereas the otter one contains the worst one. A switch-level simulator [ 111 extracts the power consumption of each description. Thus, the maximum power reduction for each scenario is obtained. The input signals to the circuits used by the switch-level simulator have been generated with an exponential distribution, i.e. time intervals between two consecutive transitions of input signal IC to the gate follow an exponential distribution with average l/Dk, being DI, the transition density of input signal t. Table 3 shows the results obtained. Columns M and S show the power-consumption reduction (best case with regard to worst case for low power) obtained with the model and with switch-level simulations respectively. Column D shows the increase in delay (best case for low power with regard to a mapping into the original cell library). The delay increases in most of the benchmarks because not always the best transistor reorderin s of a gate for low power and low delay coincide. In fact, h e rule of thumb that states that the critical transistor should always be placed near the output terminal to obtain a fast gate contradicts the low ower rule of placing it close to the ground node as can be oiserved in the motivation exam le (case (2)) in Section 1.1 and in [9] .
It is shown that tie average improvement in power consumption in scenario A is 1 2 % with an average increase in delay of 4%. The estimated average improvement is 9%. The reason of this lower value in the estimated improvement is that the model, in general, overestimates the power consumption by an offset, thus leading to a lower estimated im rovement.
%he power reduction in scenario B is roughly half the one in scenario A. The ower and delay of latches and the clock line in scenario B gas not been included in the results. In both scenafios there is a small average increase in delay.
Thus, significant power consumption reduction can be obtained in both scenarios with little average increase in delay and it is possible to achieve power reductions without increasing the delay of the" circuit.
Conclusions
This paper shows that average power reductions of 12% with a 4% increase in delay can be achieved by applying the transistor reordering technique. An optimization algorithm that uses a power-consumption model of a static CMOS gate has been presented. This novel power-consumption gate model takes into account both the probabilities and the transition densities of the inputs of the gates that compose the circuit.
The results suggest that (a) current libraries may be upgraded with more instances of the gates with different transistor reorderings, so that an optimization al orithm can choose the best instance forpower reduction and!(b) it is possible to obtain power reductions without increasing the delay of the circuit. Our future work in the transistor reordering field is devoted to this second direction.
