In this age of portable electronic systems, the problem of logic synthesis for low power has acquired great importance. The most popular approach has been to target the widely-accepted twophase paradigm of technology-independent optimization and technology mapping for power minimization. Before mapping, each function of a multi-level network is decomposed into two-input gates. How this decomposition is done can have a significant impact on the power dissipation of the final implementation. The problem of decomposition for low power was recently addressed by Pedram et al. [ll]. However, they ignore the power consump tion due to glitches, which can be a sizeable fraction of the total power [3]. In this paper, we show how to obtain a transitionoptimum binary tree decomposition (i.e., the one which has minimum number of transitions in the worst case, including those due to glitches) for some specific functions (AND, OR, and EX-OR) for zero gate delay model. For a non-zero gate delay model, we present conditions under which our algorithm yields an optimum solution for such functions. We propose a straightforward extension of this algorithm for arbitrary functions and Boolean networks. Experimental results on a set of standard combinational benchmarksindicate that on average, our algorithmgenerates networks (Using two-input gates) that have 16% fewer transitions in the worst case than the networks generated by a simpleminded twc-input technology-decomposition algorithm implemented in sis [8], a widely used logic synthesis system.
Introduction
The goal of synthesis is to generate a design that meets area, performance, and power constraints. While the fields of area and performance optimization have matured considerably, power optimization is still in its infancy. The most popular a p proach for power optimization has been to suitably modify the widely-accepted two-phase logic synthesis paradigm of technologyindependent optimization and technology mapping. Technologyindependent optimization generates a network with minimum power using some reasonable power estimation model. Then, the mapping step implements the optimized representation on to the target technology, such that the power consumed in the final implementation is " i z e d .
In this paper, we focus our attention to technology mapping.
The first step in technology mapping is to decompose each function of the optimized network into two-input gates. How this decomposition is done can have a significant impact on the power dissipation of the final implementation [ll, lo] . Recently, the 'This work, done when the first author was at the University of California, Berkeley, is supported in part by DARPA under contract number J-FBI-90-073. brayton,albertoQeecs.berkeley.edu problem of decomposition for low power was addressed by P e dram e t al. [ll] . However, they ignore the power consumption due to glitches, which accounts for a significant fraction of the total power dissipated in the circuit. According to [3] , an &bit ripple-carry adder with a uniformly distributed set of random input patterns typically consumes an extra 30% in energy because of glitches. This motivated UE to investigate the problem of decomposition of logic functions into two-input gates so as to minimize the total power dissipatedin the resultingstructure. In this paper, we show how to obtain a t r a n s i t i o n -o p t b binary tree decomposition (i.e., the one which has minimum number of transitions in the worst case, including those due to glitches) for some specific functions (AND, OR, and EX-OR) for zero gate-delay model. For a non-zero gatedelay model, we present conditions under which our algorithm yields an optimum solution for such functions. We propose a straightforwardextension of this algorithm for arbitrary functions and Boolean networks.
Terma used are defmed in Section 2. The power model used and the problem being addressed are described in Section 3. Section 4 briefly mentions previous work on decomposition for low power. Huffhnan's algorithm, which is key to our solutions, is presented in Section 5 . Sections 6, 7, and 8 describe the results we have proved and the algorithms we have proposed for the decomposition problem of logic functions for low power. In Section 9, we present experimental results demonstrating the effectiveness of our decomposition techniques.
The paper is organized as follows.
Terminology
Different inputs to the circuit may change their values at different times. This implies that in a single clock cycle, the output of a gate may change several times before settling to its final value. These spurious changes are known as glitches. The glitching activity is a function of the input arrival times, input assignments (values), the internal signal values, circuit structure, and gate delays. We capture the overall activity at a gate (or node) of a circuit by a transition vector.
The transition vector at a node g , X ( g ) , is the set of all possible times at which the output of g can switch (make a transition). We define X ( p ) for a primary input p to be the set with a single element, which is the arrival time of p.
A(g) depends on the delay through the node g , and the arrival times at the inputs of g. If d ( g ) denotes the delay through the node g , and F l ( g ) the set of f d n s of g ,
The notation d ( g ) + A( j ) refers to the set obtainedby adding d ( g ) to each element of A( j).
For instance, if a two-input node g has a delay of 4 units and ha^ as its fanins primary inputs x1 and 272 arriving at times 1 and
Given the arrival times of the primary inputs, (1) can be used to compute the transitionvector at each node of the networkif the network is traversed topologically from primary inputs to primary outputs.
Power Model
We consider only the static CMOS circuits. For such circuits, the average power dissipated for a gate g is given by
where Cl is the load capacitance seen at the output of g, v d d is the supply voltage, E(switching) ia the expected number of transitions at the output of g per clock cycle, and T is the clock cycle. The power dissipated in the entire circuit is the s u m of the power dissipated in each gate. If we m u m e the load capacitance to be the name for all gates, the total power dissipated in the circuit is directly proportional to s u m of the expected number of transitions at the output of each gate. Computing this number is somewhat complicated, since a transition at a gate input may not cause any change at the gate output. For instance, consider a two-input AND gate. If one of ita inputs is 0, the changes at the other input do not affect the value of the output. In this work, we consider the worst-case behavior with respect to the transition activity. We make the following assumptions.
Each input makes a transition (at its arrival time).
2. A transition at the input of a gate causes a transition at the Under this model, the expected number of transitions is the s m e as total number of transitions. Thus, we model the power dissipated in the circuit by the total number of transitionsin the worst czme, including those due to glitches.
Although this model is pessimistic in that it overestimates the number of transitions (realistically, each function input may not make a transition every clock cycle, and the output of a gate may not make a transition when its input makes a transition), it has the nice feature of being simple, as it does not involve probabilistic computations. As we will see shortly, this is key to deriving a provably optimum algorithm for some special classes of functiom under certain delay models.
The problem of decomposition for mini" power dissipation can then be stated as follows.
output of the gate. We restrict ourselves to a tree decomposition, since the problem of decomposition into a general graph is more difficult.
Previous Work
There has been a flurry of research activity in the last couple of years on power estimation and synthesis for low power 'in general, a leaf-dag decomposition, since some inputs may feed two or more gates [4, 1, 9, 2, 7, 11, lo]. In their seminal work, Ghosh et ul. [4] proposed techniques based on symbolic simulation for estimating power consumed in combinational and sequential networks. Shen et al. [9] presented logic optimization algorithme for reducing power consumption. The two-level representations that reduce switching probabilities were derived.
However, not much work has been done on Problem 3.1. We are aware of the following dorts.
In [3] C h a n d r b a n et al. considered a simple function for which they showed that a balanced tree results in the fewest number of transitions, including those due to glitches. However, they assumed that all the inputs have identical arrival times. As we will show, a transition-optimum tree may not always be balanced. This happens when some, but not all, of the inputs have identical arrival times. [lo] presented a technology-mapper for low power. They also assume a zero gate-delay model. To eliminate glitches to some extent, they suggest a post-mapping buffering step that makes all the paths of equal lengths.
Our work is a firet step towards reawning about glitches and accounting for them during synthesis.
Huffman's Algorithm
Consider a binary tree T with n inputs x i through Zn, where input x i has weight w,. The length of the path between input x i and the root of the tree T, denoted l,, is the number of nodes on the path. The weighted path length of the tree T i s xi wili.
Given X I through x, with weights wl through wn respectively, consider the problem of constructing a binary tree with minimum weighted path length. An elegant algorithm for constructing such a binary tree was given by Huffman [5] , and is as follows.
Combine the two weights of lowest value, say w1 and w2 (without loss of generality, w l 5 w2 5 w3 5 5 w,).
This generates a node n of the tree with children ti and 1 2 .
The output of n is added to the input set with the weight
Recursively solve the problem for n -1 weights: (wl + 
of Simple Functions
First, we restrict the problem to some simple functions, namely n-input AND, OR, and EX-OR. In Section 6.1, we consider the simplest possible case -that of zero gate-delays. We present an algorithm to solve Problem 3.1 optimally. From a logical viewpoint it does not make sense to talk about glitching under the zero gate-delay model. However, in the presence of input arrival times, a gate with zero delay can still make more than one transition. Moreover, as we will show, the algorithm presented for this model can be generalized for the non-zero delay model. In Section 6.2, we comment on the non-zero delay case. We do not know yet how to solve the problem optimally in the general case. However, for a special case, we present an optimum algorithm based on Huffman's algorithm. The key insight is drawing a correspondence between the weight of a tree node in Huffmr"s algorithm (see the comment at the end of the last section) and the cardinality of the transition vector at the corresponding gate in the tree decomposition of the function.
Gates with Zero Delays
In this model, each two-input gate has a gate-delay of zero. First consider the case when all the inputs arrive at distinct times, i.e., 
Then use Huffman's algorithm to construct the tree T .
Note that each node of the tree T represents a two-input gate. For instance, if f is an n-input A N D , each node of T is a two-input A N D .
Proposition 6.1 Given a n n-input function f (AND, OR, or EX-OR), no two of whose inputs arrive at the same time, PTOcedure 6.1 generates a tree T that has the minimum number of transitions under the zero gate-delay model.
Proof
The total number of transitions in a circuit is the s u m of the number of transitions at the output of each gate. Since all the gates have zero delays, the transition vector X(g) at each node g of a tree circuit is simply the set of arrival times of the inputs in the transitive fanin of g. Since the arrival times are distinct and we are only considering tree decompositions, X(g) n X(h) = 4 for all nodes g , h, g # h. Then, X(g) can be computed by simply taking the disjoint union of the transition vectors of g's fanins. In other words, 1X(g)1 is the s u m of the number of transitions at g's fanins. Using the comment at the end of Section 5 , it follows that the problemof finding a binary tree with the minimum number of transitions becomes identical to that of finding a tree with the minimum weighted path length, with the weight w; of each input xi set to 1. Note that the weight of a node corresponds to the number of transitions at its output. Also, since each input x, contributes one transition (at its arrival time), w , is set to 1. Since Huffman's algorithm solves the minimum weighted path length problem optimally, it also solves Problem 3.1 optimally for In other words, at each stage the procedure combines two signals with the least weight, i.e., the fewest number of transitions.
Observe that a balanced tree decompositionis obtained in this case. Interestingly, Chandrrrkssan e t al. concluded that a balanced tree results in fewest number of transitions, including the transitions due to glitches [3] . However, they assumed that all the inputs have identical arrival times. As we show soon, the optimum tree may not necessarily be balanced. This happens when some, but not all, of the inputs have identical arrival times. Now we consider the most general case, in which inputs can arrive at arbitrary times. In other words, some of the arrival times can be the same. We propose an algorithm that is a slight modification of Procedure 6.1. The following proposition states that this procedure generates a tree with the minimum number of transitions.
Proposition 6.2 Given an n-input function f ( A N D , OR, or E X -O R ) whose inputs am've at any arbitrary times, Procedure 6.2 generates a tree T that has the minimum number of transitions under the zero gate-delay model.
We skip the proof due to lack of space.
Note that the part of the tree T generated by Huffman's algorithm in step 3 of Procedure 6.2 has the property that every pair of nodes in this part has mutually disjoint transition vectors. The part of the tree generatedin step 2 has the property that the transition vectors of the inputs and of the output of each node are the same, each having one element.
Procedure 6.2 may not generate a balanced tree. The following example illustrates this. 
Gates with Non-zero Delays
The decomposition problem becomes more complicated if the gates are assumed to have non-zero delays. While in the case of the zero gate-delay model no Signal needs buffering (a buffer, which has zero delay too, does not create any opportunity for reducing transitions further down the tree; it only adds to the total number of transitions), in the case of non-zero gate-delays, buffering may reduce the total number of transitions. Currently we are searching for an algorithm to solve the pmblem optimally for non-zero gate-delays for both with and without buffers. 
Decomposition of Arbitrary Functions
The above discussion holds for n-input AND, OR, EX-OR functions. We now try to extend the results to arbitrary functions. First consider the case of zero gate-delays.
Gates with Zero Delays
We are given a logic function f with n inputs, 01 through znr whose arrival times are a1 through an respectively. Consider a at the inputs of a node of the tree T in Procedure 6.2 satisfy the following property: either they consist of single elements that are the same OT they are disjoint. This property is crucial in the proving the optimality of the procedure. For the OR-tree TOR, the cube-functions may not satisfy this property. For instance, if f = z y + z'z, with arrival times of the inputs: a(.) = l , a ( y ) = 2 , a ( z ) = 3, the cube-functions will have transition vectors { 1 , 2 } and { 1 , 3 ) ( a~~& n g i n v e r t e~-~ have zero delays). These vectors do not satisfy the above property. So for the OR-tree, an extension needs to be made to handle arbitrary transition vectors. Let us study Procedures 6.1 and 6.2 to gain some insight for handling arbitrary transition vectors. Procedure 6.1 can be seen as combining two signals of least weights (i.e., smallest sizes of the transition vectors), whereas Procedure 6.2 can be looked at as combining those two s i p & such that the union of the transition 
min-trans-in:
At each step, combine two signals with mallest transition vectors. This is exactly like Huffman's dge rithm. 2. min-trans-out: At each step, combine those two signals such that the resulting signal has minimumnumber of transitions.
In other words, select those two signala f , and fj such that
Both techniques yield optimum tree for a function f whose cubes have disjoint inputs, if no two inputs d v e at the same time. An example of such a function is f = obc + de + ghij. and generates g4 with X(g4) = {1, 2, 3, 4) . Finally, it combines the output ofg4 with that o f g 3 , generating 95. X(g5) = {1, 2, 3, 4) . Figure 8 ( 
This is shown in
----one less than the earlier case.
Gates with Non-zero Delays
The approaches of the last section can be directly applied, except that the transition vector, X(g), at the output of a gate g whose inputs are a and b, is now computed as
where d ( g ) is the delay of the gate g.
Mult i-level Networks
It is now straightforward to extend the techniques proposed for a single function in the last section to arbitrary Boolean networks, where one function feeds another. We are given the arrival time of each primary input of the network. We traverse the network topologically, i.e., viait a node N only after visiting all its fanins. The transition vector at the fanins are thus known when N is visited. Two-input AND-OR tree is constructed for the node function f at N using min-trans-in or min-trans-out. Note that the inputs of f may not have a single transition time any more, but a set of them. But both min-trans-in and min-trans-out handle this case. A by-product of the tree construction is the computation of the transition vector at the root of the tree, which can then be used to determine the tree structure of the fanout nodes. The procedure terminates after all the primary outputs of the network have been visited.
Experimental Results
We took a set of optimized combinational MCNC benchmark examples and did the following experiments. We assumed that the delay of each 2-input AND and OR gate is 1 unit, and that the arrival times of the primary inputs of the network are 0. For each internal node of the optimized multi-level network, we did the following:
Id: Decomposed the sum-of-products representation into 2-input AND and OR gates using tech-decomp -a 2 -0 2 of sis [8]. tech-deeomp first builds a tree for each cube and then constructs the OR tree. However, the trees are built by combining inputs and intermediate signals arbitrarily.
Huffman-based:
Since the fanins of a node may not be primary inputs, they have in general transition vectors associated with them. First, we decomposed each product term of the sum-of-products representation into two-input AND gates and then ORed the cube-functions two at a time. The fanins to be combined at each step were chosen based on the Huihan-based algorithms, described in Section 7. They are 1. min-trans-in: At each step, combine two signals with smallest transition vectors.
2. min-trans-out: At each step, combine those two signals such that the resulting signal has minimum number of transitions.
The total number of transitions in the resulting networks is computed for each option and shown in the corresponding columns in # transitions at the gate output sum over all networks of estimated power consumed
We also conducted another experiment. Instead of counting the total number of transitions, we decided to use the power estimation tool developed by Ghosh et al. [4] to estimate the power dissipated by the two-input AND-OR network generated by either our algorithm or the td algorithm of sis. This power estimation tool uses a symbolic simulation method and takes into account glitching. In the estimator, we set the delay of each two-input gate to 1 -same as that used at the time of decomposition. Also, the probability of each primary input being a one is half. The results are shown in Table 2 . On some of the large ISCAS benchmarks (e.g., C1355, C1908, etc.), the power estimator could not finish, and so we do not report results for them. On the rest, on average, 4% improvement in power is obtained using man-trans-out. This is only a small improvement. We attribute it to the discrepancies in the models used. Our algorithm generates decomposition tree for the Worst case (each input transition causes an output transition), whereas the model used in [4] estimates only the average switching power.
The future work is in the following directions: 1) We plan to work on an optimum algorithm for the non-zero delay model.
2)
Instead of minimizing the worst case number of transitions, as we did, it will be more accurate to minimize the expected number of transitions. The model needs to be modified to reflect the dependence of the output transition on the input signal values. 3) For a general function, we considered a specific form of decomposition: building an AND-OR tree from a sum-of-products representation. It may be better to start from a factored form.
4)
The final goal of synthesis for low power is to generate a mapped circuit that dissipates minimum power. The decomposition generated by our technique may have fewer transitions, but there is no guarantee that applying mapping on this decomposition will yield a circuit that also has fewer transitions. A comprehensive study of this aspect needs to be done.
