We address the problem of optimizing logic-level sequential circuits for low power. We present a powerful sequential logic optimization method that is based on selectively precomputing the output logic values of the circuit one clock cycle before they are required, and using the precomputed values to reduce internal switching activity in the succeeding clock cycle. We present two different precomputation architectures which exploit this observation.
Introduction
Average power dissipation has recently emerged as an important parameter in the design of general-purpose and application-specific integrated circuits. Optimization for low power can be applied at many different levels of the design hierarchy. For instance, algorithmic and architectural transformations can trade off throughput, circuit area, and power dissipation [5] , and logic optimization methods have been shown to have a significant impact on the power dissipation of combinational logic circuits [la] .
In CMOS circuits, the probabilistic average switching activity of a circuit is a good measure of the average power dissipation of the circuit. Average power dissipation can thus be computed by estimating the average switching activity. Several methods to estimate power dissipation for CMOS combinational circuits have been developed (e.g., [7, lo] ). More recently, efficient and accurate methods of power dissipation estimation for sequential circuits have been developed [9, 131. In this work, we are concerned with the problem of optimizing logic-level sequential circuits for low power. Previous work in the area of sequential logic synthesis for low power has focused on state encoding (e.g., *Currently at AT&T Bell Laboratories, Allentown, PA [ll] ) and retiming [8] algorithms. We present a powerful sequential logic optimization method that is based on selectively precomputing the output logic values of the circuit one clock cycle before they are required, and using the precomputed values to reduce internal switching activity in the succeeding clock cycle.
The primary optimization step is the synthesis of the precomputation logic, which computes the output values for a subset of input conditions. If the output values can be precomputed, the original logic circuit can be "turned off in the next clock cycle and will not have any switching activity. Since the savings in the power dissipation of the original circuit is offset by the power dissipated in the precomputation phase, the selection of the subset of input conditions for which the output is precomputed is critical. The precomputation logic adds to the circuit area and can also result in an increased clock period.
Given a logic-level sequential circuit, we present an automatic method of synthesizing the precomputation logic so as to achieve a maximal reduction in switching activity. We present experimental results on various sequential circuits. For some circuits, 75% reductions in average power dissipation are possible with marginal increases in circuit area and delay.
The model we use to relate switching activity to power dissipation can be found in [7] . In Section 2 we describe two different precomputation architectures. An algorithm that synthesizes precomputation logic so as to achieve power dissipation reduction is presented in Section 3. In Section 4 we describe a method for multiple-cycle precomputation. In Section 5 we describe additional precomputation architectures which are the subject of ongoing research. Experimental results are presented in Section 6.
Precomputation Architectures
We describe two different precomputation architectures and discuss their characteristics in terms of their impact on power dissipation, circuit area and circuit delay.
First Precomputation Architecture
Consider the circuit of Figure 1 . We have a combinational logic block A that is separated by registers RI We will first assume that block A has a single output and that it implements the Boolean function f . The first precomputation architecture is shown in Figure 2 . Two Boolean functions gl and 9 2 are the predictor functions. We require:
Therefore, during clock cycle t if either gl or g2 evaluates to a 1, we set the load enable signal of the register RI to be 0. This means that in clock cycle t + 1 the inputs to the combinational logic block A do not change.
If gl evaluates to a 1 in clock cycle t , the input to register R2 is a 1 in clock cycle t + 1, and if g2 evaluates to a 1, then the input to register R2 is a 0. Note that gl and g2 cannot both be 1 during the same clock cycle due to the conditions imposed by Equations 1 and 2. A power reduction in block A is obtained because for a subset of input conditions corresponlding to gl+g2 the inputs to A do not change implying zero switching activity. However, the area of the circuit has increased due to additional logic corresponding to gl, g2, the two additional gates shown in the figure, and the two flipflops marked FF. The delay between R1 and R2 has increased due to the addition of the AND-OR gate. Note also that gl and 9 2 add to the delay of paths that originally ended at R1 but now pass through g1 or 9 2 and the NOR gate before ending at the load enable signal of the register RI. Therefore, we would like to apply this transformation on non-critical logic blocks. The choice of g1 and 92 is critical. We wish to include as many input conditions as we can in gl and 9 2 . In other words, we wish to maximize the probability of g1 or g2 evaluating to a 1. In the extreme case this probability can be made unity if gl = f and g2 = 7. However, this would imply a duplication of the logic block A and 
Second Precomputation Architecture
In the architecture of Figure 3 , the inputs to the block A have been partitioned into two sets, corresponding to the registers R1 and R2. The output of the logic block A feeds the register R3. The functions gl and g2 satisfy the conditions of Equations 1 and 2 as before, but g1 and g2 only depend on a subset of the inputs to f . If g1 or g2 evaluates to a 1 during clock cycle t , the load enable signal to the register R2 is turned off. This implies that the outputs of R2 during clock cycle t + 1 do not change. However, since the outputs of register R1 are updated, the function f will evaluate to the correct logical value. A power reduction is achieved because only a subset of the inputs to block A change which should produce reduced switching activity in most cases.
As before, 91 and g2 have to be significantly less complex than f and the probability of g1 + g2 being a 1 should be high in order to achieve substantial power gains. The delay of the circuit between R1/R2 and R3 is unchanged, allowing precomputation of logic that is on the critical path. However, the delay of paths that originally ended at R1/R2 has increased.
The choice of inputs to g1 and g2 has to be made first, and then the particular functions that satisfy Equations 1 and 2 have to be selected. A method to perform this selection is described in Section 3.
An Example
We give an example that illustrates the fact that substantial power gains can be achieved with marginal increases in circuit area and delay. The circuit we are considering is a n-bit comparator that compares two n-bit numbers C and D and computes the function C > D. The optimized circuit with precomputation logic is shown in Figure 4 . The precomputation logic is as follows.
Clearly, when gl = 1, C is greater than D , and when g 2 = 1, C is less than D. We have to implement
where @ stands for the exclusive-nor operator.
Assuming a uniform probability for the inputs ' , the probability that the X N O R gate evaluates to a 1 is 0.5, regardless of n. For large n , we can neglect the power dissipation in the XNOR gate, and therefore, we can expect to achieve a power reduction of close to 50%.
The reduction will depend upon the relative power dissipated by the vector pairs with C n-1) @ D(n-1) = 1 and the vector pairs with C ( n -1 @ D n -1) = 0. If we add the inputs C ( n -2) and D ( n \ \ -2 to gl and g 2 we expect to achieve a power reduction close to 75%.
Synthesis of Precomputation Logic

Introduction
In this section, we will describe methods to determine the functionality of the precomputation logic, and then describe methods to efficiently implement the logic.
We will focus primarily on the second precomputation architecture illustrated in Figure 3 . In order to ensure that the precomputation logic is significantly less complex than the combinational logic in the original circuit, we will restrict ourselves to identifying gl and g 2 such that they depend on a relatively small subset of the inputs to the logic block A .
Precomputation and Observability Don't-
Assume that we have a logic function f ( X ) , with X = { X I , . . . , xn}, corresponding to block A of Figure 2 .
Given that the logic function implemented by block A is f , then the observability don't-care set for input x i is given by:
Cares where fx, and fz are the cofactors o f f with respect to xi, and similarly for 7.
If we determine that a given input combination is in ODCa then we can disable the loading of xi into the register. If we wish to disable the loading of registers Consider the architecture of Figure 3 . Assume that the inputs 21, ..., x,, with m < n have been selected as the variables that g1 and g 2 depend on. We have to find g1 and g 2 such that they satisfy the constraints of Equations 1 and 2, respectively, and such that prob(gl+ g 2 = 1) is maximum.
We can determine g 1 and g 2 using universal quantification on f . The universal quantification of a function f with respect to a variable xi is defined as:
as the (active low) load enable signal for the . . , X N . We can compute the functionality of the precomputation logic as g1 + g 2 .
3.3.1
Given a function f we wish to select the "best" subset of inputs S of cardinality K. Given S , we have D = X -S and we compute gl = U, f , g 2 = U o f . In the sequel, we assume that the best set of inputs corresponds to the inputs which result in prob(gl+ g 2 = 1) being maximum for a given IC. We know that prob(g1 + 92 = 1) = prob(g1 = 1) + prob(g2 = 1) since g 1 and g 2 cannot A branch and bound algorithm is used to determine the optimal set of inputs maximizing .the probability of the gl and gz functions. This algorithm is shown in pseudo-code in Figure 5 and is described in detail in PI.
Implementing the Logic
The Boolean operations of OR and universal quantification required in the input selection procedure can be carried out efficiently using reduced, ordered Binary Decision Diagrams (ROBDDs) [4] . We obtain a ROBDD for the gl + g2 function. A ROBDD can be converted into a multiplexor-based network (see [a] ) or into a sumof-products cover. The network or cover can be optimized using standard combinational llogic optimization methods that reduce area [3] or those that target low power dissipation [la].
Multiple-Output Functions
In general, we have a multiple-output function 
3.4.1
In general, it is hard to find a set of inputs for which every output of a multiple-output function is precomputable. We have developed an algorithm, which given a multiple-output function, selects a subset of outputs and a subset of inputs so as to maximize a given cost function that is dependent on the probability of the precomputation logic and the number of selected outputs.
This algorithm is described in pseudo-code in Figure 6 and is described in detail in [I].
Since we are only precomputing a subset of outputs, we may incorrectly evaluate the outputs that we are not precomputing as we disable certain inputs during particular clock cycles. If an output that is not being precomputed depends on an input that is being disabled, then the output will be incorrect.
Once a set of outputs G c F and a set of precomputation logic inputs S c X have been selected, we need to duplicate the registers corresponding to (support(G) -S) n support(F -G). The inputs that are being disabled are in support(G) -S. Logic in the F -G outputs that depends on the set of duplicated inputs has to be duplicated as well. It is precisely for this reason that we maximize prG x gates(G)/total-gates rather than prG in the output-selection algorithm as we want to reduce the amount of duplication as much as possible.
Multiple Cycle Precomputation
Basic Strategy
It is possible to precompute output values that are not required in the succeeding clock cycle, but required 2 or more clock cycles later. We give an example illustrating multiple-cy cle precompu t at ion.
Consider the circuit of Figure 7 . The function f computes (C+ D ) > ( X + Y ) in two clock cycles. Attempting to precompute C + D or X + Y using the methods of the previous section do not result in any savings because there are too many outputs to consider. However, 2-cycle precomputation can reduce switching activity by close to 12.5% if the functions below are used. 
Selecting a Subset of Outputs
91 = C ( n -1) . D ( n -1) . X ( n -1) . Y ( n -1)
SELECT-OUTPUTS(
If = gates(G U H)/total-gates x proldG ; if( lf 5 BEST-COST ) BEST-PROB = total-gates/gates(G U H ) x BEST-COST ; return ; if( G # 4 1 if( SELECTINPUTS( G, k ) == 4 ) return ; prG = BEST-PROB ; cost = prG x gates(G)/total-gates ; if( cost > BEST-COST) { BEST-COST = cost ; SEL-OP-SET = G ; 1 choose fi E H such that i is minimum ; SELECT-OREC( G U fi, H -fi, prG, k ) ; SELECT-OREC( G, H -fi, prG, k ) ; return ;
Other Precomputation Architectures
In this section, we describe additional precomputation architectures. We first present an architecture that is applicable to all logic circuits and does not require, for instance, that the inputs should be in the observability don't-care set in order to be disabled. This was the case for the architectures shown in Section 2. We also extend precomputation so that it can be used in combinational logic circuits. We implement the functions fs, and fK. Depending on the value of 21, only one of the cofactors is computed while the other is disabled by setting the load-enable signal of its input register. The input 21 drives the select line of a multiplexor which chooses the correct cofactor.
The main advantage of this architecture is that it applies to all logic functions. The input 21 in the example was chosen for the purpose of illustration. In fact, any architectures described earlier, we do not require that the inputs being disabled should be don't-cares for the input conditions which we are precomputing. In other words, the inputs being disabled do not have to be in the observability don't-care set. A disadvantage of this architecture is that we need to duplicate the registers for the inputs not being used to turn off part of the logic. On the other hand, no precomputation logic functions have been added to the circuit.
The algorithm to select the best inlput for this architecture is also quite different. We will1 not discuss this algorithm in detail, except to mention that in this case, we are interested in finding the input that yields the most area efficient fzl and fz functions.
Combinational Logic Precoimputation
The architectures described so far apply only to sequential circuits. We now describe precomputation of combinational circuits.
Suppose we have some combinational logic function f composed of two sub-functions A and B as shown in Figure 9 (a). Suppose we also want to precompute this function with the inputs 24 and 2 5 . The arrival tame of an input is defined as the time at which the input settles to its steady state value [6] . If the delay constraint is not met, then it may be necessary to delay the 2 1 , 22, and x3 inputs with respect to the z4 and 25 inputs in order to get the switching activity reduction in logic block B .
Experimental Results
At first we present results on datapath circuits such as carry-select adders, comparators, and interconnections of adders and comparators in Table 1 . The precomputation architecture of Figure 3 was used in all examples and the selection of outputs and inputs to use for precomputation was done manually for examples csal6, add-compl6 and addmaxl6 and automatically (using the algorithms outlined in Figures 5 and 6 ) for the rest. For each circuit, the number of literals, levels of logic and power of the original circuit, the number of inputs, literals and levels of the precompute logic, the final power and the percent reduction in power are shown. All power estimates are in micro-Watts and are computed using the techniques described in [7] . A zero delay model and a clock frequency of 20MHz was assumed. The rugged script of sis was used to optimize the precompute logic.
Power dissipation decreases for almost all cases. For circuit compl6, a 16-bit parallel comparator, the power decreases by as much as 60% when 8 inputs are used for precomputation. Multiple-cycle precomputation results are given for circuits add-compl6 and addmaxl6. The circuit add-compl6 is shown in Figure 7 , and the circuit addmaxl6 is the same circuit with the comparator replaced by a maximum function. For circuit addxompl6, for instance, the numbers 418 under the fifth column indicates that 4 inputs are used to precompute the adders in the first cycle and 8 inputs are used to precompute the comparator in the next cycle.
Results on random logic circuits are presented in Table 2. The random logic circuits are taken from the MCNC combinational benchmark sets. We have presented results for those examples where significant savings in power was obtained. Once again, the same precomputation architecture and input and output selection algorithms are used as in Table 1 and the columns have the same meaning, except for the second and third columns which show the number of inputs and outputs of each circuit. It is noteworthy that in some cases, as much as 75% reduction in power dissipation 1s obtained.
The area penalty incurred is indicated by the number of literals in the precomputation logic and is 3%
on the average. The extra delay incurred is proportional to the number of levels in the precomputation logic and is quite small in most cases. It should be noted that it may be possible to use the other precomputation architectures for all of the examples presented here. Some of these examples are perhaps better suited to other architectures than the one we used do derive the results, and therefore larger savings in power may be possible. Secondly, the inputs and outputs to be selected and the precomputation logic are determined automatically, making this approach suitable for automatic logic synthesis systems. Finally, the significant power savings obtained for random logic circuits indicate that this approach is not restricted only to certain classes of datapath circuits.
Conclusions and Ongoing Work
We have presented a method of precomputing the output response of a sequential circuit one clock cycle before the output is required, and exploited this knowledge to reduce power dissipation in the succeeding clock cycle. Several different architectures that utilize precomputation logic were presented.
In a finite state machine there is typically a single register, whose inputs are combinational functions of the register outputs. The precomputation architectures make no assumptions regarding feedback. For instance, RI and Rz in Figure 2 can be the same register.
Precomputation increases circuit area and can adversely impact circuit performance. In order to keep area and delay increases small, it is best to synthesize precomputation logic which depends on a small set of inputs.
Precomputation works best when there are a small number of complex functions corresponding to the logic block A of Figures 2 and 3 . If the logic block has a large number of outputs, then it may be worthwhile to selectively apply precomputation-based power optimization to a small number of complex outputs. This selective partitioning will entail a duplication of combinational logic and registers, and the savings in power is offset by this duplication.
Other precomputation architectures are being explored, including the architectures of Section 5, and those that rely on a history of previous input vectors. More work is required in the automation of a logic design methodology that exploits multiplexor-based, combinational and multiple-cycle precomputation. 
