This paper presents a systematic theoretical approach for the analysis of bounds on power consumption in digital multipliers. This is because in many applications the maximum value of power consumption and not just the average power may be of importance to the designer. The maximum values can be used to predict the maximum battery life in portable applications and also determine the nature of heat sinks in non-portable applications. The proposed approach involves the development of state transition diagrams (stds) for the sub-circuits making up the digital multipliers. The std is comprised of states and edges, with the edges representing a transition (switching activity) from one state to another in the sub-circuit. Then, maximum (minimum) energy values associated with the edges constituting the stds are used to derive the upper (lower) bound. The multipliers analyzed in this paper include the Baugh Wooley multiplier, the binary tree multiplier, and the Wallace tree multiplier. The analysis is performed for both non-pipelined and p-bit-level pipelined multipliers. It is theoretically shown that there is a signi cant reduction in upper bound as p is decreased, with the lower bound being una ected by the level of bit-pipelining. Experimental results are presented to show that the average power consumption values indeed lie within the predicted theoretical bounds, and that the theoretical upper bounds are quite tight.
Introduction
The rapid advancement in semiconductor technology in the last decade has made possible the integration of a large number of digital CMOS circuits on a single chip. Moreover, the desirability of portable operations of these circuits has necessitated the development of low power technology.
This research was supported in parts by the O ce of Naval Research under contract number N00014-91-J-1008 and Bell Laboratories.
Even in the case of non-portable systems, reduction in power consumption can greatly cut cooling costs.
Digital CMOS circuits form the backbone of digital signal processing (DSP) systems, microprocessors, wireless communication systems, etc. Therefore, the design of low power digital circuits is imperative. However, the power dissipated by a digital circuit can be measured only after it has been designed, tested, and fabricated. As a result, the turnaround time for the assessment of the design in terms of its power consumption may be too long. This has motivated the need for e cient techniques for estimation of the average power consumption without having to fabricate the design. Much of the earlier work done in this area involved performing exhaustive simulations for randomly generated input sequences 1] 2]. Although these methods gave accurate results, lengthy computation times rendered them impractical for large circuits. As a result researchers resorted to stochastic approaches 3] -15] where the probability of transition of nodes in the circuit is estimated.
However, in many non-portable applications some knowledge of the maximum power together with the average power is required in order to determine the nature of heat-sinks. Moreover, a theoretical analysis of the maximum and minimum power consumption of the digital circuit can give the designer some idea about the circuit switching activity, and this can be useful during the design phase.
In this paper, based on 10], we present a theoretical approach for the analysis of bounds on power consumption in digital CMOS circuits. Here, the sub-circuits constituting the di erent digital circuits are modeled using state transition diagrams (stds) characterized by states and edges. The energy consumed by the sub-circuit while traversing a particular edge is computed using HSPICE. This value re ects the transistor-level properties (size, technology, etc) of the subcircuit and has to be computed only once. These energy values are used to develop a model for estimating bounds on the power consumption. The salient feature of the proposed approach is that it can model glitching power which can be quite signi cant depending on the nature of the design.
Moreover, the approach has been applied to the case of p-bit-level pipelined circuits and useful results for power consumption have been obtained as a function of the word length W. The actual average power consumption values have been obtained by using the HEAT tool and are shown to lie within the theoretical bounds. Moreover, experimental results are also presented for the upper bounds and it is shown that the theoretical bounds are quite tight.
The organization of this paper is as follows. Section 2 presents a technique for the modeling of a digital circuit using a std. A systematic approach for the computation of bounds on power consumption in non-pipelined multipliers based on the energies of the edges in the state transition diagrams is presented in Section 3. This approach is then extended for the analysis of p-bit-level pipelined multipliers in Section 4. Experimental results for the average power and bounds on power consumption values are presented in Section 5. It is shown that the average values lie within the theoretical bounds. Finally, the main conclusions of the paper are summarized in Section 6.
Circuit modeling using state transition diagrams
In this section, a systematic approach is presented to model digital circuits using state transition diagrams. The modeling is done by deriving analytical expressions for the state-update of all nodes in the corresponding digital circuits. The approach presented is general and can be used to model any arbitrary digital circuit. Most of the non-pipelined and pipelined arithmetic units including multipliers, dividers, etc., can be designed using basic sub-circuits like NOR, NAND, full adders, half adders, ip ops, etc. Therefore, in this section many examples are presented to illustrate the modeling of these sub-circuits.
Static CMOS NOR gate
Consider a typical static CMOS NOR gate shown in Fig. 1(a) , where x 1 and x 2 , respectively, represent the two input signals and x 3 represents the output signal. It is clear from Fig. 1(a) that there are basically two nodes node 2 and node 3 , which have their values changing between 1 and 0. The presence of charging/discharging capacitances at these nodes enables us to develop the state-update arithmetic equations for these nodes in accordance with node 2 (n + 1) = (1 ? x 1 (n)) + x 1 (n) x 2 (n) node 2 (n) (1) node 3 (n + 1) = (1 ? x 1 (n)) (1 ? x 2 (n)):
The above equations can be used to derive the state transition diagram (std) for the NOR gate as shown in Fig. 1(b) , where for example, S 1 represents the state with node values node 2 = node 3 = 0, and the edge e 1 represents a transition (switching activity) from state S 1 to S 3 . Although at rst glance it may appear that the std would blow up as the number of nodes increases, this is not the case. It turns out in practice that as the number of nodes increases, the number of states which do not exist also increases. For example, the circuit which computes the carry output of the full adder has 6 nodes. However, it turns out that the corresponding std has only 8 states, i.e., 56 of the possible 64 states do not exist.
A static CMOS full adder
Consider the architecture of a static CMOS full adder as shown in Fig. 2 . It is clear from the gure that the architecture is comprised of a carry generation portion and a sum generation portion.
The state-update equations can be determined in a similar manner for the carry and the sum portion of the full adder. Then, independent state transition diagrams are constructed for both these portions. For example, the state transition diagram for the carry portion of the fulladder is shown in Fig. 3 . Here, for the sake of brevity only few edges have been shown. It is clear from Fig. 3 that the state transition diagram is comprised of eight states. Each state is associated with the six nodes present in the carry portion of the full-adder, and each edge is associated with the 3 inputs to the full-adder.
True single phase clocked (TSPC) ip-op
A TSPC ip-op can be designed by cascading a TSPC p-latch and a TSPC n-latch or vice-versa. Therefore, it is su cient to model each one of them individually. A TSPC p-latch is shown in Fig. 4 (a) and is comprised of 4 nodes. The corresponding std is shown in Fig. 4(b) and is comprised of 4 states and 16 edges.
In this manner, any arbitrary digital CMOS circuit can be modeled using a state transition diagram, and the input signal and conditional probabilities determine which edge is traversed in it. We have designed a library where typical sub-circuits like NAND, NOR, XOR, type-0 adder, type-1 adder, 2:1 multiplexer, 4:2 compressor, etc., have been modeled. The energy associated with each edge in the std is computed using HSPICE as outlined in 9] . These values depend on transistor sizes and technology parameters only, and not on the input signal characteristics.
Consequently, they have to be computed only once for a given state transition diagram.
Systematic switching activity analysis for non-pipelined multipliers
In this section, a systematic switching activity analysis is presented to determine the lower and upper bounds on the average power consumption in di erent types of multipliers. There are two assumptions made in this analysis. The rst assumption is that the inputs to the digital circuit are allowed to change only at the beginning of every clock cycle, which is generally the case. The second assumption is that time is assumed to be divided into sub-units called time-slots. The duration of each time-slot is determined by performing accurate SPICE analysis for all the sub-circuits in the library. For example, let us assume that a NAND gate in the library takes 2.7ns to generate the output and a NOR gate takes 3.6ns. Then, the duration of the time-slot is xed as 0.9ns with the NAND gate requiring 3 time-slots to compute the output, and the NOR gate requiring 4 time-slots. Therefore, variable delays (in terms of time-slots) are assigned to all the gates in the library. It turns out that with this assumption the full adder cell takes one time-slot to compute the carry and two time-slots to compute the sum.
Let the number of clock-cycles over which the switching activity analysis is performed be denoted by N, and let each clock-cycle be divided into S time-slots (numbered from 0 to S ? 1). Moreover, let the duration of each clock cycle be T clk , and let i represent the number of all possible time-slots in one clock-cycle where switching can actually occur in a particular sub-circuit.
An activity factor associated with that sub-circuit is then de ned in accordance with q i = lim N!1 # time-slots among i*N time-slots where switching actually occurs N : (3) Let us assume, for simplicity, that there is switching in at least one time slot in every clock cycle.
This is not a stringent assumption, and it will be shown later that it does not in uence the nature of the result. It then follows that 1 q i i:
The bounds on power consumptions for various multipliers are analyzed using this model.
Baugh-Wooley multiplier
Consider the architecture of a 8 8-b Baugh-Wooley (BW) 16] multiplier shown in Fig. 5 17] , where the full adders have been numbered according to the row and column in which they appear.
Let eFAC k (1 k # of edges) represent the energy consumed when the k-th edge is traversed in the std for the carry portion, and let eFAS l (1 l # of edges) represent the energy consumed when the l-th edge is traversed in the std for the sum portion, of a full adder. Then, eFA max and eFA min are de ned in accordance with eFA max = maxf(eFAC k + eFAS l ) 8 (k; l) j (k; l) 2 fset of numbers which cause a
change of state in the std for the carry or sum or both gg eFA min = minf(eFAC k + eFAS l ) 8 (k; l) j (k; l) 2 fset of numbers which cause a change of state in the std for the carry or sum or both gg:
In the above equations, the maximum (minimum) is computed over those edges which cause a change of state because, only those edges result in some switching activity inside the full adder.
Let us assume that the partial products become available to the multiplier array at time-slot 0 in every clock cycle. Moreover, let us assume that the probability that each of the multiplier and multiplicand bits are 1 is 0.5. Then, the input signal characteristics for each of the adders in the rst row of the multiplier array are identical. Therefore, the maximum energy associated with the rst row of adders is obtained in accordance with UB E 0 = maxf(W ? 1) q 1 eFA max g = (W ? 1) eFA max ; (6) where W represents the multiplicand (multiplier) word length. In the second row of the BW multiplier, the (W-1)-th adder from the LSB end consumes slightly less energy when compared to the remaining (W-2) adders since it is in uenced by only the carry of the previous row. In other words, the input signals to the rst (W-2) adders in the second row can change at time-slots 0, 1, 
If the assumption of one transition in every clock-cycle is dropped, the lower bound on the energy consumption in the full adders assumes a trivial value of zero. This could represent, for example, the case when both the multiplicand and the multiplier do not change over all clock-cycles and as a result there would be no energy consumed in the multiplier.
The switching activity in the vector merging portion of the BW multiplier can be computed in a similar manner. It has not been shown here since this portion is common to all the multipliers analyzed in this paper.
Binary tree multiplier
Consider the architecture of a 8 8-b Binary tree (BT) multiplier (employing carry save technique)
shown in Fig. 6 , where the full adders have been numbered according to the row and column in which they appear.
The architecture shown in Fig. 6 is comprised of 9 half adders and 36 full adders.
Two half adders are being treated as being equivalent to one full adder to simplify the switching activity analysis. It is also clear from 
It is clear from the theoretical results that the upper bound of the BT multiplier is much lower than that of the BW multiplier. This is because in the BT multiplier the delays tend to get balanced and as a result the glitching activity is minimized. A well known technique which can be used to further minimize glitching activity is pipelining. A given architecture can be pipelined at various levels depending on the throughput requirement. The next section considers the rami cations of pipelining on the bounds on power consumption.
Bit-level pipelined digital multipliers
In this section, the e ect of bit-level pipelining on the bounds on power consumption in digital circuits is analyzed. As before, the analysis is performed considering two examples, i.e., the BW multiplier and the BT multiplier.
p-bit-level pipelined BW multiplier
A p-bit-level pipelined BW multiplier will have a row of ip-ops after every p rows of full adders in the multiplier array shown in Fig. 5 . The lower bound in (9) had no in uence of glitching, and consequently is not a ected by the addition of a row of ip-ops. Therefore, one needs to consider the e ect of pipelining only on the upper bound.
Performing a similar analysis as before, the energy consumed by all the rows of full adders before the rst row of latches is given by
Once the rst row of latches is encountered, the glitching activity is absorbed by the latches and the situation becomes identical to the p rows of adders before the rst row of latches.
Since the number of rows in the BW multiplier is (W- 
Binary tree multiplier
The analysis for the p-bit-level pipelined case is performed assuming that p is even. Here, a row of ip-ops is added after p rows of full adders in the BT multiplier array. 
The above expression approaches the theoretical upper bound of (14) for values of p close to 2(log 2 W ? 1). The bounds on power consumption in a pipelined BT multiplier are plotted in Fig. 8 , as a function of both the word length and the bit-pipelining level. It is clear from the gure that the upper bound decreases signi cantly as p is decreased.
Experimental results and comparison
The theoretical analysis was also performed for the Wallace tree (WT) 18] multiplier but has not been shown here for the sake of brevity (see 10]). The bounds for the di erent multipliers are plotted as a function of the word length in Fig. 9 . It is clear from the gure that the lower bounds for the BW and the BT multiplier are very close to each other. However, the upper bound for the BW multiplier increases at a much faster rate than that of the BT multiplier. This implies that there is more switching activity inside a BW multiplier than that inside a BT multiplier.
This is because in a BT multiplier, the adders have been arranged in the form of a tree and as a result the computation delays tend to get equalized. In the case of a WT multiplier, a closed form solution for the upper bound on the power consumption is hard to nd due to the irregular structure. However, since the number of full adders in a WT multiplier varies as log 1:5 W and that in a BT multiplier varies as log 2 W, we predict that the BT multiplier may consume slightly less power when compared to the WT multiplier for large word lengths. In order to verify the theoretical results, the HEAT tool was used to compute the average power for the three multipliers (with and without pipelining). The results are summarized in Table 1 , where the entries in the rst 
Conclusions and future work
A systematic theoretical approach for the analysis of bounds on the power consumption in digital circuits was presented. This was achieved by modeling a given circuit using state transition diagrams. The analysis was performed for non-pipelined and p-bit-level pipelined multipliers. It was theoretically shown that the upper bound is signi cantly reduced and that the lower bound is una ected as p is decreased. Experimental results were presented to show that the average power consumption values were within the predicted theoretical bounds. It was also found that the error between the predicted theoretical upper bound and the experimental upper bound was less than 15%. Future work involves the development of a systematic approach to generate sequences which cause maximum power consumption in the multipliers. 0   a7b5  a6b5  a5b5  a4b5  a3b5  a2b5  a1b5   a0b6   a7b4  a6b4  a5b4  a4b4  a3b4  a2b4  a1b4   a7b3  a6b3  a5b3  a4b3  a3b3  a2b3  a1b3   a0b4   a0b5   a7b2  a6b2  a5b2  a4b2  a3b2  a2b2  a1b2   a7b1  a6b1  a5b1  a4b1  a3b1  a2b1  a1b1   a0b2   a7b0  a6b0  a5b0  a4b0  a3b0  a2b0  a1b0   a0b3 Vector merging portion 
