Adder architectures are presented here by an unified formalism, and analysed from the delay, complexity and power consumption points of view. An analytical model for the power consumption is derived, assuming that it is proportional to the transition density [DHNT95] . The model is subsequently validated by simulation using a signal transition probabilities propagation tool [Cra89]. Finally, glitches are taken into account when transitions at the input of a cell are separated by one or more cell delays. A redundant to total power ratio is also derived.
INTRODUCTION
Addition is the most frequently used arithmetic primitive, involved not only in simple addition but also in more complex operations like multiplication and division. The present study covers the linear ripple carry adder and different architectures of carry select and carry lookahead adders.
Designing low-power high-speed circuits requires a combination of techniques at four levels : technology, circuitry, architectures and algorithms [BCS92] . This work concentrates on the architecture level and considers a CMOS static technology.
The paper is organised as follows: the ∆ operator introduced by Brent and Kung [BrKu82] is first recalled. Then it is used to describe several well known adder architectures by an unified formalism. An analytical power consumption model is derived first for the ripple carry adder, and extended to other architectures. The notion of Glitch Threshold is then introduced and validated by HSPICE simulations, providing a glitch filtering model usable by the power evaluation software [Cra89] . Finally, concluding remarks and a brief presentation of the future work is given. . Then let us note P i,j the group propagate and G i,j the group generate, with n-1 ≥ i ≥ j ≥ 0. P i,j means that the carry propagates from position j up to position i, that is that c i+1 is equal to c j . P i,j = Π n=i j p n . G i,j means that a carry is generated somewhere between j and i and propagated from this location up to position i and yields c i+1 = 1.
THE ∆ OPERATOR
Clearly, one has P i,i = p i = a i ⊕ b i , G i,i = g i = a i ∧ b i , P i,j ∧ G i,j = 0 and c i+1 = G i,0 . For any k such that n-1 ≥ i ≥ k ≥ j ≥ 0, the pair of bits (P i,j , G i,j ) can be computed from (P i,k , G i,k ) and (P k-1,j , G k-1,j ) in the following way:
Is noted ∆ the operator such that:
In the subsequent figures the icon is used for the 4 bit input, 2 bit output ∆-cell. It is easy to prove that: ∴ ∆ is associative, non commutative and idempotent.
∴ Any (P i,j , G i,j ) requires (i-j-1) ∆-cells to be computed from the adders inputs. Intermediate results from the ∆-cells may be reused, thus reducing the total number of ∆-cells, but increasing the fan-out of some of them [Zim96] .
COST AND DELAY MODELS : WORST CASE
The cost of an n-bit adder consists of a linear cost to compute the g i and p i from a i and b i and the s i from p i and G i-1,0 plus a cost varying according to the implementation chosen to get the G i-1,0 . This cost is given roughly by the number of ∆-cells, that may range from (n-1) up to (n log 2 n) for regular adders or even go up to 1/2 (n-1) 2 for special purpose adders [Zim96] . Note that the P i-1,0 are never used, so the ∆-cells at the bottom of the following figures produce only the G i-1,0 output and since the P i,j are only useful for the right input of those cells, all the n-1 ∆ cells at the bottom of the figures are simplified. This saving is accounted for in the fixed cost . The adder delay is the sum of the delays of the ∆-cells along the critical path plus a fixed delay to get the g i and p i and finally the s i . In the following, the delay of a ∆-cell is used as the delay unit.
Some Adder Architectures
Let us examine now some well-known architectures [GBB94] , their delay (number of ∆-cells along the critical path), and their cost (the total number of ∆-cells).
Ripple Carry Adder
The ripple carry adder (figure 1) delay and its cost are in O(n-1). It is inefficient and easily constructed by mere abutment of ∆-cells. 
Two Level Carry Select Adder (2-CSA)
The two level carry-select-adder (figure 2), also named conditional-sum-adder or carry increment adder is based on the previous one truncated into blocks of varying sizes. Its cost is in O(2n) and delay O5√ 2n°, more precisely with k ∆-cells along the critical path, an adder can accommodate up to 1 
Brent and Kung Adder
The Brent and Kung adder [BrKu82] is based on binary ∆-cell trees. The cost is O(2n), the delay O(25log 2 (n)° -2). One binary tree outputs the G i,0 for all i in the form 2 j -1, then another tree gives the remaining G i,0 .
Sklansky Adder
The Sklansky adder [Skla60] has proved to be the fastest architecture. Its cost is O5 n log 2 (n) 2 °, and its delay O5 log 2 (n)°. The main drawback is that the fan-out grows exponentially from the inputs to the outputs along the critical path and consequently the transistors must be sized.
Kogge & Stone and Han & Carlson Adders
The most significant bit of a Brent and Kung adder as well as in a Sklansky adder is obtained by a perfectly balanced binary tree in time log 2 (n). If the tree for the most significant position is just copied for all other positions, the Kogge and Stone adder [KoSt73] is obtained. The fan-out is reduced to just two, at the expense of a larger number of ∆-cells, that becomes O(n(log 2 (n) -1) + 1) cells. As for the Sklansky adder, the delay is O5log 2 n° . In order to reduce the number of cells of the Kogge and Stone adder, Han and Carlson [HaCa87] have proposed to compute only the odd positions, and then to add a layer to compute the even positions from the odd ones. The delay is slightly increased to O5log 2 (n)° +1, while the complexity becomes O n 2 (5 log 2 (n)° +1).
Comparison
. ACTIVITY MODEL FOR THE RCA In this part of the paper, a model for the activity of a Ripple Carry Adder (RCA) is derived without taking into account the attenuation of the spurious transitions. In the ripple carry adder, when all the inputs are applied at once, the activity is mainly due to the propagation of the carry through a chain of p i = 1. Let us call T(n,k) the number of different chains of k consecutive "1"s in a binary word of length n : It is obvious that : T n, 0 ( ) = 0 (no "zero bit" chain), T n, n ( ) = 1 (i.e. 111...11) and T n, n − 1 ( ) = 2 (i.e. 2 possibilities : 011...111 or 11...1110).
Let us now compute the general term T(n,k) for 0 < k < n . Since the word extremities as well as the bit value 0 act as chain separators, we distinguish two cases. When the chain touches one of the two extremities of the n-bit word, there are 2 n-(k+1) different values for the n -(k + 1) bits outside the chain.
11…10 011…01 k+1 n-(k+1)
There are n -(k+2) possibilities for the chain to be in the middle of the word and for each position there are 2 n-(k+2) values of the n -( k + 2) remaining bits. 01…10 011…01 k+2 n-(k+2)
Thus T n, k
( ) = 0 and T n, n ( ) = 1
Activity of the RCA
In the case of the RCA, none of the outputs is obtained from a balanced binary tree, and thus, the activity window of any output is equal to its logical depth. This is not true for other architectures where the outputs are obtained by balanced binary trees like the Kogge and Stone. The activity caused by a carry propagation over k positions is proportional to k 2 2 [MoPa96] . Thus the average activity is
Let us recall some useful identities [Kre93] :
since they allow to simplify the expression of
With these identities, one can also easily verify that :
which is the known average delay of the ripple carry adder. In the following, the higher order terms are neglected, i.e. it is assumed that :
. Table 2 shows the relative error for 8, 16, 32 and 64 bits. Due to the equiprobability of the output vectors, the average number of useful transitions in a RCA is equal to half the number of cells. The total activity A is split in two parts : A = A useful + A redundant . Thus, the ratio η of redundant over total activity is:
For large values of n, η ≈ 1/3. This result is consistent with the BDD simulations using a unit delay and with [LMJ95] . By adopting this approach, it is also possible to determine the acticity at a given time t i (Figure 3 ). The activity at t 1 is given by : A t 1
The ripple chains of length k that exist at t 2 are those of length k+1 at t 1 , thus :
The sum goes to n-1 because in a word of length n, one cannot have a ripple carry chain of length greater than n. More generally, the activity at time t i is given by :
EXTENSION OF THE MODEL TO OTHER ARCHITECTURES
The previous model is extended to the adders that can be obtained by an association of ripple carry chains -like the carry select adder or the Kogge and Stone adder for example.
In the computation of η, the useful activity A useful is assumed to be half the number of ∆ cells since "0" and "1" are equiprobable (this is consistant with the BDD simulations).
Two Level Carry Select Adder
The 2-CSA adder is a RCA truncated into blocks. For n bits, the length of these blocks varies from 1 to 2n -1. Thus a n bits 2-CSA can be viewed as 2n RCAs of length varying from 1 to 2n -1 (first level) plus a row of cells that form the second level (figure 2).
First level
The total activity of the first level of the 2-CSA is given by the sum of the activities of the ripple carry chains, as they are independent from each other.
Second Level
The second level of the 2-CSA is approximated here by a ripple carry chain of length 2n , in which, a cell at position i is duplicated i times. This approach neglects the acticity generated at the second level by the ripple of the outputs at the first level. The activity of such a ripple carry chain can be deduced from the activity of the RCA by assuming that the capacitance of the k th cell is k instead of 1.
Total Activity of the 2-CSA
The total activity of the 2-CSA is the sum of the activities of the first and second levels :
Assuming that the useful activity is given by half the number of cells, the redundant to total activity ratio can be computed:
Kogge and Stone adder
Each bit of the Kogge and Stone adder is obtained by a balanced binary tree, thus the output of any cell can change only once during a clock cycle -no redundant transitions. The carry propagation is the result of a logical AND, thus its activity decreases very rapidly with the depth (like 2 -i ), but the transition probability of the carry generation is almost constant (1/2). These considerations allow us to approximate the activity of the Kogge and Stone adder by half the number of its cells : When the transitions at the inputs of a gate are separated by a delay δ, a glitch is generated at "Out" (figure 4). The amplitude of this glitch is proportional to δ (figure 5). This glitch can be either absorbed, or propagated depending on its width and on the delay of the following gate. As it can be seen in figure 5, there is a threshold for the glitch propagation from "Out" to "Out1".
HSPICE measurements have been carried out (ATMEL-ES2 ECPD07 technology) and ploted. The plot shows that there is a threshold delay under which the glitch is absorbed, and above which the glitch grows into a spurious transition (figure 5). We call this threshold delay G th .
The variation of G th with respect to τ (the buffers' delay) is linear (figure 6), and the slope is approximately 1.89. This means that a glitch of width δ = G th will become a spurious transition only if δ ≥ 1.89 τ. The glitch threshold depends on the loading capacitance of OUT1. The threshold phenomenon is attenuated when the capacitance is large. This characteristically behaviour of spurious transitions has been implemented in a BDD simulation tool [Cra89] , and in the following section, the above analytical model is compared to the simulations for different adders architectures.
BDD SIMULATION AND RESULTS
The tool used for implementation and experiment is ASYL+ [Cra89] . It provides a complete environment for macrogeneration and low-level synthesis. The size of the operands is sufficient to build a netlist for any kind of adder architecture. The mapping and the estimates are then performed with the user library, delay and dissipation model. The power is dependent on the circuit structure as well as the circuit inputs: it is said to be input pattern-dependent. To solve this problem, one can simulate the circuit for a large number of inputs and then average the switching activity. On the other hand, probabilities where introduced [Bur88] to perform the averaging before running the analysis [Najm95] by estimating the number of transitions per clock cycle. Using the Boolean network functionality and connectivity, these input probabilities are propagated through the network. To apply statistical properties, reconvergent fanouts and feedback have to be taken into account. A convenient way to do this is to use binary decision diagrams [Najm91] . As the adder architectures do not contain any reconvergent fanouts, the probability computation is performed without any approximation.
A glitch is created at the output of a gate because of the difference in arrival times at its inputs. Then, the glitch can be propagated to the fanout gates according to their sensitivity. The probability of a switch due to a glitch cannot be estimated the same way as a useful switch for the simple reason that the probability of a node to undergo a transition does not depend only on the Boolean network functionality but also on its structure (path lengths for instance).
The formula P T = 2P 1 (1-P 1 ) in no longer valid any more for all the possible transitions. To solve this problem, the probability calculation must be based on real delay models. Since the switching probabilities are supposed to be known at the primary inputs of the circuits, they are propagated to the fanout gates up to the roots of the circuit using the gate delays. Each gate modifies the switching probability according to its ability to propagate the transition from its inputs to its output, what we called sensitivity. The sensitivity calculation rests on the functionality of the gate according to the probabilities at the inputs. Finally, gates have a set of switching probabilities, distant from each other according to the glitch threshold previously introduced, which are added. A simple example of carry ripple adder is illustrated in figure 7. 8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31 A switch is represented by a square, coloured according to its occurrance time. Its probability is a function of the input switching probabilities P i and the sensitivities S i of the cells yet encountered. The authors propose a gate level estimator based on these remarks in [Lau96] . It gives close dissipation estimation to an exhaustive simulation in classical combinational circuits.
The automatic simulation is consistent with the analytical model previously exposed which rests on unit delay and capacitance, and with a power dissipation function linear with respect to the fanout capacitances. However, technology mapped adders have realistic delays as well as a more complex dissipation model at each switching, including for instance the charging of internal capacitances. As a consequence, we built, at transistor level, and simulated a ∆-cell with HSPICE. The submicron technology used is ATMEL ES2 ECPD07. Once the elementary cell is fully characterised, the synthesis tool estimates the total power dissipation of classical adder architectures. These estimate are presented in figure 8.
CONCLUSION
In this paper the most frequent adder architectures were compared from the activity, delay and cost points of view. The originality of the approach is that the estimation of the activity was achieved analytically by implicitly exhaustive enumeration of all vectors. This is possible thanks to the properties of the ∆ operator. Redundant transitions were taken into account, and a redundant to total power ratio was derived. Finally, glitch filtering is taken into account by feeding the BDD tool with technology driven HSPICE simulation results.
In this article it is assumed that all the inputs are ready at the same time, and that all the outputs are desired at the same time. This is not always the case, especially if an adder is associated with other operators that have their own delays. For example in multipliers or dividers, the inputs arrival times are accessible to simulation, thus different adder architectures adapted to these conditions should be examined in order to match the best power-delay-cost trade off.
BIBLIOGRAPHY

