Study and Design of a 32-bit High-Speed Adder by Stangherlin, Matteo
Universit￿ degli Studi di Padova
Tesi di Laurea in
Ingegneria dell’Informazione
Study and Design of a 32-bit
High-Speed Adder
Relatore Laureando
Andrea Neviani Matteo Stangherlin
Anno Accademico 2012/2013Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1 Study of Adders 9
1.1 Full Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Ripple Carry Adder . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Carry Lookahead Adder . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.1 Monolithic Carry Lookahead Adder . . . . . . . . . . . . 13
1.3.2 Logarithmic Carry Lookahead Adder . . . . . . . . . . . . 14
1.4 Recurrence Solver Adders . . . . . . . . . . . . . . . . . . . . . . 19
1.5 Radix-4 implementation of adders . . . . . . . . . . . . . . . . . 27
2 Design of a 32-bit High-Speed Adder 31
2.1 Circuit implementation . . . . . . . . . . . . . . . . . . . . . . . . 31
2.1.1 Blocks implementation . . . . . . . . . . . . . . . . . . . . 32
2.1.2 32-bit Kogge-Stone adder implementation . . . . . . . . . 37
2.1.3 32-bit Brent-Kung adder implementation . . . . . . . . . 44
2.2 Transistor sizing . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5Introduction
Addition is the most basic and most frequently used operation in digital cir-
cuit design. Due to this reason, it often represents the limiting factor for the
computational speed of a system. The accurate optimization of adders holds a
role of major importance in the architecture of more complex structures such
as arithmetic logic units of microprocessors. Optimization proceeds at both
logical and circuit levels. Since hardware can only perform a relatively simple
set of Boolean operations, arithmetic calculations are based on a hierarchy of
operations that are built upon simpler ones in such a way that a fast and a
low area-occupying circuit is obtained. On the other hand, as far as the circuit
level is concerned, optimization can be reached by manipulating the size of tran-
sistors and the topology of the logic gates so that every single element of the
adder is optimized. In the following paragraphs, various di￿erent architectures
for adders will be discussed, with particular attention to speed, power and chip
area in order to measure the e￿ciency of each device.Contents
8Chapter 1
Study of Adders
1.1 Full Adder
The basic building block on which more complex adding structures are derived
is the one-bit full adder. Operation of a full adder is de￿ned by the Boolean
equations for the sum and the carry signals:
Si = Ai  Bi  Ci = AiBiCi + AiBiCi + AiBiCi + AiBiCi (1.1.1)
Ci+1 = AiBi + BiCi + AiCi (1.1.2)
where Ai and Bi are the two bits that must be summed and Ci is the
input carry; Si and Ci+1 represent the sum and the carry outputs from the i-th
stage, respectively. Since the previous equations require two XOR gates to be
implemented, it is customary to de￿ne sum and carry output signals as functions
of two intermediate operations: generate (Gi) and propagate (Pi). The former
processes the generation of the carry when both the bits Ai and Bi are set to
1, independently of the value of Ci; the latter, when set to 1, carries the value
of Ci to Ci+1, propagating the carry. The Boolean expressions of these two
functions are the following:
Gi = AiBi (1.1.3)
Pi = Ai  Bi (1.1.4)
Note that generate and propagate terms are functions of the entries Ai and
Bi only and do not depend on the carry signal Ci. Generate and propagate can
be used to rede￿ne sum Si and carry Ci+1:
Si = Pi  Ci (1.1.5)
91.1. Full Adder
Ci+1 = Gi + PiCi (1.1.6)
The logical implementation of a full adder is shown in Figure 1.1.
Ai Bi
Cin
Cout
Si
Figure 1.1: Logical implementation of a full adder.
Since the delay from Aior Bi to Si is two XOR delays and the delay from Ci (or
any of the inputs) to Ci+1 is given by an AND and an OR delay, a full adder is
not e￿ciently implemented by using static circuits. Conversely, pass-transistor
circuits can implement the functions more e￿ciently in terms of speed, even
though they have more area occupation. A possible implementation of a pass-
transistor full adder is presented in Figure 1.2.
101.1. Full Adder
A
B
C
S
A
B
C
S
B
B C
C
VDD
(a)
A
A
B
B
B
B
B
B
B
B
C
C
C
C
Cout
Cout
(b)
VDD
VDD
VDD
Figure 1.2: Pass-transistor implementation of a full adder: (a) sum and (b)
carry signals.
111.2. Ripple Carry Adder
1.2 Ripple Carry Adder
A simple way to implement an adder for N-bit numbers is to concatenate N
full adders, connecting the output carry of the k   1 block (Cout;k 1) to the
input carry of the k block (Cin;k), with k = 1;2;:::;N   1, while in general the
input carry of the ￿rst full adder (Cin;0) is set to zero. Each full adder block
computes the sum of the i   th bits of the two numbers and its carry, which is
transmitted to the following full adder block. This kind of architecture is called
a Ripple Carry Adder (RCA), since the carry signal propagates from the least
signi￿cant bit position to the most signi￿cant one. The structure of a RCA is
shown in Figure 1.3.
Co,0 Co,1 Co,2 Co,3 Ci,0
A0 B0
S0
FA
A1 B1
S1
FA
A2 B2
S2
FA
A3 B3
S3
FA
Figure 1.3: Structure of a Ripple Carry Adder
The delay introduced by the circuit depends on the number of full adder
blocks that must propagate the carry, which in turn depends on the con￿guration
of the bits given as input to the adder block. The path from the input to the
output signal that is likely to take the longest time is designated as a ￿ critical
path￿. In the case of the RCA, this is the path from the least signi￿cant input
bits A0 or B0 to the last sum bit SN. Therefore, delay is proportional to the
number N of bits of the input words. Assuming that every full adder block is
implemented as shown in Figure 1.1, the delay introduced by the RCA is given
approximately by the following equation:
tRCA  tXOR + (N   1)tAO + tGP (1.2.1)
where tGP is the delay introduced by computing the generate and propagate
functions for each FA block, tXOR is the delay originated by calculating the sum
Si for each FA block and tAO is the delay introduced by the computation of the
carry from Cin;i to Cout;i, which corresponds to the sum of an AND and OR
gate delay. Since every full adder block has likely the same inner architecture,
tAO is assumed to be the same for each block.
As equation (1.2.1) shows, the delay of a RCA grows linearly with the number
N of FA stages, that is, the number N of bits of the input words. That is an
important aspect to consider when designing for digital systems that require a
121.3. Carry Lookahead Adder
sum of large number of bits. In particular, it is customary to avoid the use of
RCA topology for adders that compute operations involving 16 binary digits
or more, even though the area occupation is much smaller than other more
performing adders that will be introduced in the following paragraphs.
1.3 Carry Lookahead Adder
1.3.1 Monolithic Carry Lookahead Adder
As seen in the previous paragraph, adder performance is strongly in￿uenced
by the carry-propagating process. In order to achieve higher computational
speed, it is thus necessary to make the sum independent from carry propagation.
The principle on which monolithic Carry Lookahead Adder (monolithic CLA) is
based o￿ers a possibility to solve such issue. As previously stated in equation
(1.1.6), in an N bit adder, each carry bit can be expressed as:
Cout;i = Gi + PiCout;i 1 (1.3.1)
The dependence of Cout;i from Cout;i 1 can be eliminated by expliciting the
dependence of Cout;i 1 from the input signals Gi 1 and Pi 1:
Cout;i = Gi + Pi(Gi 1 + Pi 1Cout;i 2) (1.3.2)
The previous equation states that a carry signal exits from stage i if (a) a
carry is generated in the stage i; (b) a carry is generated at stage i   1 and
propagates across stage i; or (c) a carry enters stage i 1 and propagates across
both stages i   1 and i.
By proceeding recursively, we obtain the following equation:
Cout;i = Gi + Pi(Gi 1 + Pi 1( + P1(G0 + P0Cin;0))) (1.3.3)
where Cin;0 is typically set to zero. Equation (1.3.3) can be used to im-
plement an N bit adder. For every bit, the result of the sum is independent
from the carry of the previous one, thus eliminating the problem of the carry
propagation. The delay introduced by the monolithic CLA should be in theory
constant and independent from the number N of bits.
In practice, the implementation of the circuit, as shown in Figure 1.4 for a 4-bit
adder, points out that the delay is not constant, but grows linearly with the
number N of bits. In fact, even though the carry Cout;i does not depend on
the previous carry outputs, it depends on the respective generate and propagate
functions, making the delay proportional to N + 1, i.e. the number of parallel
branches of the circuit. Moreover, the circuit features, for a large N, an high
fan-in and fan-out. For example, we see that the signals P0 and G0 appear in
the logical expression of every other signal, signi￿cantly increasing the capacity
associated to each connected line. Because of these reasons, monolithic CLA is
useful only for small numbers of bit (typically not more than four).
131.3. Carry Lookahead Adder
G0
G1
G2
G3
P0
P1
P2
P3
Ci,0
Co,3
VDD
Figure 1.4: CMOS implementation of a 4-bit Monolithic Carry Lookahead Adder
1.3.2 Logarithmic Carry Lookahead Adder
The performance of monolithic CLA makes it suitable only for adding small
groups of bits. In order to enhance the performance of the CLA it is neces-
sary to modify the architecture of the adder by reorganizing it in a hierarchical
structure, as in Logarithmic Carry Lookahead Adder (commonly referred to as
CLA since its monolithic version is not particularly e￿cient).
CLA uses generation and propagation modules for each bit position and looka-
head modules which are used to generate carry signals independently for a group
of k-bits. In addition to carry signal for the group, lookahead modules produce
group carry generate and group carry propagate outputs that indicate that the
carry is generated within the group, or that in incoming carry would propagate
across the group. In this way, it is possible to know more in advance whether
carry is generated or propagated inside a large group of bits, decreasing the de-
lay due to the carry computation. Such modules are implemented by extending
equation (1.3.2) for a group of k-bits (for reasons that will shortly be explained,
the index i is zero or an integer multiple of four):
Cout;i = Gi + PiCin;i
Cout;i+1 = Gi+1 + Pi+1Cout;i
Cout;i+2 = Gi+2 + Pi+2Cout;i+1
Cout;i+3 = Gi+3 + Pi+3Cout;i+2
(1.3.4)
141.3. Carry Lookahead Adder
Substituting Cout;i into Cout;i+1, then Cout;i+1 into Cout;i+2, then Cout;i+2 into
Cout;i+3 yields the expanded equations:
Cout;i = Gi + PiCin;i
Cout;i+1 = Gi+1 + Pi+1Gi + Pi+1PiCin;i
Cout;i+2 = Gi+2 + Pi+2Gi+1 + Pi+2Pi+1Gi + Pi+2Pi+1PiCin;i
Cout;i+3 =Gi+3 + Pi+3Gi+2 + Pi+3Pi+2Gi+1 + Pi+3Pi+2Pi+1Gi
+ Pi+3Pi+2Pi+1PiCin;i
(1.3.5)
As seen in the previous paragraph, each additional stage increases the size
of the logic gates in terms of number of inputs. Therefore, it is not appropriate
to create a group that calculates the carry for a large number of bits. The
maximum number of inputs per gate for current technologies is four. In order
to continue the process, CLA collects carry, generate and propagate signals into
lookahead modules that compute group-generate and group-propagate signals,
respectively Gi+3:i and Pi+3:i (where i is zero or an integer multiple of four),
over a four-bit group. Every lookahead module is able to anticipate whether a
carry entering the module will be propagated all the way across it or generated
within it four times faster than a normal ripple process. Gi+3:i and Pi+3:i are
described by the following equation:
Gi+3:i = Gi+3 + Pi+3Gi+2 + Pi+3Pi+2Gi+1 + Pi+3Pi+2Pi+1Gi
Pi+3:i = Pi+3Pi+2Pi+1Pi
(1.3.6)
The notations Gj:i and Pj:i denote respectively group-generate and group-
propagate for the group that includes bit positions from i-th to j-th. The
carry equation can be expressed in terms of the 4-bit group-generate and group-
propagate signals:
Cout;i+3:i = Gi+3:i + Pi+3:iCin;i+3:i (1.3.7)
where Cin;i+3:i is the input carry of the i-th stage. Usually, carry input for the
￿rst block is set to zero. The extended expressions of the carry equations can
be obtained by substitution as done for equation (1.3.4).
In a recursive fashion, from group generate, propagate and carry, it is possible
to create a ￿group of groups" or ￿super group". The inputs of the ￿super group"
are Gi+3:i and Pi+3:i signals computed by all the groups within it. Each ￿ super
group" produces propagate P
j:i and generate G
j:i signals that indicate that the
carry signal will be propagated across all the groups belonging to the ￿ super
group" or will be generated in one of them. Similarly to the group, a ￿ super
151.3. Carry Lookahead Adder
group" produces an output carry signal as well as an input carry signal for each
of the groups in the level above:
G
j:i = Gi+3:i + Pi+3:iGi+2:i + Pi+3:iPi+2:iGi+1:i + Pi+3:iPi+2:iPi+1:Gi:i
P
j:i = Pi+3:iPi+2:iPi+1:iPi:i
Cout;j = G
j:i + P
j:iCin;i
(1.3.8)
Depending on the number of bits of the sum, it is possible to build a group
of ￿super groups" and so on. In particular , a 32-bit adder needs 8 4-bit groups,
2 ￿super groups" and a ￿nal group that conveys carry, generate and propagate
signals from the ￿super groups", producing a ￿nal carry and the intermediate
one for the levels below.
The main advantage of the hierarchical con￿guration is that the critical path
does not travel in horizontal direction, as happens in RCA. Because of the tree
structure, the delay of CLA is not directly proportional to the number of bits N,
but to the number of levels used. Therefore, the delay of CLA is proportional
to the log function of the number of bits N.
The delay introduced by CLA can be evaluated by observing that each looka-
head level takes one gate delay in order to compute Pi+3:i and two gate delays
in order to compute Gi+3:i and Cout;i+3:i. Therefore, a lookahead module in-
troduces a two-gate delay. Moreover, each generate and propagate function Gi
and Pi takes one gate delay, that must be added to the delay of the XOR gate
which processes the sum at the bit level. Finally, being the number of lookahead
modules given by the logarithm to the base k (where k is the number of bits in
a group) of the total number of bits N minus one, it is possible to compute the
delay of a LCLA:
tCLA = 1 + 1 + 2(log(dNe)   1) = 2log(dNe) (1.3.9)
The logarithmic dependence of the delay on the number of bits of the sum
makes the CLA one of the theoretically fastest structures for addition. Unfor-
tunately, in practice the e￿ciency of CLA is much lower than expected. This
is due to the fact that the model used to compute the delay does not take into
account fan-in and fan-out dependencies of the logic gates. In fact, as shown in
Figure 1.6 where a domino realization of a CLA is presented, logic gates do not
have a constant fan-in, making the delay of group and ￿ super groups" modules
much greater than the ones of the individual generate and propagate functions.
Nevertheless, CLA can reach a remarkable computational speed if properly im-
plemented. An example was already presented in Figure 1.6, which shows a
161.3. Carry Lookahead Adder
Manchester Carry-Chain (MCC) implementation of a CMOS Domino realiza-
tion of a 64-bit adder by Motorola. Due to the properties of pass-transistor logic
which characterize MCC, the adder can reach the remarkable speed of 4.5 ns at
VDD = 5V and temperature of 25C.
Cin
C4 C8 C12
Cout
Cin
Individual adders generating:
Gi Pi and sum Si ,
Carry-lookahead blocks of 4-bits
Cin
for the adders
generating:Gi+3:i Pi+3:i , and
super-blocks
blocks G
*
i+3:i
P
*
i+3:i
Carry-lookahead of
4-bits generating: ,
and Cin for the 4-bit
blocks
Figure 1.5: Block structure of a 16-bit Carry lookahead adder.
171.3. Carry Lookahead Adder
C0
P3:0
P2:0
P1:0
G3:0
G2:0
G1:0
C1
C2
C3
G3
P3
G2
P2
G1
P1
G0
P0
Figure 1.6: CMOS Domino realization of a 4-bit module of a 64-bit adder by
Motorola.
181.4. Recurrence Solver Adders
1.4 Recurrence Solver Adders
A particularly e￿cient class of adders is the one based on solving recurrence
equations that was introduced by Brent and Kung, based on the previous work
by Kogge and Stone.
They realized that any ￿rst order recurrence problem can be written in an
alternate form by using a concept called recursive doubling, which consists of
breaking the calculation of one term into two equally complex subterms. In
particular, as seen in the previous paragraphs, carry output at the i-th stage
can be written as a linear combination of generate and propagate functions at
i-th and i   1-th stages:
Cout;i = Gi + PiGi 1 + PiPi 1Cout;i 1 (1.4.1)
The previous equation can now be divided into two subterms by de￿ning the 
operator, termed ￿dot", as follows:
(Gi;Pi)  (Gi 1;Pi 1) = (Gi + PiGi 1;PiPi 1) (1.4.2)
where Gi, Pi, Gi 1 and Pi 1 are all Boolean functions. By using the dot op-
erator, group-generate and group-propagate functions can be easily written for
a group of two bits. For example, given generate and propagate functions for
the two least signi￿cant bits, it is possible to obtain the group-generate and
group-propagate functions for the two-bit group:
(G1;P1)  (G0;P0) = (G1 + P1G0;P1P0) = (G1:0;P1:0) (1.4.3)
Moreover, dot operator is associative. This property can be used in order to
compute generate and propagate functions over a group of k bits:
(Gk:0;Pk:0) = (Gk;Pk)  (Gk 1;Pk 1)  :::  (G0;P0) (1.4.4)
Since the dot operator is associative, every (Gk:0;Pk:0) can be computed in
the order de￿ned by a binary tree. This is shown in Figure 1.7, for k = 16.
In the ￿gure each black dot processes the dot operator, while the white circles
process generate and propagate functions for a single bit. Furthermore, it is
possible to calculate carry signal for every bit at positions 2n   1, where n =
1;2;:::;log2 (N), by simply extending equation (1.4.4). In fact we have:
(Co;k;0) = (Gk;Pk)  (Gk 1;Pk 1)  :::  (G0;P0)  (Ci;0;0) (1.4.5)
Calculating carry output only for the 2n   1 bit positions is though not su￿-
cient to perform a correct sum. It is thus necessary to compute (Gk:0;Pk:0) for
191.4. Recurrence Solver Adders
(
G
0
,
P
0
)
(
G
1
,
P
1
)
(
G
2
,
P
2
)
(
G
3
,
P
3
)
(
G
4
,
P
4
)
(
G
5
,
P
5
)
(
G
6
,
P
6
)
(
G
8
,
P
8
)
(
G
9
,
P
9
)
(
G
1
0
,
P
1
0
)
(
G
1
3
,
P
1
3
)
(
G
1
1
,
P
1
1
)
(
G
1
2
,
P
1
2
)
(
G
1
4
,
P
1
4
)
(
G
1
5
,
P
1
5
)
(
G
7
,
P
7
)
(G15:0,P15:0)
1
2
3
4
0
Figure 1.7: Computation of (G15;P15) using a tree structure.
0  k  N, where N is the number of the bits of the sum, and then evaluate
the carry output by implementing equation (1.4.5). The computation can be
performed by extending the tree structure in Figure 1.7, so that it is possible to
obtain (Gk:j;Pk:j) for every k and j. This process is illustrated in Figure 1.9,
for N = 16. In addition to the black dots and white squares that compute dot
operator and (Gk;Pk) respectively, Figure 1.9 also features sum operator (de-
picted as a white diamond), obtained by implementing equation (1.1.5). This
particular architecture is called Kogge-Stone adder.
Although being a variation of a CLA, recurrence solver adders feature several
properties that make them preferable for implementation, such as:
(a) a regular layout that allows easier implementation
(b) a fan-out that can be controlled and limited to no more than 2
(c) possibility to operate trade-o￿s between fan-in, fan-out and hierarchical
topology
Moreover, since they belong to the class of binary trees, recurrence solver
adders can compute output carry with complexity proportional to the number
of levels of the adder, which is variable and depends on the particular architec-
ture. As far as Kogge-Stone adder is concerned, its number of levels is given by
log2(N).
A possible implementation of Kogge-Stone adder can be realized in CMOS
technology by exploiting the properties of dynamic logic. As already pointed out
in Figure 1.9, Kogge-Stone adder consists in the recursive repetition of simpler
201.4. Recurrence Solver Adders
Gi:k
Pi:k
Gk-1:j
Pk-1:j
Gi:j
Pi:j
Gi:k
Pi:k
Gk-1:j
Gi:j
i:k k-1:j
i:j
i:k k-1:j
i:j
dot operator grey operator
(a) (b)
Figure 1.8: Logical structure of (a) dot and (b) grey operators.
(
A
0
,
B
0
)
(
A
1
,
B
1
)
(
A
2
,
B
2
)
(
A
3
,
B
3
)
(
A
4
,
B
4
)
(
A
5
,
B
5
)
(
A
6
,
B
6
)
(
A
7
,
B
7
)
(
A
8
,
B
8
)
(
A
9
,
B
9
)
(
A
1
0
,
B
1
0
)
(
A
1
1
,
B
1
1
)
(
A
1
2
,
B
1
2
)
(
A
1
3
,
B
1
3
)
(
A
1
4
,
B
1
4
)
(
A
1
5
,
B
1
5
)
S
0
S
1
S
2
S
3
S
4
S
5
S
6
S
7
S
8
S
9
S
1
0
S
1
1
S
1
2
S
1
3
S
1
4
S
1
5
Figure 1.9: Tree structure of a 16-bit Kogge-Stone adder.
211.4. Recurrence Solver Adders
CLK
CLK
VDD VDD
CLK
CLK
Ai
Bi
Pi Gi
Ai Bi
(a) (b)
Figure 1.10: CMOS Domino implementation of (a) propagate and (b) generate
signals.
Pi:i-k+1
Pi-k:i-2k+1
CLK
CLK
VDD
Pi:i-2k+1
(a)
VDD
CLK
CLK
Pi:i-k+1
Gi-k:i-2k+1
Gi:i-k+1
Gi:i-2k+1
(b)
Figure 1.11: CMOS Domino implementation of the dot operator: (a) group-
propagate and (b) group-generate signals.
221.4. Recurrence Solver Adders
modules, which realize dot operator, generate, propagate and sum functions.
The Domino CMOS implementation of generate and propagate functions from
equations (1.1.3) and (1.1.4) is shown in Figure 1.10. Note that propagate
function does not actually implement a XOR gate, but the following equation:
Pi = Ai + Bi (1.4.6)
which represents a OR gate. This is an alternate de￿nition of the propagate
function that allows carry to be forwarded also when both Ai and Bi assume
logical value 1. Due to the way generate and propagate bits are used by CLA
and recurrence solver adders, propagate de￿nition given by equation (1.4.6) is
equivalent to the one given by equation (1.1.4) 1. Because of this, propagate
function can be easily implemented in Domino logic. In addition, both generate
and propagate functions feature a bleeder circuit whose function is to eliminate
every static power consumption originated by the pull-down network. Every
bleeder p-MOS has a small inverter connected to its gate in order to decouple
it from the output load, allowing a fast switching o￿.
Figure 1.11 shows the Domino logic implementation of the dot operator. In
order to implement equation (1.4.2), every dot operator consists of two modules:
the ￿rst computes group-propagate and the second group-generate. The inputs
of group-propagate modules are given by the signals Pi:i k+1 and Pi k:i 2k+1,
where i represents the bit position of the operand and k represents the logic
level from which the input is produced. Similarly for group-generate the imputs
are Gi k:i 2k+1, Gi:i k+1 and Pi:i k+1. The outputs of the two modules will
be Pi:i 2k+1 and Gi:i 2k+1 respectively. In addition, since group-propagate and
group-generate functions do not represent the ￿rst stage of the circuit, it would
be possible remove the evaluation transistor from the PDN in order to reduce the
logical e￿ort from 1 to 2=3. Note that by doing this it would also be necessary to
delay the clock signal of stage k from the one of stage k 1 in order to eliminate
the direct path from VDD to ground. In this way the clock signal is delayed by
a constant time from a stage and the following one.
After computing all generate and propagate functions it is also necessary to
calculate the sum of the i-th bit of the two operands by implementing the
following equation:
Si = Ai  Bi (1.4.7)
A static implementation of equation (1.4.7) is not particularly e￿cient, hav-
ing implemented the previous stages in Domino logic. In fact a static gate could
increase the delay of the circuit and cause unwanted transitions whether the
following stage is implemented in dynamic logic. A possible way to compute
the sum using Domino logic is to precompute the two possible outputs for Si,
S0
i = Ai  Bi and S1
i = Ai  Bi, and select the proper output by exploiting
1Note that equations (1.4.6) and (1.1.4) are not equivalent for full-adders or RCA.
231.4. Recurrence Solver Adders
Gi:0
Si
0
Si
1
S
CLKD
Gi:0
VDD VDD VDD
CLK
CLK
CLK
CLK
CLKD
Figure 1.12: CMOS Domino implementation of the sum selector.
Gi:0 function. The implementation of the sum operator is shown in Figure 1.12.
Since domino logic does not provide the negated value of the inputs, the sum
operator is implemented by three logic stages. Note that the ￿rst two stages,
namely NOT and NAND gates, do not feature an inverter at the output and
this may cause unwanted glitches at the output. This problem can be solved
by delaying the clock for the second NAND gate so that the delayed clock edge
follows every commutation of the inputs from the previous stage.
As pointed out in Figure 1.9, the implementation of a 16-bit Kogge-Stone
adder requires 49 modules in order to realize all the dot operators, 16 modules in
order to implement generate and propagate functions and ￿nally 16 more logic
gates for the implementation of the sum. If domino logic is used as suggested
above the adder would need a total of 1507 transistors (this is the worst case in
which every gate in the dot modules has the evaluation transistor in the PDN).
Moreover, more than one wiring track is required between two logic levels.
Because of these reasons, sometimes it is preferable to reduce the computational
speed in order to improve area occupation and decrease power consumption, as
suggested by Brent and Kung.
In fact, it is possible to exploit the properties of the dot operator in order to
implement a structure that computes the sum of two operands with smaller area
occupation and power consumption than Kogge-Stone adder. In particular, it
was previously stated that every instance (Gk:0;Pk:0) can be computed in the
order de￿ned by a binary tree, giving an example of the computation in Figure
1.7 for k = 16. Note that the binary tree allows the carry computation only for
the bits at positions 2n 1, with n = 1;2;:::;log2(N). For example, referring to
241.4. Recurrence Solver Adders
(
A
0
,
B
0
)
(
A
1
,
B
1
)
(
A
2
,
B
2
)
(
A
3
,
B
3
)
(
A
4
,
B
4
)
(
A
5
,
B
5
)
(
A
6
,
B
6
)
(
A
7
,
B
7
)
(
A
8
,
B
8
)
(
A
9
,
B
9
)
(
A
1
0
,
B
1
0
)
(
A
1
1
,
B
1
1
)
(
A
1
2
,
B
1
2
)
(
A
1
3
,
B
1
3
)
(
A
1
4
,
B
1
4
)
(
A
1
5
,
B
1
5
)
S
0
S
1
S
2
S
3
S
4
S
5
S
6
S
7
S
8
S
9
S
1
0
S
1
1
S
1
2
S
1
3
S
1
4
S
1
5
Figure 1.13: Tree structure of a 16-bit Brent-Kung adder.
the structure in Figure 1.7, it is possible to compute the following carry outputs:
(Co;0;0) = (G0;P0)  (Ci;0;0)
(Co;1;0) = [(G1;P1)  (G0;P0)]  (Ci;0;0) = (G1:0;P1:0)  (Ci;0;0)
(Co;3;0) = [(G3:2;P3:2)  (G1:0;P1:0)]  (Ci;0;0) = (G3:0;P3:0)  (Ci;0;0)
(Co;7;0) = [(G7:4;P7:4)  (G3:0;P3:0)]  (Ci;0;0) = (G7:0;P7:0)  (Ci;0;0)
(Co;15;0) = [(G15:8;P15:8)  (G7:0;P7:0)]  (Ci;0;0)
= (G15:0;P15:0)  (Ci;0;0)
(1.4.8)
Since binary tree itself does not allow to compute the complete sum, it is neces-
sary to juxtapose another structure to it so that the carry can be calculated for
every position of the two operands. Kogge and Stone’s solution to this problem
was to reproduce the tree structure for every bit position, signi￿cantly increas-
ing the number of dot operators and wiring tracks.
Instead of reproducing the tree structure, it is possible to compute the re-
maining carry outputs by simply adding a second tree to the structure in Figure
1.7, as suggested by Brent and Kung. The resulting scheme shown in Figure
1.13 is known as Brent-Kung adder. The additional structure, called inverse bi-
nary tree, combines the intermediate results in order to compute the remaining
carry outputs.
251.4. Recurrence Solver Adders
In this way, the number of dot operators implemented in the adder is signif-
icantly reduced: referring to a 16-bit adder, instead of the 49 requested by a
Kogge-Stone adder, a Brent-Kung adder only needs 27. Moreover the number
of wiring tracks between two logic levels is limited to one.
Note that the inverse binary tree consists of a di￿erent class of dot operators,
depicted as grey dots in Figure 1.13. Since the implementation of the sum only
requires the group-generate signal Gi:0 in addition to the two possible values of
the signal Si, grey operators implement only the following equation:
Gi:i 2k+1 = Gi:i k+1 + Pi:i k+1Gi:i 2k+1 (1.4.9)
instead of equation (1.4.2). Because of this, only three inputs are needed
and only one output is produced. The implementation of the grey operator
can be seen in Figure 1.11 b. Note that grey operators are also used in the
implementation of Kogge-Stone adder for the nodes directly connected to the
modules that compute the sum.
The width w of an adder can be de￿ned as the number of bits it accepts at
one time from each operand. For the parallel adders considered so far, w = N
was assumed. For any Brent-Kung adder it can be proven that all the carries
in an N-bit addition can be computed in time proportional to (N=w) + log(w)
and in area proportional to wlog(w)+1, and so can the addition. The applica-
tion of the previous bound to the case w = N leads to the conclusion that the
computation delay for a N-bit Brent-Kung adder is proportional to log(N)+1.
Nevertheless, Brent-Kung adder does not reach the computational speed of a
Kogge-Stone adder. This is due to the fact that the structure of the wiring
tracks is less regularly distributed and the fan-out is variable for each logic
gate. In particular, referring to the 16-bit adder in Figure 1.13, the fan-out of
the node associated to the intermediate carry output (Co;7;0) is given by ￿ve
dot operators and a sum function, signi￿cantly increasing the delay.
261.5. Radix-4 implementation of adders
1.5 Radix-4 implementation of adders
Another factor that contributes to the enhancement of the delay is the num-
ber of logic levels of the adder which, in the case of a recurrence solver adder,
corresponds to the number of carries (or equivalently dot operators) grouped in
each step of the computation. This parameter is often referred to as the depth
of the adder. Considering that in a recurrence solver adder all the dot operators
have the same architecture and therefore the same delay, to a greater depth
corresponds greater delay. Moreover, a great number of subsequent logic gates
signi￿cantly increases the fan-out associated to each bit of the operands.
A possible way to reduce the depth of an adder is the implementation through
radix-4 architecture. All the recurrence solver adders analyzed so far were im-
plemented by using radix-2 architecture, which means coupling two signals into
a dot operator so that it is possible to compute generate and propagate func-
tions over a group of two bits. As the name suggests, radix-4 architecture groups
four signals into a single dot operator, computing generate and propagate for a
group of four bits. It is thus possible to rede￿ne equation (1.4.2) for a group of
four bits as follows:
Dot4

(Gi+3;Pi+3);(Gi+2;Pi+2);(Gi + 1;Pi+1);(Gi;Pi)

=
(Gi+3 + Pi+3Gi+2 + Pi+3Pi+2Gi+1 + Pi+3Pi+2Pi+1Gi;Pi+3Pi+2Pi+1Pi)
(1.5.1)
Consequently recurrence solver adders can be implemented by a quaternary tree
instead of a binary one. Figure 1.14 shows a 16-bit radix-4 Kogge-Stone adder.
Note that in addition to the radix-4 dot operators the adder also features radix-2
dot operators (depicted as a white circle) and radix-4 grey operators (depicted
as a grey circle). The former implement equation (1.4.2) as for radix-2 adders,
the latter implement the following equation:
Grey4

(Gi+2;Pi+2);(Gi + 1;Pi+1);(Gi;Pi)

=
= (Gi+2 + Pi+2Gi+1 + Pi+2Pi+1Gi;Pi+2Pi+1Pi)
(1.5.2)
A possible implementation in CMOS Domino logic of radix-4 dot and grey
operators is shown in Figure 1.15 and Figure 1.16 respectively. As for the imple-
mentation of the radix-2 dot operator presented in the previous paragraph, both
modules feature a bleeder in order to eliminate every static power consumption
originated by the pull-down network.
By applying radix-4 architecture it is possible to reduce the depth of the adder
and therefore decrease the number of carries grouped in each step. For ex-
ample, for a 16-bit Kogge-Stone adder, the depth decreases from four to two.
271.5. Radix-4 implementation of adders
(
A
0
,
B
0
)
(
A
1
,
B
1
)
(
A
2
,
B
2
)
(
A
3
,
B
3
)
(
A
4
,
B
4
)
(
A
5
,
B
5
)
(
A
6
,
B
6
)
(
A
7
,
B
7
)
(
A
8
,
B
8
)
(
A
9
,
B
9
)
(
A
1
0
,
B
1
0
)
(
A
1
1
,
B
1
1
)
(
A
1
2
,
B
1
2
)
(
A
1
3
,
B
1
3
)
(
A
1
4
,
B
1
4
)
(
A
1
5
,
B
1
5
)
S
0
S
1
S
2
S
3
S
4
S
5
S
6
S
7
S
8
S
9
S
1
0
S
1
1
S
1
2
S
1
3
S
1
4
S
1
5
Figure 1.14: Tree structure of a 16-bit radix-4 Kogge-Stone adder.
Nevertheless, it is necessary to consider the fact that radix-4 nodes are more
complex than radix-2 ones and the resulting quaternary-tree-based adder could
result slower than the binary-tree-based one. Moreover, despite the computa-
tional speed, radix-4 gates require greater power consumption than radix-2 ones,
making radix-4 adders more power-consuming.
281.5. Radix-4 implementation of adders
(a) (b)
CLK
CLK
VDD
Gi
Gi+1
Gi+2
Gi+3
Pi+1
Pi+2
Pi+3
GGj
CLK
CLK
VDD
Pi
Pi+1
Pi+2
Pi+3
GPj
Figure 1.15: CMOS Domino implementation of the radix-4 dot operator: (a)
group-propagate and (b) group-generate signals.
(a) (b)
CLK
CLK
VDD
Gi
Gi+1
Gi+2
Pi+1
Pi+2
GGj
CLK
CLK
VDD
Pi
Pi+1
Pi+2
GPj
Figure 1.16: CMOS Domino implementation of the radix-4 grey operator: (a)
group-propagate and (b) group-generate signals.
291.5. Radix-4 implementation of adders
30Chapter 2
Design of a 32-bit High-Speed
Adder
Since adders occupy a critical position inside arithmetic logic units of micropro-
cessors, it is very important to ensure that their performance adequately meets
given speci￿cations on speed, power and area occupation. The optimization of
such speci￿cations can be achieved at both logical and circuit level. The previ-
ous chapter featured the analysis of the logical level, presenting many e￿cient
topological solutions in order to implement high-speed adders. The purpose of
the second part of this work is to verify the results presented in the previous
chapter by implementing notable adder structures and verifying their perfor-
mance through simulations.
The adder topologies chosen for simulation are the following: 32-bit Kogge-Stone
radix-2 adder; 32-bit Brent-Kung radix-2 adder; 32-bit Kogge-Stone radix-4
adder and 32-bit Brent-Kung radix-4 adder. Kogge-Stone topology was mainly
chosen because it ensures high performance despite a considerable power con-
sumption and area occupation, while Brent-Kung topology ensures lower com-
putational speed but also lower area occupation and power consumption. More-
over, a further comparison can be made between radix-2 and radix-4 architec-
tures can be made in terms of overall computational speed and area occupation.
Circuit simulation for veri￿cation of performance was carried out by using Ca-
dence Design Framework II
R 
, an electronic design automated software for
digital, analog and mixed circuits.
2.1 Circuit implementation
Because of its peculiar properties, for circuit implementation Dynamic Domino
logic was chosen. In particular, since each operation is free from glitches as
each gate can make only one 0 ! 1 transition in evaluation, Domino logic is
particularly apt to implement cascades of logic gates as the ones of recurrence
312.1. Circuit implementation
solver adders. Moreover, as all Dynamic logic does, the area occupation is much
smaller and computational speed is much faster than the conventional CMOS
logic.
Given the modular structure of recurrence solver adders, many circuit blocks
are shared between di￿erent adder topologies. For example dot operator is iden-
tically implemented for both Kogge-Stone and Brent-Kung adders. Because of
this reason, this paragraph will deal with each module separately before intro-
ducing the proper adder topology. All the following blocks were implemented
by using the 0:35m CMOS C35 process.
2.1.1 Blocks implementation
The ￿rst stage for each recurrence solver adder is the block that computes both
generate and propagate functions from the operands bit values Ai and Bi. Since
in the tree diagrams of recurrence solver adders this function was depicted as a
square, it will be henceforth called square operator. Its implementation derives
directly from equations (1.1.3) and (1.4.6), and is shown in Figure 2.1. Note
that, di￿erently from the implementation already presented in paragraph 1.4,
the circuits presented in this paragraph do not include a bleeder p-MOS as its
presence may in￿uence the evaluation of computational speed of the adder.
Depending on the topology of the adder, square block is followed either by dot
or grey operators. Radix-2 and radix-4 implementations of the former are shown
in Figure 2.2 and Figure 2.3 respectively; while the implementation of the latter
is shown in Figure 2.4.
The ￿nal stage of the adder is the one that computes the sum from group-
generate and group-propagate signals. Similarly to square operator, since in the
tree diagrams it is shown as a diamond, this block will be henceforth referred to
as diamond operator. Di￿erently from the implementation presented in para-
graph 1.4, the one used for simulations was realized from equations (1.1.5) and
(1.4.5) and is shown in Figure 2.5. Basically, diamond operator consists in the
cascade of two logic gates: the ￿rst computes the carry output at the k-th stage
Co;k while the second computes the sum as an exclusive or of the group-propagate
signal Pk:0 and the carry output signal.
322.1. Circuit implementation
Figure 2.1: Schematic view of the square block.
332.1. Circuit implementation
Figure 2.2: Schematic view of the radix-2 dot operator.
342.1. Circuit implementation
Figure 2.3: Schematic view of the radix-4 dot operator.
352.1. Circuit implementation
Figure 2.4: Schematic view of the radix-4 grey operator.
362.1. Circuit implementation
Figure 2.5: Schematic view of the diamond block.
2.1.2 32-bit Kogge-Stone adder implementation
The implementation of a 32-bit Kogge-Stone radix-2 adder directly derives from
the tree structure presented in Figure 2.6. Since the implementation chosen for
the sum requires both generate and propagate signals, all the grey operators,
which normally compute only the generate signal, are replaced with common dot
operators. Figure 2.7 shows the full 32-bit Kogge-Stone radix-2 adder circuit,
while Figure 2.8 and Figure 2.9 show details of the interconnections among the
various blocks.
372.1. Circuit implementation
(
A
1
,
B
1
)
(
A
2
,
B
2
)
(
A
3
,
B
3
)
(
A
4
,
B
4
)
(
A
5
,
B
5
)
(
A
6
,
B
6
)
(
A
7
,
B
7
)
(
A
8
,
B
8
)
(
A
9
,
B
9
)
(
A
1
0
,
B
1
0
)
(
A
1
1
,
B
1
1
)
(
A
1
2
,
B
1
2
)
(
A
1
3
,
B
1
3
)
(
A
1
4
,
B
1
4
)
(
A
1
5
,
B
1
5
)
S
0
S
1
S
2
S
3
S
4
S
5
S
6
S
7
S
8
S
9
S
1
0
S
1
1
S
1
2
S
1
3
S
1
4
S
1
5
(
A
1
6
,
B
1
6
)
(
A
3
0
,
B
3
0
)
(
A
1
7
,
B
1
7
)
(
A
1
8
,
B
1
8
)
(
A
1
9
,
B
1
9
)
(
A
2
0
,
B
2
0
)
(
A
2
1
,
B
2
1
)
(
A
2
2
,
B
2
2
)
(
A
2
3
,
B
2
3
)
(
A
2
4
,
B
2
4
)
(
A
2
5
,
B
2
5
)
(
A
2
6
,
B
2
6
)
(
A
2
7
,
B
2
7
)
(
A
2
8
,
B
2
8
)
(
A
2
9
,
B
2
9
)
(
A
3
1
,
B
3
1
)
(
A
0
,
B
0
)
S
1
6
S
2
0
S
3
0
S
1
7
S
1
8
S
1
9
S
2
1
S
2
2
S
2
3
S
2
4
S
2
5
S
2
6
S
2
7
S
2
8
S
2
9
S
3
1
Figure 2.6: Tree structure of a 32-bit Kogge-Stone radix-2 adder.
382.1. Circuit implementation
Figure 2.7: Schematic view of a 32-bit Kogge-Stone radix-2 adder.
392.1. Circuit implementation
Figure 2.8: Detail showing the ￿rst four imputs of of Figure 2.7.
402.1. Circuit implementation
Figure 2.9: Detail showing the last ￿ve outputs of Figure 2.7.
Figure 2.10 shows the tree structure of a 32-bit Kogge-Stone radix-4 adder.
The white circles represent radix-2 dot operators, the grey circles represent
radix-4 grey operators and the black ones represent radix-4 dot operators. The
adder implementation on Cadence is shown in Figure 2.11, while Figure 2.12
and Figure 2.13 show details of the interconnections among the various blocks.
(
A
1
,
B
1
)
(
A
2
,
B
2
)
(
A
3
,
B
3
)
(
A
4
,
B
4
)
(
A
5
,
B
5
)
(
A
6
,
B
6
)
(
A
7
,
B
7
)
(
A
8
,
B
8
)
(
A
9
,
B
9
)
(
A
1
0
,
B
1
0
)
(
A
1
1
,
B
1
1
)
(
A
1
2
,
B
1
2
)
(
A
1
3
,
B
1
3
)
(
A
1
4
,
B
1
4
)
(
A
1
5
,
B
1
5
)
S
0
S
1
S
2
S
3
S
4
S
5
S
6
S
7
S
8
S
9
S
1
0
S
1
1
S
1
2
S
1
3
S
1
4
S
1
5
(
A
1
6
,
B
1
6
)
(
A
3
0
,
B
3
0
)
(
A
1
7
,
B
1
7
)
(
A
1
8
,
B
1
8
)
(
A
1
9
,
B
1
9
)
(
A
2
0
,
B
2
0
)
(
A
2
1
,
B
2
1
)
(
A
2
2
,
B
2
2
)
(
A
2
3
,
B
2
3
)
(
A
2
4
,
B
2
4
)
(
A
2
5
,
B
2
5
)
(
A
2
6
,
B
2
6
)
(
A
2
7
,
B
2
7
)
(
A
2
8
,
B
2
8
)
(
A
2
9
,
B
2
9
)
(
A
3
1
,
B
3
1
)
(
A
0
,
B
0
)
S
1
6
S
2
0
S
3
0
S
1
7
S
1
8
S
1
9
S
2
1
S
2
2
S
2
3
S
2
4
S
2
5
S
2
6
S
2
7
S
2
8
S
2
9
S
3
1
Figure 2.10: Tree structure of a 32-bit Kogge-Stone radix-2 adder.
412.1. Circuit implementation
Figure 2.11: Schematic view of a 32-bit Kogge-Stone radix-4 adder.
422.1. Circuit implementation
Figure 2.12: Detail showing the ￿rst four imputs of of Figure 2.11.
432.1. Circuit implementation
Figure 2.13: Detail showing the last three outputs of Figure 2.11.
2.1.3 32-bit Brent-Kung adder implementation
The implementation of a 32-bit Brent-Kung radix-2 adder directly derives from
the tree structure presented in Figure 2.14. Similarly to Kogge-Stone radix-2
implementation, all grey operators are replaced with dot operators. Figure 2.15
shows the full 32-bit Brent-Kung radix-2 adder circuit, while Figure 2.16 and
Figure 2.17 show details of the interconnections among the various blocks.
(
A
1
,
B
1
)
(
A
2
,
B
2
)
(
A
3
,
B
3
)
(
A
4
,
B
4
)
(
A
5
,
B
5
)
(
A
6
,
B
6
)
(
A
7
,
B
7
)
(
A
8
,
B
8
)
(
A
9
,
B
9
)
(
A
1
0
,
B
1
0
)
(
A
1
1
,
B
1
1
)
(
A
1
2
,
B
1
2
)
(
A
1
3
,
B
1
3
)
(
A
1
4
,
B
1
4
)
(
A
1
5
,
B
1
5
)
S
0
S
1
S
2
S
3
S
4
S
5
S
6
S
7
S
8
S
9
S
1
0
S
1
1
S
1
2
S
1
3
S
1
4
S
1
5
(
A
1
6
,
B
1
6
)
(
A
3
0
,
B
3
0
)
(
A
1
7
,
B
1
7
)
(
A
1
8
,
B
1
8
)
(
A
1
9
,
B
1
9
)
(
A
2
0
,
B
2
0
)
(
A
2
1
,
B
2
1
)
(
A
2
2
,
B
2
2
)
(
A
2
3
,
B
2
3
)
(
A
2
4
,
B
2
4
)
(
A
2
5
,
B
2
5
)
(
A
2
6
,
B
2
6
)
(
A
2
7
,
B
2
7
)
(
A
2
8
,
B
2
8
)
(
A
2
9
,
B
2
9
)
(
A
3
1
,
B
3
1
)
(
A
0
,
B
0
)
S
1
6
S
2
0
S
3
0
S
1
7
S
1
8
S
1
9
S
2
1
S
2
2
S
2
3
S
2
4
S
2
5
S
2
6
S
2
7
S
2
8
S
2
9
S
3
1
Figure 2.14: Tree structure of a 32-bitBrent-Kung radix-2 adder.
442.1. Circuit implementation
Figure 2.15: Schematic view of a 32-bit Brent-Kung radix-2 adder.
452.1. Circuit implementation
Figure 2.16: Detail showing the ￿rst four imputs of of Figure 2.15.
Figure 2.17: Detail showing the ￿rst four outputs of Figure 2.15.
Figure 2.18 shows the tree structure of a 32-bit Brent-Kung radix-4 adder.
The representation of operators is the same as that used for Kogge-Stone radix-4
adder. The adder implementation on Cadence is shown in Figure 2.19, while
Figure 2.20 and Figure 2.21 show details of the interconnections among the
various blocks.
462.1. Circuit implementation
(
A
1
,
B
1
)
(
A
2
,
B
2
)
(
A
3
,
B
3
)
(
A
4
,
B
4
)
(
A
5
,
B
5
)
(
A
6
,
B
6
)
(
A
7
,
B
7
)
(
A
8
,
B
8
)
(
A
9
,
B
9
)
(
A
1
0
,
B
1
0
)
(
A
1
1
,
B
1
1
)
(
A
1
2
,
B
1
2
)
(
A
1
3
,
B
1
3
)
(
A
1
4
,
B
1
4
)
(
A
1
5
,
B
1
5
)
S
0
S
1
S
2
S
3
S
4
S
5
S
6
S
7
S
8
S
9
S
1
0
S
1
1
S
1
2
S
1
3
S
1
4
S
1
5
(
A
1
6
,
B
1
6
)
(
A
3
0
,
B
3
0
)
(
A
1
7
,
B
1
7
)
(
A
1
8
,
B
1
8
)
(
A
1
9
,
B
1
9
)
(
A
2
0
,
B
2
0
)
(
A
2
1
,
B
2
1
)
(
A
2
2
,
B
2
2
)
(
A
2
3
,
B
2
3
)
(
A
2
4
,
B
2
4
)
(
A
2
5
,
B
2
5
)
(
A
2
6
,
B
2
6
)
(
A
2
7
,
B
2
7
)
(
A
2
8
,
B
2
8
)
(
A
2
9
,
B
2
9
)
(
A
3
1
,
B
3
1
)
(
A
0
,
B
0
)
S
1
6
S
2
0
S
3
0
S
1
7
S
1
8
S
1
9
S
2
1
S
2
2
S
2
3
S
2
4
S
2
5
S
2
6
S
2
7
S
2
8
S
2
9
S
3
1
Figure 2.18: Tree structure of a 32-bitBrent-Kung radix-4 adder.
472.1. Circuit implementation
Figure 2.19: Schematic view of a 32-bit Brent-Kung radix-4 adder.
482.2. Transistor sizing
Figure 2.20: Detail showing the ￿rst four imputs of of Figure 2.15.
Figure 2.21: Detail showing the ￿rst four outputs of Figure 2.15.
2.2 Transistor sizing
In a digital circuit transistors sizing should meet both area occupation and per-
formance speci￿cations. Since the purpose of this work is only to verify the
behavior of the examined structures, only preliminary sizing was carried out.
492.2. Transistor sizing
Further changes to the form factors of transistors in order to achieve higher
computing performance go beyond the purpose of this work.
Preliminary transistor sizing was carried out in order to ensure to each logic gate
the same delay of the minimum size inverter. First of all, 0:35m CMOS C35
process allows a minimum transistor gate width of 0:40m. Since the optimum
ratio between the p-MOS and n-MOS gate width in order to ensure symmetri-
cal delays for both low-high and high-low commutations is 3 1, the p-MOS gate
width of the minimum size inverter was set to 1:20m, while the one of the
n-MOS was set to the minimum value of 0:40m.
In a Dynamic logic gate the worst delay is experimented when a series of tran-
sistors is active. In fact, by using an equivalent RC model, the delay is approx-
imately given by the following equation:
tp = 0:69  R  C (2.2.1)
where R is the equivalent electrical resistance of the transistor and C the
capacitance associated to the concerned node. For a series of n transistors, the
high-low delay is given by the following equation:
tp;HL = 0:69  n  R  C (2.2.2)
Therefore, being the electrical resistance of the transistor approximately in-
versely proportional to its gate width, in order to reduce the delay by a factor
of n it is necessary to increase the width of each transistor by the same factor.
This criterion was used to size all the logic gates of the adder. The following
table summarizes the size of the transistors in each logic gate.
Transistor size
Block Logic gate p-MOS n-MOS (each)
inverter 1:20m 0:40m
square
propagate 1:20m 1:20m
generate 1:20m 0:60m
dot radix-2
group-propagate 1:20m 1:20m
group-generate 1:20m 1:20m
dot radix-4
group-propagate 1:20m 2:00m
group-generate 1:20m 2:00m
grey radix-4
group-propagate 1:20m 1:60m
group-generate 1:20m 1:60m
diamond Carry 1:20m 1:20m
XOR 1:20m 1:20m
1The actual optimum ratio is 2.4, but CMOS technology only allows integer scaling of
minimum length and width.
502.3. Simulation
2.3 Simulation
The purpose of the simulation is to verify the performance of the adder by
evaluating the delay with which the sum is computed. In order to do that it
is necessary to evaluate the critical path of each adder and measure the time
elapsed between the rising edges of the clock and the output signal at the end of
the critical path at the ￿rst evaluation phase. For a recurrence solver adder the
critical path is the one that goes from the least signi￿cant bit of the operands
to the most signi￿cant bit of the sum by propagating the carry throughout the
whole structure. The propagation of the carry from the least signi￿cant bit of
the operands to the most signi￿cant bit of the sum occurs when the con￿guration
of the operands is one of the following:
A = 011111
B = 000001
or, equivalently:
A = 111111
B = 000001
where the leftmost bit is the most signi￿cant one. The ￿rst one will hence-
forth be re￿erred to as con￿guration 1, while the second con￿guration 2. In
the evaluation of the delay both con￿gurations were used. In the ￿rst one the
delay was evaluated for the most signi￿cant bit of the sum S31 and the previous
carry output Co;30, while for the second the delay was evaluated only for the
last carry output Co;31. The following graphs and tables summarize the results
obtained by simulating the circuits in Cadence
R 
 Virtuoso
R 
 analog environ-
ment by using a 100 MHz clock frequency. The measurements of the delay were
performed at 10% (330 mV), 50% (1.65 V) and 90% (2.97 V) of the high output
voltage. The following tables summarize the results obtained in terms of delay
(measured at 50% of the commutation) and raise time.
Kogge-Stone radix-2 adder con￿guration 1
Signal propagation delay raise time
CLK 0 0.080ns
S31 1.024ns 0.071ns
Co;30 1.093ns 0.113ns
Kogge-Stone radix-2 adder con￿guration 2
Signal propagation delay raise time
CLK 0 0.080ns
Co;31 1.153ns 0.113ns
512.3. Simulation
Brent-Kung radix-2 adder con￿guration 1
Signal propagation delay raise time
CLK 0 0.080ns
S31 1.482ns 0.071ns
Co;30 1.571ns 0.113ns
Brent-Kung radix-2 adder con￿guration 2
Signal propagation delay raise time
CLK 0 0.080ns
Co;31 1.137ns 0.113ns
Kogge-Stone radix-4 adder con￿guration 1
Signal propagation delay raise time
CLK 0 0.080ns
S31 1.411ns 0.070ns
Co;30 1.624ns 0.113ns
Kogge-Stone radix-4 adder con￿guration 2
Signal propagation delay raise time
CLK 0 0.080ns
Co;31 1.594ns 0.113ns
Brent-Kung radix-4 adder con￿guration 1
Signal propagation delay raise time
CLK 0 0.080ns
S31 2.060ns 0.071ns
Co;30 2.032ns 0.114ns
Brent-Kung radix-4 adder con￿guration 2
Signal propagation delay raise time
CLK 0 0.080ns
Co;31 2.119ns 0.113ns
522.3. Simulation
0 5.0 10 15 20
time (ns)
3.5
3.0
2.5
2.0
1.5
1.0
.5
0.0
Y
0
(
V
)
Y
0
(
V
)
15
10
5.0
0
−5.0
−10
−15
Y
1
(
m
V
)
Y
1
(
m
V
)
3.5
3.0
2.5
2.0
1.5
1.0
.5
0
−.5
Y
2
(
V
)
Y
2
(
V
)
4.0
3.5
3.0
2.5
2.0
1.5
1.0
.5
0
−.5
Y
3
(
V
)
Y
3
(
V
)
Clock
Sum bit 30
Sum bit 31
Carry out 30
M3(5.048ns, 330mV) M3(5.048ns, 330mV) M3(5.048ns, 330mV) M3(5.048ns, 330mV) M3(5.048ns, 330mV)
M1(5.074ns, 1.65V) M1(5.074ns, 1.65V) M1(5.074ns, 1.65V) M1(5.074ns, 1.65V) M1(5.074ns, 1.65V)
M4(5.119ns, 2.97V) M4(5.119ns, 2.97V) M4(5.119ns, 2.97V) M4(5.119ns, 2.97V) M4(5.119ns, 2.97V)
M5(4.09ns, 2.97V) M5(4.09ns, 2.97V) M5(4.09ns, 2.97V) M5(4.09ns, 2.97V) M5(4.09ns, 2.97V)
M0(4.05ns, 1.65V) M0(4.05ns, 1.65V) M0(4.05ns, 1.65V) M0(4.05ns, 1.65V) M0(4.05ns, 1.65V)
M6(4.01ns, 330mV) M6(4.01ns, 330mV) M6(4.01ns, 330mV) M6(4.01ns, 330mV) M6(4.01ns, 330mV)
M7(5.204ns, 2.97V) M7(5.204ns, 2.97V) M7(5.204ns, 2.97V) M7(5.204ns, 2.97V) M7(5.204ns, 2.97V)
M8(5.091ns, 330mV) M8(5.091ns, 330mV) M8(5.091ns, 330mV) M8(5.091ns, 330mV) M8(5.091ns, 330mV)
M2(5.143ns, 1.65V) M2(5.143ns, 1.65V) M2(5.143ns, 1.65V) M2(5.143ns, 1.65V) M2(5.143ns, 1.65V)
time (ns)
Figure 2.22: Kogge-Stone radix-2 adder con￿guration 1.
532.3. Simulation
0 5.0 10 15 20
time (ns)
3.5
3.0
2.5
2.0
1.5
1.0
.5
0.0
Y
0
(
V
)
Y
0
(
V
)
4.0
3.5
3.0
2.5
2.0
1.5
1.0
.5
0
−.5
Y
1
(
V
)
Y
1
(
V
)
Clock
Carry out 31
M1(4.01ns, 330mV) M1(4.01ns, 330mV) M1(4.01ns, 330mV) M1(4.01ns, 330mV) M1(4.01ns, 330mV)
M0(4.09ns, 2.97V) M0(4.09ns, 2.97V) M0(4.09ns, 2.97V) M0(4.09ns, 2.97V) M0(4.09ns, 2.97V)
M2(4.05ns, 1.65V) M2(4.05ns, 1.65V) M2(4.05ns, 1.65V) M2(4.05ns, 1.65V) M2(4.05ns, 1.65V)
M4(5.152ns, 330mV) M4(5.152ns, 330mV) M4(5.152ns, 330mV) M4(5.152ns, 330mV) M4(5.152ns, 330mV)
M5(5.203ns, 1.65V) M5(5.203ns, 1.65V) M5(5.203ns, 1.65V) M5(5.203ns, 1.65V) M5(5.203ns, 1.65V)
M6(5.264ns, 2.97V) M6(5.264ns, 2.97V) M6(5.264ns, 2.97V) M6(5.264ns, 2.97V) M6(5.264ns, 2.97V)
time (ns)
Figure 2.23: Kogge-Stone radix-2 adder con￿guration 2.
542.3. Simulation
0 5.0 10 15 20
time (ns)
15
10
5.0
0
−5.0
−10
−15
Y
0
(
m
V
)
Y
0
(
m
V
)
3.5
3.0
2.5
2.0
1.5
1.0
.5
0
−.5
Y
1
(
V
)
Y
1
(
V
)
4.0
3.5
3.0
2.5
2.0
1.5
1.0
.5
0
−.5
Y
2
(
V
)
Y
2
(
V
)
3.5
3.0
2.5
2.0
1.5
1.0
.5
0.0
Y
3
(
V
)
Y
3
(
V
)
Clock
Sum bit 30
Sum bit 31
Carry out 30
M0(4.01ns, 330mV) M0(4.01ns, 330mV) M0(4.01ns, 330mV) M0(4.01ns, 330mV) M0(4.01ns, 330mV)
M1(4.05ns, 1.65V) M1(4.05ns, 1.65V) M1(4.05ns, 1.65V) M1(4.05ns, 1.65V) M1(4.05ns, 1.65V)
M2(4.09ns, 2.97V) M2(4.09ns, 2.97V) M2(4.09ns, 2.97V) M2(4.09ns, 2.97V) M2(4.09ns, 2.97V)
M3(5.506ns, 330mV) M3(5.506ns, 330mV) M3(5.506ns, 330mV) M3(5.506ns, 330mV) M3(5.506ns, 330mV)
M4(5.532ns, 1.65V) M4(5.532ns, 1.65V) M4(5.532ns, 1.65V) M4(5.532ns, 1.65V) M4(5.532ns, 1.65V)
M5(5.577ns, 2.97V) M5(5.577ns, 2.97V) M5(5.577ns, 2.97V) M5(5.577ns, 2.97V) M5(5.577ns, 2.97V)
M6(5.569ns, 330mV) M6(5.569ns, 330mV) M6(5.569ns, 330mV) M6(5.569ns, 330mV) M6(5.569ns, 330mV)
M7(5.621ns, 1.65V) M7(5.621ns, 1.65V) M7(5.621ns, 1.65V) M7(5.621ns, 1.65V) M7(5.621ns, 1.65V)
M8(5.682ns, 2.97V) M8(5.682ns, 2.97V) M8(5.682ns, 2.97V) M8(5.682ns, 2.97V) M8(5.682ns, 2.97V)
time (ns)
0 5.0 10 15 20
time (ns)
3.5
3.0
2.5
2.0
1.5
1.0
.5
0.0
Y
0
(
V
)
Y
0
(
V
)
15
10
5.0
0
−5.0
−10
−15
Y
1
(
m
V
)
Y
1
(
m
V
)
3.5
3.0
2.5
2.0
1.5
1.0
.5
0
−.5
Y
2
(
V
)
Y
2
(
V
)
4.0
3.5
3.0
2.5
2.0
1.5
1.0
.5
0
−.5
Y
3
(
V
)
Y
3
(
V
)
Clock
Sum bit 30
Sum bit 31
Carry out 30
M3(5.048ns, 330mV) M3(5.048ns, 330mV) M3(5.048ns, 330mV) M3(5.048ns, 330mV) M3(5.048ns, 330mV)
M1(5.074ns, 1.65V) M1(5.074ns, 1.65V) M1(5.074ns, 1.65V) M1(5.074ns, 1.65V) M1(5.074ns, 1.65V)
M4(5.119ns, 2.97V) M4(5.119ns, 2.97V) M4(5.119ns, 2.97V) M4(5.119ns, 2.97V) M4(5.119ns, 2.97V)
M5(4.09ns, 2.97V) M5(4.09ns, 2.97V) M5(4.09ns, 2.97V) M5(4.09ns, 2.97V) M5(4.09ns, 2.97V)
M0(4.05ns, 1.65V) M0(4.05ns, 1.65V) M0(4.05ns, 1.65V) M0(4.05ns, 1.65V) M0(4.05ns, 1.65V)
M6(4.01ns, 330mV) M6(4.01ns, 330mV) M6(4.01ns, 330mV) M6(4.01ns, 330mV) M6(4.01ns, 330mV)
M7(5.204ns, 2.97V) M7(5.204ns, 2.97V) M7(5.204ns, 2.97V) M7(5.204ns, 2.97V) M7(5.204ns, 2.97V)
M8(5.091ns, 330mV) M8(5.091ns, 330mV) M8(5.091ns, 330mV) M8(5.091ns, 330mV) M8(5.091ns, 330mV)
M2(5.143ns, 1.65V) M2(5.143ns, 1.65V) M2(5.143ns, 1.65V) M2(5.143ns, 1.65V) M2(5.143ns, 1.65V)
time (ns)
Figure 2.24: Brent-Kung radix-2 adder con￿guration 1.
552.3. Simulation
0 5.0 10 15 20
time (ns)
3.5
3.0
2.5
2.0
1.5
1.0
.5
0.0
Y
0
(
V
)
Y
0
(
V
)
4.0
3.5
3.0
2.5
2.0
1.5
1.0
.5
0
−.5
Y
1
(
V
)
Y
1
(
V
)
Clock
Carry out 31
M0(4.01ns, 330mV) M0(4.01ns, 330mV) M0(4.01ns, 330mV) M0(4.01ns, 330mV) M0(4.01ns, 330mV)
M1(4.05ns, 1.65V) M1(4.05ns, 1.65V) M1(4.05ns, 1.65V) M1(4.05ns, 1.65V) M1(4.05ns, 1.65V)
M2(4.09ns, 2.97V) M2(4.09ns, 2.97V) M2(4.09ns, 2.97V) M2(4.09ns, 2.97V) M2(4.09ns, 2.97V)
M3(5.136ns, 330mV) M3(5.136ns, 330mV) M3(5.136ns, 330mV) M3(5.136ns, 330mV) M3(5.136ns, 330mV)
M4(5.187ns, 1.65V) M4(5.187ns, 1.65V) M4(5.187ns, 1.65V) M4(5.187ns, 1.65V) M4(5.187ns, 1.65V)
M5(5.249ns, 2.97V) M5(5.249ns, 2.97V) M5(5.249ns, 2.97V) M5(5.249ns, 2.97V) M5(5.249ns, 2.97V)
time (ns)
Figure 2.25: Brent-Kung radix-2 adder con￿guration 2.
562.3. Simulation
0 5.0 10 15 20
time (ns)
3.5
3.0
2.5
2.0
1.5
1.0
.5
0.0
Y
0
(
V
)
Y
0
(
V
)
15
10
5.0
0
−5.0
−10
−15
Y
1
(
m
V
)
Y
1
(
m
V
)
3.5
3.0
2.5
2.0
1.5
1.0
.5
0
−.5
Y
2
(
V
)
Y
2
(
V
)
4.0
3.5
3.0
2.5
2.0
1.5
1.0
.5
0
−.5
Y
3
(
V
)
Y
3
(
V
)
Clock
Sum bit 30
Sum bit 31
Carry out 30
M0(4.01ns, 330mV) M0(4.01ns, 330mV) M0(4.01ns, 330mV) M0(4.01ns, 330mV) M0(4.01ns, 330mV)
M1(4.05ns, 1.65V) M1(4.05ns, 1.65V) M1(4.05ns, 1.65V) M1(4.05ns, 1.65V) M1(4.05ns, 1.65V)
M2(4.09ns, 2.97V) M2(4.09ns, 2.97V) M2(4.09ns, 2.97V) M2(4.09ns, 2.97V) M2(4.09ns, 2.97V)
M3(5.436ns, 330mV) M3(5.436ns, 330mV) M3(5.436ns, 330mV) M3(5.436ns, 330mV) M3(5.436ns, 330mV)
M4(5.461ns, 1.65V) M4(5.461ns, 1.65V) M4(5.461ns, 1.65V) M4(5.461ns, 1.65V) M4(5.461ns, 1.65V)
M5(5.506ns, 2.97V) M5(5.506ns, 2.97V) M5(5.506ns, 2.97V) M5(5.506ns, 2.97V) M5(5.506ns, 2.97V)
M6(5.561ns, 330mV) M6(5.561ns, 330mV) M6(5.561ns, 330mV) M6(5.561ns, 330mV) M6(5.561ns, 330mV)
M7(5.613ns, 1.65V) M7(5.613ns, 1.65V) M7(5.613ns, 1.65V) M7(5.613ns, 1.65V) M7(5.613ns, 1.65V)
M8(5.674ns, 2.97V) M8(5.674ns, 2.97V) M8(5.674ns, 2.97V) M8(5.674ns, 2.97V) M8(5.674ns, 2.97V)
time (ns)
Figure 2.26: Kogge-Stone radix-4 adder con￿guration 1.
572.3. Simulation
0 5.0 10 15 20
time (ns)
3.5
3.0
2.5
2.0
1.5
1.0
.5
0.0
Y
0
(
V
)
Y
0
(
V
)
4.0
3.5
3.0
2.5
2.0
1.5
1.0
.5
0
−.5
Y
1
(
V
)
Y
1
(
V
)
Clock
Carry out 31
M0(4.01ns, 330mV) M0(4.01ns, 330mV) M0(4.01ns, 330mV) M0(4.01ns, 330mV) M0(4.01ns, 330mV)
M1(4.05ns, 1.65V) M1(4.05ns, 1.65V) M1(4.05ns, 1.65V) M1(4.05ns, 1.65V) M1(4.05ns, 1.65V)
M2(4.09ns, 2.97V) M2(4.09ns, 2.97V) M2(4.09ns, 2.97V) M2(4.09ns, 2.97V) M2(4.09ns, 2.97V)
M3(5.592ns, 330mV) M3(5.592ns, 330mV) M3(5.592ns, 330mV) M3(5.592ns, 330mV) M3(5.592ns, 330mV)
M4(5.644ns, 1.65V) M4(5.644ns, 1.65V) M4(5.644ns, 1.65V) M4(5.644ns, 1.65V) M4(5.644ns, 1.65V)
M5(5.705ns, 2.97V) M5(5.705ns, 2.97V) M5(5.705ns, 2.97V) M5(5.705ns, 2.97V) M5(5.705ns, 2.97V)
time (ns)
Figure 2.27: Kogge-Stone radix-4 adder con￿guration 2.
582.3. Simulation
0 5.0 10 15 20
time (ns)
3.5
3.0
2.5
2.0
1.5
1.0
.5
0.0
Y
0
(
V
)
Y
0
(
V
)
15
10
5.0
0
−5.0
−10
−15
Y
1
(
m
V
)
Y
1
(
m
V
)
3.5
3.0
2.5
2.0
1.5
1.0
.5
0
−.5
Y
2
(
V
)
Y
2
(
V
)
4.0
3.5
3.0
2.5
2.0
1.5
1.0
.5
0
−.5
Y
3
(
V
)
Y
3
(
V
)
Clock
Sum bit 30
Sum bit 31
Carry out 30
M0(4.01ns, 330mV) M0(4.01ns, 330mV) M0(4.01ns, 330mV) M0(4.01ns, 330mV) M0(4.01ns, 330mV)
M1(4.05ns, 1.65V) M1(4.05ns, 1.65V) M1(4.05ns, 1.65V) M1(4.05ns, 1.65V) M1(4.05ns, 1.65V)
M2(4.09ns, 2.97V) M2(4.09ns, 2.97V) M2(4.09ns, 2.97V) M2(4.09ns, 2.97V) M2(4.09ns, 2.97V)
M3(6.03ns, 330mV) M3(6.03ns, 330mV) M3(6.03ns, 330mV) M3(6.03ns, 330mV) M3(6.03ns, 330mV)
M4(6.056ns, 1.65V) M4(6.056ns, 1.65V) M4(6.056ns, 1.65V) M4(6.056ns, 1.65V) M4(6.056ns, 1.65V)
M5(6.101ns, 2.97V) M5(6.101ns, 2.97V) M5(6.101ns, 2.97V) M5(6.101ns, 2.97V) M5(6.101ns, 2.97V)
M6(6.03ns, 330mV) M6(6.03ns, 330mV) M6(6.03ns, 330mV) M6(6.03ns, 330mV) M6(6.03ns, 330mV)
M7(6.082ns, 1.65V) M7(6.082ns, 1.65V) M7(6.082ns, 1.65V) M7(6.082ns, 1.65V) M7(6.082ns, 1.65V)
M8(6.144ns, 2.97V) M8(6.144ns, 2.97V) M8(6.144ns, 2.97V) M8(6.144ns, 2.97V) M8(6.144ns, 2.97V)
time (ns)
Figure 2.28: Brent-Kung radix-4 adder con￿guration 1.
592.3. Simulation
0 5.0 10 15 20
time (ns)
3.5
3.0
2.5
2.0
1.5
1.0
.5
0.0
Y
0
(
V
)
Y
0
(
V
)
4.0
3.5
3.0
2.5
2.0
1.5
1.0
.5
0
−.5
Y
1
(
V
)
Y
1
(
V
)
Clock
Carry out 31
M0(4.01ns, 330mV) M0(4.01ns, 330mV) M0(4.01ns, 330mV) M0(4.01ns, 330mV) M0(4.01ns, 330mV)
M1(4.05ns, 1.65V) M1(4.05ns, 1.65V) M1(4.05ns, 1.65V) M1(4.05ns, 1.65V) M1(4.05ns, 1.65V)
M2(4.09ns, 2.97V) M2(4.09ns, 2.97V) M2(4.09ns, 2.97V) M2(4.09ns, 2.97V) M2(4.09ns, 2.97V)
M3(6.117ns, 330mV) M3(6.117ns, 330mV) M3(6.117ns, 330mV) M3(6.117ns, 330mV) M3(6.117ns, 330mV)
M4(6.169ns, 1.65V) M4(6.169ns, 1.65V) M4(6.169ns, 1.65V) M4(6.169ns, 1.65V) M4(6.169ns, 1.65V)
M5(6.23ns, 2.97V) M5(6.23ns, 2.97V) M5(6.23ns, 2.97V) M5(6.23ns, 2.97V) M5(6.23ns, 2.97V)
time (ns)
Figure 2.29: Brent-Kung radix-4 adder con￿guration 2.
602.3. Simulation
Simulations show that the recurrence solver adders tested introduce a delay
which has an order of magnitude of nanonseconds. As predicted, in both radix-2
and radix-4 cases Kogge-Stone adder topology has the highest computational
speed compared to Brent-Kung architecture. On the other hand, area occu-
pation of Kogge-Stone adder is much greater than the one used to implement
Brent-Kung adder: 2669 to 1733 transistors for the radix-2 case and 1896 to
1618 for the radix-4 case. Consequently, the power consumption of the Kogge-
Stone topology is much greater than the one of Brent-Kung topology.
61Conclusion
As a conclusion to the work, the importance of both theory and simulation
in the design of a complex logic circuit must be underlined. On the one hand,
theoretical research lays the fundation to the creation of increasingly performant
structures by perfectioning the logical and topological aspects of the circuit. On
the other hand, simulation and veri￿cation of the theoretical results represents
an essential part of the design of all digital and analog circuits. In fact, manual
calculations alone are insu￿cient in order to implement a circuit and can only be
regarded as a ￿rst step towards the realization of the system. Moreover, further
optimization of the circuit can be achieved only by ulterior calculations and
simulations through which the designer can discern various structures in order to
meet speci￿cations in terms of computational performance, power consumption
and area occupation.
632.3. Simulation
64Bibliography
[1] Jan M. Rabaey, Anantha Chandrakasan, Borivoje Nikoli¢, Digital Integrated
Circuits, A Design Perspective, Prentice Hall, 2nd edition, 2003. ISBN: 0-
13-120764-4.
[2] R. Brent and H.T. Kung, ￿A regular Layout for Parallel Adders", IEEE
Trans. on Computers, vol. C-31, no. 3, pp. 260-264, March 1982.
[3] P.M. Kogge and H.S. Stone, ￿A Parallel Algorithm for the E￿cient Solution
of a General Class of Recurrence Equations", IEEE Trans. on Computers,
vol. C-22, pp. 786-793, August 1973.
[4] Hoang Q. Dao and Vojin Oklobdzija, ￿Performance Comparison of VLSI
Adders Using Logical E￿ort", PATMOS 2002, LNCS 2451, pp. 25-34, 2002,
Springer-Verlag Berlin Heidelberg.
[5] Vojin Oklobdzija, ￿High-Speed VLSI Arithmetic Units: Adders and Multipli-
ers in Design of High-Performance Microprocessor Circuits", Book Chapter
4, Book edited by A. Chandrakasan, IEEE Press, 2000.
[6] J. A. Abraham, ￿Design of Adders", Lecture slides, EE-382M.7, Department
of Electrical and Computer Engineering, The University of Texas at Austin,
September 21, 2011.
65