A 1.0 nsec 32-bit prefix tree adder in 0.25 micron static CMOS by Goldovsky, Alexander
Lehigh University
Lehigh Preserve
Theses and Dissertations
1999
A 1.0 nsec 32-bit prefix tree adder in 0.25 micron
static CMOS
Alexander Goldovsky
Lehigh University
Follow this and additional works at: http://preserve.lehigh.edu/etd
This Thesis is brought to you for free and open access by Lehigh Preserve. It has been accepted for inclusion in Theses and Dissertations by an
authorized administrator of Lehigh Preserve. For more information, please contact preserve@lehigh.edu.
Recommended Citation
Goldovsky, Alexander, "A 1.0 nsec 32-bit prefix tree adder in 0.25 micron static CMOS" (1999). Theses and Dissertations. Paper 580.
Goldovsky,
Alexander
A 1.0 nced 32-bit
prefix tree adder
in 0.25 micron ".
""-
static CMOS
. M:ay 31, 1999
A 1.0 nsec 32·bit Prefix Tree Adder in 0.25 Micron Static CMOS.
by
AJexander(Joldovsky
A Thesis
Presented to the (Jraduate and Research Committee
of Lehigh University
in Candidacy for the Degree of
Master of Science
III
Department of Electrical Engineering and Computer Science
Lehigh University
May 1999
. ,
_._..,..~.,~
-,.. ",.;;.~

Acknowledgments
Althoug~hiS work bears my name, it is a result of the efforts and contributions of several.
people. First and foremost, I would like to thank Professor Michel Schulte, my thesis
supervisor, for his original contributions and help throughout the project. His willingness
to discuss ideas, and enthusiasm have made it a unique and enjoyable learning experience
forme.
I am also grateful to Ravi Kolagotla of Lucent Technologies, Microelectronics for helping
me learn about different adder architectures, assisting me in testing of the shuttles and
helping me in implementing this project. He has been a good friend and a source of inspi-
ration.
I would also like to thank the following people for their contribution to the project: Chris
Nicol, for his ideas on superimposed adder architecture, Mat Besz, Tracey Dellarova,
Scott van-Horn and David Koehler for helping with the layout design of the adders, and
Vance Archer for his help in including the adders on the Lucent 0.25 micron test shuttle.
.~.
','.. '
: " .
.' . ."',": ~.'. ';,,'
~', .
:-.
Table of Contents
List of Tables vi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. vii
Absrtact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1
1. Introduction 2
1.1 Carry-Iookahead additio)l 2
1.2 Binary addition ,. 4
1.3 Implementation issues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 6
2. Existing architectures for binary addition 7
2.1 Carry-ripple adder .................................•.. 8
2.2 Carry-skip adder ~ 8
2.3 Carry-Iookahead adder. . . . . . . . . . . . . . . . . . . . . . . . • . . . . . .. 10.
2.4 Brent-Kung adder ...........................•.....• .. 11
2.5 Superimposed tree CLA ...............•..........•..•. 13
2.6 Superimposed prefix tree Ling adder ....................• 17
3. New architecture for superimposed prefix tree CLA 19
3.1 New architecture for prefix tree adder 19
3.2 Improved architecture of prefix tree adder for low power ..... 21
4. Prefix tree CLA architecture comparison 24
4.1 Architecture comparison •..................•......•... 24
4.2 VLSI implementation of 32-bit architecture 24 ~ .
..
4.3 Test results .............•.......................•.. . 25
4.4 Test structure 28
iv
Conclusions and Future Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 30
References . ........................................... . 31
Appendix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 34
Vita. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 46
(
v
,
'''/.
List of Tables
Table 1. Truth table for the full adder cell.
Table 2. Comparison of different prefix tree adder architectures.
Table 3. Lucent O.25-um CMOS process parametrics.
Table 4. Comparison with other 32-bit adders.
Table 5. Measured performance of the different implementations of the adder in Fig. 7.
/\
vi
List of Figures
Figure 1. Modified full adder for fast carry path propagation.
Figure 2. Modified fU::, adder for fast carry path propagation with carrybu~g ..
Figure 3. XOR and XNOR mux-type structure.
Figure 4. N-bit carry-ripple adder design.
Figure 5. A 16-bit carry-skip adder design using mlllti-level skip and variable block
SIze.
Figure 6. A 16-bit Brent-Kung adder architecture.
Figure 7. A 16-bit superimposed prefix tree CLA adder architecture.
Figure 8. A 16-bit superimposed prefix tree CLA adder architecture with modification
to Go.
Figure 9. New architecture for 16-bit prefix tree adder with carry incorporated into the
tree.
Figure 10. New architecture for 16-bit prefix tree adder with carry incorPorated into the
tree with low power solution.
Figure 11. Speed vs. transistor count of different 32-bit adders in O.25-um. technology.
Figure 12. AND gate delay measurement results.
Figure 13. Divide network for testing the adder off-chip.
FigUre 14. Examples of applications for a prefix tree adder with carry-in incorporated
vii
Figure A-1. 32-bit circuit implementation of the carry-skip adder.
. Figure A-2. 32-bit circuit implementation of the Brent-Kung adder.
Figure A-3. 32-bit circuit implementation of the Ling adder.
Figure A-4. 32-bit circuit implementation of the superimposed tree CLA, according to
Fig. 7.
Figure A-5. 32-bit circuit implementation of the superimposed tree CLA with carry in
incorporated into the tree, according to Fig. 9.
Figure A-6. 32-bit circuit implementation of the superimposed tree CLA with carry in
incorporated into the tree (low power solution), according to Fig. 10.
Figure A-7. Layout of the 32-bit prefix tree adder according to Fig.7.
Figure A-S. Layout of the 32,..bit prefix tree adder according to Fig.9.
Figure A-9. Circuit diagram for the test structure on the shuttle.
Figure A-IO. Two different layout styles according to Table 5 [Section 4.3].
Figure A-II.. Photograph of the die.
viii
Abstract
The carries in a carry-Iookahead adder can be computed by using a separate prefix tree for
each bit location. This is nearly twice as fast as the standard Brent and Kung addition tech-
nique. This thesis shows that the primary carry input signal can be incorporated into the
prefix trees without incurring any additional delay. The proposed architecture reduces the
, .
logic depth of an n-bit adder by one, compared to existing architectures. Using this tech-
nique with fully-static circuits, a 32-bit, radix-2 prefix tree adder has a delay of 1.0 nsec in
the Lucent O.25um CMOS Technology. The proposed adder architecture can easily be
extended to other word sizes and radices.
,
1
Chapter 1: Introduction.
1.1 Carry Loo~head Addition
With ever shrinking VLSI process geometries, transistor count and chip
area are becoming second~considerations to delay and power. Hence, it is necessary to
reexamine the tradeoffs thathave been made in existing designs and implementations of
computer arithmetic algorithms.
The carry lookahead technique, first reported by Weinberger and Smith,
speeds up the addition process by unrolling the recursive carry equation [1]. Both transis-
tor count and interconnection complexity have typically limited unrolling to four bits.
Larger adders have been built as block carry-Iookahead adders, where the lookahead oper-
ation occurs within small blocks [2].
The recursive carry-computation can also be reduced to a prefix computation [3] and [4].
With this te~hriique, a prefix tree computes the carry at the most-significant bit position"
.'.:' , ,_ '--":::_:< _h' '-:;_~~'~. _.:';c..-.:"--,-_'.~":~__ -=-'_ ~;':_~, .~'~:;_-:-~'-f..-_ •. ,,~,_-.- _7_C-~,_-",-: -~.~-~~_~::;_:._:-_::~ .~~ .. =;:.~~:~,·,~_,?;~~~:-~.~:~ .. ,,;~.-,,(-:::i.s+.-:'.~,':':'."'f~ =-_':;~--'.~-:-,"'~- :: ._.:~. _.':>~7..:'::_ .~~.. ..::: ai1(f·ai1~ufd1tf6riilltree's\iperiinposeaon-iheprefix tree is used to compute the intermediate .
2
..
carries [5]. Faster computation of the carries can be achieved by usiI).g a separate prefix
tree for each bit I'bsition [6], [7], however, this approach requires more hardware.
Full prefix tree adders, also known as Kogge-Stone adders, have not been
frequently used because of the additional delay and area introduced by their exponentially
growing interconnect complexity [2], [8]. Existing architectures have emphasized the
reduction of interconnection complexity at the expense of higher gate fanouts [6], [9].
Interconnection complexity can also be reduced by using hybrid carry-lookahead / carry-
select architectures, which eliminate the need to implement a full prefix tree for each bit
position [10]. The imminent, widespread use of low Rand C materials [11] reduces the
. r J
negative effects of architectures thatdepend on large amounts of interconnect [12]. Fur-
thermore, with additional levels of interconnect, the area overhead of implementing these
adders is alleviated through the use of extensive over-the-cell routing, which removes the
routing channels and further minimizes the interconnect capacitance.
This thesis show that the primary carry input can be incorporated into the
full prefix tree adder without additional overall delay. To demonstrate the benefits of this
approach, 32-bit prefix tree adders were implemented with the carry input based on both
the existing and proposed architectures. Both implementations use fully static circuits in
the standard Lucent O.25-um CMOS technology. Static circuits are preferred to dynamic
circuits for their low power, and their ease of design. The measured delay of the adder
with the existing architecture is 1.1 nsec, while the measured delay of the adder with the
proposed architecture is 1.0 nsec. This delay is expected to be lower if the adders are
implemented in technologies w~th lower interconnect RC delays [12] .
3
1.2 Binary addition
The addition of two numbers,
n-2
n-1 ~ j dA = - an -1 . 2 + LJ aj' 2 an
j=O
n-2
n-1 L jB=-b 1.2 + b.·2n - J
j =0
represen~'s complement binary form, can be accomplished by compnting:
gj=aj·bj ,
P · = a,(fYb·J J J
s· = p.Etlc. 1J J J-
where 0 ~ j < nand c_1 is the primary carry-input. An overflow occurs, and the
resulting sum is invalid, if
A straightforward implementation of a I-bit adder is achieved through a full adder, which
implements the above equations. In an n-bit addition, the longest path through the adder is
the carry propagation path. Hence, the design of the full adder is optimized to minimize
the delay of the carry path. The diagram for a modified full adder is shown below. The
following equations are used to implement the full adder:
c· = a··p,+p"c. 1J J J J J-
..
4
'-'" ...
~~
aj·-----L-\I
bj---J/
FIGURE 1. Modified full adder for fast carry path propagation.
In practice, when designed in 0.25um technology, all pass-gate structures, such as muxes
and latches, are buffered for better performance. This also applies to the full adder cell..
Thec~ out signal is inverted to ~uffer the 2-to-1 mux. The full adder which follows the
one with the inverted carry out signal recovers the positive logic, as shown in Figure 2B.
(A)~
aj_---L-\I
bj---J/
(B)~
aj_--'-\\
bj·---J/
I'
FIGURE 2. Modified full adder for fast carry path propagation with carry buffering.
Section 1.3 covers basic blocks for the adder design in 0.25um technology. Assuming, the
delay for an inverter cell (2 transistors) is O.5~, the delay for XORI XNOR (10 transis-
tors) is 2~ and the delay for a mux (4 transistors) is 1~ , the following hold for the dia-
gram in Figure 2(B):
Number of transistors = 28
logic depth from Cj _ 1to Sj = 2~
logic depth from
logic depth from a}o Sj = 4~
5
.....
logic depth from a}o Cj = 3.5~
1.3 Implementation issues.
With shrinking process geometries, it is critical to find the best implementation for the
basic cells, due to threshold voltage limitations of static CMOS technology. Simulation
shows that for high performance 0.25um designs should have 2-input NOR gates, 2-input
or 3-input NAND gates, and full CMOS type structures such as muxes and latches.
XOR and XNOR cells are designed using mux-type structures as shown in Figure 3.
.....
Although the transistor count is larger than typical implementations of XOR and XNOR
gates, these designs are about 20% faster than the standard weak PMOS pull-up structures
in 0.25 urn technology at 1.6 Volts. These XOR and XNOR designs each require 10 tran-
sistors.
.'
All the mux structures are designed with fully complementary pass gates, with local buff-
ering to reduce capacitive loading on the output nodes.
(A) ~D-
aj
Qi
(B) ~I>-
aj,-t.....l.-_--{
(C)
sell
{Lout
~o
out
- FIGURE 3. XOR and XNOR mux-typedesign.
6
2 Existing architectures for binary addi-
,
tion.
Parallel adders can be classified into two categories based on the way in which internal
carries from stage to stage are handled; rippl~ carry and lookahead carry. Externally, both
types of adders are the same in terms of inputs and outputs. The differences are the speed
at which they operate, the area of the layout, and the power consumption. Carry-ripple
adders, as described in Section 2.1, are simpler to design, require very little area and hard-
ware. However, they are also the slowest type of adder. Hence, if the performance is not an
issue, carry-ripple adders are often used. To improve the speed of binary addition, carry-
skip techi).iques, as described in Section 2.2, were introduced. Although, it takes more
hardware to incorporate the skip logic, the gain in speed is more than 200% for 16-bit
implementations, according .to simulations data presented in Figure 11. When designing
even faster adders, it is essential to get around the rippling effect of the carry that is present
in both carry-~kip and carry-ripple adders. The carry-Iookahead principal offers a possible
way to do so. Thisthesis describes 3 types of carry-Iookaheadadders: Brent-Kung adders
[Section 2.2], Ling adders [Section 2.3], and superimposed prefix tree adders [Section
2.4].
7
2.1 Carry-ripple adder design.
The ripple carry adder (RCA) provides a slow, but hardware efficient, method for adding
two binary numbers. An n-bit RCA is formed by cascading n full adders (FAs). For high
speed, two types of FAs are used, as described in Section 1.2. The carry out of /h FA is
used as the carry in of the (j + 1)th FA, as S.Q9wn in Figure 4. The carry propagation delay
for each full adder is the time from the application of the input carry until the output carry
is valid, 'assuming the a and b inputs are already present. For the n-bit RCA, the ll;umber of
transistors and number of logic stages (or logic depth) are
Number of transistors = 28 . n
depth for RCA adder = 4il + (1.5il· (n - 2)) + 2il = (1.5n + 3)il
&(lI-l) b(Q-1) 1(11-2) b(lI-2)
S(II-I) S(II-2) SOl S(2) S(I) S(O)
FIGURE 4. N·bit RCA design.
2.2 Carry-Skip adder design.
~
To reduce the carry propagation path of the RCA, the carry-skip adder (CSKA) is intro-
duced, where each carry is evaluated from the previous adder stages. The CSKA is based
on the following observation. The propagation process can skip any adder stage for which
.; .a· '# b·, or equivalen~. a :ffi·b. :;:: 1 . Several stages can be skipp'ed if all.satisfy
oJJ o.· JJ
. ....... -' .
. a j '# b j . Thus, an adder consisting of.n stages is divided into blocks of consecutive stages
8
with a simple RCA scheme used within each block. Every block of length k (which is
called the block size), also generates a block-carry-propagate signal that is defined as
The carry out of block k is expressed as
j+k-l j+k-l
c· k = p. ·c·+G·J + J J J
where c j + k is the carry out of the last FA in subgroup k.
A good strategy when designing CSKA is to vary the block size to optimize
the carry propagation timing. Also, for improved performance multi-level skip is per-
formed. This is illustrated in Fig. 5. Empty squares represent full adders, filled rectangles
implement skip equations, dashed lines are propagate signals, and solid lines are carry
paths. Appendix [Figure A-I] shows the circuit implementation of a 32-bit carry-skip
adder.
FIGURE 5. A 16·bit carry-skip Adder design using multi-level skip and variable block size
techniques.
.. -
r
9
2.3 Carry-lookahead adder design.
In the RCA, the speed with which an addition is performed is limited by the time required
for carries to propagate, or ripple; through all stages of the adder. One method for speed-
ing up the addition process is to eliminate this ripple carry delay by carry look-ahead addi-
tion. This method is based on the two ways in which the full-adder produces an output
carry: carry generation and carry propagation. Carry generation occurs when an output
carry is set independently of the carry input. A carry is generated only when both a j and
b j is one. The generate signal is expressed as:
An input carry is propagated by the full-adder when either aj or b j is one, as discussed
in Section 2.3. The propagate signal is expressed as:
p. = a·ffib.J J J
Alternatively, a transmit signal can be used, where
(. = a·+b.J J J
The truth table for carry generation and carry propagation conditions are shown in Table 1.
Table 1: Truth table for the full adder cell.
a j bj cj _ 1 Pj t j gj Sj cj
0 0 0 0 0 0 0 0
0 0 1 0 0 0 1 0
0 1 0 1 1 0 1 0
0 1 1 1 1 0 0 1
1 0 0 . 1 1 0 1 0
1 0 1 1 1 0 0 1
.
1 1 0 . 0 1 1 0 1
1 1 1 0 1 1 1 1
10
From Table 1, we can derive the relationship between the carry in Cj _ 1 and carry out Cj :
c· = g. +P .. C· 1 or c· = g. + t .. c· 1J J J J- J J J J-
As mentioned previously, an overflow occurs when c
n
- 1 ffi cn _2 = I.
2.4 Brent and Kung adder.
The Brent and Kung adder uses a generic associative operator, called the dot operation (0).
The associativity property implies that the following statement is valid:
a . (b . c) = (a· b) . c . Under these conditions, combining "n" arguments using the dot
operator "0" can be executed with critical path equal to f!0g2n l· to' where to is the
propagation delay of the dot-operator, defined later as a propagation stage. This property
can be applied to an n-bit adder. This requires the definition of a "0" operator that estab-
lishes the following relationship between two tuples (g j'p),
As in [5] and [6], we define (Gj,Pj) = (gj'p) ,and
"
where "0", the fundamental carry operator, is associative [5]. In particular for radix-2
2peration, the "0" operator is a function that takes in two sets of two inputs (g j'p) and
11
(gi,Pi) and produces a set of two outputs (G{,p{) . At each bit position, the carry is
given by
- G j pjc j - 0 + o' c_1
where the C_1 is the primary carry input. If there is no primary carry input, then c j IS
simply G~ . This is illustrated in Fig. 6 for a 16-bit adder. Filled circles implement the
fundamental carry equation, empty circles are buffers, and empty squares compute first
order of propagate and generate signals. The circuit diagram is shown in the Appendix
[Figure A-2].
FIGURE 6. A 16·bit Brent and Kung adder [5] where the carries propagate from top to bottom.
The number of stages needed to implement a Brent-Kung adder is 2· f!0g2nl.
12
2.5 Superimposed tree CLA adder.
With a superimposed tree CLA, the computation of (G~,P~) for 0 s:; j < n from
PO ... Pn-1 and gO ... gn-1 can be accomplished in pogzn l stages [3], [6]. A com-
plete adder is constructed by implementing the following steps.
Step 1 (1 stage)
calculate gj = aj ' bj
Step 2 ( pogzn lstages)
and p. = a.fB b . 0 s:; j < nJ J J
For k=1 to pog zn l calculate
j j j j j_Zk-l j_Zk-l
(Go,Po) = (G. Zk-l l'P, Zk-l l)o(GO ,Po )J- + J- +
Gj p j _ j. j j_Zk-l j_Zk-l
( . Zk l' . Zk. 1) - (G. Zk-l l'P, Zk-l l)o(G. Zk l'P, Zk 1)J- + J- + J-'- + J- + J- + J- +
Step 3 (1 stage)
Step 4 (1 stage)
calculate s· = p. EEl c· 1
. . J J J-
13
This is illustrated in Fig.? for a 16-bit adder. The open squares at the top compute Pj and
gj for each bit position according to step 1. The empty circles apply the fundamental carry
operator according to step 2, and the filled circles represent buffers. The last stage shown
in Fig. 6 using crossed circles applies C_1 to every (G~,P~) according to step 3. The out-
put of this array is the carry at each bit position.
FIGURE 7. Computation of the carry equation using prefix trees for a 16-bit adderlsubtractor or a 16-bit
adder with carry input. The empty circles implement the fundamental carry operator, the filled circles are
buffers, the crossed circle implement the equation in step 3, the empty squares compute generate and
propagate signals.
An additional stage (not shown) is needed to generate the sum at each bit position from Pj
and Cj _ 1 according to step 4. The logic depth of this adder is 3 + f!0g2n l. If there is no
carry input, then the last stage shown in Fig.? is not needed.
Alternatively, the contribution due to the carry input can be incorporated by redefining the
first generate in adderlsubtractor as [6]
with this change
14
This is illustrated in Fig. 8 for the 16-bit adder. This replaces the hardware required to
implement step 3 above and reduces the fanout on the C_l input from n to 1. However, the
logic depth remains 3 + pog2nl, and the overall theoretical delay of the adder IS
unchanged.
Cin
FIGURE 8. Computation of the carry equation using prefix trees for a 16-bit adder with carry input
according to [6]. The filled square implements the equation for the first generate signal, the empty
squares compute generate and propagate signals, the empty circles implement the fundamental carry
operator, the filled circles are buffers.
In CMOS technology a small speedup can be achieved by using transmit signals instead of
propagate signals to compute the carries for each bit position. The final sum computation
still requires the propagate signals to be generated from the primary inputs. The addition
operation in this case is defined as
t· = a·+b.J J J
c.=g.+t"c· lJ J J J-
s· = p.Ei1c. 1J J J-
'.
. .
where 0 ~ j < nand c_1 is the primary carry-input.
15
We define (G~,T~) = (g j,t) and
where "0" is fundamental carry operator. The computation of (G~,T~) follows the same
methodology as in step 2 for (G~,P~). At each bit position, the carry is given by
If there is no primary carry input, then Cj is simply G~ .
The t. signals can be computed faster than the p. signals since an OR gate is typically
J J
faster than an XOR gate. Hence, the carry computation through the prefix tree can start
slightly earlier if transmit signals are used. Since, the sum generation step still uses the
propagate signals, the load on the transmit signals in this architecture is smaller than the
load on propagate signals in the earlier architecture. However, the load on the input signals
is now higher since both transmit and propagate signals need to be generated.
In Fig.?, the open squares at the top ne.ed to compute the transmit signal in
addition to the generate and propagate signals. The remaining circles operate on the trans-
mit signals instead of the propagate signals. The circuit diagram is shown in the Appendix
[Figure A-4].
The superimposed prefix tree adder has smaller logic depth than the Brent-
Kung adder, hence there is less delay through the adder. Although the superimposed prefix
tree adder implementation in VLSI requires more hardware than the Brent-Kung adder as
shown in Figure 11 (transistor count), the layout area in the regular datapath structure is
16
smaller, due to the decreased number of stages (rows in the layout), with interconnect
complexity not being an issue with the use of the multilevel metal interconnect.
2.6 Ling CLA.
A new approach to represent the carry formation and propagation was introduced in [18].
A new H function was derived that represents the relationship of neighboring bits, similar
to group transmit (represented as T) and group generate (represented as K) signals in the
CLA. The adder can be constructed using the following steps
Step 1 (1 stage)
calculate k, =a,' b '
J J J
Step 2 ( IIogzn1stages)
O'5:.j<n and t, = a '+ b 'J J J O'5:.j<n
For k=1 to IIogzn1calculate
H j j
J' = K, Zk-l 1 + T, Zk-l l' H, Zk-IJ- + J- + J-
Zk -1'5:. j <n
Step 3 (Z stages)
1 j j j-lcalcu ate s· = H, EB To + To' K0 . H .J J . J
17
1 '5:.j<n ,.
and c 1 = H . Tn- n n
To reduce the fanout of C_1 to 1, C_ 1 is incorporated into the first H func-
tion. This is illustrated in Figure 8. The filled square now calculates the H0 and the
remaining circles operate on the transmit/kill signals instead of the propagate/generate sig-
nals. The final sum computation still requires 2 stages according to step 3. The logic depth
of this adder is still 3 + flogn l . The proof of algorithm is given in [18]. The circuit
implementation of superimposed tree [Section 2.4] Ling adder is shown in the Appendix
[Figure A-3].
The significance' of the Ling adder compared to other adders is the use of
the OR gate for the transmit signal in conjunction with superimposed prefix trees with low
order H functions used inside the tree for generating the high order H function. Also, sim-
ilar to the adder shown in Figure 8, the primary carry input C_1 of the adder has a fanout
of one, compared to adder in Figure 7. It requires the same layo'ut area as the superim-
posed prefix tree adder with smaller propagation delay. This is due to the use of the OR
gate for the transmit signal, instead of XOR for the propagate signal, and faster logic for
the final sum calculation.
18
3 New architecture for prefix tree CLA.
3.1 New architecture for prefix tree· adder.
An alternative solution to the carry computation shown in Fig. 7 is to allow the low order
carries to be used to compute the high order carries in parallel inside the prefix tree. For
example, Co can be used in stage 2 and further stages of the prefix tree without affecting
the delay of the carry at the higher bit positions. This algorithm is described below for a
radix-2 prefix tree.
Step 1 (l stage)
and t· = a· + b .J J J o<::;,j<n
For k= 1 to flog 2nl calculate
CJ. = G~ 2k - I 1 + T~ 2k - I I· C. 2k - IJ- + J- + J- l-I- 1<::;,j<2k -l
.. . . . i-I . 2k - 1(G~ 2k I,T~ 2k I) = (G~ 2k - 1 + I,T~ 2k - 1 l)o(G~-2k I,T~-2k I)J- + J- + J- J- + J- + J- +
Step 3 (1 stage)
n-I n-I d
calculate Cn _ 1 = Go + To· C-I an S j = Pj EEl Cj _ 1
19
'tjj, 0 <::;, j < n .
This is illustrated in Fig. 9 for a l6-bit adder. The open squares at the top compute tj and
gj for each bit position according to step 1. The crossed circles implement the first equa-
tion in step 2 and the first equation in step 3. The empty circles apply the fundamental
carry operator according to the second equation in step 2, and the filled circles are buffers.
An additional stage (not shown) is needed to generate the sum at each bit position from
Pj and Cj-l according to step 3. The sum computation occurs in parallel with the com-
putation of the final carry output Cn _ 1 • The logic depth of this adder is 2 + rlogn l .The
fanout of C_1 is I + rlogn l .This algorithm can also be extended to higher radix prefix
trees. The circuit implementation is shown in the Appendix [Figure A-5].
FIGURE 9. Prefix tree adder with carry incorporated into the tree. The empty circles implement the
fundamental carry operator, the filled circles are buffers, the crossed circles compute carries according to
Step 2 and 3, the empty squares compute generate and propagate signals.
20
3.2 Improved architecture of prefix tree
adder for low power.
With shrinking technology, the interconnect capacitance becomes a major
factor in the loading per node. Especially, the routing in the datapath used to connect the
high and low bits of the structures. In the adder structure of Fig. 9, the previous stage (low
order) carries are used to produce the high order carries. The interconnect lines to the last
stage of carry generation need to run a distance of nI2 bits, where n is the adder size. Also,
to generate the carry out (cn-J)' the primary carry in ( C_J ) needs to run from bit position 0 .
to n-l.
The carry out (Cn-1) in Fig. 9 with n=16 uses EQ. 1 below, requiring an extra column of
carry operators for calculating (G~,T~).
(EQ 1)
Instead, the new design implements the EQ. 2, thus, eliminating some one the long inter-
connecting wires, which compensates for the increase of the load on c 14 .
21
(EQ 2)
FIGURE 10. Prefix tree adder with carry input incorporated into the tree and with the low power
solution with the same delay as the carry tree shown in Figure 9.
The same idea is applied to the intermediate carry out in bit position '7', which is c7 '
shown in EQ.3 and EQA. c7 is an intermediate carry, which was generated by (G~,T~)
and c_1 EQ.3. Eliminating the cell which generated (G~,T~) and using (G~,T~) and
C3 to generate c7 reduces the total number of gates needed to implement the above archi-
tecture. Since. the fanout of the cell which generates (G~,T~) is reduced from 2 to 1, and
the fanout of c 3 is increased from 1 to 2. This balances the overall delay for c3 gener-·
ation.
(EQ 3)
(EQ4)
This idea can be generalized for an n-bit prefix-tree adder with the primary carry input
incorporated into the tree. The final carry out is generated as:
22
n-l n-lCn_I=GO +To ,Cn_2 (EQ 5)
The first (from the right) intermediate carry for every carry generation stage starting with
C7 ' as shown in Figure 9, is generated as follows:
(EQ 6)
, where j =8, 16,32,....
Table 4 compares the 32-bit prefix tree adder according to Fig. 9 with the new and
improved low power version of the same adder according to Fig. 10. This shows that for
the same delay in both designs, the number of transistors used is improved by 3.3%. The
circuit implementation is shown in the Appendix [Figure A-6].
The power numbers improve for two main reasons: 1) A full bit slice for en _ I generation
and an additional cell in each row, shown in Figure 10, represented as C j with j=8, 16,
2-I
32... were eliminated. This reduces the amount of hardware required for the adder irnple-
mentation; 2) The fanout of the primary carry input signal was reduced to 3.
23
4 Prefix tree CLA architecture compari-
SOD.
4.1 Architecture comparison.
Table 2 compares different prefix tree adder architectures with primary carry input. The
architecture shown in Figures 9 and 10 have the smallest logic depth and intermediate
amounts of fanout on the C_1 input. The wiring complexity is manageable in 0.25um and
smaller CMOS technologies, which have several levels of interconnect.
Table 2: Comparison of different prefix tree adder architectures.
Figure G, P, T fanout C_I fanout logic depth wiring
la [6] n/2 1 3 + flogn l low
Ib [6] 2 1 3 + rlogn l high
2 [6] 2 1 3 + rlogn l med
Fig. 7 2 n 3 + rlognl high
Fig. 9 2 1 + flogn l 2 + rlogn l high
Fig. 10 2 3 2 + rlogn l high
4.2 VLSI Implementation.
32-bit versions of the following adders were implemented in the Lucent 0.25-um. CMOS
process: the Brent-Kung adder, carry-skip adder, Ling adder, adders from Figures 6 and 7,
24
Figure 9, and 10. All designs use fully static circuits. Table 3 summarizes the characteris-
tics of the Lucent 0.25-um. CMOS process.
Table 3: Lucent O.25-um. CMOS process parametrics.
NMOS PMOS
Tox 50A 50A
Lpoly 0.24um. 0.28um.
Vth 0.55V 0.85V
Ion 570uNum. 230uNum.
Ml pitch 0.84 0.84um.
M2 pitch 0.88 0.88um.
M3 pitch 0.88 0.88um.
The appendix [Figure A-7] shows the layout of a 32-bit prefix tree adder according to the
architecture shown in Fig. 7. Appendix [Figure A.:8] shows the layout of a 32-bit prefix
tree adder according to the architecture shown in Fig. 9.
4.3 Test results.
The adders have been implemented on a 0.25-um. CMOS test chip and hooked up as ring
oscillators that exercise their critical paths. The critical paths were identified from Path-
mill simulations. Pathmill is a static timing analysis tool from Synopsys Inc. The output
wave form frequency of the ring oscillators is divided by 212 for observation off-chip.
Additional test structures on the chip are used to determine the delay of the control cir-
cuitry in the ring oscillator path, as described in Section 4.4. The delay of these control
circuits is subtracted from the adder delay measured and reported results are given in
Table 4. The power numbers reported in Table 4 are based on Powermill simulations
(Powerrnill is a power analysis tool from Synopsys Inc.)
25
Table 4: Comparison with other 32-bit adders.
Reference Delay, Power, mW Area, urn2 Vdd, V Technology Yearns
Fig. 7 1.1 32@400MHz 0.04 2.5 0.25um. CMOS 1999
Fig. 9 1.0 32@400MHz 0.03 2.5 0.25um. CMOS 1999
Fig. 10 1.0 30@400MHz 0.03 2.5 0.25um. CMOS 1999
Ref. [14] 1.27 114@580MHz 0.3 0.9 0.6um. GaAs 1997
Ref. [15] 2.7 0.71 5.0 l.2um. EMODL 1997
Ref. [16] 3.1 0.28 5.0 0.9um.MODL 1989
Ref. [17] 2.1 900@(?)MHz 27.84 4.5 3.5um. ECL 1988
Table 4 shows the speed of the adder shown in Figures 7,9 and 10 in comparison with
other published work. The speed was achieved with static CMOS circuits. Static circuits
are preferred to dynamic circuits because of their ease of design. In addition, static circuits
consume less power because they do not need clocks to precharge internal nodes. As
shown in Table 4, the area of the adders in Figures 9 and 10 are the smallest reported so
far. Figure 11 shows the trade-off in speeds of 32-bit adders vs. transistor count in 0.25-
urn. i t• ns
6 ns Carry-ri.rle adder (eslinuJred delay from PatJrmiJI)
I
-I-
2 ns
• Ctu'TY"Skip adder
• BrenJ·Kutrg addtr
1.5 ns
Fif!.9adtle wiJhtlrtma altS; "al • Fig.9tulJerIstroW"r.
-- t'UlJ" "'iJh trantmiJ Jivnal •
I Lin,tuJJer
Fil!.9
Ins
Tr. count. !()( tr 21JOO tr 3000tr
,
FIGURE 11. Speed vs. transistor count of different 32-bit adders in O.25-um.
26
For the adder architecture in Figure 7, since it was implemented first on a test shuttle, three
different gates styles were implemented. The first of these used and-or-invert (AOI) and
or-and-invert (OAl) gates to implement the fundamental carry operator at alternate stages
of the prefix trees. The second implementation used and-or (AO) gates at each stage, and
hence had better buffering to drive the interconnect wires. The third implementation used
OAI and AOI gates at the first two stages of the prefix tree and AO gates for the remaining
stages. This was done to measure the effect of buffering and drive only on the stages with
large interconnect.
Table 5 summarizes the measured delays of the adders based on the archi-
tecture in Fig. 7. The fastest implementation is the one with AOI/OIA gates at alternate
stages of the prefix trees and with reduced interconnect coupling. Each adder has been
implemented with two different wiring schemes. An implementation with horizontal
metal3 and vertical metal2 is area optimal and an implementation with vertical meta13 and
horizontal meta12 is delay optimal.
Table 5: Performance of the different implementations of the adder in Figure 7.
Type implemented typical worst case Area, Vdd, Mt3 direction
delay, ns delay, ns 2 Vnm
AOI/OAI @ alternate 1.00 1.08 0.035 2.5 horizontal
stages
AO @ every stage 1.16 1.26 0.035 2.5 horizontal
OAIIAOI @ first two 1.06 1.14 0.035 2.5 horizontal
stages
AO @ other stages
AOI/OAI @ alternate 0.97 1.05 0.047 2.5 vertical
stages
AO @ every stage 1.12 1.22 0.047 2.5 vertical
OAIIAOI @ first two 1.02 1.11 0.047 2.5 vertical
stages
AO @ other stages
27
The appendix [Figure A-lO] shows two layout styles of the adders discussed above one
with MT3 horizontal and MT2 vertical and the other one with MT3 vertical and MT2 hor-
izontal. The appendix [Figure A-ll] shows the .photograph of the die on which the
described adders were implemented.
4.4 Test structure.
MHz
hip
data
muxing tree
z ~ -
Drv: by 4 Divide by 1024
- r-- 10 bit division
- V off-c
of the ring
-GH
output
oscillator
FIGURE 12. Divide network for testing the speed of adders off-chip.
To test the speed of the ring oscillators on the test chip, a mux-tree network was build with
separate enables to measure the speed of one ring oscillator at a time. Since the output fre-
quency of the ring oscillators are in the GHz range, which is hard to measure on a scope, a
division by 1024 (10 bits) is performed to slow down the output signal. Also, to allow
more time for the muxes in the mux-tree to switch, a divide-by-4 network is used before
the mux-tree for each oscillator circuit. This is shown in Fig. 12.
The same enable signals were used to allow for isolation of each oscillator
circuit via an AND gate (or a NAND gate depending on the output-to-input relationship),
where the output of the oscillator is feed back to the input of the AND gate. To measure
the delay of the AND (NAND) gate,S different ring oscillators were built: a chain of 7
inverters, a chain of 11 inverters, a chain of 15 inverters, a chain -of 19 inverters, and a
chain of 23 inverters of the same size gates connected via an AND gate, enabled by the
28 .
T(delay19) = T(AND) +T (19 inv.)
T(delay23) = T(AND) + T (23 inv.)
same enable signals as the adder's ring oscillators. To calculate the delay of the AND gate,
the delays of these 5 oscillators are measured as follows:
T(delay7) = T(AND) +T (7 inv.)
T(delayll) =T(AND) +T (11 inv.)
T(delayI5) = T(AND) + T (15 inv.)
The zero point on the time plot shows the delay for an AND gate. The results of this exper-
iment are shown in Fig. 13. According to Figure 13, the average measured delay for an
AND gate is 0.38 nsec. The circuit diagram for the test structure is shown in the Appendix
[Figure A-9].
FIGURE 13. AND gate delay measurement results.
29
5 Conclusions and Future Research.
The prefix tree adders were implemented based on architectures that use a separate prefix
tree for each bit position. The architecture was described that incorporates the contribution
due to the primary carry input into the prefix trees without any additional overall delay.
Measured results from the test chip fabricated in the Lucent 0.25um. CMOS Technology
using fully static circuits verify adder operation at 1.0 ns or 1 GHz.
The new architecture presented in this work for 32-bit adders can be extended to 64-bit
adders or other word sizes. Also, the proposed adder could be compared with Tyagi adders
[19] and carry-select adders [20]. Furthermore, the architecture of the superimposed prefix
tree adder can be incorporated into other applications, such as comparators, arithmetic and
logic units (ALU and DAU (Data Arithmetic Unit)), and multiply-add units. Examples of
such applications for a prefix tree adder with carry-in incorporated into the tree are shown
in Figure 14.
Bit Manipulation Unit
Partial product
Generation
(A) DAU example (B) ALU/ACS example (C) Multiply-add
unit example
(D) Comparator example
FIGURE 14. Examples ofpossible applications for a prefix tree adder with carry-in incorporated
into the tree. Filled rectangles.indicate the place, where the adder! subtractor based on the new
architecture would be used.
30
References
[1] A. Weiberger and L. 1. Smith, "A one-microsecond adder using one-megacycle cir-
cuitry," IEEE Transaction on Electronic Computers, pp. 65-73 June 1956
[2] T.-F. Ngai, M. 1. Irwin, and S. Rawat, "Regular, area-time efficient carry-Iookahead
adders," Journal ofParallel and Distributed Computing, vol. 3, pp. 92-105, 1986.
[3] P.M. Kogge and H.S. Stone, "A parallel algorithm for the efficient solution of a general
class of recurrence equations," IEEE Transaction on Computers, vol. C-22 pp. 786-
793, Aug. 1973.
[4] RE. Ladner and M. 1. Fisher, "Parallel prefix computation," Journal of the ACM, vol.
27, pp. 831-838, Oct. 1980.
[5] R P. Brent and H. T. Kung, "A regular layout for parallel adders," IEEE Transaction on
Computers, vol. C-31, P:(J. 260-264, Mar. 1982
[6] D. Dozza, M Gaddoni, and G. Baccarani, "A 3.5ns, 64 bit, carry-Iookahead adder," in
Proceedings of the International Symposium on Circuits and Systems, pp. 297-300,
1996.
[7] Synopsys, Module Compiler User's Guide, Feb. 1998.
[8] T. K. Callaway and E. E. Swartzlander, "Low power arithmetic components," in Low
Power Design Methodologies, Jan M. Rabaey and Massoud Pedram, eds., pp. 161-
200, Kluwer Academic Publishers, 1996.
[9] W.Liu, C. T. Gray, D. Fan, W. 1. Farlow, T. A. Hughes and R K. Cavin, "A 250-MHz
wave pipelined adder in 2-um CMOS," IEEE Journal of Solid State Circuits, vol. 29,
pp. 1117-1128, Sept. 1994.
31
[10] T. Lynch and E. E. Swartzlander, "A spanning tree carry lookahead adder," IEEE
Transaction on Computers, pp.931-939, Aug. 1992.
[11] P. Singer, "Tantalum, copper and damascene: The future of interconnects," Semicon-
ductor International, pp. 90-98, June 1998.
[12] J. Silberman, N. Aoki, D. Boerstler, J. Burns, S. Dhong, A. Essbaum, U. Ghoshal, D.
Heidel, P. Hofstee, K. Lee, D. Meltzer, H. Ngo, K. Nowka, S. Posluszny, O. Takahashi,
I. Yo, and B. Zozic, "A 1.0 GHz singe-issue 64b PowerPC integer processor," in IEEE
International Solid-State Circuits Conference, pp. 230-231, Feb. 1998.
[13] I. C. Kizilyalli, R. Huang, D. Hwang, H. Vaidya, B. Kane, R. Ashton, S. Kuehne, X.
Deng, M. Twiford, D. Shuttleworth, E. Martin, X. Li, and M. J. Thoma, "A merged
2.5V and 3.3V 0.25-um CMOS technology for ASICs," in Proceedings out IEEE
CICC, pp. 159-162, May 1998.
[14] A. Beaumont-Smith and N. Burgess, "A GaAs 32-bit adder," in IEEE 13th Sympo-
sium on Computer Arithmetic, pp. 10-17, July 1997.
[15] Z. Wang, G. A. Jullien, W. C. Miller, J. Wang, and S. S. Bizzan, "Fast adders using
enhanced multiple-output domino logic," IEEE Journal Solid State Circuits, vol. 32,
pp.206-214, Feb. 1997.
[16] I. S. Hwang and A. L. Fisher, "Ultrafast compact 32-bit CMOS adder in multiple-out-
put domino logic," IEEE Journal of Solid State Circuits, vol. 24, pp.358-369, Apr.
1989.
[17] G. Bewick, P. Song, G. D. Micheli, and M. J. Flynn, "Approaching a nanosecond: A
32-bit adder," in IEEE International Conference on Computer Design, pp. 221-226,
Oct. 1988.
32
[18] H. Ling, "High-Speed Binary Adder", in IBM Journal on Res. Development, vol. 25,
no.3, pp. 156-166, May 1981
[19] Akhilesh Tyagi, "A Reduced-Area Scheme for Carry-Select Adder", IEEE Transac-
tions on Computers, vol. 42, pp.1163-1170, Oct. 1993.
[20] O. 1. Dedrij "Carry-select adder," IRE Transactions on Electronic Computers, vol.
EC-ll, pp. 340-346,1962
33
Appendix.
34
7
••0---
.~-
Figure A-I. 32·bit circuit implementation of the carry-skip adder.
Appendix 35
12S '5
'<1 J' ~ " 1 !t,
J :~ ~ : C ~'O 11) 1 JO' ) !";',j ~ ~ 1'1
3' , )' =:: 1 1,j ·,S: J Jil8r' ]1) ;,~ '1
"90~1(l1 y
(C:N8 C8ro JO)~ ......
_'.:.lrO~)I.:.)--_-.)[Do----oS(a 31)
'1 '11,1 11,r
r,",.a,,:,.(]~I.:.1----~~-...O-------C>CI)UT~,
~JIIoAo........-III' ....••• l'~ ~ ,., •.,~w .. ---O<Co"tln-----<>_•• 111---oOCll'llI
"I'." ....... .._... ."'... .11•••
.., .• .,.... ....... .,..... -i··
Figure A-2. 32-bit circuit implementation of the Brent-Kung adder.
Appendix 36
~'1 ~"
'r' )'1
~:: ,":::.>----_.::0 1ll ;:l~: 1'1
j: -: ;' .~>----~B~(:,'1:'::' -==~:'::2·1":'1,:.!"J _:~";;.i..l !.:..:...Jf~
-gO}""
'~...." ::?i,"
- ' ..,--,-- I ..., , I.",;.,~--.,I! _1"";'°
/" I .- (."
'0" l" ~:I'll 0
-.""
~:.:::~
':_30C:..;.:3'..:,.>---tY.?~---.....,
- ., I ~o DC'Jur
h'
t: .... ,n
e_"'JU
C4t1fU
Figure A-2. 32-bit circuit implementation of the Brent-Kung adder.
Appendix 36
----DSCO 31)
lSi
r:'I~r~-I~ t'(Qi1 r(~)ll
11(0 I "f ; (0 )1'((0 JI ceo Jil
hen 31 \1 10 )1\~cO )11-_~...;(O...;J:..'I--,
U~C[1jCj
Figure A-3. 32-bit circuit implementation of the Ling adder.
Appendix 37
A(J
B(ll
(e(1 Jil eour8~
'I
,-'-_.:_--
l,~ • \~'fi 'Wi \fi 'Wi 'J'
I I ill 'I ¥I I II Ii 11 ~l ;i il
Figure A-4. 32·bit circuit implementation of the superimposed tree CLA, accord-
ing to Fig. 7.
Appendix 38
:12,'1
;0:-:'0 1: 1 ~'JrG(:J]n
'Wi ''7,\ 71:1 '~-7 i
. 1l'l" "". '"
all i. :~l ) ~1 II
Figure A-4. 32-bit circuit implementation of the superimposed tree CL4, accord-
ing to Fig. 7.
Appendix 38
cour
~S(;J'~
I \'-1)110 11\ ....., JI) U-'I"'\j;~t t:r:~ C J lCH} I
'.r"dI1·t , .
"q 111 11,t
)f"""'
:$1
10 JII(C(!~ C:J 10) (eCHIJ JOH
,...... 19 J1)
-::U'
,,-..,
'-"
l~
-.
~iG JI
~;,) J1
~!N
.....~,
..~~
Figure A-5. 32·bit circuit implementation of the superimposed tree CLA adder
with carry in incorporated into the tree, according to Fig. 9.
Appendix 39
with carry in incorporated into the tree, according to Fig. 9.
Figure A-5. 32-bit circuit implementation of the superimposed tree CM adder
~I'. ~ ':T7
1
".. ':T7 ':T7
,ll i!! " I" I"
, ,
I
-T7 ~ M':T7I" I". .. ~
39Appendix
caur
$C ~ J I ~
~(1 J'l .,., ...
II>' 11' i1:]) >•.')III~';r'~ ~ J leu}.
'.', "'1'"
ill "'1 ""'"
(0 111(c£t~ c:~ lO> (CE'~.CC~ Ion
::: 19 III ::u. -
::::-
~
''-''
l~
ICC."C If
'1i (') 3')
~:,J 31)
~:N
:~
Figure A-6. 32-bit circuit implementation of the Superimposed tree CLA with
carry in incorporated into the tree (low power solution), according to Fig. 10.
Appendix 40
Figure A-7. Layout of the 32· bit prefix tree adder according to Fig.7.
Appendix 41
Figure A-S. Layout of the 32 • bit prefix tree adder according to Fig.9.
Appendix 42
Figure A-9. Circuit diagram for the test structure on the shuttle.
Appendix 43
Figure A-10. Two different layout styles according to Table 5, discussed
in Section 4.3.
Appendix 44
INTENTIONAL SECOND EXPOSURE
Figure A-lO. Two different layout styles according to Table 5. discussed
In Section 4.3.
Figure A-II. Photograph of the die.
Appendix 45
Vita
Alexander Goldovsky was born in Minsk, Belorussia, on December 5,
1973. He received the B.S. degree in electrical engineering technology from Temple Uni-
versity, Philadelphia, in 1995.
From 1995 he has been employed by the Lucent Technologies. His current
research interests include the definition and design of advanced computer systems, with
particular emphasis on the introduction of parallelism into computer design and problem
solutions.
Vita 46
END
OF
TITLE
-~ .
