Design of an IEEE compliant 32-bit floating point multiplier/accumulator by Niescier, Richard J.
Lehigh University
Lehigh Preserve
Theses and Dissertations
1994
Design of an IEEE compliant 32-bit floating point
multiplier/accumulator
Richard J. Niescier
Lehigh University
Follow this and additional works at: http://preserve.lehigh.edu/etd
This Thesis is brought to you for free and open access by Lehigh Preserve. It has been accepted for inclusion in Theses and Dissertations by an
authorized administrator of Lehigh Preserve. For more information, please contact preserve@lehigh.edu.
Recommended Citation
Niescier, Richard J., "Design of an IEEE compliant 32-bit floating point multiplier/accumulator" (1994). Theses and Dissertations.
Paper 253.
AUTHOR:
Niescier, Richard J.
TITLE:
Design of an IEEE
Compliant 32-Bit
, Floating p'oint
MultiplierlAccumulator
DATE: May 29,1994
Design of an IEEE Compliant 32-bit Floating Point Multiplier/Accumulator
by
Richard J. Niescier
A Thesis
Presented to the Graduated and Research Committee
of Lehigh University
in Candidacy for the Degree of
Master of Science
in
Electrical and Computer Engineering Department
Lehigh University
May 16,1994

Acknowledgment
This thesis is dedicated to my mother and father for
their continual support for my educational pursuits.
iii
Table of Contents
Acknowledgment 111
Table of Contents iv
List of Figures ~ v
List of Tables vii
Abstract 1
Chapter 1 Introduction 2
,
Chapter 2 IEEE Floating Point Standard 754 .4
Chapter 3 System Level Design l.10
3.1 Mantissa Logic 11
3.1.1 Partial Product Addition 11 ,
3.1.2 Accumulator add, final summation, normalization and rounding 15
3.1.3 Accumulator Alignment. 21
3.2 Exponent Processing 27
3.3 Exceptions Processing .32
3.4 Rounding Logic .35
3.5 Sign Logic j , .38
Chapter 4 Circuit D~sign 42
4.1 Full Adder Design 43
4.2 Full Adder Simulation 44
4.3 Booth Recoder and Partial Product Tree 54
4.4 Partial Product Adder Tree Simulation 63
4.5 Final 74 Bit Fast Adder Design : , 66
4.5.1 Ripple Adder 66
4.5.2 Manchester Adder : 67
4.5.3 Carry-Skip Adder 68
4.5.4 Carry Lookahead Adder 69
4.5.5 Carry Select Adder 70
4.5.6 Fast Adder for this Design 71
4.6 Final 74 Bit Fast Adder Simulation 73
4.7 Leading Zero/One Detector Design 77
4.8 Leading Zero/One Detector Simulation 79
4.9 Left ShifterlRight Shifter Design 84
4.10 LeftlRight Shifter Simulations 85
Chapter 5 Conclusions 87
Chapter 6 References and Bibliography 90
Chapter 7 Appendix 93
Chapter 8 Brief Biography 94
iv
List of Figures
~
FIGURE 18.
FIGURE 16.
FIGURE 17.
FIGURE 19.
FIGURE 20.
FIGURE r.
FIGURE 2.
FIGURE 3.
FIGURE 4.
FIGURE 5.
FIGURE 6.
FIGURE 7.
FIGURE 8.
FIGURE 9.
FIGURE 10.
FIGURE 21.
FIGURE 22.
FIGURE 23.
FIGURE 24.
FIGURE 25.
FIGURE 26.
FIGURE 27.
FIGURE 28.
FIGURE 29.
FIGURE 30.
FIGURE 31.
FIGURE 32.
FIGURE 33.
Functional Diagram of the Mantissa/Accumulator Processing unit. 23
Mantissa input latches, booth recoder and partial product adders 24
Fast Adder, Normalizer and Rounder 25
Accumulator alignment, and 2's complementer 26
Functional Diagram of the Exponent processing unit.. .30
Top-Level circuit schematic of the Exponent Processing Unit.. 31
Top-Level circuit schematic of the Exceptions Processing Unit. .34
Top-Level Circuit Schematic for the Rounding Determination Logic 37
Functional Diagram of the sign/accumulator negation logic .40
Top-Level Circuit Schematic of the Sign and Accumulation negation
Logic 41
FIGURE 11. Full Adder, Version 1 46
,.
. FIGURE 12. Full Adder, Version 2 47
FIGURE 13. Full Adder, Version 3 48
FIGURE 14. 4-2 Adder implementation using non-restring n-channel XOR gates .49
FIGURE 15. Simulations of the full adders, under 0.9um nominal processing, 5.0 volts ..
50
Simulation of Full Adders under 0.9um worst case conditions, 4.5 volts. 51
Simulations of the Full Adders under 0.9um best case conditions 5.5 volts .
52
Simulations of the Full Adders under 0.6um HD, worst case conditions,
3.0 volts 53
"L-tree" type partial product tree reduction using carry-save adder cells. 54
"V-Tree" partial product tree reduction using carry-save adders with a 12
to 2 compressor technique 55
4-2 Adders shown with the carry propagation 56
Use of 4-2 and 3-2 Adders to reduce 12 partial products to 2 57
Wallace Tree implementation for adding 7 bits 58
Textual representation of how to reduce 13 partial products to 2 59
Block diagram of the reduction of 13 partial products to 2 60
Textual representation of how to reduce 24 partial products to 2 61
Block Diagram of how to reduce 24 partial producU? to 2 62
Circuit Schematic for the 13-2 adder/compressor using 3-2 adders 64
Worst case path simulations of the 13-2 adder 65
Ripple adder configuration 66
Manchester Adder Configuration 67
Simple I-stage representation of a carry-skip adder 68
Block diagram of a pinary reduction carry-lookahead adder tree 70
v
FIGURE 34. Representation of one stage of a carry-select adder 71
FIGURE 35. Circuit Schematic for one 5-bit adder cell in the fast adder 74
FIGURE 36. Top-Level circuit schematic for the 74-bit fast adder 75
FIGURE 37. Simulations of the worst case path in the fast adder , 76
FIGURE 38. Top level circuit schematic of the leading 011 detector 80
FIGURE 39. Circuit schematic for one ten-bit section of the leading 011 detector 81
FIGURE 40. Circuit schematic of the five-bit leading 011 predictor logic 82
FIGURE 41. Simulations of the worst case path for the leading 011 detector and the
shifter 83
FIGURE 42. Simulations of the worst case path for the 174 bit shifter, including ROM
controller 86
vi
TABLE 1.
TABLE 2:
TABLE 3.
TABLE 4.
TABLES.
TABLE 6.
TABLE 7.
TABLE 8.
TABLE 9.
TABLE 10.
TABLE 11.
TABLE 12.
TABLE 13.
List of Tables
Second Order Booth Recoder 12
Entire Partial Product Summing Tree for a 25 to 13 Booth Recoded
Multiply 14
Partial Product Addition with no shifting ;.: 15
Partial Product Addition with shifting 15
Partial Product Addition of all zeros : 17
"Accumulator addition of positive or negative number 38
Final sign Determination 39
Propagation delay of any input to the sum and carry output of the full
adder 45
Propagation delay of the carry input to the carry output of the full adder.45
Propagation delay of any input to the output of the 13 to 2 compressor 63
Worst case propagation delay through the 74 bit adder 73
Worst case propagation delay through the Leading 0/1 Detector 79
Worst case propagation delay through the 74 bit shifter 85
vii
Abstract
This thesis describes the design of a fully functional 32-Bit Floating Point Multiply/
Accumulator that accepts two 32-bit floating point single basic IEEE format
operands and generates a 32-bit IEEE format result. The two operands are
multiplied together and then summed with an internal accumulation register. The
main objective of this design is to use only one final adder and one rounder block
. to add the partial product terms and the accumulator term and still comply with the
IEEE Floating Point Standard 754. The fused multiplication with addition allows
one-cycle throughput with only one rounding error and executes the multiply/
accumulate operation as one indivisible operation. Other features of this design
are the four IEEE rounding modes, round-to-nearest, round-to-positive-infinity,
round-to-negative-infinity, round-to-zero, and four accumulation modes, (A+B), (A-
S), (-A+B), -(A+B), where A is the product term and B is the accumulator term.
The thesis presents the system level description to transistor-level .. circuit
schematics and simulations of the multiplier/accumulator. A description of the
IEEE standard is presented for reference. Next, the architectural/functional level
description of the design is examined in detail. A "e-Ievel" programming model
was developed to implement the function and study design trade-ofts. Finally, the
circuit implementation is described with transistor-level, "Spice-like", simulation
shown fdr the critical circuit block implementations. The greatest emphasis is on
the array multiplication and incorporation of the accumulator into the partial
product adder tree.
1.0 Introduction
Recently, the performance of digital signal processors (DSP) and microproces-
sors has been improving rapidly. This has come about because of finer-lined VLSI
processes a~d the use of new arithmetic computing architectures. The need for
speed has been most evident in the conversion from iterative floating-point multi-
plier/accumulators to one or two cycle array multipliers and ALUs. The majority of
digital signal applications require high speed floating-point processing because
the critical processing path in the algorithms usually involve many multiplications
and/or accumulations. Other applications that require high throughput multiply/
accumulates are circuit simulation, image processing, speech processing, and
modem.
This thesis attempts to build on this explosion of fast multiplier/accumulator algo-
rithms and circuits to design a 32-bit IEEE compliant floating-point multiply/accu-
mulator with emphasis on the array multiplication and incorporation of the
accumulator into the partial product adder tree. The standard method used to
implement a floating point multiplier/accumulator is to use two final adders and
two rounders. The first adder/rounder adds the multiplier's partial products. The
second adds and rounds the rounded multiplier term and the accumulator term.
The main objective of this study is to use only one final adder and one rounder
block to add the partial product terms and the accumulator term and still comply
with the IEEE Floating Point Standard 754. A version of this type of multiply/accu-
mulator, called Multiply-Add Fused is being used on the current generation of
IBM's RiSe processors.[5] The advantage of this configuration is that it executes
the multiply/accumulate operation as one indivisible operation, with no immediate
rounding. The fused multiplication with addition allows one-cycle throughput with
only one rounding error. Also, this single functional unit, requiring an instruction
2
set which others only emulate, reduces the hardware overhead associated with
adders/normalizers by combining various operations necessary for fast multiplica-
tion with accumulations.
Other features of the design presented in this thesis are the inclusion of the four
•
IEEE rounding modes, round-to-nearest, round-to-positive-infinity, round-to-nega-
tive-infinity, round-to-zero, and four accumulation modes, (A+B), (A-B), (-A+B),-
(A+B), where A is the product term and B is the accumulator term.
This thesis will design a fully functional floating-point multiply/accumulator from
system level description to transistor-level circuit schematics and simulations. A
description of the IEEE standard is presented for reference. Next, the architec-
tural/functional level description of the design is examined in detail. A "e-Ievel"
programming model was developed to implement the function and study design
trade-ofts. Finally, the circuit implementation is described with transistor-level,
"Spice-like", simulation shown for the critical circuit block implementations.
3
2.0 IEEE Floating Point Standard 754
The single ,basic 32-bit floating point format, as required by the IEEE Floating
Point Standard 754 [3], consist of a 23 bit mantissa, an 8-bit exponent, and a one-
bit sign. The complete mantissa is 24 bits because the leading bit is considered a
23-bits
fraction/mantissa
,I'
1-bit 8-bits '
sign I exponent
s msb e Isb msb , f Isb
...widths
...order
Format of X and Y input operands and the output operand.
one for all normalized values.
The maximum positive normalized number which can be represented in single
basic IEEE format has.a mantissa of all ones and an exponent of all ones except
for the exponent least significant bit:
1-bit
sign(s)
8-bits 23-bits
exponent(e) fraction(f)
11111110 I 11111111111111111111111
...widths
msb Isb msb Isb ...order
4
The minimum positive normalized number which can be represented in single
basic IEEE format has a fraction of all zeros and an exponent of all zeros except
for the exponent's least significant bit:
00000000000000000000000
1-bit 8-bits
sign(s) exponent(e)
I· 0 1 00000001
23-bits
fraction (f) ...widths
msb Isb msb Isb ...order
s e - 127 ( 126)value = (-1) . 2 . (1.j) 2 = 1.0. 2 -
There are four exception that need to be detected to process IEEE standard
Floating Point numbers, Not-A-Number (NaN), infinity, zero, and denormals. Zero
is represented by all zeros in the mantissa and the exponent. Zero is defined as
0/1 00000000 I 00000000000000000000000
Single Basic IEEE Floating Point Format for Zero
any number that is less than smallest possible number representable in the final
destination's precision. Infinity is represented by all ones in the exponent and all
zeros in the mantissa. Infinity is defined as any number that is larger than the larg-
0/1 1 11111111 00000000000000000000000 1
Single Basic IEEE Floating Point Format for Infinity .
est possible number represented by the final destination's precision. Positive and
negative infinity, and positive and negative zero are supported by the sign bit.
NaN (Not-A-Number) is represented by all ones in the exponent and any ones in
5
the mantissa. NaN is defined by the IEEE standard as an exception format. When
0/1 I 11111111 11111111111111111111111
Single Basic IEEE Floating Point Format f?r "Quiet" NaN
0/1 11111111 I "any group of ones and zeros"I
Single Basic IEEE Floating Point Format for "Signaling" NaN
this representation is encountered, the system can use it to signal an exceptions
handler which could be software or hardware. There are two different types of
NaN's representations, quiet and signalling NaNs. A signalling NaN may have any
bit pattern in the mantissa section of the number. Signaling NaNs afford values for
un-initialized variables and arithmetic-like enhancements, such as, complex-affine
infinities or extremely wide ranges, that are not the subject of the standard. They
can be used by the exception handler to "signal" it as to how to service the inter-
-
rupt or as an address into the exception handler. The "quiet" NaN should, by
means left to the implementer's discretion, afford retrospective diagnostic infor-
mation inherited from invalid or unavailable data and results.
The final pattern in the IEEE standard is reserved for numbers called denormals.
To extend the precision of the IEEE standard at the minimum-number boundary,
numbers less than the lowest possible normalized number can be represented
with the loss of precision in the mantissa area. These numbers have an exponent
of all zeros and a mantissa of any pattern except for all zeros, which would repre-
sent zero. These numbers normalize the mantissa until the exponent is zero. This
means that the implicit leading one is not possible with this format, in fact this for-
mat has an implicit leading zero. Also precision is lost because not all of the possi-
6
Ible mantissa bits are used represent the numbers. Mantissa bits are "chopped" off
to bring the exponent to zero.
1-bit
sign(s)
8-bits 23-bits
exponent(e) fraction (f) ...widths
00000000 'any bit pattern except all zeros"
msb Isb msb
value = (_1)S. 2(-127) . (0'])2
Isb ...order
With this representation included in the standard the lowest possible number is:
al ( l) s 2(-149)v ue = - .
The addition of denormalized numbers has been termed gradual underflow. It
does not eliminate underflow, but reduces the gap between the smallest repre-
sentable value and zero. To implement this part of the standard in hardware would
require a large amount of extra circuitry and more importantly, more delay in the
critical path. This amount of extra hardware and delay is not acceptable for so few
numbers. [4] The IEEE standard gives the implementer the flexibility to support
the standard in hardware, software, or a mixture of both. The design in this thesis
will not support the full denormal operations in hardware, but will detect denormal
inputs with the Gradual-Underflow flag (GUF). This flag can then be used by an
exception handler and these operation can be handled in software. This imple-
mentation still keeps the spirit of the IEEE standard.
The design for this thesis also supports output representations of infinity, zero and
NaN. If an NaN is detected on one of the inputs, the output will become a "quiet"
7
NaN with all ones in the mantissa. Multiplication of zero and infinity will also result
in a "quiet" NaN. If one of the inputs is -infinity and the others are non-zero, non-
NaN numbers, then the accumulated result will be infinity.
Since digital arithmetic must be done on machines with finite bit lengths, the
rounding of the infinitely precise number is necessary for the result to fit into the
destination's word size. There are four types of rounding methods mentioned in
the IEEE standard, but only the round-to-nearest is required by the standard. The
other three are truncation or round-to-zero, round-to-positive infinity, and round-
to-negative infinity.
The round-to-nearest is the most desired rounding mode because it also offers
the smallest rounding bias error over a large set of operations and thus represents
the value nearest to the infinitely precise result. It is obtained by adding 2(exponent-
24) to the mantissa and then truncating. This form of rounding still has a small pos-
itive rounding bias. To eliminate this bias, the, IEEE standard has specified that if
the determination to round is a tie, then the decision to increment the mantissa is
determined by the least significant bit of the mantissa. If the two nearest repre-
sentable values are equally near, the one representatio~ that has the least signifi-
cant bit a zero shall be the rounded value. The is called round-to-nearest even. To
determine this tie condition, all of the bits to be discarded except for the most sig-
nificant bit of this group need to examined for any ones. The presence of any ones
in this group will signal an increment for the mantissa. The presence of all zeros in
this group and a one in the most significant bit of the discarded bits will signal the
even condition. A guard bit, called the sticky bit, is used to represent the presence
of any ones in the discarded bit, because it would be impractical to carry around
all of the discard bits.
8
The other modes of rounding are much simpler than round-to-nearest and are
easily implemented. When rounding towards positive infinity the result shall be the
format's value closest to and no less than the infinitely precise result. When
rounding towards negative infinity, the result shall be the format's value closest to
and no greater than the infinitely precise result. Round-to-positive and negative
infinities are useful in algorithm analysis to examine the effects of the precision
boundaries on the stability of the software code. When rounding towards zero the
result shall be the format's value clos~st to and no greater in magnitude than the
infinitely precise result. This is also called truncation and is a widely used round-
ing technique because of its simplicity and processing speed. This can be used by
developers whose target system uses this mode, but the development system is
IEEE compliant.
9
3.0 System Level Design
This 32-Bit Floating Point Multiply/Accumulator accepts two 32-bit floating point
single basic IEEE format operands and generates a 32-bit IEEE format result. The
two operands are multiplied together and then summed with an internal accumu-
lation register. The main objective of this system level design is to use only one
final adder and one rounder block to add the partial product terms and the accu-
mulator term and still comply with the IEEE Floating Point Standard 754. Other
features of this design' are the four IEEE rounding modes, round-to-nearest,
round-to-positive-infinity, round-to-negative-infinity, round-to-zero, and four accu-
mulation modes, (A+B) , (A-B), (-A+B),-(A+B), where A is the product term and B
is the accumulator term.
The IEEE Floating Point standard considers a multiply/accumulate two opera-
tions. This means that the only two ways to perform a multiply/accumulate and still
be compliant with the IEEE Floating Point standard would be to multiply, round,
sum, and then round again or multiply, add the infinitely precise product with the
rounded accumulator, and then round the result. Only the second option will allow
the accumulator to be added in the partial product tree and still comply with the
IEEE Floating Point standard.
To verify the algorithms and functionality of this implementation before design of
the circuits, a "C-Ievel" programming model was developed. This "C" model of the
multiply/accumulator was coded at the bit level to help debug different implemen-
tations and point out deficiencies in the algorithms before the design of the circuits
began. This model will also be used to generate vectors for the circuit model. In
this way the circuit model can be brought to match the "C-Ievel" model, which if
there is a mistake, will be easier to find and correct.
10
The purpose of this system level implementation is to design a high speed floating
point multiply/accumulate algorithm that utilizes an array multiplier and processes,
as much as possible, the mantissa, exponent, and sum/accumulator in parallel to
reduce the overall operation latency. The need for parallelism and speed will out-
way transistor count, but not be too excessive.
3.1 Mantissa Logic
The mantissa section can be broken into three sections, 1) the partial product tree
of the product term {X*Y}, 2) the sum of the accumulator term {S} with the product
term and the final sum, normalization and rounding, and 3) finally the alignment
and preparation of the accumulation term for addition to the product term. Figure
1, on page 23, is a functional level description of entire mantissa multiply/accumu-
late section. ~.
3.1.1 Partial Product Addition
A three-bit, second-order modified Booth's recoding method [1] is used to reduce
the 24 partial products to 13. Since the Booth's recoding method requires 2's com-
plement numbers, the 24 bit magnitude-only number needs to .be extended to 25
bits to include a sign. Twenty five bits will reduce to thirteen partial products when
the second order modified Booth's recoding scheme is used. Since the mantissa
bits can be considered positive numbers, only 24 partial products would be
required if the Booth's recoding method was not implemented. Table 1 describes
11
this modified Booth recoding scheme which requires the use of 2's complement
notation and shifting capability in the partial product adder tree.
30+1 30 30-1 +x -x +2X -2X Zero
0 0 0 0 0 0 0 1
0 0 1 1 0 0 0 0
0 1 0
.
1 0 0 0 0
0 1 1 0 0 1 0 0
1 0 0 0 0 0 1 0
1 0 1 0 1 0 0 0
1 1 0 0 1 0 0 0
1 1 1 0 0 0 0 1
TABLE 1. Second Order Booth Recoder.
The modified Booth's recoding scheme is used in place of a standard 24 term
adder tree because the standard adder tree would require two more exclusive-or
gates in the path of the partial product sum tree. The increase in complexity of the
Booth's recoding scheme is offset by the large increase in transistor count
required by a standard 24 term adder tree, especially with an efficient booth's
recoding and multiplexing implementation. With smaller adder trees, up to 16 bits,
the added complexity required by the 2-bit booth's recoding scheme is offset by
the increase transistor count of a standard adder tree. Larger recoding schemes,
such as 3 or 4 bit recoding, may reduce partial product tree requirements, but only
withe more variety of terms to be created and-ad.ged in the partial product tree.
Because of the increase complexity and recoding delay caused by the more
extensive recoding, the multi-bit recoders are most useful when one term arrives
before the other term.
Table 2 is a representation of the entire partial product adder tree. Letters a-d, g-j,
m, n, q and v correspond to the 13 partial products. The 52 least-significant, post-
aligned accumulator bits are represented by s. Letters y and z are the 13 to 2 par-
12
tial product reduction terms and 0 is the lower 52 bits of the multiply/accumulate.
Bits an, bn, en, etc. are the increment bit for the corresponding partial product if
that terms is required to be negated. Letters as, bs, cs, etc. are the sign bits of the
corresponding partial product.
13
as as as as 023 022 021 020 .19· a18 817 .16 al5 .14 .13 .12 .11 .10 .9 .8 87 .6 as .4 83 .2 81 .0
I ES b, b23 b22 b21 b20 b19 b18 bl7 bl6 bl5 bl4 bl3 bl2 bll blO b9 b8 b7 b6 bS b4 b3 b2 bl bO an
1 cs cs e23 e22 e21 e20 cl9 cl8 cl7 el6 ciS el4 el3 el2 ell clO e9 e8 e7 e6 c5 e4 c3 e2 cI cO bn
1 (IS d, d23 d22 d21 d20 dl9 dl8 dl7 dl6 diS d14 dl3 dl2 dll dlO d9 d8 d7 d6 d5 d4 d3 d2 dl dO c.
1 g' g' g23 g22 821 g20 gl9 g18 g17 gl6 g15 gl4 gl3 g12 gll glO g9 g8 g7 g6 g5 g4 g3 g2 gl gO dn
I liS hs h23 hZ2 h21 hZO hl9 hl8 hl7 h16 hIS hl4 h13 hl2 hll hlO h9 h8 h7 h6 h5 h4 h3 h2 hi ho g.
1 Is is i23 i22 i21 i20 il9 il8 il7 il6 il5 i14 il3 i12 ill ilO i9 i8 i7 16 is i4 i3 12 il iO hn
I fs js j23 j22 j21 j20 j19 jl8 j17 jl6 j15 jl4 j13 j12 jll jlO j9 j8 j7 j6 jS j4 j3 j2 jl jO i. _.
I ms ms m23 m22 m21 m20 m19 ml8 ml7 ml6 m15 m14 ml3 ml2 mll mlO m9 m8 m7 m6 rn5 m4 m3 m2 ml mO jn
1 ., ns n23 .22 .21 n20 .19 .18 .17 .16 .15 .14 .13 nl2 nll .10 n9 n8 .7 .6 n5 .4 .3 n2 nl nO ron
I ps p' p23 p22 p21 p20 pl9 pl8 p17 p16 pIS pl4 pl3 p12 p11 plO p9 pS p7 p6 p5 p4 p3 p2 pi pO nn
1 q' q' q23 q22 q21 q20 ql9 ql8 q17 ql6 q15 q14 q13 ql2 qll ql0 q9 q8 q7 q6 .q5 q4 q3 q2 ql qO p.
I v, v, v23 v22 v21 v20 v19 vl8 v17 vl6 vIS vl4 v13 v12 v11 vlO v9 v8 v7 v6 v5 v4 v3 v2 vi vO qn
,51 s50 ,49 ,48 ,47 ,46 ,45 ,44 ,43 s42 ,41 ,40 ,39 ,38 ,37 ,36 ,35 s34 ,33 ,32 ,31 s30 ,29 ,28 s27 ,26 ,25 ,24 ,23 ,22 ,21 s20 sl9 siS sl7 sl6 siS sl4 sl3 sl2 sll slO s9 ,8 s7 ,6 s5 s4 s3 s2 sl ,0
y50 y49 y48 y47 y46 y45 y44 y43 y42 y41 y40 y39 y3S y37 y36 y35 y34 y33 y32 y31 y30 y29 y28 y27 y26 y25 y24 y23 y22 y21 y20 yl9 yl8 yl7 yl6 yl5 yl4 yl3 y12 yll ylO y9 y8 y7 y6 y5 y4 y3 y2 yl yO
ZSO z49 z48 z47 z46 z45 z44 z43 z42 z41 z40 z39 z38 z37 ,36 z35 z34 213 z32 211 z30 219 z28 z27 216 z25 21'1 213 212 211 210 zl9 zl8 zl7 z16 zl5 zl4 zl3 zlz" zll zlO z9 z8 z7 z6 z5 z4 z3 z2 zl z46
051 050 049 048 047 046 045 044 043 042 041 040 039 038 037 036 035 034 033 032 031 030 029 028 027 026 O~ 024 023 022 021 020 019 018 017 016 015 014 013 012 011 010 09 08 07 06 05 04 03 02 01 00
'. I·
Table 2: Entire Partial Product Summing Tree for a 25 to 13 Booth Recoded Multiply.
14
A 13-2 compressor is used to add the thirteen 2's complement partial products to
produce 2 remaining terms. The 13-2 compressor is discussed in detail in section
4.3 on page 54. This type of adder tree design allows a bit slice of the entire par-
tial product tree to be designed and layed-out, optimizing the size and critical
paths, while keeping the binary tree reduction, of the partial products. The follow-
ing tables 3 and 4 show the formation of the partial products before they enter the
13-2 compressor. One thing to note about the table is the formation of the most
significant bits. When using 2's complement notation, the sign must be extended
to the full length of the final word size which, in this design, would be 52 bits. This
design uses acommon method of sign extending the bits in the partial product
tree without applying the sign of the partial product across the entire word.
as as as as 813 812 811 810 al9 al8 al7 al6 al5 al4 al3 al2 all alO a9 a8 a1 a6 as a4 a3 81 al aO
I bs bs b23 b22 b21 b20 bl9 bl8 bl1 bl6 bl5 bl4 bl3 bl2 bll blO b9 b8 b1 b6 b5 b4 b3 b2 bl bO as
I ts" cs c23 c22 c21 c20 cl9 cl8 cl1 1c16 cl5 cl4 cl3 cl2 ell clO c9 c8 c1 c6 c5 c4 c3 c2 cl cO bs
TABLE 3. Partial Product Addition with no shifting
as as as 813 812 811 810 al9 al8 al1 al6 al5 al4 al3 al2 all alO a9 a8 a1 a6 as a4 a3 81 al aO as
I bs b23 b22 b21 b20 bl9 bl8 bl1 bl6 bl5 bl4 bl3 bl2 bll blO b9 b8 b1 b6 b5 b4 b3 b2 bl bO bs as
I cs c23 c22 c21 c20 cl9 cl8 cl7 cl6 cl5 cl4 cl3 cl2 ell clO c9 c8 c1 c6 c5 c4 c3 c2 cl cO bs
TABLE 4. Partial Product Addition with shifting
3.1.2 Accumulator add, final summation, normalization and rounding
These two final partial product terms are then added to the exponent-aligned
accumulator term. The standard method for adding and aligning the accumulator
term with the product term is to determine which term is larger and then right shift
the other term. This implementation was not chosen because ~f the Jncrease in
complexityof shifting either a two-term 48-bit product or the single-term accumu-
lator and keeping track of the sticky bits. Also this implementation would require a
73-bit wide accumulation alignment shifter to be in the direct path of the partial
product tree adding delay into the multiplication path. The preferred implementa-
15
tion makes the product terms stationary and only aligns the accumulator term.
This design requires two 48-bit product terms and a 74 bit accumulator term. The
74 bits for the final sum are necessary to fully align the accumulator above and
below the infinitely precise product term and still retain the precision required by
the IEEE standard. 74 bits are used because the product is a maximum of 48 bits
.,
and the 2's complement length of the accumulator is 25 bits. This adds to 73 bits.
One extra bit is added to separate the product term and the accumulator term
when the least significant bit of the accumulator term is larger than the most sig-
nificant bit of the product terms.
The advantage of this scheme is that the exponent's logic can determine the
appropriate shift of the accumulator and shift it outside of the multiplier/sum path
while the product is being formed. The alignment procedure is discussed in more
detail in section 3.1.3 on page 21. The major drawback of this implementation is
that the size of the infinity precise result is 74 bits wide instead of 49 bits (48 bits
of the product term and one exfra bit for the accumulation), and the accumulator
alignment shifter is 98 bits wide. However only the lower 52 bits of the final sum
needs the 3-2 adders to sum the product terms and the accumulator term. The
multipliertree cannot produce bit terms above bit 51. Above bit 51 an incrementer
is used to produce to final result.
There are two main conditions that need to be examined for this scheme, 1)
where the most significant bit of the accumulator is less than the most significant
bit of the product and 2) where the most significant bit of the accumulator is
greater than the most significant bit of the product. For case number 1, since the
product term is infinity precise at this point, the addition of an accumulator term,
whose value is less than the product term, can be easily handled within the 48 bit
16
product term adder. The accumulatorbits less than the least significant bit of the
product are handled by the 98-bit right shifter and the sticky bit.
The handling of case two is more difficult. The final summing of the lower 48 bits is
straightforward. These comprise of the two product terms and any bits of the
accumulator term aligned into this area. Bits 48 through 51 are needed in the last
full adder stage because the sign extension method used in the partial product
tree can produce terms ·into these bit locations. These extra bits cannot be
stripped due to the fact that the accumulator addition is done in the partial product
tree. The method of sign extension also produces more complexity in the forma-
tion of the accumulator term prior to its addition to the product terms. An extra bit
is always produced at bit 51, no matter what two numbers are being multiplied. By
examining table 2 closely, the minimum and maximum possible carry out of the
partial product tree can be determined. If all the partial products were zero the
terms would look like:
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
TABLE 5. Partial Product Addition of all zeros
and so on for all 13 partial products. The position of the ones in the 13 partial
products at the most significant bit positions will propagate a carry into bit 51. For
the case where all the bits are one, a hypothetical edge condition that is not possi-
ble in this implementation, the maximum numbers of bits needed to represent the
entire number without any carry out of the most significant bit is 52. So for the min-
imum, maximum, and all in-between cases, bit 51 will always be a one, before the
sum of the accumulator term. Since this bit cannot be dealt with in the partial prod-
17
uct tree, it must be dealt with by modifying the accumulator term. This modification
will be discussed in section 3.1.3.
If the accumulator term is greater than the least significant bit of the product term
then the accumulator shift before it is added to the multiplier term is zero. Under
this condition, the accumulator term must not be modified by the productterm. To
make sure that the product term does not effect the accumulator term in the final
add, bits 48 through 51 of the multiplier partial product terms are cleared. This will
prevent any carry propagation into the accumulator term because the least signifi-
cant bit of the accumulator term is at bit 49. This will leave one guard bit, bit 48, for
the final add of the lower 48 bits. These lower 48 bits are still required to be fully
determined to produce the correct sticky bit for rounding. Since the product term is
by default a positive number, bit 48, the sign bit, should be zero. However the sign
extension method might product a carry out into bit 48, which under normal condi-
-- lions would be pushed out to bit 51. Since bit 48 through 51 are being cleared, this
push-out does not happen. Also the infinity precise result of the product term is
preserved in the lower 48 bits. Rounding and sticky bit determination in this case
uses only the lower 48 bit and not bit 48. Sign extension of bits 73 through 52 is
technically unnecessary because the product is positive. However the booth
recoding technique requires 2's complement numbers and the artifact of this is an T
extra one at bit 51.
Once all the bits to be added are determined, the final add can be performed. The
fower 52 bits consist of three terms and require a 3-2 adder recoding before the
fast final adder is used. The upper 22 bits need no adder because it consist of
only one possible term, the accumulator term. The final sum of the lower 52 bits
might produce a carry out into the upper 22 bits. To make this complete final add
18
as fast as possible, the upper 22 bits are increments while the- fast add is done.
When the fast add is finished, the carry-out can be used to select whether the
incremented or non-incremented version of the upper 22 bits are to be passed the
normalizer and rounding blocks.
Since the current number can be either positive or negative and the final form of
the results must be in sign magnitude notation, a negative number needs to be
made positive. The final form must also be normalized. These two operations can
be performed in parallel. While the 2's complement of the 74 bit final sum (PSUM)
is being formed, the leading 0 and 1 are both found and a 74 bit left shifter is set
up to normalize the PSUM term. The leading 1 is used to normalize a positive
number, while the leading 0 is used to normalize a negative number since the
leading 0/1 search is performed on the 2's complement number and not the sign-
magnitude number.
Negative Number Positive Number
11 ,01 0 .- Number out of the final add ~ 01 +01 0
Leading 1 or 0
00111 0 ~ Positive magnitude representation -.. 011010
110 ....~II-- Normalized representation ---II~" 1010
In one case this method can miss the proper normalization by one place. It is the
condition in which the PSUM is negative and the mantissa is a string of ones fol-
19
lowed by a string of zeros. This condition is easily recognized and can be dealt
Normalization Exception Condition
11,000~ Number out of the final add
Leading 1 or 0
01 0000~ P6sitive magnitude representation
000 1---
0000 .
Incorrect Normalized representation .
Correct Normalized representation
with by just detecting it after the PSUM term passes through the 74 bit left shifter
and incrementing the exponent. Since the bits will all be zero below the leading
one, no extra logic is need to shift the mantissa or round it.
The output of the leading 0/1 detector sets up the 74-bit left shifter to normalize
the positive magnitude PSUM number. The right side of the shifter is filled in with
all zeros in case the leading one or zero is below bit 24. The output of the shifter
~'consist of all 73 bits. Bits 49 through 71 are the pre-rounded magnitude bits. Bit 48
/
is used in rounding determination. Bit 72 is used to detehnine if the number is a
normalization exception condition. Bits 48 through 0 are used in sticky bit determi-
nation. The unneeded bit 73 is the sign bit and will always be zero at this point.
Now all of the signals necessary for rounding are available. If the rounding
requires an increment, the 23 magnitude bits are incremented. If there is a carry
out of the last bit then the exponent needs to be incremented. Bit 23 of the man-
tissa will always be one, so a carry out of bit 22 with produce a carry out of bit 23.
No shifting of the mantissa will be necessary when the increment, due to round-
ing, produces a carry out of the 24th bit, because the magnitude will be zero for all
24 bits. Now the final magnitude value of the mantissa is formed.
20
3.1.3 Accumulator Alignment
At the start of a multiply/accumulate operation, the user can select four possible
accumulation operations. The method used to implement the 4 accumulations
modes uses the formula, (A+B), -(A+B), -(A+(-B)), (A+(-B)). In each case the mul-
tiplier term (A) is positive. Case 1 and 2 simply require the non-negated value of
both terms. Case 2 and 3 are implemented by inverting the sign bit at the end of
the operation. Case 3 and 4 are implemented by negating the accumulator term
before it is added to the multiplier term. This requirement needs either the positive
or negative representation of the accumulator term to be available. Thus the first
operation on the accumulator is a 2's complement on the mantissa. If case 3 or 4
is needed to perform the desired operation, then the sum will use the negative
accumulator term. If a zero is detected then the accumulator will be all zeros, no
matter what.
This 2's complement of the accumulator is done in parallel with the alignment
determination by the exponent logic, so this 2's complement operation is not in the
critical path. The exponent logic will produce the necessary control to a 98 bit right
shifter to properly align the accumulator with the product term for addition. If the
least significant bit of the accumulator is greater than the most significant bit of the
.product then the right shjft will be zero. If the least significant bit of the product is·
greater than the most significant bit of the accumulator, then the shift is 74. Only
the top 74 bits of the accumulator are to be added to the product term. The bottom
24 bits are used to in sticky bit determination. the sign extension of the accumula-
tor term is accomplished by back filling all of the bits on the left to the sign of the
accumulator. The 98-bit right shifter will not shift 98 bits right, but only 74. How-
ever the number of outputs from the right shifter is 98-bits.
21
The upper 74 bits of the aligned and sign-extended accumulator are necessary for
the multiply/accumulate operation to be accomplished. The lower 51 bits are
added to the product term unmodified. Bits 51 through 73 need to be decre-
mented to account for the extra one in bit 51 of the product term. However if the
least significant bit of the accumulator is greater than the most significant bit of the
product, then the decrement is not performed on the accumulator term. This dec-
rement will not impact the speed of the multiply/accumulate because this opera-
tion will be performed while the lower 51 bits are being added in the fast adder.
Now all 74 bits are ready to be added to the product.
Figures 2, 3, and 4 ar~ the actual top level circuit schematics of the mantissa sec-
tion. Figure 2 is the input latches, Booth recoder and multiplexers, the 13-2 partial
product tree and the final 3-2 adder used to add the product and sum terms. Fig-
ure 3 is the final fast adder, leading 0/1 detector, normalizer, mantissa rounder,
and accumulator latches. Figure 4 is the accumulator alignment and 2's comple-
menter circuits.
22
tv
w
Floating Point Mantissa Processing
24-13 ~ Booth Muxes <,
---III- Booth ,
Recoder
I 13-2 Compressors
I, , ~
I 3-2 Adder I,
22-bit incrementor 52-bit carry-Iookahead/carry select adder
, ,
Leading Zero/One Detect f- 24-bit
74-bit 2's Comolementer
Decrementer
,
74-bit left shifter
, ,
I 48-bit any 1 detect/Sticky bit II24-bit incrernenter I ~.
,
25-bit latch ~ 25-bit, 98-bit right shifter any 1's
25-bit 2's Complementer ~ Detect
FIGURE 1. Functional Diagram of the MantissalAccumulator Processing unit.
"
A-
o I I 2 I 3 I 4 J
~
5
1
":"
J 6 I 7 I 8 I 9
A
I-
..... MCK
B
-
c
-
o
9 FINLTCHXI[O:22) OUTo f lfTllXL[O:22rJ<
~YL[~lfTll10:22] 0 OUTFINLTCH /
FPPG
BPP!O:11]Sr-~~~"""'"
BPPOD[0:27]r-~""'~:.L.. .....
BPP! 1: 11 JD[0:25]r-........;.:.&~:i;;J.
8
f-'- I
c
I-
o
I~
E
-
F
-
G
-
PSUMIO:511
BPPoorO:271,YSS:21 PSUM[0:51]
A[0:49)80
BPPOS,YSS,BPP10 0:25 ,VOO,YSS,'20 A[O:49]Bl
VSS,VSS,BPP1S,YSS,BPP20 0:25 ,VOO,VSS:1S A[0:49)B2
VSS:3,BPP2S,YSS,BPP30 0:25 YOO,YSS:16 A[O:49]B3
VSS:5,BPP3S,YSS,BPP4D 0:25 YOO,YSS:14 A[O:49]B4
VSS,'7,BPP4S,YSS,BPP5D 0:25 YOO,YSS:12
A[O:49]B5
VSS:9,BPPSS,YSS,BPP8D 0:25 ,YDD,VSS:10 A[O:49]86
YSS:l1,BPP6S,YSS,BPP7 0:.2 ,VOD,YSS:S A[O:49]B7
YSS:13,BPP7S,YSS,BPPS 0:2 ,VOD,YSS:S A[0:49]86
YSS:15,BPPSS,YSS,BPpg 0:25,VOD,YSS:4 A[O:49]B9
YSS:17,BPP9S,YSS,BPP100 0:25 -VOE>;YSS;"2-- A[0:49]B10
YSS, '19,BPP1OS,YSS,BPP11DIO:251,VOO,YSS A[O:49)Bl1
YSS:21,BPP11S,YSS,YrO:221,VDD,VSS,VOD A[O:49]B12
..... CMPSM9 CMPSM9
FPPMULT
SPP1 [O:50]r-~-.:.I~:L...:!lo..
SPPO[O:51Jlr......;.;;~:.:.L..~
ATiT - PFlOPRIETARY
UOO .........tIoComlonvIn_
Ii=-
E
t'
~
~
G
FIGURE 2, Mantissa input latches, booth recoder and partial product adders,
R.tI
FEBRUARY 20, 1994 I •
DVtGSlZE I ISSUE
INITIAL MANTISSA LATCH AND
BOOTH ENCODER, AND
PARTIAL PRODUCT SUMMATION,
AND ACCUMULATOR SUM.
H
o I I 2 I 3 I 4 f I 5 I 6 r 7
AT&T
I
I
8
28
FMULT2.1
I 9
1;0
SlEET.
1OF1
PRNlEDNUSA
,A ~
FI~ND
~~"%F,e.~r:r2""L] Zio-,22j SJ-UFOO::21j
"'" I ~'"U~~mO~~::""-I
0<
~ I""EAR'
1.FQ3
R1NCE,S(O:22j
~
,e.~rJI315 I 11011) f'J1
fJlIJ<·lffJI0',11
'H~U)
q:::~~:;:~1 .. u_, ", I ,.,,~~., 'I 1
CJJ~r",IJ
f"JlJl;l'If,;
s,o'7'2'!:-4Tj..e.ri'sz.v..,SL'M'i=j:r22].BrTS77.UC_BITSn
r
~A~BrrSo\8:~ ~~~
IEl2 FINVC ~-~~_"'
S'"y!~,',J;;ii54a.Sll\lFO A KlirJ'OI
FINVC
"\.1;.;1,:).:,.7'3;
F9.M"r.P I'rf"o::.nj ,'Ylr~
fSD.l./'4YJj
I ':;'''''' ~::f""" '''''Ji] I Fsq').e.'(.24Z2_,.c.:.o,~.:.t ...A,?Z;1] IF (H] F~7JFWifJIIlrJJ!['J ull-l-------jI ",~r" I !j'''''O','J>Fr" I""nlI ~;.f)i() r;jq~f,':j,'.o! '11r,:!'.!
srom
"~ ,~. 7j
FICURE 3, Fa~t ArlrIer. Normalizer and Rounder
AT&T - PROPRIETARY
Use pursuant to COlTl)any Instructions
25
MULTIPLIER SECTION 1 I RJN
FEBRUARY 6, 1994
OWG SUE I ISSUE
6S 1.0
SHEEl
10F1
P~T1:DINU.5.A
" .... -'
'0 I 1 I _m 2 L 3 I 4 I 5-----l 6 I 7 I 8 I 9
? 1
-=
~A(r) 1';ZxJ. SHF'TJ.\B
PSUMB{0:49]
; ';ifWI IJ
'Wi I]
F~!T~fj
i I 1~llf,Hlf rINRB
NR2
PSUM[O:7\
OUT
G.l.RRYF
ZIO,23j
F24iNC
AI{f.23j
ClJosm
PS:jW';:y".t:731
-
?3:'..)',€:74:9TI ?SJU~09T ~... :S-"~~7J' 1 psu:=JFIJIJXQIJ:'<I7] ( A ~ B PSM[50:73J
INRB Pf"" . A
PDIlO:97j SELS
SEL F21MUXA
FS.~:;::-rR
INRB
ND2
F4r1 rJCiSM?SlJf3 f,1
F~.:,rIA"'P2
A!r",] I' IFS'lIO'7JI L \ F-S(";I():72;BjSHFTl-1B r..Jir)':j] • • FSC[O.8. 16.24,32,40,48,&3.64, 72Jt¥)SHI,T I F5DIO:24)
FSDS1Gtj
INRB
~:22]
POS~EGABI SU~[O:22J.VDD,VSS ISIJM[024j
CMPSM(O:6,9]
>---
INRB
AT&T - PROPRIETARY
Use pursuant to Company Instructions
PREPARE TO ADD ACCUMULATOR RJNFJGURE 4. Accumulator alignment, and 2's complementer.
TO MULTIPY VALUE MARCH 13, 19!
rNlGSIZE ISSl
21i 48 1.1
I FMULT3.1 SHEAT&T 101
I I 7 I 8 I 9 P.AINTEDI5 6I 1 I 2 I 3 I 4 f I0
3.2 Exponent Processing
Figure 5, on page 30, shows a functional description of the exponent processing
section of this design. To multiply two floating point numbers the exponents are
summed together. The 8-bit exponents of the two floating point terms to be multi-
plied are added together with -127 to form a 10 bit 2's complement number. This
number is called MSUMEXP in figure 5. Bit 8 of MSUMEXP is used as the over-
flow detection. Bit 9 of MSUMEXP is the sign bit and will be used to detect an
exponent that is less than zero (underflow). Since both exponents are biased by
127, the sum of two biased exponents result in a number biased by 254. To re-
bias the summed exponent back to 127, all that must be done is to add -127 along
. .
with the exponents of the two terms to be multiplied.
The multiplication of two 24 bit positive numbers with a decimal fractional repre-
sentation (f), 0 <= f < 2 (1.999....), result in a a mantissa of 48 bits with a decimal
fractional representation, 0 <= f < 4(3.999...). In a binary representation, the deci-
mal point of the product is between bit 45 and 46. Since the initial product/sum
result, before normalization, will assume a binary decimal point between bits 72
and 71, an exponent with binary decimal point in that position needs to be formed.
The difference between bit 72 and bit 46 is 26, so 26 must be added to
MSUMEXP to form the desired exponent. This number is called M26EXP in figure
5. If the value of the least significant bit of the accumulator term is greater than the
product term, then M26EXP is loaded with the exponent of the current accumula-
tor term.
The MSUMEXP exponent is also added to 26 and the 2's complement of the cur-
rent accumulator exponent. This number is used to determine the shift of the
accumulator's mantissa before it is added to the multiplier's final mantissa and
27
,whether M26EXP is loaded with~he product exponent or the cur;ent accumuiator
exponent. This number is called CMPS_M in figure 5. If the result of CMPS_M is
negative then the multiplier's final sum is smaller than the value of the least signif-
icant bit of the accumulator's result. If this is the case then M26EXP will be
.<
replaced by the accumulator's exponent.
MSUMEXP is used in a third adder to determine if the shift of the accumulator, to
align it with the product, would be greater than the least significant bit of the prod-
uct. MSUMEXP, 2's complement of the current accumulator exponent, and -48 are
added together. If the result of the 2's complement sum is less than zero, bit 10 of
the comparison, then the accumulator is smaller that the least significant bit of the
multiplier term. This signal is called PSHIFT74.
Once the leading one or zero is determined from the final sum of the accumulator
and the product terms, the leading one or zero location is converted into a binary
number and subtracted from the exponent M26EXP. Except for rounding and
exceptions, this is the final exponent of the multiply/accumulate operation. If the
rounding of the mantissa results in a carry out of the most significant bit, the final
exponent needs to be increment. Also if the normalization exception condition
exist, an increment needs to happen. There is no need for two incrementers,
because the normalization exception condition results in a mantissa of all zeros.
With the mantissa containing all zeros, there is no possible way that an increment
of the mantissa will produce a carry out of bit 22 of the mantissa. The final deter-
mination of the exponent now depends on various exceptior.ls. All of the exponent
bits are cleared if the result is determined to be zero. The exponent's bits are set if
infinity or NaN is detected.
28
Ten bits are used in the processing of the a-bit magnitude exponent so that preci-
sion is not lost in the final result. Because the operation between the product and
the current accumulator can be addition or subtraction, the preliminary exponent
may be greater than 127 before normalization and within precision after normal~
ization.
Figure 6 is the actual top-level circuit description of the exponent processing unit.
29
Exponent Processing
Rounder/
Shift Execp.
Increment
Zero Sum
Infinity/NaN
, 2'sComp.~ IncrementI
!
S[30:23]next
MSEXP
t ,
I 10 bit Summer
I
I 2's complementer
) I
I
i
Decimal Leading
One/Zero position
Clear/Set Latches
a'
S[30:23]
26
m26exp
I cmpsm9
MSUMEXP
3-2 Adder/Summer
'10-blts26
I
X[30:23] Y[30:23]
-127
cmpsm[9:0]
3·2 Add~/Summer
'10-ults
Comparator
3-2 Adder/Summer
10-blts
S[30:23]
~
bit101 qreater than or
E;qual zero
PSHIFT74
FIGURE 5. Functional Diagram of the Exponent pro.cessing unit.
8t
f'
J\.lIA"'.JIo
~JJl7!t'
.......[T::>I~·T,
1oI."'••[fP!'.'"
...."'"
~ c ~ ~~~ ~ C! ~
..... [)'.Y~':I..
1,.l?tLl:>'l·, :oJ
:::: :Pfl/'.f'n". r
, ""n, ", ,'~fJ~';''l:;
FIV~'l', N"....11j ~I'.
I ''.<I:i...
:·~)'••II.'i ~_
"'l".,,:,~, 'I r;:, .r..•. ....i,-.
=",.
~
;~/Jikl W.
ut=Jf:-.r, ::J
~ ..'. ::; .'
~.7; .~.,
:,;. r ....r' l'
~ It~.~",
:L ~
~':1oI, n,',101 -Jof"
'7::>1
r: ~ r"-
!
; i
I
;1
---I
I
:I
~
:I
~
\ 'I I' I ' I ' I ' I • I ' I • I • I.I.Ij 9 ~ ~ ~1~ r--;.a
~
I~,'AII"! ~
Y. ~'~J') or, ~',,'11< •H
:"'-Jo'.'Fr,.
------7
A1&T • PA:lPAlETAR\'
UN pu..... 10 Compwry l-..:tiona
EXPONENT LOGIC I lUi
IMFCH S. 11194
OWG'SlZI I tS9l!
FIGURE 6. Top-Level circuit schematic of the Exponent Processing Unit. 4S 1.f'
9<1l
IOF :
3.3 Exceptions Processing
There are four exception that are recognized by this design. They are Zero, Infin-
ity, Gradual Underflow, and Not-A-Number. Figure 7 is the top-level circuit sche-
matic of the Exceptions processing unit.
Zero is determined by examining the multiply(X) and multiplicand(Y) terms to see
if they are zero. This is simply done by detecting all zeros in the mantissa and
exponent. The value of the sign is not necessary for any exception processing. To
allow the use of the all-zero detect signals fm the other exceptions, the exponent
has its own zero detect signal and so does the mantissa. A zero in this manner is
detected when both zero detect signals are active.
If either the X or Y term is found to be zero, the product is zero. However the
result of the multiply/accumulate may not be zero because the current accumula-
tor might not be zero. There are three ways to set the final zero flag. The final zero
flag can be set if either the X or Y term is zero and the current accumulator is
zero, the final exponent of the result is less than -127, or the sum of the product
and the accumulator results in zero.
The determination of the resultant exponent less than -127 is found from bit 10 of
the MSEXP term in figure 5. The determination of a zero in the sum of the product
and accumulator term is found in the leading 0/1 search of the FSUM term. If the
FSUM term is all zeros (positive number) or all ones (negative number), then the
leading 0/1 detect will produce a zero signal called LEADZERO.
Infinity is determined two ways. Infinity is determined by the either X, Y or the cur-
rent accumulator term being infinity or an overflow of the final exponent. X and Y
32
are determined to be infinite when the mantissa is all zeros and the exponent is all
ones. The overflow of the exponent is determined from Bit 9 of MSEXP.
Not-A-Number(NaN) is determined by the X and Y inputs only. If the exponent is
all ones and the mantissa is not all zeros for either the X or Y term, or one of the
inputs is infinite and the other is zero then the NaN flag is signaled.
Gradual Underflow (GUF) is also determined only from the X and Y inputs. If the
Exponent is all zeros and the mantissa is not all zeros then the GUF flag is sig-
naled. This design does not process GUF numbers, it only signals if one is
detected.
Both the zero and infinity flags are used in the processing of the next operation.
The NaN and GUF flags are not used. If Infinity is detected by this design, the
accumulator will store this and every result from that point on will be infinity. A
reset of the accumulator will be necessary to resume normal operation.
33
~1
, I
('
J.t~
ATlT . PAOPAIETARY
UN pur1lUW'It kl CaT'IpWTy Inatru:::1lont:
";,,
ww,
I#.FJP;U~:
S2'ER)A
"'
".E
IH'r:"4:> ,)
-
to'
ltI2>
:<r.:T!:
l.r~
h:?,2
:;zEJt')
lC;f.....
Lf.}:'lE?lCi
w-;~
1(:>
Y17
102
'£.0
am
""
I I
1£.0
f''').-~ -,.. U"j
'£.0
It',1
~
~
.'12
1.1,2
)
f{"...Q
Ulj
r;.' '~J , r=D~
10>
"")
1(:>
l.l1~)
10>
Ul~)
IU
NAN, INFINITY, ZERO, GUF
DETECTION
~
IMFCH 8, 1101
,",a SiII I I!I!lUf
FIG URE 7. T{)p-L~vel circuit schematic or the Ex~ptions Proce~~ingUnit. 4S 1.0
~
lOF 1
3.4 Rounding Logic
There are four rounding modes, round-to-nearest, round-to plus infinity, round-to
minus infinity, and truncate. Figure 8 is the circuit schematic for the rounding
determination logic.
Round-to-nearest is the most useful because it introduces the lowest quantization
noise of the four rounding modes in some signal processing algorithms. An
increment of the mantissa is required if the bit to the right of the least significant bit
-
of the mantissa is a one and there is at least one other bit to the right of that bit
that is a one. This guard bit is called the sticky bit. The sticky bit is defined as the
"OR" of any.bits that are less then the least significant bit of the final mantissa.
The two areas when sticky bits may be found are in the accumulator term after the
accumulator alignment-shift, as shown in figure 4, and the final add term after the
normalizing left shift as shown in figure 3. If there is a one in any of these bit, then
the sticky bit is a one. The sticky bit is important in determining if the rounding
condition is equally distant from either possible destination rounded formats. If this
tie exist an increment is required if the least significant bit of the mantissa is a one
and the bit to the right of the least significant bit is 'also a one, even if the sticky bit
is a zero. If the bit to the right of the least significant mantissa bit is a zero, no
increment is required.
Round to positive and negative infinity are similar to each other in implementation.
For positive infinity, if the sign is positive and the sticky bit or the bit to the right of
the least significant mantissa bit is' a one then the mantissa is increment. For
negative infinity, the mantissa is incremented if the sign is negative and the sticky
bit or the bit to the right of the least significant mantissa bit is a one. These modes
35
are used in signal processing to explore the quantization boundaries of their
algorithms.
The final mode is truncation. This mode is also called round-to-zero and never
requires the incrementation of the mantissa. This is the simplest and requires no
extra hardware to implement.
36
o 2 3 4 5 6 7 8 9
f. ~O:IJ A
B
G
c'
-D
t'
AT&T - PROPRIETARY
u..PoI.......tto~~
F!!OO
P.K[NJ()£\
,
P.WWODEO
Rl£l3
> P,'![)MrJDE!CT 1)
RING
z
S2
~
co
a:
z
<:
co
>-
u.
i7,
~
Z
>-
'"~
'Ii
co
IT
Z
::
r
~
FIGURE 8. Top-Level Circuit Schematic for the Rounding Determination Logic
;.;
a 2 3 5 6 7 I
ROONDING LOOIC RJj
MARCH 3, 1894 It:
OWG SIZE I ISSUE
28 1.0
1~1
9 """fID ... u....'
3.5 Sign Logic
The sign determination logic determines the final sign of the accumulated number
and also the accumulation mode in which the accumulator and the product should
be summed. Figure 10 shows the circuit schematic of this sign determination
logic. The initial function of the sign logic is to determine if the accumulator should
be added or subtracted from the product term. The add or subtract is determined
. from the sign of the two multiplier terms, the sign of the accumulator term and the
accumulation mode desired. Tables 6 and 7 describe the add/subtract determina-
tion.The reason for the complexity of the product/accumulator summing process is
because the sum is accomplished in the partial product adder tree. The final two
terms that determine the mantissa of the product have not yet been summed
together. It would be very difficult to determine the 2's complement of the product
when it has not yet been completely summed together and it is not necessary. The
four accumulation modes can be accomplished by addition or subtraction of the
accumulator from the always positive product and then after the product and the
accumulator have been summed, the sign can be modified to complete the
desired operation. Since the accumulator term is in a final rounded form, the 2's
complement can be easily formed and summed with the product.
Add/Subtract
X31 Y31 S31 A+B -(A+B) -A+B A-B
0 0 0 Add Add Subtract Subtract
0 0 1 Subtract Subtract Add Add
0 1 0 Subtract Subtract Subtract Add
0 1 1 Add Add Subtract Subtract
1 0 0 Subtract Subtract Subtract Add
1 0 1 Add Add Subtract Subtract
1 1 0 Add Add Subtract Subtract
1 1 1 Subtract Subtract Add Add
TABLE 6. Accumulator addition of positive or negative number
38
The second function of the sign logic is to determine the final sign of the accumu-
lated sum and prodbct. This is determined from the accumulation mode, the sign
of the two multiplier terms, and the sign of the final accumulated sum before nor-
malization and rounding. Since the final sum is accomplished with 2's compliment
numbers, the sign of the result is bit 74 of the 74-bit adder. X31 and '(31 are
needed to determine if the product term is negative or positive. Figure 9 shows
¢ .
the logical representation of these tables. P08NEG determines whether the posi-
tive or 2's complemented version of the accumulator term should be used and
S31 next is the final sign, since the sum of the product and sum is converted into a
signed positive number before it is normalized and rounded.
Final Sign
t
X31 Y31 Mres73 A+B -(A+B) -A+B A-B
0 0 0 Positive Negative Negative Positive
0 0 1 Negative Positive Positive Negative
0 1 0 Negative Positive Positive Negative
0 1 1 Positive Negative Negative Positive
1 0 0 Negative Positive Positive Negative
1 0 1 Positive Negative Negative Positive
1 1 0 Positive Negative Negative Positive
1 1 1 Negative Positive Positive Negative
TABLE 7. Final sign Determination
Example:
X31 =1, Y31 =0, 831 =1, Accumulation Mode =(A-B), Product of X and Y (A) is
greater than the initial accumulator value (B).
From table 6, the initial accumulator term is subtracted from the product. The ini-
tial accumulator is 2's complemented and passed to the summer. Since A is
greater than B, Mres73 will be positive(1). From table 7, the final result should be
negated to complete the operation.
39
Sign Logic
FIGURE 9. Functional Diagram of the sign/accumulator negation logic.
~~.~
. XORJ
ZerOo,um\J _
i
i I~) I
I
ANDJ
~OSNego,um
!
-(At B)(A-B)
,I I
I I
, !
'L-Y1
'OJ
Y31
Mres73
531 next
-(AtB)I(-A+B)
! ,
I :
I, I
'-LJ1
..~
.~
i~
~
.~
'~I
\~
Y31
X31 !
U
.~
.xy
.1{1,+8)
I
i . (-A+8)
,il\\2Y
831
i
i
~'~
x:j
~'~
x:j
X31
.~
'.~
xy
G1.0
""'.IE
"""1 OFl
25
9 """'t'fD"U~A
"""SIZE
AnT, PAOI'RIETAR'Y
UN pu,.."..tao Cornpwyy~
SIGN AND INSTRUCTION LOGIC I F\Jl
U&cr:H8, 1*
76
$31
5
"I"I-Ie_1-1- ....psre ~ ~ ~. ~
FSlG!/
3
POSIIEGA
~,
:>
2
H',
INRB
FIGURE 10. Top-Level Circuit Schematic of the Sib'Jl and Accumulation negation Logic.
o
:z:...::.(j
~
r
)
"'
o 11 12 13 14 I 5 1 6 1 7 1 8 I 9
'I L
~". S3i tl.P.S'JL T72
\/
'"'I I I I~ ~ N M~J ~ ~ 2 ~ ~
_, ".~ ~,~,_,~ ~I : u: 1_ I ~ I;: I I B
rF) '"~ ~ FINST[0:1]
f§ ~./ :=i~ \,,/:> I I I ~ \N/ ~ I >
x
,-I I r I I I I I I ----L
g 6
~
:;
§y~ I I GijN~
x
0'"
4.0 Circuit Design
This circuit design section will concentrate on four areas of the multiply/accumula-
,
tor: the full adder, the booth recoder and partial product adder array," the final
adder, the leading zer%ne detector, and the shifters. The other aspects of the
design either utilize circuits designed for the mentioned areas and do not have to
..;
be discussed in depth or the circuits are not time/space critical and do not need
special design attention. The simulation presented are from schematic.extractions
and not from actual layout. These simulation take into account worst case drain-
source depletion capacitance, but no routing capacitance, unless otherwise
stated. The transistor level models used to simulate various circuits, come .from
AT&T's 0.9 micron two-level metal process and AT&T's 0.6 micron two-level
metal, low voltag~ process.
42
\4.1 Full Adder Design
The basic design of a full adder or 3-2 adder consist of a 3-input exclusive OR
gate for the sum and a 3 bit majority detector circuit for the carry.
SUM = 'A ® 'B ® C
CARRY = 'A. 'B + 'A • 'C + 'B • 'C
The design of the sum term is usually done by cascading two 2-input exclusive
OR gates together. The design of the carry term utilizes the initial exclusive-OR
gate and a multiplexer. The equation of the carry is:
CARRY = ('A ® B) • 'C+ ('A ® 'B) • A
CARRY = ('A. 'B + 'A. 'B) • C + ('A. 'B + 'A. 'B) • 'A
- - --CARRY = 'A. 'B. 'C + 'A. 'B. C + 'A. ('A + 'B) • ('A + 'B)
- -CARRY = 'A. 'B • 'C + 'A • 'B • C+ 'A • 'B
This final logic equation for the carry is equivalent to the 3-bit majority encoding
equation.
Three full adder designs were evaluated. All utilize transmission gate logic to pre-
form the exclusive OR function. The transistor-level circuit schematics for these
three full adders are shown in figures 11, 12, and 13. The criteria for each version
was that it must be small in layout size, fast, zero dc static current, full CMOS lev-
\~... \
els, and the outputs must be buffered. These criteria will allow operation at any
frequency and voltage levels down to 2.0 volts in most of today's current process-
ing technologies. :
Version 1, show in figure 11, consist of 26 transistors and uses 3 standard trans-
mission gate-type exclusive OR gates. Full PN transmission gates are utilized to
keep all of the voltage levels at the power supplies to maximize performance at
43
low voltages. Version 2, shown in figure 12, consist of 23 transistors and utilizes
the all N-channel exclusive-OR gates. This design does not utilize full PN trans-
mission gates because it would become impractical for this design..To reestablish
full voltage supply logic levels, a P transistor "keeper" is used to raise the voltage
level at the input of the sum inverter to the power supply. Also the n-channel gate
voltage at the carry bit logic is one threshold drop below the power supply. P-
channel transistors are then required to allow the input voltage to completely pass
though the transmission gate without severe degradation of the output voltage
level. If speed is not a factor, the number of bits added in one cell can be extended
easily with this version to form a 4-2 adder, asshown in figure 14, with a minimum
of extra transistors. Version 3, shown in figure 13, consist of 22 transistors and is
made- up of a unique initial exclusive-OR gate [6] and standard transmission gate
excl!Jsive-OR gates for the rest of the circuitry. This 6 transistor exclusive-OR gate
benefits from the fact that only one of the inputs need to be inverted. In most
designs, this inversions is not "free" and needs to be generated somewhere. This
gates reduces the transistor count from 8 to 6 and still keeps speed and full rail
voltage level on the outputs. No "keepers" are needed in this design, because the
logic of the initial exclusive-OR gate allows only logic 1 to pass through the P-
channel transistor and only logic ato pass through the N-channel transistor. The
initial exclusive-OR gate was not preferred for the rest of the cell because it
affords no advantage in terms of speed or number of transistors.
4.2 Full Adder Simulation
The simulation of the full adders utilized a "SPICE-like" simulator called ADVICE.
this simulator is an internal AT&T Bell Labs creation that models AT&T's 0.9
micron CMOS and 0.6 micron CMOS Integrated Circuit Processes. Since the cir-
44
cuits under consideration have only 3 inputs and 2 outputs, all combinations of
inputs were simulated with two nanosecond rise times and 250 picofarads of load
capacitance on the outputs. The measured delays times were taken at 1.5 volts
for all the cases. The reported delay value is the worst of either the rise or fall
time.
Tables 8 and 9 describes the merits of the three full adder implementations:
0.9um, 0.9um, 0.9um, 0.6um,
Processing Fast (nS) Median (nS) Slow (nS) Slow (nS)
Power Supply 5.5 Volts 5.0 Volts 4.5 Volts 3.0 Volts
sum/carry sum/carry sum/carry sum/carry
Version 1 0.56/0.71 0.95/1.2 2.212.5 3.212.9
Version2 0.51/0.65 0.81/1.1 1/812.3 2.5/3.1
Version 3 0.54/0.80 0.75/1.1 1.412.0 2.6/2.8
..
TABLE 8. Propagation delay of any input to the sum and carry output of the full adder
0.9um, 0.9um, 0.9 um, 0.6um,
Processing Fast (nS) Median (nS) Slow (nS) Slow (nS)
Power Supply 5.5 Volts 5.0 Volts 4.5 Volts 3.0 Volts
Version 1 1.1 1.2 1.7 1.7
Version2 1.1 1.3 2.2 2.5
Version 3 0.9 1.0 1.5 1.6
TABLE 9. Propagation delay of the carry input to the carry output of the full adder
The propagation delay curves are shown in figures 15,16,17, and 18. Full adder
version 3 was chosen for use throughout this design for its speed and size.
45
,,' -- T
, - I
-I ' I ' I ' I ' I • I ' I • I ' I • I • [ ,
I
'l~
J
I
II
c
'"-=-
'"
-=-
~~,mrL.--~:1
<
-=-
-=-
~~
-=- -=-
"TaT· f'R:)PflIETAA'f
UM~"'C~I_
3-2 ADDER DESIGN
VERSION 2
RJN
OCTOBER 24. 11192
OIHG 511I I IBJE
FIG URE ] 2. Full Adder, Versj()n 2.
~ "
4S 1.0
!KIT
, OF 1
fllllWTt2)ItI'UI.
{--
j r..~~~
I ~ I I
'l.
-=-
-=-
H..~t (
~~ "
J
AUT· PROPRIETARY
UN prJ""" '«>C~ Ina1n.IcIona
3-2 ADDER DESIGN
VERSION 3
RJN
APRIL21.1~
DWGSllf I ~
FIGURE 13. Full Adder, Version 3. 48 ~!,
!HIT'
lOF 1
-'
I
A
lJl:
~'fl
..
•
~
.,.---~
".
o
...
F
-=
-
-=
-
.-
....
-=
-=
....
.c.
to<
AT&T· PIlOI'NETAllY..............10~_
4-2 ADDER DESIGN
VERSION 2
lUI
OCTOBER 24,,. h
OIIflSIZE I _
• .....••,.4;
FIGURE 14. 4-2 Adder implementation using non-restring n-channel XOR gates. 4S 1.01~'
ADVICE 2B AS OF 062692 RUN ON 03131194 AT 19:50:12 Sl20n2
( 25.0 DEG C) DSP Nominal 0.9 um process file "s91m15.ap" 9-11-91
(x 1E-9)
., ~ , :-..... . ~ ~ ~ ................ ,.
. . .
2
4
O~i "i) "'i ii" 'I if f 'I 'Fif I'" It" .11 ii" i'f' ilf. ill' iii 'iii Ij" Ii if .iii ,;;;jiM iii. ,iii 'iii i,;:~ !'il 'iit"ji' 'iit 'i';i'i' j", it
103.1n3.203.103.1i03.104.C04.!04.-C04.e:>4.
6 i ,-- VB
_.VCOUTV1
- - - VCOUTV2
_. _.. VCOUTV3
VB
VSUMV1
VSUMV2.,
VSUMV3
(x 1£-9)
~::'~!Tr... ~ I ... __
6 i •
2-··· .. ··············_····· .. ·····.. ······· .. ·····
4
:/1.. ..... ~)~uou,umou,m.. uuuuJ
~03.~'.~~t~t~~F.~.~.~'.~.~.;I. :. :- :. : .~.:;;ot~.r.i4.;&:z;9.~f.~f. .0
'J,
::-
6 f ,
4-+······ .. ,
2
······~··········,····--····:· .. ·--·· .. ;····--····i···· .. ·· .. ;. #.~.':": .. _.: ~~ ':':.: : I
VC
VCOUTV1
VCOUTV2
VCOUTV3
(x 1E-9)
I'
FIGURE 15. Simulation" of the full adders, under O.9um nominal processing, 5.0 volts
..".
ADVICE 28 AS OF 062692 RUN ON 03131194 AT 19:25:37 8120520
(125.0 DEG C) DSP slow 0.9 urn process file "s91175.ap" 9-11-91
(x 1E-9)109.0
• ••••• ~ •••••••••••• ";_ •••• ~ ~ ••••••••• o ••• ""'"
· .
· .
· .
.........................................
2
4-t'········ .... ~· .. · .... ···
6 i • - VCOUTV1
VCOUTV2
VCOUTV3
-- VB
0-1 ' I i It' I iii f I Iii ' ii' i i f Iii ' i i ;; iii ii' Iii t ii' B f ' -,': ; if 5;;~ '" , , , ; ,
103.0 103.5 104.0
6 , I VSUMV1
VSUMV2
VSUMV3.
VB
(x 1E-9).0 109.5
2........ ....... .. ....
4 .. ~ .. .. ,..... ..:.. .. ~ ;. .
...u!j·>;;>'-""~'~:~-:'-:~;;:?
0-1 . .u.u ,,'..... ·,
103'0'" i" T';','. : ". ; ;...,.-,...
. " .. . . 'I • , I I I I : ;-; :~... :
• Iii Ii " t"':" iii( ar:':'i • i I .. i i ; • , i I ; iii t i . i ,iii i i 2 l
'J'
6 i ,
r
4
2-.... ·
.. . . . .. . ':' . .. . . . . ~ : ,..
.. ..;._ .•._....-
VC
VCOl)TV1.
VCOOTV2
VCOUTV3
.0 (x 1E-9)
I
FIGURE 16. Simulation of Full Adders under O.9um worst case conditions, 4.5 volts.
ADVICE 2B AS OF 062692 RUN ON 03131194 AT 19:35:20 51 20584
( O. DEG C) DSP Fast 0.9um process file "s91h75.ap" 9-11-91
6 i ,
4
2
VB
VCOUTV1
VCOUTV2
VCOUTV3
(x 1E-9)
J
'J
6 i I
4
2
~~I.~b~f.~bb'.~b~i.~bb'.~~:~~:~~:~~:~~r~i).j:~iI i i~ i 'i>r i"~I;;' i ,I'I' iii i, iii i" i 'I ii,,!
Ime
VB
VSUMV1
VSUMV2
VSUMV3
(x 1E-9)
VC
VCOUTV1.
VCOUTV2
VCOUTV3
(x 1E-9)
'" .
.................................
4
2
6 I ,
o F' "i '" i "f i "i" i " i' , iii'; f'" -. . . . . . . II ( r I r r r , " i i r 'i I f r F' i , r IF i , r i , F I iii l 'ii ii' iii i i
109.0
FIGURE J7. Simulations of the Full Adders under O.9um best case conditions 5.5 volts
ADVICE 2B AS OF 031894 RUN ON 04102194 AT 15:30:21 S# 21875
( 110.0 DEG C) Low Gain a.6um process file 11-23-93
4 • , VB
VCOUTV1
VCOUTV2
VCOUTV3
(x 1E-9)109.0
.... ,.. . .....~,.2
o , f f , f f i P r rtf i r i r r f i i r r r i , r iff Iii r i f i i r , I j r r i r i r iii iii ,-- rar"~; i ,
103.0
4 I I
.-
2
....
...
.. j. ..j. .
. ,.~.-._.~'. .
..,.. .,,' ;... ......
.... ,.:f;r'''~~j"
:<J' ".
<J': •,
109.5 110.0
VB
VSUMV1
VSUMV2
VSUMV3
(x 1E-9)
VC
VCOUTV1
VCOUTV2
VCOUTV3
(x 1E-9)
. ...
....
.....
..••:..•.. .IfII' . . .: ..••• ••. t ••;, ••••• ( •••••• ; •••••
. " :
'-
....
",.....:. ...,....,~....
" .
./
~ot".~I.~i);~.~i.~~~.~~.~b'.•~b.~b:f~~~f.~f~l~~i.~1~.01k!1t~~~.S1~.~k.~k~k11&iiiiiiil"
4 i •
2
FIGURE 18. Simulations of the Full Adders under 0.6um HD, worst case conditions, 3.0 volts
4.3 Booth Recoder and Partial Product Tree.
The Booth recoder circuit and the partial product tree can be considered two sep-
arate circuits, however both sections must be considered as one to optimize the
speed and layout area of the multiplier's partial product adders.
There are two ways to add the partial product term together to obtain two final
terms which can then be added to one term. The first method is the "l-tree" and
the second is the "V-tree". The "l-tree" adds three partial products and generates
2 terms. Then these two terms are added to another partial product to generate
another two terms and so on, until all of the partial products are added up. With
each 3-2 adder step the partial product is shifted with respect to the previously
added terms. Figure 19 show how this is accomplished. The advantage of this
C[O:7] B[O:7] A[O:7]
D[O:7]
E[O:7]
H[O:7]
3-2 adders,
cc[O:9] cs[1 :9]
3-2 adders
fc[O:9] ~ ~
esO dsO csO bsO asO
FIGURE 19. "L-tree" type partial product tree reduction using carry-save adder cells.
design is that it creates an orderly layout structure and made the area loss due to
routing the bit shift minimal. The disadvantage is that it created more gates in the
54
critical pathihan needed to sum the partial products. For example, to add 12 par-
tial products, 10 3-2 adders are in the critical path.
The "V-tree" method, using 3-2 adders, would add three partial products and pro-
duce two terms, which will be call terms t1 and t2. Then three other partial prod-
ucts are added and produce another two terms, called t3 and t4, and so on. After
all the partial products are added, the intermediate terms, t1, t2,t3, t4, etc. are
added together with 3-2 adders. So t1, t2, and t3 are shifted with respect to each
other and then added to form another set of terms called u1, u2. This is performed
until all of. the terms are added and only two terms remain. Figure 20 show how
this is accomplished for 12 partial products. The advantage of this design is that it
3-2 adder 3-2 adder 3-2 adder 3-2 adder
FIGURE 20. ''Y-Tree" partial product tree reduction using carry-save adders with a 12 to
2 compressor technique.
55
reduces the critical path. For a 12 bit add, the critical path goes through only five
3-2 adders. The penalty is that the structure does not layout well.
A more desirable implementation of the 1'V-tree" would reduce the partial product
by a factor of two for each row of adders. This can be accomplished with the use
of 4-2 adders or by another name 4-2 compressors [7], [16]. These adders are
composed of two 3-2 adders connected so that the carry-in does not effect the
carryO
3-2 Adderscarry!
3-2 Adders
,
I
I
I
I
I
I
I r--.J...----";I----L......,
,
I
I
I
I
I
I
~ --
3-2 Adders
sum2
carry2
3-2 Adders
I
I
I L.-..,----r-----'
I
carry-out of the cell. Figure 21 show how that is done. This ~s~ of this summing
g fed c ba
r---'---------- -
I
I
I
I
FIGURE 21. 4-2 Adders shown with the carry propagation.
technique allows the reductions of the partial product by two for each stage with a
critical path for all inputs-to-outputs of only two 3-2 adders. Figure 22 shows how
56
4-2 adders can be used to reduce 12 partial products to two. Even through the
4-2 adder 4-2 adder 4-2 adder
4-2 adder
FIGURE 22. Use of 4-2 and 3-2 Adders to reduce 12 partial products to 2.
. structure is more regular, the number of 3-2 adders in the critical path is still five.
the advantage of this structure over the standard "V-tree" is that there are less
adder rows for layout routing. The 3-2 adders in the 4-2 adder cell can be opti-
mized for layout. This the wh()le structu(e will consume less silicon area. Also
smaller routing can result in less power and higher speed.
The Wallace Tree [2] partial product summing method is the most efficient means
by which one can add a group of numbers. The basic unit is a 3-2 adder grouped
in such a way as to utilize the least amount of 3-2 adders and minimize the critical
path throughout the addition. Figure 23, on page' 58, shows a seven bit wallace
tree implementation with only four 3-2 adders and a critical path of three 3-2
adders. The configuration show in figure 23 has also been called a 7-3 counter
circuit.
57
fed
3-2 Adder
3-2 Adder
c b a
3-2 Adder
3-2 Adder
sO
s02 s1
FIGURE 23. Wallace Tree implementation for adding 7 bits.
The design proposed in this thesis, tries to produce a partial product adder struc-
ture which has the advantage of the small layout structure of the "l-tree", but the
speed advantage of the "V-tree". The 4-2 adder/compressor structure can be
taken to its natural conclusion with an adder cell designed to add any number of
..
partial products. For this design 13 partial products need to be added to form two
terms which, when added to an accumulator, will produce a multiply/accumulate
operation. The minimum number of 3-2 adders in the critical path needed to
reduce the 13 partial products to two would be five. The total number of 3-2
adders in each 13-2 compressor cell would be 11, four 3-2 adders in the first
stage, three 3-2 adders in the second stage, two 3-2 adders in the third stage, and
58
one 4-2 adder in the final stage. Figures 24 and 25 show how the 13 partial prod-
uO
u2
u4
rl, rO
a12
t6
T7
u5,u4
aO
al
a2
tl, to
to
t2
t4
ul, uO
l=final sum and carry
a3
a4
a5
t3, t2
Tl
T3
T5
u3,u2
rO
r1
R2
R3
11,10
a6
a7
aa
t5, t4
VI
U3
V5
r3, r2
a9
alO
all
0, t6
a= initial sum bits.
t=sums of 1st sum stage.
T=carrys of 1st sum stage,
from the joining block.
u=sums of 2nd sum stage.
U=carrys of 2nd sum stage,
from the joining block.
r=sums of 3rd sum stage.
R=carrys of 3rd sum stage,
from the joining block.
FIGURE 24. Textual representation of how to reduce 13 partial products to 2.
uct are added together to minimize the critical path. Figure 28 is the circuit sche-
matic used in the design of the multiplier array.
59
3-2 Adder 3-2 Adder
3-2 Adder 3-2 Adder
4-2 Adder
FIGURE 25. Block diagram of the reduction of 13 partial products to 2.
4-2 adders would be more useful in the reduction of a non-Booth recoded adder
tree with 24 terms, as shown in figures 26 and 27. The minimum number of 3-2
adders in the critical path for the reduction of 24 terms is 7. The use of 4-2 adders
produce the minimum of seven 3-2 adders in the critical path and allows efficient
layout. The number of 3-2 adders for the 24 to 2 compressor is twenty two, double
the size of the 13-2 compressor but only two more 3-2 adders in the critical pa'h.
60
aO a4 a8 a12 a16 a20
al as a9 al3 a17 a2l
a2 a6 alO a14 a18 a22
a3 a7 all alS a19 a23
n, to t3, t2 tS, t4 t7, t6 t9, t8 tIl, tlO
to t6 Tl T7
t2 t8 T3 T9
14 tIO TS Tll
ul,uO u3,u2 uS,u4 u7,u6
uO Ul
u2 U3 a= initial sum bits.
u4 US t=sums of 1st sum stage.
u6 U7 T=carrys of 1st sum stage,
rl,rO r3, r2 from the joining block.
rO u=sums of 2nd sum stage.
rl U=carrys of 2nd sum stage,
R2 from the joining block
R3 r=sums of 3rd sum stage.
11,10 R=carrys of 3rd sum stage,
l=final sum and carry from the joining block.
FIGURE 26. Textual representation of how to reduce 24 partial products to 2.
61
"" ""
""
I I I I
""
IL I I
"" ""~-2 Adder 4-2 Adder 4-2 Adder ~-2 Adder ~-2 Adder 4-2 Adder
, ,
+ + + +
l' , if llr l' 1Ir , llr ,, "
" ,
, ,
3-2 Adder 3-2 Adder 3-2 Adder 3-2 Adder
, ,~ , ,
Ir ,Ir , Ir llr ,r ,Ir ,
4-2 Adder 4-2 Adder
, L ,
1 ,
4-2 Adder
, ,
FIGURE 27. Block Diagram of how to reduce 24 partial products to 2.
62
4.4 Partial Product Adder Tree Simulation
The Table 10 describes the merits of the 13-2 compressor, Partial Product adder
implementation:
TABLE 10. Propagation delay of any input to the output ofthe 13 to 2 compressor
0.9um, 0.9um, 0.9um, 0.6um,
Processing Fast (nS) Median (nS) Slow (nS) Slow (nS)
Power Supply 5.5 Volts 5.0 Volts 4.5 Volts 3.0 Volts
Version 1 2.2 3.5 7.0 9.3
-.
Figure 29 shows the worst case path simulation of the 13-2 adder. The data in
table 10 shows the effect of low voltage on the speed of transmission gate struc-
tures. More transistor size optimization should decrease the delay at the lower
voltage operating points.
63
II
~I~IZ ZI~IZ 21;;:12 0:1;;;11
I
!l
1.0
H£I
IOFI
t ...-.nI) ..",tA
4S
..
FUl
NOVEMBER 7, 1883
OlIIGSI/£ I I8JE
AT&T· PROPFIETARY
lJM~1I>~_
13-2 COMPRESSOR
REDUSE 13 BITS TO 2
!!.12
...
fA
m
... In ..PIt,,,
FIGURE 28. Circuit Schematic for the 13-2 adder!
.compressor using 3-2 adders.
1..
ADVICE 28 AS OF 031894 RUN ON 05107/94 AT 17:16:27 51 20942
( O. DEG C) DSP Fast O.9um process file "s91h75.ap" 9-11-91
6 i , VBO
VLPPOO
VLPPOO
VLPPOO
VLPPOO
VBO
VLPP10
VLPP10
VLPP10
VLPP10
(x 1E-9)
(x 1E-9)
242
162
.'.
; ~ ';i v, • •
i'
240
16-0
.-
. :
.... :.:~ ... ;'" .......-.....
..
• •••••••••••• j ••••••••••••••••••••••
.
...... _ .
'.' - ~ ;. . .
Time
154
. .
,.-.~._._._._.~._._._._._'_._._I_._~_,_,_._._.
.....~._._._..-_._._._._.~._._._. __.~._._._._.-
, .
. .
"-'~'_'_I_'_':~-'-'-'-'-~_'_'_I_'-:-'-'-'-'-'
Time
. . .
..• j.. . ·1··· .. -. ': _. _. ~~_._._. _. _ _. _. _. _._.i-.._._._._._
. ;'
! : I
I . I
.................. -
I I
i I
'r .:.,.
~I -
• J
I- - "-'--
. I
I :
• I':
I :
__ • , f.: , __
I I :
152
, -.'
f
I : I
. :
1 .. :J .
i t
,f.-. r·
I -
. I :
I -
.__r
! I
I I
,- j- . -,.. <..
I '
230
4
3
2
5
02Ja PIP P ~ , I P I ~ (, / F I ~ , I I , ~ ( I I I ~ It , , iii , i i J
0
148 150
'j,
6
5
4
3
2
FIGURE 29. Worst case path simulatiom of the 13-2 adder.
4.5 Final 74 Bit Fast Adder Design
There are five popular types of adders, ripple, manchester, carry-skip, carry looka-
head, and carry-select adders. Ripple adders allow the carry to pass from bit to bit
until the· sum is completed. The Manchester adder is similar to the ripple adder
except the carry path is heavily optimized. The carry-skip adder builds on the
ideas of the Manchester adder to speed up the carry propagation of large adders
by grouping the bits and propagating the carry from the group. The carry looka-
head has a separate circuit for determining the carry and another to do the sum,
as determined from expanding the basic summing equations. The carry-select
adder works on small groups of bits in which the final carry into that group has not
yet been determined and pre-determines the final result with and without the carry
from the previous stage.
4.5.1 Ripple Adder
The ripple adder is the slowest and simplest form of adding two binary numbers
and producing a single binary result.
FIGURE 30. Ripple adder configuration.
A 74 bit ripple adder configured as in figure 30 with the full adder circuits present
in this thesis would take approximately 110 nS for the carry to propagate.
66
4.5.2 Manchester Adder
This adder is similar to a ripple carry in that the carry ripples from bit to. bit, how-
ever the carry path is optimized in a particular way. The sum of each bit will either
generate a carry (A=1,8=1), stop a carry (A=0,8=0), or allow a carry to propagate
(A=1, 8=0 or A=O, 8=1). A carry cell is created in each bit. This cell will perform
this function in an efficient manner to allow the carry to be determined as fast as
possible. These cells usually consist of a transmission gate to allow the carry to
propagate or not propagate and, if it is not propagating, than circuitry to drive the
carry output to a one or zero. The individual bit will then use the carry into the
block to finish the full add.
FIGURE 31. Manchester Adder Configuration.
The full adders previously discussed act like manchester adders when the carry
output is connected to the C input of the next stage. However the critical carry
path through those full adder cells is two inverters and a transmission gate. An
optimized implementations of a Manchester adder would only have one transmis-
sion gate per full adder cell in the critical path and an occasional buffer in the carry
chain to reestablish the erive and voltage level. A 74-bit adder of the type shown
in figure 31 would take approximately 52 nanoseconds to complete its final add.
67
4.5.3 Carry-Skip Adder
The carry-skip adder is similar in concept to the manchester adder. The idea
behind this implementation is to reduce the path of the final carry and equalize the
generation of the carry/sum for as many bits as possible., The way this is accom-
plished is to group the bits into small sections, add that small section and produce
a carry out or generate bit of that group and a propagation bit to allow a carry from
a previous stage to propagate to the following stages. Those group carrys and
propagate terms are then collected and used to determined the final carrys of all
the groups. When these final carrys arrive at each group, it is used to determine
the final sum of each group.
Cnx,Pn C(n-i)x, (n- )x
FIGURE 32. Simple t-stage representation of a carry-skip adder"
For a 74-bit adder with a group size of 5 and of the configuration show in figure
32, the propagation delay would be approximately 25 nanoseconds..
68
4.5.4 Carry Lookahead Adder
The carry-Iookahead adder takes the basic full adder equations and expands the
equation to include as many bits required. The advantage to this is that a straight-
forward equation can be developed for the carry bits of any word length. This
equation can then be implemented into logic and optimized for a given application.
Also adder generators can uses these equations to easily generate a fast adder.
The carry bit has two basic terms that are called the propagate and generate
terms. These terms are simply AND and OR functions that can be easily grouped
to form the carry bit of the desired location. The equations below show what the
carry lookahead equation is for the 5th bit.
G = a .b
n n n
Since the equations for large adders can become unmanageable, a grouping
method similar to the carry-skip method is employed to create very fast and effi-
cient adders. Usually this grouping method is done in a tree manner, with the size
usually set at four bits for the initial stage and then the carry reduction after that is
69
by a factor of two, so an add of 32 bits would be done in 4 stages. A c.9rry-looka-
G (1) P (1)
n 'D
G (2) P (2)
n , n
G (m-1) p (m-1)
n 'n
FIGURE 33. Block diagram of a binary reduction carry-lookahead adder tree.
head of 74 would be accomplished in 6 stages. Figure 33 shows a typical carry
lookahead or even a carry-skip tree implementation. Out of each block there is a
block propagate (P) and block generate (G) term. For a worst case 74 bit add, the
carry propagate signal would be generated in the least significant block and prop-
agate through only five block propagate/generate gates.
4.5.5 Carry Select Adder
The carry select adder is a widely used implementation for large adder circuits.
Figure 34 shows that the A and B terms are added twice. One is a normal add. the
v
other is an add with an increment. This is done so that when the carry from the
previous stage is determined, this stage will only have to flip a multiplexer to
determine the sum and the carry into the next stage. The initial sum is usually
70
done with a ripple or carry lookahead design and the bits are divided into sections
to keep the gate count low and increase the speed.
b[n:O] a[n:O] b[n:O] a[n:O]
1 0
N-Bit Sum N-Bit Sum
Carry-out
to next Stage-4l-----I
b a
N-Bit MUX
Carry-in from
Previous Stage
FIGURE 34. Representation of one stage of a carry-select adder.
4.5.6 Fast Adder for this Design
For large bit length designs, the carry-skip and carry-Iookahead designs offer the
fastest addition time. The carry-skip has an advantage in that it can be imple-
mented in less transistors. So this design will utilize the carry-skip techniques with
some modifications. First a group size needs to be determined. This design chose
five bits instead of the usual four because it produces 15 groups across the 74 bit
adder. A group size of four would produce 19 groups.
The group carry (propagate and generate terms) would be produced by a carry
lookahead implementation because this should offer the least propagation delay
in determination of the carry out of the five bit group. The group carry of each sec-
tion will produce a generate/propagate signal to either, allow the carry determina-
tion from the previous section to pass its section, stop the carry from the previous
block from passing to the next block, or generate its own carry irrelevant of the
carry from the previous group. This is the carry-skip section of the implementation,
which is similar to the manchester carry method of propagating the carrys, but
71
only the group carrys are in the carry chain. The other way to determine the final
carry would have been to bring all of the group generate and propagate terms into
a central block and determine the final carrys there. This was not chose because it
seemed to offer no speed advantage and had a larger transistor count. Utilizing a
factor of two tree reduction, as show in figure 33, could have been used to
decrease the delay even further, however the increase in complexity did not seem
to warrant the saving in speed.
Within the groups, a method was needed to produce the initial sum. Since the
speed of these terms are not critical, the bits within the group are added with a
non-optimized carry-ripple adder. These bits are not only needed later to deter-
mine the final sum, but they are also used in the leading 0/1 detector to predict the
leading 0 and 1 within the groups.
At the end, the final sum term needs to be determined from the initial sum and the
group carry. Since this final sum term is critical, a carry-select adder is used. The
initial sum is incremented to produce the final terms for the carry-select adder.
When the global carry arrives in each block, it selects which term to produce the
correct final sum. The carry-select, however, does not produce the carry out of the
block. The carry-in will either pass though an already set-up transmission gate or
the carry would already have been generated and passed down the line. To maxi-
mize speed every 3 or 4 stages of the carry signal are buffer up. This is done
because the signal level deteriorates as it passes through transmission gates.
Through a certain number of gates, the speed of the rise/fall time is slower than
the propagation delay of two inverters. This is the point at which the signal should
be buffered. Figure 35 show the circuit schematic for one five-bit section of this
fast adder implementation.
72
The actual add of two terms is only necessary through 52 of the 74 bits. The top
22 bits only need to be incremented if a carry propagates into them. So those bits
are determined with two ten-bit and one two-bit carry select incrementers. The
design is similar to the five bit adder cells with the ripple-carry adder and the carry
lookahead circuits stripped out. If these upper bits needed ~o use a full addition,
there would be no increase in the critical path delay because there is more than
enough time to do a full carry-select add. The entire 74-bit adder is shown in fig-
ure 36.
4.6 Final 74 Bit Fast Adder Simulation
The simulation is the worst case delay for the term, (A) all ones, and (8) all zeros
except for the least significant bit which is a one. Table 11 shows the worst case
delay of the fast adder from with this bit pattern for the A and 8 terms.
0.9um, 0.9um, O.9um, 0.6um,
Processing Fast (nS) Median (oS) Slow (oS) Slow (oS)
Power Supply 5.5 Volts 5.0 Volts 4.5 Volts 3.0 Volts
74-bit Carry Delay 7.6 11.4 17.1 27.1
Sum Bit 5 Delay 4.2 5.7 7.6 10.9
Sum Bit 73 Delay 7.7 11.5 17.1 26.2
TABLE 11. Worst case propagation delay through the 74 bit adder
The simulation waveforms for the fast adder are shown in figure 37. From the data
in table 11, the effe~ts of lpw voltage on the transmission gates are obvious. The
use of an optimized five bit carry lookahead could be designed to decrease the
delay of the initial block carry term. Also the use of a tree reduction technique
would decrease the transistor/routing load on the carry chain, thus decreasing its
delay.
73
'I
"
~
r:,'f':)
i
-j ~
~
-~~
~-!'I'/~~
~.....
..~
-"---------~­
~
-'-------~ "~
~
~~
,
i • "-
:1 ~
...
I
5-81T FAST "DOER CEU.
AUT· PRJPR£TAR't
UM pIZk*1llD~ h*uc:t<n
FIGURE 35. Circuit Schematic for one 5-bit adder cell in the fast adder.
,o~ . II ,\~...)~ 1'1,.u..'tI':ih .\t\!
~ to"'" (l_Irlt'lll)llJ i,"IQh!':iJ
" 8 tuttf.i
-) ~
.........,
..:
R
.,
'"~~
...
~
<r.
:iv ........ ~
....",
...
!
on :E
on
.J.Z hI..""! '"
r--
.",.,
.,
is
'"
-5
~' .....' "v '''ll~ ..
'Ull" ~ 0;.ut:l'YJ ....
U
~I\.'c.\.~ 1:3
! S
~
.,
.,
lh"h ;'0.;'. ~
! lh'\lil ~;1
,,- , :i\f"~'i. .t:
~ =:~ .~n1.S U!f. ,I,
"
..
_", 'i:j
....... ~
;>
.- j~ ,.-
.- t...'W
0.
il ".-..,.~ ., 1.'$."
""
,\-""'IIt, ~
~ ,'loy .;,."'. ~!f.•\h't5 ., 1~ ...)3 "-) ~
~
..
;;;J
~
..
.....
'-
..
,."", t;:
i ~ ""-' ~ I' ,\Jt't\!lj
~ 1·.\11"'11\"
Ii...·..: ...,:\.
:. ,~"'\$
~. ,1
"
"~
~iil if i ~f~ ~"
7'1
ADVICE 28 AS OF 031894 RUN ON 04127/94 AT 18:17:00 S# 25338
( 25.0 DEG C) DSP Nominal 0.9 um process file "s91m75.ap" 9-11-91
VC73
VC73
VC73
VC73
VAO
(x 1E-9)4440 423832 34 363026 28
Time
20 22 2418
I
••••• 1 •• _.
: ,
: .
: ,
.. J. ....
,
,,
14 1612
o ,< i i) I I i f i , I f i i • i , i f iii t ' r , , ; I f i f i I r I ; i itt; iii i e I , Iii i , i j I r i j iii ; .. ; , iF' j f j iii iii' 'I' ,.,.,
10
2
4
6 , . . .. ..... ~ •
~.-.~._.~._.~._.- _._~_._.~._.~._.~._.~.-._~_.
.~ . " ,
. . " ...
;. .. : i : ; ! ~-
: I: :---- ( •
... , .
. ,: .
I
jT" ..
i
I
6 i •
VFSUM73
VFSUM73
VFSUM13
VFSUM73
VAO
VFSUM5
VFSUM5
VFSUM5
VFSUM5
VAO
(x 1E-9)
(x 1E-9)
40
2423
36
22
34
21
..... ~ , ~ ··1
· .
· .
· .
· .
32
20
30
19
.. ,.~~.".- ~.·.l.-.~~.~.".. ~..:.......... . ;. .
28
•• <--_ •••. ;. .•.. _-.; •••••• ..: .......
· .
.., ~ ':~f ,:. " .
. :
24 26
Time
16 17 18
Time
22
15
20
14
\: \ :
18
13
16
······,·····\1··,······,···
:1 i
" :\
12
..~ ..
. 0"" : \ '-- .....)/";,...?dI\\ .;~
.. ;:.. .
I : /. . \
/.:/.
I /.
:i ', ,
I:
14
4
C) 2
0
10 1'1
6
4
2
0
1'0 1'2
FIGURE 37. Simulations of the worst case path in the fast adder.
4.7· Leading Zero/One Detector Design
The objective of this circuit is to find both the leading one and leading zero bit of
the final sum of the 74-bit adder (FSUMA). as the sum is simultaneously being
determined. A leading zero determination is needed if bit 73 of FSUMA is a one
(negative result), otherwise the result is positive and leading one determination is
needed. The output of the leading 0/1 circuit is all zeros except for the bit deter-
mined to be the leading one or zero. This output then feeds a "rom-like" array to
create the necessary signals to control the normalizing left shifter.
To create a fast leading 0/1 determination, the adder and leading 0/1 circuit should
work together. As the addition is proceeding from the least significant bit to the
most significant bit, the leading 0/1 circuit should analyze the bits in that order to
pre-determine the leading 0 or 1. When the add is complete, the leading 0/1 circuit
needs to proceed left to right quickly through the pre-determined bit groups and
produce a result. This design uses a unique prediction method to isolate the
groups containing the leading zero and one. As the group carry is determined for
each group, transmission gates are set-up in a similar ca.rry-skip fashion to allow a
group-clear signal to propagate from most-significant group to the least significant
group. So that when the most significant carry bit is determined, the group-clear
transmission gates are already set-up and the signal has an unobstructed path.
Since the 74-bit final adder works on five bits at a time, the leading 0/1 detector
works on five bits at a time. However a bypass clear/flow through circuit is then
employed in groups of 10 bits to zero the bits to the right of the leading 0/1. The
grouping of ten instead of five offered less delay in the propagation of the clear
signal. The five initial sum bits from the adder, which are determined without the
group carry, and the group carry for each 5 bit groups are used to determine
77
whether the group contains a leading one and/or zero. First the initial sum bits are
used to predict whether the group contains a leading zero or one. If a leading zero
or one cannot be determined from the initial sum bits then the predictor sets up
transmission gates to quickly determine the group status when the group carry
arrives. The determination of whether a group contains a leading one or zero only
needs the initial sum bits and the group carry. It does not need the final sum bits
(F8UMA).
A group leading 1 is bypassed when all five bits are one and the carry is zero, any
of the bits are zero but not all zero, and when all the bits are zero and the carry is
one. A .group leading 0 is bypassed when bits 1 through 4 are ones and bit 0 and
the carry are zero, any of bits 1 through 4 are zero, or when bits 0 through 4 are
one and the carry is a one. The 10 bit bypass signal is then formed by "DRing"
together the two 5 bit results. This bypass clear/flow through signal will then clear
all leading 0/1 bits to the right of the most significant 1 or 0 regardless of any other
signals. This local determination of the lea-ding zero and one from five bits to ten -
results is less propagation stages for the final clear signal. The circuit implementa-
tion of this predictor algorithm is shown in figure 37.
Within the 10 bit block, the lower five bits are cleared if the leading 0/1 is found in
the upper five bits. Then the individual 5-bits of the final sum bits(F8UMA) are
examined for the leading 0/1 and all other bits are zeroed. When the sign bit of the
final sum term (F8UMA73) is found, it is used.to select whether the leading 0/1
detect circuit should output the position of the leading one or leading zero. Figure
37 is the top-level circuit implementation of the 74-bit leading 0/1 detector. Figure
37 is the circuit schematic of one of the ten-bit blocks that make up the leading 0/
1 detector.
78
4.8 Leading Zero/One Detector Simulation
The delay through the leading 0/1 detector is in the critical path of the multiply/
accumulation operation. Proper delay simulation of this block needs to take into
account the coupling of this circuit with the 74 bit adder. The initial sum bits need
to be applied first and all at once. The carry bits and final sum bits need to be
applied staggered to mimic the operation of the74-bit adder.
0.9 urn, 0.9 urn, 0.9 urn, 0.6 urn,
Processing Fast (nS) Median (oS) Slow (oS) Slow (oS)
Power Supply 5.5 Volts 5.0 Volts 4.5 Volts 3.0 Volts
Delay C69 change to Bit 0=1 3.6 4.9 9.4 11.6
Delay AO change to bit 0 change 4.1 4.9 7.1 7.8
TABLE 12. Worst case propagation delay through the Leading 0/1 Detector
The worst case path is from the final determination of the most significant carry bit
and the sign bit, FSUMA73, to propagating the clear signal through the entire bit
length and producing a one in the least significant bit location. This would result in
a normalization left shift of 72 bits. This worst case delay is shown in table 12 and
figure 37.
79
I I I I I I I I I
- ~
~
:---
~
~
..
~
~
~
~
I.H--
/.
~
.......
--j.',fl-""~ ........~....'\-.--{.....~-.., ..
..... ----1-..
"-".-"~.".,, "·~--'~':·----I ---<".•
B A=N ~ ttA RA ttA ttA Ait tl.. "":r:; ~ .~.~, .. ::-'; ~ 'r" .. :~;; 'm~'" .~;; .. lOT .. ':::; .'.lnr '"::;; ..~...;.~ _,107. ~-,~I','r... '.or/", ,'.-:7 .. :O',.- .. ".:I1Q. '-!o·" JIl:;" ".f8" . "--'If*!:. O'-'lC!'"J - ....~ '.ur. .... J.IfZ- ".JIr' 1". IIll/,'_' .n: :,,-, , :_' ": _:._' .0,: ::--, , :~_' ;'" :_. ;._: :::' :n! '.. __ ~. _, ' ',"-r": ", ....,.."'. , .' .....,r"" ''',.."T''l> ",:"'1.'/,0 ' ... JUV>.'.W ,'J, ....., ... r"..w::r,2If '_.·J-~;r,. ,.hot;II'"..... '_':J""'~'.3f '.'-WI"". ...~.....,,'JV
i
~~~
;f\,o.' ,,~ :01." ~
- A.T IT PRJPR£T1Rf
1JMpt.nUWt1D~
FIGURE 38. Top level circuit schematic or the leading 0/] detector.
LEADlI-IG ,~ DETECTOR I lUI
FEIlRU>,F,V., , ..
~SlI I tS\5lA:
6S 1.0
I I I T I I I
.t'l I F74LEAO.1 I 1S;11
r 8 I t -.c .. uu.
p>@
0 I I 2 I 3 I ~ I 5 I I I 7 1-' I I •
, A
?
-
~
r---
rt.......,,~ '''I n1Z
.-
-;
-
~.~
-, ....... •'M ~,~ .-
~ ..
'r •
"" Ij .,
. ~<, .• - -~ ..
~
- -
.
-~
-
: e
~
r--- ,
'---1'<.....
-It"-
- ~,~ " .-
.......
.i-
~ .. ~.
~ ..: • ~I ...
lbD "" ~" "Ii . D .."
'"""r-!' "
-t " ...
z
.- r---
- '-" '''jl'\.1l
.- I.... . .-
':JiYr. ---t~ ..... ~.- '-
E, .........
1····,·I .r ....
- ~.".'-- S
- r ..JICr,' ~........+--
--
!---
~"'-- I r---
-
1-'",-
,
"- f
rf.. •
./:--~'. . . ..
'F -
- "
, i-
--....
5 I
.
~
-
..
AT&1 .I'IIOPAET_
llN_"~_
lUI
H OCTC*JIa._
"FIGURE 39. Circuit schematic for one ten-bit section of the leading 011 detector. DlllIOM:I_8S 1.0
-
All' I F10lEA0.l I ,,,;,
q I 1 , 2 J 3 I 4 T I 5 J I ; I 7 I • I • ........"..
x
r·...J
~l,r-lr
? 1
e
rz¥.1JT "
..
.
.~...../_,.
II
•
c
D
F
G.
ATlT· PA:lI'AIETARY
Uoo~to~_
PRE·DECODE FOR LEADING 0 OR 1
DETECT.
RJIl
lMRCHl3, ,.
llWGSlZE I _ "
•• .......u....
FI(~URE 40. Circuit schematic of the five.bitleading 011 predictor logic.
.....
4S 1.0
ft[t.
lOF 1
ADVICE 28 AS OF 031894 RUN ON 04127194 AT 22:48:28 $124505
( 110.0 DEG C) Low Gain 0.6um process file 11-23-93
6 I •
5 .. t··
... -.~._~_._.~._._,_._.~._.-.-._,_._._._ .•._._.-.-
, : .: : : :
.. . . . .
./ :. ~...,.._~_._._._._.:-._._._._ _._._._ _'._._._.
i' :: II ___
VC69
VFLOUTO
VFLOUTO
VFLOUTO
VFLOUTO
(x 1E-9)3432
·l··········· I'~' • ':~'; : ... ~";; "t" ~ Of .....
30
· .........
, ... 7"' ~',;~""
28
;.
T~e24
2-..·,·
3
4 · .. :·1··.····· - - -- .
I ;,
i ~i······ .,.~.
i I .
. . I
.. -:-.,........
: • Iit!' t· I_ t --I"~ --- ---- -------~' -- .>; ...; ---------, -- ----------
o i '1' i r fir« i ' ; r I ifF; ii" f ; Iii Iii if' , f
18 20 22
6 , ,
. . .
...... _ - .
('
r
VAO
VFLOUTO
VFLOUTO
VFLOUTO
VFLOUTO
(x 1E-9)
....;. ; { .
i~" ... ~.......~ ....... ~ ....... !.......
· . .,.
· .. . . .>..: ~ ~
................................_ .
· ..-~ ~ .
: ., :
· .. .
· .. .
.; ;, ) .
. .
. .
..~
:e ••
.....;,. ·····j. .. ·····;·······i····
-- - 1.. .
.. . \
. . ~ ~ : \ . :
~ ._._~_.- .~.- ._.~ ~_I"" ._~_._.~ ~._.~.- r~- _, ~.-
. . . . . . ~ ~ . \ ~
: , t
·.. :.. ·i.. ·_ .... "E
:' \: I ./. ......:-. ., .......! ...,.. !... .. -i-\. ...., .... ··' .. ·.. ··1 ......
: .: : \
: ,,:
.. :, .... ;.
. . .
·-~_·-~-·~·-·~·_·~·_·~·-~_·_~-t
. . . . . . . . \.
- ~-.~.- .•.- .•.- ·-···-·~·-·1·-·\:·_·~·
4
2
5
3
'Z
.~
FIGURE 41. Simulations of the worst case path for the leading 0/1 detector and the shifter
4.9 Left ShifterlRighfShifter Design
This floating point multiplier/accumulator design requires the use of a 74-bit left
shifter and a 98-bit right shifter. ~
Both the 73-bit left shifter and the 98-bit right shifter is divided into two stages. The
first stage is a one of ten. The second stage is a one of eight. The first stage shifts
left 0, 8, 16, 24, 32, 40, 48, 56, 64, and 72 places. The second stage shifts 0
through 7. The actual design of the shifters are straightforward. The design of
both shifters share the same bit path design.The main differences are in the signal
routing and the number of bits. The most important feature of an efficient large bit
width shifter is speed and dense layout capacity
The first stage consist of just a ten to one multiplexer. This will require 72 plus 10
metal lines in the V-dimension to implement. The second stage, which is just a 8
to 1 multiplexer, will require only 8 plus 8 metal lines in the V-dimension to imple-
ment. It is important that the second stage have the lower routing capacitance
since there is no buffering in between the first and second stages. The total num-
ber of lines in the V-dimension to implement both stages should be only 98 metal
lines. If the metal pitch is 2.5 micron, then the size of the shifter in the V-dimension
is 245 microns by the bit width in the X-dimension. This X-dimension bit width size
could be substantial. For the first stage of the 98 bit shifter, there are 20 inputs
and one output. Assuming the output is run efficiently, each cell needs to be 20
times 2.5 microns times 98. This results in a x-dimension size of 4.9 mm.
84
4.10 LeftlRight Shifter Simulations
The shifter simulation includes not only the multiplexers, but also the shifter roms
that re-code the leading 0/1 output into what the shifter requires for control:.
0.9 urn, 0.9um, 0.9um, 0.6um,
Processing Fast (nS) Median (nS) Slow (nS) Slow (nS)
Power Supply 5.5 Volts 5.0Volts 4.5 Volts 3.0 Volts
Leading 0/1 output to 2.6 3.4 5.8 7.5
73 bit data shift delay
TABLE 13. Worst case propagation delay through the 74 bit shifter
The delay is determined from the output of the leading 0/1 detector through the
recoding rom and through the two stage multiplexer. Table 13 and figure 37 show
the results of this simulation. Since parasitic capacitance and resistance playa
major role in determining the speed of the shifter, capacitance has been added to
the interconnect for simulation purposes.
85
ADVICE 28 AS OF 031894 RUN ON 05107194 AT 16:07:14 5120394
(110.0 DEG C) Low Gain 0.6um process file 11-23-93
I
VFLO
VFMUX073
VFMUX073
VFMUX073
VFMUX073
2'6 (x 1E-9)
........... , .
~
............. .: .
. :' ~ .
.. ;, ';: .
•• j.
....... : ~ :.
;.
·1-
......................: ;, .
.. :. ..
... ;:.
...................................
:1 '
:. "
"lI
.... //
, : I
······i·,·.,.····;·
. ,
I. . .....
; Ii .; ... J. . ......•..
. ,
.. ;.
4.6
4.4
4.2
4.0
3.8
3.6
3.4
3.2
3.0
2.8
2.6
2.4
z 2.2
C)
2.0
1.8
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
FIGURE 42. Simulation.. of the worst case path for the 74 hit left shifter, including ROM controller
5.0 Conclusions
This thesis has shown a method for designing a single cycle 32-bit IEEE compli-
ant multiply/accumulator with the accumulation as part of the partial product tree
of the multiplier.
The critical path through the multiply/accumulator can be estimated from the sim-
ulation performed'in this thesis. Using on the O.gum worst case delay simulations
as the base for estimating the overall delay, the total delay is estimated at 46 nS
for a single precision IEEE floating point multiply/accumulate. The application of
the operands, Booth recoding, partial product tree and the full add of the product
and accumulator would be approximately 10 nS. The fast adder would be about
18 nS. The leading 0/1 detect would be to nS, The normalization would be about
6 nS. The latching of the data and the exponent processing would be about 2 nS.
This would give this processing unit the ability to execute approximately 40
MFLOPS (millions of floating point operations per second) in a O.gum technology.
The best case O.gum processing simulation shows a performance in excess of
100 MFLOPS. The low voltage a.6um High Density process shows·ca performance
possibility of about 28 MFLOPS.
In the design and implementation of this multiply/accumulator, a trade-off in the
partial product tree was done to minimize area. The use of a modified Booth's
recoder reduced the size of the partial product tree by half, but complicated the
sum of the product and accumulator. The sign extension require by the use of 2's
complement numbers, caused a bit to be produced in bit 51. If a normal 24 term
product tree was used, this bit would not be produced and the final adder would
only be 48 bits of a fast adder and 25 bits of an incrementer. The final term would
be 73 bits instead of 74. Two's complement numbers in the partial product tree
87
88
required the use of a decrementer and other zeroing circuitry to alleviate the
undesired bit propagation. Since faster adders are possible, this decrementer and
zeroing logic would enter into the critical path. This would mean that the exponent
processing, accumulator alignment and preparation would be the critical path and
not the partial product reduction. Future studies could examine the trade-ofts
between the use of a non-recoded multiplier and its impact on double/extended
precision operations with this design.
To speed-up the throughput of most of the multiply/accumulate processing units, a
pipeline stage is usually added. One other minor problem with the design pre-
sented in this thesis is that most signal processing algorithms need to multiply two
terms and sum them with the current accumulator. This design has no pipeline
stages because the intent was to demonstrate the feasibility of such a design. The
best place to put a pipeline stage in most multipliers/accumulators, from a soft-
ware view, would be after the product summation and before the add to the accu-
mulator. This allows the execution of multiple A*B+C instructions in a pipelined
fashion. From a circuit propagation delay perspective for this design, the optimal
place to put the pipeline latches are after the sum of the accumulator with the
product, but before the leading 0/1 detect. This, however, means that the accumu-
lator would not be ready for the addition to the next multiplication for one more
cycle. Thus A*B+C pipelining could not be done. To implement the best software
pipelining scheme, the latches would be placed before the accumulator is added
to the product. This configurations will result in the critical path being lopsided.
The partial product add would be about 2-3 times faster than the critical path of
the product/accumulator addition. The critical path of this pipelining configurations
would be the summation, normalization, and rounding. Future work could exam-
ine the merits of pipelining this design for higher throughput.
<
The main purpose of this study IS to show the feasibility of performing the accumu-
lation of a multiply/accumulation inside the partial product tree without adversely
effecting the speed of the multiply operation. The effects of adding the accumula-
tor inside the partial product tree are 1) one full adder increase in delay before the
final fast add, 2) an increase in the length of the final add from a critical path of
about 28 bits to a critical path of 74 bits, 3) requires higher speed and more paral-
lel exponent processing, and 4) larger bit-wide shifters.
The advantages that this design has over a typical two stage multiply/accumulator
are 1) only one rounding unit, 2) one fast adder unit, and 3) full precision product
term added to the accumulator before rounding.
One of the greatest advantages to this multiply/accumulate configuration is the
extent to with the accumulator alignment can be preformed while the partial prod-
uct reduction is done. This shift calculation and alignment are hidden. This
reduces the overall delay of the multiply/accumulate operation with respect to the
standard two cycle multiply/accumulate operation, which if. properly pipelined
would allow faster operation than the standard multiply/accumulators.
Other significant features of this design are the 22-transistor full adder and the
higher-order 13-2 compressor for the partial product tree. These circuits coupled
with new techniques in fast addition have greatly decreased the partial product
tree latency and have made array multipliers a standard unit on all modern data
processing devices.
89
6.0 References and Bibliography
[1] Booth, A. D., A signed binary multiplication techniQue" Quart. J. Mech. Appl.
Math., vol 4, pp 236-240,1951.
[2] Wallace, C. S., A sU<jgestion for fast multjpljers., IEEE Trans. Electron.
Comput., vol EC-13, pp. 14-17, Feb. 1964.
[3] IEEE Standard for Binary Floating-Point Arithmetic., ANSI/IEEE Standard 754-
1985, August 12, 1985.
[4] Koren, I., Computer Arithmetic Algorithms" Prentice Hall, Englewood Cliffs,
NJ, 1993, page 57.
[5] Hokenek, E., Montoye, R. K., Cook, P. W., Second- Generation RISC Floating
Pojnt with Multiply-Add Fused., IEEE J. of Solid State Circuits, Vol. 25, No.5,
April 1990, page 1207.
[6] Zhuang, N., Wu, H., A New Design of the CMOS Full Adder:.• IEEE J. of Solid
State Circuits, Vol. 27, No.5, May 1992, page 843.
[7] Santoro, M. 8., Horowitz, M. A., SPIM: A Pipelined 64x64-bit Iterative
Multiplier., IEEE J. of Solid State Circuits, Vol. 24, No.2, April 1989, page 487
[8] Hwang, Kai., Computer Arithmetic: Principles. Architrave. and Design., New
York: John Wiley & Sons, 1979.
[9] Cavanagh,J. J. F., Digital Computer Arithmetic: Design and Implementation.,
McGraw-Hili, New York, 1984.
[10] Sato, T., et. aI., A 8.5-ns 112-b Transmission Gate Adder with Conflict-Free
Bypass Circuit., IEEE J. of Solid State Circuits, Vol. 27, No.4, April 1992,
page 657.
[11] Song, P. J., Giovanni, D. M., Circuit and Architecture Trade-ofts for High
Speed Multiplication" IEEE J. of SQlid State Circuits, Vol. 26, No.9, Sept.
1991, page 1184.
[12] Cooper, A. R., Parallel architecture modified Booth multiplier., lEE
Proceedings, Vol 135, Pt. G, No.3, June 1988.
[13] Weste, N. H. E., Eshraghian, K., Principles of CMOS VLSI Design: A systems
Perspective., Addison-Wesley, Reading, MA., 1988.
[14] Bechade, 8., et. aI., A32b 66MHz 1.8W Microprocessor., 19941SSCC Digest,
Session 12, paper TP 12.4, page 208.
[15] Pham, D., et. aI., A3.0W 75SPECint92 85SPECfp92 Superscalar RISC
Microprocessor., 19941SSCC Digest, Session 12, paper TP 12.6, page 212.
90
[16) Anderson, S. F., et aI., !he IBM system/360 model 91 ; Floating-point
execution unit., IBM J., vol. 11, no. 1, Jan. 1967, page 34.
[17) Shen, D. T., Weinberger, A, 4-2 carry-save adder implementation using send
circuits., IBM Tech. Disc. BulL, vol. 20, page 3594, Feb. 1978.
[18) Heikes, C., A 4.500002 Multiplier Array for a 200MELOP Pipelined
Coprocessor., 19941SSCC Digest, Session 18, paper FA 18.1, page 290.
[19) Mori, J., et. aI., A 10-ns 54x54b Parallel Structured Full Array Multiglier with \j
0.5um CMOS Technology., IEEE J. of Solid State Circuits, Vol. 26, No.4, April
1991, page 600.
[20) Hwang, I. S., Ultrafast Compact 32-bjt CMOS Adders in Multiple-Output
Domino Logic., IEEE J. of Solid State Circuits, Vol. 24, No.2, April 1989, page
358.
[21) Lu, F., Samueli, H., MOO-MHz CMOS pjpelined Multiglier-Accumulator Using
a Quasi-Domino Dynamic Full-Adder Cell Design.. IEEE J. of Solid State
Circuits, Vol. 28, No.2, Feb, 1993, page 123.
[22) Ide, N., et. aI., A 320-MFLQPS CMOS Floating-Point Processing Unit for
Superscalar Processors., IEEE J. of Solid State Circuits, Vol. 28, No.3, March
1993, page 352.
[23) Chan, P. K., Schlag, M. D. F., Analysis and design of CMOS Manchester
adders with variable carry-skip., IEEE Trans. Comput., vo139, no 8, page 983,
'. August 1990.
[24) Suzuki, M., et. aI., A 1.5-ns 32-b CMOS ALU in Double Pass-Transistor
Logic., IEEE J. of Solid State Circuits, Vol. 28, No. 11, Nov. 1993, page 1145
[25) Kwentus, A Y., Hung, H. T., Willson Jr., AN., An Architecture for High-
Performance/Small-Area Multipliers for Use in Digital Filtering Applications.,
IEEE J. of Solid State Circuits, Vol. 29, No.2, Feb. 1994, page 117.
[26) Fujii, H., et. aI., A Floating-Point Cell Library and a 100-MFLOPs Image
Signal Processor., IEEE J. of Solid State Circuits, Vol. 27, No.7, July 1992,
page 1080.
[27) Goto, G., et. al~, A 54x54 Regularly Structured Tree Multiplier., IEEE J. of
Solid State Circuits, Vol. 27, No.9, Sept. 1992, page 1229.
[28) Srinivas, H. 8., Keshab, K. P., A fast VLSI Adder Architecture., IEEE J. of
Solid State Circuits, Vol. 27.. No.5, May 1992, page 761.
[29) Yano, K., et. aI., A 3.8-ns CMOS 16x16-b Mulliplier Using Complementary
Pass-Transistor Logic., IEEE·J. of Solid State Circuits, Vol. 25, No.2, April
1990, page 388.
91
[30] Nagamatsu, M., et. aI., A 15-ns 32x32-b CMOS MultiQJier wjtu..aa..lmgroyed_
Parallel Strycture., IEEE J. of Solid State Circuits, Vol. 25, No.2, April 1990,
page 494.
[31] Singh, H. P., et. aI., A 6.5-ns GaAs 20x20-b Parallel MYltiglier with 67-gs Gate
~, IEEE J. of Solid State Circuits, Vol. 25, No.5, October 1990, page
1226.
[32] Lotz, J., et. aI., A CMOS RISC CPU Designed for Sustained Hjgh
E!IDormance on Large Agglicati0Q§,,_IEEE J. of Solid State Circuits, Vol. 25,
No.5, October 1990, page 1190.
[33] Anderson, D., et. aI., The 68040 32-b Monolithic Processor. IEEE J. of Solid.
- State CircYits" Vol. 25, No.5, April 1990, page 1178.
92
7.0 Appendix
This thesis utilized a schematic capture packaged developed internal to AT&T
know as SCHEMA. Since this schematic package is not generally available to
Lehigh University, a complete set of circuit-level schematics have been given to
the Electrical and Computer Engineering Department of Lehigh University.
In the formation of this thesis, a "C-Ievel" model of the floating point multiply/accu-
mulator was created. An IBM-PC compatible 3.5" disk, with this program, has also
been given to the Electrical and Computer Engineering Department of Lehigh Uni-
versity.
An input file format for the "C-model" was developed to help test the multiply/accu-
mulate functionality against the IEEE format of the PC. The format consist of:
,=---------- Initial Accumulator Value
01000001100110111110111111100011 00 ~ Initial sum zero/inf Flag values
01000000100000000000000000000000 ~ X term for multiplicatwn
00110001000000000000000000000000 ~ Y term for multiplication
102 .... Instruction
01000000100000000000000000000000 Next X term for multiplication
01000001000000000000000000000000 Next Y term for multiplication
002
01000000100000000000000000000000
01000001000000000000000000000000
The initial Accumulator term is inserted into the accumulator before simulation
begins. The zero and infinity flags need to be also initialized at this time. The next
three lines consist of the X operand, the Y operand, and then the desired instruc-
tion. The instruction consist of 3 terms. First term is the reset state. The second
term is the multiply/accumulate instruction. The third term is the rounding mode
desired.
93
8.0 Brief Biography
Richard Niescier was born in Philadelphia, Pennsylvania on December 14, 1962
to Anthony and Jane Niescier. He received the B.S. degree with highest honors in
electrical engineering from Lehigh University in 1984. He joined the Advance
Technology Laboratories of RCA in Morrestown N.J in 1984. At RCA, he worked
on High Reliability Circuit Design in their 1.25um CMOS/SOS process. He
designed various process test chip and helped in the design of a 64K-Bit static
memory. In 1987, he went to AT&T Bell Laboratories in Reading PA. and worked
on advance GaAs HFET devices. He designed various high speed GaAs digital
test chips and contributed to the creation of a GaAs HFET Digital Standard Cell
library. He finished his work there with the design of a GaAs IEEE Compliant 32-
bit floating Point multiplier with an average propagation delay of under 10 nS. In
1990, he transferred to the Digital Signal Processor Laboratory in Allentown, PA.
There he has been engaged in design and development activities for the
DSP321 0/06, AT&T's 32 floating point digital signal processor chip, and the
DSP1602/04 family of low cost 16 bit integer digital signal processors. His activi-
ties range from module design to overall lead circuit designer. The following is a
list of publications that he has contributed to:
32-Bit GaAs tfFET IEEE Floating Point Multiglier., 14th Annual GaAs IC
Symposium- Technical Digest 1992.
A SEW Hardening TechniqUe for CMOS Static RAMs., IEEE Journal of Radiation
Effects, 1986.
500 MHz GaAs Macrocell Library for High Speed Low Power LSI Digital ASICS.,
SPIE Technical Symposium on Optical Engineering and Photonics, Orlando,
Florida, April 18, 1990.
94
"
