











Thesis Advisor: Herschel H. Loomis, Jr.
1 proved for public release; distribution is unlimited

Unclassified
security classification of this page
REPORT DOCUMENTATION PAGE
1 a Report Security Classification Unclassified lb Restrictive Markings
2a Security Classification Authority
>b Declassification Downgrading Schedule
3 Distribution. Availability of Report
Approved for public release; distribution is unlimited.
4 Performing Organization Report Number(s) 5 Monitoring Organization Report Number(s)




7a Name of Monitoring Organization
Naval Postgraduate School
6c Address (dry, state, and ZIP code)
Monterey, CA 93943-5000
7b Address (city, state, and ZIP code)
Monterey, CA 93943-5000
8a Name of Funding Sponsoring Organization 8b Office Symbol
( if applicable)
9 Procurement Instrument Identification Number
8c Address (city, state, and ZIP code) 10 Source of Funding Numbers
Program Element No Project No Task No Work Unit Accession No
li Title (include security classification) VLSI DESIGNS FOR PIPELINED FFT PROCESSORS (Unclassified)
12 Personal Author(s) David Charles Stuart








16 Supplementary Notation The views expressed in this thesis are those of the author and do not reflect the official policy or po-
sition of the Department of Defense or the U.S. Government.
Cosati Codes
Fieid Group Subgroup
18 Subject Terms (continue on reverse if necessary and identify by block number)
VLSI multipliers, VLSI adders, pipelined arrays, floating-point multipliers, floating-point
adders, FFT processors
19 Abstract (continue on reverse if necessary and identify by block number)
A system of custom cell building blocks utilizing scaleable CMOS technology is described. The cells are designed to support
the high speed, pipelined addition, subtraction, and multiplication operations necessary in a cyclic spectral analyzer or other
applications involving the FFT. The cells are structured in such a manner as to permit a designer to tailor the bit-length of
the operations and the number of pipeline stages used. Both fixed and floating operations are supported by the system. The
size and performance characteristics of devices produced using the cells are compared with previously produced Genesil
Silicon Compiler pipelined designs. The appendix contains designs of a 16-bit mantissa. 12-bit exponent floating point
multiplier and adder produced from the standard cells. If fabricated in 1.2-/i feature size technology, the theoretical maximum
clock speed and throughput rate is 102 MHz with an asymmetric clock and 61 MHz using a symmetric clock waveform.
Devices with clock speeds up to 178 MHz are possible if the number of logic cells between a pipeline stage is reduced to one.
20 Distribution Availability of Abstract
E unclassified unlimited same as report D DTIC users
21 Abstract Security Classification
Unclassified
22a Name of Responsible Individual
Herschel H. Loomis, Jr.




DD FORM 1473,84 MAR 83 APR edition may be used until exhausted
All other editions are obsolete
security classification of this page
Unclassified
Approved for public release; distribution is unlimited.
VLSI Designs for Pipelined FFT Processors
by
David Charles Stuart
Lieutenant, United States Navy
B.S., Southern Illinois University at Carbondale, 1979
Submitted in partial fulfillment of the
requirements for the degrees of







A system of custom cell building blocks utilizing scaleable CMOS technology
is described. The cells are designed to support the high speed, pipelined addition,
subtraction, and multiplication operations necessary in a cyclic spectral analyzer
or other applications involving the FFT. The cells are structured in such a
manner as to permit a designer to tailor the bit-length of the operations and the
number of pipeline stages used. Both fixed and floating operations are supported
by the system. The size and performance characteristics of devices produced us-
ing the cells are compared with previously produced Genesil Silicon Compiler
pipelined designs. The appendix contains designs of a 16-bit mantissa, 12-bit
exponent floating point multiplier and adder produced from the standard cells.
If fabricated in I.2-ji feature size technology, the theoretical maximum clock
speed and throughput rate is 102 MHz with an asymmetric clock and 61 MHz
using a symmetric clock waveform. Devices with clock speeds up to 178 MHz





A. CYCLIC SPECTRAL ANALYSIS 1
B. PIPELINING CIRCUITS FOR HIGH PERFORMANCE 4
C. APPLICATION SPECIFIC INTEGRATED CIRCUIT DESIGN . . 7
D. THESIS GOALS AND ORGANIZATION 10
II. FULL-CUSTOM SCALEABLE CMOS 12
A. CMOS VLSI DESIGN PRINCIPLES 12
B. SCALING CMOS DESIGNS 12
C. FULL-CUSTOM DESIGN TOOLS 13
III. NUMBER SYSTEMS AND ALGORITHMS 17
A. NUMBER SYSTEMS AND ARITHMETIC OPERATIONS 17
B. FLOATING POINT NUMBER SYSTEMS 21
C. FLOATING POINT MULTIPLICATION 26
D. FLOATING POINT ADDITION 28
E. OVERFLOW, UNDERFLOW, AND ROUNDING 30
IV. HARDWARE 33
A. ADDER DESIGNS 33
B. LATCH LAYOUTS 38
C. MULTIPLIER DESIGNS 42
D. OTHER CELL FEATURES 47
V. DESIGN COMPARISONS 61
VI. CONCLUSIONS AND RECOMMENDATIONS 68





APPENDIX A. SPICE SIMULATION EXAMPLE 70
APPENDIX B. STANDARD CELL DESCRIPTIONS AND LAYOUTS 77
A. EXPONENT ADDITION FUNCTION CELLS 77
B. MULTIPLICATION FUNCTION CELLS 104
C. EXPONENT SUBTRACTION FUNCTION CELLS 151
D. MANTISSA ALIGNMENT AND SELECTION 188
LIST OF REFERENCES 192
INITIAL DISTRIBUTION LIST 194

I. INTRODUCTION
A. CYCLIC SPECTRAL ANALYSIS
Cyclic spectral analysis is a fundamental tool for the study of periodicity in
signals and systems [Ref. 1: p. ii]. Signal detection, modulation recognition, and
signal parameter estimation are only a few of the potential uses of this relatively
new analysis technique. A particularly important application of cyclic spectral
analysis is the study of modulated signals. As an example, consider results de-
rived in Ref. 2. Figure 1 on page 2 contains the plots of four different phase-shift
keyed (PSK) signals. The magnitude of the cyclic spectrum is plotted as the
height above a bi-frequency plane defined by /and a where/ is the spectral fre-
quency and a is called the cyclic frequency. Notice that the four signals have
identical power spectral density functions (a = 0), but have highly distinct spec-
tral correlation functions.
Several methods and algorithms exist for computing the cyclic spectrum of
signals [Ref. 3 : pp. 1-17]. Figure 2 on page 3 (from Ref. 4) shows a block dia-
gram of one such algorithm using a time-averaging method. The computation
requires:
(a) prefiltering using overlapping window Fast Fourier Transforms (FFT's);
(b) multiplication by complex phase terms; and
(c) performing another, longer FFT.
These results cover only a small portion of the bi-frequency plane. Many such
overlapping computations must be performed to cover the entire plane. This
computational complexity, which far exceeds that of conventional spectral anal-
ysis, presently limits the use of the cyclic spectrum as a signal and systems anal-
ysis tool.
1













Figure 2. Time-averaging method
The basic operations required to compute the cyclic spectrum such as Fourier
transformation and product modulation are common to most signal processing
algorithms. However, the extreme number of these operations required to yield
meaningful results prove taxing to general-purpose computers. A wide range of
applications, including applied research in the field of cyclic spectral analysis
would open up if cyclic spectral analysis algorithms could be computed more
rapidly. One way of computing these algorithms more rapidly is to use compu-
tational systems that are specifically designed for digital cyclic spectral analysis.
[Ref. 3: p. 2]
The mathematical operations required to compute the cyclic spectrum are
multiplication, addition and subtraction. The cyclic spectral algorithm's use of
structured computations such as the FFT may make implementation in very-
large-scale integrated (VLSI) circuit devices practical if many mathematical
operations can be performed on a single chip. As an example, a radix-2 FFT
decimation in time "butterfly" (Figure 3) requires one complex multiply and two
complex addition operations. This is equivalent to four multiplications and six
additions of real numbers [Ref. 5]. If such a set of computations could be per-
formed on a single chip at sufficiently high rates, a near-real-time cyclic spectral
analyzer might be practical.
wN
'
Figure 3. FFT "butterfly'
B. PIPELINING CIRCUITS FOR HIGH PERFORMANCE
The basic idea behind a pipelined design is quite natural. The name pipeline
stems from the analogy with petroleum pipelines in which a sequence of products
is pumped through a pipeline. Figure 4 on page 5 shows a sequential process
made up of "N" discrete cascaded sub-processes. We can apply this analogy to
an electronic circuit made up of cascaded logic stages. The maximum delay
through the circuit is the sum of the maximum delays through each logic stage
and can be expressed as:








Figure 4. An N-step sequential process.
where TD is the maximum total circuit delay and td , tdi ,..., tdlf are the delays of
the individual stages of logic. In using a circuit of this type, inputs usually must
be held constant until the output is stable. The maximum rate at which this type
of circuit can perform operations is, therefore,
"Vnax
UTn (2)
Figure 5 on page 6 shows the circuit with clocked registers added between the
logic blocks. These registers are used to save the state of the preceding logic block
at periodic intervals, and then to input that state into the next set of logic. The
minimum clock period that could be used in this circuit is
Tc =td + ld
^min "max "registe
(3)
where td is the maximum delay of the slowest logic block and td is the total
"max J ° "rrglsttr
delay associated with a register. If a design breaks up the operation into logic
blocks with similar delay times, and the delay of a register is small compared with












Figure 5. Pipelined process
unpipelincd circuit. Notice that the total delay (sometimes referred to as
latency) through the circuit is now
TD = N.TC, (4)
where N is the number of stages and Tc is the clock period.
This pipelining principle is easily applied to mathematical operations. For
example, consider Figure 6 on page 7, a four-bit ripple-carry adder made up of
four one-bit full adders. All the inputs (A's and B's) must be held at a fixed logic
level as the carry "ripples" through the adders. This adder can be pipelined by
adding registers as in Figure 7 on page 8. This adder could operate with higher
data input rates, but the design has several potential disadvantages.
First, the answer from one addition operation is not available until several
clock periods after the next addition can be started. If the next addition to be
performed requires a previous output, it cannot be started until that previous
output is available. This could cause the pipeline to "empty" and defeats the
purpose of pipelining. It may even result in slower overall operations, since the




Figure 6. Four-bit ripple-carry adder
overall delay of the circuit is increased by pipelining. Pipelining will only increase
the throughput of a system if the computations can be structured in a way to
keep the pipeline "full". Fortunately, an FFT-type algorithm can be structured
in just such a way [Ref. 6].
Secondly, the pipelined system is physically larger and more complex. Not
only does it have all the logic required of the unpipelined circuit, it also requires
registers and at least one clock. This extra space requirement can be very signif-
icant and could limit the use of pipelining in some applications.
C. APPLICATION SPECIFIC INTEGRATED CIRCUIT DESIGN
Application specific integrated circuit (ASIC) designs have become popular
in the military. These circuits, custom designed for a particular application are
especially useful in specialized, high performance systems. Traditional methods
of ASIC design include full-custom design, gate-array circuit design, and stand-
ard cell circuit design [Ref. 7: p. 38]. Silicon compilation is the newest method of







Q: UJ O '—' (/) I— LiJ CH U~)
O
o: uj o '—i i/i i- uj o; (/)
I
a uj o —' in i— u o: 1/1
CD
ncs- g
Figure 7. Pipelined four-bit adder
than previously available. The silicon compiler works from a high-level de-
scription of the circuit that allows the designer to perform successive design iter-
ations quickly and efficiently, providing the designer rapid access to key
parameters such as chip size, power consumption, and timing constraints
[Ref. 8].
The Genesil Silicon Compiler used at the Naval Postgraduate School (NPS)
in Monterey allows a designer to easily create complex circuits using devices from
the compiler library. This design procedure requires little knowledge of VLSI
circuit design on the transistor level. Two theses done at NPS, References 9 and
10, provide excellent background on the system and its capabilities.
Several pipelined adders and multipliers designs have been produced at NPS
using the Genesil Silicon Compiler. Reference 9: pp. 46-61 documents a custom
16-bit integer adder that reduced the maximum propagation delay by slightly less
than a factor of three. This increase in performance was achieved with a signif-
icant sacrifice in circuit area; the pipelined version took 23 times the chip surface
area of the unpipelined adder. Pipelined multiplier designs also produced large
increases in chip surface area requirements [Ref. 9: pp. 62-81].
Since the physical size of IC layouts limit the number of mathematical oper-
ations that can be performed on a single chip, the large increase in chip area
necessary to pipeline designs using the Genesil Silicon Compiler could limit its
usefulness in designing chips for a cyclic spectral analyzer. Other design methods,
such as full-custom, might be more appropriate.
Full-custom design of integrated circuits is a time-consuming method that
requires extensive programming for simulation and timing analysis of the VLSI
design prior to fabrication. Full-custom design is normally used by IC manufac-
turers producing large quantities of standard off-the-shelf type chips such as
microprocessors. With a full-custom chip, the designer has the capability to
control the physical arrangement at the lowest level. The designer using Genesil
can only rearrange the compiler library devices. A full-custom design should be
able to produce chips which use less area than chips developed by Genesil, but
the only way to determine the layout size is to first produce a full-custom design.
D. THESIS GOALS AND ORGANIZATION
In order to make appropriate design choices to produce pipelined VLSI com-
ponents for a cyclic spectral analyzer certain information is needed such as:
• Is it feasible to use full-custom design to produce pipelined adders, multipli-
ers and subtracters?
• If feasible, do the full-custom devices offer significant advantages over
Genesil-produced designs?
• Can enough mathematical operations be put on a single IC chip to perform
an FFT "butterfly" operation?
• If so, what feature size will be necessary?
This thesis answers these questions by developing a full-custom VLSI cell
system that can be used to produce pipelined adders, subtracters, and multipliers
appropriate for a cyclic spectral analyzer. This system will be capable of
producing both fixed and floating point mathematical operations to enable a de-
signer to more appropriately evaluate the size and accuracy tradeoffs inherent in
each type format.
Chapter II describes the CAD design tools and programs available at NPS
for full-custom design. The principles of scaleable CMOS devices that make
them appropriate choices for cyclic spectral analyzer components are also
10
discussed. Chapter III describes the features of the number systems used,
including floating point operations. Chapter IV describes the basic adder and
multiplier cell components. Other specialized devices are also documented.
Chapter V documents the size and performance specifications for fixed and
floating point operations produced from the cell designs. Appendix A contains
sample Spice inputs and results. Appendix B contains standard cell descriptions
and layouts.
11
II. FULL-CUSTOM SCALEABLE CMOS
A. CMOS VLSI DESIGN PRINCIPLES
The past few years have seen a rapid shift in the technology of choice for
high-complexity digital microelectronics from nMOS to Complementary Metal
Oxide Silicon (CMOS) [Ref. 11: p. ixj. This shift has occurred because CMOS
offers high performance with low power consumption; CMOS sinks current from
the power line only when logic transitions occur [Ref. 12: p.5 1]. CMOS designs
also scale extremely well to small feature sizes, enhancing their speed and de-
creasing the chip area used [Ref. 1 1 : p. 150].
A VLSI chip with hundreds of thousands of transistors can be an extremely
complex device. The role of design aids and strategies is to reduce this complexity
and assure the designer a working product [Ref. 1 1 : p. 238]. Choosing appro-
priate architectures is a necessary step in the simplification process. If a structure
can truly be decomposed into a few types of simple substructures or building
blocks, which are used repetitively with simple interfaces, the principles of mod-
ularity and hierarchy can be applied [Ref. 13].
B. SCALING CMOS DESIGNS
As CMOS processes are improved and device dimensions are reduced, the
performance of a CMOS design changes. Many times, prototype devices or sub-
systems are designed, fabricated, and tested using relatively inexpensive "large"
feature sizes [Ref. 12: p. 53]. Later, these prototype designs may be incorpo-
rated into a production device which is fabricated with smaller dimensions. The
12
effects that the reduced dimensions have on the electrical behavior of the device
are important considerations in VLSI design and the final choice of feature size.
First-order MOS scaling theory is based on a "constant field" model formu-
lated by R. H. Dennard [Ref. 11: p. 150]. A device is "scaled" by applying a
dimensionless factor a to:
• all dimensions, including those vertical to the surface,
• device voltages, and
• the concentration densities.
The resultant effects of this first-order scaling process are illustrated in Table 1
on page 14.
C. FULL-CUSTOM DESIGN TOOLS
Computer-Aided Design (CAD) tools and programs are used by a designer
to simplify the design process and make large, complex designs feasible. Per-
formance predictions, design validation, and checking are also enhanced (or made
possible) by the use of these design tools.
The VLSI design facilities available at NPS include a custom VLSI design
tools "package" of programs put together by the CS Division of the Department
of EECS, University of California, Berkeley [Ref. 14: p. 1]. The package consists
of about twenty programs for designing and analyzing VIS I circuits. Reference
14 contains descriptions of these programs, and tutorials on the graphical layout
editor. Familiarity with this design package is necessary for anyone wishing to
use cells developed in this thesis. As an overview, the following are some of the
design tools applicable:
Crystal A timing analyzer that assists the designer in identifying performance
problems in a design.
13
Table 1. FIRST-ORDER SCALING
(from Ref. 11: p. 152)




Width; W I /a




Substrate doping; N (vd) a
Supply voltage: VDD 1/a
Electric field across gate oxide; E 1
Depletion layer thickness; d 1/a
Parasitic capacitance; WLjt0X 1/a
Gate delay; (Vcfl) 1/a
DC power dissipation; P
s
1/a 2
Dynamic power dissipation; Pd 1/a 2
Power-speed product 1/a 3
Gate area 1/a 2
Power density: (VI/A) 1
Current density; (I/A) a
Transconductance; g,„ 1
Esim An event driven logic-level simulator developed at MIT and distrib-
uted writh their permission.
Ext2Sim Part of the Magic suite of programs. Used for converting the output
of Magic's hierarchical extractor into a form usable by other tools
such as Esim, Crystal, and Spice.
Magic The graphical layout editor.
The graphical layout editor, Magic, is the heart of the CAD design package.
CMOS designs made using Magic's SCMOS technology are flavor-less and
14
scaleable; they may be fabricated in either N-well or P-well technology in a vari-
ety of feature sizes. The lambda units used in Magic are dimensionless; MOSIS
currently supports fabrication at the 1.5 microns/lambda, 1.0 microns/lambda,
and 0.7 microns/lambda scale factors [Ref. 14: p. 285]. These scale factors cor-
respond to 3.0, 2.0, and 1.2-micron feature sizes respectively. Other scale factors
are expected to become available in the future.
Using Magic, a designer "draws" rectangles of polysilicon, diffusion, metal,
and various contacts to generate VLSI layouts. Magic supports hierarchical de-
sign methods; layouts that are collections of cells and sub-cells are easily gener-
ated. Magic also automatically checks designs to verify that design rules have
been adhered to. Rule violations are displayed on the screen in the vicinity of the







Figure 8. Legend for layout layers and contacts.
15
Spice, a general-purpose circuit simulation program for non-linear analysis,
was also used for verification and optimization of low-level circuit designs
produced in this thesis. Spice has built-in models for semiconductor devices; a
user need only specify the pertinent model parameters [Ref. 15]. Spice performs
DC and transient analysis at various temperatures selected by the designer. Spice
utilizes matrix equations relating voltages, currents, and resistances. This type
of simulation is characterized by high accuracy but long simulation times. Simu-
lation time is typically proportional to n 2 where n is the number of nonlinear de-
vices in the circuit. It is unrealistic to use this type of program for the verification
of large VLSI chips. For large numbers of transistors switch-level-type simula-
tors are generally used to verify logic and wiring paths. [Ref. 11: p. 255]
16
III. NUMBER SYSTEMS AND ALGORITHMS
A. NUMBER SYSTEMS AND ARITHMETIC OPERATIONS
Three number systems were considered for use in arithmetic operations:
• sign-magnitude,
• ones' complement, and
• two's complement.
The addition process, when using a ones' complement number system, requires
adding the carry-out of the most significant bit back into the carry-in of the least
significant bit. Letting this carry "ripple through" an adder the second time is
costly in terms of overall delay. It is also undesirable in pipelined adders from a
layout size consideration; performing the addition process twice requires more
hardware in a pipelined system. For these reasons, use of the ones' complement
number system was rejected.
The sign-magnitude system uses the most significant bit to represent the sign
of the number; all other bits represent the magnitude. Customarily, a 7 in the
sign bit position represents a negative number and a represents a positive
number. That convention is used in the system described in this thesis. The






= £&,-• 2'-', (5)
where b
x
represents the ith bit (/ = 0, \,...,N — 1), and p identifies the location of
the radix point (specifically, the number of bits to the right of the radix point).
The sign-magnitude system appears well-suited for use in the multiplication
process. The magnitude of the product depends only on the magnitude of the
multiplier and the magnitude of the multiplicand; the sign of the product depends
only on the sign of the multiplier and multiplicand.
The two's complement number system represents both positive and negative
numbers. To negate a value, the value is subtracted from 2A' [Ref. 16: p. 33].




= ~ bN-l ' 2 + 2/' * 2 ' '' ^
/=0
where N, p, and / are defined as described in the sign-magnitude system. The
two's complement has the advantage of representing one more value than an
equally sized sign-magnitude system. The value zero only has one representation
in two's complement, vice the two representations in a sign-magnitude system.
Also, subtraction can be accomplished by adding the two's complement of a
number. This provides advantages in a system performing addition and sub-
traction of numbers which can be either positive or negative. To perform addi-
tion, simply add the numbers regardless of whether either or both are positive or
negative. The answer will be correct (assuming no overflow, discussed later).
18
The addition of two numbers, A and B, in a sign-magnitude system is much
more complex; several cases must be considered as listed below with the sign bit
explicitly shown inside the parenthesis:
1. ( + A) + ( + B),
2. (-A) + ( + B),
3. ( + A) + (-B), and
4. (-A) + (-B).
The subtraction process has similar variations. Cases 1 and 4 can be accom-
plished by adding the magnitudes and setting the sign bit positive if A and B are
positive and negative if A and B are both negative. Cases 2 and 3 are more
complex. Either case could be accomplished by subtracting B from A and sub-
tracting A from B, then choosing the "correct" magnitude and sign based on the
relative magnitude and sign of A and B. This is a complex set of procedures that
could require performing each addition (or subtraction) three "ways" simultane-
ously, and then choosing the correct answer. This procedure would use signif-
icantly more chip area than a two's complement number system.
Subtraction in a two's complement number system can be accomplished by
taking the two's complement of the subtrahend and adding it to the minuend.
The two's complement of a number can be obtained by inverting all bits and
adding a 1 to the least significant bit [Ref. 16: p. 35]. Figure 9 on page 20 shows
a possible gate-level logic design which will accomplish this complement and in-
crement operation. Notice that the conversion process will "ripple" from right to
left. Also note that the least significant bit of the operand is available imme-
diately. It is not necessary to perform this conversion process prior to a
19
A T C 3 At c 2
Aic i At c
Figure 9. Two's complement converter
subtraction; simply feed the complement (invert each bit individually) of the
subtrahend into an adder and assert the carry-in of the adder. This is simple and
fast; in CMOS, inverters are small, rapidly operating devices. Due to the overall
simplicity, size constraints, and the number of additions and subtractions in an
FFT algorithm, the two's complement system was selected for use in the addition
and subtraction cell designs produces in this thesis.
The choice of a number system for the multipliers in the FFT butterfly still
leaves two options:
1. Use a sign magnitude system, converting from two's complement prior to the
multiply and then converting back to two's complement after the multiply;
or,
20
2. Use an algorithm that correctly computes the product of two two's comple-
ment numbers.
Reference 17 documents a two's complement parallel array multiplication algo-
rithm. Other algorithms, such as Booth's [Ref. 16: p. 90], are also available.
All these algorithms require complicated interconnection patterns which increase
the complexity of a multiplier and require more space on the chip [Ref.
18: p. 35]. Keeping in mind the design goal of producing a structure which uses
simple repetitive substructures, the sign-magnitude system was chosen for the
multiplier design.
B. FLOATING POINT NUMBER SYSTEMS
Fixed point number systems are frequently used in digital signal processing
applications. If values are scaled as they enter, a system can be designed to en-
sure that intermediate values are sufficiently represented by the number of bits
in the system [Ref. 16: p. 37]. In a radix-2 FFT butterfly with both a real and
an imaginary part in each input, the maximum any output can increase in a sin-
gle stage is
1 + sin(45°) + cos(45°) = 2.414213562. (7)
Many FFT implementation schemes take this maximum growth into account and
compensate by scaling each stage to prevent "overflow" [Ref. 19: pp. 75-76]. By
implementing such scaling practices and carefully examining accuracy require-
ments, a system designer may preclude the necessity of the wider range of data
representation available in a floating point number system. This thesis produces
designs in both fixed and floating point formats to allow a designer flexibility in
21
choosing which type of number system to use based on accuracy requirements,
size of the VLSI implementations, and number of stages of delay in the pipeline.
To specify a floating point number, seven different pieces of information are
required:
• base of the system; rb ,
• sign of the mantissa; Sm ,
• magnitude of the mantissa; Mm ,
• base of the mantissa; usually also rb ,
• sign of the exponent; S
e ,
• magnitude of the exponent; M
e ,
and
• base of the exponent; r
e
.
The value of the number represented, Vfpn , is











16: pp. 42-45]. While the representation shown above implies use of a sign-
magnitude system for expressing the values of the mantissa and exponent, this is
not necessary. Other types of number systems are used quite frequently.
The excess number system is one such number system. An excess code is
produced by purposely adding an "excess" to the value to be represented. The
resultant bit pattern is then stored or used as required. One of the most prevalent
uses of excess codes is to store exponents in floating point number systems [Ref.
16: p. 38]. If S represents the excess code value that will be stored or used, V
the true value of the number, and E the excess, then the relationship can be ex-
pressed as
22
S = V + E. (9)
Table 2 on page 24 (from Ref. 16: p. 55) identifies a number of machines and
the floating point system they use. Note that most systems use excess coded ex-
ponents which will represent the smallest number (most negative value of expo-
nent) with an excess coded exponent equal to .000. ..0. In all cases where excess
coded exponents are used, the coded exponent will never be negative. This allows
various machines to use a special code for an "exact" zero. This practice also
makes some floating point operations less complex at the expense of other
operations.
In addition to the previously discussed characteristics, most floating point
number systems use a process called normalization which makes an assumption
as to the location of an "implied" radix point. This eliminates the need to code
and store radix point location. The floating point number system used in this
thesis assumes the most significant bit of the mantissa (excluding the sign bit) is
the first bit to the right of the radix point. This first bit is usually never allowed
to be a (exceptions discussed later). This means that the only legal values of the
magnitude of the mantissa is a fraction between 1/2 (assuming rb — 2) and almost
/. For example, consider a sign-magnitude system which uses a 4-bit mantissa.
Showing the sign bit explicitly, the largest legal mantissa would be
011
"=T + T + T + lV = Tf- (,0)
The smallest positive mantissa is
23
Table 2. FLOATING POINT INFORMATION SYSTEMS
Word
System
Size Exp onent Mantissa
{U Bits) rb # Bits Code # Bits Repre. Code
Burroughs 48 8 7 SM 39 Int SM
B670O/770O
CDC 60 2 11 Ex 1024 48 Int 1'sC
7600
DEC — single 32 2 8 Ex 128 24 Fra SM
DEC — double 64 2 8 Ex 128 56 Fra SM
Honeywell 48 2 or 10 7 Ex 64 40 (base 2) Fra
8200 20 (base 10) Fra SM
IBM — single 32 16 7 Ex 64 24 Fra SM
IBM — double 64 16 7 Ex 64 56 Fra SM
IEEE — single 32 2 8 Ex 127 24 Fra SM
IEEE — double 64 2 11 Ex 1023 53 Fra SM
Cray 64 2 15 Ex 16384 48 Fra SM
Int = Integer representation
SM = Sign/magnitude
Fra = Fractional
1'sC = One 's complement
Ex = Excess code
01000 = (11)
The largest (most negative) mantissa is
24
i""-(-J)«(T +T +i + T6- )—H- (12)
This system of normalization will also work with a two's complement number
system with some modifications. Consider the following examples with an im-
plied radix point to the right of a "sign" bit:
01111= — + — + — + —!— = -^ en)




11000 = - l +y = -y> ^
11111 = - 1+ t + t + I + -tV = -iV- ™
Notice that the convention of requiring the first bit to the right of the radix point
to be a 1 has resulted in a shift in the range of magnitude of the value repres-
ented. If the convention is modified to require a negative two's complement
number to have a zero as the first bit to the right of the radix point, the corre-
spondence between positive and negative numbers is more appropriate. Consider
the following two's complement examples:
10001 = _1+ -L— JL (17)
25
10111—l+i + i-t-jL—i-, (.8)
10000 = -1. (19)
This thesis uses a sign-magnitude system for multiplication and a two's com-
plement system for addition. Such a mixing of systems must allow for conversion
from one to another. The above examples show two specific instances where a
legal representation in one system has no corresponding legal representation in
the other system. To compensate for this, these two cases must be sensed,
specifically:
1. In conversion from two's complement to sign-magnitude, if a 1000...0 is
sensed, the two's complement number is "sign-extended" and shifted one bit
to the right before conversion and the exponent value is increased by 1
.
2. In conversion from sign-magnitude to two's complement, if a 11000...0 is
sensed, the number is shifted one bit to the left and the exponent is decreased
by 1
.
The floating point number system developed in this thesis uses a base of 2
with the exponent using an integer two's complement representation. In multi-
plication, a sign-magnitude number system representation is used in the mantissa;
for addition and subtraction, a two's complement number system is used in the
mantissa. Conversion between the two systems is performed as required.
C. FLOATING POINT MULTIPLICATION
Floating point multiplication of two numbers, using a sign-magnitude system,
can be stated as
26
A = BxC
= (( - 1 f™» . Mmb . #*)(( - 1 )s»c . Mmc . rfa (20)
Thus, the product mantissa is the product of the multiplier and multiplicand
mantissas; the product exponent is the sum of the exponents. The sign bit can
be easily obtained in VLSI circuitry; it is simply the "exclusive-or" of the sign bits.
Note that the above three operations can be performed simultaneously; they do
not depend on each other.
In any floating point operation using normalized operands the normalization
of the result must be checked and adjusted to meet system requirements. This
"post normalization" is relatively simple following a multiplication. Consider the




In this case, the mantissa is properly aligned; no post normalization is required.




The mantissa must be shifted one bit to the left. This normalization requires
subtracting a 1 from the exponent to complete the operation.
27
D. FLOATING POINT ADDITION
Compared to a multiplication, floating point addition is a much more com-
plex operation. Figure 10 on page 29 [Ref. 16: p. Ill] shows a block diagram
for the addition process. First, the exponents are compared. The largest expo-
nent is saved, and the difference in the exponents is used to shift the mantissa of
the number with the smallest exponent an appropriate number of places to the
right. In a two's complement system the right shift is accompanied with a sign-
extension process. After the alignment process is complete, the mantissas are





This result is incorrect due to an "overflow" which changed the sign of the result.
In this case, the correct result can be obtained by "sign-extending" the operands




Note that this result requires a post normalization shift of one bit to the right
with a corresponding increase in the exponent.
Now consider the case of adding relatively large negative mantissas using a
















POST NORMAL t Z AT I ON
RESULT MANTISSA





This is the correct result, but it also requires a post normalization shift of one bit
to the right, and a corresponding correction of the exponent.
The above two examples establish the limits of the magnitude increase with
addition or subtraction. If one mantissa is shifted many places to the right in the
alignment process, its magnitude compared with the unshiftcd mantissa is small.
Therefore, the magnitude of the sum of the mantissas will be close to the value
of the unshifted mantissa. However, if the magnitude of the mantissas are close
29
in value, but opposite in sign, the result may require a post normalization shift
of many bits. Using a 5-bit two's complement example where the exponent




Post normalization will require a shift of four places to the left. Thus the post
normalization unit must be capable of a shift of at least one position to lesser
significance, and shifts of many positions to higher significance. The post nor-
malization must also adjust the exponent saved in the compare process for any
shifts of the mantissa.





Various methods have been devised to handle these "true" zero situations. The
system developed in this thesis uses a mantissa of 0.000. ..0 and an exponent of
1000...0 to indicate a zero value. The post normalization process must also sense
0.000. ..0 in the mantissa and set the exponent to the required value.
E. OVERFLOW, UNDERFLOW, AND ROUNDING
Underflow occurs when an operation produces a result that is smaller in
magnitude than the smallest representable non-zero magnitude. This can occur
30
in a multiplication process when two negative exponents are added and the result
is more negative than 1000...0. This condition can be sensed by noting a sign
change out of the exponent adder when two negative exponents are added.
Underflow could also be caused by a post normalization in addition or multipli-
cation. In the system developed in this thesis, when underflow is sensed, the ex-
ponent is set to 1000...0 and the mantissa is set to 0.000...0.
Overflow occurs when an operation produces a result that is larger in magni-
tude than the largest representable value. This can occur in multiplication when
two positive exponents are added and the result is larger than 01 1 1 ... 1 . This
condition can be sensed by noting a sign change out of the exponent adder.
Overflow could also be caused by a post normalization in addition. In the system
developed in this thesis, when overflow is sensed, the exponent is set to 01 1 1...1
and the mantissa is set to its largest magnitude while preserving its sign. A design
may also incorporate an "error bit" which would be set if overflow occurs. As
discussed earlier, because the maximum growth of data in a single stage of an
FFT algorithm is know7n, certain hardware designs for specific systems may pre-
clude the possibility of overflow. It is the responsibility of a system designer to
use safeguards appropriate to the system.
The floating point operations of addition, subtraction, and multiplication in-
crease the number of bits in the mantissa. There are many ways to deal with
these extra bits. The simplest is truncation, which simply ignores these extra bits.
Truncation throws away information, and results in a bias; the number to be
stored is smaller than the true value [Ref. 16: p. 115].
31
Rounding, a commonly used method, adds half the value of the least signif-
icant bit position to the number before truncation. This results in a smaller av-
erage bias than the truncation method.
The technique of jamming was proposed by von Neumann to reduce overall
error. This method "jams" a 1 into the least significant bit position of the result,
regardless of the values of the extra bits. This method is as fast as truncation,
but has a much smaller bias. Other techniques such as a round-to-zero scheme
are available which use two bits of information to produce a result with zero bias.
[Ref. 16: p. 116]
There is another consideration in the "extra bit" problem: how many of the
extra bits should be retained in an addition alignment process? Consider a 7-bit




Reference 16: pp. 116-118 discusses this problem and shows that "at most three
bits are needed beyond that required of the number system." These three bits
may be needed in a round-to-zero zero bias technique. If simple rounding is used,
only two extra bits need be saved.
The VLSI cells developed in this thesis use the simple rounding technique.
The designs are easily modified for jamming or truncation. Other more complex
techniques can be used with more modifications.
32
IV. HARDWARE
Registers and full adder cells form the basis of a pipelined adder design. Full
adders are also used in many multiplier designs. Because most of the active chip
surface will be composed of these two devices, many potential designs were eval-
uated. To keep comparisons on an equal basis, all layout simulations were made
using "typical" CMOS specifications (see Appendix A for values), Vdd equal to
+ 5.0 volts, and with lambda equal to 1.5 microns.
A. ADDER DESIGNS
Before designs for adders could be evaluated, a decision on whether to use
look-ahead carry or ripple-carry adders was made. Figure 1 1 on page 34 [Ref.
20: p. 208] shows a gate-level design of a 4-bit look-ahead carry adder. Notice
that several gates in the design have five inputs. Figure 12 on page 35 shows a
gate-level design of a full adder that may be used in a ripple-carry circuit. The
number of logic gate levels a signal passes through is sometimes referred to as the
depth of the design. For a 4-bit ripple carry adder, the carry-out of the fourth
stage is nine logic levels deep (one XOR, four AND's, and four OR's). The
carry-out of the look-ahead carry adder is only three deep, but that can be de-
ceiving; with CMOS logic, the maximum delay of a gate is proportional to the
number of inputs in that gate [Ref. 1 1: p. 188]. CMOS delays are also sensitive
to the number of gates driven by a particular gate (fan-out).
In order to more adequately evaluate the potential speed increase using look-





























































Figure 12. Full adder design.
speed advantage could not be obtained for the look-ahead device due to the gates
with five inputs, and the larger fan-out of the look-ahead carry design. Reference
21: p. 155 also found that "the speed improvement [of a look-ahead carry adder]
over a ripple adder is not significant for a small number of bits." In a pipelined
arrangement only a relatively few number of additions will be done in any par-
ticular pipeline stage. Due to the necessity of producing simple, regular sub-
structures with simple interfaces, and the limited speed improvement of a
look-ahead carry adder in a pipelined design, ripple-carry adders were selected
for use.
Many designs exist for implementing a full adder function. Table 3 on page
36 (from Ref. 12: p. 51) compares a conventional CMOS design with three other
logic types. While it has the lowest power consumption, the conventional CMOS
design is one of the slowest and has a much higher transistor count. If the tran-
sistor count could be reduced, the speed and size disadvantages of conventional
CMOS logic might be minimized.
35
















100 MHz 0.500 mW 5.00/AV/MHz 32
Pseudo-NMOS 100 MHz 0.828 mW 8.28 /iW/MHz 22
Standard N-P
Domino
130 MHz 0.671 mW 5.16 jiW/MHz 22
Quasi N-P
Domino
165 MHz 0.962 mW 5.83 a/W/MHz 23
Figure 13 shows a gate-level implementation of the exclusive-or (XOR)
function. Using complementary logic, implementation of this design requires 16
transistors. Figure 14 on page 37 shows a transistor-level implementation of the
XOR function using two transmission gates and an inverter as proposed by Ref.
11: p. 317. This design requires only six transistors. A full adder implementa-
tion of the gate-level design in Figure 12 using these transmission gate XOR's
requires only 24 transistors. This is only one or two more transistors than other
logic designs shown in Table 3.




Figure 14. Exclusive-or function generation using transmission gates.
Many switch-level simulation programs will not satisfactorily simulate a
CMOS XOR gate designed using transmission gates. Reference 22 describes the
problem in detail and documents a computer program to modify input files and
allowT switch-level simulation of circuits using transmission gates.
A transmission gate XOR layout was simulated using Spice. Transistor
widths were varied to achieve similar rise and fall times. The worst-case delay
of the layout was 1.0 nanoseconds (ns). Simulation of an XOR gate using the
NAND-OR-AND gates of Figure 13 on page 36 resulted in a maximum delay
of 2.75 ns.
Due to the speed and size advantages of the transmission gate XOR, several
full adder layouts using these XOR's were simulated. Transistor widths were
varied to achieve approximately equal rise and fall times, and to reduce the
carry-in to carry-out delay. Results of the simulation of the fastest full adder are
37
listed in Table 4 on page 38. Figure 15 on page 39 shows the layout design used
in the simulation.
Table 4. FULL ADDER MAXIMUM DELAY TIMES.




SUM-OUT delay 4.25 ns 2.75 ns
CARRY-OUT delay 4.5 ns 3.0 ns
The transmission gate has another useful feature; it produces the complement
of one of the inputs. Figure 16 on page 40 shows that, if the complement is A is
available, there is no need to invert A before the XOR gate; moving two con-
nection points will accomplish the same function. Since two's complement sub-
traction can be performed by inverting the subtrahend and asserting the carry-in
of the least significant bit of an adder, use of the transmission gate XOR will al-
low a subtraction cell design to have the same delays as a full adder. All sub-
traction cells developed in this thesis are modifications of full adder cells. Their
delay times are the same as the full adder's delays (Table 4).
B. LATCH LAYOUTS
Pipelining requires devices to store logic states between stages. Various types
of registers, flip-flops, and latches have been successfully used to perform this
function. This thesis uses a compact pseudo 2-phase latch design proposed by
Ref. 11: p. 206. Figure 17 on page 40 shows the latch design which requires
eight transistors and four clock inputs. When </>l is high (and 01 is low), the in-




























Figure 16. Exclusive-or function generation using complemented input.
low, the value on node N is stored. When <f)2 goes high, the stored value is in-
verted and Q is conditionally pulled high or low. This design has several potential











Figure 17. Pseudo 2-phase latch.
40
First, the design depends on the clock signals (01, 01, 02, 02) to prevent
the "latch" from becoming transparent. This would occur if 01 and 02 were high
simultaneously. In a large design with many storage devices triggered off of clock
lines, local clock skew could cause problems with timing and transparency.
Figure 18 shows the design modified with two inverters. The first inverter
serves as a buffer for a single clock line. The second inverter produces a com-
plement for use. Since 01 and 02 must never be high at the same time, and the
latch also requires the complements of 01 and 02, the two signals can be used to
supply the four clock lines of the latch. Simulation results show that the small
amount of delay in level transition between 01, and 01; and 02, and 02 does not
prevent proper functioning of the latch. This latch design was chosen due to its
small size and its requirement of only one clock line.
VDD
CLOCK
Figure 18. Modified clocks for pseudo 2- phase latch.
41
The latch design was refined with simulation in Spice to reduce the delay
times and create approximately equal rise and fall times. Inverters were sized to
produce rise and fall times of approximately 2 ns when driving the latches. The
inverter transistor widths vary in different cells depending on the number of
latches driven. To keep rise times fast and prevent skew problems, no inverter
pair drives more than four latches in any cell design- The latch transistors were
sized to produce a delay of 4 ns for each stage (input and output) of the latch.
Figure 19 shows the clock pulse shape for the minimum allowable clock period,
where thgi is defined as the longest logic delay in any stage of the pipeline. If
a symmetric clock wave form is desired, the minimum clock period becomes
Tc . = 2 • (tloeic + 4 ns).
"-nun










t log I c MAX + 4 NS
Figure 19. Clock pulse requirements.
C. MULTIPLIER DESIGNS
The following formula illustrates the operations required to multiply two 4-bit
binary numbers.
42
*3 x2 *1 *0
x Y3 Y2 }'l ^0
*a>'3 *0>'2 *aVi *a>'0
^ x{y2 x&i xi}'o
x?yi X&2 *2>'l xi>'o
*iV3 ^i>'2 X& X&Q
Pe Ps P4 P3 Pi Pi ^0
(30)
The partial products (Xy,'s) are formed by the AND function. The columns of
partial products are then added to produce the product.
Reference 21 describes and analyzes three design schemes for adding the col-
umns of partial products. Two of the schemes (Figure 20 on page 44) strive to
minimize the time required for a multiplication by reducing the number of logic
levels in a design. These level reduction methods increase wiring complexity and
do not produce regular rectangular-shaped VLSI layouts. Pipelining allows a
system to have a high throughput rate which is not dependent on the number of
logic levels in a design; therefore, minimizing the number of logic levels is not of
prime importance. The third design type, a carry-save scheme, is shown in
Figure 21 on page 45 (From Ref. 21: p. 156). This array employs straightfor-
ward full adders in an architecture in which a modular structure and repetitive
layout can be maintained. Due to this regularity and modularity, a carry-save
array design was chosen for the multiplier layouts in this thesis.
The array multiplier can be "reshaped" slightly to form a rectangular array
as shown in Figure 22 on page 46 (from Ref. 11: p. 345). Since other devices,
such as adders, used in the thesis designs have rectangular layouts, the use of a
43















Co'ry - LDOh - Art»{ld Add*t £T
A String of Full Adder* to <>•• th« Product
Block diagram of a 5x5 multiplier using Dadda's scheme.
°A Vt °9b l °*b5 "A °4b , °/»S ° 2b2 ° 5b ,
5 14
T 0,b, 0.6, 0,6,
Corry -Look - Ahcod Md<x or String or Full Add«ri 10 ProduCf U\t Tinol ProduCI
I T i r i n
Block diagram of a 5X5 multiplier using Wallace's scheme.
Figure 20. Wallace tree and Dadda multipliers.
rectangular shape in the multipliers allows for more efficient overall usage of
VLSI chip area.
44
o b ob o b o































> A String of Full Adde'S Thot Could be Replaced Dy
o Corry - Look - Aheod Adder
Figure 21. Block diagram of a 5 by 5 multiplier using carry-save technique.
The functions of cells in the array are shown in Figure 23 on page 47. The
partial products (Xy/s) are formed in the individual multiplier cells by the AND
gate. This array design and layout keeps communication distances at a mini-
mum; once the operands are distributed to the multiplier cells, no cell communi-
cates to any cell other than its nearest neighbor. This eliminates many of the
delay times necessary in other multiplier types and makes the design simpler to
implement. This design can also be easily modified for various length operands












Y X Y*2 *0 X Y xo Y oY
o




























/ r / T / /






















Figure 22. Parallel multiplier array.
appears well-suited for pipelining; latches can be placed between "stages" as nec-
essary to produce acceptable throughput rates.
The multiplier cells used in this thesis perform the functions shown in
Figure 23 on page 47. Because a CMOS AND gate is a NAND gate followed
by an inverter, the multiplier cells use NAND gates vice the AND gates shown.
Since the transmission gate XOR design can be arranged to accept an input or its
complement, this eliminates an inverter in the critical timing path. A gate level
design for the multiplier cell is shown in Figure 24 on page 48. Delay times ob-
tained by simulation using Spice are given in Table 5 on page 47. Figure 25 on
page 49 shows the layout of the multiplier cell used in the simulation.
46
Figure 23. Parallel multiplier cell
Table 5. MULTIPLIER CELL MAXIMUM DELAY TIMES.
EVENT: Xj and Y, change Pi changes C, changes
Delay for PM 5.75 ns 4.25 ns 2.75 ns
Delay for CI+1 6.0 ns 4.25 ns 3.0 ns
D. OTHER CELL FEATURES.
In addition to the duties performed by the addition, subtraction, and multi-
plication cells, many other functions such as rounding, mantissa alignment, and
normalization are necessary. Brief descriptions of these additional functions are










Figure 24. Multiplier cell design.
A gate-level logic diagram for two's complement conversion, required when
changing between addition and multiplication, was shown in Figure 9 on page
20. Because a CMOS OR gate is composed of a NOR gate followed by an
inverter, the conversion speed can be increased slightly by modifying the design.
Figure 26 on page 50 shows the modified design; Table 6 lists the maximum de-
lay times of the modified design.
Table 6. TWO S COMPLEMENT CONVERSION DELAY TIMES.
^TCO Ajc\ A 1C2 A TCi
0.0 ns 1.0 ns 3.0 ns {{i - 2) • 2 + 3] ns
As shown earlier, in the multiplication process, mantissa alignment may be
required. This shift requires subtracting 1 from the sum of the operand expo-
nents. Many systems follow the alignment process with the exponent adjustment
process. In a hardware pipelined system, if an operation might be required, the






































Figure 25. Multiplier cell layout
o
49
At c 4 AlC3 At c 2 At c 1 At c o
Figure 26. Two's complement converter (modified design).
whether the process is necessary, and route signals is necessary. Waiting until
after the multiplication is complete to start the subtract 1 process is time con-
suming; the subtraction process must then ripple through more stages of a pipe-
line. Time and chip area can be saved by precomputing the value for the
exponent if alignment is necessary, and then "selecting" the appropriate exponent
value.
Figure 27 on page 51 shows a design implemented in the multiplication ex-
ponent addition layout to perform this computation process. This design is used
to modify all bits except the least significant bit, which is always the inverse of
50
the unmodified bit. Note that both the exponent and the modified exponent are
retained for possible use.
l X i o d
j
Figure 27. Design to subtract 1 from exponent.
There are two other possible adjustments to the exponent in the multipli-
cation process. Underflow requires setting the exponent to 1000...0; overflow re-
quires setting the exponent to 0111...1. Figure 28 on page 52 shows the
transmission gate network used to select the appropriate exponent at the end of
the of the multiplication process.
In the multiplication process, precomputation can also be used to speed up
the mantissa rounding process. Since there are only two possible locations of the
radix point, the rounding process can be performed from both starting points,
and both results stored until the location of the radix point is known and the
appropriate value is selected. When overflow occurs, the mantissa is set to either
1.000...0 or 0.1 11... 1 (depending on its sign). When underflow occurs, the
mantissa is set to 0.000. ..0. A shift network very similar to the one used for ex-
ponent selection is used to select the correct mantissa value from the four possible
choices.
51

















EXPONENT MSB ALL OTHER
.
EXPONENT BITS
Figure 28. Exponent selection network.
Figure 29 shows the gate-level diagram of the design used in the rounding
process. The carry-in of the least significant bit is set with the value of the




C I 4 1
Figure 29. Rounding design.
Zero sensing is also required in the multiplication process. If one of the
multiplication operands is a true zero, and the exponent was not set to 1000...0
at the end of the process, the mantissa alignment in the next floating point
52
addition could be erroneous. (The larger magnitude mantissa might be shifted.)
With the floating point system proposed in this thesis, checking the two most
significant bits (excluding the sign bit) for a 1 is all that is necessary to check for
the true zero condition. This can be accomplished with a single NOR gate.
The first step in a floating point addition or subtraction is to subtract the
exponents. The difference is the amount of mantissa alignment necessary for the
smaller magnitude operand. The shift network used to align the mantissas is ca-
pable of shifts up to 31 bit positions to the right. This shift network uses the five
least significant bits (LSB's) of the difference to determine the number of places
shifted. With the two's complement number system used for exponents, the shift
value out of the exponent subtraction cells could be positive or negative. The
"sign" bit of the difference is used to determine which exponent to store and
which mantissa to align. Since the mantissa shift network requires positive bit
values to function properly, the exponent subtraction cells also concurrently take
the two's complement of the five LSB's. Both sets of LSB's are stored until the
sign of the operation is known, and the appropriate selection of control bits for
the shift network is made.
The mantissa alignment of the present design is limited to 31 bits. Modifi-
cations to the design is possible to allow any magnitude shift. If the exponent
subtraction process indicates that more than 31 bit places of shift is necessary, the
shift is reduced to 31 places. For positive values of the difference in exponents,
this condition is indicated by the presence of a 1 in any bit position more signif-
icant than the five LSB's. It is sensed with a string of OR gates as shown in
Figure 30 on page 54. Similarly, when the difference in exponents is negative, a
53
1 in any bit position more significant than the five LSB's of the two's complement
of the difference indicates a shift of more than 31 places. If a shift of more than
31 places is called for, l's are loaded into the shift network control to obtain the
31 place shift. Figure 31 on page 55 shows a gate-level diagram of the network
used to determine whether to store the "A" exponent and shift the "B" exponent
or vice versa. In this design, B is subtracted from A; the difference is labeled D.
A negative number indicates that B's exponent was larger and exponent B is
stored and the mantissa of A is shifted in the alignment process. The signals
POS1 and TCI are produced by the chains of OR gates that sense l's in bit po-
sitions more significant than the 5 LSB's. The signal LOAD_lS is OR'ed with
each of the five LSB's of the selected difference. If LOAD_lS is high, all five
shift control bits are high. If LOAD_lS is low, the OR gates pass the shift bits
through unaltered.
D 8 D7 De Ds
Figure 30. Ones' sensing design.
The mantissa alignment network functions in two stages. First, the most
significant control bit shifts the mantissa either or 16 places. Figure 32 on page
56 shows one cell of the shift network. When the control bit, C4 , is a 0, the ith bit
is selected in the ith stage, resulting in a shift of places. When the control bit is














Figure 31. Shift network control selection design.
The second-stage of the shift network uses the four least significant control
bits, C3 through C , to shift from to 15 places. Figure 33 on page 57 shows a
cell and the decoders necessary to control the shift. The inverters are used to
"buffer" the long signal line runs and speed up overall operation. One of the up-
per four transmission gates select a shift of 0, 4, 8, or 12 places while the lower
transmission gates select shifts between and 3 places. The combination results
in shifts of to 15 places. The magnitude of the shift is determined by the value
of the control bits.
55
C4 f
Mi+ie Mi Mi + i5 Mi-i
f%H mm
S I Is i - 1
Figure 32. First-stage mantissa alignment shift network.
The combination of both stages of the mantissa alignment allows for shifts
of to 31 places. Since at most two extra bits need to be saved in the alignment
process, the 31 -bit shift capability allows for mantissa lengths (including the
"sign" bit) of up to 30. (A 30-bit mantissa shifted 31 places to the right and "sign
filled" produces 32 bits, all equal to the sign bit.) If more than 30-bit length
mantissas are desired, the system must be modified.
Because a two's complement number system is used, any mantissa that is
shifted to the right requires filling in the most significant bits with the value of the
"sign" bit. (While the most significant bit of a two's complement number is not
technically a sign bit, its value determines whether the operand represented is
positive or negative. This leading bit is sometimes loosely referred to as a sign
bit in this thesis for simplicity of reference.) Due to this sign filling process, the
sign bit has a fan-out of 16 bit lines in the first-stage shifter. A designer must
buffer this sign bit as necessary to ensure proper operation of the system.
56
Ms i [><>-
M s i »4 J^>
°~

















M [ i M i i - 1 M i i - 2 M ( i - 5
Figure 33. Second-stage shift network.
57
After the aligned mantissas are added (or subtracted), a post normalization
procedure is necessary. As discussed earlier, this normalization could be from 1
place to the right to many places to the left. The maximum left shift is N-l,
where N is the length of the mantissa (including the sign bit). For a 30-bit long
mantissa, provisions for shifts of from 1 bit to the right to 29 bits to the left are
necessary. This normalization can be accomplished with a shift network that is
a "mirror image" of the one described for a mantissa alignment if the proper
control signals are available.
To determine the required shift, the location of the left-most 01 or 10 bit
pattern must be sensed. This location can then be used to code the control lines
and adjust the exponent value. There are several possible ways to sense the bit
patterns. This thesis uses a "ripple" arrangement shown in Figure 34 on page
59. The EXCLUSIVE-OR gates output is high if a 10 or 01 bit pattern exists.
The string of NOR gates and inverters sense and "remember" the existence of a
1 out of an EXCLUSIVE-OR of more significant bits. Only one NAND gate
(and possibly none) will output a for any possible pattern. The output of the
NAND gates can be sent to an encoder design to activate the control lines for the
post normalization network. This design has a drawback in the time required to
perform the operation; sufficient time must be provided to allow a signal to
propagate through the OR gates.
An encoder design is shown in Figure 35 on page 60. This design OR's 16
inverted inputs to produce each of the five control bits used in the post normal-
ization process. The inputs to the initial NAND gates are the NAND gate
58
M i(
Figure 34. Radix point location sense design.
outputs from the bit pattern sensing arrangement of Figure 34 on page 59. The
inputs to each control encoder are:
C = Sj 4- S3 4- S5 4- S7 4- S9 + SH 4- S13 4- S15 4- S17




= S2 4- S3 4- S6 + S7 4- S 10 + Sji + S 14 4- S 15 + S 1:
+ S19 4- S22 4- S23 4- S26 4- S27 4- S30 4- S31 ,
(32)
C2 = S4 4- S5 4- S6 4- S7 + S]2 + S]3 4- S14 4- S15 4- S20
4- S21 4- S22 + S23 4- S28 4- S29 4- S30 4- S3J ,
(33)
C3 — S8 4- S9 4- S10 4- SH 4- S12 4- S13 4- S14 4- S15 4- S24
4- S25 4- S26 4- S27 4- S28 4- S29 4- S30 4- S31 ,
(34)
Q - ^16 + ^17 + ^18 + ^19 + ^20 + ^21 + ^22 + ^23 + ^24
4- S25 4- S26 4- S27 4- S28 4- S29 4- S30 + S31 ,
(35)
59
where S is the NAND gate output of the EXCLUSIVE-OR between the two
most significant bits of the unnormalized mantissa. (The + sign represents an
OR operation.) Grouping of the same terms used in different encoders might
result in some reduction in gate numbers while increasing the wiring complexity.
Figure 35. Encoder design for post normalization control lines.
60
V. DESIGN COMPARISONS
In order to demonstrate pipelining techniques and compare the size and speed
capabilities and layout area requirements of multipliers and adders developed in
this thesis, several example layouts are shown in the appendix. As a design ex-
ample, and to keep comparisons consistent, all layouts use no more than three
multiplier cells or five full adder or subtracter cells in a single pipeline stage. The
maximum logic delays, thgic , for three multiplier or five full adder cells with
X — 1.5 microns are
tlogic-Smul^ = 6.0 + 2 . 4.25 ns = 1 4.5 ns (36)
hogic-Sadd^, = 4.5 + 4 . 3.0 ns = 16.5 ns (37)iax
Note that the logic delay for the five full adder cells is the critical timing path.
The logic delays in pipeline stages with other logic functions were kept below the
five full adder maximum delay times to make the addition cells the critical timing
path in determining the maximum clock rate. Table 7 on page 62 lists the pre-
dicted maximum pipeline throughput possible using that critical timing path for
various feature sizes.
For comparison purposes, the layout size requirements of the example sys-
tems were determined. 16-bit mantissa (including sign bit), 12-bit exponent
floating point devices are compared with 16-bit and 29-bit fixed point devices.
The 29-bit fixed point example was chosen due to its association with a 16-bit
61
Table 7. PREDICTED MAXIMUM PIPELINE THROUGHPUT.
FEATURE SIZE: 3.0 microns 2.0 microns 1.2 microns
Asymmetric clock 40.8 MHz 61.2 MHz 102.0 Mhz
Symmetric clock 24.4 MHz 36.6 MHz 61.0 MHz
floating point format with a 1024-point FFT. A 1024-point FFT requires ten
radix-2 FFT butterfly stages. Using the maximum growth factor per FFT stage




Noting that 2 13 = 8192, 13 extra bits should be sufficient to prevent overflow
and eliminate the need to scale results between stages. Table 8 on page 63 shows
the VLSI chip area required for various devices. The table listings do not include
areas for pads, line routing between devices, etc., so the final chip size require-
ments will be significantly larger than that shown.
Individual cell layouts were made with the goal of producing function designs
that are composed of individual cells that abut other cells and require no addi-
tional connections or routings except at the periphery. Figures 36, 37, and 38
show the arrangement of cells necessary to produce a 16-bit mantissa multiply, a
12-bit exponent addition (used in a floating point multiply process), and a 12-bit
exponent subtraction (used in the exponent compare process of a floating point
addition or subtraction). Due to the complicated flow paths in a floating point
addition process, the number of cell designs required to perform the function with
62
no extra internal routings would be prohibitive. Use of the standard cell designs
to perform a floating point addition presently requires numerous connections and
wire routings between various functional blocks.
The above examples are for comparison only. The cell system described in
this thesis is capable of producing designs with any number of logic cells in a
pipeline stage. The theoretical maximum pipeline throughput of a multiplier with
pipeline latches between each multiplier cell using 1.2 micron feature size tech-
nology is 178.5 MHz. This extreme speed is probably not practically achievable;
I/O interfaces will normally limit overall system throughput rates.







16-bit fixed point multiplication 12.30 mm 1 5.47 mm 1 2.68 mm 1
29-bit fixed point multiplication 43.15 mm 1 19.0 mm 2 9.40 mm 1
16-bit mantissa, 12-bit exponent mul-
tiplication
16.6 mm 1 7.38 mm 1 3.62 mm 1
16-bit fixed point addition 2.06 mm 1 0.917 mm 1 0.449 mm 2
29-bit fixed point addition 4.99 mm 1 2.22 mm 2 1.09 mm 2
16-bit mantissa, 12-bit exponent ad-
dition
14.27 mm 1 6.43 mm 1 3.1 1 mm 1
16-bit fixed point radix-2 FFT 0.616 cm 1 0.274 cm 1 0.134 cm 1
29-bit fixed point radix-2 FFT 2.03 cm 1 0.893 cm 1 0.438 cm 1
floating point radix-2 FFT 1.52 cm 1 0.676 cm 1 0.330 cm 1
Reference 23, a thesis completed at the Naval Postgraduate School, docu-
ments the chip area and performance specifications of a pipelined multiplier de-




<i CO < CJ < CO «t c_> <t CO <I C_> <t CO <I c_> a; CD <t <_> to CO 2
"
CO , .2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 z W>J ce ce or ce ce
<
<C CO <t CJ> <C CO <C <_> a: CO •< t_i •< CO <£ CJ <C m <t t_> CO CO UJ 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ce ce CO
CC ce ce
<
<H CO •1 <-J *I CO <* c~> < CO <£ <-> <C CO <t t_> <£ CD <C CJ CO CO i • i 2
12 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ZE ce ce I/O
or or ce ce ce
<£
<t CO <t CJ <t CO «* cj < CO <I c_> <I CO <t CJ <C CO <t CJ CO
i/i UJ 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 UJ ce oo
or ce CC ce ce ce ce
<t
<t < C_3 1 CD «i c_> < CD <t <_) <t CO <t o 1 CD <£ <_> CD
z. l/->or ce ce
-^ ce 2 2 2 °-2 CD co CO => CO CO CO 2 CO CO CD 2 CO CD CO 2 CO CO CD 2 2 CJ
C_> <_) <_> tJ <_> tJ C_)
<:





- 3 2 2 2 2 2 2 2 2 2 2 z UJ ce cyi
ce ce ce ce
-=t
-£ CD <. c_> -a: CO < <_j -t CO *t CJ> •< CO < ,CJ <t m «t C-> n ^_ UJ 2
*~ UJ -t UJ u_
-5 2 2 2 2 2 2 2 2 2 2 2 2 2 ^ ce -z. ce tn
Ce ce ce
-I




i . i 1 .2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 ce ce CO
ce ce ce ce
«c
<t CO <t C_> <t CO •a: c_> <* CO a: c_> <C CD •< CJ <t CO <t <-J UJ < UJ 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ce ce CO
ce ce ce ce
«t
•< CO <I <_> «* CO •< <_> <: CO a; c_> «* CO -X. c_> < CD <C C_J UJ
<t
U~l 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ce UJ CO
ce ce ce ce ce ce ce
<c
<I CO «t c_> < CD «t <_> a: CO < CJ> <t CD -< <_> <C CD <£ c_> i . i 1 i J a. 2*" 1
i . i «i 1
,
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ce Z UJ CO
ce ce ce ce ce ce
<t
«S CO <C C_> < CD «* t_i «i CD <t C_J <t CD «« <_> «* OD <£ CJ 1 1 i < _ 2
"
2 2 2 2 ^ 2 2 2 2~ 2 2 2 2 2 2 2 2 2 2 2 2 ce ce z CO
ce ce ce ce ce
<r
<£ CO <t t_> <t CJ CJ C_J t_>
CO CO CO CO CO CO CO CO CO CD CO CD CO CD CJ
C_J C_3
<






2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ce ce CO
ce ce ce ce
<t 1 CO 1 c_> <C CO < <~j> -« CD -* cj -< CO *t <_> 1 CO -31 <_> 1 1
1
CO 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ce ce CO
ce ce ce
<t
-t CO 1 c_> < m «t c_> «t CO -=C c_> •<: CO •< <—
>
1 CO «f C_J 1 1 j CD CD 2
2 2 2 2 2 2 2 =E 2 2 ^ 2 2 2 2 =E 2 2 2 = ce ce Z CO
ce
<r < CO <r CJ «* CO <£ C^ <£ CD <t <^> <t CO <t C_J <I CD <t c_> UJ CO ( t 2
'
, 4
2 2 2 2 2 2 2 2 2 2 -c 2 2 2 2 2 2 2 2 2 ce ce Z CO
ce ce ce ce
Ce
<r CO <f C.J2 <r CO <C 2 «t CO •< 2 <t CO <C 2 <T CO <I 2u_ u_ u_ U_
>- *" >- >-
<t < CO <r <_J < CD <c C_5 •< CD <C i^J •« CD a; CJ <t OD -< ^
>- CO CO CO >- CO CO CO >— CO CO CD >- CD CD CD >- CO CO CD >-
C_J> u- U_ u_ C_> U_ U_ U_ C_i u_ U_ li. r-> D_ U








>- >- >- >- >- >-
>~ u_ U- i_i_ U_ u_ U_ Ll Li_ Li_ U_ Li_ U.





< >~ >- >- >- >- >- >- >-
>- u- li- u_ u_ Li_ u- u_ U-
-I «r en <t
-=t
<r
Li_ Li_ u_ u_










































































































































































































































































































































Figure 38. Cell structure for a 12-bit exponent addition.
66
That multiplier design used an array type functional layout similar to the multi-
plier designs of this thesis. A 16-bit unsigned integer multiplier array required
approximately 99,328 mils2 (256 mils by 388 mils) if fabricated with a 1.5 micron
feature size. This area requirement is equal to 64.1 mm 2 . The Genesil layout
operating speed was estimated at 38 MHz. In comparison, the 16-bit fixed point
multiplier design example developed in this thesis (scaled to a 1.5 micron feature
size) requires 3.08 mm 2 and has a predicted maximum clock frequency of 48.8
MHz with a symmetric clock waveform.
67
VI. CONCLUSIONS AND RECOMMENDATIONS
A. SUMMARY AND CONCLUSIONS
The purpose of this thesis was to determine the appropriateness and feasibil-
ity of using full-custom design methods to produce pipelined adders, subtracters,
and multipliers for FFT type operations. The full-custom scaleable CMOS cell
system developed produces high speed design layouts compact enough to put an
entire radix-2 FFT butterfly on a single VLSI chip. The Genesil Silicon Compiler
system, using its present library7
,
was unable to produce designs compact enough
to meet that objective.
The cell system developed in this thesis allows a system designer to evaluate
the layout size requirements and difficulty tradeoffs for various mathematical
operations in both fixed and floating number systems. The system enables a de-
signer to make educated choices as to feature size and system characteristics be-
fore starting a layout process. The system will reduce the workload required to
complete a layout, but it will by no means eliminate it. The process of layout
verification, simulation, and testing will also be very time consuming.
B. RECOMMENDATIONS
Due to the difficulty of generating random logic control functions, design
verification, timing simulations, and clock and bus routings on a limited pro-
duction full-custom design, the cell system developed in this thesis may not be
appropriate for direct implementation. Silicon compilers were developed to
overcome the shortcomings of the more time consuming layout methods. Several
68
silicon compiler systems have the capability to allow a user to custom design li-
brary cells. Genesil's Genport system, not currently licensed at the Naval Post-
graduate School, provides IC design experts with the capability to extend the
Genesil Compiler Library with fully parameterized cells that work with Genesil
verification and floorplanning tools [Ref. 24]. The capability of the Genport
system should allow incorporation of the basic cell designs developed in this thesis
into the Genesil library. This would greatly increase the ability to verify designs
and aid in wiring and pad placement of the chip.
Designs ported into Genesil loose their flavorlessness and scalability; a spe-
cific fabrication line technology and feature size must be chosen as part of the
import process. This is not a significant disadvantage; a designer can predict the
size of the layouts and the pipeline throughput by examining the characteristics
of the cell system developed in this thesis. A choice of feature size and fabrication
line can then be made before the cells are ported into Genesil. As an added
benefit, Genesil is supported by fabrication lines that can produce radiation
































































Figure 39. Four adder cell layouts.
71
APPENDIX A. SPICE SIMULATION EXAMPLE
All Spice simulations were performed with lambda = 1.5 microns (3.0 micron
feature size) and VDD equal to +5.0 volts. Due to Spice's inability to simulate
large numbers of transistors, modifications to layouts were made when necessary
to obtain results. The example shown is the simulation used to determine the full
adder CARRY-OUT delay when only CARRY-IN changes (see Table 4 on
page 38). To perform this simulation, four full adder cells were laid out in "series"
as shown in Figure 39 (a). After unsuccessful attempts to simulate the circuit,
the layout was modified by removing the EXCLUSIVE-OR gates and other logic
elements not in the CARRY-IN to CARRY-OUT logic path. The capacitance
of this modified layout shown in Figure 39 (b) was adjusted using data from the
original circuit to keep the load on logic devices the same as in the unmodified
circuit. The Spice deck used for the simulation and the results are shown on the
following pages. Note that the results show that the CARRY "ripples" through
the circuit with a delay of slightly less than 3 ns per adder cell.
70
*******06/05/90 ******** SPICE 2G.6 3/15/83 ********14 : 30 : 07*****
*** SPICE DECK CREATED FROM ADD2T.SIM, TECH=SCMOS






KP 1 .38d-05 3 . lld-05
GAMMA 0.590 1.319
PHI 0.660 0.743
CGSO 4 . 00d-10 5 . 20d-10
CGDO 4 .00d-10 5 . 20d-10
CGBO 4 .00d-10 5 . 20d-10
RSH 95.000 20.000
CJ 2 .00d-04 3 . 20d-04
MJ 0.500 0.500
CJSW 4 50d-10 9 00d-10
MJSW 0.330 0.330
JS 1 00d-04 1 00d-04
TOX 5 00d-08 5 00d-08
NSUB 5 00d+15 2 50d+16
TPG -1.000 1.000
XJ 6 00d-07 8 00d-07
LD 5 00d-07 6 40d-07
UO 200.000 450.000
UCRIT 8 00d+04 8 00d+04
UEXP 0.150 0.150
UTPA 0.300 0.300
VMAX 5 00d+04 5. 00d+04
74
*******06/05/90 ******** SPICE 2G.6 3/15/83 ********14 : 30: 07*****
*»* SPICE DECK CREATED FROM ADD2T.SLM, TECH-SCMOS










(»-. _eni n Jxftri 1 "jc 3d+00 2.500d+00 3.750d+00 5.000d+CO-~>U) U . u+UU J. . ZD
9.000d-09 0. d+00 X
9.500d-09 0. d+00 X
1.000d-08 0. d+00 X
1.050d-08 1.250d+00 X *
1.100d-08 2.500d+00 X k
1.150d-08 3.750d+00 X k
1.200d-08 5.000d+00 X *
1.250d-08 5.000d+00 X + *
1.300d-08 5.000d+00 X + *
1.350d-08 5.000d+00 X + *
1.400d-08 5.000d+00 X + *
1.450d-08 5.000d+00 X .+ *
1.500d-08 5.000d+00 X= +
1.550d-08 5.000d+00 X + *
1.600d-08 5.000d+00 X + *
1.650d-08 5.000d+00 X - + *
l.^OOd-OB 5.000d+00 X - + *
1.750d-08 5.000d+00 X X
1.800d-08 5.000d+00 $ X
1.850d-08 5.000d+00 $ ' X
1.900d-08 5.000d+00 $ - X
1.950d-08 5.000d+Q£) D =x
2.000d-08 5.000d+00 $ -X
2.O50d-08 5.000d+00 $ X
2.100d-08 5.000d+00 . $ x
2.150d-08 5.000d+00 . 0. $ x
2.200d-08 5.000d+00 . '. $x
2.250d-08 5.000d+00 . I $x
2.300d-08 5.000d+00 . $x
2.350d-08 5.000d+00 . ox
2.400d-08 5.000d+00 . X
2.450d-08 5.000d+00 . X
2.500d-08 5.000d+00 . X
2.550d-08 5.000d+00 . X
2. 600d-08 5.000d+00 . X
75
2.650d-08 5 . 000d+00 .
2.700d-08 5 . 000d+00 .
2.750d-08 5 . 000d+00 .
2.800d-08 5 000d+00
2.850d-08 5 000d+00 .
2.900d-08 5 000d+00 .
2.950d-08 5. OOOd+OO
3.000d-08 5. 000d+00
3.050d-08 3. 750d+00 .
3.100d-08 2. 500d+00
3.150d-08 1. 250d+00 .
3.200d-08 8. 272d-14 *
3.250d-08 0. d+00 *
3.300d-08 0. d+00 *
3.350d-O8 0. d+00 *
3.400d-08 0. d+00 *
3.450d-08 0. d+00 * +•
3.500d-08 0. d+00 * +
3.550d-O8 0. d+00 * +
3.600d-08 0. d+00 *+
3.650d-08 0. d+00 *+
3.700d-08 0. d+00 * +
3.750d-08 0. d+00 X
3.800d-08 0. d+00 X -
3.850d-08 0. d+00 X -
3.900d-08 0. d+00 x=
3.950d-08 0. d+00 x=
4.000d-08 0. d+00 X $
4.050d-08 0. d+00 X $
4.100d-08 0. d+00 X $
4.150d-08 0. d+00 x$
4.200d-08 0. d+00 X$
4.250d-08 0. d+00 X
4.300d-08 0. d+00 X
4.350d-08 0. d+00 X
4.400d-08 0. d+00 X
4.450d-08 0. d+00 X

























APPENDIX B. STANDARD CELL DESCRIPTIONS AND LAYOUTS
A. EXPONENT ADDITION FUNCTION CELLS
The standard cells used to perform the exponent addition function used in a
floating point multiplication are listed below. These cells not only sum the two
exponents, but also simultaneously produce a modified sum by subtracting 1 from
the sum. These two values are then stored in latches until the result of the mul-
tiply is completed and the appropriate value can be "selected". The arrangement
of the cells is shown in Figure 38 on page 66. The following list describes the
cells and their functions:
Cell AA: Figure 40 on page 80; a standard full adder cell. The inputs are la-
beled A, B, and Ci (CARRY-IN); the outputs are Co
(CARRY-OUT) and S (SUM). Note that a clock line (CLK) passes
through the cell although it is unused by the cell. Also note that a
ground bus (GND) runs along the lower edge of the cell. The cell's
connections to the positive bus (VDD) are "made" by placing the cell
directly below a cell with the VDD bus running along its lower edge.
Cell AAL: Figure 41 on page 81; a full adder cell with the same inputs and
outputs as cell AA. The cell has a slightly modified layout and is de-
signed to be used as the last logic cell before a set of pipeline latches.
Cell AAR: Figure 42 on page 82; a full adder cell produced by modifying cell
AA. It is designed to be the first full adder cell used in an addition
process. The CARRY-IN logic line is tied to ground (a logic zero).
Cell AA1: Figure 43 on page 83; a full adder cell designed to be the First adder
cell in a pipeline stage. The CARRY-IN line is arranged to provide
a connection to a latch instead of a previous adder cell.
Cell AB: Figure 44 on page 84; a standard full adder cell. The VDD bus runs
along the lower edge of the cell. The cell must be placed directly be-
low a cell with a GND bus running along its lower edge.
Cell AB1: Figure 45 on page 85; a full adder cell designed to be the first adder
cell in a pipeline stage. The CARRY-IN line is arranged to provide
a connection to a latch instead of a previous adder cell.
77
Cell CA: Figure 46 on page 86; a "clock" cell that provides a CLK output to
cells in its row by inverting the CLKin signal. This cell has a GND
bus running along its lower edge as well as GND and VDD busses
running vertically.
Cell CB: Figure 47 on page 87; a clock cell similar to cell CA, but with a VDD
bus running along its lower edge.
Cell CI: Figure 48 on page 88); a clock cell designed to be used at the top of
the layout. It has a GND bus running along its top edge and a VDD
bus running along its lower edge.
Cell ENA: Figure 49 on page 89; a "two-level" cell that performs the function
of subtracting 1 from the exponent sum. Figure 27 on page 51 shows
the gate level logic of the subtraction process. The signal S is the sum
produced by one of the full adder cells. The signal Sm is the modified
sum (sum-1). The CARRY-IN (Ci) and CARRY-OUT (Co) signals
are used in the subtraction process. Cell ENA also contains four
latches which store S and Sm for two clock cycles.
Cell ENAL: Figure 50 on page 90; similar to cell ENA, but designed to be the
last logic cell before a pipeline latch stage. To prevent the logic from
being in the critical timing path, this cell latches S and Ci prior to
performing the subtraction function.
Cell ENAW: Figure 51 on page 91; similar to cell ENA, but designed to be the
first logic cell after a pipeline latch stage.
Cell ENB: Figure 52 on page 92; similar to cell ENA in function, but has the
VDD bus running across its center and the GND bus running along
its lower edge.
Cell ENBL: Figure 53 on page 93; similar to cell ENB, but designed to be the
last logic cell before a pipeline latch stage. To prevent the logic from
being in the critical timing path, this cell stores S and Ci in latches
prior to performing the subtraction function.
Cell ENBR: Figure 54 on page 94; designed as the First cell in the subtract 1
from exponent sum process. Sm is the inverse of S. Other functions
remain the same as in cell ENB.
Cell ENBW: Figure 55 on page 95; similar to cell ENB, but designed to be the
first logic cell after a pipeline latch stage.
Cell RA: Figure 56 on page 96; a pipeline latch cell designed to store two values
(A and B). This cell is used both to store the exponent bits prior to
the addition process, and to store the sum and modified sum (S and
Sm) bits after the addition and subtract 1 processes. Cell RA has a
GND bus running along its lower edge.
78
Cell RAI: Figure 57 on page 97; a latch cell designed to be used a the top of
the layout. Cell RAI stores two values and has a GND bus running
along its top edge and a VDD bus running along its lower edge.
Cell RAIW: Figure 58 on page 98; performs the same function, but is slightly
"wider" than cell RAI. The extra width is needed in cells used below
RAIW.
Cell RAW: Figure 59 on page 99; a latch cell similar to cell RA, but slightly
wider.
Cell RAI: Figure 60 on page 100; a latch cell that stores three values. It is used
to the left of cell AAL, and stores the CARRY-OUT for use in the
next pipeline logic stage.
Cell RB: Figure 61 on page 101; a latch cell that performs the same function
as cell RA, But has a VDD bus running along its lower edge.
Cell RBW: Figure 62 on page 102; a latch cell similar to cell RB, But slightly
wider.
Cell RBI: Figure 63 on page 103; a latch cell similar to cell RB, but modified



























Figure 40. Cell AA layout
80










































































































.^^ ^ ^ ...


























































































































































































































.... X mm o
H











































































































































































































.WW* '^ - ¥/*/
P








































































































































||k>csx *[yTj]? tod 11
' IB ;fs *
fg\N>^ MK m
#v\xv l^il • iiil I2S.C\X\ \V
H iCf|X'\vCS<a pi »i pi
%m
eg H t'^t;.: '
:
H w BSBi




















































Figure 63. Cell RBI layout
103
B. MULTIPLICATION FUNCTION CELLS.
The standard cells used to perform the multiplication function are listed be-
low. An example arrangement of cells to perform a 16-bit mantissa multiply is
shown in Figure 36 on page 64. The following list describes the cells and their
functions:
Cell BMA: Figure 64 on page 108; a cell used to provide space for vertical
VDD and GND bus lines. The cell also contains a latch to store the
Y bit used on that level for the multiply process. The Y bit is sent on
line Yp to the cell one clock cycle prior to use, stored, and then dis-
tributed to cells nearby. The control lines for the latch, PI and P2,
are driven by CMRC cells located above the BMA cells.
Cell BMB: Figure 65 on page 109; similar to cell BMA, but has a VDD (vice
GND) bus running along its lower edge.
Cell CAE: Figure 66 on page 110; a "clock" ceil that contains vertical GND
and VDD busses. Also contains a clock inverter driven by the CLKin
line.
Cell CMRA: Figure 67 on page 111; the uppermost clock cell in the multiplier.
It contains an inverter that drives the CLKin line for all the clock cells
in that column. The cell also contains vertical VDD and GND
busses, an inverter to drive the CLK lines in nearby cells in the same
row, and clock inverters that drive the latch control lines (PI and P2)
for the BMA and BMB cells located below the CMRA cells.
Cell CMRB: Figure 68 on page 112; a "three level" cell which provides space
for vertical VDD and GND busses. It also contains clock inverters
to provide CLK signals to nearby cells in the same row.
Cell CMRC: Figure 69 on page 113; similar to cell CMRB, but with the hori-
zontal VDD and GND busses reversed.
Cell CN: Figure 70 on page 114; a clock cell that functions similarly to cell
CAE, but with a VDD bus running along the bottom.
Cell CP: Figure 71 on page 115; a clock cell similar to cell CAE, but with ad-
ditional lines which pass signals through the cell.
Cell CS: Figure 72 on page 116; a clock cell used in the "selection" row at the
bottom of the multiplier. It contains eight lines necessary to pass the
select control line signals through the cell.
Cell CYA: Figure 73 on page 1 17; a clock cell used in the top row to provide
VDD, GND, and CLK distribution to the latches that store an dis-
tribute the Y mantissa bits.
104
Cell CYC: Figure 74 on page 118; similar to cell CYA, but with no GND bus
at the top.
Cell FA: Figure 75 on page 119; a cell that shifts the Y mantissa bits to the
right one place every multiplier level to provide for proper distribution
of the Y bits. Also grounds the PRODUCT-IN (Pi) lines of the left-
most column of multiplier cells.
Cell FB: Figure 76 on page 120; similar to cell FA, but has a GND bus run-
ning along its lower edge.
Cell FBA: Figure 77 on page 121; provides space for vertical VDD and GND
busses running between CYC cells.
cell FBB: Figure 78 on page 122; similar to cell FBA, but with a VDD bus
running along its bottom edge.
cell FYA: Figure 79 on page 123; similar in function to cell FA.
Cell FYB: Figure 80 on page 124; similar in function to cell FB.
Cell MIA: Figure 81 on page 125; a multiplier cell. The inputs are Pi, Ci, X,
and Y; the outputs are Po and Co. Figure 24 on page 48 shows the
logic functions of this cell.
Cell M1B: Figure 82 on page 126; similar in function to cell MIA, but with a
VDD bus running along its lower edge.
Cell NA: Figure 83 on page 127; a three level cell. The top level functions as
an full adder cell that adds the PRODUCT and CARRY bits from
the bottom row of multiplier cells. The second row performs the al-
ternate rounding function (see Figure 29 on page 52) on the product
term one bit less in significance than the product term provided by the
full adder level above. The center level also contains two latches to
store the product (P) and alternate product (Pm) bits. The lower
level of cell NA contains an OR gate used in a zero sensing scheme,
two latches used to store P and Pm, and a pair of inverters used to
drive the control lines for the latches used in the cell.
Cell NAE: Figure 84 on page 128; functions similarly to cell NA, but is de-
signed to be the last logic cell in a pipeline stage. To Prevent the al-
ternate rounding logic from being in the critical timing path, the cell
latches the Pi and Pi- 1 bits before performing logic functions.
Cell NA1: Figure 85 on page 129; functions similarly to cell NA. Designed to
be the First logic cell in a pipeline stage.
Cell NB: Figure 86 on page 130; functions the same as cell NA, but with VDD
and GND bus locations switched.
Cell NBE: Figure 87 on page 131; functions the same as cell NAE, but with
GND and VDD bus locations similar to cell NB.
105
Cell NBF: Figure 88 on page 132; designed to be the last cell in the ripple
adder at the bottom to the multiplier. Unlike other NB type cells, no
adder cell is needed in the top level because the Ci and Pi bits will
always be zero. Cell NBF performs the last alternate rounding proc-
ess, and contains latches, and control inverters to store appropriate
bits.
Cell NBS: Figure 89 on page 133; designed as the first cell in the ripple add
process at the end of the multiply. Similar in function and layout to
cell NB, it contains an additional line to obtain and store the product
term needed to start the alternate rounding process.
Cell NB1: Figure 90 on page 134; performs the same function as cell NA1, but
with the VDD and GND bus layouts of cell NB.
Cell REA: Figure 91 on page 135; a latch cell that contains three latches and
the latch control inverters driven by the CLK signal line.
Cell REB: Figure 92 on page 136; a latch cell with two latches and control
inverters.
Cell REBP: Figure 93 on page 137; a latch cell similar in function to cell REA.
Cell REBS: Figure 94 on page 138; a latch cell similar to cell REA.
Cell RED: Figure 95 on page 139; a latch cell similar to cell REA, but with a
GND bus along its lower edge.
Cell REE: Figure 96 on page 140; a latch cell similar to cell RED, but with
only two latches.
Cell REEP: Figure 97 on page 141; a latch cell similar in function to cell RED.
Cell REES: Figure 98 on page 142; a latch cell similar in function to cell RED.
Cell RM1A: Figure 99 on page 143; a latch cell used at the top of the multiplier
array. Cell RM 1A initially stores the X bits for two clock cycles prior
to the start of the multiply process. Cell RM1A also grounds the Pi
and Ci lines of the upper row multiplier cells.
Cell RM1B: Figure 100 on page 144; a latch cell not needed if an odd number
of multiplier cells are used in a pipeline stage. If an even number is
used, cell RM1B is "alternated" with cell RM1C in the pipeline latch
stages.
Cell RM1C: Figure 101 on page 145; a latch cell that stores the Pi, Ci, and X
values between pipeline logic stages.
Cell SFM: Figure 102 on page 146; the selection cell used at the bottom of the
multiplier to select one of four values. Requires four control signals
and their complements. Only one control signal can be allowed to be
active at any time. Cell SFM also contains a latch to store the se-
lected value, and the control inverters necessary to operate the latch.
106
Cell YA: Figure 103 on page 147; a latch cell that stores four bits of the Y
mantissa prior to use.
Cell YAA: Figure 104 on page 148; a cell made up of a GND bus used to
provide for proper operation of cell YA when it is the topmost cell.
Cell YMC: Figure 105 on page 149; a latch cell similar in function to a YA and
YAA pair.
























s / s >
£
a
i$^.'\\--.V > < >
:
'-
;••'-;- ' ' '•'';*;••«:<
\








































































































^<\ <\ -, ' <1^^
Y//A ^^S^SS
Figure 74. Cell CYC layout
118
m£
Figure 75. Cell FA layout
119
CQ



















Figure 79. Cell FYA layout
123
































•"; v '$?& \-..<J?s&» 10






























































































































































Figure 84. Cell NAE layout
128















Figure 85. Cell NAI layout
129
Figure 86. Cell INB layout
130














_| !- B:0::-::,B B3- ::-::
B





























/ A :#/gl B' • B> *.
CLK>
EG SB iB Q p
PiP it
:P n# ik» s






































M P2 1 1
1HD


























§1 /w: QTJJ i>c\vS
1 V/A K^ ^<<S^

















SB ffe [p 1 pf^^JP
Yam








































2^1 py> [HTj ecvccv
wil
V// >sss^>>
mm m 1TS S1
e Kl ""ill ^ S^
B V\XV^N>
sg

































































































K //Am\W\\VWWW r^ j&VS ! °|
III!
:















































Hi H .>-.> .>.>>•*> •>
sm^miK^ 'aaa





























C " i *



































































Figure 103. Cell YA layout
147











































<MH IZX m ; K+*
m
S3 •• ^/; ^|§p| : E
Figure 106. Cell YRA layout
150
C. EXPONENT SUBTRACTION FUNCTION CELLS.
The standard cells used to perform the exponent subtraction function used in
a floating point addition or subtraction exponent compare operation are listed
below. An example arrangement of cells to perform a 12-bit exponent subtraction
is shown in Figure 37 on page 65. The cells subtract the "B" exponent from the
"A" exponent; the difference is labeled D. The two's complement of the differ-
ence, labeled Dtc, is also computed. All four values (A, B, D, and Dtc) are stored
until the sign of the difference is known. Then the appropriate exponent is se-
lected to be retained, and the five LSBs of D or Dtc are used to generate the
control bits for the addition mantissa alignment process. The following list de-
scribes the cells and their functions:
Cell CA: See cell description in the exponent addition function area.
Cell CAC: Figure 107 on page 154; a "clock" cell that provides a CLK output
to cells in its row by inverting the CLKin signal. The cell has a GND
bus running along its lower edge as well as GND and VDD busses
running vertically.
Cell CB: See cell description in the exponent addition function area.
Cell CI: See cell description in the exponent addition function area.
Cell CSB: Figure 108 on page 155; a cell used to provide space and connections
to GND and VDD busses. There are no active elements in this cell.
Cell ESI: Figure 109 on page 156; a cell composed of three inverter pairs. The
inverters drive the control lines (STORA, STORB, and LD1) for the
exponent selection and shift control generation cells.
Cell EXSLOG: Figure 110 on page 157; a logic cell that generates the shift
network control selection signals. If STORA (store A exponent) is
active, the A exponent is selected for storage. If STORB is active, the
B exponent is stored. If LD1 (load ones) is active, all alignment con-
trol lines are pulled high. Figure 31 on page 55 shows the gate level
logic design incorporated into the cell.
Cell RA: See cell description in the exponent addition function area.
Cell RAI: See cell description in the exponent addition function area.
151
Cell RAIVV: See cell description in the exponent addition function area.
Cell RASW: Figure 1 1 1 on page 158; a pipeline latch cell that stores an A and
a B bit for two clock cycles. The cell also contains the control
inverters for the latch.
Cell RAW: See cell description in the exponent addition function area.
Cell RA1: See cell description in the exponent addition function area.
Cell RA2VV: Figure 112 on page 159; a pipeline latch cell that stores two bits
for a single clock cycle.
Cell RA3: Figure 113 on page 160; a pipeline latch cell used to store the LD1,
STORA, and STORB control signals prior to use.
Cell RA4: Figure 1 14 on page 161; a pipeline latch cell used to store four val-
ues (A, B, D, and Dtc).
Cell RA4CL: Figure 115 on page 162; a pipeline latch cell used to store four
values. It is designed to be used to store the D and Ci values of the
last logic cell in a pipeline stage. The D and Ci are used in the next
pipeline stage to compute the two's complement of D.
Cell RA4W: Figure 116 on page 163; a pipeline latch cell that stores the D,
Dls, Dtc Is, and Ci signals used in computing the two's complement
of the difference, and the ones sensing scheme.
Cell RB: See cell description in the exponent addition function area.
Cell RBW: Sec cell description in the exponent addition function area.
Cell RBI: See cell description in the exponent addition function area.
Cell RB4: Figure 117 on page 164; a cell similar in function to cell RA4, but
with a VDD bus running along its lower edge.
Cell RI4: Figure 1 18 on page 165; a pipeline latch cell that stores two values.
Cell RI4VV: Figure 119 on page 166; a cell similar to cell RI4, but slightly
"wider".
Cell SAE: Figure 120 on page 167; a subtraction cell that produces the differ-
ence, D, by subtracting B from A.
Cell SAEL: Figure 121 on page 168; a subtraction cell designed to be the last
cell on the subtraction process. (The CARRY-OUT signal is not nec-
essary.)
Cell SAER: Figure 122 on page 169; a subtraction cell designed to be the First
logic cell in a pipeline stage.
Cell SAL4: Figure 123 on page 170; a subtraction cell used as the last logic cell
in a pipeline stage.
152
Cell SAR4: Figure 124 on page 171; a subtraction cell used as the first logic cell
in the subtraction process. The Ci (CARRY-IN) line is pulled high
to provide the increment required in two's complement generation.
Cell SA4: Figure 125 on page 172; a subtraction cell.
Cell SBCL: Figure 126 on page 173; a subtraction cell with functions similar
to cell SAL4, but with a VDD bus running along its lower edge.
Cell SBE: Figure 127 on page 174; a subtraction cell.
Cell SBEL: Figure 128 on page 175; a subtraction cell with functions similar
to cell SAL4, but with a VDD bus running along its lower edge.
Cell SCA: Figure 129 on page 176; a logic cell that computes the two's com-
plement of the difference, Dtc, and also performs the ones sensing
function of Figure 30 on page 54 for both the difference and the twTo's
complement of the difference.
Cell SCAR: Figure 130 on page 177; a logic cell that generates Dtc and also
serves as the first logic cell in the ones sensing function.
Cell SCB: Figure 131 on page 178; a cell similar in function to cell SCA, but
with a VDD bus running along its lower edge.
Cell SCBL: Figure 132 on page 179; similar in function to cell SCB, but de-
signed to be the first logic cell in a pipeline stage.
Cell SCBW: Figure 133 on page 180; a logic cell similar to cell SCB.
Cell STC: Figure 134 on page 181; a logic cell that generates Dtc.
Cell STCL: Figure 135 on page 182; a logic cell similar to cell STC, but de-
signed to be the first logic cell in a pipeline stage.
Cell STCR: Figure 136 on page 183; a cell that generates Dtc. Designed to
generate the two's complement of the least significant difference bit,
DO. (DtcO = DO; no logic is necessary.)
Cell SVB: Figure 137 on page 184; a selection cell composed of two trans-
mission gates. The control lines select whether the A exponent or the
B exponent is stored.
Cell SVBW: Figure 138 on page 185; similar in function to SVB, but slightly
wider.
CellSVBWW: Figure 139 on page 186; similar to cell SVBW, but slightly
wider.
Cell SV4B: Figure 140 on page 187; a selection cell that carries out the expo-
nent selection and mantissa alignment control signal generation proc-
































































































































\\\\\ H """"%$%""&%: ' W
:-^:-: ::':/'.<N\\\V;\\nV ^N . * . I . . i'

















































































































































H ;ea ?s*xk*vW,W i!i
I
;•;; v
h : : i^W\; PsTl M* :: ':




















: ::> >:':!:W ftft?:
^Ki\w qtj ?v>* KWW
^l : : ; : ^
Figure 115. Cell RA4CL layout
162
1 s © 5 1
i If
WM







































Figure 116. Cell RA4W layout
111
163




















-:: : : :::-: : :: :!;:;:; : :.:::: :-::•:: ::: :<!;:- :•:•:•:•:
,\\\\a








































































































































Figure 123. Cell SAL4 layout









































































































































































































































































































































































Figure 140. Cell SV4B layout
187
D. MANTISSA ALIGNMENT AND SELECTION
Most of the functions required to perform a floating point addition can be
accomplished using cells previously documented. The mantissa selection and shift
network cells are described below:
Cell SFAB: Figure 141 on page 189; a selection cell composed of eight trans-
mission gates. The cell routes two pairs of mantissa bits to the storage
and alignment cells as appropriate. The control line signals, STORA
and STORB, are the signals generated by cell EXSLOG described in
the exponent subtraction function cell list.
Cell SFI: A second stage shift cell that is composed of eight transmission gates
and an inverter. Figure 33 on page 57 shows a gate level diagram of
a cell and the control signal generation gates. Figure 142 on page 190
shows a layout of three SFI cells. For a 16-bit mantissa, 21 SFI cells
must be placed next to each other to form the second stage shift net-
work.
Cell SF5: A first stage shift cell composed of two transmission gates.
Figure 32 on page 56 shows a gate level diagram of two SF5 cells and
the control signal generation gates. Figure 143 on page 191 shows a

























































Hp p si2b gj p I 1 Egi gS piabf gj p 1 | EDgg [| P«b§ pjj g| |
! ^!i£^ so ^p§f§ p fcgfPfi so ^llt& p inMlili: so xpiM P
S4 :
:
: j* 1 S4 «£: Q ;: I S4 :::i ":.::-- : Q §:
mmm m ^mkm ss mmm s k§^^ sa ^^^ p nil :
50 Q : SO 1:^:^:-^'^:= SO l-: r
51 S1 I i S1 1 & 1
52 (23 S2 § £3 S2 || {vj
53 S3 1 1 1 E3 S3 1: :| £ Q
g) IS El g) & S3„„m E9-. £3 £1 H &• f3„„n
m i q^ a i el a 1 I : i
: c3 SOb | E3 § § SOb
Q B Sib iS |j: Sib
S2b i S2b
1




































Figure 143. Cell SF5 layout
Mi
E3 m 1
C4b : no shi ft C4D noshift C4b: noshift











1. Randy S. Roberts, "Architectures for Digital Cyclic Spectral Analysis,"
Doctoral dissertation, University of California, Davis, September 1989
2. William A. Gardner, "The spectral correlation theory of cyclostationary
time-series," IEEE Signal Processing, Vol. II, No. 1, pp. 13-36, 1986
3. R. S. Roberts and H. H. Loomis, Jr, "Digital architectures for estimating the
cyclic cross spectrum," 1990 (unpublished)
4. Charles L. Rowe, Jr. /'Detection and analysis of direct sequence spread
spectrum signals." Masters thesis, Naval Postgraduate School, Monterey,
California, 1987
5. Robert D. Strum and Donald E. Kirk, First Principles of Discrete Systems
and Digital Signal Processing, Addison-Wesley, Reading, Massachusetts,
1988
6. Harold S. Stone, High-Performance Computer Architecture, pp. 311-316,
Addison-Wesley, Reading, Massachusetts, 1989
7. D. Pavne, "Silicon Compilation in ASIC Design," Defense Computing, Vol.
1, No." 6, pp. 38-40, 1988
8. John Carl Davidson, "Implementation of a design for testability strategy us-
ing the Genesil Silicon Compiler," Master's thesis, Naval Postgraduate
School, Monterey, California, 1989
9. Robert Howard Settle, "Design methodology using the Genesil Silicon Com-
piler," Masters thesis, Naval Postgraduate School, Monterey, California,
1988
10. R.R. Rockey, "Silicon compiler implementation of a Kalman filter algorithm
as an ASIC," Masters thesis, Naval Postgraduate School, Monterey,
California, 1988
11. Neil Weste and Kamran Eshraghian, Principles of CMOS VLSI Design,
Addison-Wesley, Reading, Massachusetts, 1988
12. Fang Lu and Henry Samueli, "A bit-level pipelined implementation of a
CMOS multiplier-accumulator using a new pipelined full-adder cell design,"
Proc. of 8th Annual International Phoenix Conf on Computers and Commu-
nications, pp. 45-65, IEEE Comput. Soc. Press, Washington, DC, Cat. No.
89CH2713-6, pp. 49-53, 1989
192
13. H. T. Kung, "Why systolic architectures?," IEEE Computer, pp. 37-46, Jan-
uary 1982
14. 1986 VLSI Tools: Still More Works by the Original Artists, Walter S. Scott,
Robert N. Mayo, Gordon Hamachi, and John K. Ousterhout, editors, Re-
port No. UCB/CSD 86/272, Computer Science Division, University of
California, Berkeley, California, December 1985
15. A. Vladimirescu, Kaihe Zhang, A. R. Newton, D. O. Pederson, and A
Sangiovanni-Vincentelli, SPICE User's Guide, Northwest LIS Release 3.1,
February, 1987
16. L. Howard Pollard, Computer Design and Architecture, Prentice Hall, New
Jersey, 1990
17. Charles R. Raugh and Bruce A. Wooley, "A two's complement parallel array
multiplication algorithm," IEEE Trans, on Computers, Vol. C-22, No. 12, pp.
1045-1047, 1973
18. Joseph Y. Lee, Hugh L. Garvin, and Charles W. Slayman, "A high-speed
high-densitv silicon 8X8-bit parallel multiplier," IEEE Journal of Solid-State
Circuits, Vol. SC-22, No. 1, 1987
19. Digital Signal Processing Applications with the TMS320 Family; Theory, Al-
gorithms, and Implementations, Texas Instruments SPRA012A, Houston,
Texas, 1986
20. Herbert Herbert Taub, Digital Circuits and Microprocessors, McGraw-Hill,
New York, 1982
21. A. Habibi, and P. A. Wintz, "Fast multipliers," IEEE Transactions on Com-
puters, Vol. C-19, No. 2, pp. 153-157, August 1970
22. Kent P. Irwin, "Simulating transmission gate structures on Mossim II,"
Master's thesis, Naval Postgraduate School, Monterey, California, 1988
23. Ronald S. Huber, "Design of a pipelined multiplier using a silicon compiler,"
Master's thesis, Naval Postgraduate School, Monterey, California, 1990
24. Genesil System Genport Users Guide, Pub. No. 11-0084-2, Silicon Compiler




1. Defense Technical Information Center 2
Cameron Station
Alexandria, VA 22304-6145
2. Library, Code 0142 2
Naval Postgraduate School
Monterey, CA 93943-5002
3. Chairman, Code EC 1
Department of Electrical and Computer Engineering
Naval Postgraduate School
Monterey, California 93943-5000
4. Curricular Officer, Code 32 1
Naval Postgraduate School
Monterey, California 93943-5000
5. Prof. H. H. Loomis Jr., Code EC/Lm 5
Department of Electrical and Computer Engineering
Naval Postgraduate School
Monterey, California 93943-5000
6. Prof. M. Cotton, Code EC/Cc 1
Department of Electrical and Computer Engineering
Naval Postgraduate School
Monterey, California 93943-5000
7. Prof. Chyan Yang, Code EC/Ya 1
Department of Electrical and Computer Engineering
Naval Postgraduate School
Monterey, California 93943-5000
8. Dr. W. A. Gardner 1







3 2768 00003979 6
