








NASA SERC 1990 Symposium on VLSI Design
Welcome to the first annual NASA Symposium on VLSI Design. NASA's involvement
in this event demonstrates a need for research and development in-the area of high perfor-
mance computing. High performance computing addresses problems faced by the scientific
as well as industrial communities. High performance computing is needed in:
• Manipulating large quantities of data in real time
• Sophisticated digital control of space craft systems
• Digital data transmission, error correction and image compression
• Expert system control of space craft
In addition to requiring high performance computing, NASA imposes the constraint of
zero power, weight and space. Clearly, a valuable technology in meeting these needs is
Very Large Scale Integration.
This conference addresses important issues of VLSI design.




It is clear that solutions to problems faced by NASA have commercial applications. One
goal of this conference is to share technology advances with the industrial community and
encourage interaction between industry and NASA.
This symposium is organized by the NASA Space Engineering Research Center at the
University of Idaho and is held in conjunction with a quarterly meeting of the NASA Data
System Technology Working Group (DSTWG). One task of the DSTWG is to develop new
electronic technologies that will meet next generation data system handling needs.
The NASA SERC is proud to offer, at its first symposium on VLSI design, presentations
by an outstanding set of individuals from national laboratories and the electronics industry.
These featured speakers share insights into next generation advances that will serve as a
basis for future VLSI design.
Clearly there are individuals whose assistance was critical to the success of this sym-
posium. Barbara Martin worked long hours with every single manuscript to place into
proper IATIA form. Judy Wood did an excellent job at coordinating the many conference
activities. The efforts of these professionals were vital and are greatly appreciated.
Our goal is to build upon this symposium in years to come and suggestions are en-
couraged that would allow a better symposium next year. I hope you enjoy your stay in





Automating Analog Design: Taming the Shrew
	 1
A. Barlow
Next Generation VLSI Tools	 9
J. Gibson
CCSDS Reed Solomon VLSI Chip Set 	 20
K. Cameron, S. Whitaker, N. Liu, K. Liu, J. Canaris
Reed Solomon Error Correction for the Space Telescope	 32
S. Whitaker, K. Cameron, J. Canaris, P. Vincent, N. Liu and P. Owsley
VLSI Chip-set for Data Compression Using the Rice Algorithm 41
J. Venbrux and N. Lau
Optimal Digital Control of a Stirling Cycle Cooler	 52
J. Feeley, P. Feeley and S. Langford
Semiautomated Switched Capacitor Filter Design System	 55
D. Thelen
Integrated CMOS RF Amplifier 	 64
C. Charity, S. Whitaker, J. Purviance and M. Canaris
A Comparison of Two Fast Binary Adder Configurations	 = 78
J. Canaris and K. Cameron
Self Arbitrated VLSI Asynchronous Sequential Circuits 	 87
S. Whitaker and G. Maki
Using Advanced Microelectronic Test Chips to 	 105
Qualify ASIC's for Space
M. Buehler, B. Blaes and Y-S. Lin
Real Time SAR Processing	 117
A. Premkumar and J. Purviance
NASA SERC 1990 Symposium on VLSI Design
	 iii
Using Algebra for Massively Parallel Processor
	 140
Design and Utilization
L. Campbell and M. Fellows
On Well-Partial-Order Theory and Its Application to 	 151
Combinatorial Problems of VLSI Design
M. Fellows and M. Langston
Burst Error Correction Extensions for	 163
LARGE Reed Solomon Codes
P. Owsley
Performance Comparison of Combined ECC/RLL Codes	 186
C. French and Y. Lin
Serial Multiplier Arrays for Parallel Computation 	 197
K. Winters
PLA Realizations for VLSI State Machines 	 213
S. Gopalakrishnan, S. Whitaker, G. Maki and K. Liu
A Programmable Architecture for CMOS Sequential Circuits 	 223
S. Whitaker, G. Maki and M. Canaris
A Bit Serial Sequential Circuit	 231
S. Hu and S. Whitaker
Sequence Invariant State Machines 	 241
S. Whitaker and S. Manjunath
Pass transistor Implementations of Multivalued Logic	 253
G. Maki and S. Whitaker
Statistical Circuit Design for Yield Improvement in CMOS Circuits 260
H. Kamath, J. Purviance and S. Whitaker

N94- 71075







The march, or rather, the sprint of progress in integrated circuits continues to amaze
observers both within and without the industry. Three decades ago, a 50 transistor chip
was a technological wonder. Fifteen years later, a 5000 transistor device would "wow" the
crowds. Today, 50,000 transistor chips will earn a "not too bad" assessment, but it takes
500,000 to really leave an impression.
In 1975 a typical ASIC device had 1000 transistors, took one year to first samples
(and two years to production) and sold for about 5 cents per transistor. Today's 50,000
transistor gate array takes about 4 months from spec to silicon, works the first time, and
sells for about 0.02 cents per transistor.
Fifteen years ago, the single most laborious and error prone step in IC design was the
physical layout. Today, most IC's never see the hand of a layout designer: an automatic
place and route tool converts the engineer's computer captured schematic to a complete
physical design using a gate array or a library of standard cells also created by software
rather than by designers. CAD has also been a generous benefactor to the digital design
process. The architect of today's digital systems creates his design using an RTL or other
high level simulator. Then he pushes a button to invoke his logic synthesizer-optimizer
tool. A fault analyzer checks the result for testability and suggests .where scan based
cells will improve test coverage. One obstinate holdout amidst this parade of progress is
the automation of analog design, and its reduction to semi-custom techniques. While the
variety and power of architectural options available to the analog designer has mushroomed,
his methods remain largely unchanged from two decades ago. Synthesis by repeated trial-
and- error SPICE simulations is still the norm. The layout is still painstakingly hand-
crafted, transistor by transistor. Unlike their digital counterparts, analog first silicon
that does not perform to spec is yet the rule rather than the exception. Analog design has
stubbornly refused to be tamed by the array and cell methodologies that have overwhelmed
the digital world. While analog cell libraries are widely advertised, in practice they find
very little use [1).
2 What's the Problem?
The comparatively stunted growth of analog CAD has multiple causes. Some are natural
consequences of macroeconomics and of the general state of the computing industry. Others
2are intrinsic to the nature of the analog problem. Still others appear to be rooted in quirks
of human nature. I will focus on three of the more significant barriers. -
2.1 Help. Wanted (Semicustom need not apply)
Gate array vendors quickly learned that like the memory product business, their's is basi-
cally a simple two dimensional problem: the trade off between speed and chip area. The
definition of next year's new product line is ever so predictable: more speed, higher gate
count, lower cost. And they can feel confident that their's is a reasonably broad and
complete product line if it includes a half dozen arrays that span the range from 1k to
50k gates. Analog, by contrast is a multi- dimensional nightmare. If we try to offer a
semi-custom, structured product line containing reconfigurable analog elements, in what
ratio should we include amplifiers, capacitors, resistors, switches, free transistors, matched
pairs, etc. How many different combinations constitute a complete family of such analog
arrays? In our amplifier cell library, what combinations of DC gain., bandwidth, noise,
PSBR+, PSRR-, common- mode input range, offset voltage, settling time etc. are re-
quired? In a phase-locked- loop, what combinations of center frequency, capture range,
hold range, jitter immunity, no signal frequency drift, output phase angle, etc. will suffice?
Do next year's improvements target lower power, noise, matching accuracy, higher speed
or something else?
2.2 Where are the experts?
While digital design is as exciting and challenging, and even more economically rewarding,
it differs from analog in a very fundamental way: it is not conceptually taxing. Digital
systems may be mathematically sophisticated, but given a system design, the logic syn-
thesis and layout is not mathematically challenging. We teach digital theory in its entirety
to college freshmen. Digital's complexity and challenge is more akin to that-of a large and
involved cost accounting system than it is to analog design.
This essential difference in the nature of the problem has a very natural and interesting
consequence: the digital world is readily comprehended by computer scientists and pro-
grammers who lack explicit training in electronic theory and its prerequisite mathematics.
Thus those who best know how to create computerized automation can (and do) address
themselves to the digital problem.
Analog automation on the other hand, demands a marriage of a circuit design expert
and a design automation (computer) expert. As a species, analog designers still think of
themselves, perhaps correctly, as artists, and are wary of the inevitable degradation and
inelegance of an automated version of their craft. And as artists, they tend to enjoy the
challenge of specific design situations, and the creation, of entirely new architectures more
than the broad, accountant-like thought process that must reduce a range of previously
invented possibilities to fixed design procedures. The result of this incompatibility is that
few really excellent analog designers, who are scarce breed to begin with, have found their
way into the design automation field.
NASA SERC 1990 Symposium on VLSI Design	 3
2.3 An expert is not enough
Even given the mathematical prowess of expert designers, the fact remains that many
analog design problems are too involved to be reduced by manual methods to tractable
solutions. Traditionally much of the designers' skill has been the paring of the problem to
a manageable essence. And despite his best efforts, a large dose of trial and error remains.
Can we automate trial and error? The answer is certainly "yes", but only at the price of
enormous computing power. Computerized search algorithms rarely have the same degree
of intelligence guiding the sequence of trials, and must make a far greater number of poor
choices before arriving at a suitable solution.
To summarize then, pivotal barriers to progress in analog design automation include
the lack a suitable methodology, and the lack of experts willing to take on the task of
automation. Given capable people and plausible methodologies, a further problem re-
mains: the algorithm maker needs tools to help him do previously unmanageably complex
mathematics. Finally, even given that tool, the task demands access to fabulously large
computing power. The good news:
 all of these barriers are beginning to crumble.
3 A Light at the End of the Tunnel
While arrays and standard cells have proven ineffectual in the analog domain, a slight
variation on the concept, standard generators, holds excellent promise. This is not a novel
concept: it has been applied to the physical design of standard cell libraries for 6-7 years. In
this discussion, I expand the usual definition of a generator to include the design process
as well as the layout, and propose that we think of them not only as tools for library
generation, but also as custom design aids.
Like standard cells, generators are usually based on fixed circuit schematics, (though
algorithmic arraying can also be included under this same label). But unlike standard cells,
the component sizes are not predetermined and the layout is not fixed. Generators can .
have either of two distinct functions: component size determination based on performance
specifications, and physical layout. By introducing the flexibility of variable device sizes,
a vastly broader range of specifications can be addressed. The concept is hierarchically
extensible: macro generators can call lower level generators to create their subcomponents.
Aiding the design automator's modeling efforts, a viable first generation of symbolic
mathematics software has appeared in the last several years. These are as yet immature
and require a good deal of training to use. And quite properly, they have not attempted
to supplant the need for mathematical understanding by the user. But even in this infant
stage they offer significant benefit in addressing very complex math problems. One aspect
in which they prove particularly useful is in keeping track of the signs and coefficients of
problems with many variables. For example, the algebraic (not numeric) solution of a ten
by ten determinant would be a year's work by manual methods. These new tools compute
the twenty-something page result in an hour on a personal computer.
Lastly, it should be apparent to the most skeptical observer that the age of boundless
computing power is upon us. Five years ago, thirty engineers time-shared a 5 MIPs, 4
4Meg RAM minicomputer. Today each has a 10 MIPs, 16 Meg machine sitting on his desk
for his personal use. Within 5 years it will be a 100 MIP, 64 Meg superworkstation. In
a single decade we will have transitioned from a world where CPU time was a dominant
resource limitation to a time where ' it is a non-issue. Automated trial and error will be
both wonderfully rapid and virtually free.
Exploiting this new computing muscle, a viable first generation of analog synthesis
tools has begun to emerge. Carnegie-Mellon University researchers have created an op amp
synthesizer capable of creating a very broad range of high performance custom amplifiers
[2]. It hierarchically builds on generator submodules as small as matched transistor pairs
and complementary drivers. Keying on various specification criteria, the system makes
repeated educated guesses in determining both the device configuration and the device
sizes. Based more, but not entirely on fixed amplifier schematics, CSEM (Switzerland)
has created a tool that algorithmically sizes op amps, comparators, and even a few larger
analog blocks such as sigma-delta converters [3]. It follows the design with a high quality
automatic layout that is sensitive to analog design issues, and is interactively changeable
by the user. Adding switched capacitor filter synthesis and layout to the CSEM tool,
Silicon Compiler Systems is introducing the first fully featured commercial analog synthesis
tools. The overall system behaves very similarly to a switched capacitor synthesis system
previously reported by Asahi Kasei Microsystems (Japan) [4].
4 An Example: Asahi Kasei Microsystems' SCF De-
sign System
Harnessing the power of recent workstations to implement the algorithmic guesswork of
a non-linear programming numerical optimizer, and drawing on the modeling potential
of symbolic mathematics tools, the Asahi Kasei system well illustrates the current state-
of-the-art in analog design automation. The system integrates three new design modules
into the design environment: a filter synthesizer (SCULPTOR), an op amp synthesizer
(OPTIMIST), and a switched capacitor circuit layout synthesizer (SCARLET). A fourth
required capability, op amp layout generation, has been implemented using a cell layout
generation system similar to commercially available generator tools.
4.1 Automated Filter Design: SCULPTOR
The filter synthesizer is employed for both gain and delay designs. Filter order and coeffi-
cients may be determined either by classical approximations or by the numerical optimizer.
This later choice allows optimization of particular specs. For example, the user may choose
to minimize Q, and hence noise and sensitivity. Delay equalization filters are also designed
using the numerical optimizer. The optimizer is particularly valuable in the design of non-
standard filter functions. In telecommunication applications, gain equalization filters that
compensate for frequency dependent line attenuation are often required. Since no formal
mathematical solutions to these functions exist, designing them manually is a long and
NAS.A SERC 1990 Symposium on VLSI Design	 5
tedious trial-and-error process. SCULPTOR's optimizer created a filter with the transfer
characteristic of Figure 1 in five minutes on a SUN4-260.
SCULPTOR's analysis includes all key non-ideal effects: capacitor mismatch, both amp
and switch noise and amplifier finite gain. Filters are implemented as composites of single
stage, biquad, interpolator and cosine filter sections. Programmable gain functions can
also be automatically included in the design, thanks to SCULPTOR's embedded mixed-
mode switched capacitor / logic simulator. Filter sections may be analyzed separately or
as cascaded composites. SCULPTOR outputs a captured schematic, op amp specifications
to the amplifier generator and a netlist to the filter layout synthesizer.
4.2 Automated Amplifier Designs OPTIMIST
Also based on numerical optimization methods, OPTIMIST sizes the devices of amplifiers
and switches to meet a specification received from SCULPTOR. Min/max limits on gain,
bandwidth, noise, PSBR, etc. are inputs; device sizes and actual performance to spec
are outputs. Using analytic models based on SPICE-like IV equations, and by including
high order poles and zeros in the analysis, the result matches full conventional simulation
very closely - within a fraction of a decibel for gain functions, and within 1 degree for
phase. The creation of analytic models of this complexity is an entirely impractical task
by manual methods. For OPTIMIST, the modeler employs a symbolic math modeling
tool to create the transfer functions. Input is a set of node equations. Output is the
transfer function polynomial coefficient expressions. Figure 2 summarizes the nature of
OPTIMIST's modeling structure.
Each of OPTIMIST's design generators has a corresponding cell layout generator. The
design and layout generators are correlated to have matching diffusion areas, etc., thus
assuring correct modeling of parasitics.
While it has proven very helpful for _amplifier design, and greatly accelerated a user
defined trial-and-error process, OPTIMIST is a prime example of the- need for still more
computation speed in workstations. A full optimization search in OPTIMIST can take up
to 30 minutes on a SUN4-260. This is acceptable in some instances, but detracts from the
interactive feel that such tools should ideally have.
4.3 Automated Switched Capacitor Circuit Layout: SCARLET
SCARLET's layout capability is not limited to SCFs: it compiles any circuit comprised of
op amps, switches and capacitors. It's ability to draw circuits, properly considering noise
and crosstalk, stems from an intelligent preanalysis of the netlist to be drawn. Prior to
layout, SCARLET decomposes the network into clusters of elements connected to charge
sensitive nodes. Having thus analyzed the circuit, SCARLET can create the physical
layout with the same attention to signal crossing of critical nodes as would a human layout
expert. The resultant physical design is of comparable quality and density to hand drawn
filters. Figure 3 is an example of a SCARLET layout for a 6th order bandpass filter. It
took 7 minutes on a VAX 8650.
6Can the Horizon
The wealth of existing analyses of amplifiers and filters as well as the pervasiveness of
their use made them ideal candidates for automation in this first generation. But many
other common functional blocks also hold promise for reduction to automated techniques.
ADCs, DACs and DLLs all seem amenable to automation via the principles described
above. The computational problem will be more severe, but well within the capabilities
of the coming generation of workstations. The mathematics to model these is also more
difficult, but they can be attacked by an ever more potent arsenal of analysis aids. And
the proliferation of CASE tools is beginning to relieve the tedium of many programming
tasks, leaving algorithm creation as the dominant task of the CAD developer. This may
incline more analog artists to take up the challenge of analog CAD.
The shrew now has a tamer, and the tamer has a whip.
References
[1] In a panel discussion about analog design methodologies at the 1989 IEEE CICC,
representatives of three leading ASIC vendors, Sierra Semiconductor, IMP and AMI
all admitted that while they have created analog cell libraries and actively market
them, in actual practice, the cells are rarely reused without modification to fit each
new application's requirements.
[2] R. Harjani et. al, "A Prototype Framework for Knowledge-Based Analog Circuit Syn-
thesis", IEEE DAC, pp. 42-49, 1987
[3] M. Degrauwe et al, "IDAC: An Interactive Design Tool for Analog CMOS Circuits",
IEEE JSSC, vol SC-22, no. 6, Dec. 1987.
[4] A. Barlow et al, "An Integrated Switched Capacitor Filter Design System", IEEE
CICC, 4.5.1, 1989.







1 k	 10k	 100k	 1 M
Frequency (Hz)










1 as f (V's,W,L)
V's as f (1,W,L, other V's)
W as f (1,V's,L)
1. Chose dependent &
independent variables
2. Make sequence of
single transistor gm,gp,gmb as £ (1,V)
Cj as f (V,W)
Cg as f (W,V)
DC Gain, fo, phase
margin, PSBR, CMRR
as f (zfer polys)
1. Write linearized
model node equations
2. Solve for polynom-
ial coefficients via
symbolic math utility Noise as f (W,L,polys)Slew Rate as f (polys,SR)
Settle Time as f (polys,SR)
Area as (W,L,C)
Figure 2: OPTIMIST model structure
8Figure 3: Generated layout for a 6th order SCF. Opamps and switches are represented by
shaded bounding boxes
N94-71076
NASA SERC 1990 Symposium on VLSI .Design 	 9
Next Generation VLSI Tools
J. Gibson Hewlett Packard Company
Disk Mechanism Division
Boise, Idaho
Abstract - This paper focuses on what features would be useful in VLSI Com-
puter Aided Design Tools and Systems to be used in the next five to ten years.
Examples of current design tasks will be used to emphasize the areas where
new or expanded VLSI CAD tools are needed.
To provide a basis for projecting the future of VLSI tools, a brief history
of the evolution of VLSI design software and hardware platforms is presented.
The role of design methodology is considered, with respect to the anticipated
scale of future VLSI design projects. Puture requirements of design verification
and manufacturing testing are projected, based on the challenge of surviving
in a competitive market.
Examples of VLSI design and related issues are centered primarily on cell
library based structured custom design. Structured custom design implies the
use of a hierarchical block organization in the implementation of a complex IC.
The perspectives of what capabilities are needed in future VLSI tools reflect
the author's involvement on VLSI design teams developing integrated circuits
for disk memory and other computer peripherals, in the last eight years.
1 Introduction.
The transition from nicely manageable, fully synchronous structured custom designs to
mixed synchronous/asynchronous digital designs, coupled with analog and digital functions
on the same die, will require considerable investment in tools and engineering skills. More
engineers will become involved in high speed digital designs, requiring more analog circuit
expertise.
The requirements of new products will determine the methodology necessary to pro-
duce cost effective and timely VLSI designs. The increasingly difficult demands on design
verification will require simulation capability well beyond the limits of current tools. The
manufacturability of complex integrated circuits becomes even more important as com-
puter peripheral product volumes go from a few thousand units per month to tens of
thousands or hundreds of thousands of units per month.
The capability of tools available to VLSI designers is increasing rapidly, but the man-
agement of large complex projects still requires considerable investment. Marketplace
pressures are requiring shorter IC development times, with the need for perfect first pass
parts growing dramatically. Cost issues are pushing chip architectures towards the most
efficient and cost effective chip layouts, involving more custom design. lull custom design
or library cell based structured custom design requires the best possible tool environments
10
to support a skilled design team. Future VLSI tools will need to effectively address the
needs of full custom IC designers, to help provide a competitive edge in the marketplace.
The following discussion highlights VLSI CAD tool issues in the design capture, veri-
fication and testing processes of integrated circuit development, as encountered in several
product development cycles at Hewlett Packard's Disk Mechanism Division. The require-
ments for next generation VLSI tools will be projected, at least from one organization's
point of view.
2 History.
It is difficult to imagine amore dynamic area of technological development than computer
aided design, with VLSI design being at the forefront of the CAD evolutionary process.
Every VLSI chip project since the early 1980's has been accompanied by a new generation
of computers, graphics tools and peripherals, usually a new operating system, and new
VLSI software tools that often required a change in the design methodology used by the
IC development teams. The rapidly changing tools made it difficult to anticipate tech-
nical issues and almost impossible to accurately predict project completion dates. VLSI
manufacturing processes and design parameters were changing at an equally fast rate.
The growth rate of most electronics companies has been very high in the last ten years.
Consequently, many new engineers have been introduced into this complex, dynamic envi-
ronment. The only thing different between the early 1980's and the late 1980's, is that the
rate of change of IC technology, software tools and systems has further quickened. This
rate of technological change appears to be permanently increasing, a bit more rapidly each
year. It is the product of an amazingly competitive marketplace and a very diverse range
of applications of VLSI technology.
Just ten years ago, many IC's, some of significant complexity, were still composed by
hand, using tape and mylar film and requiring several years to be completed. The work was
exhausting, tedious and very error prone, regardless of the methodology. The first CAD
systems were minicomputer or mainframe systems with a rudimentary graphics capability.
A typical collection of software included not much more than a layout rule checker; an
analog simulator (SPICE) and a unit delay switch type digital simulator and possibly a
graphics based schematics editor. No tools existed to compare a schematic netlist verified
by simulation to the netlist extracted from a layout. Huge plots of chips were generated,
spread on the floor, and sometimes four or five manual checking cycles, involving different
people for each cycle, were necessary to find mismatches between schematics and layouts.
Manual artwork verification of digital designs continued into the early 1980's, and is still
important in analog designs.
The early minicomputer and mainframe based CAD systems generally cost $250,000
or much more, permitting only large corporations to participate in VLSI design. The
systems would support only a few users, and the tasks had to be kept small, since the
CPU's could easily be overloaded. Disk memory storage devices were expensive and not as
reliable as today's products. One CPU handled all tasks and the only way to communicate
NASA SERC 1990 Symposium on VLSI Design	 11
with other remote systems was with expensive modems on costly leased phone lines. Cost
limited the number of graphics terminals, as most engineers used RS-232 terminals. Early
computer aided design methodologies were developed primarily by experienced designers
who learned integrated circuit design in the 1970'x, using calculators and second generation
minicomputers with limited software. The VLSI tool limitations caused the development
of a rigorous and efficient design methodology which was, and still is, the best way to
achieve success.
CAD really started to be a factor when machines that could be called workstations
finally began to appear on the market. These machines combined a sharply reduced cost of
computing power with a mix of features that fit naturally with engineering groups designing
ICs. Workstations began to provide a hardware and software system of interactive desktop
machines, of affordable cost, connected to a local area network to provide transparent
access to data and programs in real time. This development gradually eliminated the
need to go through time-consuming departmental minicomputers or to use mainframes
for batch processing. Providing each engineer with a graphics terminal quickly became
affordable, and greatly improved productivity. Additional processing power could be added
incrementally. Although it could be debated in some respects, probably the first 32-
bit graphics workstation to be shipped was the Apollo DN 100, the first of which was
shipped in 1981. This was the earliest machine on the market that had most of the
features of today's workstations. Some of the earlier proprietary machines changed so
much between generations, that the older hardware was obsoleted at the end of a project,
which was not unusual up to about 1987. UNIX became more common, providing a
multiuser, multitasking environment ideal for engineering tasks. Software from several
vendors slowly became available on most workstations. Networking allowed the sharing of
computer and peripheral resources, which made it possible to apply every machine on a
network to a time critical set of tasks.
VLSI CAD technology is hardly ten years old, but several revolutions have already
taken place. The tremendous rate of hardware development, however, seems to have
outpaced the development of software tools. The productivity of one engineer has probably
increased several hundred to a thousand times the rate possible just one decade earlier.
There is a good chance that the rate of productivity could increase nearly as much in
the next ten years. Most of the VLSI CAD tool vendors are still enhancing their first
generation products or are just beginning to introduce second generation products. Even
large, diversified electronics companies are recognizing that the continuing investment in
VLSI CAD tools is too large to justify only in-house use.
3 Platforms.
In 1980, a minicomputer with a quarter of a megabyte of random access memory was
considered a powerful machine. This larger than average minicomputer may have had 50
to 100 megabytes of disk memory. The CPU was probably a sixteen bit machine, with no
cache, some DMA, and moderate I/O data transfer rates, with a generalized performance
12
rating of about 500,000 instructions per second. The graphics screens may have had about
535 by 390 pixels of display capability, with a slow re-windowing rate. All hardware
attached to the minicomputer probably was designed for that particular machine. IC's
designed for disk memory devices in the early 1980's ranged in complexity from 5000 to
35,000 FETs.
The slightly above average workstation in 1990 will probably have 16 to 32 megabytes
of RAM, one or more 700 megabyte disk drives, 64 to 256 thousand bytes of cache, a 32 bit
CISC CPU with 20 MIPS performance or a RISC CPU of about 35 MIPS, and 5 megabyte
per second I/O data rates. Most graphics terminals will have 1280 by 1024 pixel display
capability, with the top end at 2048 by 2048 pixels. The graphics display will be handled
by a separate processor system. In addition, industry standard interface busses permit
the attachment of peripherals and specialized processors from a number of vendors. The
integrated circuits designed in the late 1980's for disk memory devices ranged from 30,000
to 70,000 FETs for most designs, with one close to 380,000 FETs. Other organizations
have developed designs of about 750,000 FETs.
What might the fairly well loaded workstation look like in the year 2000? Some of
the possibilities include maybe a gigabyte of RAM, 10 gigabytes of solid state disk, 100 or
more gigabytes of disk memory, two or more CPU's, each providing 300 to 400 MIPS, plus
several processors facilitating communications with peripherals. Fiberoptic I/O should be
much less expensive than it is today and should make possible I/O rates to peripherals of 50
to 100 megabytes per second, the limit being the rate that the peripherals can accept data.
Inexpensive multiscreen graphics systems that allow the viewing of artwork, schematics,
and textual simulation data simultaneously might be possible. Each screen may be able to
display 4096 by 4096 pixels, and almost certainly will provide integrated video capability.
Some of these projections may actually be on the conservative side, given the current rate
of progress in the computer industry. Disk memory peripherals may not require IC's of
greater than 200,000 to 300,000 FETs, but the shorter development times possible with
more powerful VLSI tools and computers will be needed. Such workstations could easily
support the development of IC's of greater than 1,000,000 FETs.
The future use of networks could contribute as much to productivity as will the in-
creasing power of workstations. Network computing will require much more sophisticated
software to effectively utilize multi-CPU workstations in a LAN environment. The network
will likely contain several types of specialized processors that are very effective in processing
some of the VLSI design tasks, especially compute intensive tasks such as artwork design
rule verification. It will be difficult for software technology to keep up with hardware
technology. Software technology will probably set the pace of productivity improvements
in the next ten years.
The continual improvement of workstations will force most design groups to consider
new equipment every three or four years. The issues of obsolescence and return on invest-
ment will probably not diminish in the next ten years. Competitive pressures will continue
to shrink product lifetimes, making it mandatory to keep design productivity as high as
possible.
NASA SERC 1990 Symposium on VLSI Design	 13
4 Methodology.
Many large, complex designs have been entirely synchronous, since the cost and perfor-
mance has been acceptable. Synchronous designs generally involve considerably less risk
than asynchronous designs and are much easier to test. The methodology of synchronous
structured custom IC design involves defining a set of specifications for a cell library that
will allow the product performance and cost goals to be met. Then a cell library is con-
structed, so that each cell can be used in a design as a building block, with known limits
for it's use. System designers generally work at the cell level and above, leaving cell cir-
cuit design to one or two experts. Individual cells are designed to specification by a FET
circuit expert, using an analog simulation tool such as SPICE. Within a synchronous sys-
tem, the cell design is fairly straightforward, since a fixed period of time is available for
budgeting delays in the implementation of a cell. Cell loading is fairly predictable, since
each type of cell is used in a regular structure. The types 'of cells are divided into four
general groups: datapath (registers and arithmetic functions); programmable logic arrays
(state machines); input/output (I/O pads); and testability circuits. For synchronous de-
sign, this basic methodology is not anticipated to change very much for future designs.
Improvements are anticipated in the areas of test coverage, test time and overall cost per
function.
The need for lower cost and large product volumes are making higher levels of integra-
tion more attractive each year. The smaller physical size of each generation of disk drive
is requiring much smaller printed circuit assemblies. Where the 20 to 50 megabyte 14
inch disk drives of 1980 contained electronics on about 400 square inches on several circuit
boards, 1989's 380 to 760 megabyte 5.25 inch disk drives are limited to 44 square inches
on one board. The 3.5 inch and 2.5 inch drives of the future will require about the same
amount of electronic functionality on printed circuit boards of well under 10 square inches.
Rather than build disk drives with essentially perfect read/write heads and media, error
correcting circuits are making possible lower cost means of achieving demanding error rate
Emits for disk products.
Nearly all of the digital functions in disk drive electronics, other than memory devices
and microprocessors, have already been integrated in gate array, standard cell and struc-
tured custom circuits. Some performance and cost advantages can be obtained by merging
some of these circuits together, but many of the benefits of digital integration have already
been gained. The next step in the integration of disk electronics involves either more ana-
log integration, or the conversion of currently analog circuit techniques into an acceptable
digital form. Large complex analog circuits are still difficult and time consuming projects,
which are very hard to fit into the short development times permitted for new disk prod-
ucts. The number of expert analog IC developers is very small. It is likely that few analog
IC experts are also expert analog read/write or servo control designers. Servo control has
been largely moved into the digital domain, by using commercial digital signal processor
circuits. DSP architectures allow the flexibility needed to optimize servo performance, but
are still somewhat expensive. Analog servo control IC's are available, but are limited to
certain types of mechanisms and performance ranges.
14
In order to reduce the number of components on disk drive circuit boards, integrated
circuits involving synchronous and asynchronous timing with some analog functions on
chip, will be necessary in the next five years. Some examples of these complex designs
are beginning to appear, mostly from analog circuit vendors. In order to continue to
be successful in the development of complex mixed-methodology custom IC's, engineers
will have to identify a methodology that partitions asynchronous and analog design into
manageable sections. Managing the complexity of asynchronous designs will involve testing
the boundaries of asynchronous and synchronous circuits, along with understanding the
limits of the internal design of the asynchronous circuits. The main issues facing the mixing
of analog and digital design on the same chip involve testability of the analog functions
and overall yield or cost, due to the large number of processing steps for such IC's.
The implications for designers of mixed-methodology IC's include possibly larger de-
sign teams of people with a wider range of skills. The need for digital systems designers
will continue, since pure digital design will continue to have the lowest cost per function
and the widest range of application. Designers will have to know both synchronous and
asynchronous (or timing) simulators to adequately verify future designs. Analog designers
will be needed for the interfaces to mechanical devices that can't be handled digitally.
Each of the above trends will require more investment in cell libraries used for structured
design.
Analog effects will become more significant as overall system clock rates continue to
increase. Where the 3 micron circuits of 1980 could be designed to operate at 10 mega-
Hertz, today's 1.0 micron circuits are being designed for 30 to 40 megaHertz operation.
In 1990, most circuits could be designed in .8 micron processes that will support 60 to 70
megaHertz system clocks. The .5 micron circuits of the mid-1990's will probably support
100 megaHertz system clock speeds. In the area above 30 megaHertz, secondary analog
effects are already significant in most circuits. Inductance, parasitic capacitance, signal
cross-coupling and other effects can have serious influence on circuits. More complicated
analytical techniques, such as transmission line analysis, and more use of analog simulators
could become common. Digital simulators that are aware of secondary analog effects may
be needed for the higher speed designs.
The challenge for IC design teams in the next five to ten years will be to carefully
identify the kind of IC's that will be needed in a particular kind of product, and then
develop a strong methodology around the chosen technologies and design tools. It appears
that the analog content of future designs will increase, unless good digital algorithms can
be found to replace currently analog functions. Several vendors have tools that can provide
the proper design environments, but choosing a tool set is only the start of the process of
building an integrated circuit design methodology.
The future development of computers and peripherals is much easier to project than
the future development of software tools and environments. The effort put into software in
the next ten years will become an even larger portion of VLSI design system development
costs. But, just as the hardware developers have drifted toward somewhat standardized
architectures and interfaces, VLSI software tool developers are defining standards for in-
terfacing some of the tools used for VLSI design. Frameworks or platforms upon which
NASA SERC 1990 Symposium on VLSI .Design	 15
tools of various capabilities can be attached, with a reasonable investment, appears to be
the path to the future. As long as the interfaces between the platforms and the tools
are fairly simple, and the connections fairly loose, then considerable flexibility could be
gained. Some past attempts at large tightly coupled VLSI design tool sets either lost all
performance advantages or became too large to manage. If a design tool platform is to
be effective, most of the bandwidth of the workstation involved should be handed over to
the specific tool currently being used, keeping platform overhead processing low. It seems
inevitable that VLSI design software tools are going to grow in size as more services are
provided, and overall productivity should gain accordingly.
5 Verification.
Efficient verification of integrated circuit design is made possible by having many strengths
and few weaknesses in the design methodology used to implement an IC. The random
logic design of the early microprocessors have given way to the highly structured RISC
processors of today. The use of pin level high speed testers has been complemented with
scan path circuitry on chip. The synchronous unit delay digital simulators used for much
of the 1980's are gradually being replaced with event wheel timing simulators, using rise
and fall time delays, and several levels of drive strengths. Fet level only simulators are
being supplemented with simulators that can mix FET, gate, and functional or behavioral
models in one simulation.
How will design verification be improved in the 1990's? It is likely that one answer will
involve the efficient management of even more complex designs. The structure of future
integrated circuits will need to be partitioned to whatever level is needed to keep each
circuit blocks small enough for an engineer to work with efficiently. Too many blocks can
become a file system nightmare, and too few blocks can cause tools to be slow processing
blocks that are too large. It's a matter of knowing the design, hardware platform, and
software tool practical limitations. Hierarchical design is fairly well developed now, and
should improve with higher speed networks and file servers.
Just as the logical structure of an IC is partitioned into hierarchical blocks, the verifi-
cation effort can be similarly structured. Behavioral modeling can be used in the top down
analysis to help determine how the functions of an IC are to be implemented, before all
of the lower level circuit detail is invented. During the bottom-up implementation phase,
FET level modeling is generally used to make sure that the performance of the lowest level
blocks is exactly as desired. As the major blocks of a chip come together, the opportunity
exists for using a functional representation of a block rather than the FET representation,
to increase simulation speed. In order to insure that a functional block exactly matches
the operation of a FET level representation, it should be easy to switch representations.
If extensive testing is necessary to verify the design, more higher level functional modeling
may be feasible, depending on the project schedule, the simulation execution time, and
engineer workloads. Each level of logical representation requires additional verification, so
much in some cases, that functional modeling may not reduce overall design verification
16
time. High performance workstations on a network can greatly extend the utility of a
low level simulation environment, if the simulation task can be shared between machines.
Given personal preference, most design groups will find different paths to efficient design
verification.
One of the growing needs for verification is the modeling of functions external to an
IC. For example, if an IC has a Small Computer Systems Interface (SCSI), which has
several asynchronous control lines, how is verification accomplished? One of the present
techniques is to write vectors that cover all of the SCSI command and data operations.
Such an approach can be made to work, but is fairly inflexible, since only one sequence of
events is provided. Another approach is to interface the simulator to other workstations
via the network and sockets, so that vectors can be computed based on previous simulation
results. The use of sockets is flexible enough, but the performance can be slow. Another
option that holds promise is to write functional models within the simulation environment
that represent the SCSI operations. If a flexible language is used, such as "C", then almost
any operation can be synthesized. This approach appears to have promise for representing
the disk drive features necessary for the complete operation of an interface, formatting,
and error correcting integrated circuit.
Going one more level up from the integrated circuit verification effort, the process of
developing the drive electronics microprocessor firmware has become one of the more sched-
ule critical tasks on recent projects. Quite often, firmware development is dependent on
all of the drive electronics being functional. Firmware is usually developed using the real
disk electronics, including any custom IC's and a functioning disk mechanism. Building
breadboard prototypes is no longer practical, as breadboards have become too complex
to build quickly, and the function of breadboards rarely matched that of the IC's. But,
if the firmware development team could interact with a model that exactly represents the
function of a custom interface IC, after the IC functionality has been frozen, but before
IC silicon is available, firmware development may be able .to start several months earlier
than possible now. The IC model simulation performance has to be sufficient to allow
the firmware developers to make reasonable progress on a daily basis. As high a level
model as possible of a chip will be needed to have any possibility of being fast enough for
firmware development. High level simulations may be able to take advantage of multipro-
cessor workstation architectures, where independent IC functions can be implemented in
different processors, to emulate the parallel processing inherent in the IC itself. Such a ver-
ification environment would mean using much the same software techniques used now, but
partitioning that software between several processors. Interprocess communication would
have to be very efficient. Some VLSI tool environments that allow firmware developers to
access IC models are beginning to be marketed at this time. Tools that allow firmware
and IC design to interact will improve the ability of product development teams to reach
optimal software/hardware partitioning in system designs.
More detailed extraction of layout capacitance will be needed to use timing simulators
effectively, especially as system operating speeds increase. A delay calculator that considers
the fan-in and fan-out for all cells in asynchronous circuits should be based on detailed
layout and interconnect capacitances, and provide data that can be easily inserted into
NASA SERC 1990 Symposium on VLSI Design	 17
a simulation model of an IC. Such a tool will be needed for the accurate verification of
unclocked standard cell blocks, included in a largely synchronous chip, that provide fast
asynchronous control response. Pieces. of such delay calculation and simulation capability
exist now in some tool environments, but the process is not completely automated at this
time.
One of the major problem areas of current IC design involves still requiring the engineer
to invent a set of vectors that prove that a given circuit performs exactly as intended. As
circuits become more and more complex, the possibility of undiscovered functional flaws
increases. The need exists for tools that will help the engineer understand the bounds
of a design, by asking questions that the engineer may not have considered. The most
vulnerable areas of a design are on block or system boundaries, where specifications may
be imprecise or incomplete. Expert systems tied to automatic vector generators that can
be directed to part or all of a design may be an effective way of improving circuit quality.
Only in the last year or two has the time required to define and verify a simulation
model of an integrated circuit been shorter than the time to build and verify the circuit
layout. For a completely new IC design of the type used in disk drives, the definition phase
is about 8 months and the implementation phase about 5 months. It seems likely that in the
future, the implementation phase will become shorter, especially for largely synchronous
designs, as VLSI tools improve and workstations gain in performance. Very soon, the
length of the definition phase will need to be reduced. Some improvement can be made
through better use of existing tools, such as functional simulators. But the opportunity
for entirely new tools, using the computational power that will be available in the 1990's,
that manage most of the details of a model, so that the engineer can concentrate on the
function of a design, should be welcome. Considerable software tool help will be needed
to maintain short schedules if large amounts of asynchronous logic or analog circuits are
added to designs.
6 Testing.
From 1980 to about 1987, integrated circuit test cost was a small fraction of the total
manufactured cost of most custom IC's used'in disk drives. For most of the $100 NMOS
circuits, a test cost of $6 to $8 was not unusual. As the total IC cost was brought down,
the test cost remained somewhat constant, and now is about 30% of the price of some
custom IC's. The overhead expenses associated with IC testing are increasing gradually,
making it difficult to achieve further IC cost reductions. At the same time, the much larger
production volumes of recent years have added more emphasis on test coverage, tending to
increase tests costs further. The issues of having synchronous, asynchronous, and analog
circuits on the same chip could easily increase test costs even more.
Higher production volumes have focused attention on the rate of line scrap, which
involves parts that fail to work at the final stage of assembly of a printed circuit board.
Where a rate of 1% to 2% was once at least tolerable, when production volumes were
about 1000 units per month, goals now are set at a few hundred parts per million, about
18
.03%. Production volumes greater than 100,000 units per month will require line scrap
rates of less than .01%. The higher production volumes make it easier-to purchase faster
test equipment, but much more engineering effort is still required to improve test coverage.
One high production volume part has shown that just one test methodology may not be
enough to achieve 100 ppm line scrap rates. This standard cell part was designed with scan
circuitry access to almost all nodes, with automatic test vector generation used to achieve
greater than 99.7% node coverage. When run on production lots, the scan path vectors
produced about a 5% test escape rate. In a simultaneous effort, pin level parallel testing
was implemented on the same chip, with somewhat better than 70% node coverage. When
run at the normal operating speed of the chip, the parallel vectors also produced a test
escape rate of nearly 5%. The conclusions derived from these tests indicated that, since the
scan testing was based on stuck-at faults, and was not run at the full operating speed of the
chip, several high impedance shorts and some open circuits were probably missed. In the
parallel testing, incomplete node coverage allowed some faults to slip through. Running
both tests has, in fact, nearly reached the design goal of 100 ppm line scrap rate, with a
reasonable overall test cost. The overlap of the two test methodologies appears effective
in catching more of the failure modes of CMOS IC's. It also shows that going from 99.7%
node coverage to 99.999% node coverage for scan testing is probably not worth the effort.
Similarly, going from 70% to 90% or higher for parallel test node coverage may not reduce
the test escapes by very much.
Fault simulators are very effective in reducing line scrap rates of IC's but the use of only
the stuck-at models misses shorts and opens, which can be some of the more significant
fabrication problems in CMOS today. Fault simulators need to be provided, for example,
with pairs of signals that are close together in routing channels for a 100 microns or more,
so that tests can be generated for signals that may be shorted together by metal stringers or
other flaws. The problem is that signals that are in routing channels may come from distant
blocks that have no relation to each other, and have lengthy initialization sequences. Such
lengthy tests can raise test costs quickly. One alternative is to avoid the use of layouts that
require the use of long routing channels. Another choice may be to back further away from
mum design rules in routing channels, which could easily increase the silicon cost.
Board level scan testing, based on the IEEE 1149 standard, will no doubt see consid-
erable development in the 1990's. Board testing that involves complex custom IC's can
be difficult. If a test fixture cannot report which part has failed, the printed circuit man-
ufacturing line tends to remove the most complex IC on a board first, in the hope that
a test failure can be quickly fixed. The problem is that the removed part doesn't get a
second chance, and goes into the line scrap bin, accompanied with very little or no diag-
nostic data. The proposals for an industry standard printed circuit board boundary scan
methodology could help manufacturing lines, by providing more tools to identify when a
complex IC has failed. _With surface mount parts preventing as many board tester contact
points as were possible with through-hole parts, boundary scan may be a big help towards
improving printed circuit diagnostic capability. In theory, the boundary scan protocol of
a printed circuit board could permit scan access to some blocks within the complex IC's,
which could further improve the location of failure mechanisms. The use of built-in self
NASA SERC 1990 Symposium on VLSI Design	 19
testing and firmware that is downloaded into RAM may add more flexibility to printed
circuit board testing in the future.
Another diagnostic tool that could see considerable use in the future is the phase
contrast scanning electron microscope. This equipment is capable of displaying a video
interpretation of the surface of a portion of a chip on one screen, and a symbolic layout of
the same portion of the chip in a second screen. A virtual probe can be used on the SEM
display to see the voltage at any point on the surface of a passivated or non-passivated
chip. A high speed parallel tester drives the pins of the chip with a repeating pattern. Even
the voltage profile along interconnect can be observed. Delay times can be measured, so
that new circuits could be characterized in considerable detail. The symbolic layout is
used to locate noncontact probes on specific circuit nodes on the actual chip surface. A
particular net of the circuit can be selected and highlighted in the layout, and the SEM
processor will find and magnify the same area on the actual chip. Defects could be found
by comparing voltage levels on the test chip with reference levels. This kind of capability
suggests a common data base between the test and the design environments. Integrated
design, test and manufacturing support tools will be increasingly valuable in the future.
7 Summary.
The VLSI tools developed in the next decade could easily increase one engineer's produc-
tivity by two or three orders of magnitude. The measurement of that productivity increase
will probably not be solely in terms of FETs per day, but more likely in the ability to. set
and meet schedules, create highly manufacturable designs, and to produce manufacturable
parts on the first pass through the fab. The VLSI design engineer may well be part of a
small team, consisting of engineers that are specialists in using one or more VLSI software
tools. More analog design skills will probably be needed.
The path to the most cost effective designs will very likely continue to be full custom
or structured custom design, with the use of logic synthesis tools or function generators at
the block level. Mixed asynchronous, synchronous and analog circuits will appear on more
IC's. Higher speed designs will certainly require more attention to be placed on'secondary
analog effects in digital circuits.
In the next decade, the potential exists for single chip designs to replace almost all of
the electronic components of present disk memory devices, at all levels of performance.
The intensely competitive disk drive market suggests that shorter development schedules
are most important, with cost and manufacturability following closely. Technology choices
are going to be influenced heavily by potentially very large production volumes. These
issues will no doubt be true for much of the electronics industry.
N94- 71077
20
CCSDS Reed Solomon VLSI Chip Set
K. Cameron, S. Whitaker, N. Liu, K. Liu and J. Canaris
NASA Engineering Research Center
for VLSI System Design
University of Idaho
Moscow, Idaho 83843
Abstract - A highly efficient error correcting code has been selected by NASA
as a CCSDS standard: the 16 symbol error correcting Reed Solomon code.
A VLSI implementation of this decoder is described in this paper. A total
of 4 full custom VLSI chips are needed that correct data in real time at an
sustained rate up to 80 Mbits/second.
1 Introduction
A Reed Solomon (RS) code has been selected by both the European Space Agency (ESA)
and NASA [1] as the outer code in a concatenated coding scheme for CCSDS space com-
munications. This Reed Solomon code is a (n, n-32) block code of 8-bit symbols capable
of correcting up to 16 symbol errors; n assumes values less than or equal to 255. When
n < 255, a shortened code is generated which is desirable for certain applications.
Several VLSI implementations of the decoder have been presented in the literature.
The first by Liu [2] required 40 VLSI chips with 100 support chips and operated at a 2.5
Mbit/second rate. Another VLSI implementation was suggested by Shao et. al. [3] with
no performance data given. Both of these designs utilized systolic arrays. The design
presented in this paper does not utilize systolic arrays but rather is a set of custom VLSI
chips. Moreover, the architecture °is invariant for any RS code defined over GF(2s).
The VLSI architecture requires a small chip count and guarantees real time decoding
for data rates up to 80 Mbits/second. The design presented in this paper achieves the data
rate with 4 VLSI chips.
2 Code Specification
The RS code used can be described with the following parameters and notation:
NASA SERC 1990 Symposium on VLSI Design	 21
Symbol	 Definition
q	 the number of bits in each symbol
n < 24-1 the number of symbols per RS codeword
t	 the number of correctable symbol errors
2t	 the number of check symbols
k = n — 2t the number of information symbols
c(x)	 the code block represented as an order n —1 polynomial
M(Z)	 the k information symbols represented as an order k — 1 polynomial
g(x)	 the order n — k generator polynomial
For the code under consideration, q = 8 and t = 16.
2.1 Code Description
The RS code word is defined as:
c(x) = x2tm(x) + m(x) mod g(x).	 (1)
Simply stated, every valid code word is a multiple of the generator polynomial g(x). In its
simplest form, the generator polynomial is defined as
2t-i
	 st
g(x) _ II (x — a`) _ E gix'	 (2)
i=0	 j=0
where a is a primitive element of the field.
A more general form of the generator polynomial is defined as:
s+st-1 	st
g (x) _ H (x —fl' ) = E gix'	 (3)
i=&+1
	 j=0
where s is an offset and P is a primitive field element equal to ah . This form is the one
used by NASA and ESA, where Q = all and s = 112. Symmetrical coefficients of g(x)
result in an offset of 112 [1].
2.2 Decoding Algorithm
During transmission, errors can occur due to noise in the channel which is equivalent to
an error polynomial being added to the code polynomial c(x). Let the received polynomial
be:
R(x) = c(x) + E(x) = R._ix"-i + ... +. Rix + Ro	 (4)
where E(x) is the error polynomial, n < 255 and each Ri is a field element. Symbols
Ri , i < 32, are the check symbols. The first step in the decoding algorithm is to calculate
the syndromes. The syndrome polynomial is defined as:
S(x) = R(x) mod g(x)	 (5)
22
and contains the information needed to correct errors and%or detect the presence of an
uncorrectable error. Each byte Sk of the syndrome polynomial is defined as:
n-1
Sk =	 jkQ4tk+1)s 	 (6)
i=0
where 0 < k < 2t —1. The syndrome polynomial can be expressed as:
at-i
3(x) 
_ E Skxk .	 (7)
k=0
The next step is to obtain the error location A(x) and error magnitude 0(x) polyno-
mials. These polynomials have the following relationship with the syndrome polynomial:
S(x)a(x) = 0(x) mod X 2	 (8)
The error location and error magnitude polynomials can be obtained by using Euclid's
greatest common divisor algorithm [4], which is a recursive operation. The algorithm is
described later.
Once the two polynomials are known, the location and magnitude of a given error is
found as follows:
Let Qi
 be a zero of \(x) (i.e. \(,0i) = 0), then the error magnitude at location n — i —1
is:
where A'(x) is the first derivative of A(x) with respect to x.
For more details and examples, the reader is referred to Clark and Cain [4].
2.3 Mathematical Considerations
Each of the 255 8 bit symbols of the code polynomial are members of the finite Galois
Field GF(28). A Galois Field can be defined by an irreducible polynomial p(x) [4]. For
the field under consideration, p(x) = x8 + x7 + x 3 + x + 1. Addition of field elements is
accomplished by bit-wise modulo 2 addition (exclusive-or).
Multiplication of field elements is a bit more complicated. If each of the two field
elements is represented as a polynomial of order 7, then the product is accomplished by
multiplying each of the polynomials modulo p(x). The result is an order 7 polynomial,
which represents a field element.
Multiplication by a constant is a special case which is used frequently in the imple-
mentation of the encoder/decoder. Multiplication by a constant is a unary operator that
operates on a polynomial representation of the field element. The operator can be repre-
sented by an 8 by 8 matrix that maps the polynomial onto its final representation [5,6].
Moreover, it is possible to allow the code to be described in a dual basis [7]. A dual
basis is actually just another representation of the original field. If v is a q bit symbol in
NASA SERC 1990 Symposium on VLSI Design	 23
the original representation of the field, it can be represented by the vector V in the dual
basis. The relationship between v and V is
V1 = Tv and v = T-Y
where T is a linear operator in the field. Any operator L in the original representation
of the field can be used in the normal representation by transforming it as follows:
Ld.,a = TL,r g:,.jT-1•	 (10)
A single chip implementation of the encoder that produces RS block codes in the dual
basis has been implemented [5]. The decoder described here operates in either the dual
basis or regular representation.
3 Architecture
The architecture and cell design are crucial factors in efficient use of silicon area. Cell
interconnect is the most important issue in efficient chip design. Interconnect can consume
major portions of a chip and greatly limit the amount of circuitry that can be placed on
a chip. The objective in the design here was to minimize the amount of cell interconnect.
One of the major problems to overcome in using a Reed Solomon code is the large
number of operations that must be executed to perform error correction. The operations
that must be performed for each message are:
Syndrome evaluation of 32 equations of order 254
Euclid	 recursive evaluation between polynomials of degree 32 and 31
Polynomial 256 evaluations of polynomial of degree 16
256 evaluations of polynomial of degree 15
256 evaluations of polynomial of degree 15
Correct	 256 divisions and additions of field elements
The number of operations for each of the above modules is:






Table 1: Number of Operations per Module
The number of calculations per message in the CCSDS code is 38,772. Operating at 80
Mbits/second, the number of operations per second is 1.5 billion. Clearly, this operation
rate cannot be realized with a stored program computer.
24











Table 3: Custom Transistor Count
VLSI is one approach to implementing high performance Reed Solomon decoders.
There are three technologies for realizing VLSI: Gate arrays, standard cells and full cus-
tom. The first two approaches are relatively easy to implement but are limited in both
performance and complexity. The CCSDS decoder would require approximately 131,000
gate equivalents, not counting necessary ROM and RAM. Clearly, it is impossible to re-
alize the entire decoder on a single chip using standard cells or gate arrays. The next
step would be to try to partition the decoder into separate modules. Shown below are the
gate equivalents and number of transistors using standard cell logic needed to realize the
CCSDS decoder:
Full custom VLSI yields higher performance and greater density. Shown below are the
number of transistors required to realized the CCSDS decoder with full custom VLSI.
Notice that the full custom approach requires only 138,800 transistors compared to
526,146 using standard cells.
The above system resides on 4 full custom CMOS VLSI circuits. The critical element
in achieving high level integration is to implement a custom architecture that produces
highly dense circuits. Approaches that are effective using discrete MSI or SSI logic do
not result in similar saving in full custom VLSI. One such example involves selecting
the generator polynomial. In discrete logic, selecting a symmetric generator polynomial
results in major savings [7]. However, in VLSI, this savings does not materialize [5,6,8].
Reducing interconnect is a major concern and therefore it is often more efficient to replicate
a functional unit like a multiplier than it is to attempt to share it. Sharing a multiplier will
greatly increase interconnect which consumes more area and also increases the capacitance
values thereby reducing speed.
The VLSI cells used throughout the decoder consists of the following Galois Field






FIFO I	 _ I Correct
Corrected
Message
Figure 1: Reed Solomon VLSI System
processing elements: adder, constant multiplier, general multipler, and field inverse. The
constant multipler performs the operation c x, where c is a constant and x is a variable;
the general multiplier performs the operation xixz on variables xl and x2.
4 System Architecture
The decoder consists of 4 VLSI chips as depicted in Figure 1. The system is configured to
perform in a pipelined manner where several messages are being processed simultaneously
as depicted next:
Message is Syndrome Generator
Message i-1: Euclid Multiply and Euclid Divide
Message i-2: Polynomial Solver and Correction
Message i-3:
	 Data Output
Therefore, the latency of this system is 4.
The general operation can described as follows: A serial data stream is input into the
serial-to-parallel converter from which the received message polynomial R(x) is generated.
R(x) is stored in a buffer RAM for temporary storage. The syndrome generator produces
the 32 symbol syndrome polynomial that is received by Euclid. The Euclid chip perform
the division and multiply portions of Euclid's algorithm. A ROM is attached to Euclid
26
to calculate the inverse of a given field element. The Euclid produces the error location
polynomial and the error magnitude polynomial. Polynomial Solver receives these polyno-
mials from Euclid and performs the following simultaneous operations. The error location
polynomial is evaluated for each element in the field generated by the primitive element /3.
If Q' is a root of A(x), then a signal Zero-Found is passed to the Error Correction Module.
Both A'(x) and A(x) are evaluated for x =,Q' and these results are also presented to Error
Correction. Error Correction determines the error magnitudes; if Zero Found is true for
X = P', then the magnitude for location n — i — 1 is given by Equation 9; otherwise the
magnitude for location n — i is 0 (no error). Since the Polynomial Solver calculates both
11(x) and A'(x), Error Correction only has to divide these two values. Finally, the error
magnitudes are exclusive-ored with the original information.
Real time decoding is achieved. The system clock being the symbol clock is a very
important feature. Therefore, this decoder can decode symbols at the same rate message
symbols are presented. Decoders that cannot use the symbol clock as the system clock
must utilize a more complex clock system where the decoder operates at a higher clock
rate than the symbol clock. Therefore, for a given technology, this decoder can operate
faster than other designs which require a system clock that operates at a higher rate than
the symbol clock. Moreover, operating at the symbol clock rate reduces the amount of
message buffering.
4.1 Syndrome Generator
The calculation of the syndromes is given in Equation 6. The calculation R;p'(k+') for
syndrome byte is evaluated for all R; and each k in 10,1,2,...,2t-1} and i in 10,1,2,...,n-
1} (the number of input symbols in the message). A well known logic circuit for calculating
syndrome Si
 is shown in Fig. 2 [4]. The multiplier is a constant multiplier with the constant
Qi+j. A CMOS version of this circuit is implemented here with a constant multiplier. With
n input R; symbols, a total of n clock pulses are needed to calculate a syndrome. All 32
syndromes are calculated simultaneously with 32 circuits operating in parallel.
Since one of the design constraints placed on the syndrome generator is that the sys-
tem clock be equal to the symbol clock, it is necessary to calculate 32 syndromes in n
clock pulses. A common means to configure 32 circuits depicted in Figure 2 is to first
calculate 32 syndromes and then reconfigure the registers into a shift register and shift the
syndromes out. However, this would require n + 32 clock pulses to calculate and shift out
the syndromes, which is unacceptable.
Let the registers depicted in Figure 2 be called Syndrome registers. Let another register
be defined as part of the register stack be called Shift and serve as a shift register. With
the system clock being the symbol clock, if the contents of Syndrome are transferred to
Shift after n clock pulses (n input symbols), the contents of Shift can be shifted out while
the next set of syndromes are being calculated.
The NASA specification requires that decoder be capable of decoding dual basis RS
code words. It is necessary to transform the dual basis code words into regular field
code words; this is accomplished by operating on each received word by 7" as defined
Inp
Me
NASA SERC 1990 Symposium on VLSI Design	 27
A'
Figure 2: Syndrome generator
above. Operating on T' is equivalent to multiplying by a constant and therefore can be
implemented in a similar manner as a constant multiplier. An extra feature is added to
the syndrome generator to operate in either the regular field or the dual basis. An input
signal DUAL is provided such that if DUAL is 1, then each input symbol is multiplied by
T' (translation into regular field); if DUAL = 0, then the input symbols are not affected.
The Syndrome engine is implemented on a single, 3 micron CMOS chip 4800 x 5140
microns. There are approximately 26,000 transistors with only 5% of the area devoted to
interconnect. With 32 additions and 32 multiplications occurring every 100 nano seconds,
the equivalent instruction rate is 640 MOPS for a classical processor with a Galois Field
ALU.
4.2 Euclid Divide and Multiply
The syndrome polynomial is shifted serially into the Euclid chip from the syndrome chip.
The Euclid multiply and divide circuits recursively apply Euclid's Algorithm to find the
error location and magnitude polynomials. The Euclid module uses the following algorithm
to recursively obtain A(x) and 0(x).
fZi(x) = fli
-2(Z ) mod fZi_i (x)	 (11)
Ai(x) _ — gi(x)Ai-1(x) + Ai-2(Z )	 (12)
where qi(x) are the non-negative powers of the division of fti-2(Z) and SZi_1(x). The initial
conditions are:
1.S2i_1 (x) = xst 2. A-1 = 0 3. f2o(x) _, S(x) 4. Ao(x) = 1
The algorithm continues until the order of flj(x) is less than t.
The organization of Euclid minimizes interconnect, and when implemented with the
general multiplier, can calculate the error magnitude polynomial very rapidly. The version
28
implemented for NASA finds the location and magnitude polynomials in less than 237
clock cycles.
The Euclid chip is implemented on a single, 3 micron CMOS chip. Even though there
are approximately 61,900 transistors in a 7600 x 6800 area. The extraordinary density is
achievable because : 1) The general multiplier can be drawn exceedingly dense, and 2)
The given architecture is highly regular and requires virtually no interconnect. These two
characteristics make it ideal for VLSI implementation.
4.3 Polynomial Solver
Polynomial Solver evaluates three polynomials simultaneously: the error location polyno-
mial A(x), error magnitude polynomial fl(x), and the first derivative of the error location
polynomial A'(x). These polynomials are evaluated in three register stacks. One stack, the
A(x) stack, searches for the zeros of the error location polynomial. An adjacent register
stack evaluates the derivative of the error location polynomial. The A'(x) register stack
shares the same input bus as error location, but only loads the odd coefficients of the lo-
cation polynomial. The third register stack receives the error magnitude polynomial from
the Euclid module. The fl(x) register stack has a data path totally separate from the other
two register stacks.
With n symbols in the received polynomial, there are n possible symbol errors. The
zero's of the error location polynomial A(x) specify the location of the symbol errors as
defined in Equation 9 and restated here: If Q` is a zero to = 0), then the
location of the error is in location n — i —1, i = 0, 1, • • • , n —1. Finding the zero's of A(x)
involves a search of the elements in the field. If the number of zero's of A(x) is equal to
the degree of A(x), then the message is said to be correctable, otherwise an uncorrectable
error condition exists [4].
For a full code length where n = 254, all 255 field elements must be searched. If A(x)
can be evaluated for each field element in one clock pulse, then a total of 255 clock pulses
are required to search through the elements of the field. Moreover since the complete set
of field elements'are being examined, it does not matter which order the field elements are
searched relative to execution speed.
For shortened codes and with the constraint that the system clock is equal to the symbol
clock, searching through 255 elements cannot be permitted. However, for n < 254, the
possible error message locations are n —1, n — 2, • • • 1, 0. To determine if an error occurred
in one of the locations n —1, n — 2, • • • 1, 0, field elements ,i°, al , • • • , Q„_i respectively must
be evaluated in A(x). Any zero of A(x) for x = 6j, j > n — 1, would correspond to an
nonexistent message symbol. To evaluate A(x) for only n field elements and hence require
only n clock pulses, it is necessary to search the field in the order defined above, which is
done in the Polynomial Solver module.
Since only n clock pulses are allowed to determine the zeros of A(x), there must be
a separate control section to input the A(x) coefficients from the Euclid module. As in
the case of the Syndrome Generator, the Polynomial Solver has a serial shift register that
accepts A(x) asynchronously. In this mode, it is possible to receive the coef ficients of A(x)
NASA SERC 1990 Symposium on VLSI Design	 29
for one message while at the same time searching for "the zero's of the previous message.
When the field elements of one error location polynomial have been completely searched,
the coefficients from the A(x) previously loaded into the shift register can begin immediate
evaluation and hence completing the required search in n clock pulses.
The error magnitude and first derivative of the error location polynomials are evaluated
at the same time and for the same field elements as the error location polynomial. The
correction module is interested in the evaluation of SI(x) and X(x) only for those field
elements where A(x) = 0. Even though it is not necessary to evaluate O(x) and A'(x)
at every field element, it does no harm to perform these calculations. However, from a
speed of operation point of view, there is a great advantage in parallel evaluation of all
polynomials for each field element and to calculate all polynomials in synchronism. When
A(x) = 0, the value of fl(x) and V(x) at the field element which forced A(x) to 0 has
already been calculated. The error correction module simply divides these two values as
defined in Equation 9.
Since the evaluation of A(x) and X(x) must operate in synchronism with the evaluation
of A(x), it is necessary to have the same shift register storage system that A(x) has to accept
the data from the Euclid modules.
4.4 Error Correction
The inputs from Polynomial Solver are signals:
Zero-found	 1 bit
Error magnitude evaluation 8 bits
Derivative of error location 8 bits
The essential calculation of this module is given in Equation 9. Since all the data values to
make this calculation are input from the polynomial solver, determining the error magni-
tude is straight forward. A ROM is inserted in the data path between the chips to provide
the inverse of the derivative of the error location polynomial. The division specified by
Equation 9 becomes a multiplication and a general multiplier can by utilized to determine
the error magnitude.
When Zero-found is true, the error magnitude is stored in a RAM; when Zero-found is
false, a zero error magnitude is stored in the RAM. The number of errors is counted and
uncorrectable error condition is noted. In outputting corrected data, the error magnitudes
are fetched from the RAM and added to the input message symbols to present them to
the output system.
5 Summary
A decoder has been presented that corrects up to 16 symbol errors for a Reed Solomon code
at a 80 Mbit/second data rate. The output consists of the corrected information symbols
and a status word. The status word, which is inserted in symbol location 31 (location of the
30
first check symbol), contains the number of errors found and an uncorrectable error flag.
If the message is uncorrectable, the information symbols are unchanged. The equivalent
instruction rate of the decoder chip set as a whole is 1.5 BOPS. This figure ignores all
loading of registers, reading ROMs and writing RAMs, and interchip communication.
The size and transistor count for each chip is summarized next:
Module Number of Transistors Chip size
Syndrome 26,100 4800 x 5140
Euclid 61,900 7600 x 6800
Polynomial 27,600 5540 x 5750
Correct 23,200 17600 x 6800
The final chip set was fabricated at Hewlett Packard in a 3.0 micron CMOS process.
The VLSI chips with support FIFO and ROM chips have been incorporated onto a single
board and delivered to GSFC NASA in September 1989.
Acknowledgement The authors wish to acknowledge the efforts and support from Warner
Miller and Jim Morakis of Goddard Space Flight Center in guiding this project.. Jack Ven-
brux, T. J. Berge, Jay McDougal and Carrie Claflin, former graduate students, are recog-
nized for their efforts in designing chips that now comprise this system. Patrick Owlsey
was a member of the original team that designed this chip set; he is now with Advanced
Hardware Architectures which has commercialized this chip set under the product name
AHA 4600.
References
[1] H. F. Reefs and A. R. Best, "Concatenated Coding on a Spacecraft-to-ground Teleme-
try Channel Performance," Proc. ICC-81, 1981
[2] K. Y. Liu, "Architecture for VLSI Design of Reed-Solomon Decoders," _IEEETC vol
C-33, pp. 178-189, Feb 1984
[3] H. M. Shao et. al. "A VLSI Design of a Pipelined Reed-Solomon Decoder," IEEETC
vol C-34, pp. 393-403, May 1985
[4] G. C. Clark and J. B. Cain Error Correcting Coding For Digital Communications,
New York NY, Plenum Press, 1981
[5] G. Maki, P. Owsley, K. Cameron, and J. Shovic, "A VLSI Reed Solomon Encoder:
An Engineering Approach," IEEE Custom Integrated Circuit Conference, pp. 177-181,
May 1986
NASA SERC 1990 Symposium on 'VLSI Design	 31
[6] G. Maki, P. Owsley, K. Cameron and J. Venbrux, "VLSI Reed Solomon Decoder
Design", IEEE Military Communications Conference, pp. 46.5.1 - 46.5.6, Oct 1986
[7] M. Perlman and J. Lee, ."Reed-Solomon Encoders - Conventional vs Berlekamp's
Architecture," Jet Propulsion Laboratory, 82-71, Dec 1982
[8] G. Maki, and P. Owsley, "Parallel Berlekamp vs Conventional VLSI Architectures",
Government Microcircuit Applications Conference, pp 5-9, November 1986
N94- 71078
32
Reed Solomon Error Correction
for the Space Telescope
S. Whitaker, K. Cameron, J. Canaris, P. Vincent, N. Liu and P. Owsley a
NASA Engineering Research Center
for VLSI System Design
University of Idaho
Moscow, Idaho 83843
Abstract - This paper reports a single 8.2mm by 8.4mm, 200,000 transistor
CMOS chip implementation of the Reed Solomon code required by the Space
Telescope. The chip features a 10 MHz sustained byte rate independent of
error pattern. The 1.61Lm CMOS integrated circuit has complete decoder and
encoder functions and uses a single data/system clock. Block lengths up to
255 bytes as well as shortened codes are supported with no external buffer-
ing. Erasure corrections as well as random error corrections are supported
with programmable correction of up to 10 symbol errors. Correction time is
independent of error pattern and the number of errors.
1 Introduction
Reed Solomon (RS) codes are highly-efficient and powerful error correcting codes used by
NASA for space communication. The efficiency and power has led to the selection of the
(255, 239) RS code for the Space Telescope (ST). One of the major problems to overcome
in using a Reed Solomon code is the large number of operations that must be executed
to perform error correction. For example, the_ number of calculations per message in the
NASA CCSDS (255,223) code is 38,772 [1,2]. Operating at 80 Mbits/second, the number
of operations per second is 1.5 billion. Clearly, this operation rate cannot be realized with
a stored program computer.
A VLSI RS coder chip that supports real time decoding for the ST code has the
following features:
• Functions either as an encoder or decoder
s Programmed error correction capability up to 10 symbol errors
• 10 Mbytes/sec sustained data rate
e User selectable symbol clock rate, block length, number of check symbols and result-
ing error correction capability
• Erasure capability
1 P. Owsley is now with Advanced Hardware Architectures.
NASA SERC 1990 Symposium on VLSI Design	 33
• Shorten block length capability
9 Single VLSI chip that contains the RS coder and all ROM and FIFO circuitry
2 Reed Solomon Codes
The RS code used can be described with the following parameters and notation:
Symbol Definition
q	 the number of bits in each symbol
N _< 29-1 the number of symbols per RS codeword
t	 the number of correctable symbol errors
2t	 the number of check symbols
k = N — 2t the number of information symbols
C(Z)	 the code block represented as an order n —1 polynomial
M(x)	 the k information symbols represented as an order k —1 polynomial
g(x)	 the order N-k generator polynomial
For the Space Telescope code, q = 8 and t = 8.
2.1 Code Description
The RS code word is defined as:
c(x) = x2tm(x) + m(x) mod g(x).	 (1)
Simply stated, every valid code word is a multiple of the generator polynomial g(x). In its
simplest form, the generator polynomial is defined as:
st-1	 st
9(x) _ Il (x — a') _ E gix'	 (2)
i=o	 j=o
where a is a primitive element of the field.
A more general form of the generator polynomial is defined as
.+2t-1	 st
9(x) _ H (x - Qi)  E Six'	 (3)
i=i+1	 j=0
where s is an offset and Q is a primitive field element equal to ah. The Space Telescope




_ H (x - ai)
	 (4)
i=140
where t = 8.
34
2.2 Decoding Algorithm
During transmission, errors can occur due to noise in the channel which is equivalent to
an error polynomial being added to the code polynomial c(x). Let the received polynomial
be
R(x) = c(x) + E(x) = Rn_1x"-1 + ... + R1x + Ro	 (5)
where E(x) is the error polynomial, N < 255 and each Ri- is a field element. Symbols
R;, i < 2t, are considered to be the check symbols. The first step in the decoding algorithm
is to calculate the syndrome polynomial S(x) which contains necessary information to
correct correctable errors or detect uncorrectable errors. Each byte S; of the syndrome
polynomial is defined as:
n-1
S; _	 R;ai(a+l)^	 (6)
f=0
where 0 < j < 2t — 1. The syndrome polynomial can be expressed as:
2t-1
S( x) 	 Sixi .	 (7)
j=o
The next step is to obtain the error location A(x) and error magnitude S2(x) polyno-
mials. These polynomials have the following relationship with the syndrome polynomial:
S(x)A(x) = O(x) mod x 2	 (8)
The error location and error magnitude polynomials can be obtained by using Euclid's
greatest common divisor algorithm [5], which is a recursive operation.
Once the two polynomials are known, the location and magnitude of a given error is
found as follows: Let a' be a zero of A(x) (i.e: A(a') = 0), then the error_ _magnitude at
location n — i — 1 is
^(a') a120	 (9)
where V(x) is the first derivative of A(x) with respect to x. For more details and examples,
the reader is referred to Clark and Cain [5].
3 Chip Overview
The circuit implements both the encoder and the decoder functions for a set of RS codes.
The code is defined over the finite field GF(28 ) specified by the primitive polynomial is
p(x) = x8 + x7 + x2 +_x1 +.T° and the generator polynomial, dependent on the variable t,
is given by:
119+at
9(x) = II (x — a') 	 (10)
4=150
NASA SERC 1990 Symposium on VLSI Design	 35
Number of
Actual Errors Correct Remarks
E/2 + e < P/2 True Attempt Correction
P/2 < E/2 + e < t False Attempt Correction
t < E/2 + a False No Correction
2t < P False No Correction
2t < E False Ignore Erasures
Attempt Correction
Table 1: Error correction parameters.
where t E 11,1.5,2,2.5,..., 10}. The coder circuit has data in and out ports. Data is
input at a constant rate, and output with a fixed latency. The coder operates in either an
encoder or a decoder mode. All buffering is internal to the chip.
The block length of the code is variable, as large as 255 bytes and as small as 23 + 10t
bytes. The code block consists of the message and 2t parity bytes, where 2t ranges from 2
to 20.
The correction/detection ability of the code is quite flexible, with the limits given by
t =.E/2 + e + d/2
	
(11)
where 2t is the number of parity symbols, E is the number of erasures, a is the number
of random errors, a + E/2 is the correction ability of the code and d is the additional
detection ability of the code. Also let P be the number of parity symbols that will be used
for correction.
An erasure is any symbol that is identified to be in error prior to the actual decoding
process. Any byte flagged as an erasure will count against the correcting ability of the
code whether that byte is in error or not. The detection ability of the code is the ability
to detect errors beyond the correction ability of the code.
The parameters P and 2t are fixed. The. first is read during reset on the POA inputs
and the second is the number of parity input with the first code block. The relationship
between the number of errors and erasures and the fixed parameters P and t are given in
the following table.
Full correction ability of the coder is achieved when P = 2t. Making P smaller does
not change how the data path will perform correction, but it does change how the coder
reports the integrity of the output data. When P < 2t, three regions of error magnitude are
determined. In the first, where E/2 + a _< P/2, the error pattern is guaranteed correctable.
In the second region, where P/2 < E/2 + e < 2t — P/2, the error pattern is guaranteed
detectable. The coder will make a best guess to the right code word. If P < E/2 + e < t,
the coder will perform correction, but report that the block was uncorrectable. In the







Figure 1: Reset Timing.
4 Operation
4.1 Initialization
The initialization sequence consists of the reset timing and the first code block through
the coder. The reset timing is shown in Figure 1. Reset must be low for 4 clock cycles
and high for 2 clock cycles. The number indicated on POA sets P. The value of StatEn is
latched at this time. DataEn must be low during reset to ensure that no spurious messages
are processed.
If StatEn is low during reset, then the chip will output the corrected parity symbols on
Dout0:7 after the message symbols of the code block. If StatEn is high, then , status will
be output instead of the parity. The first status byte will indicate the number of erasures
and the second byte will indicate the total number of errors processed. The first status.
byte will include a flag that indicates whether correction was attempted. If correction
is attempted then the most significant bit, Dout7, will be low, and it will be high if no
correction was attempted.
The coder will not function properly if P is set to an invalid value during initialization.
Invalid values are 0 and 21 through 31. If P > 2t, but still a valid value, then the circuit
will work as if P = 2t.
During the fast code block, the number of actual parity bytes, 2t, will be set. In
subsequent blocks, if DataEn is low for more than 2t clocks, then data input on Din0:7
will be passed through the decoder with no correction applied to those bytes, i.e. they will
be treated as data inserted between code blocks and not as an element of a code block.
4.2 Normal Operation
After initialization, the coder receives data as indicated in Figure 2. DataEn is the timing
signal. It must be high when the message symbols are input and it must be low when
the parity symbols are input. The coder takes in code blocks consecutively, performs the
appropriate coding operation and outputs the data with a fixed latency of 2N + 10t + 34
clock pulses. The 2t parity symbols are indicated by the number of clock pulses that
DataEn is low during the first block.
The coder will allow two different block lengths to be intermixed. In that case, the
NASA SERC 1990 Symposium on VLSI Design	 37
Din0:7
DataEU--J^^— j













Figure 3: Decoder output timing.
latency is a function of the longest block length, N1. If the shorter block of length N, is
processed first, the latency will be 2N, + 10t + 34 until the first large block is ready to be
output, at which time the latency will become 2Nl + 10t + 34.
EnabIn must be high for each block of data. If EnabIn is low for a block of data, that
block will not be corrected as it passes through the decoder. However, if EnabIn is low for
the entire block, the error statistics for that block will be output when the block is output.
This allows a block of data to be passed through the coder without correction, but error
statistics can also be presented if that option is selected.
Data can also be passed through the decoder transparently between code blocks. As
indicated in the previous section, if the DataEn is low for more than 2t clock pulses, the
data that is on the bus at that time will pass through the decoder without being operated
upon. It should be noted that this is not possible during the first code block after a reset.
The coder outputs data as shown in Figure 3. The timing signal for the output is
DataR.dy. It is high while the message bytes are output and it is low for the parity
symbols. If the Correct line is high, then the message was correctable, if it was low, then
the code was determined to be uncorrectable. If StatEn is low during initialization, then
Dout will have the corrected message and parity during the times shown. If StatEn is high
during initialization, then Dout will have status bytes for the first two clock pulses during
the time labeled parity. The status words will report the number of erasures, the total
number of errors, and whether or not a correction was attempted.
4.3 Encoder Operation
The encoder function is a special case of the decoder function. A code block is input to









Table 2: Standard cell hardware requirements
After the message is input, both Erase and DataEn are brought low for the 2t clock pulses
which correspond to the parity symbols. Erase low during the parity indicates that these
locations are in error. The coder will correct these locations to the proper parity. Of
course, the StatEn line must be low during initialization to allow the parity to be output
correctly.
5 VLSI Implementation
^. VLSI is one approach to implementing high performance Reed Solomon decoders. There
are three VLSI technologies that could be used: Gate arrays, standard cells and full cus-
tom. The first two approaches are relatively easy to implement but are limited in both
performance and density. Using standard cells, a 10 symbol error correcting decoder would
require approximately 82,200 gate equivalents, not counting necessary ROM and RAM.
Shown in Table 2 are the number of gate equivalents and the associated number of transis-
tors that would be required to realize a standard cell design for each module that comprises
an RS decoder, except ROM and RAM. The total number of transistors needed for a stan-
dard cell design is 328,465. The full custom chip presented here requires only 200,000
transistors which includes the RAM and the ROM.
Full custom VLSI was used to achieve both circuit density and speed: Full custom
allows control on the amount of interconnect. Speed with is a function of capacitance
which is a function of interconnect is an important parameter in high performance VLSI.
Interconnect was minimized in this design. The VLSI architectures implemented here are
similar to previous full custom designs presented in the literature [2,3,4].
The functional modules within the coder are identified by their function. The syndrome
module produces the syndrome values according to Equation 6. Circuitry exists to calculate
2t syndromes in parallel; since t„, ax
 = 10, there are 20 parallel syndrome generator circuits.
For the ST code, 18 syndrome values are determined in parallel. The recursive Euclid
module implements Equation 8 and determines the error magnitude 11(x) and error location
A(x) polynomials. The Euclid module uses an internal ROM to calculate the field inverse.
The Polynomial Solver module evaluates polynomials A(x), fl(x) and A'(x) in parallel. This
evaluation identifies the location of an error and produces the values to calculate the error
magnitude according to Equation 9. The Correction module performs the field division and
multiplication as specified by Equation 9 to determine the error magnitude and corrects
the raw data which has been stored in the FIFO.
NASA SERC 1990 Symposium on VLSI Design	 39
Input Message
Message data enters the chip and is stored in the FIFO and input to the Syndrome
module. Messages are processed in a pipeline fashion through each of the modules. The
architecture of the chip is depicted in Figure 4. Each module was configured to mini-
mize interconnect. This was accomplished through careful data path placement such that
functional modules were adjacent.
6 Summary
A VLSI coder was presented that can function either as an encoder or decoder for Reed
Solomon codes. The error correction/detection capability of the coder can be programmed
by the user. The maximum error correction is 10 symbol errors and the maximum data
rate is 10 Mbtyes/second. The correction time is independent of the number of errors in
the incoming message.
The chip was designed in a 1.6 micron CMOS process and fabricated at Hewlett
Packard. This chip was delivered to Goddard Space Flight Center in April, 1989, and
has been installed in the ground communication link for service in the Space Telescope
40
system.
Acknowledgement The authors wish to acknowledge the support from Warner Miller
and Jim Moralds at Goddard Space Flight Center in guiding this project. Support is also
appreciated from Dr. Paul Smith of NASA Headquarters and the NASA Space Engineering
Research Center program. This chip is commercially available from Advanced Hardware
Architectures.
References
[1] K. Cameron, et. al. "CCSDS Reed Solomon VLSI Chip Set," NASA Symposium for
VLSI Design, January 1990
[2] G. Maki, P. Owsley, K. Cameron and J. Venbrux, "VLSI Reed Solomon Decoder
Design", IEEE Military Communications Conference, pp. 46.5.1 - 46.5.6, Oct 1986
[3] G. Maki, and P. Owsley, "Parallel Berlekamp vs Conventional VLSI Architectures",
Government Microcircuit Applications Conference, pp 5-9, Nov 1986
[4] G. Maki, P. Owsley, K. Cameron, and J. Shovic, "A VLSI Reed Solomon Encoder:
An Engineering Approach," IEEE Custom Integrated Circuit Conference, pp. 177-181,
May 1986
[5] G. C. Clark and J. B. Cain Frror Correcting Coding For Digital Communication,
New York NY, Plenum Press, 1981
N94- 71079
NASA SERC 1990 Symposium on VLSI Design	 41
VLSI Chip-set for Data Compression
Using the Rice Algorithm
J. Venbrux
N. Liu
NASA Engineering Research Center
for VLSI System Design
University of Idaho
Moscow, Idaho 83843
Abstract - A full custom VLSI implementation of a data compression encoder
and decoder which implements the lossless Rice data compression algorithm
is discussed in this paper. The encoder and decoder reside on single chips.
The data rates are projected to be 5 and 10 Mega-samples-per-second for the
decoder and encoder respectively.
1 Introduction
An encoder/decoder VLSI chip-set is designed for lossless image compression applicable
to NASA space requirements. With the ever increasing precision of flight instruments
that produce greater amounts of data comes a need for data compression. This data-
compression chipset uses the Rice Algorithm [1,2], and is able to perform lossless compres-
sion of pixel data at 5 Mega-samples-per-second. The pixel quantization levels may range
from 4 bits through 14 bits. Designed using full custom VLSI for a 1.6 micron CMOS
technology, the chips are low power and not larger than 8mm on a side. Rather than
design the chip-set for a single project or single application, flexibility_ was designed into
the chip-set to accommodate future imaging needs.
2 Algorithm Overview
This presentation assumes that the reader is familiar with the Rice algorithm [1,2]. The
Rice Algorithm codes differences between the present pixel value and a predictor value.
As a default, the previous pixel is used to predict the value of the present pixel. An
externally supplied predictor may also be supplied by the user to change prediction from
the X direction to either Y or Z directions.
Taking the difference between two N bit pixels results in a difference that is N+1
bits in precision. A mapper adjusts this difference back to N bits without any loss in
information. To simplify the VLSI implementation, the mapper function used by the
encoder and decoder, is slightly different than the function specified in the literature. The
mapped difference is called a "sigma" value. Because the compression method operates on
differences, an original reference pixel must precede the coded data. Without the original
42
reference, the decoder is not able to reconstruct the original pixels from all the coded
differences.
A block of sigma values are encoded by multiple parallel encoders. The winning coder
is the coder that achieves the highest compression ratio. Before the compressed data
is transmitted, an ID is sent that specifies the winning coder for that particular block.
Compressed data follows the ID bits.
The next section describes some of the theory behind the Rice Algorithm. For a more
detailed discussion of the Rice Algorithm, the reader is asked to refer to Rice's work [1,2].
3 Algorithm Theory
The Rice Algorithm uses two techniques that allow it to efficiently compress image data
that varies over a wide range of entropy conditions. The first technique, that of using
multiple coders, was briefly mentioned in the Algorithm Overview. The second technique,
is to use small block sizes. Both are discussed in more detail in the following paragraphs.
3.1 Multiple Coders
Many compression schemes adapt to varying entropy conditions by some form of estima-
tion. Based on past history, a certain codebook is either generated or chosen to compress
the present data. The Rice Algorithm uses a brute force approach to coding by performing
the equivalent of using multiple coders, each targeted for a different entropy level, and then
choosing the winner. Instead of estimating the winner this implementation of the Rice Al-
gorithm chooses the actual winner. Even if the Entropy radically changes within an image,
the method proposed by Rice will track the changes and result in efficient compression.
The encoder uses 8 different coders, each targeted for a particular entropy range. The
8 coders are formed from 2 coding techniques: default coding and fundamental sequence
coding. The default coding. option is selected when all the other options fail. A default
block of data includes a 3 bit ID followed by sigma data. Because sigma words are the
same size as input pixel data, the default blocksize is equal to the number of input bits
plus 3 ID bits. Instead of expanding during high entropy conditions, as happens in most
compression schemes, the default condition limits expansion for any block to just a few ID
bits.
The next option type is the Fundamental Sequence (FS) which is Huffman code with
a few special properties. The length of every codeword is equal to the magnitude of the
number to encode plus one. The unique prefix property of Huffman codes is attained in
the FS by having all zeros precede a single one in a given codeword. The decoder simply
counts the number of zeros until it finds a one. The decoded value (before any un-mapping)
is equal to the number of zeros. If the decoded value has no zeros preceding a one, then
the magnitude is equal to zero.
The remaining options are variations of the Fundamental Sequence. Before the code-
word is encoded with the Fundamental Sequence, the least significant bit is removed ("pre-
split") from the word to encode. Removing the least significant bit is equivalent to consid-
NASA SERC 1990 Symposium on VLSI Design	 43
Bring that bit as being random, and hence, not able to be compressed. The remaining bits
are then coded using the FS. The stripped o$ least significant bits are sent to the decoder
before the coded FS bits.
The encoder option set consists of the default condition, FS with no splitting, and six
pre-split bit options. The number of split bits ranges from 1 to 5, and then jumps to 7.
The jump was added to extend the efficient coding range up to approximately 10.5 bits
while maintaining an 8 option code set. An 8 option code set only requires a 3 bit ID,
larger than 8 options requires using a 4 bit ID.
The decoder is capable of decoding an encoded data stream that may have 12 different
options. It will handle both the 8 option set, with a 3 bit ID, and a 12 option set with a
4 bit ID.
3.2 Small Block Size
All the encoding and decoding is done in blocks of pixel data. Small block sizes have
the advantage of allowing the encoding to quickly adapt to changes within an image.
Encoding small blocks of data also reduces the storage requirements within the encoder
and decoder. The disadvantage of encoding a small block of pixels at a time is that there
is a slight increase in overhead due to the ID bits that must precede each block.
The present encoder and decoder require that the block size be 16 pixels. If the scanline
does not end on a multiple of 16, the encoder will fill in the missing data with the last
valid pixel.
4 Encoder/Decoder Chip Set
4.1 System Overview
Figure 1 shows the system diagram for the encoder/decoder chipset. - In addition to the
encoder and decoder, additional system blocks must include: a packetizer, error correction,
circuitry to unpack the data, and an input FIFO for the decoder.
The packetizer is needed to concatenate the variable length data blocks that have been
encoded. In the concatenating process, the "fill bits" that were added to make each block
end on a word boundary, must be stripped off. After the packet is constructed it should
be protected using error correction. As with most data compression schemes, an error in
the compressed data will generally result in incorrectly decoded packet data. However, the
decoder is designed to not allow errors to propagate between packets.
The decoder is able to handle packets of various types and sizes. If a packet is fixed
in length, the two typical scenarios for decoding are processing a truncated packet or
processing a packet that has fill bits at the end. A truncated packet occurs when the
expected compression ratio was not achieved and the packetizer had to truncate the encoder
output data. Most compressed image data, however, will fit within the packet size and
the packetizer will add fill bits to make the packet the required bit length. The decoder is
44










Figure 1: -System Diagram
NASA SERC 1990 Symposium on VLSI Design	 45
even able to handle compressed data that is not in any packet, but is a continuous stream
of data.
At the receiver, error correction is followed by an operation which removes packet
header bits from the packet. The decoder is expecting all packets to begin with an ID
followed by a reference pixel. Packet data must be continuous with no fill bits between
blocks. An external input FIFO stores data while the decoder de-compresses blocks of
data into pixels.
4.2 Chip Set Features
1. Variable Quantization Levels
• The chipset will encode and decode N bit wide pixel data with N in the range:
4<=N<=14.
2. External Prediction
• Nearest neighbor prediction is the default condition, where the previous pixel
acts as the predictor for the present pixel. The chip-set supports an externally
supplied predictor. An external memory chip could be used to store scanlines
or frames to allow for prediction in the Y and Z dimensions.
3. Inserting References
• By setting a few control lines that specify the number of blocks per-reference, the
encoder will automatically insert a reference pixel. References may be inserted
every block or as infrequent as once every 128 blocks (2K pixels). The decoder
will correctly interpret reference data by setting the control lines to the same
number as was used on the encoder.
4. Entropy Range
• As discussed in the algorithm overview, the chip-set can be used to efficiently
encode and decode a wide range of entropy levels. The encoder's 8 option
coder set will efficiently code from 2 bits-per-pixel of entropy to approximately
10.5 bits-per-pixel of entropy. Conditions higher than 10.5 -bits, will require
the default condition. The decoder will decode up to 12 options and efficiently
decode compressed data from 2 bits of entropy up to 14 bits-per-pixel.
The option set used by the encoder and decoder is not designed to handle
entropy conditions less than 2 bits-per-pixel very efficiently. The reason for this
is that the lowest entropy option is simply, a Huffman code, which requires a
minimum of one bit-per-symbol even if the entropy is close to zero.
5. Performance
46
The encoder will encode a maximum of 10 Mega-samples-per-second. The de-
coder, which requires more complex state machines, will decode at a maximum
of 5 Mega-samples-per-second. By using one encoder with two decoders and
some external logic, the chipset could operate at 10 Mega-samples-per-second.
Using two decoders with external logic, the maximum encoding and decoding
rate would be 140 Mega-bits-per-second for 14 bit pixels.
The architectures for both the encoder and decoder involve distributed processes
that operate on one block of data at a time. The processes, described in more
detail in encoder and decoder sections of this paper, are pipelined.
6. Full Custom VLSI CMOS
e The full custom VLSI chipset is designed in a 1.6uM CMOS process. Full custom
design, as opposed to standard cell design, allows greater flexibility in algorithm
implementation, the potential of decreased die size and increased speed due to
reduced interconnect. By using a custom RAM design in the encoder, the logic
to generate the fundamental sequence was simplified. By designing a custom
shifter for the decoder, words were able to be pulled from the bit stream in one
clock pulse. Without the flexibility of custom design, complexity would have
increased and performance would have been impacted.
The encoder, with 66 pins on the IC, will have a die size of approximately 7mm
on a side. The decoder, with 74 pins, is pad limited and will require a die size
of approximately 8 mm on a side.
5 Enco der
The encoder is implemented on a single VLSI chip approximately 7mm on a side. It uses
a single clock which may run up 1OMHz. It will process up to 14 bit pixel data at a
continuous rate up to 10 Mega-samples-per-second. The encoder logic assumes that all
data is continuous within blocks, but the chip may be placed in a wait state when there
are no blocks of data. The output is 16 bits parallel clocked on the 10MHz clock. The
compressed output data must be sent to a packetizer to concatenate the variable length
data blocks.



















	 Header WC	 Evaluate Sigma
s	 Out utp
s









Figure 2: Block Diagram for Rice Algorithm Encoder
Control is distributed around the chip to minim; e interconnect between control centers
and the data paths or memories. Most of the state machines are being implemented using
a modified ring counter. There are five main control sections that control the input section,
the data output, and the three generate blocks.
The Mapper block takes the difference between the pixel value X, and the predictor
Xp. The difference creates N+1 bits, so the mapper adjusts the value back into the range
covered by the N bits, creating a "sigma" value. It is the sigma value that is encoded. The
sigma value is stored in a 32 word X 14 bit FIFO to be used when the block is encoded in
the generate sections.
48
The Count section calculates an exact count of the number of bits that will be required
to encode a block for each of the eight options. The evaluate section does a compare
between the eight counts and chooses the winning option. It is much more area efficient to
first find the winning option and then encode the option then it is to generate all 8 options
and then choose the winner. The generate sections are larger and more complicated than
the count or evaluate sections.
The FS block, generates a fundamental sequence from a block of sigma values. The
coded sigma values are variable in length, with the codeword length in bits equal to the
magnitude of the sigma value plus one. For example, if the sigma value had a magnitude
of 12, the coded word would have a length of 13 bits. The codeword has 12 zeros followed
by a single one, 0000000000001. Because the FS codewords can radically vary in size,
from a single bit to over two-hundred bits in length, a unique RAM was designed that
allows single bit writes to the RAM with full 16 bit reads. The RAM avoided the need to
generate serial FS codewords. Generating the codewords in a serial manner would require
more control or memory than generating a complete word every clock pulse. Two such
RAMS are ping-ponged, one is being written to while the other is being read from. Each
RAM is cleared in single clock cycle just before a write operation.
The k generate data-path and control sections split off the k least-significant bits from
each sigma value and pack them into words that are stored in a FIFO. While the k generate
section is operating, the FS section is encoding the Fundamental Sequence on the remaining
N-k bits. After the block is encoded, the k bits are read from the FIFO, followed by the
FS bits that are read from one of the FS RAMS.
The default generate section packs sigma values into 16 bit words.
The output FIFO and control follows the generate sections. Sixteen bit words are read
from the generate sections' FIFOs or RAMS and are stored in the Output FIFO. When an
entire block is present in the output FIFO, the data words are sent from the chip, with no
handshaking.
A block of output data consists of a header word that contains the number of bits in
the block. With variable length data it is unlikely that the coded blocks will end on a
word boundary. The packetizer can read the count contained in the header, and strip off
any fill bits that trail actual data.
Although the chip could be designed to have an on board packetizer, it would add more
complexity and area to the design. Since packets can vary widely in size and structure
from one mission to the next, requiring the use of a separate packetizer results in a clean
encoder design with maximum flexibility as to packet type.
6 Decoder
The decoder is on a single VLSI chip with approximately 14,000 gates. The chip has
one clock which runs at 10MHz. The effective decoding rate is slightly greater than 5
Mega-samples-per-second.
The decoder consists of the following sections, as shown in Figure 3











Global control is provided by the Time-Frame controller. Because the data blocks are
variable in length, the control would be complex if the sequencing depended on the widely
varying block size. To reduce state machine logic and complexity, each of the decoding
tasks was assigned a fixed number of clock cycles, independent of block length.
Local control sections control major data paths and output logic within the chip. The
control sections include state machines that were designed using a special binary tree
structure (BTS) developed by Whitaker and Maki [3,4]. The BTS structure has many
advantages over random logic designs. First, each bit of the state machine is identical
in structure, greatly simplifying layout tasks. The individual bits of the state machine
are programmed with supply connections, allowing design changes to be implemented
with minimal effort. Secondly, with the identical cell structure, only one cell must be
characterized to determine performance characterists. Finally, the BTS network, being a
very structured form of pass logic design, can be laid out in a very compact manner with
high speed performance.
The databus interface acquires 8 bits of compressed data from an external FIFO using
handshaking, and concatenates two bytes to form a 16 bit word. The input section of the
decoder operates on 16 bit words instead of 8 bit bytes. Even though data is input a byte
at a time, the decoder must parse and decode the data stream in a serial manner. The
beginning of data words may occur anywhere within the 16 bit input word. Rather than
using a serial shift register with some control logic, which would slow down decoding, a
special moving-window-shifter was designed. It performs a special purpose serial to parallel
conversion. Given the start bit position of a new word, the next 16 bits of the word are
provided within one clock cycle. The moving-window-shifter was implemented as a special
full custom VLSI module.
A data word from the moving-window-shifter, may be decoded as default data, FS data,







Controller	 Data Path Decode Controller
50
Packet Status	 Input Data


















Figure 3: Data Compression Decoder Architecture
NASA SERC 1990 Symposium on VLSI Design	 51
FIFO with no decoding. If the data contains split bits, the decode data-path simply packs
them into the k split FIFO. If the data is an FS sequence, it is first decoded, then stored
in the FS FIFO.
The unsplitter section takes data from say, the FS FIFO and concatenates the corre-
sponding k split bits to form a sigma word. The k split bits are packed into 16 bit words
with k ranging from 0 to 10 bits. To pull out the desired k split bits from a word, something
very similar to the moving-window-shifter was used.
The unmapper converts sigma data into a pixel value. The unmapper performs the re-
verse of the encoder's mapper function. Using 2's complement arithmetic and performing
some algebraic manipulations produced a form for the unmapper that simplified imple-
mentation. The output of the unmapper is pixel data. Since the algorithm is lossless, the
decoded pixels are identical to the pixel values encoded at the source.
7 Summary
An data compression encoder/decoder full custom VLSI chip-set that implements the RICE
algorithm has been designed. Image data is compressed at a rate of 10 Mega-samples-per-
second. The decoder is designed to operate at 5 Mega-samples-per-second. By paralleling
two decoders and adding extra logic, one encoder could compress data for two decoders for
a maximum rate of 10 Mega-samples-per-second. Pixel size can range from 4 bits through
14 bits and external prediction is supported. Both designs employ pipelined architectures,
have a single clock, and take advantage of full custom VLSI to reduce die size and maximize
processing speed. The chips are in the layout phase of the design cycle, with parts expected
in the Fall of 1990. Using a 1.6uM CMOS process, the die size for the encoder should be
approximately 7mm on a side, and the decoder die should be approximately 8mm on a
side.
References
[1] R. F. Rice, "Some Practical Universal Noiseless Coding Techniques", JPL Publication
79-22, March, 1979.
[2] R. F. Rice and Jun-Ji Lee, "Some Practical Universal Noiseless Coding Techniques,
Part II", JPL Publication 83-17, March 1983.
[3] S. Whitaker and G. Maki "A Programmable Architecture for CMOS Sequential Cir-
cuits", NASA Symposium on VLSI Design, January 1990.
[4] G. Peterson .and G. Maki, "Binary Tree Structured Logic Circuits: Design and Fault
Detection," Proceedings of IEEE International Conference on Computer Design: VLSI




Of A Stirling Cycle Cooler
J. Feeley, P. Feeley and G. Langford




This short paper describes work in progress on the conceptual design of a control sys-
tem for a cryogenic cooler intended for use aboard spacecraft. The cooler will produce 5
watts of cooling at 65°K and will be used to support experiments associated with earth
observation, atmospheric measurements, infrared, x-ray, and gamma-ray astronomy, and
magnetic field characterization. The cooler has been designed and constructed for the
National Aeronautics and Space Agency (NASA) Goddard Space Flight Center by Philips
Laboratories and is described in detail in Reference 1. The cooler has a number of unique
design features intended to enhance long life and maintenance free operation in space,
including use of the high efficiency Stirling thermodynamic refrigeration cycle, linear mag-
netic motors, clearance-seals, and magnetic bearings. The proposed control system design
is based on optimal control theory and is targeted for custom integrated circuit implemen-
tation. The resulting control system will meet mission requirements of efficiency, reliability,
optimal thermodynamic, electrical, and mechanical performance, freedom from operator
intervention, light weight, and small size.
2 System Description
The Philips cryogenic refrigerator consists of three sections: the expander, the compressor,
and the counter-balance. The moving part in the expander section is the displaces. It is
supported by magnetic bearings at each of its ends and is driven by a linear motor to
produce axial rectilinear motion with little, or no, rotation. The moving part in the
compressor section, the piston, is also supported by magnetic bearings at its ends, and
is similarly driven by its own linear motor. The moving part in the counter balance
section is the counter balance. Like the displacer and the piston it is also supported by
magnetic bearings and driven by a linear motor. The axial positions of the displacer, piston,
and counter balance are measured by linear variable differential transformers (LVDT's).
The radial positions of each moving part are measured by optical sensors located in each
magnetic bearing. In addition, the pressures in the compression volume and the buffer
volume behind the piston are also measured, as is the temperature at the "cold finger" end
of the expansion volume. The acceleration of the refrigerator case is also measured to aid
in controlling the motion of the counter balance. Cooling is produced in the cold finger end
NASA SERC 1990 Symposium on VLSI Design	 53
of the expansion section of the refrigerator by carefully'controlled motion of the displaces
and the piston. Motion of the counter balance is controlled to produce a force equal in
magnitude and opposite in direction to the combined force produced by the motion of the
displaces and the piston so that there is no net motion of the refrigerator frame.
The linear electromagnetic motors that drive the displaces, the piston, and the counter
balance are similar in design. Permanent magnets are mounted on the moving parts to
create a unidirectional constant amplitude magnetic field. A coil wound circumferentially
on the fixed part is energized by as alternating voltage source that produces a bidirectional
magnetic field of variable amplitude. The interaction of the two magnetic fields produces
the force used to accelerate the moving past. The motors are carefully designed so that the
force exerted on the moving part is almost entirely axial and is very nearly proportional
to the current in the coil.
3 Mathematical Models
A mathematical model of the linear motor subsystems of the refrigerator is developed in
this section. Interaction between the linear motors and the magnetic bearings is negligible,
therefore magnetic bearing modeling and control will be considered separately later. The
motor model will be used for two purposes. First, a simplified and linearized version
of the model will be used in a state variable control system design procedure to design
the controller. Second, the complete nonlinear model equations will be used to develop
a computer simulation of the refrigerator. ' This computer simulation will then be used
to assess the integrated performance of the refrigerator and its controller under various
transient and steady-state operating conditions. The intended uses of the model help
determine the level of detail it should contain.
A useful mathematical model of a linear motor can be obtained by applying Kirchoff's
voltage law to a series circuit containing the controlling voltage source, the winding resis-
tance, the winding inductance, and the motor back emf generator. The electromagnetic
force exerted on the moving part is assumed proportional to the motor current. Because of
the similarity of the displacer, piston, and counter balance motors the same mathematical
model is used in each case. Appropriate parameter values for each motor are available
from design calculations and test data.
A useful set of equations describing the motion of the displacer, piston, and counter
balance may be obtained by applying Newton's second law to each moving mass. The forces
acting on the displacer and piston include the applied electromagnetic forces imparted
by the linear motors, the forces due to the differential pressures between the expansion,
compression, and buffer volumes, drag forces exerted by gas flow, and friction forces due
to displacer and piston motion. Expressions for the pressures in each volume are obtained
by applying the principle of conservation of mass and the ideal gas law. Counter balance
forces are similar but do not involve differential pressures or gas flow effects.
54
4 Control System Design Approach
A multi-input multi-output digital control system is being designed using optimal linear-
quadratic-gaussian theory. The mathematical model described above consists of fourteen
nonlinear differential equations and involves fourteen state variables, three control vari-
ables, and seven output variables. Two alternative control system designs are currently
being considered, based onminimizing two different performance indices. The first is a
tracking controller where the position error between actual and desired displaces and piston
positions is minimized. The second controller will maximize overall efficiency by maximiz-
ing thermodynamic output power and mm* =* iz;ng control input power. A Kalman filter
will also be designed to estimate unmeasured states both for use in the controller and for
performance monitoring. Controller and estimator designs are being carried out using the
PC-MATLAB computer aided design package.
5 Computer Simulation
The refrigerator and control system models are being simulated using the Advanced Con-
tinuous Simulation Language (ACSL). ACSL permits simulation of the complete system
consisting of the discrete time digital controller interfaced with the continuous time non-
linear refrigerator model. The simulation will facilitate controller comparisons, sampling
rate and word length investigations, and transient and steady state performance studies.
6 Results to Date
Efforts to date have focused on understanding refrigerator operation and developing a dy-
namic mathematical model appropriate for control system design. The model described
above is new in that it introduces pressures as state variables through conservation of
mass and ideal gas law considerations. This clarifies the relationship between the thermo-
dynamic and dynamic aspects of refrigerator operation and should lead to an improved.
control system design. Efforts are underway to validate the model using test data given
in Reference 1. Efforts have also been initiated on the optimal multi-variable control and
estimation system design. This integrated design approach is expected to lead to a control
system that is superior in performance and simpler in implementation than the control sys-
tems described in Reference 1 or a more conventional microprocessor based digital control
system.
References
[1] F. Stolfi, et. al., "Design and Fabrication of a Long-Life Stirling Cycle Cooler for
Space Application," Philips Laboratories, March, 1983.





Gould Inc. Semiconductor Division
2300 Buckskin Rd., Pocatello, ID 83201
Abstract - A software system is described which reduces the time required to
design monolithic switched capacitor filters. The system combines several soft-
ware tools into an integrated flow. Switched capacitor technology and alter-
native technologies are discussed. Design time using the software system is
compared to typical design time without the system.
1 Introduction
CMOS switched capacitor filters are a wise choice for precisely filtering electronic signals
while keeping power consumption and board space to a minimum. Monolithic switched
capacitor filters with custom transfer functions are especially attractive because they can
be integrated with other analog and digital functions on the same chip. Switched capacitor
filters have historically been full custom designs, and took a long time to design, and often
did not work properly on first silicon.
A software package called Filgen has been designed, which shortens the time to mar-
ket, and increases the chances of first time functionality of CMOS chips with switched
capacitor filters. Filgen achieves shorter design spans, and higher first time functionality
by integrating filter design software and utilizing analog circuit synthesis software. Each
software tool in the filter design flow creates a file which is the input for the next tool in
the flow. This approach to design automation speeds the design process for most switched
capacitor filters, while still allowing each tool to be used individually for special cells which
do not fit the standard filter design flow.
2 Monolithic Filter Technology
Building an electronic filter on a silicon die presents some unique challenges. First, passive
RLC filters are not practical, because high quality inductors can not be built on silicon.
Active RC filters can be built, but tolerances on resistors and capacitors cause time con-
stants to vary over a five to one range when temperature and process variations are taken
into account. Loop tuning schemes can be used to control the resistance or the transcon-
ductance of active devices to maintain well controlled time constants in active RC filters.














Figure 1: Switched Capacitor Equivalent Resistor
These filters hold much promise for high frequency applications, and do not require an-
tialiasing, and smoothing filters. Digital filters [6] can also be realized on a silicon die, and
have the advantage of realizing several transfer functions without changing the layout. As
with any digital circuit, digital filters have excellent rejection of process, temperature, and
power supply affects. Digital filters do however require A to D, and D to A converters,
and analog antialiasing and smoothing filters. In most cases digital filters are larger than
their analog counterparts when realized in silicon.
Figure 2: Single Pole Active RC Filter
Switched capacitor filters replace the resistors in an active RC filter with switches and
capacitors. An approximation of a resistor can , be built with four switches and a capacitor
as shown in Figure 1. The switches are run from non-overlapping clocks which do not
allow phase 1 switches to be closed when phase 2 switches are closed. When the clock is
running, a packet of charge proportional to the voltage across the capacitor moves through
NASA SERC 1990 Symposium on VLSI Design
	 57
the capacitor every clock cycle. These packets of charge approximate a current through
a resistor when the clock rate is constant. Such capacitors and switches can replace all






Pi j	 I Pi	 +	 Vow
Figure 3: Single Pole Switched Capacitor Filter
operation reveals that moving charges every clock cycle is a sampled data system, which
can be modeled by a z transfer function [1,6]. Carefully laid out silicon capacitors, can
be matched to each other to about 0.1%. Since the equivalent resistance, is determined
by capacitance, very accurate time constants are formed, from which very accurate pole
and zero frequencies and Q's can be realized. Switched capacitor filters do not require
tuning schemes to maintain accurate transfer functions, but do require antialiasing and
smoothing filters. For a more complete tutorial on the operation of switched capacitor
filters see [1]. The frequency of operation of switched capacitor filters is limited by opamp
gain bandwidth product, and slew rate. Some papers [2] have reported switched capacitor
filters operating well into the megahertz. The techniques used by Filgen limit the filters to
signals of a few hundred kilohertz. Dynamic range in switched capacitor filters is limited
by the signal swing of the opamps, opamp noise, and switch noise.
3 Filter Design System
The design flow (Figure 4) starts with Scholar which is a filter synthesis program. Scholar
calculates capacitor sizes, opamp specifications, and switch sizes from filter specifications
entered by the user. Scholar writes two files when all the information has been entered. The

















(Swltch Cap I	 I (Automatic
Simulation Aesufts	 SC Filter Layout
Figure 4: Filgen Design Flow
NASA SERC 1990 Symposium on VLSI Design	 59
the second file defines the opamp specification for Optimist. Optimist is an opamp synthesis
program which calculates transistor sizes from specifications using an optimization routine.
The output of Optimist is a Mentor command file which modifies the transistor sizes on the
schematic created by Scholar. Netlists can be created from the schematic to run simulations
using Switcap, Swap, or Scar. A Swap noise simulation is required by Scholar to size the
unit capacitor. A netlist is also created for Score, our procedural layout program. Once
the layout is completed, a layout verses schematic program is run.
4 Scholar (Filter Synthesis)
Scholar first prompts the user for filter specifications such as pass band and stop band
edges, pass band ripple, stop band attenuation, and clock frequency. Scholar then creates
a low pass s domain transfer function using standard filter approximations (Butterworth,
Bessel, Chebyshev, inverse Chebyshev, and elliptic) which meets the band edge specifica-
tions. Scholar then transforms the transfer function to a frequency scaled low pass, high
pass, band pass, or band reject filter. Once the transfer function has been. determined, fre-
quencies are pre-warped to accommodate the bilinear z transform [6], and capacitor ratios
are calculated for a cascaded biquad filter. The low Q poles are placed first in the cascade
with the high Q's last. High Q poles are paired with the closest zeros as shown in Figure 5.
These methods tend to maximize dynamic range. A more thorough approach is discussed
in [7]. Scholar uses six circuit types which may be cascaded to realize any IIR, transfer
function. The circuits include a single pole single zero stage, a high Q biquad stage, and
low Q biquad stage. Three similar stages are used to realize IIR transfer functions with
zeros in the right half s plane to make all pass group delay equalizers. The topologies
used are discussed in [8], and are shown to have low sensitivities. The second order biquad
sections have eight capacitors, but only five coefficients in the transfer function. Two of
the extra degrees of freedom are used to scale the gain to the outputs of the opamps to
maximize dynamic range. The third degree of freedom is chosen to keep capacitor ratios
as small as possible.
Swap is used to simulate the noise of the switched capacitor filter with a nominal unit
capacitor size. Scholar assumes that folded cascode opamps will be used, and therefore
most of the noise comes from the switches [9]. Scholar resizes the unit capacitor based
on the simulation results. Once the. unit capacitor size is determined, coupler sizes are
calculated to insure adequate settling time, without excessive charge injection. Opamp
gain bandwidth [10], slew rate, and capacitive load are calculated, and passed to Optimist,
the opamp synthesizer.
All capacitor ratios, unit capacitor size, coupler sizes, and biquad types are written
to a file which creates a schematic on a Mentor Graphics workstation. The schematic is
used to create netlists for additional simulations, and is also used to create a command
file for the automatic layout. The schematic can be modified if the user does not like the
assumptions made by Scholar.
6J^
so
Figure 5: Pole Zero Parting, and Biquad Order
NASA SERC 1990 Symposium on VLSI Design 	 61




Figure 6: Biquad Layout with Different Aspec Ratios
5 Optimist (Opamp Synthesis)
Optimist is an optimization program with a special shell around it for opamp design. Opti-
mist contains Gould AMI's proprietary Spice transistor model which it uses to find the DC
operating point of the opamp. Once the DC operating point is found, a small signal model
is used to calculate AC performance. Optimist can be used in several modes of operation.
First, Optimist can search for a set of transistor and capacitor sizes (parameters) which
meet all specifications (constraints) at one set of process temperature, load, and power
supply conditions. Once device sizes are chosen, Optimist can show what the performance
is at four sets of process temperature, load, and power supply conditions. When perfor-
mance is not sufficient at all operating conditions, devices sizes may be changed manually,
or Optimist can be asked to search for a new set of device sizes given a new set of operating
conditions, and/or specifications. Finally, Optimist can choose device sizes to maidmize
desireable parameters such as gain bandwidth product, and common mode rejection ratio
while minimizing objectionable parameters such as silicon area, and power supply current.
Optimist writes a file which modifies the opamp transistor sizes in the schematic created
by Scholar.
6 Score (Procedural Layout)
When the engineer is satisfied with the filter simulations, device sizes are passed from
the schematic to Score, our automatic layout tool. The filter is laid out procedurally one
biquad at a time to control parasitic capacitors. Each biquad is partitioned into switches,
capacitors, and opamps. The three sections are assembled along a common centerline
as shown in Figure 6. In some cases, the opamp will be wider than the switches, and
capacitors, while in other cases the capacitors will be wider. Since the layout is automatic,
the dividing line between capacitors, and opamps and the total cell height can be tweaked
until sometimes the opamps are wider, and sometimes the capacitors are wider. The
biquads are then placed in the order which minimizes total width by allowing cells to abut
on a zig zagged line as shown in Figure 7. The interconnect maintains the correct electrical
order of the biquads. Using this type of layout scheme, the capacitors are never very far
Biquad 2
	 Biquad 1














Design Cap Ratios 1 day 1 day
Create Netlists 2.5 hours 1 week
Noise Analysis 12 hours ECO
Opamp Design 8 weeks 20 weeks
Layout 2 hours 2 weeks
Layout Checking 2 . days 5 days
TOTAL 9 weeks 24 weeks
Table 1: Estimated Time Savings ( 6 biquads)
from their switches and opamps, which keeps parasitic capacitors small.
T Conclusion and Results
Typical runtimes for filter designs are reduced as shown in Table 1 This system is functional,
and significantly reduces design time, while increasing the probability of first time success
for switched capacitor filters.
NASA SERC 1990 Symposium on VLSI Design	 63
References
[1] R. Gregorian, K. W. Martin, G. C. Temes, "Switched-Capacitor Circuit Design,"
Proc. IEEE Vol. 71, no. 8, pp. 941-964?. Aug. 1983
[2] Bang-Sup Song, "A 10.7-MHz Switched-Capacitor Bandpass Filter," Proc. CICC
1988, pp. 12.3.1-12.3.4
[3] Y. Tsividis, M. Banu, and J. Khoury, "Continuous-time MOSFET-C filters in VLSI,"
IEEE J. Solid-State Circuits, vol. 21, pp. 15-30, Feb. 1986.
[4] F. Krummenacher, N. Joehl, "A 4-MHz CMOS Continuous-Time Filter with On-Chip
Automatic Tuning," IEEE J. Solid-State Circuits, vol. 23, no. 3, pp. 750-758, June
1988.
[5] C. S. Park, R. Schaumann, "Design of a 4-MHz Analog Integrated CMOS Transcon-
ductance C Bandpass Filter," IEEE J. Solid-State Circuits, vol. 23, no. 4, pp. 987-996,
Aug. 1988.
[6] A. V. Oppenheim, and R. W. Schafer, Digital Signal Processing. Englewood Cliffs,
New Jersey: Prentice-Hall, 1975.
[7] A. S. Sedra, and P. O. Brackett, Filter Theory and Design: Active and Passive.
Champaign, Illinois: Matrix Publishers, 1978.
[8] K. R. Laker, A. Ganesan, and P. E. Fleischer, "Design and Implementation of Cas-
caded Switched-Capacitor Delay Equalizers," IEEE Trans. Circuits and Systems, vol.
32, no. 7, pp. 700-711, July 1985.
[9] R. Castello, and P. R. Grey, "Performance Limitations in Switched-Capacitor Filters,"
IEEE Trans. Circuits and Systems, vol. 32, no. 9, pp. 865-876, Sept. 1985.
[10] K. Martin, and A. S. Sedra, "Effects of the Op Amp Finite Gain and Bandwidth on
the Performance of Switched-Capacitor Filters," IEEE Trans. Circuits and Systems,
vol. 28, no. 8, pp. 822-829, Aug. 1981.
64
	 N94- 71082
Integrated CMOS RF Amplifier
C. Charity
HP Disc Mechanism Division
P.O. Box 39
Boise, Idaho 83707
S. Whitaker, J. Purviance and M. Canaris
NASA Engineering. Research Center
for VLSI System Design
University of Idaho
Moscow, Idaho 83843
Abstract - This paper reports an integrated 2.0µm CMOS RF amplifier de-
signed for amplification in the 420-450 MHz frequency band. Design techniques
are shown for the test amplifier configuration. Problems of decreased ampli-
fier bandwidth, gain element instability and low Q values for the inductors
were encountered. Techniques used to overcome these problems are discussed.
Layouts of the various elements are described and a summary of the simu-
lation results are included. Test circuits have been submitted to MOSIS for
fabrication.
1 Introduction
Signals carrying information are attenuated as they propagate through communication
channels. Amplification is needed to restore the attenuated signals. Radio Frequency
(RF) amplifiers are usually manufactured with discrete components. Integration of the
amplifier can improve manufacturability [1]. Many amplifier designs are now implemented
in integrated circuits (ICs) [2,3,4].
Bipolar has been the dominant technology for integrating amplifier circuits. In the last
decade, great progress has been made in MOS fabrication techniques. As a result, MOS
circuits have expanded from just memory and digital logic applications to include many
analog circuits. New developments in communication technology require functional blocks
to consist of both analog and digital sections. Since many digital circuits are integrated in
MOS, there is a strong motivation to develop analog circuits in MOS [5].
The viability of developing an integrated CMOS RF amplifier is explored with this
test chip project. A single gate MOSFET RF amplifier was chosen as the test circuit
[6]. Section 2 discusses and analyzes the circuit. Design techniques are shown for this
particular amplifier configuration. In the process of the design, a few problems were
encountered. Techniques were modified to overcome these problems. Section 3 presents
the layout strategies for the components. Section 4 gives a summary of the simulation
results.
—	
— — —"Black Box" 
it
NASA .SERC 1990 Symposium on VLSI Design	 65
Figure 1: Y-equivalent circuit with source and load.
2 Circuit Design
CMOS RF amplifiers can be designed by using a four-terminal network model for the
amplifier. Short circuit admittance parameters are used to describe the network. An
admittance parameter is a complex number with the form y = g + jb where g is the real
(conductive) part and b is the imaginary (susceptive) part. The y equivalent circuit is
shown in Fig. 1 and is described by the following equations.
it = yllvl + ylsys = —v1Ys	 (1)
is = ysivi +yysys = —vsYL
	(2)
The short circuit admittance parameters are yll (input admittance), ysl (forward transfer
admittance), Y12 (reverse transfer admittance), and Y22 (output admittance). YS and YL
are the source and load admittances. From Eqs 1 and 2 with the output shorted (vz = 0),
Y11 = it /v1 and Y21 = iz/vl . When the input is shorted (vl = 0), y12 = it /v2 and
Y22 = i2 /V2-
The admittance of a circuit varies with frequency. In tuned circuits, the imaginary
component disappears at the resonance frequency [7]. The width of the resonant peak
where the magnitude differs from the resonant magnitude by less than 3 dB is called the
bandwidth. A quantity Q (quality factor) is the ratio of the resonant frequency to the
bandwidth, which gives a measure of sharpness of the resonance.
2.1 Circuit Analysis
The RF amplifier chosen for this project is shown in Fig. 2. Tuning the amplifier to the
desired resonant frequency is accomplished by the input and output passive elements: Cl
through C4 , LI , and La. Amplification is provided by the active device, Q1. Biasing of the
amplifier is accomplished by a voltage divider consisting of R l and R2 . The two resistors
are isolated from the a.c. input by L l and the bypass capacitor, BFC. An RF choke
inductor and a bypass capacitor (1 µF) are used to properly isolate the power supplies.
















Figure 2: RF Amplifier Schematic Diagram
	









(a) Equivalent input circuit
Active device	 (	 Drain circuit tuning I	 Coupling and load
I






(b) Equivalent output circuit
Figure 3: Equivalent input and output circuit of CMOS RF amplifier®
NASA SERC 1990 Symposium on VLSI Design 	 67
The inside of the "black box" in Fig. 1 shows up as R;,,, C;,,, R,.,t , and Ca„ t in Fig. 3.
The output equivalent circuit is essentially the same as the input equivalent circuit. The
equations developed next will apply to both circuits.
The actual source impedance is transformed to the optimum source impedance for
the MOSFET by load matching. Load matching produces the desired signal transmission
without signal loss. Z = R + jX where R is the resistance and X is the reactance. The
optimum source resistance for the MOSFET is
Rs = Gs	 (3)
The capacitor Ci is in series with the actual source resistance Rs.. To find the series
capacitive reactance for the optimum source impedance, the series RC network must be
converted to the parallel RC network. Table 3-5.1 in [7] defines the Q's for parallel and
series RC networks.
Q = X,/R, X. = 1 /wC,
Q = RP/XP XP = 1 /wCp
where s designates series and p designates parallel. The Thevenin equivalent for the series
RC network is
The Thevenin equivalent for the parallel RC network is:
Zth _	
Rv
	 (5)JwCPRP + 1
By combining the definitions with Eqs. 4 and 5 and then equating the two Thevenin
equivalent equations, the following formula results:
a	 _
R A. 1+ (R')	 (6)




The optimum source resistance, Rs is the parallel resistance in Eq. 6 and the actual source
resistance, Rs., is the series resistance. The numerical value of C l can be obtained from
the resonant frequency and Xcl. From Fig. 3a, the total resistance of the input resonance
circuit is R;n in parallel with Rs or
{)RT	 1	 8G,n + Gs 
As indicated above, Ci is the passive element involved in matching the actual source to
the MOSFET resistance.
6s
To determine the bandwidth of the circuit, the Q and the resonant frequency needs to
be known. Since Q is dependent on the ratio of resistance to reactance in the input circuit,
the following equation results:




The resistance RT has already been determined for matching purposes. Using Eq. 9, CT










where Ci is the parallel equivalent of C 1 . Thus, the reactance of Cs fixes the bandwidth.
Finally, to make the input circuit resonant at the desired frequency, the reactance terms
of the circuit need to cancel each other. The two reactance terms are wLl and 11wCT. By
setting the two terms equal to each other and rearranging the equation, the value for the
inductor Li can be found by
1Ll	 (12)(27r f)2C,T
One potential problem of an amplifier circuit is instability. Sustained oscillation occurs
when an amplifier is unstable. An amplifier can become unstable if a feedback path occurs
that adds rather than subtracts the output signal from the input signal. Coupling between
the input and the output can occur through capacitance within the active device as well as
through the passive circuit elements. At higher frequencies the reactance of the capacitor
decreases, thus decreasing the phase margin of the system.
Two criteria are used for determining the stability of the amplifier: the Linvill C factor
and the Stern 9 factor. The Linvi]l factor measures stability under worst-case conditions;
when the input and the output terminals are unloaded. The following equation determines
the Linvill C factor.
C, =	 yisysi	 (13)
291i922 — Re(y12J21)
If C is less than 1, the device is unconditionally stable. If C is greater than 1, certain
combinations of load and source admittances can be found to produce oscillations.
The Stern factor includes the effect of source and load admittances. The Stern K factor
is calculated from
K = 2(9il + Gs)(922 + GL)	 (14)
Iylsysll + Re(y12922)
If K is greater than 1, the amplifier is stable. If K is less than 1, the circuit is potentially
unstable. It is recommended to obtain a K value of 3 or 4 rather than 1 as a safety margin













NASA SERC 1990 Symposium on VLSI Design	 69
C12
Figure 4: Linvill stability plot
The Stern solution for creating a stable amplifier, is to deliberately add some mismatch
into the source or load tuning circuits. From Eq. 14, the Stern K factor can become greater
than 1 (ensuring a stable circuit) by choosing GS and GL large enough. Since the source
and load are not actually modified, but designed as if they were, a mismatch results. There
is however a reduction in gain.
3 Design Procedure
The RF amplifier design is for a 420 MHz to 450 MHz operating range. Thus, the resonant
frequency will be at 435 MHz with a bandwidth of 30 MHz. The first step in this R.F.
amplifier design is to characterize the MOSFET by the y parameters. SPICE simulations
were used to determine the y parameters at various frequencies. At 435 MHz, the typical
y parameter values are yli = 6.6 . 10-4 + j5.2-10- 3 , yz! = 8.6 _ 10-3 — j1.8. 10'3 , Y12 =
—1.8 . 10-8 — j7-5 - 10"  and yzz = 6.9 . 10-6 + j6.6 . 10-6.
The potential instability of the amplifier needs to be determined by using the Linvill
C factor equation (13). The results of the Linvill calculations at different frequencies are
shown in Fig. 4. As Fig. 4 shows, the device is potentially unstable up to 1 GHz.
An optimal Stern solution to stabilize the amplifier is difficult, but a good solution as
defined in [6], finds the appropriate source and load admittances (YS and YL) such that
BL = —bss = —6.6.10'6
BS = —bll = —5.2.1073
To calculate GS and GL , a mismatch ratio R is defined as
	
R = GS _ GL	 (15)
912	 922
70
It is necessary to find a ratio that gives the desired Stem K factor. By using Eq. 14 and
Eq. 15, the following equation is derived, which relates R to the K factor.
R=	 K IY21y121 + Re(y21y12) — 1
2911922
With K = 3, R = 3.07 and the appropriate conductances are found from Eq 15. The
mismatched source and load admittances are YS = 2.0 • 10-3 — j5.2 • 10-3
 and YL =
2.1 . 10-4 — j6.6 - 10-5.
The device input and output admittances. need to be calculated. Using Eq. 1 and Eq.
2, the following equation results:
Yin = it = Z311 — y12y21 = 1.2 . 10-3 +j7.5. 10-3V1	 Y22 + YL
Rearranging Eq. 2 and Eq. 1, the output admittance for the MOSFET is
Y.d = i2 = Y22 — Y21Y12 
= 126.3 . 10-e + j304.6- 10-6
V2	 yii + Ys
Now, the actual values of the passive elements can be determined. The optimum source
or load resistance and then the series capacitive reactance are calculated by Eqs. 3 and
7 respectively. To convert a capacitive reactance to capacitance, the following equation is
used.
1	 (16)C = (27rf )X.c
The capacitance that provides the series reactance at 435 MHz is Ci = 2.47pF for the
input circuit and C4
 = 0.759pF for the output circuit. The resonant circuit capacitors can
be obtained by
C2 = CT - -C;- — C2	 (17)
and
C3 = CT — C—t — C4
	
(18)
To determine the required total capacitance. (CT) of each circuit, Eqs. 8 and 10 are
used. The bandwidths for the input and output circuits need to be different as these circuits
are cascaded. The bandwidths should be chosen so that the output circuit determines the
desired frequency response. In this design, the input circuit bandwidth will be 60 MHz
and the output circuit bandwidth will be 30 MHz.
The equivalent device input capacitance (C;n) and output capacitance (Co„ t ) are found
from C = B /27rf, where B is the susceptance. The parallel capacitive reactance of each
circuit is found by an equation derived from 6 and the RC network definitions for Q,
l2XP = X. `X, I + 1
Using Eq. 16, the parallel equivalents of C1 and C4 are found. The values for the capacitors
at the resonant frequency of 435 MHz are C 2 = 3.66pF and C3 = 0.937pF.
NASA SERC 1990 Symposium on VLSI Design	 71
To finish the design for the tuning circuits, the input and output inductances need to
be calculated. Using Eq. 12, Li = 15.5nH and Ls = 74.4nS for the resonant frequency
of 435 MHz.
The RF choke, the BPC capacitor, and the 1 yF capacitor are external discrete com-
ponents. The discrete inductor needs to have a self-resonant frequency above 450 MHz
to operate as an RF choke. The bias circuit needs to be well bypassed otherwise it will
become part of the a.c. input circuit.
Finally Rl and Rz have to be determined to bias the MOSFET at Vs = 2.5V. Since
the supply voltage is +5V, the two resistor values are equal. The value of the resistors is
somewhat arbitrary. When Rl = Rs = 2KO, the current through the bias stick is 1.25
MA.
The SPICE simulator reported that although the resonant frequency was within 0.5%
of the designed resonant frequency, the circuit element values did not give the desired
bandwidth for the amplifier. The bandwidth was 43% of the expected bandwidth. Using
the SPICE simulator again, a simulation was made of the equivalent circuits in Fig. 3
with the designed values. This resulted in two circuits with the desired bandwidths and
resonant frequency. At this point, the active device model or the Stem solution for stability
which gives the input and output admittances were suspect in the bandwidth reduction.
The active device model seemed unlikely, so it was decided to re-visit the Stem solution.
The calculation for the susceptance term is the only difference between the optimal Stern
solution and the good solution. The following equations from [81 are used for the optimal
Stern solution.
B1 = zo K (Iysiyial + Re(ysayis))
2Gz
where za is a real root of a third order equation and Gz = GL — 922. To find the source
susceptance term, Bs = Bi — bil is used. The third order equation is
z3
 + [K(I X I + Re(X )) + 2Re(X )1 z — 21m(X) K(I X I + Re(X )) = 0
Where X = yaiyis• The load susceptance term is found by BL = Bs — b„ where
Bz =	 Ga z0
K ( Iysiyisl + Re(yZ1y12))
The recalculated mismatched source and load admittances are Ys = 2.0 . 10-3 — j8.8.10-3
and YL = 2.1 • 10-4 — j1.6 .10-4 .  R.e-calculating all the passive tuning elements and
simulating the new design shows an increase in the bandwidth to 75% of the desired result.
The optimum Stem solution maximizes both the susceptances and the conductances for
maximum power gain realizable for a given stability factor, K. As Stem points out in [811
maximizing the power gain imposes certain restrictions on the bandwidth. The restrictions
could be due to the active device or the circuit elements. There is no simple relationship










Table 1: Table of the Passive Element Values
The amplifier design was then analyzed using a microwave optimization program called
Touchstone by EEsof [9]. The Touchstone results showed a frequency response similar to
the SPICE results. Unfortunately, the input and output reflection coefficients indicated
a potential instability within the desired operating range of the amplifier. A SPICE run
indicated the amplifier was stable with a source and load of 50 11. A second run with
the source and load open-circuited displayed an oscillator. Stem's paper [8] implies the
amplifier must be used with the designed terminations. Touchstone was then used to
optimize the amplifier design. The resulting values for the passive elements in Fig. 2 are
shown in Table 1.
4 Layout
The RF amplifier is designed to be fabricated in an n-well 2 µm CMOS process. The
layout of each type of circuit element is now addressed.
A capacitor is formed by placing an insulator (dielectric) between two conductive plates.
In a CMOS process, the dielectric is generally silicon dioxide Si02 . The value is determined
by	 -
C, _ CS02coA	 (19)to.
where t.. is the thickness of the oxide and A is the area of the capacitor. Any variations
in to. or the area will change the result of the capacitance. The tolerance is usually within
±15% and is mainly determined by the oxide thickness variation. [5]
There is a parasitic capacitance from the bottom plate to the substrate. A parasitic
capacitance also exists on the top plate due to the connecting leads. To minimize the
parasitics in the layout of the RF amplifier, the bottom plate of C l is the top plate of C$
as shown in Fig. 5, where M1 designates the Metal 1 layer, M2 designates the Metal 2
layer, and POLY designates the polysilicon layer. Likewise, the bottom plate of C4 is the
top plate of C3. This layout configuration removes the parasitic capacitance that would
be between the bottom plate of one capacitor and the top plate of the other capacitor.
The bottom layer of the structure is grounded and the substrate is also at an a.c. ground,
eliminating the effects of that bottom plate parasitic.
The RF amplifier layout uses a polysilicon type resistor. A polysilicon resistor is a














Figure 5: Capacitor Layout
uniform slab of polysilicon surrounded by silicon dio3dde. The resistance is determined by
R = R`l	 (20)
W
where R, is the sheet resistance, 1 is the length, and w is the width. The sheet resistance
is defined as
R, _	 (21)
where p is the resistivity of the polysilicon and t is the thickness. In this particular process
the sheet resistance is 40 11/ square.
Inductors are difficult to implement in an IC due to several problems. One problem
is deriving an accurate model. Several shapes may be used to make a thin film inductor.
A straight line can be used for low inductances on the order of two to three nH. Circular
spiral or square spiral shapes provide a much higher inductance and Q value. The circular
spiral has a higher Q value of about 10 % than a square spiral for the same diameter [10].
There have been many papers written on how to model the various structures [11,12,13].
Remke [14] has a good summary of the traditional equations used in circular spiral inductor
design.
Another problem with thin film inductors is the large area needed to implement even
a small sized inductor (mH). Since the RF amplifier is designed for UHF frequencies, the
inductor sizes are reduced. A third problem for this amplifier was how to implement the
shape given the tools that are available. A circular spiral was implemented by placing
the center of small width rectangles around a spiral line so that the rectangles overlapped
74
equally. The rectangles were rotated around the spiral. The spiral ribbon is a metal 2
layer.
Using Wheeler's original formula [14] with an adjustment for ground plane effects, the
resulting inductance formula [10] is
L(nS ) = \( 1	 a2n2 Ka25.4 8a + 11c 
where n is the number of turns, K. is the ground plane adjustment and a and c are
described in the following equations a = (do + 4-)/4 and c = (do — d;)/2 where do is the
outer diameter and d; is the inner diameter. The ground plane adjustment equation [15,16]
is
Kp = 0.57 — 0.1451n W (22)
where W is the width of the spiral track and h is the separation height of the inductor to
the ground plane.
In order to find the Q of the inductor, the resistance must be found. The resistance of
a circular spiral inductor is [10]
R — KaanR,
W
where R, is the sheet resistance of the metal and K is a correction factor that takes into




where S is the spacing in the spiral. The Q value for each of the required inductors is low:
L1 =7.6and L2 =10.0.
The gate of the active device, Q1 of Fig. 2, is made of a polysilicon layer. Because of
polysilicon's high resistance with long lines and the parasitic capacitances associated with
the active device, the propagation of signals is delayed. A way of reducing the propagation
delay is to segment the gates.
The signal delay of a distributed n-section network as n becomes very large is td =
(rc12 )/2 where r denotes the resistance per unit length, c denotes the capacitance per unit
length, and 1 denotes the length of the wire. The active device was designed for a gate
length of 5 µm and a gate width of 500 µm. If the poly is a distributed network of length
500 µm, an unacceptable propagation delay of 7 ns would result. If the gate is divided into
20 segments of 25 um lengths, the resulting propagation delay is only 17.5 ps.
The I/O pad cells contain electrostatic discharge (ESD) protection. Only parallel pro-
tection was used to eliminate series resistance delays. To help prevent latch-up in the I/O
pad cells, a p+
 diffusion ring surrounds the nMOS transistors and an n + ring is diffused
into the perimeter of each of the n-wells. The active device, Q1, is surrounded by a p+
guard ring as well as the entire amplifier.
Since MOSIS requires 30 pads in the tiny chip frame, some of the extra pads are used
as taps for the inductors and capacitors. The tap and power supply lines to the inductors







Standard Amplifier 15.2 420 48
+15% 15.3 402 46
-15% 13.9 440 58
+30% 17.7 365 31
-30% 12.3 511 75
Parasitic Capacitors 18.6 425 17
Lossy Inductors 6.5 436 96
1.5V Bias 6.9 433 71
3.5V Bias 1	 19.01 419 33
Table 2: Table of SPICE Results
are made as wide as possible to reduce their contribution to the total inductance. The
inductance of a straight strip of ribbon is [10]
L(nS) = 2 . 10-41 [in(
 w + t +1.193 + 0.2235 	
t Kg
where I is the length of the strip, w is the width, t is the thickness and Kg is given in Eq.
22. In addition to the metal line inductance, there is inductance associated with the bond
wire from the pad to the pin of the packaged IC. The bond wire inductance is reduced by
using a PLCC (plastic leaded chip carrier). The tap and signal lines to the capacitors are
made as wide as possible to reduce the resistance of each of the lines.
5 Conclusions
Numerous SPICE simulations were run to show the effects of components value variations,
inductance Q's, parasitics and voltage variations. The results are summarized in Table 2.
The amplifier continues to have gain and a usable bandwidth under the changes simulated.
A test circuit has been submitted to MOSIS to verify the theory presented in this paper.
Although the amplifier continues to function, there are dramatic changes in center
frequency and bandwidth. This is to be expected for L and C component value changes
since the center frequency is a function of these values. ±15% variations in capacitance
is expected for MOS capacitors. For an amplifier to meet a manufacturing specification,
a digitally trimmed capacitor array might be necessary. The inductor variation should be
much smaller than the capacitance variation but no data can be found in the open literature
on MOS inductors. A study of MOS inductors needs. to be conducted to establish this
data. ±30% variations were simulated since the accuracy and applicability of the models
established in the literature [15,11,12] for inductors in GaAs was in question. Since both
inductors are connected to supplies through pins, external trimming is possible. Bias
voltages can be controlled accurately and present no real problem. Parasitics can be
76
accurately modeled and can be incorporated into the design optimization procedure. The
real concern is the low Q values for the inductances. Additional amplifier configurations
need to be studied to find configurations requiring smaller values of inductances which can
be designed for higher Q values.
This work indicates that RF amplifiers can be designed in MOS, but that further
work needs to be conducted to establish techniques and configurations that will result in
manufacturable RF amplifiers.
References
[1] P. Allen and D. Holberg, CMOS Analog Circuit Design, New York, N.Y., Holt, Rine-
hart and Winston, 1987, Chap 1
[2] M. Milkovic "VLSI High Frequency CMOS Operational Amplifiers for Communica-
tions Applications", Proceedings of the 27th Midwest Symposium on Circuits and
Systems, vol. 2, June 1984, pp.784-787
[3] K. Niclas, W. Wilser, R. Gold, and W. Hitchens, "The Matched Feedback Amplifier:
Ultrawide-Band Microwave Amplification with GaAs MESFET's", IEEE Transactions
on Microwave Theory and Techniques, vol. MTT-28, April 1980, pp.285-294
[4] D. Ribner "Some variations in CMOS Operational Amplifier Design", Proceedings of
the 27th Midwest Symposium on Circuits- and Systems, vol. 2, June 1984, pp. 788-791
[5] R. Gregorian and G. Temes, Analog MOS Integrated Circuits for Signal Processing,
New York, John Wiley & Sons, 1986
[6] J. Lenk, Manual for MOS Users, Reston, Virginia, Prentice-Hall, 1975
[7] H. Krauss, C. Bostian, and F. Raab, Solid State Radio Engineering, New York, John
Wiley & Sons, 1980
[8] A. Stern, "Stability and Power Gain of Tuned Transistor Amplifiers", Proceedings of
the IRE, vol. 45, pp. 335-343, March 1957
[9] EEsof, Inc., User Manual for Touchstone, Westlake Village, California, 1985
[10] I. Bahl and P. Bhartia, Microwave Solid State Circuit Design, New York, John Wiley
& Sans, 1988
[11] H. Dill, "Designing Inductors for Thin-Film Applications", Electronic Design, Febru-
ary 17, 1964, pp. 52-60
[12] P. Shepherd, "Analysis of Square-Spiral Inductors for Use in MMIC's", vol. 34, April
1986, pp. 467-472
NASA SE.RC 1990 Symposium on 'VLSI Design	 77
[13] E. Pettenpaul, Hartmut Kapusta, Andreas Weisgerber, Heinrich Mampe, Jurgen Lug-
insland, and Ingo Wolff, "CAD Models of Lumped Elements on GaAs up to 18 GHz",
vol. 36, February 1988, pp. 294-304
[14] R. Remke and G. Burdick, "Spiral Inductors for Hybrid and Microwave Applications",
Proceedings of the 1974 Electronic Components Conference, May 1974, pp. 152-161
[15] R. Chaddock, "The Application of Lumped Element Techniques to High Frequency
Hybrid Integrated Circuits", The Radio and Electronic Engineer, vol. 44, 1974, pp.
414-420
[16] K. Gupta, R. Garg, R. Chadha, Computer-Aided Design of Microwave Circuits, Ded-
ham, Massachusetts, Artech House, Inc., 1981
78
	 N94- 71083
A Comparison of Two
Fast Binary Adder Configurations
J. Canaris and K. Cameron





Abstract - Conditional sum and binary lookahead carry are two methods for
performing fast binary addition. These methods are quite different, but the
adders have a common feature that makes them interesting to compare. Both
adders have the carry generating logic implemented as a binary tree, which
grows in depth as logs n, n equals number of bits in the adder. The delay
in the carry paths also grows in proportion to 1092 n. This paper shows that
the Transmission-Gate Conditional-Sum adder and the binary lookahead carry
adder have the same speed of addition,- but that the conditional sum adder
requires only 46% of the area.
1 Introduction
There are many high performance binary adders described in the literature [1,2,3,4,5,6].
These adders use a variety of techniques to speed up the generation of the carry signal.
Carry lookahead, carry select and carry completion are among the techniques used. Some
methods [4] use specialized encodings to perform addition without a carry being gener-
ated. This paper covers two techniques which, although quite different in the method they
use, lend themselves to a regular layout as described in [7]. The two high performance
adders discussed in this paper are the Binary Lookahead Carry (BLC) adder [3] and the
Transmission-Gate Conditional-Sum (TGCS) Adder [6]. These adders are interesting be-
cause their area grows in proportion to n logs n and their propagation delay grows with
logs n, where n is the number of bits in the adder.
2 Binary Addition
Binary addition can of course be implemented by a one-dimensional array of full adder
cells which implements the truth table shown in Table 1.
NASA SERC 1990 Symposium on VLSI Design	 79
A B Ci I So Co
0	 0 0 0 0
0	 0 1 1 0
0	 1 0 1 0
0	 1 1 0 1
1	 0 0 1 0
1	 0 1 0 1
1	 1 0 0 1
1	 1 1 1 1
Table 1: Full Adder cell truth table.
This function can also be represented by:
Si = Ai ® Bi ® Cii	 (1)
Co = (Ai . Bi) + (Ai • Cii) + (Bi • Cii )	 (2)
It is Equation 2 which is the target of addition optimization techniques, as it forms an
n bit ripple path. The methods which the BLC and TGCS adders use to speed up this
carry path will be discussed next.
2.1 BLC carry generation
This discussion parallels that of [3]. First- we look at carry lookahead adders in general.
The linear growth in the delay of the carry chain can be improved by calculating the carries
to each bit in parallel.
Ci = Gi + Pi • Ci-1
	(3)
G; = A: Bi	 (4)
A=Ai ED Bi
	 (5)
The carry equation, Equation 3, can be expanded as follows:
Ci = Gi +Pi •Gi
-i +A—Pi_1 . Gi-Z+ ... +A ... PI CO 	(6)
The sum, Si is generated by:
Si = Ci-1 ED A (7)
Writing down a small number of terms of Equation 6 will show that the layout of such a
network will be quite irregular and that the number of gates needed to implement such a
scheme increases rapidly. Four stages of lookahead, for example, will have the follow will
have the following terms.
C, = Gl + (Pi • Co) .
Ca = Ga+(Ps • GI) +(Ps•Pi•Co)
Cs = G3+(Ps • Ga)+(P3-Ps° Go) +(P3- P2'Pi°Co)
C4 = G4+(P4.G3)+(P4•P3• G,)+(A-A-A- GI ) +(P4•P3•Pz•Pg•Co)
80
For these reasons a pure carry lookahead adder is usually implemented with small
sections of lookahead combined with some other addition scheme, such as carry select.
The BLC adder however takes a different approach, which lends itself to a regular layout.
A new operator, o is introduced, such that:
(g , P)o(g" ) = (g + (P ' g" P ° i)	 (8)
where g, p, g' and p are Boolean variables. This operator is associative and the carry signals





(g1, P1)	 if i =1 	 ( 10( G i^ Pi)	 )(gi,pi) .. o ... (Gi-1, Pi-1) if 2 < i < n
that is:
(`-Ti , Pi) = (gi'Pi)O(gi-1)Pi-1) ... p... (g1 ' P1)	 (11)
The associative property of the o operator allows the carry lookahead circuitry to be
organized as a binary tree structure whose depth is proportional to logs n, hence the name
Binary Lookahead Carry adder. The propagation delay through this carry section is also
proportional to logs n. This decrease in the carry propagation delay is quite significant
when compared to the delay through a ripple carry adder. A 16 bit adder, for example,
with a delay of 1 nanosecond /stage, will drop from 16 nanoseconds to 4 nanoseconds of
total delay.
Figure 1 shows the organization of a Binary Lookahead Carry adder. It should be
noted that the BLC adder has no carry signal into the least significant bit.
2.2 TGCS' carry generation
The BLC adder introduces parallelism in the carry propagation path through the clever use
of a new operator, the o operator. The TGCS adder, as described in (2,6], uses a more brute
force approach to increase performance. In the TGCS adder, parallelism is introduced at
the beginning of the addition process by calculating two sums and two carries at each bit.
These four outputs are calculated from the two input signals (the addend and the augend)
as if a carry was/was not propagating into that bit position. Multiplexers are then used
to select the proper sum and carry from that bit position based on whether C i does/does
not propagate. This calculation is known as the conditional sum, and can be described as
follows.
S,°=4-a)A=(A,•Bi) +(A.- Bi) 	 (12)
Ci+1 = Ai ° Bi	 (13)
NASA SERC 1990 Symposium on VLSI Design	 81
S; =A;OB:=(A;°Bs)-I-(A;•B;) 	 (14)
C,•+1 = A. + B;	 (15)
As in the BLC adder the multiplexers required to chose the proper sum and carry
from each bit can be organized into a binary tree of depth logs n. Each subsection of the
tree can calculate the provisional sum and carry in parallel with other subsections, so a
propagation delay proportional to loge n can be attained. Figure 2 shows the organization
of a Transmission-Gate Conditional-Sum adder.
3 Logic Design of the Adders
CMOS technology allows for a wide choice of different logic types, such as fully comple-
mentary gates, domino logic, pseudo-nmos and pass transistors 13]. The types of logic
chosen to implement the BLC adder and the TGCS adder are quite different. The specific
logic configurations chosen are described below.
3.1 BLC adder logic
Initial investigations of a pass transistor (transmission gate) implementation of the BLC
adder indicated that it would be difficult to introduce inverter buffers in the delay path.
These buffers would have been required to-implement the pass functions needed. For this
reason a traditional, fully complementary gate logic was chosen. As can be seen from
Figure 1 three different cells are needed to form a BLC adder. They are the
• G and P generator cell, labeled G.
• The o operator cell, labeled O.
* The final summation cell, labeled SO.
There is in addition 2 cells which provide interconnections between the three cells listed
above. These routing cells are provided as a layout convenience and have no logical purpose.
The G cell directly implements the logic functions given by Equation 4 and Equation 5.
The O cell implements the function given by Equation 8, which is restated here in a more
accessible form.
G	 9+(P°9')	 (16)
P = P - P	 (17)
The SO cell performs the final summation of the partial sum (P) of the current bit with
the carry out of the previous bit, as given by Equation 7. It should be noted that this
logic implementation differs from that described in [3,7]. In that implementation o and
® are alternated every other column in the carry evaluation block. That implementation
has also introduced inverter buffers in the locations where the 0 cells are missing. This
82
organization, while still regular does not grow in a true 109 2 A fashion. For instance in a four
bit adder three columns in the carry block are required, not two. In the implementation
described in this paper the buffering is performed by the O cells themselves, and the carry
block depth does grow as described.
3.2 TGCS adder logic
Conditional sum adders can be implemented in standard gate logic, just as the BLC
adder has been. This implementation however uses pass transistor logic (transmission
gates) instead. This particular Transmission-Gate Conditional Sum adder is based on
work presented in [6]. This work differs from [6] however. Based on formal pass transistor
design techniques [8,9,10] and the use of n-transistor only pass networks for arithmetic
units [11], a smaller and simpler TGCS adder has been designed. As can be seen from
Figure 2 seven different cells are needed to form this TGCS adder. They are the
e An input buffer cell, labeled IBUFF.
s The conditional sum cell, labeled CONSUM.
e A four section 2-1 multiplexer, labeled MUX2A.
s A two section 2-1 multiplexer, labeled MUX2B.
e A one section 2-1 multiplexer, labeled MUX1.
e An inverter buffer cell, labeled MUXBUFF.
e An output buffer, labeled OBUFF.
All routing required by this adder is contained within the cells listed above. Unlike the
BLC adder described above no special interconnection cells are required. The IBUFF and
OBUFF cells are simply buffering stages. If the adder, was driven directly by a flip-flop and
drove directly into a flip-flop these cells would not be required. The CONSUM cell directly.
implements the logic functions given by Equations 12, 14, 13 and 15. These equations are
rewritten in the pass transistor format described by [8,9,11] as:
So = Ai(Bi) + Ai('Bi)
C10+1 = Ai(Bi) + 4.(0)
Si = Ai(Bi) + A+(Bi)
C41 =	 Ai(1) + Ai(Bi)
The multiplexing cells MUX2A, MUX2B and MUM are used to select the proper output
from the CONSUM cells. The MUXBUFF cell is used to buffer the outputs of certain
multiplexer cells so as to drive other multiplexers.
NASA SERC 1990 Symposium on VLSI Design
	 83
4 Circuit Design and Layout of-the Adders
The BLC and TGCS adders presented here were designed to be used in a multiply-
accumulate (MAC) block required by an image processing chip. The MAC was organized
as an 8 bit by 10 bit multiplier, with a 28 bit accumulator. The multiplier itself was
organized as a carry-save multiplier operating at a 20MHz clock frequency. The adders
were to be used as the final summation stage in the multiplier as well as performing the
accumulation required by this application. The adders therefore needed to be designed to
have a carry propagation delay of less than 25 nanoseconds under worst case conditions
using a 1.6µm double-metal CMOS process. Worst case conditions for this project were:
• 4.5V supply, 0.2V noise on VDD and VSS.
• 140°C.
o Worst Case Parameter set.
• 2.Opp output load.
The circuit design and layout of the BLC adder was straight forward as it was designed
with traditional CMOS logic gates. CAD tools were available to aid the design engineers
in sizing transistors. The worst case propagation delay is 24 nanoseconds. The layout of
the adder was also uncomplicated. The area required by the 28 bit BLC adder is 1,282µm
X 452.8µm, which is 45.8µm X 452.81im per bit.
The circuit design and layout of the TGCS adder were more complicated than for the
BLC adder. The circuit design and sizing of pass transistor networks is still more of an art
than a science. In this case the adder layout was required to be pitch matched with two
memory blocks and a data path. Transistor sizes were chosen to fit into a cell layout with
a fixed width, parasitic capacitances were _extracted from the layout and fed into SPICE
simulations of the critical path. This procedure was iterated until the specified speed was
attained. It is not surprising that the worst case carry propagation delay in this adder was
also 24 nanoseconds. The area required by the 28 bit TGCS adder is 963.2µm X 278.611m,
which is 34.4µm X 278.6µm per bit. This is only 46% of the area required by the BLC
adder.
5 Summary and Conclusions
This paper describes two methods, Binary Lookahead Carry and Transmission Gate Con-
ditional Sum, for performing high speed addition. The investigation of these adders was
undertaken to find the best adder for a particular application, the design of a multiply-
accumulate block in an image processor. At the outset neither method seemed to have
an advantage over the other. When each adder was designed and met the performance
requirements it was shown that the TGCS configuration has a significant area advantage
over the BLC configuration.
84
Acknowledgement
The authors wish to acknowledge the NASA Space Engineering Research Center program
and the Lawrence Livermoore Laboratory, whose support made this investigation possible.
References
[1] I. Flores, The Logic of Computer Arithmetic, Englewood Cliffs, New Jersey, Prentice-
Hall, 1963.
[2] Kai Hwang, Computer Arithmetic, New York, New York, John Wiley and Sons, 1979.
[3] N. Weste and K. Eshraghian, Principles of CMOS VLSI Design, Reading, Mass.,
Addison-Wesley, 1985.
[4] Y. Harata et al., "A High-Speed Multiplier Using a Redundant Binary Adder Tree",
IEEE JSSC, Vol. SC-22, February 1987, pp. 28-33.
[5] T. G. Noll et al., "A Pipelined 330-MHz Multiplier", IEEE JSSC, Vol. SC-21, June
1986, pp. 411-416.
[6] A. Rothermel et al., "Realization of Transmission-Gate Conditional-Sum (TGCS)
Adders with Low Latency Time", IEEE JSSC, Vol. 24, June 1989, pp. 558-561.
[7] R. P. Brent and H. T. Kung, "A Regular Layout for Parallel Adders", IEEE Trans.
Comput., Vol. C-31, March 1982, pp. 260-264.
[8] D. Radhakrishnan, S. Whitaker and G. Maki, "Formal Design Procedures for Pass
Transistor Switching Circuits", IEEE JSSC, Vol. SC-20, April, 1985, pp. 531-536
[9] G. Peterson and G. Maki, "Binary Tree Structured Logic Circuits: Design and Fault
Detection", Proceedings of IEEE International Conference on Computer Design: VLSI
in Computers, Port Chester, NY, Oct. 1984, pp. 671-676.
[10] C. Pedron and A. Stauffer, "Analysis and Synthesis of Combinational Pass Transistor
Circuits", IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, Vol. 7, July 1988, pp. 775-786.
[11] J. Canaris, "A High Speed Fixed Point Binary Divider", Proceedings ICASSP-89,
Glasgow, Scotland, May 1989, pp. 2393-2396.
ci -A ^
NASA SERC 1990 Symposium on VLSI Design
1 i2-1 81 A70 BO	 1
I
I




A B A B A B A B 1
1
G G G G 11
1P	 P 9 P P 9 P	 P i	 P 1
1







G P RT P RT II
t P	 RT G P RT 11
f f f I
O O 1
! f ! i
1





























____112 _____ ______ _______.►
Figure 1: BLC adder block diagram.
P	 C; P	 C; P	 C; P	 C;
so so so so
sum sum sum sum
85
86
AO	 BO	 Al	 91	 ♦2 	 B2	 A3	 HS
11111111
A	 B A	 B A	 B ♦ 	 H
IBUFP MUFF ]DUFF 18UFF
A	 A	 B	 H A	 ♦ 	 8	 H A	 A	 B	 B A	 A	 H	 B
A	 A	 B	 B A	 A	 H	 B A	 A	 B	 B A	 ♦ 	 B	 B
CONSUM CONSUM CONSUM CONSUM
9 SI110 "C"10 SA"SI1I0 9,18 ' "C'sI	 I, I0
	CO 	 CO	 CO	 CO
	
Co	 CO	 CO	 COMUIC-	 3C=2A	 fix'	 31=2A
	
BUFF2 —	 BUPP2 —	 —
	
C1	 Ci	 C1	 C1
	
C1	 C1	 C1	 C1
$TOCI1	 SIO	 SIO C11
	
CO	 CO	 CO	 CO
Mva- CO	 CO	 CO	 coMUX2B	 3E=2A
HUFF2 —	 — —
	
CI	 C1	 C1	 C1
	
C1	 C1	 C1	 C1
301 S01300 -- s01 s00 SO1 SOO SOlSOO:00
CI	 CI CI	 CI CI	 CI CI	 Cr CI
M= IA=I A=1 B=1 31=1
CIN	 CIN CIN	 CIN CIN	 CIN CIN	 cut CIN
so so so so so
D a a a a
OBUFF OBUFF OBUFF OBUFF OBUFF
Q Q Q Q Q
I	 1
SO	 Sl	 32	 SS	 CO
Figure 2: TGCS adder block diagram.
NASA SERC 1990 Symposium on VLSI Design
	 N94- 71084
	 87
Self Arbitrated VLSI Asynchronous
Sequential Circuits
S. Whitaker and G. Maki
NASA Engineering Research Center
for VLSI System Design
University of Idaho
Moscow, Idaho 83843
Abstract - A new class of asynchronous sequential circuits is introduced in
this paper. The new design procedures are oriented towards producing asyn-
chronous sequential circuits that are implemented with CMOS VLSI and take
advantage of pass transistor technology. The first design algorithm utilizes a
standard Single Transition Time (STT) state assignment. The second method
introduces a new class of self synchronizing asynchronous circuits which elimi-
nates the need for critical race free state assignments. These circuits arbitrate
the transition path action by forcing the circuit to sequence through proper
unstable states. These methods result in near minimum hardware since only
the transition paths associated with state variable changes need to be imple-
mented with pass transistor networks.
1 Introduction
Most control logic in current VLSI is realized as a synchronous sequential circuit. Major
disadvantages in using synchronous machines include the clock and power distribution.
Races are avoided in synchronous logic by selecting a clock rate that is slow enough to
allow propagation of signals from the slowest logic block. This forces the clock rate to
be governed by the slowest block on a VLSI chip. Moreover, the clock signal is assumed
to transition simultaneously for all flip flops. Clock distribution on a VLSI circuit must
account for the RC time delay caused by the finite resistance of metal lines. Another
major design consideration for high speed circuits is power bussing. With synchronous
logic, CMOS circuits have a peak current demand at the clock edge since many nodes
transition simultaneously.
Asynchronous design avoids these limitations. Asynchronous designs are not often
used because of the more complicated design procedures and the electronic considerations
for proper operation. This paper introduces a new class of asynchronous circuits which
eliminates the need for critical race free state assignments. Non-normal circuit operation
is employed in the new architecture. A non-normal transition is characterized by allowing
the circuit to assume a series of intermediate states prior to reaching the stable state[91.
The non-normal mode circuits take advantage of the ability of pass transistor networks to
be tristated. The circuits are straight forward and easy to implement since they are free
from hazard.
88
I. --^ Enable ^— I
Buffer [--► Y
Id --^ Disable ^:_ yd
Figure 1: Enable-Disable Block Diagram.







Table 1: Buffer state table.
Section 2 of this paper introduces a general enable-disable pass . transistor. model. Sec-
tion 3 presents a design procedure and a design example utilizing a Liu [1] state assignment.
Section 4 introduces the theory for self arbitrated circuits and uses a one -hot-code to illus-
trate a handshaking protocol. Section 5 presents a general design procedure, then modifies
the procedure to cover scale -of-two-loops [10].. and presents a design example. Section 6
discusses hardware bounds and Section 7 summarizes the results of this work.
2 Circuit Model
The general model for the circuit is shown in Figure 1. There are two networks of pass
transistors labeled Enable and Disable followed by a Buffer stage. The state variables
values are held by a memory circuit in the buffer. The enable and disable pass transistor
networks are designed to respond to input transitions that cause state variables to change
logic states.
The buffer circuit is described by the state table shown in Table 1 where Z represents
a high impedance state driving the input of the buffer. The next state follows the input
when the input is either a 1 or a 0; Y = y; whenever the input to the buffer is tristated.
The pass transistor networks are driven by n input variables and constants such that
the set of inputs, I, is described by,
1` [Ia'Is'Is'..., Inv 0,1j 	 (1)
NASA SERC 1990 Symposium on VLSI Design	 89
Ip yi ys Y
A A 1 1 0
B A 1 0 0
C A 1 0 0
D D 1 1 1
E D 1 0 1
F F 0 1 0
G I F 1	 0 0 0
Table 2: Example T partition flow table.
A state machine is constructed from a set of circuits shown in Figure 1 where each circuit
produces one next state variable, Y,•. The set of present state variables, y, allow the state
machine to assume as many as 2- unique states.
Y = [yl , yZ , ... , Yin]	 (2)
The internal state of the machine may be described by an m-tuple consisting of each present
state variable. S is the set of m-tuples which define the specified states of a sequential
machine. Let the specified q < 21 internal states,
S = S1i S,, ... ,Sq	 (3)
Definition 1 Let p, be a partition that partitions states of set S from all other states
under input Ip of a flow table.
The states of S represent all the states of a transition path associated with S. S could
be the elements of a transition pair or a k-set, depending on the situation. A k-set [3,6]
consists of all next state entries in a column of a flow table that lead-to the same stable
state. For example p.&. = ABC; DEFG under Ip of Table 2 partitions ABC from DEFG.
In the hardware realization, partition p, represents a path in the circuitry that forms the
next state variable Y,•. In Table 2 there are 3 k-sets consisting of states ABC, DE, and
FG under input I.. Let Tl = ABCDE; FG, r2 = AFD; BCGE and r3 = DE; ABCFG.
The product expression covering p,,k is then yl y3 . The path realizing pk would consist
of two series transistors, one controlled by yl and the other by y3 .	 -
Several key concepts can be applied to p,.
1. The set S is dependent on the state assignment method.
(a) If the state assignment is a Tracey assignment, S consists of transition pairs
under input Ip.
(b) If the state assignment is a Liu assignment, then S consists of k-sets under
input Ip.
2. Partition p, can be expressed as a product of the partitioning variables of Ip.
90
Definition 2 Let pij Ik(x) represent a pass transistor network decoding the transition path
that contains Si S„ S; being stable, where pij covers the transition path containing Si , S;,
and a qualifying input Ik passing x, where x E [0, 1].
In the enable-disable model, when the enable and disable pass networks are inactive,
the input to the buffer is presented with a high impedance. The next state variable, Y,
does not change under this input condition. An input only needs to be passed by the pass
network when a state transition is required. The disable circuit provides a path to force
1 --> 0 transitions of the state variable. The enable circuit provides a path to force 0 —+ 1
transitions.
Referring to Figure 1, the enable network is armed by present state information con-
tained in Y. E y to respond to input I; in I, C I, which forces Y,• —r 1. The enable circuit
is a set of pass implicants decoding each total state requiring Y,• — ► 1. The states where
yi = 0 with next state Y,• = 0, are don't care state for the enable network logic.
When the circuit is in a state where y i = 1, the disable network performs a similar
function. The disable network is armed by present state information contained in yd C y
to keep yi = 1 until an input I; in Id C I transitions causing the sequential machine to
move to a new state with yi = 0. The disable circuit could be a set of pass implicants
decoding each total state requiring yi —► 0. The states in which yi = 1, but whose next
state also require yi = 1, are don't care states.
A simple double inverter buffer could be used as the buffering element and would
function according to Table 1; however, the state information is dynamically stored as
charge on the gates of the first buffering inverter. This would result in a minimum operating
frequency due to leakage of the stored charge through the reversed biased junctions forming
the drain regions of the pass transistors connected to the buffer gates. Also this circuit
would be susceptible to potential accumulated loss of charge due to charge sharing with
nodes internal to the pass network core. To avoid these restrictions two weak feedback
transistors are added to each buffering circuit as illustrated in Figure 2. - This forms a
latch which the NMOS pass transistor network must.overdrive to toggle. The weak PMOS
feedback device also overcomes the threshold voltage drop across the NMOS pass network,
thus avoiding potential high current draw by the first inverter in the buffer.
The enable-disable model is subject to race conditions as is any asynchronous sequential
circuit. Whether a circuit is critical race free or not is determined by the state assignment.
3 STT Design Procedure
In general, critical race conditions can be avoided by state assignments in which transition
paths between states are disjoint. Disjoint state assignments are in general known as single
transition time (STT) state assignments [3].
In the following theorem, let the term p,Ip(x) where x E [0, 1] in the expression for
Y,• represents a series of pass transistors which are qualified by p, and I. to pass x to the
buffer input.
NASA SERC 1990 Symposium on VLSI Design	 91
Figure 2: Buffer circuit with weak feedback devices.
Theorem 1 Let s be the states of a transition path and let p, partition the states of the
transition path under I. where Y must transition from a 0 — ► 1 or a 1 —► 0. If Y,• contains
p,Ip(1) or p,Ip(0) respectively, then Y,• will properly specify the state transitions of p,.
Proof If y; is not required to change in the transition within s, that is y; transitions
0 —> 0 or 1 —► 1, then pjp(1) or p,Ip(0) need not appear in the expression for Y,• since a
high impedance presented at the input of the buffer does not change Y.•. Therefore, the
absence of p,Ip(x) does not impact Y,•.
If y; must transition 0 — ► 1 or 1 —► 0, then Y must equal 1 or 0 respectively in the entire
transition path of s. If p,Ip(1) or p.Ip(0) appears in Y,•, then Y,• is properly specified in all
the states of the transition path of S. If all next state variables satisfy these conditions,
the circuit will transition to the correct stable state. O
Another way this operation can be viewed is that partitioning variables partition tran-
sition paths and guarantee critical race free operation [9]. Partition variables must not
change during transition from the unstable to the stable state. The- conditions of Theo-
rem 1 specify that only non-partitioning variables (those that change state) are excited to
change; all 'others remain unchanged and hence the operation is guaranteed to be critical
race free.
The following is a general design equation.
1': _ E pai,-U-I9(0) + EPC—UJO)	 (4)
Procedure 1 A formal procedure for the design of an enable-disable realization for each
state variable of an asynchronous sequential circuit follows:
Design procedure.
1. Encode the flow table with an STT state assignment.
2. Identify each p partition, p;, under each input variable, Ik , of the flow table.
3. List a table of transitions which cause the state variables to change states.
92
4. For each state variable y;,
(a) For any p; under Ik in which ys transitions 0 -r 1, the term p,Ih(1) appears
in Y,•.
(b) For any pj under Ik in which yt transitions 1 0, the term p;Ik (0) appears
in Y,•.
5. Find a covering for each p partition and substitute to derive the design equa-
tions.
Some asynchronous designs assume that only one input to the state machine is asserted
at a time and one input must always be asserted. This is not a practical assumption since
essential hazards may exist in any input forming logic. There will be overlaps of the
input signals where two inputs will simultaneously be either 1's or 0's. However, the input
forming logic can be designed to eliminate either 1-1 crossovers or 0-0 crossovers but not
both.
Theorem 2 When enable-disable model asynchronous sequential circuits are constructed
using procedure 1, 0-0 cross over on the inputs is tolerated.
Proof Consider an arbitrary p,Ip(z) circuit. When Ip = 0, the output is in a high
impedance state. For a 0-0 cross over, all Ip = 0. Therefore all buffer inputs are high
impedance. Since a high impedance input causes no change in state, circuit operation is
unaffected. Therefore, the inputs may freely have 0-0 cross over. q
Theorem 3 When enable-disable model asynchronous sequential circuits are constructed
using procedure 1 and the inputs to the circuit are constrained to not have 1-1 crossovers,
the circuits are free of essential hazards on the inputs.
Proof Due to the nature of pass transistors, when the gate is driven low the output floats
remaining unaffected, provided the effects of leakage currents are overcome. The latching
buffer overcomes the effects of leakage currents and state variables will remain unaffected
for any length of time that the inputs are all low. If the inputs are then constrained to be
free of 1-1 cross over, essential hazards on the inputs have been eliminated for the circuit.
Essential hazards have in the past been eliminated by either adding or limiting delay
at selected points in a circuit [2]. The floating nature of the pass transistor can be viewed
as adding a delay greater than the 0-0 crossover time.
Critical race free state assignments of Tracey [3], Liu [1] and Tan [4] are examples of
STT state assignments. The following is an example of an asynchronous state machine
designed with the Liu state assignment. The flow table is derived from Table 3.13 of [5].
Example 1 Using design procedure 1, design the enable and disable pass networks to
implement the flow  table listed in Table S.
NASA SERC 1990 Symposium on VLSI Design	 93
11 h I3 yl 9s 'b3 N
A A B A 0 0 0 0
B C B E 1 0 0 1
C C F C 1 0 1 0
D D G C 1 1 0 0
E E F E 0 0 1 1
F D F E 1 1 1 1
G I G I G I A 0 1 0 0
Table 3: Flow table with Liu state assignment.
The flow table has been given a Liu state assignment covering the following p partitions
under I1.
Pa = A; BCDEFG
Pbe = BC; ADEFG
Pdf = DF; ABCEG
	
(5)
PC = E; ABCDFG
Pg = G; ABCDEF
Under Iz, there are three p partitions.
Pia = AB; CDEFG
P^f = CEF; ABDG
	
(6)
Pia = DG; ABCEF
Under 13 , there are also three p partitions.
pia = AG; BCDEF
pk f = BEF; ACDG	 (7)
p,d = CD; ABEFG
The enable-disable model requires paths in the enable circuit to allow 0 —> 1 transitions
and paths in the disable circuits to allow 1 -4 0 transitions. Table 4 contains a summary
of the transitions from the flow table. For Y1 , the enable circuit must sense Is —4 1 while
the circuit is in state S. or S.. The disable circuit for Y1 requires a path to sense Iz —4 1
when the circuit is in state Sd to bring the machine to state Sa and must sense 13 —^ 1
when the circuit is in state Sa or S f.
The enable circuits can now be formed by covering the 0 —3 1 transitions.
1'i = paal2(1) + P.fh(1)
1'2	 P«fh(1)	 (8)
Y3 = PkII(1) + pa^fl3(1) + P^dI3(1)
Y4 = Paal2(1) + PKf12(1)
The disable circuits can then be formed by decoding the transition paths causing 1 —4 0
94
0--+1 Input 1--+ 0 Input
Yl A --> B Iz B --> E 13
E —►
 F Is D —►
 G Is
F	 E 13
Yz C	 F Iz D --> C 13
E	 F Is F --► E 13
G  13
Y3 B	 C Il F  Il
B --► E 13
D  13
Y4 A —► B Iz B	 C 11
C —► F 12 P Il
Table 4: Summary of transitions for the Lau state assignment.
transitions in the state variables. Then the complete design equations are
Yi = Pabl2(1) + pM fIs(1) + Pb^fl3(0) + Pdal2(0)
Ys = PafI2(1) + P^dl3(0) + Pbefl3(0) + PaaI3(0)	 (9)Y3 = Pbcll(1 ) + Plfh(1) + Podh(1) + P VI1 (0)
Y4 = PabI2(1) + PafI2(1) + PkIi(0) + Pdfll(0)
For implementation it is then necessary to construct the decode circuits for each of the
state transition paths. The decode equations for these state transition paths are
Pab = Ys Y3
P^ f = y3
Pb,f = y4
Pda = ys Y3
P.d = yl y4
Pae = y-1 Y4
Pb. = Y1 YZ
Pdf = yl :Y2
Substituting the decode equations, the next state design equations become
Yi = Ys Y3 Is(1) +
Y3 12(1 ) + y413(0) + ps y3 I2(9)
Ys = y3 12(1 ) + yl W I3
/
(0) ^'
Y4 I3(0) + yl y4 13(0)
Y3 = y1 Ys Ii(1) + y4 13(1) +
yl Y4 I3(1) + yl ys I1(0)
Y4 = ys y3 IZ(1 ) + Y3 12(1) +
yl Ys  11(0) + y1 ps I1(0)
(10)
(11)
Figure 3 shows the diagram of the circuit implemented with NMOS pass transistor
networks.




13 Y40 i i
Iz
-

















0	 1 1 1
I Buffer O F—x 1
Figure 3: Circuit diagram for Liu state assignment.
96
4 Self Arbitrated Circuits
The state variable design equation . for the enable-disable model is a summation of the
enable terms plus a summation of the disable terms.
Y = E0di.U.Ij(0) + Fpen ah(1 )	 (12)
where p, is the covering for a transition path which leads to the change of a state variable.
The whole transition path is covered for next state variable Y,• by the p, term. In
reality, it is not necessary to cover the entire transition path. It is only necessary to cover
the portion of the transition path that contains the unstable state. If Su is the unstable
state of a transition path and µ, is the covering of only the unstable state in the transition
path where state variable Y must experience either a 0 --+ 1 or 1 -+ 0 transition then
Y = ElAdi.,.U.Ii(0) + Eµ.U.Ik(1)	 (13)
Theorem 4 Equation 13 is sufficient to create the proper asynchronous sequential circuit
action.
Proof Theorem 1 has shown that a high impedance input to the buffer causes the next
state to remain equal the present state (Y,• = y;) and no change occurs to state variable Y.
State table transitions where Y does not change need not be covered in the design equation
for Y,• as long as a high impedance is presented to the buffer input. Only transitions in
Y; need be effected by the enable and disable circuit. Therefore, only state transitions
between unstable and stable states need be accounted for in the design equation.
If state transition S„ -►
 S, requires Y,• to transition from 0 -+ 1 or 1 --> 0, then µ,Ip(1)
or µ,I,(0) respectively must appear in the equation for Y,•. Let the circuit begin in stable
state S, under Ik . When Ik -+ 0 and I. -► 1 unstable state Su is entered _ and Y,• must
experience a 0 -► 1 transition as the circuit goes from Su -i ,
 S,. While Ik = 1, the term
Sulp(1) is disabled and passes a high impedance term to the buffer. Moreover, since only
total circuit states that are unstable appear in the design equations, the buffer for Y has a
high impedance input as long as Ik =1. When the input switches to I. = 1, Sulp(1) passes
a 1 to the input of Y,• forcing it to a 1 state. As soon as the circuit leaves state S u , term
Sulp(1) outputs a high impedance state, but Y; remains 1 to effect the transition for y; =1.
Since the buffer for Y has a high impedance input for all the states of the transition path,
y; will assume the proper next state value and a critical race cannot occur. 0
The above theorem allows a state assignment to be given that has critical races. The
theorem shows that the next state variables are excited only when the circuit enters a total
circuit state that is unstable. All other times, present and next state variables are the same
and no state transitions occur. In a sense, the circuit is operating as a synchronous, or a
self synchronized circuit [7,8]. In this case state variable transitions are arbitrated by the
total unstable state.
If the circuit is constrained to allow only one state variable to change at a time as a
circuit transitions between states, there will be no races and hence no critical races [5].
NASA SERC 1990 Symposium on VLSI Design	 97
Using a series of unit distance state transitions is an acceptable mechanism to achieve
state transitions. By using unit distant transitions between states, only one state variable
is allowed to change at a time. This type of operation is called non-normal mode [9]. This
is not an STT state assignment and therefore the penalty is slower operating circuit.
In the following assignment, each state has a distance 2 from all other states, and
hence two successive state variable changes are required for each state transition. Thus
the circuit operates at half the speed of an STT assignment.
The state assignment problem for non-normal mode operation is similar to that of STT
operation and the critical concept is that transition paths of different k-sets (or transition
pairs) must not contain states in common [9]. The transition paths must be partitioned to
be disjoint. Transition paths for non-normal operation are more difficult to characterize
than the STT assignment. For example, suppose state S. transitions to Sb, and S. is coded
000 and Sb
 111. One transition path is 000 — ►
 001 —► 011 —+ 111. Another transition path
is 000 -4 100 --►
 110 --► 111. There are a total of 6 unique paths in this case, all equally
valid.
The state assignment used here will be a 1-hot-code where state Si is coded by state
yi = 1 and all other state variables are 0.
Definition 3 Let [yi yj yk ... yn] represent a state where each element of the act is a state
variable equal to 1 and all other state variables are 0.
For example, [yz y3] denotes those states where ys and y3 are 1 and all other state variables
are 0. If state Si transitions to Sj, then the transition paths consists of the states [yi],
[yi yj] and [yj] such that [yi] --+ [yi yj] -► [yj]. There are three states in the transition path.
All states with more than two 1's are not members of any transition path.
Theorem 5 All transition paths are disjoint for the case where no state is both a successor
and predecessor state.
Proof In a transition Si --+ S„ Si is the predecessor state of Sj and Sj is the successor
state of Si . The transition path for transition Si —> Sj is [yi], [yi yj] and [yj]. The state
assignment and associated transition paths -produce a valid design if the transition path
for (Si , Sj) is disjoint from all other transition paths. Clearly, the only state that is of
concern is [yi yj] and it must be shown that it is a member of only one transition path.
State variable yj is excited only when a states transitions to Sj, that is Sk —► Si or
when Sj transitions to some other state Sj —3 Sk. In both cases, state [yj yk] is entered. If
Sk # Si, then [yi yj] is not the same state as [yj yk] and the paths are disjoint. Moreover,
since no state is both a successor and predecessor state to some other state, there cannot
be a transition Sj --► Si . Therefore, [yi yj] is entered only for the transition Si --> Si and
no other time and cannot be an element of some other transition path. O
Theorem 6 Under the conditions of theorem 5 where no state is both a successor and
predecessor state, then proper operation occurs with design equation for Yi as yklp(1) +
yiyj(0), where Sk is the predecessor state to Si , and Sj is a successor state of Si.
98
11 12 yl y2
A A B 11 0
B — IB 0 1
Table 5: Design procedure flow table.
Proof Consider states Si and Sk where Sk --^ Si transitions under input Ip. The design
equations for Y and Yk are
Y,• = yklp(1 ) + yiys(o)	 (14)Yk = MAO) + ...
When the circuit is in Sk , yk = 1 and the circuit is in state [yk]. When Ip is true, then
yi --►
 1 and state [yiyk ] is assumed. When [yiyk] is true then yiyk(0) forces Yk --> 0 and
[yi] is entered. By theorem 5, [yiyk] belongs only to the transition path of (Sk , Si) and
therefore the transition Sk --^ Si is properly effected. Transitioning out of Si occurs in
the same manner as transitioning from Sk . When another input is true, yi = 1 causes the
circuit to assume [yiyj] which forces [yi --► 0]. O
5 Self Arbitrated Design Procedures
Procedure 2 The following procedure can be used for the design of pass networks for each
state variable of an asynchronous sequential circuit with a handshake operation using the
enable-disable model.
Design Procedure.
1. Encode the flow table with a one-hot-code state assignment
2. For each stable state Si under I;, with associated state variable yi = 1,
(a) For state transition, Sk --> Si , introduce an enable term yk I;(1).
(b) For state transition out of the state, Si --> S„ introduces a disable term
yi y,(0)-
For a transition from S. —►
 Sb under 12 shown in the partial flow table of Table 5,
design procedure 2 introduces an enable term for Sb such that Y2 = yl I2(1). This term
will cause the transition [yl ] —► [yl y2]. The design procedure also introduces the disable
term Yl = yl y2 (0) into state S. thus forming a handshake.
The following relaxes the predecessor and successor requirements of Theorem 6.
Theorem 7 A valid design equation for Y is Y,• = y;Ip(1) + yiy;I,,(0), where S; is both a
predecessor and successor state to Si , such that Si —► S; under Ip and S; —► Si under I,.
NASA SERC 1990 Symposium on VLSI Design	 99
h Is yi Y2
A A B 1 0
B A B 0 1
Table 6: Scale of two loop flow table.
Proof Assume that the transition path for transition Si --► S; under Ik is [yi], [yi y'] and
[yj]. Assume that the transition path for transition S; -► Si under I„. is [yj], [yj yi] and
[y;]. Clearly, [yi yj] _ [yj yi] and a problem exists.
If the input state is added to the pass implicant ysyj(0) to form yiy;Ik (0) then the
transition path for Si
 -+ S; under Ik is [yi], [yi yj] and [yj] and the transition path for
S; -► Si
 
under Im is [y;], [yj yi] and [yi]. As long as Ik and Im are guaranteed to never be
high at the same time, then the transition paths are disjoint. O
Scale-of-two-loops are common in flow tables [10]. If the flow table were altered to
introduce a scale of two loop as shown in Table 6, the design procedure 2 would not
produce valid circuits since S. is both a predecessor and successor state of Sb and the
transition paths are not disjoint. From Theorem 7, it can be seen that the input variable
can be introduced in the design equation to partition the transition paths and hence,
scale-of-two-loops do not present a problem.
Procedure 3 The following procedure can be used to design pass networks for each state
variable of an asynchronous sequential circuit using a handshake operation with the enable-
disable model.
Design procedure.
1. Encode the flow table with a one-hot-code .state assignment
2. For each stable state Si
 under I; with associated state variable y; = 1,
(a) For state transition, Sk -> Si , introduce an enable term yk I;(1).
(b) For state transition out of the state, Si
 -
► Sk , under In introduce a disable
term yi yk Im(0).
For a transition from S. -►
 Sb under Iz shown in the flow table of Table 6, design
procedure 3 introduces an enable term for Sb such that Ys = yi I2(1). This term will
cause the transition [yi] --► [yi yz]. The design procedure also introduces the disable term
Yl
 = yi ys I2(0) into state S. forming a handshake and causing the transition [yi yz] -4 [yz].
The transition from Sb -► S. under Ii requires an enable term for S. such that Yi = ysli(1).
This term will cause the transition [ys] -► [y2 yl]. The design procedure will also introduce
the disable term Yz = yi yz 1, (0) into state Sb thus forming a unique handshake and causing
the transition [yz yi] -4 [yi].
Example 2 Design the enable and disable pass networks to implement the flow  table listed
in Table %
100
11 Is I3 M ys y3 N W W
A A F D 1 0 0 0 0 0
B A B D 0. 1 0 0 0 0
C C F 0 0 0 1 0 0 0
D C B D 0 0 0 1 0 0
E E B 0 0 0 0 0 1 0
F E F 0 0 0 0 0 0 1
Table 7: Flow table with handshake state assignment.
The first step is to assign a one-hot-code to the flow table as shown in Table 7.
Derivation of the design equations can again be understood by studying the flow table.
A state variable makes a 0 --^ 1 transition when entering the state which requires that
variable to be asserted. This is accomplished by a term in the design equation qualified by
the state from which the circuit is transitioning and the input under which the new state
is stable passing a 1. For example, when the machine is in state S. or S,,, if Iz is asserted
high, the machine will move towards state S f . The enable terms of the design equation
for state variable YQ can be written as Ye = y 1l2(1) + y3l2 (1). To guarantee that the
machine traverses states S. --> S f such that [yl] —> [yl ya] —► [ya] and S. --* Sd such that
[yl] --* [yl y4] ---> [ys] when leaving stable state S., the disable terms yly6I2(0) and yly4I3(0)
are introduced into the design equation for Yl , thus forming a handshake.
First the enable terms are read from the flow table.
Yl = y211(1)
Yz = y442(1) + yb42(1)
Y3 = y4I1(1) + y6I3(1) + y643(1)	 (15)
Y4 = Y113(1) + y9I3(1)
YS = y6ll(1) +
Ye = y1l2(1) + y342(1)
The entire design equations with the disable terms are
Yi = Y211(1 ) + ylysl2(0) + yiy413(0)
Ys = y4I2 (1 ) + y6l2(1) + Y2Y1II(0) + y2y4I3(0)
Y3 = y4I1(1) + ysI3-(1) + Y643(1) + y3y6l2(0)	 (16)
Y4 = Y113(1 ) + y213(1) + y4y3I1(0) + y4y212(0)
Ya = Y611(1) + ysY2I2(0) + y6y3I3(0)
Ye = y112(1 ) + y342(1) + V0511(0) + yay313(0)
The logical implementation of the circuit is shown in Figure 4.
6 Hardware Bound
A hardware bound count can be established for. the design of enable -disable sequential
circuits using a handshake state assignment operation. The hardware count, Tt , is the
NASA SERC 1990 Symposium on VLSI Design
	 101
number of transistors required to build the enable circuits, Tc , plus the disable circuits,
Td, and the buffer circuits, Tb.
Tt = Z + Td + Tb	(17)
Using design procedure 2, there are two transistors in each enable term and two tran-
sistors in each disable term. There is an enable term and a disable term for every unstable
state in the flow table. For a flow table with n states and i inputs, there are at most ni
next state entries. If t transistors are required to build the buffer circuit then the hardware
bound is such that
Tt < 2ni + 2ni + nt	 (18)< n(4i + t)
Using design procedure 3, there two transistors in each enable term and three transistors
in each disable term. Again there is an enable term and a disable term for every unstable
state in the flow table. The hardware bound is
Tt < 2ni + 3ni + nt
	 (19)< n(5i + t)
An exact transistor count could be determined for a specific flow table. If there are u
unstable states in an n row flow table with i inputs, then for design procedure 2
Tt = 2(ni — u) + 2(ni — u) + nt
	 (20)
= 4(ni — u) + nt
For design procedure 3
Tt = 2(ni — u) + 3(ni — u) + nt
	 (21)
= 5(ni — u) + nt
7 S ummary
The enable-disable model allows efficient implementation of VLSI asynchronous sequential
circuits. The circuit for each state variable is composed of three sections. First, the enable
network for y; which arms the circuit to look for input changes cause 0 -- ► 1 transitions
in y;. Second, the disable network which arms the circuit to look for input transitions
cause y; to change from 1 —+ 0. Third, the buffer circuit which isolates the enable-disable
network from the state variable load capacitance, restores the high level and provides a
memory function to hold the state information when the pass network is tristated.
STT state assignments are used to provide critical race free operation. Only transitions
which require state variable changes need to be covered by paths in the pass network. This
eliminates the need to cover all p partitions for each state variable. The above procedures
could be extended to the output states also. In this case the output states would be
available from a buffer like the next state variables.
Two design procedures using a handshake operation with a 1-hot-code state assignment
were presented and an example was given. The hardware bounds for both procedures were
102
established. The handshake code is free of critical races and hazards. The handshaking
arbitrates differences in delays and forces the circuit through a unique sequence of states.
No electronic circuit design constraints exist for the circuit. Since the circuit is based on
a nonnormal operation, the handshake code will cause the circuits to function at half the
speed of the STT state assignment based circuits.
The cross over constraint on the inputs was addressed by Theorem 2, which introduces
a method to allow 0-0 cross overs. The 1-1 cross over can be solved in two ways. First,
the flow table could be expanded to introduce columns which incorporate 1-1 cross overs.
Second, 1-1 cross over could be eliminated in the input circuitry utilizing cross coupled
Nor gates to logically eliminate 1-1 cross over.
References
[1] C. Liu, "A State Variable Assignment Method for Asynchronous Sequential Switching
Circuits", JACM, Vol. 10, Apr. 1963, pp. 209-216
[2] C. Roth, Fundamentals of Logic Design, 3rd Ed., St. Paul, Minn., West Publishing,
1985, Unit 23-27
[3] J. Tracey, "Internal State Assignments for Asynchronous Sequential Machines", IEEE
Transactions on Electronic Computers, Vol. EC-15, Aug. 1966, pp. 551-560
[4] C. Tan, "State Assignments for Asynchronous Sequential Machines", IEEE Transac-
tions on Computers, Vol. C-20, No. 4, April 1971, pp. 382-391
[5] S. Unger, Asynchronous Sequential Switching Circuits, New York, NY, Wiley-
Interscience, 1969
[6] G. Maki and D. Sawin, "Fault Tolerant Asynchronous Sequential Machines", IEEE
Transactions on Computers, Vol. C-23, pp. 651-657, July, 1974
[7] H. Chuang and S. Das, "Synthesis of Multiple Input Change Asynchronous Machines
using controlled Excitation and Flip-flops", IEEE Transactions on Computers, vol 22,
December 1973, pp. 1103-1109
[8] H. Chuang, "Fail Safe Asynchronous Machines with Multiple Input Changes", IEEE
Transactions on Computers, June 1976, pp. 637-642
[9] G. Maki and J. Tracey, "A State Assignment Procedure for Asynchronous Sequential
Circuits", IEEE Transactions on Computers, June 1971, vol C-20, pp. 666-668
[10] L. Hollaar, "Direct Implementation of Asynchronous Control Units", IEEE Transac-
tions on Computers, vol C-31, Dec. 1982, pp. 1133-1141
NASA. SERC 1990 Symposium on VLSI Design	 103
This work was supported in part by NASA under Contract NAGW-1406 and by the
Idaho State Board of Education under Research Grant # 87-009.
104
Il Y2 yl 13
1 I Yl	 Y4Buffer	 Buffer ^— i----+—
0
I, y6	 yl ys 13
13 y4 yl y4 y3 I1
1




Is ys y6 Ii1
Ys	 Buffer --I--I
0
Ii yl	 ys ys	 ya Is
0 I^
ys 1 y3 13
11 y4 yl Is
1 Buffer	 Y3	 Y6 — Buffer -----I--
131 ys y3 Iz
1
13 y6 y6	 ys Il
Is
0
y6 y3 y6 y3 13
Figure 4: Circuit diagram for handshake state assignment. -
NASA SERC 1990 Symposium on VLSI Design	 N94- 71085	 105
Using Advanced Microelectronic Test
Chips to Qualify ASIC'S for Space
M. G. Buehler, B. R. Blaes, and Y S. Lin
Jet Propulsion Laboratory, MS 300-329
California Institute of Technology
Pasadena, California 91109
Qualification procedures for complex integrated circuits are being developed under a
U. S. government program known as QML, Qualified Manufacturing Lines. This effort is
focused at circuits designed by IC manufacturers and has not addressed application specific
ICs (ASICs) designed at system houses. The qualification procedures described here are
intended to be responsive to the needs of system houses who design their own ASICs and
have them fabricated at Silicon foundries.
A particular focus of this presentation will be the use of the TID (Total Ionizing Dose)
Chip to evaluate CMOS foundry processes and to provide parameters for circuit simulators.
This chip is under development as a standard chip for qualifying the total dose aspects of
ASICs. The benefits of standardization are that the results will be well understood and
easy to interpret.
Data is presented and compared for 1.6-µm and 3.0-,um CMOS. The data shows that
1.6-µm CMOS is significantly harder than 3.0-µm CMOS. Two failure modes are explored:
(a) the radiation-induced degradation of timing delays and (b) radiation-induced leakage
currents.
In order to focus this effort, five critical questions have been formulated:
. WHAT ARE THE QUALIFICATION PROCEDURES FOR ASICs?
e HOW GOOD ARE RADIATION-HARDENED CIRCUIT SIMULATORS?
• DO LABORATORY IRRADIATIONS ACCURATELY SIMULATE SPACE?
• WHAT ARE THE RADIATION-HARDENED CIRCUIT DESIGN RULES?
• CAN NON RAD-HARD CMOS ASICs BE USED IN SPACE?
Initial answers to these questions are given at the end of this presentation.
An ASIC qualification scheme used for the fabrication of an ASIC Direct Memory
Access Controller (DMAC) is illustrated in Figure 1. A set of test chips were placed next
to the ASICs in order to verify the quality of the fabrication process. These chips analyze
total dose hardness, single-event upset sensitivity, metal interconnect wire and contact
reliability, and manufacturing yield.
The physics of total dose and single-event upset (SEU) radiation is illustrated in Figure
2 for the case of a static memory cell. The total dose effects are seen to introduce positive





^-- lmm — 4w,
--3mm ---wjo 2mm -3H 2mm -iol-a--3mm ---wi-a nzm -a-I
TOTAL SINGLE
GATE IONIZING EVENT GATE RELIAgMITY FAULTDOSE UPSET^m
TEST CHIP CHIP TEST CHIP CHIP










	 io	 7mm	 --
Figure 1: Microelectronic Test Chips for Space Qualified ASIC's











® BI ®^ i ------------





• p t FUNNEL
UNIFORM RADIATION CAUSES
PERMANENT DAMAGE WHICH SHIFTS • ELECTRON
THRESHOLD VOLTAGES AND INCREASES o
LEAKAGE CURRENTS • p• o HOLEp m OXIDE HOLE TRAP
COSMIC RAYS INDUCE oo•• B INTERFACE STATE DEFECT
1•nme TRANSIENTS THAT o p	 DIFFUSION e CARRIER REMOVALCAUSE MEMORY CELL UPSETS	 - c c ® DONOR IONp • 9 ACCEPTOR ION
Figure 2: Total Dose and Cosmic Ray Effects in a Static Memory Cell
108
Figure 3: Total Ionizing Dose (TID) Chip
hole-electron plasma in the silicon which temporarily shorts out the struck junction and
can cause the cell to change state.
The TID chip, shown in Figure 3, is packaged in a 28-pin DIP. The chip contains a
MOSFET matrix [1] with up to 64 MOSFETs designed with a variety of geometries. The
matrix contains both gate-oxide, field-oxide, and closed geometry MOSFETs. The chip
also contains a Timing Sampler circuit [2] with 16 timing chains for determining inverter-
pair delays at different capacitance loads.
p-MOSFET Matrix test results [2] are shown in Figure 4 for VTO , KP, ®W, and AL as
a function of Cobalt-60 dose and anneal time. These results show a significant gate bias de-
pendence for VTO (-1.37 V/Mrad(Si)) and KP(-2.5 µA/VzMrad(Si)). This is attributed
to the build-up of positive charge in the oxide when gates are biased in the ON state. For
gates biased in the OFF state, positive oxide charge and positive donor interface states
cause even stronger shifts in VTO (-3.45 V/ rad(Si))) and P(-25.5 PA/VaMrad(Si)).
n-MOSFET Matrix test results [2] are shown in Figure 5 for VTO , KP, ®W, and AL as
a function of Cobalt-60 dose and anneal time. The shifts  are due to the build up of positive
NASA SERC 1990 Symposium on VLSI Design	 109
1.2s
1.20 0 GOV. KvT0 • 341 VAkad(S)O N V. KV TO . 1 at VAkaO(S4
1.15 p OS V. KVTO . a N VAkad(S)
1.10 0 &sv. KV TO. 1 ao VAka01s1	 —'--&'•p
1.05





	 ,%	 ^^$ $ g O0.85 ./'
	 a-'.'•	 ON
0 OOV.Kayy.03§vwAkm434
O SO V. K4W . O 000KiAM40114
185 — G OMs V. KaV( . 0 30Ow0kaou
0 Ls V. KAW . 040powMaOS4
).75
	




1	 1	 1	 1	 1	 11	 1	 I	 0.60 1 	 1	 1	 1	 1	 1	 1	 1
0	 20	 40	 60	 80 100	 1	 10 100	 0	 20	 40	 60	 80 100	 1	 10 100
{^—DOSE (kred (SI))--+-ANNEAL (hr)	 DOSE (krad (Sn)	 1 ANNEAL (hr)-•{
21
O OOV.KKp.3s0yANarwaa%)	 0 9*V.KAL .400000k lSS)
23 O S4v.KKP . •3oww4waaS)
	
0.60 0 SPOV-KU'4100"Ag5)
0 VSV. KKP . •a50 uaMarwagS)	 O &SV.K&L .4101KNWn0(S)
22 0s^av
-
. KK' . •a o ww^waasl 	
E	
0 WS V. KU . 4 20 powMAOS)
21 -w•• &• p
..	 & ,- a-^-g	 3 0.55 -- a
ON20	 OFF 7
	
0.......p...... 	 ^p:......p..0•'ON	 0.50
/9	 ^^, OFF
lei	 I	 0.45
0	 20	 40	 60	 80	 100	 1	 10 100	 0	 20	 40	 60	 80 100	 1	 10 100
^.—DOSE (krad (SI)) —.1- ANNEAL (hr)--i
	
h--DOSE (krad (SI))--+ ANNEAL (hr)--{
• RADIATION INDUCED SHIFTS IN V'O ARE FIVE TIMES THOSE FOR THE n-MOSFETs
• VTO AND KP DO NOT CHANGE SIGNIFICANTLY DURING ANNEAL
• AW AND aL ARE NOT AFFECTED DURING IRRADIATION OR ANNEAL
• DOSE RATE a 1 red (SI)/SEC









l oo 0 W5V.KVio. 0"".	 54
O WSV.KV TO. 03SWUmos.))95 - Q SOV. KV 1O . 001 Wik"S')




0	 20	 40	 60	 80
i -	 DOSE (krad (SI)) -- . 1 -ANNEAL (hr)--{
0.70
O WS V. KOW .430 pa►Skaq$4
0 WS V. K&W . -0 60 paN4aQS4
0.65 - Q SO V. KEW . -0 70 pmak14S.1
0 W0 V. K W : -0SO w•WaaS.)
	




0.50 1	 1	 1	 1	 1	 1	 1	 1	 1
	
0	 20	 40	 60	 80	 100	 1	 10 100
i +	 DOSE (krad (Sq) - -i ANNEAL (hr)--{
ON
OFF
64	 1---T--r--j 1	 0.90	 a i I --r-1-
O YS V. KKP . -0SO pw2mmg94	 O LS V. KAL.-0SO ow4ka41%)
59	










\ 0.\\^ —_ 
O.,o-^'O
 -
44	 1	 1	 1	 1	 ^e	 0.65
0	 20	 40	 60	 80	 100
	
1	 10 100	 0	 20	 40	 60	 80	 100	 1	 t0 100
(+-- DOSE (krad (SI)) ----^{ .-ANNEAL (hr)- • j	 I-- DOSE (krad (SI)) - j ANNEAL (hr)--{
• DOSE RATE : 10 fad (SIYSEC
• RADIATION INDUCED SHIFTS IN VTO SHOW A TYPICAL BIAS DEPENDENCE
• KP IS DEGRADED DUE TO INTERFACE TRAPS
• THE MOSFET CHANNEL WIDENS DUE TO POSITIVE CHARGE BUILDUP IN THE BIRDS BEAK REGION
1
Figure 5: n-MOSFET Results (1.6- ,u
 m n-Well CMOS)
NASA SERC 1990 Symposium on VLSI Design
	
ill
LOAD CAPACITANCE (NODE b)
POLY	 METALI	 n+DIFF	 p+DIFF . METAL2 GATE
x










0 8 x	 12
C') 2.5
F5xxXXXX



















3.0 a DOSE: 0 TO 100 krad (Si) XXX	 14

















XXX w^X 5xXX1iK XX
t	 t	 -$	 1	 @	 1	 4	 1 t
DOSE-ANNEAL AXIS
• RADIATION SENSITIVITY OF INVERTER-PAIR DELAY IS ABOUT 2.2 ns/Mrad (SI)
FOR RISING-STEPS AND LESS THAN 300 ps/Mrad (SI) FOR FALLING-STEPS
Figure 6: Timing Sampler Delay (1.6-µm n-Well CMOS)
oxide charge and negative acceptor interface states. The opposite signs of these charges
cancels their effect on VTO . The negative acceptor interface states cause a significant shift
in KP(-105µA/V2Mrad(Si)).
Timing Sampler test results [2] are shown in Figure 6 for a variety of capacitive
loads: Polysilicon on field oxide (1,2), Metal 1 (3, 4), Metal 2 (9, 10), n+Diffusion (5,
6), p+Diffusion (7, 8), Polysilicon on gate oxide (11, 12, 13, 14). The results are shown for
Cobalt-60 dose up to 100 krad(Si) and an annealing time up to 30 hours. The rising-delay
results are dominated by p-MOSFETs which pull-up the loaded nodes. The rising-delay
shift with radiation is 2.32 ns/Mrad(Si). These results are explained in terms of the
radiation-induced shifts in the DC p-MOSFET parameters, VTo, KP, OW, and AL. The
falling-delay results are dominated by n-MOSFETs which pull-down the loaded nodes.
The falling-delay shift with radiation is very small being 0.32 ns/Mrad(Si). This result
is unexpected and is not explained by radiation-shifts in the DC MOSFET parameters.
112
Node Current Damage Factors
ID
 = C ^ Kyz,, = —6v-= 1#0
Mosfet Saturation Region Current Kxp = bad {O,
ID = P ya—Y ' Kow= aaw Ido
= xp	 i w) KP =µ0c„ KoL = a^ 100




Pri yDD — T K'yl	 = 1'	 I Co
Falling Delay Ka'	 {
= b^C ^o,Co
_ 2VT s(C'io+Cj.Tf	
wi yDD — T.i
Table 1: Model Results
Interface trapping effects are being proposed to explain these results.
The model, which couples results from the MOSFET Matrix and Timing Sampler, is
shown in Figure 1. The Damage Factors used in this analysis are also defined in this figure.
Damage Factors for 1.6-µm and 3.0-µm CMOS are listed in Figure 2. A comparison of the
damage factors reveals that the 1.6-µm CMOS is much more radiation resistant than the
3.0-µm CMOS.
Another ASIC failure mode is radiation-induced leakage currents. The data, shown in
Figure 7, were taken from four MOSFETs with different geometries. The results reveal
that side-wall charging and acceptor interface states are responsible for the leakage. That
is, the leakage scales with channel length, L, and is independent of channel width, W.
Some total dose test issues are outlined in Figure 7. The interpretation of ground test
is limited by the following: (a) The particles used in ground tests are often different from
space particles. (b) The dose rate of the ground tests is much higher than in space. Both
of these issues argue for space tests that can verify the correctness or limitations of ground
tests.
The conclusions, listed in Figure 3, indicate that the radiation-induced increase in
rising delays can be modeled by radiation-induced shifts in MOSFET parameter. How-
ever, radiation-induced falling-delay degradation cannot be explained in terms of radiation-
induced MOSFET parameter shifts. Interface-state trapping effects are suggested as the
explanation of this effect.
Data was presented that shows that 1.6-µm CMOS is significantly harder than 3.0-µm
CMOS. Leakage currents were identified as being due to leakage along the side-walls of the
MOSFETs.
Responses to the critical questions posed at the beginning of this presentation are given
in Figure 4. As the CMOS technology shrinks, mother nature seems to be cooperating
by providing more radiation tolerant processes. However, caution must be exercised as
trapping and side-wall charging effects could become more important at the smallest feature
NASA SERC 1990 Symposium on VLSI Design	 113




























n ON 0 .78 0.64 58.6 -91 .0 0.62 -0.6 0.79 -0.4 FaRing 0.90 0.30 0.18
n OFF 0.78 0.28 58.5 -119.5 0.62 -0.55 0.79 -0.65
ON -0.78 -1.37 21.1 -2.6 1	 0.67 0.46 0.62 -0.15 FaIIing
p OFF 1	 0.78 1 -3 .46 1	 21 .1 -25.8 1	 0.67 1	 0.3 1 0.52 0 1 Rising 1.50 1	 2.32 5.65
L = Mrad (W), 1lnlfng Losing TZI = Low, Ann 140. MUlL




























n ON 0.94 -2.35 66.3 -434.7 1.32 0 1.40 0
n OFF 0.94 -2 .54 66.3 -683.8 1.32 0 1.40 0 Falling 3.1 33.4 10.6
P ON -0.88 -14.40 19.2 0 1	 2.40 1	 0 1.43 0 Rising 3.94 23.8 1	 37.7
P OFF 1 -0.88 1 -14.03 19.2 1	 0 1	 2.40 1	 0 1.43 0 1 Falling
D = Mrad (Si l. Darin Dosina TSI = Hiah- Run No. M84M(MOSSFET and M88F Delay
Table 2: CMOS DAMAGE FACTORS
Dose-Shielding Codes Over Predict Total Dose by More Than Ten Times
e Correlation of Ground Radiation Sources using Gammas and X-Rays with Space
Particles (Electrons and Protons) Has Not Been Established
• Ground Test Acceleration Factor Large (e.g. 2E9)
1. ARACOR (X-RAY) = 6000 rad (Si)/sec
2. ASTM (COBALT 60) = 100 rad (si)/sec
3. SPACE(HIGH) = 3 milli-sad (Si)/sec
4. SPACE(LOW) = 3 micro-rad (Si)/sec
• Test Conditions (Dose Rate and Gate Bias) Affect Parameter Degradation Due to
Device Annealing









o :Wum)/L(um) 13.5^13 0
	
t :W(um)/L(um) 13.519.0 	 VBS(V) a 0.0, VDS n 5.0
	
.:W(um)/L(um) 4.519.0	 VGdoso(V) a 5 VGIIrtN(1/) n 5










DOSE (krad(SI))	 TIME (hr)
o SUBTHRESHOLD LEAKAGE DEPENDS ON L NOT W.
o LEAKAGE CAUSED BY BIRDS BEAK ACCEPTOR
INTERFACE TRAPS.
IS LEAKAGE OBSERVED ONLY IN n-MOSFETs
BIASED WITH 5V DURING RADIATION.










^.- ---- VD sW a 13.50 um
"'"` ----
5.00410 L n 3.00 um	 -" "	 ^-"`-~ 4.00VG a 0.00 V	 -	 ^' ..-- 3.00














0	 1	 2	 3	 4 5
GATE VOLTAGE, UG (1n
W n 13.50 um
L a 3.00 um
-VG a0.00V^
DOSE a 12 kred(SI) -
0	 1	 .2	 3	 4	 5










Figure 7: n-MOSFET Subthreshold Leakage Current 	
-
NASA .SERC 1990 Symposium on VLSI Design	 its
What Are The Qualification Procedures For ASIC's ?
Prior to fabrication, circuits are designed with circuit simulators to insure circuits
meet performance goals. After fabrication, circuit tests are supplemented with tests
from test chips which are fabricated along with the custom chips. Such tests can
reveal the presence of hazards not simulated and often due to processing flaws
How Good Are Radiation-Hardened Circuit Simulators ?
Circuit timing simulations are based on radiation-induced shifts in DC MOSFET
parameters, and accurate capacitance and resistor values. The simulators are only as
good as the parameters put into them. Currently the simulators do not account for
important second order effects such as trapping effects.
Do Laboratory Irradiations Accurately Simulate Space ?
Ground tests are highly accelerated and use single, mono-energetic particles. These
particles are different from those found in space. Thus great care must be exercised in
predicting circuit space performance from ground tests. Data from space is extremely
limited. Only five experiments have been devoted to the testing of electronic compo-
nents in space. More space test results are needed on current technologies to ensure
their survivability in space.
What Are The Radiation-Hardened Circuit Design Rules ?
Are closed geometry MOSFETS required to eliminate leakage ? ..-. They tale too
much for space ! Given a radiation scenario, what are the allowable timing margins?
What are .the rules for latch-up? Are cross strapped latches and memory cells really
needed? This takes up room and requires additional processing steps for high value
resistors.
Can Non Rad-Hard CMOS ASICs Be Used In Space ?
It appears that the answer is yes but more testing is required to verify the answer.
Table 4: Critical Questions
116
sizes. Thus a rigorous testing program should be undertaken to evaluate the latest CMOS
processes. Such tests must be compared to results from space so as to validate the ground
tests.
Test chips can play a vital role in providing first principle answers that can be used
to evaluate processes and to provide ASIC design parameters. The TID chip is being
fabricated at a number of CMOS foundries in preparation for its use in CMOS foundry
qualification.
References
[1] B. R. Blaes, M. G. Buehler, and Y S Lin, A CMOS Matrix for Extracting MOSPET
Parameters Before and After Irradiation, IEEE Trans. Nuclear Science, NS-35,1529-
1535 (1988).
[2] B. R. Blaes, M. G. Buehler, and Y-S Lin, Radiation Dependence of Inverter Prop-
agation Delay Prom Timing Sampler Measurement, IEEE Trans. Nuclear Science,
NS-36,(1989).
Acknowledgment
The authors are indebted to MOSIS of the Information Sciences Institute for brokering
the fabrication of the TID Chips. The research described in this paper was carried out by
the JET Propulsion Laboratory, California Institute of Technology, and was sponsored by
the Defense Advanced Research Projects Agency and the National Aeronautics and Space
Administration.
N94- 71086
NASA SERC 1990 Symposium on VLSI Design	 117
Real Time SAR Processing
A. B. Premkumar & J. E. Purviance
Department of Electrical Engineering
University of Idaho
Moscow, Idaho 88343
Abstract- A simplified model for the SAR imaging problem is presented.
The model is based on the geometry of the SAR system. Using this model an
expression for the entire phase history of the received SAR signal is formulated.
From the phase history, it is shown that the range and the azimuth coordinates
for a point target image can be obtained by processing the phase information
during the intrapulse and interpulse periods respectively. An architecture
for a VLSI implementation for the SAR signal processor is presented which
generates images in real time. The architecture uses a small number of chips,
a new correlation processor, and an efficient azimuth correlation process.
1 Introduction
Radar imaging of a scene requires processing the signals reflected off point targets com-
prising the scene. The received signals contain the two dimensional image information
of the image and processing the two dimensional information is time consuming. Hence,
imaging in real time becomes difficult. The basic processing functions to be implemented
are convolution, magnitude detection, interpolation and system control. Due to dramatic
improvements in fabrication techniques, the VLSI implementation of these functions can
now be achieved and it is now possible to do real time imaging on-board the spacecraft.
This will have significant impact on the capabilities of on-board signal processing functions
allowing for more sophistication, flexibility and reliability.
The basic concept behind any radar system is that it illuminates its targets by trans-
mitting bursts of microwave energy and collects the reflected signals from the targets. The
illuminated target is said to be in the footprint of the radar beam. Each target in the
footprint to be mapped has two coordinates, range and azimuth. The radar's resolving
ability in the range dimension is inversely proportional to the bandwidth of the transmit-
ted pulse. In the azimuth dimension it is inversely proportional to the length of the radar's
antenna. By way of example, a radar bandwidth of 1GHz provides a range resolution of 15
cm and an antenna length of 1000 times the wavelength of the transmitted wave, provides
an azimuth resolution of 10. m. As can be seen, there is a wide difference in range and
azimuth resolutions. For equal resolutions in both azimuth and range the antenna size




Figure 1: SAR Geometry
The Synthetic Aperture Radar (SAR) overcomes this problem by electronically syn-
thesizing a long antenna from a physically small antenna. It does this by recording the
intensity and the phase of the reflected signal and coherently processing the phase infor-
mation. This synthesizes a large antenna aperture to obtain high azimuth resolution that
is independent of range. The SAR achieves range resolution in much the same way as the
conventional radar [1].
1.1 SAR Basics
In this section the phase function of the received signal will be derived. The expressions for
range and azimuth resolutions in terms of the transmitted pulse bandwidth and the Doppler
bandwidth, respectively, will also be derived. A simplified model of a SAR undergoing
linear motion with velocity, V, with respect to a point target P is shown in Figure 1. The
physical aperture of the radar generates a radio frequency signal (RF) and radiates it in
the form of a beam of width B. The time the point target lies in the footprint is the dwell
time , Td. The received signal contains the radar refiectimty (intensity) information of
the point target as well as the radar carrier phase. The coherent radar determines the
phase difference between the transmitted and the received pulses. The coherent detector
in the receiver compares the received signal with the sum of the carrier frequency, f,, and
an offset frequency, fi f . The offset frequency centers the received spectrum about fif . The
slant range, R(t), to the image point, P, is time varying. & is the boresight range to the
point target. The phase modulation on the carrier, 0(t), due to relative position between
the platform and the point target is given by the following equation [2,3]:
OW= 
47rR(t)	 (1)
NASA SERC 1990 Symposium on VLSI Design
	 119
where A is the wave length. From Figure 1
	
R(t) = V(-4 + (z — zo)')	 (2)
Substituting for R(t) in equation (1)s
0(t) = a	 + (z - zo)^	 (3)
_ s
41r {Ro + V^(2Roto)' }
	 (5)
The shift in frequency, fd, (Doppler Shift) which is the time varying phase is given by the
following equation:
fd = 60(t)	 (6)
bt
4w V2(t — to)	
(7)
A Ro
The output of the coherent detector is called the Doppler phase history .
This phase history for each point target is generated during the time the point target
is in the physical aperture beam, that is, an interval Td. Similar signal characteristics are
obtained when the radar beam is directed at a different angle called the squint angle , 81.
The corresponding Doppler frequency is given by the following equation [4]
cos0,	 a s (t — to)
fd = —2V 	 — 2Y sin 8, ARO	 (8)
In order to obtain a fine range resolution, the transmitter produces a linear FM wave
form represented as
f (t) = ei2,ra,2	 0 < t < T
= 0	 elsewhere
(9)
where a is the chirp rate and T, is the FM pulse duration. When the detected signal is
passed through a correlator or compressor, the output of the one dimensional correlator is
the envelope of the function !!V' . Due to the frequency coding of the pulse and proper
phase matching in the correlator, the height of the correlated output is the amplitude of
the original pulse multiplied by the square root of the time bandwidth product, aT, . In
the range direction the resolution is SR and in the azimuth direction the resolution is SAz.
It is desirable for SR and SAz to be nearly equal. Expressions for the two dimensions and
the derivations of their resolutions in the following chart are taken from Kovaly [4].
120
Range Direction Azimuth Direction
Time Width rR of image Time Width r," of image
at —4db level: at —4db level:
rR = 11(0 f )R , where TA, = 11(A f),., where
(Af )R is (Af ),. is
the transmitting FM bandwidth the Doppler bandwidth
Multiplying both sides of the above Multiplying both sides of the above equation
equation by c/2 produces: by Airborne Vehicle Velocity V produces:
c7-R/2
 = c/2(A f )R Vi-A, = VI(4f)A,
Here crR/2 is defined as Here Vr,, is defined
as Range Resolution, as Azimuth Resolution,
bR = c/2(B f )R SAz = V/(Of)A,,
The expressions have similar forms in their respective derivations. The resolutions are also
seen to be inversely proportional to bandwidth. When these bandwidths are realized using
signal processing, a synthetic aperture has been generated.
Depending on the application and also the system capability, a high resolution can be
obtained by processing all of the Doppler bandwidth and compensating for the phase
difference on a pulse-to-pulse basis. If all of the phase deviations are not compensated,
a lower level of resolution is achieved called an unfocussed aperture . However, if phase
correction is applied to each of the returned pulse, then a focussed aperture is obtained
[4]
1.2 SAR Signal Processing
The physical measurement employed by SAR to achieve range resolution is the time delay
introduced in the received signal while, in the azimuth direction it is the carrier phase
history. Signal processing to implement SAR uses the measurements of time delay of the
reflected signal to determine range of the target and the phase history of the carrier to
give the angular coordinate of the target since the phase history is determined by the
geometry of source and target. The range of a target varies at every instant, since the
radar-to-target distance varies every instant. The processing is complicated further by the
presence of several point targets in each beam footprint. The reflected signal is then the
superposition of the reflections from all of the targets.
A general signal processing architecture is shown in Figure 2. In almost all SAR signal
processors, the azimuth processing and the range processing  are performed sequentially on
the data. In reality the SAR signal is a two dimensional function of range and azimuth and
hence the cross coupling between the two has to be considered. However, under certain
conditions and with suitable compensation, the cross coupling can be made negligible and,
Digitized Complex Radar Reflection Focussed
Image
NASA SERC 1990 Symposium on VLSI Design	 121
Figure 2: Block Diagram
to a first order, the two dimensional processing can be treated as two orthogonal one-
dimensional processes. This makes processing a little easier since the compression in each
direction is similar.
1.3 Goals
The primary goal of this paper is to present a simplified model for the SAR from the
geometry. It will be shown from this model that the range and the azimuth coordinates of
any point target can be determined from the intra and interpulse periods when the point
target is in the footprint of the beam. The second goal is to present a VLSI architecture
for real time SAR data processing using time domain methods in both range and azimuth
directions.
2 Review of SAR Architectures
Historically, SAR architectures were ground based optical processors which were analog
in nature. These processors used light sources and a series of lenses to perform two-
dimensional processing. In these processors a film typically, is used as both input and
output media. Although optical systems feature high throughput relative to digital pro-
cessors, they are constrained by the dynamic range of the film and the limitations of the
lenses. Although digital systems also suffer when the quality of the input data is poor,
digital systems are more adaptable to specific data problems and can compensate for the
poor quality data.
All of the existing architectures fall within the main frame-work of time domain and
frequency domain architectures. However, more recently combined forms called " hybrid
architectures " have been developed which exploit the time and frequency domain concepts.
Whether the architectures are time domain, frequency domain or hybrid, all have to employ
122
parallel processing techniques to achieve real time imaging. A brief review of three types
of existing digital architectures, and their merits and deficiencies are , considered in the
following sections.
2.1 Time Domain Architectures
Time domain architectures use algorithms that are based on the correlation or the convo-
lution integral for range and azimuth compression. These take advantage of the repetitive
nature of the correlation algorithm and employ high speed computers to implement the pro-
cess. In these systems the range correlation is usually done before the azimuth correlation.
However, since the azimuth correlation requires many received pulses before processing
can begin, a considerable amount of memory is required to implement this architecture.
Two of the important time domain architectures are discussed below.
2.1.1 Systolic Array Architecture
B Arambepola and S. R. Brooks [5] have suggested a real time systolic-array architecture
employing a regular array of modules. The modules can be expanded to meet a variety of
resolution and throughput requirements. The hardware uses a partitioned array of shift
registers to process the data in both the range and azimuth d irections. The system does not
use any explicit corner turning and uses a minimum amount of two dimensional memory.
The system degrades gracefully as the modules in the array fail and so no total failure
occurs. However, the architecture is very complicated and the data processing rates are
high.
2.1.2 Surface Acoustic Wave Compression Technique
A Surface Acoustic Wave (SAW) azimuth processor was suggested by Elachi [6]. This
uses the quadratic nature of the azimuth phase history. This system uses charge coupled
devices (CCD) for multiplexing and storing the data. This architecture tends to be slow
'because of the CCD devices used for storing and multiplexing. Since there are no prefilters
in this system large amounts of data must be processed and stored.
2.2 Frequency Domain Architectures
The frequency domain architectures for SAR perform the range and azimuth correlation
using the discrete Fourier transform technique (FFT). This technique has the advantage
of converting convolution to multiplication and thereby possibly saves computation time.
However, a large amount of memory is needed to store the data before and after this
transformation. Corner turn memory is also required since the data has to be read in the
azimuth direction before compression can be applied. Six kinds of the frequency domain
architectures are examined in the following sections.
NASA SERC 1990 Symposium on VLSI Design	 123
2.2.1 Linear Range Migration Approximation Architecture
M. Benson [7]has proposed a frequency domain architecture in which he approximated the
range migration path by a straight line. In this research he skewed the range compressed
data linearly to form a set azimuth lines on which azimuth compressions were done. Al-
though this is a frequency domain approach, the algorithm can also be implemented in
the time domain. The linear approximation for range migration requires calculating the
FFT of the range compressed data. This process is computationally intensive. Since no
prefiltering is done, there is too much data to be processed in real time.
2.2.2 The CRC Architecture
The Communication Research Company of Canada [8] implemented a true two-dimensional
matched filter technique to process SAR data. This technique eliminates the assumption
of no coupling between range and azimuth data. The advantage of the system is that it is
capable of fine tuning itself by monitoring its own performance. However two dimensional
processing is employed which requires large amounts of processing time and memory. Also
there is no data reduction and as such the data to be processed is large. Removal of
geometric distortion is more difficult due to the two dimensional processing.
2.2.3 Donier Systems
The Donier System, GmbH and the McDonald Detwiler Associates [9] have jointly devel-
oped a SAR processor based on the Specan Method. The Specan method employs the
frequency domain approach. Do pier system has been used in studies of image qualities,
effects due to quantization errors, finite length registers, effects of interpolating filter side-
lobes and ghost images. The architecture allows for parallel processing of sets of range
compressed data by employing identical pipeline processors. No presumming is done on
the data, which means that the data to be processed is large. Also because corner turning
is done on the data, the memory requirements are high.
2.2.4 Single Instruction Multiple Data Systems
The single instruction multiple data system developed by B. Arampebola [10] uses an
array architecture and is configured for high processing efficiency. The processor has high
regularity and hence is suitable for VLSI implementation. The processing power can be
increased by increasing the size of the array. Macros are used to increase the performance of
the system. However, this system requires mass storage and needs extensive data routing.
The random access memory is constructed as blocks and hence memory is not easily
accessible between blocks. This architecture cannot be implemented in real time because
there are no existing VLSI chips to implement it.
124
2.2.5 McDonald Detwiler Associates (MDA) Architecture
This architecture was developed by MDA for implementing the processing algorithm on
a main frame system [11]. The computing facility of the MDA architecture consists of
an Interdata 8/32 main-frame with limited fast storage and a large amount of disk stor-
age. Several looks are processed and then individually interpolated in both the range and
azimuth directions to accommodate the increase in the bandwidth due to the detection pro-
cess. The final SAR image is generated by reducing the large radiometric dynamic range
to 8 bits using square root law transformation preserving the dynamic range to 45db. This
architecture is accurate and fast and the range migration correction is efficiently imple-
mented in the frequency domain. However, it needs a large memory and the data rates
are high. Since a large amount of data is to be processed, a real time implementation is
not possible
2.2.6 Template Controlled Image Processor
The template controlled image processor (TIP) developed by the Nippon Electric Company
is essentially a data flow architecture and uses a special purpose, high speed processor to
do signal processing [12]. Since the TIP uses a pipeline structure which is flexible, the
same arrangement can be applied to all processes in the system. This architecture can
be made real time by stacking up several TIP systems and applying segments of received
data in parallel to all of them. The TIP system has a ring type architecture with one main
and two major rings. Data are circulated into the rings and are processed by the modules
connected to the rings until all operations on the data are complete. However, as in any
frequency domain implementation this requires a large amount of memory for performing
the corner turn operation. Real time implementation by stacking up the ring systems
leads to other problems such as data handling capability, data flow, memory accessibility,
processing capability and power.
	 -
2.3 Hybrid Architectures
The newly developed hybrid architectures use both time and frequency domain process-
ing, combining the advantages of both. In these systems either the range or azimuth
compression can be done first. The order in which the range and azimuth correlations are
done gives rise to two different systems, with each system having some advantages. Two
architectures which use this type of processing are discussed below.
2.3.1 Jet Propulsion Laboratory (JPL) Base Line Architecture
The JPL [13] has proposed a baseline architecture for a three frequency quad polarization
system. The processor uses time domain parallel accumulators. The range correlation is
done in the frequency domain and the azimuth correlation is done in the time domain.
One of the significant features of this architecture is that it does not use bulk corner
turn memory. It also does not have data transfer between the accumulator channels in
NASA SERC 1990 Symposium on VLSI Design	 125
the azimuth processor. It generates precise Doppler reference functions and the range
migration correction is programmable. The system achieves a data reduction rate of three.
However, the number of azimuth reference samples is limited to 64 and this affects the
compression ratio. Also since the range correlation is done in the frequency domain, some
amount of memory is needed before the Fourier transformation. Most of the chips that
would implement this architecture have yet to be designed.
2.3.2 Interim Digital Processor
The interim digital processor (IDP) with multiple execution developed by B. Barken, C.
Wu and W. Karplus [14] uses a SEL 32/55 minicomputer and three AP 120B array proces-
sors. The array processors are controlled by executives which allow dynamic assignment
and control of multiple array processors from a single control program. The processing is
done on subtasks partitioned from the entire job. Two post-processing operations, namely
data handling and multiple look overlay, are performed to derive the final image. This ar-
chitecture is highly repetitive. The correlator software allows a number of array processors
to perform in parallel. However, the system can suffer considerable time latency whenever
the time taken for an operation is longer than the associated disk transfer time. This can
cause a loss in efficiency.
3 Modeling of SAR
Any research into a VLSI implementation of the SAR processing algorithm requires an
understanding of the principles involved in imaging the received data. Forming an image
from the received data involves focusing in both the range and azimuth directions. The
received signal, therefore, contains two-dimensional image information and hence a two
dimensional matched filtering operation must be performed on the data. However, using
the geometry of the SAR, it can be shown that the range and azimuth information can be
obtained from processing the intra pulse and inter pulse data respectively. Furthermore,
when the radiated modulated waveform is a periodic sequence of pulses, the range and
the azimuth image information are contained in the reflected waveform in approximately
independent ways. The following sections demonstrate how the two-dimensional received
signal can be treated as two one-dimensional data for range and azimuth processing.
3.1 Transmitted and Received Signals
Let a point target, P, be located at a range x and an azimuth y with reference to the
co-ordinate system shown in Figure 1. As the platform carrying the radar travels with a
velocity, V, with respect to the point to be mapped, linear FM pulses of duration T. are
transmitted at some pulse repetition frequency (PRF). The time between the transmitted
pulses is sufficiently long to prevent ambiguous range responses when there are many point
targets. The time delay introduced in the received signal depends on the location of the
point target and the location of the platform. The point target stays in the footprint of
126
the radar beam for a short time, Td, during which time a number of pulses are transmitted
and received. A single transmitted waveform, a(t), translated with a carrier frequency, f^,
can be represented, in complex phasor form, by the following equation:
s(t) = ei2x(fct+o.6at2){U(t) — U(t — TM	 (10)
A single pulse travels a total distance of 2R (t) from the time of transmission to the time
of reception, where R(t) is the time varying distance of the point target to the platform.
The received signal is the delayed version of s(t) and is of the form as(t — ,r), where v is
the reflectivity function, which is characteristic of the point target. Hence, the collected
data comprise a set of reflectivity measurements in two dimensions and do not resemble
the point target at all before processing. Rather the signals from the point target are
dispersed in both range and azimuth. The effect of the carrier frequency in the reflected
waveform can be removed (without destroying its phase) by coherently beating the signal
down to an intermediate frequency, fi f . The received signal, r(t), is translated to a lower
frequency by low pass filtering. The filtered output is then,
isx(- fcr+firt+o.6a(t-T)'){U(t
	 U(t — Tr(t) = Qe	 ^ — r)}	 (11)
It is important for the phase coherence to be maintained between the transmitter and the
receiver. The term — fr in the exponent is the coherent carrier phase information.
3.2 Digital Processing of the Received Signal
The analog received signal can be sampled with a sampling frequency consistent with
the Nyquist sampling requirements and this forms the basis for digital processing of the
received signals. Let n be the number of pulses transmitted during the dwell time, Td,
and m be the total number of samples in the reflected signal. The image reflections are
then sampled by n x m samples. The analog phase of the received signal for a single point
image is given by the following equation
0,= 2r{— f,,r
 + fi ft + 0.5a(t — ,r)2 I I U(t  _ r) — U(t — T^ — r)}	 (12)
The platform moves a distance VTp every time a pulse is transmitted and the total distance
traveled during the dwell time is V nTp, Tp being the interpulse period. The delay, r, as
stated before, is a function of the platform position and the target co-ordinates. Let j be
the index for each of the samples in the received signal, and let k be the index for each
of the transmitted pulses. The platform position, p, can be uniquely determined by the
indices j and k, since the platform moves during every sample in the transmitted pulse
train. The delay can, then, be written as:
r = d{p(j, k), x, y}	 (13)
where p(j, k), the platform position is determined by the jth sample index in the the
received signal and kth pulse index in the pulse train. The sampled time, t(j, k), corre-
sponding to the platform position is also a function of j and k and is
t(j, k) = jQ+up
NASA SERC 1990 Symposium on VLSI Design	 127
= 1,2 . ....... m
k = 1,2,.......n
(14)
where A' is the sampling period of the received signal. With this new notation for the delay,
the sampled phase of the received signal sampled over the entire dwell time is:
O..p{t(jj k)} = 27r{f=f(jj6 +kTp) — .f d{P(j j k )9 x, y}
-{-0.5a{(j'f3 + kTp) — d{p(j, k), x, y}}21
{U(jp+kTp—r)—U(jB+kTp—T, — r)}
j = 1 1
 
2, 31........m
k = 1 1 21 3, ........n
(15)
The above equation forms a coherent record of all the received pulses during the dwell time.
Thus, a sampled record of the image is collected and stored in the SAR processor and the
total number of samples stored are n x m. The phase is a function of only the delay
of the returned signal for a given set of parameters. An examination of the expression
for the phase clearly reveals that it has the range and the azimuth information of the
point target. We will show that the range information is present in the term 0.5al (j # +
Up) — d{p(j, k), x, y}} 2 while the azimuth information is present in — f^d{p(j, k), x, y}. The
goal of the digital processing is to find the range and azimuth (-z and y co-ordinates )
information of the point target from this function. It is well known that the x and y co-
ordinate information can be determined separately in independent processing. The details
of this separation are given in the next section.
3.3 Range and Azimuth Processing
We will show that the delay in the received signal is a function of only x for intra PRP
period samples and a function of only y for inter PRP period samples. However, the above
independence is valid only when suitable compensations are applied for pitch, yaw and roll
(attitude) changes of the platform and the data are corrected for range migration before
the azimuth processing begins. As -the platform moves between every sample, there is a
positional displacement of the platform, V,6, which gives rise to a different distance to the
point target at every sampling instant. Therefore the distance, D;, at the jth sampling
instant is
D; _ ^xs + (y — Y{ j^6 +kTp})')	 (16)
since the platform is moving toward the target. Hence the delay corresponding to D; is,
d;{p(j, k), x, y} = 2v(- Z2 + (y -- V{ jQ +kTp})')/c 	 (17)
where c is the speed of light.
128
3.3.1 'Interpulse Processing
When a pulse is transmitted, the parameter k is a constant until the reflected signal is
completely received. Therefore for a fixed k and for j varying between 1 and m, the delay
d; { p(j, k), x, y} varies as follows
2^x Z + (y — V{M# + kTp})')/c < dj{p(j, k), x, y} < 2v(x2 + (y — V{!Q -}- kTp})Tc (18)
Since V, ,6 and Tp are known, the terms V {m,4 + kTp} and V{,6 + kTp} can be evaluated
as constants. Using the seasat data, as an example, the number of pulses transmitted and
received during the dwell time is 4509. Using a sampling frequency of 45MSz and a FM
duration of 33.9µs, m is 1525. Assume that a point target is at a range of 900 Km and
and an azimuth of 10 Km. During the intra pulse duration for k = 0, when the platform
is at one extreme of the footprint, the distance expression r(x2 + (y — V{mP + kTp})2)
is 900.0555538 Km for m = 1. It 900.0555510 for m = 1525. In this case the distance is
nearly equal to 900 for all m with an error in the ,5th significant place. However, when
k is 2254, that is, when the target is at the boresight of the platform, the distance when
evaluated is 900 .0000610 for m = 1 and is 900 .0000610 for m =1525. The distance for all
m is again nearly equal to 900 Km with an error in the 8th significant digit. The value of
the distance expression evaluated for k = 4508, that is, when the platform is at the other
extreme of the footprint is the same as that evaluated for k = 0.
The numbers generated by evaluating the above example show that there is a small
difference in the distance expressions when the platform is at either extreme of the footprint
compared to when it is at the boresight. Therefore, the distance (and hence the delay) is
not solely dependent on the range but is dependent on the azimuth as well. The dependence
of the delay on both the range and azimuth is called the range migration effect . If the
difference in the computed values is greater than half the range resolution, this effect has to
be corrected. Thus, the data for any transmitted pulse during the dwell time will have to
be corrected for range migration before they can be used to determine the delay and hence
the range. When the range migration correction has been made, the delay expression can
be approximated as:
d; {p(j,k),x,y} 2 r(, 2)/C	 (19)
Therefore, the intrapulse delay and hence the phase profile variation is essentially only a
function of the x co-ordinate of the image of the point target. This allows the x co-ordinate
of the image to be determined using only the intrapulse processing.
3.3.2 Interpulse Processing
However, for every k, there is considerable motion of the platform toward or away from
the point target. Therefore, with fixed j and varying 'k, (k varies from 0 to n) the delay
will vary as follows:
2x2 + (y — V { jQ + nTp})2 )/c < dk {p(j, k), x, y} < 2Vx^ + (y — V { jQ + 2'p})2 )/c (20)
NASA SERC 1990 Symposium on VLSI Design	 129
The delay illustrated by the above expression gives rise to a phase shift in the received
signal. Processing for the y co-ordinate of the image, thus,involves processing this phase
information. Note that as the platform passes the point target, the relative distance along
the y direction between the platform and the point target decreases, becomes, zero and
then increases. Hence, there is a point along the path of the platform, where the y distance
to the point target is at a minimum and the corresponding delay of the returned signal is
also at a minimum. The minimum delay in the returned signal is for a particular value of
k, since k determines the relative position of the platform at any PRF instant. Assume,
for now, that k is a continuous variable. The rninin+um delay for a particular k can be
determined by minimizing the delay expression with respect to k. The k,,,in expression is:
kmi- = (Y/V
 - .7Q)l7'r	 (21)
The y coordinate corresponding to this minimum delay is seen to be determined by both
j and k. However, even when j is at a maximum, the contribution due to the term j# is
small. Hence, it is valid to state that kmi,, is a function of only the y coordinate. Hence,
from the value of k..i ,, the y co-ordinate of the image can be determined. A correlator
in the azimuth direction traditionally is used to determine k,,,i,,, including its fractional
part and hence it also determines the y coordinate of the image point reflector. The
determination of k ni„ forms the basis for the azimuth correlation processing.
3.4 Conclusion
We have presented a simplified geometrically based model for the entire SAR imaging
problem. Using this model, we have formulated an expression for the entire phase history
of the received SAR signal. We have shown that the azimuth and the range coordinate for
a point target image can be obtained by processing the phase history during the interpulse
and intrapulse periods respectively. Although we have considered a point target, the theory
also applies to more generalized images.
4 A Real Time SAR architecture
The theory developed in the above section reveals that for seasat-type data range and
azimuth processing can be done independently. Processing for range and azimuth involves
compression techniques using correlation operations. The similarity in both the interpulse
and intrapulse data processing enables similar hardware to be used for both operations.
The correlation operation involves repetitive multiplications and additions. Although using
the frequency domain approach, the correlation can be easily implemented by multiplying
the frequency domain functions, the repetitive nature of the correlation algorithm suggests
a time domain approach. Due to the recent availability of the technology and design
capability, it is now possible to implement the correlation operation with VLSI chips in
the time domain architecture. Hence, a time implementation is discussed in the following
sections.
130
Received from Point Target
Filter &
	 (	 Range I 	 l Range Migra.Presummer (—^ Correlator H Correction
System	 AzimuthVideo	 Dis laController
	
Correlator	 Controller	 p y
Figure 3: Time Domain Architecture
4.1 Block Diagram of Real Time 5AR
A block diagram of the time domain architecture is shown in Figure 3. The data received
from the target area is sampled and converted to digital form by an A to D converter
(not shown). The system controller performs this operation using the system clock in
order to retain the phase coherence of the received signal. The data is collected for every
transmitted pulse and stored as a range line and is then sent to the filter and presumming
circuit buffer. The following sections describe the block diagram briefly.
4.1.1 Filter and Presummer
The filter and presummer is shown in Figure 4 [15]. The azimuth time-bandwidth product
required for the desired resolution is much lower than what is available in the received
signal. In order to reduce the memory requirements of the system and the processing data
rates, it is required by the system to filter the received data and use only the azimuth
spectrum required for a particular resolution. Down sampling then is possible because
a lower frequency spectrum is available for azimuth compression. The filtering and the
presummer circuit performs this operation. The filtering operation is applied on the data
in the azimuth direction. The presummer part of the circuit down samples the data by
summing the range lines which are n lines apart with n channels running in parallel, n
being the presum number. With presumming, the data rate is reduced by a factor of n.
4.1.2 Range Correlator
The range correlator shown in Figure 5 performs range compression on the presummed
data in the time domain. Range compression compresses the image in the range direction.
We used a new architecture to implement the range correlation. In this new architecture,
the correlator consists of a number of multiplier cells operating in parallel on the incoming
data. The high speed multipliers used here perform the correlation operation in real time.
The data flow diagram of the correlator chip is shown in Figure 6.
Fr om Sysst^eml;ontroller
Filter
NASA SERC 1990 Symposium on VLSI Design	 131
Complex Data
To Range Correlator







Down Sampled Complex Data
In
To Range Migration Correction Circuit

































NASA SERC 1990 Symposium on VLSI Design	 133
Coefficient
12
Figure 6: Data Flow Diagram of Range Correlator
4.1.3 Range Migration Correction Module
The range migration correction module (RMC) is shown in Figure 7. The received signal
reflected off the point targets comprising the area to be mapped travels through several
memory cells in the system memory. The range of the target varies with every transmitted
pulse. Hence, the range compressed data must be interpolated according to the path that a
particular point target takes through the memory. The range migration correction module
performs this interpolation and sends out the corrected data to the azimuth correlator.
Since the ma3dmum range migration can be computed before correction is made, the
correction is implemented by selecting one of the several sets of interpolated data. The
interpolation is done in the RMC module. The correct set is chosen by the the system
control processor.
4.1.4 Azimuth Correlator
The azimuth correlator with its associated video memory is shown in Figure 8. The azimuth
compression is performed on the range migration corrected data to compress the spread
of the image in the azimuth direction. This operation is performed in the time domain
and requires the generation of a reference function which is essentially a matched filter
in the azimuth direction. The matched filter characteristics are a function of the range
and attitude parameters and hence, have to be computed. To save on time these could
be precomputed and stored in the system controller. The selection of proper samples of
yFrom ;,,,_	 Data from RMC Module
Range Interp.H Mux 
H Buffer H Latch
Correlated	 I	 To Azimuth
lata	 RMC	 CorrelatorCoefft. Buffer
From the System Controller
Figure 7: Range Migration Correction Module
184
Figure 8: Azimuth Correlator
NASA SERC 1990 Symposium on VLSI Design 	 135
the matched filter characteristics for the correlation is done by the system controller and
is based on the particular attitude parameters of the platform. The azimuth correlation is
achieved by a single multiplier chip and part of the display memory.
4.1.5 Video Control Processor
The azimuth correlator uses the main display memory as part of its memory to do the
correlation in the azimuth direction. Every time a multiplication is done, the previously
computed and stored data is retrieved, summed with the present multiplied data and
stored back in the same location in the video portion of the main display memory. Each
location corresponds to a pixel in the image. This retrieval and storage operation is possible
because the data rate has been considerably reduced and sufficient time is available between
azimuth correlation operations. The video control processor performs this function of
retrieving, summing and storing the data in the memory.
4.1.6 The System Controller
The system controller coordinates the operation of the entire system. It synchronizes the
presumming operation so that proper data reduction is achieved. In addition, the system
controller loads the coefficients for the filters and the RMC buffers and selects the proper
interpolated data for the azimuth compression. The system controller also supervises the
operation of the video control processor and the video output.
4.1.7 Video Output
This is the final interface for the image. The data stored in the video memory comprises
the final focussed image which can be transmitted or displayed.
4.2 Significant Features of the Time Domain Architecture
1) It is a fast and efficient implementation.
2) Range processing is performed before azimuth processing.
3) No corner turn memory is required.
4) The time domain processing is performed on both range and azimuth.
4.3 VLSI Chip Descriptions
The following sections give overview descriptions of the VLSI chips used in the various
parts of the time domain architecture. These descriptions are by no means exhaustive, but
give a brief account of the different blocks that comprise each VLSI chip.
4.3.1 Filter and Presummer
The finite impulse response filter has twelve coefficients. A down sampling factor of 8
is achieved with two identical channels operating in parallel. This factor is achieved by
136
staggering the data coming into the two channels by 8 range lines between the two. The
filter circuit consists of a buffer to store the filter coefficients, a complex multiplier, an adder
and a 12 K memory. The memory is needed to apply the filtering operation in the azimuth
direction. A controller which forms part of the chip synchronizes the operation of both
channels and controls the flow of data from the memory during the filtering operation.
A multiplexor connected to the channels enables the data to be output from both the
channels sequentially.
4.3.2 Range Correlator
The range correlator uses the filtered and down sampled data and a reference function
with 1200 samples. Each processor chip has 64 multiplier cells. Each multiplier cell has
its own multiplier operating at 16MHz, an adder and a 4 word register. The 4 word
register is used for complex multiplication. With 1200 samples in the reference function,
20 such chips must be connected in cascade to perform real time correlation. The reference
function samples are stored in buffers which are double buffered. This double buffering
enables correlation with a different reference function. The final output of the correlator
is taken out of a complex output register adder pair.
4.3.3 Range Migration Correction Module (RMC)
The RMC consists of an interpolator which is programmed to do interpolation on the
range compressed data. The controller which performs the interpolation outputs a set of 8
interpolated data on a parallel bus. A multiplexor which forms part of the chip selects the
data based on the range migration coefficients from the system controller that are stored
in a buffer provided for that purpose. A second buffer is provided to store the selected
interpolated data, since the data rate is changed after interpolation. The corrected data is
output from this buffer at the system rate by the clock supplied by the system controller.
A latch is provided at the output of the RMC module to latch the data before the azimuth
processing.
4.3.4 Azimuth Correlator
The compression ratio for the azimuth processor is fixed at 64 for this example, although
it is flexible and can be as high as 128. The azimuth correlation is done with a reference
function with 64 samples. The samples are stored in double buffered registers which make
it possible for an updated reference function to be loaded. These buffers and registers
form part of the multiplier cell. The azimuth processor has 64 multiplier cells and adders
associated with each multiplier. When the compression ratio is 128, two correlator chips
have to be cascaded. _ The memory required for performing correlations in the. azimuth
direction is part of the display memory associated with the video control processor.
NASA SERC 1990 Symposium on VLSI Design 	 137
4.3.5 Video and System Controllers
The video controller is basically a microcontroller or a microprocessor which performs the
complex function of retrieving processed data stored in the memory after every time the
azimuth processor does a multiplication operation. The video controller also restores the
summed data from the azimuth correlator in the original video memory location. The
control program is loaded into the video controller program memory from the system
controller at the beginning of the imaging process.
The system controller is a microcomputer coordinating the functions of all the subsys-
tems of the architecture. It has its own program memory, buffers and peripheral devices
to compute the reference functions, update the coefficients, etc. in the subsystems.
4.4 Total Chip Count
The real time SAR. architecture discussed above has several chips which are dedicated in
their functions. All of them are custom VLSI chips. A summary of the the number of each
type of VLSI chip used in the architecture is given below. The total number of custom
VLSI chips is 4.
Subsystem Number of Chips
Filter & Presummer 2
Range Processor 20





The theory behind the synthetic aperture radar signal processing has been presented. An
alternative model for SAR from its geometry has been described. This geometrically based
model shows that the range and azimuth information are present in the phase history of
the signal received from the point targets comprising the area to be imaged. The model
also describes how the range and azimuth information can be obtained by processing the
information in the intrapulse and interpulse period samples. This makes range and azimuth
processing independent of each other.
Some of the past work and architectures have been reviewed along with their merits and
deficiencies. A real time architecture in the time domain using a new correlator has been
described. The architecture uses only a small amount of memory and performs no corner
turn. The architecture is fast and efficient and uses only five different types of custom
138
VLSI chips for its implementation. A brief description of each of these five VLSI chips has
also been presented. This work was partially funded by the NASA SERC at the University
of Idaho. Additionally, we, the authors, want to publically thank and praise the Lord Jesus
Christ for influencing our lives and our work through His love, faithfulness, grace, death
and resurrection.
References
[1] W. M. Brown and L. J. Porcello, "An Introduction to Synthetic Aperture Radar",
IEEE Spectrum, Vol. 6, Sept. 1960, pp 52-62.
[2] J. P. Fitch, Synthetic Aperture Radar, Springler- Verlag, New York, 1988.
[3] R. O. Harger, Synthetic Aperture Radar Systems: Theory and Design, Academic Press,
New York, 1970.
[4] J. J. Kovaly, Synthetic Aperture Radar, Artech House Inc., Dedham, MA, 1976.
[5] B. Arambepola and S. R. Brooks, "Systolic Array Architecture for Real Time SAR
Processing", Marconi research report , pp 360-364.
[6] C. Elachi, Spaceborne Radar Remote Sensing: Application and Techniques , IEEE
Press, Inc., New York, 1987.
[7] M. Benson, "Digital Processing of Seasat-A SAR Data Using Linear Approximations
to the Range Cell Migration Curves", IEEE International Radar Conference , June
1980, pp 176-181.
[8] M. R. Vant, G. E. Haslam and W. E. Thorp, "The CRC SAR Digital Processor",
Selected Papers, European Space Agency Workshop, Paris, France, 1979, pp 101-105.
[9] R. Schotter, R. Gunzenhauser and H. Holzi, "Real Time SAR Processor", 1982, pp--
FA-1, 2.1-2.6.
[10] R. J. Offen, VLSI Image Processing, McGraw-Hill Book Company, New York.
[11] J. R. Bennet, I. G. Cumming and R. A. Deane, "The Digital Processing of Seasat
Synthetic Aperture Radar Data", Proceedings of IEEE International radar conference
June 1980, pp. 168-174.
[12] H. Nohmi, N. Ito and S. Hanaki, " Digital Processing of Space-Borne Synthetic Aper-
ture Radar Data", Proceedings of the 3rd Seasat-SAR Workshop on SAR image
Quality, Frascati, Italy, Dec 1980, pp 47-49.
[13] K. Y. Liu and W. E. Arens, " Processing SAR Images on Board", JPL Invention
Report, NPO-17195/6667. Jan 1989, pp 1.
NASA SERC 1990 Symposium on VLSI Design	 139
[14] B. Barken, C. Wu, W. Karplus and D. Caswell, "Application of Parallel Array Proces-
sor for Seasat SAR Processing", Geoscience and Remote Sensing, Vol 1, pp 542-547.




Using Algebra for Massively Parallel
Processor Design and Utilization
Lowell Campbell
NASA Space Engineering Research Center
University of Idaho, Moscow, ID 83843
Michael R. Fellows
Department of Computer Science
University of Idaho, Moscow, ID 83843
Abstract- This paper summarizes the authors' advances in the design of dense
processor networks. Within is reported a collection of recent constructions of
dense symmetric networks that provide the largest known values for the num-
ber of nodes that can be placed in a network of a given degree and diameter.
The constructions are in the range of current potential engineering significance
and are based on groups of automorphisms of finite-dimensional vector spaces.
Key Words. interconnection networks, Cayley graphs, linear groups
1 Introduction
Two important objectives in the design of parallel processor interconnection networks are
to minimize the number of wires connecting to a processor node and the number of nodes
that a message must pass through to reach the destination node [11,12]. This is equivalent
to the problem of constructing the largest possible graph when the degree and diameter of
the graph are restricted [3].
Various approaches have been tried to construct graphs that improve these properties
[4,7,6,8]. In many cases, our technique provides dramatic improvements over the best
previously known constructions. Our new graphs are the largest known graphs of given
degree and diameter for many of the degree and diameter pairs of engineering interest for
large parallel processing systems [5]. These graphs are significantly better than current
interconnection networks in terms of degree and diameter. An example comparison is the
(degree=9,diameter=9) combination. For a 9 dimensional hypercube the number of nodes
is 29
 = 512. For the our graphs, the (9,9) pair has 4,773,696 nodes. This is an increase
of four orders of magnitude! This paper outlines these results and gives the generator sets
for 35 new record constructions. For an overview of our new results see Table 1. Improved
entries are shown in bold.
Our graphs are a type of graph called a Cayley Graph. Section II discusses the prop-
erties of these graphs that are relevant to processor design and section III discusses the
construction technique.
NASA SERC 1990 Symposium on VLSI Design	 141
Diameter
Deg 11 3 1	 41 5 1	 7 1	 8 1	 9 1	 10
5 70 174 532 49368 11,200 33,600 123,120
6 105 355 1,081 139104 50,616 202,464 682,080
7 128 506 2 9162 399732 150,348 9119088 4,773,696
8 203 842 3,081 103,776 455 9544 2 9386,848 79738,848
9 585 1 7248 6072 215,688 1 9361,520 4,7731696 1918459936
10 650 1,820 12,144 492,960 2 9386,848 797389848 47,059,200
11 715 3,200 14,625 898,776 4,7739696 25,0489800 179,755,200
Table 1: New Records
2 Algebraic symmetry as an organizing principle for
parallel processing
There are important considerations apart from degree and diameter that must figure in
any choice of network topology for parallel computation. Our approach yields symmetric
constructions, and we believe that in this lies their greater value. Symmetry is one of
the most powerful and natural tools to apply to the central problem of massively parallel
computation: how to organize and coordinate computational resources.
The symmetries of the networks we describe are represented by simple algebraic opera-
tions (such as 2 by 2 matrix multiplications and modulo arithmetic). The main advantage
of algebraic networks is that the developed mathematical resources of algebra are available
to structure the problems of testing, data exchange, message routing, scheduling and the
mapping of computations onto the network. The appeal of hypercubes, cube-connected-
cycles, butterfly networks and others rests in large part on this same availability of easily
computed. (and comprehended) symmetries. These popular networks and those that we
describe all belong to a class of algebraic networks based on vector spaces and their sym-
metry groups. For recent algebraic approaches to routing algorithms, deadlock avoidance,
emulation and scheduling for algebraically described networks of this sort see [1,2,9,10].
The next section describes our basic approach and some examples of our constructions.
3 Technique and Example Constructions
A network is (vertex-) symmetric if for any two nodes u,v there is an automorphism of
the network mapping u to v. Every Cayley network, is symmetric (symmetries are given
by group multiplication). If A is a group and S C_ A is a generating set that is closed
under inverses, i.e., S = S U S-1 , then the (undirected) Cayley graph (A, S) is the graph
with vertex set A and with an edge between elements a and b of A if and only if as = b
for some a E S. It is remarkable (but, indeed, natural) that most networks that have
142
been considered for large parallel processing systems (incluiling hypercubes, grids, cube-
connected-cycles and butterfly networks) are Cayley graphs. The degree of a Cayley graph
(A, S) is ® = IS) and the diameter of (A, S) is D = maz.eA{mint : a = ai . • • at , a; E S for
i = 1, ... ,t).
Example 1 Degree 5, diameter 7: 4368 vertices. (Best previous : 2988.)
This is a Cayley graph on the subgroup of GL[2,13] consisting of the matrices with
determinant in the set {1; 11. The generators are the following elements together with
their inverses.
0 1 order 2	 11 2 order 52 	11 4 order 14 ]1 0	 8 12	 1 7 5
Example 2 Degree 6, diameter 10 : 682,080 vertices. (Best previous : 199,290.)
This is a Cayley graph on the group GL[2,29]. The generators are the following elements
together with their inverses.
17 13 ]2$ 18o rder 28
	 16 27 order 28	 27 14 order 840
Example 3 Degree 10, diameter 5 : 12144 vertices. (Best previous : 10,000.)
This is a Cayley graph on the group SL[2,23]. The generators are the following elements
together with their inverses.
18 18 ]order 11	 18 21 order 11	 0
9 10
 17 order 22
7 ]
14 3 order 22	 17 20 order 24
The tables at the end of the paper are a complete list of the new constructions. In
them, B[2,n] denotes the Borel subgroup of GL[2,n].
4 Conclusions
We have presented new constructions for graphs with the largest number of nodes for a
given degree and diameter. These graphs have potential for use in the design of parallel
processing systems. They have the additional advantage of possessing algebraic symmetry
simplifies processor design. These graphs and their properties are the subject of on going
research.
NASA SERC 1990 Symposium on VLSI Design	 143
References
[1] F. Annexstein, M. Baumslag and A.L. Rosenberg, "Group Action Graphs and Parallel
Architectures," COINS Technical Report 87-133, Univ. of Mass., Amherst, 1987.
[2] S.B. Akers and B. Krishnamurthy, "On Group Graphs and Their Fault-Tolerance,"
IEEE Trans. on Computers, 36 (1987), pp. 885-888.
[3] J.C. Bermond, C. Delorme, and J.J. Quisquater, "Strategies for Interconnection Net-
works: Some Methods from Graph Theory", Journal of Parallel and Distributed Com-
puting 3 (1986), pp. 433-449.
[4] J. Bond, C. Delorme, and W.F. de La Vega, "Large Cayley Graphs with Small Degree
and Diameter", Rapport de Recherche no. 392, LRI, Orsay,1987.
[5] L. Campbell, M. Fellows, et. al. , "Dense Symmetric Networks from Linear Groups",
The Second Symposium on Prontiers of Massively Parallel Computation, Fairfax, Vir-
ginia, October 1988.
[6] G.E. Carlsson, J.E. Cruthirds, H.B. Sexton, and C.G. Wright, "Interconnection Net-
works Based on a Generalization of Cube-Connected Cycles", IEEE Trans. on Com-
puters, Vol. C-34, No. 8, August 1985, pp. 769-772.
[7] D.V. Chudnovsky, G.V. Chudnovsky, and Denneau, "Regular Graphs with Small
Diameter as Models for Interconnection Networks", 3rd Int. Conf. on Supercomputing,
Boston, MA, May 1988, pp. 232-239.
[8] K. Doty, "New Designs for Processor Interconnection Networks", IEEE Trans. on
Computers, Vol. C -33, No. 5 May 1984, pp. 447-450.
[9] V. Faber, "Global Communication Algorithms for Hypercubes and Other Cayley
Coset Graphs", Technical Report, Los Alamos National Laboratories, 1988.
[10] M. Fellows, "A Category of Graphs for Parallel Processing," Technical Report, Uni-
versity of Idaho, 1988.
[11] P. Mazumber,"Evaluation of Three Interconnection Networks for CMOS VLSI Imple-
mentation",International Conference on Parallel Processing, August 1986, pp. 200-
207.
[12] A. Ranade and S. L. Johnson,"The Communication Efficiency of Meshes, Boolean
Cubes and Cube Connected Cycles for Wafer Scale Integration", International Con-
ference on Parallel Processing, August 1987, pp. 479-482.
144
Parameters Order Previous Moore Group Generators: order
Record Bound S = S U S'1
degree 5 41368 2,988 279306 index 6 in [011,110]:2
diameter 7 GL[2113] [11,218,12]:52
(det = re) [11,4,7,5]:14
degree 5 123,120 52,224 1,747,626 GL[2 1 19] [011,120]:2
diameter 10 [16,11,2,0]:45
[11,16,0,15]:18
degree 6 355 320 937 index 14 in [54966,0,1]:5
diameter 4 B[2171] [5143,0,1]:5
[57,38,0,1]:5
degree 6 1,081 992 4,687 index 2 in [7,20,011]:23
diameter 5 B[2, 47] [6,33,0,1]:23
[9142,0,1]:23
degree 6 13,104 13,056 117,187 index 2 in [10,12,10,9]:39
diameter 7 GL[2113] [12,3,9,8]:84
(det = r2 ) [8,11,510]:84
degree 6 50,616 32,256 585,937 SL[21 37] [32,24,35,2]:19
diameter 8 [12,24115,271:37
[23,16,28,34]:36
degree 6 202,464 72,345 2,929,687 index 9 in [25,1,31,1]:36
diameter 9 GL[2, 37] [12,35,23,30]:76
(det = r9 ) [12,4,28,16]:152




NASA SERC 1990 Symposium on VLSI Design 145
Parameters Order Previous Moore Group Generators:order
Record Bound S = S U S'1








degree 7 39,732 35,154 391,910 index 2 in [0,4211,0]:2
diameter 7 SL[2, 43] [18716,38,41]:22
(/{±1}) [8128,14,33]:43
[34,2137,6]:22
degree 7 150,348 93,744 2,351,462 index 2 in [0,66,120]:2
diameter 8 SL[21 67] [48,4814,11]:66
(/{f1}) [59,18,42,31]:33
[7,64,66,58]:134
degree 7 911,088 304,668 14408,774 index 2 in [0,1,1,0]:2
diameter 9 GL[2, 37] [27,33,19,22):684
(det = r2 ) [25,16,13,6]:36
[23,17,14,26]:18






Parameters Order Previous Moore Group Generators:order
Record Bound I S = S U S'1
degree 8 203 200 457 index 4 in [16792021]:7
diameter 3 B[22 29] [16,212011]:7
[25,15,0,1]:7
[25292021]:7
degree 8 3,081 2,808 22,409 index 2 in [49,72,071]:39
diameter 5 B[2179] [467432021]:13
[19 22620 2 1]:39
[1321310,1]:39




degree 8 455,544 234,360 7 1686 7401 index 4 in [28,32,33,33]:171
diameter 8 GL[2, 37] [9,34,25,16]:342
(det = r4 ) [21,9,17,5]:57
[0,2613,1]:171
degree 8 21386,848 1,822,176 53,804,809 index 2 in [26,20,25,10]:1081
diameter 9 GL[2,47] [8123,21,33]:552
(det = r2k ) [20,37731,28]:184
[33,4,25,44]:23





NASA SERC 1990 Symposium on VLSI Design	 147
Parameters Order Previous Moore Group Generators:order
Record Bound IS=SUS 'a
degree 9 61072 5,150 42,130 index 2 in [0122,120]:2




degree 9 1,316,520 910,000 21,570,706 index 10 in [602120,1]:2
diameter 8 GL[2261] [10221,1928]:12
(det = r1oh ) [11,15,4251]:93
[51,7243,60]:60
[50,1,18,26]:62










degree 10 127144 10,000 73,811 SL[2, 23] [9,0,18,18]:11






Parameters Order Previous Mo ore Group Generators:order
Record Bound S = S U S'1
degree 10 492,960 486,837 5,978,711 index 2 in [48,18,5218]:78




degree 10 2,386,848 2,002,000 53,808,401 index 2 in [29,5,0,22]:46
diameter 8 GL[2v47] [23,8,3,12]:552
(det = r2h ) [19,7,11,19]:1104
[15,16,38,0]:46
[46,6,22,28]:23












NASA SERC 1990 Symposium on VLSI Design	 149
Parameters Order Previous Moore Group Generators:order
Record Bound S =SUS-1












degree 12 9,922,968 8,370,180 257,230,657 index 2 in [13,26,63,49]:66
diameter 8 GL[2,67] [50,44,6,19]:1122




degree 13 277237040 2 7657,340 42,346,682 index 5 in [6011,001]:2
diameter 7 GL[2,61] [43,2,25,27]:93












S = S U S'1
















NASA SERC 1990 Symposium on VLSI Design 	 N4® 71 088	 151
On Well-Partial-Order Theory and
Its Application to Combinatorial
Problems of VLSI Design
M. Fellows




Department of Computer Science
University of Tennessee
Knoxville, TN 37996
Department of Computer Science
Washington State University
Pullman, WA 99164-1301
Abstract- We nonconstructively prove the existence of decision algorithms with
low-degree polynomial running times for a number of well-studied graph layout,
placement and routing problems. Some were not previously known to be in P
at all; others were only known to be in P by way of brute force or dynamic
programming formulations with unboundedly high-degree polynomial running
times. Our methods include the application of the recent Robertson-Seymour
theorems on the well-partial-ordering of graphs under both the minor and
immersion orders. We also briefly address the complexity of search versions of
these problems.
1 Introduction
Practical problems are often characterized by fixed-parameter instances. In the VLSI
domain, for example, the parameter may represent the number of tracks permitted on a
chip, the number of processing elements to be employed, the number of channels required
to connect circuit elements or the load on communications links. In fixing the value of
such parameters, we help focus on the physically realizable nature of the system rather
than on the purely abstract aspects of the model.
In this paper, we employ and extend Robertson-Seymour poset techniques to prove low-
degree polynomial-time decision complexity for a variety of fixed-parameter layout, place-
ment and routing problems, dramatically lowering known time-complexity upper bounds.
Our main results are summarized in Table 1, where n denotes the number of vertices in an
input graph and k denotes the appropriate fixed parameter. ' input restricted to graphs
of maximum degree three
In the next section, we survey the necessary background from graph theory and graph
algorithms that makes these advances possible. Sections 3, 4 and 5 describe our results on
several representative types of decision problems, illustrating a range of techniques based
on well-partially-ordered sets. In Section 6, we discuss how self-reducibility can be used
152
General Best Previous
Problem Area Problem Upper Bound Our Result
Circuit Layout GATE MATRIX LAYOUT open 0(n 2) [FL1]
MIN CUT LINEAR ARRANGEMENT 0(nk-1) 0(n 2)
Linear MODIFIED MIN CUT O(nk) 0(n 2)
Arrangement TOPOLOGICAL BANDWIDTH • O(nk) 0(n 2) [FL2]
VERTEX SEPARATION 0(nk'+2k+4) 0(n 2)
CROSSING NUMBER` open 0(n 3) [FL2]
Circuit Design
MAX LEAF SPANNING TREE 0(n2k+1) 0(n2)
and Utilization
SEARCH NUMBER 0(n2k2+4k+8) 0(n2)
2-D GRID LOAD FACTOR open 0(n2)
Embedding BINARY TREE LOAD FACTOR open 0(n 2)
and Routing DISK DIMENSION open O(n3) [FL1]
EMULATION open 0(n3) [FL2]
Table 1: Main Results






— — — contract
Figure 1: Construction demonstrating that W4 is a minor of Q3
to bound the complexity of search versions of these problems. A few open problems and
related issues are briefly addressed in the final section.
2 Background
We consider only finite, undirected graphs. A graph H is less than or equal to a graph
G in the minor order, written H <,n G, if and only if a graph isomorphic to H can be
obtained from G by a series of these two operations: taking a subgraph and contracting
an edge. For example, the construction depicted in Figure 1 shows that W 4 —<m Q3•
Note that the relation <,n defines a partial ordering on graphs. A family F of graphs
is said to be closed under the minor ordering if the facts that G is in F and that H <,n G
together imply that H must be in F. The obstruction set for a family F of graphs is the
set of graphs in the complement of F that are minimal in the minor ordering. Therefore,
if F is closed under the minor ordering, it has the following characterization: G is in F if
and only if there is no H in the obstruction set for F such that H <,n
 G.
Theorem 1. [RS5] (formerly known as Wagner's Conjecture [Wa]) Graphs are well-partially-
ordered by <,n . That is, any set of graphs contains only a finite number of minor-minimal
elements, and there are no infinite descending chains.
Theorem 2. [RS4] For every fixed graph H, the problem that takes as input a graph G
and determines whether H <, n G is solvable in polynomial time.
Theorems 1 and 2 guarantee only the existence of a polynomial-time decision algorithm
for any minor-closed family of graphs. Moreover, no proof of Theorem 1 can be entirely
constructive. For example, there can be no systematic method of computing the finite
obstruction set for an arbitrary minor-closed family F from the description of a Turing
machine that accepts precisely the graphs in F [FL5].
An interesting feature of Theorems 1 and 2 is the low degree of the polynomials bound-
ing the decision algorithms'. running times (although the constants of proportionality are
enormous). Letting n denote the number of vertices in G, the time required to recognize
F is O(n3). If F excludes a planar graph, then F has bounded tree-width [RS2] and the






— — — lift
Figure 2: Construction demonstrating that C4 is immersed in Kl + 2K2
A graph H is less than or equal to a graph G in the immersion order, written H <i
 G,
if and only if a graph isomorphic to H can be obtained from G by a series of these two
operations: taking a subgraph and lifting [Ma3] a pair of adjacent edges. For example, the
construction depicted in Figure 2 shows that C4 <i Kl + 2K2 (although C4 %n► Kl + 2K2).
The relation <i, like <„a, defines a partial ordering on graphs with the associated
notions of closure and obstruction sets.
Theorem S. [RS1] (formerly known as Nash-Williams' Conjecture [Na]) Graphs are well-
partially-ordered by <.
The proof of the following result is original, although it has been independently observed
by others as well [Rob].
Theorem 4. For every fixed graph H, the problem that takes as input a graph G and
determines whether H <i
 G is solvable in polynomial time.
Proof. Letting k denote the number of edges in H, we replace G = (V, E) with G' =
(V',E'), where f V'J = kJVI + JEJ and JE') = 2kJEJ. Each vertex in V is replaced in G'
with k vertices. Each edge a in E is replaced in G' with a vertex and 2k edges connecting
this vertex to all of the vertices that replace e's endpoints. We can now apply the disjoint-
connecting paths algorithm of [RS4], since it follows that H <i G if and only if there
exists an injection from the vertices of H to the vertices of G' such that each vertex of H
is mapped to some vertex in G' that replaces a distinct vertex from G and such that G'
contains a set of k vertex-disjoint paths, each one connecting the images of the endpoints
of a distinct edge in H. O
Theorems 3 and 4, like Theorems 1 and 2, guarantee only the existence of a polynomial-
time decision algorithm for any immersion-closed family F of graphs. The method we have
used in proving Theorem 4 yields an obvious time bound of O(nh+e ), where h denotes the
order of the largest graph in F's obstruction set. (There are O(nh) different injections to
consider; the disjoint-paths algorithm takes cubic time on G', a graph of order at most n2.)
Thanks to the next theorem due to Mader, however, we find that the bound immediately
reduces to O(nh+3 ), because the problem graphs of interest permit only a linear number
of distinct edges.
NASA SERC 1990 Symposium on VLSI Design
	 155
Theorem 5. [Mal] For any graph H there exists a constant cg such that every simple
graph G = (V, E) with I E I > cgIV I satisfies G >i H.
We shall show in Section 4 that, by exploiting excluded-minor knowledge on immersion-
closed families, the time complexity for determining membership can in many cases be
reduced to 0(n 2).
3 Exploiting the Minor Order
Given a graph G of order n, a linear layout of G is a bijection I from V to {l, 2, ... , n}.
For such a layout 1, the vertex separation at location i, se(i), is I{u : u E V, $(u) < i, and
there is some v E V such that uv E E and 1(v) > ijI. The vertex separation of the entire
layout is se = max{se(i) : 1 < i < n}, and the vertex separation of G is vs(G) = min{se : I
is a linear layout of G}.
Given both G and a positive integer k, the A(P-complete VERTEX SEPARATION
problem [Le] asks whether vs(G) is less than or equal to k. It has previously been reported
that VERTEX SEPARATION can be decided in 0(nk' +2k+4) time [EST], and is thus in P
for any fixed value of k. We now prove that the problem can be solved in time bounded
by a polynomial in n, the degree of which does not depend on k.
Theorem 6. For any fixed k, VERTEX SEPARATION can be decided in 0(n 2) time.
Proof. Let k denote any fixed, positive integer. We shall show that the family F of
"yes" instances is closed under the minor ordering. To do this, we must prove that if
vs(G) < k then vs(H) < k for every H <M G. Without loss of generality, we assume that
H is obtained from G by exactly one of these three actions: deleting an edge, deleting an
isolated vertex, or contracting an edge.
If H is obtained from G by deleting an edge, then vs(H) < vs{G) <_ k because the
vertex separation of any layout of G either remains the same or decreases by 1 with the
removal of an edge. If H is obtained from G by deleting an isolated vertex, then clearly
vs(H) < k.
Suppose H is obtained from G by contracting the edge uv. Let I denote a layout of G
whose vertex separation does not exceed k and assume 1(u) < 1(v). We contract uv to u
in the layout t' of H as follows: we set 1'(x) = I(x) if t(x) < 1(v) and set J'(x) = I(x) — 1
if e(x) > t(v). Let us consider the effect of this action on the vertex separation at each
location of the layout. Clearly, se,(i) = se(i) for 1 < i < 1(u). If there exists a vertex w
with e(w) > t(u) and either uw E E or vw E E, then se#(u)) < se(f(u)). Otherwise,
st#(u)) < se(l(u)) — 1. Similar arguments establish that se,(i) < se(i) for the ranges
1(u) < i < 1(v) and 1(v) < i < n. Therefore, the vertex separation of Z' does not exceed k
and vs(H) < k.
We conclude that, in any case, H is in F and hence F is minor-closed. It remains only
to note that there are trees with arbitrarily large vertex separation. 0
156
Given a graph G and a positive integer k, the ,A(P-complete SEARCH NUMBER
problem [Par] asks whether k searchers are sufficient to ensure the capture of a fugitive
who is free to move with arbitrary speed about the edges of G, with complete knowledge
of the location of the searchers. More precisely, we say that every edge of G is initially
contaminated. An edge e = uv becomes clear either when a searcher is moved from u to
v (v to u) while another searcher remains at u (v) or when all edges incident on u (v)
except a are clear and a searcher at u (v) is moved to v (u). (A clear edge a becomes
recontaminated if the movement of a searcher produces a path without searchers between
a contaminated edge and e.) The goal is to determine if there exists a sequence of search
steps that results in all edges being clear simultaneously, where each such step is one of
the following three operations: 1) place a searcher on a vertex, 2) move a searcher along
an edge, or 3) remove a searcher from a vertex. It has been reported that SEARCH
NUMBER is decidable in 0(n2k2+4k+s) time [EST]. As has been independently noted by
Papadimitriou [Pap], however, minor-closure can be applied to reduce this bound.
Theorem 7. For any fixed k, SEARCH NUMBER can be decided in 0(n 2) time.
Proof. Straightforward, by showing that, for fixed k, the family of "yes" instances is closed
under the minor ordering and by observing that there are excluded trees. q
Consider next the NP-complete MAX LEAF SPANNING TREE problem [GJ]. Given
a graph G and a positive integer k, this problem asks whether G possesses a spanning tree
in which k or more vertices have degree one. This problem can be solved by brute force
in 0(n 21-+1) time. (There are Q) ways to select k leaves and O(n) possible adjacencies
to consider at each leaf. For each of these 0(n 2k) candidate solutions, the connectivity
of the remainder of G can be determined in linear time because there can be at most a
linear number of edges.) Although this means that MAX LEAF SPANNING TREE is in
P for any fixed k, we seek to exploit minor-closure so as to ensure a low-degree polynomial
running time.
Theorem 8. For any fixed k, MAX LEAF SPANNING TREE can be decided in 0(n2)
time.
Proof. Let k denote any fixed, positive integer. Consider a proper subset of the "no"
instances, the family F of graphs none of whose connected components has a spanning
tree with k or more leaves. F is clearly closed under the minor ordering, from which
the theorem follows because one need only test an input graph for connectedness and
membership in F. q
4 Exploiting the Immersion Order
An embedding of an arbitrary graph G into a fixed constraint graph C is an injection
f : V (G) --+ V (C) together with an assignment, to each edge uv of G, of a path from
NASA SERC 1990 Symposium on VLSI Design	 157
f (u) to f (v) in C. The minimum load factor of G relative to C is the minimum, over all
embeddings of G in C, of the maximum number of paths in the embedding that share a
common edge in C.
For example, for the case in which C is the infinite-length one-dimensional grid, the
minimum- load factor of G with respect to C is called the cutwidth of G. In the NP-
complete MIN CUT LINEAR ARRANGEMENT problem [GJ], we are given a graph G
and an integer k, and are asked whether the cutwidth of G is no more than k. Related
A(P-complete problems address the cutwidth of G relative to C when C is the infinite-
length, fixed-width two-dimensional grid (2-D GRID LOAD FACTOR) or when C is the
infinite-height binary tree (BINARY TREE LOAD FACTOR).
Theorem 9. For any fixed k and any fixed C, the family of graphs for which the minimum
load factor relative to C is less than or equal to k is closed under the immersion ordering.
Proof. Let an embedding f of G in C with load factor no more than k be given. Suppose
H <; G. If H C G, then the embedding that restricts f to H clearly has load factor no
more than k. If H is obtained from G by lifting the edges uv and vw incident at vertex
V, then an embedding for H can be defined by assigning to the resulting edge uuw the
composition of the paths from u to v and from v to w in C. This cannot increase the load
factor. q
Corollary. For any fixed k, MIN CUT LINEAR ARRANGEMENT, 2-D GRID LOAD
FACTOR and BINARY TREE LOAD FACTOR can be decided in polynomial time.
This result has previously been reported for MIN CUT LINEAR ARRANGEMENT,
using an algorithm with time complexity 0(nk-1 ) [MaS]. We now prove that it is sometimes
possible to employ excluded-minor knowledge on immersion-closed families to guarantee
quadratic-time decision complexity.
Theorem 10. For any fixed k, MIN CUT LINEAR ARRANGEMENT, 2-D GRID LOAD
FACTOR and BINARY TREE LOAD FACTOR can be decided in 0(n 2) time.
Proof. For MIN CUT LINEAR ARRANGEMENT, it is known that there are binary trees
with cutwidth exceeding k for any fixed k [CMST]. Let T denote such a tree. Because T
has maximum degree three, it follows that G >,n T implies G >s T. Thus no G >,,, T can
be a "yes" instance (recall that the "yes" family is immersion closed) and we know from
[RS2] that all "yes" instances have bounded tree-width. (Tree-width and the associated
metric branch-width are defined and related to each other in [RS3].) Now one needs only
search for a satisfactory tree-decomposition, using the O(na ) method of [RS4]. Testing for
obstruction containment in the immersion order can be done in linear time on graphs of
bounded tree-width, given such a tree-decomposition.
Sufficiently large binary trees are excluded for 2-D GRID LOAD FACTOR as well
(recall that both k and the grid-width are fixed).
For BINARY TREE LOAD FACTOR, it is a simple exercise to see that all "yes"
158
instances have bounded tree-width, by building a tree-decomposition with width at most
3k from a binary tree embedding with load factor at most k. (The decomposition tree T
can be taken to be the finite subtree of C that spans the image of G. For vertex u E V(T),
the set Su contains the inverse image of u if one exists, and every vertex v E V(G) with an
incident edge that is assigned a path in C that includes u. It follows that IS.1 < 3k + 1.)
5 Other Methods
The application of Theorems 1 through 4 directly ensures polynomial-time decidability. A
less direct approach relies on the well-known notion of polynomial-time transformation, as
we now illustrate with an example. The ASP-complete MODIFIED MIN CUT problem
was first introduced in [Le]. Given a linear layout I of a simple graph G, the modified
cutwidth at location i, cl(i), is Ile : e = uv E E such that .
 $(u) < i and 1(v) > ill. The
modified cutwidth of the entire layout is c,, = max{ct(i) : 1 < i _< nl, and the modified
cutwidth of G is mc(G) = min{ct
 : I is a linear layout of Gl. Given both G and a positive
integer k, the MODIFIED MIN CUT problem asks whether mc(G) is less than or equal
to k. Observe that, while the MIN CUT LINEAR ARRANGEMENT problem addresses
the number of edges that cross any cut between adjacent vertices in a linear layout, the
MODIFIED MIN CUT problem addresses the number of edges that cross (and do not end
at) any cut on a vertex in the layout.
When k is fixed, neither the family of "yes" instances nor the family of "no" instances
for MODIFIED MIN CUT is closed under either of the available orders. Nevertheless, we
can employ a useful consequence of well-partially-ordered sets.
Consequence. [FL2] If (S,<) is a well-partially-ordered set that supports polynomial-time
order tests for every fixed element of S, and if there is a polynomial-time computable map
t: D —+ S such that for F C_ D
a) t(F) C S is closed under < and
b) t(F) n t(D — F) 0
then there is a polynomial-time decision algorithm to determine for input z in D whether
zisinF.
To use this result on fixed-k MODIFIED MIN CUT, observe that if any vertex of a
simple graph G has degree greater than 2k + 2, then G is automatically a "no" instance.
Given a simple graph G with maximum degree less than or equal to 2k + 2, we first
augment G with loops as follows: if a vertex v has degree d < 2k + 2, then it receives
(2k + 2) — d new loops. Letting G' denote this augmented version of G, we now replace
G' with the Boolean matrix M, in which each row of M corresponds to an edge of G' and
each column of M corresponds to a vertex of G'. That is, M has IE'I rows and n columns,
with M=, = 1 if and only if edge i is incident on vertex j. M and P = 3k + 2 are now
viewed as input to the GATE MATRIX LAYOUT problem [DKL], in which we are asked
whether the columns of M can be permuted so that, if in each row we change to * every
NASA SERC 1990 Symposium on VLSI Design	 in
0 lying between the row's leftmost and rightmost 1, then no column contains more than k
1s and *s. Thus a permutation of the columns of M corresponds to a linear layout of G.
For such a permutation, each * in column i, 1 < i < n, represents a distinct edge crossing
a cut at vertex i in the corresponding layout of G.
Theorem 11. For any fixed k, MODIFIED MIN CUT can be decided in 0(n 2) time.
Proof. We apply the consequence, using the set of all graphs for S, <,,, for <, the set of
simple graphs of maximum degree 2k + 2 for D, the family of "yes" instances in D for
F, and the composition of the map just defined from graphs to matrices with the map of
[FL1] from matrices to graphs for t. Testing for membership in D and computing t are
easily accomplished in 0(n 2) time. That t(F) is closed under <,,, and excludes a planar
graph for any fixed k is established in [FL1]. Finally, condition b) holds because, for any
G in D, t(G) is a "yes" instance for GATE MATRIX LAYOUT with parameter 3k + 2 if
and only if G is a "yes" instance for MODIFIED MIN CUT with parameter k. q
6 Search Problems
Given a decision problem IID and its search version Hs, any method that pinpoints a
solution to IIS by repeated calls to an algorithm that answers IID is termed a -self-reduction.
This simple notion has been formalized with various refinements in the literature, but the
goal remains the same: use the existence of a decision algorithm to prove the existence of
a search algorithm.
It sometimes suffices to fatten up a graph by adding edges to isolate a solution. For
example, this strategy can be employed to construct solutions to (fixed -k)
 GATE MATRIX
LAYOUT, when any exist, in 0(n 4) time [BFL]. It follows from the proof of Theorem
11 that the same can be said for MODIFIED MIN CUT as well.,
-
We leave it to the
reader to verify that such a scheme works for the search version of - (fixed-k) VERTEX
SEPARATION, by attempting to add each edge in V x V — E in arbitrary order, retaining
in turn only those whose addition maintains a "yes" instance, and at the end reading off a
satisfactory layout (from right to left) by successively removing a vertex of smallest degree.
This self-reduction automatically solves the search version of SEARCH NUMBER too (see
the discussion of "2-expansions" in [EST]).
Conversely, it is sometimes possible to trim down a graph by deleting edges so as to
isolate a solution. It is easy to see that this simple strategy yields an 0(1 4 ) time algorithm
for the search version of (fixed-k)
 MAX LEAF SPANNING TREE, by attempting to delete
each edge in E in arbitrary order, retaining in turn only those whose deletion does not
maintain a "yes" instance.
Another technique involves the use of graph gadgets. A simple gadget, consisting of two
new vertices with k edges between them, is useful in constructing a solution to (fixed-k)
MIN CUT LINEAR ARRANGEMENT, when any exist, in 0(n 4) time [BFL]. A similar
use of gadgets enables efficient self-reductions for load factor problems. (On BINARY
TREE LOAD FACTOR, for example, one can begin by using two k-edge gadgets uv and
160
wx to locate a vertex y of the input graph that can be mapped to a leaf of the constraint
tree by identifying u, w and y.)
In addition to these rather straightforward techniques, faster but more elaborate meth-
ods are described in [FL4, FL5].
7 Concluding Remarks
The range of problems amenable to an approach based on well-partially-ordered sets is
remarkable. Although the problems we have addressed in this paper are all fixed-parameter
versions of problems that are ASP-hard in general, we remind the reader that by fixing
parameters one does not automatically trivialize problems and thereby obtain polynomial-
time decidability (consider, for example, GRAPH k—COLORABILITY [GJ]). Moreover,
the techniques we have employed can be used to guarantee membership in P for problems
that have no associated (fixed) parameter [FL2].
The results we have derived here immediately extend to hypergraph problem vari-
ants as long as hypergraph instances can be efficiently reduced to graph instances. For
example, such reductions axe known for HYPERGRAPH VERTEX SEPARATION and
HYPERGRAPH MODIFIED MIN CUT [MiS, Su]. Nevertheless, Table 1 suffers from
one notable omission, namely, BANDWIDTH [GJ]. The only success reported to date has
concerned restricted instances of TOPOLOGICAL BANDWIDTH. Both BANDWIDTH
and the related EDGE BANDWIDTH problem [FL3] have so far resisted this general line
of attack. Clearly, BANDWIDTH is at least superficially similar to other layout permuta-
tion problems we have addressed, and fixed -k BANDWIDTH, like the others, is solvable
in (high-degree) polynomial-time with dynamic programming [GS]. But perhaps BAND-
WIDTH really is different; it is one of the very few problems that remain A(P-complete
when restricted to trees [GGJK].
Finally, we observe that even partial-orders that fail to be well-partial-orders (on the
set of all graphs) may be useful. For example, although it is well known that graphs are not
well-partially-ordered under the topological order, it has been shown [Ma2] that all graphs
without h vertex-disjoint cycles are well-partially-ordered under topological containment.
Also, polynomial-time order tests exist [RS4]. Problems such as (fixed-k) TOPOLOGICAL
BANDWIDTH, therefore, are decidable in polynomial time as long as the input is restricted
to graphs with no more than h disjoint cycles (for fixed h). Similarly, one might employ
the result [Se] that graphs without a path of length h, for h fixed, are well-partially-ordered
under subgraph containment.
NASA SERC: 1990 Symposium on VLSI Design	 161
8 Bibliography
[BFL] D. J. Brown, M. R. Fellows and M. A. Langston, "Polynomial-Time Self-Reducibility:
Theoretical Motivations and Practical Results," Int'l J. of Computer Mathematics,
to appear.
[CMST] M-J Chung, F. Makedon, I. H. Sudborough and J. Turner, "Polynomial Time
Algorithms for the Min Cut Problem on Degree Restricted Trees," SIAM J. on
Computing 14 (1985), 158-177.
[DKL] N. Deo, M. S. Krishnamoorthy and M. A. Langston, "Exact and Approximate
Solutions for the Gate Matrix Layout Problem," IEEE Trans. on Computer-Aided
Design 6 (1987), 79-84.
	
[EST]	 J. A. Ellis, I. H. Sudborough and J. S. Turner, "Graph Separation and Search
Number," Proc. 21st Allerton Conf, on Communication, Control and Computing
(1983),224-233.
[FL1] M. R. Fellows and M. A. Langston, "Nonconstructive Advances in Polynomial-Time
Complexity," Information Processing Letters 26 (1987), 157-162.
[FL2] , "Nonconstructive Tools for Proving Polynomial-Time Decidability,"
J. of the ACM 35 (1988), 727-739.
[FL3] , "Layout Permutation Problems and Well-Partially-Ordered Sets,"
Proc. 5th MIT Conf. on Advanced Research in VLSI (1988), 315-327.
[FL4] , "Fast Self-Reduction Algorithms for Combinatorial Problems of VLSI
Design," Proc. 3rd Aegean Workshop on Computing (1988), 278-287.
[FL5] , "On Search, Decision and the Efficiency of Polynomial-Time Algo-
rithms," Proc. 21st ACM Symp. on Theory of Computing (1989), 501-512.
	
[GGJK]	 M. R. Garey, R. L. Graham, D. S. Johnson and D. E. Knuth, "Complexity Results
for Bandwidth Minimization," SIAM J. on Applied Mathematics 34 (1978), 477-
495.
	
[GJ]	 M. R. Garey and D. S. Johnson, Computers and Intractability:` A Guide to the Theory
of AfP-Completeness, Freeman, San Francisco, CA, 1979.
[GS] E. M. Gurari and I. H. Sudborough, "Improved Dynamic Programming Algorithms
for Bandwidth Minimization and the Min Cut Linear Arrangement Problem," J. of
Algorithms 5 (1984), 531-546.
	




	 W. Mader, "Hinreichende Bedingungen fur die Existenz von Teilgraphen, die zu
einem vollstandigen Graphen homoomorph sind," Math. Nachr. 53 (1972), 145-
150.
[Ma2] , "Wohlquasigeordnete Klassen endlicher Graphen," J. Combinatorial
Theory Series B 12 (1972), 105-122.
[Ma3] , "A Reduction Method for Edge-Connectivity in Graphs," Annals of
Discrete Mathematics 3 (1978), 145-164.
	
[MaS]
	 F. S. Makedon and I. H. Sudborough, "On Minimizing Width in Linear Layouts,"





Z. Miller and I. H. Sudborough, "Polynomial Algorithms for Recognizing Small
Cutwidth in Hypergraphs," Proc. 2nd Aegean Workshop on Computing (1986),
252-260.
	
[Na]	 C. Nash-Williams, "On Well-Quasi-Ordering Infinite Trees," Proc. Cambridge Philo-
sophical Society 61 (1965), 697-720.
	
[Pap]	 C. H. Papadimitriou, private communication.
	
[Par]	 T. D. Parsons, "Pursuit-Evasion in a Graph," in Theory and Application of Graphs
(Y. Alavi and D. R. Lick, eds.), Springer-Verlag, 1976, 426-441.
	
Rob]	 N. Robertson, private communication.
	
`RS1]	 N. Robertson and P. D. Seymour, "Graph Minors IV. Tree-Width and Well-Quasi-
Ordering," J. Combinatorial Theory Series B, to appear.
	RS2]	 11 "Graph Minors V. Excluding a Planar Graph," J. Combinatorial
Theory Series B 41 (1986), 92-114.
	
[RS31	 , "Graph Minors X. Obstructions to Tree-Decomposition," to appear.
	
[RS41	 , "Graph Minors XIII. The Disjoint Paths Problem," to appear.
	
[RS5]	 , "Graph Minors XVI. Wagner's Conjecture," to appear.
	
[Se]	 P. D. Seymour, private communication.
	
[Su]	 I. H. Sudborough, private communication.
	
[Wa1	 K. Wagner, "Uber Einer Eingeshaft der Ebener Complexe," Math. Ann. 14 (1937),
570-590.
9 Footnote
• A preliminary version of a portion of this paper [FL31 was presented at the Fifth
MIT Conference on Advanced Research in VLSI held in Cambridge, Massachusetts,
in March, 1988.
• Michael R. Fellow's research has been supported in part by the National Science
Foundation under grants MIP-8603879 and MIP-8919312, by the Office of Naval
Research under contract N00014-88-K-0456, asid by the National Aeronautics and
Space Administration under engineering research center grant NAGW-1406.
• Michael A. Langston's research has been supported in part by the National Science
Foundation under grants MIP-8603879 and MIP-8919312, and by the Office of Naval
Research under contract N00014-88-K-0343.
N94- 71089
NASA SERC 1990 Symposium on VLSI .Design	 163
Burst Error Correction Extensions




Abstract- Reed Solomon codes are powerful error correcting codes that include
some of the best random and burst correcting codes currently known. It
is well known that an (n, k) Reed Solomon code can correct up to (n — k)/2
errors. Many applications utilizing Reed Solomon codes require correction of
errors consisting primarily of bursts. In this paper, it is shown that the burst
correcting ability of Reed Solomon codes can be increased beyond (n — k)l2
with an acceptable probability of miscorrect.
1 Decoding Burst Errors in a Reed Solomon Code
The random error correcting ability of a code is set by the minimum Hamming distance
of the code, d'. The Hamming distance between two codewords is the number of symbols
in which the codewords are different. A code can correct all error patterns of t or fewer
symbols where
2t < d` — 1
	 (1)
An upper bound for the minimum distance of a code given the number of parity symbols
added to form a linear (n, k) code is given by the Singleton Bound
d' _ 1 < n — k
	
(2)
The minimum number of random errors that a code can correct is found at the point of
equality. Any code that meets the bound with equality is said to be a Maximum Distance




Equation 3 says that a code that corrects all patterns of t or fewer symbol errors requires
at least 2t parity symbols.
The burst correcting ability of a linear code can also be related to the number of parity
symbols added to form the (n, k) codeword. The Rieger Bound [2] states: '"In order to
correct all bursts of length t or less, a linear block code must have at least 2t parity
symbols." Any code that meets the Singleton Bound with equality also meets the Rieger
Bound with equality. The Reed Solomon code is such a code.
164
A maximum distance code, such as the Reed Solomon code, can correct all combinations
of (n—k)/2 random symbol errors. It can also correct all bursts of length (n—k)/2 symbols,
which is a subset of the random error patterns that can be corrected. The Reed Solomon
code can correct t errors, either randomly placed or contiguously placed as in a burst.
Many applications that require burst error correction use a Reed Solomon code, however
there is a significant amount of information that is not being used by the code in a burst
error environment. When t errors occur at random, there are 2t unknowns: the t locations
and the t magnitudes. However, when a burst of length t occurs, then the t magnitudes
and the location of the burst are unknown. This accounts for only t+1 variables that need
to be solved for. Less information is required to solve for the burst.
In this section, Reed Solomon codes are studied as burst error correcting codes. The
amount of extra information that is available when the errors are known to be bursts a
priori is found. The added burst correcting ability of the code is studied and the cost of
increasing the burst length to be corrected beyond (n — k)/2 is found.'
1.1 Definitions and Conjectures
To begin the study, some terms need to be defined. Also, some assumptions concerning
the occurrence of burst errors are stated.
Definition 1 Reed Solomon codes are linear codes and therefore are a subspace of GF(q)n
III•
GF(q)" is the n dimensional vector space over the finite field GF(q). As a linear code, Reed
Solomon codes satisfy the properties of a group. As such, the difference between any two
codewords is also a codeword. Consequently, the relationship of any particular codeword
to all other points in the space is isomorphic to the relationship of any other codeword
to the vector space. The distance properties of any one codeword to all other codewords
is also isomorphic to the distance properties for all codewords. Because of this, a study
of the distance properties of any single codeword provides all the distance information for
the complete code.
Definition 2 A burst polynomial, b(z), representing a burst of length 1, located at position





 = 0 for i <c and i > c+1-1,  b. 540,  and b.+a
-1 540.
A burst can have many zero coefficients within the burst itself. The length of the burst
is the number of coefficients from the first error symbol to the last error symbol inclusive
in the block. A burst is considered to be located at the position of its least significant
coefficient.
Definition 3 The burst distance db between any two codewords is the burst length of the
difference between the.two words.
NASA SERC 1990 Symposium on VLSI Design	 165
This is analogous to the Hamming distance [1] between two code words. The Hamming
distance between two codewords is the number of symbols in which they differ. The burst
distance between two codewords is the smallest burst that can be added to one to get the
other.
The Hamming distance measures distance with no constraints on location, much as
• random noise source corrupts data with equal probability. The Hamming distance is
• metric for the random error correcting ability of an error correcting code. The burst
distance measures distance with the additional constraint that the locations must be con-
secutive. The burst distance is a metric for the burst error correcting ability of an error
correcting code.
Theorem 1 The minimum burst distance d6 of a linear code is the burst length of the
nonzero codeword with smallest burst length.
Proof For a linear code, the difference between any two codewords is another codeword.
Therefore, the minimum burst distance of the code db is equal to the burst length of the
nonzero codeword with the smallest burst length. q
This is analogous to the minimum Hamming distance of a code. A burst error correcting
code can correct all burst patterns of length t or less where
2t<db-1
	 (5)
The Singleton Bound still holds for burst distance, because the burst distance between two
codewords can never be greater than the Hamming distance. Therefore
d6-1 <n — k	 (6)
Combining equations 5 and 6 gives
2t<db<n —k
with the result
2t < n — k	 (7)
which is the R.ieger bound.
As seen previously, the Rieger Bound states that a code cannot correct all bursts of
length t if less than 2t parity are added to form the codeword.
Definition 4 A burst shell of radius r contains all of the points that are equal to a burst
distance of r from the code polynomial at the center.
Definition 5 A burst sphere of radius r contains all of the points that are a burst distance
of r or less from the code polynomial at the center.
166
A "normal" sphere, as distinguished from a burst sphere, bf radius r contains all points
that are Hamming distance r or less from the center point. A burst sphere of radius t is
a subset of a sphere of radius t, since bursts of length t or less are included in the set of
words of weight t or less. It should be noted that a point in the sphere of radius two could
be a point in the outer shell of the burst sphere of radius n/2 where n is the blocklength
of the code.
Definition 6 A decoding sphere is a sphere with a codeword as the center point.
A t random error correcting code corrects all points within a sphere of radius t about the
codewords. Likewise for a burst error correcting code, the code corrects all points within
the burst sphere.
An error correcting code has a decoding sphere about each of the codewords: a "normal"
sphere for a random error correcting code and a burst sphere for a burst correcting code.
Current theory assumes that the spheres about different codewords do not overlap. It
is the purpose of this work to show that choosing to correct errors of a magnitude such
that the decoding spheres overlap is of benefit. Because there is an overlap, there will be
points in the space that could be corrected to more than one code word. Any point within
this intersection region of two decoding spheres could be corrected to the codeword at the
center of either sphere. Decoding decisions for these points are presented in a Section 1.2.
Definition 7 It
 is the radius of the largest non-overlapping burst spheres.
For Reed Solomon codes It
 is equal to (n — k)/2
If the radii of the burst decoding spheres is allowed to grow beyond I t , then the spheres
begin to overlap. Overlap occurs when the Rieger bound is violated. Points within the
area of overlap might not be correctable because they belong to more than one sphere. If a
point that is at burst distance greater than t from a code word, but an element of only one
sphere, then that point can be corrected unambiguously. The following term is be used to
define the radius of the expanded sphere.
Definition 8 The maximum burst that the code.
 attempts to correct is l,nas.
I,,,az
 is the length of the longest burst that the decoder attempts to correct. As lnax
increases beyond It , the burst spheres begin to overlap. Bursts of length greater than l,,,.
should be detected by the decoder as an uncorrectable error condition.
Since the spheres are allowed to overlap, the decoder will not always be able to correct
bursts of length I,... The following two definitions name the conditions when the decoder
fails. The decoder is designed to correct bursts of length 1.,,
Definition 9 A miscorrect occurs when the decoder fails to correct to the right codeword
when a correctable burst of length l,,,ax or less has occurred.
Definition 10 A misdetect occurs when a burst of length greater than 1,,. has occurred
and the decoder corrects to a codeword.
NASA SERC 1990 Symposium on VLSI Design
	 167
When an error occurs that is larger than the code can correct, i.e. greater than 1„.,
then the desired decoding action would be error detection. If the received vector is in
a decoding sphere and a decision to correct is made, then a misdetect has occurred. To
summarize, if a burst error of length less than or equal to " occurs and the decoder
does not correct to the right codeword, then a miscorrect has occurred. If a burst error of
length greater than 1„. occurs, and the decoder corrects to a codeword, then a misdetect
has occurred.
Definition 11 The minimum sized burst that can be miscorrected is l,nin.
The length 1„iin is the radius of the points that do not overlap with any other sphere. This
does not mean that all points of distance greater than l nin overlaps with other spheres,
but all points of distance less than or equal to I..in do not overlap with any other spheres.
For n — k even,
Irvin =n—k —"+ 1 	 (8)
For n — k odd,
Irvin = n — k — I^naa
	 (9)
Conjecture 1 All bursts of length li are equally likely.
If this is not true, then a code can be created that puts the more likely burst patterns into
more favorable regions. The arguments developed in this section assume that burst errors
of identical length are equally likely.
Conjecture 2 For all l,n
 and In, and for some It, such that It < In, and Im
 < In the
probability of a burst of length In is less than the probability of a burst of length In,.
The second conjecture assumes that longer bursts are less likely than shorter bursts as
long as the length of the longer burst is greater than It.
1.2 Decoding Decisions for Points in the Overlap Region
Any burst pattern that moves a codeword into one of the regions of overlap is a pattern
that cannot unambiguously be detected. However it might be possible to identify all of
the spheres of which it is an element. If a received word is in the area of overlap, then
there are five possibilities.
1. The received word is within a burst sphere of radius I t with the sent codeword at the
center of the sphere. By Conjecture 2, the decision should be made to correct to the
closer codeword and the decoder would function correctly.
2. The burst distance between the received word and the sent codeword is less than
the burst distance between the received word and any other codeword. If all of the
spheres can be identified, then by Conjecture 2, the shorter burst should be chosen,
and again the decoder would function correctly.
158
3. The burst distance between the received word and the sent codeword is less than
or equal to the burst distance between the received word and any other codeword
and equal to at least one of the other burst distances. In this case, an unambiguous
decoding choice can not be made. However, the burst distance between the received
word and the sent codeword is le gs than or equal to 1„. and should have been
correctable. The proper decoding choice is to detect an uncorrectable error. This is
a miscorrect in the sense that the burst is a length that is considered correctable,
but the decoder cannot make an unambiguous decoding decision.
4. The burst distance between the received word and the sent codeword is greater than
the burst distance between the received word and another codeword, but less than
I.... The decoder will not correct to the sent codeword and a miscorrect occurs.
5. The burst distance between the received word and the sent codeword is greater than
I. but within another burst decoding sphere. The error should be detected, but
will be corrected to the other codeword. This is a misdetect. It should be noted that
all decoders suffer from this problem.
The next task is to find the probability of miscorrect given that a burst error of length
1; has occurred. This is done by finding the number of burst errors of length l; that exist.
Then the number of those errors that result in miscorrects is found. The ratio between the
two is the probability of miscorrect given that a burst of length l; has occurred. The first
number is equal to the volume of the burst shell of radius l;. The second number is equal
to the volume of the overlap region of that shell with all other shells with radius less than
or equal to h.
1.3 Volumes for Burst Shells
Two volumes need to be quantified. The first is the volume of the shell of burst radius 1;.
This represents the number of ways that a burst error of length l; can occur. The second
volume is the intersection of a burst shell of radius 1; with all burst shells of radius I k . This
is the number of potential received words within the first shell that are a distance Ik from
another codeword.
1.3.1 Burst Shell Volume
The objective in this section is to find the volume of a burst shell about a code polynomial.
The shell volume is a function of three parameters:
1. n, the blocklength of the code.
2. q, the number of symbols in the field over which the code is defined.
3. la, the radius of the shell.
NASA SERC 1990 Symposium on VLSI Design
	 169
For a full length Reed Solomon code, n = q —1.
Any burst of length 1; added to the codeword at the center results in a point that is
in the shell. Each point represents a received word that is burst distance lj from the sent
codeword and the sum of all such points in the shell is the volume of the shell.
Consider the all zero codeword. The volume of the shell of burst radius lj about the
all zero codeword is equal to the number of bursts of length Ij. The method for finding
the total number of bursts of length lj is combinatoric. The following theorem gives the
volume of a shell of burst radius Ij.
Theorem 2 The volume of Yj , a burst shell of burst radius Ij is:
V0 =1	 (10)
V, = n(q — 1)	 (11)
and for Ij > 1
Y; _ (n — lj + 1 )(q —1)2g4-'	 (12)
Proof The shell of burst radius 0 includes only one point, therefore the volume is 1. This
would correspond to no errors.
The shell of burst radius 1 contains all of the bursts of length 1. There are n possible
locations and in each location the value must be non-zero, therefore there are q —1 possible
values for each location. This would correspond to one random error.
For a burst of length greater than 1,, there are n — lj + 1 different ways to place the
burst, b(x). For each placement of the burst, the two endpoints of the burst which are the
coefficients bo and bj,_l , must be nonzero, therefore there are (q — 1)' ways to choose the
two endpoints. The Ij — 2 interior points of the burst which are the coefficients bi through
big -z, can take on any of q values. Therefore there are q j-2 ways to choose the interior
points for each burst location. Moreover,. the total volume for a burst shell of radius 1; is
(n — lj + 1 )(q — 1)2 gti -2, p
1.3.2 Volume of Overlap
Points that are in the overlap region are the points that can cause a decoding failure.
If the volume of the region is small compared to the total volume of the shell, then the
probability of landing in the overlap region is also small.
In investigating the overlap region, the codeword at the center of a burst sphere Cj of
radius lj is considered to be the sent codeword; the codeword at the center of any other
sphere C;, j :A i is considered to be erroneous. The study is from the point of view of the
sent codeword. Burst shells about the sent codeword are studied to find how many points
in the shell are a burst distance equal to lk from another codeword where Ik is any burst
distance less than or equal to Ij.
The following two definitions define the two different types of bursts.
Definition 12 The real burst, bj (x), is the difference between the point in the region of
overlap and the sent codeword cj(x).
170
Definition 13 The phantom burst, bk (x), is the difference between the point in the region
of overlap and a codeword, ck(x), which is not the sent codeword.
It will be shown that if the difference of the two bursts is a codeword, then the point is in
the overlap region. The probability of miscorrect that is calculated is conditioned on the
probability of the real burst occurring. The above two definitions provide a mechanism to
distinguish between the actual error and the error that is miscorrected.
The overlap region is the intersection between the two spheres, one of burst radius li
and the other of burst radius Ik . In this section, the volume of the intersection of the two
shells is found as a function of the blocklength n, the two burst lengths Ij and lk , the size
of the symbol space q, and the number of parity symbols n — k.
Theorem 3 For an (n, k) Reed Solomon code, any set of n — k symbols can be expressed
as n — k functions dependent on the other k symbols.
Theorem 4 The minimum Hamming weight of a Reed Solomon codeword with n—k parity
symbols is n — k + 1. -.
Theorem 5 The difference between two codewords is a codeword.
The above three theorems have been proved numerous times. For example, see [1][2].
Theorem 6 If a received polynomial vj(x) = cj(x) + ej(x) is in the region of overlap,
then vj(x) = ck(x) + ek (x) where the difference of the two errors is also a codeword ci(x),
cj(x) # ck(x) and ci(x) has at least n — k -}-1 non-zero coefficients.
Proof For the first part of the theorem,
By Theorem 5,
Cj(x) + ei(x) = Ck(x) + ek(x)
cj(x) — ck(x) = ek (x) — ej(x)
Cj (x) — Cl.(- ) = ci(x)
where ci (x) is a codeword. Therefore
ci(x) = ek(x) — ej(x)
For the second part, by Theorem 4, the minimum Hamming weight of a Reed Solomon
codeword with n — k parity is n — k + 1. If ek (x) — ej(x) is a codeword, then its Hamming
weight must be greater than or equal to n — k + 1. q
If the code alphabet is a characteristic 2 alphabet, i.e., the Galois field is GF(2-) where
m is a positive integer, then subtraction in the field is identical to addition. In that case,
if the sum of two bursts is a codeword, then either of the bursts added to a codeword will
be in the region of overlap.
Theorem 7 If a point in the region of overlap is a burst distance lj from a codeword cj,
and a burst distance Ik from a code word ck, then
lk +lj >n—k	 (13)
NASA SERC 1990 Symposium on VLSI Design 	 171
Proof By theorem 6 the sum of the two bursts must be a codeword. The Hamming
weight w; of bi must be
1 <w; <1;
The Hamming weight w k of bk must be
1 < wk < Ik
Therefore
1;+lk>wk+wi
The weight of the codeword is wk + wJ . By theorem 4, all codewords have a weight greater
than n — k. Therefore
1, +lk >wi+wk>n—k
Enough groundwork has now been laid to find the volume of the region of overlap
between a burst shell of radius l; and all other burst shells of radius lk. This region is
defined below.
Definition 14 V;nik is the intersection of the burst shell of radius l; about the code poly-
nomial c;(x) with all other burst shells of radius lk.
In the following theorem, let c;(x) represent the sent codeword, bj(x) represent the
actual burst error, and v1(x) represent the received word which is in the overlap region.
Let bk(x) represent the phantom burst and let ck(x) represent_ the codeword at the center
of the intersecting shell of burst radius lk . In this case








Proof By Theorem 6, v j(x) is in the region of overlap if the difference of bj(x) and bk(x)
is a code polynomial. The right side of Equation 14 consists of two parts. The first part
is the number of ways that two bursts, one of length l; and the other of length lk , can be
placed in a block of length n without overlap. The second part of Equation 14 counts the
number of polynomials for each placement that is a code polynomial.
First, there are n — l; + 1 ways that the burst bj (x) of length l; can be placed in a block
of length n. This is broken into two cases. For the first case, the distance between the
edge of the block and bj(x) is less than Ik . This is true for 21k of the locations. In this case
the number of ways to place the burst bk(x) is n — l; — lk — i + 1 where i is the distance
between the edge of the block and bj (x). This is the first term in Equation 14 summed
over all i < lk for both ends of the block.
172
In the second case, the burst b;(z) is located far enough from the edge such that bk(z)
can be located on either side of it. There are n — 21k — l; + 1 of these locations. For each
of these locations there are n — l; — 21k + 2 ways to locate bk(z). This is the second term
of Equation 14.
For each of the ways of placing the bursts, there are Ik + l; coefficients that are included
in either of the two bursts. By theorem 6 the sum of the two bursts must be a code word.
By theorem 3, for the sum to be a codeword, n — k of the coefficients must be uniquely
specified. Of the (q — 1)4g k+t;-4 ways of choosing the 2 bursts, only J'%+';-(n—k) — 1 of
them are nonzero codewords. The all zero codeword is not a possibility, because a burst
did occur. This is the third term of Equation 14.
The number of points in a shell of burst radius l; that are in the overlap region is the
product of the ways to place the bursts and the number that are codewords. q
With the volume of the intersection between a burst shell of radius 1; with any burst
shell of radius lk , it is possible to evaluate the probability of decoding failure. Every point
within the overlap region represents one error of burst length 1; that is also a burst error
of length Ik from another codeword.
1.4 Probability of Miscorrect and Misdetect
A desirable feature of an error correcting decoder is that uncorrectable errors are de-
tectable. With any error correcting code, there is always the chance that an uncorrectable
error can cause a misdetect.
Any error correction code that does not try to correct errors beyond that which the
distances in Equations 1 and 5 allow should never miscorrect. However, if correction of
errors greater than these are attempted, then there is a possibility of miscorrect. In this
section these probabilities are found given that a burst of length 1; which violates the
R.ieger bound occurs.
1.4.1 Miscorrect
A miscorrect occurs if a burst of length less than or equal to I„.. is not corrected to the
sent codeword. The conditions for which this occurs were outlined in section 1.2. To
summarize, the decoder miscorrects if the burst distance from the sent codeword to the
received word (the real burst) is greater -than or equal to the burst distance to any other
codeword (the phantom burst).
Theorem 9 Given that a burst of length 1; for
It<1;<1.
has occurred, the conditional probability of a miscorrect is
V •ndyP(miscorrectll;) <	 '	 (15)
lh=n	 Vii.
NASA SERC 1990 Symposium on VLSI Design	 173
Proof The points in the region of overlap that cause a miscorrect are those points that
are a burst distance less than or equal to l; from another codeword. If 1; < 1=, then there
is no lk that can satisfy both bounds on the summation. Any burst of length less than
n — (k —1) — lj has fewer than n — k nonzero coefficients and cannot be a codeword. Any
burst of length greater than Ij will not cause a miscorrect because of the assumptions made
in Conjecture 2. The numerator is a count of the points that cause miscorrect with a burst
of length Ij and the denominator is a count of the total number of points that are in a
burst shell of radius lj . By Conjecture 1, each of the points in the shell are equally likely.
O
The conditional probability of miscorrect given a burst of length Ij is not a function of
the maximum burst that the decoder will attempt to correct. It is a function of the the
length of the burst that is actually found.
The bound given in Theorem 9 can be simplified through approximation. First, for the
numerator the following approximations can be made to Equation 15.
t,,-s






The term in the brackets enumerates the number of ways a burst of length Ij and a
burst of length lk can be placed in a block of length n. There are less than n ways for
each burst, therefore the total value of the bracketed term is less than n2 . For a Reed
Solomon code, n < q —1, and n2 can be" replaced with q2 and the remaining follows. The
denominator of Equation 15 is simplified below.
Yj = (n — Ij + 1 )(q —1)2gtj-s
< ngagtj-z
< qtj+l
The bound given in Equation 15 can be approximated as
tj
P(miscorrectllj ) ;z:^	 L4 (n-k-t"-i)	 (16)
1k=n-(k-1)-1j
Equation 16 is not a bound but can be used as a rough guess.
Finally, the greatest conditional probability of miscorrect occurs when a burst that is
the same length as the real error occurs. In this case the bound is
P(miscorrectlij) ;s 4 (n-k-t;-a)	 (17)
174
Since the approximations of both the numerator and the denominator of Equation 15
are less than the original quantities, it is not possible to state that the approximation given
in Equation 17 is less than the conditional probability. However, it appears that if q » 1;,
the approximation is a good one.
I,F`rom the above, once the channel error statistics are known, the conditional proba-
bility of miscorrect can be calculated. This conditional probability can then be used to
specify the error correction scheme for the channel.
1.4.2 Misdetect
A misdetect can occur when a burst of length greater than 1,,. occurs. A burst of this
length is considered uncorrectable. If the burst vector moves the original codeword into an-
other coding sphere, then the decoder corrects to the wrong codeword instead of detecting
an uncorrectable error.
Theorem 10 Given a burst of length greater than " the conditional probability of a




1, n P(I, > Imo)
q
(18)
Proof The numerator is an upper bound on the total number of polynomials within all of
the burst decoding spheres of radius 1;. Some polynomials belong to more than one sphere.
The summation is the volume of one sphere. There are qk codewords and consequently qk
spheres. The denominator is the total number of polynomials. q
The conditional probability of misdetect is a strong function of the maximum length
burst that the decoder attempts to correct.
1.5 Example
It is now possible to determine the conditional probability of miscorrect for a decoder that
violates the R.ieger Bound for an (n, k) Reed Solomon code.
Example 1 The (255,229) Reed Solomon code over GF(2 8) is used to protect satellite
communication channels 14]. It is concatenated with a convolutional inner code. The
purpose of the convolutional code is to correct random errors. When the raw bit error rate
is too high for the convolutional decoder, it creates a burst of error's in the output. This
data is then passed to a Reed Solomon decoder for burst correction. The code corrects all
bursts of length 16 that occur within a block. If a burst of length greater than 16 occurs,
then the conditional  of a misdetect from Equation 18 is leas than 4.7(10)`14.
This means that a miscorrect occurs with a probability
P(misdetect) < 4.7(10)-14P(l; > 16).
NASA SERC 1990 Symposium on VLSI Design	 175
Given that the code is being used on a burst channel, the number of errors that the code
can correct could be increased. The bound on the conditional probabilities of miscorrect can
be found through direct application of Equation 15. The approximation conditional as given
by Equation 17 is given as a reference. As can be seen in the table, the approximation is
greater than the actual conditional probability. It is also well within an order of magnitude
of the value it is approximating. The results are summarized in Table I. One interesting




17 1.4(10) -34 1.9(10) -34
18 3.6(10) -32 4.9(10) -3Z
19 9.1(10)-30 1.3(10) -S9
20 2.3(10) -47 3.2(10)'27
21 5.8(10)-26 8.3(10)-s6
22 1.4(10)-22 2.1(1 0)-22
23 3.6(10)-20 5.4(10)-20
24 9.1(10)-1s 1.4(10)'17







Table 1: Probability of miscorrect for a (255,223) Reed Solomon code with the given burst
lengths.
result is that the conditional probability of miscorrect for a burst of length 35 is less than
the conditional probability of misdetect for a normal decoder. This is possible because the
normal decoder allows for the occurrence of random errors.
Another interesting problem would be how many parity symbols are really needed to
correct a burst of length 16 with a conditional probability of miscorrect less than 10'10 9
A (255, 234) code requiring 21 parity symbols has a conditional probability of miscorrect
equal to 1.8(10)'10 . A common metric for measuring the efficiency of a code is the rate
R = k/n. The rate of the original Reed Solomon code is 87%. The new rate is 92%, a
significant improvement.
1.6 Summary
For a burst error environment, the error correcting ability of a Reed Solomon code can be
extended beyond the Rieger bound with a high degree of confidence that the bursts that
176
are found are the bursts that occurred. This is significant in that the code rate of a code
can be reduced without much if any reduction in the burst correcting ability of a code that
only corrects bursts that meet the Rieger Bound.
The decoder that performs this burst correcting does not correct any random errors
that occur outside of the bursts that are being corrected. One very common use for a burst
correcting code is as an outer code for a random error correcting code as illustrated in the
above example. When the inner random error correcting code fails, it creates a large burst
error, but the inner code corrects all of the random errors. In this case, the increased burst
correcting ability of the Reed Solomon codes is valuable.
When the size of the bursts to be corrected is increased such that the Rieger bound is
violated, the possibility of a miscorrect is non-zero. Significant improvement of the burst
error correcting ability of a Reed Solomon code can be accomplished while maintaining a
negligible conditional probability of miscorrect given that a burst of " occurs. A bound
on the conditional probability of miscorrect given that an error of magnitude I„ occurs
was found to be approximated by q-(n-k-1m,w-1).
2 A Decoder for Bursts that Violate the Rieger Bound
Reed Solomon codes are a special case of the BCH codes. Any decoding algorithm that
works for BCH codes also works for the Reed Solomon codes.
Reed Solomon decoding is a computationally complex process. Since the first decoding
algorithms were defined, most of the research in the area has been focused at reducing the
complexity rather than improving the correcting ability of the code. The results in the
previous section indicate that in a burst error environment, the error correcting ability can
be improved beyond what the Rieger bound would indicate.
The possibility of extending the burst correcting ability of a Reed Solomon code was
developed in the last section. For a given amount of information and a given maximum
burst size to be corrected, the number of parity symbols could be reduced significantly to
achieve essentially the same burst correcting ability.
Error trapping, a decoding algorithm first identified in 1964 by Rudolph and Mitchell
[3], decodes extended burst errors. This algorithm identifies the error polynomial e(z)
and then corrects the received polynomial by subtracting e(z) to get the code polynomial.
In the following sub-sections, the error trapping algorithm and a decoder- for trapping
extended bursts is described.
2.1 The Syndrome
The polynomial received by the decoder, v(z), is the sent codeword added to the error
induced by the channel.
v(z) = c(z) + e(z)	 (19)
As can be seen in Equation 19, the error is additive and once found, can be subtracted
from v(z) to get the original code word.
NASA SERC 1990 Symposium on VLSI Design	 177
The syndrome, as described below, is a function of the error polynomial and is indepen-
dent of the sent code word. There are two forms of the syndrome: the partial syndromes
and the syndrome polynomial. Most of the current methods for decoding Reed Solomon
codes use partial syndromes. The partial syndromes are discussed in a later section. The
error trapping decoder is based on the syndrome polynomial.
Deflnition 15 The syndrome polynomial s(x) is given by the equation
s(x) = Rs(=)[v(x)]
	 (20)
The syndrome generator for calculation 'of s(x) is developed in Section 2.3.
The syndrome is a function only of the error polynomial and not of the code polynomial.
This is true only because g(x) divides c(x).
R,(--)[c(x) + e(x)]	 (21)
Re(=)[c(x)] + Ra(=)[e(x)]	 (22)
0 + Rv(=)[e(x)] 	 (23)
Ra(x)[e(x)]
	 (24)
From Equation 24 it can be seen that when the degree of the error polynomial is
less than n — k, the syndrome polynomial is equal to the error polynomial. When the
syndrome polynomial equals the error polynomial, the special condition known as the
error trap occurs. An equivalent statement is when the error pattern is wholly contained
within the n — k lowest degree coefficients, then the error is trapped.
All bursts of length l; such that
1, < 2t
are trapped if the n — k — l; coefficients that are not part of the burst are equal to zero.
If a burst is not wholly contained in locations 0 through n — k —1, then it is not trapped.
However, if the received polynomial can be cyclically shifted, and the syndrome for that
shifted polynomial found, then any burst error 'can be trapped. The Meggitt Theorem
provides such a mechanism for cyclically shifting the received polynomial and-updating
the syndrome polynomial to correspond to the new polynomial.
2.2 The Meggitt Theorem and Trapping the Error
The major significance of the Meggitt Theorem is that it allows for a simple method of
calculating the syndrome of a cyclically shifted codeword given that the syndrome of the








Figure 1: Cyclic shift caused by a multiplication by x mod (x" —1).
then
R,(z)[R.-_1{xv(.T)I] = Rg(=)[xs(x)]	 (25)
The left side of Equation 25 is the received polynomial cyclically shifted to the left.
The most significant coefficient becomes the least significant and the degree of all of the
other coefficients is increased by one. This is illustrated in Figure 1.
On the right side of the equation, the syndrome for the new, cyclically shifted poly-
nomial is found. The original syndrome has been multiplied by x and the residue with
respect to g(x) found. This can be done as follows:
R8(x)[xs(x)] = XS(X) — Sri-k -19(x)
	 (26)
where s„-k-1 is the most significant coefficient of s(x).
In Section 2.3 it is shown that the syndrome generator accomplishes this operation.
Theorem 12 If




Rat=){Rx-_t [x *v3(x)]) = b; (x)
	 (27)
Proof
Rg(=){Rx-- 1[x-rv9(x)]) = Rg(=){R=--1[x-'(x-rc(x) + x-'z b(x)]}	 (28)
= Rg(=){R.--1[x r(x *c(x)]) + Rg(=){R=--1[bj(x)]) (29)
= 0 + Rg(z){R=-_1[ba(x)])	 (30)
NASA SERC 1990 Symposium on VLSI Design	 179
Theorem 12 specifies the direction a codeword with an additive burst error must be
shifted to get the burst within the window. If the burst is offset from the zeroth coefficient
by r places, then v(x) must be divided by x" mod x" — 1 to get the syndrome equal to
the burst. This is illustrated in Figure 2. A polynomial is multiplied by x mod Mn — 1
and is rotated counter clockwise in the figure. An n — k coefficient window, located on
coefficients 0 through p —1 is fixed where p is the number of parity symbols for the code.
The coefficients are shifted through the window in the direction shown.
2.3 Implementation Considerations
The functions that an error trapping decoder has to implement are calculation of the
syndrome, cyclically shifting the received vector and calculating the shifted syndrome,
trapping the error, and applying the correction. Circuits for implementing each of these is
described below. These circuits are commonly known [2] [1].
2.3.1 Syndrome
The syndrome is the remainder of the division v(x)/g(x). A circuit for dividing two
polynomials is shown in Fig 3. In this implementation of the divider, the n — k s registers
are initialized to zero. The received polynomial is input to the circuit, most significant
coefficient first. After n — k clocks, the first coefficient of the quotient appears on the
feedback line. After n clocks, the registers so to an-h-1 contain the respective coefficients
of the remainder, which is the syndrome polynomial.
The same circuit can be used to find the syndrome of the shifted code word as described
in Equation 25. Each shift performs the multiplication by x residue g(x).
This circuit does not perform efficiently for Reed Solomon codes that have been short-
ened. A full length Reed Solomon code has 2- — 1 symbols, where m is the number of
bits in each symbol. A code can be shortened to blocklength n by letting the 2,n — 1 — n
most significant symbols be equal to zero[1] . It is not necessary to send these zeros, as
the syndrome generator is in the same state after the zeros have been shifted in as it is
initially.
After the received polynomial has been input and the syndrome polynomial generated,
the burst decoding window is the least significant n — k coefficients of v(x). As the shifting
of the received word and the syndrome begins, the most significant coefficients are shifted
into the window. This can be seen referring to Figure 2.
This works fine unless the code has been shortened. For the shortened codeword,
the most significant coefficients were not sent and are known to be zero. The syndrome
generator wastes 2m — 1 — n clock pulses searching the most significant coefficients for the
burst error.
The problem -of wasted clock pulses can be solved in one of two ways. The first is
to reposition the burst decoding window to the most significant coefficients of the actual
codeword. This can be accomplished by cyclically shifting v(x) so that the most significant






Figure 2: Cyclic shift of the codeword through the burst decoding window with a) before






NASA SERC 1990 Symposium on VLSI Design	 161
Figure 3: Polynomial Division Circuit
Let v'(x) be the shifted v(x). Then
v'(x) = Rz,._ 1 [x21-1-n VOTA
	
(31)
and its syndrome, s'(x) is
$'(x) = Rg(z)[v'(x)]	 (32)
The shifted syndrome can be modified into a - form equal to the residue with respect
to g(x) of the original received polynomial multiplied by a new polynomial. A circuit will
then be shown that performs this as the received polynomial is shifted in. Let
m(x ) = Rg(=){ ^-1[x21- 1 -n1 }	 -
then
s'(x) = Rg(x)[v'(x)]	 (33)
= Rg(=){R."-1[x21"-1-nv(x)]} 	 (34)
= Rg(z)[R9(x){R.--1[x21"-1 n]IRg(=){R=n_1[v(x)1 }]	 (35)
= Rg(=)[rn(x)v(x)]	 (36)
where the order of the polynomial m(x) is less than n - k. Circuits that divide by a
polynomial (syndrome generator) and multiply by a polynomial can be concatenated [2].
The one that performs the operation described in Equation 36 is shown in Figure 4.
The second solution is to change the direction that the burst decoding window slides
over v (x). This can be done by a modifying Equation 25 as follows:
Rg(x) {Rxn_1[x-lv lx)]I = Rg(x)[x-131x)1	 (37)
Since v(x) is multiplied by x-1 instead of x, it is equivalent to shifting the codeword in
the opposite direction indicated by Figure 3. The advantage is that the decoding window






Figure 4: Circuit that performs the operation z(x) = i(x)m(x)/g(x).
This function can be accomplished by reversing the direction of the syndrome generator.
When the circuit is run in reverse, then two constant multipliers must be changed. The least
significant multiplier is the multiplicative inverse of go. The most significant multiplier is
equal to gn_k instead of the multiplicative inverse of g,,- k . The circuit for clocking backward
is shown in Figure 5.
A circuit that incorporates both the syndrome generator and the ability to shift the
syndromes in a reverse direction is shown in Figure 6. The circuit must be initialized to
all zeros. For the first n clock pulses, v(x) is shifted into the shift register, most significant
coefficient first. The syndrome generator registers shift to the left, the leftmost multiplexor
selects go, and the rightmost multiplexor selects gn ik . The circuit in this configuration
performs the syndrome generation.
For the second n clock pulses, the syndrome generator registers shift _to the right,
the leftmost mux selects go', and the rightmost mux selects gn_k• The circuit in this
configuration performs a division by x modulo the generator polynomial.
After the first n clock pulses, the burst decoding window is in the n — k lowest degree
coefficients. Each clock pulse the window is shifted one coefficient towards the higher




NASA SERC 2990 Symposium on VLSI Design	 183





Figure 6: Circuit that combines both the forward and reverse syndrome calculation
184
2.3.2 Recognizing the Error
The syndrome register cycles through the code, i.e., the observable window moves across
v(x). When the syndrome is equal to the burst then an error has been trapped. The
detection circuitry used to recognize the burst is a pattern recognizer. It has to recognize
valid bursts.
A valid burst is recognized whenever 1 or more of the coefficients at the outer extreme
of the window is equal to zero. If the burst is length I; then n — k — I; zeros occur in the
window. If the window is moving from low order coefficients of v(x) to high order, the
zeros first appear in the low order coefficients of the syndrome. As the burst shifts through
the syndrome, the zeros shift from the low order coefficients to the high. If the window
is moving from the high order coefficients of v(x) to the low, then the opposite situation
occurs.
In a normal error trapping decoder, if a valid error pattern is found, correction can
proceed immediately because only one error pattern is possible if the Rieger bound is not
violated. When the Rieger bound is violated, then the decoder must be capable of trapping
all possible error patterns and choosing the most likely error from them. When a burst is
found it should be saved, and its position and length recorded. As the search through v(x)
continues, if another burst is found that is more likely than any previous, then it should
be saved. After searching through the whole code, the most likely burst, if one exists, has
been found.
During the time that the burst is shifting through the syndrome generator, the feedback
line is equal to zero. The number of consecutive clock pulses that the line is zero determines
the length of the burst. The end of the burst is located when the feedback line becomes
nonzero. At this point, the burst is valid on the outputs of the syndrome registers and can
be latched into a holding register.
The latching should be conditioned to the length of the burst being less than the length
of any burst that was found previously. If they are of the same length, then there is no
clear choice, and in the previous section the decision was identified as a detected but not
correctable error
. condition.
2.3.3 Correcting the Error
Work has been done to build error trapping decoders that load v(x) into a shift register as
the syndrome is created, and shift out v(x) as the syndrome is shifted, and have the burst
shift out of the syndrome generator coincident with the symbols of v(x) which are in error
111. These circuits do not work when more than one burst is present. The location and
values of the burst are not known until the burst decoding window has traversed the entire
received polynomial. For this reason, control circuitry is needed to apply the corrections
at the right time.
Both v(x) and the burst e(x) need to be stored in memory. Correction can be applied
to v(x) as it is read from memory.
NASA SERC 1990 Symposium on VLSI Design 	 185
2.4 Summary
It was shown in the previous section that the burst error correcting ability of a Reed
Solomon code is much better than thought previously.
In this section, a decoder that finds the large bursts has been identified. The error
trapping decoder is a well known and simple algorithm that accomplishes the error correc-
tion. It has been shown that it also finds and corrects the bursts that violate the Rieger
bound.
For codes that are designed to protect against bursts exclusively, significant savings
in decoder cost, as well as increased performance can be achieved over the Reed Solomon
decoders in current use. The core engine of the decoder is the same as the systematic
encoder circuit. This allows the decoder to also serve as the encoder.
References
[1] R. E. Blahut, Theory and Practice of Error Control Codes, Reading, MA, Addison-
Wesley, 1983
[2] W. W. Peterson, E. J. Weldon, Error-Correcting Codes, Cambridge, MA, MIT Press,
1972
[3] L. D. Rudolph and M. E. Mitchell, "Implementation of Decoders for Cyclic Codes,"
IEEE Trans. on Inf. Theory, IT-10 pp. 259-260, 1964.






C. French	 Y. Lin




University of Idaho	 University of California, San Diego
Moscow, ID 88343
	
La Jolla, CA 92093
Abstract- In this paper, we present a performance comparison of several
combined error correcting/run-length limited (ECC/RLL) codes created by
concatenating a convolutional code with a run-length limited code. In each
case, encoding and decoding are accomplished using a single trellis based on
the combined code. Half of the codes under investigation use conventionally
(d,k) run-length limited codes, where d is the minimum and k is the maximum
allowable run of 0's between 1 9s. The other half of the combined codes use a
special class of (d,k) codes known as distance preserving codes. These codes
have the property that pairwise Hamming distances out of the (d,k) encoder are
at least as large as the corresponding distances into the encoder (i.e., the codes
preserve distance). Thus a combined code, created using a convolutional code
concatenated with a distance preserving (d,k) code, will have a free distance
(d f,.« ) no smaller than the free distance of the original convolutional code. It
should be noted that this does not hold if the (d,k) code was not distance
preserving. A computer simulation is used to compare the performance of
these two types fo codes over the binary symmetric channel for various (d,k)
constraints, rates, free distances, and numbers of states. Of particular interest
for magnetic recording applications are codes with run-length constraints (1,3),
(1,7) and (2,7).
1 Creating Combined Codes
In recent work on combined ECC/RLL trellis codes [1,2], it has been demonstrated that
some of the best codes, in the sense of lowest decoded error probability, are codes created
by concentrating a convolutional code with a RLL code, and then decoding using a single
trellis based on the combined code. In this work, we will be dealing exclusively with such
concatenated coding schemes. As an example of a concatenated code, consider the trellis
for the rate 1/4, d fr« = 10, 4-state convolutional code shown in Figure 1 (a). Here, free
distance is defined to be the minimum Hamming distance between any two sequences out
of the encoder that diverge in one state and remerge in another state. Notice in Figure 1
(a) that each branch of the trellis has a label of the form X/Y, where X is the encoder input
(1 bit long, in this case) and Y is the encoder output (4 bits long). We wish to concatenate
this code with the rate 1/2, 7-state, (2,7) code described by Adler, Coppersmith & Hassner
NASA SERC 1990 Symposium on VLSI Design	 187
in their paper on (d,k) code construction [3]. The trellis for the (2,7) code is shown in
Figure 1 (b). Initially, it would be expected that the rate 1/8 combined code would have
28 states (i.e., 4 * 7 = 28). However, the trellis for the combined code can be simplified
to 10 states, as shown in Figure 1 (c): 1n addition, the combined code now satisfies the
more stringent (2,5) constraint. The last parameter to be determined is the free distance
of the combined code. Since it is difficult to determine the free distance of a non-linear
code such as this, we will give the smallest distance found (and the free distance is then
less than or equal to this smallest distance). For this case, two paths separated by a
distance of 6 were found. These paths go through the sequences of states 5-3-1-4-9 and
5-7-5-7-9. The free distance of the combined code is thus dfr.. < 6. As is clear from the
example described above, a combined ECC/RLL code can have a lower free distance than
the original convolutional code. This is due to the manner in which the RLL code was
constructed [3]. There has been some work recently involving a class of RLL codes known
as distance preserving codes [4,5]. As the name suggests, distance preserving codes have
the property that the Hamming distance between any two encoder outputs is at least as
large as the Hamming distance between the corresponding inputs. Thus, when a distance
preserving RLL code is concatenated with a convolutional code, the combined code will
have a free distance greater than or equal to the free distance of the convolutional code.
The trade-off is that, in general, distance preserving RLL codes will have lower rates than
classical RLL codes, due to the additional requirement that the code preserve distance.
For this reason, a higher rate convolutional code is usually required to create a combined
code when using a distance preserving RLL code. One of the main reasons for this study is
to determine whether this decrease in rate is balanced by the preservation of free distance.
As an example of a combined code created using a distance preserving RLL code, consider
the convolutional code and the distance preserving (2,7) code shown in Figure 2 (a) and
2(b) respectively. It is not hard to show that the rate 3/8 (2,7) code is indeed a distance
preserving code, as discussed in [5]. The combined code has a rate equal to 1/8, as in the
previous example. The trellis for the combined code can be simplified from 8x2=16 states
to 10 states, as shown in Figure 2 (c). Also, the combined code satisfies a (2,6) constraint.
Finally, since the RLL code was a distance preserving code, the overall free distance is
d f,.,. > 10. These parameters are summarized in Table 1, along with the parameters from
the previous example.
The parameters for third code, to be described in the next section, are also included in
the table. We would expect that the combined code with the higher free distance (i.e. the
codes labeled lb—the code that utilized a distance preserving RLL code) would perform
better than the other code (labeled la). In Section 3, we will verify this by comparing
decoded probability of error for each code.
2 An Interesting Special Case
As discussed in the previous section, the free distance of a general convolutional code can
be preserved with an appropriate choice of a (d,k) code. While experimenting with codes
188
of this type, we have run across some interesting examples of combined codes created
using the rate 1/4 convolutional code of Figure 1 (a). Notice that this convolutional code
utilizes the codewords 0000, 1000, 0111 and 1111, and no others. Thus, when choosing a
distance preserving (d,k) mapping for use with this code, we need only concern ourselves
with these 4 sequences (instead of all 16 4-bit sequences). The pairwise distances between
these sequences are as follows:
0000 1000 0111 1111
0000 0 1 3 4
1000 0 4 3
0111 0 1
1111 0






Note that the mapping has a rate equal to 4/8, thus the rate of the combined code is
1/8. The pairwise distance between the (2,6) sequences are as follows: 
00001001 00010001 00100010 00100100
00001001 0 2 4 4
00010001 0 4 4
00100010 0 2
00100100 0
Comparing this to the pairwise distances of the 4-bit sequences, we see that we . have
achieved a distance preserving mapping. Thus the overall free distance of the combined
code is bounded by d f,... > 10. The parameters for this code are also summarized in Table
1 (code 1c). Notice that this code is comparable to code lb, except that it has only 4





NASA SERC I990 Symposium on VLSI Design	 189
It is easy to show that this mapping is also distance preserving. In this case, the
resulting combined code will have a rate equal to 1/6, and d fr.c > 10. In Section 3,
this code will be compared to some other rate 1/6 (1,k) codes. An interesting thing to
note is that the rates of the above two mappings are larger than the capacities of the
corresponding (d,k) constraints. Specifically, the (2,6) and (1,4) capacities are 0.4979 and
0.6174, respectively compared to rates of 0.5 and 0.6667 for the mappings. This is due to
the fact that we need only 4(d,k) sequences (instead of 16) for each mapping.
3 Performance of Combined Codes
A computer simulation was utilized to compare the codes in Table 1 over the binary
symmetric channel. In each case, the Viterbi algorithm was utilized in decoding. The
results are shown in Figure 3. Notice that, as expected, the codes that used a distance
preserving RLL code (codes 1b and 1c have a lower decoded probability of error than the
other code (code la).
As another example, consider the rate 1/4 and rate 2/8 (2,k) codes listed in Table 2(b)
with three different convolutional codes to create combined codes with rates all equal to
2/8, and with different free distances. For comparison, we also include a rate 1/4 code that
utilizes the (2,7) code from Figure 1(b). In Figure 4 we give the probability of error curves
for these codes. From the figure we see that, at low channel bit error probability, the codes
created using the distance preserving RLL code (codes 2b, 2c, and 2d) all perform better
than the other code (code 2a).
For the next set of comparisons, we are interested in codes that satisfy a (1,k) constraint..
In Table 3 we give the parameters for a rate 1/6 and a rate 2/12 (1,k) code. The rate
1.6 code (code 3a) uses the 5-state, rate 2/3 (1,7) code in [6]. This is the (1,7) code used
in many existing recording systems. The rate 2/12 code (code 3b) utilizes the rate 2/4
distance preserving (1,5) code from [5]. This code is really a block code, thus the trellis
has only one state. The last code listed in Table 3 is the 4-state (1,4) code described in
Section 32. In Figure . 5 we give the probability of error curves for codes 3a, 3b, and 3c.
As a final comparison, consider the rate 1/4 (1,k) codes in Table 4. Codes 4a and 4b
utilize the Miller code (also known as MFM), and codes 4c and 4d utilize the distance
preserving (1,5) code from [5]. Although the Miller code was not constructed specifically
to be distance preserving, it happens to satisfy the distance preserving criterion. Thus, all
the codes in 4 were constructed from distance preserving RLL codes. In 6 we compare the
performance of these four codes.
4 Summary
We have given decoded probability of error curves . for several concatenated codes that
satisfy a run-length constraint in addition to providing error correction capabilities. We




[1] P. Lee & J.K. Wolf, "Combined Error Correction/Modulation Codes," IEEE Trans-
actions on Magnetics, Sept. 1987.
[2]Y. Lin & J.K. Wolf, "Combined ECC/RLL Trellis and Tree Codes," IEEE Transaction
on Magnetics, Nov. 1988.
[3] R. Adler, D. Coppersmith, M. Hassner, "Algorithms for Sliding Block Codes," IEEE
Transactions on Information Theory, Jan. 1983.
[4] H.C. Ferreira, D.A. Wright & A.1. Nel, "Hamming Distance Preserving Mappings and
Trellis Codes with Constrained Binary Symbols," IEEE Transactions on Information
Theory, July 1989.
[5] C.A. French, "Distance Preserving Run-Length Limited Codes," IEEE Transactions
on Magnetics, Sept. 1989.








































































Figure 1: Trellis for (a) rate 1/4d f,,., = 10, 4 -state convolutional code, (b) rate 1/2 (2,7)










































































Figure 2: Trellis for (a) rate 113d f... = 10, 8-state convolutional code, (b) rate 3/8 (2,7)
code and (c) combined code
NASA SERC 1990 Symposium on VLSI Design	 193




















la 10 1/4 4 (2,7) 1/2 7 No (2,5) <6 1/8 10
lb 10 1/3 8 (2,7) 3/8 7 Yes (2,6) 10-13 1/8 10
lc 10 1/4 4 (2,6) 1	 4/8 1 Yes 1	 (2,6) 10-12 1	 1/8 1	 4















0.02	 0.04	 0.06	 0.08	 0.10
Channel Bit Error Probability
c Code 1 a
n Code lb
x Code 1 c
Figure 3: Performance of rate 1 /8 (2,k) combined codes
194



















2a 1	 5 1/2 4 (2,7) 1/2 7 No (2,7) <2 1/4 18
2b 3 2/3 1	 4 (2,7) 3/8 •2 Yes (2,7) 3-5 2/8 6
2c 4 2/3 8 (2,7) 3/8 2 Yes (2,7) 4 2/8 12
2d 5 2/3 1	 16 (2,7) 3/8 2 Yes (2,7) 5-7 2/8 24





0.02	 0.04	 0.06	 0.08	 0.10































0.02	 0.04	 0.06	 0.08	 0.10
. Channel Bit Error Probability
NASA SERC 1990 Symposium on VLSI Design	 195



















3a 10 1/4 4 (1,7) 2/3 5 No (1,5) <8 1/6 9
3b 10 1/3 8 (1,5) 2/4 1 Yes (1,5) 10-14 2/12 8
3c 10 1/4 4 (1,4) 1	 4/6 1 Yes 1	 (1,4) 10-12 1/6 4
Table 3: Rate 1/6 and 2/12 (1,k) combined codes





































4a 3 1/2 2 (1,3) 1/2 2 Yes (1,3) 3 1/4 4
4b 5 1/2 4 (1,3) 1/2 2 Yes (1,3) 5-7 1/4 8
4c 5 1/2 4 (1,5) 2/4 1 Yes (1,5) 5-6 1/4 4
4d 6 1/2 8 (1,5) 2/4 1 Yes (1,5) 6-7 1/4 8




0.04	 0.06	 0.08	 0.10
Channel Bit Error Probability
Figure 6: Performance of rate 1/4 (1,k) combined codes
NASA SERC 1990 Symposium on VLSI Design	 N94® 71491	 197
Serial Multiplier Arrays
for Parallel Computation
Kel Winters Department of Electrical Engineering
Montana State University
Bozeman, Montana
Abstract- Arrays of systolic serial -parallel multiplier elements are proposed
as an alternative to conventional SIMD mesh serial adder arrays for applica-
tions that are multiplication intensive and require few stored operands. The
design and operation of a number of multiplier and array configurations featur-
ing locality of connection, modularity, and regularity of structure are discussed.
A design methodology combining top-down and bottom-up techniques is de-
scribed to facilitate development of custom high -performance CMOS multiplier
element arrays as well as rapid synthesis of simulation models and semicustom
prototype CMOS components. Finally, a differential version of NORA dynamic
circuits requiring a single-phase uncomplemented clock signal is introduced for
this application.
1 Introduction
Single instruction/multiple datapath (SIMD) computer arrays were proposed for high per-
formance processing of large planar data structures with Unger's Spatial Computer pro-
posal in 1958 [16], the Solomon array proposal in 1962 [14], and later the ILLIAC IV
project at the University of Illinois in the sixties and seventies [1]. These early machines,
however, failed to gain commercial acceptance over vector based supercomputers in scien-
tific applications. The technology did not exist to exploit the inherent modularity of the
architecture or the locality of reference provided by the mesh interconnection network.
With the advent of Very Large Scale Integrated (VLSI) circuit methodologies in the
late 1970s, SIMD array architectures re-emerged tailored primarily for image processing
applications. This new generation of machines, like the original Unger and Slotnick designs,
featured bit-serial arithmetic and I-O operations, rather than the word-wide arrangement of
the ILLIAC processing elements. Bit-serial architectures, such as the early CLIP, Digital
Array Processor (DAP), and Massively Parallel Processors (MPP) [7], avoided much of
the functional complexity and interconnect cost of the larger-grained ILLIAC IV, at the
expense of arithmetic and I-O throughput. Subsequent SIMD mesh arrays, such as Blitzen
[4] and the Geometric Arithmetic Parallel Processor (GAPP) [6], have remained close to
the DAP/MPP architecture.
While SIMD mesh processor arrays have evolved into a set of highly similar designs,
there is in fact a continuum of possible configurations with respect of word width, inter-
connection, and functionality. Optimization of the architecture for a particular domain of
198
applications is a matter of balancing the ratio of IC area allocated to logic, memory, and
interconnection, to the requirements of the application set.
For example, algorithms requiring few stored arguments favor very fined-grained PEs
with a high logic/memory area ratio. On the other hand, the PE must have sufficient
storage to hold all arguments required by the application algorithms without wasting IC
memory or running short. One solution is to define a processor array as a matrix of elements
that may be flexibly allocated to data elements (pixels, for instance) of the problem space
in groups. Thus, an array of fixed size could serve as a large array of very fine-grained
element groups (later referred to as virtual processors) for applications with few operands,
or as a smaller array of larger groups for problems requiring more storage per problem
element.
For multiplication intensive massively parallel problems requiring relatively little operand
storage, arrays of multipliers can offer better performance and better resource utilization
than DAP/MPP style adder arrays. Adder arrays typically have a large random access
memory store (1K bits for the MPP) to accommodate varying word widths and operand
storage requirements. The access time of RAM storage can significantly reduce clock speed.
In these applications, much of this memory capacity can go un-utilized, while insufficient
arithmetic resources are available to exploit bit-level parallelism. An alternative approach
is a mesh array of serial multipliers where:
1. The ratio of arithmetic logic to memory silicon area is higher than that for conven-
tional SIMD adder arrays.
2. The function set is optimized for serial multiplication rather than serial addition to
better serve multiplication intensive applications.
3. Operand storage primarily consists of high-speed shift registers, rather than RAM
to enable higher clock rates.
4. Multiplier elements are of a fixed word width, but may be logically concatenated to
accommodate multiple word operations without degradation of performance.
2 Bit Serial Multiplication
Y I	 )--^ &	 n -{- 1 bit shiftwen;c4e^
Figure 1: Simple Serial-Parallel Multiply-Accumulator
A very simple serial multiply-accumulator is shown in Figure 1, consisting of an AND
gate, a shift register (or equivalent in random accessed memory), and a bit-serial adder.
NASA SERC 1990 Symposium on VLSI Design	 199
Lm
Figure 2: Serial Adder Module
The serial adder, designated by &, consists of a full adder and a carry-save flip-flop as
shown in Figure 2. Arguments x and y, n bits in length, are fed serially such that y is
repeated m times as each bit of x is shifted in one bit per n clock cycles. Partial products
are accumulated in the shift register, which is n+1 bits in length so that each successive
partial product is effectively multiplied by two before it is summed in the shift register. n2
clock cycles are required to complete a multiplication operation.
Despite the slow speed of this configuration, it is the basis of integer multiplication
in virtually all current SIMD processor .arrays, including the DAP and MPP and their
successors. This can create a substantial performance bottleneck given the multiplication-
intensive nature of image and signal processing applications typically run on these ma-
chines.
High performance serial multipliers for VLSI processor arrays should (a) be modular in
structure, (b) have minimal internal signal fanout, and (c) require no asynchronous carry
propagation that would constrain the clock rate, (d) are extensible in word width, and (e)
require a minimum number of clock cycles.
. The problems of input loading, adder delay, and extensibility for serial-parallel multi-
pliers may be addressed by pipelining both the multiplicand input path and the product
accumulation path. Figure 3 illustrates a fully pipelined or systolic multiplication network
[17]. On the first clock cycle, xo is shifted into the multiplier pipeline and xoyo is shifted
into the product pipeline, and appears at the product output n cycles later. This circuit re-
quires 2n clock cycles to multiply two n-bit operands. An addend, a, may be summed with
the product by shifting it into the product pipeline concurrent with the serial multiplicand,
X.
An alternative multiplier , shown in Figure 4, is a fully systolic adaptation of an early
serial-parallel multiplier introduced by Daniel Hampel, et al., in 1975 [10]. In this configu-
ration, the multiplicand is pipelined through n 2 gate inputs. This systolic multiplier also
requires 2n clock cycles. The least significant bit of the product xoyo, appears at the out-








Figure 3: Fully Systolic Pipelined-Multiplier Configuration
n bits into the product pipeline prior to a multiplication.
An interesting property of the second systolic multiplication circuit is that it contains n
bits of storage for the multiplicand and 2n bits for the product. An array of these circuits
would have the proper ratio of operand versus product storage. Thus, a product sum could
be accumulated by a single multiplier whose output is fed back to its addend input in 2n
clock cycles per multiplication.
Figure 4: Fully Systolic $ampel Multiplier
NASA SERC 1990 Symposium on VLSI Design 	 201
n
	 Win










Figure 5: Multiplier Array Element
3 Multiplier Array Topology
The following examples illustrate preliminary configurations intended for sum of products
evaluation involving multiplication by a constant, such as convolution. 1
In the first configuration, serial-parallel multipliers are interconnected to four nearest
neighbors with dynamically segmented serial busses that allow communication between
non-adjacent neighbors. This facilitates on-the-fly allocation of multiplier elements (MEs)
to the application. Control signal and parallel multiplier (constant) inputs to the MEs are
routed by individual diagonal array rows. Thus, each diagonal row of MEs is controlled
independently, allowing constrained multiple-instruction (MIMD) execution.
At the system level, an Instruction Sequencer provides control signals to each diagonal
row of MEs. These in turn are controlled by a Host Interface Controller, whose function
is to manage the communication channel to the host scalar computer and decode array
instructions.
The multiplier element, shown in Figure 5, in the array has four input ports (NIN, BIN,
WIN, SIN) and four output ports (NOUT, BOUT, WOUT, SOUT). The output ports are
in one of two modes, dump or pass. In the dump mode, the multiplier element drives the
output port, while in the pass mode, a CMOS switch passes the value of the corresponding
input port to the output port.
As only N-fet pass transistors are required for data switching to N-logic NORA stages,
three control lines are required, Load, Dump, and NotDump, for each input/output pair.






This scheme enables communication between nonadjacent MEs passing through inter-
mediate MEs, or configurable length array tesselation. Since diagonal rows of MEs are
controlled independently, traversals across arbitrary vertical column or horizontal row dis-
tances are supported. Without I/O port pipelining, the propagation delay increases with
the square of the number of series pass devices [13], or, in this case, the square of the array
distance traversed.
Configurable length tesselation of the register array with independent control of diag-
onal register rows enables MEs to be dynamically allocated to the application in groups
or virtual processors (VPs). Register elements within a virtual processor can then be ran-
domly accessed by neighboring virtual processors. For example, Figure 6 illustrates the
ME allocation to a two-dimensional problem requiring three one-word arguments, A, B,











Figure 6: Virtual Processor Partitioning
Horizontal communication is more constrained. The diagonal control routing serves to
offset horizontally adjacent virtual processors by one row. This allows horizontal commu-
nication between different argument registers in adjacent VPs, but not random register
access. In this example, B may only communicate with A in the next left VP neighbor or
with C in the next right VP. Other combinations require data shifting to reorder the argu-
ments. Despite this constrained horizontal VP communication, this arrangement is suited
to a good number of applications, particularly where products are summed by column,
then collected by row.
To illustrate, a two-dimensional convolution may be described by the function:
m-1
c(x, y) _	 /sumj_oiw„ ipx-i 'y-j 	 (1)
__O
NASA SERC 1990 Symposium on VLSI Design	 203
it
Figure 7: SASM Element
where c is the resulting convolution array, w is the weight or mask matrix, and p is the
input array data (pixels, if an image application). A configurable-length tesselation array
can perform an m-by-m convolution in 2nm2 clock cycles if two MEs are allocated per
pixel, one to hold the pixel datum. 1 , the other to accumulate the product-sum. Thus,
a 3-by-3 by 8-bit convolution would require 144 clock cycles, versus 795 for the MPP [8].
Allocating four MEs per pixel, the same convolution could be performed in as little as
2(2m — 1)n or 80 clock cycles, by executing column multiplications in the weight matrix
in parallel, then summing the columns. However the area-time performance is diminished.
Spice 3 suggests that inter-element transfers at an array distance of three elements
is practical up to a clock rate of about 50 MHz in 2-micron CMOS. While faster than
current SIMD mesh arrays, this is considerably below the performance of the multiplier
circuits investigated. An alternative approach is to add a pipeline latch to each input and
output port, These add a delay of one clock cycle, for each element traversed, to inter-
ME transfers, eliminating the need for wait cycles in long distance transfers where RC
delays through I/O pass gates would be greater than a single clock cycle allows. While
enabling high clock rates, adding pipeline delays to the configurable length interconnection
scheme does add considerable control complexity, as serial word boundaries are no longer
aligned between array elements. Multiple word-width operations would suffer significant
performance reduction to I/O pipeline delays.
A configuration for systolic arrays of systolic multipliers (SASM) is shown in Figure
7. Here the systolic multiplier of Figure 4 is used to store both data and an accumulated
result. Data operands are stored in the multiplier or mr pipeline, which is loaded from itself
or its West (left) neighbor. The product accumulator pipeline, 2n stages in length, may be
loaded from the Northwest, West, or Southwest neighbors. Thus, the array is connected
2If an n-bit shift register were added to each ME to hold pixel data, only one ME per pixel would be
required for this convolution.
3conducted by D. Wall and C. Hsiaochi at Montana State University with slow speed device models for
the MOSIS 2-micron (drawn gate length) SCMOS process, T=100 degrees C, Vdd=4.5V.
204
as a shufe-exchange network, rather than a mesh as were. the previous examples. The
multiplier array boundaries must be connected to form a cylinder or torus.
This multiplication circuit requires the multiplier operand to be shifted into the mul-
tiplier pipeline in n cycles and out in n additional cycles for a total of 2n clock cycles
per multiplication. Nominally, the multiplier pipeline should be cleared at the beginning
and end of each multiplication. To perform this while storing the multiplicand operand in
an n-bit shift register with no external storage, the product logic must effectively ignore
the multiplier operand on every second pass through its shift register. This is provided
by ANDing the parallel multiplicand operand, md, with the sequence 1000...0, 1100...0,
1110...0, ..., 1111...1, 0111...1, 0011...1, and finally 0000...0 during each multiplication se-
quence. Such sequences are easily generated with Johnson-Ring or Mobius [15] counters.
Multiplicand enabling would occur in the control logic external to the multiplier array.
In this array, operands in the mr pipeline and products in the product pipeline may be
moved in two dimensions relative to each other, not the physical array. To move products
east relative to the mr data, mrout is fed back to the MR register in the same multiplier
element. To move the products west relative to the mr data, mr is fed from its west
neighbor, so that the mr operands travel across the array eastbound at twice the rate of
the product pipeline. Northwest and southwest input switching enable vertical movement.
woomobius -► md, mrout -+ mrin, win -+ a; (ncycles)
woomobius -+ md, mrin --> mr, win -+ a; (ncycles)
wolmobius -+ md, mrin -+ mr, win - ► a; (2ncycles)
woomobius --> md, mrin -►mr, win -► a; (ncycles)
woomobius md, mrout -+ mr, nwin -+ a; (ncycles)
w12mobius -+ md, mrin --+ mr, nwin -► a; (ncycles)
w12mobius -► md, mrout --+ mr, win -> a; (ncycles)
wllmobius -► md, mrout - ► mr, win -+ a; (2ncycles)
wlomobius --> md, mrout - ► mr, win -► a; (ncycles)
wlomobius -> md, mrout -> mr, nwin -+ a; (ncycles)
woomobius ->md, mrin -►mr, nwin ---► a; (ncycles)
woomobius --> md, mrin --> mr, win -+ a; (ncycles)
W21mobius -+ md, mrin - ► mr, win -► a; (2ncycles)
w22mobius md, mrin - ► mr, win -+ a; (ncycles)
w22mobius -+ md, mrin -> mr, win -> a; (ncycles)
Table 1: SASM Convolution
Table 1 illustrates a 3-by-3 by n-bit convolution. The array is initialized with the
product shift registers cleared, the mr registers containing the pixel data, and the mobius
enable register in the control sequencer 4 cleared. In this example, only one ME per
pixel is required, so identical control and multiplicand information is sent to all MEs
4 woomobius -* and means that the multiplier, Woo is ANDed with the output of an n-bit Johnson-ring
counter, as described above.
NASA SERC 1990 Symposium on VLSI Design	 205
simultaneously (SIMD operation). Control and multiplicand information is indicated with
the transfer operator, —+.
The 3-by-3 integer convolution requires 18n clock cycles or 144 for 8-bit operands.
This array configuration is fully pipelined for all communication within and between the
multiplier elements.
4 Design Methodology
The design methodology adopted for the development of high-performance multiplier ar-
rays is two-faceted, combining bottom-up and top-down approaches. First, a library of
custom multiplier datapath cells based on differential single-clock NORA circuits was de-
veloped that could be utilized in a variety of array configurations. Second, hardware
description languages (HDLs), logic synthesis, logic and switch simulation, and module
compilation tools are used for top-down definition and verification of multiplier array sys-
tems and rapid prototyping of semi-custom CMOS models.
Custom datapath cells were developed using the Tekspice circuit simulator and graphics
editor Quickic from Tektronix, Inc. Scalable CMOS design rules from the National Sci-
ence Foundation MOSIS program were used to provide portability to a number of silicon
foundries, compatibility with cell libraries from the academic community, and access to
economical multi-project CMOS prototyping through the MOSIS service. Design rule veri-
fication is done with SDRC, provided by the Northwest Laboratory for Integrated Systems
(NWLIS) at the University of Washington.
The OCT toolset from the University of California, Berkeley, is used for top-down
behavioral and structural HDL modeling, logic and switch simulation, CMOS standard-
cell prototype synthesis, and custom layout. The toolset is an integrated system for VLSI
design, including tools and libraries for multi-level logic synthesis, standard cell placement
and routing, programmable logic array and gate matrix module generation, custom cell
design, and utility programs for managing design data. Most tools are integrated with the
OCT data-base manager and the X-based VEM graphical user interface.
The OCT tools currently use non-industry standard hardware description languages
BDS for behavior (originally from the Digital Equipment Corporation) and Bdnet for
structure. EDIF support has been recently introduced and VHDL support is under devel-
opment. Unlike some recent commercial VHDL tools, BDS descriptions cannot be directly
simulated but must first be compiled into logic netlists. While this has not been a major
inconvenience, direct HDL modeling is also under development at Berkeley.
A behavioral description of an 8-bit serial-parallel multiplier of the pipelined-product
configuration is listed in Table 2.
Behavioral models are structurally decomposed in a top-down fashion. For instance,
the multiplier of Table 2 decomposed into the eight instances of the bit-multiplier cell
whose behavior is defined in Table 3.
In turn, this bit-multiplier cell is used to define the function of the custom differential
single-clock NORA circuit described in the next section.
206
Datapath module compilers to assemble tiled arrays of multiplier elements are under
development locally in the OCT environment. Composite placement and routing of custom
and semi-custom components is done with a combination of OCT toolset place-and-route
and symbolic layout programs. Final mask verification is performed using software from
the Berkeley OCT and Washington NWLIS toolsets. Masks are released for fabrication
via the MOSIS service in Cal-Tech Intermediate (CIF) format. Currently, all custom
and semi-custom components are designed under 2-micron design rules for fabrication
using economical high-volume commercial processes. MOSIS scalable design rules (SCMOS
Revision 6) are currently supported to a minimum feature size of 1.2 µm, which could be
used to fabricate the custom datapath library without modification.
5 Differential Single-Clock NORA CMOS Circuits
High performance serial multiplier arrays require CMOS circuit realizations with properties
complementing those of the multiplier architecture. Specifically, these circuits should have
low input loading, low internal signal fanout, high locality of interconnection, minimal
series device delays, and little clock skew.
At high clock speeds, in excess of 50 MHz, clock skews and associated hold time mar-
gins occupy a significant portion of the clock cycle timing budget for synchronous systems.
For this reason, two phase non-overlapped clocking schemes (a favorite technique of NMOS
circuit designers) are not commonly used at clock rates above 40 MHz. Above 100 MHz,
skew between a single phase clock and its complement can become significant. One solution
is to eliminate complemented clocks by using circuits requiring only a single phase uncom-
plemented clock. One such circuit was proposed by Yuan Ji-Ren, et. al., in 1987 [11] as an
improvement to NORA, or "No-Race" logic [9]. With this technique, synchronous systems
may be constructed from alternating precharged P-fet and N-fet logic stages separated by
clocked inverters. This scheme, like its predecessors NORA and Domino [12] logic, is free
of precharge race failures, yet eliminates the need for a complemented clock and associated
skewing problems at very high clock rates. This method appears to have good potential
for future CMOS control [18] and datapath applications if the following design constraints
are imposed:
1. Series logic fets are minimized. NOR functions are preferable to NAND in N-logic
stages; NAND is preferable to NOR in P-logic.
2. Series inverters to form complemented gate outputs are eliminated from critical tim-
ing paths. Unlike Domino logic, the sense of dynamic logic stage output transitions
is not important in preventing a precharge race condition in successive stages. There-
fore, inverters may be used to complement logic stage outputs. These, however, add
significant output delay. In critical timing paths, it is preferable to construct differ-
ential logic stages that output complemented output pairs with equal delay. In many
cases, programming fets may be shared between the true and complemented logic
networks.
NASA SERC 1990 Symposium on VLSI Design	 207
3. Logic functions are split between successive P-fet and N-fet logic stages which also
serve as synchronous master and slave delay elements. This serves to reduce series
logic delays (a). Where practical, layout area and input loading may be reduced by
realizing the largest logic stages in N-logic rather than P.
A high-performance serial multiplication cell is under development in 2 — µm (gate length)
CMOS using differential single-clock NORA techniques. A circuit diagram is s hown in
Figure 8. The P-logic stages at left precharge low during clock and evaluate during clockbar
and serve as master storage latches. The N-logic stages at right precharge high during
clockbar and evaluate during clock, serving as slave dynamic storage latches. The clocked
inverters at each stage output insure that the inputs to successive stages only change
during the precharge phase and are stable during evaluation, eliminating precharge race
conditions.
In this circuit, a maximum of two series logic fets are . used in any dynamic stage.
Differential logic stage configurations are used wherever there are two logic fets in series
and complemented outputs are required, such as the P-logic and N-logic differential XOR
stages. Discrete output inverters are allowed only where one series logic fet exists in a
dynamic stage, such as the and — AND — mr P-logic stage. N-logic stages are used to drive
the cell outputs to take advantage of higher N mobility in critical inter-cell communication
timing.
The multiplier cell was simulated. s at clock rates in excess of 100 MHz with SPICE
using 2-micron design rules. Substantially higher speeds should be possible using more
advanced processes.
A preliminary layout .of the multiplier cell, with most metal-2 bussing removed, is
shown in Figure 9. It is 120 µm by 150 µm or 18,000 µm2 , compared to 99,000 µm 2 for
the standard-cell prototype implementation.
6 Conclusions
Arrays of systolic serial-parallel multiplier elements have been proposed as an alternative
to SIMD mesh arrays of serial adders for multiplication intensive parallel applications
requiring relatively little operand storage capacity. This type of machine is suited for a
narrower class of problems than conventional SIMD mesh arrays but broader than for
special purpose systolic machines. Targeted applications include image processing and
compression, particularly those involving convolution based algorithms, such as Laplacian
pyramid encoding [5].
A new variation of single-clock NORA CMOS circuits is presented for application in
high speed systolic multiplication networks. A design methodology is proposed that em-
phasizes matching the properties of the array design at the system, register (multiplier
5 Spice simulations and layout by M. Feister, D. Virag, and D. Mathews, of Montana State University,
using Tektronix, Inc. Tekspice, Tektronix MFET Level 2 device models, and slow speed device parameters














Figure 8: Circuit Diagram
NASA S.E.RC 1990 Symposium on VLSI Design	 209
Figure 9: Layout Diagram
210
element), and circuit levels. The Berkeley OCT tools set is used to facilitate both cus-
tom VLSI design as well as the rapid development of simulation models and semicustom
prototype components using logic synthesis methods. The circuit elements, serial multi-
plier modules, and design methodology are intended to serve as a set of building blocks to
facilitate development of processing arrays for a variety of applications.
Currently, custom CMOS modules and semicustom prototypes for high performance
systolic arrays of systolic multipliers are under development. It is hoped that this work
will lead to the implementation of a large scale parallel array prototype for image and
integer matrix processing. Future investigations will also include the application of these
circuit modules and methodology to other types of computing arrays, including application
specific systolic arrays, digital neural networks, and error correction encoders/decoders.
This work was supported by grants from the NASA Space Engineering Research Cen-
ter at the University of Idaho, Moscow, and the Montana State University Engineering
Experiment Station. The author would like to thank Dr. Gary Maki, Diane Mathews,
and the students of EE501 at Montana State University for their invaluable contribution
to this work.
References
[1] G. Barnes, R. Brown, M. Kato, D. Kuck, D. Slotnick, R. Stokes, "The ILLIAC IV
Computer," IEEE Trans., C-17, vol. 8, pp, 746- 757, August, 1968.
[2]K. Batcher, "Design of a Massively Parallel Processor," IEEE Trans. Computers, vol.
. C-29, no. 9, Sept. 1980, pp. 836-840.
[3] K. Batcher, " The Architecture of Tomorrow's Massively Parallel Computer," Proc.
1st Symposium on the Frontiers of Massively Parallel Scientific Computation, Sept.
24, 1986, Greenbelt, MA, pp. 151-157.
[4] D. Blevins, E. Davis, R. Heaton, J. Reif, "BLITZEN: A Highly Integrated Massively
Parallel Machine," Proc. 2nd Symposium on the Frontiers of Massively Parallel Com-
putation, Oct. 10, 1988, Fairfax, VA, pp. 399-406.
[5] P. Burt and E. Adelson, "The Laplacian Pyramid as a Compact Image Code," IEEE
Trans. on Communications, Vol. COM-31, No. 4, April 1983.
[6]E. L. Cloud, "The Geometric Arithmetic Parallel Processor," Proc. 2nd Symposium
on the Frontiers of Massively Parallel Computation, Oct. 10, 1988, Fairfax, VA, pp.
373-381.
[7] T. Fountain, "A Survey of Bit-Serial Array Processor Circuits," Computing Structures
for Image Processing, M. Duff ed., Academic Press, 1983.
NASA SERC 1990 Symposium on VLSI Design 	 211
[8] F. A. Gerritsen "A Comparison of the CLIP4, DAP, and MPP Processor-Array Imple-
mentations," Computing Structures for Image Processing, M. J. Duff ed., Academic
Press, 1983.
[9] N. Goncalves and H. De Man, "NORA: A Racefree Dynamic CMOS Technique for
Pipelined Logic Structures," IEEE J. Solid-State Circuits, vol. SC-18, no. 3, June
1983, pp. 261-266.
[10] D. Hampel, K McGuire, and K. Prost, "CMOS/SOS Serial-Parallel Multiplier," IEEE
J. on Solid-State Circuits, Vol. SC-10, No. 5, October 1975.
[11] Y. Ji-Ren, I. Kaarlson, and C. Svensson, "A True Single- Phase-Clock Dynamic CMOS
Circuit Technique," IEEE J. Solid-State Circuits, vol. SC-22, no. 5, October 1987, pp.
899-901.
[12] R. Krambeck, C. Lee, and H. Law, "High-Speed Compact Circuits with CMOS,"
IEEE J. Solid-State Circuits, vol. SC-17, June 1982, pp. 614-619.
[13] C. Mead and L. Conway, Introduction to VLSI Systems, Addison-Wesley, 1980.
[14] D. Slotnick, W. Borck, R. McReynolds, "The Solomon Computer," Proc. of AFIPS
Fall Joint Comp. Conf., Wash. DC, 1962, pp. 97-107.
[15] H. Taub and D. Schilling, Digital Integrated Electronics, McGraw-Hill Inc., New York,
1977, pp. 349-355.
[16] S. Unger, "A Computer Oriented Toward Spatial Problems," Proceedings of the IRE,
vol. 46, no. 10, pp. 1744-1750, October, 1958.
[17] N. Weste and K. Eshraghian, Principles of CMOS VLSI Design, a Systems Perspective,
Addison-Wesley, 1985.
[18] K. Winters, EES Quarterly Report: Project 16233045, Bridger Processor Array In-
vestigation, Montana State University, Feb. 12, 1988.
212




Variables: nextprod & presentprod - product state variables
a - addend, mr - multiplier, and - multiplicand
r
model mulreglog nextprod< 7 : 0 > = a < 0 >, mr < 0 >, and < 7 : 0 >, presentprod< 7 : 0 >;
routine cycle
nextprod = (presentprodSR01) -!- (128 x a) + (mr x md);
endroutine;
endmodel;
Table 2: BDS Description of 8-bit Multiplier
! MULBIT multiplier register bit 	 cription
! File : mulbit.bds
Kel Winters
! Rev: 9-5-89
! Variables: a - addend, mr - multiplier, and - multiplicand,
! cin - carry in, cout - carry out, sum - sum of prod.
! p - product of mr and md.
t
model mulbit sum< 0 >, cout< 0 > = a< 0 >, mr< 0 >, md< 0 >, cin< 0 >;
routine cycle;
state p< 0 >;
p = mr AND md;
sum = a XOR p XOR cin;
cout = (a AND p) OR (a AND cin) OR (p AND cin);
endroutine;
endmodel;
Table 3: BDS Description of Bit-Multiplier Cell





S. Gopalakrishnan, S. Whitaker, G. Maki and K. Liu
NASA Engineering Research Center
for VLSI System Design
University of Idaho
Moscow, Idaho 83843
Abstract - A major problem associated with state assignment procedures
for VLSI controllers is obtaining an assignment that produces minimal or near
minimal logic. The key item in PLA area minimization is the number of unique
product terms required by the design equations. This paper presents a state
assignment algorithm for minimizing the number of product terms required to
implement a finite state machine using a PLA. Partition algebra with prede-
cessor state information is used to derive a near optimal state assignment. A
maximum bound on the number of product terms required can be obtained by
inspecting the predecessor state information. The state assignment algorithm
presented is much simpler than existing procedures and leads to the same num-
ber of product terms or less. An area -efficient PLA structure implemented in
a 1.0µm CMOS process is presented along with a summary of the performance
for a controller implemented using this design procedure.
1 Introduction
Most VLSI can be partitioned into data path and control logic. In many applications the
control logic is realized with a PLA. Most chips perform operations that are synchronized
with a clock and can be modeled as a synchronous machine. The control logic can be
formally modeled as a synchronous sequential circuit. Once a formal description of the
control logic' is given in the form of a flow table or equivalent representation, a state
assignment procedure is invoked to generate a binary encoding of the states.
A main problem associated with the state assignment procedure is obtaining an as-
signment that produces minimal or near minimal logic. Finding a valid state assignment
that produces a valid hardware realization of the flow table is trivial; however, finding
an assignment that has near minimal hardware is not. The state assignment process is
combinatorial in nature and fits into the category of NP- complete problems. The lower
bound on the number of assignments for a flow table with 24 states is 2q !. Enumeration,
even for medium sized machines, is not practical.
The state assignment problem has been a research subject for a long time [1,2,3],
however, the problem has gained renewed interest with the increasing complexity of VLSI
circuits [4,5,6,7] that realize sequential circuits. Sequential circuits are often realized with
Programmable Logic Arrays (PLA). The size of a PLA is related to the number of unique
214
product terms required in the design equations. The number of literals in each product
term is not the important issue. Hence finding a MSP expression does not necessarily lead
to a minimal PLA but producing a solution that uses a minimum number of product terms
does.
This paper presents a state assignment algorithm that can be used to reduce the number
of product terms needed to implement a state machine using a PLA and also presents a
dynamic, area-efficient PLA architecture. Section 2 of this paper deals with the design
algorithms. The state assignment algorithm given in this paper is much simpler than the
existing procedures [6,7] and leads to the same number of product terms or less. Section
3 applies the state assignment algorithm to an example. Section 4 deals with the PLA
architecture and Section 5 reports the performance of a controller designed using the
algorithms described in this paper.
2 Design Algorithms
Before demonstrating the design algorithms, some basic definitions must be reviewed [8,9]
and established.
Definition 1 A r partition is a two block partition that partitions the states that are coded
1 by a state variable from those coded 0.
If -ri
 = {S;Sj ; Sk S1 1, where Si , Si , Sk and S1 are internal states of a state machine, then
Si and Sj are coded 1 by yi and the states Sk and S1 are coded 0 by yi.
Definition 2 A total circuit state of a sequential machine is a pair (Si ,Ip), where Si is
an internal state of the machine and Ip is a member of the set of input states.
The total circuit state of the machine uniquely specifies the current state of the machine,
ie. the internal state and the input state.
Definition 3 An internal state Sj has a total predecessor state Sk * Ip, if the circuit tran-
sitions from Sk to Sj when the input Ip is applied.
Predecessor states specify the next state values for each next state variable Y and
output state variable Z. If yi encodes state Sj 1(0), then everywhere Sj appears in the
flow table, Y,• = 1(0). The total predecessor states uniquely specify the next state entries
of each Y and Z.
Partition algebra can be used to formalize the design equations. Let ri
 code Sj and Sk
with a 1. Let the total predecessor states for Sj and Sk be S1 * Ip, S,,, * Iq and S,d * I,., So * I,
respectively. Then the_ next state partition of Yi is {S1 * Ip, S,n * Iq , S„ * Ir, So * I.;. ., .I. The
design equation for state variable Y is given below.
Y=S1•Ip+Sm•Iq+S.•Ir+S.•I, 	 (1)
NASA SERC 1990 Symposium on VLSI Design	 215
This equation specifies that Y,• consists of four product terms, each covering a total
circuit state. Since yi codes states S; and S A, with a 1, it should attain a value of 1
whenever the circuit transitions to the states Si and Sk . These transitions occur when the
circuit assumes the total predecessor states for either Si
 
or Sk. Covering all predecessor
states where Y; = 1, produces a sum-of-products expression for Y.
Each predecessor state corresponds to one next state entry; the total number of prede-
cessor states equals the number of specified entries in the flow table. Since each predecessor
state produces a unique product term, the maximum bound on the number of unique prod-
uct terms is equal to the number of specified entries in the flow table. If product terms
can be shared in generating Y,• and Z1 , then the number of entries in the flow table is an
upper bound on the number of product terms needed in a PLA.
Whenever a state variable y; assigns a value 0 to a state, the corresponding product
term for Y,• need not be generated. This produces two important observations:
1. States which are coded with more 0's than 1's require fewer product terms.
2. The state that appears most often (the greatest number of predecessor states) ought
to be coded all 0.
Outputs play an important role in the state assignment process which seeks to minimize
the total number of product terms. A product term is needed for all total circuit states
that require an output of 1 and its presence is independent of the state assignment. The
state assignment only determines which product term is needed. Since product terms are
needed for every total circuit state with an output = 1, these states are ignored in the
first step of the state assignment procedure shown next. The design procedure for a Mealy
type machine [10,11] is outlined next.
State Assignment Procedure
1. Identify all the predecessor terms necessary to generate the outputs and remove
them from the predecessor table.
2. Assign the all zero state as the state with the maximum number of remaining
predecessor terms in the predecessor table. If the inputs are decoded using the
same PLA, then expand all the inputs to their corresponding expressions and
count the number of product terms that should be generated.
3. Apply the Armstrong-Humphrey adjacency conditions [2] to complete the re-
maining part of the state assignment. The conditions that are applicable for
PLA based implementation are:
(a) States that have the same next state for a given input should be given
adjacent assignments (In general an assignment which allow a single product
term covering for the corresponding predecessor state terms).
(b) States that have the same output for the given input should be given ad-
jacent assignments (In general an assignment which allow a single product




A C/10 B/00 E/10
B E/11 B/00 D/10
C A/10 B/00 E/10
D A/00 D/01 B/10
E C/00 D/00 C/01
Table 1: Sample Flow Table
Each output variable Z; can be treated the same as the next state variable Y,•. The
expression for Z; covers the predecessor terms where Z; = 1. The number of product
terms for each Z; is independent of the state assignment. This fact is useful in selecting
the state assignment. The objective of the state assignment generation is to derive an
assignment which minimizes the number of product terms that need to be covered. Since
predecessor states that are present in the generation of Z; must be covered, the state
assignment selection can impact only those predecessor states that are not contained in
any Z;. The first step in the algorithm incorporates this action. When using this algorithm
for a Moore type machine [10], the designer should try to assign states such that they map
into outputs.
3 ]Design Example
Table 1 gives the flow table is used to demonstrate this algorithm. The first step is to
list the predecessor states for the state machine of Table 1. Table 2 shows this list. The
product terms required to generate the outputs have been listed in Table 3. These terms
are necessary for the outputs and are removed from consideration for state assignment
selection. The reduced predecessor table is given in Table 4. Following the second step in
the state assignment algorithm, State B is selected as the all zero state since it contains
the largest number of predecessor terms. Applying the adjacency conditions, the state
assignment of Equation 2 is derived.
Tl : ED; ABC
Tz : AE; BCD	 (2)
,r3: ACDE; B
From the -r partitions in Equation 2, state A is coded 0 by yi and 1 by y2 , y3. Yi is
coded 1 by the states E and D, leading to the following design equation for Yl.
Yi = B•Il +A•I3 +C•I3 +D•I2+E•I2+B•I3	 (3)
Similarly, the design equations for the other state variables and outputs are given below.
NASA SERC 1990 Symposium on VLSI Design	 217





E B * I1 y
 A *13, C * 13
Table 2: Predecessor Table
Output Product terms
Z1	 A * I,, B * I,, C * Il , A * 13,
B * 13, C * 13, D * 13
Z2
	B*I1iD*I2,E*I3











The total number of product terms generated is 12, including the outputs. The state
assignment uses 3 state variables. This is the same number of product terms generated
using KISS [6]. By making use of the adjacencies in the state assignment, the following
reduced design equations can be obtained.










Y3 = Il +(A+B +C)I3++D•I2+E • I2 +E • I3	 (5)
Zl = (A+B+C)Il+,(A+B +C)I3 +D•I3
Z2 = B•Il +D•I2 +E•I3
This leads to 10 unique product terms to realize the circuit. The design equations in
terms of the state variables are as follows.
Y1= YI'Y2 s'Il +yi•I3+
z3lyiy3 • 12 + y1y2y3 • 12
Y2 = yiy2y3 ' Ii + YI' Y3 - I3+
Z32Y3 • Il	 (6)Y3 = Il +yi•I3+Z31y2y3 •I2+
yl y2 y3 ' 12 +. yl y2 y3 - 13
Z1 = yl' II +yl • I3+yly2y3 - 13
Z2 = yi yi ys ' Ii + ylY2 y3 • 12 + yl y2 y3 '13
4 PLA Architecture
The architecture for the PLA implementation is shown in the block diagram of Figure 1.
The main paths are represented schematically in Figure 2. The AND and OR planes are
both precharged to a 1 by p-channel transistors when 0 = 1. During precharge, n-channel
evaluate transistors for the OR and AND planes disconnect the ground structures to avoid
a ratioed DC current path. The input latch is enabled to capture new state and input
information during the precharge time. The input lines to the AND core must propagate
and settle before 0 -+ 0. The OR plane precharge transistors are gated by a dummy
line through the AND plane. This dummy line is constructed to be the slowest possible
configuration for an AND plane term such that the OR plane remains in a precharge state
after 0 --> 0 until all the AND plane output lines are settled. This self-timed concept
avoids charge sharing and races between the AND and OR planes during evaluation and
is the same concept used in RAM design.
When 0 —> 0, first, the AND plane evaluates; then the OR plane evaluates. The output
of the OR plane is then captured in the output Flip Flops. Two Flip Flop designs are used
for the state and output registers. One is set and the other is reset by a control signal. The
control input is tied to the reset condition for the state machine and an appropriate latch
is chosen for each state variable and output such that the initial state is reached under
any reset condition.
NASA SERC 1990 Symposium on VLSI Design	 219
Figure 1: PLA Block Diagram
5 Results
The state assignment procedure, design techniques and PLA implementation were applied
to the design of the PLA controller shown in Figure 3 [12]. The controller required 13
states, 4 state variables, 20 outputs and 139 PLA product terms. The circuit was drawn
in a 1.0µm, double metal CMOS process utilizing minimum transistor sizes in the core.
The resulting PLA core was 270.21im by 642.21im. Capacitance information was then
extracted, and SPICE simulations were run to determine the operating frequency and
margins. The controller operated at 25 MHz under 3a worst case speed parameters at
100°C and Vdd = 4.5V. The AND plane inputs run in both polysilicon and Metal 2
through the core, and poly was occasionally shorted to avoid significant RC time delays
for propagating the input signals. The width of the OR core was sufficiently small for
this PLA such that no periodic shorting was needed. Speed of operation is limited by
the self-timing circuit. If timing signals were available for the AND and OR plane, the
implementation would operate at 35 MHz under 3o , worst case speed parameters at 100°C
and Vdd = 4.5V.
References
[1]J. Hartmanis, "On The State Assignment Problem for Sequential Machines", IEEE
Transactions on Electronic Computers, Vol. EC-10, pp. 157-167, Jun., 1961
[2]D. Armstrong, "Efficient Assignment of Internal Codes to Sequential Machines", IRE
Transactions on Electronic Computers, vol EC-11, pp. 611-622, Oct., 1962
[3]J. R. Story, H. J. Harrison and E. A. Reinhard, "Optimum State Assignment for






Figure 2: PLA Path Schematic
220
NASA SERC 1990 Symposium on VLSI Design 	 221
[4] G. Michelli, R. K. Brayton and A. L. Sangiovanni-Vincentelli, "Computer-aided Syn-
thesis of PLA-based Finite State Machines", Proceedings of ICCAD-84, pp. 154-157,
Sep., 1984
[5] T. Sasao, "Input Variable Assignment and Output Phase Optimization of PLR's",
IEEE Transactions on Computers, Vol. C-33, pp. 879-894, Oct., 1984
[6] G. Michelli, R. K. Brayton and A. L. Sangiovanni-Vincentelli, "Optimal State As-
signment for Finite State Machines", IEEE Transactions on CAD, Vol. CAD-4, pp.
269-285, Jul., 1985
[7] R. Amann and U. G. Baitinger, "Optimal State Chains and State Codes in Finite
State Machines", IEEE Transactions on CAD, Vol. CAD-8, pp. 153-170, Feb., 1989
[8] J. Tracey, "Internal State Assignments for Asynchronous Sequential Machines", IEEE
Transactions on Electronic Computers, Vol. EC-15, pp. 551-560, Aug., 1966
[9] G. Maki, D. Sawin and B. Jeng, "Improved State Assignment Selection Tests", IEEE
Transactions on Computers, Vol. C-21, pp. 1443-1449, Dec., 1972
[10] C. Roth, Fundamentals of Logic Design, 3rd Ed., St. Paul, Minn., West Publishing,
1985
[11] D. Lewin, Design of Logic Systems, England, Van Nostrand Reinhold, 1985
[12] S. Gopalakrishnan, S. Whitaker, G. Maki and J. Gibson, "Simple Partition Algebra
Based State Assignments", Hewlett-Packard VLSI Design Technology Conference,
Portland, Ore., May 1989














for C M O S Sequential Circuits
S. Whitaker, G. Maki and M. Canaris
NASA Engineering Research Center
for VLSI System Design
University of Idaho
Moscow, Idaho 83843
Abstract - This paper presents a programmable architecture for sequential pass
transistor circuits. The resulting circuits are such that a state machine with
N states and M outputs is constructed using a single layout replicated N + M
times.
1 Introduction
Control circuits in digital logic are often designed as finite state machines. Control often
occupies a small portion of the overall chip area but a major portion of the logic design
effort. Layout of these controllers is often random in nature. Pass transistors have been
studied over the past several years resulting in high speed and high density, practical
combinational logic circuits [1]. This paper utilizes new design methods for sequential pass
transistor circuits, which result in circuits such that the realization for each next state
variable and output variable is identical for a given flow table [2]. Thus, a state machine
with N states and M outputs can be constructed using a single layout replicated N + M
times. The personalization of each state variable is made in the input pass variables applied
to the circuit. The number of paths in the network for each state variable is a function of
the flow table, not the state assignment. -
Synchronous sequential circuits can be drawn with nicely structured PLA architectures.
Random logic in VLSI is normally avoided because of the increased cost of layout, veri-
fication, and design when compared with a regular architecture. Structured designs also
are more easily set up for programmatic generation. Attempts at structured asynchronous
sequential circuit design have been pursued in the past [3,4]. This architecture can reduce
the required design effort and lends itself to programmatic generation.
2 Design Equations
The circuits developed in [2] using the Tracey, Liu and Tan state assignments resulted in
networks which were identical in structure for each state variable. The personalization of
each state variable was made in the input pass variables applied to the circuit. The number
of paths in the network for each state variable is a function of the flow table, not the state
assignment and is equal to the number of p partitions. This architecture is suitable for
224
11 12 13 yl Y2 y3
A A F E 0 1 0
B A B D 0 0 0
C A C D 0 0 1
D D B D 1 0 0
E D C E 1 1 1
F I D I F I E 1 1 0








xs 1 0 0
xs 1 1 0
X4 0 0 0
X 3 0 0 1
X2 1 1 1
X 1 I	 1 1	 0 1	 0
Table 2: Liu circuit inputs.
establishing a structured layout. Since Liu assignments resulted in the most economical
circuits, this assignment will be employed.
The next state design equations for the flow table shown in Table 1 are
1'i = yi 11( 0) + y1 I1( 1 ) + y2 Y3 I2(1) +
Y2 Y3 12(0) + y3 12(0 ) + y2 I3( 1 ) +	 (1)
Y2 13(1)
Y2 = 9111(1) + y1 11(0) + y2 Y3 12(1) +
92 Y312(0) + y3 12( 0) + y2 13( 1 ) +	 (2)
Y213(0)
Y3 = Y1 Ii(0) + y1 11 (0) + y2
 y3 12(0) +
T2 Y3 12(0) + y3 12( 1 ) + y2 13(1) +	 (3)
y213(0)
The circuit diagram of Figure 1 shows the logic for the next state variable Y,•. The logic
is replicated three times and the inputs are driven by the next state information as shown
in Table 2 to form the total circuit diagram.
By observing the circuit diagram of Figure 1 and the circuit input matrix in Table 2,
three distinct sections are shown and a forth is implied. The input section is a coding of
1's and 0's to program the state assignment for a given state variable. The p partition
section programs the structure of the flow table into the sequential circuit and is identical
for each state variable. The buffer section restores the threshold drop on the 1 level out
NASA SERC 1990 Symposium on VLSI Design	 225
Ia	 Va












X 1 !	 G-
Figure 1: Circuit diagram for Liu state assignment.
of the pass network and eliminates essential hazards on 0-0 cross over of the inputs. The
fourth section that is implied if the circuit is to be made programmatically generated, is a
programmable feed back section. The block diagram of the architecture is shown in Figure
2.
By overlaying the architectural block_diagram with the logic, the layout form can be
envisioned. Figure 3 shows the general logic/layout form.. The buffer section requires
one input and two outputs along with power supply lines. The latching buffer is driven
by the output of the pass transistor network in the p partition section and feeds back to
drive two of the state variable lines in that section. This would be drawn as a cell such
that it fits the height of a minimum number of p partition variable lines. The feedback
section has both signals from the buffer in one layer of interconnect arranged such that a
contact can be dropped to the layer driving potential gates in the partition section. The
p partition -section would be an array such that series structures driven by state variables
and primary inputs can be programmed. The programming could be accomplished with
transistor structures and jumper connections. The input section would consist of Vdd and
Vss supply lines which would be programmed by contacts on the input node lines to the
pass array.
The programming features needed by the architecture shown in Figure 3 can be more
easily seen by an example. Figure 4 shows an overlay of the logic for state variable ya
from the Liu state assignment circuit. The feedback section has contacts programmed
to connect the buffer outputs to the ya and Va lines driving through the pass network.
Transistors and jumpers are programmed in the p partition section to create the required
pass network. The input variables shown in Table 2 are programmed as connections to the
Vdd and Vss supply lines running through the input section.




















Figure 3: Architecture with logic





















NASA SERC I990 Symposium on VLSI Design	 227
Figure 4: Architecture with yi logic overlay.
programming and abutting the cells together. The complete circuit has been drawn in a
1.61im CMOS double metal N-well process and is shown in Figure 5.
The feed back lines are metal 1 which is programmed by placing a contact to the poly
lines feeding the gates in the p partition section. The size of the machine (93.2 µm by
121.6 µm) allowed the state lines and input lines to run in poly. No metal 2 was used in
the p partition section so that these lines could be run in metal 2 for machines requiring
a large number of state variables or a large number of partition lines.
The pass transistor matrix is programmed with either a diffusion — contact — metal 1
transistor structure or a metal 1 jumper. The transistors in the pass transistor network
are sized such that the metal overlap of the contact rule is just met, forming minimum
capacitance structures thus allowing maximum speed.
The input section. has Vdd and Vss running in metal 2 with the programming vias
dropping to metal 1 lines feeding the pass array. The state variable metal 1 lines are
passed out of the cell under the Vdd and Vss lines of the input section to drive external
requirements.
3 Performance
The pass network transistors were sized to minimize the node diffusion at Wn = 3.211m.
The first buffer inverter was sized with a Wp = 6.4µm p-channel transistor. The n-channel
transistor was the same size to lower the switch point of the inverter in order to compensate
for the threshold loss on the 1 level out of the p partition pass transistor array. The second
inverter in the buffer was also sized with Wp = W„ = 6.4µm to minimize 1-1 cross over of
the state variables and avoid any potential essential hazards. The feedback devices were
weak transistors with Wp = 2.8µm, Lp = 5.0µm, W„ = 2.81im and L„ = 10.4µm. The
228
sizes were set to insure proper operation when these devices. ratio with the pass network.
The state machine occupies an area 93.2 µm by 121.6 µm. The layout density is a very
respectable 171.7 µm2 (0.266 mil') per transistor or 54.75 µm2 (0.085 mil') per transistor
site.. For a perspective, a single standard cell D flip flop in this same 1.6 µm double metal
CMOS process 'is 70.4 µm by 139.2 µm [5]. The layout of Figure 5 which contains a 3
input, 5 state, 3 state variable state machine occupies an area only 1.16 times that of a
single standard cell D flip flop drawn in the same process.
Parasitic capacitances were then extracted from the layout and a SPICE simulation
was run to determine the operating frequency of the state machine. Worst case speed 3e
parameters were used in the simulations along .with high temperature, T i = 100 deg C, low
power supply, Vdd = 4.5V and supply bus drops of 0.2V. The inputs were assumed to have
a rise and fall time of 1.0 nsec. Under these assumptions, the circuit ran in fundamental
mode for 30 MHz input changes.
Typical speed parameters were then used in a simulations along'with room temperature,
T; = 25 deg C, and typical power supply, Vdd = 5V. The inputs were assumed to have
a rise and fall time of 1.0 nsec. Under these assumptions, the circuit ran in fundamental
mode for 100 MHz input changes.
3.1 Improvements
The operating speed can be improved by two means. First, the buffer could be sized
to increase the speed of the state variables at the cost of increased dynamic power. An
improvement in speed with no penalty could be achieved by laying out the p partition
section such that the transistors driven by the circuit inputs would be next to the output.
This places the last arriving signal next to the pass transistor network output node and
maximizes the operating speed [6].
If speed needs to be optimized at the expense of programmability, then another im-
provement would be to reduce the logic. The reduction in logic for the p partition section
would reduce the total node capacitance that must be charged in the pass array and would
also reduce the gate capacitance driven by the state variable buffers.
References
[1] D. Radhakrishnan, S. Whitaker and G. Maki, "Formal Design Procedures for Pass
Transistor Switching Circuits", IEEE JSSC, vol. SC-20, April, 1985, pp. 531-536
[2] S. Whitaker and G. Maki, "Pass Transistor Asynchronous Sequential Circuits", IEEE
JSSC, Vol. SC-24, Feb., 1989, pp.
[3]J. Jump, "Asynchronous Control Arrays", IEEE Transactions on Computers, vol C-
23, no 10, Oct. 1974, pp. 1020-1029
[4] R. David, "Modular Design of Asynchronous Circuits Defined by Graphs", IEEE
Transactions on Computers, vol C-26, no 8, Aug. 1977, pp. 727-737
NASA SERC 1990 Symposium on VLSI Design	 229
[5] EDS40 CMOS40 Standard Cell User Manual, Santa Clara, California, Hewlett-
Packard Company, 1986
[6]N. Weste and K. Eshraghian, Principles of CMOS VLSI Design, Reading, Mass.,
Addison-Wesley, 1985, pp. 55-57
This work was supported in part by NASA under Contract NAGW-1406 and by the




















'k 1 74' 1,094
NASA SERC 1990 Symposium on VLSI Design
	 231
A Bit Serial Sequential Circuit
S. Hu and S. Whitaker
NASA Engineering Research Center
for VLSI System Design
University of Idaho
Moscow, Idaho 83843
Abstract-Normally a sequential circuit with n state variables consists a total
of n unique hardware realizations, one for each state variable. All variables are
processed in parallel. This paper introduces a new sequential circuit architec-
ture that allows the state variables to be realized in a serial manner using only
one next state logic circuit. The action of processing the state variables in a
serial manner has never been addressed before.
This paper presents a general design procedure for circuit construction and
initialization. Utilizing pass transistors to form the combinational next state
forming logic in synchronous sequential machines, a bit serial state machine
can be realized with a single NMOS pass transistor network connected to shift
registers. The bit serial state machine occupies less area than other realizations
which perform parallel operations. Moreover, the logical circuit of the bit serial
state machine can be modified by simply changing the circuit input matrix to
develop an adaptive state machine.
1 Introduction
Most controllers in digital circuits are modeled as sequential circuits. The controller dic-
tates the activity of a data path and interacts with external control signals. Since most
VLSI circuits are controlled by a clock, the controller is represented as a synchronous
sequential circuit.
In general, a sequential circuit processes all state variables in parallel with unique hard-
ware realizations for each state variable. With n next state variables, n unique hardware
circuits are needed to generate the variable values. Whitaker [1] has introduced a new de-
sign that produces identical hardware for each state variable. If the hardware is identical
for each state variable, then the logical question to address centers around the concept of
trying to develop a new architecture that has only one next state hardware circuit that is
used to generate all state variables. The advantage is much less hardware circuitry. The
disadvantage results from the same hardware being used n different times to calculate the
n state variables. Thus the hardware savings in next state generating logic is a factor of
n, while the time to calculate each next state increases by a factor of n.
It has been shown that pass transistor realizations of digital circuits possess the advan-
tages of high density and speed [2]. A unique form of pass transistor circuits is known as
232
Il 12 13
A Ci B A
B D C B
C E D C
D F E D
E A F E
F B I A I F
Table 1: Example flow table.
Binary Tree Structure (BTS) circuits [3]. BTS circuits possess the property of often requir-
ing fewer transistors and producing regular layouts. BTS networks are used to implement
the combinational portion of the next state equations in the design presented here.
The next chapter of this work shows the design method used to realize traditional pass
transistor sequential networks. Chapter 3 establishes the design procedure of bit serial
state machines. Chapter 4 discusses adaptable circuit realization.
2 Regular State Machine Design
2.1 Pass Transistor State Machine Design
The logic that forms each next state equation Yi
 consists of the following elements: a
storage device (normally a flip-flop), next state excitation circuitry which generates the
next state values to the flip-flop, and input logic. In classical state machine realizations,
all next state equations are evaluated in parallel with independent logic. Present state
information is fed back by state variables yi to the excitation logic. The excitation logic is
a combinational logic function of the input and state variable information. Current input
and state variables select the specific next state value.
Whitaker has derived a unique realization for asynchronous sequential circuits that
has identical circuitry for all of the next state equation circuitry[1]. This concept can
be extended to synchronous sequential circuits discussed in this paper by replacing the
transition paths of the asynchronous circuits with predecessor states. In the asynchronous
case, a transition path defines the set of states that must have the same next state entry
and a pass implicant covers each transition path, resulting in a unique network of pass
transistors in the next state equations. In the synchronous case, predecessor states define
the states that have the same next state entry and a pass implicant covers the predecessor
states and is realized as a unique network of pass transistors in the next state equations.
This notion of using predecessor states is described next.
Predecessor states for state S i under an input, Ip, are states which have Si as a next
state entry. In Table 1, the predecessor state for C under Il is A. Whenever the circuit is
in a predecessor state of S i , the next clock pulse will effect a transition to Si. Moreover,
the next state is present at the input of the flip-flops prior to the clock pulse. At the next
clock pulse, the flip-flops will assume the value of the code associated with Si.




Destination	 In ut	 StatesState	 p
Codes	 Switch
Matrix	 Logic
Conceptionally, the new architecture operates as follows: For each predecessor state
of Si, there exists a pass transistor network in the excitation network that presents the
next state value, Si , to the flip-flops. The pass transistor network consists of a single pass
implicant that covers each predecessor state such that when any state is entered, a unique
pass transistor path is enabled that passes the proper next state value to the flip-flops.
For example, in Table 1, there is a pass transistor path that presents the code for state C
to the flip-flops when Ii
 = 1 and the circuit is in state A.
All pass transistor implementations for the synchronous sequential circuits described
in this paper operate as described above, including those that use PLA pass transistor
networks to generate the next state equations. All next state equations are identical when
they are realized with general pass networks that completely decode all the internal states.
That is, if there are n state variables, then the pass network must decode all 2" states.
The value for the next state entries for each predecessor state for S i is the code for Si and
the constants for this code are input to the pass network.
A parallel implementation for an architecture to realize sequential circuits is shown in
Figure 1, where the next state logic is the general pass network. The destination state
codes contain all possible destination states for the sequential circuit. Since every state of
the flow table appears as a next state entry, all the specified states are contained in the
destination code set. These codes are input to the input switch matrix which generates all
the destination codes for a given input Ip
 as specified by Ip. The next state logic selects
the unique destination code as specified by the current state variables y.
The following illustrates specifically how this architecture works. In general, let the
circuit be in state Sk with input I;. Let Sk be the predecessor for Si . All destination
states codes are normally input to the input switch matrix. Input I; selects the codes for
the destination states under Ij
 to be presented to the next state logic. The feedback that
specifies that the present state is Sk
 enables one and only one path in the pass network
to select the correct destination code, Si , and presents this value to the flip-flops. This is
depicted in Figure 2.
In summary, the input state selects the set of potential next states that the circuit can











Figure 2: State machine operation.
yi yz ys 	 Ii	 Iz	 13
0 0 0 A
0 0 1 B
0 1 0 C
0 1 1 D
1 0 0 E
1 0 1 F
Table 2: Example flow table with next state entries.
assume (selects the input column) and the present state variables select the exact next
state (row in the flow table) that the circuit will assume at the next clock pulse.
2.2 Design Procedure
The first step in the design process is to generate the state assignment. Any state assign-
ment can be used, as long as no two distinct states have the same code. Consider the state
assignment and next state entries shown in Table 2.
Let PS: define a state partition that partitions state Si from all other states of a flow
table. If states So, S,.... S; are the destination states for a given input, then state partitions
PSo , PS„ ... PS; must be covered in the design equations. For Table 2, the partitions are:
P. = A; BCDEF
Pb = B; ACDEF
pe = C; ABDEF	 (1)
Pd = D; ABCEF
pe = E; ABCDF
p f = F; ABCDE
NASA SERC 1990 Symposium on VLSI Design 	 235
The general design equation for each next state variable consists of a term for each of
the p partitions and is as follows:
Y = Pali() + Pb11() + PcIl() + PdIl() + PeIl() + PfIl()+
P.12() + PbI2() + PJ2() + PdI2() + PeI2() + PfI2()+	 (2)
PaI3() + PbI3() + PJ3() + PdI3() + PeI3() + Pf 13()
By rearranging the partitions, the general next state design equation can be written as
follows:
MPYi = E C(Ps,,)[E It(xski)]
	 (3)
k=1	 1=1
where C(psj is the product expression covering state Sk, m is the number of states, p is
the number of inputs, and xs,,i is the ith bit of the code for the destination state of S k for
yi under input It . For this example,
Y = Pa[Il() + 12() + 13()] + Pb[Ii() + 12 () + 13()]+
PC[I1() + 12() + 13()] + Pd[h() + 12() + 13()]+	 (4)
PC [II() + 12() + 13()] + Pf [II() + 12() + 13()]
The next step specifies the constants that are passed. To do this, enter pjlk(1) for the
destination state under pj where Yi = 1 and to enter pjlk(0) for the destination state under
pj where Yi = 0. For example, the next state under the input Ii for present state A or
partition pa is C. C has the code yl = 0, y2 = 1 and y3 = 0 and this information can be
entered into the design equations as shown:
11 = Pa[II(0 ) + 12 () + 13()] + Pb[Il() + 12() + 13()]+
Pe[Il() + 12() + 13()] + Pd[II() + 12() + 13()]+
PC[Il() + 12() + 13()] + Pf [II() + 12() + 13()]
Y2 = Pa[Il(1 ) + 12() + 13()] + Pb[Il() + 12() + 13()]+
Pe[I1() + 12() + 13()] + Pd[IIO + 12() + 13()]+	 (5)
Pe[I1() + 12() + I3()] + Pf [Ii() + 12() + 13()]+
Y3 = P.[11(0) + 12() + 13()] + Pb[Il() + 12() + 13()]+
Pe[Ii() + 12() + 13()] + Pd[II() + 12() + 13()]+
Pe[II() + 12() + 13()] + Pf [II() + 12() + 13()]
By inspection of Table 2, the destination state codes for the remaining p partitions are
incorporated into the design equations.
236
Y1 = P.[I1(0) + I2(0) + I3(0)] + Pb[I1(0 ) + I2(0) + I3(0)]+
PC [I1 (1) + I2(0) + I3(0)1 + Pd[Ii(1 ) + I2 (1) + I3(0)1+
Pe[Il(0) + I2( 1 ) + I3( 1 )] + Pf[I1(0) + I2(0) + I3(1)]
Y2 = Pa[I1( 1 ) + I2(0 ) + I3(0)] + Pb[I1(1) + I2(1) + I3(0)1+
PC V1 (0) + I2(1) + I3(1)] + Pd[II(0) + I2(0) + I3 (1)1+	 (6)
Pe[I1(0) + I2(0) + I3 (0)1 + Pf[I1(0) + I2(0) + I3(0)1
Y3 = Pa[I1(0) + I2 (1) + I3(0)]•+ Pb[I1( 1 ) + I2(0 ) + I3(1)]+
Pc[I1(1) + I2(1 ) + I3(0 )1 + Pd[I1(0) + I2(1 ) + I3(1)1+
P.[I1(0) + I2 (0 ) + I3( 1 )] + Pf[I1( 1 ) + I2 (0) + I3(0)1
The final step is to derive the covering equations for each p partition. Let C(p,) define
the product term that partitions state S from all other states. For example, C(p,,) is
yl y2 y3 • With a unique code for each state, there is a unique product term for each state
partition.
C(A.) = yi y2 Y3 C(Pd) = yl y2 y3
C (Pb) = yl y2  Y3 C(P.) = yl y2 y3	 (7)
C( Pc ) = yl y2 y3 C(Pf) = yl y2 y3
The circuit in Figure 3 shows the logic for implementing state variable y3 . The logic
of Figure 3 is replicated three times and the inputs are driven by the destination state
information as shown in Table 2. Except for constant input values driving the input
switch matrix, the design equations and pass transistor realizations for each state variable
are identical.
3 Serial Implementation
A serial machine consists of a single next state logic- circuit to calculate each next state
variable. In order to utilize one hardware circuit to calculate all next state variables,
the hardware must be identical for each next state variable. It is shown that the design
presented in the previous section achieves the property of having identical hardware for
each state variable.
An architecture that can be used to realize a bit serial sequential circuit is depicted
in Figure 4. The circuit consists of a single next state variable calculator, a D flip-flop to
store each next state variable value, a shift register, a state latch and a storage means to
hold the destination codes.
Let the number of state variables be n, the number of states in the flow table be m
and the number of inputs be p.. Since the state variables are calculated serially, a shift
register is used to store and generate the entire next state one bit at a time. The shift
register length must be n. The next state of the sequential circuit is a function of the



























NASA SERC 1990 Symposium on VLSI Design 	 237
Figure 3: Example next state equation for y3.
the next state. Since the present state of the machine is needed to generate each new bit
of the next state, the present state must be held in a register until the entire next state
is generated. A state latch is used to hold the present state value until the next state is
generated. When the next state is completely formed, this latch receives a new input from
the shift register.
The input state Ip selects one of the destination code set blocks which pass all the
destination states in Ip to the next state logic. The destination state codes reside in a
storage means such as a ROM, RAM or register stack. Let the bits of each destination
codes set be noted as D; i , where i denotes next state variable Y and j denotes the state in
the flow table; j takes on values 1,2, ... m and i takes on values 1,2,... n. The destination
code sets are organized such that the next state variables assume the columns and the next
state codes assume the rows. Below is the destination code set for input I i for Table 1.
State yl y2 y3
A 0 1 0
B 0 1 1
C 1 0 0
D 1 0 1
E 0 0 0
F 0 0 1








	 m I1CodeSet for
I1
n
i	 Yi hg	 toD FF	 i s
	
a t
t e n e h
i
e




Figure 4: Bit serial architecture
NASA SERC 1990 Symposium on VLSI Design	 239
The operation of the serial circuit can be described as follows: Assume the initial state is
in the state latch. Next state variable Yl is calculated first. The input state passes the
appropriate destination code set to the next state logic. The first column representing the
next state values for y i is selected and next state variable Yl is calculated and gated into
the D flip-flop. The second column in the destination code set then is selected, representing
y2 , is input to the next state logic and Y2 is calculated. Y2 is gated into the D flip-flop
at the same time as Yl is gated into the shift register. The process continues n times to
calculate the next state. When the next state is generated, it is gated into the state latch
which denotes the new present state of the sequential circuit. At this time, a new input
state can be accepted for the circuit to begin the process again. The operation of the entire
process is controlled by a state controller which can be implemented with a simple ring
counter.
Consider the example from flow table Table 1 to illustrate the operation of this circuit.
Let the initial state be B and input I,. The state latch would have initial value 001, the
code for B. To calculate Y1 , 001100 from the destination code set for I i is passed to the
next state logic. With present state B, path yiy2y3 passes 0 to the input to the flip-flop.
At the next clock pulse, 0 is gated into the flip-flop and value 110000 is presented to the
pass network from the destination code set to calculate Y2 . With present state B, value 1
is passed to the input of the flip-flop. At the next clock pulse, 1 is input to the flip-flop
and the previous flip-flop value is gated to the shift register. Value 010101 is passed to
the next state logic to calculate Y3 , passing 1 to the flip-flop. At the next clock pulse, 1 is
input to the shift register which now holds value 10. The flip-flop assumes a value 1 for Y3.
At the next clock pulse, 1 from the flip-flip is gated to the shift register which now holds
value 011, the next state of the machine. This value is gated to the state latch signifying
that the next state is D. The process repeats with a new input state.
4 Adaptable Circuit
The architecture described above can implement any sequential circuit with a maximum
of m internal and p input states. Therefore, this architecture could be termed universal.
In addition to being universal, this architecture allows for adaptable operation. If the
destination code sets are changed, the circuit will assume a different set of destination
states. Simply changing these sets change the operation of the circuit. In the above
example, if the destination code set for Il had column 000100 for yl instead of 001100, the
next state for C under Ii would be A (000) instead of E (100).
Therefore, one circuit configuration can be adapted to implement different flow tables.
Moreover, it is possible to modify the flow table dynamically which opens the possibility
making changes to provide some level of fault tolerance.
240
5 Conclusion
A bit serial synchronous sequential circuit has been presented that enables all state vari-
ables to be generated with single next state logic circuitry. Circuit operation can be
changed to implement different flow tables by simply changing input constants that are
fed to the next state equation circuitry.
References
[1] S. Whitaker and G. Maki, "Pass-Transistor Asynchronous Sequential Circuits," IEEE
Journal of Solid State Circuits, pp. 71-78, Feb. 1989.
[2]D. Radhakrishnan, S. Whitaker and G. Maki, "Formal Design Procedures for Pass-
Transistor Switching Circuits," IEEE Journal of Solid State Circuits, pp. 531-536,
Apr. 1985.
[3] G. Peterson and G. Maki, "Binary Tree Structured Logic Circuits: Design and Fault
Detection," Proceedings of IEEE International Conference on Computer Design: VLSI
in Computers, pp. 139-144, Oct. 1984.
Footnotes
This research was supported in part by NASA under the NASA Space Engineering Research
Center grant NAGW-1406 and by the Idaho State Board of Education under grant 88-038.
N94-71095
.NASA SERC 1990 Symposium on VLSI Design	 241
Sequence Invariant State Machines
S. Whitaker and S. Manjunath
NASA Engineering Research Center
for VLSI System Design
University of Idaho
Moscow, Idaho 83843
Abstract - A synthesis method and new VLSI architecture are introduced to
realize sequential circuits that have the ability to implement any state machine
having N states and m inputs, regardless of the actual sequence specified in the
flow table. A design method is proposed that utilizes BTS logic to implement
regular and dense circuits. A given state sequence can be programmed with
power supply connections or dynamically reallocated if stored in a register.
Arbitrary flow table sequences can be modified or programmed to dynamically
alter the function of the machine. This allows VLSI controllers to be designed
with the programmability of a general purpose processor but with the compact
size and performance of dedicated logic.
1 Introduction
Most digital systems include a controller.. This is usually either a general machine such as
a microprocessor, or a dedicated, custom designed sequential state machine. The trade-off
between these two respective approaches is breadth of applicability and ease of reconfigu-
ration versus cost, performance and complexity. This work describes an architecture that
retains the traditional strengths of dedicated state machines, but offers the adaptability
of a microcontroller. This paper describes the concept and structure of this new architec-
ture which produces sequence invariant machines. The controllers on the full custom data
compression chips for NASA are being implemented using this new architecture [1].
The advantage of a hardware realization of a sequence invariant sequential machine
is that it can implement any flow table without a change in the logic equations. This
type of circuit is also known in the literature as a universal state machine [2]. The only
parameters needed to realize this circuit are the maximum number of input states m and the
maximum number of internal states 21 . The hardware realization of a given circuit for m
and n can implement any circuit with a maximum of m input states and 2n internal states.
This capability to transition through all possible circuit sequences requires a hardware
realization that yields design equations that can be adaptable. A new design procedure
for asynchronous sequential circuits [3] has produced identical design equations and this
procedure is modified in this paper for synchronous machines.
A sequential circuit consists of a combinational logic excitation network which imple-
ments the state variable equations and storage elements, which in this case are D flip-flops.
A sequential circuit is often defined in terms of a flow table like the example used in a
242
Il 12 13
A C B A
B D C B
C E D C
D F E D
E A F E
F B A I F
Table 1: Example flow table.
later section shown in Table 1. The input states are noted as Ip, internal states are S„
next states as Nj , present state variables as yi and next state variables as Y.
Pass transistor logic design is known for producing circuits that have high density and
speed [4]. When these circuits are realized in as Binary Tree Structure (BTS) networks,
they also have unique properties which often produce minimal transistor circuits and pos-
sess fault detecting properties [5]. Since these properties are attractive, BTS networks are
used in the realization of the design equations in this paper.
Section 2 of this paper introduces the combinational logic realized as BTS networks and
specifies a general BTS structure that is used in the design equations. Section 3 presents
the design procedure using the general BTS structure along with an example. The example
is then altered demonstrating the adaptive nature of this architecture.
2 Binary Tree Structured Logic
Pass transistor logic has the significant advantages over gate logic of high speed and density
[4,5]. A pass transistor network realized in a BTS form often requires fewer transistors and
displays attractive fault detection characteristics [5]. A unique form of a BTS network,
called a general BTS network, is used here to formulate the next state equations for the
sequential circuit.
In general a BTS circuit is characterized by having a maximum of 2" — 1 nodes and
each node has exactly two branches. One branch is controlled by variable X i and the other
by Xi . The maximum number of transistors in a BTS network is 2"+1
 — 2, and therefore
only constants 0 and 1 are input to the network. A general BTS network contains the
maximum number of transistors and represents a complete decoding of an input space.
Figure 1 shows a general BTS network which implements all three-variable functions. In
this general network, all three-variable functions can be realized by simply changing the
pass variable constants 1(0) at the input of the appropriate branch. A general BTS network
is a fully decoded binary tree and is used in the design which follows.
	NASA SERC 1990 Symposium on VLSI Design	 243
1(0) x3
1(0)	 x3	 x2








1(0) x 3 	 12
1(0)	 x3
Figure 1: General three variable BTS network.
3 State Machine Design
The logic that forms each next state equation Y,• consists of the following elements: a
storage device (normally a flip-flop), next state excitation circuitry which generates the
next state values to the flip-flop, and input logic. Present state information is fed back
by the state variables y; to the excitation logic. The excitation logic is a combinational
logic function of the input and the state variable information. In general, the information
needed to generate all possible next state values that can be assumed by the circuit at the
next clock pulse is resident within the excitation circuitry. The specific next state value is
selected by the current input and state variables.
A sequence invariant sequential machine can implement any flow table without a change
in hardware. In this paper, a hardware circuit can realize any flow table which requires
equal to or less than m input states and/or 2" internal states. In other words, only m and
n are needed to specify the hardware necessary to realize all possible state transitions in a
sequential circuit.
In order for the next state circuitry to implement sequence invariant operation, it must
assume a unique form to allow a circuit to implement an arbitrary state transition. First
the circuitry for each next state variable must be identical. Second, specific next state
information must not be hardwired into the logic that forms the next state equations.
Rather, specific next state values must be presented from an external source.
Whitaker has derived a unique realization for asynchronous sequential circuits where
all of the next state equations have identical circuitry[3]. This concept can be extended to
synchronous sequential circuits by replacing transition paths of the asynchronous circuits
244
with predecessor states. In the asynchronous case, a transition path defines the set of
states that must have the same next state entry and a pass implicant covers each transition
path resulting in a unique network of pass transistors in the next state equations. In the
synchronous case, predecessor states also define the states that have the same next state
entry and a pass implicant covers the predecessor states and is realized as unique network
of pass transistors in the next state equations. This notion of using predecessor states is
described next.
The predecessor states for state Si under an input Ip are states which have Si as a next
state entry. In Table 1, the predecessor state for C under I i is A. Whenever the circuit is
in a predecessor state of Si, the next clock pulse will effect a transition to Si. Moreover,
the next state entry is present at the input of the flip-flops prior to the clock pulse. At the
next clock pulse, the flip-flops will assume the value of the code associated with Si.
Conceptionally, the new architecture operates as follows: For each predecessor state of
state Si, there exists a pass transistor network in the excitation network that presents the
next state value, Si, to the flip-flops. The pass transistor network consists of a single pass
implicant that covers each predecessor state such that when any state is entered, a unique
pass transistor path is enabled that passes the proper next state value to the flip -flops.
For example, in Table 1, there is a pass transistor path that presents the code for state C
to the flip flops when Ii = 1 and the circuit is in state A.
All pass transistor implementations for the synchronous sequential circuits described
in this paper operate as described above, including those that use PLA pass transistor
networks to generate the next state equations. The circuit becomes invariant when all
state equations are identical. All equations are identical when they are realized with
general BTS networks that completely decode all the internal states. That is, if there are
n state variables, then the BTS network must decode all 2" states. The value for the next
state entries for each predecessor state for Si is the code for Si and the constants for this
code are input to the BTS network.
A specific implementation of an architecture to realize sequential circuits is shown in
Figure 2, where the next state decoder is a general BTS network. The destination state
codes represent all of the possible destination states for the sequential circuit. Since every
state of the flow table appears as a next state someplace in the flow table, all of the specified
states appear in the destination code set. These codes are input to the input switch matrix
which generates all of the destination codes for a given input Ip as specified by Ip. The
next state decoder selects the unique destination code as specified by the current state
variables y.
The following illustrates specifically how this architecture works. In general, let the
circuit be in state Sk with input I;. Let Sk be the predecessor for Si . All destination states
codes are normally input to the input switch matrix. Input I; selects the codes for the
destination states under I; to be presented to the next state decoder. The feedback that
specifies that the present state is Sk enables one and only one path in the BTS network
to select the correct destination code, Si , for Sk under I; and presents this value to the
flip-flops. This is depicted in Figure 3.

















codes	 Matrix	 Decoder	 Si
all destination
states of Ij





















N61 Nsz I NO
Table 2: General six-state three-input flow table.
Nsl i N52; Ns3;
Figure 4: General six-state three-input next state equation bit.
Let the general six-state and three-input flow table as shown in Table 2 depict an
example for a general m state machine. I1 i 12 and 13 are the inputs, So ... Sr, are the
present states of the sequential circuit and Ns,r„ Ns,r, ... Nssr, are the next states. This
can be generalized so that Ns;r; are the next states for Si under input Ij . Ns;rj has been
abbreviated as Nij . Let the state assignment be So = 000,S1 = 001,S2
 = 010, ... , Ss =
101. The next state decoder is a general BTS circuit with paths that decode each state.
The input switch matrix is a pass transistor matrix, that passes the destination state codes
to the next state pass network as shown in Figure 4.
The circuit realization of the next state pass network depicted in Figure 4 operates in
the following manner: All of the destination state codes Nij are presented to the input
switch matrix. For each input state Ii, all of the destination states in Ii are presented to
the next state decoder. The present state variables, y, select one and only one next state
NASA SERC 1990 Symposium on VLSI Design	 247
yl ys ya
0 0 0 A
0 0 1 B
0 1 0 C
0 1 1 D
1 0 0 E







001 1 000 1 101
Table 3: Example flow table with next state entries.
entry. This single next state entry passed to the flip-flip is determined by the present state
of the circuit. If the machine is in state Sl and input Is is asserted, then N12; would be
passed to the input of the flip-flop for next state variable Y,•.
In summary, the input state selects the set of potential next states that the circuit can
assume (selects the input column) and the present state variables select the exact next
state (row in the flow table) that the circuit will assume at the next clock pulse.
If the machine is to be adaptive, the circuitry must also be able to implement any
flow table and therefore must be independent of the sequences in a flow table. This
architecture can implement different flow tables by changing only the destination state
information driving the input switch matrix. Hence, the pass transistor hardware remains
constant. Changing constants can be implemented by programming the supply connections
to provide the 1's and 0's or through using enabling transistors that present data from a
register array. With a register array, the, destination state information could be changed
by a host controller to allow dynamic reconfiguration.
Design Procedure The first step in the design process is to generate a state assignment.
Any synchronous state assignment could be used, as long as no two distinct states have
the same assignment. For ease of illustration, consider the flow table shown in Table 1
with the state assignment and next state entries shown in Table 3.
Let ps define a state partition that partitions state S from all other states of the flow
table. If states So, S,.... Si
 
are all the destination states for a given input, then state
partitions pso , psl , ... psi must be covered in the design equations. For this flow table, the
partitions are
P. = A; BCDEF
Pb = B; ACDEF
PC C; ABDEF	 (1)
Pd = D; ABCEF
p, = E; ABCDF
p f = F; ABCDF.
The general design equation for each of the next state variables consists of a term for
each p partition and is as follows:
248
Y: = pall() + Pbll() + PcIl() + Pdli() + Ptli() + PfIl()±
PaI2()+ PbI2() + Pcls() + PdI2() + P.I2() + Pf 12()+	 (2)
PaI3() + PbI3() .+. PcI3() + PdI3() + P,I3() + PfI3()
By rearranging the partitions, the general next state design equation can be written as
follows:
m	 P




where C(psj is the product expression covering state Sk , m is the number of states, p
is the number of inputs and xs,, = is the ith bit of the code for the destination state of Sk
for yj under input Ii . For this example,
Y = Pa[Il() + I2() + I3()] + Pb[Il() + I2() + I3()]+
Pc[II() + I2() + I3()] + Pd[Il() + I2() + I3()1+	 (4)
P^[Il() + I2() + I3()] + Pf [II() + I2() + I3()]
The next step is to enter pjlk (1) for the destination state under pj that has Y,• = 1 and
to enter pjlk (0) for the destination state under pj that has Y = 0. For example, the next
state under the input Il for present state A or partition pa is C. C has the code yl = 0,
Y2 = 1 and y3 = 0 and this destination state can be entered into the design equations as
shown:
Y1 = P.[Ii(0 ) + I2() + I3()] + Pb [II () + I2() + I3()]+
Pc[I1() + I2() + I3()] + Pd[IlU + I2() + I3()]+
P.[Il() + I2() + I3()] + Pf [Il() + I2() + I3()]
Y2 = P.[11( 1 ) + I2() + I3()] + Pb [II () + I2() + I3()]+
Pc[II() + I2() + I3()] + Pd[Il().+ I2() + I3()]+	 (5)
PC [II () + I2() + I3()] + Pf [Il() + I2() + I3()]+
Y3 = Pa[Il( 0 ) + I2() + I3()] + Pb[Il() + I2() + I3()]+
Pc[Ii() + I2() + I3()] + Pd[Il() + I2() + I3()]+
PC [II () + I2() + I3()] + Pf [Il() + I2() + I3()]
By inspection of the flow table, the destination state codes for the remaining p partitions
are incorporated in the design equations.
NASA SERC 1990 Symposium on VLSI Design	 249
Y1 = pa[II(o) + I2(0) + I3 (0)l + Pb[II(o) + I2(0) + I3(0)1+pc[I1 (1) + I2(0) + I3(0 )1 + Pd [II(1 ) + I2 (1) + I3(0)1+
pc [II (0) + I2( 1 ) + I3( 1 )] + Pf[Ii(0) + I2(0) + I3(1)]
Y2= Pa[I1( 1 ) + I2(0) + I3(0)] + Pb[Ii(1) + I2 (1) + I3(0)1+
P.[II(o) + I2(1) + I3( 1 )1 + Pd[II(o) + I2(0) + I3( 1 )1+	 (6)
pe[II(o) + I2(0) + I3(0 )1 + Pf[II(0) + I2 (0) + I3(0)1
Y3= Pa[II(0 ) + I2( 1 ) + I3( 0)] + Pb[Il(1 ) + I2 (0) + I3(1)1+
pc[II(1) + I2(1) + 13(0 )1 + pd[II(o ) + I2(1) + I3(1)1+
pe[Ii(0) + I2(0) + I3( 1 )1 + Pf[II(1 ) + I2(0) + I3(0)1
The final step is to find the covering equations for each p partition. Let C(p,) define the
product term that partitions state S from all other states. For example, C(pa) is y1 y2 y3•
With a unique code for each state, there is a unique product term for each state partition.
C(P. ) = y1 92 y3 C(Pd) = yl y2 y3
C(Pb) = y1 y2 y3 C(P.)
 = y1 y2 y3	 (7)
C (Pc) = y1 y2 y3 C (Pf) = y1 92 y3
The circuit in Figure 4 shows the logic for implementing each state variable y;. Each
C(p,) is a path through the BTS network forming the next state decoder. The logic
of Figure 4 is replicated three times and the inputs are driven by the destination state
information shown in Table 3. Figure 5 shows the programming of the input switch matrix
for state variable y3 . Except for the constant input values that are driving the input
switch matrix, the design equations and pass transistor realizations for each state variable
are identical.
A major result of this work is that it is no longer necessary to derive the pass logic
configurations for each next state equation. The next state information is only used as the
input pattern to the input switch matrix. The actual realization or hardware is therefore
independent of the next state information. This allows for an adaptive circuit realization for
the state machine. Since the next state information is stored in the input switch matrix,
only the programming of the destination state codes need be changed to implement a
different flow table.
To illustrate the ease of adapting a flow table to a different transition sequence, consider
the flow table in Table 4. This flow table differs from Table 1 in columns 12 and 13 . The
new next state information is shown in Table 5 and the circuitry is shown in Figure 6.
Note that the next state circuitry is identical to that of Figure 5 and only the destination
state codes are changed at the input of the input switch matrix. Figure 6 shows next state











Figure 5: Example next state equation for Y3.
Ii Iz 13
A D C B
B E C C
C F D D
D A E D
E B F E
F C A E
Table 4: Modified example flow table.
yi Y2 Y3	 Ii	 Is	 I3
0 0 0 A
0 0 1 B
0 1 0 c
0 1 1 D
1 0 0 E
1 0 1 F
























NASA SERC 1990 Symposium on VLSI Design	 251
Figure 6: Modified example next state equation for Y3.
4 Conclusion
This research introduces a new VLSI architecture which realizes sequential circuits. This
general architecture provides a hardware realization which is independent of any state
transition that may be specified in a flow table. The architecture promotes an easy and
straight forward way to design synchronous machines and this design can be accomplished
by inspection. Traditional design steps are not needed such as state table generation and
design equation realizations from K-maps.
It is shown that a given hardware implementation can realize any m row flow table by
simply changing a set of input constants. The design equations have identical hardware
resulting in easy VLSI replication. From a flow table, the destination state codes can
be programmed directly into the input switch matrix. This allows designers to produce
compact, programmable VLSI controllers. The state machine architecture is also unique
because it requires a minimal amount of extra hardware to introduce adaptability into the
state machine. If the next state information is stored in a register that drives the input
matrix, any sequence can be changed dynamically to alter the function of the circuit by




[1]J. Venbrux, and N. Liu, "VLSI Architectures for Data Compression using the Rice
Algorithm", accepted for publication at NASA Space Engineering Research Center
VLSI Symposium, January 1990.
[2] S. H. Unger, Asynchronous Sequential Switching Circuits, Robert E. Krieger Publish-
ing Company, Inc., 1983
[3] S. Whitaker and G. Maki, "Pass-Transistor Asynchronous Sequential Circuits", IEEE
Journal of Solid State Circuits, pp. 71-78; Feb., 1989
[4]D. Radhakrishnan, S. Whitaker and G. Maki, "Formal Design Procedures for Pass-
Transistor Switching Circuits", IEEE Journal of Solid State Circuits, pp. 531-536,
Apr., 1985
[5] G. Peterson and G. Maki, "Binary Tree Structured Logic Circuits: Design and Fault
Detection", Proceedings of IEEE International Conference on Computer Design: VLSI
in Computers, pp. 139-144, Oct., 1984
[6] T. Martinez and J. Vidal, "Adaptive Parallel Logic Networks", Journal of Parallel
and Distributed Computing, pp. 26-58, Feb. 1988
[7]I. Aleksander and E. Mamdani, "Adaptive Logic Elements in Pulse Control Systems",
Proceedings of the Symposium on Pulse-Rate and Pulse-Number Signals in Automatic
- Control, pp. 486-493, Apr. 1968
Footnotes
This research was sponsored in part by NASA under the NASA Space Engineering Research
Center Grant NAGW-1406 and by the Idaho State Board of Education under research grant
88-038.
NASA SERC 1990 Symposium on VLSI Design
	 N94- 71096	 253
Pass Transistor Implementations
of Multivalued Logic
G. Maki and S. Whitaker
NASA Engineering Research Center
for VLSI System Design
University of Idaho
Moscow, Idaho 83843
Abstract - A simple straight -forward Karnaugh map logic design procedure for
realization of multiple-valued logic circuits is presented in this paper. Pass
transistor logic gates are used to realize multiple-valued networks. This work
is an extension of pass transistor implementations for binary-valued logic.
1 Introduction
Multiple-valued logic has been a research topic for the past two decades. Two basic moti-
vations for multivalued logic functions are to increase the amount of information conveyed
on each interconnect line and to decrease the amount of area required to build logical cir-
cuits implemented in VLSI technology [1]. Hurst in his technology status report concluded
that practical application of the research was dependent on circuit realizations and that
CMOS transmission gates should be exploited as a potential circuit realization [2].
Formal techniques for the realization of Boolean logic functions with pass transistors
have been introduced [3] and it will be shown that they can be applied to the design of
multiple-valued logic functions. In Section 2, the properties of CMOS pass gates which
allow the implementation of multiple-valued logic are presented. Formal logic design tech-
niques are introduced in Section 3 for the realization of pass transistor circuits passing
multiple logic levels. Section 4, extends the logic design theory to cover the passing of
multivalued variables as well as constants.
2 Properties of Pass Transistors
In the design of combinational pass transistor logic for Boolean functions, p-channel MOS-
FET's are normally used to pass logic 1's while n-channel MOSFET's are used to pass
logic 0's. Both the PMOS and NMOS transistors are used to form a transmission gate
when the input is a variable which could assume either Boolean value.
The MOSFET is a voltage controlled current source device [4]. When used as a pass
gate, however, the MOSFET can be described as a voltage follower for a limited input
voltage range. For a pass gate, the drain and source are input and output terminals and





Figure 1: NMOS pass transistor
of an n-channel MOSFET when V. = Vdd is given by
V	 for V <Vdd —Y;,	 1
V	 { V,W —V;, for V,• > Vdd — Vn	 ( )
where the threshold V' , = Vt.,o + y„( V,b + 20f — V20f) and Vdd is the most positive
voltage on the chip.
The equation for the PMOS FET output voltage when V. = V„ is given by
Y,•	 for V > IVpl	 (2)V° _ { IVtpI for V < IVpI
where the threshold Vp = Vtpo — 7p( I V,b —1+  20 f — 2O f ) and V„ is the reference voltage
for the chip.
Multivalued logic signals could therefore be passed by networks of PMOS and NMOS
transistors. From the DC operating equations given above, PMOS FET's can be used to
pass logic level Vx when Vs > IVp I, while NMOS FET's can be used to pass logic level
V. when Va < Vdd — Vtn. The operating voltage ranges for the inputs based on the DC
equations therefore overlap between I Vtj < Vx < Vdd — Yn.
The switching speed of FET's used as pass gates is proportional to Vg, — I Vt j [4]. Good
switching characteristics could be achieved if the input range were divided such that NMOS
FET's were used when V < Vdd/2 and PMOS FET's were used when V,• > Vdd/2. The
pass transistor control gates would still be driven by Boolean functions of the highest logic
value (1 = Vdd) and the lowest logic value (0 = Vas)•
A general multivalued pass network can be depicted as shown in Figure 2. Each Pi is
a series of NMOS transistors for Vi < Vdd/2 or a series of PMOS transistors for V,• > Vdd/2
such that when each transistor is enabled by a Boolean logic 1 (NMOS) or 0 (PMOS)
the voltage on the input, Vi, passes to the output, V° . Pi can be represented as a sum of
literals where each literal represents the Boolean variable driving the gate of a series pass
transistor. The output voltage V° can be expressed as a logic (voltage) level being passed
to the output,
	
V. = PI (V1 ) + P2 (V2 ) + ... + P„(V.)	 (3)
Each transistor in Pi can be schematically represented as shown in Figure 3.
The pass transistor network is to be distinguished from the multiple-valued T-gate [5].
The general T-gate has r + 1 inputs with one of the inputs being a r-valued control input
whose value determines which of the other r (r-valued) inputs is selected for output. The
pass transistor itself has one r-valued input and a binary control input that switches the
r-valued input to the output of the transistor. A pass transistor network has n r-valued
inputs that are switched by m binary variables.
V.
NASA SERC 1990 Symposium on VLSI Design	 255
V1
V2 V.
Figure 2: General pass network.
V ^— V. V _ x^' _.V.
n 	 p
Figure 3: Pass transistor logical representation.
3 Logic Design
3.1 Formal Statement of Problem
The inputs to a general logic circuit are the multivalue logic voltage levels, constants in
the set [v1 i v2, ... v,,] . These logic levels will be presented to the output by means of a
switching network consisting of pass transistors. Each pass transistor is controlled by
binary variables x 1 , x2, ... x,,,.
Example .1 Consider the circuit illustrated in Figure 4.
The inputs are multivalue logic levels [0,2,1] where 0 = V,,, 1 = Vdd and 2 = Vdd/2. The
control variables are x 1 , x2 and x3. A few examples help illustrate the circuit operation.
If x1 and x3 are both low then the output is logic level 0; if x1x2x3 = 110 then the output
is logic level 2. The Karnaugh map for the above circuit is shown in Figure 5. Every
combination of inputs produces one and only one output value.
3.2 Definitions
The following definitions extend the formalism of pass transistor logic already presented













n I n	 n








0 0 2 0







Figure 5: Karnaugh map representation.
Figure 6: General network.
Definition 1 A multivalue pass implicant (MPI), Pi(V ), consists of a product expression
Pi and a pass variable V,•.
When the literals of Pi
 evaluate to 1, the multivalue logic level V is presented to the
output of the pass transistor network realizing Pi (Vi ); otherwise the output is in the high
impedance state.
For Example 1, the multivalue pass implicants are xl 73(0), x2 x3 (0), x3(1), and
xlx2x3(2). It is clear that a pass implicant can pass only one multivalue logic level. The
general model of a network realization is shown in Figure 6. Each Ri represents a set of
MPI's that pass logic level V . In Example 1, Ro = xl x3( 0) + x2 x3( 0), Rl = x3 (1) and
R2 = xlx2x3(2).
Definition 2 A multivalue MSP expression is sum of multivalue pass implicants that re-
alize a given function.
The multivalue MSP expression for the above circuit is
f = xl x3(0) + x2 x3(0) + x3 ( 1 ) + x l x2 x3(2)	 (4)
3.3 Design Algorithm
Procedure 1 The logic design rules for implementing a multivalue pass network can be
incorporated into a design procedure.
1. Place the multivalue logic states on a Karnaugh map.







00 01 11 10
1 1 1 2
1 3 3 2
2 3 3 0
2 1 1 0
Figure 7: Design Example
Pass Variable MPI
0 xl x 2 x3(0)
1 (xl x2 x31 1 ) + x2 x4(1)
2 /xl x2 x3(2) + x l x2 x312)
3 x2 x4(3)
Table 1: Design example MPI.
2. Identify each set of multivalue entries that meet the conditions of a prime im-
plicant; these are the pass implicants.
3. Find an optimal covering of the circuit and form the multivalue MSP.
Example 2 Consider the example with the Karnaugh map in Figure 7.
The MPI's for this circuit are shown in Table 1. The function then is
f	 xl x 2 x3(0) + x l x2 x3(1 ) + x2 x4(1)+	 (5)xl x2 x3(2) + x1 x2 x3(2) + x2 x4(3)
4 Multivalue Switching Functions
A generalization of the problem addressed above is to allow the multivalue logic inputs to
the pass network to be variables instead of constants. The formal definition of the problem
becomes: the control variables are binary signals [XI, x2, ... , x„ ], xi E [0, 1], and the pass
variables are multivalue variables [Fl , F2 ,... Fm], Fi E [Vo, Vi ,... Vk], V; is a multivalue
logic level.
The only change in the design process presented in the previous section is to replace V,•
with Fi in the algorithm and to implement the pass network with transmission gates rather
than single transistors. Since pass transistor networks using transmission gate structures,
that is, both PMOS and NMOS pass transistor arrays, can pass multivalue logic levels,
these signals can be variables and not just constants. Definition 1 is modified below while
Definition 2 remains unchanged.
Definition 3 A multivalue pass implicant (MPI), Pi(Fi), consists of a product expression
Pi



















Figure 9: Multivalue switching network.
When the literals of Pi evaluate to 1, the multivalue logic variable Pi is presented to the
output of the pass transistor network realizing Pi(Fi); otherwise the output is in the high
impedance state.
Example 3 Consider the example shown in Figure 8 to illustrate definition 3.
The inputs are multivalue logic variables [Fo, Fl , F2 ] and the control variables are x 1 , x2
and x3. Identifying the MPI's is accomplished in an identical manner as before. Here the
MPI's are xl (Fo), xlx2 (Fl ) and xlx2(F2).
The pass logic expression is
f = xl\Fo) + xlx2\Fl ) + xlx2(F2)	 (6)
The pass logic circuit shown in Figure 9 implements the function with both an NMOS
array and a PMOS array.
References
[1]J. Muzio and I. Rosenberg, "Introduction - Multiple-Valued Logic", IEEE Transac-
tions on Computers, Vol. C-35, No. 2, Feb., 1986, pp. 97-98
[2] S. Hurst, "Multiple-Valued Logic – Its Status and Its Future", IEEE Transactions on
Computers, Vol. C-33, No. 12, Dec., 1984, pp., 1160-1179
Fo Fo F2 Fi
Fo Fo F2 Fi
NASA SERC 1990 Symposium on VLSI Design	 259
[3]D. Radhakrishnan, S. Whitaker and G. Maki, "Formal Design Procedures for Pass
Transistor Switching Circuits", IEEE JSSC, Vol. SC-20, April, 1985, pp. 531-536
[4] N. Weste and K. Eshraghian, Principles of CMOS VLSI Design, Reading, Mass.,
Addison-Wesley, 1985
[5] K. C. Smith, "Multiple-Valued Logic: A Tutorial and Appreciation," IEEE Computer,
April, 1988, pp. 17-27
This work was supported in part by NASA under Contract NAGW-1406.
260
	 N94-71097
Statistical Circuit Design For
Yield Improvement In CMOS
Circuits
H.J. Kamath, J.E. Purviance, and S.R. Whitaker
Department of Electrical Engineering
University of Idaho,
Department of Electrical Engineering
Moscow, Idaho 83843
Abstract- This paper addresses the statistical design of CMOS integrated cir-
cuits for improved parametric yield. The work uses the Monte Carlo technique
of circuit simulation to obtain an unbiased estimation of the yield. A simple
graphical analysis tool, the yield factor histogram , is presented. The yield
factor histograms are generated by a new computer program called SPICEN-
TER. Using the yield factor histograms, the most sensitive circuit parameters
are noted, and their nominal values changed to improve the yield. Two ba-
sic CMOS example circuits, one analog and one digital, are chosen and their
designs are 'centered' to illustrate the use of the yield factor histograms for
statistical circuit design.
1 Introduction
Manufacturability is a key word in industry today. Due to inherent fluctuations in the ther-
mal, chemical, and optical processes used in the fabrication of any integrated circuit(IC),
the yield (which is defined as the ratio of the number of IC's that perform correctly to the
total number of IC's manufactured) is usually less than 100 percent. As the very large
scale integrated (VLSI) circuits become more complex, and the dimensions of VLSI devices
decrease, the circuit performance becomes more sensitive to fluctuations in the manufac-
turing process. Since the profitability of a manufacturing process is directly related to
yield, the need for computer aided methods for yield improvement has risen sharply in
recent years.
Generally a CMOS designer takes the manufacturing variation into account by doing
a worst-case design. A worst-case design assures that the circuit will meet specifications
when all the process parameters are simultaneously their worst possible values. Three
problems with worst-case design are apparent:
1. Choosing the worst-case design parameters can be difficult.
2. There is a significant performance tradeoff when a circuit is required to meet speci-
fications at the worst-case parameter values.
NASA SERC 1990 Symposium on VLSI Design	 261
3. It is usually very improbable that a circuit will simultaneously encounter all the
worst-case parameters during manufacture.
The price that is paid for a worst-case design is a significant performance degradation.
For instance, it may be possible to double the clock speed specification of a certain digital
IC if the goal for the circuit is an 80% yield, rather than the 100% yield imposed by
a worst-case design. It is the premise of this paper that the worst-case design is very
conservative and leads to circuits which are under specified. The solution to this problem
is to use statistical design methods.
The intent of this paper is to apply the statistical design centering techniques developed
for microwave circuits in [15,17] to the design of CMOS integrated circuits. Two example
circuits are chosen and design centering is applied to these circuits using the new tech-
niques. A CMOS operational amplifier (op amp) and a chain of five CMOS inverters were
chosen as example circuits since they are the basic building blocks of digital and analog
VLSI circuits.
2 Design Centering And Yield
During the manufacture of a circuit, the actual component values used in the circuit
are not the nominal values determined by the designer, but are values that are near the
nominal values. The range of values encountered during manufacture is given by the
component tolerance, usually expressed as a percentage of the nominal values. A circuit
design that is manufacturable (i.e., it achieves a high yield) must not only perform well
at the nominal value of the components but also perform well over the entire tolerance
range of the component values. Therefore, if the design goal is high yield it is important to
include in the design process the circuit performance over the entire range of component
values encountered during manufacture.
The design criterion we wish to maximize in this work is manufacturing yield — the
fraction of circuits which meets the performance criteria when the circuit is manufactured.
The component value statistics are assumed known. For our example circuits, the perfor-
mance criteria are rise time (which is a measure of bandwidth) for the operational amplifier
and delay time (which is the measure of its speed) for the inverter chain. Circuit yield in
general can not be calculated exactly. Therefore, we must be satisfied with an estimate of
the yield and a knowledge of its statistics.
Consider a circuit performance function, G(X), which depends on a set of parameter
values X = (z1, x2,..., xn). A strength of this study is that no assumptions are needed as





4^ is the region of G in which the circuit meets all its performance specifications. There-
fore,
G(X) E 4, impliesX E R	 (2)
262
where R is the region of acceptable parameter values in the parameter space.
During manufacture, the values of parameters are not necessarily independent but are
varying statistically with a joint probability density function, p(X). Thus, the manufac-
turing yield, Y, can be described by:
Y = fR pXdX	 (3)
Another useful formulation results from the acceptance function accept(X), such that:
accept(X)_ 1, i f G(X) E ^ : circuit accepted, X E R
 0, if G(X) E T: circuit rejected, X E R
Now Y can be expressed as an expectation with respect to accept(X);
Y = E[accept(X )] = f ±. accept(X) p(X) dX.
where 0 < E[accept(X )] < 1
Although this is an exact expression for Y, it can not be evaluated in general because
an exact expression for accept(X) is not known.
3 Monte Carlo Technique for Yield Estimation
A method commonly used for evaluating integrals in higher dimension is the Monte Carlo
technique. The Monte Carlo technique approximates Y by:
N
Y = 1/N accept (Xi ) = NpQ„/N
where Np.„ is the number of circuits which pass the specification test and N is the total
number of circuits which are simulated, and the X= are chosen according to p(X). If we
do not know the analytical expression for the acceptability region, R, then a Monte Carlo
evaluation of the yield for a particular nominal design (i.e., X = X,) requires N circuit
analyses, one for each trial set of parameters X=. Typically, hundreds of trials are required
to obtain a reasonable estimate for Y.
The primary use of Monte Carlo analysis is to determine the yield, given component
values and tolerances. Standard software packages include routines which accomplish this.
However, in this work we are attempting to reverse the process and determine the compo-
nent values and tolerances for a specified yield. This problem is not as simple and is not
accomplished for CMOS circuits by any commercially available software packages. The
yield factor histograms developed in the next section are excellent tools which will help
the designer solve this problem.
NASA SERC 1990 Symposium on VLSI Design	 263
4 Yield Factor Histograms
We wish to evaluate the yield as a function of each of the parameter values, (xl, x2, ...xn).
This is done by developing a yield factor which is given by:
Y(xi) = . f 0 ^ • • • f 00
	
accept (X)p(X)dx1, dx2, • • -, dxi — 1, dxi + 1, • • -, dxn
This factor essentially averages out the effects of all but the ith component, and then a
parametric study can be made to determine the effects of the ith component on this factor,
and on the yield. To better calculate the yield factor an unbiased estimator of Y(xi) is;
M
Y(xi) = 11M E acceptX j
=1
where the ith component of X is fixed at xi and the others are allowed to vary by their
known statistics.
The implementation is further simplified by dividing the acceptable values for Yxi into
nine equal regions, with the nominal value of the component lying at the center of region
five. An approximation to (xi) is developed by performing a Monte Carlo analysis where
all parameters are allowed to vary by their known statistics and the sums are evaluated
separately depending on the interval in - which the value of xi lies. Since the estimate
variance goes down as 11M, we have found a value of M = 100 is satisfactory. As 9
regions are used and approximately 100 evaluations of the circuit per region are adequate,
at least 900 Monte Carlo runs (rounded off to 1000)iterations are needed to adequately
describe the estimator of yield (Y).
The graphical display of the yield factor (histogram) provides the important informa-
tion of the circuit's yield with respect to a given parameter. The program SPICENTER
computes and plots these histograms for each of *the nine regions of a given circuit parame-
ter. An interpretation of the histograms follows: 1) Design centering step: If the histogram
is not symmetric about the selected point, the operating point should be moved to make
the curve symmetrical. 2) Tolerancing step: If the histogram shows low yield values at
the parameter limits, the acceptable parameter limits can be decreased within permissible
costs.
These two steps are iterative. After the tolerances and nominal values are adjusted, the
circuit is reanalyzed and the histograms are again plotted. Further adjustments are made
if necessary. We have noticed that there is interaction between the two steps, because
changing the nominal value can change the slope on the yield factor and vise versa. These
steps and a few -example histograms are shown in Figures 1 and 2 respectively.
Studying histograms will quickly tell the designer where the nominal value for each
component might be placed for a better yield. Figure 2 shows three typical histograms.








Determine circuit nominal values by
standard optimization techniques
Calculate and plot the yield





Adjust center values to make the
yield curves symmetric and decrease
tolerances to flatten the yield curves
Figure 1: A Typical Approach to Statistical Circuit Design
450
	 503	 556 .042	 .044	 .046	 .267	 .297	 .327
Bins (value range)
Figure 2: Typical Yield VS Component Value Histograms
0-4^
NASA SERC 1990 Symposium on VLSI Design	 265
nominal value and tolerance are correct for that parameter. Histogram (b) indicates that
a higher nominal value should be tried for the displayed parameter because the yield is
higher for higher values of the parameter. Histogram (c) suggests that the designer should
pick a slightly lower nominal value and tighten the tolerance on that parameter provided
the cost of doing so is within reasonable limits.
In our work on the design centering problem, we ran through histograms for 12 param-
eters for the CMOS op amp circuit and for 13 parameters for the CMOS inverter chain
circuit and made the required changes as demonstrated earlier with Figures 2 (b)and (c).
We then reran the SPICENTER and obtained a new set of histograms. This process was
repeated until the last set of simulations achieved the desired yield.
5 Example Circuits
Two example CMOS circuits are chosen to illustrate the use of the Yield Factor Histograms
for improving the parametric yield of a circuit. The first example circuit is a CMOS
operational amplifier. This is an analog circuit used in many standard analog cells. In
both examples, the parameter statistics were assumed to be uniform and independent.
5.1 Operational Amplifier
Figure 3 is a two stage buffered CMOS op amp which is the first of the example circuits
chosen for this paper.
5.1.1 Design Centering for Rise Time
The performance criterion chosen for design centering this op amp was the rise time, which
is defined as the time taken by the output signal to rise from 10% to 90% of its final steady
state value in response to a step voltage applied at the input. For the op amp the rise time
is specified to be below 1.75 usec. It is important to note that our efforts in optimizing
the design for a better rise time did not affect the other specifications of the circuit. The
step input is applied at the noninverting input terminal of the op amp. SPICECENTER
uses the transient analysis output (voltage versus time) supplied by SPICE to evaluate the
circuit performance.
5.1.2 Yield Factor Histograms
The Figures CMOS OP AMP-SPICENTER(1) through (4) are samples of yield factor
histograms generated by SPICENTER after iterations (1) through (4) of 1000 simulations
of the circuit. These yield factor histograms depict the sensitivity of the circuit's yield
to a few critical components and parameters. The histograms of the other parameters
to which the yield is practically insensitive are not included here. The four iterations of


































1	 1	 10 	 1	 1	 ___
VSS=SV
Figure 3: A Low-Power Buffered CMOS Op Amp Example Circuit
NASA SERC I990 Symposium on VLSI Design	 267
The first iteration of SPICENTER at the nominal values of the components and pa-
rameters resulted in a average yield of 21.9%.This yield and other information are printed
on the right side of the histograms. It was observed that out of the twelve parameters two
are very critical to the yield of the circuit. They are component 4, RBIAS (RES2) and
component 10, Kp(PMOS),It is also clear from the plots that the yield is higher for lower
values of RBIAS which is the bias resistor and higher values of Kp(PMOS) which is the
transconductance of the PMOS transistors. (CMOS OP AMP-SPICENTER(1).
For the second iteration of the circuit simulations, the nominal value of RBIAS was
changed from 100K ohms to 95K ohms in the circuit file. Its tolerance and other compo-
nents or parameters were not altered. This resulted in nearly 100% improvement on the
yield (21.9% to 43.0%). (CMOS OP AMP SPICENTER(2)).
Based on our interpretation of the histogram of bias resistor from SPICENTER(2), its
value was further reduced to 90K ohms and the circuit simulated again. Once again there
was a considerable improvement in the average yield of the circuit (43% to 70.6%. (CMOS
OP AMP-SPICENTER(3)).
For the fourth and the final run of SPICENTER we modified Kp(increased from 13.5 to
15.525uA/V2 ) fixing the bias resistor at 90K. (CMOS OP AMP-SPICENTER(4)). These
changes resulted in a yield of 80.7 %.
5.2 Inverter Chain
The inverter is an important circuit element in both analog and digital CMOS designs. It
is the most basic gain stage of all amplifiers and the design of digital systems is virtually
impossible without the basic inverter. A chain of inverters(inverter chain) chosen as the
second example for this work is mainly used in ring oscillators and memory circuits.
5.2.1 Design Centering for Delay Time
One of the important characteristics of - the inverter chain is the propagation delay time
(delay time, for short). Delay time has been interpreted as the time taken at the output
to attain 90% of its final steady state value from the instant the input is triggered with a
step voltage. The transition magnitude on the input of the inverter chain is -5 volts to 5
volts. VDD is 5Volts and VSS is -5 Volts.
Delay time is used as the performance criterion for the inverter chain example circuit.
For the inverter chain circuit the delay time is specified to be less than 4.5 nsec. Like the
CMOS opamp's rise time, the CMOS inverter chain's delay time is calculated after each
simulation of the circuit from the data provided by SPICE in the transient analysis table.
5.3 Histograms
The Figures CMOS Inverter Chain-SPICENTER(1) through (4) shown below are samples
of the yield factor histograms generated by SPICENTER. The first run of 1000 simulations
of the inverter chain estimated the yield at 38.5%. The histograms also pointed out two








Figure 4: CMOS Inverter Chain Example Circuit
NASA SERC 1990 Symposium on VLSI Design	 269
Kp(PMOS) the transconductances of both P and N channel transistors (CMOS Inverter
Chain-SPICENTER(1) ).
For the second iteration, Kp(NMOS), was changed to 20uA/Va from its nominal value
of 17uA/Va without altering the nominal values of the other parameters and components.
The resulting yield was 65.7%. (CMOS Inverter Chain SPICENTER (2)).
With Kp(NMOS) restored to 17uA/Vl , Kp(PMOS) was changed to 9.5uA/V' from its
nominal 8uA/V2 before the start of the third round of circuit simulations. This time the
yield was 56% as shown in CMOS Inverter Chain SPICENTER(3). This indicates that
the influence of the transconductance of the N channel MOSFET on the circuit yield is
slightly more than that of the P channel. MOSFET.
The last simulation was done with the Kp(NMOS) changed to 20uA/V2 and Kp(PMOS)
changed to 9.5uA/V 2 . The yield from this iteration was the highest at 80.9% (CMOS In-
verter Chain-SPICENTER(4)).
6 Conclusions
This paper presents a unified design approach to parametric yield optimization by using
statistical design methods recently developed at the University of Idaho [15,17]. The work
involved the development and implementation of the computer program SPICENTER
which calculates and displays Yield Factor Histograms.These histograms are used by the
CMOS IC designer to perform design centering. These methods applied to the OP Amp
increased the yield form 21% to 80% and to the inverter chain increased the yield from
38.5% to 80.9%.
Acknowledgements- This work was partly supported by a grant from the Idaho State
Board of Education and the NASA SERC at the University of Idaho. A Patent applica-
tion has been filed by the Idaho Research Foundation covering, among other things, the
Yield Factor Histograms and their use in statistical circuit design. The Idaho Research
Foundation address is P.O. Box 9645, Moscow,Idaho, 838430178, (208)8838366.
References
[1] . Balaban and J. J. Golembeski, "Statistical Analysis for Practical Circuit Design"
,IEEE Transactions on Circuits and Systems, Vol. CAS22, No.2, February 1975,
pp.100-108.
[2] . K. Brayton and S. W. Director, "Computation of Delay Time Sensitivities for use
in Time Domain Optimization" IEEE Transactions on Circuits and Systems, Vol.
CAS22, No.12,December 1975, pp.910-920.
[3] . W. Director, G. D. Hatchel and L. M. Vidigal, "Computationally Efficient Yield
Estimation Procedures Based on Simplical Approximation" IEEE Transactions on
Circuits and Systems, Vol. CAS25, No.3, March 1978, pp.121-130.
270
[4] . W. Bandler and H. L. Abdel-Malek, "Optimal Centering, Tolerancing and Yield
Determination", IEEE Transactions on CAS25, No.10, October 1978, pp.853-871.
[5] . Singhal and J. F. Pinel, "Statistical Design Centering and Tolerancing using Para-
metric Sampling", IEEE Transactions on CAS28, No.7, July 1981, pp.692-702.
[6] . K. Brayton, G. D. Hachtel and A. L. Sangiovanni Vincentelli, "A Survey of Optimiza-
tion for I.C. Design", IEEE Transactions on CAS69, No.10, October 1981, pp.1334-
1362.
[7] . J. Anterich and R. K. Koblitz, "Design Centering by Yield Prediction", IEEE Trans-
actions on CAS29, No.2, February 1982,pp.88-96.
[8] . E. Hocevar, P. Yang, T. N. Trick and B. D. Epler, "Time Domain Sensitivities",
IEEE Transactions on CAD4, No.4, October,1985, pp.609-620.
[9] . R. Nassif, A. J. Strojwas and S. W. Director, "A Methodology for Worst Case
Analysis of Integrated Circuits", IEEE Transactions on Computer Aided Design, Vol.
CAD5, No.1, January 1986, pp.104-112.
[10] . Yang, D. E. Hocevar, P. F. Cox, C. Machala, and P. K. Chatterjee, " An Integrated
and Efficient Approach for MOS VLSI Statistical Circuit Design", IEEE Transactions
on Computer Aided Design Vol. CAd5, No.1, January 1986, pp.514.
[11] . A. Styblinski and L. J. Opalski "Algorithms and Software Tools for I. C. Yield
Optimization based on Fundamental Fabrication Parameters", IEEE Transactions on
CAD5, N0.1,January 1986, pp.7989.
[12] . Herr and J. J. Barnes, "Statistical Circuit Simulation Modeling of CMOS VLSP',
IEEE Transactions on CAD5, No.1, January,1986, pp.1522.
[13] . D. Matson and L. A. Glasser, "Macromodeling and Optimization of Digital Mos
VLSI Circuits", IEEE Transactions on CAD5, No.4, October 1986, pp.659-677.
[14] . Hedensierna and K. 0. Jeppson, "CMOS Circuit Speed and Buffer Optimization",
IEEE Transactions on CAD6, No.2, March 1987, pp.270-281.
[15] . MacFarland and J. E. Purviance, ."Centering and Tolerancing the Components of
Microwave Amplifiers", Proceedings of IEEE International Symposium on Microwave
Theory and Techniques, June 1987.
[16] . K. Yu, S. M. Kang, I. N. Hajj, and T. N. Trick, " Statistical Performance Model-
ing and Parametric Yield Estimation of MOS VLSI", IEEE Transactions on CAD6,
No.6,November 1987, pp.1013-1022.
[17] . E. Purviance and M. D. Meehan, "A Sensitivity Figure for Yield Improvement",
IEEE Transactions on MTT Vo1.36, No.2,February 1978, pp.413-417.
NASA SERC 1990 Symposium on VLSI Design	 271
[18]. E. Hocevar, P. F. Cox and P. Yang, "Parametric Yield Optimization for MOS Circuit
Blocks", IEEE Transactions on CAD7, No.6, June 1988, pp.645-658.
[19] . M. Butler, "Realistic Design Using Large Change Sensitivities and Performance
Contours", IEEE Transactions on Circuit Theory, Vol. CT18, NO.1, January 1971,
pp.58-66.
[20]aly W, and Director S. W, "Dimension Reduction Procedure for Simplical Approxi-
mation Approach to Design Centering", Proceedings of Institute of Electrical Engg,
Vol.127, No.6,December 1981.
[21]. E. Allen and D. R. Holberg, "CMOS Analog Circuit Design", 1st edition, Holt,
Rinehart, and Winston Inc., New York.
[22]. Vladimirescu and S. Liu, "The Simulation of MOS Integrated Circuits Using
SPICE2", Electronic Research Laboratory, College of Engineering, University of Cal-
ifornia,Berkeley, CA.













1 — 2 3+ b 6 1 7--$ 9









-^TYeild x	 95.7	 F
J. 2J_ 3 y4–yb i-6 J_ 7J_ 6 J–g J









1 2 3 —4-6 6 7 6 9








Component 7 - XPJ MOS	 Component 11 • XP-PMOS
Average :	 20	 Average x	 9.6
Tolerance as	 30	 Tolerance :	 30
Deviation as	 4	 Deviation =	 2.66
Figure 5: CMOS Inverter Chain
NASA SERC 1990 Symposium on VLSI Design	 273











1	 ] 3 1	 6	 6	 7 ! —9







1y 2-3j4 ' 6 y6 y 7- 3y9'^
Component 4 - RBS2
Average =	 90
Toleraace =	 10













	 Component 10 - KP.PMOS
Average =	 90	 Average =	 _ 16.626
_	 Tolerance s	 10	 Teleraaee s	 30
Deoiatioa s	 9	 Dealatiea =
	 4.6676
Figure 6: CMOS Op Amp

