Wafer Scale Integration of Parallel Processors by Hedlund, Kye Sherrick
Purdue University 
Purdue e-Pubs 
Department of Computer Science Technical 
Reports Department of Computer Science 
1981 
Wafer Scale Integration of Parallel Processors 
Kye Sherrick Hedlund 
Report Number: 
81-401 
Hedlund, Kye Sherrick, "Wafer Scale Integration of Parallel Processors" (1981). Department of Computer 
Science Technical Reports. Paper 328. 
https://docs.lib.purdue.edu/cstech/328 
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. 
Please contact epubs@purdue.edu for additional information. 
••
•





This work is part of the Blue CHiP Project. It is supported in part by the OtTice of







I would like to express my strongest thanks to my advisor. Lawrence
Snyder, for his tireless guidance and support throughout the course of this
study. Thanks are also extended La the other members of my research com-
mittee, Peter Denning. Dennis Gannon and Gerold Neudeck, for their time
and eUort.
Special thanks go to my colleagues Dan Reed, David Mount and Ching
Hsiao for their listening, helpful suggestions and advice, and to Julie Han-
over who provided valuable assistance in preparing this manuscript.
ParLial support for this work was provided by the GUice of Naval
Research. It is greatly appreciated.
Very special thanks are due Lo my mother and father for their love and
encouragement.







LIST OF TABLES v
LIST OF FIGURES vii
ABSTRACT x
CHAPTER 1 - INTRODUCTION 1
1. Wafer Scale Integration ,.. , , 2
2. PrevioUB Work on Wafer Scale Integration " " ,,6
3. Introduction to CHiP Processors , ,.. , ,.. , 11
CHAPTER 2 - YJELD MODEL : 14
1. The Price Model , ", 15
2. Yield Model for Analysis of Fault Tolerance , " 17
3. Probability Density Function , , 21
4. Comparison of Yield Models 29
a) DistingUishable Defects ,.. , , 34
b) Indistinguishable Defects " .. " , 30
5. Applications of the Yield Model.., , ".,.43
a) Recovery Analysis 45
b) Fault Tolerant CHiP Modules 52
c) Optimum Lattice Size .. , ,.61
d) Design Analysis , , 64
CHAPTER 3 - TWO LEVEL HIERARCHY 70
1. The Structuring Problem , " , ?1
2. Global Strategy , , , 76
3. T,vo Level Decomposition , ,., , 70
CHAPTER 4 - BUILDING BLOCK DESIGN 05
1. Dlock RcquircnH~nts on
0.) Dlock yield BG
b) Wire Around CapabiHly " .. , 89
2. Processing Elenlenl Design , 90
a) Functional Requirements , , 93
b) Processor Characteristics ,.. " , 9'"(
c) Layout and Area Estimalion " , , 101
iv
3. Datapath Design , ,108
4. Switch Design , ,112
a) Switch Layout , 112
b) Switch Yield 112
5. Lattice Design ,115
a) Processing Elements 120
b) Sw"itches , ,.. ,.. , 120
c) Mappability 127
d) Wire Through 129
CHAPTER 5 - A WAFER SCALE CHiP PROCESSOR 132
1. Wafer Layout.. " 132
Z. Lattice Dimensions , 136
3. Column Exclusion 143
4. External Connections " 145
5. Efficiency , 147
6. Effect of Technological Advances , 152
a) Wafer Size 154
b) Device Scaling 157
7. Practical Implementation Considerations 162
a) Power Consumption 163
b) Skeleton Routing 172
c) Clocking 176
CHAPTER 6 - TESTING CHiP PROCESSORS 178
1. Design for Testability , 187
2. Model of Lattice Testing 168
a) Definitions 189
b) Testable Components 191
c) Goals of a Testing Procedure 19B
3. Lattice Testing 199
CHAPTER 7 - CONCLUSIONS 213
1. Summary of Results 213
2. Implementation of General Wafer Scale Integration 214
3. Restructurable Design Methodology 215
4. Future Research 223
a) Penalties for Restrueturable Circuitry 223
b) Modular PE Design 224
BffiLJOGRAPHY 228
APPENDIX 1 - SUMMATION OF RANDOM VARJABLES 235
1. Lumped Approximation 236
2. Two Class Approximation , 238








2.2.1 Expected Number of Defects as a
Function of Area ( in NUA ) ,..23
2.3.1 Probability of m Fatal Defects as a
Function of Area ( in NUA ) 28
2.3.2 Cumulative Probability or m }i'atal Defects
as a Function of Area ( in NUA ) 31
2.4.1 Comparison of Gaussian and Poisson Approximations 37
2.5.1 Recovery Probability (0-4 Redundant PEs) 49
2.5.2 Recovery Probability (5-8 Redundant PEs) 50
2.5.3 Recovery of Four PEs in N PEs 54
2.5.4 Optimum Lattice Size for the
Recovery of Fow' PEs , 62
2.5.5 Optimum LaLtice Size Lo Maximize
Number of Good Chips Per Wafer 62
4.1.1 EITect of Block Yield on Grid Size
( Worst Case ) 88
4.1.2 Comparison of 90% and 99% Block Yield 08
4.2.1 Area Estimation for a Processing- Element 103
4.5.1 Effect of Switch Yie~d on the Number
of l~aulLy SwiLches Per Block , 12G
4.5.2 Probability Density of Defective Switches , 125
5.2.1 Size of Wafer Scale Processor




5.2.2 Wafer Area Occupied by a Grid of
Building I310elcs ,_., ,_ ,_ , ,.. , " ,,_, ,.,." , 139
G.2.3 Size of 1Ynfer Scale Processor for
9 x 0 Grid , 139
5.2.4 Size of Wafer Scale Processor for
g x 9 Grid , 140
5.2.5 Size of Wafer Scale Processor for
10 x 9 Grid ,., 140
[j,G.! EfIcct of Willer Diameter on tho
Wafer Scale Cl-lil:l Processor , 155
5.G.2 1'.~fIcct oj' Dcvicd Scaling 011 the Wafer









1.1.1 Structuring by Column Exclusion .. " ,." 4
1.1.2 Interrelationship of Main Concepts ,.. " ,, ?
1.3.1 Three CHiP Processors ( Circles Represent
Switches; Squares RepresenL PEs ) ,.. , , ,.. , 12
1.3.2 Mesh Configured CHiP Processor 12
2.2.1 Yield vs. Area ( in NUA -
Normalized Unit Area ) , 22
2.3.1 Probability of III Fatal Defects
as a Funclion of Area (in NUA) , 27
2.3.2 Cumulative Probability of m Fatal Defects 30
2.4.1 Taxonomy of Yield Models ,.. ",., .. , ,.. , 33
2.'".1-.2 Probability of m Defects
(Area = 1.0 NUA) 40
2.4.3 Probability of m Defects
(Area = 2.0 NUA) 41
2.4.4 Probability of m Defects
(Area = 3.0 NUA) 42
2.6.1 Hecovcry Probability vs. Number of Proccssors 1B
2.ti.2 l~ecovery Probabilily for Four PEs
in N PEs , , ,., 53
2.5.3 2 x 2 Virtual Lattice
( Datapaths Not Shown ) , ".".56
viii
2.5.4 3 x 2 Physical Lattice
( Datapaths Not Shown ) 57
2.5.5 Example of a Partial Mapping 59
2.5.6 Complete Mapping of the Virtual Lattice
Into the Physical LatLice 60
2.5.7 OptiIIlI.llI1 Lattice Size to Maximize
Number of Good Chips Per Wafer 65
2.5.8 Effect of Scaling on Recovery
( Scale Factor = 0.5 ) 67
2.5.9 Effect of Scaling on Recovery
( Scale Factor = 0.25 ) 68
3.1.1 Example of a Structured Wafer -
4 x 4 Virtual Lattice in a
6 x 5 Lattice 73
3.1.2 1· x 4 Virtual Lattice Which Is
Functionally Equivalent to the Structured Wafer 74
3.3.1 Composition of Lattices of Identical Size 80
3.3.2 Composition of Lattices of Nonuniform Size 81
3.3.3 Structuring With the T";'{o Level Hierarchy 82
4.1.1 Example of Wire Through in a
Building Block · 91
4.2.1 Systolic Algorithm for Band Matrix
Multiplication ( from [Mead80] ) 96
4.2.2 Processing Element Layout -
a Schelnalic Floor Plan 102
4.3.1 Approximate Relative Si'l8 of a
PE. SwiLch and Uatapath 110
4.4.1 Switch LayouL - An Approximate Floor Plan 113
1.1-.5.1 Recovery Curve for 4 PEs








4.5.2 Virtual Lattice to be Mapped
Into a Building Block , , 118
4.5.3 Building Block for a Wafer Scale
CliiP Processor , " 124
4.5.4 Wire Savings Due to Switches With
Degree B and Crossover Capability 128
5.1.1 Layout of a Wafer Scale CHiP Processor ,.. , 135
5.4.1 Redundant Pad Drivers [or High Reliability 148
5.'7.1 Enhancement Mode Transfer Gate with
Heference Voltages .. ,_, 173
5.7.2 ImplemenLatLon of Programmable
Power Down Mechanisfil 174
6,0.1 Indirect Testing via a Path of SwiLches 181
6.0.2 Testing with the Reflective Switch , 182
6,2.1 Example of a Generic Path " 192
6.2.2 Three Paths Required to Test a Port 195
6.2.3 Testing a Port , , , 196
6.3.1 Testing a Port Pair" .. , , " .. ,.. , ,.. 201
6.3.2 Testing a PE Square, ,.. , " , , ,,205
7.3.1 Advantages of Restructrable Design Methodology 219
7.3.2 Spectrum of Semiconductor Devices 220
7.3.3 Elements of the Hestructurable Design Methodology ,.. ,220
Appendix
F'igure Page
ALl Probo.bHity of j Defects Clustering
in 4 of 16 PEs 247
xilHSTRACT
Hedlund. Kye Sherrick. Ph.D., Purdue University, December 1982. Wafer
Scale Integration of Configurable, Highly Parallel Processors. Major Profes-
sor: Lawrence Snyder. -
Integrated circuit size (and hence complexity) is limited by the fact
that chips created using current design techniques will not function
correctly in the presence of even a single circuit defect. This research
examines the problem of constructing chips up to the size of the wafer
(wafer scale integration) that operate correctly despite the occurrence of
such (laws. We concentrate on a particular family of parallel processors,
contlgurable, highly parallel (CHiP) processors.
The key problem in the implementation of wafer scale integration is
structuring the wafer so that only the functional PEs are connected
together. A methodology. the two level hierarchy, that efficiently and
economically solves the structuring problem for CHiP processors is
presented. The principle elements are the use of column exclusion with high
yield building blocks that contain redundant components. This approach
limits the performance degradation due to structuring and allows the struc-
turing problem to be solved wiLh tractable computational effort.
Since the yield of building blocks must be high for the two level hierar-
ehy Lo be a pracLieul approach, yi.eld phenomena arc investigated in detail.
A model of the inLegraled cireuil manufacluring process is developed that




defects. These results are applied to the analysis of parallel processors in
which several PEs occupy a single chip. In addition, they are used to design
the building blocks meeting the requirements of the column exclusion stra-
tegy.
It was shown that these building blocks can be assembled into a wafer
scale CHiP processor. With current technology, it is possible to fabricate a
wafer scale system with 250 to 300 PEs. This represents a truly large paral-
leI machine. Furthermore, this machine is highly robust to faults occurring
during the machine's lifetime, consmnes a manageable amount of power and
call be eaicientiy Lested.
Although the techniques for implementing wafer scale integration were
developed for CHiP processors, they can be applied to other sysLem com-










The question that motivated this research is: how can. VLSI technology
be utilized in the design of parallel processors? With VLSI technology it is
possible to fabricate chips containing hundreds of thousands of transistors.
But designing and debugging a complex integrated circuit is a lengthy and
costly process. To reduce this cost and delay, it is necessary to decompose
a circuit into a few different types of small substructures with simple
interfaces. Technology favors replicating many copies of a simple circuit.
Consequently, this research analyzes parallel processors that are
composed of a large number of simple processing elements (PEs). Each PE
is a simple microprocessor and can be fabricated on a single piece of silicon.
Large mainframe computers in which a single processor contains thousands
of chips are not within the scope of this research.
This work concentrates on a particular family of parallel processors,
configurable, highly parallel (CHiP) computers. Although the techniques for
implementing wafer scale integration are developed for CHiP processors,
they are entirely general and can be applied to other systems composed of
uniform parts. This includes parallel processors with fIxed interconnection







generalizations of this work are discussed.
The goal of the CHiP processors considered in this work is to provide
substantial parallelism at low cost. For problems that can make use of this
parallelism, high performance results. We are not attempting compete with
the Cray 1 nor are the machines intended to be completely general purpose.
It is hoped that CHiP processors will have wide applicability. but this is an
open question and a subject of further research.
1. Wafer Scale Integration
Many different architectures for parallel processors have been
proposed but few large-scale parallel systems have actually been built. One
reason is that a large-scale parallel processor consists of a great many
components. This introduces severe practical problems of construction.
wiring and reliability. If the number of individual components could be
decreased, parallel processors would be far easier and cheaper to construct.
The absolute minimum number of components is reached when the
entire parallel processor is fabricated on a single piece of silicon. These
wa.fer scale systems have greatly reduced cost due to the increased level of
integration. Reliability is higher since the connections between processors
are implemented in silicon. Furthermore. there is the potential for
increased preformance since data values passed between processors are not
driven off the wafer.
Consider the implementation of a wafer scule system. l~abricating high
uensily integrated circuits is a delicate process. On any given lYafel', many
of the chips will contain defects - errors in the circuitry such as broken
c>'
lD
3wires or nonfunctional transistors. Defects are randomly distributed over
the wafer surface. They are caused by imperfections inherent in the silicon
or are introduced during the manufacturing process. Consequently, it is not
unusual for complex circuitry Lo yield only 5-10% working inLegrated circuits
from anyone wafer.
1'0 implement a wafer scale system. all chips on a wafer are tested, and
then the good chips are connected together. The wafer is structured so that
the presence of faulty chips is masked and only functional chips are used.
Tbis structuring problem is the key problem in the implementation of wafer
scale integration (wsI). With low yield, the good chips are sparsely and
irregularly distributed over the wafer surface so the key problem is to
provide a highly flexible means of connecting chips.
Consider the problem of connecting functional chips in a mesh pattern.
This is fundamental for constructing CHiP computers. The structuring
problem is made difficult by low chip yield. For any particular good chip, it
is very unlikely that all its four neighbors will also be functional; the
positioning of good chips on the wafer differs from the required connection
pattern - the mesh. Hence, considerable wiring may be reqUired to connect
a-chip to its neighbor in the mesh.
Now suppose that mast chips are functional. The good chips are
distributed in a morc regular pattern R one closely resembling a mesh. This
simplifies the strucLuring problem. To'or example. Figure 1.1.1 shows a wafer
contai~ng a 4 x 5 grid of chips with only one faulty chip. A 4 x 4 mesh is
obtained by excluding all chips in the column containing Lhe fault. This







Figure 1.1.1 - Structuring by Column Exclusion
4
5wire around faulty or unused chips. This strategy has been used in 64K
memories [Ccnk79, EatoBl, Kokk81] and in a computer architecture on
Massively Parallel Processor [Dalc79].
For this simple approach to be practical, the wafer must contain very
few faulty chips. But due to the nature of the integrated circuit
manufacturing process, high yield is achievable only with very simple chips -
much less complex than a processing element that is needed for a parallel
processor.
But suppose the units patterned on the wafer are not individual
processors but building blocks of a mesh. With each block contributing a
small mesh of fixed size, the blocks can be assembled to form a larger mesh.
For example, with a 4 x 4 grid of blocks each containing a 2 PE by 2 FE
mesh. a mesh with 8 PEs on a side is formed. The key idea is that each block
will contain sufficiently many redundant PEs to insure that a small.
functional mesh will exist within almost every block. Virtually every block on
the wafer will contribute a small subpart to the overall structure, so the
structuring problem can be solved by eliminating the columns (or rows)
containing the relatively rare blocl{s which are completely dysfunctional.
This technique is practical if the blocks meet two requirements:
1) Block3 must have high yield; most blocks must contain a smaller,
fully functional mesh.
2) mocks that are unused or faulty can be "wired around" to connect
Lhe Lwo blocks in Lhe adjacent columns.
•
6In the remainder of this chapter. we survey previous work on wafer
scale integration and give a concise summary or the ideas behind CHiP
processors. The approach to wafer scale integration using column exclusion
and building blocks is discussed in more delail in Chapter 3. Since Lhe yield
of building blocks must be -high, yield phenomena urc investigated in
Chapter 2. In Chapter 4. the yield results are used to design the building
blocks of a wafer scale CHiP processor. The assembly of the blocks into a
complete wafer scale system is the topic of Chapter 5. The testing of CHiP
processors is discussed in Chapter 8, and the final chapter provides a brief
summary of the results along wiLh possible extensions and generuli'lations of
this research. Ji'igure 1.1.2 shows the interrelationships of the main
concepts in this thesis. The numbers in parenthesis indicate the chapters in
which the topic is discussed.
2. Previous Work on Wafer Scale Integration
Research into wafer scale integration has been conducted for over
n.rteen years starting with discretionary wiring. In this approach. modules
(PEs, memory units, etc.) are patterned on the 'wafer and are individually
Lested by wafer probing. A wiring pattern La connect togelher Lhe good
modules is automatically' .generated. This wiring is implemenled by extra
levels of metal interconnections that are placed overtop the modules. The
structuring problem is solved by these extra layers of customized wiring.
Discretionary wiring was strongly backed by both Texas Instrulllents
and lhe Air Force. Despite stl"Ong funding and years of research, it never
became a pracLical means of implementing WSI. There arC' Lwu l1Jiljor
problems with this approacr
Generalized Techniques (7)
7
Testing (6) ImplementationConsiderations (5)
•












Figure 1.1.2 - Interrelationship of Main Concepts
B• Excessive cost. Defects arc randomly distributed over the wafer
surface. With a large number of modules per wafer, there are an
enormous number of diITerent patterns of good and bad modules. This
requires that a unique set of photolithography masks be made to define
the wIring pattern for each individual wafer. This is prohibitively
expensive [Aubu70].
II v'aulls occur in the upper levels of melalizulion used for structuring.
The topmost levels of interconnection. as with the lower levels, are
subject to faults such as poor contacts between levels and shorls to
underlying levels [Aubu?8, IEEE82]. These faults effect not just a single
module but the entire wafer.
As these problems surfaced, researchers attempted to reduce the
complexity of the custom wiring. Each level of interconnection requires two
photolithography masks. One deflllos the wiring pattern, and the other
determines the connections between levels. The initial work on
discretionary wiring required two customized metalization levels and hence
four unique masks for each wafer.
The pad relocation technique [Calh72] reduces the number of unique
masks to one, A single, standard wiri?g pattern on the topmost melallevel
interconnects fIxed position "pads" on the f1rst level of metalization, This
lower metalization level is customized for each wafor to relocate the wiring
of modules Lo Lhe pads. Only Bood modules nrc COllllCclcd to U PUt:'. The
upper level makes a standartl ::;0qucnce o[ conuedi.on,; bei..iJCClj 'll;;ed







9in response to the defect pattern of the particular wafer. Only the mask
defining the lower metalization level need be modified from wafer to wafer.
Despite this cost reduction, pad relocation did not produce reliable and
economical wafer scale systems. The problems arc the assumptions that
the customized processing sLeps would be fault free and that no modules
tested as good would fail during the remaining processing. It was recognized
Lhat LIle additional processing steps required to define the customized
wiring are the Achilles heel of these approaches.
The work of Manning [Mann?5] and the independent but closely related
research of Aubusson [Aubu?3, Aubu7B] proposed solutions Lo the
structuring problem that required no extra wafer processing steps. The
essential feature of the approach is that each module can be externally
programmed Lo connect to any of its immediate neighbors. There is an
implicit switching mechanism within each module. By selectively connecting
modules only to functional neighbors. a linear array of good modules can be
"snaked" through the grid of modules on the wafer. Heuristics for
maximizing the length of the chain were developed [Aubu7B, Ji'uss82,
Mann?5].
Since no exLra processing steps are reqUired, this solves the problems
that plagued discretionary wiring and pad relocation, but at the cost of
fiexibility. The wafer is structured only into a linear array; the solution to
the structuring problem is only one dimensionaL
The sLructuring of the wafer inlo a richer set of two dimensional
configurations is a major problem in the implemenLation of wafer scale
systems. Ji'ussell and Varman [l"ussU2] have presented algorithms for a
,
10
priority queue and a triangular array capable of performing the
ffiulLiplication of a band matrix and a vector. Koren [Kore81J developed
algorithms for a binary tree and a mesh.
Recent advo.nces in integrated circuit manufacturing may provide new
methods for implementing wafer scale integration. The most proffilsing of
these is laser programming [Kuhn"'l5, LeguBO. ManoBO, WuB2]. Submicron
thick layers of quartz sandwich the uppermost level of metal with a lower
level of metal underneath. A series o[ short laser pulses burns lhrough the
quartz layers Lo weld the two metal levels. This forms a low impedance
contact.
The use of laser programming to implement wafer scale systems is
under investigation at LincolnLaboratories [Chap]. Modules are patterned
on the wafer with fIXed wiring corridors between them. Vertical wires are
run in the first metal layer and horizontal wires in the second. Initially, the
modules are unconnected. After testing, laser programming makes the
connections required to interconnect the functional modules.
This technique resembles discretionary wiring. Although the wiring
pattern is fixed. Lhe connections between wires are completed after testing.
But with advances in semiconductor processing technology. wiring channels
can be manufactured with high reliability. Also, Lhe laser welds form low
impedance contacts with very high probability. Thus there are very few
faulLs in Lhe custom wiring.
However, Lhis approach has one seriolls dnn'1back. The connect[O<lo;
made with laser programming ate staLic; once they made tlley call noL be






and millions of transistors. During the lifetime of a system, faults are very
lil,::cly to occur. It is certainly undesirable to discard LUl entire wafer due to
a single faulty transistor. With laser programming, there is no method of
l'r.:conOgul'ing the wafer after manufacturing. A single fault during the
sysLem lifeLime may disable tlle entire wafer scale system.
3. Introduction to CHiP Processors
A brief introduction to CHiP processors is presented here. More
detailed information call be fOlUld in [Snyd82a]. The CHiP processor is a
family of architectures each constructed from three components: a
collection of microprocessors, a switch lattice and a controller. The switch
lattice is the illost important component and the main source of differences
between family members. It is composed of programmable switches
connected by datapaths. The microprocessors function as the processing
elements of the system. They are not directly connected to each other. but
rather are inserted at regular intervals into the switch Lattice. Figure 1.3.1
shows three difIerent switch lattices. The perimeter switches are connected
to external storage devices.
Each sWltch has local memory capable of storing several configuration
settings. /I.. configuration selting enables the switch to establish a direct,
static connection between two or more of its incident datapaths. (This is
circuit SWitching rather than packet switching.) Figure 1.3.2 shows a mesh
confIgured CHiP processor. Switches in alternating columns arc assigned
the North-SouLh configuraLion setting and every other row has switches set
Lo connect East to WesL. The controller is responsible for loading the swilch







Figure 1.3.1 - Three CHiP Processors ( Circles Represent
Switches: Squares Represent PEs)
0 0 0 0 0
-0-
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
Figure 1.3.2 - Mesh Configured CHiP Processor
13
processor and is responsible for starting and stopping the PEs.
Members or the CHiP family are distinguished by their lattice
parameters:
o degree - number of incident datapaths
o crom;ovcl' - number of distincL datapo.lh groups that a switch call
simulLaneously connect
o corridor width - number of switches that separate two adjacent PEs.
The lattice of Figure 1.3.10., the white laUice, is a simple CHiP structure







The implementation choices that must be made when designing a fault
tolerant CHiP machine are strongly influenced by the percentage of faulty
processing elements within the parallel processor. For example. greater
flexibility in interconnecting the PEs may be required if a large fraction of
PEs are faulty than if only a small number fail. Furthermore. redundancy
can be used Lo increase the yield of a CHiP lattice. The amount of
redundancy required to achieve a given yield depends on the mean number
of faulty PEs, Consequently, a necessary prerequisite to the analysis of fault
tolerant parallel processor design is to determine the number of faulty
processing elements. This problem is the focus of tbis chapter.
This research analyzes implementations of CHiP machines in silicon. A
number of PEs will be fabricated on a single area of silicon called a building
block. A complete CHiP machine consists of one or more building blocks. The
individual building blocks may reside on separately packaged chips or, in
wafer scale systems, on dillerent portions of a single piece of silicon. Since
the occurrence of defects on a silicon wafer is a random process, the exact
number of faulty PEs cannot be predicted. lnstead, a probability density
function describes the fault process. This is the probability that a given
15
number of defects will occur. It is dependent on many interrelated factors
of design and semiconductor processing technology. A yield model is a
mathematical model of the integrated circuit manufacturing process that
relates the probabilily of the occurrence of defects to factors such as defect
density, design rules, elc. The design parameter most directly controlled by
the computer architect is the area occupied by a building block.
Consequently, a yield model and the corresponding density function
dependent on the silicon area will be derived below.
The starting point for the development of the yield model is a widely
accepted model due to Price [Pric70]. It"Will be simplified to exclude factors
that pertain to the fabrication process but are not under the control of the
silicon architect, and some parameters will be assigned values appropriate
for the implementation of CHiP machines. The end result of the modeling of
the semiconductor fabrication process will be a function, Pr(Z=m; A),
computing the probability of exactly m defects occurring within an area of
silicon, A. This function will be used to compute the expected number of
defective PEs in a building block. It will be a workhorse in the analysis of
the effect of fault tolerance on parallel processor design.
1. The Price Model
The starling point of our development of a yield model is the multistep
Price model [Pric?O] which is one of the more realistic models of integrated
circuit manufacturing [Glas79, SLap76]. It has shown close agreement with






1. All point defects belong to one of k distinguishable classes of
indisLinguishable defects. Defects in different classes can be told apart
by inspection, but within a single class. defects are indistinguishable.
Each class represents the defects introduced by one critical masking
step in the fabrication process. (Throughout this paper we use the
terms processing or fabrication step to refer to a critical masking step,
not operations such as etching, oxide growth. etc.)
2. Each of the fabrication steps is independent of the olhers; the
number of defects introduced by the i lh step does not depend on the
number of defects introduced by previous steps. This a direct result of
the design rules. Design rules incorporate sutIicienl spacing between
levels such as polysilicon and diffusion to insure that a minor mask
misalignment will not create unwanted transistors. Furthermore,
design restrictions such as not allowing contact cuts overtop gates
insure that the processing at upper levels will not damage fragile
portions of underlying layers. The primary consequence of this
assumption is that Lhe total number of defects is the swn of the defects
introduced by each processing step.
3. The density of fatal defects is the same for each fabrication step. On
the average, each processing step contributes equally to the probability
of a fatal tIereet occurring. Yield is muximized when ull steps
contribute equally to the introduction of defects. Consequently, the
design rules are set to insure this. For example, the metalizalion layer
runs over rougher terrain than does the polysUicon layer. This makes







and spacings are typically larger than for the polysilicon layer.
From these asswnptiolls, we can derive the foHowing relationship [Glas?9J
(1.1)
where Y is the yield (Le., fraction of chips which are functional). The
parameters have the following interpretations:
C fraction of wafer area not wasted due to clustering defects
Q(r/ro) represents the effect of the design rules employed on the specific
circuit. It depends on the minimum spacing, r, and an empirical
threshold spacing fO. When r approaches 10, Q(r/ro) » 1 and the
yield drops appreciably. With relaxed design rules, r > fa and
Q(r/fO) approaches a limit q' with 0 < q':'f 1. and yield increases.
k number of critical masking steps ( i.e. number of defect classes)
d defect density/chip for a single fabrication step
The abovc model will be modified to make it applioable specifically to
the analysis of fault tolerant parallel processors. Parameters representing
deLails of the fabrication process or the design rules will be eliminated, and
specific values for other parameLers will be introduced. The result will be a
simplified model relating Lhe yield to the chip area.
2. Yield Model for Analysis of Fault Tolerance
The following simplifications in the above model are made to tailor it to
the analysis of fault tolerant design:
16
1. Onty random defects are considered. (Throughout this paper. the
term defect will refer Lo a fataL defect; one that causes the circuit in
which it occurs to function incorrectly.) It is assumed that defects have
no tendency Lo clusLer on any portion of the wafer [Stap75, Stap76.
StapBO. StapB2, Sait82]. Non-random defects are due to scratches in a
photolithography mask, surface imperfections resulting from polishing.
etc. Currently, the number of non-random defects per wafer can be
made low (e.g., 1-2 for a 2" wafer ). Improvements in processing
technology and increased care in handling wafers during fabrication
can reduce the number of non-random defects. Experience at Lincoln
Laboratories shows that they can be virtually totally eliminated [Chap]
by more careful wafer screening, increased care in wafer handling and
more frequent mask inspection. Consequently, we assume C = 1-
2. A 4-layer process is assumed. Currently, a 3-layer process defining
three levels of interconnection (dilIusion, poly and metal) is common.
1"01' implcmcnLaLion of CHiP proces:,:ors, it is highly desirable to havc an
additional level to facilitate the interconnection of PEs and the routing
of common control and power signals (the skeleton). Since metal has
the lowest RC constant. it is desirable to use an additional level of metal
for the relatively long wires of the skeleton and for the wires between
PEs. A two Level metal process is in use by several manufactures. Thus
it is reasonable to assume such a process for CHiP implementation.
Consequently. we assume there arc four interconnection levels
(diffusion, poly and two metal iayers), and we let k =4.
19
These simplifications reduce equation 1.1 to
1Y = -o---:--,"c--:-;-:-(1 + dQ(r/ro))" (2.1)
Yields vary greatly depending on the particular fabrication line, the
process being run, elc. It is undesirable to have the results of this work
apply only Lo a specific circuit or fabrication process. The results should be
independent of Lhe semiconductor processing details. Consequently. the
many processing and design rnetal'S must be lumped together into a single
factor. To accomplish this, rather than measure area by absolute quantities
(e.g" square mils). we will introduce the concept of normalized unit area.
Yield depends au both the deLails of the circuit layout Dnd the design
rules employed since different layouts will have ditIerent sensitivities to
variances in the design rules. In Chapter 4. the design of a "standard" PE for
CHiP processors is outlined. It has an B-bit ALU with 64 bytes of memory and
a simple arithmetic oriented inslruction set. This is sufficient to execute a
wide variety of systolic algoritluns [Snyd82a]. This is the yardstick by which
PE complexity will be measured.
From one fabrication line to another, the design rule spacings of the
circuit layout of the standard PE can be modified to ehange the yield.
H:elaxed design rules will increase both the yield and the area occupied by
the PI!: while Light design rules can be used on fabrication lines with more
precise manufacLuring lolerances to pack more PEs into a given area. Thus
lhe design rules <:lnd the yield can be lraded oIT against each other ( within
certain limits). Depending on the particular. fabrication liLle, the design
rules are adjusLed so that the standard PI!: is produced with predetermined
yield.
20
" A normalized unit area. ( NUA ) is the silicon area occupied by a 2 x 2
white lattice of standard PEs with the design rules set to achieve a 20%
yield of the lattices.
(The yield for the unit area deflllition assumes no fault tolerance: one defect
renders the chip dysfunctional). The 20% yield Figure is somewhat arbitrary
but was chosen so thaL a normalized unit area represents a medium to
medium large chip. AU area measurements in this work will be in terms of
normalized unit area with the understanding that the exact size of a NUA will
vary from one fabrication line to another. with improvements in
semiconductor technology, from nMOS to CMOS implementation, etc.
To convert equation 2.1 to units of normalized unit area, we define
So = average number of defects peT nOTmalized unit aTea for a single
processing step
We can then replace d Q(rl ro) in the yield model by A So
(2.2)
where A is the chip area measured in NUA. The concept of unit area has
eliminated the dependence on the design rules and the particular circuit
being manufactured. The area of u building block will be measured relative.
Lo lhe area of the standard 2 x 2 while laltice.
To determine the value of So • solve equation 2.2 for so. By definition
Y = 0.20 at A = 1.0 so
21
So == (0.20)-114 - 1 = 0.495 (defects per unit area per step)
Figure 2.2.1 shows the yield as a function or the chip area measured in
NUll.. NoLe Uw.t the yield drops steeply at fIrsL then levels off aL low yield.
This is consistent wiLh empirical evidence. Defects limit chip area; chips
that are loo lurge have prohibitively low yield.
Because the processing steps are assumed independent and the total
number of defects is the sum of the defects introduced by each processing
step. do. the average number of defects per normalized unit area after all
foW' fabrication steps is
do = 450 = 1.98 (defects per unit area)
do is a fundamenLal quantity in the analysis of fault tolerance. From it we
know the mean number of defecls in a CHiP lattice of a given area - since
defects are randomly distributed, the expected number of defects in area A
is Ado (Table 2.2.1).
3. Probability Density Function
The yield is the probability of no defects. Since we are concerned with
the design of Cault tolerant machines, a certain number of defects (the
exact number depends on thc design details) can be present without
rendering the machine dysCuncLional. ThereCore, rather Lhan yield, we are
interested in the number of defects and their probability distribution. It is
alll1i::; point thaL this research diverges Crom previous work on yleld models.
The design oC fault tolerant CHiP processors requires a more detailed

















1.5 2. 2.5 3.
1.25 1.75 2.25 2.75
Area ( NUA )
22
Figure 2.2.1 - Yield V3. !re~ ( in NU! -
Noraalized Unit Area)
Table 2.2.1 - Expected Number of Defects as a
Function of Area ( in NUA )
Area Expected Number
















The probabilily that exactly m defects occur in a lattice of area A is
denoted by Pr(Z=m; A). where Z is a random variable representing the
number of defects. For a design that can accommodate up to m' defecLs
and occupies area A, the probability that the machine is functional is
Pr(Z::;m': A). When the area is a flxed quantity, the area parameter will
some Limes be omitted and the density function abbreviated as Pr(Z=m).
Let Zi be the random variable denoting the number of defects
introduced by the i lh processing sLep and Z be the number of defects after
an processing steps. Pr(zj=m) follows a geometric distribution [Glas79]
Pr(z;=m; A) = p(l_p)m
where So is the defect density.
with p ;:: 11 + Aso
In a multistep process, total number of defects is the sum of the
defects introduced by the individual processing steps. Hence, for a given
area, A, Pr(Z=m) is the sum of independent a.nd identically distributed
geometric random variables. For a four step fabrication process,
Summing the four independent vL\l'i~~blcs,\\'ehave






= ;; (m+1) (m+2) (m+3) p'(l_p)m
25
where i,j and k are the number of defects introduced by the lot,2nd and 3rd
processing steps. The derivation of this equation is given below.
Derivation - Summation of Geometric Random Variables
Assume the random variables 21, Z2, Z3, Z'.I are independent and have
identical geometric disLribuLions, Pr(zj:::m) ::: pqrn with q ;:: l-p. lI'e will
derive the distribution for the sum of 2, 3 and 4 of the random
variables. The four variable case represents the probability of ill defect
as predicted by the 4 step Price yield model, the primary model used in
this research.
Two Random Variables:
The ill successes must be divided between the two random variables. Zl
can account for between nOlle and all of them with Z2 making up the
remainder.
rn







= 2.: p2qrn = (rn+l)p2qrn
i=O
Divide the successes into two groups, those of Zl and those of Z2 and Z3
combined. The toLal number of successes, lll, can be arbitrarily divided
betwecn thc two groups, and the two random variable result from abOve
call be used La evaluaLc Pr(~2 + z::: ::::: m-i).






= p'qrn l; (m-i+l) =
1=0
26
p'qrn[(m+l)2 _ 1.. m(m+1)] =
2
~ (m+l) (m+2) p'qrn
Four Random Variables:
Analogously to the three random variable case, we partition the random
variables into two groups: [zd and [Z2. 23. z'1-l. The three variable result
from above is employed.
rn
Pr(Zt + 22 + 23 + Z4 = m) = L: Pr(Zl = i) Pr(zl + 22 + Z3 = m-i) =
1=0
~o pq' ~ (m-i+l) (m-1+2) p'qrn-l =
.Lp4qrn f; (m-i+l) (m-i+2) =
2 i=O
~ p4qrn[(m+1) (m2+3m+2) + ~ m(m+1) (2m+1) - (2m+3) ~ m(m+l)I=
~ (m+l) (m+2) (m+3) p4qrn
Figure 2.3.1 and Table 2.3.1 show the probability of m defects, Pr(Z=m;
A), for several different areas measured in units of NUA. It is important to
observe that for smaller areas the curves peak at a very small value ( e.g. 1 -









....---- .. Area I .0
----- Arell. 2.0
































m = Number of Defects
10
12
Flgu~e 2.5.1 - P~obeblJ Ity of m F~teJ Defecto
eo 8 FunctIon of A~e9 {In NUAJ
Table 2.3.1 - Probability or m Fatal Defects as a Function of
Area ( in NUA )
Pr( Z = m; A)
number of Area (in NUA)
defects(m) 0.8 1.0 2.0 3.0
0 .353 .200 .064 .026
I .324 .265 .127 .063
2 .185 .219 .158 .094
3 .085 .145 .157 .112
4 .034 .084 .137 .117
5 .012 .045 .109 .112
6 .004 .022 .001 .100
7 .001 .010 .058 .OB6
8 .000 .005 .039 .070
9 .000 .002 .026 .056
10 .000 .001 .017 .044
11 .000 .001 .011 .033
12 .000 .000 .007 .025
28
29
For example. in a unit area, the probability of 6 defects is 2% whereas a
single defect occurs 27% of the time. Consequently, the cumulative
probability, Pr( Z~m; A), rises quickly (see Figure 2.3.2 and Table 2.3.2). This
means that at low yield, even though there is a large probability of at least
one defect. the number of defects is likely to be small. The yield of the
whole fabrication process is the product of the yields of the individual steps.
With four processing steps and under the assumption of identical yield at
each step, overall yield equals the yield of an individual step to the forth
power (equation 2.2). The yield of a single step is inversely proportional to
the chip 'area. Consequently. yield decreases quickly as chip area increases
(Figure 2.2.1): yield is the product of four identical terms. On the other
hand. the probability distribution of the number of defects per chip, Z, is
the sum of four identically distributed random variables. This exhibits a
peaked distribution in which the probability of a large number of defects is
smalL
4. Comparison of Yield Models
]n the previous sections, a multistep Price yield model was developed.
]s this particular model the most appropriate? There are other yield models
such as the Poisson and Gaussian models which are based on slightly
different and less realistic assumptions about the semiconductor
manufacturing process. However, their mathematical formulation is
considerably simpler than the Price modeL Are they sufficiently accurate
for the types of problems we will consider? Can a good approximation be











































'laura 2.3.2 - Cu.ul~tiTe Prob~bility of •
flltal Dohch
Table 2.3.2 - Cumlative Probability of m Defects as a Function of
Area ( in NUA )
Pre Z" m; A)
number of Area (in NUA)
defects(m) 0.6 1.0 2.0 3.0
0 .353 .200 .064 .026
1 .6"17 .465 .191 .089
2 .862 .685 .348 .183
3 .947 .830 .505 .294
4 .981 .914 .642 .412
5 .994 .959 .751 .523
6 .998 .981 .832 .624
"I .999 .992 .890 .709
8 1.000 .996 .929 .780
9 1.000 .998 .956 .836
10 1.000 .999 .973 .880
11 1.000 1.000 .984 .913







models and compares their accuracy. The basic question is whether the
increased accuracy of the Price model is worth its added complexity. It is
answered affirmatively.
Figure 2.4.1 shows the relaLionship of the diITerent yield models. The
key Wlderlying assumption is the distinguishability of defects. If the wafer
were examined by an inspector, could each of the indiVidual defects be told
apart? The Poisson and Gaussian models assume distinguishable defects
whereas the Price model assumes the defects have identical appearances.
This assumption deLermines the form of the probability density function for
the occurrence of defects. For example, consider the total number of ways
m defects can occur in a set of n different chips. For many of the
probabilities that will arise in applications of the yield model, this is the size
of the sample space. If the defects are distingUishable, there are nrn
dilIerent assignments of defects to chips whereas indistinguishable defects
give only
placements. The different sizes of the sample space give rise to different
probability distributions. Additionally. equations involving terms such as n rn
generaUy are simpler than those involving the more complex combinatorial
formulae. ConsequenLly, the Price models arc more complex and diITicult to
work with than the Poisson and Gaussian models.
AILhough they are more complex, the Price models are more realistic.
They agree more closely with empirical evidence [Glas?9]. Furthermore, it




































Figure 2.4.1 - Taxonomy of Yield Models
34
pinholes) can be told apart. However, an inspector could tell a meted short
from an oxide pinhole. This supports the distinguishable classes of
indistinguishable defects which lmdcrlics the Price model.
a) Distinguishable DefecLs
Assume each defect is unique 'lnd can be differentiated from all other
defects. With M distinguishable defects distributed over N chips, the
probability tha.t any given chip conLains eXClcLly k defects after a single
processing step is
(4.1)
This is a form of the binomial distribution. It can be approximated in
different ways depending on the frequency of defects: rare, occasional or
frequent. The last two cases arc of practical interest since, in any large
scale circuit, defects are likely to occur.
1) Occasional defects. If the yield is moderate then equation 4.1 can be




where So = ~ is the expected value of the random variable z.
A key advi:l.nLuge of the Poi::;:wn approximaUon i~ iLs simple exLcnsion to
modeling multiple fabrication steps. Since the sum of independent Poisson









where 81 =Ai + ;\2 + ... + Al with Ai = expected value of Zj. For identically
distribuled Zj.
Pr(Zl + Z2 + ... + ZI = k) = (ls,)kk! (4.3)
with So = expected value of Zj. This contrasts with the more complex sum of
gcometric random variables disLribution encountered in the Price model
(see section 2.3).
NoLe thaL in equation 4.2 it is not necessary \:.0 asstune (as in the Price
model) lhat (~a.ch proecssing sLep conLribuLes equally to the probability of
occurrence of defects. All that is necessary is to sum the expected number
of defects in each processing step and use the sum as the parameter in a
Poisson disLribution. In contrast, the Pricc model vfithout this assumption






where d j is the expected number of fatal defects introduced by the ilh
processing step.
2) Frequent defects. For a low yield and M large, equation 4.1 is more
accurately approximated by a. Gaussian distribution [Ross?GJ
( [~2)1 1k- soPr(7.=k) = -.-- exp - ~V2ii a'E. 2 az (4.3)
36
where a; = so(1 - ~ ) is the variance of z.
How much more accurate is the Gaussian approximation for low but still
realistic yields? First, assume N is large so
.,
a~: R: So and equation 4.3
becomes 20 is clearly lower bound on the number of chips per wafer. For
n = 20, C1~ = so(l - 2~ ) = 0.95so and at = 0.98so so this approximation is
highly accurate.
Pr(z=k) = 1 [ 1[~-so2)--,,'=~ exp --
-v2rrso 2 ~
To compute the yield we take k ;;; 0
-1/2 :10
e
Table 2.4.1 compares yield VS. so. for the Gaussian and Poisson
approximaUollS. With low yields «5%) I for a given value of So. the Gaussian
approximation predLcls a higher yield than the Poisson model. Since the
Poisson approximaLion is known Lo underestimate yields [Glas79], the
Gaussian approximation is indeed more accurate. However, the difference
bc~wecn the approximations.is llollurgc (.....22%) even aL extremely low yields
(1%). The relationship between yield and area is
•
1 2 -l/2(!l:l)Y = I'r(z=O) = ~ASo)-11 e 0
v'21f (4.5)
where So = 1.2m~ defects per uniL arc per step which is derived by solVing
equaLion 4·.~ ror So wiLh Y = 0.20. A 1;~ yield corresponds La Aso = G.B'H or A





. Table 2.4.1 - Comparison of Gaussian and Poisson Approximations
'0
Yield Gaussian Poisson Gaussian/Poisson
0.01 5.64 4.61 1.223
0.02 4.49 3.91 1.148
0.03 3.83 3.51 1.091
0.04 3.38 3.22 1.050
0.05 3.0' 3.00 1.013
0.06 2.77 2.81 0.961
0.07 2.55 2.66 0.962
0.10 2.05 2.30 0.891






block. Consequently. in the range of chip areas under consideration, the
Gaussian approximation is only marginally more accurate than the Poisson
approximation so it will not be used. The Gaussian approximation will not be
further considered.
b) Indistinguishable Defects
Assume all the defects are identical and can not be told apart. With M
indistinguishable defects on a wafer of N chips, there are
different ways or distributing the defects on the chips. To evaluate Pr(z:::k),
the probability that one speciUc chip contains exactly k defects, nole that a
:mbset of k indistinguishable defecLs can be chosen in only one way. The
















Thus a geometric distribution characterizes the defect distribution for a
single processine step with indistinguishable defects.
Extcndin~ thLs result Lo multiple classes of defects, we assume that
defects within each class are indistinguishable but two defects in difIerent
classes can be told apart. A difIerent defect class is associated with each
interconnection level. Since the fabrication steps are assumed to be
independent, the loLal nwnber of defects is the sum of the number of
defects introduced by each step. By the assumption of equal defect
densities at each level, the Zi arc identically distributed. Consequently, Z,
Lile lotal uumber of defects, is Lhe sum of independent, identically
distributed geometric random variables, and the probability density
[unctions, Pr(Z=m), [or 3 and 4 clo.sses of defects are:
1Pr(z, + Zz + Z3 = m) = t!m+l) (m+2) p'qm
with p = 1
1 + Aso
and q = 1 - p.
Graphs of Pr(Z=m; A) for the Poisson, 3 and 4 class models are shown in
Figures 2.4.2 - 2.4.4 for different areas.
Comparing the Poisson o.nd Price models, we find that the Poisson
model is less o.eeuratc as Lhe chip al'ea increases. At unit urea, the number
of defects is overestimated. Uut for larger areas, Lhe Poisson model
underesLimaLes the number of defects by a considerable amounL. In short,
















































































m ~ Number of Defects
1
2 4 6 8
9
Figure 2.4.2 - Probebl J lty of m Oefectc
































c;> •• _- •• -. e \
I' "'. ~
I ..... \
























m • Number of Defects
Flgure 2.4.5 ~ Probebl I It.y or m Oefect.c


















-- 4-s tap lIedel
.. ----- ... 3-step lIadel
























































m'~ Number of Defects
Figure 2.4.4 - Probability of • Defects
(Area = 3.0 NUA)
43
area of a wafer scale building block is large, and we would rather make
conservative esLimates than overly optimistic ones, the Poisson model is
ullsuiLabic [or precise defecL analysis. It is useful only for order of
magniLude csLimates.
Compat'illg the :) and 1, class Price model, we rmd that both curves have
vel'Y similar shapes. Furthermore, they converge as n -> ro, but lhe 4 class
model shows greater variance. The three class model is only a moderately
good approxlmation to the four class approximation. Since a 4 level process
is most appropriate for the implementation of wafer scale CHiP machines,
its added complexity will be cndlU'ed except when it is prohibitively costly.
5. Applications or the Yield Model
In the previous sections we developed a model of the integrated circuit
manufacturing process. The analysis was based on the properties of the
fabrication process. The end result was to characterize the distribution of
imperfections in the fabrication process. and from this model the yield of a
given size chip can be predicled.
This is not, however, our ultimate objective. In this work we are
interested in the analysis of parallel processors. But the processors under
consideration arc fabricated out of silicon with several PEs per chip. So the
moueling of integraled circuit fabrication technology is a necessary
prerequisitc Lo parallel processor analysis. The choice of the number of
processing clements per chip, si:.::e of the PEs, etc. depends in part on the






In this section, the yield model developed above is applied to the study
of the design of parallel processors. In very large and complex parallel
processing systems. fault tolerance is a desirable (if not mandatory)
property of the system. With the homogeneous structure of CHiP machines,
redundancy is a natural means of achieving fault tolerance. To analyze the
yield of fault tolerant CHiP modules, one must know for a chip containing a
fixed number of redWldant components. what is the probability that the
number of faulty components does not exceed the number of redundant
ones. This is the yield of the fault tolerant chip. Conversely, a design
oriented version of the above question is how much redundancy is reqUired
to achieve a given yield. Knowledge of this can guide the designer of a
parallel processor in choosing the amount of redundancy within the
processor.
Furthermorc, changes in tcclmology impact the design of parallel
processors. The scaling down of device dimensions increases yield with
resulting reduction in cost. Alternatively, scaling can be exploited by using
morc powerful and faster PEs on a chip with the same yield. Combinations
of increased PE capacity and beLler yield are also possible.
There are also Lradeoffs between the size of the individual PEs and the
dimensions of the CHiP lattice. Which is preferable, a small lllunber of
complex PEs or a larger number of simple ones? With re:,;pecl to yield, Lhis
tradeoiT can be quantized through the use of the yield model. These





Given il seL of Np identical PEs fabricated on a chip of area, A, what is
the probability, Hm. that at most m of the PEs are faulty? This is the
recovery problem. Rm is the probability that at least Np - m of the PEs can
be n?covered from the chip. If the chip contains m redundant PEs, Rm is the
yield of the fault tolerant chip. The chip is usable if no more than m of the
PEs are faulty. Otherwise the chip does not contain a sufficient number of
good PEs.
From a solution to the recovery problem, the mean number of good PEs
per chip is easily calculated. The probability that a chip has exactly m
defective PEs is Rm - Hm_ l _ The expected number of good PEs is
N -,1; (Np - m) (Rm - Rm - ,)
m =0
where R_1 ;;: O. This is the average yield of PEs per chip.l
(5.1)
How does a solution to the recovery problem apply to the analysis of
CHiP processors? CHiP machines are composed of two types of components:
switches and PEs. Thc recovery problem considers only faults in PEs. But it
will be shown (Chapter 4) that PE faults are the dominant factor in the yield
of a CHiP lattice. SWitches are very small and simple. As a result, they have
high yield; there are few faulty SWitches. On the other hand, PEs are much
larger, and defects are much more likely to occur in PEs than in switches.
Consequently, if the PEs of a latLice are fLUlctional then there is a very high
probability Lhat the enUre laLtice is functionaL Analyzing the yield of PEs
1 This probability can also be calcnlLltcd Irom Lhe binomial distribution. Our emphasis on




provides a very good approximation to the yield of the lattice as a whole.
To solve the recovery problem, note that by the assumption that all
defecLs are point defects, each defect will disable exactly one PE. A point
defect causes localized circuit damage. so it is impossible for a point defect
to span two or more PEs. Consequently, if the number of defects on the chip
is less than or equal to ill, no more than m PEs can be faulty. In addition,
recall that defecLs are randomly distributed over the wafer surface. It is
possible for a PE to contain multiple defects. In short, the chip may contain
more than m defects but they may be clustered in III (or fewer) PEs. Thus
Rm consists of two terms
Rm = Pr(Z"m; A) +
•I: Pr(Z=i; A) Pr(i defects clw;ter in m PEs)
I=m+l
(5.2)
The distribution of Z is known from the yield model results, and the
clustering probability is derived in appendix one, Different forms of the
clustering probability can be derived depending on the number of classes of
defects. As seen earlier, a four class assumption is the most appropriate
model of the integrated circuit manufacturing process (or CHiP machines.
However, the solutions to the clustering probability become increasingly
complex as the number of defccL classes increases. Jo'ig ure Al.l in Lhe
appendix compares the solutions for one, two and three classes of defects
with all defects clustering in four or [ewer of 16 PEs. Note that the
probability distributions converge as the nUluber of defect classes increase.
The difIerence beLween Lhe ClU'ves Ior Lwo and Lhree classes is less Lhan the
gap between the one and. two class curves. Tills illllicaLes Lilac Lile three uud
47
four class solutions "fill be in even closer agreement. Additionally, the two
a.ntI L!.ll·cc class solutions dilTcr by only a rew percent. As a result, the three
class soluLioll will be accepted as suiIicicnl1y accurate; the added
complexity of the four class solution does not justify slight increase in
accuracy.
Equation 5.2 gives the relationship between PE area, number of PEs,
rcdwl.dancy and yield. It can be used to analyze tradeoffs between these
quantities. To demonstrate the results of this analysis. we will study one
example that will be of considerable use in the design of the wafer scale
CHiP machine. l~ecall that Lhe definition or the normalized unit area is
Lalloretl Lo Lhis standard PE. One NUA is dermed to be the area that can hold
a 2 x 2 while CHiP lattice of standard PEs with the design rules set to
achieve 20% yield.
Figure 2.5.1 displays the results of applying equation 5.2 to the
standard FE. On the x-axis is the number of PEs in the collection. Each one
of the diITerent curves shows the relatlonship between recovery probability.
Rm , and the tolal number of PEs, Np , for a fixed number of redundant PEs,
m. Exactly m of the Np PEs are redundant. The individual cW'ves depict
Ro, RI , ... ,Ha. This information is also displayed in Tables 2.5.1 and 2.5.2.
The lowest of the curves, Ro, 'is a standard yield cW've. There is no
redundancy so a single defect renders the chip unusable. The shape of TIo is
similar to Figure 2.2.1. Note the point Np =4 and TIo =.26. One normalized
unit area holds a 2 x 2 lattice and has yield .20. However, lhe lattice
contains both switches and PEs. SOllle of thc defects within a lattice will fall






































Np : Number of Processors
Figure 2.5.1 - Recovery Probability VI. Iu.ber
of ProCelllOfll
Table 2.5.1 - Recovery Probability (0-4 Redundant PEs)
Reocovery Probability
number of Redundant PEs
PEs 0 1 2 3 4
1 .686 1.000 1.000 1.000 1.000
2 .485 .904 1.000 1.000 1.000
3 .353 .776 .968 1.000 1.000
4 .263 .650 .904 .989 1.000
5 .200 .540 .822 .958 .996
G . .155 .447 .733 .910 .981
7 .122 .371 .647 .850 .955
8 .097 .309 .567 .783 .916
9 .078 .259 .495 .714 .869
10 ,064- .218 .432 .647 .816
11 .053 .184 .377 .583 .760
12 .044 .157 .329 .525 .704
13 .037 .134 .208 ,4·71 .6"..8
1 '1 .031 .115 .253 .422 .595
15 .026 .099 .222 .379 .545
16 .022 .086 .196 .340 .498
49
"
Table 2.5.2 - Recovery Probability (5-B Redundant PEs)
Recovery Probability
number of Redundant PEs
PEs 5 6 7 B
1 1.000 1.000 1.000 1.000
2 1.000 1.000 1.000 1.000
3 1.000 1.000 1.000 1.000
4 1.000 1.000 1.000 1.000
5 1.000 1.000 1.000 1.000
6 .998 1.000 1.000 1.000
7 .991 .999 1.000 1.000
8 .977 .99B .999 1.000
9 .953 .9B8 .99B 1.000
10 .922 .973 .993 .99B
11 .683 .953 .984 .996
12 .840 .926 .971 .990
13 .'793 .893 .952 .981
14 .745 .857 .929 .969
15 .697 .617 .901 .952
16 .6~9 .776 .870 .931
50
51
with the four PEs - not with the switches. Because of this the yield of four
PEs is higher than the yield of a 2 x 2 lattice.
The size of each PE is fixed so as the number of processors, Np •
increases, the area occupied by Lhe PEs increases proportionately. Since
,
uefecls are distributed randomly. more PEs means a larger area to be "hit"
by one of the defects. The Ro decreases rapidly refiecting the fact that the
yield declines as the 4t.h power of Lhe area. For larger ro, the decline is less
sLeep. Hedundancy moderates Lhe eITecl of defects.
r"igure 2.5.1 can be used in a variety of ways to analyze the design of
parallel processors composed of the "standard" processing element. For
example. suppose we want to produce chips containing a set number of
functional PEs, but a yield higher than the Ho curve is requircd. In other
words, simply patterning the required number of PEs on the chip does not
give high enough yield. Adding redundant PEs Lo the chip will increase its
yield. Exactly how much redundancy is required to achieve the target yield?
The answer is found in Figure 2.5.1.
Ii'or example, considering fabricating a chip that conLains four good
PEs. (This is not a randomly chosen example. CHiP lattices with four PEs
will be used as basic uniLs out of which wafer scale CHiP machines will be
built.) LeL the target yield be '75%. Simply patterning four PEs per chip
results in only 26% yield (Table 2.5.1). The datapoints from Figure 2.5.1
corresponding to [our PEs (N p = 4 and 11:1. = 0; Np = 5 and m =1; ... Np = 12
and 111 = 8) arc summarized in li'igurc 2.5.2 and Table 2.5.3. 73% of -the Lime
[ow· good Pl~3 Cl1ll be found in a collection of six PEs. At least four PEs are
funcLional ouL of seven 70% of thc time. This shows that Lhe Larget yield is
"52
achieved by providing 2 - 3 redundant PEs.
from Figure 2.5.2 it can be seen that adding a single redundant FE
increases recovery from 26% Lo 6'1%. This i~ a surprising result. Why?
Adding an additional PE increases the chip area. There is more area to be
"hit" by a randomly distributed defect. One might naively suppose that the
addition of a redundant PE would be counterbalanced by the increase in
chip area. The net result would be little or no increase in recovery. The
reason this does not happen can be traced back to the characLeristics or the
cumulative probability distribution or the number of defects in a given area.
It was noted (see sccUon 3 - Probability Density l~unction) that for
moderately large areas, even though there may be a large probability of at
least one defeeL, the nU7nbr!r of defecls is likely to be small. For example, in
one normalized unit area there is an. 80% chance of there being at least one
defect. However. the mean number of defects is less than two (Table 2.2.1).
II takes only a small number of redundanl PEs to absorb the few defects
that are likely Lo occur. Thus a lilLIe redundancy provides a large increase
in recovery.
b) Fault Tolerant CHiP Modules
One aspect of this work is Lo consider the design. of CHiP modules ...
chips conLaining a small CHiP lattice. Due to pinout constraints. each
module can contain only a small number of processing elemenLs. The
individual modules can be po.clmced and assembled to form larger CHiP
machines. AlLernaLely, the modules can remain on the wafer and be




















4 6 8 10 12
5 7 9 11
N = Number of PEs
•
Figure 2.5.2 - Recovery Prob~bility for Four
PEs ill. N PEs
Table 2.5.3 - Recovery of 4 PEs from N PEs
N = number relative prob
"
4 number or
of PEs area good PEs redundant PEs
4 1.00 .263 0
5 1.25 i
.510 1
G 1.50 .'133 2











12 3.00 I .990 D
54
55
The results of the previous section show that redundancy can cause
large increases in yield. This suggests that redundancy could be a cost
dIcctivc approuch to manufacturing CHiP modules. A fault tolerant CHiP
module could be designed thaL contains redundant PEs. The switch lattice
can be used to route around Lhe faulty PEs and connect together the
functional ones. }i'aulLs. of course, can also occur in switches so redundant
switches are also required.
Three problems in the design of fault tolerant CHiP modules must be
solved:
D Choose the number of redundant PEs.
o Choose the switch lattice.
o Configure the laltice Lo avoid defects. the mapping problem.
The first problem can be solved using the recovery analysis results. As for
the second, in Chapter 4 it will be shown that switches are quite small so
Lhey have very high yield. Doubling thc corridor width of the switch lattice
provides 100% switch redundancy. This allows virtually all swilch faults to be
nbsorbed. Consequently, faulLy switches have virtually no effect on the yield
of fault tolerant CHiP modules. The recovery analysis results (which
considered only PEs) are an upper bound on the recovery of CHiP lattices
conLaining both PEs and switches. However. Lhis upper bound is a vel'Y close
approximation to aeluallaLlice recovery.
Finally, the lattice must be configured to masJ{ the presence of defects.




o 0 0 0 0
o 0 0 0 0
o 0 0 0 0
Figure 2.5.3 - 2 x 2 Virtual Lattice
(Datapaths Not Shown)
56
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
Fig;,rre 2.5.4 - 3 x 2 Physical Lattice
( DaLapalhs Not Shown)
57
•58
containing a 3 x 2 double corridor lattice (Figure 2.5.4). The 3 x 2 lattice
that is actually patterned in silicon is termed the physica~ lattice. Switches
in the physicallaUice wilt be set so that it emulates a fault free 2 x 21attice,
the viTtuaL lattice. We say that the virtualluLtice is mapped into the physical
lattice. The configured physicallaUice could be used in place of the virtual
lattice or vice versa. An observer of the input I output behavior of a faulL
tolerant CHiP module can not delect the presence or location of the faulty
components.
There afe two subtasks in fmding a mapping of the virtual lattice inlo
the physical lattice:
G Assign PEs and switches in the physical lattice to their counterparts in
Lhe virtual lattice.
o Define a one-la-one correspondence between datapaths in the virtual
lattice and paLhs in the physicallaLtice.
The process will be explained through the example of mapping a 2 x 2 virtual
lallice into a 3 x 2 physieallallice (Ji'igures 2.5.3 and 2.5.4). The four PEs of
Lhe virtual lattice can be assigned La functional PEs in the physical lattice as
shown in l"igurc 2.5.5. The 12 swiLches or the virtual lattice that arc
connected to ports (shaded in Figure 2.5.5a) can be assigned as in Figure
2.[j.3b. The dalapaths between a port and a switch tn the virtual l~\ttjce
become paUlS in Lhe physical laLLice as shown. The right port of PE A is
separated from iLs switch by six intervening switches. The compleLe
mapping is shown tn F'igure 2.5.6.
59
0 0 0 0 0
0 IT] 0 [}] 0
0 0 0 0 0
0 @ 0 ~ 0












o 0 0 0 0
o 0
o






Figure 2.5.5 - Example of a Partial Mapping
60
t> 0 t> 0
- "AL., - D
1) ~ lG--&-~--ef--6)v v ~









Figure 2.5.6 - Complete Mapping of the Virtual Lattice
Into the Physical Lattice
61
c) Optimum Lattice Size
An examination of Table 2.5.3 shows that chip yield approaches one as
the number of redundanL PEs increases. Arbitrarily high yield can be
achieved by providing enough exLra PJ.o.:s. However, with more PEs per chip
the area of Lhe chip increases. With larger area, fewer chips can be
fabricated on a single wafer. Since Lhe cost of processing a wafer is
independent of the number of chips it holds, fewer chips per wafer leads to
higher cost per chip. Unless the gain in recovery makes up for the area
increase. rcdtmdancy could result in higher chip cost.
What is the level of redundancy that optimizes the number of good
chips per wafer? Consider once again recovering four PEs from a chip.
Using the terminology of recovery analysis, let there be Np PEs per chip. Np
~ 4- of these afe redundant, and HNp-4 is the yield of the fault tolerant chips.
The number of chips per wafer is proportional to the chip area. Since PEs
ure of fixed size, area increases linearly with the number of PEs. Hence, the
number of chips per wafer is proportional to 1 / Np. Consequently,
maximizing RNp - 1 / Np determines the value of Np that also maximizes the
number of good chips per wafer. In fact, 4 RNp-'1 I Np is the fraction of PEs
on the wafer that arc actually used. RNp_ 4 of the chips are good. On these
good chips, 4· / Np of the PEs afe used. 4 HNp-4 I Np IS the FE ulilization.
Table 2.5.4 shows the PE utilization for the recovery of four PEs from a
chip contaIning Np PEs. With Np :: 4, 100% of the PEs on good chips are used
but only Ro :: 26.::1% of the chips Q,fe good. Adding one redundant PE more
,.
• Table 2.5.4 - Optimum Lattice Size for the
Recovery of Four PEs
Recovery of 4 PEs
Np = number Gain with
of PEs I chip R(%) 4R / Np FT (%)
4 26.3 .263 0.0
5 56.8 .456 73.3
6 77.2 .516 95.6
7 88.5 .504 92.1
8 91.6 .460 46.0
Table 2.5.5 - Optimum Lattice Size to Maximize
Number of Good Chips Per Wafer
PEs Optimum
Recovered Lattice Redundancy Gain with
(Nv) Size ( Np ) ( % ) R(%) ~"T(%)
1 1 0.0 68.6 0.0
2 3 50.0 60.4 10.3
3 4 33.3 68.0 44.1
4 6 50.0 77.2 95.6
5 8 60.0 78.3 144.8
6 10 66.7 81.6 215.9
7 12 71.4 U{.O 301.6
8 14 75.0 85.7 ·'1-04. '7
62
63
than doubles chip yield. There is a 73% (= .456 / .263 - 1) gain in PE
utilization. 'rhe increase in chip yield. R1 - Ro. more than makes up for the
increase in chip area. With two redundant PEs, utilization increases to 96%
(= .516 / .263 - 1). Adding additional redundancy reduces utilization. So Np
;:: G is the optimum number of PEs per chip [or maximizing the munber of
chips per wafer thal contain four good PEs.
Why is six the oplimum lattice size? The optimum is reached when the
gain in recovery is exactly counterbalanced by Lhe area increase of the- chip.
~xamining }i'igure 2.5.2 it can be seen that six PEs is at the knee of the
curve. Beyond this point lhe slope of the curve is less than one; the
marginal increase in the recovery probability is less than 0.1 for each
additional redundant FE. BeCore this point the slope exceed one; additional
redundancy increases recovery by more than 0.1.
How many more good chips per wafer are there? It will be shown
(Chapter 5) that a standard FE occupies a 1.75 mm x 1.?5 mm region of
silicon. A chip containing four PEs is therefore of size 3.5 mm x 3.5 mm.
(This estimate ignores the area. occupied by bonding pads and their drivers.)
The number of square chips with edge length e that can be packed onto a
circular wafer of diameter D is [Phis79]
1. 77 12...
e
A 4" wafer can hold 64? four PE chips. At 26.3% yield a wafer has 170 good
chips. A six PE chip has 50% r.norc area. Assume that it occupies a square
with edge 3.5 v]" = 4.29 mm. A L'.:" holds only 399 of these larger chips. Dul






wafer. Thus redundancy has resulted in an additional 308 - 170 = 138 good
chips per wafer - an 01% increase. The fIxed cost of processing a wafer is
divided bet,,,een more chips. ]n shorL,
• redundancy can substantially decrease the manufacturing cost of chips
containing several processing elements.
The optimum lattice size for recovering Nv PEs per chip with Nv ranging
from one Lo eight is shown in Table 2.5.5 and li'igure 2.5.7. In every case
except for Nv = 1, redundancy can increase the PE utilization and
SUbsequently reduce cost. The gains in utilization increase with Nv. This is
because the baseline for the comparison (no fault tolerance) is a standard
yield curve. As shown earlier, yield decreases rapidly as a function of area
(Figure 2.2.1). So as Nv increases, the baseline utilization drops sharply.
Additionally, the percentage redundancy required at the optimum
laLlicc size increases as a fWlctlon of Nv. With lattices occupying a large
area, a higher fraction of the PE:~ must be redundant. With large lattices,
Lhere is a decline in the marginal increase in red undancy of each extra PE
added. More redundant PEs are required to prOVide the same level of
protection against defects.
d) Design Analysis
Dy combining the yield model with recovery analysis, the
interrelationships between PE sizc, lattice dimcnsions, redundancy and yield
m"c known. Tradeo1Is between these quanULles can be a~scssecl. Since the
1ll0:lIlU[uC Luring eosL of a chip depends all its yield, Lhese results show how






























Number of Virtual PEs
Filure 2.5.7 - Optlaua Lattice Size to Ha%t.ize
Huaber of Good Chipi Per later
66
In the previous sections, the cfIcct of redundancy on yield was studied.
However, the methodology of the yield model and recovery analysis can be
used to investigate a wide variety of design tradeofIs. The primary
advantage of this methodology is that it provides quantitative analysis. We
consider one example below.
The slate of the art of integrated circuit manufacturing is not static.
The dimensions of individual devices continue to shrink. Given a design of a
parallel processor which is constructed from chips containing several PEs,
what is the eITeel of advances in technology on the machine? How will the
yields of the individual chips improve? How much redundancy is reqUired
with smaller PEs? Figures 2.5.8 and 2.5.9 display the recovery probabilities
for device area scaled by a factor of one haH and one quarter respectively.
We assume the same standard PE is produced only at doubled and
quadrupled density.
Let us reconsider the example proposed in section A - manufacturing a
chip with four good PEs at 75% yield. With device area shrunk by a factor of
two, only one instead of two redundant PEs arc required. The recovery of
four good PEs from a sel of six jwnps from 75% to 95%. With quadrupled
density, no redundancy is required. The yiclli of a chip containing four














































3 7 11 15
Np = Number of PEs
Figure 2.5.6 - Effect of Sc~liDI on RecoTery













.50 0 III JUV'l"DUl' PI5


























FIgure 2.5.9 . Effecl of Scallng on Recovery
( Scal.e Fadar = 0.25 )
69
Instead of exploiting the increase in density to manufacture the same
design more economically, it can also be used to produce a more powerful
ffiQ.chine at the same cost. For example, with doubled density, nine PEs per
chip can be fc.bl·icaled with about the same yield as four PEs per chip at the
previous density. Assuming pinout constraints are satisfied, the lattice
dimensions can be increased by a factor of 2.25 without increasing the
number of chips in the machine and for approximately the same cost.
This methodology can be used to investigate many other tradeoffs in
the design of a paralicl processor. The effect of teclmological advances is
but Ol1e such example. Many design decisions reQect themselves in terms of






In this chapter, we relurn to the problem of designing a wafer scale
CHiP processor. The goal is to fabricate a large-scale parallel processor on a
single wafer of silicon. There are many problems to be considered in the
design of such a system: processing element design, testing. PE to FE
communication, power conswnption, elc. In this section, we consider the
problem of stnLcturing a .wafer conlaining- individual switches and
processing elements into a CHiP processor.
As shown in Chapter 1, structuring is the key problem in the
implementation of any wafer scale system. Since the semiconductor
manufacturing process is imperfect. each wafer contains many defective
PEs and some defective switches. These must be bypassed so thei.! presence
1s masked. Only the good processing elements and switches are connected
together. Furthermore, the good components must be connected to form a
CHiP lattice. The structured wafer emulates a smaller but fully functional
CHiP lattice.
This chapter synthesizes previously presented ideas of wafer
structuring by column exclusion (Chapter 1) and of fault tolerant CHiP
modules (Chapter 2). A two level decomposition of the structuring problem
71
is proposed. The basic idea to divide the wafer into a number of separate
building blocks. Each building block contains sufficiently many redundant
components to insure that a smaller functional lattice exists within almost
every block. Virtually every block on the wafer will contribute a small
subpart to the overall structure; the blocks have high yield. In addition, the
switch lattice of the blocks provides a substantial amount of wiring
bandwidth through the block. A very large number of independent wiring
paths can pass through from one side of the block to the other.
Recall that the column exclusion strategy for structuring has two
requirements: high yield and wire around capability. Redundancy within the
building block insures high yield, and the switch lattice of the building block
provides the wire around capability. As a result, building blocks modules are
suitable for using the column exclUJ3ion strategy for wafer structuring. This
makes CHiP machines a natural choice for wafer scale implementation.
Before explaining the two level decomposition further, the structuring
problem and its global solution are examined. This will provide the
motivation for the decomposition of the wafer into building blocks.
1. The Structuring Problem
We are given a wafer with a very large lattice patterned on it. Due to
circuit defects, every wafer will contain both faulty PEs and faulty switches.
It is assumed that the yield model and recovery analysis of Chapter 2 apply
to the lattice, and that the lattice has been completely tested. (This is a
difficult problem by itself. It is considered in detail in Chapter 6.) The
status. good/bad, of every component in the lattice is known. All functional
72
components have been found, and no dysfunctional components have been
incorrectly identified as good.
The goal is to structure the wafer so it behaves as a smaller but fully
funcUonallattice. The switch lattice is used to bypass faulty components. An
observer of the input/output behavior of the structured wafer can not
detect the presence, number or location of the faults. Additionally, the
wafer is structured so that it emulates a virtual lattice (see Chapter 2). The
behavior of the structured wafer and the virtual lattice are identicaL
For example, Figure 3.1.1 shows one method of structuring a wafer. For
simplicity the switches are not shown. The wafer contains a lattice of
dimension 6 PEs by 5 PEs with ten of the PEs defective. A 4 x 4 virtual
lattice (Figure 3.1.2) is mapped onto the wafer, The numbering of the PEs
shows the correspondence between elements of the structured wafer and
the virtual lattice. The logical structure of the virtual lattice and the
structured wafer are the same since their components are connected in
identical topologies. The structured wafer could be used in place of the
virtuallatlice or vice versa,
There are two subproblems to the structuring problem. The first is to
specify the lattice structure that is patterned on the wafer. Secondly. an
algorithm for structuring the wafer into a fault-free virtual lattice must be
specified.
The designer has complete freedom in choosing the lattice parameters:
PE and switch redundancy. corridor width. switch degree. crossover
capability, datapath width, etc, As in the fault tolerant CHiP modules
previously discussed (Chapter 2), increased wiring bandwidth must be
[g] [g] ,....- 2 t-- [g] .-- 4
I I
1
-0- 6 I- 3 t-- 8
I
5 - [g] [g] '-- 7 [g]
I
I[g] 0 [g] .--- 11 12
I I
'} ~[g]- 10 - .-- 15 16
I
13 ~O- 14 - [g] 0
Figure 3.1.1 - Example of a Structured Wafer-


















Figure 3.1.2 - 4 x 4 Virtual Lattice Which Is
functionally Equivalent to the Structured Wafer
75
provided to route around faulty components. This additional wiring
capability can be implemented with a combination of extra switch corridors.
additional crossover capability and increased switch degree. The goal is to
provide sutIicienl additional wiring bandwidth to be able to replace faulty
components and also to route around the defects.
The flexibility gained by the additional Wiring bandwidth within the
lattice is not without its cost. Extra switches or additional switch complexity
are overhead that is required for fault tolerant reconfiguratioo. This
overhead consumes wafer area wWeh could be occupied by processing
elements. Perhaps more importantly. it also adversely effects performance
by increasing the number of switching levels between PEs. Every extra
switch a signal must traverse introduces additional impedance and
capacitance. This increases the time of flight of the signal and reduces the
speed with which PEs can communicate. Consequently. one design objective
is to minimize switching overhead while still insuring the reconfigurability of
the wafer in the presence of faults. The choice of lattice parameters will be
deferred until Chapter 4 on "Building Block Design." This chapter
concentrates on the second goal.
An algorithm must be specified for performing the structuring. The
input to the algorithm is the status. good/bad, of all the components on the
wafer. The algorithm. must compute all switch settings necessary to
structure the wafer into a CHiP processor ( i.~. the virtual lattice). There
are two aspects to this problem: virtual lattice selection and mapping the
virtual lattice onto the wafer. Given a wafer (with faults, of course), the






After choosing the virtual lattice size, it must be mapped onto the wafer (see
Chapter 2): the virtual switches and PEs are associated with their
counterparts on the wafer. and the datapaths of the virtual lattice are
mapped inlo paths of switches. First, consider perhaps the simplest
algorithm for structuring the wafer.
2. Global Strategy
]n the global strategy, the wafer is considered to be a single, continuous
lattice. The choice of a virtual lattice and the mapping problem are applied
to the wafer as a whole. Thus the name of the approach - the algorithms are
applied globally to the entire wafer. From the wafer, a single large virtual
lattice is extracted, and it is mapped onlo the entire wafer surface. The
virtual lattice is mapped onto the wafer just as in the fault tolerant CHiP
modules (Chapter 2). FigUI'e 3.1.1 depicts an example of a global
structuring.
Several problems are encountered with this approach. First, two logical
neighbors in the virtuallnttice are not necessarily in nearby locations on the
wafer. They may be separated by long distances. This results in very long
paths between PEs. FigUI'e 3.1.1 depicts an example of this for a small
lattice. A path between PEs, instead of going to an adjacent neighbor, may
have to route around several intervening PEs. With the much larger lattices
(e.g. 30 PEs by 30 PEs) that can be fabricated with CUI'rent technology on a
4" wafer, very long path lengths can result. This causes serious signal
propagation delays. Furthermore. due to the pipelined nature of the
computations performed, a CHiP machine is no faster than its slowest link.




As a result, it is desirable to minimize the maximum path length in a
mapping. This is difficult in general to achieve [or two reasons, First. the
mapping problem for the whole wafer is by itself computationally difficult.
Attempting a simultaneous minimization over all possible mappings is not
practical. Second, even if a minimax path length mapping is obtained, there
is no guarantee that it will be acceptably short. The minimax path length
for the global structuring may be so long that it seriously impairs machine
performance. A global solution to the structuring problem may inherently
lead to unacceptably long path lengths.
Second, given the selection of a virtual lattice, consider the problem of
mapping the virtual lattice onto the wafer. The number of possibilities for
the mapping between the virtual lattice and lattice patterned on the wafer
grows exponentially with the total number of components. Since a wafer can
hold a very large lattice, exhaustive search techniques for finding a mapping
are not practical.
The mapping problem is an instance of the subgraph homeomorphism
problem [Gare79, 1aPa7Ba. 1aPa7Bb]. No known polynomial algorithm exits
for the mapping problem. Furthermore. the global strategy gives rise to a
very large instance of the mapping problem. A 30 PE by 30 PE double
corridor lattice (which is feasible to fabricate on a single wafer - see
Chapter 5) contains over 20,000 switches and PEs. Even a polynomial time
algorithm may not be computationally tractable on problem instances of
this magnitude.
In summary. the global approach leads to a computationally intractable




resulting CHiP processor. What is needed is a means of reducing the size of
the mapping problem and placing a limit on the minimax path length of any
mapping. ]n the following section, a divide and conquer approach, the two
level decomposition, is proposed which achieves these objectives.
3. Two Level Decomposition
Rather than trying to structure the wafer as a whole, the idea of the two
level decomposition is to divide the wafer into logical pieces. A virtual lattice
is mapped into each of these pieces. and the individual solutions are
composed to form a larger CHiP lattice. The organization of the wafer is
divided into two components: the individual pieces and their composition
which forms the wafer scale CHiP processor. There is a two level hierarchy
within the processor - the individual pieces are the components out of which
the water scale machine is built, This division of the problem into small
pieces leads to a computationally tractable divide and conquer approach to
the structuring problem.
Each of the individual pieces is a building block of the water scale
machine. From each block we will extract a lattice of fixed size. For the
blocks proposed in the following chapter, a 2 x 2 lattice is extracted. This
eliminates the problem of choosing the dimensions of the virtual lattice (at
the cost of sometimes underutilizing the good components of the block). All
blocks yield the same size lattice regardless of how many functional PEs and
switches they contain. More importantly, the uniformity of the virtual lattice
size makes it easy to compose the individual lattices. Each block
contributes a fixed size piece to the overall machine. Each of the pieces
connects to its four neighbors in a simple and regular manner (Figure 3.3.1).
79
In contrast. if blocks contribute virtual lattices of difierent sizes (see Figure
3.3.2), this ~troduces difficult problems of matching the pieces. Simplicity
is a key to success.
Figure 3.3.3 depicts an example of structuring with a two level
hierarchy. The faulty or simply Wlused processing elements are marked
with Xs. A 6 x 4 lattice is patterned on the wafer. (For simplicity, switches
are not shown. The structuring of the switches is performed similarly to the
structuring of the PEs.) In the first level of the hierarchy, the wafer is
divided into four building blocks each containing a 3 x 2 lattice. A 2 x 2
virtual lattice is mapped into each of these blocks. The individual 2 x 2
lattices are in turn connected together to form a 4 x 4 array of processors
on the wafer surface. The structured wafer is functionally equivalent to the 4-
x 4-lattice in Figure 3.1.2.
In this particular example, no building block has more than two faulty
processing elements so a virtual lattice can be mapped into every block. In
practice, some blocks may not contain enough functional components to
host a virtual lattice - the block is considered faulty. The random nature of
defects makes it impossible to completely safeguard against this possibility.
The column exclusion strategy is used to deal with faulty blocks. Wherever a
faulty block occurs, the entire column (or row) containing that block is
excluded. In order to efficiently implement column exclusion, blocks must
have high yield and wire around capability. These problems are discussed in
Chapter 4- on Building Block Design.
The advantages of the two level composition are twofold. First. a bound






Figure 3.3.1 - Composition of Lattices of Identical Size
81
•
Figure 3.3.2 - Composition of Lattices of Nonuniform Size
~ - .,~-------,--~. -~-~-
-
1 2 f- ~~ 4
I
I
5 b 3 f--- ~
I I
~ ~ - 7 8
I
~ .- 10 11 12
I I
I
';) l- e- 14 - ~ ~
,----J
I I
13 f- ~ - 15 16
Figure 3.3.3 - Struct.uring With the l\oro Lc~cl Hierarchy
82
63
performed on the individual blocks are contained totally within the block.
Any two PEs in the virtual lattice mapped into a block are connected by a
path which does not go outside the block. This limits the maximum length
of any path and establishes an upper bound on the processor to processor
communication time.
Second, the problem of structuring the wafer is made computationally
tractable. The one very large instance of the mapping problem that is
generate by the global strategy is divided into many small instances. Each of
the building blocks is small. and the virtual lattice can be mapped onto it by
brute force methods. Since the same size virtual lattice is mapped into each
block, individual solutions are easily composed. In short, the structuring
problem is made computationally tractable by a divide and conquer
approach.
The primary disadvantage of the two level decomposition is that fewer
good PEs are us~d than in the global strategy. By extracting a fixed size
lattice from each block there will be functional but unused PEs on the wafer.
Many of the blocks on the wafer will have more good PEs than are used in
the virtual lattice. These extra PEs will not be utilized now. Additionally, no
PEs in the excluded columns are used.
Area is clearly sacrificed in the two level hierarchy. But the commodity
in greatest supply in a wafer scale system is area. The two level hierarchy
trades area for performance and simplicity of structuring.
Additionally, Lhe good buL unused PEs can be held in reserve for future
usc. During the lifetime or the wafer scale CHiP processor, H a PE fails, an





requires only a local modification to the affected building block. Thus even






This section considers the design of a building block of a wafer scale
processor. A building block implements the first level of the two level
hierarchy. Each functional block is configured into a virtual lattice. This
mapping is performed as with fault tolerant CHiP modules (see section
2.5b), The wafer has patterned on it a grid of blocks typically B x 8 to 10 x
10 in size which is structured by column exclusion - wherever there is a
faulty block, the entire column containing that block is excluded from the
grid. To be practical, the column exclusion strategy has two requirements:
high block yield and the capability to wire around unused columns of blocks.
These requirements are examined in detail and a qu.:antitative evaluation is
made.
Several important design choices must be made for building blocks. In
order to provide high block yield necessary for column exclusion, fault
tolerance is an essential characteristic of the building block. The amount of
redundancy within a block is one of the major design choices, and it is
-dependent on the yield of the individual processing elements. Since yield is
directly related to area, the size of the CHiP processing elements must be




processors, systolic algorithms, dictates the minimum functional
requirements of a processing element. From this. a high level floor plan of a
processing element is proposed. The tioor plan combined with the sizes of
individual register, ALU and control cells gives a rough estimate of the area
of the processing element without actually designing the PE in detail.
Once the area of a PE is known, our previously developed technique of
recovery analysis is used to determine the lattice dimensions of a building
block. After a similar consideration of switch design and estimating sWitch
yield. a fault tolerant switch lattice for the building block is designed.
1. Block Requirements
a) Block Yield
With the column exclusion strategy, every faulty block causes the loss
of an entire column of blocks. There is a multiplier effect associated with
faulty blocks. (Once again, a faulty block does not have to be completely
dysfunctional. but it is a block which due to faults does not contain an
embedded virtual lattice.) As a result, very few bad blocks can be allowed.
Otherwise a large percentage of the wafer will be unused.
What is the required block yield? To estimate this. assume a wafer
contains an 8 x 8 grid of blocks. (In Chapter 5 on the Wafer Scale CHiP
Processor, it will be shown that this is a reasonable and somewhat
conservative grid size.) For any given block yield. p, we can compute the
probability distribution of the number of faulty blocks in the 64 block grid.
Since defects on the wafer are randomly disLributed. the probabiliLy of the
individual blocks being good are independent events. The sLatus of a block is
in
87
either functional or faulty, so the probability distribution of good and bad
blocks is a binomiaL random variable. Pr(F = 0, the probability of exactly i of
the 64 blocks being faulty, is
To estimate the number of blocks left after column exclusion, we
assume that i faulty blocks eLiminate i columns (or rows) from the grid.
(Note that it is certainly possible for two or more defective blocks to fall in
the same column. This results in only one column nol two being eliminated.
This more detailed analysis of column exclusion is fOl.Uld in Chapter 5. It
differs from the following estimate by only about 5%.)
Table 4.1.1 shows the results of this analysis for different block yields.
Because of the multiplying effect of faulty blocks, the grid size obtainable is
highly sensitive to the block yield. Even if 95% of the blocks are good, this
still results in the loss of a large portion of the wafer; over 40% of the wafers
use less than two thirds of the grid. Even with 97.% block yield. 25% of the
wafers will use only about two thirds of the blocks, and only 14% of the time
will the all blocks be functional. This shows that even a small percentage of
defective blocks causes a large reduction in the size of the grid after column
exclusion.
Block yields of 98% and 99% show significant improvement. They are
compared in more detail in Table 4.1.2. With 99% yield, over half of the
wafers are fully functional, and with 90% yield over one quarter have no bad
blocks. The expected number of usable blocks is 54.0 for 98% yield and 59.1
for 99%. This relatively small difference results form the fact that with 99.%
•
Table 4.1.1 - Effect of Block Yield on Grid Size
( Worst Case)
number of resulting block yield
faulty blocks grid size 0.95 0.97 0.98 0.99
0 8x8 .0375 .142 .274 .526
1 Bx7 .126 .287 .358 .340
2 7x7 .210 .275 .230 .108
3 7x6 .228 .175 .09"12 .0226
4 6x6 .lB3 .0828 .0303 .0040
Table 4.1.2 - Comparison of 98% and 99% Block Yield
block yield = 0.98 block yield - 0.99
cumulative cumulative resulting % of
prob prob prob prob grid size grid used
.274 .274 .526 .526 Bx8 100%
.358 .633 .340 .865 13 x ? 87.5%
.230 .B63 .108 .973 7x? 76.6%
.0972 .961 .0226 .996 7x6 65.6%




yield. few lattices are smaller than 7 x 7. As a result. the 9B% case receives
a much larger contribution to its expected value from the 7 x 7 and 7 x 6
grids. This makes up for its smaller contributions from the 8 x 8 and 8 x 7
grids.
Although the expected number of usable blocks is similar. there are
twice as many completely functional wafers with 99% yield than with 98%
yield. This is important since a fully functional wafer enjoys a substantial
performance gain over wafers with one or more faults. Excluding a column
introduces a performance penalty. When a column is excluded. the two
adjoining columns must be connected together. The length of wire (and the
number of intervening switching levels) to implement this connection is
substantially longer than if the columns are adjacent. The connecting wires
must traverse at least the entire width of a column whereas adjacent
columns are separated by very short distances. This longer wire length
increases the signal propagation time. Inter-PE communication speed is
decreased, and system speed goes down. Consequently. it is desirable to
have wafers with no faulty blocks even though redundancy must be
increased to achieve the higher block yield. To achieve this
• 99.0% or better yield is required for the building block.
b) lrrre Around Capability
When a column is excluded. the two adjacent columns must be
connected together. To accomplish this. the switches and datapaths in the
Wlused blocks are used to make the required connections. The PEs in the






to connect together the two adjacent columns. Thus the "wire around"
requirement becomes a "wire through" capability via the CHiP switch lattice.
Figure 4.1.1 depicts an example of wire through.
If each block emulates an N x N virtual lattice with corridor width w, wN
+ 1 connections must be made. Each one of these requires a path from one
side of the block to the opposite side. Since either rows or columns may be
eliminated, any block must be able to provide the needed paths between
both its East and West sides and between its North and South sides. Figure
4.1.1 shows the five connections that must be made for a 2 x 2 single
corridor lattice.
Switches and datapaths are subject to failure just as processing
elements are. Switch redundancy within each block is required so that wire
through can be implemented despite the presence of faulty switches.
Determining the degree of redundancy required is one of the building block
design decisions that will be considered later.
2. Processing Element Design
The goal of the research in CHiP architectures is to investigate
problems in parallel computation such as: parallel programming, inter
processor communication, testing of concurrent systems. etc. CHiP
machines are an assembly of many conventional microprocessors. Each is a
von Neumann machine sequentially executing instructions dictated by the
contents of its program counter. The substantial body of knowledge and
design experience with such machines is built upon by using conventional











00 00 00 00
'HgHgHgHgH:
00 00 00 00 00
50000000000000000005
000000000000000000
Figure 4,1.1 - Example of Wire Through in a
Building Block
92
processing elements are largely treated as "black boxes." We are not
concerned with details of the inner workings of the processing elements, nor
do we want to design a processing element - this has been done many time
by others.
However, knowledge of the area occupied by a processing element is
essential to the quantitative analysis of the implementation of wafer scale
machines. Fault tolerance is a necessity in a wafer scale system. It is
achieved through redundancy. and the degree of redundancy required
depends on the yield of the processing elements. Yield and area are closely
linked.
Area estimation involves us in the design of processing elements. It is
impossible to know the exact area of a processing element without
specifying all the design details of the machine. Choice of word length,
instruction set, control structure, etc. have a profolUld effect on the area
occupied by the machine. However, the design of a processing element is a
complex and lengthy task. Since the design of conventional and simple
processors is a well explored topic, we will not to repeat it. To circumvent
this, our goal is to estimate the area without producing a complete and
detailed design of a specific processor. This will be done in four steps:
1) Analyze the functional requirements of the processing component of
a CHiP processor. The intended applications of the machine determine
the capabilities the machine must provide.
2) Determine the major architectural features. Very high level design
decisions such as word length and memory size determine the gross
characteristics of the processing element.
\)
93
3) Sketch the layout of the processing element. A simple schematic
floor plan showing the major elements of the implementation of the
processing element such as control logic, memory, registers and ALU is
proposed. Details of the implementation of the major blocks and their
interconnection are not covered.
4) Determine the size of the primitive cells. Each of the subsections of
the floor plan is composed of basic cells such as memory bits, a bit slice
of the ALU, PLA term, etc. The dimensions of these primitive cells can
be closely estimated from a previous design project by the author
[HedlBla] and from published reports on processor implementation
[Fitz81]'
Combining the floor plan and the dimensions of the individual cells, the area
of the major blocks of the PE can be closely estimated. Adding to the size of
the components an estimate of the wiring area required for their
intercoIlllection, the total PE area can be estimated.
a) Functional Requirements
The intended applications of CHiP processors determin~ the
computational requirements of the individual processing elements. For
example, the granularity of parallelism of the applications is a primary
determinant of the processing element's required memory capacity. 1f a
relatively large computation is preformed by each processing element,
there must be substantial memory to hold the object code of the
computation and store the intermediate results, Similarly, if there are only






must be fast (and therefore complex) in order for the entire assembly to
have high throughput.
CHiP processors are capable of .implementing a wide variety of
applications: database operations [Hsia82], signal processing [SnydB2b].
dataflow programs [CunyB2], and nUmerical applications [Gann81] are
among the problems suitable for processing by the CHiP family of
architectures. A major application of CHiP machines is the execution of
systolic algorithms [Snyd82a]. Systolic algorithms implement the control
structure of an algorithm primarily through the topology of the processing
element array and the synchronization of the processors. As a result,
different systolic algoritluns require different interconnection patterns of
processors. The switch lattice of CHiP machines provides the
interconnection ftexibility required for a processor array to reconfigure into
a wide variety of different topologies. Additionally. many of the algorithms
for the above applications area are systolic in nature. Systolic computation
is fundamental to CHiP machines.
The basic characteristics of systolic algorithms are [Kung79, Kung82,
MeadBO]:
• simple and regular pattern in the flow of data and control signals
• highly pipelined computation
• only a small operation is performed at each computational site. This is
consistent with the pipelined nature of the computation. Each stage in




• the input data, intermediate results and output values are continuously
and rhythmically passed from one computational site to another. This is
the source of the term "systolic." There is a regular pumping of data
through the processor in a manner analogous to the pumping of blood
in a living organism. Data circulates rather than being stored in a
central memory:
An example of a systolic algorithm is matrix multiplication performed
on a hexagonal array of processors (example from [MeadBO]; algorithm due
to Kung, et. al.). The problem is that of multiplying two n x n matrices with
bandwidth w (see Figure 4.2.1). The elements in the bands of the matrices
A. B and C move through the network in three directions simultaneously.
Each element of C is initialized to zero. Every processor performs an inner
product step multiplying the incoming values of A and B and adding the
result to the incoming C value. A careful study of the flow of data and its
timing will convince the reader that each Cjj is able to accumulate all its
terms before it leaves the processor through the upper boundary (see
[MeadBO] for a more complete discussion). The following observations about
the algorithm influence the design of processing elements to execute the
algorithm:
• Each processing element performs one addition and one multiplication
(and, of course, any read I write operations required to transfer the
operands). Thus the program of each processing element is very short
and simple.
• Only three data values are stored in a processing element at anyone










, + ~ + + + /, /
'.










I b" b" b"
, I, I ,/, ,
, I



























Figure 4.2.1 - Systolic Algorithm for Band Matrix
Multiplication ( from [MeadBO] )
97
each individual processing element slores only a few values. This
exemplifies the principle of processing power through the collective
action of many simple components rather than a few complex devices.
o High throughput is achieved through parallelism. A large number of
processing elements are concurrently active. It is not necessary for
each of the individual units to be fast in order for the entire assembly
to achieve a high processing rate. Once again, strength through
numbers.
• The computation is highly pipelined. As a single value of C passes
through the array, it accumulates more and more product terms. By
the time it reaches the upper boundary of the processor, the correct
value has been accumulated. Pipelining especially in combination with
large scale parallelism favors simple computational elements with
modest speed.
A large body of systolic algorithms for a wide variety of problems has
been developed in recent years. Algorithms exist for pattern matching in a
string, LU decomposition, transitive closure, minimum spanning tree,
dynamic programming, etc. (see [KungB2] for a comprehensive
bibliography). All systolic algorithms exhibit the above general
characteristics.
b) Processor Characteristics
What are the implications of the characteristics of systolic algorithms
for the design of the processing element? The following basic architectural
features are proposed as being well suited to the implementation of systolic
algorithms:
9B
1) Simple arithmetic oriented instruction set. The computational sites
in systolic algorithms in general do not perform long. complex
sequences of operations. Furthermore. many of the control operations
of the algorithms are implicit in the topology and synchronization of
the processing elements. This reduces the need for complex condition
codes and branching instructions. Furthermore, a simple streamlined
instruction set is consistent with an increasingly popular trend towards
simplified machine architectures. A very small number of diITerent
instructions account for a very high percentage of instructions
executed. These commonly used instructions typically perform simple
operations. This phenomena has been observed for many different
machines ranging from microprocessors to mainframes. Additionally, it
has been found to hold for the object code produced for a large number
of different high level languages [Peut77a. Peut77b. Knut70]. The
philosophy of simplified machine architecture is to directly implement
in the PE hardware only the most commonly used instructions. More
complex operations are performed by sequences of the simple
instructions. This philosophy is exemplified by the RIse [PattBl], MIPS
[HennB1] and B01 [RadiB2] architectural projects.
2) B-bit ALU. An a-bit word is both the ALU width and the size of words
transferred between individual PEs and between PE and external
memory. As previously noted. the parallel and pipelined naLurc of
systolic computation deemphasizes the speed of the individual






one byte at a time - digit pipelined arithmetic (OwenSl]. This further
increases the pipelining of the machine. Furthermore, implementation
considerations favor short word size. The restricted number of
connections of the parallel processor to its external memory. and the
limitations on memory bandwidth place a restrictive upper bound on
the amount of data that can be practically transferred to or from the
processing array in unit time. The rate at wWch the processor array
requires operands must be matched to the limited memory bandwidth.
A small word size decreases the number of memory bils transferred for
each operand. Additionally, the area occupied by wiring between
processing elements is dependent on the word size. Switch area is
proportional to the square of the word size. A small word size
decreases wiring overhead.
3) Five internal registers. There is one register for each port and an
accumulator to hold temporary results. The port registers serve to
buffer PE to PE communications.
4) 64 bytes of random access memory. This is the main memory of
each processing element, and it holds both the PE's program and
temporary data storage which can not be contained in the registers.
The simple instruction set and the digit pipelined nature of the
arithmetic computation increase the amount of program memory
required. Some high level languages operations can not be performed
by a single machine instruction but require a sequence of simple
instructions. Plus digit pipelined arithmetic implements a single





the main memory can hold 32 16-bit instructions which shoulq. be more
than sufficient for systolic algorithms.
In many regards. the PE is similar to· an B-bit microprocessor such as
the BOBO, Both have simple instruction sets, and 8-bit ALU, a limited
register file and byte wide data transfers. However, a CHiP processing
element has important differences from a general purpose microprocessor.
The environment of the PE is much more constrained. The limitations
imposed by tailoring the PE for systolic algorithms provide a more
restricted computational environment than that in which general purpose
devices operate. These restrictions allow the following simplifications in the
design of aPE:
• There is no need to provide a flexible and complex interrupt
mechanism. The enVironment surrounding a PE is simple and fixed. A
processing element communicates only with neighboring PEs or
external memory. On the other hand. the general purpose
microprocessor must be capable of interfacing to a wide variety of
ditrerent devices from laboratory instruments, to terminals. to other
input / output devices. Furthermore, it must be able to communicate
with several of these devices simultaneously and perhaps with diiIering
priorities. One of the microprocessor's strengths is generality. As a
result, microprocessors commonly have a flexible. prioritized interrupt
mechanism. This greatly increases the usability of the device but also
increases its complexity. The constrained and limited forms of
communication required of a CHiP processing element allow it greatly





• Microprocessors generally provide a rich assortment of addressing
modes to allow fleXibility and convenience in fetching operands from
the central memory. But with systolic computation, operands are
continually being passed from FE to FE rather than residing in a
central memory. The need for sophisticated memory access
techniques is greatly reduced.
• Processing elements have a simple instruction set. As noted previously,
there is reduced need for complex condition code setting and
branching instructions.
• With the exception of PEs on the lattice edge. no signals are transferred
off-chip. This eliminates bonding pads and pad drivers from the
majority of PEs reducing their area.
In summary, CHiP processing elements due to their constrained
environment and simpler computational requirements can be considerably
simpler than conventional microprocessors. Simplicity leads to reduced
area and greater reliability. Additionally, a simple machine has fewer gates
in the critical path of an instruction execution. Simplicity increases speed.
c) Layout and Area Estimation
Experience with the design of a simple prototype processing element
[HedI81a] suggests the PE layout shown in Figure 4.2.2. (Note that this
rough floor plan is intended to be schematic in nature. The exact sizes of
the components and their arrangement are approximately but not precisely
reproduced. The point is to "rough out" the design of a processing element
but not to prOVide the detailed design.) The register file contains the
•
IR





Figure 4.2.2 - Processing Element Layout-











Table 4.2.1- Area Estimation for a Processing Element
103
Area(KA )






















Misc. Expansion (20%) 530
Total PE Area 3000 •
104
accumulator. the four port buffers and the 64 bytes of program and data
memory. Both instructions and operands are fetched from the register file.
One of the operands can be passed through the shifter before entering the
ALU. The output of the ALU is slored back into the register file. The control
logic section is a set of PLAs wWch decode the contents of the instruction
register and time the sequence of data transfers to implement the current
instruction. The distinguished registers of the machine include the
instruction register (lR), program cOWller (PC), memory address register
(MADR) and the accumulator (AC).
To estimate the sizes of the components of the layout, we draw on the
experience of the RISe design team [FilzBl]. Both the CHiP processing
element and the data path of the RISe machine share similar design
objectives. Both machines have simple instruction sets and datapaths of
reduced complexity, and both attempt to support high level language
programming with minimum processor complexity. Additionally, the RISC
team reported very detailed data on the layout complexity and size of their
design. In their design, they spent considerable time and effort in the layout
of compact and efficient components such as memory cells, ALU slices, etc.
This has proved invaluable in making tighter and more realistic estimates of
the area of the CHiP processing element.
Table 4.2.1 shows the area estimates of the major functional
components of the processing element. All estimates were derived from the
RISC Blue design group. Their layout was restricted to using only horizontal
and vertical lines, a Manhattan geometry. This restriction was forced due to
the computational complexity of the automatic circuit extraction and
.::....
105
design rule checking programs. Additionally, Mead and Conway design rules
were employed. A more realistic, industrial design environment would use a
richer and much more complex set of design rules which are fine tuned to a
particular fabrication process. Process specific rules have tighter spacings
and smaller wire widths than the Mead and Conway "generic" rules.
Designing with fewer restrictions and tighter design rules. better results
both in area and performance are certainly obtainable. The following
estimate may be regarded as an upper bound.
.The area estimates in Table 4.2.1 were derived by scaling the functional
block area reported by the RlSC blue design team. For example, the RISe
register array consists of 138 32-bit words and occupies 4.12 Mi\2, Each of
the static RAM cells is a standard six transistor design with two independent
data busses allowing two port access to the register file. Conceptually, this
allows the accwnulator and port registers to occupy the same memory
array as the program / data memory. This reduces processor complexity.
The RISC word size is longer and the number of registers is larger, so the
area occupied by the memory or a CHiP processing element is estimated by
scaling down the area figures for RISC. Direct scaling of the memory area
reported by the RISe project shows that the 64 byte memory array of the
CHiP processing element will occupy area
64 8138 32 4.12 (MI.') = 478 (KA')
Similarly. the ALU area scales linearly down from the 32-bit wide RISe
datapath that occupies 0.44 MA2 to
8 044 (MI.') =32 . 110 (KA.')
106
The critical component of the RIse design is the memory. It occupies
most of the are of their design. As a result, considerable effort was spent
optimizing the memory cell layout and the memory felch/store timing. The
memory area estimate can be considered to be quite near optimal. But the
pitch of the memory cell determined the height of the ALV. The ALU area
was not independently optimized, but rather its layout was dictated by the
requirement to mesh with a preViously designed memory unit. This is not
necessarily optimal for the CHiP processing element. In short, there may be
room for improvement in the ALU estimate.
Not aU components of the layout scale linearly. Decoder size is
proportional to the square of the number of inputs. The RIse memory
contains 138 words, and its memory decoder occupies area 0.87 MX2 , From
this the size of the address decoder for a 64 byte memory is roughly
[163~r0.87 (MI.') = 187 (K~')
The shifter area also scales quadratically (from 0.89 MX2 for a 32~bit shifter
01 RISe) to
[382r0.89 (MI.') = 56 (KI.')
Note that all components contain an area component which is independent
of the nwnber of inputs. A more accurate scaling model is Area = An + B
where n is the number of inputs. The above analysis is an approximation
with B:;;; 0,
107
In addition to the above components. the processing element
architecture includes a number of one byte registers: four port registers,
the accumulator, program counter, memory address register plus the two
byte long instruction register. A single byte register is estimated to occupy
area
1~6 362 4.12 (M],.') = 7464 (],.')
so the nine bytes required for the auxiliary register occupy 67.2 K 'A2 .
Memory occupies a significant, 24.470, of the FE area. To double check
the memory estimates, we calculate the estimated size of a single bit of
memory. Direct scaling estimates its area to be
1~6 3~ 4.12 (M],.') = 933 (],.')
With the reported vertical pitch of a register bit being 44 A, this results in
each memory bit occupying a 44 A x 21 A region. Since this is quite
reasonable for a six transistor, dual bus memory cell, our estimates are
accurate.
The instruction sets of both the CHiP processing element and the RISe
machine are similar. Consequently, the control logic for the two machines
will be of similar complexity. The control logic area for the CHiP PE is taken
to be identical to the RISC values. This figure includes PLAs, latches to






The total area of the above components is 2.058 MX2 • 20% additional
area is added for additional wire routing between the major functional
blocks. (This is the same percentage as reported by the RISe group.) Since
layouts always occupy more area than expected,l an additional 26% area is
added to bring the total area estimate to a round 3.00 MX2 .
From this estimate. a PE occupies a square of side 1732 A. Bringing this
estimate up to the nearest round number,
• each CHiP processing element is estimated to occupy a 1750 A x 1750 A
region of silicon.
This final rounding results is an additional (1750)2 - (1732)2 ;:: 62.5 k;\2 area
for each PE. Our area estimate is conservative. Aboye and beyond the
estimated size for all components and wire routing between them (2.470 M
'),.2), an additional 0.530 + 0.0625 = 0.5925 Mi\2 has been added to the
estimate. This is an additional. 24% for miscellaneous expansion. The
estimate contains considerable "free area" for unanticipated uses.
3. Datapath Design
Datapaths are the busses connecting switches and processing elements.
In addition to data, these signals also include control signals for the PEs and
switches. Each of the individual bus wires is independent of the others.
(Note that the term "datapath" is used ambiguously. In the context of
processing element design. the datapath is the portion of the machine that
transforms and modifies data - the shifter and ALU. Within the context of
lattice design, where PEs are treated as black boxes, datapaths are simply





busses transmitting data without alteration. The intended meaning of the
term will be clear from the context of its usage.)
The datapath is qUite small in comparison to the processing elements
and switches. A FE occupies a square 1750 A on a side while in the following
section it will be shown that switches are approXimately 250 A on a side. To
estimate the datapath width, assume there are ten signals per datapath.
This is sufficient for one byte of data and two control signals - one for the
processing elements and one for switches. The distance between PEs is
much longer than the distances encountered when routing data within a
single PE. To reduce signal transmission time. datapaths are implemented
in the metal layer since metal has much lower resistance and capacitance
than the polysilicon or diffusion layers. With Mead and Conway design rules,
each metal wire is 3 A wide, and the separation between wires is also 3 A.
Therefore a ten wire datapath has a minimum width of approximately 60 A.
This is one quarter the width of a switch and only 3.5% the size of a
processing element (Figure 4.3.1).
]n addition to being small, the datapath width can be increased without
increasing the lattice area. Widening the datapaths in Figure 4.3.1 does not
increase the separation of switches and processing elements and so has no
etfect on the size of the lattice. Note that this is dependent on the layout
details and shapes of the PEs and switces.
As a result. datapaths can be designed with relaxed design rules Without
increasing lattice area. By increasing the width of wires. the probability of a
break in a wire is reduced. By increasing the separation between wires,








Figure 4.3.1 - Approximate Relative Size of a
FE. Switch and DatapaLh
no
111
rules decrease the circuit's sensitivity to defects [RungBl]. The same
number of defects may occur but the probability of a defect causing the
circuitry to malfunction is reduced.
The relationship between the design rule spacing and yield for a given
circuit is process specific. The amount of yield increase for a given increase
in design rule spacing can not be predicted without also specifying the
fabrication line on which the circuit will be manulactured. However, the
large disparity between switch and datapath size gives great tlexibility in the
design rules for the datapath. The datapath 1'fidth can be increased by a
factor of four without effecting lattice area. This allows wire widths and
spacings to be up to four times as large as allowed by the minimum
resulting in large yield increases.
From the combination of datapaths being small. simple and designed
with flexible design rules, we
• assume there are no fatal defects in datapaths.
This is of course an approximation, but with very high datapath yield, it is a
very close approximation.
Note that an increase in design rule spacing of the datapath has no
effect on the machine's performance. The signal propagation time is
unaffected by the width of datapath wires. As the width of a wire increases,
its capacitance per unit length increases proportionateiy. However,
resistance decreases linearly with width. Since the signal propagation time
is proportional to the product of the wire's resistance and capacitance, the




A sample layout of a switch is shown in Figure 4.4.1. The switch
displayed there is one of the simplest possible - degree four, no crossover
capability and only one configuration setting. Extensions to a more complex
switch are straightforward.
The switch architecture is organized around its bus rail - concentric
squares of independent bus wires. There is one wire in the bus rail for each
wire of the datapath. At each of the compass point directions. NSEW, the
bus rail is connected to the datapath. This connection is controlled by the
configuration setting. The four bits of the setting determine which subset of
the four datapaths are connected to the bus raiL ]f bits Nand E of the
setting are "on" (with Sand W "off"), these two datapaths are connected
together via the bus rail while the Sand W datapaths are disconnected from
the bus rail. The configuration setling controls datapath access via four sets
of pass transistors. Each of the groups of pass transistors is driven by one
bit of the configuration settings as indicated by the labels on the control
lines in Figure 4.4.1.
b) Switch Yield
A simple switch with degree (} and crossover capability occupies an area
of approximately 250 A x 250 A.. To estimate the yield of an individual








r - r f-
W





has 20% yield (see Chapter 2). Since we have assumed that no fatal defects
occur in datapaths a unit area consists of four PEs and 21 switches. With
PEs occupying a 1750 'A x 1750 h region. the lattice area sensitive to defects
is
4 (1750)' + 21 (250)' (1<') = 13.56 (MI<') = 1 (unit area)
The area of a single switch is
As = (250)' = 4.61 x 10-3 (Wlit area)13.56 x 100
Substituting this into the yield model
Ys = 1(1 + As SO)4 0.991
This indicates that switches will have over 9970 yield.
The yield equation results from the mathematical modeling of the
manufacturing of typical inLegrated circuits. Yield commonly varies in the
2% to 50% range. Extrapolating this model to exceptionally high yields may
be unreliable. The 99% estimate may be either low or high. Although the
specific yield figure may be questionable, the general conclusion that can be
drawn is that switches have a very high yield. There is also another factor
that supports high yield.
Switches are quite small compared to processing elements. As a result,
a proportionately large increase in switch area results in only a small
increase in total lattice area. Furthermore, some switch expansion results
in absolutely no increase in lattice area. In Figure 4.3.1, the switch can be
115
expanded horizontally without increasing the PE to PE spacing.
Consequently, relaxed design rules can be used for switch design. As
with datapaths, switch yield can be increased with little or no impact on
area. In short. through their small size and use of relaxed design rules
o very high switch yield can be assured.
To roughly esLimate the size of a 2 x 2 white lattice, note that the lat-
lice has two rows of PEs and three of switches. The total edge length is at
least 2(1750) + 3(250) = 4250 A. Allowing for spacing between components,
datapath routing, power lines. eLc., we conservatively estimate that a 2 x 2
white lattice occupies a square of edge 4750 A. With 1 technology, the edge
length is 4.75mm.
5. Lattice Design
So far in this chapter, the requirements for building blocks have been
specified, and the design of the individual processing elements. datapaths
and switches has been discussed. This section considers the integration of
these individual components into a building block meeting the requirements
of 99% yield and reliable wire through capability.
The first design decision to be made is the dimensions of the virtuallat-
lice which is mapped into the building block. After this, the characteristics
of the building block which hosts the virtual lattice must be decided upon.
This involves the degree of PE redundancy required to achieve high block
yield. Additionally, a switch lattice must be chosen that provides sufficient
wiring fiexibility despite switch faults to implement both the mapping of the





considerations are discussed in detail below.
The size of the virtual lattice determines in part the size of the building
block. A larger virtual lattice with more PEs necessitates a larger building
block.
Large building blocks entail several disadvantages. First, after the
mapping of the virtual lattice into the block. The maximum path length
between PEs is bounded by the size of the block. Larger blocks permit
lOJ?ger paths. System speed is reduced by long paths. Hence. there is a
strong preference for small blocks that can be mapped using only short
wires.
Secondly. the complexity of determining the mapping of the virtuallat-
lice into the block increases with block size. With. a larger block more
different mappings are possible. Since the mapping problem is solved by
basically brute force methods, increases in block size may substantially
increase the time required to determine the mapping. As a result of these
considerations, a small virtual lattice is chosen (Figure 4.5.2).
• a 2 x 2 virtual lattice will be mapped into the building blocks
A building block must be chosen that effectively hosLs the virtual lat-
tice. What are the requirements for a virtual lattice to be mapped into a
building block? Each component in the virtual lattice must have a counter-
part in the block. Therefore. at a minimlUll,




















4 6 8 10
579











Figure 4.5.2 - Virtual Lattice to be Mapped
Into a Building Block
118
119
and 2) as many functional switches as the virtual lattice.
In addition to the block switches which are images of switches in the
virtual lattice , there must be enough functional switches left in the block to
act as the connecting switches. These implement the datapaths of the vir-
tuallattice. They serve as the "glue" to wire together the components of the
virtual lattice. In short,
3) the datapaths of the virtual lattice must be mappable into the build-
ing bock.
The vlrtuallattice must be recoverable from the block with at least 99%
probability. For a successful mapping, each of the three requirements must
be met by a block. If a block fails to meet anyone of the requirements, it
will be impossible to map the virtual lattice into the block.
By far the most difficult of these lhre.e requirements is that the block
has the requisite number of PEs. There are likely to be very few defective
switches or datapalhs. and the yield of PEs is much lower than switches or
datapaths.
In the subsequent sections. 0. switch lattice that is highly robust will be
proposed for building blocks. Switches are small so the addition of redun-
dant switches causes little increase in the lattice area. The area of a switch
can be increased by a large percentage while increasing the lattice by only a
small fraction. Ji'urLhermore, much of this increase is in the portion of lhe
laLLice occupied by the daLapaLh. This part of Lhe lallice 1S highly inscnsi-




or fatal defects: lattice yield is almost unaITected. As a result, it is inexpen-
sive (in terms of area) to provide essentially 100% reliability through redun-
clancy in the switch lattice. Consequently, if a blocl{ contains enough PEs,
the mapping of a virtual lattice will be almost assured. PEs are the weak
link. We consider them next and then return to the switch lattice design.
a) Processing Elements
We must determine the number of PEs, N, per block such that four good
PEs can be found out of the set of N PEs with 99% probability. This is an
instance of the recovery problem discussed in Chapter 2. Drawing on the
results of recovery analysis. Figure 4.5.1 shows R (probability of recovery of
four PEs) VS. the total number of PEs. A lotal of 12 PEs gives the required
99% recovery so
• each bUilding block contains a 4 PE x 3 PE CHiP lattice.
b) Switches
From section 4.4, switch yield, Ys, is estimated from the yield model to
be 99.1%. This yield is achieved through the combination of small switch
area, simplicity and use of relaxed design rules. Throughout this section,
calculations will be made for the purposes of comparison based on both 99%
and 97% yield for. This is a more conservative approach than aatry assuming
99% yLeld, and it will indicate the sensitivity of the design decisions to
changes in switch yield.
As noted in previous sections, switches have high yield. Dul no matter
how high the yield, the random nature of defects Illeans that fundiunality
121
can not be guaranteed; some switches will always be faulty. Consequently,
each switch must have at least one other in the lattice that can take its
place. To provide adequate switch redundancy,
• the corridor width of the building block is two (Figure 4.5.2); twice that
of the virtual lattice.
This provides 100% switch redundancy. The building block has twice as
many switches as necessary.
Note that this redundancy has low cost. Switches are quite small in
comparison to PEs. Adding extra switches causes only a small increase in
overalilatlice area. In Figure 4.3.1. increasing the width of the switch corri-
dor between the PEs from one to two increases the separation of the PEs.
This increases the area occupied by the lattice by (no more than) 4 x 28
units for every row and colwnn of switches. This increase the lattice area
approximately 14%. Most of this additional area is occupied not by switches
but by the datapaths which are highly insensitive to the presence of defects.
The portion of the lattice sensitive to defects (PEs and switches) is called its
active area. This increases by only 2% (= 42 / (2(42 ) + 282 ) ). As a result, the
yield of the lattice is effected very little by the increase in corridor width.
Furthermore, depending on the details of the switch layout, it may be possi-
ble to paclc Lhe second switch inlo the inler PE area in such a way that it
causes a smaller increase in the PE separation. In turn, lattlce area would
increase less. In summary, both overall lattice area and lattice yield change
little as a result of increasing corridor width.
122
As noted in the previous section, PEs are the "weak link" in a building
block. The key to a high block recovery rate is having the required munber
of functional PEs. }i'o1' a PE to be good, all tour of its ports must be function-
lng correctly. A port itself may be functional but it is unusable if the
switches to which it is connected are faulty. In the virtual lattice (Figure
4.5.2), failure of anyone of the foUl' switches directly connected to a port
renders the entire PE dysfunctional. A PE is not usable lUlless it can com-
municate with its surrounding environment from all four of its ports.
To safeguard against a switch failure rendering a PE unusable, the
building block provides 100% switch redundancy at each port. Every port has
two switche's connected to it. Either one switch or the other can connect
the port to the J;'emainder of the switch lattice. Only one of the two switches
must be functional. Clearly. it is sLill possible for both swiLches attached to a
port to be faulty. At switch yi~ld, Ys. of 0.97, the probability of a PE having a
port which is disconnected from the switch lattice due to a double switch
fault is 4 x (1 - YS)2 = 0.36%. At Ys = 0.99, this probability shrinks to 0.04%.
We can not totally prevent switch faults from disabling PEs, but the probabil-
ity is reduced to a very small value.
How many switches in a building block are likely to be faulty? The
switch yield is the average number of faulty switches. But since defects are
a random process, the exact number of faults per block will fluctuate from
block to block. What is the maximum number of faulty switches that can
"reasonably" be expected?
By the assumption of the random distribution of defects, the probabili-
ties of the individual switches being defective are ihdependent. Since
123
switches can be in one of two states, good or bad, a binomial probability dis-
tribution applies to the collection of switches in a block. An n FE x m PE
building block with double corridor width and two switches per port has a
total of (4n + 2)(4m + 2) - run switches. The 4 x 3 building block has 240
switches. Let F be a random variable representing the number of faulty
switches per block. With Pr(F = f) representing the probability that a block
has exactly f faulty sWitches. we have
The expected value of F' is 240(1 - Ys). and its standard deviation is
a ~ ,,(240 Ys (1 Ys)
These values are shown in Table 4.5.1 for 99% and 97% average switch yield,
Ys. From this it can be seen that on the average there are only a small
number of faulty switches per block. How does the actual number of faults
vary from block to block? By Chebyshev's Theorem, at least 1 - (1/4)2 =
15/16 = 94% of the blocks are within ±2 standard deviations of the mean. At
97% switch yield, tbis means 94% < Pr(7.20 - 2(2.64) < F < 7.20 + 2(2.64))
< Pr(F < 12.5). or at least 94% of the blocks have no more than 5% (=
12.5/240) of thcir switches faulty. For Ys = 0.99, the same fraction of blocks
has no more than 2% (~ (2.40 + 2(1.54)) / 240 ~ 5.4/240) faulty switches.
Chebyshev's Theorem bounds the spread of F for any probability disLri-
bution of F. The exact distribution of F is shown in Table 4.5.2. Examining
these more exact calculatiollil, it can be seen that the spread of the defect
disLribution is somewhat less than predicted by Chebyshev's Theorem. The
binomial distribution clusters more tightly about the mean value than the












00 00 00 00 00
000000000000000000
000000000000000000
Figure 4.5.3 - Building Block for a Wafer Scale
CHiP Processor
124
Table 4.5.1 - Effect of Switch Yield on the Number
of Faulty Switches Per Block
Switch Yield
0.99 0.97
expected number of faulty 2.40 7.20
switches per block ( M )
standard deviation ( (j ) 1.54 2.64
Table 4.5.2 - Probability Density of Defective Switches
f:::: number Switch Yield = 0.99 Switch Yield = 0.97
of faults Pr(F =f) Pr(F " f) Pr(F =f) Pre F" f)
0 .0896 .0896 .0037 .0037
1 .217 .307 .010 .014
2 .262 .569 .022 .035
3 .210 .779 .043 .078
4 .126 .905 .073 .15
5 .0600 .965 .11 .26
6 .0237 .968 .14 .39
7 .00801 .997 .15 .54
8 -- --- .14 .58
9 -- --- .12 .80
10 -- --- .087 .89
11 --- --- .054 .95
12 -- -.- .029 .97
13 -- --- .014 .99
125
126
blocks (>99%) realize at least 9'1% (~ (240-7)/240) switch yield, Although the
actual switch yield can fluctuate in accordance With the binomial distribu-
lioo, it almost never dips below 97%. Similarly, with Ys = 97%. all but one
percent 01 the blocks achieve 95% (~ (240 - 13)/240) yield,
Derivation of Table 4.5.2
The distribution of F for Ys = 0.99 was derived by directly applying the
formula for the binomial distribution
For all but very small values of f. computing the binomial coefficient
is cumbersome and lengthy.
For Ys = 0.97. the binomial distribution was approximated by a normal
distribution [Ross76] with
Pr(F ~ I) ~ Pr(I-0.5 < F < 1+0,5)
Let M be the mean value of F and rJ its standard deviation. Converting
to the unit normal distribution, <P, we have
Pr(F ~ I) ~ pr[1 0; M F-M < f+O:-M 1< q






where the values of ~ are obtained from a table of the normal distribu-
tion.
c). Mappability
The building block must contain the PEs and switches to serve as the
images of the PEs and switches in the virtual lattice. Additionally. the data-
paths of the virtual lattice must be implemented by the building block.
These are mapped to either single datapaths in the block or a path of con-
nee ted switches and datapaths; a single datapath of the virtual lattice may
become a chain of switches in the block.
In addition to producing one single mapping. it is desirable to find a
mapping that has a short maximum path length between components. As
noted elsewhere, long paths reduce system performance.
The switch lattice of the building block can be chosen to help reduce
path lengths. By increasing the wiring bandwidth of the switch lattice,
shorter and more compact mappings can result. In particular, we propose
for the switch lattice of building blocks:
a) switch degree eight. The switch degree is increased from four in the
virtual lattice to eight in the building block. The addition of diagonal
connections allows some routings to "cut the corner" to reduce path
length. In Figure 4.5.4a. the diagonal datapath replaces one switch and
two datapaths that would be required in a degree four lattice. Longer









Figure 4.5.4 - Wire Saving Due to Swi.tches With
Degree [3 and Crossover Capability
129
b) crossover capability. By allowing two independent paths to cross at
a switch, paths can often follow the most direct route instead of detour-
ing around crossover points. In Figure 4.5.4b, the crossover at the
center switch saves one switch and one datapath.
Incorporating these characteristics in the switch lattice of the building
block increases the etriciency of the resulting mappings. More compact
mappings result with the corresponding increase in performance. We pro·
pose that
• building blocks have degree eight switches with a crossover capability
of two.
Even with this increased wiring capability of the switch lattice. it is
impossible to guarantee a mapping of the virtual lattice into the building
block even when there are the required number of functional PEs. It is
always possible that a mapping will be prevented by a particular pattern of
faulty switches. For example, an entire row of faulty switches divides the
block into two disconnected components. These particular patterns are
extremely unlikely given the high switch yield and the large amount of wir-
ing bandwidth provided by the switch lattice.
d) TIre Through
The requirement for wire through capability is that there exists five
continuous paths from the left side of the block to the right side (Figure
4.1.1). The block is unused so all functional switches are available for imple-
menting the paths. Orienting the block so the short side is vertical provides
.J
130
the least wiring bandwidth from left to right. TWs orientation is chosen for
the following worst case analysis. Additionally, the paths are allowed to start
and end at any switch on the edge. Nole that this is a somewhat more
liberal criterion than is actually required for wire through in which paths
must maintain their relative positions. But adding restrictions to the format
of the path simply decreases the probability that such paths exist. In
effect, we derive an upper bound for the probability that the paths do not
exist. We show this upper bound is acceptably small.
Model the problem as a graph with switches represented by nodes and
the datapaths by edges. Since PEs do not participate in the wire through.
they are not included in the graph. A faulty switch corresponds to removing
that node from the graph. The problem is to find sets of nodes whose remo-
val reduces the minimum edge bisection width of the graph to four or less.
Call this bisecting the graph. Since the probability density of defective
switches decreases rapidly as the number of defective switches increases.
we first find the minimum set of nodes to bisect the graph. Bisections
requiring more than the minimum number of switch faults will occur less
frequently.
The narrowest portion of the graph is the eight columns from which a
PE has been removed. The graph is divided by the missing PEs into four
separate wiring channels each of which is is two switches wide. For the
graph to be bisected at slice A. each of the four channels must have at least
one faulty switch. The minimum bisection width of the graph is greater than
four unless this condition is met. (Note that by using the crossover capabil-






in any column - not five. The following is an upper bound.) The probability of
a given channel haVing at least one fault is
Pr(C) ~ [ih (1 - Ys) + ~l (1 - YS)'
~ 0.0296 Ys ~ 0.97
0.0199 Ys~ 0.99
The probability of all four of the channels having at least one fault is




With eight different slices. the probability of one of the slices bisecting the
graph is




To bisect the graph through one of the columns containing 14 switches
(slice B, Figure 4.5.5), the probability is less .than
10 x [il;] (0.97)4 (0.03)10 ~ 5 X 10-12
Consequently, the probability of faulty switches causing the minimum bisec-
tion Width of the graph to fall below five is .negligible. As a result, we will
assume that






A WAFER SCALE CHiP PROCESSOR
In this section we consider the design of a wafer scale CHiP processor
using the building block described in the previous chapter. The goal is to
fabricate a large-scale parallel processor on a single wafer of silicon. This
would allow the processor component of a parallel processing system to be
constructed from a small number (perhaps one) of wafer scale components.
Consideration is given to the problems of the layout of the blocks on the
wafer, external connections, the actual number of processing elements per
wafer, and the overall efficiency of this approach.
1. Wafer Layout
Each building block occupies three times the area of a 2 x 2 lattice.
Since a 2 x 2 lattice occupies a square of side 4.75 mm, we approximate the
size of a block by a square with edge 4.75...t3 or 8.23 mm. (The actual
aspect ratio of the building blocks is highly dependent on the layout of
processors and switches. Blocks may have one side slightly longer than the
other. For simplicity we assume throughout this work that blocks are
square. However, we avoid packing the wafer tighLly with blocks. This leaves
unused wafer area available in the proposed wafer scale machines to













The first term is th ratio of wafer area to chip area. The second term
represents the nwnber of chips that do not entirely fit on the wafer due to
the curvature of the wafer edge. A 4" (101.6 mm) diameter wafer is the
industry standard, 1 and it can hold a maximum of 98 of the 4 x 3 building
blocks.
However, it is not desirable to pack as many blocks as possible on the
wafer. Obviously, room must be left for bonding pads to connect the
machine to external memory or other wafer scale CHiP machines. But there
is a more important and subtle reason for limiting the number of blocks on
the wafer.
Defects. in general, are randomly distributed over the wafer surface.
The yield model developed in Chapter 2 is based, in part. on this assumption.
As a result, the analysis of fault tolerance, and subsequently. the choice of a
4 x 3 building block depends on random distribution of defects. This
assumption applies quite accurately to the entire wafer except for its
periphery [Stap73, Stap76, Laws66]. A band at the outer edge of the wafer
exhibits a substantially higher density of defects [Gupt72]. This results from
several processing effects:
1 5" wo!ers have been available for some time but are gaining acceptWlce slowly due to some
incompatibilities with existing fabrication equipment.
134
a) crystal dislocations formed during crystal growth
b) nonuniform diffusion caused by temperature variations at the wafer
periphery. This is particularly acute near the orientation flat that is in
contact with the cooler diffusion boat.
c) beading of the photoresist near the edge
d) rounding of the wafer at the edge which causes pattern distortion.
The defect density measured inward from the edge decreases exponentially
to a constant value for the central region of the wafer. T?-e width of the
region in which the density is significantly increased has been reported to
be in the range 4-5 mm [Gupt72] although it can be expected to vary
considerably from process to process. l
To accommodate these phenomena. building blocks are placed in the
central portion of the wafer and bonding pads are located on the periphery
(Figure 5.1.1). Pads are simply areas used as targets for soldering wires
onto the silicon. Their functionality is unaffected by the presence of defects
in the silicon. On the other hand. processing elements and switches are in
general rendered dysfunctional by defects. Therefore they are located in
the large central portion of the wafer where defects are fewer.
This results in efficient utilization of the wafer area. lnstead of
uniformly distributing processors and bonding pads over the wafer (as in a
conventional layout), they are separated and placed in the most appropriate
portion of the wafer. Although processing exhibits a great deal of variability,
1 lnduslrial sources are very reluctunt to reveal 8Ily exact figures regarding yield results.
One source [Stap76] defines a two aren model with "inner" and "outer" rings of the wnfcr ex-




• Ii Ii" . 1D-<ID-<I







D-<I IIII m.,. 9D-<!
Z ... ZZ ... Z Z ... Z
135
Figure 5.1.1 ~ Layout of a Wafer Scale CHiP Processor
•136
some researchers report virtually no functional chips in the outermost 3
mm. The warer scale machine eiTcctivcly utilizes some of this area. In sum,
defect insensitive components are placed where defects are most frequent,
and defect sensitive circuitry is located where there are fewer defects.
2. Lattice Dimensions
The layout of the wafer scale CHiP processor is shown (in somewhat
schematic form) in Figure 5.1.1. In the center of the wafer is an 8 x 8 grid of
building blocks. From the results of the previous section, each of the blocks
has a 9970 probability of containing a fully functional 2 x 2 mesh. When a
wafer contains a block that does not have a 2 x 2 mesh, the entire column
containing the faulty block is discarded. The column exclusion strategy
described in Chapter 1 is used to eliminate the occasional defective block.
On the average. how many usable blocks will a wafer yield? Since the
defects are randomly distributed. the chances of the individual blocks being
functional are independent events. Because the events are either "success"
(i.e. functional) or "failure" (i.e. faulty), the probability distribution of good
and bad blocks is a binomial random variable. Let Pr(F = i) represent the
probability that exactly i blocks are faulty.
where p = 0.99 is the probability that a block is runctional, a successful
event. The probability of occurrence of a given lattice size is derived as
follows:
137
Derivation of Table 5.2.1
The probability of having a completely functional 8 x 8 grid is simply
the probability that there are no defective blocks. (0.99)ll4;;; 0.526.
In general each defective block eliminates an entire column. of blocks.
But to accurately compute the probability of occurrence of a given
lattice size, we must account for defective blocks falling in the same
column of the grid. In trus case, only a single column is lost despite the
occurrence of multiple defects.
The probability of exactly one excluded column (giving a 16 by 14
lattice since each column is two PEs wide) is:
Pr(F = 1) + Pr(F = 2) Pr(2 bad blocks in same row or col) =
140.340 + (0.108) 63 = 0.364
The first detective block can occur anywhere in the grid. There are
seven blocks in the same row as the first defective block and seven in
the same column. So 14 of the remaining 63 blocks can be faulty but
still leave just one row (or column) excluded. The chances of 3 or more
bad blocks occurring and aU falling in the same column are negligible.
The probability of exactly two excluded columns (yielding a 14 x 14
lattice) is similarly derived:
Pr(F = 2) Pr(2 bad blocks fall in different cols) +





(0 108) 49 + (0 0226)· [16 + 64 50 1= 00914
· 3·808079· .
Table 5.2.1 shows the different possible grid sizes resulting from an 8 x
8 grid on a wafer and their probabilities of occurrence. About 5370 of the
time all blocks are usable. and the wafer holds a CHiP processor of size 16
PEs by 16 PEs. 36.% of the wafers contain exactly one excluded column. With
each block being 2 PEs wide, a 16 x 14 PE processor is recovered from the
wafer. Only 1.9% of the wafers will yield a CHiP machine of size smaller than
14 x 14. The expected number of usable PEs per wafer is 237. This
represents a truly large-scale parallel processor on a single wafer, and this
is achievable with current technology. With future scaling of device
dimensions, even more processors per wafer will be possible. Thus. these
results indicate that the processing element portion of a parallel processing
system can indeed be constructed from a small number (perhaps one) of
wafer scale components.
The choice of an .8 x B grid is qUite conservative. It results in
substantial wafer are being left for bonding pads and drivers or to be unused
due to high defect density. In fact, an 8 x 8 grid occupies area
64 X (8.23)' = 4335 mm'
(recall that each building block has an edge length of B.23 mm, see section
1). But a 4" wafer has area B107 mm2 so only 53% of the wafer is occupied by
the CHiP lattice. Why was the B x 8 grid proposed? I"or the simple reason
that it is a safe choice. 1t is the largest square lattice that fits onto a 4"
; '-.
. .
Table 5.2.1 - Size of Wafer Scale Processor for B x 8 Grid
Lattice Size from an B x 8 Grid
cumulative size of CHiP
probability probability grid size processor ( PEs)
.526 .526 8x8 16 x 16 = 256
.364 .890 8x7 16 x 14 = 224
.0914 .981 7x,? 14 x 14 = 196
.0186 1.000 <7x7
Expected Number of Good PEs = 237
Table 5.2.2 - Wafer Area Occupied by a Grid
of Building Blocks
% of 4"
grid size area (sq mm) wafer area
8x 8 4335 53.5
9 x 8 4877 60.2
9 x 9 5486 67.7
10 x 9 6096 75.2
10 x 10 6773 83.5
Table 5.2.3 - Size of Wafer Scale Processor for 9 x B Grid
Lattice Size from an 9 x B Grid
cumulative size of CHiP
probability probability grid size processor ( PEs)
.405 .485 9 x 0 18 x 16 =28U
.360 .665 Ox 8 16 x 16 =256
.109 .97"!- 8 x 7 16 x 14 = 224
.0199 .994 ?x7 14 x 14 =196
.0060 1.000 <7x7
Expected Number of Good Pf.:s = 266
139
Table 5.2.4 - Size of Wafer Scale Processor for 9 x 9 Grid
Lattice Size [rom a 9 x 9 Grid
cumulative size of CHiP
probability probability grid size processor ( PEs)
.443 .443 9x9 18 x 18 = 324
.394 .837 9x8 10 x 16 ~ 288
.129 .966 8x8 16 x 16 ~ 256
.0271 .993 8x7 16 x 14 = 224
.0069 1.000 <Bx?
Expected Number of Good PEs = 297
Table 5.2.5 - Size of Wafer Scale Processor for 10 x 9 Grid
Lattice Size from a 10 x 9 Grid
cumulative size of CHiP
probability probability grid size processor ( PEs)
.405 .405 10 x 9 20 x 18 ~ 360
.400 .805 9 x 9 1Ox1O~324
.140 .945 9 x 0 10 x 16 = 280
.0414 .986 8x8 16 x 16 ~ 256
.0139 1.000 <OxB
Expected Nwnber of Good PEs = 329
140
141
wafer with a substantial safety margin of area. This area is required for
bonding pads. drivers, regions to be unused due to high defect density, area
loss due to the packing of rectangular blocks, the wafer's orientation fiat,
variations from fabrication process to fabrication process in the size of a
unit area, elc. In accordance with Slotnik's Law, in this section the machine
architecture proposed incorporates as few new features, in addition to wafer
scale integration. as possible. HigWy conservative choices are made for
virtually all design decisions. Additionally, variances from the conservative
choices are noted and their effects are analyzed.
The 47% unused area in the B x 8 grid is a very large safety margin. 1t is
quite likely that larger grids can be accommodated on a 4" wafer. (Or
alternatively. one could fabricate an B x Bgrid with larger PEs that are more
complex and faster. This option is more complex to analyze since changing
the PE area necessitates a reexamination of the degree of redundancy
required within a block. A 4 x 3 block may not be appropriate for
substantially larger or smaller PEs.) The maximum size grid that can be
patterned on a wafer depends on the details of the fabrication process.
layout details of the processing elements and switches, and wafer
characteristics. This must be determined experimentally for a particular
combination of PE design and process technology. We will be content to
propose a conservative approach and note the extensions that may be
possible,
Consider the range of possible grid sizes. First. what is the upper
bound on the wafer area that can be occupied by a grid? Once again, this is
strongly dependent qn the particular technology, but we make some rough
N('\.I
142
estimates. Assume the outermost 5 mm of the wafer is unusable due to high
defect demity. The ring of pads and drivers is approximately 0.2 mm wide.
To make a conservative estimate of the effective area, assume that the
bonding pads are placed within the 5mm outer ring. This will define a lower
bound on the effective wafer diameter. Thus the effective wafer diameter is
reduced [rom 4" (101.6 mm) to 91.2 mm. The area of this central portion of
the wafer is 6532 mm2 or 80.670 of the total wafer area. Table 5.2.2 shows the
area occupied by grids of different dimensions. A 10 x 9 grid is the
maximum allowed by the above bound.
Consider a possible alternative to the B x 8 grid. A 9 x 9 grid leaves
32.3% of the wafer area unused. This constitutes a fairly large safety
margin. It is still well below the 80% bOWld on usable wafer area derived
above. Thus a 9 x 9 grid is a reasonable choice for a 4" wafer although it
pushes the limits of technology more than the conservative 8 x 8 grid.
With the 9 x 9 grid, 44% of the wafers mIl have no excluded columns and
will realize a 18 PE by 18 PE processor (Table 5.2.4). This is a truly large
parallel machine. It represents a 25% increase over the B x 8 grid. Another
39% of all wafers will have exactly one defective block and will implement an
18 x 16 processor array. This is still 12.5% larger than the maximum size
machine achievable with the 8 x 8 grid. In total, 96% of the wafers will host a
CHiP processor at least as large as 16 PEs by 16PEs. The expected number
of good PEs per wafer is 297. This is 2'7% more than the 8 x 8 grid. In
summary. a substantially larger CHiP lattice is obLained with a 9 x 9 grid as





What eflect does the use 01 a larger grid have on the size 01 the CHiP
machine? Tables 5.2.3 - 5.2.5 show the lattice sizes obtainable with grids
larger than B x B. The expected number of good blocks per wafer increases
in direct proportion to the grid size. As the grid size increases. the
probability of a fully functional grid decreases from ....50% to ""'40%. With
more building blocks, there is an increased chance that one block will be
faulty. With technological improvements, the size of PEs and switches will
continue to decrease thus making even larger grids possible. The increased
possibility of a faulty block may ultimately put a limit on the maximum grid
dimensions.
3. Column Exclusion
When a column (or row) contains a faulty block and is excluded, the
adjacent columns musl be connected together. The switches and datapaths
in the unused ·or faulty blocks are used to make the connection. Thus the
"wire around" requirement for blocks becomes a "wire through" capability
via the CHiP switch lattice (Figure 4.1.1). The double corridor width switch
lattice of the building block provides twice as much wiring bandwidth
through the lattice as is necessary. This redundant wiring capability can be
used to circumvent faulty switches. As shown in the previous chapter.
blocks provide wire through capability with very high reliability.
However. each switch introduces additional signal delay since a signal
must pass through a pair of transfer gates in each switch. To traverse an
unused column, typically ten to fourteen extra switches are introduced into
the path. In addition to switching delays. this requires that periodically in





catastrophic signal degradation. But buffers introduce additional delays. In
short, column exclusion incurs a performance penalty.
The amount of signal delay incurred depends on the impedance of the
individual switches and the number or switches separating PEs. The design of
low impedance switches is an important practical problem in the
implementation of the CHiP family of machines. A combination of circuit
design and fabrication technology techniques such as the use of depletion
mode transfer gates with high channel doping levels reduces impedance.
These techniques substantially reduce switch delays. However, the delay
through even a fast switch is more than the delay incurred by directly wiring
together processors. The gain in flexibility due to the switch lattice is
bought at a loss in performance. This problem is common to all machines in
the CHiP family.
The number of switches between two PEs depends on two interrelated
factors: the specific PE configuration and the corridor width of the switch
lattice. The processor configuration is under the control of the
programmer. Some topologies can be mapped onto a lattice efficiently with
only short distances separating the PEs (for example, the mesh). Other
more complex arrangements require longer paths. A wider corridor width
provides additional wiring bandwidth and will in some instances allow more
compact layouts. But in any event, the corridor width of the switch lattice is
the minimum separation for any configuration. Since wafer scale systems
must be -robust to switch failures in addition to processor faults, they must
have extra SWitching corridors used exclusively for fault tolerant
reconfiguration. Thus, wafer scale systems. with their redundant switches.
145
increase the number of sWitches that inler-FE signals must traverse. Wafer
scale systems pay for their low cost in the currency of performance.
4. External Connections
Consider the requirements of connecting the wafer scale machine to
external devices - either memory or other CHiP machines or both. At the
very maximwn. every switch on the lattice edge has an external
connection. 1 With a data transfer width of one byte and two control Hnes per
datapath. each switch requires len bonding pads. In a 16 PE by 16 FE lattice
there are 32 switches on a lattice edge or 320 bonding pads per edge. (No
external connections need be provided for the redundant switching
corridors since they are used exclusively for fault tolerant reconfiguration.)
Each bonding pad is a square with edge approximately 0.1 mm on a side and
0.075 mm spacing between pads [MeadBO]. So at a total width requirement
of 0.175 mm. per pad, a line of 320 pads extends 56 mm. This is just slightly
more than the radius of the wafer. Counting all lattice edges. 4 radii of pads
are required. The circumference of the wafer is 2'lT radii long. So the pads
can be arranged around the perimeter of the wafer in ,a single circular
pattern. Note that additional external connections can be implemented with
multiple concentric circles of pads.
The otr-c'hip drivers are located between the circle of bonding pads and
the CHiP lattice. They connect a subset of the switches on the lattice edge to
bonding pads and provide the reqUired signal amplification to reliably and
1 In practice, providing connections just for the switches directly connected to PEs should
be sulIicienl to meet ilie TlO requirements. This culs the number or external connections at
least in hall which may be more in line with the limitations or pnckugin8 technology. The






quickly transmit signals to an off-chip source.
A CHiP machine can not afford to have a switch on the lattice edge with
a missing external connection. The interface of the switch lattice to its
external connections must be complete and symmetric. Therefore the
integrity of the driver circuitry and the connections to the bonding pads and
switch lattice ITlUBt be very high. There is the potential for the loss of an
entire column of blocks should a driver fail.
A nwnber of steps can be taken to insure reliability. First. the drivers
are placed inside the band of high defect density near the wafer edge. This
removes them from the wafer area most prone to circuit faults. The exact
location depends on the wafer characteristics and the sensitivity of driver
circuitry to defects.
Second, drivers can be designed to be highly reliable. Pad drivers are
relatively simple which reduces thei;r chance of failure. Also, much of the
circuitry is composed of large transistors - many times the size of a
minimum geometry transistor [HonBO]. This is necessary due to the large
power and current requirements of off-chip signals. Large size decreases
the sensitivity to defects and increases. yield. Additionally. the entire pad
driver, especially the smaller geometry circuitry on the switch lattice side,
can be designed with relaxed design rules. This once again can substantially
increase yield. Wider wires with larger spacings are less likely to fail. This
slightly increases the pad area and the signal transmission time but is a
small price to pay for increased reliability.
147
Third, provide redundant drivers. In- addition to making drivers
reliable. add 100% driver redundancy at each pad. In the rare case that a
driver is faulty, its redundant counterpart functions in its place. Both
drivers are connected to the pad (and switch) via a common bus (see Figure
5.4.1). In case of an active fault ( e.g. a short of the bus connection to Vdd
or Gnd). the driver can be physically disconnected from the bus by laser
trimming or fuse blowing. The bus wire can be made wide enough and with
sufficient spacing from neighboring circuitry to insure bus integrity. Lastly,
redundancy in the form of complete pad / driver combinations can be
added. This guards against the occurrence of non-random defects at the.
wafer edge.
In summary, the problem of providing extremely reliable off-chip
drivers can be solved by technological means. There are no fundamental
dlfiiculties. A combination of driver reliability achieved through relaxed
design rules and redundancy achieves the required reliability. The exact
combination of these techniques required to produce the desired reliability
is technology specific.
5. Efficiency
In each block only four of the twelve processors are used regardless of
how many more are actually fl..Ulctional. Furthermore, every time there is
one bad block, an entire column of eight blocks is discarded. It appears
that the two level hierarchy approach to implementing wafer scale
integration makes very inefficient use of the wafer surface. Surprisingly





B 0 n d i n g
P a d
Figure 5.4.1 a Redundant Pad Drivers for High Reliability
148
149
Consider the alternative to implementing a 2 x 2 lattice with fault
tolerant building blocll::s. The fault tolerant approach will be compared to
conventional manufacturing of integrated circuits without redundant
components. Let us simply pattern as many 2 x 2 lattices on the wafer as
possible, scribe the wafer into the individual 2 x 2 lattices and package
them. Since the 2 x 2 lattices are considerably smaller than the fault
tolerant building blocks, a 4" wafer can hold 321 of them. At 20% yield (our
reference point since one normalized unit area holds a 2 x 2 lattice and has
by definition 20% yield). there are 321 x 0.20 = 64 good lattices per wafer. In
the water scale machine (With the conservative choice of an 8 x B grid), the
expected number of PEs is 237 occurring in 59 2 x 2 lattices. This is 9%
fewer than with conventional processing.
Is this a victory for the conventional approach? Not quite. First, the
number of 2 x 2 lattices actually patterned on the wafer will be lower than
321. The bonding pads required at each lattice have not been accounted for.
As a result, the area of each lattice must be slightly larger. l Also there
must be scribe lines between lattices. This consumes a little more area
leaVing less for the lattices. Secondly. the increased defect density along
the edge of the wafer greatly reduces the chip yield there. 20% yield is
achieved only in the central portion of the wafer. Averaged over the entire
wafer, somewhat less than 20% yield will actually be realized. As a
consequence, there will be fewer than 64 good lattices per wafer with
lOne advantage of the wafer scale approach is that therc ore iewer totw number of bondiIJ8
pads. The illtcrnw lattice connections are rondc not by lorgc (and slow) pads and orr-chip
drivers, but by dircct wiriD8 in silicon from PE lo PE.
150
conventional processing. The exact number depends-on processing details.
Just as we were liberal With the estimates in the conventional approach,
we have been conservative in the estimations for the wafer scale case.
Remember that the 8 x 8 grid of building blocks is a very conservative
choice that occupies only 53% of the wafer area. In practice, a larger grid
could be used. The exact dimensions of the largest lattice that can be
patterned on a wafer is dependent on the particular fabrication process and
the characteristics of the wafers. This must be determined experimentally,
but in any case, there would probably be more than 59 lattices per wafer.
In short, the initial estimates overstated the number of good lattices
obtained through conventional technology and understated them in the
wafer scale case . .In practice, the number of good lattices per wafer is
comparable in both approaches, but the exact numbers of good lattices is
dependent on processing technology, As a result, we can conclude that the
• use of fault tolerant building blocks to implement wafer scale
integration makes efficient use of silicon area,
The reason behind this is that the area lost to re.dundant PEs is more
than made up for by the increased yield provided by the redundancy.
Examining the curve of building block recovery vs. the number of PEs (for
the recovery of a lixed size 2 x 2 lattice, Figure 2.5.2), we find the curve
rises quite quickly. This means a small amount of fault tolerance has a big






On the other hand, the area increase due to redundancy is linear. The
area of a lattice increases in direct proportion to the number of processors.
Tbis follows Cor two reasons. First, the mesh connected structure of the
lattice requires that each component be connected to only a fixed number
of other components. The number of connections does not increase With the
size of the lattice. (This property is not enjoyed by many of the other
interconnection networks. For example, the binary cube requires that each
processor in an N node machine be connected to r log N1other nodes. Thus
the number of wires per processor can be very large for large~scale binary
CUbes.) Secondly, the local connection structure of the mesh requires that
each node be connected only to its physically adjacent neighbors. Each of
the wires connecting PEs has constant length independent of the lattice
dimensions. The distance of a PE from its neighbors to which it is connected
is independent of the size of the mesh. (Once again few other
interconnection strategies preserve locality. A perfect shuffle connection
network has a constant number of connections per processor regardless of
the network size. But each node must be connected to a node in a fixed
rela.tive position in the shufl1e. For example, node! is connected to node
N/2. So, as N increases, the length of each connection (on the average)
increases. As a result, a perfect shuffle of N PEs requires O(N2 / log2 N) area
[KleiB!].) With the number of wires and their length both constant. the area
occupied by wires increases in proportion to the munber of processors in
the mesh. Since PE area is also independent of lattice dimensions. the
•
lattice area grows linearly with the number of PEs.
•
•152
As was shown in Chapter 2, redundancy can provide large increases in
the recovery probability. This means that modest amounts of redundancy
increase the efficiency of use of the wafer area. The area taken up by
redundant PEs is more than made up for by the increased recovery. In
Chapter 2 it was seen that modest amounts of redundancy ( e.g . ....50%) lead
to optimum use of the wafer area. The need for very high block yield (as
required by the column exclusion strategy) necessitates that building blocks
have much higher redundancy than for optimal area utilization. However,
the PE utilization does not fall below the PE utilization without redundancy.
Utilization for conventional, non-redundant chips and building blocks are
similar.
6. Effect of Technological Advances
The wafer scale CHiP machine described above can be. fabricated with
current (1982) technology. Four inch wafers are the industry standard and
have been commonplace for several years. The complexity of the processing
elements Is less than that of a simple microprocessor, and switches are
considerably simpler. The design of the individual components is straight
fOl"Ward in comparison to the current generation of advanced
microprocessors. Simple PEs 1.75 mm on a side can be produced with gate
lengths and wire widths attainable by current state of the art semiconductor
manufacturing processes. In summary, the wafer scale processor does not
depend upon unconventional or experimental technology.
However. semiconductor fabrication technology is not static.
Transistors will conLinue to shrink in size. Defect densities will continue to
be reduced. Wafers will become purer and larger in diameter. In short,
153
more circuitry will continue to be packed into n smaller area with decreased
power consumption and increased circuit speed. The pace of these advances
has been slowing in recent years due to increasingly difficult technological
problems. physical limitations and mounting capital costs of the
increasingly sophisticated fabrication equipment. Although the pace of
advancement is sloWing, the trend is inexorable [Noyc77].
What will be the effect of technological advances on wafer scale
machines? Larger wafers will allow the fabrication of CHiP lattices of larger
dimension which are composed of more powerful processing elements. Also,
the scaling down of device sizes has positive impact on virtually all circuit
parameters. Processing elements will become smaller, more reliable and
less power hungry. Tbis will lead to larger lattices on the wafer. less
redundancy reqUired within each building block and reduced SWitching
overhead. Although the direction of these trends is clear. this section
qua.ntitatively analyzes the effect of technology improvements on wafer
scale CHiP processors.
In previous sections. the estimates of PE size and number of PEs per
wafer are based on a conservative assessment of current technology. We
have assumed 4" wafers and transistors with 2 /-Lm channel lengths, Both of
these are typical of state of the art fabrication processes currently (1982) in
volume production. This represents the baseline case against which
technological advances will be compared. For the purposes of comparison,
we will project a short term and a long term technological advancement.
Some major facets of the design of wafer scale CHiP processors will be





Wafer diameter has steadily increased over the years. In the early
19608, wafers 1.5" in diameter were common. Today, 4" wafers have be
commonplace for some years. They are the standard of the industry.
Additionally, 5" wafers are available. Due to some incompatibilities with
existing fabrication equipment, their use has not become widespread but
their acceptance is growing. A fivefold increase in wafer area over the span
of two decades is a snails pace compared to the pace advances in device
scaling. 'Consequently. as the representative of long term future technology,
a modest increase in wafer diameter to 7" is selected. This represents a
doubling of the area of the currenl stale of the art 5" wafer.
In the following discussion, the characteristics of wafer scale machines
fabricated on 5" and 7" wafers will be compared to 4" wafer. The 4" wafer
represents the baseline case for well established current technology. 5"
wafers are at the cutting edge of the current state of the art, and 7" waters
represent the possibilities of long term future technology.
. The characteristics of wafers with diameter of 4", 5" and 7" are shown in
Table 5.6.1. The 7" wafer has over three time the area of the 4" water.
However. recall that not all the wafer area can be occupied by building
blocks. Assume that the outer 5mm of a wafer can not be occupied by
building blocks due to high defect density. (In practice. some of this area
will be occupied by bonding pads and their drivers.) We will estimate the
number of building blocks that can be fabricated on a wafer. Define the
effective diameter of the wafer as the wafer diameter minus lOmm. It
delimits a lower bound on the effeclive wafer area. This is the area that can
potentially be occupied by processing elements and switches.
,.
Table 5.6.1 - Effect of Wafer Diameter on the
Wafer Scale CHiP Processor
Wafer Diameter
4" 5" 7"
total wafer area (sq mm) 8107 12.668 24,829
elIective wafer area (sq mm) 6590 10,751 22,114
maximum number of blocks 77.6 133.6 290.4
in effective area (lower
bound)




The increases in effective area between the 4" wafer and the larger ones
arc even more pronounced than the increase in the tolal area. Removing a
fixed size outer band eliminates proportionately more area from sman
wafers than from large ones. The effective area .of a 5" wafer is 63% larger
than the 4" wafer, and the 7" wafer has 3.4 time the effective area of the 4"
wafer. There is room for substantially more building blocks on the larger
wafer.
The maximum number of square building blocks with edge length e that
can be packed onto a circular wafer of diameter D is given by formula 1.1
lTD'
---4.' 1.77 Q...•
Using the above equation, Table 5.6.1 shows that the rnaximwn number
of blocks per wafer increases by 72% for the 5" and 274% for the 7" wafer. As
expected. much larger CHiP processors can potentially be fabricated on the
larger wafers. The maximum number of blocks increases more quickly than
the effective area. Note that the elIective area increases are only 6370 and
23670 for 5" and 7" wafers respectively. The. reason for this is that larger
wafers are less effected by edge curvature. With an arc of larger radius, the
relatively small building blocks can be placed around the the wafer edge
with less waste of area. Additionally, with larger wafers, a larger fraction of
the wafer area falls in the center and is Wlaflected by edge curvature. In
particular, from the second term of equation 1.1 we see that a 4" wafer has a
20% reduction in the number of blocks due to edge curvature. 5" and 7"
wafers lose only 16% and 11% of their blocks respectively. In summary,
building blocks can be packed more efficiently into larger wafers than
smaller diameter wafers. This results in more efficient use of the wafer area






processor that can be patterned on a larger wafer is greater than simply the
increase in wafer area. A 7" wafer can hold 3.7 (= 290.4 / 77.6) as many
standard PEs as a 4" wafer whereas the ratio of tolal wafer area is only 3: 1.
In terms of maximum square grid size, a 4" wafer can hold a 9 x 9 grid.
An 11 x 11 fits onto a 5" wafer, and a 7" wafer can hold a huge 17 x 17 grid of
building blocks. This represents a 34 PE by 34 FE CHiP lattice - a truly
large-scale parallel machine. Even the use of a 5" wafer (Which is well within
the scope of current technology) allows the fabrication of a CHiP lattice with
50% (Rl 112 / 92 - 1) more PEs than the 4" wafer. In summary, even a
modest increase in wafer diameter substantially increases the maximum
size of a wafer scale CHiP processor through both an increase in wafer area
and more efficient utilization of that area.
b) Device Scaling
As advances in semiconductor manufacturing technology continue. the
size of devices continues to be reduced. Wires become narrower and
transistors smaller. Although the rate of progress is slowing. further
advances can be expected. What will be the effect on wafer scale machines?
This section examines some of the consequences of smaller processing
elements and switches on the design of wafer scale CHiP processors.
In the previous sections, the area estimates for PEs and switches were
based on Mead and Conway design rules under the asstunption that A = 1
J.Lm, This corresponds to a transistor channel length of 2 J.Lffi and is
conservatively representaLive of current teclmology. Intel's HMOSll process





years. HMOSII is a mature technology and its successor will soon be
introduced. As a result, we select as a representative of near term
technology a doubling of the device density. This corresponds to shrinking
the lateral dimensions of deVices to 70% of their current dimensions - a
channel length of approximately 1.4 Mm. As for the long term advances in
device scaling. the DOD has launched a concentrated effort to achieve 1 J.Lrn
technology which would quadruple the device density. It appears that this
goal is achievable through the extension of current optical lithography
techniques. and it is a feasible goal for the late 1980s. Consequently, a
channel length of 1 /.LID is selected as the representative of long term
technology advances.
With smaller PEs. their yield increases. When the PE area is shrunk in
half. the yield (computed by the yield model, equation 2.2) for a 2 x 2 CHiP
lattice doubles (Table 5.6.1). Higher PE yield reduces the amount of
redundancy that is required to achieve 99% block yield. Consider the
reduction in device area by a factor of two. Examining Figure 2.5.4 shows
that four functional standard PEs (with their area scaled by a factor of 0.5)
can be found in a group of 9 PEs with 99% probability. Thus the dimensions
of a building block with a 99% recovery rate can be reduced from 4 x 3 to 3 x
3. Only five redundant PEs per block are required instead of eight.
Redundancy is decreased by one third with no decrease in the recovery
probability of the block. Not only is the area of a single PE cut in half. but
the number of PEs in a block is reduced. This results in a double area








There are two main consequences of the reduction in block dimensions
due to smaller PE size:
1) More eITicient use is made of the wafer area. Wafer scale integration
implemented via column exclusion imposes overhead in the form of the
redundancy required to achieve high block yield. The redundant PEs
are not an integral part of the CHiP lattice; only a fixed number of PEs
are recovered from each block. But still they occupy area that could
be used by the lattice. Smaller PEs have higher yield and require less
redundancy for the same block recovery rate. This frees wafer area for
additional blocks.
2) Smaller building blocks have shorter paths between PEs. Recall that
the maximum path length between two PEs is determined by the
building block dimensions (Chapter 4). Reducing the block size to 3 x 3
results in fewer switches between PEs and decreased signal
tra.nsmission time. Device scaling leads to not only more efficient use
of the wafer area, but decreased switching overhead. Performance is
correspondingly enhanced.
How much can the block area be reduced by the use of the smaller PEs
and switches? In the baseline case, each 4 x 3 building block occupies a 67.7
mm2 region of silicon (Table 5.6.2). By scaling down this value, we estimate
tha.t the area occupied by each 3 x 3 building block to be approximately








channel length ( [J.m ) 2.00 1.41 1.00
PE area ( MjJ.m "2 ) 12.3 6.13 3.06
yield of a 2 x 2 lattice 0.200 0.412 0.627
PEs / block for 99% Recovery 12 9 7
building block area ( sq mm ) 67.7 25.4 9.68
block edge length ( mm ) 8.23 5.04 3.14









The area of each PE and switch is cut in half, and the number of PEs is
reduced from twelve to nine. Asswning a square block, the block edge length
is v25.4 = 5.04 (mm).
How many of these smaller building blocks can be placed on a wafer?
As shown prelliously, a 9 x 9 grid of blocks with edge length 8.23 mm can be
fabricated on a single 4" wafer. In the scaled down technology, the shorter
block edge length means that a grid of roughly
6.23 9 = 14.7
5.04
blocks per side can be fabricated on a 4" wafer. Rounding this down. the
wafer can hold a 14 x 14 grid of building blocks. Since the same 2 x 2 virtual
lattice is mapped into each of the building blocks. a 28 PE x 26 PE lattice
will fit on a single wafer. The number of PEs increases by a factor of 2.4 (=
262 / 162). So cutting PE area in half more than doubles the nwnber of PEs
per wafer due to the increased efficiency in the use of the wafer area. There
is an additionally 20% (= 0.4 I 2.0) increase attributable to increased
efficiency. Note that the increase due to efficiency is not equal to the
reduction in the nwnber of PEs per block. 25% = (12 - 9) I 12. The increase
is lower due to the restriction that the wafer contains a square grid of
building blocks. If (14.7)2 blocks could be put on the wafer. a full 25% gain
due to efficiency would be realized.
Now consider the quadrupling of the device density. The yleld of a 2 x 2
lattice of standard PEs more than triples to 62.7%. Once again the yield
>
"1
increase reduces the amount of redundancy required. Only three extra PEs
162
are required to give a 99% recovery rale of four PEs (Figure 2.5.5),
Redundancy is reduced from eight extra PEs in the baseline case to only
three PEs. The building block area is correspondingly reduced to
67.7 2- = 9.67 (mm')
4 12
with the block edge length of v9.B7 = 3. 14mm. This results in a grid of no
more than
6.23 9 = 23.6
3.14
blocks per side (Table 5.6.1). This is an increase of 653% over the baseline
case. Of this, 400% is directly attributable to smaller PEs. and the
remainder to the reduction in the number of PEs per block.
7. Practical Implementation Considerations
The previous chapters have covered the general principles of the
implementation of wafer scale integration: two level hierarchy, column
exclusion and fault tolerant building blocks. Structuring, the major hurdle
in the implementation of wafer scale systems, is achieved through a
combination of these design principles. 1n addition, a number of lower level
implementation issues have also been discussed: wafer layout, switch lattice
struclure, external connections, etc. Despite the (apparent) success of this
approach, a host of engineering problems must all be solved before the
wafer scale CHiP machine can make the transition from paper to silicon.
The problems of heat dissipation, clock skew, routing of power and ground







constructed. A number of these practical implementation considerations
are discussed in this section.
a) Power Consumption
Electrical signals are changed by the storing and discharging of
electrical energy. This requires the application of power which is
transformed into heat. To maintain a continuously operating device at an
acceptable temperature, this heal must be transferred from the device to
the surrounding environment. As more and more devices are packed into
smaller and smaller volumes, there is a greater concentration of heat in a
smaller volume with less surface area available to conduct away the heat.
Power dissipation becomes increasingly difficult. The problem of power
dissipation is a difficult one for high density 1S1 chips.
This problem is particularly acute for wafer scale systems. A wafer
scale system has on the order of 100 times as many components as a
complex LSI chip. This very large number of circuits is packed into a single
package. A single wafer scale system may replace an entire printed circuit
board resulting in a large increase in the density of gates per cubic
centimeter.
To address this problem, we will first estimate how much power can be
dissipated by a wafer. This will in turn dictate the power consumption
requirements of the individual switches and processing elements. FinaUy,
the design of the switches and PEs to meet these power reqUirements will be
considered.
164
Since we have not proposed one specific design and layout of PEs and
switches, the power consumption figures derived below will necessarily be
rough estimates. Exact fIgures can be obtained only for a specific
processor. We will attempt to show that the class of wafer scale CHiP
processor discussed in this work can with proper design meet reasonable
power corummption restrictions.
A single chip can dissipate 1W with only common and inexpensive
packaging technology. Up to 5W per chip can be dissipated through the use
of exotic and expensive packaging techniques such as direct water cooling.
heat sinks and cooling towers. A wafer has approximately 200 times the
surface area of a single chip; there is a much larger surface area over which
to perform the heal exchange with the surrotUlding environment. With
similar packaging technology, the larger wafer scale system should be
capable of dissipating substantially more power than a single chip.
With forced air cooling a printed circuit board can dissipate up to
approximately 0.5W per in2 [Stee81]. With the surface area of a 4" wafer
being 12.6 in2, this indicates a limit of approximately 6W per wafer. Since
the O.5W / in2 figure was for printed circuit boards consisting of a number of
separate packages, the application of this estimate to a single package
wafer scale systems may not be entirely accurate. Consequently, 6W will be
regarded as an upper bound. In accordance with our conservative design
philosophy, in the following considerations we will attempt to not exceed 50%
of tbis botUld. 3W per wafer will be the target for power consumption.
CMOS technology is the natural choice for reducing power consumption




for any other technology. CMOS circuitry typically runs at a small fraction
of the power consumption of an identical circuit implemented in nMOS
technology.
An additional advantage of CMOS technology is that the static power
consumption of gates is virtually zero. CMOS gates consume power only
when they are changing state. A stalic gate draws only the current
necessary for the gate leakage current - on the order of a few nanoamps. On
the other hand, with nMOS circuitry, all gates that are "on" continuously
draw an appreciable amount of power.
TWs is especially advantageous for CHiP processors since they have a
large number of static components. The switch lattice structure remains
fixed for relatively long periods of time. The switch settings remain
,
unchanged except when the lattice is being recontlgured into a new
in.terconnection pattern. With CMOS implementation, the switches will draw
essentially no power except during a reconfiguration. Since there are a very
large number of switches on a wafer (..... 20,000). this results in a large power
savings.
Furthermore. with CMOS technology the power consumption is directly
proportional to the clock rate. The faster the gates change state, the more
power that is consumed. This allows the system architect to fine. tune the
power consumption by varying the clock rate. System speed can be traded
[or power, if necessary.
As a result o( the overwhelming advantage with regard to power
consumption and the competitive speed and density characteristics of state





• wafer scale systems be implemented in CMOS technology.
Use of CMOS technology solves the power consumption problem (as will
be shown in this section), but it introduces another difficulty. The estimates
of PE and switch size (Chapter 4) were based ou the implementation of the
standard PE in nMOS technology. Implementing an identical design in
another technology will not necessarily result in the layout occupying the
same area, CMOS circuits typically require somewhat more area than their
nMOS counterparts. As a result, a second pass through the design of building
blocks (Chapter 4) should be made for the CMOS implementation of the
standard PE.
However, state of the art CMOS processes require only marginally more
area ( e.g . .... 10 - 15%) than the corresponding nMOS circuits and in some
cases require slightly less area. Consequently, the CMOS area estimates
depend on the particular design rules of the CMOS process and the details of
the PE design, but in any case. the design of a building block should not
vary drastically [rom that which was proposed in Chapter 4.
What power consumption reqUirements are imposed on the individual
PEs and switches by the need to collectively dissipate a total of 3W? The
answer to this question depends on the operation performed by the CHiP
machine. CHiP processors operate in one of two modes:
a) Computational - the switch lattice is held in a fixed structure. The
PEs compute and exchange data values.
b) Hestructuring - during a restructuring phase, computation is
generally not performed by the PEs, but raLher the structure o[ the






individual switches each fetch a new current configuration setting from
their local memory.
In a restructuring phase I how much power is consumed by the switches
simultaneously accessing their local memories? To estimate this, we draw
on power consumption figures for available memory chips. Recently
announced 64K slatic RAMS implemented in CMOS technology have a power
consumption of 10 p.W in standby mode and 15 - 200 roW in active mode
[Mina82. KoniB2]. The local memory of a switch is in active mode when it is
changing its current configuration setting. When not reading or writing, the
memory is in standby mode. To estimate the power consl.Unption of the
switches, the above power consumption values will be scaled down in
accordance with the size of the switch's local memory.
The PEs are quiescent during reconfiguration. The only power
consumed by 0. PE is to maintain its local memory in standby mode. The
maximwn nwnber of good PEs per wafer is 972 ( = 81 x 12). With a 64 byte
PE memory, the standby power consumption of the PEs does not exceed
972 x ;: ;3~ xlO (I'W) " 75 (I'W)
,
The PEs consume a negligible amount of power during reconfiguration.
Now to estimate the power consumed by the switches during
reconfiguration. note that all switches in the building block fall into one of
three categories (see Chapter 4): unused or faulty, a cOIUlecting switch or an
image of a switch in the Virtual lattice mapped into the building block. The
connecting switches do not change configuration settings from phase to









reconflguration necessary to map the virluallattice into the building block.
As a result, connecting switches are always in standby mode. In contrast.
the image switches change their setting during a reconfiguration and so
must be in active mode.
Each block contains 240 switches. Of these. 21 are image switches. The
remaining 219 switches are in standby mode. With a 9 x 9 grid of building
blocks on a wafer. there are a total of 19,440 (= 240 x 81) switches on the
wafer. 1701 ( = 21 x 81) of these are image switches leaving 17,739 switches
in standby mode. Now. each switch in the building block is of degree eight
so no more than eight memory bits are required to store a switch setting.
With four settings per switch (a typical local memory size) and one register
to hold the current confl.guration setting, there are 40 bits of memory per
switch.
By scaling down the larger (200 roW) of the cited values for active power
consumption for the 64K memory (containing 65.536 bits), we can obtain an
approximate upper bound on the power consumed by the local memory of
the switches. The total power consumption of the image switches (in active
mode) does not exceed
While the image switches are in transition. the connecting switches are
idling along consuming no more than
17,739 x 40 10 ( W) = 0 11 ( W)
65,536 J1. . m
In total, the switches consume well less than a single watt. Reconflguring
169
does not tax the power dissipation capabilities of a wafer.
Now consider the power requirements of a CHiP processor in
computational mode. The switch lattice connections are fixed so all







65,536 10 (/LW) = 0.11 (mW)
Switch power 4issipation is well less than a milliwatt. This is a negligible
amount. This leaves approximately the full 3W to be consumed by the
processing elements.
It is difl'icult to estimate the power consumption of a processing
element without knowing all its design details and performing detailed
simulation studies. So, as with the estimates of the memory power
consumption, we will rely on reported power consumption figures for similar
devices. In particular. a team at Bell Laboratory designed and fabricated a
systolic array processor implemented in twin tub CMOS technology and with
several simple PEs per chip. They reported 10 mW / PE power dissipation
[West]. Due to the close similarities of the Bell project and the wafer scale
CHiP processor, we will adopt a 10 mW estimate for the power consumption
of the CHiP processing element.
Processing elements fall into one of three categories: active PEs which
are images of PEs in the virtual lattice, faulty PEs and PEs which are
functional but unused. With four PEs in each virtual lattice and a 9 x 9 grid
of building blocks on the wafer. there is a maximum of 324 (:;: 4 x 81) active
PEs per wafer. At 10 mW per active PE, just over 3W are dissipated by the
170
active PEs. With switches consuming a negligible among of power, the target
power dissipation or 3W is (approximately) met as long as the faulty and
unused PEs consume no power.
Faulty PEs pose no problems. They can be completely disconnected
from the lattice and from the power supply by laser trimming or fuse
blowing. No power need be consumed by any faulty FE.
On the average, there will be a large number of fully functional but
unused PEs. Many of the building blocks will contain more than the
minimum number (four) of PEs required to host the virtual lattice. The
extra PEs in each block will not be used. Of the 972 (= 12 x B1) PEs on the
wafer. approximately 65% are functional. With 324 active PEs. this leaves
972 x 0.65 - 324 R: 300 functional but unused PEs, ]f each of these consumed
10 mW, the total power consumption of the wafer would double. This is
unacceptable.
Unlike faulty PEs, it is undesirable to disconnect the functional but
unused PEs. Laser trimming (or fuse blowing) physically severes the links to
a PE. This is irreversible. Once a PE is disconnected, it can not be
reconnected. During the lifetime of the machine, some PEs will undoubtedly
fail. We would like to keep the unused PEs in reserve so they can be
switched in to take the place of a PI!: that has failed. If functional but unused
PEs are permanently disconnected from the lattice during the initial
configuration of a building block into a virtual lattice then the block is left
without any redundant PEs. lt has no fault tolerance. The failure of a single
PE renders the building block faulty which in turn causes the entire colwnn





• components of the lattice which must be tested
• requirements for a compte te test
.. goals of the testing procedure
•
The model is at a high level of abstraction. It does not deal with responses
to specific lest patterns. the mechanisms of performing the testing. or
details of generating the test data. These factors will vary greatly with
changes in the implementation details of a specific CHiP machine. The
resulting model achieves independence from the myriad of design details
underlying the overall machine architecture. It captures the essential
problems of testing complex lattices of PEs and switches without being tied
down to specific implementations of the components. This allows formal
descriptions of testing algorithms without excessive and obfuscating detaiL
a) . Definitions
In this section certain key concepts concerning testing and the lattice
structure are precisely defined. This replaces intuitive notions of testing
and testability with sharply dermed and delimited concepts. Through this
approach, the fault coverage of a testing procedure can be formally
determined, and the correctness of a testing algorithm can be proven.
There are two actors in the testing process:
a) Processing element being tested, also referred to as the unit under
test (UUT).
b) Testing device (TD). This controls the DUT, applies test signal to the





The testing device may be external to the lattice - a separate and
independent device. It may be special purpose testing equipment such as a
programmable logic analyzer or a general purpose computing device such
as the CHiP controller. An external device can access the component being
tested directly by probing the bonding pads of the UUT. Indirect access is
also possible. A subset of the swiLches on the lattice edge, gateways, are
connected to bonding pads. The external device can access the DUT via a
path of switches originating at a gateway.
In addition. the testing device can be another PE: in the lattice. In this
case, a small subset of the PEs are initially tested by an external testing
device. The PEs found to be functional are used to test their neighbors
which in turn test other PEs, etc. This is a self testing strategy which is
iniLiated by a limited amount of external testing,
A single testing step consists of three distinct phases:
1) Generation of test data - Llle input test pattern to the UUT and the
correct response.
2) Application of the input test pattern to the UUT.
3) Evaluation of the response, This most commonly consists of
comparing the response to the known, correct value. Other
characterizations of the response such as number of l's (bit count) and
nwnber of transitions from 0 to 1 (transition count) can also be used.
A testing sLep is un exchange of signals between the Lesting device and the
UUT. The TD iniliates the testing ~Lep by presenting an input puLLern to the




input pattern may include instructions for the UUT to execute. Thus a
typical testing step starts with the TD downline loading the UUT with a small
program segment and input data. The UUT executes the code while the TD
monitors the output and halts the UDT at the completion of the testing step.
An individual testing step can verify that the DUT correctly executes a
single program segment. A test of a component is a sequence of testing
steps which thorougWy exercises the component and provides adequate
fault coverage. A test is successful only if every testing step succeeds.
Some basic lattice terminology will be introduced. Processing
elements have a port at each compass point, NSEW, through which the PE
can communicate with its neighbors. Each switch is also connected to its
four neighbors. A configuration setting specifies which pair of incident
datapaths the switch will connect. There, are six possible switch settings
(NW, NE, SW, SE, NS, EW). Each setting is denoted by the pair of compass
points that are connected. Lattice elements are matrix numbered with zero
origin. A pa.th through the lattice is a connected sequence of switch settings
with, optionally, a port on either end. The components of the path are the
indiVidual switch settings and ports. When the specific switch settings are
unimportanl, a "generic" path as in Figure 6.2.1 will be specified where the
setting of switch S is assumed to be the one required to connect path
segment P to PE[i.j]~.
b) Testable Components
Datapaths are not explicitly tested but rather are tested in conjunction








Figure 6.2.1 - Example of a Generic Path
193
co.mponents connected to the datapath. For example the datapath fault
results in faults in PEE and the NW, EW and SW settings of switch SW.
An intrinsic fault is caused by a defect within the lattice element which
causes erroneous behaVior. Broken wires or shorLed transistors are
examples. Any component incident upon an intrinsic fault is also faulty. If
the East port of a PE is faulty, so'is the West side of the adjacent switch.
Settings 8N!'(, 81'.'11 and SSE are termed connectivity faults since they are
attached to an intrinsic fault.
Each of the six switch settings are considered independent and can be
individually good or bad. Analogously, ports are independent. For any given
PE, some of its ports can be functional and others faulty.
Both switches and PEs have internal mechanisms in addition to their
observable communication behavior. A switch must be able to latch new
settings sent to it and select amongst those settings stored in its local
memory. A PE consists of a processor which interprets the FE's instruction
set and four ports. A PE must correctly execute its full instruc tion set and
have an fault free memory. A failure in the internal mechanism of a switch
or the processor of a PE causes aU its settings or ports to be considered
faulty. Each individual component is good only if the internal mechanism is
fully functional. There is no point to communicating with a faulty PE nor
using a sWitch which can not reliably select its setting.
Testing the internal mechanism of PEs and switches will be implicitly
assumed. When "test West port" is specified in a testing algorithm, it is
assumed that the first port of a PE that is tested also includes a full test of
the internal mechanism of the PI:::; similarly for switches. As a result. we can
•194
be concerned with only testing ports and switch settings.
Switches are not directly accessible from testing devices. A switch
setting is tested by establishing a path between two PEs (or external testing
deVices) and performing the sequence of testing steps required to fully
exercise the switch and the datapath. In general, a path may contain more
than one untested switch setting. Consequently, a failure of the te::;t along a
path will not necessarily pinpoint the faulty component. In fact, there. may
be more than one defective device on the path. Hence. a test call, in
general. verify the functionality of components but an unsuccessful test
required that tests along additional paths be performed ,to locale the
fault(s). In summary,
a switch setting is functiona.l if it is on the path of a successful test
a port can communicate if i~ is on the terminating end of a path which
is successfully tested. A port is junctional if it can communicate and
the internal mechanism of the PE functions correctly. To conclude that
a port can not communicate, it must be impossible to successfully test
the port via all three access routes into the port (see Figure 6.2.2). If a
test along anyone of these access routes is good then the port is good.
1n general, an unsuccessful test along a path with more than one
untested component does not provide any new information on the status of
the unLesLed componenLs. Any combination of untesLed components of the
paLh may be faulty. A singlc LesL does not separate Lhe possible
combinations of faults. One imporLant example is:






Figure 6.2.3 - Testing a Port
197
Lemma 2.1a - Given the path in Figure 6.2.3 With path segment
PI SIt S'NS good and both SNW, FIi:1E untested. the status of SN\V is
determined by the test along the path
independently of the status of PE1E_
Proof -
Case 1 - PE1E is good.
If SNW is good then all components of path P aTe good and the test along
P succeeds. Otherwise the test fails.
Case 2 - PE1E is bad.
The test along path P will fail. This is correct since SNW is a connectivity
fault.
QED
The above lemma is easily generalized to
a) any port of the PE
b) allOWing S' to occupy any position adjacent to S and Sol any position
adjacent to S'
This generalization is stated somewhat informally:
Lemma 2.1 ~ Any switch directly connected to a. port can be tested
independently of the status of the port if there exists a good path from
a gateway to the switch.
198
c) Goals of a Testing Procedure
In a CHiP machine. every component must be fully operational.
However, the switches and PEs fabricated on the wafer may be only partially
functional. In a PE, the processor may work but one of the ports may be
dysfunctional. Also a switch may have only a (proper) subset of its settings
working correctly.
Partially functional components may serve a useful function in a fault
tolerant machines although they will not be an integral part of the virtual
CHiP lattice. A partially functional switch may serve as a connective switch
providing a path between two fully functional components. Additionally, a
PE with at least two good ports may be used in the self testing of the lattice.
As a result. a go/no-go test (or PEs and switches is insufficient. The
goal of any testing procedure is to provide fault location at the component
level. It is necessary to know which settings of every switch and ports of
every PE are good even though the device may only be partially functional.
Below the component level. fault detection is sufficient. For example, il'
a switch setting is bad, it is not necessary to know which particular
transistor(s) are defective. If the processor of a PE is faulty, knowing
whether the memory, datapath or control logic is the culprit is unimportant.
I"urtherrnore, the testing algorithm must prOVide complete
component-levell'csoluLion. It is unacceplable to have otherwise fUllctional
components reported as faulty due to limitations of the testing algorithm in
resolVing the source of errors.
199
In addition to providing reliable, component-level fault location. any
testing algorithm must be elTicient. A wafer scale CHiP machine is a very
lurge collection of components. A processor iabricated on a 4" wafer
consists of over 20,000 switches and 900 PEs. An inefIicienl testing
algorithm will be computationally intractable.
3. Lattice Testing
Given an arbitrary port in the lattice, what are the requirements for
testing it? First. the port must be connected to a testing device. An
external testing device can access the port via a functional path from a
gateway to the port. The port may also be tested by another PE in the
lattice. But this PE doing the testing must be previously tested. So the
testing PE must have a functional path to a gateway or to another PE which
in turn has a path to a gateway or ... As a result, only regions of the lattice
which are connected to a gateway can be tested but with the connecting
path allowed to pass through intervening PEs. If a component is not
accessible from a gateway. it is untestable and hence considered faulty.
A region may be functional but if it is disconnected from the remainder
of the lattice, there is no way to use the region; it can not communicate with
the other PEs, So this testing assumption that inaccessible regions are
faulty does nol cause the loss of otherwise usable PEs.
Secondly, it would be ideal if all the components on the path to the
gateway were Imown Lo be functionaL A successful test verifies the
functionality of all components on the path. But an unsuccessful test. in





by a single test only if there is exactly one untested component on the path.
Otherwise, additional tests (perhaps a large number) are required to isolate
the defective component.
l-lowever, testing a port and testing the switch to which it is connected
can not be separated. A port can not be accessed independently of its
switch. Similarly, the West side of switch S can be tested only by being on a
path that terminates at the PE. The switch and the port are mutually
coupled with respect to testing. They must be simultaneously tested.
Because of this coupling, the primitive unit that will be tested is a port
pair. two adjacent ports and the intervening switch (Figure 6.3.1), A single
port. and its switch could have been chosen but, as will be seen, testing can
proceed by pairs of ports as easily as individual ports.
What are the requirements to be able to test a port pair? To test ports
PE1E and PE2lf through S' (see Figurc'G.3.1), there must be a functional path
from S' to a gateway, and S must complete the connection from S' to each
port. When these two conditions are met, we say that S' is a hook since it
allows the testing device to latch onlo the port pair for testing. In the worst
case ( e.g. a faulty port), testing a port requires that all three access routes
into the port be attempted. So, both S' and S" must be hooks for the port
pair. We say S' and S" are a hook pa.ir for the port pair. Furthermore, since
bolh the North and South sides of switch S must be accessible from a
gaLeway in order Lo fully test the ports, Lhe exisLence of a. hook pa.ir is the





Figure 6.3.1 - Testing a Port Pair
201
202
The following algorithm can. be used to test a port pair.
Port Pair Test Algorithm
Input - a port pair (see Figure 6.3.1) with a hook pair S' - S" and test
paths to a gateway Pi and P2.
Output - staLus of all components in the port pair.
Remarks - Initially. all components are marked FAULTY. If a test
succeeds. all components on the test path are marked GOOD,
Mark all components FAULTY.
T1: test SNS via path P' 8' SNS S" P"
(The following paths all use the segments P' 8' or P" S", They will be
omitted for clarity.)
1'4: test PE2wvia SNE
Ta: Lest along path PE1E SEW PE211
Test T1 exercises the NS selling of switch S which is connected to the
testing device via path P' and switch S' and path P" and switch Sri. The four
test paths in '1'2 - '1'5 have only one connecLion Lo the testing device. The





After the completion of tests T1 - Tl). the only untested component is the
EW setting of switch S. This Lest is qualitatively diITerent from the others
since it is by necessity a self test; the two ports must lest the setting
themselves. Self testing is possible only if both ports are functional. The
code for test To is downline loaded into each FE via one of the functional
paths found in tests T2 - T:5. test Ta is performed and the PEs report the
results back to the testing device.
Theorem - If S' and S" are a hook pair for R (with test paths P' and pIt
respectively, see Figure 6.3.1) then the PE Pair algorithm tests
PE1E• PE211' and all settings of S, despite faults.
Proof - SNS is tested since the path P'S' SNS SOl P" contains only good
components except for SNS' No other components in R affect this path
so faults in other components will not alter the testability of SNS.
By Lemma 1, SNI1. Ssw, SNE and SSE: are tested regardless of the status of
the incident port. No other components in R can atIect the test paths
used to test these settings so SNl'f, SSI'I'. SNE and SSE are tested despite
faults in the port pair.
We must show that PE1E. SEW and PE2w can be tested despite faults in R.
Consider the situation immediately before Ta in the algorithm, and let P
be the path PE1E SEW PEZ\'!'.
Case 1 - ill the previous testing steps T1 - T:;. we found both ports to be
good. We need only test SEI\,,' Path P tests SEI'{ since the other two
components in the path arc good.
204
Case 2 - One port is good and the other could not be accessed by either
of the paths altempted. ASSlUllC P~:l~ is the port known Ll) be good.
Now, path P tests SEW and PE2wsimultaneously since we have tried both
other access routes into PE2w ( i.e. SN\'( and Ssw-). If this one fails then
PE2wis faulty and SEW is a connectivity fault.
Case 3 - Neither port has been accessed by either of the paths
attempted. A test along P tests all three components simultaneously.
SElf is good only if both PE1E and PE2w are good. This the last access
route into either PE so this is the last chance to be able to
comffiWlicate with either port. Either all three components are good or
all three are bad.
QED
Now consider testing the entire region surrounding a FE - a FE squars
(see Figure 6.3.2). The "internal" settings of the square are tested by
combining four port pair tests as in Figure 6.3.5. Thus forming a cross test.
Theorem -]f each pair of corner switches, Cl - C4 (see Ji'igw-e 6.3.6), is a
hook pair for the intervening port pair and PI - P4 are functional paths
to the gateway for the corresponding corner switch in which
a) do not intersect the PE square
b) do not pairwise intersect
Lhen the internal settings of a PE square can be completely tested.
c
0
C, 0$2 Oe l0
0 sO 0 0 0I
cO 0 Oe l~
0
Figure 6.3.2 - Testing a PE Square
205
206
Proor - Apply the port pair test a.lgorithm to the four port pairs. By the
previous theorem and the assumption that there are hook pairs for each
port pair, this tests all four ports of the PE and the four associated switches.
All that remains to testis the inside settings of the four corner switches Cl -
C4. By symmetry. we need consider only one of these. Choose G1SE. ]f there
are functional test paths from both the West side of switch 82 and the North
side of Sl, Cist:: can be tested via these paths. If neither switch has a test
path. G1SE is untestable and hence fnulty. Flnally, assume there is no test
path from one switch. Let it be 81. As a result, it is impossible to test any
setting of Cl incident upon the North side of 81. So C1SE is faulty.
How are the test paths from the switches fOl.llld? Consider switch S2. ]f
there is a good connection from 82 to C2, path P2 sulIices. Otherwise one of
the PEs to the North or South of switch 82 must be the terminating point of
a path. If neither of these are functional then the path from Cl runs into a
dead end and so in untestable and faulty. Similarly for 81. There are three
possible paths from each of Sl and S2 so at most nine paths need be tested.
QED
Note that when PE square tests of adjacent squares are composed, the
untested switch settings are precisely those that are tested by the adjacent
FE squares. The "external" switch settings of a PE square are precisely the
"internal" setLinas of a neighboring square. Consequently, the cross tests
can be composed leaVing only the setting on the ouler edge untested. But it
is the outermost edge of switches which is accessible to the external testing





Theorem - Given any lattice. if all the corner switches are hook pairs for
the four neighboring PE pairs with non-intersecting lest paths to a
gateway, the laltice can be completely tested, despite faults.
Proor - Consider a single square. By the above theorem, the square is
completely tested (despite faults) except for the corner switches.
Consider the four neighboring squares which form a 5 by 5 lattice.
Perform cross tests independently at each of the four squares. The
corner square S at the center of the lattice has all right angle settings
tested since it is a member of aU four squares. We musL test SNS' SEW'
Consider SWNS ' If there are test paths (non-inlersecting) from SN and
SWs then SNS can obviously be tested. OLherwise, there is no test path
from at least one direction North or South. Let it be North. Then SN is
dead by the definition of the testability of a switch and SNS is a
connectivity fault.
Similarly for SElf' Hence SW can be completely tested. Consequently,
when composing groups of four squares, all components are completely
tested except for the corner swiLches on the edge of the region.
We could similarly show that composing four of the 5 by 5 regions yield
a 9 by 9 region with all components completely tested except for the
corner switches on the edge of the region. By induction. we can show
this holds for any 4n+1 by 4n+llattice segment.
Clearly, the corner switches on the edge of a chip can be tested by the
exLernal Lesting device and the neighboring switches (Which are already




1f a lattice is of dimension m < 4Ul+1 [or some nl' it is clear that it can
be tested in the same manner we would test a 4n+1 by 4n+1 lattice but
with the external Lesting device filling in for the PEs of the larger lattice
which fall outside the boundaries of the smaller m by ill lattice.
Conclusion - a lattice of any size can be tested.
QED
So far, we have shown that if we have hook pairs then the lattice can be
tested. How do we determine that S' and S" are a hook pair? Just as testing
ports and their adjacent switches are mutually coupled. so are checking for
a hook pair and testing the port pair. The existence of functional paths P'
and P" can be determined independently of the status of the port pair.
However, the connection from 8' and SIt to S must obviously involve S,
AddiUonaUy, completing the connection from S to the ports required that all
components of the port pair are involved in the hook pair test. 1f portions of
the porL pair are faulty, we may not know whether or not we have a hook
pair. This makes it impossible La know if the fault is at the S' - S or S" - S
connection or wiLhin the port pair, 1n conclusion, testing for a hook pair and
testing the port pair are inseparable.
Algorithm - locating a Hook Pair
InpuL - a port pair with P' and P" candidate paths to the lattice edge
(scc l"igurc 6.3,1),
Output - S' - S" a hook pair? YES/NO retUl'ned.







c1: if successful then YES
Otherwise,
T' Lest along the path P' S' SUL PEE2·
Ts: lest alon~ the path P' S'SUR PEw
C2 : if neither '1'2 nor '1':] succeed then NO
T'i: test along the path pOI S" SLL PEE
'1'5: test along the path pOl S" SLR PEw
Cs: if '1'4 or '1'5 succeed then YES else NO
Theorem - Given a port pair with candidate test paths P' and pIt which
do not intersect. the Hook Pair algorithm is a decision procedure for
the predicate
Q = (P' good) & (S' and S" are a hook pair for Il) & (P" good)
Proof - A. We must show that if the algorithm returns YES then Q is true.
Consider statement C1 of the algorithm. ]f'1'1 is successful then we know
P' and P" are good and we have verified that both S' and S" have a good
setting which connects the test path (P' or P,,) to SW, Consequently, S'
and S" arc hooks for Q. By definition P' and P" do not intersect so S' and
S" are a hook pair for Q.
Consider statement C2. If either 1'2 or 1'3 succeed then we know P' is
good. and wc have verified that the setting of S' connecting P' and SW is
good. Consequently, S' is a hook for Q.
Similarly for 1'4- and '1'5·
1£ we reach statement C3 and either 1'4 or T:; succeeds then both S' and
210
S" are hooks for Q. Since P' and P" do not intersect, S' and S" are a hook
pair for Q.
B. We must show that if Q is true then the algorithm returns YES.
Assume Q is true. We then know P' and p" are good and S' and S" are
each hooks for Q. Consider S'. There must be a good setting of SW which
completes a path to either S" or a good port. There are three settings
of SW incident upon S's. The algorithm attempts paths with all three so
it will locale the complete path and one of the tests T1• T2 or T3 will
succeed. Similarly Cor S" so either Ta, T4 or T5 will succeed.
Consequently. the algorithm must terminate at either C1 or Cs. Both of
these statements report YES,
QED
What have we accomplished so far? We have reduced the problem of
testing the lattice in the presence of faults to locating pairs of hooks. The
above theorem reduces this problem to rmding pairs of non-intersecting test
paths.
test lattice < locate hook pairs < locate test paths
The fJ.rst reduction is not strictly true since we have considered only the
SUbproblem of testing the lattice when all corner switches are hook pairs for
LIl(; ncighboring PE pairs. Testing a square with an incomplete set of hook
pairs wil be considered in a separate section of Utis paper.
We next examine the problem of locating all possible LesL paths from a
given laLLice element.
211
Theorem - given a lattice element, there are only a finite number of
candidate test paths from the element.
Proof OuLlinc - Paths do not have cycles.
At each lallice element along a path. there are only 3 choices for the
successor.
The number of lattice elements is finite.
=> the number of possible test paths < 3 "'* (number of lattice
elements)
QED
]n addition to being finite, the set of all candidate test paths from a
given lattice element can be listed.
Algorithm - Enumerate all candidate test paths
Outline of Method - Tree Traversal Algorithm
At each component along a path. there are three possibilities for its
successor. Faulty components or components already on the path are
not legal successors. A path terminates at any port or a switch on the
lattice edge.
The key to efficient testing algorithms is quickly enumerating
candidate test paths. This can be done by:
1. Testing from the edges of the lattice inward.
2. Limiting Lhe maximum length of a test paLh.
•
212
We will show algorithms for testing ·without considering their efficiency.
The Hook Pair LesL applies Lo u given pair of tesl paths. Jr we
enumerate all good test paLhs from SW' to 8W" and apply the Hook Pair
algorithm to all pairs of good test paths. we can determine if SW' and SW"
are a hook pair for R.
Algorithm - Complete PE Pair test
Given a set of good test paths from SW' (8 1) and SW" (82) not
intersecting R,
for every path in 8 1 do
for every path in 82 do
if the paths do not intersect each other then execute Hook Pair
.lg
if algorithm returns YES then
S' and SIr are a hook pair for R








1. SUmmary of Results
The key problem in the implementation of wafer scale integration is
structuring the wafer so that only the functional PEs are connected
together. A methodology, the two level hierarchy, that efficiently and
economically solves the structuring problem for CHiP processors has been
presented. The principle elements are the use of column exclusion with high
yield building blocks that contain redundant components. This approach
limits the performance degradation due to structuring and allows the
structuring problem ~o be solved with tractable computational effort.
Since the yield of building blocks must be high for the two level
hierarchy to be a practical approach. yield phenomena were investigated in
detail. A model of the integrated circuit manufacturing process was
developed that predicts circuit yield and the probability distribution of
manufacturing defects. These results were applied to the analysis of
parallel processors in which several PEs occupy a single chip. In addition,
they were u::lcd to design the building blocks meeting the requirements of







It was shown that these building blocks can be assembled into a wafer
scale CHiP processor. With current technology, it is possible to fabricate a
wafer scale system with 250 to 300 PEs. This represents a truly large
parallel machine. Furthermore. this machine is highly robust to faults
occurring during the machine's lifetime, consumes a manageable amount of
power and can be efficiently tested.
Although·the techniques [or implementing wafer scale integration were
developed for CHiP processors. they can be applied to other system
composed of uniform parts. This generalization is discussed in the following
section. Furthermore. building blocks are useful on their own; they need
not be assembled into a wafer scale system. A generalization of the design
methodology used for building blocks is shown (section 3) to increase the
maximum allowable chip area and thus increase the number of components
per chip,
2. Implementation of General Wafer Scale Integration
The techniques described above for implementing wafer scale
integration are not restricted to CHiP processors. The methodology benefits
from the fact that the mechanism needed for structuring, the switch lattice,
is an integral part of the CHiP architecture. Although this simplifies the
work. it is not necessary. The method is entirely general. It can be applied
to other systems composed of uniform parts.
As long as a system can be subdivided into modular and independent
parts, the switch lattice can provide the flexible interconnection network
required to route around faulty components. The settings of the switches
215
can be fixed. Switches can be used solely for connecting the functional
processing elements. Thus a parallel processor With a fixed interconnection
structure can be fabricated. A wafer scale processor with a mesh. perfect
shuffle. etc. interconnection topology can be implemented by embedding it
'witmn a wafer scale CHiP processor. The switch lattice simply remains in a
alalic configuration.
Furthermore. the processing elements can be replaced by other
components to implement a wafer scale system other than a parallel
processor. For example, by replacing each PE by a 4K static RAM. a 3 Mhit
wafer scale memory can be fabricated with existing technology [Egaw79,
Lea79]. Additionally, the problems of address decoding. bit line driving, etc.
must be solved. but the basic mechanism for connecting the individual
storage modules can be based on the methodology for wafer scale CHiP
processors.
3. Restructurable Design Methodology
Previously (section 2.5b) it was shown that redundancy can
substantially reduce the manufacturing cost of a chip by increasing its
yield. This suggests that building blocks with redundant components are
useful on their own. A wafer can be scribed into the individual blocks which
can be used as components of a larger system. The yield increase due to
redundancy makes this a cost effective approach.
An alterrate usage of redwldancy is withouL changes in the fabrication
technology to increase the maximum number of gates per chip. With fixed
LnmsisLor size, wil'l.l wiuLh. uk., Lhc inLcgraLlon le ....el can be increased
216
through the use of redundancy and restructurable circuitry. Furthermore,
this design methodology (which was used for building blocks) can easily be
generalized to apply to any syst~m that can be divided into independent
modules. These generalizations will be explored below.
There are three ways of increasing the llwnber of components that can
be fabricated on a single chip: increase chip area, improve circuit design, or
reduce the size of the individual components. This work uses the first
approach. The design methodology presented allows chips of larger area to
be manufactured with acceptable yield.
What limits the size of a chip? Economics. It is prohibitively costly to
manufacture very large chips. The manufacturing cost of a chip has three
primary components.
total chip cost = processing cost + packaging cost + testing cost
As a first approximation, packaging and assembly costs are independent of
the ftmction performed by the chip, although they will increase slightly as
the number of external connections to the chip increases. Similarly, test
costs increase much more slowly than the compleXity of the chip being
tested, although sophisticated and high speed test equipment may be
required. Thus, for larger and more complex chips, the packaging and test
costs are approximately constant [Noyc??].
The cost of processing a wafer is independent of the number or type of
chips patterned on it, so chip processing cost is proportional to the number
of good chips to share the wafer cost. The cost of a chip then depends
primarily on its yield. A typical yield curve (Figure 2.2.1) shows that yield
declines quickly ·with increases in area. l"or large chips, the number oC good
217
chips drops rapidly pushing up their cost.
The exact yield at which point it is no longer feasible to manufacture a
chip depends on the actual packaging. test and wafer processing costs. But
for any fabricaLion process this point does exist. and it corresponds to a
specific chip area. This is the yield limit of the technology. It is not
economically feasible to fabricate chips of area larger than the yield limit.
The fact that the yield declines quickly as a function of area causes a strict
bound to be placed qn the maximum allowable chip area. Exceeding this
bound results in rapidly escalating chip cost. By reducing the rate of decline
of Y, the yield limit will be extended allowing chips of larger area.
The cause of the rapid decline in yield is that a single defect renders
the chip unusable. A defect may be introduced by any of the critical
fabrication steps. It makes no difference in which step the defect is
introduced, the end result is the same - a faulty chip. Consequently. in the
yield equation (equation 2.2), there is a multiplicative effect of multiple
processing steps; each step eliminates a fraction of the chips. The situation
is analogous to tight rope walking - one slip and the game is over.
The slope of the yield curV'e can be lessened by decreasing so' the
defect density. or the nwnber of defect classes, k. In effect this introduces a
more error free manufacturing process or reduces the number of
fabrication sLeps. However, we have asswned a fixed technology. These
modifications are not permitted. An alternative is to design fault tolerant
chips. By introducing redundancy into the chip design, one or more defects
",U1 be nb::wdwd, and thc chip will still be functional.
•218
What can be gained by designing chips with redundant modules? The
maximlUll number of components per chip (which is determined by the
maximum chip area since we have assumed fixed technology) is determined
by the yield limit of the particular fabrication process. By adding redundant
modules to tills chip of maximum size, its yield can be increased resulting in
lower cost (see Figure 7.3.1). Alternatively, by keeping cost constant. a more
complex device can be fabricated. A device with yield below the yield limit
can, through redundancy, have iLs yield increased to an acceptable level. In
eiIect,
• use of redundancy allows the technology imposed yield limit to be
surpassed.
The size and complexity of semiconductor devices spans a vast
spectrum from SSI chips containing a few gates to wafer scale devices
occupying vast amounts of silicon real estate (see Figure 7.3.2). Devices
whose complexity and area surpass the yield limit are termed Ultra Large
Area Chips or ULAes for short. They are not characterized by any absolute
size since the position of the yield limit in the spectrum is technology
dependent. The demarcation between conventional chips and ULAes is the
requirement of fault tolerance to meet acceptable chip yield and cost.
(Note that the concept of "acceptable" yield is inherently imprecise.
Low yield (and hence high cosL) may be acceptable for a new product
commanding a premium price. Mature products [acing competitive


















Figure 7.3.3 - Elements of the Restructurable Design Methodology
221
What are the design requirements in order to utilize redundancy?
Redundancy necessitates a modula:r design. The system must be divided
into separate and independenL modules that can be replicated on the chip.
FW"lhermore, only a small number of different module types are allowed.
There must be spare copies of each different type of module. With manr
different types, the redundancy overhead becomes excessive, and the
complexity of interconnecting the modules increases.
Since the occurrence of defects is a random process, it can not be
known in advance which modules will be good and which will be bad. The
pattern and number of faulty modules will vary from chip to chip. But it is
necessary to connect together only the good modules. This requires a
flexible means of interconnecting the modules. Furthermore, the
interconnections between modules must be customized after the modules
are completely fabricated and tested. In short. the circuitry must be
reconfigurable. Mechanisms for implementing reconfiguration will be
considered in the following section.
Modularity and reconfigurability are the key elements that enable
redundancy to be utilized (see Figure 7.3.3). Through their combination,
chips of larger area and hence greater compleXity can be reliably and
economically fabricated. These ultra large area chips olIer substantial
increases in integration level above the inherent limitations of fabrication
Leehnology. ,
222
Chips designed with the restructurable design methodology require
overhead in the form of redundant modules and the wiring necessary to
reconfigure the components. For this design methodology to be practical,
this overhead must be limited. How can the overhead be kept to a
reasonable level? First, it was noted (see Chapter 2) that higher module
yield results in greater yield gains from redWldancy. Thus modules with
small area make more efficient use of silicon area and require lower
overhead due to redundancy.
Second, since the reconfigurable wiring must at a bare minimum be
capable of routing around a module. the wiring area is proportional to the
square of the number of individual connections between modules. To reduce
wiring overhead. it is necessary to limit the number of intermodule
ton.p.ections. Furthermore, to reduce the complexity of the wiring, a simple
and regular pattern of connections between the good modules is required.
Note that the requirements of small modules with restric ted and
regular information flow are precisely those for designing algorithms for
VLSI systems [Kung79]. The principles for integrated circuit design are the
same as those reqUired for the eaicient implementation of restructurable,
fault tolerant chips. There is a strong consonance between the
rcstructurabie design methodology and the general principles 'of good
integrated circuil design. In faeL, the rcstrucLuruble design methodology
may be considered La be a specialization and extension of the general design
principles which has the added benefits of increasing the level of integration
or, alternatively, reducing cost.
.,
223
As a result. well designed chips can be relatively easily redesigned to
employ reconfigurability and redundancy. Highly irregular circUitry will not
naturally adapt to the requirement of modularity, and excessively complex
designs may inherently require a large overhead for restructurable wiring.
But simple. modular circuits can easily be extended for the addition of
restructurablc wiring between modules.
4. Future Research
This work gives rise to further questions concerning the performance
and implementation of wafer scale CHiP processors. SOfie of the issues are:
the design of a low impedance switch, the implementation of programmable
power down capability, CMOS layout of PEs, etc. Perhaps of more general
interest are the questions of larger scope concerning the extension of this
work to restructurable circuitry and ultra large area chips. Two topics of
particular interest are presented below.
a) Penalties for Restructurable Circuitry
The use of redundancy to increase the manufacturing yield of circuits is
dependent on restructurable circuitry to provide fleXible interconnections
between modules. This yield increase is achieved at the expense of
o more modules per chip
o addition of extra interconnections
• an increase in signal delay




The first of these has been examined in some detail. The relationship
between yield, redundancy and area was explored in Chapter 2. Secondly.
additional wiring must be added to a chip to provide restructuring
capability. Given that faults may occur in both the modules and the
structurable wiring, how much wiring area must be provided to insure a high
probability of restructuring? In addition to consuming chip area,
restructurable wiring introduces longer wires between modules with
resulting performance penalties. How much performance loss can be
expected? What average wire lengths will exist between modules? In
complex designs with many modules. choosing the best interconnection (or
even finding an.inlerconnection) may be a computationally difficult problem
[MaIm??, Aubu?3]. Algorithms for restructuri~ homogeneous VLSI arrays
also require fUrther investigation.
b) Modular PE Design
The results of the analysis of redundancy (see Chapter 2) show that the
highesL leverage is obtained from the initial increments of redundancy. The
first extra PE causes a large marginal increase in yield whereas successive
redundant PEs cause smaller yield increases. Clearly, it is most area
efficient to have a small degree of redundancy rather than a large amount.
In the wafer scale CHiP machine, switches and PEs are regarded as
"black boxes" with no internal structure, and faulty building blocks are
eliminated by brute force R column exclusion. All redundancy is within the
building blocks. and the requirement for very high block yield forces a high
ucgrec of redundaLlcy. Examining I"igure 2.5.2 shows that N = 12 PEs is a
very nat portion of the recovery curve. The addition of the 10lh , l1 lh , and
225
12th PEs has increased recovery a total of only 1.7% (see Table 2.5.3).
A morc cfIicicnL approach may be to have all extended hierarchy with
additional levels aud redundancy at more than one level. With a modest
amount of redtmdancy introduced at several levels, very high yield for the
topmost member or the hierarchy may be achieved with less area
expenditure.
For example, one approach is to extend the hierarchy upwards.
Building blocks are coalesced into super building blocks (SBEs). There are
some redWldant PEs and switches within each BB, and each SEE contains
redundant building blocks. This combined redundancy can result in 99%
yield of the SBI3s which can then be composed using COllUIlll exclusion.
The problem with this approach is that higher up in the hierarchy the
number of connections between units increases. For example. in the wafer
scale CHiP processor, there are ten connections between a pair of switches,
but connecting two building blocks requires 90 wires. Since blocks within
each SBB must be flexibly interconnected. a switching structure to connect
blocks must be provided. With switch area proportional to the square of the
number of connections, a single switch routing 90 wires occupies a large
area and consequently has low yield. Instead of a single large switch.
routing can be implemented with n large number of small switches.
However, this substantially increases the nwnber of switching levels between
PEs resulting in reduced pre[ormance. In short, there is no practical





An ulternale solution is to extend the hierarchy downward, Instead of
treating PEs as individual units, impose a modular and reconfigurable
structure on the individual PEs. By dividing them inlo independent
modules, placing redundant modules within each PE and reconfigurable
wiring between modules. PE yield can be substantially increased. lncreasing
FE yield reduces the redundancy required within each block. Increasing PE
yield from the current 65% for the "standard" PE to 8070 reduces the number
of PEs per block from 12 to 8 while still maintaining 99% block yield.
Memory redundancy is easily incorporated into each PE using standard
techniques with spare rows (or columns) in the memory array [SmitBl.
KokkBl, ManoSO]. There are two ways of dividing the datapath of the PE into
modules: slice "horizontally" dividing into bit slices or slicing "vertically"
creating pipelined segments.
The bit slice modularization is easy to design; each module is a
miniature version of the original datapath. Pipelining provides the potential
for increased performance by each PE but is more difficult to design. Since
one module may be substituted for a faulty module, all modules must have
identical hardware. But each stage in an arithmetic pipeline performs a
different operation so the modules must be microcoded to specialize them
for a particular position in the pipeline.
A topic for future research is to design PE modules which are flexible,
powerful and of acceptable size. for a particular processing element,
comparison of the bit slice and pipelined approaches will shed light on the





The restrllcturable wiring within each PE will introduce delays into the
basic cycle Lime of the PE. A programmable switching structure may
introduce an utlnccepLable performance penalty. An allernative is to use
pCl'wancnl links Lo reconfigure the modules [SmitOl, Kokk81, Logu130j. The
less of fiexibiliLy is balanced by a decrease in cOlmecLion impedance. The
feasibility of the modular approach depends in part on the performance loss







Aubu73 Aubusson.R. AND Catt,I. "Wafer-Scale Integration - A Fault -
Tolerant Procedure," IEEE J. Solid~State Circuits. SC-13 ,3 (June
1973),339-344.
Aubu78 Aubusson.R.C. AND Gledhill,RJ. "Wa[er-Scale Integration. - Some
Approaches to the Interconnection Problem," f,{icToelectronics V-9,
1 (Jan. 1970),5-10,
Batc79 Batcher,I<. "MPP ~ A Massively Parallel Processor," Proc. of 1979
International Conf. an Parallel Processing, (Aug. 1979). 249.
Blak75 Blakeslee,T.R. Dig1tal Design with Standard MSI and LSI. Wiley,
New York. 1975.
BudzB2 Budzinski,R.. Linn,J. AND Thatte,S. orA Reslruclurable Integrated
Circuit for Implementing Programmable Digital Systems," IEEE
Computer. V-15 ,3 (March 1982),43-54.
Calh72 Calhouri,D.F. AND McNamee,L.r. "A Means of Reducing Custom LSI
Interconnection Requirements," IEEE J. Solid-State G'U'cuits. SC-7
,5 (Oct. 1972), 395-404.
Cenk79 Cenker,RP., et. al.. "A Fault-Tolerant 64K Dynamic Random-Access
Memory," IEEE Trans. Electron. Devices ED-26 ,6 (June 1979),853-
860.
Chap Chapman,G. private commWlication, Lincoln Lab.
Cuny82 Cuny,J. AND Snyder,L. "Testing Coordination [or "Homogeneous"
Parallel Algorithms," Proc. of 1982 International Conf. on Parallel
Processing, (Aug. 1982).
DasG7B DasGupLa,S., Eichelberger,E. AND Williams,T.W. "LSI Chip Design
for Testability," Isse 1978, 216-217.
DeSi79 DeSimone,RH., Donofrto,N.M., Flur,B.L., Krllggel,R.I-I., Lellng,H.H.,
AND SchnadL,R "F'ET HAMs," 1979 IEEE !SSCC Digest of Tech.
Paper" 22 ,(19'19), 15·1-155.
Eata8! Eaton,S.S. "A lOOns 6t1.J( Dynamic RAM Using Redundancy," ISSee





Egaw791 Egawa,Y., Tsuda,N. AND Masuda,K. "A 1Mb Full Wafer MOS RAM,"
ISSCC Dig. of Tech. Papers (1979). 16-19.
Elme?7 Elmer,B.R. et. al, "Fault Tolerant 92160 Bit Multiphase CeD
Memory," ISSCC Dig. of Tech. Papers (1977). 116-117.
Fitz80 Fitzgerald,B.F. AND Thoma,E.P. "Circuit Implementation of Fusible
Addresses on RAMS for Productivity Enhancement," IBM J. of Res.
and Dev. V-24. 3 (May 19UO). 291-296.
FitzBl Fitzpatrick,D.T., et. al, "VLSI Implementation of a Reduced
Instruction Set Computer," eMU Conf. on VLSI (1981),327-336.
FussB2 Fussell,D. AND Varman,P. "Fault - Tolerant Wafer - Scale Architec-
tures for VLSI," Proc. 9th Annual Symp. on ComputeT Architecture
(1962), 190-196.
Gann81 Gannon,D. AND Snyder,L. "Linear Recurrence Systems for VLSI: The
Configurable, Highly Parallel Approach," Fror:. of the 1981 Interna-
tional Conf. on Parallel Processing, (Aug. 19B1), 259~260.
Gare79 Garey, M.R. AND Johnson, D.S. Compu.te.rs and Intractability A
Guide to t1J.e Theory of NP - Completeness, W.H. Freeman, San Fran-
cisco. 1979.
Glas79 Glaser,A.B. AND Subalc-Sharpe,G.E. Integrated Circuit Engineering
,Addison-Wesley, Reading, 1979.
Gupt72 Gupta,A. AND Lathrop.J.W. "Yield Analysis of Large Integrated-
Circuit Chips," IEEE J. Solid-State Circuits. SC-7 ,5 (Oct. 1972),
3B9-395.
BayeBD Hayes,J.P. AND McCluskey,E.J. "Testability Considerations in
Microprocessor-Based Design," IEEE Computer V-13 ,3 (March
1960), 17-26.
Hed181a I-ledlund,K.S. "Design of a Prototype Blue CHiP Processing Ele-
ment," Tech. Report 3UB, Camp. Sci. Dept., Purdue Vni..... June
1961.
HedW1b I-IedlWld,KS. AND Snyder,L. "A Model for Wafer Scale Testing,"
Tech. Report 389, Camp. Sci. Dept., Purdue Univ., Sept. 1961.
230
HedlB2a Hedlund,K.S. AND Snyder.1. "Wafer Scale Integration of
Conflgurable, Highly Parallel CHiP Processors," Tech. Report 407
Compo Sci. Dept., Purdue Dillv., April 1982.
HennS1 Hennessy,J" Jouppi,N., Baskett,F. AND Gill,J. "MIPS: A VLSI Proces-
sor Architecture," eMU Conf. on VLSJ (1981),337-346.
HonOO Hon,n.W. AND Seqwn,C,I-I, A Guide to LSI fmplp,menlation 2nd F.di-
tion, Xerox (Jan. 19(0).
HsiaU2 Hsi~\o,C. "Highly Parallel Processing of Relational Databases," Ph.D.
Dissertation. Compo Sci. Dept.. Purdue V., Aug. 1982.
IEEE82 InLernaLional Elelrical and Electonics Engineers "Whatever l-lap-
pened to Wafer-Scale Integration?" IEEE Spectrum, V-19 , 6(June
1982), 18.
KleiS1 Kleitman,D. et. al, "New Layouts for the Shuale-Exchange Graph
(Extended Abstract)," Symp. on Theory of Computing (May 1981),
278-292.
KokkB1 Kokkonen,K., et. 01. "Redundancy Techniques for Fast Static
RAMs," ISSCC Dig. of Tech. Papers (1981), 80-1.
Koni82 Konishi.S. "A 64Kb CMOS RAM," ISSCC IJig. of Tech. Papers (1982).
258-259.
KoreS1 Koren, I. "A Reconfigurable and Fault - Tolerant VLSI Multiproces-
sor Array," Proc. 8th Annual Symp. on Computer Architecture
(1981), 425-442.
Kuhn75 KuhIl,L. "Experimental Study of Laser Formed Connections for LSI
Wafer Personalization," IEEE J. of Solid-State Circuits Se-IO ,4
(Aug. 19"(5), 219-228.
Knut?O Knuth,D.E. "An Empirical Study of FORTRAN Programs," Software-
Practice and Experienr..'e, V-I,ll (Nov. 19?0), 105-133.
Kung79 Kung,H. T. "Let's Design Algorithms for VLSI Systesm," Proc. of Co1-
tech Conf. on Ven) Large Scale Integration, (Jan. 19?9),65-90.







LaPa78a LaPaugh, A,S. liThe Subgraph Homeomorphism Problem," Tech.
Memo 99, Lab. for Compo SeL, MIT, Feb. 1978.
LaPa7Bb LaPaugh, A.S. AND Rivest, RL. "Ttle Subgraph Homeomorphism
Problem," Froc. 10th Annual Symp. on Theory of Computing
(1976), 40-50.
Laws66 Lawson,T.R. "A Prediction of the Photoresist Infiuence on
Integrated Circuit Yield," Semicond. Prod. Solid State Technol. V-9
,7 (July 1966), 22-25.
Lea79 Lea,RM. AND SLreetharan,M. "WSI Distributed Logic Memories,"
Proc. Caltech Conf on Very Large Saale IntegratitJn (Jan. 1979),
W7-197.
LoguBO Logue,J.e. et. al, "Techniques for Improving Engineering ProducR
tivity of VLSI Designs," Proe. of IEEE International Conf. on Cir~
cuits and Computers (1900),248-251.
Mann?? Manning,F. "An Approach to HigWy Integrated, Computer-
Maintained Cellular Arrays," IEEE Trans. Comput. C-26, 6 (June
1977),536-552.
ManoBO Mano,T. "A 256K RAM Fabricated With Molybdenum - Ploysilicon
Technology," ISSCC Dig. '/ Tech. Papers (1900), 234-235.
Mead80 Mead,C. AND Conway,L. Introduction to VLSI Systems, Addison-
Wesley, Reading, 1900.
MinaBl Minalo.O., Masllhara,T., Sasaki,'r., Sakai,Y. AND Yoshzaki,K. "A
High-Speed Hi-CMOSII 4K Static RAM," IEEE J. Solid-State Circuits
SC-16 ,5 (Del 1901), 419-451.
MinaB2 Minato, el".al. "A H]-CMOSll 8K x 8b Static RAM," ISSCC Dig. of
Tech. Papers, (1962),250-257.
MinaBO Minato,O., Masuhara,T. AND Sakai,Y. "HI-CMOS 4K Static RAM,"
ISSCC Dig. '/ Tech. Papers (l960},234-5.
Moor79 Moore,G.K "Are We Really Ready for VLS1?" ?roc. of Caltech Conf.
on Very Large Scu.la Integration (Jan. 19'19), 3-14.
Nair?O Nair,R, Thatte,S.M. ANi> Abraham,J.A. "lmicient Algorithms for
TestinG Semiconductor H;;mdom - Access Memories," IEE'.b' Trans.
Comput. C-27 ,6 (June 19'1U) , 5'i2-5 c/6.
232
Noyc?? Noyce,R.N. "Large Scale Integration: What is Yet to Come?" Sci-
ence , (March 18, 1977), 1102-1108.
OwenS! Owens, M.rt "Compound Algorithms for Digit Online Algorithms,"
Tech. Reprot CS-Ol-l, Compo Sci. Dept., Penn. State D.. Jan. 1981.
Parz60 Purzcn,K Modern Probabilily Theory and Its Applications, Wiley.
New York. 1960.
PatLS! Patterson,D.A. AND Sequin.C.H. "msc I: A Reduced Instruction Set
VLSI CompuLer," Proc. 8th Annual Symp. on Computer Architec-
ture (1981), 443-45'1.
Petr6? PeLritz,RL. "Current Status of Large Scale Integration Technol-
ogy," IE'E'E J. SaUd-State Circuits. SC-2 ,4 (Dec. 1967), 130-147.
Peut77a Peute.B.L. AND Shustek,L.J. "Current Issues in the Architecture of
Microprocessors," IEEE ComputeT, V-lO ,2 (Feb. 1977),20-25.
Peut77b Peuto,B.L. AND Shustek,L.J. "An Instruction Timing Model of CPU
Performance," Froc. 4th Annual Symp. on Computer Architecture,
(1977), 165-178.
Phis79 Phister,M. "Technology and Economics: Integrated Circuit Manufac-
turing Costs," Computer Design, V-IB, 10 (Oct. 1979),34·42.
Pric70 Price, J. E. "A New Look at Yield of Integrated Circuits," Proc.
IEEE. 20 (May 1976), 228-234.
Raff79 RaiId,J.1. "On the Use of Nonvolatile Programmable Links for Res-
tructurable VLSI," Proc. of C:'altech Conf. on Very !J:Lrge Scale
Integration, (Jan. 1979),95-104.
ReesBl Reese,E.A. ct. al. "A 4K x 8 Dynamic RAM With Self Refresh,"
ISSCC lJig. of Tech. Papers (1981), 88-89.
Ross76 Ross,S. A Pirst Course in Probability, Macmillan, New York, 1976.
Radi02 Radin,G. "The 801 Mil1icomputer," Symp. on Arch. Support for
Prog. Langs. and Operating Sys. (March 1982), 39-4?
H.ungBl Rlmg,RD. "Determinin!~ Ie LayouL Rules for Cost Minimization," '








SaitB2 Saito,K. AND AraLE. "Experimental Analysis and New Modeling of
MOS LSI Yield Associated with the Number of Elements," IEEE J.
Solid-Stm. Oi:rcuits 5(;-1"/. 1 (Feb. 1982). 28-33.
Seit79 Seitz, C.L. "Self - Timed VLSI Systems," Froc, Conf. on Very Large
Scale Integration: Architecture, Design and Fabrication (19'(9).
SmitSl Smilh,RT. el, al, "Laser Programmable Redundancy and Yield
Improvements in a 64K DRAM," IEEE J. of Solid-State Circuits SC-
16.5 (Oct. 1981), 506-514.
SnydB2a Snyder,L. "Introduction to the Configurable. Highly Parallel Com-
puter," IEEE ComputerV-15. 1 (Jan. 1982), 47-56.
Snyd82b Snyder.L. "Configurable, Highly Parallel (CHiP) Approach for Signal
Processing Applications," Froc. Tech. Symp. East '82. SPIE, 1982.
Stap73 Stapper,C.H. "Defecl Density Distribution for LSI Yield Calcula-
tions," IEjj)E Tran. Electron Devices ED-20·'.7 (July 1973), 655-657.
Stap75 Stapper,C.H. "On a Composite Mocel to the IC Yield Problem," IEEE
J. Solid-State Oircuits SC-IO, 6 (Dec. 19"(5),537-539.
Stap76 Stapper,C.H. "LSI Yield Modeling and Process Monitoring," IBM J.
Res. Dev. V-20. 3 (May 1976). 228-234.
StapBO Stapper, C.H., McLaren,A.N .. AND Dreckmann,M. "Yield Madel for
Productivity Optimization of VLSI Memory Chips with Redundancy
and Partially Good Product," IBM J. Res. and Dev. V-24 ,3 (May
1980),396-409.
StapB2 Stapper,C.H. "Yield Models for 258K RAMs and Beyond," ISSCC Dig.
of Tech. Pap.rs (1982). 12-13.
SteeBl Stcelc,'l'.S. "Terminal and Cooling Requirements for LSI Packages,"
IEEE Trans. on Components, Hybrids and Manufacturing Tech. ,
(June 1981),187-191.
Suth?7 Sutherland,I.E. .AJ'lD Mead,C.A. "Microelectronics and Computer
Science," Scientific American V-237 , 3 (Sept. 1977),.210·220.










Willi~m3,M.J.Y, AND Angell,J.B.. "Enhancing Testability of Large-
SCuie InLegrated Circnits via Tcst Points and Additional Logic."
IEEE '!'raTtS. Compu.t. C-22 ,1 (Jan. 1973),46-60.
Williams,T.W, AND Parker.K.P. "Testing Loglc Networks and Design-
ing for Testability," IEEE Computer V-I2 ,10 (Oct. 1979), 9~21.
Wu.W. "Automated Welding Customizes Programmable Logic
Arrays," Elecl"TDnicsV-55, 14 (July 14,1982),159-162.
YU,K., Chwang,J.C., Bohr.M., Wm-!ccntin,P., JLern,S. AND
Bcrf;lul1cl,C.N., "HMOS-eM.OS - A Lo·w-Power Hlgh-Performance Tech~






SUMMATION OF RANDOM VARIABLES
In this appendix we derive the probability
P' = pr(t of, Np) = Pr(i defects occupy of or fewer of Np PEs)
where Np is the total number of PEs in a sublattice which contains nf
redundant PEs and where i > nr. The i defects all fall in a set of Np PEs. P' is
the probability that the defects occupy a subset of size nf or smaller. The
form of P' varies depending on the assumptions which are made about the
processing technology. As the assumptions are made more realistic, the
analytical form of P' can become very cumbersome. P' will first be derived
under a simple set of assumptions, and the results will be progressively
refined.
The Price model assumes distinguishable classes of indistinguishable
defects. li'or the first approximation, assume only one class so that all
defects are indistinguishable. This corresponds to llUnping the elTecl of all
processing steps and regarding the wafer to be manufactured in a single
step. We do noL diITorcntiate beLween defects inLroduced at different stages






It is simple to derive and is a useflll first approximation.
1. Lumped Approximation
It is tempting to try to evaluate p' by
P' = ~ [Nf] Pr(i defects occupy k PEs)
k=l
(1.1)
However. this is somewhat ambiguous and leads to difficulties. For instance.
consider the number of differenL possible assignments of 4 defects to 3 PEs.
This includes some assignments in which 2 of the PEs each contain 2 defects
and the third PE is defect free. Only 2 of the 3 PEs contain any defects at
all. This assigllffienl is already counted when placing 4 defects in 2 PEs.
Therefore, equation 1.1 double counls many assignments. To avoid double
counting. we will be more precise in our terminology. We will say i defects
fall in k PEs if the defects occupy k or fewer PEs; some of the k PE may be
defect-free. i defects cover k PEs if the defects fall in k PEs. and every PE
contains at least one defect.
We can correctly restate equation 1.1





[~p] (number of placements of i defects which cover kPEs)/ (l.otal number of placements of i defects
in Np p~s)
(1.2)
, Slnce there are [i.Hr- 1) uWerent ways of placing i indistinguishable defects




different placements of i defects in Np PEs.
For any particular subset of k PEs, how mnny of these placements cover
the subset? F'irst. take k of the i defects and assign one to each PE of the
subset. This insures that the subset is covered. The remaining i-k defects
can be assigned to any of the k PEs. There are
((i-k) + k - 1] _ (i-1]l i-k - u-k
ways of doing this. This completes the lumped approximation
•
(1.3)
A more accurate approximation can be derived by modeling more than
one fabrication slep [Glas79]. This introduces multiple, distinguishable
classes of indistinguishable defects. Each individual class follows a lumped
approximation, but the fact that i defects. can be partitioned into multiple
classes in many different ways must be accounted for.
The first results derived will be for 2 classes of defects. A more realistic
model for Blue CHiP applications is a (our class modeL The 3 and 4 class
formulae will be derived in a manner similar to Lhe 2 class derivation.
li'igure A1.l shows P'(4,i,16). the probability that is defecLs aU faU in 4 or




2. Two Class Approximation.
In refining the lumped approximation, the following assumptions will be
made:
1) There are two distinguishable classes of indistinguishable defects.
Each class represents a separate fabrication step.
2) The fabrication steps are independent.
3) The total nwnber of defects is the sum of the defects introduced at
each step.
4) There is an equal probability of a defect belonging to either class.
Given that there are exactly i l , defects of class 1 and 12 of class 2,
consider the probability that the loLal number of defects, i = i I + 12 • [all in nf
or fewer of Np PEs. This quantity is denoted by Q". To evaluate Q", we
condition on k, the number of defects covered by defects of both classes.
Q";:; ~ (Nl) Pr(i defects cover a set of k PEs)
k=l
For any particular set of k PEs,
(2.1)
,
Pr(i defects cover set) = (numb placements of it and i2 that cover set)/
(total numb placements of i1 and i2 in Np PEs) (2.2)
Consider the denominator of the above equation. Since the fabrication steps
arc independent,
total number of placements of it and i2 in Np PEs =
(number of placements of i l in Np PEs) '"
(number of placements of i2 in Np PEs) =
239
This quantity will be denoted by Place (i" i2: Np) with the obvious extension
to Place (it. "., iN: Np) following [rom the independence of all processing
steps.
To evaluate the numerator of equation 2.2 we condition on the number
of different PEs in the set of size k occupied by class 1 defects.
numerator;::; L:
c1
[~11 (numb placements of i1 that cover C1 PEs)
(numb placements of i2 that occupy k-c1 remaining PEs)
For any subset of size Ct, select C1 of the class 1 defects and place one
defect in each PE of the subsel. This insures that all members of the subset
are occupied. The remaining it - C1 defects can be distributed over the Ct
PEs in
different ways. There are k-c 1 members of the set not covered by defects of
the first class. Therefore. these PEs must be occupied by class 2 defects.
We take k-Cl of the i~ class 2 defects and put one in each of the PEs not
covered by class 1. This insures Lhat the entire set of k PEs is covered. The







ways of placing the class 2 dcfccLs to insure that aU the k PEs contain ut
least one defect. Consequently, there are
diaerent ways of placing the i 1 and 12 to cover the subset. We will denote this
quantity by Cover (iL, i2, el' k).
This completes the evaluation of the numerator of equation 2.2.
To evaluate the limits of the stunmation,l note that the class 1 defects can
cover at most i 1 of the k PEs. FurLhermore, the class 1 defects must cover
at least 1 PE (unless there are no class 2 defects). The class 2 defects must
occupy the remaining k-c1 PEs not covered by the class 1 defects. So






I We l1~sumc ~ = 1 for a<b or 0.<0 or b<O.
241
This completes the evaluation of equation 2.2 and
~ [:t] Cover(it,i,,;Ct,k)
Place(il ,i2,i3;Np)
vt'ilh the limit3 for Cl as above.
Now, Q" assumes there are exactly i 1 and i2 defects of each class. We
can use P" to evaluate
Q" = Pr(i defects fall in nf or fewer of Np PEs) =
2: Q" Pr(i defects are partitioned with i l AND i 2 in each class) =
11+i2=i
= ~ Q" Part(i; it, i,)
i 1+la=i
(2.3)
To evaluate the partition function, Part, let ]t and 12 be random
variables representing the number of defects in each class and i be the total
number of defects. Consider the partitioning of defects into two classes to
be an experiment i trials with each trial deciding which class a defect will be
in. The partitioning of a fixed number of defec ts into two classes then
follows a binomial distribution [Ross76].
Since it is equally likely Lhat a defect will be in either class (by assumption




Since 11 and 12 must sum to i,
Tins completes the evaluation of the two class approximation with equation
2.3 becoming
or
P" = Z; Part (i; i,. i2 ) Z;
11+12=1 k=!
3. Extension to Three Classes
(2.4)
The derivation under the assumption of three distinguishable classes of
defecLs is similar to the 2-class case. porI will denote the probabilit.y under




Part (i; ill 12 • i3 ) >-..;
k=i
(J~) Pr(il' iz and 13 cover the set)
and we can decompose this last probability for a specific set of the PEs.
Pr(i1• 12 and i3 cover Lhe set) = (number of placements of iI, 12 and i3
(3.1)
, '
that cover the set)/ Place (i l • i2• is; Np)
243
(3.2)
where the three argument versions of Part and Place are simple extensions
of the two argument functions:
A) Place. By the independence of the processing steps
B) Part. We define
Part (i; i l • i 2 , is) = probability that i defects are partitioned with i tl in class 1,
i2 in class 2 AND is in class 3
= Pr(I, = i ,) Pr(I, = i, I 1, = i ,)
where 11' 12 and 1;] are random variables representing the number of defects
in each class. Note that the number of defects in class 3 need not be
explicitly accounted for. Since i = i} + i2 + i 3• choosing it and i2 determInes
It is equally likely that a defect will be in anyone of the three classes.





Now. Lhe condiLional portion of Pr(12 = i2 III ::: il) constrains the remaining
i-i1 defects Lo full in either class 2 or class 3. Both are equally probably, so





1 ri-i l ]
ZH1 l i 2
Combining equations 3.3 and 3.4 gives
244
(3.4)
The evaluation of P'" is now complete except for the numerator of
equation 3.2 which is evaluated as in the 2-class situation, but with an
additional summaLion required due to the additional class.
numerator = number of placements of iI' i2 , i:] which cover a set of k PEs =
(3.5)
Given a particular subset of size Cl' we calculate as follows the number
of placemcnts of i2 and i:3 that insure the set of k PEs is covered. Condition
on C2' Lhe nwnber of previously uefect free PEs occupied by class 2 defects.
number of placements of 12 and i:] which occupy k-Cl remaining PEs =
~ ] (number of placemcnts of 12 whichoccupy C2 previously defectL.; k-Cl free PEs) (number of placements of is which occupy (3.6)C2
"2 k-r.: -c" remaininrr PEs)
'" 0
Select C2 of Lhc I::! class 2 defects and place each in a PE not already
occupied by a class 1 defect. This insures that exactly c1 and C2 different




be assigned to any of the CI and C2 PEs already covered. There are
(3.7)
different ways of making the class 2 assignments.
Similarly, lC-Cl-C2 class 3 defects are required to complete the
covering of the set of k PEs. The remaining is-(k-cr-C2) class 3 defects can
be assigned to any of the k PEs in
(3.6)
ditTercnt ways.
Substituting equations 3.7 and 3.8 back into equation 3.6. number of
placements of 12 and i 3 which cover k-Cl PEs is
(3.9)
To determine the limits of the stuIlmation. nole that the class 2 defects
must occupy at least 1 PE (unless there arc no class 2 defects).
Furthermore, the class 3 defects must cover the remaining k-CI-C2 PEs not
1'0 simplify notation. we inLroduce a three argument version of the
Cover function





So, for a specific set of k PEs, equation 3.2 can be rewritten
Pr(i1, i2 and i3 cover the set) =
1
where the limits for cJ are derivcd similarly to C2' Finally, we can write p"'
as
P'" = Pr(i defects occupy nf or fewer of Np PEs) =
2~ Parl (i; i t -i3)
i1+i2 ,ti:l=i





































































4. 5. 6. 7. 8.
4.5 5.5 6,5 7.5
j = Number of Defects
Fiaure i1.1 - Probability of J Defect. Clusterinl







Kye Sherrick Hedlund was born in Yonkers, New York on December 2,
1953. He graduated cum laude with Distinction in Mathematics from Boston
University with a Bachelor of Arts in Mathematics in 1975. Mr. Hedlund has
worked at IBM's Thomas J. Watson Research Center, Argonne. National
Laboratory, MIT Artificial Intelligence Laboratory and Can Data, Inc. Purdue
University awarded him a Master of Science in Computer Science in 1979
and a Doctor of Philosophy in 1982. He is currently an Assistant Professor in
the Computer Sciences Department of the University of North Carolina at
Chapel Hill. While at Purdue he was employed as a teaching assistant and a
research assisLant. His high score at PACMAN is 196.400.
"
['.
,
.
rl
(Y)
