Design and analysis of fault-tolerant systolic array with built- in self-test by Ip, Joseph Chun-Shing
Lehigh University
Lehigh Preserve
Theses and Dissertations
1992
Design and analysis of fault-tolerant systolic array
with built- in self-test
Joseph Chun-Shing Ip
Lehigh University
Follow this and additional works at: http://preserve.lehigh.edu/etd
This Thesis is brought to you for free and open access by Lehigh Preserve. It has been accepted for inclusion in Theses and Dissertations by an
authorized administrator of Lehigh Preserve. For more information, please contact preserve@lehigh.edu.
Recommended Citation
Ip, Joseph Chun-Shing, "Design and analysis of fault-tolerant systolic array with built- in self-test" (1992). Theses and Dissertations.
Paper 102.
u:..... ! T··".~.·· H~·;.; n []"."f ,; . IV D
.,
Ip, Jos h Chun", .... in
~
T~ l :
D sl n n nal sis f
FauitmTol rant st Ii '
rra with uilt-In
S Ifm est
ATE:· ctober 11,1992
Design and Analysis of Fault-Tolerant
Systolic Array with Built-In Self-Test
by
Joseph Chun-Shing Ip
A Thesis
Presented to the Graduate Committee
of Lehigh Un-iversity
in Candidacy for the Degree of
Master of Science
in
Electrical Engineering
Lehigh University
September 1992

Acknowledgments
Many people have been so helpful during my thesis work and I would like to express
my gratitude to them all.
Special thanks to my supervisor at Personal Computer Products Division in Advanced
Micro Devices (AMD), Bill Boothby, and to my managers Mike Lowe and Larry
Monks. They allow me to use all the available resources during my stay at AMD.
My strong appreciation also goes to my thesis advisor, Professor Frank Hielscher, for
his interest in my topic, especially for his support throughout last year.
I would like to take this opportunity to thank every member of the Brahma development
team at AMD. We had so many valuable discussions that made my daily life at AMD
such a rewarding experience, in particular, Barry Ho, Dr. Guanghui Hu, Sally Kang,
Dr. Yi Liu, Rao Meranmenda, Qing Zhang, and Songxin Zhuang.
Last but not least, my sincere thanks go to Daniel Chan, Tim Ip, Gary Yang, and Dr.
Ker Zhang who have provided essential assistance and encouragement in every step on
my way.
Table of Contents
Abstract 1
1. Introduction 2
2. Systolic Algorithm and Structure 6
2.1 Methods on the Transformation of a Matrix Algorithm to a Systolic
Array 6
2.1.1 Descriptions of Two Graph-Based Data-Dependency Methods 8
2.1.1.1 S. Y. Kung's Method 8
2.1.1.2 1. H. Moreno's Method 9
2.2 The Gauss-Jordan Diagonalization Algorithm 10
2.3 Limitations in Our Implementation .' 12
2.4 Our Systolic Architecture for Computing BA-1 13
2.5 Block Diagrams of the Array 16
3. Designs and Simulations of Computational'Units 18
3.1 Design of a Two's Complementer and its Simulation Results 20
3.2 Design of a Two's Complement Adder and its Simulation Results 23
3.3 Design of a Baugh-Wooley Two's Complement Multiplier and its
Simulation Results 26
3.4 Design of a Two's Complement Parallel Array Divider and its Simulation
~ults W
4. Design of Built-In Self-Test Structures and their Simulation Results 32
4.1 Design of a Shift Register and its Simulation Results 32
4.1.1 Review of the Level Sensitive Scan Design 32
iv
4.1.2 Design of a Shift Register Latch Chain and its Simulation
Results 35
4.2 Design of a Pseudorandom Pattern Generator and its Simulation Results
39
4.2.1 Review ofPseudo~domSequences and Linear Feedback Shift
Registers 39
4.2.2 Design of a Linear Feedback Shift Register 42
4.2.3 Design of a Test Pattern Generator and its Simulation Results 44
4.2.4 Linear Dependencies of the m-Sequence Used as Test Stimuli 47
4.3 Design of a Signature Analyzer and its Simulation Results 50
4.3.1 Review of Signature Analysis 50
4.3.2 Design of a Signature Analyzer and its Simulation Results 53
4.3.3 Built-In Self-Checking with Zero Signature 55
4.4 Pseudorandom Testing of the Full Adder
/
4.4.1 Review of Pseudorandom Testing
4.4.2 Application of the Results on the Fuller Adder
4.5 Aliasing in the Linear Feedback Signature Register
60
60
62
6S
4.5.1 Review of Aliasing Concept 6S
4.5.2 Application of Ivanov's Results to Our Signature Analyzer 66
5.
6.
4.6 Designs and Simulations of the Two Systolic Cells
Reconfiguration Algorithms of a Two-Dimensional Systolic Array
Summary and Further Research
List of References
Appendix - Detailed Schematic Diagrams of Processing Elements
v
71
74
89
91
Abstract
The increasing complexity of integrated circuits has made it more difficult to
test their internal circuits and to obtain a reasonable yield. Thus, fault tolerant
techniques'are sometimes added to the circuits to enhance the yield, whereas
built-in self-test increases the controllability and observability of the circuit to
make it more easily testable. The matrix. BA-1, which can be implemented with
systolic techniques, is one of the generic operations of adaptive processing in
block type. A systolic array is a class of parallel processors which is made up of
a few types of processing elements, and local commun~cation.
This thesis discusses incorporating fault-tolerant techniques and built-in self-
test to a systolic array which is intended for computing the matrix BA-1,
Particularly, scan path design, pseudoran~oJ? pattern generation, and aliasing of
signature analysis are studied in detail. This thesis also describes the design of
two easily testable processing elements for the matrix BA-1.
1
1. Introduction
Recent advances in processing technology of Very-Large Scale Integrated (VLSI)
circuits concentrate on decreasing the minimum feature size of devices while achieving
an increase in the number of transistors. As the cost of mask lithography increases
exponentially with the decrease of the minimum feature size [27], increasing the die
sizes will be the only way to fabricate denser systems within a reasonable increase of
the manufacturing cost. This trend should prompt the microelectronics industry toward
the realization of Wafer-S~ Integration (WSI) in future years. Much research has '
been done on WSI in the past decades to support the development of this upcoming era.
Although WSI is still unusual at this time, defect-tolerant memories have existed for
several years [31]. Repair techniques, such as tailored metalization between the
working cells and usage of laser and electricity to create links, have been tried.
Fabrication defects, reliability issues, and production testing are often the major
concerns of manufacturing conventional VLSI circuits [47]. As WSI pushes the
physical dimensions to their natural limit - the size of wafers [12], more problems are
expected to arise. In order to enhance the yield and to improve the reliability,
redundancy must be introduced to the chips. Otherwise, chips with defects are
discarded under the current wafer-sort procedures. Since the accessibility of the internal
circuits from the pads becomes more difficult as the die sizes grow, additional circuitry
to aid testing seems to be inevitaqle.
In general,' two kinds of redundancy methods are proposed for yield and
performance enhancement: static and dynamic reconfigurati()n [20]. Static redundancy
schemes refer to the techniques in where interconnections are physically and often
permanently changed to connect the good parts of the die. A good example of this
2
reco~figuration method is the Restructurable VLSI (RVLSI) program at MIT's Lincoln
Lab. [3]. Laser links are created between the defect-free modules after fabrication for
defect avoidance and customization. As a result, the yield and utilization of the fault-
free modules are very high. Dynamic reconfiguration schemes are required during
normal system operations where faulty parts are bypassed and no intermediate inputs or
outputs should be allowed to flow through those faulty modules. The diogenes
approach used by Rosenberg [39] is one of the examples on how to reconfigure a faulty
array to a fault-free array by introducing redundancy in communication links. If faults
occur, altering the setting of switches changes the array configuration in order to
achieve a functional array. Static schemes utilize a high proportion of modules but
consume operator time, whereas dynamic methods are controlled by the system itself
but require additional hardware. Hence, static reconfiguration is for defect tolerance
after production, and dynamic reconfiguration is for fault-tolerance during field
operation.
By moving some of the tester functions to the chip, the complexity of testing may be
greatly reduced. This technique is called Built-In Self-Test (BIST). BIST requires that
structures which can generate test stimuli and evaluate output responses must be added
to the hardware. Test patterns are applied to inputs of the circuit and output data are
evaluated by a compactor. Hence, BIST consists of three parts, namely test pattern
generator, circuit under test (CUT), and output response analyzer. A fixed signature
may be used to diagnose the fault at the completion of each test Test patterns can either
be generated by a program (an off-line approach) or be calculated while the patterns are
being applied (a concurrent approach) [29]. On chip storage of all fault-free outputs (a
fault dictionary) requires too much memory to be a practical technique. If correct
outputs are not available for direct comparison, output responses are often compressed
by a compactor to reduce the amount of output data. The output of the compactor is
3
called a signature. Data compaction leads to loss of infonnation in the signatures which
results in an effect called aliasing. This happens when a faulty signature is identical to a
fault-free signature after compacting has been done. Scan path may also be added to the
circuit to enhance the controllability and observability of the CUT. Also, snapshots of
the nonnal operations are allowed to be taken by shifting out the contents of the chosen
registers along the scan chain.
A systolic array consists of a set of interconnected processing elements (PEs) which
rhythmically compute and pass data through the system [21]. The array features
simple, regular, and modular designs, a high degree of pipelining and local
interconnection, as well as balanced computation with I/O. Similar to the case of WSI
memories which are made up of interconnected cells, systolic arrays are feasible for the
WSI implementation due to their modularity and regularity. The only difference
between a memory cell and a PE in a systolic array is that a systolic system allows more
than one type of cells to exist Other features of systolic arrays include multiple usage
of inputs, extensive concurrency, as well as simple and regular data and control flow.
The characteristics of systolic arrays imply that speeding up compute-bound
computations may often be accomplished by the systolic approach. Since matrix
computations for signal processing are compute-intensive, the systolic array-
implementation is well suited for these kinds of operation.
Detection and estimation are two essential operations in signal processing. One of
the operations needed very often is the multiplication of two matrices, in particular, the
multiplication of matrix B with the inverse of A (Le. A-I) to obtain the product BA-1.
Calculation of BA-1 is an intermediate step of computing a linear optimal solution in the
least-squares sense. To obtain the linear optimal solution, another matrix, R, is needed
to multiply the product of BA-1. In this thesis, the systolic array was chosen for these
computations. Th~s thesis explores a special design for te&tability (DFf) systolic array
4 )
structure and a reconfiguration algorithm to enhance the yield enhancement after
production and to allow regular checks to determine whether reconfigurations are
needed during system operation.
This thesis is lorganized into six chapters. In chapter two, we first review two
graph-based methods for deriving the systolic array, and then discuss the algorithm
used to compute BA-1. Physical limitations on the implementation and block diagrams
are presented as well. Chapter three describes the algorithms and the actual designs of
three computational units, namely divider, multiplier, and adder. Simulation results are
also discussed. Chapter four reviews the general BIST and scan techniques, and then
discusses the ones used in the array. Test pattern generation and signature analysis are
explained completely for the specific structures that have been.designed. Simulations of
our designs and some relevant results to the BIST structures, in particular,
pseudorandom testing of the full adder and aliasing in the linear feedback shift register,
are also presented. Chapter five discusses the available reconfiguration techniques
which are suitable for this particular array. However, only a dynamic reconfiguration
scheme is presented. We conclude this thesis in chapter six.
5
2.
2.1
Systolic Algorithm and Structure
Methods on the Transformation of a Matrix Algorithm to a
Systolic Array
Many different methods have been proposed on how to transform matrix algorithms
to systolic architectures. Using algorithm dependencies as a transformation tool seems
to be more appropriate because they show the parallelism between variables and restrict
the communication requirement. However, the effective use of dependencies rely
heavily on how the dependencies are expressed and manipulated. These dependencies
should have capabilities of regularizing and deriving arrays by taking into consideration
of implementation constraints, tradeoffs among implementation parameters, and cost
and performance while preserving the dependencies in the algorithm. In general,
_) mapping algorithms onto systolic arrays comprise of two stages, namely regularization
of the algorithms and derivation of arrays from the regularized algorithms [32].
Two kinds of dependencies are suggested, index-dependencies and data-
dependencies. Index-dependencies refer to ones in which dependencies are described
by relations in the index space of an algorithm. For a matrix algorithm with for loops,
regular iterative algorithms (RIA) can first be generated by obtaining a single
assignment algorithm, and then by index matching and localization of dependencies.
Dependencies among variables are related to the distances between those variables in the
index-spaces, and are represented as expressions with those indices [38]. For example,
suppose a two-dimensional filtering algorithm has been given [38],
n n
Yij =2: akYi-k.j-k + 2: bkui-k.j-k
k=l k=O
6
For the above algorithm, a RIA can be written in the following form:
fori =0 ton do
for j, k =0 to N do
xCi, j+1, k+1) = fX.i(x(i, j, k), y(i, j, k), wei, j, k))
y(i+I, j, k) = fy.i(X(i, j, k), y(i, j, k), wei, j, k))
wei-I, j, k) =fw,i(x(i, j, k), wei, j, k))
where fx,i' fy,i, fw.i are linear functions that are detennined by a synthesis procedure. ,.;
An index-dependency graph has one dimension for each index in the RIA, potentially
leading to a graph with many dimensions. Moreover, there is no systematic approach to
obtain an RIA for a given algorithm and this process is computed heuristically. In other
words, no formal guidelines on how to achieve an optimal RIA have been given
because there are too many different optimum criteria which can be applied to the
algorithm. Also, transforming an algorithm into an RIA may add more computing load
to the array with extra variables and operations. Data-dependencies are described by a
graph which is obtained by following the data flow in the algorithm. Under this
criterion, no additional variables or operations are added to the original algorithm. As a
result, no extra hardware is added during the transformation. The resulting graph is
then modified to become the actual design of an array. This method is more systematic
and easier to use than index-dependencies. For the rest of this section, 'we will only
concentrate on the discussions of data-dependency techniques.
7
2.1.1
2.1.1.1
Descriptions of Two Graph-Based Data-Dependency
Methods
s. Y. Kung's Method [23]
~ Kung's method for the regularization stage involves identifying a suitable single
assignment representation of an algorithm and generating a dependency graph (DG)
according to the space-time indices in the algorithm. A DG, graphical representation of
a single assignment algorithm, is a graph that shows the dependencies of the
computations in the algorithm. No systematic approaches are given by Kung on how to
modify a DG. Localizing and reindexing of the DG are recommended to obtain a better
design, so the process is carried out as an ad hoc transformation. The DG is then
transformed to a signal flow graph (SFG) by following two steps: processor
assignment and scheduling. Details of the two ste~ are not going to be reviewed [23].
A SFG represents an intermediate stage of hardware level description. A node denotes
either an arithmetic or logic function performed with zero delay, whereas an edge
represents delays or data dependency between two nodes. After a SFG is formed, one
can employ pipeline retiming techniques, such as cut-set retiming procedures, to the
SFG to create a systolic array. Figure 2.1.1 shows the design flow of his method.
Single SystolicAlgorithm -7 Assignment -7 DG -7 SFG -7 Array
Algorithm
Fig. 2.1.1 Design flow of Kung's method
8
2.1.1.2 J. H. Moreno's Method [32]
Algorithm dependencies are described here by a direct mapping of the algorithm
onto a graph: A fully-parallel data-dependency graph (FPG) denotes a single
assignment representation of an algorithm. A node in the graph represents an operation
whereas an edge corresponds to a dependency of two operations.. By replacing
broadcasting with transmittant data, removing bi-directional data flow by moving nodes
to one side of broadcasting, and adding delay nodes so that dependencies are strictly
between neighboring nodes, a multi-mesh dependency graph (MMG) is finally formed.
Then one needs to collapse the MMG onto a two-dimensional G graph by grouping
primitive nodes of MMG onto one node of G graph. As a result, grouping along any of
the three-dimensional axes (x, y, and z) may produce different arrays with different
performances. Figure 2.1.2 shows the design flow of his method.
- .'
Algorithm -7 FPG -7 MMG -7 G
Graph
Systolic
-7 Array
Fig.2.1.2 Design Flow of Moreno's method
Comparing the two methods mentioned above, the MMG method is more effective
and systematic than the SFG for the transformation of an algorithm to a graph. The
MMG scheme directly maps a matrix algorithm to a graph whereas the SFG method
requires some ad hoc manipulations of a DG. Moreover, generating a DG from the
single assignment algorithm and a SFG from a DG are rather ad hoc processes. In
other words, there are no formal guidelines on how to transform a DG to an efficient
SFG and subsequently to an array.
9
2.2 The Gauss-Jordan Diagonalization Algorithm [8]
Let A and B be an n by n matrix and a p by n matrix respectively, and let C be an (n+p)
by n matrix.
C = [~]
We reduce C by the combination ofrows. The way to proceed is to postmultiply C by
n elementary n by n matrices, Ji , in order to obtain BA-1after n steps.
Let Ck = Ck-1Jk. Start with Co = C, and let cg be an (n+p-k) by n matrix which is
the last n+p-k rows of Ck. We choose h so thaf the fIrst k rows of Ck are those of In,
an identity matrix of order n, i.e.,
Ck = [ the first ~ows of In]
Consequently, after n iterations, ct is the product of BA-1. Let h is denoted by
[C(k) ... c(k)]kl len
The coefficients of C~ are represented by variable c~f\ where k+1 ::;; i ::;; n+p, and
I ::;;j ::;; n.
The values of c~f) are computed iteratively by the following algorithm, starting from
c~~) = CiJ'.1J
10
fork:= 1 to n
begin
c~ := 1/~1);
for j := 1 to n, j ;= k
C(k) .- _~) * c(k-l)·kj .- -KK kj ,
for i := k+1 to n+p
begin
for j := 1 to n, j ;= k
d~) .= d~-l) +dk-1) * c(~)·
1J· 1J 1k kJ '
C(k) .- c(k-l) * c(k).ik .- ik kk'
end;
end;
Fig.2.2.1 Algorithm to compute BA-1.
11
2.3 Limitations in Our Implementation
To simplify our implementation, all has to be the largest entry of the first column to
avoid the multiplication of a pennutation matrix P, an identity-matrix with its row re-
ordered. If A is a matrix, then PA is a row pennuted version of A [14]. For example,
suppose
~ ~~], p = [~ ~ ~]
18 -12 1 0 0
PA=[ ~ 18 -12]4 -2
17 10
This particular row interchange of a matrix is .called partial pivoting. Any kinds of
pivoting usually degrade the performance. To avoid pivoting, we suppose that A is a
diagonally dominant matrix. To satisfy the condition of being diagonally dominant, its
diagonal entries, aii, have to be greater than the sums of aij in the same row.
n
Iau I> I Iaij I, j;c i
j=l
It is well known that partial pivoting cannot be implemented on systolic arrays because
the scanning for a pivot would break down the regularity of the design. In addition, we
suppose that A is a positive definite matrix, Le. A is non-singular. This matrix property
guarantees that there is a soluti6n for BA- I . With the properties of positive definiteness
and diagonally dominance, the Gauss-Jordan algorithm is numerically stable.
12
2.4 Our Systolic Architecture for Computing BA-1
Two systolic structures [8, 33] are suggested in the literature. Both structures have
their own advantages. According to Moreno's [33] analysis, the following table is
obtained by using the multi-mesh dependency graph (MMG) method, projecting along y
and x axes. Grouping along the Y-axis corresponds to the systolic array created by
Comon and Robert [8], whereas grouping along X-axis is another array derived by
Moreno and Lang [33]. Table 2.4.1 shows the analysis results of two systolic arrays.
Y-axis X-axis
Computation time 4n + p - 2 4n + p - 2
_1_
_1_
Throughput n+p n+1
Number of cells n(n+ 1)-
.'
tn(n+l) + pn
I/O ports 2n n + 2p
n(n+2p+l) n2
Utilization 2(n+1)(n+p) n(n+l)
Cells complexity 3 types 2 types
Table 2.4.1 Performance Measures of Systolic Arrays for Computing BA-1 [33].
Both arrays compute BA-1 with the same number of steps. With p < n/2, Moreno's
array requires fewer I/O ports and number of PEs. In addition, his version of array has
a higher throughput and utilization of cells because both of these measures are
independent of p. However, Comon's array is more versatile. Not only can BA-1 be
computed, but also the product A-IB can be calculated without any changes in the array.
In this case, the C matrix needs to be changed to the following form,
13
where XT is the transpose of a matrix X. Since BT(A-l)T = (A-1B)T, the results of the
array require another transpose to obtain correct results. Another advantage of
Comon's array is that it can compute BA-l for the successive matrices Br, B2, ... , Bn
with the same array. For n=4 and p=2, table 2.4.2 is obtained:
Y-axis X-axis
Computation time 16 16
Throughput 1/6 _ 1/5
Number of cells 20 18
I/O ports 8 .
"
8
Utilization 3/5 4/5
Cells complexity 3 types 2 types
Table 2.4.2 Performance Measures of Systolic Arrays for Computing BA-1
(n=4, p=2).
According to the values obtained above, if A is a 4x4 matrix and B is a 2x4 matrix, it
will be advantageous for us to employ Moreno's design to compute BA-1. In fact, that
is exactly what we have chosen to implement. In our design, there are two types of
PEs. In next section, the block diagrams demonstrate how different arithmetic
functions are interconnected and performed, and how the intermediate results are
computed. Each cell inside the two PEs has inputs from the left, the top, and the
diagonal direction of the upper left hand comer of each cell. Outputs are connected to
14
the bottom, the right, and the diagonal direction of the lower right hand corner of each
cell. Inputs and outputs on the y and z-axes of each PE are connected to another PE
according to the direction of the arrow in figure 2.5.3. Projecting the three-dimensional
structure down to the planar diagram, the block diagrams in figure 2.5.1-3 are obtained.
15
2.5 Block Diagrams of the Array
Fig.2.5.1 A typ<;J PE
Fig. 2.5.2 A type2 PE
y x
liz -ab
II
delay
16
D
c+ab ab
all
a12
a13 fA liz [I]] ab
a14
EI 0 c +ab- ab
a2l tlil delay
a22
a23
a24
a3l
a32
a33
a34
a4l
a42
a43
a44
bll bll
bI2 bl2
b13 b13
b14 bl4
b2l b21
b22 b22
b23 b23
b24 b24
Fig.2.5.3 The systolic array for computing BA-l [33].
17
3. Designs and Simulations of Computational Units
Our systolic array employs the fixed-point two's complement number system. The
two's complement number representation is a fixed-radix arithmetic system with a radix
r=2 and a digit set {0,1}, and each number is uniquely represented [15]. Suppose a
,
number
where 0, if A;::: 0an_! = (1, if A < 0
Therefore, an-! is the sign digit, and the remaining digits are either true magnitude or
two's complemented magnitude. We have chosen all the operands to be five-bit in
two's complement representation. The reason an eight-bit number representation was
not selected here is because it requires a huge database and results in very slow
simulations. Since it takes approximately 100M bytes of disk storage to store the
schematics and simulation results for the five-bit operands, eight-bit operands are
impractical for the purpose of illustrating our design methodology. All the schematics
were drawn using NETED while the logic simulation was done with QUICKS1M.
Both software packages are available as part of the Mentor Graphics Design Tools. The
radix point is chosen as follows:
For the analysis purposes, all the time delays in this thesis are in terms of I::.'s. Table
3.0.1 shows the time delays for different gates.
18
Gate function
NAND
NOR
NOT
XOR
Number of ~s
Table 3.0.1 Time Delays of Typical Gate Functions [15].
All the simulations in this chapter were done exhaustively. Since there are too many
output patterns for some of the simulation fIles, only a few selected outputs are listed in
the following sections. All schematics in this chapter are shown in the Appendix
following the list of references.
19
-.
3.1 Design of a Two's Complementer [15] and its Simulation
Results
A bit-scanning technique is uSed to perform the desired complementation. Let
A = ~a3a2alao
The bit-scanning process starts from the right end, ao, and proceeds to the left until the
very fIrst '1' is found, say ai=l for the smallest i such that 0 ~ i ~ 4. Then, every input
to the right of ai, including ai itself, remains unchanged and every input bit to the left of
ai is complemented. This complementing technique may be written in the following
form:
Co=o
Ci = Ai + Ci-l
* - "Ai = Ai e ECi-l
The control line, E, enables the complementation process. If E is equal to zero, no
complementation will be performed by the complementer and the output (A*) is identical
to the input (A). The time delay of this structure is 3il due to the XOR gate. Table
3.1.1 contains the simulation results of the two's complememer.
E A(4:Q) A*(4:Q)
0 00000 00000
0 00001 00001
0 00010 00010
0 00011 00011
0 00100 00100
0 00101 00101
0 00110 00110
0 00111 00111
0 01000 01000
0 01001 01001
20
0 01010 01010
0 01011 01011
0 01100 01100
0 01101 01101
0 01110 01110
0 01111 01111
0 10000 10000
0 10001 10001
0 10010 10010
0 10011 10011
0 10100 10100
0 10101 10101
0 10110 10110
a 10111 10111
a 11000 11000
a 11001 11001
a 11010 11010
a 11011 UOll
0 11100 U100
a lll01 11101
a 11110 UllO
a 11111 Ull1
1 00000 00000
1 eoo01 l11ll
1 00010 UllO
1 00011 U101
1 00100 U100
1 00101 UOll
1 00110 11010
1 00111 U001
1 01000 ,11000
1 01001 10111
1 01010 10110
1 01011 10101
1 01100 10100
1 01101 10011
1 01l1O 10010
1 01111 10001
1 10000 10000 Invalid
1 10001 01111
1 10010 01110
1 10011 01101
1 10100 01100
1 10101 01011
1 10110 01010
1 10111 01001
1 11000 01000
1 11001 00111
1 11010 00110
1 11011 00101
1 11100 00100
21
1
1
1
11101
11110
11111
00011
00010
ססoo1
Table 3.1.1 Simulation results of the two's complementer.
22
3.2 Design of a Two's Complement Adder and its Simulation
Results
A two's complement addition is equivalent to a magnitude addition or subtraction
and the results should be transformed to a correct complemented form. Addition
involves two operands, say A and B, and sum, S = A+B. Let A = An-I' .. A1Ao and
B = Bn-l ... BIBo be the two operands of addition, and the sum is of the form S =
SnSn-l ... So. We consider the addition in three cases [15]:
Case 1: If both A and B are positive, then Sn-l = O. If IS 1< 2n-1, then S is correct.
Otherwise, overflow occurs when IS I~ 2n-1.
Case 2: If one of the operands, B, is negative and another operand, A, is positive, then
S = 2n + ( IA I-I B I)
is correct, and no overflow will occur. It is also true if A and B flip their
signs.
Case 3: If both A and B are negative, then Sn-l= 1.
S =2n+1 - (I A I+IB I).
IfIS I~ 2n-1, then S is correct; otherwise, overflow occurs when IS I> 2n-1.
Among all the available addition algorithms, the carry lookaheadadder (CLA) provides
the fastest response time. For the CLA technique, we need the carry propagate, Pi, and
the carry generate, Gi, to generate the corresponding carry.
Gi = AiBi
Pi =Ai $Bi
Let Si and Ci be the sum and carry outputs ofith stage, then
23
Si =Pi E9 Ci-1
Ci =Gi + PiCi-1
Carries can be generated recursively by the Ci formula and the fIrst four C formulae are
shown as follows:'
Co = Go + PoCin
Cl = G1 + GOP1 + PIPOCin
Cz = Gz + G1PZ+ GoPZP1+ PZPIPOCin
C3 =G3 + GZP3 + GIP3PZ+ GOP3P2P1 + P3P2PIPOCin
Instead of implementing all four relations shown above, we have selected a mocfular
approach' for implementation with two extra intermediate signals, namely block
propagate, BP*, and block generate, BG*, where the operands can be extended to
infInite numbers without redesigning the CLA unit
BP* =P3PZPIPO
BG* =G3 + GZP3 + G1P3P2 + GOP3P2P1
C3 =BG* + BP*Cin
This technique is called block carry lookahead (BCLA) [15]. However, the modular
approach requires more hardware and delay times. A CLA adder with one level of
BCLA unit requires 12& whereas a CLA adder without the BCLA unit only takes 8.1s.
For our purposes, Cin is tied to ground all the time. Two more signals, inva and oflow,
are introduced to deal with an invalid number representation and with overflow
situations.
24
inva =~a3a2alao + b4b3b2blbO
oflow = C3 $ C4
The number '10000' is invalid in the two's complement representation because there is
no zero with a negative sign. The signal inva is used to check whether the particular
invalid representation has been inputted to the adder or not. In other words, the invalid
number check is completed at the beginning of the addition. This is also true for
multiplication and division. Due to the size of the large simulation output file, only a
few selected outputs are shown in table 3.2.1:
A B S. oflow mva
00011 01101 10000 1 a
00101 00100 01001 a a
01101 10111 00100' a a
10010 00011 10101 a a
11011 11001 10100 a a
11100 10011 01111 1 a
10000 00110 10110 a 1
Table 3.2.1 Simulation results of the two's complement CLA adder.
25
3.3 Design of a Baugh.Wooley Two's Complement Multiplier
[6] and its Simulation Results
According to Baugh-Wooley's two's complement parallel array multiplication
algorithm, let A =(lin-lan-2 ... aoh and B =(bn-lbn-2 . " boh. then
n-2
IA1= -an_12n-1 + L ai2i
i=O
n-2
IB I= -bn_12n-1 + L bi2i
i=O
The product is, P =AB =(P2n-lP2n-2 ... Poh, i.e.,
n-2 n-2 n-2 n-2
IPI =an_lbn_122n-2 + L L aibpi+j - L aibn_12n-l+i - L an_lbi2n-l+i
i=O j=o i=O i=O
The original idea of the algorithm is to separate the partial product bits with negative
signs (the last two terms of the above equation) from the positive ones (the fIrst two
terms above). The product can then be computed by fIrst adding the positive partial
products and then subtracting the negative partial products. The authors further
simplify the subtraction process by the negation of the partial products so that the
multiplication is only a series of repeated shift-and-add processes. The following two
equations illustrate how the subtractions in the above equation are converted to the
additions.
n-2 n-2
-I. aibn_12n-l+i = 2n-1(_2n + 2n-1 + bn_12n-1 + bn-l + I. aibn_1 2i)
i=O i=O
~2 n~
-I. an-lbi2n-1+i = 2n-1(-2n + 2n-1 + an_12n-1 + an-l + I. an-lbi2i)
~o ~o
26
In other words, no subtraction is necessary. Therefore, only the original full adders
(FA) are used. Hence, this algorithm is suitable for modular design in VLSI. Let S be
the sum and C be the carry, then the equations for the full adder (FAs) are written as
follows:
Normally, the FA is designed with two XOR gates and three NAND gates. Our design
has five levels of gates and the transistor count is 34. Since it takes approximately the
same number of transistors to build a regular FA with 6~s, the approach in [26] has
been adopted which has only 5& of time del~y.., Among all the FAs in [26], this
particular implementation is the most practical one for VLSI implementation because
only NOR, NAND, and inverter gates have been used. Table 3.3.1 shows the
simulation results of the FA.
A B Ci-J S. Co~
0 0 0 0 0
0 0 1 1 0
0 1 0 1 0
0 1 1 0 1
1 0 0 1 0
1 0 1 0 1
1 1 0 0 1
1 1 1 1 1
Table 3.3.1 Simulation results of the full adder (FA).
27
Similar to the case of addition, inva is added to detect an invalid number entering the
multiplier. Overflow is also present, but the boolean equation is different from the one
used for addition.
For an n x n multiplier, (n2 - n + 3) FAs, and 4~'S of time delay are required. Table
3.3.2 contains some of the selected simulation outputs of the multiplier.
A B ~ oflow inva
00110 00100 0000011000 0 0
01010 01001 0001011010 1. 0
01001 11101 1111100101 0 0
01000 10010 1110010000 1 0
11011 00011 1111110001 -0 0
10101 00111 1110110011 1 0
11110 10001 '0000011110 " 0 0
11010 10010 0001010100 1 0
01011 10000 1101010000 0 1
10000 00011 111101ססoo 0 1
Table 3.3.2 Simulation results of Baugh-Wooley's two's complement multiplier.
28
3.4 Design of a Two's Complement Parallel Array Divider and
its Simulation Results
Division is basically a repeated shift-and-subtract process and can be represented in
the following fonn [15]:
where R(j+l) is the partial remainder after the determination of the G+ l)th quotient digit,
2R(j) is the partial dividend before the determination of the G+ l)th quotient digit,
qj+l is the (j+l)th quotient digit,
and D is the divisor.
The division algorithms are divided into two main classes, namely restoring and
non-restoring algorithms [15]. For non-restoring division, the quotient can take on the
value {-1,1} which is out of the digit set range. Incorporating the extra member of the
set, -1, will increase the complexity of the algorithm and thus the hardware. Hence,
only the restoring division is considered for the design of the divider.
For restoring division, the quotient can be selected with repeated subtractions of the
divisor D from the current partial dividend 2R(j) until the difference becomes negative.
The partial dividend is updated by shifting the partial remainder, R(j), one bit to the left. .
Before the remainder turns negative, the number of subtractions performed detennine
the value of the quotient. One addition of the divisor to the negative difference is
required to restore the new partial remainder, R(j+l). Thus, restoring division requires
two operations - repeated subtractions and one restoring addition. The quotient digit is
selected under the situation where
29
o if2R(j) < D
'li+l =< '
1, if 2R(j) ~ D
Instead of performing trial subtractions and a restoring addition, Dean [10] proposes a
different approach to eliminate the addition, so the algorithm can be carried out with
successive shift-and-subtract. If the current trial subtraction is successful, the quotient
bit is a '1' and the difference is equal to the new partial remainder; otherwise, the
current partial remainder becomes a new partial remainder. This array uses one type of
cells only and they are called controIIed subtract (CS) cells. The CS cell is a subtractor
where C is the borrow-in and P is the borrow-out. The selection of the partial
remainder is controlled by the control signal D. The next partial remainder digit is S =
A-(B+C) when D=O and S=A when D=l. The logic equations for CS cells are
specified as follows:
P=AB+AC+BC
S = AD + ABC + ABC + ABCD + ABCD
The simulation results of the CS cell are shown in table 3.4.1.
A B C P. .s-
O 0 0 0 0
0 0 1 1 1
0 1 0 1 1
0 1 1 1 0
1 0 0 0 1
1 0 1 0 0
1 1 0 0 0
1 1 1 1 1
Table 3.4.1 Simulation results of the CS cell.
30
Due to the necessity of preserving the uniform operand representation, only five bits of
the quotient can be obtained. It really decreases the degree of accuracy. In order to
implement Dean's algorithm, the radix point of the dividend is shifted one bit to the
right to deal with situations where R> D because the algorithm can only be applied
when R ::;; D. Since only a reciprocal of a number is needed, shifting one bit to the right
does not affect the results of the type1 PE. For an n x n non-restoring divider, n2 CS's
and (3n2 - 1)~ of time delay are required. Some of the selected outputs of the divider
are shown in table 3.4.2:
R
00010
00010
00010
00010
D
00010
01101
11010
11101
Q
01000
00001
11110
11011
Table 3.4.2 Simulation results of the two's complement parallel divider.
31
4.
4.1
4.1.1
Design of Built-In Self-Test Structures and their
Simulation. Results
Design of a Shift Register and its Simulation Results
Review of the Level Sensitive Scan Design (LSSD)
As the size of VLSI circuits grow rapidly, difficulties in testing them increase as
well. Recently, design efforts have focused on techniques to ensure that a device is
testable and these schemes are called design for testability (DFT) [2]. In general, testing
of sequential logic is much more difficult than that of combinational logic, so a lot of
research effort has been spent in making the testing of sequential logic more controllable
and observable. Two terms are introduced in the literature to define the testability
measure, controllability and observability. Controllability is the ability to establish a
specific signal value at each node in a circuit by setting values on the circuit inputs,
while observability is the ability to determine the signal value at any node in a circuit by
controlling the circuit inputs and observing its outputs [2]. If the values of all the
latches can be controlled at any specific values, and if they can be observed with a very
straightforward operation, then the test generation for sequential circuits can be reduced
to the efforts involved in dealing with combinational logic network. This is the main
idea behind the structured DFf technique [49].
32
"'- ""-
I I
Primary I I Primary
Inputs I I Outputs
I Combinational I
.... Logic "'-
-
- Sl
I I
I I
I I
-t:>
- Sn
Scan-In
SRL -
-
l - .'
SRL
t
Scan-Out
Fig.4.1.1 LSSD sequential circuit model [4].
Among all the structured design methods, the Level-Sensitive Scan Design (LSSD),
an IBM's discipline for structural DFT, is considered here. For LSSD, there are four
design rules for the designers to follow [4]:
1. The latches are controlled by clock signals so that the data in the latches cannot be
changed when the clocks are off. This rule ensures the data isolation of the latches.
33
2. The latches are controlled by two or more nonoverlapping clocks to eliminate the
system dependency on minimum circuit delay.
3. Different nonoverlapping clocks are used by different latches in the same shift
register latch to insure that data at the input to a latch will not change while the latch
clock is on.
4. All latches are contained in a shift register latch and all shift register latches are
interconnected into one or more shift registers.
The output of a latch follows the input exactly while the clock is active. On the other
hand, the output of a flip-flop (FF) takes on the value present at the data input during
the active transition of the clock, as for example in the case of edge-triggered FF's.
Subsequent changes of the data input have no effect until the next active transition of the
clock [30]. Hence, separation of the clocks in the LSSD implementation is necessary.
Otherwise, a race condition will occur which violates the motivation of the LSSD design
(to guarantee race-free testing). Test patterns are applied to the combinational circuits
through the scan-in primary input, and test responses are captured through the scan-out
primary output [4].
For the LSSD implementation, two different configurations are available, namely
single-latch and double-latch. Figure 4.1.2 is the double-latch design. Both Ll and L2
are used as system latches and the system output is taken from L2. A and B are shift
clocks for Ll and L2 respectively and Cl is the system clock. Cl and Bare
nonoverlapping. The scan path is presented with bold lines. Ll and L2 together is
called a shift register latch (SRL). For the single-latch, Ll is used for system latch
whereas L2 is dedicated for shifting the test data into and out of the system.
34
B'.
'--
Primary
Outputs
Xl r-- ...--- Scan-Out
Yl~ LI - L21-
Combi- X2 r-- Iary national Y2 ,-puts I-- Ll ~L2Network ~
-N u ~ ...
.", .~ T .~
Xn - r---
Yn
i- Ll I-- L2
I
-
I--
/.
CI
A
an-In -
Prim
In
Sc
Fig.4.1.2 Level Sensitive Scan Design (LSSD) double-latch design [9].
4.1.2 Design of a Shift Register Latch (SRL) Chain and its
Simulation Results
The operations in a systolic array are synchronized by a system clock, so registers
. are required to latch the inputs and outputs of each cell. Instead of using regular D FFs
for data storage, SRLs are employed to enhance the controllability and observability in
our design. Considering the overhead involved with the two latches, the double-latch
design clearly requires less overhead [35]. Thus, we implement the double-latch design
35
in our systolic array. To comply with the LSSD design rules, two non-overlapping
clocks are used for the SRL. During the scan mode, all SRLs are connected to form a
serial chain and are clocked by the test clock, TCK. However, all SRLsbecome single
registers during normal operations and they are clocked by the system clock, eLK.
To select proper data into a cell, a SRL requires an additional multiplexer to be
added before the double-latch. The multiplexer is enabled by three control signals,
namely mctrl (mode control), bist (BIST mode), and scan (scan mode). The mode
control indicates that either test or system functions are being performed. If mctrl
equals '1', then the latches become shift registers (SRs). Otherwise, they become
system latches. Table 4.1.1 clearly indicates the relationships between the three control
signals.
MCfRL BIST SCAN
0 x x Normal mode
1 0 0 Invalid
1 0 1 Scan mode
1 1 0 BISTmode
1 1 1 Invalid
Table 4.1.1 Three modes of the shift register latch (SRL).
The SRL chain is similar to the Built-In Logic Block Observation (BILBO) technique
[19]. In the normal mode, the SRLs act as latches. In the BIST mode, data is clocked
serially into the SRLs where the outputs of each stage are connected to the inputs of the
combinational circuit. In the scan mode, the test data is clocked serially from the scan-
in port to the scan-out port. The only difference between our design and the BIELO is
that a multiple-input signature register is built into the BIBLO. In our case, however,
36
the signature analyzer is a separate entity which outputs the sequence serially into the
SRL chain. Figure 4.1.3 demonstrates the configuration of each mode listed above.
yo",.
- SRLata SRL SRL SRL SRL
1;7 17 p ~ ,
D
Q1 Q2 Q3 Q4 Q5
BTSTmode
Scan mode
Zi
Nonnal mode
Fig. 4.1.3 Configurations of the SRL.
To illustrate the correctness of the design, simulations have been run. Table 4.1.2
shows various simulation outputs.
37
CYCLE DIN s..lli MCfRL BIST SCAN SOUT DQUT QOUT
1 10101 1 0 0 0 1 10101 10101
5 10101 1 0 0 1 1 10101 10101
10 10101 1 0 1 0 1 10101 10101
15 10101 1 0 1 1 1 10101 10101
20 10101 1 1 0 0 1 10101 10101
21 10101 1 1 0 1 1 10101 10101
22 10101 1 1 0 1 0 01011 01011
23 10101 1 1 0 1 1 10111 10111
24 10101 1 1 0 1 0 01111 01111
25 10101 1 1 0 1 1 11111 11111
26 10101 1 1 1 0 1 11111 11111
27 10101 1 1 1 0 1 10101 10101
31 10101 1 1 1 1 1 10101 10101
Table 4.1.2 Simulation results of a SRL chain.
To reduce the long list of simulation results, results with no changes are skipped. Only
those results when the outputs change are shown. .'
38
4.2
4.2.1
Design of a Pseudorandom Pattern Generator and its
Simulation Results
Review of Pseudorandom Sequences and Linear Feedback
Shift Registers
A linear circuit is a logic network constructed with unit delays, modulo-2 adders,
and modulo-2 scalar multipliers. Responses from a linear sequential machine, which is
constructed from a linear combination of input elements, hold the principle of
superposition - the response of a linear network to a linear combination of stimuli is the
linear combination of the response of the network to the individual stimuli [4].
Suppose a shift register contains n stages. One way to keep the register active at all
times is to feed back certain states into the first state. Let a linear sequence {am}, with
feedback, be
n
am = L Ciam-i
i=I
where the feedback coefficients, Ci = I or 0, depend only on the feedback. Such a
relationship is called a linear feedback recurrence [13]. Given a shift register sequence
{am}, a generating function G(x) can be written as:
G(x) =L :lmxm
m=O
Suppose the initial state of the shift register is a-I, a.2, ; .. , a..n, and feedback coefficients
cr, C2, ... , en, G(x) can be represented as
39
G(x)
IiI Cixi(a_ix-i + ... + a..IX- I)
i=1
=~--------.,---
With the initial conditions a-I =a-2 =... =al-n =0, a..n =1, G(x) becomes
G(x) =__c-"n'---_
n
1 - I Cixi
i=1
The denominator of G(x) is referred as the characteristic polynomial, f(x), of the
sequence {am} and of the shift register which produced it.
n
f(x) =1- I _CiX.~
i=1
A shift register with a linear feedback network is called a linear feedback shift
register (LPSR) which is a finite state machine. Each state in LFSR is uniquely
detennined from the previous state by feedback connections.
40
....
...
l~
--{> SR ".. SR ,.". SR ......... 11001011100 ...
I I
0 1 1 Initial State
0 0 1
1 0 0
0 1 0
1 0 1
1 1 0
1 1 1
0 1 1 Beginning of the next cycle
Fig. 4.2.1 Feedback shift register [2].
If a LFSR is initialized with a nonzero state, it cycles through at most 2n - 1 states. If
the sequence generated by an-stage LFSR has period 2n - 1, it is called a maximum-
length sequence or m-sequence. The characteristic polynomial of a maximum-length
sequence is called a primitive polynomial which is irreducible (i.e. unfactorable). A run
is defined as a transition from a sequence of consecutive 1's (or a single 1) to a
sequence of consecutive O's (or a single 0), or vice versa. Length is the number of
consecutive 1's or O's without any transition. Some characteristics of a maximum-
length LFSR sequence {an} with a primitive polynomial f(x) of degree n are [4]:
1. Starting from a nonzero state, the LFSR that generates {an} goes through all 2n - 1
states before repeating, that is, ap+i =ai where p is the period.
2. The number of l's in an m-sequence differs from the number of O's by one.
41
.r
3. In every period of an m-sequence, one-half the runs have length 1, and one-fourth
have length 2, one-eighth have length 3, and so forth, as long as the fractions result
an integral number of runs. The runs of l's and D's terminate with runs of length n
and n-1 respectively. Except for these tenninating, lengths, there are equally many
runs of 1's and D's. The total number of runs of 1's equals the total number of runs
of O's. The number of runs in a period is 2n- 1.
4. The number of transitions between 1 and 0 that m-sequence makes in one period is
2n- 1.
5. The sum of any sequence Ai and a cyclic shift of itself is another cyclic shift of Ai.
4.2.2 Design of a Linear Feeback Shift Regi~ter
Since all numbers used in the systolic array have five bits, the degree of the
characteristics polynomial is chosen to be five as well. The primitive polynomial is
f(x) = 1 + x2-+ x5. Figure 4.2.2 shows the implementation of f(x).
SRL SRL
Stage 1 2 3 4 5
Fig. 4.2.2 Implementation of the primitive polynomial of f(x).
42
With the initial conditions a-I =a..2 =~-3 =il..4 =0, and a.s = 1, table 4.2.1 are the
simulation results of the above structure where D and q are inputs to five different
stages and the output from stage five respectively:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
D
00001
10000
01000
10100
01010
10101
11010
11101
01110
10111
11011
01101
00110
- 00011
10001
11000
11100
11110
11111
01111
00111
10011
11001
01100
10110
01011
00101
10010
01001
00100
00010
00001
~
1
o
o
o
o
1
o
1
o
1
1
1
o
1
1
o
o
o
1
1
1
1
1
o
o
1
1
o
1
o
o
1 Beginning of next cycle
Table 4.2.1 Simulation results of m-sequences of the LFSR.
43
In order to prove that the chosen polynomial is a pseudorandom pattern generator
(PRPG), some observations of the sequence have been done. No all-zero state exists in
the pseudorandom sequence because the successive all-zero state is also zero. Hence,
the m-sequence cycles only thirty-one (25 - 1) states before it starts repeating. The
sequence cycles go through certain states in a fixed order repeatedly, but the sequence
does have some random properties. For example, in every period of the sequence,
eight runs have length 1, four runs have length 2, two runs have length 3, one run has
length 4, and another run has length 5. Hence, the total number of runs is equal to
sixteen. There are sixteen l's and fifteen O's in the sequence. The total number of runs
of Is is equal to eight, which is the same as the total number of runs of Os. Moreover,
the number of transitions between 1s and Os are sixteen in one period.
4.2.3 Design of a Test Pattern Generator (TPG) and its Simulation
Results
An autonomous test pattern generator consists of two parts, a linear feedback shift
register (LFSR) and a shift register (SR).
PRPG h.. SRs ...... ...
".
'7
m-sequence
CUT
Fig. 4.2.3 Test pattern generator.
44
'The shift register latches used for system functions differ from the ones used for the
TPG. The differences between the regular shift register latehers (SRLs) and the latches
used for the TPG are the 3-input multiplexer and an extra input from the pseudorandom
pattern generator (PRPG). Similar to the SRL in section 4.1, this latch allows three
operations. In normal operation, the SRLs are actually latches. In the BIST mode, data
from the PRPG are clocked serially into each SRL, where each output of the latch
connects to the next input of the latch. In the scan mode, data enter the SRL chain at the
scan-in port and are outputted at the scan-out port. Simplified simulation results are
listed in table 4.2.2.
CYCLE DIN SIN MCIRL BIST SCAN DOUT LOUT SOUT
1 00000 I 1 1 a xxxxx xxxxi x
2 00000 1 1 1 a xxxxi xxxl0 x
3 00000 1 1 1 a xxx10 xx100 x
4 00000 1 1 1 a xx100 x1000 x
5 00000 1 1 1 "0 x1000 10000 x
6 00000 1 I 1 a 10000 00001 1
7 00000 1 1 1 a 00001 00010 0
8 00000 1 1 1 a 00010 00101 0
9 00000 1 1 1 a 00101 01010 0
10 00000 1 1 1 a 01010 10101 0
11 00000 1 1 1 a 10101 01011 1
12 00000 1 1 1 a 01011 10111 {)
13 00000 1 1 1 a 10111 01110 1
14 00000 1 1 1 a 01110 11101 ()
15 00000 1 1 1 a 11101 11011 1
16 00000 1 1 1 a 11011 10110 1
17 00000 1 1 1 a 10110 01100 1
18 00000 1 1 1 a 01100 11000 0
19 00000 1 1 1 0 11000 10001 1
20 00000 1 1 1 0 10001 00011 1
21 00000 1 1 1 0 00011 00111 0
22 00000 1 1 1 0 00111 01111 0
23 00000 1 1 1 0 01111 11111 0
24 00000 1 1 1 a 11111 11110 1
25 00000 1 1 1 a 11110 11100 1
26 00000 1 1 1 a 11100 11001 1
27 00000 1 1 1 0 11001 10011 1
28 00000 1 1 1 a 10011 00110 1
29 00000 1 1 1 a 00110 01101 0
30 -00000 1 1 1 0 01101 11010 0
'-~
45
31 00000 1 1 1 a 11010 10100 1
32 00000 1 1 1 a 10100 01001 1
33 00000 1 1 1 a 01001 10010 0
34 00000 1 1 1 a 10010 00100 1
35 00000 1 1 1 a 00100 01000 0
36 00000 1 1 1 a 01000 10000 0'
66 00000 a 1 a 1 00100 01000 0
67 00000 1 1 a 1 01000 10000 0
68 00000 a 1 a 1 10001 00011 1
69 00000 1 1 a 1 00010 001.00 0
70 00000 a 1 a 1 00101 01011 0
71 00000 1 1 a 1 01010 10100 0
72 00000 a 1 a 1 10101 01011 1
73 00000 1 1 a 1 01010 10101 0
74 00000 a 1 a 1 10101 01011 1
75 00000 1 1 a 1 01010 10100 0
76 00000 a 1 a 1 10101 01011 1
77 00000 a 1 a 1 01010 10101 0
78 00000 a 1 a 1 10100 01000 1
79 00000 a 1 a 1 OlDOO 10000 0
80 00000 a 1 a 1 10000 00000 1
81 00000 a 1 a 1 00000 00001 0
87 00001 a a a .. 1 00000 00000 0
88 00010 a a a 1 00001 00011 0
89 00011 a a a 1 00010 00101 0
90 00100 a a a 1 00011 00110 0
91 01010 a a 1 a 00100 01001 '- 0
92 01011 a a 1 a 01010 10100 0
93 01100 a a 1 a 01011 10110 0
94 01101 a a 1 a 01100 11001 0
Table 4.2.2 Simulation results of the test pattern generator (TPG).
During the first five clock cycles, the SRLs are being loaded with the outputs of the
pseudorandom pattern generator (PRPG). After that, the TPG starts to output the m-
sequence through the scan port and different states are present at the L_OUT port for
testing. At clock cycle 36, another cycle of the m-sequence begins. At clock cycle 66,
the SR changes to scan mode. Between clock cycles 66 and 70, the data present at the
SRLs (00100) are shifted out. Data inputted from the scan-in port begins to shift out
until clock cycle 70. Consequently, data are clocked out serially from scan-in to scan-
46
'\-.
out between clock cycle 71 and 81. The rest of the clock cycles represem parallel
loading of the latches during the system operations.
4.2.4 Linear Dependencies of the m-Sequence Used as Test
Stimuli [5]
Let R(x) be an m-binary bit sequence represented in polynomial form,
R(x) =rm_Ixm-1 + rm_2xm-2 + ... + fIX + fO
Inputs to the circuit under test (CUT) are expressed as a polynomial, S(x). All the
stages in our case are inputs to the CUT, so
S(x) = x5 + x4+ x3 + x2 + X
S'(x) is defined to be the sampling polynomial associated with the connections of the
CUT, and is obtained by reducing the S(x) to its irreducible form. S'(x) describes how
the CUT samples the m-sequence, p(x), generated by the LFSR.
A dependency polynomial is defined as
h(x) =0 mod p(x)
47
where hex) is the modulo-2 division of p(x) by zero. The set polynomial, H(x),
associated with the dependency polynomial, is the product of hex) and all of its
subpolynomials. Therefore,
H(x) =(x4 + x3 + x2 + X + 1) (x4 + x3 + x2 + x) (x4 + x3 + X + 1)
(x4 + x2 + X + 1) (x4 + x3 + x2) (x4 + x3 + x) (x4 + x3 + 1)
(x4 + x2 + x) (x4 + x2 + 1) (x4 + X + 1) (x4 + x3) (x4 + x2)
(x4 + x) (x4 + 1) (x3 + x2) (x3 + x) (x3 + 1) (x2 + x) (x2 + 1) (x+ 1)
= (x4 + x3 + x2 + X + 1) (x3 + x2 + X + 1) (x4 + x3 + X + 1)
(x4 + x2 + X + 1) (x2 + X + 1) (x3 + x2 + 1) (x4 + x3 + 1)
(x3 + x + 1) (x4 + x2 + 1) (x4 + X + 1) (x + 1)4 (x2 + 1)3
(x3 + 1)2 (x4 + 1) xll
The residue, dk, of a polynomial can be computed by
dk =xk mod p(x)
Add (in a modulo-2 way) the residue of each term to determine whether a polynomial is
a dependency polynomial. If the sum is zero, the polynomial defines a linear
dependency. The analysis procedure includes the following steps:
1. Compute the residue table.
2. Detennine the sampling polynomial.
3. Fonn the set polynomial associated with the sampling polynomial.
4. Test each factor of the set polynomial for linear dependency.
Bardell [5] has computed the residue table for the polynomial, p(x) = x5 + x2 + 1,
which is listed below:
48
k Q 1 b 2- 1 Power of x
0 1 0 0 0 0
1 0 1 0 0 0
2 0 0 1 0 0
3 0 0 0 1 0
4 0 0 0 0 1
Table 4.2.3 Residues of p(x) =x5 + x2+ 1.
Since there is only one 1 in each column of the residue table, any combination of
addition of the power of x will not create a zero. Therefore, we conclude that there does
not exist any linear dependency when a five-stage linear feedback shift register (LFSR)
feeds a five-input shift register (SR). Intuitively, if the LFSR is cycled through its
entire sequence, 31 different test patterns appear at the output of the shift register latches
(SRLs). There is, therefore, no linear dependency between patterns. As a result, the
linear dependencies exist only when the number of inputs exceed the length of the
LFSR.
49
4.3
4.3.1
Design of a Signature Analyzer and its Simulation Results
Review of Signature Analysis
Built-In Self-Test (BIST) requires a method for checking the output responses of
the circuit under test (CUT). The conventional bit-by-bit comparison of the actual
outputs with the computed correct values requires a significant amount of memory
storage for the fault-free responses associated with all possible test vectors. Instead, a
more viable method is to compare some statistics of the output responses rather than
using bit-by-bit comparison. For the compression approach, the information saved is a
compressed form of the observed outputs, called a signature. The process of reducing
an output response to a signature is called compression. There are five compression
techniques available, namely ones counting, transition counting, parity checking,
syndrome checking, and signature analysis. Signature analysis is the most popular
testing method because it is sensitive to the number of l's in the data stream as well as
to their positions in the sequence of l's and O's [4]. Figure 4.3.1 clearly explains how
signature analysis is performed.
50
Input Output
Test Response
Sequence
CUT Sequence ,.,.
Signature Signature
""- Analyzer ,....... ..
7
Fault-Free --t> Comparator f---t>Signature
Fig. 4.3.1 Testing using signature analyzer.
Error
One of the most popular coding schemes for signature analysis is called cyclic
redundancy checking (CRC) [37] and it can be implemented with a linear feedback shift
- .'
register (LFSR). A cyclic code generator is shown in figure 4.3.2:
Sequence
~ 2 -~ m-11---i m
Fig. 4.3.2 Cyclic code generator [4].
Let R(x) be a binary sequence expressed as a polynomial.
P(x) = Q(x) + R(x)
G(x) G(x)
51
In the CRC coding scheme, the remainder R(x) is used as a check word where P(x) is a
message word and G(x) is a divisor polynomial. The transmitted code consists of a
message followed by a check word during the encoding stage. When the received code
is decoded, the message word is divided by the divisor ,olynomial and errors are
represented as mismatches between the received check word and the newly calculated
remainder. Signature analysis makes use of the CRC encoding as the data compressor,
and of the remainder R(x) as the signature of the output response from the CUT.
The division process can be automated by using a LFSR. For example, suppose the
divisor polynomial, G(x) =x5 + x3 + x + 1. Figure 4.3.3 shows the implementation of
the G(x).
Fig.4.3.3 LFSR implementing division by x5 + x3 + X + 1 [4].
If the LFSR is initialized to zero and the polynomial P(x) =x7 + x3 + x enters the
LFSR, high-order coefficient first, the content of the LFSR after the last message bit
has entered is the remainder of this division. The quotient bits are shifted out at the
right end of the shift register (SR). The remainder is left in the SR after all the bits of P
have been shifted in and R(x) = 1+ x2 + x3• Table 4.3.1 shows how the LFSR
perfonns the division.
52
Clock . Input 1 X ~ ~ ~
0 0 0 0 0 0 Initial State
1 1 1 0 0 0 0
2 0 0 1 0 0 0
3 0 0 0 1 0 0
4 0 0 0 0 1 0
5 1 1 0 0 0 1
6 0 1 0 0 1 0
7 1 1 1 0 0 1
8 0 1 0 1 1 0 Remainder, R(x)
Table 4.3.1 LFSR states during division of P(x).
It is possible that a signature representing an erroneous sequence from a faulty
circuit after data compression is identical to a fault-free signature. This effect is called
fault masking which means loss of information in the signature due to compression of
the output responses. The possible error pattems can be expressed as an error
polynomial, E(x). Each nonzero coefficient in the E(x) represents an error in the
corresponding bit position of the output response.
P'(x) =P(x) + E(x)
With the initial state of all zeros,
P(x) = Q(x)G(x) + R(x)
When errors occur,
P'(x) = Q'(x)G(x) + R(x)
Hence, fault masking occurs if E(x) is a multiple of P(x).
4.3.2 Design of a Signature Analyzer and its Simulation Results
To illustrate the design methodology of a signature analyzer, a divisor of degree
four is selected for implementation. It is well known that primitive polynomials have
53
·less fault masking than non-primitive polynomials [50]. Thus, the primitive polynomial
of degree four for the divisor is
G(x) = x4 + X + I
Between the two implementations of a LFSR, a true polynomial divider is chosen
because it generates both the correct quotients and remainders. The other type of LFSR
will not be discussed here [4]. Figure 4.3.4 shows the implementation of G(x).
Input
2 3 4
Fig. 4.3.4 LFSR implementation of polynomial division of G(x) =x4 + X + 1.
(
Table 4.3.2 demonstrates that the above structure acting as a polynomial divisor, when
all the initial states are zeros. The input enters the linear feedback shift register (LFSR)
with the high-order bit first (the leftmost bit) where the rightmost bit represents the
coefficient of xo, but the order of x's in the remainder is reversed (The leftmost bit of
the remainder represents the coefficient of x0) •
54
Input Quotient Remainder
10000 1 1100
10001 1 0100
10010 1 1000
10011 1 0000
10100 1 1110
10101 1 0110
10110 1 1010
10111 1 0010
11000 1 1101
11001 1 0101
11010 1 1001
11011 1 ' 0001
11100 1 1111
11101 1 0111
11110 1 1011
11111 1 0011
Table 4.3.2 Remainders of various dividends.
Suppose the polynomial is P(x) =x4 , so the rerriain'oer is R(x) =x + 1. This division is
shown in the first row of table 4.3.2.
4.3.3 Built-In Self-Checking with Zero Signature [28]
The comparison of the test process can be eliminated by selecting a proper initial
state of the linear feedback shift register (LFSR) so that the final test signature is a
constant. If the final test signature is all-zeros, only an OR tree is necessary to check
whether the output is correct. The method to obtain the all-zeros signature is
summarized as follows:
Let Sf be the final desired test signature.
1. Initialize the shift register (SR) to D's and compute the final fault-free test signature
S.
2. Compute Sl =S E9 Sf. (If Sf = D's, then Sl =S)
55
3. Seed the SR which implements the reciprocal polynomial with S 1 and run it forward
automatically for L (test length) cycles, the final state of the register is S2.
4. Initialize the SR with S2 and applying Z (test sequence), the final signature will be
Sf·
A reciprocal polynomial, f*(x), of f(x) is defined as
f*(x) = xm fO/x)
where m is the degree of f(x). In our case, since G(x) = x4 + X + 1, G*(x) =
x4 + x3 + 1.
J
Input
3 2,
Fig.4.3.5 LFSR implementation of polynomial division of G*(x) =x4 + x3 + 1.
Suppose that the LFSR for G(x) is initialized with x3 (0001). Table 4.3.3 shows the
simulation results:
CYCLE Qill Q2lll
1 0001 1000
2 1100 0100
3 DUO 0010
4 0011 0001
5 U01 1001
6 1010 1101
7 0101 1111
8 1110 1110
9 0111 0111
56
10 1111 1010
11 1011 0101
12 1001 1011
13 1000 1000
14 0100 0110
15 0010 0011 ""
16 0001 1000 Beginning of next cycle
17 HOO 0100
18 0110 0010
Table 4.3.3 Implementing G(x) and G*(x).
The sequence generated by the reciprocal polynomial, G*(x), is the reverse of G(x).
The reversed states of the LFSR implementing G*(x) are exactly in the reverse order of
the ones implementing G(x). If we reverse all the states of G'!'(x) listed above, we will
create the table 4.3.4:
CYCLE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Reversed states of G*(x)
0001
0010
0100
1000
1001
1011
1111
0111
1110
0101
1010
1101
0001
0110
1100
0001
Table 4.3.4 Reserved states of G*(x).
57
For illustration purposes, several input test patterns have been simulated to detennine
the correctness of the algorithm. Table 4.3.5 shows the simulation results:
Input test patterns (Z) fu. ~
10000 1100 1001
10001 0100 1110
10010 1000 0111
10011 ססoo 0000
10100 1110 0110
10101 0110 0001
10110 1010 1000
10111 0010 1111
Table 4.3.5 Intennediate signatures for obtaining the final all-zeros signature.
To further explain the method, suppose that the input test pattern is '10100', P(x) =
x4 + x2, so the content of the LFSR after five clock cycle is '1110', R(x) =x2 + X + 1.
Then run the LFSR which corresponds to G*(x) for five cycles with the input equal to
zero all the time, at that point the resulting content of the LFSR is '0110', R*(x) =
x2 + x. Different states of the LFSR for G*(x) are listed in table 4.3.6, the leftmost bit
corresponds to the leftmost SRL:
CYCLE
o
1
2
3
4
5
Content of theLFSR
0111
1010
0101
1011
1100
0110
Table 4.3.6 LFSR states for searching the signature S2.
58
Finally, seed the LFSR corresponding to G(x) with '0110' and input the test pattern
'10100'; the final test signature is '0000'.
CYCLE Input Content of the LFSR
a 0110
1 1 lOll
2 a 1001
3 1 0000
4 a 0000
5 a 0000
Table 4.3.7 LFSR states for the desired signature Sf.
In order for a reconfiguration algorithm to work, a fault diagnosis method is
required. The zero signature approach provides a very effective and efficient method to
detect faults. Any 1 on the final signature indicates a fault
. .'
59
4.4
4.4.1
Pseudorandom Testing of the Full Adder
Review of Pseudorandom Testing [48, 7]
To achieve a desired level of test confidence, the coverage of a given test length
should be evaluated and the number of random test patterns should be 'estimated before
testing. The patterns generated by the linear feedback shift register (LFSR) are
considered to be pseudorandom because the sequence can be shown to pass statistical
test patterns required of truly random sequences [13].
Several methods proposed in the past assume the input sequences to be truly
random, and calculate the test coverage based on those assumptions. Two methods
have been employed to determine the detection probability of a particular fault, given an
assumed input probability distribution. The first method deals with the output
probabilities as boolean expressions being true. All the inputs are assumed to be
independent from each other [36]. All the signal probabilities computed by this method
are exact and the calculation process is very complex when the boolean equations
representing the combinational circuits are very complicated. The second method is
called cutting algorithm, which computes the output probabilities in terms of upper and
lower bou'nds. The complexity of this algorithm is linear (i.e. the results can be
computed in a linear time), but the bounds can only provide a rough idea on how easy
the fault may be detected [42]. The test length estimates are often much higher than all
the combination of test lengths. Statistically, random testing uses the approach of
sampling with replacement (If there are n inputs to the circuit, there will be nn possible
input patterns). Alfthe input patterns are assumed equally likely. In general, the test
length L is bounded by the total number of possible inputs N (= 2n where n is the
60
number of inputs). However, no such limit exists for random testing. Hence, our
concern is only for pseudoqrndom testing.
The detectability of a fault is the number of different input vectors that cause a circuit.
output error. The detectability profile is H, expressed as a vector [hI, h2, ... , hN],
where hk is the number of detectable faults in the circuit that have detectability k. M is
the total number of possible circuit faults of the circuit under test (CUn.
The expected fault coverage E[CLI is the expected number of faults that can be detected
with a test length L. The probability that the fault is first detec·ted by the tth vector is
where t > 1 and N is the total number of all possible test patterns. The probability
assumes that the sequence creates all possible patterns, including the all-zeros pattern.
Thus, the LFSR should be modified to include the all-zeros pattern and the expected
fault coverage is
NL(N-L)
E[CLI = 1 - i k hk
k=l (~) M
The upper limit of the summation is N-L since faults with detectabilities N-L+1 through
'N must have been detected in a test length L. The test confidence, CL, is the probability
61
,that a particular fault is detected in a test length L. The fault chosen for the test length
calculation, in this case, is either the worst case fault (fault with minimum detectability
in the CUT) or the upper bound fault (fault with detectability k=l), so
L ( N-L )
CL= L Pt = 1- k
t=l (~ )
Test length calculations based on the worst case faults are rather pessimistic, so only the
effect of faults with larger detectabilities should be considered..
The expected test length, E[LiJ, for a particular fault F i of detectability k is
N
E(Lj] = I t Pt = N+ I
t=l k+1
. M N
E[L] = L E[~] = N+l L hk
i=l M k=l k
The expected test length does not provide information on how the test length affects
measures of test quality, whereas the expected fault coverage can do so.
4.4.2 Application of the Results on the Fuller Adder
In order to obtain the detectability profile, we run a fault simulation on all possible
faults of the full adder (FA). The fault model used here considers only single stuck-at
faults. Then a fault detection table is created for each fault [17]. The detectability of a
62
particular fault is obtained by adding all the entries of l's in the detection table. Hence,
the detectability profile is obtained: H = [hI, h2, h3' h4] = [3, 15, 0, 18], M = 36, and
N = 8. In addition, there are four hard faults (faults which are not detected by any
combination of inputs) where the detectabilities, k's, are equal to zeros. When Xg and
X9 are stuck-at 1 or 0, the outputs of the sum and carry bits will not change. Please
refer to the appendix 18 for the schematic. Since the simulation outputs are too long
and we will not show them here.
The expected fault coverages are listed in table 4.4.1,
NL(N-L)
E[Cu = 1 - i .k hk
k=I (~) M
L
1
2
4
5-8
ill9J
0.69673
0.76964
0.99286
1.00000
Table 4.4.1 Expected fault coverage of the FA.
63
1
~ 0.9en
~ 0.81.0
~
> 0.70u_
0.6~
:::..1
=u 0.5~ .......
~~ 0.4
'0-
~ 0.3....
~
~ 0.2c.
;.<
0.1~
0
0 1 2 3 4 5 6 7 g
Test Length (L)
Fig. 4.4.1 Expected fault eoverage of the FA.
If we assume a 95% confidence interval, then by solving the following equation, the
test length will be 4.
The expected test length is E[L] =3.75, by the using the equation.
N
E[L] = N+1 L. hk
M k=l k
If we round the number off to the nearest integer, the expected test length will be 4
which agrees with the test length calculation.
64
4.5
4.5.1
Aliasing in the Linear Feedback Signature Register
Review of Aliasing Concept [16]
Fault masking or aliasing of the signature analysis is measured in terms of aliasing
probability - the probability thara given fault produces a fault-free signature. Under
steady-state condition (for a long test length), assuming that all the error sequences are
equally likely, the aliasing probability is equal to 2-k where k is the length of the LFSR
[44]. However, it is more complicated. to calculate the aliasing probability in the non-
steady condition. Let us assume the binary error sequence and the feedback sequence to
be E(l) and F(l) respectively, so M(l) denotes the modulo-2 sum sequence of the
linear feedback shift register (LFSR).
M(l) =E(l) E9 F'(l)
By expressing each element Mj of the sequence M(l) as the sum of the independent
error bits (Ed, then Mj can be generated.
1
M. ='" c.. E.1 L- IJ Jj=l
The above expression can be represented in a lxl matrix where M =CE, or
65
For 1 ~ k, Ck1 {S-fneKXl. submatrix of C by taking the last k rows of C. Let W be a
kxk submatrix of Ck1 . There are a total of l-k+1 possible windows, and only L11k J
• •
windows are non-overlapping where the result of 11k has been rounded off to the
largest integer. The k rows of a window W correspond to k consecutive states of the
LFSR. Assume Zl is the upper bound of the probability that the' register is at all-zeros
state at time 1.
Pal =P[ all-zeros state -7 all-zeros state] - P [the register never leaves all-zeros state]
::; Zl- (l_p)l
1 + I 1 -2p I I 11k I
::; 2 - (l_p)1
4.5.2 Application of Ivanov's Results to Our Signature Analyzer
The primitive polynomial for our signature analyzer is f(x) =x4 + x + 1. If the test
length is equal to five,
[
10000]01000
C= 00100
00100
1 100 1
[
01000]
_ 00100
Ckl - 00100
1 100 1
According to Ivanov's [16] analysis, there are two different windows for this particular
signature analyzer for the test length of five bits. Only one non-overlapping window
exists for this particular matrix Ckl , which is the matrix itself.
66
Using Ivanov's bound of primitive polynomial, the greater the degree of the
polynomial, the smaller is the aliasing probability. When the sequence lengths are about
200, all the aliasing probabilities reach steady-state conditions. In other words, the test
length increases with the decrease of the fault masking. Figure 4.5.1 illustrates the
above points as k increases from four to thirty-two.
k=4
-
0.09
ell 0.08I:.
'-'
....
0.07
....
0.06
-~ 0.05ell
~
0 0.04s.
I:.
0.03
I:ll
= 0.02
Vl
ell 0.01.-
< 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 s=: 0 0 0 c
-C'l ~ \0 00 0 C'l ~ \0 00 0 C'l ~ \0 00 M C'l ~ \C oc.-.
- - -
.-. C'l C'l C'l C'l C'l M M M M '<t
Sequence Length (I)
(a)
67
0\
00 .--...n
'-"
x-
II
......
0\
.--...
cT
'-"
Aliasing Probability (Pal)
0 0 0 0
b 0 0 0 b 0 0
0 b ....... b t-.J 0 w
0 VI ...... Ul t-.J VI l» VI
0
16
32
48
64
80
96
112
128
144
CI:l 160til
.c
c 176til
:I
n 192til
~ 208
:I
~ 224
:s-
o;:: 240
......-
256
272
288
304
320
336
352
368
384
400
x-
II
00
:=- 0.0035
~
Q.. 0.003~
>. 0.0025....
-
,Q 0.002~
,Q
0 0.0015\.0
Q..
I:lll 0.001
I:
en 0.0005~
< 0
0 32 64 96 128 160
k= 32
192 224 256 288 320
Sequence Length (I)
Cd)
Fig. 4.5.1 Bounds on aliasing probabilities for different degrees of polynomials..
There are different peaks for the aliasing probabilities where the highest peak occurs
with the shortest length of shift registers. In'iliis case, the test length is assumed to be
very long 0=32000). Incidentally, the aliasing probability function levels off at the
detection probability of 0.0004, as can be seen in the following graph of aliasing
probability for shift register length ranging from four to thirty-two.
69
1\
1 \
1 \,,, \
, \
',,'~ '--
------------------------
0.09
:=- 0.01lc:
c.
0.D7
!;' 0.06
J:l 0.05c:
J:l
0 0.04...
c.
on 0.03
c
'"
0.02
c:
< 0.01
0
0 N 8 8 00 N ..,. '0 00 N C"I ..,. '0 00 M8 8 0 0 N C"I ~ g 00 0 0 0 0 0 0 0 q0 0 q 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0
Detection Probability (p)
----k=.l
-----k=ll
-------- k=16
------ k=32
Fig. 4.5.2 Bounds on aliasing probabilities with different degrees of polynomials. I..,
70
4.6 Designs and Simulations of the Two Systolic Cells
With the implementation of shift register latches (SRLs) as shift registers between"
computational units, type1 and type2 cells are constructed. For a type1 cell,
zo /
Zl
*
Xl
Y1
X3
Z2
Z3
*
Y2
Y3
TX4FF H-Y4
Fig. 4.6.1 Block diagram of a type1 cell.
For a type2 cell,
71
ZO FF
Xl Zl
YI-D>-l
*
+
YYI
Y2 *
X2
+
ZZI
Z2
YY2
Y3 *
X3
+
Y4 *
Z3
X4
YY3
ZZ3
YY4
ZZ4
Fig. 4.6.2 Block diagram of a type2 cell.
72
In the normal mode, the cells calculate the values according to the assigned operations.
In the scan mode, however, the contents of the nearest shift register (SR) are shifted out
fIrst and then the second closest to the scan-out port, and so forth. The input sequence
entering the scan-in port is fInally shifted out after the contents of the SRs in the scan
path have already been displaced out
The systolic array is a three-dimensional structure where four pseudorandom pattern
generators (PRPGs) and four signature analyzers (SAs) are connected to all the inputs
on the z-axis and all the outputs on the x-axis respectively. Finally, by applying the
built-in self checking technique on SAs, one can determine if faults have occurred.
Since none of the signals on the x-axis are used in any other axis, this dimension is the
most suitable for testing, and the resulting signal from the OR tree may control the
reconfiguration of the array. Four PRPGs with different initial states guarantee that
inputs on the z-axis at any given time will not be the same. This method should
increase the detection probability at all time. Further research is required on how to
initialize the LFSR for optimizing the process of test pattern generation. Because four
fault-free signatures are needed in one clock, it may require that more fault simulations
need to be done in such a short period of time. In order to avoid another level of
pipelining in the SRs, the LFSRs may be replaced with multiple-input signature
regist~rs (MISRs) to perform signature analysis where parallel output responses enter
the SA at the same moment.
73
5. Reconfiguration Algorithms of a Two-Dimensional
Systolic Array
To achieve the desired performances of a fault-tolerant array, the following
properties should be taken into consideration when a reconfiguration algorithm is being
generated [11]:
1. Minimum area and performance overhead.
2. High utilization of the surviving processing elements (PEs) or high array yield.
3. Ease of testability and reconfiguration in the presence of faults.
Reconfiguration of an array is required to restore its functions when faults are
detected. The main purpose of this process is to transform a physical architecture with
faults to a working target architecture without faults. The physical array with faults is
the array which may be contaminated by manufacturing defects or caused by operational
faults and the logical array represents the desired array structure specified by the
application [24].
Basically, reconfiguration algorithms can be classified into two classes - arrays with
redundant number of PEs, or redundancies inside the PEs. Abraham et. al. [1] suggest
the use of time redundancy for concurrent error detection and correction. Although no
extra PEs and interconnection switches are required to implement the algorithm, it
suffers from the possibility of 100% degradation in the throughput. Extra time is also
required to redesign the original algorithm to a specific reconfiguration algorithm. In
addition, loss of infonnation often occurs during error encoding and decoding. For
example, an extra row and column are needed for calculating the checksum of therows
and columns in a matrix-matrix multiplication. Figure 5.1 shows a 4x4 matrix
multiplication with the checksum row and column included in dotted boxes.
74
a
21
a 44 a 43 a 42 a41
,- - 1
I
a 54 a 53 a 52
IaSlL ____ I
b41 b42 b43 b44 b45
b
31 b32 b33 b34 b35
b
21 b22
b b
24 b2523
bll b b I3 b14 b l512 1__ -l
Fig. 5.1 Checksum matrix multiplication on an mesh array.
An element in the extra row or column is the sum of the elements of the columns or
rows respectively. If there is an error in the matrix entry (i, j), it will be identified by
-.;
checking the equality of the sum of the row elements with the checksum for row i, and
of the sum of the column. elements with the checksum for column j. Certainly, this
75
Class of algorithm is only suitable for arrays with very complex PEs. Failure of either
error detection or correction in any PEs will cause the array to become unoperational.
The second approach is based on reconfigurable interconnections and spare PEs.
One of the methods is called interstitial redundancy [45]. Spare processors at ~nterstitial
sites in the array are connected to their four nearest neighbors. With a regular placement
of spare PEs and planar interconnection, a desired target array can· be realized. The
spare assignment is obtained by applying the matching of a bipartite graph where the
nodes on the two sides represent operational spares and failed primary PEs of the array.
The edges connect a failed primary PE to a spare PE if and only if the spare can replace
the failed PE. Then, a match that covers the set of failed primary processors is the
desired assignment of spares to the failed processors.
76
Faulty
PEs
Fault-Free
PEs
Fig.5.2 Bipartite graph for finding a matching between faulty and fault-free PEs.
Interstitial redundancy has the advantages of short fixed interconnections, high
. utilization of fault-free PEs, and low area overhead. However, this approach is not
suitable for systolic array implementation without modification. If the spare is replaced
with one of the four primary PEs, the timing of the systolic array has to be readjusted to
allow the proper data flow. With the fixed location and interconnection of the spares,
77-
only a fixed level of redundancy can be achieved. Figure 5.3 shows an array with 50%
redundancy.
PrimaryI Processo',,__---"- SpareProcessor
Fig. 5.3 Interstitial redundancy with 50% spare processors.
78
Hence, it is not very flexible for reconfiguration when variable fault distributions occur.
- -,
The earliest reconfiguration algorithm using programmable interconnection switches
is the CHiP architecture where different modules of a computer are fabricated as cells
[46]. The modules are embedded to form a desired architecture that performs the
intended applications. This method enjoys the flexibility and versatility of connecting
the modules in the array. However, a lot of time is required to evaluate the PEs and to
find the graph embedding for the final target architecture. This algorithm is not practical
for the array structure. If the dimensions of the array are large, it takes a long time to
compute the graph algorithm to embed the fault-free PEs. In addition, different fault
distributions may lead to different embeddings, so no fixed fatal criteria can be defined
before the host computer evaluates the graph. Four different reconfiguration techniques
using interconnection switches are discussed below to determine if they are appropriate
for our application purposes.
The first approach is called the 'fault stealing' algorithm [41, 34]. The physical and
logical positions of PEs are represented as the physical and logical indices. If a fault is
detected, reconfiguration in this algorithm consists of two steps: assigning a direction to
each fault, and then mapping logical indices onto physical indices for all working cells.
For example, considering an array with a spare column, stealing of spare cells from
adjacent rows when two faults are present in a row. Let i be the row in which there are
two faulty cells (i, k) and (i, h) where k < h, for instance, i=3, k=2, and h=4. The
leftmost fault in the row invokes reconfiguration along the same row. Additional faults
transfer their logical indices to cells in the same column. All working PEs (i, j) with j ~
k, are associated w.ith logical indices (i', j') = (i, j-l), except for j=h. On the other
hand, PE (i, h) steals the position of PE (i-I, h), with which we associate logical
indices (i', j') = (i, h). The stolen PE behaves as ifit were the leftmost fault in row i-I,
79
and thus it reconfigures to its right. Figure 5.4 illustrate the above fault stealing
example.
4,31----~
1,3 t-----t:l>I
4,2 1-----iP-i4,11----~
3,1
2,1 1-----iD>1
1,1 t-----L:l>I 1,2 I----G>I
Fig. 5.4 Basic principles of the fault stealing algorithm.
The second approach is called divide-and-conquer [25]. Assuming that there is an
N-cell wafer, this approach consists of two stages, recursively partition the array
vertically and horizontally until the subproblem with e (lg N) where e represents the
average number of operations, and then connect the disconnected subarrays together.
80
With probability 1 - 0 (lIN), a two-dimensi.?nal array c~~ be constructed from all the
live cells on an N-cell wafer using wires of length 0 (lg N 19 Ig N) and channels of
width 0 (lg 19 N) where 0 represents the worst case of operations.
Faulty Live
Cell Cell
I I II
0 ~ 01 D ~ [ZJ D f2J
I
0 0 E21 ~ ~ D D tz]
I
lZl D 123, [2 D ~ D D
I
D D IZlI 0 D 0 ~ ~
Horizontal
- - - -
-r - - - - - - - - Partitioning
D D 01 ~ fZI D D fZ]
I
D Ea IZ3I 0 ~ 0 0 D
I
D D 01 ~ ~ 0 D D
- - - - 1
~ D D ~ DI~ D D
I
Vertical
Partitioning
Fig. 5.5 Partitioning of a 8x8 array by the divide-and-conquer method.
The above diagram illustrates the method on how to partition an 8x8 array. For an 8x8
array, N=64, the worst channel width and wire length become 1.58 and 4.75
respectively. Realistically, it needs 2 tracks between the PEs for the channel width and
81
5 time steps for the worst delay. For such a small array, the channel width has already
been too high for implementation. As the size of the array increases, the channel width
and wire length become complex as well. Since the algorithm was originally designed
for yield enhancement only, Leighton et. aI. [25] do not consider the case when
dynamic reconfiguration is required. Although the algorithm is theoretically optimal, it
is impractical for actual implementation to serve our purposes of both yield enhancement·
and fault tolerance.
The third reconfiguration algorithm is called a single-track model [24]. If faults
occur in nonspare PEs, the physical indices of an array will change to accommodate the
change in routing. If the physical indices of PEs are not equal to their logical indices,
the path which is created to reconfigure the array is called the compensation path. A
near-miss occurs when two compensation paths in neighboring rows or columns
overlap in opposite directions. Given an (n+2) x (m+2) physical array, it is
reconfigurable into an nxm logical array using one-track routing if
1. there exists a set of continuous and straight compensation paths covering all the
faulty PEs, and
2. there is neither intersection nor near-miss among the compensation path.
82
D 0
D D
(a)
D D
(b)
D D D
Fig.5.6 Compensation path: (a) a near-miss situation; (b) a non near-miss situation.
To specify the placement of the compensation paths, the algorithm reformulates the
reconfigurability problem as a maximum independent set in graph theory. The
maximum independent set problem is NP-complete and the best known algorithms take
exponential time. Given an undirected graph G, an indepe!!dent set is a set of vertices
of G so that no two vertices of the set are adjacent. A maximal independent set is a set
when no other independent set exists that contains it. If Q is the family of maximal
83
independent sets, then the number max I S I (S E Q) is called the independence number
of the graph and the set S* from which it is derived is called the maximum independent
set. If the independence number of the contradicting graph is F, there exists a valid
routing scheme to support an embedded nxm logical array. Given a fault pattern, a
contradiction graph is an undirected graph G(V, E) where V denotes the set of all
vertices which are the members of compensation path and E denotes the set of edges.
An edge exists between u and v when U, v E V if and only if u and v cannot coexist if
1. both U and v are for the saIne faulty PE, or
2. the coexistence of both u and v violate the non-intersecting and non near-miss
conditions.
Figure 5.7 shows the reconfiguration layout and how different connection states can be
achieved.
84
Spare PE
/
..,......+-.-.... .....~....,
Switch
SEN
0
SEwO G OSEE B-
0 State A State B State C State DSEs
Fig. 5.7 The physical array based on single-track switches.
85
The algorithm guarantees the maximum length of links (both horizontal and venical) ~
2WPE + 3WT where WPE and WT are the width of PE and track, respectively, if links
start and end at the boundary of PEs.
Roychowdhury et. al. [40] have solved the reconfigurability problem with a greedy
algorithm that has polynomial complexity. They called their method an augmented
model, where the compensation paths are not necessarily straight and can have bends.
Even though it is theoretically practical, their method is done in a rather heuristic
manner. In fact, no formal method is available for this single-track switch model.
The founh method is an application based approach, and is called easily testable and
reconfigurable (ETAR) array [18]. The advantages of this algorithm over the
previously mentioned ones are that multiple PEs can be tested simultaneously to reduce
the test time significantly, and the throughput remains the same with the introduction of
delay registers. The array is a mux-based switching network and all the control of the
muxs remains local inside the PEs, to reduce the propagation delay which may be
caused by broadcasting the control signals across the array.
This algorithm is a row-oriented scheme. Columns are organized through rows
with local connections only, taking only one column element from each and every row
except the bypassed rows. Given an (N + SR) x (N + SC) physically array, the aim is
to reconfigure the array into a fault-free N x N logical array where SR and SC are the
number of spare rows and spare columns respectively. The row which contains more
than SC faulty PEs' is declared as a must-be-bypassed row. All PEs in the must-be-
bypassed rows and faulty PEs found during the testing procedure must be bypassed.
The algorithm tries to connect PE[i,j-l] to PE[i+l,j-l] or PE[i+l, j] or PE[i+l,j+l],
in that order. For example, the physical 4 x 4 array for matrix-matrix multiplication is
shown in figure 5.8.
86
0 0 b33
0 b32 b23
b31 b22 b 13
At T= 1
b21 b 12 0
b ll 0 0
~' J l
..... .... X ... D .... D~ .. .. ...
~ ~ 1
..... X ..... .' D
.... D .... D.. "'0 .. ..
J J )
"
,
'(
- - -
-. X~ ..
G :PE with a delay register
Fig. 5.8 The reconfiguration scheme ofETAR.
Only the PEs to the right of the faulty 'PE in a row of a two-dimensional array are
delayed by one time due to the use of bypass registers inside the PEs. To retime the
vertical data path, delays are also introduced to the computational unit of the PEs. It is
87
required to maintain the proper systolic pulse [22]. The maximum wire length between
two vertically connected PEs is two units, but the wire length between two horizontally
connected and logically-adjacent PEs is not increased at all. Hence, the worst increase
of wire links is 50%. If only spare columns are available, this algorithm is 0 (N2)
where the number of spare columns are much smaller than the dimension of the array
(SC «N). In addition, this algorithm does not require evaluation of the graph before a
reconfiguration can be done. In conclusion, it seems to be the most appropriate choice
for our purposes, namely both yield enhancement and reliability improvement.
88
6. Summary and Further Research
We have addressed the design and analysis of fault tolerant processing elements for
computing the manix BA-l. We first reviewed various techniques on how to transform
a matrix algorithm to a systolic algorithm, and then introduced the systolic structure
which was designed by Moreno and Lang. We then designed the five-bit two's
complement computational units and proved the correctness of operations by logic
simulations. The adder was a carry look ahead implementation, the multiplier was a
direct implementation, and the division belonged to the restoring algorithm.
After the computational units were designed and simulated, built-in self-test
techniques were examined. Since there has been a great demand on a fault-tolerant
array, design-for-testability techniques were investigated. Pseudorandom testing of the
PEs was reviewed. We designed and simulated a level-sensitive shift register latch
(SRL) according to the LSSD design rules. A linear feedback shift register (LFSR) was
used to generate pseudorandom sequences for test pattern generation. The properties of
a LFSR were studied, and then we implemented the primitive polynomial of degree five.
No linear dependency had been found in test pattern generation because any test patterns
in a five-stage LFSR would not be the same if the degree of the primitive polynomial
was five as well. The outputs were evaluated by signature analysis. For the signature
analyzer, a LFSR implementation of a primitive polynomial of degree four was
designed. Another LFSR implementing a reciprocal polynomial with degree four was
also designed and studied to demonstrate tha~ a final constant signature was useful for
fault diagnosis and possible reconfiguration of the array. Bounds of aliasing probability
of the signature analyzer were also examined by performing eight experiments on
Ivanov's results. The more stages in the LFSR, the smaller was the aliasing
89
probability. With the assumption of very large test length, fewer stages of the LFSR
produced larger aliasing probability. Due to our difference of the implementation of the
full-adder (FA), test qualities involving expected test coverage and test length were also
analyzed according to McCluskey's results. Finally, various reconfiguration algorithms
were reviewed and the possibility of being implemented in our array was examined in
detail as well.
Further researches in this area may be divided into the following fields:
1. Different approaches on when the pseudorandom sequences enter the array and how
we arrange the test schedule.
2. More research effort should be concentrated on more accurate bounds of aliasing
probability of the signature analyzer.
3. Various reconfiguration algorithms for mesh array should be studied with respect to
the shapes and the ap'plications of the systolic array.
90
List of References
[1] 1. A. Abraham, P. Banerjee, C. Chen, W. K. Fuchs, S. Kuo; and A. L. N.
Reddy, "Fault Tolerance Techniques for Systolic Arrays," IEEE Computer,
vol. 20, no. 7, pp. 65-75, July 1987. .
[2] M. Abramovici, M. A. Breuer, and A. D. Friedman, Digital Systems Testing
and Testable Design. Computer Science Press, New York, 1990.
[3] A. H. Anderson, J. 1. Raffel, and P. W. Wyatt, "Wafer-Scale Integration
Using Restrllcturable VLSI," IEEE Computer, vol. 25, no. 4, pp. 41-47,
Apr. 1992.
[4] P. H. Bardell, W. H. McAnney, and J. Savir, Built-In Test for VLSI:
Pseudorandom Techniques. John Wiley & Sons, New York, 1987.
[5] P. H. Bardell, "Calculating the Effects of Linear Dependencies in m-Sequence
Used as Test Stimuli," IEEE Trans. CAD, vol. 11, no. 1, pp. 83-86, Jan.
1992.
[6] C. R. Baugh and B. A. Wooley, "A Two's Complement Parallel Array
Multiplication Algorithm," IEEE Trans. Comput., vol. C-22, no. 12, pp.
1045-1047, Dec. 1973.
[7] C. K. Chin and E. J. McCluskey, "Test Length for Pseudo;andom Testing,"
IEEE Trans. Comput., vol. C-36, no. 2, pp. 252-256, Feb. 1987.
[8] P. Comon and Y. Robert, "A Systolic Array for Computing BA -1," IEEE
Trans. Acoust., Speech, Signal Processing, vol. ASSP-35, no. 6, pp. 717-
723, June 1987.
[9] S. DasGupta, P. Goel, B. G. Walther, and T. W. Williams, "A Variation of
LSSD and Its Implications on Design and Test Pattern Generation in VLSI,"
Proc. 1982 International Test Conference, pp. 63-66, 1982.
[10] K. 1. Dean, "Binary Division Using a Data-Dependent Iterative Array,"
Electronics Letters, vol. 4, no. 14, pp. 283-284, July 1968.
[11] P. Franzon, "Interconnect Strategies for Fault Tolerant 2D VLSI Array,"
Proc. IEEE Int'[ Conf. Computer Design: VLSI in Computers, pp. 230-233,
Oct. 1986.
[12] W. K. Fuchs and E. S. Swartzlander, "Wafer-Scale Integration: Architectures
and Algorithms," IEEE Computer, vol. 25, no. 4, pp. 6-8, Apr. 1992.
[13] S. W. Golomb, Shift Register Sequences. Aegean Park Press, CA, 1982.
91
[14] G. H. Golub and C. F. Van Loan, Matrix Computations. 2nd ed., The John
Hopkins University Press, Baltimore, 1988.
[15] K. Hwang, Computer Arithmetic, Principles, Architecture, and Design.
John-Wiley & Sons, New York, 1979. .J
[16] A. Ivanov and V. Agarwal, "An Analysis of the Probabilistic Behavior of
Linear Feedback Signature Registers," IEEE Trans. CAD, vol. 8, no. 10, pp.
1074-1088, Oct. 1989. , .
[17] B. W. Johnson, Design and Analysis of Fault Tolerant Digital Systems.
Addison-Wesley, Reading, MA, 1989.
[18] J. H. Kim and S. M. Reddy, "On the Design of Fault-Tolerant Two-
Dimensional Systolic Arrays for Yield Enhancement," IEEE Trans. Comput.,
vol. C-38, no. 4, pp. 515-525, Apr. 1989.
[19] B. Konemann, J. Mucha, and G. Zwiehoff, "Built-In Logic Block
Observation Techniques," Proc. 1979 International Test Conference, pp. 37-
41, 1979.
[20] I. Koren and D. K. Pradhan, "Yield and Performance Enhancement Through
Redundancy in VLSI and WSI Multiprocessor Systems," Proc. IEEE, vol.
74, no. 5, pp. 699-711, May 1986.
[21] H. T. Kung, "Why Systolic Architectures?," IEEE Computer, vol. 15, no. 1,
pp. 37-46, Jan. 1982.
[22] H. T. Kung and M. S. Lam, "Wafer-Scale Integration and Two-Level
Pipelined Implementations of Systolic Arrays," Journal of Parallel and
Distributed Processing, vol. 1, no. 1, pp. 32-63, Aug. 1984.
[23] S. Y. Kung, VLSI Array Processors. Prentice-Hall, Englewood Cliffs, NJ,
1988.
[24] S. Y. Kung, S. Jean, and C. Chang, "Fault-Tolerant Array Processors Using
Single-Track Switches," IEEE Trans. Comput., vol. C-38, no. 4, pp. 501-
514, Apr. 1989.
[25] T. Leighton and C. E. Leiserson, "Wafer-Scale Integration of Systolic
Arrays," IEEE Trans. Comput., vol. C-34, no. 5, pp. 448-461, May 1985.
. [26] T. Liu, K. R. Hohulin, L. Shiau, and S. Muroga, "Optimal One-Bit Full
Adders with Different Types of Gates," IEEE Trans. Comput., vol. C-23, no.
1, pp. 63-70, Jan. 1974.
[27] W. Maly, "Prospects for WSI: A Manufacturing Perspective," IEEE
Computer, vol. 25, no. 4, pp. 58-65, Apr. 1992.
92
[28] W. H. McAnney and 1. Savir, "Built-In Checking of the Correct Self-Test
Signature, " IEEE Trans. Comput., vol. 37, no. 9, pp. 1142;.1145, Sept.
1988.
[29] E. J. McCluskey, "Built-In Self-Test Techniques," IEEE Design and Test,
vol. 2, no. 2, pp. 21-28, Apr. 1985.
[30] E.1. McCluskey, "Built-In Self-Test Structures," IEEE Design and Test, vol.
2, no. 2, pp. 29-36, Apr. 1985.
[31] W. R. Moore, !'A Review of Fault-Tolerant Techniques for the Enhancement
of Integrated Circuit Yield," Proc. IEEE, vol. 74, no. 5, pp. 684-698, May
1986.
[32] 1. H. Moreno and 1. 1. Vidal, "Matrix Computations on Mesh Arrays," Tech.
Rep. CSD-890046, Dept. of Comput. Sci., UCLA, Aug. 1989.
[33] 1. H. Moreno and T. Lang, "Comments on "A Systolic Array for Computing
BA-l"," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-37,
no. 11, pp. 1786-1789, Nov. 1989.
[34] R. Negrini, M. G. Sami, and R. Stefanelli, Fault-Tolerance Through
Reconfiguration of VLSI and WSI Arrays. MIT Press, Cambridge, MA,
1989. '
[35] M. 1. Ohletz, T. W. Williams, and J. P: Mucha, "Overhead in Scan and Self-
Testing Designs," Proc. 1987 International Test Conference, pp. 460-470,
1987.
[36] K. P. Parker and E. J. McCluskey, "Probabilistic Treatments of General
Combinational Networks," IEEE Trans. Comput., vol. C-24, no. 6, pp. 668-
670, Jun. 1975.
[37] W. W. Peterson and E. J. Weldon, Error Correcting Codes. 2nd ed., MIT
Press, Cambridge, MA, 1972.
[38] S. K. Rao and T. Kailath, "Regular Iterative Algorithms and their
Implementation on Processor Arrays," Proc. IEEE, vol. 76, no. 3, pp. 259-
269, Mar. 1988.
[39] A. L. Rosenberg, "The Diogenes Approach to Testable Fault-Tolerant Arrays
of Processors," IEEE Trans. Comput., vol. C-32, no. 10, pp. 902-910, Oct.
1983.
[40] V. P. Roychowdhury, 1. Bruck, and T. Kailath, "Effici~nt Algorithms· for
Reconfiguration in VLSl/WSI Arrays," IEEE Trans. Comput., vol. C-39, no.
4, pp. 480-489, Apr. 1990.
[41] M. Sami and R. Stefanelli, "Reconfigurable Architectures. for VLSI
Processing Arrays," Proc. IEEE, vol. 74, no. 5, pp. 712-722, May 1986.
93
[42] 1. Savir, G. S. Ditlow, and P. H. Bardell, "Rando~ Pattern Testability,"
IEEE Trans. Comput., vol. C-33, no. 1, pp. 79-90, Jan. 1984.
[43] N. R. Saxena, P. Franco, and E. 1. McCluskey, "Refined Bounds on
Signature Analysis Aliasing for Random Testing," froc. 1991 International
Test Conference, pp. 818-827, 1991. -
[44] J. E. Smith, "Measures of the Effectiveness of Fault Signature Analysis,"
IEEE Trans. Comput., vol. C-29, no. 6, pp. 510-514, June 1980.
[45] A. D. Singh, "An Area Efficient Redundancy Scheme for Wafer Scale
Processor Arrays," froc. IEEE Int'l Conf. Computer Design: VLSI in
Computers, pp. 505-509, Oct. 1985.
[46] L. Synder, "Introduction to the Ccmfigurable, Highly Parallel Computer,"
IEEE Computer, vol. 15, no. 1, pp. 47-56, Jan. 1982.
[47] S. K. Tewksbury, Wafer-Level Integrated Systems: Implementation Issues.
Kluwer Academic Publishers, Boston, 1989.
[48] K. D. Wagner, C. K. Chin, and E. J. McCluskey, "Pseudorandom Testing,"
IEEE Trans. Comput., vol. C-36, no. 3, pp. 332-343, Mar. 1987.
[49] T. W. Williams and K. P. Parker, "Design for Testability - A Survey," Proc.
IEEE, vol. 71, no. 1, pp. 98-112, Jan: 19'83.
[50] T. W. Williams, W. Daehn, M. Gruetzner, and C. W. Starke, "Aliasing
Errors in Signature Analysis Registers," IEEE Design and Test, pp. 39-45,
Apr. 1987.
94
Appendix - Detailed Schematic Diagrams of Processing
Elements
List of Figures
Page
1. A two's complementer Al
2. A block cany lookahead unit A2
3. A two's complement adder A3
4. A full adder (FA) cell A4
5. A two's complement multiplier - sheetl A5
6. A two's complement multiplier - sheet2 A6
7. A controlled subtract (CS) cell A7
8. A two's complement divider - sheetl A8
9. A two's complement divider - sheet2 A9
10. A shift register latch (SRL) AlO
11. A pseudorandom pattern generator (PRPG) All
12. A SRL for connecting the PRPG Al2
13. A test pattern generator (TPG) AI3
14. A signature analyzer (SA) implementing a primitive polynomial Al4
15. A signature analyzer (SA) implementing a reciprocal polynomial AI5
16. A type1processing element (PE) AI6
17. A type2PE A17
18. A schematic for computing the detection probability AI8
::
.
-
z _
w:'?
u ~
a:
Ai 3f,
L
I))
....
C
Q)
E
Q)
Q
E
o
o
Ul
o
:I
a:
iii
!:! . .
:.l .. :il..
~
.
-
.
t. ><
>< o.
Jk-
.....
-C
:J
a:
-l
U
lD
l. l . -~ !......------A..-_~.----:.'"-----, ~ :
CD a :: W
!!
"o
G>
..c
o
::D
L
L
o
o
.:.<
o
o
~
.0
a:
a: lD
:x
o
..J
...
l a:
>
Z
l-
ID
lJ
lJ
o
c
m
E:
lD
D-
E:
o
o
o
:I
-+-'
a:
a>
a:
...J
U
- - ..
mmll-eJ__ m m
~~
,,-U
lD lD
a:
-l
U
lD
L..----4-.---jI·
A3 .98
m
In
=
'"
o
u
"''''
A4-
u
99
;;
l!
., .,
x ~ .,
.X
a:
>
:z:
x
o
..J
U.
o
'"
'"..c.
en
L
'"
a.
~I 1§j7~ :;
~
"
- I Il!fX
~ [[je ~ [fiJ"srr= m e ., 0x - .>< :~ ., . ~, -- m =; ~~ ~ ~ - :: ~-~ ~ - ~
., ., ~.,
!D a: ~
I I
) )
As k)o
:J
£
c
'"f:
'"
a.
f:
o
o
(f)
o
::I
ex:
A6
':'!
.
•
-;
0-
::
c
•~
a.
c
o
o
v
x( IT
"h ~ '-----
flx x x x .x x x.- :- .~ .~
f-'-l-
x
l~
x x
l),. ..
x
I
I J I
-a
(J)
o
if)
u
~
o
o
L
J)
:J
(f)
<J
m
o
L
c
()
o
0::
103A8
.......
m
(J)
I' L
(fJ
2 cr
>
2:
Ll m-0- m ::;; :>In In In In• U • u <:lcr crn .......a.. '" a.. cL "- L "-:z: :z: (J)0 0 0 0 EU u U u:: (])
.;; '" 2c In 0.-z
-:z
Ecr .... crUJl ~=~ 0'" ()e e (f)
a; a;
0/~ =: =: 0; :r.......0 I%: '"
<>
I' CE~
::t>
\0
~
~
A t~o's oOMpleMent dlvlds~ - snest2.
'--../
IIIQI
IIII III
rglfll
I
~ 15INIS:. c>S_OUT
):.-
b
,~
Cl
Lrj
(
~--'
Hcr'L~
BIST~ X ~EN
SCAN ENS
CKc>---[)<>----,
O_lN(lIIBle>-----lIU"111 II OJ
S_fND •• 51"1011151"111 II ell
EN
ENS
FOR I ,. a ro 4
A shiFt register latch (SALJ.
x
x
O_OUT III
O_oUrclllBl
I
All
t.:l
(L
a::
= (L.,
>< L
a
0
L
a>
c
a>
OJ
c
L
% a>
<3 ....,
., ....,
0
~ 0-Ea-0c~ 0
L~ a
-0
a: :J~ !:!! a>on If)
0-
cr:
'"u
cSIlT -Eo
3CRH-fNII
IUSf...f"8
"'I:T"1.8
5111-E.5
"'fRl
IJnr-!NBIST
SCAN
SCAN-EH8
I!'terRi.
SCAH-!N
:>---
X
~
r-:-
I'-'
UUIlIl I f--oD-OUT cq.al I /X
'11.111·11,.
LIN Cq, al
su.u·u 115IHI~~
- UIUT
~ ",-_ .cr"L! IHC rIlL~.......,.----,
UH(~lalg I~ $ •~=~~ ::~:::~: ~::::: $ S1~ ~._.n. I
r;;
-.../
~
CK~
A SRL For connecting the PAPG.
,)
,»
I-"
OJ
D_IN(4:131 SRLA
CK L_IN O_OUT ","EO_o","~ I S_IN LIN 14,81 l_O Itll 81 I I0 MCTAl00 ...... _. - 81ST S_OUT S_OUTSCRN
CK SALRS
R las l po llarn ganero lor (TPGJ.
0OJ
C
....
.....
.. C
a (I)
>< E:
(I)
no
" 0-a
'" E:
CI
en
c..
1!i (I) 0
- Ng g
::J) E
0
0 C
C ::J)!! 0
.. .. 0
~~ OJ 0-L::l OJ::>0c ....,
I}-lJ OJ ~E:(J) ~LCI 0-
z!! CI'
~ .. ..I ..
0
A14 -lc.9
0Ol
:! c
co
.. ....,
>< C
(J)
N E
co co (J)
j I a.Ea:(J1
..
-co co L 0co ..
(J)
N E
:J) 0
~ § C
.. .. 0 :J)
~ ~ C0 0a.mL::l 0...... 0IM 0 0C LOl a... ";'" - :3 (f) 0u ... :z: _ a m
-a .. IT LI ..
0
..,.
u
><
AIS 1io
\
'-f
-"",1'1
n",tl
SRl '..,OI.lII"II~!"UCDttr' ·:~~~~~I _f
......,
I
~..
lflln
11rt'S'
I'JYI
t..DV1N,1J
...... '
SRL.1"',11
-"
",,,,
,.
,
I I F",,,CORP .. ",,,...1-----10
SRL
.'»
.~Il....l
-
-_._-
,-"',
(J"
I
p
~-. J."'," 5P.l
.,
.."
I-'"
~..
'"
'JlUII.,.:
I-"
flU
'"
-
~ '-""
D . 1..... ,
~-.
~
~.
...
D
..'''',11 '"L ~ ~'-'~_I""" '"L F ~_.• 11 ." ,. ..,
tl", D~'lt,., ,.,,-,tl ,,'... ~UL D.,..DUlll'., ,.,..11 ~,.
~~ 1J'l,l1 ttl :: ~:~ LOU' .,. ..
-I • It •• __ D
SOL
'J).J'",.Il)---c>I)I\,II
'.11\I 1
R Ivp.1 PE.
, ..
t'Ro.1!,
Ie, ...
, OELAT =r\'lll----jpJ,;n.:;-;;u:----,
'H' ,-,.,,~,el I ••••. __ t
UO'
""t.
~~.
...
....
""
->
SAL
4,JAlll••1t1 I .4 .......~ t
..,."
lin••
,»
-.l
..'.",11
,I'
"...III
"".
SAL
Q..QIUlII,1II 1',1,1 HUlrlrUf:Al'flll'
t.AlI ",u Q'Ll2Of1A.r, I"
,....,
'''''''1
nih.
_llIiII",,, SIR "~" .
,
"'~r 1.. ,11 IiUl rU'UfA I"~~:~ I~",. .. ,.. .1'\. ,-,,,~,,,~,~~,,,""... '"'' ILGUlII.lJI ~ !~:"" !:It ~cwn SJUI u-cur
"'" ""
U"" ""I ....... n 1... 1"'11"
.
..
_llll'ta SRL '".~. ~... , .. SAL ~ m....
-'"
III I_ ...
"... '."""'''~'''''''' <""" ''''' Ullfillill T"~I"'1 ::--ItC1otl (l-pu,,,.Il~rUI.IU,n ::= l~ I~r III
".
SJIlJr 111
-. ~
---,_..
SAL 'w~. ~" ..... SAL fin ....III _,_ ...
'-""""b"""" <I'll,. I" JJUrl',il nelll.11 Ilt'::::t'll O_OUf"''Jli:>~lll''U oCt"~C'" ~..;qIIlIJl~I.lllt"'~I loll ~~!!.! '~r ... Ill' lit S..QUI 1.... ~ !!! I.JUI Z..-,IC.~· .....
1111.T1
>
R t..,p.2 PE,
p
p
]'-.>
'"j
><
m en
>< ><
~><
=Y'I
x}..,
00
a: m
o
u
I
/
x
;--
to
X
c
o
-<
-+-'
o
OJ
.......
o
OJ
\J
OJ
L
.......
01
C
:J
0.
f:
o
o
L
o
"-
o
-'
o
E
Ql
..c
o
III
a:
...0
o
...0
o
L
0.
113

