Design and implementation of a systolic array to solve the Algebraic Path Problem with the specific instance of the transistive and reflexive closure of a binary relation by McCall, David Gene




Design and implementation of a systolic array to
solve the Algebraic Path Problem with the specific
instance of the transistive and reflexive closure of a
binary relation
David Gene McCall
Follow this and additional works at: http://scholarworks.rit.edu/theses
This Thesis is brought to you for free and open access by the Thesis/Dissertation Collections at RIT Scholar Works. It has been accepted for inclusion
in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact ritscholarworks@rit.edu.
Recommended Citation
McCall, David Gene, "Design and implementation of a systolic array to solve the Algebraic Path Problem with the specific instance of
the transistive and reflexive closure of a binary relation" (1990). Thesis. Rochester Institute of Technology. Accessed from
Approved by: 
Design and Implementation 
o+ a Systolic Array 
to Solve the Algebraic Path Problem 
with the Speci+ic Instance 
o+ the Transistive and Re+lexive 
Closure o+ a Binary Relatio~ 
by 
David Gene McCall 




Requirements +or the Degree o+ 






Pro+. ___________________________________ _ 
<Department Head> 
DEPARTMENT OF COMPUTER ENGINEERING 
COLLEGE OF ENGINEERING 
ROCHESTER INSTITUTE OF TECHNOLOGY 
ROCHESTER, NEW YORK 
MAY 1990 
TITLE OF THESIS: The Design and Implementation of 
Systolic Array to Solve the Algebraic Path 
Problem with Specific Instance to the 
Transistive and Reflexive Closure of a 
Binary Relation. 
I, David Gene McCall, hereby grant permission to the 
Wallace Memorial Library of RIT to reproduce my thesis 
in whole or in part. Any reproduction will not be for 




List o-f Tables Ill
List o-f Figures IV
1.0 Abstract 1
2.0 Introduction 2
3.0 Formal Description o-f Algebraic Path Problem.. 4
4.0 Corresponding Systolic Array 8
5. O Systol ic Array Function 12
6.0 Mapping o-f Algorithm onto a Systolic Array 23
6. 1 Description o-f Lines. . 25
7.0 Design and Implementation o-f Systolic Array 28
7. 1 Logic Equations 28
7.1.1 Circle Processor 29
7.1.2 Double Square Processor 29
7.1.3 Square Processor.... 30
7.2 Circuit Design 32
7.2.1 Circle Processor 32
7.2.2 Double Square Processor 33
7.2.3 Square Processor 33
8,0 Processor Testing 35
8.1 Circle Processor 35
8.2 Double Square Processor 37
8.3 Square Processor 41
9.0 Array Design and Testing 42
- I -
9. 1 2x2 Array Testing 42
9.2 3x3 Array Testing 52
9.3 4x4 Array Testing 52
10.0 Conversion o-f Design to MOSIS CM0S3 Standard Cells. 60
11.0 Final Simulation and Testing 67
11.1 Fanout Delay 68
11.2 Back Anotated and Wiring Delay 70
11.3 Mi n imum C lock Per i od 73
11.4 Data Timing Requi rements 75
12.0 GDS2_OUTPUT 79
13.0 Future Endeavors 79
13. 1 Fabrication and Testing 79






1 Comparisons o-f Systolic Arrays -for Solving APP 21
2 Circle Processor Operation 35
3 Double Square Processor Operation (R=0) 39
4 Double Square Processor Operation (R=l) 39
5 Propagation Delays -for Processing Cells. 67
6 Processor Propagation Delays 68




1 Applications -for the Algebraic Path Problem 10
2 Systolic Array -for APP <n=4) 13
3 Circle Processor 14
4 Square Processor 15
5 Double Square Processor 16
6 Input to First Row o-f the Array 22
7 Output o-f First Row o-f the Array 22
8 Algorithm to Solve APP 23
9 The Mapping o-f Line 4*3 onto the Systolic Array 24
lO The Mapping o-f Line 4*5 onto the Systolic Array 24
11 The Mapping o-f Line tt9 onto the Systolic Array 26
12 The Mapping o-f Line #10 onto the Systolic Array.... 26
13 Simulation Output -for Double Square Processor (R=0) 36
14 Simulation Output -for Circle Processor 36
15 Simulation Output -for Double Square Processor (R=l) 38
16 Simulation Output -for Square Processor 40
17 Test 1 -for 2x2 Array 43
18 Test 2 -for 2x2 Array. 44
19 Test 3 -for 2x2 Array 45
20 Simulation Output -for Test 1 on 2x2 Array 46
21 Simulation Output -for Test 2 on 2x2 Array 47
22 Simulation Output -for Test 3 on 2x2 Array 48
23 2x2 Systol ic Array 49
24 Test 1 -for 3x3 Array 50
- IV -
25 Test 2 -for 3x3 Array 51
26 4x4 Sytol ic Array. 53
27 Test 1 -for 4x4 Array 54
28 Test 2 -for 4x4 Array 55
29 Test 3 +or 4x4 Array 56
30 Simulation Output -for Testl on 4x4 Array 57
31 Simulation Output -for Test2 on 4x4 Array 57
32 Simulation Output -for Test3 on 4x4 Array.. 58
33 Steps to Convert to MOSIS3 Standard Cells and
Produce Fabrication File 61
34 Circle Processor Using M0SIS3 Standard Cells 62
35 Square Processor Using MOSIS3 Standard Cells....... 63
36 Double Square Processor Using M0SIS3 Standard Cells 64
37 4x4 Systolic Array Using M0SIS3 Standard Cells AA
38 Clock Skew a-fter the Fanout Delay is Added <l-0)... 69
39 Clock Skew a-fter the Fanout Delay is Added (0-1)... 69
40 Clock Skew a-fter Back Anotation and Wiring Delays
are Added (0-1) 71
41 Clock Skew a-fter Back Anotation and Wiring Delays..
are Added (1-0) 71
42 Simulation -for Modi-fied Test 1 on 4x4 Array with
Fanout Delays Added 72
43 Simulation -for Modi-fied Test 1 on 4x4 Array with
Back Anotation and Wiring Delays Added 72
44 Simulation -for Test 3 on 4x4 Array with Back
Anotation and Wiring Delays Added 74
- V -
45 Simulation -for Test 2 on 4x4 Array with Back
Anotation and Wiring Delays Added 74
46 Simulation -for Modi-fied Test 1 on 4x4 Array at
Maximum Frequency (8.3 MHz) 76
47 Simulation -for Test 3 on 4x4 Array at Maximum
Frequency (8.3 MHz) 76
48 Simulation -for Test 2 on 4x4 Array at Maximum
Frequency (8.3 MHz) 77
49 Clock Wave Form a-fter Skewing 78




The Algebraic Path Problem (APP) has many practical instances
to be solved. The general solution by Robert and Trystram
(1986) will be discussed along with the mapping and operation
o-f the algorithm to a systolic array. The speci-fic instance o-f
the APP, the transistive and re-flexive closure o-f a binary
relation, will be implemented with a discussion o-f the
di-f-fer-
ent stages ranging from the logic equations to a method o-f the




2. O Introduct ion :
A systolic array has been de-fined by Will Moore as "a regular
array o-f processing elements all doing the same calculation and
passing results on to their nearest neighbors es/ery
cycle."
(Robert 1986) Although this de-finition is not strictly adhered
to by many systolic architectures, it is still the basic
under-
ly i ng theme.
The systolic array that will be presented here is a regular
array o-f three di-f-ferent processing elements per-forming similar
-functions. The interconnections are only to the closest hori
zontal or vertical neighbors. It will solve the general
Algebraic Path Problem (APP).
The discussion will begin with the theory of the APP -followed
by the presentation o-f Yves Robert and Denis Trystram's (1986)
algorithm -for its solution. Their mapping o-f the algorithm and
its operation as implemented on a systolic array will be
explained and clari-fied.
A speci-fic instance o-f the APP, the transistive and re-flexive
closure o-f a binary relation, will be selected and implement
ed. The complete stages will be stepped through -from circuit
Introduction
design and testing to the integrated circuit layout o-f MOSIS
CMOS3 (M0SIS3) standard cells. At the end o-f the discussion a
-fabrication -file will be completed allowing the chip to be




3. 0 Formal Description o-f Algebraic Path Problem:
The Algebraic Path Problem (APP) deals with the problem o-f
-finding the distances between all vertices o-f a weighted
graph
over a speci-fied semiring. To begin the discussion several
de-finitions will be given along with some corresponding
exam
ples to clari-fy the concepts. The -first de-finition is
that o-f a
semigroup, the building block o-f a semiring.
De-finition i:
A pair (G,(-)) is a semigroup i-f and only i-f:
a) G is a (nonvoid) set.
b) G X G --> ( ) G.
c) (") is associative.
d) there exists a unique neutral element (e) E G
such that, g () (e)
= (e) () g
=
g -for every g E G.
e) the group is commutative
i-f-f:
gi () gz = gz () gi -for all gi, g= e g.
(Hartnett 1963, p. 160)
The semigroup is
equivalent to a moniod which has a variation
on the notation, that is
(G,(-)) == <G,(-),(e)> (Kuich 1986, p.
5). The de-finition o-f a semiring






A triple (H, ( -fr ) , (-X-) ) is a semiring i-f and only i-f:
a) (H,(-^)) is a commutative semigroup with neutral
element (O) .
b) (H, (*) ) is a semigroup with neutral element (1).
c) -for all hi,h2,h3 H,
hi (*) (hz (-) ha) = (hx (*) hz) ( + ) (hi (*) hs)
(hi (!) hz) (^f) h3) = (hi (*) h3) ( + ) (h= (*) h3)
d) -for every h E H,
(O) (*) h = h (*) (O) = (O)
(Kuich 1986, pp. 5-6)
For those who understand the concept o-f a ring, notice that a
semiring would be a ring i-f the additional property o-f an
"inverse"
-for ( -fr ) were added (Hartnett 1963, p. 173). Below are
two examples o-f semirings.
Example 1:
The set o-f real numbers having ordinary addition with the
natural neutral element O and multiplication with neutral
element 1, Hi= (|R, -^, *> , is a semiring. This can be shown by
-first proving that (|R,--) is a commutative semigroup.




e IR, ri -^ O = ri -f O = ri. Likewise it can be shown that
the real numbers are commutative and associative. The same
holds true -for (|R,-X-) and it is known that real numbers are
distributive and zero absorptive, i.e. part (d) o-f the de-fi
nition. (Rote 1985, p. 194)
Example 2:
Hz=(iR U {-oo,-Kx>}, min, +) is a semiring with zero (0)=-Kx)
and unity (1)=0, i-f < +00) +r= i+oo) , ( -00) -^r= ( -00) , -for all
r E IRj and ( -00) -h ( -^oo) = ( -Kx>) are de-fined. Here it is not as
obvious that Hz is a semiring. However, the min operation as
de-fined normally is commutative, mi n ( r 1 , rz) =mi n ( ri , r-2) , and
likewise associative. Since (-00) and ( -^oo) are included with
the above de-finitions + is also commutative and associative.
Parts (d) and (e) o-f the semiring de-finition also -follow to
be true.
I-f ri,rz,r3 E (ff? U {-00,-hcjo})
(d) ri + min(rz,r3) = ri-frz OR ri-i-r3 depending on which
0+ r-z and rs are smaller
= mi n ( ( r l-^rz) , (r i-frz) )
(e) ( -Kxj) + ri = ri + ( -^oo)




(Rote 1985, p. 194)
The de-finition o-f the APP as given by Robert and Trystram
(1986) -follows which includes the de-finition o-f a weighted
graph .
Given a weighted graph G=(V,E,w) where V is a -finite vertex
set, E an arc set, a -function w: E -> H with weights -from a
semiring (H, (-), (-X-) ) with zero (O) and unity (1), -find -for
all pairs o-f vertices (i,j) the quantities
d ij
= ( -) w (p) ,
peMij
where Mu denotes the set o-f all paths -from i to j.
Associated with the weighted graph described above there is a
weight matrix A = (aij), where a^ = w(i,j) i-f (i,j)E E and
aij
= O otherwise. The set o-f all paths Mij is modi-fied to
Mij"*', the set o-f all paths -from i to j which contain only
vertices x with l<=x<=k as intermediate vertices. This will
also modi-fy the weight matrix
calculations to
= < + ) w(F
pEMij
<**>




4. 0 Corresponding Systolic Arrav:
A general algorithm was developed by Robert and Trystram (1986)
that solves the APP -for any particular instance by using only
the semiring operations ( -^ ) , (-X-) , and *, where
c* := (-H)c' =(1) (-I-) c (-^) {(zi*)c) i + ) (c (*) c (*) c )
-
i >=0
The above de-fintion is the generalized in-finite geometric
series -for l/(l-c) giving the semiring an
"inverse"- The
algorithm is equivalent to Rote's (1985) algorithm with an
improvement in time (Robert 1986, pp. 173, 179). The algorithm
is given below:






-for i < - 1 to n , i ^ k
aiu'**' <- aiu"*-' (*) au,<<<>
-for j
<- 1 to n, j^U.
begin
for i < - 1 to n, i^U
aij
<?" <- aij<*-i' < + ) aiK'io (*) ai.j<K-i>






The algorithm can solve a variety o-f applications depending
upon the de-finitions o-f (-1-), (*) , and
* in the semiring. A
chart o-f several applications is given in -figure 1. Four addi
tional applications were brie-fly described in Robert and Try
stram (1986) and three are expanded upon here:
1. The determination o-f the inverse o-f a real matrix A can be
per-formed by de-fining ( -f ) and ()(-) in the usual manner i n ff?
and we have seen that (ff?,-f,*) is indeed a semiring. Since
the de-finition o-f * is sum o-f increasing power it can be
de-fined i-f c^l to be c* = l/(l-c). This de-finition may
seem odd since l/(l-c) is not convergent -for abs(c)>l, but
the solution in this case would not exist which is a per-
-fectly adequate solution. The algorithm actually computes
(I-A)~^ in this case but a simple mod i -f i cat i on can be done
to permit A"* to be computed directly (Rote 1985).
2. The shortest distance in a weighted graph has the defini
tion as follows: aij are the weights taken in
H=(R U<-oo, -^oo>, (
?- ) is the minimum operation with zero
(0)=-K>o, (*) is addition in |R extended to H (with -00 (-X-)
-^oo = -^co) with (1) = O, and
* is defined by if c>=0 then
C-* = o else c* = -00. Here the semiring is not as obvious
however, in example 2 above this












connectivity {0,1} max min
0 1



















number of arcs Nu{<x>} min +
00 0
0
Shortest path Ru{qo) min +
00



















Counting Counting of paths f^
-1- X 0 1








k smallest elements of
two vectors
k smallest terms of
sums of pairs
Sequence of smallest







of elements of U
with amplitude >]
Sequence of smallest
elements up to ti
of two sequences
Figure 1. Applications of the APP
(Gondran 1984)
Systolic Array
3. The reflexive and transitive closure of a binary relation
can easily be computed with the algorithm. The weight
matrix's components aij are defined to be boolean, i+) and
(*) are respectively the OR and AND operations, and
* is
defined by
c* = true for all c, see the definition of c*.
The weight matrix is the relation matrix for binary
relation. Clearly from the concepts of boolean algebra this
is a semiring with the zero (0)=0 and unity (1)=1.
Application 3 described above will be the one designed and
implemented on a single chip for the CMOS3 process. A practical
use for the reflexive and transistive closure is determining if
there is a path between any two vertices within a graph. The
graph could correspond to cities, nodes in a circuit, etc.
- 11
Array Functioning
5. O Systolic Array Function:
The algorithm described above is implemented on a two-dimen
sional array of n by n-H orthogonally connected processors (see
figure 2). Each row of the array, k, and has n-H processors
labelled Pu i , ', Pu.i-,*i. The weight matrix A followed by In,
which will be represented by C, is fed into the array one row
at a time in a staggered fashion (see figure 6).
There are three types of processors which perform different
functions. The circle processors implement the """-operat ion on
the first input and afterwards act as delay cells (see figure 3
parts (c) and (d)). The square processors intialize their
registers with the modification of the first input and
afterward act as mul t ip 1 y-and-add cells (see figure 4 parts (c)
and (d)). The double square processors are similiar to the
square processors except the register value is initialized
differently (see figure 5 parts (c) and (d)). (Robert 1986, p.
174; Robert 1987, p. 187)
Each row k of the array performs the
k**^
phase of the algo
rithm. Processors Pki transmit the input data arriving from the
top to the right. As the data
travels to the right, the square
processors merely pass this







Figure 2. Systolic Array for APP










































7474 Y ini ^
(b) circuit implementation
IF Init = TRUE THEN
Bout I = fl i n.^
Init : = -,FflLSE
ELSE







































NIL 0 e 0












:nl I HlnS. Qo 01 u jQ







H . Ini l.OSln.Dflln.nin.Bln
- InU.fl
Rout Rlr gin.fl
:nl I Oflln DB-.n Inl t
Q 0 0 0
0 Q 1 0
Q I 0 0































IF Inl I = TRUE THEN
R i = flin ond Bin
Inl I := FALSE
Bout : =. Bin
flout : = NIL
ELSE
flou I : = flin or R ond Bin












Inl t Bin Dl-i R Bout Ooul R Inl t
U
U"
u u X u U U
0 0 Q 1 X 0 1 G
0 0 1 0 0 1 0 a
0 0 1 1 1 1 1 0






















n n Q X Q X 1
n n 1 X 0 X 1
n 1 0 X 0 Q Q
f) 1 1 X Q Q 0
1 0 n X 0 X 1




























X X X X
Hou t Bin





n . Inl t"B In














Oout = -Init Din
Bin







IF Init = TRUE THEN
R : = Bin
Init : = FRLSE
flout : = NIL
ELSE




and store the first data arriving from the top but modify and
pass downward the following data arriving from the top.
Likewise the Pk.^*! processors store the first input data and
modify and pass the following data downward. As seen in figures
3, 4, and 5, the control of the processor's current operation
is dependent upon the variable Init.
It can be seen that after 2n input data has gone through the
array, its length has been shortened to n, a row of output
data. The shortening takes place because each cell that passes
data downward does so with its first datum at half the rate in
which all data arrives. As the data flows through the array, it
is reordered at each row. The element aKu"** is computed in the
circle processor Pk.i and moves rightward to be stored in the
double-square processor PK,r.*i. The non-diagonal elements
aiK<'*>, i^k are computed by the square processors Puz, ''',
Pw.n-i. where element aiK**** is stored in processor Pk,<i-k*i.j
mod n. Therefore row k of the array outputs matrix
c<*<>
in row
order with the leftmost row k+1. (Robert 1986, pp. 174
-
177;
Robert 1987, pp. 211
- 213)
The operation of row 1 will be detailed through several steps





time step Px. P12 P13 p^^ Pio
1 aii
< > = (an <o> )*




P1.2 computes a^i < * ' =aai< ( x ) ai 1 < ^ > and passes aii<i>
to the right.
3 ai3<o> aaz*-^' asi**^*
remark: Pn. passes ai.3*',
P1.2 computes azz * ' =a22 <> ( + ) azi ' ' ( x ) aiz and passes
it down while passing
ai2< to the right
Pi3 computes asi < ' =a3i. <* <x ) ax x * ^ ' and passes axx**>
to the right
4 ax-"=" *a3(l) a32*'-> a^x"-'




** < + ) *ax ( 1 )
"* '
*X3 <0) """^ p--




=a32 *M + ) asx
' ' *
( x ) axz
<*
and passes
aia<> to the right
Px. computes a^x
' * '





a24<^' a33<^' a-,2<i> axx<*^'
- 18 -
Array Functioning
remark: Pn passes the first 1 of the identity matrix to the
right
Pxz computes a^^ < > passing it downward while passing
ax-**"* to the right
Pi3 computes a33 * ^ ' passing it downward and passes
axs*"*
to the right
Px-. computes a*a <
i- >
=a^2 <>< + ) a3x
'^ M x ) aiz
<> >
and passes
it down while passing
ax='* to the rigPit
Pxo stores axx'*^'
6 O azx'*-' a3.
< * ' a^a**-' axz"-'
remark: Pxx passes O from the identity matrix
Pxz computes a^x * * ' =1 < + ) a^x * *^ M x ) 1 passing it downward




ax^'*" to the right
Px-* computes a^3 * * ' passing it downward and passes
ax3<' to the right
Pxs computes ax= ' * ' =ai x < *^ M x ) axz<' and passes it
downward
The above description is summarized in figures 6 and 7, where






The time neccessary for the array to process the input data is
5n+2 where n is the array size (Robert 1986, p. 174). As pre
viously described, the systolic array size is n(n+l) which will
give a size of order n=. The time and size can be compared to
other algorithms and array sizes, see table 1 below.
From the comparisons in table 1, it can be seen that the pre
sent algorithm is one of the best in terms of time and array
size. Plus, it is a general solution of the APP. The particular
instance, the transistive and reflexive closure, that will be
implemented with this algorithm has an equal order of time to
the one specifically design for it. In fact the last entry in
the table, Kung-Lo-Lewis, is a better implementation because it
can begin a new solution every n time steps whereas
Robert-





Add 1 icat ion Area Time
Guibas transistive closure n" 6n
Kung-Lo shortest paths
n= 7n
Kramer-Leeuwen matrix inversion n^ 6n










(Benaini 1989, p. 74)











































































6.0 Mapping o-f Algorithm onto a Systolic Array:
The mapping of Robert and Trystram's algorithm for solving the
Algebraic Path Problem (APP) for the specific instance of re
flexive and transitive closure of a binary relation is done by
Robert and Trystram's Algorithm to Solve APP
l: for k <-- 1 to n
2: begin
3: aKK<<> <-- (aKu"*-^')*
4: for i <-- 1 to n, i ? k
5:
6: -for j
<-- 1 to n, j^k
7: begin
8: for i











C K X >
12: end
FIGURE 8
the use of quasi -dependence graphs
which look similar to the
systolic array depicted
in figure 2 except the double square
processor is denoted as a single square. The algorithm which is
- 23 -
a^. Mapping of Line 3
( j ^1 tit Ifll llllj
r y 'w
Figure 9.
The mapping of line #3



















The mapping of line #5
onto the Systolic Array
- 24 -
Systolic Mapping
listed above in figure 8 has several nested loops. Each rele
vant instruction line will be discussed separately. In the fol
lowing description, a relation with four elements will be used.
This leads to a 4x4 array. The choice of size 4 is partly be
cause, it is the largest size which can be easily explained, as
in the previous section. Also, it limits the number of pins
needed on an integrated circuit chip to a reasonable number
when the array is fabricated.
6. 1 Description of lines:
The first instruction, line #3, has the quasi -dependance graph
as illustrated in Figure 9. The value aKu"*' only depends on
aKK<^~*-' but there needs to be n of these because of the for
loop in line **1. In Figure 9, the relevant nodes that perform
this function are in bold and depicted on the final systolic
array to give a feeling of how the mapping will proceede and
where the data is flowing.




with one output aiK***'. This relation needs to be re
peated for all i between 1 and n except for i=n. The quasi -de
pendence graph is depicted in Figure lO. Once again there are
n repititions because of line ttl and the nodes that perform the
- 25 -















The mapping of line #9
onto the Systolic Array
















i)!f T T 9
Figure 12.
The mapping of line #10
onto the Systolic Array
26
Systolic Mapping
operation are bold. Here aiK<*<> is shown within the node to in
dicate that it will be stored there for future use in the final
systolic array.
The final loop, line 1*3, encompasses an instruction that when
mapped has three inputs, aij<K-i>, atu**", a^j***"^', and one
output, atj***'. The quasi -dependence graph notation is shown in
Figure 11. Here the value that was stored in the previous loop
is used as one of the inputs. As seen by the for statement,
line #8, there are n-1 nodes. The other loops are taken care of
by the data flow.
The last instruction, line *10, is depicted in Figure 12. Here
one of the two inputs has been stored within the node from
previous data flow in the array. This line is in fact taking




7. O Design and Implementation of Systolic Array:
The choice of the transistive and reflexive closure of a binary
relation as the instance for the systolic array to solve was
done because of its ease of implementation and the ability to
describe its functioning in simple terms. There are other
specific algorithm for solving the stated instance, however
their time and size are of the same order as this solution 0(n)
and 0(n=), respectively (see Table 1).
In the following subsections, the necessary equations and
cir
cuits are derived and constructed for the systolic array. All
the input and output data paths of the processors in
figure 2
are actually 2 bits wide as
can be seen in figures 3 to 5 part
(c). One of the bits is the data, which is labeled either Axx
or Bxx for vertically or horizontally flowing,
while the other,
which is labeled Dxx, signals whether the data
is valid. The xx
is replaced by in or out for
input and output, respectively.
7. 1 Logic Equations;
The logic equations for the
three different types of cells were
derived from the functions
each cell needed to perform. In





is given at the bottom for the circle, square, and double
square processors respectively. The necessary logic
equations
are derived below in order of processor complexity.
7.1.1 Circle Processor;
The circle processor first performs the "^-operation and then
acts as a delay cell. These functions are summarized in the
truth table at the upper left corner of figure 3(a). When Init
is true Bout should pass 1, the value of the ?-operat ion ,
otherwise it should pass Ain. The value of Init should only
remain true if there has not been any data and it is presently
true. Valid Data is signalled by Din and therefore Dout should
directly follow Din. From these truth tables and the
previous
description, the following logic equations are
derived for the
outputs of the circle processor.
Bout =AinDrInit til
Dout = Din t23
Init=-Dinandlnit [33
7.1.2 Double Square Processor;
The double square processor's functioning
is depicted at the
bottom of figure 5(d). Initially
Bin is stored within the pro-
- 29 -
Design and Implementation
cessor in R for future computations and no valid value is
passed through Aout. Afterwards, Aout is computed and passed.
Since no data is passed the first time Dout does not directly
follow Din however, Init's function remains the same. The truth
table summarizing these actions are given in the upper left
corner of figure 5(a). The value of Aout is only given definite
values when Init and Din are false and true, respectively. The
rest are don't cares, X. The reason for this is that Din sig
nals when there is valid data. The storage of the first re
ceived value R should remain the same when Init is false but,
should latch the value of Bin when Init and Din are both true.
Karnaugh maps were used to reduce terms in the equations and
are given below the truth table in figure 5(a). The following
logic equations were derived:
Aout = R and Bin [41
Dout = -Init and Din C5:
R = (Init and Bin) or (-Init and R) 161
Init = Init and -Din t73
7.1.3 Square Processor;
This is the most complex processor since there are two inputs
and two outputs. As shown at the bottom of figure 4(d), the pro




in R, while at the same time passing Bin to the right and no
thing down. After the first values have been processed. Bin is
still passed to the right through Bout and the value passed
downward, Aout, is the sum and product expression for Ain, Bin,
and R. The Init value depends on two data valid flags, DAin and
DBin, and should only remain true when it is true and either of
the latter values is false. Since the Aout value is delayed one
time step its data valid flag DAout is exactly the same as that
for the double square processor. Bin is always passed uneffect
ed and therefore Bout and DBout directly follow their input
counterparts Bin and DBin. These conclusions are summarized in
the three truth tables on the left side of figure 4(a). Once
again Karnaugh maps were used to minimize terms for the compli
cated equations, R and Aout. There are two equations for R
given below its Karnaugh map. The first is the equation di
rectly derived from the mapping while the second has the addi
tional data valid flags added to ensure R only changes when
valid data is available. The equations are summarized below:
Aout = Ain or (Bin and R) C83
DAout = -Init and DAin C93
Bout =Bin [103
DBout =DBin [113
R = (Init and DBin and DAin and Ain and Bin)
or (-Init and R) C123
- 31 -
Design and Implementation
Init = Init and (-DBin or -DAin) C133
7. 2 Circuit Design:
The circuits for the three different processors were designed
directly from the derived logic equations. A mixed logic ap
proach was used with NAND and NOR gates. The choice to use ex
clusively NAND and NOR gates was due to the fact that the de
sign was going to be implemented with cMOS technolgy and these
gates are more easily made. The inputs were all latched using
D-f 1 ip/f 1 ops. This was necessary for proper operation, plus it
gave both the input and its inverted value. The D-f lip/flops
were also used to store the internal processor variables, Init
and R. A reset line has been included to set all input latches
and R to the false state while Init is set to true.
7.2.1 Circle Processor:
The three equations for the circle processor when converted to
mixed logic notation become:
C13 Bout,h = Ain,l or Init,l C143
C23 Dout,h = Din,h C153
[33 Init,h = -Dln,l and
Init,l C163
In the circuit diagram at the upper




these equations are directly implemented.
7.2.2 Double Square Processor:
Similarly, the equations for the double square processor are
converted to mixed logic notation for direct implementation
using NAND and NOR gates.
C43 Aout,h = R, 1 and Bin,l tl73
C53 Dout,h = -Init,l and Din,l C1S3
C63 R,h = ( Initjh and Bin,h),l or
(-Init,h and R,h),l C193
C73 Initjh = Init,l and -Din,l C203
Refer to the upper right corner of figure 5(b) for the circuit
d iagram.
7.2.3 Square Processor:
The final equations for the square processor are converted
below. However, in the upper right corner the circuit diagram
of figure 4(b), an 8 input NAND gate is
used because TTL parts
were used during this phase of the
design. The 8 input NAND was
used as an 5 input NAND by tying together the proper lines.
tS3 Aout,h = Ain,l or
(Bin,h and R,h),l |[213
[93 DAout, h




[103 Bout,h = Bin,h t233
[113 DBout, h = DBin,h C24 3
C123 R,h = (Initjh and DBin,h and
DAin,h and Ain,h and Bin,h),l
or (-Initjh and R,h),l [253





8. O Processor Testing;
Each of the processor cells were simulated by an exhaustive
test of the complete truth table for the particular cell. This
was done by doing Quicksim simulations on the gate-level
circuit designs.
8- 1 Circle Processor;
Since there are only two inputs and outputs for this cell a
relatively simple set of input forcing functions produced the
small truth table given below, table 2. The input forcing
function event times in nanoseconds are also given to be
compared to the (3uicksim simulation output in figure 14.
time Ain Din Bout Dout -Init
2000 1 O 1 O 0
3000 1 1 1 1 0
4000 0 O 0 O 1
5000 O 1 0 1 1
6000 1 O 1 O 1
7000 1 1 1 1 1































































1000. Q 2 0OQ.0 3OQQ.0 4000. Q
5000.0 6000.0 7000.0 0000.0
f)O0O.O 10000. 1
Figure 13. Simulation output for the Double Square Processor (R=0)
o>
,
- r + + + + +


































^ i i L_
^~~
1
+ + + ^ + + + + + Init 3it&J,'::taa
1000.0 2000.0 30QQ.Q 4000. Q 5QQQ.0 600Q.0 70QQ.Q 8000.0
9000.0 IQQOO.O
Figure 14. Simulation output for the Circle Processor
Processor Testing
The -Init value is the -Q output of the D-flip/flop and should
be one time step behind.
This simulation test also proved the proper functioning of the
-Init signal because it did only change when there was valid
data available. In other tests not shown, the other two
variations on inputs Ain and Din when Init is true were tested
and verified. Please refer to figure 3 for the originally
created truth tables.
8.2 Double Square Processor;
Similarly to the circle processor, the double square
processor-
has two inputs and outputs. The truth table, table 3, given
below has the same format as the one in table 2. The values can
be compared to the (3uicksim simulation output in figure 13. The
Init and R values are the 0 outputs of the D-flip/flops and
should be one time step behind the other values.
Similar test were done to make R true and are given below,
table 4, and in figure 15.
These tests also proved the proper functioning of the Init and
R signals. The described tables, 3 and 4, can also be compared
- 37 -
J I \ L__i I f 1 r~i f \
r~
-4--i ; ; I + + I * * L
1.
I
















to the original truth tables created in figure 5.
time Bin Din Aout Dout Init R
2000 0 1 0 0 1 0
3000 0 O ! O O 0 O
4000 1 1 0 1 o O
5000 1 0 I 0 0 0 O
6000 O 1 O 1 0 0
7000 0 O I 0 0 0 0
Table 3: Double Square Processor Operation (R=0)
t ime Bin Din Aout Dout Init R
12000 1 O 0 0 1 O
13000 1 1 1 0 1
14000 O 0 0 O 0
15000 0 1 0 1 0
16000 1 o 1 0 0
17000 1 1 1 1 0




It.0. MH.I. >(. ..H.l MH.I <.!. m.t ii-K.t w.a ift??^ uir




8. 3 Square Processor:
As with the previous processors, simulation tests were
conducted on the square processor. The simulation output is
shown in figure 16. A detailed truth table will not be
presented here because the listing would be too long and the







9.0 Array Design and Testing:
The processing cells were connected to form several different
array sizes for testing. Initially a 2x2 array was created and
then a 3x3 and a 4x4 followed. The testing was done using
Quicksim for the logic simulations and with a computer program,
listed in Appendix A, that strictly followed the program
implementation. These two outputs were compared against one
another and against hand calculations for correctness. During
this phase of the testing it was discovered that there had been
an error within one of my main texts (Robert 1986). Luckily the
problem was the switching of AND and OR operations for the
transistive and reflexive closure, which was thought to be a
problem before any implementation began.
The clock period was set at lOOOns to avoid any problems that
might occur because of propagation delays or rise and fall
t imes.
9. 1 2x2 Array Testing;
Three different input matrices were used for testing the 2x2
array. The matrices, computer program output, and directed
graphs are given in figures 17 to 19. The computer program
- 42
















Figure 17. Test #1 for 2x2 Array
Program Input




















Figure 18. Test #2 for 2x2 Array
Program Input















Trans i st i ve : <^ Reflexive
CI o sur e
1 G
1 1























-t + 4 4 ?
^ r n r 11 +













4 ^ f 4^ L ^ I - L














4+ + + 4
+ + + 4
4~
^
+ 4 4 4 4 4














































1Q0Q.Q 2Q8G.Q 3809.0 40GG.0 50QG.Q 6QQG.0 70QQ.Q 8QQG.G 9GG0.Q IGQQQ.G IIGQQ.C
Figure 20. Simulation output for test #1 of 2x2 Array













































^ f 4* 1
4
1 - 1





















4 4 4 4
4 4 + 4
^~
+ 4 4 4 +
+ + + ^ 4
4
4































4 4 -1 +
IQOQ.Q 2000. Q 3QG0.G 4QGG.Q 5QQG.G 6GQG.Q 7Q0O.G 6QQO.0 90QQ.G 1GG0G.Q 11Q0G.Q
Figure 21. Simulation output for test #2 for 2x2 Array
- r ,
_







































- r 4 - f -- L - L 4















4 44 4 4 4





4 4 4 4 4 4













































4 4 4 4 i
4
^ :







4 4 4 4 ^
4 4 +i 4 4 i_
IGQQ.Q 2GQG.G 3Q00-Q 4QQQ.G 5Q0Q.Q 6G0G.G 7000. Q BOOO.G 90QO.G I8G00.Q 11003.3







































































Figure 24. Test #1 for 3x3 Array
Progran Input
Ln









Starts '.[ -1 -1
L G -1
i 1 G
- I 1 G










Figure 25. Test #2 for 3x3 Array
Array Design
showed the data -flow into and out o-f the array and simulated
invalid data as a -1. This can be compare with figures 6 and 7
to see the skewed data input and output.
The Quicksim simulation outputs are respectively given in
figures 20 to 22. These simulations show the values o-f all
inputs, outputs, and intermediate points. The matrices in
figures 17 to 19 were compared to the simulations.
The symbolic diagram o-f the array is depicted in -figure 23. At
this stage all lines were tested -for correct operation. The
computer program was able to print the state o-f the array at
each time interval and was used to compare against the Quicksim
output -for correctness. Thorough testing on this 2x2 array
prevented major problems with an-y larger sizes.
9. 2 3x3 Arrav Testing:
Two input matrices were tested against a 3x3 array, see -figures
24 to 25. The Quicksim simulations are not included -for sake o-f
brevity, however they were done and
compared with the computer
simulations and hand calculations.















































ri KHBout flout nsT
Tifn; n
n l^pB,i,l Boul BST
^

















1 -1 -1 -1
Q 0 -1 -1
1 1 1 -1
0 0 0 0
1 G 1 1
Q Q 0 Q
0 1 0 1
0 Q 0 0
-1 0 1 0
-1 -1 0 0
-1 -1 -1 1
1^
Program Output
Start S 1 -1 -1 -1
Q 0 -1 -1
1 1 1 -1
Q 0 0 0
-1 0 1 1
-1 -1 0 0
End?-1 -1 -1 1
Gi ven Matrix
1 0 1 0
0 1 0 0
1 0 1 0
0 1 0 1
Transistive ^ Reflexive
CI o sur e
1 0 1 0
0 1 0 0
1 0 1 0
0 1 0 1
Figure 27. Test #1 for 4x4 Array
Program Input
Starts 0 -1 -1 - 1
1 1 -1 - 1




0 1 0 0




-1 -1 Q 0






0 1 0 1
1 Q 1 0
0 1 Q 1
1 0 1 0
Transisti^je ^ Reflexive
CI o sur e
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1


































1 -1 -1 -1
1 G -1 -1
1 1 0 -1
1 0 0 0
1 0 1 0
1 -1 0 0
1 -1 -1 1
Given Matrix
1 1 1 1
G G G 0
0 G G 0
0 G 0 0
Transistive ^ Reflexive
CI osur e
1 1 1 1
0 1 0 0
0 0 1 0
0 0 0 1
































































' L 1 . .
. ?
1 , > . *














? ' ? L_.
1100. D tODU.I 3000. D 401)0.* avQO, 9 OOOO.O TDOO.O eoei.o BOOO.O lOOOI.D 1 1010. 0 I7000.0 I30IB.0 MIOO.O IIOBO.O lOIOO.O
noDd.i loeos.s leoDo.i loDOo.a 71000.6 1




; ; ; ; ;
~r-
; ;



















1 1 t 1
^-. 1 1 1 1 r 1 t 1 4 1 * 1 * 1 t 1 4 1 t 1 4 1 t 1 t 1 f 1
1 1 . 1 . 1 '
.
-
. , . 1 ,
1 1 . 1 . 1 . 1 1
. 1 . 1 .
1 , 1 , 1 , , 1 , 1 .
1
. 1 .
-1 . 1 , 1 , . . , 1 , 1 ,
. 1 .







laoo.o iQoo.s aoQfi.o 4000.1 sooo.o booo.o 7000.0 bodls eooo.o ioqob.o hobo.b 12000.0
IBOSO.O |4B0D.0 LSBOO.O IOBOO.D tTODQ.B ISOOO.O 10000. B loooo.o zioeo.o 1
Figure 31. Simulation output for test #2 on 4x4 Array



















= = : :
_:-J ^-L . . . , ,
- -: ....
-
-. ; : ;-, , .
....
' 1 -' ^ ' ^ ^ r-^ h : : : : : : : : :
: : 1 .
= = '- ^ ^J L.
: : : : : '- : i - : 1 H
'- = = = : =
5
'- : '- - < -1
'- : : : ; : ,... ^ ... j . . mnur
: : : : 1 d . h ...( .1 . . i4)out
: : : - : 1 , ,
_d
-t mouT
: ; '. 1 d ***.....[.. .| 4 ruGouT
' : : ; 1 : .... ^ . . . .[ ououi
1000.0 lOOO.B 3000.t 1000. B IMO.O lOOO.O 7000.0 OOOB.O OOOO.O lOOOS.O IIOBO.O 17000.0 I30B0.O HBOO.O ISOOO.O lOBOO.O 17000. B lOMO.O IBOOO.B lOOOO.O ZIOOO.O
Figure 32. Simulation oi^ut for test #3 on 4x4 Array
OO
Array Design
The circuit diagram -for a 4x4 array is depicted in -figure
26.
It is merely the logical expansion o-f the 2x2 array. During the
array testing, the intermediate points were not displayed in
the Quicksim simulations because they were checked during the
2x2 testing and the in-formation would be too overwhelming -for
easy and accurate ver i -f icat ion . The three input matrices
tested
are depicted in -figrures 27 to 29 along with the Quicksim





lO.O Conversion o-f Design to MOSIS3 Standard Cells:
The 4x4 array can be made into an integrated circuit using
MOSIS CM0S3 (M0SIS3) standard cells. Mentor Graphic's
Corporation's Cell Station design automation tools running on
the Apollo workstations were used to layout the design. The
chip can be -fabricated with these standard
cells using the
MOSIS -facility. The TTL component design there-fore must be
converted to standard cell -formats, see -figure 33.
The NAND and NOR gates were directly converted except that
the
mixed logic symbols were not available, see
-figures 34 to 36.
In the square processor, the S-input NAND
gate which was being
used as an 5-input NAND gate had to be
converted to a two stage
implementation using two NANDs and
one NOR, see -figures 3 and
35.
Most o-f the D--f 1 ip/-f lops were
changed to ones that had only the
neccessary
outputs and inputs, i.e. only Q and/or only
R.
Latches were going to be
used instead o-f the D-f 1 ip/-f lops, but
a-fter checking
propgation delays and other timing requi remnents
it was discovered that
the -flipZ-flops were as -fast or
faster
than the corrsponding latch






Command Descr iot ion .
1 MOSIS_NETED -f i le C NETED version -for M0SIS3
1
2 MOSIS_EXPAND_COMP -f i le CM0S3 C Builds net list 1
3 MOSIS_DESIGN_CHECKER -f i le CMOS3 C Electrical Check
]
4 MOSIS_EXPAND_DESIGN -f i le CM0S3 C Flattens Design
1
5 QUICKSIM -f i le C Simulates Circuit 1
6 MOSIS_ADD_DELAY -f i le CM0S3 -FO I Adds Fanout Delay
3
7 QUICKSIM -file C Simulates Circuit 1
8 MOSIS_LAYOUT LOGIC_ENTRY -f i le CM0S3 CCreates physical
design!
9 MOSIS_LAYOUT CELLFLOOR -f i le CMOS3 C Generates
-floorplan 3
10 MOSIS_LAYOUT EDIT_PARMS -f i le CM0S3 C Edit
-floorplan 1
11 MOSIS_LAYOUT CELLFLOOR -file CM0S3 C Must rerun
a-fter edit 1
12 MOSIS_LAYOUT CELLPLACE -file CM0S3 C Placement o-f
cells 3
13 MOSIS_LAYOUT CELLPOWER -f i le CM0S3 i Routes power
networks 3
14 MOSIS_LAYOUT CELLROUTE -f i le CMOS3 C Routes signal
networks 3
15 MOSIS_LAYOUT CELLSQUEEZE -f i le CMOS3 C Removes
excess in
routing channels 3
16 MOSIS_LAYOUT MINROUTE -f i le CM0S3 C
Minimizes use o-f poly 3
17 MOSIS_ADD_DELAY -f i le CM0S3 -BA
-LO C Adds back
annotation and
wiring delays 3
13 QUICKSIM -f i le t
Simulates Circuit 3
19 MOSIS_LAYOUT PREGRAPH -f i le
CM0S3 C Generates working -file 3
20 MOSIS_LAYOUT CELLGRAPH -f i le
CMOS3 C Allows manual editing 3
21 MOSIS_LAYOUT CELLVERIFY -file
CM0S3 t Final validity check 3
22 MOSIS_LAYOUT GDS2_0UPUT -f i le


























































LlJ c _1 C
ZJ
-1 ( CC -rH
Cd cc Ll_ CC
1
II II 11
II o^ OO 00
-p
H-J -l-> -P
-rH D -1 1 D

















O CD ^-1 T-H 1 1
CD
c










C CD CD ' CD
1 1
c








Init Rin Bin R Rout Bout R Inll
B n U S e U U U
D n n Q 0 1 B
Q Fl 1 0 0 L 0 tl
Q B I 1 1 1 U
Q 1 Fl n 1 B B t)
Q 1 fl 1 t) 1 tl
G) 1 1 B 1 I tl bl
B 1 1 1 1 1 Id
n B B NIL 0 tl t)
n B NIL tl tl tl
H 1 B NIL I U tl
B 1 NIL I n B






















OQ 01 U 10
Q 1 1 a
Q 1 1 Q
Q Q 1 I
Q Q 0 Q
. Bin R
Init Rin\ oO 0 1 11 LO
0 0 1 0
0 0 I 1 i
X X X X !
X X X X
R Inlt.Bln.Bln t - Intt.R
R - InUDBlnDRlnHinein - I n 1 I "R
flin + BlnR
Inlt ?Bin DBin Init
Q 0 0 8
0 0 0
Q 1 0 0
0 1 0
1 0 0 1
1 0 1




































IF Init TRUE THEN


















Init Bin Din R flout Dout R Inlt
0 0 '0 0 X e G u
G 0 0 X 0 0
Q 0 0 0 1 0
0 0 1 0
1 0 X 0 0
1 X 0 0
1 0 0 1 0 0
1 1 1 0
0 X 0 X 1
0 X 0 X 1
0 X 0 0 0
Q X 0 0 0
1 X 0 X 1
1 X 0 X 1
1 0 X 0 1 Q






X X Q 0
X X 1 0
X X X X
X X X X
Din R





















0 1 1 0
0 1 1 0
X X 1 1
X X 0 0
InllBln > -InllR
0 0 0 0
0 0 0 0











IF Init = TRUE THEN




flout 8 - R end Bin
END IF
Conversion to MOSIS3
The I/O inputs for the TTL implementation had to be changed to
I/O pads -for the chip la-yout. Additionally, Vdd and Vss pads
needed to be included within the circuit diagram -for proper
layout. The circuit diagram is given in -figure 37 and was deve
loped using Mentor Graphic's Schematic Editor, NETED, with
CM0S3 circuit elements. The editor can be called inot operation
using the command MOSIS_NETED. A complete listing o-f the
additional commands and steps necessary -for the conversion to a
M0SIS3 standard cell layout is given in -figure 33.
The 4x4 array was chosen -for the chip layout because the 20
pins required will -fit within the standard 24 pin chip. The
next array size which would be use-ful, an 3x8, would require 70
p i ns.
The array's outputs were not latched. This may give glitches on
these lines during the zero state o-f the clock. The reasons -for
not latching the outputs are: <1) More than one chip could be
connected together then, the latching would be redundant; (2)
The latching can be done externally, especially -for the add
itional design which would be discussed later where several




SI 1 I O A A Dould MoCol 1




























































CLK Pout flout R3T
IT

























Figure 37. 4x4 Systolic Array using
M0SIS3 Standard Cells
Final Testing
li-O Final Simulation and Testing:
To per-form the -final simulations and testing, initially Cell
Station commands in steps 1 through 5 o-f -figure 33 were com
pleted. The design was converted to use MOSIS3 standard cells,
expanded, and checked. During checking it was discovered that
additional bu-f-fers were needed to drive the clock and reset
lines because the -fanout was over thirty. Compare -figures 26
and 37 -for the changes.
The same tests that were run be-fore on the 4x4 array were again
completed. These proved to agree exactly. At this point propa
gation delays were measured for the individual processors. The
table below gives the results o-f the worst case.
Processor Aout DAout Bout DBout
Circle --
-- 31ns 20ns
Square 31ns 32ns 20ns 20ns
Double Square 32ns 32ns
Table 5: Propagation Delays -for Processors
These values were determined using Quicksim's ability to list






Since the array was logically -functioning properly, the -fanout
delays were added next, using steps 6 to 7 o-f -figure 33. The
clock period remained at lOOOns to avoid any problems with
propagation delays and rise and -fall times. The previously
mentioned delays and times were measured to ensure no problems
were actually occuring. The propagation delays are summaraized
bel ow.
Processor Aout DAout Bout DBout
Circle -- -- 31.6ns 20.7ns
Square 31.7ns 34ns 21.1ns 23.9ns
Double Square 33.0ns 34.1ns
Table 6: Processor Propagation Delays
The above values in table 6 were once again calculate -from
Quicksim's listing output.
The rise and -fall times o-f the clock signal determined by Cell
Station tools were checked during this step and subsequent
steps to ensure proper signal timing. The other
signals were
not critically checked.
The typical -fanout was three. Whereas









15QQ.Q 151Q.Q 1520.0 153Q.Q
Figure 38. Clock skew after the fanout delay is added (Q-1).





2QQQ.0 2Q1Q.Q 2Q2Q.Q 2030. Q
Figure 39. Clock Skew after the fanout delay is added (1-0).




Once the -fanout was added, the clock signal, which drove the
processors, rise time to change -from 4ns to 13.933 and its -fall
time -from 4.19ns to 12.656ns. The clock skew which occurred
because o-f the input pad and additional driving bu-f-fer was also
considered. A pictorial represntat ion is given in -figures 33
and 39 o-f the skewing. The input pad delayed the zero to one
transistion 6ns. This compounded with the drive bu-f-fer produced
a total delay o-f 20.1ns. Similarly the one to zero transition
was delayed 20.5ns. All o-f these restrictions on the clock
signal will be considered when the maximum clock -frequency is
calculated a-fter all delays are added.
A simulation output -for a single test case is given in -figure
42. The test case is similar to that depicted in -figure 27
except that the extra relation 2->4 was added, i.e. 2 is mapped
to 4.
11.2 Back Annotated and Wiring Delav;
The standard cells making up the array's integrated circuit
were placed and then routed on the chip
-floor plan. The delays
due to the routing wires
were back annotated into the
simulation properties. These wiring







15QQ.0 1510.0 1520.0 1530.0
Figure 40. Clock Skew after wiring delay and back annotation
are added (0-1). N$633 is the pad output. N$634




1000. 0 1010.0 1020.0 1030.0
Figure 41. Clock Skew after wiring delay and back annotation
are added (1-0). N$633 is the pad output. N$634

















































1 1 1 1 . 1
. . . . , 1 1 -
* , .
_|
1 1 1 1 - 1 1 . 1
* * * ? 1 1
4. . . ,
_J 1 1 . 1 1 . 1
* * < . 1 - 1
* . * 1 ~i 1 . 1 * 1 1
* * . 1 .
HOD. D laoa.t 9M0.I loia.l 8HI. 1 BQM.B TBIB.i Bnai.i Boat. a laoBi.a tioia. izaaa. t laoiB.a I4iai. iBana.a iBiae. naea.l leaBa.f iBBftt.a laaaa.a ziaBa.a |
Figure 42. Simulation output for modified test #1 on 4x4 array with fanout delays added.
ho
1 ; ; ; ; ; ; ; : 5 5 :
? >
r-
















1 1 l__l i__i i__r^n__i l__l I 1 I 1 \ 1 I 1 1 f I 1 1 T I 1 1 1 I 1
i \ 1 t_ 1 4 t : '. = 1 1 : ! -
, . t 1 ) 4 ( 4 .
( ; -, ; ; -.
'
. 1 .....







I 1 I 4 . , 1 4 .
. .
.
1 - 1 1 : 1 : 1 1 - 1 - 1
1
r-








1 - 1 1
. 1 1 1
, . . - .
. I 1 .
IIOD.O (000.1 3000. a 4000. BOaO.O BBM.O 7080.0 OOOB.O
0000.0 lOOOI.O IIDIO.O 12000.0 13010.0 14100.0 15000.0 16100,8 11000.1 18008.0 ISQDQ.I lODOO.O 21000.0 |
Figure 43. Simulation for modified test #1 on 4x4 array with wiring delays and back annotation
added.
Final Testing
to be added -for simulation in steps 8 to 18 o-f -figure 33. The
clock skew had increased to 22.8ns and 23.8ns -for one to zero
and zero to one transist ions, respectively, see -figures 40 to
41. The rise and -fall times were also increased to 17.106 and
15.362, respectively. The new propagation delays -for the
processors are given below.
Processor Aout DAout Bout DBout
Circle -- -- 31.7ns 21.3ns
Square 32.0ns 34.7ns 21.4ns 21.3ns
Double Square 33.3ns 34.5ns
Table 7: Final Propagation Delays
Simulation outputs -for the three test ran are given in -figures
43, 44, and 45 which correspond to the matrices given in
-figures 27 (with 2->4 added), 29. and 28, respectively. The
lOOOns clock period was used during these tests also. In the
-following section, the minimum clock period will be derived and
used .
11.3 Minimum Clock Period:
The minimum clock period needs to assimulate the propagation
delays o-f the processors along with the rise and -fall times o-f
- 73
;r, BMTi iBo.a.t iiBao:B iaat.B UBta.. hobb.b ibbbb.b iBiaa.t
noaa.i laooB.a i.bbb.8 ibbbo.b
jiobb.b
Figure 44. Simulation for test #3 on 4x4 array with wiring delays




















































































1800. 2800.8 3000. 4009.1 6000.0 6800.a 1080.0 BOOl.a OOOO.O 10088.0 11010.0 12000.0 13080.0 14100. e 13800.0 I6IS0.0 11000.8 18008.8 18000.8 20080. C 2)000.1
Figure 45. Simulation for test #2 on 4x4 array with wiring delays and back annotation added.
Final Testing
the clock signal. The one state o-f the clock should be at least
40ns with a -fall time o-f 20ns to completely cover the maximum
propagation delay of the processors and the new -fall time.
Likewise, the zero state should be at least 30ns, minimum
allowed is 29ns (Heinbuch 1988), with a 20ns rise time. This
would yield a 110ns period, re-fer to -figure 49. These values
will ensure the proper operation o-f the array. Simulations were
run with a period o-f 120ns, the zero state was increased to
40ns, see -figures 46 to 48.
11.4 Data Timing Requirements:
The clock signal has been skewed by the input pad and drive
bu-f-fer. This requires the data to be placed on the lines at a
certain period. From examination o-f -figure 49, one can see that
the best place is 20ns a-fter the zero to one transition.
- 75 -
1 1 < 1 \ 1 t.j 1
^














*l * ' ' * ' 1 ; n a ,
:: ! '- -J ! -J -1 .1 n ........... .
'- : tl ' ' : ' = -. Tl . . . '.
:
. , ; ,
:i - ! ! iJ .1 . . .1 n .... ....
: '- '- U ' ^ - .1 . .... ,
'- : : ! '1 ! ! ! .1 .1 .. ..
"
: ! ; : n ....... ^ ....
{ ; : ' = = : : : f 1= 1 \i: 1 ! < 1. ....
;




. 4 . . k . . .
; : 3^ 3 1 . . . |. 4 |. 4 .1 .1 .1 . .
3 3 3 3 ! 3 . * . * 4 . . . j. 4 .
; ;




= ^ h ^ K
'
K






120.8 248.0 380.3 400.8 038.0 120.0 840.8 960.0 1088.0 1200.8 1323.0 1448.0 1800.8 1000.0 1008.0 U20.0 2040.8 2188.0 2280.0 2406.8 2328.0 2840.0 |
Figure 46. Simulation for modified test #1 on 4x4 array at maximum frequency (8.3 MHz)
C3>


















1 ' . -. ; -. : ^ ...... ; : : : f
( ... .'1 7\ . ....
".-
1 : ; - n ..... -. ...
3 : 3 '. .....,, ., , - . . _:J r[_ . . . . . . . . . .
' ' > '1 . . .1 . . . . . . . . .
i : 3 3 ; 3 3 ., .1 Tl ....
t ' '. * ' -r : : n . . . . , . : : 1
^
3 3 3 3 3 3 3 3 1 ..... y T]
3 : 3 3 3 3 3 ! 1 1' . . 4 |. . . . .
! : 3 : 3 3 3 3 3 : ... >.<... i. . .
3 3 3 3 3 ! 3 ) .... . |. . . J 1.
' i '- '- : i '- 3 ! 3 3 _.'.. ., i . , . -V
'"' "''' "'> '"' ''" '"'' >"' ' ""'' '""> 1620.0 1418.0 1560.8 1000.0 1008.0 1828.0 2040.8 2168.0 2180.6 2400.8 2528.0 2.0
Figure 47. Simulation for test #3 on 4x4 array at maximum frequency (8.3 MHz)


















I ' ... -1 ' - . -. i : n ...........
'. . , ' ! ! '1 ! . ! .1 . . . . 4 . . . . 4 . .
3 3 3J ? . ' 4 . . . .1 . 4 . . . . . . . .
3 3 >_.. 3J ! ll .1 . .1 n . ....
r . . -' i -. : ; n ... .. . .
[ . .
'
! ! 3J ! > . .1 n 4 4..
3 3 ...4 .r. 4 4 . . . , .1 4 . . 4 . . . .
. 4 . 4 . . 4 , . . . , . . .1 . .
4 .? 4 ......... . 4 : : ; 1. . .
* . . 4 . . . f . . -, , ; : ; 1 . 1 ; T) .
'
,
' ' . . . i . . . 1. .
* * . * *
. ^ . . * - . * * ? 4 . . * . ).
3 :. ... 4 ....... H .. 1. Msour
120.8 248.0 3B0.0 400.8 008.0 120.0 840.8 903.0 tOBB.O 1200.8 1320.0 1448.0 ISSO.B 1800.0 1668.0 1828.0 2040.8 2108.0 2280.0 2400. 8 2S28.0 2840.0 {
Figure 48. Simulation output for test #2 on 4x4 array at maximum frequency (8.3 MHz)
OO
f-\ -l







/ + V.../ + ^.....z_+_^_ /
i





1 DqIq ::::i ::::i:::::::::_i::::
Figure 49. Clock wave form after skewing
Fabrication and Future Endeavors
12. O GDS2 OUTPUT:
A-fter -fully simulating the systolic array and veri-fying proper
functioning, the -final steps were taken to obtain a file to
have the chip -fabricated. Steps 19 through 22 o-f -figure 33 were
completed producing the chip layout in -figure 50 and the
GDS2_OUPUT -file -for -fabrication.
13. O Future Endeavors:
There are several more steps which could be completed on this
thesi s.
13. 1 Fabrication and Testing:
Ideally the chip would be -fabricated and tested. The test
vectors that could be used are those that have already been
presented along with matrices o-f all O's and all I's. These
additional matrices would test i-f the R registers are stuck at
either 1 or O.
13.2 Three Chip Design:
The systolic array could be expanded to
use three chips with
- 79 -




each chip enclosing a 4x4 section o-f a larger array. This
would
allow any array size to be built up. The -first chip would have
all the processors except for the right most, double
square
processors. The second chip will be an array of only
square







14. O Conclusion :
Starting from the theory of the Algebraic Path Problem, this
thesis has presented a general algorithm by Robert and Trystram
(1986) for its solution. The approach and operation of the
algorithm was explained and clarified which led to the discus
sion of the specific systolic array implementation. From
this
systolic array, a single instance for the APP, the transistive
and reflexive closure of a binary relation, was designed and
laid out for a CM0S3 standard cell chip design. A minimum clock
period of 120ns or maximum frequency of 8.3MHz was determined.




BENAINI A., ROBERT Y., TOURANCHEAU B. 1989, A New Systolic
Architecture for the Algebraic Path Problem, Proc.
International Conference on Systolic Arrays, Killarney, Co.
Kerry, Ireland, in Systolic Array Processors, ed . McCanny
J., McWhirter J., Swartzlander Jr. E., Prentice Hall, 1989,
pp. 73 - 82.
GONDRAN M., MINOUX M., VAJDA S. 1984, Graphs and Algorithms,
John Wiley 8t Sons, New York, 1984.
HARTNETT W. 1963, Principles of Modern Mathematics, Harper Bi
Row, New York, 1963.
HEINBUCH D. 1988 ed . , CMOS3 Cell Library, Add i son-Wes 1 ey
Publishing Company, Reading, Massachusetts, 1988.
KUICH W. , SALOMMAA A. 1986, Semiring, Automata, Languages,
Spr i nger-Ver lag, Berlin, Germany, 1986.
KUNG S. Y. 1988, VLSI Array Processors, Prentice Hall, New
Jersey, 1988.
ROBERT Y. 1987, Systolic Algorithms and Architectures, in
Automata Networks in Computer Science (Theory and
Applications), ed . Soulie F- P., Robert Y. , Tchuente M.,
Princeton University Press, Princeton, New Jersey, 1987, pp.
187 - 228.
ROBERT Y. , TRYSTRAM D. 1986, Systolic Solution of the Algebraic
Path Problem, Proc. First International Workshop on Systolic
Arrays, Oxford, 2-4 July 1986, in Systolic Arrays, ed . Moore
W., McCabe A., Urquhart R., Adam Hilger, 1987, pp. 171
-
180.
ROTE G. 1985, A Systolic Array Algorithm for the Algebraic Path














































REMark Flag to output to printer
iREMark Flag to print individual steps
REMark Flag to pick which algorithm
; REMark O circuit, 1 program implem.
1 REMark Array size
: REMark array
: REMark Initialization
i e%-l,Size%-l) , Ai n% (Si ze%-l , Si ze%) ,
Aout%(Size%-l,Size%)
, Bi n% (Si ze%-l , Si ze%) ,
Bout%(Size%-l, Size%) , R% (Si
ze%- 1 , Si ze%) ,
Init%(Size%-l,Size%) , DataStreamIn% (3-X-Si ze%-2. Si ze%-l ) ,
DataStream0ut%(29(-Size%-2,Size%-l ) , 1 1% (Si ze%-l , Si ze%-l )
DAin%(Size%-l,Size%) , DAout% ( Si
ze%- 1 , Si ze%) ,
DBin%(Size%-l,Size%) , DBDut% (Si
ze%- 1 , Si ze%)
IF Print% THEN 0PEN+t3,prt
RESTORE 6
: REMark Setup matrix input stream
FOR i=0 TO Size%-1
FOR j=0 TO Size%-1
IF i=j THEN













IF Print% THEN CL0SEtt3
STOP : REMark Program end
REMark Sets up the input data
stream
DEFine PROCedure SetUpDataStream
LOCal i , j
: REMark Insert NIL'S
FOR i=0 TO Size%-2
FOR j=0 TO Size%-1
DataStreamIn% ( i , j ) =-1
84 -
Appendix A
''2 DataStreamIn%(2^e-Size%-i-i , j )=-l
43 END FOR j
44 END FOR i
"^S : REMark Insert Data 8. I matrix
46 FOR i=0 TO Size%-1
'^^ FOR j=0 TO Size%-1
'*S DataStreamIn%( i -h j , j )=Input%(j , i )
'*^ DataStreamIn%( i -h j -fSizey., j )=II%( i , j )
50 END FOR j
51 END FOR i
52 END DEFine SetUpDataStream
53 :
54 :
55 DEFine PROCedure Pr i ntDataStreamIn
56 LOCal i,j
57 CLSttl
58 FOR i=0 TO 3*Size%-2
59 FOR j=0 TO Size%-1
60 IF Print% THEN PRINT**3,TO j *3; DataStreamIn% ( i , j ) ;
61 AT1, i , j*3:PRINT ** 1 , DataSt reami n% ( i , j )
62 END FOR j
63 IF Print% THEN PRINT3
64 END FOR i
65 IF Print% THEN PRINT3,\\\\
66 END DEFine Pr i ntDataStreamI n
67 :
68 :
69 DEFine PROCedure Pr i ntDataStreamout
70 LOCal i,j
71 CLSttl
72 FOR i=0 TO 2*Size%-2
73 FOR j=0 TO Size%-1
74 IF Print% THEN PRINT3,T0 j -)f3; DataStreamOut% ( i , j ) ;
75 AT#1, i , j*3:PRINT *1 ,DataSt reamOut% ( i , j )
76 END FOR j
77 IF Pr int% THEN PRINT#3
78 END FOR i









83 Transf erOutIn Time%
89 IF StepThrough% THEN
- 85
-





93 FOR i=0 TO Size%-1
94 FOR j=0 TO Size%
95 UpdateCell i,j
96 END FOR j
97 END FOR i
98 Time%=Time%-H
99 IF Time%>5*Size%-2 THEN EXIT Time%
lOO END REPeat Time%
101 END DEFine Systolic
102 :
103 :
104 DEFine PROCedure TransferOut In ( T%)
105 LOCal i,j
106 FOR j=0 TO Size%
107 IF j<>Size% THEN
108 IF T%<=3*Size%-2 THEN
109 IF DataStreamIn%(T%, j )=-l THEN
110 Ain%(0, j )=0:DAin%(0, j )=0
111 ELSE
112 Ain%(0, j )=DataStreamIn%(T%, j ) :DAin%(0, j ) =1
113 END IF
114 ELSE
115 Ain%(0, j )=0: DAin%(0, j )=0
116 END IF
117 END IF
118 IF j THEN
119 Bin%(0, j )=Bout%(0, j -1)
120 DBin%(0, j )=DBout%(0, j -1)
121 END IF
122 END FOR j
123 FOR i=l TO Size%-2
124 FOR j=0 TO Size%
125 IF jOSizey. THEN
126 Ain%(i,j) =Aout% ( i -1 , j -H )
127 DAin%( i , j )=DAout%( i-l, j -H)
128 END IF
129 IF j THEN
130 Bin%(i,j) =Bout% ( i , j
- 1 )
131 DBin%( i , j )=DBout%( i , j -1)
132 END IF
133 END FOR j
134 END FOR i
135 FOR j=0 TO Size%
136 IF j<>Size% THEN
137 Ain%(Size%-l, j ) =ADut% (Si ze%-2 , j -H )
- 86
Appendix A
^^S DAin%(Size%-l, j ) =DAout% (Si ze%-2, j -H)
139 END IF
140 IF j THEN
141 Bin%(Size%-l, j ) =Bout% (Si ze%-l , j -1 )
142 DBin%(Size%-l, j ) =DBout% (Si ze%-l , j -1 )
143 IF T%>=3^fSize% AND T%< =5*S i ze%-2
144 DataStream0ut%(T%-3*Size%, j -1) =




147 END FOR j
148 END DEFine TransferOut In
149 :
150 :
151 DEFine PROCedure SetUpCells
152 LOCal i,j
153 FOR i=0 TO Size%-1
154 FOR j=0 TO Size%
155 Ain%(i,j) =0 :Aout%(i,j) =0
156 DAin%(i,j)=0 : DADut% ( i , j ) =0
157 Bin%(i,j) =0 :Bout%(i,j) =0
158 DBin%(i,j)=0 : DBout% ( i , j ) =0
159 Init%( i , j )=1 :R%( i , j ) =0
160 END FOR j
161 END FOR i
162 END DEFine SetUpCells
163 :
164 :
165 DEFine PROCedure UpdateCe 1 1 ( X%, Y%)
166 LOCal y
167 y=Y%
168 SELect ON y
169 ON y=0
170 Circles X%,Y%





176 END DEFine UpdateCell
177 :
178 :
179 DEFine PROCedure C i re 1 es (X%, Y%)
180 IF Algo% THEN

















191 DAout%(X%, Y%)=0:DBout%(X%, Y%)=0
192 END IF
193 ELSE
194 Bout%(X%,Y%> =Ain%(X%, Y%) I I Init%(X%, Y%)
195 DBout%(X%,Y%)=DAin%(X%, Y%)
196 Init%(X%,Y%) =NOT(DAin%(X%, Y%) )8i8<Init%(X%, Y%)
197 END IF
198 END DEFine Circles
199 :
200 :
201 DEFine PROCedure Dsquare (X%, Y7o)
202 IF Algo% THEN
203 IF DBin%(X%,Y%) THEN
204 IF Init%(X%,Y%) THEN









214 DAout%(X%, Y%) =0: DBout%(X%, Y%) =0
215 END IF
216 ELSE
217 Aout%(X%,Y%) =Bin%(X%, Y%)8i&:R%(X%, Y%)
218 DAout%(X%, Y%)=NOT ( I n i t% ( X%, Y%) ) 8s8<:DBi n% (X%, Y%)
219 R%(X%,Y%) = (Init%(X%, Y%)8ifcDBin%(X%, Y%) ) I!
(NOT (Init%(X%, Y%) )88<:R%(X%, Y%) )
220 Init%(X%,Y%) =Init%(X%, Y%)8<:8iNOT(DBin%(X%, Y%) )
221 END IF
222 END DEFine Dsquare
223 DEFine PROCedure Square (X%, Y%)
224 IF Algo% THEN
225 IF DBin%(X%,Y%) AND DAin%(X%,Y%) THEN
226 IF Init%(X%,Y%) THEN
227 R%(X%, Y%)=Ein%(X%, Y%) iiic Ain%(X%,Y%)
228 Init%(X%, Y%)=0
229 Aout%(X%, Y%)=0:DAout%(X%, Y%)=0





232 Aout%(X%,Y%) = Ain%(X%,Y%) !! R%(X%,Y%) 8<:fe:
Bin%(X%, Y%)
233 DADUt%(X%, Y%)=1




237 DBout%(X%, Y%) =0: DADUt%(X%, Y%) =0
238 END IF
239 ELSE
240 Aout%(X%, Y%) = (Bin%(X%, Y%)8i8<:R%(X%, Y%) ) ! ! Ain%(X%, Y%)
241 DAout%(X%, Y%)=NOT(Init%(X%, Y%) ) a<:&:DA i n% (X%, Y%)
242 DBout%(X%, Y%)=DBin%(X%, Y%)
243 Bout%(X%, Y%)=Bin%(X%, Y%)
244 R%(X%,Y%) = (Init%(X%, Y%) 8<8< DBin%(X%,Y%) iiZc
DAin%(X%,Y%) 8.:8<: Ain%(X%,Y%) Ziit
Bin%(X%,Y%)) !! (NOT ( I n i t% (X%, Y%) ) Bt8<:
R%(X%, Y%) )
245 Init%(X%,Y%) = Init%(X%,Y%) ScSs ( NOT ( DBi n% ( X%, Y%) )
NOT(DAin%(X%, Y%) ) )
246 END IF
247 END DEFine Square
248 :
249 :
250 DEFine PROCedure DisplayArray
251 LOCal i,j
252 CLS1
253 FOR j=0 TO Size%-1
254 INKttl , 4-^3-)(-DAin%(0, j )
255 ATttl ,0, lO^fj :PRINT1 , 'A:
'
; Ain%(0, j ) ;DAin%(0, j )
256 END FOR j
257 FOR i=0 TO Size%-1
258 FOR j=0 TO Size%
259 INK1 ,4-^3*-DAout%( i , j )





; Aout%( i , j ) ;DAout%( i , j )
261 INKttl,4-t-3-)fDBout%( i , j )
262 AT#l,5^fi-^3, 10*j :
PRINT#1, 'B:
'
;Bout%( i , j ) ;DBout%( i , j )
263 INKttl,7
264 ATl,5*i-^4, lO^f-j :PRINT*1, 'R:
'
;R%(i , j )
265 INK#1 ,4-f3*Init%( i , j )




; In i t% ( i , j )
267 END FOR j
268 END FOR i
269 FOR j=l TO Size%
270 INKttl,4-f3*DAout%(Size%-l, j )








;Aout%(Size%-l, j ) ; DAout% (Si ze%-l , j
)
272 END FOR j
273 INK**1,7
274 PRINT**1 , \ 'Time: ';Time%
275 END DEFine DisplayArray
- 90
-
