On some equivalent configurations of systolic arrays by Yao, K. & Chang, C. Y.
ORIGINAL
Of POOR
ON SOME EQUIVALENT CONFIGURATIONS
OF SYSTOLIC ARRAYS
C.Y. Chang and K. Yao
Electrical Engineering Dept
University of California
Los Angeles, CA 90024 N 8 6'-WO 70
ABSTRACT
A systematic approach is presented for
designing systolic arrays and their equivalent
configurations for certain general classes of
recursively formulated algorithms. A new method is
also introduced to reduce the input bandwidth and
storage requirements of the systolic arrays through
the study of dependence among the input data. Many
well known systolic arrays can be rederived and
also many new systolic arrays can be discovered by
this approach.
I. INTRODUCTION
A systolic array is a network of processors
that rhythmically process and pass data among
themselves. It provides pipelining, parallelism,
and simple adjacent neighbor cell interconnection
structure so that it is suitable for VLSI
implementation. While most of the earlier systolic
array algorithms were discovered beurietically
[1-3], there has been various work on systematic
approaches to the design of systolic array
algorithms [4-6]. In this paper, ve shall present
a systematic approach for designing systolic arrays
and especially focus on their equivalent
configurations for certain general classes of
recursively formulated algorithms. In order to
reduce the input bandwidth and storage requirements
of the systolic arrays, the dependence among the
input data is also investigated in details. It is
shown that many well known systolic arrays can be
rederived and also many new systolic arrays can be
discovered by this systematic approach. For
simplicity of illustration, we mainly consider the
linear systolic array in this paper. The same idea
can also be generalized to the two dimensional
mesh-connected systolic arrays.
II. IMPLEMENTATION OF RECURSIVELY
FORMULATED ALGORITHMS
Consider two simple but important ways of data
flow pattern in a linear systolic array as shown in
Figure 1 and 2. In these two figures. P., Q., and
b. . are three given input data sequences^nd^. is
tojbe the output data sequence, where Oii<m-l and
Oij<n-l. For the systolic array shown in Figure
1, Q. and R. are stored in the j processor, where
R. will be Updated while P. is moving to the right
add b.. is moving down. For the systolic, array
shown1In Figure 2, P. is stored in the i
processor and R. wilt be updated as it is moving to
the right with 4- while b. . is moving down. All of
the data movements are synchronized. Tbe R.'s will
successively have the required output data after a
steps. For convenience, according to the R.'s
behavior of these two systolic arrays, tbey^are
respectively named as R-stay and R-move linear
systolic arrays. There is great similarity between
these two systolic arrays. It can be shown that a
large class of interesting problems in the real
world can be implemented by these two types of
linear systolic arrays. Besides, various different
but equivalent configurations of linear systolic
arrays can also be derived from them.
Procedure 1 : Given any problem which can be
formulated so that it has P.. Q.. and b.. as three
input data sequences and R.vas the output data
sequence, where 0<i<m-l and 0£J£n-l, if R. can
be generated through the following recurrence
equation
_
 f/p
- f(P.. vb.j \). (1)„R.  .
where R. contains some initial value, f is any
function, of four variables P.. Q.. b. .. and R. .
and R. is the required oufput JdataJR . , theiJ this
problem can be implemented by the R-stay linear
systolic array of n processors and the R-move
linear systolic array of m processors. 0
The complexity and the configuration of the
systolic array depend on the complexity of the
function f and the generation procedure of b...
Some regularity and dependence among b. .'s ml^
greatly simplify the whole system. ^
III. MAPPING INTO FAN-IN TYPE
LINEAR SYSTOLIC ARRAY
Note that for the two linear systolic arrays
shown in Figure 1 and 2, -the input bandwidth and
storage requirements are large in comparison to the
number of processors in the array, which may be
either infeasible or inefficient for many
applications of interests. This is mainly because
the dependence among the b..'s is not efficiently
utilized so that each processor needs its own
external input connection due to the existence of
all the b..'s. It is expected that under certain
circumstances not all of these external input
connections are required. In this paper, we are
also very interested in the issue of reducing the
input bandwidth and storage requirements by showing
under what conditions these external input
connections can be removed so that only the very
first processor is allowed to have such a
connection, i.e.. the input sequences can only be
fanned in through the systolic array. It is shown
that the existence of certain patterns of
dependence among the b..'s allows themselves to be
fanned-in generated by slightly modifying the
operations involved in each processor without
losing the property of adjacent neighbor
interconnection structure. These conditions are
shown in the following two procedures.
Procedure 2 : For the R-stay linear systolic
array, if b. . can be determined through the
following dependence equation
v (2)
https://ntrs.nasa.gov/search.jsp?R=19860020598 2020-03-20T14:42:05+00:00Z
vbere u. if a variable vbicb
ie a variable vbich depends
function of four variables, tben b. . can be
depends only on i. v.
only oo j. and T is aj
generated by the fan-in scheme systolic array aa
shown in Figure 3 rather tban being broadcast aa
shown in Figure 1. Also note that b , . as well aa
v.. vbicb depends only on j. can be prlJoaded in
lie j processor, and b. , as veil aa u., which
depends only on i can be deed aa a fanned-in input
sequence. 0
Note that for the R-stay linear systolic array
shown in Figure 1, if b.. is the current input to
the j processor, then B._. . is the previous-
it to the j processor agd b^ . j is
rious input to the (j-1) processor,
arstandable that in order to avoid the
Che adjacent neighbor interconnection
b.. can only depend on b._. . and b. ._. as well as
tie1 data that can be preloaded and tfie1 da
hu ivwcopv** fcM^a* a* . _i . .a. a b.uc y* c v *wwv
inpu i . _.  the
previ 1)8 processor. It ia
under  violation
of t  structure,
ie1 f J ta that
can be fanned in, which is what Procedure 2 ia
about. In general, the systolic array shown in
Figure 3 has two sets of input data. One of them
consists of three fanned-in data sequences. P., u.,
and b. ., which depend only on the i index, and
the offier set consists of three preloaded data
sequences, Q., v. and b , ., which depend only on
the j index, wber'e u., v.J^b. , and b . . are used
to generate all the D..'S. F6r each professor,
four registers are required, namely Q , V , B and
R, where registers Q and V are uaed"to (tore the
preloaded data Q. ana v. respectively. Initially
register,B,is loaded asjb . and register R is set
to be R. , both of which whl be updated aa the
systolic array etart operation. The reason to
include so many data sequences ia to take care of
the general cases. However, it is expected that in
many applications, not all of these fanned-in and
preloaded data sequences are required. It is often
the case that the fan-in generation process of b. .
simply depends on two or three data sequences which
can either be fanned-in or preloaded. Similarly
for the R-move linear systolic array, very similar
results can be obtained as follows.
Procedure 3 : For the R-move linear systolic
array, if b.. can be determined through the
following dependence equation
v (3)
. is a variable which depends only on i, v.
riable which depends only on j, and T is a''
where u.
is a var
function of four variables, tben b.. can be
generated by the fan-in scheme systolic array aa
shown in Figure 4 rather than being broadcast as
shown in Figure 2. Also note that b. , as well aa
u., vhich depends only on i, can be preloaded in
tie i processor, and b , . as well as v., which
depends only on j, can be used as a fanned"-in input
sequence. Q
Note that for tbe R-move linear systolic array
shown in Figure 2, if b. . is tbe current input to
the i processor, then 0. ._. is the previous
input to the i processor'ana b. , . is tbe
previoua input to the (i-1) processor. What
procedure 3 says simply repeats the fact that in
order to avoid the violation of adjacent neighbor
interconnection structure, b.. can only depend on
b. , . and b. . , as well as1the data that can be
preloaded anJ'tlbe data that can be fanned in. In
general, the systolic array shown in Figure 3 has
two sets of input data. One of them consists of
three fanned-in data sequences, Q.. v., and b_j .,
which depend only on the j index,Jandjthe other'set
consists of three preloaded data sequences. P., u.,
and b. _.• which depend only on tbe i index, where
u., v.l o. , and b , . are used to generate all
tie b^.'a?*~For each processor, three registers are
required, namely U , B and P, where registers P and
D are used to etoPe the preloaded data P. and u..
Initially register B is,-loaded aa b. _. and output
data R. ia set to be R. » both oflwblch will be
updated aa the systolic array start operation.
The previoua three procedures provide a rather
systematic approach to design the systolic array
architecture for the implementation of a given
problem. At first, by checking the existence of
tbe recurrence relationship as shown in equation
(1), we are able to know if there exiet any
systolic arrays as shown in Figure 1 and 2. Next,
by checking the dependence among the b..'s as shown
in equations (2) and (3), we are able to know the
existence of the fan-in type systolic arrays as
shown in Figure 3 and 4 so that only small input
bandwidth and storage are required. Tbe key issue
is in how to search for the recurrence function f
and the dependence function T. It is expected that
there may exist several different forms of
functions due to different possible approaches to
formulate a given problem. Various forms of these
functions simply create many different but
equivalent configurations of systolic arrays. Also
note that in the previous discussion, P, Q, b, u,
and v are somewhat treated as single variables,
however it is clear that they can be aet of
variablea and tbe same results atill bold. This
approach can be applied to design systolic arrays
for many interesting problems in the real world.
Various new configurations of systolic arrays can
be derived. In the next section, we shall
illustrate this design approach by considering the
DFT algorithm.
IV. SYSTOLIC ARRAY ARCHITECTURE
FOR DISCRETE FOURIER TRANSFORM
Given n discrete data a. in tbe time domain,
where 0<.i£n-l, and n discrete frequencies W. =
(e /n)J in tbe frequency domain, where 0.<j<n-l,
the discrete Fourier transform (DFT) is to compute
.j • ,W.n-1 j
n
"
1
• ,w.°-2 +
n-2 j 0.
CS <
BL O1
_l tC
< O
52o *•£ u.
oo
Let
f(P. Q, b; R) = (R x b) 4 P.
By induction, it can be shown that by letting
and y.twj = a^. then y.[n"13 = y.. is the
required output. Tbe existence of^a recurrence
function f and the satisfaction of the recurrence
relationship guarantee that there exists systolic
arrays for the implementation of discrete Fourier
transform as shown in Figure 5 and 6.
It can be seen from Figure 5 and 6 that the
b..'a are not totally independent. Note that P. =
a ._? and b.. = V.. In order to see if b.. can be
fanned-in generated, let us examine the data1
"dependence among the b..'§. Many different form*
of dependence function1? exist. For example*
V (5)
where v. - V.. The pair of systolic arrays based
on equal!ionsJ (4) and (5) are shown in Figure 7 and
8. The systolic array shown in Figure 8 is the
well known systolic DFT [2], whose discovery
appear* to be heuristic rather than in a systematic
manner ai from our approach. For another example
of T function, note that
i.e.*
(6}
where u. s V. and b. , s W. . which can be either
used as fanned-in sequences of the R-stay linear
systolic array or preloaded in the i processor of
the R-move linear systolic array. The pair of
systolic arrays based on equations (4} and (6) are
shown IB Figure 9 and 10. '
Another interesting issue is that the type of
function f used in this example does not belong to
the class of general matrix vector multiplication.
This confirm the fact that the class of problems
covered in the Procedure 1 really contains not only
the class of general matrix vector multiplication.
As well known, there are two different ways to
consider the discrete Fourier transform. One shows
that the DFT is a special case of the evaluation of
a polynomial and the other shows that the DFT is a
special case of general matrix vector
multiplication. The first way was just considered
in this example. Let us see what can be obtained
by following the second way. Let
f(P. Q. b; R) = R •» (P x b).
By induction, it can be shown that by letting
j j i j
and y. B 0. then y. s y.» i* the required
output. The existence of a new recurrence function
f and the satisfaction of the recurrence
relationship guarantee that there exists systolic
arrays for the implementation of DFT as shown in
Figure 11 and 12.
From Figure 11 and 12 it can also be seen that
the b..'s are not totally independent. Note that
P. = a-? and b. . = V.1. Let us examine the data
dependence amon'g the1 b..'s. Note that
b.. = W.1 = W.J = W.J~1H. = W. .Vij _ . J
 ui i i j-1 i
" i.j-1 i*i.e.i
b..
»0
= b.
bi.j-l{ V V
-1
(8)
where u. = W. and b. _. = W." . which can be either
used as fanned-in sequences of the R-stay linear
systolic array or preloaded in the i processor of
the R-move linear systolic array. The pair of
systolic arrays based on equations (7) and (8) are
shown in Figure 13 and 14. Also note that
"ij ;jj -;jt "j
i.e.
(9)
where v. = W. and b. . = W. . which can be either
preloaded injthe j p'r'ocess'or of the R-etay linear
systolic array or used as fanned-in sequences of
the R-move linear systolic array. The pair of
systolic arrays based on equations (7) and (9) are
shown in Figure 15 and 16.
This DFT example shows that under certain
circumstances it is possible to formulate a given
problem in several different ways to implement with
various different but equivalent configurations of
systolic arrays.
V. CONCLUDING REMARKS
A systematic approach is presented for
designing systolic arrays and deriving their
equivalent configurations for certain general
classes of recursively formulated algorithms. This
approach can be considered as a two-stage design
procedure. In the first stage, the existence of
recursiveness is investigated. If it exists,
according to the same formulation the input data
are classified into three parts, two of them, P.
and Q., depend only on one index, and another one
of th^m. namely b.. depends on both index i and j,
so that the systolic arrays shown in Figure 1 and 2
apply. However, for certain applications, it is
either infeasible or inefficient to store all of
the b..'s. In the second stage, the dependence
among the b..'s is then investigated to see if it
can be used to fan-in generate the b..'s through
the data sequence that can either be preloaded or
fanned in. For a given problem, various
formulations of the recursive property and the
dependence among the b..'s are possible, which
simply lead to many different but equivalent
configurations of systolic arrays.
So far we mainly deal with the linear systolic
arrays. However, the same technique can be easily
generalized to the two dimensional mesh-connected
systolic arrays, since the mesh-connected systolic
arrays can be simply treated as the concatenation
of many linear systolic arrays.
VI. ACKNOWLEDGEMENT
This work was partially supported by the
NASA/Ames research contract NAG-2-304.
VII. REFERENCES
1. H. T. Kung and C. E. Leiserson, 'Systolic
Arrays (for VLSI),' Proc. Syrup. Sparse Matrix
Computations and Their Applications, Nov. 2-3,
1978, pp. 256-282.
2. H. T. Kung. 'Why Systolic Architectures,'
Computer. Jan. 1982, pp. 37-45.
3. H. T. Kung, 'Let's design algorithms for VLSI
systems,' in Proc. Caltech Conf. on VLSI, pp.
65-90, Jan. 1979.
4. P. R. Cappello and K. Steiglitz, 'Unifying VLSI
Array Design with Geometric Transformations,'
Proc. Int. Conf. on Parallel Processing, pp.
448-457. Belleire, Michigan. Aug. 1983.
5. D. I. Moldovao. 'On the Design of Algorithms
for VLSI Systolic Arrays.1 Froc. IEEE. V 71. H
1. pp. 113-120. Jan 1983.
6. W. L. Mirenker and A. Winkler, 'Spacetime
Representations of Computational Structures.1
Computing. V 32, 1984.
bi n
Pin Pout bm-1,1
bm - 1 , 0 .
Pout*-Pin
R*-f(Pin ,Q,bin ;R) bn
bio boi
boo |
L ±
Pm-1 ,. . .,Pl .
bm - 1 ,n- 1
bl , n -1
bo, n.-l
I
Qo
Ro -9
Qi
Ri
Qn-l
Rn-J
Figure 1: The R-stay linear systolic
array.
Uin
bi n
Pin
Vp
B.Qp
R
uout B«-T(B;bi n ;uin ;Vp )
bout R«-f (Pi n ,Qp ,B;R)
Pout Uout*- Uin , bout*-B
Pout*- Pin
Uai-l Ul ,UO
b m - i , - i , . . . , b i . - i i b o . - i
Pm-l , . . . ,Pl ,Po
Vo
Qo
Ro
_,
-»
—»
Vi
Qi
Ri
Vn-l
Qn-l
Rn-l
Figure 3: The fan-in scheme of R-stay
linear systolic array. Note that the
register B in the jth processor is
initially loaded with b-i.j.
Win
i
a> n aout
aout<— ai n
ao ,an-2
Ho
Wo
Wo
i
Wn-l
Wn-1
Wn-l
Wi
Wi
Figure 5: R-stay linear systolic array of
discrete Fourier transform based on
equation (4).
ORIGINAL PAGE IS
OF POOR QUACfTY
bi n bm-1 ,n -1
Qi n
Rin
'Qout
'Rout bo.n-1.
b m - l . l
Rout*-f (P.Qi n ,bi n ;Rin ). . ^ bm-1.0
Qout*-Qin bl 1 . ' i
boi bio I
boo I '
Qn-l , . . . ,Qi ,Qo
Rn-i , . . . ,Ri ,Ro
Po
— *
Pi Pm-l
Figure 2- The R-move linear systolic
array.
bin
Qin
Rin
UP
B,P
vout
bout
Qout
Rout
n ; B ; U p ; v i n )
Rout*-f (P.Qin ,B;Rin )
vout*-vin ', bout*- B
Qout*- Qin
vn-i , .
b-1 , n -1 , .
Qn-l , .
Rn-l
, VI , VO -
,b-i, i , b - i , o *
, Qi , Qo -
, Ri , Ro •
Uo
Po
Ui
Pi
Um-1
Pm-l
Figure 4: The fan-in scheme of R-move
linear systolic array. Note that the
register B in the ith processor is
initially loaded with b i , - i .
Win
A.
Vin yout
Wn-l
yout<-(yinxWi n
Wn-l
Wi
W n - l
Wi
Wo
yn -1 , . . , yi . yo
Wi Wo
Wo I
T -1-
an-2 dn-9
Figure 6: R-move linear systolic array of
discrete Fourier transform based on
equation ( 4 ) .
a> n aout i n
y«-(yxWp
a o , . . . , a n - 8 , a n - 2
Wn-l
yn-1
Win
Xin
Wout
yout
Wout*-Win
yo u t«_ (yi n xWi n ) +a
Wn-l Wl ,Wo
yn -1 , . . . , yi , yo
an-2
*
an-S ao
Figure 7: R-stay linear systolic array of
discrete Fourier transform based on
equations (4) and ( 5 ) .
Figure 8: R-move linear systolic array of
discrete Fourier transform based on
equations (4) and (5)
Woutl
Wout 2
Bout
Wi
ao,...,an-s,an-2
Woutl*- Wi ni
Wout 2«- Wi nJ xWi n2
y<- (yxWout2 )+ain
aout«— ai n
yo yi yn -1
yin
yn-l
•*• yout
, yi, yo
B«-UpxB
- (yi nxB)+a
Wi
an-2
Wi
an- J
Figure 9= R-stay linear systolic array of
discrete Fourier transform based on
equations (4) and (6).
Win
ai n _». aout
aout*- ai n
y«-y+(ai nxWi n )
an -1 , . . . , ai , ao
Won-l
Woi Wio
Wo<> I
1
yo
. Wn-l *
Wn-lO
Figure 11: R-stay linear systolic array
of discrete Fourier transform based on
equation (7).
Figure 10: R-move linear systolic array
of discrete Fourier transform based on
equations (4) and (6). Note that register
Up is preloaded with Wi and register
B is initially loaded with Wi-i.
yi n- yo u t Wn -
yo u t<_ yi n + ( axWi n )
Wn-in-1
W i n - l
Won-1
W l O
Woo
yn -1 yj , yo
Figure 12 : R-move linear systolic array
of discrete Fourier transform based on
equation (7).
Woutl
Wout 2
aout
Wn-l ,...,Wi ,Wo -»
Wn-l-1 .... ,Wl-l , Wo-
an -1 , . . . , ai , ao •
Wo u 11 *- Wi n l
Wo u t 24- Wi n 1 xWi n 2
y«- y + (ai n xWo u 12 )
ao u t*- at n
yo yn-l
Figure 13: R-stay linear systolic array
of discrete Fourier transform based on
equation (7) and (8).
yn-i
yout
,yo
B«-OpxB
yout«- (yinxB)+a
Figure 14: R-move linear systolic array
of discrete Fourier transform based on
equations (7) and (8). Note that in the
ith processor, register Up is preloaded
with Wi and register B is initially
loaded with Wi-i .
ai n
aout«- ai n
B*-VpxB
. aout y*- (ai n xB) +y
an -1 , . . . , ai , ao
Wi
Wi
yi
Wn
Wn
yn
nl — »
n2 — *
n — «
-1 ,
-1 ~ l ,
a
. . . .
-¥•
Wi
Wi
-I , . . . ,yi
W o u t l W o u t l * — Wi
Wout 2 Wout 2«- Wi
yout
,Wo -^»
-i .Wo-i,
,yo -»
yo»
ao
Jt 4
-»
F- yin
ai
Figure 15: R-stay linear systolic array
of discrete Fourier transform based on
equations (7) and (9). Note that in the
jth processor, register Vp is preloaded
with Wj and register B is initially
loaded with Wj-i .
Figure 16: R-move linear systolic array
of discrete Fourier transform based on
equations (7) and (9).
