Synthesizing Non-Uniform Systolic Designs by Guerra, Concettina & Melhem, Rami
Purdue University 
Purdue e-Pubs 
Department of Computer Science Technical 
Reports Department of Computer Science 
1986 





Guerra, Concettina and Melhem, Rami, "Synthesizing Non-Uniform Systolic Designs" (1986). Department 
of Computer Science Technical Reports. Paper 539. 
https://docs.lib.purdue.edu/cstech/539 
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. 
Please contact epubs@purdue.edu for additional information. 





SYNTHESIZING NON-UNlFOfuVI SYSTOLIC DESIGNS
Concenina Guerra
Rami Melhem
Depamnem of Computer Sciences, Purdue University
West Lafayene, Indiana
ABSTRACT
In this paper we propose a method to derive systolic designs with non-uniform data flow.
One of the major difficulties in systematic design is in transforming the original sequential
specification of a problem into a Conn suitable [0 VLSI implementation. Our approach [Q
auromatically restructuring a problem is based on a subset of me data dependencies eXITacted
from the original problem specification. By using such dependencies we are able to identify
chains of dependem computations which are then convened into recurrence equations. The map-
ping of the new specification imo hardware is also based on data dependencies. We illusrrate
the methodology by applying it [0 algorithms using dynamic programming.
-2-
I. INIRODUCTION
The development of efficient VLSI algorithms is relevant to many applications including
signal and image processing, which justifies the considerable interest for this topic. In recent
years, the problem of synthesizing VLSI design - and systolic design, in panicular - has also
received much attention [IJ-[4], [8], [1O]-[l1l, [13J-[24]. Systematic methodologies for the
derivation of systolic algorithms have proved useful both in finding new designs and in verify-
ing the correcmess of old ones. Moreover. the possibiliry of automatically generating a number of
viable algorithms for the solution of a given problem enables the selection of an optimal alga-
rirhm among a wider seL of candidates. The optimality can be based on such parameters as com-
pletion time T I number of processors P, chip area, etc.[l8]. Most of the exisiing s}'lUhesis
methods tend to minimjze the completion time.
The first and probably most challenging part of the systematic design process is in the
transformation of the high-level problem specification into a fonn berrer suited to VLSI imple-
mematioIL Such a fonn can be, for example, a system of first-order recurrence equations, a uni-
form recurrence equation, or a nested loop with constant data dependencies. Often, this transfor-
mation is obtained by using techniques similar lO those used for software compilers: buffering of
variables. addition of new variables, etc. However, it is not well WlderslOod how to select a good
transformation especially for some complex nonnumerical problems. Some authors assume that
the problem is already given in the required form. and concentrate on how to map it into
hardware.
A fundamental distinction between different approaches to !.he mapping problem is whether
they use dam dependence based tr.lIlSfonnations [1]-[4], [71, [14], [16], [201-[24], or delay
operators applied to the mathematical expression of !he algoritJun [lOHll]. For a survey of the
various methods see [6].
The transformational approach based on data dependencies has been successfully applied [0
a variety of algoritJuns characterized by constant dam dependencies. For such algoritluns linear
- 3 -
time and space rransformations of the computations into VLSI arrays have been derived {14J.
(20]-[23]. The resulting designs have cOl1Stam and regular dam flow. However, a number of more
complex problems with non-constant data dependencies require non linear transformations. A
typical example is the optimal paremhesization problem. which llses dynamic programming. In
[9] a systolic algorithm for dynamic programming was presented that can be cast in a triangular
array of processing elements. The design is quite complex: the data flow in non-consram
throughOut the array and the action of a processing element varies at different clock cycles.
Attempts have been made to synthesize non-unifonn design<;:. A non-linear transformation
for dynamic programming was indicated in [22]. However, the paper does not precisely describe
how the rransfor::::nation is obtained from L1e algOriUIffi specification. Chen [1] presents a melho-
dology for m1pping algorithms expressed as systems of first order recurrence relations into sys-
tolic arrays. The synthesis procedure allows designs with non-uniform communication patterns.
The mapping is done inductively, starting at the boundary conditions. By using Utis tectmique.
various designs, corresponding [Q different communication patterns in the systolic array, are
derived for dynamic programming. The procedure, based on a point-by-point mapping, appears [Q
be quite lengthy. In [19] a ma!hematical model, based on a sequence noration, is used to
represent index transformations. The model provides a precise specification of systolic designs. It
allows arbiuary interconnections in systolic networks and is not restriCted to linear rransforma-
tions. The paper does not give a constructive methodology to find the transformations.
1bis paper presentS a systematic method for the design of VLSI algorithms. The method
consistS of transforming the high-level problem specification into a set of mutually dependem
recurrence equations. To select the appropriate transformations we propose a two-step refinement
procedure. The procedure first determines a coarse timing function of the computations in the
original specification of the problem, based on a subset of constant data dependencies; then uses
this function to guide the search for an index: transformation. Ordered chains of dependent com-
putations are identified in the algorithm, according !o such function, and an index: transformation
-4-
is applied to each singie chain. The index transformation must be compatible with the timing
function, Le. all the computations whose operands are available first will be performed first. Sub-
sequently, the mapping of the obtained system of recurrence equations imo a VLSI nerwork is
performed by applying linear time and space uansformations separately to each recurrence sub-
ject ro global constraints. The resulting design may have non-uniform data flow.
This paper is organized as follows. Section IT presents the computational models for VLSI
arrays and algoril.hms and reviews the transformational approach to derive linear timc·space
transformations for mapping an algorithm imo a VLSI array. Section III presents a method
which helps deriving from an absuact algorilhm a set of murually dependent recurrence equii-
tiens. Section IV illustrates me method by applying it ro d)'namic programming. In section V
time-space transformations are used to map lhe obtained system of mutually dependem
recurrence equations inra a VLSI array. Finally, section VI shows how w derive a new design for
dynamic programming mat uses fewer processing elemems man me one in (9].
II. COMPUTATIONAL MODELS
A. Definitions
We will review me transformational approach [20]-(23] by referring to a specific algo-
rithmic model. We consider a structured set of computations written as a recurrence relation or a
nested loop. The set includes input srat.emems, outpm st<Hemems, assigrunen[ statements, and
conditional <lSsignmem S[a[ements, where me latter are of the form "IF predicate THEN assign-
ment". Conditional assignments are resuicted ra those where the predicate depends only on the
values of the loop indeces and nor on the values of lhe variables. The recurrence relation or
nested loop defines an index set /"={(i I, ...• ill) I n$.i (S.lf, ... , 1"15i,, 5.1,,,2} , subset of me set
2" of the n -tuples of integers. We assume mat the following condilions are satisfied:
- 5 -
CAl - each variable in the algorithm is associated with an index VeCWf (i1.i2•...• i,,), Le. is an ele-
ment of an n-dimensional array. In other words, mere is a one-to-one correspondence between
the n-ruples in 1" and me dimensions of any array used in the algorirnm.
CAl· if S is an assigrunem sratemem indexed by (i 1oi2, ... ,i,,), and a variable on me right side of
S is ind~xed by (j IJ2•... ,jl1), then each it may depend only on h:.
CA3 . The dependence vector of a variable is defined as me difference of the index vectors of
computations where the variable is used and generated [8J_ The data dependence vectors
db .... d:.n associated [0 the variables of a recurrence can be represented as a marrix
D=[J t •.• amJ. whose columns are labelled by the variable names. A variable may have many
dependencies each corresponding to a different index vecror. In me canonic form we assume mat
data dependence vectors are constant, Le. independent of me index vectors.
CA4 - a variable is used exactly once after it is generated.
We will refer to the above specification of an algorithm as to the canonic fonn of the algorithm.
The canonic form does not explicitly specify any ordering among the computations; the
lexicographical ordering in r is arbitrary and therefore irrelevant to our purposes. Rather, an
implicit partial ordering of the computations is given by the data dependencies. A panial ordering
>D defined by the data dependencies is such mat 7 >0 l'if 7= 7'+ aj for some ajED.
B. Mapping a canonic form into a VLSl array
It is possible, under certain conditions. to automatically map a recurrence into a VLSI array
[20]-[23]. Consider the following simplified model of a VLSI array. Each processor of the array
- 6 -
is assigned a labell eLn-ic 2,,-1 . The connection pattern of me array is described by the macrix
.1=fOI'Oz, ... ,O.fJ which specifies the links among the processors. Precisely, Sj is the difference
vector of the integer labels of adjacent cells in the network.
A linear time-space transformation of the compmations indexed by Til (me me VLSI array
is defined by:
IT=I;I
where T is a mapping from r ~ Z and S is a mapping from I" ~ U- l . The Lime function T
transforms D imo TD. Thus. to ensure a correct execution ordering, T must satisfy the following
condition:
TCdi ) > a for each (liED (I)
System (1) may have no solution or several solutions. In tllis laner case, me one which minimizes
me total execution time (defined as the difference between the maKimum and minimum value of
T ) is chosen.
The space mapping S is a mapping from me set of computations to the set of processing
elemenrs such that for each 7.7 E In
SUpS I]) implies T[J)~T(j) (2)
i.e. concurrem computations cannot be mapped into the same processor. Detennining a solution
for S which satisfies constraint (2) is equivalent [0 solving the diophantine equations:
SD = t>K (3)
for which the mauix jJ J is non-singular. K is an integer matrix with positive clements. The
equations may have no solution or several solmions. If no feasible solution is found. the design
·7·
procedure is repeated by starting wilh a different timing function or else a different imerconnec-
tion network. If several solutions are possible. the one which is optimal according to some given
criterion is chosen.
C. Algorithm transjonnations
To de!ivc a canonic fonn from L'1e high·level problem specification. transformations similar
to those used for software compilers are generally used (14]. The goal of such transformations is
to enhance pipelining and local communication in an algorithm. TIlls is accomplished by (i)
adding indeces to existing variables in the algorirnm. (ii) renaming variables, or (ii.i) imroducing
new variables. Such transformations do not change me algorictun fundamentally, but only the
data dependencies among the variables. The new data dependencies in the resulting specification
of the problem are therefore not characteristic of the problem itself but of its parallel implemen-
tation. Indeed. as is well known. there are in general several ways to transform a given problem
specification, according to the previous rules, each way producing a different set of data depen-
dencies. However, not all the possible transformations lead to a feasible VLSI design. Moreover,
different transformations may result in different performances. Consider the twO following
examples:
Example 1. The convolmion problem is defined by : given a sequence Xlo ... ,x" and a set of
weights w" .... w,p detennine me sequence Y \•... ,y" such that;
Broadcasting of x and w can be eliminated by adding onc more index to all variables x. w. and
y . At least two different index transfonnations can be applied which produce the two
recurrences in canonic fonn below. The first is a backward recurrence, where the variable Yi,k is
- 8 -




Yi.,k = Yi,k-l + Wi,k' Xi.k
with initial conditions:
y;.o = 0; WO,k = w.l:; Xi-l.a = xi; XO,k-l = 0
(4)





Yo;:' =Y;,.I;+I +Wi,.l:" xi);
Yi ,s"'! = 0; WO,..!: = w,/;-; Xi-l,a = Xi; XO,k-l = 0
(5)
Data dependencies corresponding to the two recurrences ina-oduce two differem partial orderings
in {'-{(i,k) I 1:5:i:91; IQ.ss}, which translare imo two different schedules for the computations
and consequently into different systolic designs. We first derive a systOlic design for rec'JITe~ce
(4). Dam dependencies for variables y, x and w in (4) are respectively:
dl~(O 1)' d, = (l I)' d, = (10)'.
The coefficients T I and T2 of a linear rransfonnation T must satisfy (1), that is:
Several solution to the atove equations are possible; the one which minimizes the execution time
is given by:
The resulting timing function is:
TU.k)~i+k.
- 9 -
A space mapping of the recurrence into a linear array of processing elements obtained from (3) is
given by:
S(i,k)=k.
The resulting design, identical EO me one presented by KWlg [12], is summarized in table
1. Similarly, by applying the mapping procedure to recurrence (5) we obtain two other designs
presented in [12]. These lal1er are summarized in tlble 2. Notice that design W2 cannot be gen.
erated srarring from recurrence (5) and. viceversa, designs R2 and WI cannot be generated from
recurrence (4). Fmally, we mention that other designs for convolution are possible [3], [12], bur
they cannor be generated from recurrences (4) or (5).
Design Output (y) Input (x) Weights (w)
W2 Move in the same direction Stay
at different speeds
Table 1. Systolic design for recurrence (4)
Design Output (y) Input (x) Weights (w)
WI Move in opposite directions Stay
R2 Stay Move in the same direction
at different speeds
Table 2. Systolic designs for recurrence (5)
Example 2. Recursive convolution
.'
-10 -
This problem can be expressed as: given a sequence of weigh[S w l."'ws ' determine the sequence
Y1""Y" such that:
Yi = L.i:=l,.r W.t . Yi-k
Of the two recurrences which can be derived by using transformations similar [0 mose used in
Example 1. only me forward recurrence has to be considered for a sysrolic implementation. The
backward recurrence does lead to a any reasonable design since it CaJInor overlap compmations
ofYi,1:: for different values of index k.
For more complex: problems. such as optimal parenthesization using dynamic program-
ming, selecting a good initial transformation is nm an obvious laSk and 10 fact srraighcforward
tnII1Sformations such as !:hose applied to the convolution problem faiL It is clear that this part of
the synthesis procedure sometimes requires creativity and programming experience. We suggest a
way [Q assist the human designer in this crucial task. Obviously, it is nm always possible to
transform a given problem into canonic fonn. For problems which do not have such represema-
tion we attempt to rewrite them into a form of many modules each being in canonic form. The
method we propose here relies on identifying a set of constant data dependencies in me original
formulation of the problem.
III. DERN!NG A VLSI ALGORITHM FROM THE HIGH-LEVEL SPECIFICATION OF TIlE
PROBLfuVl
We assume Lhat the high level specification of a problem is written in the form of a nested
loop with the index set I" defined by:
1'_{(O 0) 11\<· <12 11<. <12l
- 1,•...• 1" \_I\_\ •...• "_l,,_,,.
We assume that, unlike the canonic form, the loop contains a variable which is an s -dimensional
array. More specifically. we let s=n-l and assume thal me loop contains an assignmem
- 11 -
statement of the form:
(6)
where: is=(i l, £2•... , is) is an s -tuple of indices of the loop;
and where ajJ are integer consmms. In other words. the index i, on the left side of (6) is replaced
on the right side by the index il!.' Each vecrof dJ, U=l,..,m), represents a non-constant dam
dependence for variable c. since its I-rh component is a function of the two indices ir and i". We
can expand dj inlO a number of data dependence vecmI'S, each corresponding ro a specific value
of index in in the t-rh component. Then. we associate' to each patr
(variable ,index vecror)=(c. Is)
expanding dt .. ,d"";..
the set D~ of all the data dependence vectors obtained by
i'
OUf srrategy to select the initial index cransformation consists of a two-step procedure. Let
/ ' - [(' ') 1"<' <I' 1'<' <I'J F' d' ' fro th hia"l 1 'fi' f11>""'s -1 _ll- 1 ' ...• of _l,r _ s . lfSt. e.errmne m e ;u- eve speer cauon 0
the problem a come dIne function T; P ~ Z. 1his function will be used in !.he subsequem step
of me procedure [0 guide me search for a schedule of computations indexed by I". We derive T
from a subset of the data dependencies of the algorithm, namely, the subset of constant data
dependencies, thus, T provides only a lower boW1d for an acrual timing function for IS . Funher·
more, T depends only on the implicit dependencies of the problem since it is derived before any
arbitrary ordering of the computations is introduced. The set D~ of data dependencies is non-
i'
constant in the computation space. However, the set DC defined as the intersection ofD~ for ali
,
i"i E P contains only constant data dependencies. For a nested loop with constant data depen-
dencies, a linear (or quasi-affine) timing [unction can be determincd by applying the transforma-
tional method described in section I. Thus, if DC is nor empty we can derive a linear timing
• 12 .
function T : [S---:; Z which is compatible with the set Dt:. i.e. is >D< 7'9 implies T(iS) > T(I'S).
The function T must satisfy (I), mat is:
(7)
If system (7) has several solutions, the one which minimizes the roral execution time is chosen.
It obvious mat if 't is an actual timing function for r then it must be 't(i""i) 2: T(iS) for each
is els . Moreover. because of the rnonotcnocity of an expanded data dependence vector in D~ I it,.
muSt be 'tcj"i) > 't(l'S) if T CiS) > T (7'S).
The partial ordering defined by T on the set of computations IS deJines a partial ordering in
a subset of r based on the availabili[y of operands. Consider the set J" c 1" consisting of the n-
tuples Cis, ill) for any given is E [S and for t"l-;;i" '5./,,'2. For any two such n-tuples, we introduce the
relation >T defined by:
Max{TU'-d;)..... TU'-d-;')} > Max{TU'-d;\... TU'-d:::)}
where tI; and dj' U=l,...m) are the data dependence vectors in (6) corresponding to values i~ and
i",", respectively, of index i". Obviously, the relation >r is a panial ordering in J". Thus, J" can
be decomposed into a number of chains, i.e. subsets whose elements are linearly ordered (relative
to each other). Of course, there will be only one chain if the relation >T is linear to starr with.
Each chain consists of indexed computations which have to be carried out one after the other in a
specific linear ordering. Computations belonging to different chains can be carried out indcpen-
dently. There can be many decompositions of a partially ordered set into chains. Minimal chain
decompositions can be found by network flow techniques [5]. We are interested in a simpler
problem, namely in finding a chain decomposition of >r such that !.he computations in a chain
are also sorted ( eilher in increasing or decreasing order) according to lhe index. i" .
-13 -
At this point, we are ready to resrrucrure the given algoritlun. Let us denore by s !he
number of chains. We partition the computations indexed by 1" into s separate recurrences,
each corresponding [0 a distinct chain. In each recurrence the ordering for index ill is chosen
according to its ordering in me chain. Then, we transfonn each recurrence into canonic form
using the three previous rules: 1) adding missing indices. 2) adding new variables. and 3) renam-
ing old variables. In addition we also add statemems between recurrences to correlate variables
in distinct recurrences. In fact. a recurrence may use a variable generated in another recurrence.
TItis last Step may inrroduce non-constant data dependencies in the system.
In conclusion, the obtained new specification is expressed as a system of s modules, each
module being a recurrence equation in canonic form. Non....:::onstant data dependencies may occur
between variables of different modules.
IV. AN APPLICATION TO DYNAI,fiCPROGRfuVliVlING
Consider the dynamic programming technique applied co such problems as optimal




C;,i+l = Ci l~i:91
for some function f. Note that no explicit ordering for the indices i ,j, and k is specified. If we
choose the normal lexicographical ordering for the three indices and we apply some index
transfonnation we can derive a canonic fonn from the above recurrence; however, the sys[Qlic
implementation of it will have execution time 0 (n:!) since it does not overlap the computations
of Ci J for different k_
- 14-
Let /J=[(i,j,k) j 15.i,j91;i<k<j} be the index set of the recurrence above; and let
l~{(i,j) I 19. j5.n} be the index set defined by variable c. Each pair (c. (i, j)) is associated
with a distinct set ofdata dependencies. here represented in mauicial form:
o i-k
D ' -(i j) -
j-k 0
or in tb.e expanded form below where each column corresponds to specific value of the index k:
o o 0 -1 -2 i-j+I
j-i+l 210 o o




We derive a time function T : 12~ Z compatible wiili DC. Since DC contains only cOP.$[aIlt
dam dependencies, we can apply me methodology discussed in section I [Q determine a linear
time function. Conditions (7) applied to the dara dependence vectors in DC give:
The least integer values that satisfy the above equations are:
Thus, the optimal time lIansfonnation is:
TCi,j)=j-i.
- 15 -
Next we consider the partial ordering >T in f 3={(i ,jt) Ii,} are fixed and j <k <j} defined
by:
(i.j .k'l >T (i.j.k") ¢::::>
Ma:c{ T(i.k), T(k',j)} > Ma:c{ T(i.n. T(k ",j)}
Notice that me minimal elemems with respect to >T are:
(i,), (i+j)/2) ifi+} is even or
(i,j, (i+j-l)/2)and(i,j, (i+j+l)/2) ifi+j is odd.
By repeatedly finding minimal elements after removing the previous minimal elements
from the set we obtain a decomposition of J3 in the two chains below (here we only write the
third componem of the index vectors):
{if (i+j) is even}
(i+j)/2, (i+j)/2-1 • ...• i+l;
(i+j)/2-+1, U+j)/2+2, ... , j-1.
{if (i+j) is odd}
(i+j-l)/2. (i+j-l)/2-1 • ...• i+l;
(i+j+l)l2, (i+j+l)/2+1 • ... , j-l;
We can now rewrite (8) into a form where lhe execution ordering of index k is specified
according to the above chains. The new specification of the problem consists of two recurrences:
the first recurrence is a forward recurrence where the index k varies from (i +j)/2 to i +1 (or from
(i+j-l)/2 to i+l if i+j is odd); and the second is a backward recurrence where k varies from
(i+j)/2 La j-l (or from (i+j+l)/2 to j-l if i+j is odd) _To transfonn each recurrence into
- 16 -
canonic form some further manipulation is necessary. We first add missing indices to variables on
both sides of (8) and then inrroduce new recurrence variables to eliminate broadcast of a variable
to different destinations. We use different sets of variables for me two recurrences. Each
recurrence initializes some of its variables to values generated by the orner recurrence. Now
equations (8) can be convened imo !:he following system of mutually dependent recurrence equa-
tions:
- 17-
for j :=1 to n-l do at ;+1, £+1:= Ci. i+'; Ci. i+l. ;+1;= Ci. ;+1;
for1:=2 ton-Ida
for i:= 1 to n do begin
j:=i+l;
if (i+j)=even) then begin
k:=(i+j)/2;
Al' ai,i,le= ai,j_l,k
A2' ifk= i+l then bj : i . .l: :=Cj+l.j,j elsebi',j,k. := bi~l.i.k. ;





ai,J,k := ai,i-1.le ;
ifk=i+lLi'J.enb:,i,k. :=Cj+i,J,j elseb;',j,k :=bi~l.i.1c






bi':i.k := bi~l.j,k ;
ci,i,k '- f(ai~j.ic' bi~j,k);
end






ifk=i+l rhenbi:i, •..-,:=Ci+l,i,jelsebi:j,k :=bi~l,j,ic;
c;',i,ic := h (C;'.j,ic+l,j (ai:j,k ,bi',i,k »;
end; J
module 1
for k:= LU+j+I)/2 +IJ to j-l do begin
if k= )-1 then ai: i. k := Ci. j-I, j-I else a/:i. k
bt},k := bi:,,},k ;
ctj,It;:= h (ctj,It;-I.! (a/:j,k ,btj.;; )):
end;





From the above specification we extract distinct sers D I and D 2 of local data dependencies for the two
modules.
e' a' b' e a b"
0 0 -1 0 0 -1
0 1 = 0 1 0 O2 = 0 1 0
-1 0 0 1 0 0
Non~constant dependencies between variables of distinct recurrences are defined by statements Al to AS
in the algorithm. We will refer to such dependencies as global dependencies.
v. Mlu'PING TIlE ALGORITHM INTO HARDWARE
A. Time function
Once the algorithm has been convened into the new form, the mapping imo a VLSI array can be
accomplished in two steps:
1) Finding for each individual module in the algorithm representation a separate time function
which is compatible with me local data dependencies and also satisfies the consrraints imposed by the
global dependencies.
2) Finding a space mapping for each module into me processors of a physical network which is
compatible with the time function and satisfies global constrnints.
Referring again to the dynamic programming algorithm. we first seek linear time c.ransformations A.
and /l for module I and 2, respectively. Furthermore. le[ cr be me time function for computation in AS
outside the two modules. The linearity requires mm for the dependence vectors of the modules i[ mus[
be:
'- (d) > 0 for aED I and 11 (d) >0 for aED,
- 19 -
that is:
J.L1 :5 -1 J.lz;;:: 1 J.l3;;:: 1.
Coosuaints introduced by AI-AS lead to the following equations:
1.(1, j, U+j)/2) > IlU, j-I, U+j)/2)
'A.(i. j, i+l) > a(i+l, j. j)
IlU, j, j-l) > erU, j-I, j-I)
IlU, j, U+j+I)/2) > 1.(1+1, j, (i+j+I)/2)
er(i, j, j) ~ max[A(I, j, I+1),Il(i, j, j-I)]
It easy to check that an optimal solution to the above system is given by:
A, =-1 ).,,2 = 2 Aj =-1
J.l1 =-2 )J.2 = 1 III = 1
<11 =-2 0'2 = 1 0'3 = 1.
Hence:
AU,} ,k) = -i+2j-k
Il(i ,j ,k) = -2i +j+k
aCt ,j ,j) = -2i+2j.




The automatic procedure for detennining !.he mapping of computations into the cells of a systolic
array is analogous to the one for the time funCtiOIl Again, we look for separate solutions to the different
modules in the algoriLhm subject [0 global consrraints. We consider a 2·D VLSI array of processing ele·
mems modelled by the pair [L 2AJ, where L 2 is the set oflabels (x.y) assigned to processing elements and
d is a matrix describing the imerconnection network between processing elements. Differem lntercon-
nection patterns may result in different classes of designs. In the following. we generate the optimal
design when.1. is chosen [0 be:
010
lOG -1 I
Ll corresponds to a nerwork with unidirectional links. as shown in fig. 1.
Let 5', 5", and S be the space functions for module 1, module 2. and statement AS, respectively.






, 5;1 S~'2 S;:' 521 Sll S,--,
In addition LO satisfy (3), [he coefficiems of S', S", and S must satisfy me constraims imposed by global
dependencies. Precisely, if a global dependence involves two variables belonging to different modules
which are computed at times c and c' wiili c-c'=d, men me dismnce of me cells where the two variables
will be mapped cannot be more than d. By distance we mean !:he lengtll of a pam consisting of intercon-
nection links between !:he two cells. From Al we have:
S· [i j (i+j)/2[' = S" [i j-l (i+j)/2 [' + d l ;
since :\.(i ,j ,U +j)/2) - cr(i,j -1,(i +j)/2) = 1 and, consequently, the two compmations must occur either
in the same cell or in adjacent cells. Similarly, from A2·A4 we obtain:
- 21 .
,
I' jl' + a,; a2 ED..;s Ii j i+l = S li+l j
S" 1i j j-I I' ~ S Ii j-l j-ll/+Gj; G3 e 6.;
S" Ii j U+J+l)/2 II = S' li+1 j (i+j+I)/2I t + a,,;
S Ii j j II = S Ii j i+ll f +Cis;
One solution to above system of equations is:
s~ =S~ =0; S~l = 1
for !he first recurrence and:
for the second recurrence. Thus
s'u ,j ,k) ~ S"U.j ,k) = SU ,j.j) = U.iJ.
The resulting design is identical IO me one first inrroduced in [9J. The corresponding systolic array and
the action of a cell at differem times is depicted in figure 1.
VI. A NEW DESIGN FOR DYNA,vlIC PROGRA,\1MlNG
As shown in [8], a new design for dynamic programming can be amomatically generated if we
choose a differem interconnection pattern for the network. This design uses fewer processing elements
than the one in [9]; precisely 3/8n 2 insteJd of n 2/2. Consider an array of processing elements whose








Cells in the am.y are connected by bidirectional horizomallinks as well as by diagonal and venicallinks,
as shown in fig. 2. An optimal design for dynamic programming is generated for such an array using
the same mapping procedure. Again we solve equations (4) subject to global constraints (10). We derive
for the first recurrence:
and for me second recurrence:
Thus we p.2ve:
s'U ,j.l:) ~ (k ,i) and S'U ,j ,k) = U+j-k ,i),
These transformations lead to me sysm!ic design of figure 2. The array consists of 3fSn1 cells. All cells
are identical. However, the action of a cell varies from time to time. It does computation relative to
module 1 or module 2 depending on the values of indeces i. j, and k. Also. the direction of dara streams
varies for me [wo modules. The transformation of data dependence vecrors D 1 imo commlUlication vec-
tors n is derived from S' . From:
c' a' b'
IS u' S 12' S 13, 0 0 -J -1 0 0
521 ' S12' S"
, 0 1 0 0 0 -J~
-J 0 0
we derive that variables c;'J,.! move to the left along the horizontal links. variables a/J.k do nO[ move
along the array but stay inside lhe cells. where they are updaled. Finally. variables bi'J.k move up. except
at the boundary, where lhey move along the diagonal (from (10». The direction of variables in module 2
is derived from the mapping S". Variables a/J.k move to the right along lhe horizom.a1links. variables
·23·
bj'J.k move up [Q the left along the diagonal links. Variables Cj'j.k move wiIh the same pattern as in the
other module. The action of a cell at each time is illustrated in figure 2.
VlI. CONCLUSION
We have suggested a methodology to assist a human designer imo the difficult task of designirlg a
VLSI algorirnm starting from a sequential specification of a problem. Among the many possible transfor-
mations which can make parallelism explicit in a program the method helps selecting a feasible and
optimal sequence of t::ralt';formations. The methodology appe3I'S useful for complex nonnumerical prob-
lems for which standard resrrucruring techniques seem inadequate. The class of obtainable designs is not
resrricted to designs having constaD[ data flow.
REFERENCES
[lJ M. Chen, "Synthesizing Systolic Designs ", in Proc. Int. Symp. on VLSI Technology, Sysretr'.$ and
Applications I Taipei-Taiwan. 1985.
[2] M. Chen, "The Generation of a Oass of Multipliers: A Symhesis Approach to me Design of Highly
Parallel Algorithms in VLSI", IEEE Inc. Can! on Compueer Design, 1985.
[3] P. R. Cappello, and K. SLCiglitz, "Unifying VLSI Array Design wim Geomeuic Transformations",
Int. Con! on Parallel Processing, pp. 448-457,1983.
(4] J. M. Delosme, and I. Ipsen, "An lliustr.:J.tion of a Methodology foe the Construction of Efficient
Systolic Aechitectures in VLSI", Proc. Second 1m. Sym. on VISI Tech. Syse. and Appl., pp. 268-23,
1985.
-24-
[5] L. Ford, and D. Fulkerson. Flows in Nenvorks, Princemfi Univ. Press. 1962.
[6J J. A. B. Fones. K. S. FUt and B. W. Wah, "Systematic Approaches [0 the Design of Algoritlunically
Specified Systolic Arrays", Tech. Rep., ElecIT. Eng.• Purdue Universi[y.
[7J J. A. B. Fones, and D. L Moldovan, "Parallelism Derection and Algoritlun Transformation Tech-
niques Useful for VLSI Archirecture Design" J. Parallel Disrrib. Compur., May 1985.
[8J C. Guerra, nA Uriifying Framework for Sysmlic Designs",AWOC Symposiwn, Greece, 1986.
[9J L. J. Guibas, H. T. Kung, and C. D. Thompson, "Direcr VLSI Implementation of Combinatorial
Algorithms ", froc. ojCaltech Conf on VLS/, 1979.
[10] L. Jolmsson, and D. Cohen. "A Mathematical Approach to Modelling Flow of Data and Cancrol in
Computational Nerworks", VLSI Systems and Compurarions, H.T. Kung, B. Sproull and G. S[~le
eds., Computer Science Press, pp. 213-225, 1981
[11] L. U. Jobnsson, U. Weiser, D Cohen, and A. L. Davis, "Towards a Formal Treatment of VLSI
Arrays", 2nd Caltech Conf. on VLSI, pp. 375-398, 1981.
[12] H. T. Kung, "Why Systolic .A.rchitecrures?", Computer, pp. 37-45,1982.
[13] H. T. Kung, and W. Lin, "An Algebra for \'l...SI Algorithm Design", Proc. Conlon Elliptic Prob-
lem Solvers, 1983.
[14] R. H. Kunh, "Transforming Algorithms for Single-Stage and VLSI Architectures", Workshop on
Imerconflecrion Nerworks for Parallel and Distribured Processing, 1980.
[15] O. H. Ibarra, S. M. Kim, and M. A. Palis, "Designing Systolic Algorithms Using Sequential
Machines", IEEE Trans. on Camp., C-35, pp. 531-542,1986.
[16] M. Lam, and J. Mostow, "A Transformational Model ofVLSI Systolic Design", Compucer, pp. 42-
52.1985.
[17] C. Leiserson, and J. Saxe, "Optimizing Synchronous Systems", Proc. 22nd Annual Sym. Founda-
tions of Compucer Science, 1981.
- 25-
[18] G. Li. and B. Wah, "The design of Oplimal SyStolic Amlys",lEEE Trans. on Computers, C-34, pp.
66-77, 1985.
[l9] R. Melhem. and C. Guerra, "The Application of a Sequence Notation to the Design of Systolic
Computations", Teelm. Rep. 568, Depl Compo SC. t Purdue Universi[)'.
[20] W. L. Miranker. and A. Winkler, "Spacetime Representations of Computational Strucrures", Com~
pucing, vol. 32. 1984.
[21] D. Moldovan. "On the JuJ.alysis and Synthesis of VLSI Algorithms". IEEE Trans. on Computers,
C-3!, pp. 1121,1126. 1982.
[22] D. Moldovan, "On the Design Df J>Jgorithms for VLSI Systolic Arra)'s". Proc. IEEE, vol. 71, pp.
113-120. Jan 1983.
[23] P. Quinton, "Automatic Synthesis of Sysmlic Arrays from Unifonn Recurrent Equations ", Proc.
ll-rh Annual Symp. on Computer Archirecrure. pp. 208-214, 1984.
[24] LV. Ramakrishnan, D. S. Fussell, and A. Silberschatz, "Mapping Homegeneous Graphs on Linear




12 13 14 15 1 6
I
23 24 25 26
J I





for k < r(i+j)121 for k > r(i+j- 1)121
Fig. 1. A systolic array for dynamic programming
1b; i k
"








Fig. 2. A new design for dynamic programming
Fig. I. A systolic array for dynamic programming
Fig. 2. A new design for dynamic programming
Index terms - VLSI algorirnms, VLSI architectures. sYStolic algorithms. algorilhm transformations.
synthesis. mapping.
,
,
!
.
