A systematic methodology to synthesize systolic designs is described and used to derive a new design for dynamic programming. This latter design uses fewer processing elements than previously considered ones. The synthesis method consists of two pans: 1) deriving from the high-level problem specification a form more suitable to VLSI implementation; 2) mapping the new specification into physical hardware. The method also provides a Wlifying framewOIK for existing systolic algorithms.
I. INTRODUCTION
In recent years, a large number ofVLSI algorithms have been designed for such problems as matrix multiplication, linear systems, convolution, etc. The mapping of an algorithm into an array of processing elements has been usually done in an ad-hoc manner. TItis is a difficult and error-prone task, especially for complex nonnwnerical problems. A synthesis procedure can save a considerable amount of design effort and, at the same time, guarantee the correctness of the result Recently, there have been various attempts to fonnalize the procedure for mapping algorithms into physical hardware. Their appraisal is in [2] .
Among all the existing approaches, the transfonnational approach based on data dependencies proved to be very powerful [I, 7-13] . However, such an approach can be applied only to a restricted class of algorithms, namely, algorithms with highly regular communication patterns. Such algorithms are generally expressed by a uniform recurrence relation, or a nested loop with constant data dependencies. Many algorithms involve nonuniform recurrences or many nested loops with additional statements between the loops. These algorithms are sometimes implemented on VLSI networks with variable·speed data flow. An example is dynamic programming as applied to the optimal parenthesization problem.
In [5] , a systolic algorithm for dynamic programming was proposed which can be cast in a triangular array of processing elements. The systolic design is quite complex: the data flow is non constant throughout the array and the action of a processing element varies at different clock cycles. Anempts have been made to synthesize non-unifonn designs, concentrating on various aspects of the design process [1, 4, 9] . The crucial Step towards the automatization of the design is in the abili[)l to rewrite the algorithm in a form better suited to VLSI implementation. The synthesis procedure described here anemprs to solve this problem. It operates~th at the algoritlunic level and implementation level. Thus, it consists of two pans: 1) transforming the given problem array. The high-level problem specification is rewritten as a system of muruall)' dependent recurrence relations. This is accomplished by a two--step procedure. Thc procedure first detennines a coarse timing function for me computations in the high~level specification and then uses it to identify chains of linearly dcpendent computations. Each such chain can be convened into a recurrence with constant data dependencies. Statcmcnts between recurrences are introduced to correlate variables in distinct recurrences. Next the mapping of the ncw form into VLSI hardware is obtained by applying linear time and space transformations 10 the individual recurrences, subject to global COnstrainlS. Such transformations are again based on data dependencies.
Using this procedure we transform the dynamic programming algorilhm into a system of twO dependent recurrence relations. In addition to rederiving the design presented in [5] , we are able to automatically generate another design which obtains a better utilization of the processing elements and therefore requires fewer cells in the array; precisely, 3/8n 2 instead of n 2 /2. This paper is organized as follows. In section I the synthesis method is illustrated for the derivation of a systolic algorithm for dynamic programming. Next, time and space transformations to map the new fonn into hardware are described in sections 2. The new design for dynamic programming is presented in section 3. It is also shown that the application of the proposed methodology to obtain other syslOlic designs such as convolution and palindrome recognition exploits interesting similarities among such problems.
I. DERNATION OF A SYSTOLIC ALGORITHM FOR DYNAMIC PROGRAMMING
The dynamic programming can be expressed by a recurrence of the form: 
Our aim is to rransform. expression (1) into a ne~form which consists of possibly many recurrences each characterized by constant data dependencies; non-constant data dependencies may only occur at the bOWldaries between the recurrences. For each such recurrence it will be possible to determine a linear time-space rransforrnation into a systolic array, by applying the transformational method described in [11, 13] . We briefly review the tranformational method for Wliform recurrenceS.
Mapping a uniform recurrencc into a VLSI array
Consider a recurrence with index set I" defined by:
wherc f is a given function and I is a vector of 1". The We consider the following simplified model of a VLSI array.__~ach processor of the array is assigned a labelleL,,-Ie Zn-I . The connection pattern of the array is described by the matrix 8=[Oj,~....
• os] which specifies the links among the processors. Precisely, OJ is the difference vector of the integer labels of adjacent cells in the network.
A linear time-space transformation of the indexed computations intO the VLSI array is defined by:
.
where T is a mapping form J" ---7 Z and S is a mapping from 1" ---7 L II-I. The time function T transforms D into TD. Thus. to ensure a correct execution ordering, T must satisfy the following condition:
System (2) may have no solution or several solutions. In this laner case, the one which minimizes the total execution time (defined as the difference between the maximum and minimum value of T ) is chosen.
The space mapp'ing S is a mapping from the set of computations to the set of processing elements S :
such that for each 7,J E 1"
i.e. concurrent computations cannot be mapped into the same processor. Detennining a solution for S which satisfies constraint (3) is equivalent to solve diophantine equations:
for which the matrix r:Ji S non-singular. K is an integer matrix with positive elements. The equations may have no solution or several solutions. If no feasible solution is found, the design procedure is repeated by staning with a different timing function or else a different interconnection nelWorX. If several solutions are possible, the one which is optimal according to some criterion is chosen.
A IWo-:;rep procedure/or dynmnic programming
To adapt expression (1) to systolic implementation, we first eliminate broadcasting by 1) adding missing indeces to variables on both sides of the expression, 2) renaming variables, 3) introducing new variables. How. ever, as is well known, there are many ways to perfonn such transformations some of which may not lead La any feasible systolic design. The selection of a good transformation is crucial to the entire design process; for complex problems such as dynamic programming this is not a straightforward task. In order to add the index k to variable CiJ on the left side of (1), we need to specify an appropriate ordering for the computations CjJ.i for given i, j and for i < k < j, in such a way to introduce as much parallelism as possible. If we choose, say, the lexicographical ordering relative to index k, we cannot overlap computations of CiJ,k. for different values of Our strategy to select the appropriate transfonnaLion consistS of identifying among the computations The least integer values that satisfy the above equations are:
Thus. the optimal time transfonnation is:
The function T will guide the search for a schedule of computations indexed by J3 according to the availability of the variables ci-.*-and Ckj on the right side of (1). Because of the monotonociry of data dependencies in D(ij) ,it
According to function T, we inrroduce a partial ordering >T in J' defined by:
Notice that the minimal elements with respect to >T are:
A partial ordering produces a decomposition of the set into chains of linearly ordered computations. Among all the possible chain decompositions of J 3 , we select the one in which the index vectors of a chain are also soned (either in increasing or decreasing order) according to the third index k.
To obtain such a decomposition we repeatedly find minimal elements after removing the previous minimal elements from the ordered set For the set J3 we obtain a decomposition in twO chains (here we only write the 
. ,j-I;
Wc arc now ablc to reSlllIcture (1) imo a system of [wo recurrences or modules, each corresponding 10 a chain" The execution ordering of compulations in each recurrence is specified according 10 the ordering in the chain. Thus, the fiCS[ recurrence is a forward recurrence where the index k varies from (i+j)f2 [0 i+l (or from (i+)-I)f2 to i +1 if i+) is odd); and the second is a backward recurrence where k varies from (i+))f2 to )-1 (or from (i +) + l)f2 to ) -1 if i +) is odd) " The two recurrences have different sets of variables; boundary condilions relate variables in the twD recurrences. Now equations (1) can be convened into the fDllowing form. The two-step mapping procedure can be generalized 10 from which we obtain the system of equations:
O"a > 0 foraeD z ; Global dependencies specified by AI-A51ead to the additional equations:
aU, j. j-I)]
I[ easy to check that an optimal solution to the above system. i.e. one which minimizes the execution lime, is given by:
Hence, we obtain the timing functions:
Spacefuncrion
The automatic procedure for determining the mapping of computations into the cells of a systolic array is analogous to the one for the timing function. Again, we look for separate solutions to the different modules in the algorithm subject to global constraints. We consider a 2-D amy of processing elements modelled by the pair [L 2,.6.], where L 2 is the set of labels (x;y) assigned to processing elements and ,f!. is a matrix describing the interconnection network between processing elements. Different interconnection patterns may result in different classes of designs. In the following, we generare the optimal design when !1 is chosen to be: One solution [Q above system of equations is:
S;l =S;3 =0; S;2 = I S~=S~=0; S;I = I for the first recurrence and:
S;'I =S;3 =0; S;2 = I for the second recurrence. Thus
The resulting design is identical to the one first inlroduced in [5] . The corresponding systolic array and the action of a cell at different times is depicted in figure 1.
A NEW DESIGN FOR DYNAMIC PROGRAMMING
Consider an array of processing elements whose communication pattern is described by:
Cells in the array are connected by bidirectional horizontal links as well as by diagonal and vertical links, as shown in fig. 2 . An optimal design for dynamic programming is generated for such an array using the same mapping procedure. Again we solve equations (4) subject to global conslraints. We derive:
S;1 =S;2 =O;S~3 =1 S~=S~=0;S;1 =1
for the first recurrence and: because of its similarity with !.he design of a sYSlolic palindrome recognizer [6) . A linear array of processing elements with [Wo-way pipelining was described in [6] for this laner problem. The alT3Y deLerrnines for each char. actcr in a suing whelhcr the suing inpUl up to that character is a palindrome. for some function! .
The data flow -of input variables S and output variables C along the linear array is the same as the data flow --of variables C and a. resPectively. along any row of the two-dimensional array of fig.2 for dynamic programming. Indeed, both problems have similar data dependence patterns. . . . as j[ can be easily observed if we rewrite recurrence (I) in the following form: (6) In (6) index j in (1) has been replaced by the index l=j-i. From the above expression it is apparent that, as for the palindrome recognizer, variables with indeces I and l-k, 1<J:. <I, have to collide in the same cell at the same time.
It is also easy to show that the automatic mapping procedure described here transforms the palindrome recognition problem into a form similar to the one for convolution and therefore applies the same time-space tranSformations to both.
