In this paper, we describe a methodology for mapping normal linear recurrence equations onto a spectrum of systolic architectures. First, we provide a method to map a system of directed recurrence equations, a subclass of linear recurrence equations, onto a very general architecture referred to as basic systolic architecture and establish correctness of the implementation. We also show how e cient transformations/implementations of programs for di erent systolic architectures can be obtained through transformations such as projections and translations. Next, we show that the method can be applied for the class of normal linear recurrence equations using the method for the class of directed recurrence equations. Finally, we provide a completely automated procedure called cubization to achieve better performance while mapping such equations. The method is illustrated with examples and a comparative evaluation is made with other works.
1. Introduction. Designing systolic arrays has been studied extensively in the past. Most of the early systolic arrays were designed in an ad-hoc manner. Signi cant contributions 1, 3, 6, 7, 17, 19, 22, 23, 24, 25] were made on unifying techniques and theories for designing systolic arrays. In most cases, the initial speci cation was expressed as a recurrence equation. However, a majority of the works cited above have worked with uniform recurrence equations (URE), a subclass of linear recurrence equations (LRE). For these recurrences, the dependency vectors of points are constant vectors in Z n and the dependency graph is a lattice in Z n . The solution to the problem of designing systolic arrays is then obtained by using elementary geometric transformations on this lattice. However, the class of problems that can be expressed in terms of UREs is restricted and expressing most problems using URE is quite tedious.
The problem of mapping LREs onto systolic architectures has been addressed in the past in 23, 25] . However, both the methods depend heavily on nding an a ne timing function. The problem with nding an a ne timing function is that one needs to solve a system of linear recurrence equations 7, 23, 25] . For some linear recurrence equations (In nite Impulse Response ltering, Gauss-Jordan diagonalization), the standard linear programming techniques will show that there is no feasible solution (though the functions are explicit 7, 23] ). However, Jagadish et al. 7] , Quinton and Dongen 23] arrive at a valid schedule for the above examples by studying the constraint matrix and altering the problem or by ad-hoc methods. But, altering the problem to arrive at a valid schedule is not automatic and involves intuitive tactics. Such an approach is not possible in general. In contrast, our approach is capable of mapping LREs onto systolic architectures in an automatic fashion, even for which nding appropriate timing function might fail in other approaches.
In this paper, we develop a simple method for deriving systolic programs for a large class of linear recurrence equations, referred to as directed recurrence equations (DRE) 9], which properly includes the class of uniform recurrence equations. The starting point for our method is a graphical abstraction of the DRE. Using a set of rules that can handle multi-step algorithms 23] (cf. rule 3), we transform the dependency graph of the recurrence equations to a graph called modi ed dependency graph (MDG) where the level of each node yields the time steps for the actions. In other words, this transformation yields the timing function. In a sense, this is a generalization of the free schedule rst proposed by Karp et al. 8] . The advantage of such an approach is that one need not solve a system of linear inequalities for nding appropriate timing functions 22] . Using the MDG, we design a systolic program for a very general architecture referred to as basic systolic architecture (BSA). Using two procedures, folding and staggering, we eliminate non-local communications by transforming the programs in each processor of the BSA. We then map the BSA onto a given systolic architecture using a set of semantics-preserving transformations. Our mapping considers mapping of channels and timing functions unlike the allocation function of 22] . As a result, we can transform the program in each processor of the BSA to an equivalent program for the corresponding processor in the given systolic architecture. The transformations are semantics-preserving and hence, the correctness of the solution comes for free. We then go on to extend our method to handle the class of normalised linear recurrence equations 10]. A rst attempt would be to express a system of normal linear recurrence equations in terms of an equivalent set of directed recurrences. The method previously mentioned would then be su cient to handle them. Unfortunately, the transformation from LREs to DREs introduces extraneous computation points that unnecessarily enlarge the domain size and makes the implementation in-2 e cient. We circumvent the problem by introducing a procedure, cubization, which eliminates the need to use the intermediate DRE representation, thereby optimising the implementation. We also show that the out-degree of every node in the MDG that results from the application of the cubization procedure is bounded above by a constant (independent of domain size). The rest of the paper is organized as follows. In section 2, we introduce some notations and de nitions. The method of obtaining MDGs from DGs is explained in section 3. In section 4, we give a method to derive MDGs from LREs. In section 5, we give a procedure called cubization to design MDG in an e cient manner. In section 6, we give a method to construct a basic systolic architecture. In section 7, we give a set of transformations to map the BSA onto other systolic architectures and an outline of the implementation is given in subsection 6.2 (pseudo code of the implementation is given in the appendix). In section 8, we give a comparison of our method with other methods. This section is followed by conclusions in section 9.
2. De nitions and Notations. In this section, we provide the background de nitions and notations. where (a) A ik is a constant n i n k matrix over Z, (b) b ik is a constant n i 1 matrix over Z, 4. a k (p) does not appear on the r.h.s of equation (2.1) and 5. f k is a function from R s to R. Depending on the function ik , the linear recurrence equation (2.1) can be further classi ed into di erent subclasses. Identi cation of these subclasses has been of great help in the development of the design procedure. The major subclasses of (2.1) are normal recurrence equations 23], and uniform recurrence equations 8].
A linear recurrence equation is called normal if all the transformations ik , for 1 i s, are from L n to L n for a xed n. This in turn would imply that A ik is a square matrix of order n. Such a constraint guarantees that all the nodes in the dependency graph have the same dimension and introduces a weak regularity in the dependency graph. Remark 3.5. In a DG all the nodes are distinct and all the outgoing edges from a node are labeled with the same label.
3.2. Deriving a timing function. For deriving the timing function, we should arrive at a deterministic mapping which yields the operations that can be evaluated at any given time. We arrive at such a function by identifying operations that could be computed concurrently without any interference with one another. For instance, the labels on the edges indicate shared information. As such, two children of the same node with the same label will not be computed at the same time; i.e. they will be ordered such that one of them is computed before the other. Such a condition guarantees that one does not require global memory. On the other hand, to ensure that parallelism is not precluded, we constrain the ordering to those children that belong to the same child-set (cf. De nition 3.7). In the following, we de ne transformations on the DG that lead to a timing function. The transformations are de ned after two auxiliary relations on DG are de ned. Notation:
1. Let lat be the lexicographic ordering de ned on Z n .
2. Let child b p((a i ; p); (a j ; q)) i (a j ; q) 2 child b (a i ; p).
3. Let level(a i ; p) be the maximal depth of (a i ; p) in the DG. Definition 3.7. (child-set) The child-set of (a i ; p) under transformation ij , denoted by child-i j (a i ; p), is the set of all (a j ; q)s, p = A ij q + b ij that depend directly on (a i ; p). i.e. child-i j (a i ; p) = f(a j ; q) j child b p((a i ; p), (a j ; q)) true^p = A ij q + b ij g, where b = (a i ; p). We also de ne child-level(l) = fchild-i j (a i ; p) j level(a i ; p) = lg.
The timing function is obtained by transforming the DG into another graph referred to as the modi ed dependency graph (MDG). In the following, we rst describe an algorithm for transforming a DG into MDG and then show how a timing function can be derived.
3.2.1. Algorithm to Transform DG into MDG. In this subsection, we describe the steps involved in transforming a DG into an MDG.
Steps of the Transformation; I. Preprocessing Steps 1. All children of a given node are partitioned into classes of child-sets (cf.
De nition 3.7 ). 2. The edges leading into classes of child-sets are relabeled such that all edges leading to child nodes in the same partition have identical labels, while those leading to child nodes in di erent partitions have distinct labels.
II. Iterative Rewrite Rules
We express the procedure in terms of rewrite rules of the form ; with the interpretation that the pattern in is replaced by the pattern in .
The procedure consists of applying the following four rules repeatedly till no rule can be applied further. The rule is depicted by rule 4 in Figure ( 2).
Note: Rule 4 introduces extra nodes referred to as dummy nodes.
Remark 3.8. After application of rules 1-4 on a given child-set we obtain a path labeled < variableName; sourceIndex > :transformation. We relabel the edge between the parent and the smallest child by < variableName; sourceIndex > and keep the label for the rest of the path same. This is done only when the child-set is not a singleton. Proposition 3.9. The transformations de ned by rule 1-4 are terminating.
Proof: Proof follows from the following observations: 1. Application of rule 1 -3 reduces the out-degree of a node at the lowest level with multiple children with the same label by one and every node in the DG has a nite number of children. 2. Each application of rule 4 introduces one node if the level di erence between the parent and child is more than one and after applying, the level di erence reduces by one. 2 We note that modi ed dependency graph of any DG is well de ned by Proposition (3.9). The MDG corresponding to the DG of Example (3.4) is shown in Figure (4 Proof: From the rewrite rules 1-3, we see that whenever an edge between two nodes is deleted, a path from the parent node to the child node is created. Let us consider rule 4 which introduces new nodes. Again, the deleted edge is replaced by a path.
More explicitly, let MDG t be the graph obtained from DG after applying the rewrite rules t times. Let (a j ; q) depend directly on (a i ; p) in MDG t . Now, in MDG t+1 , either the dependency is maintained or there could be a node, say (d aj ; q) between (a i ; p) and (a j ; q) and directed edges from (a i ; p) to (d aj ; q) and from (d aj ; q) to (a j ; q). Thus, it immediately follows that the dependency between (a j ; q) and (a i ; p) is maintained. 2. Proof: Proof follows from the following facts : 1. The DG consists of a set of total order chains (De nition (3.2)) and 2. The preprocessing steps of transformations guarantee that the children of a given node are partitioned according to the labels. Application of rules 1-3 orders all the children of a given node that have the same label and the order is total. Hence, all nodes in MDG can only have children with distinct labels. 2.
Timing Function. The basic idea of systolization or parallelization lies
in showing that the actions performed simultaneously do not interfere with one another. To make arguments clear, we de ne the notion of interference-freedom as exists in the programming methodology literature without going into the speci cs of the structure of assertions. In this paper, we consider a more restricted de nition of interference-freedom, as we do not even allow sharing of variables. For the sake of convenience, we use interference-free and independent interchangeably. Lemma 3.18 . The order (MDG; ) guarantees interference-freedom. Proof: Any two nodes that share a variable in a DG will occur in a chain in the MDG. Now, each chain is a distinct total order from Proposition (3.15). Therefore it follows that (MDG; ) is interference-free. 2 1. For every node in the DG, there exists at least one node in the MDG. This is because the transformation steps never delete a node. 2. For every directed edge in the DG, there exists an equivalent directed edge or path in the MDG. This follows from the de nition of the rewrite rules 1-3, which simply reroute data between two nodes and rule 4, which at most changes an edge into a path. 3. All the arguments needed for a computation in any given node are available at the time of its computation. Rule 4 ensures that a node appears at the same depth in all the chains of the MDG that contain the node. The above facts combined with Proposition (3.12) imply the result. 2. We now de ne the timing function formally: Informally, the timing function associates a computation f k (: : :), with a time at which it can be evaluated. It also guarantees that the arguments of f k (: : :) have already been computed. The following proposition de nes the timing function for the MDG. 4. Obtaining MDGs from LREs. In this section, we extend our approach to the class of LRE where the dependency vector of a point is a linear function of the point. First, we show that every linear recurrence equation can be expressed as a system of directed recurrence equations (from where our method for DREs is applicable). In the next section, we provide a procedure called cubization to achieve better performance while mapping such equations into MDGs.
Directed Recurrence Equations as a Basic Tool
In the following, we show that every normalized linear recurrence can be expressed as a system of directed recurrence equations. 9
Lemma 4.1. Every normalised linear recurrence equation can be expressed as a system of directed recurrence equations.
Proof: Consider the following system of normal linear recurrence equations. ) = a i ( ik (p)) (4.6) Repeating the same procedure over all transformations ik , 1 < i; k < s in equation (4.2), we obtain a system of directed recurrence equations. This is because, at each stage the dependency vector will be either a constant vector or a vector with only one non-zero component. Therefore, the transformation will either be non-singular or have a one-dimensional kernel. 2 Note: It must however be noted that using the above approach we expand the domain and may introduce unnecessary nodes. A more e cient procedure called cubization, to map any LRE onto a systolic array is described in the next section.
5. Procedure for designing MDGs . The linearization procedure (re-write rules), when applied to LREs directly, would severely limit the parallelism in the implementation. Alternatively, we could rst convert LREs into a systems of equivalent DREs. But, in transforming an LRE into a DRE, we would in general introduce extraneous computation points solely for the purpose of re-routing, thereby increasing the domain size. Instead, we would like to allow existing computation points 10
to make copies of the data, where necessary, for re-routing along distinct channels. This in turn would let the implementation exploit the inherent parallelism in the problem to a greater extent and would not require the introduction of too many additional computation points. In the following we describe a procedure called cubization which implements the above strategy. First, we describe the preprocessing steps needed for the application of the cubization procedure.
Steps of the transformation:
I. Preprocessing Steps end Procedure cubization operates on nodes of the DG and their child-sets. It calls two procedures split, to redirect the data along distinct chains, and connect, to connect the node to source these chains. The depth of the DG is re-evaluated at each iteration of for loop in the procedure cubization. The for loop terminates when all parent nodes in the DG have been processed. The procedure call rule-4 applies rule 4 (see Section (3.2.1)-rewrite rules) on the DG ensuring that a node appears at the same level in all the chains that it appears.
procedure SPLIT(i; cluster; (a j ; p)) begin universe=PARTITION(cluster; i; (a j ; p));
i-set = ;; foreach partition in universe do begin if (size(partition) > 1) then i-set = SPLIT(i + 1; partition; (a j ; p)) i-set; else i-set = partition i-set; 11 return(REWRITE(i-set; (a j ; p); ++l) ) end end Procedure split which calls procedure partition, divides the child-set into clusters based on the i th index. This division goes on recursively till the partition contains only one element. The variable i-set at recursion level n contains one element per partition at the same recursion level. Using procedure rewrite the elements of i-set are ordered. Procedure split returns the smallest child after application of the rewrite procedure. procedure PARTITION The procedure rewrite orders these partitions (using rewrite rules) and connects them with an edge label < nodename > :w l , where l is updated with each recursive call of procedure split (hence, all chains will have distinct labels).
procedure CONNECT((a i ; q); (a j ; p)) begin if (A jk (p ? q) = 0) then RELABEL-CHAIN((a i ; q); (a j ; p)) else RELABEL((a i ; q); (a j ; p)) end Procedure connect is used to check whether the node (a i ; q) has to feed the data directly to each chain originating from it (i.e. when the point q belongs to the null space of the transformation A jk ) or to send it to the smallest child to be forwarded to the chains. (a) The procedure partition terminates. Procedure partition, partitions a cluster based on the i th index of p.
Since the elements of a cluster is nite and p 2 L n , the procedure terminates. (b) The procedure rewrite terminates.
The procedure rewrite orders i-set using rewrite rules. As in Proposition (3.9), at each level the application of rewrite rules terminates. (c) The depth of recursion is bounded.
The maximum depth of recursion is bounded by n, the dimension of L n . 3. The procedure connect terminates.
(a) relabel-chain terminates.
The operation performed by procedure relabel-chain is bounded by the out-degree of the smallest child which is bounded by the dimension of L n . (b) relabel terminates. This is trivial as there is only one operation to be performed. From (a) and (b), it follows that connect terminates. 4. The procedure terminates.
This follows because, each application of rule 4 introduces one node and the level di erence reduces by one. From the above observations (1)-(4), we see that the cubization procedure terminates. 2 Lemma 5.2. The cubization procedure preserves the dependency relation.
Proof: The cubization procedure involves calling the procedures split, connect and rule-4. The procedures which change the ordering in the DG are procedure rewrite, which is called by procedure split and procedure rule-4. The procedure rewrite de nes ordering among the elements of child-set using rules 1-3. The procedure rule-4, using rule 4, introduces dummy nodes. From Proposition (3.12), the application of the rewrite rules 1-4 preserves the dependency relation. Hence, the lemma follows. 2 Proposition 5.3. The MDG obtained from application of the cubization procedure on DG, is a set of distinct total order chains.
Proof: The proof follows from the following observations:
1. The DG consists of a set of total order chains from De nition (3.2) and 13 2. The preprocessing steps of transformation and the application of the cubization procedure guarantee that all the children of a given node that have same labels get partitioned into child-sets, where each child-set has a distinct label ( cf. Preprocessing Steps 1-2). Next, the procedure rewrite orders the partitions (using rewrite rules) and connects them with an edge labeled < nodename > :w l , where l is updated with each recursive call of procedure split. Hence all chains will have distinct labels. 2 The MDG obtained from application of cubization satis es properties similar to the MDG in the case of DRE. So, we just write the analogous results in this case without giving any proof. Proof: Every node will at worst be part of s (cf. equation (2.1)) di erent transformations ik (or A ik ) and the cubization procedure will limit the out-degree of a node for each transformation to at most n, thereby the out-degree of a node is bounded above by n s. 2 Lemma 5.9 . The cubization procedure is scalable with respect to domain size.
Proof: The cubization procedure recursively divides the child-set of (a i ; q) into clusters, de nes an ordering among these clusters and connects them by labeled edges (using procedures split, partition, rewrite and connect). If the domain size is increased, then as the ordering de ned by the cubization procedure is based on the recurrence equation and the index of the points p; q 2 L n , the already existing ordering will not be disturbed. Figure (5) gives the dependency graph for Gauss-Jordan diagonalization for N = 3. The cubization procedure is applied on it. The procedure iterates over every node and its child-set for each level in the DG starting from level 0. For each node selected, it reroutes the data from the parent node to elements of the child-set using the procedures split and connect. Finally, it uses procedure rule-4 to arrive at a consistent timing function for each node in the MDG. The procedure split calls the procedure partition to partition the child-set recursively into clusters based on the index and orders these partitions using procedure rewrite. Procedure connect connects the smallest child returned by the top-level invocations of procedure split to the current node. In the next section, we provide a method for constructing a basic systolic architecture (BSA) using the MDG and a procedure to map the BSA obtained onto speci c target architectures.
The problem of Gauss-Jordan diagonalization has been addressed by Quinton and Dongen in 23]. They presented an ad-hoc method for solving the problem because the problem is a multi-step algorithm and their approach does not generate a timing function for it. However, using our approach, we are able to obtain a timing function and map the problem onto systolic architectures without having to resort to problem speci c processing. 6 . Basic systolic architecture. Having obtained the computation dependencies, our next task is to map this dependency abstraction onto speci c target architectures. As a rst step towards mapping the computation points obtained onto a speci c target architecture, we use a very general intermediate host architecture called the basic systolic architecture (BSA). The BSA consists of simple and primitive processor nodes mapped onto a subset of the integer lattice with each 15 node having nearest neighbor connectivity. The use of the BSA a ords us a degree of architecture-independence while generating a solution. Thus, to obtain a BSA from an MDG, we have to embed all the chains of the MDG in Z k for some k. In the embedded structure, all the chains of the MDG can be identi ed by traversing the paths between lattice points. The process requires identifying the sub-sequences of chains that can be overlaid on other chains. The process of embedding an MDG in Z k is described in the sequel. It is based on classifying the chains into parallel and coplanar chains. 
where diff-level(i; j) = abs(level(i) ? level(j)) then U is parallel to W and V is parallel to X. 2. If U is parallel to V and V is parallel to W, then U is parallel to W. After classifying the chains into coplanar and parallel classes, we embed all the chains in MDG in R k for some k and then apply procedure folding to get an embedding in Z k .
Method to Embed MDG in R k
Step 1: Obtain the coplanar and parallel classes of chains of the given MDG. Let us denote the respective partitions by C = fC 1 ; ; C r g and P = fP 1 ; ; P s g.
Step 2: Embedding the planes: a) Choosing the initial plane. In this case, we embed all the chains of C in the planes connecting these common chains, noting that more than one plane may be necessary for this purpose; we still insist that minimum number of planes are used. We also note that a chain may have to be folded in two or more planes for this embedding.
2. planes embedded := planes embedded fCg. planes not embedded := planes not embedded -fCg.
Choose another plane, say C k , such that one of the chains of C k also belongs to some plane in planes embedded and C k is the plane with the least num-of-dummies() among such planes. Let C := C k .
end until planes not embedded = ;
For the cases (c) and (d) above, for each plane embedded, we choose a i and b i to be the common chains themselves if the common chains are not parallel; if they are parallel, we choose the one with least number of dummies as one axis a i and the chain in C with minimum number of dummies that intersects a i as the other axis b i .
Step 3: Embedding chains in each plane. 2. Layout all chains in the same parallel class as a i and b i parallel to z i1 and z i2 respectively. 3. Other chains in C m are chains that can be expressed in terms of the basis chains. The following two steps are performed for embedding the nonparallel (or inclined) chains: folding: The inclined chains are folded in the plane corresponding to C m by expressing the chain in terms of the vectors corresponding to the basis vectors. The procedure is de ned by the procedure folding given below. staggering: The folding of the non-parallel chains leads to multiple edges between lattice nodes. Hence, the timing function has to be re ned so that proper data is sent to the proper node at the proper time. The procedure is de ned in procedure staggering given below.
end until all planes in C have been considered.
The procedures for folding and staggering are described below.
Procedure for folding chains
The purpose of folding chains along the coordinate axes chosen is to see that links need only be made to adjacent nodes. The procedure ensures that any nonlocal communication that the algorithm required is re-routed through alternate paths and also the MDG gets embedded in Z k .
Steps of procedure folding:
The outline of the steps is given below:
1. We have the following two cases: 4. Now select any two real nodes on this chain and compute the di erence between their "x-intercepts", dx i . Similarly compute the di erence of the "y-intercepts", dy i . 5. Then (D min + 1) dy i =(dy i + dx i ) gives the length the chain will have to traverse along the "y-axis" before being folded along the "x-axis" and 18
(D min +1) dx i =(dx i +dy i ) gives the length the chain will have to traverse along the "x-axis". This pattern is repeated n-times. 6. Repeat 2 -5 over all planes in 1(a) and then do the same for planes in 1(b). For planes in 1(b), the axes themselves are folded and we fold all the other chains also accordingly by 2-5. We continue 1-5 till no more chain needs to be folded. At the end of the application of procedure folding, each chain is folded along the co-ordinate axes. Hence, after application of the procedure folding, we obtain a subset of Z k . Proposition 6.3. Using the procedure folding it is always possible to fold an inclined chain along the basis chains.
Proof: Follows from the fact that the Manhattan path travelled to reach a node from another node is the di erence of the levels between the nodes and that a node appears at the same level in all chains if it appears at all in more than one chain (by rule 4). 2 In the process of folding a chain along the basis chains, there could be more than one datum to be propagated to its neighbors. Hence, there is a need to modify the timing function which was so far considered as the level number of a node in the chain. This stems from the fact that only one datum can be sent over a link in one time step.
Steps of procedure staggering Obviously a node must wait for all its input data to arrive before it can commence computation. This constraint ensures that each node executes only as many time units, after its parent node executes, as the number of data values it receives from it. Formalizing this observation, we de ne the modi ed timing function h, for a node q by induction on level(q):
where m-level(q)= max(number of variables being staggered at level q).
Note that the max value over a level is being used to keep the processors in lock step. Associate the timing function thus obtained to each real node. Each node will compute only at the time determined by the timing function and will transmit data available with it in the remaining cycles. When more than one variable is required to be sent on a given channel, the processor will choose to send them according to some predetermined order (lexicographic is an obvious choice). This makes sure that the recipient knows the order in which to expect variables over a channel. The program for each node in Z k onto which a node from MDG is mapped is given in the Appendix.
Proposition 6.5. It is always possible to embed the chains in Z k . Proof: Let us assume the contrary. i.e. it is not possible to embed the chains in Z k . In that case, while embedding the chains in the space one would come across the case shown in the following gure: .4)). Adjacency of the nodes in the MDG is preserved in Z k , as can be seen from the procedure of laying out the chains.
The laying out of chains, however, may violate the assumption of one datum per channel. This would therefore necessitate modi cation of the timing function. Procedure staggering ensures that all the input variables required for a computation are available at the required time at the right node. That is: level(q) ? level(parent(q)) + stagger(parent(q); q) h(q) ? h(parent(q)); where stagger(p; q) = the number of data staggered on the channel (p; q). This follows from the equation (6.1) and de nition of level.
Procedure staggering would involve the use of some local memory to store incoming data until required for computation. Hence, during procedure staggering no data will be lost. So, the laying out of chains is done in a consistent manner.
Hence, MDG is embedded in Z k . 2 The embedding of MDG in Z k is referred to as the BSA and we say that the MDG is embedded in the BSA.
Proof: It follows from Lemma (6.6) that all the chains in the MDG have been embedded in Z k and the embedding is order and data-ow preserving, as can be seen from our construction of the embedding. Hence, the Corollary follows. 2 Theorem 6.8. BSA exactly computes the given DRE/LRE. Proof: The proof follows from the following observations:
1. The MDG is embedded in the BSA. From Lemma (6.6). 2. The MDG is computationally equivalent to the DG.
From Proposition (3.19) / Proposition (5.6). 6.2. Implementation Overview. In the pseudo code, apart from the procedures that were described earlier, one can note the initialization of data structures essential for implementation purposes through the procedure initialize. We then rename some variables as explained earlier (recording the changes vide procedure rename-var). We then go on to transform the DG into a modi ed dependency graph (MDG) using the transformations de ned earlier to extract a timing function which essentially de nes the time at which computations on various processors can take place while still satisfying the partial order (MDG; ).
Each chain in the MDG corresponds to logical channels in the systolic implementation. We now proceed to map logical channels onto physical channels by building the basic systolic architecture (BSA). In this process, more than one logical channel might get mapped onto a physical channel, implementation details of which are taken care of in procedure merge-node-node.
Using the given speci cation and the timing function, we obtain a very general program, whose generic form is given in program sample-prog for a BSA from which one can obtain concrete programs for speci c architectures. The Figure (14) shows procedure dependencies for the pseudo code.
7. Mapping onto other systolic architectures. In an earlier section, we have described a method for arriving at a systolic architecture for a given DRE/LRE. In this section, we show how the BSA can be mapped onto a given standard systolic architecture. In a BSA, the number of processors is O(n m ) (m = number of axes), where each processor computes at most one assignment statement. By applying di erent mappings on the BSA, we map a number of assignment statements in the BSA to a single processor. As a result, we obtain programs for di erent systolic architectures.
In the following, we de ne mappings from a BSA to speci c architectures. As in the concrete architecture one uses the notion of channels for communication between processors, we use links (used in BSA) and channels of the given architecture interchangeably.
A mapping from a BSA to another network G 0 is a triple M= (e; d; g), where e; d and g are as de ned below:
e is a projection scheme. By a projection scheme we mean`projection along an axis' or`translation along an axis' de ned as follows:
1. Projection along an axis.
For any point q = (x 1 ; : : :; x n ) 2 BSA, we de ne the projection along the i-axis as follows:
e : (x 1 ; : : :; x n ) 7 ! (x 0 1 ; : : :; x 0 n?1 ) 3 (8 1 j < i; x 0 j = x j )^(8i j < n ; x 0 j = x j+1 ) where point (x 1 ; : : :; x n ) associates to processor l = (x 0 1 ; : : :; x 0 n?1 ) in G 0 .
Translation along an axis
For any point q = (x 1 ; : : :; x n ) 2 BSA, we de ne the translation along The timing function g maps the execution time associated with each computation in a BSA to the execution time associated with each computation in G 0 and can be computed as outlined below:
{ Each node will potentially be assigned multiple functions to be computed at di erent or identical times. The new timing function will be computed in essentially the same way that was employed for the BSA while applying the procedure of staggering. In other words, if one node was mapped onto another node, the timing function associated with the original node was incremented at least by one. That is, if one considers the start-time and end-time of passing the data across this link, it would take at least two time units (one for each data item). This is essentially the basis for computing the new timing function on the given systolic architecture. The steps for obtaining the new timing function, g, are described below:
1. Initially, we set start-time(1) = 1: 2. Let n p (t j ) denote the number of functions that were originally assigned time t j to a given node p. Then, end-time p (t j ) = start-time(t j ) + n p (t j ); where end-time p (t j ) denotes the new completion time of the computation at node p that was to have been completed originally at t j . 3. Let end-time(t j ) denote the time of completion of the computations that were originally scheduled to complete at t j over all the nodes. It is obtained by taking the maximum of end-time p (t j ) across all the nodes p. That is, end-time(t j ) = max(end-time p (t j ); p any node in BSA): 22
4. Now, if start-time(t j ) is the new starting time for those computations that were originally scheduled at t j at node p, then start-time(t j ) = end-time(t j ? 1) + max-staggering(pred(p));
where max-staggering(pred(p)) is the number of variables staggered between p and predecessors of p, p running over nodes where some computation was originally scheduled at time t j . Now the function g is de ned as a function of three parameters -time t j , a node p and computation elements a tj;i : g(t j ; p; a tj;p ) = k; where start-time(t j ) k end-time p (t j ) ; g(t j ; p; a tj;p ) = g(t j ; p; a 0 tj;p ) i a tj;p = a 0 tj;p :
If in G 0 , the input channel and the output channel are locally connected, the communication can be replaced by a simple memory assignment and need not be considered when computing max-staggering(pred(t j )).
The timing function so obtained should be used to update the data structures associated with each node for each individual computation that it is to carry out. The program associated to each node will resemble a collection of programs for a node in the BSA to be executed sequentially based on their respective execution times. 3. All arguments needed for a computation in any given node are available at the time of its computation.
The computation performed at time t in processors p 1 ; : : :; p j in the BSA will be performed after start-time(t) = end-time(t ? 1) + max-staggering(pred(p)) in p 2 G 0 . That is, a computation is scheduled only after a node has received all arguments from its predecessor. From the above, the Proposition follows. 2 Theorem 7.3. Given a system of LRE, there exists a map from the LRE onto a speci c target architecture that can compute the given LRE.
Proof: It follows from Theorem (6.8) and Proposition (7.2). 2 For a BSA in Z 3 , projection along z-axis gives us a mesh-connected network. Projection along z-axis followed by a projection along x-axis gives us a linear array, while translation along x-axis followed by translation along y-axis and projection along z-axis gives us a hexagonal network 4, 11] . . Since uniform recurrence equations can be mapped onto systolic arrays 22], to map linear recurrence equations onto systolic architectures, they rst transform the linear recurrence equation into a uniform one (uniformization). The drawback of the method is having to choose uniformization vectors in such a way that the resulting uniform system of equations have an a ne timing function. The choice of uniformization vectors is in uenced by several factors and further, to nd an a ne timing function, one needs to obtain an integral solution to a system of linear programming problems. In contrast, our method does not hinge on nding an a ne timing function. Consequently, the choice of uniformization vectors does not arise either. This makes automating our approach easier. Their approach does not address non-local communication and staggering (multiple data transfer on a channel). In addition, the problem of multi-step algorithms can be handled formally in our approach.
Rajopadhye et al., 24, 25] , present a technique for synthesizing systolic arrays from linear recurrence equations. They characterize a class of transformations to e ect data pipelining. Here too, an a ne timing function is a must, and multi-step algorithms are not addressed. The choice of a bad timing function could result in broadcasting, leading to poor systolic architectures. Our approach, not only handles data pipelining but also handles a system of linear recurrence equations.
In 2], Fort and Moldovan discuss whether and how data broadcasts in array processors with a given interconnection structure can be either eliminated or reduced by choosing an adequate linear schedule. Their constraints being strict, do not deal with broadcasts in their generality. Here again, the method requires the extraction of an explicit linear timing function.
Guerra and Melhem 5] have presented a method to derive systolic arrays given a speci cation in a canonical form. They apply a time-space index transformation on a coarse timing function obtained by analysing the dependence vectors. Their initial speci cation, though more general than UREs does not fully include LREs.
Our approach handles the class of normalised LREs.
Starting from a speci cation (program), Huang and Lengauer 6], derive a partial order on the operations in the program (traces). Using a step function, a ow function and a pattern function, they embed a program into a given systolic architecture. Their method is suitable for mapping speci c problems onto systolic designs. Our method on the other hand starts with a normalised LRE and derives the program and then the mapping onto a systolic architecture.
In 21], Prasanna Kumar and Yu-Chen Tsai present a technique to map the computation of a two dimensional systolic array into a one-dimensional systolic array. Their design technique consists of three parts -a) the activation of operation, b) the alignment of operand (using fast and slow channel) and c) the transportation of operands (using transportation channels and mechanisms to switch data). In contrast, we address the problem of mapping a problem expressed as a normalised LRE onto systolic arrays and are not restricted to xed dimensions. Further our solutions are intended to conform to existing systolic architectures.
Lengauer and Sanders 16], present a scheme to transform systolic programs with two dimensional structure to one dimensional ones. In comparison, our method allows us to derive a systolic program for a sub-class of linear recurrence equations and then transform it into one that can be embedded in many existing systolic architectures. 9 . Conclusion. We have proposed a general method for mapping a system of linear recurrence equations onto speci c systolic architectures. Our approach differs primarily from that of others in that we generate solutions suitable for existing target architectures as opposed to designing new ones for a given problem. This assumes importance from a practical point of view when systolic architectures are programmable. The method essentially consists of mapping normalised linear recurrence equations onto a generic architecture called the basic systolic architecture and then applying correctness preserving transformations to embed this intermediate solution onto speci c target architectures. The approach presented is scalable with respect to domain size, preserves correctness and lends itself well to automation. The paper also presents/studies two procedures, folding and staggering that are used to eliminate non-local communications and organize data transfer between communicating neighbors. In addition, we also handle multi-step algorithms as well as program transformations to arrive at equivalent programs for given architectures. We have used this method to map a few typical recurrences onto standard architectures successfully. To sum up, our method is general enough for arriving at systolic architectures from normal LREs and furthermore, the method is scalable. /*Initialises the data structures associated to each node of the dependency graph */ 
Illustration of Rule 4
It is this rule that introduces extra nodes called dummy nodes to ensure that nodes have the same depth on every path they are on. a341  a331  a311  a321  a241  a231  a221  a211  a141  a121  a111  a131   a342  a242  a232  a222  a142  a132  a122   a133  a233  a243  a333  a343  a143   a322 a332 a110  a120  a130  a140  a210  a220  a230  a240  a310  a320  a330  a340   a141  a111  a211  a221  a231  a241  a311  a321  a331  a341  a131 a121 
