The class of systems of uniform recurrence equations (UREs) is closed under uni-modular transformations. As a result, every systolic array described by a unimodular mapping can be specified by a system of space-time UREs, in which the time and space coordinates are made explicit. As non-unimodular mappings are frequently used in systolic designs, this paper presents a method that derives space-time equations for systolic arrays described by non-unimodular mappings. The space-time equations for non-unimodular mappings are known elsewhere as sparse UREs (SUREs) because the domains of their variables are sparse and their data dependences are uniform. Our method is compositional in that space-time SUREs can be further transformed by unimodular and non-unimodular mappings, allowing a straightforward implementation in systems like ALPHA. Specifying a systolic design by space-time equations has two advantages. First, the space-time equations exhibit all useful properties about the design, allowing the design to be formally verified. Second, depending on the application area and performance requirement, the space-time equations can be realised as custom VLSI systems, FPGAs, or programs to be run on a parallel computer.
INTRODUCTION
The problem of synthesising systolic arrays from uniform recurrence equations (UREs) [4] is well-understood and various design methodologies have been proposed for its solution over the last two decades [11, 13, 14, 16, 20] . The main technique consists of finding a non-singular mapping to transform the original index space to a space-time domain, i.e., assigning a time and place to each point in the original index space. Such a space-time mapping (or mapping for short) *Corresponding author. ISSN: 0020-7160 # 2002 Taylor and Francis Ltd DOI: 10.1080=00207160290018310 must satisfy several design constraints to be valid, including the causality constraint ensuring that the original data dependences are respected [10, 12] .
The class of systems of UREs is closed under unimodular mappings. As a result, every systolic array described by a unimodular mapping can be readily specified by a system of space-time UREs, in which the time or space coordinates are explicit. Section 3 recalls how such a system of space-time equations can be obtained from the original program by applying a so-called domain morphism in Crystal [3] . Applying this idea directly to a non-unimodular mapping, however, does not yield a system of well-formed equations because the domains of equations are sparse and cannot be specified by dense convex polyhedra.
This paper presents a method for deriving space-time equations for systolic arrays designed by non-unimodular mappings. The basic idea is to decompose a non-unimodular mapping into a unimodular domain morphism followed by a scaling transformation. Our method is compositional in that space-time equations can be further transformed by unimodular and non-unimodular mappings, allowing a straightforward implementation in transformational systems like AL-PHA [7] . In addition, our method handles unimodular mappings as a special case.
This work is useful for three reasons. First, non-unimodular mappings are frequently used in the synthesis of systolic designs. For example, the mapping that describes Kung-Leiserson's array for matrix multiplication is non-unimodular. In addition, all multirate arrays as defined in [16] are described by nonunimodular mappings. Our method can derive space-time equations for all systolic arrays in the traditional sense [10, 12] and all multirate arrays as defined in [16] . Second, space-time equations provide a precise specification of systolic designs, allowing the designs to be realised as custom VLSI systems, as FPGAs, or as programs to be run on a general purpose parallel architecture. In particular, since wavefront arrays are the asynchronous version of multirate arrays according to S. Y. Kung [6, p. 244] , the space-time equations for a multirate array can also be translated to a program to be run in a wavefront array. Third, the space-time equations for a design exhibit all useful properties about the design, allowing the design to be formally verified and further transformed. Two important properties about a non-unimodular design are the period of the array [16] and the phase in which a processor is active [8] .
The objective of this work is to derive space-time equations for non-unimodular mappings. It suffices to use the familiar matrix multiplication as an example for illustrations.
The rest of the paper is organised as follows. Section 2 introduces sparse UREs and Crystal's domain morphism. Section 3 reviews how to derive space-time equations for unimodular mappings. Section 4 describes our method to deal with non-unimodular mappings. The space-time equations for Kung-Leiserson's array are given. Section 5 applies our method to derive space-time equations for two multirate array realisations of matrix multiplication. Section 6 describes the related work. Section 7 concludes the paper.
SURES, URES AND DOMAIN MORPHISM
For a classification of various forms of recurrence equations used in systolic designs, we refer to [8] . For the purposes of this paper, it suffices to consider a generic system of sparse UREs (SUREs), P sure , such that each variable X is defined by an input equation (IE), a computation equation (CE) and an output equation (OE) as follows:
where z 2 Z n is an index point, M 2 Z nÂn is a non-singular integer matrix, D i ; D c and D i are the domains of the three equations, respectively, W 2 Z n is a constant dependence vector, g and h are functions from Z n to Z m for some m and f is a strict, single-valued function. In the computation equation, Y is a variable not necessarily distinct from X and the dots ''. . .'' indicate the arguments of the same syntax. The index space of the entire system P sure is defined to be the union of the domains of all computation equations.
The domains of the three equations are (dense) convex polyhedra of the form:
The domain of variable X , i.e., fMz j z 2 D c g, is sparse when M is nonunimodular. P sure is sparse when the domains of some variables are sparse. P sure becomes a system of UREs when M is the identity matrix in all its equations. Then, we write P sure as P ure morphism. In this paper, it is only necessary to consider the domain morphisms that are non-singular linear transformations on systems of SUREs (including UREs as a special case).
Let T 2 Z nÂn be a non-singular mapping from the original index space of P sure to a new index space. Let domðT ; P sure Þ be the equivalent system of SUREs obtained from applying the domain morphism T to the original program P sure . We have:
domðT ; P sure Þ:
where T ðD x Þ ¼ fTzj z 2 D x g; 8x 2 fi; c; og In the case of P ure , we have:
domðT ; P ure Þ:
where T ðD x Þ ¼ fTzj z 2 D x g; 8x 2 fi; c; og
UNIMODULAR MAPPINGS
Before being mapped to systolic arrays, affine recurrence equations (AREs) must be first transformed into UREs [13] . So we are only concerned with deriving space-time equations for systolic arrays synthesised from UREs. The synthesis of a systolic array from a system of UREs consists of finding two separate functions. The timing function tðzÞ ¼ lz, where l 2 Z n , specifies that the index point z is computed at the time step lz. The allocation function, defined usually by a projection direction u ¼ ðu 1 ; . . . ; u n Þ 2 Z n such that gcdðu 1 . . . ; u n Þ ¼ 1, specifies that two index points z and z 0 are mapped to the same PE iff z ¼ z 0 AE au, where a 2 Z n .
The two functions can be collectively specified as a single space-time mapping:
where P 2 Z ðnÀ1ÞÂn has full-row rank and satisfies Pu ¼ 0. P, the allocation matrix specifies that the index point z is executed at the PE Pz. Two different allocation matrices P 1 and P 2 such that P 1 u ¼ P 2 u ¼ 0 define the same array; they do not change the topology of the array but only modify the processor coordinates (i.e. labels) assigned to the PEs. T satisfies all data dependences of the program if lW ! 1 for every dependence vector in the program. This assumes that evaluation of an index point takes one unit time. A relaxation of this assumption will allow multirate arrays to be designed, as discussed in Section 5. T must also satisfy the condition det ðT Þ 6 ¼ 0, i.e., lu 6 ¼ 0, ensuring that two index points mapped to the same time step are not mapped to the same PE.
A unimodular mapping T transforms the original program P ure to domðT ; P ure Þ. The domains of equations in (5) are all convex polyhedra and can then be specified as follows:
Thus, the space-time equations in domðT ; P ure Þ are well-formed. They are the space-time equations for the systolic array described by T. Let us consider matrix multiplication defined by the following UREs: j ¼ 0; 1 i; k N ! Aði; j; kÞ ¼ aði; kÞ 1 i; j; k N ! Aði; j; kÞ ¼ Aði; j À 1; kÞ i ¼ 0; 1 j; k N ! Bði; j; kÞ ¼ bðk; jÞ 1 i; j; k N ! Bði; j; kÞ ¼ Bði À 1; j; kÞ k ¼ 0; 1 i; j N ! Cði; j; kÞ ¼ 0 1 i; j; k N ! Cði; j; kÞ ¼ Cði; j; k À 1Þ þ Aði; j À 1; kÞBði À 1; j; kÞ k ¼ N ; 1 i; j N ! cði; jÞ ¼ Cði; j; kÞ
The dependence graph of the UREs is depicted in Figure 1 .
S.Y. Kung's array shown in Figure 2 is described by the unimodular mapping:
The processor structure is obtained by projecting the dependence graph along u ¼ ð0; 0; 1Þ.
The space-time equations for the array are obtained directly from (5) by substitutions:
Many useful properties about the array can be extracted from these space-time equations. For example, we find that the array consists of N 2 PEs and runs in ð3N À 2Þd sys time, where d sys is the length of the global clock.
NON-UNIMODULAR MAPPINGS
When T is non-unimodular, the space-time equations in domðT ; P ure Þ from (5) are not well-formed if the domain of equations are defined as in (7) . This is because T ðD x Þ is not dense, i.e., T ðD x Þ contains index points that do not correspond to any index points in the original index space.
Our approach to deriving space-time equations for a non-unimodular mapping is to decompose it into a unimodular domain morphism followed by a non-singular scaling transformation that scales the domains of all variables while leaving the domains of all equations unchanged.
A scaling transformation S 2 Z nÂn maps P sure to the following program:
In the case of P ure , we have: The main result of this paper is summarised in the following theorem.
(a) It is always possible to decompose T such that T ¼ SU, where S is a upper triangular, nonnegative matrix with j lu j as its top-left element and U is unimodular. In the special case when P is e-unimodular, S has the following form:
(b) The space-time equations for the array are derived from P ure according to
by first applying U as a domain morphism and then S as a scaling transformation. We obtain:
scaleðS; domðU ; P ure ÞÞ:
In the space-time equations, the first subscript function of a variable represents time and the remaining subscript functions, which are time-invariant since S is upper triangular, represent processor coordinates.
Proof Let us prove (a). P 2 Z ðnÀ1ÞÂn has full-row rank. By Theorem 1, there must exist a unimodular matrix U À1 2 Z nÂn such that PU À1 ¼ ½0; H, where H 2 Z ðnÀ1ÞÂðnÀ1Þ is a non-singular upper triangular matrix. Hence, we have:
To prove that the top-left element of S ¼ lU À1 0 H is lu, it suffices to show that the first column of U À1 is the projection vector AEu. Since Pu ¼ ½0; HUu ¼ 0, where H is a non-singular upper triangular square matrix, we must have Uu ¼ ðAE1; 0; . . . ; 0Þ and U ðnÀ1ÞÂn u ¼ 0. That is, u ¼ U À1 ðAE1; 0; . . . ; 0Þ T . Hence, the first column of U À1 is AEu. Let us prove (b). In the proof of (a), we have already shown that U ðnÀ1ÞÂn u ¼ 0. That is, both T and U share the same projection vector u. Therefore, domðU ; P ure Þ define exactly the same processor space as domðT ; P ure Þ. Finally, T ¼ SU . By scaling domðU ; P ure Þ to get scale ðS; domðU ; P ure ÞÞ, we ensure that the timing function is applied correctly in the final space-time equations. & This theorem is also correct when T is unimodular, in which case U ¼ T and S is the identity matrix. Thus, both (5) and (10) are identical.
The space-time equations derived by our method exhibit all useful properties about a design. In addition to those that are also relevant to a unimodular mapping, four properties particularly pertinent to a non-unimodular mapping are discussed below.
Let the first subscript function in the variables X ðSzÞ be written explicitly as:
where ðjlu j ; s 2 ; . . . ; s n Þ is the top row of S. Let t min and t max be the time steps for the first input and last output, respectively.
1. The period of the array is jlu j, i.e., the coefficient of z 1 That is, a PE is active every j luj clock cycles. 2. A PE at the location S ðnÀ1ÞÂn z is active in the following time steps: fjluj t þ s 2 z 2 þ Á Á Á þ s n z n j9t 2 Z: t min jlu j t þ s 2 z 2 þ Á Á Á þ s n z n t max g 3. The PEs in the array can be divided into jlu j groups such that all PEs in the same group can be simultaneously active. The kth group G k , where 0 k < j luj , contains the following PEs:
These groups are active periodically according to the order:
We say that the array has j lu j phases and the PEs in G k are active in phase k.
4. The equivalence of all mappings that have the same l and u is obvious. Let a systolic array be designed using l ¼ ð2; 1; 0Þ and u ¼ ð1; 0; 0Þ. Consider three equivalent mappings and their decompositions: The same array is viewed differently from the perspective of each mapping. Let us use the processor structure described by T 1 as a reference. T 2 changes its shape (but not its topology) by applying a skewing transformation 1 1 0 1
i.e., the bottom-left ðn À 1Þ Â ðn À 1Þ submatrix of U 2 to its processor space.
T 3 also modifies the shape of the array by applying 0 À1 1 1 ¼ À1 0 0 1 0 1 1 0 1 1 0 1 , i.e., the bottom-left ðn À 1Þ Â ðn À 1Þ submatrix of U 3 . This consists of applying the same skewing transformation as in T 2 followed by interchanging the two processor axes and then reversing the first (new) processor axis. The lower ðn À 1Þ Â ðn À 1Þ submatrix of S 3 , i.e., 1 1 0 1 corresponds to a relabeling of the processor coordinates. Let us construct the space-time equations for Kung-Leiserson's hexagonally mesh-connected array [5] , depicted in Figure 3 . Unlike S. Y. Kung's array, this array is designed using the following non-unimodular mapping:
The processor structure is obtained by projecting the dependence graph along u ¼ ð1; 1; 1Þ.
Applying Theorem 2, we decompose T as follows:
To illustrate our method, we derive the space-time equations in two steps. In the first step, we apply the following domain morphism: to obtain the following equations:
y þ 1 þ Aðt; x; y À 1ÞBðt; x À 1; yÞ 
SPACE-TIME EQUATIONS FOR SYSTOLIC DESIGNS
In the second step, we apply S to scale the domains of three variables A; B and C :
From these equations, we can see clearly that the period of Kung-Leiserson's array is j lu j ¼ 3. This means that the array operates in 3 phases. The PEs that can be simultaneously active are shaded identically in Figure 3 . The array consists of 3N 2 À 3N þ 1 PEs. If we assume that all I=O are performed at border PEs, the latency of the array can be calculated as follows. Following [16] , the index space is extended to find out the point(s) mapped to the first time step and the point(s) mapped to the last time step. By extending the index space as shown in Figure 4 , we find that (1, 1, ÀN þ 2) and (N, N, 2NÀ1) are mapped to the first and last time steps, respectively. Thus,
The latency of the array is ð5N À 4Þd sys . PEðx; yÞ is active in the time steps:
Our method for deriving space-time equations is compositional in the sense that the space-time equations for a mapping can be further transformed by applying unimodular and non-unimodular mappings. The following theorem describes how a non-singular mapping can be applied to the space-time equations given in (10) . THEOREM 3 Let T 1 and T 2 be two non-singular mappings such that their decompositions are By this theorem, the space-time equations obtained by applying T 2 T 1 as one compound mapping are the same as those obtained by applying T 1 and T 2 individually in that order.
SPACE-TIME EQUATIONS FOR MULTIRATE ARRAYS
In the synthesis of systolic arrays, the evaluation of every equation (i.e., operation) is assumed to take one unit time. As a result, the clock cycle has to be the maximum of these operation times. A multirate array is a generalised systolic array, allowing different equations to take different time units to complete by making use of a finer clock cycle. Some discussions about multirate arrays can be found in [6, 16] In this section, we apply our method to derive space-time equations for a multirate array once the corresponding mapping is known.
In the case of matrix multiplication, we assume that all PEs are implemented as serial multiply-accumulators. We further assume that the computation equations for A and B take one unit each and the computation equation for C takes 16 time units. Then Rao formulated the problem of finding efficiency-maximal multirate arrays as follows [16] :
Minimise lu
Subject to lð1; 0; 0Þ T ! 1 lð0; 1; 0Þ T ! 1
There are three dependence vectors in the program. The first three constraints ensure that all three dependence vectors (1, 0, 0), (0,1, 0) and (0, 0,1) are respected. The last constraint ensures that the time interval for computing two consecutive index points in the same PE is at least 16 time units.
In general, one of the constraints in designing multirate arrays is:
where T max is the maximum of the times for evaluating all operations. Thus, all mappings for multirate arrays are non-unimodular. Let us consider two multirate arrays, one designed using the projection vector for S. Y. Kung's array and one using the projection vector for Kung-Leiserson's array.
Projection Vector u ¼ ð0; 0; 1Þ
In this case, l ¼ ð1; 1; 16Þ is a solution to (12) that achieves 100% efficiency. The array can be described by the mapping:
An application of Theorem 2 will decompose T to: The space-time equations can be derived as:
x À 1; yÞ
By extending the index space, we find that the latency of the array is ðð1; 1; 16ÞððN ; N ; N Þ À ð1; 1; 1ÞÞ þ 16Þd mul ¼ ð18N À 2Þd mul where d mul is the length of the array's clock. This array is much faster than S. Y. Kung's array since d sys ¼ 16d mul . This is because, by allowing different operations to consume different time units, the computations of all elements of C in the multirate array can begin as soon as the respective elements of A and B are available. Due to the projection (0, 0, 1) used, a PE is responsible for executing one vertical line of the points in the dependence graph (Fig. 1) . The dependence vector for C in the space-time equations is (16, 0, 0). Since the computation equation for C takes 16d mul time units to evaluate, a PE is active in every clock cycle from the time when it begins to compute the first point. Hence, the array has 100% efficiency.
Projection Vector u ¼ ð1; 1; 1Þ
In this case, there are no multirate arrays achieving 100% efficiency. One solution is to choose the same timing function l ¼ ð1; 1; 16Þ as before. This gives rise to the following equivalent space-time mapping:
The efficiency of the array is 16 lu % ¼ 89%. Applying Theorem 2 decomposes T to:
T ¼ SU ¼ Due to the similarities between this mapping and the one for Kung-Leiserson's array, we obtain the following space-time equations, which are the same as those for Kung-Leiserson's array except that the coefficient of t is 18 instead of 3:
Again by extending the index space, we find that the latency of this multirate array is ðð1; 1; 16ÞððN ; N ; 2N À 1Þ À ð1; 1; ÀN þ 2ÞÞ þ 16Þd mul ¼ ð48N À 34Þd mul , which is faster than Kung-Leiserson's array since d sys ¼ 16d mul However, it is well-known that we can maximise the throughput of Kung-Leiserson's array by interleaving the execution of multiple instances of matrix multiplication. The time for executing three instances can be calculated to be ð5N À 2Þd sys The pipelined execution of three instances in the multirate array given above will take three times longer than the execution of a single instance.
Therefore, the multirate version has a lower latency but Kung-Leiserson's array can achieve better throughput by interleaving the execution of multiple instances in the array.
Note that the first subscript function of the rhs C in the computation equation is 18t þ x þ y À 16. This means that a PE wastes two cycles when evaluating the computation equation for C for two consecutive points allocated to the PE. Hence, the efficiency of the array is 16 18 % ¼ 89%.
RELATED WORK
Loop transformation and systolic design are two closely related fields. In both fields, a matrix transformation is sought that specifies precisely the parallel code to be generated or the systolic array to be designed. The major focuses in loop transformations are on increasing parallelism, improving data locality and redu-cing communication overheads. The emphases in systolic design are on minimising the latency, throughput, and processor count of a design. This work is related to the code generation problem arising in loop transformation, which consists of producing the new loop code to execute the iterations in the original loop code according to the order specified by a given loop transformation. If the loop transformation is unimodular, the new index space is convex since the original index space is convex. The code generation is simple. The new loop code can be generated as described in [1, 18] . If the loop transformation is non-unimodular, the new index space is not convex. In this case, the generation of the new loop code is solved in several papers [9, 15, 19] . The basic idea is to obtain the Hermite normal form from the loop transformation matrix and derive from it the new loop bounds and step sizes. The non-unity step sizes serve to skip the holes in the new non-convex index space.
In systolic design, many researchers have focused on finding the scheduling vector l and the projection vector u to describe a systolic array [12] [13] [14] [15] [16] . Although both l and u can be collectively specified using a single non-singular matrix as in (6) , the generation of new space-time equations has been discussed only for unimodular mappings [2, 7, 12, 16] . To the best of our knowledge, no systematic methods for deriving space-time equations under non-unimodular mappings have been reported in the literature. This paper provides a systematic method for solving this problem. Theorem 2 shows that the space-time equations can be obtained from the original equations by first applying a unimodular domain morphism and then a non-unimodular scaling transformation. Theorem 3 shows that our method is compositional so that the space-time equations can be further transformed by unimodular and non-unimodular mappings.
CONCLUSION
In this paper, we have presented a method for deriving space-time equations for systolic arrays described by non-unimodular space-time mappings. Our method allows both unimodular and non-unimodular mappings to be treated in a unified manner. The space-time equations provide a precise specification of systolic designs, allowing them to be formally manipulated. Depending on the application area and performance requirement, these space-time equations can be realised in a variety of ways, as VLSI custom systems, as FPGAs and as programs to be run on general purpose parallel computers.
In presenting our method, we have assumed all variables are scheduled by the same linear timing function. In the general case, every variable V is scheduled by an affine timing function of the form t v ðzÞ ¼ lz þ a v . Since the effect of the affine constant a v is to perform a translation on the domain of variable V , our method applies in this case as well. For a similar reason, our method also works in the case when the allocation matrix is affine.
