I. INTRODUCTION
YSTOLIC ARRAY [ 11 has been proposed over a decade.
S Roughly speaking, it is a special purpose parallel device composed of several processing elements (PE's) whose interconnections have the properties of regularity and locality. With these properties, systolic architecture is very suitable for VLSI technology.
In spacetime mapping design methodology, the first step is regularization [2] , [3] . By regularization we mean to rewrite the sequential algorithm to a regular iterative algorithm, RIA [2] (or uniform recurrent equations, URE 's [4] ). Dependence graph (DG) is a graphical representation of a RIA and the length of the longest path in this graph is the minimal execution time needed for executing the RIA.
The next step of designing systolic array is to determine a proper schedule vector and its corresponding compatible processor assignment matrix to meet the constraints of data availability and processor availability. That is, a proper schedule must satisfy the precedence constraints imposed by the DG and must also ensure that no two different computations are executed at the same processor at the same time.
The well-known schedules for array processors are linear schedule, uniform affine schedule and affine schedule [2] . A linear schedule is that all variables are with the same schedule vector. In addition to the same schedule vector, the uniform affine schedule can have different translation parts for variables. The affine schedule is more general. It can have different schedule vectors and translation parts for variables. If two indexes are executed at the same time, they are on the Manuscript received May 24, 1991; revised July 30, 1993 and June 30, 1994 . This work was supported by the National Science Council of the R.O.C. under Contract NSC-8 1 -0408-E-009-568.
The authors are with the Institute of Computer Science and Information Engineering, College of Engineering, National Chiao Tung University, Hsinchu, Taiwan 30050, Republic of China.
IEEE Log Number 940933 1.
same hyperplane. For the entire index space, we may draw several parallel iso-temporal hyperplanes. The schedule vector is in the direction normal to these hyperplanes. We use a transformation matrix (T) to represent the spacetime mapping. The first row of this matrix is the schedule vector ( A T ) and the remaining rows are the processor assignment matrix ( S T ) . Comparing these designs for matrix multiplication, we have: first, the mesh array has the good property of local connection, but its execution time (3N -2) is greater than that (2n' -1) of the cylindrical array, where ?i is the size of the matrices. Second, based on the Winograd algorithm, we have :jLV/2 execution time [16] , but this two-layered mesh array is not so regular.
In Section 111-A, we design mesh arrays with 2 5 -1 execution time. In Section 111-B, we have a mesh array with r(3N -l)/21 execution time. All of the above new designs of 2-D arrays are based on the sequential matrix multiplication
For the problem of transitive closure, the orthogonal array designed by S. Y. Kung et al. [18] has the execution time of 5N -4. which is optimal in terms of pipelining rate, block pipelining rate, and the number of VO connections. According to the dependence graph (DG-3 in his paper) used by S. Y. Kung, the longest path in this graph is 5 N -4. Therefore, for this DG the design is optimal in execution time. It is interested that whether there is a DG for the transitive closure problem with a shorter length of the longest path. We will show in Section IV that it is true and present a new design of 2-D array for transitive closure, which has 4 N -2 execution time. Finally, the concluding remarks are presented in Section V.
PRELIMINARY DEFINITIONS
In this section, we give some preliminary definitions as a basis for following sections. The following theorem shows the relationship between these two sequences. 
The meaning behind this theorem is that if we want to know which position (say j ) the number k appears in the left-shift sequence L ( i ; 1, N ) , we can read j from the value of ri(IC). For example, if we want to know which position the number 3 appears in the left-shift sequence L( 1; 1 , 3 ) , we have the position j = q ( 3 ) = 2.
There are many criteria (e.g., execution time, pipelining period, array size, and YO channels) to measure performance of array processors. Since execution time is the most important criterion on designing real-time signalhmaging processing system, we pay our attention on execution time in this paper.
Dejnition 2.3:
The execution time (t,) of a systolic array is defined as the time interval between the time when the first operation is executed and the time when the last result is calculated.
By computation domain (0) we mean the set of finite indexes used by a RIA. Let I and I' be two indexes in the computation domain 0 of a RIA A.AT is a linear schedule in the first row of transformation matrix T . Assuming that there is a unitary time increment, the execution time t , of a is t , = maxI,I/Ee {AT(I-I')}+l [19] . The actual meanings of the execution time of a RIA is the number of hyperplanes sweeping the index space.
The multiple fan-in ( Fig. l(a) ) and multiple fan-out ( Fig. 1 (b) ) data dependence vectors are called broadcast vectors. All broadcast vectors can be systematically transformed into propagation vectors (e.g., Fig. l(c) ). We use the term broadcast point (the dark node in Fig. l(c) ) to denote the starting position for data propagation. A broadcast line is composed of several broadcast points. By aggregating broadcast lines, we obtain a broadcast plane.
DESIGN OF MESH ARRAYS FOR MATRIX MULTIPLICATION
The problem of matrix multiplication is to calculate matrix C = A * B , where A and B are both matrices. Without loss of generality, we assume that A , B , and C are all N x N matrices. The matrix multiplication can be carried out in N recursions as depicted in Algorithm 3.1.
[Algorithm 3.11
For i , j , k = 1 to 
0
The broadcast data U i , k and b k , j in Algorithm 3.1 can be removed by introducing propagation variables u ( i , j , k ) and b ( i , j , IC), respectively. Now we have Algorithm 3.2 [2]. The corresponding DG is shown in Fig. 2 , where the bold line denotes the longest path in this graph.
[Algorithm 3.21
with c ( i , j , 1) = 0 
A. Mesh Arrays with 2 N -1 Execution Eme
The execution time of a systolic array can be decomposed into three parts. with N of another data. We know t , = t , + t , Eliminating or lowering any part of these times can decrease the execution time of a systolic array. That is, if we do operation in each PE as soon as possible, we can obtain a systolic array with less execution time. It can be carried out by moving the broadcast planes of variables a and b to (i = j)-plane. By using this broadcast plane, the algorithm of matrix multiplication can be decomposed into two phases. [:
Hence, we have a new design of mesh array for matrix multiplication by composing these two phases in broadcast plane, as shown in Fig. 6 . We call it Design m2. Notice that, we assume that the data can be inputted on the diagonal PE's. If the array restricts that data must be fed on the boundary PE's, then the data preloading is necessary. This new mesh array not only has the same execution time as the cylindrical array [13] but also eliminates the spiral arcs in the cylindrical array. In this design, two phases have different schedule vectors. We call it two-phase schedule and this schedule can be generalized into an m-phase schedule. There are three different types of m-phase schedule. They are m-phase linear schedule, m-phase uniform affine schedule and m-phase affine schedule. In this paper, the first two schedules are used to design 2-D arrays for multiphase RIA'S. Therefore, we give formal definitions to these two types of m-phase schedule as follows: For j = 1 to [N/21
For i = 1 to [N/21 Hence, we have an another new design of mesh array for matrix multiplication by composing these four phases in broadcast planes as shown in Fig. 8 . We call it Design m3. We use four-phase uniform affine schedule to let these four phases begin execution at the same time. Note that here we let every variable with the same translation part in each phase. in the waiting time increasing to r( N -1) /2] , though we halve the operating time. Therefore, it has the same execution time as Design m2. The interesting problem is how we can decrease the operating time without increasing the waiting time. This problem will be tackled by adding delays to some PE's.
B. Mesh Array with t , = r(3N -1)/21 Execution Eme
In this section, we will show how to eliminate the waiting time in a mesh array by adding delays to some PE's. We know that the relative data must meet on the same place at the same time in a systolic array. For example, in Fig. 8 , a2,1 and b1,3 must meet at the same time at PE23, so b1,3 should wait one time step to meet u2,1. Nevertheless, adding one delay to PE23 for variable b will eliminate this waiting time. In other words, b1,3 goes into self loop in PE23 and in [-1 01 direction to PE13 at the same time, then b1,3 can operate simultaneously with u2,1 at PE23 and with a1,l at PE13 at the next time step. Using this method, we can eliminate the waiting time in Design m3 and get an another new design of mesh array for matrix multiplication with execution time of r(3N -1)/21. We call this mesh array Design m4 as shown in Fig. 9 . It in Algorithm 4.2 and it is computed at node ( i , j , k ) in Fig. 10 .
[Algorithm 4.21 Now we will show how to get a spherical array with execution time of 4N -2 (if N is even, 4N -3 if N is odd) by moving respectively the broadcast planes of variables a and b to the center of the DG in the j-and i-directions. The DG is shown in Fig. 11 . And we have the following algorithm.
[Algorithm 4.31
For j = 1 to 
Intraphase dependencies between 1) phase 2 and phase 1 of the m-phase schedule to help us on mapping multiphase algorithms onto array processors. They are the m-phase linear schedule and the m-phase uniform affine schedule. Furthermore, it is easily to extend the definitions of these two types of m-phase schedule to the m-phase affine schedule and may apply it to map nonsystolic RIA'S onto array processors. These new m-phase schedules can actually map an algorithm that fails by other schedules and design 2-D arrays that can not be obtained by other methodologies we know.
The idea of-moving broadcast planes and decomposing algorithms by the broadcast planes can help us on designing Since (see the second equation at the bottom of the page).
a DG with a shorter length of the longest path. Its penalty is that the control complexity will be increased somewhat. We can apply the method proposed in this paper to many other problems to design some even faster VLSI array processors to cater for the requirements of real-time applications.
V. CONCLUSIONS
In this paper, we have presented several new 2-D arrays for the problems of matrix multiplication and transitive closure. Besides these new designs, we have also proposed two types 
