Abstract-In this paper, some combinatorial characteristics of matrix multiplication on regular two-dimensional arrays are studied. From the studies, we are able to design many efficient varieties of the cylindrical array and the two-layered mesh array for matrix multiplication. To design a cylindrical array for matrix multiplication, a systematic design procedure is proposed. In this design procedure, Latin square (a special type of matrix) plays an important role. To design a two-layered mesh array, we find that there is a transformation procedure to transform a cylindrical array to a two-layered mesh array.
[5] H. T. Kung and gives (5) . By (4) , substitute rfwith cF' f into (2), it becomes -Since TI. l 7 = 1. c 5 1.
Q.E.D.
I. INTRODUCTION
Systolic array, as defined by Kung [3] , is a synchronous operating parallel processing array consisting of simple processing elements interconnected in a local and regular manner. Systolic array has an unnoticed characteristic-its architecture should be planar, in other words, the communication lines among the processing elements should not cross over each other. If we relax this restriction, it is possible to obtain more efficient designs than systolic designs for a certain class of problems. Usually for this class of problems, operators used in their algorithm satisfy the associative and commutative properties. Typical examples in this class are matrix multiplication [4] , (51, [7] - [9] . We study some combinatorial characteristics of the graph and extract their essential information, then formalize it on two tables-the tzminglevel table and the processor assignment table. Special characteristics of these tables enable us to design many efficient varieties of matrix multiplication algorithms which are executable on the cylindrical array [8] or the two-layered mesh array [l] . Fig. 1 and Fig. 2 are examples of these arrays, respectively.
DESIGN OF CYLINDRICAL ARRAYS FOR MATRIX MULTIPLICATION
The problem of matrix multiplication is to compute the product C of two matrices, A and B . An equation to compute the product is C = A i B i +AzBz +...+ A,,B,, where A , and B , are ith column and ith row of A and B, respectively; and the product A,B, denotes an "outer product." Therefore, the matrix multiplication can be carried out in n recursions (Each executes an outer product AkBk) [5] , data [5] , their data flow direction can be reversed. Besides, since the addition operator is associative and commutative, data flow direction of c,J can be reversed too. Consequently, the product of r r , l b l z.cr,2b2/..... cr,,,b,,, can be added to ctJ in any order. Therefore, the recurrence equation can be generalized as On designing a parallel algorithm, there are two things we must determine-the timing schedule and the processor assignment of the algorithm. We determine the timing schedule of the algorithm first, then determine a processor assignment (or a projection direction) which is not conflicting with the timing schedule. The timing schedule and the processor assignment can be represented by a timing-level table (TLT) [7] and a processor assignment table (PAT), respectively. For example, the original systolic design, Fig. 4 , proposed by Kung [4] to compute product of two ti x r~ ( n = 3 ) matrices A and B, can be described by a TLT and a PAT shown in Table I . In this design, [0 0 11 is the projection direction so that indexes i, j , and k in the tables corresponding to a row, a column, and a layer, respectively. In general, we use the indexes r, s, and q to represent a row, a column, and a layer in the tables, respectively. Assume that the index space is In other words, all the nodes {( r . S . q ) l y = 1 . 2 , . . . . n } of a DG are projected onto the same processor (n. J). For example, Table I specifies that the projection direction is [0 0 11 and the operation P~~ = c%:% + 0 2 2 b 2 3 is performed at time step t,.,, = ttJA. = t 2 3 2 = 5 by the processor (n. J) = ( 2 . 3 ) .
In the following, we shall describe the problem of how to select appropriate values of f r s r , and p<,,j so that varieties of parallel 
The above conditions a) and b) are illustrated in Fig. 5 , where all elements of M , ( M , ) on each solid-line (dash-line) paths are equal.
Next, we shall describe the design of cylindrical arrays [8] for matrix multiplication. First, we determine the timing schedule TLT, then determine a processor assignment PAT which is compatible (or not conflicting) with the TLT. The following lemma gives the conditions to be satisfied for a PAT to be compatible. In the following, we assume, without loss of generality, that t , -s l 5 t 7 -s 2 Proof: For a cylindrical array, since the data are fed into the array from the first row of the array, computations of the processors in the first row (i.e., processors on positions (1, j ) , j = 1.2.. . . . t8) start one step earlier than the processors in the second row. In general, the processors in the ith row start one step later than the processors 1 5 r. s 5 t i .
in the (t -1)th row, for I = 2.3.. . . . I ! . Therefore, the processors on the nth row start computing at time step n. If t , ,I = ( I , then the node (r.51) should be executed at time step n. Hence, the element (I'. 5 ) should be placed in the 0 t h row of the PAT. In other words, 0 An immediate consequence of this lemma is that t , , I = n if Based on the lemma, we describe a design procedure for construct-1) construct a TLT: Elements of the TLT should satisfy the in the TLT is an (0) (ordered) Latin square.
Note that the constructed PAT is a z-square of type (F.
R ) , ( F . L ) , ( S . R ) , or ( S . L ) .
The construction of a TLT is not difficult. But the construction of a PAT is not easy. It needs to be further described. We use a TLT whose 
arrive ((1 + 1)th row of the array.
of the array are Furthermore, by (1), we find the data stayed in the ( n + 1)th row
( f ( , -,,) ,l,<,dU , ? y . C ( 1 + 1 --n ) ,,,<,dU 7) ( y + 1 ) mc,du i l . ' . . .
('(,+,,-]-,,,,,,,,I,,,, ( y + 7 z -l ) , , , < , d u , ,
).
It f o~~o w s fmm (5)-(7), n ( , -l -, , + J ) m~, d u ,~ t . b t ( , + , -l ) " , u d t i n >
and C (~-I -, , +~) r r l o~~l l ( y + , -l ) r , , , ,~~u be updated accordingly) to be exchanged, then, based on the following lemmas, other cylindrical arrays can be designed. Now we prove it.
Lemma 3: Assume that P and T are, respectively, a PAT and a TLT which have been constructed by the design procedure. If T' is a TLT which is obtained by interchanging rows II and 1% (1 5 t 1 . r I t l ) of T , then it is possible to find a PAT P' compatible with T' such that P' is a ;-square.
Proof: Since T' is obtained by interchanging rows ( I and t', we have t,,,l = n and t,,l = U ) iff ti,,, = II' and t:,l = C I .
Without loss of generality, assume that P is a :-square of (F. R ) type. If we construct P' from P by interchanging all ]in j ( 1) which have value U with all p n j ( l ) which have value ( 1 , then we have (the interchanging operations do not alter yn,3(2)) yk ,(1) = u if p < , j ( l ) = (2; p:,j(l) = 1' if y , j ( 1 ) = t / ; and p b 3 ( 2 ) = y c , j ( 2 ) .
Since P is compatible with T , by Lemma 1, we have type, the proof is similar. 
In other words,
Moreover, all elements of [p:, j] are distinct. Therefore, P' is compatible with T'. Moreover, since p:, ,(2) = p,, j ( 2 ) and all the p:, j( 1) on a southeast diagonal path are constant, we conclude that P' is a :-square of (F. R ) type.
Lemma 4: Assume that P and T are, respectively, a PAT and a TLT which have been constructed by the design procedure. If T' is a TLT which is obtained by interchanging columns ( I and r (1 5 11. (1 5 1 1 ) of T , then it is possible to find a PAT P' compatible with T' such that P' is a ;-square.
Proof (0) or ( P ) type, it is possible to construct a TLT T and find a PAT P compatible with T such that P is a :-square. Thus, by Lemma 2, a cylindrical array for matrix multiplication corresponding to the P and T can be constructed. Now, we give some illustrative examples. In Table 11 is selected (i.e., I' I ) , then it has a corresponding parallel algorithm run on the cylindrical array of Fig. 1 . In this design c,, stays in the array. If r i = [0 1 01 is selected, then a cylindrical array, where stays in the array can be obtained. There are other compatible ;-squares which can be used for the PAT. Table 111 lists two examples. Table III(a) is an (F. R ) t-square with 1 1 1 1 = ( 3 . 2 ) . Table III(b) is an ( S . L ) ;-square with 1111 = (1 1). Since three different projection directions can be selected, each z-square corresponds to three different designs of parallel algorithms.
;-square of (F. R ) type. The procedure to transform a t-square S to an .r-square is now given below (we name leftmost column as the first column):
For ti 2 4, insert the first column of S between the ( T I -2)th and ( t i -1)th columns, and insert the second column between the ( 1 1 -3)th and ( t i -2)th columns. ... until the (even-odd) transpositional network should be selected (see Table IV ). After all rows are processed, we obtain an .r-square.
For example, after we apply the transformation procedure, a ;-square of Table I1 will be transformed to an .r-square of Table V .
When we choose the projection direction [0 0 11, it has a corresponding design of two-layered mesh array [2] shown in Fig. 2 .
Another example is given in Table VI . is an ( S . L ) 2.-square which is a PAT compatible with the TLT. TABLE V  A PROCESSOR ASSIGNMENT TABLE FOR THE TWO-LAYERED MESH ARRAY FIG. 2   Table VI (c) is an .c-square derived from the r-square. The x-square is a feasible PAT for designing a two-layered mesh arrays.
If [ f , , I ] in the TLT is not an ordered Latin square or cannot be transformed from an ordered Latin square by interchanging rows or columns, then we cannot find its corresponding ;-square or .c-square. Table VI1 is an example of this type.
IV. CONCLUSION
Some combinatorial characteristics of parallel algorithms for matrix multiplication on regular two-dimensional arrays are studied. Studying its characteristics, we are able to design different parallel arrays, such as the cylindrical array, or the two-layered mesh array. Intuitively, we conjecture that the design procedure can be used to construct all the cylindrical arrays (of the form shown in Fig. 1) for matrix multiplication. From a given cylindrical array, we have described a transformation procedure which can be used to transform the cylindrical array to a two-layered mesh array. Finally, it is worthy to note that almost all the matrix multiplication algorithms designed in this paper use nonlinear timing schedules. This indicates that is shown in Fig. 1 . As shown in [6], the performance degradation of a partial-multiple-bus is not significant. For a two-group partialmultiple-bus system of size 16 (i.e., AY = .\[ = l G , where S is the number of processors and -11 the number of memory modules), the decrease in performance (system bandwidth) is below 6%. For the sake of simplicity and consistency, we shall call this structure memory-oriented partial-multiple-bus, or MPMB. A different partial multiple-bus structure is proposed in this paper a\ an alternative to the one proposed by Lang, and as one which provides higher system bandwidth and faster arbitration at lower or equal cost. Derived also from the conventional multiple-bus structure, this structure, called processor-oriented partial multiple-bus, or PPMB, divides processors and buses into identical groups while maintaining the connection of each memory module to every bus.
A notable difference between this structure and the one by Lang is that in it, a memory module has a maximum of B potential paths (where B is the number of buses) to processors while, in Lang's, a memory module has a maximum of only B / g potential paths to processors (where 9 is the number of groups of buses). This structural difference gives rise to a distinguishing feature of the PPMB structure, namely of having potential for load-balancing arbitration. Load balancing, aimed at fully exploiting the potential for higher bandwidth inherent in the structure, is able to provide a substantial improvement in system performance. As a matter of fact, analytical and simulation results have both shown a maximum of 20% increase in system bandwidth of the PPMB over MPMB. Meanwhile, the cost of a PPMB system has been shown in general to be less than or equal to that of an MPMB of the same size. Note that while the partial-multiple-bus structure, proposed by Lang, was motivated to reduce cost and arbitration time without reducing system bandwidth significantly, we have shown as well that the PPMB structure can lead to a substantial improvement in cost-effectiveness when system size is very large.
In the section that follows, details of the PPMB structure and its load-balancing feature are discussed on a comprehensive basis. Section III introduces probabilistic models for evaluating synchronous-system bandwidth of the structures under study and comparisons are made between PPMB and MPMB. The numerical results produced by them all lie within &3'k of the results of simulation, implying a high level of confidence in the models. Finally, some concluding remarks are given in Section IV.
PROCESSOR-ORIENTED PARTIAL MULTIPLE-BUS STRUCTURE (PPMB)
A. The Structure
In PPMB, shown in Fig. 2 , S processors are divided into g groups with each group of ( S / g ) processors fully connected to a set of ( B / g ) buses, whereas all -11 memory modules are connected to all 13 buses. This is to be contrasted with MPMB in which the JI memory modules are divided into 9 groups where each group of (.21/g) memory modules is fully connected to a set of ( B / g ) buses, and all of the -1-processors are fully connected to all buses. For both MPMB and PPMB, g is assumed to be a factor of both B and .I1 (or S).
In the rest of this paper on the study, we will refer to an -1-x 11 x B/g system as a partial multiple-bus system that has B buses, . \ I memory modules, -\-processors, and is divided into groups. In addition, we will replace the notation M/g, -I-/g, and B / g with JIG, -\-G, and BG, respectively.
0018-9340/92$03.00 0 1992 IEEE
