We present a parallel method for matrix multiplication on distributedmemory MIMD architectures based on Strassen's method. Our timing tests, performed on an Intel Paragon, demonstrate that our method realizes the potential of the Strassen's method with a complexity of 4:7M 2:807 at the system level rather than the node level at which several earlier works have been focused. The parallel e ciency is nearly perfect when the processor number is divisible by 7. The parallelized Strassen's method is always faster than the traditional matrix multiplication methods whose complexity is 2M 3 coupled with the BMR method and the Ring method at the system level. The speed gain depends on matrix order M : 20% for M 1000 and more than 100% for M 5000.
Introduction
As the heart of many linear algebra algorithms, matrix multiplication (MM) has been made more and more e cient during the past decades. The complexity of MM for a matrix of order M has dropped 9,13] from O(M 3 ) for the traditional method (referred to as the T-method, hereafter), to O(M 2:807 ) for Strassen's method 15] (referred to as the S-method, hereafter), to O(M 2:376 ) for the Winograd method 5, 16] . This complexity reduction gives rise to many e cient algorithms in other areas of linear algebra: inverting a matrix, solving a system of linear equations, computing the eigenvalues of a matrix 17], and calculating the determinant of a matrix, etc.
Of course, the full potential of these e cient methods can be realized only on large matrices, which require large machines such as parallel computers. Thus, designing e cient parallel algorithms for these methods becomes essential. The parallelization of the general linear algebra routines on distributed-memory MIMD architectures has achieved reasonable success 8]. But, due to the complication of these dedicated MM methods, the progress has not been compatible. In fact, Manber 12] in 1989 claimed that S-method cannot be easily parallelized. In addition, the Winograd method has not been parallelized so far. Indeed, the attempts for the parallelization of the S-method have been mainly focused on shared-memory machines, such as Bailey's work 2] done on the Cray-2 and Cray Y-MP with no more than 8 processors, and the data-parallel SIMD architectures, such as work 3] done on the MasPar MP-1. Some distributed-memory MIMD system vendors, such as Intel for its Paragon 14] , supply node-level library routines that use the S-method. There is, thus far, no reported work 1 on parallelizing the S-method on message-passing parallel systems at the system level. However, there are many e orts and resulting routines for parallelizing the T-method. The report, PUMMA: Parallel Universal Matrix Multiplication Algorithms 4], documented three methods including the block scattered decomposition, a variation of the Broadcast-Multiply-Roll (BMR) method 6, 11] . A Ring method, another variation of the BMR method, is also used to parallelize the T-method.
In this paper, we introduce a method to parallelize the S-method. For a comparison, we study six di erent implementations obtained by mixing three global methods (the S-method, BMR method, and the Ring method at the system level) and two implementations locally at the node level by using the T-method (one implementation uses the routine sgemm 1] coded in i860 assembler by Intel and the other is a homegrown routine for matrix multiplication coded in Fortran.) The assembler-coded 
BMR and Ring methods and parallel e ciency
Many have proposed to merge the local S-method with simpler global methods like BMR or Ring method. The advantage of this idea lies in its ease of implementation, while the shortcomings are that: (1) small node memory limits the strength of the S-method (which needs large matrix to be useful) and (2) these parallel algorithms do not scale, i.e., the parallel e ciency drops as the number of processors increases. Now, let us study the parallel e ciency of these two global methods (B-and R-) coupled with S-method as the \core" algorithm, locally on each node. Let A and B be two M M square matrices whose product is another matrix C, and P be the number of processors used to perform the MM. In this section, we brie y describe how to parallelize the T-method and analyze the performance.
The BMR method
The BMR method has been discussed extensively in other literatures (See, for example, 6].) In this section, we concentrate on analyzing its e ciency.
On one processor, the time is T 1 (M) = (2M 3 ? M 2 )t comp 2M 3 t comp where t comp is the time for one oating-point number operation which may be addition or multiplication. This formula can be written in a general form as
where = 2 and = 3 for the T-method. On P processors, each being given a submatrix of order m = M= Asymptotically, for large matrices on large parallel systems, the parallel e ciency approaches E P (M ! 1) ! 1 P (3? )=2 : If the \core" algorithm is the T-method, then = 3 and we get E P (M ! 1) ! 1; which means a parallel-e cient algorithm can be derived this way.
4
But, if the \core" algorithm is the S-method, then = 2:807 and we get E P (M ! 1) ! 1 P (3?2:807)=2 < 1: If we choose P = 49 (as we explain later), then E P = 69%. It means, the maximum parallel e ciency for BMR method coupled to S-method is 69% for up to 49 processors.
The Ring method
We suppose M is divisible by P and thus we partition m = M=P rows of matrices A and B to each processor, initially. (Note, the divisibility condition can be relaxed with small loss of e ciency and the requirement of square matrix can be removed without any loss of performance.) The natural way to distribute the data is to give the rst m rows of A and B to processor 1, the second m rows of A and B to processor 2, ..., and the last m rows of A and B to processor P. Each processor then slices the two sets of rows, one from A and the other from B, into P blocks of m m submatrices. This distribution scheme determines that processor 1 is responsible for producing the rst m rows of the resulting matrix C, processors 2 for the second, ..., and the last processor for the last rows. At step I, all P processors start multiplying the diagonal submatrices of A with their own partition of submatrices of B. At the end of this step, processor 1 has created one term for each submatrix in the rst m rows, a 11 12 to the position of a 11 , processor 2 moves the submatrix a 23 to the position of a 22 , and so on.) Of course, this shift is done only within individual processors and there is no need for inter-processor communication. However, at the same step, all P processors must roll up the entire submatrices of B to one row higher in the matrix diagram (e.g., to move submatrices b 21 ; b 22 ; :::b 2P , to the positions of submatrices b 11 ; b 12 ; :::b 1P respectively.) The processor 1 will roll \up" its submatrices to processor P. We repeat steps I and II until all rows of matrix B reach all processors in the fashion of ring and all submatrices in A become diagonal submatrices once. Now, we study its e ciency. Similarly, on one processor the time is
where = 2 and = 3 for the T-method. On P processors, the time consists of two parts: T roll , time required to move rows of submatrices from one processor to another, and T comp , time needed to perform submatrix multiplications. Thus, T roll = (P ? 1) M 2 P t comm M 2 t comm and using Equation (1), we get T comp P 2 M P t comp : Therefore, the total time is T P (M) = T roll + T comp M 2 t comm + P 2 M P t comp : The parallel e ciency becomes E P (M) = S P (M)=P = 1 P 3? + P M ?2 tcomm tcomp :
If we x P and increase M, the parallel e ciency goes as E P (M ! 1) ! 1 P 3? : For the T-method where = 3, then
as long as m = M=P is kept large enough. Therefore, the Ring method, if used for realistic problems in which m is always large, is well parallelized, but rooted in the T-method with an O(M 3 ) complexity.
For the S-method where = 2:807 and if, again, we choose P = 49, E P (M ! 1) ! 0:47. It means that the best parallel e ciency is only 47%, which is low. If we increase P, E P becomes even lower. For additional comparison, if we combine the Ring method with the Winograd method in which = 2:376, we get E P (M ! 1) ! 0:09, and the situation is much worse.
In summary, we have the following observations: (1) the S-method is superior to the T-method in performance, for large matrices at node-level, (2) as a parallel method, the BMR method is alway superior to the Ring method, (3) coupling global BMR method or global Ring method with the local T-method can produce parallele cient algorithms, but overall performance is low due to low node performance using the T-method, and (4) coupling global BMR method or global Ring method with the local S-method can not produce parallel-e cient algorithms. Approaches (3) and (4) are not desirable.
Obviously, the ideal combination is a global S-method with a local S-method. For ease of comparison, we design a scheme to use S-method at the system level and the T-method (for practical reason, one can easily use the S-method as well, the di erence is small) at the node level.
Strassen's method
We rst brie y describe the conventional S-method in x3.1 and then introduce our parallelization of the S-method in x3.2. Finally, in x3.3, we discuss the implementation of our method.
Serial
Let A and B be matrices of order M = m2 k+1 and let C be their product Therefore, we have identi ed 7 7 = 49 MMs and naturally we will either use 7 or 49 processors to perform the MMs. In either case, the MMs distributed to each processor can be performed by the S-method, which leads to a perfect parallelization of the S-method.
After nishing these 49 MMs, we need to combine the resulting P ij 8(i; j) 7 S 1 = f1; 4; 5; 7g; S 2 = f2; 4g; S 3 = f3; 5g; and S 4 = f1; 2; 3; 6g:
The 4 4 blocks of submatrices forming the product matrix C can be written as c 11 = P i2S 1 i (P i1 + P i4 ? P i5 + P i7 ); c 12 = P i2S 1 i (P i3 + P i5 ); c 13 = P i2S 3 P i1 + P i4 ? P i5 + P i7 ; c 14 = P i2S 3 P i3 + P i5 ; c 21 = P i2S 1 i (P i2 + P i4 ); c 22 = P i2S 1 i (P i1 + P i3 ? P i2 + P i6 ); c 23 = P i2S 3 P i2 + P i4 ; c 24 = P i2S 3 P i1 + P i3 ? P i2 + P i6 ; c 31 = P i2S 2 P i1 + P i4 ? P i5 + P i7 ; c 32 = P i2S 2 P i3 + P i5 ; c 33 = P i2S 4 i (P i1 + P i4 ? P i5 + P i7 ); c 34 = P i2S 4 i (P i3 + P i5 ); c 41 = P i2S 2 P i2 + P i4 ; c 42 = P i2S 2 P i1 + P i3 ? P i2 + P i6 ; c 43 = P i2S 4 i (P i2 + P i4 ); c 44 = P i2S 4 i (P i1 + P i3 ? P i2 + P i6 ):
We have e ectively parallelized the MMs. In fact, we can also group the matrix additions to avoid repetitions in computing matrix sums and in communicating \raw" submatrices. This will be discussed in the next section.
Implementation On 7 processors
If we use 7 processors, the implementation is easy. We need to distribute 8 submatrices to 7 processors and each processor will contain no more than three submatrices (the same memory requirement as in the Ring method) so as to minimize the submatrix movement among processors. There are several ways to do this e ciently and Figure 1 illustrates one simple way for the distribution.
On 49 processors
It is more complex to implement our method on 49 processors. In this case, we need to distribute 32 submatrices to 49 processors with 3 submatrices each. We observe from Equation (4) and its expanded 49 MM expressions that submatrices P 1 , P 6 and P 7 have the same pattern (needing sums of two submatrices from A and two from B); P 2 and P 4 are similar; P 3 and P 5 are similar. Thus, we divide the 49 MMs into three groups, P 1 , P 6 and P 7 in Group 1; P 2 and P 4 in Group 2; P 3 and P 5 in Group 3. Figures 2, 3 , and 4 depict the submatrix distribution and the multiplication processor for the three groups respectively. For Groups 1, 2, and 3, we need 16, 12, and 12 independent submatrices from a and b, respectively. The grouping has another advantage, i.e., no communication occurs among all these groups; every 7 processors form a cluster and there is no need for communication with any of the other 49 ? 7 = 42 processors. In addition, within each group, we perform additions before communication whenever possible. The group size can be further reduced to localize the access of submatrices and therefore reduce the communication.
Performance comparison
We have conducted timing tests for six methods, SL, SH, BL, BH, RL, and RH discussed above, on a 56-node Paragon, a distributed-memory MIMD parallel computer. For each of these six methods, we collect timing results for 1-, 7-, and 49-processors. The timing results are tabulated in Tables 2, 3, 4 .
In our tests, we rst distribute the submatrix to all participating processors and each processor then performs its local matrix multiplication with the T-method. All three methods, SL, BL, and RL, share one identical local routine supplied by the vendor (which is tailored to deliver maximum ops from the i860 processor) while the other three methods, SH, BH, and RH, share another identical local routine we created with Fortran. The entire test is done on single precision.
With these data from the three tables ( Tables 2, 3 , 4), we make three sets of plots: Figures 5, 6 , and 7. These gures show the matrix order vs. performance time for processor numbers P = 1; 7; 49 in log-log plot.
Two points are very clear from examining the gures: (1) for small matrices, there exist large \start up" costs for distributing the submatrices and the timing does not follow any pattern and (2) when the matrices are su ciently large (typical order: M 1000 for 7 processors and M 2000 for 49 processors), the timing does follow
This suggests that a tting of our su ciently many data points for large matrices to the above form should capture the timing behavior. The resulting lines in log-log plot are very straight and the coe cients ( and ) for these lines are tabulated in Table  5 .
The six sets of coe cients, three for Ring method and three for our method, are tabulated in Table 5 . Using Equation (5), we compute the timing ratios of our method to the Ring method at typical matrix orders on 1, 7, and 49 processors. The results are tabulated in Table 6 . With the coe cients in Table 6 and the Equation 5, we can compute the exact costs of multiplying certain matrix on P = 7; 49 processors by using any one of the six methods. The rst half of the table shows the costs of multiplying matrices of orders M = 1000; 2000; 3000 on P = 7 while the next half the costs on orders M = 4000; 5000; 6000 on P = 49. In the rst row (for method SL), we list the exact time in units of seconds, while in the remaining ve rows list the times \in units of" the times in the corresponding rst row. From this table, we notice (1) SL method is always 15{30% faster BL, but about 30% (for P = 7) and about 100% (for P = 49) faster than RL. speed in using the S-method indeed justi es its value in applications where speed is important. We are in the process of reducing the problems in (1) and (2) . In the case of 49 processors, we have 32 submatrices. If we create three submatrix slots on each processor, each submatrix is repeatedly distributed for 3 49=32 = 4:6 times. We are developing a new scheme to distribute these submatrices based on their occurrence \frequency" in the 49 formulae for P ij 8(i; j) 7 . The more they appear, the more often we distribute them. This may yet reduce the cost of moving submatrices around.
In addition, we are applying the ideas presented in this paper to the Winograd method.
