Abstract
Introduction
Matrix multiplication is a building block in many matrix operations and for solving graph theory problems. Due to its fundamental importance in sciences and engineering, much effort has been devoted to finding and implementing fast matrix multiplication algorithms. The standard sequential algorithm has ON 3 time complexity in multiplying two N N matrices. Since Strassen's ON 2:8074 algorithm was discovered [39] , successive progress has been made to develop fast sequential matrix multiplication algorithms with time complexity ON , where 2 3. The current best value of is less than 2.3755 [7] . As for parallel implementation of matrix multiplication algorithms, the standard algorithm can be easily parallelized, and Strassen's algorithm has been parallelized on shared memory multiprocessors [5] . Furthermore, any known sequential ON matrix multiplication algorithm can be parallelized on a CREW PRAM in Olog N time by using ON = log N processors [4, 33, 35] . This significant result has been used extensively in many other parallel algorithms. Unfortunately, the PRAM model as well as large scale shared memory systems are practically limited due to their difficulty in realization.
On distributed memory multicomputers (which are considered more practical), research has essentially focused on the parallelization of the standard algorithm. It was shown that matrix multiplication can be done in ON 3 =p + logp=N 2 time on a hypercube with p processors, where N 2 p N 3 [8] . However, it is not clear how non-standard algorithms can be implemented efficiently on commonly used electronic static networks such as hypercubes and meshes, which have limited network connectivity and communication capability. In fact, none of the ON sequential algorithms with 3 has been fully parallelized on any distributed memory systems with static interconnection networks, since these networks do not have sufficient capability to support complicated communication efficiently. In [8] , it was shown that matrix multiplication can be done in ON =p ,1=2 time on a hypercube with p processors, where 1 p N 2 , and is the exponent of the best available ON sequential algorithm. This implementation is valid only in a small interval of p, and is not cost-optimal (as compared to the ON sequential algorithm). When p = N 2 , we get the shortest execution time ON, which is very slow. The reason is that the ON algorithm is invoked sequentially to calculate submatrix products, and not parallelized at all.
To fully parallelize non-standard sequential matrix multiplication algorithms on distributed memory systems, more powerful support for communication patterns used in efficient parallelization of these algorithms is required. One promising approach is to use fiber optical buses as interconnections [21] . For instance, on Linear Arrays with Reconfigurable Pipelined Bus Systems (LARPBS, a kind of distributed memory systems employing optical interconnections), one recent breakthrough was reported in [22] .
(R1) Matrix multiplication (where we assume that matrix elements are integers of bounded magnitude, or floating-point values of bounded precision and magnitude) can be performed by an LARPBS in Olog N time by using ON 3 =1:1428 log N processors, where 0 1 [22] . This implies that matrix multiplication can be done in O1 time using N 3 processors, and in Olog N time using ON 2:8074 processors.
This was then the fastest parallel implementation of non-standard matrix multiplication algorithms on distributed memory systems.
In reality, it is not feasible to use, say, N processors where 2, for large scale matrix multiplications. Typically, the number of processors available is far less than what is required. Therefore, it is necessary to design parallel algorithms that can be executed on p processors, where p is arbitrarily chosen in certain
range. An important issue in parallel computing is the scalability of a parallel algorithm on a parallel system. Though there is no unified definition of scalability, scalability essentially measures the ability to maintain speedup that is linearly proportional to the number of processors. To be more specific, when the number of processors is less than the maximum required, the parallel time complexity can gen- It is clear that (R3) is a special case of (R4), where the time complexity is substantially reduced as compared with that in [25] . Also, the processor complexity is substantially reduced as compared with that in [22] . This is currently the fastest and most processor efficient parallelization of the best known sequential matrix multiplication algorithm on a distributed memory system. In this paper, we further strengthen result (R4). In fact, we consider a quite general model, called Distributed Memory Parallel Computers (DMPC), which includes a number of important models of distributed memory systems such as LARPBS as special cases. of PRAM. The significance of result (R5) is three-fold. First, it strengthens result (R4) by making our parallel implementation from high scalability to full scalability. Second, result (R5) is applicable to a wider range of systems than LARPBS, since the DMPC model is weaker than LARPBS. Third, result (R5) unifies all known algorithms for matrix multiplication on DMPC, standard or nonstandard, sequential or parallel. Such a unification has rarely been seen for other problems. Moreover, our methods and results can be extended to other parallel systems (see x6 for more details). All the above claims result in significant progress in scalable parallel matrix multiplication (as well as solving many other important problems) on distributed memory systems, both theoretically and practically.
Distributed Memory Parallel Computers
A distributed memory parallel computer (DMPC) consists of p processors P0, P1, P2, ..., Pp,1 (see Figure 1) . Each processor has its own local memory, and there is no global shared memory. Processors communicate with each other via message passing. The computations and communications in a DMPC are globally synchronized into steps. A step is either a computation step or a communication step. In a computation step, each processor performs a local arithmetic/logic operation, or, is idle. A computation step takes constant amount of time.
In a communication step, processors send and receive messages via an interconnection network. A communication step can be specified as 0; v 0; 1; v 1; 2; v 2; :::; p , 1; v p,1; where for all 0 j p , 1, processor Pj sends a value vj to processor P j , and is a mapping : f0; 1; 2; :::; p , 1g ! f,1; 0; 1; 2; :::; p , 1g. If processor Pj does not send anything during a communication step, then j = ,1 and vj is undefined. It is required that for all 0 i p , 1, there is at most one j such that j = i, i.e., each processor can receive at most one message in a communication step. Such a communication step is essentially one-to-one communication. It is assumed that the interconnection network connecting the p processors in a DMPC can realize any one-to-one communication in constant amount of time. In a busiest communication step, every processor sends a message to another processor, and 0; 1; 2; :::; p,1 is a permutation of 0; 1; 2; :::; p , 1.
Based on the above discussion, the time complexity of a parallel computation on a DMPC is the sum of the numbers of computation and communication steps.
The DMPC model is functionally identical to the Module Parallel Computer (MPC) model which was used to study the capability of distributed memory systems in simulating Parallel Random Access Machines (PRAM) [27] . Using electronic interconnections, the interconnection network in a DMPC can be implemented using a static completely connected network (as in MPC), or a dynamic crossbar network, both can support a one-to-one communication step and arbitrary permutation in constant time. However, such implementations using electronic interconnections are very expensive and unrealistic when p is large.
Due to recent advances in optical interconnection technologies, the interconnection network in a DMPC can be implemented using optical interconnection networks. For instance, the Optical Model of Computation (OMC) was proposed in [10] (cf. [2, 11, 12] for more study on this model). In such a system, there is a processor layer and a deflection layer, both embedded in Euclidean planes. Interprocessor communications are performed using free space optical beams. An optical beam can carry and transmit information in constant time, independent of the distance covered. There are various techniques to implement the deflection layer (see [10] for detailed discussions). The optical beams function properly as long as each processor receives at most one message in a communication step, thus supporting arbitrary one-to-one communication.
Recently, fiber optical buses have emerged as promising networks [3, 6, 9, 13, 16, 26, 30, 37, 40] . Pipelined optical buses can support massive volume of data transfer simultaneously, and can implement various communication patterns. Furthermore, a system with optical buses can be reconfigured into independent subsystems, which can be used simultaneously to solve subproblems in parallel [30] . It is now feasible to build distributed memory systems that are no less powerful and flexible than shared memory systems in solving many problems, such as Boolean matrix multiplication [17] and sorting [24] . Numerous parallel algorithms using optical interconnection networks have been developed recently [1, 14, 17, 18, 20, 22, 23, 24, 28, 29, 31, 36, 38] .
Under such circumstances, a new model called Linear Array with Reconfigurable Pipelined Bus Systems (LARPBS) was proposed in [29, 30] . An LARPBS consists of p processors connected by a pipelined optical bus system, which uses optical waveguides instead of electrical signals to transfer messages among electronic processors. In addition to the high propagation speed of light, there are two important properties of optical signal transmission on an optical bus, namely, unidirectional propagation and predictable propagation delay. These advantages of using waveguides enable synchronized concurrent accesses of an optical bus in a pipelined fashion [6, 16, 37] . Such pipelined optical bus systems can support a massive volume of communications simultaneously, and are particularly appropriate for applications that involve intensive communication operations such as broadcasting, one-to-one communication, multicasting, global aggregation, and irregular communication patterns. In addition to the tremendous communication capabilities, an LARPBS can also be partitioned into k 2 independent subarrays LARPBS1, LARPBS2, ..., LARPBS k , such that LARPBSj contains processors Pi j,1 +1, Pi j,1 +2, ..., Pi j , where ,1 = i0 i1 i2 i k = p , 1. The subarrays can operate as regular linear arrays with pipelined optical bus systems, and all subarrays can be used independently for different computations without interference (see [30] for an elaborated exposition). As in many other parallel computing systems, a computation on LARPBS is a sequence of alternate global communication and local computation steps. The time complexity of an algorithm is measured in terms of the total number of bus cycles in all the communication steps, as long as the time of the local computation steps between successive communication steps is bounded by a constant and independent of the problem size. In addition to one-to-one communication, an LARPBS can also support broadcast, multicast, multiple multicast, and even global aggregation in constant time. Hence, LARPBS is a more powerful model than DMPC.
Bilinear Algorithms
All existing sequential matrix multiplication algorithms over an arbitrary ring fall into the category of bilinear algorithms (see [4, pages 315-316] and [33, 34, 35] ). Let A = Aij, B = B jk , and C = C ik be N N matrices, where N = m n . Assume that m is a fixed constant, and n ! 1. Each of these matrices are partitioned into m 2 submatrices Aij, B jk , C ik of size m n,1 m n,1 . A bilinear algorithm for computing C = A B
Step (B) (Basis) If n = 0 , i.e., the matrices are of size 11, compute the product C = A B directly, and return; otherwise, do the following steps.
Step ( Step (R) (Recursion) Calculate the R matrix products Mu = Lu L u , for all 1 u R. 
Implementation on DMPC
The recursive bilinear algorithm can be unrolled into an iterative algorithm. Let us label Steps (D), (R), and (C) in the lth level of the recursion as Step (D l ), (R l ), and (C l ) respectively, where 1 l n. The computation proceeds in the following way:
Step
Step (C 1 : The recursive step (R l ), 1 l n, is actually
Given matrices Lu1; u 2; :::; u l and L u1; u 2; :::; u l of size N=m l N=m l , Step (R l ) obtains their product Mu1; u 2; :::; u l = Lu1; u 2; :::; u l L u1; u 2; :::; u l ; for all 1 u1; u 2; :::; u l R. Therefore, we can imagine that the entire bilinear algorithm is actually Step (R0) which, given two matrices L = A and L = B, computes their product
To compute the product Mu1; u 2; :::; u l in Step (R l ), where 0 l n , 1, we first execute Step (D l+1 ), i.e., partitioning Lu1; u 2; :::; u l and L u1; u 2; :::; u l into submatrices We now discuss how the above iterative bilinear algorithm can be implemented on a DMPC. Our main result of this section is the following claim. The rest of the section is devoted to the proof Theorem 1.
Theorem 1. Multiplying two N N matrices can be performed in Olog N time by a DMPC with N processors.
We use the notation P s : t to represent the group of t processors with consecutive indices starting from s, i.e., Ps, Ps+1, Ps+2, ..., Ps+t,1. All the matrices and their submatrices are of size m l , where 0 l n. A matrix X of size m l m l is stored in a processor group P s : m 2l in the shuffled-row-major order. That is, X = Xij is partitioned into m 2 submatrices of size m l,1 m l,1 , and Xij is stored in the processor group P s + i , 1m + j , 1m 2l,1 : m 2l,1 in the shuffled-row-major order, where 1 i; j m.
Based on the assumption that the interconnection network in a DMPC can support one-to-one communication in one step, it is not difficult to see the following lemmas. 
Our algorithm for Step (D l ) is given in Figure 3 . 
3.
The processor group specified by Eq. (2) broadcast Liju1; u 2; :::; u l,1
4.
to the processor groups specified by Eq. (1), for all 1 u l R.
end do 6.
for all 1 u l R do in parallel 7.
The processor group specified by Eq. (1) Olog N.
The size of the matrix Lu1; u 2; :::; u l is N=m l N=m l , where 1 l n. Since there are R l Lu1; u 2; :::; u l matrices, the total number of elements in the Lu1; u 2; :::; u l 's is R l N=m l 2 = N 2 R=m 2 l = N 2 m ,2l , which is the number of processors used in Steps (D l ) and (C l ). As the computation proceeds from Steps (D1) to (Dn), the number of processors required increases, which reaches its maximum, i.e., N , when l = n.
Hence, the total number of processors required in Steps (D1) to
The computation of the L u1; u 2; :::; u l 's in Step (D l ) can be done in a similar way, which requires the same amount of execution time and the same number of processors. It is clear that Step (B) takes one local computation step.
In
Step (C l ), where 1 l n, we calculate Mu1; u 2; :::; u l,1 based on the Mu1; u 2; :::; u l 's, for all 1 u1; u 2; :::; u l,1 R, using the following equation, M ik u1; u 2; :::; u l,1 = R X u l =1 f k; i; u l Mu1; u 2; :::; u l,1 ; u l : (4) Our algorithm for Step (C l ) is given in Figure 4 . Again, for all 0 l n, the Mu1; u 2; :::; u l 's are arranged in the lexicographical order, and Mu1; u 2; :::; u l is stored in the processor group specified in Eq. Summarizing all the above discussions, we obtain Theorem 1.
1. for all u1; u 2; :::; u l,1 , 1 u1; u 2; :::; u l,1 R, do in parallel 2.
The processor groups specified by Eq. (1), where 1 u l R, compute
3.
M ik u1; u 2; :::; u l,1 using Eq. (4), and save the result in 
Processor Reduction and Scalability
In this section, we show that the number of processors in The- In the following, we show that our parallelized bilinear algorithm on a DMPC can be made fully scalable, that is, for all 1 p N = log N, multiplying two N N matrices can be performed by a DMPC with p processors in ON =p time, i.e., linear speedup can be achieved in the range 1::N = log N . (As indicated in Section 3, for matrix multiplication using bilinear algorithms, we have T N = Olog N.)
It is clear that in the implementation mentioned in Theorem 2, processors perform computation and communication at the matrix element level. That is, processors calculate one element level addition or multiplication during a computation step in Steps (D1) to (Dn) and Steps (C1) to (Cn); calculate one element level multiplication during a computation step in Step (B); send/receive one element during a communication step in Steps (D1) to (Dn) and Steps (C1) to (Cn). When there are fewer processors, it is necessary for processors to perform computation and communication at the block level. The idea is to partition the input/output matrices A, B, and C into blocks. When a matrix X = Xij of size N N is divided into blocks, i.e., the Xij's, of size s s, X is treated as a super-matrix of size, such that N = qs, and the blocks are treated as super-elements. Since the problem size is changed from N to q, the number of processors required in Theorem 2 is changed from N = log N to q = log q. Let r = dlog m qe. There are r levels of recursion. Then, processors perform computation and communication at the super-element (block) level, i.e., processors calculate one super-element level addition or a scalar-block multiplication during a computation step in Steps (D1) to (Dr) and Steps (C1) to (Cr); calculate one super-element level multiplication (i.e., multiplying matrices of size s s) during a computation step in
Step (B); send/receive one super-element during a communication step in Steps (D1) to (Dr) and Steps (C1) to (Cr). For super-matrices with super-elements, the running times in The first statement in Theorem 3 summarizes all sequential algorithms and parallelized bilinear algorithms for matrix multiplication on DMPC.
Extension to Other Parallel Systems
It is well known that multiplying two N N Theorem 3 can also be extended to distributed memory systems with other interconnection networks that cannot support one-toone communications in constant time.
Multiplying twosuper-matrices, where super-elements are ss submatrices, can be performed by a DMPC with q = log q processors in Os log q + Tps 2 log q time, assuming that the parallel system can support one-to-one communication in
OTp time. Let q be the largest integer such that q = log q p.
This implies that q = p log q, q = p log q 1= , and log q = log p. Hence, the overall running time is
Os log q + Tps 2 log q For example, it is well known that the Beneš network connecting p processors can support any one-to-one communication in Tp = Olog p time, provided that a mapping : f0; 1; 2; :::; p , 1g ! f,1; 0; 1; 2; :::; p , 1g is known in advance, which is really the case for matrix multiplication, where all communication steps are well defined. The Beneš network belongs to the class of hypercubic networks, including the hypercube network and many of its constant-degree derivatives such as cubeconnected cycles, shuffle-exchange, generalized shuffle-exchange, de Bruijn, k-ary de Bruijn, butterfly, wrapped butterfly, k-ary butterfly, Omega, Flip, Baseline, Banyan, and Delta networks. All these networks are computationally equivalent in the sense that they can simulate each other with only constant factor slow down [15] . Thus, all these hypercubic networks can support one-to-one communication in Tp = Olog p time. The value p given by the above Corollary is displayed in Figure 5 , assuming that = 2 :3755. Clearly, when d is a constant, our parallelized bilinear algorithm is not highly scalable for k-ary d-cube networks.
Closing Remarks
We have shown that any known sequential algorithm for matrix multiplication over an arbitrary ring with time complexity ON , where 2 3, can be parallelized on a distributed memory parallel computer (DMPC) in Olog N time by using N = log N processors. Such a parallel computation is cost optimal and matches the performance of PRAM. Furthermore, our parallelization on a DMPC can be made fully scalable, that is, for all 1 p N = log N, multiplying two N N matrices can be performed by a DMPC with p processors in ON =p time, i.e., linear speedup and cost optimality can be achieved in the range 1::N = log N . This unifies all known algorithms for matrix multiplication on DMPC, standard or non-standard, sequential or parallel. Extensions of our methods and results to other parallel systems are also presented. We believe that we have made significant progress in scalable parallel matrix multiplication (as well as solving many other important problems) on distributed memory systems, both theoretically and practically.
