Abstract
Introduction
Many sequential algorithms have been proposed for matrix multiplication, one of the most fundamental problems in sciences and engineering. The standard sequential algorithm takes ON 3 operations to multiply two N N matrices. Since Strassen's remarkable discovery of his ON 2:8074 algorithm [17] , successive progress has been made to develop fast sequential matrix multiplication algorithms with time complexity ON , where 2 3. The current best exponent is 2:3755 [3] .
The standard algorithm for matrix multiplication, Strassen's algorithm of [17] with 2:8074, and its Winograd's variation are used extensively in practice [5] . Other known asymptotically fast algorithms are not practically useful because of large overhead constants hidden in the big-O notation (this applies to all the known algorithms with 2:78), or because of the requirement of increasing memory space [6] . It is still plausible that faster practical algorithms for matrix multiplication will appear, which should motivate theoretical importance of parallelization of matrix multiplication algorithms, and in the case of Strassen's algorithm and its Winograd variation, such a parallelization should already have practical value.
On shared memory multiprocessors, Strassen's algorithm has been parallelized [2] . Furthermore, it is well known that two N N matrices can be multiplied under a CREW PRAM in Olog N time by using ON +"
processors, for any fixed positive ", as soon as we have an algorithm using ON arithmetic time for N N matrix multiplication [1, 15] . As a matter of fact, based on any of the known algorithms for N N matrix multiplication running in ON time, we yield Olog N parallel time using ON = log N arithmetic processors under the PRAM model. [4] . It was also reported that matrix multiplication can be done in constant time on a reconfigurable mesh with N 4 processors [16] . Such an implementation, though very fast, is far from cost-optimal.
To have fast and processor efficient parallel algorithms, it is necessary to consider non-standard algorithms. To the best of the authors' knowledge, all ON sequential algorithms with 3
have not been fully parallelized on any distributed memory systems, since these systems do not have sufficient capability to support complicated communication efficiently. In [4] , it was shown that matrix multiplication can be done in ON =p ,1=2 time on a hypercube with p processors, where 1 p N 2 . This algorithm is valid only in a small interval of p, which is not costoptimal (as compared to the ON sequential algorithm), and the shortest execution time is reached when p = N 2 is ON, which is very slow. The reason is that the ON algorithm is invoked sequentially to calculate submatrix products, and not parallelized at all.
To fully parallelize the fast sequential matrix multiplication algorithm on distributed memory systems, more powerful communication mechanism is required. It is clear that all existing realistic static networks with electronic connections have limited communication capability in supporting fast parallelization of an ON algorithm, where
3.
Recently, fiber optical buses have emerged as promising networks [11] . Pipelined optical buses can support massive volume of data transfer simultaneously, and can implement various communication patterns. Furthermore, a system with optical buses can be reconfigured into independent subsystems, which can be used simultaneously to solve subproblems in parallel [12] . It is now feasible to build distributed memory systems that are no less powerful and flexible than shared memory systems in solving many problems, such as Boolean matrix multiplication [7] and sorting [9] . Numerous parallel algorithms using optical interconnection networks have been developed recently [7, 8, 9, 10] .
On a linear array with a reconfigurable pipelined bus system (LARPBS) proposed in [12] . Furthermore, multiplying two N N matrices can be performed by an LARPBS with ON processors in Olog N time. This matches the performance of PRAM. Also, it is clear that the processor complexity is substantially reduced as compared with that in [8] .
The LARPBS Computing Model
A pipelined optical bus system uses optical waveguides instead of electrical signals to transfer messages among electronic processors. In addition to the high propagation speed of light, there are two important properties of optical signal (pulse) transmission on an optical bus, namely, unidirectional propagation and predictable propagation delay. These advantages of using waveguides enable synchronized concurrent accesses of an optical bus in a pipelined fashion. Such pipelined optical bus systems can support a massive volume of communications simultaneously, and are particularly appropriate for applications that involve intensive communication operations such as broadcasting, one-to-one communication, multicasting, global aggregation, and irregular communication patterns.
A linear array with a reconfigurable pipelined bus system (LARPBS) consists of N processors P1, P2, P3, ..., PN, connected by an optical bus. In addition to the tremendous communication capabilities, an LARPBS can also be partitioned into k 2 independent subarrays LARPBS1, LARPBS2, ..., LARPBS k , such that LARPBSj contains processors Pi j,1 +1, Pi j,1 +2, ..., Pi j , where 0 = i0 i1 i2 i k = N. The subarrays can operate as regular linear arrays with pipelined optical bus systems, and all subarrays can be used independently for different computations without interference (see [12] for an elaborated exposition). As in many other parallel computing systems, a computation on LARPBS is a sequence of alternate global communication and local computation steps. The time complexity of an algorithm is measured in terms of the total number of bus cycles in all the communication steps, as long as the time of the local computation steps between successive communication steps is bounded by a constant and independent of the problem size.
Perhaps the best way to understand the LARPBS computing model is to inspect the primitive operations that it can efficiently support. A number of basic communication, data movement, and aggregation operations on the LARPBS model implemented using the coincident pulse processor addressing technique have been developed [8, 12] . Each of these primitive operations can be performed in a constant number of bus cycles.
One-to-One Communication. Assume that processors Pi 1 , Pi 2 , ..., Pi q are senders, and processors Pj 1 , Pj 2 , ..., Pj q are receivers. In particular, processor Pi k sends a value xi k to Pj k , for all 1 k q simultaneously.
Broadcasting. Here, we have a source processor Pi, who sends a value x to all the N processors P1; P 2; P 3 ; :::; PN.
Multicasting. In a multicasting operation, we have a source processor Pi, who sends a value x to a subset of the N processors Pj 1 , Pj 2 , ..., Pj m . Multiple Multicasting. Assume that we have g disjoint groups of destination processors, G k = fPj k;1 ; P j k;2 ; P j k;3 ; :::g, 1 k g, and there are g senders Pi 1 , Pi 2 , ..., Pi g . Processor Pi k has value xi k to be broadcast to all the processors in G k , where 1 k g. Global Summation. Suppose that every processor Pj holds a numerical value vj, 1 j N, where vj is an integer or a floating-point value with finite magnitude and precision, we need to calculate the summation v1 + v2 + v3 + + vN.
The summation is finally saved in P1.
It has been a common practice in algorithm analysis to assume that a single manipulation (e.g., an arithmetic operation) takes constant time. This essentially implies that numerical values have finite magnitude and precision; otherwise, a manipulation either takes longer time, or requires extra hardware support. Therefore, our assumption in the global summation is quite reasonable. However, since in this paper, we are dealing with matrix multiplication on an arbitrary ring, we need the following general aggregation operation.
Global Aggregation. Suppose that every processor Pj holds a value vj, 1 j N, where vj is in an arbitrary set S with a binary associative operator , we need to calculate v1 v2 v3 vN. The result of the aggregation is finally saved in P1. We say that the size of the aggregation is N.
It may not be the case that all such kind of global aggregations can be implemented using optical signals in constant number of bus cycles. However, we can still use the ordinary binary tree method to find the aggregation in Olog N time, where the data communications can be easily supported by an optical bus. If N is a constant, such an aggregation requires constant amount of time. Fortunately, in this paper, we only use aggregations of constant sizes on subarrays whose sizes are independent of matrix sizes, and hence, their constant execution time is independent of the data set S and the definition of . We will mention the size of each aggregation explicitly.
The Strategy
In this section, we show how to parallelize any fast sequential matrix multiplication algorithm, with one limitation, namely, the execution time is bounded from below by Olog N. This limitation is due to the recursive nature of the class of bilinear algorithms, not to the communication constraints on a parallel system. Assume that the known sequential algorithm for multiplying two N N matrices has time complexity ON .
We follow the approach [1, pages 315-316] and [14, 15] . With no loss of generality, we will consider the class of recursive bilinear algorithms for the evaluation of the matrix product C = AB, where A = aij, B = b jk , and C = c ik are m m matrices. First of all, there is a basis bilinear computation that has three steps.
Step (a). In the first step, the values of 2R linear functions are computed, Lu = P 1i;jm fi; j; uaij, and L u = P 1j;km f j; k; ub jk , where u = 1 ; 2; :::; R.
Step (b). Then, in the second step, we compute the R products LuL u , for all 1 u R.
Step (c). Finally, in the third step, we calculate the m 2 outputs, c ik = P m j=1 aijb jk = P R u=1 f k; i; uLuL u ; for all 1 i; k m.
In the above computation, all the fi; j; u's, f j; k; u's, and f k; i; u's are constants. The value R is called the rank of the algorithm. For any positive " and any existent or plausible algorithm for N N matrix multiplication running in ON arithmetic time, we may fix a natural m and a basis bilinear computation with R m +" [13] . For all the known algorithms for N N matrix multiplication running in ON time for 3 (including the standard algorithm for = 3, Strassen's algorithm for = log 2 7 2:8074 [17] , and the ones of [3] for 2:3755, which are currently asymptotically fastest), we may yield R = m in the associated basis bilinear construction. (Note that for fixed m and , R is finite.)
The above bilinear algorithm can be used recursively. Let A = Aij, B = B jk , and C = C ik be N N matrices, where N = m n . Assume that m is fixed, and n ! 1 . Each of these matrices are partitioned into m 2 submatrices Aij, B jk , C ik of size m n,1 m n,1 . Then, the above computation is still applicable when all the aij's, b jk 's, and c ik 's are replaced by submatrices.
The recursive algorithm that computes C = A B is described below.
Step (0). If the matrices are of size m m, compute the product C = A B directly using the method above (i.e., Steps (a)-(c)), and return; otherwise, do Steps (1)-(3).
Step (1). Calculate 2R linear combinations of submatrices Lu = P 1i;jm fi; j; uAij; and L u = P 1j;km f j; k; uB jk ; for all 1 u R, where Lu and L u are m n,1 m n,1 matrices.
Step (2). Calculate the R matrix products Lu L u , for all 1 u R.
Step (3). Compute C ik = P m j=1 AijB jk = P R u=1 f k; i; uLu L u ; for all 1 i; k m. Step (1)). After the recursion in Step (2), all the C ik 's are also computed in parallel (see Step 
Implementation Details
In this section, we examine the implementation details of a bilinear algorithm on an LARPBS with m 2 R n processors. Our algorithm is called Fastn, which stands for the fastest bilinear algorithm parallelized on an LARPBS with pn = m 2 R n processors P1, P2, ..., Pp n , where pn is the total number of processors required in the above bilinear algorithm.
The number of processors required to achieve the above maximum parallelism is analyzed as follows. When n = 1 , we have the base case.
Step ( R processors, and Step (2) needs pn,1R processors. That is, pn = maxm 2n R;pn,1R. It can be proven by induction on n that pn = m 2 R n , for all n 1. Therefore, for a fixed m, and n ! 1 , the total number of processors used for multiplying two N N matrices, where N = n m , is pn = OR n = OR log m N = ON log m R = ON .
The Base Case
We first examine Fast1, that is, the base case, which calcu- Once processor Pu receives Lu and L u , it can compute the product LuL u using local computation. After Step (b) is finished, processor Pu has the values LuL u , for all 1 u R.
In Step (c), processors are called P i;k;u , where 1 i; k m, and 1 u R. The subarray P i;k; , which contains processors P i;k;1 , P i;k;2 , ..., P i;k;R , is used to compute c ik , for all 1 i; k m.
First, for all 1 u R, processor Pu sends the product LuL u to processor P i;k;u , for all 1 i; k m, via a multiple multicasting operation. Second, processor P i;k;u computes the value f k; i; uLuL u locally. Third, the summation c ik = P R u=1 f k; i; uLuL u is assembled by the processors in P i;k; using an aggregation operation of size R. The result c ik is then stored in P i;k;1 , for all 1 i; k m. System reconfiguration is required in this step. Finally, we need to perform a one-to-one communication to bring c ik from P i;k;1 to P i,1m+k , for all 1 i; k m, to meet the output data layout requirement.
It is clear that Steps (a), (b), and (c) can all be implemented using primitive data movement and communication operations, aggregations of constant sizes, and local operations, in constant number of bus cycles. Hence, we reach the following conclusion. 
The General Case
Now, let us look at algorithm Fastn in the general case where n 1. As mentioned before, the number of processors used by algorithm Fastn is pn = m 2 R n . Basically, we will show how each level of the recursion can be implemented in constant amount of time.
We assume that the input matrices A and B, as well as the output matrix C are all arranged in the shuffled-row-major order in the first N 2 processors of an LARPBS. Essentially, this means that if A = Aij of size m n m n is divided into blocks Aij of size m n,1 m n,1 , then the blocks are arranged in the row-major order. Furthermore, the arrangement of the blocks are defined recursively, and the base matrices of size mm are in the row-major order.
In
Step (0), we reach the base case. Therefore, algorithm
Fast1 is executed in constant time using p1 processors to multiply two base matrices of size m m.
Step (1) The matrix L u can be obtained in a similar way. Let us assume that after Step (1), Lu and L u are hold by super-processor Pu;1;1 in the shuffled-row-major order, for all 1 u R.
In
Step (2) , all the pn processors are employed. These pn processors will be divided into R subsystems LARPBS1, LARPBS2, ..., LARPBSR, and each subsystem LARPBSu has pn,1 processors used to calculate the matrix product Lu L u , for all 1 u R. For convenience, processors are also grouped into super-processors.
Super-processor Pu;1;1 sends Lu and L u to the first superprocessor in LARPBSu, for all 1 u R in parallel, using two one-to-one communications.
LARPBSu computes the product Lu L u by recursively invoking algorithm Fastn,1 and using pn,1 processors, where 1 u R.
After
Step (2) is finished, the first super-processor in LARPBSu has the matrices Lu L u , for all 1 u R.
In Step (3), again, only p 0 n processors are used. Once more, the p 0 n processors are grouped into super-processors P i;k;u , where 1 i; k m, and 1 u R, and each super-processor has m 2n,1 processors. These m 2 R super-processors are divided into m 2 subarrays P i;k; , and P i;k; is used to compute one submatrix C ik , where 1 i; k m. Each subarray P i;k; has R super-processors P i;k;1 , P i;k;2 , ..., P i;k;R , where P i;k;u holds the matrix product Lu L u , for all 1 u R.
For all 1 u R, the first super-processor in LARPBSu of
Step (2) 
Fewer Processors: 1 p N
In the last section, we provide the implementation details of parallelization of the fastest sequential matrix multiplication algorithm when there are sufficiently many processors. In reality, the number of processors available is not always enough for an application. In this section, we examine the case where the number of processors p is arbitrarily chosen in the range 1::N . Notice that Corollary 1 is well known on the PRAM model [15] .
Cost-Optimality
We now show that algorithm Fastn; p achieves linear speedup and is cost-optimal in a wide range of p for 
Concluding Remarks
We have developed an efficient parallelization of the fastest sequential matrix multiplication algorithm on a linear array with a reconfigurable pipelined optical bus system. The algorithm has linear speedup and cost-optimality in a wide range of choice for the number of processors. It turns out that for parallel matrix multiplication, a distributed memory system with optical interconnections like LARPBS compares favorably with shared memory systems, where the concurrent read capability is replaced by highly efficient communications with predictable senders and receivers.
As indicated in Section 3, under the PRAM model, ON = log N processors are enough to achieve Olog N parallel execution time. While such a parallelization is not difficult under the PRAM model, where all the processors share infinite common memory, it is more involved to implement such a parallelization in a distributed memory model. However, we believe that by carefully designing data distribution and scheduling the computation and communication, it is possible to parallelize a sequential ON matrix multiplication algorithm on an LARPBS in Olog N time by using ON = log N processors. This will be our next step of investigation.
