Abstract-We present efficient parallel matrix multiplication algorithms for linear arrays with reconfigurable pipelined bus systems (LARPBS). Such systems are able to support a large volume of parallel communication of various patterns in constant time. An LARPBS can also be reconfigured into many independent subsystems and, thus, is able to support parallel implementations of divide-and-conquer computations like Strassen's algorithm. The main contributions of the paper are as follows: We develop five matrix multiplication algorithms with varying degrees of parallelism on the LARPBS computing model, namely, MM 1
INTRODUCTION
ATRIX multiplication is one of the most fundamental and important problems in science and engineering. Many other important matrix problems can be solved via matrix multiplication, e.g., finding the Nth power, the inverse, the determinant, eigenvalues, an LU-factorization, and the characteristic polynomial of a matrix, and the product of a sequence of matrices [20] . Many graph theory problems are also reduced to matrix multiplication. Examples are finding the transitive closure, all-pairs shortest paths, the minimum-weight spanning tree, topological sort, and critical paths of a graph [6] , [12] . Therefore, fast and processor efficient parallel algorithms for matrix multiplication is definitely of fundamental importance.
The standard matrix multiplication algorithm takes O(N 3 ) operations. Most existing parallel algorithms are parallelizations of the standard method. For instance, matrix multiplication can be performed in O(N) time on a mesh with wraparound connections and N × N processors [7] ; in O(log N) time on a three-dimensional mesh of trees with N 3 processors [20] ;
in O(log N) time on a hypercube or shuffle-exchange network with N 3 processors [12] . It is clear that all implementations of the standard method have cost, i.e., time-processor product, of at least Ω(N 3 ). Therefore, it is interesting to develop highly parallel and processor efficient algorithms that have o(N 3 ) cost. It turns out that such an algorithm should not be a parallelization of the standard matrix multiplication algorithm.
Extensive research has been done to develop algorithms that require O(N β ) operations, such that β < 3. Strassen was the first one who was able to find an O(N log7 ) time algorithm, where log 7 < 2.8074, that recursively reduces the calculation of the product of two matrices of size N × N to the calculation of the products of seven submatrices of size N/2 × N/2 [46] . Chandra implemented Strassen's algorithm on shared memory multiprocessors in O(log N) time using N 2.8074 /log N processors [8] . It is well known that matrix product can be computed by a CREW PRAM in O(log N) time, using M(N) processors [38] , where M(N) = O(N β ) is the number of operations done by the best sequential algorithm. Since Strassen's initial work, the value β has been reduced a number of times. Currently, the best matrix multiplication algorithm takes O(N β ) operations, where β < 2.375477 [11] . However, these algorithms are highly sophisticated even for sequential execution (the constant involved in the O(N β ) term is so large that the break-even point with the standard matrix multiplication algorithm is way above practically interesting matrix sizes); parallelization of these methods are only of theoretical interest, and far from practical (see, e.g., [37, p. 624] and [20, p. 312] ). The reader is referred to [4] for more information on parallel matrix algorithms. A parallel algorithm on the PRAM model only reveals the maximum parallelism of a problem. Large scale parallel computing on a shared memory system is impractical, at least using the current technology. As for distributed memory systems, to the best of the authors' knowledge, even Strassen's algorithm has not been parallelized on any popular static network with time and processor complexities O(log N) and O(N 2.8074 ), respectively. (Note that Ω(log N) time is inherent in Strassen's recursive method and is independent of any implementation on any system.) It is unlikely that Strassen's algorithm can be parallelized on static networks with the above-mentioned performance, due to its large amount of data transfer and mismatch between its communication patterns and the structure of most of the commonly used networks.
Efficient communication is the most difficult issue in designing a high performance parallel system which includes architecture and technology that support high communication bandwidth and an algorithm that involves less communication. The communication capacity of static networks, such as meshes and hypercubes, is limited by their finite node degrees, while communication latency is in proportional to network diameters. Hence, increasing the size of a network may not result in further decrease in the time complexity of a parallel algorithm running on a static network. As a matter of fact, the time complexity of a nontrivial parallel computation is bounded from below by the network diameter.
One way to overcome this difficulty is to use electronic/optical buses for communication since they provide direct connection between any two processors in the system. Processor arrays with various bus systems have become increasingly interesting due to recent advances in VLSI and fiber optic technologies. Arrays of processors with global buses [5] , multiple buses [10] , hyperbuses [17] , multidimensional bus architectures [29] , and reconfigurable buses [30] have been proposed for efficient parallel computations. The time taken by a datum to go from one end of an electronic bus to the other end is a function of the length L of the bus. These functions can be L 2 , L, or log L, depending on the physical properties of the technology used to implement the bus [2] . Similarly, the time for a signal to traverse a given section of the bus is also unpredictable. Other difficulties associated with electronic buses are low bandwidth (since they cannot support concurrent accesses), temperature sensitivity, capacitive loading, and cross talk caused by mutual inductance.
On the other hand, fiber optic communication technologies offer a combination of high bandwidth, predictable message delay, low interference and error probability, and gigabit transmission capacity, and have been used extensively in wide-area networks. Clearly, the ability to control optical delays precisely can also be used in many ways to support high bandwidth multiprocessor interconnection. For instance, precise delays can be used for the buffers required in temporal rearrangement of TDM signals, for collision resolution in packet switched networks, and for synchronization of incoming packets on separate channels. Depending upon the material through which the signals propagate, one millimeter corresponds to three to seven picoseconds. With the precision available in mechanical layout, sub-picosecond time precision is achievable.
Based on the characteristics of fiber optical communications, several researchers have proposed using optical interconnections to connect processors in a parallel computer system [3] , [13] , [15] , [44] , [47] , [50] . In such a system, messages can be transmitted concurrently on a pipelined optical bus without partitioning the bus into several segments, while the time delay between the furthest processors is only the end-to-end propagation delay of light over a waveguided bus. This design integrates the advantages of both optical transmission and electronic computation, and leads to an entirely new realm of techniques for parallel algorithm development. Many parallel algorithms, such as the Hough transform, singular value decomposition, order statistics, sorting, and some numerical algorithms, have been proposed for systems with optical interconnections [14] , [16] , [26] , [31] , [32] , [33] , [40] , [41] , [43] , [45] . These works indicate that arrays with pipelined optical buses are very efficient for parallel computation due to the high bandwidth within a pipelined bus system.
More recently, linear arrays with reconfigurable pipelined bus systems (LARPBS) and arrays with reconfigurable optical buses (AROB) have been independently proposed by Pan et al. [34] , [35] , and Pavel and Akl [41] . In these systems, messages can be transmitted concurrently on a bus in a pipelined fashion and the bus can be reconfigured dynamically under program control to support algorithmic requirements. Many algorithms have been designed for these two models, including selection, quicksort, matrix operations, computational geometry, and PRAM simulation on the LARPBS model [22] , [23] , [24] , [28] , [34] , [36] , and on the AROB model [40] , [41] , [42] . A fundamental difference between the two similar models is that counting is not allowed during a bus cycle in the LARPBS model, while it is permitted in the AROB model. In fact, the LARPBS does not allow any processor involvement during a bus cycle, except setting switches at the beginning of a bus cycle. Recently, a linear array with a different pipelined optical bus system has been proposed in [27] , [50] . This system can implement most algorithms designed for LARPBS with much less hardware. Other simplified but equivalent models have also been developed [49] .
In this paper, we present efficient parallel matrix multiplication algorithms for linear arrays with reconfigurable pipelined bus systems. These kinds of systems are able to support a large volume of parallel communication of various patterns in constant time. An LARPBS can also be reconfigured into many independent subsystems and, thus, is able to support parallel implementations of divide-and-conquer computations, like Strassen's algorithm. The main contributions of the paper are as follows: We develop five matrix multiplication algorithms with varying degrees of parallelism on the LARPBS computing model:
• A linear time algorithm MM 1 (applicable to an arbitrary ring or a closed semiring) using N 2 processors, where each processor needs constant amount of memory; We have not seen parallel computing systems with conventional interconnection networks that solve the matrix multiplication problem in constant running time or support efficient implementation of Strassen's algorithm and/or provide such a wide range of cost and performance combinations. Our results have the following important implication, i.e., due to its high communication bandwidth, the versatile communication patterns it supports, and its ability of utilizing communication reconfigurability as an integral part of a parallel computation, the LARPBS is a powerful architecture for exploiting large degree of parallelism in a computational problem that most other machine models cannot achieve.
The rest of the paper is organized as follows. In Section 2, we explain pipelined optical buses, and linear arrays with reconfigurable pipelined bus systems. In Section 3, we describe the implementation of several primitive communication operations on LARPBS, which are building blocks of our algorithms, and also discuss time complexity measure in LARPBS computations. In Sections 4-6, we present three simple matrix multiplication algorithms MM 1 , MM 2 , and MM 3 . In Section 7, we develop two compound algorithms (combinations of simple algorithms), & 1 (⑀) and & 2 (δ). Section 8 concludes the paper.
LINEAR ARRAYS WITH RECONFIGURABLE PIPELINED BUS SYSTEMS
A pipelined optical bus system uses optical waveguides instead of electrical signals to transfer messages among (electronic) processors. In addition to the high propagation speed of light, there are two important properties of optical signal (pulse) transmission on an optical bus, namely, unidirectional propagation and predictable propagation delay. (Notice that electronic buses lack these characteristics.) These advantages of using waveguides enable synchronized concurrent accesses of an optical bus in a pipelined fashion [9] , [21] . Such pipelined optical bus systems can support a massive volume of communications simultaneously and are particularly appropriate for applications that involve intensive communication operations such as broadcasting, one-to-one communication, multicasting, global aggregation, and irregular communication patterns. Fig. 1 shows a linear array of N processors connected via an optical bus. Each processor is connected to the bus with two directional couplers, one for transmitting on the upper segment and the other for receiving from the lower segment of the bus [44] . Optical signals propagate unidirectionally from left to right on the upper segment and from right to left on the lower segment.
An optical bus contains three identical waveguides, i.e., the message waveguide for carrying data, the reference waveguide, and the select waveguide for carrying address information, as shown in Fig. 2 . For simplicity, the message waveguide, which resembles the reference waveguide, has been omitted from the figure. An optical pulse p can represent a binary bit 1, and the absence of a pulse represents bit 0. Messages are organized as fixed-length message frames, each frame containing at most b pulses. Therefore, if b = 8, the integer data 165 can be represented as a message frame (p, −, p, −, −, p, −, p), where − stands for the absence of a pulse.
The most important hardware parameters of an optical bus system are given as follows: τ : pipeline cycle, i.e., the time for an optical pulse to travel the section between two consecutive processors;
T: bus cycle, i.e., the time (2N − 1)τ for an optical pulse to propagate through the entire bus (
, plus (N − 1)ω delays introduced for reference and select pulses for the purpose of pipelined data transmission, plus T′ for message processing;
b: the length of a message frame. (It is assumed that a message frame is longer enough to hold one data item such as an array element. For processors to send their indices, b is at least log N.)
Messages transmitted by different processors may overlap with each other even if they propagate unidirectionally on the bus. We call these message overlappings "transmission conflicts." To ensure that there is no transmission conflict, it is required that τ > bω, so that each message frame can fit into a pipeline cycle τ, and that in a bus cycle T, up to N messages can be transmitted simultaneously without collisions on the bus, as long as all processors are synchronized at the beginning of each bus cycle.
To operate an optical bus in a pipelined fashion so that multiple data transfer can be performed in parallel, extra timing capabilities are necessary. First, one unit delay ∆ (shown as a loop in Fig. 2 ) is added between two consecutive processors on the receiving segments of the reference waveguide and of the message waveguide. Each loop is an extra section of a fiber and the amount of delay added can be accurately chosen based on the length of the segment. As a result, the propagation delays on the receiving segments of the select waveguide and the reference waveguide are no longer the same. Second, a conditional delay ∆ is added between any two consecutive processors P i−1 and P i , where 1 ≤ i ≤ N − 1 on the transmitting segment of the select waveguide (cf. Fig. 2 ). The switch between processors P i−1 and P i is called S(i) and is local to processor P i . Thus, every processor has its own switch except processor P 0 . The conditional delays can be implemented using 2 × 2 optical switches such as the Ti:LiNbO3 switches used in an optical computer [3] . Each switch S(i) can be set by the local processor P i to one of the two possible states, straight or cross, as shown in Fig. 3 . When a switch is set to straight, it takes τ time for an optical signal on the transmitting segment of the select waveguide to propagate from one processor to its successor. When a switch is set to FURVV, a delay ω is introduced, and such propagation takes τ + ω time. Clearly, the maximum delay that the S(i) switches can introduce is (N − 1)ω.
The coincident pulse addressing technique [9] , [21] , [44] , i.e., a time-division switching method, can be applied to route messages on an optical bus system. Using this approach, a source processor P i determines the relative time delay of a select pulse and a reference pulse so that they will coincide and produce a double-height pulse only at a destination processor P j . By properly adjusting the detecting threshold of the detector at processor P j , this double-height pulse can be detected, thereby addressing P j . The details of processor addressing and data communication are given in Section 3.
The capability of a linear array with a pipelined optical bus can be significantly enhanced by including switches for reconfiguring the bus system. The enhanced model, called a linear array with a reconfigurable pipelined bus system (LARPBS), originated in [34] . (See [35] for a more elaborated exposition.) In the LARPBS shown in Fig. 4 , we insert two optical switches, namely, RST(i), a 1 × 2 optical switch, on the section between P i and P i+1 of the transmitting segment, and RSR(i), a 2 × 1 optical switch, on the section between P i and P i+1 of the receiving segment, where 0 ≤ i < N. Both RSR(i) and RST(i) switches are controlled by processor P i . These switches are able to reconfigure a bus system into several independent subsystems that can operate in parallel. When all switches are set to straight, the bus system operates as a regular pipelined bus system. When RSR(i) and RST(i) are set to cross, the bus system is split into two separate systems, one consisting of processors P 0 , P 1 , ..., and P i , and the other consisting of P i+1 , P i+2 , ..., P N−1 . The total delay for a signal to pass the optical fiber between RST(i) and RSR(i) is made to be equal to τ. Hence, the subarray with processors P 0 to P i can operate as a regular linear array with a pipelined bus system; so does the subarray with processors P i+1 to P N−1 . Fig. 4 demonstrates the LARPBS model with N = 6 processors, where the array is split into two subarrays, with the first one having four processors and the second one having two processors. In the figure, only one waveguide is shown. Conditional switches are omitted in the figure to avoid confusion. Notice that the time for setting up switches for one reconfiguration is a constant. 
THE LARPBS COMPUTATIONAL MODEL
An LARPBS is a distributed memory system which consists of processors linearly connected by a reconfigurable pipelined optical bus system. Each processor can perform ordinary arithmetic and logic computations and interprocessor communication. All computations and communications are synchronized by bus cycles, which is similar to an SIMD machine. However, due to reconfigurability, an LARPBS can be divided into subsystems and these subsystems can operate independently to solve different subproblems, which results in an MIMD machine. Input/output can be handled in a way similar to an ordinary SIMD/MIMD machine; further discussion on this issue is beyond the scope of the paper since we are mainly interested in algorithm development and analysis.
In this section, we show in detail how to perform several basic communication, data movement, and global aggregation operations on the LARPBS model using the coincident pulse processor addressing technique. We are going to justify that all these primitive operations can be performed in a constant number of bus cycles. These powerful primitive operations that support massive parallel communications, plus the reconfigurability of the LARPBS model, make the LARPBS very attractive in solving problems that are both computation and communication intensive, such as matrix multiplication. We will see that optical buses are not only communication channels among the processors, but also active components and agents of certain computations, e.g., global data aggregations. The unique characteristics of optical buses have led to an entirely new computational model, and opened up rich avenues of algorithmic design, and novel and interesting algorithms (see chapter 10 of [2] ).
Time Complexity Measure
Let us first discuss the issue of measuring the time complexity of an LARPBS computation. As in many other parallel computing systems, a computation on LARPBS is a sequence of alternate communication and computation steps:
The time complexity of an algorithm is measured in terms of the total number of bus cycles in all the (m 1 + m 2 + L) communication steps, as long as the time of the n i local computation steps is bounded by a constant and independent of N. Notice that the time for setting up switches for one reconfiguration is a constant, and the number of system reconfiguration does not exceed, actually is far less than, the number of communications steps. Therefore, this part of the system overhead is negligible or can be merged into the big-O notation. The above complexity measure implies that a bus cycle takes constant time, which obviously needs more justifications. Recall that the bus cycle length T = (2N − 1)τ + (N − 1)ω + T′ is the end-to-end message transmission time over a bus (2N − 1)τ + (N − 1)ω plus the message processing time T′ for message generation, preparation, buffering, encoding, and decoding, etc., in the source and destination processors [15] , [42] . The ratio r = T′/τ of message processing time (which is comparable to processor speed) to message transmission time between successive processors is at least in the order of 10 3 , which implies that a bus cycle can be regarded as a constant for systems of sizes O(r). It has been observed that in parallel computing, the time to access a memory location in a global shared memory system, the time for a message to traverse a link in a direct network, and the time for a message to move to the next stage in an indirect network are all assumed to be constants [2] . Strictly speaking, all these times are in proportion to system size. Thus, the assumption that T is a constant has been adopted widely in the literature, and is consistent to other models [2] , [5] , [10] , [15] , [16] , [17] , [30] , [40] , [41] , [42] , [45] . As pointed out in the last section of the paper, our algorithms are highly scalable. Therefore, in a realistic system whose size is much less than the problem size, we still can run the algorithms such that linear speedup can be obtained. 
Primitive Operations
The following primitive operations on LARPBS are used in this paper, and our algorithms are developed by using these operations as building blocks. Implementation details of these operations on optical buses are given in Section 3.4.
One-to-One Communication
Assume that processors P P P
, , , K are senders and
, , , K are receivers. In particular, processor P i k sends its value in its register R(i k ) to the register R(j k )
in P j k . The operation is represented as
(Note that we use R(i) to denote both the name and the content of register R(i).)
Broadcasting
Here, we have a source processor P i which sends a value in its register R(i) to all the N processors:
Multiple Multicasting.
In a multicasting operation, we have a source processor P i which sends a value in its register R(i) to a subset of the N processors P P P
Assume that we have g disjoint groups of destination processors,
and there are g senders P P P
value R(i k ) to be broadcast to all the processors in G k , where 1 ≤ k ≤ g. Since there are g simultaneous multicastings, we have a multiple multicasting operation, which is denoted as follows:
It is important to point out that due to relative slow speed of the electronic processors, each processor can only read one message frame during one bus cycle. Thus, the g groups of destination processors must be disjoint.
Element Pair-Wise Operations
Let P i , P i+1 , ..., P j be a consecutive group of processors. We use R[i. 
Binary Prefix Sums
Assume that every processor P i , where 0 ≤ i ≤ N − 1, has a register R(i) which holds a binary value. We need to calculate s j = R(0) + R(1) + L + R(j), and save the result in R(j). The operation is represented as
Extraction and Compression
In this operation, every processor P j has a value x j , and we wish to extract those x j s that have certain property and compact them to the beginning of the linear array. In
, , , K are those values that have the property, where
after the extraction and compression operation.
Binary Value Aggregation
Assume that every processor P i , where 0 ≤ i ≤ N − 1 has a register R(i) which holds a binary value. We need to calculate R(0) + R(1) + L + R(N − 1) and save the result in R(0). The operation is represented as
Boolean Value Aggregation
Assume that every processor P i , where 0 ≤ i ≤ N − 1 has a register R(i) which holds a truth value. We need to calculate R(0) and R(1) and L and R(N − 1),
and save the result in R(0).
Integer Aggregation
The binary value aggregation operation can be generalized to calculate the summation of N integers.
Floating-Point Value Aggregation
The integer aggregation operation can be further generalized to calculate the summation of N reals.
Coincident Pulse Processor Addressing Technique
Before we go to the implementation of primitive operations, let us look at the coincident pulse technique. Assume that all the S(i) switches on the transmitting segments are set to straight, where 1 ≤ i ≤ N − 1, so that no delay is introduced for the select pulses. Suppose that a source processor P i wants to send a message to a destination processor P j .
Processor P i sends a reference pulse on the reference waveguide at time t ref , i.e., the beginning of a bus cycle, and a message frame at time t ref on the message waveguide, which propagates synchronously with the reference pulse sent by P i . Processor P i also sends a select pulse at time t sel (j) on the select waveguide. Whenever a processor P j detects a coincidence of a reference pulse and a select pulse, it reads the message frame. Thus, in order for processor P i to send a message to processor P j , we need to have the two pulses coincide at processor P j . This happens if and only if
where 0 ≤ i, j < N. Fig. 5 gives an address frame relative to a reference pulse for addressing processor P j . An address frame is a sequence of N consecutive time slots s N−1 , s N−2 , ..., s 1 , s 0 , at the beginning of a bus cycle. Each time slot s j has duration ω. The starting time of slot s j is t sel (j) = t ref + (N − j − 1)ω. Within an address frame, processor P i has the chance to address one or more destinations. Obviously, processor P j is expected to receive the message sent by P i if and only if there is a select pulse at s j . In Fig. 5 , the select pulse is delayed for (N − j − 1)ω seconds relative to the reference pulse. Hence, the two pulses meet at processor P j after the reference pulse goes through N − j − 1 delays, and processor P j receives the message at time
where (N − 1 − i)τ is the amount of time for the reference pulse and the message frame to travel from P i to P N−1 on the transmitting segment, (N − j)τ is the amount of time for the reference pulse and the message frame to travel from P N−1 to P j on the receiving segment (excluding the extra loop delays), and (N − 1 − j)ω is the amount of time due to loop delays. Since the select pulse is initiated at time t sel (j), and takes (2N − i − j − 1)τ time to reach P j on the receiving segment, the select pulse meets with the reference pulse and the message frame at the right time t ij at the right place P j .
Implementation of Primitive Operations
We now present in detail how the 10 primitive operations defined in the last subsection are implemented on reconfigurable pipelined optical buses. Readers only interested in algorithmic aspects can skip this section (which is on optical engineering implementation details) without loss of continuity.
One-to-One Communication
One-to-one communication can be implemented in one bus cycle as follows. All the P s i k send a reference pulse and a message frame containing R(i k ) at time t ref , and a select pulse at time t sel (j k ). Whenever a processor P j k detects a coincidence of a reference pulse and a select pulse, it reads the message frame to its R(j k ).
Broadcasting
To implement the broadcast operation, we set all the S(i) switches on the transmitting segments to straight. The source processor P i sends a reference pulse at the beginning of its address frame. It is also clear that to broadcast a message, P i needs to select all the N processors. Thus, if the source processor P i sends N consecutive select pulses in its address frame on the select waveguide, as shown in Fig. 6 , every processor on the bus detects a double-height pulse and thus reads the message.
Multiple Multicasting
The implementation of this operation is similar to that of broadcast, that is, processor P i sends select pulses at times t sel (j 1 ), t sel (j 2 ), ..., t sel (j m ). It is clear that the multiple multicasting operation can also be implemented in one bus cycle.
Element Pair-Wise Operations
It is clear that element pair-wise operations are similar to one-to-one communications. We first perform a one-to-one communication, that is, P n+k−1 sends R(n + k − 1) to P m+k−1 for all 1 ≤ k ≤ N 2 , and then the P m+k−1 s conduct a local operation ⊕.
Binary Prefix Sums
< L < i 1 and all other values are 0 (we assume that R(0) = 0 for the moment). We divide the N processors into (m + 1)
and we assume that i m+1 = 0 and i 0 = N, for convenience.
Thus, the goal is to let processor P j know the value (m + 1 − k)
if P j ∈ G k . The binary prefix sums operation takes four bus cycles.
• In the first cycle, each processor P i sends a message frame that contains its index i, a reference pulse, and a select pulse at time slot s N−1 . In other words, all processors try to send its index to P N−1 . Furthermore, processor P i sets S(i) to cross if R(i) = 1, and straight if R(i) = 0. Such a switch setting changes the destination of some of the N messages due to the delays introduced by the switches S(i), where i ∈ {i 1 , i 2 , ..., i m }. In particular, the messages sent by processors in G k are received by P N−k , where 1 ≤ k ≤ m + 1. However, processor P N−k only reads the first message that arrives, i.e., index i k−1 − 1.
• In the second bus cycle, processor P N−k , which received something during the first cycle, sends the value (k − 1) to processor P i k − − 1 1 for all 1 ≤ k ≤ m + 1. This is a one-to-one communication.
• In the third bus cycle, processors
, broadcasts the value (k − 1) it received in the second bus cycle to all other processors in its subarray.
• In the fourth bus cycle, processor P 0 broadcasts m to all processors and all P j save
(Notice that every P j knows that it belongs to G k from the value k − 1 it received in the third bus cycle.)
The above solution works only when R(0) = 0 initially. The reason is that processor P 0 does not have an S(0) switch that causes a delay for its select pulse (cf. Fig. 2 ). To solve this issue, we need one extra bus cycle in which processor P 0 broadcasts its initial value to all processors, and all processors add the value to their final result.
Extraction and Compression
To implement this operation, we first invoke the binary prefix sums operation, where R(j) = 1 if and only if x j has the property. In particular, P i k knows the value k and its destination P k−1 for all 1 ≤ k ≤ m. Then, one more one-to-one communication brings all the x j s to the right processors.
Binary Value Aggregation
Assume that
and all other values are 0. Thus, the goal is to let processor P 0 know the value m. It is clear that this operation is simpler than the binary prefix sums operation. The aggregation operation takes two bus cycles.
• In the first cycle, processor P 0 sends a dummy message z, a reference pulse, and a select pulse at time slot s N−1 . In other words, processor P 0 tries to send a message to P N−1 . Furthermore, processor P i sets S(i) to cross if R(i) = 1, and straight if R(i) = 0. Such a switch setting changes the destination of message z due to the delays introduced by the switches S(i), where i ∈ {i 1 , i 2 , ..., i m }. In particular, it is easy to see that processor P N−1−m receives the message z.
• In the second bus cycle, processor P N−1−m , which received something during the first cycle, sends the value m back to processor P 0 .
After the two bus cycles, processor P 0 obtains the value m. The above solution works only when m ≤ N − 1 or R(0) = 0 initially. To solve this problem, we let processor P 0 add 1 to the value received in the second bus cycle if R(0) = 1 initially.
Boolean Value Aggregation
It is clear that the above algorithm is also applicable to boolean values, where 1 means true, and 0 false. The logical DQG of the R(i)s is true if and only if the sum m is N. The logical RU of the R(i)s is true if and only if the sum m is not zero.
Integer Aggregation
The binary summation operation can be generalized to calculate the summation of N integers. Assume that each 
via local calculation. The entire integer aggregation operation takes O(d) time, which is constant time as long as d is a constant. Multiple independent aggregation is also possible by reconfiguring the system. For unbounded integer values we can conduct log M simultaneous binary value aggregations using log MN processors, where M is the largest magnitude. In particular, we assume that there are N values x 0 , x 1 , ..., x N−1 to be aggregated, where
K is a log M-bit integer, and is in processor P j initially for all 0 ≤ j ≤ N − 1. The log MN processors are divided into log M groups G k = {P kN , P kN+1 , ..., P (k+1)N−1 }, where k = 0, 1, ..., log M − 1.
• First, we perform N multicasting simultaneously such that x j is available to all P P P
, for all 0 ≤ j ≤ N − 1.
• Second, the linear array with log MN processors is re-
Processors in G k extract the kth bit x j,k of the x j s via local computation and then perform a binary value aggregation, so that σ k is calculated and put into P kN , where 0 ≤ k ≤ log M − 1.
• Third, processors P kN send the σ k s to P P P M
by a one-to-one communication.
After the above three steps-each takes constant time-we reduce the problem of aggregating N integers into aggregating log M integers, which can be added using the ordinary binary tree method in O(log log M) time. As a matter of fact, the number of processors can be reduced to Nφ(M), where
such that there are φ(M) groups of processors, and that each group calculates log log M σ k s. This increases the time complexity of the second and the third steps from constant to O(log log M), which is still within the overall execution time. . ,
Floating-Point Value Aggregation
where r is an implicit radix, and the exponents are in an excess-2 q−1 code. For instance, in the IEEE 754 floatingpoint standard found in virtually every computer invented since 1980, we have p = 23, r = 2, and q = 8 for 32-bit, single precision numbers, and p = 52, r = 2, and q = 11 for 64-bit, double precision numbers [18] . Notice that the precision is 2 −(p+q−1) and the magnitude is in the order of , where x j is stored in P j , for all 0 ≤ j ≤ N − 1. We assume that r = 2.
• First, the maximum exponent e max = max(e 0 , e 1 , ..., e N−1 ) is found. This is accomplished by applying the extraction and compression operation for q times. Let S 0 be the set of all e j s initially. In the kth repetition, 1 ≤ k ≤ q, we extract from S k−1 the set of e j s with e j,k = 1, and compact these e j s into S k ; if S k = ∅, then we reset S k = S k−1 . In O(q) time, we will find e max in S q .
• Second, e max is made available via broadcasting to all processors, and each P j normalizes e j by shifting m j . This step takes a small amount of local computation time.
• Third, the shifted m j s are added as if they are integers, and this requires O(p) bus cycles.
• Finally, processor P 0 performs normalization of the summation via local computation. 
AN O(N) ALGORITHM USING N 2 PROCESSORS
We start with a linear time algorithm MM 1 that calculates the product of two N × N matrices by using 
We find that Cannon's algorithm [7] , [19] can be easily implemented on the LARPBS model. For convenience, we label the N 2 processors using pairs (i, j), 1 ≤ i, j ≤ N, and processors are linearly ordered in terms of lexicographical order. In other words, processor P(i, j) corresponds to P (i−1)N+(j−1) in the linear array of processors shown in Fig. 1 . Each processor P(i, j) has three registers A(i, j), B(i, j), and C(i, j). Elements of arrays A and B are initially stored in A and B registers in the row-major order, i.e., a ij in A(i, j), and b ij in B(i, j). Finally, array C is stored in C registers in the row-major order, i.e., element c ij is found in
The algorithm involves an initial alignment of rows of A and columns of B, that are actually two one-to-one communications: After such shifts, we have the following data distribution: 7  0 5  2 7 2 7  2  7  2 7 2 7  0 5  2 7 2 7  2  7 2 7 2 7 0 5
Then we have a sequence of N alternate local computation and row/column cyclic shifts: 
AN O(1) ALGORITHM USING N 3 PROCESSORS
In this section, we present a constant time algorithm MM 2 to calculate the product of two N × N integer matrices using N 3 processors in the LARPBS model. This algorithm is applicable to integers and floating-point values of bounded precision and magnitude, and Boolean values. For convenience, we label the N 3 processors using trip-
Processors P(i, j, k) are ordered in the linear array using the lexicographical order. We use P(i, j, * ) to denote the group of consecutive processors P(i, j, 1), P(i, j, 2), ..., P(i, j, N). It is noticed that the original system can be reconfigured into N 2 subsystems, namely, the
The input and output of algorithm MM 2 are specified as follows: Initially, elements a ij and b ji are stored in registers A (1, i, j) and B(1, i, j) , respectively, for all 1 ≤ i, j ≤ N. If we partition A into N row vectors A 1 , A 2 , ..., A N , and B into N  column vectors B 1 , B 2 , ..., B N ,
then the initial data distribution is as follows:
(Other processors are not shown here for simplicity, since they do not carry any data initially.) In other words, the two input matrices A and B are put in the beginning of the linear array in the row-major and the column-major orders, respectively. This is a quite simple and easy-to-manage way of preparing the input data. When the computation is done, c ij is found in C (1, i, j) .
then the output layout is
Again, such an arrangement makes it easy for C to be used as an input to another computation. The algorithm proceeds as follows: First, we change the placement of matrix A in such a way that element a ij is stored in A(i, 1, j) for all 1 ≤ i, j ≤ N. This can be accomplished by a one-to-one communication:
After such replacement, we have the following data distribution,
where the linear array of N 3 processors are logically arranged as a two dimensional N × N array, and each element in the array stands for a group of N processors P(i, j, * ). The symbol "−" means that the A and B registers are still undefined. Next, we distribute the rows of A and columns of B to the right processors, such that processors P(i, j, * ) hold A i and B j , for all 1 ≤ i, j ≤ N. This can be performed using multiple multicasting operations:
After multicasting, the data distribution is as follows: 1  1  1  2  1   2  1  2  2  2   1  2 , , , 
At this point, processors are ready to compute. In particular,
Then, the values C(i, j, k), 1 ≤ k ≤ N, are assembled using integer aggregation operations, and the result c ij is in C(i, j, 1), as shown below,
Here, for the purpose of multiple aggregations, the original system is reconfigured into N 2 subsystems P(i, j, * )s, for 1 ≤ i, j ≤ N. After calculation, the data distribution is as follows: , , ,
Finally, one more data movement via one-to-one communication brings the c ij s to the right processors:
It is clear that each step in the algorithm involves a simple local computation, or a primitive communication operation and, hence, takes O (1) It is observed that in virtually all sequential and parallel algorithm design and analysis, it is assumed that an arithmetic operation can be performed by a processor in O(1) time, regardless of the magnitude/precision of the operands. In other words, the maximum magnitude M of integers and p and q for floating-point number representations are all fixed. Therefore, Theorem 2 is already enough for most applications and is theoretically very reasonable. However, algorithm MM 2 can also be extended to handle matrices containing integers/reals with unbounded magnitude and precision. We present the following results and leave out the details. 
AN O(LOG N) ALGORITHM USING O(N 2.8074 ) PROCESSORS
The matrix multiplication algorithm MM 3 described in the present section is a parallelization of Strassen's recursive method, which is applicable to an arbitrary ring. First of all, let us describe Strassen's strategy. Without loss of generality, we assume that N is a power of two, i.e., N = 2 n , for some n > 1. Given two matrices, . The reconfigurability of the LARPBS model is crucial in such an implementation. Also, a pipelined optical bus architecture can efficiently support all necessary data transfer in the parallelized Strassen's algorithm.
Let the p = N log7 processors be labeled as P 0 , P 1 , ..., P p−1 .
Each processor P i has six registers A(i), B(i), C(i), X(i), Y(i), and M(i), for each level of the recursion. Hence, O(log N) amount of memory is required in each processor. Let Π(i, j) denote a partition of an LARPBS system consisting of consecutive processors P i , P i+1 , ..., P j . The original system is Π(0, p − 1). When we mention a partition Π(i, j), it is assumed that the original system has been reconfigured such that Π(i, j) is a complete and independent LARPBS system. We use R[i.
.j] as an abbreviation of the registers R(i), R(i + 1), ..., R(j). Algorithm MM 3 is recursively defined. Our first call to the algorithm is
which means calculating the product C of matrices A and B of size N using all the N log7 processors. Later on, Algorithm MM 3 will be invoked to calculate submatrices of C of smaller sizes using partitions of the system. Thus, our description of
is general, where j − i + 1 = N log 7 .
The input arrays A and B are stored in the A and B registers of the first N 2 processors of the partition, both in the row-major order. The output array C is also stored in the C registers of the first N 2 processors of the partition, in the row-major order.
The processors P i , P i+1 , ..., P j are divided into seven groups of equal size or the partition Π(i, j) is reconfigured into seven independent subpartitions during recursion:
where 1 ≤ g ≤ 7, and
We wish the input arrays being stored within the range of Π 1 . This is possible only when N > 8, as shown in Table 2 .
Thus, if N ≤ 8, we reach the base case of the recursion and calculate C directly using any method, e.g., the sequential algorithm. This takes constant time.
In general, when N > 8, we calculate X g , Y g , and M g using the partition Π g , and all seven computations are done in parallel. To this end, we first make A and B available to all seven partitions via multiple multicasting, though a partition may only need parts of A and B. and subtraction is involved in each partition Π g . Then, the M g s are obtained recursively on Π g , for all 1 ≤ g ≤ 7, in parallel: ). We would like to point out that since no global aggregation (which is useful only for certain types of data) is conducted, our implementation of Strassen's algorithm is applicable to all rings. The time complexity O(log N) is due to the nature of Strassen's algorithm, not the physical limitation of LARPBS systems. It is unlikely that Strassen's algorithm can be easily parallelized on static networks such as meshes and hypercubes with the same performance as that of MM 3 , nor on ordinary bus systems with limited bandwidth.
COMPOUND ALGORITHMS
A compound algorithm is a combination of multiple existing algorithms solving the same problem. Two compound algorithms for matrix multiplication on LARPBS, i.e., & 1 (⑀) and & 2 (δ), are given in this section. As a matter of fact, & 1 (⑀) and & 2 (δ) are algorithm schemes with a parameter, such that they have adjustable time complexity in different levels. Without loss of generality, we assume that N is a power of two in the following discussion.
We ).
The above discussion can be summarized as Since the set of Boolean values with operations DQG/RU is a closed semiring (which is not a ring), Strassen's algorithm is not directly applicable. However, Boolean matrix multiplication can be treated as integer matrix multiplication over the ring Z N+1 (cf. Section 6.6 in Recently, the well-known Four Russians' algorithm has been parallelized on the LARPBS model, which has constant execution time and uses O(N 3 /log N) processors [22] .
CONCLUDING REMARKS
We developed a number of matrix multiplication algorithms on linear arrays with reconfigurable pipelined bus systems. 2 and & 2 (δ) to unbounded integers and real values were also discussed in this paper. Many other important matrix problems and graph theory problems can be solved using a matrix multiplication algorithm as a subroutine [24] . Hence, the algorithms developed in this paper can be used as subroutines for solving a large class of problems. We have not seen parallel computing systems with conventional interconnection networks that solve the matrix multiplication problem in constant running time or support efficient implementation of Strassen's algorithm, and/or provide such a wide range of cost and performance combinations. Our results have the following important implication, namely, due to its high communication bandwidth, the versatile communication patterns it supports, and its ability of utilizing communication reconfigurability as an integral part of a parallel computation, the LARPBS is a powerful architecture for exploiting large degree of parallelism in a computational problem that most other machine models cannot achieve.
To conclude the paper, we would like to briefly discuss the scalability of our algorithms. While the algorithms presented in this paper are fast and processor efficient in the theoretical sense, it is less practical to build an LARPBS system with O(N c ) processors, where c > 1 for large scale matrix computations. For a system with fixed size, a large matrix should be partitioned into submatrices such that processors can perform computations and communications at the submatrix level. In other words, our primitive operations should be extended to handle blocks of data, not just a single data item. Some initial work in this direction has been reported in [48] . When the system size is far less than the problem size, the parallel time complexity can generally be represented
as O(T(N)/p + T comm (N, p) ), where N is the problem size, p is the number of processors available, T(N) is the time complexity of the sequential algorithm being parallelized, and T comm (N, p) is the overall communication overhead of a parallel implementation. We say that a parallel implementation is scalable in the range [1. .P] if linear speedup can be achieved for all 1 ≤ p ≤ P. The implementation is highly scalable if P is as large as Θ(T(N)/(T * (N)(log N) k )), where k ≥ 0 is some constant, and T * (N) is the best possible parallel time. The implementation is fully scalable if k = 0. A fully scalable parallel implementation means that the sequential algorithm can be fully parallelized, and communication overhead in parallelization is negligible. We have obtained the following results, namely, there is a fully scalable implementation of the standard matrix multiplication algorithm on LARPBS, and there is a highly scalable parallelization of Strassen's algorithm on LARPBS with k = 2.4771 ... . Due to space limitation, we will report this part of the research in a separate paper [25] .
