The Recon gurable Array with Spanning Optical Buses (or RASOB) architecture provides exible recon guration and strong connectivities with low hardware and control complexities. We use a parallel implementation of the matrix transposition as well as multiplication algorithms as an example to show how the architectural capabilities can be taken advantage of in designing e cient parallel algorithms.
Introduction
Recon gurable architectures are attractive because they provide alternatives to completely connected systems at lower implementation costs. Since optical interconnects can o er many advantages over its electronic counterpart including high connection density and relaxed bandwidth-distance product, they will soon be a viable alternative for multiprocessor interconnections 1, 2, 3, 4] . This paper describes the Recon gurable Array with Spanning Optical Buses (RASOB) architecture that provides exible recon guration as well as rich connectivities at low hardware and control complexities. Some of the issues related to the implementation of optical bus systems can be found in 5] .
A unique feature of the RASOB architecture which distinguishes it from other two dimensional array with either optical or electronic buses 6, 7, 8] is that there is a direct connection between any two processors, not just between those having the same row or column number. Speci cally, a direct connection between a source and a destination located at di erent rows and columns can be established by setting an electro-optical switch 9, 10] that interconnects the source row and the destination column. We will refer to the operation of setting switches as hardware recon guration in a RASOB.
As a part of the recon guration process, the RASOB architecture also takes advantage of the two important properties of the optical transmissions, namely, unidirectional propagation and predictable unit propagation delay. More specically, the processors in a RASOB can be programmed to send and receive messages under synchronized control, such that a connection between a source and a destination is established by letting the source send a message at a speci c point of time and letting the destination receive the message at another speci c point of time 6, 11] . We refer to this type of recon guration as software recon guration.
Because some of the recon guration is done in software, the complexity of both hardware and control required for recon guration can be kept low. However, despite of its low control and hardware complexity, the proposed RASOB architecture provides exible recon guration that leverages the high communication bandwidths available in optical interconnects and is thus promising for e cient parallel implementation of many communication intensive algorithms.
The rest of the paper is organized as follows. Section 2 describes the RASOB architecture, in which both software and hardware recon gurations are discussed. In Section 3, a parallel matrix transposition and multiplication algorithms are developed, which serve as examples of how the architectural capabilities of the RASOB can be taken advantage of. It is shown that matrix transposition can be done in just two communications phases. Depending on whether or not high-speed broadcasting is supported, matrix multiplication can be done with 2 and (2n?2) communications phases, respectively. Finally, we conclude the paper in Section 4.
Architectural Model
The RASOB architecture is similar to the array structure described in 12]. A main di erence is that in the proposed architecture, messages are sent and received according to speci c timing requirements. This makes the proposed architecture suitable for SIMD applications. On the other hand, the structure in 12] employs an addressing mechanism which supports MIMD applications at higher hardware and control complexities. Figure 1 illustrates the architecture of an RASOB. As shown in Figure 1 (a), there are n folded row buses and n folded column buses interconnecting the processor array. Each processor has a transmitting interface to the upper segment of a row bus, and two receiving interfaces to the lower segment of the row bus and the right segment of a column bus, respectively. Note that each processor may well be replaced with a module of a number of processors that are electronically interconnected in a desirable way. The processors in a module would then share the same transmitting and receiving interfaces of the module via either wavelength or time division multiplexing.
A distinct architectural feature of the RASOB is that a 2 2 electro-optical switch is placed at the intersection of a row and a column bus, as shown in Figure 1(b) . When the switch is set to straight, a message arriving along a row bus will continue propagating on the row bus; Otherwise, the message will be switched to the column bus instead. During a speci c period, when all the switches at a given row are in the straight, messages will propagate only on a row bus. As a result, processors at a row communicate with each other at the same row. This type of communications is referred to as "Row communications" and the period during which row communications is accomplished is referred to as a Row phase.
A processor may also communicate with a processor at a di erent row, which may or may not be at a di erent column. This type of communications is referred to as "Column communications" and is accomplished by switching the message from a row bus to the desired column bus during a period called Column phase. In doing so, the switches are set to "cross" for the duration of the message and then changed back to the straight state.
Changing the state of the switches in a column phase is an example of what we called hardware recon guration. As a contrast, software recon guration in this paper refers to the programming of the processors so that they will send and receive messages at some speci c points of time. In the following sections, we rst illustrate software and hardware recon guration and then discuss both hardware and control complexities of a RASOB as well as its connectivities.
Software Recon guration
In a row phase, each row bus operates independently from the others so it is su cient to describe just one row bus (e.g. row bus r), as shown in Figure 1(c) . In the following presentation, we will denote the processors at row r from left to right by p(r; 1), p(r; 2), ... and p(r; n), respectively.
There are two important optical transmission properties, namely, unidirectional propagation and predictable propagation delay of the optical signals, that make concurrent access of an optical bus possible. More speci cally, with an appropriate spatial separation between the neighboring processors, message collision can be avoided even when the processors are transmitting messages concurrently 6, 11] . In the following discussions, we assume that each processor on a row bus is separated in time by D = bw + (seconds) from its neighbors, where b is the the maximal length of a packet in bits, w is the optical pulse width (or bit duration) in seconds, and > 0 is used as guard bands to tolerate synchronization error to a certain degree. This temporal separation can be achieved by separating the two neighboring transmitter interfaces on the upper segment as well as the receiver interfaces on the lower segment of a row bus with a ber length D c, where c is the speed of light in the ber, as shown in Figure 1(c) . Without loss of generality, we assume that the length of the folded part, which is the separation of the transmitter and receiver interface of p(r; 1), is also made equivalent to D.
We may use the train loading/unloading model to describe the operations in a row phase. Let us imagine that at the beginning of a row phase, a train (or motorcade) of n cars is originated at the rightmost end of the upper segment of the row bus. Each car can be regarded as an empty packet slot with a duration of D and is numbered 1 through n from left to right.
During a row phase, the switches that connect the row bus with column buses are in the "straight" state so that the train will run through the lower segment of the row bus. A simple assignment of the cars is to let processor p(r; 1) use car 1 for sending its packet, let p(r; 2) use car 2 for sending its packet and so on.
With this assignment of the cars, the time when p(r; i) may transmit its packet to p(r; j) for any j, relative to the beginning of the row phase, is given by
As a result, all processors will be transmitting simultaneously because the transmitting time does not depend on i. In addition, a receiving processor can determine the exact time when the car carrying the packet will arrive at its receiver interface. More speci cally, if processor p(r; i) is expecting a packet sent by p(r; j), it can calculate the time it should pick up the packet as below, RowRec (r; i) (r; j)] = (n ? 1)D + (i + j ? 1)D = (n + i + j ? 2)D (2) By placing all the processors under a synchronized control and let each processor send and receive at speci c points of time as in Eqs. 1 and 2, the row bus can be recon gured into a variety of interconnection patterns.
Hardware Recon guration
If a processor needs to communicate with another processor at a di erent row, it has to send a packet in a column phase. The training loading/unloading model used previously is also useful in illustrating the principles involved in column communications. More speci cally, we will let car 1 of the train make a turn, from the lower segment of a row bus, onto column bus n, car 2 make a turn onto column bus (n ? 1) , and so on. For simplicity, we assume that the switches are placed near the receiver interfaces so that the propagation delay between a switch and its nearby receiver is almost negligible. This also implies that the switches are placed D apart from each other.
Similar to Eq. 2, we can determine the time that car k arrives at switch (n?k+1) to be SwitchArvl (r; n ? k + 1) (r; k)] = (2n ? 1)D (3) Since the right side of the equation does not contain k, every car arrives at its turning point at the same time. Therefore, one may set the switches on a row bus to "cross" simultaneously and by doing this, the n packets in the train are switched onto their respective destination columns, one packet per each column. This arrangement implies that during a Column phase, two or more processors at the same row can not send packets destined to the same column.
If p(i; j) needs to communicate with p(r; k) where r 6 = i, p(i; j) will have to transmit (or load) a packet into car (n ? k + 1). We can determine the time for p(i; j) to transmit its packet to be
By separating the adjacent row buses by D, a column bus will look like a row bus that is turned 90 o anti-clockwise after the packets are switched. More speci cally, that every row bus switches a packet onto a column bus at the same time is similar to the case where packets are transmitted to a row bus simultaneously. With as much as D separation between every two row buses, there will be again a train of n cars, each carrying a packet, formed on the left segment of a column bus. Hence, we can determine the time for p(r; k) to pick up the packet at the its receiver interface on the column bus, which is sent by p(i; j), to be Software recon guration can be performed with little control overhead because each of the above equations (1 to 5) involves simple arithmetic calculations. In addition, the hardware complexity of the proposed architecture is low because each processor uses only one two-state 2 2 switch and has only one transmitter. Although two receiver interfaces are needed by each processor, a single high-speed receiving circuit may be shared among these two interfaces. As a comparison, most mesh-based recon gurable architectures would require at least an equal number of switches having four or more states and four I/O interfaces per each processor 13, 14] .
Despite of the low control and hardware complexities, the RASOB provides strong connectivities due to the following characteristics. First, a direct connection between any two processors can be established. The existence of such a direct "all-optical" path is important, because conversions between opticals and electronic signals required for bu ering and address decoding at intermediate nodes are costly. Secondly, recon guration is exible as one may interleave Row and Column phases in many ways to provide communication bandwidth required by an application. Finally, because only a portion of optical power is tapped o at each receiver interface, multicasting can be supported simply by programming multiple receivers to receive at di erent points of time during the same phase. Noting that one-to-one and broadcast are special cases of multicasting, we may summarize the communication capability of the RASOB below: all processors at row i can multicast to the processors at the same row at the same time; and such an row-to-row multicasting can be performed at all n rows simultaneously. all processors at column j can multicast to the processors at the same column at the same time; and such an column-to-column multicasting can be performed at all n columns simultaneously. p(i; j) can multicast to several processors at any column k and such a processorto-column multicasting can be performed by all the n 2 processors at the same time, with a restriction being that two or more processors at the same row i can not multicast to the same column k at the same time. Note that while the rst two items on the list alone mean that the RASOB is at least as powerful as any mesh with row and column buses, the third item clearly shows that the RASOB has a stronger connectivity than other meshes with row and column buses. In addition, we note that the RASOB has many other capabilities that are not included in this list but may be deducted from its basic features described in the previous two subsections. For example, simulcasting, in which a processor sends di erent messages to di erent processors simultaneously can be supported by assigning more than one car in a train to a source processor. However, a high-speed transmitter interface that can transmit one packet every packet slot time is required at each processor. Similarly, many-to-one communications can be supported with high-speed receiver interfaces that can receive one packet every packet slot time.
Algorithm Development
Although the RASOB has a strong connectivity, it, like many practically scalable architectures, has a weaker connectivity than a completely-connected network. The capabilities as well as restrictions of the architecture makes it an interesting yet challenging task to design e cient algorithms.
In designing algorithms on a RASOB, one may use the idea proposed in 15] to partition the set of connections required by an application into subsets such that the connections in each subset can be established in a Row or Column phase. However, such a partition may not result in optimal (i.e. minimal) number of communication phases and therefore a customized design may be necessary. As an example of how one can take advantage of the capabilities while overcoming the restrictions of the RASOB architecture, we describe a parallel matrix transposition and multiplication algorithm, which are fundamental to many other numerical algorithms. The parallel matrix transposition algorithm developed requires only two phases (a row and a column), which is minimal in the RASOB. The matrix multiplication algorithm to be described assumes that one of the matrix has already been transposed. It is intended to illustrate the impact of the bandwidth limitation versus the availability of high-speed receiver interfaces on the algorithm design.
In the following discussions, we assume, for simplicity, that each matrix is of n n and processor p(i; j) thus stores the element of the matrix at row i and column j. For a larger matrix ofn n wheren > n, we may rst treat then n matrix as a kn kn matrix where kn n for some integer k. We then divide this matrix into blocks, each being a k k submatrix and assign the submatrix to a processor in the RASOB. This would be natural if each processor is in fact a module consisting of a number of processors, as mentioned earlier.
Matrix Transposition
Consider matrix B, whose elements are denoted by b(i; j). To transpose matrix B, one needs to route element b(i; j) from processor p(i; j) to processor p(j; i), where 1 i; j n and i 6 = j. In the following discussion, we will use the notation x] n to represent operation 1+(x?1) mod n in calculating the indices, which are numbered from 1 to n.
Our matrix transposition algorithm accomplishes data movement from p(i; j) to p(j; i) in two steps: (1) . route an element from p(i; j) (where j 6 = i) to an intermediate location, which is p(j; i + j ? 2] n ) and (2) . route the element from its intermediate location to its nal destination p(j; i). . Part (b) of the Lemma means that no two elements originated from the same row will end up at the same column and thus, step one routing requires only one Column phase in the RASOB. Note that after the rst step is completed, element b(i; j) is at p(j; i+j ?2] n ), which is at the same row as the nal destination p(j; i).
In particular, b(2; j) is already at its destination for 1 j n and hence only b(i; j)'s with i 6 = 2 need the second (and nal) step routing.
To formalize the routing function used in the second step, we de ne f 2 is also a one-to-one mapping function. In addition, since f 2 routs an element to the same row, the second step routing can be accomplished with Row communications in an RASOB. Speci cally, we have Lemma 2.
Step two routing can be completed in one Row phase. To summarize, the following pseudo program can be used to carry out these two routing steps, in which detailed timing for each processor to send and receive an element can be derived from Eqs. 1, 2, 4 and 5. We note that in the RASOB, it is impossible to relocate b(r; 1) and b(r; 2) from the same row r, to p(1; r) and p(2; r) at the same column r, respectively, in one step. Therefore, it is clear that the matrix transposition algorithm which requires only two phases is optimal. It is also noted that during each of the two phases, the algorithm requires that a processor send and receive at most one packet and thus only a subset of the architectural capabilities has been utilized.
Matrix Multiplication
We now describe a matrix multiplication algorithm that multiplies two matrices, A and B. Both matrices are assumed to be n n, although the algorithm can be easily adapted to the cases where A and B are k l and l m respectively. In addition, we assume that matrix B has already been transposed and the resulting matrix C will be stored in a similar way as A. More speci cally, given that a(i; j) and b(i; j) are initially stored in p(i; j) and p(j; i), respectively, then c(i; j) = P n k=1 a(i; k) b(k; j) will be stored in p(i; j). Figure 3(a) shows the locations of the elements of A and B before matrix multiplication begins, assuming n = 3. p(i; j) sends its cf j + k] n ; jg at time t r k] + RowSend (i; j) ! (i; j + k] n )]; p(i; j) receives its cfj; j ? k] n g at time t r k]+ RowRec (i; j) (i; j ? k] n )]g; 4. p(i; j) sets c(i; j) = c(i; j) + cfj; kg for all 1 k n; /* n ? 1 additions */ More speci cally, during the distribution step, processor p(i; j) distributes element b(j; i) to all the other processors at column j that store a(k; j) for all k 6 = i.
Assuming that each processor has a high-speed receiver interface to its column bus, this step can be accomplished in one Column phase by letting every processor in the RASOB broadcast to the processors at the same column, as illustrated in Figure 3 (a) for column two.
After the above distribution step, processor p(i; j) performs n multiplications to get products of the form cfk; jg(i; k) = a(i; j) b(j; k) for all k. The processor then retains product cfj; jg(i; j) for its own use and initiates the second communication step, namely forwarding, to forward the rest of the products of the form cfk; jg where k 6 = j to other processors. The case for p(1; 2) is illustrated in Figure 3 Since each processor has (n ? 1) products of the form cfk; jg to send to, and (n ? 1) products of the form cfj; kg to receive from the other (n ? 1) processors at the same row, this forwarding step can be accomplished in (n ? 1) Row phases, as shown in the above pseudo program. Note that the number of phases needed is independent of the availability of the high-speed receiver or transmitter interfaces.
After forwarding, each processor p(i; j) calculates c(i; j) = P n k=1 cfj; kg(i; j), which will take (n ? 1) additions. At this point, the matrix multiplication of A and B is completed. Note that, similar to the matrix transposition algorithm, Eqs. 1, 2, 4 and 5 can be used to derive the timing speci ed by functions RowSend, RowRec, ColSend and ColRec, respectively, in both steps of the above matrix multiplication algorithm. Figure 4 illustrates the timing diagram of the matrix transposition and multiplication algorithm. As can be seen, the matrix multiplication algorithm needs one Column phase and (n ? 1) Row phases in the distribution and forwarding steps, respectively. This results in a total of n communication phases for the entire matrix multiplication algorithm described above.
We note that rst, if high-speed receiver interfaces are not available, the distribution step, in which p(i; j) needs to broadcast b(j; i) to all other processors at the same column, would take (n ? 1) communication phases, making it similar to the forwarding step. More speci cally, it would need (n ? 1) column phases and during the k-th phase, 1 k n?1, p(i; j) would send b(j; i) to p( i+k] n ; j). As a result, the total number of communication phases needed for matrix multiplication would be (2n ? 2).
Secondly, the algorithm can be improved assuming that each processor has highspeed receiver interfaces to both the row and column buses. More speci cally, we can rst invoke the transposition algorithm to transpose B back, which would take two phases. Then, we can perform a row and a column broadcasting, in which each processor would broadcass its A element and B element, respectively, to all other processors at the same row and column. This would take two phases, which would be similar to the distribution step described above. At the end of these two phases, each p(i; j) would have a copy of all a(i; k) from the row broadcasting and b(k; j) from the column broadcasting, for any 1 k n. As a result, c(i; j) can be determined through a sequence of local computations of the inner product, involving n multiplications and (n ? 1) additions. Therefore, matrix multiplication can be done with 4 communication phases. Note that, the above discussion also implies that had B had not been transposed as assumed, matrix multiplication would have taken only 2 communication phases if high-speed receiver interfaces are available and (2n ? 2) communications phases otherwise. In the latter case, the number of phases needed, (2n ? 2), is the same as the algorithm described previously that deals with transposed B using the distribution and forwarding step. Finally, we note that, computations and communications may be overlapped as shown in the gure, as long as each processor has separate units for computation and communications. More speci cally, since p(i; j) holds a(i; j) and b(j; i) after the transposition of B takes place, multiplication can started before the distribution nishes. Moreover, once the product a(i; j) b(j; k), where k 6 = j, is available, p(i; j) can start the forwarding step before it multiplies a(i; j) with another element of B. Finally, additions can start as soon as p(i; j) receives two or more products of the form a(i; k) b(k; j). Although the total computation complexity is O(n), the total time needed by the algorithm would be less than the sum of the time for communications and the time for local computations.
Concluding Remarks
It is noted that a communication phase (a Row or Column phase) is at most 4nD (seconds) long, where n is the number of processors per row or column and D is the separation between two adjacent processors. At a transmission speed of 2.5 Gb/s and with a packet length of 32 bits, a phase in a 100 100 processor system will take at most 6 s, which is comparable to an electronic bus cycle. Therefore, the matrix transposition algorithm described may be considered as having a time complexity equivalent to two(2) bus cycles while, as a comparison, we note that matrix transposition on a hypercube and a mesh would require O(logn) and O(n) cycles respectively. The advantage of the RASOB implementation mainly comes from the fact that in an optical bus system with high-speed transmitter and receiver interfaces, one can send and receive up to n packets within one bus cycle whose length equals to the end-to-end propagation delay. However, this is not possible in an electronic bus with the same bus cycle length, simply because it requires exclusive access. One has to realize that in general, however, it is di cult to measure the communication time of an algorithms in optically interconnected multiprocessor systems in terms of the electronic processing speed. Therefore, care must be taken in interpreting this comparison.
We also note that communication phases may be overlapped with each other. In other words, one may start a new phase (row or column) before the current phase ends. For example, in the matrix multiplication algorithm, the last n?1 row phases can be overlapped such that a new row phase can start every nD (seconds). Using the terminology of the train loading/unloading model, this overlapping of the communication phases has the e ect of dispatching one train right after another and thus creating multiple consecutive trains on a bus. Note that, data dependency may prevent a phase from being able to overlap with another phase as in the case of the matrix transposition algorithm.
In this paper, a parallel matrix transposition and multiplication algorithms have been developed as an example of how communication-intensive algorithms can be e ciently designed for the RASOB architecture. In addition to its own importance to high-performance matrix operations, this work is a proof of concept that will hopefully lead to e cient development of other algorithms (for example in computer geometry and image processing), by utilizing many other capabilities of the RASOB that have not been utilized in the matrix algorithms presented. Finally, we also note that there are applications requiring asynchronous communications, in which a receiver may not know the source or arrival time of a message. In such cases, hardware recon guration via switch setting that alternates the Row and Column phases can still be performed. However, each sender is required to include information of the destination address in its packet. These packets will then be self-routed to their destinations via an all-optical addressing mechanism 12, 16, 17] .
