Singular value decomposition (SVD) is used in many applications such as real-time signal processing where fast computation of these problems is needed. In this paper, parallel algorithms for solving the singular value decomposition problem are discussed. The algorithms are designed for optically interconnected multiprocessor systems where pipelined optical buses are used to connect processors. In a pipelined bus system, messages can be transmitted concurrently in a pipelined fashion. However, certain restrictions may apply in a pipelined bus system. For example, a processor can send at most one message and receive one message during a bus cycle. Pipelined bus interconnection requires us to rethink how we write parallel algorithms. Fully exploring the properties of concurrent message transmissions requires careful mapping of data, an efficient addressing mechanism, and a set of efficient basic data movement operations. In this paper, these issues are addressed in detail. Analysis of the parallel computation times of the SVD algorithms shows that they are asymptotically equivalent to those implemented on the hypercube while using substantially less hardware. The results obtained in this paper further demonstrate that optically interconnected multiprocessor systems are very promising as a new multiprocessor architecture.
Introduction
One of the important problems in mathematical science and engineering is singular value decomposition (SVD). It is often used in many applications such as real-time signal processing, where the solution to these problems is needed in the shortest possible time. SVD is one of the most important factorization of a real m-by-n (m≥n) matrix, and is a very computationally intensive problem. Given the growing availability of parallel computers and the need for faster solutions to the SVD problem, there has been great interest in the development of parallel implementations of the singular value decomposition.
A singular value decomposition of a real m-by-n (m≥n) matrix A is its factorization of this matrix into the product of three matrices:
where U is an m-by-n matrix with orthogonal columns, D is an n-by-n non-negative diagonal matrix, and V is an n-by-n orthogonal matrix. For more details, please refer to Golub and Reisch [1] . The most common method used for the solution of the SVD problem is incorporated in the Golub-Kahan SVD algorithm [1, 2] , which requires
Singular value decomposition
There is a wealth of serial SVD algorithms available in the literature. Most of these algorithms can be transformed into parallel algorithms to be run on parallel computers. Each of these serial algorithms exhibits a certain degree of parallelism. Thus, in choosing the serial algorithm to be transformed into a parallel algorithm, we have to take the potential of inherent parallelism within the algorithm into account. For this reason, we choose the Hestenes method of solving the SVD problem [10] to implement it on the array of processors with optical pipelined buses because of its high potential of parallelism and its flexibility in being transformed efficiently into a parallel algorithm suited for our architecture, even though its sequential version is less efficient than that of the Golub-Kahan-Reinsch SVD method [1, 2] .
The basic idea of the decomposition is to generate an orthogonal matrix V such that the transformed matrix AV=W has orthogonal columns. Normalizing the Euclidean length of each non-null column to unity, we get the relation
where U′ is a matrix whose non-null columns form an orthonormal set of vectors, and D is a non-negative diagonal matrix. An SVD of A is then given using the following relation:
As a null column of U′ is always associated with a zero diagonal element of D, (1) and (2) are essentially identical. Hestenes [10] uses plane rotations to construct V. A sequence of matrices {A k } is generated by
where A 1 =A, and each R k is a plane rotation. 
The rotation angle should be chosen so that the two new columns are orthogonal. For this we can use the formulas given by Rutishauser [20] . Defining:
We set =0, if =0. Otherwise, we compute 
Directional Coupler Processor
The rotation angle always satisfies | |≤( /4). In this way, we orthogonalize the pth and the qth columns in the kth step. The process in which all column pairs (i, j) for i<j are orthogonalized exactly once is called a sweep. We repeat the sweep until A converges to W, which has orthogonal columns. In actual computation, we can select a small positive number and stop the computation when the inner product of every column pair is less than . We can use as a threshold, and orthogonalize a column pair only when its inner product is larger than . This threshold approach accelerates the convergence.
For a matrix with n=2 g columns, n(n−1)/2 column pairs have to be orthogonalized in a sweep. More than one column pair can be orthogonalized simultaneously in a parallel system. In the orthogonalization, the elements of the new columns can be computed in parallel. The inner product operation to compute the rotation parameters , , and can also be carried out in parallel. The critical issue here is how to compute the rotation parameters efficiently, perform orthogonalization of column pairs, and move columns to create all column pairs, using the pipelined optical buses available in the system.
SVD on 1-D arrays with a pipelined optical bus
Before we discuss the detailed implementation of the SVD algorithm on the 1-D array of processors (PEs) with a pipelined optical bus, it is appropriate to give a brief introduction to this novel architecture and discuss its functionality. Figure 1 shows a linear array of processors connected with an optical pipelined bus. Each processor is connected to the bus with two directional couplers (for the conversion of optical signals to electronic signals and vice versa), one for transmitting messages on the lower bus segment and the other for transmitting messages on the upper bus segment [15, 17, [21] [22] [23] . Because of the properties of the optical bus (waveguides), messages propagate unidirectionally from right to left on the lower bus segment and from left to right on the upper bus segment. That is why two buses are needed for such a computer system, unlike traditional electronic buses where electronic signals can propagate in either direction on the same bus. Each processor uses a set of control registers to store information needed to control the transmission and reception of messages by that processor which will be explained later. The above computer system can operate in MIMD mode where each processor would have its own controller, or in SIMD mode where a single controller is used for all the processors. In this paper, this parallel computer system is assumed to be operating in SIMD mode, and thus our algorithms are designed with this mode in mind. For convenience, we let the optical path length between any two couplers on the bus be d 0 units. Denote the propagation time of a light signal between two couplers by tc; that is, tc=d 0 /c g , where c g is the velocity of light on the bus. Let a bus cycle, Tc, be Ntc. With a concurrent bus accessing method, often referred to as bus pipelinings [21, 22] , processors can transmit their messages simultaneously at the beginning of each bus cycle. As a result of a simultaneous transmission by all processors, a train of N messages is formed on the lower segment of the bus and will propagate from left to right on the upper segment. Processors that wish to receive any message from among the train of messages transmitted must wait for the message to arrive at its receiver [23] . These properties are a result of the unique characteristics of optical buses (waveguides). For more details, readers are referred to Melhem et al. [22] .
Messages transmitted by different processors may overlap with each other even if they propagate unidirectionally on the bus. We call these message overlappings transmission conflicts. Assume each message has b binary bits, where each bit is represented by an optical pulse, with the existence of a pulse for 1 and the absence of a pulse for 0. Let w be the width (or duration) of each pulse in seconds. To ensure free of transmission conflicts, the following condition has to be satisfied:
We call tc a pipeline cycle; thus, the term pipelined bus is self-explanatory. The above condition ensures that each message will be 'fit' into a pipeline cycle such that, in a bus cycle, up to N messages can be transmitted by processors simultaneously without collisions on the bus. This is the biggest difference between an optical bus and electronic bus, where the former can handle up to N simultaneous messages when connected to n processors while the latter can handle just one message regardless of the number of processors. In a parallel array, messages normally have very short length (b is very small). Thus, in the remaining discussions, we assume that the condition given by eqn (7) above is always satisfied and no transmission conflicts are possible as long as all processors are synchronized at the beginning of each bus cycle.
In SIMD environments, each processor knows the source (processor id) of the message it is receiving and when it is sent relative to the beginning of a bus cycle. Therefore, the time (relative to the beginning of a bus cycle) can be easily calculated by the processor receiving the message. Let the processor that wishes to receive a message be processor i. Assume that the message is sent by processor j at the beginning of bus cycle C. We use a receiving function, rec(i, j, C), to determine the time that processor i has to wait, relative to the beginning of bus cycle C, before receiving the message from processor j. The value of the function, in the number of pipeline cycles, is given in the following equation:
By calculating the receiving function as given by eqn (8), each processor would be able to selectively receive any message from a train of messages in each bus cycle. These values of receiving functions can be precomputed and stored in a set of receiving control registers for fast retrieval during the execution of the algorithm. The precomputing of these values is done, obviously, according to the design of the given algorithm. Readers may refer to Melhem et al. [22] for details. Now, it is time to present the SVD algorithm on the 1-D array with an optical bus of size n/2, in which every PE has a local memory large enough to store a pair of columns of data elements. Suppose that the two columns are stored in arrays A(i) and B(i), 0≤i<n/2, respectively. Every PE continuously receives commands from the controller to perform the orthogonalization of the column pair concurrently with other PEs according to eqns (3) (4) (5) (6) . A procedure called 1-D-ORTH is used to orthogonalize a column pair as follows. 1-D-ORTH consists of the following steps, i.e. to update , , and using (5), to calculate L, t, c, and s using (6), and to orthogonalize the column pair based on (4). Since the length of a column is m, it takes O(m) to update (5) . All other steps in 1-D-ORTH can be done in constant time. Hence, the time to perform
1-D-ORTH is O(m).
After performing the orthogonalization of the column pair, each PE exchanges a column of data with another PE. Clearly, any exchange of column data between two PEs can be done in m bus cycles since each PE can put one message onto the bus and receive one message from the bus in one bus cycle and a column contains m data elements. Here, the exchange sequence described in Dekel et al. [24] is used to move data column A(i) in PE [i] , for all i, to all other PEs sequentially to produce all (A, B) column pairs, i.e. the set {(A p , B q )|p=0, 1, 2, . . . , n/2−1, q=0, 1, 2, . . . , n/2−1}. In order to obtain all possible column pairs, one still has to consider the pairing among the A columns and among the B columns. This can be achieved by iteratively applying the process of exchanging the A and the B columns between neighboring nodes to create subarrays and then moving the A columns within the subarrays. The entire algorithm is given in procedure 1-D-SVD. In the above algorithm, f(g−k, p) is the pth number in the exchange sequence [24] of length 2
is the number whose binary representation is i q , . . . , i h , . . . , i 0 . R rc (i ) is the receiving control register used to store the receiving function value in processor i. Thus, when R rc (i ) is assigned the value of 1+i+i (h) , processor i receives a message automatically from processor i (h) .
A[ j](i)
is the jth element of the data column stored in processor i. Column pairs are generated within a subarray by loop p using the exchange sequence [25] . Lines 13 to 24 iteratively exchange the A and B columns between neighboring nodes. Obviously, lines 8, 9, and 11 dominate the time complexity of the algorithm since they are statements within the double loop. The time spent in procedure 1-D-ORTH on line 11 is O(m) since each PE needs to process a column pair whose length is m. The number of times line 11 is executed is given by:
Hence, the time spent in line 11 is O(mn). Similarly, it can be shown that the number of times lines 8 and 9 are executed is
≤mn Thus, the time spent in lines 8 and 9 is O(mn). Therefore, the total time of algorithm 1-D-SVD is O(mn), which is on the same order of magnitude as that on a hypercube of the same size [5] . While a PE in a linear array of size n with a pipelined optical bus has only two communication ports, a PE in a hypercube of size n has log n ports. 
SVD on 2-D arrays with pipelined optical buses
The 1-D array with a single pipelined optical bus has certain limitations [23] . First, the system size cannot be very large since it is limited by the minimum power that can be detected by the PE at the very end of the bus (similar to electronic buses). Second, when the number of PEs, N, is very large, the end-to-end transmission delay increases since the condition given by eqn (3) has to be held and the number of pipeline cycles is proportional to N. Moreover, many problems, such as SVD, are 2-D in nature and are more efficient when implemented on a 2-D array of processors. To overcome the shortcomings of the linear array of processors, we consider a 2-D array of processors where each row of processors and each column of processors are connected by a pipelined optical bus. Thus, a PE in a 2-D array is coupled to four waveguides: two for transmitting messages and two for receiving messages (see Fig. 2 ). Communication between PEs in the same row or in the same column in a 2-D array can be carried the same way as in the linear case. If two PEs in different rows or different columns need to communicate, then two bus cycles are needed. That is, at the end of the first bus cycle a message has to be buffered at a node which is in the same row (or column) as the destination [15, [21] [22] [23] . It is expected that parallel algorithms can be run more efficiently on a 2-D array than on a 1-D array as can be deduced from our SVD algorithms implemented on a 1-D array and a 2-D array. Nevertheless, a 1-D array with a pipelined optical bus is a legitimate architecture with its own merits [22] . Now we consider the implementation of the SVD problem on the 2-D array of processors with pipelined optical buses of size m×(n/2), where n=2 g+1 . Each PE in a row has an index which is between [0, m−1] and each PE in a column has an index which is between [0, n/2−1]. Hence, each PE can be uniquely identified by its row index and column index. Each PE(i, j) has several registers; registers A(i, j) and B(i, j) are used to store matrix elements. The given matrix A is initially stored as follows:
where a ij is the (i, j) element of matrix A. Thus, the n columns of the matrix are partitioned into n/2 pairs and each column pair is stored in a column of m processors. A column of m processors orthogonalizes the column pair stored in it by using the procedure 2-D-ORTH explained below. The procedure 2-D-ORTH is similar to 1-D-ORTH except that a column pair of data is stored in a column of processors instead of one processor. 2-D-ORTH is divided into several distinct phases and each phase corresponds to its counterpart in 1-D-ORTH. In the first phase, the local products
are calculated in the m processors concurrently. In the second phase, the sums
(i, j), and
are computed by combining the partial sums using the reporting operation described in Hamdi and Pan [18] , and the results are stored in PE(0, j).
In the third phase, the values L( j), t( j), c( j), and s( j) are computed locally in P(0, j). In the fourth phase, the rotation parameters c(0, j)=c( j) and s(0, j)=s( j)
are broadcast over the column buses to all processors in the same columns. The broadcast operation described in Hamdi and Pan [18] can be used here. In the fifth phase, elements of the two new matrix columns are computed locally using the rotation parameters received. The time complexity of the 2-D-ORTH can be derived as follows. The first phase takes a constant time since all operations are local. The second phase needs O(log m) time since these are reporting operations over columns of m PEs [18] . The third phase uses a constant time. In the fourth phase, one bus cycle is needed since it is a broadcasting operation over columns [18] . The fifth phase requires a constant time again. Hence, the time taken in 2-D-ORTH is O(log m). After the procedure 2-D-ORTH is performed in parallel on all n/2 columns of processors, half of the newly computed matrix columns are exchanged in parallel between neighboring columns of processors. The procedure 2-D-ORTH is then carried out on all columns of processors in parallel again. Clearly, any column exchange between two PEs can be done in one bus cycle since this is a one-to-one operation. Here again the exchange sequence described in Dekel et al. [24] is used to move column
, for all i and j, to all other PEs sequentially to produce all (A, B) column pairs in a column of PEs, i.e. the set { (A p , B q )|p=0, 1, 2, . . . , n/2−1, q=0, 1,  2 , . . . , n/2−1}. In order to obtain all possible column pairs, one still has to consider the pairing among the A columns and among the B columns. This can be achieved by iteratively applying the process of exchanging the A and the B columns between neighboring columns of PEs to create subsystems and then moving the A columns within the subsystems. The entire algorithm is very similar to its 1-D counterpart and is specified in 2-D-SVD: procedure 2-D-SVD 1 While not (all | |< ) do 2 2-D-ORTH; 3 for k :=0 step 1 until g−1 do 4 for p :=1 step 1 until 2 . Since lines 5 and 6 require a constant time and each data transfer in lines 7 and 8 takes one bus cycle, the time is dominated by line 9. Similar to the analysis of the 1-D-SVD algorithm, the number of times line 9 is executed is at most n. Therefore the total time of 2-D-SVD is O(n log m), which is on the same order of magnitude as that on a hypercube of the same size [14] . While a PE in a 2-D array with a pipelined optical bus has only four communication ports, a PE in a hypercube of size mn/2 has log(mn/2) ports.
Large SVD problems
Finally, let us discuss the time complexity of the parallel SVD algorithm on an array where the number of processors is much smaller than the size of the problem. Suppose we have a 2-D v×w array with optical pipelined buses. Without loss of generality, suppose v≤m and w≤n. Also assume that m and n are multiples of v and w, respectively. We partition the m-by-n matrix A into h groups of w columns each: C 1 , C 2 , . . . , C h , as shown in Fig. 3 for h=3 . In this example, all column pairs which are orthogonalized in a sweep can be obtained by pairing the columns in the following way: (C 1 , C 1 ), (C 1 , C 2 ), (C 1 , C 3 ), (C 2 , C 2 ), (C 2 , C 3 ). Therefore, the whole computation consists of processing 
Conclusions
Parallel computers using static networks such as the mesh and hypercube have the problem of large diameters due to their limited connectivities. The time complexities of parallel algorithms on these networks are lower bounded by their diameters. Adding buses to these networks is one way to solve the problem. However, messages cannot be transmitted concurrently on an electronic bus. A pipelined optical bus, on the other hand, not only has a shorter end-to-end propagation delay, but also can be accessed concurrently by messages. It has been established that much higher bandwidth can be achieved with pipelined optical bus communication than with exclusive electronic bus access communication [21] . Thus, many parallel algorithms can be implemented on an array with a pipelined optical bus efficiently. In this paper, we have shown that the time complexities used in the algorithms on arrays with optical interconnects are asymptotically equivalent to those implemented on the hypercube while they use substantially less hardware.
In this paper, we have concentrated on algorithm design on the optical bus system. During the description of the singular value decomposition algorithms and the optical bus system, many assumptions from Qiao and Melhem [23] are used. Many technical details such as clock distribution, frame synchronization, electronically controlled optical switching, conversion between optical and electronic signals, and packet size limitation are either discussed briefly or completely omitted. Interested readers may refer to references [15, 17, 21, 23, 25] for details.
Recently, the authors have proposed another new computational model called Linear Array with a Reconfigurable Pipelined Bus System (LARPBS) based on optical waveguides [26] . In this model, messages can be transmitted in a pipelined fashion on an optical bus system and the bus can be dynamically reconfigured into independent segments to satisfy different communication requirements during a computation. Using conditional optical delays, many basic operations such as binary prefix summation, split, compression, and multiple broadcasting can be done in one bus cycle [26] . It is clear that LARPBS is much more powerful than the model used in this paper although only several switches are added to a pipelined optical bus system. We believe that the SVD problem can be solved more efficiently on the LARPBS and we are doing research in this direction. 
