This paper briefly reviews some of the more popular parallel-computer models-pipelined optical bus and OTIS interconnect models-that employ optical interconnect. The interconnect topology and some simple algorithms for each model are also described.
Introduction
The emerging feasibility of optical interconnects in very large parallel computers introduces new challenges in the development of efficient parallel algorithms. Single-mode waveguides are unidirectional and light pulses travel down the waveguide with a highly predictable delay [12] . As a result, waveguides support pipelining (i.e., at any instance, several messages encoded as light pulses can be traveling down a waveguide, one message behind the other). This means that several processors can simultaneously write different messages to the same optical bus; all messages can flow along the bus without interference. When electronic buses are used, only one message can be present on the bus at any given time. Therefore, when two or more processors attempt to write to a shared bus in the same cycle, we get a write conflict. Several pipelined bus models for parallel computers have been proposed and studied. We review some of the more popular models here.
When designing a very large parallel computer system, processors are spread over several levels of the packaging hierarchy. For example, we might have several processors on a single chip, several chips on a wafer, several wafers on a printed circuit board, and so on. This means that, necessarily, the interprocessor distance cannot be kept small for all pairs of processors. When a connection must be run between processors on two different chips (for example), the connect distance is usually larger than when we connect two processors on the same chip. It is also known [1, 4] that optical connects provide power, speed, and crosstalk advantages over electronic interconnects when the connect distance is more than a few millimeters. Therefore, minimum overall delay is achieved when shorter interconnects are realized using electronics and longer interconnects are realized using optics. This realization leads to the concept of optoelectronic computers-computers which have a mix of optical and electronic interconnects (or more generally, components). It is important to develop interconnection topologies that maximize the benefits of the two technologies and are manageable from the prespective of efficient algorithm design. The OTIS family of optoelectronic computers is a step in this direction.
Static Bus Models

A Unidirectional Bus
An Ò-processor unidirectional-bus computer is a synchronous computer in which Ò processors are connected to an optical bus (or waveguide). Figure 1 shows a 4-processor unidrectional-bus computer. The processors are labeled 0 through 3 and the optical bus is shown as a thick arrow. The processors are evenly spaced and the time required for an (fixed length) optical message (or signal) to travel the distance between two adjacent processors is denoted by . An optical bus is unidirectional-optical messages may travel in only one direction along the bus, that is, in the direction indicated by the arrow head. Each processor has a read/write connection to the optical bus, this is shown as a thin line in Figure 1 .
If processor 0 writes a message to the bus, this message will arrive at processor 1 after time units and at processors 2 and 3 after ¾ and ¿ time units, respectively. If processors 0 through 3 write messages , , and , respectively, to the bus at time 0, then message arrives at processors 1, 2, and 3 at times , ¾ , and ¿ , respectively; message arrives at processors ¾ and ¿ at time and ¾ ; and message arrives at processor 3 at time .
The cycle time of an Ò-processor unidirectional-bus computer is defined to be Ò [2] . The cycle time for the 4-processor configuration of Figure 1 is . A cycle may be regarded as composed of Ò slots, each of duration . It is generally assumed that a processor can write in only one slot in a cycle and can read from only one slot in a cycle. However, some models permit reading from several slots of a cycle. Several mechanisms have been proposed for how a processor knows which slot to write and which to read. Two of these mechanisms are:
1. Use of a slot counter All processors write in slot 0 of a cycle. If processor , , wants to read processor 's message, it starts a slot counter that is incremented by 1 every time units; when the counter reaches , processor reads from the bus. The drawback of this scheme [9] is that it requires an electronic counter that is at least as fast as the optical waveguide. This drawback is eliminated by replacing the arithmetic counter by the system clock which is advanced every units. As you can see, the coincident pulse method requires a clock that advances every time unit (a time unit equals the delay introduced between pirs of adjacent processors on the message and reference buses); this time unit is less than .
The complexity of algorithms for optical bus computers is usually measured in cycles. Although the cycle time for an Ò-processor bus varies with Ò, for Ò up to a few thousand, the cycle time is no more than the time required to perform a CPU operation (such as an add or a compare) [10] . Typically, an algorithm will involve the CPU for ¢´½µ steps between each bus cycle. So, the number of bus cycles times the CPU speed is a good measure of complexity for Ò up to a few thousand.
One-Dimensional Array
A unidirectional bus isn't of much use because there is no way for processor to send a message to processor when , that is you cannot send a message to a processor on your left. This difficulty is overcome by adding an additional bus in which messages flow from right to left ( Figure 3 ). The resulting parallel computer model is called one-dimensional array with pipelined buses (1D APPB) [2] . To see the power of a 1D APPB, suppose you want to permute the data in the Ò processors according to the permutation Ô´µ, that is processor is to receive data from processor Ô´ µ. Data from processor Ô´ µ will get to processor in Ô´ µ slots (or Ô´ µ time) provided it is written to the proper bus. So, processor computes the number of slots Û Ø´ µ = Ô´ µ for which it must wait for its data. Following this computation, all processors write their data to the upper (i.e., the left to right) and lower buses at the start of a bus cycle. Processor reads the desired data from the bus when its wait time is up. Amazingly, we can perform any data permutation in just one bus cycle. This is not possible using an electronic bus. For an electronic bus, define the bus cycle to be the time needed for an electronic signal written by a processor on a bus to become available at all processors on the bus. When an electronic bus is used, only one distinct data item can be transported on the bus in a bus cycle. Therefore, Ò cycles are required to perform a permutation.
An alternative one-dimensional model uses a folded message bus as in Figure 4 [12] . In this model, all writes are done to the upper bus segment and all reads are done from the lower bus segment. The cycle length for the folded bus is ¾Ò . When a folded bus is used, it takes´¾Ò ¾ Ô´ µ µ time for data to get from Ô´ µ to . To perform the permutation Ô´µ, processor computes Û Ø´ µ ¾Ò ¾ Ô´ µ before the start of the bus cycle; all processors write to the upper bus segment at the start of a bus cycle; and processor reads from the lower bus segment when its wait time is over. Again, the permutation is complete within one bus cycle.
Two-Dimensional Array
The 1D APPB model is easily generalized to higher dimensions. Figure 5 shows a two-dimensional version-the 2D APPB. The 2D APPB is quite similar to meshes with buses [16] -both have row and column buses. The essential difference is that the buses in a 2D APPB are optical whereas those in a mesh with buses are electronic; an electronic bus can carry only one distinct message at a time; whereas an optical bus can carry a distinct message in each slot.
A significant advantage afforded by two-dimensional arrays is the ability to build rather large computers while keeping the number of processors on an individual bus (and hence the bus cycle time) reasonable. If we limit the number of processors on a bus to a few thousand (see above), then the two-dimensional array allows us to build computers with up to a few million processors. The 2D APPB is harder to program than the 1D APPB. For example, to perform a data broadcast, we must first broadcast along a row bus (say) and then along all of the column buses. Even though an arbitrary permutation can be done in Ç´½µ bus cycles, the preprocessing needed by the permutation routing algorithm is excessive [2] .
A two-dimensional array that uses folded row and column buses may also be developed.
Reconfigurable Bus Models
One-Dimensional Array
Given the success of reconfigurable architectures that employ electronic buses [7, 8] , it is not surprising that reconfigurable optical bus architectures abound. In a onedimensional reconfigurable bus, for example, processor , ¼ controls a bus control switch that enables it to break the optical bus at processor . When the bus is broken at processors ½ , ¾ , and ¿ , ½ ¾ ¿ , for example, we get four independently operating one-dimensional bus computers. The first is comprised of processors 0 through ½ ½, the second of processors ½ through ¾ ½, the third has processors ¾ through ¿ ½, and the fourth computer has processors ¿ through Ò ½. This bus breaking into four segments is done by processors ½ , ¾ , and ¿ by opening their bus control switch. Processors can open and close their switches dynamically while a program is executing. Hence, the computer may be reconfigured, as computation proceeeds, into a varying number of subcomputers.
The 1DAROB (one-dimensional array with reconfigurable optical bus) model of [10] and the LARPBS (linear array with a reconfigurable pipelined bus system) of [9] include an conditional delay unit between every pair of pro-cessors. This delay unit is on the upper segment of the select waveguide of the bus; processor , ¼, controls the delay unit to its left.
The optional delay unit is useful in both the static and reconfigurable bus models. For example, the optional delay unit may be used to find the binary prefix sum in Ç´½µ bus cycles [9, 11] . Suppose that is a binary value that is stored in processor and that processor is to compute
¼ turns its delay unit on (i.e., sets the unit so that the select pulse will be delayed by one unit) iff = 1. Then, the leader (i.e., processor 0 unless the bus has been broken into subbuses) writes a select signal at the start of a bus cycle; the leader also writes a reference signal in each slot. Processor receives a reference signal in each slot beginning with slot ; it receives the select signal with a delay of È ½ . So, the two signals are conincident at processor in slot È ½ . All that remains is for processor to add ¼ to the slot number when the signals were coincident. For this, processor 0 broadcasts ¼ using a bus cycle.
Two-Dimensional Array
The 2DAROB of [10] is a reconfigurable mesh [7, 8] in which the elctronic buses have been replaced by optical ones. Figure 6 shows a ¢ 2DAROB. Each arrangement of circular arcs denotes a switch; each line segment denotes a bidirectional optical bus segment, and there is one processor (not shown) at the center of each arrangement of circular arcs. The permissible switch settings are shown in Figure 7 Each processor can set its switch dynamically and thereby determine the bus topology. The switch settings at any time result in a set of disjoint buses. Each of these disjoint buses is required to be a unidirectional chain. The first processor on a configured bus is called the bus leader.
An Ò ¾ -processor 2DAROB can simulate an Ò ¢ Ò reconfigurable mesh with a constant factor slow down [11] . Since a 2DAROB can perform a column permutation in 1 cycle, whereas an Ò ¢Ò reconfigurable mesh requires Ç´Òµ cycles to this, a 2DAROB is more powerful (in the asymptotic sense) than a reconfigurable mesh. Some fundamental 2DAROB algorithms are developed in [13] .
An alternative two-dimensional reconfigurable model, the array with synchronous switches (ASOS) was proposed in [12] (Figure 8 ). This model uses folded row and column buses; each processor can write only to the upper segment of its row bus; and each processor can read (concurrently) from the lower segment of its row bus and the right segment of its column bus. The shown switches can be in one of two states. In the straight state, messages move along row buses; and in the cross state, messages move from a row bus onto a column bus. Although [12] requires that all switches be set to the same state at any given time, we could per- 
OTIS Models
OTIS Topology
The OTIS (optical transpose interconnect system) family of parallel computer models was proposed in [6, 3, 19] . In an OTIS parallel computer, the processors are divided into groups and each group of processors is realized using an electronic package (such as a high density chip or wafer). Intragroup connections are electronic and intergroup connections are realized using free space optics. Thus an OTIS system is an optoelectronic system. By contrast, all interprocessor connections in a pipelined bus system are optical.
Since optical connects provide power, speed, and crosstalk advantages over electronic interconnects when the connect distance is more than a few millimeters [1, 4] , an OTIS system attempts to get the best of both worldselectronic interconnect is used for the short-distance intragroup (or intra package) connections and optical interconnect is used for the longer-distance interpackage connections.
The bandwidth of an OTIS system is maximized and power consumption is minimized when the number of groups equals the number of processors in a group [5] . This means that an optimal AE ¾ processor OTIS system has AE groups of AE processors each. Let´ È µ denote processor È of group (the processors in each group and the groups are numbered from 0 to AE ½). In an OTIS system, processor´ È µ is connected via an optical link to processor´È µ, for all and È . If you regard´ È µ as a matrix index, then the matrix transpose operation moves element´ È µ to position´È µ, hence the name optical transpose interconnect system. Figure 9 shows the topology of a 16-processor OTIS computer; processors are shown as small shaded boxes; thé È µ index of each processor is given; each group of 4 processors is enclosed by a large box; and the OTIS (i.e., the optical transpose connections) are shown as bidirectional arrows.
For the intragroup interconnect topology, we may choose any of the electronic topologies proposed for parallel computers-mesh, hypercube, mesh of trees, etc. The selection of the intragroup topology identifies the specific OTIS model within the family of OTIS models. For example, an OTIS-mesh is an OTIS computer in which the intragroup interconnections correspond to a square mesh and in an OTIS- hypercube, the processors within each group are connected using the hypercube topology. When analyzing the complexity of an OTIS computer we count the number of OTIS (i.e., intergroup or optical) data move steps and the number of electronic (or intragroup) data move steps. Figure 10 shows a 16-processor OTIS-Mesh computer. AE mesh using either one intragroup (i.e., electronic) move or one intragroup and two intergroup (i.e., OTIS) moves [19] . For the simulation, processor´ Ðµ of the 4D mesh is mapped on to processoŕ È µ of the OTIS-Mesh, = Ô AE · and È Ô AE ·Ð.
OTIS-Mesh
It is easy to see that the 4D mesh moves´ ¦ ½ Ð µ and´ Ð ¦ ½µ can be done with one intragroup move of the OTIS mesh. Moves of the form´ ¦ ½ Ð µ and ¦ ½ Ð µ can be done with two OTIS and one electronic move as follows. First an OTIS move is made to get data from´ Ðµ to´ Ð µ; then an electronic move gets the data to´ Ð · ½ µ (say); and a final OTIS move gets it to´ · ½ Ð µ.
Many OTIS-mesh and OTIS-hypercube algorithms appear in [14, 15, 17, 18] .
