In this paper we present deterministic algorithms for integer sorting and on-line packet routing on arrays with recon gurable optical buses. The main objective is to identify the mechanisms speci c to this type of architectures that allow us to build e cient integer sorting, partial permutation routing and h-relations algorithms. The consequences of these results on the PRAM simulation complexity are also investigated.
Introduction
In large-scale general purpose parallel machines based on connection networks, e cient communication capabilities are essential in order to solve most of the problems of interest in a timely manner. Interprocessor communication networks are often the main bottlenecks in parallel machines. One important limitation of these networks concerns the exclusive access to the bus resources, which limits throughput to a function of the end-to-end propagation time. Optical communications have emerged as a solution to this problem. Unlike electronic buses on which signal propagation is bidirectional, optical channels are inherently unidirectional and have predictable delay per unit length. As it is shown in 11, 21] , this allows a pipeline of signals to be created by the synchronized directional coupling of each signal at speci c locations along the channel. Using this kind of spatial parallelism, the end-to-end propagation latency can be amortized over the number of parallel messages active at the same time on the bus. A number of arrays with optical pipelined communications are proposed in 11, 13, 21, 24, 25, 29, 31] . 1 We introduce in 24, 25] a model which incorporates some of the advantages and characteristics of two existing models, namely the classical recon gurable networks 20] and the arrays with optical pipelined buses. The new Array with Recon gurable Optical Buses, AROB, is used in this paper as the model for which we study the possibility of implementing integer sorting and routing algorithms. When using optics, techniques which are unique and/or suitable to optics must be developed. Some of these techniques are either revised or are introduced in this paper. The aim is to identify those techniques which allow for simple and e cient algorithms.
Integer sorting is the rst problem studied. The performances of the integer sorting algorithms given in this paper are immediate consequences of the particular communication and computation mechanisms used. Under certain conditions (to be speci ed later), sorting n integers takes O(1) steps on a linear AROB (LAROB) with n processors. This is to be compared with the complexity of the algorithm given in 32] which takes O(k) steps when n k-bit integers are sorted on a LAROB with the same number of processors. Note that sorting n bits on the CRCW PRAM needs (log n= log log n) time, for any polynomial number of processors 6] . A PRAM consists of n processors and m memory locations, where each processor is a random-access machine, see for example 1]. All processors communicate via the shared memory. There are two cases for the write operation: exclusive write EW, when only one processor is allowed to write into a given location and concurrent write (CW) where two or more processor can write into a given location. Similarly for the read operation we distinguish between concurrent read (CR) and exclusive read (ER). When several processors attempt to write into the same arbitrary memory location of a CW variant, the model must specify what value ends up stored in that memory location. Examples of these rules, given in the decreasing order of their power, are: Priority, Common and Combining CW. Obviously, the concurrent (CW,CR) models are more powerful than the exclusive ones, (EW,ER).]
It is also shown in this paper that the RotateSort algorithm for n 2 values ai;j, 0 ai;j n ? 1 takes O(1) steps on the n n AROB. The more general case of sorting n general values has been considered in 2, 12, 17, 24] . Of particular interest for us is the possibility of using integer sorting in order to implement e cient algorithms for di erent packet routing problems.
The routing problem requires each processor in a network, which initially stores one packet, to send it to its destination processor without interfering with other packets. The task is to send all packets as quickly as possible with the smallest possible queue size. The queue size is the maximum number of packets that have to be stored simultaneously in a single processor during any step of the routing algorithm. Two types of routings are studied in this paper, namely the partial permutation routing and the h-relations.
Consider that each processor of a network stores at most one data packet which must be routed to an arbitrary destination processor. The partial permutation routing (or 1-relation) problem requires to move the packets to their destinations, each processor being the destination of at most one packet 36]. The partial permutation routing of n packets on an n-processor LAROB can be done on-line in one step 11, 24] . It is easy to prove that the same permutation routing algorithm can also be used when each processor has O(1) packets 24]. If o -line preprocessing is allowed, it is shown in 11, 24] that an arbitrary partial permutation routing on a two-dimensional array with optical pipelined buses can also be implemented in O(1) steps. In 26] it is shown that the barrel shift permutation routing can be implemented on-line in O(1) steps on the 2D AROB. The subset of permutations called bit permute and complement can also be implemented in O(1) steps on the 2D AROB 33].
The basic computation and communication mechanisms, particular to this type of optical networks, allow us to give in this paper O(1)-step deterministic algorithms for on-line arbitrary partial permutation routing on the 2D and 3D AROBs.
The h-relation is the communication problem which requires that the (up to) h packets stored by each processor be moved to their arbitrary destinations, with the condition that each processor is the destination of at most h packets 36] . It has been shown in 4] that a special case of (log n)-relations can be routed inÕ(log n) steps (i.e. in O(log n), with high probability) on an n-node Optical Communication Parallel Computer (OCPC).
In an OCPC any processor can communicate with any other processor in one unit of time, provided there are no con icts. In case of con ict, no message reaches the intended destination 4, 14, 36] . The best time ofÕ(h + log log n) has been obtained in 14] for the general OCPC h-relation problem. It is shown in 32] that: 1) the h-relations on an n-node LAROB takeÕ(h) steps; 2) arbitrary h-relations can be routed inÕ(h + log log n) steps on an p n p n AROB; and 3) there is an O(h log n)-step deterministic algorithm for any h-relation on a LAROB as well as a 2D AROB. We show that, in general, for any h-relation simulated on either linear or 2D AROBs, there are on-line deterministic algorithms which take O(h) steps. It takes (n) steps for a processor to read/write h data and therefore the h-relation algorithms are optimal.
It has been shown in 24] that an n-processor Priority-CRCW PRAM can be simulated in O(1) steps by a linear array processor with pipelined buses. Also, one step of a CRCW PRAM can be simulated on n 1=2 1=2 array processors with pipelined buses in O(log n) steps with high probability 24]. The on-line partial permutation routing and h-relations algorithms presented in this paper allow us to extend the study of PRAM simulation on the AROB. It is shown that a p-processor linear or two-dimensional AROB, where each processor has O(n=p) memory, can simulate in O(n=p) steps any operation performed on an n-processor EREW PRAM, with p n. When p = n the EREW PRAM can be simulated in O(1) steps on the 3D AROB. As imediate consequence, one step of the nprocessor Priority-CRCW PRAM can be simulated deterministically in O(n=p log n) steps on a p-processor 2D AROB, p n, and in O(log n) steps on a 3D AROB with n processors.
The Model

Linear arrays with optical pipelined buses
Consider a linear array of n processors connected to an optical bus as in Fig. 1 . Each processor is connected to the bus through two directional couplers. One is used to write data on the bus and the other to read the data from the bus. Each processor sends data on the upper (transmitting) segment of the bus and reads messages from the lower (receiving) segment of the bus. The optical bus is constructed from three identical waveguides. One waveguide is used for message transmission, the message waveguide, and two are used for carrying address related information, the reference (Ref) and select (Sel) waveguides. During a write cycle, the data written by a processor into the bus propagate as indicated with arrows in Fig. 1 , and may be read by any subsequent node on the bus. Due to the directionality of the signal propagation and the predictable delay of the signal, the same bus may be used to transmit messages between other nodes in the same time. Let us consider that each message is b bits long. Each bit is represented by an optical signal of width w seconds for a binary value of 1. The absence of such a signal represents a 0. We call the position occupied by a message a time slot.
Two conditions are essential for this kind of transmission: (1) all transmissions are synchronized and (2) d > bwcg; (1) where cg is the velocity of light in the waveguide. These two conditions ensure that the signals corresponding to two di erent messages, which travel in a waveguide in the same direction at the same time, do not physically overlap at any point on the waveguide, i.e., are space multiplexed. The end-to-end propagation time is b = 2n seconds, where n is the number of processors in the linear array and = d=cg, is the time taken for a pulse to traverse the optical distance d. A -time period is called a petit cycle. The principles of this linear Array Processor with Pipelined Buses (APPB) have been introduced in 11, 21] .
Several approaches can be used to route messages in a linear APPB structure from one processor to another. In 11, 21] a time waiting function is introduced. In 29] a timedivision multiplexing scheme is given for communications in arrays with optical pipelined buses. This is used in 24, 25] to implement a part of the synchronous communications needed by the AROB. The address information can also be encoded using the coincident pulse technique and unit time delays as shown in 10, 19, 29, 30] . This technique can be used when the source processor knows the address (index) of the destination processor. A unit time delay is represented by the duration of a pulse w and can be obtained using a waveguide (or ber) segment of length equal to the spatial length of the optical pulse, that is w cg 29] . For example, a processor can use the coincident pulse addressing mechanism in order to broadcast a value to all other processors 23]. Many other communication patterns are described in 10, 19, 30] using the same coincident pulse technique. The coincident pulse mechanism seems to avoid the traditional bottleneck which characterize most of the on-line address decoding techniques 29] .
It is easy to verify that these communication mechanisms allow for optimal simulation of the OCPC on arrays with optical pipelined buses.
A bus cycle length is given by the time taken by the end-to-end propagation of the messages together with the time required to process a message at the source and destination processors. The latter operations include message generation and detection, synchronization, bu ering, etc. 11]. The message processing is done in parallel by all source processors, at the beginning of the bus cycle, and/or by all destination processors after the message propagation ends. The ratio between the message processing time and the time required for communication over a distance d, is estimated to be on the order of 10 to 1000 11]. In order for the message processing time to be of the same order of magnitude as the end-toend propagation time b, the number of processors should satisfy n . In this case the bus cycle length can be considered to be O( b). Let a step be either an internal operation or a communication step. It has been argued that counting the number of steps of an algorithm simulated on an array with optical pipelined buses allows for a better comparison between algorithms, 12]. This method has the advantage that it abstracts from the details introduced by the technology dependent parameters, as b for example. Another argument is given by the fact that although the signal propagation time is proportional to the length of the connection, for su cient long wires, this is the best achievable performance 8]. This is a natural physical limitation of which we must be aware but which does not need to be included in the time complexity expression given for the AROB algorithms. We adopt in this paper the method of specifying the time complexity as a number of steps. For more details on the time complexity issue see 11, 13, 17, 21, 23, 24].
The AROB Model
Two-dimensional arrays with optical pipelined buses have been proposed in order to reduce the number of processors connected to a linear optical pipelined bus 11, 13, 31, 29] . The Array with Recon gurable Optical Buses (AROB) we propose in 24, 25] uses the basic architectural and functional structure of a classical recon gurable network, RN, 5]. Consider n 2 processors connected in a two-dimensional recon gurable structure as depicted in Fig. 2 .a. The switching system of a processor is able to connect the four ports (N;S; E; W) in one of the con gurations shown in Fig. 2 .b. The interconnection network is assumed to use optical waveguides and optical switches for communication and recon guration, respectively. The communication system of this network is modi ed in order to allow the implementation of the pipelined optical communication mechanisms as they are introduced in 11, 21] and also presented in this section. It is shown in 24] that the local switch setting provides a mechanism for building arbitrary APPB-like linear structures within the AROB. A processor can be connected to up to four such buses at a time. The functional and architectural characteristics of the AROB are described in more The description of 2D AROBs can be extended to higher dimensions k > 2. For example a 3D AROB processor has two extra ports for the communications with the neighbor processors in the third dimension. The switch con gurations allow any two pairs of these 6 ports to be connected internally, with any port connected to only one of the other ports in the same time.
3. Basic Algorithms
Integer pre x sums
In this section we consider the special case of computing the integer pre x sums of n numbers vi with 0 vi n, 0 i n ?1, and P n?1 i=0 vi n. Assume that an n-processor LAROB is available and that each processor pi stores one value vi. The pre x sums are given by the sequence (v0; v0 + v1; v0 + v1 + v2; : : : ; v0 + v1 + : : : + vn?1). We identify two possible techniques which can be used to nd the integer pre x sums.
Technique (i): The rst technique is similar to the one used in 24] to compute the binary pre x sums (BPS) of n one-bit values. In the rst step each pi introduces a delay of vi units on the receiving segment of the selection waveguide. In the next step the leader p0 sends synchronously a pair of light pulses, one on the reference waveguide and one on the selection waveguide. At pi the selection pulse is delayed by vi pulse positions relative to the reference pulse. The relative delay is decoded o -line by each pi and represents the pre x sum v0 + v1 + : : : + vi. Note that as n and the condition imposed by equation (1) is satis ed. The programmable delays could be implemented using unit-delay ( ) ber segments and switches as in Fig. 4 .a. When a processor has to introduce a delay of L units, a number of L delay stages are coupled as in Fig. 4 .b. Another possibility is to use (log n) stages with the delay at stage i being 2 i . This reduces the number of switches needed to implement the programmable delay lines. Although very simple, this technique has the main drawback that each processor must have n unit delays and more importantly, a considerable number of switches are required.
Alternative solutions can be found in 7] and 28], for example.
Technique (ii): For the second variant we note that at most n relative delays are required in order to nd the pre x sums. This allows the unit delay to increase from w to , while keeping the bus cycle length in O( b ). The timing mechanisms of each processor can be used for the required operations, rather than using ber and switches to construct the delays. Consider a bus cycle which starts with p0 sending a pulse on the transmitting segment of the bus. After one petit cycle, the pulse reaches p0 on the receiving segment. Let us denote this moment as t0. The internal timing circuit is triggered by this pulse and after v0 petit cycles, p0 sends a pulse to p1. In general, if at tk a pulse is detected at pk, another pulse is sent on the bus after vk petit cycles. In order for this mechanism to function properly each processor which stores a value greater than zero must interrupt the receiving segment of the bus by setting (o -line) the switch in the cross state. For the zero value no bus segmentation is performed as no delay is required. The value ((tk ? t0)= ? (k ? 1)) = v0 + : : : + vk?1, computed by pk, represents the delay, in number of petit cycles, introduced by pi, for all i, 0 i k ? 1. The total delay is no more than n and the bus cycle takes O( b ). That is, the bus cycle is within the limits imposed by the model. Equation (1) is also satis ed. Therefore, Claim 1 The pre x sums of n integers vi with 0 vi n, 0 i n?1, and P n?1 i=0 vi n, can be computed in O(1) steps on an n-processor LAROB.
Observation. The second technique (ii) can be interpreted as a modi ed version of the skewed-time clock distribution model presented in 29].
Bit polling operation
The bit polling operation calls for reading (detecting) the values of the k-th bits (pulses) of all (or a subset of) the n messages which propagate on the bus during a bus cycle, with 0 k b. The objective is to perform this operation during one bus cycle, i.e., with the same speed as the communication. It is assumed that each processor has an optical rotate-shift register RSR integrated in the communication subsystem, Fig. 5 . This can easily be constructed from a waveguide ( ber) loop of length equal to the optical distance between two adjacent processors, d. Therefore, a light pulse which propagates inside the RSR loop has a period equal to a petit cycle, . The output of the RSR is connected to a coincident-pulse detection circuit CDC, which can be an adapted variant of the one described in 19], for example. The second input of the CDC is connected to the bus, Fig.  5 . As it is implied by its name, the CDC generates a 1 signal at the output if and only if two light pulses reach its inputs at the same time.
For the k-th bit polling operation, the RSR is loaded with 2 k at the beginning of a bus cycle. When one b-bit message arrives at pi a signal is generated at the output of the CDC if and only if the k-th bit of the message on the bus is 1. This information can be further processed in di erent ways. Due to the applications to follow, we are interested in determining the number of k-th pulses which are 1. Therefore, either a counter or a shift register can be used to accumulate this information. The choice depends on the actual technological constraints which require that the update period tu of this device be tu .
We assume without loss of generality that the latter relation is satis ed. Two consecutive pulses with index k arrive time apart and their number is not larger than n. Thus, Claim 2 With the resources speci ed above, the bit polling operation can be performed in O(1) steps on an n-processor LAROB.
Observation. In case a shift register is used, it is assumed that the o -line decoding of the unary value stored by the shift register takes O(1) time. This assumption is similar to that allowing a memory access to be counted as a unit time operation, even though the circuit depth required to decode an address is proportional to the logarithm of the memory size. However Variation of two di erent sorting techniques are presented in this paper. The rst variant is a stable sorting algorithm which is based on computing for each input key the rank corresponding to its sorted order, followed by a partial permutation routing. The second technique is based on the radix-sort principle, i.e. if n numbers in range 0::I] can be stably sorted using p processors in time t, then n numbers in 0::I O (1) ] can also be stably sorted in O(t) with the same number of processors. The motivation for this approach is that the rank-based stable sorting seems to use e ciently the mechanisms speci c to the LAROB model when the integer keys to be sorted are restricted to values at most equal to n. The rank-based stable sorting is further used as a building block for the radix-sort algorithm which allows the range of the input keys to be increased.
Integer sorting on the (basic) LAROB
The rst sorting algorithm presented is for n values vi, with vi represented with (log log n) bits, i.e. 0 vi log n for 0 i n ? 1 Proof: The value B(k) obtained by pk at the end of the rst step is the number of processors pi, with i n ? k and vi = vk. In step 2 the processor pi 2 Pa which has computed jSaj is marked as selected, 0 a log n. The jSaj value is stored in A(a) during the third step, 0 a log n. In the fourth step the entries of the vector B are moved to their correct position in the LAROB. The rank of each element is computed in step 5 and the permutation routing in step 6 moves the elements to their sorted order. The relative positions of equal keys are given in vector B. The same relative position is also maintained in the nal output sorted order. Therefore the sorting operation is stable. Using the radix sort, Corollary 1 An n-processor LAROB can sort in a constant number of steps n integers in range 0::(log n) O(1) ]. The result in Claim 3 can be used to give an algorithm for sorting integers in range 0::n O (1) ] which are represented with k = c log n bits, for some constant c 1. Without loss of generality assume that k is multiple of (log log n). We divide the k bits in (k= log log n) sub-groups each of (log log n) bits. The sorting algorithm proceeds in (k= log log n) stages. During stage i the values are stably sorted with respect to the i-th least signi cant group of bits. In view of Claim 3, each of the (k= log log n) stages takes O(1) steps. Therefore, Claim 4 A LAROB with n processors can sort n k-bit values in O(k= log log n) steps.
The result in Claim 4 compares favorably with the sorting algorithm given in 32] which takes O(k) steps in the same conditions.
Integer sorting on the (extended) LAROB
One objective of this paper is to identify low level communication and computational mechanisms speci c to this type of architecture and to use them into di erent, more general applications. Low level refers to the mechanisms which allow very simple logic and/or arithmetic operations to be performed on-line. Two examples of low level operations are the bit polling and binary pre x sums. The assumption used so far was that the system is designed to support the simultaneous transmission of up to n messages with b bits each.
Another restriction was that no on-line switch setting is performed. We show next that, by relaxing either of these conditions, sorting n integers can be performed in O(1) steps on the (extended) LAROB.
Multiple Binary Pre x Sums Algorithm
Assume that we have an n-processor LAROB storing a set S of n one-bit values vi, with sj, which is read by pk during this bus cycle, is the actual binary pre x sum of the elements in Sj with indices smaller than k. The de nition of the (L)AROB does not allow a processor to read a value from the bus, to use it in an internal operation and then to write it back on the bus, all in time. However, the updating operation needed to be performed by pk is very simple and requires only that the value stored by sj be incremented if vk = 1. If vk = 0 no operation is performed. We suggest two possible techniques speci c to this type of interconnection network which can be used to implement the updating operation.
Technique (i): Consider the framework of the coincident pulse technique 10, 19, 24, 29, 30] . The address messages on the reference and selection waveguides have n bits.
Additionally, each processor can introduce a unitary delay on the selection waveguide at a well speci ed moment of time. The unit delay can be connected to the bus through a switch as in Fig. 6 .a. At the beginning of a bus cycle each processor writes a pair of pulses on the transmitting segments of the reference and the selection waveguides, respectively. Processor pk 2 Pj performs the update operation by introducing a unit delay after wait1(k; j) = k + j + 1 petit cycles from the beginning of the bus cycle if vk = 1. If vk = 0, no delay is introduced, see Fig. 6 .a. The update operation performed by pk (if any) should a ect only the value carried by sj and thus, the switch state is reversed after another petit cycle, i.e. at wait2(k; j) = wait1(k; j) + 1. Practical optical switches have a nite switching time ts > 0 which currently is on the order of hundreds of picoseconds. To ensure that the switching process does not a ect the messages on the bus, the e ect of the switching time must be taken into consideration. A switch must change its state before and after a message reaches that processor and the general non-overlapping condition in equation (1) becomes bw + ts < , see Fig. 6 .b. There is no need to assume an extra ts time gap after each message because this actually coincides with the one from the beginning of the next message. If vk = 1, the e ect of the update operation performed when sj reaches pk is to increment by one unit the selection pulse delay relative to the reference pulse. The relative position of the two pulses in the address frame read by pk during sj, encodes the sum of the values from the same subset Sj which are stored by processors with indices smaller than k. The disadvantage of this technique is that it requires to accommodate up to n messages on the bus, each of which has maxfb; ng pulse slots. For big values of n this might imply long and therefore ine cient buses.
Technique (ii): The second technique presented here uses only a single n-pulse message. The n pulses are initially positioned time apart of each other with a value to be speci ed later, see Fig. 7 .a. For simplicity and without loss of generality assume that is a multiple of w. Any processor can introduce a unit time delay on the selection waveguide. At the beginning of the bus cycle the switch is in the straight state and no delay is introduced.
Assume that pk stores vk = 1. During the bus cycle the message propagates on the selection waveguide from p0 to pn?1. When the j-th pulse of the message is detected at Besides the switching time ts, should also include the time required by the mechanism used, by each processor, in order to detect the presence of the j-th pulse. Similar to the case of bit polling operation, this mechanism can be implemented using either a (log n)-bit counter or an n-bit shift register. The maximum time length of the message on the bus is in this case (n + 1)( + 2w). When + 2w bwc, for any constant c, the overall updating operation takes O(1) steps. For example, typical values for the switching time and the pulse width are ts = 100 picoseconds 34] and w = 50 picoseconds 29], respectively.
LAROB stable sort algorithm
Consider n values vi, with 0 vi n ? 1, 0 i n ? 1 . In what follows we present an algorithm to sort these values in O(1) steps on the LAROB which allows on-line switch setting. Let Sa = fvi j vi = a; 0 i n ? 1g and Pa = fpi j vi 2 Sa; 0 i n ? 1g for any a, 0 a n ? 1. LAROB integer stable-sort Algorithm 2
1. Apply the MBPS algorithm and nd for each vi its relative position in Sa. The one-bit value used by each processor in the MBPS steps is 1 and any subset has no more than n elements. In order to implement the MBPS algorithm, each subset Sa has associated the work slot sa, 0 a n ? 1. Let ri be the rank computed by pi. 2. During the MBPS step the last element in Sa, i.e. pi with i = maxkfk j pk 2 Sa; 0 k n ? 1g, has computed ma = jSaj, with 0 ma n and P n?1 a=0 ma n. The multiple binary pre x OR algorithm 24] can be used to determine in parallel the pi with i = maxkfk j pk 2 Sa; 0 k n ? 1g. Let us call these processors active.
3. Apply a permutation routing step during which each active processor in Sa sends ma to pa, 0 a n ? 1. 4 . Apply the integer pre x sums algorithm and determine the pre x sums psa of the ma values. In another bus cycle all pi 2 Sa read the pre x sum psa computed at pa 0 a n ? 1. ] can be performed in O(1) steps using an n-processor LAROB.
Partial Permutation Routing on a 2D AROB
We show in this section that the two-dimensional partial permutation routing can be performed in O(1) steps on a 2D AROB. We give next a short description of a 2D integer sorting algorithm which is also used in the partial permutation routing algorithm.
The LAROB sorting algorithm can be used to construct the 2D AROB sorting al- 6. h-relations algorithms on the (L)AROB 6.1. h-relations on the LAROB Consider a LAROB with n processors each storing up to h packets. The h-relation is the communication problem which requires to move the up to h packets in each processor to their destinations. The destinations can be arbitrary except that each processor is the destination of at most h packets 36] . There are at most nh packets to be moved. Imagine the packets as being stored in an h n array. Each column of this array represents a processor and each row represents a level (or memory location) on which each processor stores one packet. Therefore, the h-relation problem can be seen as a modi ed partial permutation routing in the h n array using only an n-processor LAROB. The di erence is that the actual level on the destination processor is no longer signi cant, as we want to move the packets to their destination processors regardless the memory location it occupies.
The algorithm for the h-relation on the LAROB is a simple adaptation of the 2D partial permutation algorithm. Of interest for practical parallel processing is the possibility to simulate e ciently, i.e.
with minimum slowdown, an n-processor machine M on a p-processor machine M 0 , with p n. Note that the trivial lower bound for this simulation, (n=p), is given by the number of steps required by a LAROB processor in order to simulate the internal operation of the n=p associated PRAM processors. Thus the result in Corollary 4 is optimal.
h-relations on the AROB
Let us consider an n n AROB with each processor storing at most h packets. The two-dimensional h-relation problem calls for moving these packets to their arbitrary destinations, where each processor can receive at most h packets. There are at most n 2 h packets which can be seen as arranged in an n n h array, with the base array (i; j; 0), (nk(c) ? nk;i(c)) h: (4) That is, we have an upper bound on the di erence between the number of packets when each level k of (i; ; ) is assumed to have the maximum nk(c) packets (that is, in the worst case scenario) and the real number of packets. For all matrices (i; ; ), for a xed c and taking into consideration equations (3) and (4) Equivalently, if we assume for some i that after phase (i) there are nk(c) packets with the same destination column c stored at each level k of array (i; ; ), then there are at most 2h such packets in the entire array (i; ; ). This is an upper bound on the number of packets with the destination-column c which can be found after phase (i) in an array (i; ; ), for any i and c.
Phases (ii) and (iii) are actually (2h)-relation algorithms which take O(h) steps and are implemented on the associated LAROBs. Note that in order to be able to perform these two phases the processors should have queues of length 2h. Furthermore, Corollary 5 A p-processor 2D AROB, where each processor has O(n=p) memory, can simulate in O(n=p) steps any operation performed by an n-processor EREW PRAM, with p n.
It is trivial to give a partial permutation routing algorithm for a 3D AROB using the result in Claim 9. After applying the three phases of the h-relations AROB algorithm with h = n, an extra permutation routing is needed in each column (i; j; ) and all the packets are moved to their destination processors. Thus, Claim 10 The partial permutation routing can be performed in O(1) steps on the threedimensional AROB. Also, any operation performed by an n-processor EREW PRAM can be simulated optimally, in O(1) steps, on the n-processor 3D AROB.
It is shown in 18] that one step of the Priority-CRCW PRAM with n processors can be simulated by an n-processor EREW PRAM in O(log n) steps. Therefore, one step of the n-processor Priority-CRCW PRAM can be simulated deterministically in O(n=p log n) steps on a p-processor 2D AROB, p n, and in O(log n) steps on a 3D AROB with n processors.
Conclusions
We have identi ed in this paper a number of communication and computation mechanisms which allow us to build e cient integer sorting, partial permutation routing and h-relations algorithms on the AROB. It is shown that if on-line switch setting is allowed, sorting n integers in range 0::n O (1) ] takes O(1) steps on a LAROB with n processors. An O(1)-step deterministic algorithm for the partial permutation routing is also presented. In general, the h-relations algorithms designed for the (L)AROB take O(h) steps.
The implications on the complexity of PRAM simulation, of the results obtained for the communication primitives, are also determined. This work was motivated by the need to relate the architectural models which use optical pipelined interconnections to computational models which have support features for high-level, general-purpose programming languages. The PRAM and the Bulk-Synchronous Parallel (BSP) computer 36, 37] are parallel computation models which allow concurrent memory access, virtual processors, barrier synchronization and automatic and explicit memory allocation, features which are essential in order to support high-level programming languages. It is well known that given enough slackness the OCPC can e ciently simulate the BSP 36, 37] . It is trivial to verify that the OCPC can be simulated optimally on an array with optical pipelined buses. Thus, the BSP can also be simulated on an array with optical pipelined buses. (We conjecture here that a more e cient and direct simulation of the BSP by the (L)AROB is possible.)
The model of computation used in this paper has electronic digital processors, ber interconnections and opto-electronic couplers. Some of the issues not analyzed in this paper are the synchronization of processors, clock distribution, pulse positioning, optical fanout, delay lines, o ine setup, etc. These problems have been investigated in 9, 27, 29]. The current opto-electronic technology is still immature and many problems remain to be solved. One example is given by the increasing optical power consumption along an optical bus, making it necessary to use optical ampli ers on buses connecting more than a few hundreds processors. The use of on-line switching is of particular importance for the AROB because signal amplitude restoration and re-synchronization can be accomplished by the technique of switching in a fresh copy of the system clock, see 16]. However, although expensive, all the opto-electronic devices assumed by our model exist. Present commercial devices limit the operating speed to a few hundreds of MHz. Demonstrations of optical transmissions at 10-250 Gb/s have already been reported 28, 35] . It is expected that future devices could increase considerably this speed 16].
