We show that the diameter of an N 2 processor OTIS- 
Introduction
Marsden et al. [6] , Hendrick et al. [2] , and Zane et al. [9] have proposed large scale parallel computer architectures in which the processors are divided into groups. Each group of processors is realized using one or more high density chips or modules with electronic interprocessor connections. Interconnections between processors in different groups are realized via free space optics. These optoelectronic architectures use electronic interconnects over short distances ( i.e., for the local intra group interconnects ) and free space optics for the longer inter group interconnects. This is so as it is known [l] [3] that free space optical connects provide power, speed, and crosstalk advantages over electronic interconnects when the connect distance is more than a few millimeters. Further, when optical interconnects are used, the per group I/O bandwidth increases in proportion to the group's layout area while when electronic interconnects are used, this bandwidth grows in proportion to the layout perimeter.
Krishnamoorthy et al. [4] have shown that the bandwidth and power consumption are minimized when the number of processors in a group equal the number of groups. That is when an N 2 processor system is partitioned into N groups each containing N processors. As a result, we limit our "This work was supported, in part, by the Army Research Office under grant DAA H04-95-1-0111. study of optoelectronic architectures to those partitioned in this manner. The specific optoelectronic architecture proposed by Marsden et al. [6] employs the optical transpose interconnection system ( OTIS ). In this, processor i of group j is connected to processor j of group i via an optical link. Figure 1 shows a 16 processor OTIS architecture. The processor indices are represented as pairs of the form (G, P ) where G is the group index and P the processor index ( within the group ).
Some of the suggested topologies for the intra group electronic connections are mesh, hypercube, and mesh of trees [6] . These, respectively, result in the OTIS-Mesh, OTIS-Hypercube, and OTIS-MT optoelectronic architectures. Figure 2 shows a 16 processor OTIS-Mesh computer.
The processors of groups 0 and 2 are labeled using two dimensional local mesh coordinates while the processors in groups 1 and 3 are labeled in row major fashion.
Zane et al. [9] have shown that the OTIS-Mesh can simulate each move of an f l x f l x fi x f l four dimensional mesh by using either an electronic move local to a group or one local electronic move and two inter group moves using the OTIS ( or optical ) interconnection. We shall refer to the latter as OTIS moves.
In this paper, we study the OTIS-Mesh architecture and obtain properties and basic permutation routing algorithms. These algorithms can in turn be used.in the development of efficient application codes. We begin, in Section 2, by deriving the diameter ( i.e., maximum distance between any two processors ) of the OTIS-Mesh. In Section 3, we consider the problem of embedding a two dimensional N x N mesh into an N2 processor OTIS-Mesh. Algorithms to perform popular data rearrangements such as transpose, reversal, and shuffle are developed in Section 4. An algorithm for general BPC ( bit permute complement ) permutations is presented in Section 5.
2.OTIS-Mesh Diameter
Let ( G I , P I ) and (G2, P2) be two OTIS-Mesh processors. The shortest path between these two processors is of one of the forms: So, we may assume that the path is of the above form with exactly two OTIS moves.
The path involves an odd number of OTIS moves. In this case, it must involve exactly one OTIS move ( as otherwise it may be compressed into a shorter path with just one OTIS move as in (b) ) and may be assumed to be of the form (G1, P l ) -% (G1, G2) 3 (G2,Gi) 4 (G2,P2).
Embedding Of An N x N Mesh
Zane et al. [9] have shown that a f i x f i x f i x fi mesh may be embedded into an N2 processor OTIS-Mesh so that each mesh move may be performed by either one electronic OTIS-Mesh move or one electronic and two optical OTIS-Mesh moves. The embedding is rather straightforward with processor (i, j , k, 1) of the mesh being identified with processor ( G , P ) , For example, the move (i, j + 1, k, 1) may be done by the
The above efficient embedding of a 4D mesh implies that 4D mesh algorithms can be run on the OTIS-Mesh with a constant factor ( at most 3 ) slowdown [9]. Unfortunately, the body of known 4D mesh algorithms is very small compared to that of 2D mesh algorithms. So, it is desirable to consider a 2D mesh embedding. Such an embedding will enable one to run 2D mesh algorithms on the OTIS-Mesh. Naturally, one would do this only for problems for which no 4D algorithm is known or for which the known 4D mesh algorithms are not faster than the 2D algorithms.
There are at least two intuitively appealing ways to embed an N x N mesh into the OTIS-Mesh. One is the group row mapping ( GRM ) in which each group of the OTISMesh represents a row of the 2D mesh. The mapping of the mesh row onto a group of OTIS processors is done in a snake-like fashion as in Figure 3 can be moved to processor (j,z) with one OTIS and zero electronic moves.
The second way to embed an N x N mesh is to use the group submesh mapping ( GSM ). In this, the N x N mesh is partitioned into N a x a submeshes. Each of these is mapped in the natural way onto a group of OTIS-Mesh processors. Figure 3(b) shows GSM of a 4 x 4 mesh. Moving all elements of row or column i over by one is now considerably more expensive. For example, a row shift by +1 would be accomplished by the following data movements ( a boundary processor is one on the right boundary of a group ):
Step 1: Shift data in non-boundary processors right by one using an electronic move.
Step2: Perform an OTIS move on boundary processor data. So, data from (G, P ) moves to (PI G).
Step 3: Shift the data moved in Step 2 right by one using an electronic move. Now, data from (G, P ) is in (PI G + 1).
Step 4: Perform an OTIS move on this data. Now data originally in (G, P ) is in (G + 1, P ) .
Step 5: Shift this data left by a-1 using a-1 electronic moves. Now, the boundary data originally in ( G , P ) is in the processor to its right but in the next group.
The above five step process takes fi electronic and two OTIS moves.
GSM is also inferior on the transpose operation which now requires 8 ( f i -1) electronic and 2 OTIS moves.
Theorem3
The transpose operation of an N x N mesh requires 8 ( a -1) electronic and 2 OTIS moves when the GSM is used.
Proof See [8] . 13 Based on the relative performance of GRM and GSM with respect to the common mesh operations of shift by row or column and transpose, we recommend the use of GRM.
Common Data Rearrangements
Since an N 2 input three stage OTIS multistage interconnection network ( MIN ) composed of N N-input-Noutput switches in each stage is rearrangeable [6] and since the N processor mesh in each group of an N 2 processor OTIS-Mesh is able to realize every permutation of N elements, an N2 processor OTIS-Mesh can realize any permutation of N2 data ( one to each processor ) using at most two OTIS moves. However, additional OTIS moves are needed to determine the local group data rearrangements that must be made. Another way to accomplish arbitrary data permutations is to use sorting. The optimal 4D mesh sorting algorithm of [5] may be run on an OTIS-Mesh using the simulation described in [9]. This yields an OTIS-Mesh sorting algorithm that takes 1 2 f i OTIS and 1 4 f i electronic moves. Nassimi and Sahni [7] have developed optimal 4D mesh algorithms for several frequently arising permutations. These may be simulated using the method of [9] to obtain algorithms for the OTIS-Mesh. Table 1 gives the number of 4D mesh moves used by the optimal 4D mesh algorithms, a break down of the number of moves in the first two and last two dimensions, and the number of electronic and OTIS moves required by the simulation.
Assume that the N2 OTIS-Mesh processors are numberedindexed 0 through N2 -1 such that in the binary representation of a processor index the left half bits give the group number and the right half give the processor number local to a group. So, a processor index I is of the form I = GP where I , G and P are represented in binary and G and P have the same number of bits. G and P may be decomposed into halves to get G = G,Gy and P = P, Py such that G, and G, give the group location by row and column in an array layout of groups ( as in Figure 1 ) and P, and P, locate processor P of a group by its row and column coordinates. The permutations of Table 1 are members of the BPC ( bit permute complement ) class of permutations defined in [7] . In a BPC permutation, the destination processor of each data is given by a rearrangement of the bits in the source processor index. For the case of our N 2 processor 4D mesh The destination for the data in any processor may be computed in the following manner. Let mP--lmp-2 . . . mo be the binary representation of the processor's index. Let dp-ldP-, . . . do be that of the destination processor's index.
Then.
i f A; 2 0, 1 -m ; i f A;<O.
dlA;I =
In this definition, -0 is to be regarded as < 0, while +O is
The permutation vector A for each of the permutations 2 0.
of Table 1 is provided after each subsection title.
Tkanspose
The transpose operation may be accomplished via a single OTIS move and zero electronic moves. [O,p - 
Perfect Shuffle
Let b~ ( i ) and bp(i), respectively, denote the bits in po-
sition G(i) and P ( i ) of G and P. So b~(~p -~)
and b~ (~/ 2 -1 ) are the most significant bits of G and P while b~p )
and b q o ) are the least. Let G = b~(~p -1 ) G '
and P = b q p / 2 -1 ) P'. A perfect shuffle may be performed as below:
Step 2: This step involves processors in groups G such that b~(~/~-~) = 0 only. In these groups, odd processors exchange data with corresponding even processors ( note that the processors exchanging data differ only in bit zero ).
Step 3: Perform an OTIS move on all proccssors.
Step 4: Perform a local shuffle in each group.
Step 5: This step involves only processors in even groups. In these groups, odd processors exchange their data with the corresponding even processors.
Step 6: Perform an OTIS move on all processors.
Step 7: Same as Step S.
Steps 1 and 4 perform perfect shuffles in fi x fi meshes. Each of these can be done optimally in 2 f i electronic moves using the algorithm of [7] . Steps 2, 5, and 7 requires exchanging data between mesh neighbors. Each exchange moves data in opposite directions on the same link and takes two electronic moves. Steps 3 and 6 take one OTIS move each. So, the total number of moves is 4 m + 6 electronic and 2 OTIS only.
Unshuffle Ep -
This is the inverse of a perfect shuffle and may be done by running the seven step shuffle algorithm backwards ( i.e., beginning with Step 7 ) and replacing the local shuffles of Steps 1 and 4 by local unshuffles. The number of data moves is the same as for a perfect shuffle. [0, 1 , . . . , p -1 1 A bit reversal can be done using one OTIS and 8 ( f i -1) electronic moves as below.
Bit Reversal
Step 1: Do a local bit reversal in each group.
Step 1: Perform a local perfect shuffle in each group.
Step 2: Perform an OTIS move of all data.
Step 3: Do a local bit reversal in each group.
Steps 1 and 3 are done optimally in 4 ( n -1) electronic moves each using the optimal 2D mesh bit reversal algorithm of [7] . [ -( p -I), -( p -2) , . .
Vector Reversal

. ,o]
A vector reversal can be done using 8 ( f i -1) electronic and two OTIS moves. The steps are:
Step 1: Perform a local vector reversal in each group.
Step 2: Do an OTIS move of all data.
Step 3: Perform a local vector reversal in each group.
Step 4: Do an OTIS move of all data.
Note that
Step 1 moves data from G P to G F ( where F is the complement of P ).
Step 2 moves this data from G F to 'iisG. Next, Step 3 sends that data to PG and finally Step 4 sends it to GP completing the vector reversal. The number of data moves is easily obtained by noting that the optimal way to perform the local vector reversals takes 4 ( f i -1) electronic moves [7] .
Bit Shuffle
Our algorithm to perform this permutation employs a GYPx Swap permutation in which data from processor GxGyPxPy is routed to processor GxPxGyPy. So, let us first see how to perform this permutation. Swap [p-1,. . .,3p/4,p/2-1,. . .,p/4,3p/4-We have two algorithms for this. The first uses 2 ( f l -1) electronic and log, N OTIS moves. The second uses 6 ( f i -1) electronic and 2 OTIS moves. While the second algorithm uses a larger number of moves, it is to be preferred when the cost of an OTIS move is considerably larger than that of an electronic move. Details of the first algorithms are provided below. The second is given in [8] .
GYPx
The first algorithm performs a series of bit exchange per- 
I j otherwise
The permutaion B ( i ) may be realized as below:
Step 1 Step 2: Perform an OTIS move.
Step 3: The data moved in Step 1 is routed from their current processors to corresponding processors that differ only in bit i. This requires data moves left and right along rows of fi x fi meshes. The distance is 2i in each direction.
Step 4: Perform an OTIS move.
The total number of moves is 2i+2 electronic and two OTIS.
To perform a GYPx Swap permutation, we simply perform B ( i ) permutations for 0 < i < p/4. This takes p/2 = log, N OTIS moves and 2it2 = 4(2p/4 -1) = 4 ( f i -1) electronic moves.
Bit Shuffle
A bit shuffle may be performed following these steps:
Step 1: Perform a GYP, swap.
Step 2: Do a local bit shuffle in each group.
Step 3: Do an OTIS move.
Step 4: Do a local bit shuffle in each group.
Step 5: Do an OTIS move.
Using the 4 ( n -1) electronic and log, N OTIS move algorithm for the GYP, Swap and the optimal mesh bit shuffle algorithm of [7] , the number of moves becomes ( approximately ) fi -4 electronic and log, N + 2 OTIS.
Shuffled Row-major
This is the inverse of a bit shuffle and may be done in the same number of moves by running the bit shuffle algorithm backwards. Of course, Steps 2 and 4 are to be changed to shuffled row-major operations.
BPC Permutations
Every BPC permutation, A, may be realized by a sequence of bit exchange permutations of the form B(i, j ) = &I,.
. . , BO], p/2 5 i < p, 0 < j < p/2, and The transpose permutation may be realized by the sequence B(p/2 + j , j ) , 0 < j < p/2; bit reversal is equivalent to the sequence B(p -1 -j , j ) , 0 5 j < p/2; vector reversal can be realized by performing no bit exchanges and using
; perfect shuffle may be decomposed into B(p/2,0) and and 2 OTIS moves following a process silimar to that used for B ( i ) in Section 4.6.1.
Our algorithm for general BPC permutations is:
Step 1: Decompose the BPC permutation A into the bit exchange permutations B~( i l , j l ) , B 2 ( i 2 , j 2 ) , . . ., B k ( i k , j k ) and the BPC permutation C = I I G I I~ as above. Do this such that il > i 2 > e . . > ik, and jl > j 2 > e -. > jk.
Step 2: If k = 0, do the following:
Step 2.1: Do the BPC permutation IIp in each group
Step 2.2: Do an OTIS move.
Step 2.3: Do the BPC permutation rI& in each group
Step 2.4: Do an OTIS move.
Step 3: If k = p/2, do the following: using the optimal algorithm of [7] .
using the algorithm of [7].
Step 3.1: Do the BPC permutation II& in each group.
Step 3.2: Do an OTIS move.
Step 3.3: Do the BPC permutation IIp in each group.
Step 4: If k < p/4, do the following:
Step 4.1: Perform the bit exchange permutation B1,. . . , B k .
Step 4.2: Do Steps 2.1 through 2.4.
Step 5: If k 2 p/4, do the following:
Step 5.1: Perform a sequence ofp/2-k bit exchanges involving bits other than those in B1,. . . , B k in the same orderly fashion described in Step 1. Recompute ITG and IIp. Swap & and IIp.
Step 5.2: Do Steps 3.1 through 3.3.
The local BPC permutations determined by IIG and H p take at most 4 ( n -1) electronic moves each [7] ; the bit exchanges cumulatively take at most 4 ( n -1) electronic and log, N OTIS moves. So, the total number of moves is at most 1 2 ( a -1) electronic and log, N + 2 OTIS.
