All-to-all personalized communication is a class, of permutations in which each processor sends a unique message to every other processor. We present optimal algorithms for concurrent communication on all channels in Boolean cube networks, both for the case with a single permutation, and the case where multiple permutations shall be performed on the same local data set, but on different sets of processors. For K elements per processor our algorithms give the optimal number of elements transfer, K/2. For a succession of all-to-all personalized communications on disjoint subcubes of p dimensions each, our best algorithm yields $.+c-p element exchanges in sequence, where cr is the total number of processor dimensions in the permutation.
Introduction
We give simple, yet optimal, schedules for aJl-to-all personalized communication on Boolean cubes with concurrent communication on all channels of every processor. An example of an architecture that allows for such communication is the Connection Machine. The schedules avoid indirect addressing in the data exchanges by performing a local alignment of t8he data in each processor prior to, and after, the data interchanges between processors. In addition to optimal utilization of communication channels we are also concerned with the duration of the orbits of the elements, i.e., the time from the first motion of an element until it reaches its destination. The orbit length is important for pipelining several all-to-all personalized communications (AAPC:). With K elements per processor and AAPC in &cubes our first algorithm requires $ + p -($ mod p) element transfers in sequence for a single AAPC, except if % mod p = 0. Then, the number of element transfers is optimal, $. The orbit length for all pairs of elements is 0. One or the other element in a pair is always exchanged. Our second algorithm has the minimal number of element exchanges for any K and Ching-Tien Ho IBM Almaden Research Center 650 Harry Road San Jose, CA 95120 ho@ibm.com p, but the maximum orbit length is p + ($ mod p).
Our third algorithm illustrates how pipelining can be combined with exchange sequences starting in arbitrary dimensions to yield a n optimal number of element exchanges in sequence. But, the orbit length is 4. Our last algorithm requires $ element transfers in sequence for any p, and has a maximum orbit length
All-to-all personalized communication is a frequently used class of permutations in multi-processor systems. Examples of all-to-all personalized communication are bit-reversal, vector reversal, matrix transposition, shuffle permutations, and conversion between cyclic and consecutive mapping [5] for allocations of the arrays such that a number of storage dimensions are exchanged with the same number of processor dimensions, All-to-all personalized communication can also be combined with code conversion, such as conversion between binary code, and binary-reflected Gray code [8, 6] , which is often used for encoding of arrays An example of the bit-reversal permutation is ( a g a~a g~a~u~a~) -+ ( a~a~a 2 / a 3 a~a g ) and that of the matrix transposition is ( a g a~a~~a 2 a~a~) --+ ( a~a~a~~a g a~a g ) .
Both are examples of AAPC.
Let IS,[ = ,B 5 min(k,n). If ,B < IC then the permutation is repeated 2"p times. For instance, if there are eight bits for the local memory, b = 8, and three are part of the permutation, ,f? = 3 , then the permutation is repeated 28-3 = 3 2 times. Similarly, if there are processor dimensions not included in the permutation, then the permutation consists in a number of permutations in disjoint subcubes. Each such subcube is identified by the address bits not included in the permutation.
The relative address of a local memory address i in processor j with respect to its destination is j @ i, where @ is performed bit-wise. A homogeneous scheduling has communication schedules for each node that only depend on the relative addresses. This implies that node i sends data to node i @ j in the same dimension and step as node 0 sends data to node j.
The message path from node i to node i @ j is a translation with respect to node i (i.e., exclusive-or) of the path from node 0 to node j. We only consider homogeneous schedules. For such schedules it is sufficient to consider schedules for node 0.
Definition 2
The span for message j, s p a n ( j ) , is the number of exchanges from the step during which the message starts its path through the cube until it arrives at its destination in a single AAPC. Formally, if the message leaves the source during step tl and arrives at the destination during step t, > t l , then s p a n ( j ) = t, -tl + 1. The span of the communication for a single AAPC is the maximum span for any message in the communication, i.e., span = maxj s p a n ( j ) .
In the following, we consider AAPC of the form (jli) --+ (ilj). Based on this, algorithms for the general
Routing for the permutation ( j l i ) -+
The permutation (jli) --f (ilj) is equivalent t o the transposition of a matrix stored with one row per processor to the storage of one column per processor. The permutation can also be viewed as changing the allocation of a one-dimensional array from consecutive storage to cyclic storage [5] .
The routing algorithms we present use only direct addressing in the data exchanges between processors. All processors access the same (local) address during the same step. To accomplish this property a local alignment precedes the exchanges, and a realignment follows them. Depending on the relative times for the alignment and realignment phases and the exchanges with direct and indirect addressing, avoiding indirect addressing in the exchanges may be desirable. On the Connection Machine avoiding indirect addressing in the exchange phase results in a significant speed-up.
(W Phase 1: Alignment. Sort the local data by relative address, i.e., perform the operation
, where a is the local memory address and j is the processor address.
The interprocessor exchange phase implements
Phase 3: Realignment. The realignment to restore the local memory order is identical to phase 1.
Phase 2: Interprocessor exchange.
The alignment is performed on the storage dimensions involved in the exchange. The treatment of other storage dimensions is arbitrary. Pairing A memory location is subject to an exchange operation in processor dimensions that correspond to nonzero bits in its (relative) address. Hence, local memory location zero retains its content throughout phase 2, whereas memory location (11.. . 1) exchanges content with neighboring nodes in every dimension. Similarly, memory location (00. . . O l ) only exchanges its content with the neighboring node in the least significant processor dimension, while local memory location (11.. . l o ) exchanges content with all neighboring nodes, except the neighbor in the least significant processor dimension. does not fully use all dimensions. This property causes the algorithm to be slightly non-optimal with respect to channel utilization. However, the schedule is easily modified to an optimal schedule by having one subphase include / J + ( f mod p) pairs. Consider the case for ,B = 3. There are a total of eight memory locations, or four pairs. By scheduling the four pairs as shown in Table 1 , the optimal number of element transfers is achieved. The scheme is easily generalized to arbitrary p. We refer to this algorithm as Algorithm 2. There are many schedules tha.t yield an optimal utilization of communication channels. For instance, the schedule in Table 2 is also optimal with respect to channel utilization, Algorithm 3. But, the last P -1 pairs are in orbit for 2P-l exchanges for an AAPC on ,B bits. The schedule in [2] has a similar property.
The long orbits are a disadvantage in the case several AAPC's on different processor address bits shall be performed. (0001) and (0011) are non-cyclic. The address with smallest value in the necklace is a distinguished address. A qnecklace is a necklace in which each address has q bits equal to 1.
Non-cyclic addresses
The exchanges for the three local memory locations (OOl), ( O l O ) , and (100) obviously can be done concurrently. Similarly, locations ( O l l ) , (110), and (101) can be scheduled concurrently with respect to the least significant bit of ( O l l ) , the second least significant bit of (110), and the most significant bit of (101). The communication for the remaining bit that is one in each of these addresses can also be scheduled concurrently. has period one, and must be matched with other locations in order to achieve maximum utilization of the communication channels. For every cyclic address the address formed through bit-complementation is also cyclic. Hence, pairs of cyclic addresses can be formed through bit-complementation (as in Algorithms l through 3). If the number of such pairs is p, Complement pairs of cyclic addresses share a row. The table entries specify the exchange dimension for the addresses. One or the other address in a complement pair must be exchanged in every dimension. Every row contains all dimensions. Which element in a complement pair is exchanged depends upon which address in the pair has the address bit set to one for the considered dimension. respectively. The value of the pth line entry defines the dimension in which the contents of the pth address is exchanged during the time step given by the column in which the pth entry occurs. The number of times an address is scheduled is equal to the number of lines, i.e., q It is easy t,o see that the first time an address is scheduled, it is scheduled in the dimension that corresponds to dimension zero in the distinguished address. For each address the next higher dimension, modulo the number of dimensions, is scheduled the next time the address is scheduled for an exchange. (11110), (11101), (11011) Proof: The q-necklace defined by addresses with q consecutive one bits in cyclic order is non-cyclic for q < p. Deriving the schedule from the table guarantees that all dimensions are scheduled every time step, precisely one address is scheduled for a dimension every time step, each complement pair is scheduled in every dimension, and every address in the full necklace is scheduled in each of the q dimensions with a bit set, and in no other dimension. I
In summary, memory addresses are first partitioned into cyclic and non-cyclic addresses. The number of cyclic addresses is two if ,f? is prime, otherwise it is of order O(*) [4] . Cyclic addresses are paired through bit-complementation and divided into blocks of pairs. The remaining /3 -q pairs is scheduled together with p non-cyclic addresses forming a qnecklace. All non-cyclic addresses can be scheduled individually, except the addresses scheduled with the cyclic addresses. Non-cyclic addresses can also be scheduled as complement pairs, except if there are no complement pairs forming distinct full necklaces. There are a t most two such sets, one set forming a /3/2-necklace, if p is even, and one set of addresses forming the complement of the addresses scheduled with cyclic addresses. For the case where the succession of AAPC's shall be perfomred on non-overlapping subcubes of different order, refer to [7] .
Summary and Discussion
We have presented four schedules for a single all-to-all personalized communication, three of which are simple to implement. One algorithm (Algorithm 4) requires the optimal number of element exchanges in sequence, %, with a span of p. The other three algorithms have either minimial maximal orbit lengths, or maximum channel utilization, but not both. Algorithm 1 has been implemented on the Connection Machine model CM-2. The exchanges require 40 psec per element compared t o 62 psec for the schedule in [2] . The total expense for alignment and realignment is about 0.9 psec per element (four bytes). Hence, the simplified algorithms presented here may yield a speed-up of up to 50% for a single AAPC, and a considerably reduced pipeline delay for multiple AAPC. It should be noted that for single AAPC on a /3-cube, Algorithm 4 can be easily blocked to /3 communication steps with the optimal number of element transfers preserved 191. The blocking procedure can also be used to generate optimal schedules for channel widths greater than the width of the data item [9] . Acknowledgment The Connection Machine implementation of Algorithm 1 was performed by Michel Jacquemin of Yale University, Department of Computer Science, in a collaborative effort between Thinking Machines Corporation and INRIA, Centre di Sophia-Antipolis. Valuable comments on a draft were given by Roland Sweet of the University of Colorado a t Denver.
