Abstract. In a generalized shu e permutation an address (a q?1 a q?2 : : :a 0 ) receives its content from an address obtained through a cyclic shift on a subset of the q dimensions used for the encoding of the addresses. Bit-complementation may be combined with the shift. We give an algorithm that requires K 2 + 2 exchanges for K elements per processor, when storage dimensions are part of the permutation, and concurrent communication on all ports of every processor is possible. The number of element exchanges in sequence is independent of the number of processor dimensions r in the permutation. With no storage dimensions in the permutation our best algorithm requires ( r + 1)d K 2 r e element exchanges. We also give an algorithm for r = 2, or the real shu e consists of a number of cycles of length two, that requires K 2 +1 element exchanges in sequence when there is no bit complement. The lower bound is K 2 for both real and mixed shu es with no bit complementation. The minimum number of communication start-ups is r for both cases, which is also the lower bound. The data transfer time for communication restricted to one port per processor is r K 2 , and the minimum number of start-ups is r . The analysis is veri ed by experimental results on the Intel iPSC/1, and for one case also on the Connection Machine.
Introduction
The main contributions of this paper are optimal algorithms for dimension permutations on Boolean cube con gured distributed memory multi-processors, and lower bounds for such permutations with concurrent communication on all channels. Communication systems (packet or circuit switched) only allowing communication on one channel at a time per processor are treated brie y. The Connection Machine is an example of a computer allowing concurrent communication on all channels. Some computers allow concurrent communication on several, but not all channels of every processor. The techniques for concurrent communication on all channels may be adapted to such architectures.
A dimension permutation is de ned by permuting and/or complementing the bits of the logic address eld. There are M(log 2 M)! possible dimension permutations where M is the number of elements in the address eld. We consider stable permutations 2], i.e., permutations for which the data occupy the same machine address space before and after the data rearrangement. The machine address space consists of two parts: a processor address eld and a local storage 1 address eld. Real dimension permutations are restricted to the processor address eld, whereas mixed dimension permutations include parts of both the local storage and processor address elds. Virtual dimension permutations that only include local storage addresses require no communication, and are not considered here.
The reason for distinguishing between processor and local storage dimensions is that access times to local storage usually is considerably faster than communication between processors. The width of the data paths to local memory are often equal to the word width of the architecture (32 or 64 bits), while the width of network channels typically are in the range 1 { 16 bits. Furthermore, the contention for network channels is often a more serious issue than the contention for local memory, and the techniques for reducing network contention are quite di erent from the techniques for handling contention for local memory.
The di erence in the width of the data paths to local memory and inter-processor communication channels, and the di erent protocols used for interprocessor communication and local memory references often allow many local memory references to be performed in the time required for an inter-processor communication. For instance, in a Connection Machine model CM-2 each processor has a single 32-bit wide data path to memory, while inter-processor communication channels are 1-bit wide. The width of the communication channels is determined by a trade o between many demands for o -chip communication. Pins on chips is a highly critical resource in many technologies. A memory read of 32 bits is a one-cycle operation, while a store is slightly longer. An exchange of 32-bits between a pair of processors requires about 185 cycles. On a fully con gured Connection Machine model CM-2 22 exchanges can be performed concurrently. In the n-port communication model local memory references on a Connection
Machine account for at most about 25% of the total time. which is a shu e repeated p times, or an unshu e performed q times. In the consecutive allocation scheme successive elements are allocated to the same processor, whereas in cyclic allocation successive elements are assigned to successive processors in a wraparound fashion. From the illustration below, it is clear that conversion between the two is a dimension permutation, which can be de ned as an n-shu e, or n-unshu e. Nassimi and Sahni 10, 11] consider stable, real dimension permutations on multi-processors con gured as meshes and hypercubes, while Flanders 1] focuses on stable, both real and mixed, dimension permutations on two-dimensional mesh-connected multi-processors. Swarztrauber 15] considers dimension permutations on Boolean cubes. Swarztrauber does not consider bit complementation, which is not required for the FFT, the main focus of 15] . In all of the previous work only one communication channel per processor is used in any step of the algorithms, or for lower bounds. We consider concurrent communication on all channels of all processors. Such communication is possible on the Connection Machine. The emphasis is on mixed, stable permutations. Real, stable permutations are treated brie y. Algorithms for real, stable dimension permutations with one-port communication can be found in 10, 11, 15] .
The notation and de nitions used throughout the paper are introduced in Section 2. In Section 3 we discuss lower bounds. Algorithms are described in Section 4, and results from implementations on the Intel iPSC/1 and the Connection Machine are described in Section 5. We conclude with a few remarks in Section 6.
Preliminaries
In one-port communication the communication is restricted to one exchange operation per processor. In n-port communication exchange operations can be performed concurrently on all ports of every processor. With the processors con gured as a Boolean n-cube each processor has n channels (edges) to n distinct processors (nodes). With a binary encoding there is an adjacent processor for every bit of the address, or dimension. The number of node-disjoint minimum length paths between any pair of nodes at real distance d is d. There 00j00) (01j00) (10j00) (11j00) (00j00) (01j00) (00j10) (01j10) (00j01) (01j01) (10j01) (11j01) =) (00j01) (01j01) (00j11) (01j11) (00j10) (01j10) (10j10) (11j10) (10j00) (11j00) (10j10) (11j10) (00j11) (01j11) (10j11) (11j11) (10j01) (11j01) (10j11) (11j11) Figure 1 : Shu e permutation of order two. f0; 1; : : :; ? 1g.
In a dimension permutation the content of address a is assigned to address (a) (i.e., (a) a). A GSH corresponds to a single cycle on the bits of the address eld. We let the GSH correspond to a left cyclic shift on the address eld. For example, the shu e (a 4 a 3 a 2 a 1 a 0 ) (a 3 a 2 a 1 a 0 a 4 ) has the index set f4; 3; 2; 1; 0g. For a 2-shu e on the same dimensions J = f4; 2; 0; 3; 1g. In general, an SDP de nes several cycles on the set J . De nition 3 A sub-cube permutation (SCP) is an algorithm for dimension permutation in which the routing is con ned to the set of processors to which the data set is allocated. An extended-cube permutation (ECP) is an algorithm for dimension permutation in which the routing is extended to an n e -cube in which the n-cube holding data is embedded.
Time complexity
The transmission time for each element is t c , and the start-up time, or overhead, for each packet of B elements is . In a circuit-switched system corresponds to the time to set one switch in a path. In the time complexity T(ports; r ; n; K) for a lower bound, or an algorithm, the rst argument is the number of ports per processor used concurrently, the second argument the real order of the GSH, the third argument the number of processor dimensions used for data allocation, and the last argument the size of the local data set.
Lemma 1 The time complexity of an SDP of real order r < n cannot be improved by communication in the n ? r processor dimensions not included in the index set, if a dimension 4 permutation is required within all r -cubes, and the SDP algorithm uses full bandwidth within each r -cube.
Proof: We prove the theorem by contradiction. Let the lower bound for the SDP of real order r < n on an n-cube be T = W L , where W is the total bandwidth required and L the available bandwidth per unit time. Now, map 2 n? r nodes of the n-cube to a single node of a r -cube by identifying all nodes with the same values of the address bits for dimensions in the set J of the SDP. Furthermore, increase the communications bandwidth of each channel in the rcube by a factor of 2 n? r . Then, every algorithm for the SDP performed on the n-cube can be converted to an algorithm on the r -cube with a running time T 00 that is at most the same. Hence, T T 00 = 2 n? r W 2 n? r L = T.
Lemma 2 Lower bounds for a sub-cube GSH of real order r > 0 are T lb gsh (1; r ; n; K) = Proof: The minimum number of start-ups, or switch settings in a path, is equal to the maximum number of channels that must be traversed, which is max a2A Hamming r (a; (a)).
The minimum data transfer time is bounded from below by the required bandwidth divided by the available bandwidth. By Lemma 1 it su ces to consider a r -cube. For each dimension j 2 J \ D U p such that j i, where i 2 J \ D U p , only nodes for which (a i 6 = a j ) need to send elements across dimensions j. If instead i 2 J \ D U s then all nodes send half of their data. Therefore, the bandwidth requirement for each r -cube is r 2 r K 2 . The available bandwidth per routing cycle is 2 r for one-port and r 2 r for n-port communication.
The lower bounds are not tight when some dimensions in the permutation are complemented. For instance, consider the case ( 0 1 ) ( 1 0 ). In this case the permutation is a cyclic shift in a loop of length four. All non-minimum length paths are of length three. There is only one minimum length path between any pair of processors, and only one non-minimum length path as well. For a routing time of K 2 element transfers in sequence at most K 2 elements can be routed along minimum length paths. The total routing requirement for elements routed along non-minimum length paths is at least 4 3 K 2 . With four available links a total routing time of at most K 2 is impossible.
Corollary 1 Proof: For the lower bound data transfer time in Lemma 2, all channels are used evenly in every routing cycle, and all elements are routed through a shortest path. But, during the rst and last routing cycles, at least half of the channels are not used for a real GSH. For one-port communication, the same argument applies to the rst and last routing cycles for any cube dimension used by all processors.
A dimension permutation on a data set allocated to an n-dimensional subcube of an n e -cube can be realized by subcube expansion from the n-cube to the n e -cube, full cube permutation, and compression to the original n-cube. The permutation is performed with local data sets reduced by a factor of 2 ne?n . The subcube expansion (compression) is of type one-to-all (all-to-one) personalized communication 6].
Corollary 2 Lower bounds for a GSH of order extended from an n-cube to an n e -cube are
max ( The essential ideas used to achieve concurrency in communication is: 1) pipelining, 2) starting the cyclic shifts in several dimensions, 3) factoring of cycles into several cycles. Independent concurrent exchange sequences are created by partitioning the local data set, and de ning one sequence per partition. By properly choosing the sequences a uniform load on the communication system is achieved. Dimension permutations can also be performed as recursive matrix transpositions 4, 13, 14] , or by performing all-to-all personalized communication twice 6, 14] .
The main focus below is on algorithms for mixed, stable dimension permutations. We rst consider the case with a single mixed GSH, then consider shu e permutations that can be factored into several mixed GSH. We conclude by considering two algorithms for real GSH.
4.1 Mixed, generalized shu e algorithms 4.1.1 A single GSH A mixed GSH with an index set J consisting of a single block of processor dimensions followed by one virtual dimension can be performed in r exchanges by using 0 as the xed exchange dimension, where the GSH is de ned by ( r r ?1 : : : 1 j 0 ). If the index set J consists of one block of processor dimensions and several storage dimensions, then the GSH is factored into two cycles: one on the block of processor dimensions and the memory dimension immediately to the right of it, one on all memory dimensions. For instance, a cyclic shift on the set ( ?1 ?2 : : : ? r j ? r ?1 ? r ?2 : : : 0 ) is factored into a cyclic shift on ( ?1 ?2 : : : ? r j ? r ?1 ) followed by a cyclic shift on ( ? r ?1 ? r ?2 : : : 0 ). The second GSH is virtual. For n-port communication we will rst describe a pipelined algorithm in which all data items start their exchange sequence in the rst processor dimension, then an algorithm for which some of the exchange sequences are initiated in other processor dimensions than the rst.
Pipelining
The mixed GSH with a single local storage dimension and K 2 elements per processor requires that K 2 elements be exchanged in each dimension. For n-port communication the data transfers can be pipelined. The number of element transfers in sequence is K 2 + r ?1. Figure 2 shows the scheduling of local memory locations paired with respect to the exchange dimension. The table entries indicate the processor dimension in which one element in a local pair is subject to exchange. Figure 3 shows an example of memory locations exchanging data in the case of eight processors. Two memory locations marked by the same symbol, for instance X0, are exchanged with each other. For more than two local memory locations the exchange pattern is repeated for all local pairs. The exchange algorithm for two local memory locations is as follows 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15 Step 1
Final allocation 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 A local memory reordering before and after the exchange sequence allows the sending and receiving local addresses in an exchange to be the same. Indirect addressing can be avoided for all data interchanges. Half of the processors use the lower address in a local pair, half of the processors the upper address. The initial and nal reordering, and the pairs of memory locations involved in the exchanges are illustrated in Figure 4 . endif endforall
The correctness follows from the following consideration.
Step 1: Step r + 2: Note that precisely one local storage location in a pair de ned by the exchange dimension is subject to an exchange in any step. Moreover, after the rst exchange even and odd data are separated into two distinct subcubes identi ed by the rst exchange dimension.
Pipelining with arbitrary starting dimension
With multiple local elements it is possible to initiate exchange sequences in di erent dimensions. All such sequences can be executed concurrently with n-port communication. element exchanges in sequence. The minimum number of start-ups is at least r , and at most r + 3.
For K 8, or r 3 all elements should have their rst exchange in the rst processor dimension of the GSH. If K > 8, but less than 2( r + 1), then some elements should start their exchange sequence in a dimension other than the rst processor dimension in the GSH. The time complexity is determined by elements that start their exchange in a dimension other than 1 . The partitioning of the K= r space with respect to scheduling algorithms is illustrated in 
Remarks:
The same local memory dimension can be used for di erent concurrent exchange sequences, but memory locations must be distinct. Four memory locations are needed per sequence, if the starting dimension is di erent from the rst in the cycle. The alignment is made in the xed exchange dimension, and is controlled by the parity of all processor dimensions in the shu e, except the xed exchange dimension. Hence, the alignment is made on the local memory dimension in the shu e, if the exchange sequence conc. exch. with pipel. starts in the rst real dimension in the shu e. Otherwise, the alignment is made on the extra memory dimension used for the exchanges. The alignment and exchanges are controlled entirely by the dimensions in the shu e. The local storage is partitioned into two blocks with respect to exchange schedules: one for exchange sequences starting in a dimension other than the rst real dimension, and consisting of 2( r ?1) locations for r odd, or consisting of 2( r ?2) locations for r even, and one block for exchange sequences starting in the rst real dimension and consisting of the remainder of the local storage. The rst block is aligned on dimension v, and the second on dimension 0 .
Multiple GSH
In general, the index set J for a GSH can be factored into a number of mixed GSH, each for a block of unique processor dimensions and one unique storage dimension, and one virtual GSH, as illustrated in Figure 7 . We refer to the mixed GSH's in the factored GSH as constituting GSH.
The number of such GSH is . All constituting, mixed GSH can be performed concurrently. For each location in local storage the exchange dimensions for di erent constituting GSH can be interleaved in any order. Only the order within each GSH is xed. The techniques for scheduling of exchanges described for a single mixed GSH can be applied for each constituting GSH. The scheduling algorithms presented below are optimal within two element exchanges. The set of exchanges de ned by one dimension exchange for all constituting GSH consists of a number of independent 2-cycles. This permutation is equivalent to matrix transposition, or bit-reversal. We refer to it as all-to-all personalized communication 6]. Each processor holds a unique piece of data for every other processor. Any algorithm for all-to-all personalized communication (AAPC) can be used repeatedly to accomplish the required permutation. If all constituting GSH are of the same order, then the use of any AAPC algorithm is straightforward.
In 8] AAPC algorithms are presented that allow for a pipeline delay of cycles between each new AAPC application for mixed GSH.
The di erence between the AAPC based algorithms and the concurrent/pipelined algorithms is in the scheduling of exchanges for the local storage, or in the ordering of the loops. This di erence results in a di erence in the pipeline lling time, which is slightly longer for the AAPC based schedules. The alignment and realignment is the same for both types of algorithms.
Alignment and realignment
Let the constituting GSH be ( j The lemma follows from the fact that the exchanges for each constituting GSH take place within subcubes for which all other processor address bits are the same. However, by performing the alignment for di erent GSH on the same storage dimension the local addresses are not the 
Pairing of local memory locations
In the exchanges in any dimension only half of the local data is exchanged. For a single GSH one local memory bit (v or 0 ) is used to control the exchange, and either all the local data with the bit set or not set is exchanged. It is convenient to view storage locations in pairs, where one location or the other in a pair is exchanged. When more than one local storage dimension is used to control the exchanges, then the pairing of locations shall be made with respect to all such dimensions, i.e., the pairing shall be made on v, 0 0 , 1 0 , : : :, and ?1 0 . For locations with a rst exchange for each constituting GSH in its rst processor dimension v need not be included in the alignment and the pairing. The values of the local storage dimensions in a pair are complements of each other. For instance, for three control dimensions the pairs are (000; 111), (001; 110), (010; 101), and (011; 100). If there are additional local storage dimensions the pairing is simply repeated as many times as necessary.
Pipelined Algorithms
Concurrent pipelined single GSH algorithms The algorithm for a single mixed GSH can be generalized to multiple GSH. With n-port communication and su ciently many local data elements, the di erent constituting GSH can be initiated and performed concurrently. require that all constituting GSH are of the same order.
Theorem 2 Any generalized shu e of real order r > 0 can be realized in at most T = Table 2 : The number of element transfer in sequence for concurrent/pipelined algorithms for multiple mixed GSH.
Comparison of algorithms
The pipelined GSH algorithm is always preferable over the pipelined AAPC algorithm. Similarly, the strategy to combine concurrent exchange sequences with pipelining always yields a better result for the algorithms completing each constituting GSH for each pair of data elements before initiating another constituting GSH, than the same strategy applied to AAPC based algorithms. The pipeline lling time is longer for the latter algorithms. The complexity estimates are summarized in Table 2 . For all constituting GSH of the same order we have included the e ects of limited communication bu ers and a start-up overhead for each communication action.
Real shu e algorithms
A mixed GSH has at least one local storage dimension as part of the permutation. In a real GSH no local storage dimension is part of the permutation. But, if there are at least two data elements per processor, then using an algorithm with the storage dimension as a xed exchange dimension allows each exchange to involve only one processor dimension, as in the algorithms above for mixed GSH. 011 100 101 110 111 0 0j000 0j001 0j010 0j011 0j100 0j101 0j110 0j111 1 1j000 1j001 1j010 1j011 1j100 1j101 1j110 1j111 + 0 0j000 1j000 0j010 1j010 0j100 1j100 0j110 1j110 1 0j001 1j001 0j011 1j011 0j101 1j101 0j111 1j111 + 0 0j000 1j000 0j001 1j001 0j100 1j100 0j101 1j101 1 0j010 1j010 0j011 1j011 0j110 1j110 0j111 1j111 + 0 0j000 1j000 0j001 1j001 0j010 1j010 0j011 1j011 1 0j100 1j100 0j101 1j101 0j110 1j110 0j111 1j111 + 0 0j000 0j100 0j001 0j101 0j010 0j110 0j011 0j111 1 1j000 1j100 1j001 1j101 1j010 1j110 1j011 1j111 Figure 10 : A shu e permutation of order r using r + 1 communications.
element exchanges in sequence. The pipelined algorithm always requires a larger number of element transfers in sequence, and more communication start-ups, than the algorithm based on concurrent exchange sequences. The algorithm outlined above using an extra local memory dimension for all exchanges is non-optimal by one exchange. For architectures with a high start-up time relative to the data transfer time, it may be desirable to nd an algorithm with the optimal number of start-ups. In algorithm A1 half of the data is at its nal destination after r exchanges. If the initial content in location zero in processors four through seven, and in location one in processors zero through three did not matter, then the permutation would be completed in r steps. In general, at the expense of doubling the memory requirements, and the data transfer time, the minimal number of communications start-ups can be achieved. The rst step of the modi ed algorithm can be accomplished through the exchange This is algorithm A2.
The case with r = 2 represents a special case for algorithm A2. Each channel is only used in one direction. Splitting the data set K into two parts allows both minimum length paths to be used. For two equal parts the complexity is ( K 2 + B)t c + (d K 2B e + 1) . This is algorithm A2 2 .
With an insigni cant communication start-up, such as on the Connection Machine, the optimal value of B is one, and only one element exchange in excess of the lower bound is required. The result generalizes to the case where the real shu e consists of a number of independent cycles 
Experiments
We have implemented the one-port version of algorithm A1 for real GSH on the Intel iPSC/1, and algorithm A2 2 on the Connection Machine (in the context of a bit-reversal routine). On the Intel iPSC/1 we also performed shu e permutations by using the routing logic, by using all-to-all personalized communication twice, and by an algorithm that requires precisely r startups for a real shu e of order r (algorithm A1 in 5]). Algorithm A2 presented above should behave similarly for small values of K, and be superior for large values of K, but was not known at the time of the implementation. For a real, sub-cube, shu e permutation with a message size less than a few hundred bytes, algorithm A2 is the fastest, but for larger message sizes algorithm A1 is preferable, Figure 11 . The best time of either algorithm A1 or A2 is 5 { 10 times less than that of the router. All one-port algorithms have a complexity that is linear in the number of dimensions for shu e permutations with a real order equal to the number of cube dimensions. The deviation from the linear dependence exhibited in Figure 11 is due to a hybrid implementation, that optimizes the sum of start-up time and the time for local data movement 4]. For the Connection Machine implementation of algorithm A2 2 the measured time complexity is 270 + (3:2 + 3:9 )K sec, where is the number of 2-cycles in the real GSH.
Summary and conclusions
The algorithms for mixed shu e permutations are optimal within two element exchanges for concurrent communication on all channels of every processor. The number of communication start-ups for an unlimited bu er size is either exactly optimal, or requires three excess start-ups depending upon the size of the local data set relative to the number of processor dimensions in the generalized shu e, and the number of blocks of contiguous processor dimensions. The communication complexity for n-port communication is summarized in Table 2 , page 16. For one-port communication the minimum number of start-ups is r , and the minimum number of element transfers in sequence is r K 2 . The mixed shu e algorithms are also valid when bit-complementation is required by appropriately modifying which data in a pair is exchanged.
We also show that by performing a local alignment before and after the data exchanges for mixed GSH, all data exchanged have the same relative address.
The communication complexities of a real GSH of order r are summarized in Table 3 , page 18. The second and last columns contain the values of the data transfer times and startup times relative to the respective lower bound. With an unbounded bu er size the number of start-ups is either exactly optimal, or suboptimal by one start-up. The number of element transfers in sequence exceeds the lower bound by a factor of 1 r for the best algorithm, except if r = 2, or there are a number of 2-cycles. Then, only one element transfer in excess of the lower bound is required. This result is not valid if bit-complementation is required.
Performing a GSH through all-to-all personalized communication 6, 14] is always inferior to the pipelined/concurrent algorithms presented here. Likewise, performing the permutation by recursively applying an optimal matrix transposition algorithm 4] yields higher complexity.
Finally we note that the control is identical for all processors, and can easily be distributed.
