Transposing an N × N array that is distributed row-or column-wise across P = N processors is a fundamental communication task that requires time-consuming interprocessor communication. It is the underlying communication task for the fast Fourier transform of long sequences and multi-dimensional arrays. It is also the key communication task for certain weather and climate models. A parallel transposition algorithm is presented for hypercube and mesh connected multicomputers with programmable networks. The optimal scheduling of network transmissions is not unique and known to be non-trivial. Here, scheduling is determined by a single de Bruijn sequence of N bits. The elements in each processor are first preordered and then, in groups of log 2 P adjacent elements, either transmitted or not transmitted, depending on the corresponding bit in the de Bruijn sequence. The algorithm is optimal both in overall time and the time that any individual element is in the network. The results are extended to other communication tasks including shuffles, bit reversal, index reversal, and general index-digit permutation. The case P ≠ N and rectangular arrays with nonpower-of-two dimensions are also discussed. Algorithms for mesh connected multicomputers are developed by embedding the hypercube in the mesh. The optimal implementation of the algorithms requires certain architectural features that are not currently available in the marketplace.
Introduction.
The spectral transform method for weather and climate models includes multiple fast Fourier transforms (FFTs) in the longitudinal direction and multiple Legendre transforms in the latitudinal direction [18] . Consider the following two methods for parallel computation. First, a fixed or static distribution could be selected. For example the arrays could be distributed in longitude but not latitude. The Legendre transforms could then be performed in-processor without interprocessor communication. However, the Fourier transforms would then require a distributed or parallel FFT algorithm [16] . If the arrays are distributed in latitude a parallel Legendre transform is required. Alternately, the arrays could be dynamically redistributed (transposed) between the computational modules, and both the Legendre and Fourier transforms could be performed in-processor. This is called the transpose method or 1-D decomposition method [3] because the arrays are distributed in only one dimension.
Dynamic rather than static distribution of arrays may be preferred even though it requires the global redistribution of the array. First, following the development of efficient transpose software, it is relatively straightforward to port an existing uniprocessor code to a multicomputer. Second, the computations can be performed in-processor using sequential algorithms that usually require less computation than their parallel counterparts. Third, the communication requirements of the transposition are less than the communication requirements of the distributed algorithm. For example, assume an N ×N array is distributed columnwise on P = N processors but we wish to Fourier transform the rows. If we transpose the array then each processor contains a complete row and the FFT can be performed using efficient uniprocessor software. It will be shown that the communication requirements for this approach are O (N ) for the transposition. However, if the rows are transformed in-place with elements of the row in different processors then the communication requirements for each row are O (logN ) and therefore O (N logN ) for N rows. Where possible, it is preferable to move data into a processor with a communication requirement proportional to the amount of data rather than the computational complexity of the distributed algorithm.
The transpose method is applicable to a number of fundamental problems. For example, the transposition is used to compute the 2-D FFT of an array that is distributed column-or row-wise [16] . It can also be used to compute the 1-D FFT of a long sequence. Let a sequence with length N 2 be uniformly distributed over N = P
processors. An ordered-to-ordered FFT can be performed with three transpositions and an unordered-to-unordered FFT can be performed with a single transposition [16] . This algorithm requires O (N ) communication which is less than O (N logN ) computation. Therefore communication is negligible for sufficiently large N. Transposition is also used to implement the alternating direction implicit (ADI) method for solving elliptic partial differential equations, e.g. [7] . Section 3 contains additional common communication tasks that are related to the transposition. Transposition is included in the class of communication tasks called complete exchange [14] , total exchange [21] , and all-to-all personal communication (AAPC) [9] .
Perhaps the earliest relevant work was published prior to the advent of contemporary multicomputers. Fraser [6] presented an efficient FFT for arrays distributed on external storage and introduced index-digit permutations which has subsequently been used in the development of parallel algorithms for a broad class of communication tasks including transpositions. Early parallel communication algorithms were developed using the single-port model in which only a single communication port is active at any given time. Nassimi and Sahni [12] developed optimal parallel algorithms for bit-permute-complement (BPC) communication tasks which include index-digit permutations. They assume a single element per processor and provide optimal algorithms for both hypercubes and mesh connected multicomputers. Flanders [5] developed BPC algorithms for mesh connected multicomputers and used them to develop parallel FFTs. Saad and Schultz [13] developed a transpose algorithm that runs in the time required to transmit log 2 P packets with N ⁄2 elements. Swarztrauber [16] developed an algorithm based on index-digit permutations that requires the same time. He also showed that any index-digit permutation requires no more time than the transmission of 1.5log 2 P packets with N ⁄2 elements. McBryan and Van de Velde [11] developed a single-port divide-and-conquer transpose algorithm.
postordered by inverting the preorder in (a). Preordering in (a) differs from processor to processor but the schedule of transmissions in part (b) is the same for all processors.
(a) Preordering
We must first determine the set of channels that each element must traverse to reach its destination processor. This set is given by the relative address of its origination processor p and destination processors The relative address determines the channels that an individual element must traverse but not the order in which the channels must be traversed to avoid contention and maximize the total network bandwidth. To this end we first reorder the relative addresses as shown in column 4 Table 2.1. Column five is the same reorder of column 1 so the bits in column 4 determine the channels that must be traversed by the corresponding elements in column 5. Column 4 is a de Bruijn ordering of the integers 0 through 7 that is characterized by the fact that each bit is equal to the one above and to its right (with wraparound from top to bottom). Therefore the bits are constant on any anti-diagonal through the fourth column. Preordering is determined in the same manner for the other processors using the same de Bruijn ordering. Therefore column 4 is the same for all processors. However, the orderings in column 5 differ from processor to processor because the relative addresses in column 3 are different.
By construction, all of the bits in column 4 are determined by the least significant bits on the right side of column four. From top to bottom this sequence of bits is 11101000. This is called a de Bruijn sequence which is defined as a sequence of N = 2 r bits such that any number between 0 and N −1 has a binary representation as r consecutive bits with wrap-around [10] . For example, the bits 11101000 form a de Bruijn sequence for N = 8. Proceeding from left to right (with wrap-around) three consecutive bits correspond to 1, 3, 7, 6, 5, 2, 4, and 0. It will be shown that a single de Bruijn sequence of N bits determines the the schedule for all processors. Note however that de Bruijn sequences, and hence the schedules, are not unique. 111 010 000
_ __________________________________________________
In column 4 an anti-diagonal with bits set to 1 specifies that the corresponding three consecutive elements are to be transmitted simultaneously on channels 0, 1 and 2 respectively. By inspection, a nonblocking schedule can now be given that utilizes all channels on all communication cycles and implements the transmissions specified by the 1 bits in column 4 of 6, 5 , and x 2,5 on channels 0, 1, and 2 respectively. Cycle 2: Transmit elements x 6, 5 (1) , x 2,5 (1) , and x 3,5 on channels 0, 1, and 2 respectively.
, and x 0,5 on channels 0, 1, and 2 respectively.
Cycle 4: Transmit elements x 0,5 (1) , x 7, 5 , and x 1,5 on channels 0, 1, and 2 respectively.
The contents of processor 5, following each communication cycle, are listed in Table  2 .2. Superscripts denote an element that was received on a previous cycle and stored in the location vacated by the element that was transmitted in the opposite direction on the same channel (channels are assumed to be bidirectional). These elements can be determined as elements from the original sequence from Table 2 .2. For example, on Cycle 2 the element x 6,5
(1) = x 4,7 which replaced x 6,5 , is transmitted on channel 0. The routing information in column 4 remains unchanged because the replacement elements have the same routing information. Therefore column 4 remains unchanged from cycle to cycle and from processor to processor. From column 3 or 4, at least 4 transmissions are required on the same channel and therefore the algorithm is optimal. It can be verified that each element in column 5 is transmitted on the channels specified by the 1 in column 4. In general, an N × N transposition can be performed in N ⁄ 2 cycles; that is, the array can be transposed in a time equal to the time required to transmit N ⁄2 elements on a single channel. 
_ ________________________________
Postordering is the inverse of de Bruijn preordering and performed in-processor. Now consider the algorithm for packet-based communication systems. It is assumed that the time required to transmit l elements in a single packet is γ l + β where beta is the latency and γ −1 is the channel bandwidth. From columns 4 and 5 in Table 2 .1 it is evident that x 0,5 could also be transmitted with x 4,5 on the first cycle. Indeed, on the first cycle, the packets (x 4,5 , x 0,5 ), (x 6,5 , x 7,5 ), and (x 2,5 , x 1,5 ) could be transmitted simultaneously on wires 0, 1, and 2 respectively. The addresses of elements in a given packet must differ by at least log 2 P ; otherwise the element with the larger address would appear in two different packets, which contradicts the requirement that elements must be transmitted sequentially on the channels prescribed by their relative address. The packet lengths vary from cycle to cycle because the pattern of zeros and ones in the de Bruijn sequence is irregular. Packet lengths for N = 1024 and the corresponding de Bruijn sequence in the appendix are 26, 67, 64, 52, 45, 27, 64, 68, 54, and 45. This differs from [9] where the packets, with the possible exception of the last, have the same length. Nevertheless the average packet length is N ⁄(2log 2 N ) and hence the total number of transmitted elements is the same as the element-based algorithm. Therefore the time required to transpose an N ×N array on P = N processors with a packet-based communication system is
which is known to be optimal. The contents of processor 10 (chosen at random) are listed in Table 2 .3 for a 16×16 transposition. The leftmost column contains the 10th column of the original array and the right most column, after 4 communication cycles, contains the 10th row. The Table is based on the de Bruijn sequence 1111011001010000 that generates packet lengths of 1, 3, 2, and 2.
_ ____________________________________ 
Related communications and mesh architectures.
In this section the transposition algorithms are extended to several related communication tasks. Column permutations and index reversal, or bit complementation, are discussed in subsection 3.1. The general index-digit permutation, including perfect shuffles and bit reversal, is discussed in subsection 3.2. Rectangular arrays and the case P ≠N are also discussed. Non-power-of-two dimensions are discussed in subsection 3.3 and the general communication problem is discussed briefly in subsection 3.4. All of the hypercube communication algorithms are extended to mesh connected architectures in subsection (3.5).
Column permutations and index reversal (bit complement).
Consider now the problem of reversing the order of a long sequence, i.e. implementing the program syntax y (i ) = x (N −i −1) for i = 0, . . . , N −1. Again we assume the sequence has N 2 elements that are distributed N = P elements per processor. If we construct a table like (2.1) for index reversal, all of the bits in every relative address are 1. Therefore de Bruijn ordering is not necessary. The resulting packet-based algorithm requires log 2 P packet transmissions each with N ⁄log 2 N elements which is double the length of the packets for the transpose algorithm.
Now consider a second problem; namely, permuting the columns of an N × N array that is distributed column-wise across N = P processors. The columns can be permuted by (a) transposing, (b) permuting in processor, and (c) transposing. This algorithm requires two transpositions and can also be used to solve the index reversal problem. First the order of the columns is reversed using the column permutation algorithm. Next, the elements in each processor are reversed. Although communication is the same as the algorithm in the previous paragraph, the ordering overhead is greater. Nevertheless algorithm (a), (b), and (c) may be convenient and competitive if efficient transposition software is available. All bits are complemented in the index reversal discussed in this subsection. Individual bits can be complemented by combining the results in this and the next subsections.
3.2 Index-digit permutations including rectangular arrays and the case P ≠ N .
Index-digit permutations include matrix transposition, shuffles and the bit reversed orderings that are used, for example, in the FFT. Index-digit permutations have been used prior to the advent of parallel processing to design efficient algorithms for ordering large files on external storage devices [6] . Later they were used to design parallel algorithms such as the FFT on mesh [5] and more recently hypercube [16] interconnected multicomputers. Consider now the implementation of the following sample index-digit permutation of a 256 element sequence on a 32 processor hypercube; namely, x (i 2 i 6 i 1 i 4 i 0 |i 7 i 5 i 3 ). The algorithm is almost identical to the transposition algorithm in section 2 except channels 0, 1, and 2 are replaced by channels 0, 2, and 4. The packet size is therefore N ⁄[2(r −d )]. In general, the minimum of d and r −d indices can be exchanged simultaneously across the partition. The number of transmissions remains unchanged but the packet size is reduced by the number of channels which, for r <2d , is less than the number available.
Arrays with other than power-of-two dimensions.
Here we assume that the prime decomposition of N contains a few small primes other than 2. This generalization is known to be of use; indeed, this was a fundamental generalization of the FFT. However, it is still assumed that the number of processors is P = 2 d . The results in this subsection will be illustrated by data structures and mappings that are encountered in weather and climate models. Initially we assume a one degree horizontal resolution and 32 vertical levels. 
The computations for a weather or climate model can be partitioned into three phases, each of which require only a single axis to be in the processor. However we first assume P = 32 processors and therefore each processor contains two axes or a complete horizontal level. Three processor mappings are given below. The first (3.1) corresponds to a mapping in which the levels are distributed. The third (3.3) corresponds to a mapping in which the vertical axis are in-processor. The second (3.2) is an intermediate mapping that does not require communication between processors and facilitates the conversion between (3.1) and (3.3).
The conversion between (3.1) and (3.2) is performed in-processor. If we define i1 = i 5 i 4 i 3 , i0 = i 2 i 1 i 0 , j1 = j 4 j 3 , and j0 = j 2 j 1 j 0 , (3.2) can be obtained from (3.1) using the FORTRAN index syntax y (i1,j1,i0,j0) = x (i1,i0,j1,j0) with dimensions y (8, 4, 45, 45 ) and x (8, 45, 4, 45) .
The conversion between (3.2) and (3.3) proceeds as described in section 2 except that the packet lengths are increased by the product 45 2 of primes other than 2. The transposition between the data structures (3.1) and (3.2) requires 5 transmissions of packets with average length 45 2 . 32⁄(2 . 5) = 6480 elements. The non-binary digits remain in-processor and are not exchanged across the partition. Increased resolution can be implemented by additional binary digits that can be exchanged across the partition. The key observation is that the non-binary digits remain in-processor.
The general one-to-one communication problem.
The transposition algorithms and extensions developed here provide communication for a large class of computations and it is therefore reasonable to speculate on the ultimate extent of this class. Consider a sequence with N 2 elements that contain the integers 0, . . . , N −1 each replicated exactly N times. Now randomly permute and distribute this sequence across P = N processors. Each processor contains N elements with random processor numbers although, as a linear sequence with N 2 elements, each processor number occurs N = P times. The goal is to find an efficient way to position each element in the processor designated by the contents of the element.
The following experiment was conducted. Every element in each processor is exchanged with the element located at the contents of the element. The array is then transposed and again the exchanges are performed. This process is repeated until all elements are in the processor that corresponds to their contents. This algorithm was applied to 2,000 random permutations of an array with over a million elements (N = 1024). A maximum of six transpositions was required to transmit all elements to their destination.
Mesh connected architectures.
In this subsection the hypercube algorithms are extended to the mesh connected architecture without distinguishing between a mesh and torus. All hypercube algorithms are extended to the mesh by embedding the hypercube in the mesh. The resulting performance is comparable to the hypercube if the bandwidth of the mesh exceeds the bandwidth of the hypercube by an amount that is typical in the current marketplace. The embedding is defined by logical hypercube channels that consist of one or more mesh channels. Such embeddings are not unique, as demonstrated by Stricker [15] who observed that the performance of hypercube-induced mesh algorithms for sorting were comparable to algorithms designed specifically for the mesh architecture. Flanders [5] describes single-port transposition algorithms designed specifically for the mesh architecture. Varvarigos and Bertsekas [21] present algorithms for both hypercube and wraparound mesh (torus) architectures.
The hypercube can be embedded in a 3-D mesh interconnect using a 3-D version of the two-dimensional embedding given in [2] . Consider first the mesh as a physical 3-D cube of processors with P 1⁄3 processors on an edge. A hypercube interconnect can be defined as follows. First separate the processors into two halves and connect the processors one to one across the two halves. Next, apply this algorithm to the two halves and then recursively to the resulting quarters, eights and so forth until only a single processor remains. ⁄2 times faster than the hypercube channels, the performance of the two computers is comparable. For P = 512 the mesh channels should be four times faster and for P = 4096 the mesh channels should be eight times faster. The latter factor currently exists in the marketplace where mesh channels are built significantly faster than hypercube channels.
Summary and comments.
Individual channel transmissions have been scheduled using de Bruijn sequences for the transposition of an N ×N array that is distributed row-or column-wise. On a P = N processor hypercube, with a packet-based communication system, the time required is γ N ⁄2 + βlog 2 N where β is packet latency and γ −1 is the bandwidth of a single channel.
The time required for an element-based system is τN ⁄ 2 where τ is the the time required to transmit a single element on a single channel. These times are known to be optimal [1, 4, 9, 14, 20] . The span or the maximum time that any element spends in the network is τlog 2 P which is also known to be optimal [9, 20] .
The central contribution of this paper is therefore a scheduling algorithm that uses a single de Bruijn sequence of N bits to schedule all processors. Once the elements in each processor are preordered then, for i = 0,...,N −1, log 2 N consecutive elements, beginning at location i , are either transmitted simultaneously (or not) depending on whether the i th bit in the de Bruijn sequence is 1 or 0. Several related communication tasks are also discussed including column permutations, index reversal, index-digit permutations, transposing rectangular arrays, and N with prime factors other than 2.
The results are also extended to mesh architectures and some comments are included regarding the general one-to-one communication problem.
All of the algorithms require an architecture with a programmable network and, for optimum performance, the ability to switch incoming transmissions back onto the network in a single cycle. On a hypercube this would be possible if each processor were equipped with a small 2log 2 P × 2log 2 P crossbar switch that is programmable and could be switched on each cycle. This switch interfaces the network and a local vector register. For and 8×8 transposition, the evolution of the contents of the local vector register in processor p = 5 in shown in Table 2 .2. The register fills at an average rate of two elements per cycle which is double the rate associated with traditional vector registers. This prompts the thought that perhaps multicomputers can be vectorized. This in turn leads to the vector parallel paradigm in which all arrays are uniformly distributed across the processors. Additional vectors and arrays that are generated at run time by index syntax are also distributed at the average rate of one element per cycle to all of the vector registers [17, 19] .
Traditional architectures are currently moving toward coarse grain computing with a modest number of relatively powerful processors. This has been in response to the high cost of interprocessor communication required by massive numbers of processors. Consequently, the once hoped for advantage of computers with low priced commodity components has not yet been realized. However the optimal algorithms presented here and elsewhere, together with architectural features that permit optimal implementation, suggest the viability of the original concept of providing performance via massive numbers of processors. This becomes particularly interesting when we compare optimal communication rates of one or two elements per cycle with existing technologies requiring hundreds or even thousands of cycles to transmit elements, in parallel, to their destination processors. These observations have led to the development of a recently patented architecture [20] . Historically the programmer has had two fundamental tools to speed computations on a single processor; namely, algorithms and hardware that supported their optimal implementation. The algorithms presented here and by others, together with the hardware features in [20] , provide the same tools for minimizing multicomputer communication.
APPENDIX
The de Bruijn sequences given below were computed in O (N ) time using algorithm Ecycle in [10] . De Bruijn sequences are not unique. 
