We describe an implementation of the Cooley Tukey complex-to-complex FFT on the Connection Machine. The implementation is designed to make e ective use of the communications bandwidth of the architecture, its memory bandwidth, and storage with precomputed twiddle factors. The peak data motion rate that is achieved for the interprocessor communication stages is in excess of 7 Gbytes/s for a Connection Machine system CM-200 with 2048 oating-point processors. The peak rate of FFT computations local to a processor is 12.9 G ops/s in 32-bit precision, and 10.7 G ops/s in 64-bit precision. The same FFT routine is used to perform both one-and multi-dimensional FFT without any explicit data rearrangement. The peak performance for a one-dimensional FFT on data distributed over all processors is 5.4 G ops/s in 32-bit precision and 3.2 G ops/s in 64-bit precision. The peak performance for square, two-dimensional transforms, is 3.1 G ops/s in 32-bit precision, and for cubic, three dimensional transforms, the peak is 2.0 G ops/s in 64-bit precision. Certain oblong shapes yield better performance. The number of twiddle factors stored in each processor is P 2N + log 2 N for an FFT on P complex points uniformly distributed among N processors. To achieve this level of storage e ciency we show that a decimation-in-time FFT is required for normal order input, and a decimation-in-frequency FFT is required for bit-reversed input order.
The Discrete Fourier Transform (DFT) is de ned by X(l) = P 1 X j=0 ! lj P x(j); 8l 2 0; P 1]; ! P = e 2 i P : and the Inverse Discrete Fourier Transform (IDFT) is de ned bỹ x(j) = 1 P P 1 X l=0 ! lj P X(l); 8j 2 0; P 1]; ! P = e 2 i P :
The coe cients w lj P are known as twiddle factors. We consider both decimation-in-time, DIT, and decimation-in-frequency, DIF, versions of the Cooley-Tukey FFT 1] . The decimationin-time FFT is de ned by the splitting formula The data interactions for the two splitting formulas, also known as butter ies, are shown in Figures 1 and 2 . The splitting formulas above de ne radix-2 FFTs. Higher radix FFTs such as radix-4 and radix-8 FFT are de ned similarly, see e.g . 13] . Details of a Connection Machine implementation are reported in 12]. Both the DIF and the DIT FFT reorder the data from normal to bit reversed order (or the converse). The most important di erence between the DIF and DIT FFTs with respect to distributed memory systems is that the DIF and DIT FFTs use the twiddle factors in opposite order. This di erence causes a di erence in the demand for twiddle factor storage by a factor of log 2 N for an N processor system, as discussed in section 4. Index binary x (15) 1111 x (14) 1110 x(13) 1101 x(12) 1100 x(11) 1011 x(10) 1010
x(9) 1001 x(8) 1000 x(7) 0111 x(6) 0110 x(5) 0101 x(4) 0100 x(3) 0011 x(2) 0010 x(1) 0001 x(0) 0000
Index binary 1111 X (15) 0111 X (7) 1011 X(11) 0011 X(3) 1101 X (13) 0101 X(5) 1001 X(9) 0001 X(1) 1110 X (14) 0110 X (6) 1010 X(10) 0010 X (2) 1100 X (12) 0100 X(4) 1000 X (8) 0000 X(0) Figure 1 : Decimation-in-time FFT.
The Connection Machine systems CM-2 and CM-200 have up to 2048 oating-point processors that support operations in both 32-bit and 64-bit precision. The memory is distributed among the processing units, with a maximum of 4 Mbytes of memory per unit, and a total memory of 8 Gbytes. Each processing unit has a single 32-bit wide data path to its memory. Data paths internal to the oating-point unit are 64-bits wide. The processing units are interconnected as an 11-dimensional Boolean cube, with two channels between every pair of nodes. Data may be sent on all 22 (11 2) channels concurrently.
The data layout has a signi cant impact on performance for most distributed memory architectures. In order to maximize the potential for parallelism data arrays are distributed evenly over the processor memories. Two common schemes for assigning data to processors are the consecutive (block) allocation, and cyclic (scatter) allocation 6, 3] . The two types of data allocation are illustrated in Figure 3 for a 32 element array distributed among 8 memory units. All Connection Machine compilers use the consecutive allocation scheme. However, for the FFT a cyclic data allocation would yield a lower communication time in some cases 11]. For multi-dimensional data arrays, the ability to con gure the Connection Machine processors as a multi-dimensional array with axes lengths being arbitrary powers of two can be exploited for enhanced performance. The processor con guration is controlled by compiler directives. The focus of the algorithm development presented here is an unordered FFT. The presented algorithms fully utilize the communication system. Reordering for ordered transforms is performed explicitly. Reordering algorithms that fully use communications systems allowing concurrent communication on multiple channels are presented in, for example, 7, 8, 9] . An implementation on the Connection Machine is presented in 2]. For reference we include performance data for some ordered transforms. Interleaving the reordering with the FFT computation o ers no advantage on a system allowing concurrent communication on all channels, unlike the case with communication restricted to one channel at a time 15, 16] . We also restrict the algorithmic considerations to data allocated to processors using the standard binary encoding, which is advantageous for the Cooley-Tukey FFT. For multiprocessors con gured as Boolean cube networks, a binary-re ected Gray code 14] is often used. In such an encoding successive array elements are allocated either in local memory, or in an adjacent processor. This type of data allocation is advantageous for algorithms requiring nearest neighbor communication, such as for instance relaxation algorithms. Algorithms that compute a Cooley-Tukey FFT on such a data allocation without loss of e ciency are presented in 10]. The outline of this paper is as follows. We rst review the mapping of the radix-2 FFT to Boolean cube networks. Particular emphasis is placed on e ective use of the communication system 11]. We then brie y review the issues in e cient use of the data path between a oating-point unit and its memory, followed by a detailed account of twiddle factor computation and storage in the distributed memory system. We focus on the consecutive data allocation, since all Connection Machine compilers use this allocation scheme. Whenever a cyclic data allocation would result in signi cantly di erent conclusions we will state the di erence. A few details of the implementation are given in Section 6, and performance data are reported in Section 7. Potential improvements of the current implementation are reviewed in Section 8.
Radix-FFTs on Boolean Cubes
For the radix-2 Cooley-Tukey FFT the data interaction in stage q, 0 q < p, with stages labeled from input to output, is between data with indices i and i 2 p 1 q . denotes the bit-wise exclusive-or function. For example, in Figures 1 and 2 a 16-point transform is computed with the data interaction in the rst stage being between data elements that di er by 8 in their indices. In the second stage the interaction is between elements that di er by 4 in their indices, etc. In a Boolean cube network nodes can be labeled such that there is one adjacent node for every bit in the node address, i.e., for every node with address i in a Boolean cube of n dimensions and N = 2 n nodes, there are n neighbors with addresses i 2 j , 0 j < n. The data interaction for the radix-2 Cooley-Tukey FFT matches well with the connectivity of the Boolean cube. For a precise assessment of the communication requirements we consider in some detail the consecutive allocation of data to memory units. In the data index space the allocation can be represented as follows, where x i denotes one bit in the encoding of the indices. The total number of bits required for the encoding of the data indices is p for a data set of size P = 2 p . Bits assigned to the processor address eld are labeled rp, and bits in the encoding of indices assigned to the local memory address eld are labeled vp. With consecutive data allocation the n most signi cant bits identify the processor, and the least signi cant p n bits identify the local memory addresses. Thus, there are P N elements per processor. For instance, a data array of 1024 points require 10 bits for its binary encoding. With a consecutive mapping to 128 processors, there are 8 data points per processor. Seven bits are required for the encoding of processor addresses, and three for the local addresses. With the consecutive mapping elements 0 though 7 are mapped to the rst processor, elements 8 through 15 to the second processor, etc. The three least signi cant bits of the data index encoding are used for local memory addresses. For the consecutive data allocation the rst n steps require inter-processor communication, while the last p n steps are local to a processor. If the data is allocated in bit-reversed order, then the most signi cant bit in the data index corresponds to address bit 0, and the least signi cant index bit to address bit p 1. The order of the inter-processor communication and the local reference phases are reversed. With the data in normal input order, the rst stage in the Cooley-Tukey FFT requires communication in the most signi cant cube dimension. The following n 1 stages require communication in successively lower dimensions. By exchanging all P N local data elements between pairs of processors the computations can be split between a pair of processors; one computes the \top" of a butter y, one the \bottom". A direct implementation of the above scheme yields poor utilization of the communication system. In each of the rst n stages only one cube dimension per stage is used. The communications e ciency is 1 n . By pipelining the FFT stages the e ciency can be improved to almost 100%. Table 1 illustrates the idea of pipelining FFT stages. Columns represent processors, and rows local memory addresses. The rst ve memory locations are shown. The data allocation is assumed consecutive, and the entries in the table show the cube dimensions in which data are exchanged during the rst four time steps. After the pipeline startup of n 1 stages, all n dimensions are used for every time step until the pipeline shut down takes place. Once the pipeline is lled, n data items are exchanged in all processors in every stage, until the pipeline shut down phase is reached. The total number of element exchanges in sequence is P N +n 1, which is an improvement over the non-pipelined algorithm by a factor of approximately n. Our FFT implementation on the Connection Machine uses the above scheme to achieve full utilization of the communication capabilities of the architecture. The two channels between each pair of nodes are used to exchange concurrently the real and imaginary parts of a data point.
Time Memory

Processor
Step location 0 1 2 3 4 5 6 7  0  2 2 2 2 2 2 2 2  1 - 
Minimizing local data motion
The most e ective use of a memory hierarchy with respect to data transfer bandwidth 4, 5] is to use a high radix FFT algorithm. For each level in the hierarchy the radix shall be equal to the size of the maximum data set that ts in the memory at that level. With a single data path between a oating-point unit and its memory, the radix shall be chosen according to the number of registers in the oating-point unit. Table 2 summarizes the arithmetic and data motion requirements for radix-2, 4, and 8 FFT. Only the most signi cant terms are accounted for. The reduction in the need for memory bandwidth is much more signi cant than the reduction in the number of arithmetic operations. An FFT based on radix-8 kernels requires about 1/3 as many memory references as a radix-2 based FFT. The reduction in the required number of arithmetic operations is about 20%. The Connection Machine FFT uses a combination of radix-2, 4, and 8 kernels as required to t the size of the local data set. A higher radix than 8 is not used because of shortage of registers in the oating-point unit. 
Arithmetic
Twiddle Factors
The total number of twiddle factors needed for a radix-2 FFT of size P = 2 p is P 2 1, and for a radix-R FFT it is (R 1) P R . With precomputed twiddle factors stored in a distributed memory, it is important to minimize the need either for redundant storage of twiddle factors, or for communication of twiddle factors. Below we discuss the issues related to computation and allocation of twiddle factors for a radix-2 FFT on a distributed memory architecture. Computation and storage of twiddle factors for high radix FFT are presented in 12]. In the analysis below the order of the input data is assumed to be normal.
Decimation-in-time (DIT) FFT
For a radix-2 DIT FFT in-place algorithm, the exponents of the twiddle factors for stage q are ! (j p q 1 )(jp qjp q+1 :::j p 1 )2 p 1 q P , q 2 0; p 1], and q = 0 for the rst stage. Note that the address is bit-reversed and shifted for the proper exponent. The twiddle factors for stage 0 are all ! 0 P . If the FFT of size P = 2 p is computed on a Boolean p-cube, then node P 1 requires p 1 distinct twiddle factors. For a P element transform performed on an N processor Boolean cube with consecutive data allocation, stages 1 through n 1 each requires one twiddle factor per stage and processor. All P N local elements have the same twiddle factor in a given processor. The last p n stages are local, and the maximum total number of twiddle factors required per processor is P N 1. We will now prove the claims made above. In the consecutive data allocation the assignment of data indices to processors is (j p 1 j p 2 : : : j p n jfj p n 1 j p n 2 : : : j 0 g), where fj p n 1 j p n 2 : : : j 0 g denotes the set of values assumed by the bits within the braces. Clearly, for q < n none of the data index bits mapped into local memory enters into the twiddle factor exponent. Hence, all local data elements have the same twiddle factors for the stages requiring inter-processor communication. The last stage, which is local, requires P 2N twiddle factors, since the set of exponents are computed from (f(j 0 )(j 1 j 2 : : : j p n 1 gjj p n : : : j p 1 ). In processor zero the twiddle factors for stages n < q < p 1 are always a subset of the twiddle factors for stage p 1, whereas in processor N 1 successive stages have unique twiddle factors. Hence, a maximum total of P p n k=1 P 2 k N = P N 1 di erent twiddle factors per processor is needed for the local stages, and the second part of the claim has been veri ed.
Allocating twiddle factor storage uniformly across all processors yields a total twiddle factor storage of P +(n 2)N, which for P N is about twice the storage required on a sequential or shared memory computer. For P = N uniform twiddle factor storage across processors yields a total storage of (n 1)N, which exceeds the sequential storage by a factor of approximately 2(n 1). With cyclic allocation the p n most signi cant bits are mapped into local memory. The last n steps require communication, and di erent local elements often have di erent twiddle factors. For instance, consider processor (011 : : : 1), which in the cyclic allocation is assigned indices (fj p 1 j p 2 : : : j n gj011 : : : 1). The last stage requires twiddle factors with indices (1 : : : 110jfj n j n+1 : : : j p 1 g), or P N twiddle factors. The second to last stage requires another P N twiddle factors in processor (011 : : : 1), since the index set (1 : : : 110fj n jj n+1 : : : j p 1 g0) is disjoint from the set for the last stage. In general, at least (n 1) P N twiddle factors are required in a processor for the cyclic data allocation and a DIT FFT.
Decimation-in-frequency (DIF) FFT
For an in-place radix-2 DIF FFT the twiddle factors required for the butter y computation in stage q is ! (j p q 1 ) (j p q 2 j p q 3 :::j 0 )2 q P . The exponent of the twiddle factor for a pair of complex elements in a butter y computation is formed from address bits 0 through p q 2 shifted left q steps with an end-o shift. With a consecutive data allocation at least one processor needs P N twiddle factors for each of the rst n 1 stages. The sets of twiddle factors for di erent stages are disjoint. For instance, consider the processor with address (j p 1 j p 2 : : : j p n+1 j p n ) = (11 : : : 10). This processor contains the data indices (11 : : : 10jfj p n 1 j p n 2 : : : j 0 g), with the consecutive allocation.
Shifting this set of addresses left by one step yields a new set of addresses if n > 2. This property holds for n 1 left shifts, i.e., the rst n 1 stages. For n = 1 it is easily seen that P N twiddle factors are needed, and for n = 2 processor (11) requires 3 P 2N twiddles. Hence, at least (n 1) P N twiddles are required in some processor. For a cyclic allocation the rst n stages are local to a processor. The sets of twiddle factor indices required for the stages local to a processor are (fj p 2 j p 3 : : : j n gjj n 1 : : : j 0 ), (fj p 3 j p 4 : : : j n gjj n 1 : : : j 0 )2, : : : : : : : : :, (j n 1 : : : j 0 )2 p n 1 , for stages 0, 1, etc., or a maximum total of P p n m=1 2 p n m = P N 1 for any processor. After the rst p n stages the remaining n stages consist of P N independent FFTs of size N, each with one element per processor. All P N FFTs have the same twiddle factor for a given butter y stage and processor. A maximum of n 1 twiddle factors per processor is needed for the inter-processor communication stages (one for each butter y stage, except the last). Hence, for cyclic data allocation and a radix-2 DIF FFT of size 2 p computed on N processors, n < p, the maximum number of distinct twiddle factors needed in a processor is P N + n 2, the same as for a DIT FFT with consecutive data allocation.
Bit-reversed input
With the input in bit-reversed order the traversal of the bits in the address eld is from the least signi cant bit to the most signi cant bit. The indices used for twiddle factor computation for the DIF FFT are obtained by using bit-reversed addresses, instead of addresses in normal order. Analogously, the DIT FFT uses addresses in normal order for the index computation, instead of bit-reversed addresses for normal order input. With the input data in bit-reversed order the DIF FFT requires the least twiddle factor storage for the consecutive data allocation, while the DIT FFT requires the least storage for the cyclic data allocation. The preferred combinations of data allocation and FFT type are the opposite to those preferred with normal order input.
Reduced twiddle factor storage
For the consecutive data allocation, normal order input, and a radix-2 DIT FFT, the set of twiddle factor indices in the last stage is (fj 1 j 2 : : : j n 1 gjj n : : : j p 1 ). The highest order bit j 1 corresponds to bit position p 2. Hence, (f1j 2 : : : j n 1 gjj n : : : j p 1 ) = P 4 + (f0j 2 : : : j n 1 gjj n : : : j p 1 ). But, ! j+ P 4 P = ! j P e 2 i P P 4 = i ! i P . Notice that multiplication by i, which is associated with a 90-degree rotation, simply involves exchanging the real and imaginary parts of ! i P and negating the imaginary, which requires only one negation. Therefore, half of the twiddle factors can be obtained from the other half with no arithmetic. This property is true for all on-processor stages. The same property is true for decimation-in-frequency FFT, cyclic data allocation, and normal input order, decimation-in-time FFT, cyclic data allocation, and bit-reversed input order, decimation-in-frequency FFT, consecutive data allocation, and bit-reversed input order.
The observation can be generalized to bits p 3; p 4; : : : in the twiddle factor index, but complex arithmetic is required for all of them. Bit p 3 is associated with a 45-degree rotation, i.e., 1+i p 2 . However, these rotations do require complex arithmetic.
Summary
We conclude that the preferred combinations of data allocation, input order, and FFT type with respect to twiddle factor storage are: normal input order, consecutive data allocation, decimation-in-time FFT normal input order, cyclic data allocation, decimation-in-frequency FFT FFT Data
Twiddle
Max. number alloc.
index of twiddles stage q per proc. Normal input order DIT consec. fj p q j p q+1 : : : j p n 1 gjj p n : : : j p 1 2 p 1 q P N + n 2 cyclic j p q j p q+1 : : : j n 1 jfj n : : : j p 1 g2 p 1 q (n 1) P N DIF consec. j p q 2 j p q 3 : : : j p n jfj p n 1 : : : j 0 g2 q (n 1) P N cyclic fj p q 2 j p q 3 : : : j n gjj n 1 : : : j 0 2 q P N + n 2 Bit-reverse input order DIT consec. j q 1 j q 2 : : : j p n jfj p n 1 : : : j 0 g2 p 1 q (n 1) P N cyclic fj q 1 j q 2 : : : j n gjj n 1 : : : j 0 2 p 1 q P N + n 2 DIF consec. j q+1 j q+2 : : : j n 1 jfj n : : : j p 1 g2 q P N + n 2 cyclic fj q+1 j q+2 : : : j p n 1 gjj p n : : : j p 1 2 q (n 1) P N Table 3 : Radix-2 twiddle factor storage for DIF and DIT FFT for di erent data allocation schemes and input orders.
bit-reversed input order, consecutive data allocation, decimation-in-frequency FFT bit-reversed input order, cyclic data allocation, decimation-in-time FFT
The storage requirements and the formula for the twiddle factor index computations are summarized in Table 3 . With 90-degree rotations performed \on-the-y", the storage requirements for on-processor twiddles is reduced by a factor of two compared to what is stated in Table 3 .
FFT on multi-dimensional arrays
Computing an FFT along a single axis of a multi-dimensional array implies a number of independent one-dimensional FFTs. The number of FFTs is determined by the product of the length of the axes on which the FFT is not performed. Using the pipelined type of FFT algorithm described above, there is no need for data rearrangement between FFTs on di erent axes. Pipelining can be performed over all inter-processor dimensions being part of any FFT, as long as successive axes have no local components. If the axes have local components, such that the inter-processor dimensions are interleaved with local dimensions, then the communication pipeline must be broken for the local butter y stages. Each axis in a multi-dimensional FFT has its own set of twiddle factors. The twiddle factors for an axis is a subset of the twiddle factors for the longest axis. With axes of length P 1 P 2 : : : P k the minimum number of twiddle factors is max`(P`) 2 . With separate storage of the twiddle factors for each axis the total storage is P`P2 .
Implementation
All compilers for the Connection Machine systems use the consecutive data allocation scheme. In order to minimize the twiddle factor storage with precomputed twiddle factors, a decimationin-time FFT is used for data in normal input order, and a decimation-in-frequency FFT is used for bit-reversed input order. 90-degree rotations are performed \on-the-y". The inverse Discrete Fourier Transform uses conjugated twiddle factors, and requires no additional storage. The inter-processor communication stages are pipelined, as described above, but for multi-dimensional FFTs each axis is treated independently. In order to simplify the implementation no sharing of twiddle factors between axes takes place. Likewise, each stage of each axis has its own set of twiddle factors. For the FFT stages requiring inter-processor communication the following implementation detail deserves to be mentioned. The Connection Machine systems CM-2 and CM-200 are of the SIMD variety. In the butter y computations one processor in a pair performs a complex addition and the other a complex subtraction. By integrating the negation of one of the operands into the communication both operations can be performed concurrently with no measurable loss in e ciency. For a data array axis with d bits assigned to the processor address eld, d data elements are exchanged in each communication, except during the start-up and shut down of the communications pipeline. d butter y computations can be performed after each communication. The butter y computations belong to di erent stages, and require di erent twiddles. A local data element is updated d times in sequence. Then, it will not be updated until the local phase, if any. In order to reduce the number of loads and stores to local memory, the local data items are cached in the register set of the oating-point unit. Twiddle factors are (re)read from memory. The data caching scheme is used for up to 10 dimensions. For 11 dimensions there are insu cient registers in the oating-point unit, resulting in a performance loss, as can be seen from the timings in the next section. For the local stages of the FFT a mix of radix-2, 4, and 8 kernels are used. Each kernel comes in several varieties depending upon the number of twiddle factors required, and whether or not the twiddles are cached in the register set. The radix-8 kernels yield the best performance, and for an FFT of a given size as many stages as possible are performed using these kernels. When the FFT cannot be computed using only these kernels, then one radix-2 or radix-4 stage is used. Details of the implementation are given in 12]. The scheduling of the radix-2 kernels attempts to minimize the loading of twiddle factors. Consider the DIF FFT in Figure 2 . In the rst stage all twiddle factors are di erent, but in the second stage each twiddle factor that is used is repeated twice. In the third stage each twiddle factor is repeated four times, etc. Hence, by performing the butter y computations in suitable order a twiddle factor needs to be retrieved only once from memory. It is easily veri ed that the same property is true for a DIT FFT. Furthermore, the 90-degree rotation property is used to further reduce the number of twiddles needed. It is possible to load a number and perform a unary operation (in this case, negation) on it in the same cycle, which eliminates the need to perform any arithmetic whatsoever to use the rotated twiddle. The radix-4 and radix-8 kernels reload their twiddle factors each time, due to lack of su cient registers. In these cases, however, the twiddle factor loading can be overlapped with computation e ciently. The kernels for the local DIT and DIF FFT are very similar, but di er slightly in how the pipelines in the oating-point unit are organized. This di erence result in a performance di erence of up to about 7% in 32-bit precision, and up to about 11% in 64-bit precision, as seen from the tables in the next section. The twiddles for all radix-2 stages are computed by the following algorithm.
1. extract the p 1 highest order bits of the data element index, i.e., bits 1 through p 1 into a word t with bit locations 0 to p 2. 2. bit-reverse the extracted word t. 3. perform p q 1 steps of end-o left shifts of t with bits p q 2 to 0 set to zero, q = f0; 1; : : : ; n 1g.
Each value of q corresponds to a di erent butter y stage. The computation is performed by all oating-point units concurrently. The computations are completely uniform. The twiddle factor computation for radix-4 and 8 stages is analogous, but somewhat more complex 12]. In our implementation there is a separate table for each stage. Each table is organized such that a sequence of twiddle factors are accessed with a stride of one. Within each table the twiddle factors are stored in bit-reversed order. The same set of tables are used for both the DIF and DIT FFT, and for both forward and inverse FFTs.
Performance measurements
The performance measurements below have been made on a Connection Machine system CM-200 with 2048 64-bit oating-point units. All performance data refer to a complexto-complex FFT, CCFFT, implemented as described above, and included as part of the Connection Machine Scienti c Software Library version 3.0. The order of the input data is normal in all cases unless otherwise speci ed. We provide data for both ordered and unordered transforms. All oating-point rates are based on 5N log 2 N arithmetic operations for a transform of size N. The performance of the local kernels for di erent sizes is given in Table 4 and Figure 4 . The peaks in Figure 4 correspond to data sizes for which only radix 8 FFT's are used. The performance for 64-bit precision is about 75-80% of the performance for 32-bit precision; the di erence is entirely due to the fact that the data path between each oating-point unit and its memory is 32-bits wide. Data paths internal to the oating-point unit are 64-bits wide. The performance of the DIT kernels is 90 -95% of the DIF kernel performance for most sizes; the di erence here is entirely due to implementation details of the pipelines. 10% slower than unordered transforms; for transforms of size 1024 the ordered transform is about 20% slower than the unordered transform. The performance for the inter-processor communication stages is considerably less than for the local stages. The communication time is constant regardless of the number of processors over which the data set is distributed. With communication in d dimensions, d butter ies can be computed for each communication. The construction of the arithmetic pipelines for the DIT and DIF CCFFT for the inter-processor stages is almost identical, except for the case where 11 processor dimensions are used. In this case the DIF pipeline is about 5% less e cient in 32-bit precision, and 2.5% less e cient in 64-bit precision. The e ciency of the arithmetic pipeline is constant for d in the range 2 { 10. With a communication time that is almost independent of d, it follows that the overall e ciency of the inter-processor butter y stages increases with the number of processors over which the data set is distributed. The peak inter-processor data motion rate exceeds 7 Gbytes/s for 11 processor dimensions, as is apparent from Table 6 and Figure 5 . The oating-point rates are scaled to a 2048 processor CM-200. For instance, in calculating the oating-point rates for FFT on a data set spread across 512 processors, it is assumed that 4 such computations are performed concurrently on distinct data sets and processors. The performance for one-dimensional DIF CCFFT computations for a few di erent sizes of the data set is given in Table 7 and Figure 6 . The performance di erence between the DIT and DIF versions is about 2 { 3% in 32-bit precision, and about 1% in 64-bit precision. The di erence is due to the di erence in e ciency of the local kernels. The performance for large one-dimensional FFTs is largely determined by the inter-processor butter y stages. These stages smooth out the variations in the performance of the local kernels, and a monotonic Table 6 : Performance data for pure inter-processor unordered CCFFT on a 2048 processor CM-200, 8192 elements/processor. increase in performance is observed with increased transform size. Some sample speedup gures for the computations as a function of machine size are given in Table 8 . The speedup is almost perfect for both the FFT's of one million and 16 million points. The loss of e ciency for the largest con guration is due to the decrease in e ciency for the inter-processor communication stages, when 11 inter-processor dimensions are involved in the FFT. The marginally better than perfect speedup that is achieved in some cases is due to the e ciency variations of the local FFT stages. For instance, a 1 million point data set has 8k points per processor for a 128 processor con guration, but 4k points per processor in a 256 processor con guration. The latter uses only radix-8 kernels, which are the most e cient. Timings for two-and three-dimensional CCFFT are given in Table 9 , and shown in Figure 7 .
The signi cant increase in performance for the two-dimensional CCFFT between the 1024 1024 array and the 2048 2048 array is due to one of the axis being local to a processor for the larger array. The subsequent minor decrease in performance for the next larger array is due the fact that the axis distributed over all processors also has a local component of length two. This part of the axis requires a radix-2 kernel, which is less e cient than the radix-4, and the radix-8 kernels normally used. For reference, performance data for ordered two and three-dimensional transforms are given in Table 10 . The execution time increases by 50 -100% for our examples, considerably more than for entirely local transforms.
Optimizing the con guration of the address space
With pipelining of communications the number of element transfers in sequence is P N +d 1, where P N is the number of elements per processor, and d the number of inter-processor dimensions over which an axis subject to transformation is spread. Hence, the number of element transfers in sequence is approximately independent of the number of axis d, except if d = 0 in which case no communication is required. The number of arithmetic operations are clearly independent of the data layout. Hence, if load balance can be achieved by allocating all axes subject to transformation to local memory, then the best possible performance for a given array is achieved. If such an allocation is not feasible, either because of poor load balance, or because the data set is too large to t in local memory, then optimizing performance represents a trade-o between the e ciencies of the local kernels, and the inter-processor stages. The variation in e ciency is small. We consider the one and multidimensional cases in some detail below. Table 9 : Performance data for two and three-dimensional, unordered, CCFFT on a 2048 processor CM-200. 
One-dimensional FFT
The execution time for a single one-dimensional FFT is minimized by spreading the data over as many oating-point units as possible. For multiple one-dimensional FFTs we distinguish between two cases. If the number of instances of the one-dimensional FFT is at least half the number of processors, and the length of the axis on which the transform is performed ts in local memory, then maximum performance is attained if the data is layed out with the FFT axis local to a processor. The other case is where there are fewer instances, or the axis subject to transformation will not t in local memory. In this case the performance is fairly independent of the layout. The optimum layout is mostly determined by the variation in the e ciency of the local FFT kernels as a function of the local axis length. The e ciency of the arithmetic for the interprocessor communication stages is constant, except for d = 1 and d = 11, for which cases the e ciency is lower. The communication time is constant with d. Table 11 shows performance data for one-dimensional FFT on a 4096 4096 data array for various data layouts. The peak performance is achieved for the axis subject to transformation entirely local. The performance for the FFT axis spread over more than one processor is fairly constant, with a noticeable drop o for the FFT axis spread over all processors. The performance variation for the FFT axis spread over 2 to 1024 processors is about 10% in both 32-bit and 64-bit precision. For d = 11 the performance decreases by another 10%.
Multi-dimensional FFT
Optimal e ciency is attained by maximizing the number of axes that has no non-local component. The performance variation once an axis is distributed across processors is minor, as can be seen in Table 12 . For a two-dimensional FFT of shape 4096 4096 the worst performance once an axis is distributed across processors is at most 5% below the peak in 32-bit precision, and at most 3.5% below peak in 64-bit precision. The di erence between a distributed axis, and a local axis is about 20% in 32-bit precision and close to 30% in 64-bit precision. Smaller and high-dimensional FFTs cannot always be layed out in such a way as to have only one axis with any non-local component. Figures 10 and 11 present various two and three dimensional FFTs layed out optimally on di erent size machines. Notice the discontinuity as the machine size exceeds the axis length in both the two and three dimensional transforms. The performance degradation is caused by an axis changing from being entirely local to becoming distributed over at least two processors for the larger machine size. Other performance uctuations are minor. For example, the point at which the FFT axis length equals the machine size runs relatively slightly faster than the other transforms in the series, and the point at which the FFT axis length is exactly twice the machine size runs relatively slightly slower. The reason is that in the former case the looping is exceptionally simple.
In the latter case one local stage is needed for the axis with a non-local extent. This local stage is performed with a radix-2 kernel, which is less e cient than the radix-4 and radix-8 kernels used normally. Very careful inspection will show other similar e ects due to minor di erences in the kernels. 
Summary and Discussion
We have presented a multi-processor FFT adapted for e cient use of both the communication system and the memory bandwidth. We have also shown that the data layout and twiddle factor usage demands that both decimation-in-time FFT and decimation-in-frequency FFT are supported to accommodate data in either normal or bit-reversed order, when the twiddle factors are precomputed. Supporting only one type would increase the twiddle factor storage by a factor of log 2 N for N processors. Multi-dimensional FFTs are performed with no explicit data rearrangement. The peak performance for the entirely local complex-to-complex FFT is about 12.9 G ops/s in 32-bit precision and about 10.7 G ops/s in 64-bit precision. The peak performance for one-dimensional FFT in the range one million to half a billion points is in the range 4.2 -5.2 G ops/s in 32-bit precision, and in the range 2.5 -3.2 G ops/s in 64-bit precision. In 32-bit precision, the peak rates for two-dimensional complex-to-complex FFTs are in the range 1.8 -5.0 G ops/s, and for cubic three-dimensional complex-to-complex FFTs the range is 2.2 -3.2 G ops/s. In 64-bit precision the corresponding numbers are 1.1 -3.2 G ops/s and 1.3 -2.0 G ops/s, respectively. The peak data motion rate between processors is in excess of 7 Gbytes/s. The performance for the corresponding ordered transforms is about 10 -20% less for transforms on data entirely local to each processor, and 2/3 -1/2 for transforms on data sets spread over several processors. Despite the fact that cubic three-dimensional FFTs do not perform as well on large machines as two-dimensional FFTs, certain three-dimensional shapes perform very well, particularly on smaller machines. This is due to the fact that communication pipelining is done only within a given array axis, but not across axes.
Once an axis subject to transformation is distributed across processors, the performance is quite insensitive to the actual con guration of the set of processors. The performance di erence as a function of the number of processors over which an instance of the array axis subject to transformation is distributed is in the 5 -10% range. The load balance for the FFT stages requiring inter-processor communication can be improved. Only one processor in a pair computing a butter y performs a useful complex multiplication. A partitioning type of algorithm, in which all stages are performed locally, as described in 11, 15, 16] , results in perfect arithmetic load balance. In many cases, however, the e ciency for such an algorithm on the Connection Machine systems will not be higher. It may even be lower, as we will explain below. The partitioning type of FFT also has the advantage that only half the data set is communicated in each inter-processor communication stage, as opposed to all of the data in the direct pipelined algorithm used for the Connection Machine implementation. However, with the consecutive data allocation at least one of the processor dimensions must be used twice for this type of FFT 11, 15, 16] . With concurrent communication on all channels the number of element transfers in sequence is the same as for the direct pipelined algorithm used in our implementation. Hence, for normal order input and consecutive data allocation the partitioning type FFT does not o er any reduction in the communication requirements. Indeed, because of low level features of the Connection Machine architecture, each communication in such an FFT implementation would require longer time, and the savings in time due to improved arithmetic load balance would be nulli ed in most cases. However, for cyclic data allocation the partitioning type FFT would o er a moderate improvement, since in this case no inter-processor dimension needs to be used more than once for an unordered FFT 11] .
