Implementations of various fast Fourier transform (FFT) algorithms are presented for distributed-memory multiprocessors. These algorithms use data redistribution to localize the computation. The goal is to optimize communication cost by using a minimum number of redistribution steps. Both analytical and experimental performance results on the Intel iPSC/860 system are presented.
Introduction
This paper presents distributed-memory parallel programs for the fast Fourier transform (FFT) [1, 2] which use data redistribution to localize the computation. In a distributedmemory multiprocessor like the Intel iPSC/860 and the Thinking Machines CM5, shared data is distributed across the local memories of interconnected processors. The most commonly used distributions are the block, cyclic, and block-cyclic distributions. These distributions are also proposed in High Performance Fortran (HPF) [3] . Communication is needed when a processor requires data from another processor's local memory. I n most distributed-memory multiprocessors the cost of communicating a data value is considerably larger than the cost of a primitive arithmetic computation on the data element. It is therefore important to reduce the communication overhead to achieve high performance.
In order to achieve good scalable performance, programs may have to use different approaches to solve the same problem for different ranges of problem size. In this paper we present FFT algorithms based on redistribution primitives, which have better performance than the FFT programs based on point-to-point message passing, when the problem size is large. The overall communication cost in the FFT programs with redistribution decreases due to a reduction in the total communication volume, although the number of transmitted messages increases. Furthermore, using redistribution primitives to eliminate explicit message-passing in the computation will lead to a more portable implementation, by abstracting away the details of the target machine, such as the network topology. Both analytical and experimental performance results on an Intel iPSC/860 system show that real benefits can be achieved by such a scheme. For simplicity, only radix-2 decimation-in-time FFT algorithms are developed.
The paper is organized as follows. Section 2 describes the semantics of block-cyclic distributions. In Section 3, we present a distribute-memory node program for the CooleyTukey FFT. Section 4 and Section 5 present implementations of the Pease FFT and the Stockham FFT. Performance results are given in Section 6. Conclusions are included in Section 7.
Semantics of Data Distribution
Various data distributions of an array correspond to mapping certain bits of the address (binary representation of the array index) of an element to represent the processor address [4] . One such data distribution is the block-cyclic distribution. A block-cyclic distribution partitions an array into equal sized blocks of consecutive elements and maps them to the processors in a cyclic manner. The elements mapped to a processor are stored in increasing order of their indices in its local memory. We will use the following convention to express the block-cyclic distribution of a linear array A(0 : N ? 1) on P processors:
A(cyclic(b)): the block size is b and element i, 0 i < N, is on processor
The local array A 0 (0 : dN=Pe?1) will contain elements of array A mapped to a processor.
A(block) and A(cyclic) will be used to express block and cyclic distributions of array A. These distributions are equivalent to cyclic(dN=Pe) and cyclic(1), respectively.
Suppose an array of size 2 n is block-cyclically distributed on 2 p processors. Then the p bits required for the processor address are chosen from the n bits used for array indexing. If b n 1 is the index of an array element, then cyclic(2 b ) corresponds to choosing b b+p b+1 as its processor address a . As the array elements are stored in increasing order of their indices in each processor's local memory, the local address of element b n 1 on processor b b+p b+1 is b n b+p+1 b b 1 . We will mark any bit b i of the global address that is used for processor indexing asb i . For example, the global address for the index b n 1 would bê b n n?p+1 b n?p 1 for a linear array of size 2 n distributed as block. It would be b n p+1bp 1 when it is distributed as cyclic. The local address and the processor address can be extracted from a global address by functions local and proc. 
Cooley-Tukey FFT
For simplicity, we will assume that the input is in bit-reversed order for the CooleyTukey FFT and the Pease FFT. The twiddle factors will be assumed to be replicated on all processors. Furthermore, we will present only radix-2 decimation-in-time FFT algorithms. However, the techniques presented in this paper are also applicable to mixed-radix and decimation-in-frequency FFTs [2] . We will use N (= 2 n ) to denote the size of the FFT being performed and P (= 2 p ; p n) to denote the number of processors being used to perform the FFT. In a 2 n point Cooley-Tukey FFT [5] , there are n steps of computation. At step i, 1 i n, the following butterfly computation is performed on elements at address b n i+1 0b i?1 1 and b n i+1 1b i? The above computation is performed on elements whose addresses differ only in bit b i . This butterfly computation will be denoted by b n i+1 B i b i?1 1 .
A simple way to implement the Cooley-Tukey FFT on a distributed-memory multiprocessor is to use a block distribution for the input array A. The communication required for such a program can be determined from its trace in terms of the global address. For i n?p, the computation can be summarized asb n n?p+1 b n?p i+1 B i b i?1 1 , whereas, for i > n ? p, it can be summarized asb n i+1 B ibi?1 n?p b n?p?1 1 If n 2p (N P 2 ) then by choosing the most significant p bits for the first n?p steps and the least significant p bits for the last p steps to represent the processor address, we can localize the computation for all n steps. This implies that, the input array A is initially distributed using a block distribution and then after performing n?p steps of computation, its distribution is changed to a cyclic distribution. The computation performed on each node before the redistribution is: In general, dn=(n ? p)e redistribution steps would be required for any n p. end of the computation. For example, Fig. 2 shows the data flow diagram (after bit-reversal) for a 32-point Cooley-Tukey FFT on four processors using redistribution. The input array initially has a block distribution which is changed to a cyclic distribution after the third computation step.
Pease FFT
Pease proposed an FFT algorithm suitable for parallel processing [6] . In this algorithm, the array of intermediate results is permuted (using the perfect shuffle) after the butterfly computation at each step so that the two data elements involved in a butterfly computation at the next step are moved adjacent to each other. The following computation is performed at each step i, 1 A direct implementation of the Pease FFT algorithm on a distributed-memory multiprocessor would be to distribute array A according to some block-cyclic distribution. Suppose array A is initially distributed using a block distribution. The shuffle permutation at each step, in terms of the global address, can be expressed as:
A(b 1bn n?p+2 b n?p+1 2 ) = B(b n n?p+1 b n?p 1 ):
Note that proc(b n n?p+1 b n?p 1 ) 6 = proc(b 1bn n?p+2 b n?p+1 2 ), which implies that communication is needed at every step. This is true for any distribution of the input array. To avoid communication at each step, we need to modify the Pease FFT algorithm. We now present a modified Pease FFT algorithm which requires a data redistribution, after every n ? p steps. When n 2p, only a single redistribution is required. Fig. 3 The computation performed on each node after a block to cyclic redistribution is: A 0 (b n n?p+2 1b n?p p+1 ) = ! Steps ia and ib represent the computation phase and the permutation phase, respectively, at step i.
Step global addr. proc addr. local addr. For example, Fig. 4 shows the data flow diagram (after bit-reversal) for a 32-point Pease FFT on four processors, which uses a single redistribution. The input array has a block distribution for the first three steps and a cyclic distribution for the remaining two steps. Communication is only needed to change the distribution from block to cyclic, after the third step. The Korn-Lambiotte FFT [7] , which was developed for vector processing, can be similarly implemented on a distributed-memory vector multiprocessor. The Korn-Lambiotte FFT performs a shuffle permutation at the beginning of each step so that a fixed vector length of 2 n?1 is achieved. Using a technique similar to the one used for the Pease FFT, the KornLambiotte FFT can be modified so that only a single redistribution is required when n 2p.
Further, a fixed vector length of 2 n?p?1 is achieved on each node. where (i) = (b n?i+1 ) (b n n?i+2 ).
Stockham FFT
To implement the Stockham FFT on a 2 p processor distributed-memory multiprocessor, it can be determined that if we choose the least significant p bits for the processor address, i.e., cyclic distribution, then the first n?p steps are communication-free. This is because the least significant p bits are not affected by the permutation performed in these steps. However, communication will be required for the remaining p steps. Redistributing the input array to any other distribution will also require p steps of communication. Now assuming that n 2p, a trace of the modified algorithm is shown below. Steps ia and ib represent the permutation phase and the computation phase, respectively, at step i.
Step global addr. where (i) = (b n?i+1 ) (b n n?i+2 ). After performing this computation, each node copies A 0 (j) to B 0 (j), 0 j < 2 n?p . The distribution of array B is then changed to cyclic(2 p ) (step redist: in the trace above). Subsequently, an address-bit permutation, which corresponds to exchanging bits b n?p+1!n with b p 1 , is performed (step perm: in the trace above). This permutation is performed by copying B(b p+1!n?pbn?p+1!n b p 1 ) to A(b p+1!n?p b p 1bn?p+1!n ). 
Performance Results
We now present analytical and experimental performance results for the Cooley-Tukey, Pease, and Stockham FFT on the Intel iPSC/860 distributed-memory hypercube. The CooleyTukey FFT requires communication only during the redistribution step where the distribution of array A is changed from block to cyclic. If the initial distribution of array A is to be preserved, then another redistribution will be required at the end of the program. Hence, two versions of the Cooley-Tukey FFT were implemented, one using only a single redistribution step (CT-1R) and another one performing an additional block to cyclic redistribution Table 1 . Parallel programs which use point-to-point message-passing were implemented for the Cooley-Tukey FFT (CT-PP) and the Stockham FFT (ST-PP). These programs require log(P) communication steps [9] . All the communication in CT-PP is nearest-neighbor. PS-PP was not implemented as its performance is certain to be worse than CT-PP, due to the log(N) communication steps needed. In all programs except ST-1R and ST-PP, there is an initial bit-reversal permutation step which requires one additional redistribution. We now compare the communication requirements of the CT-PP and the CT-1R algorithms. CT-PP requires log(P) messages, each of size N=P. CT-1R requires a block to cyclic redistribution involving an all-to-all personalized communication, which can be performed using the pairwise exchange algorithm [10] . The pairwise exchange algorithm is contention-free on hypercubes. In each of the P ? 1 steps of this algorithm, each processor sends and receives a message of size N=P 2 . Not including the bit-reversal, the total communication volume for CT-1R is P (P ? 1) N=P 2 N, whereas for CT-PP, it is P N=P log(P) = N log(P): If t s is the message setup time and t p is the link transmission time per data element, then the time spent in communication per processor in CT-PP can be estimated to be: T CT-PP = log(P) t s + (log(P ) N=P) t p and T CT-1R = (P ? 1) (because one complex number has eight bytes). For CT-PP, all the messages are nearestneighbor. Therefore, the total communication time for CT-PP is:
T CT-PP = 164 log(P) + 3:184 (log(P ) N=P) + 29:9 log(P):
For CT-1R, the average distance a message travels is log(P)=2, therefore the communication time is:
T CT-1R = 164 (P ? 1) + 3:184 ((P ? 1) N=P 2 ) + (29:9 log(P) (P ? 1))=2: Table 2 shows the estimated communication time for CT-PP and CT-1R on 32 processors.
For large values of N, it is evident that the communication time for CT-1R is much smaller than that for CT-PP even though more messages are exchanged in CT-1R. A similar analysis can be performed for the Pease FFT and Stockham FFT. From Figure 7 , it can be observed that CT-2R performs nearly as well as or better than CT-PP. CT-1R performs better than both CT-2R and CT-PP. PS-1R and PS-2R perform better than all the Cooley-Tukey programs. ST-1R performs better than ST-PP and has the best performance among all the FFT implementations. These results indicate that efficient portable programs can be implemented using redistribution primitives.
Conclusions
We have presented FFT programs in which communication is expressed through data redistribution primitives. The resulting programs have reduced communication volume compared to point-to-point message-passing programs, when the problem size is large.
