The development of the fast Fourier transform (FFT) and its numerous variants in the past 30 years has led to very efficient software and hardware implementations of the transform on uniprocessor Computers. In recent years, many researchers have recognized the practical importante of minimizing computing time by parallelizing sequential FFT algorithms in various ways for today's high-performance multiprocessor computers. This Paper presents many FFT variants already proposed by others in a common framework, which illuminates the progress made in parallelizing them to this date. In addition, three new parallel FFT algorithms along with communication complexity results are presented. The proposed algorithms show alternative ways of designing parallel FFT algorithms which feature reduced communication tost and further flexibility in the choices of data mappings.
Introduction
The discrete Fourier transform (DFT) has played an important role in control, Signals and image processing over a long period of time, and the practical importante of fast Fourier transform algorithms (FFTs) has also been well recognized. In the past 30 years, the development of the FFT and its numerous var-iants has led to very efficient Software and hardware implernentations of the transform on uniprocessor Computers. Since multiprocessor Computers became commercially available a decade ago, the parallelization of sequential FFTs on high-performance multiprocessor Computers has received the attention of numerous researchers. The variations of sequential FFTs, together with different parallelizing techniques, different multiprocessor architectures, and different network topologies, has led to a large number of parallel FFT variants, each with its own "twist" on things. A Survey of the FFT literature suggests that a useful contribution would be to put the various ideas in a common framework. To this end, the concept of index-digit permutation [5, 6] and the associated binary address based notation [6, 15, 18] have been borrowed, modified, and applied in a uniform manner to explain the various ideas which have appeared in the FFT literature. Throughout this Paper, examples from the radix-2 FFTs are used to introduce and demonstrate how this notation tan bring all FFT variants, sequentiul or parallel, into a common framework which facilitates the understanding, implementation, and design of the various FFT algorithms. The notation extends to the entire class of radix-2" FFTs, including mixed-radix and Split-radix FFTs, in a straightforward way, but is omitted to save space. Three new distributed-memory parallel FFTs along with communication complexity results are also presented. The proposed algorithms show alternative ways of designing parallel FFT algorithms which feature reduced communication tost and further flexibility in the choices of data mappings.
The discrete Fourier transform (DFT)
The DFT refers to the transformation of N discrete samples xo,xl, . . . ,x,+~ according to the formula N-l x,=~x,c& r=0,1,..., N-l, I=O
(1)
where the x,'s are sampled at equally spaced intervals in one period from a complex time series, the Xr's are complex numbers which represent a spectrum of the time series resulting from the transform, and ON is the Nth primitive root of unity; i.e., ON = e-2Zj'N with j E fl.
Although Eq.
(1) defines X, for all r, one only needs to consider values for r = 0 to r = N -1, since o.$' = 1 for h 3 1, which implies that X,,, = X,. When N=4, the DFT of XO, XI, x2 and x3 is the product of the matrix Q and the vector x in the equation below.
FFT variants and the divide-and-tonquer paradigm
The divide-and-tonquer paradigm is fundamental to the FFT algorithm. The three major Steps of the divide-and-tonquer paradigm are
Step 1. Divide the Problem into two or more subproblems of smaller size.
Step 2. Solve each subproblem recursively by the same algorithm. Apply the boundary condition to terminate the recursion when the sizes of the subproblems are small enough.
Step 3. Obtain the Solution for the original Problem by combining the solutions to the subproblems.
The radix-2 FFT is traditionally introduced as a recursive algorithm obtained from dividing the given Problem (and each subproblem) into two subProblems of half the size. Within this framework, there are two commonly used FFT variants which differ in the way the two half-size subproblems are defined. They are referred to as the DIF (decimation in fiequency) FFT and the DIT (decimation in time) FFT. To conserve space, only the DIF Version is considered here. However, the notation and the approach in this Paper apply equally well to the DIT case.
The remainder of this article consists of the following sections. The derivation of the radix-2 DIF FFT algorithm is presented in Section 2. A binary address based notation to assist in describing algorithms is introduced in Section 3. Various data mapping strategies for parallelizing FFTs on multiprocessor Computers are discussed in Section 4. Specific parallel FFTs in the literature are explained and presented in a unified manner in Section 5. Three new distributed-memory parallel FFTs along with communication complexity results are presented in Section 6. Section 7 contains concluding remarks.
The radix-2 decimation-in-frequency (DIF) FFT
As its name implies, the radix-2 DIF FFT algorithm is obtained by decimating the output frequency series into an even-indexed set {J&lk = 0, . . . , N/2 -l}, and an odd-indexed set {XZk+rlk = 0,. . . ,N/2 -1). To define the two half-size subproblems, Eq. 
For r even, using Eq. (4) Note that because X, = & in (6) and X 2k+i = & in (8) no more computation is needed to obtain the Solution for the original Problems after the two subproblems are solved. Therefore, in the implementation of the DIF FFT, the bulk of the work is done during the subdivision Step; i.e., the Set-up of appropriate subproblems, and there is no combination Step. Consequently, the computation of yr = xI + xIY~~t and zI = (x[ -Xl+N,2)Wb completes the first (subdivision) Step.
: :!
The computation of y/ and z/ in the subdivision Step as defined above is referred to as the Gentleman-Sande butterfly in the literature, and is depicted by the annotated butterfly Symbol in Fig. 1 .
Let T(N) be the arithmetic tost of computing the radix-2 DIF FFT of size N, and, for some constant c, let cN be the tost of computing y,'s and z,'s in the ifN=2" > 1.
Assuming that N= 2" for the radix-2 FFT, the Solution to the equation above is
3. A binary address based notation for butterfly operations
In the abstract, the equations introduced in the previous sections completely describe the computations. However, in practise, the computations are normally performed in pluce in a one-dimensional array, with new values overwriting old values as implied by the butterfly in Fig. 1 . Throughout this Paper, it is assumed that the input data XI, 0 6 1 < N are stored in the one-dimensional array a in natura1 Order. 2 Since these computations are performed repeatedly on subProblems of various sizes, determining the location of X, in the array at the end of the computation is not immediately obvious. The butterfly notations do not explicitly relate the locations of the input data to the locations of the output data; for example, the Gentleman-Sande butterfly represents only the first subdivision Step of the recursive DIF FFT algorithm. Thus, there is a gap between the elegant (but somewhat implicit) butterfly notations and the detailed specification of the positions of the Outputs X, contained in the repeatedly modified array. The purpose of this section is to describe a simple notation to assist in this specification. It will also facilitate the adaptation of these FFT algorithms to parallel processing, which is the focus of this Paper. The notation is not entirely new, but rather a modest embellishment and extension of the ideas that have been used in the past [6, 15, 18] . The radix-2 DIF FFT introduced in Section 2 is used as a vehicle for introducing the notation.
Iterative form of the radix-2 DIF FFT
Recall that the Gentleman-Sande butterfly represents the first subdivision
Step. Because there is no combination Step, it is straightforward to derive an iterative algorithm by simply subdividing each resulting subproblem iteratively 2 The input data xo,xl, ,x,+l are said to be in "natura1 Order" if XI and XI+, are stored in consecutive locations in o for all 0 < l< N-2. Similarly, the output data Xo,Xl , , xN_ 1 are said to be in natura1 Order if X, and X,+l are stored in consecutive locations in o for all 0 < r Q N-2.
until the Problem size is one, as shown in the pseudo-code algorithm below. It is assumed that the twiddle factors are pre-computed and stored in an array W, with w[i] = w;, 0 < i < N/2. For N = 32 = 25, the DIF FFT algorithm as described above consists of five stages involving the sequence k = 4, 3, 2, 1, 0, with butterflies being applied to all pairs of array elements whose binary addresses differ in bit ik. Since the algorithm tan be expressed in terms of binary addresses, it is convenient to adopt binary notation for the subscripts of a and X. Thus, in what follows the notation xi,i,izi,iO will mean xr, where i&i2ilio is the binary representation of r. (6) and (8) are interpreted as implied by the butterfly in Fig. 1 , the top half of II will contain the even-numbered X's after all stages are completed, and the bottom half will contain the odd-numbered As. Thus, if i4 = 0, then a[i&izil io] is in the top half of u and will ultimately contain an even-numbered X, and if i4 = 1, then a[i&i2ilio] is in the bottom half of 4 and will ultimately contain an odd-number\d X. This in turn means that after the butterfly Step identified by Operation i&izilio is completed, it is evident that a[i&i2iliO] will ultimately contain output X,., where the rightmost bit of the binary representation of r is id. By exactly the Same argument, applied recursively to both the top and bottom halves of the array, we tan conclude that the second from the rightmost bit of r will be 6. Repeating the argument three more times on successively smaller portions of the array u yields the conclusion that a[i&i2iliO] will finally contain Xioi,i2iji4. In other words, XiOi,ftijid will OCCUPY the Position originally occupied by xilisi2i,i,,. That is, the output X is in bit-reversed Order. Fig. 2 displays the initial contents of a, the five stages of butterfly computation together with the associated twiddle factors, and the resulting bit-reversed sequence. To help identify the pairs of subproblems resulting from every Stage of butterfly computation, the subproblems involving a [O] , which initially contains the input element xo, are highlighted. Note that for an input sequence of N elements, there are exactly N/2 butterflies in each of the log ZN stages of butterfly computations. 
The eflect of permutations
The algorithm described in the previous section is correct only because Xi,,i3izi,i0 is initially contained in a [i&i$lio] . However, the following program is correct, regardless of where xi4isi2i,i,, is found in a. Thus, the input data could be permuted arbitrarily, and as long as the butterflies are applied correctly to the data, the correct answers would be obtained. Moreover, the element of Q that initially contained xi4i3i2ili,,, would contain X;Oi,itiSi4 at the end of the computation.
For example, suppose the objective is to find an initial ordering of the input so that the resulting output is in natura1 Order. The Observation above implies that initially, u[ioili&i4] should contain x,i3i,i,i,,, since at the end of the computation one wants u[i,$li$&] to contain XiOi1i2i3i4. That is, the input should be placed in bit-reversed Order before the computation begins.
Again for the case N= 32, a shorthand notation describing the 5-step process, together with the initial permutation to bit-reversed Order, is the sequence Here the sequence begins with i&i$lio, which is intended to imply that xi4i3i2ili0 is assumed to be in a[i&izilio] as before; the notation i,$li&z is intended to imply that xi4i3izili0 has been permuted to a[i,$lizi&] before the first butterfly computation is performed. That is, i&i$lio always represents the binary representation of the subscripts of x; the Order in which the bits appear, or are permuted during the computation, refer to movements that Xi4i,i,i,i0 or its derivatives undergo in a during the computation.
Arranging that input and output are both in natura1 Order
When input x and output X are both in natura1 Order, the algorithm is referred to as an "ordered" FFT in the literature. The key to understanding what is required is to view each butterfly computation as consisting of one permutation step followed by one in-place computation Step. These permutation Steps recorder the initial input as well as the input to each subsequent subproblem, and the notation introduced above tan be used to describe this process in a natural way. Again using the example above, with the input in the natura1 Order, the first in-place butterfly is denoted by ci&ilio. If this butterfly Operation is preceded by per-muting the data in u [i&ilio] to a [i&ili&] , it is natura1 to use i&iliozT to denote the combined effect. Since the permutation step cannot be done in-place, two arrays of size N are used to alternately store the data; this doubles the storage requirement. As usual, assuming that x is initially contained in u in the natura1 Order, a second array b would alternately contain the data. The entire computation process, along with the use of two arrays, is depicted below.
Since the Gentleman-Sande butterfly does not permute the input when the Problem size is 2, no Permutation
Step precedes the last butterfly Operation, so there is no relocation of data from a to b in the last
Step of the process. Another class of ordered FFTs perform in-place Permutation and consequently do not need a second array; they are the so-called "self-sorting inplace" algorithms. This class contains variants of the Prime-factor algorithms [1,12,16] and a radix-2 FFT [8] . Recently, this class has been further extended to include self-sorting in-place radix-3, radix4, and radix-5 FFTs [ 171. To keep our discussion brief, only the radix-2 algorithm is reviewed here. Using the notation developed earlier, the process of applying the self-sorting in-place radix-2 FFT to array u is depicted below for N= 32.
Observe that the permutation always involves bits in symmetric positions: e.g., in
Step 1, the leftmost bit i4 switches with the rightmost bit i0 and in step 2, bit i3, the second bit from the left end, switches with bit i,, the second bit from the right end. Accordingly, the ordering of the bits is "reversed" after only two Steps, and the permutation tan be implemented using "pairwise" interchanges. Step 2. Since each pairwise interChange tan be done using a Single temporary location, the array b is not needed.
Mapping data to processors
For purposes of this Paper, it is assumed that the multiprocessor available has P = 2d processors, and each processor has its own local memory. That is, the machine in question is a distributed-memory multiprocessor, where each processor is connected to the others via a communication network with a prescribed topology. A common topology is a hypercube, but others such as a regular grid or a ring are also commonly used.
A key step in parallelizing the FFT on such multiprocessor Computers is the mapping of the elements of x to the processors. Since N = 2" and P = 2d, it is natura1 to select d of the n bits in the binar-y representation of the subscripts of x to create a unique d-bit binary processor ID number specifying the data-to-processor mapping. While any set of d bits could be Chosen, there seems to be no compelling reason not to choose consecutive bits. Different choices of bits yield different mappings. Then n -d + 1 cyclic block mappings (CBMs) are shown in Fig . . ik-d+l have been Chosen to specify the data-to-processor allocation. Esch processor is always assigned NIP data elements; i.e., this class of mappings ensures even data distribution. Esch processor tan always compute the butterflies involving the NIP local data elements independently, because these data have the same d processor address bits. Thus, no inter-processor communication is required to compute the butterflies involving the bits specified by the braces below.
In general, since any d bits tan be used to form the processor ID number, it is easier to recognize the generic communication Pattern if one concatenates the bits representing the ID into one group called "PID", and refers to the remaining II -dbits, which are concatenated to form the local array address, as "Local w'. For the class of CBM mappings, one tan use the following equivalent notation, where the leading d bits are always used to identify the processor ID number.
In either case, the d consecutive bits are marked by the Symbol "1" at both ends. The two notations are equivalent and both are used in the text.
Computing the butterflies involving the address bits used to define the processor ID number will involve exchange of data between processors whose ID numbers differ in exactly one bit. Of course these processors may or may not be physically adjacent, depending on the network topology. For example, if the p processors form a hypercube network, data communication between such a pair is always between neighbouring processors. If the p processors form a linear array or a two-dimensional grid, such a pair of processors tan be physically many hops apart, and the simultaneous data exchange between all pairs tan Cause traffit congestion in the network.
CBM mappings have received considerable study in the literature dealing with parallelizing FFTs [2] [3] [4] 7, 9, 14, 15, 18, 19] . These works vary in the choice of the blocksize, whether DIF or DIT transforms are used, and whether the input and/ or output is in unordered (reverse-binary) or in natura1 Order, and so on. The purpose of this Paper is to bring all of these treatments into a common framework.
Parallel FF'Ts
Recall from Section 3 that one may apply the algorithm to the naturally ordered input and obtain the output in bit-reversed Order, or one may permute the input into bit-reversed Order in advance and obtain the X values in natura1 Order, or one may apply permutations during the computation so that naturally ordered output is obtained from naturally ordered input. A fourth Option that was not considered is to use naturally ordered input, and then bit-reverse the output after the computation is complete in Order to obtain naturally ordered output. The task of bit-reversing a vector is an interesting Problem in its own right, with many variations and intricacies [lO] . However, the Problem is outside the scope of this Paper. The second and last choices are simply special cases of the third, where all the permutations are applied either in advance of the computation or after all computation has been completed. Moreover, only one of a large number of Permutation variations was considered. Indeed, given naturally ordered input, any binary Permutation of X tan be obtained, and in numerous different ways. For example, suppose after the computation one requires that u[i2i3iOi4il]
contain Xi,,i,i2igi4. Two of several possible schemes are shown below, using the notation developed earlier.
Of course it should also be evident that assuming that the input is in natura1 Order is simply a matter of expositional convenience. Any binary permutation of the input tan be handled using the same shorthand notation. As noted earlier, the term "ordered" FFT in the literature refers to an algorithm which transforms naturally ordered x to naturally ordered X; the algorithms resulting from any other permutation of either input elements or output elements are all referred to as "unordered" FFTs. Therefore, the FFT with bit-reversed input or bit-reversed output is also an unordered FFT.
Parallel FFTs without inter-processor permutations
In general, in the parallel context, if permutations are allowed, it may turn out that part of a processor's complement of data may migrate to another machine. This subsection deals with the case where no permutations are performed. Using the notation and the example of the previous section, computations involving the four mappings are depicted below. The notation +==+ indicates that N/P data elements must be exchanged between processors in advance of the butterlly computation. The Symbol ik identifies two things: first, it indicates that the incoming data from another l%ocessor are the elements whose subscripts differ from a processor's own data in bit ik; second, it indicates that all pairs of processors whose binar-y ID number differ in bit ik send each other a copy of their own data.
Since the input sequence is in natura1 Order and the output is in bit-reversed Order, the processor initially allocated xidgiZi,i,, will finally have Xi0;,iZiji4 if interprocessor Permutation is not allowed. For example, if the initial mapping is li& li2ili0, processor Po initially contains xo,xi ,x2, . . . , x7 in their natura1 Order as depicted in Fig. 3 ; when the parallel FFT ends, processor PO contains Xo,X,s,Xs,X24,X4,X20,~~2,~2s, which are the first eight elements in the output array in Fig. 2 . In this case, x and X are Said to be comparably mapped to the processors [ll], p.160. Note that the initial mapping is a CBM mapping of naturally ordered x, but the final mapping is not a CBM mapping of naturally ordered X.
Butterfly computations will Cause communication between processors if the two input elements are stored in different processors. Since both input elements are needed to update each of them, the two processors involved must exchange the N/P data elements for each other to update their local data. As shown above, butterflies require data to be exchanged in exactly d = log, P Stages, regardless of the blocksize used in the mapping. (In the example, d= 2.) Algorithms of this type are described in Refs. [2, 4, 9] . This is also Version 1 of the distributed-memory FFTs in Ref.
[ 111, pp. 156162.
A consequence of this scheme is that one half of the processors update their local data according to a formula not involving the twiddle factor; i.e., they each update N/P elements according to yr = (xl + x~+~I~). The other half of the processors update their local data according to a formula involving the multiplication of a pre-computed twiddle factor; i.e., they each update N/P elements according to ZI = (x[ -x,+N,~)c+. 1 Thus, the arithmetic workload is not evenly divided among all processors unless each processor computes both yi and zI.
To balance the arithmetic workload, Walton [19] proposed a variant of this type of communication which allows each pair of processors to exchange i (N/P) elements twice: once before the butterfly computation, and once after the buttertly computation. The second communication is needed for each processor to get back the 5 (N/P) e ements updated by the other processor. The same 1 idea is employed in version 2 of the two-processor distributed-memory FFT in Ref. [l 13 , p. 158. Compared to the scheme above, the amount of data exchanged between each pair of processors remains the Same, but the number of message exchanges is doubled and the notation x=+ now denotes two exchanges of i (N/P) elements between all pairs of processors. 3 Unfortunately, the gain from balancing the arithmetic workload may be offset by doubling the number of exchanges if the startup tost for each message exchange is high. The arithmetic workload cay1 be balanced, and the amount of data exchanged tan be halved without doubling the number of exchanges if inter-processor permutations are allowed, as disscussed in the next section.
Parallel FFTs with inter-processor permutations
As suggested in Refs. [3, 7, 15, 18] , if processors are not required to keep and update only the same data throughout the entire computation, then every other concurrent message exchange tan be eliminated from Walton's scheme [19] . To introduce this concept, consider the butterfly computations depicted by
The notation t+ denotes one concurrent message exchange of i (N/P) data elements between all pairs of processors, which is shown to occur in the butterfly stages involving bits which form the processor ID number. After data are distributed to individual processors according to the initial mapping li4i3li2i,io, the element x,i,i,i,i, in a[i&ilio] tan be found in A[i$lio] in processor Q,. When bit i4 in the PID and bit i2 in the local M switch their positions, the mapping is changed to (i2i3ji4ili0, which means that the data in a[i4i3i2ilio] tan now be found in A [ibi, io] in Pi,i, . To identify the one half of data each processor must send out, the Symbol A is used to label two different bits: the bit ik, which has just been permuted from PID into Local M, and the bit il, whichhas just was in PID before the switch, ik = 1 in one processor, and ik = 0 in the other processor. On the other hand, because i, was in Local M before the switch, il = 0 for half of the data, and il = 1 for another half of the data. Consequently, the value of ik, the PID bit, is equal to il, the local M bit, for half of the data elements in each processor, and the notation which represents the switch of these two bits identifies both the PID of the other processor as well as the data to be sent out or received. To depict exactly what happens, the data exchange between two processors and the butterfly computation represented by r li& li,&il io A A (discussed above) is shown in Fig. 4 . While such a detailed schematic diagram tan be drawn to trace the entire parallel FFT process Stage by Stage, there is no need to do so because all information tan be deciphered from the simple notation introduced above. Note that in the example above, element x,ijizi,i,, is initially contained in A[i2iIiO] in processor PiJi,, and it is finally located in processor Pisj4 at the same relative location. Note also that the number of concurrent exchanges 4 is d + 1. A second example, using the "IPIDI Local M" notation with a different mapping, suggests the general Situation.
Note that in this example, the Order of the data within each processor again has been preserved, but the data in the aggregate has migrated from processor fi,i, to processor fi,i,, and the number of concurrent exchanges remains d + 1.
It is straightforward to develop a communication scheme so that this is true for all CBM mappings: the data within each processor remains in the same relative Order, but its processor location is changed from a processor with ID = Mk-1 . . . ik_d+l to another processor with ID = ik_d+likik-1 . . . ik-d+2; i.e., the d bits are cyclic-shifted to the right by one position. In all cases, the number of concurrent exchanges remains d + 1. The comrnunication schemes described in Ref. [3, 7, 15, 18] may be viewed as variations of this basic scheme, and tan all be treated within the same framework. Note also that in the first example, the PID bit in question is always exchanged with the leftmost bit in the Local M, which is referred to as the "pivot" in Refs. [15, 18] . However, in the second example, the PID bit in question is always exchanged with the rightmost bit in the Local M. Clearly, the so-called "Pivot" could be arbitrarily Chosen, if one so desires, from the bits of the Local M. Since the ID number is formed by consecutive bits, whenever a PID bit is permuted into the local Pivot Position, it will be exchanged with the next PID bit and occupy the latter's Position back in the PID field. After d exchanges, we have the following scenario: the rightmost PID bit is in the Local M, and the Pivot i, or rp from Local M is still in the leftmost Position in PID. Therefore, one more permutation involving these two bits will get i, or rp back into its original Position in the Local M, and the rightmost PID bit would be cyclicshifted into the leftmost Position in the PID as shown below. Observe that as long as the PID is not formed by the leftmost d bits, there would be at least one zp bit available when the buttertly computation reaches any PID bit. One thus has the Option of using a rp (instead of ip) bit as the pivot. In this case, the rp bit may stay as the leftmost bit in the PID if the local data is not required to remain together in one processor, and one concurrent message exchange tan be saved, with the final mapping determined by l~p~k.. . The PID in the mapping above is no longer formed by consecutive bits. In all cases, the final "IPIDJ Local M" is a Permutation of the n bits, which uniquely determines the data mapping of X&i,;2...i,_l. In certain contexts, it is important to have the output array X mapped to the processors in a specific way to facilitate subsequent computations. For example, the final distribution of X is required to be identical to the initial one for x in the Solution of partial differential equations using spectral techniques [3] . This has motivated the development of a number of algorithms which are now reviewed.
Parallel FFTs in the literature
In this section some parallel FFTs in the literature are reviewed. To facilitate the discussion, the same example (when it is possible) is used to provide a summary of some distributed-memory parallel FFT literature in Table 1 . The basic parallel FFT without inter-processor Permutation underlies the work in Refs. [2, 4, 9, 19, 20] , and the communication tost is either d concurrent exchanges of NIP elements or 2d concurrent exchanges of $N/P) elements. The final mapping is identical to the initial mapping in all of the algorithms in Refs. [2, 4, 9, 19] . However, because the output element &,,i,i2i3i4 overwrites xi4ijii;,i,, in its initial location in the same processor, the resulting mapping is not a CBM of X. These algorithms parallelize an unordered sequential FFT, and thus are referred to as unordered parallel FFTs in the literature, and they are so labelled in Table 1 . The re-coding of PID (by Gray Code) in Ref. [2] and the use of radix4 FFT in Ref. [4] do not affect the communication scheme, because all processors are identical and the data are accessed in a similar pattern by all radix-T FFTs. Furthermore, the choice of a different local FFT in Ref. [4] is independent of the choice of the communication algorithm.
In Ref.
[20], the PID part of the initial mapping is preserved in the final mapping as dictated by the communication scheme, but the bits which form the Local M are bit-reversed in the final mapping because each processor applies an where the PID preserves its initial arrangement: id-lid-2 . . . ilio. Since the Local M is formed by the rightmost n -d bits of Xs subscript, &,Xr . . . ,XNp are stored in natura1 Order in one processor, which is determined to be PO; and the next N/P elements, namely XN,~,XN~P+r, . . . , X2(Np)-I, are stored in natura1 Order in one processor, which is determined to be Pp/* by simply reversing the leftmost d bits of X's subscript; and so on. Since the mapping is equivalent to a CBM block mapping of the naturally ordered X, we use the term "block-equivalent" in the relevant table entries. To use this method, one simply distributes x before the computation using the cyclic CBM, and apply this algorithm to obtain naturally ordered X. Note that the input data must be distributed among the processors one way or the other, and the communication tost for data distribution is the same regardless of the initial mapping. Therefore, the cyclic initial mapping used by this method does not Cause extra communication, and this is one way to obtain the block-equivalent mapping of naturally ordered X. It will be shown later that the communication tost tan be halved in an improved algorithm which achieves equivalent results. The FFTs in Refs. [3, 7, 15, 18] all contain inter-processor-permutations which may be viewed as switching a PID bit with an address bit, the Pivot. In Ref. [7] the PID is formed using the rightmost d bits as shown in the two initial mappings given in Table 1 . To ensure that the stride for data in buttertly operations is always one (for good locality), the current rightmost bit in the Local M is used as the Pivot for either a local or an inter-processor permutation, which precedes each butterfly Operation, as depicted by the following example.
lili0/i& jili0/22zj ; r lili0@

I~i0ln~
Im IG a a +_ +4
Here the sequence begins with lilioli&, whicf) is the result of a cyclic mapping of naturally ordered x; the following lilioli2i3 is intended to imply that i3 has been permuted into the Pivot location befoFeAthe butterfly computation begins. Since the initial mapping (ilioli& may be obtained by permuting the local data resulting from a cyclic mapping, it is referred to as a cyclic-variant in Table 1 . Given the initial mapping and the fixed-Position of the Pivot, which is a rp (instead of ip) bit, the communication tost and the final mapping tan be easily predicted for given N and P as explained in the previous section. However, the resulting mapping is neither a CBM nor its equivalent for X. The work in Ref. [3] deals with all possible initial CBM mappings for given N and P on a hypercube. The initial CBM for x and the final CBM for X are required to be identical. Under the condition, the authors show that in addition to d concurrent exchanges between all pairs of processors with IDs different in one bit, their generalized subroutine could, in the worst case, require each processor to send data to all the other processors. This requirement may Cause severe data contention depending on the network topology. A new general algorithm is proposed in Section 6 which requires only 1.5d more concurrent exchanges in the worst case. Thus the algorithm proposed in Section 6 appears to deal with the question in Ref. [3] conceming the communications requirement for solving the data rearrangement Problems arising in FFT or other similiar algorithms.
The work in Refs. [15, 18] concems two specific CBM mappings; in each case, the initial mapping for input X,i,izi,iO is required to be maintained for output Xi,,i,i2iXi4. These two parallel algorithms are depicted in Table 2 using given N and P values. For the natural-order parallel FFT in Ref. [ 151, the communication tost of 2d + 1 = 5 exchanges for N = 2" = 32 and P = 2d = 4 are marked by five occurrences of t+ in the first column. The algorithm differs depending on whether d = n/2, d < n/2, d = n -1, or n/2 < d < n -1. Consequently, the communication costs range from 1.5d + 2 to 2d + 1 concurrent exchanges as indicated by the following theorem.
Theorem [15] . Note that a hypercube of dimension d has P = 2d processors, and that the condition n/2 < d is equivalent to N/P < P, and (n + 1)/2 < d is equivalent to N/P < P/2, so they represent fine-grain cases. Consequently, for more common medium-grain and large-grain cases, 2d + 1 parallel exchanges will be required .
The cyclic-Order parallel FFT in Ref. [18] refers to an FFT with naturally ordered input x and output X both CBM mapped to the processors using one element per block. That is, the initial cyclic mapping of x is maintained for X. The case N/P < P was considered in Ref. [18] for the massively parallel Connection Machine (CM). The communication tost was derived in proving the following lemma.
Lemma [18] . Since the condition n/2 -C d is equivalent to N/P < P, this lemma again applies to the fine-grain cases, and the number of parallel message exchanges is 
2d -n/2 > d (if n is even) or 2d-(n -1)/2 > d (if n is odd). In the finest-grain case, N/P = 1 if n is even: i.e., n = d, and the number of parallel transmissions is given by 2d -n/2 = 1Sd. If n id odd, N/P = 2 or n -1 = d for the finestgrain case, and the number of parallel message exchanges is given by 2d -(n -1)/2 = 1.5d. In the example cited in Table 2 , the condition n/2 < d is satisfied, and there are 2d -n/2 = 6 concurrent exchanges as identified by six occurrences of t+ in column 2. In both algorithms, the Pivot is always the current leftmost bit in the Local A4, but reordering of the address bits in Local M is also performed by exchanging the Pivot with another bit in the Local M. This implies that other bits from Local M tan effectively serve as the Pivot bit, although they must first be permuted into the fixed Pivot location. In Refs. [15, 18] , the permutation of any other bit with the fixed Pivot bit is defined as a Single i-cycle. The algorithms were developed by first decomposing the required permutation (or final mapping) into disjoint cycles, and each disjoint cycle tan then be implemented by a sequence of i-cycles. Some of the i-cycles are followed by butterfly computations, and other i-cycles are used only for the purpose of rearranging local data or permuting data between processors. The pseudocode (similar to CM Fortran) FFT algorithm given in Ref. [18] uses an i-cycle subroutine assuming that N/P = 2, where F could represent the number of virtual processors when N/P > 2. In the latter case, 13 > P, and the tost of 1.52 concurrent exchanges, where d = log, j, includes the communication between virtual processors. In the next section alternative ways are proposed having the Same or lower communication tost and without the restriction to fixed-Pivot i-cycles. In addition, the "NIP" ratio is not restricted to a specific value.
New parallel FFTs and a generalization
In this section three new parallel FFT algorithms are proposed, including a generalized algorithm and communication complexity results for other CBM mappings which are believed to be new. For easy comparison with the algorithms reviewed in the last section, the first two algorithms are depicted in Table 3 and a general algorithm is depicted in Table 4 below. Since the result of bit-reversing two bits is the Same as the result of cyclic-shifting them one position to the right, examples with 8 or 32 processors are now used to demonstrate the general cases better.
Algorithm I
The elaboration of algorithm 1 is as follows. Consider a general case, in which the initial cyclic mapping of naturally ordered xi._,,,,i,iO is transformed to the final mapping of X$,i,,,.i,_, given by dPID bitsl a1
where PID = i&_lid_2 . . . izii; i.e., all bits in the initial PID are cyclic-shifted to the right by one Position. Since the rightmost n -d bits of X's subscript form (cyclic-equivalent if $ > P).
2P
the local A4, the final mapping is equivalent to a block CBM of naturally ordered X. To obtain the final local M which bit-reverses the initial local IV, all that is required is to apply an ordered FFT to the local data. As explained in Section 5, the final PID is obtained from per-muting each PID bit into the Pivot Position and back into the PID field in sequence. The combination of a specific initial mapping, inter-processor permutations, and an ordered local FFT produces an ordered parallel FFT. Compared to other parallel FFTs reviewed in Section 5, the communication tost is halved because only d + 1 concurrent exchanges of 4 (N/P) elements are needed. Of course, the same algorithm tan also transform block mapped x to cyclic-equivalent mapped X, and vice versa.
Algorithm 11
Observe first that the initial and final PIDs use the rightmost d bits from the subscripts of x and X, respectively. Because Xs subscript bit-reverses x's subscript, the final PID is formed by bit-reversing the leftmost d bits from x's subscript. The algorithm is designed to perform the first d stages of local computation without reordering the local data, followed by n -2d more stages of local computation using an ordered sequential FFT. The initial mapping is thus changed to
The objective is to apply inter-processor permutations in the remaining d stages so that the final mapping becomes Note that the final PID bit-reverses the rightmost d bits of Xs subscript. In order to describe how Algorithm 11 is generalized to include this case, the condition n -d < d is reflected in the initial mapping given by IInitial rightmost d bits 1 n. (Note that the bits braced from z,,_i to r,_d are reversed, and the bits braced from tk-1 to r, are reversed. since the bit-reversing operations tan be incorporated into the computation, the internal reordering does not necessarily increase processing time in an efficient implementation.) The final n -d interprocessor permutations in Algorithm 11 switches the remaining n -d bits in PID into the local A4, resulting in the final mapping This is a cyclic-equivalent mapping for Iy10i,i2,,,i,_, because the Local M is formed by the leftmost R -d bits of X. Note that the rightmost d bits of Xs subscript tan be obtained by reversing the PID bits, and vice versa.
A general algorithm and communication complexity results
This algorithm allows the choice of any initial CBM for naturally ordered x and any final CBM for naturally ordered X. For N = 128 and P = 32, the four possible initial CBMs of x are shown in the first four entries in column 1 in Table 4 ; in each case, the initial CBM is required to be maintained for X as shown in column 2. In column 3, both actual and the maximum (worse-case) communication complexities are presented for N= 2" and P = 2d, assuming only that NIP 2 2. That is, the specific relationships between n and d in different cases are not exploited in the algorithm.
The algorithm uses the ideas developed in Section 5. Recall that by using d + 1 inter-processor permutations, a parallel FFT tan be completed with the initial PID bits cyclic-shifted-to-the-right by one Position. Apparently more inter-processor permutations may be added afterwards to switch the desired rp bits into the PID and arrange them in final Order, and that tan be followed by local reordering so that the bits in the Local M are also arranged as desired. To determine the maximum communication tost, consider the two possibilities.
Case (i). If the initial PID does not overlap the final PID, then exactly d more inter-processor permutations are needed to switch in the final PID bits. In total, 2d + 1 concurrent exchanges are needed.
Case (ii). If the initial PID overlaps the final PID, then one may first switch in the m zp bits currently in Local M so that these m PID bits are put in their final positions. Repeat this process until no desired PID bits exist in the local M. Now the only remaining task is to reorder the v = d -k potentially out-of-Order bits in the PID. Because the final PID will no longer involve bits in the Local M, one need only consider a fixed (arbitrary) Pivot. How many inter-processor permutations are needed to reorder v PID bits via a Single Pivot? Observe that in the worst case, the non-PID Pivot bit is permuted back in Local M after two PID bits happen to switch their positions; i.e., three permutations could be required for placing every two bits. Therefore, at most 1.5~ inter-processor permutations are required to reorder the v PID bits. Since v tan be as large as d, in total 2.5d + 1 concurrent exchanges are needed.
Although the two cases above were considered separately to help explain the algorithm, note that case (ii) reduces to case (i) if m = d. The actual communication tost for given N and P tan also be predicted for the entire class of CBM mappings as shown in Table 4 . Since the reordering Phase requires only 2d<2.5d + 1 2P
1.5d more concurrent exchanges in the worst case, this algorithm appears to have dealt with the question in Ref. [3] concerning the communication requirement for solving the data rearrangement Problems arising in FFT or other similar algorithms.
In addition, since the reordering Phase is independent of the computation Phase, the algorithm and the communication complexity results are applicable even if the initial and final CBM maps are different. The last three entries in Table 4 are such examples. Thus, the proposed algorithm deals with cases that do not appear to be handled by the algorithm proposed in Ref. [3] .
Conclusion
In this Paper the various ideas which have appeared in the FFT literature were presented in a common framework. To this end, the concept of index-digit Permutation and the associated binary address based notation were borrowed, modified, and applied in a uniform manner to explain and further explore the various algorithm design and implementation issues related to both sequential and parallel FFTs. Although the study is limited to variants of a radix-2 DIF FFT algorithm, the notations and results may be generalized to cover the entire family of radix-2" DIF and DIT FFTs, including the mixed-radix and the splitradix variants. These possible extensions as well as the extension to two-dimen-sional (2-D) FFTs are omitted for lack of space and will be presented elsewhere. In addition, three new parallel FFT algorithms along with communication results were presented. The proposed algorithms show alternative ways of designing parallel FFT algorithms which feature reduced communication tost and further flexibility in the choices of initial and final data mappings.
