FFT algorithms and their adaptation to parallel processing  by Chu, Eleanor & George, Alan
LINEAR ALGEBRA 
AND ITS 
APPLICATIONS 
EISEVIER Linear Algebra and its Applications 284 (1998) 95-124 
FFT algorithms and their adaptation to 
parallel processing 
Eleanor Chu a3*, Alan George b31 
a Department of Computing and Information Science and Department of Mathematics and Statistics, 
University of Guelph, Guelph. Ont., Canada NIG 2 WI 
b Department of Computer Science, University of Waterloo, Waterloo, Ont., Canada N2L 3GI 
Received 22 September 1997; accepted 5 May 1998 
Submitted by D.P. O’Leary 
Abstract 
The development of the fast Fourier transform (FFT) and its numerous variants in 
the past 30 years has led to very efficient software and hardware implementations of 
the transform on uniprocessor Computers. In recent years, many researchers have recog- 
nized the practical importante of minimizing computing time by parallelizing sequential 
FFT algorithms in various ways for today’s high-performance multiprocessor comput- 
ers. This Paper presents many FFT variants already proposed by others in a common 
framework, which illuminates the progress made in parallelizing them to this date. In 
addition, three new parallel FFT algorithms along with communication complexity re- 
sults are presented. The proposed algorithms show alternative ways of designing parallel 
FFT algorithms which feature reduced communication tost and further flexibility in the 
choices of data mappings. 0 1998 Eisevier Science Inc. All rights reserved. 
1. Introduction 
The discrete Fourier transform (DFT) has played an important role in con- 
trol, Signals and image processing over a long period of time, and the practical 
importante of fast Fourier transform algorithms (FFTs) has also been well rec- 
ognized. In the past 30 years, the development of the FFT and its numerous var- 
??Corresponding author. E-mail: echu@msnet.nw.uoguelph.ca. 
’ E-mail: jageorge@sparse.uwaterloo.ca. 
0024-3795/98/$19.00 0 1998 Elsevier Science Inc. All rights reserved. 
PII:SOO24-3795(98)10086-1 
96 E. Chu, A. George I Linear Algebra and its Applications 284 (1998) 95-124 
iants has led to very efficient Software and hardware implernentations of the 
transform on uniprocessor Computers. Since multiprocessor Computers became 
commercially available a decade ago, the parallelization of sequential FFTs on 
high-performance multiprocessor Computers has received the attention of nu- 
merous researchers. The variations of sequential FFTs, together with different 
parallelizing techniques, different multiprocessor architectures, and different 
network topologies, has led to a large number of parallel FFT variants, each 
with its own “twist” on things. A Survey of the FFT literature suggests that a 
useful contribution would be to put the various ideas in a common framework. 
To this end, the concept of index-digit permutation [5,6] and the associated bi- 
nary address based notation [6,15,18] have been borrowed, modified, and ap- 
plied in a uniform manner to explain the various ideas which have appeared 
in the FFT literature. Throughout this Paper, examples from the radix-2 FFTs 
are used to introduce and demonstrate how this notation tan bring all FFT var- 
iants, sequentiul or parallel, into a common framework which facilitates the un- 
derstanding, implementation, and design of the various FFT algorithms. The 
notation extends to the entire class of radix-2” FFTs, including mixed-radix 
and Split-radix FFTs, in a straightforward way, but is omitted to save space. 
Three new distributed-memory parallel FFTs along with communication com- 
plexity results are also presented. The proposed algorithms show alternative 
ways of designing parallel FFT algorithms which feature reduced communica- 
tion tost and further flexibility in the choices of data mappings. 
1.1. The discrete Fourier transform (DFT) 
The DFT refers to the transformation of N discrete samples xo,xl, . . . ,x,+~ 
according to the formula 
N-l 
x,=~x,c& r=0,1,..., N-l, 
I=O 
(1) 
where the x,‘s are sampled at equally spaced intervals in one period from a 
complex time series, the Xr’s are complex numbers which represent a spectrum 
of the time series resulting from the transform, and ON is the Nth primitive root 
of unity; i.e., ON = e-2Zj’N with j E fl. Although Eq. (1) defines X, for all r, 
one only needs to consider values for r = 0 to r = N - 1, since o.$’ = 1 for 
h 3 1, which implies that X,,, = X,. When N=4, the DFT of XO, XI, x2 
and x3 is the product of the matrix Q and the vector x in the equation below. 
E. Chu, A. George l Linear Algebra and its Applications 284 (1998) 95-124 97 
A straightforward implementation of the DFT defined by Eq. (1) requires 
# complex multiplications and (N - 1)2 complex additions in addition to 
the tost of generating the matrix R. The efficient computation of the N x N 
matrix R is an important Problem in its own right, but beyond the scope of this 
Paper; we shall assume that all w; values, which are commonly referred to as 
twiddle factors, are pre-computed and available to the Computer programs im- 
plementing the various transform algorithms. One could argue that this is a 
reasonable assumption to make, since FFT Codes are usually applied sequen- 
tially to large numbers of vectors, and thus pre-computation of the twiddle fac- 
tors is an efficient strategy. The argument appears to be valid for Single 
processor machines, and for shared-memory multiprocessors, since access to 
the twiddle factors is straightforward. As pointed out by Chamberlain [2], this 
is also an efficient strategy for distributed-memory machine if multiple trans- 
forms are performed by each processor all at once. 
However, for transforming a Single large vector on distributed-memory ma- 
chines, the best strategy is not at all obvious. If every processor has a copy of 
all the twiddle factors, then they may consume more store than the data being 
transformed - a somewhat incongruous circumstance. If this represents an un- 
acceptable amount of storage, then the FFT algorithm must arrange that the 
(pre-computed) twiddle factors are conveyed among the processors so that they 
are available when needed. A final Option is to compute the twiddle factors “On 
the fly” when they are needed, which relieves the storage and communication 
burdens at the expense of additional computation. 
The choice of strategy depends on a number of factors: the relative Speeds of 
communication and computation, the amount of memory available compared 
to the size of the Problem being solved, and the algorithm itself. Although spe- 
cific strategies were considered and compared under special circumstances in 
Refs. [2,18], their results do not seem to generalize. Due to the large number 
of algorithms surveyed in this Paper, consideration of the twiddle factor distri- 
bution for all of them would exceed the space permitted; however, the issue is 
important, and will be dealt with elsewhere. 
1.2. FFT variants and the divide-and-tonquer paradigm 
The divide-and-tonquer paradigm is fundamental to the FFT algorithm. 
The three major Steps of the divide-and-tonquer paradigm are 
Step 1. Divide the Problem into two or more subproblems of smaller size. 
Step 2. Solve each subproblem recursively by the same algorithm. Apply the 
boundary condition to terminate the recursion when the sizes of the subprob- 
lems are small enough. 
Step 3. Obtain the Solution for the original Problem by combining the solu- 
tions to the subproblems. 
98 E. Chu. A. George l Linear Algebra and its Applications 284 (1998) 95-124 
The radix-2 FFT is traditionally introduced as a recursive algorithm ob- 
tained from dividing the given Problem (and each subproblem) into two sub- 
Problems of half the size. Within this framework, there are two commonly 
used FFT variants which differ in the way the two half-size subproblems are 
defined. They are referred to as the DIF (decimation in fiequency) FFT and 
the DIT (decimation in time) FFT. To conserve space, only the DIF Version 
is considered here. However, the notation and the approach in this Paper apply 
equally well to the DIT case. 
The remainder of this article consists of the following sections. The deri- 
vation of the radix-2 DIF FFT algorithm is presented in Section 2. A binary 
address based notation to assist in describing algorithms is introduced in 
Section 3. Various data mapping strategies for parallelizing FFTs on multi- 
processor Computers are discussed in Section 4. Specific parallel FFTs in the 
literature are explained and presented in a unified manner in Section 5. 
Three new distributed-memory parallel FFTs along with communication 
complexity results are presented in Section 6. Section 7 contains concluding 
remarks. 
2. The radix-2 decimation-in-frequency (DIF) FFT 
As its name implies, the radix-2 DIF FFT algorithm is obtained by decimat- 
ing the output frequency series into an even-indexed set {J&lk = 0, . . . , 
N/2 - l}, and an odd-indexed set {XZk+rlk = 0,. . . ,N/2 - 1). To define the 
two half-size subproblems, Eq. (1) is rewritten as 
N/2-1 N-l 
_& = c x,co; + c n,w; 
1=0 I=N/2 
N/2-1 N/2-1 
= 
c x,o; + c 
r(l+N/2) 
xI,N/2@N 
I=O I=O 
N/2-1 
ZZ 
C( xl+x,+N,24?2)(3& 
/=o 
r=O,l,...,N- 1 (3) 
The following identities involving the twiddle factors are used in what follows. 
(mN)N’2 = - 1, ~N,2=41, N wN = 1 . (4) 
For r even, using Eq. (4) in Eq. (3) yields 
E. Chu. A. George / Linear Algebra and its Applications 284 (1998) 95-124 99 
N/2-1 
x2k = c (XI + XI+N,2434 
I=O 
N/2- 1 
ZZ c bI+XI+N,2)4,2> k=O,l,..., N/2- 1. 
I=O 
Defining Yk = &k and yi = XI + XI+N/2 yields the half-size subproblem 
N/2-1 
yk = -5’74,27 k=O,l,.*., N/2- 1. 
I=O 
Similarly, for r odd, using (4) in (3) yields 
N/2-1 
x 2k+l = 
U 
XI 
(2k+l)N/2 
+ Xi+N/ZmN 
I=O 
> 
(2k+l)l 
WN 
N/2-1 
= c ((n, - Xj+N,2)4,)4,2, k = 0, 1,. . . ,N/2 - 1 
I=O 
(5) 
(6) 
Defining zk = X,+i and zI = (x, - X[+N/2)0.$ yields the second half-size prob- 
lem 
N/2-I 
zk= &DN/~, k=O,l,..., N/2-1. 
I=O 
Note that because X, = & in (6) and X 2k+i = & in (8) no more computa- 
tion is needed to obtain the Solution for the original Problems after the two 
subproblems are solved. Therefore, in the implementation of the DIF FFT, 
the bulk of the work is done during the subdivision Step; i.e., the Set-up of ap- 
propriate subproblems, and there is no combination Step. Consequently, the 
computation of yr = xI + xIY~~t and zI = (x[ - Xl+N,2)Wb completes the first 
(subdivision) Step. : :! 
The computation of y/ and z/ in the subdivision Step as defined above is re- 
ferred to as the Gentleman-Sande butterfly in the literature, and is depicted by 
the annotated butterfly Symbol in Fig. 1. 
Let T(N) be the arithmetic tost of computing the radix-2 DIF FFT of size N, 
and, for some constant c, let cN be the tost of computing y,‘s and z,‘s in the 
Fig. 1. The Gentleman-Sande butter@ 
100 E. Chu. A. George l Linear Algebra and its Applications 284 (1998) 9S-124 
subdivision step. Then T(N) is the sum of the subdivision tost and the tost of 
computing (6) and (8), which implies 
T(N) = 
0 ifN= 1, 
2T(N/2) + cN ifN=2” > 1. 
Assuming that N= 2” for the radix-2 FFT, the Solution to the equation above 
is 
T(N) = cN log,N. (9) 
3. A binary address based notation for butterfly operations 
In the abstract, the equations introduced in the previous sections completely 
describe the computations. However, in practise, the computations are normal- 
ly performed in pluce in a one-dimensional array, with new values overwriting 
old values as implied by the butterfly in Fig. 1. Throughout this Paper, it is as- 
sumed that the input data XI, 0 6 1 < N are stored in the one-dimensional array 
a in natura1 Order. 2 Since these computations are performed repeatedly on sub- 
Problems of various sizes, determining the location of X, in the array at the end 
of the computation is not immediately obvious. The butterfly notations do not 
explicitly relate the locations of the input data to the locations of the output 
data; for example, the Gentleman-Sande butterfly represents only the first sub- 
division Step of the recursive DIF FFT algorithm. Thus, there is a gap between 
the elegant (but somewhat implicit) butterfly notations and the detailed speci- 
fication of the positions of the Outputs X, contained in the repeatedly modified 
array. The purpose of this section is to describe a simple notation to assist in 
this specification. It will also facilitate the adaptation of these FFT algorithms 
to parallel processing, which is the focus of this Paper. The notation is not en- 
tirely new, but rather a modest embellishment and extension of the ideas that 
have been used in the past [6,15,18]. The radix-2 DIF FFT introduced in Sec- 
tion 2 is used as a vehicle for introducing the notation. 
3.1. Iterative form of the radix-2 DIF FFT 
Recall that the Gentleman-Sande butterfly represents the first subdivision 
Step. Because there is no combination Step, it is straightforward to derive an 
iterative algorithm by simply subdividing each resulting subproblem iteratively 
2 The input data xo,xl, ,x,+l are said to be in “natura1 Order” if XI and XI+, are stored in 
consecutive locations in o for all 0 < l< N-2. Similarly, the output data Xo,Xl , , xN_ 1 are said to 
be in natura1 Order if X, and X,+l are stored in consecutive locations in o for all 0 < r Q N-2. 
E. Chu, A. George / Linear Algebra and its Applications 284 (1998) 95-124 101 
until the Problem size is one, as shown in the pseudo-code algorithm below. It 
is assumed that the twiddle factors are pre-computed and stored in an array W, 
with w[i] = w;, 0 < i < N/2. 
Algorithm 3.1. The iterative DIF FFT algorithm in pseudo-code 
begin 
NumOfSubProblems : = 1 Initially: One Problem of size N 
SubProblemSize : = N 
while SubProblemSize > 1 do Subdivide each subproblem 
HalfSize : = SubProblemSizel2 
for K : =0 to NumOfSubProblems - 1 do 
JFirst : = K x SubProblemSize 
JLast : = JFirst + HalfSize - 1 
Jtwiddle : = 0 
for J : = JFirst to JLast do 
W : = w[Jtwiddle]; Temp : =a[Jj 
a[Jj : = Temp + a[J + HalfSize] 
a[J + HalfSize] : = W x (Temp - a[J + HalfSize]) 
Jtwiddle : = Jtwiddle + NumOfSubProblems 
end for 
end for 
NumOfSubProblems : = 2 x NumOfSubProblems 
SubProblemSize : = HaljYSize 
end while 
end 
Since the initial SubproblemSize = N = 2”, the variable HalfSize takes on the 
values 2”-’ 2nV2 > >“‘, 2,l; the distance between ado a[Jj and a[J + HalfSize] is 
always a power of 2. Thus, denoted as binary numbers i,_, ‘. . ilio, J and 
J + HalfSize differ only in bit ik when HaljSize = 2k and the algorithm tan 
be expressed in terms of binary addresses as shown below. 
Algorithm 3.2. The iterative DIF FFT algorithm in terms of binary addresses 
begin 
k:=n-1 Initial Problem size N = 2” 
while k > 0 do Subdivide each subproblem 
Apply Gentleman-Sande butter-y computation 
to all pairs of array elements whose 
binary addresses d@er in bit ik 
k:=k- 1 
end while 
end 
102 E. Chu, A. George / Linear Algebra and its Applications 284 (1998) 95-124 
For N = 32 = 25, the DIF FFT algorithm as described above consists of five 
stages involving the sequence k = 4, 3, 2, 1, 0, with butterflies being applied to 
all pairs of array elements whose binary addresses differ in bit ik. Since the al- 
gorithm tan be expressed in terms of binary addresses, it is convenient to adopt 
binary notation for the subscripts of a and X. Thus, in what follows the nota- 
tion xi,i,izi,iO will mean xr, where i&i2ilio is the binary representation of r. Sim- 
ilarly, a[i&i2ilio] refers to the element a[m], where the binar-y representation of 
m is i&i2ilio. Then a shorthand notation for describing this 5-Step process be- 
ing applied to a[m] is the sequence 
7 r v v v 
i&i2ilio 74 i&ilio z4zjizilio qr372ilio z4z3z2eli0, 
where ik indicates that the butterfly computation involving the pairs in u 
different in bit ik is being performed in the current Stage, and rk indicates that 
the corresponding butterfly operations were performed in a previous Stage. 
The transformation is completed when butterflies in all stages have been per- 
formed. 
Which element of X will be found in a[i4i3i2il io]? If Eqs. (6) and (8) are inter- 
preted as implied by the butterfly in Fig. 1, the top half of II will contain the 
even-numbered X’s after all stages are completed, and the bottom half will con- 
tain the odd-numbered As. Thus, if i4 = 0, then a[i&izil io] is in the top half of u 
and will ultimately contain an even-numbered X, and if i4 = 1, then a[i&i2ilio] 
is in the bottom half of 4 and will ultimately contain an odd-number\d X. This 
in turn means that after the butterfly Step identified by Operation i&izilio is 
completed, it is evident that a[i&i2iliO] will ultimately contain output X,., where 
the rightmost bit of the binary representation of r is id. By exactly the Same ar- 
gument, applied recursively to both the top and bottom halves of the array, we 
tan conclude that the second from the rightmost bit of r will be 6. Repeating 
the argument three more times on successively smaller portions of the array u 
yields the conclusion that a[i&i2iliO] will finally contain Xioi,i2iji4. In other 
words, XiOi,ftijid will OCCUPY the Position originally occupied by xilisi2i,i,,. That 
is, the output X is in bit-reversed Order. 
Fig. 2 displays the initial contents of a, the five stages of butterfly computa- 
tion together with the associated twiddle factors, and the resulting bit-reversed 
sequence. To help identify the pairs of subproblems resulting from every Stage 
of butterfly computation, the subproblems involving a[O], which initially con- 
tains the input element xo, are highlighted. Note that for an input sequence 
of N elements, there are exactly N/2 butterflies in each of the log ZN stages 
of butterfly computations. 
E. Chu, A. George / Linear Algebra and its Applications 284 (1998) 95-124 
Fig. 2. The DIF FFT with naturally ordered input and bit-reversed output. 
3.2. The eflect of permutations 
The algorithm described in the previous section is correct only because 
Xi,,i3izi,i0 s initially contained in a[i&i$lio]. However, the following program 
is correct, regardless of where xi4isi2i,i,, is found in a. 
104 E. Chu. A. George I Linear Algebra and its Applications 284 (1998) 95-124 
Algorithm 3.3. The iterative DIF FFT applied to the x elements 
begin 
k:=n- 1 Initial Problem size N= 2” 
while k B 0 do Subdivide each subproblem 
Apply Gentleman-Sande butterfly 
computation to all pairs of elements of 
x whose (binary) subscripts differ in bit ik 
k:=k- 1 
end while 
end 
Thus, the input data could be permuted arbitrarily, and as long as the but- 
terflies are applied correctly to the data, the correct answers would be obtained. 
Moreover, the element of Q that initially contained xi4i3i2ili,,, would contain 
X;Oi,itiSi4 at the end of the computation. 
For example, suppose the objective is to find an initial ordering of the input 
so that the resulting output is in natura1 Order. The Observation above implies 
that initially, u[ioili&i4] should contain x,i3i,i,i,,, since at the end of the compu- 
tation one wants u[i,$li$&] to contain XiOi1i2i3i4. That is, the input should be 
placed in bit-reversed Order before the computation begins. 
Again for the case N= 32, a shorthand notation describing the 5-step pro- 
cess, together with the initial permutation to bit-reversed Order, is the sequence 
Here the sequence begins with i&i$lio, which is intended to imply that 
xi4i3i2ili0 s assumed to be in a[i&izilio] as before; the notation i,$li&z is intend- 
ed to imply that xi4i3izili0 has been permuted to a[i,$lizi&] before the first butter- 
fly computation is performed. That is, i&i$lio always represents the binary 
representation of the subscripts of x; the Order in which the bits appear, or 
are permuted during the computation, refer to movements that Xi4i,i,i,i0 or its 
derivatives undergo in a during the computation. 
3.3. Arranging that input and output are both in natura1 Order 
When input x and output X are both in natura1 Order, the algorithm is re- 
ferred to as an “ordered” FFT in the literature. The key to understanding what 
is required is to view each butterfly computation as consisting of one permuta- 
tion step followed by one in-place computation Step. These permutation Steps 
recorder the initial input as well as the input to each subsequent subproblem, 
and the notation introduced above tan be used to describe this process in a nat- 
ural way. Again using the example above, with the input in the natura1 Order, 
E. Chu, A. George / Linear Algebra and its Applications 284 (1998) 95-124 105 
the first in-place butterfly is denoted by ci&ilio. If this butterfly Operation is 
preceded by per-muting the data in u[i&ilio] to a[i&ili&], it is natura1 to 
use i&iliozT to denote the combined effect. Since the permutation step cannot 
be done in-place, two arrays of size N are used to alternately store the data; this 
doubles the storage requirement. As usual, assuming that x is initially con- 
tained in u in the natura1 Order, a second array b would alternately contain 
the data. The entire computation process, along with the use of two arrays, 
is depicted below. 
V V V V V 
i4i&ilio i&ilio& i2il iOi3z4 iOili2z3z4 iOi1 z2z3z4 i0zlT27324 
a b a b ll P 
Since the Gentleman-Sande butterfly does not permute the input when the 
Problem size is 2, no Permutation Step precedes the last butterfly Operation, 
so there is no relocation of data from a to b in the last Step of the process. 
Another class of ordered FFTs perform in-place Permutation and conse- 
quently do not need a second array; they are the so-called “self-sorting in- 
place” algorithms. This class contains variants of the Prime-factor algorithms 
[1,12,16] and a radix-2 FFT [8]. Recently, this class has been further extended 
to include self-sorting in-place radix-3, radix4, and radix-5 FFTs [ 171. To keep 
our discussion brief, only the radix-2 algorithm is reviewed here. Using the no- 
tation developed earlier, the process of applying the self-sorting in-place radix- 
2 FFT to array u is depicted below for N= 32. 
v V V V V 
i4i3i2iliO iOi3i2ili4 iOil i2i3z4 iOil i2T3z4 iOi1 z27374 i0~I~2~3~4 
* <I a II ll n 
Observe that the permutation always involves bits in symmetric positions: e.g., 
in Step 1, the leftmost bit i4 switches with the rightmost bit i0 and in step 2, bit i3, 
the second bit from the left end, switches with bit i,, the second bit from the 
right end. Accordingly, the ordering of the bits is “reversed” after only two 
Steps, and the permutation tan be implemented using “pairwise” interchanges. 
The contents in a[Oi&il l] and a[li&ilO] are switched in Step 1 and the contents 
in a[ioOiz1~4] and a[ioli20r4] are switched in Step 2. Since each pairwise inter- 
Change tan be done using a Single temporary location, the array b is not needed. 
4. Mapping data to processors 
For purposes of this Paper, it is assumed that the multiprocessor available 
has P = 2d processors, and each processor has its own local memory. That is, 
the machine in question is a distributed-memory multiprocessor, where each 
processor is connected to the others via a communication network with a pre- 
scribed topology. A common topology is a hypercube, but others such as a reg- 
ular grid or a ring are also commonly used. 
106 E. Chu, A. George I Linear Algebra and its Applications 284 (1998) 95-124 
A key step in parallelizing the FFT on such multiprocessor Computers is the 
mapping of the elements of x to the processors. Since N = 2” and P = 2d, it is 
natura1 to select d of the n bits in the binar-y representation of the subscripts 
of x to create a unique d-bit binary processor ID number specifying the da- 
ta-to-processor mapping. While any set of d bits could be Chosen, there seems 
to be no compelling reason not to choose consecutive bits. Different choices of 
bits yield different mappings. Then n - d + 1 cyclic block mappings (CBMs) are 
shown in Fig. 3, where n = 5 and d = 2, and the array locations mapped to pro- 
i.&&irJ fl,;, %i, piiix Pi,& 
00000 
ilooo1 
00010 
00011 i 
T 
i Blocksiee =1 
Bloeksize =2 
Blocksize =4 4 
00100 
00101 
00110 
00111 
01000 
Blocksize =8 
Blocksize =1 
Blocksize =l 
01001 PI 
01010 Pl 
01011 Pl 
01100 Pl’ 
01101 ! 
Pl 
01110 4 
01111 Pl 
10000 Pz 
10001 Pz 
10010 I_i 
10011 1Pz 
10100 Pz 
10101 pz: 
10110 
10111 
11000 
: Pz 
pzl 
& 
11001 +g 
11010 8 
11011 Ps 
11100 
11101 
11110 
11111 f 
P3 
Ps 
Ps 
Ps 
Pl 
Pl 
p2 
p2 1 p2 IP2 6 p3 p3 i Bloeksize =2 i Pl p2 1 p2 p3 Blocksize =1 
f 
Blocksize =4 
tp2 
p2 1 PZ PS S 
p3 tl p3 
i 
Blocksize =2 
i 
i 
Blocksiee =2 
i 
Blocksize =l 
Pl 
; 
s 
9 1 
Bloeksize =l 
Blocksiee =1 
Fig. 3. The n - d + 1 cyclic block mappings for N = 2” = 32 and P z 2d = 4. 
E. Chu, A. George / Linear Algebra and its Applications 284 (1998) 95-124 107 
cessor PO are shaded to highlight the cyclic nature with various block sizes. 
When the block size is NIP or one, the mappings are respectively referred to 
as block or cyclic mappings in the literatme. 
In what follows the notation i,_l . . . ik+l lik . . . ik_dfl lik-d. io is used to de- 
note that bits ik . . . ik-d+l have been Chosen to specify the data-to-processor al- 
location. Esch processor is always assigned NIP data elements; i.e., this class of 
mappings ensures even data distribution. Esch processor tan always compute 
the butterflies involving the NIP local data elements independently, because 
these data have the same d processor address bits. Thus, no inter-processor 
communication is required to compute the butterflies involving the bits speci- 
fied by the braces below. 
/ \ / * Y 
in-1 . . . ik+2ik+l lik . . . ik-d+l 1 ik-d . . . iiio. 
In general, since any d bits tan be used to form the processor ID number, it is 
easier to recognize the generic communication Pattern if one concatenates the 
bits representing the ID into one group called “PID”, and refers to the remaining 
II - dbits, which are concatenated to form the local array address, as “Local w’. 
For the class of CBM mappings, one tan use the following equivalent notation, 
where the leading d bits are always used to identify the processor ID number. 
JPIDJ LOCd li’f = lik . . ik_d+l(~~ 
In either case, the d consecutive bits are marked by the Symbol “1” at both ends. 
The two notations are equivalent and both are used in the text. 
Computing the butterflies involving the address bits used to define the pro- 
cessor ID number will involve exchange of data between processors whose ID 
numbers differ in exactly one bit. Of course these processors may or may not be 
physically adjacent, depending on the network topology. For example, if the p 
processors form a hypercube network, data communication between such a 
pair is always between neighbouring processors. If the p processors form a lin- 
ear array or a two-dimensional grid, such a pair of processors tan be physically 
many hops apart, and the simultaneous data exchange between all pairs tan 
Cause traffit congestion in the network. 
CBM mappings have received considerable study in the literature dealing with 
parallelizing FFTs [2-4,7,9,14,15,18,19]. These works vary in the choice of the 
blocksize, whether DIF or DIT transforms are used, and whether the input and/ 
or output is in unordered (reverse-binary) or in natura1 Order, and so on. The pur- 
pose of this Paper is to bring all of these treatments into a common framework. 
5. Parallel FF’Ts 
Recall from Section 3 that one may apply the algorithm to the naturally or- 
dered input and obtain the output in bit-reversed Order, or one may permute 
108 E. Chu, A. George I Linear Algebra and its Applications 284 (1998) 95-124 
the input into bit-reversed Order in advance and obtain the X values in natura1 
Order, or one may apply permutations during the computation so that natural- 
ly ordered output is obtained from naturally ordered input. A fourth Option 
that was not considered is to use naturally ordered input, and then bit-reverse 
the output after the computation is complete in Order to obtain naturally or- 
dered output. The task of bit-reversing a vector is an interesting Problem in 
its own right, with many variations and intricacies [lO]. However, the Problem 
is outside the scope of this Paper. The second and last choices are simply special 
cases of the third, where all the permutations are applied either in advance of 
the computation or after all computation has been completed. Moreover, only 
one of a large number of Permutation variations was considered. Indeed, given 
naturally ordered input, any binary Permutation of X tan be obtained, and in 
numerous different ways. For example, suppose after the computation one re- 
quires that u[i2i3iOi4il] contain Xi,,i,i2igi4. Two of several possible schemes are 
shown below, using the notation developed earlier. 
7 7 V V V 
i4i&ilio iji$li& i2il iOi3z4 iliOi273z4 iOz273z4il z273i0T4zI 
* b a b <I b 
7 v v 7 7 
i4i3i2iliO i3i2ili4iO i2i3iOz4il i2iOz3z4il z2z3iOz4ik ~2~3iOT4~1 
n b <I b n n 
Of course it should also be evident that assuming that the input is in natura1 
Order is simply a matter of expositional convenience. Any binary permutation 
of the input tan be handled using the same shorthand notation. As noted ear- 
lier, the term “ordered” FFT in the literature refers to an algorithm which 
transforms naturally ordered x to naturally ordered X; the algorithms resulting 
from any other permutation of either input elements or output elements are all 
referred to as “unordered” FFTs. Therefore, the FFT with bit-reversed input 
or bit-reversed output is also an unordered FFT. 
5.1. Parallel FFTs without inter-processor permutations 
In general, in the parallel context, if permutations are allowed, it may turn 
out that part of a processor’s complement of data may migrate to another ma- 
chine. This subsection deals with the case where no permutations are perform- 
ed. Using the notation and the example of the previous section, computations 
involving the four mappings are depicted below. The notation +==+ indicates 
that N/P data elements must be exchanged between processors in advance of 
the butterlly computation. The Symbol ik identifies two things: first, it indicates 
that the incoming data from another l%ocessor are the elements whose sub- 
scripts differ from a processor’s own data in bit ik; second, it indicates that 
all pairs of processors whose binar-y ID number differ in bit ik send each other 
a copy of their own data. 
E. Chu, A. George I Linear Algebra and its Applications 284 (1998) 95-124 109 
Since the input sequence is in natura1 Order and the output is in bit-reversed 
Order, the processor initially allocated xidgiZi,i,, will finally have Xi0;,iZiji4 if inter- 
processor Permutation is not allowed. For example, if the initial mapping is 
li& li2ili0, processor Po initially contains xo,xi ,x2, . . . , x7 in their natura1 Order 
as depicted in Fig. 3; when the parallel FFT ends, processor PO contains 
Xo,X,s,Xs,X24,X4,X20,~~2,~2s, which are the first eight elements in the output 
array in Fig. 2. In this case, x and X are Said to be comparably mapped to 
the processors [ll], p.160. Note that the initial mapping is a CBM mapping 
of naturally ordered x, but the final mapping is not a CBM mapping of natu- 
rally ordered X. 
Butterfly computations will Cause communication between processors if the 
two input elements are stored in different processors. Since both input elements 
are needed to update each of them, the two processors involved must exchange 
the N/P data elements for each other to update their local data. As shown 
above, butterflies require data to be exchanged in exactly d = log, P Stages, re- 
gardless of the blocksize used in the mapping. (In the example, d= 2.) Algo- 
rithms of this type are described in Refs. [2,4,9]. This is also Version 1 of the 
distributed-memory FFTs in Ref. [ 111, pp. 156162. 
A consequence of this scheme is that one half of the processors update their 
local data according to a formula not involving the twiddle factor; i.e., they 
each update N/P elements according to yr = (xl + x~+~I~). The other half of 
the processors update their local data according to a formula involving the 
multiplication of a pre-computed twiddle factor; i.e., they each update N/P 
elements according to ZI = (x[ - x,+N,~)c+. 1 Thus, the arithmetic workload is 
not evenly divided among all processors unless each processor computes both 
yi and zI. 
To balance the arithmetic workload, Walton [19] proposed a variant of this 
type of communication which allows each pair of processors to exchange i (N/P) 
elements twice: once before the butterfly computation, and once after the 
110 E. Chu, A. George l Linear Algebra and its Applications 284 (1998) 95-124 
buttertly computation. The second communication is needed for each proces- 
sor to get back the 5 (N/P) e ements updated by the other processor. The same 1 
idea is employed in version 2 of the two-processor distributed-memory FFT in 
Ref. [l 13, p. 158. Compared to the scheme above, the amount of data ex- 
changed between each pair of processors remains the Same, but the number 
of message exchanges is doubled and the notation x=+ now denotes two ex- 
changes of i (N/P) elements between all pairs of processors. 3 Unfortunately, 
the gain from balancing the arithmetic workload may be offset by doubling 
the number of exchanges if the startup tost for each message exchange is high. 
The arithmetic workload cay1 be balanced, and the amount of data exchanged 
tan be halved without doubling the number of exchanges if inter-processor per- 
mutations are allowed, as disscussed in the next section. 
5.2. Parallel FFTs with inter-processor permutations 
As suggested in Refs. [3,7,15,18], if processors are not required to keep and 
update only the same data throughout the entire computation, then every other 
concurrent message exchange tan be eliminated from Walton’s scheme [19]. To 
introduce this concept, consider the butterfly computations depicted by 
(i4i3li2ilio lzjijl$lio li2r41i:ili0 lyT4l{ili0 15374lc2i:i0 15jt4)72r1iO. 
A A A 
c+ t- t-t 
The notation t+ denotes one concurrent message exchange of i (N/P) data 
elements between all pairs of processors, which is shown to occur in the butter- 
fly stages involving bits which form the processor ID number. After data are 
distributed to individual processors according to the initial mapping 
li4i3li2i,io, the element x,i,i,i,i, in a[i&ilio] tan be found in A[i$lio] in processor 
Q,. When bit i4 in the PID and bit i2 in the local M switch their positions, the 
mapping is changed to (i2i3ji4ili0, which means that the data in a[i4i3i2ilio] tan 
now be found in A [ibi, io] in Pi,i, . To identify the one half of data each processor 
must send out, the Symbol A is used to label two different bits: the bit ik, which 
has just been permuted from PID into Local M, and the bit il, whichhas just 
been permuted from Local M to the PID. In the example abote, id and i2 have 
switched their respective positions in the PID and the Local M, i3 and ~“4 have 
switched their respective positions in the PID and the Local M, and i2 and 23 
have switched their respective positions in the PID and the Local M. Because fk 
3 While the Symbol jk in the PID still identifies the pair of processors, this Symbol alone is not 
sufficient to identify tie one half of data each processor needs to send out. To avoid redundancy, 
this notation will be modified in the next section when this idea is further exploited. 
E. Chu. A. George / Linear Algebra and its Applications 284 (1998) 95-124 111 
was in PID before the switch, ik = 1 in one processor, and ik = 0 in the other 
processor. On the other hand, because i, was in Local M before the switch, 
il = 0 for half of the data, and il = 1 for another half of the data. Consequently, 
the value of ik, the PID bit, is equal to il, the local M bit, for half of the data 
elements in each processor, and the notation which represents the switch of 
these two bits identifies both the PID of the other processor as well as the data 
to be sent out or received. To depict exactly what happens, the data exchange 
between two processors and the butterfly computation represented by 
r 
li& li,&il io 
A A 
(discussed above) is shown in Fig. 4. While such a detailed schematic diagram 
tan be drawn to trace the entire parallel FFT process Stage by Stage, there is no 
need to do so because all information tan be deciphered from the simple nota- 
tion introduced above. 
Note that in the example above, element x,ijizi,i,, is initially contained in 
A[i2iIiO] in processor PiJi,, and it is finally located in processor Pisj4 at the same 
relative location. Note also that the number of concurrent exchanges 4 is d + 1. 
A second example, using the “IPIDI Local M” notation with a different map- 
ping, suggests the general Situation. 
Note that in this example, the Order of the data within each processor again has 
been preserved, but the data in the aggregate has migrated from processor fi,i, 
to processor fi,i,, and the number of concurrent exchanges remains d + 1. 
It is straightforward to develop a communication scheme so that this is true 
for all CBM mappings: the data within each processor remains in the same rel- 
ative Order, but its processor location is changed from a processor with 
ID = Mk-1 . . . ik_d+l to another processor with ID = ik_d+likik-1 . . . ik-d+2; i.e., 
the d bits are cyclic-shifted to the right by one position. In all cases, the number 
of concurrent exchanges remains d + 1. The comrnunication schemes described 
in Ref. [3,7,15,18] may be viewed as variations of this basic scheme, and tan all 
be treated within the same framework. 
4 They are also called parallel transmissioas in the literatme [ 15,181. In Ref. [3] the authors caution 
that the actual protocol for message exchange may vary among machines. For example, on the Intel 
iPSC/860 [13], the duplex bandwidth of a pair-wise data exchange tan be utilized only if both nodes 
Start sending the data simultaneously. If one node Starts before a second one, the second node must 
wait until the first one completes its transfer. Thus the communication is reduced to simplex mode. 
In Ref. [l l] the communication overhead is modeled assuming that all exchanges are asynchronous 
and they are thus performed in the simplex mode. 
112 E. Chu, A. George l Linear Algebra and its Applicatiom 284 (1998) 95-124 
I?ift.sr ex hnnee: llollol k! ___.__ ..~____ 
IPIDllocal M ~10~110 H 
= 
nents 
anged 
(PIDllocal M 
)iZiSli&iO 1 
A ~0~000 
1001001 
100(010 
(0O)Oll 
1001100 
100~101 
(00~110 
100~111 
Lza 
AIMI 
API 
-Ul 
API 
AL31 
AI41 
451 
461 
471 
PrOCW301 
1 Globdn 
i&i&l 
00000 
00001 
00010 
00011 
10000 
10001 
10010 
10011 
(PIDllocal M 
La- 
AP4 
ANI 
API 
421 
431 
~4141 
At51 
‘4[61 
API 
Fig. 4. Butterfly computation (Stage 1) with data migration hetween Po and P2. 
Note also that in the first example, the PID bit in question is always ex- 
changed with the leftmost bit in the Local M, which is referred to as the “piv- 
ot” in Refs. [15,18]. However, in the second example, the PID bit in question is 
always exchanged with the rightmost bit in the Local M. Clearly, the so-called 
“Pivot” could be arbitrarily Chosen, if one so desires, from the bits of the Local 
M. Since the ID number is formed by consecutive bits, whenever a PID bit is 
permuted into the local Pivot Position, it will be exchanged with the next PID 
bit and occupy the latter’s Position back in the PID field. After d exchanges, we 
have the following scenario: the rightmost PID bit is in the Local M, and the 
Pivot i, or rp from Local M is still in the leftmost Position in PID. Therefore, 
one more permutation involving these two bits will get i, or rp back into its 
original Position in the Local M, and the rightmost PID bit would be cyclic- 
shifted into the leftmost Position in the PID as shown below. 
Ir&& . . . 
/ . 
‘r-d+2 1 w ik-d . . . ip+lipip-lip-1 . . . i,iO 
a A 
E. Chu, A. George I Linear Algebra and its Applications 284 (1998) 95-124 113 
or 
Iq-& . . . / W’ 
Y 
Tk_d+2 1 ‘t,-1 . . . ~p+l~p~p-l . . . rk+ZTk+l ik_d . . . ili0 . 
A A 
Observe that as long as the PID is not formed by the leftmost d bits, there 
would be at least one zp bit available when the buttertly computation reaches 
any PID bit. One thus has the Option of using a rp (instead of ip) bit as the piv- 
ot. In this case, the rp bit may stay as the leftmost bit in the PID if the local 
data is not required to remain together in one processor, and one concurrent 
message exchange tan be saved, with the final mapping determined by 
l~p~k.. . , w- Tk-dt2 ( tn-1 . . . Zptlrk-d$1 . . . zkt2Tktl lk-d . . . llz0. 
n A 
The PID in the mapping above is no longer formed by consecutive bits. 
In all cases, the final “IPIDJ Local M” is a Permutation of the n bits, which 
uniquely determines the data mapping of X&i,;2...i,_l. In certain contexts, it is im- 
portant to have the output array X mapped to the processors in a specific way 
to facilitate subsequent computations. For example, the final distribution of X 
is required to be identical to the initial one for x in the Solution of partial dif- 
ferential equations using spectral techniques [3]. This has motivated the devel- 
opment of a number of algorithms which are now reviewed. 
5.3. Parallel FFTs in the literature 
In this section some parallel FFTs in the literature are reviewed. To facilitate 
the discussion, the same example (when it is possible) is used to provide a sum- 
mary of some distributed-memory parallel FFT literature in Table 1. The basic 
parallel FFT without inter-processor Permutation underlies the work in Refs. 
[2,4,9,19,20], and the communication tost is either d concurrent exchanges of 
NIP elements or 2d concurrent exchanges of $N/P) elements. The final map- 
ping is identical to the initial mapping in all of the algorithms in Refs. 
[2,4,9,19]. However, because the output element &,,i,i2i3i4 overwrites xi4ijii;,i,, in 
its initial location in the same processor, the resulting mapping is not a 
CBM of X. These algorithms parallelize an unordered sequential FFT, and 
thus are referred to as unordered parallel FFTs in the literature, and they 
are so labelled in Table 1. The re-coding of PID (by Gray Code) in Ref. [2] 
and the use of radix4 FFT in Ref. [4] do not affect the communication scheme, 
because all processors are identical and the data are accessed in a similar pat- 
tern by all radix-T FFTs. Furthermore, the choice of a different local FFT in 
Ref. [4] is independent of the choice of the communication algorithm. 
In Ref. [20], the PID part of the initial mapping is preserved in the final map- 
ping as dictated by the communication scheme, but the bits which form the Lo- 
cal M are bit-reversed in the final mapping because each processor applies an 
Ta
bl
e 
1 
Pa
ra
lle
l 
FF
Ts
 
in
 t
he
 
lit
er
at
m
e:
 
So
m
e 
on
e-
di
m
en
si
on
al
 
(1
-D
) 
pa
ra
lle
l 
FF
Ts
 
us
in
g 
P=
 
2d
 =
 4
, 
8,
 o
r 
16
 p
ro
ce
ss
or
s 
an
d 
in
pu
t 
da
ta
: 
N
 =
 2
” 
= 
32
 
R
ef
er
en
ce
s 
fo
r 
th
e 
ex
am
pl
es
 
In
iti
al
 
m
ap
pi
ng
 
of
 .
Q
);~
~,
;~
 
Fi
na
l 
m
ap
pi
ng
 
of
 &
hi
,i,
 
# 
C
on
cu
rr
en
t 
ex
ch
an
ge
s 
$ h 
M
es
sa
ge
 
le
ng
th
 
D
 
.?
 
W
al
to
n 
[1
9]
 
Ja
m
ie
so
n 
et
 a
l. 
[7
] 
Sw
ar
zt
ra
ub
er
 
[ 1
51
 
C
ha
m
be
rla
in
 
[2
] 
To
ng
 
an
d 
Sw
ar
zt
ra
ub
er
 
11
81
 
Jo
hn
ss
on
 
an
d 
K
ra
w
itz
 
[9
] 
D
ub
ey
 
et
 a
l. 
[3
3 
Y
an
g 
[2
0]
 
Fa
bb
re
tti
 
et
 a
l. 
[4
] 
(r
ad
ix
-4
 
an
d 
lo
ca
l 
Sp
lit
-r
ad
ix
) 
li
4i
3l
i2
il
io
 (b
lo
ck
) 
Ji
3i
2i
Ii
&
 
(c
yc
lic
) 
ji
2i
li
o]
i&
 (
cy
cl
ic
-v
ar
ia
nt
) 
li
4i
31
i2
il
io
 (b
lo
ck
) 
]G
ra
y[
i.&
i2
]]
ili
o 
(b
lo
ck
 
W
. 
PI
D
 
in
 G
ra
y 
C
od
e)
 
)i
zi
j io
]i&
(c
yc
Iic
) 
(i
&
 l
iz
il
 io
 (b
lo
ck
) 
li
&
ji
zi
li
o 
(b
lo
ck
) 
i.+
)i
&
li
li
o 
i&
li
zi
l 
li
o 
i&
i2
 l
il
 io
l (
cy
cl
ic
) 
li
l io
 li
&
i2
 (
cy
cl
ic
) 
li
&
li
+l
io
 
(b
lo
ck
) 
~~
~~
~~
~~
~~
~~
 
(u
no
rd
er
ed
) 
2d
 
f$
 
JT
~T
~_
s,
 Jt
0 
(u
no
rd
er
ed
) 
d 
1 
N
 
ZP
 
~~
37
22
~~
74
~0
 
(u
no
rd
er
ed
) 
d 
f$
 
I~
o?
I lT
2z
37
4 
(b
lo
ck
) 
2d
+l
($
>P
)L
5d
+2
($
‘=
1)
 
+$
 
lG
ra
y[
~4
73
~2
11
~~
~0
 
d 
! 
(u
no
rd
er
ed
) 
Fr
om
dt
o 
1.
5d
($
<P
) 
ff
 
d 
- 
1 
+ 
$j
 (o
n 
al
l 
ch
an
ne
ls
) 
1 
d 
ex
ch
an
ge
s 
pl
us
 
al
l 
to
 
al
l 
in
 t
he
 
i$
 
fo
r 
d 
&
 v
ar
ie
d 
w
or
st 
ca
se
 
le
ng
th
s 
fo
r 
al
l 
to
 a
ll 
lz,
 
~0
1~
2~
37
4 
(b
lo
ck
-e
qu
iv
al
en
t) 
d 
P 
IT
~T
~I
Q
T~
T~
 (u
no
rd
er
ed
) 
d 
5 
E. Chu, A. George I Linear Algebra and its Applications 284 (1998) 9%124 115 
ordered (sequential) FFT to its local data. Since the initial mapping is restricted 
to be cyclic, the resulting mapping is, in general, given by 
Id PID bits 1 m, 
where the PID preserves its initial arrangement: id-lid-2 . . . ilio. Since the Local 
M is formed by the rightmost n - d bits of Xs subscript, &,Xr . . . ,XNp are 
stored in natura1 Order in one processor, which is determined to be PO; and 
the next N/P elements, namely XN,~,XN~P+r, . . . , X2(Np)-I, are stored in natura1 
Order in one processor, which is determined to be Pp/* by simply reversing the 
leftmost d bits of X’s subscript; and so on. Since the mapping is equivalent to a 
CBM block mapping of the naturally ordered X, we use the term “block-equiv- 
alent” in the relevant table entries. To use this method, one simply distributes x 
before the computation using the cyclic CBM, and apply this algorithm to ob- 
tain naturally ordered X. Note that the input data must be distributed among 
the processors one way or the other, and the communication tost for data dis- 
tribution is the same regardless of the initial mapping. Therefore, the cyclic ini- 
tial mapping used by this method does not Cause extra communication, and 
this is one way to obtain the block-equivalent mapping of naturally ordered 
X. It will be shown later that the communication tost tan be halved in an im- 
proved algorithm which achieves equivalent results. 
The FFTs in Refs. [3,7,15,18] all contain inter-processor-permutations 
which may be viewed as switching a PID bit with an address bit, the Pivot. 
In Ref. [7] the PID is formed using the rightmost d bits as shown in the two 
initial mappings given in Table 1. To ensure that the stride for data in buttertly 
operations is always one (for good locality), the current rightmost bit in the Lo- 
cal M is used as the Pivot for either a local or an inter-processor permutation, 
which precedes each butterfly Operation, as depicted by the following example. 
lili0/i& jili0/22zj ; r lili0@ I~i0ln~ Im IG 
a a 
+_ +4 
Here the sequence begins with lilioli&, whicf) is the result of a cyclic mapping 
of naturally ordered x; the following lilioli2i3 is intended to imply that i3 has 
been permuted into the Pivot location befoFeAthe butterfly computation begins. 
Since the initial mapping (ilioli& may be obtained by permuting the local data 
resulting from a cyclic mapping, it is referred to as a cyclic-variant in Table 1. 
Given the initial mapping and the fixed-Position of the Pivot, which is a rp (in- 
stead of ip) bit, the communication tost and the final mapping tan be easily 
predicted for given N and P as explained in the previous section. However, 
the resulting mapping is neither a CBM nor its equivalent for X. 
The work in Ref. [3] deals with all possible initial CBM mappings for given 
N and P on a hypercube. The initial CBM for x and the final CBM for X are 
required to be identical. Under the condition, the authors show that in addition 
116 E. Chu. A. George I Linear Algebra and its Applications 284 (1998) 95-124 
to d concurrent exchanges between all pairs of processors with IDs different in 
one bit, their generalized subroutine could, in the worst case, require each pro- 
cessor to send data to all the other processors. This requirement may Cause se- 
vere data contention depending on the network topology. A new general 
algorithm is proposed in Section 6 which requires only 1.5d more concurrent 
exchanges in the worst case. Thus the algorithm proposed in Section 6 appears 
to deal with the question in Ref. [3] conceming the communications require- 
ment for solving the data rearrangement Problems arising in FFT or other si- 
miliar algorithms. 
The work in Refs. [15,18] concems two specific CBM mappings; in each 
case, the initial mapping for input X,i,izi,iO is required to be maintained for out- 
put Xi,,i,i2iXi4. These two parallel algorithms are depicted in Table 2 using given 
N and P values. For the natural-order parallel FFT in Ref. [ 151, the communi- 
cation tost of 2d + 1 = 5 exchanges for N = 2” = 32 and P = 2d = 4 are 
marked by five occurrences of t+ in the first column. The algorithm differs 
depending on whether d = n/2, d < n/2, d = n - 1, or n/2 < d < n - 1. Con- 
sequently, the communication costs range from 1.5d + 2 to 2d + 1 concurrent 
exchanges as indicated by the following theorem. 
Theorem [15]. An ordered FFT of length N= 2” tan be implemented on a 
hypercube of dimension d with n/2 + d + 1 parallel transmissions ifn/2 < d and 
n is even. If n is odd and (n + 1)/2 < d, then (n + 1)/2 + d -t- 1 parallel 
transmissions are required. For the remaining cases the ordered FFT tan be 
implemented with 2d + 1 parallel transmissions. 
Note that a hypercube of dimension d has P = 2d processors, and that the 
condition n/2 < d is equivalent to N/P < P, and (n + 1)/2 < d is equivalent 
to N/P < P/2, so they represent fine-grain cases. Consequently, for more com- 
mon medium-grain and large-grain cases, 2d + 1 parallel exchanges will be re- 
quired . 
The cyclic-Order parallel FFT in Ref. [18] refers to an FFT with naturally 
ordered input x and output X both CBM mapped to the processors using 
one element per block. That is, the initial cyclic mapping of x is maintained 
for X. The case N/P < P was considered in Ref. [18] for the massively parallel 
Connection Machine (CM). The communication tost was derived in proving 
the following lemma. 
Lemma [18]. A cyclic-Order FFT of length N = 2” tan be implemented on a 
hypercube of dimension d (where n/2 < d) with 2d - n/2 parallel transmissions 
if n is even and 2d - (n - 1)/2 parallel transmissions if n is odd. 
Since the condition n/2 -C d is equivalent to N/P < P, this lemma again ap- 
plies to the fine-grain cases, and the number of parallel message exchanges is 
E. Chu, A. George I Linear Algebra and its Applicatiom 284 (1998) 95-124 117 
Table 2 
Some ordered parallel FFTs in the literature 
Swarztrauber [15] Tong and Swarztrauber [18] 
Example: N=2”=32, P=2d=4 ($ 2P) 
li&[i2i,io (block map of X) 
. 
lilioli27374 
Ah 
v 
bziolil7374 
n * 
t- 
. 
17271 li07374 
a a 
-- 
1~~7, l727374 (block map of X) 
a n c- 
Example [18]: N=2”=256, P=2d =32 ($ < P) 
li4i3i*ili&~i& (cyclic map of x) 
. 
li4i3izilio[i677ij 
nn 
li&i2iliOJi:777f, 
n n 
. 
l7&i*ili&7776 
n a 
c- 
T 
17574i2ili01ij777f, 
b D c- 
17374i2hi01757776 
n n 
-- 
. 
[73747576iO(i17772 
0 n 
-- 
/73747576iOl777,72 
*n 
1~3~4~55~~71~~~~~2 (cyclic map of X) 
n ‘5 L- 
2d - n/2 > d (if n is even) or 2d-(n - 1)/2 > d (if n is odd). In the finest-grain 
case, N/P = 1 if n is even: i.e., n = d, and the number of parallel transmissions 
is given by 2d - n/2 = 1Sd. If n id odd, N/P = 2 or n - 1 = d for the finest- 
grain case, and the number of parallel message exchanges is given by 
2d - (n - 1)/2 = 1.5d. In the example cited in Table 2, the condition 
118 E. Chu. A. George l Linear Algebra and its Applications 284 (1998) 95-124 
n/2 < d is satisfied, and there are 2d - n/2 = 6 concurrent exchanges as identi- 
fied by six occurrences of t+ in column 2. 
In both algorithms, the Pivot is always the current leftmost bit in the Local 
A4, but reordering of the address bits in Local M is also performed by ex- 
changing the Pivot with another bit in the Local M. This implies that other 
bits from Local M tan effectively serve as the Pivot bit, although they must 
first be permuted into the fixed Pivot location. In Refs. [15,18], the permuta- 
tion of any other bit with the fixed Pivot bit is defined as a Single i-cycle. The 
algorithms were developed by first decomposing the required permutation (or 
final mapping) into disjoint cycles, and each disjoint cycle tan then be imple- 
mented by a sequence of i-cycles. Some of the i-cycles are followed by butter- 
fly computations, and other i-cycles are used only for the purpose of 
rearranging local data or permuting data between processors. The pseudo- 
code (similar to CM Fortran) FFT algorithm given in Ref. [18] uses an i-cycle 
subroutine assuming that N/P = 2, where F could represent the number of 
virtual processors when N/P > 2. In the latter case, 13 > P, and the tost of 
1.52 concurrent exchanges, where d = log, j, includes the communication be- 
tween virtual processors. In the next section alternative ways are proposed 
having the Same or lower communication tost and without the restriction 
to fixed-Pivot i-cycles. In addition, the “NIP” ratio is not restricted to a spe- 
cific value. 
6. New parallel FFTs and a generalization 
In this section three new parallel FFT algorithms are proposed, including a 
generalized algorithm and communication complexity results for other CBM 
mappings which are believed to be new. For easy comparison with the algo- 
rithms reviewed in the last section, the first two algorithms are depicted in Ta- 
ble 3 and a general algorithm is depicted in Table 4 below. Since the result of 
bit-reversing two bits is the Same as the result of cyclic-shifting them one posi- 
tion to the right, examples with 8 or 32 processors are now used to demonstrate 
the general cases better. 
6.1. Algorithm I 
The elaboration of algorithm 1 is as follows. Consider a general case, in 
which the initial cyclic mapping of naturally ordered xi._,,,,i,iO is transformed 
to the final mapping of X$,i,,,.i,_, given by 
dPID bitsl a1 
where PID = i&_lid_2 . . . izii; i.e., all bits in the initial PID are cyclic-shifted to 
the right by one Position. Since the rightmost n - d bits of X’s subscript form 
E. Chu. A. George l Linear Algebra and its Applications 284 (1998) 95-I24 119 
Table 3 
New distributed-memory parallel FFTs: new parallel FFTs using P = 2d = 8 or 32 processors and 
input data: N= 2” = 256 
New algorithms Initial mapping of Final mapping of 
Xi,r615,rl,r>,,a x 101, ,*iJl4iSl& 
# Concurrent Message 
exchanges length 
Algorithm 1 li2ilioJi7i&i4ij (cyclic) 1~0~2T,1~3~455%~7 d+l 1N -- 
(block-equivalent) 2P 
Algorithm 11 li2ili01i7i&i& (cyclic) 
1%%5,1~0~,72T3~4 
d 1N 
(cyclic if 5 > P) 
2P 
li&i~ilioli7i& (cyclic) l%%T4T3l~OT,72 d IN 
(cyclic-equivalent if $ > P). 2P 
the local A4, the final mapping is equivalent to a block CBM of naturally or- 
dered X. To obtain the final local M which bit-reverses the initial local IV, all 
that is required is to apply an ordered FFT to the local data. As explained in 
Section 5, the final PID is obtained from per-muting each PID bit into the Pivot 
Position and back into the PID field in sequence. The combination of a specific 
initial mapping, inter-processor permutations, and an ordered local FFT pro- 
duces an ordered parallel FFT. Compared to other parallel FFTs reviewed in 
Section 5, the communication tost is halved because only d + 1 concurrent ex- 
changes of 4 (N/P) elements are needed. Of course, the same algorithm tan also 
transform block mapped x to cyclic-equivalent mapped X, and vice versa. 
6.2. Algorithm 11 
Observe first that the initial and final PIDs use the rightmost d bits from the 
subscripts of x and X, respectively. Because Xs subscript bit-reverses x’s sub- 
script, the final PID is formed by bit-reversing the leftmost d bits from x’s sub- 
script. The algorithm is designed to perform the first d stages of local 
computation without reordering the local data, followed by n - 2d more stages 
of local computation using an ordered sequential FFT. The initial mapping is 
thus changed to 
(Initial rightmost d bitsl initial aeversedn 
The objective is to apply inter-processor permutations in the remaining d stages 
so that the final mapping becomes 
120 E. Chu, A. George I Linear Algebra and its Applications 284 (1998) 95-124 
1%d%d+l . . 
-- 
. ~,-2~,_lI TOT1 . , , zd_2rd-1 zdzd+l . . . Zn-d-1 
For simplicity, assume 12 - d > d (which implies N/P 2 P) so that the right- 
most d bits do not overlap the leftmost d bits; the case n - d < d will be dealt 
with later. Under this assumption, the remaining d inter-processor permuta- 
tions are performed to switch id-l with ‘t,-d, id-2 with r,-d+l, . . . , il with ~~-2, 
and io with z,-i, That is, the PID bit is switched with a dzferent Pivot euch time, 
and the Pivots are simply Chosen in the Order they appear in thejinal PID. Since 
all potential Pivot bits are zp instead of i,, only d concurrent exchanges of 
i (N/P) elements are required. 
For the case n - d < d (which implies N/P < P), local reordering of data 
may be added to Algorithm 11 to produce a cyclic-equivalent mapping for el- 
ements of X. That is, the final mapping of X&,i,...i,_, is given by 
I , Ir,_,74 . . . zd+l~&_l . . . Zn-d1 r,,zI . . . Zn-d-1 
Note that the final PID bit-reverses the rightmost d bits of Xs subscript. In or- 
der to describe how Algorithm 11 is generalized to include this case, the condi- 
tion n - d < d is reflected in the initial mapping given by 
IInitial rightmost d bits 1 n. 
P . 
= lid_li&2.. . in-&-d-l . . iliO\ in_1 . id+,& 
After the first 12 - d stages of local computations (without reordering of the lo- 
cal data), the mapping becomes 
/ . lid-lid-2.. . in-din_&l . . . iliO\ 7n-l . . . zd+l’&j. 
The next d - (n - d) = 2d - n inter-processor permutations switch the left- 
most 2d - n bits in PID with the bits in the Local M in the Order from left 
to right; if it reaches the end of Local M, begin from the leftmost bit in Local 
M until all 2d - n bits in PID are processed. The following are mapping results: 
-- 
Now the local reordering tan be applied to rearrange the bits in Local M to 
obtain 
jr,-12,_2.. . 5k+lq&d_l . . . -- iliol Zn-d.. .7,-2’&-1 7,. . . rk-2q-1 
(Note that the bits braced from z,,_i to r,_d are reversed, and the bits braced 
from tk-1 to r, are reversed. since the bit-reversing operations tan be incorpo- 
rated into the computation, the internal reordering does not necessarily in- 
crease processing time in an efficient implementation.) The final n - d inter- 
processor permutations in Algorithm 11 switches the remaining n - d bits in 
PID into the local A4, resulting in the final mapping 
E. Chu. A. George I Linear Algebra and its Applications 284 (1998) 95-124 121 
/ . 
/&-,&_2.. ~k+l~k~k_l~k-2.. . ~,7,_~‘&,_2.. . Tn_dl TOT1 . . . z,-d-1. 
This is a cyclic-equivalent mapping for Iy10i,i2,,,i,_, because the Local M is formed 
by the leftmost R - d bits of X. Note that the rightmost d bits of Xs subscript 
tan be obtained by reversing the PID bits, and vice versa. 
6.3. A general algorithm and communication complexity results 
This algorithm allows the choice of any initial CBM for naturally ordered x 
and any final CBM for naturally ordered X. For N = 128 and P = 32, the four 
possible initial CBMs of x are shown in the first four entries in column 1 in Ta- 
ble 4; in each case, the initial CBM is required to be maintained for X as shown 
in column 2. In column 3, both actual and the maximum (worse-case) commu- 
nication complexities are presented for N= 2” and P = 2d, assuming only that 
NIP 2 2. That is, the specific relationships between n and d in different cases 
are not exploited in the algorithm. 
The algorithm uses the ideas developed in Section 5. Recall that by using 
d + 1 inter-processor permutations, a parallel FFT tan be completed with 
the initial PID bits cyclic-shifted-to-the-right by one Position. Apparently more 
inter-processor permutations may be added afterwards to switch the desired rp 
bits into the PID and arrange them in final Order, and that tan be followed by 
local reordering so that the bits in the Local M are also arranged as desired. To 
determine the maximum communication tost, consider the two possibilities. 
Case (i). If the initial PID does not overlap the final PID, then exactly d 
more inter-processor permutations are needed to switch in the final PID bits. 
In total, 2d + 1 concurrent exchanges are needed. 
Case (ii). If the initial PID overlaps the final PID, then one may first 
switch in the m zp bits currently in Local M so that these m PID bits are 
put in their final positions. Repeat this process until no desired PID bits exist 
in the local M. Now the only remaining task is to reorder the v = d - k po- 
tentially out-of-Order bits in the PID. Because the final PID will no longer in- 
volve bits in the Local M, one need only consider a fixed (arbitrary) Pivot. 
How many inter-processor permutations are needed to reorder v PID bits 
via a Single Pivot? Observe that in the worst case, the non-PID Pivot bit is 
permuted back in Local M after two PID bits happen to switch their posi- 
tions; i.e., three permutations could be required for placing every two bits. 
Therefore, at most 1.5~ inter-processor permutations are required to reorder 
the v PID bits. Since v tan be as large as d, in total 2.5d + 1 concurrent ex- 
changes are needed. 
Although the two cases above were considered separately to help explain 
the algorithm, note that case (ii) reduces to case (i) if m = d. The actual com- 
munication tost for given N and P tan also be predicted for the entire class of 
CBM mappings as shown in Table 4. Since the reordering Phase requires only 
122 E. Chu, A. George / Linear Algebra and its Applications 284 (1998) 95-124 
Table 4 
General communication results for CBM mappings: A general algorithm for arbitrary initial and 
final CBM mappings. Assumption: N= 2”, P = 2*, and NIP > 2 
Any initial CBM of Any final CBM of # Concurrent Message length 
Xi61ri4,,,~lllo x. ~O~lQW4WO exchanges 
li&i&i,io IzOr,r2r3154rSr6(block) 2d+ 1<2Sd+ 1 IN 
(block) 2P 
1N 
if,(isi&i&lio ~oI~,WX&% 2d+2<2Sd+ 1 2P 
1N 
i&li&izij lio TOT] 1WWsl% 2d+ 2<2Sd+ 1 2P 
IN 
i&i&&iliOI (cyclic) ~OTl+3T4%% (CYCliC) 2d + 1 < 2.5d + 1 2P 
IN 
i&i41i&ilioj (cyclic) ITot,z2r3lT4TsT6(block) 2d<2.5d + 1 2P 
1N 
i6i& liJi2il iO 1 (cyclic) WI I~253W&6 2d+ 2<2.5d+ 1 2P 
1N 
li&i&ji2ilio (block) ~O~l~z~w4ts% (CYclic) 2d<2.5d + 1 2P 
1.5d more concurrent exchanges in the worst case, this algorithm appears to 
have dealt with the question in Ref. [3] concerning the communication re- 
quirement for solving the data rearrangement Problems arising in FFT or oth- 
er similar algorithms. In addition, since the reordering Phase is independent of 
the computation Phase, the algorithm and the communication complexity re- 
sults are applicable even if the initial and final CBM maps are different. The 
last three entries in Table 4 are such examples. Thus, the proposed algorithm 
deals with cases that do not appear to be handled by the algorithm proposed 
in Ref. [3]. 
7. Conclusion 
In this Paper the various ideas which have appeared in the FFT literature 
were presented in a common framework. To this end, the concept of index-digit 
Permutation and the associated binary address based notation were borrowed, 
modified, and applied in a uniform manner to explain and further explore the 
various algorithm design and implementation issues related to both sequential 
and parallel FFTs. Although the study is limited to variants of a radix-2 DIF 
FFT algorithm, the notations and results may be generalized to cover the entire 
family of radix-2” DIF and DIT FFTs, including the mixed-radix and the split- 
radix variants. These possible extensions as well as the extension to two-dimen- 
E. Chu, A. George / Linear Algebra and its Applications 284 (1998) 95-124 123 
sional (2-D) FFTs are omitted for lack of space and will be presented else- 
where. In addition, three new parallel FFT algorithms along with communica- 
tion results were presented. The proposed algorithms show alternative ways of 
designing parallel FFT algorithms which feature reduced communication tost 
and further flexibility in the choices of initial and final data mappings. 
Acknowledgements 
The author are grateful to the anonymous referee who obviously read the orig- 
inal manuscript with great care. In particular, the referee provided valuable com- 
mentary about issues related to accessing twiddle factors on distributed-memory 
Computers; those Points have been included. The referee also drew the authors’ 
attention to the literature on self-sorting in-place FFT algorithms, which are 
briefly reviewed at the end of Section 3. The work of first author was supported 
in part by Canadian Natura1 Sciences and Engineering Research Council grant 
0GP0121352. The work of second author was supported in part by Canadian 
Natura1 Sciences and Engineering Research Council grant OGP0008 11. 
References 
[l] C.S. Burrus, P.W. Eschenbacher, An in-place, in-Order Prime factor FFT algorithm, IEEE 
Trans. Acoust. Speech Signal Process. ASSP/29 (1981) 806817. 
[2] R.M. Chamberlain, Gray Codes, fast Fourier transforms and hypercubes, Parallel Comput. 6 
(1988) 2255233. 
[3] A. Dubey, M. Zubair, C.E. Grosch, A general purpose subroutine for fast Fourier transform 
on a distributed memory parallel machine, Parallel Computing 20 (1994) 1697-1710. 
[4] G. Fabbretti, A. Farina, D. Laforenza, F. Vinelli, Mapping the synthetic aperture radar Signal 
processor on a distributed-memory MIMD architecture, Parallel Comput. 22 (1996) 761-784. 
[5] P.M. Flanders, A unified approach to a class of data movements on an array processor, IEEE 
Trans. Comput. 31 (1982) 809-819. 
[6] D. Fraser, Array Permutation by index-digit permutation, J. ACM 33 (1976) 298-309. 
[7] L.H. Jamieson, P.T. Mueller Jr., H.J. Siegel, FFT algorithms for SIMD parallel processing 
Systems, J. Parallel Distributed Comput. 3 (1986) 48871. 
[8] H.W. Johnson, C.S. Burrus, An in-place, in-Order radix-2 FFT, in: Proceedings of IEEE 
International Conference on Acoustics, Speech and Signa1 Processing, 1984, pp. 28A.2.14. 
[9] S.L. Johnsson, R.L. Krawitz, Cooley-Tukey FFT on the Connection machine, Parallel 
Comput. 18 (1992) 1201-1221. 
[IO] A.H. Karp, Bit reversal on uniprocessors, SIAM Review 38 (1996) 1-26. 
[l I] C.Van Loan, Computational Frameworks for the Fast Fourier Transform, SIAM, Philadel- 
phia, PA, 1992. 
[12] J.H. Rothweiler, Implementation of the in-Order Prime factor transform for variable sizes, 
IEEE Trans. Acoust. Speech Signal Process. ASSP/30 (1982) 105-107. 
[13] S.R. Seidel, M.-H. Lee, S. Fotedar, Concurrent bidirectional communication on the Intel 
iPSC/860 and iPSC/2, Technical Report CS-TR 90-06, Dept. of Computer Science, Michigan 
Tech. Univ., November 1990. 
124 E. Chu, A. George / Linear Algebra and its Applications 284 (1998) 95-124 
[14] H.S. Stone, Parallel processing with the perfett shuffle, IEEE Trans. Comput. c/20 (1971) 1533 
161. 
[15] P.N. Swarztrauber, Multiprocessor FFTs, Parallel Comput. 5 (1987) 197-210. 
[16] C. Temperton, Implementation of a self-sorting in-place Prime factor FFT algorithm, Journal 
of Computational Physics 58 (1985) 283-299. 
[17] C. Temperton, Self-sorting in-place fast Fourier transforms, SIAM J. Sei. Comput. 12 (1991) 
808-823. 
[18] C. Tong, P.N. Swarztrauber, Ordered fast Fourier transfonns on a massively parallel 
hypercube multiprocessor, J. Parallel Distributed Comput. 12 (1991) 50-59. 
[19] S.R. Walton, Fast Fourier transforms on the hypercube, Technical Report, Ametek Computer 
Research Division, 610 N. Santa Anita Ave., Arcadia, CA 91006, September 1986. 
[20] W. Yang, Parallel ordered FFT algorithms on distributed-memory multiprocessors, M.Sc. 
Thesis, Department of Mathematics and Statistics, University of Guelph, Guelph, Ontario, 
1996. 
