Efficient Parallel FFTs for Different Computational Models by Shalaby, Nadia
Efficient Parallel FFTs for
Different Computational Models
The Harvard community has made this
article openly available.  Please share  how
this access benefits you. Your story matters
Citation Shalaby, Nadia. 1996. Efficient Parallel FFTs for Different
Computational Models. Harvard Computer Science Group Technical
Report TR-17-96.
Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:25691718
Terms of Use This article was downloaded from Harvard University’s DASH
repository, and is made available under the terms and conditions
applicable to Other Posted Material, as set forth at http://
nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-
use#LAA
Ecient Parallel FFTs
for Dierent Computational Models
Nadia Shalaby
TR-17-96
December 1996
Center for Research in Computing Technology
Harvard University
Cambridge, Massachusetts
Ecient Parallel FFTs
for Dierent Computational Models

Nadia Shalaby
y
December 25, 1996
Abstract
We select the Fast Fourier Transfrom (FFT) to demonstrate a methodology for
deriving the optimal parallel algorithm according to predetermined performance
metrics, within a computational model. Following the vector space framework for
parallel permutations, we provide a specication language to capture the algorithm,
derive the optimal parallel FFT specication, compute the arithmetic, memory,
communication and load{balance complexity metrics, apply the analytical performance
evaluation to PRAM, LPRAM, BSP and LogP computational models, and compare
with actual performance results.
1 Introduction
The FFT algorithm, popularized by Cooley and Tukey [2], performs the discrete Fourier
Transform (DFT) in O(N logN) time, as opposed to the naive O(N
2
), and is ubiquitously
used in most areas of applied sciences and engineering [8]. As more of these applications
migrate to the platform of parallel computing, the exigency of a highly ecient and portable
parallel FFT kernel becomes evident. Considerable valuable research was conducted on this
topic, such as [5, 6, 7, 9, 15, 16]. However, studies tended to be geared towards specic
architectures, or did not fully analyze the system and problem parameters at hand, or
gave underspecied algorithms, and overall did not present a general approach for the
parallelization of other orthogonal transforms.
The above perspective motivates a number of goals. Our rst goal, addressed in x 2,
is to develop a methodology and algorithmic specication that can be extended to other
parallel orthogonal transforms, such as the discrete cosine transform [11] and the Legendre
transform [13] and dene a set of complexity metrics suitable for this class of algorithms.
The second goal, to derive a parallel FFT algorithm and prove that the choices made are
optimal, is covered in x 3 and 4. Third, in x 5, we demonstrate that this computational
framework is suciently general to render abstraction to computational models such as the
PRAM [4], LPRAM [1], BSP [17] and LogP [3]. On the other hand, x 6 illustrates that
the specication and analysis are rened enough for immediate implementation, and we
present our comparative implementations on the CM{5 as an example.
2 Methodology
Computational Framework: We denote the problem size by N = 2
r
, the number of
processors by P = 2
p
and the virtual processor ratio as V = N=P . Thus, p = logP ,

Partial support by Oce of Naval Research grant N00014-93-1-0192, Air Force Oce of Scientic
Research grant F49620-93-1-0480 and Los Alamos National Laboratory contract 8283K0014-9U.
y
Aiken Computation Lab., Harvard University, Cambridge, MA 02138
1
2v = logV and r = logN = p + v represent the number of bits in processor, local memory
and data addresses, respectively. All logarithms are base two.
We dene a permutation  on an ordered set A to be a bijective map  : A  ! A and
denote the composition of two permutations as 
1

2
. For simplicity, we assume N  P
(by setting u = r in [12]), and the machine's physical memory location address is modeled
as a vector space Z
r
over F
2
, where jZ
r
j = 2
r
, and for P = 2
p
processors, a vector z 2 Z
r
is denoted by z
r 1
: : :z
v+1
z
v
j z
v 1
: : : z
1
z
0
where the left and right parts represent the
processor and memory addresses respectively (thus arbitrarily selecting block allocation).
Similarly, the data element index is modeled by a vector space A
r
over F
2
, where jA
r
j = 2
r
,
and a vector a 2 A
r
is denoted by a
r 1
a
r 2
: : : a
1
a
0
= (a
i
)
0
i=r 1
. We assume a uniform
mapping between the set of 2
r
indices and the set of all 2
p
processors, where each processor
is assigned 2
v
indices, and assignment is made to local memory addresses 0 to 2
v
  1.
The set of permutations relevant to the FFT can now be dened within this vector
space framework. The bit{reversal permutation, 
m;l
bitrev
, is a permutation on A
r
such that

m;l
bitrev

(a
i
)
m+1
i=r 1
(a
i
)
l
i=m
(a
i
)
0
i=l 1

= (a
i
)
m+1
i=r 1
(a
i
)
m
i=l
(a
i
)
0
i=l 1
.
Let Q = 2
q
, where q = 1; 2; : : :br=2c and let l;m 2 I such that 0  l;m  r  1,
m  l  q, m+ q   1  r  1. Then the radix{Q exchange permutation, 
m;l
xch;Q
, is a per-
mutation on A
r
dened as: 
m;l
xch;Q

(a
i
)
m+q
i=r 1
(a
i
)
m
i=m+q 1
(a
i
)
l+q
i=m 1
(a
i
)
l
i=l+q 1
(a
i
)
0
i=l 1

=
(a
i
)
m+q
i=r 1
(a
i
)
l
i=l+q 1
(a
i
)
l+q
i=m 1
(a
i
)
m
i=m+q 1
(a
i
)
0
i=l 1
.
Let a 2 A
r
, and let 0  l;m  r   1 and Q = 2
q
. Then the radix{Q shue
permutation, 
m;l
shf;Q
(a), is dened as: 
m;l
shf;Q

(a
i
)
m+q
i=r 1
(a
i
)
m q+1
i=m
(a
i
)
l
i=m q
(a
i
)
0
i=l 1

=
(a
i
)
m+q
i=r 1
(a
i
)
l
i=m q
(a
i
)
m q+1
i=m
(a
i
)
0
i=l 1
.
We dene the radix{2
q
unshue permutation, 
m;l
unshf;2
q
(a), in terms of the shue
permutation as follows: 
m;l
unshf;2
q
(a) = 
m;l
shf;2
n q
(a), where n = m  l + 1.
The buttery permutation, 
bfly;Q
, where logQ 2 I , the identity mapping, is a
permutation on A
r
such that 
bfly;Q
(a) = a Q, where  is the addition on F
2
.
Which FFT? The complex{to{complex forward DFT, expressed as a mapping between
two N{point complex sequences, h = (h
0
; : : : ; h
N 1
) 2 C
N
and
~
h = (
~
h
0
; : : : ;
~
h
N 1
) 2 C
N
is
given by: F(h
k
) =
~
h
k
=
P
N 1
j=0
h
j
e
2i
N
jk
; k = 0; 1; : : : ; N 1, where i is
p
 1, !
N
= e
2i
N
is the principle N
th
root of unity, and !
jk
N
are the corresponding twiddle factors. However,
the term parallel FFT is overloaded in its many variations, each of which may aect the form
of the output, and the computational, storage or communication complexities, while still
conforming to the DFT denition [5, 6, 15]. For our purposes, it is sucient to demonstrate
our methodology for any such variation. Thus, arbitrarily and for the sake of clarity, we
select the forward, unscaled, ordered, radix{2, one{dimensional FFT with precomputed
twiddle factors. Likewise, we arbitrarily choose the decimation in time (DIT) bit{reversed
FFT, for which we assume a block data allocation scheme, already selected by the dened
vector space computational framework, summarized above.
Specication: The theoretical framework of [12] enables an algorithmic specication
for orthogonal transforms that facilitates algorithmic representation, derivation and anal-
ysis. We refer to the integral part of the FFT computation, the DIT buttery operation
on all data points from 0 to N   1, with radix{Q, as O
N 1;0
t bf ly;Q
, an instance of which is
portrayed in Figure 1(b) for two data points. We dene !
twi(a)
N
to be the twiddle factor for
the pair of elements x
a
and x
a+Q
in O
N 1;0
t bf ly;Q
, where twi(a) denotes their twiddle index.
Each instance of such a buttery operation consists of one complex multiplication, addi-
3tion and subtraction, as well as buttery permutation, 
bfly;Q
. Performing m consecutive
operations at stage s of the algorithm is expressed in the form of an ordered sequence of m
such operations,

O
f
1
(s)
O
f
2
(s)
 : : : O
f
m
(s)

=

O
f
i
(s)

m
i=1
, where f
i
(s) is some discrete
function of s. Hence, a stage{operation pair precisely describes the operations performed
at stage s in the form

s;

O
f
i
(s)

m
i=1

. A sequence (with enforced ordering) of n similar
stage{operation pairs will be denoted by

g(j);

O
f
i
(g(j))

i

n
j=1
, where g(j) is some discrete
function on j. Finally, the entire algorithm is specied as an ordered sequence of such
stage{operation pairs enclosed in curly braces.
Complexity Metrics: We group and denote the complexity metrics as A, M, C
and I for arithmetic, memory, communication and load imbalance (for any metric relative
imbalance) complexities. We isolate various types within each class, say A
type
, and further
rene the metrics to A
type nmax
and A
type navg
, which stand for the maximum, and average
number of units of A
type
per node, whereas A
type all
is for all nodes. In class A, we
compute the real arithmetic operations, and assume complex adds take 2 real adds, and
a complex multiply comprises 3 real adds and 3 real multiplys. A
ssteps
is the number
of supersteps (sequence of computation steps not interruptible by communication) for the
entire algorithm, and A
steps
gives the total number of computational steps. M
active 
and
M
precomp 
denote the number of real active data (read{write) and precomputed data (read-
only) in memory, respectively. C
steps
are the total number of communication steps for the
entire algorithm; while C
ssteps
gives the total number of communication supersteps (sequence
of communication steps between computation supersteps) for the entire algorithm. C
trans 
is the number of real element transfers performed across all communication steps, but is
also decompose into real element transfers for the particular communication patterns, such
as: C
bfly 
, C
bitrev 
, C
xch 
and C
shf 
. The imbalance ratio I ranges from 0 (for perfect
load{balance) and is less than 1. Assuming that real multiplys costs  times real adds,
I
A
= ((A
add nmax
 A
add navg
) + (A
mul nmax
  A
mul navg
)) =(A
add all
+ A
mul all
),
I
M
type
= (M
type nmax
 M
type navg
) =M
type all
, for type 2 factive; precompg,
I
C
= (C
trans nmax
  C
trans navg
) =C
trans all
.
3 Direct Parallelization
This refers to the straight forward approach of evenly distributing the input data among
the processor nodes, which is specied as:


0; 
r 1;0
bitrev
O
N 1;0
t bf ly;2
0

;

s; O
N 1;0
t bf ly;2
s

r 1
s=1

.
Figure 2 depicts the direct (DR) parallelization of a DIT, bit{reversed FFT of size 8 among
4 processors. Let the a
th
data element, 0  a  N 1 reside in physical location z at stage
(a) Flow graph 
      conventions
x+y
x-y
x
y
y
xx
-x
cx
x
x
x c
y
x x+cy
x-cyc
(y=x+Q)
(b) O t_bfly,Q
Fig. 1.
bit-reversal stage  0 stage 1 stage  2
P
0
P
1
P
2
P
3
0
1
2
3
4
5
6
7
0
3
4
7
2
6
1
5
0
0
0
3
4
7
2
6
1
5
0
2
2
0
0
0
0
2
1
3
Fig. 2. Direct Parallelization: N = 8, P = 4
s. Then, the twiddle index, twi(z) = twi

(
r 1;0
bitrev
)
 1
(a)

= (a
r 1 s
)  (a
r s
: : : a
r 2
a
r 1
) 
42
r 1 s
, which is for our bit{reversed DIT FFT. The rst v stages are entirely local [12]
(comprising a single computational superstep), whereas all subsequent p stages invoke
communication (each constituting a computational superstep). The complexity metrics for
direct parallelization are tabulated in Figures 4, 5, 6 and 7, which follow from the properties
of 
r 1;0
bitrev
and 
bfly;Q
proved in [12]. These metrics illustrate that, for DR parallelization,
while the communication is almost perfectly load{balanced among all the processor nodes
(there is, in fact, a slight imbalance for v < br=2c), it is not the case with arithmetic and
precomputed memory. Figure 2 demonstrates the source of this imbalance to stem from the
multiplications at every stage, precisely what the load{balanced approach needs to remedy.
4 Load{balanced Parallelization
The goal is to achieve I
A
= I
M
active
= I
M
precomp
= I
C
= 0. In order to load balance the
computation (and hence the precomputed storage requirements), we compel the buttery
operations to be local to the processor nodes at every stage of the algorithm [5, 15]. This
is attained by placing the index bit currently operated on by the buttery operation into
the memory part of the index [12], thus guaranteeing V=2 multiplys per node. Such a load{
balancing permutation (LBP) enforces this locality for one or more stages, necessitating
another LBP whenever the locality \runs out". As a result, the output is scrambled,
and an order restoring permutation (ORP) is required at the end. Figure 3 illustrates
LBP
14
12
10
15
0
1
6
7
2
3
4
5
9
11
13
8
11
15
13
14
10
12
0
3
4
7
2
6
1
5
8
9
ORPbit-reversal stage  0 stage  1 stage  2
P
0
P
1
P
2
P
3
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
stage  3
11
15
13
14
10
12
0
3
4
7
2
6
1
5
8
9
0
0
0
0
0
0
0
0
0
0
0
0
4
4
4
4
0
0
2
2
4
4
6
6
0
1
2
3
4
5
6
7
Fig. 3. Radix{4 Load{balanced Parallelization: Case 1 (V  P ), N = 16, P = 4
the load{balanced (LB) approach for V = 4, where it is evident that the LBP is, in
fact, a radix{4 exchange permutation [12]. However, a sequence of two radix{2 exchange
permutations would produce exactly the same outcome. A result in [12] proves that the
optimal solution is to perform one exchange permutation of the maximum possible radix
rather than a sequence with lower radices. Clearly, the maximum attainable radix for a
LBP is minfP; V g.
In deriving the algorithm, the degrees of freedom (more than one permutation produces
the desired result) are resolved according to the following criteria: (1) Memory bits not
exchanged do not change their position within the index, thereby avoiding memory moves
for elements remaining local to the processor after the LBP. (2) The block of bits to
be exchanged moves into the least signicant positions of the memory index, thereby
minimizing the strides of the sequence of buttery operations following the LBP. This
5maximizes the probability for both operands of a buttery operation to fall within the
same cache fetch, which would be the case if the buttery stride is less than half the cache
size [3]. (3) An algorithmic pattern with a monotonically changing step repeats for as
long as possible, which reduces index recalculation (indices usually serve as variables in a
program loop), facilitates programming, and improves performance in most cases, especially
if pipelining is in place [5].
For a rigorous derivation of the LB algorithm, refer to [10], which relies on the
permutation results proved in [12]. Here, our exposition is limited to the derived
specication. Since the twiddle index twi(z) for any location z holding a data element
is the same as for DR parallelization for stages 0 s v   1, we only specify it for stages
vsr   1. The specication of the LB algorithm is broken up into three cases:
Case 1 (V  P ):


0; 
r 1;0
bitrev
O
N 1;0
t bf ly;2
0

;

s; O
N 1;0
t bf ly;2
s

v 1
s=1
;

v; 
v;0
xch;2
p
O
N 1;0
t bf ly;2
0

;

v + j; O
N 1;0
t bf ly;2
j

p 1
j=1
;

r; 
v;0
xch;2
p


.
For stages v + j, 0jp 1, twi(z) = twi

(
r 1;0
bitrev

v;0
xch;2
p
)
 1
(a)

=
(a
j
)  (a
j 1
: : :a
0
a
v 1
: : :a
p
a
r 1
: : : a
v
)  2
p 1 j
, where the bits a
j 1
: : :a
0
= 1 for j = 0.
Case 2 (V < P , p mod v = 0):


0; 
r 1;0
bitrev
O
N 1;0
t bf ly;2
0

;

s; O
N 1;0
t bf ly;2
s

v 1
s=1
;


kv; 
kv; 0
xch;2
v
O
N 1;0
t bf ly;2
0

;

kv + j; O
N 1;0
t bf ly;2
j

v 1
j=1

p=v
k=1
;

r; 
r 1;0
unshf;2
v

)
.
For stages kv + j, 0jp 1, 1kp=v, twi(z) =
twi



r 1;0
bitrev
 (
p=v
k=1

kv;0
xch;2
v
)

 1
(a)

= (a
j
) (a
j 1
: : :a
0
a
(k+1)v 1
: : : a
v
) 2
p 1 j
, where
the bits a
j 1
: : : a
0
= 1 for j = 0.
Case 3: (V < P , p mod v 6= 0):
(

0;
r 1;0
bitrev
O
N 1;0
t bf ly;2
0

;

s; O
N 1;0
t bf ly;2
s

v 1
s=1
;


kv;
kv; 0
xch;2
v
O
N 1;0
t bf ly;2
0

;

kv+j; O
N 1;0
t bf ly;2
j

v 1
j=1

bp=vc
k=1
;

(bp=vc+ 1)v; 
r pmodv; 0
xch;2
pmodv
O
N 1;0
t bf ly;2
0

;

(bp=vc+ 1)v + j; O
N 1;0
t bf ly;2
j

pmodv  1
j=1
;

r; 
v 1;0
unshf;2
pmodv

r 1;0
unshf;2
v
o
.
For stages kv + j, 0jp  1, 1kbp=vc, twi(z) =
twi



r 1;0
bitrev
 (
bp=vc
k=1

kv;0
xch;2
v
)

 1
(a)

= (a
j
)(a
j 1
: : : a
0
a
(k+1)v 1
: : :a
v
)2
p 1 j
, where
the bits a
j 1
: : : a
0
= 1 for j = 0.
For stages (bp=vc+ 1)v + j, 0jp mod v   1, 1kbp=vc,
twi(z) = twi



r 1;0
bitrev
 (
bp=vc
k=1

kv;0
xch;2
v
)
r pmodv;0
xch;2
pmodv

 1
(a)

=
(a
j
)  (a
j 1
: : :a
0
a
v 1
: : : a
pmodv
a
r 1
: : :a
bp=vc+1)v
a
(bp=vc+1)v 1
: : : a
v
)  2
p 1 j
, where the
bits a
j 1
: : : a
0
= 1 for j = 0.
The complexity metrics of the LB approach are tabulated in Figures 4, 5, 6 and 7,
where we achieve I
A
= I
M
active
= I
M
precomp
= I
C
= 0 (except for the case V < P , where
a slight imbalance occurs in the bit-reversal, for both algorithms). For the performance
metrics of case 3 (V < P , p mod v 6= 0), the subcase v  b
r
2
c, as well as for a detailed
derivation of the entire set of performance metrics, refer to [10].
6Arithmetic Direct Load{bal
A
steps
2 logN 2 logN
A
super steps
logP + 1 d
logN
log V
e
A
add nmax
1
2
V (7 logN + 3 logP )
7
2
V logN
A
add navg
7
2
V logN
7
2
V logN
A
mul nmax
3
2
V (logN + logP )
3
2
V logN
A
mul navg
3
2
V logN
3
2
V logN
I
A
3(1+) log P
(7+3) P logN
0
Fig. 4. Parallel FFT Arithmetic Complexity
Memory Direct Load{bal
M
active nmax
2V 2V
M
active navg
2V 2V
I
M
active
0 0
M
precomp nmax
V (logN + logP ) V logN
M
precomp navg
V logN V logN
I
M
precomp
log P
P logN
0
Fig. 5. Parallel FFT Memory Complexity
Commun. Direct Load{balanced
C
ssteps
logP + 1 3
C
bfly nmax
V logP 0
C
bfly navg
V logP 0
C
bitrev nmax
V (1  1=P ) V (1  1=P )
C
bitrev navg
V (1  1=P ) V (1  1=P )
C
xch nmax
0 2V (1  1=P )
C
xch navg
0 2V (1  1=P )
C
shf nmax
0 0
C
shf navg
0 0
C
trans nmax
V (p+ 1  1=P ) 3V (1  1=P )
C
trans navg
V (p+ 1  1=P ) 3V (1  1=P
I
C
0 0
Fig. 6. Parallel FFT Communication
Complexity for V  P
Commun. Direct Load{balanced
C
ssteps
logP + 1 r=v+ 1
C
bfly nmax
V logP 0
C
bfly navg
V logP 0
C
bitrev nmax
V V
C
bitrev navg
V   2
d
r
2
e
=P V   2
d
r
2
e
=P
C
xch nmax
0
p
v
(V   1)
C
xch navg
0
p
v
(V   1)
C
shf nmax
0 V
C
shf navg
0 N   V
C
trans nmax
V (p+ 1) (2 +
p
v
)V  
p
v
C
trans navg
V (p+ 1) (2 +
p
v
)V
 
1
P
2
d
r
2
e
 
p
v
 
1
p
(2
d
r
2
e
+ 1)
I
C
2
d
r
2
e
P (N(1+p) 2
d
r
2
e
)
1
p
(2
d
r
2
e
+1)
(2+
p
v
)N 2
d
r
2
e
 V 
p
v
P
Fig. 7. Parallel FFT Communication
Complexity for V < P and v  b
r
2
c
5 Performance Analysis within Dierent Computational Models
We claim that our methodology for algorithmic derivation and specication adroitly delivers
a rened set of complexity metrics applicable to a wide range of computational models
and implementation environments. For instance, if all communication operations incur the
same cost, regardless of their pattern, then the goal becomes minimizing C
ssteps
. Consulting
Figures 6 and 7 shows that the LB approach wins for systems where P > 4 for the case
V  P , and whenever V  4 for the case V < P . On the other hand, if the communication
cost is incurred by the total number of transfers between nodes, then our criteria becomes
C
trans all
, revealing that, for the case V  P , the LB approach is superior whenever P  4.
To illustrate how these metrics translate into some well known formal models, let
the time taken by an algorithm be T
M
time units, for a machine/model M. First,
consider the PRAM model [4]. Since all memory accesses are considered uniform,
T
PRAM
= A
add nmax
+ A
mul nmax
, yielding (7 + 3)
V
2
logN + (1 + )
V
2
logP and
(7 + 3)
V
2
logN for the DR and LB methods, respectively, thus favoring the latter for its
arithmetic performance. An alternative model, the LPRAM [1] introduces communication
via the machine latency parameter l for accessing global memory. Considering the more
general case of V  P for conciseness, T
LPRAM
= T
PRAM
+ l  C
trans nmax
, yielding
(7+3)
V
2
logN+(1+)
V
2
logP +l 3V (logP +1 1=P ) and (7+3)
V
2
logN+l 3V (1 1=P )
for the DR and LB methods, respectively, conrming that the LB method wins.
Within the realm of the BSP [17], we let the four model parameters be N and P
as before, g the ratio between communication and computation operations, and L the
synchronization cost. In such a scenario, for the case V  P , and following the guidelines
in [14], T
BSP
= A
add nmax
+ A
mul nmax
+ g C
bitrev nmax
+L + g 
P
r 1
s=v
C
bfly nmax
(s) +
7C
ssteps
 L for the DR method, and T
BSP
= A
add nmax
+ A
mul nmax
+ g C
bitrev nmax
+ g 
P
2
s=1
C
xch nmax
(s) + C
ssteps
L for the LB method. A simple analysis suggests that in terms
of communication cost, the LB outperforms the DR whenever V  P=2, which is always
true for V  P . The more rened LogP model [3] assumes message passing as the general
paradigm, providing exibility of performance analysis for very specialized implementations.
In our case, assuming the natural barrier synchronization after every superstep, we may
bundle the overhead per message parameter o with the gap g. Thus, for our purposes,
T
LOGP
= T
BSP
jg g+o
, resulting in the same conclusion as the BSP analysis.
6 A Comparative Implementation
A data parallel CM{Fortran implementation was carried out on the CM{5 (where P is
the number of vector units), for the purpose of comparing the two methods. The intent
was neither to compete with highly tuned library codes, (implemented at a lower level and
performing higher radix{4 FFTs locally to boost performance), nor to contrast this against
other paradigms, such as message passing (with access to ner grain machinery). Rather,
emphasis was made on a fair comparison with uniform assumptions and optimizations for
each case. Purely radix{2 FFTs were computed, with a customization of the rst two stages
to minimize arithmetic. For B disjoint buttery blocks, we control the geometry by aliasing
the N{size 1D data array h(N = P V ), as a 2D array h(P 
B
P
; 1
N
B
) for B  P , and as
h(B  1;
B
P
 V ) for B < P , to perform ecient buttery operations that only invoke one
cshift. This gives a tremendous advantage to the DR version, setting the cost of 
m;l
xch;S
and

m;l
shf;S
, which are performed via a global send through the router, at ve to ten times that
of 
bfly;Q
, depending on Q. Moreover, unlike cshift, all permutations implemented via the
general send, including 
m;l
bitrev
, require an address calculation phase.
To analyze the relative performance, the code is highly instrumented, which not only
introduces substantial time dilation, but also breaks up the pipelining capability of the
compiler, preventing the code from achieving near peak FLOP rate performance.
0
2
4
6
8
10
12
14
16
16 18 20 22 24 26 28
To
ta
l F
FT
 T
im
e 
(se
c)
logN (logarithmic scale)
Total Time: DR and LB FFT
direct, P=128
loadb, P=128
direct, P=256
loadb, P=256
direct, P=512
loadb, P=512
direct, P=1024
loadb, P=1024
direct, P=2048
loadb, P=2048
Fig. 8. Total Times
0
5
10
15
20
25
30
35
40
6 8 10 12 14 16 18 20 22
Ti
m
e 
G
ai
n 
LB
 o
ve
r D
R 
(%
)
logN (logarithmic scale)
Relative Time Gain
P=128
P=256
P=512
Fig. 9. Relative Gain
0
10
20
30
40
50
60
70
80
5 10 15 20
Fr
ac
tio
n 
of
 T
ot
al
 T
im
e 
(%
)
logN (logarithmic scale)
Fraction of Total Time: DR and LB FFT (P=128)
DR comm
LB comm
DR addr calc
LB addr calc
DR arithm
LB arithm
Fig. 10. Fractions of Time
As prophecied by M
precomp nmax
, Figure 8 illustrates that the LB version computes
problem sizes at least twice or four times as large as the DR version, for the entire set of
processor values. Moreover, despite the advantages of the DR implementation discussed
above, the LB algorithm consistently outperforms its direct counterpart.
Figure 9 is a partial magnication of Figure 8, showing the relative time gain of the LB
over the DR algorithm for P = 128; 256, and 512. If t
DR
and t
LB
are the total times of the
DR and LB approaches, respectively, the relative ratio (t
DR
  t
LB
)=t
LB
is depicted. Note
that the LB version is identical to the DR for V = 1, when the relative gain is zero. Then,
8we observe a transitional period, where the relative gain oscillates from approximately 5%
to 35%, which results from the uneven performance of the LB method for cases 2, V < P ,
p mod v = 0 and 3, V < P , p mod v 6= 0. This transitional period lasts from logN sizes of
8 to 14, 9 to 16, and 10 to 18 for the curves P = 128; 256, and 512, respectively. After that,
case 1, V  P , is entered, where the relative performance gain settles around 15{20%.
In Figure 10, we plot three dierent curves per algorthm, which account for the fraction
of the total time spent on communication, address calculation and arithmetic, for P = 128.
The implementational disadvantage of the LB algorithm versus the DR, only calculating
addresses for 
r 1;0
bitrev
, is illustrated by the higher fraction of time that the LB spends in
address calculation, the dierence being 10% on the average. Despite this disadvantage,
the LB algorithm still spends less of its time communicating, about 50% as opposed to 60%
for the DR. Moreover, the LB approach spends a higher fraction of its time in arithmetic
computation, with a 10% lead over the DR, proving that the LB algorithm is more ecient.
We have illustrated that the LB approach is a signicant win for an FFT kernel, from
all perspectives: it is faster, requires less memory, and despite the advantage of the DR
approach within this implementation, spends a smaller fraction of its time communicating.
These observations agree closely with the analytical performance metrics.
References
[1] A. Aggarwal, A. K. Chandra, and M. Snir, Communication complexity of PRAMs, Theoretical
Computer Science, 71 (1990), pp. 3{28.
[2] J. W. Cooley and J. W. Tukey, An algorithm for the machine calculation of complex fourier
series, Mathematics of Computation, 19 (1965), pp. 297{301.
[3] D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and
T. von Eicken, LogP: Towards a realistic model of parallel computation, in 1993 Procedings of
the Fourth ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, 1993.
[4] S. Fortune and J. Wyllie, Parallelism in random access machines, in Proceedings of the 10th
Annual Symposiym on Theory of Computing, 1978, pp. 114{118.
[5] S. L. Johnsson, M. Jacquemin, and R. L. Krawitz, Communication ecient multi{processor
FFT, Journal of Computational Physics, 102 (1992), pp. 381{397.
[6] S. L. Johnsson and R. L. Krawitz, Cooley-Tukey FFT on the Connection Machine, Parallel
Computing, 18 (1992), pp. 1201{1221.
[7] S. L. Johnsson, R. L. Krawitz, D. MacDonald, and R. Frye, A radix-2 FFT on the Connection
Machine, in Supercomputing 89, ACM, November 1989, pp. 809{819.
[8] H. J. Nussbaumer, Fast Fourier Transform and Convolution Algorithms, Springer-Verlag, 1982.
[9] R. B. Pelz, Parallel compact FFTs for real sequences, SIAM Journal of Scientic Computing,
14 (1993), pp. 914{935.
[10] N. Shalaby, Ecient parallel FFTs: A generalized approach, Tech. Rep. TR{33{95, Harvard
University, Nov. 1995.
[11] , Parallel discrete cosine transforms: Theory and practice, Tech. Rep. TR{34{95, Harvard
University, Dec. 1995.
[12] N. Shalaby and S. L. Johnsson, A vector space framework for parallel stable permutations,
Tech. Rep. TR{32{95, Harvard University, Nov. 1995. (submitted for publication).
[13] , Hierarchical load balancing for parallel fast Legendre transforms, in Eighth SIAM
Conference for Parallel Processing for Scientic Computing, 1997. (to appear).
[14] D. Skillicorn, J. M. Hill, and W. McColl, Questions and answers about bsp, Tech. Rep. PRG{
TR{15-96, Oxford Unviverstiy Computing Laboratory, Parks Road, Oxford OX1 3QD, 1996.
[15] P. Swarztrauber, Multiprocessor FFTs, Parallel Computing, (1987), pp. 197{210.
[16] C. Temperton, Self sorting in{place fast Fourier Transforms, SIAM Journal of Sci. Stat.
Computation, 12 (1991), pp. 808{823.
[17] L. G. Valiant, A bridging model for parallel computation, Communications of the ACM, 33
(1990), pp. 103{111.
