A Practical Hierarchical Model of Parallel Computation ll: Binary Tree and FFT Algorithms by Heywood, Todd & Ranka, Sanjay
Syracuse University 
SURFACE 
Electrical Engineering and Computer Science - 
Technical Reports College of Engineering and Computer Science 
10-1991 
A Practical Hierarchical Model of Parallel Computation ll: Binary 




Follow this and additional works at: https://surface.syr.edu/eecs_techreports 
 Part of the Computer Sciences Commons 
Recommended Citation 
Heywood, Todd and Ranka, Sanjay, "A Practical Hierarchical Model of Parallel Computation ll: Binary Tree 
and FFT Algorithms" (1991). Electrical Engineering and Computer Science - Technical Reports. 122. 
https://surface.syr.edu/eecs_techreports/122 
This Report is brought to you for free and open access by the College of Engineering and Computer Science at 
SURFACE. It has been accepted for inclusion in Electrical Engineering and Computer Science - Technical Reports by 
an authorized administrator of SURFACE. For more information, please contact surface@syr.edu. 
SU-CIS-91-07 
A Practical Hierarchical Model 
of Parallel Computation II: 
Binary Tree and FFT Algorithms 
Todd Heywood and Sanjay Ranka 
October 1991 
School of Computer and Information Science 
Syracuse University 
Suite 4-116, Center for Science and Technology 
Syracuse, New York 13244-4100 
Abstract 
A companion paper has introduced the Hierarchical PRAM (H-PRAM) model of parallel com-
putation, which achieves a good balance between simplicity of usage and reflectivity of realistic 
parallel computers. In this paper, we demonstrate the usage of the model by designing and analyz-
ing various algorithms for computing the complete binary tree, and the FFT /butterfly graph. By 
concentrating on two problems, we are able to demonstrate the results of different combinations 
of organizational strategies and different types of sub-models of the H-PRAM. The philosophy in 
algorithm design is to maximize the number of processors P that are efficiently usable with respect 
to an input size N, and to minimize the inefficiency when efficiency is not possible (when Pis too 
large with respect to N). This can be done because of the H-PRAM's representation of general 
locality, i.e. both strict and neighborhood locality, and results in algorithms that can efficiently 
employ more processors (and are thus faster) than algorithms for models that only represent strict 
locality. 
1 Introduction 
In the companion paper [8] we introduced the Hierarchical PRAM (H-PRAM) model of parallel 
computation, and discussed its properties and benefits in detail. In this paper we demonstrate 
the usage of the H-PRAM via the design and analysis of various algorithms for computing both 
complete binary trees and FFT /butterfly graphs. By concentrating on two problems, we are able 
to demonstrate the results of different combinations of organizational strategies and different types 
of sub-models of the H-PRAM, e.g. EREW PRAM, LPRAM [1], and BPRAM [2]. 
It is assumed that the reader is familiar with the H-PRAM and its properties, as defined in [8]. 
General locality has been defined to mean both strict and neighborhood locality. We can 
employ general communication and synchronization locality on the private H-PRAM, or general 
synchronization locality (and strict communication locality) on the shared H-PRAM. Since the 
private H-PRAM provides the tightest simultaneous control over the four types of complexities 
(computation, communication, synchronization, and conceptual, as discussed in Section 2.2 of [8]), 
reduces the responsibilities of overlaying programming models (Section 2.2 of [8]), and provides a 
framework that may allow reduced communication bandwidth requirements on architectures as they 
scale up (Section 3.2 of [8]), we feel that algorithms should initially attempt to employ the private 
H-PRAM. However, problems that only submit to non-oblivious (data dependent) communication, 
or possess irregular communication patterns, may require a switch to the shared variant, which is 
the main reason for its existence. The private variant is assumed in this paper. 
Section 2 of the section considers some preliminary algorithmic issues. In Section 3, we give 
two H-PRAM algorithms for computing a complete binary tree. The first serves as an introduction 
to the use of the H-PRAM, and the second introduces simple memory management, which results 
in improved performance for certain latencies. 
In Section 4 we conduct a case study of the H-PRAM by designing and analyzing various 
H-PRAM algorithms for computing the FFT (or butterfly) graph. These algorithms differ both 
in design and in the types of sub-models which are employed, thus demonstrating the results 
of different combinations of organizational strategies and different types of sub-models of the H-
PRAM. Section 5 concludes the paper with a discussion of general algorithmic issues. 
2 Preliminaries 
We always assume that P ~ N for input size N and number of H-PRAM processors P. For 
simplicity, we also consider the size of the H-PRAM memory to be equal to the size of the data set 
under consideration (normally the input size, but some algorithms may use space> N). The reason 
for this is that, in the private H-PRAM, partition steps partition memory proportionately to the 
processors, and in algorithm design what we are really doing is partitioning the input data. This 
assumption is realistic since a mapping of the H-PRAM to an architectural model can simply assign 
N / P memory locations of the H-PRAM memory to the local memory of each architecture processor. 
The H-PRAM memory can be seen as P blocks of N / P memory locations, where the blocks are 
1 
the units that may be grouped and partitioned. Note that for any P'-processor sub-PRAM with 
N' memory, we have N' I P' = NIP and P' ~ N'. 
We employ the streamlined complexity analysis procedure introduced and justified in Sec-
tion 2.3.2 of [8). 
Recall from [8) that the latency l(P') is the cost of a communication in a P'-processor sub-
PRAM, and sp(Q,P') is the cost for aP'-processor sub-PRAM to,8-synchronize Q sub-sub-PRAMs. 
Since latency is not necessarily the diameter of a P'-processor sub-network in an underlying archi-
tecture, diam(P') is used when diameter is meant. 
In H-PRAM algorithm analysis, it is often useful to fix the latency parameter l in order to 
obtain more informative and concrete results. Since an exhaustive enumeration of results for all 
possible values of l is out of the question, we wish to choose a couple of "representative" functions 
to fix l to. In this paper we will consider the cases l( P') = log P' and l( P') = VJ», where P' is the 
number of processors being communicated amongst, since these functions (and functions of these 
functions) seem to cover a wide range of the latencies of existing and likely-to-be-built architectures. 
It is also useful to fix the ,8-synchronization parameter sp along with l. Recall that when /3-
synchronization is solely a function of the number of sub-PRAMs being synchronized, diam(Q) ~ 
sp(Q,P') ~ diam(Q)logQ. As noted in Section 3.1 of [8), most hierarchical architectures have 
/3-synchronization costs which are solely functions of Q. Thus, when l(P') = diam(P') = logP' 
we redenote sp(Q,P') as sp(Q). Although the range of sp(Q) is logQ ~ sp(Q) ~ log2 Q, for 
simplicity we fix sp(Q) = logQ in this paper. This is very reasonable; for example, the hierar-
chical, logarithmic-diameter hypercube network can synchronize Q sub-cubes in O(log Q) time. If 
latency is not defined as diameter then we are at worst overestimating /3-synchronization cost, since 
diam(Q) ~ l(Q). 
The two-dimensional mesh may have a ,8-synchronization cost that is also a function of the 
number of processors P' in the PRAM that is synchronizing the Q sub-PRAMs (see Section 3.1 
of [8]). Although sp( Q, P') has been defined to be diam( P') ~ sp( Q, P') ~ diam( P') log Q in this 
case, since the two-dimensional mesh is the only .Jl>'-diameter architecture and it is known that 
P' mesh processors can synchronize in time 0( .JPI), when l( P') = VJii we ( redenote sp( Q, P') as 
sp(P') and) fix sp(P') = VJ». Again, if latency is not defined as diameter then we are at worst 
overestimating /3-synchronization cost. 
The following fact, discussed in Section 2.3.2 of (8), will be used when /3-synchronization is under 
consideration. The ,8-synchronization complexity of a sub-PRAM sub-algorithm will not dominate 
the sub-algorithm's complexity if the following two conditions are met: 
1. sp(Q,P') ~ l(P') (note Q ~ P' always). 
2. The number of partition steps, thus the number of ,8-synchronizations, in the sub-algorithm 
is of the order of the number of communication steps in the sub-algorithm. Or equivalently, 
that it is of the order of N' / P' times the number of permutation steps on N' elements, which 
is the same as saying that the number of partition steps is of the order of the number of 
permutation steps on N' elements for all possible P' and N' (since P' ~ N' always holds). 
2 
It is hard to conceive of any architecture where ,8-synchronization could not be done at the latency 
l(P') of the sub-PRAM executing the partition step, so we assume Condition 1 to be true. Note that 
the cases we consider are (a) l(P') = logP' and sp(Q,P') = sp(Q) = logQ, and (b) l(P') = VJii 
and sp(Q, P') = sp(P') =..[iii, so Condition 1 certainly holds for our purposes. Given this, what 
Condition 2 means is that ,8-synchronization will not dominate the complexity of any sub-PRAM 
algorithm that performs memory management, i.e. that executes permutation steps on the N' 
elements in the sub-PRAM memory (in time (N' I P')l(P')) "paired" with partition steps, since 
it will be subsumed by the cost of the permuting. If ,8-synchronization does not dominate the 
complexity of any of the sub-algorithms in a hierarchy, then clearly we can conclude that it does 
not dominate the complexity of the H-PRAM algorithm that they comprise. 
Due to the numerous definitions of "optimal" in the literature, we need to make the following 
explicit definition. 
Definition 2.1 An optimal H-PRAM algorithm is one which is efficient, i.e. whose processor-time 
product is of the order of (within a constant factor of) the running time of the best known sequential 
algorithm for the problem at hand. 
We often refer to the "optimality range"; by this we mean the range of values that P may 
take with respect toN (or vice versa) such that optimality is achieved. We want to maximize the 
number of processors P that can efficiently be used on a problem of size N; the perfect (optimal!) 
optimality range is P :5 N. 
Results are mainly compared to LPRAM [1] results for the same problem, thus comparing 
neighborhood and strict locality utilization (the LPRAM is a PRAM where, in addition to the 
global shared memory, processors have unbounded local memory). 
3 Complete binary tree 
Consider the computation of an N -leaf, height log N binary tree where the inputs to the problem 
reside at the leaves, and an internal node corresponds to the result of some computation ED on the 
node's two children. Computation starts with the parents of the leaves and flows upward until 
the root is computed. This is a good introductory problem for the H-PRAM because of the tree 
hierarchy relation that it uses. The results of this section also hold for the parallel prefix problem 
by making two passes through the tree, first from the leaves to the root then from the root to the 
leaves [5]. 
We give two algorithms for computing a height log N (log N + !levels) binary tree. The first is 
straightforward and intuitive, and serves as the introduction to usage of the H-PRAM; the second 
is an adjustment of the first that improves performance in certain cases (latencies), and serves as 
an introduction to simple memory management. Assume we have a P-processor H-PRAM and let 
k be an arbitrary value, 2 :5 k :5 P, to be fixed later, where both P and k are powers of two 
(clearly N is a power of two). All sub-PRAMs are EREW PRAMs. We divide theN-leaf binary 
tree into (log PI log k) + 1 layers, where layers .X, 1 :5 .\ :5 log PI log k, have height log k and layer 
3 
(log PI log k) + 1 has height log( NIP). The number of sub-trees in layer A, 1 :5 A :5 (log PI log k) + 1, 
is k>.-1 • If we represent each sub-tree in the partitioned binary tree as a node, then we obtain a k-
ary tree of height logk P = log PI log k whose P leaf nodes involve the computation of an NIP-leaf 
binary tree, and whose internal nodes involve the computation of a k-leaf binary tree. 
The algorithm configures the H-PRAM to compute this k-ary tree. We create a k-ary hierarchy 
of sub-PRAMs, where each of the logk P +!levels of the hierarchy corresponds to a layer in the 
original binary tree, and sub-PRAMs in the same level of hierarchy computes the sub-trees in the 
corresponding layer of the binary tree. In other words, each sub-PRAM in levels A, 1 :5 A :5logk P, 
computes a k-leafbinary tree, and each sub-PRAM in levellogk P+l computes an NIP-leaf binary 
tree, with computation starting at the bottom of the hierarchy and proceeding upward. 
The parameter k can be seen as a "partitioning parameter", dictating the sizes and numbers of 
sub-PRAMs used in the H-PRAM algorithm. The idea is that, given N and P, we will choose k 
to optimize the performance of the algorithm; this will be returned to subsequently. 
In level A, 1 :5 A :5logk P+l, there are k>.-1 sub-PRAMs, each of which has Plk>.-1 processors. 
Note that this means that there is a single P-processor (sub)-PRAM in Ievell (as necessary) and 
P !-processor sub-PRAMs in levellogk P + 1. Thus each P'-processor sub-PRAM in level A, 
1 :5 A :5 logk P, creates k sub-sub-PRAMs of P' lk processors each, or equivalently, all of the sub-
PRAMs in level A collectively create k>. level A+ 1 sub-sub-PRAMs that have PI k>. processors each; 
in other words P' lk = Plk>.. Since by the definition of the private H-PRAM, memory is partitioned 
proportionately with the processors, each of the !-processor sub-PRAMs in levellogk P + 1 initially 
has NIP leaves of the binary tree in its memory. The H-PRAM algorithm consists of the recursive 
use of one EREW PRAM sub-algorithm, BT(P', A), which is given below. 
BT(P', A) 
if A = logkP + 1 then 
<compute NIP-leaf binary tree with P' = 1 processor > 
else 
partition{k, P' lk, BT(P' lk, A+ 1)} 
<compute k-leaf binary tree with k processors; note that k :5 P' > 
end-if-then-else 
end-BT 
The H-PRAM algorithm is invoked by BT(P, 1). The part of BT that computes an NIP-leaf 
binary tree is the standard sequential method. The part of BT that computes a k-leaf binary tree 
is the straightforward method of computing each level of the tree in parallel, where a processor 
computing an internal node (the leaves are in memory) reads the node's two children from memory, 




<- _________________ l6.:P..r2~e~S,9~ !!:~~~¥- ________________ -:> 






' ... ' ... 
' ' ' ' ' 
... ... ... 
1-processor sub-PRAM 
at level (log P / log k) + 1 = 3 
Figure 1: Hierarchy arising from BT(P, 1) when P = 16, N = 128, and k = 4 
processors, rather than k, are actually used to compute a k leaf binary tree. The quantity k is used 
in order to (slightly) simplify analysis and the adjusted algorithm discussed below. 
Figure 1 shows the hierarchy that arises from BT(P,1) for P = 16, N = 128 (thus NfP = 8), 
and k = 4. Each box in the figure denotes a sub-PRAM, and the sub-tree within each box is the 
sub-tree computed by the sub-algorithm running on the corresponding sub-PRAM. 
Lemma 3.1 {3-synchronization does not dominate the complexity of the BT(P, 1) algorithm. 
Proof: Clearly, the number of partition steps (1), thus the number of {3-synchronization steps, 
in a BT(P', .\)sub-algorithm is ~ the number of communication steps. Thus the conditions for 
{3-synchronization non-domination are satisfied for all sub-algorithms in the hierarchy, and we con-
clude that {3-synchronization does not dominate the complexity ofthe BT{P, 1) H-PRAM algorithm. 
0 
5 
Although we know that ,8-synchronization will not dominate BT(P, 1), and thus can be dropped 
from further analysis, for this problem we keep it under consideration for demonstration purposes. 
Theorem 3.1 BT(P, 1) can be computed in time 
( 
log,.P log,.P ) 
0 logP+logk· {; l(k>.)+NIP+ {; sp(k,k>.) 
Proof: Clearly BT(1,loglc P + 1) running on single processor, level logic P + 1 sub-PRAMs, will 
take time O(NIP) as l(1) = 1. BT(P',A) sub-algorithms running on P' = Plk>.-1 processor sub-
PRAMs in level A, 1 ~A~ log1cP, have logk (parallel) computation steps and 3logk (parallel) 




0 {; (logk + logk ·l(Pik>.-1 )) 
( 
log,.P ) 
0 logP + logk · {; l(Pik>.-1 ) 
By reversing the sum, we get 
0 (logP+Iogk ·! l(k')) 
Adding the 0( NIP) time for level logic P + 1 sub-PRAMs to this gives the total time required for 
computation and communication. The final term in the complexity is that of ,8-synchronization, 
which is obtained by observing that P' = Plk>.-1 processor sub-PRAMs in level A, 1 ~ A ~ 
logic P have one ,8-synchronization step on k sub-sub-PRAMs, for a ,8-synchronization complexity 
of E~!~P sp(k, Plk>.-1). Again reversing the sum, we get E~!~P sp(k, k>.). D. 
The form of this result is typical of H-PRAM algorithms which use hierarchies whose height is a 
function of the number of processors. We see that it is useful to consider specific latency functions 
in order to remove the sums and obtain a more concrete result (which can be compared with results 
for other models). 
Theorem 3.2 For l(P) = logP, BT(P, 1) can be computed in time 
O(log2 P +NIP) 
Proof: Setting l(x) =log x in Theorem 3.1 simplifies the sum: ~;!~Plog(k>.) = O(logk ·log~ P). 
So without ,8-synchronization complexity added in, the time is 
O(log P + log k(log k ·log~ P) + NIP) 
= O(log P + log2 P + NIP) 
regardless of the choice of k. Since sp(k,k>.) = sp(k) = logk in this case, the ,8-synchronization 
component is E~~~ P log k =log P. Thus the complexity of BT(P, 1) is O(log P + log2 P +NIP) = 
O(log2 P + NIP). D 
6 
Corollary 3.1 For i(P) =log P, BT(P, 1) is optimal for £Plog2 P ~ N, constant£. 
Proof: Since the best sequential time for this problem is O(N), BT(P, 1) is optimal if log2 P = 
O(NIP), i.e. if Plog2 P ~ c · N, some constant c, and£= 11c. D 
On the LPRAM, the algorithm for this problem takes time O(log2 P + (NIP) log P) [1] when 
i(P) =log P. On the BPRAM [2], Phase PRAM [6] (7], and the model of Papadimitriou and Yan-
nakakis [11] (which is essentially an arbitrary-pipelined LPRAM), which are all pipelined models, 
algorithms take time O(log2 PlloglogP +NIP) when l(P) = logP (and, for the Phase PRAM, 
which charges for (a) synchronization, when the synchronization cost is O(l)). lfpipelining capabil-
ity is taken away from these models, it adds a factor of log P to the times (although this comparison 
is somewhat unfair, as the algorithms are designed to take advantage of the pipelining). Note that 
the problem can be solved directly on the log P diameter hypercube in time O(log P + NIP) by 
embedding the tree in it. 
Theorem 3.3 For l(P) = VP, BT(P, 1) can be computed in time 
O(VP+NIP) 
Proof: Setting l(P) = VP in Theorem 3.1 simplifies the sum: ~;!~P ../fX = O(VP). So without 
,8-synchronization complexity added in, the time is 
O(logP +logk · VP +NIP) 
Since sp(k, k>.) = sp(k>.) = #in this case, the ,8-synchronization component is E~!~ P ,ffX = 
0( VP). The log k term in the complexity indicates that k should be fixed at its minimum legal 
value, which is 2. Therefore we get 
O(logP+ VP+ NIP)= O(VP+ NIP) 
D 
Corollary 3.2 For l(P) = VP, BT(P, 1) is optimal for£· p3/2 ~ N, constant£. 
Proof: The algorithm is optimal for VP = O(NIP), which translates to P 312 ~ c·N, and£= 11c. 
D 
If the underlying latency l( P) = VP architecture is a diameter ,.fP two-dimensional mesh, it 
may be best to fix k = 4 such that P'-processor sub-PRAMs will map to ,fPi X ,fPi sub-meshes. 
The complexity is affected by a factor of 2. 
We note that 0( vP + NIP) is the fastest possible time for solving this problem on an ar-
chitecture with l(P) = vP, and matches the complexity of the equivalent algorithms for the 
abovementioned pipelined models without using pipelining (all sub-PRAMs are EREW). On the 
LPRAM, the algorithm for this problem takes time 0( YP(log P + NIP)). 
7 
BT requires no explicit memory management, as it consists of only partitioning on the way 
down into the hierarchy, and computation as it proceeds back up where all data (leaves) needed 
by each sub-PRAM algorithm exists in the sub-PRAM memory. H we take responsibility for some 
memory management, we can adjust BT to obtain an H-PRAM algorithm that performs better for 
certain latencies, specifically l(P) = logP but not l(P) = VP (for which BT achieves the fastest 
possible time anyway). The remainder of this section is devoted to this new algorithm, and serves 
as an introduction to the abstract control over memory management that the H-PRAM provides. 
In BT, k processors are (actively) used by P'-processor sub-PRAMs to compute k-leaf binary 
trees in levels .\, 1 ~ .\ ~ logk P. In levellogk P, P' = k; and in levels< logk P, P' > k. So it can 
be seen that we are unnecessarily charging l( P') for communication even though only k processors 
are actually used. The adjustment to BT which results in the new algorithm, BT-Pack, is that 
we use k-processor sub-PRAMs to compute the k-leaf binary trees. The new algorithm works as 
before, except that as it proceeds back up through the hierarchy, a sub-PRAM with> k processors 
(i.e. those in levels .\, 1 ~ .\ ~ logk P- 1) does not directly compute its k-leaf binary tree. Instead, 
it moves ("packs") the k leaves in its memory into a contiguous block of memory and performs a 
partition step such that this block becomes the private block of shared memory of a k-processor 
sub-sub-PRAM. The algorithm assigned to this sub-sub-PRAM computes the k-leaf binary tree 
with its k processors. 
There are three sub-algorithms for this H-PRAM algorithm. Nuii-Aig is an algorithm that 
does absolutely nothing. Par-BT(x, y) computes a x-leaf binary tree with y processors (we will 
set x = y = k) via the straightforward method of computing each level of the tree in parallel, 
where a processor computing an internal node (the leaves are in memory) reads the nodes two 
children from memory, performs the operation ffi on them, and writes the result to memory. The 
last sub-algorithm is BT-Pack(P', .\),which is given below. 
Recall from the beginning of this section that the input size N is equated to the H-PRAM's 
memory size, such that a partition step conceptually partitions data, and that the units of H-PRAM 
memory that are grouped and partitioned are blocks of NIP memory locations. A P'-processor 
sub-PRAM has memory (data) size N' = P' ·NIP. What the last partition step does is create 
two sub-sub-PRAMs: one with k processors and memory size k ·NIP, which will compute a k-leaf 
binary tree; and one with P' - k processors and memory size ( P' - k) · NIP. Therefore, before the 
partition step we must make sure that all k leaves in the P'-processor sub-PRAM are moved into 
the block of memory that contains the first k · NIP memory locations, 0, ... , ( kN I P) - 1, of the 
sub-PRAM; specifying locations i ·(NIP), 0 ~ i ~ k -1 just evenly spaces the leaves in memory. 
The partition step creates two sub-PRAMs; the k-processor one computes the k-leaf binary tree and 
the (P'- k)-processor one does nothing (Nuii-Alg). 
The H-PRAM algorithm is invoked by BT-Pack(P, 1). 
Lemma 3.2 {J-synchronization does not dominate the complexity of the BT-Pack(P, 1} algorithm. 
Proof: Identical to that of Lemma 3.1. D 
8 
BT-Pack(P', ..\) 
if ..\ = logkP + 1 then 
<compute N I P-lea.f binary tree with P' = 1 processor > 
else if ..\ = logk P then 
partition{k, P' lk,BT-Pack(P' lk, ..\ + 1)} 
<compute k-leaf binary tree with k processors; note that k = P' > 
else 
partition{k, P' lk,BT-Pack(P' lk, ..\ + 1)} 
<pack the k leaves: k processors read the leaves, and then write them to 
locations i ·(NIP), 0 ~ i ~ k -1 > 
partition{ 
k: Par-BT(k,k); 
P' - k: Null-alg} 
end-if-then-else 
end-BT-Pack 
Again, although we know that ,B-synchronization does not dominate, we keep it under consid-
eration in the following analysis for demonstration purposes. 
Theorem 3.4 BT-Pack(P, 1} can be computed in time 
( 
(log,.P)-1 log,.P log,.P-1 ) 
0 logP+l(k)logP+ _(; l(k)I.)+NIP+ _(; sp(k,k)l.)+ _(; sp(2,k)l.) 
Proof: Clearly BT-Pack(1,log.kP + 1) running on single processor, levellogkP + 1 sub-PRAMs, 
will take time O(N I P) as l(1) = 1. BT-Pack(P', ..\) sub-algorithms running on P' = k processor 
sub-PRAMs in level log.k P have log k computation steps and 3log k communication steps, thus 
taking time O(log k+l(k) logk). BT-Pack(P', ..\)sub-algorithms running on P' = P/k}l.-1 processor 
sub-PRAMs in level ..\, 1 ~ ..\ ~ (logk P)- 1, have 2 communication steps, thus taking time 
O(E~~~,.P)-1 l(P/k}l.-1 )). Therefore the total cost of computation and communication steps in 
levels $log.k P by the BT-Pack(P', ..\)sub-algorithm is 
( 
(log,. P)-1 ) 
0 logk + l(k)logk + _(; l(Pik>.-1 ) 
Reversing the sum of the third term: 
( 
(log,.P)-1 ) 
0 log k + l( k) log k + {; l( k>.) 
9 
The Par-BT and Nuii-Aig algorithms run in levels ..\, 2 ~ ..\ ~ logk P, concurrently with each 
other. Since Nuii-Aig does nothing, the time of a partition step which invokes the two algorithms 
is dominated by Par-BT, which has 3log k communication steps and log k computation steps on k 
processors. So the total time spent in Par-BT is O((logkP- 1)(1ogk + l(k)logk)). Putting the 
times for BT-Pack(P', ..\)and Par-BT together, we get 
( 
(log,.P)-1 ) 
0 logk P(log k + l( k) log k) + [; l( k).) + NIP 
( 
(log,.P)-1 ) 
= 0 log P + l( k) log P + [; l( k).) + NIP 
The final term in the complexity is that of ,8-synchronization. The ,8-synchronization from the 
k-ary hierarchy is the same as before, i.e. E~!~ P sp( k, k).), and we have the additional cost of 
,8-synchronizing 2 sub-PRAMs in each level ..\, 1 ~ ..\ ~ logk P- 1, which is ~;!~ P-1 sp(2, k).). D. 
Theorem 3.5 For l(P) = logP, BT-Pack(P, 1) can be computed in time 
O(log312 P +NIP) 
Proof: Settingl(P) = logP in Theorem 3.4 simplifies the sum: E~:~,.P)-1 log(k).) = O(log2 Pllogk-
log P + log k ). So without ,8-synchronization complexity added in, the time is 
( ~p ) 0 logP+logk·logP+( logk -logP+logk)+NIP 
( log
2 P ) = 0 logk·logP+ logk +logk+NIP 
( log
2 p ) 
= 0 logk·logP+ logk +NIP 
In this case we have sp(k,k).) = sp(k) = logk and sp(2,k).) = sp(2) = log2 = 1. Therefore the 
,8-synchronization component is subsumed by the other components of the complexity. because 
E~!~ P log k = log P and ~;!~ P-1 sp(2) = logk P - 1. 
We wish to pick k to minimize the complexity 
log2 P 
O(log k ·log P + log k + NIP) 
Choosing k = 2~, and noticing that k is legal (i.e. a power of two such that 2 ~ k ~ P) gives 
the result O(log312 P + NIP). D 
Corollary 3.3 For l(P) =log P, BT-Pack(P, 1) is optimal for €Plog312 P ~ N, constant €. 
10 
Proof: The algorithm is optimal if Plog312 P = O(N), i.e. Plog312 P:::; c · N, constant c, and 
f.= 11c. 0 
In comparison to the O(log312 P +NIP) time for BT-Pack(P, 1), the LPRAM algorithm for 
this problem takes time O(log2 P + (NIP)logP) [1] when l(P) = logP. The algorithms for the 
pipelined models of [2] [6] [7] [11] take time log2 PI log log P + NIP when £( P) = log P. 
BT-Pack(P, 1) gives no improvement over BT(P, 1) for the case l(P) = ../P. 
4 Case study: FFT graph 
In this section we conduct a case study of the utility of the private H-PRAM by designing and 
analyzing algorithms for computing the FFT, or butterfly, graph. We give algorithms that employ 
both two-level and multi-level (logk P) hierarchies, with different combinations of EREW PRAM, 
LPRAM, and BPRAM sub-models for these cases. (Although it has been noted in Section 2 of [8] 
that it may not be a good idea in practice to use extended PRAMs such as the LPRAM or BPRAM 
as sub-models, it is important to include them in a case study.) In addition to demonstrating 
different instances and contexts in which the H-PRAM can be used, a key goal of this section is 
to show how the H-PRAM can exploit general locality. Strict locality can sometimes be used to 
to obtain optimal algorithms, but the optimality only holds when N is significantly larger than 
P. We show that, in the context of the FFT graph problem, when N is not-so-large with respect 
to P, we can switch to employing neighborhood locality to achieve optimality. Furthermore, the 
configuration of the H-PRAM, representing the degree of locality employed, can be parameterized 
by N and P so that the "best" configuration is used, resulting in optimal algorithms for a wider 
range of N and P with respect to each other than achievable with strict locality alone. 
The problem of computing the FFT graph is an ideal problem for a case study of the H-PRAM 
due to its regular structure, and high communication and synchronization requirements. 
A directed acyclic graph, or "dag", can be used to represent a computational problem, where 
nodes represent inputs (if in-degree is 0), outputs (if out-degree is 0), or computations (if in-degree 
and out-degree are positive), and whose edges represent data dependencies, or communications. 
The time required to compute a dag is the time taken by an algorithm to compute all of the output 
(out-degree 0) nodes. The binary tree of the previous section is a dag whose leaves are input nodes, 
and root is the output node. The FFT graph, also known as the butterfly graph, is a dag whose 
computation can solve several problems; the Fast Fourier Transform and bitonic merge are two 
simple examples. 
For N = 2m, an N -point, height log N FFT graph has log N + 1 levels of N nodes each, and 
can be represented algorithmically as follows. Let Xi,j denote the jth node, 0:::; j :::; N- 1, in level 
i, 0 :::; i :::; m. Then the graph is defined by 
• Inputs: xo,o, xo,1, ... , xo,N -1 
• Outputs: Xm,o, Xm,h ••• , Xm,N-1 
11 
Figure 2: A 16-point FFT graph 
• Computations: For 1 ~ i ~ m, Xi,q = f(xi-I,q,Xi-I,r) 
where f is a binary function computed in constant time, and q and r have binary representations 
that are identical except in the (m- i)th position. Figure 2 shows a 16-point FFT graph. 
We assume again in this section that P is a power of two. 
In Section 4.1 we give a simple extension of the BT algorithm to solve the FFT graph problem. 
The following sections consider more sophisticated H-PRAM algorithms for computing the FFT 
graph that undertake memory management; Section 4.2 considers algorithms that employ two-level 
hierarchies, while Section 4.3 addresses algorithms which use (logk P)-level hierarchies. 
4.1 Binary tree extension 
As with a binary tree, note that the N -point, height log N FFT graph can be divided into logk P + 1 
stages, where stages A, 1 ~ A ~ log~; P have height log k, and stage logk P + 1 has height log( N / P). 
Furthermore, the result of this is that the FFT graph is partitioned into disjoint sub-graphs, where 
12 
a sub-graph in level A ,1 S: A S: logic P, consists of the first log k levels of a ( N I k>.-1 )·point FFT 
graph, and a sub-graph in level logic P + 1 is a NIP-point FFT graph. Therefore we see that we can 
adjust the binary tree algorithm, BT(P, 1), to obtain an algorithm for computing the FFT graph, 
where stages correspond to levels of hierarchy, and sub-PRAMs to sub-graphs. The adjustments 
are that computation is done on the way down into the hierarchy, rather than on the way back up, 
and that FFT sub-graphs are computed rather than binary sub-trees. 
Note that no memory management is required; as long as we store the jth point in a level 
of a FFT sub-graph in the jth memory location of the sub-PRAM that is computing (the first 
logk levels of) it, the memory (points) partition as required. The H-PRAM algorithm consists 
of one sub-algorithm FFT-BT-Ext(P', .X) that is used recursively, given below, and is invoked by 
FFT-BT-Ext(P, 1). 
FFT-BT-Ext(P', .X) 
if A = log~cP + 1 then 
<compute (N I P)-point FFT with P' = 1 processor > 
else 
<compute first logk levels of (Nik>.-1)-point FFT 
with P' = PI k>.-1 processors > 
partition{k, P' lk, FFT-BT-Ext(P' lk, .X+ 1)} 
end-if-then-else 
end-FFT-BT-Ext 
Theorem 4.1 FFT-BT-Ext(P, 1) has complexity 
(
N logA:P N logA:P ) 
0 p (log P +log k · _(; l(k>.)) + p log(N I P) + _(; sp(k, k>.) 
Proof: The proof follows from that of BT (Theorem 3.1) after noting two differences between FFT-
BT-Ext(P', .X) and BT(P', .X). First, in level logic P + 1 the FFT-BT-Ext sub-algorithms take time 
O((NIP)log(NIP)) to compute an (NIP)-point FFT graph, rather than O(NIP) for BT. Second, 
FFT-BT-Ext sub-algorithms operating in levels S: logic P take time that is an additional factor of 
NIP over that of BT sub-algorithms. Specifically, the time required for a level A, 1 S: A S: logic P, 
FFT-BT-Ext sub-algorithm, which computes log k - 1 levels (since one of the log k levels is the 
"input level") of Nlk>.-1 points using Plk>.-1 processors, is 
0 ( ;~:~=: · (logk + logk ·l(Pik>.-1 ))) = 0 (;(logk + logk ·l(Pik>.-1 ))) 
The proof follows from that of Theorem 3.1. D 
13 
Theorem 4.2 For l(P) = logP, FFT-BT-Ext(P, 1) can be computed in time 
0 ( ~ (log2 P + log( NIP))) 
Proof: Follows from Theorem 3.2 and Theorem 4.1. 0 
Corollary 4.1 For l(P) = logP, FFT-BT-Ext(P, 1) is optimal for pt.·logP ~ N, constant f. 
Proof: Since the best sequential time for this problem is O(N log N), FFT-BT-Ext(P, 1) is optimal 
if log2 P = O(log N), or log2 P ~ c ·log N for some constant c. This solves to pt.·logP ~ N, where 
f= 11c. 0 
Theorem 4.3 For l(P) = VP, FFT-BT-Ext(P, 1) can be computed in time 
0 (~(VP + log(NIP))) 
Proof: Follows from Theorem 3.3, and Theorem 4.1 above. 0 
Corollary 4.2 For l(P) = VP, FFT-BT-Ext(P, 1) is optimal for 2t.·../P ~ N, constant f. 
Proof: The algorithm is optimal when VP = O(log N), or VP ~ c ·log N, constant c. This solves 
to 2t.·../P ~ N, where f = 1lc. 0 
The algorithm for computing the FFT graph on the LPRAM [1] is optimal for P · 2t.·l(P) ~ N, 
so we see that the H-PRAM allows a larger number of processors to be efficiently used than on 
the LPRAM for l(P) = VP, but not l(P) = logP (since P · 2t.·l(P) = pt.+I < pt.·logP). The range 
pt.·log P ~ N for l( P) = log P is not very good, and gives motivation to the development of the 
additional algorithms of the following two sections. 
4.2 Two-level algorithms 
We begin by noticing (in more detail than the previous section) how the FFT graph can be par-
titioned, so that we can partition the H-PRAM accordingly in order for sub-PRAMs to compute 
sub-graphs. Let k be an arbitrary value, whose bounds will be fixed subsequently, which is a. power 
of two. Then the N -point, height log N FFT graph can be partitioned into log N I log k = log~; N 
stages of height log k, where ea.ch stage consists of N lk independent k-point FFT graphs. Within a 
stage, the k-point FFT graphs are generally "intertwined", i.e. have intersecting edges, but share 
no common nodes, so are thus disjoint from each other. The output of one stage is the input of the 
next. 
Figure 3 shows a 16-point FFT graph with k = 4. It is thus partitioned into log 16flog4 = 2 
stages of height log4 = 2. In each stage, one independent (k = 4)-point FFT sub-graph is outlined 
in bold. 
The idea behind a two-level H-PRAM algorithm is to compute the FFT graph stage by stage, 
with the level 1 P-processor PRAM acting as the coordinator and executing log~; N partition steps, 
14 
, • ,, ,, 
r 
..!' , ... , 
,I' 
-_I - --..~ .,_ I -.- -. -. Sr·I , I , I , , I , ,( r ,( .. , , .. ,I, I ..., 
,I 
.. 
.. I , ... I .. .. .... .. .. I ' I ' .. I' ... ... ... .. ....... 
I\. \. I , 1 \. \. I I 
l '\ I V I , y I , I '\ I V I 1 I I \. I I \. I I ' I I '\ I I I I \. I I \. I I I I 'I' .., , v .... , ... , .... , I ,, ... ,, ,, 'I' \. I 11\. \. 1 I \. 1 I '\ , I \. 1 I \. I , I < I ... I I A I ... I I "' I ... I I I I '\I '\ I I I I '\I '\I 
+I 
I I '\I '\I Stage 2 • • tt .. " 
., • .. • • • I' II 1\ II II I\ II I' II I\ II 
1 
I \ I I I \ I I 
I \ I I I \ I I I \ I I I \ I I 
1 '.' I 
\ I 
I \ I \ I \ I \ I I I V I I I I I I v ( ( 
I I \ I I I I I I 
" 
1\ I \ I \ I \ 
I I \ I I I \ I I 
I I I \ I I I \ I I \ \ 
II \I II \ I I I \I II \I II \I II \I • • • • • • • • • • • • 
Figure 3: A partitioned (with k = 4) 16-point FFT graph. Within each stage, one independent 
(k = 4)-point FFT sub-graph is outlined in bold. 
15 
and employing N/k level2 sub-PRAMs to compute the k-point FFT (sub)-graphs. However, since 
the k-point FFT graphs within a stage are generally intertwined, meaning that the points of a 
sub-graph are not contiguous in (level!) memory, the k-point sub-graphs must be "untangled" so 
that level 2 sub-PRAMs can compute them. This means that memory management is required; 
but if we do this, then we can obtain parameterized (through k) control over the sizes of all FFT 
(sub )-graphs that are computed in the algorithm. In other words, we can parameterize general 
locality, and k can be set to obtain the optimum degree of locality for certain l, P, and N. 
The memory management required is as follows. At the beginning and end of every stage, the 
jth point of the corresponding level is stored in the jth memory location, 0 ::; j ::; N- 1, in the 
level 1 PRAM. Then untangling the k-point FFT graphs at each stage consists of permuting the 
N points, i.e. permuting the H-PRAM memory, such that the points of each k-point FFT graph 
are in contiguous memory locations. Then blocks of k memory locations (points) will become the 
shared memories of level 2 sub-PRAMs. After the level 2 sub-PRAMs compute the k-point FFTs 
and exit, the memory is "unpermuted" so that points are returned to their correct order in the 
current level in the N -point FFT graph. 
The required permutation is the k-shuffie, and the "unpermutation" the k-unshuffie. The k-
shuffie ( k-unshuffie) is equivalent to performing log k perfect shuffles ( unshuffies ). The permutations 
are "segmented", i.e. number the permutations are applied independently to multiple blocks of 
contiguous points, whose number and size depends on the stage number. To be more specific, in 
stage >., 1 ::; >. ::; logk N, the k-shuffie is applied to each of k>.-1 blocks of N f k>.-1 points prior 
to computing the k-point FFTs of that stage, and the k-unshuffie is applied to the same blocks 
following the k-point FFT computations. (Note that in stage logk N the permutations are identity 
permutations and thus are not necessary.) The fact that the memory management permutations 
are "segmented" ones is used in Section 4.3 to design multi-level H-PRAM algorithms. 
The general two-level, P-processor H-PRAM algorithm for computing the N -point FFT graph 
consists of two sub-algorithms: the coordinator sub-algorithm that runs on the level 1 PRAM, 
performing the memory management, and the sub-algorithm that computes a k-point FFTs on a 
level 2 sub-PRAM. Since, in each stage, there are Nfk FFT graphs to be computed, each partition 
step in the Ievell sub-algorithm creates N fk level 2 sub-PRAMs. Note that the minimum number 
of sub-PRAMs created must be one (Nfk ~ 1) and the maximum must be P (N/k::; P), therefore 
the bounds on the value of k are N / P::; k ::; N. The number of processors in a level 2 sub-PRAM 
l·s P'- P - k - 7Vfk- N[P• 
The specific H-PRAM algorithms are those that we have when the types (PRAM, LPRAM, 
BPRAM) of sub-models are fixed, thus fixing the specifics of the sub-algorithms that run on those 
sub-models. We give a general, high-level description of the two-level H-PRAM algorithm, and then 
consider the specific instances of it that occur when various combinations of sub-model types are 
fixed. Let Direct-FFT(P') be the sub-algorithm that is assigned to level 2 sub-PRAMs to compute 
k-point FFT graphs with P' processors. The sub-algorithm running on the level 1 PRAM is FFT-
Two-Levei(P'), and is given below; the H-PRAM algorithm is initiated by FFT-Two-Levei(P). 
16 
F FT-Two- Level( P') 
for~= 1 to logkN do 
< k-shuffie on k~-1 blocks of points/memory > 
partition{N/k, N'P' Direct-FFT(N'P)} 
< k-unshuffie on k~-1 blocks of points/memory > 
end-for-do 
end-FFT-Two-Level 
We use the notation TYPE1-TYPE2 to refer to the specific H-PRAM algorithms that use a. 
TYPEl sub-model in Ievell (on which FFT-Two-Level runs) and TYPE2 sub-models in level 2 (on 
which Direct-FFT algorithms run). We consider three types of sub-models: EREW PRAM (EREW 
for short), LPRAM, and BPRAM. There are three specific H-PRAM algorithms presented: EREW-
EREW, EREW-LPRAM, and BPRAM-EREW. The reason that the LPRAM is not considered as a.TYPEl 
sub-models is that it has no a.dva.nta.ges over the EREW PRAM for performing permutations. As 
noted in Section 3.2 of [8], whether to classify the BPRAM as a. computational or architectural model 
is unclear. We explore its use as a. TYPEl sub-model since its algorithm for "rational permutations" 
can implement our memory management, and we want to see the results of using block pipelining to 
move data around (permute it) for memory management. While the EREW-EREW and EREW-LPRAM 
algorithm results are presented in detail, we only briefly discuss the BPRAM-EREW results since 
the rather complicated complexity of the BPRAM rational permutation (sub)-algorithm combined 
with dynamic configura.bility of the H-PRAM leads to tedious and complicated analysis. We do 
not consider algorithms that employ the BPRAM as a TYPE2, variable-sized (parameterized by k) 
sub-model. 
We first consider the ,8-synchronization complexity of F FT-Two- Level algorithms, as it is inde-
pendent of the types of sub-models used. 
Theorem 4.4 The ,8-synchronization component in FFT-Two-Level algorithm complexities will 
never dominate. 
Proof: FFT-Two-Level runs in Ievell on all PH-PRAM processors, executing logk N =log N /log k 
partition steps. Since there are 2logk N permutation steps, the conditions for ,8-synchronization 
non-domination are met. 0 
Thus, we drop consideration of ,8-synchronization complexity from the subsequent analyses of 
the FFT-Two-Level algorithms. This results in a significantly cleaner and simpler process. 
We first concentrate on the EREW-EREW FFT-Two-Level algorithm. 
17 
4.2.1 EREW-EREW FFT-Two-Level 
Theorem 4.5 The EREW-EREW FFT-Two-level algorithm can be computed in time 
O (NlogN. (t(-k ) + l(P))) 
P NIP logk 
Proof: FFT-Two-level, running on the Ievell, P-processor sub-PRAM, executes 2logkN permu-
tations and logk N partition steps which create N lk sub-PRAMs. Permuting N elements takes time 
O((NIP)·l(P)), therefore the time taken by the FFT-Two-level sub-algorithm is O(logk N ·(NIP)· 
l(P)). A Direct-FFT sub-algorithm operating in level2 on P' = N'P processors computes a k-point 
FFT graph in the straightforward way in time 
0 (klogk klogk ·l(P')) 
P' + P' 
= 0 (Nlogk Nlogk ·l(-k-)) 
P + P NIP 
0 (Nlogk ·l(-k-)) 
P NIP 
and level 2 Direct-FFT sub-algorithms are invoked logk N times. Therefore the total time taken by 
the H-PRAM algorithm is 
0 (logkN · (N~gk ·l(N~P) +(NIP) ·l(P))) 
Which, by changing logk N to log N I log k and rearranging, gives the result. 0 
Again, k must be NIP~ k ~ N. 
Theorem 4.6 The EREW-EREW FFT-Two-level algorithm is optimal for P · 2€·l(P) ~ N, constant 
f. 
Proof: The optimal parallel time is O((N I P) logN) so we see that the EREW-EREW FFT-Two-level 
algorithm is optimal when 
k l(P) 
l(NIP)+ logk = O(l) 
In order for this to hold, we need l( N'P) = 0( 1 ), which means that k must be 0( NIP) by the 




log(NI P) = O(l) 
or equivalently l(P) ~ c·log(NIP), constant c. This solves toP· 2t:·i(P) ~ N, where£= lie. 0 
The range P · 2t:·i(P) ~ N is the same as that of the FFT graph algorithm for the LPRAM 
[1]. This is expected, since k = NIP means that there are 1-processor sub-PRAMs in level 2, and 
the LPRAM is an instance of the H-PRAM corresponding to a two-level private H-PRAM with 
1-processor sub-PRAMs in level 2. 
Note that this is an improvement in the optimality range over that of FFT-BT-Ext for l(P) = 
logP but not for l(P) = ..JP. 
18 
The fact that k = NIP means that only strict locality is employable to achieve optimality, for 
FFT-Two-Level. H k >NIP, sub-PRAMs have> 1 processor; this would be using neighborhood 
locality. Although choosing k > NIP will not achieve optimality, it may result in an improved 
complexity (over that from choosing k = NIP) when the given N and P are not within the 
optimality range. In other words, neighborhood locality may be useful for reducing inefficiency 
when N is not so much larger than P. 
Following these lines, we now consider what we call the term minimization strategy for config-
uring the H-PRAM (i.e. for choosing k, given N and P). Very simply, it consists of minimizing 
the non-optimal factor 
k l(P) 
l(NIP) + logk 
by choosing k such that the two terms are equal to each other. To do this, we need to fix the 
latency function. 
Lemma 4.1 When l(P) = logP, the equality 
l _k __ l(P) 
(NIP)- logk 
holds when k is chosen such that 
logk = ~ (log(NIP) + Vlog2(NIP) + 4logP) 
and furthermore, this k is within the required bounds NIP ~ k ~ N. 
Proof: k is chosen such that 
k logP 
log( NIP)= logk 
which simplifies to 
log2 k -log(NIP)logk -logP = 0 
Applying the quadratic formula gives the solution for log k: 
logk = ~ (log(NIP) + Vlog2(NIP) + 4logP) 
It needs to be checked that NIP ~ k ~ N, or equivalently that log( NIP) ~ log k ~ log N. We 
first check that log( NIP) ~ log k: 
log(NIP) ~ ~ (log(NIP)+Vlog2(NIP)+4logP) 
log(NIP) < Vlog2(NIP) + 4logP 
log2(NIP) ~ log2(NIP) + 4logP 
19 
which clearly holds. We now check that log k ~ log N: 
~ (log(N/P) + Vlog2(N/P) + 4logP) ~ logN 
Vlog2(N/P)+4logP ~ logN +logP 
(log N - log P)2 + 4log P < (log N + log P? 
4logP ~ 4logNlogP 
which clearly holds. Finally, we note that k is a power of two, as required. 0 
Theorem 4.7 For l(P) = logP, the EREW-EREW FFT-Two-Level algorithm can be computed in 
time 
o ( Nl~gN (Vlog2(N/P) + 4logP -log(N/P))) 




= 2 (~(log(N/P) + Jlog2(N/P) + 4logP) -log(N/P)) 
= Vlog2(N/P) + 4logP -log(N/P) 
and the Theorem follows. 0 
Corollary 4.3 For l(P) = logP, the EREW-EREW FFT-Two-Level algorithm can be computed in 
time 
O ( NlogN~g(NfP)) 
for P · 22ylogP ~ N, and in time 
for P · 22ylogP ~ N. 
Proof: If 4logP ~ log2(N/P), then 
Vlog2(N/P) + 4logP -log(N/P) 
< V2log2(N/P) -log(N/P) 
= 0.41·log(N / P) 
and 4logP ~ log2(N/P) simplifies to 
)4log P < log(N/ P) 
22~ < NfP 
P·22~ < N 
20 
Similarly, if 4log P 2: log2( NIP) (i.e. if P · 22y1og? 2: N), then 
Vlog2(NIP) + 4logP -log(NIP) 
~ .j8logP -log(NIP) 
< .j8logP = 2.83 · JlogP 
The Corollary follows from substituting O(log(NIP)) or O(JIOg'"P) for Jlog2(NIP) + 4logP-
log(NIP) in Theorem 4.7. 0 
Compare the times in Corollary 4.3 to the complexity which was obtained when k = NIP: 
O (NlogN. logP ) 
P log(NIP) 
which is also the complexity of the FFT graph algorithm on the LPRAM [1]. We see that neigh-
borhood locality can indeed be used to improve performance for N not so "significantly" larger 
than P (i.e. when N and P are such that optimality is not achievable). 
We now apply the term minimization strategy to the EREW-EREW FFT-Two-Level algorithm for 
the other latency we consider in this paper, l( P) = VJS. 
Lemma 4.2 When l(P) = VP, 
l( k ) l(P) 
NIP ~ logk 
when k is chosen such that k = N I log2 N. However, in order for this k to be legal {NIP ~ k ~ N }, 
it is necessary that N ~ 2../P. 
Proof: We want to choose k such that 
fli;=~ 
which simplifies to klog2 k = N. Since k = O(Nilog2 N) when klog2 k = N, we choose to fix 
k = N I log2 N. Now it needs to be checked that NIP ~ k ~ N. Clearly N I log2 N ~ N so we 
check that NIP ~ N I log2 N. This simplifies to N ~ 2../P, which must hold in order to apply the 
term minimization strategy. It is also necessary that k be a power of two; if N I log2 N is not then 
we pick the nearest power to it. 0 
Theorem 4.8 For l(P) = n, the EREW-EREW FFT-Two-Level algorithm can be computed in 
time 
21 
Proof: Choosing k as in Lemma 4.2 under its constraint N :::; 2../P, we know that ~ ~ 1$. 
Since 
,fP ,fP 
log k - log( N I log2 N) 
is the slightly larger term, the non-optimal factor in the complexity is 
< ,fP 
- log(Nilo-i N) 
so the complexity is 
O (NiogN. ,fP ) _ O (N n) 
P log(Nilog2 N) - P 
D 
When l(P) = ,fP then, the term minimization strategy can be employed to get a O((N I P)VP) 
time algorithm for N :::; 2../P. In comparison, the complexity when k was chosen to be NIP was 
( N VP logN ) O P p .log(NIP) 
which is also the complexity of the FFT graph algorithm on the LPRAM. 
4.2.2 EREW-LPRAM FFT-Two-Level 
We turn to the EREW-LPRAM FFT-Two-level algorithm, where the level 1 sub-model remains an 
EREW PRAM, but the level 2 sub-models that are used to compute k-point FFT graphs are 
now LPRAMs. Recall that the level 2 sub-PRAMs have P' = NjP processors each in FFT-Two-
level. The EREW PRAM algorithm for computing a k-point FFT graph with P' processors used 
O((kiP')Iogk) = O((NIP)Iogk) computation and communication steps. The LPRAM algorithm 
in [1] uses O((kiP')Iogk) = O((NIP)Iogk) computation steps and O(p, 1:~~jP')) = O(pf:~(~jP)) 
communication steps. In the following we consider the Direct-FFT sub-algorithm in the general 
FFT-Two-level algorithm to be this LPRAM algorithm. Using the LPRAM as a sub-model of the 
H-PRAM allows one to exploit strict and neighborhood locality simultaneously. 
Theorem 4.9 The EREW-LPRAM FFT-Two-level algorithm can be computed in time 
O (NiogN (l(ki(NIP)) + l(P))) 
P log(NIP) logk 
Proof: The FFT-Two-level sub-algorithm, running on the Ievell, P-processor sub-PRAM, takes 
the same time as it did for the EREW-EREW algorithm: 
0 (loglcN. N l(P)) = 0 (NlogN. l(P)) 
p p logk 
From the above discussion of the number of steps in the LPRAM FFT algorithm, it is straight-
forward that a Direct-FFT sub-algorithm operating in level 2 on P1 = N'jp processors computes a 
k-point FFT graph in time 
( N Nlogk k ) O P logk + Plog(NIP) ·l(NIP) 
22 
and level 2 Direct- FFT sub-algorithms are invoked logk N = log N I log k times. Therefore the total 
time taken up by Direct-FFT in the FFT-Two-level algorithm is 
O (NlogN ( 1 + l(ki(NIP)))) = O (NlogN. l(ki(NIP))) 
P log(NIP) P log(NIP) 
Adding this and the above time for the level1 sub-algorithm together gives the result. D 
In the following, we consider the circumstances under which the EREW-LPRAM FFT-Two-level 
algorithm is optimal. 
Lemma 4.3 The EREW-LPRAM FFT-Two-level algorithm is optimal if k can be chosen such that 
the following bounds hold simultaneously. 
1. l(-,/jp) ~ Ct ·log(N I P) 
2. k ~ 2£2·l(P) 
where Ct and c2 are positive constants in the running time, c1 corresponding to the Direct-FFT 
sub-algorithm and c2 corresponding to the FFT-Two-level sub-algorithm, and £2 = 1lc2. 
Theorem 4.10 Forl(P) = logP, the EREW-LPRAM FFT-Two-level algorithm is optimal for 
p£/(c+I)+l ~ N 
where c and£ are positive constants such that£~ 1 + 1lc. (c = c1 and£= £2 = 11c2, where c1 and 
c2 are the constants in the running time used in Lemma 4.3). 
Proof: We know from Lemma 4.3 that the algorithm is optimal when k can be chosen such that 
{1) log(N'P) ~ Ctlog(NIP), which solves to k ~ (NIPY1+1 = (NIP)c+l, and (2) k ~ 2£2logP, 
which solves to k ~ P£2 = pc. Remembering that k must be within the bounds NIP ~ k ~ N, we 
conclude that the algorithm is optimal if k can be chosen such that 
Since N and P are powers of two, we can assume that k also is, as required. 
Therefore, optimality exists when max(NIP,Pc) ~ min(N,(NIPY+l). Clearly, NIP ~ N 
and NIP ~ ( N I P)c+l, so we need to show the conditions under which we know that pc < 
min(N,(NIP)c+1) holds. Let min(N,(NIP)c+l) = (NIPY+l. Then pc ~ (NIP)c+l when 
pt:+c+l ~ Nc+l 
pt:/(c+t)+I ~ N 
which gives the claimed optimality range. Now consider that min(N, (N I PY+l) = N. In this case, 
N ~ (NIP)c+1 
pc+l ~ Nc 
pl+l/c ~ N 
23 
Then P£ ~ m.in(N, (N/ PY+l) = N ifF ~ pl+l/c, i.e. if£~ 1 + 1/c, which gives the constraints 
on the constants, and the Theorem follows. D 
Note that the smaller£ is and the larger cis, the better the optimality range 
p£/(c+l)+l ~ N 
Also,£ stands for 1/c2 and c for c~, where c1 and c2 are constants in the running time (Lemma 4.3). 
Therefore we see that the larger the constant factors in the running time are, the better the 
optimality range. 
Assume that c1 = c2 and denote the common constant by c (so £ = 1/c); the optimality 
range of the EREW-LPR.AM FFT-Two-Level algorithm becomes pl/c(c+l)+l ~ N. Assume c = 1 
for demonstration purposes; then the range is p3/2 ~ N. Now let c = 2; here the EREW-LPRAM 
FFT-Two-Level algorithm's optimality range is P716 ~ N (remember that the perfect optimality 
range would be P ~ N). We think that this is good evidence for the utility of neighborhood 
locality, at least for the problem of computing the FFT graph with an underlying logarithmic 
latency architecture. 
The other latency function we consider in this paper is l(P) = ,fP, so attention is now turned 
to how the EREW-LPRAM FFT-Two-Level algorithm behaves in this case. 
Theorem 4.11 Forl(P) = ,fP, the EREW-LPRAM FFT-Two-Level algorithm is optimal for 
2£·# ~ N 
where £ is a positive constant such that, for a positive constant c, £ ~ 1 + 1/ c. (c = c1 and 
£ = £2 = 1/c2, where c1 and c2 are the constants in the running time used in Lemma ./.3). 
Proof: We know from Lemma 4.3 that the algorithm is optimal when k can be chosen such that 
(1) ~ ~ c1·log(N/P), which solves to 
k ~ (c1 ) 2 • (N/P)log2(N/P) = c2(N/P)log2(N/P) 
and (2) k ~ 2£2 ·VP = 2(.#. Remembering that k must be within the bounds N/P ~ k ~ N, we 
conclude that the algorithm is optimal if k can be chosen such that 
Since N and P are powers of two, we can assume that k also is, as required. 
Therefore, optimality exists when 
Clearly, N/P ~ N and N/P ~ c2 · (N/P)log2(N/P), so we need to show the conditions under 
which we know that 
24 
holds. Clearly, we need that 2~../P :::; N; this establishes the optimality range stated in the Theorem. 
Now consider the other case, where 
min(N,c2 · (NIP)log2(NIP)) = c2 · (NIP)log2(NIP) 
We need to show that 2~../P ::=; c2 ·(N I P) log2(N I P). Operating under the knowledge that 2~../P:::; N, 
we know that 
2~../P 2 
c2 · p (log(2~../PIP)) :::; c2 · (NIP)log2(NIP) 
Then 2~../P:::; c2(NIP)log2(NIP) if 
2~../P :::; , 2<H ( e·H) r c ·-- log --p p 
!..;p e<H) :::; log --c p 
!..;p :::; f.VP -logP 
c 
1 
logP :::; ..fP(f.--) 
c 
This holds if f.~ 1 + 1lc, which is the constraint on the constants, and the Theorem follows. D 
As before, notice that the smaller f. is (and thus the larger c is, since it is necessary that 
f ~ 1 + 1lc), the better the optimality range. Remember that f stands for 11c2 and c for Ct, where 
c1 and c2 are constants in the running time (Lemma 4.3). 
This optimality range, 2~../P :::; N, matches that of FFT-BT-Ext for i(P) = ...(P, and is an 
improvement over the one for the LPRAM FFT graph algorithm [1] (and, equivalently, the EREW-
EREW F FT-Two-Level algorithm where k = NIP), which is P · 2~../P :::; N. 
4.2.3 BPRAM-EREW FFT-Two-Level 
For the reasons noted at the beginning of this section, we only briefly discuss the results of the 
BPRAM-EREW FFT-Two-Level algorithm. These results are analogous to those obtained for the 
EREW-EREW FFT-Two-Level algorithm. Optimality is only obtainable for the k = NIP case of 
single-processor sub-PRAMs in level 2, and occurs for the range P · l(P) :::; N. This matches 
the result for computing the FFT graph on the BPRAM [2], which is to be expected because the 
BPRAM is an instance of a two-level private H-PRAM corresponding to a BPRAM sub-model in 
level 1 with single-processor sub-PRAMs in level 2. Term minimization may be used to obtain 
slightly improved performance for N < P · l(P), but is overly complicated. 
4.3 Multi-level algorithms 
Although the general two-level algorithm is simple and avoids the sums in its analyses that arise 
from multi-level hierarchies, it does not exploit all of the general locality inherent in the problem 
25 
of computing the N -point FFT graph. The general multi-level (logk P) algorithm presented in 
this section, although more involved, does allow the exploitation of all inherent general locality; it 
can be seen as a combination of the two-level algorithm and the binary tree extension algorithm 
FFT-BT-Ext. As in the previous section, we will consider specific instances of the general algorithm 
which are obtained by fixing the types of sub-models of the H-PRAM used. 
The previously unexploited locality can be seen in the memory management of F FT-Two- Level, 
specifically in the permutations on memory /points that "untangle" and "retangle" the k-point FFT 
graphs prior to and following each of the logk N stages. Recall that these permutations were per-
formed in the Ievell P-processor (sub)-PRAM, and thus that communication steps implementing 
the permutation were charged at the full latency l(P) of the H-PRAM. However, the permutations 
were "segmented", i.e. multiple independent permutations were applied to multiple independent 
blocks of contiguous points/memory, where block sizes grew successively smaller with successive 
stages. The idea behind the multi-level algorithm is to partition the the H-PRAM in correspon-
dence with the blocks that permutations are applied to, so that each block, and only one block, 
is within a sub-PRAM's private memory when a permutation is applied. Then permuting can be 
done at the latency of the sub-PRAM rather than that of the P-processor PRAM at level 1 in the 
hierarchy. 
A better way of explaining is in terms of the structure of the N -point FFT graph first noticed 
in Section 4.1. Again consider the graph to be partitioned into logk N = log N /log k stages of 
height log k. Then note that after stage .X, 1 ~ .X ~ logk N, has been computed, the remainder of 
the N -point FFT graph that is yet to be computed consists of k>.. distinct (disjoint) FFT graphs, 
each of which has N / k>.. points. Broadly speaking, the general multi-level algorithm will compute 
a stage and then call itself recursively on each of the remaining FFT sub-graphs. 
As an example, consider Figure 3. Here we have N = 16 and k = 4. Note that after stage 1 
has been computed, there are k1 = 4 disjoint FFT sub-graphs of N / k1 = 4 points. After stage 2 
has been computed, there are k 2 = 16 disjoint FFT sub-graphs of N / k 2 = 1 point. 
The algorithm computes Nfk k-point FFT graphs as before, but also partitions the H-PRAM 
into a k-ary hierarchy. At each level in the hierarchy, sub-PRAM algorithms will permute (untan-
gle), compute k-point FFT graphs (via a partition step, as before), and unpermute (retangle). Each 
stage .X of theN-point graph will be computed by sub-PRAMs in level A of the hierarchy. 
Since the P-processor H-PRAM is being partitioned into a k-ary hierarchy, and P ~ N, we have 
to partition the N -point FFT graph slightly differently. The straightforward way is to say that it 
is partitioned into logk P + 1 = log P /log k + 1 stages, where stage .X, 1 ~ .X ~ logk P, has height 
log k, and stage logk P + 1 has height log(N / P). Here, in level .X of the hierarchy, 1 ~ .X ~ logk P, 
k-point FFT graphs will be computed, and in levellogk P+ 11-processor sub-PRAMs will compute 
(N /P)-point FFT graphs. 
However, we wish to allow for the case that k > P (although k ~ N), since maximizing the 
partitioning flexibility (parameterized by k) maximizes the H-PRAM's ability to adapt to different 
N, P, and cost functions. Define logk x = 1 when k > x. Then we say that the N -point FFT graph 
26 
is partitioned into [logk Pl + 1 stages, where stage >., 1 ~ >. ~ [logk Pl, has height log k, and stage 
[logk Pl + 1 has height log N- [logk Pllog k. If k > P, there will be two stages, the first having 
height log k and the second having height log N - log k = log( N I k). 
We give a pseudo-code outline of the general multi-level algorithm below, followed by further 
discussion, after which an understanding of the FFT-Multi-Level algorithm and its correctness should 
be immediate when it is given. The sub-algorithm Seq-FFT( x) is a standard sequential algorithm 
that computes an x-point FFT graph on a 1-processor sub-PRAM. 
Pseudo-FFT-Multi-Levei(P', >.) 
< Compute N lk>. k-point FFT graphs > 
< Comment: note that there are k>.- 1 sub-PRAMs in level>., 
and k>.- 1 • Nlk>. · k = N > 
if >. < logk P then 
partition{k, P' lk, Pseudo-FFT-Multi-Levei(P' lk, >. + 1)} 
else 
if k ~ P then 
<Comment: note that P'lk = 1 and Nlk>. =NIP> 
partition{k, 1, Seq-FFT(NIP)} 
else 
< Comment: note that N I k>. = N I k > 




The (pseudo) H-PRAM algorithm is invoked by Pseudo-FFT-Multi-Levei(P, 1). As for FFT-BT-
Ext, there are k>.-1 sub-PRAMs in levels >., 1 ~ ). ~ logk P. If k ~ P, there are k>.-1 = P 
sub-PRAMs in level >. = logk P + 1. If k > P, there will be a two-level hierarchy with one 
P-processor sub-PRAM in level 1 and P !-processor sub-PRAMs in level 2. Consider the point 
where ). = logk P. Note that if k ~ P we have computed log P levels of the FFT graph and have 
yet to compute the remaining log(N I P) levels. k>. = P sub-PRAMs in level >. then compute these 
remaining levels by independently computing N lk>. = NIP point FFT graphs. If k > P, then when 
>. = logk P = 1 we have computed log k levels and have yet to compute the remaining log( N I k) 
levels. There are k > P independent N I k < NIP point FFT graphs to be computed. Since P is 
the maximum number of sub-PRAMs usable, P of them are created, each of which computes an 
( N I k )-point FFT graph. 
We now turn attention to the computation of k-point FFT graphs in levels 1 ~ >. ~ logk P. 
27 
Sub-PRAMs in level A collectively compute stage A of the N-point FFT graph in the same way 
that the two-level algorithm did; by permuting, partitioning, and unpermuting. There are e·-1 
sub-PRAMs running Pseudo-FFT-Multi-Level in level A; each is responsible for computing Njk>.. 
k-point FFT graphs. Therefore each stage is computed as Njk independent k-point FFT graphs 
since (Njk>..) · k>..-1 = Njk. 
The general FFT-Multi-Level algorithm is given below. Note that when a P' = Pjk>..-1 processor 
sub-PRAM partitions itself to create N jk>.. sub-sub-PRAMs for computing k-point FFT graphs, 
the sub-sub-PRAMs have 
P' Pjk>..-1 Pk k 
Njk>.. = Njk>.. = N = NjP 
processors, as in the two-level algorithm. As before, the Direct-FFT(x) sub-algorithm computes a 
k-point FFT graph using x processors. 
FFT-Multi-Levei(P', A) 
< k-shuffie on the Njk>..-1 points in memory> 
partition{Njk\ N'P' Direct-FFT(NJP)} 
< k-unshuffie on the N j k>..- 1 points in memory > 
if A < logk P then 
partition{k, P' jk, FFT-Multi-Levei(P' jk, A+ 1)} 
else 
partition{ min(k, P), 1, Seq-FFT(N jk>..) } 
end-if-then-else 
end-FFT-Multi-Level 
We again use the notation TYPE1-TYPE2 to refer to specific instances of the general algorithm, 
where FFT-Multi-Level runs on TYPE1 sub-models, and Direct-FFT runs on TYPE2 sub-models. Two 
specific algorithms are considered: EREW-EREW and EREW-LPRAM. 
We first consider the ,8-synchronization complexity of FFT-Multi-Level since it is independent 
of the types of sub-models used. 
Theorem 4.12 The ,8-synchronization complexity of FFT-Multi-Level will never dominate. 
Proof: The FFT-Multi-Levei(P', A) sub-algorithms in the hierarchy each have two partition steps 
and two permutation steps. Therefore the conditions for ,8-synchronization non-domination are met 
for each sub-algorithm, and we conclude that ,8-synchronization does not dominate the complexity 
of the FFT-Multi-Levei(P, 1) H-PRAM algorithm. o 
Therefore we again drop consideration of ,8-synchronization from the subsequent analyses of the 
FFT-Multi-Level algorithms. 
28 
4.3.1 EREW-EREW FFT-Multi-Level 
Theorem 4.13 The EREW-EREW FFT-Multi-Level algorithm can be computed in time 
Proof: Clearly the Seq-FFT sub-algorithm running on !-processor sub-PRAMs in levellogk P + 1 
will take time O((NIP)log(NIP)) or O((Nik)log(Nik)) depending on whether k ~Pork> P, 
respectively. Since NIP > Nlk when k > P we just use the O((NIP)log(NIP)) form. The 
EREW PRAM Direct-FFT sub-algorithm is executed logkP times (in levels 2, ... ,logkP + 1 of 
the hierarchy); each time employing P' = N'P processors computing a k-point FFT graph in the 
straightforward way. Thus the total time taken by Direct-FFT sub-algorithms is 
0 (l p. (klogk klogk ·l(-k-))) 
ogk P' + P' NIP 
= (logP (N k )) 0 log k . p log k ·l( NIp) 
= O (NlogP ·l(-k-)) 
P . NIP 
Lastly, the FFT-Multi-Level sub-algorithm runs in levels .X, 1 ~ .X ~ logk P + 1, and executes 
permutations (in addition to partition steps, already accounted for). Permuting N' elements with 
P' (~ N') processors takes time O((N' I P')l(P')) on an EREW sub-model. In level .X there are 
k~-1 sub-PRAMs running FFT-Multi-Level; each has Plk~-1 processors and Nlk~-1 points in its 
memory. Thus the total time taken by FFT-Multi-Level sub-algorithms in the EREW-EREW FFT-
Multi-Level algorithm is 
0 L: ~-1 ·l(Pik~-1 ) = 0 - L: l(Pik~-1 ) ( lolk P Nlk~-1 ) (N logA: P ) 
~=1 Plk p ~=1 
By reversing the sum, we get 
o (~El(k'>) 
Adding the total times of the three sub-algorithms gives the result. D 
Lemma 4.4 The EREW-EREW FFT-Multi-Level algorithm is optimal if k can be chosen such that 
the following bounds hold simultaneously. 
1. l(N'P) ~ c1 • ~::~ 
2. ~;!~P l(k~) ~ c2 ·logN 
where c1 and c2 are positive constants in the running time, c1 corresponding to the Direct-FFT 
sub-algorithm and c2 corresponding to the FFT-Multi-Level sub-algorithm. 
29 
We again consider the explicit latency functions l(P) = log P and l(P) = ..(P in order to 
remove the sums from the analyses and get simpler and more informative results. 
Theorem 4.14 Let c = c1 and£= l/c2, where c1 and c2 are the constants in the running time 
used in Lemma ./ . ./, and let a = (logP + c)/logP. Then, for l(P) = logP the EREW-EREW 
FFT-Multi-Level algorithm is optimal for 
p(l+v'1+4r~t:)/2a $ N 
under the restriction that £ $ 1 (c2 2:: 1). 
Proof: We know from Lemma 4.4 that the algorithm is optimal if k can be chosen such that (1) 
log( N' p) $ c1 ·log N /log P, which, letting c = c~, solves to 
logk < logN c · log p + log(N / P) 
k < N. NcflogP p 
and (2) 
logk P 
L log( k>.) $ c2 • log N 
which, letting £ = 1/c2 and noting that 
log,.P ( 2 ) 
{; log(k>.) = 0 1C::g: 
solves as 
log2 P 





r-·IogP/logN $ k 
Remembering that k must be within the bounds N / P :5 k :5 N, we conclude that the algorithm is 
optimal if k can be chosen such that 
Since N and P are powers of two, we can assume that k also is, as required. Thus, the algorithm 
is optimal when 
30 
Clearly, NIP$ Nand NIP$~· NcflogP. Also pe·logP/logN $ N when 
log2 P 
€·-- < logN 
logN 
v'£ ·log P $ log N 
which holds for all P $ N when € $ 1 (or c2 ~ 1 since € = 1lc2), which gives the restriction in the 
Theorem. 
The final case that must hold is 
pe·logP/logN < N . NcflogP 
-p 
which we now proceed to simplify to the optimality range stated in the Theorem. 
pe·(log P/ log N)+I < N(c/log P)+I 
1 p ( € • log p 1) 
og logN + $ logN Co;P + 1) 
€ ·logP + 1 
logN 
< logN cogP +c) 
logP logP 
For presentational purposes, let x = logNilogP and let a= (logP + c)jlogP. Then we have 
( 
-+1 $ xa 
X 
£+X $ x2a 
x2a- x- € ~ 0 
Applying the quadratic formula to the equation gives 
log N 1 + vf1 + 4a£ X = -- > _ __;_ _ _ 
logP - 2a 
Therefore, with a = (log P + c)jlog P, k is choosable such that the algorithm is optimal when 
1 P 1 + vf1 + 4a£ og · $ logN 
2a 
p(l+V'1+4ae)/2a $ N 
and the Theorem follows. D 
Now the behavior of the EREW-EREW FFT-Multi-Level algorithm when l(P) = VP is considered. 
Theorem 4.15 Forl(P) = -/P, the EREW-EREW FFT-Multi-Level algorithm is optimalfor2f·VP $ 
N, where € = 1 I c2 and c2 is a constant in the running time (corresponding to the c2 in Lemma 4. 4). 
Proof: We know from Lemma 4.4 that the algorithm is optimal if k can be chosen such that (1) 
{;;;; $ c1 • log N I log P, which solves to 
k < (c )2 . N log2 N 
- 1 Plog2 P 
31 
and (2) 
which, noting that 
solves to 
logkP 





vP < c2 ·logN 
2f..;p < N 
where£= l/c2. Remembering that k is constrained by N/P ~ k ~ N, we see that the algorithm 
is optimal for 2f../P ~ N if k can be chosen such that 
N/P < k < min(N (c )2. Nlog2 N) 
- - ' 1 Plog2 P 
This is clearly possible since N / P is ~ both terms of the min function. Since N and P are powers 
of two, we can assume that k also is, as required. D 
We again note that 2f../P ~ N is a substantial improvement in the optimality range, for f(P) = 
,fP, over the P · 2~ ~ N optimality range of both the ER.EW-ER.EW FFT-Two-Level algorithm, 
and the LPRAM FFT graph algorithm [1]. It matches the FFT-BT-Ext and ER.EW-LPR.AM FFT-
Two-level ranges for l(P) = VP. 
The term minimization strategy can be employed to choose k for the ER.EW-ER.EW FFT-Multi-
Level algorithm similarly to the way it was for the ER.EW-ER.EW FFT-Two-Level algorithm, giving 
very similar results. We only summarize them here. For l(P) = logP, the complexities are 
O ( NlogP~g(N/P)) 
for P · 22.../fuiP < N and - ' 
0 ( Nlog~v1()gP) 
for P · 22yloiP ~ N. For £( P) = ,fP the complexity is 
0 (~(YP + log(N/P))) 
We again note that term minimization is meant to be used when N and P are values such that 
optimal time O((N/P)logN) cannot be achieved. 
32 
4.3.2 EREW-LPRAM FFT-Multi-Level 
Theorem 4.16 The EREW-LPRAM FFT-Multi-Level algorithm can be computed in time 
(
N ( logP k log,.P ~ )) 
0 p · logP+ log(NIP) ·l(NIP)+ {; l(k )+log(NIP) 
Proof: The only difference from the EREW-EREW algorithm is that the sub-models that k-point 
FFT graphs are computed on are now LPRAMs rather than EREW PRAMs. The total time taken 
by the FFT-Multi-Level sub-algorithms and Seq-FFT sub-algorithms is the same, which we know 
from Theorem 4.13 to be 
The LPRAM Direct-FFT sub-algorithm is executed logic P times (in levels 2, ... , logic P + 1 of the hi-
erarchy); each time employing P' = N'P processors computing a k-point FFT graph. The LPRAM 
algorithm in [1] uses O((kiP')logk) = O((NIP)logk computation steps and O(p, 1!~(fjP')) 
O(p{;i~jp)) communication steps. Thus the total time taken by Direct-FFT sub-algorithms is 
( ( k log k k log k k ) ) O log~cP· p;-+ P'log(kiP') ·l(NIP) 
= (logP (N Nlogk k )) 
O logk . P logk + Plog(NIP) ·l( NIP) 
= (N ( logP k )) 0 P · log p + log( NIP) ·l( NIP) 
Adding the total times of the three sub-algorithms gives the result. o 
Lemma 4.5 The EREW-LPRAM FFT-Multi-Level algorithm is optimal if k can be chosen such that 
the following bounds hold simultaneously. 
J. l( 1c ) < c . logNlog(N/P) "'ifTP - 1 log P 
2. ~~~~P l(k~) ~ c2 ·logN 
where c1 and c2 are positive constants in the running time, c1 corresponding to the Direct-FFT 
sub-algorithm and c2 corresponding to the FFT-Multi-Level sub-algorithm. 
We turn to considering the explicit latency functions l( P) = log P and £( P) = ...JP, as usual, 
in order to remove the sums from the analyses and get simpler and more informative results. 
For l(P) = logP, finding the optimality range J(P) ~ N requires solving a cubic equation. 
Because of the difficulty of this, we define the range in terms of the cubic equation in the following 
Theorem, and subsequently calculate a few explicit range results. 
33 
Theorem 4.17 Let c = c1 and £ = 11 c2 , where c1 and c2 are the constants in the running time 
used in Lemma 4. 5, let x = log N I log P, and let z denote the positive solution to the cubic equation 
cx3 + ( 1 - c )x2 - x - £ ~ 0 
Then, for l(P) =log P, the EREW-LPRAM FFT-Multi-level algorithm is optimal for P:' ~ N, under 
the restriction that£~ 1 (c2 ~ 1). 
Proof: We know from Lemma 4.5 that the algorithm is optimal if k can be chosen such that (1) 
1 ( k ) < logNlog(N/P) hi h 1 · 1 og N/ p _ c1 · log p , w c , ettmg c = Ct, so ves to 
logk ~ c . log Nl~~g~N I P) + log( NIP) 
logN 
= log(N I P)(c · log p + 1) 
k ~ (N y·IogN/logP+t p 
N Nc·logN/logP 
= pc·log N /log P p 
N Nc·logN/logP 
= p Nc 
k ~ N . Nc·(logN/logP-1) p 
and (2) 
logr.P 
L log(k~) ~ c2 ·log N 
~=1 
which, letting£= llc2 and noting that 










p(•logP/logN < k 
Remembering that k must be within the bounds NIP ~ k ~ N, we conclude that the algorithm is 
optimal if k can be chosen such that 
max (NIP, p(·logP/logN) ~ k ~min (N, ~ . Nc·(logN/logP-1)) 
34 
Since N and P are powers of two, we can assume that k also is, as required. Thus, the algorithm 
is optimal when 
max (NIP, pf·logP/logN) ~ min ( N, ~ . Nc·(logN/logP-1)) 
Clearly, NIP~ N and NIP~ ~ · Nc-(logN/logP-1). Also pf·logP/logN ~ N when 
log2 P 
£·-- ~ logN 
logN 
v'£·logP ~ logN 
which holds for all P ~ N when£~ 1, (or c2 ~ 1 since£= 1lc2), which gives the restriction in the 
Theorem. 
The final case that must hold is 
pf·logPflogN < N . Nc·(logN/logP-1) 
-p 
which we now proceed to simplify to the optimality range stated in the Theorem. 
Nc·(logN/logP-1)+1 
log N ( c · G:: ~ -1) + 1) 
logN (c. (logN _ 1) + 1) 
logP logP 
For presentational purposes, let z = log N I log P. Then we have 
£ 
z · (c(z- 1) + 1) -+1 ~ z 
£+z ~ z2 • (ex - c + 1) 
cz3 + (1- c)x2 - x- £ ~ 0 
which is the cubic equation stated in the Theorem. Let z be the positive solution of the equation. 
Then z = l~:~ ~ z, which solves to 
z·logP ~ logN 
pz < N 
and the Theorem follows. A very loose upper bound can be established for z. We are looking for 
a solution z > 1 to the cubic equation. Reformulate it as 
and note that this holds if 
35 
holds (for x > 1). Then we get 
cx2(x- 1)- 1 · (x- 1) ;?: £ 
(x- 1)(cx2 - 1) ;::: £ 
Assume that c;?: 1 (clearly reasonable since cis a constant in the running time). Then the last 
equation holds if 
which reformulates as 
This last equation holds if 
(x- 1)(cx2 - c);:::£ 
(x- 1)(x2 - 1) ;?: £/c 
(x- 1)(x- 1)(x + 1) ;?: £/c 
(x- 1)3 ;?: £/c 
x ;::: 1 + (£/c)113 
Thus z ::;; 1 + ( £/ c )113 is a solution, and provides a very loose (due to the simplifying assumptions) 
upper bound on z. 0 
By calculating solutions to the cubic equation for certain c1 and c2 (constants in the running 
time; see Lemma 4.5), it can be seen that the optimality range for the EREW-LPRAM FFT-Multi-
Level algorithm is better than that of the EREW-LPRAM FFT-Two-Level algorithm for l(P) = logP. 
For example, for c1 = c2 = 1 we get the range P413 ::;; N (compare to P312 ::;; N for the two-level 
algorithm), and for c1 = c2 = 2 we get the range P817 ::;; N (compare to P716 ::;; N for the two-
level algorithm). Again we see that the larger the constants in the running time, the better the 
optimality range. 
Now we turn attention to the behavior of the EREW-LPRAM FFT-Multi-Level algorithm for 
l(P) = ...(P. 
Theorem 4.18 Forl(P) = ...[P, the EREW-LPRAM FFT-Multi-Level algorithm is optimalfor2~.../P::;; 
N, where £ = 1/ c2 and c2 is a constant in the running time (corresponding to the c2 in Lemma 4. 5). 
Proof: We know from Lemma 4.5 that the algorithm is optimal if k can be chosen such that (1) 
;;t;;::;; c1 · logN~~~gjf/P), which solves to 
k < (c )2 . Nlog2 Nlog2(N/P) 
- 1 Plog2 P 
and (2) 
logkP 
L ~::;; c2 ·logN 
A=1 
36 
which, noting that 
solves to 
log,.P 
L .fk>. = O(VP) 
A=l 
VP ~ c2 ·log N 
2f..;p ~ N 
where£= 1lc2. Remembering that k is constrained by NIP~ k ~ N, we see that the algorithm 
is optimal for 2f../P ~ N if k can be chosen such that 
NIP< k <min (N (c )2. Nlog2 Nlog2(NIP)) 
- - ' 1 Plog2 P 
This is clearly possible since NIP is ~ both terms of the min function. Since N and P are powers 
of two, we can assume that k also is, as required. D 
Note that this optimality range is the same as for the FFT-BT-Ext, EREW-LPRAM FFT-Two-
Level, and EREW-EREW FFT-Multi-Level algorithms. 
5 Conclusions and discussion 
Although a key emphasis of this paper was to demonstrate usage of the H-PRAM, and the results 
of using different combinations of sub-model types and hierarchical structures, it is informative 
to summarize a few of the "best" complexity results here. For the binary tree problem: when 
l(P) = logP, time O(log312 P +NIP) (optimal for £Plog312 P ~ N, £ = 1lc, c a constant in 
the running time). When l(P) = VP, time O(VP +NIP) (optimal for £P312 ~ N, £ = 1lc, c a 
constant in the running time). 
For the FFT graph problem: when l(P) = VP, optimal time O((NIP)IogN) is achieved 
for 2f../P ~ N, £ = 11 c, c a constant in the running time. This is an improvement over the 
P · 2f../P ~ N optimality range of the case where only strict locality is used (the LPRAM [1]). 
When l(P) = logP and j3-synchronization cost is logarithmic in the number of sub-PRAMs being 
synchronized, optimality is achieved for pz ~ N, where z approaches 1 as two constants c1 and c2 
in the running time grow. When c1 = c2 = 2 the optimality range is pS/7 ~ N. 
The algorithms presented in this and the following sections demonstrate a methodology of 
building on existing algorithms to obtain new results. In other words, the H-PRAM provides 
a means of organizing various existing algorithms such that communication and synchronization 
overhead is "minimized", which can result in a composite algorithm that performs better than its 
individual sub-algorithms do alone. Although we give H-PRAM algorithms for basic problems for 
study and comparison purposes, the H-PRAM should allow conceptually simple construction of 
"larger", more complex parallel algorithms comprised of many simple sub-algorithms. It allows us 
to build on the work which already exists with respect to the PRAM. In other words, we really do 
37 
not want to design H-PRAM versions of common (and optimal) PRAM algorithms, i.e. reinvent 
the wheel, even if those PRAM algorithms are not optimal on the H-PRAM, but to use them 
as building blocks for designing "larger" (and optimal) H-PRAM algorithms. We anticipate that 
basic algorithms, such as for prefix and list ranking, would be provided as primitives (employing 
network topology) in any implementation of an H-PRAM to architecture mapping. The point here 
is that the H-PRAM allows the modular construction of large, (controlled) asynchronous, complex 
systems from simple synchronous PRAM algorithms, and we wish to make full use of the large 
body of work that exists on PRAM computations. 
We want to stress that algorithm design and analysis seems relatively simple; it appears that the 
implicit hierarchical organization controls any additional conceptual complexity over the PRAM 
model. What can be difficult, as seen in the following section, is choosing the best configuration of 
the H-PRAM given an H-PRAM algorithm, input size N, and number of processors P. However, 
in a computer system that supports the H-PRAM, automated tools could do this. The general 
philosophy in (private) H-PRAM algorithm design should be to obtain algorithms that have the 
greatest possible partitioning flexibility. This means defining the loosest possible upper and lower 
bounds on the values that the "partitioning parameter" k can take on. The algorithm complexities 
will be in terms of k, N, and P. Then, given a value for N (the input size that a user want to 
compute on) and a value for P (the number of processors available to the user), the value for k 
that minimizes the complexity can be chosen; as stated, potentially by automated tools. 
In theoretical work one can attempt to find a value, or range of values, for k (as a function of 
N and P) such that good (optimal) performance is obtained for the widest possible range of values 
for N and P. This is what we will do in the following section, in effect to maximize the number of 
processors that are efficiently usable with respect to an input size N, and to minimize the inefficiency 
when optimality is not possible (when Pis too large with respect to N). This is possible because 
of the H-PRAM's representation of general locality, i.e. both strict and neighborhood locality. 
When N and P are such that optimality ranges hold for multiple fixed instances of the latency 
parameter (e.g. logP, ../P), then the H-PRAM algorithm is architecture independent without loss 
of efficiency across architectures with those latencies. Similarly, when optimality is not possible 
but the inefficiency is within a certain bound for multiple latencies, the algorithm could be consid-
ered architecture independent with bounded inefficiency across architectures with those latencies. 
Kruskal, Rudolph, and Snir [9) have proposed a complexity theory of parallel algorithms based on 
preservation of efficiency across multiple architectures. 
There is the potential that general H-PRAM algorithms can be designed, with unspecified 
types of sub-models and unspecific sub-algorithms (one defines what the sub-algorithms do but not 
how they go about it), in order to gain an additional degree of architecture independence. Once 
a target architecture is known, one can choose the type(s) of sub-models(s) that most reflect it 
(e.g. PRAM, LPRAM, BPRAM), then choose specific sub-algorithms that have been designed for 
that/those sub-model(s). In other words, one general H-PRAM algorithm may have various specific 
instances of it, as demonstrated by the FFT graph algorithms of the following section. 
38 
The private H-PRAM provides a memory management paradigm that lies between the extremes 
of totally automated (PRAM) and totally manual (networks). This is one reason for our belief, as 
stated in the introduction, that the H-PRAM provides a good balance between simplicity of usage 
and reflectivity of realistic architectures. The paradigm is one where memory is seen as a linear 
block of memory locations, and there is responsibility for organizing that block into groups (by 
permuting the memory), but not for the details of implementing the organizing (i.e. for routing 
data in a network). 
Clearly, the private H-PRAM with a tree hierarchy relation is naturally suited for problems 
that can be solved by divide-and-conquer. Preparata and Vuillemin (10] have defined ASCEND 
and DESCEND classes of algorithms, which operate in a divide-and-conquer manner, and noticed 
that quite a few algorithms either belong to, or are comprised of sub-algorithms which belong to, 
these classes. Cypher [ 4] has generalized these classes to a class (the "bit-block" class) that is less 
restrictive of the operations in a divide-and-conquer algorithm. 
Chan [3] has pointed out that many problems in scientific and numerical computation have 
natural hierarchical solutions, and advocated the development of hierarchical parallel algorithms 
and architectures for this domain. 
Not all problems will submit to solution on the private H-PRAM; there may be difficulties in 
designing algorithms for this variant such that data required by processors are always in the private 
shared memory of the sub-PRAMs that the processors belong to. Problems that only submit to 
non-oblivious (data dependent) communication may require a switch to the shared variant, which 
is one reason for its existence. We conjecture that H-PRAM algorithms will generally be "control 
oblivious", i.e. the partitioning will not be data dependent, but will instead depend on the cost 
parameters, number of processors P, and input size N. 
We are continuing to investigate algorithms for the private variant of the H-PRAM. 
References 
[1] A. Aggarwal, A.K. Chandra and M. Snir, Communication complexity of PRAMs, Theoretical 
Computer Science, Vol. 71, 1990, pp. 3-28. 
[2] A. Aggarwal, A.K. Chandra and M. Snir, On communication latency in PRAM computations, 
Tech. Rep. RC 14973 ( #66882) 9/27/89, IBM T .J. Watson Research Center, Yorktown Heights, 
NY. 
[3] T.F. Chan, Hierarchical algorithms and architectures for parallel scientific computing, Proc. 
ACM Inti. Conference on Supercomputing, 1990, pp. 318-329. 
[4] R. Cypher, Efficient communication in massively parallel computers, Ph.D. Thesis, Dept. of 
Computer Science, Univ. of Washington, 1989. 
[5] E. Dekel and S. Sahni, Binary trees and parallel scheduling algorithms, IEEE Trans. on Com-
puters, March 1982, pp. pp. 307-315. 
39 
[6] P.B. Gibbons, A more practical PRAM model, Proc. 1st Annual ACM Symposium on Parallel 
Algorithms and Architectures, 1989, pp. 158-168. 
[7] P.B. Gibbons, The asynchronous PRAM: a semi-synchronous model for shared memory MIMD 
machines, Ph.D. thesis, Computer Science Division, University of California, Berkeley, Cali-
fornia, Dec. 1989. 
[8] T. Heywood and S. Ranka, A practical hierarchical model of parallel computation I: The 
model, Technical Report SU-CIS-91-06 School of Computer and Information Science, Syracuse 
University, Feb. 1991, Revised: Oct. 1991. 
[9] C.P. Kruskal, L. Rudolph and M. Snir, A complexity theory of efficient parallel algorithms, 
Theoretical Computer Science, Vol. 71, 1990, pp. 95-132. 
[10] F .P. Preparata and J. Vuillemin, The Cube-Connected Cycles: a versatile network for parallel 
computation, Commun. of the ACM, May 1981, pp. 30Q-309. 
[11] C.H. Papadimitriou and M. Yannakakis, Towards an architecture independent analysis of 
parallel algorithms, SIAM Journal on Computing, April1990, pp.322-328. 
40 
