Balanced computation of two-dimensional transforms on a tree machine  by Gorin, A.L. et al.
Appl. Math. Let:. Vol. 3, No. 3, pp. 4346, 1990 
Printed in Great Britain. All rights resewed 
089~S659/SO 83.00 + 0.00 
Copyright@ 1990 Pergamon Press plc 
Balanced Computation of Two-Dimensional Transforms on a Tree Machine 
‘A.L. GORIN, 2L. AUSLANDER, AND 2A. SILBERGER 
‘AT&T Bell Laboratories 
2CUNY Center for Large-Scale Computation 
(Received September 1989) 
1. INTRODUCTION 
This paper describes the idea of a balanced parallel algorithm, and then applies it to the 
design and analysis of 2D transforms on a tree-structured parallel computer. A balanced 
algorithm is one that is written as a sequence of separate computation and communication 
stages, which is furthermore sized so that these stages require equal time. An advantage of 
writing an algorithm this way is simplicity of representation, which implies ease of program- 
ming. This representation furthermore allows one to analyze separately the complexity of 
computation and interprocessor communications in a parallel algorithm, greatly simplifying 
the performance modeling problem. If, based on such performance models, the stages are 
sized to require equal time, then one can pipeline and overlap multiple tasks to achieve 
(theoretical) 100% utilization of both the processors and communications network. 
We apply these ideas to the design and analysis of 2D transforms on a tree machine. In 
particular, we consider transforms on square matrices which are defined via application of 
an identical operator to each row, followed by application of the same operator to each of 
the resultant columns. A classic example is the 2D Fast Fourier Transform (FFT), which 
finds application in image processing [S] and radar [l]. We also consider a binary tree of 
processors, in which each processing element (PE) is connected to a parent and two chil- 
dren, excepting for the root and leaves of the tree [2], and whose external communications 
is through the root PE. Such an interconnection is scalable and has low overhead commu- 
nications, which yields implementations well suited to VLSI and advanced packaging [5, 31. 
A tree is also universal, in the sense that it can be embedded within any lattice of PEs 
in reconfigurable networks [7, 91, such that the communications diameter of the network is 
preserved. 
This paper proceeds as follows. We first describe the computing paradigm and software 
constructs which underly balanced 2D transform algorithms on a tree machine. We then 
describe the parallel algorithm and derive formulas expressing processing and communication 
times as a function of problem and device parameters. We then derive a criterion for 
a balanced transform, and as an example predict execution time for a balanced 1024 x 
1024 2D FFT on a tree with a 20 Mhz communications skeleton that compares favorably 
to measured performance on a CRAY Y-MP. Although this is a comparison of projected 
performance on a theoretical machine with measured performance on a real machine, it 
illustrates that supercomputer speeds are achievable with intelligent exploitation of parallel 
processing. 
2. COMPUTING PARADIGM 
Consider parallel algorithms that can be written as a sequence of separate processing and 
communication stages. Such a parallel program is host-orchestrated, with a barrier syn- 
chronization between each stage. Further consider a processing paradigm in which identical 
44 A.L. GORIN, L. AUSLANDER, A. SILBERGER 
programs are executed simultaneously in each PE on different data sets, denoted a sliced 
procedure. The multiple executions of this single program can follow different instruction 
streams (still within the program, of course) depending on the data. These potentially dif- 
ferent streams are initiated by broadcasting a jump address to all PEs, which are forced 
to converge and synchronize at the completion of the sliced procedure. This concept can 
be described as single program multiple data (SPMD), which can be alternatively viewed 
as coarse-grain SIMD or data-driven MIMD. We define two global burst communication 
functions which will support the 2D transform algorithm. Broadcast transmits data from 
the host to enabled PEs, and Report transmits data from a distinguished PE to the host. 
For a tree machine with I\’ PEs (of depth log, K) in which interprocessor communications 
between adjacent PEs requires time c per data element, then a broadcast or report of D 
data elements requires time CD + log, K. For large data blocks, this simplifies to time CD. 
3. PARALLEL ALGORITHM 
Consider a 2D transform on N x N points, where the array of values is initially stored row 
by row in the host memory. After processing, the tree will return to the host the transformed 
array of values in which column follows column. We assume a tree with Ii PEs, where if 
L = N/K then each PE has memory sufficient to store and compute L one-dimensional 
transforms. The parallel algorithm is as follows. 
(1) Load the N rows into the machine, putting the I<th set of L rows into the PE #Ii. 
This can be accomplished via the proposed software architecture by, for each I<, 
broadcasting the block of LN data elements to PE #I<. 
(2) Execute a sliced procedure concurrently in each PE that computes the N-point trans- 
form on each of the L rows resident in that PE. 
(3) Perform a global transposition so that the Kth set of L columns is now in the Kth 
PE. This can be accomplished via the proposed software architecture by, for each I<, 
first reporting the transformed rows from PE #K to the host; second broadcasting 
this block of LN data elements to all PEs, whence each PE then selects and saves 
the relevant partial columns from that block. 
(4) Execute a sliced procedure concurrently in each PE that computes the N-point trans- 
form on each of the L columns resident in that PE. 
(5) Transmit the transformed columns back to the host by sequentially reporting blocks 
from each PE. 
4. PERFORMANCE MODELS 
We treat separately processing (steps 2 and 4) and communications (1, 3, and 5). If 
a one-dimensional transform on N points requires time F(N), then steps 2 and 4 each 
require time LF(N). In particular, if the FFT on N points requires time PA log, N, where 
p is the butterfly rate of the PE, then steps 2 and 4 each require time p(N’/I\‘) log* N. 
For the analysis of communications complexity, we assume that block sizes (N’/K) are 
sufficiently large that startup and shutdown times are negligible in comparison to the block 
transfer times. Step 1 requires time cN2, as does step 5. Step 3 requires twice that, time 
2cN2, since the data is first reported and then broadcast. Thus, total execution time is 
PF(N)N/K + 4cN2 for a generic 2D transform, and time 2p(N?/K) log? N + 4cN2 for the 
2D FFT. 
5. BALANCED EXECUTION 
If the algorithm is sized, via adjusting the number of PEs, so that 
2cN2 = p(N2/K) log, N, then the five stages require times in ratio 1:2:2:2:1. This balancing 
criteria implies .that the number of PEs is, for the 2D FFT, K = (p/2c) log2 N. Given a 
continuous channel which provides a sequence of transforms to be computed, then we have 
produced a balanced algorithm with equally sized processing and communication stages, (the 
final communication stage of one block plus the first communication stage of the subsequent 
are abutted.) This one channel utilizes both the processing and communications at 50% 
Balanced Computation of Two-Dimensional Transforms 45 
cb 1. blk 1 
c, P, c , P I= 
I I I I 
cb 2. blk 1 
e, P , c , B lC 
I I I I : . 
chl,blkZ 
‘C, b, c, p ,c 
I:I I I 
ch 2, blk 2 
‘C, p , e , i ,c 
I I I:I 
: ; 
chl,blk3 
ch 2, blk 3 
;c, p , c , p ,c 
I I I I 
Figure 1. Pipelined execution of twodimensional FFT yielding 
100% utilization of processing and communications. Notation: c 
denotes a communications stage; p denotes a processing stage; (ch 
i, blk j) denotes block j of channel i. 
capacity in a highly regular manner, computing a new transform at time intervals 8cN*. 
Thus, as shown in Figure 1, the processing of two such channels can be overlapped to 
achieve 100% utilization of both the processing and communications network, computing 
a new transform at time intervals 4cN 2. This assumes that each PE can simultaneously 
process and communicate. 
6. AN EXAMPLE 
Consider a tree machine whose PE comprise a typical high-performance microprocessor 
such as the AT&T DSP32C, which is a C-programmable digital signal processor. As reported 
by Shear [8], the time to execute a 1024-point complex FFT is 3.2 ms. Assume a moderate- 
bandwidth communications skeleton that achieves interprocessor communication times of 
c = 100 ns per complex data element, (e.g., a 32-bit wide port clocked at 20 Mhz.) Based 
on the balanced algorithm analysis, such a machine could compute a stream of 1024 x 1024 
2D FFTs at time intervals of 400 ms using 16 PEs. For reference, it is worth comparing 
this predicted time to measured experimental performance on a CRAY Y-MP8/832, which 
is 484 ms [4]. Although this is a comparison of projected performance on a theoretical 
machine with measured performance on a real machine, it does serve to emphasize that 
supercomputer speeds are achievable with intelligent exploitation of parallel processing. 
7. CONCLUSIONS 
We have described the idea of a balanced parallel algorithm and applied it to the design 
and analysis of a 2D transforms on a tree machine. The analysis shows that intelligent use 
of parallel processing can provide supercomputer performance, predicting execution time for 
a 1024 x 1024 2D FFT comparable to that of a CRAY Y-MP. 
REFERENCES 
1. D.A. Aushennan, et al., Developments in Radar Imaging, IEEE Trans. on Aerospace and Electronic 
Syslcme AES-20(4), 363-400 (July 1984). 
2. S.A. Browning, The Tree Machine: A Highly Concurrent Computing Environmcni, Ph.D. Thesis, 
CalTech, (Jan. 1980). 
46 A.L. GORIN, L. AVSLANDER, A. SILBERCER 
3. A.L. Gorin and RR. Shively, The ASPEN Parallel Computer, Speech Recognition, and Parallel Dy- 
namic Propammi ng, In Proceeding of The International Conference on Acoustics, Speech and Signal 
Procasing (ICASSP), (April 1987). . 
4. C.M. Grass1 and C. Kerr, private communication. 
5. C.E. Leisemon, Arm-Eficient VLSI Computation, Ph.D. Thesis, Department of Computer Science, 
Carnegie-Mellon University, (1981). 
6. W.K. Pratt, Digital Image Processing, Wiley, (1978). 
7. D.K. Pradhan, Dynamica.Uy Restructurable Fault-tolerant Processor Network Architectures, IEEE 
Trans. on Computers c-34 (5) (May 1985). 
8. D. Shear, EDN’s DSP Benchmarks, EDN, 126-148 (Sept. 1988). 
9. R.R. Shively and A.L. Gorin, A Reconfigurable, Fault-Tolerant Systolic Signal Processor, Proc. ZCASSP 
(May 1989). 
‘AT&T Bell Laboratories, Murray Hill, New Jersey, USA 
‘CUNY Center for Large-Scale Computation, New York, New York, USA 
