Factored-matrix representation of distributed fast transforms by Bainbridge, Richard Lee
Calhoun: The NPS Institutional Archive
Theses and Dissertations Thesis Collection
1987-03















Thesis Advisor D.E. Kirk
Approved for public release; distribution is unlimited
T230088

CuRiTy Classification Of This page
REPORT DOCUMENTATION PAGE




3 DECLASSIFICATION /DOWNGRADING SCHEDULE
3 DISTRIBUTION"/ AVAILABILITY OF REPORT
Approved for public release;
distribution is unlimited.
PERFORMING ORGANIZATION REPORT NUMBER(S) 5 MONITORING ORGANIZATION REPORT NUM8ER(S)





7a NAME OF MONITORING ORGANIZATION
Naval Postgraduate School
: ADORESS (Cry. Sfafe. and ZIP Code)
onterey, California 93943-5000
7b ADDRESS (dry. State, and ZIP Code)
Monterey, California 93943-5000




9 PROCUREMENT INSTRUMENT IDENTIFICATION NUMBER









i Title (include Security Claudication)
FACTORED-MATRIX REPRESENTATION OF DISTRIBUTED FAST TRANSFORMS
I PERSONAL auThQR(S)
Bainbridge. Richard Lee











'8 SUBJECT TERMS (Continue on reverie it necessary and identify by bjock number)
Fast Fourier Transform; Fast Hartley Transform;
Walsh-Hadamard Transform; Parallel Processing;
Distributed Signal Processor; Butterfly Network
I ABSTRACT (Continue on reverie if neceuary and identify by block number)
Parallel implementations of Fast Fourier Transforms (FFTs) and other
fast transforms are represented using factored, partitioned matrices.
The factored matrix description of a distributed FFT is introduced
using a decimation-in-time (DIT) FFT algorithm suitable for
implementation on a distributed signal processor.
) S"R'3UTiON/AVAiLA8lLiTY OF A8STRACT
fljNCLASSlFiEQ/UNLiMlTED D SAME AS RPT D DTlC USERS
21 ABSTRACT SECURITY CLASSIFICATION
TTMTT AQqTTTTTVn
a NAME OF RESPONSIBLE 1ND1V1OUAL
Professor D.E. Kirk pdwymwrtlu.de
Area Code) 22c OFFICE SYMBOL
62Ki
3 FORM 1473, 84 mar 83 APR edition may be used until exhausted
All other editions aie obsolete
SECURITY CLASSIFICATION OF ThiS PAGE
SECURITY CLASSIFICATION OF THIS PAGE (Whmn Dmm En<»r*<>
Block 19: Abstract (continued):
The heart of the matrix representation of distributed
fast transforms is the use of permutations of an NxN identity
matrix to describe the required inter-processor data
transfers on the Butterfly Network. The properties of
these "transfer matrices" and the resulting output ordering
are discussed in detail. The f actored-matrix representation
is then used to show that the Fast Hartley Transform (FHT)
and the Walsh-Hadamard Transform (WHT) are supported by
the Butterfly Network.
S N 0102- LF- 014- 6601
SECURITY CLASSIFICATION OF THIS P kdt.(Whan Dalm Bnffd)
Approved for public release; distribution is unlimited
Factored-Mat r ix Representation of Distributed Fast Transforms
by
Richard Lee Bainbridge
Lieutenant, United States Navy
3.E.E., Auburn University, 1980
Submitted in partial fulfillment of the
requirements for the degree of





Parallel implementations of Fast Fourier Transforms
(FFTs) and other fast transforms are represented using
factored, partitioned matrices. The factored matrix
description of a distributed FFT is introduced using a
decimation-in-time (DIT) FFT algorithm suitable for
implementation on a distributed signal processor.
The heart of the matrix representation of distributed
fast transforms is the use of permutatons of an NxN identity-
matrix to describe the required inter-processor data
transfers on the Butterfly Network. The properties of
these "transfer matrices" and the resulting output ordering
are discussed in detail. The f actored-matrix representation
is then used to show that the Fast Hartley Transform (FHT)





B. RESEARCH OBJECTIVES 10
II. FFT MATRIX FACTORIZATIONS 13
A. SHORTHAND NOTATION 16
B. FACTORED-MATRIX REPRESENTATION OF
SINGLE-PROCESSOR FFT (SPFFT) 20
C. FACTORED-MATRIX REPRESENTATION OF
MULTI-PROCESSOR FFT (MPFFT) 30
D. PROPERTIES OF TRANSFER MATRICES 38
1. Transfer Matrix Patterns 38
2. Algorithm for Generating
Transfer Matrices 44
3. Orthogonality of Transfer Matrices 47
4. Pseudo-normal Ordering 49
5. Selection of Transfer Matrices 54
E. GENERALIZED MULTI-PROCESSOR FFTS 57
1. DIT MPFFT with Bit-reversed Input 59
2. DIT MPFFT with PN Input 68
3. Pseudo-bit-reversed Ordering 73
4. DIF MPFFT Algorithms 76
III. OTHER FAST TRANSFORMS ON THE BUTTERFLY NETWORK- 80
A. FAST HARTLEY TRANSFORM (FHT) 80
1. Single-Processor FHT (SPFHT)
Algorithm 81
2. Multi-processor FHT (MPFHT) Algorithm— 88
5
3. Transfer Patterns and Output Ordering
—
101
B. FAST WALSH-HADAMARD TRANSFORM (WHT) 107
1 . Single-Processor Walsh Ordered
Transform (WTHT)w 107
2. Multi-Processor Walsh Ordered
Transform (MPWHT)w 114
3. Single-Processor Hadamard Ordered
Transform (WTHT)h 122
4. Multi-Processor Hadamard Ordered
Transform (MPWHT)h 124
IV. COMBINED ALGORITHM FOR GENERATING TRANSFER
MATRICES AND OUTPUT ORDERING 133
V. CONCLUSIONS 138
LIST OF REFERENCES— 140
3IBLI0GRAPHY 141
INITIAL DISTRIBUTION LIST 142
I. INTRODUCTION
The Fast Fourier Transform (FFT) has become popular
because it computes the Discrete Fourier Transform (DFT)
much faster than direct evaluation of the DFT definition.
While evaluation of the DFT from the definition requires
N
2
complex multiplications, the FFT algorithm produces
the same result with the number of multiplications
proportional to N log
2
N. The DFT can be expressed as a
vector-matrix equation:
X(k) = Wnk x(n), n,k-0,l N-l (1.1)
where W = exp [
-j 27T/N ] . (1.2)
The Nxl column vectors x(n) and X(k) are the input sequence
and its DFT, respectively. The record length, N, is assumed
nk
to be a power of 2. The NxN matrix W represents the DFT
operation on the input data sequence that yields the
transformed output sequence. This thesis uses matrix
factorization of the DFT matrix to describe a parallel
multi-processor implementation of the FFT. After a detailed
investigation into the properties of the necessary inter-
processor data transfers, the Fast Hartley Transform (FHT)
and the Walsh-Hadamard Transform (WHT) are also investigated
for compatibility with the parallel-processing network.
A. BACKGROUND
The Cooley-Tukey FFT algorithm can be conveniently
viewed as a factorization of the DFT matrix which it
implements. Many authors have used this approach to
illustrate the computational savings afforded by the
FFT [Refs. 1-5]. Each factor of the f ac tored-matr ix
representation corresponds to one butterfly stage of the
FFT. The product of matrix factors is equivalent to
evaluating the DFT by its definition, though the rows or
columns of the FFT matrix product are reordered in
comparison to the DFT matrix. The row (column) reordering
corresponds to permutation of the output (input) sequence
ordering, respectively. The keys to the speed advantage
of the FFT are that the matrix factors are sparse
[Ref. 3: p. 328], and that the complex weighting factors
are periodic [Ref. 4: p. 358]. These properties permit
the calculation of the DFT coefficients to be carried out
i teratively .
A more recent approach to enhance signal processing
of large data sets is parallel processing. This concept
was used to develop the Distributed Signal Processor (DISP)
at MIT Lincoln Laboratory [Ref. 6]. The DISP uses nodal
processors interconnected by a Butterfly Network to perform
real-time signal processing. The input data set is first
divided equally among the processors. Then all processors
simultaneously perform the same operations on different
segments of data. Finally, all processors simultaneously
exchange partially-processed data through the Butterfly
Network. The last two steps are repeated until the task is
completed. Since many signal processing applications involve
convolution and FFT algorithms, the inter-processor network
was selected for efficient handling of these algorithms.
The Butterfly Network allows all nodal processors to
simultaneously transfer the necessary data among themselves
without conflict [Ref. 6: p. 1]. References 6-10 discuss
in detail the architecture of the DISP and the properties
of the Butterfly Network. It is assumed that an FFT
algorithm is carried out in parallel by P active processors
to compute the DFT of a set of N data points. It is
assumed that N and P are integer powers of 2, N>2P
[Ref. 7: p. 3]. Therefore, each processor contains
M=N/P points, where M is also a power of 2.
A new algorithm for the computation of power spectra
is the Fast Hartley Transform (FHT) introduced by
Bracewell. Requiring only real arithmetic computations,
the FHT performs even faster than the FFT [Refs. 11-12].
The Walsh-Hadamard Transform (WHT) , on the other hand,
uses Walsh functions (square waves) as its basis
functions. Since the Walsh functions are square waves,
they take on only two values, namely +1 or -1 [Ref. 13: p. 225]
Both the FHT and WHT are studied for parallel implementation
using the Butterfly Network.
B. RESEARCH OBJECTIVES
The primary objective of this research is to investigate
the suitability of the Butterfly Network for multi-processor
implementation of fast transform algorithms. This thesis
develops a matrix representation of the multi-processor
FFT (MPFFT) suitable for implementation on the Distributed
Signal Processor. The MPFFT algorithm is based on a radix-2,
in-place, decimation-in-time (BIT) algorithm with input in
bit-reversed order [Ref. 6: p. 28], It is felt that the
matrix representation provides insight into the output
reordering caused by the required transfers on the Butterfly
Network. It will be shown that the transfer patterns can
be represented by permutations of an NxN identity matrix.
This makes the "transfer matrices" easy to generate, and
leads to a description of the output reordering as
a function of the number of active processors. After the
transfer patterns and properties of the transfer matrices
are examined in detail, the FHT and WHT algorithms are
investigated for potential support by the Butterfly Network.
To permit concentration on the matrix manipulations, all
normalizing factors of (1/N) are neglected. With the DFT
definition used here, a factor of (1/N) is required in the
definition of the inverse DFT. Other fast transforms also
10
require normalizing factors in either the forward or
inverse transform definitions. This factor, in either
direction, can easily be inserted, as necessary, as a
gain multiplying each output element. In the MPFFT matrix
representations of Section II, the column indices are
repeatedly referred to as time indices. While this
terminology is correct for the first stage, it is simply
a convenience when referring to the partially-processed
data at the second and successive computation stages. This
terminology is used because it is felt that it helps rather
than hinders the discussion.
The multi-processor FFT is presented in Section II as
a partitioning of the single-processor DIT FFT. The
required inter-procesor data transfers are described by
matrices, and their properties are examined in detail. An
algorithm is presented for generating the transfer matrices
as permutations of identity matrices. The output reordering
caused by the transfers is then discussed, and an algorithm
is developed for determining output ordering as a function
of N and P. Alternative transfer patterns are investigated,
and the pattern chosen in reference 6 is justified. The
decimation-in-f requency (DIF) FFT is examined, and a
general matrix equation for the MPFFT is developed.
Section III investigates several other fast transforms
for compatibility with the Butterfly Network. The
11
relatively new Fast Hartley Transform (FHT) is partitioned
for parallel processing using the Distributed Signal
Processor. Finally, the Walsh and Hadamard orderings of
the Walsh-Hadamard Transform (WT) are studied. Section
IV presents a unified approach to the algorithms for
transfer matrix generation and corresponding output
reordering, and also shows how the two separate algorithms
from Section II can be combined into a single algorithm.
12
II. FFT MATRIX FACTORIZATIONS
This section introduces shorthand notation to simplify
the matrix manipulations to follow, and presents a repre-
sentation of a distributed FFT algorithm using partitioned,
factored matrices. The matrix version of the distributed
FFT will be shown to involve permutations of identity
matrices to describe the necessary data transfers. These
"transfer matrices" possess many interesting properties
that will be discussed in detail.
Many authors have used matrices to describe FFT
algorithms [Ref. 1: pp. 148-149 and 2: pp. 63-69]. In fact,
the initial derivation of the FFT depended on the fact that
an NxN matrix, none of whose elements is zero, can be
factored into sparse matrices with many zeros [Ref. 3: p. 297]
A Discrete Fourier Transform defined by
N-l
X(k) = I x(n) exp[-j(2ir/N)nk] k = , 1 , . . . , N-l (2.1)
n =
can be represented as a vector matrix equation
E
X(k) = W~ x(n), (2.2)
where W = exp(
-j 2 ir/N ) , and E is a matrix of exponents. The
input sequence is the vector x(n), given by
13
(n) = [x(0) x(l) ... x(N-l)] (2.3)
and the transformed output sequence X(k) is
X(k) = [X(0) X(l) ... X(N-1)]\ (2.4)
The vectors x(n) and X(k) may represent sequences in either
normal order, or in digit-reversed order. In several forms of
the radix-2 FFT algorithms a normal-ordered input
data sequence produces output in bit-reversed order.
Similarly, a bit-reversed input data sequence yields a
transformed output sequence in normal order.
E
The matrix W" is a two-dimensional array representing
the transform operation. It has row indices k = , 1 , . . . , N-l ,
and column indices n=0 , 1 , . . . , N-l . Since the "twiddle factor"




, a matrix of exponents E can be written, whose
elements are functions of k and n. For example, for N=8
,
(2.1) gives
X(0) = [W° W° W° W° W° W° W° W°] x(n)
X(l) = [W° W 1 W 2 W 3 W 4 W 5 W 6 W 7 ] x(n)
X(2) = [W° W 2 W 4 W 6 W° W 2 W 4 W 6 ] x(n)
(2.5)



































2 4 6 4 6







































































Note that each element in the matrix of exponents, E, is the
product of the row label, k , and the column label, n, computed
modulo N (i.e., k'n mod N = remainder of k-n/N). The compu-
nktation of the kn products mod N is due to the fact that W








The periodicity of W is one of the keys to the FFT
[Ref. 4: p. 358]. In general, for an N-point DFT, the E














One way to derive FFT algorithms is to factor the DFT
E
matrix W into a product of matrices. This method is also
well-suited to the description of distributed FFT algorithms,
and will be the method used here.
Elliott and Rao have developed a shorthand notation that
makes the matrix factorization easy to follow [Ref. 2:


































This can be manipulated into a decomposition-in-time (DIT)
E

























1 1 1 1 x(0)




1 -1 j -j x(3)
(2.11)
where W = exp(-j2 7r/4) = -j , and the n index values of the
input sequence have been correspondingly reordered as well
E






where W and W will represent the first and second
stages, respectively, of butterfly computations in the
DIT FFT. Let



































where the shorthand notation of a dot (no entry) in the
E
T
matrices of exponents corresponds to an element value
E
~L
of zero in the W matrices. Performing the matrix
multiplication of (2.12) yields
18













T7 + „0 + T7 + Y7 +W w w w
I7 + ..0 + 2 „l+0Www .1 + 2
„0+0 V7 + V7 2 + T7 2 +w w w w
t7
+
T7 + 2 T7 3 + T7 3 + 2W W W w







Note Chat W~ is factored such that only one nonzero entry
hper row of W is multiplied by only one nonzero entry of
h
any given column of W . This result is true for all
factored FFT matrices [Ref. 2: p. 48]. Since W is a
common factor, the exponents simply add (modulo N) when the
o r\ pa -4- r\
matrix multiplication is performed, i.e., W • W =W
Therefore, an easier way to carry out the matrix multiplica-
tion of (2.12) is by the addition of appropriate exponents
of (2.13). Let the shorthand notation
n J—. ,- i—
. -i (2.16)
19
mean the matrix of exponents derived from the sum of
appropriate elements of the factored matrices of exponents
of (2.13), where the elements to be added are defined by the
row-times-column rule of matrix multiplication. Thus,





0+0 0+0 0+0 0+0
0+0 0+2 1+0 1+2
0+0 0+0 2+0 2+0










and the result is the same as the exponents of (2.15). It is
usually much simpler to work with the matrices of exponents,
than to actually perform the matrix multiplication. Since
all the necessary information is contained in the matrices
of exponents, this section will deal mainly with this
representation .
B. FACTORED-MATRIX REPRESENTATION OF SINGLE-PROCESSOR FFT
(SPFFT)
E
Factoring the DFT matrix W~ of (2.2) and reordering the
rows and/or columns leads to various FFT algorithms. The
20
factored FFT matrices represent individual stages of the
particular algorithms. The concept of matrix factorization
will be developed in this section to describe a
single-processor, radix-2, DIT FFT.
To compute an N-point FFT (for N a power of 2), there
are log~N "computation" matrix factors, each having 2
nonzero elements per row. Each computation matrix represents
the butterfly computations carried out at a given stage.
The number of nonzero elements per row depends on the radix
base of the algorithm. For example, a radix-3 FFT compu-
tation matrix contains 3 nonzero elements per row, radix-4
matrices have 4 nonzero elements per row, etc. A flow
diagram for an 8-point radix-2 DIT FFT is shown in Figure 2.1
The vector-matrix equation describing a single-processor
implementation of the 8-point DIT FFT is:
k E ~ E ry E -
X(k) = W~x(n) = W W W x(n) k=0 , 1 , . . . ,
8
n=0, 1 , ... ,8
(2.18)
where 3 butterfly (computation) stages are necessary to
-1 ~2
accomplish the radix-2 FFT. The matrix factors W , W ,
h
and W represent the computations performed at stages
1, 2, and 3, respectively. Let X (n) represent the output
of the first stage of butterflies of Figure 2.1. Then,
21
o 1-1 w n «* 0) CO
^^^



















































X^O) = X°(0) + W°X°(1)
X
X (1) = X°(0) - W°X°(1)
X*(2) = X°(2) + W°X°(3)
X1 (3) = X°(2) - W°X°(3)

































where W 4 = exp [ -j ( 2tt /r ) ( 4 ) =-1 , W°=l , and X (n)=x(n) is the
bit-reversed input sequence, i.e., the input to the first
butterfly stage. In matrix form,
rw ° w°


















Similarly, the output of the second stage of butterfly
computations is:
X
2 (0) = X^O) + W°X 1 (2) = W°X 1 (0
X




2 (2) = X X (0) - W°X 1 (2) = W°X 1 (0
2 1 2 1 1
X
Z (3) = X X (1) - W Z X X (3) = W U X (1
X
2 (4) = X X (4) + W°X 1 (6) = W°X 1 (4
X




2 (6) = X X (4) - W°X 1 (6) = W°X 1 (4
X
2 (7) = X X (5) - W 2 X 1 (7) = W°X 1 (5




























since W = -W , and W = -W . In matrix form,
X















The output of the last stage is:
3 2 2 2
X
J (0) = X
Z (0)+W U X (4) = W U X (0
3 2 12 2
X°(l) = X^(1)+W X (5) = W U X (1
3 2 2 2 2
X
J (2) = X Z (2)+W Z X Z (6) = W U X Z (2
o o o 9 2
X
J (3) = X"(3)+W J X~(7) = W U X~(3
X
3 (4) = X 2 (0)-W°X 2 (4) = W°X 2 (0
X
3 (5) = X 2 (1)-W 1 X 2 (5) = W°X 2 (1
X
3 (6) = X 2 (2)-W 2 X 2 (6) = W°X 2 (2
3 2 3 2 2
X








































since W 4 = -W° , W 5 = -W 1
,
W 6 = -W 2
, and W 7 = -W 3 .
Again, in matrix form,




















Combining (2.20), (2.22), and (2.24) gives the overall matrix
representation for the 8-point DIT FFT:
25
§o o ?<5 ?o i ?Q Wo E, n
X(k) = W J X Z (n) = W J W Z X i (n) = W JW Z W X (n) (2.25)
where X(k) is the normal-ordered output sequence, and X (n)
is the bit-reversed input sequence. The factored matrices




4 4 4 4
• • • •
1 4 • • • •
2 • •
3 4 • •
4 • •
5 • • 4
6 • • • •
























where once again a dot or no entrv in the exponent matrices
corresponds to an entry of zero in the W matrices.
The column indices are discussed in a subsequent paragraph.
Performing the matrix mult iDlication using the shorthand
notation described earlier gives the result shown in
Figure 2.2. Notice that the addition of exponents in (2.27)
is carried out mod N (N=8). Also observe the sparseness of
the matrix factors E, , E~ , and E~. Each matrix factor
contains only 2N entries, and only the final overall matrix
2
of exponents, E, contains numerical entries in all NxN=N^
elements .
A few comments about the nature of the DIT FFT algorithm
are in order. When an N-point DFT is divided into two DFT's
of one-half the original size, the time sequence separation
of the input of either of the smaller DFTs is increased by 2
This corresponds to dropping (decimating) alternating inputs
27
h V B i
KJO 0. 2 2 2 2_ tt
1 4 4 4 4_ &° 4 2 6 4 2 8































41 2 6| 1
©I 4 4[
3 • 01 • 61 1 3
.
-i° ii- 1 3 4 ! 6 2 1 1
4
1 1





1 • 2 5
!
!o
«i 5 | 4 | 2 6
5 1 1 • 1 4 6 1 T "
1 |







• S 7 1 1 1 o 4 4 04 62
E 2 *E 1
i
1




























1 5| 3 7
oj" 4 4
j






. oj . .1.3 3 4 1 6 21 3 41 62| 371 15
4
-j • -1 4 . I . . 4 1 1 1 4 01 0| 4 41 44






04 2 6 5 4j 2 5l 5 l| 7 3
6




0~[ 4 4 6 00' 441 56 1 22
7 - . 1 . 0| •
1 ,
•1-7 7 1 | 41 6 2 7 4J 6 2] 7 3J 5 1
(2.27)
Figure 2.2 8-Point/ 1 -Processor DIT FFT
28
of the original DFT ; hence the name decimation in time. The
idea behind the FFT is thus to break the original N-point
sequence into two N/2-point sequences, the DFT's of which
can be combined to give the DFT of the original N-point
sequence. Next, each N/2-point sequence is decomposed
into two N/4-point sequences, and so on, until 2-point
sequences result. The DFT of a one-point sequence is simply
the sample itself [Ref. 5: p. 49]. The input sequence
indices for an N-point DFT are aeparated by 1 unit, based
on a normalized analysis period, (record length), P = 1
second. This yields transformed output sequence frequencies
separated by 1 Hz [Ref. 2: p. 60]. Thus, the column labels,
n, of the factored matrices of (2.26) are separated by larger
increments of time going from right-to-left in Figure 2.1.
That is, the last stage, E~, has column labels separated by
1 unit, E~ has indices separated by N/4=2units, and the E
1
column labels are separated by N/2=4 units. Performing the
matrix multiplications of (2.27) results in a time sequence
mixing operation, in which the time labels in the matrices
of exponents add . The matrices E~ and E„ have sample periods
and 1 or and 2 units, respectively (based on an analysis
period of P = 1 second). These periods mix so that E~ * E~
has periods 0, 2, 1, and 3 seconds. Likewise, the time
sequence periods and 4 seconds in E, combine to yield
periods 0, 4, 2, 6, 1, 5, 3, and 7 seconds in the overall
matrix, E. The overall DIT FFT matrix of exponents, E in
29
(2.27) is the same as the DFT matrix of exponents (2.7),
except for a reordering of the columns. Therefore, the DIT
FFT matrix can be viewed as a DFT matrix with reordered
columns
.
C. FACTORED-MATRIX REPRESENTATION OF MULTI-PROCESSOR
FFT (MPFFT)
In this section the concept of matrix factorization will
be extended to the description of distributed FFT algorithms,
with consideration of the multi-processor FFT implemented on
the Butterfly Network at M.I.T. Lincoln Laboratory [Ref. 6].
Distributed FFT algorithms can also be described in terms
of row or column permutations of an overall DFT matrix,
with the inclusion of matrices describing the transfer(s)
of data among the several processors. Figure 2.1 can be
partitioned to describe an 8-point FFT implemented using 4
processors. The result of this partitioning is shown in
Figure 2.3.
Since the DIT FFT algorithm combines data separated by
1 , 2 , 4 , . . . , N/2 memory locations progressing through the
butterfly stages of the algorithm, the required separation
will eventually exceed the size (M=N/P) of the data block
held within each processor. (N=number of input data points,
P-number of active processors; both assumed to be a power of
2.) At this point, and for each subsequent computation stage,
an inter-processor data transfer must take place. Kirk and
















































































"post-transfer" butterflies to describe these multi-processor
operations [Ref. 7: pp. 5-6]. The butterfly computations
that are accomplished prior to any data transfers are called
pre-transfer butterflies. Those which are carried out after
data transfers are called post- transfer butterflies.
Partitioning the N data points equally among P processors
leads to L";:" = logil (M=N/P) stages of pre-transfer butterflies,
and log^P post-transfer butterfly stages. Since each post-
transfer butterfly stage requires one data transfer, there
are also log ? P transfer stages. Note that the number of
transfers depends only on the number of processors, and not
on the input data record length.
The pattern of transfers used in reference 7 consists
of an exchange of the lower half of the data in one processor
with thee upper half of the data in another. (In this context,
lower and upper refer to local memory addresses, and not to
relative memory positions.) The destination processor in a
given transfer is determined by the. cube, function, which
complements the it_h bit of the source processor's binary
address [Ref. 8: p. 224]. For example, consider processor
Pp. in a four processor (P = 4) network. The processor's binary
address is 00. The cube
n
function is 01, and the cube.
function is 10. The destination processors in the first
data transfer are determined using the cube„ function, those
in the second transfer by the cube, function. In general,
the destination processors in the it_h transfer are determined
by the cube,. .. function [Ref. 9: p. 74].
( l-l )
32
To determine whether a processor transfers the upper or
lower half of its data, processors are designated as
"upper-half" or "lower-half" processors at each transfer
stage. At the itji transfer stage, the (i-l)s_t bit of the
processor's binary address is inspected. If this bit is
zero, the processor is designated as an upper-half processor,
and it transfers the upper half of its data at this stage.
A processor whose (i-l)s_t bit is one transfers the lower
half of its data at the i_th transfer stage [Ref. 9: p. 74].
For example, in Figure 2.3, at the first transfer stage (i=l),
the Qth bits of processors P~ and P« are zero. Therefore,
P„ and P~ are upper-half processors; similarly, P, and P~
are lower-half processors at this stage.
The transfer pattern chosen in references 7 and 9
is not unique, but leads to a regular structure that can be
described analytically. When these data transfers are
represented using matrices, the "transfer matrices" possess
many interesting properties. As will be shown later, the
data transfers can be described as permutations of an NxN
identity matrix. Although single-processor algorithms
generate output sequences in normal order when the input
data is in bit-reversed order, the distributed-processor
algorithm produces data in what reference 7 refers to as
"pseudo-normal" order.
33
The distributed implementation of an 8-point DIT FFT with
bit-reversed input data shown in Figure 2.3 can be described
by the vector-matrix equation:
?3 I? ?2 ~1 ?1X(k)=WWWWWx(n) (2.28)
k=0, 1, . .
.
,8
n = 0, 1 , ... ,8
where E, , E„ and E~ are matrices of exponents describing
the butterfly computations at the first, second, and third
stages, respectively. Matrices of this type are referred
to as "computation matrices". T, and T~ are matrices
describing the first and second data transfer stages, and
are called "transfer matrices". To illustrate the generation











































where the superscripts indicate stage number (prime denoting
a transfer stage), the subscripts denote the processor number,
and the argument specifies the multi-processor memory




• • • •
• • •
(2.30)
sxnce w = 1 .
Using the shorthand notation to carry out the matrix
multiplication of (2.28) gives the result shown in Figure 2.4
Note the ordering of the output sequence X ( k ) , indicated by
the row label, k. One-half of the data in a given processor
can be read out in natural order before it is necessary to
move on to the next processor to read the next element of
the output sequence. Two passes through the processor
network are thus necessary to read out the entire output
sequence. This is always true, regardless of the length
of the output sequence or number of processors employed.
35
Bi Ti*Bi






































° 2 1 1













































































































































































Figure 2.4 8-Point /4-Processor DIT FFT(BR Input)
36
The partitioned computation matrices E T are identical to
the single-processor case for the stages prior to any data
transfers (the pre-transfer butterflies). For the post-
transfer butterfly stages, the rows of the computation
matrices have been permuted vis-a-vis their single-processor
counterparts. (For example, compare the row ordering of
E of Figure 2.4 with E n of Figure 2.2.) Notice that the
*- z " Z
exponents within any given row are identical, though the
ordering of the rows is changed. This reordering is caused
by the data transfers, and will be examined in detail in the
next section. The entire purpose for data transfers is to
get data points that need to be combined in subsequent
stage butterflies into the same processor. This is necessary
when the memory separation between points to be combined in
a butterfly exceeds that of the data stored in a given
processor. The most economical way to affect the necessary
transfers is to transfer only one-half the data at any given
transfer stage. There are several other constraints on
potential transfer patterns; these will also be discussed
in the next section. Finally, observe once again that the
exponent entries in any computation matrix, E T , can be
'"Li
determined by the residue of kn modulo N, where k is the row
label, and n is the column label.
37
D. PROPERTIES OF TRANSFER MATRICES
At the heart of the multi-processor FFT (MPFFT)
algorithm are the inter-processor data transfers that must
occur. There are log^P inter-processor transfers required
to compute an FFT implemented using P active processors.
When the transfers are represented by matrices, the "transfer
matrices" possess many interesting properties. These
properties will be examined in detail in this section. The
required data transfers permute the ordering of the output
sequence, hence the rows of the FFT matrix. When the FFT
is implemented on a single processor, the output is produced
in normal order when the input data is in bit-reversed (BR)
order. The bit-reversal is a consequence of the decimation-
in-time FFT algorithm for computing DFTs . The multi-
processor FFT algorithm produces "pseudo-normal" (PN)
ordered output when the input sequence is in bit-reversed
order. Thus, PN ordering is a characteristic of MPFFT
implementations. This section will show how and why the PN
ordering comes about
.
1 . Transfer Matrix Patterns
An eight-port butterfly network to interconnect
eight processors is shown in Figure 2.5. Reference 7
describes in detail the topology and properties of the
Buttterfly Network (BN). The transfers that take place on
the BN are included in the vector-matrix FFT representation

















































The transfer matrices represent the input-output connectivity
of the BN , where each switch is regarded as a two-port
network. The inputs to the BN come from output ports of the
processors P ~ , P.,, ..., P p _, ; the network's outputs in turn
are connected to the input ports of these same processors
[Ref. 10: p. 1]. It is convenient to view the transfer
matrices as permutations of an NxN identity matrix. An
identity transfer matrix results in effectively no transfer,
leaving all the data in unchanged locations. When
representing the actual inter-processor data transfers
,
elements move off the main diagonal in a very regular pattern.
One-half of the elements remain on the main diagonal. (Recall
that only half the data is transferred at any given transfer
stage.) The remaining elements move right (or equivalent ly
,
down) off the main diagonal or left (up) off the main diagonal
The elements move so as to maintain symmetry about the main
diagonal. This is a convenient property, the consequences
of which will be discussed later.
The only parameters that must be known to determine the
exact form of a particular transfer stage matrix are the
record length, N, the number of active processors, P, and the
transfer stage number, R. The number of active processors
sets the number of transfer stages, log„P. The ratio
N/P=M (the size of the data block in each processor) determines
the number of elements that remain on the main diagonal, M/2.
To illustrate the exact pattern of a transfer matrix, consider
40
a 16-point DIT FFT implemented using 4 active processors.
The four-processor system requires log~P=2 transfer stages.
The transfer matrices describing the inter-processor transfer
are shown in Figure 2.6. The NxN matrices are partitioned
into blocks of dimension MxM . The first MxM block corresponds
to processor P
n ,
the next block to P
1
, etc. Labels are
added indicating processors corresponding to given rows and
columns. The local processor memory addresses are listed
next to the processor addresses. The MxM blocks corresponding
to individual processors are highlighted. Numerical entries
not falling into a highlighted processor block (along the
diagonal) describe the inter-processor transfers. The
destination processor is indicated by the element's row
position and the source processor is given by the column
position. "Upper-half" or "lower-half" processors are
determined by which column(s) contain the entries that have
moved outside the processor's block. If the right-most
columns (larger local memory addresses) corresponding to a
source processor contain the transferring entries, the
processor is an "upper-half" processor at that stage.
Similarly, left-most column transfer entires correspond to
"lower-half" processors. Recall the determination of
"upper-half" and "lower-half" processors discussed in the
previous section. Processors with the same (i-l)s_t bit of
their binary addresses exhibit the same pattern at the i th




















































































































Second Transfer Stage (T2)
Figure 2.6 Transfer Matrices of 16-point /4-processor
DIT MPFFT
42
at the it_h transfer stage differ only in the (i-l)s_t bit of
their binary addresses. To illustrate these ideas, consider
the transfer matrix T, of Figure 2.6. The entries in rows 4
and 5 occur in columns 2 and 3, respectively, outside the
4x4 block corresponding to P~ . Thus, during the first
transfer stage, processor P
n
transfers the upper-half (memory
locations 2 and 3) of its data to P, (processor corresponding
to rows where the entries appear). Recall that upper and
lower refer to the local memory addresses, and not to relative
position in memory. Mote also that the Ot_h bit is the only
bit where P^'s and Pi's binary addresses differ. (Bit value=0
corresponds to upper-half processors, and a 1 indicates
lower-half processors.)
The distance moved off the main diagonal for those
elements shifting is given by the difference between single-
processor memory separation of points combining in butterflies
and the separation available in the local processors. Points
participating in a butterfly in the single-processor DIT FFT





, 8 , . . . , N /2 memory locations at successive
stages. In general, the separation between points par t icipatini
in a butterfly at stage L is 2 L-l The maximum separation
available in the MPFFT is M/2. Therefore, elements move
(2 ) - (M/2) places off the main diagonal when representing
the transfer prior to the Lt_h butterfly stage. In the 16-point/
4-processor example, the first transfer occurs prior to the




diagonal move (2 - (4/2))= 2 places right or left/down
or up. In summary, all the information about an
inter-processor transfer stage can be determined by
inspecting the corresponding transfer matrix.
2 . Algorithm for Generating Transfer Matrices
The regular patterns exhibited by the transfer
matrices make them easy to generate. The row/column positions
can be expressed as an ordered-pair (i,j). For an identity
matrix (no transfer), all elements reside on the main
diagonal and i=j . When actual inter-processor transfers are
represented, the column index is permuted so that i^j . The
algorithm for obtaining the column indices as permutations
of the row indices is as follows. First, the rows and
columns are numbered from to N-l . Next, the row indices
are represented as binary numbers. The column indices of
elements descri bing inter-processor transfers are obtained
by switching 2 appropriate bits in the binary representation
of the row indices. When computing, an N-point MPFFT, the
matrices are of dimension NxN. Thus, there are log„N bits
necessary to represent the row/column indices in binary.





N ) -1 . The
bit whose label is (L*-l) is switched with the bit whose
label depends on the transfer stage number. ( L'::" = log^M=number
of pre-transfer butterfly stages.) At the first transfer
stage, for example, bit number (L"x"-1) is switched with the
bit indexed one number greater. At the second transfer
44
stage, bit (L*-l) is switched with the bit whose label is two
greater, etc. This procedure generates the column position,
j, in the ordered-pair (i,j) from the row position. For
example, an 8-point DIT MPFFT implemented on 4 processors
requires 3 bits to represent the row/column indices
0,1,... ,7. These bits are labelled b ? ,b-,,b n . Then
L*=log^M=log
9
( N/P)=l . To determine the column positions
of the transfer matrix entries, the row indices are permuted
as described above. That is, for the first transfer stage,
bits b
n
(L*-l) and b, are switched. At the second transfer
stage, bits b
n
and b„ are switched. Thus, the transfer
matrices for this example are obtained a shown in Figure 2.7.
Note that switching the appropriate two bits at a given
stage has no effect on the elements that remain on the main
diagonal. For elements not transferring, the two appropriate
bits are the same for the row index at the stage of interest.
Consider now the powers of 2 that the bits represent. The
distance shifted off the main diagonal is the difference
represented by the bit-switching operation. For example,
when switching b„ and b. , the difference is 2 -2 =1. At
this transfer stage, elements shifting move one place off
the main diagonal. This is simply the difference between i
and j when i^j . If this algorithm results in a larger
column index ((i,j) j i), the element in the transfer
matrix shifts right off the main diagonal. If the row
index permutation results in i>j , the corresponding element


























O * CN cO rH lO
j
o O o o o I-t l-H |
i-i
J3 O o rH i-H o o
J
(N
J3 o rH o rH o l-( 1
o o 1-1 o 1—1 o I-1 1
1—1
-0 o o I-* l-H o o
J
o o o o 1-1 1-t .























l-H o O 1
o rH o i-H o rH '
o o o O i-H rH
1
o f-i o r* o rH |
o o l-H i-H o o
[
o o o o rH l-H |







,] j. . !
1 °l1
<0 | • ol
i • -j o
* • ol •
CO I o •
1



























CO • ' o 1
1















































to "upper-half" processors, and left shifts correspond to
"lower-half" processors. The bit-switching algorithm holds
for any number of points and any number of processors. It
always involves switching the bit labelled (L";:"-L) with a
bit determined by the transfer stage number. Although this
algorithm works for either DIT or DIF MPFFTs with bit-reversed
input data sequences , it must be slightly modified to
accommodate pseudo-normal input. This modification will be
developed in a later section.
3 . Orthogonality of Transfer Matrices
All transfer matrices are symmetric. At each
transfer stage, each processor retains half its data. The
corresponding elements of the transfer matrix remain on the
main diagonal . Each processor that transfers the upper-half
of its data (transfer matrix elements shifting right) receives
data from another processor's lower half (elements shifting
left). At each stage, elements shifting off the main diagonal
all move the same number of places. This results in a symmetric
permutation of the NxN identity matrix.
Recall that entries in the transfer matrices only
take on numerical values of 1(=W ) or 0(.=no entry). This
property, coupled with the symmetry property, results in
all transfer matrices being orthogonal. That is,



















= diag [0] (2.34)
For example, consider again T\ of the 8-point . 4-processor


































These convenient properties will be put to use later. For
now, it is sufficient to note that all transfer matrices,
are orthogonal .
4 . Pseudo-Normal Ordering
It was previously shown that single-processor FFT
algorithms can be viewed as row or column permutations of a
DFT matrix. This section will demonstrate that multi-
processor FFT implementations can be viewed as further
row/column permutations of the same basic DFT matrix. The
additional row/column permutations arise from the inter-
processor data transfers required in the MPFFT algorithm.
The inter-processor transfers also cause a reordering of the
data sequences involved in the MPFFT.
Inter-processor data transfers arise from the need
to move partially-processed data points that will be
combined in succeeding butterfly stages in the same processor
When computing a DIT MPFFT, this permutes the rows of
succeeding computation matrix factors. It will be shown
later that the inter-processor transfers similarly reorder
the columns of the decimation-in-f requency (DIF) computation
matrices. The exponent weighting factors within a given row
or column remain unchanged; only the row or column ordering
is changed. As would be expected, the row/column
permutations also exhibit a regular pattern, based on the
pattern of the transfer matrices.
49
The rows and columns of the matrix factors are
labelled with values of k and n, respectively. The column
labels, n, correspond to the ordering of the input data
sequence, x(n). For now, only bit-reversed (BR) input
sequences will be considered. The row labels, k, represent
the ordering of the output sequence, X(k). Starting with
the normal-ordered output that is produced by the SPFFT
algorithm using BR input, this section describes how the
normal ordering is transformed into "pseudo-normal" (PN)
ordering by the inter-procesosr transfers on the Butterfly
Network.
Consider a 16-point DIT MPFFT implemented on an
8-processor Butterfly Network (BN). Three transfer stages
(log
? 8) are necessary in this computation. The row labels,





































































Figure 2.8 Normal to Pseudo-normal Ordering
(16 points/8 processors)
Figure 2.8 illustrates which data points are combined in the
butterfly computations at each stage. During the first
(and only) pre-transfer butterfly stage, points x(0) and
x(l) are combined in a butterfly in processor P~. The second
( 2-1
)
butterfly stage requires separation between points of 2 =2
memory locations. This separation is not available within a
given processor, so an inter-processor transfer must occur.
After the first transfer, partially-processed data points
x(0) and x(2) reside together in P~ , and another butterfly
computation is performed. This procedure continues until
the computation is completed. The PN output sequence X(k)
that results is in the order shown after the final transfer
stage, T~. This pseudo-normal ordering depends only on the
record length, N, number of active processors, P, and their
ratio, M=N/P. The record length determines the total
number of butterfly stages, log 2 N. The number of transfer
stages (log ? P) is set by the number of processors. The size
of the data block in each processor, M, sets the number of
pre-transfer butterfly stages (log2M=L#) that can be computed
prior to any transfers. The PN ordering is a function of the
51
number of transfers, hence processors. This is analogous to
the SPFFT case, where bit-reversal is dependent on the record
length, N. Note also from Figure 2.8 that one-half the points
in each processor are output in normal order , after the final
transfer. Thus, the output sequence can be rearranged by
additional network transfers, or the processor local memories
can be read cyclically to obtain the output data in normal
order [Ref. 7: p. 18].
The reordering of the output sequence, from normal
to PN ordering, can be derived by a bit switching algorithm
similar to that for generating transfer matrix patterns.
The bits to be switched at each stage are different in this
case, however. The binary representation of the row labels,
k , again requires log N bits to describe the N-point output
sequence. The bits switched this time always involve two
adjacent bits, beginning again with the bit labelled (L";c"-1)
at the first transfer stage. The labels of the bits to be
switched are then each incremented by one at each successive
transfer stage. Figure 2.9 illustrates the algorithm for 2
different values of P and M. Note in Figure 2.9 that only M/2
values of k in each processor are changed at any given stage.
Also observe that the starting bit to be switched at the
first transfer stage is different for different values of M.
This is because fewer active processors compute more pre-
transfer butterfly stages, and require fewer transfers (less
reordering). Consider again the powers of 2 represented by
52




1 1 1 1 1 8
2 1 1 1 1 1
3 1 1 1 1 1 1 1 1 9
4 1 1 1 1 2
5 1 1 1 1 1 1 1 1 10
6 1 1 1 1 1 1 1 1 3
7 1 1 1 1 1 1 1 1 1 1 1 1 11
8 1 1 1 1 4
9 1 1 1 1 1 1 1 1 12
10 1 1 o - 1 1 1 1 1 1 5
11 1 1 1 1 1 1 1 1 1 1 1 1 13
12 1 1 1 1 1 1 1 1 6
13 1 1 1 1 1 1 1 1 1 1 1 1 14
14 1 1 1 1 1 1 1 1 1 1 1 1 7
15 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 15
Figure 2.9a Normal to Pseudo-Normal Ordering
16-points/8-processors DIT MPFFT (L*-1)=0
b3 b2b1 b
Tx (b1<^b 2) T2 (b2<H>b 3)
1 1 1 1 1
2 1 1 1 8
3 1 1 .1 1 1 1 9
4 1 1 1 2
5 1 1 1 1 1 1 3
6 1 1 1 1 1 1 10
7 1 1 1 1 1 1 1 1 1 11
8 1 1 1 4
9 1 1 1 1 1 1 5
10 1 1 1 1 1 1 12
11 1 1 1 1 1 1 1 1 1 13
12 1 1 1 1 1 1 6
13 1 1 1 1 1 1 1 1 1 7
14 1 1 1 1 1 1 1 1 1 14
15 1 1 1 1 1 1 1 1 1 1 1 1 15
Figu re 2 .9b Normal to Pseudo-No:rmal
16-points/4-processors DIT MPFFT (L*-l)=l
Figure 2.9 Permutation (by stage) Normal to
Pseudo-Normal Ordering
53
the bits. Since the first transfer in Figure 2.9a switches
bits b„ and b, , the k values change by 2 -2 =1. This
explains why the bits to be switched are adjacent in this
case. During each transfer, the network is preparing for
butterflies requiring separation between points twice that
available. Thus* bits representing successively higher
powers of 2 are switched at succeeding stages.
5 . Selection of Transfer Matrices
The transfer patterns used in references 7 and 9
are not unique. Other patterns of transfers were investigated,
but had the disadvantage of structures for which the general
pattern could not be easily discerned and described analytically
[Ref. 9: p. 74]. As would be expected, the same situation
exists regarding the transfer matrix factors describing dis-
tributed FFTs on the BN . This section discusses the
selection of appropriate transfer matrices to represent the
inter-processor transfers on the BN
.
Several constraints on the make-up of potential
transfer matrices can be identified. These constraints are:
1) Entries take on values of 1 or only.
2) Only one entry per row or column is permitted.
3) The toplogy of the Butterfly Network must be satisfied.
4) Data points to be combined in the next butterfly stage
must reside in the same processor after transer.
5) One-half the entries remain on the main diagonal.
54
These constraints greatly reduce the number of NxN matrices
that qualify as transfer matrices. The constraints also
lead to the transfer matrices being symmetric, hence
orthogonal. This convenient property will be exploited in
the next section.
Beginning with the NxN identity matrix, the elements
are permuted according to the constraints listed above to
determine "legal" transfer patterns. For the 8-point/
4-processor DIT MPFFT , this procedure yields only two
matrices for each transfer stage that meet all the constraints.
These matrices are shown in Figure 2.10. The transfer matrices
actually used are shown in Figure 2.10a. The other matrices
that satisfy all the constraints are shown in Figure 2.10b.
When the transfer patterns of Figure 2.10b are used, two
undesirable properties result. First, since the higher-
numbered data points occupy the lower addresses in each
processor memory, the butterflies that follow must be
inverted, with the twiddle factor paterns upside-down.
Additionally, following the second transfer, the weighting
factor patern is scrambled from the order expected. The
second undesirable result involves the ordering of the
output sequence. Instead of normal, bit-reversed, or
pseudo-normal ordering, all of which can be unravelled, the
output is in indecipherable order: X(5) X(l) X(4) X(0) X(7)


































































































































































































CM ol • 1 1 CM * • lo • 1 1 •H
_ __ -4-- —1— —





















OrHCMcO^lO CD N O ri N CO ^ lOCDN
56
the transfer patterns used in reference 7 also lead to
a consistent pattern in the twiddle factors in every
butterfly stage. The PN ordering of the output sequence
is easy to describe and unravel. Note also that the
alternate transfer matrices are more difficult to generate .
That is, the bit-switching algorithm presented in Section
II. D. 2 does not work for the alternate transfer matrices.
E. GENERALIZED MULTI-PROCESSOR FFTs
Single-processor FFT (SPFFT) algorithms permute the
row or column ordering of the DFT matrix which they imple-
ment. This is turn reorders the output (permuted rows) or
input (permuted columns) data sequence. The reordering
caused by SFFT algorithms changes normal-ordered input or
output sequences into bit-reversed (BR) order. The multi-
processor implementation of the decimation-in-time (DIT)
FFT algorithm described in Section II. D. 4 has been shown
to further reorder the rows of the basic DFT matrix. This
further permutes the ordering of the transformed output
sequence. As was shown, when using bit-reversed input data,
the DIT multi-processor FFT produces a pseudo-normal (PN)
ordered output sequence. Pseudo-normal ordering is thus a
consequence of this multi-processor FFT (MPFFT) algorithm,
as bit-reversed ordering is associated with SPFFT
algorithms. This section describes the generalized
factorization of MPFFT matrices to obtain the stage-by-stage
57
matrix factors. The DIT MPFFT (BR input/PN output) examined
previously will be extended to represent the DIT MPFFT
with PN input ordering, and to decimation-in-f requency (DIF)
MPFFT algorithms using both BR- and PN-ordered input
sequences .
The overall MPFFT matrix of twiddle factor exponents,
E
,
is obtained by forming the modulo N product of the
appropriately ordered row and column labels [Ref. 2: p. 43].
The row and column labels represent the ordering of the
output and input sequences, respectively. A BR input
sequence to the MPFFT implemented using the Butterfly
Network produces PN-ordered output. Similarly, a PN input
data sequence produces transformed output in BR order. This
is true whether a DIT or DIF MPFFT is being computed,
although the matrix factors representing individual computa-
tion stages differ. Thus, after calculating the BR and PN
sequence ordering for the given number of data points and
active processors, it is easy to write down the overall
MPFFT matrix.
Once the overall MPFFT matrix is obtained, it must
be factored to represent the individual computation stages.
As discussed previously, an N-point FFT requires a total
of log^N butterfly stages. When implemented on the multi-
processor Butterfly Network, there are L*=log 9 M pre-transfer
butterfly stages. (Recall that N=record length, P=number
of active processors, and M=N/P=number of points in each
58
processor.) Therefore, each computation stage is represented
by one matrix factor, with one transfer matrix prior to each
post-transfer butterfly computation matrix factor. Each
matrix factor successively pre-multiplies the factor(s)
representing the previous stage(s) when the matrix product









T (logP)-l...W l W l0gM ...W^ l x(n)





where all logarithms in the matrix indices are base two.
The remainder of this section discusses how to form each
of the matrix factors in (2.37).
1 . PIT MPFFT with Bit-Reversed Input
The MPFFT matrix, E , is formed by reordering the
row and column indices as described previously, and then
forming the modulo N products to give the individual
elements. This matrix must then be factored to obtain
the representation of each computation and transfer stage.
The most illustrative case to consider is when L""" = log^M=l .
In this case, there is only one pre-transfer butterfly
stage, and the maximum number of transfers and post-transfer
butterfly stages. Each processor contains M=2 data points,
and is thus capable of performing a 2-point DFT internally.
At successive stages, each processor can compute one-half
of a 4-point DFT, one-fourth of an 8-point DFT, etc.,
59
provided that appropriate data points are transferred to
reside within each individual processor. Other situations
involving more than 2 points per processor are easily
described in relation to the L*=l case.
Each NxN matrix factor in (2.37) is first partitioned
into blocks of dimension MxM. Each MxM block along the main
diagonal represents an individual processor. The computation
stage matrices E
T
contain entries only in the submatrix
blocks along the diagonal. The entries in these blocks
describe the butterfly computations taking place within the
individual processors. The transfer matrices T contain
~ R
entries outside the blocks along the diagonal to describe
the inter-processor data transfers.
Consider once again the 8-point/4-processor DIT
MPFFT using BR input of Figure 2.4. Note that there are a
total of 3(log
9 8) computation matrix factors. Ji, represents
the first ( pre-transf er ) butterfly stage, and there are
2 (log ? 4) matrix factors representing the post-transfer
butterfly stages, each preceded by a transfer matrix.
Recall that the DIT algorithm decomposes an 8-point DFT into
two 4-point DFTs , then into four 2-point DFTs . See Figure 2.11
This decomposition partitions the input time sequence into
subsequences having even and odd indices. At each
successive decomposition, the butterfly inputs are separated
by larger increments in time. The first butterfly stage
combines samples from the time sequence separated by 4
60
o 1-1 <N CO TJ» LO CO N\~s ^ v_/ ^/ v_^ V—
/
>< ><


















































































































/-\ /-N ^v r~\ s-\ S~>\ s-\ /—
s
H.
o "^ 03 CO i-H LO CO N bC
>»_• ^^ v^ ^/ N-X ^-^ ^^ v^ •H




units (N/2). Thus, the first matrix factor , E, , has column
(time) indices and 4. The ordering of the row indices, k,
at the first stage is the normal ordering characteristic
of the single-processor DIT FFT with bit-reversed input data
The row ordering is then permuted by each inter-processor
transfer stage until the output sequence ends up in the PN
ordering characteristic of the MPFFT on the Butterfly
Network. The second butterfly stage in the DIT FFT combines
points separated by two memory locations. In this example,
this separation is not available within a given processor,
so an inter-processor transfer must take place. This
reorders the rows of the next computaton stage, and all
successive stages. The algorithm for reordering the row
indices was presented in Section II. D. 4, The time
separation between points combining in the second butterfly
stage is 2 units, so the column labels for this matrix
factor (E ) are and 2. Note that each twiddle factor
exponent in the blocks along the diagonal is still obtained
by the modulo N product of the row and column indices. The
third (final) computation stage marix factor, E~ , is derived
similarly. The row ordering is further permuted by the
second inter-processor transfer. The row indices of the
final computation stage are now in the PN order in which
the transformed output sequence appears. The column labels
are and 1
,
corresponding to the butterfly inputs '
separation of 1 unit. It can be verified by comparing
62
Figure 2.4 with Figure 2.2 that identical butterfly computa-
tions take place at each stage. That is, each row in the
computation matrix factors contains identical entries at
any stage, only the ordering of the rows is changed in the
MPFFT by the inter-processor transfers.
An explanation of the coiumn labels of the transfer
matrices is also necessary. A column label of indicates
that the element in that column remains on the main diagonal
(doesn't represent a transfer at that stage). The negatively-
valued labels indicate elements moving down off the main
diagonal, and correspond to lower-half processors at the
transfer stage in question. Similarly, positively-valued
labels indicate elements moving up from the main diagonal
(upper-half processors). The magnitude of the transfer
matrices ' column labels in the DIT algorithm corresponds to
the time separation between points that are to be combined
in the butterflies at the next computation stage. The row
indices are permuted by previous transfers the same as the
two indices of the computation matrix factors.
The matrix factors in (2.37) are called sparse
matrices because of the numerous zero entries in any row
or column. The sparse matrices cut down the number of
arithmetic operations required to compute the DFT [Ref . 2:
p. 67]. Multiplying out the matrix factors gives the
permuted DFT matrix E as the result. Taking the products
can be regarded as a mixing operation in which the column
63
time indices of the matrix factors add. Successive
decomposition of the input time sequence results in E,
having only time values of and 4 units, E« having only
time values of and 2, and E~ having only time values of
and 1. When the matrix factors are multiplied, the time
values mix so that E„";:"T^';:"E„ has time values of 0,4,2, and 6







results in the product having time values of 0,4,2,6,1,5,3,7
This represents the ordering of the input sequence which was
successively decomposed at each stage in the factorization
of E to give the DIT algorithm. Thus, carrying out the
DIT MPFFT (multiplying the matrix factors) can be viewed
as a recombination of the input sequence that was decomposed
in setting up the algorithm. The row indices of the E
matrix are determined by the row ordering at the final
computation stage, and represent the ordering of the
transformed output sequence, X(k).
The 8-point /4-processor example (L";c~=l) can be
extended in several ways as follows. Since M=N/P, there are
really only two variables in (2.37), the record length, N,
and the number of active processors, P. The total number
of butterfly stages must equal the number of pre-transfer











Two situations can result in fewer transfer stages than the
L*=l example just cited. Extending the input record
length to a higher power of 2 while keeping the number of
processors constant results in one or more additional pre-
transfer butterfly stages. This situation will also increase
the total number of butterfly stages, log
?
N. Also, decreasing
the number of active processors (again, required to be a
power of 2) for a constant input sequence length results
in more pre-transfer butterfly stages. This situation can
result when processors suffer faults, and there are no spares,
and leaves the total number of butterfly stages unchanged.
Both of these situations increase the number of pre-transfer
butterfly stages by increasing the size of the data block
within each processor, M. This makes each processor capable
of performing more than one pre-transfer butterfly stage.
To illustrate, consider again the 8-point DIT MPFFT, this
time using only 2 processors, as shown in Figure 2.12.
There is still a total of 3 (log«8) butterfly stages. Now,
however, there are 2 (log„(8/2)) pre-transfer butterfly
stages
,
and only 1 transfer stage and 1 post-transfer
butterfly stage. Referring to the algorithm of Section
II. D. 4 to calculate the PN ordering for L"x" = 2 , the E matrix
can be written down. Each of the matrix factors E T is
divided into submatrix blocks of dimension MxM. Now,
however, the blcoks will be 4x4. Each processor contains
enough data to compute 2 pre-transfer butterfly stages.
65
S2*Bl
lc\_0 2 2 2 2 kv 4 4 4
•i°
1 2t- 2-l -
2 •1 4





































































4| 6 2 L
4
4 2 6


















































































Figure 2.12 8-Point /2-Processor DIT FFT (BR Input)
66
Since during the pre-transfer stage(s) each processor is
functioning independently (i.e., as a single-processor),
the pre-transfer matrices are identical to the SPFFT case.
Note also that all processors contain the same weighting
factor pattern prior to any transers, regardless at which
stage the first transfer occurs.
When the submatrix blocks representing the
individual processors are of dimension 4x4 or greater,
there are dots, representing no entry, within each block.
This is analagous to the single-processor case, in that
some or all of the required memory separation between
points is available within the processors' local memories.
When the stage is reached where inter-processor transfers
are necessary, rows of the succeeding stage matrices are
permuted, as always. The column labels representing
time values of each matrix factor are such that each
processor block contains an equal number of zero and non-zero
time labels. Regarding the transfer matrices, one or several
of the transfer stages required in the L*=l case are
"bypassed" as the N/P=M ratio increases. The algorithm
for generating the transfer matrices presented in Section
II. D. 2 still applies, but the number represented by L"x"-1
(first bit to be switched) increases.
67
2. PIT MPFFT with PN Input
The DIT FFT algorithm can be used on a single
processor with normal-ordered input, to produce bit-
reversed output. Similarly, the DIT multi-processor
algorithm can also be used with a pseudo-normal input
sequence. As would be expected, the transformed output is
produced in bit-reversed order. The MPFFT matrix E is the
transpose of the E matrix of the BR input case. This
reflects the change in th input-output relationship of
the PN- and BR-ordered sequences. The elements of the E
matrix are thus still obtained from the modulo N product
of the row and column indices representing the output and
input sequence ordering. Unfortunately, the matrix factors
describing individual stages are not simply the transposes of
the matrix factors from the BR input algorithm. This is
due to the different ordering of the input sequence. The
inter-processor transfers occur in reverse order, and for
the case where L";:"^l (fewer than maximum number of transfers),
even occur at different stages. Specifically, for an
8-point /2-processor DIT algorithm, L*=log ~M=2 . When using
PN input data, there is one transfer stage (log„P) as
expected. However, the one transfer is required after the
first butterfly stage, and not after two pre-transfer
stages, as in the BR input case. There are two post-transfer
butterfly stages, but only one transfer. That is, there is
no second transfer required between the two post-transfer
butterfly stages. (Compare Figure 2.13 with Figure 2.12).
68











































* 1 01 • 4













































































































Jo j 4 j 4
4
2




5 1 1 4 j 4 _ 6 5 !_0 5 1 4 21! 4 2
1
\~ T 1





4 5 12 3
j
6 7
5 ?! 4j__2J 6. 5 51 4 1 j_2 7 J.6 3
3 '4 1 6 j 2 3 03|47J6 1|25
7 0| 4 j 6 | 2_ 7 _0 7 1 4 3 | 6 5 1 2 1_
(2.40)
Figure 2.13 8-Point /2-Processor DIT FFT (PN Input)
69
Again, the easiest case to consider is when L"x*=l .
There is only one pre-transfer butterfly stage, and each
post-transfer stage is preceded by an inter-processor
transfer stage. In this case, though, since the Butterfly
Network is transforming PN input into BR-ordered output
,
the transfer pattern is the reverse of that in the BR
input case. The individual transfer matrices are the same,
since they are symmetric, but they occur in reverse order.
(Compare T, of Figure 2.1.4 and T
2
of Figure 2.4.) The column
labels once again correspond to the time decomposition of
the input sequence. For an 8-point DIT, the labels are
and 4 at the first stage, and 2 at the second stage
and and 1 at the final stage. The column indices of the
transfer matrices are also the same as before. The non-zero
magnitudes still indicate time separation at the next
butterfly stage. Values of zero indicate non-transferring
elements, and the sign gives direction of movement off the
main diagonal. The row indices for the first computation
matrix factor are in what will be called "pseudo-bit-
reversed" (PBR) order. This is the output ordering that
would result in a SPFFT using pseudo-normal-ordered input.
Recall that the exact PN ordering is a function of the
number of required network transfers, log„P. The same
is true of PBR ordering, since the transfers ultimately
permute the PBR ordering into bit-reversed-ordered
output. Thus, the elements within each matrix factor can
70
TV E. T1 '-E 1














1 " J_ i°
1
4 1 . i
i
1 • •


















































































































T * * E2*±1































T 2'*E 2 ,T 1 '*E 1














































Figure 2.14 8-Point /4-Processor DIT FFT (PN Input)
71
still be calculated from the modulo N product of the row
and column indices. The row ordering is permuted by each
successive transer , as before. At the final computation
stage, the row indices, k, are in bit-reversed order
corresponding to the ordering of the transformed output
sequence. The column labels of the E matrix can again
be found by summing the time indices of each of the
individual matrix factors.
The situation becomes slightly more complicated
when L*^ 1 . For PN input, there is only one pre-transfer
butterfly stage before inter-processor transfers become
necessary. There will be the same number of transfer
stages as before (BR input case), but they occur in
reverse order. This leaves log 9 M=L"5:" computation factors
at the end without any transfer matrices between them.
This is analagous to the BR input case, when there were
L * computation stages prior to any transfers.
Recall the discussion on P«N ordering from Section
II. D. 4. Each processor contains two subsequences of M/2
points in normal order, each subsequence separated by
N/2. Since the DIT algorithm combines points separated
by N/2 at the first butterfly stage, the PN input case
calculates M/2 2-point DFTs in each processor at the first
computation stage. There are (H/2)-l dots (no entry) in each
block submatrix corresponding to the local memory
separation between points combining in the butterflies.
72
Refer again to Figure 2.13. The butterfly width remains
constant for successive stages preceded by inter-processor
transfers, and decreases when computation stages are no
longer preceded by transfer. Again, recall the BR input
case where butterfly widths grew to a maximum, determined
by local memory contents, then remained constant as
transfer stages occurred. Since the PN input sequence
does not have points separated by N/4 (second stage DIT
butterfly width) residing in any processor, the first
transfer stage occurs after the first butterfly stage.
Transfers occur until log^P stages have been carried out.
The final L* butterfly stages do not require transfers
between them, as mentioned above. Each transfer
successively permutes the pseudo-bit-reversed row labels,
until the bit-reversed output sequence emerges.
3 . Pseudo-Bit-Reversed Ordering
Further discussion of the nature of pseudo-bit-
reversed (PBR) ordering is necessary. Just as normal
ordering is permuted to PN ordering by the DIT MPFFT using
BR input, PBR ordering serves as the starting point for
PN input algorithms. Also, PBR order is the ordering of
the output that would result from a SPFFT DIT algorithm
using PN input ordering. The terms "PN" and "PBR order"
are really meaningless when talking about single-processor
implementations, since each is a function of the number of
inter-processor transfers. However, if a particular
73
configuration is assumed, say 8 points/4 processors, then
that particular PN ordering, when input to a SPFFT, would
result in the corresponding (8/4) PBR-ordered output. In
any case, the algorithm of Section II. D. 4, which generates
PN from normal ordering , can also be used to generate PBR
ordering from bit-reversed order. To use the algorithm
as presented, the bits must first be reordered to bit-
reversed ordering. That is, instead of descending order,
the bits are reordered in ascending order. For example,
a 16-point data sequence requires 4 bits to describe the





b~, b~. Now the algorithm may be applied
directly to generate pseudo-bit-reversed order, for a
given number of processors. Recall that the algorithm
switches bits L*-l and L* at the first transfer stage,
then increments both bits at successive stages. Thus, at
the second stage, bits L* and L*+l are switched, and so on.
The transfer matrices used for the PN input case occur
in reverse order from the BR input case. Thus, this
algorithm can be used in reverse, to give the stage-by-stage
row indices, k, and ultimately, the BR-ordering of the
output sequence. Refer to Table 1 for a display of these
relationships .
74
TABLE 1: SEQUENCE REORDERING (DIT MPFFT)





















b b b b12 3
Note: Tg^^
























































8 Points/4 Processors (ET =1)
Normal : b-b b.
T
l
: b b l
V b l







T2 : b b l





8 Points/2 Processors (L =2)
Normal
:
bo b i bn
Tl : b,





BR: bQb 1 b2
Tl : b l b2
PBR: b b b
2 1
75
4. DIF MPFFT Algorithms
As in single-processor algorithms, the multi-
processor DIF FFT exhibits much duality in relation to the
multi-processor DIT FFT. The MPFFT matrix E can again
be viewed as a DFT matrix with rows and columns reordered
corresponding to the output and input sequences. The E
matrix of the DIF with PN input/BR output is the transpose
of the E matrix of the DIT with BR input/PN output. The
transpose of a product of matrices is the product of the
transposed matrices, in reverse order. The stage matrix
factors of the DIF with PN input are thus the transpose
of the factors, in reverse order, of the DIT with BR
input. The DIF algorithm partitions the output sequence
into even and odd values of k, the frequency index. The
row indices now represent frequency separation at the
individual computation stages. That is, the first matrix
factor, E, , has frequencies and 4. The units of
frequency depend on the analysis period, or record length.
The column indices start out in the order of the input
sequence for the first stage factor. They are then
reordered by each transfer stage, in the same manner that
the row indices are permuted in the DIT algorithm. Thus,
the column indices, n, represent the points of the input
sequence that reside in a given processor at any given stage
The transfer matrices are the same as for the DIT, and occur
in the same order as the DIT with the same input ordering.
76
That is, the DIF with PN input has the same transfer
pattern as the DIT with PN input. The pattern is the
reverse of the DIT with BR input. Since the transfer
matrices are symmetric, the transpose operation leaves
them unchanged. Thus, the individual factors of the DIF
with PN input/BR output are each the transpose, in reverse
order, of the factors of the DIT with BR input/PN output.
Since the ordering is reversed, the magnitudes of the
transfer matrices' row labels are given by the frequency
separation at the previous stage. Recall that the
magnitude of the column labels on the transfer matrices for
the DIT are given by the time separation for the next
stage. (See Figure 2.15 and compare with Figure 2.4.) Similar
duality exists between the DIF with BR input/PN output and
the DIT using PN input/BR output. The E matrices are the
transpose of one another, and the stage factors of the DIF
are the transpose of the factors of the DIT, taken in
reverse order. (See Figure 2.16 and compare with Figure 2.14.)
This section has presented the factorization of
FFT matrices, to obtain their multi-processor representations.
Following sections will investigate the implementation of
other fast transforms on the Butterfly Network.
77
il Si I1-S1



































































2 . o . .
i
!



























£' 2 13 48 57
r~ i —





2 1 j 4 4
1
1 j |~0 . 2
1 | I
*0 0[ "J4 4T
2 [2 2| J 6 6 2 '2 2 1 1 6 6




1 5 1 3 7
1
3











- 1 *- 1




1 3 7 | | 1 5_
JO 4 15 26 3 7




1 14 4 1
4
2 012 2 |4 4 16 6
2
_








1 1 5 | 13 7
4 | } 6 2~]
5
3
0415 1 12 6173
~i i —
r





i d 7 _0 4 j 7 3 1 6 2 J 5 1_
(2.42)
Figure 2.15 8-Point /4-processor
78
DIF FFT (PN Input)
Il



















k£_0 4 2 6 15 3 7
]£_P 2 .45 ,13 ,



























































































































1 041 26| 1 5137
5 4j_2 6_j 5 1(_7 3
2 0| 4
4J
2 2| 5 6
6
_0_ 0I_ 4
_4j_ _6_6 1 _2_ 2
4| 62J37| 153
7
_0 4| 6 2| 73 5 1_
(2.43)
F
Figure 2.16 8-Point /4-Processor DIF FFT (BR Input)
79
III. OTHER FAST TRANSFORMS ON THE BUTTERFLY NETWORK
This section uses the tools developed previously to
investigate other fast transform algorithms for potential
implementation on the Butterfly Network (BN). Section II
used matrix representations to show that the radix-2 FFT
algorithm is supported by the BN . That is, the radix-2
FFT can be partitioned so that multiple processors can
carry out the computation in parallel. The BN supports
the required inter-processor transfers of partially-
processed data, producing transformed output in a new, yet
easily decipherable, order. This section examines several
other algorithms to determine their adaptability to
parallel processing on the BN . The analysis will focus
on the required inter-processor transfer patterns, and
the output ordering produced from the Butterfly Network.
A. FAST HARTLEY TRANSFORM (FHT)
Bracewell has articulated a real, fast transform
algorithm that serves all the purposes of the complex
Fourier transform [Ref. 11]. It is called the Fast
Hartley Transform (FHT) in honor of Hartley, whose 1942
formulation of a real integral transform is the basis for
the FHT. The FHT algorithm is faster than the FFT, and
if desired, may even be used, with a little additional cost,
to compute the DFT.
80
As shown previously with the DFT , the discrete Hartley
transform may be viewed as a square matrix operating on an
N-dimensional vector. The DHT matrix can then be factored
to obtain the FHT algorithm, where each factor represents
one stage of the algorithm. The step from the Hartlev
transform to the Fourier transform can also be expressed
as a matrix multiplication and so the factorization of
the Hartley matrix leads directly to a new factorization
of the Fourier matrix [Ref. 11: p. 64].
1 . Single-Processor FHT CSPHFT) Algorithm
The Discrete Harley Transform (DHT) is defined
[Ref. 11: p. 27] as the sum of the cosine and sine transforms
N-l
X(k) = ± 2 x (n) cas(2 7rnk/N)
,
k=0 , 1 , . . . , N-l
n =
(3.1)
where cas 6 = cos 6 + slnd . The inverse transform is:
N-l
x(n) = X X(k) cas (2^nk/N), n=0 , 1 , . . . , N-l (3.2)
k =
Note that only real arithmetic is used, and except for the
factor N in (3.1), the forward and inverse transforms
are identical. Thus, for real input data, the DHT is also
real valued, and the original sequence can be obtained by
using the same transformation formula a second time.
Equation (3.1) may be represented as a vector-matrix
equation :
81
X(k) = H x(n) (3.3)
The DHT matrix H can be factored to give the FHT , similar
to the way FFTs are obtained from the factorization of a
DFT matrix. The input sequence is again assumed to have
length equal to a power of 2. Thus, the DHT matrix H is
factored as :
5 * tG £g-i feii (3.4)
where G = log«N. The factors L , s=l , 2 , . . . , log^N are
called stage matrices, and represent the computations at
each stage of the algorithm. The input sequence x(n) is
in bit-reversed order, and yields normal-ordered output.
Each of the factors in (3.4) are sparse matrices,
contributing to the speed of the FHT algorithm. The FHT
employs a "divide and conquer" strategy similar to the
FFT . An N-point DHT is obtained as a linear combination
of the outputs of two N/2-point DHTs , each in turn
obtained as combinations of the outputs of two N/4-point
DHTs, etc. The decomposition proceeds until 2-point
sequences result. The DHT of a 2-point sequence is
obtained simply as the sum and difference of the 2 data
elements. To illustrate, applying the definition (3.1)
for N=2 yields:
82
X(0) = i[x(0)(cos O+sin 0)+x(l)(cos O+sin 0)]
X(l) = i[x(0)(cos O+sin )+x( 1 ) ( cos ^ +sin * )] (3.5)
X(0) = i[x(0)+x(l)
]











Reference 11 defines a general decomposition formula that
produces the DHT by successive halving of the input
sequence. Given the input sequence x ( n ) , n=0,l,...,N-l,
let the N-element subsequences [x(0) x(2) . . . £ and
{x(l) x(3) . .










This structure corresponds to the factorization of the
DHT matrix, H [Ref. 11: p. 85].
The second stages compute one or more A-point
DHTs, using two 2-point DHTs as inputs. The cosine and
sine multipliers will be either 1 , 0, or -1 , since
83
(2-n-nk/N) = (27rnk/4) = (nk7r/2), and only integer multiples
of t/2 are thus possible. Applying (3.1) for N=4 and





1 1 1 1
1 1 -1 -1
1 -1 1 -1







The columns are rearranged to correspond to bit-reversed
input, yielding:
X(0) 1 1 1 1 x(0)
X(l) 1 -1 1 -1 x(2)
X(2) 1 1 -1 -1 x(l)
1(3) 1 -1 -1 1 x(3)
(3.10)
Equation (3.10) can be factored, since half the terms can
















It is easily verified by carrying out the matrix
multiplication that (3.11) is equivalent to (3.10).
(In this development, and in the future, the N factor
is neglected for clarity. It can be inserted at the
end, if necessary for normalization.)
The third and subsequent stage factors, representing
8- (or more) element FHT stage matrices are more complicated
The inputs to the 8-point FHT are the outputs of two
4-point FHTs . This recursive property for generating
higher-order FHTs from the combination of lower-order
FHTs never changes. What does change is the number of
non-zero elements per row in the stage matrices. For the
third and subsequent stages, there are three non-zero
elements in each row. This means that three elements
enter, with appropriate coefficients, into each output




















where C = cos(27r n /8), S = sin(2?rn/8) and K =
n n n
[ cos( 2irn/8)+sin( 27rn/8) ] . Note that the cosine elements
and the l's are on diagonals parallel to the main diagonal,
but the sine elements run the other way. As a result,
the independent variable of the operand element that
multiplies the sine coefficient S decreases as nr
n
increases. This property is referred to as retrograde
indexing [Ref. 11: p. 73]. The arguments of the sine and
cosine terms are integer multiples of vlk, taking on a
numerical value of either, 0, +1, or + 1 /V JT Putting
numerical values in the L^ matrix and carrying out the
matrix multiplication gives the result shown in Figure 3.1.
Again it is easy to verify that the product in Figure 3.1
gives the same result as the DHT definition of (3.1).
Conversion of the DHT to the DFT is accomplished simply
by associating the even and odd parts of the DHT with the
real and imaginary parts of the DFT. For a discussion of
the mechanics of this operation, see reference 11, pp. 75-77
and reference 12, p. 150. However, if the power spectrum
is desired, it can be computed directly from the DHT by
2 2
evaluating [X(k)] +[X(N-1)] . Finally, note that since
the FHT is a real transform compared to FFT , which
is complex, the real and imaginary parts of any
complex input data can be processed in parallel,


























































01 23 45 67
ii |i | "























2 3 4 5
i i i
l -ii l











L3 -I- 2 -L 1
_0 4 . 2 6





1 -11 1 -1, s -si
1 1 r l -lj 1 1 r l -1
1 — li— 1 1 I ' s -s
:t _:.
1 1 I 1 1 [-1 -in -i
i-ii-ibs s
i i -l -iUi -l! l l
i
i






Figure 3.1 8-Point /1-Processor DIT FHT (BR Input)
87
2. Multi-Processor FHT (MPFHT) Algorithm
The Fast Hartley Transform can be partitioned among
several processors for parallel implementation on the
Butterfly Network. The number of points per processor,
M=N/P, must be greater than or equal to 4, since the
third and successive computation stages ( N_>8 ) combine
3 elements at a time. Thus, there are at least two
(log
?
M) pre-transfer stages for any N-point FHT. The
recursive nature of the FHT yields inter-processor transfer
requirements similar to the MPFFT. That is, after
L"::"=log
?
M pre-transfer stages, there are transfer stages
interspersed between the subsequent post- transfer computation
stages. As always, the transfer stages permute the row
ordering, and thus the ordering of the output sequence.
Processors again retain half their data, and transfer half
at any given stage . When more than one transfer stage is
required (4 or more active processors), the patterns of
the output ordering are apparently not so convenient as in
the MPFFT. More desirable patterns than those to be
presented may well exist. The difficulty lies with the
fact that the amount of data transferred is always a power
of 2 (one-half each processor's memory), while 3 points
are combined in the third and successive computation stages.
The easiest case to consider is an 8-point FHT
implemented using 2 processors. There are 3 (log^8)
88
computation stages: 2 pre-transfer computation stages
(L* = log
2
(N/P)=2) , 1 transfer stage (log
2
P=l) and
1 post-transfer computation stage. For this case:
X(k) = L~ T, L„ L- x(n)
-*, ~j ~i ~*£ —i ~ (3.14)
The product is computed in Figure 3.2 and can be compared
with Figure 3.1. The transfer matrix is symmetric, and
was chosen in keeping with the invertability property of
the DHT . Note that this transfer pattern causes bit-
reversed output, when the input is also in bit-reversed
order. Again, exactly one-half of each processor's memory
is exchanged. This time, however processors exchange
alternating memory locations, instead of the upper- or
lower-half as in the MPFFT. Processor P retains its
o
even-labelled memory locations, and transfers its odd-
labelled memory locations in exchange for processor P, 's
even-numbered data. The order of computation after
transfer is somewhat subjective, and was chosen to minimize
the amount of buffering required within each processor.
In this case, the requirement of only 2 additional memory
locations per processor for buffering appears to be
optimum
.
Extending the discussion to N=16 illustrates the
features of the generalized multi-processor FHT. For a
16-point data sequence transformed using the same 2




01 23 45 67 K0 4 2 6 15 3 7K 1— i i i — ]£_0 1 , 2 3 , 4 5 , 6 7_,i|i!
! 111 1 1 1 1 1 1 1 1 1
1








2 1 1 |-i -ii
3











5 1 I i _!l
1. 1 ^
5 .
Il -ll l -l
6 i i l-i 6 1 1 111 6 l ,111-1 -i
7
-II 1 ! "L 7 1 1 1 i _i1 1 1 l X 7 _ ! ! 1 -i-1 i
II L2 - Lx VAfa






01 23 45 67
-i i ! 1 1
j j




2 pi T" ""]
1
2
l -11 l -li I
-1 r —
r




- L -1 1 l|_l 1.
llj-l-lj
|
1 11 lj-1 -1



















1 1 Ll -1
5
3
4 r-1 -y -i
1 -ll -1 ll 1
7
_ ! !
i J, 7 | | 1 -|-1 1_ 7 _ ! !i -ii-i.!
1-3 fcWi L
>h
_0 1,234567 s°_P 1, 2 3, 4 5, 6 7 ^_0 4, 2 6, 15, 3 1,
1 a
!
1 l" 1 l' |
| |














i 1 1 i i i-i -i-i -i

















1 -]| 1 -lj s -d
5 1 1-r 1 -r 5
- J il "l| l -l 5 _1_ - {_ L - Sr3 _s_j _Q_ o
6 1 r U -r
| 1 1
3 l -i -l l 3 1 -lj— 1 1 j 1 s -s






Figure 3.2 8-Point /2-Processor DIT FHT (BR Input)
90
X(k) - L 4 lx L 3 L 2 Lj x(n) (3.16)
The recursive property of the FHT keeps the first 3 stage
matrices L, , L n , and L~ of (3.13) the same, with their
patterns repeating along the main diagonal. Thus, only
the transfer stage matrix T, and the final 16-point
computation stage L, need be discussed. The transferr
~4
pattern is chosen again so that T-i is symmetric. The
symmetry property, along with the property of only one
non-zero element (whose value is unity) per row and
column, makes the transfer matrices presented herein
orthogonal. Processor .P~ once again transfers its odd-
labelled memory locations (this time, 4 elements) to P.
,
in exchange for P, T s even-numbered element . The order
of computation obviously determines output ordering.
This will be discussed further in the next section.
Matrices T, and L, are shown, as is their product, in
~1 ~4
Figure 3.3. The fourth stage compuation matrix L, of
the single-processor product L, • T 1 is equivalent to the
-~4 ~ 1
single-processor L, matrix with reordered rows. Since,
for any given row, the column entries are the same
both implementations perform the same computation,
although the ordering of their outputs is different.
Consider now the 16-point FHT implemented on a
4-processor Butterfly Network. This illustrates the
"maximum transfers" case for the FHT since more than 3
91
n
1 2 3 4 5 a 7 8
1
9 10 1112 1314 15^
1 1 1
1
3 1 -1 1
2 1 a
1
10 1 -a H





























IS 1 -d 1 b_




















K_0 1 2 4 4 5 a 7 8 9 10 1112 1314 15
1 1
8 1 -1
2 1 a a
10 1 -a -a
4 1 1
12 1 -1









9 1 -b -d
3 1 d b
11 1 d -b
5 1 b -d







NOTE : = (2)~
1/2
,b=coa<7T78) , d=ain(7T/8)
Figure 3.3 Fourth Stage of 1 6-Point /2-Processor MPFHT
92
points per processor are required. This is not really a
problem, since the multi-processor's advantage lies in
the computations that can be done in parallel, whether
prior to or after the required inter-processor transfers.
Reference 11 also contains a program called FASTPERMUTE
that speeds up the required reordering of the input
sequence, from normal to bit-reversed order. This thesis
assumes that a bit-reversed sequence is made available
as the input to the multi-processor algorithms presented.
The vector-matrix equation for the 16-point/
4-processor MPFHT is:
X(k) = L, T L; L,T, L L, x(n) (3.17)
^ ~4 ~2 ~4 ~4 I ~z ~1 ~
where part of the fourth computation stage is performed
prior to the second transfer, for reasons to be explained
below. The pre-transfer stage matrices, L, and L are
identical to the single-processor versions, and duplicate
the patterns of the 2- and 4-point FHTs as submatrices
along the main diagonal. The first transfer stage T,
occurs at the same place as, and duplicates the pattern
of, the 8-point/2-processor MPFHT. This should be






and it is this value that drives inter-processor transfer
requirements. The first post-transfer computation stage
is also identical to the 8-point/2-processor case, again
with the element pattern repeated along the main diagonal.
93
These patterns were chosen in keeping with the recursive
property of the single-processor FHT algorithm. That is,
an 8-point FHT can be made up from the outputs of two
4-point FHTs , each in turn derived from the outputs of
two 2-point FHTs. This property is responsible for the
lower-order computation matrices repeating as
submatrices along the diagonal in the early stages of
higher-order FHTs.
The fact that the FHT combines 3 elements into
one output element at the third and successive computation
stages causes problems when more than 2 processors are
used. Recall that the number of active processors, P,
sets the number of inter-processor transfer stages ( 1 o g 9 P
)
Thus, for 4 or more active processors, there are 2 or more
transfer stages. Since the transfers occur after all the
pre-transfer computations (for N=16 or a higher power
of 2) the transfers occur between stages involving 3
points in the computations. Therefore, an inherent
incompatibility exists between the transfers, involving
a number of points equal to a power of 2 (one-half of
each processor's local memory), and calculations involving
3 elements. Figure 3.4 illustrates that all elements that
are other than 1 (multiples of sin( 7r /8) and cos(?r/8)) fall
into the right-half of matrix L. (columns 8-15). The
non-unity coefficients multiply elements x (8) through
x (15), which are outputs from stager . Note in L Q
94
u


























































Superscripts represent stage number,
arguments are single-processor memory locations
Figure 3.4 Fourth Stage of 16-Point/l-Processor FHT
95
of Figure 3.5 that all these elements occur in processors
P ? and P~. Thus, one way around the problem is to
partially pre-compute the fourth stage prior to transfer.
After the inter-processor transfer, stage L, finishes the
16-point MPFHT computation, using only 2 elements in the
calculation of each output point.
Note that the partial pre-calculation (stage L'
in Figure 3.5 and Equation 3.17) leaves processors P
n
and
P, idle at that stage. This inefficiency pays dividends
in the final stage 4 computation (L,), since this stage
now only involves sums and differences of 2 elements. No
synchronization problems should be generated, since the
next stage is a transfer stage, involving handshaking.
Partial precomputaton is only necessary when 2 or more
transfers are required, and arises in part due to the
decisions made involving patterns of the first transfer
and order of computation of the first post- transfer
computation stage. The products of the third and fourth
stages of the 16-point /4-processor MPFHT are shown in
Figure 3.6. It can be verified by comparing Figure 3.6a
with Figure 3.4 that the multi-processor implementation
performs identical computations as the single-processor
algorithm. The output is reordered in the multi-processor
version, as indicated by the row reordering caused by





















































































































































10 a. a. |
•.4 1 a. -a.]





















Figure 3.5 16-Point /4-Processor MPFHT
(Stages 3 and 4)
97
i£_0 4 2 6 15 3 7 8 12 10 14 9 13 11 15
x&o) 1
x|ci) 3 | 1
X|(2) 2 1
X|(3) 10

















































£ 8 2 10 1 9 3 11 4 12 8 14j 5 13 7 15
Xl,(0) 1 1 x$(o) X (0)
X*o(l) 8 1 -1 X&l) X (8)





} ±L X^)(3) X $Q
X (1)1 1 1
"1
S(0)
xi(i)X*Kl) 9 1 1 -1 X (9)
X\(2) 3 1 1 1 X^(2) X (3)




X l X|(0) X (4)
X*2(D 12 |1 -1 X (2)








X (5)r 1 1 1 X|(0)
X3(l) 13
1




. ! 1 ! x|(2) x (7)
XJ3(3_L 15 1 -1 X|C?1 x.*2.
NOTE: L=(2) •1/2 b=co3 (7T/8) ,d=3in(7r/8)
NOTE: Subscripts represent processor addresses,
superscripts represent stage numbers, prime
denoting transfer stages, asterisks indicating
partial pre-computation stage. Argument indicates
local/single-processor memory locations.
Figure 3 . 5 16-Point/4-Processor MPFHT (Stages
3 and 4) (Continued)
98















I ! - - I
I I I
». , — 1
1
-r i-a "° i i - -°
—
i i
n | ^ ja | I "9 is
3 ! J f "° T










' ' ' • 1




J Ufl ... Id °£ O «9 - w 2 "* " -• * -" * "*
- Ill













2 1 . 1 1 -o J
3 1-3-3
r- — ; - r-~~fi ' '
3 4 ,1




















n 1 1 — i
e» "" -. 1 1 1





CI "* — 1
-- —























































| 1 a a
1 -a -a
I a 1 -a
I
-a 1 a




































































) 1 2 3
(







i i -i i i
T
1 >





















Figure 3.6b 1 6-Point /4-Processor MPFHT(Stage 3)
100
are reordered as well. This reflects the fact that the
input to the partial pre-coraputation stage (!!) comes from
the output of stage L~ . This sequence was permuted by
the first transfer in the 4-processor implementation, T, .
This is verified by comparing Figure 3.6b with Figure 3.1,
which shows identical elements in any given row, with the
rows of Figure 3.6b reordered due to transfer T. .
Figure 3.6b can also be compared with Figure 3.2. Note
that each factor in Figure 3.6b is a duplicate of the
factors in Figure 3.2, as is the final product. Thus,
the MPFHT retains the recursive property of the FHT , in
that the initial stages of higher-order FHTs are duplicates
of the outputs of lower-order FHTs, as desired.
3 . Transfer Patterns and Output Ordering
The single-processor FHT (SPFHT) accepts bit-
reversed input, and generates a normal-ordered output
sequence. The multi-processor FHT (MPFHT) implemented
on the Butterfly Network also accepts input in bit-reversed
order. The transformed output sequence, however, is
permuted by the inter-processor transfers required. The
permuted output is not the pseudo-normal (PN) ordering
characteristic of multi-processor FFTs , for reasons that
will be explained shortly. Normal-ordered output may
again be obtained by additional network transfers, or by
cyclically reading the processors' local memories.
101
Recall the earlier statement that the "maximum
transfers" case of the MPFFT won't work with the MPFHT.
That is, 4 or more points must be contained in each
processor. The situation involving the maximum number
of transfer stages in the MPFFT occurs when only 2 points
are contained in each processor. Thus, the first transfer
of the "maximum transfers" MPFFT case (L*=log„M=l) is
effectively bypassed in the MPFHT "maximum transfers"
case (L"x" = 2). With this in mind, the algorithm of
Section II. D. 2 for generating transfer matrices can be
modified to generate the MPFHT transfer patterns. The
modification to the "bit-switching" algorithm of
Section II. D. 2 is necessary because of which memory
locations are involved in the MPFHT transfers. Recall
that the MPFFT transfers involved the upper- or lower-half
of a processor's local memory. The MPFHT algorithm
transfers the even- or odd-numbered local memory
locatons , instead of adjacent upper- or lower-half
memory locations. This all relates to the properties
of counting in binary. That is, every even decimal
number has a binary representation ending in 0, while
every odd decimal number's last bit is 1 when expressed
in binary. The modified algorithm for generating MPFHT
transfer matrices is described below.
102
As in the MPFFT case, the MPFHT transfer matrices
are viewed as permutations of NxN identity matrices.
This procedure determines the column indices where l's
appear, by switching appropriate bits in the binary
representation of the row indices. The bits being
switched differ from those of the MPFFT transfer matrix
algorithm. Again, this is due to the fact that the half
of each processor's memory transferring is now based on
even- and odd-numbered local memory locations. Not
surprisingly then, one of the bits to be switched is
always the last, or Oth bit, since this bit determines




- . . . b~b , b„
,
where L=( log~N ) -1
.
) The first transfer stage swaps bit
b„ with bit b "x", where L*;$' = log^M=log
7
(N/P) . Successive
transfer stage switch b~ with bT#+ . , then with b, ^ «
,
etc. For example, in the 1 6-point /4-processor
implementation of Figure 3.5, log~N=4 and L*=2 . Thus, the
binary row indices have bits labelled b^b^b-jbp.. The
first stage transfer matrix T determines column
position of elements by switching b~ with b„ in the row
indices. The second transfer matrix T switches b n and b„.
Source and destination processors are identical to the
MPFFT case, and can be calculated using the cube,._,>.
functions to find a destination, given a source.
Additionally, at the ith transfer stage, the (i-l)st
103
bit of the processor's binary address determines whether
a processor transfers its even or odd memory locations.
If the (i-l)st bit is zero, the processor transfers its
odd-numbered memory locations, and if the bit is one,
the processor transfers even-numbered memory locations.
As in the MPFFT with "upper-half" and "lower-half"
processors, those processor's transferring their odd-
numbered memory locations receive the even-numbered
locations of another processor. This convenient
property again makes all transfer matrices symmetric
and orthogonal. See Figure 3.7 for a graphical display
of this example.
Discussion of the reordering of the output
sequence can be simplified by distinguishing between bit
postions and bit labels. Let the normal-ordered output





occupying bit locations (positions) b^b^b^bj-.
(N=16). Now the algorithm j ust. discussed generates
column indices from row indices, when applied to bit
locations. That is, each transfer stage starts anew
assuming bit locations b„b„b,b
r)
for row indices, then
switches the two appropriate bits for that transfer stage
to give column positions where l's occur. Since the
transfer stage permutes the output ordering cumulatively,




























































4 5 8 7
P
2
8 . 10 11
P
3











3 3 " I •








8 • • 1 •















12 • 1 • •
1
P
13 • i • •



















Second Transfer Stage (T2)
Figure 3 . 7 Transfer Matrices of 16-point/
4-Processor MPFHT
105
stages. For example, the 16-point /4-processor case begins
with bit labels b^b^b^b^ in bit locations b^b-b^p.. The
first transfer T, switches b and b to give bit labels
~1 o 2 &
(output sequence) and bit locations (column indices)
bob^b^bo. The second transfer T again starts with row
ordering in bit locations b„b„b-,b while the bit labels
are left partially-reordered ( bobrjb, b«) • Then, switching
b„ and b„ gives column indices ordered b
n
b 7 b,bo> while
the output sequence is b~b~b
-i b., . When the bits of the
respective binary representations are permuted according
to this rule, the correct column positions and output
sequence order result. This procedure was used when
labelling the figures in this section. The column
indices correspond to input ordering at a given stage, and
row indices give output ordering from that stage.
Therefore, the row labels of the final computation stage
give output ordering in all cases.
To summarize, the Fast Hartley Transform is
realizable on the multi-processor Buttrerfly Network.
Several pertinent features of the implementation have been
presented, but details of the features and properties of
this new algorithm have of necessity been only partially
covered. The interested reader is referred to reference 11
for a more detailed discussion of this algorithm that has
great potential for signal processing applicatons.
106
B. FAST WALSH-HADAMARD TRANSFORM (WHT)
While the DFT uses sinusoids of harmonic frequencies
as its basis functions, other transforms can be defined
using other bases. The Walsh functions, named for
J.L. Walsh who introduced them in 1923, are the basis
functions for the Walsh-Hadamard transform (WHT)
[Ref. 2: p. 301]. Walsh functions can be generated
recursively, are orthogonal, and form a closed set.
Since they are square waves, Walsh functions take on
only two values, namely +1 or -1 [Ref. 13: p. 225].
This property enables the WHT to be calculated using real
number operations, involving only additions and
subtractions. This section presents two of the several
possible ordering schemes for calculating the WHT.
After introduction of the single-processor implementation,
the WHT will be examined for compatibility with the
multi-processor Butterfly Network. Again, the emphasis
will be on inter-processor transfer requirements and
output ordering.
1 . Single-Processor Walsh Ordered Transform (WHT)w
Walsh functions can be rearranged to form
different ordering schemes such as Walsh or sequency
order, Hadamard or natural order (shown in Figure 3.8),
Paley or dyadic order, and cal-sal order [Ref. 2: p. 303],








































walw (k , t) represents the kth Walsh function
in sequency order
Figure 3.8a Walsh Functions in Walsh or Sequency Order





























walu(k,t) represents the kth Walsh function
in natural order
Figure 3.8b Walsh Functions in Hadamard or Natural Order
[Ref. 2: p. 305]
109
Walsh functions are described in terms of sequency,
which combines the number of sign changes and frequency.









where Z.C. is the average number of zero crossings per
unit interval [Ref. 2: p. 305].
Sampling of the Walsh functions at equally spaced
sample points generates the Walsh-Hadamard matrices in
corresponding ordering. Periodic sampling of the




2 3 4 5 6 7
1
S equency
1 1 1 -1 -1 -1 -1 1
2 -1 -1 -1 -1 1 1
3 _ i _ i _ ^ -1 2
[Hw(3)] =
4 — 1 — 1 — 1 — 1 1 2
5 -1 -1 1 -1 -1 3
6 -1 -1 -1 -1 1 3




where [Hw(3)] is thus the 2 x2 Walsh-Hadamard marix in
Walsh or sequency order. The rows represent Walsh
functions whose sequencies are listed. Whereas W~
r
defines the DFT matrix for N=2
,, [Hw(G)] defines the
(WHT)w matrix [Ref. 2: p. 310]. The (WHT)w is thus
defined as
:
X(k) = (l/N)[Hw(G)]x(n) (3.20)
while the input data sequence can be recovered from the
inverse transform:
x(n) = [Hw(G)]X(k) (3.21)
The same algorithm can be used, then, to generate the
transform or its inverse, except for the (1/N) factor.
One fast algorithm that results from the marix factorization
of (3.19) is shown in Figure 3.9. This factorization
2reduces computation requirments from N to Nlog^N additions
and subtractions, where N is again the length of the input
sequence. The algorithm represented in Figure 3.9 accepts
bit-reversed input, and generates a transformed output
sequence in normal order. Note also that this algorithm
can be computed in-place, thus minimizing extra memory-
required for buffering. The f actored-mat rix equation
corresponding to Figure 3.9 is shown in Figure 3.10.



















o Tt< CM CD 10 CO t>>»• v-^ >^ s_^
^-^
^-/ S-^ N«-<





















































1 11 ,1 11
1 ll ll 1 ..iL-ij-iJ-i
2 111 11 2 i -i ii -i
3 1 i 1 ! i 3 _ ij -i|_ i± -i




















1 U f-1 hi
7 i ii i -i 7 L i! ij -i! -i
*3
































































-1 1 | 1-1
j-l 1 1-1 1
1-1-1 1-1-1
(3.22)
Figure 3.10 8-Point/ 1 -Processor Fast WHTw (BR Input)
113
computation stage. The final product, W» is equivalent
to [Hw(3)] in (3.19), with the columns reordered to
correspond to the bit-reversed input . The rows of W
are in the normal order in which the output appears.
2 . Multi-Processor Walsh Ordered Transform (MPWHT)w
The Walsh or sequency ordered transform (WHT)w
can be partitioned among several processors for
implementation on the Butterfly Network. An alternate
(equivalent) signal flowgraph to enable partitioning is
shown in Figure 3.1.1. It is easily verified that the
flowgraph of Figure 3.11 represents identical computations
to those of Figure 3.9. Note that multi-processor
implementation of Figure 3.9 directly is not possible
due to the ordering of points participating in the first
butterfly stage. The flowgraph of Figure 3.11 allows
points participating in first stage butterfly computations
to be input to the same processors when implemented using
more than one processor. The input data is normal-
ordered, however, and the algorithm produces bit-reversed
output. Thus the WHT matrix W is the transpose of the
W matrix of the BR input /normal-ordered output case.
The stages perform the same computations, but in
different order. Thus the matrix factorization represented
by the signal flowgraph of Figure 3.11 represents an
alternate factorization of the WHT matrix. This
114
o ^ OJ CO 1-1 IC 00 ^
^s ^ ^w ^-/ V—s s_^ v_^




































alternate factorization is shown in Figure 3.12, and
is the basis for the multi-processor implementation of
the WHT described in the following paragraphs.
The factorization of (3.23) shown in Figure 3.12
allows the multi-processor Walsh ordered transform
(MPWHT)w to parallel the development of the MPFFT
presented in Section II. The length of the input sequence
and the number of active processor are again assumed to
be powers of 2. There are log„N computation matrix factors
each representing a butterfly computation stage.
(N=input record length.) These computation matrices are
again divided into L*x" = log 9 M pre-transfer stages and
log^P post-transfer stages, with log~P inter-processor
transfer stages. (P=number of active processors,
M=N/P=number of points in each processor.) The
"maximum transfers" case again occurs when L*=l and
each processor contains only 2 data points. For the
8-point WHT on 4 processors, the vector-matrix equation
factors as :
X(k) = W„ T W„ T, W- x(n) (3.24)
where the W matrices represent butterfly stages, the T
matrices describe transfer stages, x(n) is the input
sequence, and X(k) is the transformed output sequence.
The factorization of (3.24) is shown in Figure 3.13 and
116
*2
£?0 1 2 3 4 5 6 7



















































1-1 ki 1 1
r "+ ~i "1 1 '-l -1 1
S
1 1
-1 1 ' 6 1-1! 1-1
j_
1 l 1 , , 1
1 II 1 ! j-x i-Ti-
1
5


















K l 01 23 45 67
^ 1 . 1 — *\ 01 23 45 671— 1 , 1 —
1









1-1 1-1 1 j 1-1 1-1 1
1 1 pl-1 1-1-1 j 1 1
5 1-1 1 1-1
1
6 1.-1 1 1-1 1-1 1 1-1 1
1
*" T "•
1 j 1 1 1 1 1- 1
1 T ""*





1 1 1 i-i-i
5
3
1-1! -1 ll - 1 1! 1-1









Figure 3.12 8-Point/l -Processor Fast WHTw (Normal Input)
117




























































































































































1 1 1 1 1











































1 1 I 1 1
VT


















1 2 3 4 5 6 7




-l'-l 11 1-1 ! -l 1
I










Figure 3.13 8-Point /4-Processor Fast WHTw (Normal Input)
118
represents the multi-processor implementation of (3.23)
from Figure 3.12. Note the effect of the multi-processor
implementation on the ordering of the output sequence.
The single-processor WHTw produces bit-reversed output
when the input is normal-ordered (Figure 3.12). The
transfers in the multi-processor algorithm reorder the
output to pseudo-bit-reversed (PBR) order. Note also
that the transfer matrices are the same as the 8-point/
4-processor MPFFT with BR input (Figure 2.4). This
convenient property means that the algorithm of
Section II. D. 4, which transforms normal to pseudo-normal
ordering, may be used to permute BR (single-processor
output) into PBR ordering. A slight modification must be
made to account for BR order as the starting point,
rather than normal ordering. Whereas the bit positions
are labelled b 9 b,b n for the 8-point normal-to-PN





to correspond to BR-to-PBR reordering.
This is the same idea as shown in Table 1 , and results
in the appropraite reordering when the "bit-switching"
algorithm is applied.
Just as with the MPFFT, the MPWHTw can also
be implemented using 2 processors. In this case the
equation is:
X(k) = W„ T, W„ W- x(n) (3.26)
"s* ~*>o <^»JL *^Z >vl <**>
119
and, as in the MPFFT, there are 2 pre-transfer butterfly
stages, a transfer stage, and 1 post-transfer butterfly
stage. Transfer matrix T-, is again identical to T, of
the 8-point /2-processor DIT MPFFT. The output is again
in PBR order, though different from the 8-point /4-processor
PBR order. Since there is only one transfer in the
2-processor case, its output is less-reordered from the
single-processor BR ordering. This is analogous to the
PN ordering of the MPFFT, which was shown to also be a
function of the number of transfers required.
The MPWHTw may also be implemented using PBR input,
to produce normal-ordered output. Using the signal
flowgraph of Figure 3.11, now with PBR-ordered input,
the 8-point /4-processor MPWHTw matrix factors as:
X(k) = W~ T ' W 9 T, ! W, x(n) (3.27)
where T ,'=T of the normal input algorithm and T '=T,
of (3.24) and Figure 3.13. TheWHTw matrix W is the
transpose of the product W from Figure 3.13. The
individual stage factors are different, however. They
represent the same computations at each stage, but in a
different order, since the input ordering is changed.
These ideas are illustrated in Figure 3.14. The
situation is slightly more complicated when only 2
processors, requiring only 1 transfer stage, are used
120
-V fl Il'-Il
\gO 1 4 5 2 3 6 7
JJ.































































4 6 13 5 7_
in!

































































1 1 1 i-i-ii
1
2
1 1 1-1-1 I 1 1 1-1-1
r -j r








1 1 [_ 1 i'-i-iJ-i-i
1-1
j
1-1 i-i 1 ]-i 1
5
. J 1 ~ l L- 1-1 I 5 l-ij-i 1 j-i ij_ i-i
3 1-lj jl-lj 6 1-1 j-i 1 1 1-1 J-i 1
7 ! 1 -ii | 1 -j. 7 _i-i| 1-1 1 1-1 1 i-i_
(3.28)
Figure 3.14 8-Point /4-Processor Fast WHTw (PBR Input)
121
with a PBR-ordered input sequence. The vector-matrix
equation in this case is:
X(k) = W W T. W. x(n)
~w xj
~~Z ~1 ~I ~ (3.29)
where the output is again produced in normal order. The
transfer matrix T, is identical to T-. of the 8-point/
2-processor MPWHTw with normal-ordered input , but occurs
after only one pre-transfer butterfly stage. The
product of (3.29) is again the transpose of the W matrix
of the normal input case, and the same calculations are
performed at each stage, again in different order. Thus,
for PBR input, there are log~P butterfly stages with
log^P transfers between each, then log 9 M post-transfer
butterfly stages. The Walsh ordered transform is easily
implementable on the multi-processor Butterfly Network,
and exhibits many of the same properties as the multi-
processor FFT. It can be used with normal-ordered input,
to yield pseudo-bit-reversed output, or vice versa.
3 . Single-Processor Hadamard Ordered Transform (WHT)h
Walsh functions can be rearranged to form
Hadamard or natural ordering. The periodic sampling of
these Hadamard-ordered Walsh functions yields Hadamard
matrices, which can be generated recursively [Ref. 2: p. 314,
13: p. 226] as follows:
122
[H
h (0)] = [1]
[Hh (k+1)]
















N. The inverse (WHT)h is defined as:
x(n) = [H (G)]X(k)
<>» n /-w
(3.31b)
The Hadamard ordered transform is also referred to as the
binary Fourier representation (BIFORE) transform. Fast
algorithms can also be developed for the (WHT)h based on
matrix factorizations. The Hadamard matrix for an 8-point





















This matrix can be factored into 3(log„N) matrix factors,
each representing a computation stage of a fast WHTh
algorithm. One particular factorization, using normal
input and producing normal output, is shown in Figure 3.15.
4 . Multi-processor Hadamard Ordered Transform (MPWHT)w
Another method of factoring (3.32) is shown in
Figure 3.16. Again, this factorization is not unique,
but allows for efficient implementation of (3.32). The
corresponding single-processor signal flowgraph is shown in
Figure 3.17. This factorization accepts bit-reversed input,
and generates bit-reversed output. This method is
compatible with a multi-processor implementation, in
that points participating in the first butterfly stage
are adjacent in the input sequence. It is easy to verify
that W from Figure 3.16 is equivalent to (3.32), with
rows and columns reordered to reflect the changes in
output and input ordering, respectively.
The algorithm of (3.34) and Figure 3.17 can be
partitioned for implementation on the multi-processor
Butterfly Network. For an 8-point WHTh implemented using
4 processors, the vector-matrix equation is:
X(k) = W, T W 9 T- W, x(n) (3.35)
where the input sequence, x(n), is in BR ordering. The
output is permuted into PBR order by the necessary
124
ff«
























i ; i i i













kf_0 1 2 3 4 5 6 7 tf
1
1 23 45 67
h" ' i ' —
i
l ll | l ll 1 11 1 1 1 1 1 1 1 1
1
„j1-_lJ__iL_j 1 i-i i i-i ! i-ii i-i
i lj-i-i | i i j-i-i2 i |-i i i |-i 2




i 1 1 |-i i-i 4
t —
-T ~
1 1| 1 1 j-1-1 1-1-1
5
.__i4__l«..-ij._zi. 5 i-i! i-i i-i i_-i i
5 i i-i pi 1
1
6 i ii -i-i i-i-i i i i
i i
_
ij -i] -ij ij 7 i-i i -i i-i i| i-i
(3.33)
Figure 3.15 8-Point /l -Processor Fast WHTh (Normal Input)
125
*2
tf_0 4 2 6 15 3 7
I I



















































































































Figure 3.16 8-Point /1-Processor Fast WHTh (BR Input)
126
o Tf <N CD iH 10 CO N
^-/ ^-y s_^ s_^ \s ^-y s_x v^



























inter-processor transfers. The matrix product of (3.35)
is carried out in Figure 3.18. Note that the transfer
matrices T-, and T are again identical to the 8-point/
4-processor DIT MPFFT with BR input. The product ^W is
equivalent to the W matrix of the WHTh single-processor
implementation with rows reordered, reflecting the
reordering of the output.
An 8-point WHTh implemented on 2 processors can
be expressed as:
X(k) = W„ T. W w\ x(n) (3.36)
where T, is the same as T, of the Walsh-ordered MPWHT
,
*~ l ~i
again identical to T, of the 8-point /2-processor DIT
MPFFT with BR input. In this case also, the output is in
PBR order, as would be expected. Thus, the algorithms
from Section II for transfer matrix generation and output
reordering may be used in this case also.
Another alternative factorization of the
Hadamard transform is to use pseudo-normal input data.
This option is desirable, since the output is produced
in normal order, as the Walsh ordered transform produces
normal output when the input sequence is in PBR order.
The 8-point/4-processor MPWHTw using PN input is
expressed as
:
X(k) = W~ T ' W T, f W, x(n) - (3.37)
128
II Wi Il-Wi









1 1 1 1
1 i

























6 1 ll 6 ii -ii 6
1 1 -11 '
1
-1 r t









3, 1 1 _____











































l 04 2615 37
































2 4 6 1
l"j 1 1-j

































i i-i !-i i
T2 -W2-I1
2,46 1




























2 1 1 i-i-i 1 1
3
._ 1 L_L_L_.




























Figure 3.18 8-Point /4-Processor Fast WHTh (BR Input)
129
and the product is carried out in Figure 3.19. Similar
to the Walsh ordered transform when PBR input is used,
the transfers of (3.37) occur in reverse order from
the BR input case of (3.35). That is, T ' of (3.37) is
identical to T of (3.35), as areT ' and T, . When fewer
processors are used, the transfer matrices are also the
same, and also occur in reverse order. This has the
effect of reversing the relationship between pre- and
post-transfer butterfly stages. Now there are log
?
P
butterfly and transfer stages, followed by log^M
post-transfer butterfly stages with no transfers between
the post-transfer butterflies.
In this section all normalizing factors of (1/N)
r
for the forward transforms have been omitted to allow
concentration on the transfer patterns and reordering
of output sequences. They are easily included at the
end, if necessary.
The ideas presented in this section are easily
extended to larger data lengths and more processors as
follows. For input record length N, there are always
log„N total butterfly stages. For P active
processors, there are log„P butterfly stages with inter-
processor transfer stages between each. These occur after
L'K' = log 9 M pre-transfer butterfly stages (BR input), or





















































































































































n I 1 1
a I
5





























































































































-i 1 1 i-i
(3.38)
Figure 3.19 8-Point /4-Processor Fast WHTh (PN Input)
131
patterns and output reordering for the Walsh transform
(Walsh or Hadaraard order) parallel those of the multi-
processor FFT. Thus, algorithms for FFT transfer
matrix generation are applicable to the corresponding
WHT transfer patterns. Reordering of output sequences
from the single-processor WHT are likewise obtainable
from the corresponding MPFT algorithm.
Finally, the Walsh transform is very compatible
with the multi-processor Butterfly Network. Both Walsh
and Hadamard orderings have many of the same properties as
the MPFFT when implemented on the BN . Thus, ideas from
Section II are also applicable to the WHT.
132
IV. COMBINED ALGORITHM FOR TRANSFER MATRIX
GENERATION AND MULTI-PROCESSOR OUTPUT
ORDERING
The algorithms presented in Section II for generating
transfer matrices and output reordering can be combined into
a single algorithm. To do this, a distinction must be
drawn between column labels and output labels. Recall
the algorithm of Section II. D. 2 for generating transfer
matrices as permutations of identity matrices. In an identity-
matrix, l f s occur only in those positions where the row and
column indices are the same. For the chosen transfer
patterns on the Butterfly Network, processors transfer
one-half of their data at any given transfer stage. Thus,
half the elements move off the main diagonal, and half
remain. The algorithm for generating transfer matrices
gives the column indices where l's appear, as a permutation
of the binary representation of the (normal-ordered) row
indices. For half the row indices, this involves switching
a 1 and a 0, and the corresponding element moves off the
main diagonal (representing transfer of a data point). For
the other half of the row indices, the algorithm switches a
1 and a 1 , or a and a 0, and the corresponding elements'
row and column indices remain equal. These elements remain
on the main diagonal, and represent data points not
133
transferring at that stage. The transfer raarices are
independent, in the sense that each successive transfer
matrix is generated by permuting the normal-ordered row
indices to obtain the columns where 1's appear at that
stage
.
The effect of the transfers on the output ordering is
cumulative, however. Recall that the multi-processor fast
transform output can be viewed as a reordering of the
corresponding single-processor output. For example, a
single-processor DIF FFT with bit-reversed input produces
normal-ordered output. Section II. C showed that the DIT
multi-processor FFT output is produced in pseudo-normal
order. Recall also that pseudo-normal order is a function
of the number of active processors, P, hence the number of
transfer stages. That is, the pseudo-normal ordering for
a 16-point /8-processor DIT FFT is different from that of
a 16-point /4-processor DIT FFT. In the first case,
3(log 9 8) transfers are necessary, while the 4-processor
case only requires 2 transfer stages. Thus, the pseudo-
normal ordering of the 8-processor case is permuted
"farther" in some sense from the single-processor
normal-ordered output.
With the above background in mind, the algorithm for
generating transfer matrices is extended to also give
output reordering as follows. A distinction is drawn
between column indices and output ordering indices.
134
These are both represented as binary numbers, and then
permuted according to the algorithm of Section . II . D . 2
.
Since the column indices are generated from the normal-
ordered row indices of each transfer stage, normal-order
is the starting point for generating each transfer matrix.
The output ordering begins in the appropriate single-
processor output ordering for the fast transform being
computed, then is cumulatively permuted by successive
transfers
.
Several examples to illustrate the combined algorithm
are shown in Table 2. This table illustrates the transfer
matrix/output reordering of the DIT multi-processor FFT.
Recall that the single-processor DIT FFT with bit-reversed
input produces normal-ordered output. Likewise, for
normal-ordered input, the single-processor DIT FFT generates
output in bit-reversed order. These orderings are the
starting points for the output reorderings shown in
Table 2. Note that the bit-switching algorithm is applied
literally to the output indices' bit labels, regardless of
what position the bit labels occupy at a given stage. Also
note that due to the cumulative effect of transfers oh the
output ordering, the first transfer of any example is the
only time the column indices and the output indices are the
same. It is also obvious from Table 2 that pseudo-normal
and pseudo-bit-reversed ordering depend on the number of
transfers necessary. The first transfer stage of the
135
TABLE 2: COMBINED ALGORITHM FOR GENERATING TRANSFER MATRICES
AND OUTPUT ORDERING (DIT MPFFT)
16 Points/8 Processors :
Transfer Matr ix
Row Indices Column Ie.dices Output Ordering
b b b b
3 2 10 Normal
:
b3 b2 bl b BR: b b1 b2 b3
Tl (b " b i) b3 b2 b bl Tl (b " • b i)b3 b2 b b1 T3'(b - b 1)b 1 b b2b3
T 2 (b " b2) b3b b 1 b2 T2 (b -• b 2)b3bob2b 1 T2(b - b 2) b1 b2 b b3
T3 (bo - b3) bo b2blb3 T3 (b - b 3) b b3b2b 1
PN: b b3 b2 b1
Tl'(bo - b 3) bib2 b3 b
PBR: blb2b3b
16 Points/4 Processors :
Transfer Matr ix
Row Indices Column In dices Output Ordering
b3 b2 bl b Normal : b3b2b 1 b BR: b b 1 b2 b3
Ti (bi - bs) b3b l b2b Tx (bi -- b2 ) b3b 1 b2b T2 (b x - b 2) b b2b 1 b3
T2 (bi - b3) bib2 b3bo T2 (bi -- b 3) b 1b 3b 2b
PN: b^b^o










Row Indices Column Indices Output Ordering
b3 b2blbo Normal : b3b2b 1 b BR: b b 1 b2b3
Tl (b2 " b3) b2b3b lb Tl (b 2 -- b3 ) b2b3b 1 b
PN: b2 b3 b1 b
Tl'(b2 - b 3) bQb 1b3b2
PBR: bob-jb3 b2
8 Points/4 Processors :
Transfer Matrix
Row Indices Column Indices OutDUt Ordering
b2 b1 b Normal
:
b2b l b BR: bQbib2
Tl (b - b x) b2b b l Tl (b - b-j) b2b b 1 T2 (b " b l) b 1 b b2
T 2 (b " b2 ) b b l b 2 T2 (b - b2 ) bQ b2 b1
PN: bQ b2 b1
Ti'Oo - b 2) V2b
PBR: bib2 bo
8 Points/2 Processors :
Transfer Matrix
Row Indices Column Itidices Output Ordering
bob,b
rt2 10 Normal b2b 1 b BR: bQ b1 b2







Tl(b 2 - b 2) b b 2b;l
PBR: bobob-i
136
16-point /8-processor example is not required when using only
4 or 2 processors, and is effectively bypassed in the latter
two cases. It can also be verified, by comparing Table 2
with Table 1, that the output ordering produced is the same
when using the combined algorithm.
Care must be taken in applying this algorithm, depending
on the type of fast transform being computed. The algorithm
from Section II. D . 2 and generalized here, was developed
for DIT FFT transfer patterns. It can be used, with slight
modifications to the sequence of bit-switching, for all
other fast transforms presented herein. The modifications




Matrix factorization is a useful method of representing
distributed fast transforms. While not suggested as a
method of actually computing the transforms , this
representation provides insight into which points combine
during computation stages, and into the transfer patterns of
the Butterfly Network. Additional insight is gained in
terms of the output reordering caused by the inter-processor
data transfers
.
Matrix partitioning and factorization have been widely
used to describe single-processor fast transform
algorithms. This method was shown to also be useful in
describing distributed fast transforms. Additional matrix
factors are required to represent the necessary inter-
processor transfer stages. These "transfer matrices" may
be generated by permutations of identity matrices for
selected transfer patterns on the Butterfly Network. The
transfer structure was shown to determine the ordering of
the transformed output sequence. The output ordering can
be described in terms of a permutaton of the single-processor
output produced by a given fast transform algorithm.
The Butterfly Network also supports the Fast Hartley
Transform, and the Walsh-Hadamard Transform. Transfer
138
patterns similar to those required for the multi-processor
FFT can also be used for these other fast transforms. The
output orderings produced are unique, but easily decipherable.
Many equivalent signal flowgraphs can be constructed to
represent any given fast transform algorithm. This
flexibility also permits many equivalent matrix factorizations
to be developed. Constraints include the need to minimize
additional memory requirements for buffering, or computing
"in-place" as much as possible, and the desire to produce
an orderly and decipherable output sequence. Additionally,
for distributed algorithms, an orderly pattern of inter-
processor data transfers is also desirable. The Butterfly
Network was shown to meet these constraints for the
computation of distributed FFTs , and to conveniently
support several other fast transform algorithms.
Further work would be desirable on developing a general
method for finding the matrix factors of the distributed form
of a given fast transform algorithm. An optimal implementation
of the FHT (possibly the one given in Section III. A. 2) on
the Butterfly Network is also worthy of further research,
as the FHT appears -to have a quite promising future.
139
LIST OF REFERENCES
1. Brigham, E.O., The Fast Fourier Transform , Prentice-Hall,
Inc., 1974.
2. Elliott, D.F., Rao, K.R., Fast Transforms: Algorithms,
Academic Press, 1982.
3. Cooley, J.W., Tukey , J.W., "An Algorithm for the Machine
Calcuation of Complex Fourier Series", Math Comp . 19
,
pp. 297-301, April 1965.




5. G-AE Subcommittee on Measurement Concepts, "What is
the Fast Fourier Transform?", IEEE Trans. Audio and
Electroacoustics
,
Vol. AU-15, pp. 45-55, June 1967.
Also Proc. IEEE
,
Vol. 55, pp. 1664-1674, October 1967.
6. Filip, E., Arthur, J.S., Drinan, J.D., Huntoon, A.H.,
Kirk, D.E. and Verily, J.G., Technical Report TR-637,




7 . Kirk, D.E., and Verly, J.G., An Algorithm for
Distributed Computation of FFT ' s , to appear in the
Journal of Compuers and Electrical Engineering.
8. Siegel, H.J., and Smith, S.D., Study of Multistage
Interconnection Networks
,
1978 Symposium on Computer
Architecture, pp. 223-229, April 1978.
9. Verly, J.G., and Kirk, D.E., A Distributed Implementation
of Fast Fourier Transforms , 18th Asilomar Conference
on Circuits, Systems, and Computers, November 1984.
10. Kirk, D.E., MIT Lincoln Laboratory Internal Memorandum
,
July 1981.
11. Bracewell, R.N., The Hartley Transform , Oxford University
Press, 1986.
12. Hou, H.S., "The Fast Hartley Transform Algorithm", IEEE
Trans, on Computers
,
Vol. C-36, No. 2, pp. 147-156,
February 1987.
13. Ahmed, N., Rao, K.R., Abdussattar, A .L . , "BIFORE or
Hadamard Transform", IEEE Transactions on Audio and
Electroacoustics
,




Ahmed, N., Cheng, S.M., "On Matrix Partitioning and a Class
of Algorithms", IEEE Trans, on Education
,
Vol. E-13,
pp. 103-105, August 1970.
Andrews, H.C., Caspari , K.L., "A Generalized Technique for
Spectral Analysis", IEEE Trans, on Computers
,
Vol. C-19,
pp. 16-25, January 1970.
Bergland, G.D., Wilson, D.E., "A Fast Fourier Transform
Algorithm for a Global, Highly Parallel Processor", IEEE
Trans, on Audio and Electroacoustics
,
Vol. AU-17, No. 2,
pp. 125-127, June 1969.
Cooley, J.W., Lewis, P.A.W., Welch, P.D., "Historical Notes
on the Fast Fourier Transform" , IEEE Trans, on Audio and
Electroacoustics
,
Vol, AU-15, No. 2, pp. 76-79, June 1967.
Filip, A., Kelly, G.L., Kirk, D.E., Distributed FFT
Algorithms with Data Transfers Overlapping Computation
,
16th Asilomar Conference on Circuits, Systems, and Computers,
November 1982.
Fishburn, J. P., Analysis of Speedup in Distributed
Algorithms
,
UMI Research Press, 1984.
Gold, B., Bially, T., "Parallelism in Fast Fourier Transform
Hardware", IEEE Trans. Audio Electroacoustics
,
Vol. AU-21,
No. 1, pp. 357-368, February 1973.
Strum, R.D. and Kirk, D.E., First Principles of Discrete




Larsen, R.D., Madych, W.R., "Walsh-Like Expansions and
Hadamard Matrices", IEEE Trans. Acoustics, Speech, and
Signal Processing
,
Vol. ASS0-24, No. 1, pp. 71-75,
February 1976.
Parker, D.S., Jr., "Notes on Shuffle/Exchange-Type Switching
Networks", IEEE Trans, on Computers
,
Vol. C-29, pp. 213-222,
March 1980.
Pease, M.C., "An Adaptation of the Fast Fourier Transform
for Parallel Processing", J. A. CM., Vol. 15, No. 2,
pp. 252-264, April 1968.
Theilheimer, F., "A Matrix Version of the Fast Fourier
Transform", IEEE Trans, on Audio and Electroacoustics
,




1 . Defense Technical Information Center 2
Cameron Station
Alexandria, Virginia 22304-6145
2. Library, Code 0142 2
Naval Postgraduate School
Monterey, California 93943-5002
3. Professor D.E. Kirk, Code 62Ki 3
Naval Postgraduate School
Monterey, California 93943-5000





5. Professor H. H. Loomis , Code 62Lm 1
Naval Postgraduate School
Monterey, California 93943-5000
6. LT R. L. Bainbridge, USN 1
Attack Squadron FORTY-TWO
Naval Air Station, Oceana
Virginia Beach, Virginia 23460-5210




8. Dr. H. S. Hou 1
Electronics and Optics Division
The Aerospace Corporation
El Segundo, California 90245
9. Associate Professor C. W. Therrien, Code 62Ti 1
Naval Postgraduate School
Monterey, California 93943-5000
10. Chairman, Code 62 1







DUDLEY KNOX r.ron />p.v










c .l Factored-matrix repre-
sentation of distributed
fast transforms.

