Parallel FFT algorithms and architectures by Pan, Xiaofeng
Lehigh University
Lehigh Preserve
Theses and Dissertations
1993
Parallel FFT algorithms and architectures
Xiaofeng Pan
Lehigh University
Follow this and additional works at: https://preserve.lehigh.edu/etd
Part of the Computer Sciences Commons, and the Electrical and Computer Engineering
Commons
This Thesis is brought to you for free and open access by Lehigh Preserve. It has been accepted for inclusion in Theses and Dissertations by an
authorized administrator of Lehigh Preserve. For more information, please contact preserve@lehigh.edu.
Recommended Citation
Pan, Xiaofeng, "Parallel FFT algorithms and architectures" (1993). Theses and Dissertations. 238.
https://preserve.lehigh.edu/etd/238
u
an, iaofen
TI
arall I Ig rit s
and rchitectures
DA· "E: January 16, 1994
,.5 . J.
'c .
. '., ,0 '-. ,~,~~~ ,-,10'-' .
Parallel FFT A1g.ori!hms and Architectures
by
Xiaofeng Pan
A Thesis
Presented to the Graduate and Research Committee
of Lehigh University
in Candidacy for the Degree of
Master of Science
10
Computer Engineering
Department of Electrical Engineering and Computer Science
Lehigh University
. December 10, 1993

Acknowledgements
I would like to thank my advisor, Dr. Barada, for suggesting the thesis topic
and giving me valuable insight and assistance in preparing for this work. Throughout
the course of my graduate studies, Dr. Barada has been a constant source of support
and his kindness is fully appreciated, especially in times of difficulty.
I am deeply indebted to my parents for their sacrifice in order for me to
receive higher education. The support and encouragement from my family has been
indispensable.
Finally, I want to thank my wife. Without her- love and friendship, all of this
would have been impossible.
111
Abstract
.
Table of Contents
••••••••••••• 0" ••••••••••••••••••••••••••••••
1. Introduction 2
1. 1 Historical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2
1.2 Fourier Series and Fourier Transfonn 3
1.3 Discrete FT and Fast FT 4
1.4 Parallel FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 6
2. Systolic Arrays and VLSI Implementations. . . . . . . . . . . . . . . . . . . . .. 8
2.1 Systolic Arrays 8
2.2 Systolic Arrays Configurations . . . . . . . . . . . .. . . . . . . . . .. 10
2.2.1 Meshes 10
2.2.2 Perfect Shuffle . . . . . . . . . . . . . . . . . . . . . . . . . .. 11
2.2.3 Cube Connected Cycle . . . . . . . .. 12
2.3 VLSI Complexity . . . . . . . . . . .. ' . . . . . . . . . . . . . . . . . .. 14
3. One-Dimensional FFT 16
3. 1 The Pise Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 16-
3.2 Time-Area Analysis of 1-D Designs. . . . . . . . . . . . . . . . . . .. 18
3.3 Flow of Vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 19
4. Multi-Dimensional FFT 22
4.1 DefInitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 22
4.2 Row-Column Method 23
4.3 Matrix-Vector Multiplication . . . . . . . . . . . . . . . . . . . . . . .. 25
4.4 An Optimal Hybrid Design . . . . . . . . . . . . . . . . . . . . . . . .. 26
4.5 Mapping To Higher-Dimensional DFf . . . . . . . . . . . . . . . . .. 28
5. Other Algorithms . . . . . . . . . . . . . . . . . .. 30
5.1 Prime Factor Algorithms 30
5.2 The Number Theoretic Algorithms . . . . . . . . . . . . . . . . . . . .. 31
5.3 VLSI Implementation of Small N Algorithms . . . . . . . . . . . . .. 33
5.4 The Goertzel Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . .. 38
5.5 The Chirp Transform Algorithm . . . . . . . . . . . . . . . . . . . . .. 41
6. Other Architectures _ . .. 43
6.1 TIliac IV : The First SIMD Supercomputer. . . . . . . . . . . . . . .. 43
6.2 Parallel Associative Processors . . . . . . . . . . . . . . . . . . . . . .. 44
6.3 Hypercube 45
IV
7. Fault Tolerant FFf
8, Conclusions ,.."..,...".,..,......,...,....,..,..
References .'.....,..,..,...,...,.........,..,...,'..
Vita " .... ,."., ... , .. ",., .. " ... ,." .. , .. "" .. , ..
.~
\.
v
51
54
57
62
List of Figures
Figure 1. A Systolic Array 9
Figure 2. Chain and Cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 10
Figure 3. Perfect Shuffle ' . . . . . . . . . . . . . . . . . . . . . . . . . .. . 12
Figure 4. Cube Connect Cycle 13
Figure 5. A Cell of a Systolic Array for I-D DFT . . . . . . . . . . . . . . . .. 17
Figure 6. Structure of a I-D DFT Systolic Array . . . . . . . . . . . . . . . . .. 17
Figure 7. FFT Systolic Array for Flow-of-Vectors ;............ 20
Figure 8. Systolic Array for Matrix Transposition 24
Figure 9. A Mesh Connected Systolic Array for 2-D DFT 24
Figure 10. Matrix-Vector Multiplication 25
Figure 11. The Rotation Process for a 3-D Cube 27
Figure 12. A Hybrid Design with a Rotator and 1-D DFT Array 27
Figure 13. Summation Unit '.' . . . . . . . . . .. 36
Figure 14. Modified Summation Unit 37
Figure 15. Multiplication Unit '. . . . . . . . . . . . . . . . .. 37
Figure 16. Arithmetic Unit 38-
Figure 17. Flow Graph of a First-Order Recursion . . . . . . . . . . . . . . . . .. 40
Figure 18. Goertzel Algorithm 40
VI
Abstract
Fast Fourier transfonn is an important tool in many fields of engineering and
applied mathematics. Much effort has been spent on constructing FFf algorithms
with higher computation efficiency. Although many variations of the original Cooley-
Tukey algorithm have been devised to take advantage of characteristics of specific
problem domains, serial algorithms have a lower bound on the computation
requirement of O(N log N). The exception to this is the class of algorithms that are
.
based on number theoretic decomposition of the circular convolution. It has been
shown that the number of multiplications can be reduced to O( N ). To boost
computation speed significantly, one needs to parallelize the algorithms, and in
general the computation time is less than the serial version by a factor of N.
There are many implementations of the various parallel FFf algorithms.
Among all the medium of implementation, VLSI is often the choice due to the
significant computation speed boost afforded at reasonable cost. The systolic principle
has been applied in a number of VLSI systems that implement one-dimensional and
multi-dimensional parallel FFf's. Implementations can also be found in other parallel
systems. Among them, the architectures that are versatile and easy to embed other
structures are often chosen. Hypercube is one such example.
1
1. Introduction
1.1 Historical Background'
The ubiquitous mathematical tool in signal processing, Fourier Transform, is
named after the French mathematician and physicist Jean Baptiste Joseph Fourier
(1768-1830). Fourier's masterpiece, Theorie Analytique de fa Chaleur [13], was one
of the most important mathematical books published in the 19th century. It not only
set forth his theory of heat conduction, but marked an epoch in the history of pure
and applied mathematics. Fourier developed the theory of the series known by his
name and applied it to the solution of boundary-value problems in partial differential
equations. This work brought to a conclusion a long controversy, and thereafter it
was generally agreed that almost any function of a real variable can be represented by
a series involving sines and cosines of integral multiples of the variable.
Emerging from efforts to analyze empirical data in physics, Fourier series
have had profound effects on the development of such fields of mathematics as
mtegral and differential equations, and analytic functions. In modem .engineering
disciplines such as automatic control, communication systems, and signal processing,
the theory of Fourier has become the basis of frequency-domain analysis and the
foundation of such computational methods as FFf.
Fourier's work was not received without criticism. The main weakness in his
analysis was his failure to provide a. general proof that the Fourier series actually
2
converges to the value of the function concerned. It was not until nearly a century
later that the difficulty was resolved in the development of Lebesgue integral [29].
1.2 Fourier Series and Fourier Transform
Consider a .periodic functionf{t) with a period T. This function can be
expressed as a sum of sinusoids of frequencies f (f = liT) and all its integral
multiples [30]:
f( t) :;:;ao+L ancos (nw t) +bnsin (nw t)
. n=l
if the functionf{t) satisfies the following DiriGhlet conditions:
1. I If(t) I dt is finite;
w:;:;21tf (1)
2. f(t) may have at m<?st a finite number of maxima and minima over any finite
time period;
3. f(t) may have at most a flnite number of discontinuities over any finite time
period. Furthennore, each of these discontinuities must be finite.
All p~ctical signals satisfy these requirements.
The series in equation (1) is called the trigonometric Fourier series. Because a
sinusoidal signal of angular frequency nw can be expressed in tenns of the exponential
signals t!"" and e-Jnu, a periodic signal j(l) with period r can also be represented by a
'l>
series consisting of exponential components:
3
f( t) =L Fnejnwt. w=2rtf=2rt/T
n=-oo
which is called the exponential Fourier series. The coefficients Fn are generally
complex and are given by
F =J:f f( t) e-jnwt dt
n T T
While periodic signals can be represented by Fourier series, almost all
(2 )
(3 )
practical signals are non-periodical. For non-periodic signals, an integral is used in
place of the summation in equation (2). The resulting representation is given by
f( t) =~fF(jW)e jwt dt
2rt
-00
(4)
The "coefficient" F(jw) is a function of angular frequency w, and is in a form similar
to equation (3):
Equation (5) is known as the Fourier· transform of the functionj(t).
1.3 Discrete Ff and Fast Ff
(5)
Since the 1940's, the growing need for more sophisticated signal processing
created great interest in the study of discrete-time signals. Coupled with the
increasing availability of digital computers since the 1950's, tremendous efforts were
4
spent on the research of discrete versions of Fourier transfonn (DFf) that can be
implemented on a digital comlmter. The discrete Fourier transfonn is fonnulated as a
convolution sum of a flnite length input sequence.
Consider a sequence x[n] of length N samples such that x[n] = 0 outside the
range 0 ~ n ~ N -1.
The DFf of this sequence is given by
N-l
X[k] =E x[n] w~
n=O
W =e -j (2lt/N)
N
(6)
Notice that the computation complexity of equation (6) is O(n2).
In 1965, Cooley and Tukey [10] published their classic paper in which they
described an efficient algorithm known as the fast Fourier transfonn (FFT). It
decomposes the input sequence into a series of short DFf's. In the special case
where the length of the input sequence is a power of 2, we have the familiar radix-2
decimation in time FFf. The time complexity of this class of algorithms is o(nlogn).
This is a significant saving in computation requirement, especially when n becomes
large.
With the constant need for faster algorithms that can be used in real-time
application, numerous other FFf algorithms have been devised. The prime factor
algorithm, the Winograd Fourier transform algorithm, the chirp transfonn algorithm,
and the Goertzel algorithm are just a few of them [37]. Most of these methods take
advantage of known characteristics of a particular type of signal, and can perfonn
better than the Cooley-Tukey in special cases.
5
1.4 Parallel FFf
To further speed up the computation of Fourier transfonn significantly, the
only practical solution is to parallelize the algorithms. A large number of such
parallel FFT algorithms exist. Most of the literature has been devoted to the systolic
array and VLSI implementations of parallel FFT. This paper will attempt to discuss
some of the representative algorithms with VLSI/systolic implementations as well as
other implementation methods, and issues related to the theme of parallel FFT.
Chapter 2 contains a general discussion of systolic arrays, configurations
commonly used for its processing element interconnections, and a brief discussion of
VLSI complexity with respect to systolic arrays and systolic implementations of FFT.
!
Chapter 3 focuses on one-dimensional FFT algorithms, and in the case of
VLSI implementations, the relevant area-time complexities are presented and, where
appropriate, comparisons are made. Chapter 4 extends the discussion to multi- '
dimensional FFT's."
Aspects of small N algorithms and number theoretic algorithms are explored in
chapter 5, and the hypercube and other architectures are discussed in chapter 6.
. .
Chapter 7 touches upon the issue of fault-tolerance in FFT implementations.
Since there are a vast number of parallel FFT algorithms and their
implementations, to repeat or even mention all of them in this paper would be both
impractical and undesirable. The approach taken in this paper is to provide a general
discussion fIrst and then to include a few examples to illustrate the salient points.
6
Brief references to variations of the algorithms and implementations are included
wherever possible and appropriate.
7
2. Systolic Arrays and VLSI Implementations
2.1 Systolic Arrays
In 1978, H.T. Kung and C.B. Leiserson published a paper [27] in which they
described a VLSI processor array, called systolic array, as "... a network of
processors which rhythmically compute and pass data. The analogy is to the rhythmic
contraction of the heart which pulses blood through the circulatory system of the
body. Each processor in a systolic network can be thought of as a heart 'that pumps
multiple streams through itself. The regular beating of these parallel processors keep
up constant flow of data throughout the entire network."
More formally we can defme a systolic array as follows,
Definition 1. A systolic array is a network of cells connected by channels.
The number of neighbors for each cell is independent of the problem size and is
bounded. An interior cell is one whose neighbors· are either boundary cells or other
interior cells. Each interior cell runs its own instructions, exchanges data and control
information with its neighbors. The boundary cells are those connected with the
array's system memory and feed data to/from t.he interior cells without
transformation. The execution of all the cells is synchronous.
The systolic processing principle mUes it possible to solve many standard
problems on multicellular devices with autonomous cells. The host only needs to
interact with the systolic array to initiate a computation, to collect results, and to
8
j I' , I,':' ;1 " l ~. -II' ,;
•
~
L
a
f
u;;
r
t
~
"~!Jj
----------~
HosL CO" ',",'J",r 1 ---1
\~'\ ,I'; ;//
i \ \ / \ /11
\ \L-,;-L,<I.--o---rl--o'v'\~~--'/ '-' \ --'
\ \ '. ,._0-0-0---0-\-0-0-0
\ ..\---cr-o-o--o-\cr-o-o,
---0-0--0 ---o---b--o---o-+--,
I---\-o--o-o---o-o--cr-o-I---I
0-0-0---0-0--J-O
0-0-0--0--0---0-0
o-o-o--o-cr-o-o
~
LQ
f,
:>
E
"r,
~
III
Figure 1 A Systolic Array
change control parameters for new modes of operation. Individual cells process their
own data and pass on the intennediate results to their neighbors. This allows a highly
regular computing structure and alleviates the control and communication problems in'
a complex system. Figure 1 depicts a systolic array with system memory connected
to the host computer as well as the boundary cells.
The idea of systolic processing was the focus pf much research in the 80's,
and many systems were constructed or proposed on this principle. Examples include
specialized digital signal processors such as FFT chips.
9
2.2 Systolic Arrays Configurations
Mathematically, we can represent the interconnection network of a systolic
array by a graph. There are three types of interconnection structures used for systolic
arrays: meshes, trees, and shuffles. Generally, the closer the configuration of the
array is to the problem to be solved, the more efficient is the architecture for that
particular problem. Therefore it is not surprising that the most widely used
configuration for FFT is the perfect shuffle structure. Mesh structures are sometimes
used to compute two-dimensional FFT. Here we will describe three class of
configurations that are used in algorithms discussed in subsequent sections.
2.2.1 Meshes
The chain and cyCle depicted in Figure 2 are both one dimensional meshes.
Figure 2 Chain and Cycle
10·
The nodes of an mXn mesh are defmed by integer points in a rectangle {(i, )),
o =:;; i < m, 0 =:;; n}. A node (i, D is connected with nodes (i ± i, j) and (i, j ± 1),
if they are in the rectangle. For the graph of this two-dimensional mesh, the degree
is 4, diameter is m + n -2, and area is O( m n). If we add diagonal edges ( (i, i), (i
± 1, j ± 1)) to the two-dimensional mesh, then we get the hexagonal mesh. If we
add edges ((0, j), (m - 1, j) ) and ((i, 0), (i, n-1)) to an mXn mesh, then we get a
two-dimensional torus. If we add to the two-dimensional mesh twisting edges
( ( i, 0), (( i + 1 ) mod m, n -1 ) ), =:;; i < m,
( ( 0, ) ), (m - 1, (j - 1 ) mod n ) ), 0 =:;;) < n,
then we get the twisted torus.
2.2.2 Perfect Shuffle
Nodes of the perfect shuffle network are numbered k-bit integers. The i-th
node is connected with the nodes numbered by Sri) and E(i), where S is the shuffle
operation and E is the exchange operation:
S(i) =2imod(2 d-l) , E(i) =iel. (7)
Sri) is a l-bit circular left shift of the binary representation of i. E(i) inverts the last
bit of i. Therefore, the edges of the perfect shuffle are as follows:
(i,2i), (i+NI2,"2i+1), (2i, 2i+1)
where N = ]i, i = 0, 1, ... , NI2 - 1. The perfect shuffle has two loops (0,0) and
(N-l, N-l) and the i-th node is connected with three nodes
11
000 o
000 0 0 0 0 0
Figure 3 Perfect Shuffle
S(i), E(i), and S(i)-l = 21-1 i mod (? - 1).
Thus the degree of the graph is 3, and the diameter is d. It can be shown that there
exists a VLSI realization of the perfect shuffle in the region with area O(22d/cP).
Figure 3 shows a perfect shuffle connection between two rows of processors.
2.2.3 Cube Connected Cycle
Another architecture proposed for I-D Ff is the Cube Connected Cycle
(CCC). Preparata et aZ. [40] constructed the CCC from the hypercube by replacing
each node by a cycle of length d, where d is the degree of the hypercube. Figure 4
shows a realization of a CCC in a planar region, where the area is O(N2 / Zag2 N),
12
Cn+1
-
r-=- ,....-- r--- r--
-
r--
-
\
-
-.
c c ,
n n
,-
- r--- r-- r--
r-- C
3
r--
r---
-
'--
-
'--
-
'-- '--- '--
Figure 4 Cube Connected Cycle
and N = ? is the number of nodes in the CCC. Cj is cube connected cycle of
dimension 1, C3 is of dimensional 3, etc.
A node of a CCC is a pair of integers:
(1, p), 0 -~ 1 ~ ? - 1, 0 < P ~ d.
The node (1, p) is connected with three other nodes:
(1, (p + 1) mod d), (1, (P-1) mod d), and (1 EB 2P, p).
The degree of the CCC graph is 3, the diameter is 2d, and the area is O(N2).
13
2.3 VLSI Complexity
The designer of VLSI is constantly faced with the challenge to satisfy many
conflicting design goals. Often the person has to carefully weigh the tradeoffs
between different design choices. In order to quantify the tradeoffs, we need to
construct a measure that indicates the major characteristics of the VLSI design. A
measure that is widely used is Ar, where A is the layout area, and T is the
computation time [15].
To estimate the complexity of VLSI design that realizes a certain function, we
can make use of the notion of information content of the function.
Definition 2. Let f:xm.....yn be an m inputs n outputs function with domain X
and range Y. For simplicity, assume that n and m are even. The information content
of the function f is the greatest integer p., such that for every partition of inputs on
two equal groups ("0, XI) and for each group YI of 0/2 outputs, there exists a value XI'
E XW2 of inputs of the second group such that the number of different values of the
function fl("o) = prylf("o,xl*) is at least 2".
The value
(8)
is the informationflow from inputs "0 to outputs YI and denoted as !/>("o.....YI). The
information content is
14
(9)
which is the minimum of infonnation flow from inputs Xo to outputs Yl on all
partitions of inputs and outputs onto sets of the same cardinality.
Using this notion, we can fmd the lower bounds for AT2, which are based on
the fact that all bisections of a VLSI design realizing a function with p. are at least
Cp., where C is constant depending on the VLSI technology.
Theorem 1. When a VLSI realization off is laid out in a planner region R
with the area A, then
(10)
where A is the minimum width of the wire, T is the global clock period, and JI is the
number of metal layers in the chip. The proof of this theorem can be found in [14].
Theorem 2. The information content of N-point DFf of K-bit numbers is at
least K*N /4 and, hence, in bit serial input mode the lower bound A'P=O((NKYJ
holds true for any VLSI DFf design. The proof can be found in [50].
15
3. One-Dimensional FFT
3.1 The Pise Algorithm
The discrete Fourier transfonn in equation (6) (see Chapter 1) can be rewritten
as the multiplication of an N element vector by a matrix FN• D. J. Rose proved the
following identity [44]:
where N = 2n , Q is a pennutation matrix, Ik is a kxk unit matrix, and
(11)
d -wpqk- SI' k=ps + q, (12)
is a rs x rs diagonal matrix. (The tensor product of mx m matrix A and of nx n
matrix B is a mnXmn matrix A®B with ((i-1)n+p, (j-1)n+q)-th element equals alj 0
The Pise algorithm is based on equation (11). The following .is a systolic
realization of the Pise algorithm. The configuration of the systolic array is a perfect
shuffle and each cell perfonns the butterfly operation ( Figure 5 and 6 ).
Algorithm 1. Pise Algorithm
Initial Conditions:
for ( i = 0; i < = N/2 -1; + +i )
{
u[ 0, i] = x[ 2*i ];
u[O, i+NI2] = x[2*i + 1];
w[ 0 i] = W i.
, N'
16
x y
b
f--
W
f--
V
x' y'
x' = x + w.y; y' = x - w.y;
IF (b (mod 2) = 1) THEN w = wlv;
v = v· v; b = b/2
TO
T1
T2
T.3
Figure 6 The Structure of a I-D DFT Systolic Array
17
v[ 0, i] = WN;
b[ 0, i] = i;
}
Iterations:
for ( t = 1; t < = n; + + i )
{
for ( i = 0; i < = N/2 -1; + + i )
{
u[ t + 1, i] = u[ t, 2*i ] + w[ t. i 1* u[ t, 2*i + 1];
u[ t +1, i +N/2] = u[ t, 2*i 1- w[ t, i] * u[ t*2+1];
if ( mod( b[ t, i], 2) == 1)
w[ t+ 1, i] = w[ t, i 1/ v[ t, i ];
v[ t + 1, i] = v[ t, i] * v[ t. i ];
b[ t+1, i ] = b[ t, i] / 2;
}
}
3.2 Time-Area Analysis of I-D Designs
A good layout of a perfect shuffle [25] allows us to get a systolic array with
an area O(}f / loiN). Processing time is simply T = log N, and we have
AT=O(N), which is the optimal for the one-dirnensional DFf, cf. Theorem 2 in
Chapter 2.
Other architectures proposed for 1-0 FI' include the CCC and, the butterfly,
and the mesh [56]. All these parallel designs ~b the area-time theoretical lower
bound of AT = O( N ), and all ofthem are bit-serial. The I/O operations are
affected by the layout of the design. NO( all the designs allow free access for
connection I/O lines to every processor. 'Therefore I/O have to be done in pipelines.
The butterfly layout consists of NlogN processing elements each performing a
butterfly operation. The alignment of input PE's allows us to connect an input to
18
each of the input PE's and an output from each of the output PE's. Thus, the I/O
operations can be performed in one step in the bufferfly design. The CCC and the
perfect shuffle, on the other hand, requires more complicated I/O layout, which
consists of L = O(N / log N) cycles and necklaces. The layout allows wiring access
only to the edge PE's, requiring I/O operations to be done in pipelines. It takes
O(NIL) = O( log N) steps to complete an I/O on the CCC and perfect shuffle.
3.3 Flow of Vectors
~ ..JnJnany~eal·time_FELapplications,-input-sample-vectors-are-fed-into-the-~----_·_~·_~~--
systolic array at a steady clock rate. The systolic array operates in a pipelined
fashion and produces an FFr vector at every clock in steady state. The startup time,
or the latency between the fIrst input vector and the fIrst output vector of the FFr is
simply
(13).
A straightforward extension of Algorithm 1 leads to the flow-of-vectors version which
performs FFf on k sequential vectors of length N. The cell and the network are both
depicted in Figure 7.
Algorithm 2. Flow of Vectors
Initial Conditions:
for ( i = 1; i < = n; ++i )
{
for (j = 0; j < = NI2 -1; ++j )
{
19
Elo
x' y'
X' =X + w*y
- 0_-- - - - - - - - - - - - - - - - - - - -- - - -------y'-= X - wy--
Figure 7 FFT systolic array for flow-of-vectors
p = 2i ;
W [ j, i ] = Wpi;
}
}
Boundary Conditions:
for ( t = 0; t < k; + +t)"
{
for ( i = 0; i < N/2; ++i )
{
z[ t + 1, i, 0] = x[ t + 1, i ];
z[ t + 1, i + NI2, 0] = x[ t + 1, i.+ N/2 ];
}
}
20
Iterations:
for ( t = 1; t < = n + k; + +t )
{
for (j = 1; j < = n; + + j )
{
for (s = 0; s < (2 < < (i-I)); ++s)
{
for ( i = 0; i < (2 < < (n-j)); + +i )
{
z[ t + 1, i + (s < < (n-j)), j ] =
z[ t , i + (s < < (n-j)), j - 1 ] +
w[i,j]*z[t, i+(2 < <(n-j-I))+(s < <(n-j),j];
z[ t + 1, i + (s < < (n-j)), j] =
z[ t , i + (s < < (n-j)), j - 1 ] -
w[iJ]*z[t, i+(2< < (n-j-1))+(s< < (n-j)J],'
}
}
}
}
The area of this systolic array is O( NI log N ), and the processing time is (k-
+ log n). The SA produces FFf of the input vectors with reversed binary indices.
The output z[n+ 1, i, n] is the r-th component of the vector z1 = F Xl' where the
binary representation of r is the reverse of the binary representation of i.
21
4. Multi-Dimensional FFT
4.1 Defmitions
It is often necessary to perfonn multi-dimensional Fourier transfonn in areas
of signal processing such as image processing. In the field of computer vision and
pattern analysis, there is a growing interest in simultaneous time-spatial frequency
representati~ignalS for better understanding of dynamic system behavior.
However, the multi-dimensional Ff requires an enonnous amount of
computation~ and ~apping the multi-dimensional transfonn into a 2-D silicon area
requires a great deal of interconnections and routing.
Definition 3. Given a d-dimensional data with indices nt E [0, Nt], n2 E
[0,N2], ... , ~ E [0, NJ, the d-dimensional DFf is given by
(14)
where
(15)
/'
To simplify the analysis, it is assumed that 1) N1 = Nz = ... = Nd = N , and n = }I
is the total number of elements in the input data; 2) N is a power of 2. The following
are two methods that are conceptually simple and suitable for VLSI implementation.
22
4.2 Row-Column Method
A multi-dimensional DFT can be computed as a sequence of one-dimensional
DFTs. This method is often called the row-column algorithm because in the two-
dimensional case, the 2-D DFT can be viewed as a one-dimensional DFT on the rows
followed by a I-D DFT on the columns.
For an MxN input data array, the two-step DFT matrix can be written as
(16)
The computation involves the following steps:
a) Represent the input as an MxN array.
b) Perform an FFT on the rows of the array ( multiplication by the matrix ffiFN).
c) Transpose the resulting array ( multiplication by the matrix PNM).
d) Perform an FFT on the rows of the resulting array ( multiplicatiop by the
e) Transpose the resulting array (multiplication by the matrix PM~.
There different systolic arrays for the purpose of matrix transposition. In [36] a
systolic array is proposed that incl~des !VZ switches, N2-bit buffers, and it transposes
an NxN matrix in time 3N -1 ( see Figure 8 )..
c. N. Zhang [55] suggested a mesh connected systolic array for computing 1-
D DFT. I. Gertner and M. Shamash [16] modified the architecture for computing 2-
D DFT. The algorithm begins by computing the DFT of columns. The bottom row
of cells is used to feed the transformed array back into the mesh.
23
INPUT FROM 1-0 F.T ARRAYS
WRITE
ENABLE
1------ 0(N1oq nJ -----4
o
M
T~=H=~:::::j+=::=++=~---!-1 U ot NI
~XI
WRITE
ENABLE
READ n INPUT
ENABLE~ DATA
•OUTPUT DATA
-4 ~
0009")
OUTPUT
DATA
TO 1-0
FT.
ARRAYS
Figure 8 Systolic Array for Matrix Transposition
o .0,1., I ---+--,t----+--+---1f--------'
0 .. O:""'~~~. W".f --H--h----+---1-+-I-r--
Figure 9 A Mesh Connected Systolic Array for 2-D DFT
A
14 A.z3 ~2 A41 -
~ ~ ~ ~ -, ~
A13 A.z2 - ~1 - -
~ ~ l ~ l
A12 A.z1 - - -
~ ~ ~ ~ ~
A
- -
.-
-r ~ ~ l ~
Y4-- Y3~ Y2-- Y,- 81
f-- 82
f-- 83
~ 84 -- 85
Figure 10 Matrix-Vector Multiplication
The next step computes the DFf of the rows ( see Figure 9). The mesh is
implemented with N2 cells requiring O(N log n) area. The computation time
required is O( N log n). The Ay2 is O( }/ lot n).
4.3 Matrix-Vector Multiplication
We can represent the multi-dimensional DFf by a matrix-vector multiplication
just as in the one-dimensional case [52]. If we denote the DFf matrix in one-
dimension by F/, then for a d-dimensionaJ. transform the DFf matrix is given by
25
d 1 1 1F =FJ:l fi}FN fi} ... fi}FN1 2 d (17)
where ® is again the tensor product of matrices. The advantage of the method is that
it can be implemented in VLSI in a straightforward way. However, the main
weakness of this approach is that every output requires n2 = N2d multiplications.
A matrix-vector multiplication can be performed on a systolic array as
described in [27]. The design implementing the band matrix by vector product uses
O(n) cells requiring O(n log n) area and O(n log n) time. The AT2 lower bound is
O( ni loIn). This desIgn does not achieve the theoretical lowl bound on AT.
4.4 An Optimal Hybrid Design
A hybrid design that uses a rotator with I-D butterfly arrays was shown to
achieve the optimal lower bound [16]. The rotator is a network or an array that
permutes data. The rotation permutation R defmed on a d-dimensional data cube
permutes the data such that its indices are circularly shifted to the left. For example,
(18)
The rotation for a two-dimensional data is simply a matrix transposition. For a three-
dimensional data, it corresponds to a cube rotation ( see Figure 11 and 12 ).
First the array for computing I-D DFf is connected to the data cube along
dimension nz• When the transform is completed along this dimension, the cube is
rotat.ed so that dimension n2 replaces nz and connects to the I-D transformer. This
design has a high degree of parallelism and high area consumption. For ~pplication
26
where low cost is crucial, the mesh systolic array may be used for 2-D FT. The
penalty in performance is a factor of N. For higher dimensions, the matrix-vector
product systolic array can be used to trade performance for lower cost.
.
.
.
.
nl_·__--J
R
.
:'
I
n3 •
n3
I
R~
R
Figure 11 The Rotation Process for a 3-D Cube
from inpul
• memOty-- /vi
U
X
'N' \-1) fourier
tronsform
orroys
o
/vi
U
X
to output .
buffer mem:lry
L..:::====~ ROTATOR
Figure 12 A Hybrid Design with a Rotator and l-D DFT Array
4.5 Mapping To Higher-Dimensional DFf
Often times it is desirable to tum a one-dimensional Ff into a multi-
dimensional one. In general, a higher throughput can be achieved at the expense of
more processing elements and communication between them by mapping a DFf from
a lower dimension to a higher one. The resulting structures are more regular and
modular with higher perfonnance.
The multilinear fonn approach is systematic and can be used in a variety of
linear algorithms such as filtering, convolution. and pattern matching. In this
approach, a transfer function in the lower dimension is decomposed into transfer
functions in a higher dimension. The resulting system can be mapped directly to
VLSIIULSI/WSI structures. The architecture is usually semi-systolic and solves the
I/O bottleneck problem. However as the number of processors increases, the system
clock rate has to slow down.
Many existing schemes are based on mapping indices of input and output for a
prime-factor or 2"-point decomposed sequences of DFT. A different method based on
the Goertzel algorithm via Homer's rule was proposed in [31]. This technique places
an optimized dependency graph for J!-point OfT in d-dimensional index space.
The index n in DFf (equation (6) of Chapter 1) can be decomposed according
to the following general index mapping equatiOn:
28
(19)
where
Substitute n into equation (6), we get
(20)
1 1
X[k] =E E
nl=o n2=O (21)
Equation (21) can be realized to p-dimensional systolic array.
The scheme proposed in [31] is system can lends itself to logic synthesis. The
dependency graph based approach has a simply realization and a reduced computation
overhead.
29
5. Other Algorithms
The algorithms described in Chapter 3 and 4 are based on either the matrix-
vector multiplication or the data shuffling similar to the Cooley-Tukey FFT. A
different class of algorithms are based on the circular convolution in the time domain.
The motivation behind these algorithms is to minimize the number of multiplications
by using certain properties of the convolution and the type of signal itself. This is
often done at the expense of an increased number of additions and somewhat less
regular computation structures.
The algorithm discussed in this chapter, with the exception of the Goertzel
algorithm and the chirp transfonn algorithm, are sometimes referred to as small N
algorithms, due to the fact that they all attempt to reduce the number of
multiplications in the circular convolution used to computed the DFT.
5.1 Prime Factor Algorithms
The key idea behind the Cooley-Tukey's FFr is splitting up the input sequence
f
in a certain way that can reduce the computation requirement, primarily the number
of multiplications. These schemes are often referred to as decimation-in-time and
decimation-in-frequency methods.
\
Similar to this idea, the prime factor algorithms [17] decompbse a one-
dimensional DFT of length N into a two-dimensional DFT of an NJ xN2 matrix,
30
where N = N] XNz, N] and Nz are relatively prime. The task of decomposition
involves fmding an index mapping of the form:
n=An1 +Bn2 ,
k=Ck1 +Dk2 ,
o~nl ~Nl -1,
O~kl<=Nl-1,
O~n2~N2-1
O~k2~N2-1. (22 )
It can be shown that for any values of NI , Nz, the mapping where A = Nz,
B= C = 1, and D = NI , produces the so-called Cooley-Tukey decomposition of the
OFf. The same can be said of the mapping where A ;:: D = 1, B = N1, and C=Nz.
However, if NI and Nz are relatively prime, an appropriate choice of the four
constants can completely eliminated the "twiddle" factors in the CoOley-Tukey
decomposition. The requirements for the constants are that AC = Nz, BD = NI , and
BC = O. Finding a set of constants that satisfies these constraints involves the use of-
the Chinese remainder theorem. Burrus [8] found a set of constants that satisfies the
conditions: A = Nz, B = NI , C = NlNz1 mod NI), and D = NlN/ mod NJJ.
Since the twiddle factors have been eliminated, the two-dimensional OFf can
be performed in either row first or column first order. The number of complex
multipliGations is reduced by approximately N.
. -
5.2 The Number Theoretic Algorithms
S. Winograd proposed an algorithm in 1978 [52], known as the Winograd
Fourier transform algorithm (WFfA), which achieves its efficiency by expressing the
OFf in terms of polynomial multiplication. The WFfA uses the same indexing
scheme as the prime factor algorithm and decomposes the DFT into a number of
31
short-length DFT's where the lengths are relatively prime. The short DFT's are then
converted into circular convolutions. Rader [42] proposed a scheme for converting a
DFT into a convolution when the length of the input sequence is prime.
Winograd combined these ideas with efficient algorithms for computing
circular convolutions into a new method for computing the DFT. The techniques for
deriving the efficient algorithms for computing short convolutions are based on
number theoretic concepts such as the Chinese remainder theorem for polynomials.
With the WFTA approach, the number of multiplications required for an N.·
point DFT is O(N) compared to O(NlogN) for the Cooley-Tukey FFT. Although this
,
approach leads to algorithms that are optimal in terms of minimizing multiplications,
the number of additions is significantly increased in comparison with FFT.
Therefore, WFfA is advantageous when multiplication is much slower or more
expensive than addition. Other difficulties with the WFTA are that the decomposition
is more complicated, in-place computation is not possible, and each different value of
N (size of the input sequence) requires its own computation structures.
Agarwal and Cooley [1], and Kolba and Parks[26] also proposed similar
algorithms. As in the case of WFfA, they are all based upon of Rader's theorem on
DFT [42] with prime transform length to construct their algorithms for the
computation of DFT.
. Reed and Truong [43] proposed a technique for the computation of the DFT
based on Winograd's method in combination with Mersenned prime number-theoretic
transforms. This hybrid algorithm requires fewer multiplications than either the
32
standard FFf or Winograd's more conventional algorithm. However, a very large
number of additions are required and the number of multiplications per point is still
relatively large.
W. C. Siu [47] presented an algorithm that further reduced the number of
multiplications by using number theoretic transforms. By noting some properties of
number theory and the DFT, the total number of real multiplications for an N point
DFT is reduced to (N-l). For a proper choice of transform length and number
theoretic transform, the number of shift additions per point is approximately the same
as the number of additions required for FFf algorithms.
Nussbaumer [35] defmed a type of pol ynomial transform over the field of
polynomials which could be used to compute two-dimensional convolutions
efficiently. This leads to the development of fast algorithms for the computation of
multi-dimensional DFT.
,
5.3 VLSI Implementation of Small N Algorithms
Compared to the implementations discussed in Chapter 3 and 4, the small N
algorithms are a fast design. The difficulty in the VLSI implementation of these
algorithm is that regular computation structure is not apparent. Furthermore, the
optimal primitive small N algorithms have been derived for only N = 2, 3, 4, 5, 7,
8,9,16.
33
In [39] it was shown that architectures· for the small N algorithms with optimal·
area-time lower bound can be constructed with only nearest neighbor
interconnections.
For the sake of describing the implementation, the small N algorithms can be
summarized by the following equation
y=S'C'T'X
where T is a predefmed L xN matrix whose elements are either 0, 1, or -1
(23 )
representing input additions and subtractions, C is a predefmed L xL diagonal matrix
representing the multiplications, and S is a predefmed NXL matrix whose elements
are either 0, 1, -1, representing the output additions and subtractions. The number of
multiplications L is O( N ).
In general the length of an input sequence can be written as
M-l
N= II N.
i=O 1
(24 )
where each ~ is a number for which the optimal primitive algorithm is known, and
any two ~'s are relatively prime. Then for the composite algorithm, the three
matrices in equation (24) are correspondingly
34
(25)
(26)
(27 )
where ® is the Kronecker product.
The Winograd algorithm discussed in the previous section can be fonnulated
as follows
(28)
where @ is the element-wise multiplication, and
(29 )
The arithmetic unit proposed in [39] computes the WFTA by using a
summation unit and a multiplication unit. The process described by equation (28)
involves three basic operations: 1) Z = BX; 2) Z = (EX?; 3) Z = (H@}(). The
adder and summation unit depicted in Figure 13 perfonns the fIrst operation. The
modifIed adder and summation unit in Figure 14 perfonns the second operation. The
multiplier and multiplication unit in Figure 15 perfonns the third operation. Figure
16 combines these components into an arithmetic unit which computes the WFTA.
From the generality of equations (23) - (27), it is apparent that many different
number theoretic algorithms can be computed on this arithmetic unit. In fact, Owens
et al [39] showed that the prime factor algorithm described previously can also be
perfonned on this system.
35
\X" i/.11 > 0
1.., = 1,. + o. II.\{ = 0
-.\" '1.\1 < U
o 0
Is _. I---~-B-
I I
'" ...
II 1- - -@-G=J-
Ia 1- - -GJ~c£J-
:/ 1 I
'-----"-'
I
I Ic!J1=±J
X I •I X 1.a
X a.a
I I
Zo., Z,.:
1 o.~ Z 1,1
ZO.I Z l.a
Z 0.0
Figure 13 Summation Unit
36·
o
-IAS_II.v.11
I
-~
-~
1
: IB
z....,.,.o ...
1
.1/ If .•\ .. > 0
la' =1... ';' II If,l .. =()
-.11 If .\.. < 0
1 1 1
Z).o Z:.I .• , ZO."'o-1
Z :.0 Z 1.1
Z 1.0 Z0.1
Z 0.0
Figure 14 Modified Summation Unit
X O,3 X I•2 X"'1l- I .O
X O,2 X I•1
X O•I X I•O
XO,O
X in I
1
p
t 1
H in Zout 1 11 1
H 0,0 Z0,3 Z 1.2 Z"'O-I,O
H 0,1 ZO,2 H 1,0 Z 1,1
H 0.2 Z 0,1 H 1.1 Z 1,0
H0,3 Z 0,0 H 1.2 H"'O-I.O
1L..-- _
Figure 15 Multiplication Unit
37'
t t r l
I/0. lIb, I. 0 .•., I/O~.,
I
I t I fIt I t
I M. I M, I M~., M.~.,
4 I i 4
Figure 16 The Arithmetic Unit
5.4 The Goertzel Algorithm
The Goertzel algorithm [17] takes advantage of the periodicity of the
coefficients in the circular co~nvolution of the DFT to reduce computation. The
coefficients in equation (6) of Chapter 1 has the following property
(30)
If we multiply both sides of equation (6) by the left hand side of equation (30), we
have
38
N-l N-l
X[k] "'WNL w:"'E x[n] WNk(N-nl
n=O n=O
Now if we defme another sequence
""
Yk [n] = L x [r] WNk(n-rl U [n- r] I
r=-oo
where urn) is the unit step function, then it follows that
X[k] "'Yk [n] In=N'
(31)
(32)
(33)
Equation (32) can be regarded as a convolution of the fmite-duration sequence x[n],
o~ n ~ N - 1, with the sequence W/lIu[n]. Therefore, Yin] can be viewed as the
response of a system with impulse response W/lr1u[n] to a fmite duration input
sequence x[nj. The k-the DFf of the input signal is the output of Yk when n = N.
A signal flow diagram of this system is shown in Figure 17, The notation of
z-transform is used here and the branch labelled Zl represents a delay element. The
advantage of this system is that no computation or storage of the coefficients WNnA: is
needed since these quantities are implicitly computed in the recursion of Figure 17.
To reduce the number of multiplications, we rewrite the system function of
Figure 17 as follows
39
x[n]
. -1
. Z
Figure 17 Flow Graph of First-Order Recursion
x(n] yk[n]
z-1
2 cos (2~k)
-WkN
z-,
-1
Figure 18 Goertzel Algorithm
40'
(34 )
Figure 18 depicts a system corresponding to equation (34).
If the input is complex, only two real multiplications per sample are required.
As' in the case of the fIrst-order system, for complex input, four real additions per
sample are required to implement the system. Note that in the figure, the complex
multiplication by -WNk need not be performed until the N-th iteration. Thus, the total
number of multiplication is 2(N+2) and the number of addition is 4(N+l). Similar
to the first-order system, the second-order system requires only two coefficients to be
computed and stored.
Another advantage of the Goertzel algorithm is that a subset of the output
sequence can be computed independently. If we are only interested in a certain
frequency band of the signal, this can be a convenient method to reduce computation. -
The complexity is O(NM) where M is the number of output points desired. When M
is small, this algorithm can be more efficient than the Cooley-Tukey FFT.
5.5 The Chirp Transform Algorithm
~other algorithm based on expressing the DFT as a convolution is the chirp
transform algorithm [41] [48]. This algorithm is not optimal in terms of computation
complexity, but it is useful in applications where the technology used for
implementation is well suited to doing convolution with a fIxed, predefmed impulse
response.
The charge-coupled devices (CCD) and surface acoustic wave devices (SAW)
are particularly useful for implementing convolution with a fixed, prespecified
41
impulse response. Such devices can be used for implementation of FIR fIlters, with
the fIlter impulse response being specified at the time of fabrication by a geometric
pattern of electrodes. A similar approach was used by Hewes in implementing the
chirp transform algorithm with CCD devices [21].
The chirp transform algorithm is also more flexible than the FFf, since it can
be used to compute any set of equally spaced samples of the Fourier transform. In
general, we do not require N = M, where N is the length of the input sequence and
M is the length of the output sequence. Furthermore, M and N need not be compo'site
numbers. In fact, they can be prime numbers if desired.
42
6. Other Architectures
There are many parallel or distributed architectures, designs, and
implementations. Although a high degree of concentration is found in the
systolic/VLSI implementations of the fast Fourier transform, there exist various FFr
algorithms for different parallel systems.
6.1 Illiac IV : The First SIMD Supercomputer
The Illiac IV was the first large scale array computer which became
operational in 1975. Initially, it was intended as a grand experiment in computer
engineering to overcome existing limitations on the computing speed and to design a
system with a high level of parallelism and pipelining to achieve performance never
before heard of [22].
Sponsored by the DARPA, the llliac IV was designed by Professor Daniel
Slotnick and manufactured by the Westinghouse Electric Company. In this design,
there is one single instruction stream and a single control unit which decodes the
instructions and broadcasts control signals to all me 64 processing elements. The
processing elements perform the same insuuctionJ on different data stored in their
local memories. Thus, the Illiac IV became me first SIMD supercomputer.
43
An FFf algorithm designed for the liliac IV performs a two-dimensional FFT.
The program is called TWDFFr. The complex data that are supplied to the program
are stored one to each 64-bit word with the components encoded as 32-bit floating
point numbers. Both the real and inlaginary parts are stored in the same 64-bit word.
The input matrix elements are stored in a row-major fashion. Two parameters are
used to specify the input data size, another to specify whether a FFr or inverse FFT
.is to be performed. Yet another parameter contains error status.
The FFf algorithm used is essentially a row-column method as discussed in
Section 4.2. It fIrst computes the transform "down" the PE's, and when it is time to
compute "across" the PE's, a transposition of the matrix is performed. The across-
the-PE FFf takes O(N log N), while the transposition takes O(N) and down-the PE
FFf takes O(log N). Therefore both methods have the same complexity, but in most
cases, switching data locations is faster than multiplications. Furthermore, the liliac
IV control unit can perform housekeeping while the PE's are Performing arithmetic
operations. Properly taking advantage of this feature can optimize the results.
6.2 Parallel Associative Processors
In the classical model of the von Neuman machine, the bus connecting the
processor and the memory often becomes the bottleneck. To further increase the
computing speed, the concept of associative memory (AM) has been advanced. An
AM is basically an intelligent memory that is capable of storing information,
£omparing it with other information and discovering agreement or disagreement[ 12).
44
The AM can alleviate the problem of the heavy traffic on the bus thus increasing the
overall throughput of the computer. Based on the concept of AM, a parallel
associative processor (PAP) is simply an associative memory capable of a parallel
recording of information into all the words that report agreement.
The Cooley-Tukey algorithm can be implemented on the PAP in a
straightforward way provided that a suitable interconnection network is in place. The
Goodyear Aerospace's associative computer STARAN employs such an
interconnection network which is similar to the perfect shuffle structure.
Both the STARAN and Illiac IV computers can be classified as SIMD
machines, and common fast DFf algorithms can be implemented with ease. This is
because the computation involved in the DFT is highly regular and structured. Each
cell carries out exactly the same instruction on different data. At each stage, all the
processors involved can be synchronized by a single system clock. In the case of the
Illiac IV, a matrix-vector multiplication scheme was used, while the STARAN used
the data-shuffling method. The difference is primarily due to the different
interconnection architectures.
6.3 Hypercube
The hypercube has a very versatile structure. The FFf algorithm can be
implemented on the hypercube by embedding a mesh or a shuffle into it. The
algorithms discussed in chapters 2-5 can therefore be easily modified to run on the
hypercube.
45
.~
\
Theorem 3. There exists an isomorphic embedding of the '2:" x? mesh into a
hypercube of degree (m +n) [15].
Proof We start with the embedding of a chain of length '2:" into an m-
dimensional hypercube. For m = 1, the embedding is obvious. The index of a node
in the hypercube is simply the binary representation of the index in the chain.
Suppose we have an embedding of a chain of length '2:".1 in an (m-l)-dimensional
cube. Then an m-dimensional cube can be seen as two (m-l)-dimensional cube each
containing a chain of length '2:".1. Connect the free ends of the two embedded chains
by an edge of the cube. Now let us have embeddings of two chains of lengths '2:" and
? into m-dimensional and n-dimensional cubes. Since a '2:" x 2:' mesh is a direct
product of these chains, a product of embeddings of these chains gives us an
embedding of the mesh into a product of the cubes, ie. the (m+n)-dimensional
cube. II
A parallel decomposition of the FFr algorithm directly onto a MIMD
hypercube was described in [14]. The N-sequence is divided equally among M
processors. As the FFr progresses, each processor contains fewer and fewer points
until eventually each processor contains only one point to compute. At this point the
processors need to exchange data.
At fIrst the even and odd points are stored in adjacent processors. At the next
stage, a point is stored in every other processor, and even and odd points are stored
two processors apart. Proceeding further, the points move further away by twice the
distance in the previous stage. In other words, the processors need to communicate
46
with processors whose numbers differ by one, two, four, eight, etc. This directly
corresponds to the hypercube structure.
The following is the pseudocode showing the major steps of the algorithm
described in [14].
Algorithm 3. Parallel FFr on the Hypercube
/*fmd M point FFf of distributed complex data array, x */
function fft_2 ( M )
{
if ( M = = 1 ) then
return;
else if (M < = number of processors ) then
{
/* only one element of x is stored locally */
/* fmd Ml2 point FFf */
fft_2( Ml2 );
/* combine local value of x with value from complementary proc. */
fft_combine_comm( );
return;
}
else
{
/* fmd Ml2 point FFf of even elements of x */
fft_2( Ml2 );
/* fmd Ml2 point FFf of odd elements of x */
fft_2( Ml2 );
/* combine even and odd elements of x within the same proc. */
fft combine 2( );
- -
return;
}
}
/* combine the even and odd elements of x within the same processor */
function fft_combine_2( )
{
for (each pair of elements in this processor)
{
j = the global index of the pair;
x_even = the even element of the pair;
47
x_odd = the odd element of the pair;
pennute( );
exp_fac = exp( 27rij/M );
tenn = x odd * exp fac;
- -
x_even = x_even + tenn;
x_odd = x_even - tenn;
}
}
/* combine the single element x with complementary element from another processor
*/
function fft_combine_comm( )
{
p_comp = proc. containing the complementary element;
send x to p_comp;
receive z from p_comp;
j = global index of x;
pennute( );
exp_fac = exp ( 27rij'/M );
if ( x is an odd element )
{
x_even = z;
x odd = x·
- ,
x_odd = x_even - x_odd * expJac;
}
else
{
x_even = x;
x odd = z'_. ,
x_even = x_even + x_odd * expJac;
}
}
The same scheme can be readily extended to multi-dimensional FFT. The
communication structure on the hypercube can be thought of as a product of cubes in
lower dimensions. In a two-dimensional case. for example, the sum of the
dimensions of the smaller cubes d1 and d: is equal to the dimension of the whole
hypercube. When computing the inner-most summation, we use the communication
48
channels in the fIrst subcube. When computing the outer summation, we use the
channels associated with the second subcube.
Another method corresponding to the row-column decomposition discussed in
section 4.2' is to store an entire row of the two-dimensional matrix in a single
processor. When the FFf of the row vectors ~ complete, a matrix transposition is
performed, and then the column vectors are stored in the processors. This second
pass completes the two-dimensional FFf.
The scalability aspect of FFf on the hypercube is studied in [19]. The
scalability is a measure of the speedup as a function of the increase in the number of
processors. It is a very important concept because a common mistake is attempting to
extrapolate performance of a small system to a large system by automatically
assuming a linear scalability.
It was found that a linearly increasing speedup is possible with only a small
increase in problem size. The improvement in efficiency, however, is limited by the
ratio of the CPU speed and the communication bandwidth. With store-and-forward
routing, efficiencies beyond this threshold can be achieved if the problem size
increased very rapidly. Alternate formulation using transposition can also overcome
this limit if the hardware supports cut-through routing. Below the threshold, the
regular FFr is more scalable while above the threshold the transposition formulation
is more scalable.
An analysis of the scalability of the FFf on the mesh structure is also given in
[19] . It was found that the FFf algorithm cannot make efficient use of large mesh
49
architectures unless the communication bandwidth is increased together with the
number of the processors. In summary, it was shown that if the cost of a
communication network is proportional to the total number of communication links,
then it is more cost-effective to implement the FFf algorithm on a hypercube than on
a mesh structure although large meshes are often cheaper than large hypercubes.
50
7. Fault Tolerant FFT
As the problem size increases, the complexity of the FFT implementations in
hardware beComes significant, and the concern for fault-tolerance capabilities
increases. The fIrst architecture dedicated to the fault-tolerant computation of FFT
can be found in [23]. Fault diagnosis for a FFT processor is proposed in [33]. The
method described in [33] relies on a time redundancy approach for a one-point
butterfly-based architecture by recomputing the transform on alternate paths.
The time redundancy approach requires all computations to be duplicated.
Comparison of results is used to detect errors and diagnose faults in the butterfly
during normal operation. A permanent fault can be located within log N + 5 cycles
for an N-point FFT. The issue of gracefully degraded reconflguration is also
addressed: an additional stage of one-point bufferfly units is introduced to achieve this
objective. The hardware overhead accounts for O( 11 log N) with some additional
switches and comparators.
Another approach to fault tolerance for FFr computation is given in [24]. A
concurrent error detection (CED) scheme is proposed; this scheme is applicable to
(N12) log N two-point' butterfly modules and is used to detect a single error in either
the complex multiplier, or one input...or output set of lines. The fault model is strictly
\
functional and is based on the addition of two extra stages to the basic system: the
fIrst stage consists of N decoding multipliers, while the second stage is based on the
51
cascade of N summation adders whose outputs is compared with the output of an
additional multiplier.
The scheme utilizes a coding relationship in the computations of the FFI' to
achieve fault-tolerant results and distinguish round-off errors and functional errors
with a hardware overhead of O(2/iogN). Fault location is accomplished using a time-
redundancy method, in which a retry p~ocedure is executed by modified input
sequences.
An .improvement to the architecture was presented in [49]. A detailed round-
off error analysis is provided and a novel CED scheme, which is suitable for handling
round-off errors, is proposed. It is proved that 100% fault-coverage is possible with
the same overhead and throughput as in [33].
Yamashita et ai. [53] proposed a constant-geometry processor, which was built
using wafer-scale integration (WSI), whose operation can be tested using built-in test
circuits. Testing is implemented in two steps: initially, a building block test is
executed. At the successful completion of this step, the whole wafer is then tested.
It has been proven that test circuits occupy only 8% of the total active area of the
wafer. However, this processor has no error detection, or fault-tole~t capabilities
once it is placed in operation.
Lombardi [32] proposed a fault detection and location scheme for the above
processor using a combination of concurrent and reconfiguration techniques. Faults
are considered at the component level and a single fault (permanent or transient) per
module is assumed. The component based approach for fault detection represents a
52
more efficient utilization of the hardware than a method which uses full module
replacement The process requires a very accurate fault location technique, which is
achieved in the proposed architecture by exploiting the functional relationship in the
computations of the Fourier transform.
It was shown that system proposed by Lombardi has a higher reliability than
the graceful degradable structure of [33] and the FFf array system of [24].
Throughput of the system is kept at a normal value after reconfiguration is performed.
Overhead due to sparing is O(lIN) as switching and routing do not account for a
significant overhead as compared with the complexity of the multipliers and. adders iIi
the cell.
53
8. Conclusions
Discrete Fourier transfonn is a powerful tool in many fields of engineering
and applied mathematics. Yet it is a computation-intensive operation. Achieving the
highest possible computational efficiency, therefore, has been the focus of efforts of
many researchers.
Ever since Cooley-Tukey published their fast Fourier transfonn algorithm,
numerous improvements have been added to the literature. Among the most widely I'
used algorithms, the Cooley-Tukey 3;9d the prime factor are both based on the
decomposition of the input sequence into a fonn suitable for fast computation. In
general, the algorithms in this category have a sequential computation complexity of
O( N log N). Another approach is based on the number theoretic manipulation of the
DFI' in the fonn of circular convolution. The Winograd Fourier transfonn is an
example in this class. In general, the number of multiplications is reduced to O( N )
at the expense of more additions.
The parallel implementation of the DFI' often falls into one of three
categories. The first method is matrix-vector multiplication. This is analogous to the
regular sequential DFI' in which no attempt is made to reduce the number of
multiplications. Although the parallel version with O( N) processors speeds up the
computation by approximate a factor of N, it is still a slow design compared to the
other two categories. The advantage of this method, however, is a straightforward
54
long wires.
The third class of parallel DFTs are based on the number theoretic algorithms.
The number of multiplications is often O( N ) and extremely efficient implementations _
layout in the VLSI design due to its regularity and simplicity. An architecture well
suited for this method is the mesh structure. Systolic arrays based on the mesh can
be easily conceived.
The second category is centered around the parallelization of the Cooley-Tukey
FFr algorithm. Again, a reduction of computation time by a factor of N from the
serial version is achieved. The butterfly network is usually the architecture of choice.
Compared with the matrix-vector product method, this is a fast algorithm with well
understood VLSI implementation. The drawback is that the VLSI layout requires
\
~
~)can be found. The perceived major difficulty is the irregularity in the computation
structure. However, it has been shown that it is-possible to design a regular structure
that achieves both the small number of multiplications and nearest-neighbor
interconnections.
With all the FFr algorithms, the most popular medium of implementation is
VLSI. This is mostly due to the tremendous speed benefits afforded by VLSI at
reasonable costs. The systolic principle aPJbrs to be the pillar in most VLSI
designs. On the other hand, general purpose parallel computers are often chosen for
their flexibility. Architectures such as the hypercube, whose versatility allows
embedding of other structures, are particularly suitable and convenient for
implementing a variety of FFr algorithms.
55
As the need for more sophisticated and reliable computing system grows, the
issue of fault-tolerance has attracted increa'sing attention. Several fault-tolerant FFf
processors have been proposed and many error detection and location techniques have
been presented in the literature.
56
, ,
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
References
R.C. Agarwal and J. W. Cooley, "New algorithms for digital convolution,"
IEEE Trans. ASSP-25, p. 106-124, 1977.
G. Alia, F. Barsi, E. Martinelli, "A fast near optimal VLSI implementation of
FFf using residue number systems," Integration, vol. 2, p. 133-147, 1984. (
G. M. Baudet and F. P. Preparata, "Area-time optimal VLSI circuits for
convolution," IEEE Trans. Comput., vol. 32, no. 7, p. 684-688,1983 ..
G. Bilardi and M. Sarrafzadeh, "Optimal discrete Fourier transfonn in VLSI,~
VLSI, p79-89, 1984.
G.E.I. Bold, "A comparison of the time involved in computing fast Hartley
and Jast Fourier transfonns" , Proc. IEEE, vol. 73, no. 12, pp.1863-1864,
1985.
G. Bongiovanni, "Two VLSI structures for the discrete Fourier transfonn,"
IEEE Trans. Comput. , vol. C-32, pp. 750-753, 1983.
G. Bongiovanni, "A VLSI network for variable size FFfs," IEEE Trans.
Comput., vol.32, no. 8, p. 756-760, 1983.
C. S. Burrus, "Index mappings for multidimensional fonnulation of the DFI'
and convolution," IEEE Trans. Acoustics, Speech, and Signal Processing, vol.
ASSP-25, p. 239-242, 1977.
,
~.~F. Catthoor and H.I. de Man, "Applicatoin-specific architecturalmethodologies for high' t~ughput digi~ signal and image processing," I 'ETrans. Accoust., Speech, SIgnal Processzng, vol. 38, no. 2, pp. 339-349, (
1990. ~
I. W. Cooley and I. W. Tukey, "An algorithm for the machine calculation of
complex Fourier series," Math. Comp., vol 19, pp. 297-301, 1965.
[11] A. L. DeCegama, Parallel Processing Architectures and VLSI Hardware,
Prentice Hall, 1989.
57
"\
[12] C. C. Foster, Content Addressable Parallel Processors, Van Nostrand
Reinhold, 1976.
[13] J. B. J. Fourier, The Analytic Theory of Heat, Cambridge, The University
Press, 1878.
[14] G. C. Fox, M. A. Johnson, G. A Lyzenga, S. W. Otto, J. K. Salmon, and D.
W. Walker, Solving Problems on Concurrent Processors, Prentice Hall, 1988.
[15] M. A. Frumkin, Systolic Computations. Kluwer Academic Publishers, 1990.
[16]' 1. Gertner and M. ,Shamash, "VLSI Architectures for multidimensional Fourier
transform processing," IEEE Trans. CompUl., vol. C-36, no. 11, 1987.
[17] G. Goertzel, "An algorithm for the evaluation of ftnite trigonometric series,"
American Math. Monthly, vol. 65, p. 34-35. 1958.
[18] I. J. Good, "The interactive algorithm and practical Fourier analysis," J.
Royal Stat. Soc., vol. B-20, p. 361-372. 1958.
[19] A. Gupta and V. Kumar, "The scalability of FFT on parallel computers,"
IEEE Trans. Parallel and DistribUled Systems, vol. 4, no. 8, 1993.
[20] M. Hasegawa and Y. Shigei, "AT2 =. O(N 10tN), T = O(log N) fast Fourier
transform in a light connected 3-dimensional VLSI," Computer Architecture
News., vol. 14, no. 2, p. 252-260, 1986.
[21] C. R. Hewes, R. W. Broderson, and D. D. Buss, "Applicaitons of CCD and
switched capacitor fJ.lter technology: Proc. IEEE, vol. 67, no. 10, p. 1403-
1415, 1979.
[22] R. M. Hord, Parallel Supercompuring in SIMD Architectures, CRC Press,
1990,,-
[23] K. H. Huang and J. A. Abraham, .. Algorithm-based fault tolerance for matrix
operations," IEEE Trans. COmpUl.. vol. C-33", no. 6, p. 518-528, 1984.
[24] J. J. Jou and J. A. Abraham, "Fault-colerant FFT networks," Proc. IEEE
FTCS-15 , p. 338-343, 1985.
[25] D. Kleitman, F. T. Leighton, M. Lepley, and G. L. Miller, "New layouts for
the shuffle-exchange graph," Proc. X111 ACM Symp. Theory of Comput., p.
334-341, 1981.
58
[26] D. P.'Kolba and T. W. Parks, "A prime factor FFf algorithm using high-
speed convolution", IEEE Trans. ASSP-25, p. 91-103, 1977.
[27] H. T. Kung and C. E. Leiserson, "Algorithms for VLSI processor arrays,"
Sparse Matrix Proceed., SIAM Press, p. 256-282, 1978.
[28] H. T. Kung, "Why systolic architectures?" IEEE Com., vol. 15, pp. 37-46,
1982.
[29] C. Lanczos, Discourse on Fourier Series, Edinbugh, London, Oliver & Boyd,
1966.
[30] B. P. Lathi, Signals and Systems, Berkeley-Cambridge Press, 1987.
[31] M. H. Lee, "High speed multidimensional systolic arrays for discrete Fourier
transform," IEEE Trans. Circuit and Systems-II., vol. 39, no. 12, 1992.
[32] F. Lombardi, "Concurrent error detection and fault location in an FFf
architecture," IEEE Journal of Solid-State Circuits, vol. 27, no. 5, 1992.
[33] M. Malek and Y. H. Choi, "A fault-tolerant FFf processor," Proc. IEEE
FTCS-15 , p. 266-271, 1985.
[34] J. Miklosko, M. Vajtersic, I. Vrto, and R. Kle~, Fast Algorithms and Their
Implementation on Specialized Parallel Computers , Veda Publishing House of
the Slovak Academy of Sciences, 1989.
[35] H. J. Nussbaumer, "Fast computation of discrete Fourier transforms using
polynomial transforms," IEEE Trans. Acoust., Speech, Signal Processing, vol.
ASSP-27, pp. 169-181, 1979.
[36] D. P. O'Leary, "Systolic array~ for matrix transpose and other reorderings,"
IEEE Trans. Comput., vol. 36, no. 12, p.117-122, 1987.
[37] A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing,
Prentice Hall, 1989.
[38] R. M. Owen and J. Ja'Ja', "A VLSI chip for the Winograd/prime factor
algorithm to compute the discrete Fourier transform," IEEE Trans. Acoust..
Speech, Signal Processing, vol. 34, pp. 979-988, 1986.
[39] R. M. Owens and M. J. Irwin, "The arithmetic cube," IEEE Trans. CompUl..
vol. C-36, no. 11, 1987.
59
[40] F. P. Preparata and J. E. Vuillemin, "The cube-connected cycles: A versatile
network for parallel computation," Commun. ACM., vol. 24, no. 5, p. 300-
309, 1981.
[41] L. R. Rabiner, R. W. Schafer, and C. M. Rader, "The chirp z-transform
algorithm," IEEE Trans. Audio Electroacoustics, vol. AU-17, p. 86-92, 1969.
[42] C. M. Rader, "Discrete Fourier transforms when the number of data samples
is prime," IEEE Proc., vol. 56, P 1107-1108, 1968.
[43] I. S. Reed and T. K. Truong, "A new hybrid algorithm for computing a fast
discrete Fourier transform," IEEE Trans., C-28, p. 487-492, 1979.
[44] D. J. Rose, "Matrix identities of the FFf," Linear Algebra and its
Applications, vol. 29, no. I, p. 423-443, 1980.
[45] K. Sapiecha and R. Jarocki, "Modular architecture for high performance
implementation of the FFf," IEEE Trans. Comput., vol. 39, pp. 1464-1468,
1990.
[46] J. E. Savage and S. Swamy, "Space-time trade-offs of FFf algorithm," IEEE
Trans. Information Theory, vol. 24, no. 9, p. 563-568, 1978.
[47] W. C. Siu, "Very fast discrete Fourier transform using number theoretic
transform," lEE Proc., vol. 130, G, no. 5, p. 201-203, 1983.
[48] M. 1. Skolnik, Introduction to Radar Systems, 2nd edition, McGraw-Hill Book
Company, 1980.
[49] D. L. Tao, C. R. P. Hartmann, and Y. S. Chen, "A novel concurrent error
detection scheme for FFf networks," Proc. IEEE FTCS, p. 114-119, 1990.
[50] C. D. Thompson, "Fourier transform in VLSI," IEEE Trans. Comput., vol.
32, no. 11, p. 1047-1057, 1983.
[51] A. Trew and G. Wilson, Past, Present, Parallel: A Survey ofAvailable
Parallel Computer Systems, Springer-Verlag, 1991.
[52] S. Winograd, "On computing the discrete Fourier transform," Math. Comput.,
vol. 32, no. 141, pp. 175-199, 1978.
[53] K. Yamashita, "A wafer scale 170,OOO-gate FFf processor with built-in test
circuits," IEEE J. Solid-State Circuits, vol. 23, no. 2, p. 336-341, 1988.
60
[54] E.L. zapata and F. Arguello, "A VLSI Constant Geometry Architecture for
the Fast Hartley and Fourier Transforms," IEEE Trans. Parallel and
Distributed Systems, vol. 3, no. 1, pp. 58-70, 1992
[55] C. N. Zhang, "Multi-dimensional systolic networks for discrete Fourier
transform," Proc. 11th Annu. Int. Symp. Computer Arch., p. 21-27, 1984.
[56] __, "Fourier transforms in VLSI," IEEE Trans. Comput., vol. C-32, pp.
1047-1057, 1983.
61
Vita
Date of Birth September 16, 1967.
Place of Birth Shanghai, China.
Education Lehigh University, Bethlehem, Pennsylvania.
M.S. Computer Engineering, degree expected December 1993.
Lafayette College, Easton, Pennsylvania.
B.S. Electrical Engineering, B.S. Mathematics, May 1992.
Dana Scholar.
Experience
1992-Present
1991-1992
1989-1991
Summer 1991
Summer
1990, 1989
Computer Engineer, Communication, Automation & Control, Inc., Allentown,
Pennsylvania.
Dana Research Assistant, Department of Electrical Engineering, Lafayette
College.
I
System Management: Academic Computing Center, Lafayette College.
WHOI Summer Research Fellow, Woods Hole Oceanographic Institution,
Woods Hole, Massachusetts.
Dana Summer Research Assistant, Department of Mathematics, Lafayette
College.
62

