A single chip low power asynchronous implementation of an FFT algorithm for space applications by Stevens, Kenneth & Hunt, B. W.
A Single Chip Low Power Asynchronous Implementation of an FFT
Algorithm for Space Applications

B. W. Hunt K. S. Stevens
y
B. W. Suter D. S. Gelosh
Department of Electrical and Computer Engineering
Air Force Institute of Technology
WPAFB, OH 45433
Abstract
A fully asynchronous xed point FFT processor is
introduced for low power space applications. The ar-
chitecture is based on an algorithm developed by Suter
and Stevens specically for a low power implementa-
tion. The novelty of this architecture lies in its high lo-
calization of components and pipelining with no need to
share a global memory. High throughput is attained us-
ing large numbers of small, local components working in
parallel. A derivation of the algorithm from the discrete
Fourier transform is presented followed by a discussion
of circuit design parameters specically those relevant to
space applications. A survey of this application specic
architecture is included with a detailed look at the design
of the complex-valued Booth multiplier to demonstrate
the design methodology of this project. Finally, simu-
lation results based on layout extractions are presented
and an outline for future work is given.
1 Introduction
This paper discusses the background, design, and
implementation methodology of an asynchronous xed
point FFT processor. The algorithm used to perform
the FFT was specically tailored for low power hard-
ware realizations using asynchronous communication.
This project has been motivated by several factors.
First, a formal mathematical approach to develop low
power high performance numerical applications is being
investigated by Suter and Stevens [8]. An implemen-
tation of this circuit is being designed and will be fab-
ricated to accurately evaluate and validate the power
benets of this formal approach. The eort to take an
academic paper design to a complete integrated circuit
is also motivated by the desire to understand interface
issues more accurately as well as validate the design
methodology that will be presented later. The need

This work is supported by a grant from the Space Technology
Directorate of Phillips Laboratory, Kirtland Air Force Base, New
Mexico
y
Now with Intel Corp.
for low power signal processing in space applications is
particularly vexing, and presented an excellent target
for this design.
A brief mathematical derivation of the architecture
based on wavelet theory is presented. Background in-
formation regarding micro-electronic radiation harden-
ing and how it eects low power designs is discussed.
A brief description of the overall architecture is pre-
sented followed by the design methodology using a sim-
ple component of the design as an example. The paper
concludes with some simulation power and speed results
as well as expected results from future test chips.
2 Background
2.1 The Suter/Stevens FFT Algorithm
The Discrete Fourier Transform, or DFT, is a com-
mon signal processing algorithm that supports a vari-
ety of uses. The FFT or Fast Fourier Transform reduces
the complexity, in terms of multiplication and additions,


















Briey, a wavelet approach was applied by Suter and
Stevens to create a hardware realization that parallelizes
and localizes the computations of an FFT in such a
way as to reduce the overall power consumption. We




























= 0; 1;    ; N
2
  1. If you are given a se-





+k) where k = 0; 1; : : : ;M 1. Now, the orig-
inal N point FFT problem can be divided into equiv-






















). Using this polyphase nota-




























































































































































































































































notation. The motivation behind manipulating
the FFT into the form of Equation 14, and how this
maps to a low power implementation will be more clear
following a discussion of the general architecture in Sec-
tion 3.1.
General Notes on the Suter/Stevens Algorithm





blocks to be a smaller instantia-
tion of the algorithm. For instance, if N = 1024 where
N
1
= 16, and N
2














Four point FFTs can be implemented without multi-
plication, so a hierarchical decomposition which maps
the leaf nodes to four-point FFTs is the most ecient
realization. Our test chip design implements a simple




= 4 so that all the compo-
nents of a larger FFT circuit are included. From this,
it is easy to see that FFT point sizes that are an even
power of the base FFT point size work best.
Although a synchronous implementation of this al-
gorithm is realizable, the multirate algorithm based on
concurrent execution of each decimated sequence maps
very well to an asynchronous implementation. The
shifts between time domains occurs naturally and with
minimal energy expense.
2.2 Designing for Space Applications
In the harsh environment of space, there are many ra-
diation hazards that must be eliminated or minimized
for a circuit to perform reliably. Several types of radi-
ation are harmful to CMOS circuits and have diering
eects, including neutron radiation, ionizing radiation,
and total-dose radiation. Each one of these forms of
harmful radiation can cause a single event eect (SEE)
which is dened as either a hard or soft error. Hard er-
rors include latchup, burnout, gate rupture, frozen bits,
noise in CCDs, and snapback. Soft errors include bit
upsets in memories or registers or transient signals in
logic circuits. These soft errors are commonly referred
to as single event upsets, or SEU [1].
Ionizing radiation, for example, is caused mainly by
gamma and x-rays as well as other minor sources and
primarily aects the oxide layers of a CMOS circuit [2].
Upon irradiation, electron-hole pairs are generated and
evenly distributed throughout the SiO
2
layer. Many
of these pairs recombine within 100sec, but some free
electrons are swept out by the electric eld in the gate
insulator. The trapped holes that remain in the insu-
lator cause a negative shift in the MOSFET threshold
voltage. Over time, the holes slowly migrate toward the
most negative potential within the SiO
2
. If this most
negative potential is the channel, the holes will tend to
migrate toward the insulator-channel interface decreas-
ing V
t
from its pre-rad or initial value. If this most
negative potential is the gate, then the holes migrate
toward the insulator-gate interface decreasing V
t
from
the initial value. After a period of time, the holes are




to shift back to-
ward its initial value [3, 7].
The key parameter changes caused by ionizing radi-
ation is the threshold voltage shift. For P-FETS, V
t
is
shifted negatively at all dose levels because the trapped
holes in the oxide and the interface states work together.
For N-FETS, V
t
is shifted negatively at low dose levels
and positively at high dose levels [3].
A thin gate oxide allows fewer electron-hole pairs to
be formed, mitigating the eects of ionizing radiation.
A reentrant form of the N-FET also keeps the area of the
eld oxide at a minimum, further reducing the amount
of oxide that can be \ionized".
The following section discusses how rad hardening ef-
fects low power CMOS designs.
2.3 Designing Low Power Space Applica-
tions
Solar panels and nuclear generators are the only way
a satellite can acquire energy. Therefore both peak and
standby power must be kept to a minimum.
A commonmethod of reducing power consumption in
integrated circuits is to lower the supply voltage, yield-
ing a quadratic improvement in power at a linear cost in
performance. However, scaling the voltage of a CMOS
circuit allows it to become more susceptible to SEU be-
cause the noise margin between a logic high and a logic
low is reduced. SEU possibilities are further acerbated
due to the threshold shifts that occur under radiation.
Therefore, voltage scaling must be used judiciously and
in general has more restrictions than in earthbound elec-
tronics.
The power and complexity required to implement
many CMOS functions can be reduced using circuit
techniques such as dynamic, pre-charge, and pass-gate
logic. Unfortunately these techniques are also to be
avoided since the single event eects can prey very eas-
ily on these structures. Design is largely limited to static
logic gates.
Fortunately standard CMOS processes can be used.
A radiation tolerant
1
cell library developed jointly by
Mission Research Corporation (MRC)and Air Force Re-
search Laboratory[4] specically for the HP 0.8 m pro-
cess via MOSIS has been used for our test chip. How-
ever, the radiation requirements result in devices much
larger than would be used otherwise, resulting in addi-
tional power consumption. For example, the minimum
size inverter width in this cell library is 50  for the N-
FET and 90  for the P-FET! With the additional rad
tolerant characteristics, the total size of the inverter cell
is 42  119 .
Architecture becomes the primary means of reduc-
ing power in space applications, due to constraints to
voltage scaling, circuit structure, and device size. This
FFT architecture implemented using asynchronous cir-
cuits signicantly reduces the power compared to other
space worthy designs. The most signicant contribution
1
Rad tolerant implies the ability to withstand 100 krad(Si)
and maintain whereas rad hard implies the ability to withstand
1 Mrad(Si)
to low power in this architecture are twofold. First,
the algorithm has been designed to maximize local-
ity, point-to-point data pipelining, and hierarchy. The
only shared structure in the design is the decimator in-
puts, expander outputs, and pipelined crossbar switch
(all discussed in Section 3.1). Second, the frequency
is greatly reduced by decimation allowing devices and
drivers to be undersized. This can signicantly reduce
the capacitance of transmitting data signals. For exam-
ple, assume a 100MHz sample rate of a 256-point FFT.
This design requires additions at a low frequency rate
of 320ns in the leaf FFT cells. The pipelined crossbar
transmits one data word across each row and column ev-
ery 160ns. This allows ample latency to size devices op-
timally. The asynchronous implementation technology
allows the common frequency changes to be supported




There are six major components required to imple-
ment the FFT of Equation 14. The block diagram of
Figure 4 shows how each of the components t into
the computation. First of all, the data is decimated
in time into N
2
sequences of length N
1
. Then, the
FFT of each sequence is computed (the interior sum-







). After the complex multiply, the partial trans-
formed data is interleaved, as required by the FFT,
through a pipelined crossbar switch. This pipes a data
stream to the N
2
point FFT blocks. Finally the data,
which is decimated in frequency, goes to an expander
to correctly sequence each fully transformed element in
the output sequence as X(m) and regenerate the input
frequency.
3.2 Specic Architecture for This Project
For test chip, the minimum iteration necessary to
demonstrate the functionality of the algorithm was de-










blocks, drastically saving on area with only a mini-
mum of control overhead.
Control operates in data-ow pipelines, using data
bundling and four-phase handshaking protocol between
each stage. Control of every major component is imple-
mented using one-hot encoded state machines [6]. There
are 16 unique burst-mode asynchronous nite state ma-
chines (AFSMs) used in the control structures. Some of
the AFSMs are very simple like the ones used to con-
trol register locking which have 3 states with 2 inputs
2
patent pending
and 2 outputs. Others are more complex like the mul-
tiplier controller which has 9 states with 8 inputs and 6
outputs.
3.3 Formal Specications
Table 1 displays the CCS specications for a 2:1 four-
cycle decimator and expander controller. Larger deci-
mators and expanders can be specied by increasing the
number of interfaces DIFC or EIFC. While we have sev-
eral implementations of these circuits for our test chip,
we will leave the design and synthesis of these cells as
a challenge for the synthesis tools in the asynchronous
community.
Table 1: CCS Specications for Decimator and Ex-
pander Cells
agent DIFC = tn.r.a.r.a.DIFC
agent DIN = req.t1.ack.req.ack.req.t2.ack.req.ack.DIN
agent DECIMATOR = (DIN j DIFC[t1/tn, r1/r, a1/a] j
DIFC[t2/tn, r2/r, a2/a]) \ f t1, t2 g
agent EIFC = r.tn.a.r.a.EIFC
agent EOUT = t1.req.ack.req.ack.t2.req.ack.req.ack.EOUT
agent EXPANDER = (EOUT j EIFC[t1/tn, r1/r, a1/a] j
EIFC[t2/tn, r2/r, a2/a] \ f t1, t2 g
4 Design Process
4.1 Design Methodology
Our design methodology uses VHDL as the central
simulation and specication language. Designs are re-
ned using VHDL, and when we are satised with the
simulation results we then implement the blocks. Al-
though there are many tools available today for asyn-
chronous design, the necessary tools are not integrated
into a tool ow so the simulation, synthesis, and imple-
mentation steps were disjoint.
Generic timing diagrams and petri-nets were devised
to demonstrate the ow of control and data through
the computation and to help understand interface tim-
ing and sequencing. Burst mode AFSM specications
were derived and synthesized by the 3D [10] and MEAT
[5] tools. The synthesized equations were written in
behavioral VHDL, analyzed and simulated with a test
bench modeling the timing diagrams to validate the
burst specications. Once the VHDL was conrmed,
the static logic equations were laid out using the MRC
cell library. Some of the asynchronous designs were
testable with IRSIM (v9.02 from the Berkeley tool suite)
and others would only function using SPICE (Avanti
Corporation HSPICE v95.1). Occasionally, it would be
found that a design was too big to be practical or its
function could not deal with the simulated real hazards
of the circuit and the design would need to be reparti-
tioned or redesigned.
4.2 An Example of Iterative Component
Improvement
The design of the complex multiply block of Figure 4
will be used as a design example. This is the most time-
critical component in the design, since a complex mul-
tiply must occur every 160ns to meet the intended 10ns
sample rate. A radix-4 Booth multiplication algorithm
was chosen.
The shift-and-add control was implemented with a
sequence of nine one-hot cells (eight for shifting, and
the ninth to indicate a \done" condition). On each suc-
cessive pulse, a dierent combination of three bits are
enabled onto the three decoding lines going to the Booth










































Figure 1: Dual Fixed Point Multiplier
Note that two multiplications are performed in paral-
lel using a single control unit. As can be seen from Fig-
ure 4, the FFT-4 produces a pipelined stream of real and
imaginary data words to the complex product block.
This data is multiplied in the product block by a com-
plex constant. The FFT-4 rst produces the real data
component followed by the imaginary data component.
Since both the real and imaginary constants are always
available (stored locally in a static register bank), time
and area are conserved by decoding the Booth instruc-
tions from the FFT-4 input and producing two partial
products simultaneously. Although all the control in
the multiply unit can be shared, only the control for
the ALUs is illustrated to simplify the gure. Since the
Booth instruction will be the same for both multiplies,
the AREQ signal is routed to both ALUs. A C-element
is required to synchronize the AACK signals from each
ALU because the ALU operating time is data depen-
dent and the real and imaginary constant additions will
likely complete at dierent times. Each data path has
its own ALU, shift register, and constant storage.
A complex multiply requires four integer multiplica-
tions, an integer addition, and an integer subtraction.
We use the \multiplier
2
" block with two adders and reg-
isters to complete a complex multiply. Table 2 gives the
step by step procedure of the complex multiply opera-
tion corresponding to Figure 2 parts a to d.
Operation
1 Receive RefXg from FFT-4 Figure 2 (a)
2 Multiply RefXg by RefY g and ImfY g
creating the two partial products XrY r and j XrY i Figure 2 (b)
3 Receive ImfXg from FFT-4 Figure 2 (b)
4 Multiply ImfXg by RefY g and ImfY g
creating the two partial products j XiY r and XiY i Figure 2 (c)
5 Subtract XiY i from XrY r to produce RefZg and
add j XrY i and j  XiY r to produce ImfZg Figure 2 (d)














































Figure 2: Complex Multiplier Data Path Operation
As shown in Figure 2, each output of the multiplier
2
block connects to one A and B input of an adder and
subtracter. Each A input contains a latch to store the
integer product of the rst two multiplications as de-
scribed in step 2 of Table 2. The second multiplication
result can be passed directly to the adder/subtracter
and used with the latched value to produce the com-
plex valued result. Figure 2(a) shows the circuit after
the arrival of RefXg. The rst two partial products
have been computed and latched in Figure 2(b). Since
the FFT-4 will likely produce its outputs faster than
the multiplier can use them, ImfXg will probably ar-
rive early. Despite this, ImfXg will not be used until
after it is latched in step 3 of Table 2. The second mul-
tiplication pair has completed in Figure 2(c) and held
statically on the data lines. The nal step of Table 2
occurs when all four integer multiplication products are
present, and the nal complex products can be com-
puted and latched into the crossbar switch, as shown in
Figure 2(d).
5 Results
We are able to obtain some preliminary power and
timing results based on SPICE simulations on extracted
layouts. The MRC cell library is designed to be max-
imally radiation tolerant when Vdd is 5.0 Volts. How-
ever, 2.2V is customary for many of today's low power
designs due to the benets of voltage scaling. Our pre-
liminary numbers use a middle-ground Vdd of 3.3 Volts.
VHDL simulations at this voltage have been projected
to the system timing chart of Figure 3. We have in-
cluded results with Vdd of 5.0V and 2.2V along with
the baseline of 3.3 Volts to examine the MRC operating













640320 960 1280 16000Time (ns)
Figure 3: FFT-16 System Timing @ Vdd=3.3V
Notice how the duty cycle of all the major compo-
nents overlap. The actual amount of overlap (pipelin-
ing) will vary depending on the data but this gure gives
a good timing estimate. The asterisk by theMult(0) line
in Figure 3 is there to indicate that this is not an actual
multiplier. The constants for the 0
th
sequence are all
equal to 1+ j0 so no multiplication is required. In place
of a multiplier, a holding register is used so a full 32-
bit complex value is sent to the crossbar switch. Based
on empirical SPICE data, we are able to extrapolate
system and component timing to Vdd = 5.0 Volts and
Vdd = 2.2 Volts. Table 3 shows the timing comparison
between the three Vdd levels.
Table 3: Latency by Element (ns)
Vdd Dec FFT-4 Mult C-bar Exp FFT-16
5.0 240 480 480 510 390 1200
3.3 320 640 640 680 520 1600
2.2 531 1062 1062 1129 863 2656
Using the actual SPICE power numbers for each com-
ponent running at the frequencies in Table 3, we are
able to compile projected power consumption for the
FFT-16. The component and system power consump-
tion numbers are given in Table 4.
Table 4: Power Consumption by Element (mW)
Vdd Dec FFT-4 Mult C-bar Exp FFT-16
5.0 5.8 182 350 114 74 1076
3.3 1.1 45 86 25 18 264
2.2 0.6 11 22 9 5 67
Then, using the scaleability properties of each seg-
ment, energy consumption for larger point sizes results
can be determined as in Table 5.
Table 5: Extrapolation of Energy Consumption (J)
Vdd FFT-16 FFT-32 FFT-128 FFT-256 FFT-1024
5.0 V 1.29 3.29 26.2 68.0 494.9
3.3 V 0.422 1.07 8.4 21.6 153.25
2.2 V 0.178 0.455 3.7 9.9 77.5
Since these numbers are very rough estimates, direct
comparisons against current FFT chips are not that con-
clusive. These comparisons are still drawn to show that,
despite using a power hungry cell library and disregard-
ing many known power reduction techniques, similar
power eciency numbers can be achieved with archi-
tecture and asynchronous design techniques.
Table 6: Energy Eciency Comparison
Chip Vdd Power Time Energy/unit-transform
(V) (mW) (s) (nJ)
This Work 5.0 2578 192 483
Plessey 5.0 3000 98 287
PDSP16510A
This Work 3.3 599 256 149
Spiffee1 3.3 845 30 24.7
This Work 2.2 182 425 75
Spiffee1 2.5 339 42 13.9
Table 6 gives a rough comparison to the SPIFFEE
project at Stanford University and a commercial FFT
processor (The Plessey PDSP 16510A). It is fair to point
out that the Plessey DSP chip uses a block-oat data
format instead of xed point which accounts for some of
the additional energy required. The gure-of-merit we
are using, that of energy consumed per unit transform,
compares the energy eciency of the architecture in
generating a result and is independent of the frequency.
We must also point out that the sample frequency of our
asynchronous design is considerable faster than that of
any of these processors. The numbers for this project
will remain fairly constant for larger point sizes due to
the hierarchical nature of our FFT algorithm. However,
as the point-size grows, additional hierarchical layers are
required which will result in increased power consump-
tion.
The 2.2 Volt Vdd entry of \This Work" in Table 6
probably could not be used in space because of the single
event eects discussed in Section 2.2. It is presented
here to show how the eciency FOM scales between
the dierent Vdd levels.
6 Conclusion
This paper shows a formal mathematical approach
that has been directed towards architecting low power
FFT circuits. The architecture relies on localization
techniques, pipelining, and frequency shifts of deci-
mation and expansion. Asynchronous implementation
techniques are a particularly appealing low power imple-
mentation methodology for such an architecture due to
the ability to shift between various frequencies at no ad-
ditional cost in power or complexity. The asynchronous
pipeline supports both rapid computation and minimal
energy dissipation during idle periods.
It is prudent to say that the methodology we are us-
ing is eective for this project. Seldomly will the best
design arise from the rst specication. Typically a de-
sign will go from specication to implementation before
many important design considerations become evident.
Our methodology based on VHDL, as both a speci-
cation language and simulator, was very helpful in di-
recting our top level design because handshake protocol
inconsistencies are detected very easily due to the way
VHDL performs value resolution. However, the VHDL
is a bit unwieldy at times because it does not recognize
dynamic values on a bus, as well as a few other minor
problems. The hand layout was fairly easy because the
MRC library cells t together in a gate-array format.
However, accurate circuit simulations become dicult
as the design increases requiring modular approaches to
circuit analysis.
The numbers attained from the extracted SPICE sim-
ulations show surprisingly low power for the device sizes
required by the rad-tolerant library. We surmise that a
signicant eect is due to the synergy between asyn-
chronous design and the architecture. Our preliminary
results also give signicant motivation for future design
exploration and test chips, particularly for radiation tol-
erant applications. Should this project be extended to
an implementation that is not radiation tolerant, we
could expect an additional reduction power consump-
tion beyond what is presented here, achieved through
smaller transistor widths, pre-charge logic, and other
circuit level optimizations.
Acknowledgments
We would like to thank several people for making
this project possible. First of all, our sponsor, Charles
Brothers who is the Chief of the Applied Technologies
Branch of the Space Technology Directorate, Phillips
Laboratory, Kirtland Air Force Base, New Mexico. Ad-
ditionally, we would like to thank further personnel at
the Air Force Institute of Technology, namely David
Gallagher, Je Butler, and Sam SanGregory. Each has
contributed direct support on this project or provided
the groundwork for it to begin.
References
[1] Joseph Azarewicz and Sheldon Jurist. Short
course slides. Introduction to Single Event Eects.
Physitron, Inc. Huntsville, AL, May 1995.
[2] Joseph Azarewicz and Sheldon Jurist. Short course
slides. Ionizing Radiation Eects in Electronics.
Physitron, Inc. Huntsville, AL, May 1995.
[3] Charles P. Brothers Jr. Rapid and Accurate Tim-
ing Simulation of Radiation-Hardened Digital Mi-
croelectronics Using VHDL. PhD thesis, Air Force
Institute of Technology (AU), March 1994.
[4] Charles P. Brothers Jr., Joseph R. Chaves, David R.
Alexander, and David G. Mavis. Radiation Hard-
ening Techniques for Commercially Produced Mi-
croelectronics for Space Guidance and Control Ap-
plications. In 20th Annual American Astronautical
Society Guidance and Control Conference, February
1997.
[5] William S. Coates, Al L. Davis, and Kenneth S.
Stevens. Automatic Synthesis of Fast Compact Self-
Timed Control Circuits. In IFIP Working Confer-
ence on Design Methodologies, pages 193{208, April
1993.
[6] Lee A. Hollaar. Direct Implementation of Asyn-
chronous Control Units. IEEE Transactions on
Computers Vol. C-31, No. 12, pp. 1133-41, Decem-
ber 1982.
[7] George C, Messenger and Milton S. Ash. The eects
of Radiation on Electronic Systems. New York, NY:
Van Nostrand Reinhold, 1992.
[8] BruceW. Suter and Kenneth S. Stevens. Low Power,
High Performance FFT Design. In A. Sydow, editor,
Proceedings of IMACS World Congress on Scientic
Computation, Modeling, and Applied Mathematics,
number 1, pages 99{104, June 1997.
[9] Bruce W. Suter. Multirate and Wavelet Signal Pro-
cessing. San Diego, CA: Academic Press, 1997.
[10] Kenneth Yi Yun. Synthesis of Asynchronous Con-
trollers for Heterogeneous Systems. PhD thesis,



























































































































































































Figure 4: General Block Diagram of the Suter/Stevens
FFT Algorithm
