Design and Implementation of Scalable FFT Processor for Wireless Applications by Revanna, Deepak
DEEPAK REVANNA
Design and Implementation of Scalable FFT Processor for Wire-
less Applications
Master of Science Thesis
Examiners:
Prof. Jari Nurmi
M.Sc. Omer Anjum
Examiners and topic were approved in
the Computing and Electrical En-
gineering Faculty Council meeting
15.Aug.2012.
II
ABSTRACT
TAMPERE UNIVERSITY OF TECHNOLOGY
Master's Degree Programme in Information Technology
DEEPAK REVANNA: Design and Implementation of Scalable FFT Processor
for Wireless Applications
Master of Science Thesis, 63 Pages, 5 Appendix Pages
March 2013
Major: Electronics and Communications Engineering
Examiners: Prof. Jari Nurmi, M.Sc. Omer Anjum
Keywords: FFT, FPGA, OFDM, SDR, Processor
In the recent past communication is predominantly becoming wireless which is a
drastic shift from wired communication. Generally, the transmitted radio signal
over a wireless channel is subject to more distortion, more interference and more
noise than a signal over wired channel. In other words, the SNR of received signal
over a wireless channel is comparatively lower compared to received signal over a
wired channel. Hence, to recover original data from received signal, wireless com-
munication systems have to be more robust and eﬃcient in recovering original data.
Wireless communication systems these days adopt eﬃcient multi-carrier transmis-
sion technique such as OFDM in their transceivers. And majority of the commercial
wireless standards are OFDM based.
OFDM based wireless standards demand highly eﬃcient baseband hardware in com-
munication systems. The baseband hardware needs to meet stringent design param-
eters such as high speed, low power, low area, low cost, highly ﬂexible and highly
scalable. Modern wireless systems support multiple standards to meet the demands
of end user application requirements. A wireless system while supporting multiple
standards, should also satisfy performance requirements of those supported stan-
dards. Wireless transceivers based on SDR platform support multiple wireless stan-
dards. Meeting performance requirements of multiple standards is a challenge while
designing baseband hardware. To design an eﬃcient OFDM baseband hardware,
it is necessary to eﬃciently design its performance critical component. FFT com-
putation is one of the most performance critical component in an OFDM system.
Designing FFT hardware to support multiple wireless standards while meeting the
above speciﬁed performance requirements is a challenging task.
In this thesis work a N-point scalable novel FFT processor architecture was proposed.
A radix-2 ﬁxed point 16-bit N-point scalable FFT processor was designed and proto-
typed using VHDL on an Altera Stratix V FPGA device 5SGSMD5K2F40C2. The
processor was implemented targeting SDR platforms supporting multiple OFDM
III
based wireless standards. The processor operates at a maximum frequency of
200MHz and uses less than 1% of hardware resources on the FPGA. It meets the per-
formance requirements of OFDM based wireless standards such as IEEE 802.11a/g,
IEEE 802.16e, 3GPP-LTE, DAB and DVB-T. The FFT processor based on pro-
posed novel architecture has a better performance in terms of speed, ﬂexibility and
scalability when compared to existing ﬁxed as well as variable length FFT proces-
sors.
IV
PREFACE
When I wanted to start work on my thesis I approached Prof. Jari Nurmi to give
me an opportunity and to which he positively responded. He provided me with
all the facilities and environment to work in the department. He was a mentor, a
guide and consistently supported my work during bad times as well as good times.
Hence, I would like to express my sincere gratitude towards Prof. Jari Nurmi for
all his support. Mr. Omer Anjum who provided me with the research topic, shared
his knowledge and guided me throughout my thesis work. I thank him for guiding,
supporting and supervising me throughout my work. Mr. Roberto Airoldi was also
my mentor who shared his technical knowledge and helped me during my work. He
also taught me how to write research papers for publishing in international confer-
ences. I would like to thank him for all his guidance, support, sharing knowledge
and I cherish the moments of sharing oﬃce space with him. I express my thanks to
Mr. Manuele Cucchi for his technical help and discussions during the course of my
work. I would also like to thank Ms. Leyla Ghazanfari for being a very supportive
and encouraging colleague. My family has loved, cared and supported me during
tough times as well as good times. I am eternally grateful to my father Revanna,
my mother Mangalamma and my brother Nagendra Prasad for showering me with
their unconditional love and support.
Tampere, 22.Mar.2013
Deepak Revanna
VCONTENTS
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Wireless OFDM Systems . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 OFDM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 OFDM Based Wireless Standards . . . . . . . . . . . . . . . . . . . . 6
2.4 Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Research Work On FFT Processors . . . . . . . . . . . . . . . . . . . 9
3. Scalable FFT Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1 Internal Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Butterﬂy Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 Bit Parallel Multiplier . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Data Memory (RAM) . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Twiddle Factor Memory (ROM) . . . . . . . . . . . . . . . . . . . . . 21
3.5 Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.6 Address Generation Unit . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.7 Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.8 Dataﬂow Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.1 Pre-simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.2 Running Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.3 Post Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.4 VCD File Generation . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Synthesis and Power Analysis . . . . . . . . . . . . . . . . . . . . . . 40
5. Results and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
A. Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
VI
LIST OF FIGURES
2.1 An OFDM based simplex communication system [8]. . . . . . . . . . . 3
2.2 OFDM modulation at transmitter [7]. . . . . . . . . . . . . . . . . . . 5
2.3 OFDM demodulation at receiver [7]. . . . . . . . . . . . . . . . . . . . 5
2.4 Radix-2 DIT FFT butterﬂy diagram [1]. . . . . . . . . . . . . . . . . 8
2.5 Single radix-2 DIT butterﬂy operation. . . . . . . . . . . . . . . . . . 9
3.1 Scalable FFT processor block diagram. . . . . . . . . . . . . . . . . . 11
3.2 FFT processor core pin details. . . . . . . . . . . . . . . . . . . . . . 12
3.3 FFT processor pipelined internal architecture. . . . . . . . . . . . . . 13
3.4 Pipelined butterﬂy unit [10]. . . . . . . . . . . . . . . . . . . . . . . . 14
3.5 Butterﬂy unit pin details. . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.6 Butterﬂy unit waveform for 16-point FFT. . . . . . . . . . . . . . . . 16
3.7 Data format stored in memory. . . . . . . . . . . . . . . . . . . . . . 18
3.8 Order of input sample at the beginning and the order of ﬁnal output
of 16-point FFT computation. . . . . . . . . . . . . . . . . . . . . . . 19
3.9 RAM memory bank pin details. . . . . . . . . . . . . . . . . . . . . . 20
3.10 SetA RAM memory bank (RAM0) waveforms. . . . . . . . . . . . . . 20
3.11 SetB RAM memory bank (RAM4) waveforms. . . . . . . . . . . . . . 21
3.12 Order of twiddle factors stored in ROM. . . . . . . . . . . . . . . . . 22
3.13 ROM memory pin details. . . . . . . . . . . . . . . . . . . . . . . . . 22
3.14 ROM memory waveforms. . . . . . . . . . . . . . . . . . . . . . . . . 23
3.15 Interconnect internal architecture and external interface. . . . . . . . 24
3.16 Interconnect pin details. . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.17 InterconnectA waveforms. . . . . . . . . . . . . . . . . . . . . . . . . 26
3.18 InterconnectB waveforms. . . . . . . . . . . . . . . . . . . . . . . . . 26
3.19 Address generation unit internal logic. . . . . . . . . . . . . . . . . . 29
3.20 Address generation unit pin details. . . . . . . . . . . . . . . . . . . . 30
3.21 Address generation unit waveforms. . . . . . . . . . . . . . . . . . . . 31
3.22 Control unit state diagram. . . . . . . . . . . . . . . . . . . . . . . . . 31
3.23 Control unit pin details. . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.24 Control unit waveforms. . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.25 16-point dataﬂow butterﬂy diagram for FFT processor architecture. . 35
5.1 FFT computation time as a function of N. . . . . . . . . . . . . . . . 44
5.2 FFT total energy consumption as a function of N. . . . . . . . . . . . 45
A.1 Synthesized FFT core in RTL viewer. . . . . . . . . . . . . . . . . . . 50
A.2 Synthesized butterﬂy unit in RTL viewer. . . . . . . . . . . . . . . . . 50
VII
A.3 Synthesized complex multiplier in RTL viewer. . . . . . . . . . . . . . 51
A.4 Synthesized interconnect in RTL viewer. . . . . . . . . . . . . . . . . 51
A.5 Synthesized address generation unit in RTL viewer. . . . . . . . . . . 52
A.6 Synthesized control unit in RTL viewer. . . . . . . . . . . . . . . . . . 52
A.7 Location of FFT core on the FPGA chip, courtesy: Chip planner tool. 53
A.8 Partition of FFT core components on the FPGA chip, courtesy: De-
sign partition planner tool. . . . . . . . . . . . . . . . . . . . . . . . . 53
VIII
LIST OF TABLES
2.1 FFT computation time for OFDM based wireless standards. . . . . . . 6
5.1 FFT Core Resource Utilization. . . . . . . . . . . . . . . . . . . . . . 43
5.2 FFT Computation Time. . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3 Power Analysis Summary. . . . . . . . . . . . . . . . . . . . . . . . . 45
5.4 Comparison With Existing FFT Processors. . . . . . . . . . . . . . . 46
IX
ABBREVIATIONS
ALUT Adaptive Look Up Table
CDM Code Division Multiplexing
CentOS Community ENTerprise Operating System
DAB Digital Audio Broadcasting
DAC Digital to Analog Converter
DFT Discrete Fourier Transform
DIF Decimation In Frequency
DIT Decimation In Time
DSP Digital Signal Processing
DVB-T Digital Video Broadcasting-Terrestrial
EDA Electronic Design Automation
FDM Frequency Division Multiplexing
FFT Fast Fourier Transform
FPGA Field Programmable Gate Array
ICI Inter Carrier Interference
IEEE Institute of Electrical and Electronics Engineers
IFFT Inverse Fast Fourier Transform
ISI Inter Symbol Interference
LOS Line Of Sight
LPF Low Pass Filter
OFDM Orthogonal Frequency Division Multiplexing
PSK Phase Shift Keying
QAM Quadrature Amplitude Multiplexing
RAM Random Access Memory
RF Radio Frequency
ROM Read Only Memory
RTL Register Transfer Level
SDC Synopsys Design Constraint
SDR Software Deﬁned Radio
SNR Signal-to-Noise Ratio
USB Universal Serial Bus
VCD Value Change Dump
VHDL Very high speed integrated circuit Hardware Description Language
3GPP-LTE 3rd Generation Partnership Project-Long Term Evolution
11. INTRODUCTION
The wired telecommunication networks oﬀer low-bit-rate services as well as high-
bit-rate services. Voice services require low-bit-rates while broadband multimedia
services require high-bit-rates. Wireless communication networks also provide the
speciﬁed services. However, some of the high-bit-rate services are limited due to
various performance constraints. During the course of time, there has been a grow-
ing demand for high-bit-rate services in wireless communication systems. Providing
services over wireless channels is challenging because the mobile radio channels are
more contaminated compared to wired channels.
The main characteristic of a mobile radio channel is multipath reception of trans-
mitted signal. The received signal not only contains Line-Of-Sight (LOS) signal but
also reﬂected signals. The reﬂected signals are delayed and distorted versions of
transmitted signal. Transmitted signal undergoes reﬂections due to terrain features
like trees, buildings, vehicles, hills, mountains and so on. The delayed signal causes
Inter Symbol Interference (ISI) with LOS signal received at the receiver. ISI causes
performance degradation of the transceiver and it is necessary to adopt suitable
equalization technique in order to improve the performance. A broadband multime-
dia wireless communication system requires high-bit-rate transmission in terms of at
least several megabits per second. Designing wireless transceivers to support such
high data rates with compact and low-cost hardware is a challenging task. In order
to overcome multipath-fading environment and in the meantime achieve high data
rates, Orthogonal Frequency Division Multiplexing (OFDM) transmission scheme is
used.
OFDM is a parallel data transmission technique which minimizes the inﬂuence of
multipath fading through simpler equalization technique. OFDM is widely adopted
in modern wireless communication systems. It has been adopted by major wire-
less standards such as Institute of Electrical and Electronics Engineers (IEEE)
802.11a/g, IEEE 802.16e, 3rd Generation Partnership Project-Long Term Evolution
(3GPP-LTE), Digital Audio Broadcasting (DAB) and Digital Video Broadcasting-
Terrestrial (DVB-T). A transceiver supporting multiple standards and which is based
on Software Deﬁned Radio (SDR) platform allows switching between multiple wire-
1. Introduction 2
less standards at run time. Modern wireless transceivers based on SDR platform
may support one or multiple standards. In any case, the transceiver should com-
ply with performance requirements of all the standards it supports. The standards
specify strict performance requirements in terms of high speed, low power, low cost,
ﬂexibility and scalability.
Meeting stringent performance requirements while supporting multiple wireless stan-
dards is the need of the hour. Since, OFDM based communication system is com-
mercially adopted in major wireless standards, there is a huge amount of research
interest in OFDM baseband digital signal processing. In order to design an eﬃcient
OFDM baseband hardware, its components require to be eﬃcient. In an OFDM
baseband hardware, FFT computation is one of the most computationally intensive
operation which inﬂuences performance of the system. The baseband hardware has
to be eﬃcient and capable enough to compute FFT within the time constraints
necessary to support multiple wireless standards. Baseband hardware should be
scalable so that it supports multiple wireless standards as well as it should meet the
performance constraints such as high speed, low area and low power consumption.
Hence, the baseband hardware requires a scalable FFT module which meets the
performance constraints required by multiple wireless standards.
Scope of the thesis work was to propose a N-point scalable novel FFT processor archi-
tecture, implement a radix-2 ﬁxed point 16-bit N-point scalable FFT processor based
on the proposed architecture using Very high speed integrated circuit Hardware De-
scription Language (VHDL) and synthesize the processor on a Field Programmable
Gate Array (FPGA). The processor implementation was simulated using ModelSim
simulation tool from Mentor Graphics Corporation to measure its performance in
terms of speed and scalability. Also, the processor was synthesized on an FPGA to
measure the performance parameters such as maximum operating frequency, area
and power consumption. The synthesis tool used was Quartus II version 12.1 from
Altera Corporation and the FPGA was Altera stratix V 5SGSMD5K2F40C2.
The structure of this thesis is as follows: the Chapter 2 explains the background
related to OFDM and research work on FFT processors, the Chapter 3 explains
FFT processor architecture and its components in detail, the Chapter 4 is about
implementation details such as simulation and synthesis, the Chapter 5 discusses
results and its evaluation and ﬁnally the Chapter 6 draws conclusions based on the
results achieved.
32. BACKGROUND
OFDM is an eﬃcient multi-carrier transmission technique which is predominantly
used in wireless transceivers. OFDM technique oﬀers better spectral utilization
and better performance compared to other transmission techniques in recovering
original signal from received signal. Since, major wireless standards are based on
OFDM transmission, there is a lot of research interest in this domain. Research in
OFDM baseband hardware of transceiver is a challenging task. One of the major
performance critical module of OFDM transceiver is FFT computation. One of the
steps in creating an eﬃcient OFDM transceiver is to create an eﬃcient FFT module.
In this regard, basics of OFDM based communication systems and FFT algorithm
are described in detail below.
2.1 Wireless OFDM Systems
A simplex communication system based on OFDM is shown in Figure 2.1.
Channel 
coding/
interleaving
Symbol 
mapping 
(modulation)
OFDM 
modulation 
(IFFT)
Guard 
interval/
windowing
DAC
Down 
conversion 
and I/Q 
demodulation
ADC
Guard 
interval 
removal
OFDM 
demodulation 
(FFT)
Symbol de-
mapping 
(detection)
I/Q 
modulation 
and up-
conversion
Decoding/de-
interleaving
Channel 
estimation
Data 
source
Data 
sink
Multipath 
radio 
channel
Transmitted 
baseband 
signal s(t)
Received 
signal r(t)
: Digital signals : Analog signals
I/Q I/Q
I/Q
I/Q I/Q
I/Q
RF
RF
N complex data 
constellations {xi,k}
Received data constant 
{yi,k}
sRF(t)
rRF(t)
Time synch. Carrier synch.
Figure 2.1: An OFDM based simplex communication system [8].
A communication system in general has a transmitter and a receiver which can be
put together to form a transceiver. The transmitter modulates baseband digital sig-
nal, converts it into Radio Frequency (RF) signal using Digital to Analog Converter
(DAC). The transmitted RF signal undergoes multipath fading, it is contaminated
with thermal noise, distortions in radio channel and also undergoes Inter Symbol
Interference (ISI). To recover original transmitted signal, the received signal has to
2. Background 4
undergo equalization. Several equalization techniques are available, but a suitable
technique is chosen based on Signal-to-Noise Ratio (SNR) requirements and other
design constraints.
Inverse Fast Fourier Transform (IFFT) and FFT are used for modulating and de-
modulating the data constellations on orthogonal sub-carriers. These two signal
processing algorithms are used instead of I/Q-modulators and demodulators. The
input of IFFT is xi,k which is an N-point data constellation, where N is the num-
ber of IFFT/FFT points, i is sub-carrier index and k is an OFDM symbol index.
N is chosen as a power of two, this helps in eﬃcient implementation of IFFT and
FFT algorithms for modulation and demodulation respectively. The described com-
munication system consists of OFDM module which is our area of interest in the
baseband domain. Hence, OFDM, OFDM modulation and demodulation are dis-
cussed in detail below.
2.2 OFDM
OFDM is a multi-carrier transmission technique which is widely popular in most
of the available commercial wireless communication standards. As described in [7]
an OFDM signal is made up of a number of sub-carriers or sub-channels which
are orthogonal to each other. The bandwidth of an OFDM signal includes all the
sub-carriers or in other words each sub-carrier shares the available bandwidth. The
sub-carriers carry data individually and are modulated in amplitude as well as phase.
The sub-carriers can be multiplexed using Frequency Division Multiplexing (FDM)
or Code Division Multiplexing (CDM) technique. Orthogonality property of sub-
carriers increases spectral utilization of the transmitted signal while reducing Inter
Carrier Interference (ICI). The OFDM technique is an improvement over an FDM
technique. Advantage of OFDM is that it requires a single ﬁlter for all the sub-
carriers while FDM requires ﬁlter for each of the sub-carriers. But, disadvantage
of OFDM is that it requires highly accurate frequency synchronization technique
in order to avoid ICI. OFDM technique can be used to modulate a number of sub-
carriers to carry data individually as detailed below.
2. Background 5
Figure 2.2 describes modulation in baseband domain of an OFDM based communi-
cation system. Input data bits are split among diﬀerent sub-carriers with the help of
serial-to-paraller converter. The sub-carriers are assigned with a range of frequencies
and each of them share the available bandwidth of an OFDM signal.
iFFT
DAC
DAC
X
+
X
90
0
.
.
.
Input data 
bits
Mapping 
symbols
Re
Im
fc
OFDM 
carrier signal
S
er
ia
l-
to
-p
a
ra
ll
el
 
co
n
v
er
te
r
Figure 2.2: OFDM modulation at transmitter [7].
Each sub-carrier is modulated by data using Phase Shift Keying (PSK) or Quadra-
ture Amplitude Modulation (QAM) technique. IFFT on sub-carriers transforms
them to time domain and combines them together to form an OFDM signal. Re-
sulting digital OFDM signal is converted to analog signal by Digital to Analog
Converter (DAC). To transmit data bearing OFDM signal over radio channel, an
RF carrier is modulated by the OFDM signal. After using OFDM modulation at
transmitter, OFDM demodulation is used at receiver to recover data from radio
signal.
OFDM demodulation at the receiver side is illustrated in Figure 2.3.
.
.
.
Output 
data bits
Re
Im
fc
OFDM 
carrier signal
P
a
ra
ll
el
-t
o
-s
er
ia
l 
co
n
v
er
te
r
ADC
ADC
LPF
LPF
X
X
90
0
Symbols BitsQuantization
Low-pass 
filters
FFT
Figure 2.3: OFDM demodulation at receiver [7].
2. Background 6
RF signal received from radio channel is down converted to separate data bearing
OFDM signal from it. OFDM signal is complex valued consisting of real and imag-
inary parts. The resulting baseband OFDM signal is low pass ﬁltered to eliminate
unwanted harmonics present around baseband signal frequency.
OFDM analog signal is converted into digital signal through Analog to Digital Con-
verters (ADC). Time domain digital OFDM signal is converted into frequency do-
main through FFT operation. Also, FFT operation disintegrates OFDM signal into
its sub-carriers. Individual sub-carriers are demodulated separately to extract data
from them. Symbols from sub-carriers are converted to bit streams using symbol
detectors. The demodulator or symbol detector is in synchronization with the mod-
ulator which maps bit streams to symbols.
OFDM modulators and demodulators are used in wireless transceivers. And wireless
transceivers support multiple wireless standards. OFDM transmission technique is
widely popular and it is adopted commercially by a number of wireless standards.
2.3 OFDM Based Wireless Standards
Most of the wireless standards available are OFDM based. Some of the OFDM
based wireless standards are IEEE 802.11a/g, IEEE 802.16e, 3GPP-LTE, DAB and
DVB-T. Diﬀerent standards specify time constraints for FFT computation as shown
in Table 2.1.
FFT Size FFT Period [µs]
DAB
2048 1000
1024 500
512 250
256 125
DVB-T
8192 896
2048 224
IEEE 802.11a/g 64 3.2
IEEE 802.16e 256 8
3GPP-LTE
128
66.7
256
512
1024
2048
Table 2.1: FFT computation time for OFDM based wireless standards.
2. Background 7
To support a speciﬁc standard, a communication system should meet the perfor-
mance constraints set by that standard. FFT computation time is speciﬁed in micro
seconds for diﬀerent sizes of FFT. IEEE 802.11a/g and IEEE 802.16e support only
speciﬁc size FFT computation while DAB, DVB-T and 3GPP-LTE support diﬀer-
ent FFT size computations. For DAB and DVB-T, FFT computation time varies in
accordance with FFT size. However, in case of 3GPP-LTE, FFT computation time
is the same irrespective of FFT size.
FFT operation is computationally intensive and is required to be performed within
the time constraints speciﬁed by various wireless standards. Hence, FFT is studied
in more detail before its implementation in hardware.
2.4 Fast Fourier Transform
FFT is a faster version of Discrete Fourier Transform (DFT). Computation of
DFT/FFT of a time domain digital signal x(n) results in converting it into a fre-
quency domain signal. Analysis and processing of a discrete signal in frequency
domain is more eﬃcient than its analysis in time domain. The FFT algorithm was
ﬁrst developed and presented by Cooley and Tukey in [5]. It was developed in order
to reduce number of complex multiplications and additions in DFT. An N-point
DFT is given by,
X(k) =
N−1∑
n=0
x(n)e−(i
2pink
N
) (2.1)
where k = 0, 1, 2 . . . N − 1.
According to equation 2.1, DFT computation requires N2 − N complex additions
and N2 complex multiplications. An N-point FFT equation is given by,
X(k) =
N
2
−1∑
n=0
x(2n)e
−(i 2pinkN
2
)
+W kN
N
2
−1∑
n=0
x(2n+ 1)e
−(i 2pinkN
2
)
(2.2)
where W kN = e
−i 2pik
N , k = 0, 1, 2 . . . N − 1.
According to equation 2.2, the number of multiplications and additions are reduced
to N
2
∗ log2(N) and N ∗ log2(N) respectively. Preferring FFT over DFT for hardware
implementation means increased speed, reduced power consumption, reduced area
and reduced cost.
FFT is computed in two diﬀerent ways, Decimation In Time (DIT) and Decima-
tion In Frequecny (DIF). In DIT algorithm, inputs are in bit reversed order and
the outputs are in natural order. In DIF algorithm, the inputs are in natural order
2. Background 8
and the outputs are in bit reversed order. According to Tran-Thong et al. [12],
the DIT algorithm provides better signal-to-noise ratio when compared to DIF al-
gorithm for a ﬁnite word length. Based on number of FFT inputs, the algorithm
can be radix-2, radix-4, radix-8 or split-radix type. In radix-2 algorithm FFT size
is a power of two, radix-4 FFT size is a power four while radix-8 FFT size is power
of eight. And split-radix type involves mix of any of the speciﬁed radix combinations.
A radix-2 DIT FFT algorithm can be depicted as a butterﬂy diagram as shown in
Figure 2.4. The ﬁgure describes 16-point FFT butterﬂy diagram where x(n), X(k)
are 16-point complex inputs and outputs respectively.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
X(0)
X(8)
X(1)
X(9)
X(4)
X(12)
X(5)
X(13)
X(2)
X(10)
X(3)
X(11)
X(6)
X(14)
X(7)
X(15)
W
0
W
0
W
0
W
0
W
4
W
4
W
4
W
4
W
0
W
4
W
0
W
4
W
2
W
2
W
6
W
6
W
0
W
2
W
4
W
6
W
1
W
3
W
5
W
7
X(0)
X(1)
X(2)
X(3)
X(4)
X(5)
X(6)
X(7)
X(8)
X(9)
X(10)
X(11)
X(12)
X(13)
X(14)
X(15)
Figure 2.4: Radix-2 DIT FFT butterﬂy diagram [1].
Since, it is a DIT algorithm, the inputs to butterﬂy diagram are in bit reversed
order and the outputs are in natural order. The upper half of butterﬂy diagram is
symmetric with its lower half until the last stage. During the last stage, the upper
half of input samples mingle with lower half before computation. This particular
property of DIT algorithm is the basis of our address generation algorithm, dataﬂow
algorithm and input data storage described later in this document. Considering an
2. Background 9
N-point FFT, there are log2N number of stages and each stage requires
N
2
butterﬂy
operations.
Butterﬂy operation is the basic entity of a butterﬂy diagram. The butterﬂy operation
is pictorially described as shown in Figure 2.5.
+X
+
W
P
Q
X
Y_
Figure 2.5: Single radix-2 DIT butterﬂy operation.
Butterﬂy operation can be illustrated in equation as,
X = P +WQ and Y = P −WQ (2.3)
where P, Q are complex input values, W is an input twiddle factor and X, Y are
complex output values. The output X is the result of addition while Y is the result
of subtraction.
Studying existing research on FFT processors provides us with a information on
their merits and demerits. Knowing demerits of existing research enables us to work
on them and improve them according to our requirements. In the next section,
existing research work on FFT processors is evaluated and summarized.
2.5 Research Work On FFT Processors
In recent times, research on OFDM wireless communication systems is focused ex-
tensively. In particular, research on FFT algorithm and its hardware implementation
is a hot research topic. Focus of research is to optimize FFT algorithm and to ﬁnd
eﬃcient hardware solutions. Since, SDR platforms based on OFDM technique sup-
port multiple wireless standards, OFDM communication systems need to support
multiple standards. Since, FFT is an integral component of OFDM transceiver,
FFT hardware should be capable of supporting multiple wireless standards. Hence,
we have investigated existing research literature on variable length as well as ﬁxed
length FFT processors.
According to [14], in OFDM based wireless transceivers, FFT is one of the most
power consuming and computationally intensive operation. Since, FFT is power
2. Background 10
hungry and computational module, it is the main motivation behind research on
FFT processor architectures. Fixed length FFT processors are proposed by Deraf-
shi et al. in [2], H.Jiang et al. in [4], S.-N. Tang et al. in [11], K. George et al. in
[3] which focus on speciﬁc FFT size and speciﬁc standard. Processor proposed by
Y.-T.Lin et al. in [6] focuses on lower power consumption, the processor presented
by B.Wang et al. in [13] focuses on higher speed and supports only 64-point FFT
computation. The paper presented by Q.Zhang et al. in [15] the main focus is to
reduce area consumption. Hence, some of the speciﬁed existing research works are
based on speciﬁc FFT size targeting speciﬁc standard and are optimized for speciﬁc
design parameter. Fixed length FFT processors can support only speciﬁc wireless
standard. And they won't be scalable across multiple standards but they are opti-
mized in terms of power, area, high speed and low cost. On the other hand, variable
length FFT processors supporting multiple standards have to compromise in terms
of high speed, low power and low area. Finding a reasonable balance between scala-
bility and meanwhile achieving performance constraints is a design challenge. Hence,
we have attempted to ﬁnd a reasonable balance between low power, low area, low
cost, high speed, ﬂexibility and scalability of FFT processor.
Scalable FFT processor designed and implemented as part of the thesis work is
based on a holistic approach adopted to achieve reasonable balance between scal-
ability while meeting strict performance constraints. The processor is design time
conﬁgurable to support a maximum FFT size Nmax. Since, processor is based on
radix-2 FFT algorithm, Nmax can only be power of two. During runtime, the pro-
cessor can support FFT computation of size varying from 16-point upto Nmax-point.
Hence, proposed processor architecture is conﬁgurable at design time and scalable
at runtime. The proposed architecture can be extended to radix-4/8 FFT compu-
tations to achieve higher performance. In addition, the FFT processor can be used
in non-OFDM systems where scalability is required.
11
3. SCALABLE FFT PROCESSOR
The scalable FFT processor supports N-point complex value radix-2 ﬁxed point FFT
computation. The processor is conﬁgurable at design time to required Nmax-point
(radix-2 values only) and after which at runtime it can perform FFT computation
from 16-point to Nmax-point. During design time data memory and twiddle factor
memory are chosen so as to support Nmax-point computation. Following are the
major components of FFT processor and its block diagram representation is shown
in Figure 3.1.
• Butterﬂy unit
• Data memory (RAM)
• Twiddle factor memory (ROM)
• Interconnect
• Address generation unit
• Control unit
I
N
T
E
R
C
O
N
N
E
C
T
A
I
N
T
E
R
C
O
N
N
E
C
T
B
CONTROL 
UNIT
BUTTERFLY 
UNIT 0
BUTTERFLY 
UNIT 1
ROM
SetA
Memory
SetB
Memory
ADDRESS 
GENERATION 
UNIT
Figure 3.1: Scalable FFT processor block diagram.
3. Scalable FFT Processor 12
The FFT processor uses two butterﬂy units which operate in parallel and compute
two outputs per clock cycle. Two sets of data memory were chosen which were
named SetA and SetB, each set contained four memory banks required for simulta-
neous access of four samples. One twiddle factor memory was used for storing Nmax
2
twiddle factors to support Nmax-point FFT computation. Two interconnects called
interconnectA and interconnectB were used to form link between butterﬂy units
and memory set SetA and SetB respectively. Address generation unit was used to
generate addresses required to read input samples and twiddle factors for butterﬂy
units. Control unit was required to generate control signals at required timing to
co-ordinate and synchronize activities between rest of the components.
Figure 3.2 shows input/output ports of FFT core.
FFT Core
clk
rst
N_WIDTH
ADDR_WIDTH
N
f_start
f_done
f_wren_A
f_wren_B
f_addr_0
f_data_in_0
f_data_out_0
f_addr_1
f_data_in_1
f_data_out_1
f_addr_2
f_data_in_2
f_data_out_2
f_addr_3
f_data_in_3
f_data_out_3
f_addr_4
f_data_in_4
f_data_out_4
f_addr_5
f_data_in_5
f_data_out_5
f_addr_6
f_data_in_6
f_data_out_6
f_addr_7
f_data_in_7
f_data_out_7
f_addr_8
f_data_in_8
f_addr_9
f_data_in_9
Generics
RAM read-
write
RAM0 
ports
RAM1 
ports
RAM2 
ports
RAM3 
ports
RAM4 
ports
RAM5 
ports
RAM6 
ports
RAM7 
ports
ROM 
ports
Figure 3.2: FFT processor core pin details.
3. Scalable FFT Processor 13
Ports with thin line represent single bit pins while ports with thick line repre-
sent multi-bit pins. Ports named in capital letters are generics which allow design
time conﬁguration of modules. The address width (ADDR_WIDTH ) and N-width
(N_WIDTH ) are conﬁgurable at design time. The core has clk (clk), active low reset
(rst), start port (f_start) to trigger beginning of computation, done port (f_done) to
signal end of computation and a port to specify size of FFT (N ). In addition, the core
has address (f_addr_0,. . . ,f_addr_7 ), data in (f_data_in_0,. . . ,f_data_in_7 )
and data out (f_data_out_0,. . . ,f_data_out_7 ) ports for each of RAM memory
banks. Read-write operation on SetA (f_wren_A) and SetB (f_wren_B) memory
are controlled via write enable signals. And twiddle factor ROM access is through
address (f_addr_8 and f_addr_9 ) and data in (f_data_in_8 and f_data_in_9 )
ports.
3.1 Internal Architecture
When FFT processor is in operation, the overall dataﬂow through the processor
follows a ping-pong logic. In even numbered stages, input data is read from SetA
and output data is stored in SetB. In odd numbered stages, input data is read
from SetB and output data is stored in SetB. The pipelined internal architecture of
scalable FFT processor is shown in Figure 3.3.
REGISTER
REGISTER
ROM
BUTTERFLY 
UNIT 0
BUTTERFLY 
UNIT 1
MUX
MUX
RAM4
RAM5
RAM6
RAM7
SetB
RAM0
RAM1
RAM2
RAM3
SetA
M
U
X
M
U
X
M
U
X
M
U
X
R
E
G
I
S
T
E
R
S
M
U
X
R
E
G
I
S
T
E
R
S
M
U
X
R
E
G
I
S
T
E
R
R
E
G
I
S
T
E
R
I
N
T
E
R
C
O
N
N
E
C
T
A
I
N
T
E
R
C
O
N
N
E
C
T
B
ADDRESS 
GENERATION 
UNIT
CONTROL 
UNIT
16
16
16
16
16
16
16
16
32
32
32
32
16
16
1616
161632
32
32
32
Figure 3.3: FFT processor pipelined internal architecture.
3. Scalable FFT Processor 14
There are nine pipeline stages in the FFT core. The pipeline includes address
generate, memory read, datapath from interconnectA to interconnectB and memory
write. The nine pipeline stages include address generation stage, memory read stage,
two stages before butterﬂy units, two stages inside butterﬂy units, two stages after
butterﬂy units and memory write stage. Two pipeline stages of butterﬂy unit are
shown in Figure 3.4. In the beginning of each stage, initial ten clock cycles are
used up to ﬁll up the pipeline because memory read/write latency is one clock cycle,
address generation latency is one clock cycle and one extra clock cycle is required at
the beginning of each stage. Hence, ten clock cycles are required to compute initial
output samples in each stage and after which two outputs are computed every clock
cycle. The number of clock cycles required by each stage in a pipelined architecture
for an N-point FFT is given by,
cycles_per_stage = 10 +
N
2
(3.1)
The total number of clock cycles required to compute FFT is given by,
cycles_FFT = (cycles_per_stage ∗ log2(N)) + 2 (3.2)
The FFT processor architecture is described in detail in terms of its components in
the following sections.
3.2 Butterﬂy Unit
The butterﬂy unit shown in Figure 3.4 was adopted from [10] which was implemented
by J.Takala et al.
R
R
R
R
R
R
R
X
X
R
R
+
+
_ R
R +
+
+/-
+/-
M
M
PR
PI
YR, XR
YI, XI
QR, PR
QI, PI
WR, WI
Figure 3.4: Pipelined butterﬂy unit [10].
3. Scalable FFT Processor 15
The butterﬂy unit was designed to support radix-2 DIT butterﬂy operation. Since,
FFT processor was required to be a ﬁxed point processor, the butterﬂy unit imple-
mentation was modiﬁed to support Q-14 ﬁxed point computation. There are three
input ports, two output ports and they are 16-bit ﬁxed point values. P, Q are
complex input samples, W is twiddle factor and X, Y are complex output samples.
Two out of three input ports are shared between QR (real part), QI (imaginary part)
and PR, PI respectively. The third input port is for twiddle factor and it is shared
betweenWR andWI . In Figure 3.4 the dotted lines indicate imaginary data and the
thick line indicate real data. Inputs and outputs of butterﬂy unit are registered.
The butterﬂy unit consists of a complex multiplier and two adders as computational
units. Complex multiplier contains two bit parallel real multipliers and two adders.
Complex multiplication operation is computed in two staged pipeline to reduce
number of real multipliers from four to two. The reduction in number of multipliers
through pipeline operation is a reasonable compromise between high throughput,
area and power eﬃciency.
When butterﬂy unit is in operation Q and P are read every alternate clock cycle
and same is the case with WR and WI . Two clock cycles are required to ﬁll up the
pipeline in the beginning and thereafter every clock cycle one output (X or Y ) of
butterﬂy operation is produced. Thus, every two clock cycles one butterﬂy opera-
tion is completed per butterﬂy unit. The FFT processor employs two such butterﬂy
units operating in parallel at any given time as shown in Figure 3.3.
The complex multiplier inside butterﬂy unit forms critical path of the FFT pro-
cessor. The critical path includes a multiplier and an adder. The butterﬂy unit
operation is controlled by control unit which issues register loads, multiplexer sig-
nals at appropriate time intervals. Two butterﬂy units operate in parallel in the
FFT processor to compute two output samples per clock cycles which increases the
throughput of the processor.
3. Scalable FFT Processor 16
Figure 3.5 shows input/output pins of a butterﬂy unit. Butterﬂy unit has a clock
port clk and an active low reset port rst. Imaginary parts of inputs P, Q are
multiplexd into an input port PQI while real parts of P, Q are multiplexed into
another port PQ_R. Real and imaginary parts of twiddle factor W are multiplexed
into WRI. load_P and and load_P2 are load signals for P input registers, load_Q
is load signal for Q input register and load_W is load signal for W input register.
clk
rst
PQI
add_sub
load
load1
Butterfly 
Unit
PQ_R
WRI
load_P
load_P2
load_Q
load_W
sel
imagout
realout
Input ports
Output add/
sub select
Complex 
multiplier 
register loads
Input 
register load 
signals
Output 
ports
Multiplexers 
select
Figure 3.5: Butterﬂy unit pin details.
The load signals load and load1 are for multiplier output registers and complex
multiplier adder output registers respectively. Multiplexers are controlled via sel
control signal. Addition or subtraction operations which are part of butterﬂy unit
are controlled using add_sub port. The output ports realout and imgout are real
and imaginary parts of output data respectively.
Waveforms of input/output ports of butterﬂy unit are shown in Figure 3.6 for 16-
point FFT. The timing of control signals to butterﬂy unit at diﬀerent stages of FFT
are shown in the ﬁgure.
Figure 3.6: Butterﬂy unit waveform for 16-point FFT.
3. Scalable FFT Processor 17
There are four stages in a 16-point FFT and timing of pipelined butterﬂy unit control
signals follow similar pattern in every stage. The signals c_start, begin_stage and
s_done are not part of butterﬂy unit. But those signals are shown to clearly describe
waveforms at the beginning of FFT computation, during computation and at the
end of computation. However, c_start indicates start of computation, begin_stage
indicates beginning of a stage and s_done indicates end of FFT computation. Input
data latency is four clock cycles and output data latency is seven clock cycles with
respect to beginning of stage.
In a butterﬂy unit, multiplication is the most power, time and area consuming
portion. Hence, multiplier operation is explained in more detail below.
3.2.1 Bit Parallel Multiplier
The multiplication of two complex numbers W and Q is given by,
WQ = (WR + jWI)(QR + jQI) = (WRQR −WIQI) + j(WRQI +WIQR) (3.3)
where suﬃx WR, QR are real parts and WI , QI are imaginary parts. If we try
to implement equation 3.3 straight away and without optimization it requires four
multipliers and two adders. The critical path is decided by multiplier and adder com-
bination. We can optimize complex multiplication to use three multipliers instead
of four by following the equation 3.4 below.
WQ = WI(QR −QI) +QR(WR −WI) + j[WI(QR −QI) +QI(WR +WI)] (3.4)
The area and power consumption can be reduced if two clock cycles are used for
complex multiplication. As shown in Figure 3.4 it is implemented by sharing a com-
mon bus for both input samples P and Q, it means that ﬁrst Q is read followed by P.
The complex multiplication is pipelined by using two clock cycles for computation.
The pipelined multiplication is given by,
A1 = QRWR; B1 = QIWR (cycle#1)
A2 = QRWI ; B2 = QIWI (cycle#2)
WQ = (A1−B2) + j(B1 + A2) (3.5)
This approach uses only two multipliers and two adders. As demonstrated above
it is not necessary to have real and imaginary parts of twiddle factor in the same
clock cycle. Hence, a single bus can be shared to read real and imaginary parts of
twiddle factor in consecutive clock cycles. The outputs of real valued multipliers are
3. Scalable FFT Processor 18
registered thereby reducing critical path of complex multiplier and in turn critical
path of butterﬂy unit. The outputs of adders of complex multiplier are registered as
well. The butterﬂy unit is implemented by pairing up complex multiplier along with
adders required for butterﬂy operation. Inputs of the butterﬂy unit are registered
and it needs two sets of input registers for operand P since, next P is read while
current W*Q is computed.
3.3 Data Memory (RAM)
The input samples, intermediate samples and the output samples of FFT computa-
tion are stored in data memory. Data memory is RAM based and there are two sets
called SetA and SetB. Butterﬂy operation is performed by reading input samples
from one set of memory and writing output samples to a diﬀerent set of memory.
Hence, two sets of memory are used for storing samples at diﬀerent stages during
computation. SetA memory includes four memory banks RAM0, RAM1, RAM2,
RAM3 while SetB memory includes four memory banks RAM4, RAM5, RAM6 and
RAM7. Since, two butterﬂy units require four input samples per clock cycle we
chose to use four memory banks to store data samples. The maximum size of each
memory bank is decided by the maximum size of FFT computation Nmax selected
during design time. Maximum size of a memory bank is given by,
ram_bank_size =
Nmax
4
(3.6)
where the size is measured in terms of 32-bit words. The size of a memory bank
decides its address bus width which is expressed as log2(ram_bank_size).
In memory a complex data sample is stored as a 32-bit word, higher 16-bits consti-
tute real part while the lower 16-bits constitute imaginary part. Data format stored
in memory is pictorially described as shown in Figure 3.7.
XR XI
X = XR + jXI
X16X17X31 . . . . . . . .X15 X1 X0
XR0. . . . XR1XR15 XI0XI1XI15 . . . .
X
Figure 3.7: Data format stored in memory.
In the beginning of FFT computation input samples are always stored in SetA. The
3. Scalable FFT Processor 19
input samples are bit reversed according to DIT FFT and split equally among four
memory banks. Figure 3.8 describes order of input samples stored in memory at the
beginning of FFT computation.
X(0)
X(4)
X(2)
X(6)
X(8)
X(12)
X(10)
X(14)
X(1)
X(5)
X(3)
X(7)
X(9)
X(13)
X(11)
X(15)
RAM0
RAM1
RAM2
RAM3
SetA
Input samples 
in bit reversed 
order
X(0)
X(4)
X(8)
X(12)
X(2)
X(10)
X(6)
X(14)
X(1)
X(9)
X(5)
X(13)
X(3)
X(11)
X(7)
X(15)
RAM0
SetA
RAM1
RAM2
RAM3
Final output of 
FFT 
computation
X(0)
X(6)
X(13)
X(11)
X(3)
X(5)
X(14)
X(8)
X(1)
X(7)
X(12)
X(10)
X(2)
X(4)
X(15)
X(9)
Figure 3.8: Order of input sample at the beginning and the order of ﬁnal output of 16-point
FFT computation.
The bit reversed input samples are equally split into two halves, the upper half
inputs are stored in RAM0, RAM1 and the lower half in RAM2, RAM3. Order of
input samples are such that it provides conﬂict free access for butterﬂy operation.
Final output of an N-point FFT computation might be available in SetA or SetB de-
pending on number of stages. If number of stages are even, ﬁnal outputs are stored
in SetA and if number of stages are odd, ﬁnal outputs are stored in SetB. Since, for
a 16-point FFT number of stages are even, ﬁnal outputs would be available in SetA
and the order of outputs would be as shown in Figure 3.8. The order of outputs
is decided by dataﬂow algorithm described in detail in later part of this document.
3. Scalable FFT Processor 20
The memory banks are clocked dual port memories which enable access from within
FFT core as well as from external world. Memory access (read-write) latency is one
clock cycle.
Figure 3.9 shows input/output ports of a RAM memory bank. Generic port
FILE_NAME is to specify memory initialization ﬁle and ADDR_WIDTH is to
specify address port width during design time. RAM bank is a dual port memory
and hence it has two sets of data and address ports.
clk
RAM Memory 
Bank
FILE_NAME
ADDR_WIDTH
data_A
addr_A
wren_A
q_A
data_B
addr_B
wren_B
q_B
Generics
External 
interface FFT core
Figure 3.9: RAM memory bank pin details.
A RAM memory bank consists of input data ports (data_A, data_B), address ports
(addr_A, addr_B), read-write enable signal (wren_A, wren_B) and output data
ports (q_A, q_B). Ports with suﬃx `A' are external interface ports and ports with
suﬃx `B' are FFT core interface ports.
Waveforms of SetA RAM bank (RAM0) ports are shown in Figure 3.10 for 16-point
FFT. Write operation happens when wren_B is `1' and read operation happens
when it is `0'. Memory read-write access latency is single clock cycle.
Figure 3.10: SetA RAM memory bank (RAM0) waveforms.
3. Scalable FFT Processor 21
In the ﬁrst stage data is read from SetA memory and in the next stage data is writ-
ten to SetA memory. Likewise, read-write operations are switched between SetA
and SetB memory in alternate stages.
Waveforms of SetB RAM bank (RAM4) ports are shown in Figure 3.11 for 16-point
FFT.
Figure 3.11: SetB RAM memory bank (RAM4) waveforms.
In the ﬁrst stage data is written to SetB memory and in the next stage data is read
from it. Read-write operations are switched between SetB and SetA memory in
alternate stages.
Data samples are stored in data memory, to store twiddle factors we need twiddle
factor memory which is described in detail in the next section.
3.4 Twiddle Factor Memory (ROM)
The twiddle factor memory is to store twiddle factors required by butterﬂy unit0 and
butterﬂy unit1 during FFT computation. Twiddle factor memory is a clocked dual
port ROM which allows reading two twiddle factors per clock cycle simultaneously
as shown in Figure 3.3. For an N-point FFT, a maximum of N
2
twiddle factors are
required for computation. Hence, a maximum of Nmax
2
twiddle factors are stored
in ROM at design time in order to support FFT computation from 16-point upto
Nmax-point. Hence, size of ROM is given by,
rom_size =
Nmax
2
(3.7)
where size is measured in terms of 32-bit words. Since, ROM address bus width de-
pends on maximum size of ROM, address bus width is expressed as log2(rom_size).
3. Scalable FFT Processor 22
Twiddle factors stored in ROM are in natural order and it is illustrated for a 16-
point FFT in Figure 3.12. Twiddle factor is a complex value in which 16-bit real
and imaginary parts are packed into a 32-bit word as shown in Figure 3.7.
W016
W116
.
.
.
W516
W616
W716
W216
Twiddle factor memory 
(ROM)
Figure 3.12: Order of twiddle factors stored in ROM.
The 32-bit twiddle factor is read from memory and it is unpacked into real and
imaginary parts before feeding them to butterﬂy units as shown in Figure 3.3. The
real and imaginary parts are fed through a common input port to butterﬂy unit in
alternative clock cycles as shown in Figure 3.4.
Figure 3.13 shows input/output ports of a ROM memory.
clk
ROM Memory
FILE_NAME
ADDR_WIDTH
addr_A
wren_A
q_A
addr_B
wren_B
q_B
Generics
FFT core 
interface: 
twiddle factor0
FFT core 
interface: 
twiddle factor1
Figure 3.13: ROM memory pin details.
Generic port FILE_NAME is to specify memory initialization ﬁle andADDR_WIDTH
3. Scalable FFT Processor 23
is to specify address port width during design time. ROM bank is a dual port mem-
ory and hence it has two sets of data and address ports. A ROM memory consists of
input data ports (data_A, data_B), address ports (addr_A, addr_B), read enable
signal (wren_A, wren_B) and output data ports (q_A, q_B). The address, data
and read enable ports interface with FFT core. Ports with suﬃx `A' are meant for
butteﬂy unit0 (twiddle factor0) while ports with suﬃx `B' are meant for butterﬂy
unit1 (twiddle factor1) inside FFT core.
Waveforms of ROM memory ports are shown in Figure 3.14 for 16-point FFT.
Figure 3.14: ROM memory waveforms.
Since, two twiddle factors are required for computation, dual port ROM is used to
store twiddle factors. Read operation happens when wren_A, wren_B are set to `0'.
We understood how twiddle factors are stored in ROM and how data samples are
stored in data memory. Now, it is necessary to understand how data samples are
guided to butterﬂy units after they are read from data memory and it is explained
in following section.
3.5 Interconnect
The function of interconnect is to route data between memory and butterﬂy units
as well as route address between memory and address generation unit as shown in
Figure 3.3. An interconnect consists a number of multiplexers whose outputs are
registered. The multiplexers act a switches linking inputs to appropriate outputs
based on select signals from control unit.
3. Scalable FFT Processor 24
An interconnect is described as in Figure 3.15.
M
U
L
T
I
P
L
E
X
E
R
S
R
E
G
I
S
T
E
R
S
INTERCONNECT
CONTROL SIGNALS
ADDRESS
DATA IN: FROM 
MEMORY
DATA OUT: TO 
MEMORY
DATA IN: FROM 
BUTTERFLY
ADDRESS OUT: TO 
MEMORY
DATA OUT: TO 
BUTTERFLY
Figure 3.15: Interconnect internal architecture and external interface.
The address generation unit sends read/write addresses to interconnect which are
routed to appropriate memory banks based on control signals from control unit.
Interconnect also receives data samples from memory banks which are routed to
appropriate butterﬂy units depending on control signals. After butterﬂy operation,
data received from butterﬂy units are routed to memory for storage. Two such in-
terconnects are used in FFT processor architecture and they are interconnectA and
interconnectB shown in Figure 3.3. The interconnectA forms a connection between
butterﬂy units and memory SetA while interconnectB forms a connection between
butterﬂy units and memory SetB. Both the interconnects form link between address
generation unit and memory sets SetA and SetB.
3. Scalable FFT Processor 25
Figure 3.16 shows pin details of an interconnect module. Generic port N_WIDTH
allows scalable value for N while ADDR_WIDTH is to scale address corresponding
to N. Interconnect has clock (clk) and active low reset (rst) ports. The port i_RW
indicates read/write operation for SetA or SetB memory. Last stage of computa-
tion is indicated by i_last_stage port. Whether data read from memory has to
be ﬂipped or not is indicated by i_input_ﬂip port. Memory read address input
ports are i_read_data and i_read_data1 which are received from address genera-
tion unit. The read_data1 is nine clock cycles delayed version of read_data because
write operation begins after nine cycles from beginning of each stage. Memory store
addresses are i_store_add and i_store_sub received from address generation unit.
The output memory addresses are i_addr_RAM0,. . . ,i_addr_RAM3, these are
RAM bank addresses. Ports i_bfy0_in_ﬁrst, i_bfy0_in_second, i_bfy1_in_ﬁrst,
i_bfy1_in_second are outputs of butterﬂy units and inputs to interconnect.
clk
rst
N_WIDTH
ADDR_WIDTH
Interconnect
i_RW
i_last_stage
i_input_flip
i_read_data
i_read_data1
i_store_add
i_store_sub
i_bfy0_in_first
i_bfy0_in_second
i_bfy1_in_first
i_bfy1_in_second
i_bfy0_out_first
i_bfy0_out_second
i_bfy1_out_first
i_bfy1_out_second
i_data_in_RAM0
i_data_in_RAM1
i_data_in_RAM2
i_data_in_RAM3
i_data_out_RAM0
i_data_out_RAM1
i_data_out_RAM2
i_data_out_RAM3
i_addr_RAM0
i_addr_RAM1
i_addr_RAM2
i_addr_RAM3
Generics
Input read 
addresses
Output store 
addresses
Butterfly 
outputs
Butterfly 
inputs
Input data read 
from memory
Memory read-
write addresses
Output data to be 
stored in memory
Figure 3.16: Interconnect pin details.
Ports i_bfy0_out_ﬁrst, i_bfy0_out_second, i_bfy1_out_ﬁrst, i_bfy1_out_second
are outputs of interconnect and inputs to butterﬂy units. Data input from RAM
banks are received through ports i_data_in_RAM0,. . . ,i_data_in_RAM3. And
data outputs to RAM banks are sent through i_data_out_RAM0,
. . . ,i_data_out_RAM3.
3. Scalable FFT Processor 26
Figure 3.17 shows waveforms for interconnectA for 16-point FFT computation.
Figure 3.17: InterconnectA waveforms.
Since, it is linked to SetA memory, memory read occurs every even numbered stage.
And memory write occurs every odd numbered stage as shown in the waveforms.
Figure 3.18 shows waveforms for interconnectB for 16-point FFT computation.
Figure 3.18: InterconnectB waveforms.
3. Scalable FFT Processor 27
Since, it is linked to SetB memory, memory read occurs every odd numbered stage.
And memory write occurs every even numbered stage as shown in the waveforms.
Interconnects form link between data memory and address generation unit, the data
memory was dealt in detail in previous sections. Hence, following section deals with
address generation unit in detail.
3.6 Address Generation Unit
The address generation unit is required to generate addresses for butterﬂy inputs,
twiddle factors and butterﬂy outputs. And it should be capable of supporting N-
point FFT computation in order to support scalable architecture. A novel address
generation algorithm was developed to generate addresses for inputs, twiddle factors
and outputs. The algorithm is also capable of supporting N-point FFT computa-
tion and provides conﬂict free access of data samples during computation. The
address generation algorithm requires two m-bit counters, where m = log2(
N
4
). The
algorithm is described in detail below.
1. Address to read inputs from memory:
A simple m-bit counter is used to generate read address.
read_data = [am−1am−2......a1a0] (3.8)
2. Address to store outputs to memory:
Another simple m-bit counter is used to generate store address.
store_add = [bm−1bm−2.......b1b0] (3.9)
store_sub = [b
′
m−1b
′
m−2........b
′
1b
′
0] (3.10)
3. Address to read twiddle factors from ROM:
For an N-point FFT there are log2N number of stages, we assume stage index
as s which is incremented at the end of each stage. Hence, s can take values
s = 0, 1, 2, ....log2(N)-1.
* Twiddle factor addresses for stages except last stage (s = 0, 1, ...log2(N)-2):
Following steps are executed in sequence to generate twiddle factor addresses
in the current stage.
(i) [dm−1..d0] = [am−1...a0] XOR [0am−1...a1]
(ii) count_gray = [0dm−1dm−2.....d1d0]
3. Scalable FFT Processor 28
= [emem−1..e1e0]
(iii) coefx = [em....em−s0m−s0m−s−1...0100]
= [fm+1fm...f1f0]
(iv) Coef0Addr = [fmfm−1...f1f0]
(v) Coef1Addr = Coef0Addr
* Twiddle factor addresses for last stage (s = log2(N)-1):
Following steps are executed in sequence to generate twiddle factor addresses
for last stage. Note: coefy = [0m+10m...011] at the beginning of the stage.
(i) coefy = [fm+1fm...f1f0]
(ii) sum = [am−1am−2..a1a0] + [fmfm−1..f2f1]
= [gm−1gm−2...g1g0]
(iii) coefy = [0gm−1gm−2...g1g0f
′
0]
= [fm+1fm...f0]
fm = ‘0' for ﬁrst
N
4
butterﬂy computations.
fm = ‘1' for next
N
4
butterﬂy computations.
(iv) Coef0Addr = [fmfm−1...f1f0]
(v) Coef1Addr = [fmfm−1...f1f
′
0]
To read inputs from SetA or SetB memory, read_data address from step 1 is re-
quired. The read_data is generated using an m-bit counter. To store outputs
of butterﬂy units store addresses from step 2 are required. Two store addresses
store_add, store_sub are required for storing output data. There are two butterﬂy
outputs wherein one is result of addition and the other is result of subtraction. Out-
puts of butterﬂy units have to be stored such that there are no access conﬂicts while
storing as well as when they are read in next stage. Hence, store_add is required to
store result of addition and store_sub is required to store result of subtraction. The
store_add is generated using an m-bit counter while store_sub is one's complement
of store_add. Twiddle factor addresses are generated in step 3 wherein Coef0Addr
corresponds to butterﬂy unit0 and Coef1Addr corresponds to butterﬂy unit1. The
logic is diﬀerent for last stage and rest of the stages. To generate twiddle factor
addresses, the steps have to be followed in the speciﬁed order. For stages except
last stage, using read_data a gray code value is generated which is appended with
3. Scalable FFT Processor 29
suitable number of zeros. The number of zeros appended to form address depends
on stage index s. Both the twiddle factor addresses are same for stages except the
last stage. The last stage twiddle factor address is generated by adding read_data
with previous value of Coef0Addr and the last bit of result address is set to `0' for
ﬁrst half of stage and set to `1' for second half of the stage. The Coef1Addr is formed
by inverting the least signiﬁcant bit of Coef0Addr.
Address generation unit internal logic is pictorially presented in Figure 3.19. It
consists of two m-bit counters one each for generating read_data and store_add.
Reset
a0
:
:
am-2
am-1
Reset
b0
:
:
bm-2
bm-1
store_add
store_sub
read_data
Counter
Counter
>> 1
Zero 
value
+
>> 1
lsb
msb0
m
m+1
lsb
msb
s+1
m-s+1
lsb
msb
m+2
msb
lsb
m+1
msb
lsb
0
m
lsb
msb
1
0
1
msb
lsb
m+2
half_stage
lsb
msb
m
1
m+1
lsb
msb
lsbmsb
1
m m+1
M
M
M
0
1
0
1
1
0
last_stage
last_stage
Coef0Addr
Coef1Addr
count_gray coefx
coefy
sum
Figure 3.19: Address generation unit internal logic.
3. Scalable FFT Processor 30
The twiddle factor address generation involves logic for last stage and for rest of
the stages. For rest of the stages gray code is generated from read_data and ap-
pended with zeros supplied by zero value module. And Coef0Addr and Coef1Addr
are the same for rest of the stages as described before. For the last stage, previous
Coef0Addr value is added with read_data. The last bit of resulting address is set
to `0' for ﬁrst half of the stage and it is set `1' for next half of the last stage. The
Coef1Addr is formed by inverting the leas signiﬁcant bit of Coef0Addr. The select
signal last_stage diﬀerentiates last stage from rest of the stages. And half_stage
select signal diﬀerentiates ﬁrst half of last stage from its second half.
Figure 3.20 shows pin details of address generation unit.
clk
ADDR_WIDTH
N_WIDTH
rst
N
a_start
a_read_data
a_read_data1
a_begin_stage
a_store_add
a_store_sub
a_Coef0Addr
a_Coef1Addr
Generics
Address 
Generation Unit
Input read 
addresses
Output store 
addresses
Twiddle factor 
addresses
Figure 3.20: Address generation unit pin details.
Address generation unit has generic ports for specifying address width
(ADDR_WIDTH ) and for specifying N width (N_WIDTH ). The clock port is
clk, active low reset port is rst and FFT size port is N. The port a_start is a
start signal to start computation while port a_begin_stage is to indicate the begin-
ning of each FFT stage. Read address ports are a_read_data and a_read_data1.
a_read_data1 is nine clock cycle delayed with respect to beginning of each stage.
And it is required by interconnects in routing data to memory. Store address
ports are a_store_add and a_store_sub. The twiddle factor address ports are
a_Coef0Addr and a_Coef1Addr.
3. Scalable FFT Processor 31
Figure 3.21 shows waveforms address generation unit for 16-point FFT.
Figure 3.21: Address generation unit waveforms.
Address generation starts immediately after begin stage signal is received from con-
trol unit. And for every two clock cycles a new address is generated.
As we saw earlier, address unit generates addresses but there needs to be a module
which co-ordinates activities of all the components in the processor. And such a
module is called control unit which is discussed in detail in the following section.
3.7 Control Unit
The control unit controls, co-ordinates and synchronizes activities of rest of the
components as shown in Figure 3.3. The timing of control signals issued have to
be accurate in order to synchronize activities of diﬀerent components. Moore state
machine was applied to implement control unit and it is shown in Figure 3.22.
S0
S2
S3 S1
CONTROL 
UNIT
Initial 
state
SetA read 
& 
SetB write
SetB read 
& 
SetA write
Transition 
state
Current stage 
is complete
Current stage 
is complete
Last stage 
is complete
Last stage 
is complete
Figure 3.22: Control unit state diagram.
3. Scalable FFT Processor 32
Control unit begins with initial state S0 also called as reset state. In initial state,
all the signals and variables are initialized to appropriate values. After initializing
required signals and variables it changes state to S1. In S1, it generates timing
for control signals related to following activities: address generation, SetA memory
read, route inputs via interconnectA to butterﬂy units, butterﬂy operation, route
outputs to SetB memory via interconnectB and store outputs to SetB memory. It
moves to transition state S2 if S1 does not correspond to last stage of FFT. Other-
wise, it moves to initial state S0 indicating that it is end of FFT computation. In
transition state S2, necessary control signals are re-initialized so as to prepare for
generating control signals for next stage of FFT.
After re-initializing required signals and variables in transition state S2, it moves
to S3. In S3, it generates timing for control signals related to following activities:
address generation, SetB memory read, route inputs via interconnectB to butterﬂy
units, butterﬂy operation, route outputs to SetA memory via interconnectA and
store outputs to SetA memory. It moves to transition state S2 if S1 does not cor-
respond to last stage of FFT. Otherwise, it moves to initial state S0 indicating
that it is end of FFT computation. In transition state S2, necessary control signals
are re-initialized so as to prepare for generating control signals for next stage of FFT.
Likewise, the state transitions S1 → S2 → S3 and S3 → S2 → S1 are iterated in
alternative FFT stages until completion of FFT computation. It begins with state
S0 and after completion of all stages it returns to S0. It is important to note that
state S1 generates control signals for even numbered stages, while state S3 gener-
ates control signals for odd numbered stages.
3. Scalable FFT Processor 33
Figure 3.23 shows pin details of control unit.
Control 
Unit
clk
rst
N_WIDTH
N
c_start
c_done
c_begin_stage
c_input_flip
c_add_sub
c_load
c_load1
c_loadP
c_loadP2
c_loadQ
c_loadW
c_sel
c_SetA_RW
c_SetB_RW
c_wren_A
c_wren_B
c_bfy0_ip0_reg_load
c_bfy0_ip1_reg_load
c_bfy0_mux_sel
c_bfy0_tw_reg_load
c_bfy0_tw_sel
c_bfy0_add_op_reg_load
c_bfy0_sub_op_reg_load
c_bfy0_tw_addr_reg_load
c_bfy1_ip0_reg_load
c_bfy1_ip1_reg_load
c_bfy1_mux_sel
c_bfy1_tw_reg_load
c_bfy1_tw_sel
c_bfy1_add_op_reg_load
c_bfy1_sub_op_reg_load
c_bfy1_tw_addr_reg_load
Generic
Address generation 
unit port
c_last_stage
Butterfly unit ports
Interconnect ports
RAM ports
Butterfly0 input/
output register/mux 
control ports
Butterfly1 input/
output register/mux 
control ports
Figure 3.23: Control unit pin details.
Control unit signals, co-ordinate and synchronize activities of rest of the components
of the processor. Generic port to conﬁgure size of FFT is N, there is clk port clk,
reset port rst and c_start port to trigger beginning of computation. Beginning of
each stage is synchronized with address generation unit through port c_begin_stage.
Ports corresponding to butterﬂy unit are: c_add_sub to choose addition/subtraction
at the butterﬂy output, ports c_load and c_load1 for complex multiplier registers
load signals, ports c_loadP, c_loadP2, c_loadQ, c_loadW for input register loads,
port c_sel for multiplexer select signal.
Ports corresponding to interconnects are: ports c_SetA_RW, c_SetB_RW for sig-
naling read-write operation for diﬀerent sets in each stage,port c_input_ﬂip to indi-
cate whether ﬂip is required for current inputs read from memory, port c_last_stage
to signal last stage of FFT.
Ports speciﬁc to RAM memory banks are: port c_wren_A is SetA read-write signal,
3. Scalable FFT Processor 34
port c_wren_B is SetB read-write signal.
Butterﬂy unit0 register/multiplexer control ports are: input register load ports are
c_bfy0_ip0_reg_load and c_bfy0_ip1_reg_load, input mux select port is
c_bfy0_mux_sel, twiddle factor input register load port is c_bfy0_tw_reg_load,
input twiddle factor select port is c_bfy0_tw_sel, twiddle factor address register
load port is c_bfy0_tw_addr_reg_load, butterﬂy output register load ports are
c_bfy0_add_op_reg_load and c_bfy0_sub_op_reg_load.
Butterﬂy unit1 register/multiplexer control ports are: input register load ports are
c_bfy1_ip0_reg_load and c_bfy1_ip1_reg_load, input mux select port is
c_bfy1_mux_sel, twiddle factor input register load port is c_bfy1_tw_reg_load,
input twiddle factor select port is c_bfy1_tw_sel, twiddle factor address register
load port is c_bfy1_tw_addr_reg_load, butterﬂy output register load ports are
c_bfy1_add_op_reg_load and c_bfy1_sub_op_reg_load.
Figure 3.24 shows waveforms of control unit for a 16-point FFT computation.
Figure 3.24: Control unit waveforms.
State transition throughout diﬀerent FFT stages is evident in the waveforms and is
in accordance with earlier description. And butterﬂy unit control signals are sym-
metric across FFT stages.
3. Scalable FFT Processor 35
After understanding each component of FFT processor in detail, its functionality is
better understood through understanding the dataﬂow through various components
during computation. In this regard, a novel dataﬂow algorithm is described in the
following section.
3.8 Dataﬂow Algorithm
The ﬂow of data across various components of processor is the basis of FFT processor
architecture. The dataﬂow algorithm describes dataﬂow through memory, intercon-
nects and butterﬂy units. It also describes order of reading inputs from memory,
storing outputs to memory, routing inputs to butterﬂy units, order of twiddle factor
access and routing butterﬂy outputs to memory. A novel dataﬂow algorithm is pic-
torially described in Figure 3.25, the thick line describes butterﬂy unit0 and dotted
line describes butterﬂy unit1.
X(0)
X(4)
X(2)
X(6)
X(8)
X(12)
X(10)
X(14)
X(1)
X(5)
X(3)
X(7)
X(9)
X(13)
X(11)
X(15)
RAM0
RAM1
RAM2
RAM3
W
0
X(0)
X(8)
W
0
X(1)
X(9)
X(4)
X(12)
X(5)
X(13)
X(2)
X(10)
X(3)
X(11)
X(6)
X(14)
X(7)
X(15)
Y(0)
Y(1)
Y(8)
Y(9)
Y(2)
Y(3)
Y(10)
Y(11)
Y(4)
Y(5)
Y(12)
Y(13)
Y(6)
Y(7)
Y(14)
Y(15)
STAGE 0:
W
0
W
0
W
0
W
0
W
0
W
0
SETA
RAM4
RAM5
RAM6
RAM7
W
0
W
0
STAGE 1:
W
0
W
0
W
4
W
4
W
4
W
4
Y(0)
Y(4)
Y(7)
Y(3)
Y(2)
Y(6)
Y(5)
Y(1)
Y(8)
Y(12)
Y(15)
Y(11)
Y(10)
Y(14)
Y(13)
Y(9)
SETB
Y(0)
Y(2)
Y(8)
Y(10)
Y(4)
Y(6)
Y(12)
Y(14)
Y(5)
Y(7)
Y(13)
Y(15)
Y(1)
Y(3)
Y(9)
Y(11)
Z(0)
Z(2)
Z(8)
Z(10)
Z(4)
Z(6)
Z(12)
Z(14)
Z(5)
Z(7)
Z(13)
Z(15)
Z(1)
Z(3)
Z(9)
Z(11)
RAM0
RAM1
RAM2
RAM3
W
0
W
0
STAGE 2:
W
2
W
2
W
6
W
6
W
4
W
4
SETA
Z(0)
Z(5)
Z(3)
Z(6)
Z(4)
Z(1)
Z(7)
Z(2)
Z(8)
Z(13)
Z(11)
Z(14)
Z(12)
Z(9)
Z(15)
Z(10)
Z(0)
Z(4)
Z(8)
Z(12)
Z(1)
Z(5)
Z(9)
Z(13)
Z(3)
Z(7)
Z(11)
Z(15)
Z(2)
Z(6)
Z(10)
Z(14)
U(0)
U(4)
U(8)
U(12)
U(1)
U(5)
U(9)
U(13)
U(3)
U(7)
U(11)
U(15)
U(2)
U(6)
U(10)
U(14)
RAM4
RAM5
RAM6
RAM7
W
0
W
1
STAGE 3:
W
3
W
2
W
6
W
7
W
5
W
4
SETB
U(0)
U(3)
U(6)
U(5)
U(1)
U(2)
U(7)
U(4)
U(8)
U(11)
U(14)
U(13)
U(9)
U(10)
U(15)
U(12)
U(0)
U(8)
U(1)
U(9)
U(3)
U(11)
U(2)
U(10)
U(6)
U(14)
U(7)
U(15)
U(5)
U(13)
U(4)
U(12)
X(0)
X(8)
X(1)
X(9)
X(3)
X(11)
X(2)
X(10)
X(6)
X(14)
X(7)
X(15)
X(5)
X(13)
X(4)
X(12)
RAM0
SETA
RAM1
RAM2
RAM3
X(0)
X(6)
X(13)
X(11)
X(3)
X(5)
X(14)
X(8)
X(1)
X(7)
X(12)
X(10)
X(2)
X(4)
X(15)
X(9)
+
+
+
+
+
+
+
+
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Figure 3.25: 16-point dataﬂow butterﬂy diagram for FFT processor architecture.
The dataﬂow algorithm presented below is in accordance with Figure 3.25.
In even numbered stages, the inputs are read from SetA memory and they are
routed to butterﬂy units via interconnectA. The twiddle factors are read from ROM
and sent to butterﬂy units. After butterﬂy operation, the outputs are routed via
interconnectB and stored in SetB memory.
In odd numbered stages, the inputs are read from SetB memory, they are routed to
butterﬂy units via interconnectB. The twiddle factors are read from ROM and sent
3. Scalable FFT Processor 36
to butterﬂy units. After butterﬂy operation, the outputs are routed via intercon-
nectA and stored in SetA memory.
Dataﬂow with respect to butterﬂy unit0:
Even numbered stages:
• During stages except last stage:
Read inputs from RAM0, RAM1. After butterﬂy computation, the result of
addition and subtraction are stored in RAM4,RAM5.
• During last stage:
Read inputs from RAM0, RAM2. After butterﬂy computation, the result of
addition and subtraction are stored in RAM4,RAM5.
Odd numbered stages:
• During stages except last stage:
Read inputs from RAM4, RAM5. After butterﬂy computation, the result of
addition and subtraction are stored in RAM0,RAM1.
• During last stage:
Read inputs from RAM4, RAM6. After butterﬂy computation, the result of
addition and subtraction are stored in RAM0,RAM1.
Dataﬂow involving butterﬂy unit1:
Even numbered stages:
• During stages except last stage:
Read inputs from RAM2, RAM3. After butterﬂy computation, the result of
addition and subtraction are stored in RAM6,RAM7.
• During last stage:
Read inputs from RAM1, RAM3. After butterﬂy computation, the result of
addition and subtraction are stored in RAM6,RAM7.
Odd numbered stages:
• During stages except last stage:
Read inputs from RAM6, RAM7. After butterﬂy computation, the result of
addition and subtraction are stored in RAM2,RAM3.
• During last stage:
Read inputs from RAM5, RAM7. After butterﬂy computation, the result of
addition and subtraction are stored in RAM2,RAM3.
The FFT processor implementation details in terms of simulation and synthesis are
described in detail in the following chapter.
37
4. IMPLEMENTATION
FFT processor was implemented using VHDL written in a Linux (CentOS) environ-
ment using emacs editor. Once FFT processor architecture was designed, its com-
ponents were implemented separately. The functionality of individual components
were veriﬁed by running test benches in ModelSim (simulation tool from Mentor
Graphics Inc.). After individual modules were functionally veriﬁed, they were in-
tegrated to form the complete system. The FFT processor as a complete system
was functionally veriﬁed using a system level test bench simulated using ModelSim
tool. Initially, 16-point FFT computation was veriﬁed. Later, the processor was
conﬁgured for diﬀerent radix-2 sizes to verify their functionality. The simulation
results were analyzed to determine the time taken by computation of diﬀerent FFT
sizes.
Following the simulation, synthesizable version of FFT processor was created from
its simulation version. Later, the processor was synthesized on an Altera Stratix V
FPGA device 5SGSMD5K2F40C2 using Quartus II version 12.1 tool. The synthesis
results were analyzed to determine area, maximum frequency of operation and power
consumption. The implementation details are discussed in detail in the following
sections.
4.1 Simulation
The FFT processor was simulated before synthesizing it on an FPGA device. Sim-
ulation was carried out in order to verify the functionality of the processor and to
validate the results. The RTL simulation was also performed to capture signal ac-
tivities of diﬀerent ports and signals in a Value Change Dump (VCD) ﬁle. The VCD
ﬁle was used to carry out power analysis using post ﬁt netlist data obtained after
synthesis. The simulation phase is described in detail below.
4.1.1 Pre-simulation
Before simulating FFT processor necessary environment had to be set up. Simula-
tion of FFT processor operation required RAM and ROM inputs. RAM and ROM
data were provided in text ﬁles as inputs to simulation. RAM stores data samples
whereas ROM stores twiddle factors required for FFT computation. Data samples
4. Implementation 38
and twiddle factor values were represented in ﬁxed point format. Q-14 ﬁxed point
format was used to represent ﬂoating point values. Since, datapath of the processor
is 16-bit, lower fourteen bits were used to represent fraction part while upper two bits
were used for integer part and sign. A test application was written in C to generate
RAM and ROM text ﬁles containing input samples and twiddle factors respectively.
For an N-point FFT, the test application generated Q-14 complex data, it packed
real and imaginary parts of complex data into 32-bit value and stored them among
four SetA memory text ﬁles. In addition, the test application generated N
2
twiddle
factors, it packed real and imaginary parts of complex data into 32-bit value and
stored them in ROM text ﬁle. The RAM and ROM text ﬁles were used in simulation
to initialize the FFT processor RAM and ROM memories respectively.
All the VHDL ﬁles along with RAM and ROM text ﬁles were grouped inside a user
deﬁned project directory. For simulation, the tool used was ModelSim version 6.5c
from Mentor Graphics Inc.. The simulation tool was used in a Linux (CentOS)
environment. To launch ModelSim from command line, following command was
executed from the project directory.
• vsim &
Before the ﬁrst simulation run, a work folder was created inside project directory
by executing the following command inside it.
• vlib work
4.1.2 Running Simulation
To run a simulation, the VHDL code ﬁles needed to be compiled. Since, there were
many VHDL ﬁles to compile, a script ﬁle was created to compile all the ﬁles at once
and save time while compiling frequently. Following command was used to compile
VHDL ﬁles for simulation.
• vcom ﬁle_name1.vhd ﬁle_name2.vhd ... ﬁle_nameN.vhd
The following steps were followed in sequence to run simulation.
1. Compile VHDL ﬁles.
The script ﬁle is executed using following command to compile VHDL ﬁles.
• sh script_ﬁle_name.sh
2. Load the design.
After compilation, design ﬁle was created with the name of top level entity
inside the work directory. Before running simulation, the design was loaded
using the following command.
4. Implementation 39
• vsim work.top_level_entity_name
3. Add ports and signals to wave window.
To observe the state and values of various ports and signals, they were loaded
onto the wave window by executing the following command.
• add wave signal1 signal2 .. signalN port1 port2 ... port N
4. Run simulation.
The simulation was run for a ﬁnite time using the following command.
• run time_in_nano_seconds
4.1.3 Post Simulation
A ﬁxed point FFT algorithm was implemented in C and in matlab to calculate N-
point FFT. The results from matlab are considered to be accurate when compared
to C implementation results. Hence, C implementation results were veriﬁed with
matlab results for an N-point FFT. FFT computation by the processor is comparable
to C implementation. Hence, the results of simulation were validated against the C
implementation results. The simulation results were analyzed further to determine
FFT computation time taken by diﬀerent FFT sizes.
4.1.4 VCD File Generation
To perform power analysis using PowerPlay power analyzer, it was required generate
VCD ﬁle containing signal activities of the processor. Hence, VCD ﬁles for diﬀerent
FFT sizes were generated during RTL functional simulation. The following steps
were executed in sequence to generate VCD ﬁles.
1. Compile VHDL ﬁles.
The script ﬁle was executed using following command to compile VHDL ﬁles.
• sh script_ﬁle.sh
2. Load the design.
After compilation, design ﬁle was created with the name of top level entity
inside work directory. The design was loaded for simulation using the following
command.
• vsim work.top_level_entity_name
3. Add ports and signals to wave window.
To observe state and values of various ports and signals, they were loaded onto
the wave window by executing the following command.
4. Implementation 40
• add wave signal1 signal2 ... signalN port1 port2 ... portN
4. Create VCD ﬁle.
During simulation, signal activities were captured in a VCD ﬁle by using fol-
lowing command.
• vcd add -ﬁle ﬁle_name.vcd -internal -ports -r comp_instance_name/*
The command recursively captures signal activities of ports and internal sig-
nals of all the hierarchical components. The internal switch enables capturing
internal signal activities while ports switch enables capturing input/output
port activities of components.
5. Run simulation.
The simulation was run for a ﬁnite time using the following command.
• run time_in_nano_seconds
6. Dump signal activities.
Following command dumps all the signal activities and their value changes
into VCD ﬁle.
• vcd checkpoint
7. Exit simulation.
Writing to VCD ﬁle ﬁnishes only when simulation is exited. Following com-
mand is used to exit simulation.
• quit -f
4.2 Synthesis and Power Analysis
Synthesis required Quartus II version 12.1 from Altera Corp. which was installed
on a Windows 8 machine. Altera Stratix V FPGA device 5SGSMD5K2F40C2 was
used to synthesize FFT processor core. The following procedure was followed step
by step to carry out synthesis on FPGA.
1. Create project.
Created a new Quartus project for compiling VHDL ﬁles of FFT processor.
And while creating the project added all the VHDL ﬁles to the project.
2. Set top level entity.
A top level entity for the project which was an entry point to begin the com-
pilation was speciﬁed.
4. Implementation 41
3. Compile.
Started compiling the project and ﬁxed the compilation errors and warnings
encountered. Compilation process goes through analysis and synthesis, fol-
lowed by placement and routing, followed by timing analysis, followed by EDA
netlist write operation.
4. Program FPGA.
Programmed the FPGA device via USB port after the compilation was suc-
cessful.
5. Analyze results.
Analyzed the synthesis results, ﬁtter summary and timing analyzer summary
to obtain design parameters such as area, timing and maximum operating
frequency.
After the synthesis was complete, power consumption was analyzed with the help
of PowerPlay Power Analyzer tool available in Quartus II software. The PowerPlay
Power Analyzer tool needed VCD ﬁle containing signal activities as an input to
estimate the power consumption. The tool also required Synopsys Design Constraint
(SDC) ﬁle to set up frequency at which power analysis was carried out. The following
were the sequence of steps followed to carry out power analysis.
1. Launch PowerPlay Power Analyzer.
Launched PowerPlay Power Analyzer tool in Quartus II software after compi-
lation was successful.
2. Input (SDC) File.
Speciﬁed the frequency of operation for power analysis using SDC ﬁle.
3. Input VCD ﬁle.
The VCD ﬁle generated during RTL functional simulation was provided as an
input to the PowerPlay Power Analyzer tool.
4. Set default toggle rate.
Default toggle rates were set suitably depending on frequency at which power
analysis was done. Generally, the toggle rates for a design module varies
between 8-12% and hence, default toggle rate was chosen as 12% of frequency.
5. Start power analysis.
Started the power analysis after setting up the environment and ﬁxed the
errors or warnings encountered during the same.
4. Implementation 42
6. Analyze results.
The dynamic power dissipation, static power dissipation and routing dynamic
power dissipation were noted down and analyzed for diﬀerent FFT points.
43
5. RESULTS AND EVALUATION
The FFT processor was functionally veriﬁed through RTL simulation using Model-
Sim version 6.5c. During simulation, FFT computation time for diﬀerent FFT sizes
are analyzed which are tabulated below. Before carrying out synthesis, the FFT
core was conﬁgured to support maximum FFT size of Nmax = 2048. The resulting
FFT core (excluding memory components) was synthesized on an Altera Stratix V
FPGA device 5SGSMD5K2F40C2 using Quartus II version 12.1 synthesis tool. The
FFT core area consumption details are tabulated and discussed as below.
Resource usage for each of the FFT components and for FFT core as a whole are
given in TABLE 5.1.
Table 5.1: FFT Core Resource Utilization.
Components Combinational
ALUTs
Registers DSP Blocks
Butterﬂy Units 136 352 4
Interconnect 170 752 0
Address Generation Unit 482 199 0
Control Unit 291 111 0
FFT Core 1143 1754 4
The FFT core utilizes 1143 ALUTs out of 345200 available (<1%) and 4 DSP blocks
out of 1590 available (<1%) on the FPGA device.
5. Results and Evaluation 44
FFT processor was simulated in ModelSim for functional veriﬁcation and to measure
computation time. FFT computation time for diﬀerent FFT sizes are given in
TABLE 5.2. The simulation was carried out at a frequency fmax = 200MHz.
Table 5.2: FFT Computation Time.
N Clock Cycles Total Time[µs]
16 74 0.37
32 132 0.66
64 254 1.27
128 520 2.6
256 1106 5.53
512 2396 11.98
1024 5222 26.11
2048 11376 56.88
The plot of computation time against FFT size is given in Figure 5.1.
16 32 64 128 256 512 1024 2048
0
20
40
60
N-point
T
im
e
[u
s]
FFT computation time [us]
Figure 5.1: FFT computation time as a function of N.
Increase in FFT size results in almost linear increase in computation time till 256
points. But after 256 points, there is exponential increase in computation time.
According to TABLE 5.2, computation time (in micro seconds) for speciﬁed FFT size
meets the timing constraints of various wireless standards such as IEEE 802.11a/g,
IEEE 802.16e, 3GPP-LTE, DAB and DVB.
Power dissipation of FFT core was determined using PowerPlay Power Analyzer tool
available in Quartus II synthesis tool. PowerPlay tool performs power analysis on
post-ﬁt netlist data obtained after synthesis. The tool requires VCD ﬁle containing
signal activities of FFT core. As explained earlier, signal activities at RTL were
captured during simulation in ModelSim. In addition, frequency at which power
analysis to be done was speciﬁed through SDC ﬁle. Default toggle rate was set
5. Results and Evaluation 45
at 12% of frequency for those signals which didn't have signal activity in VCD
ﬁle. Operating voltage was set at 0.9V and the ambient temperature was set at
25 degree celsius. The power analysis results are presented in TABLE 5.3. Total
dynamic power dissipation (in milliWatts) consists of dynamic power, static power
and routing dynamic power dissipation of FFT core. Energy consumption expressed
in micro Joules was calculated as Energy = power ∗ time.
Table 5.3: Power Analysis Summary.
N Total Power Dissipation[mW] Energy[µJ]
16 399.05 0.147
32 412.46 0.271
64 423.46 0.537
128 407.83 1.06
256 434.20 2.4
512 428.57 5.1
1024 432.01 11.279
2048 434.44 24.71
According to TABLE 5.3, power dissipation varies slightly about an average value of
approximately 420mW for diﬀerent FFT size. Variation of power dissipation against
FFT size is illustrated in Figure 5.2.
16 32 64 128 256 512 1024 2048
0
10
20
30
N-point
E
n
er
gy
[u
J
]
FFT Total energy consumption [uJ]
Figure 5.2: FFT total energy consumption as a function of N.
The dynamic power dissipation increases with increase in FFT size because of in-
creased signal activities and increase in area. The energy consumption increases
drastically with N whereas power consumption varies slightly around an average
value.
5. Results and Evaluation 46
The proposed FFT processor was compared in terms of computation time and scal-
ability with the existing research work. Comparison information is provided in
TABLE 5.4.
Table 5.4: Comparison With Existing FFT Processors.
64
[µs]
128
[µs]
256
[µs]
512
[µs]
1024
[µs]
2048
[µs]
f
[MHz]
Scalable
[13] 2.1 - - - - - 31.69 No
[2] - - - - 26 - 100 No
[3] - 40.34 47.30 52.30 61.14 - 470 Yes
[6] - - - - 57 17.86 Yes
Proposed 1.27 2.6 5.53 11.98 26.11 56.88 200 Yes
The FFT processor based on our novel architecture outshines ﬁxed length as well
as variable length FFT processors.
47
6. CONCLUSIONS
A scalable FFT processor architecture was proposed. In order to realize it, a radix-
2 ﬁxed point 16-bit N-point FFT processor was implemented using VHDL. The
FFT processor was simulated in ModelSim verify its functionality and to determine
computation time. The FFT core was synthesized on an Altera Stratix V FPGA
device 5SGSMD5K2F40C2 to determine area, maximum operating frequency and
power consumption. Based on simulation and synthesis results proposed architec-
ture meets the timing constraints of wide range of wireless standards such as IEEE
802.11a/g, IEEE 802.16e, 3GPP-LTE, DAB and DVB. Hence, the proposed architec-
ture can be adopted in SDR platforms supporting above speciﬁed wireless standards.
According to [9], higher radix FFT computation is faster than lower radix FFT for
large number of FFT points. Hence, the proposed architecture can be extended to
implement radix-4, radix-8 or split radix FFT processors with additional modiﬁca-
tions. In order to extend the architecture to higher radix: suitable radix butterﬂy
should be used, address generation algorithm should be extended to support re-
quired data access during computation, additional control signals should be added
to control unit and the interconnect should be extended to accommodate additional
data/address signals. In addition, the processor architecture can be adopted by any
other non-OFDM based applications where a reasonable balance between speciﬁed
design parameter is required.
48
REFERENCES
[1] C. Sydney Burrus. Signal Flow Graphs of Cooley-Tukey FFTs. http: // cnx.
org/ content/ m16352/ latest/ . Accessed: 21-Mar-2013.
[2] Z.H. Derafshi, J. Frounchi, and H. Taghipour. A high speed FPGA imple-
mentation of a 1024-point complex FFT processor. In Second International
Conference on Computer and Network Technology (ICCNT), 2010, pages 312
315, april 2010.
[3] K. George and C.-I.H. Chen. Conﬁgurable and expandable FFT processor for
wideband communication. In IEEE Instrumentation and Measurement Tech-
nology Conference Proceedings, 2007. IMTC 2007., pages 1 6, may 2007.
[4] Haining Jiang, Hanwen Luo, Jifeng Tian, and Wentao Song. Design of an
eﬃcient FFT processor for OFDM systems. IEEE Transactions on Consumer
Electronics, 51(4):10991103, nov. 2005.
[5] J.W.Cooley and J.W.Tukey. An algorithm for the machine calculation of the
complex fourier series. Mathematics of Computation, 19(90):297301, 1965.
[6] Y.-T. Lin, P.-Y. Tsai, and T.-D. Chiueh. Low-power variable-length fast fourier
transform processor. IEEE Proceedings - Computers and Digital Techniques,
152(4):499  506, july 2005.
[7] Marcus Majo. Design and implementation of an OFDM-based communica-
tion system for the GNU radio platform. Master's thesis, IKR, University of
Stuttgart, 2009.
[8] Ramjee Prasad. OFDM for Wireless Communications Systems. Artech House,
31 Mar 2004.
[9] K.L.S. Swee and Lo Hai Hiung. Performance comparison review of radix-based
multiplier designs. In 4th International Conference on Intelligent and Advanced
Systems (ICIAS), 2012, volume 2, pages 854859, June.
[10] J. Takala and K. Punkka. Butterﬂy unit supporting radix-4 and radix-2 FFT.
In Proc. Int. Workshop Spectral Methods and Multirate Signal Process., Riga,
Latvia, pages 4753, June 20-22 2005.
[11] Song-Nien Tang, Chi-Hsiang Liao, and Tsin-Yuan Chang. An area- and energy-
eﬃcient multimode FFT processor for WPAN/WLAN/WMAN systems. IEEE
Journal of Solid-State Circuits, 47(6):14191435, june 2012.
REFERENCES 49
[12] Tran-Thong and Bede Liu. Fixed-point fast fourier transform error analysis.
IEEE Transactions on Acoustics, Speech and Signal Processing, 24(6):563573,
dec 1976.
[13] Bingrui Wang, Qihui Zhang, Tianyong Ao, and Mingju Huang. Design of
pipelined FFT processor based on FPGA. In Second International Conference
on Computer Modeling and Simulation, 2010. ICCMS '10., volume 4, pages
432435, jan. 2010.
[14] Ning Zhang and R.W. Brodersen. Architectural evaluation of ﬂexible digital sig-
nal processing for wireless receivers. In Conference Record of the Thirty-Fourth
Asilomar Conference on Signals, Systems and Computers, 2000., volume 1,
pages 7883 vol.1, 29 2000-nov. 1 2000.
[15] Qihui Zhang and Nan Meng. A low area pipelined FFT processor for OFDM-
based systems. In WiCom '09. 5th International Conference on Wireless Com-
munications, Networking and Mobile Computing, 2009., pages 14, sept. 2009.
50
A. APPENDIX
FFT core was synthesized on an Altera Stratix V FPGA 5SGSMD5K2F40C2 using
Quartus II software. Post synthesis, synthesized FFT core and its components can
be viewed using Quartus II RTL viewer tool shown in the screen shots below.
Figure A.1: Synthesized FFT core in RTL viewer.
Figure A.2: Synthesized butterﬂy unit in RTL viewer.
A. Appendix 51
Figure A.3: Synthesized complex multiplier in RTL viewer.
Figure A.4: Synthesized interconnect in RTL viewer.
A. Appendix 52
Figure A.5: Synthesized address generation unit in RTL viewer.
Figure A.6: Synthesized control unit in RTL viewer.
The chip planner tool in Quartus II software provides a pictorial view of synthesized
FFT core location on the FPGA chip. Dark blue region in Figure A.7 is the location
of FFT core on FPGA chip. The core was placed in a localized region which indicates
eﬃcient placement and routing of FFT core.
A. Appendix 53
Figure A.7: Location of FFT core on the FPGA chip, courtesy: Chip planner tool.
Placement of synthesized design over a localized region is eﬃcient in a sense that
it reduces static and dynamic power consumption, reduces clock skew issues and
allows economic usage of available FPGA chip area.
Design partition planner tool in Quartus II enables us to view the design partition
of FFT core components as shown in Figure A.8.
Figure A.8: Partition of FFT core components on the FPGA chip, courtesy: Design par-
tition planner tool.
Major design partitions are mapped as mentioned below.
U8: Address Generation Unit
U1: Control Unit
U6: Butterﬂy Unit0
U7: Butterﬂy Unit1
A. Appendix 54
INTER_CONNECT_A: InterconnectA
INTER_CONNECT_B: InterconnectB
