Pipelined Fast Fourier Transform Processor by Ali, Muazam
  
 
 
 
 
 
 
 
MUAZAM ALI 
PIPELINED FAST FOURIER TRANSFORM PROCESSOR 
 
Master of Science Thesis 
 
 
 
 
 
 
Examiners: Prof. Jarmo Takala and 
Dr. Fahad Qureshi 
Examiner and topic approved by the 
Faculty Council of the Faculty of 
Computing and Electrical Engineering 
on January 4th, 2017 
 
i 
 
ABSTRACT 
Muazam Ali: Pipelined Fast Fourier Transform Processor 
Tampere University of Technology 
Master of Science Thesis, 52 pages, 2 Appendix pages 
August 2017 
Master’s Degree Programme in Information Technology 
Major: Pervasive Systems 
Examiners: Prof. Jarmo Takala and Dr. Fahad Qureshi 
 
Keywords: FFT, SDF, VHDL, MATLAB, OFDM, Constant Multiplication 
 
In some cases, signal processing is easier in frequency-domain and Discrete Fourier 
Transform (DFT) is a useful tool to convert signals between time and frequency do-
mains. Fast Fourier Transform (FFT) is an efficient algorithm for computing DFT. FFT 
has a number of applications, where frequency-domain of a signal needs to be analyzed. 
It gained attention in communication systems because FFT is a crucial processing oper-
ation in the Orthogonal Frequency Division Multiplexing (OFDM) systems. The re-
search work carried out in this thesis presents hardware implementation of FFT proces-
sors. The processors’ architectures are designed based on Single-path Delay Feedback 
(SDF) FFT architecture scheme, which implements the radix-22 and identical radix-22 
FFT algorithms for 2048-point FFT. The designed identical radix-22 FFT algorithm has 
similar number of non-trivial complex twiddle factor multiplication stages, identical to 
radix-22 FFT algorithm but less complex operations. The lower complexity makes it an 
area efficient, memory efficient, reduce the multiplication cost and power consumption. 
Moreover, the computational complexity of the FFT processor can be reduced by re-
placing the general complex multipliers by constant multipliers (shift-and-add) circuits. 
W8 and W16 constant multipliers circuits are implemented in this thesis, which can re-
place three complex multipliers in the FFT processor. Finally, the FFT processors and 
constant multiplier circuits are implemented on Virtex-7 FPGA. 
 
 
ii 
PREFACE 
The research work in this thesis was carried out in the Department of Pervasive Compu-
ting, Tampere University of Technology, Finland. I would like to sincerely express my 
gratitude to the all the knowledgeable people who helped me the most throughout my 
thesis. I deeply respect and appreciate their encouragement, support and prayers. 
Foremost, I express my sincere and highest gratitude to Prof. Jarmo Takala who gave 
me opportunity to work with him in his research group. I could not have envisioned a 
better professor and mentor to conduct my thesis work. In addition to it, I will always be 
highly indebted to him for teaching me two core courses of my M.Sc. majors, DSP Im-
plementations and Computer Architecture. His immense knowledge and professional-
ism in teaching the courses helped me understand the core details of these subjects. The 
practical work conducted in these courses helped me a lot to understand the ideas even 
better and improved my skillset. I am extremely thankful to him for believing in my 
capabilities, giving me finances to support my work and giving me all the freedom I 
needed to conduct my thesis work. 
Besides my professor, I gratefully thank my advisor Dr. Fahad Qureshi for his priceless 
supervision, guidance, encouragement, inspiration and support. The most knowledgea-
ble efforts to polish my skillset and steer my research work towards completion are at-
tributed to him. He was always available to help me whenever I had trouble or a ques-
tion about my thesis work or writing. I strongly admit that his persisting support not 
only boosted confidence in me but also helped me to carry out quality research work. 
I want to pay a kind gratitude to all of my teachers who put their efforts to grow me in 
my career. I deeply respect the valuable contribution of Prof. Timo Hämäläinen, Dr. 
Matti Rintala, Arto Perttula, Tero Ahtee, and Dr. Waqar Hussain in developing my 
skillset in the field of computer systems. 
I am highly thankful to my proud father Ghulam Rasool and my dearest mother Bismil-
lah BiBi for their support to pursue my Master’s in Information Technology in Finland. 
I acknowledge their sacrifices for my best education since I was a child, which I can 
never pay back my whole life. My parents always taught me to support democracy, re-
spect different cultures, strongly believing in love and toleration for everyone. Moreo-
ver, I am extremely thankful to my elder brother Muhammad Amjad as well as my sis-
ters, Maria, Saima, Javeria and Sadia Rasool for their prayers and support. 
A special thank of mine goes to my colleagues Jingui Li, Renjie Xie and Jakub Zadnik 
who always motivated me, exchange their interesting ideas and created a friendly work 
environment to conduct my research work. 
iii 
Finally yet importantly, I thank all of my friends for their prayers and moral support in 
both academic and social life. 
This research work has been a great learning experience for me; it strengthened my 
skillset in hardware design and implementation through different approaches. I grateful-
ly thank again Prof. Jarmo Takala and Dr. Fahad Qureshi for this opportunity, believing 
in my capabilities and their persisting support as well as giving me all the freedom I 
needed to conduct my thesis. 
Tampere, 12.07.2017 
Muazam Ali 
iv 
CONTENTS 
1. INTRODUCTION ................................................................................................ 1 
1.1 Thesis Objective ......................................................................................... 4 
1.2 Thesis Organization .................................................................................... 4 
2. FAST FOURIER TRANSFORM AND ITS ALGORITHMS ................................ 5 
2.1 Decimation-in-Time FFT ............................................................................ 6 
2.2 Decimation-in-Frequency FFT .................................................................. 10 
2.3 FFT Algorithms ........................................................................................ 13 
2.3.1 Radix-2 FFT ............................................................................... 13 
2.3.2 Radix-4 FFT ............................................................................... 14 
2.3.3 Radix-2k FFT .............................................................................. 15 
2.3.4 Identical Radix-22 FFT ................................................................ 17 
3. FFT ARCHITECTURES..................................................................................... 18 
3.1 Direct Implementation .............................................................................. 18 
3.2 Memory-Based FFT Architectures ............................................................ 18 
3.3 Pipelined FFT Architectures ..................................................................... 21 
3.3.1 Feedback FFT Architectures ....................................................... 21 
3.3.2 Feedforward FFT Architectures .................................................. 23 
3.4 Mapping of FFT SFG to Hardware ........................................................... 23 
3.5 Input/Output Order or Bit-Reversal ........................................................... 25 
4. TWIDDLE FACTOR MULTIPLICATIONS ...................................................... 26 
4.1 Implementation Techniques ...................................................................... 26 
4.1.1 General Multiplier ....................................................................... 26 
4.1.2 Constant Multiplier ..................................................................... 27 
5. NUMBER REPRESENTATION ......................................................................... 30 
5.1 Binary Number Representation ................................................................. 31 
5.2 Two’s Complement Representation .......................................................... 32 
5.3 Canonical Signed Digit Representation (CSD) .......................................... 33 
5.4 Effects of Finite Word Length................................................................... 34 
5.4.1 Overflow ..................................................................................... 34 
5.4.2 Scaling: An Overflow Handling Technique ................................. 34 
5.4.3 Round–Off and Truncation.......................................................... 35 
6. PIPELINED FFT PROCESSOR.......................................................................... 38 
6.1 Radix-2 FFT Architecture ......................................................................... 38 
6.2 Radix-22 FFT Architecture ........................................................................ 39 
6.2.1 Radix-22 FFT Architecture .......................................................... 39 
6.2.2 Internal Structures of BF I and BF II ........................................... 40 
6.2.3 Improved radix-22 FFT Architecture ........................................... 42 
6.3 Identical Radix-22 FFT Architecture ......................................................... 43 
6.4 Implementation of Constant Multipliers .................................................... 44 
v 
6.4.1 W8 Constant Multiplier ............................................................... 46 
6.4.2 W16 Constant Multiplier .............................................................. 46 
7. SYNTHESIS RESULTS ..................................................................................... 48 
7.1 Verification .............................................................................................. 48 
7.2 Synthesis .................................................................................................. 48 
7.2.1 Comparison of FFT Architectures ............................................... 49 
7.2.2 FFT Processors Synthesis Results ............................................... 49 
7.2.3 Constant Multipliers Synthesis Results ........................................ 50 
7.2.4 Results Analysis .......................................................................... 50 
8. CONCLUSIONS AND FUTURE WORK ........................................................... 52 
 
APPENDIX A: Simulation Waveforms 
vi 
LIST OF SYMBOLS AND ABBREVIATIONS 
 
DSP  Digital Signal Processing 
 
DFT  Discrete Fourier Transform 
 
IDFT Inverse Discrete Fourier Transform  
 
FFT Fast Fourier Transform 
 
SFG Signal Flow Graph 
 
DIT Decimation-in-Time 
 
DIF Decimation-in-Frequency 
 
PE Processing Element 
 
BF Butterfly 
 
SDF  Single path Delay Feedback 
 
MDF Multiple path Delay Feedback 
 
FF Feed Forward  
 
MDC Multiple path Delay Commutator 
 
FIFO First in First out 
 
SCM Single Constant Multiplication 
 
MCM Multiple Constant Multiplication 
 
CSD Canonical Signed Digit 
 
CC  Clock Cycle 
 
BF I 2-point Butterfly 
 
BF II 2-point Butterfly with additional circuitry of W4 Multiplier 
1 
1. INTRODUCTION 
Advancement of digital systems is continuously replacing the old analog systems. Digi-
tal systems are the backbone of modern day organizations, products, processes and ser-
vices and their quality is increasingly subject to these systems. Applications of present 
day digital systems include communication systems, traffic systems, control systems, 
weather forecasting systems, internet and so forth. 
Microelectronics hardware and software have turned into a fundamental vital material 
for leading gadgets, products, processes and services. Modern day microelectronic tech-
nology implements complete complex information processing system on a single chip. 
The transformation from the multi-chip systems to systems-on-a-single-chip (SoC) 
opens new potential outcomes and generates new challenges. Advances in microelec-
tronic technology are exceptionally fast. 
These days, digital systems are set up to confront the requirements of the most demand-
ing digital signal processing (DSP) applications, which impose solid conditions such as, 
small silicon area, clock frequency, high throughput, reduced power consumption, la-
tency and real-time computations. In order to fulfil these requirements, digital systems 
are being implemented on Application Specific Integrated Circuits (ASICs) and Field 
Programmable Gate Arrays (FPGAs) devices. These devices operate at high clock fre-
quencies and achieve high performance in computations of digital signal processing 
(DSP) algorithms. An algorithm can have wide range of hardware implementations, 
which implies algorithms does not provide any information about the hardware architec-
ture that processes it. 
Fast Fourier transform (FFT) is an efficient algorithm for fast computation of discrete 
Fourier transform (DFT) and its inverse. FFT is highly efficient procedure for fast com-
putation of DFT finite series and requires less number of computations than the direct 
evaluation of DFT. It takes benefit of the fact that the computation of coefficients can be 
carried out iteratively. FFT reduces the computational complexity of DFT from O(N2) to 
O(NlogN).  
Cooley-Tukey’s algorithm [4] is the most popular among all the other FFT algorithms. 
Decimation-in-Time (DIT) and Decimation-in-Frequency (DIF) are two basic forms of 
Cooley-Tukey’s algorithm. Radix-2 DIT approach decomposes input samples into even 
and odd samples iteratively but radix-2 DIF approach practice same procedure from 
output samples. In FFT algorithms, decompositions are applied recursively up to re-
quired small point DFTs called as radix. Small point DFTs can be radix-2, radix-4 and 
2 
mixed radix. After Cooley-Tukey’s algorithm, various FFT algorithms were proposed 
like, radix-4, radix-8 and higher radices to decrease the complexity of twiddle factor 
multiplications. But, these FFT algorithms have complex butterfly and cannot be ap-
plied for all power-of-2 FFT sizes. For such FFTs, a mixed radix FFT algorithm (radix-
2 and radix-4) was proposed, which are not power-of-4 at cost of complex. 
After three decades of Cooley-Tukey’s FFT algorithm, a new radix-22 FFT algorithm 
was proposed, which uses simple radix-2 butterfly structure to take advantage of radix-
4. The number of complex twiddle factor multiplications in radix-22 is almost similar to 
radix-4. This algorithm is applicable for any power-of-2 FFT sizes. Based on this algo-
rithm, later radix-23, 24 and radix-2k FFT algorithms were proposed. However, binary 
tree representation is used to present the basic forms of Cooley-Tukey’s algorithms as 
well as FFT algorithms mentioned previously are generated through this representation. 
It is worth mentioning that these algorithms differ from each other only in twiddle fac-
tor multiplications. The flow of data and butterfly structure remain constant. 
FFT has various applications where frequency-domain of a signal is necessary to be 
analyzed. It gained attention in communication systems because FFT is a crucial pro-
cessing operation in orthogonal frequency division multiplexing (OFDM) systems. 
OFDM is a leading modulation technique, which adequately broaden channel usage and 
abridge inter-symbol interference (ISI) along with inter-carrier interference (ICI) gener-
ated by multipath effect. It is mainly used in digital audio broadcasting (DAB), digital 
video broadcasting-terrestrial (DVB-T) and digital video broadcasting-handheld (DVB-
H). In addition to it, there are many communication applications based on OFDM where 
FFT is most important processing operation such as Ultra Wide Band (UWB), 3GPP 
LTE, Digital Subscriber Line (DSL), WLAN, WiMax etc. 
In designing communication systems, the principal factor dominating performance is 
throughput of the system. Because of immense data rate requirements, efficient FFT 
modules with high throughput but smaller chip area as well as with reduced power con-
sumption are in demand. 
In general, there are two types of FFT architectures, pipelined architectures and 
memory-based architectures. There are two principal types of pipelined architectures: 
Feedback FFT architectures and Feedforward FFT architectures. According to the num-
ber of data lines, these architectures can be further divided into single-path and multi-
path pipelined FFT architectures. Multi-path pipelined architectures, where data is fed 
to the FFT processor using several paths, are used where the throughput has to be in-
creased for a given clock frequency of FFT processor. This research work is particularly 
focused on hardware implementation of single path delay feedback (SDF) pipelined 
FFT architecture, where data is fed to the processor using a single path. 
3 
Lately, IC design resources in chip technology have increased, which allows the imple-
mentation of complex algorithms on a single chip, such as FFT. Especially, to fulfill the 
requirements of OFDM standards, it is necessary to design an efficient FFT processor. 
Thus, the research work carried out in this thesis presents an efficient hardware imple-
mentation of an FFT processor. 
SDF pipelined FFT architectures are often used for higher throughputs and reduced 
chip-area. In SDF architectures, FFT is computed in stages where each stage consists of 
an FFT butterfly structure, a FIFO and a complex multiplier. One complex multiplica-
tion contains four real multiplications and two real additions, which is important for 
performance and chip-area. A lot of research has done in order to reduce the number of 
complex operations.  
Previous work includes FFT architectures for radix – 2, radix – 22 and radix – 23 algo-
rithms. We propose identical radix-22 FFT algorithm, which requires less number of 
complex multiplications than the previous work. Designed FFT algorithm is given the 
name identical radix – 22 due to the fact that the number of non-trivial complex multi-
plier stages are identical to radix-22. The designed FFT algorithm mapped on SDF pipe-
lined hardware architecture for 2048-point FFT. Less complex multiplications makes it 
an area-efficient, memory efficient, reduce the multiplication cost, and power consump-
tion. 
FFT architectures differ in twiddle factor multiplications but flow and processing of 
data remains the same. This architecture uses radix-2 butterflies to take advantages of 
higher radix FFT i.e. radix – 22. In general in this case, a traditional SDF architecture 
requires ten complex multiplier stages. This approach extensively reduces the number of 
complex multipliers from 10 to two, which reduces the memory requirements to store 
the twiddle factors. Five trivial complex multipliers (multiplication by –j) are handled 
within the butterfly structure by swapping and sign changing. Furthermore, three com-
plex multipliers can be replaced by constant multipliers i.e. shifters and adders. Con-
stant multipliers implementation is done as well in this work. Through integration of 
those constant multipliers in the implemented pipelined FFT processor, the number if 
complex multipliers can be further reduced from five to two. The total required memory 
in this approach is less than the previous works and it decreases the chip-area as well as 
power consumption. 
Another greater advantage of using this approach appears in reduction of round-off error 
as compared to previous work. The main reasons for reduction in round-off error are 
lower number of non-trivial multiplications and their position in the later processing 
stages. 
4 
1.1 Thesis Objective 
The main objective of this research work was to design an efficient FFT algorithm with 
less number of complex multiplications, map the designed algorithm on SDF pipelined 
FFT architecture and implement on the FPGA. 
1.2 Thesis Organization 
The thesis is organized as follows. Chapter 2 introduces the DFT and FFT. Moreover, 
derivation of different FFT algorithm is provided and main concepts of FFT are ex-
plained in this chapter such as, difference between DIT and DIF decompositions of 
FFT. Lastly, improved FFT algorithms are discussed through binary tree representation. 
Chapter 3 introduces FFT hardware architectures. A brief overview of memory based 
FFT architectures and pipelined FFT architectures is provided. However, pipelined FFT 
architectures are discusses in more detail including both feedback FFT architectures and 
feedforward FFT architectures. Finally, mapping of FFT signal flow graph (SFG) on 
hardware is discussed. 
Chapter 4 deals with FFT twiddle factor multiplications and its different hardware im-
plementation techniques. Twiddle factor multiplications’ implementation approaches 
include generic complex multiplier and constant multiplier (shift-and-add). 
Chapter 5 analyses the basic concepts of different number representations such as, Bina-
ry, Two’s Complement and Canonical Signed Digit (CSD). Moreover, it discusses the 
effects of finite word length and the concepts of overflow handling. 
Chapter 6 focuses on designing the hardware architecture for the proposed approach that 
is identical radix-22 pipelined FFT processor. Hardware architectures with general com-
plex multipliers and constant multipliers are discussed. 
Chapter 7 discusses the synthesis results of the proposed FFT architecture and make 
comparison with radix-22 FFT architecture synthesis results. 
Finally, Chapter 8 addresses the main conclusions and future works. 
5 
2. FAST FOURIER TRANSFORM AND ITS ALGO-
RITHMS 
The discrete Fourier transform (DFT) is the Fourier representation of finite length se-
quences and is a complex function of frequency [1], [2]. It is a widespread algorithm 
and can be exploited for the implementation of the convolution of two linear sequences 
as well as spectral analysis of signals [3].  Therefore, it is of greater importance to ex-
plore the efficient computation methods of DFT. 
N-point DFT is defined as 
𝑋[𝑘] = ∑ 𝑥[𝑛]𝑊𝑁
𝑛𝑘𝑁−1
𝑛=0
, k = 0, 1, …, N-1,     (2.1) 
where x[n] and X[k] are the input and output sequences respectively. The inverse dis-
crete Fourier transform (IDFT) is defined as 
𝑥[𝑛] =
1
𝑁
∑ 𝑋[𝑘]𝑊𝑁
−𝑛𝑘𝑁−1
𝑘=0
, n = 0, 1, …, N-1,    (2.2) 
where x[n] and X[k] are both complex sequences in time domain and frequency domain 
respectively. Coefficient 𝑊𝑁  is called the twiddle factor, which is defined as 
𝑊𝑁= 𝑒
−𝑗2𝜋/𝑁.          (2.3) 
In addition, 𝑒−𝑗2𝜋/𝑁 can be defined as 
𝑒−𝑗2𝜋/𝑁 =  cos (
2𝜋
𝑁
) − 𝑗sin (
2𝜋
𝑁
).       (2.4) 
The Fast Fourier Transform1 (FFT) [1], [2], [3] is an efficient algorithm for DFT 
computation. It is based on the successive decomposition of N-point DFT into smaller 
size DFTs and taking the benefit from the symmetry and periodicity properties of the 
twiddle factor in equation (2.3). Cooley-Tukey’s algorithm [4] is the most popular 
among all the other FFT algorithms. FFT algorithm reduces the computational complex-
ity of DFT from O(𝑁2) to O(NlogN). Two basic forms of Cooley-Tukey algorithm are 
Decimation-in-Time (DIT) and Decimation-in-Frequency (DIF). Cooley-Tukey FFT 
algorithm is based on divide and conquer approach where N-point DFT is decomposed 
                                               
1 In FFT algorithms, we will be using an N-point sequence or sequence of length N interchangeably. 
Terms sample and point will also be used as interchangeably. 
6 
recursively up to required small point DFT which defines the radix. The next sections 
illustrate radix-2 DIT and DIF FFT algorithms. 
2.1 Decimation-in-Time FFT 
Algorithms, which decompose the x[n] sequences into successive smaller sequences, are 
called Decimation-in-Time (DIT) algorithms [1]. The main principle of DIT FFT algo-
rithm is illustrated as follows, when N =2𝑣. 
We compute X[k] by dividing x[n] into two (N/2) sequences i.e. even and odd indexed 
sequences. Rewrite the equation (2.1) as follows 
𝑋[𝑘] = ∑ 𝑥[𝑛]𝑊𝑁
𝑛𝑘
𝑒𝑣𝑒𝑛  + ∑ 𝑥[𝑛]𝑊𝑁
𝑛𝑘
𝑜𝑑𝑑 .       (2.5) 
Now substitute n = 2m for even and n = 2m + 1 for odd indices of x[n] sequence as 
𝑋[𝑘] =  ∑ 𝑥[2𝑚]𝑊𝑁
2𝑚𝑘
𝑁
2
−1
𝑚=0
+ ∑ 𝑥[2𝑚 + 1]𝑊𝑁
(2𝑚+1)𝑘
𝑁
2
−1
𝑚=0
 .   (2.6) 
Equation (2.6) is rewritten as 
𝑋[𝑘] =  ∑ 𝑥[2𝑚](𝑊𝑁
2)𝑚𝑘
𝑁
2
−1
𝑚=0
+ 𝑊𝑁
𝑘 ∑ 𝑥[2𝑚 + 1](𝑊𝑁
2)𝑚𝑘
𝑁
2
−1
𝑚=0
 .   (2.7) 
As [1] says 𝑊𝑁
2 =  𝑊𝑁/2 because, 
𝑊𝑁
2 = 𝑒−2𝑗(
2𝜋
𝑁
)
 = 𝑒−𝑗2𝜋/(
𝑁
2
)
 = 𝑊𝑁/2 .       
Finally, rewrite the equation (2.7) as 
𝑋[𝑘] =  ∑ 𝑥[2𝑚]
𝑁
2
−1
𝑚=0 𝑊𝑁/2
𝑚𝑘+ 𝑊𝑁
𝑘 ∑ 𝑥[2𝑚 + 1]𝑊𝑁/2
𝑚𝑘  
𝑁
2
−1
𝑚=0
     (2.8) 
Now simplify the above equation as 
X[k] = G[k] + 𝑊𝑁
𝑘𝐻[𝑘],  k = 0, 1, …, N-1,          (2.9) 
where G and H are both N/2 sample points even and odd DFTs respectively. 
Flow of data in equation (2.9) is shown in the Fig. 2.1. In this figure, each of the N/2 
points DFT the computational complexity is O((𝑁/2)2) i.e. number of complex multi-
plications and additions. Then multiply the second part of (2.9) with 𝑊𝑁
𝑘  and we need N 
complex additions to get summation of both parts of (2.9). Hence, for once divided DFT 
N + (𝑁/2)2 complex multiplications and additions are needed, which is less than the 
7 
computational complexity of the traditional DFT having O(𝑁2) computational com-
plexity. 
Equation (2.9) represents two N/2 point DFTs where N is a power-of-2. Now, further 
consider N/2 point DFT decomposition into two N/4 points DFTs for each N/2. Then 
add N/4 points DFTs together to get N/2-point DFT. 
𝐺[𝑘] =  ∑ 𝑔[2𝑚]
𝑁
4
−1
𝑚=0 𝑊𝑁/4
𝑚𝑘+ 𝑊𝑁/2
𝑘 ∑ 𝑔[2𝑚 + 1]𝑊𝑁/4
𝑚𝑘  
𝑁
4
−1
𝑚=0
     (2.10) 
Similarly, write H[k] as 
𝐻[𝑘] =  ∑ ℎ[2𝑚]
𝑁
4
−1
𝑚=0 𝑊𝑁/4
𝑚𝑘+ 𝑊𝑁/2
𝑘 ∑ ℎ[2𝑚 + 1]𝑊𝑁/4
𝑚𝑘 
𝑁
4
−1
𝑚=0
     (2.11) 
 
Figure 2.1 Flow graph of DIT decomposition for computation of N-point DFT into two 
(N/2- point) DFTs, N = 8 
The computations for equations (2.10) and (2.11) are shown by green color in the Fig. 
2.2.  
Now, computations have become 2-point DFTs.  Now, the signal flow graph of 2-point 
DFT is illustrated in the Fig. 2.3. This 2-point DFT structure is called as radix-2 butter-
fly and signal flow graph (SFG) is called as butterfly structure. 
Finally, the complete SFG of 8-point DFT is illustrated in Fig. 2.4, which is replacing 
radix-2 butterfly structure, by N/4-point DFTs in Fig. 2.2. 
x[0]
x[4]
x[2]
x[6]
x[1]
x[5]
x[3]
X[0]
X[1]
X[2]
X[3]
X[4]
X[5]
X[6]
X[7]
N/2-point 
DFT
x[7]
𝐖𝐍
𝟎 
𝐖𝐍
𝟏 
𝐖𝐍
𝟐 
𝐖𝐍
𝟑 
𝐖𝐍
𝟒 
𝐖𝐍
𝟓 
𝐖𝐍
𝟔 
𝐖𝐍
𝟕 
G[0]
G[1]
G[2]
G[3]
H[0]
H[1]
H[2]
H[3]
N/2-point 
DFT
8 
 
Figure 2.2 Green color represents DIT decomposition of N/2-point DFT into two N/4-
point DFTs computations, N = 8 
To generalize the idea for large value of N where N would still be a power of two, fur-
ther decompose N/4-point DFTs2 into N/8-point DFTs and so on recursively decompose 
into small DFTs until radix-2. For any value of N, which is a power-of-2, this process 
requires 𝑙𝑜𝑔2N computing stages for computing an N-point DFT [1].   
 
Figure 2.3 Radix-2 butterfly for 2-point DFT computation 
The overall computational complexity for N-point DFT would be N + N + 4(𝑁/4)2 
number of complex multiplications and complex additions. To generalize it for any 
number N that is a power-of-2, 𝑙𝑜𝑔2N number of stages are needed for N-point DFT 
computations. Hence, total number of complex multiplications and additions are 
N𝑙𝑜𝑔2N. 
                                               
2 Equations (2.10) and (2.11) represent the N/4 -point DFTs. 
x[0]
x[4]
x[2]
x[6]
x[1]
x[5]
x[3]
X[0]
X[1]
X[2]
X[3]
X[4]
X[5]
X[6]
X[7]
N/4-point 
DFT
x[7]
N/4-point 
DFT
N/4-point 
DFT
N/4-point 
DFT
𝐖𝐍
𝟎 
𝐖𝐍
𝟏 
𝐖𝐍
𝟐 
𝐖𝐍
𝟑 
𝐖𝐍
𝟒 
𝐖𝐍
𝟓 
𝐖𝐍
𝟔 
𝐖𝐍
𝟕 
𝐖𝐍
𝟎 
𝐖𝐍
𝟐 
𝐖𝐍
𝟒 
𝐖𝐍
𝟔 
𝐖𝐍
𝟎 
𝐖𝐍
𝟐 
𝐖𝐍
𝟒 
𝐖𝐍
𝟔 
𝐖𝐍
𝟎 =  𝐖𝟐
𝟎 = 𝟏 
 
nth stage
𝐖𝐍
𝟒 =  𝐖𝟐
𝟏 =  𝐖𝐍
𝐍/𝟐
 = -1 
 
 
(n – 1) stage
9 
 
Figure 2.4 Signal flow graph of 8-point DIT FFT 
The computational complexity is reduced by using the periodicity and symmetry prop-
erties of twiddle factor, which is defined in Eqn. (2.3). For example, simplified structure 
of radix-2 butterfly in Fig. 2.3 is redrawn in Fig. 2.5. This structure requires only one 
complex multiplication as comped to Fig. 2.3 structure, which demands two complex 
multiplications. Because of exploiting the following DFT properties: 
Symmetry property: 𝑊𝑁
𝑘+
𝑁
2  = - 𝑊𝑁
𝑘 . 
Periodicity property: 𝑊𝑁
𝑘+𝑁 = 𝑊𝑁
𝑘 . 
For example, to compute 2048-point DFT, N = 2048 =211. This implies 𝑁2= 4194304 
and N𝑙𝑜𝑔2N = 22528. Notice that the computational complexity is significantly reduced. 
 
Figure 2.5 Simplified structure of radix-2 butterfly 
The signal flow graph of 8-point DIT DFT using simplified structure of radix-2 butter-
fly is illustrated in Fig. 2.6. Now, the computational complexity of DFT illustrated by 
Fig. 2.6 is reduced about 50%, i.e., the number of complex multiplications are reduced 
by the factor 2 as compared to SFG illustrated in Fig. 2.4. 
X[0]
X[1]
X[2]
X[3]
X[4]
X[5]
X[6]
X[7]
x[0]
x[4]
x[2]
x[6]
x[1]
x[5]
x[3]
x[7]
𝐖𝐍
𝟎 
𝐖𝐍
𝟏 
𝐖𝐍
𝟐 
𝐖𝐍
𝟑 
𝐖𝐍
𝟒 
𝐖𝐍
𝟓 
𝐖𝐍
𝟔 
𝐖𝐍
𝟕 
𝐖𝐍
𝟎 
𝐖𝐍
𝟐 
𝐖𝐍
𝟒 
𝐖𝐍
𝟔 
𝐖𝐍
𝟎 
𝐖𝐍
𝟐 
𝐖𝐍
𝟒 
𝐖𝐍
𝟔 
𝐖𝐍
𝟎 
𝐖𝐍
𝟒 
𝐖𝐍
𝟎 
𝐖𝐍
𝟒 
𝐖𝐍
𝟎 
𝐖𝐍
𝟒 
𝐖𝐍
𝟎 
𝐖𝐍
𝟒 
nth stage
(n – 1) stage
𝐖𝐍
𝟎 =  𝐖𝟐
𝟎 = 𝟏 
 
-1
10 
 
Figure 2.6 8-point DFT by using simplified radix-2 butterfly structure 
2.2 Decimation-in-Frequency FFT 
Decimation-in-Frequency (DIF) algorithms are built on structuring the DFT computa-
tion by decomposing X[k] i.e. output sequence into smaller subsequences [1]. 
For DIF algorithms, reconsider N as a power-of-2 and individual computation of even 
and odd indexed samples. Since, the DFT X[k] for an input sequence x[n] is defined by 
the following equation. 
𝑋[𝑘] = ∑ 𝑥[𝑛]𝑊𝑁
𝑛𝑘𝑁−1
𝑛=0
, k = 0, 1, …, N-1     (2.12) 
and even indexed samples are 
𝑋[2𝑘] =  ∑ 𝑥[𝑛]𝑊𝑁
𝑛(2𝑘)
𝑁−1
𝑛=0
, k = 0, 1, …, N/2-1.     (2.13) 
Rewrite equation (2.13) as follows 
𝑋[2𝑘] =  ∑ 𝑥[𝑛]𝑊𝑁
𝑛(2𝑘)
𝑁
2
−1
𝑛=0
 + ∑ 𝑥[𝑛]𝑊𝑁
𝑛(2𝑘)
𝑁−1
𝑛=𝑁/2
 .    (2.14) 
The previous equation can also be expressed as 
𝑋[2𝑘] =  ∑ 𝑥[𝑛]𝑊𝑁
𝑛(2𝑘)
𝑁
2
−1
𝑛=0
 + ∑ 𝑥 [𝑛 +
𝑁
2
] 𝑊𝑁
(𝑛+
𝑁
2
)(2𝑘)
𝑁
2
−1
𝑛=0
 .   (2.14) 
Since 𝑊𝑁
2𝑛𝑘   is periodic, we have 
X[0]
X[1]
X[2]
X[3]
X[4]
X[5]
X[6]
X[7]
x[0]
x[4]
x[2]
x[6]
x[1]
x[5]
x[3]
x[7]
𝐖𝐍
𝟎 
𝐖𝐍
𝟎 
𝐖𝐍
𝟎 
𝐖𝐍
𝟎 
𝐖𝐍
𝟎 
𝐖𝐍
𝟐 
𝐖𝐍
𝟎 
𝐖𝐍
𝟐 
𝐖𝐍
𝟎 
𝐖𝐍
𝟏 
𝐖𝐍
𝟐 
𝐖𝐍
𝟑 
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
11 
𝑊𝑁
2𝑘(𝑛+
𝑁
2
)
 = 𝑊𝑁
2𝑘𝑛𝑊𝑁
𝑘𝑁 =  𝑊𝑁
2𝑛𝑘         (2.15) 
and 
𝑊𝑁
2 = 𝑊𝑁/2.           (2.16) 
By using Eqns (2.15) and (2.16), we can write the equation (2.14) as 
X[2k] = ∑ (𝑥[𝑛] + 𝑥 [𝑛 +
𝑁
2
])𝑊𝑁/2
𝑛𝑘
𝑁
2
−1
𝑛=0
,   k = 0,1, …, N/2 -1.   (2.17) 
Eqn (2.17) represents the N/2-point DFT by summation of first and second halves of the 
input sequence.  
Now, consider the odd indexed samples: 
X[2k + 1] = ∑ 𝑥[𝑛]𝑊𝑁
𝑛(2𝑘+1)
𝑁−1
𝑛=0
, k = 0, 1, …, N/2 -1 .    (2.18) 
We can rewrite Eqn. (1.18) as follows 
X[2k + 1] = ∑ 𝑥[𝑛]𝑊𝑁
𝑛(2𝑘+1)
𝑁
2
−1
𝑛=0
  +  ∑ 𝑥[𝑛]𝑊𝑁
𝑛(2𝑘+1)
𝑁−1
𝑛=
𝑁
2
 .    (2.19) 
Then the second half of Eqn. (2.19) can be expressed as 
∑ 𝑥[𝑛]𝑊𝑁
𝑛(2𝑘+1)
𝑁−1
𝑛=
𝑁
2
 = ∑ 𝑥 [𝑛 +
𝑁
2
] 𝑊𝑁
(𝑛+
𝑁
2
)(2𝑘+1)
𝑁
2
−1
𝑛=0
      (2.20) 
= 𝑊𝑁
𝑁
2
(2𝑘+1)
∑ 𝑥 [𝑛 +
𝑁
2
] 𝑊𝑁
𝑛(2𝑘+1)
𝑁
2
−1
𝑛=0
. 
Since, 𝑊𝑁
(
𝑁
2
)2𝑘
= 1 and 𝑊𝑁
𝑁
2  = -1 we have 
𝑊𝑁
𝑁
2
(2𝑘+1)
∑ 𝑥 [𝑛 +
𝑁
2
] 𝑊𝑁
𝑛(2𝑘+1)
𝑁
2
−1
𝑛=0
  = - ∑ 𝑥 [𝑛 +
𝑁
2
] 𝑊𝑁
𝑛(2𝑘+1)
𝑁
2
−1
𝑛=0
.  (2.21) 
By using Eqn. (2.21) in Eqn. (2.19), we get 
X[2k + 1] =  ∑ (𝑥[𝑛] −  𝑥 [𝑛 +
𝑁
2
]) 𝑊𝑁
𝑛(2𝑘+1)
𝑁
2
−1
𝑛=0
,  k = 0, 1, …, N-1   (2.22) 
and 
12 
X[2k + 1] =  ∑ (𝑥[𝑛] −  𝑥 [𝑛 +
𝑁
2
]) 𝑊𝑁
𝑛
𝑁
2
−1
𝑛=0
𝑊𝑁/2
𝑛𝑘 ,  k = 0, 1, …, N-1 .  (2.23) 
Eqn. (2.23) represents the N/2-point DFT by subtracting the second half of the input 
sequence x[n] from the second half of the input sequence and finally multiplying the 
result with𝑊𝑁
𝑛.  Fig. 2.7 illustrates the mathematical expressions of equations (2.17) and 
(2.23). 
 
Figure 2.7 DIF decomposition of N-point DFT into two N/2-point DFTs, N = 8 
Now, similar to DIT decomposition, N is a power-of-2, which makes N/2 even as well. 
Therefore, even and odd number of samples in Eqns (2.17) and (2.23) computes N/2-
point DFT. Furthermore, N/4-point DFTs is computed in a similar way. 
 
Figure 2.8 DIF decomposition of 8-point DFT into four 2-point DFTs 
x[0]
x[1]
x[2]
x[3]
x[4]
x[5]
x[6]
x[7]
X[0]
X[4]
X[2]
X[6]
X[1]
X[5]
X[3]
X[7]
N/2-point DFT
N/2-point DFT
-1
-1
-1
-1
𝐖𝐍
𝟎 
𝐖𝐍
𝟏 
𝐖𝐍
𝟐 
𝐖𝐍
𝟑 
x[0]
x[1]
x[2]
x[3]
x[4]
x[5]
x[6]
x[7]
-1
-1
-1
-1
N/4-point 
DFT
N/4-point 
DFT
N/4-point 
DFT
N/4-point 
DFT
X[0]
X[4]
X[2]
X[6]
X[1]
X[5]
X[3]
X[7]
-1
-1
-1
-1
𝐖𝐍
𝟎 
𝐖𝐍
𝟏 
𝐖𝐍
𝟐 
𝐖𝐍
𝟑 
𝐖𝐍
𝟎 
𝐖𝐍
𝟐 
𝐖𝐍
𝟎 
𝐖𝐍
𝟐 
13 
Fig. 2.8 illustrates that the computations of 8-point DFT has been reduced to 2-point 
DFTs. Now, replace the 2-point DFTs in Fig. 2.8 by 2-point butterfly structure illustrat-
ed in Fig. 2.9. Consequently, SFG of 8-point DFT is illustrated in Fig. 2.10. 
 
Figure 2.9 Radix-2 DIF butterfly structure 
To generalize N, where N is a power-of-2, the computational complexity for DIF is 
same as DIT algorithm i.e.  N𝑙𝑜𝑔2N complex multiplications and (N/2)𝑙𝑜𝑔2N complex 
additions [1]. Twiddle factor multiplication stages are the major difference between DIT 
and DIF approaches. 
 
Figure 2.10 SFG of 8-point DFT using radix-2 DIF decomposition 
2.3 FFT Algorithms 
FFT-The Fast Fourier Transform [1], [2], [3] is an efficient algorithm for DFT computa-
tion. In FFT algorithms, decompositions are applied recursively up to required small 
point DFTs called as radix. Small point DFTs can be radix-2, radix-4 and mixed radix. 
The approach to develop an FFT algorithm have enormous effect on the computational 
complexity of DFT. Some of the FFT algorithms are introduced here. 
2.3.1 Radix-2 FFT 
N-point DFT X[k] of an input sequence x[n] is defined by equation (2.1) as 
(n – 1) stage
-1 𝐖𝐍
𝟎 =  𝐖𝟐
𝟎 = 𝟏 
 
nth stage
x[0]
x[1]
x[2]
x[3]
x[4]
x[5]
x[6]
x[7]
-1
-1
-1
-1
X[0]
X[4]
X[2]
X[6]
X[1]
X[5]
X[3]
X[7]
-1
-1
-1
-1
-1
-1
-1
-1
𝐖𝐍
𝟎 
𝐖𝐍
𝟏 
𝐖𝐍
𝟐 
𝐖𝐍
𝟑 
𝐖𝐍
𝟎 
𝐖𝐍
𝟐 
𝐖𝐍
𝟎 
𝐖𝐍
𝟐 
𝐖𝐍
𝟎 
𝐖𝐍
𝟎 
𝐖𝐍
𝟎 
𝐖𝐍
𝟎 
14 
𝑋[𝑘] = ∑ 𝑥[𝑛]𝑊𝑁
𝑛𝑘𝑁−1
𝑛=0
, k = 0, 1, …, N-1, 
where 𝑊𝑁= 𝑒
−𝑗2𝜋/𝑁. 
Cooley and Tukey proposed radix-2 algorithm in 1965 [4] which is very beneficial for 
data vector of size N, where N is a power-of-2: 
                                     N = 𝑟𝑣,        (2.24) 
where r and v both are integers and r is called the radix. When r =2, algorithm is called 
as radix-2 FFT algorithm. Radix-2 FFT algorithm proceeds by decomposing N-point 
DFT into two N/2 point DFTs, recursively iterates up to 2-point DFTs. DIT, and DIF 
decompositions of DFT in sections 2.2 and 2.3 are based on radix-2 FFT. In radix-2, 
FFT the basic unit of computation is 2-point butterfly, shown in the Fig. 2.11. 
 
Figure 2.11 Radix-2 DIF butterfly structure 
2.3.2  Radix-4 FFT 
Radix-4 FFT algorithm is suitable for N =4𝑣. Although we can also use radix-2 FFT 
algorithm for these data points but for this particular case we are considering radix-4 
FFT algorithm. Similar to radix-2 FFT algorithm, radix-4 FFT algorithm also has both 
DIT and DIF decompositions of DFT. Radix-4 FFT processes 4 input samples at the 
same time rather than radix-2 FFT which processes 2 input samples at the same time. 
Hence, number of FFT stages are reduced in radix-4 algorithm than radix-2 FFT. For 
example, in 16-point FFT computation radix-4 needs only two FFT stages than 4 in ra-
dix-2 FFT. The concept of radix-4 algorithm resembles with the radix-2 FFT algorithm. 
As in radix-2 we recursively decompose N-point DFT computation into N/2-point DFT 
computations until 2-point DFTs which defines the radix of the FFT algorithm. Similar-
ly, radix-4 divides the N-point DFT into N/4-point DFTs and recursively iterates until 4-
point DFTs which defines the radix. The N-point input sequence is decomposed into 
four subsequences i.e. x[n], x[n + N/4], x[n + N/2] and x[n + 3N/4] where N = 0, 1, …, 
N/4-1 [5]. 
The computations carried out by one radix-4 butterfly are equal to four radix-2 butter-
flies. The basic radix-4 butterfly is shown in the Fig. 2.12. 
(n – 1) stage
-1 𝐖𝐍
𝟎 =  𝐖𝟐
𝟎 = 𝟏 
 
nth stage
15 
 
Figure 2.12 Basic butterfly for radix-4 FFT algorithm 
2.3.3 Radix-2k FFT 
The radix of the FFT algorithm extremely effects the FFT processor’s architecture. 
Small radix gains importance because of simple butterfly structure. However, high radix 
results in reduction in FFT twiddle factor multiplications. The concept of radix-rk is to 
earn both advantages simultaneously, simple butterfly structure and minimum number 
of twiddle factor multiplications [6].  
As the number of stages in FFT algorithms can be reduced by using higher radix but the 
complexity of FFT butterfly increases in high radix algorithms. One of the most popular 
FFT algorithms is radix-2k, which uses simple 2-point butterfly structure to take ad-
vantage of higher radices. For instance, radix-22, which has similar number of non–
trivial multiplications as radix – 4 FFT algorithm [6]. Fig. 2.14 shows the binary tree of 
radix-22 FFT algorithm. Furthermore, Figs. 2.13 and 2.14 depict the binary tree repre-
sentation of radix 2, 4, and 22. 
The binary trees [7 – 10] of radix – 2 DIT, radix -2 DIF and radix – 4 FFT algorithms 
are shown in the Fig. 2.13. In the binary tree representation, a weight is assigned to each 
of the nodes to represent a set of 2𝑛-point DFTs. Let n(u) be the weight of a node u. If a 
2𝑙+𝑟-point DFT is decomposed into 2𝑙-point and 2𝑟-point DFTs then the node (l + r) 
has left and right children of weight l and r respectively. Now the left child represents 
the set of 2𝑟 inner DFTs of 2𝑙 points and the right child represents the set of 2𝑙 outer 
DFTs of 2𝑟 points. Thus, the weight of the parent node is always equal to the sum of the 
weights of its children. After the binary tree structure is finalized then all the leaf nodes 
represent the BFs. Internal nodes represent the twiddle factor multiplication between the 
two sets of child DFTs. In comparison, radix – 4 algorithm has half stages than radix – 2 
FFT algorithm [10, 11]. 
x(0)
x(1)
x(2)
x(3)
-j
-1
j
-1
-1
j
-1
-j
X(0)
X(1)
X(2)
X(3)
16 
 
Figure 2.13 Binary trees of FFT algorithms: (a) Radix – 2 DIT, (b) radix – 2 DIF, (c) 
radix – 4 DIF 
 
Figure 2.14 Binary tree of radix -  22 FFT algorithm 
We can notice from the figure that every alternate stage has W4 twiddle factor multi-
plication where 𝑊4 is a trivial twiddle factor multiplication. Moreover, we can use 
radix-23, radix-24, …, radix-2𝑣 to implement radix–8, radix–16, …, radix-2𝑣 butter-
flies by using radix – 2. The twiddle factor stages in radix - 23, radix - 24, radix -  
2𝑣 are same as radix–8, radix–16, …, radix-2𝑣 FFT algorithms. Twiddle factor mul-
tiplication stages of radix–2, radix-22, radix-23, radix-24 are written in Table 2.1. 
 
Radix 
Twiddle Factor Multiplication Stages 
1 2 3 4 5 … s 
2 𝑊𝑁  𝑊𝑁
2
 𝑊𝑁
4
 𝑊𝑁
8
 𝑊 𝑁
16
 … 𝑊4 
22 𝑊4 𝑊𝑁  𝑊4 𝑊𝑁
4
 𝑊4 … 𝑊4 
n
n – 1
n – 2
2
11
1
1
n
n – 2
n – 4
4
2
2
2
2
n
n – 1
n – 2
2
1 1
1
1
n
n – 2
n – 4
4
2
2
2
2
1 11 1
1 1
1 1
17 
23 𝑊4 𝑊8 𝑊𝑁  𝑊4 𝑊8 … 𝑊8 
24 𝑊4 𝑊8 𝑊16 𝑊𝑁  𝑊4 … 𝑊16 
Table 2.1 Twiddle factor multiplication stages for radix–2, radix-22 , radix-23, radix-24 
FFT algorithms 
2.3.4 Identical Radix-22 FFT 
We have designed this identical radix-22 FFT algorithm for the work presented in this 
thesis. Fig. 2.15 represent the binary tree of identical radix-22 for 2,048-point FFT. This 
algorithm is given name an identical radix-22 because the number of non-trivial com-
plex multiplier stages are identical to radix-22 and alternate stages of both algorithms 
have W4 complex multipliers. The overall complexity is reduced in this algorithm as 
compared to radix - 22.  Table 2.1 presents the comparison between complex multiplier 
stages of radix-22 and identical radix-22. 
 
Figure 2.15 Binary tree of identical radix - 22 FFT algorithm 
 
Radix 
Number of twiddle factor multiplication stages 
1 2 3 4 5 6 7 8 9 10 
22 W4 W2048 W4 W512 W4 W128 W4 W32 W4 W8 
Identical 
22 
W4 W16 W4 W256 W4 W16 W4 W2048 W4 W8 
Table 2.1 Complex multiplier stages’ comparison between radix-22 and identical radix-
22, N=2,048 
11
2 2
1 11 1
2 2
1 11 1
2
1
1 1
8 3
4 4
18 
3. FFT ARCHITECTURES 
The fundamental principle to achieve very high performance in FFT hardware is to ar-
range the processing elements in such a manner that the architecture matches the struc-
ture of the algorithm [10]. 
There are three most common approaches to design the hardware architectures for FFT 
algorithms’ implementations, named as direct implementation, Memory-based FFT ar-
chitectures and Pipeline-based architectures. Main units of an FFT processor are a but-
terfly-processing unit, address generation unit, twiddle factors multiplication circuit, 
memory units for storing twiddle factors as well as other FFT data and a control unit to 
generate control signals for the FFT processor [12]. 
3.1 Direct Implementation 
The direct implementation of FFT architecture is simply an identical mapping of signal 
flow graph of an FFT algorithm on hardware. The number of processing elements (PEs) 
is exactly equals to the number of operations. Therefore, this hardware architecture may 
not be an efficient implementation for most of the FFT based applications [12]. This 
architecture is suitable for smaller size FFTs implementations or when the requirement 
of throughput is very high. The utilization of butterflies in this architecture is 100% and 
the hardware cost is proportional to O(NlogN) [10, 12]. 
3.2 Memory-Based FFT Architectures 
Memory-based FFT hardware architectures [10, 12 – 14] are also known as in-place 
architectures or iterative architectures. The main goal of the memory-based architec-
tures is to improve the utilization rate of butterfly processing unit. These architectures 
use one or more butterfly processing units for computation of all the butterflies and 
twiddle factor multiplication of the algorithm. As the memory based architecture, con-
sist of one or more butterfly processing units, so one or more memory units are required 
to store the data [10]. Memory-based FFT architectures depends on the utilization of 
memory for computations. Usually, the memory-based architectures are divided into 
two different categories [12]: 
 Single memory FFT architectures. 
 Dual memory FFT architectures. 
19 
Single memory FFT architectures consist of one memory unit and an FFT processing 
core, which are, connected to each other though data bus. FFT computation is achieved 
through fetching the input data from the memory, processing it in the FFT core and stor-
ing the results back in the same memory. FFT core includes circuits of butterfly pro-
cessing element and twiddle factor multiplication. This process is iterated until all the 
FFT computations are done, giving it the name, iterative FFT architectures [10, 12 – 
14]. 
 
Figure 3.1 Memory-based FFT architecture 
Fig. 3.1 shows a very simple memory-based FFT architecture. As noticed in the Fig. 3.1 
that FFT data are stored in the memory at every iteration, which makes necessary to 
compute whole FFT before it receives new samples of data. Therefore, single memory-
based FFT architecture is unable to compute the FFT when input data is arriving con-
tinuously [10]. This problem can be fixed by adding an extra memory block to store the 
input data while FFT is being computed. Dual memory architectures fix this problem as 
illustrated in Fig. 3.2. 
 
Figure 3.2 Dual memory FFT architecture 
Dual memory FFT architectures [12] use two memory units and one FFT processing 
core. In every clock cycle, an input sample is fetched from the first memory and FFT 
computed sample is stored into the second memory. After the completion of one FFT 
stage, the second memory becomes input memory and the first memory becomes the 
output memory. Memories continue switching roles at every stage until the FFT compu-
tation is done. This architecture allows simultaneous reading from and writing samples 
into memory in one clock cycle. The FFT output can be read from either of these mem-
ories depending on the number of FFT stages. 
These architectures are very hardware efficient because of low area requirements for 
large size FFTs. However, there is a tradeoff between the hardware cost and the pro-
Memory FFT Core
0
1
Input
Output
MEM_1 FFT Core MEM_2
Input/Output
20 
cessing speed of the FFT processor because in one clock cycle the processing unit can 
only perform one butterfly operation, which results in, higher latency and lower 
throughput or decreasing the overall speed of the FFT processor. 
High-radix processing elements can optimize the throughput and latency of the 
memory-based FFT architectures. However, memory conflict problems occur when 
memory-based and high-radix used together because of the improper memory accesses. 
Using multi bank memory can fix these memory conflict problems [15]. Parallel pro-
cessing elements is another approach to optimize the throughput and latency of the 
memory-based architectures. The column FFT architectures are used for implementation 
of parallel processing approach [10, 16 – 17], where entire stage of FFT signal flow 
graph is computed in parallel. Fig. 3.2 shows the column FFT architecture.  
 
Figure 3.2 Memory-based architectures with parallel PEs 
Such architectures have better throughput but more hardware cost than single pro-
cessing element architectures. 
P.E
P.E
P.E
P.E
Input Output
S
w
it
ch
in
g
 N
et
w
o
rk
21 
3.3 Pipelined FFT Architectures 
Pipelined FFT architectures [10, 18 – 20] are also known as streaming architectures and 
take advantage of parallel processing between the pipeline stages. Pipelined architec-
tures consist of well-organized structure, area efficient, comparatively simple control 
unit and have high throughput. One of their considerable advantages is to process con-
tinuous flow of data without any additional circuitry in comparison to memory-based 
architectures, which makes it cost efficient as well. Moreover, it is also possible to in-
crease the clock frequency in pipelined architectures with addition of register, which 
shorten the critical path and increase the throughput. Pipelined architectures are more 
flexible when transforms of variable lengths are to be computed with the same circuit. 
In general, pipelined architectures have a butterfly-processing element between the 
commutators at each stage. Processing element computes the butterfly operation, then 
twiddle factors multiplication is performed and commutator (used in feedforward FFT 
pipelined FFT architectures) has functionality is to rearrange the data to make the suc-
cessive computations easier. As a result, pipelined architectures are best suited for real-
time applications because of high throughput rates.  
In pipelined architectures, one stage of signal flow graph of FFT algorithm is computed 
by exactly one stage of FFT architecture. At every stage, for the continuous flow of data 
and FFT computations the first half of input samples has to be stored in memory until 
the second half arrives, i.e., x(n) and x(n + N/2) should be available in order to start the 
FFT computation [10]. The general pipelined FFT architecture for N-point FFT and 
𝑙𝑜𝑔𝑟(𝑁) number of pipeline stages is shown in Fig. 3.3., where r is the radix of FFT 
algorithm.  
 
Figure 3.3 General architecture of a pipelined FFT processor 
There are two main types of pipelined FFT architectures as follows: 
 Feedback FFT Architectures. 
 Feedforward FFT Architectures. 
3.3.1 Feedback FFT Architectures 
There are two types of feedback architectures, named as Single path Delay Feedback 
(SDF) architectures and Multi-path Delay Feedback architectures (MDF). However, we 
have only focused on the SDF architectures because of the target applications for this 
BF BF BF
Twiddle Factors Multiplication Twiddle Factors Multiplication
Input Data Output Data
First Stage Second Stage Nth Stage
22 
work. In order to increase the throughput as well as to reduce the chip-area, SDF pipe-
lined FFT architectures are often used. 
SDF FFT architectures have a feedback loop at every pipeline stage [21 – 22]. SDF ar-
chitectures consist of a butterfly processing unit, a feedback memory, i.e., FIFO, 
memory element at each stage to store the twiddle factor coefficients and a complex 
multiplier in each stage. Fig. 3.4 shows SDF architecture for N points FFT and 𝑙𝑜𝑔𝑟(𝑁) 
number of pipeline stages. 
 
Figure 3.4 Radix-r N-point FFT SDF architecture 
The utilization of butterflies in SDF architectures is 50%. These architectures get one 
continuous stream of input data which is one sample per clock cycle (CC), i.e., one 
sample per CC is fed to the system and one output sample appears at the output per CC. 
Fig. 3.5 shows a 16-point radix-2 FFT signal flow graph and its SDF architecture. 
 
 
Radix-r BF
FIFO
Radix-r BF
FIFO
Radix-r BF
FIFO
Twiddle Factors Multiplication Twiddle Factors Multiplication
Input Data Output Data
First Stage Second Stage Nth Stage
0
0
0
0
0
0
0
0
0
1
2
3
4
5
6
7
Radix-2 butterfly stage
Twiddle factor 
multiplication
x[0]
x[1]
x[2]
x[3]
x[4]
x[5]
x[6]
x[7]
x[8]
x[9]
x[10]
x[11]
x[12]
x[13]
x[14]
x[15]
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
0
0
0
0
0
2
4
6
0
0
0
0
0
2
4
6
0
0
0
4
0
0
0
4
0
0
0
4
0
0
0
4
Twiddle factor 
multiplication
Twiddle factor 
multiplication
X[0]
X[8]
X[4]
X[12]
X[2]
X[10]
X[6]
X[14]
X[1]
X[9]
X[5]
X[13]
X[3]
X[11]
X[7]
X[15]
Radix-2 butterfly stage Radix-2 butterfly stage Radix-2 butterfly stage
Radix-2 BF
8
Radix-2 BF
4
Radix-2 BF
2
Twiddle Factors Multiplication Twiddle Factors Multiplication
Input Data
First Stage Second Stage Third Stage
Radix-2 BF
1
Twiddle Factors Multiplication
Output Data
Fourth Stage
23 
Figure 3.5 Radix-2 16-points FFT Signal Flow graph and its SDF architecture 
MDF pipelined FFT architectures are also named as parallel feedback architectures be-
cause their structure consists of parallel SDF pipelined FFT architectures to process 
several samples in parallel [10, 49, 50, 51]. Butterfly utilization ratio of MDF FFT ar-
chitectures is similar to the SDF pipelined FFT architectures, i.e., 50%. Thus, we cannot 
optimize the utilization ratio through parallelism. As MDF, architectures can process the 
several samples in parallel, which can improve the throughput.  
3.3.2 Feedforward FFT Architectures 
Feedforward FFT architectures are also referred as; Multi-path Delay Commutator 
(MDC) pipelined FFT architectures [20]. MDC FFT architecture is the most straight-
forward implementation of FFT algorithm. These architectures do not have any feed-
back loops; therefore, the FFT data is processed by the FFT butterflies and rotators, and 
then fed to the next stage of MDC architecture [10]. Thus, utilization ratio of butterflies 
in this architecture is 100% [20]. This architecture can process several samples in paral-
lel, which means MDC architectures have higher throughput than SDF architectures. 
Fig. 3.6 shows radix-r MDC pipelined FFT architecture. 
 
Figure 3.6 Radix-r MDC pipelined FFT architecture 
3.4 Mapping of FFT SFG to Hardware 
In this thesis, we have implemented SDF pipelined FFT architecture, so we only discuss 
the hardware design of SDF pipelined FFT processor. Referring to Fig. 3.5, the 16–
point radix–2 DIF SFG is mapped to SDF pipelined FFT architecture.  
We need 𝑙𝑜𝑔2(N) pipeline stages, which is four in case of a 16–point FFT. We need 
𝑙𝑜𝑔2(N) – 1 number of twiddle factor multiplication stages i.e., three.  This is SDF ar-
chitecture which gets one continuous stream of data which is one sample per CC, i.e., 
one sample per CC is being fed to this processor and one sample per CC is being ap-
peared at the output. We can divide the operation of the architecture into four stages as 
follows 
1. First Stage 
There are eight 2–point butterflies to compute the FFT in the SFG but in hard-
ware, we need only one butterfly PE per stage because this is a pipelined proces-
R
a
d
ix
-r
B
u
ff
er
 a
n
d
 
C
o
m
m
u
ta
to
r
Input
Stage 1
R
a
d
ix
-r
B
u
ff
er
 a
n
d
 
C
o
m
m
u
ta
to
r
Stage 2
R
a
d
ix
-r
B
u
ff
er
 a
n
d
 
C
o
m
m
u
ta
to
r
Stage s-1
R
a
d
ix
-r
Stage s
Output
24 
sor. From SFG it can be noticed that first butterfly operation takes 1st and 9th 
samples as input. Second butterfly operation requires 2nd and 10th samples to 
perform the butterfly operation and so on. Thus, to start the butterfly operation 
first we will store the first 8 input samples to the FIFO and when the 9th sample 
appears at the input port then PE start performing butterfly operation at 9th CC 
on 1st and 9th input samples i.e. it takes 1st sample form the FIFO and 9th sample 
from the input stream. On the output side, on the same CC two outputs are gen-
erated, 1st output is passed to the second stage of the architecture while 9th output 
is stored back in the FIFO. At 10th CC the butterfly PE repeats the same proce-
dure on 2nd sample i.e. coming from FIFO and 10th sample i.e. coming from the 
input stream. PE perform butterfly operation from 9th to 16th CC to compute all 
the eight 2–point butterflies in the first stage of SFG. On the same CC the PE 
sends first eight outputs to the second stage and last eight outputs are stored 
back in the FIFO. Butterfly PE go to inactive state for next 8 CCs because other 
eight outputs stored in the FIFO are now being send to the second stage. Control 
unit generates the control signals to control the functionality. Control unit of 
SDF architectures is very sample and can be designed using sample counters. 
2. Second Stage 
Firstly, the outputs coming from the first stage are being multiplied with the FFT 
twiddle factors as shown in the figure and then fed to the second stage PE or 
FIFO. As SFG represents that, second stage PE requires 1st and 5th sample to 
perform the butterfly operation, so PE will be in inactive state for first 4 CCs 
and feed the input samples to the FIFO. In Fifth CC when the 5th samples arrives 
at the input of second stage PE, it will take 1st sample from the FIFO and per-
form the butterfly operation. On the output side, two outputs are generated, 1st is 
send to the third stage and 5th is stored back in the FIFO i.e. behavior is similar 
to first stage PE. As first stage generate 1st output at 9th CC, so second stage 
generates 1st output at 13th CC and so on. 
3. Third Stage 
PE’s behavior is similar to the first and second stage but it stores the first two 
samples in the FIFO and then start performing butterfly operation. PE of the 
third stage goes to active and inactive state for 2 CC each state. First actual out-
put after third stage appears at 15th CC. 
4. Fourth Stage 
Behavior of PE is similar to last three stages but it stores one sample to the FIFO 
and then perform the butterfly operation in the second CC. PE switches between 
active and inactive state after every CC. First actual output after fourth stage i.e. 
first FFT output of a 16–point radix–2 SDF pipelined processor appears at 16th 
CC and so on. 
The butterflies’ utilization ration in SDF pipelined FFT architectures is 50% as butterfly 
PE perform computations for 50% of the time and the other 50% it stays inactive. 
25 
3.5 Input/Output Order or Bit-Reversal 
Bit-reversal is an algorithmic technique, which is very important for radix-2 Cooley-
Tukey FFT algorithms [4] and has been of strong interest. This algorithm generates in-
dexed data through reversal of bits. The output frequencies of FFT are usually in bit-
reversed order and we use this algorithm to sort out those frequencies in order [10].  
For N-point FFT, where N is a power-of-2, algorithm rearranges the data samples 
through reversal of the bits of the index. This is simply an inversion of the bits of the 
indices. For example any data samples with indices I = 𝑏𝑛−1, …, 𝑏1, 𝑏0 are moved to 
the place BR(I) = 𝑏0, 𝑏1, … , 𝑏𝑛−1.  
 
26 
4. TWIDDLE FACTOR MULTIPLICATIONS 
Two key modules for mathematical operation of the FFT are the butterfly operation and 
the twiddle factor multiplication. Twiddle factor has defined by equation (2.3) and (2.4) 
in chapter 2. The butterfly block of the FFT consist of an adder and a subtractor, which 
perform addition and subtraction on two input data samples of FFT. Twiddle factor 
block usually requires a complex multiplier and a memory to store the twiddle factor 
coefficients. The butterfly operation of FFT is usually followed by the twiddle factor 
multiplication in each stage of FFT architecture for computation [23]. In this chapter, 
we have discussed different techniques to implement FFT twiddle factor multiplication. 
4.1 Implementation Techniques 
There are several techniques to implement FFT twiddle factor but we have discussed the 
following two techniques: 
 General Multiplier. 
 Constant Multiplier. 
4.1.1 General Multiplier 
A general twiddle factor multiplication consists of a memory to store twiddle factor 
coefficients and a complex multiplier as represented in the Fig. 4.1. The most common 
way of implementing the FFT twiddle factor multiplication is by using a complex mul-
tiplier that consist of four real multiplications and two real additions as shown in the 
following example.  
For example, the complex a + jb is the complex number. 
 (a + jb) (c + jd) = a.c + ja.d + jb.c + j2b.d 
    = (a.c – b.d) + j(a.d + b.c) 
    = X + jY, 
where X and Y are the real and imaginary parts of the resultant complex numbers, re-
spectively. There are different ways to implement this complex multiplication. To im-
plement general multiplier in pipelined FFT architectures, real and imaginary parts of 
the twiddle factors are precomputed and stored in the memory.  In pipelined FFT archi-
27 
tectures, each stage performs N twiddle factor multiplications. Therefore, we need to 
store N twiddle factor coefficients in the memory.  
 
Figure 4.1 General complex multiplier 
The multiplication sign in the figure represents the twiddle factor multiplication being 
performed. This multiplication operation sign has four real multiplications and two real 
additions inside. An address generation unit is required to address the twiddle factor 
coefficients’ memory. Twiddle factor coefficients are input to the complex multiplier in 
order to compute the twiddle factor multiplication of the input sample. 
 
4.1.2 Constant Multiplier 
A constant multiplier is usually used when one of the operands is already known; for 
example, twiddle factor multiplication in FFT processors where twiddle factor coeffi-
cients are already known.  
A constant multiplier is implemented by using shift-and-add operations [7]. We replace 
the general multiplier with shifts, adders, and subtractors and is called as constant mul-
tiplier [24 – 25]. This implementation is sometimes also referred as multiplierless im-
plementation [8 – 9, 25 – 28]. This implementation is cost-efficient for hardware be-
cause it replaces the costly multipliers with shifts, adders, and subtractions. In addition 
to it, it also benefits software implementations, such as embedded processors [25]. For 
example, y = 9x can be computed as y = (x << 3) + x. 
Such a solution is known as a multiplier block. In the case of adders and subtractors, the 
complexity is similar which allows us to refer both of them as adders. Therefore, the 
term adder can be used to denote cost of the total number of adders and subtractors in 
the entire circuit of implementation [25]. 
Memory
x + jy X + jY
28 
Constant multiplier block is implemented for the known operands and the challenge is 
to find the optimal implementation of the constant multiplier block, which means block 
with minimum number of adders and shifters. Shifters does not have much effect on the 
hardware cost of the constant multiplier block and can be assumed to be cost free. 
4.1.2.1 Single-Constant Multiplication 
Single-constant multiplication (SCM) [24 – 25] is a simple method for multiplication of 
a given constant c using adders and shift can only be read off from the bit representation 
of c. This method is called as binary method.  
Canonical Signed-Digit (CSD) [29], a well-known fixed-point number representation, 
can reduce the number of non-zero bits. Therefore, this method is based on CSD record-
ing, which is explained in section 5.3. However, there are many other ways to imple-
ment multipliers more efficiently, which require fewer adders because shifts are as-
sumed free. The main challenge is to reduce the number of adders in the multiplier less 
circuits. 
For example, a constant 45, has CSD representation of 101̅01̅01. The implementation of 
this pattern is shown in Fig. 4.3, where 45 is computed as 64 – 16 – 4 + 1. Left shift by 
1 bit is equivalent to multiplication by 2. Fig. 4.4 optimize the implementation in Fig. 
4.3 and reduces the number of adders from 3 to 2. Moreover, the number of actual shifts 
are also reduced from 6 to 5.  Thus, 45 is implemented as 9 × 5 = (1 + 4)(1 + 8). 
 
Figure 4.3 Circuit for multiplication of an input X with a constant 45 using 6 actual 
shifts and 3 adders 
 
Figure 4.4 Optimized circuit for multiplication of an input X with a constant 45 through 
5actual shifts and 2 adders 
29 
4.1.2.2 Multiple-Constant Multiplication 
Some applications require one signal to be multiplied with a set of given constants 𝑐1, 
𝑐2, … , 𝑐𝑛. The challenge is to find the multiplier block with parallel multiplications 
with the given set of constants using minimum number of adders [25]. Fig. 4.5 explains 
the idea of multiple-constant multiplication (MCM). 
 
Figure 4.5 The principle of multiple-constant multiplication 
 
30 
5. NUMBER REPRESENTATION 
Number representation gives an exact numeric value to any kind of data. In the field of 
computers and digital electronics, we always deal with numbers and their representa-
tions. Therefore, it is very important to understand the different ways to represent nu-
meric data systematically. Numbers can be represented by two symbols, i.e., 0 and 1; 
the system based on this representation is called as binary number system. However, it 
would be wise to discuss the decimal number representation before going through the 
details of the binary representation because it would much easier to understand the bina-
ry number system along their decimal representations (ten possible digits). For example, 
we can represent the positive one hundred and twenty-five as a decimal number as fol-
lows:  
 12510 =  1 × 100 + 2 × 10 + 5 × 1 = 1 × 10
2 + 2 × 101 + 5 × 100, 
where subscript 10 denotes the base of the decimal number system. To represent a num-
ber in decimal system the right most digit is multiplied by 100, the next digit is multi-
plied by 101 and so on. Moving from right to left, each digit has a multiplier, which is 
10 times of the previous one. In the decimal number system, to multiply a number by 10 
is simply achieved by sifting the number to the left by one digit and adding a zero at the 
place of right most digit. Similarly, to divide a number by 10 can be implemented by 
shifting the decimal number to the right by one digit or move the decimal place by one 
digit to the left. To find, how many digits we need to represent a number in decimal 
number system, we can take logarithm of the absolute value of the number with base 10 
and add 1 to it. The integer part of the result would be the number of needed digits [30].  
Fig. 5.1 represents an example for calculating required number of digits to represents a 
decimal number. 
31 
 
Figure 5.1 MATLAB program for calculating required number of digits to represent a 
decimal number 
As a result, six digits are needed to represent the above number 155345. We can repre-
sent 10𝑛 unique numbers with n number of digits. i.e. 0 to 10𝑛 – 1. Lastly, negative 
numbers can be represented by just adding a minus - sign before the number [10]. 
5.1 Binary Number Representation 
In a binary number representation system, we have two possible digits (1 and 0) to rep-
resent a number. Binary representation of positive numbers can be understood in the 
similar way to their decimal representation. For example: 
8610 = 1 × 64 + 0 × 32 + 1 × 16 + 0 × 8 + 1 × 4 + 1 × 2 + 0 × 1 
8610 = 1 × 2
6 + 0 × 25 + 1 × 24 + 0 × 23 + 1 × 22 + 1 × 21 + 0 × 20 
8610 = (1010110)2 , 
where the subscript denotes the binary number and each digit in the binary number rep-
resentation is called a bit [30]. Therefore, seven bits represent 1010110. We can repre-
sent any number in the binary format by breaking down the number as above example 
and finding all the powers of two, which add up to the required number. 
Multiplication of a number by two in the binary system is simply shifting the number to 
the left by one bit and filling the right most bit with zero. Similarly, for dividing by two 
means shifting to the right by one bit. We can find out the number of binary digits need-
ed for the binary representation of a number by taking logarithm of the number with 
base two and add one to it. For instance, Fig. 5.2 represents an example MATLAB pro-
gram for binary representation and calculating the required number of digits for repre-
sentation. 
32 
 
Figure 5.2 MATLAB program for binary representation and calculating required num-
ber of digits 
As a result, seven digits or bits are needed to represent 86 in the binary format because 
the integer part is seven [30]. 
5.2 Two’s Complement Representation 
Two’s complement representation is a convention, which represents the signed binary 
integers. As in binary number representation each binary digit has a weight which a 
power of two. This weight increases from right to the left: 
8610 = 1 × 2
6 + 0 × 25 + 1 × 24 + 0 × 23 + 1 × 22 + 1 × 21 + 0 × 20 = (1010110)2. 
Two’s complement of a positive number is a simple binary. Two’s complement notation 
represents the negative numbers. To represent a negative number through two’s com-
plement, first we represent the number in binary, then we invert all the bits i.e. 0 be-
comes 1 and 1 becomes 0. Lastly, we add 1 to the inverted number. To represent the 
number -86, first we write simple binary of 86 as follows 
8610  = (1010110)2. 
Then we invert all the bits, which results(0101001)2. Now we add 1 to the inverted 
number, which results in tow’s compliment binary of -86, (0101001)2 + 1 = 01010102 
= 1010012. 
This is the representation -86 in the 6-bit binary representation through two’s comple-
ment scheme. The rightmost digit in the two’s complement representation indicates the 
sign of the integer. 1 stands for a negative number and 0 stands for a positive number 
[31]. 
33 
5.3 Canonical Signed Digit Representation (CSD) 
Canonical signed digit (CSD) is a special technique to encode the binary values in 
signed digit representation. CSD is not unique because it allows a number to be repre-
sented in many different ways. CSD representation allows the encoding of a number 
such that it contains the fewest number of non-zero bits. It is very useful technique to 
represent the multiplier constants which can significantly reduce the area of hardware 
implementation [29]. The important properties of CSD representation are as follows 
 Mapping the number to a ternary system i.e. CSD representation of a number 
consists of numbers 0, 1 and -1, 
 The CSD representation of a number is unique, 
 The number of non-zero digits are minimal, and 
 No two consecutive digits in a CSD number are zero. 
Let us take an example to understand the conversion of a number into its CSD represen-
tation. For example, we write the CSD representation of 287,  
Step #01  
First, we find the binary of 287. 
28710 = 256 + 16 + 8 + 4 + 2 + 1 = (100011111)2 
Step #02 
Starting from the left, the LSB (least significant bit) and moving towards right, the MSB 
(most significant bit), if we find more than one non-zero digits i.e. 1 or -1, take all of 
them plus the next zero. If there is no zero, then create a zero. 
(011111)2 
Step # 03 
Add 1 to the above number and force the right most bit to be -1. 
(011111)2 => (1000002 – 1). 
Now, our original number looks like this 
(100100002 – 1) 
Since, we can notice that there are no consecutive non-zero digits; our conversion to 
CSD is complete. Hence, the CSD representation of 287 is (100100002 – 1)  
34 
5.4 Effects of Finite Word Length 
Finite word length effects occur when the word length or memory is less than the preci-
sion needed to store. These word length effects introduce ´noise and non-ideal system 
responses. 
In FFTs, Finite word length data and arithmetic are used to map FFT algorithm on 
hardware. This leads to the quantization of internal signals and coefficients in FFT pro-
cessors’ architectures. This determine the hardware resources and input dynamic range 
precision.  
This section discusses these effects in FFTs including round – off/truncation error and 
overflow. 
5.4.1 Overflow 
Overflow occurs when the available data range becomes higher than the available finite 
word length [10]. In FFTs, overflow occurs because of the butterfly operations i.e. addi-
tion/subtraction. In general, it is assumed that the input data being fed to the FFT pro-
cessor is within the range [-1, 1] and is represented using certain number of bits. Thus, 
results of butterfly operations are not under range [-1, 1] which causes the overflow. 
Therefore, certain techniques are used i.e. scaling the data to avoid the overflow, per-
forming truncation to convert data to the required word length and rounding the data to 
minimize the error. 
5.4.2 Scaling: An Overflow Handling Technique 
Scaling is a technique to handle overflows in fixed-point representation of data while at 
the same time keeping the signal level as high as possible to reduce the round – off 
noise [10]. Range of data is adjusted by introducing scaling multipliers.  However, the 
scaling should not disturb the functionality of the system of the transfer function. There 
are two types of scaling, named as static data scaling and dynamic data scaling. 
 
Figure 5.3 Butterfly with scaling 
Static data scaling is very common in designing the hardware architectures to avoid 
overflows. In designing FFT architectures, static data scaling is often used and it intro-
+
+
x(0)
x(1)
1/2
1/2
X(0)
X(1)
-1
35 
duces a scaling factor after each butterfly operation. In FFT processors, to make sure 
that overflow will not occur, results after butterfly operations are multiplied by a scaling 
factor ½ as presented in Fig. 5.3. The reason behind this ½ scaling factor is the FFT 
butterfly operation i.e. addition and subtraction of data samples. As addition introduces 
at most one extra bit, so scaling factor of ½ can avoid it. 
The basic idea about the dynamic scaling is to scale down the data when required and 
scale up the data when possible. The static scaling is often too prohibiting as it is statis-
tically rather unlikely, two data signals will add up to a large value, if they do so then 
the other butterfly output will be small. Thus, it is of a greater interest to find ways to 
adapt the scaling dynamically to the signals being processed. Many approaches have 
been presented to perform this dynamic scaling for FFT processors [10]. 
Dynamic data scaling scales down the data when the butterfly results are large, scales 
up the data when the butterfly results are smaller and leave the data as it is if it is within 
the proper range. We do not even need to store the scaling factor if it is same for all da-
ta. Before making the scaling factor decision, all data must be processed in a stage. In 
memory based FFT processors, it is very easier to have memory with slightly longer 
word length as compared to that of arithmetic operator inputs and some additional logic 
to keep track of the stored data. For pipelined FFT processors it increases the memory 
requirements because most of the pipelined architectures do not have access to all in-
termediate data at the same time [10]. 
5.4.3 Round–Off and Truncation 
In fixed-point representation, not all data within the range can be represented exactly. 
When data type and scaling cannot represent data exactly, then it must be rounded-off to 
a representable number. 
If we multiply two w-bit (w is the word length) fixed-point numbers then the product is 
(2w-1)-bit. Eventually, we must truncate this product to w bits by rounding.  
 
Figure 5.4 1st order IIR filter (W = 8) 
36 
For example, let us consider the first order IIR filter as shown in the Fig. 5.4. Assume 
that input has word length of 8-bits and the multiplier uses 8-bit coefficients. Now, to 
maintain the full precision at the output, per iteration we have to increase the output 
word-length by 8 bits, which is infeasible. Thus, we as an alternate solution we can 
quantize the output word-length by rounding off and truncation to its nearest 8-bit rep-
resentation [32]. 
In pipelined FFTs, we need to perform rounding and truncation to quantize the output 
data to its nearest w-bit (word length of the FFT processor) representation at two points 
in every butterfly stage, after the butterfly operation, after the twiddle factors multiplica-
tions. Fig. 5.3 represents the FFT butterfly operation being performed, output word 
length becomes (w+1)-bit because addition requires 1-bit extra. We quantize the output 
to w-bit word length by the truncating the extra bit and rounding the data to the nearest. 
Twiddle factors’ multiplication doubles the output word length for signed fixed-point 
representation, we need to truncate output data back to its nearest w-bit representation 
by truncating the extra bits and then rounding the data to the nearest. Finally, we pass 
w-bit results to the next FFT butterfly stage. 
Such quantization results in introducing round-off noise, which is important to analyze. 
The main purpose behind the analysis of round off noise is to determine its effect at the 
output. If the noise variance at the output is not negligible as compared to the output 
signal level then we must increase the word length. That is why we compute SNR (sig-
nal-t-noise) at the output. It is important to notice that the FFT twiddle factors’ multipli-
ers are sources for round off noise in FFT processors [32]. 
 
Figure 5.5 MATLAB program for fixed-point scaling, truncation and rounding 
37 
The script presented in Fig. 5.5 is a model of a butterfly stage of pipelined FFT proces-
sor, which explains, how MATLAB does scaling after performing butterfly operation, 
how it rounds the data to nearest and truncates the data back to w-bit word length. 
Range of input data is always [-1, 1]. This script is using 36-bit fixed-point representa-
tion of input data, with 18-bits of real and imaginary parts. The MSB represents the 
sign-bit, lower 17-bits represent the fraction. Twiddle factor is represented 24-bit fixed-
point representation, with 12-bits of real and 12-bits of imaginary part. 
38 
6. PIPELINED FFT PROCESSOR 
This chapter discusses the implementation of proposed approach for a pipelined FFT 
architecture. This architecture consists of modified radix – 2 SDF FFT, which processes 
length of 2,048-point FFTs. The following pipelined FFT processor architectures are 
implemented in this work. 
 Radix-22 FFT architecture 
 Identical radix-22 FFT architecture 
This chapter further discusses the implementation of W8 and W16 constant multipliers. 
identical radix-22 requires less number of complex multiplications and smaller memo-
ries to store the twiddle factor coefficients than the radix-22 FFT processor. W8 and W16 
constant multipliers can be integrated in the identical radix-22 for reducing the complex 
multiplications cost even more, which can remove three complex multipliers and the 
twiddle memories associated to them. Finally, the chapter provides the comparison be-
tween radix-22 FFT architecture and identical radix-22 FFT architecture. 
6.1 Radix-2 FFT Architecture 
 Fig. 6.1 depicts the radix–2 DIF FFT algorithm based SDF pipelined processor’s archi-
tecture. This processor computes FFT length of 2,048-point DFT. The number of stages 
required to compute FFT length of 2048-point is n=𝑙𝑜𝑔2(2048), which is 11. Total 
number of complex multiplier stages required for twiddle factor multiplication is 
𝑙𝑜𝑔2(2048)–1, which is 10. Furthermore, 10 memory blocks are required to store the 
twiddle factor coefficients. A complex multiplier to perform the twiddle factor multipli-
cation as shown in the Fig. 6.1 follows each of the butterfly stage. 
39 
 
Figure 6.1 2048 points radix–2 SDF FFT architecture 
6.2 Radix-22 FFT Architecture 
This section discusses designs of radix-22 SDF architecture using radix-2 (BF I) FFT 
butterflies and BF II structures, which is an extended radix-2 butterfly structure with W4 
complex multiplier. W4 performs trivial complex multiplication (multiplication by –j), 
which can be implemented by sign change and exchanging the real and imaginary parts 
of the data. Therefore, BF I extended structure with W4 is designed, which is called as 
BF II structure. 
6.2.1 Radix-22 FFT Architecture 
The Fig. 6.2 represents the radix-22 SDF FFT architecture, which is same as radix-2. 
The difference comes in the stages of twiddle factor multiplication. W4 is a trivial com-
plex multiplier and can be implemented within radix-2 butterfly (BF I), which is called 
as BF II. The next section discusses the internal structures of BF II and I. 
Input Data
Output Data
Radix-2
16
Radix-2
32
Radix-2
64
7th Stage 6th Stage 5th Stage
W32 W64 W128
Radix-2
4
Radix-2
2
Radix-2
1
9th Stage 11th Stage
W8 W4
Radix-2
8
8th Stage
W16
10th Stage
Radix-2
1024
Radix-2
512
Radix-2
256
1st Stage 3rd stage
W2048 W1024 W512
Radix-2
128
4th Stage2nd Stage
W256
40 
 
Figure 6.2 2048- point radix-22 SDF FFT Architecture 
6.2.2 Internal Structures of BF I and BF II 
Radix–2 butterfly (BF I) is basic 2-point FFT butterfly, which computes 2-point DFT.  
6.2.2.1 Internal Structure of BF I 
Fig. 6.3 depicts the internal structure of radix–2 butterfly, which is named as BF I. It 
simply takes two input samples and computes 2-point FFT. Butterfly operation means 
performing addition and subtraction operations on the input samples as shown in Fig. 
6.3. 
S represents the selection line of the multiplexers to choose between performing the 
butterfly operation on the inputs and to keep the butterfly to inactive mode. 
Input Data
Output Data
Radix-2
16
Radix-2
32
Radix-2
64
7th Stage 6th Stage 5th Stage
W4 W128 W4
Radix-2
4
Radix-2
2
Radix-2
1
9th Stage 11th Stage
W4 W8
Radix-2
8
8th Stage
W32
10th Stage
Radix-2
1024
Radix-2
512
Radix-2
256
1st Stage 3rd stage
W4 W2048 W4
Radix-2
128
4th Stage2nd Stage
W512
41 
 
Figure 6.3 Internal structure of a radix–2 butterfly or BF I 
6.2.2.2 Internal Structure of BF II 
BFII is similar to BF I structure with additional circuitry of W4. Fig. 6.4 depicts the in-
ternal structure of BF II. W4 multiplication is nothing but the multiplication with a com-
plex number ‘-j’. Thus, multiplication of a complex number with ‘-j’ results in effecting 
the signs of real and imaginary parts of the complex number and swapping the real and 
imaginary parts. 
 
Figure 6.4 Internal Structure of BF II 
For example, we have a complex number x + jy then its multiplication with ‘-j’ will 
result as follows 
-j * (x + jy) = -j * x + (-j) * jy 
                  = -jx + (−𝑗2) * y 
                                                                 = -jx + (-(-1)) * y 
+
0
1
+
0
1
+
0
1_
+
0
1_
X1, Re
X1, Im
X2, Re
X2, Im
Y2, Re
Y2, Im
Y1, Re
Y1, Im
S
1/2
1/2
1/2
1/2
+
0
1
+
0
1
+
0
1_
+
0
1_
X1, Re
X1, Im
X2, Re
Y2, Re
Y2, Im
Y1, Re
Y1, Im
S
1/2
1/2
1/2
1/2
0
1
0
1
-1
T
X2, Im
42 
               = -jx + y 
               = y – jx. 
There, the implementation of this W4 logic can be seen as additional circuitry to BF I in 
the Fig. 6.4. The signals S and T are the control signals of the BF II structure to select 
between BF II and I behaviors. 
 
6.2.3 Improved radix-22 FFT Architecture 
Fig. 6.5 shows the improved radix-22 SDF FFT architecture, which uses BF II and I 
structures. This resulted in removing all of the W4 multipliers from the architecture. This 
improved the architecture because of reduction in twiddle factor coefficients’ memories 
and smaller chip-size as well as reduction in power consumption. 
  
Figure 6.5 Improved 2048-point radix-22 FFT architecture 
Improved processor’s architecture only requires five complex multipliers instead of ten 
as depicted in Fig. 6.2. 
Finally, Fig. 6.6 depicts improved architecture for N–point radix-22 FFT. 
Input Data
Output Data
Radix-2 
BF_I
16
Radix-2 
BF_II
32
Radix-2 
BF_I
64
7th Stage 6th Stage 5th Stage
W128
Radix-2 
BF_I
4
Radix-2 
BF_II
2
Radix-2 
BF_I
1
9th Stage 11th Stage
W8
Radix-2 
BF_II
8
8th Stage
W32
10th Stage
Radix-2 
BF_I
1024
Radix-2 
BF_II
512
Radix-2 
BF_I
256
1st Stage 3rd stage
W2048
Radix-2 
BF_II
128
4th Stage2nd Stage
W512
43 
 
Figure 6.6 N–point SDF FFT Architecture using BF II 
The control unit controls the processing and flow of FFT data in the processor as shown 
in the above figure. 
6.3 Identical Radix-22 FFT Architecture 
This section discusses the identical radix-22 SDF FFT architecture. Fig. 6.7 illustrates 
the SDF architecture for 2048-point FFT for identical radix-22 algorithm. The number 
of complex multipliers are still the same in identical radix-22 architecture but the total 
number of complex twiddle factors’ operations are reduced in comparison to radix-22 
depicted in Fig. 6.5. 
Input Data
Output Data
Radix-2 
BF_I
2
Radix-2 
BF_II
1
Radix-2 
BF_I
N/2
Radix-2 
BF_II
N/4
Radix-2 
BF_I
N/8
Radix-2 
BF_II
N/16
Twiddle N/4
Control 
Unit
Twiddle N
Twiddle 16
44 
 
Figure 6.7 Identical radix-22 pipelined FFT architecture 
 
6.4 Implementation of Constant Multipliers 
Shift-and-add operations implement multiplication by a constant efficiently (section 
5.1.2). Shifts, adders, and subtractors [7] replace the general complex multiplier to 
avoid the multiplier cost. However, we are referring subtractors as adders because of 
similar hardware complexity. Shifts do not have much effect on the hardware cost and 
assumed to be cost free. For FFT computation, precomputation of twiddle factors is 
carried out. A complex multiplier to perform the twiddle factor multiplication follows 
each stage of the FFT architecture. Constant multiplication circuits replace the complex 
multipliers to reduce the hardware cost of the FFT architectures. In implementation of 
constant multipliers, the main goal is to obtain the coefficients with minimum error and 
smallest number of adders. 
FFT computes the twiddle factors by 𝑊𝐿
𝑖
 = 𝑒−𝑗.2𝜋𝑖/𝐿 , i = 0, …, L – 1.  The coefficient 
sets W4, W8, and W16 are very common twiddle factors in the FFT architectures, where 
W4 is trivial complex multiplier which is very simple to implement. Radix-2 FFTs of 
size greater than or equal to 16, demand W8 and W16 twiddle factors. For example, 2048-
point identical radix-22 FFT requires W16 at two stages and W8 at one stage. Twiddle 
factors are computed from specific sets of angles, which are generated by dividing the 
circumference in L equal parts. This division of circumference results in multiple sym-
metries in the complex plane, which defines the concept of octave symmetry. According 
Input Data
Output Data
Radix-2 
BF_I
16
Radix-2 
BF_II
32
Radix-2 
BF_I
64
7th Stage 6th Stage 5th Stage
W16
Radix-2 
BF_I
4
Radix-2 
BF_II
2
Radix-2 
BF_I
1
9th Stage 11th Stage
W8
Radix-2 
BF_II
8
8th Stage
W2048
10th Stage
Radix-2 
BF_I
1024
Radix-2 
BF_II
512
Radix-2 
BF_I
256
1st Stage 3rd stage
W16
Radix-2 
BF_II
128
4th Stage2nd Stage
W256
45 
to octave symmetry for an L-point twiddle factor, only M = L/8 + 1 angles in the range 
[0, π/4] are required [28]. The idea of octave symmetry is illustrated in Fig. 6.8. 
 
Figure 6.8 W8, W16 and octave symmetry in (a), (b) and (c), respectively 
Only [0, π/4] angles are considered and rest of the twiddle factors can be computed from 
those in range [0, π/4] through swapping of real and imaginary parts of input and output 
data samples as well changing their signs [28]. Table 6.1 provides the twiddle factors 
designed with low complexity with range [0, π/4]. 
TF  Coefficients Properties 
0 π/8 π/4 WL Adders 
W8 577 - 408 + j408 11 6 
W16 349093 322520 + j133592 246846 + j246846 20 10 
Table 6.1 Designed W8 and W16 twiddle factors for FFT with low complexity [28] 
In this thesis work, we have implemented W8 and W16 constant multipliers [28] that can 
replace the complex multipliers. Figs. 6.9 and 6.10 depicts the schematics of W8 and 
W16 multipliers. The schematics of both W8 and W16 designed using shifters, adders and 
multiplexers. The output configurations are represented by  ,  and . This constant 
(c) 
(a) 
(b) 
46 
multiplication implementation also removes the memories to store the twiddle factor 
coefficients for W8 and W16 complex multipliers, which makes the FFT architecture 
memory efficient as well as cost efficient. 
6.4.1 W8 Constant Multiplier 
Fig. 6.8 represents the implementation of W8 constant multiplier. This circuit considers 
the angles 0 and π/4 and implements the kernel [577, 408 + j408]. Configuration of the 
multiplexers decides the multiplication of the input data sample by 577 or 408 + j408 
[28]. 
 
 
Figure 6.9 W8 constant multiplier, kernel [577, 408 + j408], [28] 
6.4.2 W16 Constant Multiplier 
The Fig. 6.10 shows the implementation of W16 constant multiplier, which considers the 
angles 0, π/8 and π/4 [28]. The kernel implemented is [349093, 322520 + j133592, 
246846 + j246846]. 
x
y
<<1
+
<<9
<<6
+
<<4
<<3 -
<<1
+
<<9
<<6
+
<<4
<<3 +
X = 577x
X = 408x – 408y
Y = 577y
Y = 408y + 408x
47 
 
Figure 6.10 W16 Constant multiplier, kernel [349093, 322520 + j133592, 246846 + 
j246846], [28] 
x
y
<<4
+
<<2
<<7
+/-
<<6
<<2
+/-
<<7
<<5
<<13 <<4
+/-
<<16
<<3
<<3
+/-
<<4
+
<<2
<<7
+/-
<<6
<<2
+/-
<<7
<<5
<<13 <<4
+/-
<<3
<<3
+
<<16
X = 349093x
X = 246846x – 246846y
X = 322520x – 133592y
Y = 349093y
Y = 246846x + 246846y
Y = 322520y + 133592x
48 
7. SYNTHESIS RESULTS 
This chapter talks about the verification, synthesis and presents the FPGA implementa-
tion results. FFT processors were modeled in MATLAB. VHDL was used for the im-
plementation of the FFT processor using Vivado (Xilinx) 2017.1 tool. Vivado’s simula-
tor was used for waveform monitoring and verification of functionality. Finally, the 
FFT processor were implemented on Virtex-7 FPGA in synthesis. 
7.1 Verification 
Verification is perhaps the most challenging aspect of any design, there is no excuse for 
omitting the verification because a design without verification cannot be claimed cor-
rectly designed. It is extremely important to verify the functionality of the design. There 
are different verification methodologies being used in the industry for verification of 
digital designs e.g. UVM, OVM etc. Another way to verify your design is to use test 
benches. We have used VHDL test benches for the functional verification of our de-
signed FFT processors.  
TB is a piece of code meant to verify the functional correctness of VHDL model. Test 
bench instantiate the design under test (DUT), feed the stimuli to the DUT, generate the 
output waveforms and make comparison with the reference outputs. Finally, test bench 
provides a pass or fail indication. 
We Modeled the FFT processor in MATLAB, generated the stimuli and reference in 
MATLAB using two’s compliment binary in fixed-point representation. Data word-
length used was 36-bit to represent a complex data sample, MSB 18-bits were used to 
represent the real part and LSB 18-bits for imaginary part. Data was in range [-1, 1], 
where no integer bits were involved, i.e., MSB represents the sign and 17 bits represent 
the fraction. 
Stimulus and expected response data samples are vectors stored as constants in an array 
in TB. TB read the stimulus vector and feed the data samples to the FFT processor as 
well as it reads the expected response vector and compare with processor’s response. 
7.2 Synthesis 
The main goal of writing VHDL to design digital systems is to generate synthesizable 
description. The quality of the design is much affected by the coding styles. We must be 
49 
able to select coding structures that synthesize the best on FPGA device. We used Vir-
tex-7 (xc7vx1140tflg1930-1) FPGA device for synthesis in this thesis work. 
7.2.1 Comparison of FFT Architectures 
Table 7.1 presents the comparison between twiddle factor multiplication resources of 
radix-22 and identical radix-22 architectures illustrated in Figs. 6.5 and 6.7 respectively. 
The main goal was to reduce the number of general complex multipliers and memories 
to store the twiddle factor coefficients. Both radix-22 and identical radix-22 FFT algo-
rithms were mapped on the radix-2 SDF pipelined FFT architecture. When we are using 
general complex multipliers, identical radix-22 FFT algorithm has reduced twiddle fac-
tors memory than radix-22, thus less number of complex operations for 2048-point FFT 
computations. Further, general complex multiplier can be replaced by using the W8 and 
W16 constant multipliers. Even, twiddle factor memories is reduced further because the 
constant multipliers do not require any coefficient memory. As a result, the twiddle fac-
tor memory can be reduced up to 15% as shown in Table 7.1.  
 
Resources 
Radix-22 
Fig. 6.5 
Identical Radix-22 
General Multipliers 
Fig. 6.7 
Const. Multipliers 
W8 and W16 
# of complex multipliers 5 5 2 
# of constant multipliers 0 0 3 
Twiddle factor coefficients 2728 2344 2304 
Constant Multiplier Name # of Constant Coefficients 
W8 2 
W16 3 
Table 7.1: Comparison of radix-22 and identical radix-22 FFT architectures 
7.2.2  FFT Processors Synthesis Results 
The following table provides the comparison between synthesis results of radix-22 and 
identical radix-22 architectures illustrated in Figs. 6.5 and 6.7 respectively. 
Resource Type Available Radix-22 
Fig. 6.5 
Identical Radix-22  
Fig. 6.7 
50 
Slice LUTs 
 LUT as Logic 
 LUT as Memory 
o LUT as Shift Registers 
Slice Registers 
 Registers as Flip Flop 
F7 Muxes 
F8 Muxes 
Block RAMs 
 RAMB36/FIFO (36 Kb) 
 RAMB18 (18 Kb) 
DSP48E1 
712000 
712000 
283200 
 
1424000 
1424000 
437600 
218800 
1880 
1880 
3760 
3360 
15911 
14687 
1224 
1224 
38329 
38329 
4896 
2448 
90 Kb 
36 Kb 
54 Kb 
20 
15896 
14672 
1224 
1224 
38345 
38345 
4896 
2448 
72 Kb 
36 Kb 
36 Kb 
20 
Table 7.2: Radix-22 and identical radix-22 synthesis results 
7.2.3 Constant Multipliers Synthesis Results 
The following table provides the synthesis results of W8 and W16 constant multipliers. 
Resource Type Available W8 
Fig. 6.9 
W16 
Fig. 6.10 
Slice LUTs 
 LUT as Logic 
712000 
712000 
231 
231 
776 
776 
Table 7.3: W8 and W16 constant multipliers synthesis results 
7.2.4 Results Analysis 
Referring to Figs. 6.5 and 6.7 of radix-22 and identical radix-22 FFT algorithms, both 
algorithms are mapped on the same SDF architecture. The non-trivial complex multipli-
er stages are identical in both architectures. Without using the constant multipliers, the 
main difference in the both FFT processors is the number of complex operations, size of 
memories to store the twiddle factor coefficients and the memory address generation 
logic. Table 7.1 presents the comparison between the architectures of both FFT proces-
sors. The reduction in twiddle factor memories is clearly shown in the Table, which also 
means the less number of complex operations for computation of 2048-point FFT.  
In both implementations, DSP48 blocks are used for general multipliers. Generally, one 
multiplier requires one DSP48 block, which means four DSP48 blocks are required for 
one general complex multiplier. These architectures require five general complex multi-
pliers, so 20 DSP48 blocks are used for both implementations as shown in Table 7.2. 
51 
There were two different sizes of block RAMs available on Virtex-7 FPGA used, i.e. 36 
Kb and 18 Kb. Radix-22 used one block RAM of 36 Kb and three block RAMs of 18 
Kb. On the other hand, identical radix-22 used one block RAM of 36 Kb similar to ra-
dix-22 but less number of 18Kb block RAMs, which is two as compared to three in ra-
dix-22. Overall, 20% reduction in the memories is achieved in the identical radix-22 as 
compared to radix-22. In addition to it, difference can be seen in the slice LUTs, which 
is because of the memory address generation logic in the FFT processors.  
Moreover, Table 7.3 represents the synthesis results of W8 and W16 constant multipliers. 
These can be used in identical radix-22 for further reduction in memory and general 
complex multiplier as shown in Table 7.1. 
 
52 
8. CONCLUSIONS AND FUTURE WORK 
Signal processing is easier in the frequency domain and DFT is a method that provides 
transforms between time and frequency domains of signals. FFT is an efficient algo-
rithm for faster computation of DFT. 
In this thesis work, we have designed an efficient FFT algorithm named as identical 
radix-22 FFT algorithm that has same number of non-trivial complex multiplier stages 
as radix-22 but less number of complex multiplications. Both radix-22 and identical ra-
dix-22 FFT algorithms were mapped on SDF pipelined FFT architectures. 
FFT processors were modeled in MATLAB. VHDL was used for implementations of 
the FFT processor using Vivado (Xilinx) 2017.1 tool. Functional correctness of proces-
sors was verified by writing testbenches, where Vivado’s simulator was used for the 
simulations and generated output samples were compared with the expected output 
samples of the MATLAB models. Finally, the designs were implemented on Virtex-7 
(xc7vx1140tflg1930-1) FPGA device in synthesis. In synthesis, without using the con-
stant multipliers, 20% of reduction in memories is achieved in the identical radix-22 
than radix-22. identical radix-22 used less number of block RAMs and slice LUTs as 
compared to radix-22. Reduction in memories and area can be achieved even more by 
replacing general complex multipliers with W8 and W16 constant multipliers. 
Radix-22 and identical radix-22 FFT architectures are implemented in this thesis work. 
In comparison, identical radix-22 FFT processor performs less number of complex mul-
tiplications than radix-22 for computation of 2048-point FFT. In addition to it, identical 
radix-22 requires smaller memories to store the twiddle factors than radix-22. Further-
more, W8 and W16 constant multipliers are implemented that can replace the two W16 
and one W8 general complex multipliers in the identical radix-2
2 FFT processor. Integra-
tion of constant multipliers in place of general complex multipliers can further reduce 
the cost of complex twiddle factor multiplications in the FFT processor as well make 
the processor even more memory efficient by removing twiddle factor coefficients’ 
memories for W8 and W16 multipliers. 
As a result, the work done in this thesis sets a very strong base for the implementation 
of identical radix-22 FFT processor with constant multipliers in extension of this work 
in future. 
53 
BIBLIOGRAPHY  
[1] A. V. Oppenheim, R. W. Schafer, “Discrete – Time Signal Processing”, Published 
by Prentice-Hall, 1999. 
[2] L. Wanhammar, “DSP Integrated Circuits”, Published by Academic Press, March 
4,1999. 
[3] S. K. Mitra, “DIGITAL SIGNAL PROCESSING: A COMPUTER-BASED AP-
PROACH”, Published by Mcgraw Hill Higher Education, 4th International edition, 
2010. 
[4] J. W. Cooley, J. W. Tukey, “An algorithm for machine computation of complex 
Fourier series,” in proc. Mathematic of Computation, April 1965, vol. 19, no. 90, 
pp. 297-301. 
[5] Prof. S. S. Belsare, A. S. Padekar, “Radix-4 FFT Architecture”, in proc. Interna-
tional Journal of Advance Research in Computer Science and Software Engineer-
ing, 05 May. 2014, vol. 4, no. 5, pp. 337 – 340. 
[6] A. Cortes, I. Velez, J. F. Sevillano, "Radix r^{k} FFTs: Matricial Representation 
and SDC/SDF Pipeline Implementation," in IEEE Transactions on Signal Pro-
cessing, vol. 57, no. 7, pp. 2824-2839, July 2009. 
[7] F. Qureshi, O. Gustafsson, "Generation of all radix-2 fast Fourier transform algo-
rithms using binary trees", in proc. 20th European Conference on Circuit Theory 
and Design (ECCTD), Linkoping, IEEE, 29 – 31 Aug. 2011, pp. 677-680. 13 Oct. 
2011. 
[8] M. Garrido, R. Andersson, F. Qureshi, O. Gustafsson, "Multiplierless Unity-Gain 
SDF FFTs", in proc. IEEE Transactions on Very Large Scale Integration (VLSI) 
Systems, Sept. 2016, vol. 24, no. 9, pp. 3003-3007. 
[9] F. Qureshi, O. Gustafsson, "Low-complexity reconfigurable complex constant 
multiplication for FFTs", in proc. IEEE International Symposium on Circuits and 
Systems, Taipei, 24 – 27 May 2009, pp. 1137-1140. 
[10] F. Qureshi, “Optimization of Rotations in FFTs”, Linköping Studies in Science 
and Technology, Dissertations, no. 1423, Linköping, 2012. 
[11] H. Y. Lee, I. C. Park, "Balanced Binary-Tree Decomposition for Area-Efficient 
Pipelined FFT Processing", in proc. IEEE Transactions on Circuits and Systems I: 
Regular Papers, vol. 54, no. 4, pp. 889-900, April 2007. 
54 
[12] S. M. Joshi, “FFT Architecture: A Review”, in proc. International Journal of 
Computer Applications (0975-8887), Apr. 2015, vol. 116, no. 7, pp. 33 – 36. 
[13] Sang-Chul Moon, In-Cheol Park, "Area-efficient memory-based architecture for 
FFT processing", in proc. International Symposium on Circuits and Systems, 25 – 
28 May 2003. ISCAS '03, IEEE, vol. 5, pp. V-101-V-104. 
[14] C. L. Wey, W. C. Tang, S. Y. Lin, "Efficient VLSI Implementation of Memory-
Based FFT Processors for DVB-T Applications", in proc. IEEE Computer Society 
Annual Symposium on VLSI (ISVLSI '07), Porto Alegre, 9 – 11 Mar. 2007, pp. 
98-106. 
[15] B. G. Jo, M. H. Sunwoo, "New continuous-flow mixed-radix (CFMR) FFT Pro-
cessor using novel in-place strategy", in proc. IEEE Transactions on Circuits and 
Systems I: Regular Papers, May 2005, vol. 52, no. 5, pp. 911-919. 
[16] H. S. Stone, "Parallel Processing with the Perfect Shuffle", in proc. IEEE Transac-
tions on Computers, Feb. 1971, vol. C-20, no. 2, pp. 153-161. 
[17] P. Philipov, V. Lazarov, Z. Zlatev, M. Ivanova, "A Parallel Architecture for Ra-
dix-2 Fast Fourier Transform", in proc. IEEE John Vincent Atanasoff 2006 Inter-
national Symposium on Modern Computing (JVA'06), Sofia, 3 – 6 Oct. 2006, pp. 
229-234. 
[18] P. P. Boopal, M. Garrido, O. Gustafsson, "A reconfigurable FFT architecture for 
variable-length and multi-streaming OFDM standards", in proc. IEEE Internation-
al Symposium on Circuits and Systems (ISCAS2013), Beijing, 19 – 23 May 2013, 
pp. 2066-2070. 
[19] M. Garrido, K. K. Parhi, J. Grajal, "A Pipelined FFT Architecture for Real-Valued 
Signals", in proc. IEEE Transactions on Circuits and Systems I: Regular Papers, 
Dec. 2009, vol. 56, no. 12, pp. 2634-2643. 
[20] M. Garrido, J. Grajal, M. A. Sanchez, O. Gustafsson, "Pipelined Radix – 2(k) 
Feedforward FFT Architectures", in proc.  IEEE Transactions on Very Large 
Scale Integration (VLSI) Systems, Jan. 2013, vol. 21, no. 1, pp. 23-32. 
[21] F. Qureshi, O. Gustafsson, "Analysis of twiddle factor memory complexity of 
radix-2i pipelined FFTs", in proc. Conference Record of the Forty-Third Asilomar 
Conference on Signals, Systems and Computers, Pacific Grove, CA, 1 – 4 Nov. 
2009, pp. 217-220. 
[22] F. Qureshi, S. A. Alam, O. Gustafsson, "4k-point FFT algorithms based on opti-
mized twiddle factor multiplication for FPGAs", in proc. Asia Pacific Conference 
55 
on Postgraduate Research in Microelectronics and Electronics (PrimeAsia), 
Shanghai, 22 – 24 Sept. 2010, pp. 225-228. 
[23] H. J. Kang, J. Y. Lee, J. H. Kim, "Low-complexity twiddle factor generation for 
FFT processor", in proc. Electronics Letters, 07 Nov. 2013, IEEE, vol. 49, no. 23, 
pp. 1443-1445. 
[24] J. Thong, N. Nicolici, "An Optimal and Practical Approach to Single Constant 
Multiplication", in proc. IEEE Transactions on Computer-Aided Design of Inte-
grated Circuits and Systems, Sept. 2011, vol. 30, no. 9, pp. 1373-1386. 
[25] “Multiplerless Constant Multiplication”, Access date: 20 Oct. 2016, Available 
online: http://www.spiral.net/hardware/multless.html 
[26] O. Gustafsson, F. Qureshi, “Addition Aware Quantization for Low Complexity 
and High Precision Constant Multiplication”, in proc. IEEE Signal Processing 
Letters, Feb. 2010, vol. 17, no. 2, pp. 173 – 176. 
[27] F. Qureshi, O. Gustafsson, “Low-Complexity Constant Multiplication Based on 
Trigonometric Identities with Applications to FFTs”, in proc. IEICE Transactions 
on Fundamentals of Electronics, Communication and Computer Sciences” Nov. 
2011, vol. 94, no. 2, pp. 2361 – 2368. 
[28] M. Garrido, F. Qureshi, O. Gustafsson, "Low-Complexity Multiplierless Constant 
Rotators Based on Combined Coefficient Selection and Shift-and-Add Implemen-
tation (CCSSI)", in proc. IEEE Transactions on Circuits and Systems I: Regular 
Papers, Jul. 2014, vol. 61, no. 7, pp. 2002-2012. 
[29] R. Guo, L. S. DeBrunner, "A novel fast canonical-signed-digit conversion tech-
nique for multiplication" in proc. IEEE International Conference on Acoustics, 
Speech and Signal Processing (ICASSP), Prague, 22 – 27 May 2011, pp. 1637-
1640. 
[30] “Representation of Numbers”, Access date: 10 Oct. 2016. Available online: 
http://www.swarthmore.edu/NatSci/echeeve1/Ref/BinaryMath/NumSys.html 
[31] “Two’s Complement” Access date: 10 Oct. 2016. Available online: 
http://www.tfinley.net/notes/cps104/twoscomp.html 
[32] K. K. Parhi, “VLSI digital Signal Processing Systems – Design and Implementa-
tion”, A Wiley-Interscience Publication, John Wiley & Sons, Inc. 1999, pp. 380 – 
382. 
 
56 
APPENDIX A: SIMULATION WAVEFORMS 
 2048-Point Radix-22 FFT Simulation Waveforms 
 
 2048-Point Identical Radix-22 Simulation Waveforms 
 
 W8 Constant Multiplier Simulation Waveforms 
 
 
 
 
 
 
57 
 W16 Constant Multiplier Simulation Waveforms 
 
