Deterministic analysis of the accuracy in FFT hardware architectures by Guinart Platero, David
Institutionen för systemteknik
Department of Electrical Engineering
Examensarbete
Deterministic Analysis of the Accuracy in FFT
Hardware Architectures
Examensarbete utfört i Reglerteknik
vid Tekniska högskolan vid Linköpings universitet
av
David Guinart Platero
LiTH-ISY-EX--TG/ZD20--SE
Linköping 2012
Department of Electrical Engineering Linköpings tekniska högskola
Linköpings universitet Linköpings universitet
SE-581 83 Linköping, Sweden 581 83 Linköping

Deterministic Analysis of the Accuracy in FFT
Hardware Architectures
Examensarbete utfört i Reglerteknik
vid Tekniska högskolan i Linköping
av
David Guinart Platero
LiTH-ISY-EX--TG/ZD20--SE
Handledare: Mario Garrido
isy,LIU
Examinator: Oscar Gustafsoon
isy,LIU
Linköping, 7 June, 2012

Avdelning, Institution
Division, Department
Electronics Systems
Department of Electrical Engineering
Linköpings universitet
SE-581 83 Linköping, Sweden
Datum
Date
2012-06-07
Språk
Language
 Svenska/Swedish
 Engelska/English


Rapporttyp
Report category
 Licentiatavhandling
 Examensarbete
 C-uppsats
 D-uppsats
 Övrig rapport


URL för elektronisk version
http://www.es.isy.liu.se/
http://www.ep.liu.se
ISBN
—
ISRN
LiTH-ISY-EX--TG/ZD20--SE
Serietitel och serienummer
Title of series, numbering
ISSN
—
Titel
Title
Svensk titel
Deterministic Analysis of the Accuracy in FFT Hardware Architectures
Författare
Author
David Guinart Platero
Sammanfattning
Abstract
This Master Thesis studies the different quantization effects in hardware architec-
ture due to the use of finite word lenght. This master thesis gives a deterministic
analysis with relation to the accuracy and presents a relationship between input
bits and coefficient bits for minimizing recourses and to obtained the best relation
with the accuracy. Furthermore, the objective of this mater thesis is to find a
direct relation between the input bits and coefficient bits. This can be used as
guide for the design of FFT hardware architectures
Nyckelord
Keywords

Abstract
This Master Thesis studies the different quantization effects in hardware architec-
ture due to the use of finite word lenght. This master thesis gives a deterministic
analysis with relation to the accuracy and presents a relationship between input
bits and coefficient bits for minimizing recourses and to obtained the best relation
with the accuracy. Furthermore, the objective of this mater thesis is to find a
direct relation between the input bits and coefficient bits. This can be used as
guide for the design of FFT hardware architectures
v

Acknowledgments
I would like to acknowledge to the LIU for give to me the opportunity of finalize
my studies at this institution.
Of course, I acknowledge to my supervisor Dr. Mario Garrido for all his time
invested in this work, without his assistance this thesis would not have been pos-
sible. I also acknowledge to Dr. Oscar Gustafsson for give to my the opportunity
of do my master thesis in the electronic system department.
Finally, I would also give my deepest thanks to my colleges, friends, and fam-
ily for their support.
vii

Contents
1 Introduction 1
2 Introduction of the FFT 3
2.1 The FFT algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Hardware architectures of the FFT . . . . . . . . . . . . . . . . . . 6
2.2.1 In-place architectures . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Pipelined architectures . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 Hardware components of FFT architectures . . . . . . . . . 9
3 Proposed Model 11
3.1 Model of computation of the FFT in hardware . . . . . . . . . . . 11
3.2 Truncation in the butterflies . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Quantification in the coefficients (CQ) . . . . . . . . . . . . . . . . 13
3.4 Truncation in the rotations (TR) . . . . . . . . . . . . . . . . . . . 14
3.5 Output of the proposed model . . . . . . . . . . . . . . . . . . . . 15
3.6 Description of the increment profile . . . . . . . . . . . . . . . . . . 15
3.6.1 Increment profile in the data at every stage . . . . . . . . . 16
3.6.2 Increment profile in the coefficients at every stage . . . . . 16
4 Experimentals Results 19
4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Effect of the truncation in the rotations . . . . . . . . . . . . . . . 22
4.3 Comparison between rounding and truncation in the operations . . 24
4.4 Influence of the data increment profile . . . . . . . . . . . . . . . . 25
4.5 Relation between the wordlength of the samples and coefficients . . 29
4.6 Experimental results for pipeline feedback architecture . . . . . . . 32
4.6.1 Relation for every δ between buffer size and the data wordlenght 33
4.6.2 Relation for every β between area and the coefficient wordlenght 39
4.7 Effect for the different decomposition DIT-DIF . . . . . . . . . . . 48
4.7.1 Difference between DIT and DIT for β used . . . . . . . . . 50
4.8 Hardware implementation results . . . . . . . . . . . . . . . . . . . 53
4.8.1 DIT-cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.8.2 DIF-cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.9 Error in the different channels of the FFT . . . . . . . . . . . . . . 59
4.10 Conclusions of the experimental results . . . . . . . . . . . . . . . . 62
ix
x Contents
5 Guidelines for the designer 63
6 Conclusions 65
Bibliography 67
Chapter 1
Introduction
The fast Fourier transform (FFT) is one of the most important algorithms in
the field of digital signal processing. It is used to calculate the discrete Fourier
transform (DFT) efficiently. In order to meet the high performance requirements
of the modern applications, much research on FFT hardware architectures has
been done in last decades [1–13]. These works present very efficient designs in
terms of area, throughput and power consumption.
Apart from these figures of merit, in a hardware implementation of the FFT it is
necessary to take into account the accuracy of the output results. Contrary to the
ideal FFT algorithm, in a hardware FFT architecture data have finite wordlength.
Accordingly, different quantization effects take place in the computation of the
FFT. The accuracy of the FFT depends on different parameters. These parameters
are the word length in the data and coefficients, the FFT size, the decomposition
(DIT and DIF) and the radix used. We present a software where it is possible to
change all the parameters of the FFT, as well as the different quatization effects.
This software present an easy parametrization, a wide space of analysis and it is
possible to get a fast results respect to hardware simulation. Furthermore, this
software provides exactly the same results that are obtained in hardware. The
goal is to find a good relations among the different parameters and the accuracy
of the FFT.
This thesis analyzes three different errors for truncations and quantizations.
Firstly, the output of the additions leads to an increase of one bit. Accordingly,
the output must be truncated in order to maintain the same word lenght as in the
input. Secondly, after the multipliers the number of bits also increases. The idea
is to maintain the same size of bits as in the inputs of the rotation. Consequently,
the output is also truncated. Finally, due to the finite size in the memory of the
coefficients, the coefficient must be stored in a memory. Thus, the sine and cosine
components of the angles or the coefficients must be quantized to a finite number
of bits.
Some approaches in the literature have studied the effects in the quantization
in the additions and rotations [14] with relation to the area and the speed by using
statistical models.
1
2 Introduction
Other studies [15], [16] present a deterministic analysis of the quantization in
the coefficients. More concretely [15] presents different options of scaling the coeffi-
cients. However, our model presented in this work is using above all quantizations.
The present thesis gives a relationship between input bits and coefficient bits with
the objective to find a good relation between both features for minimizing the
resources.
Other models study the quantizations in the additions and rotations. [17]
does analyzes the relation about the input bits with coefficient bits. This case
does a exhaustive search for optimal word lengths. This study does not allow to
find general cases, furthermore this study does not find a good relations it only is
working with concrete cases. On the other hand this study is analyzed using an
approximate statistical model.
The present study gives a deterministic analysis of the accuracy of the FFT
in hardware architectures. Our approach bases on simulating the results instead
of using statistical models. Besides our thesis performs a search to find all the
possible cases. Therefore, we can present the optimum cases.
The rest of the thesis is organized as follows. In chapter 2, the FFT algorithm
and the different hardware architectures are reviewed. In chapter 3 the proposed
model is presented. It includes all the quantization effects, which are modelled
mathematically. Chapter 4 presents the experimental results of the effects of the
different truncations and quantizations with relation to the accuracy. The main
objective is to find the relation with to number of bits in the input and coefficient
for obtained the best output result with the area of the FFT. The chapter 5 present
a guidelines for the designer. Finally the chapter 6 presents the conclusions.
Chapter 2
Introduction of the FFT
In the chapter 2 we are going to explain the FFT algorithm and we are going to
do a short introduction about the different hardware architectures for the design
of the FFT. Further explanation is provided in [3].
2.1 The FFT algorithm
The Fourier transform is one of the most important advances in signal processing.
The Fourier transform allows to represent general functions by sums of trigono-
metric functions. It can help to solve different mathematics and physics problems.
The fourier transform converts a signal in time domain into a signal frequency
domain. The formula is :
X(Ω) =
∞∫
−∞
x(t)e−jΩt dt. (2.1)
where Ω is related to the frequency f by f = Ω/2pi, X(Ω) is the signal in the
frequency domain, t is the temporal variable and x(t) is the signal in the time
domain. In order to recover the original signal the inverse Fourier transform is
calculated by:
x(t) = 12pi
∞∫
−∞
X(Ω)e−jΩt dΩ. (2.2)
The discrete version of the Fourier transform is represented by the discrete
Fourier transform (DFT). This operation is used in digital systems. The objective
is to obtain the samples of the spectrum of the Fourier transform from samples in
the time domain according to [18]:
X[k] =
N−1∑
n=0
x [n] WnkN , k = 0, 1, . . . , N − 1 (2.3)
3
4 Introduction of the FFT
where WnkN = e−j
2pi
N nk, N is the number of points of the DFT, x [n] are the
samples in the time domain, k represents the frequency, and X [k] is the signal in
the frequency domain.
For recovering the original sequence, the inverse discrete fourier transform
(IDFT) is used:
x [n] = 1
N
N−1∑
n=0
X[k]ej 2piN nk, k = 0, 1, . . . , N − 1 (2.4)
The fast Fourier transform (FFT) is an efficient way for computing the DFT.
the term FFT refers to a collection of algorithms that reduce the number of op-
erations of the DFT. The Cooley-Tukey algorithm is the most used among them.
Using this algorithm is possible to reduce the number operation, from order O(N2)
in DFT to order O(N logN) in the FFT. The Cooley-Tukey algorithm decomposes
the DFT in n = logrN , where r is the radix used.
There are two main methods to do the decomposition: Decimation In Time
(DIT) [19] and Decimation In Frequency (DIF) [19]. For radix-2, the DIT de-
composition separates the sequence x[n] into its even and odd samples. The first
iteration of DIT is shown by:
X[k] =
N/2−1∑
i=0
x [2i] e−j
2pi
N/2 ik + e−j 2piN k
N/2−1∑
i=0
x [2i+ 1] e−j
2pi
N/2 ik (2.5)
Conversely, the DIF decomposition separates the even and odd output frequencies
leading to:
X[2r] =
N−1∑
n=0
x [n] e−j 2piN 2rn, r = 0, 1, . . . , N − 1 (2.6)
X[2r + 1] =
N−1∑
n=0
x [n] e−j 2piN (2r+1)n, r = 0, 1, . . . , N − 1 (2.7)
by knowing that e−j
2pi
N/2 2r(N/2) = 1, we have:
X[2r] =
N/2−1∑
n=0
(x [n] + x [n+N/2]) e−j
2pi
N/2 rn, r = 0, 1, . . . , N/2− 1 (2.8)
X[2r + 1] =
N/2−1∑
n=0
(x [n]− x [n+N/2]) e−j 2piN ne−j 2piN/2 rn, r = 0, 1, . . . , N/2− 1
(2.9)
Figure 2.1 shows the flow graphs of 16-point radix-2 FFTs decomposed using
decimation in time (DIT). At each stage of the graphs, s ∈ {1, . . . , n}, butterflies
2.1 The FFT algorithm 5
Figure 2.1. Flow graph of a 16-point radix-2 DIT FFT.
and rotations have to be calculated. The butterfly operation is shown in Figure
2.2. This figure represents a radix-2 butterfly, which calculates:
X[0] = x[1] + x[0]
X[1] = x[1]− x[0] (2.10)
Note that in the Figure 2.1 the multiplications by -1 in the lower edges of the
butterflies are not depicted in order to simplify the graph.
In Figure 2.1 the numbers at the input represent the indexes of the input
sequence, whereas those at the output are the frequencies, k, of the output signal
X[k]. Finally, each number, φ, in between the stages indicates a rotation by:
WφN = e−j
2pi
N φ = cos (2piφ/N)− j sin (2piφ/N) (2.11)
As a consequence, samples for which φ = 0 do not need to be rotated. Likewise,
if φ ∈ [0, N/4, N/2, 3N/4] the samples must be rotated by 0◦, 270◦, 180◦ and
90◦, which correspond to complex multiplications by 1, −j, −1 and j, respec-
tively. These rotations are considered trivial, because they can be performed by
interchanging the real and imaginary components and/or changing the sign of the
data.
6 Introduction of the FFT
Figure 2.2. Flow graph of a 16-point radix-2 DIT FFT.
The Figure 2.3 shows the graph for a 16 point DIF FFT. The difference between
DIT and DIF decompositions is that the order of rotation in every stage is inverted.
2.2 Hardware architectures of the FFT
In section reviews the different hardware architectures for implementation of the
FFT. The main operations consists of a set of additions and multiplications and
different options for data management in every stage. Furthermore, the design of
the FFT is really complicated because there are a lot of influence factors, such
as number of inputs (N), number of bits in the inputs, number of bits in the
coefficients, type of architecture, the number of memory blocks, types of memory,
number of rotations and the way of calculate these data. All of these variables
have an important influence on the performance of the circuit in terms of speed,
throughput, power consume and accuracy.
As has been said, the FFT decomposes the DFT in n = logrN stages, where
N is the number of inputs and r is the radix. The butterflies shown in Fig. 2.2
consist of a complex adder and a complex subtractor. The rotators rotate complex
data. They usually consist of four real multipliers and several adders. Finally, the
data management usually consist of memories or buffers and control logic.
There exist a lot of different hardware architectures for the FFT. These ar-
chitectures depend of the performance demanded. The most common hardware
architectures are in-place and pipelined architectures. On the one hand an in-place
FFT is an architecture in which all the stages of the FFT are calculated iteratively
with the same processing element. On the other hand, a pipelines FFT divided
the FFT in different stages and every stage is calculated by a processing element.
2.2.1 In-place architectures
In place or memory-based architectures [20–26] consist of one or more memories
and one or more processing units that calculate the butterflies and the rotations of
the algorithm. The FFT is computed iteratively by loading data from the memory,
processing them in the processing unit and storing the results again in the memory.
This process is repeated until all the computations of the FFT are carried out.
The easier case consists in using a memory, a butterfly and a rotator, but it is not
2.2 Hardware architectures of the FFT 7
Figure 2.3. Flow graph of a 16-point radix-2 DIF FFT.
recommendable if you want to work in real time, because the number of clocks
that are needed to calculated the FFT may be too large
Figure 2.4. In place FFT architecture.
2.2.2 Pipelined architectures
Pipelined architectures [1–3, 5, 8, 27–33] consist of different stages of processing
elements serie. In pipelined FFTs each stage of the FFT flow graph is calculated by
stage of the architecture. A great advantage of pipelined architectures is that can
8 Introduction of the FFT
process a continuous flow of data. This makes them suitable to work to real time.
There are two mains architectures in pipelined: the feedback and feedforward
architectures.
The main feature of feedback architectures is that the data calculated by a
butterfly is stored in the memory of the same stage and they are stored until they
are required by the following stage. Figure 2.5 shows a pipelined radix-2 feedback
FFT architecture for N = 256. As has been said, the number of stages is logr(N).
Each stage includes a radix-2 butterfly (R2). The rotators are expressed by (⊗)
and the diamond-shaped rotators only calculate trivial rotations.
Figure 2.5. 256-point radix-2 feedback pipeline architecture.
In Figure 2.6 is possible to observe, the radix-2, the buffer and the controller
using multiplexors and their connections for the feedback pipelined architecture.
Figure 2.6. Module feedback pipelined architecture.
The other import architecture in pipelined is feedforward [2], [1]. This ar-
chitecture pass the data to the following stage once they are processed by the
butterflies and rotators. They can compute several samples in parallel. So, they
can achieve very high throughput rates and they are very suitable for processing
parallel stream of data. Figure 2.7 shows a radix-2 feedforward pipelined archi-
tecture for 16 point. Is possible to observe that this architecture can process two
samples in parallel of a continuous dataflow.
2.2 Hardware architectures of the FFT 9
Figure 2.7. 16-point radix-2 feedforward pipelined architecture
2.2.3 Hardware components of FFT architectures
Butterfly
As has been said, the butterflies consist of a complex adder and a subtractor.
Inside of the butterfly is possible to find the structure of the Figure 2.8. The input
wordlenght is represented by WLI . This means that both real and imaginary
components of the inputs have WLI bits, which leads to a total 2WLI
Figure 2.8. module butterfly radix-2.
Memory
There exist two different memories: the data memory and the coefficient memory.
Rotators
The function of the rotator is to do the rotation of the angle of rotation with
the output of the butterfly at every stage. Rotators can be implemented by a
complex multiplier, which consists of four real multipliers and two adders [15] or
by CORDIC algorithm [34, 35], which carries out the rotation by means of shifts
and additions.

Chapter 3
Proposed Model
Once we know the FFT algorithm, we want to implemente the FFT algortithm in
hardware, but the hardware has to work with a finite wordlenght. Therefore, it
generates a error and the results are not ideal. This thesis studies the accuracy of
the FFT in hardware. The accuracy depend of the different parameters (FFT size
(N), wordlenghts, decompositions, radix · · · ). We want to change this parameters
for search the best configuration. There exists different approaches to study the
accuracy. The first alternative is to do directly simulations in hardware for search
the best configuration for get the best accuracy of the FFT. It means to do a lot of
implementions in hardware and it causes a high implemention time and high simu-
lation time. The proposed aproach represented in this chapter it is a new approach
to model the computations that are carried out in hardware FFT architectures.
An FFT hardware architecture consists of a set of additions, rotations and circuits
for data management. The propose model defines the mathematical operations
that are carried out in hardware architectures including all the quantization ef-
fect. Therefore, the results of the FFT according to the proposed model are the
same results that are obtained in hardware. The model has been programmed in
software. This allows for fast an easy way to calculate the results that the hard-
ware architectures would provide. Besides, this provides an easy parametrization
of the FFT, which allows for analyzing a large number of architectures with dif-
ferent number of points, radix, decomposition, wordlength, quantization effects
and so on. As a results, this model can be used to analyze the influence of these
different parameters in the performance of the FFT. Furthermore, from a design
perspective, the results obtained by model can be used to choose the best FFT
configuration that meets the requirements of a given application.
3.1 Model of computation of the FFT in hardware
The Figure 3.1 shows the model for the computation of the FFT in hardware
architectures. This model includes all the quantization effects that happen in
hardware: truncation in the butterflies (TB), truncation in the rotators (TR) and
quantization of the coefficients (QC). Note that TB and TR affect the dataflow
11
12 Proposed Model
directly and happen during the computation of the FFT, whereas QC affects the
result indirectly and happens a priori when the coefficients are determined. As can
be observed, the model can be applied to the butterflies and rotators of every stage
of the FFT flow graph in Figure 2.1 and Figure 2.3 in order to obtain the behaviour
expected in hardware. Apart from the quantization effects that may take place,
all the FFT architectures that compute a given FFT algorithm calculate the same
mathematical operations. Although the different architectures may differ in the
dataflow and in the order of certain operations, the output result must be the
same. Therefore, the proposed model can be applied to obtain the accuracy of
any FFT hardware architecture.
Figure 3.1. Computation of the FFT in hardware
3.2 Truncation in the butterflies
A hardware implementation of the FFT may either include truncation in the but-
terflies or increase one bit at each addition. On the one hand, Figure 3.2 shows
the adder/subtracter module without truncation. In the figure, XI and YI are the
inputs of the butterfly, which have a wordlength of WLI bits. The output result,
XB , is calculated as:
XB = XI + YI (3.1)
and has one more bit than the inputs, i.e., WLB = WLI + 1. Note that data
are complex and, thus, both real and imaginary parts of the data will have the
wordlengths shown in the figure. Finally, in this scenario there is no loss of accu-
racy in the computations, as the circuit simply adds the two input values without
any quantization effect.
On the other hand, Figure 3.3 shows the behaviour of the adders/subtractors
when the output is truncated. In this case, the output is desired to have the
same number of bits as the inputs, i.e., WLB = WLI . As the addition adds one
extra bit to the result, the less significant bit (LSB) is removed, whereas the WLI
3.3 Quantification in the coefficients (CQ) 13
Figure 3.2. Example of an adder/substracter without truncation.
most significant bits (MSB) provide the output. Overflow effects are avoided by
removing the LSB. However, there is a loss of accuracy in the output result, being:
XB =
⌊
(XI + YI)
2
⌋
(3.2)
where (b c) represents a floor operation on complex numbers, i.e., ∀ a  <, bac =
bRe(a)c+ j bIm(a)c
Figure 3.3. Model for the truncation in the butterflies.
3.3 Quantification in the coefficients (CQ)
Figure 3.4 shows the model for the rotation operation. It includes a complex
multiplier whose inputs are the signal XB and the coefficient C + jS. As for
adder, the wordlengths shown in the figure applies to both real and imaginary
parts of the data.
The quantization of the coefficients is due to the finite wordlength of the mem-
ory. If the memory has b bits for each of the real and imaginary parts, the coeffi-
14 Proposed Model
cients for a given coefficient φ are generally calculated as:
C =
[
2b−2 · cos(2piφ/N)]
S =
[
2b−2 · sin(2piφ/N)] (3.3)
where ([ ]) represents a rounding operation, i.e., ∀ a  <, [a] = [Re(a)] + j [Im(a)]
Note that in order to get C(φ) and S(φ) the cosine and sine of the angles are scaled
by 2b−2 instead of 2b−1. This is due to the fact that the number 2b−1 cannot be
represented using b bits in 2’s complement. This approach is called Non-scaled,
because although the coefficients are actually scaled by a factor 2b−2, this factor
can be compensated in hardware at no cost by a right shift of the data. Different
approaches for obtained the coefficients are discussed in [15]
Figure 3.4. Example of rotation with truncation in the output of the multiplier
3.4 Truncation in the rotations (TR)
Figure 3.4 also shows the truncation in the rotation. It happens at the end of
the multiplier. For an input word length WLB and a coefficient wordlength b,
the output wordlength of the multiplier is WLB + b− 1. Then the output of the
multiplier has to be truncated. Otherwise, the data wordlength would increase
significantly after several stages of the FFT. The truncation is usually done in
order to have the same number of bits at the output of the rotator as in the input,
i.e., WLO = WLB . On the one hand, the MSB is removed, as this bit is never
used due to the values that the coefficient can take. On the other hand, the lowest
b− 2 bits of the output are truncated. This compensates the scaling by 2b−2 that
is used to obtain the coefficients according to equation (3.3).
By combining the quantization of the coefficients and the truncation of the
rotator, the output of the rotator can be modelled by:
XO =
⌊
XB · (C + jS)
2(b−2)
⌋
=
XB ·
[
WφN2(b−2)
]
2(b−2)
 (3.4)
3.5 Output of the proposed model 15
Note that if (C+ jS) ∈ [1,−j,−1, j], equation (3.4) can be simplified to XO =
XB ·WφN . Thus, trivial rotations do not generate any quantization effect, neither
in the coefficients nor in the truncation after the rotator.
3.5 Output of the proposed model
The formula that model the operations that are calculated in a stage or fan FFT
hardware architecture according to Figure 3.1 can be derived from previous ex-
planation. If there is no truncation in the butterflies, the combination of the
equations 3.1 and 3.4 gives:
XO =

⌊
(XI + YI)
2
⌋ [
WφN2(b−2)
]
2(b−2)
 (3.5)
Likewise, according to 3.2 and 3.4, when there is truncation in the butterflies the
operations that are calculated in an FFT stage are:
XO =
 (XI + YI)
[
WφN2(b−2)
]
2(b−2)
 (3.6)
It can be observed that the final equations are independent of the number
of bits in the inputs signal. They depend on the input values, XI and YI , the
coefficients, WφN , and their wordlenght, b. Finally, note that in hardware the
outputs of butterflies and rotators can be rounded instead of truncated. In case
of rounding, the floor operations in equation 3.5 to 3.6 should be substituted by
rounding ones. Apart from this fact, the model is the same when there exist
rounding. The influence of using truncation and rounding in the accuracy of the
FFT is analyzed in chapter 4.
3.6 Description of the increment profile
The previous sections have explained how to model the computations in a hardware
architecture when there exist quantization effects. The model studied represent
one stage of the FFT. Therefore, the FFT has a concrete number of stages de-
pending of the number of points, N . Consequently, the module explained before
is concatenated in function to the numbers of the FFT stages. As has been said
at every stage it is possible to truncate in the output of the butterfly or not and it
is also possible to use different number of bits in the coefficients b at every stage.
The target in this section is to define the fields of the FFT, ie., the number of
butterfly truncation with their number of bits in the data sample and the number
of bits in the coefficient at every stage.
16 Proposed Model
3.6.1 Increment profile in the data at every stage
In order to define the total increment of data wordlenght in the FFT we follow
this expression:
∆WL = WLO −WLI (3.7)
where ∆WL is the total data wordlenght increment (number of butterfly trunca-
tions in the FFT), WLO is the number of bits in the output of the FFT and WLI
is the number of bits in the input of the FFT. We represent ∆WL as:
∆WL =
n∑
s=1
δs (3.8)
where n represents the number of stages of the FFT and δs represents the
increment of data. Therefore, if δs = 1 represents an increment of one bit for the
data wordlenght in the stage s. If δs = 0 the data worldlenght has in the stage s
the same size that at the previous stage (s−1). Accordignly, the increment profile,
δ is defined as a vector that includes all the wordlenght increments at the stages
of the FFT.
δs = WLs −WLs−1 (3.9)
For represent data sample bits in some stage we use the following expression:
WLs = WLI +
s∑
i=1
δi (3.10)
where WLs is the data wordlenght in the stage s and δi represents the butterfly
truncations until the stage s. Therefore, it indicates the increments of bits at every
stage i until s.
3.6.2 Increment profile in the coefficients at every stage
The coefficient wordlenght can be different at different stages. We will select the
coefficient wordlenght at every stage using a increment profile (β). Accordignly,
the increment profile, β is defined as a vector that includes all the wordlenght
increments for every stage of the memory coefficients. This profile can increase
(βs = 1), decrease (βs = 1¯) 1 bit or it can be constant (βs = 0) from the stage
(s − 1) to the stage (s). The Figure 2.1 shows the number of rotations in every
stages and it is possible to observe that in the last stage there is not rotation.
Consequently, the profile of the coefficient bits always have one stage less than
data sample profile.
Following the idea of the increment profile data, we will define the total wordlenght
increment in the coefficients as:
∆b = bn−1 − b1 =
n−1∑
s=1
βs (3.11)
3.6 Description of the increment profile 17
where bs represents the coefficient wordlenght in the stage s, n represents the
number of stages in the FFT and βs represents profile in the stage s. We represent
βs in the stage i as:
βs = bs − bs−1 (3.12)
For represent coefficient bits in some stage we using the follow expression:
bs = b1 +
s−1∑
i=2
βs (3.13)

Chapter 4
Experimentals Results
In Chapter 3, the quantization effects in hardware FFT architectures have been
analyzed and a model that explains all the operations that happen in hardware
has been presented. This chapter studies the impact of the different quantization
effects.
The works based on statistical analysis consider an architecture and study how
any input signal would be affected by architecture. In our approach we propose
an alternative study in which a single signal and modify the FFT configuration
in order to observe how the output is affected. As these results many depend
on the characteristics of signal, different types of input signal have been consid-
ered (pulse, noise, sinusoids), in order to draw general conclusions for these cases.
The computation of the simulations in software adds versatility, as many different
configurations can be explored simultaneously, and provides the results in a short
time. The validity of the model has been verified by implementing a parameteriz-
able FFT in hardware and checking that the output results are the same in both
cases.
4.1 Methodology
In order to get the quantification effect we perform N FFTs depending of the
number of points in the input and we introduce in the inputs for every FFT a
pulse at every channel as shown in the table 4.1.
Where WLI as has been said is the input bits and M = N .
Otherwise, we have generated a noise at the different inputs channels of the
FFT (table 4.2) with the goal that the following study is valid for any input signal.
Where rand is:
rand = {ra ∈ < : (−1 ≤ Re(ra) ≤ 1) + (−1j ≤ Im(ra) ≤ 1j)} . (4.1)
and M = 1024.
We have also generated a sinusoidal for demonstrate that the study is for any
19
20 Experimentals Results
Table 4.1. Input data FFT (Pulse)
Channel Kn Number of FFT from 1 to M
0 2WLI−1 0 0 . . . 0
1 0 2WLI−1 0 . . . 0
2 0 0 2WLI−1 . . . 0
. . . . . . . .
. . . . . . . .
. . . . . . . .
N 0 0 0 . . . 2WLI−1
Table 4.2. Input data FFT (Noise)
Channel Kn Number of FFT from 1 to M
0 2WLI−1(rand1) 2WLI−1(randN+1) . . . . .
1 2WLI−1(rand2) 2WLI−1(randN+2) . . . . .
2 . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
N . . . . . . 2WLI−1(randN∗M )
input signal. Input data FFT (sinusoidal):
2WLI−1 · P · [NxN ] (4.2)
where M = N ,
P = (C + jS) (4.3)
C = [cos(2pif/N)]
S = [sin(2pif/N)] (4.4)
where f = 0, 1, 2 · · ·N ,
P ∈ < : (−1 ≤ Re(C) ≤ 1) + (−1j ≤ Im(S) ≤ 1j) (4.5)
In order to get the experimental results, the matrices that define the transfer
function of the FFT have been obtained from the data inputs represents in the
table 4.1 and they have been compared to the ideal transfer function, according
to:
Error = E = 10 · log10
 1
N ·M
N−1∑
i=0
M−1∑
j=0
|Yi,j − Yˆi,j |2
 (dB), (4.6)
where Yi,j is the ideal matrix, Yˆi,j is the approximated one using quantized co-
efficients, and N is the FFT size. We use this expresion because it is the more
global way to calculate the accuracy of the FFT. In the Chapter 3 it has been
4.1 Methodology 21
explained that it is possible to work with different input bits (WLI) and different
coefficient bits (b) for every stage, this generates a value different for the coefficient
bits at every stage bs. It is also possible to select the butterfly truncation (TB) for
every stage. Consequently, this generate a value different for the data sample in
every stage (WLs). Accordingly, we can design the FFT depending of theWLI , b,
∆WL, ∆b. The goal of this chapter 4 is to find the relation with the Error of the
FFT modifying these values (WLI , b, ∆WL, ∆b). Later, in order to find the best
cases to get the best accuracy results of the FFT we will do a study for a different
values of WLI , b, ∆WL, ∆b and we will compare with the area consumed. This
way we can discard the worse cases and only we will obtain the best cases with
relation to the area and Error.
In the next sections the cases are valid for different architecture (feedback, feed-
forward, in-place ...) because these experiments are studying the result after of the
different truncations and quantifications, but these results have not any relation
with architecture used. In the section 4.6 we will study the results with relation
to the area. Therefore, for these cases is important to know the architecture used.
22 Experimentals Results
4.2 Effect of the truncation in the rotations
We are going to explain the effect of the truncation in the rotators. This truncation
appers in all FFT architectures. Otherwise the number of bits in the output would
be too long. The Figure 4.1 shows this effect.
3 4 5 6 7 8 9 10
−100
−90
−80
−70
−60
−50
−40
−30
log2(N)
E(
dB
)
 
 
WLi=8
WLi=9
WLi=10
WLi=11
WLi=12
WLi=13
WLi=14
WLi=15
WLi=16
Figure 4.1. Relationship between FFT size (N) and Error for different worldlenght in
the input data - pulse in the input signal
Table 4.3. Parameters for studying
Parameters Values
Input signal pulse and noise
FFT size (N) from 8 to 1024 points
Radix 2
Decomposition DIT
βs 0 ∀s
δs 1,∀s
WLI from 8 to 16 bits
b 64 bits
Butterfly quantization truncation
The FFT in this case is working with the features shown the table 4.3. The FFT
is working without truncation in the output of the adder/substracter (Figure 3.2)
4.2 Effect of the truncation in the rotations 23
3 4 5 6 7 8 9 10
−90
−80
−70
−60
−50
−40
−30
−20
−10
log2(N)
E(
dB
)
 
 
WLI=8
WLI=9WLI=10
WLI=11
WLI=12
WLI=13
WLI=14
WLI=15
WLI=16
Figure 4.2. Relationship between FFT size (N) and Error for different worldlenght in
the input data - noise in the input signal
and without truncation in the coefficient. In order to get this objective, the FFT
is working with 64 bits in the coefficients, so the effect of scaling of the coefficients
is negligible. Therefore, we can say that it only affects the rotations. In this case
the mathematical model can be simplified to:
XO ≈
⌊
(XI − YI)WφN
⌋
(4.7)
Figure 4.1 and 4.2 shows the relationship between the FFT sizes (N) and their
results in relation to the Error in dBs. The goal is to show the effects in the
rotations. It is possible to affirm that the result of the equation will be better if
the input wordlenght is higher (WLI) and the number of rotations is lower (N).
For example, for the same number of input bits the accuracy would be worse if
the number of inputs is higher because the number of rotations in this case is
also higher. It is possible to observe that the difference among every number of
bit-inputs (WLI) is 6 dB. With this result is possible to get 1 more bit in the
output of the FFT for every number of bits in the input.
Note that the results in the graph only consider the truncation in the rotator
whereas others quantization effects are negligible. Therefore, as the truncation
in the rotations must always happen the results in Figure 4.1 and 4.2 represents
an upper bound of the accuracy of FFT architectures for a given FFT sizes and
24 Experimentals Results
wordlenght.
4.3 Comparison between rounding and truncation
in the operations
In this case we will present the different effects in the butterfly truncation using
floor and round.
3 4 5 6 7 8 9 10
−80
−70
−60
−50
−40
−30
−20
−10
0
10
20
log2(N)
E 
(dB
)
 
 
floor case
round case
WLI=8
WLI=10
WLI= 9
WLI=11
WLI=12
WLI=13
WLI=14
WLI=15
WLI=16
Figure 4.3. Relationship between FFT size (N) and Error for different worldlenght in
the input data with truncation or rounding in the butterfly quantization - pulses in the
input signal
Table 4.4 shows the features used for the design of the FFT in this case.
Figure 4.3 shows the result of the Error using floor operation and round oper-
ation with relation to the FFT size, N . The difference between both operations it
is that the last bit is rounded down using truncation (floor operation) and for the
round operation is rounded up. It is possible to say that the result will be better
with round operation if it is working with a lower number of bits and also with
lower number of stages . The Figure 4.3 shown that if the FFT has more than 6
stages the result will be better working with floor operation. The different about
floor and round is approximation of 1dB’s. This case is working with 64 bits in
the coefficients.
4.4 Influence of the data increment profile 25
Table 4.4. Parameters for studying
Parameters Values
Input signal pulse
FFT size (N) from 8 to 1024 points
Radix 2
Decomposition DIT
βs 0 ∀s
δs 1,∀s
WLI from 8 to 16 bits
b 64 bits
Butterfly quantification truncation and rounding
The error generate for difference quantification [19]:
Truncation:
−2−b < ErrorT ≤ 0 (4.8)
Rounding:
(−1/2)2−b < ErrorR ≤ (−1/2)2−b (4.9)
where b is the number of bits desired.
4.4 Influence of the data increment profile
In the section 4.2 can be observed the effect of the truncation in the rotations,
but now, we will study the effects in the butterfly truncations with the rotation
truncations together with all possible different data increment profile (δ). In order
to get this objective the FFT is working with 64 bits in the coefficients, so the
effect of scaling the coefficients (Equation 3.3) is negligible.
Table 4.5. Parameters for studying
Parameters Values
Input signal pulse, sinusoidal and noise
FFT size (N) 256 points
Radix 2
Decomposition DIT
βs 0 ∀s
δs from 0 ∀s to 1,∀s
WLI from 2 to 32 bits
b 64 bits
Butterfly quantization truncation
The Table 4.5 shows the features used for the design of the FFT in this case.
Therefore, we can to say that there are two effects, the truncation in the
rotations and in the adders/substracters. In this case the mathematical model is:
26 Experimentals Results
XO ≈
⌊⌊
(XI − YI)
2
⌋
WφN
⌋
(4.10)
0 5 10 15 20 25 30 35
−200
−150
−100
−50
0
50
 
 
input bits (WLI)
E(
dB
)
Without butterfly truncation in all stages
With butterfly truncation in all stages
Figure 4.4. Relationship between Error and data input (WLI) for different data
wordlenght increment profile - input signal pulse
Figure 4.4, 4.5, 4.6 and represents the relation between the Error and the input
bits (WLI) for every increment profile from δs = 0, ∀s to δs=1, ∀s.
This case is working with 256 input points (N) it has 8 stages (log2(256)).
In every stage it is possible to truncate in the output of the butterfly or not.
We will study the effect for every case. Hence it has two limits ∆WL = 0 and
∆WL = 8 (minimum and maximum) with butterfly truncation in all stages and
without butterfly truncation at any stages.
Figure 4.4, 4.5 and 4.6 it can be observed the influence of the number of bits
in the inputs in function of the Error. There are 8 stages, so there are 28 possible
data increment profiles. The Figure 4.4 shows all possible profiles. The upper line
(∆WL = 0) with truncation in every stage and bottom line (∆WL = 8) without
truncation in any stage.
Between both lines there are the 254 different profiles, with profile from ∆WL =
7 to ∆WL = 1.
4.4 Influence of the data increment profile 27
0 5 10 15 20 25 30 35
−200
−150
−100
−50
0
50
Input bits (WLI)
E(
dB
)
 
 
Without butterfly truncation in all stages
With butterfly truncation in all stages
Figure 4.5. Relationship between Error and data input (WLI) for different data
wordlenght increment profile - input signal noise
The evolution for every profile is lineal and decreases about 6dB for every
number of input bits in every increment profile. The difference between the profiles
of ∆WL = 8 and ∆WL = 0 is approximately 40dB for the Figure 4.4, 30dB for
the Figure 4.5 and 25dB for the Figure 4.6.
It is possible to observe that it is better to work with a higher increment
profile because it allows to work with a number of bits lower in different stages.
In Figure 4.4 for Error=-50dB is possible to get this result with 10 input bits
with ∆WL = 8 (increasing 1 bit more in every stage) and with 17 input bits
with ∆WL = 0 (constant data sample bits in every stage ∆WLs). Therefore, we
can say that it is preferable to choose a cases in which the wordlenght increases
through the stages.
In the Table 4.6 is possible to observe which are the best data increment profile
in the output of the butterfly for obtained the best result (Error) for the case with
pulse in the input signal. It is known that the best result is without truncation
in the butterfly in any stage, but the best profile to follow is when only there is
one truncation in the butterfly in one stage. The truncations must be in the last
stages. Consequently, it is better to grow in first stages (without truncation) and
truncate in the last stages. With noise and sinusoidal in the input it follows the
28 Experimentals Results
0 5 10 15 20 25 30 35
−200
−150
−100
−50
0
50
Input bits (WLI)
E(
dB
)
without butterfly truncation in all stages
with butterfly truncation in all stages
Figure 4.6. Relationship between Error and data input (WLI) for different data
wordlenght increment profile - input signal sinusoidal
same pattern.
Table 4.6. Error result for different data increment profile (δ) with WLI=8 bits
Number of bits in the output of the butterfly in every stage
WLI s1 s2 s3 s4 s5 s6 s7 s8 ∆WL Error(dB)
8 9 10 11 12 13 14 15 16 8 −39.55
8 9 10 11 12 13 14 15 15 7 −37.61
8 9 10 11 12 13 14 14 15 7 −36.89
8 9 10 11 12 13 13 14 15 7 −36.02
8 9 10 11 12 12 13 14 15 7 −35.12
8 9 10 11 11 12 13 14 15 7 −34.42
8 9 10 10 11 12 13 14 15 7 −33.94
8 9 9 10 11 12 13 14 15 7 −33.50
8 8 9 10 11 12 13 14 15 7 −33.29
8 9 10 11 12 13 14 14 14 6 −31.92
. . . . . . . . . . .
. . . . . . . . . . .
8 8 8 8 8 8 8 8 8 0 3.25
4.5 Relation between the wordlength of the samples and coefficients
29
4.5 Relation between the wordlength of the sam-
ples and coefficients
In this section we will study the effect of the quantification in the coefficients
and truncation in the rotation and their relation. Firstly, we will work without
truncation in the butterfly (∆WL = 8) and we will select the bottom line of the
Figure 4.4, 4.5, 4.6 and we will modify the coefficient bits. Secontly, we will work
with truncation in the butterfly (∆WL = 0) and we will select the upper line of
the Figure 4.4 and we will modify the coefficient bits . So, we will use different
input bits (WLI) and coefficient bits (b) and we will find the relation among the
WLI , b and Error.
Table 4.7. Parameters of study
Parameters Values
Input signal pulse, sinusoidal and noise
FFT size (N) 256 points
Radix 2
Decomposition DIT
βs 0 ∀s
δs 1,∀s
WLI from 5 to 24 bits
b from 5 to 24 bits
Butterfly quantization truncation
The Table 4.7 shows all the parameters of design. One of the target of this
section is to know that it is not really good idea to work with a high number of
coefficient bits and input bits, for this we need to use a number lower of bits with
relation to the Error desired. Therefore, we have to approximate this value. The
mathematic model for this case is equation 3.6.
Figure 4.7, 4.8 and 4.9 it is possible to observe which are the relationships
between the number of bits in the inputs and the number of bits in the coefficients.
The Figure 4.7 is divided in different regions. Each region indicates a range of 6
dB for the Error as is possible to observe in the Figure 4.7, 4.8 and 4.9 . The
conclusion obtained for the Figure 4.7, Figure 4.8 and Figure 4.9 is the following,
the b must always be same number of bits or 1 bit more than WLI , but never
higher because the result will always be the same (4.11).
b ≤WLI + 1 (4.11)
For example for the case of 14 bits in the inputs for the Figure 4.7, the best
relation with Error will be from -72 to -78 dB’s. In this case it is to possible
observe (Figure 4.7) that this result was obtained with 15 bits in the coefficients
and for all higher bits in the coefficients the result will be the same.
Now, we will work with truncation at all stages (WLI = 0) and we will start
to work with different coefficient bits.
The Table 4.8 shows all the parameters of design.
30 Experimentals Results
−126
−120
−114
−108
−102
−96
−90
−84
−78
−72
−66
−60
−54
−48
−42
−36
−30
−24
Coefficient bits (b)
in
pu
t b
its
 (W
L I)
6 8 10 12 14 16 18 20 22 24
6
8
10
12
14
16
18
20
22
24
Figure 4.7. Relationship between inputs and coefficients wordlenght with Error for a
∆WL = 8 and a ∆b = 0 - pulse in the input signal
Table 4.8. Parameters of study
Parameters Values
Input signal pulse
FFT size (N) 256 points
Radix 2
Decomposition DIT
βs 0 ∀s
δs 0,∀s
WLI from 5 to 24 bits
b from 5 to 24 bits
Butterfly quantization truncation
In the Figure 4.10 can be obtained a relationship among the coefficients and
the inputs bits. The conclusion obtained is the following, the b must always satisfy
the equation 4.12, but never b can be higher to the equation (4.12) because the
result will always be the same.
b ≤WLI − 7 (4.12)
4.5 Relation between the wordlength of the samples and coefficients
31
−110
−105
−100
−95
−90
−85
−80
−75
−70
−65
−60
−55
−50
−45
−40
−35
−30
−25
−20
−15
−10
−5
coeficient bits (b)
in
pu
t b
its
 (W
L I)
 
6 8 10 12 14 16 18 20 22 24
6
8
10
12
14
16
18
20
22
24
Figure 4.8. Relationship between inputs and coefficients wordlenght with Error for a
∆WL = 8 and a ∆b = 0 - noise in the input signal
−90
−84
−78
−72
−66
−60
−54
−48
−42
−36
−30
−24
−18
−12
−6
0
6
12
18
Coefficient bits (b)
In
pu
t b
its
 (W
L I)
6 8 10 12 14 16 18 20 22 24
6
8
10
12
14
16
18
20
22
24
Figure 4.10. Relationship between inputs and coefficients wordlenght and the Error for
a ∆WL = 0 and a ∆b = 0
32 Experimentals Results
−110
−105
−100
−95
−90
−85
−80
−75
−70
−65
−60
−55
−50
−45
−40
−35
−30
−25
−20
−15
−10
−50
coefficient bits (b)
in
pu
t b
its
 (W
L I)
6 8 10 12 14 16 18 20 22 24
6
8
10
12
14
16
18
20
22
24
Figure 4.9. Relationship between inputs and coefficients wordlenght with Error for a
∆WL = 8 and a ∆b = 0 - sinusoidal in the input signal
For example, for the case of 18 bits in the inputs (WLI), this case has 7 work
areas, -24, -30, -36, -42, -48, -54, -60 dB’s for 5, 6, 7, 8, 9, 10 and 11 bits in the
coefficients (b) respectively, the best result is in the area between -60 and -66 dB’s.
4.6 Experimental results for pipeline feedback ar-
chitecture
In this section we need to select some concretely architecture because we will study
the effect of truncations with relation to the area. The target is to select the best
increment profile for the data (δ) and for the memory coefficients (β) for every
stage. it mean that we will select if we want to truncate or not in the output of
the butterfly for every stage and we will select if decrease or increase the number
of bits in the coefficients for every stages. The objective is to find the best profile
(δ, β) in order to get the best relation with area and Error. Consequently, we will
use 256-point DIT FFT feedback pipeline architecture for perform our study.
4.6 Experimental results for pipeline feedback architecture 33
4.6.1 Relation for every δ between buffer size and the data
wordlenght
For study the relation between buffer size (BS) and the input bits with order to
get the best data increment profile (δ), we have studied the sizes of the buffer in
every stages (Figure 4.11). This size is proportional to the number of bits in every
stages (WLs). In the equation 4.13 is possible to observe the way for calculate
this value.
Figure 4.11. FFT-256 point-pipeline feedback - Study model of area with input bits
The total number of bits that are necessary for the architecture is calculated
as:
BSDIT = BSDIF =
n=8∑
s=1
2(n−s) ·WLs (4.13)
The size of the buffer for DIT and DIF are the same.
Table 4.9. Parameters for studying
Parameters Values
Input signal pulse, sinusoidal and noise
FFT sizes (N) 256 points
Radix 2
Decomposition DIT
βs 0 ∀s
δs from 0 ∀s to 1,∀s
WLI from 8 to 16 bits
b 64 bits
Butterfly quantization truncation
Now, we will study which are the best cases in function of the BS and Error
34 Experimentals Results
2000 2500 3000 3500 4000 4500
−90
−80
−70
−60
−50
−40
−30
−20
−10
0
10
Buffer size (BS)
E(
dB
)
 
 
8
9
10
11
12
13
14
15
16
input−bits (WLI)
Figure 4.12. Relationship between the buffer size and Error for all cases (from ∆WLI =
0 to ∆WLI = 8) - pulse in the input signal
for every inputs bits (WLI) and for every profile δ. The Table 4.9 shows all
parameters used in the study of the FFT for this case.
In order to get the best relation about Error-BS, we work from 8 bits to 16
bits in the inputs (WLI) and 64 bits in the coefficients for every stage (b). We
are working with 64 bits for follow the equation 4.7 and to do negligible the error
generate for the coefficients. Therefore, we have the effect of the truncations in
the rotations and the effect of the truncation in the butterfly.
We have in the Figure 4.12 different results for the Error for different data
increment profiles from δs = 0, ∀s to δs=1, ∀s. We will select the best cases in
relation Error-BS. These cases will have lower BS and better Error result. In order
to get the simplification we use the algorithm of the Figure 4.13, where t =total
number of coefficients for input bits. Example from 8 to 16 coefficient bits for 8
input bits and n,m=counters of controls
4.6 Experimental results for pipeline feedback architecture 35
Figure 4.13. The diagram represents the technique used for get the best relation be-
tween memory used and Error
2000 2500 3000 3500 4000 4500
−90
−80
−70
−60
−50
−40
−30
−20
−10
0
Buffer size (BS)
E(
dB
)
 
 
8
9
10
11
12
13
14
15
16
input−bits (WLI)
Figure 4.14. Relationship between buffer size and Error for different data wordlenght
increment profiles - pulse in the input signal
36 Experimentals Results
2000 2500 3000 3500 4000 4500
−80
−70
−60
−50
−40
−30
−20
−10
0
Buffer size (BS)
E(
dB
)
 
 
8
9
10
11
12
13
14
15
16
input bits (WLI)
Figure 4.15. Relationship between buffer size and Error for different data wordlenght
increment profiles - noise in the input signal
2000 2500 3000 3500 4000 4500
−80
−70
−60
−50
−40
−30
−20
−10
0
Buffer size (BS)
E(
dB
)
8
9
10
11
12
13
14
15
16
input bits WLI
Figure 4.16. Relationship between buffer size and Error for different data wordlenght
increment profiles - sinosuidal in the input signal
4.6 Experimental results for pipeline feedback architecture 37
Figure 4.14, 4.15 and 4.16 shows the relation Error-BS after of the simplifica-
tion. If we observe the Figure 4.14, 4.15 and 4.16 is possible only to get the most
important results for every input bits. For example, if we are working with 8 and
9 bits in the inputs, we will always select the samples with WLI = 8 bits if their
ranges are among the values of Error desired. Is possible to see that after to find
the best relationship Error-BS is possible simplify more this case.
2000 2500 3000 3500 4000 4500
−90
−80
−70
−60
−50
−40
−30
−20
−10
0
Buffer size (BS)
E(
dB
)
 
 
8
9
10
11
12
13
14
15
16
coefficient−bits (b)WLI=8
WLI=9
WLI=10
WLI=11
WLI=12
WLI=13
WLI=14
WLI=16
WLI=15
Figure 4.17. Relationship between the buffer size and Error using a profile ∆WL = 7
and ∆WL = 8 and constant coefficients (β) at every stage - pulse in the input signal
In the Figure 4.14, 4.15 and 4.16 is possible to observe which are the best
cases (red case). These cases have better relationship about Error-BS and they
have features in common. The total increment profiles in the output of the adders
(butterfly) for all cases are of ∆WL = 7 or ∆WL = 8. For the other cases of
∆WL that do not enter in the red case they will not selected.
Now, we know which are the best cases and their data increment profiles (β),
then we select the red cases and we start to work with different number of bits in
the coefficients (from 8 bits to 16 bits). These coefficients are constants in every
stage (∆b = 0). The table 4.10 shows all parameters used in the study.
38 Experimentals Results
Table 4.10. Parameters of study
Parameters Values
Input signal pulse
FFT size (N) 256 points
Radix 2
Decomposition DIT
βs 0 ∀s
δs from 0 ∀s to 1,∀s
WLI from 8 to 16 bits
b from 8 to 16 bits
Butterfly quantization truncation
Table 4.11. Limits between input bits and coefficient bits - pulse in the input signal
Limits WLI and b
WLI Limit b Limits
8 −38 < Error < −32 8 −37 < Error
9 −44 < Error < −38 9 −36 < Error
10 −50 < Error < −44 10 −53 < Error
11 −56 < Error < −50 11 −61 < Error
12 −62 < Error < −56 12 −67 < Error
13 −68 < Error < −62 13 −71 < Error
14 −74 < Error < −68 14 −75 < Error
15 −80 < Error < −74 15 −81 < Error
16 −86 < Error < −80 16 −85 < Error
Table 4.12. Selection the data increment proflile (The best cases) for input bits - for
pulse, sinusoidal and noise in the input signal
The best profile of growth in the inputs
δ1 δ2 δ3 δ4 δ5 δ6 δ7 δ8 Difference(aprox)
1 0 1 1 1 1 1 1 0dB
1 1 0 1 1 1 1 1 −1dB
1 1 1 0 1 1 1 1 −1dB
1 1 1 1 0 1 1 1 −1dB
1 1 1 1 1 0 1 1 −1dB
1 1 1 1 1 1 0 1 −1dB
1 1 1 1 1 1 1 0 −1dB
1 1 1 1 1 1 1 1 −1dB
In the Figure 4.17 is possible to observe which are the different limitations for
every number of bits in the coefficient. For example if we are working with 8 bits
in coefficient, we will never get better result that -38dB approximation and if we
are working with 8 bits in the samples is possible to see that the results are among
-31dB with ∆WL = 7 and -38dB with ∆WL = 8. For the results from -31dB to
>-38dB these case are with ∆WL = 7 in the different stages. Table 4.11 shows
the limits between data input and the coefficients for the case with pulse in the
input signal.
4.6 Experimental results for pipeline feedback architecture 39
It is possible to observe in the Figure 4.17 that the best result working with
constants coefficients is when we are working with one bit more that the input bits
(WLI) (equation 4.11). For example if we have WLI = 9 then we would have to
work with 10 bits in the coefficients because if we select 11 or more the result will
be the same. The table 4.12 represents the best data increment profile obtained
(δ) with relation Error-BS. It is possible to observe the difference respect the Error
from the first case (δ = 10111111) to the last case (δ = 11111111) for all the cases
(pulse, sinusoidal, noise) in the input signal.
4.6.2 Relation for every β between area and the coefficient
wordlenght
In this section we will study the effect and relation with the Error in function
of the total memory used (area-memory) more buffer used (area-buffer), it was
studied in the section 4.6.1. We have studied the sizes of the memory in every
stages. The table 4.13 shows all the parameters of study used in this case.
The Figure 4.18 shows the size of the width of the memory that it is represented
for the number of bits in the coefficients for every stage (bs). In every stage is
possible to used one different number of bits in the coefficients following the profile
β.
Table 4.13. Parameters of study
Parameters Values
Input signal pulse, sinusoidal and noise
FFT size (N) 256 points
Radix 2
Decomposition DIT and DIF
βs 0 ∀s
δs from -1 ∀s to 1,∀s
WLI from 8 to 16 bits
b from 4 to 18 bits
Butterfly quantization truncation
In the last section was found the best profile (∆WL) in the output of the
butterfly for every stage with relation to the Error-BS (buffer size), these cases are
working with a total data increment ∆WL = 7 and ∆WL = 8, with the last case
is possible to get the best result. So, in this case we will work without truncation
in the butterfly. This way we will study the relation of the number of bits in
the coefficient with Error-TotalMemory. The objective of this section is the same
of the last section it is to find the best relation for obtained a good coefficient
increment profile (β) this case has to indicate the coefficient wordlenght (bs) for
every stage for minimizing the area.
For find the best relation about number of bits in the coefficient for every stage,
we are working with a coefficient increment profile (β) where it is possible increase,
decrease 1 bit in every stage or to use the same number of bits of the above stage.
Once we have to defined the profile. We will work from 8 to 16 bits in the input
bits and from 4 to 18 coefficient bits. The first study will be for the decomposition
40 Experimentals Results
in DIT. The idea is the same of the last section 4.6.1. We will simulate for every
numbers of bits in the inputs and coefficients with all coefficient increment profiles
and we will not truncation in the butterfly.
Relation for every β between total memory size and the coefficient
wordlenght for DIT
In this section we will present the best coefficient increment profile (β) with relation
Error-TotalMemory using the decomposition in DIT.
Figure 4.18. 256-point DIT FFT pipeline-feedback study model of buffer size with
input bits and coefficient bits
The number of positions in every memory increases 2s for every stage (s) for
the decomposition in DIT. Consequently, Figure 4.18 starts with 2 positions in the
first stage until 128 positions in the last memory for 256 point. The size used is
proportional to the number of bits used in every stage (bs).
Table 4.14. Parameters for studying
Parameters Values
Input signal pulse, sinusoidal and noise
FFT size (N) 256 points
Radix 2
Decomposition DIT
βs from 1¯∀s to 1,∀s
δs 1 ∀s
WLI from 8 to 16 bits
b from 4 to 18 bits
Butterfly quantization truncation
Once we have the simulation of the all cases (Figure 4.19). We need to simplify
and we obtain the best results with relation Error-TotalMemory. For obtained this
4.6 Experimental results for pipeline feedback architecture 41
Figure 4.19. Relationship between the total memory and Error, from 8 to 16 input bits
(WLI) and from 4 to 18 coefficient bits (b) with different coefficient increment profile (β)
- pulse in the input signal
objective we use the algorithm of the Figure 4.13 where it select the best case with
relation Error-TotalMemory. The table 4.14 shows all the parameters of study.
We calculate the area in function of the memory size as:
MSDIT =
n−1∑
s=1
2s · bs (4.14)
TotalMemoryDIT = BSDIT + MSDIT (4.15)
Once we have the simulation we select the best case where we have minimum
TotalMemory and the best Error, following the Figure 4.13. When we have finished
the simplification we obtain Figure 4.20.
Figure 4.20, 4.21 and 4.22 shows all ranges that it is possible to get after of
simplification, for the case of pulse and noise in the input signal. These cases are
for the best relation Error-TotalMemory. Figure 4.20, 4.21 and 4.22 represent the
case with WLI=8. It is possible to observe that the best result is obtained with
to start with 6 bits in the coefficients, but we need to know which is the increment
profile (β) in the coefficients that it allows to obtain this result. Now, we will
take the best result obtained in the Figure 4.20, 4.22 and 4.21, we will represent
every number of bits used in the coefficient in every stages. The objective of this
42 Experimentals Results
2500 3000 3500 4000 4500 5000 5500 6000 6500 7000
−40
−35
−30
−25
−20
−15
−10
−5
0
TotalMemory
E(
dB
)
 
 
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Coefficient−bits (b)
Figure 4.20. The relation between the total memory and Error with 8 input bits and
from 4 to 18 coefficient bits with different coefficient increment profile (β) - pulse in the
input signal
is to get exactly the number of bits in the coefficient for obtain the Error desired.
We have found the limits of the coefficient bits for every stage (bs), where is not
necessary the used of more bits because the result will be the same.
For obtain the best result for minimizing the TotalMemory used for everyWLI ,
we must to start for the stage1 (s = 1) with this rule (4.16):
b1 = WLI − 1 (4.16)
it must follow the profile β = 110000. Is important to say that in the stage1
for decomposition in DIT the rotation is trivial. Therefore, it is possible to use
the minimum number of bits, but in the stage2 (s = 2) we have that recuperate
the original value as (4.17):
b2 = WLI (4.17)
4.6 Experimental results for pipeline feedback architecture 43
3000 3500 4000 4500 5000 5500 6000 6500
−25
−20
−15
−10
−5
0
TotalMemory
E(
dB
)
 
 
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Coefficient bits (b)
Figure 4.21. The relation between the total memory and Error with 8 input bits and
from 4 to 18 coefficient bits with different coefficient increment profile (β) - noise in the
input signal
1500 2000 2500 3000 3500 4000 4500 5000 5500 6000
−30
−20
−10
0
10
20
30
40
50
TotaMemory
E(
dB
)
4
5
6
7
8
9
10
11
12
13
14
15
16
coefficient bits (b)
Figure 4.22. The relation between the total memory and Error with 8 input bits and
from 4 to 18 coefficient bits with different coefficient increment profile (β) - sinusoidal in
the input signal
44 Experimentals Results
Table 4.15. Selection the coefficient increment profile (β) (the best cases)
Stages Rotations
β2 β3 β4 β5 β6 β7
DIT
1 1 0 0 0 0
Table 4.15 shows the best coefficient increment profile for get the best relation
Error-TotalMemory for DIT.
Relation for every β between total memory size and the coefficient
wordlenght for DIF
The objective of this section is to find the best result about Error-TotalMemory
following the same pattern of the last section. The main difference about DIF
with DIT is the order of the rotations it is inverter. In the Figure 4.23 is possible
to observe the sizes of the memory.
Therefore, the TotalMemory is calculated as:
MSDIF =
n−1∑
s=1
2(n−s+1) · bs (4.18)
TotalMemoryDIF = BSDIF + MSDIF (4.19)
Figure 4.23. 256-point DIF FFT pipeline-feedback study model of total memory with
input bits and coefficient bits
The table 4.16 shows all the parameters of study.
The Figure 4.24 shows the result with all input bits and coefficient bits with
all the coefficient increment profiles (β)
4.6 Experimental results for pipeline feedback architecture 45
Table 4.16. Parameters for studying
Parameters Values
Input signal pulse, sinusoidal and noise
FFT size (N) 256 points
Radix 2
Decomposition DIF
βs from 1¯∀s to 1,∀s
δs 1 ∀s
WLI from 8 to 16 bits
b from 4 to 18 bits
Butterfly quantization truncation
2000 3000 4000 5000 6000 7000 8000 9000 10000
−100
−80
−60
−40
−20
0
20
40
TotalMemory
E(
dB
)
 
 
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
WLI=8
WLI=9
WLI=10
WLI=11
WLI=12
WLI=13
WLI=14
WLI=15
WLI=16
Coefficient−bits (b)
Figure 4.24. Relationship between the total memory and Error, from 8 to 16 input bits
and from 4 to 18 coefficient bits with different coefficient increment profile (β) - pulse in
the input signal
46 Experimentals Results
2500 3000 3500 4000 4500 5000 5500 6000 6500
−40
−35
−30
−25
−20
−15
−10
−5
0
TotalMemory
FN
(dB
)
 
 
4
5
6
7
8
9
10
11
12
13
14
15
16
coefficient bits (b)
Figure 4.25. The relation between the total memory and Error, from 8 input bits and
from 4 to 18 coefficient bits with different coefficient increment profile (β) - pulse in the
input signal
3000 3500 4000 4500 5000 5500 6000 6500 7000 7500
−25
−20
−15
−10
−5
0
TotalMemory
E(
dB
)
 
 
4
5
6
7
8
9
10
11
12
13
14
15
16
coefficient bits (b)
Figure 4.26. The relation between the total memory and Error, from 8 input bits and
from 4 to 18 coefficient bits with different coefficient increment profile (β) - noise in the
input signal
4.6 Experimental results for pipeline feedback architecture 47
2500 3000 3500 4000 4500 5000 5500 6000 6500 7000
−30
−20
−10
0
10
20
30
40
50
TotalMemory
E(
dB
)
4
5
6
7
8
9
10
11
12
13
14
15
Coefficient bits (b)
Figure 4.27. The relation between the total memory and Error, from 8 input bits and
from 4 to 16 coefficient bits with different coefficient increment profile (β) - sinusoidal in
the input signal
Now, we have to do the simplification for select only the best cases with relation
Error-TotalMemory (Figure 4.25).
Figure 4.25, 4.26 and 4.27 is possible to observe that the best result is obtained
with to start with 8 bits in the coefficients, but we need to know which is the
increment profile β in the coefficients that it allows to obtain this result.
Now, we will take the best result obtained in the Figure 4.25, 4.26 and 4.27 and
we will represent every number of bits used in the coefficient at every stages. The
objective of this is to get exactly the number of bits in the coefficient for obtain
the Error desired. We have found the limits of the coefficient bits for every stage
(bs), where is not necessary the used of more bits because the result will be the
same.
For obtain the best result for minimizing the TotalMemory used, we have to
start for the stage1 (s = 1) with this rule:
b1 = WLI + 1 (4.20)
Table 4.17. Selection the coefficient increment profile (β) (the best cases)
Stages Rotations
β2 β3 β4 β5 β6 β7
DIF
1 0 0 0 0 0
48 Experimentals Results
it must follow the profile β = 100000. It is important to say that in the
stage7 for decomposition in DIF the rotation is trivial, so is possible use the
minimum number of bits. The Table 4.17 represents the best increment profile in
the coefficient obtained for every WLI .
4.7 Effect for the different decomposition DIT-
DIF
The goal of this section is to find the relation between DIT and DIF and their
difference. If these difference are important, we could select the best option of
decomposition. The table 4.18 shows all the parameters used for the study.
Table 4.18. Parameters for studying
Parameters Values
Input signal pulsel and noise
FFT size (N) 256 points
Radix 2
Decomposition DIF and DIT
βs 0 ∀s
δs 0,∀s and 1,∀s
WLI from 2 to 32 bits
b 64 bits
Butterfly quantization truncation
The Figure 4.29 shows the different about DIT and DIF decomposition using
pulses in the input signal. In this case is working with 64 bits in the coefficients.
Therefore, the effect of scaling of the coefficients are negligible, for these cases the
mathematical model are the equations 4.7 and 4.10 with and without butterfly
truncation ∆WL = 0 and ∆WL = 8 respectively. It is possible to observe that
for the cases of ∆WL = 0 the difference about DIT and DIF is the same, the
error generated for the truncation in the butterflies are more predominant than
the difference decompositions (DIT and DIF). For other case (∆WL = 8) without
truncation in the butterflies in any stages, it is possible to observe that exist
a little difference respect the decomposition in DIT or DIF. This difference is
approximately of 2 dB. It is not really a important difference, so we can say that
the result for DIT and DIF would be the same, although there is a little improve
of the DIF decomposition.
4.7 Effect for the different decomposition DIT-DIF 49
0 5 10 15 20 25 30 35
−200
−150
−100
−50
0
50
input bits WLI
FN
(dB
)
 
 
DIT−∆WL=0
DIT−∆WL=8
DIF−∆WL=0
DIF−∆WL=8
Figure 4.28. 256-point DIT-DIF FFT relationship Error with input bits, working with
64 bits in the coefficients, for ∆WL = 0 and ∆WL = 8 - noise in the input signal
0 5 10 15 20 25 30 35
−200
−150
−100
−50
0
50
input bits (WLI)
E(
dB
)
 
 
DIT−∆WL=0
DIT−∆WL=8
DIF−∆WL=0
DIF−∆WL=8
difference
aprox−−>2dB
Figure 4.29. 256-point DIT-DIF FFT relationship Error with input bits, working with
64 bits in the coefficients, for ∆WL = 0 and ∆WL = 8 - pulse in the input signal
50 Experimentals Results
In the other hand for the case with noise in the input signal Figure 4.28 the dif-
ference between DIT and DIF is insignificant. Therefore, we can say the difference
between both decomposition for any input signal is negligible.
4.7.1 Difference between DIT and DIT for β used
In the feedback pipeline architecture the main difference between the two archi-
tectures are the position of the rotations. This position is inverter. Therefore, the
size of the different memories will be different. Figure 4.18 shows the architecture
for DIT and Figure 4.23 shows the architecture for DIF.
If we are working with a coefficient bits constant (bs) at every stage, the result
about Error obtained in this case will be the same and the area used will also be
the same for DIT and DIF. Figure 4.30 shows this results. In order to get this
target, the Table 4.19 shows all parameters used in the study.
Table 4.19. Parameters for studying
Parameters Values
Input signal pulse
Points (N) 256
Radix 2
Decomposition DIT and DIF
βs 1 ∀s
δs 0,∀s
WLI 64
b from 8 to 16 bits
Butterfly quantization truncation
Now, we will study the result if the coefficient bits are not constant. In the
Figure 4.31, we are increasing the number of bits in the coefficient (1 bit in every
stage follow the profile β = 1111111 ) it is possible to observe that the result is
better for DIT than DIF. The architecture feedback pipeline for the DIT (Figure
4.18) the number of rotations increase in every stage and for DIF decrease in every
stage. So, the DIT result has to be better than DIF with this feature. In order to
get this target, the Table 4.20 shows all parameters used in the study.
Table 4.20. Parameters for studying
Parameters Values
Input signal pulse
Points (N) 256
Radix 2
Decomposition DIT and DIF
βs 1 ∀s
δs 1,∀s
WLI 64
b from 8 to 16 bits
Butterfly quantization truncation
4.7 Effect for the different decomposition DIT-DIF 51
8 9 10 11 12 13 14 15 16
−90
−80
−70
−60
−50
−40
−30
coefficient bits
E(
dB
)
 
 
DIT
DIF
Figure 4.30. Relation between DIT and DIF with constant the coefficient bits
8 9 10 11 12 13 14 15 16
−100
−90
−80
−70
−60
−50
−40
coefficient bits
E(
db
)
 
 
DIT
DIF
Figure 4.31. Relationship between DIT and DIF with increase in every stage of the
coefficient bits
52 Experimentals Results
Otherwise in the Figure 4.32 we are decreasing the coefficient bits in every
stage (1 bit in every stage following the profile β1:7 = 1¯1¯1¯1¯1¯1¯), the result will be
better for DIF than for DIT because DIF has more rotations in the first stages
than DIT. In order to get this target, the Table 4.21 shows all parameters used in
the study.
Table 4.21. Parameters for studying
Parameters Values
Input signal pulse
Points (N) 256
Radix 2
Decomposition DIT and DIF
βs 1 ∀s
δs -1,∀s
WLI 64
b from 8 to 16 bits
Butterfly quantization truncation
8 9 10 11 12 13 14 15 16
−80
−70
−60
−50
−40
−30
−20
−10
0
10
Coefficient−bits
E(
dB
)
 
 
DIF
DIT
Figure 4.32. The relation between DIT and DIF with decrease in every stage of the
coefficient bits
4.8 Hardware implementation results 53
4.8 Hardware implementation results
The presented architectures have been programmed for the use in a field-programmable
gate arrays (FPGAs). The designs are parametrizable in the number of points (N),
wordlength in the inputs bits, wordlength in the coefficients bits, increment profile
in the output of butterfly (∆WL) and increment profile in the coefficient memory
(∆b). The target FPGA is a Virtex-5 FPGA, XC5VSX240T-2 FF1759. The ob-
jective in this section is compare the typical options of design with the best result
obtained in the previous sections with the order to show than the real results are
following the simulated results.
4.8.1 DIT-cases
As has been said, the limit in the increment profile for the coefficient bits (β) for
to obtain the best relation Error-area for decomposition in DIT, it is β = 110000.
Besides, there exist a useful range with a little deterioration of the Error (maximun
deterioration 3dB) but with a little improvement (improvement maxim %6) of the
Totalmemory used. These cases are inside of the proposed cases The range meets
the following expression:
b2 = WLI (4.21)
WLI ≤ b(3:7) ≤WLI + 1 (4.22)
The value of b1 is not included. As has been said this rotation is trivial, so the
wordlenght is not important. In other hand the data increment profile that we are
using it is δ = 11111111.
In the table 4.22 and Figures 4.33, 4.33 is possible to observe the different
typical and proposed cases for different number of bits in the input (WLI) with
decomposition in DIT. These fields are possible to get the results following the
Error and their area used. The table 4.23 shows the comparative between typical
vs proposed cases and proposed vs proposed cases and it indicates the improve of
the area and Error.
54 Experimentals Results
Table 4.22. Different cases with different parameters of design for the FFT for DIT
decomposition
DIT-cases
case WLI ∆WL δ b ∆b β Error (dB)-pulse Error (dB)-noise Area (slices)
1 4 0 00000000 4 0 000000 23.56 31.68 546
2 8 0 00000000 8 0 000000 3.18 7.89 1255
3 12 0 00000000 12 0 000000 −20.46 -16.24 2299
4 16 0 00000000 16 0 000000 −44.67 -40.38 3348
5 4 8 11111111 4 8 111111 −12.58 1.76 1284
6 8 8 11111111 8 8 111111 −37.60 -23.23 2244
7 12 8 11111111 12 8 111111 −61.61 -47.25 3418
8 16 8 11111111 16 8 111111 −85.88 -71.46 4908
9 4 8 11111111 4 0 000000 −11.66 5.6 883
10 8 8 11111111 8 0 000000 −34.53 -16.6 1613
11 12 8 11111111 12 0 000000 −60.53 -44.4 2664
12 16 8 11111111 16 0 000000 −84.61 -67.67 3810
13 4 8 11111111 3 2 110000 −12.67 2.04 933
14 8 8 11111111 7 2 110000 −37.55 -21.80 1736
15 12 8 11111111 11 2 110000 −61.18 -46.65 2881
16 16 8 11111111 15 2 110000 −85.49 -70.89 4084
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
−100
−80
−60
−40
−20
0
20
40
Area (Slices)
E(
dB
)
 
 
∆WLI=0 β=000000
∆WLI=8 β=000000
∆WLI=8 β=111111
∆WLI=8 β=100000
WLI=4
WLI=4
WLI=8
WLI=12
WLI=16
WLI=8
WLI=12
WLI=16
WLI=4
WLI=8
WLI=12
WLI=16
Figure 4.33. 256-point DIT FFT relation between Error and Area (slices) using different
profiles (∆WL and ∆b) - pulse in the input signal
4.8 Hardware implementation results 55
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
−80
−60
−40
−20
0
20
40
Area(slice)
E(
dB
)
 
 
∆ WLI=0 ∆b=0
∆ WLI=8 ∆b=0
∆ WLI=8 ∆b=7
∆ WLI=8 ∆b=2
WLI=4
WLI=8
WLI=12
WLI=16
WLI=16
WLI=12
WLI=8
WLI=4
Figure 4.34. 256-point DIT FFT relation between Error and Area (slices) using different
profiles (∆WL and ∆b) - noise in the input signal
Table 4.23. Comparative for different cases of DIT decomposition
cases Error(dB) improvement (approx.) Area improvement (%) (approx.)
9-13 +2 -5.66
5-13 -0.1 +37.62
10-14 +4 -7.08
6-14 -0.5 +29.3
11-15 +2 -7.53
7-15 -0.6 +18.63
12-16 +2.2 -6.70
8-16 -0.5 +20.17
It is possible to observe in the Figure 4.33 and 4.34 that for the cases ∆WL = 8
and ∆b = 2 and ∆WL = 8 and ∆b = 0 have the best relation about Error-area.
The difference between both cases is approximately -3 dB in the Error for the cases
∆WL = 8 and ∆b = 2 and −6% of area used for the case ∆WL = 8 and ∆b = 0.
56 Experimentals Results
4.8.2 DIF-cases
As has been said, the limit in the increment profile for the coefficient bits (β) for
to obtain the best relation Error-area for decomposition in DIF, it is β = 100000.
Besides, there exist a useful range with a little deterioration of the Error (maximun
deterioration 3dB) but with a little improvement (improvement maxim %15) of
the Totalmemory used. These cases are inside of the proposed cases The range
meets the following expression:
WLI ≤ b1 ≤WLI + 1 (4.23)
WLI ≤ b(2:6) ≤WLI + 2 (4.24)
The value of b7 is not included. As has been said this rotation is trivial, so the
wordlenght is not important. In other hand the increment profile in the input bits
that we are using it is δ = 11111111.
Table 4.24. Different cases with different parameters of design for the FFT for DIF
decomposition
DIF-cases
case WLI ∆WL δ b ∆b β Error (dB)-pulse Error (dB)-noise Area (slices)
1 4 0 00000000 4 0 000000 23.81 31.06 551
2 8 0 00000000 8 0 000000 2.64 6.78 1257
3 12 0 00000000 12 0 000000 −21.48 -17.24 2289
4 16 0 00000000 16 0 000000 −45.78 -41.07 3312
5 4 8 11111111 4 7 111111 −15.08 2.07 1272
6 8 8 11111111 8 7 111111 −39.72 -21.90 2276
7 12 8 11111111 12 7 111111 −63.76 -45.80 3366
8 16 8 11111111 16 7 111111 −87.73 -69.97 4886
9 4 8 11111111 4 0 000000 −12.96 5.67 828
10 8 8 11111111 8 0 000000 −35.55 -16.408 1634
11 12 8 11111111 12 0 000000 −62.37 -43.68 2674
12 16 8 11111111 16 0 000000 −85.79 -67.11 3774
13 4 8 11111111 4 2 100000 −15.01 2.03 1071
14 8 8 11111111 8 2 100000 −39.54 -21.81 1996
15 12 8 11111111 12 2 100000 −63.78 -45.40 3073
16 16 8 11111111 16 2 100000 −87.73 -69.84 4201
4.8 Hardware implementation results 57
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
−100
−80
−60
−40
−20
0
20
40
Area (slices)
E(
dB
)
 
 
∆WLI=0 β=000000
∆WLI=8 β=0000000
∆WLI=8 β=111111
∆WLI=8 β=100000
WLI=4
WLI=4
WLI=8
WLI=8
WLI=12
WLI=12
WLI=16
WLI=16
Figure 4.35. 256-point DIF FFT relation between Error and Area (slices) using different
profiles (∆WL and ∆b) - pulse in the input signal
58 Experimentals Results
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
−80
−60
−40
−20
0
20
40
Area (slice)
E(
dB
)
 
 
∆WL=0 Deltab=0
∆WL=8 Deltab=7
∆WL=8 Deltab=0
∆WL=8 Deltab=2
WLI=4
WLI=4
WLI=8
WLI=12
WLI=16
WLI=8
WLI=12
WLI=16
Figure 4.36. 256-point DIF FFT relation between Error and Area (slices) using different
profiles (∆WL and ∆b) - noise in the input signal
In the table 4.24 is possible to observe the different typical and proposed cases
for different number of bits in the input (WLI) with decomposition in DIF. These
fields are possible to get the results following the Error and their area used. The
table 4.25 shows the comparative between typical vs proposed cases and proposed
vs proposed cases and it indicates the improve of the area and Error.
Table 4.25. Comparative for different cases of DIF decomposition
cases Error(dB) improvement Area improvement (%)
9-13 +2 -22.68
5-13 -0.07 +15.8
10-14 +3.99 -22.15
6-14 -0.18 +14.02
11-15 +1.42 -14.92
7-15 -0.02 +9.56
12-16 +1.94 -11.38
8-16 -0.003 +16.35
It is possible to observe in the Figure 4.35 and 4.36 that for the cases ∆WL = 8
and ∆b = 2 and ∆WL = 8 and ∆b = 0 have the best relation about Error-area.
4.9 Error in the different channels of the FFT 59
The difference between both cases is approximately -2.7 dB in the Error for the
cases ∆WL = 8 and ∆b = 2 and −19% of area used for the case ∆WL = 8 and
∆b = 0.
We can observe that the Error results between the DIT and DIF are the same
or really similar for the best cases (∆WL = 8, ∆b = 2). Therefore, we can say
that the area used in the DIT is lower than DIF, the difference approximation
is 8%. This difference is lower if the number of bits in the input data (WLI) is
higher.
4.9 Error in the different channels of the FFT
This section presents the distribution of the error in the output of the FFT for
different channels. This way we can know which channels are better or worse.
Figure 4.37 represents the relationship between the error in dB and the all the
channels of the FFT. We have divided the error in 6 cases the best to worst as is
shown in the Figure 4.37.
0 50 100 150 200 250 300
−100
−90
−80
−70
−60
−50
−40
−30
Channel (K)
e
rr
o
r(d
B)
 
 
8
9
10
11
12
13
14
15
16
4
2
3
1
5:6
input bits (WLI) 
 coefficient bits (b)
Figure 4.37. 256-point DIT FFT relation between error in the output of the FFT and
channels using ∆WL=8 ∆b=0 - pulse in the input signal
60 Experimentals Results
The error is calculated as:
error = 10 · log10
(
1
Nm
Nm−1∑
i=0
|Hk(f)− Hˆk(f)|2
)
(dB), (4.25)
where Hk(f) and Hˆk(f) are the output in the frequency ideal and quantified
respectively.
Distribution of the errors (Figure 4.37) at different channel (k) (4.26)
Channels (k) with Error(1)
k = 32 · (2i+ 1), i = 0 · · ·N/64− 1
Channels (k) with Error(2)
k = 16 · (2i+ 1), i = 0 · · ·N/32− 1
Channels (k) with Error(3)
k = 8 · (2i+ 1), i = 0 · · ·N/16− 1
Channels (k) with Error(4)
k = 4 · (2i+ 1), i = 0 · · ·N/8− 1
Channels (k) with Error(5)
k = 2 · (2i+ 1), i = 0 · · ·N/4− 1
Channels (k) with Error(6)
k = 2i+ 1, i = 0 · · ·N/2− 1
(4.26)
For observe the different effects of the filters in the channels we use a sinusoidal
in the input signal with a frequency as: f = 0, 1m, 2m · · ·N − 1, where m = 0.25.
Figure 4.38 and 4.39 is possible to observe this effect for the channel 32 with error
1 and channel 62 with error 5 respectively with δ = 11111111 and β = 000000.
4.9 Error in the different channels of the FFT 61
115 120 125 130 135
20
40
60
80
100
120
frequency
5
6
7
8
9
10
coefficient bits (b)
input bits (WLI)
G(dB)
Figure 4.38. Aspect of the filter of the channel 32
238 240 242 244 246 248 250 252
0
20
40
60
80
100
120
Frequency
G
(dB
)
5
6
7
8
9
10
Input bits (WLI)
Coefficient bits (b)
Figure 4.39. Aspect of the filter of the channel 62
62 Experimentals Results
4.10 Conclusions of the experimental results
The previous results allow for obtaining some conclusions about the relation among
∆WL, WLI , δ for the data sample and b, ∆b, β or bs for the coefficient bits at
every stage. On the one hand, in relation to the WLI and δ, we have found that
we have to work with a increment profile for get the most efficient relation for DIT
and DIF decomposition, as:
δ = 11111111 (4.27)
On the other hand, the relation of the coefficient bits at every stage (bs) with
WLI and to have a range between the limit (β = 110000 and β = 100000) for DIT
and DIF and 3dB more, we will modify the number of bits in the coefficients, so
we will change the coefficient increment profile (β) and this way we can to move
inside of the red case of Figures 4.15, 4.14 and not only select the limit (the best
case). It has to follow the following rule for DIT decomposition:
b2 = WLI (4.28)
WLI ≤ b(3:7) ≤WLI + 1 (4.29)
For DIT the value of b1 is not included. As has been said this rotation is trivial,
so the wordlenght is not important.
For DIF decomposition:
WLI ≤ b1 ≤WLI + 1 (4.30)
WLI ≤ b(2:6) ≤WLI + 2 (4.31)
The value of b7 is not included. As has been said this rotation is trivial, so the
wordlenght is not important.
Chapter 5
Guidelines for the designer
In this chapter we are going to present some guidelines for the designer. With
this is possible to get a good tradeoff between area and accuracy. This consists
in different steps. Firstly, we need to choose the decomposition, we can select the
decomposition in DIT or in DIF, we have to know that the case DIT has a little
improvement in relation between Error and area used, it is a 8% approximation.
Once we have the decomposition, then we follow the following steps:
1. We have to select the number of bits in the input (WLI). Each bit more
in the input bit (WLI), we get 6dB more of improvement and we get worse
relation with the area used, it is a 8.5% approximation in the TotalMemory
used.
2. The following step consists in to selecting the data wordlenght for all the
stages of the FFT. It mean that in this steps we will select the truncation
in the butterfly at every stage. As has been said, good results between the
Error and the area it is when the data increment profile is without truncation
in any stage of the FFT (∆WL = 8). Therefore, the data increment profile:
for both DIT and DIF can be closen as:
δ = 11111111 (5.1)
3. Then, the number of bits of the coefficients must be determined.
For the DIF decomposition the word lenght of the coefficients at different
stages can be obtained as:
b1 = WLI + 1
β = 100000 (5.2)
Whereas for DIT:
b1 = WLI − 1
β = 110000 (5.3)
63

Chapter 6
Conclusions
In this master thesis have been defined a model where this model defines the
different hardware operations together with the different quantizations in the FFT.
For study this model we have done a software that simulates all the effects in
hardware. We have compared the result of our software with the hardware results
and we saw that the results are the same. We have selected to use a software
because this presents a wide space analysis and the sofware get a faster results.
We have analysied the accuracy as a function of the parameters of the FFT. Once
we have done the study of the accuracy of the FFT we have searched and we
have obtained the best configurations of the FFT. Therefore we can offer some
guidelines for the designer.
65

Bibliography
[1] M.A. Sánchez Mario Garrido, J. Grajal and O. Gustafsson. Pipelined radix-
2k feedforward fft architectures. In IEEE Transactions on Very Large Scale
Integration Systems, pages 1–10, 2011.
[2] T. Ahmed Marrio Garrido and O. Gustafsson. A 512-point 8-parallel pipelined
feedforward fft for wpan. In IEEE Asilomar Conference on Signals, Systems,
and Computers, Pacific Grove,CA (United States), pages 1–4, 2010.
[3] Mario Garrido. Efficient hardware architectures for the computation of the
fft and other related signal processing algorithms in real time. In PhD thesis,
Technical University of Madrid (UPM), 2009.
[4] M. Garrido, Keshab K. Parhi, and J. Grajal. ’A Pipelined FFT Architec-
ture for Real-Valued Signals’. IEEE Transactions on Circuits and Systems I:
Regular Papers, 56(12), December 2009.
[5] Liang Yang, Kewei Zhang, Hongxia Liu, Jin Huang, and Shitan Huang. ’An
efficient locally pipelined FFT processor’. IEEE Transactions on Circuits and
Systems II: Express Briefs, 53(7):585–589, July 2006.
[6] A.M. Despain. ’Fourier Transform Computers Using CORDIC Iterations’.
Computers, IEEE Transactions on, C-23:993–1001, October 1974.
[7] S. He and M. Torkelson. ’Design and implementation of a 1024-point pipeline
FFT processor’. Custom Integrated Circuits Conference, Santa Clara, CA,
pages 131–134, May 1998.
[8] M.A. Sánchez, M. Garrido, M.L. López, and J. Grajal. ’Implementing FFT-
based digital channelized receivers on FPGA platforms’. IEEE Transactions
on Aerospace and Electronic Systems, 44(4):1567–1585, October 2008.
[9] L. Liu, J. Ren, X. Wang, and F. Ye. ’Design of Low-Power, 1GS/s Throughput
FFT Processor for MIMO-OFDM UWB Communication System’. ISCAS
2007, pages 2594–2597, May 2007.
[10] C. Lee and Y. Lin. High-throughput Pipelined FFT Processor. Patent US
20060282764, 14.12. 2006.
67
68 Bibliography
[11] J.A. Johnston. ’Parallel pipeline fast fourier transformer’. Communications,
Radar and Signal Processing, IEE Proceedings F, 130(6):564–572, October
1983.
[12] E.E. Swartzlander, W.K.W. Young, and S.J. Joseph. ’A Radix 4 Delay Com-
mutator for Fast Fourier Transform Processor Implementation’. IEEE Journal
of Solid-State Circuits, 19(5):702–709, October 1984.
[13] Y.-W. Lin and C.-Y. Lee. ’Design of an FFT/IFFT Processor for MIMO
OFDM Systems’. IEEE Transactions on Circuits and Systems I: Regular
Papers, 54(4):807–815, April 2007.
[14] Jing-Yang Jou Cheng-Yeh Wang, Chih-Bin Kuo. Hybrid word-lenght opti-
mization methods of pipeline fft processors. 56(8):1105–1117, August 2007.
[15] M. Garrido, O. Gustafsson, and J. Grajal. Accurate rotations based on coef-
ficient scaling. 58(10):662–666, October 2011.
[16] L. Lundheim. Some quantizing effects in discrete fourier transform compu-
tations. In Proc 5th IEEE Nordic Signal Processing Symposium (NORSIG-
2002), pages 1–6, 2002.
[17] Chia-Wei Chen Pei-Yun Tsai and Meng-Yuan Huang. Automatic ip generation
of fft/ifft processors with word-length optimization for mimo-ofdm systems.
In EURASIP Journal on Advances in Signal Processing, pages 1–15, 2010.
[18] J.W. Cooley and J.W. Tukey. ’An algorithm for the machine calculation of
complex Fourier series’. Math. Comput., 19:297–301, 1965.
[19] A.V.Oppenheim and R.W.Schafer. Discrete-Time Signal Processing. Prentice
Hall, 1989.
[20] G. Szedo, V. Yang, and C. Dick. ’High-performance FFT Processing Us-
ing Reconfigurable Logic’. 35 Asilomar Conference on Signals, Systems and
Computers, pages 1353–1356, 2001.
[21] Sang-Chul Moon and In-Cheol Park. ’Area-efficient memory-based architec-
ture for FFT processing’. Proceedings of the 2003 International Symposium
on Circuits and Systems, ISCAS ’03, 5:V–101 – V–104, May 2003.
[22] G. Zhang and F. Chen. ’Parallel FFT with CORDIC for ultra wide band’.
Personal, Indoor and Mobile Radio Communications, 15th IEEE International
Symposium on, 2:1173–1177, September 2004.
[23] P. B. Denyer. VLSI Signal Processing: A Bit-Serial Approach. Addison-
Wesley, 1985.
[24] H.S. Stone. ’Parallel Processing with the Perfect Shuﬄe’. IEEE Transactions
on Computers, C-20(2):153–161, February 1971.
Bibliography 69
[25] F. Argüello, J.D. Bruguera, R. Doallo, and E.L. Zapata. ’Parallel architecture
for fast transforms with trigonometric kernel’. IEEE Transactions on Parallel
and Distributed Systems, 5(10):1091–1099, October 1994.
[26] S.F. Gorman and J.M. Wills. ’Partial column FFT pipelines’. IEEE Transac-
tions on Circuits and Systems II: Analog and Digital Signal, 42(6):414–423,
June 1995.
[27] Lawrence R. Rabiner and Bernard Gold. Theory and Application of Digital
Signal Processing. Prentice Hall, 1975.
[28] E.H. Wold and A.M. Despain. ’Pipeline and Parallel-Pipeline FFT Processors
for VLSI Implementations’. IEEE Transactions on Computers, (5):414–426,
May 1984.
[29] M. A. Sánchez, M. Garrido, M. L. López, and J. Grajal. ’Implementing the
FFT Algorithm on FPGA Platforms: A Comparative Study of Parallel Archi-
tectures’. XIX International Conference on Design of Circuits and Integrated
Sistems (DCSI 2004), (Bourdeaux, France), November 2004.
[30] M.A. Sánchez, M. Garrido, M.L. López, J. Grajal, and C. López-Barrio. ’Dig-
ital Channelised Receivers on FPGAs Platforms’. 2005 IEEE International
Radar Conference, pages 816–821, May 2005.
[31] A. Cortes, J.F. Sevillano, I. Velez, and A. Irizar. ’An FFT Core for DVB-
T/DVB-H Receivers’. 13th IEEE International Conference on Electronics,
Circuits and Systems, pages 102–105, December 2006.
[32] M. Shin and H. Lee. ’A High-Speed Four-Parallel Radix-24 FFT/IFFT Pro-
cessor for UWB Applications’. IEEE International Symposium on Circuits
and Systems, ISCAS 2008, pages 960–963, May 2008.
[33] Yun-Nan Chang and Keshab K. Parhi. ’An efficient pipelined FFT archi-
tecture’. IEEE Transactions on Circuits and Systems II: Analog and Digital
Signal Processing, 50(6):322–325, June 2003.
[34] Jack E. Volder. ’The CORDIC Trigonometric Computing Technique’. IRE
Trans. on Electronic Computing, September 1959.
[35] M. Garrido and J. Grajal. ’Efficient Memoryless CORDIC for FFT Com-
putation’. IEEE International Conference on Acoustics, Speech, and Signal
Processing, ICASSP’07, 2:II–113 – II–116, April 2007.

