Cognitive Radion on a reconfigurable platform by Walters, K.H.G.
University of Twente
Faculty of Electrical Engineering, Mathematics and
Computer Science
Computer Architecture for Embedded Systems
Cognitive Radio on a reconfigurable
platform
M.Sc. thesis by
Karel H.G. Walters
Supervisor: Graduation committee:
dr. ir. Andre´ B.J. Kokkeler prof. dr. ir. Gerard J.M. Smit
dr. ir. Andre´ B.J. Kokkeler
MSc. Qiwei Zhang
Enschede, The Netherlands, August 2007
Summary
This report was made as part of the MSc project of Karel Walters, which started February
2007 and finalized in August the same year. During this project a specific sparse FFT,
transform decomposition, has been implemented on the Montium. This has been done as
part of the development of an OFDM receiver for Cognitive radio. Within cognitive radio
certain parts of the frequency spectrum are used. Which carriers are used depends on the
current usage of the spectrum, therefore these carriers vary and might very well be scarce.
In such a scenario calculating a complete FFT is a waste of computational power and bus /
NOC bandwidth. For that purpose the transform decomposition is used where only a
certain amount of samples within a FFT is calculated and thus saving resources. This
report shows how this algorithm is implemented and how much is actually saved by
using this implementation. Next to this it shows a way for run-time reconfiguration of the
Montium using a radix-2 FFT. The last chapter shows a demonstration of all this on the
BCVP platform and a user interface part implemented on the PC.
Contents
1 Introduction 7
2 Cognitive Radio 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 OFDM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Discrete Fourier Transform 11
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1 Butterfly . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Transform decomposition . . . . . . . . . . . . . . . . . . . . 15
4 Montium tile processor 17
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Coarse grain reconfiguration . . . . . . . . . . . . . . . . . . 17
4.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.1 Sequencer . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3.2 AGU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3.3 Interconnect . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3.4 ALU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3.5 Hydra . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4 Application Development . . . . . . . . . . . . . . . . . . . . 20
4.4.1 CDL Programming Language . . . . . . . . . . . . . . 21
4.4.2 Simulator . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4.3 Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5 Related Work 23
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2 Goertzel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.3 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6 Problem description 27
4 Contents
7 AlgorithmMapping 29
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.2 Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . 29
7.2.1 Complex multiplication . . . . . . . . . . . . . . . . . 29
7.2.2 Input stage . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.2.3 Middle stages . . . . . . . . . . . . . . . . . . . . . . . 32
7.2.4 Last stage . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.3 Transform decomposition . . . . . . . . . . . . . . . . . . . . 34
7.3.1 Input mapping . . . . . . . . . . . . . . . . . . . . . . 34
7.3.2 Recombination stage . . . . . . . . . . . . . . . . . . . 34
7.4 Type conversion . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.5 Run time reconfiguration . . . . . . . . . . . . . . . . . . . . . 37
7.5.1 FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.5.2 Transform decomposition . . . . . . . . . . . . . . . . 38
7.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.6.1 Reconfiguration . . . . . . . . . . . . . . . . . . . . . . 41
7.7 Adjacent sub-carriers . . . . . . . . . . . . . . . . . . . . . . . 41
7.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8 BCVP 45
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
8.2 Differences in relation to the simulator . . . . . . . . . . . . . 45
8.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 46
8.3.1 BasOS . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
8.3.2 libusb . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.3.3 Trolltech Qt . . . . . . . . . . . . . . . . . . . . . . . . 47
8.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
8.6 Further work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
9 Conclusion and Recommendations 51
9.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
9.2 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . 51
9.2.1 Reconfiguration . . . . . . . . . . . . . . . . . . . . . . 51
9.2.2 Programming . . . . . . . . . . . . . . . . . . . . . . . 52
Bibliography 53
A Pseudo-code 55
B The Rough Guide to run-time reconfigurable code for the Mon-
tium 59
Contents 5
Acknowledgments 67
6 Contents
Chapter 1
Introduction
This thesis report was made as part of the master assignment of Karel H.G.
Walters. This thesis assignment was done as part of the AAF project, at
the CAES1 chair, in which cognitive radio is researched as solution for the
scarce radio spectrum available for data transmissions.[1]. Within cognitive
radio transceivers the FFT and IFFT are the most computational intensive.
A sparse FFT/IFFT decreases the need for such computational intensity
would be a great improvement. This thesis describes the mapping of a
specific sparse FFT algorithm onto the Montium tile processor architecture.
The Montium tile processor is a low power processor with a lot of compu-
tational power. Chapter 3 describes the FFT that is normally implemented
and it describes the sparse FFT algorithm. Chapter 4 describes the Mon-
tium tile processor and the available development tools. Two other algo-
rithms are described in 5 as related work. Chapter 6 gives the problem de-
scription in which the main assignment is described. Chapters 3 and 4 are
primarily as a reference and provide background information for chapter 7
in which the mapping of the actual algorithm is described. Chapter 7 also
gives a conclusion whether or not the implementation is a success. Chapter
8 shows an implementation on the BCVP platform and a user interface im-
plemented on a PC, which mainly serves as a technology demonstration.
This reports ends with recommendations for further development for the
Montium tile processor in chapter 9.
1Computer Architecture for Embedded Systems
8 Introduction
Chapter 2
Cognitive Radio
Data communication requires a significant amount of radio spectrum,while
spectrum is scarce. Todays approach divides the spectrum into small pieces,
each for a specific purpose. The applications use their spectrum to a lim-
ited extent. This leads to the unwanted situation of under-utilization of this
scarce public resource. Regulatory bodies (e.g. FCC) recognise the nearly-
reached end to this approach, while radio-based communication grows
constantly.
2.1 Introduction
The FCC actively pursues Cognitive Radio as a new paradigm in spec-
trum utilization. A Cognitive Radio utilizes empty spectrum. It searches
for under-utilised spectrum, and adapts its transmission without interfer-
ing other users. As spectrum use is by definition dynamic, it adapts very
rapidly to changes in spectral usage.
Adaptation to spectral utilisation can be done in multiple ways. Primary
the frequency domain but also the time and space domain are means to
this end. Adaptive loading of carriers in combination with adaptive power
control can further optimise the communication. Finally interference can-
cellation and multi-user detection can be added. Cognitive Radio is a new
technology with both issues in technology and legislation. The FCC inves-
tigates rule-making aiming at Cognitive Radio. In Europe, initiatives in this
direction are still very premature.
The AAF project [1] researches how adaptive OFDMwith its huge free-
dom is best applied, considering interference, system stability, capacity and
complexity. The project also addresses meta-communications, including
neighbour discovery with beacons that fulfill the Cognitive Radio rules,
and the communication of transmission parameters.
10 Cognitive Radio
2.2 OFDM
Frequency division multiplexing (FDM) is a technology that transmits mul-
tiple signals simultaneously over a single transmission path, such as a cable
or a wireless system. Each signal travels within its own unique frequency
range (carrier), which is modulated by the data (text, voice, video, etc.).
Orthogonal FDM’s (OFDM) spread spectrum technique distributes the data
over a large number of carriers that are spaced apart at precise frequencies.
This spacing provides the ”orthogonality” in this techniquewhich prevents
the demodulators from seeing frequencies other than their own. The bene-
fits of OFDM are high spectral efficiency, resiliency to RF interference, and
lower multi-path distortion.
One of the major components in OFDM is the FFT/IFFT (chapter 3) as seen
in figure 2.1 which depicts an ideal OFDM receiver. The FFT takes up the
major part in the computational complexity of the OFDM scheme. This is
why OFDM for cognitive radio could benefit so well from a sparse FFT. For
more information on OFDM [2] and [3] are good starting points.
Figure 2.1: Diagram of an ideal OFDM receiver
Chapter 3
Discrete Fourier Transform
3.1 Introduction
The discrete Fourier transform is a mathematical procedure to determine
the harmonic, or frequency, content of a discrete signal in the time domain.
This is exactly what is being done in the OFDM scheme and therefore the
field of interest. This chapter will describe a specific implementation of the
FFT that is most commonly used on the Montium platfrom. Next to that it
will describe a more specific implementation which can reduce the compu-
tational intensity of the algorithm. This chapter does not give a complete
overview or the mathematical background of the Fourier transform. For a
complete explanation of the DFT and its FFT derivative, chapters 3 and 4 of
[4] can be read which give an easy to understand explanation. This chapter
will give a limited insight in the different aspects of the FFT.
3.2 Fast Fourier Transform
The fast Fourier transform or FFT for short is the most common way for
computers to transform a set of input values from the time domain to the
frequency domain. Equation 3.1 shows the DFT equation, in which W is
the so called twiddle factor by which the input values are multiplied.
X(k) =
N−1∑
n=0
x(n)W nkN k = 0, 1, ...,N − 1
W nkN = e
−jpink
N
(3.1)
Although different implementations exist, the most common is the divide
and conquer method first introduced by Cooley and Turkey, described in
[5]. Its main benefit is the reduction of the number of complex multipli-
cations and additions needed to get the result of equation 3.1. Often im-
plementations are optimized towards a certain platform or suit a certain
12 Discrete Fourier Transform
type of input values. The FFT currently implemented on the Montium, is
the power of two decimation in time radix-2 algorithm. These terms will
be explained further along in this chapter. The fact that this type of FFT is
used does not mean it is not possible to implement another but since most
applications can make use of this FFT and it is reasonably easy to imple-
ment makes it the better candidate.
Power of two
In an N -point FFT, N stands for the amount of points in the FFT. The term
power of two restricts N to a power-of-two. The power of two type is by
far the most common type of FFT currently used because of its efficiency
and simplicity. Other types of FFTs are the Prime Factor Algorithm and
Winograd Fourier Transform. Although these often require fewer opera-
tions, overall they do not warrant the amount of extra complexity that is
introduced.
Radix
The radix specifies in how many parts the FFT is split in every stage. With
a radix-2 type the DFT is split up in two in every stage. With radix-4 every
time the stage is split up into four. The benefits for a radix-4 algorithm are
that fewer multiplications need to be executed, the downside is that more
inputs are needed and it puts further constraints on the hardware. Also
mixed variations exist called split-radix, which often combine radix-2 and
radix-4 algorithms. The graphical representation of a radix-2 structure is
given in figure 3.1. Both of the structures show the so called butterflies of
the multiplication and addition that take place.
Decimation in time
The decimation explains in which direction the FFT is split up. Either from
the frequency domain or from the time domain side. The difference can
easily be seen in figure 3.1 in which both are depicted side by side. Figure
3.1 can be a bit misleading, because inputs and outputs can be bit reversed
in which case the butterflies are ordered differently. To determine whether
such an FFT is either DIT or DIF, can be deduced from the twiddle factors.
If the largest number of different twiddle factors are at the start of the FFT
it is DIF otherwise it is DIT.
Bit reversal
The output values in figure 3.1 are in bit reversed order. This is caused by
the algorithm and there is not much that can be done about it. Bitreversal
3.2 Fast Fourier Transform 13
Figure 3.1: Decimation in time on the left side and decimation in fre-
quency on the right
is a simple process and is not much of a problem in current hardware and /
or software. For example in figure 3.1, where the number of input points is
8, the bit reversal schemewould be as in table 3.1. Important to notice is the
amount of bits that is reversed. For different implementations the amount
of bits that need to be reversed can differ as will be shown in chapter 7.
In-order index In-order index
in binary
Bit-reversed in
binary
Bit-reversed in-
dex
0 000 000 0
1 001 100 4
2 010 010 2
3 011 110 6
4 100 001 1
5 101 101 5
6 110 011 3
7 111 111 7
Table 3.1: Bit reversal example with 3 bits
3.2.1 Butterfly
Figure 3.3 shows the butterfly calculations needed to perform an 8-point
FFT. It is build up from smaller 2-point FFT butterflies. A single 2-point FFT
butterfly looks like the one in figure 3.2. The twiddle factors are the mul-
tiplication factors in each butterfly calculation. Equation 3.1 shows how to
determine the twiddle factors. There is an important thing to notice about
twiddle factors, which can be explained by figure 3.4. The twiddle factors
overlap eachother, so the twiddle factors for an 8-point FFT can be used for
a 4-point FFT. Another thing to notice is that the twiddle factors are related
14 Discrete Fourier Transform
in a different way as well.
W
k+N/2
N = W
k
NW
N/2
N = W
k
n (e
−j2piN/2N ) = W kN (−1) = −W
k
N (3.2)
From equation 3.2 can be seen that for instanceW 38 andW
7
8 are the same ex-
cept for the sign. This is also were computational intensity is reduced in the
DFT. This reduction cuts the required complex multiplications in halve and
the amount of twiddle factors needed is also cut in halve. This means that
for a single butterfly only one complex multiplication is needed together
with an addition and multiplication.
Figure 3.2: Butterfly calculation
Figure 3.3: Butterfly calculations for an 8-point FFT together with the
required twiddle factors
Figure 3.4: Twiddle factors for an 8-point FFT in the complex plane.
3.3 Transform decomposition 15
3.3 Transform decomposition
The transform decomposition algorithm was first described in [6] and an
optimisation in [7]. The algorithm can be usedwhen beforehand it is known
if certain input or output values will result in zero. In a normal FFT these
values would still be part of the transform since an FFT would take all in-
put values and calculate all butterflies. The values leading to a zero in the
output could be considered useless. That is where transform decomposi-
tion provides for a solution. A graphical representation of the following is
given in figure 3.5. Transform decomposition in [7] is described as: Starting
from the DFT as described in equation 3.1, L represents the number of non-
zero outputs. LetN be factorized as two integersN1 andN2, soN = N1N2.
The index n can be written as:
n = N2n1 + n2
n1 = 0, 1, . . . , N1 − 1 n2 = 0, 1, . . . ,N2 − 1
(3.3)
Then n in 3.1 and 3.3 is substituted and the DFT can be rewritten as:
X(k) =
N2−1∑
n2=0
N1−1∑
n1=0
x(N2n1 + n2)W
(N2n1+n2)k
N
=
N2−1∑
n2=0
[
N1−1∑
n1=0
x(N2n1 + n2)W
N2n1k
N ]W
n2k
N
(3.4)
Defining:
Xn2(〈k〉N1) =
N1−1∑
n1=0
x(N2n1 + n2)W
n1k
N1
=
N1−1∑
n1=0
xn2(n1)W
n1k
N1
(3.5)
in which 〈〉N1 denotes modulo N1. So 3.4 can be written as:
X(k) =
N2−1∑
n2=0
Xn2(〈k〉N1)W
n2k
N (3.6)
Now the originalN -point DFT with L non zero outputs is decomposed
into two major parts: the N2 times N1-point DFTs in 3.5 which can be im-
plemented as a N2 times an N1-point FFT together with the multiplica-
tions and re-combinations in 3.6. The reduction in computational intensity
comes from the fact that the number of k values is the same as the number
of L non-zero values. Therefore only L twiddle factors are multiplied with
each Xn2(〈k〉) for n2 = 1, 2, . . . ,N2.
The values of the variables N1 and N2 are selected as followed:
N1 = ⌈L⌉
N2 =
N
N1
16 Discrete Fourier Transform
Figure 3.5: The different stages of Transform decomposition
In which N1 is L rounded upwards to the closest power of two. During
the input mapping, the normal order of the inputs is split over the different
FFTs, x(0) translates to x0(0), x(1) to x1(0) and so on. Then these FFTs are
calculated as normal. In the last part of the algorithm they are put together
again. Then all the indices k are selected and multiplied with their respec-
tive twiddle factor W kN and added together. The index k is calculated by
LmodN1.
Chapter 4
Montium tile processor
4.1 Introduction
The Montium tile processor is an embedded processor primarily used for
streaming applications. It was developed as part of the Ph.D. thesis of Paul
Heysters [8]. It is part of a system on chip template called Chameleon in
which several tiles of different types are combined in a network on chip.
Currently the Montium processor only exists as a VHDL description and
is implemented on several FPGAs. This chapter describes the assets of the
Montium that play a major role in the implementation of the sparse FFT. It
is formost a reference, it provides some background information in order
to understand the following chapters.
4.2 Coarse grain reconfiguration
Coarse grain reconfigurability is a term which addresses the amount of
reconfigurability the hardware has. This ranges from a completely rigid
architecture often normal ASIC like architectures to completely flexible,
GPPs resign in this area. TheMontium architecture is coarse grain reconfig-
urable. The Montium tile processor is a domain specific accelerator for the
Chameleon System-on-Chip template. The template consists out of several
different tiles connected through a Network-on-Chip. An instantiation of
such an Chameleon template looks like the one in figure 4.1.
4.3 Architecture
The Montium consists of 5 ALUs connected to 10 memories through an
interconnect network, see figure 4.2. Through east-west connections, the
ALUs are also connected to eachother. It is characterised by its low power
consumption and high efficiency. For example it is capable of doing a com-
18 Montium tile processor
Figure 4.1: An instantiation of the Chameleon template
plex multiplication in a single clockcycle. The Montium is a 16 bit fixed
Figure 4.2: The Montium tile including Hydra
pointed architecture.
4.3.1 Sequencer
The Montium is controlled through the sequencer. The sequencer imple-
ments a state machine that determines the instructions for the different
components of the Montium. A program consists out of one or more states
4.3 Architecture 19
that are repeated. The sequencer takes care of this process. The number of
sequencer of instructions are limited but their diversity is enormous. This
diversity comes from the fact that all the hardware components can be con-
figured in different ways and combined in different combinations and then
used in the sequencer.
4.3.2 AGU
The AGU of the Montium architecture acts like a pointer in C. The AGU
points towards the value in thememory fromwhich can be read or towhich
can be written. The AGU, like a C pointer, can be assigned towards a single
spot in the memory agu p1m1 = 11 or it can be altered by normal opera-
tors. The limiting factor is the amount of different AGU instructions. Each
memory has its own AGU and each AGU can have only 8 different instruc-
tions. This can be repeated almost indefinitely and with operators like ++
it is easy to reach all memory addresses with only 1 instruction.
An AGU does allow for more complex instructions which. For example:
agu p1m1 +=1 & 3 <-> 2 |= 2
This means that the AGU of memory p1m1 is increased by 1, it is masked
with 3, it is bit reversed over 2 bits and its base address is 2. When this
AGU would be initialised with 0 it would mean that the new AGU value
would point to 4: the initial value is 0, add 1, leads to 1, bit reverse this
over 2 bits, leads to 2. The base address was set to 2 already which with the
previous outcome leads to 4.
Another feature of the AGU is that it can get the value from the output of
an ALU. This means that any calculation from the ALU can be used as an
input of the AGU. This is mainly used for look-up-table purposes.
4.3.3 Interconnect
The interconnect is the networkwhich connects thememory to the registers
of the ALUs. It is a transport network which can be configured in different
ways to accommodate the source and destination that is needed.
4.3.4 ALU
The ALU of the Montium allows for many different calculations on the
five possible input values. The ALU consists out of two parts. The first
part can make different combinations of four different functional units. The
functional units can do binary operations as well as boolean operations.
The can then be fed into a multiplier which resides in the second part of
1The AGU of memory p1m1 points to memory location 1
20 Montium tile processor
the ALU. The adder and subtractor are also in this area. Next to the initial
inputs it can take the input from the east input connection and provide an
output to the west output. For a more comprehensive description of the
ALU please read section 5.3 from [8].
Figure 4.3: An ALU of the Montium. The functional units are all
capable of various logical operations. For a complete listing please
refer to [8]
4.3.5 Hydra
The Hydra provides the connection to the outside world. It is the only
means by which input- and configuration-data can reach the Montium and
output data can leave the Montium. Therefore it is also the only way
to configure the Montium. To distinguish between data, addresses and
configuration-data all the values are tagged. The tag together with the data
is called a flit. For further information on these flits and their make-up,
please read [9].
4.4 Application Development
Application development on the Montium is done on a normal pc from
which the compiled code can be either fed into the simulator or an actual
Montium. Most of the development is done using the simulator since it
is the fastest. The compiled code can also be read by a tool called by the
Montium Configuration tool. This tool is a graphical front end of all the
configuration bits in the Montium. This allows for a bit wise configuration
which is a very time consuming job and if possible needs to be avoided. It is
however the only tool that makes run-time reconfiguration of the Montium
possible and is therefore a necessity.
4.4 Application Development 21
4.4.1 CDL Programming Language
The CDL programming language is the basic assembler like language in
which the Montium is programmed. There is currently no C or higher
level language available. Although assembler like, there are some major
differences. For example clock cycles are explicitly declared which makes
it possible to define explicit parallel calculations. The CDL language comes
with a comprehensive pre-processor. It allows for basic structures like for
loops which make it easier to generate the Montium code. For example
when 100 clockcycles need to be programmed it is sufficient to program 1
in a for loop and let the pre-processor generate the other 99. For a complete
overview of the CDL language and its capabilities please read [10].
4.4.2 Simulator
There is a simulator available for the Montium. It is capable of simulating
the Montium including or excluding the Hydra. For input and output it
uses files. These files contain the configuration data as well as the input
data. All the current register values of the Montium can be read in a tree
wise fashion like in the way files are read in a *nix prompt or old DOS
prompt. This makes it difficult to have certain values side by side since
quite a few directory changes could be needed. The simulator can single
cycle step through the code generated but there is no way of going back or
to break at a certain point. For more information on the simulator please
read [11]
As of writing this thesis a second simulator is in development that does
have these features. Although it does not show as many parameters of the
Montium it is far more friendlier to use if only functionality of the algo-
rithm needs to be tested. It has a graphical Java frontend and has the capa-
bility of stepping forward and backward and allows for a single breakpoint
in the code.
4.4.3 Matlab
Matlab is a tool to format the input and read the output files. Matlab scripts
either format only the input data in the correct order when there is no sim-
ulation of the Hydra. In this mode the compiled code is directly read by
the simulator and Matlab’s task is only to format the input data. It can also
pack the configuration data together with the input data when the Hydra
is part of the simulation. When the Hydra is simulated, Matlab’s task is to
tag all the configuration and input data according to the specification [9].
22 Montium tile processor
Chapter 5
Related Work
5.1 Introduction
This chapter describes two other algorithms that improve the FFT’s perfor-
mance. These do not map that well on to the Montium platform but are
worth mentioning, because other hardware platforms might be available
that can use these algorithms. Most of these methods are developed with
an x86 architecture or alike in mind, in which operations are done sequen-
tially. The methods will be described in short and a short explanation will
be given as to why they do not map well onto the Montium.
5.2 Goertzel
The Goertzel algorithm is primarily used as a tone detector. Which means
it is used to detect whether or not a single signal exists within a certain sam-
ple. The most well known example is the touch tone dial phone. (DTMF).
A specific set of frequencies that is known be forehand can be detected in
a certain samples and then based upon those frequencies it can be deter-
mined which number was pressed.
Algorithm
The Goertzel algorithm is implemented as a 2nd order IIR filter with a cer-
tain z-domain transfer function, depicted in figure 5.1.
HG(z) =
Y (z)
X(z)
=
1 − e−j2pim/Nz−1
1 − 2cos(2pim/N)z−1 + z−2
This might seem a good alternative since now only N + 2 1 complex
1N times through the filter including the two delays.
24 Related Work
Figure 5.1: Flowgraph showing IIR filter implementation of the Go-
ertzel algorithm
multiplications together with some additions/subtractions are needed to
get the corresponding power. Which can also be seen from the following
pseudo-code.
coeff = 2*cos(2*PI*frequency);
for each sample, x[n],
s = x[n] + coeff*s_prev - s_prev;
s_prev2 = s_prev;
s_prev =s;
end
power = s_prev2*s_prev2 + s_prev*s_prev - coeff*s_prev2*s_prev;
The main drawback is now also more apparent. As long as the amount
of detection frequencies (m) is small this algorithm is a great advantage
above others. Whenm becomes larger, the number of operations rises fast.
Often the break even point is already reached atM < 56 log2N
More on the algorithm can be found in section 13.17 of [4] and in chapter 6
of [12].
From a practical point of view this could be implemented on the Mon-
tium without many problems.
5.3 Pruning
Pruning algorithms filter out unnecessary multiplications during all the
stages of a FFT. When only a selection of output points are needed, it is pre-
determined which butterflies in all the stages actually contribute to those
output points. The butterflies that do not contribute should not be calcu-
lated.
Algorithm
Most implementations are based on the fact that be forehand it is known
which calculations need to be done. This information is then stored in a
matrix that has the dimensions Nxlog2N by storing either a 1 or a 0. Each
zero represents a multiplication that can be dropped and a one represents
5.3 Pruning 25
a multiplication that needs to be done. Other implementations exist which
optimize for calculation order and storage space but all of them come down
to the same thing.
The drawback for an implementation on theMontiumplatform is themem-
ory space and the irregularity such a matrix comes with.
26 Related Work
Chapter 6
Problem description
The main assignment of this thesis is to map the Transform Decomposi-
tion algorithm (explained in section 3.3) onto the Montium (described in
chapter 4). The reasoning behind this mapping comes from the fact that
the Montium is well fitted to execute complex multiplications. That is one
of the reasons the normal FFT maps so good onto the Montium platform
as well. It is therefore expected that an algorithm as transform decomposi-
tion, that also relies mainly on a complex multiplication, maps well to the
Montium platform. Within cognitive radio, the position of the sub-carriers
can change as well as the number of sub-carriers. These two factors can be
dealt with by the algorithm but it is expected that these operations influ-
ence the performance of the Montium since these operations require some
additional control, which is something the Montium is not really equipped
to deal with. However the Montium has always been promoted as being
reconfigurable and should be capable of dealing with this issue.
Mapping the algorithm is not a straightforward operation because of the
limited number of instructions that can be used in the Montium. It does
have some irregularities which are particularly difficult to implement on
the Montium, which involve changing the number and position of sub-
carrier. This is one of the issues that needs to be resolved and influence on
performance needs to be determined.
Changing the number and position of sub-carrier at run-timewill put futher
strains on mapping the algorithm. Although the Montium is capable of
run-time reconfiguration, there are no easy-to-use tools at hand that sup-
port this feature. This is however one of the requirements to fully support
the algorithm in a cognitive radio context. Run-time configuration and the
algorithm might put a strain on the memory usage and therefore limit the
size of the FFT that can be done.
28 Problem description
Chapter 7
AlgorithmMapping
7.1 Introduction
This chapter describes the different aspects of the algorithm and how they
are implemented on the Montium platform. The normal FFT implemen-
tation will be explained first and how certain features of the Montium ar-
chitecture are exploited. The sparse FFT makes use of the normal FFT as
explained earlier in chapter 3. Section 7.3 explains how the additions to the
normal FFT are implemented.
7.2 Fast Fourier Transform
The FFT normally implemented on a Montium platform is the decimation
in time fast Fourier transform. The FFT on the Montium can be split up
in three different parts. The input stage, the middle stages and the output
stages. The reason for this is that all the middle stages are the same and
only the input and output stage differ from these. All of these stages can
also be split up into three parts. The retrieval of the data, the retrieval of
the twiddle factors and the storage of the results. It is important to keep
in mind that for any Montium implementation the number of different in-
structions needs to be minimized. This will keep the amount of required
sequencer memory to a minimum. The following section will describe an
8 point FFT example on the Montium platform. The imaginary part of the
values is not always mentioned to keep the explanation simple.
7.2.1 Complex multiplication
The butterfly itself requires a complex multiplication together with a sub-
traction and addition. From figure 7.2 it can be seen that the outcome of the
multiplication can be used twice, within the subtraction and addition parts.
30 Algorithm Mapping
Figure 7.1: All the butterflies and stages needed for an 8 point FFT as
implemented on the Montium
The multiplication itself is complex so it uses four normal multiplications
together with an addition and subtraction as can be seen in equation 7.1
Figure 7.2: Butterfly calculation
W = c+ id
y = a+ ib
(a+ ib)(c + id) = ac+ ibc+ iad− bd
= (ac− bd) + i(ad + bc)
(7.1)
Putting together equation 7.1 and figure 7.2 results in the mapping de-
picted in figure 7.3.
Figure 7.3: Butterfly mapped onto the Montium
Which results in the following CDL code.
alu p2a1 fmul p2c1 -> p2ws //Wim * Yim
alu p1d1 ssub (p1a1 fmul p1c1 ssub p1es) -> p1o1 //Xre - (Wre * Yre - ALU 2)
alu p1d1 sadd (p1a1 fmul p1c1 ssub p1es) -> p1o2 //Xre + (Wre * Yre - ALU 2)
alu p4a1 fmul p4c1 -> p4ws //Wim * Yre
alu p3d1 ssub (p3a1 fmul p3c1 sadd p3es) -> p3o1 //Xim - (Wre * Yim + ALU 4)
alu p3d1 sadd (p3a1 fmul p3c1 sadd p3es) -> p3o2 //Xim + (Wre * Yim + ALU 4)
7.2 Fast Fourier Transform 31
The complex multiplication and addition can lead to values larger then the
initial input value. The values would saturate towards the largest possible
value and therefore would not be correct. To prevent the saturation from
happening, the output value should be scaled. The Montium CDL lan-
guage provides for this with the scale instruction. Accuracy will decrease
significantly 1 when scaling would be employed every stage. Therefore it
is needed to employ this feature with care.
7.2.2 Input stage
During the input stage the values are distributed sequentially over four
memories in the Montium as depicted in 7.4. Only two memories are de-
picted but the imaginary part is distributed in the same way.
Figure 7.4: Input value distribution in the Montium for an 8 point
FFT.
Each clockcycle, data input values need to be available on the input reg-
isters of the ALUs. This can be dealt with because the data input values are
in sequence so an agu p1m1 p3m1++ instruction can be used. The input
memories need to be switched halfway through the first stage.
Not only the data values but also the twiddle factors are needed for the
multiplication. The twiddle factors are stored in order of the original cal-
culation so for an 8-FFT this results in the following sequence,W 08 ,W
1
8 ,W
2
8
and W 38 . To get the correct twiddle factor for the multiplication, the AGU
of the twiddle factor memory needs to be set. At first this could seem a
bit troublesome since the twiddle factors are not stored in the order they
are needed, except for the last stage, and every stage would require more
and different twiddle factors. The first stage however is rather simple since
only a single twiddle factor is required during the whole stage. The second
and other stages except for the end stage is explained in section 7.2.3
After calculation the results need to be stored. To keep the number of dif-
ferent instructions to a minimum the output of this stage should provide
for a sequential readable input for the next stage. The agu p1m1 p3m1++
does not have to change in that case.
The first butterfly of the first stage will provide input for two butterflies in
the second stage. To make the sequential read possible in the second stage,
1Scaling will divide the answer by two which is effectively loosing a bit in the answer.
32 Algorithm Mapping
the first results of the first stage need to be stored at indices 0 and 1 re-
spectively. The first butterfly provides the input samples 0 and 4 but in the
second stage the first butterfly needs the input samples 0 and 2 and the sec-
ond butterfly requires 6 and 4 as depicted in figure 7.5. This is why input
sample 4 is stored at index 1. The reason for it to be in the other memory
and not in the same, which might be expected, is that the memories are not
capable of storing two values at the same time. It has no dual port capabil-
ities.
The next butterfly result from stage 1 should be stored not one but two
places further. Storing the values in this order will provide for a sequential
read in the next stage. Figure 7.5 shows the location at which values are
stored for an 8-FFT, together with figure 7.1 it can be seen that this storage
order makes the sequential read possible. The AGU instruction is therefore
agu p4m1 p2m1+=2. Again like reading the values also during write the
target memory switches halfway through the stage.
Figure 7.5: Output of a stage provides for a sequentially readable
input of the next stage. Only the real part of the complex values are
depicted here.
7.2.3 Middle stages
Stages in the middle of the FFT differ from the first stage because of the
twiddle factors and input ordering. The number of twiddle factors dou-
bles with each stage and the addressing is not sequential. To overcome
this problem the AGU provides for two important features, bitreversed ad-
dressing and masking, both of which are explained in section 4.3.2. In the
second stage two twiddle factors are needed,W 08 andW
2
8 respectively. The
CDL code in the following example solves this. The code also makes clear
that the real and imaginary values of the twiddle factors are actually stored
in memories p5m1 and p5m2 respectively.
\\init
agu p5m1 p5m2 =0 & mask <-> (2 log N) -1
\\next
agu p5m1 p5m2 ++ & mask <-> (2 log N) -1
The bit masking could be neglected in this stage. It is however useful in the
last stage. In the second stage of the 8-FFT the addition of the AGU address
7.2 Fast Fourier Transform 33
would lead to AGU address 00 and 01 but because it is bit reversed over
2 bits2 it results in 00 and 10 which are the correct binary addresses. The
only thing to worry about is to reset the addressing to 0 at the correct time
during the stage.
The memory is limited on the Montium. Therefore, as explained in
the previous section, it would be preferred to not change the instructions.
However the output of the second stage would be wrong if the input and
output scheme were to be unchanged.
The output from stage 1 is sequentially readable but the write procedure
used in this stage introduced a side effect. The values on indices 1 and
3 are switched around as can be seen in figures 7.5 and 7.6 in combina-
tion with the CDL code for the complex multiplication. The operands for
the complex multiplication are always read from the same registers but the
values are switched around. It would be the same as if a− b would be cal-
culated and the values for a and b would be switched around. To get the
calculation needed, the values need to be switched back. It is not so much
a problem but it needs to be addressed and taken care of.
Figure 7.6: Stage 2 and 3 input
In order to solve this problem, every other butterfly the inputs on the
ALUs are switched around, this will cause the outputs to be switched around
as well, which would make the whole calculation consistent again. The in-
puts of the ALUs can be switched around this easily by doing a different
move instruction. Where for example themovewould first be mov p1m1 -> p1c1
3, it is now changed to mov p1m1 -> p1d1. Looking at figure 7.3 this
causes the operators to be switched around which in terms causes the out-
puts to be switched around while keeping the ALU instruction the same.
On ALU1 Xre and Yre are interchanged and on ALU3 Xim and Yim are.
There are other options to solve this problem but almost all of them would
introduce more instructions.
2log
2
(8)− 1 = 2
3move value from memory p1m1 to register c1 on ALU 1
34 Algorithm Mapping
7.2.4 Last stage
Since the FFT algorithm started with a sequential input of data and not
bitreversed it can be assumed from the explanation in section 3.2 that the
output were to be in bitreversed order. The last stage is like any of the mid-
dle stageswith a small difference in the write cycle. By writing sequentially
but bitreversing the address all the results will be in normal sequential or-
der in the memory.
7.3 Transform decomposition
The transform decomposition algorithm can use most of the FFT mapping,
but several additions need to be made for the algorithm to work on the
Montium architecture. From the explanation in section 3.3 it can be seen
that there is a need for N2 times a N1-point FFT. This means that the FFT
itself needs to be repeated, a recombination stage needs to be added and the
input mapping needs to be changed. A small difference to the algorithm
described in section 3.3 is the fact that N1-FFT needs to be a power of two.
To accomplish this N1 is determined by rounding L upwards to the closest
power of two. So when L = 5 it would result in a N1 of 8.
7.3.1 Input mapping
Since the input is fed through the Hydra interface, only the addresses of the
input data need to be changed to create the correct input pattern as shown
in figure 7.7. The FFT itself needs to be repeatedwith different input values.
At first glance it might seem that storing the input values consecutively is
the best way to proceed however this will have a serious impact on the
original FFT implementation. The AGU masks would not work anymore
when the smaller FFT passes the set mask value. To overcome this prob-
lem, the base addresses of the AGU can be increased everytime the FFT
completes. This overcomes the problem of the masks and has minor im-
pact on the original FFT code. The base address can be changed with a
minimum value of 64. This means that for FFTs from size 128 4 and down-
wards the baseaddress always needs to be changed with 64 although this
is in-efficient and leaves gaps in the memory.
7.3.2 Recombination stage
The recombination stage is the last stage in the transform decomposition. It
multiplies all the different parts of the non-zero outputs with their respec-
tive twiddle factors and adds them together. The complex multiplication
4Only halve of the input values are stored in each memory, 7.2
7.3 Transform decomposition 35
Figure 7.7: Inputmapping for transformdecomposition using 2 times
an 8-FFT. Thememory addresses are 64 apart from eachother for each
FFT within the transfrom decomposition. Only the real part is de-
picted.
with the twiddle factor results in a MAC operation. Several issues need to
be resolved to make this last stage work. The correct indices need to be
chosen from the different FFTs that were done. The indices needed are also
stored in the Montium memory, not all of them only those of the first FFT.
By making use of the 5th ALU the other indices can be calculated. Figure
7.8 describes this.
Figure 7.8: The three clock cycles that are needed to get the correct
values for the recombination stage. During the first clock cycle the
first index is loaded into the register of ALU 5 and calculated. Dur-
ing the second clockcycle the result of ALU 5 is used as an index in
the memories. In the third clockcycle the values on those indices is
loaded into the registers of the appropriate ALUs
For example when all the 2nd values need to be used, index value 2
would be loaded from memory and the 5th ALU would add 64 each time
to this value to create the correct memory address. The AGU can use the
36 Algorithm Mapping
output of the ALU as a memory address, similar to a look up table. How-
ever for indices that are larger than halve the FFT size this is not a total
solution. During the whole calculation the FFT samples are split over four
different memories. The real parts are split over two and their complex
parts are split over two as well. Therefore the results are also split up. So
for a 64-FFT, the first 32 results are stored in one memory while the other 32
are stored in another as with their complex parts. This is better described
in section 7.2.2.
When for example index 45 would be loaded from memory for a 64-FFT,
the index would be wrong, the correct index would need to be 14 and a
different memory as source. This problem is also solved in the 5th ALU,
by first masking the value with 32. This determines which memory has to
be taken as a memory source by checking if the result is larger then 0. The
value is then masked again with 31 to get the correct index in the memory.
Figure 7.9 illustrates this. Afterwards the value is incrementedwith 64 each
time like before.
The values of the respective twiddle factors are stored in the 9th and 10th
Figure 7.9: Index value address calculation
memory (p5m1 p5m2). The twiddle factors are stored consecutively since
they are known beforehand. The twiddle factors for indices 2 , 6 , and 7
could be stored next to eachother. This makes reading them easy since the
AGU only has to increment the address by one.
All the non-zero values are stored consecutively. This makes reading them
from memory easy but it has to be known which frequency they represent
since those are not consecutive anymore.
7.4 Type conversion 37
Data Indices Memory location
Real part input samples 0 - 31 p1m1 p2m1
64 - 95
128 - 159
192 - 223
256 - 287
320 - 351
384 - 415
448 - 479
Imag part input samples 0 - 31 p3m1 p4m1
64 - 95
128 - 159
192 - 223
256 - 287
320 - 351
384 - 415
448 - 479
Twiddle factors 512-FFT 0 - 255 p5m1 p5m2
Twiddle recombination 256 - 319 p5m1 p5m2
320 - 383
384 - 447
448 - 511
512 - 575
576 - 639
640 - 703
704 - 767
k-indices 512 - 575 p2m2
Results 512 - 575 p1m2 p3m2
Table 7.1: Memory usage for 512-FFT with transform decomposition
and maximum of 64 non-zero outputs
7.4 Type conversion
On a normal x86 architecture or even 64 bit architectures these days the
accuracy seems almost infinite. This is not the case on the Montium. The
Montium is a 16 bit fixed point architecture. This means that all the values
for the Montium are either a full 16 bit 2 complements integer <16,0> or
they are in a <1,15> format in which the decimal point is directly after the
sign bit. This representation puts all the values between −1 and 1 − 2−15.
This is used for all the FFT implementations since fractions are needed. The
direct influence is that all the input values first need to be remapped to this
input range. The effect on the values is that, the larger the original range
the larger the error in the end. The closer the original values are together
the better the remapping. Quantization effects are introduces, similar to the
effects introduced by AD converters.
7.5 Run time reconfiguration
On of the characteristics of Cognitive Radio is that factors can change dur-
ing the course of execution of the algorithm. The number of carriers, the
position of carriers, etc can change over time. Normally this would require
a rewrite and recompilation of the code. The Montium however is recon-
38 Algorithm Mapping
figurable at run-time. This results in a significant decrease in the amount of
time needed to alter one or more of these variables. Although theMontium
is capable of run-time reconfiguration it is currently everything but an easy
operation to deal with. The main reason for this is that there are currently
no tools that support alteration of the code at run-time.
7.5.1 FFT
The source code for the algorithm is reasonably easy to read and compre-
hend. The FFT itself is a regular structure of butterflies that are repeated
until a certain stage is reached. This repetition is implemented as a number
of loops in the sequencer of theMontium. By adjusting the number of times
a certain loop is taken, different butterfly structures can be created. This is
what is done during the reconfiguration of the algorithm presented in this
thesis. Since code generation itself is difficult, the code for the maximum
size FFT needed is generated by the compiler. This is limited by Montium
memory space. By reducing the number of times a loop is taken, smaller
FFTs are created while it is still easy to increase the loop counter in order to
go back to a larger size FFT. This approach requires for example that two
different last stages of an FFT are generated. The source and target memo-
ries for FFTs that have an even number of stages differ from those that have
an odd number of stages.
The downside to this approach is that it creates an overhead on the se-
quencer memory usage. Not only the loop counters and jumps need to be
altered, also the memory masks need to be changed. This is currently done
by means of a tool that can read the output of the compiler and represents
this in a graphical way. The Montium configuration tool as it is called, dis-
plays the configuration data. By hand, the configuration data values are
changed and the new configuration is stored. With a tool that shows the
differences, e.g. diff, the differences are extracted and these are the config-
uration alterations that need to be send to the Montium to reconfigure it.
This procedure is described in more detail in appendix B.
7.5.2 Transform decomposition
The transform decomposition is also altered in the same way. The number
of carriers can easily be changed like the number of butterflies by adjusting
a loop counter. As with the FFT, the transform decomposition also has two
different last stages depending an on odd or even number of stages. Since
all the code is generated beforehand it is also possible to switch from a
transformdecomposition implementation to a normal FFT one. The normal
FFT is part of the transform decomposition so changing some loop counters
and skipping the recombination stage would allow for a normal FFT, of
7.6 Results 39
course the otherway around would also work. A step-by-step procedure of
this can be found in appendix B.
7.6 Results
To check whether the algorithm is correctly implemented a comparison is
made between the results from the algorithm calculated by the Montium
and the results calculated by Matlab. Inputsets are the same for the Mon-
tium as well for Matlab. There are two issues with the Montium imple-
mentation that need to be taken into account when interpreting the results.
First, the results are scaled during the stages of the FFT. During scaling,
accuracy is lost. Second, the Matlab implementation is implemented with
32-bit floating point while theMontium architecture is based on 16-bit fixed
point. This also skews the results. By comparing both factors independent
of eachother it became clear that scaling the results during the stages in-
fluenced the results far more than switching from 32-bit floating to 16-bit
fixed, however the influence of the 16-bit fixed point depends on the range
of the input values. The bigger the range in the input values, the larger
influence it has on the outcome.5
Figure 7.10 shows the results of a 512 point transform decomposition
FFT with 56 non-zero outputs. The results were scaled during each stage
of the FFT which means they are scaled six6 times, which is the worst cast
scenario. Scaling in all the stages is far from necessary.
These results do show the transform decomposition is correctly imple-
mented. However it does not show whether transform decomposition has
any positive effect on the number of clockcycles needed to compute the
FFT. Figure 7.11 shows the number of clockcycles needed on the Montium
to calculate the FFT with and without transform decomposition.
These results might not be exactly as could be expected from the algo-
rithm description in section 3.3. This is because the algorithm optimizes for
the amount multiplications done. The number of multiplications indeed
decreases but the number of clockcycles already surpasses that of a normal
FFT when more then 63 sub-carriers are used in a 512-FFT. The reason for
this, is the number of clockcycles it costs to load the k index and calculate
the memory locations during the recombination stage, as described in sec-
tion 7.3.2. For each non-zero output it costs 4 clockcycles, 3 to calculate the
memory index and 1 for the actual calculation. This means that the amount
of clockcycles it would cost to calculate a 512-FFT with all the sub-carriers
enabled far exceeds that of a normal calculation of an FFT.
The number of clockcyles with 63 sub-carriers enabled equals the number
5The complete range of input values need to be remapped to the range −1 - 1− 2−15
6512-FFT= 8 ∗ 64-FFT→ log2(64) = 6
40 Algorithm Mapping
0 100 200 300 400 500 600
−5
0
5
10
x 10−3 FFT Absolute Error
freq
 
 
Re
Im
Abs
0 100 200 300 400 500 600
−2
−1
0
1
2
x 10−3 FFT Relative Error (scaled with maximum output value)
freq
 
 
Re
Im
Abs
(a)
0 100 200 300 400 500 600
−4
−2
0
2
4
FFT Montium results
freq
 
 Re
Im
0 100 200 300 400 500 600
−4
−2
0
2
4
FFT Matlab results
freq
 
 Re
Im
(b)
0 100 200 300 400 500 600
−0.05
0
0.05
Input sample
time
 
 Re
Im
(c)
Figure 7.10: Comparison of results. Figure a shows the absolute and
scaled difference between the results calculated by the Montium and
those by Matlab. Figure b shows the results of the algorithm cal-
culated by the Montium and by Matlab. Figure c shows the input
sample that was used by the Montium and Matlab.
(a)
(b)
Figure 7.11: Figure a shows the number of clockcycle needed for a
full 512-FFT (N/2) + 2 ∗ log2(N) compared to those of the transform
decompositionN2∗N1-FFT+(L∗(4+N2)). Figure b shows the benefit
of transform decomposition percentage wise.
7.7 Adjacent sub-carriers 41
of clockcycles of a normal 512-FFT. The number of clockcycles needed for
an FFT on the Montium can be calculated as (N/2)+2∗ log2(N). The num-
ber of clockcycles for the transform decomposition is a bit more difficult
to derive. The FFT within the transform decomposition is calculated as a
“normal” FFT and then multiplied with the number this FFT is executed.
Then the number of clockcycles for the recombination stage has to be added
which depends on the non-zero carriers. The total number of cycles there-
for equals N2 ∗N1-FFT+(L ∗ (4 +N2))
7.6.1 Reconfiguration
The total program consists out of 896 configuration parameters. For recon-
figuration to any other size FFT or switching between transform decom-
position and a normal FFT, about 40 configuration parameters need to be
changed in the Montium. Table 7.2 depicts the number of configuration
parameters that need to be send to switch from one size FFT to the other.
For reconfigurable code this is significantly less. The speed at which the
reconfiguration can be done is dominated by the speed of the network on
chip. There are four lanes available on the Hydra interface through which
data can be transported, however configuration data can only go through 1
of the lanes. This restriction means the configuration data cannot be paral-
lelized over the interface. The reconfiguration has been implemented and
tested and is significantly faster then a recompilation cycle. In the current
implementation, where the FFT code resides in the Montium together with
the transform decomposition code, it is relatively easy to reconfigure be-
tween FFT sizes and switching between “normal” FFTs and transform de-
composition. The direct cost of this implementation is that both implemen-
tations reside in Montium configuration memory, while only one of them
can be used at a time. The total configuration space for run-time recon-
figurable code and the transform decomposition is roughly double that of
the normal FFT implementation on the Montium. If additional operations
would be needed on theMontium thesewould have significantly less space
and a normal FFT implementation might be preferred. Because it is now
relatively easy to switch between FFT sizes, this implementation also is an
ideal candidate for spectrum sensing in cognitive radio.
7.7 Adjacent sub-carriers
Loading the indices from memory is what makes this implementation per-
form in the way it does. There is a way to improve this for specific cases.
When sub-carriers are next to eachother there is no need to load the next in-
dex since it would be known. It would be the next. This would require only
to the first index of every block of sub-carriers to be loaded and any subse-
42 Algorithm Mapping
reconfigurable code static code
Initialisation 896 0
to 512-point 52 569
to 256-point 41 527
to 128-point 39 485
to 64-point 40 443
to 512-TD (8 * 64 point) 39 –
Table 7.2: The number of configuration parameters that need to be
sent to switch from one FFT to the other. The first column in case of
reconfigurable code and the second in case of static code.
quent carrier would just be the next in line. Every gap in the sub-carriers
would make this scenario act more and more like the previous implemen-
tation.
There are some issues with this implementation. The main problem is that
the results from the normal FFT are stored in two different memories as
described in section 7.2. The new implementation would have to take care
of sub-carriers transitioning from one memory to the other while not using
the masking solution since that would require extra clockcycles. The worst
case would be when two such transitions need to be made and the code has
to be able to make these in any direction.
A solution would be to implement all four scenarios. Single transition both
ways as well as a double transition both ways. Now with loading the first
sub-carrier in a block a decision needs to be made, which of these scenarios
to use. This depends on the starting position and on howmany sub-carriers
follow. Deciding which scenario to use is not a task for the Montium but
a program on another platform will need to take care of this. It will need
to decide for each first sub-carrier which scenario to use and how the loop
counters for the MAC operations within these scenarios need to be set.
If this would be implemented the number of scenarios even doubles since
the implementation would need to be able to support even as well as an
odd number of stages. This influences the source memory as earlier de-
scribed.
This section started out with a simple assumption on the input values but
the effect on the implementation is big. The extra scenarios will proba-
bly fit on the Montium. The amount of control code that is needed to set
the correct jump addresses and loopcounters for such an implementation
is rather large. Whether or not this is worth implementing depends on the
fact whether the performance is needed and the control code can be imple-
mented in an another platform that would need to be available.
7.8 Conclusion 43
7.8 Conclusion
Loading the indices frommemory takes three clockcycles. This puts a huge
strain on the efficiency of the algorithm. Loading the indices takes more
clockcycles in relation with the actual calculation which only takes a single
clockcycle. In the paper of [6] only the number of multiplications are com-
pared. The reduction in the amount of multiplications still holds but the
indices take three clockcycles to load which increases the total amount of
clockcycles dramatically. However, this does not mean the implementation
is a failure. The fact remains that for certain number of carriers in certain
FFTs the algorithm can offer an improvement over a normal FFT. This im-
provement costs near to nothing, only a small amount of memory which
makes it still worth the effort of implementing.
An additional benefit is the reduction in data from the Montium processor.
In the example described in this chapter in which 56 non-zero values are
calculated, the number of data from the Montium is also only 56 values
where normally 512 values would be sent. This means a reduction of 89%
in data from the Montium, which could certainly be worth the effort in cer-
tain cases. It might even be the case that an increase in clockcycles on the
Montium is worth the reduction in data that comes from the Montium and
therefore on the connected bus or network on chip.
44 Algorithm Mapping
Chapter 8
BCVP
The BCVP is a hardware platform on which an ARM and a FPGA resides.
The FPGA is loaded with the hardware description of three Montiums con-
nected through a network on chip. This platform is used to test various
Montium applications together with different operating systems on the
ARM. This chapter described the implementation that has been made and
what this implementation demonstrates.
8.1 Introduction
During the last month of this master project an implementation was made
for the BCVP platform. It was to show that reconfiguration of the Montium
would work together with BasOS on the current platform. BasOS is an
operating system developed within the CAES chair. The idea was to also
switch the FFT task dynamically from the ARM platform to the Montium
platform on run-time. This has not been done for various reasons. Mainly
because of the limitations of BasOS at the time of implementation.
8.2 Differences in relation to the simulator
The switch from the simulator environment to the hardware implementa-
tion requires changes to the format of the code as well as extra code that
needs to be developed. The format difference is due to the fact that the
simulator reads the data and configuration differently from the OS that
runs on the ARM. The simulator reads a flit type followed by the data e.g.
0x3 0x8001. The OS expects data to be formatted as 0x38001. This is a
minor transition and can easily be changed in Matlab that generates these
files.
The other difference is the fact that a BasOS application needed to be devel-
oped. This requires significant effort. The application takes care of loading
46 BCVP
the configuration and re-configuration data into the Montium, as well as
loading the data samples into the Montium.
8.3 Implementation
The final implementation needed to show the correctness of the implemen-
tation as well as the capability of run-time re-configuration. The best way
to show results is by transferring data from the BCVP to a PC and show a
plot of the data. Since it is formost a technology demo, a graphical interface
would be most intuitive and appealing to show the run-time reconfigura-
bility of the Montium. The decision was made to incorporate both and
make some clickable buttons on the PC.
8.3.1 BasOS
BasOS is one of two OSes that currently run on the ARM on the BCVP. The
other is eCOS in the Osyris framework. For this project BasOS was chosen
for various practical reasons. For example the development is done very
close by so any problems could directly be addressed. BasOS is still in the
early stage of development and far from stable. This makes application
development somewhat troublesome and therefore slow. The application
needed to take care of loading the configuration parameters and loading
the data. Next to that is needed to check whether the results coming from
the Montium were actually the correct values.
The correctness of the program is checked by taken the known results from
the simulator and comparing those with the ones coming from the Mon-
tium. The data that needs to be sent to theMontium is in the current imple-
mentation always the same. This makes checking for mistakes simple and
makes the application less complex. The reconfiguration data comes from
the PC through the USB. A user specifies which FFT he would like and that
configuration is loaded.
To make the BasOS application aware that the data from the USB is actu-
ally reconfiguration data, a specific first data sample is sent which looks
like 0x400000. The first number determines that actual reconfiguration
is being done. The second specifies to which configuration to switch and
the last four values determine how many reconfiguration data is sent. For
specifics please consult the explanation in the application code.
Although this application would be enough to show correctness of the code
and the capability of reconfiguration it was decided that the data from the
Montium is also sent to the the PC through the USB for a graphical repre-
sentation.
8.4 Results 47
8.3.2 libusb
BasOS comes with a USB driver to send and receive data from the PC. The
opensource libusb [13] driver provides for an easy way to interact with the
BCVP. The main reason for using this driver is that it provides generic ac-
cess to any usb device in the userspace of the operating system. This saves
time since development of an actual driver in kernel space is not necessary.
8.3.3 Trolltech Qt
There was only one example program available for the BCVP which con-
sisted of reading data from the BCVP and writing it to file. Data from the
Montium however represents the results of an FFT and can best be shown
in a plot. Several C libraries exist that are capable of representing data in
a plot. For this project it was chosen to use the QWT library (Qt Widgets
for Technical Applications)[14], mainly due to the amount of examples that
are widely available on the internet for Qt applications and because it can
be used on a variety of operating systems. Trolltech has extensive docu-
mentation for its product with several examples[15]. The application was
developed in version 4 of the library.
A QT - libusb interface needed to be developed that would be able to run
apart from the main program thread and takes care of the communication.
Receiving the data from BasOS as well as sending the reconfiguration data.
Since there is no need for a high throughput but only to show the results,
1 FFT is done each time when new data is send. The interface that was
developed is simplistic but shows an easy to handle interface and provides
easy to understand code.
Furthermore, the application takes care of scaling back the values from the
FFT and putting them in the correct order. The normal FFT results are bi-
treversed and the transform decomposition results need to be put in the
correct place. This is done by taking the original mask and matching the
enabled sub-carriers with the output values.
8.4 Results
The Qt application together with the BasOS application and Montium im-
plementation of the algorithm shows that it all works as intended. It shows
that the implementation on the Montium can be reconfigured at run-time
and provides for some easymethods to check for correctness. The dataflow
can be seen in figure 8.1.
The application works as a demo and should always be used that way.
Throughput has not been a considered during development. By taking out
the Qt application and altering the code of the BasOS application it could
be used as part of a complete OFDM implementation.
48 BCVP
Figure 8.1: Data flow of the algorithm implemented on the BCVP
The demo also shows that the traffic on theNOC and thememory needed is
reduced when using the transform decomposition implementation on the
Montium.1 During development, quite a few problems popped up with
the use of BasOS. Not surprisingly since the OS is in development. Be-
cause of the general lack of proper documentation and the complexity of
the source code of the OS it is nearly impossible to fix bugs when they are
encountered, without the help of the developer. Certain failures to run the
application took weeks to correct and as of now are still not known how
they should be properly corrected. For example the stacksize can be set for
a task but by setting this to any value but 0, the application will not work
and the OS will simply crash.
8.5 Conclusions
First of all the demo application shows the run-time reconfiguration capa-
bilities of the Montium. Next to that it proves that it is possible to have
a run-time reconfigurable FFT and sparse FFT and it is possible to switch
between these in a real reconfigurable hardware. The Qt application can
be used for different algorithms on the Montium and represent the data
in a plot. When transferring the actual code from the Montium simulator
to the actual Montium hardware very few discrepancies were encountered.
Most of the problems during development came fromworkingwith BasOS,
which in the endmight not have been the correct OS choice. Qt has been an
easy to work with library. It is very well documented and the community
is very helpfull.
8.6 Further work
Currently there are no real benchmarks. There are no measurements on
how fast the actual implementation is. This is mainly due to lack of bench-
1Can only be seen by looking at the sourcecode
8.6 Further work 49
(a) (b)
(c)
Figure 8.2: Impressions of the demo application: Figure a shows the
results from a 512-FFT done on random generated data. Figure b
shows a 64-FFT on the same random data. Figure c shows the result
from a 512-FFT in which only 56 sub-carriers are enabled.
mark capabilities on BasOS. Theoretical values for theMontium implemen-
tation can be derived since it is known how many clockcycles it costs to
process data and on what clockspeed the Montium runs. The overhead of
BasOS is unknown. Very rough estimates can be made but with the current
implementation in which the FFT is only run a single time it is impossible
to find out where the overhead comes from. It could be everything from
bus speed, NOC speed to memcopy operations and context switches.
If a complete implementation of OFDMwere to be implemented, the eCOS
/ Osyris platform should certainly be taken into consideration.
50 BCVP
Chapter 9
Conclusion and
Recommendations
9.1 Conclusion
This thesis shows that the algorithm can be mapped to the Montium and
can be reconfigured at run-time. The technology demo on the BCVP shows
that is can also be done on an actual hardware implementation. The al-
gorithm can be used for cognitive radio and next to the mapping of the
complete algorithm it provides for a run-time reconfigurable radix-2 FFT,
which can be used for spectrum pooling. The mapping might not show the
improvement in clockcycles that was expected at the start of this project
but it does however show a significant decrease in the amount of data traf-
fic. The fully run-time reconfigurable radix-2 FFT and the decrease in data
traffic together with the technology demo on the BCVP brings the imple-
mentation of a full implementation of OFDM for cognitive radio closer to
reality. The capability of the Montium to be reconfigured at run-time is also
exploited to a far extend and demonstrates this ability in a way previous
applications did not.
9.2 Recommendations
9.2.1 Reconfiguration
The run-time reconfiguration itself is a success but the steps to get there
are very tedious to say the least. A future programming language for the
Montium should have some support for the feature. A compiler that only
outputs configuration data that corresponds to certain parameters in the
code which are the ones that need to be changed at runtime would be a
great improvement. This would create a possibility to output only code
that would need to be changed at run-time. Which would in term save the
52 Conclusion and Recommendations
developer a lot of time. The fact that it is currently a very time consuming
process to create certain pieces of code that need to be reconfigured proba-
bly prevents people from even considering this as a viable option. This is a
pity since it is such a powerful option which could really benefit people.
9.2.2 Programming
When development started on the algorithm, the debugging was a very
time consuming process since the original simulator has no option to dis-
play multiple memory locations at the same time. During the end of the
project a second simulator became available which does offer such an op-
tion and more. This had a positive influence on the time it took to develop
new parts of code. The fact remains that the development of new algo-
rithms for the Montium is still a bit slow. This is primarily due to the fact
that the Montium differs so much from a normal cpu. It offers five ALUs
that can work at the same timewhich canmake it difficult for people to pro-
gram since parallel programming is not common knowledge. Algorithms
for the Montium are not that much optimised towards ALU operations as
it is to memory access, which is the real constraint of the Montium.
Bibliography
[1] Freeband AAF project. “AAF: Data communica-
tions in emergency situations through Cognitive Radio”.
http://www.freeband.nl/project.cfm?id=488. 1, 2.1
[2] Various. “Orthogonal frequency-division multiplexing”.
http://en.wikipedia.org/wiki/OFDM. 2.2
[3] Jochen H. Schiller. “Mobile Communications”. Addison-Wesley 2nd edition
(2004). 2.2
[4] Richard G. Lyons. “Understanding Digital Signal Processing”. Prentice Hall
2nd edition (2004). 3.1, 5.2
[5] James W. Cooley and John W. Tukey. “An Algorithm for the machine cal-
culation of the complex Fourier series”. Math. Comput. 19, 297–301 (1965).
3.2
[6] Henrik V. Sorensen and C. Sidney Burrus. “Efficient Computation of the DFT
with Only a Subset of Input or Output Points”. IEEE Transactions on Signal
Processing 41, 1084–1200 (1993). 3.3, 7.8
[7] Qiwei Zhang, Andre B.J. Kokkeler and Gerard J.M. Smit. “An Efficient FFT
Implementation For OFDM Based Cognitive Radio On A Reconfigurable
Architecture”. IEEE International Conference on Communications 2007 Cognet
Workshop 1, 24–28 (2007). 3.3
[8] Paul M. Heysters. “Coarse-Grained Reconfigurable Processors”. PhD thesis
University of Twente (2004). 4.1, 4.3.4, 4.3, B, B
[9] M.D. van de Burgwal. “Hydra Protocol Specification”. Technical report Uni-
versity of Twente (2007). 4.3.5, 4.4.3, B
[10] Recore. “Simsation Compiler Getting Started Guide”. Recore (2007). 4.4.1, B
[11] Recore. “Simsation Simulator Quick Reference Guide”. Recore (2007). 4.4.2,
B
[12] Alan V. Oppenheim and Ronald W. Schafer. “Digital Signal Processing”.
Prentice Hall International (1975). 5.2
[13] Various. “Libusb”. http://libusb.sourceforge.net. 8.3.2
54 Bibliography
[14] Various. “Qt Widgets for Technical Applications”.
http://qwt.sourceforge.net. 8.3.3
[15] Various. “Trolltech Qt”. http://doc.trolltech.com. 8.3.3
Appendix A
Pseudo-code
In this appendix parts of the FFT code are listed as reference to this report.
For a full copy of the code please contact the author or the CAES chair of
the University of Twente.
FFT calculation
proc compute_butterfly
alu scale p1d1 sadd (p1a1 fmul p1c1 ssub (p2a1 fmul p2c1)) -> p1o1,
scale p1d1 ssub (p1a1 fmul p1c1 ssub (p2a1 fmul p2c1)) -> p1o2,
scale p3d1 sadd (p3a1 fmul p3c1 sadd (p4a1 fmul p4c1)) -> p3o1,
scale p3d1 ssub (p3a1 fmul p3c1 sadd (p4a1 fmul p4c1)) -> p3o2
end
Recombination calculation
proc c_mac
//R+jI = (a+jb)(c+jd) = (ac - bd) + j(bc+ad)
//Re
//b*d
alu p2a1 fmul p2c1 -> p2ws
//acc L(i) + (a*c - bd)
alu p1d1 sadd (p1a1 fmul p1c1 ssub p1es) -> p1o1
//Im
//a*d
alu p4a1 fmul p4c1 -> p4ws
//acc L(i) + b*c + ad
alu p3d1 sadd (p3a1 fmul p3c1 sadd p3es) -> p3o1
end
Complete Butterfly
proc fft_but rmode rswap wmode wswap tmode tmask
weak agu p5m1 p5m2 =0 & N/2-1 <-> (2 log N)-1
call read rmode rswap
call write wmode wswap
call read_twiddle tmode tmask
call move_twiddle
56 Pseudo-code
call compute_butterfly
end
Middle FFT stage
stage_.s._l4: clock
print "This is stage: ".stage_.s._l4."\n"
weak agu p5m1 p5m2 =0 & N/2-1 <-> (2 log N)-1
call read init !swap
llc lc1 lc1_a // outer loop
clock
call fft_but next !swap init !swap next N/2-1
llc lc2 lc2_a // inner loop
stage_.s._l1: clock
call fft_but next swap next !swap next N/2-1
clock
call fft_but next !swap next !swap next N/2-1
loop lc2 stage_.s._l1
clock
call fft_but next swap next !swap init N/2-1
llc lc2 lc2_a
clock
call fft_but next !swap next !swap next N/2-1
loop lc1 stage_.s._l1
clock
call fft_but next swap init swap next N/2-1
clock
call fft_but next !swap next swap next N/2-1
llc lc1 lc1_b
clock
call fft_but next swap next swap next N/2-1
llc lc2 lc2_b-2
stage_.s._l2: clock
call fft_but next !swap next swap next N/2-1
clock
call fft_but next swap next swap next N/2-1
loop lc2 stage_.s._l2
clock
call fft_but next !swap next swap next N/2-1
llc lc2 lc2_b
clock
call fft_but next swap next swap init N/2-1
loop lc1 stage_.s._l2
clock
call write next swap
weak agu p5m1 p5m2 =0 & N/2-1 <-> (2 log N)-1
undef lc1_a lc2_a lc1_b lc2_b
end
jmp stage_last
57
undef s
Recombination loop
recomb: clock //MULT AND RECOM STAGE!!!
agu p2m1 = 0 |= 512 //set load address for K
agu p5m1 p5m2 = 0 |=256 //set load address for recomb twiddle
agu p1m1 p3m1 = 0 |= 512
llc lc1 50 //amount of (k-1) values
alu 0 -> p5o1,
0 -> p1o1,
0 -> p3o1
k_values: clock
// 0 into mac
mov p5o1 -> p5b1
call c_mov
//p2m1 k value to p5a1 p5d1
mov p2m1 -> p5a1 p5d1
//p5a1 k value, p5b1 mac storage.
//k + 0 , k AND mask (1100)
alu (p5d1 and p5c1) gt 0 -> p5sb //calc read addr and mask bit to check if 2nd part of mem
k_calc: clock
alu p5a1 add p5b1 -> p5o1, p5d1 and p5c2 -> p5o2 //mask with n/2 -1
jcc p5sb k_right
k_left: clock
llc lc2 7 //(blocks)
//calc k with left mem as source
agu p1m2 p3m2 int load
//k + 0
mov p5o1 -> p1m2 p3m2 p5d1 //mov address to memory and mac place
//k + 64
alu p5a1 add p5b2 -> p5o1,0->p1o1,0->p3o1 //calc next (+64)
k_left_l: clock
//mov and calc complex mult
//-----------------------
call c_mov
//b
mov p3m2 -> p2a1 p3a1
//d
mov p5m2 -> p2c1 p4c1
//a
mov p1m2 -> p1a1 p4a1
//c
mov p5m1 -> p1c1 p3c1
call c_mac
//--------------
agu p1m2 p3m2 int load
//k+64
mov p5o1 -> p1m2 p3m2 p5a1
alu p5a1 add p5b2 -> p5o1 //calc next (+64)
agu p5m1 p5m2 |+= 64
loop lc2 k_left_l
clock
//store results
mov p1o1 -> p1m1
mov p3o1 -> p3m1
jmp loop_k
58 Pseudo-code
Appendix B
The Rough Guide to run-time
reconfigurable code for the
Montium
Introduction
Currently there has not been done much with the run-time reconfigurable
potential of the Montium. This is a pity since it is a powerfull tool in cur-
rent DSP development needs. This chapter will try to generate some enthu-
siasm to write such code and help developing it. This chapter will explain
which tools to use and how these can be applied to develop run-time recon-
figurable code. Furthermore it describe some simple small examples and
work up to a fully reconfigurable radix-2 FFT. It is expected that the reader
is familiar with the Montium and Hydra and how code is developed for it.
If you need to brush up on your knowledge, [8] is a great place to start and
for the simulator and compiler, [10] and [11] can help you out.
Tools
To develop run-time reconfigurable code a few tools are necessary. Obvi-
ously we will need the compiler for the Montium. It takes the normal Mon-
tium CDL code and transforms it into a .cfg file. Furthermore we will
need a platform to run the code on. The simulator is the simplest method
but if you feel confident you can try the BCVP with an OS like eCos or Ba-
sOS although this is discouraged if you are not comfortable in using them.
You will also need the Montium Editor. This tool reads a .cfg file and dis-
plays the contents of the Montium configuration memory. This tool is not
used in normal application development for the Montium but currently es-
sential in developing reconfigurable code.
60 The Rough Guide to run-time reconfigurable code for the Montium
Next to these tools we will need Matlab with the convertscript cfg2li0
to transform the .cfg files into files that can be read by the CCU/Hydra.
Finally you will need a tool that can easily display the difference between
files. Either the *nix diff tool or something like WinMerge.
Hello world
It might seem a bit silly since the Montium cannot display any text but by
guiding you through this simple example you will get an understanding
how reconfigurable code is created.
CDL code
We will start with the following CDL code:
proc addone
alu p3d1 add p3c1 -> p3o2
end
init: clock
agu p1m1 p1m2 p2m2 p3m1 p3m2 p4m1 p4m2 p5m1 p5m2=0 |=0
llc lc4 98
clock
mov p1m1 -> p3d1 p3c1
start: clock
call addone
mov p3o2 -> p3c1
loop lc4 start
clock
nop
clock
frz
Assume that a 1 is loaded at location zero of p1m1. Furthermore it is impor-
tant to understand that every clock instructionwill cause a new sequencer
instruction. With this in mind it is easy to understand what is going on. In
the first sequencer instruction 98 is loaded into loopcounter 4. In the second
sequencer instruction the value 1 is loaded into register d1 and c1 of ALU
3. During the third sequencer instruction the ALU 3 is instructed to add
register d1 and c1 together and a move instruction will cause the result to
be moved into input register c1. The sequencer is instructed to loop this 98
times more.
Now if you compile this code and load it into the Montium Editor you will
see something like in figure B.1.
Not surprisingly you see 5 sequencer instructions. In the first instruc-
tion the loopcounter is loaded and in the third instruction the sequencer
will loop on this counter.
Now what if you would need only 50 loops of this operation? You could
rewrite the application and upload it completely to the Montium, which
61
Figure B.1: Example loaded into Montium Editor
would take you through a whole development cycle, or you could choose
the path of run-time reconfiguration.
New configuration
In order to get 50 loops of the operation the loopcounter needs to be loaded
with 49. In theMontiumEditor the value is manually changed in 49 and the
configuration is exported through File -> Export MemoryMap. The new
configuration is still a complete configuration but contains the alteration of
the loop counter. Loading theMontiumwith this new configuration would
be the same as recompiling the code which we wanted to avoid in the first
place. Now to find out what actually changed in the configuration we load
up the initial configuration and the alterted configuration in a diff tool and
find out what actually changed. The outputwill be something like in figure
B.2.
As you can see on address 0x0200 the initial configuration loaded 0x62
while in the new configuration this address is loaded with 0x31. Just as ex-
pected. I know that thewhole procedurewith the diff tool could be avoided
by just reading the address in the Montium Editor but when there are a lot
of differences it is easier with such a diff tool. The diff tool can output the
difference to a separate file which can be converted by theMatlab script to a
CCU/Hydra file. Now the new value needs to be loaded into the Montium
62 The Rough Guide to run-time reconfigurable code for the Montium
Figure B.2: Difference in loopcounter
at the correct address.
Loading new value
The new value needs to be loaded through the Hydra. This following im-
plementation depends on which platform you intend to develop for but
they all follow the same pattern. With this simple example there is no data
transfer through the Hydra but that does not matter. You can choose to
send the reconfiguration data along with the normal data stream or start
some task that solely takes care of reconfiguration. We have to make the
Hydra aware that new configuration data is coming and where to load it.
Next we need to reset the Montium and start our application from the be-
ginning. The reason that you want to reset the Montium after reconfigura-
tion is that it is impossible to know at which execution point the Montium
is and it will most definitely cause corrupted data output. The reset pro-
posed here is the software reset. It will only cause the program counter to
be reset and the Montium will go to idle mode.
The following code is for the simulator. When the output of the diff tool is
converted by the Matlab script you will get something like this:
0x3 0x2000 //configuration flit & load configuration data
63
0x0 0x0031 //data flit & new data value
0x2 0x0000 //tail flit
A said earlier this code will update the Montium but there is no way of
knowing in what state it will end up in. So reset and run commands need
to be added. The correct code to send would be.
0x3 0x2000 //configuration flit & load configuration data
0x1 0x0200 //address flit & address location of counter 4
0x0 0x0031 //data flit & new data value
0x2 0x0000 //tail flit
0x3 0xC000 //configuration flit & reset the Montium
0x3 0x8001 //configuration flit & start running
This all that needs to be done to reconfigure the data on run-time. Imple-
menting this on an other platform might go a little bit different. Inform
yourself about what the target OS and / or driver actually does when up-
loading configurations. It might need to be set in a different mode since
you will need to send 18 bits of data and not 16 which is the normal data
size on a lane. It might be the case that the OS adds the 0x38001 to a
configuration to make it ‘easier’ for you, in which case you do not have to
add it yourself. For more on the special configuration possibilities please
consult [9].
Radix-2 FFT
In the example only a loopcounter is adjusted. More complex issues are
masks of AGUs or ALU instructions, the implications might be more dif-
ficult to understand but the principle is the same. Do keep in mind that
creating code at run-time is difficult and should be avoided. Adjusting ex-
isting code is a lot easier. This means that for the radix-2 FFT we want the
largest size FFT loaded into the Montium and make it smaller if needed.
Since the FFT is such a regular structure it is not that difficult to create a
smaller FFT from the code of a large FFT. If you are not familiar on how the
FFT is implemented on the Montium please read section 6.2 of [8].
Additional Code
The standard Recore FFT generator generates an FFT from the size specified
on the command line. The generated code cannot be reconfigured well at
run-time. The reason for this is that in a radix-2 FFT on the Montium the
side from which the input samples are read in each stage alternate. So in
stage 1 the samples are read form the left hand side of each memory and in
stage 2 the values are read from the right hand side. As you can imagine
64 The Rough Guide to run-time reconfigurable code for the Montium
the last stage of the FFT would need to read from either the left hand side
or right hand side, dependingwhether the amount of stages is even or odd.
To counter this problem we just need to implement an alternate last stage.
Then, depending on the size of the FFT, take the original last stage and
jump over the alternate stage or jump over the original stage and take the
alternate stage.
Smaller FFT
The FFT can be made smaller by jumping stages, lowering the loopcounter
values and adjusting the memory masks. You first determine the amount
of stages that you need. With a 64-FFT you need log2(64) = 6 stages. This
means that in the first stages you need to correct the loop counters to match
a 64-FFT and at the end of stage 5 you need to jump to the appropriate
last stage. Also in the last stage, the loop counters need to be corrected to
match a 64-FFT. Finally the masks and bit reversal width of the AGUs need
to be adjusted. When you started out with a 512-FFT they will be set to
255 and 8 while for a 64-FFT you will need a mask of 31 and bit reversal
of 5. Make sure you do not change the masks and bit reversal width of
memories p5m1 and p5m2. These need not to be changed since the twiddle
factors are still those of the largest FFT and can be used for any smaller
radix-2 FFT. If you are not sure which values to use you can create a 64-
FFT from the generator and load it in the Montium Editor and see what the
correct values should be. When you are finished with the reconfiguration
you can save the configuration and load the original and new configuration
in a diff tool and save the values that need to be reconfigured.
Loading new values
This is exactly the same as in the Hello World example. The only difference
is, is that there are many more differences but still a fraction of the original
code.
Sparse FFT
This section describes the exact steps that need to be taken to reconfigure
the code the was developed for the master thesis this appendix is part of.
Initial code generation
If you run the compile the code on the prompt the pre-processor asks for
how many points the FFT should be generated. The number that is en-
tered determines the largest FFT that can be done. This section deals with a
largest FFT of 512 points. The code that is generated does not provide for a
65
normal 512-FFT. It generates all the necessary stages that would be needed
to create any variant up to a 512-FFT. So the code needs to be adjusted by
hand for a normal 512 streaming FFT. There is one line that needs to be ad-
justed so it can work.
At sequencer line [98] the JNC [0] needs to be changed to JCC [106].
This will cause a jump from the end of stage 8 to the start of the streaming
output stage 9. Now the code will function as a normal 512-FFT with a
streaming output stage of which the output is bit-reversed.
Reconfigure code
The following table describes the key sequencer lines with which FFTs and
Sparse FFTs can be created.
Sequencer line description
[1] Counter corresponding toN2 − 1
[2] Start stage 1
[8] Start stage 2
[16] Start stage 3
[24] Start stage 4
[38] Start stage 5
[52] Start stage 6
[66] Start stage 7
[80] Start stage 8
[94] End stage for TD (log2(N) =odd)
[100] End stage for TD (log2(N) =even)
[106] streaming end-stage (log2(N) =odd)
[112] streaming end-stage (log2(N) =even)
[119] jump address to recombination stage
[120] start recombination stage (log2(N1) =odd
[120] counter corresponding to L− 1
[123] & [126] counters corresponding toN2 − 1
[131] start recombination stage (log2(N1) =even
[131] counter corresponding to L− 1
[134] & [138] counters corresponding toN2 − 1
[141] last sequencer line jcc [0]
Table B.1: Key sequencer lines
With this information it is relatively easy to generate any power of two
radix-2 FFT smaller or equal to 512 and generate any transform decompo-
sition FFT.
66 The Rough Guide to run-time reconfigurable code for the Montium
Switch from 512-FFT to 512 with 56 non-zero
The basic difference between these scenarios is the amount of stages and
the end stage. For 56 non-zero outputs we need 8 times a 64-FFT and a
recombination end stage.
• [1] loaded with 7
• Stages 1 to 5 loaded with correct loopcounters.
• End of stage 5 [51] jump to [100]
• Counter in end stage for TD loaded with correct loopcounters
• [119] set jump target address to [131]
• [131] set counter to 55
• [134] and [138] set counters to 7
Now the sequencer is switched to the correct scenario. The only thing
that needs to be done now is adjust the masks and memory offsets. These
are still set to a 512-FFT. To see what the correct values for these configu-
ration registers are as well as to see what the correct values for the loop-
counters are you can generate a 64-FFT from the base code and see what
the correct values should be. Remember that we adjusted only the config-
uration and the correct input data as well as the masks in the registers of
ALU 5 is still needed.
There is a small note to make with the current code. The code that is gen-
erated can take any N1-FFT smaller or equal to 128. If you want to make
this larger, which is not recommended since the efficiency decreases greatly,
you need to change the base address addition that is done. Currently the
base address of the AGUs is increased with 64 between each of theN1 FFTs.
You will need to edit the code manually and compile it to see which config-
uration register changes and add this to the reconfiguration code. This is
only needed when you want to do transform decomposition in which the
amount of non-zero values are between 128 and 256. Again this scenario
is NOT recommended and a normal 256 or 512 FFT should be done in this
case.
Acknowledgments
This project would not have been possible without the help of many peo-
ple to whom I am very gratefull. First of all I would like to thank Andre´
Kokkeler and Gerard Smit for their help gettingme this project in the CAES
group. I would like to thank Qiwei Zhang for his help getting me to really
understand the project. Both Andre´ and Qiwei were a great help during
the project by asking questions which made me really think about solu-
tions and if they were as great as I hoped them to be. Gerard Rouwerda
and Marcel van de Burgwal were of great importance to my work by help-
ing me out with questions I had about the Montium architecture and its
programming language. Together with the people at Recore Systems they
offered fast solutions to problems I encountered with the simulator and
compiler which was of great importance to my work. I would like to thank
Albert Molderink for his help on BasOS, which made it possible for me to
create a technology demo on the actual hardware. Finally I would like to
thank all the CAES group members and students for a great time.
