Design and Implementation of MIMO OFDM IEEE802.11n Receiver Blocks on Heterogeneous Multicore Architecture by Hosseinvand, Mohammad
MOHAMMAD HOSSEINVAND
DESIGN AND IMPLEMENTATION OFMIMO OFDM IEEE802.11N
RECEIVER BLOCKS ON HETEROGENEOUS MULTICORE
ARCHITECTURE
Master of Science Thesis
Examiners: Prof. Jari Nurmi
M.Sc. Sajjad Nouri
Examiners and topic approved by the
Faculty Council of the Faculty of
Computing and Electrical Engineering
on 30th of May 2018
iABSTRACT
MOHAMMAD HOSSEINVAND: Design and Implementation of MIMO OFDM
IEEE802.11n Receiver Blocks on Heterogeneous Multicore Architecture
Tampere University of Technology
Master of Science Thesis, 61 pages
October 2018
Master's Degree Programme in Communications And Computer Networks Engineering
Major: Communication Systems and Networks
Examiners: Prof. Jari Nurmi
M.Sc. Sajjad Nouri
Keywords: Software-Deﬁned Radio, WLAN, OFDM, MIMO, Heterogeneous, Application-
speciﬁc Accelerator, Multicore, FFT, HARP, RISC Processor, Network-on-Chip, CGRA,
COFFEE, FPGA, Reconﬁguration, Time Synchronization, Frequency Oﬀset Estimation,
Channel Estimation, Symbols Demapping
In this thesis, the performance of a heterogeneous multicore platform in terms of
technical capability is evaluated. Therefore, the choice of architecture in general
can be based on a set of diverse applications. Selected applications can be parallel
or serial in nature. Applications evaluation are often based on various performance
metrics including the resource utilization and execution time. The wireless com-
munication systems are expanded to accelerate their functions execution in both
software and hardware. The embedded systems which involve several types of com-
munication systems perform a large number of computations which require short
execution time and minimized power consumption. Also, there is a growing demand
for application-speciﬁc accelerators aiding general-purpose. One feasible way is to
use heterogeneous multi-core platforms. Furthermore, many application-speciﬁc ac-
celerators are loosely connected with each other.
In this study, the implementation of Multiple-Input Multiple-Output (MIMO) Or-
thogonal Frequency Division Multiplexing (OFDM) receiver is evaluated by apply-
ing a Heterogeneous Multicore Architecture (HMA). The MIMO OFDM receiver
is composed of computationally intensive and general-purpose processing tasks and
can serve maximum coverage for evaluation of the HMA. The receiver blocks are
designed by crafting template-based Coarse-grained Reconﬁgurable Array (CGRA)
devices. In this case study, four streams (antennas) are proposed in order to process
the data over CGRAs simultaneously. HMA nodes will be reconﬁgured at run-time
in diﬀerent blocks of the receiver. In this experimental work, according to the per-
formance of each CGRA, the collective performance of the entire platform as well as
NoC traﬃc is recorded considering the number of clock cycles and also several high-
ii
level performance criteria. The implementation of OFDM receiver scaled CGRAs
to various dimensions. The data can also be exchanged between diverse nodes on
the NoC structure by utilizing direct memory access (DMA) devices independently.
iii
PREFACE
This thesis work presented here was accomplished in the Laboratory of Electronics
and Communications Engineering at Tampere University of Technology, Tampere,
Finland to pursue a Master of Science degree in the Information Technology Program
in 2018.
At ﬁrst, I would like to acknowledge my mother Soudabeh Memar, my father Asa-
dollah Hosseinvand and my sister Maryam Hosseinvand for their patience and con-
sistent struggle, passionate love, and tremendous support in all moments throughout
my life to bring me up to a stage, at which I became able to conduct this thesis and
write it. It is clear to me that I owe all my accomplishments to them and without
their support, all this would not have been possible.
I would like to express my deepest gratitude and respect to my supervisor Prof.
Jari Nurmi, who made possible accomplishment of this research work at Tampere
University of Technology as well as he welcomed me in his research group, always
his expertise, motivation and patience was a great value for my Master thesis. In
addition, I appreciate his support, which helped me manage the achievement of my
Master thesis.
I am also thankful to M.Sc. Sajjad Nouri, who guided me to the ﬁrst steps, plus for
being my helpful colleague, introducing a new topic to me for thesis and collaborating
in this research work under his supervision as well as for all his support and advices,
which led to carry out this thesis. It was an incredible delight for me to work under
his supervision and I had this chance to learn a lot from him.
I am really thankful to Prof. Roberto Garello for being my supervisors at Politecnico
di Torino, Turin, Italy for the given opportunity, to believe in my abilities to do
my thesis abroad, his enthusiastic support and always guided me to choose the best
decision. I express my thanks to Dr. Daniel Riviello for providing valuable comments
through this research.
I would like to express my warmest thanks to my dear friend, Yekta Lajevardi for
endless support and for helping me to stay positive and focused. I am thankful to
her for being an honest and lovely friend.
I am also grateful to my friend, Javad Malek Shahkoohi for long support and advices
given through all the ups and downs of my studies. He has been my very caring and
amazing friend.

vTABLE OF CONTENTS
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Objective and Scope of Thesis . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Processor/Co-processor Models . . . . . . . . . . . . . . . . . . . . . 4
2.2 Reconﬁgurable Devices . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Fine-Grained Devices . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Middle-Grained Devices . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.3 Coarse-Grained Devices . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Multi-core Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.1 MORPHEUS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.2 P2012 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.3 NineSilica . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.4 RAW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3. OFDM WLAN Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1 MAC Frame Structure for WLAN Standards . . . . . . . . . . . . . . 10
3.2 Physical Layer Speciﬁcations for WLAN Standards . . . . . . . . . . 16
3.2.1 Time Synchronization . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.2 Frequency oﬀset Estimation . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.4 Channel Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.5 Symbols Demapping . . . . . . . . . . . . . . . . . . . . . . . . . 28
4. Platform Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1 Coarse-Grained Reconﬁgurable Arrays . . . . . . . . . . . . . . . . . 29
4.1.1 CGRA Execution Flow . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Heterogeneous Accelerator-Rich Platform . . . . . . . . . . . . . . . . 31
4.2.1 Internal Structure of NoC . . . . . . . . . . . . . . . . . . . . . . 32
vi
5. Design and Implementation of IEEE 802.11n on template-Based CGRA . . 34
5.1 Time Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Frequency Oﬀset Estimation . . . . . . . . . . . . . . . . . . . . . . . 37
5.3 Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.4 Channel Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.5 Symbols Demapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6. Integration of Baseband Processing Blocks on HARP . . . . . . . . . . . . 54
7. Measurements and Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 57
8. Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 59
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
vii
LIST OF FIGURES
2.1 MORPHEUS architecture . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1 IEEE 802.11n PPDU formats in Legacy, Mixed and Green-ﬁeld . . . 10
3.2 PLCP Preamble for OFDM training structure . . . . . . . . . . . . . 12
3.3 Subcarrier frequency allocation for 40.0 MHz with 128 subcarriers . . 14
3.4 Block diagram of IEEE 802.11n transmitter [44] . . . . . . . . . . . . 16
3.5 Block diagram of IEEE-802.11n receiver [10] . . . . . . . . . . . . . . 16
3.6 natural order and Gray coding of QAM modulation . . . . . . . . . . 17
3.7 Cyclic Preﬁx (CP) in OFDM Symbol . . . . . . . . . . . . . . . . . . 18
3.8 Transmit spectrum of OFDM (PDS) based on IEEE 802.11n standard 20
3.9 Block diagram of correlation algorithm for time synchronization . . . 22
3.10 Cyclic preﬁx (CP) correlation along with SNR 20 dB . . . . . . . . . 23
3.11 Linear interpolation algorithm to perform the channel estimation . . 26
4.1 The scalable template-based CGRA architecture. . . . . . . . . . . . 30
4.2 Heterogeneous Accelerator-Rich Platform (HARP) [33]. . . . . . . . . 31
4.3 A view of master and slave node of HARP [78] . . . . . . . . . . . . . 33
5.1 Second context for the calculation of the correlations . . . . . . . . . 35
5.2 Third context for the calculation of the correlations . . . . . . . . . . 35
5.3 The context for the multiplication between a signal and its complex
conjugation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.4 Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . 41
viii
5.5 The ﬁrst context includes four radix-2 butterﬂies. . . . . . . . . . . . 42
5.6 A radix-4 butterﬂy for the second context . . . . . . . . . . . . . . . 43
5.7 Linear Interpolation algorithm based on pilot-assisted for Channel
Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.8 Second context of the channel estimation . . . . . . . . . . . . . . . . 45
5.9 First context of the Linear Interpolation . . . . . . . . . . . . . . . . 46
5.10 Second context ofthe Linear Interpolation . . . . . . . . . . . . . . . 46
5.11 First context of the Newton-Raphson method . . . . . . . . . . . . . 48
5.12 Second context of the Newton-Raphson method . . . . . . . . . . . . 48
5.13 Sixth context of the channel estimation . . . . . . . . . . . . . . . . . 49
5.14 Seventh context of the channel estimation . . . . . . . . . . . . . . . 49
5.15 Decision regions of 64-QAM Gray-coded constellation . . . . . . . . . 50
6.1 Abridged general view of IEEE 802.11n MIMO receiver on HARP
platform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
ix
LIST OF TABLES
3.1 Standards for OFDM WLANs [42] . . . . . . . . . . . . . . . . . . . 10
3.2 Pilot speciﬁc values for 40.0MHz [44] . . . . . . . . . . . . . . . . . . 15
3.3 IEEE 802.11n OFDM parameter values [45] . . . . . . . . . . . . . . 15
5.1 Diﬀerent types and lengths of FFT and their complexity in number
of stages and in number of operations per butterﬂy [10]. . . . . . . . . 42
5.2 Clock cycles (cc) based on the type of FFT accelerator and length [10]. 43
5.3 64-QAM constellation mapping with gray coded . . . . . . . . . . . . 51
6.1 The required clock cycles at diﬀerent stages for data transfer and
processing. In the table, D. Mem, Trans and Exe. are referring to Data
memory, Transfer and Execution respectively, while Clock cycles with
* sign indicate data transfer from CGRA to Node's data memory. . . 56
7.1 Summary of resource utilization based on the breakdown of node-by-
node for Stratix-V (5SGSED8N3F45I3YY) FPGA device . . . . . . . 58
7.2 Dynamic power of each CGRA node and the NoC. . . . . . . . . . . . 58
xLIST OF ABBREVIATIONS AND SYMBOLS
ACK ACKnowledgment
ADC Analog-to-Digital Converter
AGC Automatic Gain Control
ALM Adaptive Logic Module
ALU Arithmetic Logic Unit
ALUT Advanced Look-Up Table
ASIC Application Speciﬁc Integrated Circuit
ASK Amplitude Shift Keying
ATM Asynchronous Transfer Mode
AWGN Additive White Gaussian Noise
BER Bit Error Rate
BPSK Binary Phase Shift Keying
CC Clock Cycle
CCB Core Conﬁguration Block
CE Channel Estimation
CFO Carrier Frequency Oﬀset
CGRA Coarse Grain Reconﬁgurable Array
CIR Channel Impulse Response
CISC Complex Instruction Set Computing
CP Cyclic Preﬁx
CPU Central Processing Unit
CREMA Coarse grain REconﬁgurable array with Mapping Adaptiveness
DAC Digital-to-Analog Converter
DC Delay and Correlate
DFT Discrete Fourier Transform
DMA Direct Memory Access
DSP Digital Signal Processor
FEC Forward Error Correction Code
FireTool FIeld programming and REconﬁguration management Tool
FPU Floating Point Unit
FF Flip Flop
FFT Fast Fourier Transform
FOE Frequency Oﬀset Estimation
FIR Finite Impulse Response
FPGA Field Programmable Gate Array
FSK Frequency Shift Keying
xi
FSM Finite State Machine
FU Functional Unit
Gbps Giga Bit Per Second
GCC GNU Compiler Collection
GI Guard Interval
GOPS Giga Operation Per Second
GPP General Purpose Processor
GUI Graphical User Interface
HDD Hard Decision-based Detection
HDL Hardware Description Language
HMA Heterogeneous Multicore Architecture
HW HardWare
I In-Phase
IEEE Institute of Electrical and Electronics Engineers
IDFT Inverse Discrete Fourier Transform
IFFT Inverse Fast Fourier Transform
I/O Input/Output
IP Intellectual Property
I/Q In-phase and Quadrature-phase
ISI Inter-Symbol Interference
LTS Long Training Symbols
LUT Look Up Table
MAC Medium Access Control
MCM Multi-Carrier Modulation
MIMD Multiple-Instruction Multiple-Data
MIMO Multiple-Input Multiple-Output
ML Maximum Likelihood
MPSoC Multi Processor System-on-Chip
MT Mobile Terminal
MVM Matrix Vector Multiplication
NOP No-OPeration
NoC Network-on-Chip
OFDM Orthogonal Frequency Division Multiplexing
PAPR Peak-to-Average Power Ratio
PCB Peripheral Control Block
PE Processing Elements
PN Pseudo Noise
PPDU Physical Protocol Data Unit
PSDU Physical layer Service Data Unit
xii
PSK Phase Shift Keying
PTS Partial Transmit Sequences
Q Quadrature-phase
QAM Quadrature Amplitude Modulation
QPSK Quadrature Phase Shift Keying
RF Radio Frequency
RISC Reduced Instruction Set Computing
RPU Reconﬁgurable Processing Unit
RTL Register Transfer Level
RTOS Real-Time Operating System
SDD Soft Decision-based Detection
SDR Software Deﬁned Radio
SER Symbol Error Rate
SIMD Single-Instruction Multiple-Data
SISO Single-Input Single-Output
SLM SeLected Mapping
SNR Signal-to-Noise Ratio
SoC System-on-Chip
STS Short Training Symbols
TUT Tampere University of Technology
TS Time Synchronization
VC Virtual Carrier
VHDL Very-high-speed integrated circuit Hardware Description Language
VLIW Very Long Instruction Word
WiFi Wireless Fidelity
WLAN Wireless Local Area Network
11. INTRODUCTION
Today's human society is reliant on computers more than ever before. Their use in
the everyday life of the human society has lead to scientiﬁc advancements, events as
well as new ﬁelds of science, which have prompted the formation of modern world.
An integrated part of modern life is communication which has evolved greatly th-
roughout the years. The latest means of communication is mobile phones which
are categorized into embedded systems. Embedded processors are dominant in com-
munication systems such as mobile phones in which short execution time is highly
desired. Embedded processors are expected to run many concurrent applications.
Reconﬁgurability is a solution for achieving the goal of short execution time, since
many embedded applications are computationally intensive. At the same time power
dissipation must be limited. Over the years, there has been a pattern to increase
the performance of the system by scaling the frequency of single-core architectures.
This led to considerable increase in power dissipation. This issue has caused ven-
dors to oﬀer multicore systems. However, Dark-Silicon [37] is a critical issue in mul-
ticore systems. We could overcome the Dark-Silicon challenge by introducing many
application-speciﬁc accelerators that provide high performance at low frequencies.
In this thesis, the focus is on the template-based Coarse-Grained Reconﬁgurable
Arrays (CGRAs), to generate special-purpose accelerators. These platforms have
relatively low power consumption. CGRAs are recommended since they operate
at very low frequencies while they can yield enormous performance improvements.
CGRAs are reconﬁgurable and are programmed using a high level language such as
C. Also, they contain many specialized accelerators and are designed in a way to
perform massively-parallel tasks in critical applications.
Moreover, there are many heterogeneous platforms which have almost identical de-
sign philosophies, e.g. NineSilica [1], Platform 2012 [2] and MORPHEUS ([3], [4]).
According to the guidelines published in [5], this study uses heterogeneous plat-
form in an accurate testing condition to identify possible architectural problems.
The tested units are Orthogonal Frequency Division Multiplexing (OFDM) receiver
blocks which are computationally complex tasks. Time Synchronization (TS) and
Fast Fourier Transform (FFT) are examples of computationally intensive tasks. In
1.1. Objective and Scope of Thesis 2
addition, OFDM application as a test was used to evaluate the ability of the plat-
form's general-purpose processing for algorithms like CORDIC algorithms and the
Taylor series. The RISC processor performs general-purpose tasks while OFDM-
speciﬁc tasks are carried out by the CGRA, and the two are interconnected using a
Network-on-Chip (NoC) for complete OFDM functionality. Heterogeneous Multico-
re Architecture (HMA) platform used in this thesis has been designed to maximize
computing resources to enhance the performance of many particular algorithms of
various types. Heterogeneous Accelerator Rich Platform (HARP) is a speciﬁc ins-
tance of HMA platforms. It has nine nodes placed in a 3×3 topology.
The HARP platform includes a RISC core which performs two speciﬁc tasks; it
acts as a system controller and distributes the conﬁguration streams as well as
the data to all other nodes that are interconnected through the NoC platform.
After the conﬁguration and data distribution tasks, the RISC core performs general
computations and enforces synchronization between the nodes. All OFDM receiver
block accelerators are designed based on the template-based CGRAs. Lastly, the
OFDM receiver is evaluated to determine that it supports many communication
systems algorithms.
1.1 Objective and Scope of Thesis
In this thesis, the feasibility of implementation and design of scalable CGRAs for
MIMO OFDM running on a heterogeneous accelerator-rich platform is studied. This
research work explores general issues as well as generation of application-speciﬁc
accelerators for Software-Deﬁned Radio (SDR) baseband processing utilizing the
CGRA template. They are integrated with each other and then to a RISC core
on a NoC. This thesis will be expanded to measurement, estimation and mapping
of intensive signal processing algorithms. The main objectives of this thesis are to
design and implement speciﬁc accelerators for an OFDM receiver baseband in a
MIMO setup. The designed accelerator performance for each OFDM receiver block
regarding the clock cycles, the use of resources and also the maximum operating
frequency by synthesis on the family of Altera Stratix-IV FPGAs is addressed.
1.2 Thesis Outline
The thesis consists of chapters as follows; Chapter 2 overviews the literature. Chapter
3 presents the main part of OFDM system based on IEEE 802.11n standard where
various approaches are explained for each OFDM receiver block. Chapter 4 describes
1.2. Thesis Outline 3
the platform architecture and template-based CGRAs. Additionally, the whole per-
formance of the HARP platform and the nodes of NoC is explained completely. In
Chapter 5, the design and execution of MIMO OFDM receiver blocks are explained
by employing template-based CGRAs. Then Chapter 6 explains the integration of
baseband processing blocks on HARP and the distribution of data between diﬀerent
nodes. Chapter 7 covers measurement and estimation of diﬀerent levels of HARP
related performance metrics. Finally, the last chapter presents the conclusion and
future work.
42. LITERATURE REVIEW
Nowadays, the computationally intensive tasks are allocated to Multi-processor
System-on-Chip (MPSoC) or accelerators based on the processor/co-processor mo-
del. Moreover, Coarse-Grained Reconﬁgurable Array (CGRA) is one of the most
powerful classes of accelerators, which is suitable for signal processing applications
by providing high throughput and parallelism. CGRAs contain a lot of gates, which
makes sense when they are used most of the time [7]. BUTTER [8], Morphosys [12],
ADRES [13] and PACT-XPP [14] are examples of CGRA. In order to complete dif-
ferent applications at the same time, the CGRAs working as coprocessors should be
combined to make a heterogeneous multicore platform [15].
2.1 Processor/Co-processor Models
This section will brieﬂy discuss processor and co-processor models. Regarding the
history of the processors, it can be seen that single-core processors were used for
the general-purpose approach of numerous applications. Additionally, it was used in
some proprietary accelerators, including audio, video, etc. as well as computationally
intensive applications. Throughout the years, diﬀerent processors for a various set
of requirements have appeared. It can be seen that Very Long Instruction Word
machines (VLIW) have been grown in large-scale parallel applications [17]. Also, to
support high-grade communication mobile applications, a combination of VLIW and
Digital Signal Processing (DSP) architecture was developed by supporting multiple
applications simultaneously [18].
Loosely Coupled (LC) accelerators communicate easily with a low bandwidth, which
allows multiple accelerators to be connected to the processor. This model of acce-
lerators can be obtained by connecting the accelerator to a system bus, on a local
node or remote network. In this case, multiple accelerators can exchange data with
each other, as well as they can also work concurrently. As a prototype of a loosely
coupled architecture on NoC, the platform P2012 [2] can be expressed.
In the Tightly Coupled (TC) model, accelerators have high bandwidth for commu-
nication with the processor. This makes it possible to provide faster data transmis-
2.2. Reconﬁgurable Devices 5
sion as well as synchronization. In this model, using a dedicated co-processor bus or
directly integrating an accelerator in the processor data-path are viable solutions.
A CGRA can be an example of an accelerator tightly coupled to a processor, which
can utilize a network of switched interconnections.
2.2 Reconﬁgurable Devices
In recent years, reconﬁgurable architectures have become more popular platforms
due to particular capabilities and abilities to perform computational tasks. Recon-
ﬁgurable architectures have diﬀerent levels of parallelism. Reconﬁgurability means
modifying their functionality at run-time for various applications. Regarding the
characteristics of reconﬁgurable computing systems, some of the most signiﬁcant
features can be as follows [20].
• Reconﬁgurability: This refers to altering the internal architecture for the
purpose of running various applications at a high degree of performance.
• Computation Model: The computational models such as Single-Instruction
Multiple-Data (SIMD) or Multiple-Instruction Multiple-Data (MIMD) can be
used. Moreover, some systems may follow the Very Long Instruction Word
(VLIW) model.
• Granularity: Refers to the data size for operations of Reconﬁgurable Proces-
sing Unit (RPU) of a system.
Considering a wide range of diﬀerent models of reconﬁgurable devices, the recon-
ﬁgurable devices can be categorized according to their granularity level into three
diﬀerent classes; Fine-Grained, Middle-Grained and Coarse-Grained [3].
2.2.1 Fine-Grained Devices
The FPGAs have been in the market for a few decades. They are well suited for
ﬁne-grained reconﬁgurable architectures. Logic Element (LE) is the smallest unit of
processing in an FPGA which is composed of a Look-Up Table (LUT), a few Flip-
Flops (FFs), 2-to-1 multiplexers and some logic gates. Two notable FPGA vendors
are Xilinx [22] and Altera [21]. For Altera, the goal is to reach at higher synthesis
frequencies in their tools, while Xilinx focuses on resource utilization [23]. MOLEN
model is another ﬁne-grained device which operates as a co-processor to a General-
Purpose Processor (GPP) ([24], [25]).
2.3. Multi-core Platforms 6
2.2.2 Middle-Grained Devices
The word length of middle-grained devices is less than or equal to 8. On the other
hand, irregular subword-length calculation increases trouble in the mapping of the
algorithm, when the processing width is increased. There is a good compromise
between the area and performance in this model. One of middle-grained devices is
PiCoGA-III, which includes a matrix of Reconﬁgurable Data-path Units (RDUs),
each of them composed of a 4-bit LUT, 4-bit ALU and 4-bit integer and Galois ﬁeld
multiplier [27], [28].
2.2.3 Coarse-Grained Devices
One of the most successful platforms in the academic research and industrial envi-
ronment is CGRAs due to their high-level of granularity and also the number of di-
verse applications that can be targeted on them without diﬃculty. In fact, CGRAs
have a record of processing numerous data parallel applications for academic re-
search, e.g., Image and video processing ([8], [29]), Finite Impulse Response (FIR)
ﬁltering [30], Wideband Code Division Multiple Access (WCDMA) cell search [31]
and Turbo Codes [32]. Applying the CGRAs provides access to a large bandwidth
and high throughput. However, CGRAs can engage a wide area of a few million ga-
tes and have potentially high transient power dissipation [26]. XPP-III is one of the
coarse-grained devices. Another CGRA platform is Adjustable Dynamic Embedded
System (ADRES) as a reconﬁgurable array of 8×8 elements, which is strongly inte-
grated with a VLIW processor [13]. Each of the processing elements in the ADRES
includes Functional Units (FUs) and Register Files (RFs) linked to a mesh topolo-
gy. The ADRES particular instances can be produced by utilizing an XML-based
architecture speciﬁcation language.
2.3 Multi-core Platforms
In this section, we will focus on the subject of multi-core platforms, consisting of
both homogeneous and heterogeneous core models. In homogeneous model, the co-
res are all similar, in heterogeneous they are of diﬀerent types. Additionally, other
features of multi-core platforms (homogeneous and heterogeneous) are that they
are C-programmable but heterogeneous platform cores may need extra customized
tools compared to a simple compiler in C language, with data ﬂow-level support.
The following section will introduce some of the state-of-the-art platforms in detail.

2.3. Multi-core Platforms 8
2.3.2 P2012
Another homogeneous multi-core platform is P2012 including 16 general-purpose
processors divided into four clusters communicating with each other by using a NoC
([2], [34]). Moreover, all the processors are locally synchronous in a cluster [2]. In
other clusters, processors are globally asynchronous. The P2012 platform has been
tested for algorithms related to signal processing.
2.3.3 NineSilica
NineSilica [1] is a homogeneous MultiProcessor System-on-Chip (MPSoC) platform.
NineSilica platform includes a network of nine nodes placed in a mesh topology of
3×3 processing nodes (PNs) with three rows and three columns. The interconnec-
tion between PNs is done by a hierarchical Network-on-Chip (NoC). Each node of
NoC includes a 32-bit COFFEE RISC processor. All the nodes can exchange data by
packet switching technique [26]. Also, NineSilica is programmable in C language. Ni-
nesilica architecture indicates that MPSoC can obtain high parallelization eﬃciency
[35]. Many software-deﬁned radio applications such as correlations and FFT can be
designed and implemented over the NineSilica platform. The HARP platform used
in this study is a heterogeneous derivative of NineSilica.
2.3.4 RAW
Reconﬁgurable Architecture Workstation (RAW) is composed of 16 slices of 32-bit
MIPS2000 processors arranged in a 4×4 array ([36], [37]). The use of NoC has facili-
tated communication between processors. RAW provides both a static (determined
at compile-time) and a dynamic network (determined at run-time: wormhole routing
for the data forwarding) [38]. Its characteristics are similar to a reconﬁgurable fabric.
Furthermore, programmable NoC in RAW has employed only one communication
resource, resolving the wire selection problem from routing [38].
93. OFDM WLAN OVERVIEW
One of the spread spectrum techniques is Orthogonal Frequency-Division Mul-
tiplexing (OFDM). It divides the available bandwidth into several narrow-band
channels, with orthogonal carriers. This modulation performs multiplexing opera-
tions by using frequency division. The orthogonality concept in frequency division
refers to orthogonal signals, which returns to a mathematical deﬁnition in which, if
two sinusoidal functions are multiplied, then the integral of this product is zero in
any period of time. In fact, OFDM is a method of the general digital multi-carrier
modulation to reach higher data rates close to the Shannon limit [42]. The advantage
of this method is to send the data in parallel and to overcome the frequency selec-
tive fading because in this case, every part of the data is carried over a small range
of frequency band. This kind of fading on this small interval practically appears
linearly and can be compensated until the signal is eventually extracted.
The main beneﬁts of using the OFDM method are its frequency selective fading
due to multi-path propagation in wireless communication systems, narrowband in-
terference, and reduction of the Inter-Symbol Interference (ISI). This means that
the ISI can be decreased by transmission of several parallel symbols and increasing
the symbol duration [43]. OFDM splits a higher bit rate encoded data stream into
diﬀerent streams of lower bit rate, then transmits them in parallel on diﬀerent sub-
carriers all of which are orthogonal with respect to each other [39]. It is necessary
to explain that for the purpose of maintaining orthogonality, both transmitter and
receiver must use the same modulation method.
The beneﬁts of OFDM are high spectral eﬃciency, adaptive modulation, and ro-
bustness against narrow-band co-channel interference [40]. Also, its disadvantages
include the loss of eﬃciency due to the Cyclic Preﬁx (CP) and sensitivity to doppler
shift [40]. The following section describes the OFDM structure consisting of trans-
mitter, channel and receiver based on IEEE 802.11n speciﬁcations.

3.1. MAC Frame Structure for WLAN Standards 11
Also, receiver generally carries out Time Synchronization (TS), Frequency Oﬀset
Estimation (FOE), Channel Estimation (CH), FFT and Symbols Demapping based
on this standard. Predeﬁned samples in preamble are known to the receiver [33].
Short training symbols can be used for packet detection, frequency oﬀset estimation
and timing synchronization. Furthermore, the long training symbols are used for
channel estimation. The following subsections give a detailed explanation of these
operations [33].
The structure of the IEEE 802.11n MAC frame is illustrated in Figure 3.1. The
IEEE 802.11n standard supports the legacy IEEE 802.11a/b/g Physical Protocol
Data Unit (PPDU) formats. In the IEEE 802.11n speciﬁcations, the PPDU can
have several formats [44], depending on the abilities of the transmitter device such
as:
• Non-High Throughput (Non-HT) Legacy mode: Composed of the preamble,
which uses short and long training symbols as well as support for this format is
compulsory for IEEE 802.11n standard. In addition, this may occur as either
20.0 MHz Bandwidth or a 40.0 MHz Bandwidth.
◦ 20.0 MHz: The signal has 64 subcarriers with 4 pilots. pilots are inserted
in subcarriers ±21 and ±7. The signal is transmitted on sub-carriers -26
to -1 and +1 to +26 in the legacy mode.
◦ 40.0 MHz: For this model, two adjacent 20.0 MHz channels are emplo-
yed. The signal has 128 subcarriers with 6 pilots. Pilots are inserted in sub
carriers ±53, ±25 and ±11. Also, the signal is transmitted on subcarriers
-58 to -2 and +2 to +58.
• HT-Mixed mode: Preamble consist of the Non-HT short and long training
symbol, which can be decoded by legacy mode in IEEE 802.11a/g. The HT-
Mixed mode is compatible with IEEE 802.11a/g PLCP headers. Among other
features of this mode, transmissions can occur both in 20.0MHz and 40.0MHz
channels.
• HT-Greenﬁeld mode: HT packet does not require a legacy compatible part
to be transferred. As a result, the maximum data throughput is much higher.
The various parameters in the headers are [75]:
• Rate (4 bits): Signiﬁes the type of modulation (8 combinations) and data
Forward Error Correction (FEC) coding.

3.1. MAC Frame Structure for WLAN Standards 13
and 3.2 [44].
SL−STF (−26,26) =
√
1
2
{0, 0, 1 + j, 0, 0, 0,−1− j, 0, 0, 0, 1 + j, 0, 0, 0,−1− j, 0, 0, 0,
− 1 + j, 0, 0, 0, 1 + j, 0, 0, 0, 0, 0, 0, 0,−1− j, 0, 0, 0,−1− j,
0, 0, 0, 1 + j, 0, 0, 0, 1 + j, 0, 0, 0, 1 + j, 0, 0, 0, 1 + j, 0, 0}
(3.1)
SL−STF (−58,58) = {SL−STF (−26,26), 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, SL−STF (−26,26)} (3.2)
Long Training Sequence (LTS) for 40.0MHz bandwidth mode is another preamble
sequence, which has similar long symbols T1 and T2 as in Figure 3.2, as well as a
guard interval (GI2) with duration of 1.6 µs ahead of these two symbols for coping
with the phenomenon of Inter-Symbol Interference (ISI). Both T1 and T2 symbols
take 3.2 µs, which means that the duration of the long training sequence (LTS) is
equivalent to 8.0 µs.
TLong = 1.6 + 2× 3.2 = 8.0
Furthermore, T1 and T2 will be deﬁned in the frequency domain with 115 sub-
carriers, which are modulated instead of 57 sub-carriers in 20.0MHz (based on the
deﬁnitions of the legacy long training). The resulting sequence is as follows [45]
SL−LTF (−58,58) ={1, 1,−1,−1, 1, 1,−1, 1,−1, 1, 1, 1, 1, 1, 1,−1,−1, 1, 1,−1, 1,
− 1, 1, 1, 1, 1, 1, 1,−1,−1, 1, 1,−1, 1,−1, 1,−1,−1,−1,−1,−1,
1, 1,−1,−1, 1,−1, 1,−1, 1, 1, 1, 1,−1,−1,−1, 1, 0, 0, 0,−1, 1,
1,−1, 1, 1,−1,−1, 1, 1,−1, 1,−1, 1, 1, 1, 1, 1, 1,−1,−1, 1, 1,−1,
1,−1, 1, 1, 1, 1, 1, 1,−1,−1, 1, 1,−1, 1,−1, 1,−1,−1,−1,−1,−1, 1,
1,−1,−1, 1,−1, 1,−1, 1, 1, 1, 1}
(3.3)
Channel estimation (CE) is done in IEEE 802.11n speciﬁcation utilizing the LTF
ﬁelds at the beginning of each packet as well as LTF ﬁelds will be applied for more
precise time synchronization and frequency oﬀset estimation [43].
The next ﬁeld in the PLCP header is the signal ﬁeld containing information about
TXVECTOR length and coding rate. The signal ﬁeld in legacy mode informs to
the receiver about the type of modulation used in the system and also the rate of
coding along with the packet data length. The L-SIG ﬁeld is composed of an OFDM

3.1. MAC Frame Structure for WLAN Standards 15
Table 3.2 Pilot speciﬁc values for 40.0MHz [44]
NSTS iSTS Ψ
(NSTS)
iSTS ,0
Ψ
(NSTS)
iSTS ,1
Ψ
(NSTS)
iSTS ,2
Ψ
(NSTS)
iSTS ,3
Ψ
(NSTS)
iSTS ,4
Ψ
(NSTS)
iSTS ,5
1 1 1 1 1 -1 -1 1
2 1 1 1 -1 -1 -1 -1
2 2 1 1 1 -1 1 1
3 1 1 1 -1 -1 -1 -1
3 2 1 1 1 -1 1 1
3 3 1 -1 1 -1 -1 1
4 1 1 1 -1 -1 -1 -1
4 2 1 1 1 -1 1 1
4 3 1 -1 1 -1 -1 1
4 4 -1 1 1 1 -1 1
transform of sequence P−58,58 based on Equation 3.5.
P−58,58 ={0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0,−1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,−1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0}
(3.5)
Table 3.3 IEEE 802.11n OFDM parameter values [45]
Parameter Non-HT(20.0MHz) HT(20.0MHz) HT(40.0MHz)
FFT length 64 64 128
NSD: No.data carriers 48 52 108
NSP : No. pilot carriers 4 4 6
NST : Total no. carriers 52 56 114
∆f : Carrier spacing 312.5 kHz 312.5 kHz 312.5 kHz
TDFT : IDFT/DFT period 3.2 µs 3.2 µs 3.2 µs
TGI : Guard interval 0.8 µs = TDFT /4 0.8 µs 0.8 µs
TGI2: Double GI 1.6 µs 1.6 µs 1.6 µs
TL−STF : L-STF duration 8.0 µs = TDFT /4 8.0 µs 8.0 µs
TL−LTF : L-LTF duration 8.0 µs = 2× TDFT + TGI2 8.0 µs 8.0 µs
TSYM : OFDM symbol duration 4.0 µs = TDFT + TGI 4.0 µs 4.0 µ
TL−SIG: L-SIG ﬁeld duration 4.0 µs = TSYM 4.0 µs 4.0 µ
THT−SIG: HT-SIG ﬁeld duration N/A 8.0 µs = 2 TSYM 8.0 µs
THT−STF : HT-STF ﬁeld duration N/A 4.0 µs 4.0 µs
THT−LTF : HT-LTF ﬁeld duration N/A 4.0 or 8.0 µs 4.0 or 8.0 µs
The types of modulation of data can be BPSK, QPSK, 16QAM and also 64QAM.
They are similar for each burst, as well as the length of IFFT per symbol that is
128. Based on the 40.0MHz transmission for all 128 subcarriers, the frequency range



3.2. Physical Layer Speciﬁcations for WLAN Standards 19
coordinated by a GI or CP to resist InterSymbol Interference (ISI) as well as ti-
me synchronization errors [45]. It should be noted that the ISI is basically created
by receiving multiple copies of the transmitted signal due to multi-path eﬀects and
channel dispersion [47].
To clarify the issue, it will be assumed that there are two OFDM symbols, the last
part of the ﬁrst OFDM symbol makes interference with the ﬁrst part of the second
OFDM symbol upon it is received. Thus, according to the above-mentioned condi-
tions, the amplitude and phase of the sub-carriers may deviate. Subsequently, the
cyclic preﬁx is very important in terms of solving this problem. The delay portion of
the ﬁrst OFDM symbol is absorbed through the cyclic preﬁx of the second OFDM
symbol [48]. In the windowing section, it is optionally used for smoothing the edges
of each symbol in order to increase the spectral decay. After adding CP, preambles
are produced which consists of short and long training symbols. Also, before trans-
mitting the signal on the air interface by antennas, the signal must be converted from
digital to analog through a digital-to-analog converter (DAC). Since the samples go
through the DAC, a renovation ﬁlter is needed to eliminate the replication of the
spectrum, which makes the design of this model much simpler.
In the cellular wireless communications, the transmission channel causes various
unwanted changes in the signal of information resulting from reﬂections and dif-
fractions. Indeed, these changes may cause noise, interference, cancellation and dis-
tortion in the systems. In addition, the channel can be described as a linear time-
invariant transfer function with Additive White Gaussian Noise (AWGN). The recei-
ved signal is as y(t) = x(t) + n(t), because the noise n(t) is added to transmitted
original signal x(t). The signal-to-noise ratio (SNR), the signal strength to noise, is
measured in (dB) unit. Based on Equation 3.9, SNR is the ratio between the power
of the transmitted signal and the undesirable noise.
SNRdB = 10 log10
(
Psignal
Pnoise
)
= 10 log10(Psignal)− 10 log10(Pnoise) (3.9)
Figure 3.8 shows the Power Spectral Density (PSD) of the OFDM system in terms of
the amount of SNR according to the IEEE 802.11n. PSD is the frequency response
of a random or periodic signal. It tells us where the average power is distributed as
a function of frequency. In other words, PSD is distribution of power, and it can be
calculated by Fourier Transform of auto-correlation function of the signal. Therefore,
the quality of the signal is improved by increasing SNR.
The ﬁrst parts of the receiver are mostly used to detect the synchronization, estimate

3.2. Physical Layer Speciﬁcations for WLAN Standards 21
transmitted in each stream with various polarities, making them orthogonal to each
other, and then the receiver has the ability to perform a channel estimation evalua-
tion for each sub-carrier separately. Once pilots are the same in all streams, it can
be stated that they are not orthogonal to each other. On the other hand, channel
estimation for the subcarrier of pilots cannot be estimated by utilizing the VHT-
LTF as the other subcarriers. As an alternative, interpolation between surrounding
subcarriers is used to estimate the channel for the subcarriers of the pilot.
In the following receiver blocks, a linear equalization algorithm employs a reverse
of the frequency response of the channel to the received signal using a Zero-Forcing
equalizer [45]. Based on this model of the equalizer, it will eliminate the whole of
ICI and it is ideal when the channel is without noise. When the channel is in noisy
environments, the zero-forcing equalizer ampliﬁes the noise strongly at frequencies
where the channel response has a low magnitude. The equalizer converts a number
of multiplexed received chains into a number of equalized space-time streams where
each space-time stream provides a pilot tracker.
In the next block, the demodulation extracts the original transmitted bits from
the received modulated constellations and the deinterleaver reverses the process of
interleaving. Interleaver block interleaves the bits of each spatial stream to prevent
long sequences of adjacent noisy bits. The block deinterleaver performs the inverse
operation of the interleaver. Also, the demodulated bits are crossed through the
stream deparser. In the last stages, the data bits are depunctured with appending
dummy zeros in the locations where encoded bits were punctured, and the symbols
are transformed into a bit stream.
3.2.1 Time Synchronization
In OFDM systems, the time estimation block is used for two speciﬁc tasks which
are packaging detection and symbol timing synchronization. When there is no in-
formation about the starting point of the received packet, packet detection will be
required for OFDM systems. Also, in order to ﬁnd the exact start point of of the
OFDM symbols which determines the true position of FFT window, time synchro-
nization is needed [55]. Correlation algorithm implies to the similarity between two
signals. There are two types of correlations which are auto-correlation algorithm and
cross-correlation algorithm.
An auto-correlation algorithm is the correlation between a signal with its delayed
version or its shifted version. The cross-correlation algorithm refers to the correlation
between two diﬀerent signals. Also, correlation algorithm is intensive when it is


3.2. Physical Layer Speciﬁcations for WLAN Standards 24
3.2.2 Frequency oﬀset Estimation
FOE will be discussed in this section according to IEEE 802.11n standard. OFDM
waveform is made of multiple sinusoidal components. Before transmission, a signal is
upconverted to carrier frequency. The received signal on the receiver is downconver-
ted to demodulation from the same carrier frequency prior. The OFDM's sensitivity
to carrier frequency oﬀset is one of its drawbacks which causes device impairments
[42]. Based on Equation 3.15, f∆ is the diﬀerence between the carrier frequencies on
the transmitter and receiver side.
f∆ = fTx − fRx (3.15)
There are many reasons that might create a Carrier Frequency Oﬀset (CFO) in
OFDM systems because of either inconformity of frequencies between the oscillators
of the transceivers or because of the Doppler spread [56]. As result of the CFO,
the rotation of demodulated symbols in the constellation or ISI [53] can be noticed.
Frequency synchronization should be performed very precisely at the receiver for
the purpose of preventing losing orthogonality between the samples while frequency
oﬀset measurement in time domain could be done by applying maximum likelihood
estimator. For this reason it is possible to use short training sequences with the
duration of 0.8µs each. Let us presume that xn is our transmitted signal, then
passband signal yn could be modeled from the complex baseband one as
yn = xne
j2pifTxnTs , (3.16)
Here fTx is carrier frequency of the transmitter. As mentioned before, upon receiving
signal, it should be downconverted to baseband signal rn with a carrier frequency
fRx that can be seen from Equation 3.17. Moreover, f∆ refers to frequency oﬀset.
rn = sne
j2pif∆nTs (3.17)
Frequency oﬀset that can be gained from Equation 3.18, is calculated by the same
delay and correlate method.
yτˆ =
L−1∑
n=0
rnr
∗
n+D =
L−1∑
n=0
sns
∗
n+De
j2pif∆nTse−j2pif∆(n+D)Ts
= e−j2pif∆DTs
L−1∑
n=0
| sn |2
(3.18)
3.2. Physical Layer Speciﬁcations for WLAN Standards 25
Where D is 32 (0.8µs × 40MHz(fs)) according to IEEE 802.11n standard. FOE
can be expressed based on Equation 3.19 when multiplication between the received
signal and the complex conjugation of its delayed version is done.
fˆ∆ = −
1
2piDTs
6 yτˆ , (3.19)
Consider Ts as the sampling period and 6 as the angle of yτˆ , that is a correlation
output in the last equation. Based on Equation 3.20, frequency oﬀset correlation is
gained by using Frequency Oﬀset Estimation (FOE) and multiplied by the received
signal. Here rn
′
as the corrected signal, n stands for the sample index and N is the
number of samples in a symbol.
rn
′
= rn × e
−j2pif∆
n
N (3.20)
3.2.3 FFT
The most time consuming and computationally intensive block is FFT. The 128-
point FFT has to be implemented within 4 µs [12] according to IEEE 802.11n
standard. The DFT will be obtained as the following Equation 3.21 [57].
X[k] =
N−1∑
n=0
x[n]W
nk
N (3.21)
Where k = 0, 1, 2, ..., N − 1 and e−j2pi
nk
N refers to the twiddle factor for W nkN . DFT
complexity is equal to O(N2) and for FFT is O(N
2
LogrN). The FFT block is expres-
sed based on Equation 3.22 for radix-2.
X[k] = W
k
N
N
2
−1∑
m=0
x[2m+1]W
2mk
N +
N
2
−1∑
m=0
x[2m]W
2mk
N
(3.22)
Also, Equation 3.23 is used for radix-4.
X[k] =
N
4
−1∑
n=0
x[n]W
nk
N +W
Nk
4
N
N
4
−1∑
n=0
x[n+N
4
]W
nk
N
+W
Nk
2
N
N
4
−1∑
n=0
x[n+N
2
]W
nk
N +W
3Nk
4
N
N
4
−1∑
n=0
x[n+ 3N
4
]W
nk
N
(3.23)

3.2. Physical Layer Speciﬁcations for WLAN Standards 27
Here PRx refers to the received pilot, which may be noisy, H˜j is representing the
received pilot for channel impulse response (CIR) and j stands for the number of pi-
lots. One of the methods for estimating the channel is using the linear interpolation
algorithm, which can be seen in Figure 3.11 [60]. Furthermore, the linear interpo-
lation algorithm is the approximate model of value in any position between two
samples and due to pilot overhead on the receiver side, this method is used to solve
this problem. The two sequential known pilot subcarriers in the linear interpolation
are used to specify the channel response for data subcarriers. Then, the intermediate
estimates will be assessed through the linear sum of known elements on both sides.
The channel estimation at the data sub-carriers k will be expressed as follows [61]:
mL < k < (m+ 1)L (3.27)
Here mL and (m+ 1)L are two points. Also, using a linear interpolation method in
[61] can be expressed as:
H(n) = H(mL+ 1), 0 < l < L
=
((
Hj(m+ 1)−Hj(m)
)
×
m
L
)
+Hj(m)
(3.28)
Since linear interpolation is one of the easiest methods and the samples connect to
each other with a straight line, then Equation 3.29 can be expressed based on the
(interp function in the Matlab) [51] by extending Equation 3.28, where Hˆn refers
to the channel frequency response of all subcarriers.
Hˆn =
Np−1∑
a=1
Pl∑
b=1
((
H˜j(a+ 1)− H˜j(a)
)
×
b− 1
Pl
)
+ H˜j(a) (3.29)
Np is the number of pilots and Pl refers to the data length between two successive pi-
lots. In order to rectify OFDM symbols carrying noisy data by channel equalization,
the channel frequency response must be estimated. Then the obtained Yn symbols
are similar to the Xn data. This procedure can be implemented as a result of the
received signal divided by its channel frequency response, as shown in Equation 3.30.
Yˆn =
Yn
Hˆn
(3.30)
3.2. Physical Layer Speciﬁcations for WLAN Standards 28
3.2.5 Symbols Demapping
The last part of the OFDM receiver is demapping of the symbols. After performing
all the synchronization and demodulation operations, demapping step will be run.
Then, the true value of received data bits should be determined. The main purpose
of demapping symbols is to transform received data symbols to data bits without
loss of accuracy. Moreover, 64-QAM modulation was used based on IEEE 802.11n
speciﬁcation for this thesis work. Decisions about received data bits should be taken
according to the system modulation. Also, decisions are divided into two parts, soft
decisions and hard decisions [42].
• Hard Decision: If the number of transferred data bits is identical to the
number of received data bits, a hard decision demodulator will be used and if
the received data bits are noisy, a Gaussian cloud in the constellation points
will be made by them. The diﬃculty is to decide on transferred data symbols
related to the received data bits in this part. Maximum- likelihood decision says
that allocation of bits will be carried out by hard decision if the constellation
points and received bits are close to each other.
• Soft Decision: Soft Decision implies to use information bits about forwarded
symbols which will give acceptable results in performance related to execution
complexity [62].
When received data symbols are demapped to data bits, the quality of OFDM
systems is measured in terms of bit error rate (BER). A part of bits that has errors
over the total transmitted bits is called BER and it is varying as SNR changes. It
reduces while SNR increases [64]. Moreover, BER is related on modulation type for
the equal SNR.
29
4. PLATFORM ARCHITECTURES
The HARP platform is an experimental platform which permits to integrate at
most nine NOC nodes, with the node in the center combined with a RISC core
named COFFEE [65]. Other than the fact that a RISC core it is beneﬁcial for
general-purpose processing, it is necessary to be programmable and having constant
supervision of the platform. Moreover, AVATAR accelerator can be combined with
other nodes to speed up the tasks which are intense computationally.
4.1 Coarse-Grained Reconﬁgurable Arrays
As illustrated in Figure 4.1, CREMA and AVATAR accelerators have similar arc-
hitectural attributes, and the only diﬀerence between them is their sizes that are
modiﬁed according to the applications which are proposed. CREMA consists of R
rows × 8 columns of PEs whereas AVATAR that is the developed version of CRE-
MA has R rows × 16 columns of PEs. Also, CREMA contains 32-bit local memories
of size 16 × 256 while the size of each local memory is 32 × 512 for AVATAR. I/O
buﬀers insert the data among local memories and the PEs. In CREMA, I/O buf-
fers contains sixteen 16 × 1 multiplexers and sixteen 32-bit registers, and the size
is twice for AVATAR. The data that is processed into the local memories during
an operation is loaded and stored on the PE array sequentially by applying Direct
Memory Access (DMA) device [66].
AVATAR is a highly parallel template-based CGRA [66]. The computationally in-
tensive cores are run by the accelerators generated by AVATAR, while the general-
purpose processing is carried out by COFFEE that can be programmed in C. The
DMA is interconnected with COFFEE, I/O peripherals and system memory through
a matrix of switched interconnection. Each PE operates on two 32-bit operands and
performs integer and ﬂoating-point computations. Furthermore, PE core components
might be divided to two major parts: ﬁrstly, the Functional Units (FU), secondly,
the control blocks of conﬁguration. A PE includes a LookUp Table (LUT), addition,
subtraction, shifter, multiplication, instant register and ﬂoating point logic that is
selectable at design time based on the processing needs of the application. Figure

4.2. Heterogeneous Accelerator-Rich Platform 31
N0 N1 N2
N3 N4 N5
N6 N7 N8
RISC
DMACGRA Template
DMA
CGRA 
Template
DMA CGRA Template
DMA CGRA Template
DMA CGRA Template
DMACGRA Template
DMACGRA Template DMA
CGRA 
Template
Figure 4.2 Heterogeneous Accelerator-Rich Platform (HARP) [33].
1. The DMA device at the system start-up time will facilitate the conﬁguration
data loading in the CGRA.
2. The data to be processed is loaded into the local memories of the CGRA.
3. Conﬁguring the functionality of the PEs and interconnection among them by
enabling each context.
4. Processing the data over PE array
5. As needed, CGRA will be reconﬁgured by changing the context.
6. Processing a new set of data from step 3
In order to complete the execution an algorithm will repeat these phases, and will
transform the result to the local memory of another CGRA (RISC processor).
4.2 Heterogeneous Accelerator-Rich Platform
This thesis work employs HARP, depicted in Figure 4.2. The HARP is built with
nine nodes over a NoC in a 3×3 mesh topology. The HARP platform is written
in parametric VHDL. As can be seen from Figure 4.2, the COFFEE RISC core is
combined with the node in the center and the rest include a template-based CGRA,
4.2. Heterogeneous Accelerator-Rich Platform 32
data memory and DMA device. All of the nodes can have a template-based CGRA
accelerator except the central node which is integrated with COFFEE RISC core.
By utilization of NoC the CGRAs and RISC core can communicate with each other.
HARP comprises multiple CGRAs of particular dimensions in order (rows×columns)
of PEs. These CGRAs together form a test-case which can be used to examine
the overall design. COFFEE RISC provides supervision in terms of control and
communication. CGRA of a particular size for any of the existing nodes can be
integrated since HARP is a template-based architecture. Alternatively, the node
can be considered as a data routing resource.
4.2.1 Internal Structure of NoC
Figure 4.3 illustrates how the master and slave nodes are utilized in details. It can
be seen that NoC nodes are connected to one master and two slaves where the
mentioned master node is in charge of communication within the node as well as
publishing the data on the network, hence it is combined with RISC core. On the
other hand, the master node of other nodes are linked to the master of DMA device
and the slaves utilize local memory and slave node of DMA device. The supervisor
node consists of a RISC processor which is in charge of data transmission through
its own and the other slave nodes' data memory. RISC cores provide the ability of
synchronous data transmission between each two nodes by reserving a shared space
in their memory for setting and resetting track of read and write ﬂags which is
accessible by the other nodes.
While the system is starting up, the data transfer is begun by the supervisor node
(N4) with sending the conﬁguration stream. Also, the data will be processed between
its own and other data memories. The packet has two parts: routing information in
the header and data and conﬁguration words in the rest. At the ﬁrst, the packet
will be received by the initiator and then will be forwarded to the request switch of
the destination node. As the node arbiter resides between the request and response
switches to set up connections through various modules, the initiator module notiﬁes
a node. The targeted slave device then gets selected by the request switch based on
the address ﬁeld of the routed packet. Through the target module the data can
be written to NoC. However reading/writing data from/to the instruction or data
memory are requirements for given RISC core so the request switch has to interact
with a node's master. As soon as the transport route is determined, DMA devices
will load the processed data and conﬁguration stream into the local memory banks
of the template-based CGRA in the slave nodes. The RISC core can carry out
the same operations for other nodes. Also, shared memory space of which size is

34
5. DESIGN AND IMPLEMENTATION OF IEEE
802.11N ON TEMPLATE-BASED CGRA
In previous chapters the concepts and basic structure of OFDM systems and the
platform architecture used in this work were explained. This chapter describes the
execution of OFDM blocks in detail and focuses on the design of CGRAs on HARP
platform. In others words, the design of an application-speciﬁc accelerator using
AVATAR for implementing the IEEE 802.11n speciﬁcation is elaborated. In this
process a baseband receiver executes the algorithms of digital signal processing, in
order to utilize the received data bits with high accuracy. Moreover, MATLAB is
used to test the functionality of baseband receiver and all of the transceiver algo-
rithms implemented. During the ﬁrst step, a MATLAB script generates random data
symbols by employing a speciﬁc constellation based on IEEE 802.11n speciﬁcations.
Subsequently, the accelerator was designed and then implemented for each block
such as TS, FOE, FFT and CE. Each output was compared to the corresponding
MATLAB results. Moreover, the script generates OFDM data symbols based on 64-
QAM modulation, and the channel with AWGN is modeled assuming that CFO is
40 kHz and the SNR value is equal to 20 dB. In the following section, the process
of designing and implementing the application-speciﬁc accelerator for each receiver
block is explained. COFFEE is used for calculating the number of clock cycles and
execution time for each one of them. Also, ModelSim software [79] is utilized for
simulation purposes and for testing the functionality of each accelerator. Ultimately,
the designed accelerators are synthesized onto Altera Stratix FPGAs.
5.1 Time Synchronization
After the analog to digital conversion, time synchronization is the ﬁrst block of
OFDM receiver. As mentioned earlier, there are two known methods to achieve
time synchronization: using special symbols or Cyclic Preﬁx (CP). In this thesis,
cyclic preﬁx method is employed [54]. A correlation algorithm is performed between
the received signal and its delayed version. In accordance to the IEEE 802.11n
speciﬁcation, the delay length z−D is equal to the length of CP, L = 32.

5.1. Time Synchronization 36
Outputs cn and zn of the correlation algorithm are expressed by Equations 5.1 and
5.2 respectively. Symbol of ∗ stands for the complex conjugate.
cn = yny
∗
n−D (5.1)
zn =
L−1∑
i=0
ci+n (5.2)
Equations 5.1 and 5.2 should be mapped over the PE array of AVATAR. For this
speciﬁc design, AVATAR is further scaled up to 5×16 PE array. Also, Equation 5.1
can be simpliﬁed to Equation 5.3, to perform a more eﬀective way for placement
and routing. R stands for Real and I for Imaginary part of the received signal.
cn = ((yn(R) × yn−D(R)) + (yn(I) × yn−D(I)))
︸ ︷︷ ︸
Real
+ ((yn(R) × yn−D(I))− (yn(I) × yn−D(R)))
︸ ︷︷ ︸
Imaginary
(5.3)
In Figures 5.1 and 5.2 the mapping of a 160-point correlation algorithm is shown.
The two contexts must be used consecutively. The ﬁrst context (is not illustrated)
loads immediate values to the PEs to operate shift after any multiplication. With
regard to interconnections between PEs and the processes that the PEs should
perform, they are completely identical. The sole dissimilarity, though, is their I/O
buﬀers. First, the received data symbols must be loaded into the ﬁrst local memory
of AVATAR. As can be observed in Figure 5.1, two diﬀerent tasks are performed in
this context. The multiplication among the received data symbols and the complex
conjugation of its delayed version is related to the ﬁrst task based on Equation 5.3.
The data demonstrated by indexes from 0 to 159 belong to 160-point correlation.
To perform time synchronization for 160 data symbols requires 160 correlations.
Data distribution in other columns of local memory is the second task in this context
so as to maximize the parallel usage of resources. As discussed earlier, this context
can only be used to distribute data and implement the ﬁrst correlation. For executing
four correlations simultaneously, the results stored in the second local memory are
applied by altering the data ﬂow direction to the ﬁrst local memory.
Figure 5.2 shows in last row of PEs in the third context, a sum-of-products of
the results of complex multiplications is performed based on Equation 5.2. The
delayed version undergoes a shift and the procedure is repeated. Unregistered-Feed
Through (URF) operation is used to solve this issue. According to Figure 5.2 in the
third context, four URFs shift the delayed version of data symbols four units during
each run. Then, in the next step and when all the correlations are completed (160
5.2. Frequency Oﬀset Estimation 37
correlations), the maximum value is looked for by RISC processor (N3). Moreover,
after the execution of C code the largest value is obtained. This variable corresponds
to the time oﬀset index, which is equal to the ﬁrst FFT window by using Equation
5.4. Program 5.1 that looks for the largest value for each data symbol is performed
by COFFEE RISC processor.
1 int z, max_val , position = 1;
2 max_val = output [160];
3
4 for (z = 1; z < 160; z++)
5 {
6 i f (output[z] > max_val)
7 {
8 max_val = output[z]; position = z+1;
9 }
10 }
Program 5.1 C code for the seek of largest value
The Square Modulus (SM) is used when the correlation results are complex. SM
uses Equation 5.4 to calculate the magnitude of complex numbers.
τˆs = argmax
n
| zn | = argmax
n
| zn(R) × zn(R) + zn(I) × zn(I) | (5.4)
Where τˆs identiﬁes the maximum value among the 160 correlation results of (zn).
Besides,R and I illustrate the Real and Imaginary parts in this Equation. At the end,
after ﬁnding the index of the time oﬀset, the data symbol should be transmitted to
the next block which is the FOE for further processing discussed in detail in Section
5.2.
5.2 Frequency Oﬀset Estimation
As explained previously, Carrier Frequency Oﬀset (CFO) in OFDM systems is the
consequence of incompatibility between the transmitter and receiver oscillator. By
adding training symbols on the transmitter side, the CFO can be estimated. Ba-
sed on Equation 3.18, for this intention, the delay and correlation algorithm is used
with a delay value equal to 32. That is to say, to obtain the phase diﬀerence between
the received training symbols, a multiplication by the complex conjugation of their
delayed version must be done. Therefore, there needs to be 160 complex multiplica-
tions, equal to the length of the short training sequence.

5.2. Frequency Oﬀset Estimation 39
rations. Accordingly, they can be done in shorter running times by using processor
software at the cost of more power and energy. The use of CORDIC algorithms is
most beneﬁcial when no predeﬁned hardware multiplier exists as they only use addi-
tion, subtraction, bit-shift and lookup table [69]. Once the division is done according
to Equation 5.5, the phase angle of a complex number should be computed by means
of the result of the division from the prior section. As previously mentioned, the-
re are no predeﬁned functions on the COFFEE RISC processor, so ATAN function
has to be done using another method like the Taylor series [71]. Finally, the pha-
se angle can be calculated in processor software by expanding Taylor series for the
arctangent ratio as Equation 5.6.
arctan x =
N∑
n=0
(−1)n
(2n+ 1)
x2n+1 (5.6)
Where the value of N is dependent on the needed precision, which is equal to 4
in this particular case. When the phase angles are found from the received data
symbols, carrier frequency oﬀset should be estimated by means of Equation 3.19
previously explained. Afterward, using Equation 3.20, the data symbols must be
corrected separately on the basis of the estimated frequency oﬀset, in which the
exponential function is needed. The Taylor series has to be employed based on the
following equation.
ex =
∞∑
n=0
xn
n!
(5.7)
The equation above works only for integer numbers, so for complex numbers, the
Equation 5.7 should be modiﬁed as
ez = ex
(
cos(y) + isin(y)
)
, (5.8)
z consists of the real part x and the imaginary part y. cos and sin functions can
be expanded by the Taylor series, which are shown in Equation 5.9 and 5.10,
respectively.
cos y =
∞∑
n=0
(−1)n
(2n)!
y2n (5.9)
sin y =
∞∑
n=0
(−1)n
(2n+ 1)!
y2n+1 (5.10)
Finally, the received signal should be multiplied by the correction factor which is
calculated above. Once all of the steps above are carried out on the processor software
5.3. Fast Fourier Transform 40
to perform the complex multiplication, the data can be transferred again to the local
CGRA memory. This can be done by using almost the same context as depicted in
Figure 5.3. Subsequently, after performing frequency oﬀset estimation operation the
local memories have the data symbols that can be directly demodulated. It has to
be mentioned that before the demodulation of the data symbols, the cyclic preﬁx is
removed.
5.3 Fast Fourier Transform
In this section, the Fast Fourier Transform (FFT) block will be explained in detail.
The data symbols in this step must be converted from time domain to frequency
domain after the received signal has been corrected in terms of the frequency oﬀset;
it's named demodulation. It can be executed by FFT as a particular class of Discrete
Fourier Transform (DFT). When comparing the FFT block with other blocks on the
receiver side, FFT block is one of the most time consuming. More details about the
implementation of the radix-N algorithms can be found in [58] and [10]. The Discrete
Fourier Transform (DFT) can be deﬁned as
X[k] =
N−1∑
n=0
x[k]W nkN (5.11)
Where W nkN = exp(−j2pi
nk
N
) is a twiddle factor and n is any sample that has been
processed among samples of N . DFT is obtained by FFT radix-2m structures [26],
where m ∈ Z+ and its structural unit is a butterﬂy. Furthermore, with the increase
of m, the arithmetic resources required by the butterﬂy will increase along with the
complexity of the FFT structure, but the number of implementation steps for FFT
processing will be signiﬁcantly reduced. Based upon IEEE 802.11n speciﬁcations
with frequency bandwidth 40.0 MHz, demodulation can be done by a 128-point
FFT within 4µs.
In addition, a 128-point FFT can not be processed by a radix-4 butterﬂy, while the
radix-2 needs expensive seven stages to process it. However, if a mixed-radix is used,
the 128-point FFT can be implemented in four stages. In the ﬁrst stage, a radix-2
butterﬂy is used for processing and as a result, the structure of 128-point FFT is
divided into two parts. Each half of the FFT will be 64-points. Also, these halves
can be processed by a radix-4 butterﬂy in three steps.
Mapping four diﬀerent contexts, a mixed-radix accelerator was designed utilizing
AVATAR, two of which comprise radix-2 and radix-4 butterﬂies and the remaining






5.4. Channel Estimation 47
xi+1 = xi +
f(xi)
f
′(xi)
, (5.12)
where i is the number of iterations such as i = 0, 1, 2, ..., and xi will be the initial
guess for the root. The function f
′
(xi) is derivative of a function f(xi) and division
operation based on this Equation 5.13 can also be modiﬁed to use other purposes.
xi+1 = xi.(2−Dxi) (5.13)
Assuming that ﬁnding 1
D
is desired, we need to ﬁnd the function f(x) whose value
is zero at x = 1
D
. The resulting function is f(x) = 1
x
− D based on Equation 5.13.
where xi is representing the initial guess and D stand for the denominator. Based
on the noisy channel, the denominator is a complex number. Therefore, for more
processing by Newton-Raphson method and in order to map on CGRA Equation
5.13 can be simpliﬁed to an integer number as
x+ jy
a+ jb
×
a− jb
a− jb
=
(x+ jy)× (a− jb)
a2 + b2
(5.14)
where x+jy stands for received data symbols from FFT block, a+jb is representing
estimated channel response and a − jb signify complex conjugation of the channel
response. In the ﬁrst step, Newton-Raphson method should be mapped on AVATAR
for calculating the value of 1
a2+b2
in order to carry out the channel equalization. Here
according to the Equation 5.13, D is equivalent to (a2 + b2). a stands for the real
part of the channel frequency response and b is representing the imaginary part of
the channel frequency response. The mapping of Newton-Raphson method is shown
in Figures 5.11 and 5.12.
The mapping employs eleven columns of the ﬁrst context and ten columns of the
second context. So as to compute the square values of the real and imaginary parts
of the channel frequency response, in the ﬁrst context, the ﬁrst PEs' row act as
multiplication. These values, then, are added to each other on the second row. Com-
puted results, in the next step, must be multiplied by the initial guess Dxn. Then, 2
is loaded into the local memory as well as results obtained from the previous context
based on Equation 5.13. The local memories in CGRAs are line-readable in order to
be faster and simpler. The values of X and 2 are loaded along with every column.
The ﬁrst row of PEs performs preprocessing of data and with the rest of the rows of
PEs, it can execute the required shift operations, subtractions and multiplications.
Finally, speciﬁc DMA operations transfer the result of division ( 1
a2+b2
) back to the
main memory.



5.5. Symbols Demapping 51
Figure 5.15 shows there are eight areas for each three bits (rightmost and leftmost)
after dividing the complex plane into Quadrature and In-phase regions. Meanwhile,
the ﬁrst three bits (Bit1 Bit2 Bit3) are repeated frequently for each region of I-axis
constantly, whereas the last three bits (Bit4 Bit5 Bit6) are similar for each region
of Q-axis. Table 5.3 shows how each constellation point is represented by a 6-bit
symbol composed of three bits each from the In-Phase axis and the Quadrature axis
respectively.
Example. Assume that in the receiver side, '4.8 + 2.8i' is received as a demodula-
ted data symbol instead of '5 + 3i' because of the Additive White Gaussian Noise
(AWGN) channel. In the ﬁrst step, a decision boundary can be made for real and
imaginary parts, respectively. There are eight states (≥ 6, < 6 & ≥ 4, < 4 & ≥ 2,
< 2 & ≥ 0, < 0 & ≥ −2, < −2 & ≥ −4, < −4 & ≥ −6, ≤ −6), which might
happen for each real and imaginary part. As '4.8' is greater than '4', speciﬁc bits
can be determined: '101' is assigned to the ﬁrst three data bits. Figure 5.15 shows
that received data bits only vary between eight diﬀerent values, which contains bits
('101000','101001','101011','101010','101110','101111','101101' and '101100') that are
located within the same zone of I-axis. In the ﬁnal step, an imaginary part is detected
in the same way which is equal to '011' in this case.
Table 5.3 64-QAM constellation mapping with gray coded
In-phase Bit1 Bit2 Bit3 Quadrature Bit4 Bit5 Bit6
-7 0 0 0 +7 0 0 0
-5 0 0 1 +5 0 0 1
-3 0 1 1 +3 0 1 1
-1 0 1 0 +1 0 1 0
+1 1 1 0 -1 1 1 0
+3 1 1 1 -3 1 1 1
+5 1 0 1 -5 1 0 1
+7 1 0 0 -7 1 0 0
1 for (j = 0 ; j < data ; j++)
2 {
3
4 // The real section
5
6 i f ( OUTPUT_REAL >= 6){
7 bit1[j] = 1; bit2[j] = 0; bit3[j] = 0;
8 }
5.5. Symbols Demapping 52
9 e l se i f ( OUTPUT_REAL >= 4 )
10 i f ( OUTPUT_REAL < 6 ){
11 bit1[j] = 1; bit2[j] = 0; bit3[j] = 1;
12
13 }
14 }
15 e l se i f ( OUTPUT_REAL >= 2 ){
16 i f ( OUTPUT_REAL < 4 ){
17 bit1[j] = 1; bit2[j] = 1; bit3[j] = 1;
18
19 }
20 }
21 e l se i f ( OUTPUT_REAL >= 0 ){
22 i f ( OUTPUT_REAL < 2 ){
23 bit1[j] = 1; bit2[j] = 1; bit3[j] = 0;
24
25 }
26 }
27 e l se i f ( OUTPUT_REAL >= -2 ){
28 i f ( OUTPUT_REAL < 0 ){
29 bit1[j] = 0; bit2[j] = 1; bit3[j] = 0;
30
31 }
32 }
33 e l se i f ( OUTPUT_REAL >= -4 ){
34 i f ( OUTPUT_REAL < -2 ){
35 bit1[j] = 0; bit2[j] = 1; bit3[j] = 1;
36
37 }
38 }
39 e l se i f ( OUTPUT_REAL >= -6 ){
40 i f ( OUTPUT_REAL < -4 ){
41 bit1[j] = 0; bit2[j] = 0; bit3[j] = 1;
42
43 }
44 }
45 e l se {
46 bit1[j] = 0; bit2[j] = 0; bit3[j] = 0;
47
48 }
49
50
51
52 // The imaginary section
53
54 i f ( OUTPUT_IMAGE >= 6){
55 bit4[j] = 0; bit5[j] = 0; bit6[j] = 0;
5.5. Symbols Demapping 53
56 }
57 e l se i f ( OUTPUT_IMAGE >= 4 )
58 i f ( OUTPUT_IMAGE < 6 ){
59 bit4[j] = 0; bit5[j] = 0; bit6[j] = 1;
60
61 }
62 }
63 e l se i f ( OUTPUT_IMAGE >= 2 ){
64 i f ( OUTPUT_IMAGE < 4 ){
65 bit4[j] = 0; bit5[j] = 1; bit6[j] = 1;
66
67 }
68 }
69 e l se i f ( OUTPUT_IMAGE >= 0 ){
70 i f ( OUTPUT_IMAGE < 2 ){
71 bit4[j] = 0; bit5[j] = 1; bit6[j] = 0;
72
73 }
74 }
75 e l se i f ( OUTPUT_IMAGE >= -2 ){
76 i f ( OUTPUT_IMAGE < 0 ){
77 bit4[j] = 1; bit5[j] = 1; bit6[j] = 0;
78
79 }
80 }
81 e l se i f ( OUTPUT_IMAGE >= -4 ){
82 i f ( OUTPUT_IMAGE < -2 ){
83 bit4[j] = 1; bit5[j] = 1; bit6[j] = 1;
84
85 }
86 }
87 e l se i f ( OUTPUT_IMAGE >= -6 ){
88 i f ( OUTPUT_IMAGE < -4 ){
89 bit4[j] = 1; bit5[j] = 0; bit6[j] = 1;
90
91 }
92 }
93 e l se {
94 bit4[j] = 1; bit5[j] = 0; bit6[j] = 0;
95 }
96 }
Program 5.2 C code of Data Symbols Demapping
54
N0 N1 N2
N3 N4 N5
N6 N7 N8
RISC
DM
A DMA
DMADM
A
TS/FOE/FFT
(5×16 PE)
CE
(5×16 PE)
CGRA CGRA
CGRACGRA
DMA
DMA
DM
A DMACGRA CGRA
CGRA
CGRA
TS/FOE/FFT
(5×16 PE)
TS/
FO
E/F
FT
(5×
16 
PE)
TS/FOE/FFT
(5×16 PE)
CE
(5×16 PE)
CE
(5×16 PE)
CE
(5×16 PE)
Stream #1
Stream #2
Stream #3
Stream #4
Figure 6.1 Abridged general view of IEEE 802.11n MIMO receiver on HARP platform.
6. INTEGRATION OF BASEBAND
PROCESSING BLOCKS ON HARP
Figure 6.1 shows the overall architecture for processing MIMO OFDM after integra-
tion over HARP. The CGRAs are used to create application-speciﬁc accelerators for
the receiver blocks. Also, the size of each template-based CGRA on HARP is shown
in Table 6.1. In the proposed platform for this thesis, the designer can write the
distributed control for the data transfer. The RISC core transfers the conﬁguration
stream and data for processing to CGRAs where node N4 RISC is responsible for
N1, N3, N5 and N7 CGRAs. By considering the MIMO OFDM as a case-study, the
designed platform is divided into 4 parts including streams 1-4. The RISC core is
instantiated on N4, which sends the data to all the streams. N1, N5, N7 and N3 are
6. Integration of Baseband Processing Blocks on HARP 55
responsible for streams 1 to 4 respectively. To perform TS, FOE and FFT blocks, the
received data will be loaded and transferred to the N1 prior to the system start-up
and transfer of the conﬁguration stream with the RISC core.
Table 6.1 shows the Clock Cycles (CC) needed for data transfer from data memory
of a node to the other one and to CGRA's local memory while processing at diﬀerent
stages. Also, using the other CGRA's local memory from Stream 1, CH block can be
performed. Transfer of data from local memory to another local memory is performed
directly. To execute CE block, the output of the FFT is stored to the local memory
of N1 which has to be transferred to the local memory of N0 directly by the DMA.
Moreover, to complete the task of CE, the results should be ﬁrst transferred back to
the data memory of N1 and at a later stage, to the data memory of N4. Table 6.1
shows that for all data transfers from N4 RISC to N1 CGRA, 4,595 CC are required.
Also, correlation can be done in 2,371 CC. When the algorithm of correlation is
executed by N1, the results will return to N4 for calculating the oﬀset time index.
This process needs 13,738 CC in order to transfer data to data memory from CGRA.
The calculation of the SM as well as ﬁnding the time oﬀset index can be carried out
in 4,179 CC by the RISC core.
After that, short training symbols must be loaded to the local data memory of N1 in
order to execute complex multiplication. It takes 4,665 CC to load the data from data
memory of N4 to N1. A RISC core is required to perform some parts of FOE block
that needs data exchange between N1 and N4 twice. Therefore, in order to perform
some parts of FOE by N4 RISC core, the result of complex multiplication must come
back to the data memory of N4. Generally, it is possible to divide execution time
for performing FOE in 12,634 and 74 CCs which are executed by N4 RISC core and
the template-based CGRA mapped on N1, respectively. After completing Equation
3.20 by the designed CGRA located at N1 in the last stage, it can be reconﬁgured at
run-time to perform a 128-point FFT (radix(2-4)) on the results from N1 in 571 CC.
When the FFT execution is completed, a transfer of the results by the DMA of N1
from the local memory of CGRA to the local memory of N0 is performed. Also, the
data symbols have to be transferred to the local data memory of N0 to execute the
channel estimation, which takes 2,585 CC. Once the channel estimation is completed
by N0 in 479 CC, the ﬁnal results should be transferred back to the data memory of
N4 to perform symbols demapping in the last stage. Furthermore, the conﬁguration
streams of all the slave nodes are loaded in parallel mode to speed-up the execution
time of the entire platform.
Moreover, Table 6.1 shows the number of CC needed to run each block by the
6. Integration of Baseband Processing Blocks on HARP 56
Table 6.1 The required clock cycles at diﬀerent stages for data transfer and processing. In
the table, D. Mem, Trans and Exe. are referring to Data memory, Transfer and Execution
respectively, while Clock cycles with * sign indicate data transfer from CGRA to Node's
data memory.
Node-to D. Mem to D. Mem Trans. Exe.
-Node D. Mem to CGRA Total Total
N4-N1 (Correlation) 2,680 1,915 4,595 2,371
N1-N4 (SM) - 13,738* - 4,179
N4-N1 47 - - -
N4-N1 (FOE) 1,961 2,704 4665 12,634 (+74)
N4-N1 (FFT) 2,042 1,033 3,075 571
N1-N0 (CE) 1,753 832 2,585 479
N0-N1-N4 - 815* - 4,378
three types of data transfers. The ﬁrst category relates to the transmission of data
from the RISC core data memory to the CGRA's data memory, the second, data
transfer to the local memory banks from the data memory and the third category,
the transfer within a slave node by employing a DMA device. Furthermore, the
total run-time for FOE, TS, CE blocks and FFT will be 12708, 6550, 479, 571
CCs, respectively. Also, RISC core can perform symbols demapping in 4,378 CCs.
Based on the algebraic equations, CGRAs are designed to perform MIMO OFDM
receivers. These CGRAs are made in an optimal way as most of the PEs are used
in each context. The desirable mapping of an application at design time is essential
for performance improvement, power and development time, and region utilization
which need scaling up/down the CGRA.
57
7. MEASUREMENTS AND ESTIMATION
For prototyping purposes, the entire platform is incorporated to a Stratix-V (5SG-
SED8N3F45I3YY) FPGA device. The operating temperatures are considered to be
−40◦C for low and 100◦C for high junction temperature points. With the mentioned
conditions, operation frequencies after the placement and routing are 197.82 MHz
and 187.72 MHz at −40◦C and 100◦C for slow timing model (900mV). On the other
hand, the fast timing model (900 mV) led to higher frequencies of 321.75 MHz at
−40◦C and 284.17 MHz at 100◦C, respectively. It is worth mentioning that a single
clock source has been utilized throughout the work.
Table 7.1 illustrates how the resources are employed for the proposed platform by
presenting the number of Adaptive Logic Modules (ALMs), Memory Bits, Registers
and DSP elements. As it can be seen, about 98 percent of the entire resources are
used in the design stage while a total of 1,164 (59 percent) 18-bit DSPs resources
are employed which are dependent on the number of 32-bit multipliers since each
of these 32-bit multipliers deﬁned in PE consists of two 18-bit DSPs for being
able to be synthesized on FPGA. The mentioned 32-bit multipliers in each receiver
block as well as a RISC on a NoC node are also provided in Table 7.1.
Post placement and routing information evaluate the platform's power dissipation in
such a way that PowerPlay Power Analyzer Tool of Quartus II 15.0 at 25◦C tempera-
ture with frequency of 200.0 MHz is used. Therefore, the overall estimation process
result is highly reliable. The gate-level netlist of the platform has obtained these
estimations and ModelSim software was used to produce the Value Change Dump
(VCD) ﬁle taht includes the signal transition information when OFDM receiver is
executed. Signal switching activity contributes to the dynamic power and powering
on the FPGA contributes the static power. The estimation tool Quartus II compu-
tes the dynamic power, static power and I/O thermal power dissipation equal to
5,506.4 mW, 1,541.82 mW and 56.68 mW respectively. Accordingly, the total power
dissipation of the FPGA by MIMO OFDM platform is 7.1 W.
From the table 7.2, the dynamic power rises along with the size of CGRA. It should
be noticed that with scaling up the CGRA, the static power remains almost the
7. Measurements and Estimation 58
Table 7.1 Summary of resource utilization based on the breakdown of node-by-node for
Stratix-V (5SGSED8N3F45I3YY) FPGA device
Node ALMs Registers Memory (32-bit Multipliers)
Bits DSPs
N0 (TS/FOE/FFT) 32,237 21,252 2,634,456 (80) 160
N1 (CE) 30,101 16,625 2,632,672 (64) 128
N2 (TS/FOE/FFT) 32,234 21,249 2,634,456 (80) 160
N3 (CE) 30,099 16,621 2,632,672 (64) 128
N4 (RISC) 6,604 5,752 4,194,304 (6) 12
N5 (CE) 30,101 16,623 2,632,672 (64) 128
N6 (TS/FOE/FFT) 32,240 21,254 2,634,456 (80) 160
N7 (CE) 30,100 16,631 2,632,672 (64) 128
N8 (TS/FOE/FFT) 32,239 21,267 2,634,456 (80) 160
NoC 2,592 4,218 - -
258,547 161,492 25,262,816 (582) 1164
Total 98.53% 30.77% 48.05% 59.3%
Table 7.2 Dynamic power of each CGRA node and the NoC.
Node Dynamic Power (mW)
N0 518.51
N1 532.96
N2 521.12
N3 534.18
N4 358.79
N5 537.08
N6 520.17
N7 531.83
N8 515.93
NoC 101.68
Integration Logic 834.17
Total 5,506.4
same. Also, energy consumption for each node is computed as the product of power
dissipation and runtime.
In this case study, the current instance of the platform contains 640 PEs. By taking
into account the operating frequency of 200 MHz and also total power dissipation
of 7.1 W in this model, this HARP instance can provide 128 Giga Operations Per
Second (GOPS) and 0.018 GOPS/mW for Altera Stratix-V chips in 28 nm.
59
8. CONCLUSIONS AND FUTURE WORK
In this thesis work, OFDM receiver blocks are implemented to fully examine func-
tional features and architectural capabilities of a HMA such as HARP platform. The
computational and processing power required by the MIMO OFDM receiver for pa-
rallel and serial algorithms is intensive enough to make them potential candidates
for discovering all HARP design features. In this case, algorithm implementation
involves TS, FOE, FFT, CE and symbols demapping for four parallel streams.
The designed receiver using HARP platform has been prototyped on the FPGA
device which operates at frequency of 200.0 MHz at room temperature of 25◦C.
Furthermore, the HARP platform oﬀers a performance of 128 GOPS and 0.018
GOPS/mW.
The designed application-speciﬁc accelerators satisfy the IEEE 802.11n standard.
Moreover, the simulation results and comparisons with other modern platforms prove
the advantages of maximizing the number of reconﬁgurable processing resources over
a platform since the integrated CGRAs can be scaled to numerous dimensions.
There are two motivations for the design of HARP, one that deals with the dark-
silicon problem by replacing the underutilized section of the chip with special-
purpose accelerators, and the other which maximizes the number of PEs available
on the chip so that the demand for throughput can be met.
In this thesis, a MIMO OFDM according to IEEE 802.11n standard speciﬁcation has
been designed by crafting template-based CGRAs. It makes the whole design process
easier for those application developers who are not familiar with VHDL. Moreover,
the computationally intensive tasks have been parallelized which led to achieve a
considerable level of performance. According to the conducted experimental results,
the total power dissipation is equal to 7.1 W for the whole platform. In terms of
resource utilization, in total, 98.53% ALMs, 30.77% Registers, 48.05% Memory Bits
and 59.3% 32-bit DSPs have been utilized. In terms of the required clock cycles for
executing each task by designed CGRAs, we achieved 2,371 CC for TS block, 12,708
CC for FOE block, 571 CC for FFT block and 479 CC for CE block.
8. Conclusions and Future Work 60
As a future work, the HARP architecture can be selected in order to implement
Massive-MIMO OFDM as a candidate for 5G. Moreover, the power dissipation of the
implemented design in this thesis can be mitigated by applying self-aware computing
models.
61
BIBLIOGRAPHY
[1] R. Airoldi, F. Garzia, O. Anjum and J. Nurmi, Homogeneous MPSoC as
baseband signal processing engine for OFDM systems, International Sympo-
sium on System on Chip (SoC), 2010, pp. 26-30, Sept. 2010, doi: 10.1109/IS-
SOC.2010.5625562.
[2] D. Melpignano, L. Benini, E. Flamand, B. Jego, T. Lepley, G. Haugou, F.
Clermidy and D. Dutoit, Platform 2012, a Many-Core Computing Accelerator
for Embedded SoCs: Performance Evaluation of Visual Analytics Applications,
in Proc. 49th Annual Design Automation Conference (DAC '12). ACM, pp.
1137-1142, New York, NY, USA.
[3] N. S. Voros, M. Hübner, J. Becker, M. Kühnle, F. Thomaitiv, A. Grasset, P.
Brelet, P. Bonnot, F. Campi, E. Schler, H. Sahlbach, S. Whitty, R. Ernst, E.
Billich, C. Tischendorf, U. Heinkel, F. Ieromnimon, D. Kritharidis, A. Schnei-
der, J. Knaeblein and W. Putzke-Rming, MORPHEUS: A Heterogeneous Dy-
namically Reconﬁgurable Platform for Designing Highly Complex Embedded
Systems, ACM Trans. Embed. Comput. Syst. 12, 3, Article 70, 33 pages, April
2013.
[4] F. Thoma, M. Kuhnle, P. Bonnot, E. M. Panainte, K. Bertels, S. Goller, A. Sch-
neider, S. Guyetant, E. Schuler, K. D. Muller-Glaser and J. Becker, MORP-
HEUS: Heterogeneous Reconﬁgurable Computing, International Conference on
Field Programmable Logic and Applications, FPL 2007, pp. 409-414, August
2007.
[5] M.B. Taylor, Is dark silicon useful?: harnessing the four horsemen of the coming
dark silicon apocalypse, In proceedings of the 49th Annual Design Automation
Conference (DAC 2012), pp. 1131-1136, San Francisco, CA, USA.
[6] A. Shrivastava, J. Pager, R. Jeyapaul, M. Hamzeh and S. Vrudhula, Enabling
Multithreading on CGRAs, 2011 International Conference on Parallel Proces-
sing.
[7] W. Hussain, R. Airoldi, H. Hoﬀmann, T. Ahonen and J. Nurmi, Design of an
accelerator-rich architecture by integrating multiple heterogeneous coarse grain
recongurable arrays over a networkon-chip, in Proc. IEEE 25th Int. Conf.
Appl.-Speciﬁc Syst. Archit. Processors, 18-20 Jun 2014, pp. 131-138.
62
[8] C. Brunelli, F. Garzia and J. Nurmi, A Coarse-Grain Reconﬁgurable Architec-
ture for Multimedia Applications Featuring Subword Computation Capabili-
ties, in Journal of Real-Time Image Processing, Springer Verlag, 2008, 3 (1-2):
21-32. doi:10.1007/s11554-008-0071-3.
[9] F. Garzia, W. Hussain and J. Nurmi, CREMA, A Coarse-Grain Reconﬁgurable
Array with Mapping Adaptiveness, in Proc. 19th International Conference on
Field Programmable Logic and Applications (FPL 2009). Prague, Czech Re-
public: IEEE, September 2009.
[10] W. Hussain, F. Garzia, T. Ahonen and J. Nurmi, Designing Fast Fourier Trans-
form Accelerators for Orthogonal Frequency-Division Multiplexing Systems,
Journal of Signal Processing Systems for Signal, Image and Video Technology,
Springer, Vol. 69, pp. 161-171, November 2012.
[11] W. Hussain, X. Chen, G. Ascheid and J. Nurmi, A Reconﬁgurable Application-
speciﬁc Instruction-set Processor for Fast Fourier Transform processing, 2013
IEEE 24th International Conference on Application- Speciﬁc Systems, Archi-
tectures and Processors (ASAP), pp. 339-345, 5-7 June 2013, Washington, D.C.,
USA.
[12] H. Singh, M.-H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh and E. M. C.
Filho, Morphosys: An integrated reconﬁgurable system for data-parallel and
computation-intensive applications, IEEE Trans. Computers, vol. 49, no. 5,
pp. 465-481, 2000.
[13] B. Mei, S. Vernalde, D. Verkest, H. D. Man and R. Lauwereins, ADRES: An
architecture with tightly coupled VLIW processor and coarse-grained recon-
ﬁgurable matrix, Field-Programmable Logic and Applications, vol. 2778, pp.
61-70, September 2003, ISBN: 978-3-540-40822-2.
[14] J. M. P. Cardoso and M. Weinhardt, XPP-VC: A C Compiler with Temporal
Partitioning for the PACT-XPP Architecture, in Field-Programmable Logic
and Applications: Reconﬁgurable Computing Is Going Mainstream, Editors:
M. Glesner and P. Zipf and M. Renovell, Lecture Notes in Computer Science,
Springer Berlin Heidelberg, pp. 864-874, Vol. 2438, ISBN: 978-3-540-44108-3,
2002.
[15] W. Hussain, T. Ahonen F. Garzia and J. Nurmi, Application-Driven Dimensio-
ning of a Coarse-Grain Reconﬁgurable Array, 2011 IEEE NASA/ESA Confe-
rence on Adaptive Hardware and Systems (AHS-2011), pp. 234-239, San Diego,
California, USA.
63
[16] P. Bonnot, F. Lemonnier, G. Edelin, G. Gaillat, O. Ruch and P. Gauget, De-
ﬁnition and SIMD implementation of a multi-processing architecture approach
on FPGA. In Proc. of Design, Automation and Test in Europe (DATE '08).
ACM, pp. 610-615, New York, NY, USA.
[17] Hennessy JL, Patterson DA (1990) Computer Architecture: A Quantitative
Approach. 3rd edn. Elseview Morgan Kaufmann, San Francisco.
[18] C. Panis, VLIW DSP Processor for High-End Mobile Communication Applica-
tions, In Processor Design: System-on-Chip Computing for ASICs and FPGAs,
J. Nurmi, Ed. Kluwer Academic Publishers / Springer Publishers, June 2007,
ch. 5, pp. 83-100, ISBN-10: 1402055293, ISBN-13: 978-1-4020-5529-4.
[19] D. Vassiliadis, N. Kavvadias, G. Theodoridis and S. Nikolaidis, A RISC archi-
tecture extended by an eﬃcient tightly coupled reconﬁgurable unit. In Proc.
ARC, 2005.
[20] H. Singh, M.-H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh and E. M. C.
Filho, Morphosys: An integrated reconﬁgurable system for data-parallel and
computation-intensive applications, IEEE Trans. Computers, vol. 49, no. 5,
pp. 465-481, 2000.
[21] www.altera.com
[22] www.xilinx.com
[23] Altera(2006) Stratix II vs. Virtex-4 Density Comparison. Consulted on 16 Fe-
bruary 2007. Altera white paper at http://www.altera.com
[24] S. Vassiliadis, S. Wong, G. N. Gaydadjiev, K. Bertels, G. Kuzmanov and E. M.
Panainte, The Molen Polymorphic Processor, IEEE Transactions on Compu-
ters, vol. 53, pp. 1363-1375, November 2004.
[25] E. M. Panainte, K. Bertels and S. Vassiliadis, The Molen Compiler for Recon-
ﬁgurable Processors, ACM Trans. Embed. Comput. Syst., Vol. 6, ISSN: 1539-
9087, New York, USA, February 2007.
[26] W. Hussain, Design and development from single core reconﬁgurable acce-
lerators to a heterogeneous accelerator-rich platform, Tampere University of
Technology, p. 128, ISBN: 978-952-15-3406-5, Tampere, Finland, 27 Nov 2014.
[27] A. Lodi, C. Mucci, M. Bocchi, A. Cappelli, M. De Dominicis and L. Ciccarelli,
A Multi-Context Pipelined Array for Embedded Systems, International Con-
ference on Field Programmable Logic and Applications, 2006. FPL06, pp. 18,
Aug. 2006.
64
[28] A. Lodi, M. Toma, F. Campi, A. Cappelli, R. Canegallo and R. Guerrieri, A
VLIW Processor with Reconﬁgurable Instruction Set for Embedded Applica-
tions, IEEE Journal of Solid-State Circuits, Vol. 38, pp. 1876-1886, Nov. 2003,
doi: 10.1109/JSSC.2003.818292.
[29] Chia-Cheng Lo, Shang-Ta Tsai and Ming-Der Shieh, A reconﬁgurable archi-
tecture for entropy decoding and IDCT in H.264 VLSI Design, Automation
and Test, 2009. VLSI-DAT 09. International Symposium on, pp. 279-282, April
2009, doi: 10.1109/VDAT.2009.5158149, ISBN: 978-1-4244-2781-9.
[30] P. Kunjan and C. Bleakley, Systolic Algorithm Mapping for Coarse Grained
Reconﬁgurable Array Architectures, Reconﬁgurable Computing: Architectu-
res, Tools and Applications, Lecture Notes in Computer Science, 2010, Sprin-
ger Berlin Heidelberg, pp. 351-357, Vol. 5992, doi: 10.1007/978-3-642-12133-3
33.
[31] F. Garzia, W. Hussain, R. Airoldi and J. Nurmi, A Reconﬁgurable SoC tailored
to Software Deﬁned Radio Applications, in Proc of 27th Norchip Conference,
Trondheim (NO), 2009.
[32] Y. Kishimoto, S. Haruyama and H. Amano, Design and Implementation of
Adaptive Viterbi Decoder for Using A Dynamic Reconﬁgurable Processor, in
Proc. Reconﬁgurable Computing and FPGAs, Dec. 2008, pp. 247-252, doi=
10.1109/ReConFig.2008.39, ISBN: 978-1-4244-3748-1.
[33] S. Nouri , W. Hussain and J. Nurmi, Evaluation of a Heterogeneous Multicore
Architecture by Design and Test of an OFDM Receiver, IEEE Transactions on
Parallel and Distributed Systems, VOL. 28, NO. 11, NOVEMBER 2017.
[34] F. Conti, C. Pilkington, A. Marongiu and L. Benini, He-P2012: architectural
heterogeneity exploration on a scalable many-core platform, in Proc. of the
24th edition of the great lakes symposium on VLSI (GLS-VLSI '14), pp. 231-
232, ACM, New York, NY, USA.
[35] Roberto Airoldi, Design and Implementation of Software Deﬁned Radios on a
Homogeneous Multi-Processor Architecture, Tampere University of Technolo-
gy, Vol. 1136, ISBN: 978-952-15-3078-4, May 2013, Tampere, Finland.
[36] M. B. Taylor, J. Kim, J. Miller, D. Wentzlaﬀ, F. Ghodrat, B. Greenwald, H.
Hoﬀman, P. Johnson, Jae-Wook Lee, W. Lee, A. Ma, A. Saraf, M. Seneski,
N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe and A. Agarwal, The
Raw microprocessor: a computational fabric for software circuits and general-
purpose programs, Micro, IEEE, vol. 22, no. 2, pp. 25-35, 2002.
65
[37] M. B. Taylor, W. Lee, J. Miller, D. Wentzlaﬀ, I. Bratt, B. Greenwald, H. Hoﬀ-
mann, P. Johnson, J. Kim, J. Psota, A. Saraf, N. Shnidman, V. Strumpen, M.
Frank, S. Amarasinghe and A. Agarwal, Evaluation of the Raw Microproces-
sor: An Exposed-Wire-Delay Architecture for ILP and Streams, SIGARCH
Comput. Archit. News 32, March 2004.
[38] R. Hartenstein, Coarse grain reconﬁgurable architecture, ASP-DAC '01
Proceedings of the 2001 Asia and South Paciﬁc Design Automation Conference,
pp. 564-570, Yokohama, Japan.
[39] R. Airoldi, F. Garzia, O. Anjum and J. Nurmi, Homogeneous MPSoC as
baseband signal processing engine for OFDM systems, International Sympo-
sium on System on Chip (SoC), 2010, pp. 26-30, Sept. 2010, doi: 10.1109/IS-
SOC.2010.5625562.
[40] Man-On Pun, M. Morelli and C-C Jay Kuo, Multi-carrier techniques for
broadband wireless communications: a signal processing perspective, copyright
c©2007 by Imperial College Press, December 2007.
[41] S. Afrasiabi Gorgani, Peak Power Reduction In Multicarrier Waveforms, Mas-
ter Thesis, pp. 77, May 2014, Tampere University of Technology, Tampere, Fin-
land.
[42] J. Heiskala and J. Terry, OFDM Wireless LANs: A Theoretical and Practical
Guide, 336 pages, SAMS, 2001, ISBN: 0672321572, Indianapolis, Indiana, USA.
[43] E. Perahia and R. Stacy, Next Generation Wireless LANs 802.11n and
802.11ac, 2nd edition, p. 452, 2013, Cambridge University Press.
[44] Supplement to IEEE Standard for Information Technology - Telecommunica-
tions and Information Exchange Between Systems - Local and Metropolitan
Area Networks - Speciﬁc Requirements. Part 11: Wireless LAN Medium Access
Control (MAC) and Physical Layer (PHY) Speciﬁcations: High-Speed Physical
Layer in the 5 GHz Band, IEEE Std 802.11a-1999, 1999, New York, NY, USA.
[45] JERRY R. HAMPTON, Introduction to MIMO Communications, Published
in the United States of America by Cambridge University Press, New York,
ISBN: 978-1-107-04283-4, First published 2014.
[46] NADER AL-GHAZU, A Study of the Next WLAN Standard IEEE 802.11ac
Physical Layer, KTH School of Electrical Engineering (EE) Signal Processing,
p. 68, 2013, XR-EE-SB 2013:001.
66
[47] A. F. Molisch, Wireless Communication, 2nd Edition, John Wiley and Sons
Ltd., p. 884, December 2010, ISBN: 978-0-470-74186-3.
[48] A. Peled and A. Ruiz, Frequency domain data transmission using reduced
computational complexity algorithms, Acoustics, Speech, and Signal Proces-
sing, IEEE International Conference on ICASSP '80, Vol. 5, IEEE, April 1980,
doi: 10.1109/ICASSP.1980.1171076.
[49] M. Valkama and M. Renfors, COMMUNICATION THEORY, http://www.cs.
tut.ﬁ/kurssit/TLT-5206/general.html
[50] L. Liang, J. Shi, L. Chen and S. Xu, Implementation of Automatic Gain Cont-
rol in OFDM digital receiver on FPGA, 2010 International Conference on
Computer Design and Applications (ICCDA), vol. 4, pp. 446-449, June 2010.
[51] http://www.mathworks.com
[52] A. Mueen, A. Nath and J. Liu, Fast approximate correlation for massive time-
series data, Proceedings of the 2010 ACM SIGMOD International Conference
on Management of data, pp. 171-182, Indianapolis, Indiana, USA, June 2010.
[53] L. Harju and J. Nurmi, Hardware platform for software-deﬁned WCD-
MA/OFDM baseband receiver implementation, IET Computers and Digital
Techniques, vol. 1, no. 5, pp. 640-652, September 2007.
[54] J.-J. van de Beek, P.O. Borjesson, M.-L. Boucheret, D. Landstrom, J.M. Arenas,
P. Odling, C. Ostberg, M. Wahlqvist and S.K. Wilson, A time and frequency
synchronization scheme for multiuser OFDM, IEEE Journal on Selected Areas
in Communications, vol. 17, no. 11, pp. 1900-1914, 1999.
[55] J. Rinne, Multicarrier Techniques, http://www.cs.tut.ﬁ/kurssit/ TLT-5706/
[56] T. Hwang, C. Yang, G. Wu, S. Li and G.Y. Li, OFDM and Its Wireless Applica-
tions: A Survey, IEEE Transactions on Vehicular Technology, vol. 58, no. 4,
pp. 1673-1694, May 2009.
[57] R. G. Lyons, Understanding Digital Signal Processing, Boston, MA, USA:
Addison-Wesley, 1999.
[58] W. Hussain, F. Garzia and J. Nurmi, Evaluation of Radix-2 and Radix-4 FFT
Processing on a Reconﬁgurable Platform, in Proc. IEEE International Sympo-
sium on Design and Diagnostics of Electronic Circuits and Systems, pp. 249-254,
Vienna, Austria, April 2010, doi: 10.1109/DDECS.2010.5491775.
67
[59] S. Takaoka and F. Adachi, Pilot-assisted adaptive interpolation channel estima-
tion for OFDM signal reception, IEEE 59th Vehicular Technology Conference
(VTC 2004-Spring), vol. 3, pp. 1777-1781, May 2004.
[60] R. Hajizadeh, K. Mohamedpor and M.R. Tarihi, Channel Estimation in OFDM
system Based on the Linear Interpolation, FFT and Decision Feed-back, 18th
Telecommunication forum TELFOR, Serbia, Belgrade, November 2010.
[61] S. Coleri, M. Ergen, A. Puri and A. Bahai, Channel Estimation Techniques
Based on Pilot Arrangement in OFDM Systems, IEEE TRANSACTIONS ON
BROADCASTING, pp. 223-229, vol. 48, no. 3, 2002.
[62] M. Renfors and M. Valkama, DIGITAL TRANSMISSION,
http://www.cs.tut.ﬁ/kurssit/TLT-5406/
[63] Filipo Tosato and Paola Bisaglia, Simpliﬁed Soft-Output Demapper for Binary
Interleaved COFDM with Application to HIPERLAN/2, IEEE International
Conference on Communications, HPL-2001-246, October 2001, New York, NY,
USA.
[64] R.V. Nee and R. Prasad, OFDM for Wireless Multimedia Communications,
Copyright 2000 by Artech House, Inc. Norwood, MA, USA.
[65] J. Kylliainen, T. Ahonen and J. Nurmi, General-purpose embedded proces-
sor cores - the COFFEE RISC example, Processor Des.: Syst.-Chip Comput.
ASICs FPGAs, J. Nurmi, Ed. Kluwer Academic Publishers Springer Publishers,
ch. 5, pp. 83-100, 2007, doi: 10.1007/978-1-4020-5530-0-5.
[66] C. Brunelli, F. Garzia, C. Giliberto and J. Nurmi, A dedicated DMA Logic
addressing a time multiplexed memory to reduce the eﬀects of the system buss
bottleneck, in Proc. 18th Int. Conf. Field Program. Logic Appl., 2008, pp.
487-490.
[67] F. Garzia, C. Brunelli and J. Nurmi, A pipelined infrastructure for the di-
stribution of the conﬁguration bitstream in a coarse-grain reconﬁgurable ar-
ray, in Proceedings of the 4th International Workshop on Reconﬁgurable
Communication-centric System-on-Chip (ReCoSoC '08). Univ Montpellier II,
July 2008, pp. 188-191, ISBN:978-84-691-3603-4.
[68] J. E. Volder, The CORDIC Trigonometric Computing Technique, IRE Tran-
sactions on Electronic Computers, pp. 330-334, September 1959.
[69] P. K. Meher, J. Valls, T-B Juang, K. Sridharan and K. Maharatna, 50 Years of
CORDIC: Algorithms, Architectures and Applications, IEEE Transactions on
68
Circuits and Systems-I: Regular Papers, vol. 56, no. 9, pp. 1893-1907, September
2009.
[70] J. E. Meggitt, Pseudo Division and Pseudo Multiplication Processes,
IBM Journal of Research and Development, pp. 210-226, April 1962, doi:
10.1147/rd.62.0210.
[71] M. D. Greenberg, Foundations of Applied Mathematics, Dover Publications,
Inc. Mineola, New York, USA, p. 656, 1978, ISBN: 9780486492797.
[72] V. S. Ryabenkii and S. V. Tsynkov, A Theoretical Introduction to Numerical
Analysis, p. 537, Boca Raton, FL, USA: CRC Press, ISBN: 1584886072, 2006.
[73] Design implementation and optimization quartus-II handbook version 13.1,
San Jose, CA, USA, Altera Corporation, May 2013.
[74] W. Hussain, R. Airoldi, H. Hoﬀmann, T. Ahonen and J. Nurmi, HARP2:
An X-scale reconﬁgurable accelerator-rich platform for massively-parallel signal
processing algorithms, Springer's J. Signal Process. Syst., vol. 85, no. 3, pp.
341-353, 2015.
[75] D. Westcott, D. Coleman, P. Mackenzie and B. Miller, CWAP Certiﬁed Wi-
reless Analysis Professional Oﬃcial Study Guide, Sybex, p. 696, March 2011,
ISBN: 978-0-470-76903-4.
[76] K. Ian and J. Rose, Measuring the gap between FPGAs and ASICs, IEEE
Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 26, no. 2, pp. 203-215,
2007.
[77] F. Garzia, From run-time reconﬁgurable coarse-grain arrays to application-
speciﬁc accelerators design, Ph.D. dissertation, Tampere University of Tech-
nology (TUT), Tampere, Finland, 2009, p. 125, TUT Publications 860, ISBN:
978-952-15-2280-2.
[78] W. Hussain, J. Nurmi, J. Isoaho and F. Garzia, Computing Platforms
for Software-Deﬁned Radio, p. 240, Springer, ISBN: 978-3-319-49678-8, doi:
10.1007/978-3-319-49679-5, 2017.
[79] https://www.mentor.com/products/fv/modelsim
