Implementing carrier recovery for LTE 20 MHz on transport triggered architecture by Mubashir, Ali
MUBASHIR ALI
IMPLEMENTING CARRIER RECOVERY FOR LTE 20 MHZ ON
TRANSPORT TRIGGERED ARCHITECTURE
MASTER OF SCIENCE THESIS
                                     EXAMINERS:   PROF. JARI       NURMI
        PROF. MIKKO VALKAMA, MSC OMER ANJUM
Examiner and topic approved by Faculty
Council of Computing and Electrical
Engineering, December 2011.
IABSTRACT
Degree Programme in Electrical Engineering
Mubashir Ali: Implementing Carrier Recovery for LTE 20 MHz on Transport
Triggered Architecture
Master of Science Thesis: 49 pages
March 2013
Major subject: Radio Frequency Electronics
Examiners: Professor Jari Nurmi, Professor Mikko Valkama, MSc Omer Anjum
Keywords: Long Term Evolution, Orthogonal Frequency Division Multiplexing, Carrier
Frequnecy Offset, Transport Triggered Architecture, Software Defined Radio.
Synchronization is a critical function in digital communications. Its failure may
cause catastrophic effects on the transmission system performance. It is very important
that the receiver is synchronised with the transmitter because it is not possible to correct
frequencies/phases without any control mechanisms. Synchronization is different in
Third Generation Partnership Project (3GPP) Long Term Evolution (LTE) for uplink
and downlink because of the choice of multiple access scheme. Multiple access scheme
for LTE downlink is Orthogonal Frequency Division Multiple Access (OFDMA) and
Single Carrier-Frequency Division Multiple Access (SC-FDMA) for the uplink. OFD-
MA is susceptible to Carrier Frequency Offset (CFO). In case of a typical LTE system
with a carrier frequency of 2.1 GHz, a frequency drift of 10ppm (10×10-6) of the local
oscillator can cause an offset of 21 kHz. LTE system employs a fixed subcarrier spacing
of 15 kHz. This offset caused by the local oscillator corresponds to 1.40 subcarrier spac-
ings. The receiver extracts the information from the received signal to synchronise and
compensate for any carrier frequency/phase offset. Increasing demand for data driven
applications has put stress on communication systems to provide high data rates and
increased bandwidth. This demand has ever been increasing and requires new standards
to evolve and efficient hardware. It has been difficult to develop hardware at the pace
new communication standards are developing. It also increases the cost of deployment
of a technology for a brief period of time without covering the huge capital invested in
the network. In order to meet the pace of evolving standards and covering the huge net-
work costs, industry needs Software-Defined Radio (SDR). SDR is a radio communica-
tion technology that is based on software defined wireless communication protocols
instead of hardwired implementations. System components that are usually implement-
ed in hardware are implemented by means of software on a computer or embedded sys-
tem.
LTE carrier recovery algorithm for LTE downlink with 20 MHz system bandwidth
has been implemented in this thesis. The architecture chosen for implementation is
Transport Triggered Architecture (TTA) with the goal to achieve real time constraints
II
along with a certain flexibility and power consumption needed for an SDR platform.
The target programming language is C with TTA specific extensions instead of hand
optimized assembly with the aim to reduce the whole design time and still achieve the
required optimizations and throughput. This design cycle time is also one of the im-
portant aspects for product development in the industry.
III
PREFACE
This M.Sc. thesis was completed in Department of Computer Systems of Tampere
University of Technology (TUT) as a part of the International Masters Degree Program
in RF Electronics.
I would like to express my sincere gratitude to my thesis supervisors, MSc Omer
Anjum and Prof. Jari Nurmi, for giving me a chance to develop as a C programmer and
for their patience during this period. I am extremely grateful to them for introducning
me the new concepts and research topics in computer architecture, software definded
radio and for their patience in answering my questions. Special thanks to Department of
Computer Systems of TUT for giving me the opportunity and space for learning and
research during my thesis work. I would also like to thank my colleagues and friends for
the inspired and motivated work atmosphere they have provided.
I  am also  extremely  grateful  to  my parents  for  their  continuous  support  especially
my father and my elder brother Mr. Haider Ali for providing me moral and financial
support and courage during the whole period. Finally, I would like to remember my
relatives who passed away while I was away from home.
Tampere, March 22, 2013
Mubashir Ali
IV
CONTENTS
ABSTRACT .................................................................................................................I
PREFACE .................................................................................................................III
CONTENTS .............................................................................................................. IV
List of Acronyms ....................................................................................................... VI
List of Symbols and Definitions ................................................................................ IX
1. Introduction ......................................................................................................... 1
1.1. Evolution of Wireless Technology .................................................................. 1
1.2. OFDM & OFDMA ........................................................................................ 3
1.3. LTE (4G) Multiple Access Schemes ............................................................... 5
1.4. Related Work ................................................................................................. 6
1.5. Organization of the Thesis .................................................................................. 8
2. LTE Downlink physical layer ............................................................................. 9
2.1. LTE Physical Layer Frame Structure ............................................................. 9
2.2. Downlink Frame Structure .......................................................................... 11
2.3. LTE Downlink Physical Layer Signals ......................................................... 12
2.3.1. Reference Signals .................................................................................... 12
2.3.2. Reference Signal Sequence Generation .................................................... 13
2.3.3. Reference Signal Mapping to Resource Elements .................................... 14
2.3.4. Synchronization Signals........................................................................... 16
2.3.5. Primary Synchronization Signal Sequence Generation and Resource
Element Mapping ................................................................................................ 17
2.3.6. Secondary Synchronization Signal Sequence Generation and Resource
Element Mapping ................................................................................................ 18
2.4. Downlink Multi-Antenna Transmission ........................................................ 19
3. LTE Carrier Recovery ...................................................................................... 21
V3.1. Carrier Recovery ......................................................................................... 22
3.2. LTE Downlink Carrier Recovery Algorithm ................................................. 23
3.2.1. Carrier Frequency Offset Estimation Method ........................................... 23
3.3. LTE Uplink Carrier Recovery ...................................................................... 26
4. TCE Framework ............................................................................................... 27
4.1. Transport Triggered Architecture (TTA) ...................................................... 27
4.2. Hardware Aspects of TTA ............................................................................ 28
4.3. TTA Programming Model ............................................................................ 29
4.4. TCE (TTA-based Co-design Environment) ................................................... 31
4.5. TTA Design with TCE .................................................................................. 32
5. Implementation ................................................................................................. 36
5.1. Implementation in C .................................................................................... 36
5.1.1. Fractional Carrier Frequency Offset ......................................................... 36
5.1.2. Integer Carrier Frequency Offset ............................................................. 36
5.1.3. Residual Carrier Frequency Offset ........................................................... 36
5.2. Channel Characteristics and Synchronization Perfromance ........................ 37
5.3. TTA Processor for Carrier Recovery ........................................................... 37
5.4. TTA with Custom Functional Units .............................................................. 38
5.5. CORDIC Functional Unit ............................................................................ 39
5.6. Complex Multiplier Functional Unit ............................................................ 40
5.7. FFT Functional Unit ................................................................................... 41
5.8. Evaluation ................................................................................................... 42
6. Summary and Conclusion ................................................................................. 44
6.1. Summary ..................................................................................................... 44
References ................................................................................................................. 46
VI
LIST OF ACRONYMS
1G, 2G, 3G, 4G First  Generation,  2nd  Generation,  3rd  Generation,  4th
Generation
3GPP 3rd Generation Partnership Project
ADF Architecture Definition File
AMPS Advanced Mobile Phone Service
CDMA Code Division Multiple Access
CFO Carrier Frequency Offset
CORDIC Coordinate Rotational Digital Computer
CP Cyclic Prefix
DA Data-Aided
DD Data Directed
DL Downlink
DRE Dead Result Elimination
DSE Design Space Exploration
EDGE Enhanced Data rates for GSM Evolution
FDD Frequency Division Duplex
FDMA Frequency Division Multiple Access
FFO Fractional Frequency Offset
FFSK Fast Frequency Shift Keying
FFT Fast Fourier Transform
FPGA Field Programmable Gate Array
FU Functional Unit
GPRS General Packet Radio Service
GSM Global Systems for Mobile Communications
GUI Graphical User Interface
HDB Hardware Database
HLL High Level Language
IC Interconnection
VII
ICI Inter-Carrier Interference
IDF Implementation Definition File
IFO Integer Frequency Offset
IMT International Mobile Telecommunications
IS Interim Standard
ISI Inter-Symbol Interference
ISS Instruction Set Simulator
ITU International Telecommunication Union
LO Local Oscillator
LSU Load-Store Unit
LTE Long Term Evolution
MIMO Multiple Input Multiple Output
ML Maximum Likelihood
MPSoC Multi-Processor System on Chip
MU-MIMO Multi-User MIMO
NDA Non-Data Aided
NMT Nordic Mobile Telephony
OFDMA Orthogonal Frequency Division Multiple Access
OTA Operation-Triggered Architecture
PAPR Peak to Average Power Ratio
PCI Physical-Layer Cell Identity
PDC Personal Digital Cellular
PHS Personal Handy Phone System
PPM Parts Per Million
P-SCH Primary Synchronization Signal
QAM Quadrature Amplitude Modulation
QoS Quality of Service
QPSK Quadrature Phase Shift Keying
RB Resource Block
RE Resource Element
RF Radio Frequency/Register File
RFO Residual Frequency Offset
RISC Reduced Instruction Set Computer
RS Reference Symbol
VIII
RTL Register Transfer Level
SC-FDMA Single-Carrier Frequency-Division Multiple Access
SDR Software Defined Radio
SFU Special Functional Unit
SIMD Single Instruction, Multiple Data
SODA Signal Processing on Demand Architecture
S-SCH Secondary Synchronization Signal
SU-MIMO Single-User MIMO
TACS Total Access Communication System
TCE TTA-based Co-design Environment
TDMA Time Division Multiple Access
TPEF TTA Program Exchange Format
TTA Transport Triggered Architecture
WLAN Wireless Local Area Network
IX
LIST OF SYMBOLS AND DEFINITIONS
)(
,
p
lka Value of resource element ),( lk  [for antenna port p ]
h Channel impulse response
I Inter-carrier interference caused by neighboring subcarriers
),( lk Resource element with frequency-domain index k  and time-
domain index l
Kp Joint set of OFDM symbol number and subcarrier indices for
reference symbols
KSCH Subcarrier indices that contain the synchronization signals
l OFDM symbol index in one subframe
N FFT size
cell
IDN Physical layer cell identity
DL
RBN Downlink bandwidth configuration, expressed in multiples
of RBscN
DLmax,
RBN Largest downlink bandwidth configuration, expressed in
multiples of RBscN
RB
scN Resource block size in the frequency domain
Ng CP length
Ns Number of OFDM symbols in one slotNୖ୆ୈ୐ Maximum number of downlink Resource Blocks
DL
symbN Number of OFDM symbols in a downlink slot
n Time index within one OFDM symbol
R Received reference symbols
RPSCH Received primary synchronization signal
RSSCH Received secondary synchronization signal
r Received time-domain signal
Tu Subcarrier modulation-symbol time
Xv Additive Gaussian noise
X Transmitted reference symbols
XPSCH Transmitted Primary synchronization signal
XSSCH Transmitted secondary synchronization signal
x Transmitted OFDM signal
ǻf Subcarrier spacing
ǻȦ Constant carrier frequnecy error
ˢCFO Frequency mismatch/offset between transmitter and receiver
ˢFFO Fractional frequency mismatch/offset
ˢIFO Integer frequency mismatch/offset
ˢRFO Residual frequency mismatch/offset
ĳ Constant carrier phase error
Ȗ Degradation on the desired subcarrier caused by the CFO
11. INTRODUCTION
Wireless technology has evolved very rapidly during the last couple of decades.
New wireless communication methods, services and efficient hardware have been
adopted throughout the world. The mobile radio communication industry has grown by
orders of magnitude and this growth has been further accelerated by advancement in
digital and radio frequency (RF) circuit fabrication, new large scale circuit integration,
and other miniaturization technologies which make portable radio equipment smaller,
cheaper, and more reliable. Digital switching techniques have facilitated the large scale
deployment of affordable, easy to use radio communication networks. These trends will
continue at even greater pace during the coming years. [1]
1.1. Evolution of Wireless Technology
The evolution of wireless communication networks is spread across four genera-
tions. These are first generation (1G) mobile networks, second generation (2G) mobile
networks, third generation (3G) mobile networks, and the latest commercially available
fourth generation (4G) mobile networks respectively.
Figure 1.1. Wireless networks generations and technologies
Japan took the lead in the development of wireless technology, deploying the first
cellular networks in Tokyo. Within a couple of years Nordic Mobile Telephony (NMT)
started cellular operations in Europe. At the same time Advanced Mobile Phone Service
(AMPS) started in the USA, while Total Access Communication System (TACS) start-
ed in the UK. These systems were called ‘First Generation Mobile Systems’, providing
2speech services and were based on analogue transmission techniques. 1G network sup-
ported very low data rates. AMPS modem call data rate under good conditions is as high
as 14.4 kbps and under poor conditions is as low as 4.8 kbps. NMT FFSK modem data
rate was 12kbps.
In the early 1990s, digital transmission technology came into force, introducing the
‘Second Generation Mobile System’. Key 2G systems in this generation included Glob-
al Systems for Mobile Communications (GSM), TDMA IS-136, CDMA IS-95, Personal
Digital Cellular (PDC) and Personal Handy Phone System (PHS). IS 54 and IS 136
(where IS stands for Interim Standard) were the second generation mobile systems that
constituted the D-AMPS. This was the digital advancement of the then existing AMPS
in America. [2]
The third generation cellular networks were developed with the aim of offering high
speed data and multimedia connectivity to subscribers. The International Telecommuni-
cation Union (ITU) under the initiative IMT-2000 defined 3G systems as being capable
of supporting high speed data ranges of 144 kbps to greater than 2 Mbps. A few tech-
nologies are able to fulfill the International Mobile Telecommunications (IMT) stand-
ards, such as CDMA and UMTS. [2]
The three previous generations of mobile networks had their merits and demerits but
none of them had the ability to completely replace the other. Even IMT-2000, a world-
wide standard, was not able to break the bottleneck of high data rate and capacity, and
this led to the formation of a new generation, referred to as 3GPP Long Term Evolution
(LTE) or 4G. Orthogonal Frequency Division Multiple Access (OFDMA) and Frequen-
cy  Division  Multiple  Access  (FDMA)  are  used  as  the  modulation  access  schemes  for
downlink and uplink respectively in LTE. It was thought that instead of developing new
radio interfaces and new technology it would be better to integrate existing and newly
developed wireless systems like the GPRS, EDGE, Bluetooth, WLAN and Hiper-LAN.
[2]
Wireless networks evolution in terms of data rates, mobility and different technolo-
gies is depicted in the following figure. The figure is based on research carried out on
existing and emerging wireless networks at IMEC Belgium.
3Figure 1.2. Wireless networks supported data rates and mobility [(C) Copyright IMEC]
1.2. OFDM & OFDMA
OFDM is a special case of Frequency Division Multiplexing (FDM). OFDM is both
a modulation and multiplexing technique. The concept of OFDM has existed since
around 1966. [5]
A single carrier system modulates information onto one carrier using frequency,
phase, or amplitude adjustment of the carrier. For digital signals, the information is in
the form of bits, or collection of bits called symbols, that are modulated onto the carrier.
As higher bandwidths (data rates) are used, the duration of one bit or symbol of infor-
mation becomes smaller. [32]
OFDM system breaks the available bandwidth into many narrower sub-carriers and
transmits the data in parallel streams. OFDM symbols are much longer than symbols on
single carrier systems of equivalent data rate. [32]
Each sub-carrier is modulated using varying levels of QAM modulation, e.g.,
QPSK, QAM, 64QAM or possibly higher orders depending on signal quality. There-
fore, each OFDM symbol is a linear combination of the instantaneous signals on each of
the sub-carriers in the channel.
There are two important aspects of OFDM. First, each OFDM symbol is preceded
by a cyclic prefix (CP). This CP is effective in eliminating Inter-Symbol Interference
(ISI). Second, the sub-carriers are very tightly spaced to make efficient use of band-
4width, yet there is virtually no interference among adjacent sub-carriers. These two fea-
tures are in fact closely related.
Figure 1.3. OFDM symbol with CP
OFDM symbol consists of two major components: CP and a Fast Fourier Transform
(FFT) period. The duration of CP is determined by the highest anticipated degree of
delay spread for the targeted application. When transmitted symbols arrive at the re-
ceiver by two paths of differing length, they are spread in time. However it is possible to
have distortion from preceding symbol within the CP. With a CP of sufficient duration,
preceding symbols do not spill over into the FFT period.
Once the signal is received and digitized, the receiver removes CP. The result is a
rectangular pulse within each subcarrier, which has constant amplitude over the FFT
period. [6]
The OFDM pulse shaping for simple rectangular pulse is shown in below:
Figure 1.4. Pulse shaping and sub-carrier spacing
5Figure 1.4 (c) shows tight frequency-domain packing of the subcarriers with a sub-
carrier spacing ǻf = 1/Tu, where Tu is the per-subcarrier modulation-symbol time (see
Figure 1.4). In other words the subcarrier spacing is thus equal to the per-subcarrier
modulation rate 1/Tu, which gives orthogonal sub-carriers.
OFDMA is the multiple access scheme for LTE downlink. There is a difference be-
tween OFDM and OFDMA in the scheduling of the users. In an OFDM system, differ-
ent users scheduled in time domain are allocated the full bandwidth. In OFDMA users
are scheduled in both frequency and time domain in such a way that the available
bandwidth is shared by the users. A very important advantage of OFDMA system is its
scheduling property. This property allows scheduling decisions in the frequency do-
main. For instance, some sub-carriers can be modulated simultaneously with 64-QAM
while others may be modulated with some other digital modulation technique like
QPSK in order to take frequency dependencies into account in the radio link quality.
Figure 1.5. OFDM and OFDMA (Adapted from [36])
1.3. LTE (4G) Multiple Access Schemes
Actual work towards 3GPP Long Term Evolution (LTE) started in 2004 with the
definition of the targets. After initial consolidation of proposals, the candidate scheme
for the downlink was OFDMA, while the candidate scheme for the uplink was SC-
FDMA. The choice of multiple-access schemes was made in December 2005, with
OFDMA being selected for the downlink, and SC-FDMA for the uplink. Both of these
schemes open up the frequency domain as a new dimension of flexibility in the system.
6Figure 1.6. LTE multiple access technologies
Problem: Increasing demand for data driven applications has put stress on commu-
nication systems to provide high data rates and increased bandwidth. This demand has
ever been increasing and requires new standards to evolve and efficient hardware. It has
been difficult to develop hardware at the pace new communication standards are devel-
oping. It also increases the cost of deployment of a technology for a brief period of time
without covering the huge capital invested in the network.
Proposed Solution: In  order  to  meet  the  pace  of  evolving  standards  and  covering
the huge network costs industry needs SDR (Software-Defined Radio). [11] SDR is a
radio communication technology that is based on software defined wireless communica-
tion protocols instead of hardwired implementations. System components that are usual-
ly implemented in hardware are implemented by means of software on a computer or
embedded system. In other words, frequency band, air interface protocol and functional-
ity can be upgraded with software download or configuring hardware instead of com-
plete hardware replacement.
1.4. Related Work
SDR architectures have been proposed by CoreSonic [8], Sandbridge [9] and SODA
(Signal Processing on Demand Architecture) in [10]. Multi-threaded processor architec-
ture for SDR by Sandbridge is shown in the Figure 1.7. The design includes a unique
combination of modern techniques such as a SIMD Vector/DSP unit, a parallel reduc-
tion unit, and a RISC-based integer unit. Each processor core provides support for con-
current execution for up to eight threads of execution. The platform allows the freedom
to choose a particular set of operating modes (such as GSM, GPRS, WCDMA, etc.) and
the desired local area connectivity (Bluetooth, Wireless LAN, etc). [9]
7 Some of the key focuses in this architecture are support for high-level programming
language like C and compiler opimization for DSP. The need for compiler design in
parallel with the DSP architecture design is particularly emphasized in their design cy-
cle for the whole system. The proposed compiler analyzes the C code and extracts the
DSP operations itself. [11]
Figure 1.7. Multi-threaded processor SDR architecture [9]
A tradeoff exists between operating frequency and power consumption. The availa-
ble state of the art SDR architectures have a drawback of either higher operating operat-
ing frequency or higher power consumtion. Compiler design optimization is another
contributing factor as well. Considering these facts TTA (transport triggered architec-
ture) is considered to be a possible solution.
 LTE carrier recovery algorithm for LTE with 20 MHz system bandwidth has been
implemented in this thesis. The architecture chosen for implementation is TTA with the
goal to achieve real time constraints along with a certain flexibility and power consump-
tion needed for an SDR platform. The target programming language is C with TTA spe-
cific extensions instead of hand optimized assembly with the aim to reduce the whole
design time and still achieving the required optimizations and throughput. This design
cycle time is also one of the important aspects for product development in the industry.
81.5. Organization of the Thesis
The thesis is organized as follows:
Chapter 1: discusses the evolution of the wireless technology, the technologies availa-
ble and the challenges faced by future wireless technology.
Chapter 2: is about the LTE physical layer. LTE frame structure and the physical layer
downlink signals used for carrier recovery.
Chapter 3: discusses the carrier recovery offset according to the mathematical model
presented in [16].
Chapter 4: is about TTA framework, TTA based Co-Design Environment (TCE) tool-
set for designing and optimizing a TTA processor.
Chapter 5: describes the TTA design process and optimizations for a Carrier Recovery
TTA and the Special Functional Units (SFUs) designed for the Carrier Recovery TTA.
Results are discussed and comparison has been made between Carrier Recovery TTA
and other architectures.
Chapter 6: summarizes and concludes the thesis.
92. LTE DOWNLINK PHYSICAL LAYER
The physical layer offers data transport services to higher layers in a system. LTE
physical layer deals with signals in the OFDM format on the downlink (DL). The physi-
cal layer is expected to perform functions like modulation and demodulation of physical
channels, frequency and time synchronization, multiple input multiple output (MIMO)
antenna processing, transmit diversity (TX diversity) and RF processing.
As mentioned in the previous chapter multiple access schemes for the LTE physical
layer are based on OFDMA with a cyclic CP in the downlink, and on SC-FDMA with a
cyclic prefix in the uplink. Furthermore LTE frame structure and access schemes will be
discussed in more detail in this chapter.
2.1. LTE Physical Layer Frame Structure
The Layer 1 is defined in a bandwidth agnostic way based on resource blocks, al-
lowing the LTE Layer 1 to adapt to various spectrum allocations. Downlink and uplink
transmissions are organized into radio frames with ms10307200 sf  u TT duration.
Two radio frame structures are supported:
 Type 1, applicable to FDD
 Type 2, applicable to TDD
10 subframes are available for downlink transmission and 10 subframes are availa-
ble for uplink transmissions in each 10 ms interval for the FDD mode. Uplink and
downlink transmissions are separated in the frequency domain. In half-duplex FDD op-
eration, the user terminal cannot transmit and receive at the same time while there are
no such restrictions in full-duplex FDD.
Type 1 frame structure with FDD mode has been focused in this work. The smallest
time-frequency unit for downlink transmission is called a resource element (RE), shown
in Figure 2.1 in the resource grid identified by the index pair (k,l) in a slot where
1,...,0 RBsc
DL
RB  NNk  and 1,...,0 DLsymb  Nl are the indices in the frequency and time
domains, respectively. DLRBN is downlink bandwidth configuration, expressed in multi-
10
ples of RBscN , RBscN is resource block size in frequency domain and
DL
symbN  is number of
downlink OFDM symbols in a time slot.
A group of contiguous sub-carriers and symbols form a resource block (RB). Data is
allocated  to  each  user  in  terms  of  RB.  For  a  frame  structure  using  normal  CP,  a  RB
spans 12 consecutive sub-carrier spacings of 15 kHz, and 7 consecutive symbols over
slot duration of 0.5 ms. In other words, resource blocks are actually two dimensional
(time-frequency) units with a size of 12 sub-carriers times 0.5ms slots.
The following table shows the number of resource blocks for LTE system operating
at different bandwidths in the frequency domain;
Table 2-1 System Bandwidth and number of resource blocks
System
Bandwidth
[MHz]
1.4 3 5 10 15 20
Number of
RBs
6 15 25 50 75 100
Signal
Bandwidth
[MHz]
1.08 2.7 4.5 9 13.5 18
The following table shows the number of OFDM symbols in a slot in time domain;
Table 2-2 OFDM symbols in a slot
Cyclic Prefix Sub-carrier Spacing
Number of OFDM sym-
bols in a time slot
Normal 15 kHz 7
Extended
15 kHz 6
7.5 kHz 3
11
Figure 2.1. Downink resource grid. Relationship between a slot, OFDM symbols and Resource
Blocks. [ 33]
The  transmitted  signal  in  each  slot  is  described  by  one  or  several  resource  grids  of
RB
sc
DL
RB NN subcarriers and DLsymbN OFDM symbols. The number of OFDM symbols in a
slot depends on the cyclic prefix length and subcarrier spacing.
2.2. Downlink Frame Structure
Frame structure type 1 is applicable to both full duplex and half duplex FDD. Each
radio frame is ms10307200 sf   TT  long and consists of 20 slots of length
ms5.0T15360 sslot   T ,  numbered  from  0  to  19.  A  subframe  is  defined  as  two
consecutive slots where subframe i consists of slots i2 and 12 i .
12
Figure 2.2. FDD Frame. Timing and symbol allocations shown for FDD with normal cyclic prefix
(CP).  [33]
2.3. LTE Downlink Physical Layer Signals
The LTE frame carries physical channels and signals. Channels carry information
received from higher layers. Signals originate at the physical layer. The framing struc-
ture is common to uplink and downlink, but the physical signals and physical channels
are different. [12]
The signals used in the carrier recovery algorithm are reference, primary and second-
ary synchronization signals.
2.3.1. Reference Signals
Reference signals are mapped to the resource elements in the frequency domain as
shown in Fig. 2.3. Whenever there is one antenna port transmitting a reference signal on
one resource element, all other antenna ports transmit a ‘zero’ symbol at this position.
Thus, interference of the reference symbols transmitted from different antennas is
avoided and channel estimation is simplified. Reference symbols are used in estimating
the Residual Frequency Offset  (RFO). The position of reference signal in OFDM sys-
tems is different based on the channel conditions and required Quality of Service (QoS).
These may be located in block type, comb type, rectangular grid, parallelogram-shaped
grid and hexagonal grid as depicted in the Figure. 2.3. The channel is assumed to be
slow fading and reference symbols are in hexagonal configuration as shown in the Fig-
ure. 2.3 (e). There is only one antenna port in our case and all other ports are ignored.
Reference symbols are transmitted in symbols (0 and 4) of each slot in case of normal
CP and the first and the fourth in case of extended CP. The location of reference sym-
bols varies on the subcarriers. Hexagonal configuration depicted in Figure. 2.3 (e) has
been used in this work. Reference signals are transmitted every sixth symbol in hexago-
nal configuration. The hexagonal symbol arrangement reduces the reference symbol
density and thus improves the spectral efficiency. Reference signals are transmitted on
one or several of antenna ports 0 to 3.
13
Figure 2.3. Illustration of several Reference signals pattern: (a) block type, (b) comb type, (c) rec-
tangular grid, (d) parallelogram-shaped grid, (e) hexagonal grid (Adapted from [35])
2.3.2. Reference Signal Sequence Generation
The reference-signal sequence )(
s, mr nl  is defined by,
    12,...,1,0,)12(21
2
1)2(21
2
1)( DLmax,RB, s   Nmmcjmcmr nl
                          (2.1)
14
where sn  is the slot number within a radio frame,
DLmax,
RBN is the largest downlink band-
width and l  is the OFDM symbol number within the slot.
Pseudo-Random Sequence: c  is the pseudo-random sequence. Pseudo-random se-
quences are defined by a length-31 Gold sequence. The output sequence )(nc  of length
PNM , where 1,...,1,0 PN  Mn , is defined by,
  2mod)()()( 21 CC NnxNnxnc  ,    (2.2)
  2mod)()3()31( 111 nxnxnx   ,    (2.3)
  2mod)()1()2()3()31( 22222 nxnxnxnxnx   ,             (2.4)
where 1600 CN  and the first m-sequence is initialized with
30,...,2,1,0)(
and,1)0(
1
1
  
 
nnx
x
 The initialization of the second m-sequence is denoted by,
¦   300 2 2)(i iini ixc
The pseudo-random sequence generator is initialised with
     CPcellIDcellIDs10init 2121172 NNNlnc            (2.5)
 at the start of each OFDM symbol where , cellIDN  is the physical layer cell ID and
¯
®
­ 
CPextendedfor0
CPnormalfor1
CPN
2.3.3. Reference Signal Mapping to Resource Elements
The reference signal sequence )(
s, mr nl  is mapped to complex-valued modulation
symbols )( ,plka  used as reference symbols for antenna port p  in slot sn  according to:
)'(
s,
)(
, mra nl
p
lk  , (2.6)
where
15
 
^ `
^ `
DL
RB
DLmax,
RB
DL
RB
DL
symb
shift
12,...,1,0
3,2if1
1,0if3,0
6mod6
NNmm
Nm
p
pN
l
vvmk
 c
 
¯
®
­

 
 
The variables v  and shiftv  define the position in the frequency domain for the different
reference signals where v  is given by:
°°
°
°
¯
°°
°
°
®
­
 
 
z 
  
z 
  
 
3if)2mod(33
2if)2mod(3
0and1if0
0and1if3
0and0if3
0and0if0
s
s
pn
pn
lp
lp
lp
lp
v
The cell-specific frequency shift is given by 6modcellIDshift Nv  . Resource elements
 lk ,  used for transmission of cell-specific reference signals on any of the antenna ports
in a slot  shall  not be used for any transmission on any other antenna port  in the same
slot and set to zero.
Figure 2.4 illustrates the resource elements used for reference signal transmission
according to the above definition. The colored resource elements denote resource ele-
ments being used for reference signal transmission on antenna port.
16
Figure 2.4. Reference signals in LTE downlink. (Adapted from [12])
2.3.4. Synchronization Signals
LTE standard  has  defined  two synchronization  signals:  these  are  primary  (P-SCH)
and secondary (S-SCH). The two signals are identified with different colors in the Fig.
2.5. These signals are located on 62 subcarriers within 72 reserved subcarriers symmet-
rically arranged around DC-carrier in the first slot in the sixth and seventh OFDM sym-
bols of the first and sixth subframes [15]. The synchronization signals are transmitted
twice per 10 ms on predefined slots as depicted in Figure 2.5. Signal configuration is
shown in the following figure:
Figure 2.5. Primary and Secondary synchronization signals (Adapted from [12])
17
Physical-Layer Cell Identity (PCI): PCI is used in the sequence generation and re-
source element mapping of primary and secondary synchronization signals.
There are 504 unique physical-layer cell identities. In the LTE air interface, Physical
Layer Cell Identity (PCI or CellID) is used for cell identification. The physical-layer
cell identities are grouped into 168 unique physical-layer cell-identity groups, each
group containing three unique identities. The grouping is such that each physical-layer
cell identity is part of one and only one physical-layer cell-identity group. The PCI is
expressed as:
(2)
ID
(1)
ID
cell
ID 3 NNN  ,                                   (2.7)
A physical-layer cell identity ‘ (1)IDN ’, is uniquely defined by a number in the range of
0 to 167, representing the physical-layer cell-identity group, and a number (2)IDN  in the
range of 0 to 2, representing the physical-layer identity within the physical-layer cell-
identity group.
2.3.5. Primary Synchronization Signal Sequence Generation and Resource Ele-
ment Mapping
The primary synchronization signal carries the physical layer identity 0, 1, or 2. The
sequence )(nd  used for the primary synchronization signal is generated from a frequen-
cy-domain Zadoff-Chu sequence according to the following expression:
°¯
°®
­
 
  

61,...,32,31
30,...,1,0)(
63
)2)(1(
63
)1(
ne
nend nnuj
nunj
u S
S
,                      (2.8)
where the Zadoff-Chu root sequence index u  is given by Table 2.3.
Table  2-3 Root indices for Primary Synchronization Signal
(2)
IDN Root index u
0 25
1 29
2 34
It is assumed that the primary synchronization signal is not transmitted on the same
antenna port as any of the downlink reference signals. Furthermore it is also assumed
that any transmission instance of the primary synchronization signal is not transmitted
on the same antenna port, or ports, used for any other transmission instance of the pri-
18
mary synchronization signal. The sequence  nd  shall be mapped to the resource ele-
ments according to:
  ,, nda lk                                          (2.9)
2
31and61,...,0 where
RB
sc
DL
RB NNnkn   
The primary synchronization signal is mapped to the last OFDM symbol in slots 0 and
10.
2.3.6. Secondary Synchronization Signal Sequence Generation and Resource
Element Mapping
The secondary synchronization signal carries the physical layer cell identity group.
The sequence )61(),...0( dd is used for the second synchronization signal. It is
an interleaved concatenation of two length-31 binary sequences. The concatenated se-
quence is scrambled with a scrambling sequence given by the primary synchronization
signal.
The combination of two length-31 sequences defining the secondary synchroniza-
tion signal differs between subframe 0 and subframe 5 according to:
 
 ¯®
­ 
5subframein)(
0subframein)(
)2(
0
)(
1
0
)(
0
1
0
ncns
ncns
nd m
m
,                           (2.10)
   
   ¯®
­ 
5subframein)(
0subframein)(
)12( )(
11
)(
0
)(
11
)(
1
10
01
nzncns
nzncns
nd mm
mm
,                     (2.11)
where 300 dd n . The indices 0m  and 1m  are derived from the physical-layer cell-identity
group (1)IDN  expressed as:
¬ ¼ 
¬ ¼30,30
2)1(,2)1(
31mod131
31mod
(1)
ID
(1)
ID(1)
ID
01
0
NqqqNqqqNm
mmm
mm
 c»¼
»
«¬
« cc  c
c 
c 
Same antenna port as for the primary synchronization signal is used for the second-
ary synchronization signal. The sequence in equation 2.9 is used for mapping secondary
synchronization signals to resource elements but expression for k differs:
19
66,...63,62,1,...,4,5
10and0slotsin2
2
31
DL
symb
RB
sc
DL
RB
 
 
 
n
Nl
NNnk
The index values of n are reserved and not used for transmission of the secondary syn-
chronization signal.
2.4. Downlink Multi-Antenna Transmission
Multiple input and multiple output (MIMO) antenna technology is a key feature for
the LTE downlink. Different downlink MIMO modes are specified for LTE which can
be adjusted and flexibly configured in accordance with channel conditions and data rate
requirements. “It is noteworthy that full selection of MIMO options is available only for
the downlink in LTE.” [34]
Spatial multiplexing is an important MIMO technique. It is also referred to as “true
MIMO”. It is the technique to transmit different data streams simultaneously over the
same set of radio resources. If spatially multiplexed data streams belong to one user, it
is referred to as single-user MIMO (SU-MIMO). In case spatially multiplexed data
streams  belong  to  different  users,  it  is  multi-user  MIMO  (MU-MIMO).  Both  SU-
MIMO and MU-MIMO are supported in LTE downlink. SU-MIMO directly impacts
the data rate of the user and MU-MIMO helps to increase the capacity of the network by
accommodating multiple users.
Multi-antenna transmission with up to 8 transmit antennas is supported. The maxi-
mum number of codeword is two irrespective to the number of antennas with fixed
mapping between code words to layers. [33] A codeword is a block of information bits
that can be encoded, scrambled, and modulated separately before it is transmitted in a
subframe over the air interface. [34]
MIMO and the principle of spatial multiplexing are displayed in Figure 2.6.
20
 Figure 2.6. MIMO and Spatial Multiplexing (Adapted from [34])
Other important MIMO techniques are transmit diversity and beamforming.
In Transmit diversity antennas at the transmitter transmit the same data streams
simultaneously or in other words replicas of the same signal are transmitted. The infor-
mation content of each stream remains the same but each antenna uses a different cod-
ing. Transmit diversity does not increase the data rate but it increases the robustness of
the transmission against fading effects by using the diversity effect.
Beamforming is the shaping of the antenna radiation pattern. The shapped antenna
beam is used to focus the transmission energy in the direction of the receiver. It is also
used to mitigate possible interferers. Beamforming is done with an antenna array that
contains multiple antenna elements which can be individually modified in gain and
phase.
There are different MIMO modes mentioned in [34], starting from Mode 1 upto
Mode 7. “Each MIMO mode requires a different type and amount of control signaling
to be conveyed to the terminal”. [34]
21
3. LTE CARRIER RECOVERY
Synchronization is a critical function in digital communications. Its failure may
cause catastrophic effects on the transmission system performance. It is very important
that the receiver is synchronised with the transmitter because it is not possible to correct
frequencies/phases without any control mechanisms.
Synchronization is different in 3GPP LTE for uplink and downlink because of the
choice of multiple access schemes. Multiple access scheme for LTE downlink is OFD-
MA and SC-FDMA for the uplink. OFDMA is susceptible to Carrier Frequency Offset
(CFO). In case of a typical LTE system with a carrier frequency of 2.1 GHz, a frequen-
cy drift of 10ppm (10×10-6) of the local oscillator can cause an offset of 21 kHz. LTE
system employs a fixed subcarrier spacing of 15 kHz. This offset caused by the local
oscillator corresponds to 1.40 subcarrier spacings.
The receiver extracts the information from the received signal to synchronise and
compensate  for  any  carrier  phase  offset.  In  Figure  3.1  it  can  be  seen  that  before  the
OFDM symbols are demodulated the carrier frequency offset must be compensated.
Figure 3.1 OFDM Receiver
22
3.1. Carrier Recovery
Information is sent over a carrier by doing some changes in its fundamental charac-
teristics like phase, frequency and amplitude, in a digital communication system. These
fundamental characteristics modifications must be detected by the receiver in order to
recover the data correctly.
Carrier recovery is an important aspect in synchronizing digital communication sys-
tems because errors can occur in the carrier frequency/phase due to many reasons. Error
may be caused by the transmitter local oscillator during up-conversion and modulation,
receiver local oscillator used for down-conversion or demodulation, up-conversion or
down-conversion in repeaters, and Doppler shift in mobile channels. The traditional
solution is to adaptively adjust the local oscillator to match the frequency and phase of
the received signal. Carrier recovery can be performed in a number of ways in different
systems:
 Sufficiently accurate frequency adjustment is enough for coherent and non-
coherent systems.
 Coherent detection is required for accurate phase recovery
 Quadrature carriers require accurate phases for complex symbols or alpha-
bets.
It is assumed that the receiver is able to generate a reference carrier whose phase
and frequency are identical to those of the carrier’s at the transmitter. When the receiver
exploits the knowledge of the carrier’s phase to detect the signal, the process is called
coherent detection. Carrier signal is generated by the local oscillator (LO). Generally the
LO at the receiver is not synchronous with the LO at the transmitter. To match the re-
ceiver’s local oscillator carrier frequency and phase to that of the transmitter’s, carrier
recovery is done and strictly required in coherent detection system such as OFDM sys-
tems. Generally carrier recovery is performed in two steps: first step is coarse frequency
adjustement; second step is fine frequency adjustment and phase recovery.
Carrier synchronization error can affect the carrier phase and carrier frequency.
There might be rotation in the constellation resulting in constant carrier phase error ĳ.
#௞ ൌ݁௝ఝ Ǥ ܣ௞               (3.1)
There might be a time-varying rotation of the constellation resulting in constant car-
rier frequency error ǻȦ.
#௞ ൌ ݁௝௞்οఠ Ǥ ܣ௞               (3.2)
23
In the past analog and hybrid methods utilizing both analog and digital solutions had
been used for carrier recovery. Two salient features of these solutions are:
 Directly adjusting the demodulator local oscillator
 Phase-locked loops and related circuitory
There is an increase in utilization of digital synchronization techniques during the
last decade. Digital synchronization principles can be classified into three categories:
Data-Aided (DA): By transmitting a separate sequence known to the receiver (refer-
ence symbols, pilot symbols, preambles/midambles)
Data-Directed (DD): By utilizing information extracted from the detected symbol.
Non-Data-Aided (NDA): This method does not depend on the detected symbol.
It is evident that DA exhibits the best synchronization performance, but there is
some overhead in terms of some part of the bandwidth lost for pilot signals or training
sequences. There are different carrier recovery methods. A well-known carrier recovery
is the Maximum Likelihood (ML) method. This method is applicable to all the above
mentioned DA, DD and NDA configurations. ML theory has been used in this work.
3.2. LTE Downlink Carrier Recovery Algorithm
General OFDM systems can be synchronized by splitting the carrier frequency off-
set (CFO) into Fractional Frequency Offset (FFO), Integer Frequency Offset (IFO), and
Residual Frequency Offset. These offsets can be estimated individually [16;17;18]. The
entire implementation of this work is based on standardized synchronization signals and
reference symbols of LTE as explained in the previous chapter, following the approach
presented in [16].
As discussed earlier in the previous chapter the LTE standard defines two kinds of
signals utilized for synchronization at the receiver. First one is the dedicated synchroni-
zation signals and the second is the cell-specific reference signals. Their exact details
used in the implementation are discussed in the following subsection:
3.2.1. Carrier Frequency Offset Estimation Method
The three stage CFO compensation scheme for 3GPP LTE is explained here. It is
assumed  that  the  channel  is  slow  fading,  so  that  the  variation  of  the  channel  impulse
response within one subframe of duration 1 ms is negligible.
The following equation defines the system model in time domain:
24
ݎ = {ݔ כ ݄ ൅ ݒ} ή ݁௜ଶగఢ಴ಷೀሺ௡ା௟ሺேାே೒))/ே,                                    (3.3)
‘r’ the received time-domain signal, ‘x’ is the transmitted OFDM signal, ‘h’ is the chan-
nel impulse response, ‘v’ is the additive Gaussian noise, l is the OFDM symbol index in
one  subframe,  ‘N’ denotes  FFT size,  ‘n’  is  the  time index  within  one  OFDM symbol,
‘Ng’ the CP length and İCFO is frequency mismatch between transmitter and receiver.
When discrete Fourier transform is applied to the eq. 3.3, we have the corresponding
frequency response.
ܴ ൌ ߛǤ ܺܪ ൅ ܫ ൅ ܸ,                                               (3.4)
where,
ߛ = ୱ୧୬ሺగఌᇲ)
ே௦௜௡ሺగఌᇲ ேΤ ) Ǥ ݁௜గఌᇲሺேିଵ)/ே Ǥ ݁௜ଶగఢ಴ಷೀ௟ሺேାே೒)/ே ,                            (3.5)
ܫ = σ ܺܪ. ୱ୧୬ሺగሺ௣ାఌ಴ಷೀି௞))
ேୱ୧୬ሺగሺ௣ାఌ಴ಷೀି௞)/ே) Ǥ ݁௜గሺ௣ାఌ಴ಷೀି௞)(ேିଵ)/ேǤ ݁௜ଶగఢ಴ಷೀ௟ሺேାே೒)/ே௣ஷ௞ᇲ ,
      (3.6)
The factor ‘ˠ’ is the degradation on the desired subcarrier caused by the CFO. ‘I’ is
the inter-carrier interference caused from neighboring subcarriers. Expressions for ‘ˠ’
and ‘I’ are derived in [16]. The total CFO is split into FFO, IFO and RFO:
ߝ஼ிை ൌ ߝிிை ൅ ߝூிை ൅ ߝோிை ,                                          (3.7)
FFO is the major source of inter-carrier interference (ICI); it removes the orthogo-
nality between the subcarriers. Therefore, it must be compensated before the FFT opera-
tion in the OFDM receiver.
The expression for the FFO mismatch has been derived in [16] for the LTE subframe;
ߝிிை ൌ െ
ଵ
ଶగ
ܽݎ݃ ቄσ σ ݎ௟ǡ௡ݎ௟ǡ௡ାே
כே೒
௡ୀଵ
ே೑
௟ୀଵ ቅ,                                 (3.8)
where, ‘r’  is  the  received  symbols,  ‘Nf’ is the number of received OFDM symbols in
one subframe, and ‘Ng’ is the CP length. The estimation range for this stage is -0.5fs to
0.5fs, where ‘fs’ is the subcarrier spacing. CP is used to estimate the FFO within the es-
timation range according to [16]. FFO estimation is independent of the presence of IFO
25
and RFO. FFO can be further reduced by using more antennas at the receiver, by in-
creasing the CP length or by taking more OFDM symbols into account.
IFO is the carrier offset by an integer multiple of subcarrier spacing. IFO is estimat-
ed after the fractional part of the CFO has been corrected. IFO has two effects on the
received signal. First one is the FFT index shift i.e. frequency translation of a subcarrier
by an integer multiple of subcarrier spacing if LO at transmitter and receiver are at dif-
ferent frequencies. The second one is the phase rotation of the symbols if the two oscil-
lators are not in phase with each other. These two effects can be compensated by using a
ML estimator proposed in [19]. The standardized synchronization signals are used for
the frequency and phase correction. The IFO estimator based on the idea presented in
[19] is represented by the following equation:
ߝூிை ൌ ܽݎ݃max௜ሼܴ݁ ൤݁ೕమഏ೔ಿ೒ಿ .σ (ܴௌௌ஼ுכ ܴ௉ௌ஼ு)( ௌܺௌ஼ுכ ܺ௉ௌ஼ு)כ௞א௄ೄ಴ಹ ൨},   (3.9)
where ‘RSSCH’, ‘RPSCH’, ‘XSSCH’ and ‘XPSCH’ are the received and transmitted  secondary
and primary synchronization signals respectively. The set ‘KSCH’ represents the subcar-
rier indices that contain the synchronization signals. ‘i’ ranges from -31 to 31 and corre-
sponds to the set of integer offsets that can be estimated. In LTE synchronization signals
exist in every fifth subframe only, therefore estimation of ‘ˢIFO’ can only done in these
subframes.
RFO, extent of RFO depends on the magnitude of the estimation error of FFO. Its
impact on the current OFDM symbols may not be clearly visible, but the estimation
error results in an increase in the phase shift for the OFDM symbols to be received sub-
sequently. It is necessary to improve the estimation on the subframe basis by utilizing
the reference symbols in the frequency domain to correct the RFO after the FFT.
It is assumed that ‘ˢFFO’ and ‘ˢIFO’ have been corrected to an extent that the magnitude
of the interference is negligible. The resulting RFO estimator expressed in [16] is:
ߝோிை ൌ െ
ଵ
ଶగ
ே
ேೞ൫ேାே೒൯
. arg{σ ൫ܴ௟ǡ௞ܴ௟ାேೞ ǡ௞כ ൯ሺ ௟ܺǡ௞ ௟ܺାேೞǡೖכ )כሺ௞ǡ௟ሻఢ௄೛ },      (3.10)
where ‘R’ and ‘X’ are the received and transmitted reference symbols respectively, ‘Ns’
is the number of OFDM symbols in one slot and ‘Kp’ is the joint set of OFDM symbol
number and subcarrier indices for reference symbols.
26
3.3. LTE Uplink Carrier Recovery
OFDMA properties are less favourable for the uplink. This is mainly due to weaker
peak-to-average power ratio (PAPR) properties of OFDMA signal, resulting in worse
uplink coverage. [34]
Synchronization is different in the uplink because of a slightly different multiple ac-
cess scheme in the uplink, SC-FDMA. Both OFDMA and SC-FDMA are multiple ac-
cess versions of OFDM. SC-FDMA signals have better PAPR properties in comparison
to an OFDMA signal. This is one of the main reasons for selecting SC-FDMA as LTE
uplink multiple access scheme. Two types of uplink reference signals are supported in
the uplink:
 Demodulation reference signal
 Sounding reference signal
“The same set of base sequences is used for demodulation and sounding reference
signals in the uplink” [34], but different sequences were used for the reference and syn-
chronization signals in the downlink.
Another  aspect  of  uplink  carrier  recovery  is  the  uplink  MIMO  scheme.  Uplink
MIMO schemes for LTE differ from downlink MIMO schemes in order to limit termi-
nal complexity. Many user terminals may transmit simultaneously on the same resource
block, also known as Spatial Division Multiple Access (SDMA). Only one transmit an-
tenna is required on the terminal end in this scheme. The terminals sharing the same
resource block must apply mutually orthogonal pilot patterns, i.e. demodulation refer-
ence signals.
MU-MIMO can be used in the uplink. [34]
27
4. TCE FRAMEWORK
Move Processor and Transport Triggered Architecture are relevant terms, used in-
terchangeably referring to the same architecture. The Innovative and exciting concept
was introduced by Prof. Henk Corporaal at the Delft University of Technology, The
Netherlands in 1990’s. [20]
The concept is based on recent research performed within the MOVE project at the
Delft University of Technology. Unlike most traditional architectures, TTAs function
through programmed operations that are triggered by internal data transports. As a result
the new architecture alleviates bottle-necks, allows for new code generation optimiza-
tions, and exploits hardware more efficiently. TTAs can be used as a template for auto-
matic generation of application specific processors and are well-suited for embedded
system design. [21]
4.1. Transport Triggered Architecture (TTA)
TTA concept is relatively new in the field of computer architecture. The main difference
between TTAs and conventional Operation-triggered architectures (OTAs) is based on
the way they are programmed. In case of OTAs operations are specified and then pro-
grammed, and data transports are handled and scheduled by the hardware. In case of
TTAs the compiler is responsible for translating the operations into data operations. [23]
A TTA processor consists of functional units (FUs), special functional units (SFUs),
register files (RFs), interconnection (IC) network, control logic, and separate instruction
and data memories as shown in Fig. 4.1. The IC is responsible for transferring data be-
tween the FUs and the RFs. It is composed of data transport buses and sockets, which
are connecting the FUs and the RFs to each others. A common approach to enhance the
performance of TTA processors is to increase the number of functional units which op-
erate concurrently.
28
Figure 4.1. Organiztion of TTA
A TTA is programmed by specifying the required data transports. The number of
data transports or moves depends on the number of transport buses. Each move specifies
a transport from a FU output to one of the FU inputs or the corresponding general pur-
pose  register.  The  FU  interconnection  network  is  not  limited  in  any  sense  as  long  as
point to point transports can be specified. The network is not required to be fully con-
nected, although full connectivity would ease the code generation process. It is the task
of the compiler to optimize the required number of transports, given certain connectivi-
ty, such that the number of execution cycles is minimized. Limiting bus connections is
an important tool in reducing bus loads, yielding faster transport times. [23]
4.2. Hardware Aspects of TTA
TTAs have several characteristics which make them very interesting from a hardware
design point of view:
Modularity:  TTAs are constructed using a limited number of building blocks. They
are built by proper connection of FUs and bus-connections. FUs are completely inde-
pendent of each other and of the interconnection network. FUs can therefore be de-
signed separately, and pipelined independently. The modularity of TTAs allows the
29
hardware design process to be automated. Different TTAs can easily be configured by
assembling different combinations of these blocks.
Flexibility and Scalability: TTAs have large flexibility because the interconnection
network is separate from the FUs and both can be designed independently. TTAs have
virtually no constraints on how to design and pipeline the network, and different FUs as
long as both obey the interface specification. FUs can implement any functionality even
if it requires more than two operands and multiple results.
Processor Cycle Time: The processor can be optimized for operation throughput,
instead of latency if needed. This requires extensive pipelining those FUs which con-
strain the achievable cycle time. The result is cycle time which is limited by the time
needed for inter FU data transports only. The implementation can be tuned to optimize
this time by using advanced bus implementation techniques.
Hardware Efficiency: There are several aspects contributing to an economic usage
of hardware resource: TTAs basically support only one operation format. The compiler
is in control of every data transport. Many of these transports can be optimized away.
Many values produced during the course of a program do not need to be allocated in
general purpose registers; the register file traffic rate reduces. FU logic can be split into
independent parts, used for different functionality without a large impact on the inter-
connection network and register file. These parts may be used concurrently after split-
ting which results in a larger efficiency. [23]
4.3. TTA Programming Model
TTA  has  only  one  instruction,  the move instruction which implements operand
transports. The need to support only a single instruction combined with the static opera-
tion scheduling makes the control logic of the TTA very simple.
The operand transports of TTA are visible to the programmer. In a traditional opera-
tion based assembly, an addition r3 = r1 + r2 could be executed as:
add r3, r1, r2
In TTA assembly, addition is executed by defining the operand transports:
r1 - > add.o1
r2 -> add.t
add.r -> r3
Or, if there are multiple transport busses available, the input operands could be trans-
ferred simultaneously:
30
r1 -> add.o1, r2 -> add.t
add.r -> r3
It is notable that the operation latency is also visible to the programmer. In the above
example the latency of the addition operation is 1 clock edge. The operation result can
be read from the output on the next instruction after the triggering move.
One of the advantages of the operand transport paradigm is software bypassing [24] and
Dead Result Elimination (DRE). For example the following code in operation based
assembly:
add r3, r1, r2
shift r6, r3, r4
store r3, r6
could be executed on a TTA with two transport buses as:
r1 -> add.o1, r2 -> add.t
add.r -> shift.o1, r4 -> shift.t
add.r -> store.o1, shift.r -> store.t
As the example demonstrates, the addition result can be bypassed to the shift and
store operations. Likewise, the shift result is bypassed to store. Furthermore, as the re-
sults of addition and shift can be bypassed to the next instructions, dead result elimina-
tion removes the unnecessary result writes to the register file. In the example, these op-
timizations decreased register file utilizations by removing two register writes and
reads.
TTA also makes it easy to schedule code for custom MIMO operations. The greatest
advantage in case of MIMO operations is that all the input and output operands do not
have to be transported in a single instruction cycle. The input port registers of a function
unit hold their values until they are overwritten and the operation results stay in the out-
put port registers as long as the next operation is triggered and new results are written to
the outputs. This makes it possible to divide the operand transports to multiple instruc-
tion  cycles  which  puts  less  pressure  on  the  register  file.  For  example,  if  there  was  an
operation with eight inputs and all the operands were stored in a register file, there
would have to be eight read ports in a register file in order to provide all the operands in
a single instruction cycle. However, in the case of TTA, the input operands can be
transported in multiple instruction cycles depending on the available register file read
ports. For example, if there are two read ports, the input operands can be transported in
four instruction cycles. [25]
31
4.4. TCE (TTA-based Co-design Environment)
TCE is an open source toolset for designing application specific processors based on
transport triggered architecture. TCE provides a complete co-design flow from C pro-
grams down to synthesizable VHDL and parallel program binaries. Processor customi-
zation points include the register files, function units, supported operations, and the in-
terconnection network. TCE has been developed in Tampere University of Technology
since early 2003. [26]
The most essential tools in TCE are the processor design tool ProDe, a retargetable
high level language compiler tcecc, the retargetable Instruction-Set Simulators (ISS)
ttasim, proxim (graphical user interface version) and the processor generator ProGe.
These tools allow high level language (HLL) to RTL design flow. The retargetability of
the tools means that they automatically adapt to the processor architecture at the
runtime.
TCE allows the designer to customize TTA processors, for example the designer can
change the number of function units in the architecture, modify which functions are
included  in  a  function  unit  and  even  create  new  operations.  The  register  files  can  be
customized  as  well.  The  designer  can  change  the  register  width,  the  register  file  size,
and the register file port count. The number of register files is also customizable. The
interconnection  network  can  also  be  tailored.  The  number  of  transport  buses  and  con-
nections between FUs, RFs and transport buses are also customizable. [25]
The TCE design flow is depicted in the following state flow diagram:
32
Figure 4.2. TCE design flow [25]
4.5. TTA Design with TCE
The goal of TCE design flow depicted in Fig. 4.2 is to produce a TTA processor which
is able to execute specific application while complying with the restrictions set by the
design requirements. The design flow inputs are HLL (high level language) source code
of the desired application and the design requirements. These requirements can define
the amount of FPGA resources the implementation can use, the target execution time or
performance, the minimum clock frequency or the maximum allowed energy consump-
tion.  At  the  beginning  of  the  design  flow,  the  designer  uses ProDe to create starting
point architecture. ProDe stores the  processor architecture description in XML-format
to an architecture definition file (ADF). Or alternatively a pre-existing ADF can be used
as the starting point.
The next step is to compile the source code for the starting point architecture with
the tcecc compiler which oututs a TTA Program Exchange Format (TPEF) binary file.
The retargetable instruction set simulator ttasim gets the ADF and TPEF files as imputs
and produces the simulation results, such as execution cycle count, processor resource
utilization data and optionally execution traces which can be utilized to extract profiling
data. The simulation results are the feedback from the flow. The designer analyses the
feedback and modifies the architecture accordingly with ProDe. Then a new iteration is
started and the new feedback shows how the modifications affected the results. The iter-
ation process is also known as manual processor Design Space Exploaration (DSE).
TCE also uses an experimental explorer tool which can be used to automate the proces-
sor design space exploration. In the automated DSE, the designer sets goals for explora-
33
tion and then the explorer modifies the processor architecture until the given goals are
reached. [27]
Processor design space exploration is continued until the design requirements are
met or the designer determines that the results are adequate. It should be noticed that the
maximum clock frequency of the processor at this point is unknown. Thus, the actual
run  time of  the  application  is  also  unknown because  the  simulation  results  only  show
the instruction cycle count. If the design requirements set a target clock frequency, the
execution time and the FPGA resource usage, the processor must be implemented and
synthesized. The RTL implementation of the processor is generated with the ProGe
tool. Before generating the RTL, processor architecture resources such as the FUs and
RFs must be mapped to their actual RTL implementations.The FU and RF implementa-
tion are stored in hardware databases (HDB), and the resource mapping is done by
simply defining which HDB entry is used for each architecture resource. The mapping
information is written to an Implementation Definition File (IDF). When the mapping is
done, ProGe uses the ADF and IDF to create the processor RTL implementation by
generating the IC network and connecting the FUs and RFs together.
Only the processor core is implemented at this point. The core must be interfaced
with the target platform in order to execute the applications. The processor can be syn-
thesized with a third party synthesis tool after the integration process. The synthesis
produces an FPGA device programming file which is used to configure the design onto
the FPGA for execution. The synthesis results are valuable feedback for the design
flow. The results determine the actual resource usage and maximum clock frequency of
the processor. The designer compares these results with the design requirements to de-
termine whether the iterative design process can be finished.
TCE also allows the designer to exploit  custom hardware operations which can be
used to accelerate the application. TCE allows the custom operations to be tested and
evaluated without having the RTL implementation of the custom operation, due to the
separation of the processor architecture and implementation. The custom operation de-
sign flow is illustrated in Fig.  4.3. The flow begins by searching for a custom operation
candidate. Application profiling can be helpful for this purpose. When a candidate is
found, the designer creates a custom operation compiler definition by using the Opera-
tion Set Editor tool OSEd. The compiler definition simply describes the name of the
operation and the number of input and output operands. In order to simulate the opera-
tion, a simulation model is needed. This model implements the operation behavior using
C/C++. Usually the software function in the accelerated program can be exploited in
defining the simulation model. For example, if the custom operation replaces a function,
the function code can be utilized as the simulation model.
34
Using new custom operation requires modifications to the processor architecture and
the HLL source code. The designer must add a function unit containing the custom op-
eration to the architecture. At this point, the custom operation latency must be defined.
Sometimes it can be difficult to determine the operation latency before the operation
hardware is implemented. The designer can either take a suitable guess or change the
latency between iterations to find out which latencies would be feasible.
For the software to be able to use the custom operation, the designer needs to modi-
fy the source code to utilize the new operation. This is done by calling the operation via
TCE-specific operation macros or intrinsics. In case the custom operation implements a
function from the original code, these function calls are replaced with calls to the opera-
tion macros. When the modifications are ready, the application can be compiled and
simulated. Feedback from the simulation will reveal how the custom operation affected
the execution cycle count. If the results do not satisfy, another custom operation can be
tested by starting a new iteration.
If the custom operation speedup was adequate and the designer chooses to include
the custom operation to the architecture, the custom operation must be implemented.
This is the only step in the TCE design flow where the designer is required to write RTL
code. TTA Unit Tester verifies that the RTL implementation of the custom operation is
equal to its simulation model. When custom operation implementation is ready, the cus-
tom FU is added to a hardware database with the hdbeditor tool. This allows the FU to
be reused in later designs. [25]
35
Figure 4.3. TCE custom operation design flow. [25]
36
5.  IMPLEMENTATION
The implementation details of the work are discussed in this chapter.
5.1.  Implementation in C
The algorithm is implemented in C language using data in fixed point format (least
significant 12 bits for fractional part and the most significant 4 bits for the fixed part)
and making the code as efficient as possible by getting rid of the unnecessary computa-
tions, such as divisions and multiplications. C extensions have been used for operations
like FFT, Cordic and complex multiplications to speed up the process.
5.1.1. Fractional Carrier Frequency Offset
Fractional carrier frequency offset (FCFO) is implemented by extracting the indices
of the complex CPs from the received OFDM symbol stream. The number of CP points
varies with the system bandwidth. For a 10 MHz bandwidth there are 1024 CP points
and for 20 MHz it is 2048 CP points.The received CP points are stored in two different
arrays with a shift of 2K FFT points in case of 20 MHz system bandwidth. These two
arrays are then multiplied and their sum is taken to find a maximum value. Fractional
estimate is computed from this maximum value by taking it phase.  This estimate is then
used to correct the received symbol stream.
5.1.2. Integer Carrier Frequency Offset
Integer carrier frequency offset (ICFO) is independent of the system bandwidth. It is
computed with the help of primary and secondary synchronization symbols. It is based
on the correlation of the received primary and secondary synchronization symbols with
the locally generated primary and secondary symbols at the receiver and then finding
the peak value which results in a perfect ineteger value indicating that the receiver is
able to locate the centre frequency of a carrier. The received symbol stream is then cor-
rected based on this integer value.
5.1.3. Residual Carrier Frequency Offset
The residual offset is corrected with Residual Carrier Frequency Offset (RCFO). RCFO
is computed with the help of reference symbols. The number of reference symbols de-
pends on the system bandwidth. Residual carrier offset is computed by correlating the
37
reference symbols and summing the results of the correlations. Finally phase is calculat-
ed based on the summation of the correlations.
5.2. Channel Characteristics and Synchronization Parameters
As mentioned earlier received symbol stream is in fixed point format (least significant
12 bits for fractional part and the most significant 4 bits for the fixed part). A floating
point standard compliant LTE simulation Matlab-model for Downlink Physical Layer
has been used as a reference to verify the results and performance of the system. Results
matched the floating point simulation. The simulation parameters shown in the Table 5-
1 have been used and the results have been verified:
Table 5-1
Channel AWGN Block Fading
Channel Type Typical Urban (TU)
SNR 15 dB
Number of Faders 15
Number of Users 1
Bandwidth 20 MHz
Transmission Mode Open Loop Spatial Multiplexing
Introduced CFO 3.1457
FCFO 0.1457
ICFO 3.0000
RCFO 0
Data Format 12 bit fixed point
5.3. TTA Processor for Carrier Recovery
The TTA processor used in the carrier recovery algorithm is shown in Fig. 5.2. The
architecture is not fully connected but still the compiler has a lot of scheduling freedom
because of many connections still present in the interconnect network. There is a signif-
icant amount of power consumption in this architecture that can be saved by removing
those connections which are not utilized at all or the removal of which does not affect
the performance of the application beyond a certain acceptable limit to meet the real
time constraints. Unnecessary capacitance is removed by the removal of those intercon-
nections from the bus and decreasing the size of the instruction word. The TCE provides
a Connection Sweeping Tool which removes the unneceassry connections and gives
different cycle count for different sets of TTA connections for the designed TTA. The
most optimum TTA is then selected from this set. Fig. 5.1 shows different sets of con-
38
nections for the same TTA and the cycle count that is required for performing RCFO
only. A considerable amount of power is saved by implementing these measures.
Figure 5.1. Bus connections and cycle count for performing RCFO
SFUs can also be added to the architecture for certain portions of the application
program to improve the performance of the system. Addition of SFUs can help in reduc-
tion of the number of instructions in the instruction memory and thus reducing the num-
ber of fetches. SFUs can also help to accelerate the program as they realize the software
portion of the code as a special hardware unit. The control mechanism of the TTA
makes  automated  clock  gating  of  the  FUs possible,  and  this  property  is  used  to  make
SFUs inactive when not used. It further reduces the power consumption.
5.4.  TTA with Custom Functional Units
The architecture for carrier recovery TTA is shown in Figure. 5.2. It consists of a to-
tal of nine functional units and three register files of 16x32 bit registers each, and inter-
connection network consisting of 10 buses. The instruction memory, dual-port data
39
memory and control unit are not displayed in Fig. 5.2. Special functional units include
Complex multiplier, Cordic for calculating sine/cosine value and a functional unit for
FFT which is in fact a specialized TTA for calculating 2K point FFT. In this context the
architecture can be regarded as a multi-TTA design.
Figure 5.2. Starting point Fully Connected Carrier Recovery TTA
5.5. CORDIC Functional Unit
Sine and cosine values need to be computed in certain parts of the carrier offset
compensation  algorithm.  CORDIC  algorithm  was  chosen  for  this  task.  It  is  similar  to
shift-and-add algorithms. It is a simple and efficient algorithm for calculating the sines
and cosines of a value using basic arithmetic operations like addition, subtraction, com-
parisons and shifts. A precomputed lookup table with as many entries as the desired
accuracy in bits is required. The acronym CORDIC stands for Coordinate Rotational
Digital Computer. The cordic algorithm is not accurate, but is simple to implement with
limited hardware without multipliers and small storage space. [28]
Since it has been called many times in the algorithm a special function unit has been
dedicated for this operation. The algorithm calculates the values in an iterative process,
fifteen iterations to calculate an output. Since the CORDIC core is implemented using a
pipelined approach, all the iterations are performed in parallel. The valid range of
CORDIC algorithm is between ߨ/2 to –ߨ/2, so some preprocessing of the input is need-
ed to bring the values in the valid range. To calculate the sine and cosine 'x' is some
constant value, 'y' is zero and 'z' is fed with the angle. After the calculations are done the
outputs must be placed in the correct quadrants by the post processor. There is input
port  for  reading  the  operand  (angle)  and  ports  for  the  outputs  (values  of  sine  and  co-
sine).  The addition of special functional unit for CORDIC calculations has a huge im-
40
pact on the cycle count as without this functional unit the cycles taken to complete the
task are 3518937 which is almost 11 times worse than the cycle count measured when
the architecture has FU for CORDIC.
Figure 5.3. Cordic FU
5.6. Complex Multiplier Functional Unit
The algorithm operates on complex values and there are considerable amount of
complex multiplications. It would be better to have a separate functional unit dedicated
to  complex  multiplications.  The  operands  of  complex  multiplier  are  16  bits  each.  The
latency of complex multiplier is two clock cycles. Its data path is shown in Figure. 5.4.
In each cycle an output can be read because of the pipelined implementation but this
complex multiplication operation depends on the scheduling done by the compiler.
There are four input ports for reading the real and imaginary parts of the operands and
two output ports for writing the real and imaginary parts of the result.
41
Figure 5.4. Complex Multiplier Architecture
5.7. FFT Functional Unit
 The estimators for fractional and residual carrier frequency offsets operate on data
in frequency domain so it is important to transform the time domain received data to
frequency domain. For this reason 2K point FFT needs to be taken overall six times in
the algorithm. FFT itself is computationally very intensive algorithm. To meet the real
time constraints and low power requirements a TTA designed for FFT as shown in Fig-
ure. 5.5 has been used in the architecture as a special functional unit.
The algorithm followed to calculate the FFT is Cooley and Tukey [29] because of its
reduced computational complexity as compared to calculating DFT in a straightforward
way using the eq. 5.1.
ܺ௠ = ¦
 

1
0
/2
N
n
Nnmj
nex
S
,
(5.1)
Cooley and Tukey algorithm decomposes the whole DFT into smaller DFTs and re-
duces the complexity from N2 to Nlog(N). N is  the  number  of  points  of  FFT  and  it  is
equal to Rr, where R is the radix of FFT algorithm with usually a value power of two.
The most common used FFT is radix-2 in which N must be a power of two. In this unit
mixed radix 4/2 is used, to obtain performance boost from radix4 and still maintaining
the support of FFT sizes of power of two, supported sizes are from 32 to 2048.
+
A[32..0]
B[32..0]
ADDER
+
A[31..0]
B[31..0]
ADDER
x
A[15..0]
B[15..0]
SIGNED MULTIPLIER
x
A[15..0]
B[15..0]
SIGNED MULTIPLIER
x
A[15..0]
B[15..0]
SIGNED MULTIPLIER
x
A[15..0]
B[15..0]
SIGNED MULTIPLIER
D Q
PRE
ENA
CLR
D Q
PRE
ENA
CLR
D Q
PRE
ENA
CLR
D Q
PRE
ENA
CLR
D Q
PRE
ENA
CLR
D Q
PRE
ENA
CLR
Mult1
clk
A_re[15..0]
A_im[15..0]
B_re[15..0]
B_im[15..0]
re[31..0]
rst
Mult0 re_mult1[31..0]
im_mult2[31..0]
Add1
re_r[31..12]Add0
1' h1 --
1' h1 --
re_mult2[31..0]
Mult3
im[31..0]
im_r[31..12]
im_mult1[31..0]Mult2
42
Figure 5.5. FFT TTA [37]
Since the FFT unit requires large memory bandwidth, i.e. two memory accesses re-
quired in every cycle the unit is operating, a local memory is allocated for the FFT unit.
With aid of the local memory the special unit can be completely parallel to the rest of
the system. In order to reduce overhead caused by data transport to and from the FFT
FU, two memory arrays are used for the unit. By using two arrays result can be read and
new data can be written to the unit, while the previous operation is still calculating. The
unit is controlled by simple control flow; host processor will trigger the operation by
defining which memory array the unit is ought to use and the host can read the results
after the computations are complete, for which the latency is fixed.
5.8. Evaluation
It took 331733 cycles on a fully connected TTA for implementing Carrier recovery
for  LTE  with  20  MHz  Bandwidth.  The  architecture  must  be  able  to  run  around 335
MHz which  could  be  achieved  on  a 65 nm ASIC processor in order to meet the real
time constraint of 1 ms.  Operating frequency of around 84 MHz was  achieved  on
Stratix-II FPGA for the fully connected design. Some of the connections were removed
from some of the RFs in order to increase the frequency. This resulted in an increase in
frequency to around 120 MHz on Stratix-II without any significant change in the cycle
count. But there are still many connections that can be removed which may result in
further increase in the frequency. According to a rough estimate the real time constraint
might be met if implementation is done on an ASIC.
As a consequence the architecture was then synthesized on 130 nm and 1.5 V ASIC
technology at 200MHz with a power consumption of 46.38mW. The total energy con-
sumption for task completion is 0.0968mJ.
As mentioned earlier the system has data in complex format as input, 16 bits for the
real part and 16 bits for the imaginary part, stored in separate memory locations. Both
the real and imaginary parts can be stored on the same 32 bit memory location on upper
and lower 16 bits. In this way data can be read/written at the same time, thus saving a
lot of read and write operations.
43
Packing the real and imaginary part of data into the same memory location should
not make much difference in terms of throughput and power consumption at the logic
level. Because the same address goes to both real and imaginary part of data and data
can be read or written by using two halves of 32 bit data bus. But this packed memory
will fit into a slightly smaller die area due to less clearance around the memory mod-
ule(s) and will require a single address decoder in comparison to two decoders in case
real and imaginary parts are stored in separate memory locations.
The  carrier  recovery  program  was  then  run  on  a  different  architecture  in  order  to
evaluate the impact of SFUs. This architecture was fully connected containing four real
multipliers, three register files of 16x32 bit registers each, interconnection network con-
sisting of 10 buses, instruction memory, dual-port data memory and a control unit. The
difference between this architecture and the one shown in Figure. 5.1 is that it does not
have special functional units for CORDIC, FFT and Complex Multiplication. The ef-
fect on throughput of the system was very evident in this case as it took 4451425 cycles
to complete the task which is almost 13.42 times slower than the architecture using the
special functional units.
Carrier recovery program was then run on COFFEE RISC architecture [31] in or-
der  to  compare  its  performance  with  a  TTA  without  functional  units.  COFFEE  RISC
took 6377378 to complete the task. COFFEE needs a very high clock frequency to meet
the real time constraints.
The  same  task  was  also  performed  on TI's famous fixed point VLIW DSP
TMS320C6416. It took around 3928880 cycles to complete. A single TMS320C6416
core fabricated on 130nm technology must consume at least 200mW at 500MHz and
1.2V.  This single DSP core is not enough to meet the real time constraints.
Figure 5.5. Cycle Count for different Architectures to complete the task
0
1
2
3
4
5
6
7
TTA with SFUs TTA without SFUs Coffee RISC TMS320C6416 (DSP)
M
ill
io
ns
Cycles
44
6. SUMMARY AND CONCLUSION
There is still room for further improvement as the code has been written in C but if
it is written in assembly language program has full control over scheduling the instruc-
tions and hence improvements can be made in terms of cycle count and power con-
sumption  but  writing  assembly  code  needs  much more  effort  and  time.  The  other  ap-
proach to increase the performance of architecture is to increase the available resources
and exploit them using OpenCL but the support of Opencl in TCE is still under devel-
opment. Further improvements can be made by using an MPSoC containing multiple
TTAs and distributing the carrier recovery application code on the TTA cores inside the
MPSoC, by compromising chip area and power consumption.
6.1. Summary
LTE carrier recovery for 20 MHz system for the downlink has been studied and im-
plemented in this work. The carrier recovery algorithm has been coded in C language. A
single TTA core has been synthesized for this purpose. TCE toolset designed at TUT
has been used for this purpose. TCE provides a very good framework giving control to
the designer in different aspects of design space exploration with its GUI and command
prompt interface thus making it possible to develop TTAs for general or specific appli-
cation needs. Specialized functional units (SFUs) were designed during the TTA design
process for computationally intensive and recurring operations like CORDIC for calcu-
lating sine and cosine values, CMUL for complex multiplication, and FFT_TTA for
FFT computation. Special function units have been discussed briefly and it has been
shown that how these SFUs affect the overall performance of the system. A short com-
parison of TTA and other architectures e.g. RISC and DSP has been made which shows
much better performance of TTA as compared to other architectures.
A fully connected TTA took 331733 cycles  to  Carrier  recovery  for  LTE with  20
MHz Bandwidth on 65 nm technology running around 335 MHz. Operating frequency
was 84 MHz on Stratix-II for the fully connected design, 200 MHz on 1.5 V ASIC
130 nm technology. The Carrier Recovery algorithm execution took 4451425 cycles on
architecture which does not have special functional units for CORDIC and Complex
Multiplication, which is almost 13 times slower than a fully connected TTA. COFFEE
RISC core took 6377378 to complete the task. TI's famous fixed point DSP
TMS320C6416 core at 500MHz fabricated on 130nm technology consuming 200mW
and 1.2V took around 3928880 cycles to complete the task.
45
Architectures compared with TTA have to run either at a very high clock frequency
and higher power. There must be some multi-core solution available which is able to
run at comparatively lower frequency and consumes lower power, in order to meet the
real time constraints.
46
REFERENCES
[1]. Theodore S. Rappaport, Wireless Communications: Principles and Practice.
[2]. Ajay R. Mishra, Advanced Cellular Network Planning and Optimiztion:
2G/2.5G/3G…Evolution to 4G.
[3]. Copyright IMEC.
[4]. Stefania Sesia, Issam Toufik, and Mathew Baker, LTE, The UMTS Long Term
Evolution, FROM THEORY TO PRACTICE.
[5]. Ajay R. Mishra, Cellular Technology in Emerging Markets: 2G, 3G and beyond,
Nokia Siemens Networks.
[6]. Jim Zyren,  Dr. Wes McCoy, Overview of the 3GPP Long Term Evolution Physical
Layer: White Paper.
 [7]. Erik Dahlman, Stefan Parkvall, and John Sköld, 4G LTE/LTE-Advanced for Mo-
bile Broadband, [8]. Dake Liu, Coresonic AB and Linköping University, Anders Nils-
son and Eric Tell, Coresonic AB, Di Wu and Johan Eilert, Linköping University, Bridg-
ing Dream and Reality:Programmable Baseband Processors for Software-Defined Ra-
dio.
[9]. John Glossner, Tanuj Raja, Erdem Hoknek, and Mayan Moudgill, A Multi-threaded
Processor Architecture for SDR, Sandbridge Technologies, Inc.
[10]. Yuan Lin, Hyunseok Lee, Mark Woh, Yoav Harel, Scott Mahlke, Trevor Mudge,
Chaitali Chakrabarti, and Krisztián Flautner, SODA: A High-Performance DSP Archi-
tecture for Software-Defined Radio, IEEE Micro (Micro's Top Picks in Computer Ar-
chitecture for 2006), Vol. 27, No. 1, Jan./Feb. 2007, pp. 114-123.
47
 [11]. Omer Anjum, Tapani Ahonen, Fabio Garzia, Jari Nurmi, Claudio Brunelli and
Heikki  Berg,  State  of  the  art  baseband DSP platforms  for  Software  Defined  Radio:  A
survey, 6 Jun, 2011.
[12]. LTE Resource Guide, Anritsu
[13]. John B. Anderson, Digital Transmission Engineering, Second Edition.
[14]. John G. Proakis, Digital Communications, Fourth Edition.
[15]. B.P. Lathi, Modern Digital and Analog Communication Systems, Third Edition.
[16]. Qi Wang, Christian Mehlf¨uhrer, and Markus Rupp, Carrier Frequency Synchroni-
zation in the Downlink of 3GPP LTE, The 21st Annual IEEE International Symposium
on Personal, Indoor and Mobile Radio Communications (PIMRC'10)
[17]. C. R. N. Athaudage and K. Sathananthan, Cramer-Rao Lower Bound on
Frequnecy Offset Estimation Error in OFDM Ssytems With Timing Error Feedback
Compensation, 2005
[18]. Paul H. Moose, A Technique for Orthogonal Frequency Division Multiplexing
Frequency Offset Correction, IEEE Transactions on Communication. Vol. 42, 10, Octo-
ber 1994
[19]. D. Toumpakaris, J. Lee, and H. Lou, Estimation of integer carrier frequency offset
in OFDM systems based on the maximum likelihood principle, IEEE transactions on
broadcasting, vol. 55, March 2009
[20]. Prof. Jari Nurmi, Springer, Processor Design, System on Chip Computing for
ASICs and FPGAs, 2007.
[21]. Henk Corporaal, Microprocessor Architectures from VLIW to TTA, Delft Univer-
sity of Technology, The Netherlands, John Wiley, 1998.
[22]. Henk Corporaal, Jan Hoogerbrugge, Kluwer, Code generation for transport trig-
gered architecture, 1995.
48
[23].  TTAs:  Missing  the  ILP  complexity  wall,  Henk  Corporal,  Delft  University  of
Technology, The Netherlands.
[24]. V. Guzma, P. Jääskeläinen, P. Kellomäki, and J. Takala, Impact of Software By-
passing on instruction level parallelism and register and register file traffic. In Proc. Int.
Workshop Embedded Computer Syst.: Architectures, Modeling and Simulation, pages
23-32, 2008.
[25]. Otto Esko, Mar 2010, ASIP Integration and Verification Flow for FPGA, Master’s
Thesis, Tampere University of Technology. 15. 17. 19 p.
[26]. TTA-based Co-design Environment v 1.6 User Manual, Tampere University of
Technology, Department of computer systems, Version 28, 18 May, 2012.
[27]. J. Mäntyneva, Automated Design Space Exploration of Transport Triggered Ar-
chitectures. Master’s thesis, Tampere University of Technology, Finland, Apr. 2007.
[28]. Jörg Arndt, Matters Computational, Ideas, Algorithms, Source Code, Nurnberg
Germany, June 2010.
[29]. Cooley, James W., and John W. Tukey, "An algorithm for the machine calculation
of complex Fourier series," Math. Comput. 19, 297–301 (1965).
[30]. Kylliäinen  J,  Nurmi  J,  Kuulusa  M  (2003)    COFFEE    –    A Core   for   Free.
In  Proc.  International  Symposium  on  SoC, pp 17–22.
[31]. http://focus.ti.com/lit/an/spra811c/spra811c.pdf
[32]. Louis Litwin and Michael Pugel, The Principles of OFDM, January 2001.
[33]. Evolved Universal Terrestrial Radio Access (E-UTRA) and Evolved Universal
Terrestrial Radio Access Network (E-UTRAN), Physical Channels and Modulation,
Release 8, 3GPP TS 36.211, 10, 52, 53 p.
[34]. Christina Geȕer, Long Term Evolution, A concise introduction to LTE and its
measurement requirements, Rohde & Schwarz, Ist Edition 2011.
49
[35]. Tzi-Dar Chiueh and Pei-Yun Tsai, OFDM Baseband Receiver Design for Wireless
Communications,  John Wiley & Sons (Asia) Pte Ltd, Dec 4 2007.
[36]. Christina Gessner and Andreas Roessler, LTE technology and LTE test, a deskside
chat, Rohde & Schwarz, April 2009.
[37]. Teemu Pitkänen, Risto Mäkinen, Jari Heikkinen, Tero Patanen, and Jarmo Takala,
Low-Power, High Performance TTA Processor for 1024-Point Fast Fourier Transform,
Volume LNCS 4599, pp. 233–240. Springer-Verlag, Berlin, Germany, 2007, 3 p.
