Study and implementation of a parallel turbo-decoder on FPGA for 3GPP-LTE by Τσιόκανος, Ιωάννης
  
University of Thessaly  
Department of Electrical and Computer Engineering  
 
 
Μελέτη και Υλοποίηση ενός Παράλληλου  Αποκωδικοποιητή  σε   
Eπαναπρογραμματιζόμενη Λογική για δίκτυα  4𝜂𝜍 γενιάς  
 
  
 
 
Study and Implementation of a Parallel Turbo-Decoder on FPGA for 
3GPP-LTE 
 
 
 
 
Diploma Thesis by 
Tsiokanos Ioannis  
 
 
   Supervisors:  
                Georgios Stamoulis   Antonios Argyriou 
                Professor    Assistant Professor  
 
                   
    
 
 
 
Volos, July 2016
  
 
 
 
 University of Thessaly 
Department of Electrical and Computer Engineering 
 
 
“Μελέτη και Υλοποίηση ενός Παράλληλου Αποκωδικοποιητή                  
σε Επαναπρογραμματιζόμενη Λογική  
για δίκτυα 𝟒𝜼𝝇 γενιάς”  
 
 
       “Study and Implementation of a Parallel Turbo-Decoder on 
FPGA for 3GPP-LTE”  
 
 
 
   By 
Tsiokanos Ioannis 
      
Graduate Thesis for the degree of  
Diploma of Science in Computer and Communication Engineering  
 
 
     Approved by the two-member inquiry committee at  8𝑡ℎ of July 
 
 
 
 
  ……………………..... . .              ………………………. .  
  Dr.  Georgios Stamoulis                                    Dr.  Antonios Argyriou  
  
       
 
  
  
Declaration of Authorship 
 
I, Tsiokanos Ioannis, declare that this thesis titled, ‘Study and Implementation of a 
Parallel Turbo Decoder’ and the work presented in it are my own. The research was 
carried out wholly or mainly while in candidature for the graduate degree of Diploma 
of Science in Computer and Communication Engineering, at the University of 
Thessaly, Department of Electrical and Computer Engineering, Volos, Greece.  
 
 
 
…………………..  
Tsiokanos Ioannis   
 
 
 
 
 
 
 
 
 
 
 
Copyrights © Tsiokanos Ioannis,  2016  
All rights reserved.
i  
 
  
 
 
 
 
 
 
 
To my family and my friends
 i i  
 
 
Acknowledgements 
 
Upon completion of my thesis,  I would like to thank my supervisor 
Dr,Georgios Stamoulis and my co -supervisor Dr. Antonios Argyriou  
for their trust and excellent corporation we had during this thesis and 
my studies.  
 
I would also like to thank my friends and cooperators at  VLSI and 
EDA Tools Laboratory and especially Ph.D candidate Charalampos 
Antoniadis for their assistance and guidance on this work.  
 
Finally,  I have to thank my family for their endless and invaluable 
moral support that  offered me all those academic years.  
 
        
Tsiokanos Ioannis,  
    Volos 2016 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
 i ii  
 
Contents 
 
 
List of Tables ……………………………………………………………. iv 
List of Figures ……………………………………………………… ....... v  
List of Acronyms ……………………………………………………….. vi 
Abstract …………………………………………………………………. vii 
1 Introduction …………………………………………………………… . 1 
1.1 Motivation ………………………………………………………….  1 
1.2 Thesis goal …………………………………………………………  2 
1.3 Thesis structure ……………………………………………………  3 
2 Turbo-Decoding for LTE …………………………………………… .. 5 
    2.1 Intoduction …………………………………………………….. ..... 5 
    2.2 Turbo-Decoding Algorithm ……………………………………….6 
    2.3 Radix-4 Max log BCJR Algorithm ……………………………….9 
    2.4 LTE Interleaver .………………………………………………….  13 
3 Parallel Turbo-Decoder architecture …………………………….. 17 
    3.1 High-Level Architecture ………………………………………… 17 
    3.2 Memory Architecture ……………………………………………..19 
    3.3 Implementation Tradeoffs ………………………………………..19 
4 LTE Interleaver Architecture ……………………………………… 21 
    4.1 Contention Free Interleaving for LTE …………………………. 21 
    4.2 Master-Slave Batcher Network ………………………………… 23 
5 Radix-4 Max-Log BCJR Architecture ……………………………. 25 
    5.1 VLSI Architecture ……………………………………………….. 25 
    5.2 Radix-4 ACS Units with Modulo-Normalization ……………… 27 
    5.3 LLR Computation Unit …………………………………………. . 28 
6 Implementation Results …………………………………………… .. 29 
    6.1 Axi-4 Stream Ip  ………………………………………………….. 29   
    6.2 Verilog Implementation ………………………………………… 31     
    6.3 Error-Rate Performance and key characteristics ……………… 35 
7 Conclusion …………………………………………………………….  37 
    7.1 Future work  ……………………………………………………….  37 
Bibliography …………………………………………………………… . 39 
 iv 
 
List of Tables  
 
Table 2.1   Matlab Simulator Prof il ing for SISO Receiver  
Table 2.2   Turbo codes Interleaver parameters (Part  1 of 2)  
Table 2.3   Turbo codes Interleaver parameters (Part  2  of 2) 
 
Table 6.1   Signals associated with slave interface  
Table 6.2   Signals associated with master  interface 
Table 6.3   Construction of Tdata_s  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 v 
 
List of Figures 
 
 
Figure 1.1  Evolution of wireless s tandards in the last  two 
decades 
Figure 1.2   LTE SISO Processing Chain  
 
 
Figure 2.1  Parallel -concatenated turbo-encoder and block 
diagram of a  turbo-decoder.  
 
Figure 2.1   Basic Trel l is Diagram 
 
Figure 2.2  Basic structure of an i t erative Turbo decoder.  
Iterative decoding based on MAP decoders.  
Forward/backward recursions on the trell is diagram  
 
Figure 2.3  Example of the calculation of the forward and 
backward state metrics for radix -2 recursions  
 
Figure 2.4   Radix-2 and radix-4 recursions  
 
 
Figure 3.1  High-level architecture of the parallel  turbo -decoder 
 
 
Figure 4.1   Architecture of the contention -free interleaver  
 
Figure 4.2   The Master-Slave Batcher Network architecture  
 
 
Figure 5.1 Architecture of the implemented radix -4 max-log 
BCJR core  
Figure 5.2   Radix-2 and Radix-4 architectures  
 
 
Figure 6.1   ASIC/FPGA Design process  
 
Figure 6.2    Synthesis report  (part  1 of 3)  
 
Figure 6.3    Synthesis report  (part  2 of 3)  
 
Figure 6.4    Synthesis report  (part  3 of 3)  
 
 vi 
 
List of Acronyms  
 
LTE   Long Term Evolution  
3GPP   Third Generation Partnership Project  
SISO    Soft-input Soft -output  
LLR   Log-Likelihood-Ratio  
SD   SISO Decoder  
CE   Convolutional Encoder 
QPP    Quadratic Polynomial Permutation 
ARP     Almost Regular Permutation  
HSDPA  High-Speed Downlink Packet Access  
𝒂𝒌(𝒔)   forward state metric  
𝜷𝒌(𝒔)   backward state metric  
𝜸𝒌(𝒔
′, 𝒔)  branch metric  
𝑳𝒌𝒔   Systematic Log-Likelihood-Ration  
𝑳𝒌𝒑𝟐   Parity LLR from the second CE  
𝑳𝒌𝒑𝟏   Parity LLR from the first CE 
𝑳𝒌𝑨   A-priori  LLR  
𝑳𝒌𝑫   Intrinsic LLR  
𝑳𝒌𝑬   Extrinsic LLR  
BCJR  Bahl, Cocke, Jelinek and Ravin  
ACS    Add-Compare-Select  
OFDM  Orthogonal Frequency Division  Multiplexing 
 
AWGN  Additive White Gaussian Noise  
FFT   Fast Fourier Transform 
HDL   Hardware Descript ion Language  
AMBA  Advanced Microcontroller Bus Architecture  
 vii  
 
FPGA  Field- Programming Gate Array 
IP   Internet Protocol  
MAP    Maximum A Posteriori  
OFDM   Orthogonal Frequency-Division Multiplexing 
FFT   Fast Fourier Transform 
BER    Bit  Error Rate  
Rx   Receiver  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 viii  
 
Abstract 
 
The LTE (Long Term Evolution) and LTE-Advanced are the latest  
mobile communications s tandards developed by the Third Generation 
Partnership Project  (3GPP). These standards represent a 
transformative change in the evolution of m obile technology. Within 
the present  decade, the network infrastructures and mobile terminals 
have been designed and upgraded to  support  the LTE standards. As 
these systems are deployed in every corner of the globe, the  LTE 
standards have finally realized the dream of providing a truly global 
broadband mobile  access technology.  
 
The turbo decoder is the most challenging component  in a digital  
HSDPA receiver in terms of computation requirement and power 
consumption, where large block size and recursive algor ithm prevent 
pipelining. 
 
This thesis addresses hardware  implementation aspects of parallel  
Turbo-Decoder on FPGA that reach more than 150 Mb/s LTE data-
rate using multiple soft -input soft-output (SIS0) decoders that 
operate in parallel .  To improve efficacy, we harness a radix-4-based 
8x parallel turbo-decoder.  Turbo-Decoding rate is set to 1/3.  
 
 
 
 
 
Keywords: 
LTE, mobile communication standards,  HSDPA receiver,  hardware 
implementation, parallel Turbo -Decoder,  FPGA 
 
 
  
 ix 
 
 
 
 1 
 
Chapter 1 
 
Introduction 
 
Turbo coding was introduced in 1993 by Berrou, Glavieux,  and 
Thitimajashima [1], [2], who reported extremely impressive results 
for a code with a long frame length. Since its recent invention, turbo 
coding has evolved at an unprecedented rate and has reached a state 
of maturity within just a few years due to the intensive research 
efforts of the turbo coding community.  The excellent performance of 
turbo codes however, comes at  the expense of significant 
computational complexity and consequently high power consumption 
at the receiver for proper  decoding. Indeed, the computational burden 
of the turbo decoder  far exceeds that  of any other component in a  
receiver,  especially for high  data rates.  
 
 
1.1 Motivation 
 
In the past  two decades we have seen the introduction of various 
mobile standards,  from 2G to 3G to the present 4G, and we expect 
the trend to continue. The primary mandate of the 2G standards was 
the support of mobile telephony and voice applications.  The 3G 
standards marked the beginning of the packet -based data revolution 
and the support of Internet applications such as email,  Web browsing, 
text messaging, and other client -server services. The 4G standards 
will feature all -IP packet-based networks and wi ll support the 
explosive demand for bandwidth-demanding applications such as 
mobile video-on-demand services.  The rapid increase in wireless data 
traffic now begins to strain the network capacity and operators are 
looking for novel technologies enabling even higher data-rates than 
those in the past.  The channel  coding scheme for LTE [3] is  Turbo 
coding. Turbo codes achieve close to Shannon capacity [ 4]  and the 
Turbo decoder  is  typically one of the major blocks in a LTE wireless 
receiver. Turbo decoders suffer from high decoding latency due to 
the i terative decoding process, the forward–backward recursion in the 
 2 
 
maximum a posteriori (MAP) decoding algorithm and the 
interleaving/de- interleaving between iterations  
 
 
 
 
1.2 Thesis goal 
 
In this work, we present the implementation of  a power-efficient and 
high throughput  parallel  Turbo-Decoder architecture for LTE, 
proposed in [5]. It is detailed an 8x parallel radix-4-based SISO 
Decoder.  We used the Verilog Hardware Description Language 
(HDL) for the development of the hardware modules and we 
performed the verification by comparing the HDL simulation results 
with the corresponding from Matlab.   
 
Another goal of this thesis is  to integrate the hardware 
implementation of the Turbo-Decoder into a LTE compliant Single-
In Single-Out model in order to accelerate the receiver’s  (Rx) 
baseband processing (Figure 2.1).  
 
 3 
 
 
 
Figure 1.2  LTE SISO Processing Chain  
 
 
1.3 Thesis Structure 
 
The remainder of the thesis is  organized as follows. Section 2 reviews 
the principle of turbo decoding and details the algori thm used for 
SISO decoding. The paralle l  turbo-decoder architecture is presented 
in Section 3 and the corresponding throughput/area tradeoffs are 
studied. The interleaver architecture is  detailed in Section 4 and 
Section 5 describes the architecture of the SISO decoder .  Section 6 
provides the implementation results and we  conclude in Section 7.  
  
 4 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 5 
 
Chapter 2 
 
Turbo-Decoding for LTE 
 
2.1 Introduction 
 
The components  of the receiver [6] that is shown in figure 1.2 is 
shortly described below:  
 
 OFDM  (including Demapper) 
o Subdivides the information transmitted in the frequency 
domain and aligns data symbols with subcarriers  
o Cycle prefix removal  
o FFT (Fast  Fourier Transform)  operation to recover the 
received data and reference signals at each subcarrier  
 
 Channel Estimation and Equalizer 
o Estimate channel frequency response based on 
transmitting known data or symbols  
o Recover the best estimate of the transmitting signal 
using a low complexity-frequency-domain equalizer  
 
 Demodulator 
o Demodulate the payload symbols to the chosen 
constellation grid.  
 
 Descrambler 
o Inverse transmitter’s scrambling operation in which had 
encrypted the transmitted signal.  
 
 Turbo Decoder 
o is used in conjunction with a Turbo Convolutional 
Encoder to provide an extremely effective way of 
transmitting data reliably over noisy dat a channels  
o is designed to meet the LTE specification  
 
According to Matlab-Simulator (table 2.1) the most t ime consuming 
receiver’s component is Turbo-Decoder by far.  
 
 6 
 
 
Component 
 
Time (sec)  
OFDM 0.008512 
Demapper 0.004483 
Channel Estimation  0.026690 
Equalizer  0.001541 
Demodulator  0.055949 
Descrambler  0.015564 
Turbo Decoder  0.153524 
Table 2.1  Matlab Simulator Prof il ing for SISO Receiver  
 
 
 
 
2.2 Turbo-Decoding Algorithm 
 
The turbo encoder is illustrated in the left -hand of figure 2.1. The 
first component encoder receives uncoded (systematic) data bits in 
natural order and outputs a se t of parity bits. The second component 
encoder receives a permutation of the  data bits from a block 
interleaver and outputs a second set of parity bits. The systematic 
bits and the two sets of parity bits are then transmitted over the 
wireless channel.   
 
 7 
 
 
Figure 2.1  Parallel -concatenated turbo-encoder and block 
diagram of a  turbo-decoder .  
 
  
However,  since this signal is usually distorted by noise and 
interference, the demodulator can only obtain estimates of the 
systematic and two sets of parity bits.  These estimates are provided 
to the subsequent turbo decoder in the form of log-likelihood ratios 
(LLRs), 𝐿𝑘𝑠, 𝐿𝑘𝑝2, 𝐿𝑘𝑝1,  and which express the ratio between  the 
probabilities of the transmitted bits being 0 and being 1.  The turbo 
decoder inverts the operations performed by the turbo encoder. A 
turbo decoder is based on the use of two  decoders and two 
interleavers in a feedback loop.  Figure 2.1 depicts the main idea. The 
first and second SD perform decoding of the convolutional code 
generated by the first or the second CE, respectively.  One pass by 
both the first  and the second SD is referred to as a full -iteration; the 
operation performed by a single SD a half-i teration. In this work is 
used 11 half iterations  in order to produce the final decoded bits.  
 
Each SD computes intrinsic a-posteriori LLRs 𝐿𝑘𝐷1 and 𝐿𝑘𝐷2,  for the 
transmitted bits,  based on the systematic LLRs in natural 𝐿𝑘𝑠 or 
interleaved order 𝐿𝜋(𝑘)𝑠 ,  on the parity LLRs 𝐿𝑘𝑝1 or 𝐿𝑘𝑝2,  and on the 
so-called a-priori  LLRs 𝐿𝑘𝐴1 or 𝐿𝑘𝐴2.  In subsequent iterations,  each 
SD uses the extrinsic LLRs 𝐿𝑘𝐸𝑖 = 𝐿𝑘𝐷1 –  (𝐿𝑘𝑠 + 𝐿𝑘𝐴𝑖) computed by 
the other SD. For the first iteration  the a-priori LLRs are set to 0 .   
Due to the interleaving used at  the encoder,  care must be taken to 
properly interleave and de-interleave the LLRs which  are used to  
represent the soft values of the bits. Furthermore, because of the 
iterative nature of the decoding, care must be taken not to re-use the 
same information more than once at each decoding step.  
 
 8 
 
A soft-in soft-out decoder is a type of soft-decision decoder used 
with error correcting codes.  "Soft-in" refers to the fact  that  the 
incoming data may take on values other than 0 or 1, in order to  
indicate reliability.  "Soft -out" refers to the fact that each  bit  in the 
decoded output also takes on a value indicating reliability.  
 
The soft outputs  and inputs from the component decoders are 
typically represented in terms of the so-called Log Likelihood 
Ratios(LLRs),  the magnitude of which gives the sign of the bit, and 
the amplitude the probability of a co rrect decision. The LLRs are 
simply, as their name implies , the logarithm of the ratio of two 
probabilities. For example, the LLR 𝐿(𝑢𝑘) for the value of a decoded 
bit 𝑢𝑘 is given by 
 
 
𝐿(𝑢𝑘) = 𝑙𝑛 (
𝑃(𝑢𝑘=+1)
𝑃(𝑢𝑘=−1)
)     (2.1) 
 
 
We summarize below what is meant by the terms a-priori ,  
a-posteriori ,  and extrinsic information.  
 
a-priori: The a-priori information about a bit is information 
known before decoding starts, from a  source other than   
the received sequence or the  code constraints.  It  is  also 
sometimes referred to as intrinsic information to contrast  
with the extrinsic information described next.  
 
extrinsic:  The extrinsic information about a bit  𝑢𝑘 is the 
information provided by a decoder based on  the received 
sequence and on a-priori  information excluding  the 
received systematic bit  and the a-priori information for  
the bit . Typically,  the component decoder provides this 
information using the constraints  imposed on the 
transmitted sequence by the  code used. It  processes the 
received bits and  a-priori  information surrounding the 
systematic bit, and uses this information and the  code 
constraints to provide information about  the value of 𝑢𝑘.  
 
a-posterior: The a-posteriori information about a bit is the  
information that  the decoder gives taking into account   
all  available sources of information  about 𝑢𝑘.  It  is  the 
a-posteriori LLR, that the MAP algorithm gives as its  
 output.  
  
 
 9 
 
2.3 Radix-4 Max-Log BCJR Algorithm 
 
In 1974 an algorithm, known as the Maximum A -Posteriori  (MAP) 
algorithm, was proposed by Bahl, Cocke, Jelinek and Raviv  [7] for 
estimating the a-posteriori probabilities of the states and the 
transitions of an observed Markov source, when subjected to 
memoryless noise. This algorithm has also become known as the 
BCJR algorithm, named after its inventors. They showed how the 
algorithm could be used for decoding both algebraic and 
convolutional codes.  The MAP algorithm examines every possib le 
path through the convolutional decoder trellis  and therefore initially 
seemed to be unfeasibly complex for application in most systems. 
Hence, it  was not widely used before the discovery of turbo codes. 
The MAP algorithm provides not only the estimated bit sequence, but 
also the probabilities for each bit has been decoded correctly.  This is  
essential  for the iterative decoding of turbo codes and makes the MAP 
algorithm very suitable for turbo decoders  
 
The BCJR algorithm resembles the Viterbi algorithm [8] and 
traverses a trellis representing the convolutional code to compute the 
intrinsic LLRs. 
Trellis  codes do not operate on independent blocks of source data,  
unlike the block codes. A trellis encoder maps an arbitrari ly long 
input data stream to an arbi trari ly long output code stream. Trellis  
codes can encode data continuously.  A trellis encoder is a finite state 
machine. The output of the encoder depends on the inputs at that time 
and the current state of the encoder. The rate of encoder is k/n as in 
block codes because it gives n outputs for k inputs. In this trellis  
coded modulation method the receiver’s decision is taken depending 
on entire sequence of symbols rather than on symbol by symbol 
calculation.  
 10 
 
Figure 2.1  Basic Trell is Diagram 
 
 
.  
 11 
 
Figure 2.2  Basic structure o f an i tera t ive  Turbo  
decoder .  I tera t ive decoding based on MAP decoders.  
Forward/backward recursions on the tre l l is  diagram .  
 
 
It  is  applied the Max-log approximation to the forward state-metric 
recursions:  
  𝑎𝑘(𝑠) = max⁡{𝑎𝑘−1(𝑠
′
0) +⁡𝛾𝑘(𝑠
′
0, 𝑠), 
⁡𝑎𝑘−1(𝑠
′
2) +⁡𝛾𝑘(𝑠
′
2, 𝑠)}⁡                      (2.2) 
 
where 𝑠′0 and 𝑠
′
2 correspond to the two possible predecessor  states 
of s (see Fig.  2).  The backward st ate-metrics 𝛽𝑘(𝑠) are computed 
similarly to (2.2) in the opposite direction. Both recursions can be 
performed efficiently based on hardware -friendly add-compare-
select (ACS) operations.  The 𝛾𝑘 term above is the branch transition 
probability that depends on the trellis  diagram, and is  usually 
referred to as the branch metric  (see [9] for details) .   
 
 12 
 
 
Figure 2.3  Example of the calcula t ion of the forward and 
backward sta te  metr ics for  radix -2 recurs ions  
 
 
 
Once all  𝑎𝑘 and 𝛽𝑘  have been obtained, the a-posteriori output of the 
max-log-MAP decoder can be computed. To this  end, the decoder 
must consider the state transitions  𝑠′ → 𝑠 associated with 𝑥𝑠= 0 and 
the ones associated with 𝑥𝑠= 1 separately and then computes :  
 
𝐿𝑘
𝐷1,𝐷2
  =     max    ⁡{𝑎𝑘−1(𝑠
′) +⁡𝛾𝑘(𝑠
′, 𝑠) ⁡+ ⁡𝛽𝑘(𝑠)} 
            ( s ’ , s ) : ⁡𝑥𝑠=0  
 
-      max   ⁡{𝑎𝑘−1(𝑠
′) +⁡𝛾𝑘(𝑠
′, 𝑠) ⁡+ ⁡𝛽𝑘(𝑠)}.       (2.3) 
                      ( s ’ , s ) : ⁡𝑥𝑠=1  
 
 
In this work, it  is  used a radix-4 (see figure 2.4) Max-Log turbo 
decoder in order to enhance the throughput. The Log -MAP core 
processes two received symbols per clock cycle using a radix -4 
architecture, doubling the throughput for a given cl ock rate over a 
similar radix-2 architecture.  Specifically,  the radix-4 forward state 
metrics (figure 2.4) are computed on the basis of its  four admissible 
predecessor states 𝑠′0,  𝑠
′
1,  𝑠
′
2 and 𝑠
′
3 (at  step k-2) as follows:  
 
𝑎𝑘(𝑠) = max⁡{𝑎𝑘−2(𝑠
′′
0) +⁡𝛾𝑘(𝑠
′′
0, 𝑠), +⁡𝑎𝑘−2(𝑠
′′
1) +⁡𝛾𝑘(𝑠
′′
1, 𝑠),  
 𝑎𝑘−2(𝑠
′′
2) +⁡𝛾𝑘(𝑠
′′
2, 𝑠), +⁡𝑎𝑘−2(𝑠
′′
3) +⁡𝛾𝑘(𝑠
′′
3, 𝑠)}⁡.          (2.4) 
         
For the first trellis step (k=0) we initialize 𝑎𝑘(𝑠
⁡
0) = 1, 𝑎𝑘(𝑠
⁡
1) = 0,
𝑎𝑘(𝑠
⁡
2) = 0⁡𝑎𝑛𝑑⁡𝑎𝑘(𝑠
⁡
3) = 0.  The radix-4 branch metrics required in 
(2.4) are computing according to:  
   
  𝛾𝑘(𝑠
′′
𝑖, 𝑠) =𝛾𝑘(𝑠
′′
𝑖, 𝑠
′
𝑗) ⁡+⁡𝛾𝑘(𝑠
′
𝑗 , 𝑠
⁡
⁡)    (2.5), 
 
 13 
 
using the six branch metrics associated with the trellis step k and k -
1 required in the radix-2 recursion.  
 
Since the backward recursion progresses from the end of trellis  
diagram to its  beginning for every step we initially set 𝛽𝑘(𝑠
⁡
⁡)= 1/Ν ,  
where N is the number of states in the turbo encoder.  Then we use 
the radix-2 recursion (2.2)  to calculate 𝛽𝑘−1(𝑠
′
⁡).  
 
 
 
Figure 2.4  Radix-2 and radix-4 recursions  
 
 
2.4 LTE Interleaver 
 
Interleavers for turbo codes scramble the data in a  pseudo-random 
order to minimize the correlation between the outputs of component 
encoders.  Interleaver is an essential part and i s also responsible for 
an excellent  Bit Error Rate (BER) performance of turbo code. 
Although parallelism can be obtained using multiple hardware 
instances of a single decoder, this solution increases the memory 
requirements (each decoder requires separate  memory) and also 
incurring a long latency. Recognizing these  deficiencies,  the LTE 
working group decided upon an approach  that enables internal 
parallel ism within a fast serial decode r .    
 
Generally,  the task of an interleaver  is  to permute the soft  values  
generated by the MAP decoder and write them into random or pseudo-
random positions.  Interleaver architectures are well  studied in 
literature [10], [11] and the recent wireless  communication standards 
 14 
 
like LTE have incorporated QPP and ARP interleavers  [12]  
respectively.   
 
In this work, contention free QPP interleaver architecture is used in  
the turbo decoder design. The recursive architecture of QPP 
interleaver has a simplified design and it can be easily used in the 
parallel  architecture of turbo decoder to achieve higher throughput. 
Subsequently,  QPP interleaver can be configured to calculate 
interleaved addresses for any value of block length (K) . For example, 
3GPP-LTE wireless standard uses 188 different values of K , ranging 
from 40 bits to 6144 bits .  Specifically,  address -computation for QPP 
interleavers is carried out from:  
  
  𝜋𝑘(𝑘⁡) = (𝑓1𝑘 + 𝑓2𝑘
2)⁡𝑚𝑜𝑑⁡𝐾   (2.6) 
 
Where f1 and f2 are suitably chosen interleaver parameter s that 
depend on the code-block length K.  
 
 
 
 15 
 
 
Table 2.2  Turbo codes Interleaver parameters (Part  1 of 2)   
 
 
 
 16 
 
 
 
Table 2.3  Turbo codes Interleaver parameters (Part  2 of 2)  
 
 
QPP interleaver can be configured to produce contention -free 
interleaved addresses for any of these values by changing the values 
of f1 and f2 in the expression (2.6). The expression (2.6) can be 
implemented efficiently in hardware because only addition, multi ply 
and modulo-operations are involved. Furthermore, QPP interleavers  
map even addresses to even addresses and odd to odd.  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 17 
 
Chapter 3  
 
Parallel Turbo-Decoder Architecture 
 
In the conventional BCJR algorithm (non-parallel),  computations of 
forward-state, backward-state and branch metrics for entire trellis  
stages result in huge memory requirement and impose large decod ing 
delay.  Major steps involving in these parallel Turbo -decoding 
relating to state metrics are presented as follow.  
 
Initialization :  Assuming that  the encoder is reset, the forward state 
metrics are initialized as 𝑎𝑘=0(𝑠
⁡
𝑖) = 1 ∀  i=0 and 𝑎𝑘=0(𝑠
⁡
𝑖) = 0 ∀  i≠0.  
 
Forward recursion :  During this process, the forward state metric of 
each states for successive trellis stages are computed as in (2.4).  
 
Backward-recursion and estimation of backward state metrics :  If  
N represents total  number of states in each trellis  stage, the backward 
state metrics are initialized as 𝛽𝑘(𝑠
⁡
𝑖) =1/N ∀  i∈N (N is the number of 
trellis states) and during the backward recursion it is used the radix -
2 recursion as in (2.2) in order to carry out 𝛽𝑘−1(𝑠
⁡
𝑖).  
 
In order to increase throughput,  a promising solution is to instantiate 
N-BCJR units and to perform N-fold parallel  decoding of trell is. This 
approach increases the turbo-decoding throughput by a factor of N 
compared to a non-parallel turbo-decoder.  
 
 
3.1 High-Level Architecture 
 
This work contains N=16 max-log BCJR instances,  input memories 
for the storage of systematic and parity LLRs and one intermediate 
memory for the storage of the extrinsic LLRs. Radix-4 technique is 
used therefore two trellis  steps are processed per clock cycle.  It  is  
noteworthy that  the use of radix -4 recursions entails 2x increased 
memory-bandwidth, since the LLRs associated with even and odd 
numbered trellis steps are requi red per clock.   
 
 18 
 
Input Ram
Parity-1
LLRs
Input Ram
Parity-2
LLRs
1
st
BCJR
Decoder
2
nd
BCJR
Decoder
N
BCJR
Decoder
Intermediate 
Ram
Extrinsic LLRs
interleaver
interleaver
Input Ram
Systematic
LLRS
1
st
Intrl
BCJR
Decoder
2
nd
Intrl
BCJR
Decoder
N
Intrl
BCJR
Decoder
De-Interleaver
Address
generator
Sytematic LLRs
Parity 1 LLRs
Parity 2 LLRs
 
 
Figure 3.1  High-level architecture of the parallel  turbo -
decoder  
 
 
 
 
 
 19 
 
3.2 Memory Architecture 
 
With low power and big throughput  in mind, turbo-decoder is  based 
on the architecture in figure 3.1. In this desi gn and taking into 
consideration the fact  that  radix -4 recursion is used , 4 block-rams 
store one block of the LLRs of the systematic and both sets of parity 
bits and 2.  Two input block rams are associated to the systematic 
LLRs, one stores the systematic LLRs relating to the even numbered 
and the other for the odd-numbered trellis  steps. In addition, 2 input 
block rams are used to store parity 1 and parity 2 LLRs.  Furthermore, 
2 block rams store the intermediate extrinsic LLRs, one for the odd 
and one for the even trellis -steps and 2 block ram in for the de -
interleave unit .  Totally,  8 block rams are used and the 4 block rams 
for systematic and extrinsic require half the amount of storage in 
contrast with the parities block rams. Each memory  contains N LLR-
values per address. This partitioning enables 2xN (N is the number 
of the parallel  decoders) LLRs to be read per clock cycle.  
 
     
3.3 Implementation Tradeoffs 
 
Typically,  the throughput of digital  circuits can be increased  by 
architectural and circuit -level transformations such as pipelining or 
parallel  processing. For turbo decoders,  the applicabil ity of 
pipelining is limited due to the presence of feedback loops and the 
accompanying extra registers increase the energy consumption.  
 
Comparative study of BER performances has shown that the parallel  
turbo decoder achieves an adequate BER performance. Recently,  the 
VLSI implementations of para llel turbo decoders with N=8 [13], 
N=16 [14], N=32 [15] and N=64 [16] have been reported for higher 
data-rate applications. One of the key aspects of this work is the use 
of radix-4 recursions in order to achieve high throughput.  Despite the 
fact  that  the use of radix -4 increases the area that  BCJR decoders 
occupied, the area of the rather large,  input and intermediate,  
memories remains the same.  Clearly,  the throughput improvement has 
to be paid for by a complexity increase.   
 
 
 
 
 
 
 20 
 
  
 21 
 
Chapter 4 
 
 
LTE Interleaver Architecture 
 
Interleaving means the permutation of the order of the data bits in a 
code block. Turbo codes requi re specific interleavers which minimize 
the correlation between the SISO decoder inputs of subsequent half -
iterations to achieve best  decoding error rate performance. However, 
the rules for the generation of the interleaved pattern are highly 
complex.  
 
In turbo decoder implementations the interleaver is a sub -block of 
the address generator,  which generates the addresses for the 
memories in natural  or permuted order. Thus, depending on the turbo 
decoder half-iteration, the SISO decoder inputs can be read fro m the 
input and from the intermediate memory in natural or interleaved 
order.  After decoding, the LLR outputs of the SISO decoder block 
are written back in natural or interleaved order to the same address 
in the intermediate memory, depending on the specif ic turbo decoder 
half-iteration.  
 
For most interleavers, parallel  and interleaved memory access  
leads to an interleaver bottleneck which is caused by access -
contentions.  Thus, an Interleaver that  alleviates the interleave 
bottleneck is of primary importance for parallel turbo decoding.  
 
 
4.1 Contention-Free Interleaving for   LTE 
 
This LTE Interleaver exhibits two approaches to in order to have 
access to the memories in interleaved and natural  order.  The first  
approach to solve the memory access contention problem is to 
constrain the interleaver  to be contention-free. Contention-free 
interleavers  [17] allow instant access and trivial mapping for LLRs 
values that are required for the N parallel SISO decoders. For 
example,  if  K is the block length and N divide s the K without 
remainder, the interleaved or natural  order LLRs values can be 
always read from N memories. The second property is that  the 
interleaver is maximally vectorizable  [18], the address-distance 
 22 
 
between each of the N interleaved addresses is alwa ys an integer 
multiple of the trellis -segment length S.   
 
 
Figure 4.1  Architecture of the contention -free interleaver  
 
 
As it  is  said in this work, radix -4 is used and therefore even and odd -
numbered systematic and extrinsic LLRs are stored in separate RA Ms 
with S/2 addresses.  Figure 4.1 indicates the storage of K LLRs 
relating to one code-block (with length K) in a folded memory. 
Folded memory has S addresses and each address contains N LLRs. 
Therefore, K = NxS LLR values can be stored. In figure 4.1 it  i s used 
N=8 and S=5. LLRs are written column-wise and each column 
corresponds to an SISO decoder .  As is il lustrated in figure 4.1 the 
address-distance between each of the N LLRs in the same row is a n 
integer multiple of 5 (trellis -segment S) and this is  due  to the 
maximally-vectorizable interleaver.  
 
In the natural order phase, starting from the folded memory address 
0 in increment way, the straightforward N LLrs located to the N BCJR 
instances. The value of nth corresponds to the nth BCJR.  
 
Since LTE interleaver is maximally-vectorizable, the N interleaved 
addresses always point out at  the same row in the folded memory. As 
illustrated in figure 4.1 the 8 interleaved addresses 
(6,31,36,21,26,11,16,1) relevant to  address 1 in the folded memory 
point out in the same row. In the interleaved phase,  address -decoding 
 23 
 
generates the sorting order that  is  required to assign the LLRs from 
the folded memory to the corresponding SISO decoders and a 
permutation according to the extracted sorting -order is applied to the 
N LLR values, which are then passed to the corresponding BCJR 
instances. This enables N-fold parallel access to the folded memory .  
 
 
4.2 Master-Slave Batcher Network 
 
Address decoding and permutation for maximally-vectorizable 
contention-free interleaver based on [5]  is depicted in figure Master -
Slave Batcher Network.  
 
Address-decoding that it  is reffered in 4.1 chapter is carried out in 
the master network and the slave network performs the permutation 
by applying the inverse-sorting order to the N LLRs. The master 
network consists of a number of 2 -input sorter (SO) units and the 
slave network of a 2-input switches (SW). The permutational signals 
from the master networks control the switches in the slave network.  
 
 24 
 
    
 
Figure 4.2  The Master-Slave Batcher Network     
architecture  
 
This network is a hardware efficient way to perform address -
decoding and permutation because only Multiplexers (MUXs) with 2 
inputs and 1 output are required. LTE interleaver is of primary 
significance for parallel turbo -decoders.  
 
 
 
 
 25 
 
Chapter 5 
 
Radix-4 Max-Log BCJR Architecture 
 
In this design, Radix-4 Max-log BCJR with 16 instances dominate 
the circuit area and the power consumption. Consequently,  is  very 
significant an area-power efficient implementation of radix -4 max-
log BCJR. 
 
 
5.1 VLSI Architecture 
 
The architecture of the radix -4 max-log BCJR is presented in figure 
5.1.   
 
 26 
 
 
Figure 5.1  Architecture of the implemented  radix-4 max-
log BCJR core  
 
In this design, two trellis  steps are computed per clock cycle. This  
computation is performed using 2 parallel units,  the forward state -
metric recursion unit and the backward state -metric recursion unit . 
The problem of this  approach is the unknown backward (or forward) 
state metrics which are required in the beginning of the backward (or  
forward) recursion.  In the very first iteration, uniform state metrics 
can be used for initialization . The forward state metrics are 
initialized as 𝑎𝑘=0(𝑠
⁡
𝑖) = 1 ∀  i=0 and 𝑎𝑘=0(𝑠
⁡
𝑖) = 0 ∀  i≠0 and in every 
clock cycle (2.4) is  used to compute the forward state metrics for this 
trellis step. In the backward state -metric recursion unit in every step 
the backward metrics are initi alize 𝛽𝑘(𝑠
⁡
𝑖) =1/N ∀  i∈N, where N is the 
number of trellis -states (in this work we have 4 states).  
 
The branch metrics unit first work out the radix -2 branch metrics and 
then compute the radix-4 branch metrics according to  (2.5). The 
 27 
 
results of the forward state-metric recursion unit are passed from 
flip-flops before used to compute the intrinsic LLRs. This occurs 
because we want to delay a cycle the results from forward state -
metric recursion unit  because in the LLR computation unit we need 
the forward metrics from the previous cycle. For example, for the 
computation of 𝐿𝑘−1
𝐷 we need to know 𝑎𝑘−2 ,  see (2.3).  
 
 
5.2 Radix-4 ACS Units with Modulo-
Normalization 
 
The recursive state metric computation cannot be pipelined or 
parallel ized due to the presence of the feedback loop.  Hence, we 
shall focus on measures for reducing the complexity of the state 
metric recursions to shorten the critical path and to reduce area and 
power consumption.  The normalization technique used in this thesis 
is focused to achieve high-speed performance of turbo decoder from 
an implementation perspective. In addition, radix -2 and radix-4 
ACS that  is depicted in figure 5.2 are hardware friendly.   
 
The comparison (CMP circuit for modulo-normalization [19] 
achieves the renormalization with a controlled overflow in the data 
path and requires only a 3-input XOR gate. In the parallel radix -4 
ACS is utilized 4 adders,  6 CMP circuits and a 4-1 MUX (4 inputs,1 
output). The selection signal is  carried out by the six parallel CMP 
followed by Karnaugh-map minimization. Radix-2 ACS requires 
only 2 adders a CMP circuit and a MUX with select signal the 
output from CMP circuit .  
 
 
Figure 5.2 Radix-2 and Radix-4 architectures 
 
 28 
 
5.3  LLR Computation Unit 
 
The LLR computation unit  that  is  presented in figure 5.1 calculate 
the intrinsic and extrinsic LLRs for th e trellis  step k-1 and k in each 
clock cycle. Hence for the computation of the intrinsic and extrinsic 
LLRs for step k-1, 𝑎𝑘−2,  ⁡𝛽𝑘−1⁡and 𝛾𝑘−1 are required.  𝑎𝑘−2 came from 
the flip flop after the forward state metric-recursion unit, 𝛾𝑘−1 came 
from branch metric unit and ⁡𝛽𝑘−1 from backward state metric-
recursion unit.  𝑎𝑘−1,  ⁡𝛽𝑘⁡and 𝛾𝑘 are required for the calculation of the 
intrinsic and extrinsic LLRs for the step k. With aid of radix -2 ACS 
𝑎𝑘−1 stem from 𝑎𝑘−2 ,  𝛾𝑘 came from branch metric unit and ⁡𝛽𝑘 from 
backward state metric-recursion unit.  
 
Now for the computation of the intrinsic LLRs (2.3), the max of 
𝑎𝑘−1(𝑠
′) +⁡𝛾𝑘(𝑠
′, 𝑠) ⁡+ ⁡𝛽𝑘(𝑠)⁡ relating to a state transitions  𝑠
′ → 𝑠 
associated with 𝑥𝑠= 0 and the ones associated with 𝑥𝑠= 1 must be 
calculated. In order to compute this is used a design similar with 
radix-4 ACS with the difference that  adders have 3 inputs  (α, β, γ) .   
 
 
⁡ 
 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 29 
 
Chapter 6 
 
Implementation Results 
 
In this chapter,  it  is  shown simulation and synthesis results and it is  
summarized the key points of the 8x parallel  implemented Tutbo -
Decoder.  
 
 
6.1 Axi-4 Stream Ip 
 
AXI4-Stream is a subset  of Advanced Microcontroller Bus 
Architecture (AMBA) AXI4 protocol. It  is designed for high -speed 
streaming data. To simplify interoperability, Xilinx IP requiring 
streaming interfaces use a strict subset of the AXI4 -Stream protocol.  
An AXI4-Stream Ip is easy to use, flexible and is a high perf ormance 
IP.  This Turbo Decoder Ip must have a master and a slave interface 
because it is requirement to  receive and send data.  Table 6.1 and 
Table 6.2 show the signal names of the slave and master interface  and 
define them. 
 
 
 
 
 
Pin 
 
Direction 
Port  
width 
(bits)  
 
Description 
 
Aclk  
 
Input 
 
1 
 
Clock:  Sample  on the r i sing edge  
 
Arst  
 
Input 
 
1 
 
Rese t:  Act ive  lo w reset .  When asser ted  
low the decoder  i s  rese t .  
 30 
 
 
En 
 
Input 
 
      1  
 
Enable :  Clock enable  s ignal  
 
Tvalid_s  
 
Input 
 
1 
Tval id:  ind icates that  the mast er  i s  
dr iving a va lid  transfer .  A trans fer  
takes p lace when both Tval id  and  
Tready are asser ted  
 
Tready_s 
 
Output 
 
1 
 
Tready:  ind ica tes tha t  the slave can 
accep t  a  transfer  in the cur rent  cyc le.  
 
Tlast_s  
 
Input 
 
1 
 
T last :  ind ica tes the boundary of a  
packet .  
 
Tdata_s 
 
Input 
 
32 
Tdata:  is  the  pr imary payload tha t  i s  
used to  provide the data that  i s  pass ing 
across the inter face.  The wid th of the  
data payload i s  an integer  number  o f 
bytes.  
 
Table 6.1  Signals  associated with slave interface 
 
 
 
 
 
Pin 
 
Direction 
Port  
width 
(bits)  
 
Description 
 
Aclk  
 
Input 
 
1 
 
Clock:  Sample  on the r i sing edge  
 
Arst  
 
Input 
 
1 
 
Rese t:  Act ive  lo w reset .  When asser ted  
low the decoder  i s  rese t .  
 
En 
Input  
1 
 
Enable :  Clock enable  s ignal  
 
Tvalid_m 
 
Output 
 
1 
Tval id:  ind icates that  the master  i s  
dr iving a va lid  transfer .  A trans fer  
takes p lace when both Tval id  and  
Tready are asser ted  
 
Tready_m 
 
Input 
 
1 
 
Tready:  ind ica tes tha t  the slave can 
accep t  a  transfer  in the cur rent  cyc le.  
 31 
 
 
Tlast_m 
 
Output 
 
1 
 
T last :  ind ica tes the boundary of a  
packet .  
 
Tdata_m 
 
Output 
 
64 
Tdata:  is  the  pr imary payload tha t  i s  
used to  provide the data that  i s  pass ing 
across the inter face.  The wid th of the  
data payload i s  an integer  number  o f 
bytes.  
 
Table 6.2  Signals  associated with master  interface  
 
Tdata_s is  the primary input for this work and its  length is  8 Byte.  
This size contains the systematic LLR, parity 1 and 2, the number of 
iterations the decoder must implement and code block and is 
organized as follows:  
 
31             19 18             15  14             10 9               5  4               0  
Block length  iterations S ys t ema t i c  LLR  Pari ty 1  LLR  Pari ty 2  LLR  
 
Table 6.3  Construction of Tdata_s  
 
  
6.2 Verilog Implementation   
 
It  is shown that Max-Log BCJR algorithm is totally suitable between 
the implantation complexity and the decoding performance. Now it is 
investigated how to implement the turbo decoder into a Field - 
Programming Gate Array (FPGA). Verilog is used as the Hardware 
Design Language for design entry and behavioral simulation. A basic 
Application Specific Integration Circuit (ASIC) /  FPGA design 
process is depicted in figure 6.1.  
 
 32 
 
 
Figure 6.1  ASIC/FPGA Design process.  
 
1) Design entry  
In this step system interfaces and functionalitie s are defined. The 
detailed design is captured in Verilog, which provides useful 
programming features for structured design techniques.  With these 
techniques,  a complex design can be analyzed into simpler 
implementation modules. Each module has its own def inition of 
functionality and interface.  
 
2) Test bench development  
The functionality of Verilog Design must be verified before going 
further in synthesis.  Test  benches is developed with this purpose, 
which is also programmed in Verilog that  provides design  enti ty with 
the stimulus and verifies the outputs.  
 
3)Functionality verification  
In this step, combinations of inputs (st imulus) are fed into the design 
entity and the outputs are verified. Usually the stimulus and results 
are generated and saved into fil es before the Verilog simulation. The 
test  bench will read in the st imulus, feed them into the design entity,  
obtain the outputs of the design entity and compare these outputs to 
the outputs that should be obtained. A properly design verification 
program should be take into account the mathematic limitations in a 
 33 
 
realistic hardware design including the finite resolution and limited 
dynamic range of the date representation.  
  
4) Synthesis  
Synthesis is a process of transforming a design specification into an 
implementation, i .e converting an abstract design description into a 
hardware abstract. This process is performed using the synthesis 
tools based on certain synthesis technology l ibrary provided by FPGA 
manufactures.   
 
5) Device mapping 
This process tries  to find proper devices from a library based on 
synthesis result. In this phase,  a t iming model generation program 
provided by a device vendor or third part  simulation model supplier 
could be used to generate the accurate timing model of the design.  
 
6) Timing Simulation 
The t iming model generated during the device mapping is combined 
into the test bench and the verif ication is performed again.  When  the 
design is performed correctly with the timing model, is ready to be 
manufactured. However,  if  the design fails with this t iming model ,  
the designer has to go back to the first  step, modify the design and 
go through all the steps again until the design passes the t iming 
simulation.  
 
In this thesis,  it  is  implemented a parallel  turbo decoder and the 
corresponding Verilog test bench in Verilog. The functional  
verification is performed by comparing the decoding performance of 
Verilog implementation with a Matlab -simulation. The parallel  
Turbo-Decoder for LTE at Register -Transfer-Level (RTL) and the 
design description follows a proper coding style to make it  
synthesizable.  For implementation, simulation and synthesis is used 
a Xill inx tool,  Vivado and the test platform was ZYNQ -7 ZC706 [20].   
 
Synthesis results are presented below:  
 
 34 
 
 
 Figure 6.2   Synthesis report  (part  1 of 3)  
 
 
Figure 6.3   Synthesis report  (part  2 of 3)  
 
 35 
 
 
       Figure 6.4   Synthesis report  (part  3 of 3)  
  
 
6.3 Error-Rate Performance and key 
characteristics 
 
To achieve a good error-rate performance, the input LLRs are 
quantized to 5 bit,  the ext rinsic LLRs to 6 bit  and all state metrics in 
the radix-4 ACS units require 10 bits.  This Turbo-Decoder 
implements 5.5 full i terations to carry out the decoded bits.  
 
The majority of chip area is occupied by the BCJR instances. The 
maximum measured clock frequency is 300 MHz, at which a 
throughput of 200 Mb/s has measured.  
 
 
 
 
 36 
 
  
 37 
 
Chapter 7 
 
Conclusions 
  
In the recent years,  high-throughput design and implementation have 
become dominating requirement in the field of VLSI design of 
wireless-communication systems. There has been a rapid surge in 
data-rate for next-generation wireless -communication and this will  
lead to more complex algorithms and VLSI architectures in next few 
decades.  Based on this scenario,  I have aggregated the study of turbo -
code and the implementation of high-throughput parallel -turbo 
decoder on FPGA in this thesis.  In this work it  is detailed a parallel  
turbo decoder for the 3GPP-LTE standard. The use of radix -4 in 
combination with 8 parallel SISO decoders is  of paramount  
importance in order to achieve high throughput and an area efficient 
design.  
 
 
7.1 Future Work 
 
For the future work, proposed VLSI-architecture of high-throughput 
parallel-turbo decoder can be re -designed into area-efficient 
architecture.  Similarly,  power -reduction techniques could be 
incorporated to conceive high -throughput architecture for low-power 
applications.  Possible extensions in this project may be the 
following :  
 
 Windowing 
To significantly reduce the large memory requirements, 
windowing can be employed. In this app roach the trellis is  
processed in small windows.  
 
 Early termination  
Decoders for turbo codes are iterative in nature.  There are 
techniques that  can be used to reduce the average number of 
iterations. There are stopping rules  based on comparing a 
metric on bit  reliabil ities (soft bit decisions) with a threshold.  
If the metric is smaller than the threshold, the decoder 
continues with a new iteration; otherwise,  it  stops.  
 38 
 
  
 39 
 
Bibliography 
 
 
[1]  C. Berrou, A. Glavieux, and P. Thit imajshima, "Near Shannon l imit  
error-correcting coding and decoding. Turbo codes",  in Proc, Int .  
Conf. Communication,  May 1993,  pp 1064 -1070.  
 
[2]  C. Berrou, A. Glavieux, and P.  Thit imajshima, "Near optimum error  
correcting coding and decoding. Turbo -codes," IEEE Trans. 
Commun., vol.44, no 10, pp.  1261- 1271,  1996.  
 
[3]  3rd Generation Partnership Project;  Technical  Specification Group 
Radio Access Network; Evolved Universal  Terrestrial  Radio Access  
(EUTRA); Multiplexing and channel coding (Release 9),  3GPP 
Organizational Partners TS 36.212, Rev. 8.3.0, May 2008.  
 
[4]  C. E. Shannon and W. Weaver , The Mathematical  Theory of  
Communication.  Urbana, IL: Univ. Il l inois Press,  1949.  
 
[5]  C. Studer,  C. Benkeser,  S. Belfanti ,  and Q. Huang, “Design and 
implementation of a  parallel  turbo -decoder asic  for 3gpp-l te,”  
Solid-State Circuits,  IEEE Journal of,  vol.  46, no. 1, pp. 8 –17, jan. 
2011.  
 
[6]  H.Zarrinkoub, Understanding LTE with MATLAB From 
Mathematical  Modeling to Simulation and Prototyping ,  United 
Kingdom: John Wiley & Sons Ltd,  2014 
 
[7] L. Bahl ,  J .  Cocke, F. Jelinek, and J .  Raviv,  “Optimal decoding of  
l inear codes for minimizing symbol error rate,” IEEE Trans. Inf.  
Th., vol.  20, no.  2, pp.  284–287, Mar. 1974.  
 
[8]  A. J .  Viterbi ,  “Error  bounds for convolutional codes and an 
asymptotically optimum decoding algorithm,” IEEE Trans. Inf.  Th.,  
vol.  13, no. 2, pp. 260–269, Apr. 1967.  
 
[9]  J .  P. Woodard and L. Hanzo,  “Comparative study of turbo decoding 
techniques:  an overview,” IEEE Trans.  Vehicular Tech.,  vol.  49, 
no. 6,  pp. 2208–2233, Nov. 2000.  
 
[10] S. Vafi  and T. Wysocki,  “Performance of convolutional interleavers 
with different spacing parameters in turbo codes,” Proceedings of  
Australian Communication Theory Workshop, pp. 8 -12, 2005.  
 
[11]  S. Lee, C. Wang and W. Sheen, “Architecture Design of QPP  
Interleaver for Parallel  Turbo Decoding,” Proceedings of IEEE 
Vehicular Technology Conference (VTC), pp. 1 -5, 2010.  
 40 
 
 
[12] A. Nimbalker,  Y. Blankenship, B. Classon, and T. K. Blankenship, 
“ARP and QPP interleavers  for LTE turbo coding,”  in Proc.  IEEE 
WCNC, Las Vegas,  NV, USA, Mar. 2008, pp. 1032 –1037.  
 
[13] C-C.  Wong and H-C. Chang,  “Reconfigurable Turbo Decoder With 
Parallel  Architecture for 3GPP LTE System,” IEEE Transactions on 
Circuits and Systems II:  Express Briefs,  vol.  57, pp. 566 -570, July-
2010.  
 
[14] C-C. Wong,  M-W. Lai,  C-C.  Lin,  H-C. Chang and C-Y. Lee,  “Turbo 
Decoder Using Contention -Free Interleaver  and Parallel  
Architecture,” IEEE Journal of Solid -State Circuits,  vol.  45,  no.  2, 
pp. 422-432, February-2010.  
 
[15] S. M. Karim and I.  Chakrabarti ,  “High Throughput Turbo Decoder 
Using Pipelined Parallel  Architecture and Collision Free 
Interleaver,” IET Communications, vol.  6,  pp. 1416 -1424, 2012.  
 
[16] Y. Sun and J .  R. Caval laro, “Efficient Hardware Implementation of  
a Highly-Parallel  3GPP LTE/LTE-Advance Turbo Decoder,” 
INTEGRATION, the VLSI Journal,  vol.  44, pp. 305 -315, 2011.  
 
[17] O. Y. Takeshita,  “On maximum contention -free interleavers and 
permutation polynomials over integer rings,” IEEE Trans. Inf.  Th. ,  
vol.  52, no. 3, pp. 1249–1253,  Mar. 2006 .  
 
[18] J .  Sun and O.  Y.  Takeshita,  “Interleavers for turbo codes using 
permutation polynomials over integer rings,” IEEE Trans. Inf.  Th. ,  
vol.  51, no. 1, pp. 101–119, Jan. 2005.  
 
[19] C. B.  Shung,  P.  H.  Siegel,  G. Ungerboeck,  and H. K.  Thapar, “VLSI 
archi tectures for metric normalization in the Viterbi  algorithm,” in 
Proc. IEEE ICC, vol.  4,  Atlanta, GA, USA, Apr. 1990, pp. 1723 –
1728.  
 
[20] ZC706 Evaluation Board for the Zynq -7000 XC7Z045 All  
Programmable SoC User Guide.  
  
 
