Efficient hardware implementation of a highly-parallel 3GPP LTE/LTE-advance turbo decoder by Sun, Yang & Cavallaro, Joseph R.
INTEGRATION, the VLSI journal 44 (2011) 305–315Contents lists available at ScienceDirectINTEGRATION, the VLSI journal0167-92
doi:10.1
 Corr
E-mjournal homepage: www.elsevier.com/locate/vlsiEfﬁcient hardware implementation of a highly-parallel 3GPP
LTE/LTE-advance turbo decoderYang Sun , Joseph R. Cavallaro
Department of Electrical and Computer Engineering, Rice University, 6100 Main Street, Houston, TX 77005, USAa r t i c l e i n f o
Available online 17 July 2010
Keywords:
QPP interleaver
Quadratic permutation polynomial
Turbo decoder
MAP decoder
VLSI
ASIC
3GPP LTE60/$ - see front matter & 2010 Elsevier B.V. A
016/j.vlsi.2010.07.001
esponding author.
ail addresses: ysun@rice.edu (Y. Sun), cavallara b s t r a c t
We present an efﬁcient VLSI architecture for 3GPP LTE/LTE-Advance Turbo decoder by utilizing the
algebraic-geometric properties of the quadratic permutation polynomial (QPP) interleaver. The high-
throughput 3GPP LTE/LTE-Advance Turbo codes require a highly-parallel decoder architecture. Turbo
interleaver is known to be the main obstacle to the decoder parallelism due to the collisions it
introduces in accesses to memory. The QPP interleaver solves the memory contention issues when
several MAP decoders are used in parallel to improve Turbo decoding throughput. In this paper, we
propose a low-complexity QPP interleaving address generator and a multi-bank memory architecture to
enable parallel Turbo decoding. Design trade-offs in terms of area and throughput efﬁciency are
explored to ﬁnd the optimal architecture. The proposed parallel Turbo decoder has been synthesized,
placed and routed in a 65-nm CMOS technology with a core area of 8.3mm2 and a maximum clock
frequency of 400MHz. This parallel decoder, comprising 64 MAP decoder cores, can achieve a
maximum decoding throughput of 1.28Gbps at 6 iterations
& 2010 Elsevier B.V. All rights reserved.1. Introduction
3GPP Long Term Evolution (LTE) [1], which is a set of
enhancements to the 3G Universal Mobile Telecommunications
System (UMTS) [2], has received tremendous attention recently
and is considered to be a very promising 4G wireless technology.
For example, Verizon Wireless has decided to deploy LTE in their
next generation 4G evolution. One of the main advantages of
3GPP LTE is high throughput. For example, it provides a peak data
rate of 326.4Mbps for a 44 antenna system, and 172.8Mbps for
a 22 antenna system for every 20MHz of spectrum. Further-
more, LTE-Advance [3], the further evolution of LTE, promises to
provide up to 1Gbps peak data rate.
The channel coding scheme for LTE is Turbo coding [4]. The Turbo
decoder is typically one of the major blocks in a LTE wireless receiver.
Turbo decoders suffer from high decoding latency due to the iterative
decoding process, the forward–backward recursion in the maximum
a posteriori (MAP) decoding algorithm and the interleaving/de-
interleaving between iterations [5–7]. Generally, the task of an
interleaver is to permute the soft values generated by the MAP
decoder and write them into random or pseudo-random positions.
A high throughput Turbo decoder can be realized by paralleliz-
ing several MAP decoders, where each MAP decoder operates on a
segment of the received codeword [8]. Due to the randomness of
the Turbo interleaver, two or more MAP decoders may access thell rights reserved.
@rice.edu (J.R. Cavallaro).same memory at the same clock cycle which will lead to a memory
collision. As a result, the decoder has to be stalled which
consequently delays the decoding process. The Interleaver struc-
tures in the current 3G standards, such as CDMA/W-CDMA/UMTS,
do not have a parallel structure. Although the memory stalls caused
by the interleaver can be partially reduced by using write buffers
[9], the memory stalls will become more and more frequently as
the parallelism degree increases. To solve this problem, the high
data rate 3GPP LTE standard has adopted a contention-free, parallel
interleaver which is called quadratic permutation polynomial
(QPP) Turbo interleaver [4]. From an algebraic-geometric perspec-
tive, the QPP interleaver allows analytical designs and simpliﬁes
hardware implementation of a parallel Turbo decoder [10]. Based
on the permutation polynomials over integer rings, every factor of
the interleaver length can be a parallelism degree for the decoder
[10] which is contention-free.
In the literature, many decoder architectures have been
extensively investigated for 3G or 3G-like Turbo codes [11–18].
Recently, several high speed Turbo decoders have been developed
for 3GPP LTE standard [19–22]. As a 4G candidate system, the 3GPP
LTE-Advance system is pushing for 1Gbps data rate. Thus, it is very
important and challenging to design a Turbo decoder to support
such a high data rate. In this paper, we propose an efﬁcient
hardware architecture for 3GPP LTE/LTE-Advance Turbo decoder. A
low-complexity circuit is designed to generate the QPP interleaving
addresses on the ﬂy. By utilizing the QPP contention-free property,
memory systems are partitioned into multiple banks to allow
concurrent accesses by multiple MAP decoders. More than 1Gbps
data rate is feasible with the proposed decoding scheme.
01
2
3
4
0=ku
1=ku
Forward recursion α
MAP 1
1−∏
Ls
Lp1
LaLe
∏∏
1 1
Y. Sun, J.R. Cavallaro / INTEGRATION, the VLSI journal 44 (2011) 305–315306The rest of this paper is organized as follows. Section 2 reviews
the fundamentals of Turbo codes. Section 3 describes the basic
structure of the QPP interleaver and its several algebraic proper-
ties. Then we propose an online address generator for the QPP
interleaver. In Section 4, two types of low-latency MAP decoder
architectures are introduced and compared. By employing multi-
ple MAP decoder cores and multiple QPP interleavers, we present
a parallel Turbo decoder architecture in Section 5. Then the VLSI
implementation results are summarized and compared with
existing Turbo decoders.ks 1+ks
5
6
7
2+ks 3+ks 4+ks
kk cu , Backward recursion β
MAP 2Lp2
La Le2 2
Fig. 2. Basic structure of an iterative Turbo decoder. (a) Iterative decoding based
on MAP decoders. (b) Forward/backward recursions on the trellis diagram.2. Fundamentals of turbo codes
In order to explain the proposed parallel Turbo decoder
architecture, the fundamentals of Turbo codes are brieﬂy
described in this section.
2.1. Turbo encoder structure
As shown in Fig. 1, the Turbo encoding scheme in the LTE standard
is a parallel concatenated convolutional code with two 8-state
constituent encoders and one quadratic permutation polynomial
(QPP) interleaver [4]. The function of the QPP interleaver is to take a
block ofN-bit data and produce a permutation of the input data block.
From the coding theory perspective, the performance of a Turbo code
depends critically on the interleaver structure [23]. The basic LTE
Turbo coding rate is 1/3. It encodes an N-bit information data block
into a codeword with 3N+12 data bits, where 12 tail bits are used for
trellis termination. The initial value of the shift registers of the 8-state
constituent encoders shall be all zeros when starting to encode the
input information bits. LTE has deﬁned 188 different block sizes,
40rNr6144.
2.2. Turbo decoder structure
The basic structure of a Turbo decoder is functionally
illustrated in Fig. 2. A Turbo decoder consists of two maximum
a posteriori (MAP) decoders [24,25] separated by an interleaver
that permutes the input sequence. The decoding is an iterative
process in which the so-called extrinsic information is exchanged
between MAP decoders. Each Turbo iteration is divided into two
half iterations. During the ﬁrst half iteration, MAP decoder 1 is
enabled. It receives the soft channel information (soft value Ls for
the systematic bit and soft value Lp 1 for the parity bit) and the a
priori information La
1 from the other constituent MAP decoder
through deinterleaving ðp1Þ to generate the extrinsic information
Le
1 at its output. Likewise, during the second half iteration, MAP
decoder 2 is enabled, and it receives the soft channel information
(soft value Ls for a permuted version of the systematic bit and soft
value Lp 2 for the parity bit) and the a priori information La
2 fromQPP
Interleaver
D D D
D D D
uk (Information)
sk (Systematic)
p1k (Parity 1)
p2k (Parity 2)
Fig. 1. Structure of rate 1/3 Turbo encoder in LTE.MAP decoder 1 through interleaving ðpÞ to generate the extrinsic
information Le
2 at its output. This iterative process repeats until
the decoding has converged or the maximum number of
iterations has been reached.
The MAP algorithm at each constituent MAP decoder computes
the log-likelihood ratios (LLRs) of the a posteriori probabilities
(APPs) for information bit uk as follows: [7,26]
LLRðu^kÞ ¼ max

u:uk ¼ 1
fak1ðsk1Þþgkðsk1,skÞþbkðskÞÞg
 max
u:uk ¼ 0
fak1ðsk1Þþgkðsk1,skÞþbkðskÞÞg, ð1Þ
where ak and bk denote the forward and backward state metrics,
and are recursively computed as follows:
akðskÞ ¼ max

sk1
fak1ðsk1Þþgkðsk1,skÞg, ð2Þ
bkðskÞ ¼ max

skþ 1
fbkþ1ðskþ1Þþgkðsk,skþ1Þg: ð3Þ
The gk term above is the branch transition probability that
depends on the trellis diagram, and is usually referred to as the
branch metric. The max star operator employed in the above
descriptions is deﬁned as follows: [26]
max
 ða,bÞ ¼ logðeaþebÞ ¼maxða,bÞþ logð1þejabjÞ: ð4Þ
3. QPP interleaver
Interleaving/deinterleaving of extrinsic information is a key
issue that needs to be addressed to enable parallel decoding
because memory access contention may occur when MAP
decoders fetch/write extrinsic information from/to memory. The
QPP interleaver deﬁned in the new 3GPP LTE standard differs from
previous 3G interleavers in that it is based on algebraic
constructions via permutation polynomials over integer rings. It
is known that permutation polynomials generate contention-free
interleavers [27,10], i.e. every factor of the interleaver length
becomes a possible parallelism degree.
3.1. Algebraic description of QPP interleaver
The QPP interleaver can be expressed via a simple mathema-
tical formula. Given an information block length N, the
x-th interleaving output position is speciﬁed by the quadratic
MEM 0 MEM 1 MEM 2 MEM 3
SEG 0 SEG 1 SEG 2 SEG 3
x x+L x+2L x+3L
Fig. 3. An example of the contention-free interleaving, where a data block is
divided into P¼4 segments (SEG 0–SEG 3) with equal length of L¼N/P. The
contention-free property requires that for a ﬁxed offset x at each segment, the
segment indices for the interleaving addresses bf ðxþ iLÞ=Lc (0r irP1) are
unique so that they can be physically mapped to different memory modules.
Table 1
QPP interleaver parallelism.
N f(x) Parallelism (factors of N)
40 10x2+3x 1,2,4,5,8,10,20
48 12x2+7x 1,2,3,4,6,8,12,16,24
64 42x2+19x 1,2,4,8,16,32
y y y
6016 94x2+23x 1,2,4,8,16,32,47,64
6080 190x2+47x 1,2,4,5,8,10,16,19,20,32,38,40,64
6144 480x2+263x 1,2,3,4,6,8,12,16,24,32,48,64
Y. Sun, J.R. Cavallaro / INTEGRATION, the VLSI journal 44 (2011) 305–315 307expression: [4]
f ðxÞ ¼ ðf2x2þ f1xÞmodN, ð5Þ
where parameters f1 and f2 are integers and depend on the block
size N ð0rx,f1,f2oNÞ. For each block size, a different set of
parameters f1 and f2 are deﬁned. In LTE, all the block sizes are
even numbers and are divisible by 4 and 8. Moreover, the block
size N is always divisible by 16, 32, and 64 when NZ512, 1024,
and 2048, respectively. By deﬁnition, parameter f1 is always an
odd number whereas f2 is always an even number. Through
further inspection, we can list the following algebraic properties
for the QPP interleaver.
QPP interleaver algebraic property 1: f(x) has the same even/odd
parity as x:
f ð2kÞmod2¼ 0
f ð2kþ1Þmod2¼ 1:
QPP interleaver algebraic property 2: The remainders of f(x)/4,
f(x+1)/4, f(x+2)/4, and f(x+3)/4 are unique:
f ð4kÞmod4¼ 0
f ð4kþ1Þmod4¼
1 when ðf1þ f2Þmod4¼ 1
3 when ðf1þ f2Þmod ¼ 3
(
f ð4kþ2Þmod4¼ 2
f ð4kþ3Þmod4¼
3 when ðf1þ f2Þmod4¼ 1
1 when ðf1þ f2Þmod4¼ 3:
(
QPP interleaver algebraic property 3:
f ðxÞmodn¼ f ðxþmÞmodn, 8m : mmodn¼ 0:
Property 1 can be easily veriﬁed since parameter f2 is always even
and parameter f1 is always odd by deﬁnition. Property 2 can be
shown through the following equations:
f ð4kÞ ¼ 4ð4f2k2þ f1kÞ
f ð4kþ1Þ ¼ 4ð4f2k2þ2f2kþ f1kÞþ f2þ f1
f ð4kþ2Þ ¼ 4ð4f2k2þ4f2kþ f1kþ f2Þþ2f1
f ð4kþ3Þ ¼ 4ð4f2k2þ6f2kþ f1kþ2f2Þþ f2þ3f1:
Property 3 can be veriﬁed by
f ðxþmÞ ¼ f ðxÞþmð2f2xþ f2mþ f1Þ:
We will explain later that these algebraic properties are very
useful in designing memory systems for parallel Turbo decoder.
3.2. QPP contention-free property
In general, a Turbo interleaver/de-interleaver f(x), is said to be
contention-free for a window size of L if and only if it satisﬁes the
following constraint [10,28,29]
f ðxþ iLÞ
L
 
a
f ðxþ jLÞ
L
 
, ð6Þ
where 0rxoL, 0r i,joP (¼N/L), and ia j. The terms in (6) are
essentially the memory indices that are concurrently accessed by
the P MAP decoder cores. If these memory indices are unique
during each read and write operation, then there are no
contentions in memory accesses. Fig. 3 shows an example of the
contention-free memory access scheme.
It has been shown in [27,10] that every factor of the interleaver
length N becomes a possible interleaver parallelism that satisﬁes
the contention-free requirement in (6). Table 1 summaries theparallelism degrees (up to 64) for some of the LTE QPP
interleavers.
3.3. Hardware implementation of QPP interleaver
Based on the algebra analysis in [27], the QPP interleaver is
guaranteed to always generate a unique address which greatly
simpliﬁes the hardware implementation. In MAP trellis decoding,
the QPP interleaving addresses are usually generated in a
consecutive order (with step size of d). By taking advantage of
this fact, the QPP interleaving address can be computed in a
recursive manner. Suppose the interleaver starts at x0, we ﬁrst
pre-compute f(x0) as
f ðx0Þ ¼ ðf2x20þ f1x0ÞmodN: ð7Þ
In the following cycles, as x is incremented by d, f(x+d) is
computed recursively as follows:
f ðxþdÞ ¼ ðf2ðxþdÞ2þ f1ðxþdÞÞmodN ð8Þ
¼ ðf ðxÞþgðxÞÞmodN, ð9Þ
where g(x) is deﬁned as
gðxÞ ¼ ð2df 2xþd2f2þdf 1ÞmodN: ð10Þ
Note that g(x) can also be computed in a recursive manner:
gðxþdÞ ¼ ðgðxÞþ2d2f2ÞmodN ð11Þ
¼ ðgðxÞþð2d2f2modNÞÞmodN: ð12Þ
The initial value g(x0) needs to be pre-computed as
gðx0Þ ¼ ð2df 2x0þd2f2þdf 1ÞmodN: ð13Þ
The modulo operation in (9) and (12) can be difﬁcult to
implement in hardware if the operands are not known in advance.
However, by deﬁnition we know that both f(x) and g(x) are less
than N so calculating (9) and (12) can be realized by additions. In
the proposed method, three numbers need to be pre-computed:
(2d2f2)modN, f(x0), and g(x0). Fig. 4 shows a hardware
architecture to compute the interleaving address f(x), where x
starts from x0 and is incremented by d on every clock cycle. For
example, by setting d to 1, this circuit can generate interleaving
addresses at each step of 1. If n consecutive interleaving addresses
(2d2f2)%N
N
-
g(x0)
D
Init
+ +
-
f (x0)
D
MSB
g (x)
f (x)
0
1
0
1
0
1
Init
1
0
MSB
1
0
Init
Fig. 4. Forward QPP address generator circuit diagram, step size¼d.
(2d2f2)%N
N
g (x0)
D
f (x0)
D
MSB
g(x)
f (x)
0
1
0
1
0
1
MSB
0
1
D
DInit
Fig. 5. Backward QPP address generator circuit diagram, step size¼d.
Y. Sun, J.R. Cavallaro / INTEGRATION, the VLSI journal 44 (2011) 305–315308are required at each clock cycle, this circuit can be replicated n
times with n different initial values: x0, x0+1,y, and x0+n1.
The circuit in Fig. 4 can generate interleaving address in a
descending order as well by setting d to be a negative number, eg.
d¼1. But g(x0) needs to be recomputed for negative d. To be
able to generate both forward and backward addresses using the
same f(x) and g(x) functions, we now describe a method to
generate the QPP interleaving address in the descending order. By
substituting x with xd in (9) and reorganize (9), we can get
f ðxdÞ ¼ ðf ðxÞgðxdÞÞmodN: ð14Þ
Similarly, substitute x with xd in (12) and reorganize (12), we
can get
gðxdÞ ¼ ðgðxÞð2d2f2 modNÞÞmodN: ð15Þ
Based on (14) and (15), Fig. 5 shows a hardware architecture to
compute the QPP address f(x) in the descending order (backward
generating), where x starts from x0 and is decremented by d on
every clock cycle. The three pre-computed values are the same as
those in the forward QPP address generator (cf. Fig. 4).
As can be seen from Figs. 4 and 5, the proposed QPP interleaver
pattern generator consumes very few resources. The complexity of
this circuit is an order of magnitude smaller than the previous 3G
interleavers. For example, a circuit with about 30K gate count is
reported in [30] to generate the interleaving addresses for Turbo
codes in the previous 3G standard (3GPP Release-4), and a UMTS
hardware interleaver with 10.5K gate count is presented in [31].4. MAP decoder architecture for LTE turbo codes
MAP decoder architectures have been studied by many
researchers [24,25,32–35]. Several factors, such as interleaverstructure and sliding window scheme, must be considered when
choosing an appropriate MAP decoder for LTE Turbo decoding. In
this section we modify two low-latency MAP decoder architec-
tures and propose a low-complexity QPP interleaving address
generator to operate full-speed with the MAP decoder.
Due to the double recursion in the MAP decoding algorithm
[7], the MAP decoder suffers from high decoding latency. To
reduce the decoding latency, the sliding window algorithm is
often used [36]. However, the problem of the sliding window
approach is the unknown backward (or forward) state metrics
which are required in the beginning of the backward (or forward)
recursion. We refer to the state metrics at sliding window length
distance as stakes. These stakes can be estimated by using a
training calculation [36], which will result in an additional
decoding delay depending on the training length. For LTE Turbo
codes, we do not recommend this traditional sliding window
method when the Turbo coding rate is high. Because many parity
bits will be removed after the base Turbo code is punctured to a
higher code rate, the training length has to be increased to
accurately estimate the state metrics at those stakes which
consequently delays the decoding process.
For LTE Turbo decoding, we suggest to use a low-latency
decoding method, referred to as state metric propagation (SMP)
method, where the state metrics at stakes are initialized with
stakes from the previous iteration [37]. In the very ﬁrst iteration,
uniform state metrics can be used for initialization. This method
avoids the training calculation by propagating the state metrics to
the next iteration. This method is especially useful when the
Turbo coding rate is high. Based on our simulation results,
the performance degradation caused by the window truncation in
the SMP method is smaller than that in the traditional training
based sliding window method in the case of high Turbo code rate.
To compare the decoding performance using these two sliding
window algorithms for high rate LTE Turbo codes, we perform
ﬂoating point simulations using BPSK modulation over AWGN
channel. The LTE rate matching algorithm [4] is used for code
puncturing. Fig. 6 shows the ﬂoating-point simulation result for a
rate of 0.95 Turbo code. Because of the high code rate, the
maximum number of iterations is set to 10. In the ﬁgure, we show
the block error rate (BLER) curves for the SMP based sliding
window algorithm and the traditional training based sliding
window algorithm. In the traditional training algorithm, we
assume the training length is equal to the window length. As
can be seen, the BLER performance of the SMP algorithm with
window lengthW¼64 is better than that of the training algorithm
with window length W¼64, and is close to that of the training
algorithm with W¼96. The SMP algorithm with W¼96 and the
training algorithm withW¼128 perform close to the optimal case
when there is no window effect. Because of the good decoding
performance and low decoding delay, we adopted the SMP
algorithm in our Turbo decoder design.
The SMP based sliding window (SW) MAP algorithm (SW-
MAP) has a window overhead of W (c.f. Fig. 7(a)), which will lead
to additional decoding delays. To eliminate this window
overhead, we also consider a non-sliding window (NSW) based
MAP algorithm (NSW-MAP) which is shown in Fig. 7(b). To be
more general, we consider the case of decoding a segment of the
code block where the segment length is L¼N/P. In the SW
algorithm, a sliding window is applied to the backward recursion
where the stakes are initialized from the previous Turbo iteration.
If the window length is W, (L/W)2 stakes needed to be saved
(note that MAP 1 can only be initialized with stakes from MAP 1,
not from MAP 2, resulting in twice the amount of stake memory).
In the NSW algorithm, no sliding window is applied to the
backward recursions. So only the stakes in the end of
the recursion needed to be saved. It should be noted that the
4.6
10−4
10−3
10−2
10−1
100
B
lo
ck
 E
rr
or
 R
at
e 
(B
LE
R
)
Eb/N0 (dB)
4.7 4.8 4.9 5.1 5.2 5.3 5.4 5.5 5.65
Traning, W = 64
SMP, W = 64
Traning, W = 96
SMP, W = 96
Traning, W = 128
No window
Fig. 6. Simulation result for a rate of 0.95 LTE Turbo code using two different
sliding window algorithms.
w
kc
olb
ed
oc
af
o t
ne
mge s
A
Stakes initializing from the previous iteration
Stakes propagating for the next iteration
pets
siller T
pet s
s illerT
time time
0.5 Turbo iteration
kc
olb
ed
oc
af
ot
ne
mges
A


 
Init
Propag
ate
L L
0.5 Turbo iteration
Init
PropagateInit
Propagate
Fig. 7. Two recommended MAP decoding algorithms for LTE Turbo codes. (a) SW-
MAP decoding algorithm. (b) NSW-MAP decoding algorithm.

LIFO
LIFO
Branch 
Unit
-unit
-unit
WW


LLR
C
First_half_iteration
SMP
Buffer

Write 
interleaving
0
1Read 
interleaving
LLR(out)
LLR(in)
First_half_iteration
Symbol 
Memory
Ls ,Lp
LLR Memory
(Two-port)0
1
Fig. 8. SW-MAP decoder architecture.
0 1 2 3 4 5 6 7
3 2 1 0 7 6 5 4
Read index x
Forward QPP
Generator
LIFO
f (x0), g (x0) Read address f (x)
Write address f (y)
LLR Memory
(Two-port)
W
Init with f (x0), g (x0) Forward QPP generation
...
...
W
Write index y
Fig. 9. (a) An example of the interleaver addressing scheme for the SW-MAP
decoder, whereW¼4, x0¼0. (b) Architecture for generating QPP interleaving read/
write addresses.
Y. Sun, J.R. Cavallaro / INTEGRATION, the VLSI journal 44 (2011) 305–315 309memory bandwidth of the NSW-MAP algorithm is higher than the
SW-MAP algorithm since two LLRs are read and two LLRs are
written in one cycle. When the decoder parallelism is high, i.e. P is
large, the NSW-MAP algorithm has throughput advantage over
the SW-MAP algorithm. There are many other varieties of the
MAP algorithms. See [38] for a thorough analysis of the MAP
decoder architectures. In this paper, we primarily focus on these
two simple but effective MAP algorithms, and we will present QPP
interleaving address generator architectures for these two MAP
algorithms.4.1. QPP interleaving address generator for SW-MAP decoder
Fig. 8 shows the recommend SW-MAP decoder architecture.
The SW-MAP decoder requires one set of a unit, b unit, branch
unit, and LLRC unit because of the single ﬂow structure. It
employs fully parallel add-compare-select-add (ACSA) [39] units
to calculate the state metrics in the a and b recursion processes. A
SMP buffer was used to save the stakes for use in the next Turbo
iteration. In the SW algorithm, the channel LLRs (systematic Ls and
parity Lp) are loaded from the symbol memory in the sequentialorder. A priori information LLR(in) are loaded from the LLR
memory in the sequential order for the ﬁrst half iteration, and
in the interleaving order for the second half iteration. The soft
information LLR(out) are written to the LLR memory in the
backward sequential order during the ﬁrst half iteration, and in
the backward interleaving order for the second half iteration. To
avoid loading interleaving systematic LLR from the symbol
memory during the second half iteration, we have modiﬁed the
MAP algorithm to combine the systematic LLR with the extrinsic
LLR in the ﬁrst half iteration.
In this algorithm, the interleaving addresses must be gener-
ated during the second half iteration to provide read and write
addresses to the LLR memory. In the SW algorithm, the read
operation is in the forward direction, whereas the write operation
is in the backward direction and is always behind of the read
operation. Fig. 9(a) shows an example of the addressing scheme
for W¼4 and x0¼0. Fig. 9(b) shows the a hardware architecture
for generating interleaving read/write addresses by using one
forward QPP generator (cf. Fig. 4) and one last-in ﬁrst-out (LIFO)
buffer.
When the sliding window length is large, using a LIFO can be
costly. We will now propose another method to generate the
interleaving write addresses. As depicted in Fig. 10(b), a forward
QPP address generator and a backward QPP address generator are
used to recursively generate the read addresses f(x) and write
address f(y), respectively. The initial values f(x0) and g(x0) for the
forward QPP generator need to be pre-computed, whereas the
initial values for the backward QPP address generator are
obtained from (synchronized with) the forward QPP address
generator every W cycles and then a backward recursion is
performed on the next W1 cycles to generate the next W1
Y. Sun, J.R. Cavallaro / INTEGRATION, the VLSI journal 44 (2011) 305–315310write address. Fig. 10(a) gives an example of this algorithm for
W¼4 and x0¼0.4.2. QPP address generator for Radix-4 SW-MAP decoder
Radix-4 MAP decoding [13,34] is a commonly used technique
to achieve a higher trellis processing speed. For binary Turbo
codes, eg. LTE Turbo codes, the trellis cycles can be reduced 50%
by doing Radix-4 processing. In the Radix-4 processing, during the
second half iteration two LLRs for information bit vector {ux, ux+1}
are needed to be fetched/written from/to the LLR memory at
addresses f(x) and f(x+1). Thus, two read and two write
interleaving addresses need to be generated in each clock cycle.
Fig. 11(a) shows an example of the read/write addressing scheme
where a sequence is partitioned into even and odd sub-sequences.
Fig. 11(b) shows a hardware architecture to generate the
interleaving read and write addresses for the Radix-4 SW-MAP
decoder. Two forward QPP address generators (with step d¼2) are
used to generate the interleaving read addresses, and two0 1 2 3 4 5 6 7
3 2 1 0 7 6 5 4
Read index x
Forward
QPP Generator
f (x0), g (x0) Read address f (x) 
LLR 
Memory
(Two-port)
f (3), g (3)
Sync SyncBackwardQPP generation
...
...
Backward
QPP Generator
Sync with f (x),g (x)  
Write address f (y) 
Forward QPP generation
f (7), g (7)
Backward
QPP generation
W
Init with f (x0), g (x0)
Write index y 
Fig. 10. (a) An example of the forward/backwoard data ﬂow in SW-MAP
algorithm, where W¼4. (b) A hardware architecture to generate interleaving read
and write addresses for SW-MAP decoder.
Forward QPP 
Generator (d = 2)
f (x0), g (x0)
2 Read addresses
Forward QPP 
Generator (d = 2)
Sync
Odd 
Bank
LLR Memory
1 3 5 7
0 2 4 6
...
7 5 3 1
6 4 2 0
W/2
...
x = 2k
x = 2k+1
f (x0+1), g (x0+1)
f (x = 2k)
f (x = 2k+1)
Two read 
index:
Backward QPP 
Generator (d = 2)
Backward QPP 
Generator (d = 2)
f (y = 2k)
f (y = 2k+1)
Sync
Even 
Bank
2 Write addresses
Forward QPP generation (d = 2)Init
Backward QPP generation (d = 2)
Two write y = 2j+1 
index: y = 2j
R
W
R
W
Fig. 11. (a) An example of the forward/backward data ﬂow in Radix-4 SW-MAP
algorithm, where W¼4. (b) A hardware architecture to generate read/write
interleaving address for Radix-4 SW-MAP decoder.backward QPP address generators (with step d¼2) are used to
generate the interleaving write addresses. Based on the QPP
algebraic property 1, the LLR memory can be partitioned into even
and odd indexed banks to avoid collisions.4.3. QPP address generator for NSW-MAP decoder
In the NSW algorithm, forward and backward recursions are
performed simultaneously by processing data from both ends of
the sub-trellis. After the middle point, soft LLRs are calculated in
both forward and backward directions. Fig. 12 shows the NSW-
MAP decoder architecture. Note that the NSW-MAP decoder
requires two branch metric calculation units and two LLR
calculation (LLRC) units because of the double-direction data
processing. Fig. 13(a) shows the forward/backward data ﬂow in
the NSW-MAP decoding process. Because both the forward and
the backward processes need to access memory, we propose to
use a two phase memory accessing scheme to support double-
direction data processing. As shown in Fig. 13(b), in phase 0, theLIFO
Branch 
Unit
-unit
-unit
L/2

LLRC
LLR (out)
Branch 
Unit
LLRC
1
LIFO
1

2
2


SMP
Buffer
L/2
First_half_iteration
Write 
interleaving0
1Read 
interleaving
First_half_iteration
LLR Memory
(Two-port)0
1
Symbol 
Memory
Ls ,Lp
LLR (in)
Fig. 12. NSW-MAP decoder architecture.
0 1 2 3 4 5 6 7Forward read index x
Forward QPP 
Generator
f (x0), g (x0)
FW addresses
f (x), f (x+1)
...
Init Forward QPP generation
7 6 5 4 3 2 1 0
Clock cycle
...
Init Backward QPP generation
0 7 2 5 4 3 6 1
...
1 6 3 4 5 2 7 0
0 1 2 3 4 5 6 7
0 1 0 1 ...
Read address
f(2k), f(2k+1)
Delays
Delays
Delays
BW addresses
f (y), f (y-1) Odd 
Bank
Even 
Bank
LLR Memory
Write address
f (2j), f (2j+1)
Phase
x
x+1
Backward read 
index y
Phase
0
1
0
1
D
DBackward QPP Generator
f (y0), g (y0)
Index yy-1
Fig. 13. (a) Forward/backward data ﬂow in the NSW-MAP decoding process. (b)
Two-phase memory accessing scheme. (c) A hardware architecture for generating
interleaving addresses for the NSW-MAP decoder.
64321684120
0
0.5
1
1.5
2
2.5
P
A
re
a 
(m
m
2 )
NSW−MAP Decoder
SW−MAP Decoder
Fig. 15. Area of a NSW-MAP decoder and a SW-MAP decoder.
10−1
100
101
102
A
T 
C
om
pl
ex
ity
 (m
m
2 
×
 μ
s)
NSW Architecture
SW Architecture
Y. Sun, J.R. Cavallaro / INTEGRATION, the VLSI journal 44 (2011) 305–315 311forward MAP process is allowed to read two data at addresses f(x)
and f(x+1) from the LLR memory. In the next clock cycle (phase 1),
the backward MAP process is allowed to read two data at
addresses f(y) and f(y1) from the LLR memory. And then this
process repeats. For the write operation, it is the same as the read
operation. And write address is just a delayed version of the read
address. The number of the delay cycles depends on the pipeline
delays in the LLRC unit in the MAP decoder which is typically
several clock cycles. Fig. 13(c) shows a hardware architecture to
implement this two-phase memory accessing algorithm, where
the LLR memory is partitioned into even and odd indexed banks to
avoid collisions.
4.4. QPP address generator for Radix-4 NSW-MAP decoder
The two-phase memory accessing scheme shown in Fig. 13(b)
can be extended to support Radix-4 NSW-MAP decoding as well,
where four data at addresses f(x), f(x+1), f(x+2), and f(x+3) are
needed to be generated in each clock cycle. Based on the QPP
algebraic property 2, the memory can be partitioned into four
banks to allow concurrently memory accesses in each clock cycle
without any collisions. Fig. 14 shows a hardware architecture for
generating interleaving addresses for Radix-4 NSW-MAP decoder.
4.5. MAP decoder comparison
Table 2 compares the resource usage and decoding latency for
a SW-MAP decoder and a NSW-MAP decoder, in which W is the
sliding window length in the SW algorithm, L is the segment
length L¼N/P, Ba and Bg are the total bit widths for the a state
metrics (8 states in total) and the g branch metrics, respectively.
To compare the area for these two types of MAP decoder
architectures, we have synthesized them in a TSMC 65-nm CMOS
technology for a 400MHz clock frequency. The ﬁxed point word
lengths for the channel LLRs, extrinsic LLRs, and state metrics are
6, 7, and 10, respectively [40]. For the SW-MAP architecture, the
sliding window length W is assumed to be 64. Consider decoding
of a segment of a code block where the code length is N¼6144
and the segment length is L¼N/P, Fig. 15 shows the area cost forForward QPP 
Generator x2
Forward addresses
f (x), f (x+1), f (x+2), f (x+3)
Backward QPP
Generator x2
4 read addresses
f (4k), f (4k+1)
f (4k+2), f (4k+3)
Delays
Delays
Delays
Backward addresses
f (y), f (y-1), f (y-2), f (y-3)
Bank 0
LLR Memory
4 write addresses
f (4j), f (4j+1)
f (4j+2), f (4j+3)
Bank 1
Bank 2
Bank 3
Phase
Init
0
1
0
1
Init
Align
Align
Fig. 14. A hardware architecture for generating interleaving addresses for the
Radix-4 NSW-MAP decoder.
Table 2
MAP decoder architecture comparison.
SW-MAP NSW-MAP
a unit 1 1
b unit 1 1
Branch unit 1 2
LLRC 1 2
QPP address generator 2 2
State-buffer (bit) Ba W Ba  L
gbuffer (bit) Bg W 0
SMP-buffer (bit) Ba  2L=W Ba  4
Processing time (cycles) W+L L
64321684120
10−2
P
Fig. 16. AT complexity of a SW-MAP decoder and a NSW-MAP decoder.these two types of MAP decoders. As can be seen, as the decoder
parallelism P increases, the area cost of the NSW-MAP decoder
reduces quickly and comes closer to the area cost of the SW-MAP
decoder.
To compare the efﬁciency of these two architectures, we deﬁne
an efﬁciency metric as area time, or AT, where area is one MAP
decoder area and time is the processing time for a sub-trellis for
half Turbo iteration. Fig. 16 plots the AT complexities for different
P, where the AT value is displayed on a logarithmic scale. Clearly,
when the parallelism degree P is small, the NSW-MAP
architecture has a higher AT complexity than the SW-MAP
architecture because a large number of state metrics have to be
buffered. On the other hand, as P increases, the NSW-MAP
architecture will become more efﬁcient due to the fact that the
double-ﬂow NSW-MAP decoding has no sliding window
overhead, whereas the single-ﬂow SW-MAP decoding has a
sliding window overhead of W=ðN=PþWÞ. As a design tradeoff,
we adopted the SW-MAP architecture in our ﬁnal hardware
implementation to save area while still achieving 1Gbps
throughput.
64321684210
10−2
10−1
100
101
102
P
A
T 
C
om
pl
ex
ity
 (m
m
2  
×
 μ
s)
NSW Radix−4 Architecture
SW Radix−4 Architecture
Fig. 17. AT complexity of a Radix-4 SW-MAP decoder and a Radix-4 NSW-MAP
decoder.
time
Tr
el
lis
 
bl
o
ck
w
time
Tr
el
lis
 
bl
oc
k
0 
~
N
/4
-
1
N
/4
 
~
N
/2
-
1
N
/2
 
~
3N
/4
-
1
3N
/4
 
~
N
-
1
0 
~
N
/4
-
1
N
/4
 
~
N
/2
-
1
N
/2
 
~
3N
/4
-
1
3N
/4
 
~
N
-
1
Stakes initializing from the previous iteration
Stakes propagating for the next iteration
Fig. 18. An example of a multi-MAP parallel decoding approach with P¼4. (a)
Parallel SW-MAP algorithm with state metric propagation. (b) Parallel NSW-MAP
algorithm with state metric propagation.

LIFO
LIFO
Branch 
Unit


LLRC
SMP
Buffer 
-unit
 
Sy
m
bo
l L
LR
 
M
em
o
ry
 
0
ve
r ato
r
Y. Sun, J.R. Cavallaro / INTEGRATION, the VLSI journal 44 (2011) 305–315312Fig. 17 compares the AT complexities of a Radix-4 SW-MAP
decoder and a Radix-4 NSW-MAP decoder for a 250MHz clock
frequency. One observation is that the Radix-4 transform can
effectively reduce the AT complexity of the NSW-MAP decoder
when P is small. However, Radix-4 transform will not necessarily
reduce the AT complexity of the SW-MAP decoder. This is due to
the fact that the Radix-2 decoder can run at a faster clock
frequency, and has a lower complexity than the Radix-4 decoder
(assuming full LogMAP implementation). We will compare the
Radix-2 and the Radix-4 architectures in more detail in the next
section.Cr
os
sb
ar
 In
te
rc
on
ne
ct
s (
P -
In
pu
t P
-
O
ut
pu
t)

LIFO
LIFO
Branch 
Unit
-unit
-unit
-unit
-unit
-unit
-unit
-unit


LLRC
SMP
Buffer 

LIFO
LIFO
Branch 
Unit


LLRC
SMP
Buffer 

LIFO
LIFO
Branch 
Unit


LLRC
SMP
Buffer 
M
em
o
ry
M
o
du
le
 
0
M
em
o
ry
 
M
o
du
le
 
1
M
em
o
ry
 
M
o
du
le
 
j
M
em
o
ry
 
M
o
du
le
 
P-
1
LLR Memory
MAP 0
MAP 1
MAP j
MAP P-1
Sy
m
bo
l L
LR
 
M
em
o
ry
 
1
Sy
m
bo
l L
LR
 
M
em
o
ry
 
j
Sy
m
bo
l L
LR
 
M
em
o
ry
 
P-
1
QP
P I
nte
rle
a
A
dd
re
ss
 G
en
er
f ( x
)
QP
P I
nte
rle
av
er 
A
dd
re
ss
 G
en
er
at
or
f (x
+
L)
QP
P I
nte
rle
av
er 
A
dd
re
ss
 G
en
er
at
or
f (x
+
jL)
QP
P I
nte
rle
av
er 
A
dd
re
ss
 G
en
er
at
or
f (x
+
( P
-1
)L
)
W
&
R
 
ad
dr
W
&
R
 
ad
dr
W
&
R
 
ad
dr
W
&
R 
ad
dr
Fig. 19. The proposed parallel decoder architecture with P SW-MAP decoders. P
memories are used to support contention-free memory accessing. Crossbar
interconnects are used to permute the memory read/write data.5. Parallel turbo decoder architecture
Decoder parallelism is necessary to achieve the LTE/LTE-
Advance high throughput requirement which is up to 1Gbps. In
order to increase the throughput by a factor of P, an information
block can be divided into P segments with equal length L and then
each segment is processed independently by a dedicated MAP
decoder [14,19,33,40–47]. In this scheme, each of the PMAP cores
processes the data sequentially and fetches/writes the data
simultaneously always at the same offset x to each segment.
The interleaver structure in the current and previous 3G
standards do not have a parallel structure which makes it difﬁcult
to realize the parallelization of the MAP decoders. Expensive write
buffers have to be used to reduce the memory collision caused by
the interleaver [9,48]. However, when the parallelism degree
increases, the collisions cannot be effectively resolved by using
write buffers. The LTE QPP interleaver, however, has an inherent
parallel structure that supports contention-free memory accesses
which result in a large design space for the selection of
appropriate levels of decoder parallelism.
In this section, we will present a scalable parallel Turbo
decoder architecture and give an analysis of the complexity and
the throughput. Fig. 18 illustrates the proposed parallel decoding
algorithm where multiple MAP decoders are used to improve the
throughput. Fig. 19 shows a hardware architecture for
implementing the proposed parallel SW-MAP algorithm. In this
architecture, P sets of QPP interleavers are used to generate the
interleaving addresses f(x), f(x+L),y, and f(x+(P1)L)
concurrently, where L is the segment length L¼N/P. Based onthe QPP contention-free property, these P addresses will be
mapped to different memory modules 0 to P1 without any
collisions. Thus, no write buffers are required. A crossbar network
Y. Sun, J.R. Cavallaro / INTEGRATION, the VLSI journal 44 (2011) 305–315 313is used to permute the data between the MAP decoders and the
memory modules.
Furthermore, based on the QPP interleaver algebraic property
3, this architecture can be modiﬁed to support the Radix-4 SW
and NSW MAP decoding algorithms by setting the following
constraints. To support the Radix-4 SW-MAP decoding, L needs to
be divisible by 2, and each memory module needs to be
partitioned into even and odd indexed banks. To support the
Radix-4 NSW-MAP decoding, L needs to be divisible by 4, and each
memory module needs to be partitioned into four banks.Extrinsic 
LLR Memory
MAP Decoder
X64 Cores 
Channel 
LLR
Memory
Crossbar + Interleaver
In/Out
Buffer
Ctrl & 
Misc
Fig. 21. VLSI layout view which shows the core area of the decoder.5.1. Throughput-area tradeoff analysis
High throughput is achieved by using multiple MAP decoders
and multiple memory modules/banks. In this section, we will
analyze the impact of parallelism on throughput and area. The
maximum throughput is measured as
SW Throughput¼ N
Decoding time
 N  f
I  ð ~N=Pþ ~W Þ
NSW Throughput¼ N
Decoding time
 N  f
I  ð ~N=PÞ
,
where ~N ¼N, ~W ¼W in the case of Radix-2 decoding, and
~N ¼N=2, ~W ¼W=2 in the case of Radix-4 decoding. I is the total
number of half iterations performed by the Turbo decoder. f is the
operating clock frequency.
To analyze the area and throughput performance for different
QPP parallelism degrees, we describe a Radix-2 and a Radix-4 SW
parallel Turbo decoder in Verilog HDL and synthesize them for a
65nm CMOS technology using Synopsys Design Compiler. The
tradeoff analysis result is given in Fig. 20 which plots the area and
the throughput for different parallelism degrees and clock rates.
As can be seen, a 1Gbps throughput is achievable with 64 Radix-2
MAP decoder cores running at 310MHz clock frequency or 32
Radix-4 MAP decoder cores running at 250MHz clock frequency.0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
6.5
7
7.5
8
8.5
Throughput (Mbps)
A
re
a 
(m
m
2 ) 2
P = 64
P = 32
P = 16
P = 8
P = 4
P = 2
P = 1
300 MHz
400 MHz
200 MHz
100 MHz
Fig. 20. Area-throughput tradeoff analysis for different parallelism and clock rates (N¼
MAP parallel Turbo decoder.For parallel Turbo decoder which consists of multiple MAP
units, the MAP units tend to dominate the silicon area especially
when the parallelism is high. From Fig. 20, we can see that given
the same throughput target, the Radix-2 architecture provides a
lower area cost than the Radix-4 architecture for most of the cases
and especially when P is large. This is mainly due to the fact that
the Radix-2 MAP unit can run at a faster clock frequency, and has
a lower complexity than the Radix-4 MAP unit (assuming full
LogMAP implementation). However, it should be noted that the
Radix-2 decoder may need a higher partitioning of the code block
than the Radix-4 decoder to achieve the same throughput target.
As a design tradeoff, we adopted the Radix-2 architecture in our
ﬁnal hardware implementation to save area while still meeting
the 1Gbps throughput target.A
re
a 
(m
m
)
0 200 400 600 800 1000 1200 1400 1600
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Throughput (Mbps)
250 MHz
200 MHz
150 MHz
100 MHz
P = 64
P = 32
P = 16
P = 8
P = 4
P = 2
P = 1
6144, I¼12, W¼64). (a) Radix-2 SW-MAP parallel Turbo decoder. (b) Radix-4 SW-
Table 3
Implementation result and architecture comparison with existing Turbo decoders.
This work [42] [43] [19] [22]
Maximum block size 6144 432 5120 6144 6144
MAP cores 64 7 6 8 8
Maximum iterations Programmable 6 6 8 8
Technology 65nm 180nm 180nm 90nm 130nm
Supply voltage 0.9V 1.8V NA 1.0V 1.2V
Clock frequency 400MHz 160MHz 166MHz 275MHz 250MHz
Core area 8.3mm2 7.16mm2 13mm2 2.1mm2 10.7mm2
Gate equivalent (GE) 5.8M 587Ka 1.3Mb 740Kc 800K
GE (excluding memory macros) 4.9M 373K N/A N/A 500K
Throughput (at max. iteration) 1.28Gbps (@6 iter.) 75.6Mbps 60Mbps 129Mbps 186Mbps
Power consumption 845mW N/A N/A 219mW N/A
Energy efﬁciency (nJ/bit/iteration) 0.11 1.45 1.65 0.21 0.61
a The gate count is estimated based on the chip data in the paper.
b The unit cell area is assumed to be 10:00mm2 for 180nm technology.
c The unit cell area is assumed to be 2:82mm2 for 90nm technology.
Y. Sun, J.R. Cavallaro / INTEGRATION, the VLSI journal 44 (2011) 305–3153145.2. VLSI implementation result
A highly-parallel 3GPP LTE/LTE-Advance Turbo decoder, which
consists of 64 Radix-2 SW-MAP decoder cores, has been synthesized,
placed and routed for a 1.0V 8-metal layer TSMC 65nm CMOS
technology. The decoder has scalable parallelism. The decoder can
employ 64, 32, and 16 MAP units when the block size NZ2048,
1024, and 512, respectively. For small block size No496, the
decoder can use up to 8 MAP cores. Fig. 21 shows the top layout
view of this ASIC which shows the core area of this decoder. The
ﬁxed-point bit precisions are as follows: the channel symbol LLRs for
systematic and parity bits are represented with 6-bit signed
numbers, the internal a and b state metrics are represented with
10-bit unsigned numbers (modulo normalization), and the extrinsic
LLRs are represented with 8-bit signed numbers. Based on the ﬁxed-
point simulation result, the ﬁnite word-length implementation leads
to negligible BER performance degradation from using the ﬂoating-
point representation. The maximum achievable clock frequency is
400MHz based on the post-layout simulation. The corresponding
maximum throughput is 1.28Gbps (at 6 iterations) with a core area
of 8.3mm2.
5.3. Comparison with existing turbo decoders
In this section, we compare the proposed Turbo decoder with
existing Turbo decoders from [42,43,19], and [22]. In [42], a
parallel Turbo decoder based on 7 MAP decoders is presented. In
order to avoid memory contention, a custom designed interleaver,
which is not standard compliant, is used. In [43], a 3G-compliant
parallel Turbo decoder based on the row-column permutation
interleaver is introduced. In [19], a 188-mode Turbo decoder chip
for 3GPP LTE standard is presented. In this decoder, 8 MAP units
are used to achieve a maximum decoding throughput of 129Mbps
(at 8 iterations). In [22], a Radix-4 Turbo decoder is proposed for
3GPP LTE and WiMax standards. A maximum throughput of
186Mbps is supported by employing 8 MAP units (at 8 iterations).
Table 3 summarizes the implementation result of the proposed
decoder and the hardware comparison with existing decoders. As
can be seen, the proposed decoder supports 3GPP LTE-Advance
throughput requirement (1Gbps) at a small area cost, and
achieves a good energy efﬁciency.6. Conclusion
We have presented a highly-parallel architecture for the
decoding of 3GPP LTE/LTE-Advance Turbo codes. Based on thealgebraic constructions, the QPP interleaver offers contention-free
memory accessing capability which enables parallel Turbo
decoding by using multiple MAP decoders working concurrently.
We proposed a low-complexity recursive architecture for gen-
erating the QPP interleaver addresses on the ﬂy. The QPP
interleavers are designed to operate at full speed with the MAP
decoders. The proposed architecture has scalable parallelism and
can be tailored for different throughput requirements. With this
architecture, a throughput of 1.28Gbps is achievable with a core
area of 8.3mm2 in a 65-nm CMOS technology.Acknowledgements
The authors would like to thank Nokia, Nokia Siemens
Networks (NSN), Xilinx, and US National Science Foundation
(under Grants CCF-0541363, CNS-0551692, CNS-0619767, EECS-
0925942 and CNS-0923479) for their support of research.
References
[1] Evolved Universal Terrestrial Radio Access (EUTRA) and Evolved Universal
Terrestrial Radio Access Network (EUTRAN), 3GPP TS 36.300.
[2] General UMTS Architecture, 3GPP TS 23.101 version 7.0.0, June 2007.
[3] S. Parkvall, E. Dahlman, A. Furuskar, Y. Jading, M. Olsson, S. Wanstedt, K.
Zangi, LTE-advanced—evolving LTE towards IMT—advanced, in: IEEE
Vehicular Technology Conference, September 2008, pp. 1–5.
[4] Multiplexing and channel coding, 3GPP TS 36.212 version 8.4.0, September
2008.
[5] C. Berrou, A. Glavieux, P. Thitimajshima, Near shannon limit error-correcting
coding and decoding: turbo-codes, in: IEEE International Conference on
Communication, May 1993, pp. 1064–1070.
[6] C. Berrou, A. Glavieux, Near optimum error correcting coding and decoding:
turbo-codes, IEEE Transactions on Communications 44 (October) (1996)
1261–1271.
[7] L. Bahl, J. Cocke, F. Jelinek, J. Raviv, Optimal decoding of linear codes for
minimizing symbol error rate, IEEE Transactions on Information Theory IT-20
(March) (1974) 284–287.
[8] T.K. Blankenship, B. Classon, V. Desai, High-throughput turbo decoding
techniques for 4G, in: International Conference on Third Generation Wireless
and Beyond, May 2002, pp. 137–142.
[9] P. Salmela, R. Gu, S.S. Bhattacharyya, J. Takala, Efﬁcient parallel memory
organization for turbo decoders, in: Proceedings of European Signal
Processing Conference, September 2007, pp. 831–835.
[10] O.Y. Takeshita, On maximum contention-free interleavers and permutation
polynomials over integer rings, IEEE Trans. Inform. Theory 52 (March) (2006)
1249–1253.
[11] D. Garrett, B. Xu, C. Nicol, Energy efﬁcient turbo decoding for 3G mobile, in:
International Symposium on Low Power Electronics and Design, ACM2001,
pp. 328–333.
[12] C. Chaikalis, J.M. Noras, Reconﬁgurable turbo decoding for 3G applications,
Elsevier Signal Processing 84 (October) (2004) 1957–1972.
[13] M. Bickerstaff, L. Davis, C. Thomas, D. Garrett, C. Nicol, A 24Mb/s radix-4
logMAP turbo decoder for 3GPP-HSDPA mobile wireless, in: IEEE Interna-
tional Solid-State Circuit Conference (ISSCC), February 2003.
Y. Sun, J.R. Cavallaro / INTEGRATION, the VLSI journal 44 (2011) 305–315 315[14] M. Martina, M. Nicola, G. Masera, A ﬂexible UMTS-WiMax turbo decoder
architecture, IEEE Trans. Circuits and Syst. II 55 (April) (2008) 273–369.
[15] K.K. Loo, T. Alukaidey, S.A. Jimaa, High performance parallelised 3GPP turbo
decoder, in: IEEE Personal Mobile Communications Conference, April 2003,
pp. 337–342.
[16] M.C. Shin, I.C. Park, SIMD processor-based turbo decoder supporting multiple
third-generation wireless standards, IEEE Trans. on VLSI 15 (June) (2007)
801–810.
[17] Y. Lin, S. Mahlke, T. Mudge, C. Chakrabarti, A. Reid, K. Flautner, Design and
implementation of turbo decoders for software deﬁned radio, in: IEEE
Workshop on Signal Processing Design and Implementation (SIPS), October
2006, pp. 22–27.
[18] P. Salmela, H. Sorokin, J. Takala, A programmable Max-Log-MAP turbo
decoder implementation, Hindawi VLSI Design 2008 (2008) 636–640.
[19] C.-C. Wong, Y.-Y. Lee, H.-C. Chang, A 188-size 2.1mm2 reconﬁgurable turbo
decoder chip with parallel architecture for 3GPP LTE system, in: 2009
Symposium on VLSI Circuits, June 2009, pp. 288–289.
[20] D.-S. Cho, H.-J. Park, H.-C. Park, Implementation of an efﬁcient UE decoder for
3G LTE system, in: International Conference on Telecommunications, June
2008, pp. 1–5.
[21] J. Berkmann, C. Carbonelli, F. Dietrich, C. Drewes, W. Xu, On 3G LTE terminal
implementation—standard, algorithms, complexities and challenges, in:
International Wireless Communications and Mobile Computing Conference,
August 2008, pp. 970–975.
[22] J.-H. Kim, I.-C. Park, A uniﬁed parallel radix-4 turbo decoder for mobile
WiMAX and 3GPP-LTE, in: IEEE Custom Integrated Circuits Conference,
September 2009, pp. 487–490.
[23] H.R. Sadjadpour, N.J.A. Sloane, M. Salehi, G. Nebe, Interleaver design for turbo
codes, IEEE J. Sel. Areas Commun. 19 (May) (2001) 831–837.
[24] C. Schurgers, F. Catthoor, M. Engels, Optimized MAP turbo decoder, in: IEEE
Signal Processing Systems (SiPS), October 2000, pp. 245–254.
[25] S.S. Pietrobon, Implementation and performance of a turbo/MAP decoder, Int.
J. Satell. Commun. 16 (December) (1998) 23–46.
[26] P. Robertson, E. Villebrun, P. Hoeher, A comparison of optimal and sub-
optimal MAP decoding algorithm operating in the log domain, in: IEEE
International Conference on Communication, 1995, pp. 1009–1013.
[27] J. Sun, O.Y. Takeshita, Interleavers for turbo codes using permutation
polynomials over integer rings, IEEE Trans. Inform. Theory 51 (January)
(2005) 101–119.
[28] A. Nimbalker, K.T. Blankenship, B. Classon, T.E. Fuja, D.J. Costello, Contention-
free interleavers for high-throughput turbo decoding, IEEE Trans. Commun.
56 (8) (2008) 1258–1267.
[29] A. Nimbalker, Y.W. Blankenship, B.K. Classon, K.T. Blankenship, Arp and qpp
interleavers for lte turbo coding, in: IEEE Wireless Communications and
Networking Conference, April 2008, pp. 1032–1037.
[30] P. Ampadu, K. Kornegay, An efﬁcient hardware interleaver for 3G
turbo decoding, in: IEEE Radio and Wireless Conference, August 2003,
pp. 199–201.
[31] G. Masera, M. Mazza, G. Piccinini, F. Viglione, M. Zamboni, Low-cost IP-blocks
for UMTS turbo decoders, in: 27th European Solid-State Circuits Conference,
September 2001, pp. 470–473.
[32] C. Schurgers, F. Catthoor, M. Engels, Memory optimization of MAP turbo
decoder algorithms, IEEE Trans. VLSI Syst. 9 (2) (2001) 305–312.
[33] S.-J. Lee, N.R. Shanbhag, A.C. Singer, Area-efﬁcient high-throughput MAP
decoder architectures, IEEE Trans. VLSI Syst. 13 (August) (2005) 921–933.
[34] Y. Zhang, K.K. Parhi, High-throughput radix-4 logMAP turbo decoder
architecture, in: Asilomar Conference on Signals, Systems and Computers,
October 2006, pp. 1711–1715.
[35] R. Ratnayake, A. Kavcic, G.-Y. Wei, A high-throughput maximum a posteriori
probability detector, IEEE J. Solid-State Circuits 43 (2008) 1846–1858.
[36] A.J. Viterbi, An intuitive justiﬁcation and a simpliﬁed implementation of the
MAP decoder for convolutional codes, IEEE J. Sel. Areas Commun. 16
(February) (1998) 260–264.
[37] J. Dielissen, J. Huisken, State vector reduction for initialization of sliding
windows MAP, in: Second International Symposium on Turbo Codes and
Related Topics, September 2000.
[38] M.M. Mansour, N.R. Shanbhag, VLSI architectures for SISO-APP decoders, IEEE
Trans. VLSI Syst. 11 (4) (2003) 627–650.
[39] G. Masera, G. Piccinini, M. Roch, M. Zamboni, VLSI architecture for turbo
codes, IEEE Trans. VLSI Syst. 7 (1999) 369–379.
[40] Y. Sun, Y. Zhu, M. Goel, J.R. Cavallaro, Conﬁgurable and scalable high
throughput turbo decoder architecture for multiple 4G wireless standards, in:
IEEE International Conference on Application-Speciﬁc Systems, Architectures
and Processors (ASAP), July 2008, pp. 209–214.
[41] Z. Wang, Z. Chi, K.K. Parhi, Area-efﬁcient high-speed decoding schemes for
turbo decoders, IEEE Trans. VLSI Syst. 10 (December) (2002) 902–912.
[42] B. Bougard, A. Giulietti, V. Derudder, J.-W. Weijers, S. Dupont, L. Hollevoet,
F. Catthoor, L. Van der Perre, H. De Man, R. Lauwereins, A scalable 8.7-nJ/bit75.6-Mb/s parallel concatenated convolutional (turbo-) codec, in: IEEE
International Solid-State Circuit Conference (ISSCC), February 2003.
[43] M.J. Thul, F. Gilbert, T. Vogt, G. Kreiselmaier, N. Wehn, A scalable system
architecture for high-throughput turbo-decoders, The J. VLSI Signal Process.
(2005) 63–77.
[44] G. Prescher, T. Gemmeke, T.G. Noll, A parametrizable low-power high-
throughput turbo-decoder, IEEE International Conference on Acoustics,
Speech, and Signal Processing (ICASSP), vol. 5, March 2005, pp. 25–28.
[45] R. Dobkin, M. Peleg, R. Ginosar, Parallel interleaver design and vlsi
architecture for low-latency map turbo decoders, IEEE Trans. VLSI Syst. 13
(4) (2005) 427–438.
[46] M. May, C. Neeb, N. Wehn, Evaluation of high throughput turbo-decoder
architectures, in: IEEE International Symposium on Circuits and Systems
(ISCAS), May 2007, pp. 2770–2773.
[47] A. Tarable, L. Dinoi, S. Benedetto, Design of prunable interleavers for parallel
turbo decoder architectures, IEEE Commun. Lett. 11 (February) (2007) 167–169.
[48] R. Asghar, D. Wu, J. Eilert, D. Liu, Memory conﬂict analysis and implementa-
tion of a re-conﬁgurable interleaver architecture supporting uniﬁed parallel
turbo decoding, J. VLSI Signal Process. (July 2009).Yang Sun received the B.S. degree in Testing Technol-
ogy and Instrumentation in 2000, and the M.S. degree
in Instrument Science and Technology in 2003, both
from Zhejiang University, Hangzhou, China. From
2003 to 2004, he worked at S3 Graphics Co. Ltd. as
an ASIC design engineer, developing Graphics Proces-
sing Unit (GPU) cores for graphics chipsets. From 2004
to 2005, he worked at Conexant Systems Inc. as an
ASIC design engineer, developing Video decoder cores
for set-top box (STB) chipsets. He is currently a Ph.D
student in the Department of Electrical and Computer
Engineering at Rice University, Houston, Texas. His
research interests include parallel algorithms and VLSI
architectures for wireless communication systems, especially forward-error
correction (FEC) systems. He received the 2008 IEEE SoC Conference Best Paper
Award, the 2008 IEEE Workshop on Signal Processing Systems Best Paper Award
(Bob Owens Memory Paper Award), and the 2009 ACM GLSVLSI Best Student Paper
Award.
Joseph R. Cavallaro, Professor
Rice University, Department of Electrical and Computer Engineering
Center for Multimedia Communication
6100 S. Main Street, MS 380, Houston, TX 77005, USA
Tel: +17133484719
Fax: +17133486196
Ofﬁce: 3042 Duncan Hall
Internet: cavallar@ece.rice.edu
WWW: http://www.ece.rice.edu/cavallarJoseph R. Cavallaro received the B.S. degree from the
University of Pennsylvania, Philadelphia, PA, in 1981,
the M.S. degree from Princeton University, Princeton,
NJ, in 1982, and the Ph.D. degree from Cornell
University, Ithaca, NY, in 1988, all in electrical
engineering. From 1981 to 1983, he was with AT&T
Bell Laboratories, Holmdel, NJ. In 1988, he joined the
faculty of Rice University, Houston, TX, where he is
currently a Professor of electrical and computer
engineering. His research interests include computer
arithmetic, VLSI design and microlithography, and DSP
and VLSI architectures for applications in wireless
communications. During the 1996–1997 academic
year, he served at the National Science Foundation as Director of the Prototyping
Tools and Methodology Program. He was a Nokia Foundation Fellow and a Visiting
Professor at the University of Oulu, Finland in 2005 and continues his afﬁliation
there as an Adjunct Professor. He is currently the Associate Director of the Center
for Multimedia Communication at Rice University. He is a Senior Member of the
IEEE. He was Co-chair of the 2004 Signal Processing for Communications
Symposium at the IEEE Global Communications Conference and General Co-chair
of the 2004 IEEE 15th International Conference on Application-Speciﬁc Systems,
Architectures and Processors (ASAP).
