Configurable and Scalable High Throughput Turbo Decoder Architecture for Multiple 4GWireless Standards by Sun, Yang et al.
Configurable and Scalable High Throughput Turbo Decoder Architecture for
Multiple 4G Wireless Standards∗
Yang Sun †, Yuming Zhu ‡, Manish Goel ‡, and Joseph R. Cavallaro †
†ECE Department, Rice University. 6100 Main, Houston, TX 77005
‡DSPS R&D Center, Texas Instruments. 12500 TI Blvd MS 8649, Dallas, TX 75243
Email: ysun@rice.edu, y-zhu@ti.com, goel@ti.com, cavallar@rice.edu
Abstract
In this paper, we propose a novel multi-code turbo de-
coder architecture for 4G wireless systems. To support var-
ious 4G standards, a configurable multi-mode MAP (max-
imum a posteriori) decoder is designed for both binary
and duo-binary turbo codes with small resource overhead
(less than 10%) compared to the single-mode architecture.
To achieve high data rates in 4G, we present a parallel
turbo decoder architecture with scalable parallelism tai-
lored to the given throughput requirements. High-level
parallelism is achieved by employing contention-free inter-
leavers. Multi-banked memory structure and routing net-
work among memories and MAP decoders are designed to
operate at full speed with parallel interleavers. We designed
a very low-complexity recursive on-line address generator
supporting multiple interleaving patterns, which avoids the
interleaver address memory. Design trade-offs in terms of
area and power efficiency are explored to find the optimal
architectures. A 711 Mbps data rate is feasible with 32
Radix-4 MAP decoders running at 200 MHz clock rate.
1. Introduction
The approaching fourth-generation (4G) wireless sys-
tems are projected to provide 100 Mbps to 1 Gbps speeds
by 2010, which consequently leads to orders of magnitude
increases in the expenditure of the computing resources.
The high throughput turbo codes [1] are required in many
4G communication applications. 3GPP Long Term Evo-
lution (LTE) and IEEE 802.16e WiMax are two examples.
As some 4G systems use different types of turbo coding
schemes (e.g. binary codes in 3GPP LTE and duo-binary
codes in WiMax), a general solution to supporting multiple
code types is to use programmable processors. For exam-
∗This work was supported by the Summer Student Program 2007,
Texas Instrument Incorporated.
ple, a 2 Mbps turbo decoder implemented on a DSP proces-
sor is proposed in [2]. Also, authors in [3] and [4] develop a
multi-code turbo decoder based on SIMD processors, where
a 5.48 Mbps data rate is achieved in [3] and a 100 Mbps data
rate is achieved in [4] at a cost of 16 processors. While these
programmable SIMD/VLIW processors offer great flexibil-
ities, they have several disadvantages notably higher power
consumption and less throughput than ASIC solutions. A
turbo decoder is typically one of the most computation-
intensive parts in a 4G receiver, therefore it is essential to
design an area and power efficient turbo decoder in ASIC.
However, to the best of our knowledge, supporting multi-
code (both simple binary and double binary codes) in ASIC
is still lacking in the literature.
In this work, we propose an efficient VLSI architecture
for turbo decoding. This architecture can be configured to
support both simple and double binary turbo codes up to
eight states. We address the memory collision problems
by applying new highly-parallel interleavers. The MAP de-
coder, memory structure and routing network are designed
to operate at full speed with the parallel interleaver. The
proposed architecture meets the challenge of multi-standard
turbo decoding at very high data rates.
SISO 1
Ch
an
ne
lEncoder 1
Encoder 2 SISO 2
1−∏
Lc (ys)
Lc (yp1)s
p1
p2
Lc (yp2)
La(u)Le(u)
∏∏
∏
La(u) Le(u)
u
Figure 1. Turbo encoder/decoder structure
2. MAP decoding algorithm
The turbo decoding concept is functionally illustrated in
Fig. 1. The decoding algorithm is called the maximum a
1-4244-1898-5/08/$20.00 ©2008 IEEE 209
Authorized licensed use limited to: Rice University. Downloaded on June 18, 2009 at 13:02 from IEEE Xplore.  Restrictions apply.
posteriori (MAP) algorithm and is usually calculated in the
log domain [5]. During the decoding process, each SISO
decoder receives the intrinsic log-likelihood ratios (LLRs)
from the channel and the extrinsic LLRs from the other con-
stituent SISO decoder through interleaving (Π) or deinter-
leaving (Π−1). Consider a decoding process of simple bi-
nary turbo codes, let sk be the trellis state at time k, then the
MAP decoder computes the LLR of the a posteriori proba-
bility (APP) of each information bit uk by
Λ(uˆk)=
∗
max
u:uk=1
{αk−1(sk−1) + γk(sk−1, sk) + βk(sk))}
− ∗max
u:uk=0
{αk−1(sk−1) + γk(sk−1, sk) + βk(sk))}
where αk and βk represent forward and backward state met-
rics, respectively, and are computed recursively as:
αk(sk) =
∗
max
sk−1
{αk−1(sk−1) + γk(sk−1, sk)} (1)
βk(sk) =
∗
max
sk+1
{βk+1(sk+1) + γk(sk, sk+1)} (2)
The second term γk is the branch transition probability and
is usually referred to as a branch metric (BM). The max∗
function, defined as max∗e(f(e)) = ln (
∑
e exp(f(e))), is
typically implemented as a max followed by a correction
term [5]. To extract the extrinsic information, Λ(uˆk) is split
into three terms: extrinsic Le(uˆ), a priori La(uk) and sys-
tematic Lc(ysk) as: Λ(uˆk) = Le(uˆk) + La(uk) + Lc(y
s
k).
2.1. Unified Radix-4 decoding algorithm
For binary turbo codes, the trellis cycles can be reduced
50% by applying the one-level look-ahead recursion [6][7]
as illustrated in Fig. 2. Radix-4 α recursion is then given
by:
αk(sk) =
∗
max
sk−1
{ ∗
max
sk−2
{αk−2(sk−2) + γk−1(sk−2, sk−1)}
+γk(sk−1, sk)
}
=
∗
max
sk−2,sk−1
{αk−2(sk−2) + γk(sk−2, sk)} (3)
where γk(sk−2, sk) is the new branch metric for the two-bit
symbol {uk−1, uk} connecting state sk−2 and sk:
γk(sk−2, sk) = γk−1(sk−2, sk−1) + γk(sk−1, sk) (4)
Similarly, Radix-4 β recursion is computed as:
βk(sk) =
∗
max
sk+2,sk+1
{βk+2(sk+2) + γk(sk, sk+2)} (5)
Since the Radix-4 algorithm is based on the symbol level,
we must define a symbol probability (or reliability):
L(ψij) =
∗
max
sk−2,sk
{αk−2(sk−2) + γijk + βk(sk)} (6)
0
1
0
1
11
01
10
00
kβ1−kγ kγ
2−ks 1−ks ks 2−ks ks
2−kα kβ),( 2 kkk ss −γ
Compress in time
Radix-2 Trellis Radix-4 Trellis
2−kα
0
1
0
1
0
1
0
1
0
1
0
1
Figure 2. 4-state trellis merge example
1
2
1
2
+ S1 S2 S3+ +
+
+
A
B
A
B
Y1 (Y2)
W1 (W2)
Switch
Constituent encoder
Symbol
interleaver
Systematic part
Parity part
Circular
trellis
End state = Start state.
.
.
.
.
.
0N-1
1−kα kβ
1−ks kγ ks
}11,10,01,00{∈ku
Radix-4 trellis
∏
11
01
10
00
Figure 3. Duo-binary convolutional encoder
where γijk is the branch transition probability with uk−1 = i
and uk = j, for i, j ∈ (0, 1), then the bit LLRs for uk−1
and uk are computed as:
Λ(uˆk−1) =
∗
max(L(ψ10), L(ψ11))− ∗max(L(ψ00), L(ψ01))
Λ(uˆk) =
∗
max(L(ψ01), L(ψ11))− ∗max(L(ψ00), L(ψ10))
For duo-binary codes, they differ from binary codes
mainly in the trellis terminating scheme and the symbol-
wise decoding [8]. The duo-binary trellis is closed as a cir-
cle with the start state equal to the end state, this is also
referred to as a tail-biting scheme, which is shown in Fig. 3.
The decoding algorithm for duo-binary is inherently based
on the Radix-4 algorithm [8], hence the same Radix-4 α, β
and L(ϕ) function units (3)(5)(6) can be applied to the duo-
binary codes in a straightforward manner. The unique parts
are the branch metrics γij calculations and tail-biting han-
dling. Moreover, three LLRs must be calculated for duo-
binary codes: Λ1(uˆk) = Lk(ψ01) − Lk(ψ00), Λ2(uˆk) =
Lk(ψ10)− Lk(ψ00) and Λ3(uˆk) = Lk(ψ11)− Lk(ψ00).
3. Unified Radix-4 MAP decoder architecture
Based on the justification in section 2.1 that both binary
and duo-binary codes can be decoded in an unified way, we
propose a configurable Radix-4 Log-MAP decoder archi-
tecture to perform both types of decoding operations. To
210
Authorized licensed use limited to: Rice University. Downloaded on June 18, 2009 at 13:02 from IEEE Xplore.  Restrictions apply.
Trellis
Time
W 2W 3W 4W 5W
W
2W
3W
4W
5W
6W
W
2W
3W
4W
5W
6W
Time
(a) Binary tailing (b) Duo-binary tail-biting
0β Data input
αdummy calculation
dummy
α Tail-biting only
7W
6W
7W
1& βΛ calculation
I 0
I 1
I 3
I 3
I 0
I 0
I 3
I 1I 2
I 2
O 0
O 1
O 3
O2
O0
O 1
O 3
O2
W 2W 3W 4W 5W 6W
Figure 4. Sliding window tile chart
efficiently implement the Log-MAP algorithm, an overlap-
ping sliding window technique [9] is employed.
We generalize the two types of decoding operations into
one unified flow which is shown in Fig. 4. This decoding
operation is based on three recursion units, two used for
the backward recursions (dummy β0 and effective β1), and
one for forward recursion (α). Each recursion unit contains
full-parallel ACSA (Add-Compare-Select-Add) operators.
To reduce decoding latency, data in a sliding window is fed
into the decoder in the reverse order; α unit is working in
the natural order; dummy β0, effective β1 and Λ units are
working in the reverse order as shown in Fig. 4. This leads
to a decoding latency of 2W for binary codes and 3W for
duo-binary codes, where W is the sliding window length.
Duo-binary codes have longer latency because the start trel-
lis states are not initialized to 0 but equal to the end states,
hence an additional acquisition is needed to obtain the initial
states’ αmetrics (see Fig. 4(b)). Similarly, an additional ac-
quisition is also required for the end states’ β metrics, which
however will not cause any additional delays.
Fig. 5 shows the proposed Radix-4 Log-MAP decoder
architecture. Three scratch RAMs (each has a depth of W )
were used to store the input LLRs (systematic, parity and
a priori). Three branch metric calculation (BMC) units are
devoted to compute the branch metrics for the α, β0 and β1
function units. To support multi-code, the decoder employs
configurable BMCs and configurable α and β function units
which can support multiple transfer functions by config-
uring the routing (multiplexors) blocks (top-right Fig. 5).
Each α and β unit consists of 8 ACSA units so they can
support up to 8 state turbo decoding. The Radix-4 ACSA
unit is implemented with max∗ trees. Since we want to
support both binary and duo-binary decoding, the extrinsic
Λ function unit as depicted in the bottom part of Fig. 5 con-
tains both bit LLR and symbol LLR generation logic. After
a latency of 2W or 3W , the Λ unit begins to generate soft
Scratch 
RAM
α -BMC
1β -BMC
-BMC
α-Unit
1β -Unit
-Unit
α -RAM
Λ-Unit
uˆ
0β
Top Level MAP Decoder
M
U
X
M
U
X
0β
Lc (ys)
Lc (yp1,2)
La (u)
)( 00ϕL
Max Tree
)( 01ϕL
)( 10ϕL
)( 11ϕL
- )ˆ( 12 −Λ ku
)ˆ( 2kuΛ
-
-
-
)ˆ(1 kuΛ
)ˆ(2 kuΛ
)ˆ(3 kuΛ
Bit LLR
Symbol
LLR
*
Max Tree*
Max Tree*
Max Tree*
1−kα
kβ
kγ
-
Max*
Max*
Max*
Max*
ACSA 0 ACSA 7
Routing Routing
. . .
. . .
BMs
+
0
1
0/ +kka β
0
kγ
+1
kγ
+2
kγ
+3
kγ
Max*1
1
1/ +kka β
2
1
2/ +kka β
3
1
3/ +kka β
kka β/1+
s7][s0/1 −+ kka βUnitRadix-4 ACSA β/a
Max*
Max*
   max(x, y)
= max(x,y) +LUT(x-y)
*
0β : Dummy backward
: Efective backward1β
Λ : LLR and extrinsics
α : Forward
)ˆ(uLe
Figure 5. Unified MAP decoder architecture
LLRs Λ(uˆ) and extrinsic information Le(uˆk) in real time.
In this architecture, many parts of the logic circuits can
be shared. For example, the α, β and L(ϕ) units, the scratch
and α RAMs can be easily shared between two operations.
Table 1 compares the resource usage for a multi-code ar-
chitecture and a single-code architecture, in which M is the
number of states in the trellis, Bm, Bb, Bc and Be are the
precisions of state metrics, branch metrics, channel LLRs
and extrinsic LLRs, respectively. From table 1, we see the
overhead for adding flexibility is very small (about 7%).
Table 1. Hardware complexity comparison
Multi-code Single-code
Storage (9Be + 12Bc (9Be + 12Bc
(bits) +MBm)W +MBm)W
Bm-bit max∗ † (25/2)M + 4 (25/2)M
1-bit adder 16MBm + 10MBb 16MBm + 10MBb
1-bit flip-flop 5MBm + 2MBb 5MBm + 2MBb
1-bit mux 16MBm + 16MBb 3MBm
Normalized area 1.0 0.93
† 1 four-input max 4∗ is counted as 3 two-input max∗
1 eight-input max 8∗ is counted as 7 two-input max∗
The fixed-point word length chosen for this decoder is
Bc=6, Be=7 and Bm/b=10. The sliding window length W
is programmable and can be up to 64. Table 2 summarizes
the synthesize results on a 65nm technology at 200MHz
clock. Compared to the Radix-4 MAP decoder in [6] which
211
Authorized licensed use limited to: Rice University. Downloaded on June 18, 2009 at 13:02 from IEEE Xplore.  Restrictions apply.
contains 410K gates of logic, our architecture achieves bet-
ter flexibility by supporting multiple code types at low hard-
ware overhead.
Table 2. Area distribution of the MAP decoder
Blocks Gate count
α-processor (including α-BMU) 30.8K gates
β-processor x2 (including β-BMUs) 66.2K gates
Λ-processor 37.3K gates
α-RAM 2.560K bits
Scratch RAMs x3 4.224K bits
Control logic 13.4K gates
4. Low complexity interleaver architecture
One of the main obstacles to parallel turbo decoding is
the interleaver parallelism due to the memory access col-
lision problems. For example, in [10] memory collisions
are solved by having additional write buffers. However, re-
cently, new contention-free interleavers have been adopted
by 4G standards, such as QPP interleaver [11] in 3GPP LTE
and ARP interleaver [12] in WiMax. The simplest approach
to implement an interleaver is to store all the interleaving
patterns in ROMs. However, this approach becomes im-
practical for a turbo decoder supporting multiple block sizes
and multiple standards. In this section, we propose an uni-
fied on-line address generator with two recursion units sup-
porting both QPP and ARP interleavers.
+
M
U
X
Sign bit
A
B
Y
N
П(x)
П(x0)
Г(x)
g
Г(x0)
Г(x0)
D
П(x+1)=(П(x)+Г(x)) % N
Г(x+1)=(Г(x)+g) % N
N
-
+
M
U
X
Sign bit
A
B
Y
D
N
-
ACC Unit
Init
Init ACC Unit
+
M
U
X
Sign bit
A
B
Y
N
П(x)
P0
D
λ(x+1)=(λ(x)+P0) % N
N
-+
M
U
X
Sign bit
A
B
Y
D
N
-
ACC Unit ACC Unit
П(x)=(λ(x)+Qx) % N
Q0
Q1
Q2
Q3
(a) QPP interleaver architecture
(b) ARP interleaver architecture
λ(x0)
Init
λ(x)
x%4
Figure 6. Low-complexity unified interleaver
We first look at the QPP interleaver. Given an informa-
tion block length N , the x-th interleaved output position is
given by Π(x) = (f2x2+f1x) mod N, 0 ≤ x, f1, f2 < N .
This can be computed recursively as:
Π(x+ 1) =
(
(f2x2 + f1x) + (2f2x+ f1 + f2)
)
mod N
= (Π(x) + Γ(x)) mod N (7)
... ...
hcti
wS rabssor
C
Extrinsic Buffer
MAP 1
Bank 2
Bank Q
MAP 2
MAP P
Bank 1hcti
wS rabssor
C
...
Input Buffer
Bank 2
Bank Q
Bank 1
QPP/ARP Interleaver Address Generators П
Lc Le
Input 
LLRs
...Tailbiting SB 1 SB 2 SB P
(a) Multi-MAP turbo decoder architecture
(b) Parallel MAP decoding
...
One decoding block N
e
mit
WMAP 1 MAP PW
. . . . . . . . .
Figure 7. Parallel decoder architecture
where Γ(x) = (2f2x+ f1 + f2) mod N , and it can also be
computed recursively as: Γ(x+ 1) = (Γ(x) + g) mod N ,
where g = 2f2. Since Π(x), Γ(x) and g are all smaller than
N , the QPP interleaver can be efficiently implemented by
cascading two Add-Compare-Choose (ACC) units as shown
in Fig. 6(a), which avoids the expensive multipliers and di-
viders.
For the ARP interleaver, it employs a two-step interleav-
ing. In the first step it switches the alternate couples as
[Bx, Ax]=[Ax, Bx] if x mod 2 = 1. In the second step, it
computes Π(x)=(P0 · x+Qx) mod N , where Qx is equal
to 1, 1+N /2+P1, 1+P2 or 1+N /2+P3 when x mod 4 = 0,
1, 2 or 3 respectively. P0, P1, P2 and P3 are constants de-
pending on N . Denote λ(x) = P0 · x, ARP interleaver can
be efficiently implemented in a similar manner by reusing
the same two ACC units, as shown in Fig. 6(b).
This architecture only requires a few adders and multi-
plexors which leads to very low complexity and can sup-
port all QPP/ARP turbo interleaving patterns. Compared
to the latest work on row-column based interleaver [13]
which needs complex state machines and RAMs/ROMs, our
QPP/ARP interleaver architecture has lower gate count and
more flexibility.
5. Parallel turbo decoder architecture
To increase the throughput, parallel MAP decoding can
be employed by dividing the whole information block into
several sub-blocks and then each sub-block is decoded sep-
arately by a dedicated MAP decoder [14][15][16][17]. For
example, in [15] a 75.6 Mbps data rate is achieved using 7
SISO decoders running at 160 MHz.
212
Authorized licensed use limited to: Rice University. Downloaded on June 18, 2009 at 13:02 from IEEE Xplore.  Restrictions apply.
0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput (Mbps)
la
mro
N
i
aera dez
P=1
P=2
P=4
P=8
P=16
P=32
f = 200 MHz
f = 80 MHz
Clock frequency range
(80 :20: 200 MHz)
f = 100 MHz
f = 180 MHz
[4]
[15]
[14]
[16]
(a) Normalized area versus throughput (N=6144, I=6, W=32)
0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput (Mbps)
N
or
m
al
iz
ed
 p
ow
er
P=1
P=2
P=4
P=8
P=16
P=32
f = 80 MHz
f = 200 MHz
Clock frequency range
(80 :20: 200) MHz
f = 100 MHz
f = 180 MHz
(b) Normalized power versus throughput (N=6144, I=6, W=32)
Figure 8. Architecture tradeoff for different parallelism and clock rates
In this section, we propose a scalable decoder architec-
ture, shown in Fig. 7(a), based on QPP/ARP contention-free
4G interleavers. The parallelism is achieved by dividing
the whole block N into P sub-blocks (SBs) and assigning
P MAP decoders working in parallel to reduce the latency
down to O(N/P ). The memory structure is designed to
support concurrent access of LLRs by multiple (P ) MAP
decoders in both linear addressing and interleaved address-
ing modes. This is achieved by partitioning the memory
intoQ banks (Q ≥ P ), with each bank being independently
accessible. In this paper, we use Q=P . Take the QPP inter-
leaver as an example, P MAP decoders always access data
simultaneously at a particular offset x, which guarantees no
memory access conflicts due to the contention-free property
of bΠ(x + jM)/Mc 6= bΠ(x + kM)/Mc, where x is the
offset in the sub-block j and k (0 ≤ j < k < P ), and
M is the sub-block length N/P . Since any MAP decoder
will write to any memory during the decoder process, a full
crossbar is designed for routing data among them. Fig. 7(b)
depicts a parallel decoding example for duo-binary codes
(general concepts hold for binary codes).
5.1. Architecture trade-off analysis
High throughput is achieved at a cost of multiple MAP
decoders and multiple memory banks. In this section, we
will analyze the impact of parallelism on throughput, area
and power consumption. The maximum throughput is esti-
mated as:
Throughput =
N
Decoding time
≈ N · f
2 · I · ( N˜P + 3W˜ )
(8)
where N˜=N/2 and W˜=W/2 for Radix-4 decoding, I is the
number of iterations (contains two half iterations), 3W˜ is
the decoding latency for one MAP decoder and f is the
clock frequency. And the area is estimated as:
Area ≈ P ·Amap(f) +Amem(P,N) +Aroute(P, f) (9)
where Amap is one MAP decoder area which increases with
f , Amem is the total memory area which increases with N
and also P due to the sub-banking overhead, and Aroute
is the routing cost (crossbars plus interleavers) which in-
creases with both P and f . Note that the complexity of the
full crossbar actually increases with P 2. We describe the
decoder in Verilog and synthesize it on a 65 nm technology
using Synopsys Design Compiler. The area tradeoff analy-
sis is given in Fig. 8(a) which plots the normalized area ver-
sus throughput for different parallelism levels and clock rate
(80-200MHz, step of 20MHz). The power estimation is ob-
tained from the product of area and frequency (actual power
estimation is still in progress). This estimation is based on
the dynamic power consumption P = CfV 2 where V is
assumed to be the same for all the cases and C is assumed
to be proportional to the silicon area. Fig. 8(b) shows the
power tradeoff analysis based on the area× f metric.
5.2. Comparison with related works
Table 3 compares the flexibility and performance with
existing state-of-the-art turbo decoders. Fig. 8(a) compares
the silicon area with [4][14][15][16] where the area is scaled
and normalized to a 65nm technology, and the block size
is scaled to 6144. In [2][4][18], programmable processors
are proposed to provide multi-code flexibilities. In [14][15]
213
Authorized licensed use limited to: Rice University. Downloaded on June 18, 2009 at 13:02 from IEEE Xplore.  Restrictions apply.
Table 3. Comparison with existing turbo decoder architectures
Work Architecture Flexibility MAP algorithm Parallelism Iteration Frequency Throughput
[2] 32-wide SIMD Multi-code Max-LogMAP 4 window 5 400 MHz 2.08 Mbps
[18] Clustered VLIW Multi-code LogMAP/Max-LogMAP Dual cluster 5 80 MHz 10 Mbps
[4] ASIP-SIMD Multi-code Max-LogMAP 16 ASIP 6 335 MHz 100 Mbps
[14] ASIC Single-code LogMAP 6 SISO 6 166 MHz 59.6 Mbps
[15] ASIC Single-code LogMAP 7 SISO 6 160 MHz 75.6 Mbps
[16] ASIC Single-code Constant-LogMAP 64 SISO 6 256 MHz 758 Mbps
This work ASIC Multi-code LogMAP 32 SISO 6 200 MHz 711 Mbps
efficient ASIC architectures are presented for decoding of
simple binary turbo codes using the LogMAP algorithm.
In [16], a high speed Turbo decoder architecture based on
the sub-optimal Constant-LogMAP SISO decoders (Radix-
2) is discussed. Though the decoder in [16] achieves high
throughput at a low silicon area, it has some limitations,
e.g. the interleaver is non-standard compliant and it can not
support WiMax duo-binary turbo codes. As can be seen,
our architecture exhibits both flexibility and efficiency in
supporting multiple turbo codes (simple binary + double bi-
nary) and achieving high throughput.
6. Conclusion
A flexible multi-code turbo decoder architecture is pre-
sented together with a low-complexity contention-free in-
terleaver. The proposed architecture is capable of decoding
simple and double binary turbo codes with a limited com-
plexity overhead. A date rate of 10-711 Mbps is achiev-
able with scalable parallelism. The architecture is capable
of supporting multiple proposed 4G wireless standards.
References
[1] C. Berrou, A. Glavieux, and P. Thitimajshima. Near Shannon
limit error-correcting coding and decoding: Turbo-Codes. In
IEEE Int. Conf. Commun., pages 1064–1070, May 1993.
[2] Y. Lin et al. Design and implementation of turbo decoders
for software defined radio. In IEEE Workshop on Signal
Processing Design and Implementation (SIPS), pages 22–27,
Oct. 2006.
[3] M.C. Shin and I.C. Park. SIMD processor-based turbo
decoder supporting multiple third-generation wireless stan-
dards. IEEE Trans. on VLSI, vol.15:pp.801–810, Jun. 2007.
[4] O. Muller, A. Baghdadi, and M. Jezequel. ASIP-Based mul-
tiprocessor SoC design for simple and double binary turbo
decoding. In DATE’06, volume 1, pages 1–6, Mar. 2006.
[5] P. Robertson, E. Villebrun, and P. Hoeher. A comparison of
optimal and sub-optimal MAP decoding algorithm operating
in the log domain. In IEEE Int. Conf. Commun., pages 1009–
1013, 1995.
[6] M. Bickerstaff et al. A 24Mb/s radix-4 logMAP turbo de-
coder for 3GPP-HSDPA mobile wireless. In IEEE Int. Solid-
State Circuit Conf. (ISSCC), Feb. 2003.
[7] Y. Zhang and K.K. Parhi. High-throughput radix-4 logMAP
turbo decoder architecture. In Asilomar conf. on Signals,
Syst. and Computers, pages 1711–1715, Oct. 2006.
[8] C. Zhan et al. An efficient decoder scheme for double binary
circular turbo codes. In IEEE Int. Conf. Acoustics, Speech,
and Signal Processing (ICASSP), pages 229–232, 2006.
[9] G. Masera, G. Piccinini, M. Roch, and M. Zamboni. VLSI
architecture for turbo codes. In IEEE Trans. VLSI Syst., vol-
ume 7, pages 369–3797, 1999.
[10] P. Salmela, R. Gu, S.S. Bhattacharyya, and J. Takala. Effi-
cient parallel memory organization for turbo decoders. In
Proc. European Signal Processing Conf., pages 831–835,
Sep. 2007.
[11] J. Sun and O.Y. Takeshita. Interleavers for turbo codes us-
ing permutation polynomials over integer rings. IEEE Trans.
Inform. Theory, vol.51, Jan. 2005.
[12] C. Berrou et al. Designing good permutations for turbo
codes: towards a single model. In IEEE Conf. Commun.,
volume 1, pages 341–345, June 2004.
[13] Z. Wang and Q. Li. Very low-complexity hardware inter-
leaver for turbo decoding. IEEE Trans. Circuit and Syst.,
vol.54:pp. 636–640, Jul. 2007.
[14] M. J. Thul et al. A scalable system architecture for high-
throughput turbo-decoders. The Journal of VLSI Signal Pro-
cessing, pages 63–77, 2005.
[15] B. Bougard et al. A scalable 8.7-nJ/bit 75.6-Mb/s paral-
lel concatenated convolutional (turbo-) codec. In IEEE Int.
Solid-State Circuit Conf. (ISSCC), Feb. 2003.
[16] G. Prescher, T. Gemmeke, and T.G. Noll. A parametriz-
able low-power high-throughput turbo-decoder. In IEEE Int.
Conf. Acoustics, Speech, and Signal Processing (ICASSP),
volume 5, pages 25–28, Mar. 2005.
[17] S-J. Lee, N.R. Shanbhag, and A.C. Singer. Area-efficient
high-throughput MAP decoder architectures. IEEE Trans.
VLSI Syst., vol.13:pp. 921–933, Aug. 2005.
[18] P. Ituero and M. Lopez-Vallejo. New schemes in clustered
vliw processors applied to turbo decoding. In Proc. Int. Conf.
on Application-specific Syst., Architectures and Processors
(ASAP), pages 291–296, Sep. 2006.
214
Authorized licensed use limited to: Rice University. Downloaded on June 18, 2009 at 13:02 from IEEE Xplore.  Restrictions apply.
