A Flexible UMTS-WiMax Turbo Decoder Architecture by Martina, Maurizio et al.
Politecnico di Torino
Porto Institutional Repository
[Article] A Flexible UMTS-WiMax Turbo Decoder Architecture
Original Citation:
Martina M.; Nicola M; Masera G (2008). A Flexible UMTS-WiMax Turbo Decoder Architecture. In:
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS. II, EXPRESS BRIEFS, vol. 55 n. 4, pp.
369-373. - ISSN 1549-7747
Availability:
This version is available at : http://porto.polito.it/1909067/ since: February 2009
Publisher:
IEEE
Published version:
DOI:10.1109/TCSII.2008.919510
Terms of use:
This article is made available under terms and conditions applicable to Open Access Policy Article
("Public - All rights reserved") , as described at http://porto.polito.it/terms_and_conditions.
html
Porto, the institutional repository of the Politecnico di Torino, is provided by the University Library
and the IT-Services. The aim is to enable open access to all the world. Please share with us how
this access benefits you. Your story matters.
(Article begins on next page)
1A Flexible UMTS-WiMax Turbo Decoder
Architecture
Maurizio Martina, Member IEEE, Mario Nicola, Guido Masera, Senior Member IEEE
Abstract—This work proposes a VLSI decoding architecture
for concatenated convolutional codes. The novelty of this ar-
chitecture is twofold: (i) the possibility to switch on-the-fly
from the UMTS turbo decoder to the WiMax duo-binary turbo
decoder with a limited resources overhead compared to a single
mode WiMax architecture, and (ii) the design of a parallel,
collision free WiMax decoder architecture. Compared to two
single mode solutions, the proposed architecture achieves a
complexity reduction of 17.1% and 27.3% in terms of logic
and memory respectively. The proposed, flexible architecture has
been characterized in terms of performance and complexity on
a 0.13 µm standard cell technology, and sustains a maximum
throughput of more than 70 Mb/s.
Index Terms—VLSI, Turbo Decoder, UMTS, WiMax
I. INTRODUCTION
In the last years several standards have been proposed for
reliable transmission of data over wireless channels (e.g. [1],
[2]). Besides this, in order to cope with severe transmission
environments, typical of wireless systems, channel codes ought
to be adopted. Turbo codes [3] are among the most performing
channel codes, and are still a major topic of interest in the
scientific literature. Recent works dealing with turbo decoder
implementation mainly focus on three aspects. 1) The design
of VLSI architectures for duo-binary turbo codes [4], [5], [6].
2) The design of flexible architectures able to support multiple
codes [7], [8], [9]. 3) The design of parallel decoders to sustain
very high throughput (tens or hundreds of Mb/s), where the
interleaver parallelization is particularly challenging, due to
the problem of collisions in memory access [10], [11], [12].
Though current scaled CMOS technologies allow to reach
clock frequencies of several hundreds of MHz, parallelization
is still an effective methodology to achieve high throughputs
and to approach the long term objective of 1 Gb/s in wire-
less communications. Furthermore, in high throughput ASIC
design, the adoption of lower frequency parallel architectures
instead of higher frequency serial ones is an effective method
to combat unreliability and reduce nonrecurrent costs.
This work presents a high performance turbo decoder archi-
tecture, which faces the parallelization, the flexibility and the
duo-binary implementation issues while keeping the complex-
ity as reduced as possible, and achieves a throughput of several
tens of Mb/s. Implementation results on a 0.13 μm standard
cell technology show that the complexity overhead required
to support both UMTS and WiMax is limited, compared to
The authors are with Dipartimento di Elettronica - Politecnico di Torino
- Italy. This work is partially supported by the MEADOW (MEsh ADaptive
hOme Wireless nets) project, founded by the italian government. Copyright (c)
2007 IEEE. Personal use of this material is permitted. However, permission
to use this material for any other purposes must be obtained from the IEEE
by sending an email to pubs-permissions@ieee.org.
a single mode WiMax decoder architecture. Moreover, the
proposed architecture yields noteworthy complexity reduction
figures compared to a dual mode architecture, where no
sharing technique is employed.
The rest of the paper is organized as follows. In section II
the decoding algorithm is briefly recalled. Section III presents
a reference architecture. In section IV the design of the
low complexity interleaver employed in our architecture is
addressed, whereas section V deals with the employed design
strategies. Finally, section VI shows the experimental results
and section VII draws some conclusions.
II. DECODING ALGORITHMS
Both the UMTS and the WiMax turbo codes are based on
the parallel concatenation of two 8-state convolutional codes
(CCs). However, the constituent code used in UMTS is a single
binary systematic CC, whereas that used in WiMax is a duo-
binary circular recursive systematic CC. At the decoder side,
the SISO (Soft In Soft Out) module [13] executes the BCJR
algorithm [14], usually in its logarithmic form [15]. Each SISO
module receives the intrinsic log-likelihood ratios (LLRs) of
coded symbols c from the channel and outputs the LLRs
of information symbols u. The two SISO modules exchange
extrinsic LLRs (λk[u]) by means of interleaving memories Π
and Π−1 (Fig. 1 (a)). The output extrinsic LLRs of symbol u
at the k-th step (λk[u;O]) are computed as:
λk[u;O] =
∗
max
e:u(e)=u
{b(e)} − ∗max
e:u(e)=u˜
{b(e)} − λk[u; I] (1)
where u˜ is an input symbol taken as a reference (usually u˜ =
0), e represents a certain transition on the trellis and u(e) is
the uncoded symbol u associated to e. The ∗max{xi} function
[15] is implemented as a max followed by a correction term
stored in a small Look-Up-Table (LUT) [16]. The correction
term, usually adopted when decoding binary codes, can be
omitted for duo-binary turbo codes [4].
It is worth pointing out that in binary turbo codes, at each
trellis step the SISO outputs only one extrinsic LLR, whereas
in duo-binary turbo codes the SISO produces three extrinsic
LLRs; thus, in general, the terms λk[u;O] and λk[u; I] are
vectors. The term b(e) in (1) is defined as:
b(e) = αk−1[sS(e)] + γk[e] + βk[sE(e)] (2)
αk[s] =
∗
max
e:sE(e)=s
{
αk−1[sS(e)] + γk[e]
} (3)
βk[s] =
∗
max
e:sS(e)=s
{
βk+1[sE(e)] + γk[e]
} (4)
γk[e] = πk[u(e); I] + πk[c(e); I] (5)
2α
k
α
β β
k−1 k
k−1 k
e
u(e),c(e)
s (e)
s (e)
S
E
u CC1
CC2
c
c1
c2
u1
u2
(a) (b)
λ[u1;O] λ[u2;I]
−1[u1;I]λ
[c2;I]λ
λ[u2;O]
λ[c1;I]
SISO1 SISO2
Π
Π
Π
Figure 1. Parallel Concatenated Convolutional Codes: coder and iterative
SISO based decoder (a), notation for the trellis section in the SISO (b)
where sS(e) and sE(e) are the starting and the ending states
of e, αk[sS(e)] and βk[sE(e)] are the forward and backward
metrics associated to sS(e) and sE(e) respectively (see Fig.
1 (b)). The πk[c(e); I] term in (5) is computed as a weighted
sum of the λk[c; I] produced by the soft demodulator as
πk[c(e); I] =
nc∑
i
ci(e)λk[ci(e); I] (6)
where ci(e) is one of the coded symbols associated to e and
nc is the number of bits forming a coded symbol. On the
other hand, we can write πk[u(e); I] = u(e)λk[u(e)] for a
binary turbo code, whereas for a duo-binary turbo code the
πk[u(e); I] terms are piece wise functions:
πk[u(e); I] =
⎧⎪⎪⎨
⎪⎪⎩
0 if u(e) = (‘0’, ‘0’)
λABk [u(e), I] if u(e) = (‘0’, ‘1’)
λABk [u(e), I] if u(e) = (‘1’, ‘0’)
λABk [u(e), I] if u(e) = (‘1’, ‘1’)
(7)
For further details on the theoretical aspects, the reader can
refer to [13].
III. REFERENCE ARCHITECTURE
The throughput of a turbo decoder (T ), defined as the
number of decoded bits (N ) over the time required to perform
the decoding operations (D), can be roughly estimated as
T =
N
D
= P
Nfclk
2I(N˜ + SISOlP )
(8)
where P is the number of SISOs instanced into the decoder,
I is the number of iterations, SISOl is the SISO latency, fclk
is the clock frequency and N˜ is equal to N or N/2 for binary
and duo-binary turbo decoders respectively. In a windowed
architecture, the SISO latency is directly related to the window
size W , as W clock cycles are required for computing both
the α and β values. If boundary metrics calculated at previous
iteration on the neighboring windows are used to initialize α
and β recursion, as suggested in [17], the SISO latency can be
estimated as SISOl=2W . Considering W=32 (typical value),
fclk=200MHz and I=8, a throughput T  12 Mb/s is obtained
with P = 1. This value is more than sufficient to achieve
the UMTS maximum throughput of 2Mb/s (block length
N = 5114). On the other hand the WiMax standard requires
a throughput close to 70 Mb/s for the maximum block length
Nc = N/2 = 2400. Considering the same parameters listed
above for UMTS, we obtain P ≥ 4. As a consequence, the
WiMax turbo decoder ought to be implemented as a parallel
architecture, for the same clock frequency f clk=200MHz.
A large part of the decoder area is devoted to the interleaver
memory and the SISO modules, so these blocks can be
effectively shared between the two decoders. The general SISO
architecture is shown in Fig. 3. The α processor implements
(3) and the β processor implements (4) on two consecutive
windows of data. Both the α and β processors compute in
parallel all the new State Metrics (SMs) (see the SM proc.
block in Fig. 3). Since the α processor works in direct order
on the input data, whereas the β processor elaborates them in
reverse order, two Branch Metrics Units (BMUs), are placed
before the α and β processors. Each BMU is devoted to com-
bine λk[u; I] and λk[c; I] and obtains in parallel the Branch
Metrics (BMs) γk associated to the k-th trellis section. As a
consequence a local buffer (BMU-MEM) is required to store
W extrinsic information and channel transition information
values. The λ − O processor generates the λk[u;O] values
according to (1) receiving the βk values directly from the β
processor and loading the αk values from a local buffer (α-
MEM).
It is known that in binary turbo code decoders LLRs
are commonly used. On the contrary, in duo-binary turbo
decoders the use of logarithmic probabilities (LP) instead of
LLRs allows to save a certain amount of logic in the SISO
architecture [5]. However, the use of 4 LPs instead of 3 LLRs
has a negative impact on both the interleaver memory and
the BMU-MEM footprint. In order to select the most suitable
approach, we implemented both the LLR based SISO (SISO-
LLR) and the logarithmic probabilities based SISO (SISO-LP)
in VHDL and synthesized them on a 0.13μm standard cell
technology. Moreover, we generated the dual port SRAMs to
implement the interleaver memory both for the SISO-LLR (2p-
LLR) and the SISO-LP (2p-LP) and the single port SRAMs to
implement the BMU-MEM as a “ping-pong” buffer for both
the cases (1p-LLR and 1p-LP). Fig. 2 shows the complexity
4 5 6 7 8
0
5
10
15
x 105
bit
μm
2
SISO−LLR
SISO−LP
2p−LLR
2p−LP
1p−LLR
1p−LP
decoder−LLR
decoder−LP
Figure 2. Complexity growth [μm2] of a duo-binary turbo decoder building
blocks as a function of the number of bits (bit) to represent the input data
growth of SISO-LLR, SISO-LP, 2p-LLR, 2p-LP, 1p-LLR and
3γ γ
k
β
k
k−1α k+1 β
k−1
k−1α
β
k+1
k
k+1α
γ
α/β
k
k α/βk
γk
α/βk α/β
[u;I]
λk[u;O]
AB
λk[u;O]
AB
λk
AB[u;O]
uk
n
o
rm
n
o
rm
αk−1
βk
8 9 10 117654γ γ γ γ3210 γ γ γ γ γ γ γ γ 12 13 14 15γ γ γ γ
λ [u,I]
AB
k λ [u,I]
AB
k λ [u,I]
AB
kλ
A
[c,I]k
λ
B
[c,I]k
λ
Y
[c,I]k
λ
W
k [c,I]
processor
u k
λ k[u;O] βout
αk−1 βk
βprv
β −LOC−MEM
max∗max
i[s ]i[s ]
[e](1) i [e](0) i[e]
(2)
i[e]
(3)
i
(3)
i[s ]
(2)
i[s ]
(1)
i[s ]
(0)
i[s ]
max
UMTS_WiMax_n
[0][0] [7][7]
−Oλ
γ k
λk[T]
AB
λk[T]
AB
λk
AB[T]
λk[u;I]
AB
λk[u;I]
AB
λk
AB
k k k k k k k k k k k
processor
α β
BMU−MEM
processor
−BMUα β
−Oλ
−MEMα
γγ
β −EXT−MEM−EXT−MEMα
α
λ k[u;I]
λ k[c;I]
α k k
kα
out
in
βk−1
βin
processor
PE0 PE7
(α/β,γ) (α/β,γ)k k
−BMU
BMU SM
proc.
PE
UMTS_WiMax_n
max
*
*
0
00
01
10
11
UMTS_WiMax_n
0
k k k k k
max
tree
max
tree
max
tree
max
tree
Figure 3. Flexible UMTS/WiMax SISO architecture: in the center the general SISO architecture and its building blocks, on the left the BMU structure, on
the top the SM processor structure, on the right the λ− O processor and the detail of the SM processor processing element (PE)
1p-LP in μm2 as a function of the number of bits (bit) used
to represent the LLRs or the LPs. The range explored in this
analysis (bit ∈ [4, 8]) shows that the SISO-LLR complexity is
slightly larger than that of the SISO-LP. However, the amount
of memory required by a SISO-LP based decoder increases
more than that of a SISO-LLR based one. In Fig. 2 the
complexity of a single SISO decoder, including the interleaver,
is also shown. Further experiments show that increasing P ,
the overhead required by the LP based decoder with respect
to the LLR based one decreases from 7.6% (P = 1) to
2.2% (P = 8). However, the LLR based decoder is still less
complex. As a consequence, in the following the LLR based
decoder architecture is described according to the formalism
introduced in section II. This choice implies that the BMU-
MEM contains W words, each word being made of 4 channel
LLRs (λk[c; I]) represented on nλc bits and 3 extrinsic LLRs
(λk[u; I]) represented on nλu bits.
IV. LOW COMPLEXITY INTERLEAVER DESIGN
Since the proposed architecture achieves the throughput
required by UMTS with a single SISO, the UMTS interleaver
parallelization is not addressed in this work. In order to reduce
as much as possible the complexity of the UMTS permutation
generator, we implemented the two step architecture detailed
in [18], which is very similar to that proposed in [19].
On the other hand a parallel decoder (P = 4) is required
to achieve the WiMax throughput with the assumed clock
frequency, fclk=200 MHz; as a consequence we designed the
parallel interleaver shown in Fig. 4. The permutation algorithm
specified in the WiMax standard is structured in two steps. The
first step switches λABk [u(e)] and λABk [u(e)] stored at odd
addresses leaving λABk [u(e)] un-moved (where λk[u(e)] can
be either λk[u(e), I] or λk[u(e), O]). The second step provides
the interleaved address i of the j-th λj [u(e)] triplet as
i = (P0 · j + P ′j ) mod Nc (9)
where P ′j ∈ {1, (1+Nc/2+P1), (1 +P2), (1 +Nc/2+P3)}
and Pj are constants depending on Nc, defined in [2] and [18].
The interleaver architecture can be simplified by rewriting (9)
as i = [(P0 · j) mod Nc + (P ′j mod Nc)] mod Nc as detailed
in [18].
In this work we consider that the throughput sustained by
the decoder scales with Nc, namely for short block lengths
a single SISO is active (e.g. T = 6.8Mb/s with Nc = 24).
When 192 ≤ Nc < 480, two SISOs are active (e.g. T =
30Mb/s with Nc = 192) and when 960 ≤ Nc ≤ 2400 all
the four SISOs are active (e.g. T = 78.95Mb/s with Nc =
960). Given P SISOs and P memories to interleave extrinsic
information, two different SISOs should not read from or write
to the same memory at the same time to avoid collisions. As
detailed in [18], the resulting parallel interleaver with variable
parallelism degree is a circular shifting interleaver [20], whose
implementation requires nearly the same complexity as the
non-parallel version. In fact, the collision free characteristic is
achieved by making the SISOs accessing at the same time the
same location of different memories.
V. FLEXIBLE UMTS/WIMAX SISO ARCHITECTURE
In the following paragraphs the solutions employed to share
the SISO architecture between UMTS and WiMax are detailed.
In Fig. 3 the logic employed in the UMTS mode is highlighted
with bold lines into the WiMax SISO building blocks.
a) BMU sharing - left side Fig. 3: due to the trellis sym-
metry, the WiMax BMU computes 16 possible BMs, whereas
the UMTS BMU only 4. Thus, starting from the equations
required to compute the 8 WiMax αk and βk SMs, we can
reduce the number of multiplexers to implement the UMTS
SMs. Since in the UMTS mode (UMTS WiMax n = ‘1’) 12
outputs of the BMU would not be used, we properly replicate
the 4 possible UMTS BMs so that the 2×16 input adders in
the α and β processor do not require to be multiplexed.
b) SMs processors sharing - bottom right side Fig. 3:
α
(j)
k [si] and β
(j)
k [si] are the j-th SMs connected to the i-th
state (si) at the k-th trellis step and γ (j)k [ei] the corresponding
BMs. The UMTS mode at the k-th trellis step requires the
4α out
β out β in
α in α out
β out β in
k+NWλ [u;O]P
uk λ k[u;O]
uk+3NW
λk+2NW λk+3NW
rdata−switch SISO−3
SISO−2
SISO−1
SISO−0EI−MEM0
EI−MEM1
EI−MEM2
EI−MEM3
wdata−switch
switch
couple
radx−switch
0 2 31
LIFO
generator
parallel
address
SISO−0 SISO−2 SISO−3
last_SISO
last_SISO last_SISO
UMTS_WiMax_n
SISO−1
[u;I]λk+NW[u;I]λk
uk+2NW
λ k+2NW [u;O]P
P
λ k+3NW
uk+NWP P P [u;O]P
P[u;I] [u;I]
adx
P
t
idxktNc
α init
β init
α in α out
β out β in
α in α out
β out β in
α in
Figure 4. WiMax parallel interleaver architecture, in the shaded part SISO
connection and inter SISO SM exchange network are highlighted
i-th PE to combine 2 SMs and 2 BMs to produce a new
SM, whereas 4 SMs and 4 BMs are required in the WiMax
mode. The ∗max function shared by the UMTS and the WiMax
decoder is implemented as a programmable two input ∗max.
Since the UMTS turbo code achieves excellent performance
even with a 1 bit correction (3 positions LUT [16]), the 1
bit correction can be exploited to substitute the last adder in
the standard Add-Compare-Select-Offset implementation with
a simpler, programmable increment.
c) λ − O processor sharing - top right side Fig. 3: the
λ − O processor input stage is made of two normalization
blocks (norm) devoted to subtracting the 0-state (s0) from the
others, αk(si)− αk(s0) and βk(si)− βk(s0).
The normalized SMs, combined with γk, become the inputs
of the ∗max trees (two ∗max trees for UMTS and four ∗max trees
for WiMax). The ∗max tree output referred to u˜ is subtracted
from the others; then, subtracting the corresponding λ k[u; I],
the output extrinsic LLRs are obtained. To ease the decoded
bits generation the hard decision circuit is embedded into
the λ − O processor. For a binary turbo decoder it can be
implemented taking the sign of λk[T ] = λk[u; I] + λk[u;O].
On the other hand for a duo-binary turbo decoder the hard
decision is selected as the couple with maximum LLR in
{0, λABk [T ], λABk [T ], λABk [T ]}, as shown in Fig. 3.
d) SM exchange network sharing: to grant a windowed
elaboration, the α-MEM contains W words, each word being
made of 8 SMs, represented on nSM bits. As stated in section
III the SISO complexity and latency can be reduced [17]
implementing a β metrics inheritance strategy at the expenses
of additional memory. Given the number of windows per
SISO (NWP ), a NWP -1 words local memory (β-LOC-MEM)
stores the SMs at the boundary of two consecutive windows
(βprv). Each word is made of 8 SMs, each of which is
represented on nSM bits. Moreover, every SISO requires two
8 SMs values to initialize its trellis portion (αin and βin).
This architecture is suited to a single SISO decoder, where
only intra SISO SMs inheritance is required. However, in a
Table I
TURBO DECODER ARCHITECTURES COMPARISON: SYMBOL MODE (S) CAN
BE BINARY (B) OR DUO-BINARY (D), CMOS TECHNOLOGY PROCESS
(TP), LOGIC (L), MEMORY (M), CLOCK FREQUENCY (fclk ) AND
THROUGHPUT (T )
Architecture S TP L M fclk T
[μm] [kgate] [kbit] [MHz] [Mb/s]
[24] B 0.18 410 450 145 24
[25] B 0.18 121 - 285 27.6
[22] B 0.5 75 390 95 3.8
This work B 0.13 75 70.9 200 12
[23] D - ∼480 ∼713 ∼200 -
This work D 0.13 171 133.6 200 90.36
[7]-I B/D 0.09 97 - 335 7.4
[7]-II B/D 0.09 1552 - 335 100
This work B/D 0.13 204 148.6 200 90.36
parallel SISO decoder, inter SISO SMs inheritance is required
to properly initialize trellis slices of different SISOs. This can
be achieved by inserting two 2-position shift registers (α-EXT-
MEM and β-EXT-MEM) to exchange the αout and βout SMs
with the neighboring SISOs. As depicted in Fig. 4, a simple
network allows to properly exchange the boundary SMs among
the different SISOs considering that in the UMTS mode the
trellis starting and termination SMs are fixed (αinit and βinit)
whereas in the WiMax mode they are estimated as explained
in section III. Depending on which is the last SISO active
(last SISO) the SMs ought to be inherited from a different
SISO.
VI. IMPLEMENTATION, THROUGHPUT AND LATENCY
According to the literature [16], [21], nλc= 6, nλu= 8 and
nSM = 12 have been chosen as a significant, conservative
case for both UMTS and WiMax. Synthesis results on a
0.13 μm standard cell technology show that the proposed,
flexible UMTS/WiMax architecture requires about 204 kgates.
The single mode WiMax architecture requires about 171
kgates, the UMTS one described in [22], similar to the one
employed in this work, requires 75 kgates. So the combination
of the two single mode decoders leads to 246 kgates: the
proposed solution is 17.1% less complex. As stated in section
III, memory sharing and on-the-fly generation of scrambled
addresses grant a large area saving. This is confirmed by the
actual memory requirements: the P = 4 WiMax decoder
requires 133.6 kbits, whereas the UMTS decoder requires
70.9 kbits. As a consequence the two architectures require
204.5 kbits. The proposed solution with memory sharing
requires only 148.6 kbits, thus it grants a memory saving of
27.3% and a total area saving of 27.7% compared to the two
single mode architectures. In Table I the proposed architecture
is compared to some binary and duo-binary turbo decoder
architectures. The proposed dual mode architecture shows
excellent performance and complexity figures compared both
to a fixed implementation [23] and to a programmable solution
[7] ([7]-I refers to the single processor solution, whereas [7]-II
is related to the 16 processor architecture).
As it can be inferred from Table I the proposed architecture
achieves a throughput higher than specified by the WiMax
standard. This implies that enough processing power is avail-
able for the concurrent execution of the UMTS and WiMax
5Table II
THROUGHPUT/LATENCY
Architecture T [Mb/s] D [μs]
Single standard UMTS 12 414
Single standard WiMax 90.36 53
Concurrent UMTS 2 2400
Concurrent WiMax 75 2400
decoding. Of course external buffers must be available to
receive an UMTS frame while WiMax decoding is in progress
and viceversa. In the following we prove that the proposed
architecture can support the concurrent execution of the UMTS
and WiMax decoding. The time required to decode K blocks
of N bits with the proposed architecture is
Θx(K) = K
N
Tx
(10)
where Tx is the throughput of the proposed architecture and x
can be either U for UMTS or W for WiMax. As a consequence
Θtot(K) = ΘW (K) + ΘU (1) (11)
is the total time required to decode the hybrid sequence of K
WiMax blocks and 1 UMTS block. Concurrent decoding can
be sustained only if the throughputs required (Φ) by UMTS
and WiMax are achieved for maximum block length:
K
NW
Θtot
≥ ΦW NUΘtot ≥ ΦU (12)
where NW =4800 (WiMax) and NU=5114 (UMTS). Substitut-
ing (10) and (11) in (12) we obtain⌈
ΦWTW
TU (TW − ΦW )
⌉
≤ K ≤
⌊
TW
ΦU
− TW
TU
⌋
= KM (13)
The final choice for K has been made taking ΦU=2 Mb/s and
solving (13) for the maximum ΦW ; this results in KM=37 and
ΦW =75 Mb/s. The concurrent decoding also affects latency.
The proposed architecture latency can be obtained from (8) as
D =
2I(N˜ + SISOlP )
Pfclk
(14)
namely DU 414 μs and DW 53 μs. Thus, the total latency
to decode KM WiMax blocks and 1 UMTS block is
Dtot =
2I
fclk
[
KMNc
P
+ N + SISOl(1 + KM )
]
(15)
In the worst case (Nc=2400 and N=5114) we obtain D tot 
2.4 ms, which is a small percentage of the global latency
specified by both UMTS and WiMax standards. The single
and multi-standard figures of throughput and latency offered
by the presented architecture are summarized in Table II.
VII. CONCLUSION
In this paper a flexible UMTS/WiMax turbo decoder ar-
chitecture has been presented together with a parallel WiMax
interleaver architecture. Compared to a single mode, paral-
lel WiMax architecture the proposed one exhibits a limited
complexity overhead. Moreover, compared to a separated dual
mode UMTS/WiMax turbo decoder architecture, it achieves
the 17.1% logic reduction and the 27.3% memory reduction.
REFERENCES
[1] “Universal mobile telecommunications system (UMTS), multiplexing
and channel coding (TDD),” Mar. 2007, 3GPP TS 25.222 version 7.2.0.
[2] “IEEE Std 802.16, part 16: air interface for fixed broadband wireless
access systems,” Oct. 2004.
[3] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near Shannon limit error
correcting coding and decoding: Turbo codes,” in IEEE ICC, 1993, pp.
1064–1070.
[4] C. Berrou, M. Jezequel, C. Douillard, and S. Kerouedan, “The advan-
tages of non-binary turbo codes,” in IEEE Inf. Theory Workshop, 2001,
pp. 61–63.
[5] C. Zhan, T. Arslan, A. T. Erdogan, and S. MacDougall, “An efficient
decoder scheme for double binary circular turbo codes,” in IEEE
ICASSP, 2006, pp. 229–232.
[6] S. Papaharalabos, P. Sweeney, and B. Evans, “Constant log-MAP
decoding algorithm for duo-binary turbo codes,” IET Electronics Letters,
vol. 42, no. 12, pp. 709–710, Jun 2006.
[7] O. Muller, A. Baghdadi, and M. Jezequel, “ASIP-baser multiprocessor
SOC design for simple and double binary turbo decoding,” in DATE,
2006, pp. 1330–1335.
[8] T. Vogt and N. Wehn, “A reconfigurable applcation specific instruction
set processor for Viterbi and Log-MAP decoding,” in IEEE Workshop on
Signal Proc. Systems Design and Implementation, 2006, pp. 142–147.
[9] M. C. Shin and I. C. Park, “SIMD processor-based turbo decoder
supporting multiple third-generation wireless standards,” IEEE Trans.
on VLSI, vol. 15, no. 7, pp. 801–810, Jul 2007.
[10] M. J. Thul, N. Wehn, and L. P. Rao, “Enabling high-speed turbo-
decoding through concurrent interleaving,” in IEEE ISCAS, 2002, pp.
897–900.
[11] F. Speziali and J. Zory, “Scalable and area efficient concurrent interleaver
for high throughput turbo-decoders,” in IEEE Euromicro Symposium on
Digital System Design, 2004, pp. 334–341.
[12] A. Tarable, L. Dinoi, and S. Benedetto, “Design of prunable interleavers
for parallel turbo decoder architectures,” IEEE Comm. Letters, vol. 11,
no. 2, pp. 167–169, Feb 2007.
[13] S. Benedetto, D. Divsalar, G. Montorsi, and F. Pollara, “Soft-input soft-
output modules for the construction and distributed iterative decoding
of code networks,” European Trans. on Telecom., vol. 9, no. 2, pp. 155–
172, Mar/Apr 1998.
[14] L. Bahl, J. Cocke, F. Jelinek, and J. Raviv, “Optimal decoding of linear
codes for minimizing symbol error rate,” IEEE Trans. on Inf. Theory,
vol. 20, no. 3, pp. 284–287, Mar 1974.
[15] P. Robertson, E. Villebrun, and P. Hoeher, “A comparison of optimal and
sub-optimal MAP decoding algorithms operating in the Log domain,”
in IEEE ICC, 1995, pp. 1009–1013.
[16] G. Montorsi and S. Benedetto, “Design of fixed-point iterative decoders
for concatenated codes with interleavers,” IEEE Journal on Selected
Areas in Comm., vol. 19, no. 5, pp. 871–882, May 2001.
[17] A. Abbasfar and K. Yao, “An efficient and practical architecture for high
speed turbo decoders,” in IEEE VTC, 2003, pp. 337–341.
[18] M. Martina, M. Nicola, and G. Masera, “Low complexity
UMTS and WiMax interleavers design,” Politecnico di Torino,
Dipartimento di Elettronica, Tech. Rep., 2007, available at:
http://www.vlsilab.polito.it/∼martina.
[19] Z. Wang and Q. Li, “Very low-complexity hardware interleaver for turbo
decoding,” IEEE Trans. on Circuits and Systems II, vol. 54, no. 7, pp.
636–640, Jul 2007.
[20] S. Dolinar and D. Divsalar, “Weight distributions for turbo codes using
random and nonrandom permutations,” TDA Progress Report, vol. 42-
122, pp. 56–65, Aug 1995.
[21] A. P. Hekstra, “An alternative to metric rescaling in Viterbi decoders,”
IEEE Trans. on Comm., vol. 37, no. 11, pp. 1220–1222, Nov 1989.
[22] G. Masera, G. Piccinini, M. Ruo-Roch, and M. Zamboni, “VLSI
architectures for turbo codes,” IEEE Trans. on VLSI, vol. 7, no. 3, pp.
369–379, Sep 1999.
[23] A. Bartolazzi, G. Cardarilli, A. Del-Re, D. Giancristofaro, and M. Re,
“Implementation of DVB-RCS turbo decoder for satellite on-board
processing,” in IEEE Intern. Conf. on Circuits and Systems for Comm.,
2002, pp. 142–145.
[24] M. Bickerstaff, L. Davis, C. Thomas, D. Garrett, and C. Nicol, “A
24Mb/s radix-4 LogMAP turbo decoder for 3GPP-HSDPA mobile
wireless,” in IEEE Intern. Solid State Circuits Conf., 2003, pp. 150–
151.
[25] S. J. Lee, N. R. Shanbhag, and A. C. Singer, “A 285-MHz pipelined
MAP decoder in 0.18-μm CMOS,” IEEE Jour. of Solid-State Circuits,
vol. 40, no. 8, pp. 1718–1725, Aug 2005.
