Hardware Design of a Low Complexity, Parallel Interleaver for WiMax Duo-Binary Turbo Decoding by Martina, Maurizio et al.
1Hardware design of a low complexity, parallel
interleaver for WiMax duo-binary turbo decoding
Maurizio Martina, Member IEEE, Mario Nicola, Guido Masera, Senior Member IEEE
Abstract—A low complexity, parallel, collision-free interleaver
architecture for the WiMax duo-binary turbo decoder is pre-
sented. The proposed architecture dynamically adapts to different
block sizes and it features reduced complexity resorting to
parallel circular shifting interleavers. Moreover, it sustains a peak
throughput of nearly 90 Mb/s with a 200 MHz clock frequency,
when synthesized on a 0.13 µm standard cell technology.
Index Terms—Parallel interleaver, hardware, VLSI
I. INTRODUCTION
Turbo codes are employed in several standards for wireless
communications, such as WiMax [1]. When high transmission
throughputs are required, parallel decoder architectures are
needed to meet application speed constraints while keeping
the clock frequency limited to few hundreds of MHz. A
parallel turbo decoder is basically structured as M processing
elements (PE) and M memories. Each PE plays the role of
a Soft In Soft Out (SISO) module on a given window of
data, whereas memories are used for exchanging extrinsic
information among SISOs. Since WiMax resorts to a duo-
binary turbo decoder [2], the encoder processes Nc couples of
information bits u = {A,B} with A,B ∈ {0, 1}, whereas the
decoder works on Nc triplets of logarithmic likelihood ratios
(LLRs) λ[u] = (λAB [u], λAB [u], λAB [u]) where u˜ = {A,B}
is taken as the reference symbol. The decoding process is
iterative: in the in-order half iteration, the n-th SISO accesses
only the n-th memory, whereas in the scrambled half iteration
the n-th SISO reads from and writes to different memories. A
collision occurs when two or more SISOs try to simultaneously
access the same memory.
Parallel circular shifting interleavers are intrinsically
collision-free [3] so they do not require further memory
and logic to avoid collisions. In this letter, we propose a
collision-free, low complexity, parallel architecture supporting
the interleaving laws specified for the WiMax duo-binary
turbo code [1]. Collisions are avoided by means of a variable
parallelism architecture where M is chosen to grant that the
resulting parallel interleaver is circular shifting. To the best
of our knowledge this is the first work concerning the VLSI
implementation of a collision-free parallel interleaver for the
WiMax turbo decoder.
II. PROPOSED ARCHITECTURE
The permutation algorithm specified in [1] is structured in
two steps. The first step switches λAB [u] and λAB [u] stored
The authors are with Dipartimento di Elettronica - Politecnico di Torino
- Italy. This work is partially supported by the MEADOW (MEsh ADaptive
hOme Wireless nets) project, funded by the italian government.
12bits
c
P0
N cN cN c
N c
a
b
idx j
mod NcjK
N c
N c
M
>>1
>>2
1
4
2
1
4
2
adx j
serial
WiMax
Interleaver mod N
idx j
idx
Π(j)
Π(j)
17x2bits
LUT
1
4
2
M
M
01 3
2
1
2
3
1
10
2 bits
a:MSB; b:LSB
11
00
1
c
17x37bits
LUT
j−cnt
mod N
0
j
idx j
Figure 1. WiMax parallel interleaver address generator architecture
at odd addresses leaving λAB [u] un-moved. The second step
provides the interleaved address of the j-th triplet (λj [u]) as
Π(j) = (P0 · j +Kj) mod Nc j = 0, 1, . . . , Nc − 1 (1)
where Kj = 1 when j mod 4 = 0, Kj = 1 + Nc/2 + P1
when j mod 4 = 1, Kj = 1 + P2 when j mod 4 = 2, Kj =
1+Nc/2+P3 when j mod 4 = 3; P0, P1, P2, P3 are constants
that depend only on the number of couples Nc [1]. Since the
two steps can be swapped, the first step can be performed
on the fly using Π(j) least significant bit (LSB[Π(j)]) as a
selector. The implementation of (1) can be derived as follows:
if x ∈ [0, 2 ·Nc−1], x mod Nc can be implemented by means
of a subtracter and a multiplexer. Unfortunately, P0 · j +Kj
is not granted to belong to [0, 2 ·Nc − 1]. As a consequence,
several x mod Nc blocks ought to be cascaded to obtain Π(j).
However, (1) can be rewritten as
Π(j) = {[(P0 · j) mod Nc] + (Kj mod Nc)} mod Nc (2)
Given that a small Look-Up-Table (LUT) is employed to store
P0 and the Kj mod Nc terms, (2) can be implemented by
two parts as depicted in Fig. 1 (shaded gray box). The first
part accumulates P0 to implement the P0 · j term and the
mod Nc block produces the correct modulo Nc result. Since
j is a counter (j−cnt in Fig. 1), (P0 ·j) mod Nc is generated
in one clock cycle adding P0 to [P0 · (j − 1)] mod Nc and
performing the modulo operation. The second part employs
the two least significant bits of the j−cnt counter to select the
proper Kj mod Nc value, which is added to the (P0 · j) mod
Nc term. A further modulo Nc operation is performed at the
output. Since in this architecture both the first and the second
part work on data belonging to [0, 2 ·Nc− 1], all the modNc
operations are implemented by means of a subtracter and a
multiplexer (dark-shaded gray box in Fig. 1).
2In the WiMax HUMAN-OFDM profile for 10 MHz chan-
nelization [1] the worst case downlink throughput is Tˆdl '65
Mb/s. The decoder throughput can be estimated as the number
of decoded bits (2Nc) over the time required to perform the
decoding operations:
T =
2Nc · fclk
2I(NcM + SISOl)
(3)
where 2I is the number of half iterations, fclk is the clock
frequency and SISOl is the SISO latency. We adopt a
sliding window based approach where boundary metrics are
inherited from one iteration to the next one as proposed in
[4]. This allows to obtain SISOl=2W (W is the window size).
Assuming W=32 [5], I=8 and fclk=200 MHz, we estimate the
throughput of the decoder for the 17 possible values of Nc [1].
As shown in Fig. 2, M=3 allows to achieve Tˆdl (horizontal
solid line) only for Nc ≥1440, whereas with M=4 it can be
reached for Nc >500 (i.e. Nc ≥960, the next specified size).
200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400
0
10
20
30
40
50
60
70
80
90
100
N
c
T 
[M
b/s
]
M=1
M=2
M=3
M=4
Proposed
Figure 2. Parallel decoder throughput as a function of Nc for different
parallelism degree values M (W=32 for dashed lines, W as in Table I for
solid curve)
To satisfy the throughput requirement and to avoid high cost
inter-SISO communication structures, a parallel collision-free
interleaver is advisable. According to [3], a circular shifting
interleaver is defined as Π(j) = (a · j + r) mod N where N
is the block size, r < N is an offset and a < N is a step size
that is relatively prime to N . Comparing this definition with
(1), it is clear that the WiMax interleaver could be circular
shifting with a = P0, r = Kj and N = Nc. As detailed
in [6], parallel collision-free, circular shifting interleavers are
obtained imposing
Π
(
j + k · Nc
M
)
=
[
Π(j)± k · Nc
M
]
mod Nc (4)
with j = 0, 1, . . . NcM − 1 and k = 0, 1, . . .M − 1. Substituting
(1) into (4) we obtain the conditions required to ensure that
the WiMax interleaver is collision-free for a given parallelism
degree M . Given that NcM ∈ N, the first condition is defined
by (5). Further conditions must be added depending on NcM
parity: if NcM mod 4 6= 0 the conditions required are (5) and
(6). Finally, if also NcM mod 2 = 1, (5), (6) and (7) must be
simultaneously satisfied:
(P0 ∓ 1) modM = 0 (5)
P2 mod Nc = 0 |P3 − P1| mod Nc = 0 (6)
F (P1, P3, |P2 − P1|, |P3 − P1|) = [0 0 0 0] (7)
where F (x0, . . . , x3) = [f(x0) . . . f(x3)] with f(x) = (2x+
Nc) mod 2Nc.
Let’s introduce I as the set of the 17 possible Nc values
specified by the WiMax standard [1]. Given Nc ∈ I and the
corresponding P0, P1, P2, P3, we find which M ∈ {2, 3, 4}
grants to obtain parallel and collision-free interleavers. It is
worth pointing out that all the possible P0 specified in [1]
satisfy (5) with M ∈ {2, 3, 4}. As a consequence, all the
configurations where NcM mod 4 = 0 with M ∈ {2, 3, 4}
and Nc ∈ I correspond to a parallel collision-free interleaver.
This is the case of M=3 that leads to parallel collision-free
interleavers for every Nc ∈ I .
When M=2, we have NcM mod 4 6= 0 for Nc ∈ I
′
=
{36, 108, 180}. In this case NcM mod 2 = 0, so that only (6)
must be checked. The only value Nc ∈ I ′ not satisfying (6)
is Nc=108, which leads to collisions.
When M=4, we have NcM mod 4 6= 0 for Nc ∈ I
′′
=
{24, 72, 120, 216} and Nc ∈ I ′′′ = {36, 108, 180}, namely
Nc
M mod 2 = 0 with Nc ∈ I
′′
and NcM mod 2 = 1 with
Nc ∈ I ′′′ . Since (6) is verified with M=4 and Nc ∈ I ′′ , these
configurations lead to parallel collision-free interleavers. On
the other hand, when M=4 and Nc ∈ I ′′′ both (6) and (7) must
be satisfied to obtain parallel collision-free interleavers. The
only Nc ∈ I ′′′ leading to collisions is Nc=108, that satisfies
neither (6) nor (7).
In this work a parallel, collision-free interleaver is ob-
tained selecting M as a function of Nc and in particular
M 6=2 and M 6=4 when Nc=108. Since the resulting in-
terleaver is a parallel circular shifting interleaver, we can
write
[
Π(j)± k · NcM
]
mod Nc = idxkj · NcM + adxj , where
adxj = Π(j) mod NcM , idx
k
j = (idx
0
j ± k) modM and
idx0j = Π(j) − adxj . This results in a simplified address
generation with all SISOs simultaneously accessing the same
location (adxj) of different memories, where idxkj is the
memory accessed by SISO-k at time j during a scrambled half
iteration. Thus, the parallel interleaver is obtained cascading a
simple block to the serial interleaver architecture. This block
extracts adxj and idxkj from the first Nc/M values of Π(j):
adxj=

Π(j) if Π(j)∈ [0, NcM −1]
Π(j)−NcM if Π(j)∈ [NcM , 2NcM −1]
. . .
Π(j)−(M−1)NcM if Π(j)∈ [(M−1)NcM , Nc−1]
(8)
whose straightforward implementation needs to calculate
Nc/M and to allocate M − 2 multipliers, M − 1 subtracters,
an M -ways multiplexer and few logic for selecting the proper
adxj value. The Nc/M division can be simplified by choosing
the possible M values as powers of two. Thus, in the proposed
throughput/parallelism scalable decoder, we employ: M=1
when Nc ≤ 108, M=2 when 120 ≤ Nc ≤ 480 and M=4
3Table I
PARALLELISM DEGREE (M ), WINDOW SIZE (W ), NUMBER OF WINDOWS
PER SISO (NMW ) AND THROUGHPUT (T ) IN MB/S ACHIEVED BY THE
WIMAX TURBO DECODER PARALLEL ARCHITECTURE FOR THE 17 Nc
VALUES
Nc 24 36 48 72 96 108 120 144 180
M 1 1 1 1 1 1 2 2 2
W 24 36 48 36 48 36 60 36 45
NMW 1 1 1 2 2 3 1 2 2
T 8.3 8.3 8.3 12.5 12.5 15 16.7 25 25
Nc 192 216 240 480 960 1440 1920 2400
M 2 2 2 2 4 4 4 4
W 48 36 40 48 40 40 40 40
NMW 2 3 3 5 6 9 12 15
T 25 30 30 35.7 75 81.8 85.7 88.2
when 960 ≤ Nc ≤ 2400. As it can be inferred from Fig. 1,
multiplications are avoided resorting to simple shift operations
(x >> i = x/2i). The sign of the subtractions (dashed
lines in Fig. 1) allows not only to select the correct adxj but
also to find idx0j . Then, the idx
k
j values are generated with
M − 1 modulo M adders/subtracters. Moreover, the proposed
decoder grants that Nc/M is even, so that LSB[adxj ] is
used to switch λAB [u] and λAB [u] stored at odd addresses
(see the switch couple signal in Fig. 3). We also impose
Nc/(M ·W ) ∈ N, which implies that each SISO processes the
same number (NMW ) of windows in a data frame. Of course,
proper W and M values must be selected for each frame size,
as reported in Table I, where NMW and the resulting throughput
T , obtained from post synthesis simulations with Modelsim
(bold line in Fig. 2), are also given. This solution requires the
component decoder to support different sliding window sizes
and NMW values. However, resorting to the SISO architecture
detailed in [7], different W and NMW values are supported with
a negligible complexity overhead using two programmable
counters in the SISOs control unit.
III. RESULTS AND CONCLUSIONS
The architecture detailed in the previous paragraphs simul-
taneously produces M addresses per cycle and is employed to
implement the interleaver reading part. Since idxkj identifies
the memory accessed by SISO-k at time j, the parallel
interleaver architecture ought to signal to the memory which
SISO is requiring the data. This operation is accomplished
by a 4 × 4 crossbar switch (radx-switch) controlled by idxkj
with 2 bit wide fixed inputs, as shown in Fig. 3. When
the idxk memory (EI-MEM idxk) is read, it sends back the
corresponding λ[u] triplet to SISO-k, through a 4×4 crossbar
switch (rdata-switch). This crossbar switch is controlled by the
output of the radx-switch.
Since each SISO outputs its data in reverse order, during the
reading operation idxkj and adxj are stored into a LIFO; idx
k
j
and adxj are read from the LIFO during the writing operation
to configure a 4× 4 crossbar switch (wdata-switch).
The proposed parallel interleaver has been described in
VHDL and synthesized on a 0.13 µm standard cell technology
with Synopsys Design Compiler. Power consumption values
have been obtained with Modelsim and Synopsys Power
Compiler. The 17×37 bits LUT, synthesized as combinational
logic, requires 330 equivalent gates. Thus, the serial interleaver
c
switch
couple
switch
couple
rdata−switch SISO−3
SISO−2
SISO−1
SISO−0EI−MEM0
EI−MEM1
EI−MEM2
EI−MEM3
wdata−switch
radx−switch
0 2 31
LIFO
generator
parallel
adx
address
j
idxkjN
Figure 3. WiMax parallel turbo decoder architecture
is made of only 1340 equivalent gates and the total address
generator needs 1474 equivalent gates, where 24 equivalent
gates are used to implement the 17×2 bits LUT devoted
to obtain M from Nc. Further 50 and 577 equivalent gates
are required respectively by the radx-switch and the data-
switch (either rdata-switch or wdata-switch), where 8 bits are
employed for each LLR of the λ[u] triplet. The complete
parallel interleaver architecture needs about 2200 equivalent
gates for the reading part and about 600 equivalent gates
for the writing part, leading to a total complexity of about
2800 equivalent gates and an average power consumption of
about 1.3 mW with a 200 MHz clock frequency (memories
not included). Similar complexity results were obtained on
FPGA: the complete parallel interleaver architecture requires
876 LUTs and 103 Flip-Flops for a 200 MHz clock frequency
on a Xilinx Virtex 5 X5VLX30 with Xilinx Ise. The complete
decoder requires 133.6 kbits [7] of memory, where 57.6 kbits
are devoted to the EI-MEM and 2.2 kbits to the LIFO.
A parallel decoder made of M SISOs requires to simul-
taneously produce M addresses per cycle. Thus, for a fair
comparison, we consider a parallel interleaver obtained using
four instances of the single address per cycle interleaver in
[5]: this solution requires about 4880 equivalent gates (1220
equivalent gates for each address generator) not including the
switches and the memories, with an average power consump-
tion of about 3.2 mW (0.8 mW for each address generator) at
200 MHz. As it can be observed the proposed interleaver, is
more than 50% simpler than placing four instances of [5].
REFERENCES
[1] “IEEE Std 802.16, part 16: air interface for fixed broadband wireless
access systems,” Oct. 2004.
[2] C. Berrou et al. “The advantages of non-binary turbo codes,” in IEEE
Inf. Theory Workshop, 2001, pp. 61–63.
[3] S. Dolinar and D. Divsalar, “Weight distributions for turbo codes using
random and nonrandom permutations,” TDA Progress Report, vol. 42-
122, pp. 56–65, Aug 1995.
[4] C. Zhan et al. “An efficient decoder scheme for double binary circular
turbo codes,” in IEEE ICASSP, 2006, pp. 229–232.
[5] J. H. Kim and I. C. Park, “Double-binary circular turbo decoding based
on border metric encoding,” IEEE Trans. on Circuits and Systems II,
vol. 55, no. 1, pp. 79–83, Jan 2008.
[6] J. Kwak and K. Lee, “Design of dividable interleaver for parallel
decoding in turbo codes,” IET Electronics Letters, vol. 38, no. 22, pp.
1362–1364, Oct 2002.
[7] M. Martina, M. Nicola, and G. Masera, “A flexible UMTS-WiMax turbo
decoder architecture,” IEEE Trans. on Circuits and Systems II, vol. 55,
no. 4, pp. 369–373, Apr 2008.
