VLSI Architectures for WIMAX Channel Decoders by Martina, Maurizio & Masera, Guido
VLSI Architectures for WIMAX Channel Decoders 1
VLSI Architectures for WIMAX Channel Decoders
Maurizio Martina and Guido Masera
X 
 
VLSI Architectures for  
WIMAX Channel Decoders 
 
Maurizio Martina and Guido Masera 
Politecnico di Torino 
Italy 
 
1. Introduction 
 
WIMAX has gained a wide popularity due to the growing interest and diffusion of 
broadband wireless access systems. In order to be flexible and reliable WIMAX adopts 
several different channel codes, namely convolutional-codes (CC), convolutional-turbo-
codes (CTC), block-turbo-codes (BTC) and low-density-parity-check (LDPC) codes, that are 
able to cope with different channel conditions and application needs.  
On the other hand, high performance digital CMOS technologies have reached such a 
development that very complex algorithms can be implemented in low cost chips. 
Moreover, embedded processors, digital signal processors, programmable devices, as 
FPGAs, application specific instruction-set processors and VLSI technologies have come to 
the point where the computing power and the memory required to execute several real time 
applications can be incorporated even in cheap portable devices.  
Among the several application fields that have been strongly reinforced by this technology 
progress, channel decoding is one of the most significant and interesting ones. In fact, it is 
known that the design of efficient architectures to implement such channel decoders is a 
hard task, hardened by the high throughput required by WIMAX systems, which is up to 
about 75 Mb/s per channel. In particular, CTC and LDPC codes, whose decoding 
algorithms are iterative, are still a major topic of interest in the scientific literature and the 
design of efficient architectures is still fostering several research efforts both in industry and 
academy.  
In this Chapter, the design of VLSI architectures for WIMAX channel decoders will be 
analyzed with emphasis on three main aspects: performance, complexity and flexibility. The 
chapter will be divided into two main parts; the first part will deal with the impact of 
system requirements on the decoder design with emphasis on memory requirements, the 
structure of the key components of the decoders and the need for parallel architectures. To 
that purpose a quantitative approach will be adopted to derive from system specifications 
key architectural choices; most important architectures available in the literature will be also 
described and compared. 
The second part will concentrate on a significant case of study: the design of a complete CTC 
decoder architecture for WIMAX, including also hardware units for depuncturing (bit-
deselection) and external deinterleaving (sub-block deinterleaver) functions. 
1
WIMAX, New Developments2
 
2. From system specifications to architectural choices 
 
The system specifications and in particular the requirement of a peak throughput of about 
75 Mb/s per channel imposed by the WIMAX standard have a significant impact on the 
decoder architecture. In the following sections we analyze the most significant architectures 
proposed in the literature to implement CC decoders (Viterbi decoders), BTC, CTC and 
LDPC decoders.  
 
2.1 Viterbi decoders 
The most widely used algorithm to decode CCs is the Viterbi algorithm [Viterbi, 1967], 
which  is based on finding the shortest path along a graph that represents the CC trellis. As 
an example in Fig. 1 a binary 4-states CC is shown as a feedback shift register (a) together 
with the corresponding state diagram (b) and trellis (c)  representations. 
 
(c)
0/00
1/11
0/10
0/10
1/01
1/01
00 00
01 01
10 10
11 11
1/11
0/10
0/00
1/01
1/11
1/01
0/10
0/00
(a) (b)
e2 e1
c1
u
c2
e1 e2
u/c2c1
00
01
10
11
1/11
0/00
 Fig. 1. Binary 4-state CC example: shift register (a), state diagram (b) and trellis (c) 
representations 
 
In the given example, the feedback shift register implementation of the encoder generates 
two output bits, c1 and c2 for each received information bit, u; c1 is the systematic bit. The 
state diagram basically is a Mealy finite state machine describing the encoder behaviour in a 
time independent way: each node corresponds to a valid encoder state, represented by 
means of the flip flop content, e1 and e2, while edges are labelled with input and output bits. 
The trellis representation also provides time information, explicitly showing the evolution 
from one state to another in different time steps (one single step is drawn in the picture).   
At each trellis step n, the Viterbi algorithm associates to each trellis state S a state metric ΓSn 
that is calculated along the shortest path and stores a decision dSn, which identifies the 
entering transition on the shortest path. First, the decoder computes the branch metrics (γn), 
that are the distances from the metrics labelling each edge on the trellis and the actual 
received soft symbols. In the case of a binary CC with rate 0.5 the soft symbols are λ1n and 
λ2n and the branch metrics γn(c2,c1) (see Fig. 2 (a)). Starting from these values, the state 
metrics are updated by selecting the larger metric among the metrics related to each 
incoming edge of  a trellis state and storing the corresponding decision dSn. Finally, decoded 
bits are obtained by means of a recursive procedure usually referred to as trace-back. In 
order to estimate the sequence of bits that were encoded for transmission, a state is first 
selected at the end of the trellis portion to be decoded, then the decoder iteratively goes 
backward through the state history memory where decisions dSn have been previously 
stored: this allows one to select, for current state, a new state, which is listed in the state 
 
history trace as being the predecessor to that state. Different implementation methods are 
available to make the initial state choice and to size the portion of trellis where the trace 
back operation is performed: these methods affect both decoder complexity and error 
correcting capability. For further details on the algorithm the reader can refer to [Viterbi, 
1967]; [Forney, 1973]. Looking at the global architecture, the main blocks required in a 
Viterbi decoder are the branch metric unit (BMU) devoted to compute γn, the state metric 
unit (SMU) to calculate ΓSn and the trace-back unit (TBU) to obtain the decoded sequence. 
The BMU is made of adders and subtracters to properly combine the input soft symbols (see 
Fig. 2 (a)). The SMU is based on the so called add-compare select structure (ACS) as shown 
in Fig.2 (b). Said i the i-th starting state that is connected to an arriving state S by an edge 
whose branch metric is γin-1, then ΓSn is calculated as in (1). 
 
}{max 11   nini
i
n
S   (1) 
 
γ (01)n γ (10)n γ (11)n ΓSnd Sn
Γj
n−1γ jn−1γ i
n−1Γi
− − − + + − + +
λ2 λ1
γ (00)n
n n
(a) (b)
n−1
 Fig. 2. BMU and ACS architectures for a rate 0.5 CC 
 
As it can be inferred from (1) ΓSn is obtained by adding branch metrics with state metrics, 
comparing and selecting the higher metric that represents the shortest incoming path. The 
corresponding decision dSn is stored in a memory that is later read by the TBU to reconstruct 
the survived path. Due to the recursive form of (1), as long as n increases, the number of bits 
to represent ΓSn  tends to become larger. This problem can be solved by normalizing the state 
metrics at each step. However, this solution requires to add a normalization stage increasing 
both the SMU complexity and critical path. An effective technique, based on two 
complement representation, helps limiting the growth of state metrics, as described in 
[Hekstra, 1989]. 
 
u
c1
c2  Fig. 3. WIMAX binary 64-state CC with rate 0.5 shift register representation 
VLSI Architectures for WIMAX Channel Decoders 3
 
2. From system specifications to architectural choices 
 
The system specifications and in particular the requirement of a peak throughput of about 
75 Mb/s per channel imposed by the WIMAX standard have a significant impact on the 
decoder architecture. In the following sections we analyze the most significant architectures 
proposed in the literature to implement CC decoders (Viterbi decoders), BTC, CTC and 
LDPC decoders.  
 
2.1 Viterbi decoders 
The most widely used algorithm to decode CCs is the Viterbi algorithm [Viterbi, 1967], 
which  is based on finding the shortest path along a graph that represents the CC trellis. As 
an example in Fig. 1 a binary 4-states CC is shown as a feedback shift register (a) together 
with the corresponding state diagram (b) and trellis (c)  representations. 
 
(c)
0/00
1/11
0/10
0/10
1/01
1/01
00 00
01 01
10 10
11 11
1/11
0/10
0/00
1/01
1/11
1/01
0/10
0/00
(a) (b)
e2 e1
c1
u
c2
e1 e2
u/c2c1
00
01
10
11
1/11
0/00
 Fig. 1. Binary 4-state CC example: shift register (a), state diagram (b) and trellis (c) 
representations 
 
In the given example, the feedback shift register implementation of the encoder generates 
two output bits, c1 and c2 for each received information bit, u; c1 is the systematic bit. The 
state diagram basically is a Mealy finite state machine describing the encoder behaviour in a 
time independent way: each node corresponds to a valid encoder state, represented by 
means of the flip flop content, e1 and e2, while edges are labelled with input and output bits. 
The trellis representation also provides time information, explicitly showing the evolution 
from one state to another in different time steps (one single step is drawn in the picture).   
At each trellis step n, the Viterbi algorithm associates to each trellis state S a state metric ΓSn 
that is calculated along the shortest path and stores a decision dSn, which identifies the 
entering transition on the shortest path. First, the decoder computes the branch metrics (γn), 
that are the distances from the metrics labelling each edge on the trellis and the actual 
received soft symbols. In the case of a binary CC with rate 0.5 the soft symbols are λ1n and 
λ2n and the branch metrics γn(c2,c1) (see Fig. 2 (a)). Starting from these values, the state 
metrics are updated by selecting the larger metric among the metrics related to each 
incoming edge of  a trellis state and storing the corresponding decision dSn. Finally, decoded 
bits are obtained by means of a recursive procedure usually referred to as trace-back. In 
order to estimate the sequence of bits that were encoded for transmission, a state is first 
selected at the end of the trellis portion to be decoded, then the decoder iteratively goes 
backward through the state history memory where decisions dSn have been previously 
stored: this allows one to select, for current state, a new state, which is listed in the state 
 
history trace as being the predecessor to that state. Different implementation methods are 
available to make the initial state choice and to size the portion of trellis where the trace 
back operation is performed: these methods affect both decoder complexity and error 
correcting capability. For further details on the algorithm the reader can refer to [Viterbi, 
1967]; [Forney, 1973]. Looking at the global architecture, the main blocks required in a 
Viterbi decoder are the branch metric unit (BMU) devoted to compute γn, the state metric 
unit (SMU) to calculate ΓSn and the trace-back unit (TBU) to obtain the decoded sequence. 
The BMU is made of adders and subtracters to properly combine the input soft symbols (see 
Fig. 2 (a)). The SMU is based on the so called add-compare select structure (ACS) as shown 
in Fig.2 (b). Said i the i-th starting state that is connected to an arriving state S by an edge 
whose branch metric is γin-1, then ΓSn is calculated as in (1). 
 
}{max 11   nini
i
n
S   (1) 
 
γ (01)n γ (10)n γ (11)n ΓSnd Sn
Γj
n−1γ jn−1γ i
n−1Γi
− − − + + − + +
λ2 λ1
γ (00)n
n n
(a) (b)
n−1
 Fig. 2. BMU and ACS architectures for a rate 0.5 CC 
 
As it can be inferred from (1) ΓSn is obtained by adding branch metrics with state metrics, 
comparing and selecting the higher metric that represents the shortest incoming path. The 
corresponding decision dSn is stored in a memory that is later read by the TBU to reconstruct 
the survived path. Due to the recursive form of (1), as long as n increases, the number of bits 
to represent ΓSn  tends to become larger. This problem can be solved by normalizing the state 
metrics at each step. However, this solution requires to add a normalization stage increasing 
both the SMU complexity and critical path. An effective technique, based on two 
complement representation, helps limiting the growth of state metrics, as described in 
[Hekstra, 1989]. 
 
u
c1
c2  Fig. 3. WIMAX binary 64-state CC with rate 0.5 shift register representation 
WIMAX, New Developments4
 
The WIMAX standard specifies a binary 64 states CC with rate 0.5, whose shift register 
representation is shown in Fig. 3. Usually Viterbi decoder architectures exploit the trellis 
intrinsic parallelism to simultaneously compute at each trellis step all the branch metrics 
and update all the state metrics. Thus, said n the number of states of a CC, a parallel 
architecture employs a BMU and n ACS modules. Moreover, to reduce the decoding latency, 
the trace-back is performed as a sliding-window process [Radar, 1981] on portions of trellis 
of width W. This approach not only reduces the latency, but also the size of the decision 
memory that depending on the TBU radix requires usually 3W or 4W cells [Black & Meng, 
1992]. 
To improve the decoder throughput, two [Black & Meng, 1992] or more [Fettweis & Meyr, 
1989]; [Kong & Parhi, 2004]; [Cheng & Parhi, 2008] trellis steps can be processed 
concurrently. These solutions lead to the so called higher radix or M-look-ahead step 
architectures. According to [Kong & Parhi, 2004], the throughput sustained by an M-look-
ahead step architecture, defined as the number of decoded bits over the decoding time is  
 
kMf
WMN
fNk
T clk
T
clkT 

/
 (2) 
 
where fclk is the clock frequency, NT is the number of trellis steps, k=1 for a binary CC, k=2 
for a double binary CC and the right most expression is obtained under the condition W << 
NT that is a reasonable assumption in real cases. 
Thus, to achieve the throughput required by the WIMAX standard with a clock frequency 
limited to tens to few thousands of MHz, M=1 (radix-2) or M=2 (radix-4) is a reasonable 
choice. 
However, since CCs are widely used in many communication systems, some recent works 
as [Batcha & Shameri, 2007] and [Kamuf et al., 2008] address the design of flexible Viterbi 
decoders that are able to support different CCs. As a further step [Vogt & When, 2008] 
proposed a multi-code decoder architecture, able to support both CCs and CTCs. 
 
2.2 BTC decoders 
Block Turbo Codes or product codes are serially concatenated block codes. Given two block 
codes C1=(n1,k1,δ1) and C2=(n2,k2,δ2) where ni, ki and δi represent the code-word length, the 
number of information bits, and the minimum Hamming distance, respectively, the 
corresponding product code is obtained according to [Pyndiah, 1998] as an array with k1 
rows and k2 columns containing the information bits. Then coding is performed on the k1 
rows with C2 and on the n2 obtained columns with C1. The decoding of BTC codes can be 
performed iteratively row-wise and column-wise by using the sub-optimal algorithm 
detailed in [Pyndiah, 1998]. The basic idea relies on using the Chase search [Chase, 1972] a 
near-maximum-likelihood (near-ML) searching strategy to find a list of code-words and an 
ML decided code-word d={d0,…, dn-1} with dj{-1,+1}. According to the notation used in 
[Vanstraceele et al., 2008], decision reliabilities are computed as  
 
4
||||
)(
2)(12)(1 jj
j
crcr
d
   (3) 
 
where r={r0,…rn-1} is the received code-word and c-1(j) and c+1(j) are the code-words in the 
Chase list at minimum Euclidean distance from r such that the j-th bit of the code-word is -1 
and +1 respectively. Then one decoder sends to the other the extrinsic information  
 
jj
out
j rdw  )(  (4) 
 
If the Chase search fails the extrinsic information is approximated as 
 
j
out
j dw    (5) 
 
where β is a weight factor increasing with the number of iterations. 
The decoder that receives the extrinsic information uses an updated version of r obtained as 
 
in
j
old
j
new
j wrr    (6) 
 
where  is a weight factor increasing with the number of iterations. A scheme of the 
elementary block turbo decoder is shown in Fig. 4 where the block named “decoder” is a 
Soft-In-Soft-out (SISO) module that performs the Chase search and implements (3), (4) and 
(5). An effective solution to implement the SISO module is based on a three pipelined stage 
architecture where the three stages are identified as reception, processing, and transmission 
units [Kerouedan & Adde, 2000]. As detailed in [LeBidan et al., 2008], during each stage, the 
N soft values of the received word r are processed sequentially in N clock periods. The 
reception stage is devoted to find the least reliable bits in the received code-word. The 
processing stage performs the Chase search and the transmission stage calculates λ(dj), wj 
and rjnew. Another solution is proposed in [Goubier et al. 2008] where the elementary 
decoder is implemented as a pipeline resorting to the mini-maxi algorithm, namely by using 
mini-maxi arrays to store the best metrics of all decoded code-words in the Chase list. 
 
wj
rj
new
wj
out
r
rj
old
α
delay
delay
in
decoder
rj
new
β
r
 Fig. 4. Elementary block turbo decoder scheme 
 
Several works in the literature deal with BTC complexity reduction. As an example [Adde & 
Pyndiah, 2000] suggests to compute β in (5) on a per-code-word basis, whereas in [Chi et al., 
VLSI Architectures for WIMAX Channel Decoders 5
 
The WIMAX standard specifies a binary 64 states CC with rate 0.5, whose shift register 
representation is shown in Fig. 3. Usually Viterbi decoder architectures exploit the trellis 
intrinsic parallelism to simultaneously compute at each trellis step all the branch metrics 
and update all the state metrics. Thus, said n the number of states of a CC, a parallel 
architecture employs a BMU and n ACS modules. Moreover, to reduce the decoding latency, 
the trace-back is performed as a sliding-window process [Radar, 1981] on portions of trellis 
of width W. This approach not only reduces the latency, but also the size of the decision 
memory that depending on the TBU radix requires usually 3W or 4W cells [Black & Meng, 
1992]. 
To improve the decoder throughput, two [Black & Meng, 1992] or more [Fettweis & Meyr, 
1989]; [Kong & Parhi, 2004]; [Cheng & Parhi, 2008] trellis steps can be processed 
concurrently. These solutions lead to the so called higher radix or M-look-ahead step 
architectures. According to [Kong & Parhi, 2004], the throughput sustained by an M-look-
ahead step architecture, defined as the number of decoded bits over the decoding time is  
 
kMf
WMN
fNk
T clk
T
clkT 

/
 (2) 
 
where fclk is the clock frequency, NT is the number of trellis steps, k=1 for a binary CC, k=2 
for a double binary CC and the right most expression is obtained under the condition W << 
NT that is a reasonable assumption in real cases. 
Thus, to achieve the throughput required by the WIMAX standard with a clock frequency 
limited to tens to few thousands of MHz, M=1 (radix-2) or M=2 (radix-4) is a reasonable 
choice. 
However, since CCs are widely used in many communication systems, some recent works 
as [Batcha & Shameri, 2007] and [Kamuf et al., 2008] address the design of flexible Viterbi 
decoders that are able to support different CCs. As a further step [Vogt & When, 2008] 
proposed a multi-code decoder architecture, able to support both CCs and CTCs. 
 
2.2 BTC decoders 
Block Turbo Codes or product codes are serially concatenated block codes. Given two block 
codes C1=(n1,k1,δ1) and C2=(n2,k2,δ2) where ni, ki and δi represent the code-word length, the 
number of information bits, and the minimum Hamming distance, respectively, the 
corresponding product code is obtained according to [Pyndiah, 1998] as an array with k1 
rows and k2 columns containing the information bits. Then coding is performed on the k1 
rows with C2 and on the n2 obtained columns with C1. The decoding of BTC codes can be 
performed iteratively row-wise and column-wise by using the sub-optimal algorithm 
detailed in [Pyndiah, 1998]. The basic idea relies on using the Chase search [Chase, 1972] a 
near-maximum-likelihood (near-ML) searching strategy to find a list of code-words and an 
ML decided code-word d={d0,…, dn-1} with dj{-1,+1}. According to the notation used in 
[Vanstraceele et al., 2008], decision reliabilities are computed as  
 
4
||||
)(
2)(12)(1 jj
j
crcr
d
   (3) 
 
where r={r0,…rn-1} is the received code-word and c-1(j) and c+1(j) are the code-words in the 
Chase list at minimum Euclidean distance from r such that the j-th bit of the code-word is -1 
and +1 respectively. Then one decoder sends to the other the extrinsic information  
 
jj
out
j rdw  )(  (4) 
 
If the Chase search fails the extrinsic information is approximated as 
 
j
out
j dw    (5) 
 
where β is a weight factor increasing with the number of iterations. 
The decoder that receives the extrinsic information uses an updated version of r obtained as 
 
in
j
old
j
new
j wrr    (6) 
 
where  is a weight factor increasing with the number of iterations. A scheme of the 
elementary block turbo decoder is shown in Fig. 4 where the block named “decoder” is a 
Soft-In-Soft-out (SISO) module that performs the Chase search and implements (3), (4) and 
(5). An effective solution to implement the SISO module is based on a three pipelined stage 
architecture where the three stages are identified as reception, processing, and transmission 
units [Kerouedan & Adde, 2000]. As detailed in [LeBidan et al., 2008], during each stage, the 
N soft values of the received word r are processed sequentially in N clock periods. The 
reception stage is devoted to find the least reliable bits in the received code-word. The 
processing stage performs the Chase search and the transmission stage calculates λ(dj), wj 
and rjnew. Another solution is proposed in [Goubier et al. 2008] where the elementary 
decoder is implemented as a pipeline resorting to the mini-maxi algorithm, namely by using 
mini-maxi arrays to store the best metrics of all decoded code-words in the Chase list. 
 
wj
rj
new
wj
out
r
rj
old
α
delay
delay
in
decoder
rj
new
β
r
 Fig. 4. Elementary block turbo decoder scheme 
 
Several works in the literature deal with BTC complexity reduction. As an example [Adde & 
Pyndiah, 2000] suggests to compute β in (5) on a per-code-word basis, whereas in [Chi et al., 
WIMAX, New Developments6
 
2004] the dependency on  in (6) is solved by replacing the term ·wj  with tanh(wj/2). In [Le 
et al. 2005] both  in (6) and β in (5) are avoided by exploiting Euclidean distance property.  
Due to its row-column structure, the block turbo decoder can be parallelized by 
instantiating several elementary decoders to concurrently process more rows or columns, 
thus increasing the throughput. As a significant example in [Jego et al., 2006] a fully parallel 
BTC decoder is proposed. This solution instantiates n1+n2 decoders that work concurrently. 
Moreover, by properly managing the scheduling of the decoders and interconnecting them 
through an Omega network intermediate results (row decoded data or column decoded 
data) are not stored.  
A detailed analysis of throughput and complexity of BTC decoder architectures can be 
found in [Goubier et al. 2008] and [LeBidan et al., 2008]. In particular, according to [Goubier 
et al. 2008] a simple one block decoder architecture that performs the row/column decoding 
sequentially (interleaved architecture) requires 2(n1+n2) cycles to complete an iteration; as a 
consequence it achieves a throughput  
 
)(2 21
21
nnI
fkk
T clk
  (7) 
 
where I is the number of iterations and fclk is the clock frequency. The BTC specified for 
WIMAX is obtained using twice a binary extended Hamming code out of the ones show in 
Table 1 
 
N k 
15 11 
31 26 
63 57 
Table 1. WIMAX binary extended Hamming codes (H(n,k)) used for BTC 
 
Considering the interleaved architecture described in [Goubier et al. 2008] where a fully 
decoded block is output every 4.5 half iterations, we obtain that 75 Mb/s can be obtained 
with a clock frequency of 84 MHz, 31 MHz and 14 MHz for H(15,11), H(31,26) and H(63,57) 
respectively. 
 
2.3 CTC decoders 
Convolutional turbo codes were proposed in 1993 by Berrou, Glavieux and Thitimajshima 
[Berrou et al., 1993] as a coding scheme based on the parallel concatenation of two CCs by 
the means of an interleaver (Π) as shown in Fig. 5 (a). The decoding algorithm is iterative 
and is based on the BCJR algorithm [Bahl et al., 1974] applied on the trellis representation of 
each constituent CC (Fig. 5 (b)). The key idea relies on the fact that the extrinsic information 
output by one CC is used as an updated version of the input a-priori information by the 
other CC. As a consequence, each iteration is made of two half iterations, in one half 
iteration the data are processed according to the interleaver (Π) and in the other half 
iteration according to the deinterleaver (Π-1). The same result can be obtained by 
implementing an in-order read/write half iteration and a scrambled (interleaved) 
read/write half iteration. The basic block in a turbo decoder is a SISO module that 
 
implements the BCJR algorithm in its logarithmic likelihood ratio (LLR) form. If we consider 
a Recursive Systematic CC (RSC code), the extrinsic information λk(u;O) of an uncoded 
symbol u at trellis step k output by a SISO is  
 
);();()}({max)}({max);(
~)(:
*
)(:
* IcIuebebOu ukk
ueueueue
k     (8) 
 
where ũ is an uncoded symbol taken as a reference (usually ũ=0), e represents a certain 
transition on the trellis and u(e) is the uncoded symbol u associated to e. The max* function is 
usually implemented as a max followed by a correction term [Robertson et al., 1995]; [Gross 
& Gulak, 1998]; [Cheng & Ottosson, 2000]; [Classon et al., 2002]; [Wang et al., 2006]; 
[Talakoub et al. 2007]. A scaling factor can also be applied to further improve the max or 
max* approximation [Vogt & Finger, 2000]. The correction term, usually adopted when 
decoding binary codes, can be omitted for double binary turbo codes [Berrou et al. 2001] 
with minor error rate performance degradation. The term b(e) in (8) is defined as  
 
)]([][)]([)( 1 eseeseb
E
kk
S
k     (9) 
]}[)]([{max][ 1
)(:
eess k
S
k
sese
k E
    (10) 
]}[)]([{max][ 1
)(:
eess k
E
k
sese
k S
    (11) 
]);([]);([][ IecIeue kkk    (12) 
 
where sS(e) and sE(e) are the starting and the ending states of e, k[sS(e)] and βk[sE(e)] are the 
forward and backward state metrics associated to sS(e) and sE(e) respectively (see Fig. 5 (b)) 
and γk[e] is the branch metric associated to e. The πk[c(e);I] term is computed as a weighted 
sum of the λk[c;I] produced by the soft demodulator as 
 
 cn
i
ikik IececIec ]);([)(]);([   (13) 
 
where ci(e) is one of the coded bits associated to e and nc is the number of bits forming a 
coded symbol c and πk[cu(e);I] in (8) is obtained as πk[c(e);I] considering only the systematic 
bits corresponding to the uncoded symbol u out of the nc coded bits. The πk[u(e);I] term is 
obtained combining the input a-priori information λk(u;I) and for a double binary code can 
be written as in (14), where A and B represent the two bits forming an uncoded symbol u. 
The CTC specified in the WIMAX standard is based on a double binary 8-state constituent 
CC as shown in Fig. 6, where each CC receives two uncoded bits (A, B) and produces four 
coded bits, two systematic bits (A,B) and two parity bits (Y,W). As a consequence, at each 
trellis step four transitions connect a starting state to four possible ending states. Due to the 
trellis symmetry only 16 branch metrics out of the possible 32 branch metrics are required at 
each trellis step. As pointed out in [Muller et al. 2006] high throughput can be achieved by 
VLSI Architectures for WIMAX Channel Decoders 7
 
2004] the dependency on  in (6) is solved by replacing the term ·wj  with tanh(wj/2). In [Le 
et al. 2005] both  in (6) and β in (5) are avoided by exploiting Euclidean distance property.  
Due to its row-column structure, the block turbo decoder can be parallelized by 
instantiating several elementary decoders to concurrently process more rows or columns, 
thus increasing the throughput. As a significant example in [Jego et al., 2006] a fully parallel 
BTC decoder is proposed. This solution instantiates n1+n2 decoders that work concurrently. 
Moreover, by properly managing the scheduling of the decoders and interconnecting them 
through an Omega network intermediate results (row decoded data or column decoded 
data) are not stored.  
A detailed analysis of throughput and complexity of BTC decoder architectures can be 
found in [Goubier et al. 2008] and [LeBidan et al., 2008]. In particular, according to [Goubier 
et al. 2008] a simple one block decoder architecture that performs the row/column decoding 
sequentially (interleaved architecture) requires 2(n1+n2) cycles to complete an iteration; as a 
consequence it achieves a throughput  
 
)(2 21
21
nnI
fkk
T clk
  (7) 
 
where I is the number of iterations and fclk is the clock frequency. The BTC specified for 
WIMAX is obtained using twice a binary extended Hamming code out of the ones show in 
Table 1 
 
N k 
15 11 
31 26 
63 57 
Table 1. WIMAX binary extended Hamming codes (H(n,k)) used for BTC 
 
Considering the interleaved architecture described in [Goubier et al. 2008] where a fully 
decoded block is output every 4.5 half iterations, we obtain that 75 Mb/s can be obtained 
with a clock frequency of 84 MHz, 31 MHz and 14 MHz for H(15,11), H(31,26) and H(63,57) 
respectively. 
 
2.3 CTC decoders 
Convolutional turbo codes were proposed in 1993 by Berrou, Glavieux and Thitimajshima 
[Berrou et al., 1993] as a coding scheme based on the parallel concatenation of two CCs by 
the means of an interleaver (Π) as shown in Fig. 5 (a). The decoding algorithm is iterative 
and is based on the BCJR algorithm [Bahl et al., 1974] applied on the trellis representation of 
each constituent CC (Fig. 5 (b)). The key idea relies on the fact that the extrinsic information 
output by one CC is used as an updated version of the input a-priori information by the 
other CC. As a consequence, each iteration is made of two half iterations, in one half 
iteration the data are processed according to the interleaver (Π) and in the other half 
iteration according to the deinterleaver (Π-1). The same result can be obtained by 
implementing an in-order read/write half iteration and a scrambled (interleaved) 
read/write half iteration. The basic block in a turbo decoder is a SISO module that 
 
implements the BCJR algorithm in its logarithmic likelihood ratio (LLR) form. If we consider 
a Recursive Systematic CC (RSC code), the extrinsic information λk(u;O) of an uncoded 
symbol u at trellis step k output by a SISO is  
 
);();()}({max)}({max);(
~)(:
*
)(:
* IcIuebebOu ukk
ueueueue
k     (8) 
 
where ũ is an uncoded symbol taken as a reference (usually ũ=0), e represents a certain 
transition on the trellis and u(e) is the uncoded symbol u associated to e. The max* function is 
usually implemented as a max followed by a correction term [Robertson et al., 1995]; [Gross 
& Gulak, 1998]; [Cheng & Ottosson, 2000]; [Classon et al., 2002]; [Wang et al., 2006]; 
[Talakoub et al. 2007]. A scaling factor can also be applied to further improve the max or 
max* approximation [Vogt & Finger, 2000]. The correction term, usually adopted when 
decoding binary codes, can be omitted for double binary turbo codes [Berrou et al. 2001] 
with minor error rate performance degradation. The term b(e) in (8) is defined as  
 
)]([][)]([)( 1 eseeseb
E
kk
S
k     (9) 
]}[)]([{max][ 1
)(:
eess k
S
k
sese
k E
    (10) 
]}[)]([{max][ 1
)(:
eess k
E
k
sese
k S
    (11) 
]);([]);([][ IecIeue kkk    (12) 
 
where sS(e) and sE(e) are the starting and the ending states of e, k[sS(e)] and βk[sE(e)] are the 
forward and backward state metrics associated to sS(e) and sE(e) respectively (see Fig. 5 (b)) 
and γk[e] is the branch metric associated to e. The πk[c(e);I] term is computed as a weighted 
sum of the λk[c;I] produced by the soft demodulator as 
 
 cn
i
ikik IececIec ]);([)(]);([   (13) 
 
where ci(e) is one of the coded bits associated to e and nc is the number of bits forming a 
coded symbol c and πk[cu(e);I] in (8) is obtained as πk[c(e);I] considering only the systematic 
bits corresponding to the uncoded symbol u out of the nc coded bits. The πk[u(e);I] term is 
obtained combining the input a-priori information λk(u;I) and for a double binary code can 
be written as in (14), where A and B represent the two bits forming an uncoded symbol u. 
The CTC specified in the WIMAX standard is based on a double binary 8-state constituent 
CC as shown in Fig. 6, where each CC receives two uncoded bits (A, B) and produces four 
coded bits, two systematic bits (A,B) and two parity bits (Y,W). As a consequence, at each 
trellis step four transitions connect a starting state to four possible ending states. Due to the 
trellis symmetry only 16 branch metrics out of the possible 32 branch metrics are required at 
each trellis step. As pointed out in [Muller et al. 2006] high throughput can be achieved by 
WIMAX, New Developments8
 
exploiting the trellis parallelism, namely computing concurrently all the branch and state 
metrics. 
 










)'1','1'()(
)'0','1'()(
)'1','0'()(
)'0','0'()(0
);(
BAeuif
BAeuif
BAeuif
BAeuif
Iu
AB
k
BA
k
BA
k
k


  (14) 
 
α
k
α
β β
k−1 k
k−1 k
e
u(e),c(e)
s (e)
s (e)
S
E
u CC1
CC2
c
c1
c2
u1
u2
(a) (b)
λ[u1;O] λ[u2;I]
−1[u1;I]λ
[c2;I]λ
λ[u2;O]
λ[c1;I]
SISO1 SISO2
Π
Π
Π
 Fig. 5. Convolutional turbo code: coder and iterative SISO based decoder (a), notation for a 
trellis step in the SISO (b) 
 
The 16 branch metrics are computed by a BMU that implements (12) as shown in Fig. 7. To 
reduce the latency of the SISO, usually the decoding is based on a sliding-window approach 
[Benedetto et al., 1996]. As a consequence, at least two BMUs are required to compute the 
two recursions (forward and backward) according to the BCJR algorithm. However, since β 
metrics require to be trained between successive windows, usually a further BMU is 
required. A solution based on the inheritance of the border metrics of each window 
[Abbasfar & Yao 2003] requires only two BMUs. Furthermore, this strategy reduces the SISO 
latency to the sliding window width W. The state metrics are updated according to (10) and 
(11) by two state metric processors, each of which is made of a proper number of processing 
elements (PE). As shown in Fig. 7 for the WIMAX CTC 8 PEs are required. It is worth 
pointing out that the constituent codes of the WIMAX CTC use the circulation state 
tailbiting strategy proposed in [Weiss et al. 2001] that ensures that the ending state of the 
last trellis step is equal to the starting state of the fist trellis step. However, this technique 
requires estimating the circulation state at the decoder side. Since training operations to 
estimate the circulation state would increase the SISO latency, an effective alternative [Zhan 
et al. 2006] is to inherit these metrics from the previous iteration.  
 
 
Code   (CC )
Code   (CC )2
Constituent
1
CC i
Y
W
A
B
i
i
i
i
W
2
2
A
B
A
B
Y1
W1
Y
interleaver
CTC
(Π)
Constituent
 Fig. 6. WIMAX CTC: encoder and constituent CC structures 
 
As in Viterbi decoder architectures often in CTC decoders the state metrics are computed by 
means of the “wrapping” representation technique proposed in [Hekstra, 1989]. This 
solution requires a normalization stage, depicted in Fig. 7, when combining , β and γ 
metrics to compute the extrinsic information as in (8). The last stage of the output processor, 
that computes the output extrinsic information, is a tree of max blocks for each component 
of the extrinsic information and few adders to implement (8). As highlighted in Fig. 7 this 
scheduling requires a buffer to store input LLRs that are used to compute the backward 
recursion (BMU-MEM). Since the output extrinsic information is computed during the 
backward recursion, forward recursion metrics are stored in a buffer (-MEM). Further 
memory is required to implement the border metric inheritance, -EXT-MEM, β-EXT-MEM 
and β-LOC-MEM. 
The throughput sustained by the CTC decoder, defined as the number of decoded bits over 
the time required for their computation, is 
 
dec
cyc
clkT
ID
cyc
SISO
cyc
clkT
NI
fNk
NNI
fNk
T 


2)(2
 (15) 
 
where fclk is the clock frequency, NT is the number of trellis steps, k=1 for a binary CTC, k=2 
for a double binary CTC, 2I is the number of half iterations, NcycSISO and NcycID represent the 
number of clock cycles required by one SISO and by the interleaving/deinterleaving 
structure. Since both NcycSISO and NcycID  are a function of NT they can be rewritten as 
NcycSISO=NT·SP+SISOcyclat  and NcycID =NT·SP+IDcycoh where SP is the sending period, namely 
the rate sustained by the decoder to output two consecutive valid output data (SP=1 means 
at each clock cycle new valid output data are ready), SISOcyclat  is the decoder latency, 
namely the number of clock cycles spent to produce the first valid output data, and IDcycoh  is 
the interleaver/deinterleaver architecture overhead expressed in clock cycles. Usually, 
resorting to pipelining, NcycSISO and NcycID  can be partially overlapped; thus, the number of 
cycles required by one SISO decoder is Ncycdec=NT·SP+SISOcyclat+IDcycoh. Using the sliding 
window technique with the border metric inheritance strategy [Abbasfar & Yao 2003]; [Zhan 
et al. 2006] we obtain SISOcyclat≈SP·W and so (15) can be rewritten as (16), where the 
rightmost expression is obtained considering W<<NT and IDcycoh <<SP·NT that is a reasonable 
assumption in real cases.  
VLSI Architectures for WIMAX Channel Decoders 9
 
exploiting the trellis parallelism, namely computing concurrently all the branch and state 
metrics. 
 










)'1','1'()(
)'0','1'()(
)'1','0'()(
)'0','0'()(0
);(
BAeuif
BAeuif
BAeuif
BAeuif
Iu
AB
k
BA
k
BA
k
k


  (14) 
 
α
k
α
β β
k−1 k
k−1 k
e
u(e),c(e)
s (e)
s (e)
S
E
u CC1
CC2
c
c1
c2
u1
u2
(a) (b)
λ[u1;O] λ[u2;I]
−1[u1;I]λ
[c2;I]λ
λ[u2;O]
λ[c1;I]
SISO1 SISO2
Π
Π
Π
 Fig. 5. Convolutional turbo code: coder and iterative SISO based decoder (a), notation for a 
trellis step in the SISO (b) 
 
The 16 branch metrics are computed by a BMU that implements (12) as shown in Fig. 7. To 
reduce the latency of the SISO, usually the decoding is based on a sliding-window approach 
[Benedetto et al., 1996]. As a consequence, at least two BMUs are required to compute the 
two recursions (forward and backward) according to the BCJR algorithm. However, since β 
metrics require to be trained between successive windows, usually a further BMU is 
required. A solution based on the inheritance of the border metrics of each window 
[Abbasfar & Yao 2003] requires only two BMUs. Furthermore, this strategy reduces the SISO 
latency to the sliding window width W. The state metrics are updated according to (10) and 
(11) by two state metric processors, each of which is made of a proper number of processing 
elements (PE). As shown in Fig. 7 for the WIMAX CTC 8 PEs are required. It is worth 
pointing out that the constituent codes of the WIMAX CTC use the circulation state 
tailbiting strategy proposed in [Weiss et al. 2001] that ensures that the ending state of the 
last trellis step is equal to the starting state of the fist trellis step. However, this technique 
requires estimating the circulation state at the decoder side. Since training operations to 
estimate the circulation state would increase the SISO latency, an effective alternative [Zhan 
et al. 2006] is to inherit these metrics from the previous iteration.  
 
 
Code   (CC )
Code   (CC )2
Constituent
1
CC i
Y
W
A
B
i
i
i
i
W
2
2
A
B
A
B
Y1
W1
Y
interleaver
CTC
(Π)
Constituent
 Fig. 6. WIMAX CTC: encoder and constituent CC structures 
 
As in Viterbi decoder architectures often in CTC decoders the state metrics are computed by 
means of the “wrapping” representation technique proposed in [Hekstra, 1989]. This 
solution requires a normalization stage, depicted in Fig. 7, when combining , β and γ 
metrics to compute the extrinsic information as in (8). The last stage of the output processor, 
that computes the output extrinsic information, is a tree of max blocks for each component 
of the extrinsic information and few adders to implement (8). As highlighted in Fig. 7 this 
scheduling requires a buffer to store input LLRs that are used to compute the backward 
recursion (BMU-MEM). Since the output extrinsic information is computed during the 
backward recursion, forward recursion metrics are stored in a buffer (-MEM). Further 
memory is required to implement the border metric inheritance, -EXT-MEM, β-EXT-MEM 
and β-LOC-MEM. 
The throughput sustained by the CTC decoder, defined as the number of decoded bits over 
the time required for their computation, is 
 
dec
cyc
clkT
ID
cyc
SISO
cyc
clkT
NI
fNk
NNI
fNk
T 


2)(2
 (15) 
 
where fclk is the clock frequency, NT is the number of trellis steps, k=1 for a binary CTC, k=2 
for a double binary CTC, 2I is the number of half iterations, NcycSISO and NcycID represent the 
number of clock cycles required by one SISO and by the interleaving/deinterleaving 
structure. Since both NcycSISO and NcycID  are a function of NT they can be rewritten as 
NcycSISO=NT·SP+SISOcyclat  and NcycID =NT·SP+IDcycoh where SP is the sending period, namely 
the rate sustained by the decoder to output two consecutive valid output data (SP=1 means 
at each clock cycle new valid output data are ready), SISOcyclat  is the decoder latency, 
namely the number of clock cycles spent to produce the first valid output data, and IDcycoh  is 
the interleaver/deinterleaver architecture overhead expressed in clock cycles. Usually, 
resorting to pipelining, NcycSISO and NcycID  can be partially overlapped; thus, the number of 
cycles required by one SISO decoder is Ncycdec=NT·SP+SISOcyclat+IDcycoh. Using the sliding 
window technique with the border metric inheritance strategy [Abbasfar & Yao 2003]; [Zhan 
et al. 2006] we obtain SISOcyclat≈SP·W and so (15) can be rewritten as (16), where the 
rightmost expression is obtained considering W<<NT and IDcycoh <<SP·NT that is a reasonable 
assumption in real cases.  
WIMAX, New Developments10
 
SPI
fk
IDWNSPI
fNk
T clk
oh
cycT
clkT



2])([2
 (16) 
 
γk
k−1βk+1α
α/βk α/βk
βk−1α k+1
γk
β
γk
k−1α
γk
k+1
α/βk α/βk
λ
B
[c,I]k λ
W
k [c,I]
λ [u,I]
AB
k λ [u,I]
AB
k
λ
Y
[c,I]k
λ [u,I]
AB
8 9 10 117654γ γ γ γ3210 γ γ γ γ γ γ γ γ 12 13 14 15γ γ γ γk k k k k k k k k k k k k k k k
[e]
(1)
i[e]
(2)
i[e]
(3)
i
max
max
(3)
i[s ]
(2)
i[s ]
(1)
i[s ]
(0)
i[s ]
i[s ]i[s ]
[e]
(0)
i
PE
α −BMU
γk
processor
α
α
in
αk−1
kα
−BMUβ
β
processor
γk
βk
β −LOC−MEM
βin
βk−1 βprv
β −EXT−MEM
βout
−Oλ
λ k[u;O]u k
processor−MEMα−EXT−MEMα
α
out
αk−1
βk
γ k
λk[u;O]
AB
λk[u;O]
AB
λk
AB
[u;O]
uk
λk[u;I]
AB
λk[u;I]
AB
λk
AB
[u;I]
−Oλ processor
[0][0] [7][7]
λ
A
[c,I]k
PE7
(α/β,γ) (α/β,γ)k k
processor
α/β
0
k
BMU
max
λ k[u;I]
λ k[c;I]
BMU−MEM
SISO
no
rm
no
rm 0
max
00
01
10
11
λk[T]
PE0
max
tree
max
tree
max
tree
max
tree
 Fig. 7. WIMAX SISO block scheme 
 
Usually optimized architectures [Masera et al., 1999]; [Bickerstaff et al., 2003]; [Kim & Park, 
2008] are obtained with SP=1, whereas flexible architectures have higher SP values [Vogt & 
Wehn, 2008]; [Muller et al., 2009]. However, even with SP=1, a double binary turbo decoder 
architecture that achieves the throughput imposed by WIMAX with eight iterations (I=8), 
would require fclk=600 MHz. A possible solution to improve the throughput by a factor  that 
ranges in [1.2, 1.9] is the based on decoder level parallelism [Muller et al. 2006] and is 
usually referred to as “shuffling” [Zhang & Fossorier, 2005]. However, to further improve 
the throughput a parallel decoder made of P SISOs working concurrently is required. As a 
consequence, a parallel architecture achieves a throughput 
 
SPI
fPk
IDWP
NSPI
fNk
T clk
oh
cyc
T
clkT




2])([2
 (17) 
 
Thus, setting P=4, I=8 and SP=1, the WIMAX throughput is obtained with fclk=150 MHz. It is 
worth pointing out that a P-parallel CTC decoder is made of P SISOs connected to P 
memories devoted to store the extrinsic information. However, in a parallel decoder during 
the scrambled half iteration collisions can occur, namely more SISOs could need to access 
 
the same memory during the same cycle. Since the collision phenomenon increases IDcycoh, 
several algorithmic approaches to design collision free interleavers [Giulietti et al. 2002]; 
[Kwak & Lee, 2002]; [Gnaedig et al., 2003]; [Tarable et al.,  2004] have been proposed. On the 
other hand, architectures to manage collisions in a parallel turbo decoder have also been 
proposed in the literature [Thul et al., 2002]; [Gilbert et al., 2003]; [Thul et al., 2003]; [Speziali 
& Zory, 2004]; [Martina et al. 2008-a]; [Martina et al., 2008-b], in particular [Martina et al. 
2008-b] deals with the parallelization of the WIMAX CTC interleaver and avoids collision by 
the means of a throughput/parallelism scalable architecture that features IDcycoh=0. 
It is worth pointing out that parallel architectures increase not only the throughput but also 
the complexity of the decoder, so that some recent works aim at reducing the amount of 
memory required to implement SISO local buffers. In [Liu et al., 2007] and [Kim & Park, 
2008] saturation of forward state metrics and quantization of border backward state metrics 
is proposed. Further studies have been performed to reduce the extrinsic information bit 
width by using adaptive quantization [Singh et al., 2008], pseudo-floating point 
representation [Park et al., 2008] and bit level representation [Kim & Park, 2009]. 
 
2.4 LDPC code decoders 
LDPC codes were originally introduced in 1962 by Gallager [Gallager, 1962] and 
rediscovered in 1996 by MacKay and Neal [MacKay, 1996]. As turbo codes, they achieve 
near optimum error correction performance and are decoded by means of high complexity 
iterative algorithms.  
An LDPC code is a linear block code defined by a CB parity check matrix H, characterized 
by a low density of ones:  B is the number of bits in the code (block length), while C is the 
number of parity checks. A one in a given cell of the H matrix indicates that the bit 
corresponding to the cell column is used for the calculation of the parity check associated to 
the row. A popular description of an LDPC code is the bipartite (or Tanner) graph shown in 
Figure 8 for a small example, where B variable nodes (VN) are connected to C check nodes 
(CN) through edges corresponding to the positions of the ones in H. 
LDPC codes are usually decoded by means of an iterative algorithm variously known as 
sum-product, belief propagation or message passing, and reformulated in a version that 
processes logarithmic likelihood ratios instead of probabilities. In the first iteration, half 
variable nodes receive data from adjacent check nodes and from the channel and use them 
to obtain updated information sent to the check nodes; in the second half, check nodes take 
the updated information received from connected bit nodes and generate new messages to 
be sent back to variable nodes.  
In message passing decoders, messages are exchanged along the edges of the Tanner graph, 
and computations are performed at the nodes. To avoid multiplications and divisions, the 
decoder usually works in the logarithmic domain. 
 
 
VLSI Architectures for WIMAX Channel Decoders 11
 
SPI
fk
IDWNSPI
fNk
T clk
oh
cycT
clkT



2])([2
 (16) 
 
γk
k−1βk+1α
α/βk α/βk
βk−1α k+1
γk
β
γk
k−1α
γk
k+1
α/βk α/βk
λ
B
[c,I]k λ
W
k [c,I]
λ [u,I]
AB
k λ [u,I]
AB
k
λ
Y
[c,I]k
λ [u,I]
AB
8 9 10 117654γ γ γ γ3210 γ γ γ γ γ γ γ γ 12 13 14 15γ γ γ γk k k k k k k k k k k k k k k k
[e]
(1)
i[e]
(2)
i[e]
(3)
i
max
max
(3)
i[s ]
(2)
i[s ]
(1)
i[s ]
(0)
i[s ]
i[s ]i[s ]
[e]
(0)
i
PE
α −BMU
γk
processor
α
α
in
αk−1
kα
−BMUβ
β
processor
γk
βk
β −LOC−MEM
βin
βk−1 βprv
β −EXT−MEM
βout
−Oλ
λ k[u;O]u k
processor−MEMα−EXT−MEMα
α
out
αk−1
βk
γ k
λk[u;O]
AB
λk[u;O]
AB
λk
AB
[u;O]
uk
λk[u;I]
AB
λk[u;I]
AB
λk
AB
[u;I]
−Oλ processor
[0][0] [7][7]
λ
A
[c,I]k
PE7
(α/β,γ) (α/β,γ)k k
processor
α/β
0
k
BMU
max
λ k[u;I]
λ k[c;I]
BMU−MEM
SISO
no
rm
no
rm 0
max
00
01
10
11
λk[T]
PE0
max
tree
max
tree
max
tree
max
tree
 Fig. 7. WIMAX SISO block scheme 
 
Usually optimized architectures [Masera et al., 1999]; [Bickerstaff et al., 2003]; [Kim & Park, 
2008] are obtained with SP=1, whereas flexible architectures have higher SP values [Vogt & 
Wehn, 2008]; [Muller et al., 2009]. However, even with SP=1, a double binary turbo decoder 
architecture that achieves the throughput imposed by WIMAX with eight iterations (I=8), 
would require fclk=600 MHz. A possible solution to improve the throughput by a factor  that 
ranges in [1.2, 1.9] is the based on decoder level parallelism [Muller et al. 2006] and is 
usually referred to as “shuffling” [Zhang & Fossorier, 2005]. However, to further improve 
the throughput a parallel decoder made of P SISOs working concurrently is required. As a 
consequence, a parallel architecture achieves a throughput 
 
SPI
fPk
IDWP
NSPI
fNk
T clk
oh
cyc
T
clkT




2])([2
 (17) 
 
Thus, setting P=4, I=8 and SP=1, the WIMAX throughput is obtained with fclk=150 MHz. It is 
worth pointing out that a P-parallel CTC decoder is made of P SISOs connected to P 
memories devoted to store the extrinsic information. However, in a parallel decoder during 
the scrambled half iteration collisions can occur, namely more SISOs could need to access 
 
the same memory during the same cycle. Since the collision phenomenon increases IDcycoh, 
several algorithmic approaches to design collision free interleavers [Giulietti et al. 2002]; 
[Kwak & Lee, 2002]; [Gnaedig et al., 2003]; [Tarable et al.,  2004] have been proposed. On the 
other hand, architectures to manage collisions in a parallel turbo decoder have also been 
proposed in the literature [Thul et al., 2002]; [Gilbert et al., 2003]; [Thul et al., 2003]; [Speziali 
& Zory, 2004]; [Martina et al. 2008-a]; [Martina et al., 2008-b], in particular [Martina et al. 
2008-b] deals with the parallelization of the WIMAX CTC interleaver and avoids collision by 
the means of a throughput/parallelism scalable architecture that features IDcycoh=0. 
It is worth pointing out that parallel architectures increase not only the throughput but also 
the complexity of the decoder, so that some recent works aim at reducing the amount of 
memory required to implement SISO local buffers. In [Liu et al., 2007] and [Kim & Park, 
2008] saturation of forward state metrics and quantization of border backward state metrics 
is proposed. Further studies have been performed to reduce the extrinsic information bit 
width by using adaptive quantization [Singh et al., 2008], pseudo-floating point 
representation [Park et al., 2008] and bit level representation [Kim & Park, 2009]. 
 
2.4 LDPC code decoders 
LDPC codes were originally introduced in 1962 by Gallager [Gallager, 1962] and 
rediscovered in 1996 by MacKay and Neal [MacKay, 1996]. As turbo codes, they achieve 
near optimum error correction performance and are decoded by means of high complexity 
iterative algorithms.  
An LDPC code is a linear block code defined by a CB parity check matrix H, characterized 
by a low density of ones:  B is the number of bits in the code (block length), while C is the 
number of parity checks. A one in a given cell of the H matrix indicates that the bit 
corresponding to the cell column is used for the calculation of the parity check associated to 
the row. A popular description of an LDPC code is the bipartite (or Tanner) graph shown in 
Figure 8 for a small example, where B variable nodes (VN) are connected to C check nodes 
(CN) through edges corresponding to the positions of the ones in H. 
LDPC codes are usually decoded by means of an iterative algorithm variously known as 
sum-product, belief propagation or message passing, and reformulated in a version that 
processes logarithmic likelihood ratios instead of probabilities. In the first iteration, half 
variable nodes receive data from adjacent check nodes and from the channel and use them 
to obtain updated information sent to the check nodes; in the second half, check nodes take 
the updated information received from connected bit nodes and generate new messages to 
be sent back to variable nodes.  
In message passing decoders, messages are exchanged along the edges of the Tanner graph, 
and computations are performed at the nodes. To avoid multiplications and divisions, the 
decoder usually works in the logarithmic domain. 
 
 
WIMAX, New Developments12
 
C3
1
2
3
4 5
6
B1 B2 B3
C1 C2
 Fig. 8. Example Tanner graph 
 
The message passing algorithm is described in the following equations, where k represents 
the current iteration, Qji is the message generated by VN j and directed to CN i, Rij is the 
message computed by CN i and sent to VN j. C[j] is the whole set of incoming messages for 
VN j and R[i] is the whole set of the incoming messages for CN i.  
Each variable node is initialized with the log-likelihood ratio (LLR) j associated to the 
received bit. Next, messages are propagated from the variable nodes to the check nodes 
along the edges of the Tanner graph. At the first iteration, only j are delivered, while 
starting from the second iteration VNs sum up all the messages Rij coming from CNs and 
combine them with j according to 
 
 



ijC
k
jj
k
ji RQ
/
1

  (18) 
 
The check node computes new check to variable messages as   
 
 
  ijjiR
k
i
k
ij QR 

 


  


/
1          with              
j
ji
iR
ij Qsgn1  (19) 
 
where |R[j]|is the cardinality of the CN and (x) is a non linear function defined as 
 
  



2
tanhln
x
x  (20) 
 
After a number of iterations that strongly depends on the addressed application and code 
rate (typically 5 to 40), variable nodes compute an overall estimation of the decoded bit in 
the form 
 
 



jC
k
jj
k
j R

 1  (21) 
 
where the sign of j can be understood as the hard decision on the decoded bit. 
 
A large implementation complexity is associated to (19), which is simplified in different 
ways. First of all, function (x) can be obtained by means of reduced complexity estimations 
[Masera et al., 2005]. Moreover sub-optimal, low complexity algorithms have been 
successfully proposed to simplify (19), such as for example the normalized Min-Sum 
algorithm [Chen et al., 2005] where only the two smallest magnitudes are used. 
A further change is usually applied to the scheduling of variable and check nodes in order to 
improve communications performance. In the two-phase scheduling, the updating of 
variable and check nodes is accomplished in two separate phases. On the contrary, the turbo 
decoding message passing (TDMP) [Mansour & Shanbhag, 2003], also known as layered or 
shuffled decoding, allows for overlapped update operations: messages calculated by a 
subset of check nodes are immediately used to update variable nodes. This scheduling has 
been proved to be able to reduce the number of iterations by up to 50% at a fixed 
communications performance.  
The required number of functional units in a decoder can be estimated based on the concept 
of processing power Pc [Gouillod et al., 2007], which can be evaluated on the basis of the rate 
Rc of the code, the number K of information bits transmitted per codeword, the block size 
N=K/Rc, the required information throughput D, the operating clock frequency fclk, the 
maximum number of iterations iMAX and the total number of edges to be processed per 
iteration . This relation is expressed as 
 
clk
MAX
c fK
iD
P 
   (22) 
 
As two messages are associated with each edge (to be sent from the CN to the VN and vice 
versa), 2Pc gives the number of messages that must be concurrently processed at each 
decoding iteration in order to achieve the target throughput D. Equation (22) does not 
consider the message exchange overhead: yet it assumes that all messages dispatched 
during a cycle are delivered simultaneously during the same cycle. The Pc value must then 
be assumed as a lower bound and the actual degree of parallelism strongly depends on both 
the structure of the H matrix [Dinoi et al., 2006] and the adopted interconnect architecture  
among processing units [Quaglio et al., 2006] [Masera et al., 2007]. 
Actually, most of the implementation concerns come from the communication structure that 
must be allocated to support message passing from bit to check nodes and vice versa. 
Several hardware realizations that have been proposed in the literature are focused on how 
efficiently passing messages between the two types of processing units. 
Three approaches can be followed in the high level organization of the decoder, coming to 
three kinds of architectures. 
- Serial architectures: bit and check processors are allocated as single instances, each 
serving multiple nodes sequentially; messages are exchanged by means of a memory. 
- Fully parallel architectures: processing units are allocated for each single bit and check 
node and all messages are passed in parallel on dedicated routes. 
- Partially parallel architectures: more processing units work in parallel, serving all bit 
and check nodes within a number of cycles; suitable organization and hardware 
support is required to exchange messages. 
VLSI Architectures for WIMAX Channel Decoders 13
 
C3
1
2
3
4 5
6
B1 B2 B3
C1 C2
 Fig. 8. Example Tanner graph 
 
The message passing algorithm is described in the following equations, where k represents 
the current iteration, Qji is the message generated by VN j and directed to CN i, Rij is the 
message computed by CN i and sent to VN j. C[j] is the whole set of incoming messages for 
VN j and R[i] is the whole set of the incoming messages for CN i.  
Each variable node is initialized with the log-likelihood ratio (LLR) j associated to the 
received bit. Next, messages are propagated from the variable nodes to the check nodes 
along the edges of the Tanner graph. At the first iteration, only j are delivered, while 
starting from the second iteration VNs sum up all the messages Rij coming from CNs and 
combine them with j according to 
 
 



ijC
k
jj
k
ji RQ
/
1

  (18) 
 
The check node computes new check to variable messages as   
 
 
  ijjiR
k
i
k
ij QR 

 


  


/
1          with              
j
ji
iR
ij Qsgn1  (19) 
 
where |R[j]|is the cardinality of the CN and (x) is a non linear function defined as 
 
  



2
tanhln
x
x  (20) 
 
After a number of iterations that strongly depends on the addressed application and code 
rate (typically 5 to 40), variable nodes compute an overall estimation of the decoded bit in 
the form 
 
 



jC
k
jj
k
j R

 1  (21) 
 
where the sign of j can be understood as the hard decision on the decoded bit. 
 
A large implementation complexity is associated to (19), which is simplified in different 
ways. First of all, function (x) can be obtained by means of reduced complexity estimations 
[Masera et al., 2005]. Moreover sub-optimal, low complexity algorithms have been 
successfully proposed to simplify (19), such as for example the normalized Min-Sum 
algorithm [Chen et al., 2005] where only the two smallest magnitudes are used. 
A further change is usually applied to the scheduling of variable and check nodes in order to 
improve communications performance. In the two-phase scheduling, the updating of 
variable and check nodes is accomplished in two separate phases. On the contrary, the turbo 
decoding message passing (TDMP) [Mansour & Shanbhag, 2003], also known as layered or 
shuffled decoding, allows for overlapped update operations: messages calculated by a 
subset of check nodes are immediately used to update variable nodes. This scheduling has 
been proved to be able to reduce the number of iterations by up to 50% at a fixed 
communications performance.  
The required number of functional units in a decoder can be estimated based on the concept 
of processing power Pc [Gouillod et al., 2007], which can be evaluated on the basis of the rate 
Rc of the code, the number K of information bits transmitted per codeword, the block size 
N=K/Rc, the required information throughput D, the operating clock frequency fclk, the 
maximum number of iterations iMAX and the total number of edges to be processed per 
iteration . This relation is expressed as 
 
clk
MAX
c fK
iD
P 
   (22) 
 
As two messages are associated with each edge (to be sent from the CN to the VN and vice 
versa), 2Pc gives the number of messages that must be concurrently processed at each 
decoding iteration in order to achieve the target throughput D. Equation (22) does not 
consider the message exchange overhead: yet it assumes that all messages dispatched 
during a cycle are delivered simultaneously during the same cycle. The Pc value must then 
be assumed as a lower bound and the actual degree of parallelism strongly depends on both 
the structure of the H matrix [Dinoi et al., 2006] and the adopted interconnect architecture  
among processing units [Quaglio et al., 2006] [Masera et al., 2007]. 
Actually, most of the implementation concerns come from the communication structure that 
must be allocated to support message passing from bit to check nodes and vice versa. 
Several hardware realizations that have been proposed in the literature are focused on how 
efficiently passing messages between the two types of processing units. 
Three approaches can be followed in the high level organization of the decoder, coming to 
three kinds of architectures. 
- Serial architectures: bit and check processors are allocated as single instances, each 
serving multiple nodes sequentially; messages are exchanged by means of a memory. 
- Fully parallel architectures: processing units are allocated for each single bit and check 
node and all messages are passed in parallel on dedicated routes. 
- Partially parallel architectures: more processing units work in parallel, serving all bit 
and check nodes within a number of cycles; suitable organization and hardware 
support is required to exchange messages. 
WIMAX, New Developments14
 
For most codes and applications, the first approach results in slow implementations, while 
the second one has an excessive cost. As a result the only general viable solution is the third 
partially parallel approach, which on the other hand introduces the collision problem, 
already known in the implementation of parallel turbo decoders. Two main approaches 
have been proposed to deal with collisions: 
- To design collision free codes [Mansour & Shanbhag , 2003], [Hocevar, 2003], 
- To design decoder architecture able to avoid or at least mitigate collision effects [Kienle 
et al., 2003], [Tarable et al., 2004]. 
Even if the first approach has proven to be effective, it significantly limits the supported 
code classes. The second approach, on the other hand, is well suited for flexible and general 
architectures. An even more challenging task is the design of LDPC decoders that are 
flexible in terms of supported block sizes and code rates [Masera et al., 2007].  
In partially parallel structures, permutation networks are used to establish the correct 
connections between functional units. However, structured LDPC codes, such as those 
specified in WIMAX, allow for replacing permutation networks by low complexity barrel 
shifters [Boutillon et al., 2000]; [Mansour & Shanbhag, 2003]. 
Early terminal schemes can be adopted to improve the decoding efficiency by dynamically 
adjusting the iteration number according to the SNR values. The simplest approach requires 
that decoding decisions are stored and compared across two consecutive iterations: if no 
changes are detected, the decoding is terminated, otherwise it is continued up to a 
maximum number of iterations. More sophisticated iteration control schemes are able to 
reduce the mean number of iterations, so saving both latency and energy [Kienle & When, 
2005]; [Shin et al., 2007]. 
 
3. Case of study: complete WIMAX CTC decoder design 
 
The WIMAX CTC decoder is made of three main blocks: symbol deselection (SD), subblock 
deinterleaver and CTC decoder as highlighted in Fig. 9 where N represents the number of 
couples included in a data frame. SD, subblock deinterleaver and CTC decoder blocks are 
connected together by means of memory buffers in order to guarantee that the non iterative 
part of the decoder (namely SD and subblock deinterleaver) and the decoding loop work 
simultaneously on consecutive data frames. Since the maximum decoder throughput is 
about 75 Mb/s and the native CTC rate is 1/3 (two uncoded bits produce six coded bits), at 
the input of the decoding loop the maximum throughput can rise up to 225 millions of LLRs 
per second. The same throughput ought to be sustained by the subblock deinterleaver, 
whereas even higher throughput has to be sustained at the SD unit in case of repetition. 
 
3.1 Symbol deselection 
Depending on amount of data sent by the encoder (puncturing or repetition), the 
throughput sustained by the symbol deselection (SD) can rise up to 900 millions of LLRs per 
second (repetition 4). When the encoder performs repetition, the same symbol is sent more 
than once. Thus, the decoder combines the LLRs referred to the same symbol to improve the 
reliability of that symbol. As shown in Fig. 9 this can be achieved partitioning the symbol 
deselection input buffer into four memories, each of which containing up to 6N LLRs.  
Since the symbol deselection architecture can read up to four LLRs per clock cycle, it reduces 
the incoming throughput to 225 millions of LLRs per second. However, the symbol 
 
deselection has to compute the starting location and the number of LLRs to be written into 
the output buffer. The number of LLRs and the starting location are obtained as in (23) and 
(24) respectively, where NSCHk, mk and SPIDk are parameters specified by the WIMAX 
standard for the k-index subpacket when HARQ is enabled, namely NSCHk, is the number of 
concatenated slots, mk is the modulation order and SPIDk is the subpacket ID. 
 
SCHkkk NmL  48  (23) 
NLSPIDF kkk 6mod)(   (24) 
 
Since NSCHk[1, 480] and mk{2, 4, 6} we can rewrite (23) as 
 







62)8(
42)2(
22)2(
5
6
5
kSCHkSCHk
kSCHkSCHk
kSCHkSCHk
k
mwhenNN
mwhenNN
mwhenNN
L  (25) 
 
The efficient implementation of (25) is obtained with an adder whose inputs are NSCHk and 
the selection between two hardwired left shifted versions of NSCHk (one position and three 
positions), followed by a programmable left shifter (five-six positions). Similarly, since 
SPIDk{0, 1, 2, 3}, the multiplication in (24) is avoided as 
 










36mod)2(
26mod2
16mod
00
kkk
kk
kk
k
k
SPIDwhenNLL
SPIDwhenNL
SPIDwhenNL
SPIDwhen
F  (26) 
 
0
λk
in−order
address
scrambled
in−order
address
scrambled
[u;I]λk [u;O]λkλΑΒ λΑΒ λΑΒ
A B Y W Y W1 1 2 2
Fk L k
up−counter
4 LLRs
6N/4
6N/4
6N/4
6N/4
CU
SISO
uk
packetizer
hard
decision
memory
address
generator
A
B
Y
W
Subblock deinterleaver CTC decoderSymbol deselection
0
0
[c;I]
 Fig. 9. Complete CTC decoder block scheme 
 
A block scheme of the architecture employed to compute Fk and Lk is depicted in Fig. 10 (a). 
Furthermore, in order to support the puncturing mode, the output memory locations 
corresponding to unsent bits must be set to zero. To ease the SD architecture 
implementation, all the output memory locations are set to zero while Lk and Fk are 
VLSI Architectures for WIMAX Channel Decoders 15
 
For most codes and applications, the first approach results in slow implementations, while 
the second one has an excessive cost. As a result the only general viable solution is the third 
partially parallel approach, which on the other hand introduces the collision problem, 
already known in the implementation of parallel turbo decoders. Two main approaches 
have been proposed to deal with collisions: 
- To design collision free codes [Mansour & Shanbhag , 2003], [Hocevar, 2003], 
- To design decoder architecture able to avoid or at least mitigate collision effects [Kienle 
et al., 2003], [Tarable et al., 2004]. 
Even if the first approach has proven to be effective, it significantly limits the supported 
code classes. The second approach, on the other hand, is well suited for flexible and general 
architectures. An even more challenging task is the design of LDPC decoders that are 
flexible in terms of supported block sizes and code rates [Masera et al., 2007].  
In partially parallel structures, permutation networks are used to establish the correct 
connections between functional units. However, structured LDPC codes, such as those 
specified in WIMAX, allow for replacing permutation networks by low complexity barrel 
shifters [Boutillon et al., 2000]; [Mansour & Shanbhag, 2003]. 
Early terminal schemes can be adopted to improve the decoding efficiency by dynamically 
adjusting the iteration number according to the SNR values. The simplest approach requires 
that decoding decisions are stored and compared across two consecutive iterations: if no 
changes are detected, the decoding is terminated, otherwise it is continued up to a 
maximum number of iterations. More sophisticated iteration control schemes are able to 
reduce the mean number of iterations, so saving both latency and energy [Kienle & When, 
2005]; [Shin et al., 2007]. 
 
3. Case of study: complete WIMAX CTC decoder design 
 
The WIMAX CTC decoder is made of three main blocks: symbol deselection (SD), subblock 
deinterleaver and CTC decoder as highlighted in Fig. 9 where N represents the number of 
couples included in a data frame. SD, subblock deinterleaver and CTC decoder blocks are 
connected together by means of memory buffers in order to guarantee that the non iterative 
part of the decoder (namely SD and subblock deinterleaver) and the decoding loop work 
simultaneously on consecutive data frames. Since the maximum decoder throughput is 
about 75 Mb/s and the native CTC rate is 1/3 (two uncoded bits produce six coded bits), at 
the input of the decoding loop the maximum throughput can rise up to 225 millions of LLRs 
per second. The same throughput ought to be sustained by the subblock deinterleaver, 
whereas even higher throughput has to be sustained at the SD unit in case of repetition. 
 
3.1 Symbol deselection 
Depending on amount of data sent by the encoder (puncturing or repetition), the 
throughput sustained by the symbol deselection (SD) can rise up to 900 millions of LLRs per 
second (repetition 4). When the encoder performs repetition, the same symbol is sent more 
than once. Thus, the decoder combines the LLRs referred to the same symbol to improve the 
reliability of that symbol. As shown in Fig. 9 this can be achieved partitioning the symbol 
deselection input buffer into four memories, each of which containing up to 6N LLRs.  
Since the symbol deselection architecture can read up to four LLRs per clock cycle, it reduces 
the incoming throughput to 225 millions of LLRs per second. However, the symbol 
 
deselection has to compute the starting location and the number of LLRs to be written into 
the output buffer. The number of LLRs and the starting location are obtained as in (23) and 
(24) respectively, where NSCHk, mk and SPIDk are parameters specified by the WIMAX 
standard for the k-index subpacket when HARQ is enabled, namely NSCHk, is the number of 
concatenated slots, mk is the modulation order and SPIDk is the subpacket ID. 
 
SCHkkk NmL  48  (23) 
NLSPIDF kkk 6mod)(   (24) 
 
Since NSCHk[1, 480] and mk{2, 4, 6} we can rewrite (23) as 
 







62)8(
42)2(
22)2(
5
6
5
kSCHkSCHk
kSCHkSCHk
kSCHkSCHk
k
mwhenNN
mwhenNN
mwhenNN
L  (25) 
 
The efficient implementation of (25) is obtained with an adder whose inputs are NSCHk and 
the selection between two hardwired left shifted versions of NSCHk (one position and three 
positions), followed by a programmable left shifter (five-six positions). Similarly, since 
SPIDk{0, 1, 2, 3}, the multiplication in (24) is avoided as 
 










36mod)2(
26mod2
16mod
00
kkk
kk
kk
k
k
SPIDwhenNLL
SPIDwhenNL
SPIDwhenNL
SPIDwhen
F  (26) 
 
0
λk
in−order
address
scrambled
in−order
address
scrambled
[u;I]λk [u;O]λkλΑΒ λΑΒ λΑΒ
A B Y W Y W1 1 2 2
Fk L k
up−counter
4 LLRs
6N/4
6N/4
6N/4
6N/4
CU
SISO
uk
packetizer
hard
decision
memory
address
generator
A
B
Y
W
Subblock deinterleaver CTC decoderSymbol deselection
0
0
[c;I]
 Fig. 9. Complete CTC decoder block scheme 
 
A block scheme of the architecture employed to compute Fk and Lk is depicted in Fig. 10 (a). 
Furthermore, in order to support the puncturing mode, the output memory locations 
corresponding to unsent bits must be set to zero. To ease the SD architecture 
implementation, all the output memory locations are set to zero while Lk and Fk are 
WIMAX, New Developments16
 
computed. As a consequence, about two clock cycles per sample are required to complete 
the symbol deselection, namely 6N LLRs are output in 12N clock cycles. So that the symbol 
deselection throughput can be estimated as 
 
212
6 clk
clkSD
f
f
N
N
T   (27) 
 
As it can be observed, to sustain 225 millions of LLRs per second a clock frequency of 450 
MHz is required. To overcome this problem we impose not only to partition the input buffer 
into four memories, but also to increase the memory parallelism, so that each memory 
location contains p LLRs. Thus, we can rewrite (27) as (28) and by setting p to a conservative 
value, as p=4, the SD architecture processes simultaneously up to sixteen LLRs with fclk=113 
MHz. 
 
212
6 clk
clkSD
fp
f
p
N
N
T
  (28) 
 
3.2 Subblock deinterleaver 
The received LLRs belong to six possible subblocks depending on the coded bits they are 
referred to (A, B, Y1, W1, Y2, W2) and each subblock is made of N LLRs. The subblock 
deinterleaver treats each subblock separately and scrambles its LLRs according to Algorithm 
1, given below, where m and J are constants specified by the WIMAX standard and BROm(y) 
is the bit-reversed m-bit value of y.  
 
  1: k←0 
  2: i←0 
  3: while i<N do 
  4:    Tk←2m(k mod J)+BROm(k/J) 
  5:    if Tk<N then 
  6:        i←i+1     
  7:    else 
  8:        discard Tk 
  9:    end if 
10:    k←k+1 
11: end while 
Algorithm 1. Subblock deinterleaver address generator 
 
As a consequence, the number of tentative addresses generated, NM, can be greater than N. 
Exhaustive simulations, performed on the possible N specified by the standard, show that 
the worst case is NM=191 that occurs with N=144. Since 191/144=1.326, a conservative 
approximation is NM=4N/3. The whole subblock deinterleaver architecture is obtained with 
one single address generator implementing Algorithm 1 to simultaneously write one LLR 
from each of the six subblock memories. In particular, as imposed by the WiMax standard, 
the interleaved LLRs belonging to the A and B subblocks are stored separately, whereas the 
 
interleaved LLRs belonging to Y1 and Y2 are stored as a symbol-by-symbol multiplexed 
sequence, creating a “macro-subblock” made of 2N LLRs. Similarly a macro-subblock made 
of 2N LLRs is generated storing a symbol-by-symbol multiplexed sequence of interleaved 
W1 and W2 subblocks. 
Since all the subblocks can be processed simultaneously, this architecture deinterleaves six 
LLRs per clock cycle. As a consequence, the subblock deinterleaver sustains a throughput 
 
clkclkSubDein ffN
N
T 5.4
3
4
6   (29) 
 
Thus, a throughput of 225 Millions of LLRs per second is sustained using fclk=50 MHz. 
To implement line 4 and 5 in Algorithm 1, three steps are required, namely the calculation of 
k mod J and k/J, the calculation of 2m(k mod J) and BROm(k/J), the generation of Tk while 
checking Tk<N. It is worth pointing out that k mod J can be efficiently implemented as an 
up-counter followed by a mod J block. Moreover, each time the mod J block detects k=J, a 
second counter is incremented: the final value in the second counter is k/J. Since m[3, 10], 
the 2m(k mod J) term is implemented as a programmable shifter in the range [0, 7] followed 
by a hardwired three position left shifter. The BROm(k/J) term is obtained by multiplexing 
eight hardwired bit reversal networks. Finally, a valid Tk address is obtained with an adder 
and is validated by a comparator. The address generation architecture is shown in Fig. 10 
(b). 
 
3.3 CTC decoder 
As detailed in section 2.3 to sustain the throughput required by the WIMAX standard a 
parallel decoder architecture is required. To that purpose we set SP=1, I=8, and fclk=200 
MHz, then from (17) we analyze the throughput as a function of N for W=32. As shown in 
Fig. 11, only P=4 allows to achieve the target throughput (horizontal solid line) for N≥480. 
Moreover, the window width impacts both on the decoder throughput and on the depth of 
SISO local buffers. So that a proper W value for each frame size must be selected. In 
particular if N/(P·W)  SISOs synchronization is simplified. However, the choice of P 
should minimize collisions in memory access.  
Exhaustive simulations show that collisions occur for P=2 and P=4 only with N=108. As a 
consequence, we select P as a function of N to simultaneously obtain a monotonically 
increasing throughput as a function of N and to avoid collisions. It is worth pointing out 
that, when collisions are avoided, the resulting parallel interleaver is a circular shifting 
interleaver: the address generation is simplified with all SISOs simultaneously accessing the 
same location of different memories. 
Said idx0t the memory accessed by SISO-0 at time t during a scrambled half iteration, the 
memory concurrently accessed by SISO-k is idxkt=(idx0t±k) mod P. 
 
VLSI Architectures for WIMAX Channel Decoders 17
 
computed. As a consequence, about two clock cycles per sample are required to complete 
the symbol deselection, namely 6N LLRs are output in 12N clock cycles. So that the symbol 
deselection throughput can be estimated as 
 
212
6 clk
clkSD
f
f
N
N
T   (27) 
 
As it can be observed, to sustain 225 millions of LLRs per second a clock frequency of 450 
MHz is required. To overcome this problem we impose not only to partition the input buffer 
into four memories, but also to increase the memory parallelism, so that each memory 
location contains p LLRs. Thus, we can rewrite (27) as (28) and by setting p to a conservative 
value, as p=4, the SD architecture processes simultaneously up to sixteen LLRs with fclk=113 
MHz. 
 
212
6 clk
clkSD
fp
f
p
N
N
T
  (28) 
 
3.2 Subblock deinterleaver 
The received LLRs belong to six possible subblocks depending on the coded bits they are 
referred to (A, B, Y1, W1, Y2, W2) and each subblock is made of N LLRs. The subblock 
deinterleaver treats each subblock separately and scrambles its LLRs according to Algorithm 
1, given below, where m and J are constants specified by the WIMAX standard and BROm(y) 
is the bit-reversed m-bit value of y.  
 
  1: k←0 
  2: i←0 
  3: while i<N do 
  4:    Tk←2m(k mod J)+BROm(k/J) 
  5:    if Tk<N then 
  6:        i←i+1     
  7:    else 
  8:        discard Tk 
  9:    end if 
10:    k←k+1 
11: end while 
Algorithm 1. Subblock deinterleaver address generator 
 
As a consequence, the number of tentative addresses generated, NM, can be greater than N. 
Exhaustive simulations, performed on the possible N specified by the standard, show that 
the worst case is NM=191 that occurs with N=144. Since 191/144=1.326, a conservative 
approximation is NM=4N/3. The whole subblock deinterleaver architecture is obtained with 
one single address generator implementing Algorithm 1 to simultaneously write one LLR 
from each of the six subblock memories. In particular, as imposed by the WiMax standard, 
the interleaved LLRs belonging to the A and B subblocks are stored separately, whereas the 
 
interleaved LLRs belonging to Y1 and Y2 are stored as a symbol-by-symbol multiplexed 
sequence, creating a “macro-subblock” made of 2N LLRs. Similarly a macro-subblock made 
of 2N LLRs is generated storing a symbol-by-symbol multiplexed sequence of interleaved 
W1 and W2 subblocks. 
Since all the subblocks can be processed simultaneously, this architecture deinterleaves six 
LLRs per clock cycle. As a consequence, the subblock deinterleaver sustains a throughput 
 
clkclkSubDein ffN
N
T 5.4
3
4
6   (29) 
 
Thus, a throughput of 225 Millions of LLRs per second is sustained using fclk=50 MHz. 
To implement line 4 and 5 in Algorithm 1, three steps are required, namely the calculation of 
k mod J and k/J, the calculation of 2m(k mod J) and BROm(k/J), the generation of Tk while 
checking Tk<N. It is worth pointing out that k mod J can be efficiently implemented as an 
up-counter followed by a mod J block. Moreover, each time the mod J block detects k=J, a 
second counter is incremented: the final value in the second counter is k/J. Since m[3, 10], 
the 2m(k mod J) term is implemented as a programmable shifter in the range [0, 7] followed 
by a hardwired three position left shifter. The BROm(k/J) term is obtained by multiplexing 
eight hardwired bit reversal networks. Finally, a valid Tk address is obtained with an adder 
and is validated by a comparator. The address generation architecture is shown in Fig. 10 
(b). 
 
3.3 CTC decoder 
As detailed in section 2.3 to sustain the throughput required by the WIMAX standard a 
parallel decoder architecture is required. To that purpose we set SP=1, I=8, and fclk=200 
MHz, then from (17) we analyze the throughput as a function of N for W=32. As shown in 
Fig. 11, only P=4 allows to achieve the target throughput (horizontal solid line) for N≥480. 
Moreover, the window width impacts both on the decoder throughput and on the depth of 
SISO local buffers. So that a proper W value for each frame size must be selected. In 
particular if N/(P·W)  SISOs synchronization is simplified. However, the choice of P 
should minimize collisions in memory access.  
Exhaustive simulations show that collisions occur for P=2 and P=4 only with N=108. As a 
consequence, we select P as a function of N to simultaneously obtain a monotonically 
increasing throughput as a function of N and to avoid collisions. It is worth pointing out 
that, when collisions are avoided, the resulting parallel interleaver is a circular shifting 
interleaver: the address generation is simplified with all SISOs simultaneously accessing the 
same location of different memories. 
Said idx0t the memory accessed by SISO-0 at time t during a scrambled half iteration, the 
memory concurrently accessed by SISO-k is idxkt=(idx0t±k) mod P. 
 
WIMAX, New Developments18
 
left
shift
NSCHk
L k
mk
SPIDk
Fk
CU 6N
<<1
<<3
mk
<<2
<<1
N
0
<<1
(a)
up−counter
m−J LUT
up−counter <N
N
Tk
J
mod J
<<3
k
mN
shifter
BRO
valid
(b)  Fig. 10. Symbol deselection starting address and number of elements generation block 
scheme (a), subblock deinterleaver address generation block scheme (b). 
 
0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400
0
10
20
30
40
50
60
70
80
90
100
N
T 
[M
b/
s]
P=1
P=2
P=3
P=4
Proposed
 Fig. 11. Parallel CTC decoder throughput as a function of the block size N for different 
parallelism degree values P. The horizontal line represents the target throughput. 
 
Thus, the parallel CTC interleaver-deinterleaver system is obtained as a cascaded two stage 
architecture (see Fig. 12). The first stage efficiently implements the WIMAX interleaver 
algorithm, whereas the second one extracts the common memory address adxt and the 
memory identifiers idxkt from the scrambled address i.  
The CTC interleaver algorithm specified in the WIMAX standard is structured in two steps. 
The first step switches the LLRs referred to A and B that are stored at odd addresses. The 
second step provides the interleaved address i of the j-th couple as 
 
1,,0mod)( '0  NjNPjPi j   (30) 
 
where P0 and Pj’ are constants that depend only on N and are specified by the standard. It is 
worth pointing out that the two steps can be swapped, as a consequence the first step can be 
performed on-the-fly, avoiding the use of an intermediate buffer to store switched LLRs. A 
simple architecture to implement (30) can be derived by rewriting (30) as 
    NNPii jj modmod''   (31) 
 
where  
 
 




 1,,1modmod
00
0
'
1
'
0'
NjwhenNNPi
jwheni
i
j
j   (32) 
 
A small Look-Up-Table (LUT) is employed to store P0 mod N and Pj’ mod N terms; then (31) 
is implemented by two parts as depicted in Fig. 12. The first part accumulates P0 to 
implement the P0·j term and the mod N block produces the correct modulo N result. The 
second part employs the two least significant bits of a counter (j−cnt) to select the proper Pj’ 
mod N value, which is added to the (P0·j) mod N term. A further modulo N operation is 
performed at the output. Since in this architecture both the first and the second part work on 
data belonging to [0, 2N−1], all the mod N operations are implemented by means of a 
subtracter and a multiplexer. 
The second stage of the parallel CTC interleaver-deinterleaver architecture works as follows.  
Since adxt[0, N/P-1], it can be obtained from the scrambled address i produced by the first 
stage as  
 
   











 



 



 

1,11
1
2
,
1,0
N
P
N
Piwhen
P
N
Pi
P
N
P
N
iwhen
P
N
i
P
N
iwheni
adxt

 (33) 
 
The straightforward implementation of (33) needs to calculate N/P and to allocate P−2 
multipliers, P−1 subtracters, a P-way multiplexer and few logic for selecting the proper adxt 
value. The N/P division can be simplified by choosing the possible P values as powers of 
two. Thus, we obtain a CTC decoder architecture that exploits throughput/parallelism 
scalability to avoid collisions, namely we employ: P=1 when N≤180, P=2 when 192≤N≤240 
and P=4 when 480≤N≤2400. Moreover, as it can be inferred from Fig. 12, multiplications are 
avoided resorting to simple shift operations (x>>i=x/2i). The sign of the subtractions (dashed 
lines in Fig. 12) allows not only to select the proper adxt but also to find idx0t . Then, with P−1 
modulo P adders the other idxkt values are straightforwardly generated. As it can be 
VLSI Architectures for WIMAX Channel Decoders 19
 
left
shift
NSCHk
L k
mk
SPIDk
Fk
CU 6N
<<1
<<3
mk
<<2
<<1
N
0
<<1
(a)
up−counter
m−J LUT
up−counter <N
N
Tk
J
mod J
<<3
k
mN
shifter
BRO
valid
(b)  Fig. 10. Symbol deselection starting address and number of elements generation block 
scheme (a), subblock deinterleaver address generation block scheme (b). 
 
0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 2400
0
10
20
30
40
50
60
70
80
90
100
N
T 
[M
b/
s]
P=1
P=2
P=3
P=4
Proposed
 Fig. 11. Parallel CTC decoder throughput as a function of the block size N for different 
parallelism degree values P. The horizontal line represents the target throughput. 
 
Thus, the parallel CTC interleaver-deinterleaver system is obtained as a cascaded two stage 
architecture (see Fig. 12). The first stage efficiently implements the WIMAX interleaver 
algorithm, whereas the second one extracts the common memory address adxt and the 
memory identifiers idxkt from the scrambled address i.  
The CTC interleaver algorithm specified in the WIMAX standard is structured in two steps. 
The first step switches the LLRs referred to A and B that are stored at odd addresses. The 
second step provides the interleaved address i of the j-th couple as 
 
1,,0mod)( '0  NjNPjPi j   (30) 
 
where P0 and Pj’ are constants that depend only on N and are specified by the standard. It is 
worth pointing out that the two steps can be swapped, as a consequence the first step can be 
performed on-the-fly, avoiding the use of an intermediate buffer to store switched LLRs. A 
simple architecture to implement (30) can be derived by rewriting (30) as 
    NNPii jj modmod''   (31) 
 
where  
 
 




 1,,1modmod
00
0
'
1
'
0'
NjwhenNNPi
jwheni
i
j
j   (32) 
 
A small Look-Up-Table (LUT) is employed to store P0 mod N and Pj’ mod N terms; then (31) 
is implemented by two parts as depicted in Fig. 12. The first part accumulates P0 to 
implement the P0·j term and the mod N block produces the correct modulo N result. The 
second part employs the two least significant bits of a counter (j−cnt) to select the proper Pj’ 
mod N value, which is added to the (P0·j) mod N term. A further modulo N operation is 
performed at the output. Since in this architecture both the first and the second part work on 
data belonging to [0, 2N−1], all the mod N operations are implemented by means of a 
subtracter and a multiplexer. 
The second stage of the parallel CTC interleaver-deinterleaver architecture works as follows.  
Since adxt[0, N/P-1], it can be obtained from the scrambled address i produced by the first 
stage as  
 
   











 



 



 

1,11
1
2
,
1,0
N
P
N
Piwhen
P
N
Pi
P
N
P
N
iwhen
P
N
i
P
N
iwheni
adxt

 (33) 
 
The straightforward implementation of (33) needs to calculate N/P and to allocate P−2 
multipliers, P−1 subtracters, a P-way multiplexer and few logic for selecting the proper adxt 
value. The N/P division can be simplified by choosing the possible P values as powers of 
two. Thus, we obtain a CTC decoder architecture that exploits throughput/parallelism 
scalability to avoid collisions, namely we employ: P=1 when N≤180, P=2 when 192≤N≤240 
and P=4 when 480≤N≤2400. Moreover, as it can be inferred from Fig. 12, multiplications are 
avoided resorting to simple shift operations (x>>i=x/2i). The sign of the subtractions (dashed 
lines in Fig. 12) allows not only to select the proper adxt but also to find idx0t . Then, with P−1 
modulo P adders the other idxkt values are straightforwardly generated. As it can be 
WIMAX, New Developments20
 
observed, choosing P as a power of two reduces the modulo P adders to simpler, binary 
adders. The actual throughput sustained by the described throughput/parallelism scalable 
architecture is represented by the bold line in Fig. 11. 
 
12bits
P0
N
adx t
idx t
0
idx t
idx t
idx t
j−cnt
00
01
10
11
>>1
>>2
mod N
N NN
mod N
i
3
2
1
2
3
1
serial
WiMax
Interleaver
1
Pj
i
17x37bits
LUT
’
 Fig. 12. Parallel CTC address generator 
 
The global architecture of the designed parallel SISO is given in Fig. 13 where each SISO 
contains the processors devoted to compute the different metrics required by the BCJR 
algorithm as detailed in section 2.3. A simple network is used to properly connect the SISOs 
according to the current value of P by setting the signal last_SISO. Furthermore, one address 
crossbar-switch (radx-switch) is used to implement the reading operation, a LIFO stores the 
address and makes them available for the writing phase, two data crossbar-switches (rdata-
switch/wdata-switch) are used to properly send (receive) the data to (from) the memory 
(EI-MEM) according to the parallel interleaver idxkt values. 
 
 
β out β in
α in α out
β out β in
k+NWλ [u;O]P
uk λ k[u;O]
uk+3NW
λk+2NW λk+3NW
rdata−switch SISO−3
SISO−2
SISO−1
SISO−0EI−MEM0
EI−MEM1
EI−MEM2
EI−MEM3
wdata−switch
switch
couple
radx−switch
0 2 31
LIFO
generator
parallel
address
SISO−0 SISO−2 SISO−3
last_SISO
last_SISO last_SISO
SISO−1
[u;I]λk+NW[u;I]λk
uk+2NW
λ k+2NW [u;O]P
P
λ k+3NW
uk+NWP P P
[u;O]P
P[u;I] [u;I]
adx
P
t
idxktN
α in α out
β out β in
α in α out
β out β in
α in α out
 Fig. 13. Parallel CTC decoder architecture 
 
In Table 2 the complexity of all the blocks for a 130 nm standard cell technology is reported. 
The bit-width is: 6 bit for λ[c;I], 8 bit for λ[u;I], and 12 bit for the state metrics. For further 
details the reader can refer to [Martina et al., 2009]. 
 
Architecture SD Subblock Deinterl. SISOx1 Parallel Interl. 
Logic [kgate] 11 1.7 37 2.8 
Memory [kbit] 0 0 14.2 59 
Table 2. Complexity of the whole receiver 
 
4. Acknowledgements 
 
This work is partially supported by the WIMAGIC project funded by the European 
Community. 
 
 
 
 
VLSI Architectures for WIMAX Channel Decoders 21
 
observed, choosing P as a power of two reduces the modulo P adders to simpler, binary 
adders. The actual throughput sustained by the described throughput/parallelism scalable 
architecture is represented by the bold line in Fig. 11. 
 
12bits
P0
N
adx t
idx t
0
idx t
idx t
idx t
j−cnt
00
01
10
11
>>1
>>2
mod N
N NN
mod N
i
3
2
1
2
3
1
serial
WiMax
Interleaver
1
Pj
i
17x37bits
LUT
’
 Fig. 12. Parallel CTC address generator 
 
The global architecture of the designed parallel SISO is given in Fig. 13 where each SISO 
contains the processors devoted to compute the different metrics required by the BCJR 
algorithm as detailed in section 2.3. A simple network is used to properly connect the SISOs 
according to the current value of P by setting the signal last_SISO. Furthermore, one address 
crossbar-switch (radx-switch) is used to implement the reading operation, a LIFO stores the 
address and makes them available for the writing phase, two data crossbar-switches (rdata-
switch/wdata-switch) are used to properly send (receive) the data to (from) the memory 
(EI-MEM) according to the parallel interleaver idxkt values. 
 
 
β out β in
α in α out
β out β in
k+NWλ [u;O]P
uk λ k[u;O]
uk+3NW
λk+2NW λk+3NW
rdata−switch SISO−3
SISO−2
SISO−1
SISO−0EI−MEM0
EI−MEM1
EI−MEM2
EI−MEM3
wdata−switch
switch
couple
radx−switch
0 2 31
LIFO
generator
parallel
address
SISO−0 SISO−2 SISO−3
last_SISO
last_SISO last_SISO
SISO−1
[u;I]λk+NW[u;I]λk
uk+2NW
λ k+2NW [u;O]P
P
λ k+3NW
uk+NWP P P
[u;O]P
P[u;I] [u;I]
adx
P
t
idxktN
α in α out
β out β in
α in α out
β out β in
α in α out
 Fig. 13. Parallel CTC decoder architecture 
 
In Table 2 the complexity of all the blocks for a 130 nm standard cell technology is reported. 
The bit-width is: 6 bit for λ[c;I], 8 bit for λ[u;I], and 12 bit for the state metrics. For further 
details the reader can refer to [Martina et al., 2009]. 
 
Architecture SD Subblock Deinterl. SISOx1 Parallel Interl. 
Logic [kgate] 11 1.7 37 2.8 
Memory [kbit] 0 0 14.2 59 
Table 2. Complexity of the whole receiver 
 
4. Acknowledgements 
 
This work is partially supported by the WIMAGIC project funded by the European 
Community. 
 
 
 
 
WIMAX, New Developments22
 
5. References 
 
Abbasfar, A. ; Yao, K. (2003). An Efficient and Practival Architecture for High Speed Turbo 
Decoders, Proceedings of the IEEE Vehicular Technology Conference, pp 337-341, 
Orlando, USA, Oct. 2003 
Adde, P & Pyndiah, R. (2000). Recent Simplifications and Improvements in Block Turbo 
Codes, Proceedings of the 2nd International Symposium on Turbo Codes & Related Topics, 
pp. 133-136, Brest, France, Sep. 2000 
Bahl, L. R.; Cocke, J.; Jelinek, F. & Raviv, J. (1974). Optimal Decoding of Linear Codes for 
Minimizing Symbol Error Rate, IEEE Trans. on Information Theory, Vol. 20, No. 2, 
Mar. 1974, pp. 284-287 
Batcha, M. F. N. & Shameri, A. Z. (2007). Configurable, Adaptive Viterbi Decoder for GPRS, 
EDGA and WIMAX, Proceedings of the IEEE International Conference on 
Telecommunications and Malaysia International Conference on Communications, pp. 237-
241, Pennang, Malaysia, May 2007 
Benedetto, S.; Montorsi G.; Divsalar, D. & Pollara, F. (1996). Algorithm for Continuous 
Decoding of Turbo Codes, IET Electronics Letters, Vol. 32, No. 4, Apr. 1996, pp 314-
315   
Berrou, C.; Glavieux, A. & Thitimajshima P. (1993). Near Shannon Limit Error-Correcting 
Codes : Turbo Codes, Proceedings of the IEEE International Conference on 
Communications, pp. 1064-1070, Geneva, Switzerland, May 1993 
Berrou, C.; Jezequel, M.; Douillard, C. & Kerouedan  S. (2001). The Advantages of non-
binary turbo codes, Proceedings of the IEEE Information Theory Workshop, pp. 61-63, 
Cairns, Australia, Sep. 2001 
Bickerstaff, M.; Davis, L.; Thomas, C.; Garret, D. & Nicol, C. (2003). A 24 Mb/s Radix/4 
LogMAP Turbo Decoder for 3GPP-HSDPA Mobile Wireless, Proceedings of the IEEE 
International Solid State Circuits Conference, Section 8 – paper 8.5, San Francisco, 
USA, Feb. 2003 
Black, P. J. & Meng, T. H. (1992). A 140-Mb/s, 32-State, Radix4 Viterbi Decoder, IEEE Journal 
of Solid State Circuits, Vol. 27, No. 12, Dec. 1992, pp. 1877-1885 
Boutillon, E.; Castura, J. & Kschischang, F. (2000). Decoder-first code design, Proceedings of 
the 2nd International Symposium on Turbo Codes & Related Topics, pp. 459–462, Brest, 
France, Sep. 2000 
Chase, D. (1972). A Class of Algorithms for Decoding Block Codes with Channel 
Measurement Information, IEEE Trans. on Information Theory, Vol. IT-19, No. 1, Dec. 
1972, pp.170-182  
Chen, J.; Dholakia, A.; Eleftheriou, E.; Fossorier, M. P. C. & Hu, X. Y. (2005). Reduced-
Complexity Decoding of LDPC Codes, IEEE Trans. on Communications, Vol. 53, No. 
8, Aug. 2005, pp. 1288–1299 
Cheng, C. & Parhi, K. K. (2008). Hardware Efficient Low-Latency Architecture for High-
Throughput Rate Viterbi Decoders, IEEE Trans. on Circuits and Sytems II, Vol. 55, 
No. 12, Dec. 2008, pp. 1254-1258 
Cheng, J. F. & Ottosson, T. (2000). Linearly Approximated Log-MAP Algorithm for Turbo 
Decoding, Proceedings of the IEEE Vehicular Technology Conference, Tokio, Japan, May 
2000, pp. 2252-2256 
 
Chi, Z.; Song, L. & Parhi, K. K. (2004). On the Performance/Complexity Tradeoff in Block 
Turbo Decoder Design, IEEE Trans. on Communications, Vol. 52, No. 2, Feb. 2004, pp. 
173-175 
Classon, B.; Blankenship, K. & Desai, V. (2002). Channel Coding for 4G Systems with 
Adaptive Modulation and Coding, IEEE Wireless Communications Magazine, Vol. 9, 
No. 2, Apr. 2002, pp. 8-13 
Dinoi, L.; Martini, R.; Masera, G.; Quaglio, F. & Vacca, F. (2006). ASIP design for partially 
structured LDPC codes, Electronics Letters, Vol. 42, No. 18,  Aug. 2006, pp.1048 - 
1049 
Fettweis, G. & Meyr, H. (1989). Parallel Viterbi algorithm implementation: Breaking the 
ACS-bottleneck, IEEE Trans. on Communications, Vol. 37, No. 8, Aug. 1989, pp. 785-
790 
Gallager R. G. (1962), Low-Density Parity-Check Codes, IRE Trans. on Information Theory, 
Vol. 8, No. 1, Jan. 1962, pp. 21–28 
Gilbert, F.; Thul, M. J. & Wehn, N. (2003). Communication Centric Architectures for Turbo-
Decoding on Embedded Multiprocessors, Proceedings of Design Automation and Test 
in Europe Conference and Exhibition, pp. 356-361, Munich, Germany, Mar. 2003 
Giulietti, A.; van der Perre, L. & Strum, M. (2002). Parallel Turbo Coding Interleavers: 
Avoiding Collisions in Accesses to Storage Elements, IET Electronics Letters, Vol. 38, 
No. 5, Feb. 2002, pp. 232-234 
Gnaedig, D.; Boutillon, E.; Jezequel, M.; Gaudet. V. C. & Gulak, P. G. (2003). Multiple Slice 
Turbo Codes, Proceedings of the 3rd International Symposium on Turbo Codes & Related 
Topics, pp. 343-346, Brest, France, Sep. 2003  
Goubier, T.; Dezan, C.; Pottier, B. & Jego, C. (2008). Fine Grain Parallel Decoding of Product 
Turbo Codes: Algorithm and Architecture, Proceedings of the 5th International 
Symposium on Turbo Codes & Related Topics, pp. 90-95, Lausanne, Switzerland, Sep. 
2008 
Gross, W. J. & Gulak, P.G. (1998). Simplified MAP Algorithm Suitable for  Implementation 
of Turbo Decoders, IET Electronics Letters, Vol. 34, No. 16, Aug. 1998, pp. 1577-1578 
Guilloud, F.; Boutillon, E.; Tousch, J. & Danger, J. (2007). Generic description and synthesis 
of LDPC decoders, IEEE Trans. on Communications, Vol. 55, No. 11, Nov. 2007, pp. 
2084-2091 
Hekstra, A. P. (1989). An Alternative to Metric Rescaling in Viterbi Decoders, IEEE Trans. on 
Communications, Vol. 37, No. 11, Nov. 1989, pp. 1220-1222 
Hocevar, D. E. (2003). LDPC code construction with flexible hardware implementation,  
Proceedings of the IEEE International Conference on Communications, pp. 2708-2712, 
Anchorage, USA, May 2003 
Jego, C.; Adde, P. & Leroux, C. (2006). Full-Parallel Architecture for  Turbo Decoding of 
Product Codes, IET Electronics Letters, Vol. 42, No. 18, Aug. 2006, pp. 1052-1053 
Kamuf, M.; Owall, V. & Anderson, J. B. (2008). Optimization and Implementation of a 
Viterbi Decoder Under Flexibility Constraints, IEEE Trans. on Circuits and Systems I, 
Vol. 55, No. 9, Sep. 2008, pp. 2411-2422 
Kerouedan, S. & Adde, P. (2000). Implementation of a Block Turbo Decoder on a Single 
Chip, Proceedings of the 2nd International Symposium on Turbo Codes & Related Topics, 
pp. 133-136, Brest, France, Sep. 2000 
VLSI Architectures for WIMAX Channel Decoders 23
 
5. References 
 
Abbasfar, A. ; Yao, K. (2003). An Efficient and Practival Architecture for High Speed Turbo 
Decoders, Proceedings of the IEEE Vehicular Technology Conference, pp 337-341, 
Orlando, USA, Oct. 2003 
Adde, P & Pyndiah, R. (2000). Recent Simplifications and Improvements in Block Turbo 
Codes, Proceedings of the 2nd International Symposium on Turbo Codes & Related Topics, 
pp. 133-136, Brest, France, Sep. 2000 
Bahl, L. R.; Cocke, J.; Jelinek, F. & Raviv, J. (1974). Optimal Decoding of Linear Codes for 
Minimizing Symbol Error Rate, IEEE Trans. on Information Theory, Vol. 20, No. 2, 
Mar. 1974, pp. 284-287 
Batcha, M. F. N. & Shameri, A. Z. (2007). Configurable, Adaptive Viterbi Decoder for GPRS, 
EDGA and WIMAX, Proceedings of the IEEE International Conference on 
Telecommunications and Malaysia International Conference on Communications, pp. 237-
241, Pennang, Malaysia, May 2007 
Benedetto, S.; Montorsi G.; Divsalar, D. & Pollara, F. (1996). Algorithm for Continuous 
Decoding of Turbo Codes, IET Electronics Letters, Vol. 32, No. 4, Apr. 1996, pp 314-
315   
Berrou, C.; Glavieux, A. & Thitimajshima P. (1993). Near Shannon Limit Error-Correcting 
Codes : Turbo Codes, Proceedings of the IEEE International Conference on 
Communications, pp. 1064-1070, Geneva, Switzerland, May 1993 
Berrou, C.; Jezequel, M.; Douillard, C. & Kerouedan  S. (2001). The Advantages of non-
binary turbo codes, Proceedings of the IEEE Information Theory Workshop, pp. 61-63, 
Cairns, Australia, Sep. 2001 
Bickerstaff, M.; Davis, L.; Thomas, C.; Garret, D. & Nicol, C. (2003). A 24 Mb/s Radix/4 
LogMAP Turbo Decoder for 3GPP-HSDPA Mobile Wireless, Proceedings of the IEEE 
International Solid State Circuits Conference, Section 8 – paper 8.5, San Francisco, 
USA, Feb. 2003 
Black, P. J. & Meng, T. H. (1992). A 140-Mb/s, 32-State, Radix4 Viterbi Decoder, IEEE Journal 
of Solid State Circuits, Vol. 27, No. 12, Dec. 1992, pp. 1877-1885 
Boutillon, E.; Castura, J. & Kschischang, F. (2000). Decoder-first code design, Proceedings of 
the 2nd International Symposium on Turbo Codes & Related Topics, pp. 459–462, Brest, 
France, Sep. 2000 
Chase, D. (1972). A Class of Algorithms for Decoding Block Codes with Channel 
Measurement Information, IEEE Trans. on Information Theory, Vol. IT-19, No. 1, Dec. 
1972, pp.170-182  
Chen, J.; Dholakia, A.; Eleftheriou, E.; Fossorier, M. P. C. & Hu, X. Y. (2005). Reduced-
Complexity Decoding of LDPC Codes, IEEE Trans. on Communications, Vol. 53, No. 
8, Aug. 2005, pp. 1288–1299 
Cheng, C. & Parhi, K. K. (2008). Hardware Efficient Low-Latency Architecture for High-
Throughput Rate Viterbi Decoders, IEEE Trans. on Circuits and Sytems II, Vol. 55, 
No. 12, Dec. 2008, pp. 1254-1258 
Cheng, J. F. & Ottosson, T. (2000). Linearly Approximated Log-MAP Algorithm for Turbo 
Decoding, Proceedings of the IEEE Vehicular Technology Conference, Tokio, Japan, May 
2000, pp. 2252-2256 
 
Chi, Z.; Song, L. & Parhi, K. K. (2004). On the Performance/Complexity Tradeoff in Block 
Turbo Decoder Design, IEEE Trans. on Communications, Vol. 52, No. 2, Feb. 2004, pp. 
173-175 
Classon, B.; Blankenship, K. & Desai, V. (2002). Channel Coding for 4G Systems with 
Adaptive Modulation and Coding, IEEE Wireless Communications Magazine, Vol. 9, 
No. 2, Apr. 2002, pp. 8-13 
Dinoi, L.; Martini, R.; Masera, G.; Quaglio, F. & Vacca, F. (2006). ASIP design for partially 
structured LDPC codes, Electronics Letters, Vol. 42, No. 18,  Aug. 2006, pp.1048 - 
1049 
Fettweis, G. & Meyr, H. (1989). Parallel Viterbi algorithm implementation: Breaking the 
ACS-bottleneck, IEEE Trans. on Communications, Vol. 37, No. 8, Aug. 1989, pp. 785-
790 
Gallager R. G. (1962), Low-Density Parity-Check Codes, IRE Trans. on Information Theory, 
Vol. 8, No. 1, Jan. 1962, pp. 21–28 
Gilbert, F.; Thul, M. J. & Wehn, N. (2003). Communication Centric Architectures for Turbo-
Decoding on Embedded Multiprocessors, Proceedings of Design Automation and Test 
in Europe Conference and Exhibition, pp. 356-361, Munich, Germany, Mar. 2003 
Giulietti, A.; van der Perre, L. & Strum, M. (2002). Parallel Turbo Coding Interleavers: 
Avoiding Collisions in Accesses to Storage Elements, IET Electronics Letters, Vol. 38, 
No. 5, Feb. 2002, pp. 232-234 
Gnaedig, D.; Boutillon, E.; Jezequel, M.; Gaudet. V. C. & Gulak, P. G. (2003). Multiple Slice 
Turbo Codes, Proceedings of the 3rd International Symposium on Turbo Codes & Related 
Topics, pp. 343-346, Brest, France, Sep. 2003  
Goubier, T.; Dezan, C.; Pottier, B. & Jego, C. (2008). Fine Grain Parallel Decoding of Product 
Turbo Codes: Algorithm and Architecture, Proceedings of the 5th International 
Symposium on Turbo Codes & Related Topics, pp. 90-95, Lausanne, Switzerland, Sep. 
2008 
Gross, W. J. & Gulak, P.G. (1998). Simplified MAP Algorithm Suitable for  Implementation 
of Turbo Decoders, IET Electronics Letters, Vol. 34, No. 16, Aug. 1998, pp. 1577-1578 
Guilloud, F.; Boutillon, E.; Tousch, J. & Danger, J. (2007). Generic description and synthesis 
of LDPC decoders, IEEE Trans. on Communications, Vol. 55, No. 11, Nov. 2007, pp. 
2084-2091 
Hekstra, A. P. (1989). An Alternative to Metric Rescaling in Viterbi Decoders, IEEE Trans. on 
Communications, Vol. 37, No. 11, Nov. 1989, pp. 1220-1222 
Hocevar, D. E. (2003). LDPC code construction with flexible hardware implementation,  
Proceedings of the IEEE International Conference on Communications, pp. 2708-2712, 
Anchorage, USA, May 2003 
Jego, C.; Adde, P. & Leroux, C. (2006). Full-Parallel Architecture for  Turbo Decoding of 
Product Codes, IET Electronics Letters, Vol. 42, No. 18, Aug. 2006, pp. 1052-1053 
Kamuf, M.; Owall, V. & Anderson, J. B. (2008). Optimization and Implementation of a 
Viterbi Decoder Under Flexibility Constraints, IEEE Trans. on Circuits and Systems I, 
Vol. 55, No. 9, Sep. 2008, pp. 2411-2422 
Kerouedan, S. & Adde, P. (2000). Implementation of a Block Turbo Decoder on a Single 
Chip, Proceedings of the 2nd International Symposium on Turbo Codes & Related Topics, 
pp. 133-136, Brest, France, Sep. 2000 
WIMAX, New Developments24
 
Kienle, F.; Thul, M. J. & Wehn, N. (2003). Implementation issue of scalable LDPC-decoders, 
Proceedings of the  3rd International Symposium on Turbo Codes & Related Topics, pp. 
291-294, Brest, France, Sep. 2003 
Kienle, F. & Wehn, N. (2005). Low Complexity Stopping Criterion for LDPC Code Decoders, 
Proceedings of the IEEE Vehicular Technology Conference , pp. 606-609, Stockholm, 
Sweden, May 2005 
Kim, J. H. & Park, I. C. (2008). Double Binary Circular Turbo Decoding Based on Border 
Metric Encoding, IEEE Trans. on Circuits and Systems II, Vol. 55, No. 1, Jan. 2008, pp. 
79-83 
Kim, J. H. & Park, I. C. (2009). Bit-Level Extrinsic Information Exchange Method for Double-
Binary Turbo Codes, IEEE Trans. on Circuits and Systems II, Vol. 56, No. 1, Jan. 2009, 
pp. 81-85 
Kong, J. J. & Parhi, K. K. (2004). Low-Latency Architectures for High-Throughput Rate 
Viterbi Decoders, IEEE Trans. on VLSI, Vol. 12, No. 6, Jun. 2004, pp. 642-651 
Kwak, J. & Lee, K. (2002). Design of Dividable Interleaver for Parallel Decoding in Turbo 
Codes, IET Electronics Letters, Vol. 38, No. 22, Oct. 2002, pp. 1362-1364 
Le, N.; Soleymani, M. R. & Shayan, Y. R. (2005). Distance-Based Decoding of Block Turbo 
Codes, IEEE Communications Letters, Vol. 9, No. 11, Nov. 2005, pp. 1006-1008 
LeBidan, R.; Leroux, C.; Jego, C.; Adde, P. & Pyndiah, R. (2008). Reed-Solomon Turbo 
Product Codes for Optical Communications: From Code Optimization to Decoder 
Design, Journal on Wireless Communications and Networking, Vol. 2008, Article ID 
658042, 14 pages 
Liu, H.; Diguet, J. P.; Jego, C. ; Jezequel, M. & Boutillon, E. (2007). Energy Efficient Turbo 
Decoder with Reduced State Metric Quantization, Proceedings of the IEEE Workshop 
on Signal Processing Systems, pp. 237-242, Shanghai, China, Oct. 2007 
Mansour, M. M. & Shanbhag, N. R. (2003). High throughput LDPC decoders, IEEE Trans. on 
VLSI., Vol. 11, No. 6, Dec. 2003, pp. 976–996 
Martina, M.; Nicola, M. & Masera, G. (2008-a). A Flexible UMTS-WiMax Turbo Decoder 
Architecture, IEEE Trans. on Circuits and Systems II, Vol. 55, No. 4, Apr. 2008, pp. 
369-373 
Martina, M.; Nicola, M. & Masera G. (2008-b). Hardware Design of a Low Complexity, 
Parallel Interleaver for WiMax Duo-Binary Turbo Decoding, IEEE Communications 
Letters, Vol 12, No. 11, Nov. 2008, pp. 846-848 
Martina, M.; Nicola, M. & Masera, G. (2009). VLSI Implementation of WiMax Convolutional 
Turbo Code Encoder and Decoder, Journal of Circuits, Systems, and Computers, 
Vol. 18, No. 3, May 2009, pp. 535-564 
Masera, G.; Piccinini, G.; Ruo Roch, M. & Zamboni, M. (1999). VLSI Architectures for Turbo 
Codes, IEEE Trans. on VLSI, Vol. 7, No. 3, Sep. 1999, pp. 369-379 
Masera, G.; Quaglio, F. & Vacca, F. (2005). Finite precision implementation of LDPC decoders, 
IEE Proceedings - Communications, Vol. 152,  No. 6,  Dec.  2005, pp. 1098 - 1102 
Masera,  G.; Quaglio, F.& Vacca, F. (2007). Implementation of a Flexible LDPC Decoder, 
IEEE Trans. on Circuits and Systems II, Vol. 54,  No. 6,  Jun 2007, pp. 542 – 546 
Muller, O.; Baghdadi, A. & Jezequel, M. (2006). Exploiting Parallel Processing Levels for 
Convolutional Turbo Decoding, Proceedings of the IEEE International Conference on 
Information and Communication Technologies, pp. 2353-2358, Damascus, Syria, Apr. 
2006  
 
Muller, O.; Baghdadi, A. & Jezequel, M. (2009). From Parallelism Levels to a Multi-ASIP 
Architecture for Turbo Decoding, IEEE Trans. on VLSI, Vol. 17, No. 1, Jan. 2009, pp. 
92-102 
Park, S. M.; Kwak, J. & Lee, K. (2008). Extrinsic Information Memory Reduced Architecture 
for Non-Binary Turbo Decoder Implementation, Proceedings of the IEEE Vehicular 
Technology Conference, pp. 539-543, Marina Bay, Singapore, May 2008 
Pyndiah, R. M. (1998). Near-Optimum Decoding of Product Codes: Block Turbo Codes, 
IEEE Trans. on Communications, Vol. 46, No. 8, Aug. 1998, pp. 1003-1010 
Quaglio, F.; Vacca, F.; Castellano, C.; Tarable, A. & Masera, G. (2006). Interconnection 
framework for high-throughput, flexible LDPC decoders, Proceedings of Design, 
Automation and Test in Europe Conference and Exhibition, pp. 1-6, Munich, Germany, 
Mar. 2006, 
Radar, C. M. (1981), Memory Management in a Viterbi Decoder, IEEE Trans. on 
Communications, Vol. COM-29, No. 9, Sep. 1981, pp. 1399-1401 
Robertson, P.; Villebrun E. & Hoeher P. (1995). A comparison of optimal and sub-optimal 
MAP decoding algorithms operating in the log domain, Proceedings of the IEEE 
International Conference on Communications, pp. 1009-1013,  Seattle, USA, Jun. 1995 
Shin, D.; Heo, K.; Oh, S.; & Ha, J. (2007). A Stopping Criterion for Low-Density Parity-Check 
Codes, Proceedings of the IEEE Vehicular Technology Conference, pp. 1529-1533, 
Dublin, Ireland, Apr. 2007.  
Singh, A.; Boutillon, E. & Masera, G. (2008). Bit-width Optimization of Extrinsic Information 
in Turbo Decoder, Proceedings of the 5th International Symposium on Turbo Codes & 
Related Topics, pp. 134-138, Lausanne, Switzerland, Sep. 2008  
Speziali, F. & Zory, J. (2004). Scalable and Area Efficient Concurrent Interleaver for High 
Throughput Turbo-Decoders, Proceedings of Euromicro Symposium on Digital 
SystemDesign, pp. 334-341, Rennes, France, Sep. 2004 
Talakoub, S.; Sabeti, L.; Shahrrava, B. & Ahmadi, M. (2007). An Improved Max-Log-MAP 
Algorithm for Turbo Decoding and Turbo Equalization, IEEE Trans. on 
Intrumentation and Measurement, Vol. 56, No. 3, Jun. 2007, pp 1058-1063 
Tarable, A.; Benedetto, S. & Montorsi, G. (2004). Mapping interleaver laws to parallel turbo 
and LDPC decoders architectures, IEEE Trans. on Information Theory, Vol. 50, No. 9, 
Sep. 2004, pp. 2002–2009 
Thul, M. J.; Wehn, N, & Rao L. P. (2002). Enabling High-Speed Turbo-Decoding through 
Concurrent Interleaving, Proceedings of the IEEE International Symposium on Circuits 
and Systems,  pp. 897-900, Scottsdale, USA, May 2002 
Thul, M. J.; Gilbert, F. & Wehn, N. (2003). Concurrent Interleaving Architectures for High-
Throughput Channel Coding, Proceedings of the IEEE International Conference on 
Acoustic, Speech and Signal Processing, pp. 613-616, Hong Kong, Apr. 2003 
Vanstraceele, C.; Geller, B.; Brossier, J. M. & Barbot, J. P., A Low Complexity Block Turbo 
Decoder Architecture, IEEE Trans. on Communications, Vol. 56, No. 12, Dec. 2008, 
pp. 1985-1987 
Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimum 
decoding algorithm, IEEE Trans. on Information Theory, Vol. IT-13, Apr. 1967, pp. 
260-269 
Vogt, J. & Finger, A. (2000). Improving the Max-Log-MAP Turbo Decoder. IET Electronics 
Letters, Vol. 36, No. 23, Nov. 2000, pp. 1937-1939 
VLSI Architectures for WIMAX Channel Decoders 25
 
Kienle, F.; Thul, M. J. & Wehn, N. (2003). Implementation issue of scalable LDPC-decoders, 
Proceedings of the  3rd International Symposium on Turbo Codes & Related Topics, pp. 
291-294, Brest, France, Sep. 2003 
Kienle, F. & Wehn, N. (2005). Low Complexity Stopping Criterion for LDPC Code Decoders, 
Proceedings of the IEEE Vehicular Technology Conference , pp. 606-609, Stockholm, 
Sweden, May 2005 
Kim, J. H. & Park, I. C. (2008). Double Binary Circular Turbo Decoding Based on Border 
Metric Encoding, IEEE Trans. on Circuits and Systems II, Vol. 55, No. 1, Jan. 2008, pp. 
79-83 
Kim, J. H. & Park, I. C. (2009). Bit-Level Extrinsic Information Exchange Method for Double-
Binary Turbo Codes, IEEE Trans. on Circuits and Systems II, Vol. 56, No. 1, Jan. 2009, 
pp. 81-85 
Kong, J. J. & Parhi, K. K. (2004). Low-Latency Architectures for High-Throughput Rate 
Viterbi Decoders, IEEE Trans. on VLSI, Vol. 12, No. 6, Jun. 2004, pp. 642-651 
Kwak, J. & Lee, K. (2002). Design of Dividable Interleaver for Parallel Decoding in Turbo 
Codes, IET Electronics Letters, Vol. 38, No. 22, Oct. 2002, pp. 1362-1364 
Le, N.; Soleymani, M. R. & Shayan, Y. R. (2005). Distance-Based Decoding of Block Turbo 
Codes, IEEE Communications Letters, Vol. 9, No. 11, Nov. 2005, pp. 1006-1008 
LeBidan, R.; Leroux, C.; Jego, C.; Adde, P. & Pyndiah, R. (2008). Reed-Solomon Turbo 
Product Codes for Optical Communications: From Code Optimization to Decoder 
Design, Journal on Wireless Communications and Networking, Vol. 2008, Article ID 
658042, 14 pages 
Liu, H.; Diguet, J. P.; Jego, C. ; Jezequel, M. & Boutillon, E. (2007). Energy Efficient Turbo 
Decoder with Reduced State Metric Quantization, Proceedings of the IEEE Workshop 
on Signal Processing Systems, pp. 237-242, Shanghai, China, Oct. 2007 
Mansour, M. M. & Shanbhag, N. R. (2003). High throughput LDPC decoders, IEEE Trans. on 
VLSI., Vol. 11, No. 6, Dec. 2003, pp. 976–996 
Martina, M.; Nicola, M. & Masera, G. (2008-a). A Flexible UMTS-WiMax Turbo Decoder 
Architecture, IEEE Trans. on Circuits and Systems II, Vol. 55, No. 4, Apr. 2008, pp. 
369-373 
Martina, M.; Nicola, M. & Masera G. (2008-b). Hardware Design of a Low Complexity, 
Parallel Interleaver for WiMax Duo-Binary Turbo Decoding, IEEE Communications 
Letters, Vol 12, No. 11, Nov. 2008, pp. 846-848 
Martina, M.; Nicola, M. & Masera, G. (2009). VLSI Implementation of WiMax Convolutional 
Turbo Code Encoder and Decoder, Journal of Circuits, Systems, and Computers, 
Vol. 18, No. 3, May 2009, pp. 535-564 
Masera, G.; Piccinini, G.; Ruo Roch, M. & Zamboni, M. (1999). VLSI Architectures for Turbo 
Codes, IEEE Trans. on VLSI, Vol. 7, No. 3, Sep. 1999, pp. 369-379 
Masera, G.; Quaglio, F. & Vacca, F. (2005). Finite precision implementation of LDPC decoders, 
IEE Proceedings - Communications, Vol. 152,  No. 6,  Dec.  2005, pp. 1098 - 1102 
Masera,  G.; Quaglio, F.& Vacca, F. (2007). Implementation of a Flexible LDPC Decoder, 
IEEE Trans. on Circuits and Systems II, Vol. 54,  No. 6,  Jun 2007, pp. 542 – 546 
Muller, O.; Baghdadi, A. & Jezequel, M. (2006). Exploiting Parallel Processing Levels for 
Convolutional Turbo Decoding, Proceedings of the IEEE International Conference on 
Information and Communication Technologies, pp. 2353-2358, Damascus, Syria, Apr. 
2006  
 
Muller, O.; Baghdadi, A. & Jezequel, M. (2009). From Parallelism Levels to a Multi-ASIP 
Architecture for Turbo Decoding, IEEE Trans. on VLSI, Vol. 17, No. 1, Jan. 2009, pp. 
92-102 
Park, S. M.; Kwak, J. & Lee, K. (2008). Extrinsic Information Memory Reduced Architecture 
for Non-Binary Turbo Decoder Implementation, Proceedings of the IEEE Vehicular 
Technology Conference, pp. 539-543, Marina Bay, Singapore, May 2008 
Pyndiah, R. M. (1998). Near-Optimum Decoding of Product Codes: Block Turbo Codes, 
IEEE Trans. on Communications, Vol. 46, No. 8, Aug. 1998, pp. 1003-1010 
Quaglio, F.; Vacca, F.; Castellano, C.; Tarable, A. & Masera, G. (2006). Interconnection 
framework for high-throughput, flexible LDPC decoders, Proceedings of Design, 
Automation and Test in Europe Conference and Exhibition, pp. 1-6, Munich, Germany, 
Mar. 2006, 
Radar, C. M. (1981), Memory Management in a Viterbi Decoder, IEEE Trans. on 
Communications, Vol. COM-29, No. 9, Sep. 1981, pp. 1399-1401 
Robertson, P.; Villebrun E. & Hoeher P. (1995). A comparison of optimal and sub-optimal 
MAP decoding algorithms operating in the log domain, Proceedings of the IEEE 
International Conference on Communications, pp. 1009-1013,  Seattle, USA, Jun. 1995 
Shin, D.; Heo, K.; Oh, S.; & Ha, J. (2007). A Stopping Criterion for Low-Density Parity-Check 
Codes, Proceedings of the IEEE Vehicular Technology Conference, pp. 1529-1533, 
Dublin, Ireland, Apr. 2007.  
Singh, A.; Boutillon, E. & Masera, G. (2008). Bit-width Optimization of Extrinsic Information 
in Turbo Decoder, Proceedings of the 5th International Symposium on Turbo Codes & 
Related Topics, pp. 134-138, Lausanne, Switzerland, Sep. 2008  
Speziali, F. & Zory, J. (2004). Scalable and Area Efficient Concurrent Interleaver for High 
Throughput Turbo-Decoders, Proceedings of Euromicro Symposium on Digital 
SystemDesign, pp. 334-341, Rennes, France, Sep. 2004 
Talakoub, S.; Sabeti, L.; Shahrrava, B. & Ahmadi, M. (2007). An Improved Max-Log-MAP 
Algorithm for Turbo Decoding and Turbo Equalization, IEEE Trans. on 
Intrumentation and Measurement, Vol. 56, No. 3, Jun. 2007, pp 1058-1063 
Tarable, A.; Benedetto, S. & Montorsi, G. (2004). Mapping interleaver laws to parallel turbo 
and LDPC decoders architectures, IEEE Trans. on Information Theory, Vol. 50, No. 9, 
Sep. 2004, pp. 2002–2009 
Thul, M. J.; Wehn, N, & Rao L. P. (2002). Enabling High-Speed Turbo-Decoding through 
Concurrent Interleaving, Proceedings of the IEEE International Symposium on Circuits 
and Systems,  pp. 897-900, Scottsdale, USA, May 2002 
Thul, M. J.; Gilbert, F. & Wehn, N. (2003). Concurrent Interleaving Architectures for High-
Throughput Channel Coding, Proceedings of the IEEE International Conference on 
Acoustic, Speech and Signal Processing, pp. 613-616, Hong Kong, Apr. 2003 
Vanstraceele, C.; Geller, B.; Brossier, J. M. & Barbot, J. P., A Low Complexity Block Turbo 
Decoder Architecture, IEEE Trans. on Communications, Vol. 56, No. 12, Dec. 2008, 
pp. 1985-1987 
Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimum 
decoding algorithm, IEEE Trans. on Information Theory, Vol. IT-13, Apr. 1967, pp. 
260-269 
Vogt, J. & Finger, A. (2000). Improving the Max-Log-MAP Turbo Decoder. IET Electronics 
Letters, Vol. 36, No. 23, Nov. 2000, pp. 1937-1939 
WIMAX, New Developments26
 
Vogt, T. & When, N. (2008). A Reconfigurable ASIP for Convolutional and Turbo Decoding 
in an SDR Environment, IEEE Trans. on VLSI, Vol. 16, No. 10, Oct. 2008, pp. 1309-
1320 
Wang, H.; Yang H. & Yang, D. (2006). Improved Log-MAP Decoding Algorithm for Turbo-
like Codes, IEEE Communications Letters, Vol. 10, No. 3, Mar. 2006, pp. 186-188 
Weiss, C.; Bettstetter, C. & Riedel, S. (2001). Code Construction and Decoding of Parallel 
Concatenated Tailbiting Codes, IEEE Trans. on Information Theory, Vol. 47, No. 1, 
Jan. 2001, pp. 366-368 
Zhan, C.; Arslan, T.; Erdogan, A. T. & MacDougall, S. (2006). An Efficient Decoder Scheme 
for Double Binary Circular Turbo Codes, Proceesings of the IEEE International 
Conference on Acoustics, Speech and Signal Processing, pp. 229-232, Toulouse, France, 
May 2006  
Zhang, J. and Fossorier, M. P. C. (2005). Shuffled iterative decoding, IEEE Trans. on 
Communications, Vol. 53, No. 2, Feb. 2005, pp. 209-213 
