A fully-parallel turbo decoding algorithm by Maunder, R.G.
SUBMITTED TO IEEE TRANSACTIONS ON COMMUNICATIONS 1
A Fully-Parallel Turbo Decoding Algorithm
Robert G. Maunder, Senior Member, IEEE
Abstract—This paper proposes a novel alternative to the
Logarithmic Bahl-Cocke-Jelinek-Raviv (Log-BCJR) algorithm
for turbo decoding, yielding signiﬁcantly improved processing
throughput and latency. While the Log-BCJR processes turbo-
encoded bits in a serial forwards-backwards manner, the pro-
posed algorithm operates in a fully-parallel manner, processing
all bits in both components of the turbo code at the same
time. The proposed algorithm is compatible with all turbo
codes, including those of the LTE and WiMAX standards. These
standardized codes employ odd-even interleavers, facilitating a
novel technique for reducing the complexity of the proposed
algorithm by 50%. More speciﬁcally, odd-even interleavers allow
the proposed algorithm to alternate between processing the odd-
indexed bits of the ﬁrst component code at the same time as
the even-indexed bits of the second component, and vice-versa.
Furthermore, the proposed fully-parallel algorithm is shown to
converge to the same error correction performance as the state-
of-the-art turbo decoding algorithm. Owing to its signiﬁcantly
increased parallelism, the proposed algorithm facilitates through-
puts and latencies that are up to 6.86 times superior to those of
the state-of-the art algorithm, when employed for the LTE and
WiMAX turbo codes. However, this is achieved at the cost of a
moderately increased computational complexity.
Index Terms—Turbo codes, Iterative decoding, Parallel algo-
rithms, Throughput, WiMAX
I. INTRODUCTION
D
URING the past two decades, wireless communication
has been revolutionized by channel codes that beneﬁt
from iterative decoding algorithms. For example, the Long
Term Evolution (LTE) [1] and WiMAX [2] cellular telephony
standards employ turbo codes [3]–[8], which comprise a
concatenation of two convolutional codes. Conventionally, the
Logarithmic Bahl-Cocke-Jelinek-Raviv (Log-BCJR) algorithm
[9] is employed for the iterative decoding of these convolu-
tional codes. Meanwhile, the WiFi standard for Wireless Local
Area Networks (WLANs) [10] has adopted Low Density Parity
Check (LDPC) codes [11], which may operate on the basis
of the min-sum algorithm [12]. Owing to their strong error
correction capability, these sophisticated channel codes have
facilitated reliable communication at transmission throughputs
that closely approach the capacity of the wireless channel.
However, the achievable transmission throughput is limited by
the processing throughput of the iterative decoding algorithm,
if realtime operation is required. Furthermore, the iterative
decoding algorithm’s processing latency imposes a limit upon
the end-to-end latency. This is particularly relevant, since
The author is with Electronics and Computer Science, University of
Southampton, SO17 1BJ, United Kingdom, e-mail: rm@ecs.soton.ac.uk
The ﬁnancial support of the EPSRC, Swindon UK under the grants
EP/J015520/1 and EP/L010550/1, as well as that of the TSB, Swindon UK
under the auspices of grant TS/L009390/1 is gratefully acknowledged.
This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may
no longer be accessible.
multi-gigabit transmission throughputs and ultra-low end-to-
end latencies can be expected to be targets for next-generation
wireless communication standards [13]. Therefore, there is a
demand for iterative decoding algorithms having multi-gigabit
processing throughputs and ultra-low processing latencies.
Owing to the inherent parallelism of the min-sum algorithm,
it may be operated in a fully-parallel manner, facilitating
LDPC decoders having processing throughputs of up to 16.2
Gbit/s [14]. By contrast, the processing throughput of the
state-of-the-art LTE turbo decoder [15] is limited to 2.15
Gbit/s. This may be attributed to the inherently serial nature
of the Log-BCJR algorithm, which is imposed by the data
dependencies of its forward and backward recursions [9].
More speciﬁcally, the turbo-encoded bits corresponding to
each convolutional code must be processed serially, spread
over numerous consecutive time periods. Furthermore, the
Log-BCJR algorithm is applied to the two convolutional codes
alternately, until a sufﬁcient number of decoding iterations
have been performed. As a result, thousands of time periods
are required to complete the iterative decoding process of the
state-of-the-art turbo decoder.
This motivates the novel turbo decoder algorithm that is
proposed in this paper. In contrast to the Log-BCJR algorithm,
the novel algorithm does not have any data dependencies,
facilitating fully-parallel turbo decoding. More speciﬁcally,
the proposed fully-parallel turbo decoder algorithm is capable
of processing all bits corresponding to both convolutional
codes at the same time. The proposed fully-parallel algorithm
is compatible with all turbo codes, including those of the
LTE and WiMAX standards. These standardized turbo codes
employ odd-even interleavers, facilitating a novel technique for
reducing the complexity of the proposed algorithm by 50%.
More speciﬁcally, odd-even interleavers allow the proposed
algorithm to alternate between processing the odd-indexed bits
of the ﬁrst component code at the same time as the even-
indexed bits of the second component, and vice-versa. This
process is repeated iteratively, until a sufﬁcient number of
decoding iterations have been performed. Owing to this, the
iterative decoding process can be completed using just tens
of time periods, which is signiﬁcantly lower than the number
required by the state-of-the-art turbo decoder of [15].
Note that a number of fully-parallel turbo decoders have
been previously proposed, although these suffer from signif-
icant disadvantages that are not manifested in the proposed
algorithm. In [16], the min-sum algorithm is employed to
perform turbo decoding. However, this approach only works
for a very limited set of turbo code designs, which does not
include those employed by any standards. A fully-parallel
turbo decoder implementation that represents the soft informa-
tion using analogue currents was proposed in [17], however
it only supports very short message lengths N. Similarly,SUBMITTED TO IEEE TRANSACTIONS ON COMMUNICATIONS 2
¯ αl
2 ¯ αl
3 ¯ αl
N−1 ¯ αl
N ¯ αl
1 ¯ αl
0
Fully-parallel turbo decoder (b)
···
···
···
···
···
···
Log-BCJR turbo decoder (c)
Forward
recursion
Backward
recursion
Backward
recursion
Forward
recursion
···
Turbo encoder (a)
···
Interleaver
Lower convolutional encoder
Upper convolutional encoder
Interleaver Interleaver
···
···
bu
1,2
bu
1,1
bu
1,3
bu
1,N
. . .
bl
2,3 bl
2,1 bl
2,2 bl
2,N bl
3,1 bl
3,N bl
3,2 bl
3,3 ¯ b
l,a
2,1 ¯ b
l,a
2,2 ¯ b
l,a
2,3 ¯ b
l,a
2,N ¯ b
l,a
3,1 ¯ b
l,a
3,2 ¯ b
l,a
3,3 ¯ b
l,a
3,N ¯ b
l,a
2,1 ¯ b
l,a
2,2 ¯ b
l,a
2,3 ¯ b
l,a
2,N ¯ b
l,a
3,1 ¯ b
l,a
3,2 ¯ b
l,a
3,3 ¯ b
l,a
3,N
¯ b
u,a
1,1 ¯ b
u,a
1,2 ¯ b
u,a
1,3 ¯ b
u,a
1,N ¯ b
u,a
1,1 ¯ b
u,a
1,2 ¯ b
u,a
1,3 ¯ b
u,a
1,N
bu
2,3 bu
2,1 bu
2,2 bu
2,N bu
3,1 bu
3,N bu
3,2 bu
3,3 ¯ b
u,a
2,1 ¯ b
u,a
2,2 ¯ b
u,a
2,3 ¯ b
u,a
2,N ¯ b
u,a
3,1 ¯ b
u,a
3,2 ¯ b
u,a
3,3 ¯ b
u,a
3,N ¯ b
u,a
2,1 ¯ b
u,a
2,2 ¯ b
u,a
2,3 ¯ b
u,a
2,N ¯ b
u,a
3,1 ¯ b
u,a
3,2 ¯ b
u,a
3,3 ¯ b
u,a
3,N
¯ αu
1 ¯ αu
2 ¯ αu
3 ¯ αu
N−1 ¯ αu
0 ¯ αu
N
¯ βu
1 ¯ βu
2 ¯ βu
3 ¯ βu
N−1 ¯ βu
N ¯ βu
0
¯ βu
N ¯ βu
1 ¯ βu
2 ¯ βu
3 ¯ βu
N−1 ¯ βu
0
¯ βl
1 ¯ βl
2 ¯ βl
3 ¯ βl
N−1 ¯ βl
N ¯ αl
0 ¯ αl
1 ¯ αl
2 ¯ αl
3 ¯ αl
N−1 ¯ αl
N
¯ βl
1 ¯ βl
2 ¯ βl
3 ¯ βl
N−1 ¯ βl
N ¯ βl
0
¯ βl
0
¯ b
u,e
1,1 ¯ b
u,e
1,3 ¯ b
u,e
1,N ¯ b
u,e
1,2 ¯ b
u,e
1,1 ¯ b
u,e
1,2 ¯ b
u,e
1,3 ¯ b
u,e
1,N
¯ b
l,a
1,1 ¯ b
l,a
1,2 ¯ b
l,a
1,3 ¯ b
l,a
1,N ¯ b
l,a
1,1 ¯ b
l,a
1,2 ¯ b
l,a
1,3 ¯ b
l,a
1,N ¯ b
l,e
1,1 ¯ b
l,e
1,3 ¯ b
l,e
1,N ¯ b
l,e
1,2 ¯ b
l,e
1,1 ¯ b
l,e
1,2 ¯ b
l,e
1,3 ¯ b
l,e
1,N bl
1,1 bl
1,2 bl
1,3 bl
1,N
¯ αu
0 ¯ αu
1 ¯ αu
2 ¯ αu
3 ¯ αu
N−1 ¯ αu
N
Fig. 1. Schematics of (a) a turbo encoder, (b) the proposed fully-parallel turbo decoder and (c) a Log-BCJR turbo decoder.
[18] proposes a fully-parallel turbo decoder algorithm that
operates on the basis of stochastic bit sequences. However,
this algorithm requires signiﬁcantly more time periods than the
Log-BCJR algorithm, therefore having a signiﬁcantly lower
processing throughput.
The rest of this paper is structured as follows. Section II
provides background information on turbo encoding and in-
troduces the notation that will be employed throughout this
paper. The proposed fully-parallel turbo decoding algorithm
is described for generalized turbo codes in Section III, before
being applied to the LTE and WiMAX turbo codes, where the
above-described 50% reduction in complexity can be afforded.
In Section IV, the proposed fully-parallel turbo decoding al-
gorithm is compared with the state-of-the-art algorithm. More
speciﬁcally, the proposed fully-parallel algorithm is shown to
converge to the same error correction performance as the state-
of-the-art turbo decoding algorithm, regardless of which turbo
code it is applied for. Owing to its signiﬁcantly increased
parallelism, the proposed algorithm facilitates throughputs and
latencies that are up to 6.86 times superior to those of the
state-of-the art algorithm, when employed for the LTE and
WiMAX turbo codes. However, this is achieved at the cost
of a moderately increased computational complexity. Finally,
some conclusions are offered in Section V.
II. TURBO ENCODER
This section provides background information on turbo
encoding and introduces the notation that will be employed
throughout the remainder of this paper. Section II-A describes
a simpliﬁed turbo encoder, which facilitates a simpliﬁed
introduction of the proposed fully-parallel turbo decoder in
Section III. Sections II-B and II-C discuss the differences
between the simpliﬁed turbo encoder of Section II-A and those
of LTE and WiMAX, respectively.
A. Simpliﬁed turbo encoder
Figure 1(a) depicts a simpliﬁed turbo encoder, which en-
codes a message frame bu
1 = [bu
1;k]N
k=1 comprising N number
of bits, each having a binary value bu
1;k 2 f0;1g. This message
frame is provided to an upper convolutional encoder, as shown
in Figure 1(a). This encoder uses the process described below
to generate two N-bit encoded frames, namely a parity frame
bu
2 = [bu
2;k]N
k=1 and a systematic frame bu
3 = [bu
3;k]N
k=1.
Meanwhile, the message frame bu
1 is interleaved, in order to
obtain the N-bit interleaved message frame bl
1 = [bl
1;k]N
k=1, as
shown in Figure 1(a). This is provided to a lower convolutional
encoder, which also uses the process described below to
generate two more N-bit encoded frames, namely a parity
frame bl
2 = [bl
2;k]N
k=1 and a systematic frame bl
3 = [bl
3;k]N
k=1.
Here, the superscripts ‘u’ and ‘l’ indicate relevance to the up-
per and lower convolutional encoders, respectively. However,
throughout the remainder of this paper, these superscripts are
only used when necessary to explicitly distinguish between
the two convolutional encoders and are omitted when the
discussion applies equally to both. Note that the turbo encoder
represents the N bits of the message frame bu
1 using four
encoded frames, comprising a total of 4N bits and resulting
in a turbo coding rate of R = N=(4N) = 1=4. Following
turbo encoding, the encoded frames may be modulated onto a
wireless channel and transmitted to a receiver.
Both convolutional encoders operate in the same manner,
on the basis of a state transition diagram like the M = 8-state
example of Figure 2. The upper convolutional encoder begins
from an initial state of S0 = 0 and successively transitions
into each subsequent state Sk 2 f0;1;2;:::;M   1g by
considering the corresponding message bit b1;k. Since there
are two possible values for the message bit b1;k 2 f0;1g,
there are K = 2 possible values for the state Sk that can
be reached by transitioning from the previous state Sk 1. In
Figure 2 for example, a previous state of Sk 1 = 0 implies
that the subsequent state is selected from Sk 2 f0;4g. This
example can also be expressed using the notation c(0;0) = 1
and c(0;4) = 1, where c(Sk 1;Sk) = 1 indicates that it is
possible for the convolutional encoder to transition from Sk 1
into Sk, whereas c(Sk 1;Sk) = 0 indicates that this transition
is impossible. Of the K = 2 options, the value for the state
Sk is selected such that b1(Sk 1;Sk) = b1;k. For example,
Sk 1 = 0 and b1;k = 0 gives Sk = 0, while Sk 1 = 0 and
b1;k = 1 gives Sk = 4 in Figure 2. In turn, binary values areSUBMITTED TO IEEE TRANSACTIONS ON COMMUNICATIONS 3
selected for the corresponding bit in the parity frame b2 and
the systematic frame b3, according to b2;k = b2(Sk 1;Sk) and
b3;k = b3(Sk 1;Sk). In the example of Figure 2, Sk 1 = 0
and Sk = 0 gives b2;k = 0 and b3;k = 0, while Sk 1 = 0 and
Sk = 4 gives b2;k = 1 and b3;k = 1.
b3(Sk−1,Sk) = 1
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0
Sk−1
0
Sk
b1(Sk−1,Sk) = 0
b2(Sk−1,Sk) = 0
b1(Sk−1,Sk) = 0
b1(Sk−1,Sk) = 1
b1(Sk−1,Sk) = 1
b2(Sk−1,Sk) = 1
b3(Sk−1,Sk) = 0
b3(Sk−1,Sk) = 1
b2(Sk−1,Sk) = 1
b3(Sk−1,Sk) = 0
b2(Sk−1,Sk) = 0
Fig. 2. State transition diagram of the LTE turbo code.
B. LTE turbo encoder
The LTE turbo encoder [1] employs the state transition
diagram of Figure 2, which has M = 8 states and K = 2
transitions per state. Furthermore, the LTE turbo encoder
employs an odd-even interleaver [19] that supports various
frame lengths N in the range 40 to 6144 bits. However, in
contrast to the simpliﬁed turbo encoder of Figure 1(a), the
LTE turbo encoder [1] employs twelve additional termination
bits to force each convolutional encoder into the ﬁnal state
SN+3 = 0. More speciﬁcally, the upper encoder generates the
three message termination bits bu
1;N+1, bu
1;N+2 and bu
1;N+3, as
well as the three parity termination bits bu
2;N+1, bu
2;N+2 and
bu
2;N+3. The lower convolutional encoder operates in a sim-
ilar manner, generating corresponding sets of three message
termination bits bl
1;N+1, bl
1;N+2 and bl
1;N+3, as well as three
parity termination bits bl
2;N+1, bl
2;N+2 and bl
2;N+3. However,
in contrast to the systematic frame bu
3 that is produced by the
upper convolutional encoder, that of the lower convolutional
encoder bl
3 is not output by the LTE turbo encoder. Owing to
this, the LTE turbo encoder uses a total of (3N + 12) bits to
represent the N bits of the message frame bu
1, giving a coding
rate of R = N=(3N + 12).
C. WiMAX turbo encoder
Like the LTE turbo encoder, the WiMAX turbo encoder
[2] employs an odd-even interleaver, supporting various frame
lengths N in the range 24 to 2400 bits. However, in contrast
to the LTE turbo encoder, the WiMAX turbo encoder is duo-
binary [2]. More speciﬁcally, the upper WiMAX convolutional
encoder encodes two N-bit message frames at once bu
1 and bu
2.
In response, it produces four N-bit encoded frames, namely
two parity frames bu
3 and bu
4, as well as two systematic
frames bu
5 and bu
6. Meanwhile, the message frames bu
1 and bu
2
are interleaved, in order to obtain the two N-bit interleaved
message frames bl
1 and bl
2. These are encoded by the lower
convolutional encoder, in order to generate two parity frames
bl
3 and bl
4. As in the LTE turbo encoder however, the lower
encoder’s N-bit systematic frames bl
5 and bl
6 are not output
by the WiMAX turbo encoder. Therefore, the WiMAX turbo
encoder represents the 2N bits of the message frames bu
1 and
bu
2 using six encoded frames, comprising a total of 6N bits
and resulting a coding rate of R = (2N)=(6N) = 1=3. In
the WiMAX turbo encoder, the upper and lower convolutional
encoders operate on the basis of a state transition diagram
having K = 4 transitions from each of M = 8 states, in
correspondence to the four possible combinations of the two
message bits. Rather than employing termination, WiMAX
employs tailbiting to ensure that SN = S0, which may require
SN and S0 to have non-zero values.
III. THE PROPOSED FULLY-PARALLEL TURBO DECODER
This section describes the operation of the proposed fully-
parallel turbo decoding algorithm, which is compatible with all
turbo codes. Section III-A considers the generalized applica-
bility of the proposed algorithm, using an example of a parallel
turbo decoder that corresponds to the simpliﬁed turbo encoder
of Section II-A. Following this, Sections II-B and II-C discuss
how the proposed fully-parallel turbo decoding algorithm may
be applied to the LTE and WiMAX turbo codes, respectively.
A. Simpliﬁed turbo decoder
Following their transmission over a wireless channel, the
four encoded frames bu
2, bu
3, bl
2 and bl
3 may be demodulated
and provided to the turbo decoder of Figure 1(b). However,
owing to the effect of noise in the wireless channel, the
demodulator will be uncertain of the bit values in these
encoded frames. Therefore, instead of providing frames com-
prising N hard-valued bits, the demodulator provides four
frames each comprising N soft-valued a priori Logarithmic
Likelihood Ratios (LLRs)  b
u;a
2 = [ b
u;a
2;k]N
k=1,  b
u;a
3 = [ b
u;a
3;k]N
k=1,
 b
l;a
2 = [ b
l;a
2;k]N
k=1 and  b
l;a
3 = [ b
l;a
3;k]N
k=1. Here, an LLR pertaining
to the bit bj;k is deﬁned by
 bj;k = ln
Pr(bj;k = 1)
Pr(bj;k = 0)
; (1)
where the superscripts ‘a’, ‘e’ or ‘p’ may be appended to
indicate an a priori, extrinsic or a posteriori LLR, respectively.
The demodulator provides these a priori LLRs to the fully-
parallel turbo decoder’s 2N algorithmic blocks, which are
shown in Figure 1(b) arranged in two rows. More speciﬁcally,
the a priori parity LLR  b
u;a
2;k and the a priori systematic LLR
 b
u;a
3;k are provided to the kth algorithmic block in the upper
row shown in Figure 1(b). Furthermore, the interleaver of
Figure 1(b) provides the kth algorithmic block in the upper
row with the a priori message LLR  b
u;a
1;k, as will be detailed
below. Meanwhile, the kth algorithmic block in the lower row
is correspondingly provided with the a priori LLRs  b
l;a
1;k,  b
l;a
2;k
and  b
l;a
3;k. In addition to this, the kth algorithmic block in
each row is also provided with a vector of a priori forward
state metrics  k 1 = [ k 1(Sk 1)]
M 1
Sk 1=0 and a vector of aSUBMITTED TO IEEE TRANSACTIONS ON COMMUNICATIONS 4
The proposed fully-parallel turbo decoding algorithm.
 k(Sk 1;Sk) =
2
4
L X
j=1

bj(Sk 1;Sk)  ba
j;k

3
5 +  k 1(Sk 1) +  k(Sk) (2)
 k(Sk) =

max*
fSk 1jc(Sk 1;Sk)=1g
 k(Sk 1;Sk)

   k(Sk) (3)
 k 1(Sk 1) =

max*
fSkjc(Sk 1;Sk)=1g
 k(Sk 1;Sk)

   k 1(Sk 1) (4)
 be
j;k =

max*
f(Sk 1;Sk)jbj(Sk 1;Sk)=1g
 k(Sk 1;Sk)

 

max*
f(Sk 1;Sk)jbj(Sk 1;Sk)=0g
 k(Sk 1;Sk)

  ba
j;k (5)
priori backward state metrics  k = [ k(Sk)]
M 1
Sk=0, as will be
detailed below. All algorithmic blocks operate in an identical
manner, using the equations provided in (2) – (5). Note that
these equations are stated in a fully generalized manner,
allowing them to be applied to any turbo code, having any
state transition diagram and any number L of a priori LLRs
per algorithmic block. More speciﬁcally, (2) is employed in
order to combine the L = 3 a priori LLRs  ba
1;k,  ba
2;k and
 ba
3;k, as well as the a priori state metrics of  k 1 and  k.
This produces an a posteriori metric  (Sk 1;Sk) for each
transition in the state transition diagram, namely for each pair
of states Sk 1 and Sk for which c(Sk 1;Sk) = 1. These a
posteriori transition metrics are then combined by (3), (4) and
(5), in order to produce the vector of extrinsic forward state
metrics  k = [ k(Sk)]
M 1
Sk=0, the vector of extrinsic backward
state metrics  k 1 = [ k 1(Sk 1)]
M 1
Sk 1=0 and the extrinsic
message LLR  be
1;k, respectively. These equations employ the
Jacobian logarithm, which is deﬁned for two operands as
max*( 1;  2) = max( 1;  2) + ln

1 + e j 1   2j

(6)
and may be extended to more operands by exploiting its
associativity property. Alternatively, the exact max* operator
of (6) may be optionally replaced with the approximation [9]
max*( 1;  2)  max( 1;  2); (7)
in order to reduce the complexity of the proposed fully-
parallel turbo decoder, at the cost of slightly degrading its
error correction performance.
The proposed fully-parallel turbo decoder is operated itera-
tively, where each of the I iterations comprises the operation of
all algorithmic blocks shown in Figure 1(b). The turbo decoder
may be considered to be fully-parallel, since each iteration is
completed within just T = 1 time period, by operating all
2N of the algorithmic blocks simultaneously. In general, the
extrinsic information produced by each algorithmic block in
Figure 1(b) is exchanged with those provided by the connected
algorithmic blocks, to be used as a priori information in the
next decoding iteration. More speciﬁcally, the kth algorithmic
block in each row passes the extrinsic message LLR  be
1;k
through the interleaver, to be used as an a priori LLR by
the connected block in the other row during the next decoding
iteration. Meanwhile, this block in the other row provides an
extrinsic message LLR which is used as the a priori message
LLR  ba
1;k during the next decoding iteration. Furthermore,
the kth algorithmic block in each row provides the vectors
of extrinsic state metrics  k and  k 1 for the neighboring
algorithmic blocks to employ in the next decoding iteration.
At the start of the ﬁrst decoding iteration however, no extrinsic
information is available. In this case, the kth algorithmic
block in each row employs zero values for  ba
1;k,  k 1 and
 k. As an exception to this however, the ﬁrst algorithmic
block in the each row employs  0 = [0; 1; 1;:::; 1]
throughout all decoding iterations, since the convolutional
encoders always begin from an initial state of S0 = 0. Sim-
ilarly, the last algorithmic block from the each row employs
 N = [0;0;0;:::;0] throughout all decoding iterations, since
the ﬁnal state of the the convolutional encoders SN is not
known in advance to the receiver, when termination is not
employed. Following the completion of the ﬁnal decoding
iteration, an a posteriori LLR pertaining to the kth message
bit bu
1;k may be obtained as  b
u;p
1;k =  b
u;a
1;k + b
u;e
1;k. An estimation
of the message bit bu
1;k may then be obtained as the result of
the binary test  b
u;p
1;k > 0.
B. LTE turbo decoder
As described in Section II-B, the LTE turbo code employs
an odd-even interleaver [19]. More explicitly, the LTE inter-
leaver only connects algorithmic blocks from the upper row
having an odd index k to blocks from the lower row that also
have an odd index k. Similarly, blocks having even indices k in
the upper row are only connected to blocks having even indices
k in the lower row. Owing to this, the 2N algorithmic blocks
of Figure 1(b) can be grouped into two sets, where all blocks
within a particular set are independent, having no connections
to each other. The ﬁrst set comprises all algorithmic blocks
from the upper row having an odd index k, as well as all
blocks from the lower row having an even index k, which
are depicted with light shading in Figure 1(b). Meanwhile,
the second set is complementary to the ﬁrst, comprising the
algorithmic blocks having dark shading in Figure 1(b). In this
way, the iterative exchange of extrinsic information between
2N algorithmic blocks can be instead thought of as an iterative
exchange of extrinsic information between the two sets.SUBMITTED TO IEEE TRANSACTIONS ON COMMUNICATIONS 5
In the general case where the interleaver design prevents
grouping into sets of independent algorithmic blocks, the
approach described in Section III-A is recommended, where
all algorithmic blocks are operated in every time period,
corresponding to T = 1 time period per decoding iteration.
However, in the case of an odd-even interleaver, the simul-
taneous operation of both sets of independent algorithmic
blocks is analogous to juggling two balls, which are simul-
taneously thrown between two hands, but remain independent
of each other. In the proposed fully-parallel turbo decoder, this
corresponds to two independent iterative decoding processes,
which have no inﬂuence on each other. Therefore, one of these
independent iterative decoding processes can be considered to
be redundant and may be discarded. This can be achieved
by operating the algorithmic blocks of only one set in each
time period, with consecutive time periods alternating between
the two sets. With this approach, each decoding iteration
can be considered to comprise T = 2 consecutive time
periods. Although this is double the number required by the
T = 1 approach described in Section III-A, this T = 2
approach requires half as many decoding iterations in order to
achieve the same error correction performance. Therefore, the
T = 2 approach maintains the same processing throughput and
latency as the T = 1 approach, but achieves a 50% reduction
in complexity per message frame. Furthermore, while one
set of algorithmic blocks is being used in a particular time
period to decode a particular message frame, the other set of
blocks can be used to decode a different message frame. In
this way, the two sets of algorithmic blocks may be operated
concurrently, alternating between the concurrent decoding of
two different message frames and facilitating a 100% increase
in the overall processing throughput. Note however that this
concurrent decoding technique is not assumed throughout the
remainder of this paper, since it does not facilitate a fair
comparison with the considered benchmarkers, which are only
capable of decoding one message frame at a time.
As described in Section II-B, the LTE turbo code employs
twelve termination bits to force each of its convolutional
encoders into the ﬁnal state SN+3 = 0. In the receiver,
the demodulator provides the corresponding LLRs  b
u;a
1;N+1,
 b
u;a
1;N+2,  b
u;a
1;N+3,  b
u;a
2;N+1,  b
u;a
2;N+2 and  b
u;a
2;N+3 to the upper row,
while the lower row is provided with  b
l;a
1;N+1,  b
l;a
1;N+2,  b
l;a
1;N+3,
 b
l;a
2;N+1,  b
l;a
2;N+2 and  b
l;a
2;N+3. As shown in Figure 3, these LLRs
can be provided to three additional algorithmic blocks, which
are positioned at the end of each row in the proposed fully-
parallel turbo decoder.
The three additional algorithmic blocks at the end of each
row do not need to be operated iteratively, within the iterative
decoding process. Instead, they can be operated just once, be-
fore the iterative decoding process begins, using a backwards
recursion. More speciﬁcally, the algorithmic blocks with the
index k = N +3 may employ Equations (8) and (10) in order
to process the L = 2 LLRs  ba
1;N+3 and  ba
2;N+3. Here, the state
metrics  N+3 = [0; 1; 1;:::; 1] should be employed
since a ﬁnal state of SN+3 = 0 is guaranteed. The resultant
state metrics  N+2 can then be provided to the algorithmic
block having the index k = N + 2. In turn, this uses the
¯ b
l,a
1,N+3
¯ b
l,a
2,N+1 ¯ b
l,a
2,N+2 ¯ b
l,a
2,N+3
¯ βl
N ¯ βl
N+1 ¯ βl
N+2 ¯ βl
N+3
¯ αl
N ¯ αl
N−1
¯ βl
N−1
¯ b
l,a
2,N
···
···
···
¯ b
u,a
2,N ¯ b
u,a
3,N ¯ b
u,a
2,N+1 ¯ b
u,a
2,N+2 ¯ b
u,a
2,N+3
¯ b
u,e
1,N ¯ b
u,a
1,N
¯ b
u,a
1,N+1 ¯ b
u,a
1,N+2 ¯ b
u,a
1,N+3
¯ βu
N−1
¯ αu
N−1
¯ βu
N
¯ αu
N
¯ βu
N+1 ¯ βu
N+2
¯ b
l,e
1,N
¯ βu
N+3
¯ b
l,a
1,N
¯ b
l,a
1,N+1 ¯ b
l,a
1,N+2
Fig. 3. Schematic of the proposed fully-parallel algorithm, when employing
the termination technique of the LTE turbo code.
same process in order to obtain  N+1, which is then provided
the block where k = N + 1 in order to obtain  N in the
same way. The resultant values of  N may then be employed
throughout the iterative decoding process, without any need
to operate the three additional algorithmic blocks again. Note
that there is no penalty associated with adopting this approach,
since Equations (8) and (10) reveal that the values of  N
are independent of all values that are updated as the iterative
decoding process proceeds.
Note that since the LTE turbo encoder does not output
the systematic frame bl
3 produced by the lower convolutional
encoder, the kth algorithmic block in the lower row uses (2)
to consider only the L = 2 a priori LLRs  b
l;a
1;k and  b
l;a
2;k. By
contrast the algorithmic block in the upper row having the
index k 2 [1;N] considers the L = 3 a priori LLRs  b
u;a
1;k,  b
u;a
2;k
and  b
u;a
3;k. This is shown in Figure 3 for the algorithmic blocks
having the index k = N.
C. WiMAX turbo decoder
Like the LTE turbo code, the WiMAX turbo code employs
an odd-even interleaver [19], allowing it to beneﬁt from a 50%
reduction in the complexity of the fully-parallel turbo decoder,
as described in Section III-B. Furthermore, the concurrent de-
coding of two message frames is supported, facilitating a 100%
increase in overall processing throughput. However, the use of
this concurrent decoding technique is not assumed throughout
the remainder of this paper, as described in Section III-B.
The algorithm of (2) – (5) supports the duo-binary nature
of the WiMAX turbo code. Here, the algorithmic blocks in
the upper row consider L = 6 a priori LLRs, while those
in the lower row consider L = 4 LLRs, since the systematic
frames bl
5 and bl
6 produced by the lower convolutional code
are not output. More speciﬁcally, the kth algorithmic block in
the upper row is provided with six a priori LLRs  b
u;a
1;k,  b
u;a
2;k,
 b
u;a
3;k,  b
u;a
4;k,  b
u;a
5;k and  b
u;a
6;k, using these to generate two extrinsic
LLRs  b
u;e
1;k and  b
u;e
2;k. By contrast,  b
l;a
1;k,  b
l;a
2;k,  b
l;a
3;k and  b
l;a
4;k are
provided to the kth algorithmic block in the lower row, which
generates  b
l;e
1;k and  b
l;e
2;k in response. Tailbiting can be achieved
by employing  0 = [0;0;0;:::;0] and  N = [0;0;0;:::;0]SUBMITTED TO IEEE TRANSACTIONS ON COMMUNICATIONS 6
TABLE I
VARIOUS CHARACTERISTICS OF THE PROPOSED FULLY-PARALLEL ALGORITHM, THE LOG-BCJR ALGORITHM AND THE STATE-OF-THE-ART ALGORITHM
OF [15], WHEN IMPLEMENTING THE LTE AND WIMAX TURBO DECODERS USING THE APPROXIMATE max* OPERATOR OF (7).
Characteristic
Fully-parallel
LTE
N 2 [40;6144]
Fully-parallel
WiMAX
N 2 [24;2400]
Log-BCJR
LTE
N 2 [40;6144]
Log-BCJR
WiMAX
N 2 [24;2400]
State-of-the-art
LTE
N 2 [2048;6144]
State-of-the-art
WiMAX
N 2 [1440;2400]
Time periods
per decoding
iteration T
2 2 4N 4N N=32 N=16
Time period
duration D
8
(7) 9 9 11 3 3
Complexity per
decoding
iteration C
155N 348N 171N 436N 275N 550N
Decoding
iterations
required I
48 32 8 8 8 8
Overall
throughput
1=(T  D  I)
1
768   1
672
 1
576
1
288N
1
352N
4
3N
2
3N
Overall latency
T  D  I
768
(672) 576 288N 352N 3N
4
3N
2
Overall
complexity
C  I
7440N 11136N 1368N 3488N 2200N 4400N
Memory
required
3N + 12
(4N + 12) 6N 28N + 12 48N 14N
3 + 12 28N
3
in the ﬁrst iteration. In all subsequent iterations, the most-
recently obtained values of  N and  0 can be employed for
 0 and  N, respectively.
IV. COMPARISON WITH THE LOG-BCJR TURBO DECODER
This section compares the proposed fully-parallel turbo
decoder with the Log-BCJR turbo decoder, as well as with
the state-of-the-art turbo decoder of [15]. For each of these
schemes, Sections IV-A – IV-E quantify the number of time
periods required per decoding iteration, the memory require-
ments, the computational complexity per decoding iteration,
the time period duration and the number of decoding iterations
required to achieve a particular error correction performance,
respectively. In Section IV-F, these characteristics are com-
bined in order to quantify the overall throughput, latency and
computational complexity of the various algorithms, when
employed for both LTE and WiMAX turbo decoding. The
comparisons are summarized in Table I.
A. Operation
Figure 1(c) depicts a simpliﬁed Log-BCJR turbo decoder,
which corresponds to the simpliﬁed turbo encoder of Fig-
ure 1(a). Like the fully-parallel turbo decoder of Figure 1(b),
the Log-BCJR turbo decoder is operated iteratively, where
each of the I iterations comprises the operation of all algo-
rithmic blocks shown in Figure 1(c). However as shown in
Table I, T = 4N consecutive time periods are required to
complete each decoding iteration, so that the 4N algorithmic
blocks can be operated sequentially, in the order indicated
by the bold arrows of Figure 1(c). These arrows indicate the
data dependencies of the Log-BCJR algorithm, which impose
the forward and backward recursions shown in Figure 1(c).
Therefore, when implementing the LTE or WiMAX turbo
decoders, the number of time periods per iteration required by
the Log-BCJR algorithm is 2N times higher than the proposed
fully-parallel algorithm’s T = 2 time periods.
During the forward and backward recursions of the Log-
BCJR algorithm, the kth pair of algorithmic blocks in the
upper and lower rows operate on the basis of Equations (8)
– (12) [9]. During the forward recursion, the corresponding
kth algorithmic block in the upper and lower rows employs
(8) to combine the L = 3 a priori LLRs  ba
1;k,  ba
2;k and  ba
3;k,
in order to obtain an a priori metric  (Sk 1;Sk) for each
transition in the state transition diagram. Following this, (9)
combines these a priori transition metrics with the a priori
forward state metrics of  k 1, in order to obtain the extrinsic
forward state metrics of  k. These extrinsic state metrics are
then passed to the (k+1)th algorithmic block, to be employed
as a priori state metrics in the next time period. During the
backward recursion, the corresponding kth algorithmic block
in the upper and lower rows employs (10) to combine the a
priori metric  (Sk 1;Sk) of each transition with the a priori
backward state metrics of  k. This produces the extrinsic
backward state metrics of  k 1, which may be passed to
the (k   1)th algorithmic block, to be employed as a priori
state metrics in the next time period. Furthermore, the kth
algorithmic block in the backward recursion of the upper rows
employs (11) to obtain an a posteriori metric  (Sk 1;Sk)
for each transition in the state transition diagram. Finally, the
extrinsic message LLR  be
1;k is obtained using (12). As in the
proposed fully-parallel turbo decoder of Section III-A, zero-
values are employed for the a priori message LLRs in the ﬁrst
decoding iteration of the Log-BCJR turbo decoder. In additionSUBMITTED TO IEEE TRANSACTIONS ON COMMUNICATIONS 7
The Log-BCJR turbo decoding algorithm.
 k(Sk 1;Sk) =
L X
j=1

bj(Sk 1;Sk)  ba
j;k

(8)
 k(Sk) = max*
fSk 1jc(Sk 1;Sk)=1g
[ k(Sk 1;Sk) +  k 1(Sk 1)] (9)
 k 1(Sk 1) = max*
fSkjc(Sk 1;Sk)=1g

 k(Sk 1;Sk) +  k(Sk)

(10)
 k(Sk 1;Sk) =  k(Sk 1;Sk) +  k 1(Sk 1) +  k(Sk) (11)
 be
j;k =

max*
f(Sk 1;Sk)jbj(Sk 1;Sk)=1g
 k(Sk 1;Sk)

 

max*
f(Sk 1;Sk)jbj(Sk 1;Sk)=0g
 k(Sk 1;Sk)

  ba
j;k (12)
to supporting the simpliﬁed turbo decoder of Figure 1(c), the
Log-BCJR algorithm of (8) – (12) supports the algorithmic
blocks of the LTE turbo decoder having L = 3 and L = 2 a
priori LLRs, as well as the blocks of the WiMAX turbo code
having L = 6 and L = 4. Depending on whether termination
or tailbiting is employed, the values described in Section III for
 0 and  N can be employed in the Log-BCJR turbo decoder.
As may be expected, the proposed fully-parallel turbo
decoding algorithm of (2) – (5) is related to that of the
Log-BCJR turbo decoder (8) – (12). More explicitly, (2) can
be derived by substituting (8) into (11). Using the identity
max*( 1    3;  2    3) = max*( 1;  2)    3, (3) and (4) can
be derived by rearranging (11) and substituting it into (9) and
(10), respectively.
Note that while the simpliﬁed Log-BCJR turbo decoder of
Figure 1(c) requires T = 4N time periods to complete each
decoding iteration, several techniques have been proposed for
signiﬁcantly reducing this. For example, the Non-Sliding Win-
dow (NSW) technique [15] may be employed to decompose
the algorithmic blocks of Figure 1(c) into 64 windows, each
comprising an equal number of consecutive blocks. Here, the
data dependencies between adjacent windows are eliminated
by initializing each window’s recursions using results provided
by the adjacent windows in the previous decoding iteration,
rather than in the current one. Furthermore, within each win-
dow, the NSW technique performs the forward and backward
recursions simultaneously, only performing Equations (11) and
(12) once these recursions have crossed over. Additionally,
the Radix-4 transform [15] allows the number of algorithmic
blocks employed in Figure 1(c) to be halved, along with
the number of time periods required to process them. Here,
each algorithmic block corresponds to the merger of two state
transition diagrams into one, effectively doubling the number
of a priori LLRs L considered by each algorithmic block.
By combining the NSW technique and the Radix-4 transform,
the state-of-the-art LTE turbo decoder [15] can complete each
decoding iteration using just T = N=32 time periods, provided
that the frame length satisﬁes N 2 [2048;6114], as shown
in Table I. Note however that this number is N=64 times
higher than that of the proposed fully-parallel turbo decoder
of Section III-A, which requires only T = 2 time periods per
decoding iteration. When employing the maximum LTE frame
length of N = 6144 bits, the number of time periods per
decoding iteration required by the state-of-the-art LTE turbo
decoder is nearly two orders-of-magnitude above the number
required by the proposed fully-parallel algorithm.
As described above, the state-of-the-art LTE turbo decoding
algorithm of [15] employs the Radix-4 transform to double
the number of a priori LLRs considered by each algorithmic
block, resulting in L = 6 for the blocks in the upper row
and L = 4 for those in the lower row. Owing to this, the
state-of-the-art algorithm can also be employed for WiMAX
turbo decoding, since this naturally requires algorithmic blocks
that consider L = 6 and L = 4 a priori LLRs, as described
in Section III-C. Note however that in this application, the
turbo decoder does not beneﬁt from halving the number of
algorithmic blocks required, as is achieved when applying
the Radix-4 transform to an LTE turbo decoder. On the
other hand, the WiMAX turbo decoder can beneﬁt from the
NSW technique of the state-of-the-art algorithm, provided that
N 2 [1440;2440], resulting in T = N=16 time periods per
decoding iteration, as shown in Table I. This number is N=32
times higher than that of the proposed fully-parallel turbo
decoder of Section III-A.
B. Memory requirements
Note that in the proposed fully-parallel turbo decoder algo-
rithm of Figure 1(b), the outputs produced by each algorithmic
block in any particular time period are consumed by the
connected blocks in the next time period. Owing to this,
the proposed fully-parallel turbo decoder requires very little
memory for storing variables for longer than between consec-
utive time periods. More speciﬁcally, memory is only required
for storing the a priori LLRs provided by the demodulator,
which are required throughout the iterative decoding process.
In the case of the LTE turbo decoder, memory is required for
storing the 3N + 12 a priori LLRs that are provided by the
demodulator, while 6N a priori LLRs need to be stored in the
WiMAX turbo decoder, as shown in Table I.
By contrast, the Log-BCJR turbo decoder algorithm of Fig-
ure 1(c) has signiﬁcantly higher memory requirements. Like
the proposed fully-parallel turbo decoder algorithm, the Log-
BCJR turbo decoder algorithm requires memory for storing
the a priori LLRs provided by the demodulator. Furthermore,SUBMITTED TO IEEE TRANSACTIONS ON COMMUNICATIONS 8
memory is required for storing the M  K  N a priori
transition metrics that are produced by Equation (8) during
the forward recursion, so that they can be used by (10) and
(11) during the backward recursion. Likewise, memory is
required for storing the M  N extrinsic state metrics that
are produced by Equation (9) during the forward recursion, so
that they can be used by (11) during the backward recursion.
Finally, in the case of the LTE turbo decoder, memory is
required for storing the N extrinsic LLRs that are produced by
Equation (12), while 2N extrinsic LLRs need to be stored in
the WiMAX turbo decoder. Note that the additional memory
required by the Log-BCJR turbo decoder algorithm can be
reused by both the upper and lower decoder of Figure 1(c),
since they are not operated concurrently. As shown in Table I,
the amount of memory required by the Log-BCJR algorithm
is 9.33 and 8 times higher than that of the proposed algorithm,
when implementing the LTE and WiMAX turbo decoders,
respectively.
Note that the state-of-the-art LTE turbo decoder [15] em-
ploys the Radix-4 transform, which halves the number of
extrinsic state metrics that must be stored. Furthermore, the
state-of-the-art LTE turbo decoder uses a re-computation
technique [15] to further reduce the memory requirements.
Rather than storing the a priori transition metrics during
the forward recursion, so that they can be reused during
the backward recursion, the re-computation technique simply
recalculates these metrics during the backwards recursion. In
addition to this, the state-of-the-art LTE turbo decoder stores
only 1=6 of the extrinsic state metrics during the forward
recursion and recalculates the other 5=6 of these metrics during
the backward recursion. Owing to its employment of these
techniques, the amount of memory required by the state-
of-the-art LTE turbo decoder is 1.56 times higher than that
of the proposed algorithm, as shown in Table I. However,
Section IV-D will show that additionally storing the sum of
the a priori parity LLRs  b
u;a
2 and the a priori systematic LLRs
 b
u;a
3 is beneﬁcial to the algorithmic blocks in the upper row
of the proposed fully-parallel algorithm, when employed for
LTE turbo decoding. This increases the memory requirement
of the proposed algorithm to 4N + 12, as shown in brackets
in Figure I. As a result, the amount of memory required by
the state-of-the-art LTE turbo decoder can be considered to
be 1.17 times higher than that of the proposed algorithm, as
shown in Table I.
Likewise, when the state-of-the-art algorithm is applied to
WiMAX turbo decoding, the required memory is also 1.56
times higher than that of the proposed algorithm, as shown
in Table I. Note that this ratio is maintained even though
the WiMAX turbo decoder does not beneﬁt from the Radix-4
transform, which halves the number of algorithmic blocks that
are required, as well as the number of extrinsic state metrics
that must be stored. This is because in addition to requiring
twice as much storage for extrinsic state metrics, the WiMAX
turbo code also requires twice as much storage for LLRs, since
it is duo-binary.
C. Computational complexity
The number of additions, subtractions and max* opera-
tions that are employed within each algorithmic block of
the proposed fully-parallel and the Log-BCJR algorithms are
quantiﬁed in Table II, for both the LTE and WiMAX turbo
decoder. A number of techniques have been employed to
minimize the number of operations that are listed in Table II.
More speciﬁcally, the a priori metrics  (Sk 1;Sk) of some
particular transitions are equal to each other, allowing them
to be computed once and then reused. Furthermore, some
a priori metrics  (Sk 1;Sk) are zero-valued and so there
is no need to add them into the corresponding  (Sk 1;Sk),
 k(Sk) or  k 1(Sk 1) calculations. Finally, when computing
the extrinsic LLR  be
1;k in the WiMAX turbo decoder, the
results of some max* operations can be reused to compute
 be
2;k. Note that the algorithmic blocks in the upper row of the
LTE and WiMAX turbo decoders consider a higher number
of a priori LLRs L than those of the lower row, resulting in
a slightly higher complexity. Therefore, Table II presents the
average of the number of operations that are employed by the
algorithmic blocks in the upper and lower rows, resulting in
some non-integer values.
For both the LTE and WiMAX turbo decoders, the pro-
posed fully-parallel turbo decoding algorithm requires fewer
additions and subtractions than the Log-BCJR algorithm, as
well as an equal number of max* operations. When the
approximation of (7) is employed, max* operations can be
considered to have a similar computational complexity to
additions and subtractions [?]. Table I quantiﬁes the total
number of operations performed per iteration C, among all
of the algorithmic blocks in the upper and lower rows. As
shown in Table I, the computational complexity per decoding
iteration C of the Log-BCJR algorithm is 1.1 and 1.25 times
higher than that of the proposed algorithm, when implementing
the LTE and WiMAX turbo decoders, respectively.
Note that the state-of-the-art LTE turbo decoder [15] em-
ploys the Radix-4 transform, as well as the approximation of
(7). When employing the Radix-4 transform, the Log-BCJR
LTE turbo decoder has the same complexity per algorithmic
block as that presented in Table II for the Log-BCJR WiMAX
turbo decoder. However, it should be noted that the Radix-
4 transform halves the number of algorithmic blocks that
are required, as discussed in Section IV-A. Furthermore, as
described in Section IV-B, the state-of-the-art LTE turbo
decoder recalculates the a priori transition metrics of (8) and
5=6 of the extrinsic state metrics of (9) during the backward
recursion. Therefore, the state-of-the-art LTE turbo decoder
has a complexity per decoding iteration C that is 1.77 times
higher than that of the proposed algorithm, as shown in Table I.
When applying the state-of-the-art algorithm’s recalculation
technique to the WiMAX turbo code, its complexity per
decoding iteration C corresponds to 1.58 times higher than
that of the proposed algorithm, as shown in Table I.
D. Time period duration
As described in Sections III and IV-A, it may be assumed
that the operation of an algorithmic block in the schematicsSUBMITTED TO IEEE TRANSACTIONS ON COMMUNICATIONS 9
TABLE II
THE NUMBER OF OPERATIONS PER ALGORITHMIC BLOCK. THE NUMBER OF THESE OPERATIONS THAT CONTRIBUTE TO THE CRITICAL PATH OF THE
ALGORITHMIC BLOCK IS PROVIDED IN CURLY BRACKETS.
Turbo code Operation
Fully-parallel Log-BCJR
Forward recursion Backward recursion
Total
Eq. (2) Eq. (3) Eq. (4) Eq. (5) Total Eq. (8) Eq. (9) Eq. (10) Eq. (11) Eq. (12)
LTE
+ or   29.5 f3g 8 8 2 f2g 47.5 f5g 1.5 12 12 f1g 28 f2g 2 f2g 55.5 f5g
max* 0 8 8 14 f3g 30 f3g 0 8 8 f1g 0 14 f3g 30 f4g
WiMAX
+ or   74 f3g 8 8 4 f2g 94 f5g 12 30 30 f1g 62 f2g 4 f2g 138 f5g
max* 0 24 24 32 f4g 80 f4g 0 24 24 f2g 0 32 f4g 80 f6g
of Figure 1 can be completed within a single time period.
However, the amount of time D that is required for each
time period depends on the computational requirements of the
algorithmic blocks. More speciﬁcally, the required duration D
depends on the critical path through the data dependencies
that are imposed by the computational requirements of the
algorithmic blocks. For example, in the proposed fully-parallel
algorithm, Equations (3), (4) and (5) are independent of
each other, but they all depend upon (2). As a result, the
computation of (2) must be completed ﬁrst, but then (3), (4)
and (5) can be computed in parallel. Of these three equations,
it is (5) that requires the most time for computation, since it is
a function of more variables than (3) and (4). Therefore, the
critical path of the algorithmic blocks in the proposed fully-
parallel algorithm depends on the computational requirements
of (2) and (5).
Equation (2) is employed to obtain an a posteriori metric
 (Sk 1;Sk) for each transition in the state transition diagram.
However, these can all be calculated in parallel, using an
addition of ﬁve variables in the case of the algorithmic blocks
in the upper row of the LTE turbo decoder, which consider
L = 3 a priori LLRs, for example. By contrast, an addition
of just four variables is required in the case of the algorithmic
blocks in the lower row, for which L = 2. A summation of v
number of variables requires v   1 additions, some of which
can be performed in parallel. More speciﬁcally, the variables
can be added together in pairs and then in a second step, the
resultant sums can be added together in pairs. This process
can continue until only a single sum remains, requiring a
total of dlog2(v)e steps. Accordingly, Equation (2) contributes
three additions to the critical path of the algorithmic blocks
in the upper row of the proposed fully-parallel LTE turbo
decoder, as well as two additions for the blocks in the lower
row. The maximum of these two critical path contributions is
presented in the corresponding curly brackets of Table II, since
it imposes the greatest limitation on the time period duration.
A similar analysis can be employed to determine each of the
other critical path contributions that are provided in the curly
brackets of Table II.
As shown in Table II, the critical path of the Log-BCJR
algorithm is longer than that of the proposed fully-parallel
algorithm, requiring time periods having a longer duration D
and resulting in slower operation. When the approximation
of (7) is employed, max* operations can be considered to
make similar contributions to the critical path as additions
and subtractions. As shown in Table I, the critical path and
hence the required time period duration D of the Log-BCJR
algorithm is therefore 1.13 and 1.22 times higher than that
of the proposed algorithm, when implementing the LTE and
WiMAX turbo decoders, respectively.
Note however that the state-of-the-art LTE turbo decoder
[15] employs the Radix-4 transform, as well as the approx-
imation of (7). When employing the Radix-4 transform, the
Log-BCJR LTE turbo decoder has the same critical path as
that presented in Table II for the Log-BCJR WiMAX turbo
decoder. However, the state-of-the-art LTE turbo decoder em-
ploys pipelining [15] to spread the computation of Equations
(8) – (12) over several consecutive time periods. This reduces
the critical path to that of Equation (10) alone, namely one
addition and two max* operations. By contrast, the proposed
fully-parallel algorithm has a critical path comprising ﬁve
additions and three max* operations, as shown in Table I. Note
however that the contribution of one addition can be eliminated
from this total by employing a technique similar to pipelining.
More speciﬁcally, the sum of the a priori parity LLRs  b
u;a
2 and
the a priori systematic LLRs  b
u;a
3 may be computed before
iterative decoding commences. As described in Section IV-B,
the result may be stored and used throughout the iterative
decoding process by the algorithmic blocks in the upper row
of the proposed fully-parallel LTE turbo decoder. This reduces
the critical path contribution of Equation (2) in the upper row
to two additions, which is equal to that of the lower row. This
reduces the critical path of the proposed fully-parallel LTE
turbo decoder to 7 operations, as shown in brackets in Table I.
Therefore, the critical path and time period duration D of the
state-of-the-art LTE turbo decoder can be considered to be 0.43
times that of the proposed algorithm. Similarly, when applying
the state-of-the-art algorithm to WiMAX turbo decoding, the
result is the same critical path of one addition and two max*
operations. As shown in Table I, this critical path is 0.33 times
that of the proposed algorithm, which requires ﬁve additions
and four max* operations.
E. Error correction performance
The Bit Error Ratio (BER) of the proposed fully-parallel
turbo decoding algorithm is compared with that of the Log-
BCJR algorithm in Figures 4 and 5. In each case, BPSK
modulation is employed for transmission over an uncorrelated
narrowband Rayleigh fading channel having a range of Signal
to Noise Ratio (SNR) per bit Eb=N0, where Eb=N0 [dB] =
SNR [dB]   10log10(R) in this case. In Figure 4, the al-
gorithms are compared for the case of LTE turbo decodingSUBMITTED TO IEEE TRANSACTIONS ON COMMUNICATIONS 10
Log-BCJR N = 48
Fully-parallel N = 48
Exact max∗ LTE turbo decoder
I
(c)
Eb/N0
B
E
R
7 6 5 4 3 2 1 0
100
10−1
10−2
10−3
10−4
10−5
10−6
Log-BCJR N = 480
Fully-parallel N = 480
Exact max∗ LTE turbo decoder
I
(b)
Eb/N0
B
E
R
7 6 5 4 3 2 1 0
100
10−1
10−2
10−3
10−4
10−5
10−6
Log-BCJR N = 4800
Fully-parallel N = 4800
Exact max∗ LTE turbo decoder
I
(a)
Eb/N0
B
E
R
7 6 5 4 3 2 1 0
100
10−1
10−2
10−3
10−4
10−5
10−6
Fig. 4. The error correction performance of the LTE turbo decoder when using the exact max* operator of (6) to decode frames comprising (a) N = 4800,
(b) N = 480 and (c) N = 48 bits. Here, BPSK modulation is employed for transmission over an uncorrelated narrowband Rayleigh fading channel. Plots
are provided for the case where I 2 f1;2;4;8;16;32;64;128g decoding iterations are performed using the proposed fully-parallel algorithm, as well as
I 2 f1;2;4;8;16g decoding iterations using the conventional Log-BCJR algorithm.
Log-BCJR I = 8
Fully-parallel I = 48
Approx max∗ LTE turbo decoder
N
(c)
Eb/N0
B
E
R
7 6 5 4 3 2 1 0
100
10−1
10−2
10−3
10−4
10−5
10−6
Log-BCJR I = 8
Fully-parallel I = 32
Approx max∗ WiMAX turbo decoder
N
(b)
Eb/N0
B
E
R
7 6 5 4 3 2 1 0
100
10−1
10−2
10−3
10−4
10−5
10−6
Log-BCJR I = 8
Fully-parallel I = 32
Exact max∗ WiMAX turbo decoder
N
(a)
Eb/N0
B
E
R
7 6 5 4 3 2 1 0
100
10−1
10−2
10−3
10−4
10−5
10−6
Fig. 5. The error correction performance of (a) the WiMAX turbo decoder when using the exact max* operator of (6), (b) the WiMAX turbo decoder when
using the approximate max* operator of (7) and (c) the LTE turbo decoder when using the approximate max* operator. Here, BPSK modulation is employed
for transmission over an uncorrelated narrowband Rayleigh fading channel. Plots are provided for the case where I = 32 or I = 48 decoding iterations are
performed using the proposed fully-parallel algorithm, as well as I = 8 decoding iterations using the conventional Log-BCJR algorithm. Frame lengths of
N 2 f48;480;4800g are adopted for the LTE turbo code, while N 2 f24;240;2400g for the WiMAX turbo code.
using the exact max* operator of (6), for frame lengths of
N 2 f48;480;4800g and for various numbers of decoding
iterations I. Figure 4 shows that regardless of the frame
length N, the proposed fully-parallel algorithm can converge
to the same error correction performance as the Log-BCJR
algorithm. However, the proposed fully-parallel algorithm can
be seen to converge relatively slowly, requiring signiﬁcantly
more decoding iterations I than the Log-BCJR algorithm. Note
that this is not unexpected, since LDPC decoders employing a
parallel scheduling are known to require more decoding itera-
tions than those employing a serial scheduling [20]. Figure 4
suggests that the number of decoding iterations I required
by the Log-BCJR algorithm to achieve a particular BER is
consistently around 1=7 times that of the proposed algorithm,
for the case of LTE turbo decoding using the exact max*
operator of (6). However, when employing the approximate
max* operator of (7), this number changes to 1=6 times
that of the proposed algorithm, as shown in Figure 5(c) and
Table I. More speciﬁcally, Figure 5(c) shows that regardless
of the frame length N 2 f48;480;4800g, the Log-BCJR
algorithm employing I = 8 decoding iterations achieves the
same BER as the proposed fully-parallel algorithm employing
I = 48 iterations. In the case of the WiMAX turbo code,
Figures 5(a) and (b) reveal that the number of decoding
iterations I required by the Log-BCJR algorithm is 1=4 times
that of the proposed algorithm, regardless of the frame length
N and whether the exact or the approximate max* operator is
employed. Note that the error correction performance of the
state-of-the-art algorithm of [15] is slightly degraded by its
employment of the NSW technique, although this degradation
can be considered to be insigniﬁcant. Therefore as shown in
Table I, the number of decoding iterations I required by the
state-of-the-art algorithm can also be considered to be 1=6
and 1=4 times that of the proposed algorithm, for the LTE
and WiMAX turbo codes, respectively.SUBMITTED TO IEEE TRANSACTIONS ON COMMUNICATIONS 11
F. Overall characteristics
The latency D  T  I of a turbo decoder is given by the
product of the time period duration D, the number of time
periods per decoding iteration T and the required number of
decoding iterations I. Meanwhile, the processing throughput is
inversely proportional to the latency DT I. For both LTE
and WiMAX turbo decoding, Table I quantiﬁes the latency
and throughput of the proposed fully-parallel algorithm, the
Log-BCJR algorithm and the state-of-the-art algorithm of [15].
In the case of an LTE turbo code employing the longest
supported frame length of N = 6144 bits, the latency and
throughput of the proposed fully-parallel algorithm are more
than three orders-of-magnitude superior to those of the Log-
BCJR algorithm. Furthermore, when compared with the state-
of-the-art algorithm of [15], the proposed fully-parallel algo-
rithm has a latency and throughput that is 6.86 times superior.
Note however that the advantage offered by the proposed
fully-parallel algorithm is mitigated if the frame length N is
reduced. In the case of the shortest frame length N = 2048
that is supported by the considered parametrization of the
state-of-the-art algorithm’s NSW technique, the superiority
of the proposed fully-parallel algorithm is reduced to 2.29
times. When applying the state-of-the-art algorithm to the
WiMAX turbo decoding of frames having lengths in the
range N 2 [1440;2400], the superiority of the proposed
fully-parallel algorithm ranges from 3.75 times, up to 6.25
times. Compared to the Log-BCJR algorithm for WiMAX
turbo decoding, the proposed fully-parallel algorithm is more
than three orders-of-magnitude superior, when employing the
maximum frame length of N = 2400.
The state-of-the-art LTE turbo decoder of [15] achieves a
processing throughput of 2.15 Gbit/s and a latency of 2.85
s, when decoding frames comprising N = 6144 bits. This
is achieved using a clock frequency of 450 MHz, which
corresponds to a time period duration of 2.22 ns. The results
of Table I suggest that the proposed fully-parallel algorithm
could achieve a processing throughput of 14.7 Gbit/s and a
latency of 0.42 s, using a clock frequency of 194 MHz.
Furthermore, it may be assumed that the state-of-the-art turbo
decoder of [15] could maintain a processing throughput of
2.15 Gbit/s when applied for WiMAX decoding. If so, then
this suggests that the proposed fully-parallel algorithm could
achieve a processing throughput of 13.4 Gbit/s and a latency of
0.36 s, when decoding frames having a length of N = 2400
bits. Note that these multi-gigabit throughputs are comparable
to those that are offered by fully-parallel LDPC decoders [14].
While the proposed fully-parallel algorithm offers signiﬁ-
cant improvements to processing throughput and latency, this
is achieved at the cost of requiring an increased parallelism and
computational complexity. The overall computational com-
plexity C  I is given as the product of the computational
complexity per decoding iteration C and the required number
of decoding iterations I. For both LTE and WiMAX turbo
decoding, Table I quantiﬁes the overall computational com-
plexity of the proposed fully-parallel algorithm, the Log-BCJR
algorithm and the state-of-the-art algorithm of [15]. As shown
in Table I, the computational complexity of the proposed fully-
parallel algorithm can be more than ﬁve times higher than
that of the Log-BCJR algorithm. Compared to the state-of-
the-art algorithm of [15] however, the proposed fully-parallel
algorithm has a computational complexity that is about three
times higher.
V. CONCLUSIONS
This paper has proposed a novel turbo decoding algorithm,
which eliminates the data dependencies of the state-of-the-art
algorithm and facilitates fully-parallel operation. The proposed
fully-parallel algorithm is compatible with all turbo codes,
including those of the LTE and WiMAX standards. These
standardized turbo codes employ odd-even interleavers, fa-
cilitating a novel technique for reducing the complexity of
the proposed algorithm by 50%. More speciﬁcally, odd-even
interleavers allow the proposed algorithm to alternate between
processing the odd-indexed bits of the ﬁrst component code
at the same time as the even-indexed bits of the second
component, and vice-versa. Furthermore, the proposed fully-
parallel algorithm was shown to converge to the same error
correction performance as the state-of-the-art turbo decoding
algorithm, regardless of which turbo code it is applied for.
Owing to its signiﬁcantly increased parallelism, the proposed
algorithm facilitates throughputs and latencies that are up to
6.86 times superior than those of the state-of-the art algorithm,
when employed for the LTE and WiMAX turbo codes. In these
applications, it may be speculated that the proposed algorithm
facilitates processing throughputs of up to 14.7 Gbit/s, as well
as latencies as small as 0.42 s. However, this is achieved
at the cost of a computational complexity that is about three
times higher than that of the state-of-the-art algorithm. Our
future work will consider the practical Application Speciﬁc
Integrated Circuit (ASIC) and Field Programmable Gate Ar-
ray (FPGA) implementation of the proposed fully-parallel
algorithm, in order to determine the processing throughputs
and latencies that may be achieved in practice. Furthermore,
this study will reveal how the proposed algorithm’s increased
parallelism and complexity translate into chip area and energy
consumption. It may be expected that as the iterative decoding
process converges in a ﬁxed-point implementation [21] of the
proposed fully-parallel algorithm, the calculation circuits will
tend to have inputs and outputs that do not change between
consecutive clock cycles. Owing to this, a signiﬁcantly lower
amount of switching activity may be expected, compared to
the state-of-the-art turbo decoder implementation, which uses
the same calculation circuits for different parts of the algorithm
in consecutive clock cycles. Therefore, the three times higher
computational complexity of the proposed fully-parallel algo-
rithm may be expected to translate into a signiﬁcantly lower
increase in energy consumption, or possibly even a reduction
compared to the state-of-the-art turbo decoder implementation.
REFERENCES
[1] ETSI TS 136 212 LTE; Evolved Universal Terrestrial Radio Access (E-
UTRA); Multiplexing and Channel Coding, V12.0.0 ed., 2013.
[2] IEEE 802.16-2012 Standard for Local and Metropolitan Area Networks
- Part 16: Air Interface for Broadband Wireless Access Systems, 2012.SUBMITTED TO IEEE TRANSACTIONS ON COMMUNICATIONS 12
[3] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near Shannon limit
error-correcting coding and decoding: Turbo-codes (1),” in Proc. IEEE
Int. Conf. on Communications, vol. 2, Geneva, Switzerland, May 1993,
pp. 1064–1070.
[4] C. Berrou and A. Glavieux, “Near optimum error correcting coding and
decoding: Turbo-codes,” IEEE Trans. Commun., vol. 44, no. 10, pp.
1261–1271, Oct. 1996.
[5] L. Hanzo, T. H. Liew, B. L. Yeap, and R. Y. S. Tee, Turbo Coding, Turbo
Equalisation and Space Time Coding: EXIT-Chart-Aided Near-Capacity
Designs for Wireless Channels. Chichester, UK: Wiley, 2011.
[6] A. Nimbalker, T. K. Blankenship, B. Classon, T. E. Fuja, and D. J.
Costello, “Contention-free interleavers for high-throughput turbo decod-
ing,” IEEE Trans. Commun., vol. 56, no. 8, pp. 1258–1267, Aug. 2008.
[7] E. Rosnes, “On the minimum distance of turbo codes with quadratic
permutation polynomial interleavers,” IEEE Trans. Inform. Theory,
vol. 58, no. 7, pp. 4781–4795, July 2012.
[8] T. Breddermann and P. Vary, “Rate-compatible insertion convolutional
turbo codes: Analysis and application to LTE,” IEEE Trans. Wireless
Commun., vol. 13, no. 3, pp. 1356–1366, Mar. 2014.
[9] P. Robertson, E. Villebrun, and P. Hoeher, “A comparison of optimal and
sub-optimal MAP decoding algorithms operating in the log domain,” in
Proc. IEEE Int. Conf. on Communications, vol. 2, Seattle, WA, USA,
June 1995, pp. 1009–1013.
[10] IEEE 802.11n-2009 Standard for Information Technology - Telecom-
munications and Information Exchange between Systems - Local and
Metropolitan Area Networks - Speciﬁc Requirements - Part 11: Wireless
LAN Medium Access Control (MAC) and Physical Layer (PHY), 2009.
[11] D. J. C. MacKay and R. M. Neal, “Near Shannon limit performance
of low density parity check codes,” Electron. Lett., vol. 32, no. 18, pp.
457–458, Aug. 1996.
[12] M. Fossorier, “Reduced complexity decoding of low-density parity check
codes based on belief propagation,” IEEE Trans. Commun., vol. 47,
no. 5, pp. 673–680, May 1999.
[13] “5G Radio Access,” Ericsson White Paper, Tech. Rep., June 2013.
[14] V. A. Chandrasetty and S. M. Aziz, “FPGA implementation of a
LDPC decoder using a reduced complexity message passing algorithm,”
Journal of Networks, vol. 6, no. 1, pp. 36–45, Jan. 2011.
[15] T. Ilnseher, F. Kienle, C. Weis, and N. Wehn, “A 2.15GBit/s turbo code
decoder for LTE Advanced base station applications,” in Proc. Int. Symp.
on Turbo Codes and Iterative Information Processing, Gothenburg,
Sweden, Aug. 2012, pp. 21–25.
[16] L. Fanucci, P. Ciao, and G. Colavolpe, “VLSI design of a fully-
parallel high-throughput decoder for turbo gallager codes,” IEICE Trans.
Fundamentals, vol. E89-A, no. 7, pp. 1976–1986, July 2006.
[17] D. Vogrig, A. Gerosa, A. Neviani, A. Graell I Amat, G. Montorsi, and
S. Benedetto, “A 0.35-m CMOS analog turbo decoder for the 40-bit
rate 1/3 UMTS channel code,” IEEE J. Solid-State Circuits, vol. 40,
no. 3, pp. 753–762, 2005.
[18] Q. T. Dong, M. Arzel, C. J. Jego, and W. J. Gross, “Stochastic decoding
of turbo codes.” IEEE Trans. Signal Processing, vol. 58, no. 12, pp.
6421–6425, Dec. 2010.
[19] A. Nimbalker, Y. Blankenship, B. Classon, and T. K. Blankenship, “ARP
and QPP interleavers for LTE turbo coding,” in Proc. IEEE Wireless
Commun. Networking Conf., Las Vegas, NV, USA, Mar. 2008, pp. 1032–
1037.
[20] P. Radosavljevic, A. de Baynast, and J. R. Cavallaro, “Optimized
message passing schedules for LDPC decoding,” in Asilomar Conf.
Signals Systems and Computers, no. 1, Paciﬁc Grove, CA, USA, Oct.
2005, pp. 591–595.
[21] W. Sulek, “On the overﬂow problem in ﬁnite precision turbo decoding
message passing,” IEEE Trans. Commun., vol. 60, no. 5, pp. 1253–1259,
May 2012.