Implementation of a 3GPP LTE Turbo Decoder Accelerator on GPU by Wu, Michael & Cavallaro, Joseph R.
Implementation of a 3GPP LTE Turbo Decoder
Accelerator on GPU
Michael Wu, Yang Sun, and Joseph R. Cavallaro
Electrical and Computer Engineering
Rice University, Houston, Texas 77005
{mbw2, ysun, cavallar}@rice.edu
Abstract—This paper presents a 3GPP LTE compliant turbo
decoder accelerator on GPU. The challenge of implementing a
turbo decoder is ﬁnding an efﬁcient mapping of the decoder
algorithm on GPU, e.g. ﬁnding a good way to parallelize workload
across cores and allocate and use fast on-die memory to improve
throughput. In our implementation, we increase throughput through
1) distributing the decoding workload for a codeword across
multiple cores, 2) decoding multiple codewords simultaneously
to increase concurrency and 3) employing memory optimization
techniques to reduce memory bandwidth requirements. In addition,
we analyze how different MAP algorithm approximations affect
both throughput and bit error rate (BER) performance of this
decoder.
I. INTRODUCTION
Turbo codes [1] have become one of the most important
research topics in coding theory for wireless communication
systems. As a practical code that approaches channel capacity,
turbo codes are widely used in many 3G and 4G wireless
standards such as CDMA2000, WCDMA/UMTS, IEEE 802.16e
WiMax, and 3GPP LTE (long term evolution). However, low
BER performance comes at a price – the inherently large
decoding latency and a complex iterative decoding algorithm
have made it very difﬁcult to achieve high throughput in general
purpose processors or digital signal processors. As a result, turbo
decoders are often implemented in ASIC or FPGA [2–8].
The Graphic Processing Unit (GPU) is an another alternative
as it provides high computational power while maintaining
ﬂexibility. GPUs deliver extremely high computation throughput
by employing many cores running in parallel. Similar to general
purpose processors, GPUs are ﬂexible enough to handle general
purpose computations. In fact, a number of processing intensive
communication algorithms have been implemented on GPU.
GPU implementations of LDPC decoder are capable of real
time throughput [9]. In addition, both a hard decision MIMO
detector [10] as well as a soft decision MIMO detector [11]
have been implemented on GPU.
In this paper, we aim to provide an alternative – a turbo
decoder deﬁned entirely in software on GPU that reduces the
design cycle and delivers good throughput. Particularly, we
partition the decoding workload across cores and pre-fetch data
to reduce memory stalls. However, parallelization of the decoding
algorithm can improve throughput of a decoder at the expense
of decoder BER performance. In this paper, we also provide
both throughput and BER performance of the decoder and show
that we can parallelize the workload on GPU while maintaining
reasonable BER performance. Although ASIC and FPGA designs
are more power efﬁcient and can offer higher throughput than our
GPU design [12], this work will allow us to accelerate simulation
as well as to implement a complete iterative MIMO receiver in
software in wireless test-bed platform such as WARPLAB[13].
The rest of the paper is organized as follows: In section II
and section III, we give an overview of the CUDA architecture
and turbo decoding algorithm. In section IV, we will discuss the
implementation aspects on GPU. Finally, we will present BER
performance and throughput results and analyses in section V
and conclude in section VI.
II. COMPUTE UNIFIED DEVICE ARCHITECTURE (CUDA)
Compute Uniﬁed Device Architecture [14] is a software pro-
gramming model that allows the programmer to harness the
massive computation potential offered by the programmable
GPU. The programming model is explicitly parallel. The pro-
grammer explicitly speciﬁes the parallelism, i.e. how operations
are applied to a set of data, in a kernel. At runtime, multiple
threads are spawned, where each thread runs the operations
deﬁned by the kernel on a data set. In this programming model,
threads are completely independent. However, threads within a
block can share computation through barrier synchronization and
shared memory. Thread blocks are completely independent and
only can be synchronized through writing to the global memory
and terminating the kernel.
Compared to traditional general purpose processors, a pro-
grammable GPU has much higher peak computation throughput.
The computation power is enabled by many cores on the GPU.
There are multiple stream multiprocessors (SM), where each SM
is an 8 ALU single instruction multiple data (SIMD) core. A
kernel is mapped onto the device by mapping each thread block
to an SM. CUDA divides threads within a thread block into
blocks of 32 threads. When all 32 threads are doing the same
set of operations, these 32 threads, also known as a WARP,
are executed as a group on an SM over 4 cycles. Otherwise,
threads are executed serially. There are a number of reasons for
stalls to occur. As data is not cached, an SM can stall waiting
for data. Furthermore, the ﬂoating point pipeline is long and
register to register dependency can cause a stall in the pipeline.
To keep cores utilized, multiple thread blocks, or concurrent
thread blocks, are mapped onto an SM and executed on an SM
at the same time. Since the GPU can switch between WARP
instructions with zero-overhead, the GPU can minimize stalls by
192978-1-4244-8933-6/10/$26.00 ©2010 IEEE SiPS 2010
switching over to another independent WARP instruction on a
stall.
Computation throughput can still become I/O limited if mem-
ory bandwidth is low. Fortunately, fast on-chip resources, such
as registers, shared memory and constant memory, can be used
in place of off-chip device memory to keep the computation
throughput high. Shared memory is especially useful. It can
reduce memory access time by keeping data on-chip and reduce
redundant calculations by allowing data sharing among indepen-
dent threads. However, shared memory on each SM has 16 access
ports. It takes one cycle if 16 consecutive threads access the same
port (broadcast) or none of the threads access the same port (one
to one). However, a random layout with some broadcast and
some one-to-one accesses will be serialized and cause a stall.
There are several other limitations with shared memory. First,
only threads within a block can share data among themselves
and threads between blocks can not share data through shared
memory. Second, there are only (16KB) of shared memory on
each stream multiprocessor and shared memory is divided among
the concurrent thread blocks on an SM. Using too much shared
memory can reduce the number of concurrent thread blocks
mapped onto an SM.
As a result, it is a challenging task to implement an algorithm
that keeps the GPU cores from idling–we need to partition the
workload across cores, while effectively using shared memory,
and ensuring a sufﬁcient number of concurrently executing thread
blocks.
III. MAP DECODING ALGORITHM
The principle of Turbo decoding is based on the BCJR or MAP
(maximum a posteriori) algorithms [15]. The structure of a MAP
decoder is shown in Figure 1. One iteration of the decoding
process consists of one pass through both decoders. Although
both decoders perform the same set of computations, the two
decoders have different inputs. The inputs of the ﬁrst decoder
are the deinterleaved extrinsic log-likelihood ratios (LLRs) from
the second decoder and the input LLRs from the channel. The
inputs of the second decoder are the interleaved extrinsic LLRs
from the ﬁrst decoder and the input LLRs from the channel.
Π-1
Π Decoder 1Decoder 0Lc(ys)
Lc(yp0)
Lc(yp1)
La Le
La+LchLe+Lch
Fig. 1: Overview of turbo decoding
To decode a codeword with N information bits, each decoder
performs a forward traversal followed by a backward traversal
through an N -stage trellis to compute an extrinsic LLR for each
bit. The trellis structure, or the connections between two stages
of the trellis, is deﬁned by the encoder. Figure 2 shows the trellis
structure for the 3GPP LTE turbo code, where each state has two
incoming paths, one path for ub = 0 and one path for ub = 1.
Let sk be a state at stage k, the branch metric (or transition
probability) is deﬁned as:
γk(sk−1, sk) = (Lc(ysk) + La(y
s
k))uk + Lc(y
p
k)pk, (1)
where uk, the information bit, and pk, the parity bit, are de-
pendent on the path taken (sk+1, sk). Lc(ysk) is the systematic
channel LLR, La(ysk) is the a priori LLR, and Lc(y
p
k) is the
parity bit channel LLR at stage k. The decoder ﬁrst performs
Fig. 2: 3GPP LTE turbo code trellis with 8 states
a forward traversal to compute αk, the forward state metrics for
the trellis state in stage k. The state metrics αk are computed
recursively as the computation depends on αk−1. The forward
state metric for a state sk at stage k, αk(sk), is deﬁned as:
αk(sk) = max∗sk−1∈K(αk−1(sk−1) + γ(sk−1, sk)), (2)
where K is the set of paths that connect a state in stage k − 1
to state sk in stage k.
After the decoder performs a forward traversal, the decoder
performs a backward traversal to compute βk, the backward state
metrics for the trellis state in stage k. The backward state metric
for state sk at stage k, βk(sk), is deﬁned as:
βk(sk) = max∗sk+1∈K(βk+1(sk+1) + γ(sk+1, sk)). (3)
Although the computation is the same as the computation for
αk, the state transitions are different. In this case, K is the set
of paths that connect a state in stage k + 1 to state sk in stage
k.
After computing βk, the state metrics for all states in stage k,
we compute two LLRs per trellis state. We compute one state
LLR per state sk, Λ(sk|uk = 0), for the incoming path that is
connected to state sk which corresponds to uk = 0. In addition,
we also compute one state LLR per state sk, Λ(sk|ub = 1), for
the incoming path that is connected to state sk which corresponds
to uk = 1. The state LLR, Λ(sk|ub = 0), is deﬁned as:
Λ(sk|ub = 0) = ak−1(sk−1) + γ(sk−1, sk) + βk(sk), (4)
where the path from sk−1 to sk with ub = 0 is used in the
computation. Similarly, the state LLR, Λ(sk|ub = 1), is deﬁned
as:
Λ(sk|ub = 1) = ak−1(sk−1) + γ(sk−1, sk) + βk(sk), (5)
where the path from sk−1 to sk with ub = 1 is used in the
computation.
193
To compute the extrinsic LLR for uk, we perform the follow-
ing computation:
Le(k) = max
∗
sk∈K(Λ(sk|ub = 0)− Λ(sk|ub = 1))
−La(ysk)− Lc(ysk), (6)
where K is the set of all possible states and max∗() is deﬁned
as max∗(S) = ln(
∑
s∈S e
s).
In the next section, we will describe how this algorithm is
mapped onto GPU in detail.
IV. IMPLEMENTATION OF MAP DECODER ON GPU
A straight-forward implementation of the decoding algorithm
requires the completion of N stages of αk computation before
the start of βk computation. Throughput of such a decoder would
be low on a GPU. First, the parallelism of this decoder would be
low; since we would spawn only one thread block with 8 threads
to traverse the trellis in parallel. Second, the memory required
to save N stages of αk is signiﬁcantly larger than the shared
memory size. Finally, a traversal from stage 0 to stage N − 1
takes many cycles to complete and leads to very long decoding
delay.
Figure 3 provides an overview of our implementation. At the
beginning of the decoding process, the inputs of the decoder,
LLRs from the channel, are copied from the host memory to
device memory. Instead of spawning only one thread-block per
codeword to perform decoding, a codeword is split into P sub-
blocks and uses P independent thread blocks in parallel. We
still assign 8 threads per each thread block as there are only
8 trellis states. However, both the amount of shared memory
required and the decoding latency are reduced as a thread-
block only needs to traverse through NP stages. After each half
decoding iteration, thread blocks are synchronized by writing
extrinsic LLRs to device memory and terminating the kernel.
In the device memory, we allocate memory for both extrinsic
LLRs from the ﬁrst half iteration and extrinsic LLRs from the
second half iteration. During the ﬁrst half iteration, the P thread
blocks read from extrinsic LLRs from the second half iteration.
During the second half of the iteration, the direction is reversed.
Although a sliding window with training sequence [16] can be
used to improve the BER performance of the decoder, it is not
supported by the current design. As the length of sub-blocks is
very small with large P , a sliding window would add signiﬁcant
overhead. However, the next iteration initialization technique is
used to improve BER performance. The α and β values between
neighboring thread-blocks are exchanged between iterations.
Only one MAP kernel is needed as each half iteration of the
MAP decoding algorithm performs the same sequence of compu-
tations. However, since the input changes and the output changes
between each half iteration, the kernel needs to be reconﬁgurable.
Speciﬁcally, the ﬁrst half iteration reads a priori LLRs and writes
extrinsic LLRs without any interleaving or deinterleaving. The
second half iteration reads a priori LLRs interleaved and writes
extrinsic LLRs deinterleaved. The kernel handles reconﬁguration
easily with a couple of simple conditional reads and writes
at the beginning and the end of the kernel. Therefore, this
kernel executes twice per iteration. The implementation details
D0 D1 D2 DP-1
Host Memory
Device Memory
Lc(ys), Lc(p0), Lc(p1)
…..
2 
cy
cl
es
 =
 1
 it
er
at
io
n
Fig. 3: Overview of our MAP decoder implementation
of the reconﬁgurable MAP kernel are described in the following
subsections.
A. Shared Memory Allocation
To increase locality of the data, our implementation attempts to
prefetch data from device memory into shared memory and keep
intermediate results on die. Since the backward traversal depends
on the results from the forward traversal, we save NP stages of αk
values in shared memory from the forward traversal. Since there
are 8 threads, one per trellis state, each thread block requires
8N
P ﬂoats for α. Similarly, we need to save βk to compute βk−1,
which requires 8 ﬂoats. In order to increase thread utilization
during extrinsic LLR computation, we save up to 8 stages of
Λk(sk|ub = 0) and Λk(sk|ub = 1), which requires 128 ﬂoats.
In addition, at the start of the kernel, we prefetch NP LLRs
from the channel and NP a priori LLRs into shared memory
for more efﬁcient access. A total of 10NP +196 ﬂoats is allocated
per thread-block. Since we only have 16KB of shared memory
which is divided among concurrent executing thread blocks,
small P increases the amount of shared memory required per
thread block which reduces the number of concurrent executing
thread blocks signiﬁcantly.
B. Forward Traversal
During the forward traversal, each thread block ﬁrst traverses
through the trellis to compute α. We assign one thread to
each trellis level; each thread evaluates two incoming paths and
updates αk(sj) for the current trellis stage using αk−1, the
forward metrics from the previous trellis stage k−1. The decoder
use Equation (2) to compute αk. The computation, however,
depends on the path taken (sk−1, sk). The two incoming paths
are known a priori since the connections are deﬁned by the trellis
structure as shown in Figure 2. Table I summarizes operands
needed for α computation. The indices of the αk are stored in
constant memory. Each thread loads the indices and the values
pk|ub = 0 and pk|ub = 1 at the start of the kernel. The
pseudo-code for one iteration of αk computation is shown in
Algorithm 1. The memory access pattern is very regular for the
forward traversal. Threads access values of αk−1 in different
memory banks. Since all threads access the same a priori LLR
and parity LLR in each iteration, memory accesses are broadcast
reads. Therefore, there are no shared memory conﬂicts in either
case, that is memory reads and writes are handled efﬁciently by
shared memory.
194
TABLE I: Operands for αk computation
ub = 0 ub = 1
Thread id (i) sk−1 pk sk−1 pk
0 0 0 1 1
1 3 1 2 0
2 4 1 5 0
3 7 0 6 1
4 1 0 0 1
5 2 1 3 0
6 5 1 4 0
7 6 0 7 1
Algorithm 1 thread i computes αk(i)
a0 ← αk−1(sk−1|ub = 0) + Lc(ysk) ∗ (pk|ub = 0)
a1 ← αk−1(sk−1|ub = 1) + (Lc(ysk) + La(k))
+Lc(p
s
k)(pk|ub = 1)
αk(i) = max
∗(a0, a1)
SYNC
C. Backward Traversal and LLR Computation
After the forward traversal, each thread block traverses through
the trellis backward to compute β. We assign one thread to each
trellis level to compute β, followed by computing Λ0 and Λ1
shown in Algorithm 2. The indices of βk+1 and values of pk are
summarized in Table II. Similar to the forward traversal, there
are no shared memory bank conﬂicts since each thread accesses
an element of α or β in a different bank.
TABLE II: Operands for βk computation
ub = 0 ub = 1
Thread id (i) sk+1 pk sk+1 pk
0 0 0 4 1
1 4 0 0 1
2 5 1 1 0
3 1 1 5 0
4 2 1 6 0
5 6 1 2 0
6 7 0 3 1
7 3 0 7 1
Algorithm 2 thread i computes βk(i) and Λ0(i) and Λ1(i)
b0 ← αk+1(sk+1|ub = 0) + Lc(ysk) ∗ (pk|ub = 0)
b1 ← αk+1(sk+1|ub = 1) + (Lc(ysk) + La(k))
+Lc(p
s
k)(pk|ub = 1)
βk(i) = max
∗(b0, b1)
SYNC
Λ0(i) = αk(i) + Lp(i)pk + βk+1(i)
Λ1(i) = αk(i) + (Lc(k) + La(k)) + Lp(sk)pk + βk(i)
After computing Λ0 and Λ1 for stage k, we can compute the
extrinsic LLR for stage k. However, there are 8 threads avail-
able to compute the single LLR, which introduces parallelism
overhead. Instead of computing one extrinsic LLR for stage k
as soon as the decoder computes βk, we allow the threads to
traverse through the trellis and save 8 stages of Λ0 and Λ1
before performing extrinsic LLR computations. By saving eight
stages of Λ0 and Λ1, we allow all 8 threads to compute LLRs in
parallel efﬁciently. Each thread handles one stage of Λ0 and Λ1
to compute an LLR. Although this increases thread utilization,
threads need to avoid accessing the same bank when computing
extrinsic LLR. For example, 8 elements of Λ0 for each stage are
stored in 8 consecutive addresses. Since there are 16 memory
banks, elements of even stages Λ0 or Λ1 with the same index
would share the same memory bank. Likewise, this is true for
even stages of Λ0. Hence, sequential accesses to Λ0 or Λ1 to
compute extrinsic LLR will result in four-way memory bank
conﬂicts. To alleviate this problem, we permute the access pattern
based on thread ID as shown in Algorithm 3.
Algorithm 3 thread i computes Le(i)
λ0 = Λ0(i)
λ1 = Λ1(i)
for j = 1 to 7 do
index = (i+ j)&7
λ0 = max
∗(λ0,Λ0(index))
λ1 = max
∗(λ1,Λ1(index))
Le = λ1 − λ0
Compute write address
Write Le to device memory
end for
D. Interleaver
The interleaver is used in the second half iteration of the MAP
decoding algorithm. In our implementation, a quadratic permu-
tation polynomial (QPP) interleaver [17], which is proposed in
the 3GPP LTE standard was used. Although the QPP interleaver
is contention free since it can guarantee bank free memory
access, where each sub-block accesses a different memory bank.
However, the memory access pattern is still random. Since the
inputs are shared in device memory, memory accesses are not
necessarily coalesced. We reduce latency by pre-fetching data
into the shared memory. The QPP interleaver is deﬁned as:
Π(x) = f1x+ f2x
2 (mod N). (7)
Direct computation of Π(x) using Equation (7) can cause over-
ﬂow. For example, 61432 can not be represented as a 32-bit
integer. The following equation is used to compute Π(x) instead:
Π(x) = (f1 + f2x (mod N)) · x (mod N) (8)
Another alternative is to compute Π(x) recursively [6], which
requires Π(x) to be computed before we can compute Π(x+1).
This is not efﬁcient for our design as we need to compute several
interleaved addresses in parallel. For example, during the second
half of the iteration to store extrinsic LLR values, 8 threads need
to compute 8 interleaved address in parallel. Equation (8) allows
efﬁcient address computation in parallel.
Although our decoder is conﬁgured for the 3GPP LTE stan-
dard, one can replace the current interleaver function with
195
another function to support other standards. Furthermore, we
can deﬁne multiple interleavers and switch between them on-
the-ﬂy since the interleaver is deﬁned in software in our GPU
implementation.
E. max∗ Function
Both natural logarithm and natural exponential are supported
on CUDA. We support full-log-MAP as well as max-log-MAP
[18]. We compute full-log-MAP by:
∗
max(a, b) = max(a, b) + ln(1 + e−|b−a|) (9)
and max-log-MAP is deﬁned as:
∗
max(a, b) = max(a, b). (10)
Throughput of full-log-MAP will be slower than the throughput
of max-log-MAP. Not only is the number of instructions required
for full-log-MAP greater than the number of instructions required
for max-log-MAP, but also the natural logarithm and natural
exponential instructions take longer to execute on the GPU
compared to common ﬂoating operations, e.g. multiply and add.
An alternative is using a lookup table in constant memory.
However, this is even less efﬁcient as multiple threads access
different entries in the lookup table simultaneously and only the
ﬁrst entry will be a cached read.
V. BER PERFORMANCE AND THROUGHPUT RESULTS
We evaluated accuracy of our decoder by comparing it against
a reference standard C Viterbi implementation. To evaluate the
BER performance and throughput of our turbo decoder, we tested
our turbo decoder on a Linux platform with 8GB DDR2 memory
running at 800 MHz and an Intel Core 2 Quad Q6600 running at
2.4Ghz. The GPU used in our experiment is the Nvidia TESLA
C1060 graphic card, which has 240 stream processors running
at 1.3GHz with 4GB of GDDR3 memory running at 1600 MHz.
A. Decoder BER Performance
Since our decoder can change P , which is the number of
sub-blocks to be decoded in parallel, we ﬁrst look at how the
number of parallel sub-blocks affects the overall decoder BER
performance. In our setup, the host computer ﬁrst generates
the random bits and encodes the random bits using a 3GPP
LTE turbo encoder. After passing the input symbols through
the channel with AWGN noise, the host generates LLR values
which are fed into the decoding kernel running on GPU. For this
experiment, we tested our decoder with P = 32, 64, 96, 128 for
a 3GPP LTE turbo code with N = 6144. In addition, we tested
both full-log-MAP as well as max-log-MAP with the decoder
performing 6 decoding iterations.
Figure 4 shows the bit error rate (BER) performance of the
our decoder using full-log-MAP, while Figure 5 shows the BER
performance of our decoder using max-log-MAP. In both cases,
BER performance of the decoder decreases as we increase P .
The BER performance of the decoder is signiﬁcantly better
when full-log-MAP is used. Furthermore, we see that even with
parallelism of 96, where each sub-block is only 64 stages long,
the decoder provides BER performance that is within 0.1dB of
the performance of the optimal case (P = 1).
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
10−7
10−6
10−5
10−4
10−3
10−2
10−1
Eb/No [dB]
B
it 
E
rr
or
 R
at
e 
(B
E
R
)
P=1
P=32
P=64
P=96
P=128
Fig. 4: BER performance (BPSK, full-log-MAP)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
10−7
10−6
10−5
10−4
10−3
10−2
10−1
100
Eb/No [dB]
B
it 
E
rr
or
 R
at
e 
(B
E
R
)
P=1
P=32
P=64
P=96
P=128
Fig. 5: BER performance (BPSK, max-log-MAP)
B. Decoder Throughput
We measure the time it takes for the decoder to decode a batch
of 100 codewords using event management in the CUDA runtime
API. The runtimes measured include both memory transfers
and kernel execution. Since our decoder can support various
code sizes, we can decode N = 64, 1024, 2048, 6144 with
various numbers of decoding iterations and parallelism P . The
throughput of the decoder is only dependent on W = NP as
decoding time is linearly dependent on the number of trellis
stages traversed. Therefore, we report the decoder throughput
as a function of W which can be used to ﬁnd the throughput
of different decoder conﬁgurations. For example, if N = 6144,
P = 64, and the decoder performs 1 iteration, the throughput of
the decoder is the throughput when W = 96. The throughput
of the decoder is summarized in Table III. We see that the
throughput of the decoder is inversely proportional to the number
of iterations performed. The throughput of the decoder after
m iterations can be approximated as T0/m, where T0 is the
throughput of the decoder after 1 iteration.
Although throughput of full-log-MAP is slower than max-log-
MAP as expected, the difference is small while full-log-MAP
196
TABLE III: Throughput vs W
Max-log-MAP throughput/ Full-log-MAP throughput (Mbps)
# Iter W=32 W=64 W=96 W=128
1 49.02/36.59 34.75/23.87 26.32/19.50 17.95/12.19
2 24.14/18.09 17.09/12.72 12.98/9.62 8.82/5.59
3 16.01/12.00 11.34/8.45 8.57/6.39 5.85/3.97
4 11.98/9.01 8.48/6.51 6.41/4.78 4.37/2.97
5 9.57/7.19 6.77/5.2 5.12/3.82 3.49/2.37
6 7.97/5.99 5.64/4.33 4.26/3.18 2.91/1.97
improves the BER performance of the decoder signiﬁcantly.
Therefore, full-log-MAP is a better choice for this design.
C. Architecture Comparison
Table IV compares our decoder with other programmable turbo
decoders. Our decoder with W = 64 compares favorably in
terms of throughput and BER performance. We can support both
the full-log-MAP (FLM) algorithm and the simpliﬁed max-log-
MAP (MLM) algorithm while most other solutions only support
the sub-optimal max-log-MAP algorithm.
TABLE IV: Our decoder vs other programmable turbo decoders
Work Architecture MAP Algorithm Throughput Iter.
[19] Intel Pentium 3 MLM/FLM 366 Kbps/51Kbps 1
[20] Motorola 56603 MLM 48.6 Kbps 5
[20] STM VLIW DSP FLM 200 Kbps 5
[21] TigerSHARC DSP MLM 2.399 Mbps 4
[22] TMS320C6201 DSP MLM 500 Kbps 4
[5] 32-wide SIMD MLM 2.08 Mbps 5
ours Nvidia C1060 MLM/FLM 6.77/5.2Mbps 5
VI. CONCLUSION
In this paper, we presented a 3GPP LTE compliant turbo
decoder implemented on GPU. We portion the workload across
cores on the GPU by dividing the codeword into many sub-
blocks to be decoded in parallel. Furthermore, all computation
is completely parallel for each sub-block. To reduce the memory
bandwidth needed to keep the cores fed, we prefetch data into
shared memory and keep immediate data in shared memory.
As different sub-block sizes can lead to BER performance
degradation, we presented how both BER performance and
throughput is affected by sub-block size. We show that our
decoder provides faster throughput even though the full-log-
MAP algorithm is used. As the decoder is done in software, we
can easily change the QPP interleaver and trellis structure states
to support other codes. Future work includes other partitioning
and memory strategies to improve throughput of the decoder.
Furthermore, we will implement a completely iterative MIMO
receiver by combining this decoder with a MIMO detector on
the GPU.
REFERENCES
[1] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near Shannon
limit error-correcting coding and decoding: Turbo-Codes,” in IEEE
International Conference on Communication, May 1993.
[2] D. Garrett, B. Xu, and C. Nicol, “Energy efﬁcient turbo decod-
ing for 3G mobile,” in International symposium on Low power
electronics and design. ACM, 2001, pp. 328–333.
[3] M. Bickerstaff, L. Davis, C. Thomas, D. Garrett, and C. Nicol, “A
24Mb/s radix-4 logMAP turbo decoder for 3GPP-HSDPA mobile
wireless,” in IEEE Int. Solid-State Circuit Conf. (ISSCC), Feb.
2003.
[4] M. Shin and I. Park, “SIMD processor-based turbo decoder sup-
porting multiple third-generation wireless standards,” IEEE Trans.
on VLSI, vol. vol.15, pp. pp.801–810, Jun. 2007.
[5] Y. Lin, S. Mahlke, T. Mudge, C. Chakrabarti, A. Reid, and
K. Flautner, “Design and implementation of turbo decoders for
software deﬁned radio,” in IEEE Workshop on Signal Processing
Design and Implementation (SIPS), Oct. 2006.
[6] Y. Sun, Y. Zhu, M. Goel, and J. R. Cavallaro, “Conﬁgurable and
Scalable High Throughput Turbo Decoder Architecture for Mul-
tiple 4G Wireless Standards,” in IEEE International Conference
on Application-Speciﬁc Systems, Architectures and Processors
(ASAP), July 2008, pp. 209–214.
[7] P. Salmela, H. Sorokin, and J. Takala, “A Programmable Max-Log-
MAP Turbo Decoder Implementation,” Hindawi VLSI Design, vol.
vol.2008, pp. pp. 636–640, 2008.
[8] C.-C. Wong, Y.-Y. Lee, and H.-C. Chang, “A 188-size 2.1mm2
reconﬁgurable turbo decoder chip with parallel architecture for
3GPP LTE system,” in 2009 Symposium on VLSI Circuits, June
2009, pp. 288–289.
[9] G. Falcão, V. Silva, and L. Sousa, “How GPUs Can Outperform
ASICs for Fast LDPC Decoding,” in ICS ’09: Proceedings of the
23rd International Conference on Supercomputing, pp. 390–399.
[10] M. Wu, S. Gupta, Y. Sun, and J. R. Cavallaro, “A GPU Imple-
mentation of A Real-Time MIMO Detector,” in IEEE Workshop
on Signal Processing Systems (SiPS’09), Oct. 2009.
[11] M. Wu, Y. Sun, and J. R. Cavallaro, “Reconﬁgurable Real-time
MIMO Detector on GPU,” in IEEE 43rd Asilomar Conference on
Signals, Systems and Computers (ASILOMAR’09), Nov. 2009.
[12] Xilinx Corporation, 3GPP LTE Turbo Decoder v2.0, 2008.
[Online]. Available: http://www.xilinx.com/products/ipcenter/DO-
DI-TCCDEC-LTE.htm
[13] K. Amiri, Y. Sun, P. Murphy, C. Hunter, J. R. Cavallaro, and
A. Sabharwal, “Warp, a uniﬁed wireless network testbed for edu-
cation and research,” in MSE ’07: Proceedings of the 2007 IEEE
International Conference on Microelectronic Systems Education,
June 2007.
[14] NVIDIA Corporation, CUDA Compute Uniﬁed Device
Architecture Programming Guide, 2008. [Online]. Available:
http://www.nvidia.com/object/cuda_develop.html
[15] L. Bahl, J. Cocke, F. Jelinek, and J. Raviv, “Optimal Decoding of
Linear Codes for Minimizing Symbol Error Rate,” IEEE Transac-
tions on Information Theory, vol. IT-20, pp. 284–287, Mar. 1974.
[16] F. Naessens, B. Bougard, S. Bressinck, L. Hollevoet, P. Raghavan,
L. V. der Perre, and F. Catthoor, “A uniﬁed instruction set
programmable architecture for multi-standard advanced forward
error correction,” in IEEE Workshop on Signal Processing Sys-
tems(SIPS), October 2008.
[17] J. Sun and O. Takeshita, “Interleavers for turbo codes using
permutation polynomials over integer rings,” IEEE Trans. Inform.
Theory, vol. vol.51, pp. 101–119, Jan. 2005.
[18] P. Robertson, E. Villebrun, and P. Hoeher, “A comparison of
optimal and sub-optimal MAP decoding algorithm operating in the
log domain,” in IEEE Int. Conf. Commun., 1995, pp. 1009–1013.
[19] M. Valenti and J. Sun, “The UMTS Turbo Code and a Efﬁcient
Decoder Implementation Suitable for Software-Deﬁned Radios,”
International Journal of Wireless Information Networks, vol. 8,
no. 4, pp. 203–215, Oct. 2001.
[20] H. Michel, A. Worm, M. Munch, and N. Wehn, “Hardware soft-
ware trade-offs for advanced 3G channel coding,” in Proceedings
of Design, Automation and Test in Europe, 2002.
[21] K. Loo, T. Alukaidey, and S. Jimaa, “High performance paral-
lelised 3GPP turbo decoder,” in IEEE Personal Mobile Communi-
cations Conference, April 2003, pp. 337–342.
[22] Y. Song, G. Liu, and Huiyang, “The implementation of turbo
decoder on DSP in W-CDMA system,” in International Conference
on Wireless Communications, Networking and Mobile Computing,
Dec. 2005, pp. 1281–1283.
197
