GPU Implementation and Optimization of a Flexible MAP Decoder for
  Synchronization Correction by Briffa, Johann A.
1GPU Implementation and Optimization of a Flexible
MAP Decoder for Synchronization Correction
Johann A. Briffa
Abstract—In this paper we present an optimized parallel
implementation of a flexible MAP decoder for synchronization
error correcting codes, supporting a very wide range of code sizes
and channel conditions. On mid-range GPUs we demonstrate
decoding speedups of more than two orders of magnitude over a
CPU implementation of the same optimized algorithm, and more
than an order of magnitude over our earlier GPU implementa-
tion. The prominent challenge is to maintain high parallelization
efficiency over a wide range of code sizes and channel conditions,
and different execution hardware. We ensure this with a dynamic
strategy for choosing parallel execution parameters at run-time.
We also present a variant that trades off some decoding speed
for significantly reduced memory requirement, with no loss
to the decoder’s error correction performance. The increased
throughput of our implementation and its ability to work with
less memory allow us to analyse larger codes and poorer channel
conditions, and makes practical use of such codes more feasible.
Index Terms—Insertion-Deletion Correction, MAP Decoder,
Forward-Backward Algorithm, CUDA, GPU
I. INTRODUCTION
The problem of correcting synchronization errors has re-
cently seen an increase in interest [1]. We believe this is due
to two factors: recent applications for such codes, where tra-
ditional techniques for synchronization cannot be applied, and
the feasibility of decoding due to improvements in computing
resources. A recent application is for bit-patterned media [2],
[3], where written-in errors can be modeled as synchronization
errors. Bit-patterned media is of great interest to the magnetic
recording industry due to the potential increase in writing den-
sity. Another example is robust digital watermarking, where a
message is embedded into a media file and an attacker seeks to
make the message unreadable. An effective attack is to cause
loss of synchronization; synchronization-correcting codes have
been successfully applied to resist these attacks in speech [4]
and image [5] watermarking.
Most practical decoders for synchronization correction work
by extending the state space of the underlying code to
account for the state of the channel (which represents the
synchronization error). This increases the decoding complexity
significantly, particularly under poor channel conditions where
the state space is necessarily larger. While optimal decoding
is achievable, the complexity involved remains a barrier for
wider adoption. The problem is even more pronounced when
these codes are part of an iteratively decoded construction.
Manuscript submitted February 13, 2014; accepted May 5, 2014 for
publication in the IET Journal of Engineering. Parts of this research have
been carried out using computational facilities procured through the European
Regional Development Fund, Project ERDF-080, and a gift from NVIDIA.
J. A. Briffa is with the Dept. of Computing, University of Surrey,
Guildford GU2 7XH, England. Email: j.briffa@surrey.ac.uk
A key practical synchronization-correcting scheme is the
concatenated construction by Davey and MacKay [6], where
the inner code tracks synchronization on an unbounded ran-
dom insertion and deletion channel. We presented a maximum
a-posteriori (MAP) decoder for a generalized construction of
the inner code in [7], and improved encodings in [8]. In [9] we
presented a parallel implementation of our MAP decoder on
a graphics processing unit (GPU) using NVIDIA’s Compute
Unified Device Architecture (CUDA) [10]. This resulted in a
decoding speedup of up to two orders of magnitude, depending
on code parameters and channel conditions. Since that work
we have also presented a number of additional improvements
to the MAP decoder algorithm [11], resulting in a speedup
of over an order of magnitude in a serial implementation, as
we shall show. Unfortunately, these algorithmic improvements
change the proportion of time spent computing the various
equations, so that a straightforward application of the algo-
rithm improvements to our earlier GPU implementation does
not yield the expected speedup. A more careful parallelization
strategy is required, which we discuss in this paper.
Other GPU implementations of the forward-backward algo-
rithm (on which our MAP decoder is based) have significant
fundamental differences from this work. Recent work on turbo
decoders [12]–[14] is limited to the parallel implementation
of specific binary code constructions. Further, turbo decoders
work with a small state-space of a fixed size, and the branch
metric is trivial to compute, so is generally recomputed as
needed. These factors simplify the problem considerably, mak-
ing it much easier to optimize a parallel implementation. In
contrast, we consider the problem of efficiently parallelizing a
flexible MAP decoder implementation, where: a) the code size
is variable, b) codes are non-binary, and the implementation
works with variable alphabet size, c) the state space is variable
in size, depending on channel conditions, and can easily run
into hundreds of states for poor channels, and d) the branch
metric computation requires a lattice traversal, and represents
a significant fraction of the overall complexity. Due to these
variables, hard optimizations cannot be done, so our challenge
is to plan things to work well with suitable run-time decisions.
Since various optimization decisions are taken at run-time, this
also allows us to automatically cater for different hardware.
In this paper we present an optimized parallel implementa-
tion of the MAP decoder with the algorithmic improvements
of [11]. Two variants of this algorithm are implemented: a
direct implementation where intermediate metrics are stored
for the decoding of a whole frame, and a reduced-memory
implementation where some intermediate metrics are recal-
culated for the backward pass. This considerably reduces
the memory footprint of the decoder, which is a particularly
ar
X
iv
:1
80
2.
08
48
3v
1 
 [c
s.D
C]
  2
3 F
eb
 20
18
2important consideration for a GPU implementation, at the
cost of decoding speed. In optimizing the implementation,
we consider the use of GPU on-chip memory to reduce
memory transfers and to improve access patterns. We also
introduce a dynamic strategy for choosing how each function is
parallelized, including the number of threads to use, in order to
optimize the efficiency and usage of the GPU compute cores.
The approach presented here can be generalized to other
implementations of the forward-backward algorithm, such as
those used in turbo decoding and in applications of hidden
Markov models. The techniques we propose here are particu-
larly useful in similarly flexible implementations, commonly
found in simulators and other research tools. In addition, we
discuss a number of techniques for handling device limitations
and optimizing the parallelization parameters dynamically;
these may be relevant to other researchers working with
CUDA.
This paper starts with definitions and a summary of earlier
work in Section II, including descriptions of the coding
scheme, channel, and the MAP decoder algorithm. Our contri-
butions start in Section III where we consider the challenges
for an efficient parallel implementation of the MAP decoder.
This is followed with an initial parallel implementation and a
reduced-memory variant in Section IV. In Section V we ensure
that the parallel decoder works within hardware constraints
and improve efficiency when code parameters are less than
ideal. We analyze the practical performance on three GPU
systems in Section VI, followed by final results in Section VII
and conclusions in Section VIII.
II. PRELIMINARIES
A. The Coding Scheme
In this paper we are concerned with the coding scheme of
[9], [11], which we summarize below. The encoding is defined
by the sequence C = (C0, . . . , CN−1), which consists of the
constituent encodings Ci : Fq ↪→ Fn2 for i = 0, . . . , N − 1,
where n, q,N ∈ N, 2n ≥ q, and ↪→ denotes an injective
mapping. For any sequence z, denote arbitrary subsequences
as zba = (za, . . . , zb−1), where z
a
a = () is an empty se-
quence. Given a message DN0 = (D0, . . . , DN−1), each
Ci maps the q-ary message symbol Di ∈ Fq to codeword
Ci(Di) of length n. That is, DN0 is encoded as X
nN
0 =
C0(D0)‖ · · · ‖CN−1(DN−1), where y‖z is the juxtaposition
of y and z.
The above encoding is normally used as an inner code
to correct synchronization errors, serially concatenated with
a conventional outer code to correct residual substitution
errors. In such a construction, the MAP decoder’s posterior
probabilities are used to initialize the outer decoder. Such a
construction can be iteratively decoded by setting the prior
symbol probabilities of the MAP decoder with extrinsic infor-
mation from the previous pass of the outer decoder.
B. Channel Model
As in [9], [11] we consider the Binary Substitution, Inser-
tion, and Deletion (BSID) channel, an abstract random channel
with unbounded synchronization and substitution errors. This
channel was originally presented in [15] and more recently
used in [6]–[8], [16], [17] and others. At time t, one bit enters
the channel, and one of three events may happen: insertion
with probability Pi where a random bit is output; deletion with
probability Pd where the input is discarded; or transmission
with probability Pt = 1 − Pi − Pd. A substitution occurs in
a transmitted bit with probability Ps. After an insertion the
channel remains at time t and is subject to the same events
again, otherwise it proceeds to time t + 1, ready for another
input bit.
We define the drift St at time t as the difference between
the number of received bits and the number of transmitted
bits before the events of time t are considered. As in [6], the
channel can be seen as a Markov process with the state being
the drift St. It is helpful to see the sequence of states as a
trellis diagram, observing that there may be more than one
way to achieve each state transition.
C. The MAP Decoder
We summarize here the optimized MAP decoder of [11],
which we are concerned with parallelizing. The decoder uses
the standard forward-backward algorithm for hidden Markov
models. We assume a message sequence DN0 , encoded to the
sequence Xτ0 , where τ = nN . The sequence X
τ
0 is transmitted
over the BSID channel, resulting in the received sequence Yρ0 ,
where in general ρ is not equal to τ . To avoid ambiguity, we
refer to the message sequence as a message of size N and the
encoded sequence as a frame of size τ . We calculate the APP
Li(D) of having encoded symbol D ∈ Fq in position i for
0 ≤ i < N , given the entire received sequence, using
Li(D) =
1
λN (ρ− τ)
∑
m′,m
σi(m
′,m,D), (1)
where λi(m) = αi(m)βi(m), (2)
σi(m
′,m,D) = αi(m′)γi(m′,m,D)βi+1(m), (3)
and αi(m), βi(m), and γi(m′,m,D) are the forward, back-
ward, and state transition metrics respectively. Note that
strictly, the above metrics depend on Yρ0 , but for brevity we do
not indicate this dependence in the notation. The summation in
(1) is taken over the combination of m′,m, being respectively
the drift before and after the symbol at index i. The forward
and backward metrics are obtained recursively using
αi(m) =
∑
m′,D
αi−1(m′)γi−1(m′,m,D), (4)
and βi(m) =
∑
m′,D
βi+1(m
′)γi(m,m′, D). (5)
Initial conditions are given by α0(m) and βN (m), set as the
prior probabilities of the frame boundaries. Finally, the state
transition metric is defined as
γi(m
′,m,D) = Pr {Di = D}R(Yn(i+1)+mni+m′ | Ci(D)) (6)
where Ci(D) is the n-bit sequence encoding D and R(y˙|x) is
the probability of receiving a sequence y˙ given that x was sent
through the channel (we refer to this as the receiver metric).
The a priori probability Pr {Di = D} is determined by the
3source statistics, which we generally assume to be equiprob-
able so that Pr {Di = D} = 1/q. In iterative decoding, the
prior probabilities are set using extrinsic information from the
previous pass of an outer decoder.
Since the set of all possible states is unbounded for the
channel considered, a practical implementation has to take
sums over a finite subset, chosen so that only the least likely
states are omitted. For a transmitted segment of length T bits
we denote the range of states considered by the upper and
lower limits m+T , m
−
T respectively, for a state space of size
MT = m
+
T − m−T + 1. Therefore, the α and β metrics are
defined over a state space of size Mτ = m+τ −m−τ + 1 while
the γ metric is defined over a state space of size Mτ for the
initial drift m′ and a state space of size Mn = m+n −m−n + 1
for the drift change m−m′. The precise determination of the
size of the state space is considered in [11].
The receiver metric R(y˙|x) is computed using a recursion
over a lattice as follows. The required lattice has n+1 rows and
µ˙+1 columns, where µ˙ is the length of y˙, and n is the length of
x. Each horizontal path represents an insertion with probability
1
2Pi, each vertical path is a deletion with probability Pd, while
each diagonal path is a transmission with probability PtPs
if the corresponding elements from x and y˙ are different or
Pt(1− Ps) if they are the same. Let Fi,j represent the lattice
node in row i, column j. Then the lattice computation in the
general case is defined by the recursion
Fi,j =
1
2
PiFi,j−1 + PdFi−1,j + Q˙(y˙j |xi)Fi−1,j−1, (7)
which is valid for i < n, and where Q˙(y|x) can be directly
computed from y, x and the channel parameters:
Q˙(y|x) =
{
PtPs if y 6= x
Pt(1− Ps) if y = x.
(8)
Initial conditions are given by
Fi,j =
{
1 if i = 0, j = 0
0 if i < 0 or j < 0.
(9)
The last row is computed differently as the channel model
does not allow the last event to be an insertion. In this case,
when i = n, the lattice computation is defined by
Fn,j = PdFn−1,j + Q˙(y˙j |xn)Fn−1,j−1. (10)
Finally, the required receiver metric is obtained from this
computation as R(y˙|x) = Fn,µ˙.
Observe that for a given x, the receiver metric R(y˙|x) needs
to be determined for all subsequences y˙ within the drift limit
considered. Therefore, for a given symbol D and initial drift
m′ in (6), the lattice computation is only done once for the
largest drift change m−m′ that needs to be considered. The
required values for the remaining values of m are then also
available in the last row of the lattice.
Note also that the horizontal distance of a lattice node from
the main diagonal is equivalent to the channel drift for the
corresponding transmitted bit. For the transmitted sequence of
n bits considered, we can take advantage of this by limiting the
lattice computation to paths within a fixed corridor of width
Mn around the main diagonal.
Finally, note that the α and β metrics are normalized as
they are computed to avoid exceeding the limits of floating-
point representation. Specifically, for α the computation (4) is
changed to:
αi(m) =
α′i(m)∑
m′ α
′
i(m
′)
, (11)
where α′i(m) =
∑
m′,D
αi−1(m′)γi−1(m′,m,D). (12)
A similar argument applies for the computation of β. Also,
the receiver metric is computed at single precision1, while the
remaining equations use double precision.
D. CUDA Notation
In this paper we follow the usual CUDA notation, which we
summarize here for convenience. For further detail, the reader
is referred to [10]. CUDA defines a general-purpose parallel
programming model for a hybrid system with a host CPU
and an attached GPU device (or more than one). The device
contains the GPU chip, organized as a number of multiproces-
sors with a fixed number of compute cores each, and off-chip
memory. Each multiprocessor also contains a fixed amount of
on-chip shared memory, accessible by all compute cores in
the multiprocessor. Off-chip memory is accessible by all GPU
threads through global memory variables in read/write mode,
or in read-only mode through constant or texture memory
constructs.
Every function executed on the GPU is called a kernel; this
is run as a grid of equally-shaped blocks of parallel threads, as
specified by the execution configuration. To avoid ambiguity,
we avoid using the term block for any other purpose. Each
block of threads executes on the same multiprocessor in
groups of threads called warps. Threads in a given warp start
execution at the same address but are free to branch an execute
independently (i.e. diverge). However, highest efficiency is
achieved when there is no divergence within a warp. Note that
more than one block may be resident in a given multiprocessor
if sufficient resources (i.e. registers and shared memory) are
available. This increases the number of warps available to the
scheduler, and may be used to hide latency.
III. CHALLENGES FOR PARALLEL IMPLEMENTATION
A. Effect of Changes to Receiver Metric Computation
As compared with our previous GPU implementation, the
optimized decoder summarized in Section II-C changes to the
way the receiver metric is computed. Of particular relevance
to this work, in the optimized decoder a) for a given x, we
simultaneously compute R(y˙|x) for all subsequences of y˙,
and b) we replace the trellis-based forward pass of [7] with
a more efficient corridor-constrained lattice implementation.
Together, these changes result in a decrease in complexity
(for computing the receiver metric) of up to three orders of
magnitude, depending on channel conditions.
1We refer to 32-bit floating point as single precision, and 64-bit floating
point as double precision.
4100 101 102 103
N
100
101
102
Sp
ee
du
p
Frame decoding, (n,q) =(10,32)
p=10−3
p=10−2
(a)
2 4 8 16 32 64 128 256 512
q
100
101
102
Sp
ee
du
p
Frame decoding, N=300
p=10−3
p=10−2
(b)
Figure 1. Decoding speedup of a serial CPU implementation for lattice
corridor batch computation (new) over individual trellis computation (old)
of receiver metric, at different channel conditions p := Pi = Pd;Ps = 0 for
a range of (a) message size N and (b) alphabet size q.
To demonstrate the effect on the overall decoding speed, we
repeat the simulations of [9] using a serial CPU implementa-
tion of the improved decoder. Results comparing the overall
decoding speed, for the same codes and on the same computer,
are shown in Fig. 1. Under moderate channel conditions it
can be seen that the improved decoder is more than ten times
faster over a wide range of message size N and alphabet size
q. Speedup improves for poorer channel conditions.
Such a significant change means that the receiver metric
computation can no longer be considered to dominate the
overall complexity. This is particularly so for a GPU imple-
mentation, where we expect a higher speedup in computing
the γ metric since this has greater data parallelism. Therefore,
in parallelizing this improved algorithm we also have to pay
particular attention to an efficient parallel computation of the
α and β metrics, which have internal dependencies.
B. Effect of Implementation Flexibility
The GPU implementation of this MAP decoder also faces
challenges not considered in similar parallel implementations
for turbo codes [12]–[14]. In our case, the flexibility of the
MAP decoder and the nature of the channel we are concerned
with pose additional difficulties for parallelization, as noted
in Section I. The reference CPU implementation is intended
for simulation applications, and consequently supports a wide
range of code parameters. Specifically, the CPU implementa-
tion supports: a) arbitrary codeword length n ≥ 1, b) channel
symbols from an arbitrary finite alphabet, as long as the
alphabet is defined in an additive Abelian group, c) messages
defined over an arbitrary finite alphabet of size q ≥ 2, as long
as each message symbol is representable by a codeword, i.e.
q ≤ sn for a channel symbol alphabet of size s, d) arbitrary
message length N ≥ 1, e) arbitrary state space limits m−n ≤ 0,
m+n ≥ 0, m−τ ≤ m−n , m+τ ≥ m+n , chosen dynamically based
on channel conditions, f) arbitrary received sequence length
ρ, as long as the drift at the end falls within the chosen state
space limits, i.e. m−τ ≤ ρ−τ ≤ m+τ , g) arbitrary specification
of constituent encodings Ci, h) arbitrary thresholds to avoid
pursuing low-probability trellis paths (not considered in this
study), and i) lazy computation of the γ metric (not considered
in this study; potentially useful with path thresholding).
Additionally, the implementation supports the independent
choice of real number types for the computation of the
forward-backward algorithm and the inner lattice traversal. On
the CPU implementation, supported types include single and
double precision IEEE floating point numbers, as well as GNU
multi-precision numbers [18]. This allows the user to trade off
arithmetic speed and storage requirements for accuracy, and
the comparison of results with an exact implementation. The
implementation also allows the user to choose between differ-
ent algorithms to compute the receiver metric. These include
the constrained-corridor lattice algorithm of Section II-C, an
unconstrained version of the algorithm, and the trellis-based
algorithm used in our earlier work.
All these variables pose a number of particular challenges
for an efficient parallel implementation. First of all, it is
important for the parallel implementation to support the same
range of parameters as the CPU implementation. This can
be a problem when functions are parallelized across one of
these variables, as the required range may exceed constraints
of the parallel architecture. For example, in our earlier GPU
implementation, q was limited by hardware constraints, and
results were given for alphabet sizes up to q = 512. Another
expected problem is that the parallelization efficiency depends
on the code parameters. With so many variables and such
a wide range for each, it becomes particularly problematic
to ensure high parallelization efficiency under all conditions.
Unlike our earlier GPU implementation, in this work we
propose specific steps to improve efficiency under suboptimal
conditions. This variability also has an effect on scheduling
kernel issues across multiple streams, as the time taken by
individual kernels depends on the code parameters. On most
current hardware, the ideal issue order for different kernels
that can execute concurrently depends on their relative timings
[19]. When such decisions had to be taken in this work we
have favoured larger codes, where the speedup is of greater
objective benefit.
A more subtle problem is that functions which have many
distinct execution paths tend to require a higher register count.
In a parallel implementation, this can reduce the upper limit
for parallel threads, as each multiprocessor will have a finite
register bank to share. Further, the compiler may reduce
register usage by ‘spilling’ some values to global memory,
for a very significant increase in latency. This problem can
be alleviated in part by using C++ templates [20], [21].
Each template instance is effectively independent, so that the
corresponding execution path decisions are taken at compile
time. This reduces register requirements, which is of great
5benefit in the parallel implementation; while also useful for
the CPU implementation, the difference in execution speed
is minimal. However, this solution is not without its cost: the
use of templates increases the compiler’s work in proportion to
the number of combinations of template parameters. This can
increase object code size and compilation time considerably.
Further, for a CUDA compilation with current tools, this also
presents a constraint due to the limit on constant memory
use within the same compilation unit. The solution we have
applied to this problem was to divide the instantiation of
the various combinations of template parameters into separate
units. That is, the class methods were specified in a sepa-
rate implementation header, imported by a number of source
files, each of which instantiated an independent subset of all
required template parameter combinations.
Finally, it is worth realizing that the requirements of a
CPU implementation can be at odds with those of a GPU
implementation. For example, the sequence of constituent
encodings C is held in memory as a vector of N constituent
encodings Ci, each of which is a vector of q codewords, where
each codeword is in turn a vector of n channel symbols. This
is ideal for a CPU implementation, allowing each symbol to be
accessed by indirect addressing, and also allowing references
to individual codewords or codebooks to be specified naturally.
On the GPU, however, such an organization is problematic
because it requires Nq independent memory allocations and
memory transfers, each of which is an expensive operation. It
is more efficient to reorganize the table on the CPU, copying
the data into a single flat array and using that structure on the
GPU. This avoids changing the canonical data structure on
the CPU, which would have wider repercussions, and is the
solution we have adopted in this work.
IV. INITIAL PARALLEL IMPLEMENTATION
A. Global Storage
The most straighforward implementation of the MAP de-
coder considers it as consisting of four functions, one for
computing each of the γ, α, β, and L metrics. These depend on
each other, dictating the order of computation: specifically, α
and β depend on γ, while L depends on α, β and γ. Therefore,
γ needs to be computed first, followed by α and β (in any
order, as these are independent of each other), and finally L
is computed. This methodology, which we refer to as global
storage, assumes that there is sufficient memory to store the
α, β and γ metrics.
1) γ metric: Starting with the computation of the γ metric
(6), we have already observed in [9] that the computation is
independent for each of i, D, m′, and m. Furthermore, as
noted in Section II-C, for any given i, D, and m′, we only
need to perform the lattice computation for the largest drift
change m−m′ to be considered. It makes sense, therefore, to
compute and store the γ metric for the whole valid range of
m, for any given i, D, and m′.
To make the best use of the available data parallelism, we
initially use block coordinates (i,m′) for grid size N ×Mτ ,
and threads with coordinate D for block size q. This increases
the parallelism by a factor Mτ with respect to our earlier GPU
implementation, where the γ computation kernel had a grid
size N . This increased parallelism is particularly useful when
global storage is not possible, as we shall see in Section IV-B.
Each thread performs an independent lattice computation and
determines the γ metric for the whole valid range of m (i.e.
Mn entries).
As in our earlier GPU implementation, we store the four-
dimensional γ matrix as a flat array in global memory.
However, we change the indexing order, so as to have index
D innermost. This allows the threads in a warp to access a
contiguous range of memory; this access is fully coalesced
as long as the initial address and the transaction size are a
multiple of 128 bytes. Since q is always a power of two, this
is guaranteed for any q ≥ 16 with double precision storage.
This effect is particularly valuable since there is no warp
divergence, as the lattice traversal is identical for all D. This
index is followed by m − m′, so that consecutive accesses
by the same thread are as close to each other as possible,
maximizing cache re-use for small q.
In order to minimize global memory access and avoid
register spilling into local memory, each thread holds the
current lattice row being computed in shared memory. This
requires an array of size n+m+n +1 single precision numbers
per thread, dynamically allocated on kernel launch. The lattice
computation algorithm is re-written accordingly.
2) α metric: The α metric computation can be divided into
two main operations: the main computation (12), followed
by normalization (11). We have already observed in [9] that
the main computation at index i and state m depends on
normalization at index i − 1 for all m′, while normalization
at index i for state m depends on computation at index i for
all m′. Also, for any given i, computation and normalization
are independent for different values of m.
Parallelization strategy is the same as our earlier GPU
implementation, where a separate kernel call is required for
the main computation at each i, followed by a separate kernel
to perform normalization at that i. This is necessary because
the only way to synchronize across a grid is the completion
of a kernel call [10].
The main computation at i uses block coordinate m for a
grid size of Mτ and threads with coordinate D for block size
q, where each thread computes the corresponding partial sum-
mation over m′. The final result is then computed from these
partial sums using a parallel summation across the threads
in the block, using shared memory to communicate between
threads. This requires a shared memory array of q double
precision values per block. Normalization requires two steps:
computing the sum of all α′i, and dividing each α
′
i by this
sum. Both are most easily parallelized across a single block
of Mτ threads. This uses only one multiprocessor but greatly
facilitates implementation in a function which corresponds to
a very small proportion of the overall computation time.
In contrast with our earlier GPU implementation, we extract
the initialization of the α metric at i = 0 to a separate kernel.
Since initialization consists simply of setting each of Mτ
values, this is implemented as a single block of Mτ threads.
While this change may not seem very significant, it slightly
simplifies the main computation, keeping register usage to a
6minimum. In turn, this allows us to maximize occupancy by
allowing more resident kernels.
The α metric is stored as a two-dimensional array in global
memory, with the state index m innermost. This speeds up
access in the alpha metric computation kernel, where each
thread needs to read the metric values at all states for the
previous index i − 1. The speedup is achieved by copying
the row at index i − 1 to shared memory, requiring a shared
memory array of Mτ double precision values per block. This
memory copy can be done in parallel across the block, so
that global memory access is coalesced. The use of a two-
dimensional array ensures this by correctly aligning each row.
3) β metric: A similar argument applies to the computation
of β, except that now the main computation at index i depends
on normalization at index i + 1. Further, α and β can be
computed concurrently as there is no data dependency between
them; this can be achieved using streams on devices that
support concurrent kernel execution.
Recall that the computation of α and β requires a number of
consecutive kernel calls each. Specifically, for α, this sequence
consists of the intialization kernel, normalization at i = 0,
computation at i = 1, normalization at i = 1, and repeating
computation and normalization for increasing i until i = N .
Since each kernel depends on the completion of the preceding
one, this dependency is best expressed by issuing the kernels
in the same stream. For β, the sequence is the same but the
index order is reversed (i.e. initialization and normalization for
i = N , followed by computation and normalization pairs for
decreasing i from i = N − 1 to i = 0). The independence of
the β kernels from the α kernels is expressed by issuing these
in a second stream.
Unfortunately, hardware limitations in Fermi and initial
Kepler devices2 cause additional complication. Since these
devices have only one compute engine queue, if any stream
has more than one kernel scheduled consecutively, the issuer
will stall until the last kernel in the sequence is dispatched
[19]. Since the kernels for α and β at each index have the
same complexity, we avoid this problem by using a breadth-
first launch order, as follows. Issue first the initialization of
αi=0 in stream one and of βi=N in stream two, followed by
the normalization of αi=0 in stream one and of βi=N in stream
two. This is followed by the computation of αi=1 in stream one
and of βi=N−1 in stream two, and the normalization of αi=1
in stream one and of βi=N−1 in stream two. This is repeated,
incrementing i for α and decrementing for β. Concurrent
execution improves device utilization when the grid size for
a single kernel call is small. Note that this problem does not
exist in the latest Kepler architecture (GK110), which has 32
compute engine queues [22].
4) L metric: Finally, the L computation (1) is independent
across i, D. As in our earlier GPU implementation, we
parallelize this across blocks with index i for a grid size N
and threads with index D for block size q. In this work,
however, we also avoid multiple global memory reads and
ensure coalesced memory access by first copying the required
rows at αi and βi+1 to shared memory. This requires two
2GK104 architecture, for compute capability 3.0.
10−4 10−3 10−2 10−1 100
Channel Error Event Probability (Pd =Pi ; Ps =0)
105
106
107
108
109
1010
1011
1012
M
em
or
y 
(b
yt
es
)
Global storage
Metric storage required, N=210
(2,2)
(10,32)
(18,512)
(20,1024)
2 GiB
1 GiB
Figure 2. Metric storage requirements for a moderate message size N =
210 and half-rate codewords (n, q), over a range of channel conditions. It is
assumed that all metrics are stored as double precision values.
shared memory arrays of Mτ double precision values per
block. Each γ value is only read once, and this is done in
an order that ensures coalesced access.
B. Local Storage
Unfortunately, global storage is only possible when there is
sufficient memory to store the α, β and γ metrics. The required
storage capacity increases with increasing N , n, q, and poorer
channel conditions (since this increases the required state
space). To illustrate this, consider a system with a moderate
message size N = 210 and a range of alphabet sizes q; we
plot the required metric storage memory in Fig. 2 over a range
of channel conditions. For ease of comparison, horizontal lines
are included at values of 1 GiB and 2 GiB, corresponding to
common per-CPU core or per-GPU device limits for metric
storage. It can be readily seen that these limits are reached at
moderate to low channel error rates for larger alphabet sizes,
and also at high channel error rates for moderate alphabet
sizes.
This problem can be resolved by dividing the computation
of γi across i and observing that each of αi, βi and Li depend
only on a single index for γ. Specifically, a) αi depends on
αi−1 and γi−1, b) βi depends on βi+1 and γi, and c) Li
depends on αi, βi+1 and γi. Since the order of computation
of αi and βi is enforced by their internal dependencies, each
γi has to be computed at least twice. We avoid computing it a
third time by first completing the α metric computation, then
combining the computation of L with that of β. This is shown
graphically in Fig. 3.
This mode requires us to divide the γ and L kernels to com-
pute values at a single index i. Therefore the γ compute kernel
now uses block coordinate m′ for grid size Mτ . Similarly, the
L compute kernel now only uses a single block. In both cases,
i is passed as a kernel parameter. This significantly reduces
device utilization, particularly under good channel conditions
where Mτ is relatively small.
We can mitigate this problem using concurrent execution
by issuing multiple kernels, as follows. Consider first the
computation of the α metric, requiring kernel calls as shown in
Fig. 3a. In this figure, each row corresponds to the computation
of a particular row of the α metric, with index i. Each kernel
within a given row depends on the previous kernel, while the
7Init.
αi=0
Norm.
αi=0
Norm.
αi=1
Compute
γi=0
Compute
αi=1
Norm.
αi=2
Compute
γi=1
Compute
αi=2
Norm.
αi=N
Compute
γi=N-1
Compute
αi=N
(a)
Init.
βi=N
Norm.
βi=N
Norm.
βi=N-1
Compute
γi=N-1
Compute
βi=N-1
Norm.
βi=N-2
Compute
γi=N-2
Compute
βi=N-2
Norm.
βi=0
Compute
γi=0
Compute
βi=0
Compute
Li=N-1
Compute
Li=N-2
Compute
Li=0
(b)
Figure 3. Graphical representation of the interdependencies and the sequence
of kernel calls required to compute (a) the α metric, and (b) the β and L
metrics in local storage mode.
compute kernel also depends on the normalization kernel of
the previous row. The former dependency can be expressed by
issuing the three kernels of a given row in the same stream; the
latter dependency can be expressed using CUDA events [19].
Such an arrangement allows compute kernels for the γ metric
to execute concurrently with other kernels from previous rows,
increasing device usage.
In this way, kernel issues could be divided over N + 1
different streams. Observe, however, that independent storage
space is required for the γ metric at each index that may be
computed independently. This makes it impractical to use a
separate stream for each possible index, in which case the
amount of space required is the same as for global mode.
Instead, we limit the number of streams to a depth of four,
used on a rotating basis. This limit was chosen empirically;
others have already observed that it is difficult to have more
than four kernels running concurrently [19].
The use of streams raises the question of how best to order
kernel issues, as considered for the concurrent computation of
α and β in Section IV-A. In this case, the problem is somewhat
more complex, as different streams may be executing kernels
Table I
SUMMARY OF THE KERNELS REQUIRED FOR THE INITIAL PARALLEL
IMPLEMENTATION, FOR GLOBAL AND LOCAL STORAGE.
Kernel Storage Mode Grid size Block size Calls
Compute γ
global N ×Mτ q 1
local 1×Mτ q N
Initialize α,β both 1 Mτ 1
Compute α,β both Mτ q N
Normalize α,β both 1 Mτ N + 1
Compute L
global N q 1
local 1 q N
Compute ΦT both 1 Mτ 2
of different size and duration. Additionally, the use of events to
synchronize between streams requires that the event on stream
i is recorded before the wait on that event is issued on stream
i+1. Now for the event to be recorded, all prior kernels on that
stream need to be issued (though not necessarily executed).
Effectively, this means that for stream i we need to issue the
following sequence consecutively: wait for event completion
on stream i−1→ compute αi → normalize αi → record event
on stream i. All this has to be issued before we can repeat the
issue sequence for the next stream. This requirement forbids
us from issuing all kernels across the four streams in breadth-
first order. Instead, we issue the γ computation kernels in
breadth-first, then issue the above sequences for each stream in
the order required. This maximizes concurrency of execution
for the γ computation kernels, but limits concurrency of the
remaining kernels to the combination of computation and
normalization of α (this applies to Fermi and first-generation
Kepler devices).
A similar argument applies for the computation of the β
and L metrics, requiring kernel calls as shown in Fig. 3b.
Other than the order of index traversal, the only difference
in this case is the presence of the computation kernel for
the L metric. The simplest way to deal with this is to issue
this kernel after the normalization kernel in the same row.
This automatically satisfies its dependencies, which are the
same as for the compute kernel for the β metric in the same
row. It also allows the β computation and normalization to
occur first, minimizing delays to subsequent rows. The L
computation kernel has a significant duration but only has
one block; this inefficiency is easily hidden by concurrent
kernels from subsequent rows. In this case, we again issue the
γ computation kernels breadth-first, then issue the remaining
sequence for each stream in the order required. We include the
L computation in this sequence rather than issuing it separately
in breadth-first order. This allows that kernel to be concurrent
with the β kernels rather than only with itself, making more
efficient use of the available multiprocessors.
A summary of the kernels required for the initial parallel
implementation described so far, for global and local storage,
is given in Table I. It can be seen that the block size is equal
to q or Mτ .
810−4 10−3 10−2 10−1 100
101
102
103
104
105
106
M
em
or
y 
(b
yt
es
)
γ kernel
Shared memory required, N=210
(2,2)
(10,32)
(18,512)
(20,1024)
48 KiB
24 KiB
10−4 10−3 10−2 10−1 100
102
103
104
105
M
em
or
y 
(b
yt
es
)
α,β kernels
10−4 10−3 10−2 10−1 100
Channel Error Event Probability (Pd =Pi ; Ps =0)
102
103
104
105
M
em
or
y 
(b
yt
es
)
L kernel
Figure 4. Shared memory requirements per block for the γ, α, β, and L
computation kernels, assuming a block size of q, for a moderate message size
N = 210 and half-rate codewords (n, q), over a range of channel conditions.
It is assumed that internal metrics for the γ kernel are stored as single precision
values, while those for the other kernels are stored as double precision values.
V. ADVANCED CONSIDERATIONS
A. Handling Resource Limits
The initial parallel implementation described previously
assumes that it is possible to issue kernels with the given block
sizes: q for the γ, α, β, and L computation kernels, and Mτ
for the remaining kernels. However, this may not be possible
due to a number of constraints: a) the device limit on the
maximum number of threads per block, b) register pressure,
which depends on both the number of registers needed per
thread and the number of registers available per multiproces-
sor, and c) shared memory pressure, which depends on the
requirements for a given block size and the available shared
memory per multiprocessor. Limits for a device depend on its
compute capability, and can be queried by the implementation
at run-time [10].
To illustrate the shared memory requirements, consider
again a system with N = 210 and a range of alphabet sizes
q; we plot the required shared memory per block in Fig. 4
over a range of channel conditions. For ease of comparison,
horizontal lines are included at values of 48 KiB and 24 KiB.
The former corresponds to the maximum amount of shared
memory per multiprocessor on Fermi and Kepler devices;
this is the limit for a kernel launch to succeed. The latter
corresponds to half this value, allowing two resident blocks
per multiprocessor. Observe how the requirements for the γ
computation kernel exceed device limits for larger q. Other
kernels only approach device limits when channel conditions
are poor; limits are exceeded if N is sufficiently large.
A kernel launch that violates any of these limits will fail
at run-time. For the four computation kernels, we solve this
problem by dynamically determining a suitable block size at
run-time, and adapting the kernel implementation to work with
a block size that is not equal to q. We choose a block size
equal to the smaller of q or the largest allowed multiple of the
warp size, taking into account device limits, register pressure
and the shared memory required for a given thread count. We
denote this block size by Bγx for the γ computation kernel,
Bαβx for the α and β computation kernels, and B
L
x for the L
computation kernel. For the γ, α, and β computation kernels,
if the block size is less than q, each thread loops through
values of D equal to its thread index plus multiples of the
block size. For the L computation kernel we adopt a different
approach: for BLx < q, we extend the grid size by a factor
GLy =
⌈
q/BLx
⌉
, dividing the computation of the range of q
across multiple blocks. This has the advantage of keeping the
same parallelization in a kernel which normally has a small
grid size. In both cases, unless q is an exact multiple of the
block size chosen, some threads will have no work to do. In
the common case where q is a power of two, any large q is
also a multiple of the warp size (32 for current hardware); in
this case there will be no warp divergence, so any effect on
performance should be limited to the loss in latency hiding.
B. Improving Occupancy
At the other end of the scale, we had suggested in [9]
that one may improve performance for small q (and large N )
by computing multiple indexes in a single block. This can
potentially improve multiprocessor occupancy, and therefore
reduce latency in memory-limited kernels. It is worth noting
here that occupancy does not depend only on the block size,
but also on the number of resident blocks per multiprocessor
[23]. In turn, this depends on register pressure and shared
memory pressure. Therefore, it is advantageous to increase
the block size as long as this does not increase register and
shared memory requirements proportionally.
Consider first the γ metric computation. A suitable strategy
for increasing block size is to aggregate the work done in
multiple blocks; rather than aggregating multiple indexes i,
however, we aggregate multiple states m′. This has the advan-
tage of allowing the aggregation to happen in both global and
local storage methodologies. At the limit, one could attempt
to use thread coordinates (D,m′) for block size q × Mτ .
This would require block coordinate i for grid size N in
global storage; there would only be one block in local storage.
However, such a block size is very likely to exceed device
limits due to resource constraints. Instead, for small q, we
use a block size q × Bγy , where Bγy is the smaller of Mτ or
the largest allowed multiple of the warp size. In determining
Bγy , we take into account device limits, register pressure and
the shared memory required for a given thread count. This
results in a grid size N × Gγy in the case of global storage
or 1×Gγy for local storage, where Gγy =
⌈
Mτ/B
γ
y
⌉
. We also
limit Bγy so that the resulting grid size is not smaller than
the number of streaming multiprocessors on the device, NSM.
Specifically, the constraint is NGγy ≥ NSM for global storage,
9100 101 102 103
0
20
40
60
80
100
Pr
op
or
tio
n 
(%
)
CPU
Frame Decoding, global γ, (n,q) =(10,32), p=10−3
γ
α+β
L
100 101 102 103
N
0
20
40
60
80
100
Pr
op
or
tio
n 
(%
)
C2075
Figure 5. Time spent in each of the four metric computations as a proportion
of the total time required to decode a frame, for global storage and a moderate
alphabet size q = 32, over a range of message sizes N . Channel conditions
are given by p := Pi = Pd = 10−3;Ps = 0.
and Gγy ≥ 14NSM for local storage (where four such kernels are
issued concurrently, potentially allowing greater occupancy).
This ensures that we do not trade off parallel execution for an
increased occupancy, since the former usually has a greater
impact on speed.
A similar argument applies for the α and β computation
kernels. For small q we use a block size q × Bαβy , where
Bαβy is the smaller of Mτ or the largest allowed multiple of
the warp size. This results in a grid size Gαβx =
⌈
Mτ/B
αβ
y
⌉
.
In addition to considerations for device limits, we limit Bαβy
so that Gαβx ≥ 12NSM for global storage (where α and β
kernels are computed concurrently), and Gαβx ≥ NSM for local
storage.
We do not apply the same technique to the computation
of L since the proportion of time spent in this kernel is not
very significant on the GPU implementation. A summary of
the kernels required for the complete parallel implementation,
including the advanced considerations described above, for
global and local storage, is given in Table II. We also list in
the table the complexity of computations for a single thread.
VI. PERFORMANCE ANALYSIS
We consider the GPU performance of the above imple-
mentation on two Fermi devices and one Kepler device. The
GTX 480 and GTX 680 are the highest-performing single-
GPU devices in the consumer-oriented GeForce range for the
Fermi and Kepler architectures respectively. The C2075 is
the highest-performing processor in the computation-oriented
Tesla range for the Fermi architecture. CPU performance is
considered for a serial reference implementation, on the same
system as in [9]. Hardware specifications are summarized in
Table III.
A. Division of Computation Time
We consider first the time spent in each of the four metric
computations as a proportion of the total time required to
decode a frame. For global storage, this is shown in Fig. 5
for a moderate alphabet size q = 32 over a range of message
sizes N , for the CPU and the C2075 device. Note that we show
a combined timing for the computation of the α and β metrics
4 16 64 256 1024
0
20
40
60
80
100
Pr
op
or
tio
n 
(%
)
CPU
Frame Decoding, global γ, N=210, p=10−3
γ
α+β
L
4 16 64 256 1024
q
0
20
40
60
80
100
Pr
op
or
tio
n 
(%
)
C2075
Figure 6. Time spent in each of the four metric computations as a proportion
of the total time required to decode a frame, for global storage and a moderate
message size N = 210, over a range of alphabet sizes q. Channel conditions
are given by p := Pi = Pd = 10−3;Ps = 0.
as these are computed concurrently on the GPU. On the CPU
the division of computation time is almost constant, with the
γ metric taking over 75% of the time for moderate to large
N . On the GPU we observe a number of distinct differences:
a) the computation of the γ metric takes a substantially smaller
proportion of time, b) the α and β metrics seem to make up
for most of the difference for moderate to large N , and c) for
smaller N a substantial proportion of time is unaccounted for.
These differences may be explained as follows. Computation
of γ is more easily parallelized than α and β, which also suffer
from greater kernel call overhead. This makes the computation
of γ considerably more efficient than α and β on the GPU. For
smaller block sizes, the overhead in setting up and transferring
data to and from the GPU is also a more significant contributor.
Note that this timing depends on the processor, mainboard, and
memory speeds of the system where the GPU is fitted. Results
for the other two GPU devices show a similar trend.
It is also instructive to repeat this experiment for a moderate
message size N = 210 over a range of alphabet sizes q.
Results for this are shown in Fig. 6. In this case we can see that
even on the CPU the time required to compute γ takes a more
significant proportion of time as q increases. The time required
by α, β, and L decreases proportionally. This is expected, as
while the computation of α, β, and L scales with N , Mτ , Mn,
and q, the computation of γ also scales with n. For the codes
considered in this experiment, n = 2 log2 q to keep the same
code rate. On the GPU, the change is more pronounced, with
the γ computation again becoming dominant for large q. The
main reason for this is that as q increases, the computation of
α and β becomes considerably more efficient, each kernel call
has more computation to perform, and call overhead becomes
less significant. Again, results for the other two GPU devices
show a similar trend.
B. Computation Efficiency
We consider next the efficiency of computing the γ, α,
and β metrics for the different architectures. For each metric,
this can be visualized by plotting the time taken to compute
the metric, normalized by the expected complexity of com-
putation. This is shown in Fig. 7 for global storage and a
moderate alphabet size q = 32 over a range of message sizes
10
Table II
SUMMARY OF THE KERNELS REQUIRED FOR THE COMPLETE PARALLEL IMPLEMENTATION, INCLUDING CONSIDERATIONS FOR RESOURCE LIMITS AND
MULTIPROCESSOR OCCUPANCY, FOR GLOBAL AND LOCAL STORAGE.
Kernel Storage Mode Grid size Block size Calls Shared Memorya Thread Complexity
Compute γ
global N ×Gγy Bγx ×Bγy 1
(n+m+n + 1) ·Bγx ·Bγy O(nMn − m
−
n (m
−
n−1)
2
)local 1×Gγy Bγx ×Bγy N
Initialize α,β both 1 Mτ 1 O(Mτ )
Compute α,β both Gαβx B
αβ
x ×Bαβy N (q +Mτ ) ·Bαβy O(Mn)
Normalize α,β both 1 Mτ N + 1 O(Mτ )
Compute L
global N ×GLy BLx 1 2Mτ O(MτMn)local 1×GLy BLx N
Compute ΦT both 1 Mτ 2 O(Mτ )
aShared memory is expressed here as the unit size of an array of real numbers; for the γ computation kernel these are single precision values, while for
all other kernels these are double precision values.
Table III
HARDWARE SPECIFICATIONS FOR CPU AND GPU SYSTEMS.
Opteron 2431 GTX 480 C2075 GTX 680
Processors × cores 2× 6 15 × 32 14× 32 8 × 192
Core Speed (GHz) 2.412 1.401 1.147 1.0585
Memory (MiB) 32 768 1 536 5 376 2 048
GFLOPs (float) 2.412 (scalar) 672.48 513.856 1 625.856
GFLOPs (double) 2.412 (scalar) 336.24 256.928 67.744
100 101 102 103
10−10
10−9
10−8
10−7
N
or
m
al
iz
ed
 T
im
e 
(S
ec
on
ds
)
γ-computation
global γ, (n,q) =(10,32), p=10−3
CPU
C2075
GTX480
GTX680
100 101 102 103
N
10−9
10−8
10−7
N
or
m
al
iz
ed
 T
im
e 
(S
ec
on
ds
)
α+β-computation
Figure 7. Time to compute γ, normalized by a factor NqMτ (nMn −
m−n (m−n − 1)/2), and α and β, normalized by a factor NqMτMn, for
global storage and a moderate alphabet size q = 32, over a range of message
sizes N . Channel conditions are given by p := Pi = Pd = 10−3;Ps = 0.
N . Starting with the γ metric, observe that the normalized
time is constant for the CPU; this is expected, and indicates
that the expected complexity is an accurate estimate of actual
complexity. This is most likely due to the tight optimization of
the implementation. For the GPU implementation we can see
that maximum efficiency is reached quickly, slighly beyond
N = 10 on all devices. Efficiency fluctuates beyond this
point, though is generally better for larger N . Comparing with
the equivalent plot in [9, Fig. 1], observe that in this work
maximum efficiency is reached much earlier3. This is due to
3For the GTX 480 this was achieved around N = 100 in [9].
the increased grid size (c.f. Section IV-A1) and the use of
block aggregation to increase occupancy (c.f. Section V-B).
Note that the normalized time units in this paper and those in
[9] cannot be compared directly as the complexity is obtained
for different algorithms.
For the α and β metrics, we can see in Fig. 7 that CPU effi-
ciency is almost constant, decreasing slightly as N increases.
While unexpected, this is not surprising, as the complexity
expression considers only the floating-point arithmetic and
ignores overheads of loop handling and memory access. On
the other hand, GPU efficiency continues to improve as N
increases. This is most likely because an increase in N causes
an increase in Mτ under the same channel conditions; this
increases the computation done in each kernel call, reducing
the significance of kernel call overhead. Furthermore, the effi-
ciency of each kernel call also improves due to the increased
grid size and the use of block aggregation.
The experiment is repeated for a moderate message size
N = 210, over a range of alphabet sizes q. Results are shown
in Fig. 8. In this case, observe that the efficiency of computing
the γ metric remains approximately constant throughout the
range, with a decrease in efficiency only for the smallest
alphabet sizes. We can achieve good efficiency throughout the
range due to the use of block aggregation, which increases
occupancy and allows good device utilitization even for small
q. For the α and β metrics, again efficiency continues to im-
prove as q increases; furthermore, this improvement happens
more quickly than when N is increased (c.f. Fig. 7). This is
because an increase in q has two effects: it directly increases
the block size and indirectly increases the grid size (since Mτ
depends on n, which depends on q if the code rate is fixed).
A similar experiment with larger N shifts the curves to the
left, so that maximum efficiency is reached at a lower q.
C. Multiprocessor Occupancy and Usage
In order to understand the effect of changes in code pa-
rameters on computation efficiency, it is instructive to look
at metrics that directly measure the effectiveness of paral-
lelization. At multiprocessor level, it is important to keep the
hardware busy by having a sufficiently high thread count per
block. Since instructions are issued at warp level, the block
size is ideally a multiple of the warp size, ensuring that each
thread in each warp is doing something useful. Furthermore,
11
2 4 8 16 32 64 128 256 512 1024
10−11
10−10
10−9
10−8
10−7
N
or
m
al
iz
ed
 T
im
e 
(S
ec
on
ds
)
γ-computation
global γ, N=210, p=10−3
CPU
C2075
GTX480
GTX680
2 4 8 16 32 64 128 256 512 1024
q
10−10
10−9
10−8
10−7
N
or
m
al
iz
ed
 T
im
e 
(S
ec
on
ds
)
α+β-computation
Figure 8. Time to compute γ, normalized by a factor NqMτ (nMn −
m−n (m−n−1)/2), and α and β, normalized by a factor NqMτMn, for global
storage and a moderate message size N = 210, over a range of alphabet sizes
q. Channel conditions are given by p := Pi = Pd = 10−3;Ps = 0.
the hardware hides latencies in instruction issue and memory
access by executing warps that are ready. It is therefore useful
to maximize the number of active warps per multiprocessor,
which can be achieved by increasing the thread count per
block or by ensuring that multiple blocks can be resident. The
latter is often limited by register pressure, so that the former is
often the most practical approach. Occupancy, the proportion
of active warps to the maximum supported by the hardware,
is a useful measure to determine how well latency is hidden.
At device level, effectiveness of parallelization depends on
how many of the available multiprocessors are executing a
kernel. We refer to this proportion as the device usage. Clearly,
if only a single kernel is being executed, usage depends on
the grid size, which is ideally a multiple of the number
of multiprocessors NSM. However, this presents a trade-off
between the grid size and block size. Usage can also be
improved by running multiple kernels concurrently. A good
strategy is to maximize block size, but only so far as to allow
a grid size which keeps all multiprocessors busy.
For this analysis, we limit our results to the GTX 480 and
GTX 680 devices. Multiprocessor occupancy and usage for the
C2075 are similar to those of the GTX 480, as the two devices
have multiprocessors of the same architecture and only differ
in the number of multiprocessors available (by one).
1) Global Storage: Starting first with global storage, we
determine occupancy for the γ, α, β, and L computation
kernels, based on the actual block size used and assuming no
hardware overhead4. This is shown in Fig. 9 for a moderate
alphabet size q = 32 over a range of message sizes N .
We annotate in color the limiting factor that stops us from
achieving higher occupancy. For each of the computation
4Note that this is the theoretically achievable occupancy with the chosen
parameters; when measured with the profiler the true value will usually be
marginally lower.
100 101 102 103
0
10
20
30
40
50
Pr
op
or
tio
n 
(%
)
GTX480
Occupancy, global γ, (n,q) =(10,32), p=10−3
γ, occup.
α,β, occup.
L, occup.
100 101 102 103
N
0
10
20
30
40
Pr
op
or
tio
n 
(%
)
GTX680
Grid size
Block size
Registers
Smem
Figure 9. Multiprocessor occupancy for the γ, α, β, and L computation
kernels for global storage and a moderate alphabet size q = 32, over a
range of message sizes N . Channel conditions are given by p := Pi = Pd =
10−3;Ps = 0.
100 101 102 103
0
20
40
60
80
100
Pr
op
or
tio
n 
(%
)
GTX480
Device Usage, global γ, (n,q) =(10,32), p=10−3
γ, usage
α,β, usage
L, usage
100 101 102 103
N
0
20
40
60
80
100
Pr
op
or
tio
n 
(%
)
GTX680
Figure 10. Device usage for the γ, α, β, and L computation kernels for global
storage and a moderate alphabet size q = 32, over a range of message sizes
N . Channel conditions are given by p := Pi = Pd = 10−3;Ps = 0.
kernels, the nominal block size (equal to q) is the same as
the warp size. If this was the chosen block size, occupancy
would be limited by the number of resident blocks. For Fermi
devices this would be at most eight, for a maximum occupancy
of 16%. Increasing the block size as explained in Section V-B,
however, allows us to achieve an occupancy of 40% for the γ
kernel on Fermi (and very close to that on Kepler). Observe
that occupancy is limited at the higher end by the shared
memory requirement for the γ kernel. For the α and β kernels
occupancy is limited by the grid size; this depends on Mτ and
there is nothing we can do about it. Comparing these results
with Fig. 7, it is worth realizing that peak efficiency for γ is
reached at the same point when peak occupancy is reached.
Similarly, efficiency for the α and β kernels increases with
the kernel occupancy as N is increased.
The corresponding device usage for each kernel is shown
in Fig. 10. As expected, device usage is optimal for the γ
and L kernels whenever N is a multiple of NSM for the
12
4 16 64 256 1024
0
10
20
30
40
50
60
70
80
Pr
op
or
tio
n 
(%
)
GTX480
Occupancy, global γ, N=210, p=10−3
γ, occup.
α,β, occup.
L, occup.
4 16 64 256 1024
q
0
20
40
60
80
100
Pr
op
or
tio
n 
(%
)
GTX680
Grid size
Block size
Registers
Smem
Figure 11. Multiprocessor occupancy for the γ, α, β, and L computation
kernels for global storage and a moderate message size N = 210, over a
range of alphabet sizes q. Channel conditions are given by p := Pi = Pd =
10−3;Ps = 0.
4 16 64 256 1024
0
20
40
60
80
100
Pr
op
or
tio
n 
(%
)
GTX480
Device Usage, global γ, N=210, p=10−3
γ, usage
α,β, usage
L, usage
4 16 64 256 1024
q
0
20
40
60
80
100
Pr
op
or
tio
n 
(%
)
GTX680
Figure 12. Device usage for the γ, α, β, and L computation kernels for global
storage and a moderate message size N = 210, over a range of alphabet sizes
q. Channel conditions are given by p := Pi = Pd = 10−3;Ps = 0.
given device. Note that for the α and β kernels, device
usage peaks at 50% as we know that these kernels will be
running concurrently. This allows us to reach peak usage
earlier for these kernels. Comparing these results with Fig. 7,
observe that the fluctuations in device usage for γ explain the
corresponding fluctuations in efficiency.
The experiment is repeated for a moderate message size
N = 210, over a range of alphabet sizes q. Multiprocessor
occupancy and device usage are shown respectively in Fig. 11–
12. Consider first the Fermi device. For the γ kernel, observe
that occupancy remains fairly constant, limited at the lower
end by register pressure, and otherwise by shared memory
requirements. Together with the constantly high device usage
for this kernel, this explains the flat efficiency in Fig. 8. For
the α and β kernels, however, occupancy increases until it
reaches a peak limited by register pressure. This is consistent
with point where maximum efficiency is reached in Fig. 8.
A few points are worth noting with respect to the Kepler
4 16 64 256 1024
0
10
20
30
40
50
60
70
80
Pr
op
or
tio
n 
(%
)
GTX480
Occupancy, local γ, N=210, p=10−3
γ, occup.
α,β, occup.
L, occup.
4 16 64 256 1024
q
0
20
40
60
80
100
Pr
op
or
tio
n 
(%
)
GTX680
Grid size
Block size
Registers
Smem
Figure 13. Multiprocessor occupancy for the γ, α, β, and L computation
kernels for local storage and a moderate message size N = 210, over a
range of alphabet sizes q. Channel conditions are given by p := Pi = Pd =
10−3;Ps = 0.
4 16 64 256 1024
0
20
40
60
80
100
Pr
op
or
tio
n 
(%
)
GTX480
Device Usage, local γ, N=210, p=10−3
γ, usage
α,β, usage
L, usage
4 16 64 256 1024
q
0
20
40
60
80
100
Pr
op
or
tio
n 
(%
)
GTX680
Figure 14. Device usage for the γ, α, β, and L computation kernels for local
storage and a moderate message size N = 210, over a range of alphabet
sizes q. Channel conditions are given by p := Pi = Pd = 10−3;Ps = 0.
device. This has a higher peak single-precision performance,
so one would expect faster performance for the γ kernel.
However, it is much harder to reach the necessary efficiency,
for two reasons: a) each multiprocessor has six times as many
single-precision cores, so requires more resident warps to hide
latency, and b) the available shared memory per multiprocessor
remains the same, limiting the occupancy that can be reached.
2) Local Storage: We repeat the analysis for local storage
with a moderate message size N = 210, over a range of
alphabet sizes q. Multiprocessor occupancy and device usage
are shown respectively in Fig. 13–14. For local storage, each γ
kernel computes the metric for a single index i; this reduces the
grid size by a factor of N in comparison to global storage. This
places a considerable constraint on the achievable occupancy,
especially if we want to avoid a drop in multiprocessor usage.
For this reason, peak occupancy is only reached for larger q.
There is no direct change to the α and β kernels with local
storage, except that now we do not run these concurrently.
13
Therefore, we now try to keep all multiprocessors busy with
these kernels (rather than only half). This limits the block size
by a factor of two, so that we now reach the same occupancy
when q is doubled.
A similar issue occurs with the L metric, which is also
computed for a single index i, reducing the grid size by a
factor of N . In this case we make no attempt to increase the
grid size, relying instead on the concurrent execution of this
kernel with the γ and β computation kernels. Given the above
observations, we expect the local storage implementation to
be less efficient than global storage, except for large q.
VII. RESULTS
In this section we compare the overall performance of
the proposed implementation on all three GPU systems of
Section VI with the CPU implementation of the same al-
gorithm and with the earlier GPU implementation of [9].
Following this, we consider the limitations imposed by our
implementation and the overall performance achieved under a
range of conditions.
A. Overall Speedup
We first consider the speedup for complete frame decoding
on the different GPU devices as compared with the CPU
implementation using the same storage methodology. This is
shown in Fig. 15 for both global and local storage. When we
set a moderate alphabet size q = 32, the speedup improves
as the message size N is increased. Observe that for small
N , any improvement is very poor; indeed for the smallest
values of N , the CPU implementation is faster than the GPU
implementations. This is due to the increased impact of GPU
overhead when N is small, as we have seen in Section VI-A.
For this value of q, observe also that the speedup for global
storage is better than for local storage, for moderate to large
N . This is due to the decreased occupancy and multiprocessor
usage for small q and low channel error rate, which affects
local storage computation in a more significant way (c.f.
Section VI-C).
Setting a moderate message size N = 210, we find that
speedup improves as the alphabet sizes q is increased. Observe
that for small q, the speedup achieved for local storage is
slightly less than that for global storage. For larger q, the
speedup achieved for local storage is practically the same as
that for global storage. This indicates the increased efficiency
of local storage computation for larger q, as expected.
B. Speedup Over Previous Parallel Implementation
Next, we repeat the simulations of [9] using the improved
GPU implementation. Results comparing the overall decoding
speed, for the same codes and on the same GTX 480 system,
are shown in Fig. 16. The speedup obtained for the improved
implementation compares favorably with the expected speedup
due to algorithmic changes, shown in Fig. 1. For larger
values of N and q this speedup approaches or exceeds the
speedup achieved on the CPU implementation, indicating that
the improved GPU implementation achieves at least the same
100 101 102 103
N
10−1
100
101
102
Sp
ee
du
p 
(G
PU
 to
 C
PU
)
Frame decoding, (n,q) =(10,32), p=10−3
C2075, global
GTX480, global
GTX680, global
C2075, local
GTX480, local
GTX680, local
(a)
2 4 8 16 32 64 128 256 512 1024
q
10−1
100
101
102
103
Sp
ee
du
p 
(G
PU
 to
 C
PU
)
Frame decoding, N=210, p=10−3
C2075, global
GTX480, global
GTX680, global
C2075, local
GTX480, local
GTX680, local
(b)
Figure 15. Ratio of complete frame decoding timings on CPU as compared
with C2075, GTX 480, and GTX 680 GPU devices, for global and local
storage, with (a) a moderate alphabet size q = 32, over a range of message
sizes N , and (b) a moderate message size N = 210, over a range of alphabet
sizes q. Channel conditions are given by p := Pi = Pd = 10−3;Ps = 0.
efficiency. This is significant, as a significant fraction of time
is spent in computing the α and β metrics, which are much
harder to parallelize efficiently. As expected, speedup improves
for poorer channel conditions.
C. Limitations and Overall Performance
The main limits imposed by our implementation are dis-
cussed below. The largest supported state space Mτ is limited
by the block size for the α,β initialization and normalization
kernels and the ΦT computation kernel, Since these kernels are
very simple, this limit corresponds to the maximum number
of threads per block, which on current hardware is 1024. Now
Mτ increases with τ and the channel error probabilities, and
depends on the allowed probability for an actual drift to exceed
these limits. Even for a poor channel with Pi = Pd = 0.2 and
an exclusion probability Pr = 10−10, this allows us to support
τ / 12 000. For a very poor channel having Pi = Pd = 0.4,
this limit drops to τ / 4 000; note that this is beyond the
14
100 101 102 103
N
100
101
102
Sp
ee
du
p
Frame decoding, (n,q) =(10,32)
p=10−3
p=10−2
(a)
2 4 8 16 32 64 128 256 512
q
100
101
102
Sp
ee
du
p
Frame decoding, N=300
p=10−3
p=10−2
(b)
Figure 16. Decoding speedup for the GPU implementation of this paper (new)
over the GPU implementation of [9] (old), at different channel conditions for
a range of (a) message size N and (b) alphabet size q. Channel conditions
are given by p := Pi = Pd;Ps = 0.
decoding ability of any known codes. Larger message lengths
are possible when the target channel error rates are lower.
Another limitation depends on available device memory.
The use of local storage goes a long way to increase the limits
of supported code parameters. By way of example, a code with
N = 840, n = 20, q = 1024 uses a little over 1 GiB of device
memory at Pi = Pd = 0.1 using local storage. In general this
tends to be an issue only with larger alphabet sizes.
While the implementation considered is intended for use in
a simulator, it is interesting to consider whether the achieved
performance is suitable for real-time applications. We show
in Fig. 17 the achieved throughput for our CPU and GPU
implementations for global and local storage, a large message
size N = 840, and a range of alphabet sizes q, under
moderate to good channel conditions. Even with the algorithm
improvements of [11], the CPU implementation is clearly too
slow for real-time use except in low throughput applications.
The GPU implementation, however, reaches 100 kbit/s even
with this relatively large message length. Note that the missing
simulation results for global storage with large q indicate
conditions where available memory was exceed.
VIII. CONCLUSIONS
In this paper we have presented an optimized parallel
implementation of the MAP decoder of [11] with algorithmic
improvements over the equivalent decoder of [7]. This imple-
mentation achieves a speedup of more than 100× over the
CPU implementation of the same algorithm, and more than
10× over the previous GPU implementation of [9], on the
same hardware. This increases our ability to analyse larger
2 4 8 16 32 64 128 256 512 1024
q
101
102
103
104
105
106
Th
ro
ug
hp
ut
 (b
its
/s
)
Frame decoding, N=840, p=10−3
CPU, global
C2075, global
GTX480, global
GTX680, global
CPU, local
C2075, local
GTX480, local
GTX680, local
Figure 17. Decoder throughput on CPU and C2075, GTX 480, and GTX 680
GPU devices, for global and local storage, and a large message size N = 840,
over a range of alphabet sizes q, under moderate to good channel conditions.
Channel conditions are given by p := Pi = Pd = 10−3;Ps = 0.
codes and poorer channel conditions, and makes practical use
of such codes more feasible.
We also present a reduced-memory implementation where
some intermediate metrics are computed twice: once for the
forward pass and once again for the backward pass and
final results. This variant trades off some decoding speed for
significantly reduced memory requirements. This allows the
decoder to work with longer message sizes and poorer channel
condition than would otherwise be possible.
The speed improvements of this implementation are made
possible by a number of specific optimizations. We use shared
memory judiciously to reduce global memory transfers and to
improve memory access patterns. We also introduce a dynamic
strategy for choosing kernel block sizes, ensuring efficient use
of device resources under a wide range of code and channel
parameters. Specifically, we determine settings that maximize
occupancy while avoiding idle time on multiprocessors. Tak-
ing these decisions at run-time also has the advantage that
this automatically caters for different devices. We hope that
other researchers working with CUDA will also find these
techniques relevant to their work.
While this implementation represents a considerable im-
provement on earlier implementations, some aspects bear
further analysis. In particular, efficiency of the parallel imple-
mentation for small alphabets is still significantly lower than
for larger alphabets. Aggregation of work previously done on
multiple blocks into a single block has helped by improving
occupancy. However, not all kernels can be improved equally,
and for smaller alphabets a greater proportion of decoding
time is spent in the α and β metric computations. This
effect is more pronounced under better channel conditions,
when the required state space is much smaller. Another aspect
that may be improved is the optimization of multiprocessor
usage. While our dynamic strategy for choosing kernel block
size ensures that the usage is never low, we do not seek to
optimize the value, prioritizing instead a higher occupancy.
Of course, improving usage is further complicated by the
15
concurrent execution of kernels of different size and duration.
However, it may be possible to increase overall efficiency by
adapting our dynamic strategy to take into account empirical
kernel timings, which may, for example, be obtained during
system initialization. Clearly, neither of the above is a trivial
proposition; both are the subject of further work.
ACKNOWLEDGMENT
The author would like to thank Prof. Ing. V. Buttigieg and
Dr S. Wesemeyer for helpful discussions and comments.
REFERENCES
[1] H. Mercier, V. Bhargava, and V. Tarokh, “A survey of error-correcting
codes for channels with symbol synchronization errors,” IEEE Commun.
Surveys Tuts., vol. 12, no. 1, pp. 87–96, First Quarter 2010.
[2] J. Hu, T. Duman, M. Erden, and A. Kavcic, “Achievable information
rates for channels with insertions, deletions, and intersymbol interference
with i.i.d. inputs,” IEEE Trans. Commun., vol. 58, no. 4, pp. 1102 –1111,
2010.
[3] J. Hu, T. Duman, E. Kurtas, and M. Erden, “Bit-patterned media with
written-in errors: Modeling, detection, and theoretical limits,” IEEE
Trans. Magn., vol. 43, no. 8, pp. 3517–3524, Aug. 2007.
[4] D. J. Coumou and G. Sharma, “Insertion, deletion codes with feature-
based embedding: A new paradigm for watermark synchronization
with applications to speech watermarking,” IEEE Trans. Inf. Forensics
Security, vol. 3, no. 2, pp. 153–165, Jun. 2008.
[5] D. Bardyn, J. A. Briffa, A. Dooms, and P. Schelkens, “Forensic data
hiding optimized for JPEG 2000,” in Proc. IEEE Intern. Symp. on
Circuits and Systems, Rio de Janeiro, Brazil, May 15–18, 2011.
[6] M. C. Davey and D. J. C. MacKay, “Reliable communication over
channels with insertions, deletions, and substitutions,” IEEE Trans. Inf.
Theory, vol. 47, no. 2, pp. 687–698, 2001.
[7] J. A. Briffa, H. G. Schaathun, and S. Wesemeyer, “An improved
decoding algorithm for the Davey-MacKay construction,” in Proc. IEEE
Intern. Conf. on Commun., Cape Town, South Africa, May 23–27, 2010.
[8] V. Buttigieg and J. A. Briffa, “Codebook and marker sequence design for
synchronization-correcting codes,” in Proc. IEEE Intern. Symp. Inform.
Theory, St. Petersburg, Russia, Jul. 31–Aug. 5, 2011.
[9] J. A. Briffa, “A GPU implementation of a MAP decoder for synchro-
nization error correcting codes,” IEEE Commun. Lett., vol. 17, no. 5,
pp. 996–999, May 27th 2013.
[10] NVIDIA CUDA C Programming Guide, NVIDIA Corporation, Oct.
2012, version 5.0.
[11] J. A. Briffa, V. Buttigieg, and S. Wesemeyer, “Time-varying block codes
for synchronization errors: Maximum a posteriori decoder and practical
issues,” IET Journal of Engineering, Jun. 30th 2014.
[12] D. Lee, M. Wolf, and H. Kim, “Design space exploration of the turbo
decoding algorithm on GPUs,” in Proc. Intern. Conf. on Compilers,
Architectures and Synthesis for Embedded Systems. ACM, 2010, pp.
217–226.
[13] M. Wu, Y. Sun, G. Wang, and J. Cavallaro, “Implementation of a high
throughput 3GPP turbo decoder on GPU,” Journal of Signal Processing
Systems, vol. 65, no. 2, pp. 171–183, 2011.
[14] J. Xianjun, C. Canfeng, P. Jaaskelainen, V. Guzma, and H. Berg, “A
122Mb/s turbo decoder using a mid-range GPU,” in Wireless Com-
munications and Mobile Computing Conference (IWCMC), 2013 9th
International, 2013, pp. 1090–1094.
[15] L. R. Bahl and F. Jelinek, “Decoding for channels with insertions,
deletions, and substitutions with applications to speech recognition,”
IEEE Trans. Inf. Theory, vol. 21, no. 4, pp. 404–411, Jul. 4, 1975.
[16] E. A. Ratzer, “Marker codes for channels with insertions and deletions,”
Annals of Telecommunications, vol. 60, pp. 29–44, Jan. 2005.
[17] J. A. Briffa and H. G. Schaathun, “Improvement of the Davey-MacKay
construction,” in Proc. IEEE Intern. Symp. on Inform. Theory and its
Applications, Auckland, New Zealand, Dec. 7–10, 2008, pp. 235–238.
[18] T. Granlund, GNU MP: The GNU Multiple Precision Arithmetic
Library, Free Software Foundation, Dec. 2012, edition 5.1.0. [Online].
Available: http://gmplib.org/gmp-man-5.1.0.pdf
[19] S. Rennich, “CUDA C/C++ streams and concurrency,” in GPU Tech-
nology Conference. NVIDIA, 2011.
[20] B. Eckel, Thinking in C++, Vol. 1, 2nd ed. Pearson Education, 2000.
[21] B. Eckel and C. Allison, Thinking in C++, Vol. 2, 2nd ed. Pearson
Education, 2003.
[22] NVIDIA’s Next Generation CUDA Compute Architecture: Kepler
GK110, NVIDIA Corporation, 2012, version 1.0.
[23] NVIDIA CUDA C Best Practices Guide, NVIDIA Corporation, Oct.
2012, version 5.0.
