Linear-time encoding and decoding of low-density parity-check codes by Simberg, Mikael
Aalto University
School of Science
Degree Programme in Computer Science and Engineering
Mikael Simberg
Linear-time encoding and decoding
of low-density parity-check codes
Master’s Thesis
Espoo, 2015
Supervisor: Professor Petteri Kaski
Advisors: Professor Petteri Kaski
Professor Camilla Hollanti

Aalto University
School of Science
Degree Programme in Computer Science and Engineering
Author: Mikael Simberg
Title: Linear-time encoding and decoding
of low-density parity-check codes
Date: 2015 Pages: 108 + 12
Major: Systems and Operations Research Code: F3008
Supervisor: Professor Petteri Kaski
Advisors: Professor Petteri Kaski
Professor Camilla Hollanti
Abstract:
Low-density parity-check (LDPC) codes had a renaissance when they were rediscov-
ered in the 1990’s. Since then LDPC codes have been an important part of the eld
of error-correcting codes, and have been shown to be able to approach the Shannon
capacity, the limit at which we can reliably transmit information over noisy channels.
Following this, many modern communications standards have adopted LDPC codes.
Error-correction is equally important in protecting data from corruption on a hard-
drive as it is in deep-space communications. It is most commonly used for example for
reliable wireless transmission of data to mobile devices. For practical purposes, both
encoding and decoding need to be of low complexity to achieve high throughput and
low power consumption.
This thesis provides a literature review of the current state-of-the-art in encoding
and decoding of LDPC codes. Message-passing decoders are still capable of achieving
the best error-correcting performance, while more recently considered bit-ipping
decoders are providing a low-complexity alternative, albeit with some loss in error-
correcting performance. An implementation of a low-complexity stochastic bit-ipping
decoder is also presented. It is implemented for Graphics Processing Units (GPUs) in a
parallel fashion, providing a peak throughput of 1.2 Gb/s, which is signicantly higher
than previous decoder implementations on GPUs. The error-correcting performance
of a range of decoders has also been tested, showing that the stochastic bit-ipping
decoder provides relatively good error-correcting performance with low complexity.
Finally, a brief comparison of encoding complexities for two code ensembles is also
presented.
Keywords: bit-ipping, coding theory, error-correcting codes, graphics pro-
cessing unit, linear time complexity, low-density parity-check
codes
Language: English
iii

Acknowledgments
First and foremost, I would like to thank my supervisor Petteri Kaski for his great
help and guidance throughout the preparation of this thesis, and for introducing
me to the exciting world of error-correcting codes and the study of decoding
using stochastic bit-ipping with Alexander Mozeika and Pekka Orponen.
I also wish to thank Camilla Hollanti for fruitful discussions and helpful
comments throughout the thesis work, and for introducing me to and providing
insights about information transmission standards.
I gratefully acknowledge the use of computing resources available via project
“Science-IT” at Aalto University School of Science and via CSC–the Finnish IT
Center for Science. As regards the latter I would especially like to thank Maarit
Mantere for her advice that NVIDIA Tesla K40 graphics processing units are
available at CSC.
Finally, I would like to thank my wonderful girlfriend and friends for always
being there, and my family for showing interest in my work and fully supporting
me in what I do.
Mikael Simberg
Espoo, 15.12.2014
v

Contents
Abstract ii
Acknowledgments v
List of acronyms xi
List of mathematical symbols xiii
1 Introduction 1
2 Error-correcting codes 5
2.1 Preliminaries 5
2.2 Noisy channels 7
2.3 Low-density parity check codes 9
2.3.1 Tanner graph representation 9
2.4 LDPC code constructions 10
2.4.1 Regular codes 10
2.4.2 Irregular codes 11
2.4.3 Code ensembles and their properties 13
2.4.4 Q_uasi-cyclic LDPC codes 17
2.5 Codes used in practice 18
2.5.1 Digital Video Broadcasting 18
2.5.2 WiMAX and WiFi 19
2.5.3 Ethernet 19
2.6 Summary 19
3 Encoding of LDPC codes 21
3.1 Encoding using the systematic generator matrix 21
vii
CONTENTS
3.2 Encoding with approximate lower-triangular parity-check
matrices 23
3.2.1 Irregular codes with a fixed gap 27
3.2.2 Q_uasi-cyclic LDPC codes with approximate lower-
triangular parity-check matrices 27
3.3 Summary 28
4 Decoding of LDPC codes 29
4.1 Naïve decoding 31
4.2 Bit-flipping decoding 31
4.2.1 Gallager’s bit-flipping decoder 31
4.2.2 Weighted bit-flipping decoders 32
4.2.3 Schedules for bit-flipping decoders 34
4.2.4 Stochastic bit-flipping decoders 35
4.2.5 Stochastic bit-flipping decoder with hard channel
values 37
4.3 Message-passing decoding 39
4.3.1 Exact inference on factor graphs and the sum-product
algorithm 40
4.3.2 The sum-product algorithm for decoding 47
4.3.3 Implementing the sum-product decoder 52
4.3.4 Binary message-passing decoder 55
4.3.5 Message-passing schedules 57
4.4 Turbo codes 58
4.5 Summary 59
5 Decoder implementations 61
5.1 Hardware simulations 62
5.2 GPU implementations 63
5.3 A GPU implementation of a stochastic-bit flipping decoder 64
5.3.1 Architecture and programming of CUDA devices 65
5.3.2 Decoder implementation 67
6 Experimental results 77
6.1 Comparison of decoders 77
6.1.1 Choosing the decoder parameters 79
6.1.2 Results 80
6.2 Complexity of approximate lower triangular encoding 85
6.2.1 Results 85
viii
CONTENTS
6.3 The stochastic bit-flipping decoder on the GPU 89
6.3.1 Results 90
7 Conclusion 95
Bibliography 99
A Derivations 109
A.1 Derivation of the gradient-descent bit-flipping decoder 109
A.2 Derivation of the tahn-rule for the sum-product decoder 110
B Parameter searches 115
B.1 Parameter searches for the stochastic bit-flipping decoder 116
B.2 Parameter searches for the gradient-descent bit-flipping
decoder 119
ix

List of acronyms
LDPC code Low-density parity-check code
QC-LDPC code Quasi-cyclic low-density parity-check code
BEC Binary erasure channel
BSC Binary symmetric channel
BAWGNC Binary additive white Gaussian noise channel
BER Bit-error rate
FER Frame-error rate
CPU Central processing unit
RAM Random access memory or random access machine
MBRAM Random access machine with addition, subtraction, multipli-
cation, division and bitwise Boolean operations
GPU Graphics processing unit
CUDA Compute unied device architecture
PTX Parallel thread execution
ISA Instruction set architecture
xi

List of mathematical symbols
H Parity-check matrix
G Generator matrix
Hai The ith element of the ath row of a matrix H
HT The transpose of a matrix H
x A vector
xi The ith element of the vector x
x \ xi The vector x with the ith element removed
In Identity matrix of dimension n × n
0 An all-zero vector
F2 The binary eld
N (i ) The set of neighbors of a node i in a graph
N (i ) \ j The set of neighbors of a node i excluding the node j
Pb Bit-error rate
PB Block-error rate
[P] Iverson’s bracket notation, evaluates to 1 if the predicate P is true
and 0 otherwise
N (µ,σ 2) The Gaussian distribution with mean µ and variance σ 2
sgn(x ) The sign function
tanh(x ) The hyperbolic tangent function
tanh−1(x ) The inverse hyperbolic tangent function
xiii

Chapter 1
Introduction
The eld of information theory was started by Shannon in 1948 with his “A mathe-
matical theory of communication” (Shannon, 1948). In it he presented the problem
of transmitting information reliably over noisy channels, together with an initial
solution on how to tackle the problem. Although there is a limit to how much
information we can transmit over a channel—the Shannon capacity—Shannon
also showed that we can come arbitrarily close to this limit using error-correcting
codes. Shannon’s work gave answers to a practical and common problem, and
error-correcting codes are today in use in practically every communications
scenario, from mobile networks to deep-space communications, in addition to
reliable storage.
The classical communications problem is the following: we have a string of
bits that we want to send from one place to another. The problem we are facing
is that when we send the bits, some amount of noise will be added along the way
and the receiving end will not receive what we originally sent. Error-correcting
codes attempt to solve this problem by encoding the given string of bits into a
longer string of bits, adding some form of redundancy to the bits that are sent.
The longer string of bits is then sent over the channel and, on the receiving end,
even with a certain amount of noise added to the bits, one can decode the sent bits
uniquely to the original bits. Figure 1.1 presents the situation graphically. The
problem is usually thought of as involving physical transmission of data from
one point to another. However, the problem applies equally well in a situation
where bits are not physically transferred at all. A hard-drive is one such example,
where errors will accumulate over time.
encoding noisy channel decodingsource bits decoded bits
Figure 1.1: Transmission over a noisy channel: given source bits are encoded before
transmission over a noisy channel; decoding of the encoded noisy bits attempts to recover
the source bits.
1
Introduction
How could one start to approach the problem of making sure that the right
bits are received? Say that one would like to send the following string of 8 bits:
01001101.
We want to make sure that the receiver does not mistake the bits for any of the
other 28 − 1 possible bit strings of length 8 in the case that some of the bits are
ipped along the way. One way of dealing with this is to send the following string
of bits instead:
000 111 000 000 111 111 000 111.
What we have done is replaced each bit by three occurrences of the same value.
After receiving the longer string of bits, but with possibly some bits ipped, we
can then decide that we decode the received bits so that for each group of three
bits we set the value of that bit to the majority value. That is, if for example two
or three of the three bits in a group are ones, we decide that the group of three
bits represents a one. With this scheme one of the bits in each group of three bits
can be ipped and we will still decode the correct string of bits.
The above scheme is called a repetition code which does not work well in
practice. It does, however, demonstrate the essence of error-correcting codes: to
allow us to recover from as many errors as possible, and to do so eciently. The
repetition code does the rst if we repeat each bit enough times, but loses on e-
ciency as we have to repeat each bit many times to achieve reliable transmission.
On the other hand, keeping the number of repetitions low and xed maintains
eciency but we cannot recover from many errors. The codes considered in this
thesis, low-density parity-check codes, are one type of codes that can do both
tasks well.
Low-density parity-check codes were rst introduced by Gallager (1962).
Despite the current knowledge of their good properties, the codes were largely
forgotten after their discovery until the 1990’s. The codes were rediscovered
independently by MacKay (1995), as well as Sipser and Spielman (1996). It was
quickly realized that the new codes were largely equivalent to the codes Gallager
rst presented. Following the rediscovery there has been a considerable amount of
research into low-density parity-check codes. They have been shown to have good
error-correcting properties and have practical algorithms for both encoding and
decoding. Recently, the codes have also been included in standards for wireless
and wired communication, joining and in some cases replacing Turbo codes
(Berrou et al., 2005) which have, together with low-density parity-check codes,
been shown to be able to approach the Shannon capacity.
As the eld of low-density parity-check codes has grown more mature since
their rediscovery there has been more focus on making the codes perform bet-
ter, as well as improving the algorithms used for encoding and decoding of the
2
codes. For example, encoding can as of today not be done in linear time for
general low-density parity-check codes. On the decoding side, most decoders
are linear-time, but there is room for improvement in terms of the constants
involved in implementations. The need for faster practical encoders and, more
importantly, decoders is twofold. First, power consumption is an important factor
in mobile devices, and reducing the complexity of hardware implementations
can have a positive eect on the power consumption. Second, faster encoders
and decoders can allow improvements in a combination of throughput, latency
and error-correction. Linear-time encoders and decoders allow us to scale the
so-called block length of low-density parity-check codes, ultimately leading to
better error-correction. Most importantly, encoders and decoders with small con-
stants give practical improvements to error-correction, throughput and latency.
For this reason it is important to consider practical issues when implementing
encoding and decoding algorithms, and not only the theoretical properties of
the algorithms. With this in mind, we will in this thesis review the current state-
of-the-art in encoding and decoding of low-density parity-check codes together
with experimental results on encoding complexity and the error-correcting per-
formance of decoders. Most importantly, this thesis presents a low-complexity
decoder for GPUs.
We will begin with an overview of denitions relating to coding theory, dene
what low-density parity-check codes are, and look into various code constructions
in Chapter 2. Chapter 3 contains a summary of useful encoding methods, one
of which allows linear-time encoding for a limited but useful set of codes. A
majority of the research is focused the on decoding of low-density parity-check
codes. In Chapter 4 we review the advances in decoding algorithm designs,
which aim to reduce the complexity of the decoders to reach higher throughputs,
and to improve the error-correcting properties of the decoders. In Chapter 5
we present existing work on implementing decoders in hardware, where the
focus generally is to achieve high throughput. Partly, the aim of this thesis is
to provide a comprehensive overview of the current state-of-the-art in low-
density parity-check codes. However, the main contribution of this thesis is the
implementation of a simple decoder implemented for GPUs. The need for simpler
algorithm designs becomes apparent as decoders are implemented in hardware.
The implementation aims at achieving high decoding throughputs by using a
simple design which is easily parallelized. In Chapter 6 we will present some
experimental results concerning encoding complexity, compare ve decoders in
terms of error-correcting performance, and present results on the performance
of the GPU implementation. We conclude the thesis in Chapter 7.
3

Chapter 2
Error-correcting codes
In this chapter we set up the terminology and basic concepts concerning error-
correcting codes. In Section 2.1 we will present the basic concepts concerning
error-correcting codes in general and in Section 2.2 we present three noisy chan-
nels. In Section 2.3 we present low-density parity-check codes, some constructions
of low-density parity-check codes and the most important results relating to them.
By the end of this chapter we will have the prerequisites to consider encoding
and decoding of low-density parity-check codes in the following chapters.
2.1 Preliminaries
There are two main types of codes, of which the rst adds redundancy to contin-
uous streams of data and the second does so to blocks of data. We will concern
ourselves with the second type of codes, block codes, as low-density parity-check
codes are of this second type. Block codes work with nite blocks of data, encod-
ing each block of data into a longer block, and each block is independent from
each other. With this, we can dene a block code more formally, and since we
will only consider one type of block codes in this thesis we will refer to block
codes simply as codes. We will largely follow the notation and terminology of
Richardson and Urbanke (2008).
Definition 1 (Code)
Let F be a nite eld. A code C of block length n and cardinality M is a set
ofM ≥ 2 elements from Fn.
We call the elements of a code its codewords. In contrast, a word refers to a
vector which is part of Fn, but is not necessarily a codeword. We will denote
vectors, or words, from Fn by small bold letters, and the ith element of a vector x
by xi .
5
Error-correcting codes
Definition 2 (Linear code)
A linear code C is a code which satises
αx + α ′x′ ∈ C, ∀x,x′ ∈ C and ∀α ,α ′ ∈ F.
Low-density parity-check codes are linear codes. A direct consequence of
linearity is that the all-zero word is always a codeword. A second consequence
of the linearity of a code is that the error-correcting performance of the code is
independent of the sent codeword. Hence we usually assume that the all-zero
codeword was sent when examining decoding performance.
The weight of a vector and the distance between two vectors are useful
concepts for analyzing error-correcting codes in terms of the minimum distance
of a code.
Definition 3 (Weight of a vector and distance between two vectors)
The weight of a vector x, denoted by w (x), is the number of nonzero entries in
x. The distance between two vectors x and x′ is d (x,x′) = w (x − x′).
Definition 4 (Minimum distance of a code)
The minimum distance of a codeC is the smallest distance between any two distinct
elements of the code. More precisely, the minimum distance of a code is dened as
min
x,x′∈C
x,x′
d (x,x′).
The essence of an error-correcting code is that we map a set of shorter strings
to a set of longer strings, the codewords. In doing so we can increase the distance
between any two elements in the code resulting in the code becoming more
robust to errors. The extent to which we have improved the error-correcting
properties of a code is partly captured by the minimum distance of the code, as a
larger minimum distance means that the code can tolerate more errors while still
allowing decoding to the correct codeword.
Although codes can be formulated on larger nite elds, the binary eld F2
consisting of the elements {0,1} is most commonly used. On the binary eld the
addition operation is the logical XOR operation. More precisely, 0+ 0 = 1+ 1 = 0
and 1+ 0 = 0+ 1 = 1. The binary eld can also be represented by the elements in
the set {1,−1}. Instead of mod-2 addition, we now use multiplication, meaning
1 · 1 = (−1) · (−1) = 1 and (−1) · 1 = 1 · (−1) = −1. The eld will be assumed to
be the binary eld for the rest of this thesis.
6
Noisy channels
We say that a binary code with M elements has log2 M information bits, as
this is the number of bits of information that we are sending over the channel in
each codeword. The rate of a code is then dened as the ratio of information bits
in a codeword to the total number of bits in a codeword.
Definition 5 (Rate of a code)
The rate of a code C with block length n and cardinality M is the number of in-
formation bits sent over the number of total bits sent. More precisely, the rate r (C )
is
r (C ) =
log2 M
n
.
It will later be convenient to consider ensembles of codes. They are essentially
sets of codes, constructed using some random process, from which codes are
chosen at random. Shannon’s random ensemble is one example.
Example 1 (Shannon’s random ensemble)
Let Shannon(n,M ) denote Shannon’s random ensemble where each code has block
length n andM elements. A code is chosen from the ensemble by choosing each of
theM codewords uniformly at random from Fn2 .
2.2 Noisy channels
Noisy channels are the fundamental reason that error-correcting codes are used.
For that reason we will take a small detour to look at channels before continuing
to low-density parity-check codes. A binary channel is a mapping {0,1} → S
where S can be a nite or innite set. We will in general denote the sent codeword
by x and the received codeword after transmission over a noisy channel by y.
0
1
0
erasure
1
1 − ϵ
1 − ϵ
ϵ
ϵ
Figure 2.1: The transition diagram for the BEC. The input can have the value 0 or 1, and
will after transmission have changed to an erasure with probability ϵ and stayed at the
input value with probability 1 − ϵ .
7
Error-correcting codes
The binary erasure channel (BEC) simply performs the following operation:
with some probability ϵ , a bit becomes an erasure. The set of symbols received
thus has the added erasure symbol. Put dierently, on the binary erasure channel
we assume that the receiver knows that a bit has been erased but simply does
not know what value the erased bit had originally. The transition diagram of the
BEC is shown in Figure 2.1.
0
1
0
1
1 − ϵ
1 − ϵ
ϵ
Figure 2.2: The transition diagram for the BSC. The input can have the value 0 or 1, and
will after transmission on the BSC have ipped to the other value with probability ϵ .
The binary symmetric channel (BSC) has the same input and output symbols
and is determined by the parameter ϵ , often called the crossover probability. The
crossover probability determines the amount of noise in the channel in the sense
that a bit passing through the channel is simply ipped from 0 to 1 or from 1 to 0
with probability ϵ . The transition diagram of the BSC is shown in Figure 2.2.
The binary additive white Gaussian noise channel (BAWGNC) maps the set of
input symbols to the set of real numbers. For the BAWGNC it is convenient to
use {1,−1} as the set of input symbols. The mapping for the BAWGNC is
yi = xi + ei
for each bit xi that we wish to send, where ei is an independently and normally
distributed random variable with zero mean and variance σ 2. That is,
e ∼ N (0,σ 2).
Two quantities are useful when considering the BAWGNC. The signal-to-noise
ratio (SNR) is the ratio of the energy per transmitted bit Es to the energy of the
noise σ 2. That is, SNR = Es
σ 2 . Here, Es = 1 because the set of input symbols is{1,−1}. A related quantity is the ratio of the energy per transmitted information
bit Eb = Esr to the double-sided power spectral density N0 = 2σ
2. Here we have
simply written r for the rate of a code C . The quantities are usually shown in
dB. That is, they are shown as 10 log10
(
Es
σ 2
)
and 10 log10
(
Es
2rσ 2
)
. For codes of rate
1
2 the two quantities are the same. The BAWGNC is useful as a model for real
channels.
8
Low-density parity check codes
2.3 Low-density parity check codes
A low-density parity-check code, or LDPC code, is dened by a matrixH ∈ {0,1}m×n.
The dening characteristic of LDPC codes is that H is sparse. This means that H
has O (n) nonzero elements. Given a matrix H the set of codewords is dened by
C = {x ∈ {0,1}n | HxT = 0T },
where 0 is the zero vector and we assume that all vectors are row vectors. The
matrixH is referred to as the parity-check matrix, since it requires that a codeword
x have even parity in the so-called parity-check equations
sa =
n∑
i=1
Haixi = 0,
for all a = 1,2, . . . ,m. Each parity-check equation is often called a checksum or
check, and the vector s of sums sa is called the syndrome. If a check has even
parity we say that the check is satised and otherwise we say that the check is
unsatised. Assuming that the parity-check matrix of an LDPC code is of full
rank, the cardinality of the code is M = 2n−m. The rate of an LDPC code with a
parity-check matrix of full rank is n−mn . A parity-check matrix uniquely denes a
code up to row operations, meaning row permutations and additions of one row
to another.
2.3.1 Tanner graph representation
It is often convenient to consider a graph representation of a parity-check matrix.
The graph representation of a parity-check matrix is more commonly referred to
as its Tanner graph (Tanner, 1981). The Tanner graph is a bipartite graph with
two sets of nodes. The rst set consists of the check nodes, each corresponding
to the rows, or checks, of H . The second set consists of the variable nodes, each
corresponding to a column of H , or a symbol of a word. There is an edge between
a check node a and a variable node i if and only if Hai = 1. Put dierently, the
Tanner graph corresponding to a parity-check matrix H has an adjacency matrix
given by (
0 H
HT 0
)
.
We denote the neighbors of a node i by N (i ). As a shorthand, the notation N (i ) \ j
means the neighbors of the node i excluding the node j. A parity-check matrix
and a Tanner graph are equivalent up to permutation of rows and columns. In
general, we will also treat a code as equivalent to a parity-check matrix or Tanner
graph, even though a code can be dened by more than one parity-check matrix,
and likewise by more than one Tanner graph.
9
Error-correcting codes
2.4 LDPC code constructions
The denition of LDPC codes does not state how the codes should be constructed.
Next we will look at some ways of doing so. Most importantly, we will present
regular and irregular code ensembles. We will, in relation to code ensembles, also
state two central theorems regarding the performance of codes. In addition, we
will present so-called quasi-cyclic LDPC codes and briey look at LDPC codes
used in communications standards.
2.4.1 Regular codes
A simple way to construct an LDPC code results in what is called a regular code.
An (l ,r )-regular LDPC code is dened by a parity-check matrix H where each
column of H has weight l and each row has weight r . Such a code has, by double
counting, e =mr = nl ones in its parity-check matrix or, equivalently, e edges in
its Tanner graph representation.
ra
nd
om
pe
rm
ut
at
io
n
Figure 2.3: Drawing a random bipartite graph with the conguration model. In the
above example each node on the left has degree 3 and thus 3 “sockets” indicated by the
small gray nodes connected to each node. Likewise, each node on the right has degree 6
and thus 6 sockets. Thus, there are in total 10 · 3 = 5 · 6 = 30 sockets on each side. We can
then, using a random permutation, connect each socket on the left to a unique socket on
the right. The result is a bipartite graph which may not be simple. That is, it may have
more than one edge connecting a pair of nodes.
One way of constructing a regular code is with the help of a random permu-
tation. For the construction of a regular code it is easiest to think of the code
in terms of its graph representation. As the graph is bipartite we know that the
two sets of nodes will both have e edges incident to them. We then dene each
10
LDPC code constructions
node to have a number of “sockets” equal to the degree of the node. Each socket
will be used to attach one end of an edge to it. If we number the sockets from
1 to e on one side (say, the variable nodes) we can then, given a permutation of
1,2, . . . ,e , connect the sockets of the check nodes to a socket of a variable node
with a number given by the permutation. The model for constructing a graph
in such a way is called the conguration model (Bollobás, 2001). In this way, we
can construct a random bipartite graph but the graph is not necessarily simple.
That is, there may be multiple edges between a pair of nodes. However, we can
simply draw a new permutation until we get a simple graph. In other words, we
perform rejection sampling to get a simple bipartite graph. More importantly, the
probability of drawing a simple graph approaches zero exponentially only in the
degrees of the nodes and not in the block length n. The probability of drawing a
simple bipartite graph approaches a nonzero constant in the block length given
xed degrees (Greenhill et al., 2006).
Example 2 (Parity-check matrix of a (3,6)-regular LDPC code)
The following matrix denes a randomly generated (3,6)-regular LDPC code of
block length 20. That is, parity-check matrix has 3 ones in each column and 6 ones
in each row.
H =
*......,
0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 0 1 0
0 1 1 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0
0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 1
0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 1 0 1 0 0
0 1 0 0 1 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0
0 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1
1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 1 0 0 0
0 0 0 1 1 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 1
+//////-
(2.1)
Example 3 (Tanner graph of a (3,6)-regular code)
The Tanner graph corresponding to the parity-check matrix in (2.1) is shown in
Figure 2.4. The convention for drawing Tanner graphs is to draw the variable nodes
as circles on the left-hand side and the check nodes as squares on the right hand side.
2.4.2 Irregular codes
Irregular codes are a more general class of codes, which include the regular case.
The terminology and notation of irregular degree distributions was rst presented
by Luby et al. (1997). They also presented the idea of optimizing the properties of
a degree distribution using linear programming. Richardson et al. (2000, 2001)
expanded on the work.
11
Error-correcting codes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
9
10
varia
ble n
odes
chec
k no
des
Figure 2.4: The Tanner graph of the parity-check matrix (2.1). The nodes on the left
are the variable nodes and the nodes on the right are the check nodes. The edges cor-
responding to the nonzero entries of the rst column of (2.1) have been highlighted in
red.
For a code of length n, let Λi be the number of variable nodes of degree i .
Thus ∑i Λi = n. Likewise, let Pi be the number of check nodes of degree i , such
that ∑i Pi = m. It must hold that the number of edges emanating from both
sides are equal, so ∑i iΛi = ∑i iPi = e . For convenience, we represent the degree
distributions in terms of the following polynomials:
Λ(x ) =
dv∑
i=1
Λix
i and P (x ) =
dc∑
i=1
Pix
i ,
12
LDPC code constructions
where dv and dc are the maximum degrees of the variable and check nodes,
respectively. Using this representation we can write
Λ(1) = n and P (1) =m.
We also dene the normalized degree distributions
L(x ) =
Λ(x )
Λ(1) and R (x ) =
P (x )
P (1) .
The normalized degree distributions give the fraction of nodes with a given
degree. That is, a term Lixi in L(x ) means that there are Lin variable nodes of
degree i , and likewise for the check nodes.
An (l ,r )-regular code is a special case of an irregular code. We can dene a
(l ,r )-regular code with block length n with a degree distribution (Λ(x ),P (x )) =
(nxl , lrnx
r ). Alternatively, in terms of a normalized degree distribution, the (l ,r )-
regular code is dened by (L(x ),R (X )) = (xl ,xr ).
As in the case of a regular code, we can generate irregular codes by assigning
unique names to sockets at each variable node, where the number of sockets at
each node is equal to the degree of the node. Using a permutation we then again
assign edges to a pair of variable and check nodes. The only dierence to the
regular case is that the number of sockets at each node need not be constant for
the variable and check nodes. The probability of drawing a simple graph again
approaches a nonzero constant in the block length n (Greenhill et al., 2006).
2.4.3 Code ensembles and their properties
We have already seen Shannon’s ensemble as an example of a code ensemble.
We can also, with the help of the construction in the previous section, dene an
ensemble of irregular LDPC codes. More precisely, an ensemble of irregular LDPC
codes consists of codes constructed using the conguration model corresponding
to the degree distributions Λ(x ) and P (x ), where the permutations are drawn
uniformly at random. To further simplify analyzing code ensembles we can make
the additional simplication that we do not reject any codes even if they have
duplicate edges. Instead we take the number of edges between two nodes modulo
2. This way we may end up with parity-check matrices which are not of full
rank, but the probability approaches zero as the block length is increased (Di
et al., 2002). We will denote the ensemble of LDPC codes with block length n
constructed using the conguration model by LDPC(n,Λ,P ).
One of the main results is that for an ensemble of irregular codes constructed
using the conguration model the codes from the ensemble are concentrated
around the average in terms of error-correcting performance. A second, perhaps
13
Error-correcting codes
more important theorem is the channel coding theorem by Shannon, which states
that there is a limit to the rate at which we can send information over a channel
with a given level of noise. We will present the case for the BSC here.
Before we state the theorems, let us introduce some additional terminology
commonly used when talking about the performance of codes and decoders. The
error-correcting performance is generally stated in terms of the bit-error rate of
an ensemble of codes paired with a decoder. In the case of codes that have for
example been dened in standards, we will of course talk about the bit-error
rate of the single code paired with a decoder. The bit-error rate Pb is simply
the probability that a transmitted bit has, after decoding, the wrong value. The
bit-error rate is often abbreviated BER. One often also talks about the block- or
frame-error rate, which we denote by PB . This is the probability of a decoded
block diering from the transmitted block in at least one bit. This is sometimes
abbreviated FER. Figure 2.5 shows typical behavior of an LDPC code in terms
of the bit-error rate and the block-error rate, as a function of the noise and
the block length. Bits were encoded with codes from the (3,6)-regular ensemble
(without duplicate edges) and transmitted over the BSC. The resulting words were
decoded using the so-called sum-product algorithm. One can see a clear limit
where increasing the block length of the code does not improve error-correction.
This is called the threshold of the code and decoder, which we will dene more
precisely with Theorem 2.2. The region where the bit-error rate decreases sharply
towards lower levels of noise is called the waterfall region. Following that, one
can see the so-called error oor. The error oor is the at region at lower levels
of noise than the waterfall region. It is typical for LDPC codes to exhibit this
error oor, although it is not desirable. The error oor is often caused by some
inherent weakness in the code which leaves some bits particularly prone to being
erroneous (Richardson, 2003).
With the above terminology, we can now state the theorems more precisely.
The theorem regarding concentration around the average code in an ensemble
enables us to choose essentially any code from an ensemble and be nearly certain
that we have picked a code which is representative of the ensemble in general.
We will state the following theorems as presented by Richardson and Urbanke
(2008), omitting the proofs.
Theorem 2.1 (Concentration around the ensemble average)
Let a parity-check matrix H be chosen uniformly at random from an ensemble
LDPC(n,Λ,P ) and let transmission occur over the BEC with erasure probability ϵ .
We decode the received word with l iterations of message-passing decoding and
let Pb (H ,ϵ ,l ) denote the nal bit-error probability. Then, for a xed number of
iterations l and for any given δ > 0, there exists an α > 0, α = α (Λ,P ,ϵ ,δ ,l ), such
14
LDPC code constructions
Crossover probability ϵ
0.04 0.05 0.06 0.07 0.08 0.09
10-5
10-4
10-3
10-2
10-1
100
Bl
oc
k-
er
ro
rr
at
eP
B
Crossover probability ϵ
0.04 0.05 0.06 0.07 0.08 0.09
10-10
10-8
10-6
10-4
10-2
100
Bi
t-e
rr
or
ra
te
P
b
Figure 2.5: Bit-error rate (left) and block-error rate (right) as a function of the crossover
probability ϵ on the BSC. The following was repeated for each block length 2i for
i = 10,11, . . . ,20 until 230 total bits were transmitted: draw a new code from the (3,6)-
regular ensemble, assume the all-zero codeword and simulate transmission over the BSC,
decode using 20 iterations of the sum-product decoder. The vertical axis is logarithmic.
A bit- or block-error rate of 0 is not plotted.
that
P
{
|Pb (H ,ϵ ,l ) − EH ′∈LDPC(n,Λ,P ) [Pb (H ′,ϵ ,l )] | > δ } ≤ e−αn .
Shannon’s channel coding theorem gives us limits on how much information
we can hope to send over a channel. In this setting we consider transmission
over the BSC and look at the block error probability after so-called maximum a
posteriori decoding. The binary entropy function is dened by
h(p) = −p log2(p) − (1 − p) log2(1 − p).
Theorem 2.2 (Shannon’s channel coding theorem for the BSC)
Assume transmission over the BSC with crossover probability ϵ . Let PB (C,ϵ ) be
the block-error rate after transmission on the BSC with crossover probability ϵ using
a code C and decoding using maximum a posteriori decoding. If the rate r satises
0 < r < 1 − h2(ϵ ) then
min
C∈Shannon(n,2 brn c )
PB (C,ϵ )
n→∞−−−−→ 0.
The above theorem says that we can only hope to transmit with a bit-error
rate that approaches zero if the rate of the code is below h(ϵ ) on the BSC. This is
also called the Shannon capacity of the channel. Perhaps more importantly, the
converse also holds: if we transmit at a rate above the capacity of the channel,
15
Error-correcting codes
the error probability is bounded away from 0 asymptotically in the block length.
It is often more convenient to consider the behavior with a xed rate r and let
the crossover probability ϵ vary. In the case that the rate is xed we want to nd
the largest ϵ such that the error probability still approaches 0 in the block length.
We will refer to this as the threshold of the channel. That is, the threshold of
the channel gives the limit given any code and decoder with a xed rate. While
the limit given any code and decoder is useful, we will more often consider the
threshold of a combination of a code and decoder. With a given code and decoder
we then mean by the threshold the largest ϵ such that the error probability
approaches 0 in the block length, using the given code and decoder. Equivalent
results can be shown for the BEC and BAWGNC (Richardson and Urbanke, 2008).
Luby et al. (1998, 2001a,b, 1997) as well as Richardson et al. (2001); Richardson
and Urbanke (2001a) presented a majority of the tools used to analyze random
irregular LDPC code ensembles. An important tool is that of density evolution,
which is a method for determining the threshold under message-passing de-
coding. An approximation to density evolution was presented by Chung et al.
(2001b). Chung et al. (2001a); Richardson et al. (2000) presented optimized degree
distributions with thresholds approaching the Shannon capacity and Richardson
and Urbanke (2001a) generalized many results to more general channels.
Finally, while Chung et al. (2001a); Richardson et al. (2000) presented ensem-
bles that approach the Shannon capacity, this was done in a setting where the
number of iterations and the block length tend to innity. Clearly, neither of
these assumptions are feasible in practice. Degree distributions that approach
the Shannon capacity asymptotically can perform badly with a nite number
of iterations and nite block lengths. For this reason, designing good codes in
the nite setting requires slightly dierent tools. Richardson et al. (2001) note
that density evolution can be applied to some extent to codes of nite length
which are decoded using a nite number of iterations. The performance of codes
with practical limitations has been studied by Amraoui et al. (2009); Di et al.
(2002); Richardson (2003); Richardson et al. (2002), and some ways to construct
good nite-length codes are presented by Mao and Banihashemi (2001); Yue et al.
(2007). Of particular importance for the error-correcting performance of codes
decoded using message-passing algorithms are so-called stopping sets and short
cycles in the Tanner graph. Since the focus of this thesis is more on encoding
and decoding of given codes, rather than code constructions, we will not go into
detail about what makes a good code. However, the intuition behind the reason
for stopping sets and short cycles being problematic is easy to see. Stopping sets
are subsets of the nodes of a Tanner graph in which it is in some sense dicult
to resolve what the real values of the variable nodes should be. Variables nodes
in a stopping set will therefore often be decoded wrongly, and if the stopping
16
LDPC code constructions
sets are large the error oor of a code will be higher. We will take a closer look
at message-passing decoders in Chapter 4 and why short cycles in the Tanner
graph are harmful for them.
2.4.4 Q_uasi-cyclic LDPC codes
Dierent constructions of quasi-cyclic LDPC (QC-LDPC) codes were presented
by Tanner et al. (2004) and Myung et al. (2005). The general structure of the
parity-check matrix of a QC-LDPC code is the following: H consists of several
square submatrices, each of which is either the zero matrix or the identity matrix
with the diagonal shifted cyclically to the right by some amount. More precisely,
for a code of length n, the parity-check matrix consists of submatrices of size
z × z. The code is compactly described by a smaller matrix of size
mM × nM = m
z
× n
z
,
where we assume that z divides bothm and n. The smaller matrix HM is called
the model matrix. Each entry of the model matrix species the kind of submatrix
at that position in H . A non-negative integer entry species an identity matrix
shifted by the given entry, and “−” species a zero matrix.
Example 4 (WiMAX model matrix)
The rate-14 code with block length 2304 and submatrix size 96 in the W iMAX
standard is specied by the model matrix
HM =
*.,
6 38 3 93 − − − 30 70 − 86 − 37 38 4 11 − 46 48 0 − − − −
62 94 19 84 − 92 78 − 15 − − 92 − 45 24 32 30 − − 0 0 − − −
71 − 55 − 12 66 45 79 − 78 − − 10 − 22 55 70 82 − − 0 0 − −
38 61 1 66 9 73 47 64 − 39 61 43 − − − − 95 32 0 − − 0 0 −− − − − 32 52 55 80 95 22 6 51 24 90 44 20 − − − − − − 0 0− 61 31 88 20 − − − 6 40 56 16 71 53 − − 27 26 48 − − − − 0
+/-. (2.2)
QC-LDPC codes are convenient because they have a compact description. In
addition, the same model matrix can be used for a range of block lengths. For
example, in the WiMAX standard (IEEE, 2009), the same model matrix denes
codes for block lengths from 576 bits to 2304 bits. The description of QC-LDPC
codes above still leaves room for choosing the model matrix in dierent ways. One
possibility is presented by Myung et al. (2005). The degrees of the model matrix
are retained when it is expanded to the full parity-check matrix and so Myung
et al. propose to choose an appropriate degree distribution for the model matrix
to obtain a parity-check matrix which has similar properties to an irregular code
with the same degree distribution. The positions of the nonzero block matrices
and the shifts are chosen to maximize the girth, meaning the length of the shortest
cycle, of the resulting parity-check matrix, or more accurately of its Tanner graph.
17
Error-correcting codes
The parity-check matrix in Example 4 has additional structure on the right-hand
side, which ensures that the code can be encoded in linear time. We will present
the construction in more detail in the following chapter.
2.5 Codes used in practice
LDPC codes have been included in several standards in the recent years. We
review some of them here as they are of interest for decoding purposes. A code,
once in a standard, remains xed for the lifetime of the standard. However,
decoders can be changed independently of the codes as inprovements are made
to decoders. Thus it is useful to know the structure of codes used in practice, and
use them as benchmarks for decoding algorithms.
2.5.1 Digital Video Broadcasting
The Digital Video Broadcasting (DVB) standards for terrestrial, satellite and cable
broadcasts (ETSI, 2012, 2013a,b) employ a compact scheme for describing the
codes. The DVB codes are dened by o lists of osets, each of length oi . The osets
determine the positions of the ones in the parity-check matrix. The parity-check
matrix has the following structure:
(
H0 H1 · · · Hs−1 Hp ,
)
where Hp is an m ×m matrix with ones only on the full diagonal and in one
position below the diagonal. The matrices Hi for i = 0,1, . . . ,s − 1 are each of size
m× 360, where 360 is a constant dened for all LDPC codes in the DVB standards.
The ith list of osets determines the positions of the ones in Hi . Let bia denote the
ath element of the ith list of osets. Then, for column j of the full parity-check
matrix H , where the column is within a submatrix Hi , the positions of the ones
in that column are determined by
(bia + (j mod 360)Q ) mod m, ∀a ∈ 1,2, . . . ,oi .
The value Q is another constant dened in the standard which is dependent on
the rate of the code. The right-hand side of each parity-check matrix in the DVB
standards is lower triangular which, as we will see in the next chapter, means
that the code can be encoded in linear time.
18
Summary
2.5.2 WiMAX and WiFi
The WiMAX standard denes a few dierent code types for error-correction.
One of them is LDPC codes (IEEE, 2009). The code type used in the WiMAX
standard is a quasi-cyclic code. It denes codes of lengths between 576 and 2304
bits. Example 4 shows the rate-14 code dened in the standard. The standard has
dened codes of rates 12 and
1
4 .
The WiFi standard also uses a QC-LDPC code for error correction. The struc-
ture is the same as in the WiMAX standard. The block lengths are 648, 1296 and
1944 bits with rates of 12 ,
2
3 ,
3
4 and
5
6 .
2.5.3 Ethernet
The Ethernet standard (IEEE, 2012a) also uses LDPC codes for error-correction.
Figure 2.6 shows the parity-check matrix of the rate-0.84 LDPC code dened in
the Ethernet standard.
Figure 2.6: The parity-check matrix of the rate-0.84 LDPC code with block length 2048
dened in the Ethernet standard. Each black dot denotes a 1 in the parity-check matrix.
2.6 Summary
We have now presented the basics of codes and in particular LDPC codes, which
are a class of binary linear codes with sparse parity-check matrices. Particularly
important are the code constructions. The irregular codes based on the cong-
uration model are the basis for codes that can in general be encoded in linear
time and perform well under message-passing decoding. The tools developed for
irregular codes can also be used for QC-LDPC codes, and in these are used in
practice in many standards.
One type of LDPC codes that we have left out are codes based on nite
geometries (Kou et al., 2001). Kou et al. have shown that nite geometry codes
can have good theoretical properties. However, they often contain high degree
variable and check nodes, which increases the complexity signicantly (Cho et al.,
2010). To the best of the author’s knowledge, they have also not been included in
any current communications standards.
19

Chapter 3
Encoding of LDPC codes
Although much of the focus in research of LDPC codes has been on decoding
of the codes, encoding is equally important in terms of time complexity. For
general LDPC codes, encoding can not as of today be done in linear time. The
biggest contribution to date is that of Richardson and Urbanke (2001b). They show
that LDPC codes with a carefully chosen degree distribution can be encoded in
linear time. In addition, they show that these codes are good for message-passing
decoders in the sense that they can approach the Shannon limit using message-
passing decoders. Simpler code constructions can also yield linear time encoding
but may not be as good for error-correction.
Encoding consists of taking k information bits and adding parity bits such
that the information bits and the parity bits together form a codeword. The parity
bits are essentially chosen by rst setting the information bits in xed positions
of the codeword. Once the information bits have been xed, the parity bits can
be determined by solving a set of linear equations. The naïve method, which we
will present rst in Section 3.1, does this by Gaussian elimination. This method
is simple but has quadratic time complexity. In Section 3.2 we will present the
method by Richardson and Urbanke (2001b) where the rows and columns of a
parity-check matrix are only permuted such that the system of linear equations we
are solving is dened by a nearly triangular matrix. Doing this reduces complexity
signicantly.
3.1 Encoding using the systematic generator matrix
An LDPC code is dened by its parity-check matrix H of sizem ×n. By setting H
in an appropriate form we can form what is called a generator matrix of the code.
With the help of the generator matrix G , we can then encode the binary vector u,
also called the information bits, into a codeword x by uG = x. We assume that all
vectors are row vectors. We let k = n −m be the number of information bits.
21
Encoding of LDPC codes
Let Im be the identity matrix of sizem ×m. To begin with, we transform the
parity-check matrix into its systematic form
H =
(
−PT Im
)
.
A parity-check matrix can always be brought into the systematic form using
Gaussian elimination without changing the code. This means that we only use
standard row operations: permutations and additions of rows. We can thus for
the purposes of encoding assume that H is in systematic form. Now, given H , we
denote
G =
(
In−m P
)
.
Any word generated by the generator matrix G is a codeword, which we can see
by checking that the syndrome is the zero vector:
HxT = H
(
uG
)T
=
(
−PT Im
) (
u
(
In−m P
))T
=
(
−PT Im
) (
u uP
)T
= (−Pu + Pu)T
= 0T .
We can also see that the resulting codeword is of the form uG = x = (u xp ) =
(xs xp). The rst part of the codeword, xs , has length k and is called the systematic
part of the codeword. It is also exactly the information bits we wish to send. The
second part, xp , is of length m and contains the parity bits. If we assume that
the parity-check matrix can be brought into systematic form it implies that the
parity-check matrix is of full rank, and if we assume that the parity-check matrix
is of full rank there will be a unique codeword for each word we wish to encode.
We will for the remainder of this thesis assume that the parity-check matrix is
of full rank. As an example, Figure 3.1 shows the generator matrix of the code
specied in the Ethernet standard in systematic form.
While this method can be useful for shorter codes, it becomes impractical for
larger block lengths. For general LDPC codes, the matrix P will be dense, meaning
it will have O (n2) nonzero elements as a result of the Gaussian elimination.
This leads to O (n2) time complexity for the encoding due to the matrix-vector
multiplication. Given that an LDPC code by denition only has a linear number
of elements in its parity-check matrix, one could wish that encoding could also
be done in linear time. In the next section we will see how this can be done for
certain codes.
22
Encoding with approximate lower-triangular parity-check matrices
Figure 3.1: The generator matrix of the rate-0.84 LDPC code with block length 2048
dened in the Ethernet standard. Each black dot denotes a 1 in the matrix.
3.2 Encoding with approximate lower-triangular parity-check
matrices
MacKay (1999) introduced the idea of LDPC codes that can be encoded in linear
time with parity-check matrices that are almost lower triangular. In that case
we can perform back-substitution for most of the parity-check bits, but some
additional work still has to be done. Richardson and Urbanke (2001b) expanded
on this idea and generalized the results on when random irregular LDPC codes
can be encoded in linear time. They showed that if the degree distribution of an
ensemble of irregular codes is chosen appropriately encoding of a code in that
ensemble can in expectation be done in linear time. They also showed that if an
ensemble has a degree distribution for which linear time encoding is possible,
that ensemble of codes will also behave well under message-passing decoding.
23
Encoding of LDPC codes
This is a particularly interesting result. While the problem of linear time encoding
for general LDPC codes was not solved, the codes that one often wants to use in
practice can be encoded in linear time.
However, with the rising popularity of bit-ipping decoders, the above result
still calls for an answer to the question whether or not all LDPC codes can be
encoded in linear time. Optimal degree distributions for message-passing algo-
rithms do not necessarily lead to optimal performance for bit-ipping algorithms.
It is also good to note here that the linear time complexity really only applies to
the encoding process, and ignores a preprocessing step which needs to be done
only once for a code.
The encoding method of Richardson and Urbanke has three steps: (i) a pre-
processing step, in which the parity-check matrix is brought into an appropriate
form by only column and row permutations, (ii) a second preprocessing step,
in which a matrix inverse is calculated; and (iii) the actual encoding step, in
which the parity-check matrix from the previous step can be used to encode the
information bits u in linear time. We will begin by looking at the second and
third steps, but it is useful to keep in mind that in the rst step we only use row
and column permutations, meaning that the modied parity-check matrix still
has a linear number of nonzero elements.
In the encoding step, we assume that the parity-check matrix has been brought
into the following form:
H =
(
A B T
C D E
)
.
The matrix A is of sizem − д × k , B is of sizem − д × д,T is of sizem − д ×m − д,
C is of size д × k , D is of size д × д and E is of size д ×m − д. The matrix T is a
lower triangular matrix. We say that the parity-check matrix is in an approximate
lower-triangular form. The parameterд is called the gap of the parity-check matrix.
If the gap is zero, encoding can be done simply by back-substitution: rst set the
information bits in the rst positions of the codeword after which the parity bits
can be solved by back-substitution. If the gap is of size O (√n), it turns out that
encoding can still be done in linear time. To proceed with the encoding, we rst
premultiply H by (
Im−д 0
−ET −1 Iд
)
resulting in (
A B T
−ET −1A +C −ET −1B + D 0
)
.
24
Encoding with approximate lower-triangular parity-check matrices
As we are premultiplying by an invertible matrix the code remains equivalent to
the original code. After this step, encoding corresponds to solving two sets of
linear equations:
AxTs + Bx
T
p1 +Tx
T
p2 = 0
T ,
(−ET −1A +C )xTs + (−ET −1B + D)xTp1 = 0T .
Similarly to the case of systematic encoding, the codeword is x = (xs xp1 xp2 ).
The parity bits are now split up into two parts, xp1 and xp2 , where xp1 is of length
д and xp1 of lengthm − д.
We then dene ϕ = −ET −1B + D and assume that it is invertible. Then, given
ϕ−1 we can solve the rst set of parity bits by
xTp1 = −ϕ−1(−ET −1A +C )xts .
Once we have solved for xp1 , we can determine xp2 by back-substitution from
TxTp2 = −AxTs − BxTp1
as T is lower triangular.
In practice, one performs the premultiplication step by Gaussian elimination.
The matrix ϕ, however, might not be invertible after performing Gaussian elimi-
nation. In that case we permute columns from the left part of the parity-check
matrix into ϕ so that it is invertible. If this is in fact not possible, the parity-check
matrix is not of full rank, but as we mentioned earlier the probability of this
happening decreases with the block length and in practice we need not worry
about this.
On the whole, we now have that if the parity-check matrix is in an approxi-
mate lower-triangular form, and the matrix inverseϕ−1 has been precalculated we
can perform encoding in O (n +д2) time. Let us see why this is so. First, inverting
T need not be done explicitly and can be done by backsubstitution. Additionally,
since T is sparse this requires O (n) operations. Second, all matrix-vector mul-
tiplications involve sparse matrices with O (n) elements, again requiring O (n)
operations. Finally, we can avoid performing matrix-matrix multiplications and
instead always perform matrix-vector multiplications. The exception to sparse
matrix-vector multiplication is the multiplication by ϕ−1 which is a dense matrix-
vector multiplication. This contributes the д2 term in the running time.
What remains is to show that the original parity-check matrix can be brought
into the needed approximate lower-triangular form. Richardson and Urbanke
do not show a way to nd the minimum gap, but propose a greedy algorithm
which they show is good enough for achieving linear-time encoding complexity
for some codes. The algorithm proceeds in rounds, incrementally building up
25
Encoding of LDPC codes
the right-hand side lower-triangular matrix T . Following the terminology of
Richardson and Urbanke (2008) we let two parameters, д and t run during the
algorithm. The current gap is given by д and the iteration number is given by t .
The residual parity-check matrix H is at each step the submatrix of H consisting
of rows 1 to m − д − t − 1 and columns 1 to n − t − 1. The residual degree of a
column or row in the residual parity-check matrix is equivalent to the weight of
that column or row in the residual matrix.
The algorithm consists of two main steps, extend and choose. In the extend
step we assume that there exists a column of residual degree one. We choose
one column with residual degree one uniformly at random. Let c be the chosen
column and r the row containing the only nonzero entry of that residual column.
We then swap column c with column n − t − 1 and row r with rowm − д − t − 1.
This places the nonzero entry in the lower-right corner of the residual matrix
and extends the diagonal on the right-hand side by one. Finally, t is incremented
by one.
The choose step is performed when there are no columns with residual degree
one. We choose a column uniformly at random from the column or columns with
the minimum residual degree d , and call this column c . Then, choose an arbitrary
row with a nonzero entry in the residual part of column c and call it r . Swap
column c and row r into the lower-right corner of the residual matrix as in the
extend step. Move the remaining d − 1 rows with nonzero entries in column c to
the bottom of the full parity-check matrix. Finally, increment t by one and д by
d − 1.
The algorithm begins by considering the full parity-check matrix H and
setting t = д = 0. The algorithm stops when t + д = m. If there is at least one
column with residual degree one, perform the extend step, and otherwise perform
the choose step. At the end of the algorithm, the resulting parity-check matrix is
in approximate lower-triangular form with gap д.
Finally, for linear time encoding, we would need to show that the gap д will
on average be of size O (√n) if we choose the degree distribution appropriately.
We will not show how to achieve this here because of the lengthy details, but
the general idea is to model the residual degree distributions of the parity-check
matrix and the gap using a set of dierential equations along the course of the
greedy upper triangulation algorithm. This way one can arrive at an asymptotic
value for the gap such that elements from the ensemble will have a gap that is
close to the asymptotic value with high probability. Although Richardson and
Urbanke (2001b) give more detailed conditions on when the average gap will be
small enough for linear-time encoding, the intuition is that there needs to exist a
large enough number of variable nodes of degree two. When this holds, there
will more often exist a column of residual degree one in the extend step meaning
26
Encoding with approximate lower-triangular parity-check matrices
that the gap is not increased. If no column of residual degree one exists, the gap
will still often be increased by only one in the choose step because of the large
number of variable nodes with degree two.
3.2.1 Irregular codes with a fixed gap
Although Richardson and Urbanke (2001b) show that codes with certain irregular
degree distributions can be brought to a form where the parity-check matrix
has a gap which is proportional to
√
n, and the codes corresponding to those
distributions can therefore be encoded in linear time, Freundlich et al. (2007)
present an alternative approach to achieve this. Instead of letting all congurations
be possible with a given degree distribution pair, they propose to constrain the
construction of the code so that it will have a predetermined gap д. They achieve
this by essentially xing the diagonal elements of T in the approximate lower-
triangular decomposition to be ones, such that T is of sizem − д. After that one
proceeds to again randomly assign edges between variable and check nodes, but
disallowing edges that go above the diagonal in T , enforcing the approximate
lower-triangular form. They show through examples that forcing the gap to be
proportional to
√
n does not noticeably impact the performance of the code,
while it of course guarantees linear-time encoding. On the other hand, setting
the gap to be 0 or close to 0 does decrease the error-correcting performance of
the codes.
3.2.2 Q_uasi-cyclic LDPC codes with approximate lower-triangular parity-
check matrices
The encoding method of Richardson and Urbanke was applied to quasi-cyclic
codes to achieve linear-time encoding when the cyclic shifts of the submatrices
are chosen appropriately (Myung et al., 2005). The construction leads to the
matrix ϕ being an identity matrix, meaning that the encoding method becomes
a linear-time operation even when one includes preprocessing. For this to hold,
the model matrix specifying the parity-check matrix must be of the form
HM =
*..............,
b1 0 − · · · − −
− b2 0 · · · − −
... − b3 · · · − −
HI y
...
... · · · ... ...
...
...
... · · · 0 −
− − − · · · bm−1 0
x − − · · · − bm
+//////////////-
,
27
Encoding of LDPC codes
where the part of the model matrix corresponding to the information bits HI can
be freely chosen. The shift values x , y and bi for i = 1, . . . ,m must, however,
fulll one of the following two criteria:
x ≡
m∑
i=1
bi mod L and y ≡ −
m∑
i=l+1
bi mod L (3.1)
or
m∑
i=1
bi ≡ 0 mod L and x ≡ y +
m∑
i=l+1
bi mod L . (3.2)
We will without proof state that if the shift values fulll one of the two criteria
(3.1) or (3.2), then ϕ becomes the identity matrix in which case it is trivially
invertible. We also do not need to perform the matrix-vector multiplication which
causes the O (д2) term in the general encoding algorithm.
The authors performed limited tests with these kinds of codes but showed
that a QC-LDPC code of this kind performed equally well as a non-quasi-cyclic
code with the same degree distribution. The degree distribution of the random
code was an optimized distribution obtained by Richardson and Urbanke (2001b).
The degree distribution for the quasi-cyclic code was chosen to be the same as
for the non-quasi-cyclic code but the coecients were rounded to t the block
structure of the code. The LDPC codes in the WiMAX standard are quasi-cyclic
LDPC codes with the structure presented here allowing them to be encoded in
linear time. In the code presented in (2.2), the set of equations in (3.1) is satised
by setting b1 = x = 48 and y = b2 = b3 = . . . = b6 = 0.
3.3 Summary
While the state of encoding of LDPC codes is in some aspects a solved problem
in practice, the lack of a linear-time encoding algorithm for all types of LDPC
codes still leaves something to be desired. The QC-LDPC construction is a conve-
nient construction that has proven itself to work well enough for inclusion in
communications standards. In addition, it allows for a compact implicit descrip-
tion of the parity-check matrix. The irregular ensembles of LDPC codes which
can be encoded in linear time have the additional benet of working well with
message-passing decoders. However, being able to use arbitrary LDPC codes that
can still be encoded in linear time may allow the use of codes which have better
error-correcting capabilities.
28
Chapter 4
Decoding of LDPC codes
Decoding of LDPC codes has, unlike encoding, essentially been a linear-time
operation since Gallager’s introduction of LDPC codes. More precisely, there
are linear-time algorithms for approximate decoding. Optimal decoding, on the
other hand, is dicult. Berlekamp et al. (1978) showed that a certain decision
problem related to binary linear codes is NP-complete. The approximate schemes
are often, however, good enough in practice and can in some cases approach the
capacity of the channel asymptotically in the block length n. In addition, any
decoding algorithm that is superlinear is bound to be impractical for arbitrarily
large block lengthsn as in most situations the decoding throughput needs to equal
the channel throughput. The reason for wanting to use longer block lengths is the
improvement in error-correcting performance as the block length is increased.
One of the reasons for this is that a longer code is more robust to noise in the
sense that variations in the level of noise are less signicant for larger block
lengths. For example on the BSC, we can consider the number of bits that are
ipped by the channel constant for large enough block lengths. Another aspect
is that correlated noise is more likely to corrupt whole blocks if the block length
is short. A code with longer block length is robust to longer bursts of errors.
One should keep in mind that when we talk about decoding we mostly talk
about the process of recovering a codeword given word received from the channel.
However, the full process of decoding of course includes recovering the original
information bits but, as we saw in Chapter 3, doing this is easy as the information
bits will generally explicitly be part of the codeword.
Gallager (1962) introduced two types of decoders: (i) a simple, so called
bit-ipping decoder, which was then deemed insucient in its ability to decode;
and (ii) a message-passing decoder which performed much better but was more
complex. The bit-ipping decoders and the message-passing decoders are the two
major classes of decoders being actively researched at this point. A third class of
decoders into which research is being done is that based on linear programming as
rst presented by Feldman (2003); Feldman et al. (2005). To the best of the author’s
knowledge, these decoders are to date neither ecient enough nor good enough
29
Decoding of LDPC codes
at correcting errors compared to the two other classes of decoders. However,
their benets lie in easy analysis with established tools from linear programming
theory, and work is also being made to reduce the complexity of the decoders
(Burshtein, 2009; Burshtein and Goldenberg, 2011; Goldin and Burshtein, 2013;
Vontobel and Kötter, 2007). In this thesis, however, we will restrict ourselves to
only consider message-passing decoders and bit-ipping decoders.
The message-passing decoders are in general better at correcting errors and
were the main type of decoders considered as LDPC codes were rediscovered
in the 1990’s. However, more recently bit-ipping decoders have received much
more attention because of their low complexity which is benecial for ecient
hardware implementations. To clarify, when we talk about complexity in the con-
text of decoders we refer to arithmetic complexity of the decoders, not asymptotic
complexity. In addition, the decoders used are in essence linear-time algorithms
by choice, and the error-correcting performance and arithmetic complexity is
then improved given the constraint that the decoding must be linear-time. The
main focus for message-passing decoders has been to decrease complexity with-
out loosing too much in terms of error-correction performance. The focus for
bit-ipping decoders has been the opposite: improve the error-correction per-
formance while keeping the complexity low. Currently there is essentially a
continuum of decoders all making a trade-o between complexity and error-
correction performance.
To be precise, we are assuming an MBRAM machine model for the computa-
tions: a random access machine (RAM) with addition, subtraction, multiplication,
division and bitwise Boolean instructions, where we assume that all operations
take O (logn)-time where n is the size of the largest input operand or output
involved in the instructions. In practice, however, we assume that instructions
take O (1)-time with xed-width inputs and outputs (van Leeuwen, 1990).
We will look at the decoders roughly in order of complexity. We begin by
considering bit-ipping decoders in the next section because of their relative
simplicity and to get a rst idea of how decoding can be done. We will then move
on to message-passing decoders which require some background in graphical
models. We will review the most important aspects of factor graphs to present
the sum-product algorithm, which is also known as belief propagation. However,
the message-passing decoders can be intuitively understood even without the
factor graph framework.
30
Naïve decoding
4.1 Naïve decoding
Before looking at the practical algorithms, we will note that we can perform
exact maximum-likelihood decoding of LDPC codes with the caveat that it has
exponential time complexity. For completeness we state here the simple algorithm.
If we assume the BSC, maximum-likelihood decoding for a codeC simply consists
of nding the codeword x′ which is nearest to the received word y. We then hope
that the decoded codeword x′ is the same as the sent codeword x. The decoded
codeword is chosen to be
x′ = arg min
z∈C
w (z − y).
Here we clearly have an exponential number of candidates. Trying to be more
clever and beginning the search with the words closest to the received word
is also of no help, as we want that the code has a minimum distance which
increases linearly with the block length. This way we will still end up looking at
an exponential number of candidates.
4.2 Bit-flipping decoding
Bit-ipping decoders take a straightforward approach to decoding. Given the
received word y one can start ipping bits in y according to some appropriate
order and criterion in the hope that one will eventually arrive at the original code-
word x. In a sense we wish perform a local greedy search in promising directions
around the received word. This approach leads at its simplest to low-complexity
decoders, which however lack in error-correcting performance compared for
example to the sum-product decoder. Adding more information or noise to the
decisions of a bit-ipping decoder can already lead to much better performance
while keeping the complexity of the decoder fairly low.
4.2.1 Gallager’s bit-flipping decoder
Gallager’s bit-ipping decoder (Gallager, 1962) is the rst and perhaps the simplest
bit-ipping decoder for LDPC codes. It proceeds as follows: for each variable node
count the number of adjacent check nodes that are unsatised or, equivalently,
how many parity-check equations in which the variable is involved are unsatised;
if the number of unsatised checks is greater than or equal to some chosen
constant K , ip the value of the bit. The algorithm proceeds for a xed number
of rounds through all variables, or until all parity-check equations are satised.
31
Decoding of LDPC codes
There are two main considerations to bit-ipping decoders. The rst is how
to decide whether a bit should be ipped or not. The second is in which order bits
are considered, and if multiple bits are considered at the same time. The second
aspect will be referred to as a schedule in the following.
4.2.2 Weighted bit-flipping decoders
According to Kou et al. (2001), the essence of the weighted bit-ipping decoder
was already introduced by Kolesnik (1971), not long after Gallager’s initial work
on LDPC codes. Kou et al. (2001) then reintroduced the decoder, and it has been
modied and improved in the subsequent years. The weighted bit-ipping decoder
refers to the most basic version. However, we will collect all modications under
the same term, and present the decoder below in a general form which includes
most modications. The main dierence from Gallager’s bit-ipping decoder
is that we take soft channel values into account. In other words, we assume
transmission for example over the BAWGNC such that the channel outputs are
real-valued. This allows us to take into account which variable values we can
be fairly certain about—those that have large |yi |—and those which could likely
have been either −1 or 1—those that have small |yi |. This is in contrast to the
BSC, where we only know that any bit could have been ipped with the same
probability ϵ .
The quantity used for deciding if a bit should be ipped is in its most general
form the following:
Ei =
∑
a∈N (i )
(2sa − 1)wa − αvi
where wa and vi for all i = 1,2, . . . ,n and for all a = 1,2, . . . ,m depend on the
algorithm. The syndromes sa are calculated after hard thresholding of the channel
values. The parameter α ≥ 0 is a free parameter chosen separately for best error
correcting performance. The sum essentially calculates the number of checksums
adjacent to a variable node i that are unsatised, weighted by the quantity wa
which is a measure of the reliability of the check node. The term αvi adjusts for
the reliability of the received message at the variable node. In general, a bit is
ipped if the quantity Ei is high enough.
We begin by noting that Gallager’s bit-ipping decoder ts into this frame-
work, only it does not make use of the weighting. If we setwa ≡ 1 and αvi ≡ 0 and
we ip a bit if Ei is larger than some predetermined threshold we essentially get
Gallager’s bit-ipping decoder. This demonstrates that Gallager’s algorithm in-
deed is perhaps the simplest bit-ipping decoder and sets the stage for the changes
and improvements that have been proposed in newer bit-ipping decoders.
32
Bit-flipping decoding
Modied weighted bit-ipping decoder
In the modied weighted bit-ipping decoder (Zhang and Fossorier, 2004) the
terms are as follows:
wa = min
i∈N (a)
|yi |, (4.1)
vi = |yi |. (4.2)
Each termwa now weights the parity of a check node by the least reliable adjacent
variable node of the check node a. This means that if at least one of the variable
nodes incident to a check node has a channel value that is close to zero, we will
trust the parity of the check node less in making a decision on whether to ip a
bit or not. If, however, all variable nodes adjacent to a check node have highly
reliable channel values we can trust the value of the check node more and it
inuences Ei more. The parameter α is left as a free parameter.
Gradient-descent bit-ipping decoder
The bit-ipping decoder can also be presented as a gradient-descent optimizer
(Wadayama et al., 2007). In that case
wa = 1,
vi = xiyi .
As xi ∈ {−1,1}, the termvi is similar to the form proposed by Zhang and Fossorier
(2004). When the value of xi has not yet been ipped, the signs of xi and yi are the
same so xiyi = |yi |. The dierence between the two formulations is when xi does
not have the same sign as the channel value yi . In that case the Ei is penalized
as −αxiyi > 0. The gradient-descent formulation has a slightly more meaningful
interpretation. It maximizes the correlation between the received values y and
the decoded bits x′, penalized by unsatised checksums (see appendix A.1 for a
derivation of the terms). The free parameter α was not included in the original
formulation of the gradient-descent bit-ipping decoder, so α = 1 in this case.
Bootstrapped weighted bit-ipping decoder
Bootstrapped weighted bit-ipping (Nouh and Banihashemi, 2002) is a modica-
tion of the normal weighted bit-ipping decoder by Kou et al. (2001). The actual
decoding steps are identical to the WBF but the initialization of the variable
values are modied. A free parameter γ > 0 is chosen as a threshold. All variable
nodes which have a channel valueyj < γ are deemed unreliable. All other variable
nodes are reliable. A check node is reliable if all its adjacent variable nodes are
33
Decoding of LDPC codes
reliable and unreliable otherwise. Let Nr (j ) be the reliable check node neighbors
of a variable node j . The unreliable variable nodes are then re-initialized with the
following values
yi B yi +
∑
a∈Nr (i )
min
j∈N (a)\i
|yj |
∏
j∈N (a)\i
sgn(yj ),
where sgn(x ) is the sign function which takes the value x|x | if |x | > 0, and 0
otherwise. We can see that the term ∏j∈N (a)\i sgn(yj ) is essentially a term saying
what value the variable node i should have according to the check node a. This
is again weighted by the smallest reliability of a variable node adjacent to the
check node. The unreliable check nodes can be thought of as having 0 as the
smallest channel value of an adjacent variable node and are thus excluded from
the sum. The channel value of the variable node i is then adjusted according to
the reliable check nodes. We will see later that this corresponds to one iteration of
min-sum decoding, a more powerful message-passing decoder. The authors found
that performing this bootstrapping step improved the performance compared to
only using bit-ipping, while requiring little additional complexity as the step is
only performed on the rst iteration. Kou et al. (2001) proposed a similar scheme
which they termed hybrid decoding, where one begins with the more powerful
sum-product decoder, which we will see in Section 4.3, for a small xed number
of iterations and then continues with a simpler bit-ipping decoder.
4.2.3 Schedules for bit-flipping decoders
The most common schedule for actually ipping a bit is the following (as presented
by Kou et al. (2001)):
1. If all checksums are satised, stop.
2. For all variable nodes, calculate Ei .
3. Flip the value of the variable node for which Ei is highest. That is, ip the
value of the node i = arg maxi∈{1,2,...,n} Ei .
4. Return to 1.
An alternative schedule is to instead of ipping only the node with the highest
Ei ip all variable nodes which have Ei greater than some threshold θ (Wu et al.,
2007). The two schedules presented here are convenient to perform in parallel,
as one can compute Ei independently for each variable node and then ip each
variable node above the threshold independently. We can note here that if θ = 0
and if wa and vi are set as in (4.1) and (4.2) the decoder is called a majority-logic
decoder (Kou et al., 2001).
34
Bit-flipping decoding
A sequential schedule is such that only one variable is considered at a time,
and ipped immediately if it is to be ipped. One then continues to the next
variable node. This can be done by cycling through all variable nodes for some
number of iterations. A more ecient way of doing this is to only ever consider
those variables which have at least one unsatised checksum with the help of
appropriate data structures. This speeds up decoding in a sequential implementa-
tion assuming that variables with zero unsatised checks should never ipped
(Vanek and Farkas, 2009).
The sequential schedule is benecial in speeding up the convergence of the
decoding process. The reason for the improved convergence is that the variable
bits are essentially using new information much more often. In a sequential
schedule, the variable bits can use the updated values immediately after a single
new variable bit has been processed. Compare this with the common approach
of comparing all variable bits at a time, where new information is received only
after all variable bits have been processed once.
Sequential scheduling for a bit-ipping decoder is called shued bit-ipping
by Zhang et al. (2007). Shuing refers to a schedule initially proposed for message-
passing decoders (Zhang and Fossorier, 2002). Although the sequential schedule
is benecial from the point of view of the absolute number of iterations, it is
not possible to properly parallelize a decoder with a sequential schedule. For
this reason the parallel schedule, where all bits are considered at a time and
some bits are then ipped, is preferred for high-throughput implementations.
However, one can also take an intermediate schedule between the two extremes.
One can process the variable bits in groups of bits, such that the bits are processed
independently in parallel within the groups but sequentially between the groups.
This approach was described by Ismail et al. (2013) for bit-ipping decoders. Once
again, the approach was rst presented for message-passing decoders where it
is referred to as parallel shued decoding (Zhang and Fossorier, 2002), or more
recently and commonly as layered decoding (Hocevar, 2004). Layered decoding
was also independently presented by Mansour and Shanbhag (2002), who called
it Turbo Decoding Message Passing, due to the relation to the schedule used in
the original Turbo decoder (Berrou et al., 1993).
4.2.4 Stochastic bit-flipping decoders
While the deterministic bit-ipping decoders presented above can perform reason-
ably well, they can often get stuck in local optima where there are still unsatised
checks and no way to escape from the local optimum. To this end, stochastic
bit-ipping decoders add some amount of noise to the decisions to escape local
optima and to improve the error-correcting performance of the decoder.
35
Decoding of LDPC codes
Miladinovic and Fossorier (2005) presented a modied version of Gallager’s
original bit-ipping decoder. Instead of ipping a bit with certainty when the
number of bad checks exceeds some threshold, the decoder only ips these
variable nodes with some probability p < 1. This does not lead to a signicant
increase in error-correcting performance, but can lead to faster convergence
times and can take the decoding process away from local optima and cycles.
In the same vein, Zhou et al. (2007) presented another bit-ipping decoder
only slightly modied from Gallager’s original algorithm, but already with more
promising results. While strictly deterministic, the algorithm is close to the idea
of a stochastic bit-ipping decoder. The algorithm is modied as follows: instead
of setting a xed threshold K such that a variable bit is ipped only if the number
of bad checks adjacent to it is at least K , we dene a sequence of thresholds
(K0,K1, . . . ,KT−1). A variable bit is then ipped on iteration t if the number of bad
checks is at least at the threshold Kt mod T . As an example, the work suggested a
sequence (3,2) meaning every second iteration the threshold is 3 and every second
it is 2. In relation to stochastic decoding, this is on average the same as ipping
bits with at least 3 bad checks with certainty, and bits with 2 bad checks with
probability 0.5. The authors noted a clear improvement compared to Gallager’s
bit-ipping decoder, but with performance still far from the sum-product and
min-sum decoders.
A more recent, and successful, attempt at a stochastic bit-ipping decoder was
presented by Sundararajan et al. (2014). They proposed a stochastic version of
the gradient-descent bit-ipping decoder where a random, normally distributed
term qi is added to the decision quantity as follows
Ei =
∑
a∈N (i )
(2sa − 1)wa − αvi + qi ,
where
qi ∼ N (0,η2)
for some variance η2.
The decoder was presented in both a single-bit and a multi-bit version, mean-
ing that either the bit i with the lowest Ei is ipped, or all bits below a certain
threshold are ipped. The authors call the decoder a noisy gradient-descent bit-
ipping decoder.
As originally presented by Wadayama et al. (2007), one can also ip a xed
number of bits at a time. This was also called a multi-bit version. This type of
multi-bit bit-ipping decoder, just like the threshold based multi-bit decoder,
tends to converge faster toward the optimum. However, it is prone to oscillations
as it approaches the local maximum. The threshold based multi-bit decoder can
36
Bit-flipping decoding
also be prone to this but to a much smaller degree (Sundararajan et al., 2014).
Sundararajan et al. also reported that it can be benecial to employ a so-called
mode-switching strategy, where the decoder begins decoding with the multi-bit
decoder, but changes to the single-bit version after some steps to ensure proper
convergence. The noisy gradient-descent bit-ipping decoder performed in some
cases even comparably to the more complex min-sum decoder which we will
present later.
4.2.5 Stochastic bit-flipping decoder with hard channel values
In this thesis we propose a variation of a stochastic bit-ipping decoder which
only operates on hard values. That is, it operates on channel values from the BSC
or channel values from the BAWGNC quantized to one bit. The idea behind the
decoder is similar to the stochastic gradient-descent bit-ipping decoder. The
main dierences are that this decoder does not use soft channel values and that
the probability for ipping a bit is parameterized dierently.
In the current variation, which we will from now on call the stochastic bit-
ipping decoder, as opposed to the stochastic gradient-descent bit-ipping decoder,
the decision to ip a bit is based on three factors: (i) the degree of the variable
node, if the code is not regular; (ii) the number of unsatised checks; and (iii)
the channel value of the bit we are considering. Let us denote the degree of
the variable node by d , the number of adjacent unsatised checks by b and the
XOR of the channel value and the current value of the bit by e . The value of e
is then 1 if the channel value and the current value dier, and 0 otherwise. Let
us denote the degree of a variable node by d . We then decide to ip the current
bit with probability some pe,b,d . We will assume that we never ip a bit if it has 0
adjacent unsatised checks as it will rarely be benecial. Thus we dene pe,b,d
for all e = 0,1 and b = 1,2, . . . ,d and for each variable node degree d occurring
in the code. In general, the probabilities pe,b,d should be optimized individually
for best error-correcting performance. In practice, performing this optimization
can be dicult to perform if the number of probabilities to consider is high.
In addition, while for example degree distributions of irregular codes can be
optimized fairly eciently using density evolution for message-passing decoders,
no such convenient tools for analyzing the performance of bit-ipping decoders
exists yet. For this reason, evaluating the performance of a set of parameters
simply requires running the decoder which can be time-consuming. To avoid this
37
Decoding of LDPC codes
we parameterize the ip probabilities. The probability of ipping a bit is
pe,b,d = min
{
exp
(
−2d − 2b + θ (2e − 1)
T
)
,1
}
,
θ =
T
2 log
(
1 − p
p
)
,
where T > 0 and p > 0 are free parameters. The above parameterization leads to
ipping a bit with higher probability if the variable node has several unsatised
checks as opposed to fewer. In addition, if the current value of the bit diers
from the channel value, the bit is ipped with higher probability, resulting in
general to the decoder trusting the channel value more than the current value
of the bit. Table 4.1 shows an example of the ip probabilities using the above
parameterization with the values T = 0.8 and p = 0.12.
Table 4.1: Flip probabilities for the stochastic bit-ipping decoder when T = 0.8 and
p = 0.12. The values are shown for a variable node of degree 3 with b unsatised adjacent
checks and e . The value of e is 1 if the current value of the variable is not equal to the
channel value. When b = 0 a bit is never ipped.
e = 1 e = 0
b = 1 0.602 0.011
2 1.000 1.000
3 1.000 1.000
While similar to the stochastic gradient-descent bit-ipping decoder, the
inspiration for the decoder comes from an algorithm for the satisability (SAT)
problem presented by Alava et al. (2008). The proposed decoding algorithm can
be run with various schedules. In its simplest form it can be run by sequentially
considering each bit in the word and ipping the bit if necessary. A main insight
in the work by Alava et al. (2008) was that of focusing the search in promising
directions. They call this algorithm focused metropolis search or FMS for SAT. In
the current stochastic decoding algorithm this can be implemented by only ever
considering such variable nodes which have at least one adjacent check which
is unsatised. If the variable node has zero unsatised checks, the bit is never
ipped and so such variable nodes need not be considered. This can speed up
the convergence of the decoder, but only applies straightforwardly to a serial
implementation. The idea is essentially the same as that presented by Vanek and
Farkas (2009). The same downsides apply, namely that of no known convenient
way of parallelizing the implementation while using focusing. In Chapter 5 we
only consider a sequential schedule using an implementation for the GPU. In a
38
Message-passing decoding
situation where one is constrained to serial execution, the idea of focusing can
be benecial but for most practical decoding purposes, the implementation must
be parallel in some way to achieve high enough throughputs.
4.3 Message-passing decoding
Message-passing decoding is, despite the improvements in bit-ipping decoders,
still the preferred choice of decoder in hardware implementations as is evident
from the comparatively larger amount of hardware implementations published us-
ing message-passing decoders. A message-passing decoder is also recommended
in the DVB-T standard (ETSI, 2013a). The sum-product decoder, which is a type
of message-passing decoder, is still the decoder that in linear time comes closest
to an optimal decoder in terms of error-correcting performance.
The class of message-passing decoders operate on the simple principle that
messages are sent between the nodes of the Tanner graph, where the messages
represent some kind of belief in what the values of the variable bits are. The
messages are passed from nodes to their neighboring nodes, where new values
are calculated, and new messages are again passed on. Operating in this way, it
is possible to construct good decoders for LDPC codes.
Once again, Gallager presented already in 1962 a version of a message-passing
decoder for the BSC which is in essence the same as those in use today. We will,
however, begin by looking at the sum-product algorithm, which is more generally
an algorithm for performing inference on graphical models. The graphical model
in the case of decoding of linear codes is roughly the Tanner graph of a code,
with some minor additions.
A message-passing decoder most similar to its current form was initially
presented for decoding of Turbo codes (Berrou et al., 1993). It was called iterative
decoding, as it performs several rounds of decoding. The decoding algorithm was
then found to be largely identical (McEliece et al., 1998) to what is called belief-
propagation in AI, as presented by Pearl (1982). Belief-propagation is the same
algorithm as the sum-product algorithm. Hagenauer and Papke (1994) presented
the so-called Viterbi algorithm (Viterbi, 1967) for turbo codes and later presented
the sum-product algorithm again in a more familiar form (Hagenauer et al., 1996)
for general block codes. Wiberg (1996); Wiberg et al. (1995) give a good summary
of the connections between the dierent proposed decoding algorithms at the
time when LDPC codes were being rediscovered. Kschischang et al. (2001) present
another thorough overview of the various forms of the same algorithm.
Bahl et al. (1974) proposed an optimal decoding algorithm for general linear
block codes but noted that it is impractical because of the exponential time
and space complexity of the algorithm. Later work has found the same: optimal
39
Decoding of LDPC codes
decoding is NP-hard when the so-called factor graph describing a linear block
code has cycles (Lauritzen and Spiegelhalter, 1988). Pearl (1982) also assumed an
acyclic graph for belief propagation, the case when belief propagation is optimal.
However, it is good to remember here that all the following message-passing
algorithms are essentially suboptimal, but still perform well despite the existence
of cycles in the graph.
4.3.1 Exact inference on factor graphs and the sum-product algorithm
We will now give a short introduction to factor graphs, where we derive the
general sum-product algorithm for factor graphs that are trees, after which we
consider the special case of decoding of linear block codes with factor graphs.
This will largely follow Kschischang et al. (2001) and Richardson and Urbanke
(2008). Another comprehensive work on probabilistic methods in relation to
coding is by MacKay (2003).
First we introduce some notation. A function
д(x1,x2, . . . ,xn )
is written equivalently as
д(x),
where x is the vector of variables (x1,x2, . . . ,xn ). We will write the sum over the
variables in x as ∑
x
д(x) =
∑
x1,x2,...,xn∈S
д(x1,x2, . . . ,xn ),
where S is the domain of each xi . The domain is assumed to be the same for all xi .
That is, x ∈ Sn where n is the length of x. For example, for binary codes S = {0,1}
or {−1,1}. We write the marginal of д(x) with respect to some variable xi which
appears in x as ∑
x\xi
д(x) =
∑
x1,x2,...,xi−1,xi+1,...,xn∈S
д(x1,x2, . . . ,xn ),
where with slight abuse of notation we mean by x\xi the vector xwith xi removed.
We will sometimes write ∑
x\xi
д(x) =
∑
x\xi
д(xi ,x \ xi )
to make it explicit that the variable with respect to which we are marginalizing
appears in x.
40
Message-passing decoding
Suppose then that we have a function д on the variables x = (x1,x2, . . . ,xn )
which we assume has the factorization
д(x) =
m∏
a=1
дa (xa ). (4.3)
The xa are sub-vectors of x. We can represent the factorization (4.3) of д with the
help of a factor graph. The factor graph is a bipartite graph consisting of factor
nodes and variable nodes. The factor graph has one factor node for each factor
in (4.3), and one variable node for each variable in x. We draw the factor nodes
as squares and the variable nodes as circles. A factor node a corresponding to a
factor дa (xa ) is connected to a variable node i corresponding to a variable xi by
an edge if and only if xi appears in xa . In other words, the neighboring variable
nodes of a factor node a in the factor graph correspond exactly to the variables
which appear in xa .
Example 5 (Factorization of a function)
The factor graph of the function
д(x) = д(x1,x2,x3,x4,x5) = д1(x1)д2(x2)д3(x1,x2,x3)д4(x3,x4)д5(x3,x5) (4.4)
is shown in Figure 4.1. Note that we have drawn the factor graph laid out as a tree
with x1 chosen as the root.
x1
x2 x3
x4 x5
д1
д2
д3
д4 д5
Figure 4.1: The factor graph of the function д in (4.4) drawn as a rooted tree with x1 as
the root.
41
Decoding of LDPC codes
The general problem we wish to solve in order to perform decoding is to
compute the marginal of д with respect to some variable xi . More precisely, we
wish to compute ∑
x\xi
д(x). (4.5)
We will eventually show that so-called maximum a posteriori decoding can be
formulated as computing a marginal as above. In the case of maximum a posteriori
decoding of binary linear block codes each variable takes one of two values and
naïvely performing the summation would require summing over 2n−1 terms. We
will for the case of decoding assume that the variables take the values in {−1,1}.
In the case that the factor graph corresponding to д is a tree—a connected
acyclic graph—we can do the marginalization more eciently. To see the relation
between the marginalization of д with respect to xi and the factor graph we can
draw the factor graph as a rooted tree with the variable node i as the root as we
have done in Figure 4.1. Figure 4.2 shows the factor graph of a generic function
д as a rooted tree. We will use the same notation in the derivation beginning
in (4.6) as in Figure 4.2. For a rooted tree with root i , we say that the factor or
variable node j is a child of the factor or variable node k if k is adjacent to j on
the unique path from the root i to j. The node k is then called the parent of j.
The children of the root node i are then exactly the neighbors of i . The children
of a node which is not the root node are the neighbors of the node excluding
its parent node. A node j is a descendant of a node k if k lies on the unique path
from the root i to j. A leaf node is a node which has degree one. We will denote
the vector containing all variable nodes which are descendants of the node i by
zi . If i is a variable node we include xi in zi as well. We will occasionally refer to
variable nodes i and factor nodes a by the corresponding variables xi or factors
дa , respectively, to ease the reading.
Example 6 (Root, parents, children, descendants and leafs of a tree)
In the factor graph shown in Figure 4.1 the variable node x1 is the root of the
tree. The children of the root are the factor nodes д1 and д2. The variable node x5
is a descendant, but not a child, of the variable node x3. The factor node д3 is the
parent of the variable node x3. Finally, the variable node x5 is a leaf node.
To begin with, we consider each child a of the root i as corresponding to
a single factor Ga . A factor Ga is a function of the parent variable xi and all
descendant variables za . The function д may then also be written as
д(x) =
∏
a∈N (i )
Ga (xi ,za ). (4.6)
42
Message-passing decoding
The factorization (4.6) is in general not the same factorization as that in (4.3), but
each factor Ga will consist of several factors appearing in (4.3). Only in the case
that all factors in (4.3) are functions of the variable xi with respect to which we
are marginalizing are the two factorizations (4.6) and (4.3) equal. Since the factor
graph is a tree, the variables in za for all a ∈ N (i ) form a partition of the variables
in x \xi , meaning they are pairwise disjoint and contain together all the variables
in x \ xi . We can thus rewrite the marginalization using the distributive law as∑
x\xi
д(xi ,x \ xi )
=
∑
x\xi
∏
a∈N (i )
Ga (xi ,za )
=
∏
a∈N (i )
∑
za
Ga (xi ,za ). (4.7)
The result is that the marginal of д with respect to xi is the product of marginals
of each of the factors Ga .
xi
xj
дa
Hj
za
xa
zj
Figure 4.2: The factor graph of a generic function д that factorizes into multiple factors.
The variable node neighbors of a are denoted by xa , the descendant variable nodes of
the check node a are denoted by za , and the descendants of the variable node j including
j itself is denoted by zj .
Next, let us consider the marginal ∑za Ga (xi ,za ) corresponding to one of
the children a of the root i . Without loss of generality, Ga can be factorized
further. The factor Ga then has a factor which is called the kernel. The kernel
H (xa ) = H (xi ,xa \ xi ) of Ga is the only function in the factorization of Ga which
depends on the parent xi . In addition, it depends on the child variables of a,
43
Decoding of LDPC codes
but not on the descendant variables of a. The kernel is also exactly the factor
corresponding to the child a of the root i in the factorization (4.3), so we can
write it as дa (xa ). In addition, Ga may consist of another set of factors, one for
each child variable node j of a. The factor corresponding to the child j is denoted
by Hj . Each Hj is a function of the variable xj—which corresponds to the child j
of a—and all descendant variable nodes zj of j. The factorization of Ga can then
be written as
Ga (xi ,za ) = H (xi ,xa \ xi )
∏
j∈N (a)\i
Hj (xj ,zj )
= дa (xa )
∏
j∈N (a)\i
Hj (xj ,zj ).
The marginal ∑
za
Ga (xi ,za )
which we are now considering can then be rewritten as∑
za
дa (xa )
∏
j∈N (a)\i
Hj (xj ,zj ).
We can then use the distributive law again to rewrite the marginal. Doing so we
get ∑
za
Ga (xi ,za \ xi ) =
∑
za
дa (xa )
∏
j∈N (a)\i
Hj (xj ,zj )
=
∑
xa\xi
дa (xi ,xa \ xi )
∏
j∈N (a)\i
∑
zj
Hj (xj ,zj ). (4.8)
Now we see that the terms ∑zj Hj (xj ,zj ) are of the same form as the form we
started with for marginalizing д in (4.5), so we can apply the above recursively to
each smaller marginal. Applying this until we reach the leaf nodes we reduce the
sums so that each sum is always at most over a number of variables equal to the
maximum degree of the graph.
Example 7 (Rearranging the marginal of a function)
We may wish to compute the marginal of д(x) in (4.4) in Example 5 with respect to
the variable x1. In that case we can rearrange the sumwith the help of the distributive
law to get:∑
x\x1
д(x) = д1(x1)
∑
x2,x3
д3(x1,x2,x3)д2(x2)
∑
x4
д4(x3,x4)
∑
x5
д5(x3,x5) (4.9)
The factor graph in Figure 4.1 is drawn with x1 as the root.
44
Message-passing decoding
The decomposition we have just shown in (4.7) and (4.8) gives rise to a
convenient way to compute the marginals by sending messages between nodes,
namely the sum-product algorithm. We rst consider the single marginal of д
with respect to xi that we have been considering thus far. The intuition behind
the sum-product algorithm is that since we assumed that the factor graph is a
(rooted) tree we can begin at the leaf nodes by sending suitable messages from
the leaf nodes to their parents. Once a non-leaf node has received all messages
from its children it will compute a new message and pass it to its parent node.
Finally, at the root node we simply take the product of the incoming messages.
In the general sum-product algorithm, we send functions fa→i (x ) from a factor
node a to a variable node i , and functions vi→a (x ) from variable nodes i to factor
nodes a. In practice the functions are tables of the functions evaluated at the
values in the domain of xi . Let us look at how we can compute a marginal by
only passing messages in the following example.
Example 8 (Computing the marginal)
The decomposition of (4.4) is shown in (4.9). We now compute the marginal of
(4.4) with respect to x1 by instead passing messages towards the root in the corre-
sponding rooted tree in Figure 4.1. We begin at the leaf variable nodes x4 and x5. At
these nodes we will simply send the constant function with value 1 to their respective
parent nodes. At the factor nodes д4 and д5 we have now received all messages from
their respective child nodes. Then, at the factor nodes д4 and д5 we will compute
the marginals
∑
x4 д4(x3,x4) and
∑
x5 д5(x3,x5), respectively. That is, we have now
at the factor node д4 computed the marginal of the product of the kernel at д4 and
the incoming messages from the children of д4 with respect to x3. In this case the
message from the single child node was the constant function with value 1. At the
factor node corresponding to д5 we have done the corresponding operations.
The newly computed marginals at д3 and д4 are then sent to the common parent
node x3. The variable node x3 has now received messages from all its children. At x3
we then compute the pointwise product of the messages from its children, that is∑
x4
д4(x3,x4)
∑
x5
д5(x3,x5).
We can see that we have now computed the rightmost two sums in the decomposition
in (4.9).
We continue with the leaf node д2. In this case the node is a factor node, so we
simply send the function itself to its parent. The variable node x2 then receives д2 and
computes the pointwise product of the messages coming from its child nodes. Since it
only has one child, it simply passes д2 on to its parent д3. Now д3 has received all its
incoming messages. At д3 we then again take the product of the incoming messages
45
Decoding of LDPC codes
from its child nodes and the kernel and marginalize the product with respect to the
parent node x1. At д1 we again simply send the function itself since it is a factor
node and leaf node. Finally, at x1 we only need to take the pointwise product of the
two incoming messages from д1 and д3 and we will have computed the marginal
(4.4).
Following the intuition shown in Example 8, we see that the correct way to
send a message from a leaf node that is a variable node to its parent node is to
send the constant function with value 1. At a leaf node that is a factor node we
instead send the factor corresponding to the leaf node. At a non-leaf, non-root
variable node j we perform a pointwise multiplication of the incoming functions
from the child factors and send the function to the parent factor a. More precisely,
we send
vj→a (xj ) =
∏
b∈N (j )\a
fb→j (xj ) (4.10)
from j to a. At a non-leaf factor node a we take the pointwise product of the
kernel дa and the incoming messages vk→a , all of which are functions of xj , the
parent variable of a. We then marginalize the product with respect to xj . More
precisely, we send
fa→j (xj ) =
∑
xa\x j
дa (xa \ xj )
∏
k∈N (a)\j
vk→a (xj ).
from a to j . To nish the marginalization we take at the root node i the pointwise
product of all incoming messages. We get∑
x\xi
д(x) =
∏
b∈N (i )
fb→i (xi ). (4.11)
We can thus using the rules in (4.10) and (4.11) compute the marginal of a generic
function д(x) with respect to a variable xi .
In general we often want to compute the marginal of д with respect to all
variables in x in turn. In such a case we could naïvely perform the above message-
passing rules separately for each variable. A better way to do it is to reuse messages,
since many of them will be identical for marginals of д with respect to dierent
variables. To do this, we will again follow the rules in (4.10) and (4.11) but we
will not have a designated root node. We will begin by passing messages from
each leaf node to its only neighbor. At a non-leaf node we send messages to all
its neighbors. More specically, at a variable node i we will send a message to
a neighboring factor node a only when all incoming messages from the factor
46
Message-passing decoding
nodes b ∈ N (i ) \ a have arrived at i . The same holds for messages sent from
non-leaf factor nodes to variable nodes. Note, however, that we do not need to
send messages to leaf nodes that are factor nodes since we only perform nal
marginalization at variable nodes. Proceeding this way we can send at most
one message in each direction of an edge after which we can perform the nal
marginalization step (4.11) at each variable node separately.
4.3.2 The sum-product algorithm for decoding
Let us return to the actual problem at hand, namely that of decoding of LDPC
codes. What we wish to compute, for each bit in a received word, is the maximum
a posteriori (MAP) estimate. That is, we want to choose
xˆi = arg max
xi
p (xi |y)
= arg max
xi
∑
x\xi
p (x|y)
= arg max
xi
∑
x\xi
p (x)p (y|x) (4.12)
as the value for the ith bit. The vector y contains the channel values (y1,y2, . . . ,yn ).
The posterior probability p (x|y) follows from rewriting the joint probability
p (x,y) in two ways in terms of conditional probabilities as
p (x,y) = p (x|y)p (y) = p (y|x)p (y).
Rearranging the second equality we get the posterior probability:
p (x|y) = p (y|x)p (x)
p (y)
.
This is also known as Bayes’ theorem. The important parts of the posterior
are (i) the prior p (x), and (ii) the likelihood p (y|x). The denominator p (y) is a
constant with the given channel values y and serves as a normalization term.
When maximizing (or minimizing) we can ignore constant scaling factors, so
p (y) is left out in the MAP estimate in (4.12). Thus we only need to compute the
prior and the likelihood.
For the prior of a sent word we assume that each codeword is equally likely to
have been sent and a non-codeword is never sent. In that case the prior is equal
to the characteristic function of a code C , again ignoring a constant scaling factor.
We will use Iverson’s bracket notation in the following. An expression [P] is 1 if
47
Decoding of LDPC codes
the proposition P is true and 0 otherwise. The characteristic function of a code C
is dened for all x ∈ {0,1}n by
χC (x) =
m∏
a=1
[∏i∈N (a) xi = 1]
∝ p (x),
The function χ (x) is then 1 if and only if the modulo-2 sum of each parity-check
is 0. We see that the Tanner graph of a code is exactly the factor graph of the
characteristic function of a code, with the convention of square nodes for factors
and circular nodes for variables. The likelihood is
p (y|x) =
n∏
i=1
p (yi |xi )
since we assume that the noise of each received channel value yi is independent.
Note that the likelihood is a function of x only since the channel values y are
given. Finally, the posterior probabilities are given by
p (x|y) ∝ p (x)p (y|x)
=
m∏
a=1
[∏i∈N (a) xi = 1] ·∏ni=1 p (yi |xi ).
The decision for a single variable i is then
xˆi = arg max
xi
∑
x\xi
m∏
a=1
[∏j∈N (a) xj = 1] ·∏nj=1 p (yj |xj ).
This way we choose for each bit the value that is most likely given the chan-
nel values and we minimize the bit error rate. Notice that we now have two
sets of factors: one set corresponding to the characteristic function and one set
corresponding to the likelihoods of the channel values. Thus the factor graph
needs to be slightly modied from the plain Tanner graph. We add a factor node
representing the factor p (yi |xi ) for all i = 1,2, . . . ,n. We will call the factors
p (yi |xi ) the channel factors and the factors [∏i∈N (a) xi = 1] the check factors. The
factor graph used in decoding of the code in (2.1) is shown in Figure 4.3.
We now know that if the factor graph of the code is a tree we can perform
MAP decoding using the message-passing rules presented in (4.10) and (4.11).
Following the message-passing rules, the variable-to-check messages for decoding
are the same as the generic variable-to-factor messages as stated in (4.10), namely
vi→a (xi ) =
∏
b∈N (i )\a
fb→i (xi ). (4.13)
48
Message-passing decoding
var
iab
le n
od
es
cha
nn
el f
act
ors
che
ck
fac
tor
s
Figure 4.3: The factor graph used in decoding of the code in (2.1).
Note that since the channel factors are leaf nodes, the messages in (4.13) are only
sent from variable nodes to check factors. For the factor-to-variable messages
we make a distinction between the channel factors and check factors, and state
separate message-passing rules for them. The check-to-variable messages are
fa→i (xi ) =
∑
xa\xi
[∏j∈N (a) xj = 1] ∏j∈N (a)\i vj→a (xj ). (4.14)
We will denote a message going from a channel factor to a variable node i simply
by fi (xi ) as each channel factor is connected to only one variable node. We send
the message
fi (xi ) = p (yi |xi )
to a variable node i from the channel factor adjacent to i .
49
Decoding of LDPC codes
Unfortunately in the case of LDPC codes, the factor graphs are in general not
trees. The cycles in a factor graph which is not a tree prevent us from directly
applying the sum-product update rules meant for trees in (4.10)–(4.11). To do
exact inference on factor graphs with cycles one can for example use the junction
tree algorithm, which produces a new acyclic graph on which we can again use
the sum-product algorithm. This, however, is not practical in general for graphs
with cycles (Lauritzen and Spiegelhalter, 1988). Another option is to ignore the
fact that the sum-product algorithm should be applied only when the graph is
acyclic, and apply the sum-product rules to a cyclic factor graph nonetheless.
This is, in fact, the approach taken in decoding and has been shown to lead to
good results in practice, assuming that the graph does not contain many short
cycles. We have just concluded that the sum-product update rules cannot be
applied directly to a cyclic graph. What we can do, however, is to rst initialize
all messages to some value, after which all incoming messages at each node
are dened and we can apply the usual sum-product update rules. This way of
applying the sum-product to a factor graph with cycles is sometimes called loopy
belief propagation. If the Tanner graph has many short cycles, a message sent
along an edge on a short cycle will more often use information that has been
sent along the same edge earlier, meaning that the message uses less extrinsic
information compared to messages that are sent along edges that do not belong
to short cycles.
To use the sum-product algorithm in decoding of a code whose factor graph
has cycles we initialize the messages as follows. First, we set the outgoing mes-
sages at each check factor to zero. Second, we set the outgoing message at each
channel factor to p (yi |xi ). We do not need to initialize the outgoing messages
at the variable nodes. At this point we have dened all incoming messages at
each variable node. We can then proceed with the normal sum-product update
rules. At each variable node we use the update rule (4.13) to send messages to the
check factors. Following that we update the check-to-variable messages using
(4.14). We then alternate between sending all check-to-variable messages and
all variable-to-check messages for a xed number of rounds, where each round
consists of updating all check-to-variable and all variable-to-check messages once.
Once we have updated the messages for a xed number of rounds we perform
the nal marginalization step as in (4.11). Finally, as an estimate x′ of the sent
codeword we set
x′i = arg max
xi∈{1,−1}
∏
b∈N (i )
fb→i (xi ), ∀i = 1,2, . . . ,n. (4.15)
50
Message-passing decoding
Note that we denote the estimate of the sent codeword by x′ to dierentiate it
from the true MAP estimate xˆ as the two will generally not be the same when the
factor graph has cycles. Alternatively, we can calculate (4.15) in each round when
we update the variable-to-check messages. If the estimate x′ forms a codeword
we can stop the decoding process early and return x′ as the nal estimate.
In the case of decoding of binary linear codes we can further simplify the
sum-product algorithm. Instead of sending functions as messages, it is enough to
send a single scalar as a message and this will be equivalent to sending functions.
To arrive at this we will use the log-ratios of the function values that we are
sending as messages. More precisely, we will at each channel factor initialize
the outgoing message to the logarithm of the ratio of the two function values as
follows
fi = ln
(
p (xi = 1|yi )
p (xi = −1|yi )
)
such that fi is now a single scalar. Using log-ratios also for the variable-to-check
and check-to-variable messages we get what is called the tanh-rule for the sum-
product algorithm. The tanh-rule modies the check-to-variable messages so that
a message sent from a check factor a to a variable node i is given by
fa→i = 2 tanh−1 *.,
∏
j∈N (a)\i
tanh
(vj→a
2
)+/- . (4.16)
A message sent from a variable node i to a check factor a is given by
−5 0 5
−1
−0.5
0
0.5
1
Figure 4.4: The
tanh function.
vi→a = fi +
∑
b∈N (i )\a
fb→i . (4.17)
The derivation of the tanh-rule is shown in Appendix A.2. We can also factor out
the signs in the check-to-variable (4.16) updates to get
fa→i = 2
∏
j∈N (a)\i
sgn(vj→a ) · tanh−1 *.,
∏
j∈N (a)\i
tanh
( |vj→a |
2
)+/- . (4.18)
The form of the tanh-rule with the signs factored out makes it especially conve-
nient to interpret the update rules. In the check-to-variable messages (4.18), the
product of signs carries the message of which sign the receiving variable node
should have according to the check factor sending the message as we are leaving
out the sign of the message from the receiving variable node. The second part of
the check-to-variable rule in (4.18) can be thought of as carrying the reliability of
the adjacent variable nodes, again excluding the one we are sending to. Excluding
51
Decoding of LDPC codes
the values of the nodes we are sending messages to can be thought of as avoiding
using the values between two nodes too often, and instead relying on extrinsic
information coming from the other adjacent nodes. The variable-to-check updates
(4.17) can be seen as the check nodes voting on what value the variable node
should have, weighted by the message magnitudes or reliabilities.
When using log-ratios as messages the estimate x′ of the transmitted code-
word is set to
x′i =
1, if vi→a = fi +
∑
b∈N (i ) fb→i > 0
−1, otherwise, (4.19)
for all i = 1,2, . . . ,n. In (4.19) we use all the incoming messages at the variable
node to estimate the transmitted word. Here it is good to remember that even the
exact MAP solution does not necessarily give a codeword as the estimate. This is
because we are minimizing the bit error rate for each bit individually.
Using log-ratios is better in practice as it avoids underow of oating point
values when many small values are multiplied. There are other variations of
calculating the messages which also simplify the original sum-product update
rules. Chen et al. (2005) present a few formulations of the sum-product decoder
for linear codes.
A similar formulation was derived by Gallager (1963) completely indepen-
dently of the factor graph framework, but also using log-ratios as messages. Only
the check-to-variable messages are dierent from the tanh-rule in (4.16). The
update rule is
fa→i =
∏
j∈N (a)\i
sgn(vj→a ) · д *.,
∏
j∈N (a)\i
д
( |vj→a |
2
)+/- ,
where д(x ) = ln
(
ex+1
ex−1
)
and is dened for x > 0. This function is an involution,
2 4
2
4
Figure 4.5: The
function
д(x ) = ln
(
ex+1
ex−1
)
. meaning that it satises д(д(x )) = x . Although the evaluation of д(x ) requires
more work than the tanh and tanh−1 functions, it is convenient if evaluated with
the help of a look-up table, as one only needs one look-up table. This form of the
check node update rule also follows easily from the derivation for the tanh-rule.
4.3.3 Implementing the sum-product decoder
While research on bit-ipping decoders has focused on improving the error-
correcting performance of the decoders, the focus has been on reducing com-
plexity for message-passing decoders. For example, the full sum-product decoder
requires multiplication of oating-point values and the evaluation of the tanh
and tanh−1 functions, which are all costly operations in hardware compared to
52
Message-passing decoding
for example simple XOR operations. For this reason approximations have to be
made when implementing these decoders. An additional source of complexity is
that messages are generally stored for each edge of the Tanner graph, as opposed
to storing values only for each variable node with bit-ipping decoders.
Quantized symbol alphabet
The rst approximation one is forced to make is the use of nite-precision values
for the messages. While an exact implementation would require innite precision,
in practice the precision is limited to only 32 or 64 bits at most as a consequence
of the binary32 and binary64 oating point specications (IEEE, 2008). However,
it turns out that one does not necessarily need more than 4–8 bits of precision
for the messages, as veried by a large number of works on message-passing
decoders, for example by He et al. (2003); Xiao et al. (2008); Zhang et al. (2001);
Zhao et al. (2005). Hardware implementations also generally use far fewer than
32 or 64 bits for the message values. Using fewer message bits already reduces
the complexity of the decoders signicantly.
The min-sum decoder
The min-sum decoder is essentially a lower-complexity version of the sum-
product decoder. In some sense it can be seen as the sum-product decoder with
a dierent set of operators (Richardson and Urbanke, 2008). Alternatively, it
can be seen as an approximation to the sum-product decoder (Fossorier et al.,
1999). The min-sum decoder performs (approximate) blockwise MAP decoding, as
opposed to (approximate) bitwise MAP decoding in the case of the sum-product
decoder. Since the min-sum algorithm can be seen as an approximation to the sum-
product algorithm, which is the same as belief propagation, it is also sometimes
called BP-based decoding. It was presented in the context of decoding by Chung
(2000); Fossorier et al. (1999); Wiberg (1996), but special cases such as the Viterbi
algorithm had been presented earlier. The update rules are:
vi→a = fi +
∑
b∈N (i )\a
fb→i ,
fa→i = 2
∏
j∈N (a)\i
sgn(vj→a ) · min
j∈N (a)\i
|vj→a |.
The only dierence to the sum-product decoder is that the magnitudes of the
check-to-variable message are determined only by the magnitude of the smallest
incoming message. The variable-to-check messages are identical to the sum-
product updates in (4.17). The nal estimate for the received word is as given in
(4.19).
53
Decoding of LDPC codes
The oset and normalized min-sum decoders
The min-sum decoder is less complex than the sum-product decoder, but it
comes with a cost in error-correcting performance. Some work has been put in
to improve the performance of the min-sum decoder. Chen et al. (2005); Chen
and Fossorier (2002a,b) presented two improvements to the min-sum decoder,
both of which work to minimize the eect of overestimating the message values.
Yazdani et al. (2004) presented essentially the same modications but for the
sum-product decoder. Both modications apply to the check-to-variable updates.
The two modications are termed oset and normalized BP-based decoding. The
normalization modication is
fa→i = α
∏
j∈N (a)\i
sgn(vj→a ) · min
j∈N (a)\i
|vj→a |. (4.20)
The messages are simply scaled by a factor α , which is set to a value which results
in the best decoding performance. The oset modication is equally simple:
fa→i =
∏
j∈N (a)\i
sgn(vj→a ) ·max
{
min
j∈N (a)\i
|vj→a | − β ,0
}
.
In eect, values smaller than β are set to zero, while all larger values are shifted
down by β .
λ-min decoder
Boutillon et al. (2003) presented an alternative modication to the min-sum
decoder: the λ-min decoder. Once again, the modication is simple but can
have a signicant impact on the performance. This of course comes at a cost in
arithmetic complexity. The modication consists of using the λ smallest messages
for calculating the magnitude of the message instead of only the smallest, as in
the min-sum algorithm, or all messages, as in the sum-product decoder. Let λ > 1
and let Nλ (j ) be the set of indices of the λ smallest incoming messages to node j.
The check-to-variable updates are then
fa→i =
∏
j∈Nλ (a)\i
sgn(vj→a ) ·
∏
j∈Nλ (a)\i
|vj→a |. (4.21)
Successive relaxation
The idea of successive relaxation in message-passing decoders was rst considered
for LDPC codes in the context of analog decoders by Hemati and Banihashemi
(2006) and later more thoroughly investigated in terms of decoding performance
54
Message-passing decoding
by Xiao et al. (2008). The idea behind successive relaxation comes from the fact
that while on cycle free graphs the sum-product decoder is exact, on graphs with
cycles it is not guaranteed to be optimal. The values of the messages—the beliefs—
are overestimated due to dependencies between messages caused by cycles,
leading to suboptimal performance. The oset and normalization modications
to the min-sum decoder in (4.20) and (4.21) essentially attempt to solve the same
problem. The idea behind successive relaxation is to only gradually change the
message values. The update rules for the sum-product decoder with log-ratio
messages are
fa→i B fa→i + β *.,2
∏
j∈N (a)\i
sgn(vj→a ) · tanh−1 *.,
∏
j∈N (a)\i
tanh
( |vj→a |
2
)+/- − fa→i
+/- ,
vi→a B vi→a + β *.,fi +
∑
b∈N (i )\a
fb→i −vi→a+/- .
Here β is a free parameter chosen to be between 0 and 1. It is easy to see that
when β = 1 this corresponds to the usual sum-product tanh update rules, which
is also called successive substitution. When β < 1 the messages are only adjusted
by a fraction β towards the usual new values. This works to alleviate the over-
estimation of the messages. The result is that for some of the codes tested even
the min-sum decoder with successive relaxation outperforms the standard sum-
product decoder. The sum-product decoder with successive relaxation performs
at least as well as the min-sum decoder with successive relaxation and better
than all decoders with successive substitution.
4.3.4 Binary message-passing decoder
A dierential decoding with binary message-passing, or DD-BMP, decoder was
presented by Mobini et al. (2009), based on the ideas of successive relaxation
applied to a binary message-passing decoder. This approach takes the message-
passing algorithms to the lower end of the complexity spectrum, with one version
of the decoder being similar in complexity to some of the bit-ipping decoders.
The DD-BMP operates by only sending binary messages along the edges, while
the variable nodes have a memory with one or more bits.
The DD-BMP introduces a memory ci→a for each edge going from a variable
node i to a check node a. The memory has multiple bits to represent its value,
while the messages sent consist of only a single bit. The messages take values
55
Decoding of LDPC codes
from {1,−1}. The check-to-variable update is
fa→i =
∏
j∈N (a)\i
vj→a .
At the variable nodes, a memory is updated as
ci→a = ci→a +w ·
∑
b∈N (i )\a
fb→i ,
where w is now a free weight parameter. Having computed the memory, the
variable-to-check update is then
vi→a = sgn (ci→a ) .
The memory c is initially set to the quantized value of the channel output. The
memory serves as the state in which the variable is at each iteration of the decoder.
The nal decision for a variable node i is
x′i =

1, if ∑a∈N (i ) sgn ( fa→i ) + sgn (yj ) > 0,
0, otherwise,
for all i = 1,2, . . . ,n. That is, the decision is the sign of the sum of the signs of the
memories and the sign of the channel value. The DD-BMP decoder signicantly
reduces the complexity of a hardware implementation as messages require only a
single bit, and updates consist of only incrementing or decrementing the memory
by one.
A further reduced complexity decoder was also proposed, the MDD-BMP,
where the m stands for modied. In the MDD-BMP decoder a single memory ci is
kept for each variable node, instead of one for each edge of the Tanner graph. This
simplication resulted in a larger or smaller loss in performance depending on
the code. In general, it did not perform much worse than the standard DD-BMP
and can be a viable alternative for some applications. The MDD-BMP variable
memories are updated as
ci = ci +w ·
∑
a∈N (i )
fa→i .
Cushon et al. (2014) presented a hardware simulation of both the DD-BMP
decoder and the MDD-BMP decoder. They also presented a further modication
to the decoder by Mobini et al. (2009). They call this modication the improved
dierential binary decoding algorithm. Cushon et al. note that the standard DD-
BMP decoders are sensitive to trapping sets, a concept similar to stopping sets. To
56
Message-passing decoding
reduce the eects of trapping sets two modications were proposed: degeneration
and relaunching. Degeneration consists of performing the following update for
to the variable node memories:
ci = ci +w ·
∑
a∈N (i )
fa→i − d · sgn (ci ) .
Here the free parameter isd added to improve performance. The arbitrary constant
д determines the amount of degeneration that occurs. If the sum of the incoming
messages is less than дd the memories move towards zero. The purpose of the
degeneration is to avoid message values staying constant in trapping sets.
The second modication the authors proposed was relaunching. It simply
consists of starting the decoder again. However, as the decoder is deterministic,
the decoder is started with slightly changed initial memories. A full run of the
decoding algorithm is called a phase, and a phase is indexed by p. The decoder is
relaunched on phase p such that the memories are initialized to
ci = sgn (yi ) ·max
(
1 − sgn (yi )
2 , |yi | − F (p,i )
)
,
where F (p,i ) is a non-negative function which depends on the phase p and the
variable node j. In the work by Cushon et al. the decoder was run for 6 phases,
with 45 iterations in each phase. They found that the largest improvement came
from adding degeneration to the decoder, while relaunching still slightly improved
the performance.
4.3.5 Message-passing schedules
Some of the schedules already presented for bit-ipping decoders apply equally
well to message-passing decoders, and were in some cases initially devised for
message-passing decoders. The standard decoding schedule consists of rst updat-
ing the variable to check messages, after which all the check-to-variable messages
are updated. This is repeated until decoding is successful or a maximum num-
ber of iterations is reached. One iteration with a message-passing algorithm
refers to updating all the variable-to-check and check-to-variable messages once.
Processing the messages in turn like this is called a ooding schedule.
Layered, or group shued, decoding was already introduced in one form
in the context of bit-ipping decoders in the previous sections. The initial idea
was, however, presented in the context of message-passing decoders. Updating
the messages on both sides of the Tanner graph in smaller groups is at least as
benecial for message-passing decoders as it is for bit-ipping decoders for the
same reason: variable and check nodes can utilize updated values much earlier
instead of relying on old messages.
57
Decoding of LDPC codes
Informed dynamic scheduling
Elidan (2006) presented a scheme for improving the convergence of general belief
propagation algorithms and Vila Casado et al. (2007) presented and analyzed
the improvements in the context of LDPC code decoding. They call the general
method of using a more sophisticated schedule for a message-passing algorithm
informed dynamic scheduling. The idea is to process and propagate messages
which matter the most rst. To arrive at a good schedule, they use residual belief
propagation, which essentially involves calculating the absolute value of the
change in the message values from one iteration to the next. The edges are then
held in a priority queue according to their residuals and the message with the
highest residual is rst calculated, propagated, and reinserted into the priority
queue in the appropriate position.
The authors nd that informed dynamic scheduling improves on the perfor-
mance of the sum-product algorithm while requiring fewer iterations (an iteration
in this case is dened as processing a number of messages equal to the number of
edges in the Tanner graph). The downside of this approach is again limited par-
allelizability. The straightforward implementation is inherently sequential. The
authors also mention a parallel implementation of informed dynamic scheduling,
similar to multi-bit bit-ipping decoders or bit-ipping decoders with a threshold.
Instead of only processing the message with the highest residual, one processes
the p messages with the highest residuals. They result is a negligible loss in
performance, but the authors do not mention how large p was chosen to be.
While the method showed promising results in sequential decoding, it is unclear
whether such a schedule can eciently be implemented in hardware. The need
for a priority queue also increases complexity.
4.4 Turbo codes
We will for completeness give a brief overview of Turbo codes. The reason for
presenting them at this point is that they rely on the same message-passing tools
which we have just presented for LDPC codes. Turbo codes are convolutional
codes. This means that, unlike LDPC codes, they operate on continuous streams
of data instead of blocks of data, but in practice one uses nite streams of data
also for Turbo codes. Turbo codes are dened by a linear feedback system.
The decoding of Turbo codes can be done with what is also essentially the
sum-product decoder. In the simplest case the factor graph one gets for the
maximum a posteriori estimation of the coded bits is a tree. In this case the sum
product is also exact. In general, however, Turbo codes consist of a concatenation
of individual Turbo codes either in parallel or serially. In this case the factor
graph will not be a tree but it will, similarly to the factor graph of LDPC codes,
58
Summary
consist of several sets of nodes such that one can do message-passing where
one updates the messages from one set of nodes at a time. This resembles the
conventional schedule for LDPC message-passing decoders where all variable-to-
check messages are updated at once, and then all the check-to-variable messages.
A thorough description of Turbo codes is given by Richardson and Urbanke (2008),
and Berrou et al. (2005) gives a more general overview of the eld of Turbo codes.
4.5 Summary
There has been a wealth of work on decoding of LDPC codes in the last 10–15
years. The main types of decoders are the message-passing and the bit-ipping
decoders. Despite the recent increased interest in bit-ipping decoders, message-
passing decoders often still come out ahead, at least in terms of error-correcting
performance. This is perhaps a natural consequence of the message-passing
decoders in general using more information than the bit-ipping decoders. Still,
they can be made low-complexity as is for example the case with the MDD-BMP.
Bit-ipping decoders, on the other hand, have seen improvements in the error-
correcting performance without too large costs in arithmetic complexity. However,
they still have not been able to achieve low enough bit-error rates compared to
message-passing decoders. On the other hand, there may be situations where
one can be more relaxed in terms of error-correction and get a corresponding
increase in throughput.
We have not covered here another notable class of message-passing decoders:
stochastic message-passing decoders (Gaudet and Rapley, 2003; Gross et al., 2005;
Huang et al., 2013; Naderi et al., 2011; Noorshams and Iyengar, 2014; Tehrani
et al., 2006, 2008, 2010, 2011). In the standard sum-product decoder messages
are soft values, represented generally by more than one bit. Stochastic message-
passing decoders operate using probabilities (as opposed to log-ratios as with
the tanh-rule) and send instead streams of bits which are Bernoulli distributed
according to the probabilities concerning a particular edge over which the stream
is sent. We have not covered stochastic message-passing decoders here as there
exists an extensive amount of work about them and the general principles are
the same as in the sum-product decoder. However, they are potentially important
as they present yet another way to reduce the complexity of decoding algorithms
without too much loss in error-correcting performance.
Another aspect that will become more clear in the following chapter is that
some degree or form of parallelism is often necessary to achieve high enough
decoding throughputs. Both message-passing decoders and bit-ipping decoders
benet from sequential schedules as information is propagated faster to other
nodes. In the case of for example informed dynamic scheduling the absolute
59
Decoding of LDPC codes
number of operations per bit needed for decoding is clearly reduced. The increased
number of operations per bit in a parallel implementation is often, however, oset
by being able to process many or all bits in parallel.
60
Chapter 5
Decoder implementations
In the two previous chapters we have reviewed algorithms for encoding and
decoding of LDPC codes. In practice, the algorithms must be implemented either
in software for simulating the behavior of codes or algorithms, or in hardware for
use in devices. Conveniently, simulating the behavior can be done exactly. That
is, software encoders and decoders work for encoding and decoding real pieces of
data, but at much lower throughputs than what can be achieved with hardware
implementations. When implementing encoders and decoders in hardware it is
important to consider not only the asymptotic complexity of the algorithms, but
also the complexity of the implementation for example in terms of how much
wiring is required on the chip, or what power consumption it has. Simply put,
constants matter. This is especially true for decoders as essentially all decoders
for LDPC codes are linear-time by design.
LDPC codes have in the recent years been incorporated as part of various
wired and wireless standards, some of which are the DVB (ETSI, 2012, 2013a,b),
WiMAX (IEEE, 2009), WiFi (IEEE, 2012b) and Ethernet (IEEE, 2012a) standards.
The various standards have specied LDPC codes, but also put requirements on
the throughput and error-correcting performance of the decoders. This clearly
limits the actual decoders one can implement. For example, the DVB standards
set the requirements that for each type of code given in the standard and a given
signal-to-noise ratio, the decoder must achieve a bit-error rate of at most 10−7.
To examine the error-correcting performance of decoders simulations on a
conventional CPU is often sucient. For examining the throughput of a decoder,
the ideal situation would be to have an actual hardware implementation of the
decoder. In practice, however, this is not feasible for testing various designs
quickly, so one can instead simulate the hardware and this way get approximate
results not only of how the decoding algorithm behaves but also of how the
hardware design would behave when implemented in terms of throughput and
power consumption. In this chapter, we will review some existing simulations
of specialized hardware decoders. These are presented in Section 5.1. Recent
advances in the programmability of graphics processing units (GPUs) have meant
61
Decoder implementations
that they can be used to implement decoders with much higher throughputs
than what is possible with ordinary CPUs. A fair number of works have been
published on implementing message-passing decoders on GPUs and we review
them in Section 5.2. We will then present an implementation of a decoder for
GPUs by the author in Section 5.3.
5.1 Hardware simulations
Essentially all practical implementations of decoders employ some type of par-
allelism to achieve a high enough throughput. There are two main ways of
parallelizing a decoding algorithm. The rst is to process the bits of a word
in parallel, which can be done for example with the ooding schedule of the
message-passing decoders. We will call such a parallel implementation bit-parallel.
Alternatively, one can consider several received words simultaneously and per-
form the same operations on each word. This can easily be done since each word
is independent of each other. We will call such an implementation block-parallel.
An implementation can of course be both bit- and block-parallel.
A recent and promising work on decoder implementations is by Cushon et al.
(2014). They present a simulated hardware implementation of a binary message-
passing decoder, the MDD-BMP as presented in Section 4.3.4. They also implement
their proposed improvement to the MDD-BMP decoder, the improved dierential
binary decoding algorithm. The implementation presented is bit-parallel and is
applied to nite geometry LDPC codes, as specied by the Ethernet standard. The
IDB achieves at best a throughput of 170 Gb/s in their implementation which is
higher than the throughput of 10 Gb/s required by the targeted Ethernet standard.
The decoder, as implemented, is often close to the error-correcting performance
of the oset min-sum decoder, but does not surpass it. The improved dierential
binary decoder surpasses in some situations the performance of the regular min-
sum decoder. However, especially the MDD-BMP decoder exhibits high error
oors with certain codes. Codes with low variable node degrees are particularly
troublesome for both binary message-passing decoders implemented, conrming
the observations of Mobini et al. (2009) which show the same problem.
In the work by Mohsenin et al. (2010) an improved split-row min-sum decoder
is implemented as a hardware simulation. The split-row min-sum further sim-
plies the check-to-variable updates of the min-sum decoder by using messages
from only a subset of the variable nodes adjacent to a check node sending a
message. This reduces complexity in a hardware implementation by requiring
less wiring, and increases throughput. The decoder achieves a peak throughput
of 90 Gb/s. Results for the implementation are only presented for a bit-error rate
down to 10−7 so the presence of error oors is not known unlike in the work
62
GPU implementations
of Cushon et al. (2014). Schläfer et al. (2012) present another high-throughput
decoder design. Their design implements a min-sum decoder with 9 iterations,
where their contribution consists of a fully unrolled decoder. Unrolling means that
each of the 9 iterations are executed on physically dierent parts of the decoder
allowing multiple words to be decoded simultaneously in dierent iterations
resulting in a reported increase in area and energy eciency compared with
earlier works. The fully unrolled decoder, however, comes with the downside of
being less exible. For example, the number of iterations can not be adjusted for
dierent levels of noise and a xed number of iterations will always be performed.
They achieve a peak throughput of 160 Gb/s.
Naderi et al. (2011) present a hardware design of a stochastic message-passing
decoder. They provide designs for a short code from the Ethernet standard—the
same code used in the work by Cushon et al. (2014) and Mohsenin et al. (2010).
They also show a design for a longer code with block length 32768. The design for
the short Ethernet code achieves a peak throughput of 170 Gb/s and the design
for the longer code achieves a peak throughput of 480 Gb/s. The work shows that
stochastic message-passing decoders can be competitive in terms of throughput.
The three hardware decoder designs mentioned above all reach similar maxi-
mum throughputs. However, choosing the best decoder can be dicult. There
are many factors to take into account, some of which are error-correcting perfor-
mance, latency, throughput, power consumption and chip size. As an example,
the fully unrolled decoder cannot be stopped early because of the xed number
of iterations which may lead to higher power consumption compared to decoders
which can stop earlier when noise levels are low. On the other hand, a fully un-
rolled decoder can achieve higher throughput and, ignoring early stopping, can
in itself be more energy ecient in normal operation compared to non-unrolled
decoders.
5.2 GPU implementations
Using graphics processing units (GPUs) for other uses than their original purpose
is becoming more common. Scientic computation can often benet from being
implemented on GPUs, and especially highly parallel work is well suited for
GPUs. Compared to common CPUs, GPUs often contain more processors, each
of which contains several cores. As a downside the cores typically run at lower
frequencies than CPUs. However, the higher number of cores generally makes
up for the lower frequency, assuming that the problem one wishes to compute is
suciently parallelizable. We will now give a brief overview of existing decoder
implementations using GPUs. Compared to specialized hardware implementa-
tions as presented in the previous section, GPUs generally achieve decoding
63
Decoder implementations
throughputs on the order of 100–1000 Mb/s. While slower than the specialized
implementations, this is still 1–2 orders of magnitude faster than implementations
on CPUs, which can achieve throughputs on the order of 10 Mb/s (Grönroos et al.,
2012).
Wang et al. (2013) presented an implementation of a normalized min-sum
decoder on a GPU. They use rate-12 codes from the WiMAX and WiFi standards
with block lengths 2304 and 1944, respectively. They use one or four NVIDIA GTX
TITAN GPUs for decoding. They report a peak throughput of approximately 315
Mb/s using one GPU and 10 iterations of the min-sum decoder. Using four GPUs
and 10 iterations they report a throughput of 1.25 Gb/s. Their implementation is
both bit- and block-parallel. This is the highest throughput reported for a GPU
implementation although the hardware used is also the most powerful. Some
earlier work using GPUs for decoding has been presented by the same authors
(Wang et al., 2011a,b).
Abburi (2011) reports a throughput of 160 Mb/s using 5 iterations of the
layered min-sum decoder on a single NVIDIA GeForce 9800 GTX+. Kang and
Moon (2012) implemented a sum-product decoder on a NVIDIA GTX 480, reaching
similar maximum throughputs using 10 iterations. Other works are by Chang
et al. (2011); Grönroos et al. (2012); Martínez-Zaldívar et al. (2011). Some of the
earliest work on using GPUs for decoding of LDPC codes was done by Falcão
et al. (2008).
5.3 A GPU implementation of a stochastic-bit flipping decoder
Programming for a GPU is slightly dierent from programming for a CPU. Before
we present the implementation of a bit-ipping decoder for GPUs, we will give
a brief overview of the architecture of a GPU and the important aspects of
programming for a GPU. We will in the following only consider NVIDIA’s so-
called Compute Unied Device Architecture, or simply CUDA, platform as the
decoder was implemented for CUDA-devices. The decoder was nally run on
two types of CUDA-devices, of which the rst is the NVIDIA Tesla M2090, which
is based on the Fermi microarchitecture, and the second is the NVIDIA Tesla
K40, which is based on the Kepler microarchitecture. The two devices have
dierent compute capabilities. The compute capability version species what kind
of features and hardware is available on the device. Essentially, it species the
microarchitecture of the device. The NVIDIA Tesla M2090 has compute capability
2.0 and the NVIDIA Tesla K40 has compute capability 3.5. Details of the two
devices are shown in Table 5.1. Programming for CUDA devices is done using the
CUDA software platform which provides a programming interface based on the C
language. Code written using CUDA C is then compiled to an intermediate form
64
A GPU implementation of a stochastic-bit flipping decoder
Table 5.1: Specications of the two CUDA-devices used in this work. (NVIDIA, 2011,
2013)
Model NVIDIA Tesla M2090 NVIDIA Tesla K40
Compute capability 2.0 3.5
Multiprocessors 16 15
Cores 512 2880
Processor core clock [GHz] 1.3 0.745
Global memory [GB] 6 12
Memory clock [GHz] 1.85 3.0
Memory bandwidth [GB/s] 176 288
of assembly language which is generic for all CUDA devices and uses the Parallel
Thread Execution (PTX) Instruction Set Architecture (ISA). PTX assembly is then
compiled to an architecture-specic binary based on the compute capability of a
device.
5.3.1 Architecture and programming of CUDA devices
A GPU generally contains multiple processors. Each processor is called a streaming
multiprocessor. Each streaming multiprocessor, or just multiprocessor, contains
multiple cores and is capable of executing a number of threads simultaneously. In
simple terms, each thread executes the same instruction, resulting in a multipro-
cessor essentially acting like a vector machine. This is called single-instruction,
multiple-data (SIMD) parallelism. Execution on a GPU is divided into threads
which all run the same kernel. A kernel is simply a function which can run on
dierent threads and is aware of in which thread it is running. This allows one to
perform work in parallel on for example dierent pieces of data. A warp is a unit
of 32 threads which are executed simultaneously on a single multiprocessor. If
some threads in a warp take a dierent execution path than other threads in the
warp, each group of threads with the same execution path will be executed in
parallel while threads with dierent execution paths are executed serially. This is
called thread divergence. Because of this behavior, it is important to try to have
minimal thread divergence within warps to achieve high performance. Ideally,
all threads within a warp should execute the same sequence of instructions.
An important aspect resulting from the microarchitecture of CUDA devices
is that of arithmetic latency. An arithmetic instruction generally takes between
10–20 clock cycles to complete (NVIDIA, 2014). Arithmetic instructions are,
however, pipelined. This means that multiple instructions of the same type can
be executed simultaneously by being in dierent stages of the pipeline, assuming
that the instructions operate on independent data. Put simply, the latency of
65
Decoder implementations
arithmetic instructions on a GPU is high, but the throughput can be higher
than only one instruction per 10–20 clock cycles. If not enough independent
instructions are available to be executed simultaneously, the pipeline stalls until
new instructions can be executed. The pipelining is done automatically, but
requires that a sucient number of independent instructions are available to
be executed. Ignoring memory accesses, to fully utilize a multiprocessor one
then needs to ensure that the multiprocessor can execute a sucient number of
independent instructions. One way of achieving this is by running a number of
threads equal to 10–20 times the number of cores available on the multiprocessor.
Alternatively, one can increase the number of independent instructions within a
thread. The typical latency of instructions performed on devices with compute
capability 2.0 is 22 while with 3.5 it is 11. Although hiding arithmetic latency can
be important for achieving maximal performance, hiding memory latencies can
be more important as we will see below.
The memory hierarchy of a GPU is as follows: all multiprocessors share access
to the global memory, which acts similarly to RAM for CPUs and is relatively slow
to access. Each multiprocessor has a sharedmemory common for all threads within
that multiprocessor, and all multiprocessors typically share one or two levels
of cache for global memory accesses. Additionally, each thread has access to a
maximum number of registers. For example, GPUs with CUDA compute capability
2.0, such as the NVIDIA Tesla M2090, the maximum number of registers per thread
is 63. In addition, there is a maximum number of registers a multiprocessor has
in total, which is 32 K for compute capability 2.0. For the NVIDIA Tesla K40
with compute capability 3.5 the maximum number of registers per thread is 255
and the total number of registers in a multiprocessor is 64 K. Each thread can
also access a portion of the global memory which is local to each thread. This
is called local memory. Access to local memory is, however, as slow as accesses
to global memory. The memories are, in decreasing order of size and latency:
global and local memory, shared memory and registers. Global memory latencies
are the most noticeable. For compute capability 2.0 the global memory latency
is 400–800 cycles and for compute capability 3.5 the latency is 200–400 cycles
(NVIDIA, 2014).
Because of the high latency in accessing global memory, the most important
aspect of achieving high performance is generally to avoid or hide the global
memory latency. In the case of the current decoder implementation avoiding the
latency is not possible as the received words need to be stored in global memory
because of their size. Hiding the latency can be done by ensuring that enough
memory requests are made to global memory so that the memory bandwidth is
saturated. This is done in practice by ensuring that enough threads are running
or by increasing the number of memory requests within a thread.
66
A GPU implementation of a stochastic-bit flipping decoder
When designing a kernel, data that is accessed frequently and is small enough
should be stored in shared memory. While slower than registers, the shared mem-
ory is still signicantly faster than global memory to access. For both compute
capability 2.0 and 3.5 the amount of shared memory per multiprocessor is 48 KB.
Finally, registers are used for the remaining variables. If a thread requires more
registers than are available, variables need to be stored in local memory causing
what is called register spilling. When register spilling occurs, accessing spilled
variables from local memory is signicantly slower than accessing registers. The
local memory makes use of a cache which can help reduce access times, but
generally register spilling will have a detrimental eect on the performance of a
kernel. While the CUDA compiler automatically optimizes the use of registers
and local memory for performance, manually reducing the number of variables
used in a kernel may also help in reducing register spilling.
5.3.2 Decoder implementation
For this thesis, a stochastic bit-ipping decoder as described in Section 4.2.5 was
implemented and tested on two CUDA devices. The decoder works on hard values,
meaning that we assume either that transmission has occurred over the BSC or
that soft channel values from the BAWGNC are quantized to 1 bit. We assume here
that the channel values are in {0,1}. The decoder employs a sequential schedule,
where the decision to ip is done randomly based on probabilities pe,b,d which
depend on if the current value of the bit is the same as the channel value, the
number of adjacent unsatised checks and the degree of the variable node. The
decoder implementation in the current work is block-parallel. That is, the decoder
decodes multiple words in parallel as all received words are independent from a
decoding perspective, while the decoding of each individual word is done with a
sequential schedule. The decoder is made block-parallel on two levels. First, we
will formulate the decoder mostly as a Boolean circuit. This is important as it
allows the decoder to operate using bitwise Boolean operations on multiple words
at a time. At the same time, formulating the decoder as a Boolean circuit avoids
thread divergence and allows a high throughput. Second, multiple threads on
multiple multiprocessors each decode their own set of words, further increasing
the level of parallelism.
To clarify, the term word will only be used in the sense of coding theory,
meaning the sequence of bits that we wish to decode to a codeword, of coding
theory. This should not be confused with the commonly used word which refers
to a group of bits on which a processor generally performs computations. As
the GPUs we are considering have a 32-bit architecture, the latter use of word
refers to a group of 32 bits on which the GPU operates using an instruction.
To avoid confusion with the former use we will avoid using word in the latter
67
Decoder implementations
sense. Additionally, we will often simply use the term bit interchangeably with
variable node or column of the parity-check matrix since the value of the variable
is quantized to a single bit.
Computation
We will, to begin with, only consider the decision to ip a single bit i in a single
word. Having done that, realizing the decoder as a block-parallel decoder is
straightforward. The decoding takes place in four steps:
1. Draw a value r uniformly at random between 0 and 1 using a random
number generator.
2. Count the number of checks adjacent to the current bit i that are unsatised.
3. Compare the pseudo-random value r drawn in the rst step to the prob-
ability of ipping a bit pe,b,d given the degree d of the bit; the number of
unsatised checks b in the second step; and e , the XOR of the current value
and channel value of the bit.
4. Flip the bit if the pseudo-random value is less than the probability of
ipping.
In the rst step, the value of r is represented by a 32-bit integer taking values
between 0 and 232−1. The probabilitiespe,b,d are also represented as 32-bit integers.
The 32-bit integers representing the probabilities are calculated as pe,b,d · (232 − 1).
The value of r is drawn using a linear-feedback shift register (LFSR). While
consecutive numbers drawn with a LFSR are highly correlated, the simplicity of
a LFSR makes it ideal for a high-performance implementation.
The second and third steps are as follows. As we are working with binary
codes, calculating the value of a checksum a adjacent to the bit i is done simply
by XORing the values of the bits adjacent to a. A check is satised if the sum is
0 and unsatised if it is 1. To count the number of unsatised checks a simple
incrementer circuit is sucient. We only considered parity-check matrices with
maximum variable node degree 6 for the implementation, so an incrementer with
3 bits suces. The circuit used for counting the number of unsatised checks is
shown in Figure 5.1. We denote the number of unsatised checks by the bits b0, b1
and b2 such that b2b1b0 is the binary representation of the number of unsatised
checks. That is, b2 is the most signicant bit and b0 is the least signicant bit.
The decoder initially sets all three incrementer bits to 0. It then processes each
adjacent check in turn and adds the result to the counter bits. When the value of
the checksum of each adjacent check have been added to the incrementer, b2b1b0
holds the binary representation of the number of unsatised adjacent checks.
68
A GPU implementation of a stochastic-bit flipping decoder
b2
b1
b0
a
b2
b1
b0XOR
XOR
XOR
AND
AND
Figure 5.1: A 3-bit incrementer circuit with modular behavior. The circuit adds the bit a
to the 3-bit number with a binary representation b2b1b0. Each XOR essentially calculates
the new value of the corresponding bit, and each AND calculates the carry bit for the
next position.
The third step is also implemented as a Boolean circuit. Let us denote by f the
decision to ip a bit i , such that f takes the value 1 if the bit i should be ipped
and 0 otherwise. As in Section 4.2.5, let e be 1 if the current value of the bit is
dierent from the channel value, and 0 otherwise. That is, e is the XOR of the
current value of the bit and the channel value. The probability of ipping a bit
with degree d , b unsatised checks and e is denoted by pe,b,d . Then, let re,b,d be
1 if the drawn pseudo-random value r is less than the probability pe,b,d , and 0
otherwise. The decision to ip a bit with degree d is then given by
f = (r1,1,d ∧ (¬b2 ∧ ¬b1 ∧ b0 ∧ e ))∨
(r0,1,d ∧ (¬b2 ∧ ¬b1 ∧ b0 ∧ ¬e ))∨
(r1,2,d ∧ (¬b2 ∧ b1 ∧ ¬b0 ∧ e ))∨
(r0,2,d ∧ (¬b2 ∧ b1 ∧ ¬b0 ∧ ¬e ))∨ (5.1)
· · ·
(r1,d ,d ∧ (¬b2 ∧ b1 ∧ ¬b0 ∧ e ))∨
(r0,d ,d ∧ (¬b2 ∧ b1 ∧ ¬b0 ∧ ¬e )).
Let us look at the rst row of (5.1) in more detail. The value of r1,1,d is true if the
bit should be ipped given that it has one unsatised check and the value of the
bit we are considering is dierent from the channel value. The expression on the
right, (¬b2 ∧ ¬b1 ∧ b0 ∧ e ), checks that the current variable node actually has
exactly 1 unsatised check and that its value is equal to the channel value. Each
row then checks this for dierent values of the number of unsatised checks and
e . The number of unsatised checks and the value of e will match on exactly one
row. Taking the disjunction between each row ensures that if for the matching
69
Decoder implementations
row the value of re,b,d is also true, the whole expression will be true. Thus f is
true exactly with the probability pe,b,d given b bad checks and e . By computing
the XOR of f and the current value of the bit we get the new value of the bit. We
should note here that a reason for restricting ourselves to codes with low variable
node degree is that the number of rows in (5.1) grows exponentially with the
maximum variable node degree.
The above steps apply for a single bit in a single word. The full decoder for
a single word then consists of doing the above for each bit sequentially for a
xed number of rounds. The reason for formulating the decoder with the help
of Boolean circuits was to allow running the decoder easily as a block-parallel
decoder. This can be realized since we can perform each type of Boolean operation,
or logic gate, using a single bitwise Boolean instruction on 32-bit integers on
the GPU. As a result, the decoder can evaluate a logic gate for 32 independent
Boolean circuits simultaneously, each corresponding to an independent received
word.
For generality, we will assume that we can perform bitwise Boolean operations
on w-bit types. This is because CUDA provides the vector types uint1, uint2
and uint4, which consist of 1, 2 and 4 unsigned integers each. That is, the types
contain 32, 64 and 128 bits, respectively. We can then dene bitwise Boolean
operations to work on the vector types as well. In practice, the bitwise Boolean
operations on for example a variable of type uint4 must be performed as 4
separate 32-bit bitwise Boolean operations as there is only support for 32-bit
bitwise Boolean operations in CUDA hardware (NVIDIA, 2014). The benet of
using vector types is that there is support in CUDA for memory loads from
global memory using vector types. This is benecial for maximizing the use of
global memory bandwidth. The instruction-level parallelism is also increased by
using larger vector types as each of the 32-bit bitwise Boolean operations are
independent. The parallelization of the decoder then comes partly from using
w-bit types in each thread to process w independent words, and partly from
running multiple threads on each multiprocessor.
When decodingw words in each thread, the value of re,b,d needs to be specied
for each of the w words being decoded by the thread. This was done simply by
using a single LFSR as the pseudo-random number generator in each thread. The
pseudo-random value r was then compared to the probabilities to ip pe,b,d and
setting a full uint1, uint2 or uint4 to only ones or zeros depending on the result
of the comparison. That is, the same random number generator was used for
decoding of all words within a thread.
70
A GPU implementation of a stochastic-bit flipping decoder
Memory layout
The two most important memory-related aspects of the decoder implementation
are the layout and access of both received and current bits, and the layout and
access of the parity-check matrix. We rst consider the layout of the words.
Generally, the words will be received sequentially at the decoder. Storing bits
in the order that they are received means that the bits of a single word will be
close to each other in memory, while bits in the same position but in dierent
words will in general be far from each other. We want the opposite to hold for the
decoder to be eciently block-parallel. That is, we want bits in the same position
in dierent words to be close to each other in memory so that each thread in
the decoder can access bits in the same position of dierent words eciently to
perform bitwise Boolean operations.
w w
ords
thr
ead
0
thr
ead
1
thr
ead
2
. . .
mult
ipro
cess
or 0
mult
ipro
cess
or 1
. . .
i
Figure 5.2: The transposed received words. Each column contains one received word and
each row contains the bits in the ith position of each received word. A thread accesses
w consecutive bits from row i to get the values of the ith bits of w words processed by
that thread. In addition, the thread accesses w bits from each row corresponding to the
second neighbors of i . The work is split up so that consecutive threads within the same
multiprocessor process consecutive sets of w words.
We can think of the received bits as a binary matrix where each row contains
one word, and each row is stored sequentially in memory. With this analogy, we
wish to take the matrix transpose of the received bits, such that the bits in a given
71
Decoder implementations
position of each word form a row of the binary matrix. A binary matrix transpose
can be done quickly on modern CPUs with vector registers and operations. To be
more precise, the matrix transpose was implemented using instructions which are
part of the Advanced Vector Extensions (AVX) for the Intel 64 architecture (Intel,
2012). The transpose was realized using the instruction to left-shift 128-bit vectors
and the instruction to select most signicant bits of each byte in a 128-bit vector.
Once the received bits have been transposed, the transposed bits are transferred
to the GPU where they are now laid out so that they can be accessed eciently.
Once decoding is nished, the transposed bits are transferred back from the GPU
and transposed again so that the bits are in the order in which they were initially
received. Figure 5.2 shows the transposed memory layout more clearly. When
decoding, a thread is responsible for decoding a xed set of w sequential words,
or columns, in the transposed matrix. A thread can then eciently access the ith
bits of a set ofw words for which the thread is responsible by reading consecutive
bits in the ith row of the transposed bits. Another benet of this layout is that for
performance reasons a warp of 32 threads should generally access consecutive
memory locations in global memory. The transposed memory layout allows this,
since we can assign each warp that will be executed a set of 32 consecutive sets
of w words.
The storage and access of the parity-check matrix is the second aspect that is
important for performance. For this decoder we use a QC-LDPC code because the
parity-check matrix can be stored implicitly, reducing memory use and accesses.
In addition, the positions of the ones in the parity-check matrix can be calculated
quickly from the implicit representation.
i
i1,1
i1,2
i1,3
i2,1
i2,2
i2,3
i3,1
i3,2
i3,3
N2(i )
Figure 5.3: The second neighbors of the variable node i , denoted by N2 (i ) on a Tanner
graph where the variable node has degree 3 and the check nodes degree 4. Note that the
node i itself is not included in the set.
72
A GPU implementation of a stochastic-bit flipping decoder
When considering a bit in position i of a word we need to know which check
nodes are adjacent to i , and which variable nodes are adjacent to each neighbor
a of i to calculate the checksums of the neighboring check nodes, and how many
of them are unsatised. We will call these neighbors of neighbors of i the second
neighbors of i , and denote the set of second neighbors by N2(i ). We do not include
i itself as the second neighbor of i . Figure 5.3 shows the set of second neighbors
graphically. Because of the cyclic structure of the sub-matrices of a QC-LDPC
code it is enough to store the information of the second neighbors of a single
column or variable node within each major column of the parity-check matrix. As
before, we let the size of each sub-matrix in the parity-check matrix of a QC-LDPC
code be z×z and by a major column we mean a column in the model matrix of the
QC-LDPC code. Thus, a major column species the ones in z columns of the full
parity-check matrix. Given the second neighbors of a variable node, or column,
within the major column, we can calculate the positions of the second neighbors
of the other columns within the major column. For simplicity, we choose to store
the second neighbors of the leftmost column in each major column. Figure 5.4
shows the parity-check matrix of a QC-LDPC code with the neighbors of one
check node highlighted. For generality, we will assume that the code is irregular,
in which case we also need to store the degree of the variable node i as well as the
degrees of the adjacent check nodes. Since we do not include i itself as a second
neighbor, we only need to store the indices of d − 1 neighbors of a check node,
assuming that it has degree d . Figure 5.5 shows a portion of the memory layout of
the parity-check matrix containing the second neighbors of one column within
a major column. Note that each portion corresponding to one major column is
padded to a maximum length so that calculating the address of the rst element
in each portion can be calculated by multiplying the index of the major column
with the maximum block length, assuming that the major column indices are
zero-based.
Let us look more closely at how the memory layout of the parity-check matrix
is used by the decoder, and let us for the moment again only consider decoding
a single word. Calculating the positions of the second neighbors of the other
columns within a major column is especially convenient to do if we consider
the columns in order, beginning from the leftmost column. That is, we store in
the implicit representation the positions of the second neighbors of the leftmost
column within each major column. The implicit representation of the parity
check matrix we have presented above is too large to t into registers, but is
small enough to t in the shared memory of a multiprocessor, so it is stored in
the shared memory. Let the rst column within a major column have index i .
Thus, we rst load the indices of the second neighbors of column i from shared
memory. Once the indices have been loaded, we load from global memory the
73
Decoder implementations
iia,1 ia,2 ia,3
a
Figure 5.4: The parity-check matrix of a QC-LDPC code where the positions of the
nonzero entries are shown on one major row. The diagonal lines show the positions of
the ones in the sub-matrices. In addition, the columns whose indices are stored in the
implicit representation for the GPU implementation of the stochastic bit-ipping decoder
are shown in orange. The column marked with i is the rst column in the major column
whose second neighbors we wish to store. The blue row marked with a is one of the
check node neighbors of i . The three other orange columns marked with ia,1,ia,2 and
ia,3 are the neighbors of the check node a. The indices of the three orange columns are
then stored in the implicit representation of the parity-check matrix. This is repeated for
other check node neighbors that i may have, and for each major column.
bits corresponding to the column i and its second neighbors. We also load the
channel value corresponding to the column i . We then calculate the checksums
of the of the neighbors of i , count the number of unsatised checks, decide if the
ith bit should be ipped, and ip it if needed. We then continue to column i + 1.
For column i + 1 it is enough to increment the index of each second neighbor of
column i by one. Because of the cyclic nature of the sub-matrices, we also need to
check for each index if it is divisible by the sub-matrix size z. If it is, we subtract
z from the index value, eectively ensuring that we follow the cyclic structure of
the sub-matrix. For this we again assume that the column indices are zero-based.
We can then ip the bit corresponding to the second column of the major column
if needed. We can continue in this way, incrementing the indices and subtracting
z when needed, for all columns i,i + 1, . . . ,i + z − 1 in the major column to get
the correct indices of the second neighbors of each column. For each column, we
use the calculated indices to load the corresponding bits from global memory,
calculate the checksums, count the number of unsatised checks and ip the bit
if necessary.
74
A GPU implementation of a stochastic-bit flipping decoder
. . . d d1 i1,1 i1,2 . . . i1,d1−1 d2 i2,1 . . . id ,dd−2 id ,dd−1 . . .
Figure 5.5: A portion of the memory layout of the implicit representation of a QC-LDPC
code corresponding to a single major column. The second neighbors of the rst column i
in the major column is stored. The degree of the variable node or column is given by d .
The degrees of the neighboring check nodes are given by d1,d2, . . . ,dd , and the position
of the jth variable neighbor of the ath check neighbor of i is given by iaj . Note that we
do not store the column i itself as a neighbor of its neighbors.
First, the indices loaded and calculated in the above procedure can easily
be used within a thread to consider w independent words. Instead of loading a
single bit for each column i and its second neighbors, the thread loads w bits,
corresponding to w dierent words, for each column i and its second neighbors.
Second, each thread then performs the above calculations independently on their
own set ofw words. An important practical detail is that the indices of the second
neighbors can be loaded without conicts from shared memory. Since each thread
in a warp accesses the same position in shared memory, the result is broadcast
to each thread in the warp with a single memory read request (NVIDIA, 2014).
Another consideration is checking for divisibility eciently. If the sub-matrix size
z is a power of 2, checking for divisibility by z can be done eciently using for
example a simple bit-masking operation. In the general case, however, checking
for divisibility requires slightly more work (Warren, 2012).
To further optimize the decoder kernel, the ip probabilities were generated
before compiling the kernel and the values hard-coded. In addition, code was
generated to specialize on the degrees of the variable and check nodes to be able
to unroll the majority of the loops and so improve performance. This could be
done on the level of a major column of the parity-check matrix as the degrees
are xed within a major column.
Altogether the GPU implementation of the stochastic bit-ipping decoder
does the following steps:
1. Transpose the received words on the CPU.
2. Copy the transposed words to the GPU. Make an additional copy of the
received words on the GPU for determining e .
3. Each thread decodes w words for a xed number of iterations.
4. Copy the words back from the GPU.
5. Transpose the decoded words back.
In the next chapter we will look at how fast the decoder is in practice.
75

Chapter 6
Experimental results
In this chapter, we will present an experimental comparison of the error-correcting
performance of ve decoders implemented in software and run with typical con-
gurations. In addition, we will look at the encoding complexity of codes from a
regular and an irregular ensemble of LDPC codes using the method proposed by
Richardson and Urbanke (2001b). Finally, we will present experimental results on
the decoding throughput of the GPU implementation of the stochastic bit-ipping
decoder. To thoroughly examine the performance of decoders one would need
to take into account details of hardware implementations of the decoders, and
implementing or simulating multiple hardware designs of decoders is beyond the
scope of this thesis. For this reason, we will focus only on the error-correcting
performance, or bit-error rate, of the decoders and ignore the throughput, latency
and other hardware-related details in the decoder comparison. On the other hand,
the performance evaluation of the GPU implementation will only focus on the
throughput and latency of the decoder, and on how well it utilizes the hardware.
6.1 Comparison of decoders
For this thesis, ve dierent decoders were implemented in software to run on
the CPU and were tested with three dierent codes or ensembles of codes on
two dierent channels. The ve decoders implemented were the sum-product
decoder, the min-sum decoder, the gradient-descent bit-ipping decoder, Gal-
lager’s bit-ipping decoder and the stochastic bit-ipping decoder presented in
Section 4.2.5. The implementations of the sum-product and min-sum decoders
use 64-bit message values with the assumption that this approximates innite
precision messages well. Most importantly, as we reviewed briey in Section 4.3.3,
even using less than 10 bits for the message values is generally enough to ap-
proximate the innite precision sum-product decoder suciently well. Because
of this, we consider 64-bit message values to be sucient for the performance of
the sum-product decoder to be a good baseline to which other decoders can be
77
Experimental results
compared. In addition, the channel noise parameter was assumed known to the
sum-product and min-sum decoders for initializing the messages from channel
factors to variable nodes.
The sum-product and min-sum decoders were both run for 20 iterations. The
three bit-ipping decoders were each run for 100 iterations with a sequential
schedule. The number of iterations for the decoders were chosen to be typical
values found in the literature. In Section 5.1 on simulations of hardware decoders
we saw implementations which use approximately 10 iterations with min-sum
decoders, and so 20 iterations was chosen to ensure that the error-correcting per-
formance is generally not worse than it would be in a hardware implementation.
The higher number of 100 iterations for the bit-ipping decoders was chosen as
the bit-ipping decoders are generally less complex and require more iterations
for good performance. The number of iterations for the bit-ipping decoders is
also typical for the literature. For example, in the work by Sundararajan et al.
(2014) where the noisy gradient-descent bit-ipping decoder is presented, the dif-
ferent variations are run with 100–300 iterations. The random number generator
used for CPU implementation the stochastic bit-ipping decoder is identical to
the LFSR used in the GPU implementation.
The decoders were tested with three types of codes. First, codes from the (3,6)-
regular ensemble were used. Second, codes from an irregular ensemble of codes
presented by Richardson et al. (2001) were used. The ensemble has maximum
variable node degree 4 and has been optimized to have a high threshold in the
asymptotic case of innite block length and an innite number of iterations with
the sum-product decoder. The degree distribution of the ensemble is given by
the following polynomials
L(x ) = 0.54883x2 + 0.04042x3 + 0.41075x4,
R (x ) = 0.276153x5 + 0.723847x6.
Here, the coecient of a monomial xi of the polynomial L(x ) gives the fraction of
variable nodes with degree i . The polynomial R (x ) gives the equivalent fractions
for the check nodes. Codes from the two ensembles were drawn by rejecting codes
which have duplicate edges in their Tanner graph. Finally, the rate-12 QC-LDPC
code dened in the WiMAX standard (IEEE, 2009) was used. All three codes or
code ensembles have rate 12 .
A more thorough set of tests was run with the sum-product decoder and
the stochastic bit-ipping decoders. The two decoders were tested with block
lengths 2i for i = 10,11, . . . ,20 with the (3,6)-regular ensemble and the irregular
ensemble. The decoders were only tested with the block length 2304 with the
rate-12 WiMAX code, which is the longest block length dened in the WiMAX
standard. The transmission of a total of 230 bits was simulated for each code,
78
Comparison of decoders
block length, channel and decoder. More precisely, for the two ensembles of
codes, the following was repeated
⌊
230
n
⌋
times for each block length n: draw a new
code, simulate transmission over a channel, and decode the received word. The
same was repeated
⌊
230
2304
⌋
times for the rate-12 WiMAX code, with the dierence
that the same code was used in each repetition. Notice that encoding is not
simulated because it is time consuming for codes from the (3,6)-regular ensemble
with long block lengths. Instead, it is assumed that the sent codeword is the
all-zeros codeword when transmitting over the BSC as it is a codeword of any
LDPC code. It can be assumed that the all-zero codeword is sent as, by symmetry,
the performance of a code and decoder does not depend on the sent codeword,
but only on the noise in the channel (Richardson and Urbanke, 2008). On the
BAWGNC, the equivalent assumption is that the all-ones codeword is sent. We
will consider encoding complexity separately in Section 6.2.
The three remaining decoders—the min-sum decoder, the gradient-descent bit-
ipping decoder, and Gallager’s bit-ipping decoder—were tested with a smaller
set of tests. They were all run with the block length 214 for the two ensembles of
codes, and with the block length 2304 for the rate-12 WiMAX code. The tests were
repeated identically to the tests with the sum-product decoder and the stochastic
bit-ipping decoder by simulating the transmission of a total of 230 bits.
Finally, all decoders were tested on both the BSC and the BAWGNC, with
the exception of Gallager’s bit-ipping decoder and the stochastic bit-ipping
decoder as they only work on hard channel values. However, the results of the two
bit-ipping decoders on the BSC were translated to the BAWGNC by assuming
hard-thresholding of the soft-channel values. The decoders were run with a range
of noise levels to see the behavior as a function of the noise as well.
6.1.1 Choosing the decoder parameters
The stochastic bit-ipping decoder has two free parameters for the parameter-
ization of the ip probabilities as presented in Section 4.2.5, and the gradient-
descent bit-ipping decoder with a sequential schedule has one free parameter,
the threshold θ for ipping a bit. The free parameters were chosen by running
a constant-spaced grid search over a set of parameter values, after which the
overall best parameter values were chosen. The results of the grid search are
shown in Appendix B. It is good to note from the results of the grid searches
that the optimal values generally depend on the level of noise. The parameter
values were chosen here to give good results at moderate levels of noise, possibly
leaving larger error oors for low levels of noise. The chosen parameter values are
shown in Tables 6.1 and 6.2. For the comparison here, we only considered xed
parameter values, but further optimization of the error-correcting performance
79
Experimental results
could involve dynamically adjusting the parameter values based on the level of
noise or the iteration number. This has already been suggested by for example
Ismail et al. (2013) in the context of bit-ipping decoders.
Gallager’s bit-ipping decoder has a free parameter which adjusts how many
adjacent check nodes must be unsatised for a bit to be ipped. For Gallager’s
bit-ipping decoder this value was simply chosen for the two ensembles of codes
and the rate-12 WiMAX code. For the (3,6)-regular ensemble, the threshold was
set to 2, meaning that 2 or more check nodes adjacent to a variable node must be
unsatised for the bit to be ipped. The threshold was set to 2 for the irregular
ensemble and to 3 for the rate-12 WiMAX code.
Table 6.1: Parameter values chosen for the stochastic bit-ipping decoder.
(3,6)-regular Irregular WiMAX
T 0.8 0.8 0.9
p 0.12 0.08 0.08
Table 6.2: Parameter values chosen for the gradient-descent bit-ipping decoder.
(3,6)-regular Irregular WiMAX
θ (BAWGNC) −0.8 −0.4 −0.6
θ (BSC) −0.5 −0.5 −0.5
6.1.2 Results
Figure 6.1 shows the bit-error rate Pb of the sum-product decoder on the BSC
as a function of the crossover probability ϵ and the block length n. With the
(3,6)-regular ensemble the bit-error rate clearly shows a typical waterfall region
around ϵ = 0.075 and an error oor at lower levels of noise. On the other hand,
the bit-error rate of the irregular code is far worse than that of the (3,6)-regular
ensemble. Additionally, the irregular ensemble doesn’t show a clear waterfall
region and has a relatively high bit-error rate even with long block lengths and
low levels of noise. The bit-error rate of the rate-12 WiMAX code is again better
than the irregular ensemble with block length 2048 and slightly better than the
(3,6)-regular ensemble with the block length 2048. However, with the block length
4096 the (3,6)-regular ensemble performs roughly equally to the rate-12 WiMAX
code. Figure 6.2 shows the bit-error rate of the sum-product decoder on the
BAWGNC as a function of the signal-to-noise ratio and the block length n. The
behavior is largely similar to on the BSC, with the rate-12 WiMAX code having
the lowest bit-error rate compared to short block lengths with the (3,6)-regular
80
Comparison of decoders
ensemble and the irregular ensemble. With longer block lengths, the (3,6)-regular
ensemble shows again typical behavior and the irregular ensemble performs
worse than the (3,6)-regular ensemble. Finally, the bit-error rate of the stochastic
bit-ipping decoder on the BSC is shown in Figure 6.3. The behavior is similar
to the sum-product decoder on the BSC but with slightly worse performance in
general. It is good to note that the error oors are at similar levels using both
decoders, and that the main dierence in error-correcting performance with the
current implementations is between the thresholds of the decoders, with that of
the stochastic bit-ipping decoder being lower.
The comparison of the ve decoders on the BSC is shown in Figure 6.4. The
results show a clearer picture of how the stochastic bit-ipping decoder compares
to the generally better message-passing decoders and to simpler bit-ipping
decoders. The sum-product decoder is consistently the best decoder with the
(3,6)-regular ensemble, irregular ensemble and the rate-12 WiMAX code. The
stochastic bit-ipping decoder behaves similarly to the sum-product with the
dierence that the results are shifted to lower levels of noise. With the (3,6)-regular
ensemble the gap in error-correcting performance between the two decoders is
the smallest. In terms of the crossover probability, the results of the stochastic
bit-ipping decoder are shifted to the left from the sum-product decoder by
approximately 0.01. The corresponding gaps are 0.02 with the irregular ensemble
and the rate-12 WiMAX code. The two other bit-ipping decoders, Gallager’s bit-
ipping decoder and the gradient-descent bit-ipping decoder, generally perform
the worst. Gallager’s bit-ipping decoder only barely improves on transmission
without any encoding at low levels of noise, and the gradient-descent bit-ipping
decoder performs only slightly better. However, with the (3,6)-regular ensemble
the gradient-descent bit-ipping decoder performs clearly better than Gallager’s
bit-ipping decoder. Interestingly, the min-sum decoder performs badly compared
to the sum-product decoder because of the hard channel values on the BSC.
Figure 6.5 shows the same comparison as above but on the BAWGNC. Com-
pared to the BSC, the min-sum decoder now performs nearly as well as the sum-
product decoder. The gradient-descent bit-ipping decoder also performs better
relative to the stochastic bit-ipping decoder as it can make use of soft channel
values. As explained earlier, the results of Gallager’s bit-ipping decoder and the
stochastic bit-ipping decoder have been translated from the results on the BSC
by assuming hard-thresholding of the channel values on the BAWGNC. Thus, the
two bit-ipping decoders clearly don’t benet from moving to the BAWGNC and
so their performance is comparatively worse. However, the stochastic bit-ipping
decoder still performs relatively well. Most importantly, it performs better than
81
Experimental results
Crossover probability ϵ
0.00 0.02 0.04 0.06 0.08 0.10
10-10
10-8
10-6
10-4
10-2
100
Bi
t-e
rr
or
ra
te
P
b
10-10
10-8
10-6
10-4
10-2
100
Bi
t-e
rr
or
ra
te
P
b
10-10
10-8
10-6
10-4
10-2
100
Bi
t-e
rr
or
ra
te
P
b
Figure 6.1: Bit-error rate Pb of the sum-product decoder on the BSC as a function of
the crossover probability ϵ . The sum-product decoder was run for 20 iterations, using
the (3,6)-regular ensemble with block lengths 2i for i = 10,11, . . . ,20 (top), the irregular
ensemble with block lengths 2i for i = 10,11, . . . ,20 (middle), and the rate- 12 WiMAX
code with block length 2304 (bottom). The transmission of a total of 230 bits was simulated
for each code, block length and crossover probability. For the (3,6)-regular ensemble and
the irregular ensemble, the bit-error rate decreases as the block length increases. Note
that the vertical axis is logarithmic, and that if no errors occurred for a particular block
length and crossover probability that particular result is not plotted.
82
Comparison of decoders
SNR [dB]
0 2 4 6 8
10-10
10-8
10-6
10-4
10-2
100
Bi
t-e
rr
or
ra
te
P
b
10-10
10-8
10-6
10-4
10-2
100
Bi
t-e
rr
or
ra
te
P
b
10-10
10-8
10-6
10-4
10-2
100
Bi
t-e
rr
or
ra
te
P
b
Figure 6.2: Bit-error rate Pb of the sum-product decoder on the BAWGNC as a function of
the signal-to-noise ratio (SNR) in dB. The sum-product decoder was run for 20 iterations,
using the (3,6)-regular ensemble with block lengths 2i for i = 10,11, . . . ,20 (top), the
irregular ensemble with block lengths 2i for i = 10,11, . . . ,20 (middle), and the rate- 12
WiMAX code with block length 2304 (bottom). The transmission of a total of 230 bits
was simulated for each code, block length and crossover probability. For the (3,6)-regular
ensemble and the irregular ensemble, the bit-error rate decreases as the block length
increases. Note that the vertical axis is logarithmic, and that if no errors occurred for a
particular block length and crossover probability that particular result is not plotted.
83
Experimental results
Crossover probability ϵ
0.00 0.02 0.04 0.06 0.08 0.10
10-10
10-8
10-6
10-4
10-2
100
Bi
t-e
rr
or
ra
te
P
b
10-10
10-8
10-6
10-4
10-2
100
Bi
t-e
rr
or
ra
te
P
b
10-10
10-8
10-6
10-4
10-2
100
Bi
t-e
rr
or
ra
te
P
b
Figure 6.3: Bit-error rate Pb of the stochastic bit-ipping decoder on the BSC as a
function of the crossover probability ϵ . The stochastic bit-ipping decoder was run for
100 iterations, using the (3,6)-regular ensemble with block lengths 2i for i = 10,11, . . . ,20
(top), the irregular ensemble with block lengths 2i for i = 10,11, . . . ,20 (middle), and
the rate- 12 WiMAX code with block length 2304 (bottom). The parameter values of the
stochastic bit-ipping decoder are shown in Table 6.1. The transmission of a total of
230 bits was simulated for each code, block length and signal-to-noise ratio. For the
(3,6)-regular ensemble and the irregular ensemble, the bit-error rate decreases as the
block length increases. Note that the vertical axis is logarithmic, and that if no errors
occurred for a particular block length and crossover probability that particular result is
not plotted.
84
Complexity of approximate lower triangular encoding
the gradient-descent bit-ipping decoder showing that despite using only hard
values, adding noise to the decoding process can improve the error-correcting
performance.
It is interesting to note the performance of the irregular ensemble. As can be
seen from the gures, the performance of the irregular ensemble is signicantly
worse than the (3,6)-regular ensemble on both the BSC and the BAWGNC. At this
point it is important to remember that the degree distribution of the irregular en-
semble we have used here was optimized in the asymptotic setting. The threshold
is indeed better than for the (3,6)-regular ensemble but the threshold alone does
not guarantee good performance. On the other hand, this also does not mean
that irregular cannot perform well. The rate-12 WiMAX code is an irregular code
and performs better than the (3,6)-regular ensemble at the short block length
of approximately 2000 bits. However, at longer block lengths the (3,6)-regular
ensemble performs better than the short rate-12 WiMAX code.
To put the performance of the decoders into perspective, the Shannon limit
for codes with rate-12 on the BSC is approximately ϵ = 0.11. The Shannon limit
for rate-12 codes on the BAWGNC is 0.19 dB or, expressed in terms of the standard
deviation of the noise, σ = 0.98 (Richardson et al., 2001).
6.2 Complexity of approximate lower triangular encoding
The performance of the encoding method by Richardson and Urbanke (2001b)
using an approximate lower triangular form of the parity-check matrix was tested
using codes from the same (3,6)-regular ensemble and irregular ensemble already
used for comparing decoders. The encoding time was recorded for block lengths
2i for i = 10,11, . . . ,20 with 224 total encoded bits for each block length. The
repetitions for each block length n were made similarly to how the decoders were
tested. The following was repeated
⌊
224
n
⌋
times for each block length: draw a new
code and encode a word using the drawn code.
The encoding tests were run on the following hardware: HP ProLiant BL465c
G6 with two 2.6 GHz AMD Opteron 2435 CPUs with six cores each, and 32 GiB
of DDR2-800 main memory. The encoder implementation is single-threaded.
6.2.1 Results
Figure 6.6 shows the encoding time in seconds as a function of the block length
of the regular and irregular code side by side. In addition to showing the total
encoding time, the time taken to perform the greedy approximate lower trian-
gulation, the time taken to perform Gaussian elimination and invert ϕ, and the
actual encoding time ignoring the preprocessing steps are shown. It is clear that
85
Experimental results
Sum-product
Min-sum
GDBF
Gallager’s BF
SBF
Uncoded
Decoder
Crossover probability ϵ
0.00 0.02 0.04 0.06 0.08 0.10
10-10
10-8
10-6
10-4
10-2
100
Bi
t-e
rr
or
ra
te
P
b
10-10
10-8
10-6
10-4
10-2
100
Bi
t-e
rr
or
ra
te
P
b
10-10
10-8
10-6
10-4
10-2
100
Bi
t-e
rr
or
ra
te
P
b
Figure 6.4: Comparison of decoders on the BSC. The bit-error rate Pb as a function of
the crossover probability ϵ using the (3,6)-regular ensemble with block length 214 (top),
the irregular ensemble with block length 214 (middle), and the rate- 12 WiMAX code with
block length 2304 (bottom). Five decoders were compared: the sum-product decoder,
the min-sum decoder, the gradient-descent bit-ipping decoder (GDBF), Gallager’s bit-
ipping decoder (Gallager’s BF), and the stochastic bit-ipping decoders. Also shown is
uncoded transmission in gray. The sum-product decoder and the min-sum decoder were
run with 20 iterations; the rest were run with 100 iterations. The parameter values of the
stochastic bit-ipping decoder and the gradient-descent bit-ipping decoder are shown
in Table 6.1 and Table 6.2. The transmission of a total of 230 bits was simulated for each
code, block length and crossover probability. Note that the vertical axis is logarithmic,
and that if no errors occurred for a particular block length and crossover probability that
particular result is not plotted.
86
Complexity of approximate lower triangular encoding
Sum-product
Min-sum
GDBF
Gallager’s BF
SBF
Uncoded
Decoder
SNR [dB]
0 2 4 6 8
10-10
10-8
10-6
10-4
10-2
100
Bi
t-e
rr
or
ra
te
P
b
10-10
10-8
10-6
10-4
10-2
100
Bi
t-e
rr
or
ra
te
P
b
10-10
10-8
10-6
10-4
10-2
100
Bi
t-e
rr
or
ra
te
P
b
Figure 6.5: Comparison of decoders on the BAWGNC. The bit-error rate Pb as a function
of the signal-to-noise ratio (SNR) in dB using the (3,6)-regular ensemble with block
length 214 (top), the irregular ensemble with block length 214 (middle), and the rate-
1
2 WiMAX code with block length 2304 (bottom). Five decoders were compared: the
sum-product decoder, the min-sum decoder, the gradient-descent bit-ipping decoder
(GDBF), Gallager’s bit-ipping decoder (Gallager’s BF), and the stochastic bit-ipping
decoders. Also shown is uncoded transmission in gray. The sum-product decoder and
the min-sum decoder were run with 20 iterations; the rest were run with 100 iterations.
The parameter values of the stochastic bit-ipping decoder and the gradient-descent
bit-ipping decoder are shown in Table 6.1 and Table 6.2. The transmission of a total of
230 bits was simulated for each code, block length and signal-to-noise ratio. Note that the
vertical axis is logarithmic, and that if no errors occurred for a particular block length
and crossover probability that particular result is not plotted.
87
Experimental results
Encoding with prepr.
Greedy triangulation
GE and inversion
Only encoding
Encoding step
Blocklength nBlocklength n
103 104 105 106103 104 105 106
10-4
10-2
100
102
104
En
co
di
ng
tim
e
[s
]
Figure 6.6: Encoding time t as a function of the gap д for the (3,6)-regular ensemble
(top) and the irregular ensemble (bottom) using the method by Richardson and Urbanke
(2001b). The block length n was 2i for i = 10,11, . . . ,20. The graphs show the total
encoding time including preprocessing steps, the time to perform greedy approximate
upper triangulation of the parity-check matrix, the time to perform Gaussian elimination
and invert ϕ, and the time to perform only the actual encoding without preprocessing.
The encoding was repeated so that a total of 220 bits were encoded for each block length
and ensemble. Both axes are logarithmic.
the main computational cost in the encoding method by Richardson and Urbanke
(2001b) comes from performing the Gaussian elimination and inverting ϕ. For
the regular code this can clearly be attributed to the linear growth of the gap
д as can be seen in Figure 6.7. However, the average gap of codes from the ir-
regular ensemble is consistently small up to the block length 220. This leads to
consistently low total encoding times including preprocessing. In practice, the
preprocessing is done beforehand and does not need to be repeated each time
a new block is encoded. In the current software implementation, the encoding
times excluding preprocessing are similar for both the (3,6)-regular ensemble and
the irregular ensemble. It is interesting to note that the degree distribution for
the irregular ensemble is not signicantly dierent from the (3,6)-regular case.
The two ensembles have similar average variable and check node degrees, but
the reason for the small average gap for the irregular ensemble is essentially the
larger number variable nodes of degree two.
88
The stochastic bit-flipping decoder on the GPU
Blocklength n
103 104 105 106
Regular
Irregular
Code type
100
101
102
103
104
105
Ae
ra
ge
ga
p
д
Figure 6.7: Average gap д of the parity-check matrix after greedy approximate lower
triangulation as a function of the block length n for the (3,6)-regular ensemble and the
irregular ensemble. The block length n was 2i for i = 10,11, . . . ,20. Note that the average
gap of the irregular code is consistently less than 10 even for the block length 220. Both
axes are logarithmic.
6.3 The stochastic bit-flipping decoder on the GPU
As noted in Section 5.3.1, to fully utilize the GPU a sucient number of threads
needs to be started simultaneously or there needs to be a sucient number
of independent instructions executed in each thread. In practice, this means
that a sucient number of received words need to be decoded simultaneously.
The number of threads to use depends in general on the type of GPU used for
computations. The implementation for the current work was tested on two CUDA
devices: (i) the NVIDIA Tesla M2090 and (ii) the NVIDIA Tesla K40.
The decoder was tested on both devices using two types of codes. The rst
is the rate-12 code of block length 1536 dened in the WiMAX standard and
the second is a (3,4)-regular QC-LDPC code, also with block length 1536. The
(3,4)-regular code was generated by rst generating a (3,4)-regular code of block
length 24 using the conguration model. Then, for each nonzero entry of the
smaller parity-check matrix a random shift value was drawn uniformly at random
between 0 and 63. By setting the nonzero entries in the smaller parity-check
matrix to the randomly drawn shift values we form the model matrix of the (3,4)-
regular QC-LDPC code. By using the block length 1536 with a model matrix with
24 columns the sub-matrix size is 26 = 64 meaning that checking for divisibility
by the sub-matrix size can be done quickly.
89
Experimental results
To reach a high throughput, the decoder was tested using dierent cong-
urations. The free parameters were the number of threads per multiprocessor
core and the vector type used for bitwise Boolean operations. Let w be the num-
ber of bits in a vector type, nA the number of threads used per core and nC the
total number of cores available on the GPU. The total number of words being
decoded simultaneously on the GPU is then nCnAw . The decoder was tested on
both GPUs with nA = 2i for all 0,1, . . . ,5. Additionally, the decoder was tested
using the vector types uint1, uint2 and uint4 for all bitwise Boolean operations,
meaning that each thread was responsible for decoding w = 32,64 or 128 words,
respectively, with each vector type. The decoder was run for 100 iterations on
both GPUs with all congurations. For each set of parameters, the decoder was
run once to decode nCnAw words and the execution time of the decoding process
was recorded.
6.3.1 Results
The decoding throughput of the decoder on the NVIDIA Tesla M2090 is shown in
Figure 6.8 as a function of the code, the number of threads per core and the number
of words w decoded on each thread. The decoding throughput clearly increases
as the number of threads per core is increased, eventually plateauing at 8 threads
per core. In addition, there is a considerable increase in decoding throughput
when decoding 64 words per thread instead of 32 words per thread, but decoding
128 words per thread no longer increases the throughput considerably. The peak
decoding throughput on the NVIDIA Tesla M2090 was 350 Mb/s with the rate-12
WiMAX code tested and 700 Mb/s with the (3,4)-regular code. The global memory
bandwidth used on the NVIDIA Tesla M2090, shown in Figure 6.9, follows a
similar pattern in terms of the number of threads per core and the number of
words per thread. With both codes use of the global memory bandwidth reaches a
maximum of approximately 100 GB/s, which is 60 % of the maximum theoretical
bandwidth of 176 GB/s.
The decoding throughput and use of global memory bandwidth for decoder
on the NVIDIA Tesla K40 are shown in Figure 6.10 and Figure 6.11. The main
dierence to the NVIDIA Tesla M2090 is that the maximum decoding throughput
and use of global memory bandwidth are approximately a factor of 2 higher on
the NVIDIA Tesla K40. The peak decoding throughput with the NVIDIA Tesla
K40 is approximately 700 Mb/s with the rate-12 WiMAX code and approximately
1200 Mb/s with the (3,4)-regular code. The GPU used in the work by Wang et al.
(2013), the NVIDIA GTX TITAN, has similar specications to the NVIDIA Tesla
K40 and is based on the same microarchitecture. Compared to the work by Wang
et al., the decoder implemented for this thesis achieves approximately twice
the throughput with the rate-12 WiMAX code using 100 iterations, while the
90
The stochastic bit-flipping decoder on the GPU
min-sum decoder in the work by Wang et al. uses 10 iterations. However, one
has to keep in mind that the message-passing decoders generally perform better
than bit-ipping decoders with a smaller number of iterations. The peak use of
global memory bandwidth is nearly 200 GB/s with the rate-12 WiMAX code and
approximately 170 GB/s with the (3,4)-regular code. The maximum theoretical
bandwidth of the global memory on the NVIDIA Tesla K40 is 288 GB/s, meaning
that nearly 70 % of the maximum bandwidth is used with the rate-12 WiMAX code.
Compared to the decoder on the NVIDIA Tesla M2090, using only one thread per
core on the NVIDIA Tesla K40 already comes close to the maximum performance.
Using the rate-12 WiMAX code, one thread per core, and decoding 128 words per
thread achieves a decoding throughput of nearly 600 MB/s, while the maximum is
approximately 700 MB/s. The corresponding values on the NVIDIA Tesla M2090
are 100 MB/s and 400 MB/s. The dierence is likely a result of the NVIDIA Tesla
K40 having nearly 6 times as many cores as the NVIDIA Tesla M2090, while the
global memory bandwidth is only approximately 2 times higher. This means that
the global memory bandwidth can be saturated on the NVIDIA Tesla K40 with a
smaller number of threads per core compared to the NVIDIA Tesla M2090.
It is important to remember the trade-os one needs to consider when imple-
menting decoders. Particularly important are the trade-os between throughput,
latency and error-correcting performance. We have seen in the decoder compari-
son that the current decoder has relatively good error-correcting performance. In
addition, it can achieve high throughputs. However, because of the large number
of words that need to be decoded simultaneously to achieve high throughputs,
the latencies are relatively high. The conguration achieving minimum latency
with maximum throughput on the NVIDIA M2090 is to run 8 threads for each
core and to decode 64 words in each thread. Using this conguration a total of 400
Mb are decoded simultaneously in one pass of the decoder, and the time taken to
decode this amount is approximately 1 s with the rate-12 WiMAX code and 0.6 s
with the (3,4)-regular QC-LDPC code. On the NVIDIA Tesla K40, using the rate-12
WiMAX code one can run only 2 threads for each core, decode 128 words in each
thread, and reach the maximum decoding throughput. Using this conguration
approximately 1.1 Gb are decoded simultaneously, which takes 1.5 s to complete.
Using the (3,4)-regular QC-LDPC code, one needs to run 4 threads for each core
and decode 128 words per thread to reach the maximum decoding throughput.
Doing so results in 2.3 Gb being decoded simultaneously with a latency of 1.8 s.
In comparison to the current implementation, the decoder presented by Wang
et al. has decoding latencies on the order of 1 ms, which is signicantly lower
than the latencies of the implementation presented here.
91
Experimental results
128
64
32
w
Threads per core nAThreads per core nA
20 21 22 23 24 2520 21 22 23 24 25
0
200
400
600
800
Th
ro
ug
hp
ut
[M
b/
s]
Figure 6.8: Decoding throughput of the stochastic bit-ipping decoder on the NVIDIA
M2090 with the rate- 12 WiMAX code of block length 1536 (left) and a (3,4)-regular QC-
LDPC code of block length 1536 (right). The number of threads was nA multiplied by the
number of cores available on the GPU. The number of words decoded simultaneously by
each thread is given by w . The horizontal axis is logarithmic.
128
64
32
w
Threads per core nAThreads per core nA
20 21 22 23 24 2520 21 22 23 24 25
0
20
40
60
80
100
120
Ba
nd
w
id
th
[G
B/
s]
Figure 6.9: Global memory bandwidth used by the stochastic bit-ipping decoder on the
NVIDIA M2090 with the rate- 12 WiMAX code of block length 1536 (left) and a (3,4)-regular
QC-LDPC code of block length 1536 (right). The number of threads per core on the GPU
is denoted by nA. The number of words decoded simultaneously by each thread is given
by w . The horizontal axis is logarithmic.
92
The stochastic bit-flipping decoder on the GPU
128
64
32
w
Threads per core nAThreads per core nA
20 21 22 23 24 2520 21 22 23 24 25
0
200
400
600
800
1000
1200
1400
Th
ro
ug
hp
ut
[M
b/
s]
Figure 6.10: Decoding throughput of the stochastic bit-ipping decoder on the NVIDIA
K40 with the rate- 12 WiMAX code of block length 1536 (left) and a (3,4)-regular QC-LDPC
code of block length 1536 (right). The number of threads was nA multiplied by the number
of cores available on the GPU. The number of words decoded simultaneously by each
thread is given by w . The horizontal axis is logarithmic.
128
64
32
w
Threads per core nAThreads per core nA
20 21 22 23 24 2520 21 22 23 24 25
0
50
100
150
200
250
Ba
nd
w
id
th
[G
B/
s]
Figure 6.11: Global memory bandwidth used by the stochastic bit-ipping decoder on
the NVIDIA K40 with the rate- 12 WiMAX code of block length 1536 (left) and a (3,4)-
regular QC-LDPC code of block length 1536 (right). The number of threads per core on
the GPU is denoted by nA. The number of words decoded simultaneously by each thread
is given by w . The horizontal axis is logarithmic.
93

Chapter 7
Conclusion
After their rediscovery in the 1990’s, LDPC codes have shown themselves to
be viable codes in theory, with constructions and decoders that approach the
Shannon limit; and in practice, having been included in various communication
standards. The eld has matured in the last years, but there is still room for
improvements both in encoding and decoding of LDPC codes. To the best of the
author’s knowledge, encoding can still not be done in linear time for general
LDPC codes. Resolving whether or not it can be done would be an important
result. Current decoders can still be improved by reducing their complexity to
increase throughput, reduce latency, and allow longer block lengths to be used for
better error-correction. The GPU implementation of the stochastic bit-ipping
decoder presented in this thesis is a high-throughput decoder with relatively
good error-correcting performance.
The comparison of decoders in the previous chapter mainly introduces the
stochastic bit-ipping decoder as a new decoder. The focus of the stochastic
bit-ipping decoder is on low complexity as it uses only hard channel values,
and on examining to what extent adding noise to the decoding process helps for
the error-correcting performance. The decoder performs well compared to the
sum-product decoder considering its simplicity, and the gap in error-correcting
performance is especially small on the BSC. An interesting comparison is that
between the stochastic bit-ipping decoder and the gradient-descent bit-ipping
decoder on the BAWGNC. While the gradient-descent bit-ipping decoder makes
use of the soft channel values and the stochastic-bit ipping decoder does not,
the stochastic bit-ipping decoder still has better error-correcting performance
with the codes tested in this thesis.
The important trade-os when considering decoders are those between error-
correcting performance, throughput and latency. With the GPU implementation
of the stochastic bit-ipping decoder it is clear that its simplicity allows for
high-throughput implementations, while the decoder comparison shows that the
decoder is capable of relatively good error-correcting performance considering its
simplicity. The downside of the current GPU implementation is the high decoding
95
Conclusion
latency which is a consequence of the decoder being block-parallel. The current
high latency could be reduced by considering bit-parallel implementations of the
decoder, meaning that multiple bits in the same codeword are processed in parallel.
When using QC-LDPC codes this could be done by processing all columns within
a major column of the parity-check matrix in parallel as the second neighbors
of the columns in a major column are independent. For example, with the codes
used here which have a sub-matrix size of 64 × 64, the potential gain in latency
would be a factor of 64, bringing the latencies to the order of 10–100 ms.
In this work we optimized the free parameters of the stochastic bit-ipping
decoder with a simple grid search and chose the parameters depending on the code.
To ensure that the stochastic bit-ipping decoder performs as well as possible,
the parameters should be further optimized by using a ner grid. Additionally,
the grid search shows that the optimal parameters are dierent at dierent levels
of noise. Adjusting the parameters based on the level of noise in the channel and
the current iteration may help further improve the error-correcting performance.
The number of iterations used for the decoder is another free parameter which is
important for the error-correcting performance. However, changing the number
of iterations directly aects throughput. Increasing the number of iterations
reduces the bit-error rate, but it needs to be examined more thoroughly what
exactly is a suitable number of iterations such that error-correcting performance
and throughput are well balanced. Having said that, it should be remembered
that an optimal set of parameter is unlikely to exist for every situation. Most
importantly though, a thorough comparison of the bit-error rate, throughput and
latency should be made with dedicated hardware designs of various decoders. In
general, the stochastic bit-ipping decoder should also be compared in complexity
and error-correcting performance to more advanced bit-ipping decoders such as
the noisy gradient-descent decoder. Additionally, the gradient-descent bit-ipping
decoder was optimized with only one free parameter in this work. Future work
might involve including more free parameters to the gradient-descent bit-ipping
decoder to improve its performance.
In addition to the free parameters of the decoder, code constructions and the
block length need to be considered when talking about error-correcting perfor-
mance. First, the block length is an important factor in the error-correcting per-
formance as increasing it can signicantly improve error-correcting performance.
Most importantly, increasing the block length lowers the error oor. We saw in
Chapter 6 that for example the (3,6)-regular ensemble can produce error oors
lower than 10−8 already at the block length 214. Using a low-complexity decoder
and long block lengths should be considered as an option for high-throughput
decoding with low bit-error rates. However, by increasing the block length one
is again giving up on the decoding latency. Second, code constructions should
96
be looked at more closely. The results in this thesis show that the stochastic
bit-ipping decoder performs well with the (3,6)-regular ensemble, but performs
worse with for example the rate-12 WiMAX code. Generally, message-passing
decoders are considered rst when designing good codes which may lead to
worse results with bit-ipping decoders. Thus it may be benecial to examine
what types of codes work especially well with bit-ipping decoders in more detail.
As the literature review in this thesis shows, there exists a plethora of decoders
for LDPC codes. Future work on the stochastic bit-ipping decoder must show that
it can compete also on error-correcting performance, and not only on throughput,
for it to be a viable decoder alternative. To date, message-passing decoders are
still preferred because of their good error-correcting performance, but bit-ipping
decoders have been improving in error-correcting performance. On the other
hand, message-passing decoders have been reducing in complexity with the
binary message-passing decoder being an extreme example. However, the line
between message-passing decoders and bit-ipping decoders is currently being
blurred with ideas from message-passing decoders being incorporated into bit-
ipping decoders, and vice versa. Future work on decoders will have to weigh the
trade-os between error-correcting performance and complexity thoroughly, and
ideally attempt to nd decoder designs that are clearly better in both respects.
The current state-of-the-art in encoding is the method by Richardson and
Urbanke (2001b) which is based on permuting the rows and columns of the parity-
check matrix so that the resulting parity-check matrix is in approximate lower
triangular form. While the method has linear time complexity for a useful subset
of codes, it does not have linear time complexity for general LDPC codes. Codes
which can be encoded in linear time are good enough for inclusion in standards,
such as the QC-LDPC codes in the WiMAX standard. However, a method for
linear-time encoding of general LDPC codes could allow more freedom in design-
ing codes that work well with dierent decoders, as well as allowing scaling of
the block length to larger values. One reason for the diculty of encoding LDPC
codes is that good codes should in some sense protect all information bits equally
well. How well the code does this is likely captured to some extent by the size of
the gap when doing greedy upper triangulation. It remains an open question to
determine whether alternative encoding methods exist that can circumvent the
problem of large gaps for arbitrary LDPC codes.
In conclusion, this work has presented a stochastic bit-ipping decoder which
has been shown to be easily parallelizable by decoding multiple words at a
time, allowing the decoder to reach high throughputs. In addition, a review of
the current state-of-the-art of encoding and decoding of LDPC codes has been
given. Experimental results show that the stochastic bit-ipping decoder has
relatively good error-correcting performance at low complexity. The prospect
97
Conclusion
of low-complexity encoders and decoders that would allow scaling of the block
length signicantly is especially exciting as this would allow further reductions
in bit-error rates, albeit with a cost in latency. As an extreme example, entire
hard-drives or even multiple hard-drives could be encoded with a single block
making them more robust to errors. While the current work does not directly
allow this, it shows that there is room for improvement in making low-complexity
decoders that still have good error-correcting performance. Resolving whether
or not general LDPC codes can be encoded in linear time is another important
step in the eld, and especially for increasing the block lengths signicantly.
98
Bibliography
Abburi, K. (2011). A scalable LDPC decoder on GPU. In 2011 24th International
Conference on VLSI Design (VLSI Design), pages 183–188.
Alava, M., Ardelius, J., Aurell, E., Kaski, P., Krishnamurthy, S., Orponen, P.,
and Seitz, S. (2008). Circumspect descent prevails in solving random con-
straint satisfaction problems. Proceedings of the National Academy of Sciences,
105(40):15253–15257.
Amraoui, A., Montanari, A., Richardson, T., and Urbanke, R. (2009). Finite-Length
scaling for iteratively decoded LDPC ensembles. IEEE Transactions on Informa-
tion Theory, 55(2):473–498.
Bahl, L., Cocke, J., Jelinek, F., and Raviv, J. (1974). Optimal decoding of linear codes
for minimizing symbol error rate (Corresp.). IEEE Transactions on Information
Theory, 20(2):284–287.
Berlekamp, E. R., McEliece, R. J., and van Tilborg, H. C. A. (1978). On the inherent
intractability of certain coding problems. IEEE Transactions on Information
Theory, 24(3):384–386.
Berrou, C., Glavieux, A., and Thitimajshima, P. (1993). Near shannon limit error-
correcting coding and decoding: Turbo-codes. 1. In Conference Record, IEEE
International Conference on Communications, volume 2, pages 1064–1070 vol.2.
Berrou, C., Pyndiah, R., Adde, P., Douillard, C., and Le Bidan, R. (2005). An
overview of turbo codes and their applications. In The European Conference on
Wireless Technology, 2005, pages 1–9.
Bollobás, B. (2001). Random Graphs. Cambridge University Press.
Boutillon, E., Guillou, F., and Danger, J. (2003). lambda-min decoding algorithm
of regular and irregular LDPC codes. In 3rd International Symposium on Turbo
Codes & Related Topics.
99
BIBLIOGRAPHY
Burshtein, D. (2009). Iterative approximate linear programming decoding of
LDPC codes with linear complexity. IEEE Transactions on Information Theory,
55(11):4835–4859.
Burshtein, D. and Goldenberg, I. (2011). Improved linear programming decoding
of LDPC codes and bounds on the minimum and fractional distance. IEEE
Transactions on Information Theory, 57(11):7386–7402.
Chang, C., Chang, Y., Huang, M., and Huang, B. (2011). Accelerating regular
LDPC code decoders on GPUs. IEEE Journal of Selected Topics in Applied Earth
Observations and Remote Sensing, 4(3):653–659.
Chen, J., Dholakia, A., Eleftheriou, E., Fossorier, M., and Hu, X. (2005). Reduced-
Complexity decoding of LDPC codes. IEEE Transactions on Communications,
53(8):1288–1299.
Chen, J. and Fossorier, M. P. (2002a). Density evolution for two improved BP-
based decoding algorithms of LDPC codes. IEEE Communications Letters,
6(5):208–210.
Chen, J. and Fossorier, M. P. (2002b). Near optimum universal belief propaga-
tion based decoding of low-density parity check codes. IEEE Transactions on
Communications, 50(3):406–414.
Cho, J., Kim, J., and Sung, W. (2010). VLSI implementation of a High-Throughput
Soft-Bit-Flipping decoder for geometric LDPC codes. IEEE Transactions on
Circuits and Systems I: Regular Papers, 57(5):1083–1094.
Chung, S. (2000). On the construction of some capacity-approaching coding schemes.
PhD thesis, Massachusetts Institute of Technology.
Chung, S., Forney Jr, G. D., Richardson, T. J., and Urbanke, R. (2001a). On the
design of low-density parity-check codes within 0.0045 dB of the shannon limit.
IEEE Communications Letters, 5(2):58–60.
Chung, S., Richardson, T. J., and Urbanke, R. L. (2001b). Analysis of sum-product
decoding of low-density parity-check codes using a gaussian approximation.
IEEE Transactions on Information Theory, 47(2):657–670.
Cushon, K., Hemati, S., Leroux, C., Mannor, S., and Gross, W. (2014). High-
Throughput Energy-Ecient LDPC decoders using dierential binary message
passing. IEEE Transactions on Signal Processing, 62(3):619–631.
100
BIBLIOGRAPHY
Di, C., Proietti, D., Telatar, I., Richardson, T., and Urbanke, R. (2002). Finite-length
analysis of low-density parity-check codes on the binary erasure channel. IEEE
Transactions on Information Theory, 48(6):1570–1579.
Elidan, G. (2006). Residual belief propagation: Informed scheduling for asyn-
chronous message passing. In Proceedings of the Twenty-second Conference on
Uncertainty in AI.
ETSI (2012). Digital video broadcasting (DVB); frame structure channel coding
and modulation for a second generation digital transmission system for cable
systems (DVB-C2).
ETSI (2013a). Digital video broadcasting (DVB); frame structure channel coding
and modulation for a second generation digital terrestrial television broadcast-
ing system (DVB-T2).
ETSI (2013b). Digital video broadcasting (DVB); second generation framing
structure, channel coding and modulation systems for broadcasting, interactive
services, news gathering and other broadband satellite applications (DVB-S2).
Falcão, G., Sousa, L., and Silva, V. (2008). Massive parallel LDPC decoding on
GPU. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and
practice of parallel programming, pages 83–90.
Feldman, J. (2003). Decoding error-correcting codes via linear programming. PhD
thesis, Massachusetts Institute of Technology.
Feldman, J., Wainwright, M., and Karger, D. (2005). Using linear programming to
decode binary linear codes. IEEE Transactions on Information Theory, 51(3):954–
972.
Fossorier, M. P., Mihaljevic, M., and Imai, H. (1999). Reduced complexity iterative
decoding of low-density parity check codes based on belief propagation. IEEE
Transactions on Communications, 47(5):673–680.
Freundlich, S., Burshtein, D., and Litsyn, S. (2007). Approximately lower triangular
ensembles of LDPC codes with linear encoding complexity. IEEE Transactions
on Information Theory, 53(4):1484–1494.
Gallager, R. G. (1962). Low-density parity-check codes. IRE Transactions on
Information Theory, 8(1):21–28.
Gallager, R. G. (1963). Low-Density Parity-Check Codes, volume 21 of Research
Monograph Series. Cambridge, MA, USA.
101
BIBLIOGRAPHY
Gaudet, V. and Rapley, A. (2003). Iterative decoding using stochastic computation.
Electronics Letters, 39(3):299.
Goldin, D. and Burshtein, D. (2013). Iterative linear programming decoding of
nonbinary LDPC codes with linear complexity. IEEE Transactions on Information
Theory, 59(1):282–300.
Greenhill, C., McKay, B. D., and Wang, X. (2006). Asymptotic enumeration of
sparse 0–1 matrices with irregular row and column sums. Journal of Combina-
torial Theory, Series A, 113(2):291–324.
Grönroos, S., Nybom, K., and Björkqvist, J. (2012). Ecient GPU and CPU-based
LDPC decoders for long codewords. Analog Integrated Circuits and Signal
Processing, 73(2):583–595.
Gross, W., Gaudet, V., and Milner, A. (2005). Stochastic implementation of LDPC
decoders. In Conference Record of the Thirty-Ninth Asilomar Conference on
Signals, Systems and Computers, pages 713–717.
Hagenauer, J., Oer, E., and Papke, L. (1996). Iterative decoding of binary block and
convolutional codes. IEEE Transactions on Information Theory, 42(2):429–445.
Hagenauer, J. and Papke, L. (1994). Decoding "Turbo"-codes with the soft output
viterbi algorithm (SOVA). In IEEE International Symposium on Information
Theory, page 164.
He, Y., Li, H., Sun, S., and Li, L. (2003). Threshold-based design of quantized
decoder for LDPC codes. In IEEE International Symposium on Information
Theory, page 149.
Hemati, S. and Banihashemi, A. (2006). Dynamics and performance analysis of
analog iterative decoding for low-density parity-check (LDPC) codes. IEEE
Transactions on Communications, 54(1):61–70.
Hocevar, D. (2004). A reduced complexity decoder architecture via layered
decoding of LDPC codes. In IEEE Workshop on Signal Processing Systems, pages
107–112.
Huang, K., Gaudet, V., and Salehi, M. (2013). A scaling method for stochastic
LDPC decoding over the binary symmetric channel. In 47th Annual Conference
on Information Sciences and Systems, pages 1–5.
IEEE (2008). IEEE standard for Floating-Point arithmetic. IEEE Std 754-2008.
102
BIBLIOGRAPHY
IEEE (2009). IEEE standard for air interface for broadband wireless access systems.
IEEE Std 802.16-2012.
IEEE (2012a). IEEE Standard for Ethernet. IEEE Std 802.3-2012.
IEEE (2012b). Part 11: Wireless LAN medium access control (MAC) and physical
layer (PHY) specications. IEEE Std 802.11 2012.
Intel (2012). Intel architecture, instruction set extensions programming, reference.
Ismail, M., Coon, J., Ahmed, I., Armour, S., and McGeehan, J. (2013). Turbo adaptive
threshold bit ipping for LDPC decoding. IEEEWireless Communications Letters,
2(1):118–121.
Kang, S. and Moon, J. (2012). Parallel LDPC decoder implementation on GPU
based on unbalanced memory coalescing. In IEEE International Conference on
Communications, pages 3692–3697.
Kolesnik, V. D. (1971). Probabilistic decoding of majority codes. Problemy
Peredachi Informatsii, 7(3):3–12.
Kou, Y., Lin, S., and Fossorier, M. (2001). Low-density parity-check codes based
on nite geometries: a rediscovery and new results. IEEE Transactions on
Information Theory, 47(7):2711–2736.
Kschischang, F. R., Frey, B. J., and Loeliger, H. (2001). Factor graphs and the sum-
product algorithm. IEEE Transactions on Information Theory, 47(2):498–519.
Lauritzen, S. L. and Spiegelhalter, D. J. (1988). Local computations with probabili-
ties on graphical structures and their application to expert systems. Journal of
the Royal Statistical Society. Series B (Methodological), 50(2):157–224.
Luby, M. G., Mitzenmacher, M., Shokrollahi, A., and Spielman, D. A. (1998). Anal-
ysis of low density codes and improved designs using irregular graphs. In
Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing,
pages 249–258.
Luby, M. G., Mitzenmacher, M., Shokrollahi, M. A., and Spielman, D. A. (2001a).
Ecient erasure correcting codes. IEEE Transactions on Information Theory,
47(2):569–584.
Luby, M. G., Mitzenmacher, M., Shokrollahi, M. A., and Spielman, D. A. (2001b).
Improved low-density parity-check codes using irregular graphs. IEEE Trans-
actions on Information Theory, 47(2):585–598.
103
BIBLIOGRAPHY
Luby, M. G., Mitzenmacher, M., Shokrollahi, M. A., Spielman, D. A., and Stemann,
V. (1997). Practical loss-resilient codes. In Proceedings of the Twenty-ninth
Annual ACM Symposium on Theory of Computing, pages 150–159.
MacKay, D. J. (1999). Good error-correcting codes based on very sparse matrices.
IEEE Transactions on Information Theory, 45(2):399–431.
MacKay, D. J. (2003). Information theory, inference, and learning algorithms,
volume 7. Cambridge University Press.
MacKay, D. J. C. (1995). Free energy minimisation algorithm for decoding and
cryptanalysis. Electronics Letters, 31(6):446–447.
Mansour, M. and Shanbhag, N. (2002). Memory-ecient turbo decoder architec-
tures for LDPC codes. In IEEE Workshop on Signal Processing Systems, pages
159–164.
Mao, Y. and Banihashemi, A. (2001). A heuristic search for good low-density
parity-check codes at short block lengths. In IEEE International Conference on
Communications, 2001. ICC 2001, volume 1, pages 41–44.
Martínez-Zaldívar, F. J., Vidal-Maciá, A. M., Gonzalez, A., and Almenar, V. (2011).
Tridimensional block multiword LDPC decoding on GPUs. The Journal of
Supercomputing, 58(3):314–322.
McEliece, R. J., MacKay, D. J. C., and Cheng, J. (1998). Turbo decoding as an
instance of pearl’s “belief propagation” algorithm. IEEE Journal on Selected
Areas in Communications, 16(2):140–152.
Miladinovic, N. and Fossorier, M. (2005). Improved bit-ipping decoding of
low-density parity-check codes. IEEE Transactions on Information Theory,
51(4):1594–1606.
Mobini, N., Banihashemi, A., and Hemati, S. (2009). A dierential binary message-
passing LDPC decoder. IEEE Transactions on Communications, 57(9):2518–2523.
Mohsenin, T., Truong, D., and Baas, B. (2010). A Low-Complexity Message-
Passing algorithm for reduced routing congestion in LDPC decoders. IEEE
Transactions on Circuits and Systems I: Regular Papers, 57(5):1048–1061.
Myung, S., Yang, K., and Kim, J. (2005). Quasi-cyclic LDPC codes for fast encoding.
IEEE Transactions on Information Theory, 51(8):2894–2901.
104
BIBLIOGRAPHY
Naderi, A., Mannor, S., Sawan, M., and Gross, W. (2011). Delayed stochastic
decoding of LDPC codes. IEEE Transactions on Signal Processing, 59(11):5617–
5626.
Noorshams, N. and Iyengar, A. (2014). A novel stochastic decoding of LDPC codes
with quantitative guarantees. preprint arXiv:1405.6353.
Nouh, A. and Banihashemi, A. (2002). Bootstrap decoding of low-density parity-
check codes. IEEE Communications Letters, 6(9):391–393.
NVIDIA (2011). Tesla M2090 dual-slot computing processor module board speci-
cation.
NVIDIA (2013). Tesla K40 GPU active accelerator board specication.
NVIDIA (2014). CUDA C Programming Guide v6.5.
Pearl, J. (1982). Reverend Bayes on inference engines: a distributed hierarchical
approach. In Proceedings of the National Conference on Articial Intelligence,
pages 133–136.
Richardson, T. (2003). Error oors of LDPC codes. In Proceedings of the annual
Allerton conference on communication control and computing, volume 41, pages
1426–1435.
Richardson, T., Shokrollahi, A., and Urbanke, R. (2000). Design of provably good
low-density parity check codes. In Proceedings of IEEE International Symposium
on Information Theory, page 199.
Richardson, T., Shokrollahi, A., and Urbanke, R. (2002). Finite-length analysis of
various low-density parity-check ensembles for the binary erasure channel. In
Proceedings of IEEE International Symposium on Information Theory.
Richardson, T. and Urbanke, R. (2008). Modern Coding Theory. Cambridge
University Press.
Richardson, T. J., Shokrollahi, M. A., and Urbanke, R. L. (2001). Design of capacity-
approaching irregular low-density parity-check codes. IEEE Transactions on
Information Theory, 47(2):619–637.
Richardson, T. J. and Urbanke, R. L. (2001a). The capacity of low-density parity-
check codes under message-passing decoding. IEEE Transactions on Information
Theory, 47(2):599–618.
105
BIBLIOGRAPHY
Richardson, T. J. and Urbanke, R. L. (2001b). Ecient encoding of low-density
parity-check codes. IEEE Transactions on Information Theory, 47(2):638–656.
Schläfer, P., Weis, C., Wehn, N., and Alles, M. (2012). Design space of exible
multigigabit LDPC decoders. VLSI Design.
Shannon, C. (1948). A mathematical theory of communication. The Bell System
Technical Journal, 27(3):379–423.
Sipser, M. and Spielman, D. A. (1996). Expander codes. Institute of Electri-
cal and Electronics Engineers. Transactions on Information Theory, 42(6, part
1):1710–1722. Codes and complexity MR: 1465731.
Sundararajan, G., Winstead, C., and Boutillon, E. (2014). Noisy gradient descent
Bit-Flip decoding for LDPC codes. arXiv:1402.2773 [cs, math].
Tanner, R., Sridhara, D., Sridharan, A., Fuja, T., and Costello, D. (2004). LDPC
block and convolutional codes based on circulant matrices. IEEE Transactions
on Information Theory, 50(12):2966–2984.
Tanner, R. M. (1981). A recursive approach to low complexity codes. IEEE
Transactions on Information Theory, 27(5):533–547.
Tehrani, S., Gross, W., and Mannor, S. (2006). Stochastic decoding of LDPC codes.
IEEE Communications Letters, 10(10):716–718.
Tehrani, S., Mannor, S., and Gross, W. (2008). Fully parallel stochastic LDPC
decoders. IEEE Transactions on Signal Processing, 56(11):5692–5703.
Tehrani, S., Naderi, A., Kamendje, G., Hemati, S., Mannor, S., and Gross, W. (2010).
Majority-Based tracking forecast memories for stochastic LDPC decoding. IEEE
Transactions on Signal Processing, 58(9):4883–4896.
Tehrani, S. S., Naderi, A., Kamendje, G., Mannor, S., and Gross, W. J. (2011).
Tracking forecast memories for stochastic decoding. Journal of Signal Processing
Systems, 63(1):117–127.
van Leeuwen, J., editor (1990). Handbook of Theoretical Computer Science (Vol. A):
Algorithms and Complexity. MIT Press, Cambridge, MA, USA.
Vanek, M. and Farkas, P. (2009). Fast parallel weighted bit ipping decoding
algorithm for LDPC codes. In Wireless Telecommunications Symposium, pages
1–4.
106
BIBLIOGRAPHY
Vila Casado, A., Griot, M., and Wesel, R. (2007). Informed dynamic scheduling for
Belief-Propagation decoding of LDPC codes. In IEEE International Conference
on Communications, pages 932–937.
Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically op-
timum decoding algorithm. IEEE Transactions on Information Theory, 13(2):260–
269.
Vontobel, P. O. and Kötter, R. (2007). On low-complexity linear-programming de-
coding of LDPC codes. European Transactions on Telecommunications, 18(5):509–
517.
Wadayama, T., Nakamura, K., Yagita, M., Funahashi, Y., Usami, S., and Takumi,
I. (2007). Gradient descent bit ipping algorithms for decoding LDPC codes.
arXiv:0711.0261 [cs, math]. arXiv: 0711.0261.
Wang, G., Wu, M., Sun, Y., and Cavallaro, J. R. (2011a). GPU accelerated scalable
parallel decoding of LDPC codes. InConference Record of the Forty Fifth Asilomar
Conference on Signals, Systems and Computers, pages 2053–2057.
Wang, G., Wu, M., Sun, Y., and Cavallaro, J. R. (2011b). A massively parallel
implementation of QC-LDPC decoder on GPU. In IEEE 9th Symposium on
Application Specic Processors, pages 82–85.
Wang, G., Wu, M., Yin, B., and Cavallaro, J. R. (2013). High throughput low latency
LDPC decoding on GPU for SDR systems. In Proceedings of the IEEE Global
Conference on Signal and Information Processing.
Warren, H. S. (2012). Hacker’s Delight. Addison-Wesley.
Wiberg, N. (1996). Codes and decoding on general graphs. PhD thesis, Linköping
University.
Wiberg, N., Loeliger, H., and Kötter, R. (1995). Codes and iterative decoding on
general graphs. European Transactions on Telecommunications, 6(5):513–525.
Wu, X., Zhao, C., and You, X. (2007). Parallel weighted Bit-Flipping decoding.
IEEE Communications Letters, 11(8):671–673.
Xiao, H., Tolouei, S., and Banihashemi, A. (2008). Successive relaxation for
decoding of LDPC codes. In 24th Biennial Symposium on Communications,
pages 107–110.
Yazdani, M., Hemati, S., and Banihashemi, A. (2004). Improving belief propagation
on graphs with cycles. IEEE Communications Letters, 8(1):57–59.
107
BIBLIOGRAPHY
Yue, G., Lu, B., and Wang, X. (2007). Analysis and design of Finite-Length LDPC
codes. IEEE Transactions on Vehicular Technology, 56(3):1321–1332.
Zhang, J. and Fossorier, M. (2002). Shued belief propagation decoding. In
Conference Record of the Thirty-Sixth Asilomar Conference on Signals, Systems
and Computers, volume 1, pages 8–15.
Zhang, J. and Fossorier, M. (2004). A modied weighted bit-ipping decoding of
low-density parity-check codes. IEEE Communications Letters, 8(3):165–167.
Zhang, J., Yedidia, J. S., and Fossorier, M. (2007). Low-Latency decoding of EG
LDPC codes. Journal of Lightwave Technology, 25(9):2879–2886.
Zhang, T., Wang, Z., and Parhi, K. K. (2001). On nite precision implementation
of low density parity check codes decoder. In IEEE Symposium on Circuits and
Systems, volume 4, pages 202–205.
Zhao, J., Zarkeshvari, F., and Banihashemi, A. (2005). On implementation of Min-
Sum algorithm and its modications for decoding Low-Density Parity-Check
(LDPC) codes. IEEE Transactions on Communications, 53(4):549–554.
Zhou, X. S., Cockburn, B., and Bates, S. (2007). Improved iterative bit ipping de-
coding algorithms for LDPC convolutional codes. In IEEE Pacic Rim Conference
on Communications, Computers and Signal Processing, pages 541–544.
108
Appendix A
Derivations
A.1 Derivation of the gradient-descent bit-flipping decoder
The following is the formulation of the bit-ipping decoder as a gradient-descent
decoder, as presented by Wadayama et al. (2007). We assume here that the symbol
alphabet is {1,−1} and that transmission occurs over the BAWGNC. Maximum-
likelihood (or maximum a posteriori) decoding is equivalent to nding the code-
word which maximizes the correlation to the received values, where the correla-
tion is
n∑
i=1
xiyi .
For the gradient-descent bit-ipping decoder, we then dene an objective function
f (x) =
n∑
i=1
xiyi +
m∑
a=1
∏
i∈N (a)
xi .
The rst part of the objective function is simply the correlation of the current
decoded codeword x and the channel values y. The second part is a term which
accounts for satised parity-checks. When all parity-checks are satised it takes
the valuem and when none are satised it takes the value −m. Our goal is then
to maximize f or, equivalently, minimize −f . To do so we can use the gradient-
descent decoder. We rst calculate the partial derivative with of f with respect
to a variable xi :
∂
∂xi
f (x ) = yi +
∑
a∈N (i )
∏
j∈N (a)\i
xj . (A.1)
Note here the strong resemblance to for example the sum-product tanh-rule. The
product on the right hand side in (A.1) is similar to the check-to-variable messages,
but with the log-ratio messages replaced by xj . The sum of products and addition
of yi on the right hand side of (A.1) again resembles the nal marginalization
step (4.19) in the sum-product decoder.
109
Derivations
The rst-order approximation of f in the xi-coordinate is
f (x1, . . . ,xi + s, . . . ,xn ) = f (x) + s
∂
∂xi
f (x),
where s is the step length in the ith coordinate. Since we would like to maximize
f we would like to choose s so that s ∂∂xi f (x) > 0. This happens if we choose s > 0
when ∂∂xi f (x) > 0 and s < 0 when
∂
∂xi
f (x) < 0. Since xi ∈ {1,−1} we can multiply
the gradient by xi and get that the objective function value is increased, according
to the rst-order approximation, if we ip the value of xi when xi ∂∂xi f (x) < 0.
One possible way to choose which bit to ip is to, at each iteration, ip the bit i
that has the smallest
Ei = xi
∂
∂xi
f (x) = xiyi +
∑
a∈N (i )
∏
j∈N (a)
xj .
Alternatively, to keep with the convention that bits are ipped when Ei is large
enough, we can dene
Ei = −xi ∂
∂xi
f (x)
and ip the bit i that has the largest Ei or all bits i that have Ei greater than some
threshold θ .
A.2 Derivation of the tahn-rule for the sum-product decoder
Recall that in the rst formulation of the sum-product decoder for decoding of
linear block codes presented in Chapter 4 the variable-to-check update rule is
vi→a (xi ) =
∏
b∈N (i )
b,a
fb→i (xi ).
Also, since we are considering binary codes we send in practice the vector
(vi→a (1),vi→a (−1)) from a variable node i to a check node a. The check-to-
variable update rule is
fa→i (xi ) =
∑
xa\xi
[∏j∈N (a) xj = 1] ∏j∈N (a)\i vj→a (xj )
and we send the vector ( fa→i (1), fa→i (−1)) from a check node a to a variable
node i .
110
Derivation of the tahn-rule for the sum-product decoder
The intuition for being able to use single scalars as messages in the case of
binary codes is that a probability distribution over two states can be described by
a single scalar due to the constraint that probabilities over all states must sum to
1. To simplify the conventional sum-product rules we introduce the ratios
f ra→i =
fa→i (1)
fa→i (−1) , (A.2)
f ri =
fi (1)
fi (−1)
and
vri→a =
vi→a (1)
vi→a (−1) (A.3)
as the message values. We will use the superscript r on the messages to dieren-
tiate them from the conventional sum-product update rules where each message
consist of two values. We can then write the ratio vri→a as a product of ratios of
incoming messages f r
b→i from the neighbors b of i , excluding a. More precisely,
we can write
vri→a =
vi→a (1)
vi→a (−1) =
∏
b∈N (i )\a fb→i (1)∏
b∈N (i )\a fb→i (−1) =
∏
b∈N (i )\a
f rb→i . (A.4)
We can likewise write the ratio f ra→i at a check node a using the ratios vri→a of
the neighbors j of a, excluding i . Doing so, we get
f ra→i =
fa→i (1)
fa→i (−1)
=
∑
xa\xi [
∏
j∈N (a) xj = 1]
∏
j∈N (a)\i vj→a (xj )∑
xa\xi [
∏
j∈N (a) xj = −1] ∏j∈N (a)\i vj→a (xj )
=
∑
xa\xi [
∏
j∈N (a) xj = 1]
∏
j∈N (a)\i
vj→a (x j )
vj→a (−1)∑
xa\xi [
∏
j∈N (a) xj = −1] ∏j∈N (a)\i vj→a (x j )vj→a (−1)
=
∑
xa\xi [
∏
j∈N (a) xj = 1]
∏
j∈N (a)\i
(
vrj→a
) 1+xj
2
∑
xa\xi [
∏
j∈N (a) xj = −1] ∏j∈N (a)\i (vrj→a) 1+xj2 (A.5)
=
∏
j∈N (a)\i (vrj→a + 1) +
∏
j∈N (a)\i (vrj→a − 1)∏
j∈N (a)\i (vrj→a + 1) −
∏
j∈N (a)\i (vrj→a − 1)
(A.6)
=
1 +
∏
j ∈N (a)\i (vrj→a−1)∏
j ∈N (a)\i (vrj→a+1)
1 −
∏
j ∈N (a)\i (vrj→a−1)∏
j ∈N (a)\i (vrj→a+1)
(A.7)
111
Derivations
In (A.5) we have simply used the fact that the ratio vj→a (x j )vj→a (−1) is 1 when xk = −1
and vrj→a otherwise. In (A.6) we have done slightly more work. The product∏
j∈N (a)\i (vrj→a + 1) expands to the sum of the products∏
j∈M
vrj→a, ∀M ⊆ N (a) \ i,
where M can be the empty set in which case we dene the product to be 1. The
product ∏j∈N (a)\i (vrj→a − 1) expands to the same but with a negative sign for
some terms. Each term with |M | terms such that |N (a) \i | − |M | is odd, is negative.
Now consider the indicator function [∏j∈N (a)\i xj = −1] in the numerator of (A.5).
It selects in the sum all such terms where there are an even number of xj which
take the value −1. On the other hand, all terms with an odd number of xj which
take the value −1 are ignored. The numerator in (A.6) now selects exactly the
same terms as the indicator function does. That is, we have∏
j∈N (a)\i
(vrj→a + 1) +
∏
j∈N (a)\i
(vrj→a − 1) = 2
∑
xa\xi
[∏j∈N (a) xj = 1] ∏j∈N (a)\i (vrj→a) 1+xj2
Doing the same analysis for the denominator of (A.5) and (A.6) we get the result
in (A.6) as the 2’s cancel out. Further rearranging (A.7) we get
f ra→i − 1
f ra→i + 1
=
∏
j∈N (a)\i
vrj→a − 1
vrj→a + 1
. (A.8)
Instead of using the ratios in (A.2)–(A.3), we can use the log-ratios
f la→i = ln
(
f ra→i
)
,
f li = ln
(
f ri
)
,
and
vli→a = ln
(
vri→a
)
.
We can then rearrange the results using log-ratios. We rst write the left-hand
side of (A.8) as
f ra→i − 1
f ra→i + 1
=
e f
l
a→i − 1
e f
l
a→i + 1
= tanh *,
f la→i
2
+- , (A.9)
112
Derivation of the tahn-rule for the sum-product decoder
where we have simply used the denition of the tanh function. Likewise, we can
write the right-hand side of (A.8) as
∏
j∈N (a)\i
f rj→a − 1
f rj→a + 1
=
∏
j∈N (a)\i
tanh *,
f lj→a
2
+- . (A.10)
Rearranging (A.8) with the help of (A.9) and (A.10), we nally get the tanh-rule
for the check-to-variable updates. The rule is
f la→i = 2 tanh−1
*.,
∏
j∈N (a)\i
tanh *,
f lj→a
2
+-+/- . (A.11)
If we now initialize the outgoing messages at the channel factors as
f ri = ln
(
fi (1)
fi (−1)
)
we can send (A.11) as the check-to-variable message at each check node. In
addition, rewriting (A.4) with the help of log-ratios we can send
vli =
∑
b∈N (i )\a
f lb→i .
as the variable-to-check message at each variable node. Thus, in the case of binary
linear codes we can send a single scalar over each edge as the message, instead a
vector containing two scalars if we directly apply the sum-product decoder.
113

Appendix B
Parameter searches
The following two sections show plots of the bit-error rate as a function of the
two free parameters used in the stochastic bit-ipping decoder and the WBF
decoder. In the stochastic bit-ipping decoder, the probability to ip a bit was
parameterized using two parameters p and T . The gradient-descent bit-ipping
decoder was chosen to have one free parameter, the threshold θ for ipping. The
parameter α in the general formulation of the weighted bit-ipping decoder was
set to 1 as originally presented by Wadayama et al. (2007). The stochastic bit-
ipping decoder was run after transmission over the BSC and the gradient-descent
bit-ipping decoder after transmission over the BSC and BAWGNC.
115
Parameter searches
B.1 Parameter searches for the stochastic bit-flipping decoder
0.3
0.5
0.7
0.9
1.1
1.3
T
-8
0
-4
-6
-2
log10 Pb
0.3
0.5
0.7
0.9
1.1
1.3
T
p
0.02 0.04 0.06 0.08 0.10 0.12
0.3
0.5
0.7
0.9
1.1
1.3
T
Figure B.1: Parameter search for the stochastic bit-ipping decoder with the (3,6)-regular
ensemble and block length 214. The parameterization uses the parameters T and p. The
bit-error rate Pb was evaluated at the grid intersections. The performance was evaluated
on the BSC with crossover probability 0.03 (top), 0.05 (middle) and 0.07 (bottom) Note
that the color scale is logarithmic.
116
Parameter searches for the stochastic bit-flipping decoder
0.3
0.5
0.7
0.9
1.1
1.3
T
-8
0
-4
-6
-2
log10 Pb
0.3
0.5
0.7
0.9
1.1
1.3
T
p
0.02 0.04 0.06 0.08 0.10 0.12
0.3
0.5
0.7
0.9
1.1
1.3
T
Figure B.2: Parameter search for the stochastic bit-ipping decoder with the irregular
ensemble and block length 214. The parameterization uses the parameters T and p. The
bit-error rate Pb was evaluated at the grid intersections. The performance was evaluated
on the BSC with crossover probability 0.03 (top), 0.05 (middle) and 0.07 (bottom) Note
that the color scale is logarithmic.
117
Parameter searches
0.5
0.7
0.9
1.1
1.3
1.5
T
-8
0
-4
-6
-2
log10 Pb
0.5
0.7
0.9
1.1
1.3
1.5
T
p
0.02 0.04 0.06 0.08 0.10 0.12
0.5
0.7
0.9
1.1
1.3
1.5
T
Figure B.3: Parameter search for the stochastic bit-ipping decoder with the rate- 12
WiMAX code and block length 2304. The parameterization uses the parameters T and
p. The bit-error rate Pb was evaluated at the grid intersections. The performance was
evaluated on the BSC with crossover probability 0.03 (top), 0.05 (middle) and 0.07 (bottom)
Note that the color scale is logarithmic.
118
Parameter searches for the gradient-descent bit-flipping decoder
B.2 Parameter searches for the gradient-descent bit-flipping
decoder
0.06
0.04
0.02
ϵ
θ
-1.5 -0.5 0.5 1.5
10-5
10-4
10-3
10-2
10-1
100
Bi
t-e
rr
or
ra
te
P
b
10-5
10-4
10-3
10-2
10-1
100
Bi
t-e
rr
or
ra
te
P
b
10-5
10-4
10-3
10-2
10-1
100
Bi
t-e
rr
or
ra
te
P
b
Figure B.4: Parameter search for the gradient-descent bit-ipping decoder with the
(3,6)-regular ensemble with block length 214 (top), the irregular ensemble with block
length 214, and the rate- 12 WiMAX code with block length 2304. The threshold θ is the
free parameter in the gradient-descent bit-ipping decoder. The bit-error rate Pb was
evaluated on the BSC as a function of the crossover probability ϵ . The vertical axis is
logarithmic.
119
Parameter searches
1.9
4.4
8.0
SNR [dB]
θ
-1
.6
-1
.4
-1
.2
-1
.0
-0
.8
-0
.6
-0
.4
-0
.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
10-8
10-6
10-4
10-2
100
Bi
t-e
rr
or
ra
te
P
b
10-8
10-6
10-4
10-2
100
Bi
t-e
rr
or
ra
te
P
b
10-8
10-6
10-4
10-2
100
Bi
t-e
rr
or
ra
te
P
b
Figure B.5: Parameter search for the gradient-descent bit-ipping decoder with the
(3,6)-regular ensemble with block length 214 (top), the irregular ensemble with block
length 214, and the rate- 12 WiMAX code with block length 2304. The threshold θ is the
free parameter in the gradient-descent bit-ipping decoder. Pb was evaluated on the
BAWGNC as a function of the signal-to-noise ratio (SNR) in dB. The vertical axis is
logarithmic.
120
