Hardware Design of Decoder for Low-Density Parity Check Codes by Chi, Huageng
HELSINKI UNIVERSITY OF TECHNOLOGY
Faculty of Electronics, Communication and Automation
Department of Communications and Networking
Huageng Chi
HARDWARE DESIGN OF DECODER FOR
LOW-DENSITY PARITY CHECK CODES
Thesis submitted in partial fulﬁlment of the requirement for
the degree of Master of Science in Technology
Espoo, Finland, 3rd August 2009
Supervisor: Prof. Patric Östergård
Instructor: M.Sc. Mika Rautio
iHELSINKI UNIVERSITY OF TECHNOLOGY
ABSTRACT of the Master's thesis
Author: Huageng Chi
Name of the thesis: Hardware Design of Decoder for Low-Density Parity Check
Codes
Date: 3rd August 2009 Number of pages: x + 62
Faculty: Faculty of Electronics, Communication and Automation
Professorship: Communications Code: S-72
Supervisor: Prof. Patric Östergård
Instructor: M.Sc. Mika Rautio
A hardware decoder architecture is presented in this thesis
for quasi-cyclic (QC) low-density parity check (LDPC)
codes.
The decoder is real-time conﬁgurable and supports 15 codes
which are combination of 3 rates and 5 lengths. The partly
parallel architecture implements layered decoding. A check
node decoder is serial and implements min-sum correction
algorithm. The proposed design techniques include
out-of-order memory-write, two-stage multi-size shifter,
serial decoding termination.
The decoder consumes about half amount of logic resource
on the Xilinx FPGA chip XC2VP50-5F1152. The worst case
throughput at 20 iterations ranges from 5 Mbits to 60 Mbits
(information bits) per second. Higher throughput can be
obtained by the proposed optimisation. Reuse for similar
codes is possible.
Keywords low-density parity check (LDPC) decoder, FPGA,
multi-rate, multi-length, layered decoding, out-of-order
memory-write, multi-size shifter
Preface
This thesis work was part of a project1 carried out at VTT Technical Research
Center of Finland in 2007. The work started with the codes and certain
directive information provided by project partners. The thesis was written in
Department of Communications and Networking at TKK Helsinki University
of Technology.
I would like to thank Professor Patric Östergård at TKK for being my
supervisor and providing detailed guidelines, instructions, advices and com-
ments.
I would also like to thank the team manager Mika Rautio and the project
manager Jussi Roivainen, both at VTT, for preparing the topic for me,
and also for their support to my work and study. Special thanks go to Mika
Rautio for being my instructor.
I am most grateful to my girl friend Jing Zhang. This thesis would never
be complete without her love and support. I owe so much to her.
Finally, I would like to express my gratitude to my mother, my father, and
my elder brother.
Otaniemi, Espoo, 3rd August 2009
Huageng Chi
1It is a sub project of WINNER II, an EU funded project: http://www.ist-
winner.org/index.html
ii
Contents
1 Introduction 1
1.1 LDPC decoder hardware implementation . . . . . . . . . . . . 2
1.2 Scope of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Thesis organisation . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Algorithm and architecture 5
2.1 System model . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Parity check matrices and graphs . . . . . . . . . . . . . . . . 6
2.3 Encoding and decoding . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Message passing algorithms . . . . . . . . . . . . . . . . . . . 9
2.5 Quasi-cyclic (QC) LDPC code . . . . . . . . . . . . . . . . . . 12
2.6 Approaches to reduce complexity . . . . . . . . . . . . . . . . 15
2.7 A partly parallel architecture . . . . . . . . . . . . . . . . . . 19
iii
CONTENTS iv
3 Proposed decoder architecture 26
3.1 Deﬁnition of 15 LDPC codes . . . . . . . . . . . . . . . . . . . 26
3.2 Decoding algorithm . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Proposed design techniques . . . . . . . . . . . . . . . . . . . 28
3.4 Decoder core . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 Overall architecture with dual buﬀer . . . . . . . . . . . . . . 40
3.6 Fixed-point issues . . . . . . . . . . . . . . . . . . . . . . . . . 42
4 Results 43
4.1 Introduction to design ﬂow . . . . . . . . . . . . . . . . . . . . 43
4.2 Error correction performance . . . . . . . . . . . . . . . . . . . 45
4.3 VHDL design, veriﬁcation, and synthesis . . . . . . . . . . . . 46
4.4 Decoding throughput . . . . . . . . . . . . . . . . . . . . . . 46
5 Conclusion 49
A Base matrices for 15 LDPC codes 51
B Reference performance 54
Bibliography 57
List of symbols and abbreviations
Cm check node m ∈M
Cim′ check node with local index m
′ within CNB i
HM×N parity check matrix, size is M ×N
Hi,j submatrix at block row i and block column j of parity check
matrix for QC LDPC code
Ln total information of bit n, generally in LLR form
M number of rows of parity check matrix
M index setM = {0, 1, · · · ,M − 1} for check nodes, or parity check
constraints, or rows of parity check matrix
M (n) all check nodes that checks variable node n
M (n) \m all check nodes that checks variable node n, excluding check node
m ∈M (n)
N number of columns of parity check matrix, equal to length of
codeword
N index set N = {0, 1, · · · , N − 1} for variable nodes, or code bits,
or columns of parity check matrix
N (m) all variable nodes checked by check node m
v
CONTENTS vi
N (m) \n all variable nodes checked by check node m, excluding variable
node n ∈ N (m)
Qn,m V2C message from variable node n to check node m, generally in
LLR form
Qj,in′,m′ V2C message from variable node V
j
n′ to check node C
i
m′ , generally
in LLR form
Rm,n C2V message from check node m to variable node n, generally in
LLR form
Ri,jm′,n′ C2V message from check node C
i
m′ to variable node V
j
n′ , generally
in LLR form
Vn check node n ∈ N
V jn′ variable node with local index n
′ within VNB j
z block size of block LDPC code
APP a posteriori probability
C2V check-to-variable
CNB check node block
CND check node decoder
FPGA ﬁeld-programmable gate array
LDPC low-density parity check
LLR log likelihood ratio
QC Quasi-cyclic
TPMP two-phase message passing
TDMP turbo decoding message passing
CONTENTS vii
V2C variable-to-check
VHDL very-high-speed-integrated-circuit hardware description language
VNB variable node block
List of Figures
2.1 System model . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Example: Tanner graph . . . . . . . . . . . . . . . . . . . . . 7
2.3 Messages and updates . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Message passing viewed from a particular variable node . . . . 10
2.5 Layered decoder: an example architecture . . . . . . . . . . . 20
2.6 Example decoder wave form, non-pipelined . . . . . . . . . . . 22
2.7 Example decoder wave form, pipelined . . . . . . . . . . . . . 22
2.8 Example decoder wave form, pipelined, out-of-order memory-
write . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 Example decoder wave form, pipelined, out-of-order memory-
write and -read . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Decoder core . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Shifter, top level . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Shifter, horizontal shifter . . . . . . . . . . . . . . . . . . . . . 36
3.5 Example, log shifter . . . . . . . . . . . . . . . . . . . . . . . . 37
viii
LIST OF FIGURES ix
3.6 Check node decoder, pipelined, sample waveform . . . . . . . 37
3.7 Check node decoder (CND) structure . . . . . . . . . . . . . . 38
3.8 Decoder with dual buﬀer . . . . . . . . . . . . . . . . . . . . . 41
4.1 Error correction performance . . . . . . . . . . . . . . . . . . . 45
4.2 Decoding throughput . . . . . . . . . . . . . . . . . . . . . . . 47
A.1 Base matrix, rate 1/2 . . . . . . . . . . . . . . . . . . . . . . . 52
A.2 Base matrix, rate 2/3 . . . . . . . . . . . . . . . . . . . . . . . 52
A.3 Base matrix, rate 3/4 . . . . . . . . . . . . . . . . . . . . . . . 52
B.1 Codeword error ratio, rate 1
2
, BPSK, AWGN . . . . . . . . . . 55
B.2 Codeword error ratio, rate 2
3
, BPSK, AWGN . . . . . . . . . . 55
B.3 Codeword error ratio, rate 3
4
, BPSK, AWGN . . . . . . . . . . 56
List of Tables
2.1 Example: parity check matrix . . . . . . . . . . . . . . . . . . 6
2.2 Block LDPC code parity check matrix (trivial example) . . . . 14
2.3 Block LDPC code base matrix (trivial example) . . . . . . . . 15
3.1 Illustration: shifter . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Total information organised as columns . . . . . . . . . . . . . 33
3.3 Sum memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 A message block viewed as table . . . . . . . . . . . . . . . . . 35
4.1 Decoding throughput (decoded bits per clock cycle, 20 itera-
tions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
x
Chapter 1
Introduction
Channel noise causes transmission errors in communication systems. An
eﬀective means to improve reliability is to employ forward error correc-
tion (FEC) codes. After Shannon laid down the theoretical foundation of
communications in his 1948 landmark paper [1], a central objective was to
ﬁnd practical coding schemes that could approach channel capacity on well-
understood channels such as the additive white Gaussian noise (AWGN)
channel [2]. Turbo codes invented in 1993 [3] is the ﬁrst practical capacity-
approaching code, which is widely deployed nowadays.
Following the success of turbo code, Low-density parity check (LDPC) code,
which was invented by Gallager in early 1960s [4] but forgotten for three
decades, was rediscovered by MacKay et al in mid-1990s [5, 6]. LDPC code
has near Shanon limit performance [7, 8] as well. Unlike turbo code, the
structure of LDPC code is inherently parallel, thus enables ﬂexible imple-
mentations for a wide range of applications.
The rediscovery triggered research on LDPC code. LDPC code has been
adopted in various areas, including new generation standards like DVB-S2,
IEEE 802.22, 802.16e, 802.11n, 802.3an, and so on [9]. However, products
of LDPC decoders are still rare in market, because of implementation chal-
1
CHAPTER 1. INTRODUCTION 2
lenges. Eﬀective implementation of practical hardware decoder is of great
interest.
1.1 LDPC decoder hardware implementation
LDPC code had been ignored for about 30 years, partly because it was much
too complex for the technology at that time [2]. Various decoding algorithms
exist. In a straightforward implementation, a hardware decoder consists of
a large number of small decoders of two kinds. Every small decoder of one
kind is connected to multiple small decoders of the other kind. During decod-
ing procedure, every small decoder receives information from its neighbours,
processes the received information, and sends the processing result to each
neighbour. Information exchanged between a pair of nodes is called mes-
sage, exchange of information is called message passing, and processing of
information is called message update. All small decoders perform message
updates and message passing simultaneously and repeatedly. Stop criterion
is evaluated from time to time.
Updating a message by hardware requires a data path. A data path consists
of basic logical and mathematical operators. Operators are in turn built by
digital circuits. Hardware resource includes logic resource, memory resource,
routing resource, and so on. A small decoder consumes certain amount of
logic resource and memory resource to build data paths. Passing messages
between small decoders require routing resources. A typical LDPC decoder
in the straightforward implementation consists of large number of small de-
coders. For example, a code in this thesis needs 4,608 variable node decoders
and 2,304 check node decoders. 35,328 messages need to be computed, stored,
and passed in every iteration, and the number of iterations can be as many
as 20. Given the hardware technology in the past, such an implementation
is too complex to be practical for high speed applications.
Hardware decoder implementation became realisable only in recent years due
to improvements in code design, algorithm design, decoder architecture de-
CHAPTER 1. INTRODUCTION 3
sign, and microelectronics technology. But new challenges exist due to strin-
gent system requirements such as low hardware cost, high error correction
performance, high decoding throughput, ﬂexibility to support multiple codes
of diﬀerent rates and codeword lengths. A fully parallel decoder enables high
throughput but requires prohibitively large amount of hardware resource [10];
a completely serial decoder consumes least amount of hardware resource but
its throughput is low. All practical decoders are made partly parallel to
make trade-oﬀs between hardware cost and throughput. The key task in
architecture design is to parallelize message updates and message exchanges,
and map those algorithmic functionality to limited amount of logic, mem-
ory, and routing resource provided by target hardware device. The task is
harder when a decoder needs support multiple code rates and multiple code-
word lengths. A decoder designed for one application may not ﬁt to another,
as architecture design of LDPC decoder is closely coupled with application-
speciﬁc decisions such as code's characteristics, decoding algorithm, system
requirements, and hardware resource constraints.
1.2 Scope of thesis
The codes to be implemented in the thesis are Quasi-cyclic (QC) LDPC
codes. 3 base matrices are deﬁned for code rates 1
2
, 2
3
, and 3
4
. Each base
matrix consists of 48 columns and expands to 5 parity check matrices corre-
sponding to block sizes 12, 24, 36, 48, and 96. Therefore 15 codes are deﬁned.
5 codeword lengths of each rate are [12, 24, 36, 48, 96]×48 = [576, 1152, 1728, 2304, 4608].
The objective of the thesis work was to design a hardware decoder architec-
ture suitable for FPGA device. The implementation aims at small hardware
cost and challenging system requirements, including high decoding through-
put, real-time support to multiple code rates and multiple codeword lengths,
and small degradation of error correction performance caused by hardware
implementation. The decoder is part of a trail system for demonstration
CHAPTER 1. INTRODUCTION 4
purpose, therefore optimisation for hardware cost and decoding throughput
is not prioritised, however, space for future optimisation is considered.
The main result of the thesis is a hardware architecture suitable for FPGA
device. The decoder is real time conﬁgurable to decode any of the 15 speciﬁed
LDPC codes. A partly parallel architecture implements layered decoding, all
check node decoders in a check node block (CNB) operate in parallel, and
each check node decoder is serial and pipelined.
The thesis also presents some design techniques. Out-of-order memory-read
is used to improve throughput. A two-stage multi-size shifter is designed
to perform cyclic shift on ﬁrst z values of of 96-value vector, where z ∈
{12, 24, 36, 48, 96} is the block size. Decoder checks consecutive mb CNBs in
serial manner to evaluate stop criterion, where mb ∈ {12, 16, 24} is number
of block rows. Control signals are saved in memory modules, and updating
those memory content adapts the design to other codes without hardware
redesign, if the new codes are not far diﬀerent from those used in this thesis.
1.3 Thesis organisation
The rest of the thesis is organised as follows. Decoding algorithms and tech-
niques to reduce complexity in hardware implementation are discussed in
Chapter 2. The proposed decoder architecture and techniques are presented
in Chapter 3. Result of the solution is reported in Chapter 4, and discussed
in Chapter 5. Chapter 6 concludes the thesis. References and appendices are
in the end of the thesis.
Chapter 2
Algorithm and architecture
LDPC codes and decoding algorithms are introduced in this chapter. Various
techniques that reduce hardware implementation complexity can be found in
literature, those employed in this thesis work are brieﬂy introduced. A partly
parallel decoder architecture is illustrated.
2.1 System model
Figure 2.1 presents a basic model of a communications channel. An encoder
maps a source message s , (s0s1 · · · sK−1) to a codeword c , (c0c1 · · · cN−1) ∈
C, where sk ∈ {0, 1}, cn ∈ {0, 1}, C is a code, K is message length, N is code-
word length. Codeword c is transmitted in form of signal x , (x0x1 · · ·xN−1),
xn ∈ {−1,+1}, using mapping: cn = 0 → xn = +1; cn = 1 → xn = −1.
Figure 2.1: System model
5
CHAPTER 2. ALGORITHM AND ARCHITECTURE 6
row index
↓ 0 1 2 3 4 5 6 7 8 9 ←column index
0 0 0 0 0 1 1 0 1 0 1
1 1 0 1 0 0 1 0 1 0 0
2 0 1 1 0 1 0 1 0 1 1
3 1 0 0 1 0 1 0 1 1 1
4 0 1 1 1 1 0 0 0 0 0
Table 2.1: Example: parity check matrix
The received signal through an AWGN channel is r = x + n, where n ,
(n0n1 · · ·nN−1) is zero mean white Gaussian noise with variance σ² = N0/2.
An decoder produces estimation cˆ of the transmitted codeword c.
2.2 Parity check matrices and graphs
Every (N,K) binary linear block code is speciﬁed by its parity check matrix
HM×N , which is assumed full ranked, and whose dimension is M × N . A
column of a matrix corresponds to a code bit. Columns and bits are indexed
from left to right by 0, 1, · · · , N − 1. A row corresponds to a parity check
constraint, rows and parity check constraints are indexed from top to bottom
by 0, 1, · · · ,M − 1. Let h (m,n) be the parity check matrix entry at row m
and column n, parity check m checks bit n if and only if h (m,n) = 1. Table
2.1 presents an example parity check matrix with M = 5 and N = 10.
The graph introduced by Tanner in [11] is referred as Tanner graph, which is
commonly used to represent LDPC code. A graph is a set of nodes connected
by a set of edges. A Tanner graph is a bipartite graph, which has two types
of nodes, and each edge connects a node of one type to a node of the other
type. One type of nodes represent symbols of a codeword, and the other type
of nodes represent parity check constraints. Figure 2.2 presents the Tanner
graph for the example code given in Table 2.1. Circles are variable nodes,
CHAPTER 2. ALGORITHM AND ARCHITECTURE 7
Figure 2.2: Example: Tanner graph
indexed from left to right by 0, 1, · · · , N − 1, with node n corresponding to
bit n. Boxes are check nodes representing parity checks, indexed from left to
right by 0, 1, · · · ,M − 1, with node m corresponding to parity check m. An
edge connects check node m and variable node n if and only if h (m,n) = 1.
Degree of a node is the number of its neighbours.
A one-to-one correspondence links a row of a parity check matrix, a parity
check constraint, and a check node. Another one-to-one correspondence links
a column of a parity check matrix, a codeword bit, and a variable node.
A LDPC code is a linear block codes with sparse parity check matrix. A
sparse matrix contains only a few ones in each row and each column. In
general, parity check matrix of a LDPC code is made large and pseudo-
random in order to obtain good error correction performance. Most recent
practical codes are structured, showing much regularity, but randomness is
still obvious.
Following notations are used in this thesis:
 N , {0, 1, · · · , N − 1} is index set for variable nodes, or code bits, or
matrix columns; M , {0, 1, · · · ,M − 1} is index set for check nodes,
or parity check constraints, or matrix rows.
 N (m) denotes all variable nodes checked bym. N (m) with n ∈ N (m)
excluded is denoted by N (m) \n;M (n) is index set for all check nodes
that are neighbours of variable node n. M (n) with m ∈ M (n) ex-
cluded is denoted byM (n) \m.
CHAPTER 2. ALGORITHM AND ARCHITECTURE 8
 Check node m is denoted by Cm; variable node n is denoted by Vn.
 Check node m's degree is dm = |N (m)|; variable node n's degree is
dn = |M (n)|.
2.3 Encoding and decoding
To design a LDPC code is to ﬁnd a good parity check matrix. Given a parity
check matrix H, the code C is the null space of H, i.e., c ∈ C if and only if
HcT = 0.
Expression c = mG describes encoding message m to codeword c, where G is
the generator matrix that can be obtained from parity check matrix H. This
direct method requires large number of multiplications and additions thus it
is too complex for hardware implementation. For a class of widely used QC
LDPC codes, encoding is eﬃciently realised by solving HcT = 0 for c. The
codes are systematic, and a codeword c consists of information part m and
parity part p, written as c = [mp]. When m is given, corresponding p can
be derived by solving H [mp]T = 0. Further discussion of encoding can be
found in literature such as [12], and it is out of the scope of this thesis.
Given received signal r, a maximum likelihood (ML) decoding algorithm ﬁnds
codeword estimation
cˆ = max
c∈C
Pr (r | c)
A maximum a posteriori (MAP) algorithm ﬁnds bit-wise estimation for each
bit, the estimation for bit n is
cˆn = max
cn∈{0,1}
Pr (cn | r)
The message passing algorithm discussed in the following text is a MAP algo-
rithm. The min-sum algorithm, which can be considered an approximation
to message passing algorithm, is a ML algorithm.
CHAPTER 2. ALGORITHM AND ARCHITECTURE 9
Figure 2.3: Messages and updates
2.4 Message passing algorithms
Among various decoding algorithms, a fundamental one and its variants are
referred with diﬀerent names like message passing (MP), belief propagation
(BP), and sum product algorithm (SPA). Those terms are often interchange-
ably used in literature, so in this thesis. Message passing algorithm is brieﬂy
illustrated in this section. Detailed treatment of codes on graphs can be
found in a number of references, for example, factor graph and SPA dis-
cussed in [13], Forney graph and SPA discussed in [14], BP discussed in [5]
[6].
Along every graph edge ﬂow two kinds of information, which are represented
by variable-to-check (V2C) message and check-to-variable (C2V) message.
C2V message from Cm to Vn carries conditional information on Vn's value
being 0 or 1, as illustrated in Figure 2.3 (a). The information is obtained
using all V2C messages from Cm's neighbouring variable nodes excluding Vn.
V2C message from Vn to Cm carries conditional information on Vn's value
being 0 or 1, as illustrated in Figure 2.3 (b). The information is obtained
using channel input to Vn and all C2V messages from Vn's neighbouring
check nodes excluding Cm. Procedures computing V2C messages and C2V
messages are termed as variable update and check update, respectively.
A decoder needs perform large number of updates as fast as possible. Schedul-
ing is needed to allocate update tasks to hardware resource and time resource.
An iteration covers the least number of clock cycles for all messages to be
CHAPTER 2. ALGORITHM AND ARCHITECTURE 10
Figure 2.4: Message passing viewed from a particular variable node
updated. The algorithm runs for multiple iterations to converge. The con-
ventional two-phase message passing (TPMP) is the most straightforward
example to illustrate scheduling. In TPMP, all check nodes perform check
node updates in one phase, all variable nodes perform variable node updates
in the subsequent phase, and those two phases constitute one iteration. In
order to start the iterative procedure, all V2C messages leaving a variable
node are initialised to channel input to that node, then, check node updates
can be performed. At the end of each iteration, every variable node makes
hard decision using latest C2V messages as well as its channel input. Decod-
ing terminates if decisions satisfy all parity check constraints, otherwise, new
iteration starts, until a preset maximum number of iterations is reached.
The eﬀectiveness of such algorithms is illustrated by Figure 2.4. The variable
node of interest is drawn in the center of the ﬁgure. Tiers of variable nodes
are indexed to outwards direction by 1, 2, · · · . Only two tiers of variable
nodes are presented due to lack of space. Arrows represent graph edges and
messages. Each variable node also accepts information from corresponding
channel input, which is not shown in the ﬁgure. The key observation is
that, after n iterations, the central variable node collects information from
all variable nodes up to tier n, and the collected information is utilised to
CHAPTER 2. ALGORITHM AND ARCHITECTURE 11
decode the center variable node. Consequently and generally, the more the
number of iterations, the more reliable is the decoding result.
2.4.1 Message deﬁnition
Messages being exchanged can be probabilities or other quantities derived
from probabilities. For probability messages, let rm,n denote the C2V message
from Cm to Vn, and qn,m denote the V2C message from Vn to Cm. rm,n and
qn,m are two-element a posteriori probability (APP) vectors deﬁned as
rm,n ,
 Pr (n = 0 | qj,m, j ∈ N (m) \n)Pr (n = 1 | qj,m, j ∈ N (m) \n)

qn,m ,
 Pr (n = 0 | rn,ri,n, i ∈M (n) \m)Pr (n = 1 | rn,ri,n, i ∈M (n) \m)

where rn is channel input for bit n.
Log likelihood ratios (LLRs) are widely used as messages for a couple of ad-
vantages over probabilities: instead of operating on two probabilities, only
one ratio is maintained, multiplication and division in real valued domain be-
come addition and subtraction in LLR domain, ﬁxed-point algorithms using
LLRs are more numerically stable and suﬀers less performance loss, expres-
sions like (2.1) are neat in LLR domain, and easier to analyse and approxi-
mate, and so forth. In accordance to rm,n and qn,m, messages in LLR domain
are deﬁned as Rm,n , ln
Pr(n=0|qj,m,j∈N (m)\n)
Pr(n=1|qj,m,j∈N (m)\n)
Qn,m , ln Pr(n=0|rn,ri,n,i∈M(n)\m)Pr(n=1|rn,ri,n,i∈M(n)\m)
and the channel input LLR to bit n is deﬁned as
Lc,n , ln
Pr (cn = 0 | rn)
Pr (cn = 1 | rn)
CHAPTER 2. ALGORITHM AND ARCHITECTURE 12
2.4.2 Equations
The check node update producing C2V message Rm,n is described (e.g., in
[15]) by
tanh
Rm,n
2
= Π
j∈N (m)\n
tanh
(
Qj,m
2
)
(2.1)
The variable node update producing V2C message Qn,m is described (e.g., in
[15]) by
Qn,m =
∑
i∈M(n)\m
Ri,n + Lc,n (2.2)
The LLR of APP for bit n is also referred in this thesis as column sum, or
total information, as it is a sum of all LLR messages to n
Ln , ln
(
p (cn = 0 | r)
p (cn = 1 | r)
)
=
∑
m∈M(n)
Rm,n + Lc,n (2.3)
Ln is sliced to obtain hard decision:cˆn = 0 Ln ≥ 0cˆn = 1 Ln < 0
2.5 Quasi-cyclic (QC) LDPC code
Quasi-cyclic (QC) LDPC code [16] [17] is also referred as Block LDPC code
[18]. LetHM×N be an M × N parity check matrix of a QC LDPC code.
HM×N consists of mb rows and nb columns of submatrices of size z×z, where
z is block size, M = mb × z, and N = nb × z. A submatrix is either a null
matrix, or a circulant matrix obtained by shifting cyclically each row of an
CHAPTER 2. ALGORITHM AND ARCHITECTURE 13
identity matrix to the right for p steps. Deﬁne
Pz ,

0 1 0 · · · 0
0 0 1 · · · 0
...
...
...
. . .
...
0 0 0 · · · 1
1 0 0 · · · 0

then, a submatrix of size z × z with shift step p, 0 ≤ p ≤ z − 1, is (Pz)p, the
p-th power of Pz. Conventionally, (Pz)
0 is deﬁned as identity matrix.
A row of submatrices is called a block row. Block rows are indexed from
top to bottom by 0, 1, · · · ,mb−1. Let the symbol b c denote ﬂoor function
such that bxc returns the largest integer no more than x. Row m locates in
block row bm/zc, and is indexed locally in block row bm/zc by (m mod z).
Accordingly, parity check constraints and check nodes are also grouped to
mb blocks, and indexed globally and locally in the same manner. For in-
stance, check node m is indexed locally by (m mod z) in check node block
(CNB) bm/zc. Similarly, parity check matrix columns, code bits, and vari-
able nodes, are grouped respectively to nb block columns, code bit blocks,
and variable node blocks (VNBs). Those blocks are indexed from left to right
by 0, 1, · · · , nb − 1. A global index n corresponds to local index (n mod z)
in block bn/zc. For instance, code bit n is indexed locally by (n mod z) in
code bit block bn/zc.
Following notations are used. Check node Cm is also written as C
bm/zc
(m mod z),
meaning a check node in CNB bm/zc with local index (m mod z). Like-
wise, variable node Vn can be denoted by V
bn/zc
(n mod z), and total informa-
tion Ln can be written as L
bn/zc
(n mod z). V2C message Qn,m can be written
as Q
bn/zc,bm/zc
(n mod z),(m mod z), meaning the Q message from V
bn/zc
(n mod z) to C
bm/zc
(m mod z),
also, C2V message Rm,n can be written as R
bm/zc,bn/zc
(m mod z),(n mod z).
Suppose submatrix at block row i and block column j is Hi,j = (Pz)
p. For
k = 0, 1, · · · , z − 1, check node Cik is connected to variable node V jk+p mod z,
and conversely, variable node V jk is connected to check node C
i
k+z−p mod z.
CHAPTER 2. ALGORITHM AND ARCHITECTURE 14
0 1 2 3 4 5 6 7 8 9 10 11 ← global indices
0 1 2 0 1 2 0 1 2 0 1 2 ← local indices
0 0 0 1 0 0 0 0 0 1 0 0 0 1
1 1 0 0 1 0 0 0 0 0 1 1 0 0 ← block row 0
2 2 1 0 0 0 0 0 1 0 0 0 1 0
3 0 1 0 0 0 1 0 0 0 1 0 0 0
4 1 0 1 0 0 0 1 1 0 0 0 0 0 ← block row 1
5 2 0 0 1 1 0 0 0 1 0 0 0 0
↑ ↑ ↑ ↑
↑ 0 1 2 3 ← block column index
↑ local indices
global indices
Table 2.2: Block LDPC code parity check matrix (trivial example)
2.5.1 An example
Foregoing deﬁnitions are illustrated by a trivial example shown in Table 2.2.
In this example, M = 6, N = 12, z = 3, mb = 2, nb = 4, and parity check
matrix is partitioned to 2× 4 = 8 submatrices. For example, the submatrix
in block row 0 and block column 3 is
H0,3 =
 0 0 11 0 0
0 1 0

therefore, check nodes 0, 1, and 2 in CNB 0 are connected respectively to
variable nodes 2, 0, and 1 in VNB 3, and the permutation of index vector is
described by 20
1
 = H0,2
 01
2
 or
 20
1
 =
 0 + 2 mod 21 + 2 mod 2
2 + 2 mod 2

CHAPTER 2. ALGORITHM AND ARCHITECTURE 15
1 -1 1 2
0 1 2 -1
Table 2.3: Block LDPC code base matrix (trivial example)
2.5.2 Base matrix and expansion
Parity check matrix of a block LDPC code is often described by a mb × nb
base matrix Bmb×nb . The (i, j)-th entry of Bmb×nb , denoted by b (i, j), isb (i, j) = −1 when Hi,j = 0z×zb (i, j) = p when Hi,j = (Pz)p
Table 2.3 presents the base matrix of the code deﬁned in Table 2.2.
Expanding one base matrix to multiple parity check matrices with diﬀerent
block sizes lead to codes of multiple codeword lengths. Given a base matrix, a
parity check matrix for a code of block size z is obtained by setting submatrix
as Hi,j = 0z×z if b (i, j) = −1Hi,j = (Pz)b(i,j) mod z otherwise
This is the commonly used modulo expansion method.
2.6 Approaches to reduce complexity
The techniques employed in this thesis work to reduce computation complex-
ity and hardware complexity are shortly presented in this section.
CHAPTER 2. ALGORITHM AND ARCHITECTURE 16
2.6.1 Min-sum algorithm and its correction
Comparing (2.1) and (2.2) shows that check nod update dominates com-
putational complexity. Min-sum algorithm (MSA [19] [20]) provides simple
approximation to (2.1) at the cost of small performance loss.
In MSA, the sign part and magnitude part of Rm,n are computed separately.
Let the sign function be sign (x) =
+1 x ≥ 0−1 x < 0 , the sign part is given by
sign (Rm,n) = Π
j∈N (m)\n
sign (Qj,m) (2.4)
and the magnitude part is related to input magnitudes by
tanh
|Rm,n|
2
= Π
j∈N (m)\n
tanh
( |Qj,m|
2
)
(2.5)
Equation (2.5) is given in some literature in logarithm form:
|Rm,n| = ψ
 ∑
j∈N (m)\n
ψ (|Qj,m|)

where the psi function is
ψ (|x|) , − ln tanh |x|
2
= ln
(
e|x| + 1
)
(e|x| − 1)
The key step leading to MSA is to approximate the magnitude (2.5) by
|Rm,n| = min
j∈N (m)\n
({|Qj,m| : j ∈ N (m) \n}) (2.6)
MSA is described by (2.2), (2.4) and (2.6). In the thesis work, correction
with scaling factor [20] [21] is used to reduce performance loss.
CHAPTER 2. ALGORITHM AND ARCHITECTURE 17
2.6.2 Value reuse property
Direct implementation of (2.6) requires large number of adders to ﬁnd min-
imum values. The value reuse property [22] [23] [24] [25] [26] [27] [28] [29]
states that, when a check node performs update according to (2.6), among
all the outgoing magnitudes, there is at most one magnitude which is dif-
ferent from the rest. The outgoing magnitudes can be obtained as follows.
Let {|Qj,m| : j ∈ N (m)} be the set of input magnitudes to check node m.
Suppose |Qk1,m| is a minimum of {|Qj,m| : j ∈ N (m)}, and |Qk2,m| is a min-
imum of {|Qj,m| : j ∈ N (m) \k1}. Denote the two minima by ﬁrst minimum
m1 , |Qk1,m| and second minimum m2 , |Qk2,m|, then, the output magni-
tude given by (2.6) becomes
|Rm,n| =
m2 n = k1m1 n 6= k1 (2.7)
As an low cost hardware implementation, a check node decoder accepts input
message and sends output messages in serial. When it accepts a stream of
input messages, the check node decoder compares incoming magnitudes to
ﬁnd m1, m2, and k1. These three values are obtained immediately once all
messages are received, and then the outgoing magnitudes can be produced
according to (2.7).
2.6.3 Layered decoding
Turbo-decoding message passing (TDMP) presented in [30, 31] is modiﬁed
and referred as layered decoding in [32]. Alternatively, layered decoding can
be considered as a variant of two-phase message passing (TPMP). When
applied to QC LDPC codes, the algorithm is summarised as follows.
CHAPTER 2. ALGORITHM AND ARCHITECTURE 18
Iterations start after initialisationLn = Lc,n ∀n ∈ NRm,n = 0 ∀m ∈M,∀n ∈ N (m)
Each iteration consists of mb subiterations, corresponding to mb check node
blocks (CNBs), and a CNB is regarded as a layer. Subiterations are per-
formed sequentially from CNB 0 to mb − 1. In i-th subiteration, for any
check node m in CNB i, following steps are performed.
1. restore variable-to-check messages, ∀n ∈ N (m):
Qn,m = Ln −Rm,n (2.8)
2. check node update, ∀n ∈ N (m):
Rm,n = 2 tanh
−1
(
Π
j∈N (m)\n
tanh
(
Qj,m
2
))
(2.9)
3. update total information, ∀n ∈ N (m):
Ln = Qn,m +Rm,n (2.10)
Given QC LDPC code, above computation for a check node in a CNB is inde-
pendent on any other check node in the same CNB, therefore, all check nodes
in a CNB can perform above computation in parallel to increase throughput.
At the end of each iteration, convergence of decoding is checked, and decoding
terminates if stop criterion is met, otherwise, another iteration begins.
Compared with TPMP, layered decoding converges faster in terms of number
of iterations, requires less amount of memory and wiring. In layered decoding,
CNBs decode a codeword sequentially layer by layer. At the end of each
subiteration, newly updated C2V messages are used to perform variable node
update immediately, and the updated variable-to-check messages are used in
CHAPTER 2. ALGORITHM AND ARCHITECTURE 19
subsequent subiterations. For any variable node Vn, instead of storing all V2C
messages Qn,m,∀m ∈M (n), only the total information Ln is maintained as
a running sum, which is updated in every subiteration.
2.7 A partly parallel architecture
Architectures proposed in [33], [34], [35], [36], [27], [37], and [38] are based
on the TPMP or its variants. This kind of architectures are good candidates
for application-speciﬁc integrated circuit (ASIC), but they were not adopted
in the thesis work because it is complex to make them support multiple rates
and lengths, and they consume more hardware resource. The solution to this
thesis work is motivated by architectures presented in [32], [22], [23], [24],
[25], [28], [26], and [29], those architectures are based on layered decoding.
A decoder architecture for the example code in Table 2.2 is illustrated in
Figure 2.5. Sum memory stores total information Lij at data lane j in
memory entry address i. The ﬁrst row of the base matrix in Table 2.3 is
[b (0, 0) , b (0, 1) , b (0, 2) , b (0, 3)] = [1,−1, 1, 2], therefore, for check node C00 ,
(2.8), (2.9), and (2.10) lead to following equations: variable-to-check mes-
sages to C00 are restored as Q
0,0
1,0
Q2,01,0
Q3,02,0
 =
 L
0
1
L21
L32
−
 R
0,0
0,1
R0,20,1
R0,30,2
 (2.11)
where check-to-variable messages R0,00,1, R
0,2
0,1, and R
0,3
0,2 are updated in previous
iteration. Check-to-variable messages are updated by R
0,0
0,1
R0,20,1
R0,30,2
 =
 f
(
Q2,01,0, Q
3,0
2,0
)
f
(
Q0,01,0, Q
3,0
2,0
)
f
(
Q0,01,0, Q
2,0
1,0
)
 (2.12)
where f (x, y) , 2 tanh−1 (tanh (x/2) · tanh (y/2)). Finally, total information
CHAPTER 2. ALGORITHM AND ARCHITECTURE 20
Figure 2.5: Layered decoder: an example architecture
is updated by  L
0
1
L21
L32
 =
 Q
0,0
1,0
Q2,01,0
Q3,02,0
+
 R
0,0
0,1
R0,20,1
R0,30,2
 (2.13)
Vectors  L
0
0
L01
L02
 ,
 L
2
0
L21
L22
 , and
 L
3
0
L31
L32

are read out serially from address 0, 2, and 3 of sum memory, one vector out
of one memory entry per clock cycle. A shifter rotates vectors circularly and
upwards, respectively for b (0, 0) = 1 step, b (0, 2) = 1 step and b (0, 3) = 2
steps to obtain  L
0
1
L02
L00
 ,
 L
2
1
L22
L20
 , and
 L
3
2
L30
L31

CHAPTER 2. ALGORITHM AND ARCHITECTURE 21
The topmost elements of the 3 vectors, L01, L
2
1, and L
3
2, are routed to the
topmost subtractor in serial to form a data stream. Like the memory access
to sum memory, c2v memory 0 is accessed for 3 times, so that old check-
to-variable messages R0,00,1, R
0,2
0,1, and R
0,3
0,2 are read out and routed in serial
to the same subtractor to form another data stream. The two data streams
are synchronised, so that (2.11) can be done correctly at the subtractor.
The subtractor's output data stream Q0,01,0, Q
2,0
1,0, and Q
3,0
2,0 enters check node
decoder (CND) 0 as well as buﬀer 0. CND 0 updates check-to-variable mes-
sages according to (2.12). Value reuse property introduced in Section 2.6.2 is
utilised, the updated messages R0,00,1, R
0,2
0,1 and R
0,3
0,2 are produced at the CND
0's output immediately after Q0,01,0, Q
2,0
1,0, and Q
3,0
2,0 enter into the decoder. Q
0,0
1,0,
Q2,01,0, and Q
3,0
2,0 are delayed by buﬀer 0 in order to get synchronised with CND
0's output stream R0,00,1, R
0,2
0,1 and R
0,3
0,2, and the two streams enter into the
adder which produces total information L01, L
2
1, and L
3
1, according to (2.13).
As shown in Figure 2.5, when CND 0 decodes on behalf of check node C00 ,
CND 1 and 2 decode respectively for C01 and C
0
2 in parallel. The parallel
operations produce updated total information L
0
1
L02
L00
 ,
 L
2
1
L22
L20
 , and
 L
3
2
L30
L31

The 3 vectors are shifted to obtain L
0
0
L01
L02
 ,
 L
2
0
L21
L22
 , and
 L
3
0
L31
L32

before they are written back to sum memory's locations from which the old
total information vectors are read out. Writing back updated total informa-
tion completes one subiteration, and the subsequent subiteration can start
immediately.
Because messages enter into, and leaves from, a check node decoder in serial
CHAPTER 2. ALGORITHM AND ARCHITECTURE 22
Figure 2.6: Example decoder wave form, non-pipelined
Figure 2.7: Example decoder wave form, pipelined
manner, such a CND is said serial. The overall architecture is said partly
parallel due to multiple copies of CNDs. The parallelism factor is 3 in this
example, and all 3 check nodes in a CNB are processed simultaneously.
2.7.1 Improve throughput of the example architecture
The decoding throughput of the example architecture can be improved by
pipelining [22] and out-of-order memory-write [24].
The behaviour of the decoder in a subiteration can be abstracted as reading
sum memory for a number of clock cycles, and then writing sum memory for
a number of clock cycles, as shown in Figure 2.6. rn in the ﬁgure stands for
read memory address n, wn means write memory address n, and possible
latencies are not shown in the waveform. The throughput can be increased
if memory read operation overlaps with preceding memory write operation,
as in Figure 2.7. Overlapping is allowed by using dual port memory. In this
example, complete overlap is not possible because write operation to address
0 (w0) must precede read operation to address 0 (r0). Check node decoder is
made serial and two-stage pipelined, so that memory read stage and memory
write stage can operate simultaneously on consecutive check node blocks.
CHAPTER 2. ALGORITHM AND ARCHITECTURE 23
Figure 2.8: Example decoder wave form, pipelined, out-of-order memory-
write
Out-of-order memory-write is illustrated by Figure 2.8. In this example,
during subiteration i, instead of writing messages back to memory in the
order of w0-w1-w2, writing sequence is w2-w0-w1. If the order w0-w1-w2 is
in use, no overlap can be achieved because w2 must precede r2.
2.7.2 Multi-size shifter
A number of architectures include shifter. Given a nonzero submatrix Hi,j =
(Pz)
p, where 0 ≤ p < z, z is block size, for k = 0, 1, · · · , z − 1, check node
Cik is connected to variable node V
j
(k+p mod z), and corresponding messages
on the edge are Qj,i(k+p mod z),k and R
i,j
k,k+p mod z.
Suppose a partly parallel architecture has z check node decoders (CNDs)
operating in parallel, like the architecture given in Figure 2.5. Let z total
information messages Ljk, k = 0, 1, · · · , z − 1, be stored together in one
memory entry j. A memory entry is divided to z data lanes indexed by
0, 1, · · · , z − 1. Data lane k in memory entry j stores Ljk, in other words,
memory entry with address j stores a column vector
Lj0
Lj1
...
Ljz−1

The message required by CND k is stored at lane (k + p mod z), and a
CHAPTER 2. ALGORITHM AND ARCHITECTURE 24
column vector of the messages for
CND0
CND1
...
CNDz−1
 is

Lj0+p mod z
Lj1+p mod z
...
Ljz−1+p mod z

which is obtained by shifting 
Lj0
Lj1
...
Ljz−1

upwards and circularly for p step. Therefore a shifter is needed.
To handle codes with multiple block sizes, the shifter is required to shift ﬁrst
z messages of a vector of Z messages, where z is the block size of a code,
and Z is the maximum of all block sizes. It is common to design block sizes
as multiples of the sizes of the smallest block. For instance, in this thesis
work, the block sizes are 12, 24, 36, 48 and 96, consequently, a shifter needs to
perform circular shift on as many as 96 messages for block size z = 96, but for
block size z = 48, the shifter operates on message 0, 1, · · · , 47, while messages
in lane 48, 49, · · · , 95 are ignored and can be processed in any manner.
In short, the diﬃculty of making such a shifter steps from the large number
of messages to shift, as well as the multi-size requirement.
2.7.3 Remove one shifter
In Figure 2.5 are two shifters: the left one locates in sum memory's memory-
read data path, the right one locates in sum memory's memory-write data
path.
CHAPTER 2. ALGORITHM AND ARCHITECTURE 25
The shifter in memory-write data path can be removed, and the correspond-
ing shift operation can be compensated by the left shifter in memory-read
path. Removal of the shifter reduces latency of decoding data path and
improves throughput. This technique is applied in some works such as [26].
Chapter 3
Proposed decoder architecture
In this chapter, the codes to be implemented are ﬁrst speciﬁed, and then the
decoding algorithm is formulated, and mapped to proposed hardware archi-
tecture using old techniques introduced in Chapter 2, as well as techniques
proposed in this thesis work.
3.1 Deﬁnition of 15 LDPC codes
The codes to be implemented in this thesis are QC LDPC codes. 3 base
matrices are speciﬁed respectively for code rate 1/2, 2/3, and 3/4, listed in
appendix A. Each base matrix can be expanded to 5 parity check matrices,
with block sizes of 12, 24, 36, 48 and 96. The modulo expansion method
described in Section 2.5 is used. All parity check matrices have nb = 48 block
columns, i.e., every codeword has 48 code bit blocks, hence code lengths are
576, 1152, 1728, 2304, and 4608. Number of block rows, mb, is 24 for rate
1/2 codes, 16 for rate 2/3, and 12 for rate 3/4. The 15 codes are enumerated
in appendix A as code 1, 2, · · · , 15.
26
CHAPTER 3. PROPOSED DECODER ARCHITECTURE 27
3.2 Decoding algorithm
Initialisation is regarded as (−1)-th iteration, described asL
(−1)
n = Lc,n ∀n ∈ N
R
(−1)
m,n = 0 ∀m ∈M,∀n ∈ N (m)
Decoding proceeds iteration by iteration. An iteration consists of mb subit-
erations, executed sequentially from check node block (CNB) 0 to mb − 1.
Let (it) be the iteration index, i be the subiteration index, Mi the set of
check nodes in CNB i, and let quantities updated in the it-th iteration be
labelled with superscript (it). A subiteration includes following sequential
operations.
1. restore variable-to-check messages, ∀m ∈Mi, n ∈ N (m):
Q(it)n,m = L
(it−1)
n −R(it−1)m,n (3.1)
2. check node update, sign processing and magnitude processing respec-
tively, ∀m ∈Mi, n ∈ N (m):
sign
(
R(it)m,n
)
= Π
j∈N (m)\n
sign
(
Q
(it)
j,m
)
(3.2)∣∣R(it)m,n∣∣ = min
j∈N (m)\n
({∣∣∣Q(it)j,m∣∣∣ : j ∈ N (m) \n}) (3.3)
3. magnitude correction, down scale the magnitude by factor of β = 0.8,
∀m ∈Mi, n ∈ N (m): ∣∣R(it)m,n∣∣ = β · ∣∣R(it)m,n∣∣ (3.4)
4. update total information, ∀m ∈Mi, n ∈ N (m):
L(it)n = Q
(it)
n,m +R
(it)
m,n (3.5)
CHAPTER 3. PROPOSED DECODER ARCHITECTURE 28
substitution of (3.1) into (3.5) gives
L(it)n = L
(it−1)
n +
(
R(it)m,n −R(it−1)m,n
)
= L(it−1)n + ∆R
(it)
m,n (3.6)
it implies that check node update improves total information of every
codeword bit iteration by iteration.
Following stop condition is evaluated sequentially at the end of each iteration:
1. make hard decision, ∀n ∈ N : cˆn =
0 L
(it)
n ≥ 0
1 L
(it)
n < 0
2. decoding converges if H cˆT = 0, where cˆ , (cˆ0cˆ1 · · · cˆN−1), exit decoding
3. if it = 20, decoding fails, exit decoding
4. continue to next iteration: it = it+ 1
3.3 Proposed design techniques
In addition to the techniques presented in literature and summarised in Chap-
ter 2, some other techniques not found in literature are employed in the thesis
work, explained in this section.
3.3.1 Out-of-order memory-read
Out-of-order memory-write introduced in Section 2.7.1 improves throughput.
The example given in Figure 2.8 can be further improved by using also out-
of-order memory-read, as shown in Figure 3.1. When processing CNB i+ 1,
in stead of using read order r2-r3-r4, the order in use is r3-r4-r2, and in this
arrangement, the idle cycle in Figure 2.8 can be removed.
CHAPTER 3. PROPOSED DECODER ARCHITECTURE 29
Figure 3.1: Example decoder wave form, pipelined, out-of-order memory-
write and -read
If memory entry j is to be accessed in both subiteration i and its subsequent
iteration i + 1, it is written in the beginning of memory write operation in
subiteration i, and is read in the end of memory read operation in subiteration
i + 1. This technique is useful because the data path generally has latency
of some clock cycles, which is not illustrated in Figure 2.8 and Figure 3.1.
3.3.2 Two-stage shifter
A multi-size shifter is needed, as described in Section 2.7.2. A two-stage
shifter is presented in this thesis, illustrated by the following example. As-
sume block sizes are 3, 6, and 9. The input and output ports of the shifter
are 9-message wide, and messages are viewed as a column vector, indexed
from top to bottom by 0, 1, · · · , 8.
Suppose current block size is z = 6, shifting step is p = 4, i.e., the shifter
is required to shift elements 0 through 5 circularly upwards for 4 positions,
and ignores the elements 6 through 8.
The process is shown in Table 3.1. In Table 3.1 (a), column vector [0, 1, · · · , 8]T
is reshaped to 3 columns as a 3× 3 table, with column and row indexed by
0, 1, and 2. In general, block sizes are multiples of the smallest block size
zmin, and the number of rows of a table equals to the smallest block size zmin.
In this example, shifting step is p = 4, it is seen that element 4 locates in
row ip = 1 and column jp = 1 in Table 3.1 (a). Table 3.1 (b) is obtained by
left-shifting all rows in Table 3.1 (a) circularly, however, only the leftmost
CHAPTER 3. PROPOSED DECODER ARCHITECTURE 30
column 0
↓ column 1
↓
row 0 → 0 3 6 0 3 × (4) 1 ×
row 1 → 1 (4) 7 (4) 1 × 5 2 ×
row 2 → 2 5 8 5 2 × 0 3 ×
(a) (b) (c)
Table 3.1: Illustration: shifter
z/zmin elements in each row are subjected to shift operation when z < Z,
where Z is the maximum block size, and other elements are ignored, shown
as symbol × in the tables. Shift step for row 0 through ip−1 is jp+1, in this
example, row 0 is left-shifted circularly for 2 steps. Shift step for the rest
of rows is jp, in this example, row 1 and 2 in Table 3.1 (a) are left-shifted
circularly for 1 step. Table 3.1 (c) is obtained by up-shifting all columns
in Table 3.1 (b) circularly for jp steps, i.e., all columns in Table 3.1 (b) are
up-shifted circularly for 1 step. Finally, Table 3.1 (c) is reshaped back to a
column vector, and the shifter realises the required shifting as
[0, 1, 2, 3, 4, 5,×,×,×]T → [4, 5, 0, 1, 2, 3,×,×,×]T
The shifting is done ﬁrst row-wise and then column-wise, and it is a two-stage
shifter.
3.3.3 Iteration termination
Evaluation of stop criterion given in Section 3.2 requires computing H cˆT and
compare it with a long zero vector in the end of each iteration. This method
leads to high computational complexity and latency.
The decoder implemented in this thesis work checks convergence in serial
manner: at the end of subiteration i, new hard decisions for all bits decoded
CHAPTER 3. PROPOSED DECODER ARCHITECTURE 31
by CNB i are compared with the old hard decisions produced in previous
iteration. Decoder checks if all those decisions are equal, also checks if all
decisions satisfy all the parity check constraints in this CNB. A one bit ﬂag is
set to 1 if and only if both conditions are true, otherwise cleared to 0. With
one ﬂag corresponding to one CNB, mb ﬂags make up a register, where mb
is the number of CNBs of a code. If all ﬂags in the register are set to 1's in
the end of a subiteration, then decoding is regarded converged.
3.4 Decoder core
3.4.1 Operation overview
Main part of decoder architecture is shown as decoder core in Figure 3.2.
The implementation is similar to the example given in Section 2.7.
The algorithm described in Section 3.2 is mapped to hardware resources as
follows. Total information Ln, n ∈ N , stored in sum memory, is read out from
the memory and rotated by a shifter to get aligned with check node decoders
(CNDs). The subtractors implement (3.1) to restore variable-to-check (V2C)
messages. A CND implements min-sum algorithm with scaling factor cor-
rection, which is described by (3.2), (3.3) and (3.4). The adders implement
(3.5) to update total information. Suppose a CND performs computation on
behalf of check node m ∈M, the V2C messages Qn,m, n ∈ N (m), are saved
in the buﬀer accompanying that CND for a short while, and then read out
synchronously with the check-to-variable (C2V) messages updated by that
CND, and the two message streams are routed to the adder next to that
CND. Also saved in that buﬀer are hard decisions of bits n ∈ N (m) pro-
duced in previous iteration, which are compared with the new decisions based
on total information updated in current iteration. The comparison is done
by a terminator, which evaluates stop criteria and terminates the decoding
process when necessary. The C2V messages updated in current iteration are
saved into c2v memory for use in subsequent iterations. In the beginning
CHAPTER 3. PROPOSED DECODER ARCHITECTURE 32
Figure 3.2: Decoder core
CHAPTER 3. PROPOSED DECODER ARCHITECTURE 33
L00 L
1
0 · · · L470 row 0
L01 L
1
1 · · · L471 row 1
...
...
. . .
...
...
L0z−1 L
1
z−1 · · · L47z−1 row z − 1
column 0 column 1 · · · column 47
Table 3.2: Total information organised as columns
of current iteration, messages in c2v memory are C2V messages updated in
previous iteration.
The maximum of block sizes {12, 24, 36, 48, 96} is 96. Each section of the
data path in Figure 3.2 operates in parallel on 96 messages. Those messages
are viewed as elements of a column vector, indexed from top to bottom by
0, 1, · · · , 95. For example, the data ports of sum memory are 96-message
wide, so that total information of nodes pertaining to a variable node block
(VNB) can be stored in one memory entry, and accessed in parallel in one
clock cycle. There are 96 identical subtractors, CNDs, buﬀers, and so on. If
block size z is less than 96, only messages 0, 1, · · · , z−1 are under meaningful
operation.
Next, each part of the decoder is brieﬂy discussed.
3.4.2 Sum memory
A codeword is divided to 48 blocks. A vector of total information
[
L00, L
0
1, · · · , L0z−1, L10, L11, · · · , L1z−1, · · · , L470 , L471 , · · · , L47z−1
]
is reshaped to 48 columns as shown in Table 3.2. The sum memory in Figure
3.2 is 48 entries deep and 96 lanes wide. The memory layout is shown by
Table 3.3, in which mi,j denotes memory location at address j and lane i.
Table 3.2 is loaded to sum memory by saving Lji to memory location mi,j,
CHAPTER 3. PROPOSED DECODER ARCHITECTURE 34
address 0 address 1 · · · address 47
lane 0 m0,0 m0,1 · · · m0,47
lane 1 m1,0 m1,1 · · · m1,47
...
...
...
. . .
...
lane 95 m95,0 m95,1 · · · m95,47
Table 3.3: Sum memory
0 ≤ i ≤ z − 1, 0 ≤ j ≤ 47. One memory entry stores a message block of 48
messages, when z < 96, only message 0 through z − 1 are valid.
3.4.3 Dual port memory
The sum memory is a simple-dual port memory, with one port dedicated to
write access and the other for write access. The two ports operate simulta-
neously, when read port sends out data for subiteration i, write port is able
to receive data of previous subiteration.
Similarly, the c2v memory is also a dual port memory.
3.4.4 Multi-size shifter
In subiteration i, if submatrix Hi,j is not a null matrix, the 96 CNDs access
message block at address j in sum memory. The topmost z messages in the
message vector out of the sum memory are shifted cyclically to get aligned
with check node decoders, as described in Section 2.7.2. The two-stage shifter
proposed in Section 3.3.2 is implemented as in Figure 3.3. The 96 messages
are viewed as elements of Table 3.4. Firstly, each of the 12 rows of the
table are shifted, and then, each of the 8 columns are shifted. A register
separates row operation and column operation, so that the shifter is two-
stage pipelined.
CHAPTER 3. PROPOSED DECODER ARCHITECTURE 35
Figure 3.3: Shifter, top level
Lj0 L
j
12 · · · Lj84 row 0
Lj1 L
j
13 · · · Lj85 row 1
...
...
. . .
...
...
Lj11 L
j
23 · · · Lj95 row 11
column 0 column 1 · · · column 7
Table 3.4: A message block viewed as table
CHAPTER 3. PROPOSED DECODER ARCHITECTURE 36
Figure 3.4: Shifter, horizontal shifter
Because the 5 block sizes {12, 24, 36, 48, 96} are multiples of 12, only the
(z/12) leftmost columns in Table 3.4 are valid. Consequently, only the (z/12)
leftmost messages in a message row are subjected to circular shift operation.
Figure 3.4 presents structure of a row shifter, which contains 5 subshifters.
A subshifter labelled with number n performs leftwards circular shift on the
n leftmost messages, ignoring the rest of the messages. Therefore, subshifter
1 is dummy, it lets the leftmost message pass and ignores others, subshifter
2 either lets the 2 leftmost messages pass or swaps them, ignoring the rest,
and so forth. The step for shifting is given by signal sh, which stands for step
for horizontal shift. As shown in Figure 3.3, sh is driven by a multiplexer
selecting sh1 or sh2, the usage of the two values is described in Section 3.3.2.
12 multiplexers are controlled by signals sel0, · · · , sel11. The column shifters
in Figure 3.3 are controlled by signal sv, which stands for step for vertical
shift.
A column shifter and a subshifters in a horizontal shifter can be implemented
as logarithm shifter. Figure 3.5 presents an example of a log shifter. log2 8 =
3 stages of 2-input multiplexers are needed to implement circular upwards
shifter on 8 elements. The bold lines show the path for shift step of 6.
CHAPTER 3. PROPOSED DECODER ARCHITECTURE 37
Figure 3.5: Example, log shifter
Figure 3.6: Check node decoder, pipelined, sample waveform
3.4.5 Check node decoder (CND)
The main part of the decoder core is a column of 96 CNDs, indexed from
top to bottom by k = 0, 1, · · · , 95 in Figure 3.2. During subiteration i,
CND k performs check node update on behalf of check node Cik. A CND
is serial (Section 2.7), pipelined (Section 2.7.1), using out-of-order memory-
write (Section 2.7.1) and out-of-order memory-read (Section 3.3.1). Figure
3.6 presents sample waveforms to illustrate the behaviour of a CND, and
Figure 3.7 presents its structure, which is explained next.
A counter in control block generates indices 0, 1, · · · , ρi − 1 for incoming
messages, where ρi is degree of check nodes in CNB i. Messages are inte-
gers, taking two's complement format outside of CNDs, and sign-magnitude
format inside a CND. A converter at input converts a message from two's
CHAPTER 3. PROPOSED DECODER ARCHITECTURE 38
Figure 3.7: Check node decoder (CND) structure
CHAPTER 3. PROPOSED DECODER ARCHITECTURE 39
complement format to sign-magnitude format, and another converter does
the converse at output. A column of registers in the middle of Figure 3.7 are
grouped to set (a) and set (b). When one set is used to process incoming
messages, the other can be used to produce outgoing messages pertaining to
previous subiteration. First minimum m1, second minimum m2, and index
ind for the ﬁrst minimum (refer to Section 2.6.2), are hold in registers named
m1, m2 and ind, respectively.
Magnitude processing
At the beginning of each subiteration, register m1 and m2 are initialised to
the largest values they can hold . As messages enters into a CND one by one,
those registers are updated by the ﬁnder module. Let mag be the current
magnitude at ﬁnder's input, ind be its index, m1 and m2 be the current
values in the respective registers, and 3 cases are enumerated:
case A: mag ≤ m1
case B: m1 < mag ≤ m2
case C: mag > m2
In case A, mag is saved to m1 register, m1 is saved to m2 register, and ind is
saved to ind register; in case 2, mag is saved to m2 register, and the other two
registers are unchanged; in case 3, all registers are unchanged. The registers
are used to produce C2V messages as described in (2.7), which is reproduced
below
output magnitude of message with index j is:
m2 j = indm1 j 6= ind
where the index j = 0, 1, · · · , ρi − 1 is driven by the control block, as shown
in Figure 3.7.
CHAPTER 3. PROPOSED DECODER ARCHITECTURE 40
Sign processing
Equation (3.2) can be written as
sign
(
R(it)m,n
)
= sign
(
Q
(it)
j,n
)
Π
j∈N (m)
sign
(
Q
(it)
j,m
)
The sign function returns real value ±1. In two's complement representa-
tion, a sign bit of 1 indicates negative number, and 0 indicates non-negative
number. Product of ±1 in the equation are translated to binary sum over
binary ﬁeld {0, 1}, implemented by XOR logic operator in hardware. When
messages step into a CND one by one, their sign bits are saved in a buﬀer
memory as shown in Figure 3.7, indices of messages serve as addresses to the
buﬀer. The sum register in Figure 3.7 is a 1-bit ﬂipﬂop, which is cleared to
zero at the beginning of each subiteration, and then stores the running sum
of sign bits. To produce the sign bit for output check-to-variable message
with index j, the sign bit saved in buﬀer memory at address j is read out and
added to the value in sum ﬂip ﬂop to produce the required sign bit. Addition
is done by the adder in the upper-right corner in Figure 3.7.
Out-of-order memory access
Address sequences for all memory access in the decoder can be determined
by inspecting base matrices in design time, as described in Section 2.7.1 and
Section 3.3.1. Address sequences are saved in on-chip memories.
3.5 Overall architecture with dual buﬀer
It takes time to initialise the sum memory in Figure 3.2, and also takes time
to output decoded result from the memory to decoder user. A dual buﬀer
conﬁguration can hide the time and improve throughput. Figure 3.8 presents
decoder core equipped with dual buﬀers. When one memory is in decoding
CHAPTER 3. PROPOSED DECODER ARCHITECTURE 41
Figure 3.8: Decoder with dual buﬀer
mode, it is connected with decoder core, which reads out and writes back
data to this memory. Meanwhile, the other memory can operate in input-
output (IO) mode, it is connected with IO control block, which sends out
decoded data to decoder user, or accepts next received codeword data to
initialise the memory.
For the reason illustrated in Section 2.7.3, the decoder core presented in Fig-
ure 3.2 includes only one shifter. Consequently, the IO block shown in Figure
3.8 includes another shifter, which pre-shifts messages before iteration starts
during input mode and post-shifts messages during output mode. Control
bits supplied to this shifter is stored in memory.
The target Xilinx FPGA chip provides Block SelectRAM memory and Dis-
tributed SelectRAM memory [39]. Most memories in the decoder are built
with the former memory type, and the buﬀers in Figure 3.2 are built with
the latter type. The Xilinx block memory can be conﬁgured to dual port
mode.
A number of memories in the decoder are used to store memory access ad-
dresses. Details of memory addressing is out of scope of thesis writing. The
basic idea is that, the code structure is reﬂected by memory address se-
quences, and it is possible to modify those address sequences so that the
implementation can adapt to other codes, as long as the new codes are not
far diﬀerent.
CHAPTER 3. PROPOSED DECODER ARCHITECTURE 42
3.6 Fixed-point issues
Most parts of data path in Figure 3.2 are two's complement integers repre-
sented by 5 bits, and total information messages are 8 bits wide. A subtractor
has two outputs, the message entering into a buﬀer is 8 bits wide, and the
other going to a CND is 5 bits wide. The adder output is 8 bits wide. Sub-
tractor and adder outputs are clipped.
The input data to the hardware decoder are LLR values which are also coded
in format of two's complement integers. As shown in Figure 2.1, the received
signal r is assumed as real valued signal, and consequently LLRs are also real
valued. Quantisation is needed to convert real valued LLR values to two's
complement integers with ﬁnite digits. An interval is determined, and real
valued LLR values falling into this interval are quantised to 5 bits integers.
Real valued LLRs outsides the interval are clipped to the boundary values
of the interval. The selected interval in simulation is
[−2/σ2 − 3× (2/σ) , 2/σ2 + 3× (2/σ)]
where σ² is noise power. Interval of this size covers over 99% LLR values.
Decoding performance is related to size of the quantisation interval, as well
as number of bits of each section of data path in the hardware. MATLAB
ﬁxed-point simulation determines input quantisation interval and widths of
sections of data path. Detailed discussion of ﬁxed-point modelling and sim-
ulation is omitted due to lack of space.
Chapter 4
Results
The architecture proposed in Chapter 3 was implemented in the FPGA de-
vice. The implementation results are reported in this chapter.
4.1 Introduction to design ﬂow
Implementing demanding algorithms to hardware is a complex procedure.
This section presents a simpliﬁed design ﬂow consisting of most important
steps only. Detailed discussion is out of the scope. The hardware design
work ﬂow is generally iterative, one needs revert to earlier steps if result of
current step does not meet requirement.
1. The thesis work starts with understanding design requirement. Codes
are speciﬁed, system requirements are formulated, target hardware de-
vice is studied.
2. During literature review, various algorithms are studied, a variety of
hardware architecture presented in literature are summarised.
43
CHAPTER 4. RESULTS 44
3. Next, algorithms is chosen and draft hardware architecture is made.
This is the step requiring creativity and experience. Decoding through-
put and consumed hardware resource are roughly estimated to check if
the architecture satisﬁes system requirement.
4. Fixed-point modelling is done for the drafted architecture. Number of
bits for each variable is determined by ﬁxed-point simulation. Simu-
lation also validates the error correction performance, and the result
should be checked against system requirement. One must revert to ear-
lier steps to modify architecture, or even change to another algorithm,
if current one does not meet the system requirement. This step is done
with MATLAB.
5. Following MATLAB simulation, a detailed hardware implementation
speciﬁcation is made. The overall hardware entity is partitioned to sub
blocks. Interfaces of blocks and their communications between blocks
are formulated, and documented with diagrams and text.
6. VHDL coding is done according to the speciﬁcation. VHDL stands
for very-high-speed-integrated-circuit hardware description language,
which is used to describe functionality of a digital hardware entity.
7. VHDL models are simulated and debugged with ModelSim software.
8. VHDL model is veriﬁed by simulation using reference data. Reference
data are obtained from MATLAB simulation of the ﬁxed-point MAT-
LAB model. Output of VHDL simulation must agree with output of
MATLAB simulation.
9. The task of generating hardware net list from VHDL model is called
synthesis. Net list is a low level description of digital electronics hard-
ware, specifying what elementary hardware resource are used and how
they are connected. For example, a net list may speciﬁes how logical
gates are connected. The synthesis software utilised in this thesis work
is Synplify Pro. Given synthesis result, more accurate estimation such
CHAPTER 4. RESULTS 45matlab fixed point: rate 1 ,2 & 3, different block size, 3 sigma, biased, 8 bits variable sum message, 5bits input LLR
1,0E-05
1,0E-04
1,0E-03
1,0E-02
1,0E-01
1,0E+00
0 0,5 1 1,5 2 2,5 3 3,5
EbNo (dB)
co
de
 w
or
d 
er
ro
r r
at
io
r 1/2 z 12 sigma 3 biased
r 1/2 z 24 sigma 3 biased
r 1/2 z 36 sigma 3 biased
r 1/2 z 48 sigma 3 biased
r 1/2 z 96 sigma 3 biased
r 2/3 z 12 sigma 3 biased
r 2/3 z 24 sigma 3 biased
r 2/3 z 36 sigma 3 biased
r 2/3 z 48 sigma 3 biased
r 2/3 z 96 sigma 3 biased
r 3/4 z 12 sigma 3 biased
r 3/4 z 24 sigma 3 biased
r 3/4 z 36 sigma 3 biased
r 3/4 z 48 sigma 3 biased
r 3/4 z 96 sigma 3 biased
Figure 4.1: Error correction performance
as hardware size and decoding throughput can be obtained. Optimi-
sation within synthesis step can improve the result to certain degree.
One must revert to early steps if the result does not meet requirement.
10. There are other tasks left in a complete implementation ﬂow. For
example, the procedure of generating ﬁnal physical layout from net list
is called place and routing, and post-synthesis optimisation. The wide
sense synthesis also includes this step. This step is not emphasised in
the thesis work, as it is out of the scope of the thesis.
4.2 Error correction performance
Fixed-point model was developed and simulated using MATLAB. It is not
presented in this thesis due to lack of space. Error correction performance
is validated by MATLAB ﬁxed-point simulation. The curves are shown in
Figure 4.1. 15 codes are presented in the ﬁgure. For example, legend r 1/2
z 12 refers to a code of rate 1/2 and block size 12.
CHAPTER 4. RESULTS 46
The simulation collects 20 error codewords at each signal to noise ratio (SNR)
point. SNR is measured in dB of Eb/N0, where Eb is energy per information
bit, and N0/2 = σ². The performance is measured in codeword error ratio,
rather not bit error ratio.
4.3 VHDL design, veriﬁcation, and synthesis
The implementation is described by multiple ﬁles in very-high-speed-integrated-
circuit hardware description language (VHDL [40]). VHDL codes are written
in fully synthesizable register transfer level (RTL). VHDL ﬁles are not pre-
sented in this thesis due to lack of space.
The VHDL design is veriﬁed by ModelSim simulation using testing vectors
captured from MATLAB ﬁxed-point simulation. It is veriﬁed that output
from VHDL model and MATLAB ﬁxed-point simulation are exactly the
same.
XC2VP50-5F1152 [39, 41], a Xilinx Virtex II Pro FPGA chip, was selected
as target device. Trail synthesis of the VHDL design was performed using
Synplify Pro with default setting. It is estimated that about half of logic re-
source on XC2VP50-5F1152 chip is consumed, and clock frequency is around
70 MHz.
4.4 Decoding throughput
For the proposed architecture in this thesis work, throughput varies depend-
ing on channel noise, code rate, and codeword length. When SNR is low,
more iterations are needed and throughput is low. When the number of
decoding iterations and codeword length are ﬁxed, code with higher rate
produces higher throughput. When the number of decoding iterations and
code rate are ﬁxed, code with larger block size produces higher throughput.
CHAPTER 4. RESULTS 47
block size (codeword length)
12 (576) 24 (1152) 36 (1728) 48 (2304) 96 (4608)
rate 1/2 0.0755 0.1511 0.2266 0.3021 0.6042
rate 2/3 0.1011 0.2023 0.3034 0.4045 0.8091
rate 3/4 0.1119 0.2237 0.3356 0.4474 0.8949
Table 4.1: Decoding throughput (decoded bits per clock cycle, 20 iterations)
12 24 36 48 96
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
block size (12, 24, 36, 48, 96)
de
co
de
d 
bi
ts
 p
er
 c
lo
ck
 c
yc
le
throughput at 20 iterations
 
 
rate 1/2
rate 2/3
rate 3/4
Figure 4.2: Decoding throughput
20 iterations are performed in the worst case. Number of clock cycles to
run 20 iterations are counted in ModelSim simulation of the VHDL designs:
codes of rate 1/2 require 3813 clock cycles, rate 2/3 codes require 3797 clock
cycles, rate 3/4 codes require 3862 clock cycles. Throughput at 20 iterations
is the ratio of number of information bits of a codeword over number of clock
cycles for 20 iterations. Throughput for 15 codes at 20 iterations, measured in
decoded information bits per clock cycle, are presented in Table 4.1, drawn
in Figure 4.2. In this worst case, highest throughput is obtained by rate
3/4 code of block size 96, it is about 0.9 information bits per cycle; lowest
throughput is about 0.075 information bits per clock cycle obtained by rate
1/2 code of block size 12. Other rates and block sizes lead to throughput in
CHAPTER 4. RESULTS 48
between 0.075 bits/cycle and 0.9 bits/cycle.
The throughput increases when the number of iterations drops if channel is
good. For instance, when iteration number drops from 20 to 10, the through-
put will be doubled to reach 0.15 to 1.8 bits per cycle.
Assuming 70 MHz clock frequency, the worst case throughput at 20 iteration
translates to throughput ranging from 5.2 Mbits to 63Mbits (information
bits) per second. If the number of iterations drop to 10 from 20 due to good
channel quality, the throughput will be doubled to reach 10 Mbits to 120
Mbits. Average throughput of a given code depends on the average number
of iterations as well as the code parameters.
Chapter 5
Conclusion
The results presented in Chapter 4 are discussed in this chapter. The result
are evaluated, limit and future work are pointed out.
The objective of the thesis work are met successfully. The consumed hard-
ware resource is within the design constraint. The worst case throughput at
20 decoding iterations is 0.075 to 0.9 information bits per clock cycle. Corre-
sponding throughput at 70 MHz clock frequency is 5.2 Mbits to 63 Mbits per
second. The decoder is real-time conﬁgurable such at any of the 15 codes can
be decoded. Comparing performance curves in Figure 4.1 and reference ﬁg-
ures in appendix B shows that performance degradation is acceptably small.
The design can be further improved for smaller hardware size and higher
throughput.
The class of LDPC codes is large, and there has not been a universal hard-
ware decoder suitable for many applications. The decoder architecture pre-
sented in this thesis is constrained by limited amount of FPGA resource, and
throughput is sacriﬁced to trade for hardware resource. The architecture as-
sumes codes like those deﬁned in appendix A, and it is not likely to ﬁt to
other codes which are far diﬀerent.
Optimisation for speed and hardware cost have not yet been highlighted
in the thesis work. Improvements can be done in following aspects. re-
49
CHAPTER 5. CONCLUSION 50
synthesise VHDL ﬁles with higher optimisation eﬀort and apply synthesis
techniques. Currently it is synthesised only with default settings. Insert reg-
isters into critical path can increase clock frequency. However, More eﬀective
approach is to modify part of the logic design. For example, handshaking
is used in the decoding data path. In fact, the values of handshaking sig-
nals can be determined in design time, therefore, the handshaking can be
removed. The removal shortens current critical path signiﬁcantly, thus in-
creases clock frequency and throughput signiﬁcantly. A second example is
to reduce amount of memory. For check node Cm of degree dm, in current
implementation, all dm check-to-variable (C2V) messages updated by Cm are
saved in c2v memory. Those messages can be compressed by applying the
value reuse property (Section 2.6.2) as in [29], so that large amount of mem-
ory can be saved. Reduction in memory reduces hardware size and increases
clock frequency and throughput. Thirdly, some ineﬃcient logic design can be
corrected to save hardware and increase clock speed. For example, the shifter
in IO block performs post-shifting (refer to Section 3.5), and the needed con-
trol bits are saved in a memory block in decoding core, rather not locally
in IO block. Each time decoding core writes messages to an entry of sum
memory, it reads out control bits from its local memory and writes them to-
gether with messages to the target memory entry in sum memory. In a more
eﬃcient way, the storage of those control bits can be moved to IO block, so
that extra memory and data movement can be avoided
In summary, a partly parallel LDPC decoder hardware architecture was de-
signed and implemented successfully in this thesis work. Various design tech-
niques are reviewed in the thesis, and some new techniques are proposed.
Thesis objective are met, and further improvements have been pointed out.
The architecture can be applied to similar codes. The implementation can be
reused for other codes which are not far diﬀerent, by merely updating con-
tents in read-only-memories (ROMs). LDPC codes will be widely adopted,
research and development on implementing decoders continue. This thesis
provides useful information for practical design tasks.
Appendix A
Base matrices for 15 LDPC codes
The base matrices for codes of rate 1/2, 2/3, and 3/4 are given in Figure A.1,
A.2, and A.3, respectively. All 15 codes, corresponding to block sizes of 12,
24, 36, 48, and 96, can be obtained by modulo expansion method described
in Section 2.5.
The 15 codes are enumerated in the following list. For example, the ﬁrst line
reads as: code 1 is a (n, k) = (576, 288) code of rate r = 1/2 with block size
z = 12 and codeword length N=576.
1. (576, 288) code: rate 1/2, block size 12, word length 48x12= 576
2. (1152, 576) code: rate 1/2, block size 24, word length 48x24= 1152
3. (1728, 864) code: rate 1/2, block size 36, word length 48x36= 1728
4. (2304, 1152) code: rate 1/2, block size 48, word length 48x48= 2304
5. (4608, 2304) code: rate 1/2, block size 96, word length 48x96= 4608
6. (576, 384) code: rate 2/3, block size 12, word length 48x12= 576
7. (1152, 768) code: rate 2/3, block size 24, word length 48x24= 1152
51
APPENDIX A. BASE MATRICES FOR 15 LDPC CODES 52
工作表1
页 1
0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 0 -1 -1 -1 -1 1 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 0 0 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 31 -1 14 -1 81 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 16 84 -1 -1 39 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1 2 -1 72 -1 78 73 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 17 -1 71 -1 -1 -1 68 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 -1 13 -1 -1 -1 17 16 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 0 62 3 -1 -1 -1 -1 87 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
79 -1 -1 -1 -1 -1 -1 -1 -1 59 -1 -1 -1 -1 -1 -1 -1 -1 -1 92 2 -1 76 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 73 -1 -1 -1 -1 -1 -1 -1 83 -1 -1 23 -1 25 -1 90 -1 37 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 36 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 13 91 82 11 7 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 22 -1 51 -1 -1 -1 -1 -1 -1 65 19 -1 -1 -1 74 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 30 -1 -1 -1 13 -1 -1 -1 -1 -1 -1 -1 -1 93 -1 89 2 -1 -1 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 5 55 -1 -1 -1 -1 -1 -1 37 30 -1 26 28 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 74 20 -1 -1 -1 67 35 -1 -1 89 25 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1
61 -1 76 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 66 -1 -1 50 -1 15 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 6 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 83 -1 -1 83 89 -1 77 36 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 43 -1 -1 -1 -1 -1 -1 45 -1 -1 -1 -1 65 -1 -1 67 56 -1 85 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 37 -1 -1 -1 -1 -1 30 -1 -1 -1 -1 84 43 35 -1 -1 87 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1
-1 21 90 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 20 31 -1 37 -1 12 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 -1
-1 -1 -1 92 -1 -1 -1 -1 -1 -1 56 65 -1 -1 -1 -1 -1 -1 -1 -1 58 69 -1 51 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1
-1 -1 -1 -1 57 -1 19 -1 -1 -1 -1 -1 -1 -1 -1 -1 28 -1 -1 5 -1 55 88 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1
-1 91 -1 -1 -1 -1 -1 22 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 35 2 -1 95 53 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0
-1 -1 -1 -1 -1 -1 -1 -1 45 -1 -1 -1 -1 -1 -1 85 -1 45 -1 -1 40 66 7 -1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0
Figure A.1: Base matrix, rate 1/2
工作表2
页 1
0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 -1 -1 0 0 0 -1 -1 -1 -1 -1 -1 -1 -1 0 0 0 -1 -1 1 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 57 -1 -1 0 0 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 0 0 0 -1 -1 -1 -1 -1 -1 0 -1 0 -1 -1 -1 -1 -1 1 88 -1 -1 -1 -1 -1 58 31 25 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 0 0 0 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 -1 37 47 -1 -1 -1 14 4 -1 87 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 0 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 82 33 -1 69 89 55 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 29 -1 -1 -1 -1 10 -1 -1 -1 -1 -1 -1 74 81 -1 -1 -1 -1 36 -1 3 -1 -1 -1 -1 -1 10 73 -1 -1 92 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1
31 -1 -1 27 -1 -1 -1 -1 -1 -1 -1 -1 51 68 -1 43 54 -1 -1 -1 -1 -1 66 -1 -1 -1 -1 93 62 -1 14 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 -1 29 -1 -1 -1 46 -1 1 82 -1 -1 -1 7 -1 79 -1 -1 -1 -1 -1 -1 -1 -1 72 92 62 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1 81 -1 -1 -1 39 -1 -1 88 59 -1 -1 -1 -1 -1 -1 -1 -1 69 8 -1 -1 50 20 -1 -1 90 0 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1
-1 -1 18 -1 38 -1 -1 -1 58 -1 -1 -1 -1 74 29 -1 1 81 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 87 -1 84 63 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1
94 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 70 -1 -1 41 47 -1 -1 -1 -1 -1 -1 53 -1 -1 -1 85 31 -1 37 -1 76 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1
-1 15 -1 -1 -1 -1 -1 -1 -1 46 -1 -1 -1 81 -1 68 -1 -1 -1 -1 89 -1 -1 75 -1 -1 -1 4 44 9 62 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 -1
-1 -1 -1 -1 -1 29 -1 -1 75 -1 49 -1 -1 -1 6 71 -1 85 -1 -1 -1 -1 -1 -1 -1 7 -1 93 40 -1 47 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1
-1 -1 38 -1 73 -1 -1 -1 -1 -1 -1 -1 -1 84 6 -1 -1 -1 35 -1 -1 -1 -1 -1 -1 -1 57 24 -1 91 52 77 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1
-1 -1 -1 -1 -1 35 -1 18 -1 -1 -1 -1 -1 91 -1 17 -1 -1 -1 45 -1 -1 -1 -1 62 -1 -1 -1 -1 40 51 12 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0
-1 -1 -1 21 -1 -1 -1 -1 -1 1 -1 -1 34 -1 -1 80 -1 -1 -1 -1 -1 6 -1 -1 -1 47 -1 81 -1 54 -1 22 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0
Figure A.2: Base matrix, rate 2/3
工作表3
页 1
0 0 -1 -1 -1 -1 -1 -1 -1 -1 0 0 0 0 0 -1 -1 -1 -1 -1 -1 0 0 -1 0 0 -1 -1 -1 -1 -1 -1 -1 0 0 3 1 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 0 0 -1 -1 -1 -1 -1 -1 0 -1 -1 -1 -1 0 0 -1 -1 -1 -1 0 0 78 -1 -1 59 58 0 -1 -1 -1 -1 85 -1 24 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1
29 -1 -1 -1 0 0 -1 -1 -1 -1 0 -1 -1 -1 -1 -1 -1 0 0 0 -1 0 31 78 -1 -1 -1 -1 -1 76 82 -1 -1 -1 38 51 -1 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1
-1 -1 85 -1 -1 -1 0 0 -1 -1 82 0 0 -1 -1 -1 -1 -1 -1 -1 0 -1 54 36 -1 -1 -1 -1 -1 -1 -1 33 45 91 56 16 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 -1
-1 31 -1 -1 -1 -1 -1 -1 0 0 -1 86 -1 -1 91 25 -1 -1 20 -1 -1 29 -1 39 -1 10 18 -1 -1 -1 -1 -1 -1 45 72 38 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1
-1 -1 27 -1 -1 -1 44 -1 -1 -1 50 55 -1 32 -1 -1 -1 -1 -1 9 -1 -1 4 18 -1 -1 -1 -1 -1 27 -1 2 -1 46 21 68 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1
-1 42 -1 -1 -1 21 -1 -1 -1 -1 46 51 -1 -1 -1 -1 73 60 -1 -1 -1 33 76 79 48 -1 -1 57 -1 -1 -1 -1 -1 38 94 -1 0 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1
-1 -1 -1 8 67 -1 -1 -1 -1 82 28 34 2 -1 -1 37 -1 -1 -1 -1 -1 15 -1 23 -1 -1 -1 -1 -1 16 3 -1 -1 37 77 15 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 -1
70 -1 -1 -1 -1 -1 -1 77 -1 -1 11 20 -1 -1 67 -1 -1 -1 -1 -1 78 88 81 54 -1 -1 -1 3 -1 -1 -1 -1 1 -1 0 52 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1
-1 -1 -1 -1 -1 94 -1 -1 35 -1 32 29 -1 15 -1 -1 -1 2 -1 -1 -1 69 20 42 -1 -1 95 -1 10 -1 -1 39 -1 91 -1 67 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1
-1 -1 -1 65 -1 -1 -1 66 -1 26 62 7 -1 -1 -1 -1 -1 -1 84 29 -1 35 85 -1 -1 84 -1 -1 -1 -1 34 -1 49 14 31 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0
-1 -1 -1 -1 21 -1 48 -1 88 -1 -1 70 -1 -1 -1 -1 44 -1 -1 -1 23 41 61 91 35 -1 -1 -1 2 -1 -1 -1 -1 4 42 35 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0
Figure A.3: Base matrix, rate 3/4
APPENDIX A. BASE MATRICES FOR 15 LDPC CODES 53
8. (1728, 152) code: rate 2/3, block size 36, word length 48x36= 1728
9. (2304, 1536) code: rate 2/3, block size 48, word length 48x48= 2304
10. (4608, 3072) code: rate 2/3, block size 96, word length 48x96= 4608
11. (576, 472) code: rate 3/4, block size 12, word length 48x12= 576
12. (1152, 864) code: rate 3/4, block size 24, word length 48x24= 1152
13. (1728, 1296) code: rate 3/4, block size 36, word length 48x36= 1728
14. (2304, 1728) code: rate 3/4, block size 48, word length 48x48= 2304
15. (4608, 3456) code: rate 3/4, block size 96, word length 48x96= 4608
Appendix B
Reference performance
Figure B.1, B.2, and B.3 are provided as decoding performance reference,
they are given by project partner designing the codes. AWGN channel and
BPSK modulation are used. Decoding is done by standard believe propaga-
tion algorithm with 50 iterations, and the data are obtained through ﬂoating
point simulation. The hardware implementation uses less optimal algorithm
and is subjected to ﬁxed-point eﬀect, hence the performance of hardware
implementation is degraded. The ﬁgures can be found in page 65 and 66 of
[42].
54
APPENDIX B. REFERENCE PERFORMANCE 55
Figure B.1: Codeword error ratio, rate 1
2
, BPSK, AWGN
Figure B.2: Codeword error ratio, rate 2
3
, BPSK, AWGN
APPENDIX B. REFERENCE PERFORMANCE 56
Figure B.3: Codeword error ratio, rate 3
4
, BPSK, AWGN
Bibliography
[1] C. Shannon, A mathematical theory of communication, The Bell Sys-
tem Technical Journal, vol. 27, pp. 379423, 623656, 1948.
[2] G. D. Forney and D. J. Costello, Channel Coding: The Road to Channel
Capacity, Proceedings of the IEEE, vol. 95, no. 6, pp. 11501177, June
2007.
[3] C. Berrou, A. Glavieux, and P. Thitimajshima, Near Shannon limit
error-correcting coding and decoding: Turbo-codes. 1, in IEEE In-
ternational Conference on Communications ICC 93. Geneva. Technical
Program, Conference Record, vol. 2, 2326 May 1993, pp. 10641070.
[4] R. G. Gallager, Low-density parity-check codes, Ph.D. dissertation,
MIT, Cambridge, MA, 1963.
[5] D. J. C. MacKay and R. M. Neal, Near Shannon limit performance of
low density parity check codes, Electronics Letters, vol. 33, no. 6, pp.
457458, 13 March 1997.
[6] D. J. C. MacKay, Good error-correcting codes based on very sparse
matrices, IEEE Transactions on Information Theory, vol. 45, no. 2,
pp. 399431, March 1999.
[7] T. J. Richardson, M. A. Shokrollahi, and R. L. Urbanke, Design of
capacity-approaching irregular low-density parity-check codes, IEEE
Transactions on Information Theory, vol. 47, no. 2, pp. 619637, Feb
2001.
57
BIBLIOGRAPHY 58
[8] S.-Y. Chung, J. Forney, G. D., T. J. Richardson, and R. Urbanke, On
the design of low-density parity-check codes within 0.0045 dB of the
Shannon limit, IEEE Communications Letters, vol. 5, no. 2, pp. 5860,
Feb 2001.
[9] K. Gracie and M. H. Hamon, Turbo and Turbo-Like Codes: Princi-
ples and Applications in Telecommunications, Proceedings of the IEEE,
vol. 95, no. 6, pp. 12281254, June 2007.
[10] A. J. Blanksby and C. J. Howland, A 690-mW 1-Gb/s 1024-b, rate-
1/2 low-density parity-check code decoder, IEEE Journal of Solid-State
Circuits, vol. 37, no. 3, pp. 404412, March 2002.
[11] R. Tanner, A recursive approach to low complexity codes, IEEE Trans-
actions on Information Theory, vol. 27, no. 5, pp. 533547, Sep 1981.
[12] T. J. Richardson and R. L. Urbanke, Eﬃcient encoding of low-density
parity-check codes, IEEE Transactions on Information Theory, vol. 47,
no. 2, pp. 638656, Feb 2001.
[13] F. R. Kschischang, B. J. Frey, and H. A. Loeliger, Factor graphs and the
sum-product algorithm, IEEE Transactions on Information Theory,
vol. 47, no. 2, pp. 498519, Feb 2001.
[14] G. D. Forney, Codes on graphs: normal realizations, IEEE Transac-
tions on Information Theory, vol. 47, no. 2, pp. 520548, Feb 2001.
[15] K. S. Andrews, D. Divsalar, S. Dolinar, J. Hamkins, C. R. Jones, and
F. Pollara, The Development of Turbo and LDPC Codes for Deep-
Space Applications, Proceedings of the IEEE, vol. 95, no. 11, pp. 2142
2156, Nov. 2007.
[16] D. E. Hocevar, LDPC code construction with ﬂexible hardware imple-
mentation, in Proc. IEEE International Conference on Communica-
tions ICC '03, vol. 4, 1115 May 2003, pp. 27082712.
BIBLIOGRAPHY 59
[17] H. Zhong and T. Zhang, Design of VLSI implementation-oriented
LDPC codes, in Proc. VTC 2003-Fall Vehicular Technology Confer-
ence 2003 IEEE 58th, vol. 1, 69 Oct. 2003, pp. 670673.
[18] , Block-LDPC: a practical LDPC coding system design approach,
IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 52,
no. 4, pp. 766775, April 2005.
[19] M. P. C. Fossorier, M. Mihaljevic, and H. Imai, Reduced complex-
ity iterative decoding of low-density parity check codes based on belief
propagation, IEEE Transactions on Communications, vol. 47, no. 5,
pp. 673680, May 1999.
[20] J. Chen, A. Dholakia, E. Eleftheriou, M. P. C. Fossorier, and X.-Y. Hu,
Reduced-Complexity Decoding of LDPC Codes, IEEE Transactions
on Communications, vol. 53, no. 8, pp. 12881299, Aug. 2005.
[21] J. Chen and M. P. C. Fossorier, Near optimum universal belief propa-
gation based decoding of low-density parity check codes, IEEE Trans-
actions on Communications, vol. 50, no. 3, pp. 406414, March 2002.
[22] M. Karkooti, P. Radosavljevic, and J. R. Cavallaro, Conﬁgurable, High
Throughput, Irregular LDPC Decoder Architecture: Tradeoﬀ Analysis
and Implementation, in Proc. International Conference on Application-
speciﬁc Systems, Architectures and Processors ASAP '06, Sept. 2006, pp.
360367.
[23] Y. Sun, M. Karkooti, and J. R. Cavallaro, High Throughput, Parallel,
Scalable LDPC Encoder/Decoder Architecture for OFDM Systems, in
Proc. IEEE Dallas/CAS Workshop on Design, Applications, Integration
and Software, Oct. 2006, pp. 3942.
[24] K. K. Gunnam, G. S. Choi, W. Wang, E. Kim, and M. B. Yeary, Decod-
ing of Quasi-cyclic LDPC Codes Using an On-the-Fly Computation, in
Proc. Fortieth Asilomar Conference on Signals, Systems and Computers
ACSSC '06, Oct. 29 2006Nov. 1 2006, pp. 11921199.
BIBLIOGRAPHY 60
[25] K. K. Gunnam, G. S. Choi, and M. B. Yeary, A Parallel VLSI Ar-
chitecture for Layered Decoding for Array LDPC Codes, in Proc. th
International Conference on VLSI Design Held jointly with 6th Interna-
tional Conference on Embedded Systems, 610 Jan. 2007, pp. 738743.
[26] K. Gunnam, W. Wang, G. Choi, and M. Yeary, VLSI Architectures for
Turbo Decoding Message Passing Using Min-Sum for Rate-Compatible
Array LDPC Codes, in Proc. 2nd International Symposium on Wireless
Pervasive Computing ISWPC '07, 57 Feb. 2007igital Object Identiﬁer
10.1109/ISWPC.2007.34 2007.
[27] Z. Wang and Z. Cui, A Memory Eﬃcient Partially Parallel Decoder Ar-
chitecture for Quasi-Cyclic LDPC Codes, IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, vol. 15, no. 4, pp. 483488,
April 2007.
[28] K. Gunnam, G. Choi, W. Wang, and M. Yeary, Multi-Rate Layered
Decoder Architecture for Block LDPC Codes of the IEEE 802.11n Wire-
less Standard, in Proc. IEEE International Symposium on Circuits and
Systems ISCAS 2007, 2730 May 2007, pp. 16451648.
[29] K. K. Gunnam, G. S. Choi, M. B. Yeary, and M. Atiquzzaman,
VLSI Architectures for Layered Decoding for Irregular LDPC Codes of
WiMax, in Proc. IEEE International Conference on Communications
ICC '07, 2428 June 2007, pp. 45424547.
[30] M. M. Mansour and N. R. Shanbhag, Turbo decoder architectures for
low-density parity-check codes, in Proc. IEEE Global Telecommunica-
tions Conference GLOBECOM '02, vol. 2, 1721 Nov. 2002, pp. 1383
1388.
[31] , High-throughput LDPC decoders, IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, vol. 11, no. 6, pp. 976996, Dec.
2003.
BIBLIOGRAPHY 61
[32] D. E. Hocevar, A reduced complexity decoder architecture via layered
decoding of LDPC codes, in Proc. IEEE Workshop on Signal Processing
Systems SIPS 2004, 2004, pp. 107112.
[33] Y. Chen and D. Hocevar, A FPGA and ASIC implementation of rate
1/2, 8088-b irregular low density parity check decoder, in Proc. IEEE
Global Telecommunications Conference GLOBECOM '03, vol. 1, 15
Dec. 2003, pp. 113117.
[34] M. Karkooti and J. R. Cavallaro, Semi-parallel reconﬁgurable architec-
tures for real-time LDPC decoding, in Proc. International Conference
on Information Technology: Coding and Computing ITCC 2004, vol. 1,
2004, pp. 579585.
[35] Y. Chen and K. K. Parhi, Overlapped message passing for quasi-cyclic
low-density parity check codes, IEEE Transactions on Circuits and Sys-
tems I: Regular Papers, vol. 51, no. 6, pp. 11061113, June 2004.
[36] Z. Wang and Z. Cui, Low-Complexity High-Speed Decoder Design for
Quasi-Cyclic LDPC Codes, IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, vol. 15, no. 1, pp. 104114, Jan. 2007.
[37] Z. Cui and Z. Wang, Eﬃcient Message Passing Architecture for High
Throughput LDPC Decoder, in Proc. IEEE International Symposium
on Circuits and Systems ISCAS 2007, 2730 May 2007, pp. 917920.
[38] X.-Y. Shih, C.-Z. Zhan, C.-H. Lin, and A.-Y. Wu, An 8.29 mm2 52
mW Multi-Mode LDPC Decoder Design for Mobile WiMAX System in
0.13 um CMOS Process, IEEE Journal of Solid-State Circuits, vol. 43,
no. 3, pp. 672683, March 2008.
[39] Xilinx, Virtex-II Pro and Virtex-II Pro X FPGA User Guide, March
2007.
[40] P. J. Ashenden, The Designer's Guide to VHDL. Morgan Kaufmann
Publishers, 1995.
BIBLIOGRAPHY 62
[41] Xilinx, Virtex-II Pro and Virtex-II Pro X Platform FPGAs: Complete
Data Sheet, 2007.
[42] S. Stiglmayr, IST-4-027756 WINNER II D2.2.1-v1.0 Joint Modulation
and Coding Procedures, Tech. Rep., 2006. [Online]. Available:
http://www.ist-winner.org/index.html
