Analysis and Design of Cost-Effective, High-Throughput LDPC Decoders by Nguyen-Ly, Thien Truong et al.
1Analysis and Design of Cost-Effective,
High-Throughput LDPC Decoders
Thien Truong Nguyen-Ly, Valentin Savin, Khoa Le, David Declercq, Fakhreddine Ghaffari, and Oana Boncalo
Abstract—This paper introduces a new approach to cost-
effective, high-throughput hardware designs for Low Density
Parity Check (LDPC) decoders. The proposed approach, called
Non-Surjective Finite Alphabet Iterative Decoders (NS-FAIDs),
exploits the robustness of message-passing LDPC decoders to
inaccuracies in the calculation of exchanged messages, and it
is shown to provide a unified framework for several designs
previously proposed in the literature. NS-FAIDs are optimized
by density evolution for regular and irregular LDPC codes,
and are shown to provide different trade-offs between hardware
complexity and decoding performance. Two hardware architec-
tures targeting high-throughput applications are also proposed,
integrating both Min-Sum (MS) and NS-FAID decoding kernels.
ASIC post synthesis implementation results on 65nm CMOS
technology show that NS-FAIDs yield significant improvements
in the throughput to area ratio, by up to 58.75% with respect to
the MS decoder, with even better or only slightly degraded error
correction performance.
Index Terms—Error correction, LDPC codes, NS-FAID, low
cost, high-throughput.
I. INTRODUCTION
THE increasing demand of massive data rates in wirelesscommunication systems will require significantly higher
processing speed of the baseband signal, as compared to con-
ventional solutions. This is especially challenging for Forward
Error Correction (FEC) mechanisms, since FEC decoding is
one of the most computationally intensive baseband processing
tasks, consuming a large amount of hardware resources and
energy. The use of very large bandwidths will also result in
stringent, application-specific, requirements in terms of both
throughput and latency. The conventional approach to increase
throughput is to use massively parallel architectures. In this
context, Low-Density Parity-Check (LDPC) codes are recog-
nized as the foremost solution, due to the intrinsic capacity of
their decoders to accommodate various degrees of parallelism.
They have found extensive applications in modern communi-
cation systems, due to their excellent decoding performance,
high throughput capabilities [1]–[4], and power efficiency [5],
[6], and have been adopted in several recent communication
standards.
Thien Truong Nguyen-Ly is with CEA-LETI, MINATEC Campus, Greno-
ble, France, and ETIS ENSEA / UCP / CNRS UMR-8051, Cergy-Pontoise,
France (e-mail: thientruong.nguyen-ly@cea.fr).
Valentin Savin is with CEA-LETI, MINATEC Campus, Grenoble, France
(e-mail: valentin.savin@cea.fr).
Khoa Le, David Declercq, and Fakhreddine Ghaffari are with
ETIS ENSEA / UCP / CNRS UMR-8051, Cergy-Pontoise, France (e-mail:
{khoa.letrung, ghaffari, declercq}@ensea.fr).
Oana Boncalo is with the Computers and Information Technology
Department of the University Politehnica Timisoara, Romania (e-mail:
boncalo@cs.upt.ro).
This paper targets the design of cost-effective, high-
throughput LDPC decoders. One important characteristic of
LDPC decoders is that the memory and interconnect blocks
dominate the overall area/delay/power performance of the
hardware design [7]. To address this issue, we build upon
the concept of Finite Alphabet Iterative Decoders (FAIDs),
introduced in [8]–[10]. While FAIDs have been previously
investigated for variable-node regular LDPC codes over the
binary symmetric channel, this paper extends their use to any
channel model, and to both regular and irregular LDPC codes.
The approach considered in this paper, referred to as Non-
Surjective FAIDs (NS-FAIDs), is to allow storing the ex-
changed messages using a lower precision (smaller number
of bits) than that used by the processing units. The basic idea
is to reduce the size of the exchanged messages, once they
have been updated by the processing units. Hence, to some
extent, the proposed approach is akin to the use of imprecise
storage, which is seen as an enabler for cost and throughput
optimizations. Moreover, NS-FAIDs are shown to provide a
unified framework for several designs previously proposed in
the literature, including the normalized and offset Min-Sum
(MS) decoders [11], [12], the partially offset MS decoder [13],
the MS-based decoders proposed in [14], [15], or the recently
introduced dual-quantization domain MS decoder [16].
This paper refines and extends some of the concepts we
previously introduced in [17], [18]. In particular, the definition
of NS-FAIDs [17] is extended such as to cover a larger class
of decoders, which is shown to significantly improve the
decoding performance in case that the exchanged messages are
quantized on a small number of bits (e.g., 2 bits per exchanged
message). We show that NS-FAIDs can be optimized by
using the Density Evolution (DE) technique, so as to obtain
the best possible decoding performance for given hardware
constraints, expressed in terms of memory size reduction. The
DE optimization is illustrated for both regular and irregular
LDPC codes, for which we propose a number of NS-FAIDs
with different trade-offs between hardware complexity and
decoding performance.
To assess the benefits of the NS-FAID approach, we further
extend the hardware architectures proposed in [18] to cover
the case of irregular codes, and provide implementation results
targeting an ASIC technology, which is more likely to reflect
the benefits of the proposed NS-FAID approach in terms of
throughput/area trade-off. The proposed architectures target
high-throughput and efficient use of the hardware resources.
Both architectures implement layered decoding with fully
parallel processing units. The first architecture is pipelined,
so as to increase throughput and ensure an efficient use of the
ar
X
iv
:1
70
9.
10
39
6v
1 
 [e
es
s.S
P]
  2
3 A
ug
 20
17
2hardware resources, which in turn imposes specific constraints
on the decoding layers1, in order to ensure proper execution
of the layered decoding process. The second architecture does
not make use of pipelining, but allows maximum parallelism
to be exploited through the use of full decoding layers2, thus
resulting in significant increase in throughput. Both MS and
NS-FAID decoding kernels are integrated into each of the
two proposed architectures, and compared in terms of area
and throughput. ASIC post synthesis implementation results
on 65nm CMOS technology show a throughput to area ratio
improvement by up to 58.75%, when the NS-FAID kernel
is used, with even better or only slightly degraded error
correction performance.
The rest of the paper is organized as follows. NS-FAIDs are
introduced in Section II, which also discusses their expected
implementation benefits and the DE analysis. The optimization
of regular and irregular NS-FAIDs is presented in Section III.
The proposed hardware architectures, with both MS and NS-
FAID decoding kernels, are discussed in Section IV. Numeri-
cal results are provided in Section V, and Section VI concludes
the paper.
II. NON-SURJECTIVE FINITE ALPHABET ITERATIVE
DECODERS
A. Preliminaries
LDPC codes are defined by sparse bipartite graphs, com-
prising a set of variable-nodes (VNs), corresponding to coded
bits, and a set of check-nodes (CNs), corresponding to parity-
check equations. Finite Alphabet Iterative Decoders (FAIDs)
are message-passing LDPC decoders that have been introduced
in [8]–[10]. We state below the definition of a subclass of
FAID decoders, which is less general than the one proposed
in [10]. Let Q be a positive integer. A (2Q + 1)-level FAID
is a 4-tuple (M,Γ,Φv,Φc), where:
• M = {−Q, . . . ,−1, 0,+1, . . . ,+Q} is the alphabet of
the exchanged messages, and is also referred to as the
decoder alphabet,
• Γ ⊆M is the input alphabet of the decoder, i.e., the set
of all possible values of the quantized soft information
supplied to the decoder,
• Φv and Φc denote the update rules for VNs and CNs,
respectively.
We shall use m ∈ M and γ ∈ Γ to denote elements of M
and Γ, respectively. The CN-update function Φc is the same
for any FAID decoder, and is equal to the update function
used by the MS decoder. Precisely, for a CN of degree dc, the
update function Φc :Mdc−1 →M is given by:
Φc (m1, . . . ,mdc−1) =
(
dc−1∏
i=1
sgn(mi)
)
min
i=1,...,dc−1
|mi| (1)
1A decoding layer may consist of one or several rows of the base matrix
of the QC-LDPC code, assuming that they do not overlap.
2A decoding layer is said to be full if each column of the base matrix has
one non-negative entry in one of the rows composing the layer.
The VN-update function Φv : Γ×Mdv−1 →M, for a VN
of degree dv , is defined as:
Φv (γ,m1, . . . ,mdv−1) = F
γ + dv−1∑
j=1
mj
 (2)
where the function F : Z →M is defined based on a set of
threshold values T = {T0, T1, . . . , TQ+1} ⊂ R¯+, with T0 = 0,
TQ+1 = +∞, and Ti < Tj for any i < j:
F (x) = sgn(x)m, where m is s.t. Tm ≤ |x| < Tm+1 (3)
In Eq. (2), the variable γ represents the channel contribution,
i.e., the quantized soft information that has been supplied to the
decoder for the corresponding variable-node. The quantization
method and its impact on FAIDs’ decoding performance will
be discussed in Section II-E. It is also worth noting that in [10],
FAIDs are introduced with a more general VN-update function
Φv , but the simpler definition (2) that we use has many
hardware implementation benefits, which will be described
in later sections. Moreover, it can be easily seen that any
non-decreasing odd function satisfies Eq. (3). Precisely, the
following proposition holds.
Proposition 1: For any function F : Z→M, there exists a
threshold set T such that F is given by Eq. (3), if and only
if F satisfies the following two properties:
(i) F is an odd function, i.e., F (−x) = −F (x), ∀x ∈ Z
(ii) F is non-decreasing, i.e., F (x) ≤ F (y) for any x < y.
We note that the above proposition also implies that F (0) = 0
and F (x) ≥ 0,∀x > 0. In this paper, we further extend the
definition of FAIDs by allowing F (0) to take on non-zero
values. To ensure symmetry of the decoder, we shall write
F (0) = ±λ, with λ ≥ 0, meaning that F (0) takes on either
−λ or +λ with equal probability. In the following, F will be
referred to as framing function.
As the focus of this work is on practical implementations,
we will further assume that the sum γ+
∑dv−1
j=1 mj in Eq. (2) is
saturated toM, prior to applying F on it. Consequently, in the
sequel we shall only consider framing functions F :M→M,
and the VN-update function Φv from Eq. (2) is redefined as:
Φv (γ,m1, . . . ,mdv−1) = F
sM
γ + dv−1∑
j=1
mj
 (4)
where sM : Z → M, sM(x) = sgn(x) min(|x|, Q), is the
saturation function. Since F (−m) = −F (m),∀m ∈ M, F is
completely determined by the vector [|F (0)|, F (1), ..., F (Q)],
further referred to as the Look-Up Table (LUT) of F , which
satisfies the following inequalities (Proposition 1):
0 ≤ |F (0)| ≤ F (1) ≤ · · · ≤ F (Q) ≤ Q (5)
Summarizing, the subclass of FAIDs considered in this
paper is defined by Eq. (4), where F is a framing function
satisfying Eq. (5). Furthermore, for any integer q > 0, the
expression q-bit FAID is used to refer to a (2Q + 1)-level
FAID, with Q = 2q−1−1. It follows that messages exchanged
within the FAID decoder are q-bit messages (including 1 bit
for the sign). Finally, the message-passing iterative decoding
process for a FAID with framing function F is depicted in
Algorithm 1.
3Algorithm 1 FAID decoding with framing function F
Input: y = (y1, . . . , yN ) . received word
Output: xˆ = (xˆ1, . . . , xˆN ) . estimated codeword
Initialization
for all n = 1, . . . , N do γn = quant (LLR(xn | yn));
for all m = 1, . . . ,M and n ∈ H(m) do βm,n = 0;
Iteration Loop
for all n = 1, . . . , N and m ∈ H(n) do . VN-processing
αm,n = Φv (γn, βm′,n | m′ ∈ H(n) \m);
for all m = 1, . . . ,M and n ∈ H(m) do . CN-processing
βm,n = Φc (αm,n′ | n′ ∈ H(m) \ n);
for all n = 1, . . . , N do . AP-update
γ˜n = γn +
∑
m∈H(n)
βm,n;
for all n = 1, . . . , N do xˆn = sign(γ˜n); . hard decision
if xˆ is a codeword then exit the iteration loop . syndrome check
End Iteration Loop
Notation: H – bipartite graph of the LDPC code, with N VNs and M CNs;
H(n) – set of CNs connected to VN n; H(m) – set of VNs connected
to CN m; αm,n – message from VN n to CN m; βm,n – message from
CN m to VN n; γn and γ˜n – a priori and a posteriori LLR of VN n;
quant – quantization map (see also Section II-E).
B. Non-Surjective FAIDs
The finite alphabet MS decoder is a particular example of
FAID, with framing function F :M→M being the identity
function. Using Eq. (5) it can be easily verified that this is the
only FAID for which the framing function F is surjective (or
equivalently bijective, sinceM is finite). For any other framing
function F , there exists at least one element of M, which is
not in the image of F . The class of FAIDs defined by non-
surjective framing functions3 is investigated in this section.
Definition 2: The weight of a framing function F : M →
M, denoted by W , is the number of distinct entries in the
vector [|F (0)|, F (1), ..., F (Q)]. It follows that 1 ≤W ≤ Q+
1. By a slight abuse of terminology, we shall also refer to W
as the weight of the NS-FAID. We further define the framing
bit-length as w = dlog2(W )e+ 1.
Definition 3: A non-surjective FAID (NS-FAID) is a FAID
of weight W < Q+ 1. Hence the framing function F is non-
surjective, meaning that the image set of F is a strict subset
of M.
Table I provides two examples of q = 4-bit NS-FAIDs
(hence Q = 7), both of which are of weight W = 4, hence
framing bit-length w = dlog2 4e+1 = 3. Note that F1 maps 0
to 0, while F2 maps 0 to ±1. The image sets of F1 and F2 are
Im(F1) = {0,±1,±3,±7} and Im(F2) = {±1,±3,±4,±7}.
The main motivation for the introduction of NS-FAIDs is
that they allow reducing the size of the memory required
to store the exchanged messages. Clearly, for a NS-FAID
with framing bit-length w, the exchanged messages can be
represented using only w instead of q bits (including 1 bit
3Although we restrict the study in this paper to non-surjective framing
functions F :M→M, it can be readily extended to non-surjective framing
functions F : Z → M. While such an extension would widen the class of
NS-FAIDs, our preliminary investigations have shown that the best NS-FAIDs
are actually those within the subclass of NS-FAIDs defined by F :M→M,
investigated in this paper.
Table I
EXAMPLES OF 4-BIT FRAMING FUNCTIONS OF WEIGHT W = 4
m 0 1 2 3 4 5 6 7
F1(m) 0 1 1 3 3 3 7 7
F2(m) ±1 1 1 3 3 4 4 7
for the sign). Moreover, as a consequence of the message size
reduction, the size of the interconnect network that carries the
messages from the memory to the processing units is also
reduced.
Proposition 4: The number of (2Q+ 1)-level NS-FAIDs of
weight W is given by:
NNS-FAID(Q,W ) =
(
Q
W − 1
)(
Q+ 1
W
)
(6)
where
(
y
x
)
denotes the binomial coefficient.
C. Examples of NS-FAIDs
As mentioned above, if the framing function F is the
identity function, then the corresponding FAID is equivalent
to the MS decoder with finite alphabetM. Some examples of
NS-FAIDs are provided below.
Example 1. Let F :M→M be defined by:
F (m) = sgn(m) max(|m| − θ, 0) (7)
where θ ∈ {1, . . . , Q− 1}. Then, the corresponding NS-FAID
is the Offset Min-Sum (OMS) decoder with offset factor θ.
Example 2. Let F :M→M be defined by:
F (m) =
{
m, if |m| is even
sgn(m)(|m| − 1), if |m| is odd (8)
Then, the corresponding NS-FAID is the Partially OMS
(POMS) decoder from [13].
Moreover, it can be seen that the MS-based decoders pro-
posed in [14], [15] and the dual-quantization domain decoder
proposed in [16] are particular realizations of NS-FAIDs.
While the main reason behind the NS-FAIDs definition
consists in their ability to reduce memory and interconnect
requirements, we can also argue that they may allow improving
the error correction performance (with respect to MS). This is
the case of both OMS and POMS decoders mentioned above.
Given a target message bit-length w (e.g., corresponding to
some specific memory constraint), one may try to find the
framing function F of corresponding weight W , which yields
the best error correction performance. The optimization of the
framing function can be done by using the DE technique,
which will be discussed in Section II-E.
Since F is a non-decreasing function, it can be shown that
the framing function F can alternatively be applied at the CN-
processing step (instead of VN-processing), while resulting in
an equivalent decoding algorithm. Whether F is applied at the
VN-processing or the CN-processing step is rather a matter
of implementation. When F is applied at the VN-processing
step, both VN- and CN-messages belong to a strict subset
of the alphabet M, namely M′ = Im(F ) ⊂ M. When
F is applied at the CN-processing step, only CN-messages
belong toM′. However, it is worth noting that many hardware
4implementations of Quasi-Cyclic (QC) LDPC decoders rely on
a layered architecture, which only requires storing the check-
node messages [7].
D. Irregular NS-FAIDs
In case of irregular LDPC codes, irregular NS-FAIDs are
NS-FAIDs using different framing functions Fdv for VNs of
different degrees dv . Framing functions Fdv may have different
weights Wdv . In this case, messages outgoing from degree-dv
VNs can be represented by using only wdv = dlog2(Wdv )e+1
bits. However, the message size reduction does not necessarily
apply to CN-messages, due to the fact that a CN may be
connected to VNs of different degrees. Let M′dv = Im(Fdv ).
Then, messages outgoing from a CN c can be represented by
using: ⌈
log2
(∣∣∪dv∈DcM′dv ∣∣)⌉ bits, (9)
where Dc is the set of degrees of VNs connected to c, and | · |
is used to denote the number of elements of a set.
Alternatively, it is also possible to define CN-irregular NS-
FAIDs in a similar manner. However, in this work we only
deal with VN-irregular NS-FAIDs, since most of the practical
irregular LDPC codes are irregular on VNs, while almost
regular (or semi-regular) on CNs. In order to reduce the size
of the CN-messages, in Section III-B we will further impose
certain conditions on the framing functions Fdv , by requiring
their images being included in one another.
E. Density Evolution Analysis
For the sake of simplicity, we only consider transmission
over binary-input memoryless noisy channels. We assume
that the channel input alphabet is X = {−1,+1}, with the
usual convention that +1 corresponds to the 0-bit and −1
corresponds to the 1-bit, and denote by Y the output alphabet
of the channel. We denote by x ∈ X and y ∈ Y the transmitted
and received symbols, respectively.
We further consider a function ϕ : Y → Γ that maps
the output alphabet of the channel to the input alphabet of
the decoder, and set γ = ϕ(y). Hence, ϕ encompasses both
the computation of the soft (unquantized) log-likelihood ratio
(LLR) value and its quantization. We shall refer to ϕ as
quantization map and to γ as the input LLR of the decoder.
For transmission over the binary-input Additive White
Gaussian Noise (AWGN) channel, we shall consider that the
decoder’s input information as well as the exchanged messages
are quantized on the same number of bits; therefore Γ = M
unless otherwise stated. In this case, y = x+z, where z is the
white Gaussian noise with variance σ2, and the quantization
map ϕ : Y →M is defined by:
ϕ(y) = [µ · y]M (10)
where µ > 0 is a constant referred to as gain factor, and [x]M
denotes the closest integer to x that belongs to M (see also
[19] and the gain factor quantizer defined therein).
The objective of the DE technique is to recursively com-
pute the probability mass function (pmf) of the exchanged
messages, through the iterative decoding process. This is done
under the assumption that exchanged messages are indepen-
dent, which holds in the asymptotic limit of the code length. In
this case, the decoding performance converges to the cycle free
case. DE equations for the NS-FAID decoder can be derived
in a similar way as for the finite-alphabet MS decoder [19].
Similar to [19], the DE is used to compute the asymptotic
error probability, defined as:
p(+∞)e = lim
`→+∞
p(`)e (11)
where p(`)e is the bit error probability at iteration `.
For a target bit error probability η > 0, the η-threshold
is defined as the worst channel condition for which decoding
error probability is less than η. Assuming the binary-input
AWGN channel model, the η-threshold corresponds to the
maximum noise variance σ2 (or equivalently minimum SNR),
such that the asymptotic error probability is less than η:
σ2thres(η) = sup
{
σ2 | p(+∞)e ≤ η
}
(12)
In case that η = 0, the η-threshold is simply referred to as DE
threshold [20]. However, the asymptotic decoding performance
of finite-precision MS-based decoders is known to exhibit an
error floor phenomenon at high SNR [19]. This makes the η-
threshold definition more appropriate in practical cases, when
the target bit error rate can be fixed to a practical non-zero
value.
Finally, it is worth noting that the σ2thres(η) value depends
on: (i) the irregularity of the LDPC code, parametrized as
usual by the degree distribution polynomials λ(x) and ρ(x)
[20], (ii) the NS-FAID, i.e., the size of the decoder alphabet
and the framing function F , (iii) the channel quantizer ϕ, or
equivalently the gain factor µ used in Eq. (10). Therefore,
assuming that the degree distribution polynomials λ(x) and
ρ(x) and the size of the decoder alphabet are fixed, we use
the DE technique to jointly optimize the framing function F
and channel quantizer.
III. DENSITY EVOLUTION OPTIMIZATION OF NS-FAIDS
Throughout this section, we consider q = 4-bit NS-FAIDs
(hence, Q = 7). To illustrate the trade-off between hardware
complexity and decoding performance, we consider the opti-
mization of both regular and irregular NS-FAIDs.
A. Optimization of Regular NS-FAIDs
In this section, we consider the optimization of regular NS-
FAIDs for (dv = 3, dc = 6)-regular LDPC codes. We consider
q = 4-bit NS-FAIDs, with framing bit-length parameter
w ∈ {2, 3}. According to Proposition 4, the number of NS-
FAIDs is given by NNS-FAID(q = 4, w = 3) = NNS-FAID(Q =
7,W = 4) =
(
7
3
)(
8
4
)
= 2450, and NNS-FAID(q = 4, w = 2) =
NNS-FAID(Q = 7,W = 2) =
(
7
1
)(
8
2
)
= 196.
All regular NS-FAIDs have been evaluated by using the DE
technique from Section II-E. Table II summarizes the best NS-
FAIDs according to w and F (0) values4, for 0 ≤ |F (0)| ≤ 3;
4NS-FAIDs with |F (0)| > 3 have worse DE thresholds, thus they have not
been included in the table.
5Table II
BEST NS-FAIDS FOR (3, 6)-REGULAR LDPC CODES
F SNR-thres (dB)
w = 4 MS [0, 1, 2, 3, 4, 5, 6, 7] 1.643 (µ = 5.6)
w = 3 F (0) = 0 [0,1,1,3,3,3,7,7] 1.409 (µ = 3.8)
F (0) = ±1 [±1, 1, 1, 3, 3, 4, 4, 7] 1.412 (µ = 5.1)
F (0) = ±2 [±2, 2, 2, 3, 3, 3, 4, 7] 1.712 (µ = 7.1)
F (0) = ±3 [±3, 3, 3, 3, 3, 4, 5, 7] 2.227 (µ = 10.0)
w = 2 F (0) = 0 [0, 0, 0, 0, 0, 6, 6, 6] 2.251 (µ = 8.6)
F (0) = ±1 [±1,1,1,1,1,6,6,6] 1.834 (µ = 6.4)
F (0) = ±2 [±2, 2, 2, 2, 2, 2, 2, 7] 1.911 (µ = 8.3)
F (0) = ±3 [±3, 3, 3, 3, 3, 3, 3, 7] 2.014 (µ = 9.4)
DE thresholds (η = 0) and corresponding gain factors (µ)
are also reported. Best NS-FAIDs for w = 2 and w = 3
are emphasized in bold. For comparison purposes, the DE
threshold of the q = 4-bit MS decoder is also reported: MS
threshold is equal to 1.643 dB, for µ = 5.6. For w = 3,
it can be observed that best NS-FAIDs with F (0) = 0 or
F (0) = ±1 have better DE thresholds than the 4-bit MS
decoder. The best NS-FAID is given by the framing function
F = [0, 1, 1, 3, 3, 3, 7, 7] and its DE threshold is equal to
1.409 dB (µ = 3.8), representing a gain of 0.23 dB compared
to 4-bit MS. For w = 2, the best NS-FAID is given by
the framing function F = [±1, 1, 1, 1, 1, 6, 6, 6] and its DE
threshold is equal to 1.834 dB (µ = 6.4), which represents a
performance loss of only 0.19 dB compared to 4-bit MS. To
emphasize the benefits of the proposed NS-FAIDs extension,
we note that that for w = 2, best NS-FAIDs with F (0) = ±1,
F (0) = ±2 or F (0) = ±3 have better DE thresholds than
the best NS-FAIDs with F (0) = 0. The latter is given by the
framing function F = [0, 0, 0, 0, 0, 6, 6, 6] and its DE threshold
is equal to 2.251 dB (µ = 8.6), thus resulting in a performance
loss of approximately 0.61 dB compared to 4-bit MS.
B. Optimization of Irregular NS-FAIDs
As a case study, we consider the optimization of ir-
regular NS-FAIDs for the WiMAX irregular LDPC codes
with rate 1/2 [21] (of course, the proposed method can
be applied to any other irregular codes in the same man-
ner). The edge-perspective degree distribution polynomials
are given by λ (x) = 0.2895x + 0.3158x2 + 0.3947x5 and
ρ (x) = 0.6316x5 + 0.3684x6. Hence, VNs are of degree
dv ∈ {2, 3, 6}. For each VN-degree dv , we consider that the
corresponding framing function Fdv may be of any weight
Wdv ∈ {2, 4, 8}, corresponding to a framing bit-length wdv ∈
{2, 3, 4}. Hence, the total number of framing functions is given
by NNS-FAID(7, 2) + NNS-FAID(7, 4) + NNS-FAID(7, 8) = 2645
(see Proposition 4). Since a different framing function may be
applied for each VN-degree, it follows that the total number
of irregular NS-FAIDs is equal to 26453 = 18 504 486 125.
Clearly, even though we rely on DE, it is practically impossible
to evaluate the decoding performance of all the irregular NS-
FAIDs. To overcome this problem, we proceed as described
below.
1) Optimization procedure: First, we evaluate the DE
thresholds of NS-FAIDs applying one and the same framing
function to all the variable-nodes, irrespective of their degree,
which for simplicity will be referred to as uniform NS-FAIDs
throughout this section. Uniform NS-FAIDs with framing bit-
length w = {2, 3, 4} are then sorted with increasing DE
threshold value, from the best to the worst decoder. Note that
the case w = 4 represents a slight abuse of terminology,
since there is only one such decoder, corresponding to the
original MS decoder. We further denote by U (best)-NS-FAID-
w the set of best uniform NS-FAIDs with framing bit-length
w, determined as follows:
• U (best)-NS-FAID-2 is comprised of the uniform NS-
FAIDs with w = 2, whose DE threshold is less than
or equal to 5 dB; this represents 121 decoders out of the
total of NNS-FAID(Q = 7, w = 2) = 196 decoders.
• U (best)-NS-FAID-3 is comprised of the uniform NS-
FAIDs with w = 3, whose DE threshold is less than
or equal to 3 dB; this represents 946 decoders out of the
total of NNS-FAID(Q = 7, w = 3) = 2450 decoders.
• U (best)-NS-FAID-4 is comprised of the MS decoder only;
its DE threshold is equal to 1.374 dB.
The limiting values of the DE thresholds for U (best)-NS-FAID-
2 and U (best)-NS-FAID-3 are chosen such that the number of
selected NS-FAIDs in each set is small enough.
For irregular NS-FAIDs, we denote by NS-FAID-w2w3w6
the ensemble of NS-FAIDs defined by a triplet of framing
functions (F2, F3, F6), corresponding to variable node de-
grees dv=2, 3, 6, with framing bit-lengths w2, w3, w6. Since
w2, w3, w6 ∈ {2, 3, 4}, there are 27 such ensembles. Since the
number of NS-FAIDs in these ensembles can be very large, we
only evaluate part of them, by further imposing the following
two constraints:
Decoding performance constraint: We only consider irreg-
ular NS-FAIDs defined by triplets of framing functions
(F2, F3, F6), such that Fdv ∈ U (best)-NS-FAID-wdv , for any
dv ∈ {2, 3, 6}.
Memory size reduction constraint: We further impose the
following inclusion constraint between the image sets of
framing functions used for different VN-degrees. Let w(max) =
maxdv wdv and d
(max)
v = argmaxdv wdv . We impose that
Im (Fdv ) ⊆ Im
(
Fd(max)v
)
, ∀dv ∈ {2, 3, 6}. According to
Eq. (9), this constraint ensures that CN-messages can be
represented by using only w(max) bits, which is particularly
suitable for layered architectures.
The number of irregular NS-FAIDs that satisfy the above
two constraints is equal to N irregularNS-FAIDs = 7 017 762.
2) Density Evolution evaluation: For each of the N irregularNS-FAIDs
irregular NS-FAIDs, we compute its decoding threshold for a
target bit error rate η = 10−6, using the DE technique from
Section II-E. The threshold computation also encompasses the
optimization of the channel gain factor µ. Hence, for each NS-
FAID, we first determine the gain factor µ that maximizes the
η-threshold defined in Eq. (12). The corresponding η-threshold
value is then reported as the η-threshold of the NS-FAID.
Density evolution results for the MS decoder (indicated
as NS-FAID-444), as well as for the NS-FAIDs with the
best η-thresholds from five NS-FAID-w2w3w6 ensembles, are
shown in Table III: the framing functions used for VN-degrees
6Table III
HARDWARE COMPLEXITY VS. DECODING PERFORMANCE TRADE-OFF FOR OPTIMIZED IRREGULAR NS-FAIDS
NS-FAIDs
Ensemble
Framing functions applied to SNR-thres (dB)
(& gain factor µ)
@BER = 10−6
SNR
gain/loss
(+/− dB)
Memory size reduction (%)
dv = 2 dv = 3 dv = 6 VN-mess.
CN-messages
uncomp. comp.
NS-FAID-444 (MS) LUT0 LUT0 LUT0 1.374 (µ=3.2) 0.000 0.00 0.00 0.00
NS-FAID-432 LUT0 LUT1 LUT6 1.188 (µ=3.0) +0.186 −27.63 0.00 0.00
NS-FAID-433 LUT0 LUT3 LUT2 1.015 (µ=2.8) +0.359 −17.76 0.00 0.00
NS-FAID-332 LUT4 LUT3 LUT6 1.273 (µ=2.6) +0.101 −34.87 −25.00 −13.04
NS-FAID-333 LUT4 LUT4 LUT3 1.110 (µ=2.4) +0.264 −25.00 −25.00 −13.04
NS-FAID-222 LUT7 LUT5 LUT5 2.299 (µ=2.3) −0.925 −50.00 −50.00 −26.09
Table IV
LUTS USED BY NS-FAIDS IN TABLE III
m LUT0 LUT1 LUT2 LUT3 LUT4 LUT5 LUT6 LUT7
0 0 0 0 0 0 ±1 ±1 ±1
1 1 0 1 1 1 1 1 1
2 2 2 1 1 1 1 1 1
3 3 2 2 3 3 1 1 5
4 4 3 2 3 3 5 7 5
5 5 3 7 3 7 5 7 5
6 6 7 7 7 7 5 7 5
7 7 7 7 7 7 5 7 5
w 4 3 3 3 3 2 2 2
dv = 2, 3, 6 are shown in columns 2, 3, and 4, while the η-
threshold value (in dB) and the corresponding gain factor µ
are shown in column 5. The SNR gain (+) or loss (−) reported
in column 6 corresponds to the differences between the SNR
threshold of the MS decoder (NS-FAID-444) and the SNR
threshold of the best NS-FAID-w2w3w6. The memory size
reduction of the NS-FAID-w2w3w6 decoders compared to the
MS decoder is reported in columns 7-9, for both VN and CN
messages. For CN messages, two possibilities are considered,
according to whether they are stored in an uncompressed or
compressed format, where compressed format means that only
the signs, first minimum, second minimum, and index of the
first minimum are stored [22]. Finally, the framing functions’
LUTs of the best NS-FAID-w2w3w6 decoders are reported in
Table IV.
Fig. 1 captures the trade-off between decoding performance
and memory size reduction, for each of the 27 NS-FAID-
w2w3w6 ensembles. For each ensemble, we select the NS-
FAID with the best threshold, and indicate the corresponding
memory gains and decoding performance. The height of verti-
cal bars indicates the VN memory size reduction (values on the
left vertical axis), while their color indicates the CN memory
size reduction (uncompressed CN message storage is assumed
in the legend). The red stems indicate the SNR threshold
gain or loss compared to MS decoder (values on the right
vertical axis). It can be seen that the NS-FAID-332 decoder
allows a significant memory size reduction for both variable-
and check-node messages, while still performing 0.1 dB ahead
the MS decoder. The NS-FAID-433 decoder is also a very
good candidate for applications requiring increased decoding
performance: it achieves the best SNR gain (0.36 dB), while
providing a VN memory size reduction by 17.76% with
respect to the MS decoder.
Finally, it is worth noticing that the reported memory size
0
10
20
30
40
50
VN
-M
es
sa
ge
 S
ize
 R
ed
uc
tio
n 
(%
)
0% 25% 50%
NS-FAID Ensemble
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
SN
R 
Th
re
sh
ol
d 
G
ai
n(+
) / 
Lo
ss
(-)
 (d
B)
44
4
44
3
43
4
34
4
44
2
42
4
24
4
43
2
42
3
32
4
34
2
23
4
24
3
42
2
24
2
22
4
43
3
34
3
33
4
23
3
32
3
33
2
33
3
22
3
23
2
32
2
22
2
     CN-Message Size Reduction:
MS
Figure 1. Memory size reduction vs. decoding performance
reductions do not necessarily translate as such in hardware
implementations, for several reasons. First, depending on the
hardware architecture, VN messages may or may not be stored
in a dedicated memory. For instance, layered architectures
only require the storage of CN messages, which can be
further stored in either a compressed or uncompressed format.
Moreover, VN processing units (VNUs) need to be equipped
with a framing and a deframing module, which may offset
part of the promised gains. This is even more true in case of
irregular codes, for which some VNUs may need to implement
more than one single framing function, since the same VNU
may be reused to process VN of different degrees (except for
fully parallel architectures). To assess the gains of NS-FAIDs
in practical implementations, the integration of the NS-FAID
mechanism (framing/deframing modules) into two different
MS decoder architectures is discussed in the next section,
and ASIC synthesis results for the CMOS 65 nm process
technology are provided in Section V.
IV. HARDWARE ARCHITECTURES FOR NS-FAID
DECODERS
In this section we propose two layered decoder architectures
for QC-LDPC codes, with both MS and NS-FAIDs decoding
kernels. Proposed architectures target high-throughput, while
ensuring an efficient use of the hardware resources. Two
possible approaches to achieve high-throughput are explored,
7consisting in either pipelining the datapath, or increasing the
hardware parallelism.
We consider a QC-LDPC code defined by a base matrix
B of size R × C, and expansion factor z, corresponding to
a parity check matrix H of size M ×N , with M = zR and
N = zC. With the notation from Section II-E, we denote by
x = (x1, . . . , xN ) ∈ XN the transmitted codeword, by y =
(y1, . . . , yN ) ∈ YN the received word, and by (γ1, . . . , γN )
the input LLRs of the decoder, where γn = ϕ(yn) and ϕ is the
quantization map. VN and CN messages are denoted by αm,n
and βm,n, respectively, and the A Posteriori (AP)-LLR of a VN
n is denoted by γ˜n. Hence, γ˜n = γn+
∑
m∈H(n) βm,n, where
the sum is taken over all the check-nodes m connected to n,
denoted by m ∈ H(n). Input LLRs and exchanged messages
are quantized on q-bits, while AP-LLRs are assumed to be
quantized on q˜ bits, with q˜ > q.
A decoding layer (or simply referred to as a layer) consists
of one or several consecutive rows of B, assuming that they
do not overlap, i.e., each column of B has at most one non-
negative entry within each layer. It is assumed that the same
number of rows of B participate in each decoding layer, which
is denoted by RPL (rows per layer). Hence, the number of
decoding layers is given by L = R/RPL. We further define
Z = z × RPL, corresponding to the number of rows of H
(parity checks) within one decoding layer, and referred to as
the parallelism degree of the hardware architecture. To ease
the description of the hardware architectures proposed in this
section, we shall assume that all CNs have the same degree,
denoted by dc. However, no assumptions are made concerning
VN degrees. We present each architecture assuming the MS
decoding kernel is being implemented, then we discuss the
required changes in order to integrate the NS-FAID decoding
kernel.
A. Pipelined architecture
The proposed architecture with MS decoding kernel is
detailed in Fig. 2. A high-level representation is also shown
in Fig. 5(a), for both MS and NS-FAID decoding kernels. The
architecture is optimized so that to reduce the critical path. In
particular, we completely reorganize the interconnect network
(barrel shifters BS INIT and BS R, see below), thus removing
the need for a barrel shifter on the writing data back side. The
main blocks of the architecture are discussed below.
Input/Output buffers: The input buffer, implemented as a
number of Serial Input Parallel Output (SIPO) shift registers,
is used to store the input LLR values (γn) received by the
decoder. The output buffer, is used to store the hard bit
estimates of the decoded word. Input/output buffers allow data
load/offload operations to take place concomitantly with the
decoding of the current codeword.
Memory blocks: Two memory blocks are used, one for
AP-LLR values (γ˜ memory) and one for CN-messages
(β memory). γ˜ memory is implemented by registers, in order
to allow massively parallel read or write operations. It is
organized in C blocks, denoted by APj (j = 1, · · · , C),
corresponding to the columns of the base matrix, each one
consisting of z × q˜ bits. Data are read from/written to
blocks corresponding to non-negative entries in the decoding
layer being processed. β memory is implemented as a dual
port Random Access Memory (RAM), in order to support
pipelining, as explained below. Each memory word consists
of Z × β messages, corresponding to one decoding layer.
Depending on the Check Node Unit (CNU) implementation,
β messages can be either “uncompressed” (i.e., for a check-
node m, the corresponding β message is given by the dc
values [βm,n1 , . . . , βm,ndc ], where n1, . . . , ndc denote the vari-
able nodes connected to m) or “compressed” (i.e., for a check-
node m, the corresponding β message is given by the signs
of the above βm,ni messages, their first and second minimum,
denoted by min1 and min2, and the index of the first minimum,
denoted by indx min1) [22].
Read and Write Permutations (PER R, PER W): PER R
permutation is used to rearrange the data read from γ˜ memory,
according to the processed layer, so as to ensure processing
by the proper VNU/CNU. PER W block operates oppositely
to PER R.
Barrel Shifters (BS INIT, BS R): Barrel shifters are used
to implement the cyclic (shift) permutations, according to
the non-negative entries of the base matrix. The γ˜ memory
is initialized from the input LLR values stored in the input
buffer. However, input LLR values are shifted by BS INIT
block before being written to the γ˜ memory, according to the
last non-negative shift factor on the corresponding base matrix
column. BS R blocks are then used to shift the LLR values
read from the γ˜ memory, such that to properly align them with
the appropriate VNU. Note that there are dc BS R {1, . . . , dc}
blocks. In case RPL = 1, a decoding layer corresponds to a
row of B, and each BS R block is used to shift the LLR
values within one of the dc columns with non-negative entries
in the current row. Let i be the index of the current row. The
cyclic shift implemented by a BS R block, corresponding to
a column j with bi,j ≥ 0, is given by −bi′,j + bi,j , where
bi′,j is the previous non-negative entry in column j (i.e., the
previous row i′ with bi′,j ≥ 0). In case RPL > 1, each BS R
block actually consists of RPL sub-blocks as above, with one
sub-block for each row in the layer. The values of the cyclic
shifts are computed offline for each layer `. This eliminates
the need for data write-back barrel shifters, thus reducing the
critical path of the design. Finally, the BS INIT block operates
oppositely to BS INIT, and is used to shift back the hard
decision bits into appropriate positions.
Variable Node Units (VNUs) and AP-LLR Units: These
units compute VN-messages (αm,n) and AP-LLR values (γ˜n).
Each VN message is computed by subtracting the correspond-
ing CN message from the AP-LLR value, that is αm,n =
γ˜n−βm,n. This operation is implemented by a q˜-bit subtractor,
hence the αm,n value outputted by the VNU is quantized on
q˜ bits. The AP-LLR value is updated by the AP-LLR unit,
by γ˜n = αm,n + βnewm,n, where β
new
m,n is the corresponding CN
message computed at the current iteration (see below).
Saturators (SATs): Prior to CNU processing, αm,n values
are saturated to q bits.
Check Node Units (CNUs): These processing units compute
the CN-messages (βm,n). For simplicity, Fig. 2 shows one
CNU block with dc inputs, each one of size Z × q bits. Thus,
8 
  
BS_R_1 
VNU_1 
● 
1 0 data_sel 
⋯ ⋯ ⋯ ⋯ 
𝜷_memory 
(RAM) 
… 
… 
… 
… 
… … 
… 
… 
… 
  DCP 
CNU 
D
C
P 
PER_R PER_W 
AP-LLR 
shift factor 
1 2 𝑑𝑐 1 2 𝑑𝑐 
1 2 C-1 C 1 2 C-1 C 
  
 
  
1 2 𝑑𝑐 
  
signs, min1, min2, indx_min1 
𝑧 × ?̃? 
𝑍 × ?̃? 
𝑍 × 𝑞 
𝑍 × (𝑑𝑐 + 2 ∙ (𝑞 − 1) + ⌈log2(𝑑𝑐)⌉), if “compressed” 
1 
2 
L-1 
𝑍 × ?̃? 
L 
⋮ 
𝑍 × 𝑞 
1 
2 
⋮ 
… 
L-1 
… 
L 
… 
1 
… 
2 𝑑𝑐 
𝑍 × ?̃? 𝑍 × 𝑞 
𝑍 × 𝑞 
… 
   
𝑍 × ?̃? 
𝑍 × ?̃? 
INPUT buffer 
(channel data) 
𝑍 × 𝛽_message_layer_1 
𝑍 × 𝛽_message_layer_2 
𝑍 × 𝛽_message_layer_L-1 
 𝑍 × 𝛽_message_layer_L 
𝑍 × 𝛽_messages 
𝑍 × (𝑑𝑐 ∙ 𝑞), if “uncompressed” 
VNU_2 AP-LLR_1 AP-LLR_2 AP-LLR_𝒅𝒄 
registers registers registers … 
 
  
  𝐴𝑃1 … 𝐴𝑃2 𝐴𝑃𝐶 𝐴𝑃3 
?̃?_memory 
𝐶 × 𝑍 
  
BS_R_2 
  
BS_INIT           
  
BS_INIT 
Hard bits 
(codeword) 
⋮ 
𝑧 × ?̃? 
𝑍 × ?̃? 
DS 
count_layer_read 
count_layer_read 
co
u
n
t_layer_w
rite
 
: memory units 
: interconnection units 
: processing units 
 
count_layer_write count_layer_read 
Controller 
clk 
en_decoder 
en_mem 
data_sel 
count_layer_read 
count_layer_write 
shift factor 
shift factor 
count_layer_read 
⋮ 
  
BS_R_𝒅𝒄 
SAT_1 SAT_2 
VNU_𝒅𝒄 
 
 
SAT_𝒅𝒄 
 
 
Figure 2. Block Diagram of the Proposed Pipelined Architecture with MS-kernel
this block actually includes Z computing units, used to process
in parallel the Z check-nodes within one layer. The CNU is
implemented by using either: (i) the high-speed low-cost tree-
structure (TS) approach proposed in [23] for “compressed”
CN-messages, or (ii) comparator trees for “uncompressed”
CN-messages.
Decompress (DCP): This block is only used in case that the
CN-messages are in compressed format (signs, min1, min2,
indx min1). It converts the β messages from compressed to
the uncompressed format.
Controller: This block generates control signals such as
count layer read, count layer write to indicate which layers
are being processed, write en to enable data writing, etc. It
also controls the synchronous execution of the other blocks.
Pipelining: To increase the operating frequency, the data
path is pipelined by adding a set of registers after the VNU-
blocks. The timing schedule is shown in Fig. 5(a), where
the two pipeline stages (P1 and P2) are indicated by purple
and brown arrows. Hence, processing one layer takes 2 clock
cycles, but at each clock cycle the two pipeline stages work
on two consecutive layers of the base matrix. This imposes
specific constraints on the base matrix, as consecutive layers
must not overlap, in order to avoid γ˜ memory conflicts (note
that memory stall cycles would cancel the pipelining effect).
An example of dc = 6 regular base matrix without overlap
between consecutive layers is given in Fig. 3, assuming that
each layer corresponds to one row of the base matrix.
Regular NS-FAID decoding kernel: The changes required
to integrate a regular NS-FAID decoding kernel, with framing
49 -1 -1 -1 -1 43 -1 -1 -1 -1 50 -1 -1 -1 -1 2 -1 27 -1 -1 -1 -1 -1 49
-1 -1 -1 10 41 -1 -1 -1 -1 52 -1 -1 32 -1 -1 -1 -1 -1 50 -1 50 -1 -1 -1
-1 -1 20 -1 -1 -1 -1 20 -1 -1 -1 51 -1 10 -1 -1 47 -1 -1 -1 -1 -1 33 -1
-1 24 -1 -1 -1 -1 22 -1 53 -1 -1 -1 -1 -1 31 -1 -1 -1 -1 18 -1 47 -1 -1
10 -1 -1 -1 15 -1 -1 -1 -1 -1 2 -1 -1 -1 -1 50 -1 13 -1 -1 -1 -1 -1 53
-1 -1 44 -1 -1 6 -1 -1 -1 -1 -1 29 -1 40 -1 -1 16 -1 -1 -1 13 -1 -1 -1
-1 2 -1 -1 -1 -1 -1 13 41 -1 -1 -1 -1 -1 42 -1 -1 -1 -1 48 -1 49 -1 -1
-1 -1 -1 36 -1 -1 24 -1 -1 50 -1 -1 12 -1 -1 -1 -1 -1 10 -1 -1 -1 48 -1
-1 -1 47 -1 50 -1 -1 -1 -1 -1 0 -1 -1 -1 -1 9 -1 7 -1 -1 -1 -1 -1 28
6 -1 -1 -1 -1 -1 5 -1 -1 -1 -1 13 -1 3 -1 -1 29 -1 -1 -1 16 -1 -1 -1
-1 -1 -1 35 -1 16 -1 -1 37 -1 -1 -1 4 -1 -1 -1 -1 -1 24 -1 -1 -1 29 -1
-1 24 -1 -1 -1 -1 -1 51 -1 38 -1 -1 -1 -1 6 -1 -1 -1 -1 23 -1 16 -1 -1
Figure 3. Base matrix of the (3, 6)-regular QC-LDPC code
function F , are shown in Fig. 5(a). First, the Saturation (SAT)
block used within the MS-decoding kernel is replaced by
a Framing (FRA) block. Note that the output of the VNU
consists of q˜-bit (unsaturated) VN-messages. Hence, the FRA
block actually implements the concatenation of the following
operations, corresponding to F ◦ sM in Eq. (4):
[−Q˜, . . . ,+Q˜] sM−→ [−Q, . . . ,+Q] F−→ Im(F ) ∼−→ [−W, . . . ,+W ],
(13)
where [−Q˜, . . . ,+Q˜] is the alphabet of unsaturated messages
(Q˜ = 2q˜−1 − 1), F is the framing function being used,
Im(F ) is the image of F (which is a subset of [−Q, . . . ,+Q]
according to the framing function definition), and the last
operation consists of a re-quantization of the Im(F ) values on
a number of w-bits, where w = dlog2(W )e+ 1 is the framing
bit-length. The De-framing (DE-FRA) block simply converts
back from w-bit to q-bit values ([−W, . . . ,+W ] ∼→ Im(F ) ⊂
[−Q, . . . ,+Q]), i.e., it inverts the re-quantization operation
above. Although we have to add the de-framing blocks, the
reduction of the CN-messages size may still save significant
9 
 
 
 
 
 
 
  
 
 
 
 
 
   
 
𝒗𝟏 𝒗𝟐 𝒗𝟑 𝒗𝟒 𝒗𝟓 𝒗𝟔 𝒗𝟕 𝒗𝟖 𝒗𝟗 𝒗𝟏𝟎 𝒗𝟏𝟏 𝒗𝟏𝟐 𝒗𝟏𝟑 𝒗𝟏𝟒 𝒗𝟏𝟓 𝒗𝟏𝟔 𝒗𝟏𝟕 𝒗𝟏𝟖 𝒗𝟏𝟗 𝒗𝟐𝟎 𝒗𝟐𝟏 𝒗𝟐𝟐 𝒗𝟐𝟑 𝒗𝟐𝟒 
 1 (01) -1 94 73 -1 -1 -1 -1 -1 55 83 -1 -1 7 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 
 2 (03) -1 -1 -1 24 22 81 -1 33 -1 -1 -1 0 -1 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1 
3 (05) -1 -1 39 -1 -1 -1 84 -1 -1 41 72 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 
4 (12) 43 -1 -1 -1 -1 66 -1 41 -1 -1 -1 26 7 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 
5 (07) -1 -1 95 53 -1 -1 -1 -1 -1 14 18 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1 
6 (09) 12 -1 -1 -1 83 24 -1 43 -1 -1 -1 51 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 
7 (11) -1 -1 7 65 -1 -1 -1 -1 39 49 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 
8 (02) -1 27 -1 -1 -1 22 79 9 -1 -1 -1 12 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 
9 (04) 61 -1 47 -1 -1 -1 -1 -1 65 25 -1 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 -1 
10 (06) -1 -1 -1 -1 46 40 -1 82 -1 -1 -1 79 0 -1 -1 -1 -1 0 0 -1 -1 -1 -1 -1 
11 (08) -1 11 73 -1 -1 -1 2 -1 -1 47 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 -1 -1 
12 (10) -1 -1 -1 -1 -1 94 -1 59 -1 -1 70 72 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 -1 
 
VNU1 VNU2 VNU3 VNU4 VNU5 VNU6 VNU7 
Layer 1 𝑣2 𝑣3 𝑣9 𝑣10 𝑣13 𝑣14 − 
Layer 2 𝑣4 𝑣5 𝑣6 𝑣8 𝑣12 𝑣15 𝑣16 
Layer 3 𝑣3 𝑣7 𝑣10 𝑣11 𝑣17 𝑣18 − 
Layer 4 𝑣1 𝑣6 𝑣8 𝑣12 𝑣13 𝑣24 − 
Layer 5 𝑣3 𝑣4 𝑣10 𝑣11 𝑣19 𝑣20 − 
Layer 6 𝑣1 𝑣5 𝑣6 𝑣8 𝑣12 𝑣21 𝑣22 
Layer 7 𝑣3 𝑣4 𝑣9 𝑣10 𝑣23 𝑣24 − 
Layer 8 𝑣2 𝑣6 𝑣7 𝑣8 𝑣12 𝑣14 𝑣15 
Layer 9 𝑣1 𝑣3 𝑣9 𝑣19 𝑣16 𝑣17 − 
Layer 10 𝑣5 𝑣6 𝑣8 𝑣12 𝑣13 𝑣18 𝑣19 
Layer 11 𝑣2 𝑣3 𝑣7 𝑣10 𝑣20 𝑣21 − 
Layer 12 𝑣6 𝑣8 𝑣11 𝑣12 𝑣22 𝑣23 − 
No. FRAs 𝟐 𝟐 𝟐 𝟐 𝟑 𝟏 𝟏 
 
VNU1 VNU2 VNU3 VNU4 VNU5 VNU6 VNU7 
Layer 1 𝑣2 𝑣13 𝑣9 𝑣10 𝑣3 𝑣14 − 
Layer 2 𝑣4 𝑣6 𝑣5 𝑣8 𝑣12 𝑣15 𝑣16 
Layer 3 𝑣7 𝑣3 𝑣11 𝑣10 𝑣17 𝑣18 − 
Layer 4 𝑣1 𝑣6 𝑣13 𝑣12 𝑣8 𝑣24 − 
Layer 5 𝑣4 𝑣3 𝑣11 𝑣10 𝑣19 𝑣20 − 
Layer 6 𝑣1 𝑣6 𝑣5 𝑣8 𝑣12 𝑣21 𝑣22 
Layer 7 𝑣4 𝑣3 𝑣9 𝑣10 𝑣23 𝑣24 − 
Layer 8 𝑣2 𝑣6 𝑣7 𝑣8 𝑣12 𝑣14 𝑣15 
Layer 9 𝑣1 𝑣3 𝑣9 𝑣19 𝑣16 𝑣17 − 
Layer 10 𝑣5 𝑣6 𝑣13 𝑣12 𝑣8 𝑣18 𝑣19 
Layer 11 𝑣2 𝑣3 𝑣7 𝑣10 𝑣20 𝑣21 − 
Layer 12 𝑣6 𝑣8 𝑣11 𝑣12 𝑣22 𝑣23 − 
No. FRAs 𝟐 𝟐 𝟏 𝟏 𝟐 𝟏 𝟏 
Figure 4. Mapping between VNs and VNUs. In black: VNs of degree 2, in
red: VNs of degree 3, in blue: VNs of degree 6.
hardware resources, as compared to MS decoding. This will
be discussed in more details in Section V.
Irregular NS-FAID decoding kernel: First, we note that
the pipeline architecture proposed in the section can be applied
to the WiMAX QC-LDPC code with rate 1/2, considered in
Section III-B, by assuming that each decoding layer consists
of one row of the base matrix. Indeed, it is known that for
this code, the rows of the base matrix can be reordered, such
that any two consecutive rows do not overlap [24].
Regarding the integration of an irregular NS-FAID decoding
kernel, the same framing (FRA) or de-framing (DE-FRA)
block is reused for several VNs, which may be of different
degrees. This may require several framing functions to be
implemented within the FRA/DE-FRA blocks, thus increasing
the hardware complexity. To overcome this problem, one
may change the way the VNs are mapped to the processing
units, by reordering the columns of the base matrix processed
within each decoding layer. We determine offline such a
reordering for each decoding layer, so as to minimize the
number of FRA/DE-FRA blocks implementing more than
one single framing function. Hence, the PER R and PER W
blocks, which ensure the proper alignment between data and
processing units, are redefined accordingly.
The optimal mapping between VNs and VNUs is shown in
Fig. 4, for the base matrix with reordered rows from [24]. For
each VNU, we indicate the index of the VN (or equivalently,
base matrix column) processed by the VNU, within each
decoding layer. The last row of the table indicates the number
of framing functions that have to be implemented within the
FRA/DE-FRA blocks corresponding to each VNU. One may
see that 3 of the FRA/DE-FRA blocks must implement two
framing functions, while the other 4 FRA/DE-FRA blocks
implement only one framing function.
B. Full layers architecture
A different possibility to increase throughput is to increase
the hardware parallelism, by including several non-overlapping
rows of the base matrix in one decoding layer. For instance,
for the base matrix in Fig. 3, we may consider RPL = 4
consecutive rows per decoding layer, thus the number of
decoding layers is L = 3. In this case, each column of the base
matrix has one (and only one) non-zero entry in each decoding
layer; such a decoding layer is referred to as being full.
Full layers correspond to the maximum hardware parallelism
that can be exploited by layered architectures, but they also
prevent the pipelining of the data path. One possibility to
implement a full-layer decoder is to use a similar architecture
to the pipelined one, by removing the registers inserted after
the VNU (since pipelining is incompatible with the use of
full-layers), and updating the control unit. However, in such
an architecture, read/write operations from/to the β memory
would occur at the same memory location, corresponding to
the current layer being processed `. This would require the use
of asynchronous dual-port RAM to implement the β memory,
which in general is known to be slower than synchronous dual-
port RAM. The architecture proposed in this section, shown
in Fig. 5(b), is aimed at avoiding the use of asynchronous
RAM, while providing an effective way to benefit from the
increased hardware parallelism enabled by the use of full
layers. We discuss below the main changes with respect to the
pipelined architecture from the previous section, consisting of
the α memory and the barrel shifters blocks (the other blocks
are the same as for the pipelined architecture), as well as a
complete reorganization of the data path. However, it can be
easily verified that both architectures are logically equivalent,
i.e., they both implement the same decoding algorithm.
α memory: This memory is used to store the VN-messages
for the current decoding layer (unlike the previous architecture,
the AP-LLR values are not stored in memory). Since only one
q˜-bit (unsaturated) VN-message is stored for each variable-
node, this memory has exactly the same size as the γ˜ memory
used within the previous pipelined architecture. VN-messages
for current layer ` are read from the α memory, then saturated
or framed depending on the decoding kernel, and supplied
to the corresponding CNUs. CN-messages computed by the
CNUs are stored in the β memory (location corresponding
to layer `), and also forwarded to the AP-LLR unit, through
the DCP (decompress) and DE-FRA (de-framing) blocks,
according to the CNU implementation (compressed or uncom-
pressed) and the decoding kernel (MS of NS-FAID). The AP-
LLR unit computes the sum of the incoming VN- and CN-
messages, which corresponds to the AP-LLR value to be used
at layer `+1 (since already updated by layer `). The AP-LLR
value is forwarded to the VNU, through corresponding BS
and PER blocks. Eventually, the VN-message for layer `+1 is
computed as the difference between the incoming AP-LLR and
the corresponding layer-(`+ 1) CN-message computed at the
previous iteration, the latter being read from the β memory.
PER / BS blocks: PER 1 / BS 1 blocks permute / shift the
data read from the input buffer, according to the posi-
tions / values of the non-negative entries in the first decoding
layer. Similarly to the BS R blocks in the pipelined archi-
tecture, the PER WR / BS WR blocks permute / shift the AP-
LLR values, according to the difference between the posi-
tions / values of the current layer’s (`) non-negative entries
and those of the next layer (` + 1). This way, VN-messages
stored in the α memory are already permuted and shifted for
the subsequent decoding layer. Finally, PER L / BS L blocks
permute / shift the hard decision bits (sign of AP-LLR values),
according to the positions / values of the non-negative entries
in the last decoding layer.
10
IN-buffer
DS
BS_INIT
 𝛾-memory
PER_R
BS_R
VNU
REG
SAT/FRA
CNU
DCP
D
C
P
BS_INITHard bits
AP-LLR
DE-FRA
D
E-FR
A
PER_W
DCP: only if compressed 𝛽-messages
DE-FRA: only if NS-FAID kernel 
SAT,  if MS-kernel
FRA, if NS-FAID kernel
layer ℓ-1
layer ℓ
𝛽-memory
1
 la
ye
r 
d
el
ay
 d
u
e 
to
 p
ip
el
in
e
clk
1
cc1 cc2
3 5
cc3 cc4 cc5 cc6
Layer 1
Layer 2
Layer 3
⋯
2 4 6
P1 P2
P1 P2
P1 P2
(a) Pipelined Architecture
(timing schedule) PER_1
BS_1
IN-buffer
DS
𝛼-memory
PER_WR
BS_WR
CNU
SAT/FRA
VNU
AP-LLR
DE-FRA
DCP
D
C
P
D
E-FR
A
layer ℓ
layer ℓ+1
𝛽-memory
PER_L
BS_L
Hard bits
sign bit only
Only 𝛼-messages for 
current layer are stored
SAT,  if MS-kernel
FRA, if NS-FAID kernel
DCP: only if compressed 𝛽-messages
DE-FRA: only of NS-FAID kernel 
(b) Full-Layer Architecture
Figure 5. High-Level Description of the Proposed HW Architectures, with both MS and NS-FAID kernels
V. IMPLEMENTATION RESULTS
This section reports implementation results, as well as the
error correction performance of the implemented codes, in
order to corroborate the analytic results obtained in Section III.
As mentioned in the previous section, both architectures are
logically equivalent, thus they both yield the same decoding
performance (assuming that they implement the same MS/NS-
FAID decoding kernel), but they may have different perfor-
mance in terms of area and throughput.
A. Regular LDPC Codes
We consider the (3, 6)-regular QC-LDPC code with base
matrix B of size R × C = 12 × 24, shown in Fig. 3. The
expansion factor z = 54, thus the codeword length is N =
zC = 1296 bits. The base matrix can be divided in either L =
12 decoding layers (RPL = 1), for the pipelined architecture,
or L = 3 horizontal decoding layers (RPL = 4), for the full-
layer architecture.
Fig. 6(a) shows the Bit-Error Rate (BER) performance of
the MS decoder with quantization parameters (q, q˜) = (4, 6),
as well as q = 4-bit NS-FAIDs with w = 2 and w = 3
(framing functions F corresponding to w and F (0) values
in the legend are those from Table II). Binary input AWGN
channel model is considered, with 20 decoding iterations. It
can be seen that the simulation results corroborate the analytic
results from Section III-A, in terms of SNR gain / loss provided
by NS-FAIDs, as compared to MS. For comparison purposes,
we have further included simulations results for the floating-
point Belief Propagation (BP) decoder [12], as well as the MS
decoder with (3, 5) and (2, 4)-quantization.
ASIC post-synthesis implementation results on 65nm-
CMOS technology are shown in Table V, for the MS(4, 6)
decoder and the NS-FAIDs with (w = 3, F (0) = 0) and
(w = 2, F (0) = ±1), indicated in the table as NS-FAID-3 and
NS-FAID-2, respectively. The first (Variant) row in Table V
indicates the architecture (pipelined or full layers) and the
CNU type (compressed or uncompressed). We also note that
for the NS-FAID-2, the assumption that 0 is mapped to either
−1 or +1, with equal probability, is only needed for theoretical
analysis (the symmetry of the decoder allows reducing the
analysis to the all-zero codeword). However, in practical situ-
ations one may always map 0 to +1, since random codewords
are transmitted (for instance, in telecommunications systems,
pseudo-randomness of the transmitted data is ensured by a
scrambling mechanism).
Throughput reported in Table V is given by the formula:
Throughput =
N × fmax
δ + L× niter , (14)
where fmax is the maximum operating frequency (post synthe-
sis), niter is the number of decoding iterations (set to 20), and
δ = 1 for the pipelined architecture or δ = 0 for the full layers
architecture. To keep the throughput comparison on an equal
basis, we further define the Throughput to Area Ratio metric
TAR = Throughput / Area (Mbps/mm2).
While the NS-FAID-3 decoder outperforms the baseline
MS(4, 6) decoder by 0.19 dB at BER = 10−5 (Fig. 6(a)), it can
be seen from Table V that it also exhibits a TAR improvement
between 18.88% and 31.61%, depending on the hardware
architecture and CNU type. As predicted, the NS-FAID-2
decoder exhibit a performance loss of 0.21 dB compared
to MS(4, 6), but yields a significant TAR improvement, by
34.37% to 58.75%.
11
1 1.5 2 2.5 3 3.5 4
SNR (dB)
10-6
10-5
10-4
10-3
10-2
10-1
Bi
t E
rro
r R
at
e 
(B
ER
)
BP, floating point
MS, (4,6)-quantization
NS-FAID (w=3, F(0)=0)
NS-FAID (w=2, F(0)=±1)
NS-FAID (w=2, F(0)=0)
MS, (3,5)-quantization
MS, (2,4)-quantization
(a) (3, 6)-regular LDPC code
1 1.5 2 2.5 3 3.5
SNR (dB)
10-6
10-5
10-4
10-3
10-2
10-1
Bi
t E
rro
r R
at
e 
(B
ER
)
BP, floating point
MS, (4,6)-quant.
NS-FAID-433
NS-FAID-333
NS-FAID-432
NS-FAID-332
NS-FAID-222
MS, (3,5)-quant.
(b) WiMAX irregular LDPC code
Figure 6. BER performance of optimized regular and irregular NS-FAIDs
Table V
ASIC POST-SYNTHESIS IMPLEMENTATION RESULTS ON 65NM-CMOS TECHNOLOGY FOR (3, 6) REGULAR LDPC
Variant pipelined.uncompressed pipelined.compressed full layers.uncompressed full layers.compressed
Decoder MS(4,6) NS-FAID-3 NS-FAID-2 MS(4,6) NS-FAID-3 NS-FAID-2 MS(4,6) NS-FAID-3 NS-FAID-2 MS(4,6) NS-FAID-3 NS-FAID-2
Max. Freq. (MHz) 200 222 227 175 200 208 151 172 192 125 147 172
Throughput (Mbps) 1075 1193 1220 941 1075 1118 3261 3715 4147 2700 3175 3715
Area (mm2) 0.45 0.42 0.38 0.41 0.38 0.36 0.80 0.72 0.68 0.75 0.67 0.65
TAR (Mbps/mm2) 2389 2840 3210 2295 2828 3105 4076 5159 6098 3600 4738 5715
±% w.r.t. MS(4,6) 0 +18.88 +34.37 0 +23.22 +35.29 0 +26.57 +49.61 0 +31.61 +58.75
B. Irregular LDPC Codes
We consider the irregular WiMAX QC-LDPC code of rate
1/2, with base matrix of size R × C = 12 × 24 [21].
The expansion factor z = 96, thus resulting in a codeword
length N = zC = 2304 bits. The pipelined architecture from
Section IV-A is implemented, with RPL = 1 row per decoding
layer, after reordering the rows of the base matrix, such that
any two consecutive rows do not overlap [24]. Note that the
full layers architecture does not apply to irregular WiMAX
LDPC codes, since it is not possible to group the rows of the
base matrix in full decoding layers.
BER results for the MS(4, 6) decoder and the NS-FAID-
w1w2w3 decoders from Table III are shown in Fig. 6(b), while
ASIC post-synthesis implementation results on 65nm-CMOS
technology are shown in Table VI. The throughput reported
is computed using Eq. (14). TAR results and corresponding
gain/loss (+/−) with respect to the MS(4, 6) decoder are
reported on the last row. The NS-FAID-433 and NS-FAID-432
decoders outperform the MS decoder by 0.3 dB and 0.15 dB (at
BER =10−5), respectively, at the price of a small degradation
of the TAR. NS-FAIDs-333 improves the BER performance
by 0.12 dB, with TAR improvement by 13.51% to 16.39%,
depending on the CNU type (compressed or uncompressed).
NS-FAIDs-332 exhibits similar BER performance, with TAR
improvement by 13.51% to 19.30%. The NS-FAID-222 de-
coder yields the most significant TAR improvement (up to
42.09%), but this comes at the price of a significant BER
degradation by ≈ 1 dB as estimated in Section III-B.
To further emphasize the high-throughput characteristic of
the proposed architecture, the irregular NS-FAID-332 decoder
is further compared with other state of the art implementations
of WiMAX decoders in Table VII. We also report the TAR
and Normalized TAR (NTAR) metrics, so as to keep the
throughput comparison on an equal basis with respect to
technology, area, and number of iterations. To scale throughput
and area to 65nm, we use scale factors (technology size/65)
and (65/technology size)2, respectively, as suggested in [28].
The computation of the TAR and NTAR metrics is detailed
in the footnote to Table VII. Note that for all the reported
implementations, the achieved throughput is inversely propor-
tional to the number of iterations, hence the NTAR metric
corresponds to the TAR value assuming that only one decoding
iteration is performed. We mention that the decoders proposed
in [24], [27] are reconfigurable decoders that support the IEEE
802.16e (WiMAX) and and the IEEE 802.11n (WiFi) wireless
standards. The reported throughput is the maximum achievable
coded throughput for the (1152, 2304) WiMAX code, with
either 10 or 5 decoding iterations. From Table VII it can be
seen that the proposed irregular NS-FAID compares favorably
with state of the art implementations, yielding a NTAR value
of 45.86 Gbps/mm2/iteration.
VI. CONCLUSION
In this paper, we first introduced the new framework of
Non-Surjective FAIDs, which allows trading off decoding
12
Table VI
ASIC POST-SYNTHESIS IMPLEMENTATION RESULTS ON 65NM-CMOS TECHNOLOGY FOR WIMAX LDPC
Variant pipelined.uncompressed pipelined.compressed
Decoder MS(4,6) NS-FAID NS-FAID NS-FAID NS-FAID NS-FAID MS(4,6) NS-FAID NS-FAID NS-FAID NS-FAID NS-FAID433 432 333 332 222 433 432 333 332 222
Max. Freq. (MHz) 175 172 178 192 192 200 161 156 161 178 178 200
Throughput (Mbps) 1673 1644 1701 1835 1835 1912 1539 1491 1539 1701 1701 1912
Area (mm2) 0.87 0.88 0.90 0.82 0.80 0.70 0.77 0.79 0.79 0.75 0.75 0.72
TAR (Mbps/mm2) 1922 1868 1890 2237 2293 2731 1998 1887 1948 2268 2268 2655
±% w.r.t. MS(4,6) 0 -2.81 -1.66 +16.39 +19.30 +42.09 0 -5.56 -2.50 +13.51 +13.51 +32.88
Table VII
COMPARISON BETWEEN THE PROPOSED NS-FAID AND STATE OF THE ART IMPLEMENTATIONS FOR THE WIMAX QC-LDPC CODE
Decoders K. Zhang’09[25]
B. Xiang’11
[6]
T. Heidari’13
[26]
W. Zhang’15
[24]
K. Kanchetla’16
[27]
This work
NS-FAID-332
Code length 2304 576-2304 2304 576-2304(†) 576-2304(†) 2304
Technology (nm) 90 130 130 40 90 65
Frequency (MHz) 950 214 100 290 149 192
Iterations 10 10 10 10 5 20
Throughput (Mbps) 2200 955 183 2227 955 1835
Tput scaled to 65nm (Mbps) 3036 1910 366 1370 1318 1835
Area (mm2) 2.90(∗) 3.03(∗) 6.90(∗∗) 2.26(∗) 11.42(∗) 0.80(∗)
Area scaled to 65nm (mm2) 1.51(∗) 0.76(∗) 1.73(∗∗) 5.97(∗) 5.94(∗) 0.80(∗)
TAR (Mbps/mm2) 2011 2513 212 229 222 2293
NTAR (Mbps/mm2/iter) 20110 25130 2120 2290 1110 45860
(†) support both WiMAX and Wi-Fi standards
(∗) only core area is reported
(∗∗) total chip area is reported
TAR = (Throughput scaled to 65nm) / (Area scaled to 65nm)
NTAR = TAR × Iterations
performance for hardware complexity reductions. NS-FAIDs
have been optimized by density evolution and shown to pro-
vide significant memory size reductions, with similar or event
better decoding performance, as compared to the MS decoder.
Then, two hardware architectures have been presented, making
use of either pipelining or increased hardware parallelism in
order to increase throughput. Both MS and NS-FAID decoding
kernels have been integrated into each of the two proposed
architectures, and compared in terms of area and throughput.
ASIC post synthesis implementation results demonstrated the
effectiveness of the NS-FAID approach in yielding significant
improvements in terms of area and throughput, as compared
to the MS decoder, with even better or only slightly degraded
decoding performance.
ACKNOWLEDGMENT
The authors acknowledge support from the European H2020
Work Programme, project Flex5Gware, and the Franco-
Romanian (ANR-UEFISCDI) Joint Research Programme
“Blanc-2013”, project DIAMOND.
REFERENCES
[1] M. Karkooti and J. R. Cavallaro, “Semi-parallel reconfigurable archi-
tectures for real-time LDPC decoding,” in Proc. of Int. Conf. on Inf.
Technology: Coding and Computing (ITCC), vol. 1, 2004, pp. 579–585.
[2] X. Chen, J. Kang, S. Lin, and V. Akella, “Memory system optimization
for FPGA-based implementation of quasi-cyclic LDPC codes decoders,”
IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 58,
no. 1, pp. 98–111, 2011.
[3] V. A. Chandrasetty and S. M. Aziz, “Resource efficient LDPC decoders
for multimedia communication,” INTEGRATION, the VLSI journal,
vol. 48, pp. 213–220, 2015.
[4] K. Zhang, X. Huang, and Z. Wang, “High-throughput layered decoder
implementation for quasi-cyclic LDPC codes,” IEEE Journal on Selected
Areas in Communications, vol. 27, no. 6, pp. 985–994, 2009.
[5] X. Peng, Z. Chen, X. Zhao, D. Zhou, and S. Goto, “A 115mW 1Gbps
QC-LDPC decoder ASIC for WiMAX in 65nm CMOS,” in IEEE Asian
Solid State Circuits Conference (A-SSCC), 2011, pp. 317–320.
[6] B. Xiang, D. Bao, S. Huang, and X. Zeng, “An 847–955 Mb/s 342–397
mW dual-path fully-overlapped QC-LDPC decoder for WiMAX system
in 0.13µm CMOS,” IEEE Journal of Solid-State Circuits, vol. 46, no. 6,
pp. 1416–1432, 2011.
[7] E. Boutillon and G. Masera, Channel coding: Theory, algorithms, and
applications. Elsevier, 2014, ch. Hardware Design and Realization for
Iteratively Decodable Codes, pp. 583–642.
[8] S. K. Planjery, S. K. Chilappagari, B. Vasic´, D. Declercq, and L. Dan-
jean, “Iterative decoding beyond belief propagation,” in IEEE Informa-
tion Theory and Applications Workshop (ITA), 2010, pp. 1–10.
[9] S. K. Planjery, D. Declercq, L. Danjean, and B. Vasic, “Finite alphabet
iterative decoders for LDPC codes surpassing floating-point iterative
decoders,” IET Electronics Letters, vol. 47, no. 16, pp. 919–921, 2011.
[10] ——, “Finite alphabet iterative decoders – part I: Decoding beyond
belief propagation on the binary symmetric channel,” IEEE Transactions
on Communications, vol. 61, no. 10, pp. 4033–4045, 2013.
[11] J. Chen, A. Dholakia, E. Eleftheriou, M. Fossorier, and X. Hu,
“Reduced-complexity decoding of LDPC codes,” IEEE Trans. on Com-
munications, vol. 53, no. 8, pp. 1288–1299, 2005.
[12] V. Savin, Channel coding: Theory, algorithms, and applications. El-
sevier, 2014, ch. LDPC Decoders, pp. 211–260.
[13] T. Nguyen-Ly, L. Khoa, F. Ghaffariy, A. Amaricai, O. Boncalo, V. Savin,
and D. Declercq, “FPGA design of high throughput LDPC decoder
based on imprecise offset min-sum decoding,” in IEEE International
New Circuits And Systems Conference (NEWCAS), June 2015.
[14] D. Oh and K. K. Parhi, “Min-sum decoder architectures with reduced
word length for LDPC codes,” IEEE Transactions on Circuits and
Systems I: Regular Papers, vol. 57, no. 1, pp. 105–115, 2010.
[15] V. A. Chandrasetty and S. M. Aziz, “An area efficient LDPC decoder
using a reduced complexity min-sum algorithm,” Integration, the VLSI
Journal, vol. 45, no. 2, pp. 141–148, 2012.
[16] S. Abu-Surra, E. Pisek, T. Henige, and S. Rajagopal, “Low-power
13
dual quantization-domain decoding for LDPC codes,” in IEEE Global
Communications Conference (GLOBECOM), 2014, pp. 3151–3156.
[17] T. T. Nguyen-Ly, K. Le, V. Savin, D. Declercq, F. Ghaffari, and
O. Boncalo, “Non-surjective finite alphabet iterative decoders,” in IEEE
Int. Conference on Communications (ICC), 2016, pp. 1–6.
[18] T. T. Nguyen-Ly, V. Savin, X. Popon, and D. Declercq, “High throughput
FPGA implementation for regular non-surjective finite alphabet iterative
decoders,” in IEEE Int. Conference on Communications (ICC), Work-
shop on Channel Coding for 5G and Future Networks, 2017, pp. 1–6.
[19] Z. Mheich, T. Nguyen-Ly, V. Savin, and D. Declercq, “Code-aware
quantizer design for finite-precision min-sum decoders,” in IEEE In-
ternational Black Sea Conference on Communications and Networking
(BlackSeaCom), Varna, Bulgaria, June 2016.
[20] T. Richardson and R. Urbanke, “The capacity of low-density parity-
check codes under message-passing decoding,” IEEE Trans. on Inf.
Theory, vol. 47, no. 2, pp. 599–618, 2001.
[21] IEEE-802.16e, “Physical and medium access control layers for com-
bined fixed and mobile operation in licensed bands,” 2005, amendment
to Air Interface for Fixed Broadband Wireless Access Systems.
[22] Z. Wang and Z. Cui, “A memory efficient partially parallel decoder
architecture for quasi-cyclic LDPC codes,” IEEE Trans. on Very Large
Scale Integration (VLSI) Systems, vol. 15, no. 4, pp. 483–488, 2007.
[23] C.-L. Wey, M.-D. Shieh, and S.-Y. Lin, “Algorithms of finding the
first two minimum values and their hardware implementation,” IEEE
Transactions on Circuits and Systems I: Regular Papers, vol. 55, no. 11,
pp. 3430–3437, 2008.
[24] W. Zhang, S. Chen, X. Bai, and D. Zhou, “A full layer parallel QC-
LDPC decoder for WiMAX and Wi-Fi,” in 2015 IEEE 11th International
Conference on ASIC (ASICON), 2015, pp. 1–4.
[25] K. Zhang, X. Huang, and Z. Wang, “High-throughput layered decoder
implementation for quasi-cyclic LDPC codes,” IEEE Journal on Selected
Areas in Communications, vol. 27, no. 6, pp. 985–994, 2009.
[26] T. Heidari and A. Jannesari, “Design of high-throughput QC-LDPC
decoder for WiMAX standard,” in 21st Iranian Conference on Electrical
Engineering (ICEE), 2013, pp. 1–4.
[27] V. K. Kanchetla, R. Shrestha, and R. Paily, “Multi-standard high-
throughput and low-power quasi-cyclic low density parity check decoder
for worldwide interoperability for microwave access and wireless fidelity
standards,” IET Circuits, Devices & Systems, vol. 10, no. 2, 2016.
[28] J. R. Hauser, “MOSFET device scaling,” in Handbook of Semiconductor
Manufacturing Technology. Boca Raton, FL: CRC Press, 2008.
Truong Nguyen-Ly received his B.S. and M.S.
degrees in Electronic and Telecommunication en-
gineering from Ho Chi Minh City University of
Technology (HCMUT), Vietnam, in 2010 and 2012,
respectively. From 2010 to 2014, he was a lecturer
at Faculty of Electrical and Electronic Engineering,
HCMUT, Vietnam. He is currently working toward
the Ph.D. degree in Telecommunications Engineer-
ing at the Broadband Wireless Systems Laboratory,
CEA-LETI, MINATEC Campus, and ETIS EN-
SEA/UCP/CNRS UMR-8051, France. His research
interests include error-correction coding, analysis and implementation of
LDPC decoder architectures on FPGA/ASIC platform, and speech processing.
Valentin Savin received his Master’s Degree in
Mathematics from the “Ecole Normale Suprieure”
of Lyon in 1997, and his PhD in Mathematics
from J. Fourier Institute, Grenoble, in October 2001.
He also holds a Master’s Degree in Cryptography,
Security and Coding Theory from the University of
Grenoble 1. Since 2005, he has been with the Digital
Communications Laboratory of CEA-LETI, first as a
two-year postdoctoral fellow, and then as a research
engineer. Since 2016, he has been appointed CEA
Senior Expert on information and coding theory.
Over the last years, he has been working on the design of low-complexity
decoding algorithms for LDPC and Polar codes, and on the analysis and the
optimization of LDPC codes for physical and upper-layers applications. He
has published more than 70 papers in international journals and conference
proceedings, holds 10 patents, and is currently participating in or coordinating
several French and European research projects in ICT.
Khoa Le received his bachelor and Master of sci-
ence degree in Electronics and Telecommunication
Engineering from Ho Chi Minh City University of
Technology (HCMUT), Vietnam in 2010 and 2012,
respectively. He is working toward the Ph.D degree
at ETIS Laboratory, ENSEA, University of Cergy-
Pontoise, CNRS UMR-8051, France. His research
interests are in error correcting code algorithms,
analysis and their implementations in FPGA/ASIC.
David Declercq was born in June 1971. He gradu-
ated his PhD in Statistical Signal Processing 1998,
from the University of Cergy-Pontoise, France. He
is currently full professor at the ENSEA in Cergy-
Pontoise. He is the general secretary of the National
GRETSI association, and Senior member of the
IEEE. He has held the junior position at the “Institut
Universitaire de France” from 2009 to 2014. His
research topics lie in digital communications and
error-correction coding theory. He worked several
years on the particular family of LDPC codes, both
from the code and decoder design aspects. Since 2003, he developed a strong
expertise on non-binary LDPC codes and decoders in high order Galois
fields GF(q). A large part of his research projects are related to non-binary
LDPC codes. He mainly investigated two aspects: (i) the design of GF(q)
LDPC codes for short and moderate lengths, and (ii) the simplification of
the iterative decoders for GF(q) LDPC codes with complexity/performance
tradeoff constraints. David Declercq published more than 40 papers in major
journals (IEEE-Trans. Commun., IEEE-Trans. Inf. Theo., Commun. Letters,
EURASIP JWCN), and more than 120 papers in major conferences in
Information Theory and Signal Processing.
Fakhreddine Ghaffari received the Electrical En-
gineering and Master degrees from the National
School of Electrical Engineering (ENIS, Tunisia),
in 2001 and 2002, respectively. He received the
Ph.D degree in electronics and electrical engineering
from the University of Sophia Antipolis, France in
2006. He is currently an Associate Professor at the
University of Cergy Pontoise, France. His research
interests include VLSI design and implementation
of reliable digital architectures for wireless commu-
nication applications in ASIC/FPGA platforms and
the study of mitigating transient faults from algorithmic and implementation
perspectives for high-throughput applications.
Oana Boncalo received her B.Sc.and Ph.D. degree
in Computer Engineering from the University Po-
litehnica Timisoara, Romania, in 2006 and 2009
respectively. She is currently an Associate Professor
at University Politehnica Timisoara. She has pub-
lished over 50 research papers in topics related to
digital design. Her research interests include com-
puter arithmetic, LDPC decoder architectures, digital
design and reliability estimation and evaluation.
Note concerning prior work: Preliminary version of part of this work has
been previously published in [17], [18]. In this paper, the previous definition
and density-evolution analysis of NS-FAIDs [17] is extended to framing
functions with F (0) = ±λ, such as to cover a larger class of decoders,
which is shown to significantly improve the decoding performance in case
that the exchanged messages are quantized on a small number of bits (e.g., 2
bits per exchanged message). Optimization results presented in Section III are
new, and they report on the optimization of regular and irregular NS-FAIDs,
by taking into account the proposed extension. The hardware architectures
proposed in [18] have been extended to cover the case of irregular NS-FAIDs.
In addition, implementation results reported in this paper target an ASIC
technology, which is more likely to reflect the benefits of the proposed NS-
FAID approach in terms of throughput/area trade-off. All the implementation
results reported in Section V (for both regular and irregular codes) are new.
