A Systolic LLR Generation Architecture For Non-Binary LDPC Decoders by Al Ghouwayel, Ali & Boutillon, Emmanuel
A Systolic LLR Generation Architecture For
Non-Binary LDPC Decoders
Ali Al Ghouwayel, Emmanuel Boutillon
To cite this version:
Ali Al Ghouwayel, Emmanuel Boutillon. A Systolic LLR Generation Architecture For Non-
Binary LDPC Decoders. IEEE Communications Letters, Institute of Electrical and Electronics
Engineers, 2011, 15 (8), pp 851-853. <hal-00608287>
HAL Id: hal-00608287
https://hal.archives-ouvertes.fr/hal-00608287
Submitted on 12 Jul 2011
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destine´e au de´poˆt et a` la diffusion de documents
scientifiques de niveau recherche, publie´s ou non,
e´manant des e´tablissements d’enseignement et de
recherche franc¸ais ou e´trangers, des laboratoires
publics ou prive´s.
A Systolic LLR Generation Architecture For
Non-Binary LDPC Decoders
Ali Al Ghouwayel, Emmanuel Boutillon
Universite´ Europe´enne de Bretagne, UBS, Lab-STICC CNRS
56100 Lorient, FRANCE
Tel: +33-2-9787-4566, Fax: +33-2-9787-4500
E-mail: emmanuel.boutillon@univ-ubs.fr
Abstract—Non-Binary LDPC codes offer higher performances
than their binary counterpart but suffer from higher decoding
complexity. A solution to reduce the decoding complexity is
the use of the Extended Min-Sum algorithm. The first step of
this algorithm requires the generation of the first nm largest
Log-Likelihood Ratio (LLR), sorted in increasing order, of
each received symbol. In the case where GF(q) symbols are
transmitted using a BPSK modulation, we propose a simple
systolic architecture that generates the sorted list of symbols.
I. INTRODUCTION
Non-binary Low Density Parity Check (NB-LDPC) codes
over GF(q), with q > 2, are known to outperform binary
LDPC codes for short and medium length [1]. Moreover, for
high spectral efficiency modulation, channel symbol can be
mapped directly into code symbol, avoiding the information
loss generated by the marginalisation process used in the
binary LDPC case (gain of 1.6 dB, in the MIMO case, are
reported in [2]).
These advantages of such a Non-Binary coding scheme
come at the expense of increased hardware complexity. How-
ever, a recent decoding algorithm called Extended Min-
Sum (EMS) Algorithm was proposed by Declercq [3] which
presents a significant reduction of the computation complexity
of the decoding process while keeping a good performance.
This algorithm, instead of treating the q LLR values associated
to the complete list of GF(q) symbols, considers the sorting
of the LLR messages and only treats the nm greatest values
and proposes an offset compensating the truncated messages.
One of the key components of the NB-LDPC decoder imple-
menting the EMS algorithm is the LLR generation circuit and
its accompanying sorter. This paper considers the design and
the hardware implementation of an efficient circuit dedicated
to generate the LLR values sorted in increasing/decreasing
order.
The rest of this paper proceeds as follows. In section II, we
review some basic notions and definitions related to the LLR
computation and we define a new simplified formula allowing
the generation of the LLR values in an ordered list. In Section
III, we discuss the proposed sorting algorithm. Section IV de-
scribes the hardware design and gives a complexity evaluation
This work is supported by INFSCO-ICT-216203 DAVINCI ”Design And
Versatile Implementation of Non-binary wireless Communications based on
Innovative LDPC Codes [www.ict-davinci-code.eu] funded by the European
Commission under the Seventh Framework Programme (FP7).
of the proposed LLR generation circuit implemented on Virtex
4 FPGA devices. Finally, a conclusion ends the paper.
II. DEFINITION OF THE LOG-LIKELIHOOD RATIO
This section describes the algorithm implemented in the
LLR circuit which provides an ordered list of the smallest
nm LLR values and their associated GF(q) symbols from the
q binary LLR values received from the channel.
Let X = (x0 x1 ... xm−1) be a GF(q = 2m) symbol
composed by m = log2(q) binary symbols and let Y =
(y0 y1 ... ym−1) be the symbol received from the channel
so that Y = (yi = B(xi) + wi)i=0,..,m−1, where wi is
a realization of a white Gaussian noise of variance σ2 and
B(x) = 2x−1 is the BPSK modulation that associates symbol
-1 to bit x = 0 and +1 to bit x = 1. For an Additive
White Gaussian Noise channel with a noise of variance σ2,
ln(P (Y |X)) is given by:














Let X˜ be the symbol of GF(q) that maximizes ln(P (Y |X)),
i.e. X˜ = {argmaxX∈GF (q), P (Y |X)}. Using equation (1),
X˜ is given by X˜ = (x˜i = D(yi))i=0..m−1, where D denotes
the Hard Decision on y, i.e. D(y) = 0 if y < 0, D(y) = 1
otherwise.
With the hypothesis that the GF(q) symbols are equiprob-
able, the reliability L(X) of a symbol X may be defined as



























By definition of X˜ , L(X) is a negative number. In order to
deal with positive numbers, the quantity L(X) = −L(X) will








where ∆i = xi XOR x˜i, i.e. ∆i = 0 if xi and x˜i have the
same sign, 1 otherwise.
This approach which represents the inverse logic of the
conventional LLR computation, associates the lower LLR
value to the most reliable GF(q) symbol. Its main advantage
is that it avoids the normalization step needed to keep the
numerical stability of the extrinsic LLR during the decoding
process.
The first step of the EMS algorithms [3] requires the
generation of the first nm minimum values of L(X) which is
not a trivial problem. An elegant algorithm has been proposed
by Fossorier et al. in [5]. Unfortunately, this algorithm is more
software oriented, since it builds the LLR list dynamically,
and is not adapted to hardware implementation mainly when
a parallel approach is considered. Thus, a hardware oriented
algorithm has become a necessity when the hardware imple-
mentation of the NB-LDPC decoder has become possible.
III. PROPOSED ALGORITHM
Let us tackle the problem of generation of the nm lowest
L(X) values and their associated GF(q) symbols sorted in
increasing order. In this paper, we propose an iterative con-
struction of the list by considering, at stage c, only the first
c coordinates of X (c varying from 1 to m). Let us define
L(Xc) = 2σ2
∑c−1
i=0 |yi|∆i the partial LLR of X using only its
first c coordinates. Let λc be the set of couples (L(Xc), Xc)
sorted in increasing order according to the L(Xc) values. For
the sake of simplicity, the multiplicative constant factor 2σ2
will be omitted in the following. For c = 1, λ1 is then equal
to: λ1 = {(0, d0), (|y0|, d¯0)} where d¯ means not(d). At stage
c = 2, the list λ2 can be constructed from λ1 in two steps.
The first step is the expansion of λ1 into two lists λ02 and
λ12 by appending the second coordinate with value (0,d1) and
(|y1|, d¯1). The obtained lists are respectively
λ02 = {(0, d0d1), (|y0|, d¯0d1)} (5)
λ12 = {(|y1|, d0d¯1), (|y0|+ |y1|, d¯0d¯1)}. (6)
The second step is the merging of λ02 and λ
1
2 to create λ2.
For example, if |y1| > |y0|, then λ2 will be equal to λ2 =
{(0, d0d1), (|y0|, d¯0d1), (|y1|, d0d¯1), (|y0|+ |y1|, d¯0d¯1)}.
In the general case, during the expansion process, the kth
couple (L(Xc−1)(k), Xc−1(k)) of λc−1 is expanded as:
λ0c(k) = (L(X
c−1)(k) + 0, Xc−1(k)&dc−1) (7)
λ1c(k) = (L(X
c−1)(k) + |yc−1|, Xc−1(k)&d¯c−1), (8)
where & stands for the binary append operator.
Since both λ0c and λ
1
c are sorted, the creation of λc is simple.
These two ordered lists are iteratively compared. At each step,
each list sends its smallest remaining LLR value. Those two
LLR values are compared and the couple with the smallest
LLR is retrieved from the corresponding list and added to the
output list λc. This algorithm can be performed efficiently in
a serial pipe-line architecture. One can note that, since only
nm lowest values are required by the EMS algorithm, the size
s(c) of list λc is equal to min(2c, nm).
Let us consider the following example of LLR generation
for m = 4, nm = 10 and Y = (−7, 8, 12,−3). The hard
decoding of Y leads to D = (0, 1, 1, 0). Since |y0| < |y1|, the
list λ2 is equal to λ2={(0, 01), (7, 11), (8, 00), (15, 10)}.
The expansion of λ2 gives λ03={(0, 011), (7, 111), (8, 001),
(15, 101)} and λ13={(12, 010), (19, 110), (20, 000), (27, 100)}.
Merging those two lists gives: λ3={(0, 011), (7, 111), (8, 001),
(12,010), (15, 101), (19, 110), (20, 000), (27, 100)}
The extension of λ3 gives: λ04={(0, 0110), (7, 1110), (8,
0010), (12, 0100), (15, 1010), (19, 1100), (20, 0000), (27,
1000)} and λ14={(3, 0111), (10, 1111), (11, 0011), (15, 0101),
(18, 1011), (22, 1101), (23, 0001), (30, 1001)}. Taking the
nm = 10 highest value of λ04 and λ
1
4 leads to the final list:
λ4={(0, 0110), (3, 0111), (7, 1110), (8, 0010), (10, 1111), (11,
0011), (12, 0100), (15, 1010), (15, 0101), (18, 1011)}.
The direct approach (computation of all q LLRs and sorting
process to extract the first nm values) leads to a complexity in
q log2(q). With the proposed method, the overall complexity is
reduced to nm log2(q). Moreover, the proposed algorithm can
be implemented in a simple systolic hardware architecture.
IV. HARDWARE ARCHITECTURE OF THE LLR CIRCUIT
In this section we describe the systolic hardware architecture
serially generating the sorted LLR list. This architecture is
composed of m stages working in pipeline mode and each
stage consists of one Processing Element (PE). Figure 1-a
illustrates the architecture for m = 4. As shown in this Figure,
the cth PE receives two inputs: the channel binary observation
yc−1 with its corresponding sign and the list λc−1 from the
(c − 1)th PE. As an output, the cth PE generates the list λc
to be fed to the (c+ 1)th PE of the next stage.
A. Structure of the PE
The PE constitutes the core of the proposed LLR circuit.
It consists of two expansion modules, two First-In-First-Out
(FIFO) memories and one comparator selecting the minimum
of the two FIFO’s outputs as shown in Figure 1-b where the
third stage PE is illustrated. The input/ouput of this stage
represent the intermediate LLR computation of the example
described in previous section. The 3rd PE serially receives
(one new couple every clock cycle) the ordered list λ2 from
stage 2. The first step is the expansion of λ2 into lists λ03
and λ13 as described in the previous section. Once the first
elements of λ03 and λ
1
3 are stored in the FIFOs, the merging
process can start. At each clock cycle, the two outputs of the
FIFOs are compared and the couple having the smaller LLR
value is selected to be fed in the list λ3 and a pull signal is




















={(0,011),     (7,111),      (8,001),    (12,010),  (15,101),                 (19,110)}
b) Internal view of PE 3 with example of section III







1O 2O 3O 4O
3O
3O
















Fig. 1. LLR circuit block diagram designed over GF(16).
Figure 1-c illustrates the temporary contents of the FIFOs
during the first six clock cycles of the 3rd stage processing.
The highlighted memory case of the FIFOs represents the
content to be selected at each clock cycle. As previously
discussed, the FIFO 0 contains the first elements of the LLR
list. Thus, it is possible to empty the FIFO 0 and in this
case the remaining contents of the FIFO 1 are systematically
selected to complete the list λ3.
One can note, that, during the first s(c − 1) clock cycles,
FIFO 0 and FIFO 1 receive one new couple every clock cycle
and one couple is output, either from FIFO 0 or FIFO 1.
Let f0(c) and f1(c) be the minimum sizes required by the
FIFO 0 and the FIFO 1 of the cth PE respectively. Let us first
consider the size f1(c) when s(c) is even. In the worst case,
the first s(c)/2 couples of λc are retrieved from FIFO 0. At
that point, FIFO 1 contains the s(c)/2 couples sufficient to
complete the s(c)/2 remaining elements of λc. If s(c) is odd,
(s(c) − 1)/2 elements of λc at most can come from FIFO
1. Thus, in the general case, f1(c) ≤ bs(c)/2c, where bxc
represents the greatest integer smaller or equal to x.
The determination of f0(c) needs a more careful exam-
ination. Let us consider the case where s(c − 1) = s(c).
In the worst case, the first bs(c)/3c couples are retrieved
from FIFO 0, then the next bs(c)/3c couples are retrieved
from FIFO 1. At that point, both FIFOs have already received
2bs(c)/3c couples. The output of FIFO 0 and FIFO 1 are
respectively λ0c(bs(c)/3c + 1) and λ1c(bs(c)/3c + 1). In all
cases, the content of FIFO 0 is enough to retrieve the next
bs(c)/3c values needed to complete the list λc. If s(c) is not
a multiple of 3, s(c) is still equal to bs(c)/3c. In this case the
(2bs(c)/3c+ 1)th element of λc will be retrieved from FIFO
0, which freed room for the (3bs(c)/3c+1)th element coming
from λc−1 and so on. If s(c − 1) = s(c)/2 it can be shown
that, in that case, f0(c) = s(c)/4. In summary, f0(c) varies
between f0(c) = s(c)/4 and f0(c) = s(c)/3.
Figure 2 shows the timing diagram of the LLR computation
of the example described in section III. As shown in this
Figure, the LLR circuit operates in a synchronous way with a
clock signal. The signal startin indicates the start of the LLR
computation and the signal load yi indicates the availability
of the yi data at the input. The circuit has a Latency of
m = 4 cycles, the start of the generation of the output is
indicated by signal startout. After nm+4−1 = 13 cycles the
(LLR,GF) couples are generated. As previously mentioned, the
LLR circuit operates in a pipeline manner where it can starts
the processing of the second symbol while the last stage of the
circuit is producing the (LLR,GF) couples of the first symbol.
Thus as shown in Figure 2, after a delay of one cycle the first
(LLR,GF) couple associated to the second symbol is produced.
This delay cycle is required to reinitialize the FIFOs.
0,0














0,01 7,11 8,00 15,10
0,011 7,111 8,001


















Output of the nm=10 (LLR,GF) couples of the first symbol
Start of the output of the 










Fig. 2. Timing diagram of the LLR computation over GF(16), nm = 10.
B. Implementation results
A generic architecture of the proposed systolic architecture
has been developed and successfully validated. The architec-
ture accepts any Galois Field size q and any value nm ≤ q. The
internal binary word size (nb, ns) of the (LLR,GF) couple is
also parameterized. This architecture has been used in a FPGA
based GF(64) NB-LDPC code implementation, with q = 6,
nm = 12, nb = 6 and ns = 6. On a Virtex 4 (XC4VLX15), it
requires 275 slices and operates at a frequency of 149 MHz.
V. CONCLUSIONS
We have presented a novel and efficient hardware design of
the LLR computation circuit. The proposed circuit is the first
of its kind to be designed. It performs the LLR computation
in a systolic way and produces an ordered list of (LLR, GF)
couples. This architecture has been implemented in a design
of a GF(64) NB-LDPC code. It can be also used for other
codes, like the soft Reed Solomon decoder described in [6].
REFERENCES
[1] M. Davey and D. J. C. MacKay “Low density parity check codes over
GF(q)”, IEEE Commun. Lett., vol. 2, pp. 165, 1998.
[2] S. Pfletschinger, D. Declercq, “Getting Closer to MIMO Capacity with
Non-Binary Codes and Spatial Multiplexing”, IEEE Globecom, Miami,
Florida, USA, Dec. 2010.
[3] D. Declercq and M. Fossorier, “Decoding algorithms for Nonbinary
LDPC codes over GF(q)”, IEEE Transactions on Communications, Vol.
55, No. 4, pp. 633-643, Apr. 2007.
[4] V. Sevin, “Min-max decoding for non binary LDPC codes”. Proc. Int.
Symp. on Information Theory, ISIT2008, Toronto, Canada, pp. 960-964,
Jul. 2008.
[5] A. Valembois and M. Fossorier, “An improved method to compute lists
of binary vectors that optimize a given weight function with application
to soft-decision decoding”, IEEE communications Letters, Vol. 5, No. 11,
pp. 456-458, Nov. 2001.
[6] A Kabat, F Guilloud and R Pyndiah, “On the Sensibility of the ”Arranged
List of the Most a Priori Likely Tests” Algorithm”, IEEE Military
Communications Conference MILCOM 2007, Orlando, FL, USA, pp. 1-7,
Oct. 2007.
