A 588 Gbps LDPC Decoder Based on Finite-Alphabet Message Passing by Ghanaatian, Reza et al.
1A 588-Gbps LDPC Decoder Based on
Finite-Alphabet Message Passing
Reza Ghanaatian, Alexios Balatsoukas-Stimming, Christoph Mu¨ller, Michael Meidlinger, Gerald Matz,
Adam Teman, and Andreas Burg
Abstract—An ultra-high throughput low-density parity check
(LDPC) decoder with an unrolled full-parallel architecture is
proposed, which achieves the highest decoding throughput com-
pared to previously reported LDPC decoders in the literature.
The decoder benefits from a serial message-transfer approach
between the decoding stages to alleviate the well-known routing
congestion problem in parallel LDPC decoders. Furthermore,
a finite-alphabet message passing algorithm is employed to
replace the variable node update rule of the standard min-sum
decoder with look-up tables, which are designed in a way that
maximizes the mutual information between decoding messages.
The proposed algorithm results in an architecture with reduced
bit-width messages, leading to a significantly higher decoding
throughput and to a lower area as compared to a min-sum
decoder when serial message-transfer is used. The architecture is
placed and routed for the standard min-sum reference decoder
and for the proposed finite-alphabet decoder using a custom
pseudo-hierarchical backend design strategy to further alleviate
routing congestions and to handle the large design. Post-layout
results show that the finite-alphabet decoder with the serial
message-transfer architecture achieves a throughput as large as
588 Gbps with an area of 16.2 mm2 and dissipates an average
power of 22.7 pJ per decoded bit in a 28 nm FD-SOI library.
Compared to the reference min-sum decoder, this corresponds to
3.1 times smaller area and 2 times better energy efficiency.
Index Terms—Low-density parity-check code, min-sum decod-
ing, unrolled architecture, finite-alphabet decoder, 28 nm FD-SOI.
I. INTRODUCTION
LOW-DENSITY PARITY-CHECK (LDPC) codes havebecome the coding scheme of choice in high data-
rate communication systems after their re-discovery in the
1990s [1], due to their excellent error correcting performance
along with the availability of efficient high-throughput hard-
ware implementations in modern CMOS technologies. LDPC
codes are commonly decoded using iterative message passing
(MP) algorithms in which the initial estimations of the bits
are improved by a continuous exchange of messages between
decoder computation nodes. Among the various MP decod-
ing algorithms, the min-sum (MS) decoding algorithm [2]
and its variants (e.g., offset MS, scaled MS) are the most
common choices for hardware implementation. LDPC de-
coder hardware implementations traditionally start from one
of these established algorithms (e.g., MS decoding), where the
R. Ghanaatian, A. Balatsoukas-Stimming, C. Mu¨ller, A. Teman, and A.
Burg are with the Telecommunications Circuits Laboratory, EPFL, Lausanne,
Switzerland (email: {reza.ghanaatian, alexios.balatsoukas, christoph.mueller,
adam.teman, andreas.burg}@epfl.ch).
M. Meidlinger and G. Matz are with the Vienna University of Technology,
Vienna, Austria (email: {mmeidlin, gmatz}@nt.tuwien.ac.at).
exchanged messages represent log likelihood ratios (LLRs).
These LLRs are encoded as fixed point numbers in two’s-
complement or sign-magnitude representation, using a small
number of uniform quantization levels, in order to realize
the message update rules with low-complexity conventional
arithmetic operations.
Recently, there has been significant interest in the design
of finite-alphabet decoders for LDPC codes [3]–[7]. The main
idea behind finite-alphabet LDPC decoders is to start from
one or multiple arbitrary message alphabets, which can be
encoded with a bit-width that is acceptable from an implemen-
tation complexity perspective. The message update rules are
then crafted as generic mapping functions to operate on this
alphabet. The main advantage of such finite-alphabet decoders
is that the message bit-width can be reduced significantly with
respect to a conventional decoder, while maintaining the same
error-correcting performance [5], [6]. The downside of this
approach is that the message update rules of finite-alphabet
decoders usually cannot be described using fast and area-
efficient standard arithmetic circuits.
Different hardware architectures for LDPC decoders have
been proposed in the literature in order to fulfill the power and
throughput requirements of various standards. More specif-
ically, various degrees of resource sharing result in flexi-
ble decoders with different area requirements. On the one
hand, partial-parallel LDPC decoders [8], [9] and block-
parallel LDPC decoders [10], [11] are designed for medium
throughput, with modest silicon area. Full-parallel [12], [13]
and unrolled LDPC decoders [5], [14], on the other hand,
achieve very high throughput (in the order of several tens or
hundreds of Gbps) at the expense of large area requirements.
Even though, in principle, LDPC decoders are massively
parallelizable, the implementation of ultra-high speed LDPC
decoders still remains a challenge, especially for long LDPC
codes with large node degrees [15]. While synthesis results
for such long-length codes, as for example reported in [16],
show the potential for a very high throughput, the actual
implementation requires several further considerations mainly
due to severe routing problems and the impact of parasitic
effects.
Contributions: In this paper, we propose an unrolled full-
parallel architecture based on serial transfer of the decoding
messages, which enables an ultra-high throughput implemen-
tation of LDPC decoders for codes with large node degrees
by reducing the required interconnect wires for such decoders.
Moreover, we employ a finite-alphabet LDPC decoding al-
gorithm in order to decrease the required quantization bit-
ar
X
iv
:1
70
3.
05
76
9v
2 
 [c
s.A
R]
  3
0 D
ec
 20
17
2width, and thus, to increase the throughput, which is limited
by the serial message-transfer in the proposed architecture.
We also adopt a linear floorplan for the unrolled full-parallel
architecture as well as an efficient pseudo-hierarchical flow
that allow the high-speed physical implementation of the
proposed decoder. To the best of our knowledge, by combining
the aforementioned techniques, we present the fastest fully
placed and routed LDPC decoder in the literature.
Outline: The remainder of this paper is organized as fol-
lows: Section II gives an introduction to decoding of LDPC
codes, as well as more details on existing high-throughput
implementations of LDPC decoders. Section III describes
our proposed ultra-high throughout decoder architecture that
employs a serial message-transfer technique. In Section IV,
our algorithm to design a finite-alphabet decoder with non-
uniform quantization is explained and applied to the serial
message-transfer decoder of Section III. Section V describes
our proposed approach for the physical implementation and
the timing and area optimization of our serial message-transfer
decoders. Finally, Section VI analyzes the implementation
results, and Section VII concludes the paper.
II. BACKGROUND
In this section, we first briefly summarize the fundamentals
of LDPC codes and the iterative MP algorithm for the decod-
ing. We then review the state-of-the-art in high speed LDPC
decoder architectures to set the stage for the description of our
implementation.
A. LDPC Codes and Decoding Algorithms
A binary LDPC code is a set of codewords which are
defined through an M ×N binary-valued sparse parity check
matrix as: {
c ∈ {0, 1}N ∣∣Hc = 0}, (1)
where all operations are performed modulo 2. If the parity
check matrix contains exactly dv ones per column and exactly
dc ones per row, the code is called a (dv, dc)-regular LDPC
code. Such codes are usually represented with a Tanner graph,
which contains N variable nodes (VNs) and M check nodes
(CNs) and VN n is connected to CN m if and only if
Hmn = 1.
LDPC codes are traditionally decoded using MP algorithms,
where information is exchanged as messages between the VNs
and the CNs over the course of several decoding iterations. At
each iteration the message from VN n to CN m is computed
using a mapping Φv : Rdv → R, which is defined as:
µn→m = Φv
(
Ln, µ¯N (n)\m→n
)
, (2)
where N (n) denotes the neighbors of node n in the Tanner
graph, µ¯N (n)\m→n ∈ Rdv−1 is a vector that contains the
incoming messages from all neighboring CNs except m, and
Ln ∈ R denotes the channel LLR corresponding to VN n.
Similarly, the CN-to-VN messages are computed using a
mapping Φc : Rdc−1 → R, which is defined as:
µ¯m→n = Φc
(
µN (m)\n→m
)
. (3)
m1 . . . mdv−1 m
n
Φv
µ¯m1→n µ¯mdv−1→n
µn→m
Ln
(a)
n1 . . . ndc−1 n
m
Φc
µn1→m
µndc−1→m
µ¯m→n
(b)
Fig. 1: (a) VN update and (b) CN update for N (n) =
{m,m1, . . . ,mdv−1} and N (m) = {n, n1, . . . , ndc−1}.
Fig. 1 illustrates the message updates in the Tanner graph. In
addition to Φv and Φc, a third mapping Φd : Rdv+1 → {0, 1}
is needed to provide an estimate of the transmitted codeword
bits in the last VN iteration based on the incoming CN
messages and the channel LLR Ln according to:
cˆn = Φd(Ln, µ¯N (n)→n). (4)
Messages are exchanged until a valid codeword has been
decoded or until the maximum number of iterations I has
been reached.
For the widely used MS algorithm, the mappings (2) and
(3) are:
ΦMSv (L, µ¯
)
= L+
∑
i
µ¯i, (5)
and,
ΦMSc (µ
)
= signµ min |µ|, (6)
where min |µ| denotes the minimum of the absolute values of
the vector components and signµ =
∏
j signµj . The decision
mapping Φd is defined as:
ΦMSd (L, µ¯) =
1
2
(
1− sign
(
L+
∑
i
µ¯i
))
. (7)
B. High Throughput LDPC Decoders
Several high throughput LDPC decoders have been devel-
oped during the past decade in order to satisfy the high data-
rate requirements of optical and high-speed Ethernet networks.
These decoders usually rely on a full-parallel isomorphic [17]
architecture and a flooding schedule, which directly maps the
algorithm for one iteration to hardware. More specifically,
the CN and VN update equations are directly mapped to
M CN and N VN processing units and a hard-wired routing
network is responsible for passing the messages between
them. From an implementation perspective, while such an
architecture enables a very high throughput by fully exploiting
the inherent parallelism of each iteration, the complexity of
3the highly unstructured routing network turns out to be a
severe bottleneck. In addition to this routing problem, such
full-parallel decoders usually require one or two clock cycles
for each iteration and in the worst case as many cycles as the
maximum number of iterations for each codeword, which is
another throughput limitation factor.
Several solutions have been proposed to alleviate the routing
problem in full-parallel decoders, on both architectural and
algorithmic levels. The authors of [18], [19] suggest using
a bit-serial architecture, which only requires a single wire for
each variable-to-check and check-to-variable node connection.
While this approach can reduce the routing congestion, it also
leads to a significant reduction in the decoding throughput. The
decoder in [19], for example, only achieves a throughput of
3.3 Gbps when implemented using a 130 nm CMOS technol-
ogy. Another architectural technique is reported in [13], where
the long wires of the decoder are partitioned into several short
wires with pipeline registers. As a result, the critical path is
broken down into shorter paths, but the decoder throughput
is also affected since more cycles are required to accomplish
each iteration. Nevertheless, the decoder of [13] is still able
to achieve 13.2 Gbps in 90 nm CMOS with 16 iterations.
On an algorithmic level, the authors of [20] propose a
MP algorithm, called MS split-row threshold, which uses a
column-wise division of the H matrix into Spn partitions.
Each partition contains N/Spn VNs and M CNs, and global
interconnects are minimized by only sharing the minimum
signs between the CNs of each partition. This algorithmic
modification was used to implement a full-parallel decoder
for the challenging (2048, 1723) LDPC code in 65 nm CMOS,
which achieves a throughput of 36.2 Gbps with 11 decoding
iterations. Another decoder, reported in [21], uses a hybrid
hard/soft decoding algorithm, called differential binary MP
algorithm, which reduces the interconnect complexity at the
cost of some error-correcting performance degradation. A
full-parallel (2048, 1723) LDPC decoder using this algorithm
was implemented in 65 nm CMOS technology, achieving a
throughput of up to 126 Gbps [22]. The work of [23] also
proposes another algorithmic level modification, called the
probabilistic MS algorithm, where a probabilistic second min-
imum value is used instead of the true second minimum value
to simplify the CN operation and to facilitate high-throughput
implementation of full-parallel LDPC decoders. Further, a
mix of tree and butterfly interconnect network is proposed
in the CN unit to balance the interconnect complexity and
the logic overhead and to reduce the routing complexity. The
implementation of a decoder for the (2048, 1723) LDPC code
with the proposed techniques in 90 nm CMOS technology
achieves a throughput of 45.42 Gbps.
Stochastic decoding of LDPC codes [24] was another im-
portant improvement based on both algorithm and gate-level
implementation considerations to solve the routing problem of
LDPC decoders, where probabilities are interpreted as binary
Bernoulli sequences. This approach, on the one hand, reduces
the complexity of CNs and the routing overhead, but, on the
other hand, introduces difficulties in VN update rules due
to correlated stochastic streams, which may deteriorate the
error-correcting performance especially in longer-length codes.
To solve this correlation problem by re-randomizing the VN
output streams, the work of [25] proposes to use majority-
based tracking forecast memories in each VN, which results
in a decoder with full-parallel architecture for a (2048, 1723)
LDPC code that achieves a throughput of 61.3 Gbps in 90 nm
CMOS. An alternative method to track the probability values,
called delayed stochastic decoding, is reported in [26] and
the full-parallel decoder for the same code can deliver a
throughput as large as 172.4 Gbps in 90 nm CMOS.
To solve the problem of throughput limitations in full-
parallel decoders from potentially using multiple iterations
for decoding, the work of [14] presents an unrolled full-
parallel LDPC decoder. In the proposed architecture, each
decoding iteration is mapped to distinct hardware resources,
leading to a decoder with I iterations that can decode one
codeword per clock cycle, at the cost of significantly increased
area requirements with respect to non-unrolled full-parallel
decoders. This unrolled architecture achieves a throughput
of 161 Gbps for a (672, 546) LDPC code with dv = 3 and
dc = 6, when implemented in a 65 nm CMOS technology. It
is noteworthy that an unrolled decoder has 50% reduced wires
between adjacent stages compared to a non-unrolled decoder
since one stage of variable nodes is connected to one stage
of check nodes with uni-directional data flow per decoding
iteration. Even though this measure leads to a lower routing
congestion, it is still challenging to fully place and route
such a decoder. This routing issue becomes more and more
severe when considering longer LDPC codes and especially
with increasing CN and VN degrees to achieve better error-
correcting performance, as required in wireline applications
such as for the (2048, 1723) code with dv = 6 and dc = 32
used in the IEEE 802.3an standard [15].
III. SERIAL MESSAGE-TRANSFER LDPC DECODER
Unrolled full-parallel LDPC decoders provide the ultimate
throughput with smaller routing congestion than conventional
full-parallel decoders. However, they are still not trivial to
implement for long LDPC codes with high CN and VN
degrees, which suffer from severe routing congestion. Hence,
in this section, we propose an unrolled full-parallel LDPC
decoder architecture that employs a serial message-transfer
technique between CNs and VNs.1 This architecture is similar
to the bit-serial implementations of [18], [19] in the way
the messages are transferred; however, as we shall see later,
it differs in the fact that it is unrolled and in the way the
messages are processed in the CNs and VNs.
A. Decoder Architecture Overview
An overview of the proposed unrolled serial message-
transfer LDPC decoder architecture is shown in Fig. 2. As
with all unrolled LDPC decoders, each decoding iteration is
1We note that a first implementation of our decoder was based on
parallel (word-level) message-transfer. The place and route tool for this
implementation was hardly able to converge even when the area utilization was
unacceptable. We, therefore, propose serial message-transfer architecture and
we adopt special implementation methodologies for this architecture, which
will be explained in Section V.
4S/P-1
S/P-2
S/P-dc
Q
Q
Q
CN Processor
Q
Q
Q
P/S-2
P/S-1
P/S-dc
CN Unit-1
CN Unit-2
CN Unit-M
S/P-1
S/P-2
S/P-dv
Q
Q
Q
VN Processor
Q
Q
Q
P/S-2
P/S-1
P/S-dv
VN Unit-1
S/P-ch P/S-ch
VN Unit-2
VN Unit-N
N
CN-VN Routing
N
VN-CN Routing
N
1
1
1
1
1
1
1
1
1
1
1
1
1
1
S/P-1
S/P-2
S/P-dv
Q
Q
Q
DN Processor
DN Unit-1
S/P-ch
DN Unit-2
DN Unit-N
1
Decision Node
      Stage
Variable Node
      Stage
Check Node
      Stage
Out-111
1
Input
LLRs
Qch Qch
me
ssa
ges
channel
Fig. 2: Serial message-transfer decoder architecture.
mapped to a distinct set of N VN and M CN units, which
form a processing pipeline. In essence, the unrolled LDPC
decoder is a systolic array, in which a new set of N channel
LLRs is read in every clock cycle and a decoded codeword is
put out in every clock cycle.
Even though both the CNs and VNs can compute their
outgoing messages in a single clock cycle, similar to the
architecture in [5], in the proposed serial message-transfer
architecture each message is transfered one bit at a time
between the consecutive variable node and check node stages
over the course of Qmsg clock cycles, where Qmsg is the
number of bits used for the messages. More specifically,
each CN and VN unit contains a serial-to-parallel (S/P) and
parallel-to-serial (P/S) conversion unit at the input and output,
respectively, which are clocked Qmsg times faster than the
processing clock to collect and transfer messages serially,
while keeping the overall decoding throughput constant. More
details on the architecture of the CN and VN units as well as
the proposed serial message-transfer mechanism are provided
in the sequel.
B. Decoder Stages
The unrolled LDPC decoder, illustrated in Fig. 2, consists
of three types of processing stages, which are described in
more detail below. We note that the CN/VN processors of
this reference decoder are similar to those of a standard MS
decoder, and our modifications for these parts (to realize a
finite-alphabet decoder) are discussed in Section IV.
1) Check Node Stage: Each check node stage consists of
M CN units, each of which contains three components: a CN
processor, which implements (6) similarly to [5], [14], dc S/P
units for the dc input messages, and dc P/S units for the dc
output messages. Moreover, the complete check node stage
contains a register bank that is used to store the channel LLRs,
which are not directly needed by the check node stage, but
nevertheless must be forwarded to the following variable node
stage and thus need to be buffered. Hence, no S/P and P/S units
are required for the channel LLR buffers in the check node
stages as they are simply forwarded serially to the following
variable node stage.
2) Variable Node Stage: Each variable node stage consists
of N VN units, each of which contains a VN processor and S/P
and P/S units at the inputs and outputs, respectively, similar
to the CN unit structure. Each VN processor implements the
update rule (5) similarly to [5].
3) Decision Node Stage: The last variable node stage is
called a decision node stage because it is responsible for taking
the final hard decisions on the decoded codeword bits. The
structure of this stage is similar to a variable node stage, but a
decision node (DN) has a simpler version of the VN processor
that only computes sum of all inputs and put out its sign [5],
and thus no P/S unit is required at its output.
C. Message Transfer Mechanism
One of the modifications, compared to [14] and [5], is the
serial transfer of the channel and message LLRs between the
stages of the decoder, which reduces the required routing
resources by factor of Qmsg. This modification is applied
to make the placement and routing of the decoder feasible,
especially for large values of dv and dc. To this end, as
explained in the previous section, a S/P and a P/S shift register
are added to each input and each output of the CN and VN
units, as illustrated in Fig. 3. We see that the S/P unit consists
of a (Qmsg−1)-bit shift register and Qmsg memory registers,
while the P/S unit has Qmsg registers with multiplexed inputs.
The serial messages are transfered with a fast clock, denoted
by CLKF, that is Qmsg times faster than the slow processing
clock, denoted by CLKS. More specifically, at each CN unit
5Q’4 Q’3 Q’2 Q’1 Q’0
Load/Shift
Q4 Q3 Q2 Q1 Q0
Fast Serial In
Fast Serial Out
.  .  .
CLK
D Q
D Q
D Q D Q D Q
D Q D Q D Q D Q
Slow CN/VN Processor
D Q
Serial-to-Parallel
Shift Registers
Memory Registers
D Q D Q D Q D Q
Other outputs
Other inputs
.  .  .
Parallel-to-Serial
F
CLKS
CLKF
CLKF
CLKS
Fig. 3: The message receive and transfer mechanism by S/P
and P/S shift registers, enabled by the fast clock (CLKF), and
the message process enabled by the slow clock (CLKS) for
Qmsg = 5.
and VN unit input, data is loaded serially into the S/P shift
register using the fast CLKF, and after the Qmsg-th cycle all
message bits are stored in memory registers, clocked by the
slow CLKS. The CN/VN processing can then be performed
in one CLKS cycle and the output messages are saved in the
output P/S shift register and transferred serially to the next
stage using again CLKF. At the same time, a new set of
messages is serially loaded into the input shift register. We note
that for simpler clock tree generation, all registers in Fig. 3
are clocked by CLKF, while CLKS is actually implemented as
a pulsed clock, which is generated using a clock-gating cell
controlled by a finite state machine.
D. Decoder Hardware Complexity and Performance Analysis
In this section, we describe the required memory complexity
of the proposed decoder as well as the decoding latency and
the throughput.
1) Memory Requirement: The memory complexity can
easily be characterized by counting the number of required
registers and can be approximated by:
Rtot ≈ N(dv + 1)Q(6I − 1), (8)
where I is the number of decoding iterations (which in
unrolled decoders strongly affects the memory requirements)
and Q = Qmsg = Qch, which is often the case for MS LDPC
decoders. From (8), one can easily see that the quantization
bit-width linearly increases the memory requirement for the
proposed architecture.
2) Decoding Latency: Since each stage has a delay of two
CLKS cycles and there are two stages for each decoding iter-
ation, the decoder latency is 4I CLKS cycles or, equivalently,
4IQmax CLKF cycles, where Qmax = max(Qmsg, Qch).
3) Decoding Throughput: In the proposed unrolled archi-
tecture, one decoder codeword is output in each CLKS cycle.
Therefore, the coded throughput of the decoder is:
T = NfSmax , (9)
where fSmax is the maximum frequency of CLKS while the
maximum frequency of CLKF or simply maximum frequency
of the decoder is fmax = fFmax = QmaxfSmax . For the proposed
architecture, we have:
fSmax =
{
max
(
(QmaxTCP,route), (TCP,VN), (TCP,CN)
)}−1
,
(10)
where TCP,VN and TCP,CN are the delay of the critical paths
of the CN unit and the VN unit, respectively, and TCP,route
is the critical path delay of the (serial) routing between the
decoding stages. Thus, the decoder throughput will be limited
by the routing, if the VN/CN delay is smaller than Qmax times
the routing delay. Hence, on the one hand, the serial message-
transfer decoder alleviates the routing problem by reducing the
required number of wires, but on the other hand, the decoder
throughput for large quantization bit-widths may be affected,
as the serial message-transfer delay will become the limiting
factor.
IV. FINITE-ALPHABET SERIAL MESSAGE-TRANSFER
LDPC DECODER
Even though the serial message-transfer architecture allevi-
ates the routing congestion of an unrolled full-parallel LDPC
decoder, it has a negative impact on both throughput and
hardware complexity, as discussed in the previous section.
In our previous work [5], [6], we have shown that finite-
alphabet decoders have the potential to reduce the required
number of message bits while maintaining the same error rate
performance. In this section, we will review the basic idea
and our design method for this new type of decoders and then
show how the bit-width reduction technique of [5], [6] can
be applied verbatim in order to increase the throughput and
reduce the area of the serial message-transfer architecture.
A. Mutual Information Based Finite-Alphabet Decoder
In our approach of [5], [6], the standard message-passing
decoding algorithm update rules are replaced by custom up-
date rules that can be implemented as simple look-up tables
(LUTs). These LUTs take integer-valued input messages and
produce a corresponding output message. Moreover, the input-
output mapping that is represented by the LUTs is designed
in a way that maximizes the mutual information between the
LUT output messages and the codeword bit that these mes-
sages correspond to. We note that a similar approach was also
used in [7], but the corresponding hardware implementation
would have a much higher hardware complexity than the
method of [5], [6]. This happens because, contrary to [7],
in [5], [6] we used LUTs only for the VNs while the CNs
use the standard min-sum update rule.
B. Error-Correcting Performance and Bit-Width Reduction
In Fig. 4, we compare the performance of the IEEE 802.3an
LDPC code under floating-point MS decoding, fixed-point
MS decoding (with Qch = Qmsg ∈ {4, 5}), and LUT-based
decoding (with Qch = 4 and Qmsg = 3) when performing
I = 5 decoding iterations with a flooding schedule. We
also show the performance of a floating-point offset min-sum
63 3.5 4 4.5 5 5.5
100
10−2
10−4
10−6
10−8
10−10
Eb/N0 [dB]
Fr
am
e
E
rr
or
R
at
e
LUT, (Qch = 4, Qmsg = 3)
MS, (Qch = 4, Qmsg = 4)
MS, (Qch = 5, Qmsg = 5)
MS, (floating-point)
OMS, (Floating-point)2
Fig. 4: Frame error rate (FER) of the IEEE 802.3an LDPC
code under floating-point MS decoding, fixed-point MS de-
coding with different bit-widths, LUT based decoding, and
floating-point offset min-sum (OMS) decoding (offset=0.5) as
reference, all with I = 5 decoding iterations.
(OMS) decoder as a reference.2 We observe that the fixed-
point decoder with Qch = Qmsg = 5 has almost the same
performance as the floating-point decoder, while the fixed-
point decoder with Qch = Qmsg = 4 shows a significant
loss with respect to the floating-point implementation. Thus,
a standard MS decoder would need to use at least Qch =
Qmsg = 5 quantization bits. The LUT-based decoder, however,
can match the performance of the floating-point decoder with
only Qch = 4 channel quantization bits and Qmsg = 3 message
quantization bits.3 Additionally, with the above quantization
bit choices, no noticeable error floor has been observed for
both the MS decoder and LUT-based decoder when 1010
frames have been transmitted (which corresponds to a BER
≈ 10−12 with the current block length). We note that, for the
LUT-based decoder, the performance in the error floor region
can be traded with the performance in the waterfall region by
an appropriate choice of the design SNR for the LUTs [6].
C. LUT-Based Decoder Hardware Architecture
The LUT-based serial message-transfer decoder hardware
architecture is very similar to the MS decoder architecture,
described in Section III. However, the LUT-based decoder
can take advantage of the significantly fewer message bits
that need to be transferred from one decoding stage to the
next. This reduction reduces the number of CLKF cycles per
iteration, which in turn increases the throughput of the decoder
according to (10) provided that the CN/VN logic is sufficiently
fast. Moreover, the size of the buffers needed for the S/P and
P/S conversions is also reduced significantly, which directly
reduces the memory complexity of the decoder (see (8)).
2The reference simulation was obtained and matched with our simulation by
using the open-source simulator provided by: Adrien Cassagne; Romain Tajan;
Mathieu Lonardon; Baptiste Petit; Guillaume Delbergue; Thibaud Tonnellier;
Camille Leroux; Olivier Hartmann, AFF3CT: A Fast Forward Error Correction
Tool, 2016. [Online]. Available: https://doi.org/10.5281/zenodo.167837
3We note that reducing Qch further results in a non-negligible loss with
respect to the floating-point decoder.
On the negative side, we remark that the VN units for
each variable node stage (decoder iteration) of the LUT-
based decoder are different, which slightly complicates the
hierarchical physical implementation as we will see later.
Furthermore, since Qch > Qmsg, we now need to transfer
the channel LLRs with multiple (two) bits per cycle to avoid
the need to artificially limit the number of CLKF cycles per
iteration to Qch rather than to the smaller Qmsg. To reflect this
modification, we redefine (10) as Qmax = max(Qmsg, dQch2 e).
While this partially parallel transfer of the channel LLRs
impacts routing congestion, we note that the overhead is neg-
ligible since the number of channel LLRs is small compared
to the total number of messages.
V. IMPLEMENTATION
Despite the use of a serial message-transfer, the physical
implementation of the decoders proposed in the previous
sections requires special scrutiny since the number of global
wires is still significant and the overall area is particularly
large. Therefore, in this section, we propose and describe
a pseudo-hierarchical design methodology to implement the
serial message-transfer architecture.
A. Physical Design
Due to the large number of identical blocks in the decoder
architecture, a bottom-up flow is expected to provide the best
results. The CN, VN, and DN units are first placed and
routed individually to build hard macros,4 and their timing
and physical information are extracted. These macros are then
instantiated as large cells in the decoder top level. We propose
to treat the macros as custom standard-cells with identical
height to be able to perform the placement using the standard-
cell placement, rather than the less capable macro placement of
the backend tool, since in our case the number of hard macro
instances is extremely large and the interconnect pattern is
complex and highly irregular.
Fig. 5 illustrates the proposed physical floorplan for the
decoders with the unrolled architecture. In this floorplan, the
CN and VN macros within each stage are constrained to be
placed into dedicated regions (placement regions in Fig. 5a).
This measure enforces the high-level structure of systolic
array pipeline, but it also leaves freedom to the placement
tool to choose the location for the macros in each stage to
minimize routing congestion between stages. Note that the
linear floorplan has also the advantage of being scalable in
the number of iterations since little interaction or interference
exists between stages. Furthermore, the CN and VN macros
are placed in dedicated rows while the area between these rows
is left for repeaters and for the register standard-cells for the
channel LLRs in the check node stages, as shown in Fig. 5b.
We note that the proposed floorplan and the encapsulation of
the VN and CN macros as large standard-cells exploit the
automated algorithm to optimize both custom and conventional
standard-cells placement in order to alleviate the significant
routing congestion.
4Note that for the LUT-based decoder there are different macros for each
variable node stage as apposed to the MS decoder.
7Placement region for CN stage 1
Placement region for CN stage 2
.
.
.
Placement region for VN stage 1
Placement region for DN stage
(b)
Placement region for VN stage 2
(a)
Area for macros
.
.
.
Area for registers 
and repeaters
(b)
Fig. 5: The physical floorplan for serial message-transfer
architecture, (a) high level overview of the floorplan with
dedicated placement regions for each decoder stage; and (b)
zoomed in overview showing rows structure for custom macros
(large colored blocks) and conventional standard-cells (small
colored blocks) placement.
B. Timing and Area Optimization Flow
Although the synthesis results can give an approximate
evaluation for timing and area of the physical implementa-
tion, several iterations with different constraints are required
to reach an optimal layout. To this end, we propose the
methodology illustrated in the flowchart of Fig. 6 to effectively
implement the serial message-transfer architecture. The main
idea behind this methodology is that three main factors directly
contribute to the decoder throughput and also indirectly to
the decoder area, as discussed in Section III and specifically
summarized in (10). Our goal is to maximize the throughput
at a minimum area.
We define the timing constraint applied to CLKS as
TCSTR,CLKS , and the timing constrain applied to CLKF as
TCSTR,CLKF . The first step is to place and route the CN/VN
macros based on TCSTR,CLKS . This step is followed by the
implementation of the decoder using TCSTR,CLKF . (The initial
constraints for the backend are thereby extracted from syn-
thesis timing results.) The fully placed and routed design can
give an accurate routing delay, which will be used to update
TCSTR,CLKF and then TCSTR,CLKS according to (10). The updated
TCSTR,CLKS will be used to re-implement the CN and VN
macros within the minimum achievable area.
We note that for a long LDPC code with a large
area and long routing delay (such as the one of the
IEEE 802.3an standard), the first implementation starts with
TCSTR,CLKS < QmaxTCSTR,CLKF . After obtaining a realistic value
Synthesis
Implement CN/VN macros with
TCSTR,CLKS and minimum area
Implement the decoder toplevel
with TCSTR,CLKF and extract the
minimum achievable CLKF period
Update TCSTR,CLKF
to the minimum
achievable CLKF period
TCSTR,CLKS =
QmaxTCSTR,CLKF
TCSTR,CLKS ≈
QmaxTCSTR,CLKF
Stop
TCSTR,CLKS
TCSTR,CLKF
no
yes
TCSTR,CLKS : timing constraint applied to CLKS
TCSTR,CLKF : timing constraint applied to CLKF
Fig. 6: The proposed flowchart to optimize timing and area
for the serial message-transfer architecture.
2.99 mm
7.78 mm
(a)
2.45 mm
6.61 mm
(b)
Fig. 7: Layouts for (a) the MS decoder and (b) the LUT-based
decoder.
for the CLKF period (and hence for TCSTR,CLKF ) at the end
of the implementation, the TCSTR,CLKS will be updated to
a larger value to approach TCSTR,CLKS ≈ QmaxTCSTR,CLKF .
Consequently, the CN and VN macro area and thus the decoder
area will shrink in the second iteration, which result in larger
achievable CLKF frequency and hence smaller TCSTR,CLKF and
TCSTR,CLKS . The feedback loop will reach the optimum point
after a few iterations.
8TABLE I: Critical path delays for MS and LUT-based decoder
Path MS decoder LUT-based decoder
CN [ns] 2.38 1.42
VN [ns] 0.96 1.24
Routing [ns] 1.51 1.16
CN
/V
N
ma
cro
s
Ch
an
ne
l L
LR
reg
ist
ers
Cl
k t
ree
an
d
rou
tin
g b
uff
ers
Ro
uti
ng
0
5
10
15
9.56
0.18
0.84
5.41
14.12
0.15
1.03
7.95
A
re
a
(m
m
2
)
LUT-based decoder MS decoder
Fig. 8: Detailed area results for the LUT-based and MS de-
coder with total area of 16.2 mm2 and 23.3 mm2, respectively.
VI. RESULTS AND DISCUSSIONS
To study the impact of the serial message-transfer archi-
tecture and the finite-alphabet decoding scheme, we have
implemented the proposed architecture by employing the
methodology explained in Section V and we analyzed the
results for both MS and LUT-based decoding. We used the
parity check matrix of the LDPC code defined in the IEEE
802.3an standard [15], i.e., a (2048, 1723) LDPC code of
R = 0.8413 with dv = 6 and dc = 32. We used I = 5
for both decoders and Qmsg = Qch = 5 for the MS decoder
and Qmsg = 3 and Qch = 4 for the LUT-based decoder to
achieve the same error-correction performance, as described
in Section IV. The decoders were synthesized from a VHDL
description using Synopsys Design Compiler and placed and
routed using Cadence Encounter Digital Implementation. The
layouts are shown in Fig. 7. The results are reported for
a 28 nm FD-SOI library under typical operating conditions
(VDD = 1 V, T = 25◦ C).
A. Delay Analysis
In the serial message-transfer architecture, the critical path
and, hence, the maximum decoding frequency are defined
by (10). To investigate the impact of serially transferring the
messages on the decoder throughput, we consider the delay
of the following register-to-register critical paths for both the
MS and LUT-based decoder.
1) CN Critical Path: The CN critical path (TCP,CN) is the
path from the S/P memory registers to the P/S shift register
within the CN unit. For both decoders, this path is essentially
comprised of the logic cells for a sorter tree with a depth of
four.
TABLE II: Detailed area for CN/VN unit?
Component MS decoder LUT-based decoder
CN unit logic [µm2] 1578 485
CN unit register [µm2] 1695 971
CN macro [µm2] 3607 1510
VN unit logic [µm2] 315 403†
VN unit register [µm2] 381 235†
VN macro [µm2] 755 646†
?Logic and register areas are obtained form synthesis, and macro areas
are the final post-layout results.
†Although the VNs are different for each stage of the LUT-based decoder,
their areas are similar and the result for one of them is reported here.
TABLE III: Power and energy efficiency comparison for MS
and LUT-based decoder
MS decoder LUT-based decoder
Total power @ fmax [mW] 12248 13350
Total power @ 662 MHz [mW] 12248 10257
Leakage power [mW] 7.44 5.27
Energy efficiency [pJ/bit] 45.2 22.7
2) VN Critical Path: The VN critical path (TCP,VN) is the
path from S/P memory registers to P/S shift register within
the VN unit. This path is dominated by an adder tree for the
MS decoder and an LUT tree for the LUT-based decoder.
3) Routing Critical Path: The routing critical path
(TCP,route) comprises mainly the interconnect wires (and
buffers) that connect the CN/VN unit S/P shift register to the
VN/CN unit P/S shift register.
Table I summarizes the critical path delays of the CN/VN
and the routing path. Together with (10), the values in the
table dictate the maximum achievable frequency for CLKS
and CLKF, respectively, for both of the decoders with the
proposed serial message-transfer architecture. We note that the
critical paths are reported after the timing and area constraints
for the CN/VN macros and for the decoder toplevel are
jointly optimized according to the flow shown in Fig. 6.
According to (10), we observe that in both decoders the
message transfer limits the slow clock CLKS to a period
of 5× 1.51 ns = 7.55 ns and 3× 1.16 ns = 3.48 ns for the
MS and the LUT-based decoder, respectively, where 1.51 ns
and 1.16 ns are the corresponding minimum CLKF periods.
Consequently, in our flow, the VN and CN units end up as
optimized for minimum area only with relaxed and easy to
meet timing constraints.
B. Area Analysis
Fig. 8 illustrates the area distribution among the various
components after the layout. The area utilization is approxi-
mately 67% for both decoders. While almost 62% of the layout
is filled with CN/VN macros and registers, the clock tree and
routing buffers occupy around 5%. Furthermore, we see a 44%
difference in total area between the decoders due to the fact
that the total area for CN and VN macros is 14.12 mm2 in the
MS decoder, as opposed to only 9.56 mm2 in the LUT-based
decoder.
9TABLE IV: Implementation results for MS and LUT-based decoder and comparison with other works
MS decoder
LUT-based
decoder [14] [8] [20] [23] [27] [26]
Process technology 28 nm FD-SOI 65 nm CMOS
65 nm CMOS
low-power 65 nm CMOS 90 nm CMOS 90 nm CMOS 90 nm CMOS
Supply voltage [V] 1.0 1.2 1.2 1.3 0.9 1.2 1.0
LDPC code (2048, 1723) (672, 546) (2048, 1723) (2048, 1723) (2048, 1723) (2048, 1723) (2048, 1723)
Node degree (dv, dc) (6, 32) (3, 6) (6, 32) (6, 32) (6, 32) (6, 32) (6, 32)
Algorithm min-sum finite-alphabet min-sum
offset min-sum
with post processor split-row
normalized probablistic
min-sum
reduced-complexity
min-sum delayed stochastic
Imax 5 9 8 11 9 30 −
Quantization bits 5 3 4 4 5 4 6 5
Eb/N0 @ BER= 10−7 [dB] 4.97 4.95 − 4.25 4.55 4.4 4.32 4.7
Architecture unrolled full-parallel
unrolled
full-parallel partial-parallel full-parallel full-parallel full-parallel full-parallel
Core area [mm2] 23.3 16.2 12.9 5.05 4.84 9.6 3.84 3.93
Area utilization [%] 66.4 65.9 76 84.5 97 91 − 93
Max. frequency (fmax) [MHz] 662 862 257 700 195 199.6 226 750
Latency [ns] 151 69.6 105 137 56.4 45.09 − 800
Throughput @ Imax [Gbps] 271 588 160.8 13.3 36.3 45.42 12.8 172.4?
Power @ fmax [mW] 12248 13350 5360 2800 1359 1110 1040 −
Area eff. [Gbps/mm2] 11.6 36.3 12.5 2.63 7.5 4.73 3.34 43.86
Energy per bit @ Imax [pJ/bit] 45.2 22.7 33.3 210.5 37.4 24.44 81.2 −
Scaled area eff.† [Gbps/mm2] 11.6 36.3 156.4 32.9 93.8 157.1 110.9 1456.8
Scaled energy per bit‡ [pJ/bit] 45.2 22.7 10 63 9.5 9.4 17.5 −
?Maximum throughput @ Eb/N0 = 5.5 dB (Note that throughput @ Imax is not reported in the original paper)
†Scaling is done by S3 where S is the relative dimension to 28 (Note that this is very rough and optimistic since it does not apply to the interconnects)
‡Scaling is done by 1/SU2 where U is the relative voltage to 1.0
To understand this fact, we list the area of each CN/VN
macro in Table II. According to this table, the finite-alphabet
message passing algorithm leads to significantly smaller CN
processors because of two important factors: first, the bit-width
reduction of the messages directly affects the data-path area,
and second, the quantized messages in the LUT-based decoder
are processed directly in the sorter tree of the CNs without
the need to compute their absolute values. However, VN
processors are less area-efficient in the LUT-based decoder in
comparison with the ones of the MS decoder. This is caused by
the fact that the LUT-based computations are, in general, less
area-efficient than the conventional arithmetic based update
rules. Thus, the logic area of the VN in the LUT-based decoder
is larger, even though their input/output bit-width is smaller.
Another contributing factor in the Table II is the register area,
which is defined by the number of S/P and P/S registers.
For those, the 40% reduction of bit-width in the LUT-based
decoder is directly noticeable in the register area for both CN
and VN units. Altogether, the CN and VN macros in the
LUT-based decoder are 58% and 14% smaller, respectively,
compared to those of the MS decoder.
C. Power Analysis
The energy which is consumed by each decoder is pro-
portional to the capacitance, which in turn is related to the
decoder area. Also, the number of required CLKF cycles for
the serial message-transfer to decode one codeword, which is
inversely proportional to the decoding throughput at a constant
frequency, directly contributes to the consumed energy for
each decoded bit. Therefore, we analyze both the total power
and the energy efficiency of the decoders using post-layout
vector-based power analysis.5 The results are reported in
Table III. We note that the total powers are calculated at fmax
5We first extract the parasitic information of both the hard macros and the
top level from the placement and routing tool and then read and link them
using a power computation tool to generate the complete parasitic information.
for both decoders. Also, for comparison purpose, we have
calculated the total powers at a constant CLKF frequency,
here min(fmax,MS, fmax,LUT) = 662 MHz, for both decoders
and note them in the Table III. According to this table,
the total power consumption of the LUT-based decoder is
16.2% smaller than that of the MS decoder. Furthermore, by
considering the fact that the LUT-based decoder has 66.7%
higher throughput than the MS decoder at a similar CLKF
frequency, the energy efficiency of the LUT-based decoder is
almost 2 times better in comparison with the MS decoder.
D. Summary and Final Comparison
The final post-layout results for our MS and LUT-based de-
coders and also for some other recently implemented decoders
are summarized in Table IV. Our LUT-based decoder runs
at a maximum CLKF frequency of fmax,LUT = 862 MHz and
delivers a sustained throughput of 588 Gbps, while it occupies
16.2 mm2 area and dissipates 22.7 pJ/bit. Compared to the MS
decoder, the LUT-based decoder is 1.4× smaller, 2.2× faster,
and thus 3.1× more area efficient. It also has 16.2% lower
power dissipation and 2× better energy efficiency, when the
decoding throughout is taken into account.
The work in [14] is the only other unrolled full-
parallel decoder in literature, but it is designed for the
IEEE 802.11ad [28] code, which has a shorter block length
and smaller node degrees (dv = 3 and dc = 6 as opposed
to dv = 6 and dc = 32 for the code used in the design
reported in this paper). The work of [8], [20], [23], [26], and
[27] are for the same IEEE 802.3an code considered in this
paper, but with partial-parallel and full-parallel architectures.
The proposed LUT-based decoder has more than an order of
magnitude higher throughput compared to [20] and [23], and
three times higher throughput compared to [26], while the
maximum throughput of the proposed decoder is maintained
for all SNR scenarios as it does not require early termination to
achieve a high throughput. The area efficiency of the proposed
10
unrolled full-parallel architecture, however, is inferior to the
one of the decoders in [20], [23] and [27] with full-parallel
architecture due to the repeated routing overhead between the
decoder stages in our design.
VII. CONCLUSION
An ultra high throughput LDPC decoder with a serial
message-transfer architecture and based on non-uniform quan-
tization of messages was proposed to achieve the highest
decoding throughput in literature. The proposed decoder archi-
tecture is an unrolled full-parallel architecture with serialized
messages for CN/VN units, which was enabled by employing
S/P and P/S shift registers at the inputs and outputs of each
unit. The proposed quantized message passing algorithm re-
places conventional MS, resulting in 40% reduction in message
bit-width without any performance penalty. This algorithm was
implemented by using generic LUTs instead of adders for VNs
while the CNs remained unchanged compared to MS decod-
ing. Placement and routing results in 28 nm FD-SOI show
that the LUT-based serial message-transfer decoder delivers
0.588 Tbps throughput and is 3.1 times more area efficient
and 2 times more energy efficient in comparison with the MS
decoder with serial message-transfer architecture.
ACKNOWLEDGMENT
This work was supported by the Swiss National Science
Foundation (SNSF) under the project number 200021-153640.
REFERENCES
[1] D. J. MacKay, “Good error-correcting codes based on very sparse
matrices,” IEEE Trans. Inf. Theory, vol. 45, no. 2, pp. 399–431, 1999.
[2] M. P. Fossorier, M. Mihaljevic´, and H. Imai, “Reduced complexity
iterative decoding of low-density parity check codes based on belief
propagation,” IEEE Trans. Commun., vol. 47, no. 5, pp. 673–680, 1999.
[3] S. K. Planjery, D. Declercq, L. Danjean, and B. Vasic, “Finite alphabet
iterative decoderspart i: Decoding beyond belief propagation on the
binary symmetric channel,” IEEE Trans. Commun., vol. 61, no. 10, pp.
4033–4045, 2013.
[4] B. Kurkoski, K. Yamaguchi, and K. Kobayashi, “Noise thresholds for
discrete LDPC decoding mappings,” in IEEE Global Telecommun. Conf.
(GLOBECOM), Nov. 2008, pp. 1–5.
[5] A. Balatsoukas-Stimming, M. Meidlinger, R. Ghanaatian, G. Matz, and
A. Burg, “A fully-unrolled LDPC decoder based on quantized message
passing,” in IEEE Int. Workshop on Signal Process. Syst. (SiPS), Oct
2015, pp. 1–6.
[6] M. Meidlinger, A. Balatsoukas-Stimming, A. Burg, and G. Matz, “Quan-
tized message passing for LDPC codes,” in Asilomar Conf. on Signals,
Syst., and Comput. (ACSSC), Nov 2015, pp. 1606–1610.
[7] F. J. C. Romero and B. M. Kurkoski, “Decoding LDPC codes with
mutual information-maximizing lookup tables,” in IEEE Int. Symp. on
Inf. Theory (ISIT), Jun. 2015, pp. 426–430.
[8] Z. Zhang, V. Anantharam, M. J. Wainwright, and B. Nikolic´, “An
efficient 10GBASE-T Ethernet LDPC decoder design with low error
floors,” IEEE J. Solid-State Circuits, vol. 45, no. 4, pp. 843–855, 2010.
[9] A. Cevrero, Y. Leblebici, P. Ienne, and A. Burg, “A 5.35 mm2
10GBASE-T Ethernet LDPC decoder chip in 90 nm CMOS,” in IEEE
Asian Solid-State Circuits Conf. (A-SSCC), 2010, pp. 1–4.
[10] C. Roth, P. Meinerzhagen, C. Studer, and A. Burg, “A 15.8 pJ/bit/iter
quasi-cyclic LDPC decoder for IEEE 802.11n in 90 nm CMOS,” in
IEEE Asian Solid-State Circuits Conf. (A-SSCC), 2010, pp. 1–4.
[11] T.-C. Kuo and A. N. Willson Jr, “A flexible decoder IC for WiMAX
QC-LDPC codes,” in IEEE Custom Integrated Circuits Conf. (CICC),
2008, pp. 527–530.
[12] A. J. Blanksby and C. J. Howland, “A 690-mW 1-Gb/s 1024-b, rate-1/2
low-density parity-check code decoder,” IEEE J. Solid-State Circuits,
vol. 37, no. 3, pp. 404–412, 2002.
[13] N. Onizawa, T. Hanyu, and V. C. Gaudet, “Design of high-throughput
fully parallel LDPC decoders based on wire partitioning,” IEEE Trans.
VLSI Syst., vol. 18, no. 3, pp. 482–489, 2010.
[14] P. Schlafer, N. Wehn, M. Alles, and T. Lehnigk-Emden, “A new
dimension of parallelism in ultra high throughput LDPC decoding,” in
IEEE Int. Workshop on Signal Process. Syst. (SiPS), 2013, pp. 153–158.
[15] “IEEE Standard for Information Technology – Telecommunications and
Information Exchange between Systems – Local and Metropolitan Area
Networks – Specific Requirements Part 3: Carrier Sense Multiple Access
with Collision Detection (CSMA/CD) Access Method and Physical
Layer Specifications,” IEEE Std. 802.3an, Sep. 2006.
[16] K. Cushon, P. Larsson-Edefors, and P. Andrekson, “Low-power 400-
Gbps soft-decision LDPC FEC for optical transport networks,” J. Lightw.
Technol., vol. 34, no. 18, pp. 4304–4311, Aug. 2016.
[17] H. Kaeslin, Digital integrated circuit design: from VLSI architectures to
CMOS fabrication. Cambridge University Press, 2008.
[18] A. Darabiha, A. C. Carusone, and F. R. Kschischang, “A 3.3-Gbps bit-
serial block-interlaced min-sum LDPC decoder in 0.13-µm CMOS,” in
IEEE Custom Integrated Circuits Conf. (CICC), 2007, pp. 459–462.
[19] A. Darabiha, A. C. Carusone, and R. Kschischang, “Power reduction
techniques for LDPC decoders,” IEEE J. Solid-State Circuits, vol. 43,
no. 8, pp. 1835–1845, 2008.
[20] T. Mohsenin, D. N. Truong, and B. M. Baas, “A low-complexity
message-passing algorithm for reduced routing congestion in LDPC
decoders,” IEEE Trans. Circuits Syst. I, vol. 57, no. 5, pp. 1048–1061,
2010.
[21] N. Mobini, A. H. Banihashemi, and S. Hemati, “A differential binary
message-passing LDPC decoder,” IEEE Trans. Commun., vol. 57, no. 9,
pp. 2518–2523, 2009.
[22] K. Cushon, S. Hemati, C. Leroux, S. Mannor, and W. J. Gross, “High-
throughput energy-efficient LDPC decoders using differential binary
message passing,” IEEE Trans. Signal Process., vol. 62, no. 3, pp. 619–
631, 2014.
[23] C.-C. Cheng, J.-D. Yang, H.-C. Lee, C.-H. Yang, and Y.-L. Ueng, “A
fully parallel LDPC decoder architecture using probabilistic min-sum
algorithm for high-throughput applications,” IEEE Trans. Circuits Syst.
I, vol. 61, no. 9, pp. 2738–2746, 2014.
[24] S. S. Tehrani, W. Gross, and S. Mannor, “Stochastic decoding of LDPC
codes,” IEEE Commun. Lett., vol. 10, no. 10, pp. 716–718, 2006.
[25] S. S. Tehrani, A. Naderi, G.-A. Kamendje, S. Hemati, S. Mannor, and
W. J. Gross, “Majority-based tracking forecast memories for stochastic
LDPC decoding,” IEEE Trans. Signal Process., vol. 58, no. 9, pp. 4883–
4896, 2010.
[26] A. Naderi, S. Mannor, M. Sawan, and W. J. Gross, “Delayed stochastic
decoding of LDPC codes,” IEEE Trans. Signal Process., vol. 59, no. 11,
pp. 5617–5626, 2011.
[27] F. Angarita, J. Valls, V. Almenar, and V. Torres, “Reduced-complexity
min-sum algorithm for decoding LDPC codes with low error-floor,”
IEEE Trans. Circuits Syst. I, vol. 61, no. 7, pp. 2150–2158, 2014.
[28] “ISO/IEC/IEEE international standard for information technology–
telecommunications and information exchange between systems–local
and metropolitan area networks–specific requirements-part 11: Wireless
lan medium access control (MAC) and physical layer (PHY) specifi-
cations amendment 3: Enhancements for very high throughput in the
60 GHz band (adoption of IEEE Std 802.11ad-2012),” ISO/IEC/IEEE
8802-11:2012/Amd.3:2014(E), pp. 1–634, March 2014.
Reza Ghanaatian was born in Jahrom, Iran, in
1988. He received the M.Sc. degree in digital sys-
tems from the Department of Electrical Engineering,
Sharif University of Technology (SUT), Tehran,
Iran, in 2012. He was with Advanced Integrated
Circuit Design Laboratory (AICDL), at SUT from
2011 to 2013, working on field-programmable gate
array based systems for wireless and optical com-
munication applications.
In 2014, Mr. Ghanaatian joined Telecommunica-
tions Circuits Laboratory (TCL) at EPFL, Lausanne,
Switzerland, working towards his Ph.D. His current research interests include
VLSI circuits for signal processing and communications as well as approxi-
mate computing techniques for energy efficient system design.
11
Alexios Balatsoukas-Stimming received the
Diploma and M.Sc. degrees in Electronics
and Computer Engineering from the Technical
University of Crete, Greece, in 2010 and 2012,
respectively. His M.Sc. studies were supported
by a scholarship from the Alexander S. Onassis
foundation. He received the Ph.D degree in
Computer and Communications Sciences from
EPFL, Switzerland, where he performed his research
at the Telecommunications Circuits Laboratory.
He serves as reviewer for several IEEE Journals
and Conferences, and has been recognized as an Exemplary Reviewer
by the IEEE Wireless Communications Letters in 2013 and 2014. His
current research interests include VLSI circuits for signal processing and
communications, as well as error correction coding theory and practice.
Thomas Christoph Mu¨ller (S’16) received the
bachelor’s degree in information technology from
the Schmalkalden University of Applied Sciences,
Schmalkalden, Germany, in 2011, and the mas-
ter’s degree in System-on-Chip from Lund Univer-
sity, Sweden, in 2013. He has been working as
Project/Research Assistant at Lund University and
at Technical University of Denmark, Kongens Lyn-
gby, Denmark. Currently he is pursuing his Ph.D.
degree in the Telecommunication Circuits Labora-
tory (TCL) at the E´cole polytechnique fe´de´rale de
Lausanne (EPFL), Switzerland with a research focus on digital implementation
and, an emphasis on low power, variation mitigation and design methodology.
Michael Meidlinger was born in Austria in 1989.
He received his Bsc. and Msc. degrees from Technis-
che Universitt (TU) Wien, Vienna, Austria in 2011
and 2013 respectively, both with distinction. From
2011 to 2013, Michael has been working in the
field of mobile communication research and helped
to develop the Vienna LTE-A Simulators. Since
2013, Michael is part of the Communication Theory
group at TU Wien, working towards his Ph.D. His
current research interests include quantizer design
for telecommunication receivers as well as error
correction coding and superposition modulation techniques.
Gerald Matz received the Dipl.-Ing. (1994) and Dr.
techn. (2000) degrees in Electrical Engineering and
the Habilitation degree (2004) for Communication
Systems from Vienna University of Technology,
Austria. He currently holds a tenured Associate
Professor position with the Institute of Telecom-
munications, Vienna University of Technology. He
has held visiting positions with the Laboratoire des
Signaux et Syste`mes at Ecole Supe´rieure dElectricite´
(France, 2004), the Communication Theory Lab at
ETH Zurich (Switzerland, 2007), and with Ecole
Nationale Supe´rieure d’Electrotechnique, d’Electronique, d’Informatique et
d’Hydraulique de Toulouse (France, 2011).
Prof. Matz has directed or actively participated in several research projects
funded by the Austrian Science Fund (FWF), by the Viennese Science and
Technology Fund (WWTF), and by the European Union. He has published
some 200 scientific articles in international journals, conference proceedings,
and edited books. He is co-editor of the book Wireless Communications over
Rapidly Time-Varying Channels (New York: Academic, 2011). His research
interests include wireless networks, statistical signal processing, information
theory, and big data.
Prof. Matz served as as a member of the IEEE SPS Technical Committee
on Signal Processing Theory and Methods and of the IEEE SPS Technical
Committee on Signal Processing for Communications and Networking. He
was an Associate Editor of the IEEE Transactions on Information Theory
(2013-2015), of the IEEE Transactions on Signal Processing (20062010), of
the EURASIP Journal Signal Processing (20072010), and of the IEEE Signal
Processing Letters (20042008). He was Technical Program Chair of Asilomar
2016, Technical Program Co-Chair of EUSIPCO 2004, Technical Area Chair
for MIMO Communications and Signal Processing at Asilomar 2012, and
Technical Area Chair for Array Processing at Asilomar 2015. He has been
a member of the Technical Program Committee of numerous international
conferences. In 2006 he received the Kardinal Innitzer Most Promising Young
Investigator Award. He is an IEEE Senior Member and a member of the O¨VE.
Adam Teman received the Ph.D., M.Sc., and B.Sc.
degrees in Electrical Engineering from Ben-Gurion
University (BGU), Be’er Sheva, Israel in 2006, 2011,
and 2014, respectively. He worked as a Design
Engineer at Marvell Semiconductors from 2006 to
2007, with an emphasis on Physical Implementation.
Dr. Teman’s research interests include low-voltage
digital design, energy efficient SRAM, NVM, and
eDRAM memory arrays, low power CMOS image
sensors, low power design techniques for digital and
analog VLSI chips, significance-driven approximate
computing, and process tolerant design techniques. He has authored more
than 40 scientific papers and 4 patent applications and is an associate
editor at the Microelectronics Journal and a technical committee member
of several IEEE conferences. In 2010–2012, Dr. Teman was honored with
the Electrical Engineering Department’s Teaching Excellence recognition at
BGU, and in 2011, he was awarded with BGU’s Outstanding Project award.
Dr. Teman received the Yizhak Ben-Yaakov HaCohen Prize in 2010, the
BGU Rector’s Prize for Outstanding Academic Achievement in 2012, the
Wolf Foundation Scholarship for excellence of 2012 and the Intel Prize for
Ph.D. students in 2013. His doctoral studies were conducted under a Kreitman
Foundation Fellowship. Dr. Teman was a post-doctoral researcher at the
Telecommunications Circuits Lab (TCL) at the E´cole Polytechnique Fe´de´rale
de Lausanne (EPFL), Switzerland under a Swiss Government Excellence
Scholarship from 2014–2015. In October 2015, Dr. Teman joined the faculty
of engineering at Bar-Ilan University, Ramat Gan, Israel in 2015, where he is
currently a tenure track researcher in the department of electrical engineering
and a leading member of the Emerging Nanoscaled Integrated Circuits and
Systems (EnICS) Labs.
Andreas Burg (S’97-M’05) was born in Munich,
Germany, in 1975. He received his Dipl.-Ing. de-
gree from the Swiss Federal Institute of Technology
(ETH) Zurich, Zurich, Switzerland, in 2000, and the
Dr. sc. techn. degree from the Integrated Systems
Laboratory of ETH Zurich, in 2006. In 1998, he
worked at Siemens Semiconductors, San Jose, CA.
During his doctoral studies, he worked at Bell Labs
Wireless Research for a total of one year. From
2006 to 2007, he was a postdoctoral researcher
at the Integrated Systems Laboratory and at the
Communication Theory Group of the ETH Zurich. In 2007 he co-founded
Celestrius, an ETH-spinoff in the field of MIMO wireless communication,
where he was responsible for the ASIC development as Director for VLSI.
In January 2009, he joined ETH Zurich as SNF Assistant Professor and as
head of the Signal Processing Circuits and Systems group at the Integrated
Systems Laboratory. Since January 2011, he has been a Tenure Track Assistant
Professor at the Ecole Polytechnique Federale de Lausanne (EPFL) where he
is leading the Telecommunications Circuits Laboratory.
In 2000, Mr. Burg received the Willi Studer Award and the ETH Medal for
his diploma and his diploma thesis, respectively. Mr. Burg was also awarded
an ETH Medal for his Ph.D. dissertation in 2006. In 2008, he received a
4-years grant from the Swiss National Science Foundation (SNF) for an SNF
Assistant Professorship. With his students he received the best paper award
from the EURASIP Journal on Image and Video Processing in 2013 and best
demo/paper awards at ISCAS 2013, ICECS 2013, and at ACSSC 2007.
He has served on the TPC of various conferences on signal processing,
communications, and VLSI. He was a TPC co-chair for VLSI-SoC 2012
and the TCP co-chair for ESSCIRC 2016 and SiPS 2017. He served as an
Editor for the IEEE Transaction of Circuits and Systems in 2013 and is on
the Editorial board of the Springer Microelectronics Journal and the MDPI
Journal on Low Power Electronics and its Applications. He is also a member
of the EURASIP SAT SPCN and of the IEEE TC-DISPS.
