Hardware Acceleration of Beamforming in a UWB Imaging Unit for Breast Cancer Detection by Colonna, Francesco et al.
1Flexible LDPC decoder architectures
 Muhammad Awais,  Carlo Condo
Politecnico di Torino
Abstract—Flexible channel decoding is getting significance with
the increase in number of wireless standards and modes within
a standard. A flexible channel decoder is a solution providing
inter-standard and intra-standard support without change in
hardware. However, the design of efficient implementation of
flexible Low-Density Parity-Check (LDPC) code decoders satis-
fying area, speed and power constraints is a challenging task and
still requires considerable research effort. This paper provides
an overview of state-of-the-art in the design of flexible LDPC
decoders. The published solutions are evaluated at two levels
of architectural design, the Processing Element (PE) and the
Interconnection structure. A qualitative and quantitative analysis
of different design choices is carried out and comparison is
provided in terms of achieved flexibility, throughput, decoding
efficiency and area (power) consumption.
Index Terms—Low Density Parity Check codes, Channel De-
coder, Layered Decoding, NOC, Flexible architectures
I. INTRODUCTION
With the word flexibility regarding channel decoding, we
mean the ability of a decoder to support different types of
codes, enabling its usage in a wide variety of situations.
Much research has been done in this sense after the great
increase in number of standards, standard complexity and
code variety witnessed during the last years. Next genera-
tion wireless standards such as DVB-S2 [1], IEEE 802.11n
(WiFi) [2], IEEE 802.3an (10GBASE-T) [3] and IEEE 802.16e
(WiMAX) [4] feature multiple codes (LDPC, Turbo) where
each code comes with various code lengths and rates. The
necessity for flexible channel decoder intellectual properties
(IPs) is evident and challenging due to the often unforgiving
throughput requirements and narrow constraints of decoder
latency, power and area.
This work gives an overview of the most remarkable
techniques in context of flexible channel decoding. We will
discuss design and implementation of two major functional
blocks of flexible decoders: Processing Element (PE) and
Interconnection structure. Various design choices are analyzed
in terms of achieved flexibility, performance, design novelty
and area (power) consumption.
The paper is organized as follows. Section 2 provides a
brief introduction to LDPC codes and decoding. Section 3
gives an overview of flexible LDPC decoders classifying them
on the basis of some important attributes e.g. parallelism,
implementation platforms and decoding schedules. Sections
4 and 5 are dedicated to PE and Interconnection structure
respectively, where we depict various design methodologies
and analyze some state of the art flexible LDPC decoders.
Finally, Section 6 draws the conclusions.
II. LDPC DECODING
A. Introduction
LDPC codes [5] are a special class of linear block codes.
A binary LDPC code is represented by a sparse parity check
matrix H with dimensions M  N such that each element
hmn is either 0 or 1. N is the length of the codeword and
M is the number of parity bits. Each matrix row H(i;(1jN))
introduces one parity check constraint on the input data vector
x = fx1; x2;    ; xNg:
Hi:x
T = 0mod 2
The complete H matrix can best be described by a Tanner
graph [6], a graphical representation of associations between
code bits and parity checks. Each row of H corresponds to
a Check Node (CN) while each column corresponds to a
Variable Node (VN) in the graph. An edge eji on the Tanner
Graph connects a VNj with CNi only if the corresponding
element hij is a `1´in H. If the number of edges entering in a
node is constant for all nodes of the graph, the LDPC code is
called regular, being otherwise irregular in case of variable
node degree. Irregular LDPC codes yield better decoding
performance compared to regular ones.
Next generation wireless communication standards adopt
structured LDPC codes, which hold good interconnection,
memory and scalability properties at the decoder implemen-
tation level. In these codes, the parity check matrix H is
associated to a HBASE matrix, as defined in [7]:
HBASE =
266664
0;0 0;1 : : : 0;Nb
1;0 1;1 : : : 1;Nb
...
...
. . .
...
Mb;0 Mb;1 : : : Mb;Nb
377775
HBASE has Mb block rows and Nb block columns; it is
expanded, in order to generate the H matrix, by replacing each
of its entries i;j with a z z permutation matrix, where z is
the expansion factor. The permutation matrix can be formed
by a series of right shifts of the zz identity matrix according
to a determined shifting factor, equal to the value i;j . The
same base matrix is used as a platform for all the different
code lengths related to a selected code rate: implementation
of a full mode decoder is thus a challenging task, due to
huge variations in code parameters. For example, current IEEE
802.16e WiMAX standard features four code rates i.e. 1/2, 2/3,
3/4 and 5/6 with HBASE matrices of size 12  24, 8  24,
6 24 and 4 24 respectively. Each code rate comes with 19
different codeword sizes ranging from 576 bits (z = 24) to
2304 bits (z = 96), with granularity of 96 bits (z = 4).
2Algorithm 1 The Standard TPMP Algorithm
1. Initialization: For j  f1;    ; Ng
0i;j = ln
P (V Nj = 0jyj)
P (V Nj = 1jyj) =
2yj
2
(1)
2. CN Update Rule: 8 CNi ; i  f1;    ;Mg do
ni;j = sgn
n
i;j : jni;j j (2)
sgnni;j =
Y
j0N (i)nj
sgn(
(n 1)
ij0 ) (3)
jni;j j =
O
j0 6=j
(n 1i;j0 ) (4)
3. VN Update Rule : 8 V Nj ; j  f1;    ; Ng do
ni;j = 
0
i;j +
X
i0M(j)ni
ni0;j (5)
4. Decoding: For each bit, compute its a-posteriori LLR
nj = 
0
i;j +
X
i0M(j)
ni0;j (6)
Estimated codeword is C^ = (c^1; c^2;    ; c^N ), where
element c^j is calculated as
c^j =
8<:0 if 
n
j > 0
1 else
(7)
If H(C^)T = 0 then stop, with correct codeword C^.
B. LDPC Decoding Algorithms
The nature of LDPC decoding algorithms is mainly iterative.
Most of these algorithms are derived from the well known Be-
lief Propagation (BP) algorithm [5]. The aim of BP algorithm
is to compute the a-posteriori probability (APP) that a given
bit in the transmitted codeword c = [c0; c1;    ; cN 1] equals
1, given the received word y = [y0; y1;    ; yN 1]. For binary
phase shift keying (BPSK) modulation over an additive white
Gaussian noise (AWGN) channel with mean 1 and variance 2,
the reliability messages represented as Logarithmic Likelihood
Ratio (LLR) are computed in two steps: 1) check node update
and 2) variable node update. This is also referred to as two
phase message passing (TPMP). For nth iteration, let ni;j
represent the message sent from variable node V Nj to check
node CNi, ni;j represent the message sent from CNi to V Nj ,
M(j) = fi : Hij = 1g is the set of parity checks in which
V Nj participates, N (i) = fj : Hij = 1g the set of variable
nodes that participate in parity check i, M(j)ni the set M(j)
with check CNi excluded and N (i)nj the set N (i) with
V Nj excluded. The standard TPMP algorithm is described
in Algorithm 1.
As given in Eq. 2, the CN update consists of sign update
and magnitude update where the latter depends on the type
of decoding algorithm, of which several are commonly used
Table I
CHECK NODE UPDATE FOR LDPC DECODING ALGORITHMS
Algorithm Formulation :
N
j0 6=j(
(n 1)
i;j0 )
SP (
P
j0N (i)nj (j(n 1)ij0 j)
(x) =  log(tanh(x
2
))
MS minj0N (i)njfj(n 1)i;j0 jg
OMS maxfminj0N (i)njfj
(n 1)
i;j0 jg   ; 0g
  0
NMS :minj0N (i)njfj
(n 1)
i;j0 jg
 < 1
(Table I). The Sum Product (SP) algorithm [8] gives near opti-
mal results: however, the implementation of the transcendental
function (x) requires dedicated LUTs, leading to significant
hardware complexity [9]. Min-Sum (MS) algorithm [10] is
a simple approximation of the SP: its easy implementation
suffers in 0.2 dB performance loss compared to SP decoding
[11]. Normalized Min-Sum (NMS) algorithm [12] gives better
performance than MS by multiplying the MS check node
update by a positive constant k, smaller than 1. Offset Min-
Sum (OMS) is another improvement of standard MS algorithm
which reduces the reliability values nij by a positive value :
for a quantitative performance comparison for different CN
updates, refer to [13] and [14].
C. Layered Decoding of LDPC Codes
Modifying the VN update rule (Eq. 5)
ni;j = 
n
j   ni;j (8)
we can merge the CN and VN update rules into a single
operation, where the CN messages ni;j are computed from

(n 1)
j and 
(n 1)
i;j . This technique is called Layered De-
coding [15]. Layered decoding considers the H matrix as a
concatenation of l layers (block rows) or constituent sub codes
i.e. HT = [HT1H
T
2 ;    ;HTl ] where the column weight of
each layer is at most 1. In this way a decoding iteration
is divided into l sub-iterations. Formally, the algorithm for
layered decoding Min-Sum is described in Algorithm 2.
After CN update is finished for one block row, the results are
immediately used to update the VNs, whose results are then
used to update the next layer of check nodes. Therefore, an
updated information is available to CNs at each sub iteration.
Based on the same concept, the authors in [7] introduced
the concept of turbo decoding message passing (TDMP)
[16] using the BCJR algorithm [17] for their architecture-
aware LDPC (AA-LDPC) codes. TDMP results in about 50%
decrease in number of iterations to meet a certain BER, which
is equivalent to a 2 increase in throughput and significant
memory savings as compared to the standard TPMP schedule.
Similar to the TDMP schedule is the Vertical Shuffle Schedul-
ing (VSS) [18]: while TDMP relies on horizontal divisions of
the parity check matrix, VSS divides the horizontal layers in
sub-blocks. It is a particularly efficient technique with quasi-
cyclic LDPC codes [19], where each sub-block is identified
by a parity check submatrix.
3Algorithm 2 The Layered Decoding Min-Sum Algorithm
1. Initialization: 8 CNi ; i  f1;    ;Mg do 0i;j = 0
2. CN Update Rule: 8 CNi ; i  f1;    ;Mg do
nj = 
0
i;j
ni;j =
Y
j0N (i)nj
sgnf(n 1)j   (n 1)i0j g
 min
j0N (i)nj
j (n 1)j   (n 1)i;j0 j (9)
nj = 
n
j + 
n
i;j
Estimated codeword is C^ = (c^1; c^2;    ; c^N ), where
element c^j is calculated as
c^j =
8<:0 if 
n
j > 0
1 else
(10)
If H(C^)T = 0 then stop, with correct codeword C^.
III. FLEXIBLE DECODERS
A. Parallelism
The standard TPMP algorithm described in the previous
section exploits the bipartite nature of the Tanner Graph: since
no direct connection is present between nodes of the same
kind, all CN (or VN) operations are independent from each
other and can be performed in parallel. Thus, a first broad
classification of LDPC decoders can be done in terms of the
degree of parallelism. The hardware implementation of LDPC
decoders can be serial, partially-parallel and fully parallel.
Serial LDPC decoder implementation is the simplest in
terms of area and routing. It consists of a single check node, a
single variable node and a memory. The variable nodes are up-
dated one at a time and then check nodes are updated in serial
manner. Maximum flexibility could be achieved by uploading
new check matrices in memory. However, each edge of the
graph must be handled separately: as a result, throughput is
usually very low, insufficient for most of standard applications.
A fully parallel architecture is the direct mapping of Tanner
graph to hardware. All node operations (CNs and VNs) are
directly realized in hardware PEs and connected through
dedicated links. This results in huge connection complexity
that in extreme cases dominates the total decoder area and
results in severe layout congestion: maximum throughput can
be, however, theoretically reached. In [20], a 1024-bit, full
parallel decoder is presented, achieving 1Gbps throughput with
logic density of only 50% to accommodate the complexity
of interconnection: it comprises of 9750 wires with 3 bit
quantization. None of the parallel implementations in [20]–
[22] grant multi mode flexibility due to wired connections.
In addition, almost all existing fully parallel LDPC decoders
are built on custom silicon, which precludes any prospect
of reprogramming. An alternative approach is the partially
parallel architecture which divides the node operations of
Tanner graph over P PEs, with P< (N + M). This means
Table II
LDPC CODES APPLICATIONS
Application standard Code Length Code Rates Throuhghput
WMAN IEEE 802.16e 576-2304 1/2 - 5/6 70 Mb/s
WLAN IEEE 802.11n 648-1944 1/2 - 5/6 450 Mb/s
Broadcast DVB-S2 16400,64800 1/4 - 9/10 90 Mb/s
wired 10Gbase-T 2048 arbitrary 6.4 Gbps
that each PE will perform the computation associated to
multiple nodes, necessitating memories to store intermediate
messages between tasks. Time sharing of PEs greatly reduces
the area and routing overhead. Partially parallel architectures
are studied extensively and provide a good trade off in
throughput, complexity and flexibility, with some solutions
obtaining throughputs up to 1 Gbps.
B. Implementation Platforms
Hardware implementation of LDPC decoders is mainly
dictated by the nature of application. LDPC codes have been
adopted by a number of communication (wireless, wired and
broadcast) standards and storage applications: a few of them
are briefly summarized in Table II.
In wireless communication domain, LDPC codes are
adopted in IEEE 802.16e WiMAX which is a wireless
metropolitan area network (WMAN) standard and IEEE
802.11n WiFi which is a wireless local area network (WLAN)
standard. Both standards have adopted LDPC codes as an
optional channel coding scheme with various code lengths
and code rates. LDPC codes are also used in Digital Video
Broadcast via satellite (DVB-S2) standard which requires very
large code lengths of 64800 bits and 16200 bits with 11
different codes rates, and a 90 Mb/s decoding throughput.
In wireline communication domain, LDPC codes are adopted
in 10 Gbit Ethernet copper (10GBASE-T) standard which
specifies a high code rate LDPC code with a fixed code length
of 2048 bits, with a very high decoding throughput of 6.4
Gbps.
There is no standard for magnetic recording hard disk,
however they demand high code rate, low error floor, and high
decoding throughput. In [23], a rate-8/9 LDPC decoder with
2.1 Gbps throughput has been reported for magnetic record-
ing. The decoder utilizes four blocklengths with maximum
consisting of 36864 bits.
The varied nature of applications makes the selection of
a suitable hardware platform an important choice. Typical
platforms for LDPC decoder implementation include Pro-
grammable devices (e.g. microprocessors, Digital Signal Pro-
cessors (DSPs) and Application Specific Intruction set Pro-
cessors (ASIPs)), customized Application Specific Integrated
Circuits (ASICs) and reconfigurable devices (e.g. FPGAs).
General purpose microprocessors and DSPs utilize strong
programmability to achieve highly flexible LDPC decoding,
allowing to modify various code parameters at run time.
Programmable devices are often used in the design, test and
4 
!
 
 "# !"$
%&'
( ) (
!'
%&*%&*!'
*+ *+
,&*,&'
-
.
.
/0
1
1
2
0
3
0
/#
45
/
*
0
/6
7
4#
48
5
3
9
#
6
:534/5;;0/
:<#330;
(a) TPMP Decoding
 !  
"#
$%&$%&"#
&'
(%&
(%#
)*+,-*../-
)01++/.
2
"
 
3
4
4
-/
5
5
6
/
+
/
-1
,*
-
&
/
-7
8
,1
,9
*
+
:
1
7
$%#
;
<
=<
>
(b) Layered Decoding
Figure 1. Generalized datapath of LDPC Decoder
performance comparison of decoding algorithms. However,
they are usually constituted by a limited number of PEs that
execute in a serial manner, thus limiting the computational
power to a great extent. An LDPC decoder implemented on
TMS320C64xx could yield 5.4 Mb/s throughput running at
600MHz [24]. This performance is not sufficient to support
high data rates defined in new wireless standards.
Reconfigurable hardware platforms like FPGAs are widely
used due to several reasons. First, they speed up the empirical
testing phases of decoding algorithms which is not possible in
software. Secondly, they allow rapid prototyping of decoder.
Once verified, the algorithm can be employed on the same
reconfigurable hardware. It also allows easy handling of dif-
ferent code rates and SNRs, power requirements, block lengths
and other variable parameters. However, FPGAs are suited
for datapath intensive designs and have programmable switch
matrix (PSM) optimized for local routing. High parallelism
and the intrinsic low adjacency of parity check matrix lead
to longer and complex routing, not fully supported by most
FPGA devices. Some designs [16] [25], used time sharing of
hardware and memories that reduces the global interconnect
routing, at a cost of reduced throughput.
Customized ASICs are a typical choice which yield a
dedicated, high performance IC. ASICs can be used to fulfill
high computational requirements of LDPC decoding, deliver-
ing very high throughputs with reasonable parallelism. The
resulting IC usually meets area, power and speed metrics.
However, ASIC designs are limited in their flexibility and
usually intended for single standard applications only: flex-
ibility, if reached at all, comes at the cost of very long design
time and non-negligible area, power or speed sacrifices. An
alternative or parallel approach is the usage of ASIPs, that
greatly overcome the limitations of general purpose micropro-
cessors and DSPs. Fully customized instruction set, pipeline
and memory achieve efficient, high performance decoding:
ASIP solutions are able to provide Inter- and Intra-standard
flexibility through limited programmability, guaranteeing av-
erage to high throughput.
C. Decoding Schedule
A partiall parallel architecture becomes mandatory to realize
flexible LDPC decoding. Generally, functional description of
a generic LDPC decoder can be broken down into two parts:
 Node Processors
 Interconnection structure
A partially parallel decoder with parallelism P consists of P
node processors, while an interconnection structure allows var-
ious kinds of message passing according to the implemented
architecture. Based on the decoding schedule i.e. TPMP or
Layered decoding, the datapath can be optimized accordingly.
Figure 1 shows two possible datapath implementations of par-
tially parallel LDPC decoder which are discussed as follows.
1) TPMP datapath: In the TPMP structure depicted in [26]
for a generic belief propagation algorithm, each VN consists
of 4 dual port RAMs: I, Sa, Sb and E, as shown in Fig.
1a. RAM I stores the channel intrinsic information, while
RAMs Sa and Sb manage the sum of extrinsic information for
previous and current iteration respectively, and RAM E stores
the extrinsic information for current iteration. The decoding
process consists of D iterations: during iteration d + 1, the
intrinsic information (0i;j) fetched from RAM I is added to
the contents of RAM Sa (
P
i0M(j) 
d
i0;j). Simultaneously, the
extrinsic information generated by the current parity check
during the previous iteration, di;j , is retrieved from RAM E
and subtracted from the total. The result of the subtraction is
fed to the PE, which executes the chosen CN update (see Table
I). The d + 1 updated extrinsics are then accumulated with
the iteration d ones (RAM Sb) and replace the old extrinsic
information in RAM E. At iteration d+ 2, the roles of RAM
Sa and RAM Sb are exchanged.
52) Layered datapath: The layered decoding datapath de-
scribed in [27] is shown in Fig. 1b. The VN structure is
simplified and consists of RAM I only, which stores dj at
each sub-iteration. Equation 8 is computed inside the check
node, that consists of a PE, a FIFO and RAM S which
stores d 1i;j . During iteration d, these are subtracted from
the message incoming from RAM I, to generate the VN-CN
message. The updated extrinsic generated by PE is added to
the corresponding input coming from the FIFO, storing the
resulting di;j in RAM S.
In both datapath architectures described above, assignment
of PEs to nodes (VNs and CNs) is determined by a given
code structure and can be done efficiently by designing LDPC
codes using permuted identity matrices. Considering paral-
lelism P=z, the z VNs are connected to z CNs through a
zz interconnection network = 1 which has to realize the
permutations of identity matrix. Typically, a highly flexible
barrel shifter allows all possible permutations (rotations) of
identity matrix. In some implementations, a single node type
joining both VN and CN operations is present, thus changing
the nature and function of the connections. The controller
is typically composed of address generators (AGs) and per-
mutation RAM (PRAM). The address generators generate the
address of RAMS (I,Sa,Sb and E) while the permutation RAM
generates the control signals for permutation network accord-
ing to the rotation of identity matrix. Multi-mode flexibility is
achieved by reconfiguring AGs and PRAMs each time a new
code needs to be supported.
In order to realize an efficient LDPC decoder, optimization
is required both at PE and Interconnection level. Overall
complexity and performance of decoder are largely determined
by the characteristics of these two functional units. In the next
two section we will discuss them in detail and analyze various
design choices aimed at realizing high performance flexible
LDPC decoder.
IV. PROCESSING ELEMENT
The PE is the core of the decoding process, where the
algorithm operations are performed. Its design is an important
step that heavily affects overall performance, complexity and
flexibility of decoder. The PE can be designed to be serial,
with internal pipelining to maximize throughput, or parallel,
processing all data concurrently. Depending on this initial
choice, critical design issues can arise in either latency and
memory requirements or complex interconnection structures
and extended logic area.
A. Serial PE
As described in Section II-A, the LDPC codes specified
by majority of standards are based on so-called structured
LDPC codes. Considering a decoder parallelism P=z, as in
state-of-the-art layered decoders, one sub-matrix (equal to
P edges) is processed per clock cycle, with one operation
completed by each PE working in a serial fashion. Fig 2a
shows a generalized architecture for serial PE implementing
the Min-Sum algorithm. In Min-Sum decoding, out of all
LLRs considered by a CN, only two magnitudes are of interest,
Table III
HBASE PARAMETERS FOR WIMAX AND WIFI LDPC CODES.
Code Rate 1/2 2/3 3/4 5/6
HBASE matrix 12 24 8 24 6 24 4 24
(WiMax / WiFi)
Mb 12 8 6 4
(WiMax / WiFi)
Wr - Wc
WiMax 7  6 11  6 15  6 20  4
WiFi 8  12 11  8 15  6 22  4
i.e. minimum and the second minimum. The PE works serially
maintaining three variables namely MIN, MIN2 and INDEX.
MIN and MIN2 store the minimum and second minimum of all
values respectively whereas INDEX stores the position index
of minimum value. Each time a new VN-CN message ij
is received, its magnitude is compared with MIN and MIN2,
possibly substituting one of the two, with consequent position
storage in INDEX. For each outgoing message ij , either the
value is MIN (i 6= INDEX) or MIN2 (i = INDEX). Such
method avoids storing all VN-CN messages and results in
considerable memory saving in CN kernel.
Table III collects some information about WiMAX and WiFi
standards different parameters. Mb denotes the number of
block rows in HBASE matrix whereas Wr and Wc denote
the maximum row and column weights (i.e. CN and VN
degrees) respectively. A full mode LDPC decoder for WiMAX
must support 6 code rates with weights ranging from 7 to
20. Serial CN implementation is particularly suitable for this
scenario, as it allows run time flexibility to process any value
of CN degree with the same number of comparators, allowing
efficient hardware usage. However, very large values of CN
degree increase the latency and limit the achievable throughput
to a great extent, requiring a high degree of parallelism to
achieve medium-to-high throughputs (Table IV).
B. Parallel PE
Realizing high throughput decoders (supporting data rates
up to few hundred Mb/s) either asks for massive parallelism
or high clock frequency, resulting in significant area and
power overhead. However, parallelism at CN level can bring
significant increase in throughput with affordable complexity.
A parallel PE manages all VN-CN messages in parallel and
writes back the results simultaneously to all connected VNs.
This results in lower update latency and consequently higher
throughput. A parallel Min-Sum PE for dc = 6 is shown in
Fig. 2b. This unit computes the minimum among different
choices of five out of six inputs. PE outputs the result to output
ports corresponding to each input which is not included in
the set e.g. 1 = min(a2; a3;    ; a6). The PE is capable of
supporting all values of dc less than or equal to 6 whereby
unused inputs are initialized to +1. Supporting higher values
of dc requires additional circuitry which adds to complexity
and latency of PE. As shown in the figure, the complexity
of PE is dominated by logic components (e.g. comparators)
and increases almost linearly with node degree. Such type
of PE architectures are mostly employed to structured LDPC
6 !!!! 
"#$%&'(
)*% +
)*%
)*%
,%-'.
)*% +
/ 0 1
/ 0 1
,%-'.
2%345'
/ 6 1
7*%85'!9#(&
)':#(;
"#:<3(3&#(
=*>
?*>
@*8% 7*8%!9(#-$A&
(a) Serial Approach
 !"
#$%#
 !"
 !"
#&'#
#&(#
#&)#
#&%#
#&*#
#&+#
 !"
 !"
 !"
 !"
 !"
 !"
 !"
 !"
 !"
#$)#
#$+#
#$*#
#$'#
#$(#
(b) Parallel Approach
Figure 2. Min-Sum PE block scheme
codes where the check node degrees are either fixed or show
small variations throughout the decoding process. To achieve
code rate flexibility, the check node PE is synthesized for
maximum check node degree (dcmax) required by a particular
application, supporting all values of dc less than or equal to
dcmax.
C. State of the art
Flexibility as a design parameter is not always addressed
as an important figure of merit, but various design techniques
have been reported in literature which can be compared in
terms of throughput, complexity and number of supported
decoding modes, thus evaluating the obtainable degree of
flexibility.
1) ASIC Implementations: The partially parallel decoder
presented by Kuo and Willson in [28] offers a simple and
tailored solution to the mobile WiMAX problem. The designed
ASIC is able to work, upon reconfiguration, on all the mobile
WiMAX standard LDPC codes. The quasi- cyclic structure of
such codes allows an effective implementation of the layered
decoding approach, here exploited with a variable degree of
parallelism, and simple interconnection between memories
and processing units. The implemented decoding algorithm
is the OMS, and it is fixed. Each component of [28] is not
flexible per se, but a serial architecture and programmable
parallelism extend its range of usable codes to any code with
parameters smaller than or equal to the WiMAX ones (block
length, column and row weights, total number of exchanged
information), although without guaranteeing compliance with
standard throughput requirements.
In [33] the intra-standard flexibility comes together with a
choice among two decoding approaches, the layered decoding
and the TPMP. Although this ASIC performance has been
evaluated only in case of the structured QC-LDPC codes
of WiMAX, the true benefit from the dual algorithm comes
in case of unstructured codes. The usually more performing
layered decoding generates data collisions that are transparent
to the two-phase decoding process, enabling the presence
of superimposed sub-matrices in the code H matrix. The
central processing unit consists of a Reconfigurable Serial
Processing Array (RSPA) incorporating a serial Min-Sum PE
and reconfigurable logarithmic barrel shifter. The RSPA can be
dynamically reconfigured to choose between the two decoding
modes according to different block LDPC codes. With intel-
ligent hardware reuse via modular design the overhead due
to the double decoding approach is reduced to a minimum,
with an overall acceptable power consumption. The decoder
operates at 260 MHz achieving a handsome throughput which
meets the WiMAX standard specifications.
When designing an efficient multi mode decoder a typical
approach is to find similarities between different modes and
then implementing common parts as reusable hardware com-
ponents. Controlling the data flow between reusable compo-
nents guarantees multi mode flexibility. One of such efforts is
the work by Brack and Alles [27]. It portraits an IP core of a
full mode LDPC decoder that can be synthesized for a selected
code rate specified by WiMAX standard. The unified decoder
architecture proposes two different datapaths for TPMP and
layered decoding, as in [33], and combines them in a single
architecture sharing the components common to both. Code
rate and codeword size flexibility is achieved by realizing serial
check node PEs, and the chosen decoding algorithm is   3
Min [39].
As discussed in Section III-C2, a partially parallel TDMP
decoder performs serial scanning of block rows of HBASE
matrix. The check node reads the bit-LLR connected to a
block row serially and stores them in FIFO whose size is
proportional to row weight Wr. For multi rate irregular QC
decoders, the utilization ratio of FIFO is low for smaller Wr.
In addition, due to random location of non zero sub matrices
and correlations between consecutive block rows of HBASE ,
extrinsic information exchange can lead to memory access
conflicts. These limitations were addressed by Xiang et al.
in [34], whereby the authors presented an overlapped TDMP
decoding algorithm. The design proposes a block row and
column permutation criterion in order to reduce correlation
between consecutive rows and uniform distribution of zero
and non-zero matrices in columns, with a smart memory
management technique. The resulting decoder is a full mode,
QC LDPC decoder for WiMAX. The decoder achieves a
maximum throughput of 287 Mb/s, with support for other
similar QC-LDPC codes.
7Table IV
FLEXIBLE LDPC DECODER ASIC IMPLEMENTATIONS: CMOS TECHNOLOGY PROCESS (TECH), AREA OCCUPATION(A), Anorm (NORMALIZED AREA
@ 130NM), SCHEDULING (SCHED. TDMP/TPMP), CODE TYPE (C.T), BLOCK LENGTH (N), NO. OF DECODING MODES SUPPORTED (DM), FLEXIBILITY
(FLEX.), DT (DESIGN TIME), RT (RUN TIME RECONFIGURABLE), DECODING ITERATIONS (IT.), THROUGHPUT (T.P), CLOCK FREQUENCY(F ), PE
STRUCTURE (PE) (SERIAL SE, PARALLEL PA), NUMBER OF DATAPATHS (DP), THROUGHPUT-AREA RATIO (TAR )(MB/S  IT/ MM2= T.P  IT/ Anorm),
DECODING EFFICIENCY (DE) (BITS/CYCLE = T.PIT/F), FLEXIBILITY EFFICIENCY (FE) (DM BITS/CYCLE/MM2 = DEDM/ Anorm)
Design Tech. A Anorm Sched. C.T N DM Flex. It. T.P f PE Dp TAR DE FE
[nm] mm2 mm2 Bits Mb/s MHz.
[29] 65
1.337 5.348
TDMP
QC-WiMAX 576-2304 114
D.T
25, 20 48 - 333
400 Se
24-96 1245.3 16.7 356.0
3.861 15.444 DVB-S2 64800 20 50, 15 60 - 708 90 687.6 26.6 34.4
1.023 4.092 QC-WiFi 648-1944 12 25, 20 54 - 281 27-81 1373.4 14.1 41.4
[30] 65 0.51 2.04 TDMP WiMedia 1200 - 1320 8 R.T 5, 3 1120 - 1220 264 Pa 90 1794.1 13.9 54.5
[31] 90 9.60 20.03 TDMP DVB-S2 64800 20 R.T 15 181 - 998 320 Se 360 747.4 46.8 46.7
[32] 90 0.42 0.87 TDMP QC-WiFi 1944 12 R.T 30 43 294 Pa 324 1482.7 4.4 60.7
[28] 180 3.39 1.768 TDMP QC-WiMAX 576-2304 114 R.T 10 68 100 Se 24 384.6 6.8 438.5
[33] 130 6.3 6.3 TDMP/TPMP Block-LDPC 576-2304 114 R.T 15 205 260 Se 24-96 487.5 11.8 213.5
[34] 130 2.46 2.46 Overlapped TDMP QC-WiMAX 576-2304 114 R.T 15 248-287 150 Se 96 1750.0 28.7 1330.0
[27] 130 3.843 3.843 TDMP/TPMP QC-WiMAX 576-2304 114 D.T 10,15 83-610 333 Se 24-96 2380.9 18.3 4.8
[35] 90 0.679 1.416 TDMP QC-WiMAX 576-2304 114 R.T 8-12 200 400 Pa 16 1694.9 4.0 322.0
[36] 90 6.25 13.04 TDMP QC-WiMAX 576-2304 114 R.T 20 105 150 Se 24-96 161.0 14.0 122.4
[37] 130 8.29 8.29 TDMP QC-WiMAX 576-2304 19 R.T 2-8 222 83.3 Pa 4 - 8 214.2 5.3 12.1
[38] 130 4.94 4.94 TPMP Arbitrary 1536 Arbitrary R.T 2-8 86 125 Pa Arbitrary 139.3 1.37 N/A
An interesting way to tackle the flexibility issue is proposed
in [35]. Here the different code parameters are handled via
what has been called processing task arrangement. This work
presents a decoder based on the decoding approach described
in [40], Layered Message Passing Decoding with Identical
Core Matrices (LMPD-ICM). LMPD-ICM is a variation of
the original Layered Decoding: the H matrix is partitioned in
several layers, with each layer yielding a core matrix. This
consists of the non zero columns of that layer. The resulting
core matrix is further divided in to smaller and identical tasks.
Applying LMPD-ICM to a QC-LDPC code reveals that core
matrices of layers are column-permuted versions of each other
and show similarities not only among different layers in a
single code, but also among different codes within a same
code class. This technique is applied to QC-LDPC codes for
WiMAX in [35], and a novel task arrangement algorithm is
proposed to assign the processing operation for variety of QC-
LDPC codes to different PEs. The design features four stage
pipelining for task execution, flexible address generation to
support multi rate decoding and early termination strategy
which dynamically adjusts the number of iterations according
to SNR values to save power. The decoder achieves a mod-
erate throughput of 200 Mb/s at 400MHz frequency utilizing
parallelism P of 4.
A single design flow is exploited in [29] to provide three dif-
ferent implementations, each supporting a different standard.
The Adaptive Single Phase Decoding (ASPD) [41] scheduling
is enforced, that allows to detach the decoder’s memory
requirements from the weight of rows and columns of the
H matrix, leaving them dependent on the codeword length
only. Though sacrificing up to 0.3 dB in BER performances,
this technique accounts for 60-80% reduction in memory bits.
The offset min-sum decoding algorithm is employed: different
sizing of memories are able to comply with DVB-S2, 802.11n
and 802.16e standards. At runtime, the standard serial node
architecture enables intra-standard flexibility.
A classical layered scheduling is used in the DVB-S2
decoder proposed in [31]: the 360 PEs, whose architecture
is detailed along with the iteration timing, are able to process
a whole layer concurrently. A 360360 barrel shifter manages
the inter-layer communication: since the number of rows that
compose the layer never change in DVB-S2, a change of
code will mean a different workload on the communication
structure, but very easy reconfiguration.
The work in [36] presents a reconfigurable full mode LDPC
decoder for WiMAX. A so called phase overlapping algo-
rithm similar to TDMP is proposed which resolves the data
dependencies of CNs and VNs of consecutive sub-iterations
and overlaps their operation. The proposed decoder features
serial check nodes with Min-Sum algorithm implementation.
Parallelism of 96 yields a throughput of 105 Mb/s at 20
iterations.
In addition to serial check node architectures, the state-of-
the-art for flexible LDPC decoders also reports some solutions
utilizing parallel check nodes. The work in [37] proposes a re-
configurable multi mode LDPC decoder for Mobile WiMAX.
The authors applied the matrix reordering technique [46] to
the HBASE matrix of rate 1/2 WiMAX. This improved matrix
reordering technique allows overlapped operation of CNs and
VNs and results in 68.75% reduction in decoding latency com-
pared to non-overlapped approach. A reconfigurable Address
Generation Unit and improved Early Stopping Criterion help
to realize a low power flexible decoder which supports all
the 19 block lengths (576-2304) of WiMAX. Parallel check
nodes implementing Min-Sum algorithm help to achieve a
throughput of 222 Mb/s with low frequency of 83.3 MHz and
parallelism of 4.
The work in [38] features a parallel check node based on
divided group comparison technique, adaptive code length as-
signment to improve decoding performance and early termina-
tion scheme. The proposed solution is run time programmable
to support arbitrary QC-LDPC codes of variable codes lengths
and code rates. However, no compliance with standardized
codes is guaranteed. A reasonable throughput of 86 Mb/s at
125 MHz is achieved. In [47], the authors presented a parallel
check node incorporating the value reuse property of Min-Sum
algorithm, for non-standardized, rate 0.5 regular codes.
The WiMedia standard [48] requires very high throughput:
the work by Alles and Wehn [30] manages to deliver more
than 1 Gb/s throughputs for most code lengths and rates.
8Table V
FLEXIBLE LDPC DECODERS ASIP IMPLEMENTATIONS : CMOS TECHNOLOGY PROCESS (TECH), AREA OCCUPATION (A), NORMALIZED AREA
(Anorm)@ 130NM, CODE TYPE (C.T), FLEXIBILITY (FLEX.) DESIGN TIME (D.T), RUN TIME (R.T), MAXIMUM THROUGHPUT (T.P), MAXIMUM
ITERATIONS (IT.), NUMBER OF DATAPATHS (DP), OPERATING FREQUENCY (F), PROCESSING ELEMENT (PE) (SERIAL SE, PARALLEL PA),
THROUGHPUT-AREA RATIO (TAR )(MB/S  IT/ MM2= T.P  IT/ Anorm), DECODING EFFICIENCY (DE) (BITS/CYCLE = T.PIT/F)
Design Tech. A Anorm C.T Flex. It. T.P f Dp PE TAR DE
[nm] mm2 mm2 Mb/s MHz.
Multicore ASIP [42] 90 2.6 5.42 LDPC-fWiMAX,WiFig R.T 10 f312, 263g 500 24 Se f575.6,485.2g f6.24,5.26g
Turbo-fBTC-LTE,DBTC-WiMAXg 6 f173,173g f191.4,191.4g f2.07,2.07g
2D NOC ASIP [43] 130 N/A N/A Turbo R.T 8 86.5 200 16 Se N/A 3.46
LDPC 11.2 0.448
FlexiCHAP [44] 65 0.62 2.48 LDPC-fWiMAX, WiFig R.T 10-20 f237,257g 400(Max.) 27 Se f955.6,2879.0g f5.9,6.42g
Turbo fBTC,DBTCg 5 f18.6,37.2g f37.5,75.0g f0.23,0.46g
Bin/Non-Bin [45] 65 3.4 13.6 Bin LDPC-fWiMAX, WiFig R.T 10 90 400 96 Se 66.2 2.25
Non-Bin LDPC fGF(8)g 1 12.5 0.92 0.03
The result is achieved via the instantiation of 3 PEs with
internal parallelism of 30: the similarity of WiMedia codes
with the QC-LDPC of WiMAX allows a multiple submatrix-
level parallelism in the decoder.
A more technological point of view is given in [32], where
low-power VLSI techniques are used in a 802.11n LDPC de-
coder design. The decoder exploits the TDMP VSS technique
with a 12 datapath architecture: separate variable-to-check and
check-to-variable memory banks are instantiated, one per type
for each datapath. Each of these ”macro banks” contains 3
”micro-banks”, each storing 9 values per word. The internal
degree of parallelism is effectively sprung up to 1227. VLSI
implementation is efficiently tackled in less-than-worst case
thanks to VOS [49] and RPR [50] techniques, saving area and
power consumption.
2) ASIP Implementations: Future mobile and wireless com-
munication standards will require support for seamless ser-
vice and heterogeneous interoperability: convolutional, turbo
and LDPC codes are established channel coding schemes
for almost all upcoming wireless standards. To provide the
aforementioned flexibility, ASIPs are potential candidates. The
state of the art reports a number of design efforts in this
domain, thanks to good performance and acceptable degree
of flexibility.
The work portrayed in [42] outlines a multi-core architec-
ture based on an ASIP concept. Each core is characterized by
two optimized instruction sets, one for LDPC codes and one
for turbo codes. The complete decoder only requires 8 cores,
since each of them can handle three processing tasks at once.
The simple communication network maintains this parallelism,
allowing for efficient memory sharing and collision avoidance.
The intrinsic flexibility of the ASIP approach allows multiple
standards (WiFi, WiMAX, LTE, DVB-RCS) to be easily
supported: the exploitation of the diverse datapath allows a
very high best case throughput, while the reduced network
parallelism keeps the complexity low.
A possible dual turbo/LDPC decoder architecture is de-
scribed in [43]. Here, a novel approach to processing element
design is proposed, allowing a high percentage of shared logic
between turbo and LDPC decoding. The TDMP approach
allows the usage of BCJR decoding algorithm for both codes,
while the communication network can be split into smaller
supercode-bound interleavers, effectively merging LDPC and
turbo tasks.
In [44], the authors proposed a flexible channel coding pro-
cessor FlexiCHAP. The proposed ASIP extends the capabilities
of previous work FlexiTrep [51] and is capable to decode
convolutional codes, binary/duo-binary turbo codes and struc-
tured LDPC codes. The proposed ASIP is based on the Single
Instruction Multiple Datapath (SIMD) paradigm [52], i.e. a
single IP with an internal data parallelism greater than one,
and processes whole sub-matrix of parity check matrix with
single instruction. Multi-standard multi-mode functionality is
achieved by utilizing 12 stage pipeline and elaborated memory
partitioning technique. The proposed ASIP is able to decode
binary turbo codes up to 6144 information bits, duo-binary
turbo codes and convolutional codes upto 8192 information
bits. LDPC decoding capability of block length up to 3456
bits and check node degree up to 28 is sufficient to cover all
code rates of WiMAX and WiFi standards. The ASIP achieves
payloads of 237 Mb/s and 257 Mb/s for WiMAX and WiFi
respectively.
The solution proposed in [45] makes use of a single SIMD
ASIP with maximum internal parallelism of 96. A combined
architecture is strictly designed for LDPC codes, but the sup-
ported ones range from binary (WiFi, WiMAX) to non-binary
(Galois Field (8)): the decoding approach is turbo-decoder-
based. While the decoding mode can be changed at runtime,
flexibility is guaranteed at design time, by instancing wide
enough rotation engines for the different LDPC submatrix
sizes, and a sufficient number of memories. These memories
dominate the area occupation, mainly due to the non-binary
decoding process.
Tables IV and V summarize the specifications of various
state-of-the-art ASIC and ASIP solutions for flexible LDPC
decoders discussed above. To simplify the comparison, the
area of each decoder has been scaled up to 130nm process
represented as normalized area (Anorm). A parameter called
throughput to area ratio (TAR) defined as TAR=Throughput
It/Area has also been included in the table to evaluate the
area efficiency of proposed decoders. Another metric named
Decoding Efficiency (DE) given as DE = (Throughput  It.)/f
[53] has been defined to give a good comparison regardless of
different clock frequencies. DE gives the number of decoded
bits per clock cycle per iteration, while Dp is the total degree
of parallelism, taking in account both the number of PEs and
their possible multiple datapaths.
To evaluate the effective flexibility of each decoder, and its
9cost, a metric called flexibility efficiency is introduced, and
computed as
FE =
DE DM
Anorm
(11)
It gives a measure of each decoder’s flexibility through
its different decoding modes (DM), taking in account the
normalized throughput performances (DE) in relation to the
normalized area occupation Anorm. The metric is applied to
ASIC decoders only, since in these cases the cost of flexibility
is reflected on the area much more directly than in the ASIP
case.
As shown in Table IV, the work in [27] dominates in terms
of throughput and TAR but offers only design time flexibility
(effective D.M = 1): for this reason, its FE value is very low.
On the contrary, the design time flexibility of [29] results
in a runtime flexibility once the standard to be supported
has been chosen: since the implementation of the WiMAX
decoder requires a higher number of codes to be supported at
the same time than DVB-S2 and WiFi, its FE will be higher.
Best DE value is held by the DVB-S2 decoder presented in
[31], together with an average TAR. Explicitly enabling the
decoding of just 20 codes, however, lowers its FE measure.
Among serial PE based run time flexible solutions discussed
above, the work in [34] achieves very high throughput, TAR
and DE with a small area occupation of 2.46 mm2, yielding
the best FE of all decoders. A full mode reconfigurable
solution based on parallel check node in [35] achieves a
handsome throughput of 200 Mb/s and the best FE among
the parallel node solutions. Its Anorm of 1.416 mm2 is the
minimum among all WiMAX solutions discussed above, but
with very low DE stains overall performance.
Among the ASIP solutions (Table V), the work in [43]
cannot effectively be compared to the others in terms of area,
not providing complete estimations. The work in [44] yields
the higher TAR and DE in LDPC mode: the solution proposed
in [42], however, yields the best decoding efficiency and the
top TAR in turbo mode, while at the same time reaching a
very good DE in LDPC mode too.
V. INTERCONNECTION STRUCTURES
As shown through the previous sections, in the great ma-
jority of current LDPC decoders, some kind of intra-decoder
communication is necessary. Except for very few single core
implementations based on the Single Instruction Single Dat-
apath (SISD) paradigm, the need for message routing or per-
mutation is a constant throughout the wireless communication
state of the art. As a first classification, two scenarios can be
roughly devised:
 Single PE Architectures: some state of the art decoders
propose single core solutions with internal parallelism
greater than one, that rely on smart memory sharing and
on programmable permutation networks. These decoders
make often use of TDMP and VSS, that require either
reduced communication or very regular patterns: the
involved interconnection structures are simple.
 Partially Parallel Architectures: referring to the graph
representation of the LDPC H matrix within a selected
decoding approach, it is possible to map the graph nodes
onto a certain number of processing cores. In the partially
parallel approach the number of graph nodes is much
higher than the PEs. Each node is connected to a set
of other nodes distributed on the available PEs: different
nodes will have different links, resulting in a widely
varied PE-to-PE communication pattern. This situation
calls for flexible and complex interconnection structures.
A. Shift and shuffle networks
Structured LDPC codes decoding, regardless of their im-
plementation, often require shift or shuffle operation to route
information between PEs or to/from memories. This is partic-
ularly true for some kinds of LDPC codes, as QC-LDPC and
shift-LDPC [54].
The barrel shifter (BS) is a well-known circuit designed
to perform all the permutations of its inputs obtainable with a
shift operation, thus being well suited for the circularly shifted
structure of QC-LDPC H matrix.
Rovini et al. in [55] exploit the simple structure of the
barrel shifter to design a circular shifting network for WiMAX
codes. This network must be able to handle all the different
submatrix sizes of the standard, thus effectively becoming a
multi-size circular shifting (MS-CS) network. This MS-CS
network is composed of a number of B  B BSs, where
B is the greatest common divisor among all the supported
block-sizes. Each BS rotates of the same shift amount all
the blocks of B data, that are subsequently rearranged by
an adaptation network into the desired order according to the
current submatrix size. Implementation results show that the
proposed MC-CS network outperforms in terms of complexity
previous similar solutions as [56]–[59], with a saving ranging
from 30:4% to 67:2%.
In [36] is designed a shift network for WiMAX standard,
based on a self-routing technique. The network is sized to
handle the largest submatrix size of the standard, 96: when
decoding a smaller code, dummy messages are routed as well,
with a dedicated flag. Two stages of barrel shifters provide the
shift function to real and dummy messages alike, together with
a single permutation network: a lookup engine finally selects
the useful ones basing its decision on the flag bits, shift size
and submatrix size.
Barrel shifters, though providing the most immediate im-
plementation of the shift operation, often lack the necessary
flexibility to directly tackle multiple block sizes. For this
reason, they are usually joint to more complex structures.
One of the most common implementations among the sim-
plest interconnection structures is the Benes network (BeN).
This kind of network is a rearrangeable non-blocking net-
work frequently used as a permutation network. Defining SM
the number of inputs and outputs, an SM  SM BeN can
perform any permutation of the inputs creating a one-to-one
relation with the outputs with (SM=2 (2log2SM   1)) 22
switches. Its standalone use is thus confined to situations in
which sets of data tend not to intertwine: its range of usage,
though, can be extended through smart scheduling.
In [60] a flexible ASIC architecture for high–throughput
shift-LDPC decoders is depicted. Shift-LDPC codes are sub-
10
class of structured LDPC: the H of an (N;M )(Mb; Nb) Shift-
LDPC is structured as
H =
2666664
H1;1 PmH1;1    PmNb 1H1;1
H2;1 PmH2;1    PmNb 1H2;1
      . . .   
HMb;1 PmHMb;1    PmNb 1HMb;1
3777775 (12)
where N M are the dimensions of the H matrix, and the
leftmostMb submatrices are randomly row–permuted versions
of the z  z identity matrix I. Matrix Pm identifies a z  z
permutation matrix, obtained by cyclically shifting right the
columns of I by a single position. The operations involved in
the definition of of each the Mb Nb submatrices guarantee
that PmkHi;1 is Pmk 1Hi;1 with the rows shifted up one
position, so all matrices of row i can be found cyclically
shifting Hi;1.
To exploit at best the proprieties of Shift-LDPC a VSS
scheme has been selected, along with an highly parallel
implementation: z Variable Node Units (VNUs) andM Check
Node Units (CNUs) perform a whole iteration in Nb steps.
Since every H submatrix is a shifted version of the previous
one, also the connections between z VNUs and z CNUs shift
cyclically every clock cycle: this observation leads to the joint
design of the Benes global permutation network and the CN
shuffler.
The BeN is used to define the links between VNUs and
CNUs: these links are static, once the parameters z, Nb and
the structure of the Mb leftmost Hx;1 have been fixed. Its
high degree of flexibility is exploited to guarantee support
over a variety of different codes. The inter-CN communication
required by the VSS approach is handled by the CN shuffle
network. Its function is to cyclically shift the submatrix rows
assigned to each CNU: this means that while each CNU will be
physically connected to the same VNU for the whole decoding,
the row of the H matrix they represent will change. The BeN
has consequently no need to be rearranged.
The flexible decoder is able to achieve 3:6 Gb/s with an
area of 13:9 mm2 in 180 nm CMOS technology: the area
occupation is relatively small w.r.t. the very high degree
of parallelism thanks to the nonuniform 4-bit quantization
scheme adopted.
In [61] a SIMD-based ASIC is proposed for LDPC decoding
over a wide array of standards. The ASIC is composed of 12
parallel datapaths able to decode both turbo and LDPC codes
through the BCJR algorithm. As most of the SIMD cores, the
decoder handles communication by means of shared memo-
ries. Memory management can be challenging, especially in
case the parallel datapaths are assigned to fractions of the same
codeword. According to [62], it is possible to avoid collisions
in such cases with ad-hoc mapping of the interleaving laws
together with one (for LDPC) or two (for turbo) permutation
networks to interface with memories. These two networks are
implemented in [61] with 8 8 BeN, the one at the input of
the extrinsic values memory being transparent in LDPC mode.
Not every supported standard require all the 12 datapaths
to be active: the chosen parallelism is the minimum necessary
for throughput compliance, and the same can be said for
the working frequency. The implementation results show full
compliance with WiMAX, WiFi, 3GPP-HSDPA and DVB-SH,
at the cost of 0.9 mm2 in 45 nm CMOS, technology and total
power consumption of 86.1 mW.
One of the limitations of the traditional BeN is the number
of its inputs and outputs, that are bound to be a power
of 2. However, LDPC decoders often need a permutation
of different size: for example, WiMAX codes require shift
permutations of sizes corresponding to the possible expansion
factors, i.e. from 24 to 96 with steps of 4. In [63] an alternative
switch network is designed that makes use of 3 3 switches
as well, leading to a more hardware-efficient design. The
introduction of 3  3 switches allows in fact SM = 3  2i.
A fully compliant WiMAX LDPC decoder shift operation can
thus be implemented with a novel 96 96 switch network: it
requires 32i+32ilog22i = 576 22 switches, against the
832 necessary for a traditional 128 128 BeN. Together with
efficient control signal generation, this solution outperforms in
terms of both complexity and flexibility other modified Benes-
based decoders as [64] and [57], that exploit a secondary BeN
to rearrange the first one.
In [65], Lin et al. propose an optimized Benes-based shuffle
network for WiMAX. Unlike [63], the starting point of the
design is the non-optimized 128  128 BeN: from here all
the switches are removed where no signal is passed, whereas
switches with fixed output are replaced by wires. This ad-hoc
trimming technique, together with an efficient algorithm for
control signal creation, allow a 26:6%-71:1% area reduction
with respect to previously published shift network solutions
like [27], [55], [66].
Similar to the BeN is the Banyan network (ByN) [67], that
can be seen as a trade-off between the flexibility of the BeN
and the complexity of the BS. Although not non-blocking in
general, the ByN is non-blocking in case of shift operations
only. Moreover, it is composed of averagely half of the 2 2
switches of a BeN, and requires fewer control signals.
The work described in [68] portrays a highly parallel shuffle
network based on the ByN paradigm. Like the BeN, also ByN
are bound to a power-of-two number of inputs: as Parhi et al.
have done in [63], also here the introduction of 33 switches
allows to handle WiMAX standard various submatrix sizes.
The implemented decoder guarantees a very high degree of
flexibility with complexity lower or comparable to [36], [63]–
[65].
B. Networks-on-Chip
Networks-on-Chip (NoCs) [69] are versatile interconnection
structures that allow communication among all the connected
devices through the presence of routing elements. Recently,
LDPC decoders for both turbo and LDPC codes based on
the NoC paradigm have been proposed, thanks to the intrinsic
flexibility of NoCs. NoC-based decoders are multi-processor
systems composed of various instantiations of the same IP
associated to a routing element, which are linked in defined
pattern. This pattern can be represented with a graph, in which
every node corresponds to a processor and a router: the arcs are
the physical links among routers, thus identifying a topology.
11
In [70], a De Bruijn topology [71] NoC is proposed for
flexible LDPC decoders. Since the TPMP has been selected,
the NoC must handle the communication between VNs and
CNs. The NoC design, however, is completely detached from
code and decoding parameters, effectively allowing usage for
any LDPC code. The router embeds a modified shortest path
routing algorithm that can be executed in 1 clock cycle,
together with deadlock-free and buffer-reducing arbitration
policies, and is connected to its PE via the network interface.
The network is synthesized and compared to other explored
network topologies, as the 2-dimensional mesh [72], Benes
[73] and MDN [74]: the degree of flexibility and scalability
that the proposed topology guarantees is unmatched.
The performance of another topology, the 2-D toroidal
mesh, is evaluated in [43]. The routing element implements the
near-optimal X-Y routing for the torus/mesh [75]. A whole
set of communication-centric parameters is varied in order
to evaluate the impact of the network latency on the whole
decoder performance. It has been shown that small PE sending
periods R, i.e. cycles between two available data from the
processor, increase latency to unsustainable levels, with the
smallest values at R7. Also, the variations in throughput
due to different NoC parallelisms are shaded by the impact
of latency.
The work in [76] describes a flexible LDPC decoder design
tackling the communication problem with two different NoC
solutions. The first network is a De Bruijn NoC adopting on-
line dynamic routing, implementing the same modified shortest
path algorithm described in [70], while the second, a 2-D
torus, is based on a completely novel concept named Zero
Overhead NoC (ZONoC). Given a mapping of VNs and CNs
over a topology, the message exchange pattern is deterministic,
along with the status of the network at each instant. The
ZONoC exploits this propriety by running off-line simulations
and storing routing informations into dedicated memories,
effectively wiping out the time spent for routing and traffic
control. Overall network complexity is scaled down, since no
routing algorithm is necessary; FIFO length can be trimmed
to the minimum necessary, while router architecture is as
simple as possible (Fig. 3). A crossbar switch controlled by the
routing memory receives messages from the FIFOs connected
to its PE and other routers, while outgoing messages are sent
to as many output registers. Implementation results show a
significant reduction in complexity with respect to [29], [36],
[56], [72] and comparable or superior throughput.
C. Reducing the NoC penalty
The NoC approach guarantees a very high degree of flex-
ibility and, in theory, a NoC-based decoder can reach very
high throughput. The achievable throughput is proportional to
the number of PEs: but increasing the size of the network
means rising the latency, and thus degrading performance
back. Very few state of the art solutions have managed to solve
this problem, and those who do suffer from large complexity
and power consumption. We have tried to overcome these
shortcomings in some recent works.
Figure 3. ZONoC routing element
1) NoC-based WiMAX LDPC decoder: The solution de-
scribed in [76] supports the WiMAX standard LDPC codes,
but does not guarantee a high enough throughput. Stemming
from it, we have developed an LDPC ZONoC-based decoder
fully compliant with WiMAX standard: although having a
more convoluted graph structure, relies on a smaller number
of exchanged messages, and guarantees a 2 factor in con-
vergence speed. We designed a sequential PE implementing
the Normalized Min-Sum decoding algorithm, as described
by Hocevar in [77]: unlike in [76] we adopt the layered
decoding approach. The PE architecture is independent from
code parameters, and the memory capacity sets the only limit
to the size of supported codes. Together with the PE, we
devised a decoder reconfiguration technique to upload the
data necessary for routing and memory management when
switching between codes.
In order to comply with WiMAX standard throughput re-
quirements the size of the 2-D torus mesh has been risen from
16 nodes to 25. As detailed in [78], the decoder guarantees
more than 70 Mb/s for all rates and block sizes of the standard,
with an area of 4.72 mm2 in 130 nm CMOS technology.
2) Bandwidth and power reduction methods: While the
former decoder is compliant with WiMAX in worst case, i.e.
when the maximum allowed number of iteration is performed,
a codeword is averagely corrected with fewer iterations: the
unnecessary iterations significantly contribute to the NoC high
power consumption. In [79] two methods aimed at reducing
power and increasing throughput are studied and implemented.
The first one is the iteration early-stopping (ES) criterion
proposed in [80], that allows to stop the decoding when all
the information bits of a codeword are correct, regardless of
the redundancy bits. The other is a threshold-based message
stopping (MS) criterion, that reduces the traffic load on the
network by avoiding injection of values which carry informa-
tion about high-probability correct bits.
Various combinations of the two methods have been tried,
together with different parallelisms of the 2-D torus mesh. Im-
12
plementation of the ES criterion requires a dedicated process-
ing block with minimal PE modifications, while MS requires
a threshold comparison block for each PE and switching to
on-line dynamic routing. This is necessary since stopping a
message invalidates the statically computed communication
pattern. While the ES method guarantees an average 10%
energy per frame decoding reduction regardless of the im-
plementation, the MS method’s results change with the size
of the NoC. Since stopped messages can lead to additional
errors, a performance sacrifice must be accepted: among the
solutions presented in [79], with 0.3 dB BER loss, a 9-PE
NoC is sufficient to support the whole WiMAX standard.
3) NoC analysis for turbo/LDPC decoders: In [81] an
extensive analysis of performance of various NoC topologies
is performed in the context of multi-processor turbo decoders.
Flexibility can be explored also in terms of types of code
supported: in [82] we have consequently extended the topology
analysis to LDPC codes, in order to find a suitable architecture
for a dual turbo/LDPC codes. As a case of study, we focused
our research on the WiMAX codes.
The performance of a wide set of topologies (ring, spider-
gon, toroidal meshes, honeycomb, De Bruijn, Kautz) has been
evaluated in terms of achievable throughput and complexity,
considering different parallelisms. Exploiting a modified ver-
sion of the cycle-accurate simulation tool described in [81], a
range of design parameters has been taken in consideration,
including data injection rate, message collision management
policies, routing algorithms, node addressing modes and struc-
ture of the routing element, allowing to span from a completely
adaptive architecture to a ZONoC-like precalculated routing.
The simulations revealed the Kautz topology [83] to be the
best trade-off in terms of throughput and complexity between
LDPC and turbo codes, with a partially adaptive router archi-
tecture and FIFO-length based routing. Two separate PEs for
turbo and LDPC codes have been designed: different NoC and
PE working frequencies allow to trim the message injection
rate in the network. The full decoder, complete with turbo and
LDPC separated PEs, has been synthesized with 90 nm CMOS
technology: the decoder is compliant with both turbo and
LDPC throughput requirements for all the WiMAX standard
codes. Worst-case throughput results overperform the latest
similar solutions as [42], [44], [61], [84], with a small area
occupation and particularly low power consumption (59 mW)
in turbo mode.
VI. CONCLUSIONS
A complete overview of LDPC decoders, with particular
emphasis on flexibility, is drawn. Various classifications are
depicted, according to degree of parallelism and implementa-
tion choices, focusing on common design choices and elements
for flexible LDPC decoders. An in-depth view is given over the
PE and Interconnection part of the decoders, with comparison
with the current state-of-the-art, the latest work by the authors
on NoC-based decoders is briefly described.
REFERENCES
[1] A. Morello and V. Mignone, “DVB-S2: The second generation standard
for satellite broad-band services,” Proceedings of the IEEE, vol. 94,
no. 1, pp. 210 –227, 2006.
[2] J. Lorincz and D. Begusic, “Physical layer analysis of emerging IEEE
802.11n WLAN standard,” in Advanced Communication Technology,
2006. ICACT 2006. The 8th International Conference, vol. 1, 2006, pp.
6 pp. –194.
[3] The IEEE p802.3an, 10GBASE-T task force. [Online]. Available:
www.ieee802.org/3/an
[4] “IEEE standard for local and metropolitan area networks part 16:
Air interface for fixed and mobile broadband wireless access systems
amendment 2: Physical and medium access control layers for combined
fixed and mobile operation in licensed bands and corrigendum 1,” IEEE
Std 802.16e-2005 and IEEE Std 802.16-2004/Cor 1-2005 (Amendment
and Corrigendum to IEEE Std 802.16-2004), 2006.
[5] R. Gallager, “Low-density parity-check codes,” Information Theory, IRE
Transactions on, vol. 8, no. 1, pp. 21 –28, 1962.
[6] R. Tanner, “A recursive approach to low complexity codes,” Information
Theory, IEEE Transactions on, vol. 27, no. 5, pp. 533 – 547, sep 1981.
[7] M. Mansour and N. Shanbhag, “A 640-mb/s 2048-bit programmable
LDPC decoder chip,” Solid-State Circuits, IEEE Journal of, vol. 41,
no. 3, pp. 684 – 698, 2006.
[8] F. Kschischang, B. Frey, and H.-A. Loeliger, “Factor graphs and the sum-
product algorithm,” Information Theory, IEEE Transactions on, vol. 47,
no. 2, pp. 498 –519, feb 2001.
[9] G. Masera, F. Quaglio, and F. Vacca, “Finite precision implementation
of LDPC decoders,” Communications, IEE Proceedings-, vol. 152, no. 6,
pp. 1098 – 1102, dec. 2005.
[10] N. Wiberg, “Codes and decoding on general graphs,” Ph.D. dissertation,
Linkoping University, Linkoping, Sweden.
[11] M. Daud, A. Suksmono, Hendrawan, and Sugihartono, “Comparison of
decoding algorithms for LDPC codes of IEEE 802.16e standard,” in
Telecommunication Systems, Services, and Applications (TSSA), 2011
6th International Conference on, oct. 2011, pp. 280 –283.
[12] J. Chen and M. Fossorier, “Near optimum universal belief propagation
based decoding of LDPC codes and extension to turbo decoding,”
in Information Theory, 2001. Proceedings. 2001 IEEE International
Symposium on, 2001, p. 189.
[13] J. Chen, A. Dholakia, E. Eleftheriou, M. Fossorier, and X. Y. Hu,
“Reduced-complexity decoding of LDPC codes,” IEEE Transactions on
Communications, vol. 53, no. 8, pp. 1288–1299, Aug 2005.
[14] M. Martina, G. Masera, S. Papaliralabos, P. Mathiopoulos, and
F. Gioulekas, “On practical implementation and generalization of max*
operation for Turbo and LDPC decoders,” Instrumentation Measure-
ments, IEEE Transactions on, to appear.
[15] E. Yeo, P. Pakzad, B. Nikolic, and V. Anantharam, “High throughput
low-density parity-check decoder architectures,” in Global Telecommu-
nications Conference, 2001. GLOBECOM ’01. IEEE, vol. 5, 2001, pp.
3019 –3024.
[16] M. Mansour and N. Shanbhag, “Memory-efficient turbo decoder archi-
tectures for LDPC codes,” in Signal Processing Systems, 2002. (SIPS
’02). IEEE Workshop on, oct. 2002, pp. 159 – 164.
[17] L. Bahl, J. Cocke, F. Jelinek, and J. Raviv, “Optimal decoding of
linear codes for minimizing symbol error rate,” IEEE Transactions on
Information Theory, vol. 20, no. 3, pp. 284–287, Mar 1974.
[18] J. Zhang and M. Fossorier, “Shuffled belief propagation decoding,” in
Signals, Systems and Computers, 2002. Conference Record of the Thirty-
Sixth Asilomar Conference on, vol. 1, nov. 2002, pp. 8 – 15 vol.1.
[19] R. Tanner, D. Sridhara, A. Sridharan, T. Fuja, and J. Costello, D.J.,
“LDPC block and convolutional codes based on circulant matrices,”
Information Theory, IEEE Transactions on, vol. 50, no. 12, pp. 2966
– 2984, dec. 2004.
[20] A. Blanksby and C. Howland, “A 690-mw 1-Gb/s 1024-b, rate-1/2 low-
density parity-check code decoder,” Solid-State Circuits, IEEE Journal
of, vol. 37, no. 3, pp. 404 –412, mar 2002.
[21] L. Fanucci and P. Ciao, “Design of a fully-parallel high-throughput
decoder for turbo Gallager codes,” IEICE Transactions on Fundamentals
of Electronics,Communications and Computer Sciences, no. 7, pp. 1976–
1986.
[22] V. Nagarajan, N. Jayakumar, S. Khatri, and O. Milenkovic, “High-
throughput VLSI implementations of iterative decoders and related
code construction problems,” in Global Telecommunications Conference,
2004. GLOBECOM ’04. IEEE, vol. 1, 29 2004-dec. 3 2004, pp. 361 –
365 Vol.1.
[23] H. Zhong, W. Xu, N. Xie, and T. Zhang, “Area-efficient min-sum
decoder design for high-rate quasi-cyclic low-density parity-check codes
in magnetic recording,” Magnetics, IEEE Transactions on, vol. 43,
no. 12, pp. 4117 –4122, dec. 2007.
[24] G. Lechner, J. Sayir, and M. Rupp, “Efficient DSP implementation of
an LDPC decoder,” in Acoustics, Speech, and Signal Processing, 2004.
13
Proceedings. (ICASSP ’04). IEEE International Conference on, vol. 4,
may 2004, pp. iv–665 – iv–668 vol.4.
[25] T. Zhang and K. Parhi, “A 54 Mbps (3,6)-regular FPGA LDPC decoder,”
in Signal Processing Systems, 2002. (SIPS ’02). IEEE Workshop on, oct.
2002, pp. 127 – 132.
[26] E. Boutillon, J. Castura, and F. Kschichang, “Decoder first code design,”
in Proc. 2nd International Symposium on Turbo Codes and Related
Topics, Sept. 2000, pp. 459–462.
[27] T. Brack, M. Alles, F. Kienle, and N. Wehn, “A synthesizable IP
core for WIMAX 802.16e LDPC code decoding,” in Personal, Indoor
and Mobile Radio Communications, 2006 IEEE 17th International
Symposium on, sept. 2006, pp. 1 –5.
[28] T.-C. Kuo and A. Willson, “A flexible decoder IC for WiMAX QC-
LDPC codes,” in Custom Integrated Circuits Conference, 2008. CICC
2008. IEEE, 2008, pp. 527 –530.
[29] T. Brack, M. Alles, T. Lehnigk-Emden, F. Kienle, N. Wehn,
N. L’Insalata, F. Rossi, M. Rovini, and L. Fanucci, “Low complexity
LDPC code decoders for next generation standards,” in Design, Au-
tomation Test in Europe Conference Exhibition, 2007. DATE ’07, 2007,
pp. 1 –6.
[30] M. Alles, N. Wehn, and F. Berens, “A synthesizable IP core for wiMedia
1.5 UWB LDPC code decoding,” in Ultra-Wideband, 2009. ICUWB
2009. IEEE International Conference on, sept. 2009, pp. 597 –601.
[31] B. Zhang, H. Liu, X. Chen, D. Liu, and X. Yi, “Low complexity DVB-
S2 LDPC decoder,” in Vehicular Technology Conference, 2009. VTC
Spring 2009. IEEE 69th, april 2009, pp. 1 –5.
[32] J. Cho, N. Shanbhag, and W. Sung, “Low-power implementation of a
high-throughput LDPC decoder for IEEE 802.11N standard,” in Signal
Processing Systems, 2009. SiPS 2009. IEEE Workshop on, oct. 2009,
pp. 040 –045.
[33] S. Huang, D. Bao, B. Xiang, Y. Chen, and X. Zeng, “A flexible
LDPC decoder architecture supporting two decoding algorithms,” in
Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International
Symposium on, 30 2010.
[34] B. Xiang, D. Bao, S. Huang, and X. Zeng, “A fully-overlapped multi-
mode QC-LDPC decoder architecture for mobile WiMAX applications,”
in Application-specific Systems Architectures and Processors (ASAP),
2010 21st IEEE International Conference on, july 2010, pp. 225 –232.
[35] Y.-L. Wang, Y.-L. Ueng, C.-L. Peng, and C.-J. Yang, “Processing-task
arrangement for a low-complexity full-mode WiMAX LDPC codec,”
Circuits and Systems I: Regular Papers, IEEE Transactions on, 2010.
[36] C.-H. Liu, S.-W. Yen, C.-L. Chen, H.-C. Chang, C.-Y. Lee, Y.-S. Hsu,
and S.-J. Jou, “An LDPC decoder chip based on self-routing network
for IEEE 802.16e applications,” Solid-State Circuits, IEEE Journal of,
vol. 43, no. 3, pp. 684 –694, 2008.
[37] X.-Y. Shih, C.-Z. Zhan, C.-H. Lin, and A.-Y. Wu, “An 8.29 mm2 52
mw multi-mode LDPC decoder design for mobile WiMAX system in
0.13 m CMOS process,” Solid-State Circuits, IEEE Journal of, vol. 43,
no. 3, pp. 672 –683, 2008.
[38] X.-Y. Shih, C.-Z. Zhan, and A.-Y. Wu, “A real-time programmable
LDPC decoder chip for arbitrary QC-LDPC parity check matrices,” in
Solid-State Circuits Conference, 2009. A-SSCC 2009. IEEE Asian, nov.
2009, pp. 369 –372.
[39] F. Guilloud, E. Boutillon, and J. Danger, “-min decoding algorithm of
regular and irregular LDPC codes,” in Turbo Codes and Related Topics,
2003 3rd International Symposium on, sept. 2003, pp. 451–454.
[40] Y. Dai, N. Chen, and Z. Yan, “Memory efficient decoder architectures
for quasi-cyclic LDPC codes,” Circuits and Sistems I: regular papers,
IEEE Transactions on, vol. 55, no. 9, pp. 2898 – 2911, 2008.
[41] M. Castano, M. Rovini, N. E. L’Insalata, F. Rossi, R. Merlino, C. Ciofi,
and L. Fanucci, “Adaptive single phase decoding of LDPC codes,” Turbo
Codes Related Topics; 6th International ITG-Conference on Source and
Channel Coding (TURBOCODING), 2006 4th International Symposium
on, pp. 1 –6, april 2006.
[42] P. Murugappa, R. Al-Khayat, A. Baghdadi, and M. Jezequel, “A flexible
high throughput multi-ASIP architecture for LDPC and turbo decoding,”
in Design, Automation and Test in Europe Conference and Exhibition,
2011, pp. 1–6.
[43] M. Scarpellino, A. Singh, E. Boutillon, and G. Masera, “Reconfigurable
architecture for LDPC and turbo decoding: a NoC case study,” in
IEEE International Symposium on Spread Spectrum Techniques and
Applications, 2008, pp. 671–676.
[44] M. Alles, T. Vogt, and N. Wehn, “FlexiChaP: A reconfigurable ASIP for
convolutional, turbo, and LDPC code decoding,” in Turbo Codes and
Related Topics, 2008 5th International Symposium on, 2008, pp. 84 –89.
[45] F. Naessens, A. Bourdoux, and A. Dejonghe, “A flexible ASIP decoder
for combined binary and non-binary LDPC codes,” in Communications
and Vehicular Technology in the Benelux (SCVT), 2010 17th IEEE
Symposium on, nov. 2010, pp. 1 –5.
[46] I.-C. Park and S.-H. Kang, “Scheduling algorithm for partially parallel
architecture of LDPC decoder by matrix permutation,” in Circuits and
Systems, 2005. ISCAS 2005. IEEE International Symposium on, may
2005, pp. 5778 – 5781 Vol. 6.
[47] K. Gunnam, G. Choi, and M. Yeary, “A parallel VLSI architecture for
layered decoding for array LDPC codes,” in VLSI Design, 2007. Held
jointly with 6th International Conference on Embedded Systems., 20th
International Conference on, jan. 2007, pp. 738 –743.
[48] High Rate UWB PHY and MAC Standard, Standard ECMA-368 Std.
[Online]. Available: http://www.ecma-international.org
[49] R. Hegde and N. Shanbhag, “A voltage overscaled low-power digital
filter IC,” Solid-State Circuits, IEEE Journal of, vol. 39, no. 2, pp. 388
– 391, feb. 2004.
[50] B. Shim, S. Sridhara, and N. Shanbhag, “Reliable low-power digital
signal processing via reduced precision redundancy,” Very Large Scale
Integration (VLSI) Systems, IEEE Transactions on, vol. 12, no. 5, pp.
497 –510, may 2004.
[51] T. Vogt and N. Wehn, “A reconfigurable application specific instruction
set processor for convolutional and turbo decoding in a SDR environ-
ment,” in Design, Automation and Test in Europe, 2008. DATE ’08,
march 2008, pp. 38 –43.
[52] M. Flynn, “Very high-speed computing systems,” Proceedings of the
IEEE, vol. 54, no. 12, pp. 1901 – 1909, dec. 1966.
[53] M. Awais, A. Singh, E. Boutillon, and G. Masera, “A novel architecture
for scalable, high throughput, multi-standard LDPC decoder,” in Digital
System Design (DSD), 2011 14th Euromicro Conference on, 31 2011-
sept. 2 2011, pp. 340 –347.
[54] J. Sha, Z. Wang, M. Gao, and L. Li, “Multi-Gb/s LDPC code design
and implementation,” Very Large Scale Integration (VLSI) Systems, IEEE
Transactions on, vol. 17, no. 2, pp. 262 –268, feb. 2009.
[55] M. Rovini, G. Gentile, and L. Fanucci, “Multi-size circular shifting
networks for decoders of structured LDPC codes,” Electronics Letters,
vol. 43, no. 17, pp. 938 –940, 16 2007.
[56] F. Quaglio, F. Vacca, C. Castellano, A. Tarable, and G. Masera, “Inter-
connection framework for high-throughput, flexible ldpc decoders,” in
Design, Automation and Test in Europe, 2006. DATE ’06. Proceedings,
vol. 2, march 2006, p. 6 pp.
[57] K. Gunnam, G. Choi, M. Yeary, and M. Atiquzzaman, “VLSI archi-
tectures for layered decoding for irregular LDPC codes of WiMAX,”
in Communications, 2007. ICC ’07. IEEE International Conference on,
june 2007, pp. 4542 –4547.
[58] J. Dielissen, A. Hekstra, and V. Berg, “Low cost LDPC decoder for
DVB-S2,” in Design, Automation and Test in Europe, 2006. DATE ’06.
Proceedings, vol. 2, march 2006, pp. 1 –6.
[59] M. Karkooti, P. Radosavljevic, and J. Cavallaro, “Configurable, high
throughput, irregular LDPC decoder architecture: Tradeoff analysis
and implementation,” in Application-specific Systems, Architectures and
Processors, 2006. ASAP ’06. International Conference on, sept. 2006,
pp. 360 –367.
[60] C. Zhang, Z. Wang, J. Sha, L. Li, and J. Lin, “Flexible LDPC decoder
design for multigigabit-per-second applications,” Circuits and Systems
I: Regular Papers, IEEE Transactions on, vol. 57, no. 1, pp. 116 –124,
jan. 2010.
[61] G. Gentile, M. Rovini, and L. Fanucci, “A multi-standard flexible
turbo/LDPC decoder via ASIC design,” in International Symposium on
Turbo Codes & Iterative Information Processing, 2010, pp. 294–298.
[62] A. Tarable, S. Benedetto, and G. Montorsi, “Mapping interleaving laws
to parallel turbo and LDPC decoder architectures,” Information Theory,
IEEE Transactions on, vol. 50, no. 9, pp. 2002 – 2009, sept. 2004.
[63] D. Oh and K. Parhi, “Low-complexity switch network for reconfigurable
LDPC decoders,” Very Large Scale Integration (VLSI) Systems, IEEE
Transactions on, vol. 18, no. 1, pp. 85 –94, jan. 2010.
[64] J. Tang, T. Bhatt, V. Sundaramurthy, and K. Parhi, “Reconfigurable
shuffle network design in LDPC decoders,” in Application-specific
Systems, Architectures and Processors, 2006. ASAP ’06. International
Conference on, sept. 2006, pp. 81 –86.
[65] J. Lin, Z. Wang, L. Li, J. Sha, and M. Gao, “Efficient shuffle network
architecture and application for WiMAX LDPC decoders,” Circuits and
Systems II: Express Briefs, IEEE Transactions on, vol. 56, no. 3, pp.
215 –219, march 2009.
[66] C.-H. Liu, C.-C. Lin, S.-W. Yen, C.-L. Chen, H.-C. Chang, C.-Y. Lee,
Y.-S. Hsu, and S.-J. Jou, “Design of a multimode QC-LDPC decoder
based on shift-routing network,” Circuits and Systems II: Express Briefs,
IEEE Transactions on, vol. 56, no. 9, pp. 734 –738, 2009.
14
[67] F. T. Leighton, Introduction to Parallel Algorithms and Architectures:
Arrays, Trees, Hypercubes. Morgan Kaufmann Publishers, 1992.
[68] X. Peng, Z. Chen, X. Zhao, F. Maehara, and S. Goto, “High parallel
variation Banyan network based permutation network for reconfigurable
LDPC decoder,” in Application-specific Systems Architectures and Pro-
cessors (ASAP), 2010 21st IEEE International Conference on, july 2010,
pp. 233 –238.
[69] L. Benini and G. De Micheli, “Networks on chips: a new SoC paradigm,”
Computer, vol. 35, no. 1, pp. 70 –78, Jan. 2002.
[70] H. Moussa, A. Baghdadi, and M. Jezequel, “Binary De Bruijn onchip
network for a flexible multiprocessor LDPC decoder,” in ACM/IEEE
Design Automation Conference, 2008, pp. 429–434.
[71] N. De Bruijn, “A combinatorial problem,” Koninklijke Nederlandse
Akademie, vol. Wetenschappen 49, pp. 758–764, 1946.
[72] T. Theocharides, G. Link, N. Vijaykrishnan, and M. Irwin, “Implement-
ing LDPC decoding on network-on-chip,” in VLSI Design, 2005. 18th
International Conference on, 2005, pp. 134 – 137.
[73] G. Masera, F. Quaglio, and F. Vacca, “Implementation of a flexible
LDPC decoder,” Circuits and Systems II: Express Briefs, IEEE Trans-
actions on, vol. 54, no. 6, pp. 542 –546, 2007.
[74] F. Kienle, M. Thul, and N. When, “Implementation issues of scalable
LDPC-decoders,” in Turbo Codes and Related Topics, Proc. 3rd Inter-
national Symposium on, 2003.
[75] D. Seo, A. Ali, W.-T. Lim, and N. Rafique, “Near-optimal worst-case
throughput routing for two-dimensional mesh networks,” in Computer
Architecture, 2005. ISCA ’05. Proceedings. 32nd International Sympo-
sium on, june 2005, pp. 432 – 443.
[76] F. Vacca, G. Masera, H. Moussa, A. Baghdadi, and M. Jezequel,
“Flexible architectures for LDPC decoders based on network on chip
paradigm,” in Digital System Design, Architectures, Methods and Tools,
2009. DSD ’09. 12th Euromicro Conference on, 2009, pp. 582 –589.
[77] D. Hocevar, “A reduced complexity decoder architecture via layered
decoding of LDPC codes,” in Signal Processing Systems, 2004. SIPS
2004. IEEE Workshop on, 2004, pp. 107 – 112.
[78] C. Condo, “A parallel LDPC decoder with Network on Chip as under-
lying architecture,” Master’s thesis, Politecnico di Torino, 2010.
[79] C. Condo and G. Masera, “A flexible NoC-based LDPC code decoder
implementation and bandwidth reduction methods,” in Design and Ar-
chitectures for Signal and Image Processing (DASIP), 2011 Conference
on, nov. 2011, pp. 1 –8.
[80] Z. Chen, X. Zhao, X. Peng, D. Zhou, and S. Goto, “An early stopping
criterion for decoding LDPC codes in WiMAX and WiFi standards,” in
Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International
Symposium on, 30 2010.
[81] M. Martina and G. Masera, “Turbo NOC: A framework for the design
of network–on–chip–based turbo decoder architectures,” Circuits and
Sistems I: regular papers, IEEE Transactions on, vol. 57, no. 10, pp.
2776 – 2789, 2010.
[82] C. Condo, M. Martina, and G. Masera, “A Network-on-Chip-based high
throughput turbo/LDPC decoder architecture,” in Design, Automation
and Test in Europe, 2012. DATE’12. Proceedings. Conference on, to
appear.
[83] M. Imase and M. Itoh, “A design for directed graphs with minimum
diameter,” Computers, IEEE Transactions on, vol. C-32, no. 8, pp. 782–
784, Aug. 1983.
[84] F. Naessens, B. Bougard, S. Bressinck, L. Hollevoet, P. Raghavan, L. V.
der Perre, and F. Catthoor, “A unified instruction set programmable
architecture for multi-standard advanced forward error correction,” in
IEEE Workshop on Signal Processing Systems, 2008, pp. 31–36.
