Reduced-Complexity Column-Layered Decoding and Implementation for LDPC
  Codes by Cui, Zhiqiang et al.
1Reduced-Complexity Column-Layered Decoding and
Implementation for LDPC Codes
Zhiqiang Cui1, Zhongfeng Wang2, Senior Member, IEEE, and Xinmiao Zhang3
1Qualcomm Inc., San Diego, CA 92121, USA
2Broadcom Corp., Irvine, CA 92617, USA
3Case Western Reserve University, Cleveland, OH 44106, USA
Abstract: Layered decoding is well appreciated in Low-Density Parity-Check (LDPC) decoder
implementation since it can achieve effectively high decoding throughput with low computation
complexity. This work, for the first time, addresses low complexity column-layered decoding schemes and
VLSI architectures for multi-Gb/s applications. At first, the Min-Sum algorithm is incorporated into the
column-layered decoding. Then algorithmic transformations and judicious approximations are explored to
minimize the overall computation complexity. Compared to the original column-layered decoding, the new
approach can reduce the computation complexity in check node processing for high-rate LDPC codes by up
to 90% while maintaining the fast convergence speed of layered decoding. Furthermore, a relaxed
pipelining scheme is presented to enable very high clock speed for VLSI implementation. Equipped with
these new techniques, an efficient decoder architecture for quasi-cyclic LDPC codes is developed and
implemented with 0.13um CMOS technology. It is shown that a decoding throughput of nearly 4 Gb/s at
maximum of 10 iterations can be achieved for a (4096, 3584) LDPC code. Hence, this work has facilitated
practical applications of column-layered decoding and particularly made it very attractive in high-speed,
high-rate LDPC decoder implementation.
2Index Terms—Decoder, Error correction codes, Low-density parity-check (LDPC), Quasi-cyclic (QC)
codes, VLSI Architecture, Layered decoding.
1 INTRODUCTION
Conventionally, LDPC codes are decoded using the Sum-Product algorithm (SPA) [1] or the modified
Min-Sum algorithm (MSA) [2]. In general, the SPA has the best decoding performance. The MSA is an
approximation of the SPA aimed to reduce computation complexity. It is widely employed in LDPC decoder
design [3][4][5][6]. Both algorithms are based on two-phase message-passing (TPMP) scheme. In the
literature, variations of the SPA and MSA decoding approaches have been investigated to reduce the
interconnect complexity in VLSI implementation [15][16]. Recently, layered LDPC decoding schemes
[7][8][9][10] have attracted much attention in both academy and industry because they can effectively speed
up the convergence of LDPC decoding and thus reduce the required maximum number of decoding
iterations. Presently two kinds of layered decoding approaches, i.e., row-layered decoding [7][8][9] and
column-layered decoding [9][10] have been proposed. In row-layered decoding, the parity check matrix of
the LDPC code is partitioned into multiple row layers. The message updating is performed row layer by row
layer. The column-layered decoding employs the similar idea except that the parity check matrix is
partitioned into multiple column layers. The message computation is performed column layer by column
layer.
The implementation of LDPC decoder with row-layered decoding has been widely studied
[11][12][13][14]. The SPA-based column-layered decoding approach [10] was proposed in 2005. In the
same year, Radosavljevic [9], et al, proposed a simplification of the SPA-based column-layered decoding
approach. It has been reported that the column-layered decoding has a similar convergence speed as row-
layered decoding. However, the column-layered decoding algorithm has attracted much less attention due to
its inherent high computation complexity. In this work, we investigate and explore the benefits of the
3column-layered decoding approach. We first incorporate the Min-Sum algorithm into the column-layered
message passing scheme because it has much lower complexity than the SPA. In addition, by deeply
investigating the message updating process in the Min-Sum based column-layered decoding, we develop a
simplified column-layered decoding scheme, which maximally eliminates the redundant computations in the
original scheme and significantly reduces the computation complexity further with judicious algorithmic
approximation. It is shown that up to 90% of computations in check node processing can be saved for high
rate LDPC codes.
For high-speed VLSI implementation, the proposed design has significant advantages over the
conventional row-layered decoding. First, the column-layered decoding inherently has shorter critical path.
For a row-layered decoder, the messages associated to multiple sub-blocks in a row layer can be processed
in one cycle to increase throughput. However, it will increase the complexity of check node unit (CNU) and
require serial concatenation of multiple comparison and selection stages in VLSI implementation. It has
been reported that significant hardware overhead is required to optimize the corresponding circuitry for high
clock speed [12]. In the column layered decoding, the major implementation complexity is associated with
variable node unit (VNU), particularly when the messages corresponding to multiple sub-blocks in a column
layer are processed in parallel. Because only addition operations are performed in a VNU, it is very
convenient to employ arithmetic optimization to minimize the critical path. Second, in the proposed design,
the overall pipeline latency in decoding a code block is equal to the number of pipeline stages while in row-
layered decoding, the same amount of pipeline latency is introduced for the message updating of every layer.
Moreover, the intrinsic message loading latency is minimized in the column-layered decoding because the
decoding can start as soon as the intrinsic messages corresponding to one block column are available. In
summary, the proposed design is well suited for very high decoding throughput and low power LDPC
decoder implementation.
42 SIMPLIFIED COLUMN-LAYERED DECODING WITH MIN-SUM ALGORITHM
The column-layered decoding scheme (also known as the shuffled iterative decoding) for LDPC
codes was proposed in [10]. The decoding scheme is based on the SPA algorithm. Similar to row-layered
decoding [7][8][9], the maximum number of decoding iterations can be significantly reduced for the same
decoding performance. Since the Min-Sum algorithm has much lower implementation complexity than the
SPA algorithm, it is widely utilized in hardware implementation. In this paper, the Min-Sum algorithm is
incorporated into the column-layered decoding for the first time. Then, algorithmic transformations and
intelligent approximations are explored to significantly reduce the computation complexity and memory
requirement.
2.1 Min-Sum Algorithm Based Column-layered Decoding
Let C be a binary (N, K) LDPC code specified by a parity-check matrix H with M rows and N
columns. Each row of the parity check matrix is associated with a check node, and each column is
associated with a variable node. Let }1:{)( == cvHvcN denote the set of variable nodes that participate in
check node c, and }1:{)( == cvHcvM denote the set of check nodes associated to variable node v. Let vI
denote the intrinsic message for variable node v and cvR represent the check-to-variable message conveyed
from check node c to variable node v, and cvL represent the variable-to-check message conveyed from
variable node v to check node c. Assume that the N bits of a codeword are divided into G groups of the
same size, 110 ,, −GNNN ⋯ . Accordingly, the parity-check matrix is divided into G block columns. The
proposed column-layered decoding based on the Min-Sum algorithm is described with the pseudo code
below.
Min-Sum-based column-layered decoding algorithm
Initialization:
vcv IL = for v=0, 1, …, N-1, c=0, 1, …, M-1;
5Iterative decoding:
For iter = 1, 2, …, maximum iteration number
{
For g=0, 1, …, G-1
{
Horizontal Step: For each check node c that is connected to variable node gNv ∈ , computes
.||)sgn( min
\)(\)(
cn
vcNn
cn
vcNn
cv LLR
∈∈
×= ∏ (1)
Vertical Step: For each variable node gNv ∈ , updates cvL and vL as follows:
∑ ∈×+= cvMm mvvcv RIL \)(α , (2)
∑ ∈×+= )(vMm mvvv RIL α . (3)
}
Hard decision and termination:
Make hard decision by using the sign of vL ; Terminate the decoding if a valid codeword is found.
}
The optimum value of the scaling factor α is around 0.8[Wrong!]0. For the convenience of VLSI
implementation, it is set as 0.75 in this work.
Assume all variable nodes are divided into 4 groups (i.e., G = 4). The computation flow of the column-
layered decoding for one iteration is illustrated in Figure 1. (a). The shaded sub-matrices indicate coverage
of computation in decoding each layer, where the computation in (1) must be carried out for all block
columns except for the current updating block column. Hence, the computation complexity of the original
column-layered decoding scheme per iteration is many times more than that of the conventional TPMP
scheme whether the MSA or SPA is used.
6g = 0 g = 2g = 1
horizontal 
step
vertical 
step
g = 0 g = 2g = 1
horizontal 
step
vertical 
step
g = 3
g = 3
( a )
( b )
Figure 1. The computation flow in an iteration of column-layered decoding,
(a) original algorithm, (b) after algorithm reforumlation.
2.2 Low Complexity Decoding Scheme
2.2.1 Algorithm Reformulation
With a close study of the computation flow of the column-layered decoding, it can be observed that a
significant amount of redundant computation is performed in the original column-layered decoding
algorithm. To improve the computation efficiency, the computation in (1) for consecutive column layers
can be incrementally performed. In LDPC decoding, each variable node sends a variable-to-check message
to every neighboring check node. Assume that a check node c has cd variable node neighbors, and hence
receives cd soft messages from its neighboring variable nodes. To clarify the main procedure of the
reformulated column-layered decoding, we assume that each check node has only one variable node
neighbor in each block column of the parity check matrix. It should be noted that this constraint is not
indispensable for column-layered decoding in general. Let =)(gcm ]...[ 21 cdmmm be a sorted vector of the
magnitudes of the soft messages received by check node c in ascending order. The superscript g indicates
7the soft message was generated when the thg block column is processed. Similarly, let )(gcS be
∏ ∈ )( )(cNn cnLsign when the decoding for the layer g is completed. To reduce the computation complexity of
Min-Sum based column-layered decoding, )(gcm can be computed from )1( −gcm in three steps.
1. For each variable node v in gN , remove the old || cvL from )1( −gcm to obtain a temporary sorted
vector )(~ gcm . In addition, remove the old )( cvLsign from )1( −gcS and obtain a temporary sign-product
)(~ g
cS . Since the smallest value, 1m , in )(~ gcm is |)(|min \)( cnvcNn L∈ and )(~ gcS is )(\)(∏ ∈ vcNn cnLsign , the
value of cvR can be computed as 1)1(
~
mS gc ×
−
. Send the cvR message to the variable node v.
2. Perform variable-to-check message computations for all variable nodes belonging to gN . The new
cvL messages are sent back to corresponding check nodes.
3. For each check node, insert the updated || cvL into )(~ gcm in a sorted order to obtain )(gcm .
The reformulated column-layered decoding procedure is summarized as follows:
The Reformulated Column-layered Decoding
Initialization:
Let vcv IL = for all variable nodes. For each check node, sort the magnitudes of the cvL messages from
its neighboring variable nodes. Compute the sign product for each check node =cS ∏ ∈ )( )(cNn cnLsign .
Iterative decoding:
For iter = 1, 2, …, maximum iteration number
{
For g=0, 1, …, G-1
{
Horizontal Step-A: for each check node c that connects to variable node )(gNv ∈ ,
compute )(~ gcm by removing the old || )(gcvL from )1( −gcm , (4)
and )(gcvR = )(
~ g
cS 1m× , where )(
~ g
cS =
)1( −g
cS × old )( )(gcvLsign . (5)
Vertical Step: For each variable node gNv ∈ , compute )(gcvL and )(gvL using (2) and (3).
8Horizontal Step-B: for each check node c that connects to variable node )(gNv ∈ ,
compute )(gcm using )(~ gcm and )(gcvL , (6)
and )(~ )()()( gcvgcgc LsignSS ×= . (7)
}
Hard decision and termination:
Make hard decision by using the sign of vL ; Terminate decoding if a valid codeword is found or the
maximum decoding iteration is reached.
}
In the above algorithm, )0(~ cm and )0(
~
cS in a decoding iteration are computed from )1( −Gcm and )1( −GcS ,
respectively, which are obtained in the previous iteration. The new computation flow for one full iteration
is depicted in Figure 1. (b).
2.2.2 Simplification of Min-Sum Based Column-layered Decoding
The algorithm reformulation presented above removes a large amount of redundant computation in
the original column-layered decoding scheme, and thus significantly reduces the overall computation
complexity. However, because every vector )(gcm contains cd values, the reformulated algorithm still
requires a considerable amount of memory to store check-to-variable messages. For row c, only the two
smallest values in the sorted vector )(gcm are directly involved in the message updating Step-A from layer
g-1 to layer g. For example, if the smallest value in )1( −gcm is from layer g in the previous iteration, it is
removed from the vector )1( −gcm in horizontal Step-A. Then the second smallest value is used as the
magnitude of variable-to-check message. In horizontal Step-B, )(gcm is computed with the new value, || cvL ,
from variable node, and the pre-sorted vector )(~ gcm from horizontal Step-A. The new value, || cvL , could
take any index in )(gcm . If it takes a very small index, it has more chance to be used in Step-A of further
computation. Otherwise, it has much less chance to be one of the two smallest values in the remaining
computation. In another word, though )(gcm contains cd values, most of the values in the end of the sorted
9vector are less likely to be used as reliability information for message updating. Thus, it is reasonable to
reduce the length of the vector )(gcm to further reduce the implementation complexity. Our simulations
show that if the lengths of the vectors )(gcm and )(~ gcm are set to be 3, the decoding convergence speed and
performance have almost no degradation compared to the standard Min-Sum based column-layered
decoding. In such case, we name the scheme as three-min column-layered decoding. Similarly, the lengths
of the vectors )(gcm and )(~ gcm can be set as 2 or even 1. Correspondingly, more penalties in convergence
speed and performance are expected.
In this approximation, )(~ gcm contains three values in most cases. It may contain two valid values if one
value is removed from the vector )1( −gcm in horizontal Step-A. In the computation of horizontal Step-B, at
most three comparison operations are required to sort the new value || cvL from variable node and the
sorted values in )(~ gcm . If )(~ gcm contains three valid values, the third comparison is needed to determine
whether the || cvL is the third smallest value or it should be discarded. Table I shows the average number
of )(gcm updating events in one iteration for the decoding of a (4096, 3584) (4, 32) QC-LDPC code at the
SNR of 4.1dB. In one iteration, the total number of )(gcm updating events for a check node is 32. A message
updating event can be categorized into one of three types. Type I, a value is removed from )1( −gcm and then
a new value is inserted into )(~ gcm . Type II, no value is removed from )1( −gcm but )(gcm is updated. Type III,
no value is removed from )1( −gcm and || cvL is discarded. It can be seen from Table I that if no value is
removed from )1( −gcm , || cvL is much more likely to be discarded than being the third smallest value in
)(g
cm . To further reduce the decoding complexity, the third comparison in horizontal Step-B is eliminated.
In the modification, || cvL is discarded if no value is removed from )1( −gcm and || cvL is larger than the
second value in )(~ gcm . This results in the simplified three-min column-layered decoding. In this case, a
10
check node requires 3 equal-comparisons to remove the old || cvL from vector )1( −gcm in horizontal Step-A
and requires 2 regular comparisons to update )(gcm vector with the new || cvL in horizontal Step-B.
The major difference between the three-min decoding and the simplified three-min decoding is that
the three-min decoding requires 3 comparisons for sorting in horizontal Step-B. The simplified three-min
requires 2 comparisons for sorting in horizontal Step-B. Our simulation shows that the additional
approximation introduced by the simplified three-min decoding only causes very small performance loss.
Table I shows that a sorted vector cm only gets updated about 4 times in an iteration of the three-min
decoding. On the contrary, for TPMP or row-layered decoding, the sorting computation to find the smallest
and the second smallest magnitudes for a check node re-starts in every iteration. For the same LDPC code,
the average number of updating activities for a sorted vector is more than 16 per iteration. It leads to
significant power savings for the three-min column-layered decoding in check node message updating.
TABLE I. THE AVERAGE NUMBER OF )(gcm UPDATING EVENTS IN ONE ITERATION
In the 3rd iteration In the 6th iteration
)(~ g
cm ≠
)1( −g
cm
2.804 2.783
)(~ g
cm =
)1( −g
cm , || cvL being the smallest value in )(gcm 0.036 0.050
)(~ g
cm =
)1( −g
cm , || cvL being the second smallest value in )(gcm 0.140 0.170
)(~ g
cm =
)1( −g
cm , || cvL being the third smallest value in )(gcm 0.880 1.005
)(g
cm =
)(~ g
cm =
)1( −g
cm , || cvL being discarded 28.140 27.992
2.2.3 Pipelining of Column-layered Decoding
Pipelining is a common practice in VLSI implementation to increase clock speed and thus to speed up
data processing throughput. In general, pipelining can only be applied to feed-forward data paths in order to
maintain the original function of VLSI circuitry. In LDPC column-layered decoding algorithms, data
dependency exists between consecutive layers. Thus, the VLSI circuitry for check node, variable node, and
message memories forms a logic loop and pipelining can not be directly applied to increase the effective
11
clock speed of LDPC decoders. In this section, a relaxed pipelining scheme of column-layered LDPC
decoding is proposed.
In the original column-layered decoding, the change between )1( −gcm and )1( Pgc +−m is very small if the
value of P is not large. Thus, )1( −gcm can be used as the estimation of )1( Pgc +−m . An approximation of )( PgcvR +
can be calculated from )1( −gcm before )1( Pgc +−m are obtained. Then, it is immediately used for computing
)( Pg
cvL
+
. The approximation allows P clock cycles to complete the message computation for each layer.
When || )( PgcvL + is obtained, the )1( Pgc +−m is already available. Thus, the horizontal step for layer g+P can be
undertaken with )1( Pgc +−m and || )( PgcvL + . The updating method of )( Pgc +m is the same as before. The pipelined
column-layered decoding algorithm is formulated as the following.
The Pipelined Column-layered Decoding
Initialization:
Let vcv IL = for all variable nodes. For each check node, sort the magnitudes of the cvL messages from
its neighboring variable nodes. Compute the sign product for each check node =cS ∏ ∈ )( )(cNn cnLsign .
Iterative decoding:
For iter = 1, 2, …, maximum iteration number
{
For g=0, 1, …, G-1
{
Compute )( PgcvR + : For each check node c that connects to variable node )(gNv ∈ , compute )(~ Pgc +m by
removing the old || )( PgcvL + from )1( −gcm . Then, )( PgcvR + = )(~ PgcS + 1m× , where
)(~ Pg
cS
+
=
)1( −g
cS × old )( )( PgcvLsign + . (8)
Vertical Step: For each variable node )( PgNv +∈ , computes )( PgcvL + and )( PgvL + using (2) and (3).
Horizontal Step-A: for each check node c that connects to variable node )(gNv ∈ , compute )(~ gcm by
removing the old || )(gcvL from )1( −gcm , (9)
Horizontal Step-B: for each check node c that connects to variable node )(gNv ∈ ,
12
compute )(gcm using )(~ gcm and )(gcvL , (10)
and )(~ )()()( gcvgcgc LsignSS ×= . (11)
}
Hard decision and termination:
Make hard decision by using the sign of vL ; Terminate decoding if a valid codeword is found or the
maximum decoding iteration is reached.
}
2.2.4 Impact to the Complexity of VLSI Implementation
For the standard Min-Sum based column-layered decoding, a check node requires 2−cd regular
comparators to compute the check-to-variable message for a new column-layer. To execute (1) and (2), the
variable-to-check messages corresponding to the whole H matrix must be stored. For the simplified three-
min column-layered decoding algorithm, a check node only requires 2 regular comparators. The
magnitudes of variable-to-check messages can be used on-the-fly and are not stored. Only the sign bits of
variable-to-check messages should be stored. For each check node, three magnitudes, three indices, and one
sign bit need to be saved. For a (N, N-M) (4, 32) structured LDPC code, approximately, %9330/)230( =− of
computation can be reduced. Assuming each message is quantized as 4 bits, the number of memory bits
needed by a check node is 251)53(3 =++× . The percentage of memory bits that can be saved for extrinsic
messages is %4.55)432/(])2532432[( =×××−−× MM .
Because the proposed column-layered decoding approaches do not save the magnitude of variable-to-
check messages, they are inherently memory-efficient. Equipped with the proposed decoding techniques,
an efficient column-layered decoder architecture for quasi-cyclic LDPC codes is developed with
architectural and arithmetic optimizations. The details are provided in Section IV.
13
3 PERFORMANCE SIMULATION
To evaluate the decoding performance and convergence speed of the proposed algorithms, three
LDPC codes are simulated. Codes I and II are rate 1/2, (2304, 1152) LDPC code and rate 5/6, (2304, 1920)
LDPC code, respectively, adopted in WiMax (802.16e) standard. Code III is a (4096, 3584) (4, 32)-regular
QC-LDPC code, which is constructed using the progressive edge growth method (PEG) [17][18]. The code
is also used to illustrate the VLSI architecture design in Section 4. In each simulation, at least 50 frame
errors are observed.
Figure 2. shows the frame error rates in decoding the WiMax rate-1/2 (2304, 1152) LDPC code with
four different layered decoding algorithms, i.e., 1) the row-layered decoding, 2) the original column-
layered decoding, 3) the three-min column-layered decoding, and 4) the simplified three-min column-
layered decoding. The maximum number of decoding iterations is set as 10 and 20, respectively, for all
decoding approaches. It can be observed that all the four decoding algorithms have almost the same
decoding performance with maximum of 20 iterations. With the maximum of 10 decoding iterations, only
subtle performance difference can be observed. Figure 3. shows the frame error rates in decoding the
WiMax rate-5/6 (2304, 1920) LDPC code with maximum of 10 and 20 iterations, respectively. It shows
again that the performance differences among various layered decoding algorithms are negligible.
14
1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3
10
-4
10
-3
10
-2
10
-1
10
0
Eb/No (dB)
Fr
a
m
e
 
e
rr
o
r 
ra
te
s
Row-layered.
Column-layered.
3-min column-layered.
simplified 3-min column-layered.
10 iterations  
20 iterations
Figure 2. Frame error rate of various layered decoding approaches in decoding
WiMax rate-1/2 (2304, 1152) LDPC code.
15
3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 4.1
10
-5
10
-4
10
-3
10
-2
10
-1
Eb/No (dB)
Fr
a
m
e
 
e
rr
o
r 
ra
te
Row-layered.
Column-layered.
3-min column-layered.
simplified 3-min column-layered.
20 iterations
10 iterations 
Figure 3. Frame error rate of various layered decoding approaches in decoding
WiMax rate-5/6 (2304, 1920) LDPC code.
Figure 4. shows the frame error rate (FER) in decoding the (4096, 3584) (4, 32) QC-LDPC code. The
number of columns in each column layer is 128. The maximum number of iteration is set as 10. It can be
seen that the three-min column-layered decoding algorithm achieves almost the same decoding
performance as the standard Min-Sum based column-layered decoding algorithm. The simplified three-min
approach has about 0.02dB performance loss. The pipelined three-min column-layered decoding methods
introduced slight performance loss compared to the non-pipelined three-min decoding algorithm. With two
stages of pipelining, the pipelined decoding scheme has less than 0.02dB performance loss (compared
to which??).
16
Figure 4. Frame error rate of various decoding approaches in decoding a (4096, 3584) (4, 32)
QC-LDPC code.
4 PROPOSED COLUMN-LAYERED DECODER ARCHITECTURES
In this section, an optimized QC-LDPC decoder architecture for a (4096, 3584) (4, 32)-regular QC-
LDPC code using the proposed pipelined column-layered decoding scheme is presented. The architecture
efficiently enables very high decoding parallelism for layered decoding while having very short critical
path. In one clock cycle, the messages corresponding to 4 circulant matrices are processed in parallel. Two-
stage pipelining is employed in order to improve the clock frequency. In order to further reduce the critical
path delay, we rearrange the additions in variable node unit to maximally take the advantage of carry-save
addition.
17
4.1 Column-Layered Decoder Architecture for QC-LDPC Codes
The parity check matrix of the QC-LDPC code to be considered in this work consists of an array of
324 × cyclically shifted permutation matrices of dimension 128128 × . The decoding steps discussed in
Section II.B are performed for one block column at a time. That means all extrinsic messages
corresponding to a block column of the H matrix are processed in parallel in one clock cycle. The
architecture of pipelined decoder with two pipeline stages is shown in Figure 5. Each M component in an
M-register array on the left side of Figure 5. represents a 25-bit register for storing a sorted vector
associated with a row of the H matrix. The Get Rcv component performs (8) to get check-to-variable
messages for layer g+P, where P=2. The CNUa component (the portion of check node unit for horizontal
Step-A) performs (9) to remove the old variable-to-check message associated to layer g from the sorted
vector. The CNUb component (the portion of check node unit for horizontal Step-B) is utilized to perform
(10) and (11). The total numbers of M-register, Get Rcv, CNUa, and CNUb are all 512. The number of
variable node unit (VNU) in the VNU array is 128. A VNU is required to perform the vertical step (2) and
(3) for a variable node. The barrel shifters in the left side of the VNU array align the check-to-variable
message from row order to column order. Similarly, the barrel shifters in the right side of VNU array align
the variable-to-check message from column order to row order. It can be observed form Figure 5. that two
types of loops are formed in column-layered decoding. The first type of loop consists of a CNUa and a
CNUb components. The second type of loop is composed of a Get Rcv, a CNUb, a VNU, and two barrel
shifters. With the pipelined column-layered decoding approach, the critical path in the second type loop is
drastically reduced when applying two-stage pipelining.
The Get Rcv component is needed because its outputs are the check-to-variable messages for layer
g+P. On the contrary, the outputs of CNUa are intermediate values for layer g. For non-pipelined decoding,
the Get Rcv component can be eliminated because the output of CNUa contains the check-to-variable
message for layer g. We only need one stage of barrel shifter at the input of CNUa array to minimize
18
critical path. The connections among the array of CNUa, CNUb and VNU have no change during the
decoding.
B
a
rre
l s
h
ifte
r
B
a
rre
l s
h
ifte
r
B
a
rre
l s
h
ifte
r
B
a
rre
l s
h
ifte
r
P
ip
e
lin
e
P
ip
e
lin
e
Figure 5. The top-level block diagram of pipelined column-layered decoding.
4.2 Architecture of Check Node Unit
Figure 6. shows the structure of the CNU for the three-min decoding scheme. The sub-matrices in a
block row of the H matrix are processed one at a time. The CNU is composed of two concatenated stages.
19
The first stage, CNUa, computes the )(~ gcm , and the second stage, CNUb, generates )(gcm . The data in Fig.
10 are associated with the vectors in the horizontal decoding step as follows:
]min3_old min2_old,min1_old,[)1( =−gcm ,
]min3_temp min2_temp,min1_temp,[~ )( =gcm ,
]min3_new min2_new,min1_new,[)( =gcm ,
]idx3_old idx2_old,[idx1_old,)1( =−gcI ,
]idx3_tmp idx2_temp,,[idx1_temp~ )( =gcI ,
]idx3_new idx2_new,idx1_new,[)( =gcI .
In each M-register for a sorted vector, three smallest magnitudes and their indices are stored. In the
decoding of a column layer, if the column index is not in the vector, the magnitudes and indices in the
vector are directly passed through the select-logic-A to the second stage. Otherwise, min1_temp and
min2_temp get values from the two remaining smallest magnitudes. The value of min3_temp becomes void.
The temporary index values are determined in the same way. It is clear that min1_temp is the magnitude of
the cvR message for the column layer in non-pipelined decoding. After new cvL value is sent back from
VNU, it is used to compute the new value for the sorted vector. The structure of a Get Rcv component is
shown in Figure 6. . This component is not required for non-pipelined decoding.
For the proposed simplified three-min decoding approach, the adder for the computation of
tempLabs cv _3min)( − is not needed. Instead, a simple three-input OR gate can be used to disable M-register
update in the condition shown in row 5 of Table I. The three inputs for the OR gate are the outputs of the
three comparators in CNUa. It is shown in Fig. 3 that the removal of the adder results in less than 0.02dB
performance loss.
20
Figure 6. (a) CNU architecture for the three-min column-layered decoding. (b) The structure of the
Get Rcv component.
4.3 Architecture of Optimized Variable Node Unit
Shown in Figure 7. is the structure of an optimized VNU that can simultaneously process 4 check-to-
variable messages. The addition operations are rearranged such that the advantage of carry-save adder can
be maximally taken. In the beginning of a VNU, the check-to-variable messages in signed-magnitude
format are converted to 2’s complement representation. In the data conversion for each signed-magnitude
number, the sign bit is not immediately added to the bitwise-not of lower bits. Instead, all sign bits are
added through adder array. Then, each two-bit sign-sum is sent to an adder in the second addition stage for
final summation. Because each adder in the second and third stages has three inputs, it can be implemented
using a carry-save adder and a regular binary adder. The right shift operations >>1 and >>2 are used for
performing the scalar multiplication of 0.75. The above mentioned arithmetic optimizations aim to
significantly reduce the logic delay in a VNU.
21
Figure 7. The architecture of the optimized VNU .
To illustrate the data flow of the optimized VNU, let us take the computation of vL3 as an example.
Assume that vR4 is a negative number and other inputs are positive numbers. The value before the shift of
scaling operation of vL3 is )'1''0''0'|)(|_|||(| 421 +++++ vvv RinversebitRR . The '1''0''0' ++ computation is
performed by a 1-bit full adder in the adder array block associated with vL3 message. After the final shift
and addition stage, the output of vL3 is vvvv IRRR +++× )(75.0 421 .
5 HARDWARE REQUIREMENT AND DECODING THROUGHPUT
The pipelined column-layered decoder with two pipeline stages is modeled using Verilog RTL and
synthesized using Fujitsu 0.13um standard library. The required hardware resource and the synthesis result
22
are summarized in Table II. Each intrinsic message is quantized as 4 bits. It takes 32 clock cycles to
compute the initial sorted vectors and 3201032 =× clock cycles for 10 decoding iterations. The number of
clock cycles for pipeline latency is 2. Let clkf denote the clock frequency of the decoder, the information
(source data) decoding throughput is )232032/()5124096( ++−×clkf . Thus, the decoder achieves a decoding
throughput of 3.928 Gb/s at a clock speed of 388MHz.
TABLE II. THE HARDWARE RESOURCE FOR (4096, 3584) (4, 32) QC-LDPC DECODERS
VNU 128
CNU 512
Get_Rcv 512
Intrinsic message (bits) 4 × 4096 = 16384
Sorted vector (bits) 25 × 512 = 12800
Signs of variable-to-check messages 4 × 4096 = 16384
Area per VNU ( 2um ) 5597.8
Area per CNU ( 2um ) 2100.9
Area per Get_Rcv ( 2um ) 152.1
Clock Frequency (MHz) 388
Synthesis area ( 2mm ) 6.755
Information decoding throughput (Gb/s) 3.928
Table III compares the proposed column-layered decoder with the state-of-the-art row-layered
decoders. To mitigate the discrepancy introduced from different implementation technologies, the area and
clock speed of all these designs are scaled to 65nm. For the implementation technology of 0.18um, 0.13um,
and 90nm, the area scaling down factor is set as 8, 4, and 2, respectively. The corresponding clock
frequency is scaled up by 325.1 , 225.1 , and 1.25, respectively. The maximum decoding iteration is set as
23
10 for all decoders with layered decoding algorithms. For area of synthesis result, a scaling factor of 1/0.7
is applied to approximate the layout area.
TABLE III. DESIGN COMPARISONS WITH RECENT HIGH SPEED LDPC DECODERS
This work [11] [14] [12] [13]
Code (4096, 3584) (9600, 7200) 802.11n 802.16e,
802.11n
2048-bits
Rate 7/8 3/4 1/2 ~5/6 1/2 ~ 5/6 1/2 ~ 7/8
Algorithm Column-layered
Min-Sum
Row-layered
Min-Sum
Row-layered
Min-Sum
Row-layered
BP
Row-layered
BCJR
LLR message quantization 4-bits 6-bits 5-bit - 4-bits
Max. iteration 10 10 5 10 10
Max decoding parallelism 512 80 81 2 × 96 -
Technology 0.13um 65nm 0.18um 90nm 0.18um
Clock frequency(MHz) 388 500 208 450 125
Area 6.755
(synthesis)
0.504
(synthesis)
3.39 3.5 14.3
Max info. throughput (Gb/s) 3.928 1.08 0.78 1.0 0.64
Clock frequency (scaled to 65nm) 606 500 406 562 244
Layout area (scaled to 65nm) 2.41 0.72 0.42 1.75 1.79
Normalized info. throughput
(scaled to 65nm)
6.13 1.08 0.76
(10 iterations)
1.25 1.25
Considering the normalized decoding throughput, throughput/area ratio, and decoding performance
among various designs, it can be concluded that the proposed simplified column-layered decoding
algorithm and architecture have significant advantages in high throughput LDPC decoder implementation.
24
6 CONCLUSION
In this paper, various techniques have been explored to reduce the computation complexity of the
column-layered decoding. As a result, the proposed method can drastically reduce the overall computation
complexity of the original scheme while largely maintaining decoding performance and convergence speed.
In addition, a relaxed pipelining scheme has been shown to break the data dependency between adjacent
column layers, and thus enhance the clock speed. Combining all the proposed techniques, a low-complexity,
high-speed LDPC decoder architecture for generic QC-LDPC codes has been developed and the
implementation result for a specific example has demonstrated that the proposed column-layered decoder
architecture is very competitive to state-of-the-art row-layered LDPC decoder designs.
REFERENCES
[1] R. G. Gallager, “Low-density parity-check codes,” IRE Trans. Inform. Theory, vol. IT-8, pp. 21-28, Jan. 1962.
[2] J. Chen, A. Dholakia, E. Eleftheriou, M.P.C. Fossorier, X. –Y. Hu, “Reduced-Complexity Decoding of LDPC
Codes,” IEEE Transactions on Communications, vol. 53, issue 8, pp. 1288 – 1299, Aug. 2005.
[3] Z. Wang and Z. Cui, “A memory efficient partially parallel decoder architecture for QC-LDPC codes,” in Proc.
of the 39th Asilomar Conference on Signals, Systems & Computers, pp. 729-733, 2005.
[4] C. Lin, K. Lin, H. Chan, and C. Lee, “A 3.33 Gb/s (1200, 720) low-density parity check code decoder,” in Proc.
31st Eur. Solid-State Circuits Conf., pp. 211–214, Sep. 2005.
[5] D. Oh and K. K. Parhi “Nonuniformly quantized Min-Sum decoder architecture for low-density parity-check
codes,” in Proc. of the 18th ACM Great Lakes symposium on VLSI , pp. 451-456, 2008
[6] X. Shih, C. Zhan, C. Lin, and A. Wu, “An 8.29 mm2 52 mW Multi-Mode LDPC Decoder Design for Mobile
WiMAX System in 0.13 µm CMOS Process,” IEEE Journal of Solid-State Circuits, vol. 43, issue 3, pp. 672 –
683, March 2008.
[7] E. Sharon, S. Litsyn, and J. Goldberger, “An efficient message-passing schedule for LDPC decoding,” in Proc. of
the 23rd IEEE Convention of Electrical and Electronics Engineers in Israel, pp. 223-226, Sept., 2004.
25
[8] D. E. Hocevar, “A reduced complexity decoder architecture via layered decoding of LDPC codes,” IEEE
Workshop on Signal Processing Systems, pp. 107 - 112, 2004.
[9] P. Radosavljevic, A. Baynast, and J. R. Cavallaro, “Optimized Message Passing Schedules for LDPC
Decoding,” in Proc. of 39th Asilomar Conference on Signals, Systems and Computers, pp. 591 – 595, 2005.
[10] J. Zhang, M.P.C. Fossorier, “Shuffled iterative decoding,” IEEE Transactions on Communications, vol.
53, Issue 2, pp. 209 – 213, Feb. 2005.
[11] T. Brack, M. Alles, T. Lehnigk-Emden, F. Kienle, N. When, N. E. L’Insalata, F. Rossi, M. Rovini, and L.
Fanucci, “Low Complexity LDPC Code Decoders for Next Generation Standards,” in Proc. of Design,
Automation and Test in Europe (DATE '07), Apr. 2007.
[12] Y. Sun and J. R. Cavallaro, J.R, “A low-power 1-Gbps reconfigurable LDPC decoder design for multiple 4G
wireless standards,” 2008 IEEE International SOC Conference, pp. 367 – 370, 2008
[13] M. M. Mansour and N. R. Shanbhag, “A 640-Mb/s 2048-bit programmable LDPC decoder chip,” IEEE Journal
of Solid-State Circuits, vol. 41, issue 3. pp. 684-698, 2006.
[14] C. Studer, N. Preyss, C. Roth, and A. Burg, “Configurable high-throughput decoder architecture for quasi-
cyclic LDPC codes,” 42nd Asilomar Conference on Signals, Systems and Computers, pp. 1137 – 1142, 2008.
[15] A. Darabiha, A. C. Carusone, and F. R. Kschischang, “Block-interlaced LDPC decoders with reduced
interconnect complexity,” IEEE Transactions on Circuits and Systems II, vol. 55, no. 1, pp. 74-78, Jan. 2008.
[16] T. Mohsenin, D. Truong, and B. Baas, “Multi-Split-Row Threshold decoding implementations for LDPC
codes,” 2009 IEEE International Symposium on Circuits and Systems, pp. 2449 – 2452.
[17] X. -Y. Hu, E. Eleftheriou, and D. M. Arnold, “Regular and irregular progressive edge-growth tanner graphs,”
IEEE Trans. on Info. Theory, vol. 51, issue 1, pp. 386-398, Jan. 2005.
[18] Z. Li and B. V. K. V. Kumar, “A class of good quasi-cyclic low-density parity check codes based on
progressive edge growth graph,” Thirty-Eighth Asilomar Conference on Signals, Systems and Computers, vol. 2,
pp. 1990-1994, 2004.
