Abstract: Layered decoding is well appreciated in Low-Density Parity-Check (LDPC) decoder implementation since it can achieve effectively high decoding throughput with low computation complexity. This work, for the first time, addresses low complexity column-layered decoding schemes and VLSI architectures for multi-Gb/s applications. At first, the Min-Sum algorithm is incorporated into the column-layered decoding. Then algorithmic transformations and judicious approximations are explored to minimize the overall computation complexity. Compared to the original column-layered decoding, the new approach can reduce the computation complexity in check node processing for high-rate LDPC codes by up to 90% while maintaining the fast convergence speed of layered decoding. Furthermore, a relaxed pipelining scheme is presented to enable very high clock speed for VLSI implementation. Equipped with these new techniques, an efficient decoder architecture for quasi-cyclic LDPC codes is developed and implemented with 0.13um CMOS technology. It is shown that a decoding throughput of nearly 4 Gb/s at maximum of 10 iterations can be achieved for a (4096, 3584) LDPC code. Hence, this work has facilitated practical applications of column-layered decoding and particularly made it very attractive in high-speed, high-rate LDPC decoder implementation.
column-layered decoding approach. We first incorporate the Min-Sum algorithm into the column-layered message passing scheme because it has much lower complexity than the SPA. In addition, by deeply investigating the message updating process in the Min-Sum based column-layered decoding, we develop a simplified column-layered decoding scheme, which maximally eliminates the redundant computations in the original scheme and significantly reduces the computation complexity further with judicious algorithmic approximation. It is shown that up to 90% of computations in check node processing can be saved for high rate LDPC codes.
For high-speed VLSI implementation, the proposed design has significant advantages over the conventional row-layered decoding. First, the column-layered decoding inherently has shorter critical path.
For a row-layered decoder, the messages associated to multiple sub-blocks in a row layer can be processed in one cycle to increase throughput. However, it will increase the complexity of check node unit (CNU) and require serial concatenation of multiple comparison and selection stages in VLSI implementation. It has been reported that significant hardware overhead is required to optimize the corresponding circuitry for high clock speed [12] . In the column layered decoding, the major implementation complexity is associated with variable node unit (VNU), particularly when the messages corresponding to multiple sub-blocks in a column layer are processed in parallel. Because only addition operations are performed in a VNU, it is very convenient to employ arithmetic optimization to minimize the critical path. Second, in the proposed design, the overall pipeline latency in decoding a code block is equal to the number of pipeline stages while in rowlayered decoding, the same amount of pipeline latency is introduced for the message updating of every layer.
Moreover, the intrinsic message loading latency is minimized in the column-layered decoding because the decoding can start as soon as the intrinsic messages corresponding to one block column are available. In summary, the proposed design is well suited for very high decoding throughput and low power LDPC decoder implementation. 4 
SIMPLIFIED COLUMN-LAYERED DECODING WITH MIN-SUM ALGORITHM
The column-layered decoding scheme (also known as the shuffled iterative decoding) for LDPC codes was proposed in [10] . The decoding scheme is based on the SPA algorithm. Similar to row-layered decoding [7] [8] [9] , the maximum number of decoding iterations can be significantly reduced for the same decoding performance. Since the Min-Sum algorithm has much lower implementation complexity than the SPA algorithm, it is widely utilized in hardware implementation. In this paper, the Min-Sum algorithm is incorporated into the column-layered decoding for the first time. Then, algorithmic transformations and intelligent approximations are explored to significantly reduce the computation complexity and memory requirement.
Min-Sum Algorithm Based Column-layered Decoding
Let C be a binary (N, K) LDPC code specified by a parity-check matrix H with M rows and N columns. Each row of the parity check matrix is associated with a check node, and each column is associated with a variable node. Let . Accordingly, the parity-check matrix is divided into G block columns. The proposed column-layered decoding based on the Min-Sum algorithm is described with the pseudo code below.
Min-Sum-based column-layered decoding algorithm
Initialization:
Iterative decoding:
Step: For each check node c that is connected to variable node
Vertical
Step: For each variable node
, updates cv L and v L as follows:
}

Hard decision and termination:
Make hard decision by using the sign of v L ; Terminate the decoding if a valid codeword is found.
}
The optimum value of the scaling factor α is around 0.8[Wrong!]0. For the convenience of VLSI implementation, it is set as 0.75 in this work.
Assume all variable nodes are divided into 4 groups (i.e., G = 4). The computation flow of the columnlayered decoding for one iteration is illustrated in Figure 1 . (a). The shaded sub-matrices indicate coverage of computation in decoding each layer, where the computation in (1) must be carried out for all block columns except for the current updating block column. Hence, the computation complexity of the original column-layered decoding scheme per iteration is many times more than that of the conventional TPMP scheme whether the MSA or SPA is used. 
Low Complexity Decoding Scheme
Algorithm Reformulation
With a close study of the computation flow of the column-layered decoding, it can be observed that a significant amount of redundant computation is performed in the original column-layered decoding algorithm. To improve the computation efficiency, the computation in (1) The reformulated column-layered decoding procedure is summarized as follows:
The Reformulated Column-layered Decoding Initialization:
for all variable nodes. For each check node, sort the magnitudes of the cv L messages from its neighboring variable nodes. Compute the sign product for each check node
Iterative decoding:
For iter = 1, 2, …, maximum iteration number
Step-A: for each check node c that connects to variable node
and
Vertical
using (2) and (3).
Horizontal
Step-B: for each check node c that connects to variable node 
Simplification of Min-Sum Based Column-layered Decoding
The algorithm reformulation presented above removes a large amount of redundant computation in the original column-layered decoding scheme, and thus significantly reduces the overall computation complexity. However, because every vector The major difference between the three-min decoding and the simplified three-min decoding is that the three-min decoding requires 3 comparisons for sorting in horizontal Step-B. The simplified three-min requires 2 comparisons for sorting in horizontal Step-B. Our simulation shows that the additional approximation introduced by the simplified three-min decoding only causes very small performance loss. Table I shows that a sorted vector c m only gets updated about 4 times in an iteration of the three-min decoding. On the contrary, for TPMP or row-layered decoding, the sorting computation to find the smallest and the second smallest magnitudes for a check node re-starts in every iteration. For the same LDPC code, the average number of updating activities for a sorted vector is more than 16 per iteration. It leads to significant power savings for the three-min column-layered decoding in check node message updating. 
Pipelining of Column-layered Decoding
Pipelining is a common practice in VLSI implementation to increase clock speed and thus to speed up data processing throughput. In general, pipelining can only be applied to feed-forward data paths in order to maintain the original function of VLSI circuitry. In LDPC column-layered decoding algorithms, data dependency exists between consecutive layers. Thus, the VLSI circuitry for check node, variable node, and message memories forms a logic loop and pipelining can not be directly applied to increase the effective clock speed of LDPC decoders. In this section, a relaxed pipelining scheme of column-layered LDPC decoding is proposed.
In the original column-layered decoding, the change between by removing the old
Vertical
Horizontal
Step-A: for each check node c that connects to variable node 
Horizontal Step-B: for each check node c that connects to variable node
}
Hard decision and termination:
Make hard decision by using the sign of v L ; Terminate decoding if a valid codeword is found or the maximum decoding iteration is reached. }
Impact to the Complexity of VLSI Implementation
For the standard Min-Sum based column-layered decoding, a check node requires 
PROPOSED COLUMN-LAYERED DECODER ARCHITECTURES
In this section, an optimized QC-LDPC decoder architecture for a (4096, 3584) (4, 32)-regular QC-LDPC code using the proposed pipelined column-layered decoding scheme is presented. The architecture efficiently enables very high decoding parallelism for layered decoding while having very short critical path. In one clock cycle, the messages corresponding to 4 circulant matrices are processed in parallel. Twostage pipelining is employed in order to improve the clock frequency. In order to further reduce the critical path delay, we rearrange the additions in variable node unit to maximally take the advantage of carry-save addition. 18 critical path. The connections among the array of CNUa, CNUb and VNU have no change during the decoding.
Barrel shifter Barrel shifter
Pipeline Pipeline Figure 5 . The top-level block diagram of pipelined column-layered decoding. Figure 6 . shows the structure of the CNU for the three-min decoding scheme. The sub-matrices in a block row of the H matrix are processed one at a time. The CNU is composed of two concatenated stages.
Architecture of Check Node Unit
19
The Fig.   10 are associated with the vectors in the horizontal decoding step as follows:
In each M-register for a sorted vector, three smallest magnitudes and their indices are stored. In the decoding of a column layer, if the column index is not in the vector, the magnitudes and indices in the vector are directly passed through the select-logic-A to the second stage. Otherwise, min1_temp and min2_temp get values from the two remaining smallest magnitudes. The value of min3_temp becomes void.
The temporary index values are determined in the same way. It is clear that min1_temp is the magnitude of the cv R message for the column layer in non-pipelined decoding. After new cv L value is sent back from VNU, it is used to compute the new value for the sorted vector. The structure of a Get Rcv component is shown in Figure 6 . . This component is not required for non-pipelined decoding.
For the proposed simplified three-min decoding approach, the adder for the computation of 
Architecture of Optimized Variable Node Unit
Shown in Figure 7 . is the structure of an optimized VNU that can simultaneously process 4 check-tovariable messages. The addition operations are rearranged such that the advantage of carry-save adder can be maximally taken. In the beginning of a VNU, the check-to-variable messages in signed-magnitude format are converted to 2's complement representation. In the data conversion for each signed-magnitude number, the sign bit is not immediately added to the bitwise-not of lower bits. Instead, all sign bits are added through adder array. Then, each two-bit sign-sum is sent to an adder in the second addition stage for final summation. Because each adder in the second and third stages has three inputs, it can be implemented using a carry-save adder and a regular binary adder. The right shift operations >>1 and >>2 are used for performing the scalar multiplication of 0.75. The above mentioned arithmetic optimizations aim to significantly reduce the logic delay in a VNU. To illustrate the data flow of the optimized VNU, let us take the computation of v L 3 as an example.
Assume that v R 4 is a negative number and other inputs are positive numbers. The value before the shift of 
HARDWARE REQUIREMENT AND DECODING THROUGHPUT
The pipelined column-layered decoder with two pipeline stages is modeled using Verilog RTL and synthesized using Fujitsu 0.13um standard library. The required hardware resource and the synthesis result are summarized in 10 for all decoders with layered decoding algorithms. For area of synthesis result, a scaling factor of 1/0.7 is applied to approximate the layout area. Considering the normalized decoding throughput, throughput/area ratio, and decoding performance among various designs, it can be concluded that the proposed simplified column-layered decoding algorithm and architecture have significant advantages in high throughput LDPC decoder implementation. 24 
CONCLUSION
In this paper, various techniques have been explored to reduce the computation complexity of the column-layered decoding. As a result, the proposed method can drastically reduce the overall computation complexity of the original scheme while largely maintaining decoding performance and convergence speed.
In addition, a relaxed pipelining scheme has been shown to break the data dependency between adjacent column layers, and thus enhance the clock speed. Combining all the proposed techniques, a low-complexity, high-speed LDPC decoder architecture for generic QC-LDPC codes has been developed and the implementation result for a specific example has demonstrated that the proposed column-layered decoder architecture is very competitive to state-of-the-art row-layered LDPC decoder designs.
