Abstract-In this paper, we propose a layered LDPC decoder architecture targeting flexibility, high-throughput, low cost, and efficient use of the hardware resources. The proposed architecture provides full design time flexibility, i.e., it can accommodate any Quasi-Cyclic (QC) LDPC code, and also allows redefining a number of parameters of the QC-LDPC code at the run time.
I. INTRODUCTION Low Density Parity Check (LDPC) codes are a class of error correction codes known to closely approach to the Shannon limit under iterative message-passing (MP) decoding algorithms. MP architectures are composed of processing units that perform the desired computation by passing messages to each other. The way such architecture applies to LDPC decoding is closely related to the bipartite graph representation of LDPC codes [1] . It comprises two types of nodes, known as variable-nodes and check-nodes, corresponding respectively to coded bits and parity-check equations. Accordingly, an LDPC decoder comprises two types of processing units, namely Variable-Node Units (VNUs) and Check-Node Units (CNUs), which exchange messages according to the structure of the bipartite graph.
MP decoders may deal with different scheduling strategies, according to the order in which variable and check-node messages are updated during the message passing iterative process. The classical convention is that, at each iteration, all checknodes and subsequently all variable-nodes pass new messages to their neighbors. This message-passing schedule is usually referred to as flooding scheduling [2] . A different approach is to split the parity check matrix in several horizontal layers, then process horizontal layer sequentially, while check-nodes (rows) within the same layer are processed by using a flooding schedule strategy. Each time a layer is processed the decoder updates the neighbor variable-nodes, so as to profit from the propagated messages, and then proceeds to the next layer. This message scheduling, known as layered scheduling [3] , propagates information faster and converges in about half the number of iterations compared to the fully parallel scheduling [4] , thus yielding a lower decoding latency. Layered scheduling advantageously applies to Quasi-Cyclic (QC) LDPC codes [5] , which are naturally equipped with a layered structure, and also known to significantly reduce the complexity of the interconnection network. Due to their benefits in terms of area/throughput/flexibility, layered QC-LDPC decoders have been widely adopted, and can be considered as a de facto standard solution in most applications [6] . Additional considerations may address different optimizations at the processing unit level, e.g., implementing different decoding algorithms or processing the input data in either a serial or a parallel manner [7] . Regarding the MP decoding algorithm, hardware implementations of LDPC decoders mostly rely on the MinSum (MS) algorithm [8] , since the corresponding VNUs and CNUs can be implemented by very simple arithmetic operations (additions and comparisons).
In this work, we propose a layered MS decoder architecture targeting (i) flexibility, (ii) high-throughput, and (iii) low cost and efficient use of the hardware resources. Highest flexibility can be achieved by using serial processing units: VNUs and CNUs process incoming messages in a serial manner, which makes their implementation independent of the variable or check-node degree. However, this comes at the cost of a reduced throughput. Thus, in this paper we focus on layered LDPC decoder architectures with fully parallel processing units. Such architecture has some inherent limitations in terms of flexibility, mainly concerning the number of incoming messages into VNUs and CNUs, corresponding to the degrees (i.e., number of connections) of the corresponding variable and check nodes in the Tanner graph [1] . To ensure the highest possible flexibility, the proposed architecture can accommodate any QC-LDPC code, and also allows redefining a number of parameters at the run time, e.g., number of rows of the QC base matrix, as well as the positions and values of the non-negative entries within each row.
The classical solution to increase throughput and to also ensure an efficient use of hardware resources in layered architectures is to pipeline the datapath. However, the number of stages in the datapath may impose specific constraints on the base matrix of the QC-LDPC code, in order to ensure that no memory conflicts occur during the read/write operations from/to the memory storing the exchanged messages or the a posteriori logarithmic likelihood ratios (AP-LLR) values. Moreover, pipelined architectures violate the layered scheduling principle, in the sense that each layer processing starts before completing processing the previous layer, thus reducing the convergence speed. To avoid such limitations, the proposed architecture does not use pipeline. Instead, we propose a specific design of the datapath processing units (VNUs, CNUs, and AP-LLR units) that allow an efficient reuse of the hardware resources, thus yielding significant cost reduction. Accordingly, the main novelty of the paper consists of: (1) A low-cost VNU/AP-LLR processing unit that merges in an efficient way the logical functionalities of the VNU and AP-LLR units, and can be executed by selecting either the VNU or the AP-LLR mode. (2) A high-speed, low-cost CNU architecture, which only computes the first minimum (min1) and index of the first minimum (indx min1), instead of first two minima and indx min1 as required by the MS decoding algorithm. To compute the second minimum (min2), the CNU is executed a second time with indx min1 input set to the maximum value (according to the bit-length of the exchanged messages). Due to a specific organization of the datapath, the second execution of the CNU does not induce any penalty in terms of throughput, as explained below. (3) We split the iteration processing in two perfectly symmetric stages, executed in two consecutive clock cycles, each one using the same processing resources. In the first clock cycle we perform read operations, then execute the VNU/AP-LLR unit in VNU mode, and the CNU to compute min1 and indx min1. In the second clock cycle we execute the CNU to compute min2, the VNU/AP-LLR unit in AP-LLR mode, and perform write back operations. The processing load is perfectly balanced between the two clock cycles, thus yielding an optimal clock frequency. In particular, the second execution of the CNU during the second clock cycle does not impose any penalty on the operating clock frequency.
The paper is organized as follows. In Section II we briefly review QC-LDPC codes and the MS decoding algorithm. Section III details the proposed low-cost, high-throughput flexible architecture for the layered MS decoder. We discuss first the baseline architecture, and then the main enhancements that we are incorporating into this architecture. Implementation results are provided in Section IV, and Section V concludes the paper.
II. LAYERED MS DECODING FOR QC-LDPC CODES
We consider a QC-LDPC code defined by a base matrix B of size R × C, with integer entries b i,j ≥ −1. The paritycheck matrix H is obtained by expanding the base-matrix B by an expansion factor Z; thus, each entry of B is replaced by a square matrix of size Z × Z, defined as follows: −1 entries are replaced by the all-zero matrix, while b i,j ≥ 0 entries are replaced by a circulant matrix, obtained by rightshifting the identity matrix by b i,j positions. Hence, H has M = R × Z rows and N = C × Z columns. We also denote by M r the set of Z consecutive rows of H corresponding to the r-th row in B. M r is further referred to as a (decoding) layer of H. Finally, we denote by N (m) the set of columns of H having a non-zero ('1') entry in the m-th row, for any m = 1, . . . , M. In the bipartite graph, representation, check and variable nodes correspond respectively to rows and columns of H, and they are connected by edges according the the non-zero entries of H. The number of edges incident to each check or variable node (or equivalently, the weight of the corresponding row/column) is referred to as the node degree.
Let (x 1 , · · · , x N ) denote a codeword that is sent over a binary input channel, and (y 1 , · · · , y N ) be the received word. The following notation for MP decoders will be used throughout the paper:
• γ n = log (Pr(x n = 0|y n )/ Pr(x n = 1|y n )), the LLR value of x n according to the received y n value; it is also referred to as the a priori LLR of variable node n; •γ n : the a posteriori (AP) LLR of variable node n; • α m,n : message sent from variable-node n to check-node m;
• β m,n : message sent from check-node m to variable-node n;
The layered MS decoding is described in Algorithm 1. To match to the hardware implementation that will be discussed 
ƐŝŐŶƐ͕ ŵŝŶϭ͕ ŵŝŶϮ͕ ŝŶĚǆͺŵŝŶϭ
ƐŝŐŶƐ͕ ŵŝŶϭ͕ ŵŝŶϮ͕ ŝŶĚǆͺŵŝŶϭ in the next section, we assume that input LLRs γ n and check-to-variable node messages β m,n are quantized on q bits, while AP-LLR valuesγ n are quantized onq bits, with q <q. Subtractions and additions used in the VNU and AP-LLR steps are implemented through the use ofq-bit saturated adders. Hence, variable-to-check messages α m,n computed at the VNU step are quantized onq bits, and they are saturated to q bits just before entering the CNU. The α m,n values used at the AP-LLR step are the unsaturatedq-bit values. It is worth noting that for a given m, the absolute values of the β m,n messages computed at the CNU step are equal to either the first or the second minimum of the input messages' absolute values |α SAT m,n |. Moreover, there is only one β m,n message whose absolute value is equal to the second minimum, with the variable-node index corresponding to the first minimum. In the sequel, we shall denote by min1 and min2 the first and second minimum, and by indx min1 the index of the first minimum. Thus, β m,n messages can be stored in a compressed format [9] to reduce memory requirements, by storing only their signs, min1, min2, and indx min1 values, as shown in Figure 2 .
III. LAYERED MS DECODER ARCHITECTURE
For the sake of simplicity, we shall first assume that all the check-nodes have the same degree, which will be denoted in the sequel by d cmax . No further assumptions are made regarding the base matrix B. The case of check-node irregular codes will be discussed in Section III-C. We start by discussing the baseline architecture, then the proposed enhancements are discussed in Section III-B.
A. Baseline Architecture Figure 1 illustrates the baseline architecture of the layered MS decoder, whose main blocks are further discussed below. Each decoding iteration takes two clock cycles. All data are read and processed at the first rising edge clock, then written at the second rising edge clock. Memory blocks. Two memory blocks are used, one for theγ n values (γ memory) and one for the β m,n messages (β memory).γ n values are quantized onq bits, and β m,n messages on q bits.γ memory is implemented by registers, in order to allow massively parallel read or write operations. The memory is organized in C blocks, denoted by AP i (i = 0, . . . , C − 1) corresponding to the number of columns of base matrix, each one consisting of Z ×q bits. Data are read from/write to blocks corresponding to non-negative entries in the row of B (layer) being processed. β memory is implemented as a Random Access Memory (RAM). Each memory word consists of Z compressed β-messages, corresponding to one row of B. Permutations for Reading and Writing (PER R, PER W). PER R permutation is used to rearrange the data read from γ memory, according to the processed layer, so as to ensure processing by the proper VNU/CNU. PER W block operates oppositely to PER R. Barrel Shifter for Reading and Writing (BS R, BS W). Barrel shifters are used to implement the cyclic (shift) permutations corresponding to the non-negative entries of the base matrix B Figure 1 consists of Zq-bit saturated subtractors for the parallel execution of Z variable-nodes (one column of B). Saturators (SATs). Prior to CNU processing, α m,n values are saturated to q bits. Check Node Units (CNUs). These processing units compute the β m,n messages. For simplicity, Figure 1 shows one CNU block with d cmax inputs, each one of size Z ×q bits. Thus, this block actually includes Z computing units, used to process in parallel the Z check-nodes within one layer. The checknode processing consists of computing the signs of the β-messages, as well as min1, min2 and indx min1 value, and is implemented by using the high-speed low-cost (treestructure) TS approach proposed in [10] . AP-LLR Units. These units compute theγ n values. Each AP LLR i block (i = 0, . . . , d cmax − 1) in Figure 1 consists of Zq-bit saturated adders, for the parallel execution of Z variable-nodes (one column of B).
Controller. This block generates control signals such as count layer for indicating which layer is being processed, En read and En write for reading and writing data, etc. It also controls the synchronous execution of the other blocks.
B. Enhanced Architecture
In this Section we discuss the main enhancements that we are incorporating into the baseline architecture, which consist of (1) a low-cost VNU/AP-LLR processing unit that merges in an efficient way the logical functionalities of the VNU and AP-LLR units, (2) a low-cost CNU architecture, which is executed twice in order to complete computation of the check-node messages, (3) a splitting of the iteration processing in two perfectly symmetric stages, yielding an optimal clock frequency. VNU/AP-LLR unit and the new CNU substitute to the VNU, AP-LLR, and the old CNU units in the baseline architecture, as shown in Figure 3 (where VNU/AP-LLR is shortened to VN/AP). All the other blocks of the architecture remain the same. 
1) VNU/AP-LLR Unit:
The main difference between VNU and AP-LLR processing units is that subtractors are used within the first, while adders are used within the second. We propose a new VNU/AP-LLR processing unit that merges their logical functionalities, controlled by a specific signal (sel) to allow selecting between the VNU or AP-LLR mode. The control signal is generated by the controller, such that VNU mode is selected during the first clock, and AP-LLR mode during the second.
The block diagram of the VNU/AP-LLR unit is detailed in Figure 4 . At the input, two multiplexers are used to select the input data according to either the VNU or AP-LLR mode. Similarly, at the output, a de-multiplexer is used to choose the value of either α m,n orγ n , depending on the sel signal. The block in the middle, which may acts as either a subtractor or an adder is detailed in Figure 5 (by the sake of simplicity, we illustrate this block forq = 4 bits). It consist of a modified Ripple Carry Adder (RCA) with carry in given by the complement of the sel signal (C 0 = sel), and which is further XORed to all the bits of the second input. It can be easily seen that the VNU/AP-LLR unit operate in VNU mode if sel = 0 (C 0 = 1), or in AP-LLR mode if sel = 1 (C 0 = 0).
2) CNU Unit: We focus only on the computation of min1, min2, and indx min1, as the signs of the output messages can be simply computed by XORing the adequate signs of input messages. We propose a high-speed low-cost CNU architecture inspired by the TS architecture proposed in [10] , which is further simplified so as to compute only the value and the index of the first minimum. As shown in Figure 6 , our CNU is executed during the first clock cycle to compute min1 and indx min1, then it is re-executed during the second clock cycle with indx min1 input set to the maximum value, so that to compute min2. The sel control signal is used to indicate whether the CNU is in first or second minimum mode (first or second clock cycle). The compare and select block is used to set the indx min1 input to the maximum value, in case that the sel signal indicates that the second minimum is being computed (second clock cycle).
The proposed CNU architecture is detailed in Figure 6 for Figure 9 . IG (Index Generator) architecture a number of inputs (2 k + 2 r ) equal to the sum of two powers of 2. The general case can be worked out by decomposing the number of inputs as a sum of powers of 2, then combining corresponding blocks similarly to the technique used in [10] . The 2 k -FMIG (First Minimum and Index Generator) block computes the value and the index of the first minimum among the 2 k input values. The 2-FMIG block includes one comparator and one multiplexer, as shown in Figure 7 . The 4-FMIG consists of three 2-FMIG blocks for finding the minimum value and one multiplexer for indicating its index, as shown in Figure 8 . Similarly, the 2 k+1 -FMIG block can be constructed from three 2 k -FMIG blocks and one multiplexer. The IG (Index Generator) block in Figure 6 is used to determine the index of the minimum value, and is further detailed in Figure 9 3) Iteration Processing Split: As shown in Figure 3 , in the new architecture the clock signal is fed to the CNU. This allows splitting the iteration processing in two perfectly symmetric stages, executed in two consecutive clock cycles, each one using the same processing units, but in different mode. In the first clock cycle we perform read operations, then execute the VNU/AP-LLR unit in VNU mode, and the CNU to compute min1 and indx min1. In the second clock cycle we execute the CNU to compute min2, the VNU/AP-LLR unit in AP-LLR mode, and perform write back operations. The processing load is perfectly balanced between the two clock cycles, thus yielding an optimal clock frequency. In particular, the second execution of the CNU during the second clock cycle does not impose any penalty on the operating clock frequency. The baseline CNU (i.e. computing min1, min2, and indx min1) executed in one of the two clock cycles would lead to an increased critical path, and therefore a reduced clock frequency, while splitting its execution between the two clock cycles would have resulted in an inefficient use of the hardware resources.
C. Case of Check-Node Irregular Codes
To accommodate QC-LDPC codes with variable check- Figure 12 details the flowchart of the QC-LDPC decoder generation. The VHDL inputs consist of two configuration files, for the base-matrix related parameters and the userdefined parameters. Base-matrix parameters relate to either the matrix size (number of rows and columns, expand factor) or to the number, position and values of the non-negative entries (d cmin , d cmax , positions and values on non-negative entries per row). While some of these parameters are fixed, meaning that they cannot be overwritten at run time, the number of rows of the base matrix as well as the positions and values on non-negative entries per row can be overwritten at run time, while still ensuring proper operation of the decoder using the redefined base-matrix. This property is particularly useful to achieve flexibility of the implemented decoder with respect to the coding rate. Note also that it would also be possible to achieve flexibility with respect to the expansion factor (Z) value, by including some extra control logic. However, such control logic has not been included in our current implementation, so we report this parameter as being fixed.
D. Design and Run Time Flexibility
The RPL parameter shown in Figure 12 allows defining the number of base matrix Rows Per Layer. For the sake of simplicity, we have assumed so far that one decoding layer
FRPSUHVVHG IRUPDW Figure 11 . Modified CNU to accommodate variable check-node degree (example for d cmin = dcmax − 1) corresponds to one row of the base matrix B. However, in general it is also possible to define a decoding layer as RPL consecutive rows of the base matrix, as long as each column of B has at most one non-negative entry in each layer. This feature has been integrated to our design. If RPL > 1, the number of decoding layers is equal to R/RPL, with RPL × Z check nodes per each layer.
Finally, the user-defined parameter allows specifying the quantization parameters (q,q), and the number of decoding iterations.
IV. IMPLEMENTATION RESULTS
We have implemented the baseline and enhanced layered MS decoder architectures for a regular QC-LDPC code with variable-nodes of degree d v = 3, and for the irregular WiMAX QC-LDPC code with rate 1/2 [11] . For both codes, the size of the base is equal to R × C = 12× 24. For the regular code, the base matrix B is shown in Figure 13 . It can be divided in 3 horizontal layers, with each layer corresponding to RPL = 4 consecutive rows of B. For the WiMAX code, the RPL value is set to 1, thus the number of decoding layers is equal to 12. Configuration parameters of the two decoders are further detailed in Table I. ASIC synthesis results targeting a 65nm CMOS technology are shown in Table II . The top part of the table reports the maximum operating frequency, the corresponding throughput, and the area. The reported throughput is given by the formula:
where N = C × Z is the codeword length, and cyc iter = 2 × (R/RPL) is the number of clock cycles to complete one iteration (2 clock cycles per layer, times the number of layers). First, we note that the enhanced architecture provides a significant increase in the maximum operating frequency compared to the baseline architecture, by a factor of ×2.25 and ×3, for the (3, 6)-regular and the WiMAX code, respectively. This is due to the proposed increased-speed CNU together with the proposed split of the iteration processing. Regarding the area, it can be seen that the enhanced architecture provides a significant area reduction for the (3, 6)-regular code, by 24.2% compared to the baseline architecture. However, the area reduction is of only 2.27% for the WiMAX code. In oder to keep the area comparison on an equal basis with respect to synthesis timing constraints, in the bottom part of Table II we report area figures when the same timing constraints are applied to both the baseline and the enhanced architecture. We consider timing constrains corresponding to the maximum operating frequency for the baseline architecture. In this case, it can be seen that the proposed cost-efficient VNU/AP-LLR and CNU processing units yield an area reduction by 25.26% for the (3, 6)-regular code, and by 13.64% for the WiMAX code.
For the WiMAX QC-LDPC code, the proposed enhanced architecture is further compared with other state of the art implementations in Table III . We also report throughput and area figures scaled to 65nm [12] , as well as the Throughput to Area Ratio (TAR) and the Normalized TAR (NTAR) metrics [13] , so as to keep the throughput comparison on an equal basis with respect to technology, area, and number of iterations. To scale throughput and area to 65nm, we use scale factors (technology size/65) and (65/technology size) 2 , as suggested in [12] . The computation of the TAR and NTAR Table III it can be seen that the proposed enhanced architecture compares favorably with state of the art implementations, yielding a NTAR value of 27.9 Gbps/mm 2 /iteration. Finally, we mention that for the (3, 6)-regular QC-LDPC code, the proposed enhanced architecture achieves an NTAR value of 75 Gbps/mm 2 /iteration.
V. CONCLUSION
In this paper we proposed a low-cost and flexible architecture for high-throughput layered LDPC decoders with fully-parallel processing units. To do so, we proposed new processing unit architectures that allow a more efficient hardware usage, thus yielding a significant cost reduction. The proposed CNU further allows splitting the iteration processing in two perfectly symmetric stages, resulting in a significant increase in the maximum operating frequency. The proposed 
