Abstract-In this paper, a LDPC decoder chip based on CP-PEG code construction is presented. The (2048The ( , 1920 irregular LDPC code generated by CP-PEG algorithm has better performance than other PEG-based codes; however, the large check node degrees introduced by high code-rate 15/16 become the implementation bottleneck. To design such a high coderate LDPC decoder, our approach features variable-node-centric sequential scheduling to reduce iteration number, single piplelined decoder architecture to lessen the message storage memory size, as well as optimized check node unit to further compress the register number. Overall 73% message storage memory is saved as compared with traditional architecture. Fabricated in 90nm 1P9M CMOS technology, a test deocder chip could achieve maximum 11.5 Gbps throughput under 1.4V supply voltage with core area of 2.7 x 1.4 mm/', The energy efficiency is only 0.033 nJ/bit with 5.77 Gbps at 0.8V to meet IEEE 802.15.3c requirements.
I. INTRODUCTION
Low-density parity-check (LDPC) code is a famous error control code with near Shannon limit performance [1] and can be described by its parity-check matrix H. The rows and columns of H are mapped to check nodes and variable nodes of a bipartite graph, on which the belief-propagation (BP) algorithm exchanges messages between nodes iteratively to decode LDPC codes [2] . The message exchanging order between nodes is called scheduling, which will influence the convergence speed of the decoding algorithm. In standard BP algorithm, simultaneous update of all check node messages or variable node messages is named as flooding scheduling. Alternatively, the layered BP algorithm [3] peforming message update along different check node groups is a method of check-node-centric sequential scheduling (CSS). Reseaches have revealed that CSS could reduce maximum iteration to approximate half of the standard BP with similar performance.
Recently, LDPC codes adopted in high-throughput systems have high code-rate property to increase channel efficiency. However, the introduced large check node degree de will cause implementation full of difficulties. For example, the largest check node degree of (2048, 1723) LDPC code adopted in IEEE 802.3an [4] equals 32, leading to routing congestion and low chip density. Even though the CSS could reduce the iteration number, the throughput is still degraded due to long critical path of check node unit (CND). The situation will become worse for the (1440, 1344) LDPC code of IEEE 802.15.3c [5] with de == 45.
This work was supported by the NSC, Taiwan, R.O.C., under grant NSC 97-2220-E-009-013. The authors would like to thank UMC, MediaTek, CIC, and Mr. Yi-Kai Lin for their assistance.
978-1-4244-4353-6/09/$25.00 ©2009 IEEE In this paper, the proposed decoder aims at providing a high-throughput and hardware-efficient solution to the high code-rate LDPC with large check node degrees. In order to reduce the iteration number, the decoding scheduling is based on the variable-node-centric sequential scheduling (VSS; also known as shuffled decoding [6] ), where the messages are updated along different variable node groups. Since the inputs of CND operation are also divided into several subgroups, the complexity and critical path delay of CND are reduced. Furthermore, single pipelined approach and modified CND are proposed to minimize the message storage memory. Dsing a (2048,1920) LDPC code constructed by circulant permutation progressive edge-growth (CP-PEG) algorithm [7] as a design example, the overall decoder chip implemented in 90nm technology will show its advantages in terms of throughput, energy efficiency, and hardware efficiency.
The rest of this paper is organized as follows. Section II introduces the code structure and the decoding algorithm with VSS. In Section III, we propose a modified scheduling algorithm and an improved decoder architecture. Performance simulation and implementation result are shown in Section IV. Finally, the conclusion is given in Section V.
II. CODE STRUCTURE AND DECODING ALGORITHM

A. CP-PEG LDPC Code Construction
The (2048, 1920) irregular LDPC code, rate-15/16, used in this paper was constructed by CP-PEG algorithm and shown in Fig. l(a) . The constructed parity-check matrix H consists of p x p circulant permutation (CP) and all-zero matrices. A CP matrix is a cyclic square matrix with constant row and column weight of one. The number of each CP matrix indicates the cyclic shift amount and -1 means all zero matrix. By setting p == 32, there are 4 x p check nodes and 64 x p variable nodes in bipartite graph, where each check node has uniform degree 46, and 16 xp, 24 xp, 24 xp variable nodes have degrees of 4,3,2 respectively. The performance of this code was proven to have better performance than other PEG-based LDPC codes [7] ; nevertheless, the high check node degree required suitable decoder architecture to overcome implementation difficulties.
B. Variable-node-centric Sequential Scheduling
In VSS approach, the initialization, stopping criterion test, and output steps remain the same as the standard BP algorithm. The only difference between two algorithms lies in the updating procedure. The normalized min-sum (NMS) algorithm which compensates the approximation error in check node 
B. Modified CNU
The operation of check node to variable node update could be divided into magnitude part and sign part. Fig. 3 (a) illustrates the magnitude part of CNU, which is an accumulative sorter composed of a local sorter and a global sorter. The local sorter is used to find the local min and second min values in each subgroups , and global min and second min values of a check node will be found by a global sorter. Similarly, the sign operation can be computed in an accumulative way like the accumulative sorter.
For our proposed CP-PEG LDPC codes with de = 46, The VSS approach with G = 4 could divide 46 inputs of the sorter into only 12 inputs. More subgroup number G will result in of equal number of variable nodes with the same degree to reduce the hardware cost of variable node unit (VNU) . Moreover, the submatrices with the same shift amounts (shaded blue CP matrices) are arranged in the same position thus the routing and control could be further simplified.
III . PROPOSED DECODER ARCHITECTURE
In this section , a complete decoder architecture will be presented, including datapath, scheduling, and VLSI structure of CND and modified CND.
Then the magnitude part of check node to variable node message in (1) could be computed by the following equation : Fig. 2 (b) demonstrates the timing diagram of proposed decoder. There are G initialization cycles required to calculate a~for 0 :::; 9 :::; G -1. Since only one subgroup of the message z~~is updated in g-th cycle of one iteration, the main operation of eND could be simplified to calculate a~i) (local sorting) in each cycle and then perform global sorting like equation (5) .
From the propose single pipelined architecture, only messages a~i) and E~n are stored. The sorted results could be represented by min value, second min value, and the index of min value in NMS algorithm . Therefore, the proposed decoder only latches two values, one index, and sign part of messages in each subgroup, while the variable node to check node message zg~is on-the-f1y calculated. The single pipelined architecture is feasible because the CNU could be updated immediately after VNU's operations in VSS approach .
A. Single Pipelined Archite cture
The entire decoder depicted in Fig. 2(a) is composed of fulIy-paraIlel CNUs and partial-paraIlel VNUs, where the VNU2, VNU3, and VNU4 will handle variable node operations with degree 2, 3, and 4 respectively. Let a~i) denote the sorted messages sent from variable nodes in the g-th group to one specific check node at i-th iteration , which is: . N a -1, process 
m E M (n )
3) Hard Decision: Let X n be the n -th bit of decoded codeword . If z~i)~0, x, = 0, else if z~i) < 0, x, = 1.
In this work, the codeword is divided into G = 4 groups, therefore the parity-check matrix H is divided into 4 submatrices (H I to H4). As shown in Fig. 1(b) , each submatrix consists Fig.3(b) , the memory size is reduced to
CNU in
MEM c NU (N -K) · (2 · (Min + 2ndMin + Index + 2ndIndex) + Sign) (N -K) . (4 . (w -1) + 4 . log2 1del + de) 128 · (4 . 5 + 4 . log2 146l + 46) = 11520 (7)
IV. P ERFORMANC E AND IMPL EM ENTATION R ESULTS
Under AWGN channel with BPSK modulation, the performance curves are simulated to determine the required bit-width and maximum iteration number. The simulation parameters of proposed algorithm are 6-bit input quantization (5-bit integer and l-bit decimal fraction), scaling factor 0.75 for NMS algorithm, and 4 or 5 iterations . In Fig. 4 , the bit-error rate (BER) curves indicate that 4 iterations for the proposed algorithm are sufficient to achieve similar performance of standard BP algorithm with 7 iterations . Furthermore, in the aspect of almost the same code-rate and better error-correcting capability, our CP-PEG LDPC codes outperforms 1.2 dB better than the (255, 239) RS code at BER=1O-5 , which reveals the potential of CP-PEG LDPC codes for storage applications and fiber optical communication systems . The overall SNR loss between this work and Shannon limit is only 1.6dB.
The proposed LDPC decoder is implemented by standardcelI design flow and fabricated in 90-nm IP9M CMOS technology. The core occupied 3.84 mm 2 area with 68% Therefore the overall register reduction of proposed architecture is 73%, leading to the following advantages : fewer registers, higher utilization of fuctional units, and reduced complexity. Since high-rate LDPC codes usualIy have more VNUs than CNUs (in our case: 512 VNUs and 128 CNUs), the elimination of registers from VNU to CNU not only reduces hardware cost but also lowers power consumption of clock tree. For the proposed single pipelined decoder and modified fewer inputs of sorter, but increase the storage for min, second min, and index values of each subgroup .
In order to further reduce the storage overhead of each subgroup, we propose a reduced storage accumulative sorter as shown in Fig.3(b) . The basic idea is to simplify the local min and local second min from G -1 subgroups into one group. Some extra control circuits are needed to open or close the feedback loop in Fig.3(b) . This sorter architecture is beneficial since the complexity reduction of storage registers and global sorters is higher than the overhead of control circuits . Section IV will show the performance of this modified CNU is similar to original CNU.
C. Summary
In traditional two-stage pipelined architecture, both zg~and c~n messages are kept in registers or memory. Assume the bit-width of messages is w (= 6) and variable node degree is dv , then the required memory size (or registers) is as folIows: utilization , The die photo is shown in Fig, 5 , where the distribution of CNDs and VNDs is auto-determined by APR tool. Since required decoding cycles of one LDPC codeword are 4 initialization cycles plus 4 iterations, the throughput is (l920bit/20cycles) x frequency. Fig . 6 shows the measured maximum throughput and power dissipation under different SNR conditions and supply voltages . The measurement result indicates that the test chip with FF corner can achieve I 1.5 Gbps throughput under IAV supply voltage . The throughput could be scaled down to 5.77Gbps with 0.8V supply voltage to meet the throughput requirement of IEEE 802.15.3c standard and the energy efficiency will be 0.033 nJ/bit.
Compared with the state-of-the-art in Table I , the proposed LDPC decoder outperforms others in the aspects of throughput, hardware efficiency, and power efficiency. Since the LDPC code specification of these designs are different, the SNR loss between each work to their Shannon limit is also listed for reference.
V. CONCLUSION
A high-throughput and power-efficient LDPC decoder is presented. Utilizing the characteristic of variable-node-centric sequential scheduling, the proposed decoding algorithm could reduce the maximum iteraion without performance loss. In addition, the single pipelined architecture and modified CND can save 73 % message storage memory and decrease the sorter size, resulting in a low-complexity design . After implementa- 
