An area-e±cient high-throughput shift-based LDPC decoder architecture is proposed. The specially designed (512, 1,024) parity-check matrix is e®ective for partial parallel decoding by the min-sum algorithm (MSA). To increase throughput during decoding, two data frames are fed into the decoder to minimize idle time of the check node unit (CNU) and the variable node unit (VNU). Thus, the throughput is increased to almost two-fold. Unlike the conventional architecture, the message storage unit contains shift registers instead of de-multiplexers and registers. Therefore, hardware costs are reduced. Routing congestion and critical path delay are also reduced, which increases energy e±ciency. An implementation of the proposed decoder using TSMC 0.18 m CMOS process achieves a decoding throughput of 1.725 Gbps, at a clock frequency of 56 MHz, a supply voltage of 1.8 V, and a core area of 5.18 mm 2 . The normalized area is smaller and the throughput per normalized power consumption is higher than those reported using the conventional architectures.
Introduction
With the advances in information transmission, the error correction code has been used to correct the transmission errors and reduce required transmission energy. The low-density parity-check (LDPC) code¯rst introduced by Gallager in 1962 1 is a binary linear block code, which is capable of approaching the Shannon limit.
Since the last decade, advances in VLSI technology have generated interest in the use of LDPC 1À3 codes. Several speci¯cations have adopted the LDPC codes as an error correction code to enhance transmission quality, such as DVB-S2, 10 GBASE-T, WiFi (802.11 n), WiMax (802.16e) and the fourth-generation mobile communications (4 G).
The belief propagation algorithm (BPA) 4 with slight simpli¯cation of the theoretical algorithm provides accurate decoding capability. Blanksby et al. 5, 6 implemented a 1-Gb/s fully parallel decoder using BPA, but the circuit occupied a large chip area. Complex circuits can be simpli¯ed by using the min-sum algorithm (MSA), 7 which provides acceptable accuracy. In addition to decoding algorithms, circuit complexity can be further reduced in the structured LDPC by using a regular structure of parity-check matrix (H matrix), which is more appropriate for hardware implementation compared to a randomly generated H matrix. Commonly used structured LDPC codes include quasi-cyclic LDPC (QC-LDPC) codes, 8 array LDPC codes 9 and Reed Solomon-based LDPC codes (RS-LDPC). 10 The H matrices are composed of shifted sub-matrices that must be carefully selected, since the numbers and sizes of sub-matrices as well as the minimum number of cycles (i.e., girth) a®ect the decoding performance and circuit complexity.
These LDPC decoders can be implemented using fully parallel, 11 partial parallel, 12 or serial 13 architectures. In the fully parallel architecture, each check or variable node requires a processor. Therefore, its throughput is very high. However, it occupies huge chip area due to numerous processors and complex interconnections caused by a quite large number of irregular edges. On the contrary, the interconnect complexity can be reduced by employing the serial architecture, which is impractical because of small throughput, and thus limits the applications of serial LDPC decoders. The partial parallel architecture signi¯cantly reduces the node processing units rather than the fully parallel architecture. Therefore, this trade-o® choice has been widely used in many studies.
14 For example, the early version 12 employs the many multiplexers (mux) and de-multiplexers (demux) with long latency. The other°exible architecture 15 results in greater circuit complexity and smaller throughput than the standard architectures do.
For practical applications of LDPC decoders, IC chips tend to have low cost and low power. In this paper, a specially designed H matrix of LDPC decoder was implemented using the partial parallel LDPC architecture with the special shiftregister technique to achieve the best performance, including a¯gure of merit de¯ned as throughput divided by normalized power. 16, 17 The organization of this paper is as follows. After the background is discussed, the design of the proposed LDPC code is described. Next, the architecture of the partial parallel LDPC decoder is proposed. Then, the experimental results and comparison are presented. The¯nal section is the conclusion.
Background
The QC-LDPC codes are structured LDPC codes with good regularity, which is appropriate for hardware implementation. Figure 1 illustrates its H matrix (H mÂn ), which is composed of p Â p shifted unitary matrices S i;j , where i and j mean the row and column indices are integers from 1 to m=p and n=p, respectively. This QC-LDPC code with block length n has column and row weights of m=p and n=p, which represent the number of 1's in a column and a row, respectively. Each matrix S i;j in an H matrix is a p Â p unitary matrix shifting to the right s i;j times, where 0 5 s i;j < p À 1. Figure 2 shows that the conventional LDPC decoding architecture 12 can be divided into four major modules: variable node units (VNUs), check node units (CNUs), and two message storage units (Á and Ã registers). The message storage units store the updated CNU and VNU data. The CNUs start to work only after the entire VNU computation is complete, and vice versa. In the QC-LDPC H matrix in Fig. 1 , the shifted unitary matrices can be calculated column by column in the VNU operation and then compared row by row in the CNU operation. This partial parallel architecture signi¯cantly reduces the node processors to p rather than to n or m in the fully parallel architecture. That also results in reduced routing complexity. However, the additional mux and demux are required. 12 Registers with mux and demux can be replaced by memories like register¯les, which are marked by the dashed lines in Fig. 2 . Besides, data \re-order" blocks for aligning updated data in the correct positions are needed before data are restored in the registers. In speci¯c permutations of the H matrix, 12 some overlapped scheduling processes are also used to reduce latency, but the idling time of CNUs or VNUs still approximates to 50%.
To design a good H matrix, girth which is the minimum cycle, or cycle length, signi¯cantly determines the decoding performance. The conventional QC-LDPC and RS-LDPC codes were further enhanced to the partition-and-shift LDPC (PS-LDPC) code 18 which theoretically optimizes the girth. The algorithm counts 2t translations between the shifted unitary matrices. If and only if there exists a closed loop of 2t cycles exits in the tanner graph, then Eq. (1) holds, 18 where \mod" means modulus operation and s 1;1 , s 2;2 ; . . . ; s 2t;2t , represent the shifted numbers in the shifted unitary matrix. The minimum cycle number, 2t, is the girth.
Here, this PS-LSPC code algorithm is adopted to maximize the girth. Figure 3 shows the proposed H matrix using PS-LDPC codes which can be partitioned into 4 Â 4 sub-blocks. Each sub-block contains k shifted unitary matrices. The column and row weights of each sub-block are 1 and k, respectively. In our design, all diagonal sub-blocks (H 14 ; H 23 ; H 32 and H 41 ) are zero matrices. The advantage is to avoid the data access in memories at the same triggered edge. For example, if all sub-blocks are non-zero matrices, after the CNUs work for the last row (H 41 ; H 42 ; H 43 and H 44 ), the updated data for H 41 must be stored in the memory as shown in Fig. 2 and read out at the same time for the next VNU operation of thē rst column (H 11 ; H 21 ; H 31 and H 41 ). The register-¯le memories cannot be applied in this situation. The solution is to use°ip-°op based registers. However, the mux and demux before and after the registers increase the critical paths and routing complexity. With zero diagonal sub-blocks, the critical paths can be shortened and routing complexity can be reduced.
Design of LDPC Code
To further reduce the number of mux and power consumption, the sub-blocks H 11 , H 22 , H 33 and H 44 are composed of k unitary matrices without shift. Figure 4 illustrates the proposed parity-check matrix, in which \I" indicates a p Â p unitary matrix, and each sub-block has k shifted or non-shifted unitary matrices. The parameter k gives the coding rate of ðk À 1Þ=k. The coding length is 4 Â k Â p, where p can be used to adjust the coding length. Therefore, it is a very°exible approach to provide various coding rates and coding length. In other words, if we want to have higher coding rates, the k is increased, which corresponds to increased unitary matrices in a sub-block. This paper evaluated the implementation of an H matrix (1,024, 3, 6) with girth of 8 as shown in Fig. 5 , where 1,024 is the code length, and 3 and 6 are the column and row weights, respectively. The unitary matrices are 128 Â 128. The integers in Fig. 5 indicate the numbers of right shifts in the unitary matrices. Figure 6 shows the data for decoding capability, which was tested by comparing bit-error rates (BER) of the proposed PS-LDPC code and the randomly generated Fig. 4 . The proposed general form of H matrix.
An Area-E±cient High-Throughput Shift-Based LDPC Decoder 4. Shift-Based LDPC Decoding Architecture Figure 7 shows the partial parallel decoding architecture using the min-sum algorithm with the specially designed LDPC code given in Fig. 5 . The variable node calculation contains 256 columns in the H matrix, so 256 VNUs work in parallel. Similarly, the check node calculation runs 128 rows, thus 128 CNUs work in parallel. Both the variable node and the check node operations require four clock cycles, respectively, in a decoding iteration. For the optimal use of VNUs and CNUs with the minimal idle time, two frames of the received log likelihood ratios (LLR) are stored in the input bu®ers. The¯rst frame is transferred to the VNUs with the updated data from the check-to-variable storage unit (CTVSU), if available, during the¯rst four clocks. Then, the second frame of data is fed to the VNUs in the next four clocks. The process is named as \collection." The VNU outputs are stored in the variable-to-check storage unit (VTCSU), which is composed of shift registers. In the following \check" procedure, VTCSU data are sent to the CNUs for updating. The iteration is complete when the data are stored in the CTVSU. Note that the two data frames are computed using VNUs and CNUs alternatively to maximize throughput.
1350039-5
The architectures of the VTCSU and CTVSU message storage units di®er from those of the conventional approaches given in Fig. 2 . Either register-¯le memories or registers with mux and demux occupy large chip areas and consume more power compared to the proposed shift registers-based technique. Figure 8 shows the delay clock cycles for the VTCSU. That shows how the outputs of VNUs are scheduled to the inputs of CNUs. In Fig. 8(a) , the sub-blocks marked by the bold lines give the examples of the delay cycles of three sub-blocks in the VTCSU. Sub-blocks H 11 , H 22 , H 33 and H 44 sequentially enter the CNUs after four clock cycles of the outputs of VNUs. However, the others require di®erent delay clock cycles to enter VNUs as indicated by the underlined numbers in Fig. 8(b) , which shows the sub-block H ij , where i or j is from 1 to 4. The H 14 , H 23 , H 32 and H 41 are zero matrices. Figure 9 shows an implementation of the VTCSU using 12 blocks of shift registers categorized as A1 to A4, B1 to B6 and C1 to C2. Figure 10 plots the corresponding timing diagram. The VNU outputs ðÁ 1 þ Á 2 ; Á 3 þ Á 4 ; Á 5 þ Á 6 Þ enter the register blocks A4, B6, and C2; the outputs of A1 (Á 1 þ Á 2 ), B1 (Á 3 þ Á 4 ), and C1 (Á 5 þ Á 6 ) enter the CNUs. The shift registers A4 to A1 sequentially shift H 11 
Input buffer
The proposed VTCSU has several advantages. It not only eliminates the demux, but also reduces the number of mux. The mux to re-order the data are inserted before the registers B1 and C1 due to the zero matrices of H 14 and H 41 . Therefore, the minimum number of delay cycles is two and the critical paths are reduced. The routing can be distributed between di®erent registers to reduce routing complexity. Table 1 compares the numbers of mux and demux used in the conventional register based architecture and the proposed shifter based design for the same LDPC code. The di®erence is the blocks marked by the dash lines in Fig. 2 and the VTCSU and CTVSU in Fig. 7 . The area of a 2:1 mux can be estimated to be one-third of a 4:1 mux. The signi¯cant di®erence is no demux required in our design, so the gate count and the routing complexity are reduced.
Experimental Results and Comparison
The proposed shift-based (1,024, 3, 6) LDPC decoder using eight iterations with eight clocks per iteration was designed and implemented using the 0.18 m CMOS process. Table 2 compares the simulated performance between the conventional and the proposed architectures. The gate count is reduced by 15%, so the reduced critical paths and routing complexity result in the maximum operating frequency increased by 25%. Note that the contribution of area reduction is mainly attributed to the VTCSU and CTVSU in Fig. 7 . Those in the conventional LDPC decoder occupy approximately 47% of the chip area. If the area of VTCSU and CTVSU is reduced by 32% the net area improvement is about 47% Â 32% ¼ 15%. Figure 11 shows that microphotograph of the chip occupies an area of 10.8 mm 2 whereas the core occupies an area of 5.18 mm 2 . The chip works at a clock frequency 
Owing to technology scaling, the chip area is inversely proportional to the square of channel length. Thus, the normalized area is usually de¯ned as chip area divided by (Technology) 2 . Besides, if the H matrix is large, the chip area is also large. Therefore, the normalized area is further divided by the sizes of H matrix, which is proportional to ðcodewordÞ Â ½ðcodewordÞ Â ð1 À code rateÞ, as shown in Eq. (2) .
Power consumption is proportional to the square of supply voltage, so the normalized power is given in Eq. (3). To evaluate the power e±ciency, it is reasonable to compare the energy required for each bit per iteration. Therefore, the normalized energy is de¯ned as the normalized power divided by the product of throughput and iteration numbers as shown in Eq. (4). In this work, all three normalized factors are very small. Speci¯cally, for high throughput e±ciency, we propose one¯gure of merit: throughput divided by normalized power. This parameter is the highest in the proposed decoder than in the other decoders. Based on the above analyzes, the proposed architecture is area-e±cient and power-e±cient. It can decode more data in terms of unit area and unit energy.
Conclusion
A special PSLDPC parity-check matrix equivalent to the randomly generated matrix is proposed. The proposed matrix is suitable for chip implementation with small area and power consumption for high throughput applications. Unlike the conventional architecture, no demux are required, and mux as well as critical paths are also reduced. In an implementation with TSMC 0.18 m CMOS process, the proposed architecture occupies 10.8 mm 2 with core area of 5.18 mm 2 . The measured throughputs are 1.46 Gbps at 1.62 V and higher than 1.7 Gbps at 1.8 V. The small normalized area and the high throughput per normalized power indicate that the proposed LDPC decoder is area-e±cient and power-e±cient and should be considered for use in the future LDPC decoders.
