This paper presents a simple yet effective decoding for general quasi-cyclic low-density parity-check (QC-LDPC) codes, which not only achieves high hardware utility efficiency (HUE), but also brings about great memory block reduction without any performance degradation. The main idea is to split the check matrix into several row blocks, then to perform the improved message passing computations sequentially block by block. As the decoding algorithm improves, the sequential tie between the two-phase computations is broken, so that the two-phase computations can be overlapped which bring in high HUE. Two overlapping schemes are also presented, each of which suits a different situation. In addition, an efficient memory arrangement scheme is proposed to reduce the great memory block requirement of the LDPC decoder. As an example, for the 0.4 rate LDPC code selected from Chinese Digital TV Terrestrial Broadcasting (DTTB), our decoding saves over 80 memory blocks compared with the conventional decoding, and the decoder achieves 0.97 HUE. Finally, the 0.4 rate LDPC decoder is implemented on an FPGA device EP2S30 (speed grade -5). Using 8 row processing units, the decoder can achieve a maximum net throughput of 28.5 Mbps at 20 iterations.
Introduction 1
Low-density parity-check (LDPC) codes, which were first proposed in the early 1960's [1] and re-discovered in the 1990's [2] , have attracted much attention due to their capacity approaching performance and low decoding complexity. LDPC codes have been adopted as the European Digital Video Broadcasting standard (DVB-S2), Chinese Digital TV Terrestrial Broadcasting (DTTB) standard [3] , as well as WiFi and WiMAX standards. LDPC codes have also been proposed by the Consultative Committee for Space Data Systems (CCSDS) for the deep space communications and near Earth communications [4] . It is evident that LDPC codes will be widely used in wired and wireless communication, DVB and other fields in the near future.
LDPC codes can be effectively decoded using two-phase message passing (TPMP) algorithms [5] [6] [7] [8] [9] . In the algorithms, the check-to-variable messages and the variable-to-check messages are transmitted along the edges of the Tanner graph to update each other iteratively. In the recent literature, the turbo decoding message passing (TDMP) algorithm [10] is of particular interest since the algorithm can lead to faster convergence and higher throughput [11] . Zhang [11] , et al. designed an LDPC decoder for a (2 304, 1 152) code. The decoder achieves 2.2 Gb/s throughput with 10 iterations. Cui [12] , et al. proposed a 4.7 Gb/s decoder with 15 iterations. The decoder is targeted for a (3 456, 1 728) LDPC code. Authors in Ref. [13] 
presented a
Open access under CC BY-NC-ND license.
·748 · ZHAO Ling et al. / Chinese Journal of Aeronatics 25(2012) 747-756
No. 5 block-serial layered decoder, whose decoding rate is up to 1 Gb/s. However, the TDMP decoding must satisfy the constraint that there is at most one "1" in each column of every layer, which limits the application of TDMP decoding for some excellent LDPC codes. To solve the problem in TDMP decoding, authors in Ref. [14] -Ref. [15] tried to split the layers through memory mapping and scheduling the matrices, but these methods cannot eliminate all the conflicts and also make performance degradation. In addition, the dependence of the computation of adjacent layers makes it very hard to pipeline the computation of different layers, which limits the throughput of the decoding.
As for decoder architecture, generally, the existing work can be classified into three categories: fully parallel method [16] , full serial method [17] and partly parallel method [6] . Compared with the fully parallel decoder and the fully serial decoder, the partly parallel decoding architecture offers a better balance between the throughput performance and hardware complexity. However, various challenging issues still remain in the partly parallel decoding. One such issue is low hardware utility efficiency (HUE). Two techniques, folding and overlapping are used to solve this problem. The authors in Ref. [6] remapped the check node and variable node functional units into the same hardware to improve HUE. However, only some hardware resource is reduced and this method cannot suit min-sum algorithm. In Ref. [7] -Ref. [8] , overlapped message passing (OMP) decoding was proposed, in which HUE is improved by overlapping check and variable node update stage. A sliced message passing (SMP) decoding was presented in Ref. [5] , whose HUE is almost 1, but the decoding only suits nearly fully parallel operation, which consumes too many hardware resources. Another problem of partly parallel decoding is high requirement of small memory blocks. Three kinds of memory, i.e. intrinsic memory, extrinsic memory and decision memory are needed. The number of extrinsic memory is always equal to that of the nonzero submatrix of the check matrix whereas the numbers of other two memories both equal to that of the column submatrix block. It is a heavy burden especially for long LDPC codes.
In this paper, we present a simple yet effective decoding for general quasi-cyclic low-density paritycheck (QC-LDPC) codes. In our decoding, the check matrix is split into several blocks row-wise and two-phase computations of different blocks are overlapped to achieve high HUE. There is no constraint to split the check matrix hence our decoding can be used in both high and low throughput applications for any QC-LDPC codes. Furthermore, an efficient memory arrangement method is also proposed to reduce the requirement of memory blocks. As an example, for the 0.4 rate LDPC code selected from Chinese DTTB, our decoding saves over 80 memory blocks compared with the conventional decoding and the decoder achieves 0.97 HUE.
The rest of the paper is organized as follows. Section 2 introduces the background of QC-LDPC codes and the decoding algorithms. In Section 3, the improved decoding and overlapping schemes are presented. The proposed memory sharing scheme is discussed in Section 4 and comparisons between the presented decoding, OMP and TDMP decoding are performed in Section 5. An FPGA implementation is shown in Section 6 and in Section 7, the conclusions are drawn.
QC-LDPC Codes and Log-BP Decoding

QC-LDPC codes
LDPC code is described as the null space of a binary sparse parity-check matrix. Each row of matrix represents a parity check and each column corresponds to the quantized variable symbol. The number of "1" entries in a row (column) is the row (column) weight. An LDPC code is a regular code if it has a uniform column weight and a uniform row weight; otherwise, it is an irregular code.
QC-LDPC codes are a special class of the LDPC codes with structured check matrix, whose check matrix is illustrated in Eq. (1). In general, the check matrices of QC-LDPC codes contain cyclically shifted identity submatrices, zero submatrices and compound circulant submatrices. Each of the compound matrices consists of w superimposed cyclically shifted identity matrices. Usually, w is very small, equaling 2 or a little larger. In this paper, for simplicity, the cyclically shifted identity matrix and the compound circulant matrix are called the weight-1 and weight-w circulant matrix, respectively. 
where the check matrix H is defined to have q×p submatrices, whose size is b×b. The number of total nonzero submatrices is W. The maximum and minimum row weights of H are d max and d min , respectively, whereas the column weight of H is d C .
Log-BP decoding algorithm
Taking the incompressible potential flow round a cylinder for example, the stream function is min-sum (MS) and log-BP algorithms are usually used in practice to decode LDPC codes among the message passing decoding algorithms [6] . To achieve higher robusticity, the log-MAP algorithm can also be used to compute the extrinsic message [13] . In this paper, however, for the sake of simplicity, we just use log-BP to demonstrate our decoding. It would be noted that the MS and other variants can also be used in our decoding to compute the extrinsic message.
The log-BP decoding algorithm consists of two phases of message passing, i.e., variable-to-check message passing and check-to-variable message passing. Let R mj denote the check-to-variable message and L mj the variable-to-check message, then R mj is computed as follows: On the other hand, the variable-to-check message L mj for the check node m and variable node j using the incoming check-to-variable messages R mj and received channel information I j is computed by
where M(j) is the set of check nodes connected to variable node j and I j the intrinsic information. The soft output L j for the variable node j is later sliced to check whether the decoded output is a codeword or not. According to the decoding algorithm described above, the check-to-variable message R mj update unit (CNU) and variable-to-check message L mj update unit (VNU) can be implemented as illustrated in Fig. 1 and Fig. 2 . For the purpose of clarity, the parity check part is not included in Fig. 1 . "LUT" in Fig. 1 represents a lookup table, which is used to calculate Ȍ(x) in Eq. (2) and Eq. (4). "SĺT " and TĺS represent the transformation between sign-magnitude and binary compliment.
Improved Decoding and Overlapping Scheme
In conventional TPMP decoding, there are two computation stages, check-to-variable message update stage (CNS) and variable-to-check message update stage (VNS). R mj is updated in CNS, whereas L mj updated in VNS. However, from Eqs. (2)- (6), it can be seen that the update of R mj needs all the L mn connected to the check node m, and the update of L mj , all the R nj connected to the variable node j. In consequence, there is a sequential tie between CNS and VNS. CNS cannot be started until VNS has been finished, and vice versa. As a result, the conventional TPMP decoding only achieves an HUE of 0.5.
In Refs. [7] - [8] , the authors presented and improved OMP decoding. By rearranging the computation sequence of the check nodes and variable nodes, CNS and VNS can be partly overlapped. For some special LDPC codes, OMP decoding even achieves a high HUE of 0.98. However, there are some primary problems. One is that the improvement of HUE highly depends on the code structure. For some LDPC codes especially for the weight-w codes, the OMP decoding is hard to make an improvement. Another is that OMP decoding could not keep a similar high HUE as computation parallel degree varies.
In this paper, we improve the TPMP decoding to break the sequential tie between the two computations to achieve high HUE, and the two primary problems of OMP will not exist in our decoding. Furthermore, we present an efficient memory arrangement scheme for our decoding, which will save large number of memory blocks.
Check matrix splitting
In our decoding, the parity check matrix of an LDPC code is partitioned into several blocks row-wise, and then the message passing computations is performed sequentially block by block. For the sake of clarity, we call one partitioned row block as one slice, and call the
·750 · ZHAO Ling et al. / Chinese Journal of Aeronatics 25(2012) 747-756
No.5
presented decoding as STMP (sliced two-phase message passing) decoding. Different from the TDMP decoding, TDMP must satisfy the constraint that there is at most one "1" in each column of every layer. However, In STMP, there is no constraint to split the check matrix, and each column of every slice could contain more than one "1".
For example, for H illustrated in Eq. (1), we can split it into q slices, each of which contains one submatrix row block. We could also split it into q/2 slices, each of which contains two submatrix row blocks. And we could split it into 2q slices, each of which contains half submatrix row block.
Sliced message passing decoding algorithm
Except for the notations defined above, we define k mj R as the R mj message in the kth iteration, , k t mj R the R mj message of the tth slice in the kth iteration, S j the accumulating result of R j and , k t j S the accumulating result of R j of the 1st to tth slices in the kth iteration. Besides, we define N as the maximum iteration number, and L as the slice number after the check matrix splitting.
The STMP decoding with log-BP algorithm can be summarized in the following two major steps.
1) Row processing stage (RPS), performed by row processing unit (RPU). At the kth iteration, where k = 1, 2, ,N, for check node m, calculate k mj R as follows:
When k = 1. Set S j to be I j and R mj to be 0. Meantime, RPU uses the sign part of S j to compute the parity check for each check node. The decoding process terminates when the entire codeword X satisfies the parity check equations: H T X=0, or the presetting maximum iteration number is reached.
2) Column processing stage (CPS), implemented by column processing unit (CPU). At the tth slice in the kth iteration, where t = 1, 2, ,L, for variable node j, S j is updated by , , 1 ,
In each iteration, when the variable node participates the CPS for the first time, , 1 k t j S should be replaced by I j . On the other hand, in every iteration, when the variable node participates the CPS for the last time, hard decision could be made for the current variable node using the sign part of S j.
Steps 1) and 2) have to be repeated until H T X=0 or until a fixed number of iterations is reached.
In terms of the hardware implementation, the transformation from Eqs. (5)- (9) is just a reorganization of all addition operations to several groups, while the subtraction operation in Eq. (5) is separated to Eq. (8).
Hence, there is no performance degradation.
According to the decoding algorithm, the RPU and CPU can be implemented as illustrated in Figs. 3-4 , respectively. From Fig. 4 , it can be seen that the RPU is complex. In order to reduce the critical path delay and improve clock frequency, the RPU can be partitioned into several pipeline stages. Generally speaking, we define P 1 as the pipeline stage number. Therefore, if slice's size is set to b and only one RPU is used, it will take (P 1 +b) clock cycles to finish the RPS computation of one slice. On the other hand, the CPU is quite simple that it is just composed of several adders. The number of the adders equals the "1" number of the column of the slice. Usually, the column weight is small hence no pipeline is needed in the CPU.
Overlapping schemes
Form Eqs. (7)- (8), it can be seen that during the kth iteration, the RPU reads R mj computed in the (k1)th iteration by RPU, reads S j computed in the (k1)th iteration by CPU, and updates R mj . In consequence, RPS could be performed slice by slice without the computation result of CPS in the same iteration.
On the other hand, according to Eq. (9), the CPU reads R mj updated by RPU in the current iteration and adds it up slice by slice. In consequence, CPS of kth slice can be started as long as the RPS of kth slice finished.
During the iteration, the message R mj is only delivered from RPU to CPU, and the R mj sum result S j is sent back from CPU to RPU at the end of each iteration. Therefore, the sequential tie between the check and variable node update stages of the conventional TPMP algorithm is broken. CPS of slice t can be started as No.5
ZHAO Ling et al. / Chinese Journal of Aeronautics 25(2012)747-756
· 751 · long as the RPS of slice t is finished. Furthermore, because the RPS of slice t does not need the computation results of slice (t1), the RPS for all slices can be performed in pipeline. The overlapping scheme can also be alternatively interpreted in timing slots as shown in Fig. 5 . It can be seen from Fig. 5 that RPS of different slices can be fully pipelined, and there is no redundant waiting time between adjacent slices. In the overlapping scheme, it takes (L+1)b+P 1 clock cycles to finish one iteration, where RPS and CPS computations are overlapped in Lb clock cycles. The HUE is thus approximated by
Obviously, in STMP, the HUE is only decided by the number of slices. If the check matrix is split more than 9 slices, the decoder HUE will be larger than 0.9; if the check matrix is split more than 19 slices, the decoder HUE will be larger than 0.95. Therefore, the decoding is suitable for most of the LDPC codes.
From Fig. 5 , It can be seen that there are three types overlapping in this scheme. Taking RPS of slice 3 for example, at the beginning, RPS of slice 3 is overlapped with RPS of slice 2 and CPS of slice 1; in the middle, RPS of slice 3 is overlapped with CPS of slice 2; in the end, RPS of slice 3 is overlapped with RPS of slice 4 and CPS of slice 2. Three types of overlapping will cause some conflicts in memory accessing, which will be discussed in the next section. As an alternative, we could add some waiting periods between the adjacent RPS to avoid memory access conflict. This overlapping scheme is shown in Fig. 6 . Scheme B only has two types of overlapping, and it takes (L+1)b+LP 1 /2 clock cycles to finish one iteration. As a result, HUE of scheme B will slightly smaller than that of scheme A.
Besides, the basic thought of OMP decoding can also be used in our overlapping scheme to avoid accessing conflicts in overlapping scheme A.
The Proposed Memory Arrangement
In the conventional partly parallel structure, three kinds of memory, i.e. intrinsic memory (storing intrinsic information I j ), extrinsic memory (storing extrinsic information L mj and R mj ) and decision memory (storing hard decision results) are needed. The number of extrinsic memory blocks equals W and the numbers of other two memory blocks both equal q. It is a heavy burden especially for long LDPC codes. In this paper, an efficient memory sharing scheme is proposed to reduce the requirement of the three kinds of memories.
Conflict-free column block group
First of all, we define the conflict-free column block group (CFG).
Those column blocks which have no nonzero submatrixes in common in the same row compose a CFG. A separate column block can also be regarded as a CFG.
As shown in Fig.7 , the check matrix is partitioned into 6 slices. Column block 1, 2 and 3 of H 1 compose a CFG. We define G as the minimum number of CFGs of the check matrix. It is obvious that G is between d max to q. For example, G of H 1 is 4 whereas G of H 2 is 5. 
Efficient memory sharing scheme
In STMP decoding, four types of information, intrinsic LLR message I j , extrinsic LLR message S j , extrinsic message R mj and hard decision results need to be stored. Intrinsic and extrinsic LLR messages are stored together. For convenience, the memory used to keep I j and S j is called L-RAM; the memory used to keep R mj is called Q-RAM and the memory used to keep hard decision results is called Z-RAM.
The arrangement of the three kinds of memories is based on the CFGs. We will discuss it in detail.
As for L-RAM, each CFG needs three L-RAMs, marked with L A , L B and L C , respectively. L A stores I j of the corresponding CFG, whereas L B and L C store S j of two adjacent iterations, respectively. L A , L B and L C
·752 · ZHAO Ling et al. / Chinese Journal of Aeronatics 25(2012) 747-756
No. 5 have the same structure, each of which is partitioned into several sections. The section number equals the column block number in the CFG. For example, in H 1 , column block 1, 2 and 3 compose CFG 1 , hence, L 1A , L 1B and L 1C have three sections, respectively. Section 1 is used to store the corresponding messages of column block 1; Sections 2 and 3 are used to store the corresponding messages of column block 2 and 3, respectively. Therefore, there are 3G L-RAMs in total.
On the other hand, as for Q-RAM, each CFG needs two Q-RAMs, marked with Q A and Q B to store R mj of adjacent slicess of the check matrix. Q A and Q B are also separated into several sections. The number of total sections for Q A and Q B equals that of nonzero submatrixes of the CFG. Q A stores the R mj messages of odd slices and Q B stores the R mj messages of even slices. For example, in H 1 , CFG 1 contains six nonzero submatrixes, consequently, Q 1A and Q 1B have six sections in total. Q 1A has three sections to store the R mj messages of the odd slices, whereas Q 1B also has three sections to store the R mj messages of the even slices.
As for Z-RAM, its number can be set to d min , the minimum row weight of the check matrix. Taking H 1 for example, the hard decision of column block 1 can be made in the CPS of slice 3, those of column block 4 and 6 can be made in the CPS of slice 4, that of column block 5 can be made in the CPS of slice 5, and those of column block 3, 5 and 7 can be made in the CPS of slice 6. The number of Z-RAMs needed to be accessed simultaneously is 3. Therefore, only three Z-RAMs are needed for H 1 . Figure 8 illustrates the memory sharing scheme for H 1 , where I(x), S(x) and Z(x) represent I j , S j and hard decision results of the xth column block respectively, and R(x,y), the R mj messages of the xth row yth column submatrix. It can be seen that after the memory arrangement 23 memory blocks are needed in the decoder. However, 34(=W+2q) memory blocks are needed in the conventional TPMP decoder.
Generally, the memory block number of the decoding is thus calculated by
If I j and R mj are f bits quantified, then S j needs [f +log 2 (d C +1)] bits theoretically. However, in practice, (f +2) bits is enough for S j . Then the total memory bit can be approximated by
Wbf qb (12) Compared with the conventional TPMP decoding, extra 2qb(f+2) memory bits are needed to store the S j messages.
Memory accessing analysis
In Section 2, STMP decoding and overlapping schemes are presented. In Section 3.2, an efficient memory sharing arrangement is proposed. In this section, we will analyze the memory accessing of STMP decoding under the presented overlapping scheme and memory arrangement.
As Fig. 5 shows, if operation parallel degree is set to 1, the RPS period of one slice is b+P 1 clock cycles. Taking the second slice for example, in the first P 1 clock cycles, RPS of slice 2 is only overlapped with RPS of slice 1. In this period, RPU updates R mj messages of slice 1, reads R mj and S j messages of slice 2. The memory accessing of this period is shown in Fig.  9(a) .
In the following bP 1 clock cycles, RPS of slice 2 is only overlapped with CPS of slice 1. In this period, RPU reads R mj and S j messages of slice 2 and updates R mj messages of slice 2, while CPU reads R mj and I j (or S j ) messages of slice 1 and updates S j messages of slice 1. The memory accessing of this period is shown in Fig.  9(b) .
In the last P 1 clock cycles, RPS of slice 2 is not only overlapped with CPS of slice 1, but also overlapped with RPS of slice 3. In this period, RPU updates R mj messages of slice 2, reads R mj and S j messages of slice 3 while CPU reads R mj and I j (or S j ) messages of slice 1 and updates S j messages of slice 1. Figure 9(c) shows the memory accessing of this period.
It can be seen from Fig. 9 (c) that RPU and CPU need to read Q A simultaneously. The simultaneous access causes conflict, which we have mentioned before. However, if we set Q-RAM into three parts to store R mj messages of the adjacent three slices, the access conflict will be eliminated. However, the total memory block number of overlapping scheme A will be 6G+d min . The memory accessing flowchart is interpreted in Fig. 9(d) . As for overlapping scheme B, because there are only two types of overlapping, whose memory accessing flow chart is corresponding to Figs. 9(a)-(b) , respectively, therefore, there is no conflict.
In a word, in the overlapping scheme, RPU and HPU access L B and L C alternately in the adjacent iterations, and access Q A and Q B or Q C alternately in the adjacent slices.
In addition, if the computation of S j for the whole column is finished then a hard decision may be made by CPU and the result is written into Z-RAM. The hard decisions may be made in CPS of different slices.
Comparisons and Discussions
To verify the improvement of proposed scheme on the memory reduction and HUE, we simulated the decoding algorithm on a set of benchmark codes. Table 1 reports some details regarding the selected codes. We considered three codes which have been adopted in real applications (the codes are used in Chinese DTTB standard).
If the slice size is set to the submatrix size, then the three codes will have 35, 23 and 11 slices, respectively. According to Eq. (10), the HUE of our STMP decoding will be 0.97, 0.96 and 0.92, respectively. If the slice size is set to half of the submatrix size, then the three codes will have 70, 46 and 22 slices, respectively, and the HUE of STMP decoding will be 0.99, 0.98 and 0.96 respectively. The HUE is at least no less than that in Ref. [8] . By observing the distribution of nonzero submatrixes, the check matrix could be separated into several CFGs. The separation results for the three codes are shown in column 2 of Table 2 . In column 3 of Table 2 , the numbers of memory blocks needed by STMP overlapping scheme A are estimated. The memory block requirement of TPMP decoding is listed in column 4. In order to make a fair comparison, the LUT memories are not taken into account.
As can be seen in column 5, STMP decoding makes a significant reduction of memory blocks to the TPMP decoding.
According to Eqs. (11)- (12), a more detailed comparison on memory block number and memory bit for code 1 between STMP and TPMP is carried out in Table 3. The intrinsic, extrinsic and decision memory numbers of TPMP equal q, W and q respectively, which are illustrated in column 3 of Table 3 Total memory bits of the three kinds of memories are 304 419, assuming that the extrinsic and intrinsic messages are 7 bit quantified.
On the other hand, the STMP decoder needs G memory blocks to store intrinsic messages I j , 3G memory blocks to store extrinsic messages R mj , and d min memory blocks to store decision result. Moreover, extra 2G memory blocks are needed to store accumulating messages S j . The statistical result is illustrated in column 2 of Table 3 .
From column 4 of Table 3 , it can be seen that STMP needs extra 22 memory blocks, 134 874 memory bits. However, for FPGA implementation, the memory resource is organized as several basic memory compo-
·754 · ZHAO Ling et al. / Chinese Journal of Aeronatics 25(2012) 747-756
nents, e.g. M4k(4 608 bit) and M512(576 bit) for Altera FPGA. One memory block in Table 3 at least occupies one memory component. As a result, the TPMP decoder at least needs 393 memory components, but STMP can integrate small memory blocks into larger ones to make better use of the memory components in FPGA. In consequence, STMP decoder would require less memory components than TPMP decoder despite the increase of the total memory bits. The FPGA implementation in Section 6 also supports this point. The bit error rate (BER) testing curves of the LDPC codes at three rates under Gaussian noise are illustrated in Fig. 10 . The maximum iteration number is set to 20, and the intrinsic LLR message I j and extrinsic message R mj are 7 bit quantified. Compared with the existing solution [18] , the presented decoding has the same BER performance with the traditional TPMP decoding. There is no performance degradation of STMP decoding.
Discussion between OMP and STMP decoding
Compared with the OMP decoding, because STMP decoding is based on the TPMP decoding, the same as the OMP decoding, the two decoding methods consume the same hardware resource in RPUs and CPUs. However, the STMP decoding has the following merits.
1) STMP brings about great memory block reduction. The total memory blocks of OMP are (2q+W), whereas those of STMP shrink to (6G + d min , overlapping scheme A). Although total memory bits of STMP are increased, STMP can compose small memory blocks into larger blocks, which will lower the total block requirement in FPGA implementations. Furthermore, fewer memory blocks also means simpler interconnections.
2) STMP decoding is more flexible in trade-off between hardware and throughput. The trade-off step of STMP is quite small. We can increase or decrease one RPU to trade off hardware and throughput, and STMP decoding can work in full-serial or nearly full-parallel mode at a similar high HUE by adjusting the parallel degree, whereas the OMP decoding cannot.
3) STMP decoding is simple and suitable for general QC-LDPC codes. High HUE has nothing to do with the code structure. However, the OMP decoding is not so efficient especially for weight-w QC-LDPC codes.
Discussion between TDMP and STMP decoding
Compared with the TDMP decoding, STMP decoding consumes the same hardware resource in function units because of the similar computations. For weight-1 codes in low-throughput application, TDMP decoding will be more efficient than STMP decoding because of the converge speed. However, STMP will be more powerful in the following two situations.
1) To decode weight-w QC-LDPC codes. STMP and TDMP decoding splits the check matrix row-wise. But the TDMP decoding must satisfy the constraint that there is at most one "1" in each column of every layer, whereas STMP does not have such constraint. Therefore, our decoding is suitable for general QC-LDPC codes.
2) To decode weight-1 QC-LDPC codes at high throughput. In high throughput applications, the whole slice will operate in parallel. It will take only 1 clock to finish the RPS without considering the group delay. However, because of the dependence of the adjacent layers, it takes L(1+P 1 ) clock cycles to fulfill one iteration of TDMP decoding. But using STMP decoding, because all slices can be pipelined, it only takes (L+1+P 1 ) clock cycles to fulfill one iteration. Even if TDMP converges two times faster than STMP decoding, STMP still roughly achieves a throughput gain of (1+P 1 )/2(1+P 1 /L) over TDMP decoding. Taking code 1 in Table 2 for example, in high throughput application, the check matrix can be split into 35 slices (L=35) and the parallel degree can be set to 127. If the pipeline stage is set to 7, then the throughput gain will be (1+P 1 )/2(1+P 1 /L)=(1+7)/2(1+7/35) §3.33, which eans the throughput of STMP is 3.33 times that of TDMP decoding.
FPGA Implementation Result
Finally, the (7 493, 3 048) irregular QC-LDPC code is selected from the Chinese DTTB standard and decoded with our decoding. The check matrix is partitioned into 35 row blocks and the parallel degree is set to 8. Seven pipelines are inserted into the RPU, and the messages I j and R mj are 7-bit quantified, where 1 bit for sign, 2 bits for integer part and 4 bits for decimal part.
The STMP decoder with scheme B was modeled in Verilog HDL and simulated using ModelSim. We then synthesized and performed place and route for the design using the Altera QuartusII7.2 software package. The design was targeted on the Altera Stratix EP2S30 device (speed grade -5). In fact, there are plenty of memory bits even in junior grade FPGAs. As Table 4 shows, there are over 1.3 mega memory bits in EP2S30. And the memory bits are organized into M4k, M512 and M-RAM. However, the memory block is not small enough to suit for conventional TPMP LDPC decoder. For the (7 493, 3 048) code, the submatrix size is 127, and one 1K bits memory block is enough to store extrinsic information L mj or R mj . However, in FPGA implementation, we have to use one M4k block to store them, where more than 3K bits of M4k are wasted.
However, STMP decoding combines the small memory blocks into larger ones which could achieve better utilization of FPGA memory resource. As Table  4 shows, the presented decoder can be implemented on EP2S30, whereas the TPMP decoder cannot be realized on it. In spite of the LUT operation, the TPMP decoder needs at least 393 memory components to store intrinsic, extrinsic and decision messages Based on the Altera timing report, the maximum clock frequency of the implementation is 127.2 MHz. If maximum iteration number is set to 20, then the net throughput is (127.2×3 048)/[(35×16+7×17)×20] §28.5 Mbps.
To provide a fair comparison, we also implemented a conventional TPMP decoder for the same code using the same quantization. The synthesis result is given in column 5 of Compared with the conventional TPMP decoder, the STMP decoder consumes roughly 17 ALUTs and registers while achieves 40 throughput. The HUE of STMP decoder is doubled to the TPMP decoder. After the memory arrangement, the 494K memory bits of STMP decoder can be placed into 134+177 = 311 memory components while the TPMP decoder need 423+646 = 1 069 memory components to place 581K memory bits. Obviously, the STMP decoder could save large number of memory components in FPGA implementation. As a result, the STMP decoder could be implemented on EP2S30 whereas the TPMP decoder should use EP2S130.
Several previous FPGA implementations of code (7 493, 3 048) are also included in Table 5 . In Ref. [19] and Ref. [20] , the authors use dual-word schedule to achieve high HUE at the expense of nearly double memory bits, e.g., for the (7 493, 3 048) code, over 600 K extra memory bits are used. However, from Table III, only 134 K extra memory bits are needed in the proposed decoder to achieve the same high HUE. Even if the decoders of Ref. [20] -D used min-sum algorithm, the proposed decoder consumes roughly 30 ALUTs and registers while achieves 37 throughput. Furthermore, it should be noted that the idea of Ref. [20] can also be used in the proposed decoder to save logic elements.
Conclusions
In this paper, a high hardware utility efficiency low memory block requirement decoding of general QC-LDPC codes is proposed. The decoding not only achieves high hardware utilization efficiency , but also brings about great memory block reduction without any performance degradation. The present decoding facilitates efficient circuit implementations of the LDPC decoder. In this paper, we just use log-BP to demonstrate the STMP decoding. However, the MS and other variants can also be used in the presented decoding. 
