Abstract. Low-Density Parity-Check (LDPC) codes are one of the best known error correcting coding methods. This article concerns the hardware iterative decoder for a subclass of LDPC codes that are implementation oriented, known also as Architecture Aware LDPC. The decoder has been implemented in a form of synthesizable VHDL description. To achieve high clock frequency of the decoder hardware implementation -and in consequence high data-throughput, a large number of pipeline registers has been used in the processing chain. However, the registers increase the processing path delay, since the number of clock cycles required for data propagating is increased. Thus in general the idle cycles must be introduced between decoding subiterations. In this paper we study the conditions for necessity of idle cycles and provide a method for calculation the exact number of required idle cycles on the basis of parity check matrix of the code. Then we propose a parity check matrix optimization method to minimize the total number of required idle cycles and hence, maximize the decoder throughput. The proposed matrix optimization by sorting rows and columns does not change the code properties. Results, presented in the paper, show that the decoder throughput can be significantly increased with the proposed optimization method.
Introduction
Modern communication demands in transmission throughput motivate continuous progress of error correcting coding systems. Low-Density Parity-Check (LDPC) codes are one of the best known coding methods that allow achieving very low bit error rates at code rates approaching Shannon's channel capacity limit. (The second known method is turbo coding). Thus LDPC codes have recently attracted intense research interest. Their main advantage over turbo-codes is highly parallel decoding scheme.
LDPC codes were first introduced by Gallager in 1962 [1] , but soon forgotten. Their implementation complexity was exceeding capabilities of the accessible technology, so they were not considered for practical applications. The codes were rediscovered in the late 90's [2] and since then they have been under interests of many researchers. Despite the constant progress in electronics technology, the hardware design for LDPC coding systems is still not straightforward. The main challenges include: 1) to define low complexity, high throughput decoder architectures, 2) to make the architecture versatile, i.e. capable of decoding large family of codes.
The fully parallel LDPC iterative decoding architecture can achieve high decoding throughput, but it suffers from large hardware complexity caused by a large set of processing units and complex interconnections. A practical solution for area efficient decoders is to use the partially parallel architecture in which a processing step is performed in a several time slots using some number of processing units working in parallel. It has been recognized that the partially parallel decoder architectures can be accomplished well for some subclass of codes, with structured parity check matrix, known as Architecture-Aware LDPC (AA-LDPC, [3] ), VLSI-Oriented [4] or Structured LDPC [5] .
A programmable partially parallel decoder has been implemented in the form of synthesizable VHDL description. The decoder is capable for decoding any code that has the parity check matrix in the Architecture-Aware form. The target hardware platform for the implemented decoder is FPGA; Xilinx VirtexII devices were used for decoder verification. In this paper we briefly present the decoder structure. Then we focus on pipeline processing optimization that has been proposed to speed up the decoding process.
The pipeline registers has been used to achieve high clock frequency of the decoder hardware implementation and in consequence high data-throughput. However, the registers increase the processing path delay as the number of clock cycles required for propagating data is increased. Thus the main problem connected with pipeline processing is that the new step of data processing must not be started before the previous one is completed if the data updated in the previous step is required for the new one. Hence in general the idle clock cycles must be introduced. However we show in this article that the number of required idle cycles is somehow dependent on the parity check matrix structure. Moreover we present the parity check matrix optimization algorithm that can minimize the required idle cycles number. As a result of the idle time reduction, the decoder throughput is significantly increased, which is shown in the experimental results.
The paper is organized as follows. First, we present a basic concepts connected with LDPC codes and the Architecture Aware subclass of codes. Particularly decoding algorithm will W. Sułek be described. Then the decoder structure will be presented with emphasis on pipeline processing elements. In the following sections we study the conditions for necessity of idle cycles and provide a method for calculation the exact number of required idle cycles on the basis of parity check matrix structure. Then we propose a parity check matrix optimization method to minimize the total number of idle cycles and finally present results obtained with proposed optimization method for several LDPC codes.
LDPC codes basics
LDPC codes are linear block codes [6] with sparse paritycheck matrix. The parity-check matrix H M×N of a code C represents the relation between N bits of the codeword and M parity-check equations. Vector x = {x 1 , x 2 , . . . , x N } is a correct codeword (x ∈ C) iff the parity check condition Hx T = 0 is satisfied (in GF (2) field).
In the encoder an information vector u = {u 1 , u 2 , . . . , u K } of length K = N − M is transformed into a proper codeword by combining it with M parity bits. The coderate R = K/N characterizes the amount of redundancy in the codeword. In the decoder, where information about bit values is distorted, the most probable codeword is determined on the basis of received vector y = {y 1 , y 2 , . . . , y N }. The inputs to the decoder algorithm are in the form of received bits in the case of hard-decision decoding or in the form of a priori probabilities of bit values P (x n = 0|y n ), P (x n = 1|y n ) (or some functions of probabilities) in the case of soft-decision decoding. The latter case allows obtaining significant better error correcting performance.
Based on the parity check matrix, a bipartite graph G (Tanner graph) is defined with bit nodes corresponding to bits and check nodes corresponding to parity-check equations (Fig. 1) . 
{ } The Tanner graph visualizes iterative message passing algorithms used for decoding. The algorithms are performed by exchanging messages (beliefs) between bit nodes and check nodes through the edges in both directions. Each node of the graph represents computation of updated beliefs. In the case of LLR-BP algorithm (Log-Likelihood Ratio Belief Propagation), messages are log-likelihood ratios of beliefs, hence operations are sums (bit nodes) and sums of nonlinear functions of messages (check nodes). LLR-BP and its modifications are considered the most frequently [3, 7] for hardware implementations. Inputs to the decoding algorithm are bit values altogether with measures of its reliability based on channel observations (received channel soft values), which are in the form of log-likelihood ratios; they are called intrinsic channel reliability values and will be denoted as δ n for nth bit:
The basic two-phase message-passing (TPMP) algorithm is described in reach literature, e.g. [2, 6, 7] .
TDMP decoding algorithm.
Here we focus on TurboDecoding Message-Passing (TDMP) algorithm [3, 8] that has been used for the presented hardware decoder. It is based on a modification in message passing scheme of the classic TPMP algorithm, where C is considered as the concatenation of some number of codes. It means that the set of rows of H is virtually partitioned into a number of subsets. Each subset constitutes a code, which will be called a subcode. A word is a correct codeword if it belongs to all constituent subcodes. Let C d denote the dth subcode with parity check matrix The sum of Λ n -messages for all subcodes in addition to the channel values δ n is the posterior reliability value, denoted by Γ n and updated at each subiteration. Thus Γ n represents "all the information" about the value of the bit x n in the current stage of decoding process. The Γ n value calculated at previous subiteration is denoted by γ n . The TDMP algorithm is summarized as follows.
1) Initialization
Initialize posterior γ values to the intrinsic channel reliability LLRs and the messages λ for all subcodes to zeros.
Pipeline processing in low-density parity-check codes hardware decoder
2) Iteration
Carry out D decoding subiterations corresponding to subcodes
where n ′ ∈ N (m) \n, where c m is the check node in the subset V d c incident to b n (as we assumed, there is only one such check node) and ψ(x) is a nonlinear function defined as: ψ(x) = ψ −1 (x) = − ln (tanh(x/2)) that can be calculated making use of some known approximations [7, 12] . -Store the Λ and Γ-values as λ and γ-values to be used as inputs in the next subiteration:
3) Stop Criterion After all subiterations make hard decisions x = {x 1 , x 2 , . . . , x N } such that
If the parity check equation Hx T = 0 is satisfied or a maximum number of iterations i max is reached, halt the algorithm with x as output. Otherwise, go to step 2.
The decoder receives soft channel values δ and generates reliability values of the decoded bits Γ. These values are updated at each subiteration. Furthermore, for each constituent subiteration, extrinsic reliability values Λ are computed assuming that the codeword belongs to the subcode, according to (4)- (6) . Intrinsic λ-messages pertaining to the code under consideration are subtracted from γ (Eq. (4)) to eliminate correlation between newly generated messages and the previously generated. Thus, the Q d n represents LLR reliability value for bit nth based on messages from all subcodes except subcode under consideration. These Q values are used for updating extrinsic messages Λ corresponding to the subcode being decoded.
The main advantages of TDMP scheme over standard TPMP are that it exhibits a faster convergence behavior (about 20-50% fewer decoding iterations) as well as it allows a memory savings due to eliminating multiple bit-to-check messages [3] .
AA-LDPC codes.
As is well known, efficient partiallyparallel decoder implementation is possible for parity-check matrices with certain constraints on their form [3, 10, 11] . Firstly, as we assumed, the rows of H are partitioned such that in each submatrix H d all columns contain at most single non-zero element. Secondly, in order to suitably organize message (Γ) memory, the set of bit nodes of the graph is partitioned into L subsets that are delivered to P computing modules in a single clock period by configurable interleaver that shuffles the fragments of the memory word according to the local graph structure of the code.
With such code-graph organization, a parity check matrix is similar to the one shown in Eq. (9):
Matrix H is composed of D × L submatrices, where each submatrix P d,l of size P × P is either an all-zero matrix or a permutation matrix obtained by permuting columns of an identity matrix [5] . Placement of nonzero submatrices in H is specified by so-called seed matrix W. It is a D × L matrix with elements w d,l = 0 if P d,l is the all-zero submatrix and 
Decoder architecture
A configurable decoder, based on TDMP scheme has been implemented in the form of synthesizable VHDL model. This model can be adjusted for decoding any regular or irregular code that has the matrix H in the presented ArchitectureAware form. The decoder structure has been described in detail in other papers [5, 13] . Here we focus on pipeline processing that has been used to speed up the decoding process. (9)). The messages Λ are stored in Λ-memory, which is partitioned and included in SISO units as small memory buffers. When stop criterion is met, current word decoding is halted and data is outputted in words of length P (the MSBs of the P Γ-messages in a single memory word are equivalent to the decoded bits).
Time slot duration, i.e. the number of clock cycles for executing single subiteration, depends on check-nodes degree as well as the number of pipeline registers in the processing chain. The pipeline registers -denoted as grey boxes in the figures -are essential components to achieve high clock frequency of the decoder hardware implementation and in consequence high data-throughput. However, the registers increase the processing path delay as the number of clock cycles required for propagating data is increased. Here the main problem connected with pipeline processing arises. The new subiteration must not be started before the previous one is completed if the data (Γ messages) updated in the previous subiteration are required for the new one. Hence in general the idle clock cycles must be introduced to await for the completion of the previous subiteration. The number of required idle cycles depends on the processing chain delay, thus the increase in clock frequency due to the pipelining is counterfeited by the increase in number of idle clock cycles.
In the classic literature concerning TDMP decoding implementation [3, 8, 9 ] the problem mentioned above is not treated at all. In the next sections we first study the conditions for necessity of idle cycles and provide a method for calculation the exact number of required idle cycles on the basis of parity check matrix structure. Then we propose the formula for the calculation of the total number of idle cycles per iteration and finally propose a parity check matrix optimization method to minimize the total number of idle cycles. In the end we present results obtained with proposed optimization method for several cases.
Number of idle cycles calculation
The basic block diagram of the SISO module is presented in Fig. 3 . The main component of SISO is CNU (Check Node Unit), which is responsible for Λ messages calculation as in (5) . Since the CNU operates in a double recursion scheme (see e.g. [12] ), the output messages are in reversed order. Blocks denoted as Subtr* and Add* are subtractor and adder respectively with clipping elements that constrain the results of addition (subtraction) to W bits. To achieve high clock frequency, five pipeline registers (grey boxes in Fig. 3 ) has been used in the CNU as well as two registers in blocks Add* and Subtr*. The placement of the registers has been designed experimentally by observing the synthesis results for VirtexII FPGA and trying to exploit the maximum achievable clock speed.
Let T P be the total pipeline delay of the decoder defined by the number of clock cycles from the last Γ-memory read to the first Γ-memory write. The T P delay is equal to the sum of delays due to interconnection network T N ET and due to SISO module T SISO :
In the case of the implemented decoder we have: T N ET = 1 (Fig. 2) , T SISO = 9 (Fig. 3 ), thus T P = 11. Let T d idle be the number of idle cycles required before subiteration d is started (i.e. the first message is fetched). Obviously T d idle depends on T P , but also -as will be shown -it is dependent on the existence of nonzero elements in the same columns of dth, d − 1, d − 2 and d − 3 rows of the seed matrix. It is illustrated by a following example. Figure 4 presents a part of some parity check matrix, where the numerated boxes represent permutation submatrices and grey boxes -all-zero submatrices. At the bottom of the figure we present a sequence of data transmitted from / to the Γ-memory, where the numbers indicate memory cells that are read / written, which are consistent with the location of the nonzero submatrices in the parity check matrix.
Before the 2nd subiteration can be started (blue marks in Fig. 4) , the decoder has to await a proper time for memory cell 9 update from the 1st subiteration (black marks). Precisely: the pause T 2 idle has to ensure that memory cell 9 read (2nd subiteration) is at least one cycle after memory cell 9 write (1st subiteration). (It is assumed that "write first" configuration of two-port memory is used for implementation.) Thus the idle cycles are needed in the case of one (or more) common messages are processed in the consecutive subiterations, Pipeline processing in low-density parity-check codes hardware decoder which is a result of "overlapping" ones in consecutive rows of seed matrix (here: ones in column 9th, "black" and "blue" rows). In the example presented in Fig. 4 , T 2 idle = T P − 2. Furthermore, the need for idle cycles is still possible even if two consecutive rows of seed matrix do not contain ones in a common column. Such a case is presented in Fig. 4 for the 3rd subiteration (red marks). The pause T 3 idle has to ensure memory cell 1 read (3rd iteration) is at least one cycle after this memory cell update (1st subiteration). The common memory cell 1 usage is a result of "overlapping" ones in every second rows of seed matrix (here: "black" and "red" rows).
Here we propose a formula for exact calculation of required idle cycles T d idle , on the basis of seed matrix structure. Let X (d2,d1) be an auxiliary variable defined as:
• if rows d 2 th and d 1 th of the seed matrix do not contain ones in a common column, then X (d2,d1) = ∞ • if rows d 2 th and d 1 th of the seed matrix contain a one in a common column l, then X (d2,d1) is a difference between the number of ones in d 2 th row in columns with indexes lower than l and the number of ones in d 1 th row in columns with indexes greater than l, diminished by 1.
• if rows d 2 th and d 1 th of the seed matrix contain ones in more than one column, then the rule stated above applies for the column with the lowest index l.
Formally it can be stated as:
With reference to the above example, for the case presented in Fig. 4 , e.g. X (2,1) = 3 − 0 − 1 = 2 and X (3,2) = ∞. We can observe that (in the case X (d2,d1) = ∞) the value of X (d2,d1) represents the difference between the pipeline delay T P and the number of required idle cycles. Let T d 1 be the number of idle cycles before subiteration d is started with considering only awaiting for the completion of the previous (d − 1) subiteration. It can easily be shown that:
For the above example, T If awaiting for the completion of penultimate (d−2) subiteration is considered ("red" row in Fig. 4) , the number of idle cycles, denoted as T d 2 , equals:
idle . Finally, sometimes the need for idle cycles is due to awaiting for completion of d − 3 subiteration. In this case the number of idle cycles equals:
To determine the desired value of T d idle , the case among the mentioned above that gives the highest number of idle cycles has to be considered. Thus:
Equation (15) along with (12)- (14) show how to calculate the number of required idle cycles for the consecutive subiterations. For the particular decoder, calculations can be made according to (12) - (15) 
Optimization of the parity check matrix
In order to minimize the total idle time of the decoder and hence substantially increase the decoder throughput, a heuristic optimization algorithm has been developed. The algorithm takes advantage of the following linear code properties:
• shuffling the rows of the parity check matrix does not change the code defined by the matrix at all, • shuffling the columns of the parity check matrix does not change error performance of the code -it results only in adequate shuffling of the bit sequence in the codeword.
Thus the algorithm consist of sorting the seed matrix columns and rows in a way to minimize the total number of idle cycles in a full iteration, defined as:
The proposed algorithm is presented below as algorithm 1. In the first step, columns of W are sorted according to their growing weights (number of ones). It is motivated by the fact that more idle cycles are needed, when ones are located in a common columns with lower indexes (Fig. 4) (12)- (15) . A random choice is made if more than one row satisfies this condition.
The described procedure is repeated a number of times and among obtained seed matrices, the one with the lowest T idle is selected as a final result. As experiments have shown, 1000 repetitions is a sufficient number for obtaining satisfactory results (for practical number of rows in the seed matrixup to a few hundreds). In most cases increasing this number above 1000 does not improve results significantly.
A large number of experiments have been performed in order to verify performance of the proposed optimization algorithm. Results of optimization for several seed matrices are shown in Tables 1 and 2 . We present seed matrices with different sizes and coderates R, regular and irregular constructions. All seed matrices were constructed with Progressive Edge Growth algorithm [14] . The seed matrix W 10×20 before and after optimization is shown in Fig. 5 , where black squares indicate placement of nonzero elements. In this case all neighboring overlapping rows were eliminated by rows reordering procedure, thus for every d, T . Thus the total T idle is pretty large, which is the case for all matrices with small sizes, due to their relatively dense placement of nonzero elements. Table 1 presents total idle times for original matrix W as well as matrices obtained with the algorithm 1. We included results for different number of repetitions (k m = 100, 1000, 5000) of the main loop in the algorithm. As was mentioned, more than 1000 repetitions didn't improve the results in most cases and in the other few cases the improvement was really insignificant.
It can be seen that the number of required idle cycles after seed matrix optimization is very low, compared to this number before optimization. In many cases T idle = 0 can be obtained, especially for codes with lower rates (R = 0.5 or less). For codes with higher rates R, some greater than 0 idle time is usually necessary, because due to lower number of rows in H and their higher weights, it is harder to select Pipeline processing in low-density parity-check codes hardware decoder "non-overlapping" rows and thus to bring the constituent T d idle numbers to zero. Moreover we can observe that for matrices with larger sizes, the results obtained are better (for example: compare W 16×64 and W 25×100 ), which is related to the greater sparsity.
As a result of the idle time reduction, the decoder throughput is significantly increased, which is shown in Table 2 . The presented throughput values are dependent on the number of SISO units P used for implementation (P is usually equal to the permutation submatrix size). The throughput has been determined making use of synthesis results for Xilinx XC2V3000 FPGA device, where we achieved f clk = 145 MHz for a normalized Min-Sum messages calculation algorithm [15] used in SISO implementation. Equation (17) shows the relationship between actual throughput T H (the number of information bits decoded per second) and the parameters of the code and decoder. The sum 17) is the total number of cycles for single word decoding. For the results presented in Table 2 we assumed the iteration number i max = 10, which is a common assumption for throughput calculation [9] . The numerator in Eq. (17) is the number of information bits in a single word M = P (L − D).
The typical number of SISO units is in range P = 10 . . . 100. For example, a decoder with P = 32 SISO units, wordlength W = 6, occupies about 1/3 of the available resources of the mentioned XC2V3000 device. For a rate R = 0.5 code the throughput is then more than 60 Mb/s and it is about twice as much as for the decoder without proposed parity check matrix optimization. Thus the presented throughput increase is the main virtue of the work presented in this paper.
Conclusions
In the first part of this article we briefly described LDPC decoder architecture based on TDMP decoding scheme. The main drawback of the straight pipelined implementation is the necessity for idle cycles that reduce the throughput. However, as was shown, by means of proper parity check matrix columns and rows reordering, the number of required idle cycles can be significantly reduced. The heuristic algorithm for such optimization of the parity check matrix has been developed. The goal of the algorithm is minimization of the total idle time of the decoder. We performed a large number of experiments, for different seed matrices. The achieved reduction of the idle time is always significant, in many cases even to zero. The resulting decoder throughput increase is also meaningful. Hence the full advantage of the speedup due to pipelined processing can then be taken.
