This paper proposes an efficient HDL library of processing units for generic and DVB-S2 LDPC decoders following a modular and automatic design approach. General purpose, low complexity and high throughput bit node and check node functional models are developed. Both full serial and parallel architecture versions are considered. Also, a dedicated functional unit for an array processor LDPC decoder architecture to the DVB-S2 standard is considered. Additionally, it is described an automatic HDL code generator tool for arbitrary decoder architectures and LDPC codes, based on the proposed processing units and Matlab scripts.
INTRODUCTION
Low Density Parity-Check (LDPC) codes (Gallager 1962; MacKay & Neal 1996) are among the most powerful forward error correction codes known and can be applied in a vast number of applications, from data storage to telecommunications. The existence of efficient coding and decoding algorithms combined with their good decoding performance called the attention of the scientific community and led already to their inclusion in the recent digital video satellite broadcasting standard (DVB-S2) (ETSI 2005) . Although simple, the decoding algorithm presents a significant challenge from the hardware implementation point of view.
LDPC codes are a sub-set of linear block codes, defined by sparse parity check matrix H, to which a Tanner graph (Tanner 1981) can be coupled as for any linear block code. This bipartite graph is formed by two types of nodes, Check Nodes (CN), one per each code constraint (H rows), and Bit Nodes (BN), one per each bit of the codeword (H columns), with the connections between them given by H.
The importance of the Tanner graph is reinforced by the fact that best known LDPC decoding algorithms, namely the Sum Product Algorithm (SPA) (Gallager 1962; Chen & Fossorier 2002) , are all derived from the Tanner Graph structure. The iterative procedure is based on an exchange of messages between the BN's and CN's of the Tanner graph, containing believes about the value of each codeword bit with these messages (probabilities) being represented rigorously in their domain or, more compactly, using logarithm likelihood ratios (LLR). The iterative procedure stops when a valid codeword is achieved or the maximum number of iterations is attained (in this case a decoder failure is declared). A simple iterative decoder can thus be constructed by considering each CN and BN of the Tanner graph as processing units, and the connections between them as bidirectional communication channels through which the processed information is sent. In this paper we propose a generic hardware implementation for the CN and BN processing units.
A full parallel decoder is impracticable when considering codes of length 64800, as the ones that are proposed for the DVB-S2 standard, because of the large silicon area that would be needed for an implementation of this type, imposed not only by the high number of processing units, but also by the huge number of connections between them (which imposes severe routing problems).
Following this line of thought Kienle et al. (2005) have proposed a partial parallel architecture with processing units being shared by groups of nodes, which allows a drastic reduction of the used silicon area.
Another advantage of their proposed implementation is the fact that it explores the particular characteristics, namely, the periodicities, of the sub-set of LDPC codes adopted by the DVB-S2 standard (ETSI 2005) , known as LDPC-IRA (LDPC -Irregular Repeat and Accumulate Codes). This allows the decoder to work in a reconfigurable way.
The fact that LDPC decoders can be constructed taking a modular approach allows the usage of auxiliary tools/libraries in their development. It is possible to design Matlab © application scripts, that according to certain parameters, are capable of creating and connecting the full set of module units needed for each decoder, according to the target architecture. Furthermore, these application scripts will be able to automatically generate HDL code, since the number of module units and respective interconnections depend only on the given parity test matrix H of the code.
In the following sections we will describe with further detail the proposed HDL models for each processing unit. In Section 2 we present a short description of the LDPC-IRA codes and the special characteristics of the ones adopted by the DVB-S2 standard. Section 3 presents a brief review of the sum product algorithm in the logarithmic domain (LSPA) following the traditional flooding schedule approach. Alternative scheduling methods that speed up the convergence of LSPA algorithm are also referred in this section. In section 4, generic hardware modules are proposed for the basic processing units of a LDPC decoder. Section 5 describes the particular characteristics of a generic processing unit for an array processor DVB-S2 LDPC decoder. Finally, in section 6, we describe the procedure of automatically generating Verilog/VHDL code for an LDPC decoder based on simple Matlab © application scripts and previously developed libraries.
LDPC-IRA CODES
The new Satellite Digital Video Broadcasting standard (DVB-S2) adopted a special class of LDPC codes known by IRA codes (Eroz, Sun & Lee 2004) as the main solution for the FEC system. LDPC-IRA codes ally to the powerful error correction capabilities of the LDPC codes, a linear encoding complexity. In fact, although the parity check matrix, H, of a LDPC code is sparse, the generator matrix needed for encoding, which is obtained from H through the Gaussian elimination method, is, in general, not sparse, leading to storage and encoding complexity problems.
By restricting the H matrix to be of the form 
where A is a random sparse matrix and B a staircase lower triangular one, we can obtain a LDPC code with almost the same performance (less than 0.1dB loss) as the best known LDPC codes for the same code dimensions, with linear encoding complexity. The obtained code is systematic,
c i p , with the message/information bits,
, being associated to the A matrix, and the parity check bits, 
H Periodicity
The H matrices of the DVB-S2 LDPC codes have other properties beyond being of IRA type. Some periodicity constraints were put on the pseudo-random construction of the A matrices, which allows a significant reduction on the storage requirement of their descriptions, and also, the design of efficient decoding architectures (Kienle et al. 2005) . The matrix A construction technique is based on dividing the IN's in groups of M consecutives ones. All the IN's of a group, say group l , should have the same weight, l w , and it is only necessary to choose the CN's that connect to the first IN of the group in order to specify the CN's that connect to each one of the remaining 
SOFT-DECODING
Best known LDPC decoding algorithms (Gallager 1962) are based on an iterative message-passing between the BN's and CN's of the Tanner graph, containing believes about the value of each codeword bit.
Given a ( ) • nm Lq -The LLR of BN n, which is sent to CN m, and is calculated, based on all received messages from CN's ( ) \ M n m and the channel information, n LP .
• n LQ -The a posteriori LLR of BN n.
Traditional Flooding-Schedule
Traditionally, the LDPC iterative decoding procedure follows the so-called flooding schedule approach which consists in: all messages sent by BN's are updated alltogether before being sent to CN's processing units and vice-versa. The Sum Product Algorithm (SPA), 
Iterative body: A. Calculate the log-likelihood ratio of message sent from CN m to BN n ,: 
C. Compute the a posteriori pseudo-probabilities and perform hard decoding:
,
The iterative procedure is stopped if the decoded word ĉ verifies all parity check equations of the code (ˆT = c H 0 ) or the maximum number of iterations is reached.
Alternative Scheduling Methods
It is well known that SPA, following the traditional flooding-schedule message updating rule, is an optimum a posteriori probability (APP) decoding method when applied to codes described by TG's without cycles (Kschischang et al. 2001 ). However, good codes always have cycles and the short ones tend to degrade the performance of the iterative message-passing algorithms (results far from optimal). Motivated by the referred problem and the speed up convergence goal, new message-passing schedules have been proposed (Zhang & Fossorier 2002; Sharon et al. 2004; Xiao & Banihashemi 2004) .
Considering flooding-schedule, the messages sent by BN's are updated all together (in a serial or parallel manner) before CN's messages could be updated and, vice-versa. At each step, the messages used in the computation of a new message, are all from the previous iteration. A different approach is to use new information as soon as it is available, so that the next node to be updated could use more up-to-date (fresh) information. This can be done, for example, following two different strategies known by horizontal and vertical scheduling with a considerable processing gain in the number of iterations to reach a valid codeword (Sharon et al. 2004 ).
Vertical-schedule operates along the BN's that are processed in a serial manner. After a BN, says n, be processed, the messages, , are updated according to (5) taking in account the fresh received information, nm Lq , from BN n. This way, the next received BN to be processed receives information more updated.
Horizontal-schedule strategy is similar to verticalschedule, with the only difference that it operates along the CN's.
PROCESSING UNITS FOR A GENERIC LDPC DECODER
As already mentioned, a simple iterative decoder can be constructed by considering each CN and BN of the Tanner graph as processing units, and the connections between them as bidirectional communication channels through which the processed information is sent. Yet, this approach presents some disadvantages (principally for long and unstructured LDPC codes) from the hardware implementation point of view, as the high number of processing units required, but also the huge number of connections between them which impose severe routing problems. However, even for best known hardware structured and efficient LDPC codes, such as the one recently proposed for DVB-S2 standard (ETSI 2005; Kienle et al. 2005) or for LDPC decoders following different schedule approaches, the updating procedure of a single BN or a single CN remains unchanged which means that elementary hardware processing units can be developed for both CN and BN and, thus, LDPC decoders can be constructed under a modular approach.
BN Processing Unit
A BN processor should calculate the log-likelihood ratio messages sent from the assigned BN to its CN's neighbours, the a posteriori pseudo-probability associated to the current BN and perform hard decoding taking a decision about its bit value. Considering a BN of weight w , the BN processor can be seen as a black box with 1 w + inputs, from where it receives the channel information plus w CN messages, mn Lr , sent from the CN's connected to it, and with 1 w + outputs, through where it communicates the hard decoding decision and sends the w messages, nm Lq , to the CN's connected to it.
Observing equations (6) and (7) we note that the message sent from BN n to CN m , can easily be obtained by
The computation procedure can thus be optimized and done in serial or parallel mode.
In a parallel version the inputs are added all together, producing the value of the a posteriori pseudo-probability, n LQ . The message outputs can then be computed simultaneously by just subtracting all entries from the output of the referred adder. This type of implementation requires an adder capable of adding 1 w + inputs of x bits, as well as, w output x bits adders in order to be able to perform the w subtractions. This means that a high number of gates is required to implement just a single processing unit, but has the great advantage of a minimum delay system (high throughput), allowing us to lower the clock frequency which implies a reduction in the power consumption. Alternatively, in a serial version, the inputs are added on a recursive manner as shown in figure 2. The Reg_Sum register is initialized with the received channel information. The output messages can be obtained in a parallel manner as in figure 1, or using a full serial approach as shown in figure 2, with a new message being obtained at each clock cycle.
This implementation minimizes the hardware complexity (measured in terms of number of logic gates) at the cost of a significant increase in processing time (time restrictions could require an increase in the clock frequency). The serial implementation has also the advantage of supporting the processing of a BN of any weight, at the expense of little additional control. 
CN Processing Unit
A similar approach to the one used in the previous section, can be followed in the computation of the mn Lr messages, sent by a CN. In fact, the boxplus operation defined in (5) can be reversed as: Also, Equation (5) can be rewritten in the following way functions contain logarithmic operators whose hardware implementation consumes a significant number of resources. Their implementation can be significantly simplified by approximating them by fixed point piece-wise linear functions, namely, with powers of two based multiplying factors (shifts and adders) (Hu et al. 2001; Masera et al. 2005) .
Boxplus and boxminus operations can both be implemented at the cost of four additions, one comparison and two corrections, each involving a shift and a constant addition, as shown in figure 3 and figure 4. Sometimes the boxplus operation is even more simplified, with a small decrease in performance, by considering a void correction factor. This simplification of the SPA algorithm is known by Min-Sum (Chen & Fossorier 2002; Hu et al. 2001) .
Based on the proposed boxplus and boxminus hardware modules, it is possible to adopt a serial or parallel configuration for the CN processor (similar to the ones described for the BN processor unit). Nevertheless, the complexity of the boxplus operation on a parallel implementation requires a boxplus-sum chain of all inputs according to figure 5. The advantages of one configuration compared with the other are similar to the ones that were mentioned for the BN processor. However, it should be noted that the proportion of silicon area, occupied by a parallel implementation with respect to a serial implementation, is in this case significantly higher than the one for the BN processor, due to the number of operations involved in the boxplus and boxminus processing. In fact, the number of gates required by the boxplus and boxminus processing units is superior to the common add and subtract arithmetic operations. 
PROCESSING UNIT FOR A DVB-S2 LDPC DECODER
The particular characteristics of LDPC-IRA codes adopted by the DVB-S2 standard turn possible to think in more efficient decoder solutions that surpass the evident limitations of a full parallel architecture. In figure 7 is presented the basic architecture of a partial parallel array processor decoder solution for LDPC DVB-S2 (Kienle et al. 2005 ). This efficient architecture not only explores the periodicities of the adopted LDPC-IRA codes, but also has the great advantage of supporting all code rates and code lengths defined by the DVB-S2 standard, through a simple reconfigurable mechanism.
In this section we suggest a possible implementation for each processor or functional processing unit (FU) that merges both the functions performed by the BN and CN units Attending to the fact that messages sent from CN's to BN's are computed based on the previous messages received from BN's, and vice-versa, it means that a message value once used can be discarded, and the memory place that it occupies be re-used to store the new computed message. The shuffling network is responsible for the correct exchange of the messages between the CN's and BN's emulating the Tanner Graph.
Considering the zigzag connectivity between PN's and CN's, the PN's and IN's are updated following different schedule methods. The traditional flooding schedule is carried on the IN's, while PN's are updated jointly with CN's following the horizontal schedule approach. This fact requires some modifications on the CN processing unit from figure 6 in order to construct the basic functional unit.
As referred, a single FU unit is shared by a constant number of IN's, CN's and PN's (CN's and PN's are processed jointly), depending on the code length and rate. More precisely, for a ( ) In CN mode, each FU updates not only the associated CN's but also the corresponding PN's (note that per each CN restriction exists a PN bit). Attending to the zigzag connectivity between PN's and CN's, when updating a PN, say m , according to (6), it works as a simple passing node because the message that it sends to the CN m+1 is simply the message received from CN m added to the channel information, and viceversa (see figure 8 ). Since each FU processes q consecutive CN's, the PN's updating can follow a horizontal schedule approach (both PN's and CN's processed simultaneously). This way, the message that travels through CN m , PN m and CN 
