Abstract-Through a rapid survey of the architecture of lowdensity parity-check (LDPC) decoders, this paper proposes a general framework to describe and compare the LDPC decoder architectures. A set of parameters makes it possible to classify the scheduling of iterative decoders, memory organization, and type of check-node processors and variable-node processors. Using the proposed framework, an efficient generic architecture for nonflooding schedules is also given.
part of the answer. We suggest also considering the complexity of the LDPC code itself.
The first part of the next section presents the notations that will be used throughout this paper. It also recalls briefly the bipartite graph representation, which is very convenient when dealing with the LDPC codes. In the second part of Section II, the decoding algorithm for LDPC codes is briefly recalled, independent of scheduling.
In Section III, three classical different schedules are presented: the flooding schedule and the two shuffle schedules (the horizontal one and the vertical one from [7] and [8] ). In Section IV, an evaluation of the complexity for decoding the LDPC codes is proposed. The purpose is to compare different LDPC codes and help the designer to evaluate the messagepassing structure that is to be proposed to suit the required specifications.
In Section V, a generic architecture for LDPC code decoders is proposed. It is specified by several parameters related either to the datapath or to the processing modes of variable and check nodes. We show that the combination of these parameters enables us to describe most of the published LDPC decoders, and, hence, to compare them. Based on this generic architecture, the synthesis of a new architecture for LDPC code decoders is proposed, implementing a fast converging decoding algorithm. Section VII summarizes the results and concludes this paper.
II. DECODING ALGORITHMS
The most popular LDPC decoding algorithm is the belief propagation (BP) algorithm, which is optimal if the graph of the code does not contain any cycles. Although the graph of the LDPC codes does contain cycles, this algorithm is still used and is considered as a reference. Before describing this algorithm, we introduce the notations that will be used hereafter.
A. Notations, Bipartite Graphs
An LDPC code or a repeat-accumulate (RA) code of size N and rate R can be represented as a bipartite graph, where the N bits are represented by N variable nodes v n . Each variable node is connected to some of the M parity-check nodes, M ≥ (1 − R) × N . We denote by M(n) (resp. N (m)) the set of all the parity-check indices (resp. variable indices) that are connected to the variable v n (resp. parity-check c m ). We denote also by M(n)\m (resp. N (m)\n) the set of the parity-check (resp. variable) indices that are connected to the variable v n (resp. parity-check c m ) without the parity-check c m (resp. variable v n ). A cycle on the graph is defined as a closed path. Finally, 0090-6778/$25.00 © 2007 IEEE we denote by |A| the cardinal of the set A. Thus, the parity-check c m is connected to |N (m)| variables and the variable-node v n is connected to |M(n)| check nodes. The degree of a node is the number of edges connected to it; so the variable-node v n has degree |M(n)| and the check-node c m has degree |N (m)|. An LDPC code is said to be regular if the degree d v of its variable nodes and the degree d c of its check nodes are constants.
B. Decoding With the BP-Algorithm
Let E m ,n denote the message from check c m to variable v n . Similarly, let T n,m denote the message from variable v n to check c m . Each node of the graph (check node or variable node) is replaced by a processor whose input-output ports are the connections of the graph. The BP algorithm describes the behavior of each type of processor:
1) a variable-processor V n has to compute the output messages T n,m using the input messages E m ,n according to
where
E m ,n is the extrinsic information of the variable v n . The variable I n associated with each variable v n is called the intrinsic information. I n = 2y n /σ 2 in the case of a binary phase-shift keying (BPSK) modulation over an additive white Gaussian noise (AWGN) channel of variance σ 2 , where y n is the observation of the nth received symbol of the codeword. 2) a check-processor C m has to compute the output messages E m ,n using the input messages T n,m according to the function F:
where f (x) = − ln(tanh(|x|/2)). The processors can process independently: they sample their input messages and process their output messages. For a graph without cycles, the algorithm converges toward a unique solution whatever the sample times are.
III. DECODING SCHEDULES
When the graph of the code contains cycles, the order of the sample times between the processor will have an influence on the results of the message-passing algorithm. The schedule denotes the given order of the sampling times between all the node processors. Not all the schedules yield the same results: the unavoidable existence of cycles in an LDPC code of finite length generates a correlation between the outgoing and incoming messages and yields some self-information behavior that decreases the performance of the iterative decoding process [11] , [12] and introduces pseudocodewords [13] . The loss in performance increases as the length of the cycles becomes shorter. The smallest cycle length in the graph is called the girth. A probabilistic schedule that achieves good performance was first presented by Mao and Banihashemi [14] . It will not be addressed, however, in this paper, since we only consider schedules updating all the edges in one iteration.
A. Flooding Schedules
The most popular schedule associated with the BP algorithm is the flooding schedule (FS). All the variable-processors sample their input at the same time and, then, all the check-processors sample their input at the same time. Once all the messages have been processed, one decoding iteration has been completed. Iterations are repeated as long as required. The flooding schedule is summed up in Algorithm 1. It can be noted that in this scheduling, the computation of the extrinsic information E (i) n of a given variable v n is performed in one step of the algorithm (line number 7).
It is possible to have a different and equivalent representation of the FS, closer to some hardware implementations, combining the operations processed in the check nodes and in the variable nodes. For example, the variable-node processing can be distributed during the whole iteration, while the check nodes are processed sequentially (singly or in groups of P ). More precisely, the computation of the extrinsic information is not processed in a single step, as indicated in Algorithm 2 on line 10. In this case, memories are required in the variable-node processor to save the accumulation of the check-to-variable messages of both the current and the previous iteration (E
n , respectively). This schedule will be denoted the parity-check flooding schedule (FS-P). It is also possible to process the variables sequentially with the processing of the check-processors being distributed. It is, then, denoted by the variable flooding schedule (FS-V).
B. Fast Converging Schedules
Two other schedules are used, which will be called the horizontal and vertical shuffle schedules (HSS and VSS) [4] , [7] , [8] .
The VSS was proposed independently by Kfir and Kanter [15] , and Zhang and Fossorier [7] , [8] . One of the main advantages of the VSS is that it enables the decoding convergence to speed up. In flooding-like schedules, we observe that the processing of the check-to-variable messages E [7] , [8] ; hence, the shuffling of the check-node update and the variable-node update. The VSS is described in Algorithm 3.
The HSS is the converse of the VSS: the roles of check nodes and variable nodes are swapped. It is a turbo-decoding like schedule, where the component codes are the rows or groups of rows of the parity-check matrix. A complete historical view of this schedule class can be found in [6] . An LDPC HSS decoder was generalized by Boutillon et al. [16] . This schedule also enables the decoding convergence to speed up.
Note that these schedules force the node processing to be serial. The use of parallelism (P > 1) leads to implementing the group-shuffled approach [7] .
IV. ANALYSIS OF COMPLEXITY
The decoding complexity of an LDPC code is directly linked to the number of messages to be processed per iteration, i.e., to the number of edges within the bipartite graph of the code, or equivalently, to the number of nonzero entries of the paritycheck matrix (two messages per edge), whatever the scheduling is: the scheduling is in fact only a partitioning of the different edges to be processed.
A. Processing Power
One decoding iteration involves processing all the T n,m and E m ,n messages, related to the edge between the variable v n and the parity-check c m . Let E denote the total number of edges inside the bipartite graph of the code. For a regular (d v , d c ) LDPC code of length N , for example, the total number of edges is given by
Let P c denote the required processing power to decode the LDPC codes, defined as the number of edges to be processed per clock cycle (we assume the decoder is implemented on synchronous-logic hardware, using a single clock). We can derive P c from the following parameters: 1) number K of information bits to be transmitted per codeword; 2) rate R of the code; 3) information throughput D required; 4) maximum number 1 of iterations i max ; 5) clock frequency f clk . 
P c can also be expressed using the average variable-node degree d v = E/N and the rate R = K/N of the code
B. Example
To illustrate the results of the theoretical study, let us consider a regular LDPC code of length N and rate R = 1/2, with parameters (d v , d c ) = (3, 6) . Assume that this code is decoded with a binary throughput D = 10 Mbits/s by means of a decoder with a clock frequency of 100 MHz. What is the minimum number of edges to be processed by the architecture per clock cycle to achieve the throughput if a maximum of i max = 20 iterations is specified?
The code being regular, the number E of edges is E = 3N . Considering that K = N/2 (code rate is 1/2), the numerical application of (5) yields
Thus, at each clock cycle, an average of 12/3 = 4 variable nodes and 12/6 = 2 parity-check nodes have to be processed. The architecture of the decoder has to use a parallelism of at least two check-processors and four variable-processors to achieve the specifications.
V. GENERIC ARCHITECTURE OF LDPC CODE DECODERS
We propose now to define a generic architecture for an LDPC code decoder, associated with various parameters. The aim of this section is to define a parametric and generic architecture of LDPC decoders that embeds most of the existing published architectures. This architecture is based on node processor architectures, the position and type of memory, the algorithm schedules, and the level of parallelism.
A. Generic Node Processors

1) Data Paths:
The generic node processor is made up of d input/output ports (e j , s j ), j ∈ {1, · · · , d}. Internal memory banks allow input or output messages to be saved inside the node processor. The processing of an output port s j is performed according to:
is a generic associative operator. Many architectures can be implemented for this processor: for example, the trellis architecture [2] , [18] in Fig. 1(b) or the "total sum" architecture [19] , [20] in Fig. 1(a) with both parallel and serial implementations. The total sum implementation involves computing first the "total sum," which is defined as: s = i {e i }. Then, the jth input is inversed so as to compute s j from s. This implementation is interesting when the degree of the processor is high, since a lot of common computations are grouped. But, it is possible only when the operator can be reversed. These are detailed in [4] . Note that these different architectures can have either a parallel implementation or a serial implementation. Registers can also be added to pipeline the processing and, thus, decrease the critical path length. The generic operator may be either the sum ( ), or the star ( ). The star operator between two log-likelihood ratios is defined as the function F applied on them [21] e i e j = 1 + exp (e i × e j ) exp (e i ) + exp (e j ) .
2) Update Modes:
There are three steps to be handled inside a generic node processor: input message reading (or sampling), computation of the output messages, and output of the outgoing messages. There are mainly two types of approaches that can be defined to manage these three steps differently for the d input/output ports of the processor: grouped update and spread update.
The grouped update involves computing the output messages if and only if all the input messages have been sampled. So, a typical scheme is first to wait for all the inputs to be updated and to save them. Then, all the output messages are to be computed, and finally to be output. Then, a new cycle can start again by waiting for all the inputs to be updated in the memory, erasing the previous ones. The spread update is a kind of on-demand control. It means that, for example, the node processor can be asked for a given output message, and, then, it can be asked to take into account a new input message. Two spread update modes can be defined, depending on the memory inside the node processor: the straight one and the delayed one. When a new input message on a given edge has to be taken into account, either the previous message related to this edge is erased and replaced by this new one (straight update), or there are two memories (delayed update) denoted input-memory and compute-memory: the new message is saved in the input-memory while the previous one has been saved in the compute-memory. When the input-memory is full, then the roles of the two memories are swapped. In the straight update case, the same memory is used to save the new incoming inputs and to compute the output messages. Hence, the last output message will be processed with the most recent input messages. The straight spread update is implemented to speed up the propagation of the messages as in shuffle schedules.
To summarize, there are three ways of handling the processing steps in a generic node processor: the straight spread update, the delayed spread update, and the grouped update. When combining these three steps on the check-and variable-node processors, it is possible to span all the known schedules of LDPC decoding. These are summed up in Table I .
B. Message Passing Architecture
The generic processors are instantiated so as to create the global architecture of the decoder, as illustrated in Fig. 2 . The interconnection network represented in the bipartite graph of an LDPC code is materialized through a shuffle or a routing network π and its inverse π −1 . The complexity of the interconnection network depends on the structure of the parity-check matrix. From an implementation point of view, it seems desirable to have simple interconnections such as a barrel shifter, like in [22] and [6] . Depending on specific hardware constraints, this network can be implemented on different alternative locations, as illustrated in Fig. 3 : depending on the check-node generic operator (star or sum operator), up to four different locations are presented (dashed lines) separating the variable node processors from the check-node processors. With the use of the star operator (the input and the output of the paritycheck node processors have the dimension of LLRs), there is only one possible location. But, with the sum operator, there are four different locations, whether the inputs and the outputs of the node processors are in the Fourier domain or not.
The optimal number of variable-processors having d v input/output ports is given by the fact that the interconnection network has a determinate number of inputs and outputs. Each of the variable-processors is able to process a parity-check of
Such a generic message passing architecture has a complexity P c , as defined by (5), which is equal to
Thus, if we take the example of Section IV-B, where P c = 12, a solution with P = 2 check-processors is possible if the check-processors are able to process at least d c = P c /P = 6 inputs/outputs per clock cycle (α = 1), {i.e.}, if P = 2 parallel generic processors are used. If serial generic processors are to be used (α = d c = 6), then P = 12 check-processors would have to be instantiated to achieve the specifications. 
VI. ARCHITECTURE ANALYSIS AND SYNTHESIS FOR LDPC CODE DECODERS
A generic architecture for LDPC decoders was proposed in Section V. In this section, some examples from the literature are taken to illustrate the versatility of the generic architecture. Then, we show that shuffling the parameters of the generic architecture can even yield a new architecture for the VSS schedule.
A. Architecture Analysis
The parameters specifying the architecture of the LDPC code decoders have been defined in the previous sections. They are listed as follows:
r Node processors: 1) three possible architectures (direct, trellis, total sum) 2) four possible locations for the interconnection network 3) three possible update modes (grouped, straight, or delayed spread)
r Message passing architecture:
1) three parameters for the parallelism specification (P, α, β) Some combinations of values for these parameters have already been used or implemented in the LDPC decoders. Some other combinations are new and yield interesting new implementations.
Due to the limited number of pages, we cannot present a classification of all the published decoder architectures, so we present only three of them hereafter: they fairly illustrate the diversity of the published architectures. Note that the parameters we propose for the description of the following architectures are summed up in the first three columns of Table II. The first one is from Zhang and Parhi [23] . It is a decoder implemented on an FPGA Xilinx Virtex 2600 having a throughput of 54 Mbit/s, for a regular code of length N = 9216 bits and of rate R = 0.5. We can note that the node processors are completely parallel. The scheduling is the FS, and the locations of the interconnection network are such that lookup tables are required in both the check nodes and the variable nodes (location number 2).
The second example is an architecture proposed by Chen and Hocevar [24] implemented on an application-specific integrated circuit (ASIC) 0.11 µ and having a throughput of 376 Mbit/s. The length of the code is N = 8088, the rate is also R = 0.5, but the code is irregular. In this implementation, the check-node processors are parallel whereas the variable node processors are serial. The location of the interconnection network is classical (the lookup tables are in the check-node processors). The scheduling is the flooding one over the variables (FS-V).
The third example is an architecture proposed by Mansour [25] implemented on an ASIC 0.18 µ and having a throughput of 192 Mbit/s. The length of the code is N = 2304, the rate is R = 2/3, and the code is also irregular. In this implementation, the check-node processors and the variable-node processors are both serial. The location of the interconnection network is classical (the lookup tables are in the check-node processors). The scheduling is the shuffled one over the parity checks (HSS).
B. Architecture Synthesis 1) Example of A New Architecture:
The architecture family generated by our framework also encompasses new efficient architectures. An example of an application of this formalism is given in the last column of Table II . The algorithm performed by this architecture is exactly the VSS, as specified in Table I . To the best of our knowledge, it is the first published architecture implementing this algorithm. It is depicted in Fig. 4 , where only the magnitude processing is illustrated. Algorithm 4 is applied: it is a hardware-oriented description of the algorithm 3 for the magnitude part only. The messages have changed: the variable-to-check message magnitudes are now denoted 
where J is a positive integer varying from 1 to d c during the iteration. The R m ,n messages are fed to the variable-node processor, where the T m ,n values are processed and transformed into Q n,m messages. These new Q n,m messages replace in memory the message of the previous iteration. Also, the R m value is updated by subtracting the old Q n,m and adding the new one (see algorithm 4, line 13). This architecture requires saving (N + M + E) values for the I n , R m , and Q n,m data, respectively. Note that the architecture associated with the HSS requires only saving (N + E) values for the I n + E n and E m ,n data. (Table III) : We assume that an LDPC code of length N has an average variable degree of d v average = 3. Then N = 3N . We also assume that all the messages are coded using the same fixed-point format on w bits. Finally, we omit the input/output buffers, which would add two memories of Nw bits each. The HSS has the lowest required memory size. The VSS memory size is a function of the code rate: for high code rates, it is possible to use almost the same amount of memory as for the HSS. It is to be noted that the FS-P has a higher memory requirement than the two shuffle schedules HSS and VSS. However, the main part of the memory is used to save the edge messages. This issue can be addressed using suboptimal algorithms such as the (scaled or offset) BP-based algorithm [10] , [26] the λ−min algorithm [21] , or the A−min* algorithm [27] .
2) Comparison of Memory Requirements for the Different Schedules
VII. CONCLUSION
A lot of LDPC code decoders have been proposed in the literature. However, it is difficult to compare them. We have proposed a global framework for the description, analysis and synthesis of low-density parity-check (LDPC) code decoders. It is based on a generic model of a decoder described by parameters, related to a generic node processor and to a generic message passing architecture. A quantification of the complexity required by the decoding of an LDPC code has also been proposed. This framework makes it possible to describe several published LDPC decoder implementations using the same model and parameters. It also makes it possible to ensure a good match between algorithm and scheduling on the one hand, and decoder architectures on the other hand, as illustrated by the new efficient architecture proposed in this paper.
