Abstract-Efficient implementations of the sum-product algorithm (SPA) for decoding low-density parity-check (LDPC) codes using loglikelihood ratios (LLR) as messages between symbol and parity-check nodes are presented. Various reduced-complexity derivatives of the LLR-SPA are proposed. Both serial and parallel implementations are investigated, leading to trellis and tree topologies, respectively. Furthermore, by exploiting the inherent robustness of LLRs, it is shown, via simulations, that coarse quantization tables are sufficient to implement complex core operations with negligible or no loss in performance. The unified treatment of decoding techniques for LDPC codes presented here provides flexibility in selecting the appropriate design point in high-speed applications from a performance, latency, and computational complexity perspective.
I. INTRODUCTION
Iterative decoding of binary low-density parity-check (LDPC) codes using the sum-product algorithm (SPA) has recently been shown to approach the capacity of the additive white Gaussian noise (AWGN) channel within 0.0045 dB [1] [2] [3] . Efficient hardware implementation of the SPA has become a topic of increasing interest. The direct implementation of the original form of SPA has been shown to be sensitive to quantization effects [4] . In addition, using likelihood ratios can substantially reduce the required quantization levels [4] . A simplification of the SPA that reduces the complexity of the paritycheck update at the cost of some loss in performance was proposed in [5] . This simplification has been derived by operating in the log-likelihood domain. Recently, a new reducedcomplexity decoding algorithm that also operates entirely in the log-likelihood domain was presented [6] . It bridges the gap in performance between the optimal SPA and the simplified approach in [5] . Finally, low complexity software and hardware implementations of an iterative decoder for LDPC codes suitable for multiple access applications were presented in [7] .
Here we present efficient implementations of the SPA and describe new reduced-complexity derivatives thereof. In our approach, log-likelihood ratios (LLR) are used as messages between symbol and parity-check nodes. It is known that in practical systems, using LLRs offers implementation advantages over using probabilities or likelihood ratios, because multiplications are replaced by additions and the normalization step is eliminated. The family of LDPC decoding algorithms presented here is called LLR-SPA.
The unified treatment of decoding techniques for LDPC codes presented here provides flexibility in selecting the appropriate design point in high-speed applications from a performance, latency, and computational complexity perspective. In particular, serial and parallel implementations are investigated, leading to trellis and tree topologies, respectively. In both cases, specific core operations similar to the special operations defined in the log-likelihood algebra of [8] are used. This formulation not only leads to reduced complexity LDPC decoding algorithms that can be implemented with simple comparators and adders but also provides the ability to compensate the loss in performance by using simple look-up tables or constant correction terms.
The remainder of the paper is organized as follows. In Section II, the SPA in the log-likelihood domain is described, and the issues associated with a brute-force implementation are discussed. In Section III, a trellis topology for carrying out the parity-check updates is derived. The core operation on this trellis is the LLR of the exclusive OR (XOR) function of two binary independent random variables [8] , rather than the hyperbolic tangent operation used in the brute-force implementation. This core operation can either be implemented very accurately by using the ¢ ¡ ¤ £ ¦ ¥ operation [9] or approximately by using the so-called sign- §© operation. In either case, the check-node updates can be efficiently implemented on the trellis by the well-known forward-backward algorithm. Section IV is devoted to parallel processing, and a simple tree topology with a new core operation is proposed. It is shown that such an implementation offers smaller latency compared to the serial implementation. In practice, this core operation can be realized by employing a simple eight-segment piecewise linear function. In Section V, simulation results are presented, comparing the performance of the various alternative implementations of the LLR-SPA. Finally, Section VI contains a summary of the results and conclusions.
II. SPA IN THE LOG-LIKELIHOOD DOMAIN
A binary LDPC code [1, 2] is a linear block code described by a sparse ! " 
. The LLR-SPA is then summarized as follows.
, and
, then halt the algorithm with i as the decoder output; otherwise go to Step (i). If the algorithm does not halt within some maximum number of iterations, then declare a decoder failure.
The check-node updates are computationally the most complex part of the LLR-SPA. Two issues influence their complexity: i) the topology used in computing the messages that a particular check node sends to the symbol nodes associated with it, and ii) the implementation of the core operation needed for computing these messages. For example, the core operation of the check-node update computation in Step (i) above is the hyperbolic tangent function, which is known to be difficult to implement in hardware. Furthermore, in a brute-force implementation of the check-node update (1),
multiplications are necessary per check node, with all multiplicands requiring the evaluation of the hyperbolic tangent core operation. Clearly, the higher the rate of the code, the higher the row degree
, thus leading to a higher number of multiplications. Therefore, the brute-force topology and its corresponding core operation are not suited for high-speed digital applications.
III. SERIAL IMPLEMENTATION: TRELLIS TOPOLOGY

A. Check-Node Updates
Consider a particular check node
connections from symbol nodes in
The goal is to efficiently compute the outgoing messages
. Let us define two sets of auxiliary binary random variables
, where denotes the binary XOR operation. It can easily be seen that for statistically independent binary random variable
Using (2) repeatedly, we can obtain
in a recursive manner based on the knowledge of 9 A D C p 9
. Using the parity-check node constraint p 9
. Therefore, the outgoing message from the check node 5 can be simply expressed as
The total computational load consists of the forward recursive computation of 
hyperbolic tangent operations for the check-node updates of the brute-force topology. Clearly, the above procedure is exactly the forward-backward algorithm on a single-state trellis, as shown in Fig. 1 . The serial nature of computations makes the latency in computing a check-node update of the order¸
B. Symbol-Node Updates
In the log-likelihood domain, the symbol-node updates consist only of additions of incoming messages. It is more conve- , given
are the incoming LLRs from the parity-check nodes
connected to the symbol node 
C. Efficient Implementation of Core Operation
s t r ¦ AE
In this section, two efficient implementation versions of the core operation s t r ¦ AE are described, both of which are amenable to efficient VLSI design.
The first version is analogous to the ¢ ¡ ¤ £ h ¥ operation used in turbo codes [9, 10] . By using the Jacobian logarithm twice, we obtain
It can be shown that the following equality holds:
in which the terms
can be implemented by a look-up table. Fig.  2 shows a plot of the function
is given in Table I . The maximum approximation error is less than 0.05.
The function
can also be approximated more accurately by a piece-wise linear function where the multiplying factors are powers of two and therefore simple to implement in hardware with shift operations. Table II shows a piece-wise linear approximation of Í X 1 E F with only eight regions. Fig. 2 shows the corresponding piece-wise linear approximation plot. As can be seen, the piece-wise linear function offers almost a perfect match to the original function. In summary, the core operation s t r ¦ AE can be realized using four additions, one 
which is called herein the sign- §© approximation. The advantage of using the sign- §© approximation lies in its simplicity. No additions are needed for check-node updates, merely two-way comparisons, hence requiring a very small number of logic gates.
Finally, the difference between the exact s t r ¦ AE operation and its sign- §© approximation is given by the term B F
, called the correction factor in [6] . This correction factor can be described by the bivariate function
where the arguments 
® ¦ AE
, respectively. It is shown in [6] that this correction factor can be approximated by a single constant without incurring any loss in performance with respect to the SPA. Clearly, one can also use the function Fig. 2 instead of the bivariate function (7), introduced in [6] , the correction factor is a positive or a negative nonzero value determined according to the signal-to-noise ratio. In this case, the computational complexity of the s t r ¦ AE core operation is a single two-way comparison and an addition with a constant.
IV. PARALLEL IMPLEMENTATION: TREE TOPOLOGY
For applications with high throughput requirements, recursive algorithms such as the forward-backward algorithm may not be well suited. In this section, a simple tree topology that enables fast check-node updates is described. The symbolnode updates remain the same as in (4) .
We begin by defining an auxiliary binary random variable
. The LLR of ó C at a particular check node 5 can be computed using the tree topology shown in Fig. 3 . The operation at each node in the tree is s t r ¦ AE , which can be efficiently implementated using any of the alternatives described in Section III-C. The latency in computing the LLR of ó C is of order¸
, resulting in a speed-up factor of
compared to the serial trellis topology of Section III-A.
Having obtained the LLR of ó C
, we now describe a simple and efficient way to compute the outgoing LLRs T° . Thus (9) becomes
After some algebra, we finally obtain
We define
Clearly, for eachéG 
In (12) the calculation of the function
i s required, whose plot is given in Fig. 4 . As can be seen in Fig. 4 , the simple sign- §© approximation suffers a performance penalty of 0.3 to 0.5 dB. It appears that the loss in performance is greater as the number of parity-check equations of the LDPC code increases. On the other hand, all other reduced-complexity variants of the LLR-SPA perform very close to the conventional SPA. In particular, the piece-wise linear approximations of the core operations in LLR-SPA1 or LLR-SPA2 appear to suffer no loss (essentially less than 0.05 dB) in performance even in the case of the b a F P B P B P Q a µ B P B P B P h LDPC code, which involves 3000 parity-check equations. Furthermore, as can be seen in Fig. 5 , the simple LLR-SPA1 algorithm that uses a constant correction term (ð § a 4 P Q f@
) is also able to achieve the performance of the conventional SPA, in particular at higher SNRs. . The core operations are somewhat different in the two cases. Nevertheless, the correction terms in these core operations can be implemented via look-up tables or piece-wise linear functions, or even by using a single constant, facilitating simple hardware design. Simulations results have shown that it is possible to attain the performance of the conventional SPA extremely closely with a significant reduction in implementation complexity.
