Low-density parity-check codes are attractive for high throughput applications because of their low decoding complexity per bit, but also because all the codeword bits can be decoded in parallel. However, achieving this in a circuit implementation is complicated by the number of wires required to exchange messages between processing nodes. Decoding algorithms that exchange binary messages are interesting for fullyparallel implementations because they can reduce the number and the length of the wires, and increase logic density. This paper introduces the Relaxed Half-Stochastic (RHS) decoding algorithm, a binary message belief propagation (BP) algorithm that achieves a coding gain comparable to the best known BP algorithms that use real-valued messages. We derive the RHS algorithm by starting from the well-known Sum-Product algorithm, and then derive a low-complexity version suitable for circuit implementation. We present extensive simulation results on two standardized codes having different rates and constructions, including low bit error rate results. These simulations show that RHS can converge faster on average than existing state-of-the-art decoding algorithms, leading to improvements in throughput and energy efficiency.
Relaxed Half-Stochastic Belief Propagation I. INTRODUCTION L OW-DENSITY parity-check (LDPC) codes can approach channel capacity with a low decoding complexity per bit, making them attractive for a wide range of error correction applications. For most applications, the decoding operation must be performed by a custom circuit implementation because of the processing performance requirements. For a given coding gain, we seek to optimize the processing performance (throughput, latency) normalized to the circuit area, and the energy per decoded bit.
The decoding of LDPC codes is part of a large family of problems that can be solved by iteratively passing messages in a factor graph [1] . When the desired decoding throughput is high, the most efficient implementation approach is to explicitly map the factor graph in hardware. This is known Manuscript as a fully-parallel implementation. Because of the structure of LDPC factor graphs, the wiring complexity typically represents a large portion of the implementation complexity. It has a big impact on area requirements, as well as dynamic power consumption. Furthermore, the longer wires are likely to form critical paths and constrain the maximum clock frequency. We will call a circuit graph a graph where a node corresponds to a localized circuit block, and an edge corresponds to a connection between two circuit blocks, composed of one or several wires. In the simplest fully-parallel implementation, the circuit graph is identical to the factor graph.
Different approaches have been proposed to reduce the wiring complexity of fully-parallel LDPC decoders. Circuit architectures have been proposed [2] , [3] that send messages serially on a single wire, at the expense of using a larger number of clock cycles to exchange messages. However, [2] requires an approximation in the check node function which degrades performance, and [3] , while power-efficient, might have difficulty achieving high throughputs. In Split-Row decoders [4] , the topology of the circuit graph is modified by partitioning check nodes into multiple sub-nodes that are then linked by two or four wires. The authors show this topology change can have a big impact on area and power requirements of the decoder. However, their approach suffers from increasing error rates as the number of partitions increases. Finally, stochastic decoding algorithms have been introduced as a low-complexity alternative to the standard algorithms. The technique was initially demonstrated only on small codes [5] , but since then stochastic decoding has been applied to longer LDPC codes used in communication standards [6] . Since they rely on binary messages, stochastic algorithms have a low wiring complexity. They also use a very simple check node function, which can be partitioned arbitrarily without introducing any approximation. However, these stochastic algorithms suffer from a high latency, and an important loss in coding gain with respect to Sum-Product Algorithm (SPA) decoding.
In this paper, we introduce a new BP algorithm that is a binary message passing (BMP) algorithm, which for an LDPC decoder means that the algorithm is constrained to using only modulo-2 addition as the check node function. Hard-decision decoding algorithms such as "Gallager-B" and the stochastic algorithms mentioned previously are BMP algorithms. The Relaxed Half-Stochastic (RHS) algorithm relies on stochastic messages for exchanging information between processing nodes, but, contrary to existing stochastic algorithms, performs the variable node computation in the log-likelihood ratio domain. The name "relaxed" comes from the use in the variable node function of an estimation mechanism that 0090-6778/13$31.00 c 2013 IEEE is similar to the relaxation step in a Successive Relaxation decoder [7] . Our results show that despite the constraint on the check node function, it is possible to achieve very good performance, in terms of error rate, latency, and throughput, while preserving a low implementation complexity. The RHS algorithm can match or sometimes outperform the error rate of BP algorithms that use real-valued messages, such as the Sum-Product algorithm, while simultaneously having a low wiring complexity.
A preliminary version of the RHS algorithm was introduced in [8] . In this paper, we present several improvements that increase the performance while reducing complexity. In Section II, we briefly review the Sum-Product, Min-Sum and Normalized Min-Sum algorithms, and review the stochastic message representation. In Section III, the algorithm is developed using the Sum-Product Algorithm as a basis. Following this, we show how the average number of iterations can be decreased by changing the decoding rule at some predetermined iterations, similar to Gear-Shift decoding [9] . We also present a method for lowering error floors that can be used with RHS or SPA decoders. In Section IV, we derive a low-complexity implementation from the ideal case described previously. Finally, in Section V, we present simulation results for two standardized codes. Results are presented for various versions of the algorithm, from the ideal case described in Section III to a version containing all the approximations required to achieve a low-complexity circuit implementation.
II. BACKGROUND
An LDPC code can be modeled by a bipartite graph G = (V, C, E), where elements of C, named check nodes (CN), represent the parity-check equations, and elements of V, named variable nodes (VN), represent the variables of the parity-check equations. A variable can represent a transmitted symbol, or, in the case of punctured codes, an additional unknown that is not part of the transmitted information. An edge exists between a variable node and a check node if the variable is an argument of the parity-check equation associated with the check node. Throughout the paper, we denote the degree of a given variable node by d v , and the degree of a given check node by d c .
We will introduce Algorithm 1 as a template to be shared by all BP algorithms. This template will allow us to define all the algorithms in terms of a VN function VAR and of a CN function CHK, with subscripts indicating the algorithm.
Let v i ∈ V denote the variable node with index i ∈ [1, n], c j ∈ C denote the check node with index j ∈ [1, m] , and N (x) denote the set of neighbors of a node x. In Alg. 1, we
We also denote a message from v i to c j by η i,j and from c j to v i by θ j,i . The INIT(y i ) function converts a channel output y i into the representation domain used by the specific algorithm, and similarly δ is the value corresponding to a probability of 1 2 . Finally, for a set S = {s 1 , s 2 , . . . , s z }, the expression VAR(S) is equivalent to VAR(s 1 , s 2 , . . . , s z ), and similarly for CHK.
In the paper we use the term throughput to refer to the average number of bits processed by the decoder per time unit. The BP algorithm terminates as soon as a codeword is found, and therefore in the discussions we will assume that the data throughput of the decoder is inversely proportional to the average number of iterations until convergence. We also use the term latency to refer to the time required to run the decoder for L iterations.
A. The Sum-Product Algorithm
The Sum-Product algorithm is a BP algorithm that, for a cycle-free G, computes the maximum-likelihood (ML) estimate of each codeword bit. Because of the cycles contained in an LDPC code's factor graph, the Sum-Product algorithm is not guaranteed to converge to the bit-wise ML estimates. Nonetheless, it has proven to be a very useful approximation algorithm. The Sum-Product algorithm can be expressed in terms of various metrics, notably probability values or loglikelihood ratios (LLR). The LLR metric is most often used for implementations because it reduces the quantization error, and lowers the implementation complexity by avoiding the need for multiplications. An LLR metric Λ is defined in terms of a probability p as Λ = ln((1 − p)/p). Following [1] , we will use the notation {∼ X i } to denote the set {X 1 , X 2 , . . . , X n } \ {X i }, where n is given implicitly from the context. In the LLR domain, a variable node output Λ i ,
where Λ i is the input message on edge i and Λ 0 is the a-priori likelihood obtained from the channel. A common expression for computing a check node output Λ i , 1 ≤ i ≤ d c , is given by
However, the use of tanh and arctanh is a source of quantization error in implementations. The check node function in the Min-Sum algorithm [10] removes the quantization error and reduces the complexity at the cost of an approximation. This check node function is given by
where
The approximation can be improved by performing a multiplicative or additive correction on the min(·) operation [11] , [12] . With a multiplicative correction, the algorithm is known as Normalized Min-Sum (NMS). The NMS check node function is given by
where 0 < α ≤ 1 depends on the code structure. Although the optimal α should also depend on the channel signal-to-noise ratio (SNR), this can be ignored in practice [11] .
B. Stochastic Belief Representation
The stochastic stream representation expresses belief information by using a random sequence of binary messages, where the information is contained in the sequence's mean function. In the probability domain, the Sum-Product check node function for n inputs is given by [13] CHK p (p 1 , p 2 , ...,
For the stochastic check node function, we let the check node inputs be the random bit sequences {X 1 (t), . . . , X n (t)}, independent and distributed such that E[X i (t)] = p i (t). To evaluate the check node function on these stochastic stream inputs, we want the check node function binary output Y (t) to be a random sequence with E[Y (t)] = CHK p (p 1 (t), ..., p n (t)). This is satisfied by
where ⊕ represents modulo-2 addition. Note that this is also the function used in Gallager's hard-decision decoding algorithms [13] . The probability domain SPA variable node function for 2 inputs is
The function can be obtained for more inputs by reusing the two-input function, e.g. VAR p (p 1 , p 2 , p 3 ) = VAR p (p 1 , VAR p (p 2 , p 3 )). If we now let the stochastic variable node function inputs be {X 1 (t), . . . , X n (t)}, we would like the binary output Y (t) to be distributed such that E[Y (t)] = VAR p (p 1 (t), . . . , p n (t)). However, there is no memoryless binary-valued function that will achieve this. Some functions with memory are proposed in [5] , [6] . In this paper, we introduce a new variable node function that achieves higher accuracy by performing its core computation in the LLR domain. Furthermore, the new function can handle an extension of the concept of stochastic message to more than one bit.
III. THE RHS ALGORITHM
The objective of the RHS algorithm is to achieve highprecision decoding while only relying on binary messages. The advantages of using binary messages include the smaller number of wires required for transmitting messages, and the low complexity and other interesting properties of the binary check node function (Eq. (6)), which will be discussed in Section III-C. Note that these advantages are not related to the number of bits that are sent on a given edge of the factor graph during one iteration, as long as the bits are sent sequentially on the same wire. We can therefore introduce an extension of stochastic messages to sequences of binary messages. We will refer to these extended stochastic messages as "iteration messages". In an RHS decoder, all information is exchanged between processing nodes in the form of stochastic messages, but the variable node computation can be performed in an other representation domain. This allows computing the variable node function with any desired accuracy.
A. Check Node Function
As is the case for SPA, the RHS check node function takes as input d c − 1 messages, and produces one output message. Ideally, for output i this computation would be evaluating m i = CHK p (∼p i ). However, the messages are constrained to be binary, and the CN instead evaluates (6) one or several times. Parameter k > 0 controls the number of times (6) is used in one iteration. The binary inputs to the CN are represented by random variables X i,j , where i ∈ [1, d c ] is the input index, and j ∈ [1, k] the bit index. The j-th binary check node evaluation for output i is given by
We then define a function g k that estimates the ideal iteration message m i from the binary messages:
In practice, the variable node circuit receives the sequence (9), but conceptually, it belongs to the check node computation. Note also that this message estimate corresponds to a single iteration of the algorithm, and should not be confused with the tracking estimator that will be introduced shortly. We will denote by M the image of g k , which we call the message set. If the check node input messages {X 1,j , . . . , X dc,j } are independent and distributed such that
and the optimal message estimator g k is the sample mean:
Therefore the check node function is obtained by combining (10) and (8). Under (10), M has k + 1 elements. The n-th element of M will be denoted μ n , 0 ≤ n ≤ k, and defined by μ n =m such that
. . Fig. 1 . Functional diagram of the RHS VN computation. For clarity, only the output associated with edge i = 1 is shown. An iteration index t is implied for all variables, except the prior po. (11) and (13) refer to the respective equations.
B. Variable Node Function
The variable node function for output i takes as input {∼m i } ∈ M dv−1 and the codeword symbol prior p o , and generates a binary output sequence {X i,1 , X i,2 , . . . , X i,k }. A functional representation of the computation is shown for a single output in Fig. 1 .
The intuition behind the RHS algorithm is to allow information to be transmitted using simple messages, while at the same time obtaining a high precision decoder using an averaging mechanism. We achieve this by using messagesm i (t) that are stochastic, and the averaging consists in estimating the mean function p i (t) of the message sequence. For the sake of simplicity, we use a linear tracking function, given by
wherem i (t) is the new received message, and 0 < β ≤ 1 is a real constant. Using the estimates p i (t) as input, the standard SPA function is then used to generate an intermediate output message p i (t). The specific function used will depend on the representation domain. In the probability domain, the output probability for edge i, p i (t), is given by
(12) However, we will see in Section IV that the LLR domain is more convenient for the implementation. Finally, we generate the binary sequence that will be sent on edge i. This sequence is composed of k independent binary random variables X i,j such that E[X i,j ] = p i (t), 1 ≤ j ≤ k. This can be implemented by generating k random thresholds T j , such that T j is uniformly distributed over [0, 1], and constructing X i,j as
C. Properties of the Binary Check Node Function
In an implementation of the RHS algorithm, the binary check node function (Eq. (6)) is the only operation performed on message bits as they are transmitted between variable node blocks.
Compared to the check node functions used in SPA or Min-Sum (Eq. (2) to (4)), it has much lower complexity since only a modulo-2 sum is required. Furthermore, the various extrinsic outputs of a check node can be computed in terms of a total function. For a given CN output Y i we have
This can be used to simplify the implementation by computing only the total Y T = CHK bin (X 1 , X 2 , . . . , X dc ) and broadcasting this result to all neighboring variable nodes, which then
Finally, another interesting property is that (14) can be factored arbitrarily. In a circuit implementation, this allows partitioning the logic in several locations, to provide flexibility in the circuit layout, as done in [4] .
D. β-Sequences
The choice of (11) as a tracking function makes RHS similar to a Successive Relaxation SPA algorithm [7] , but with the important difference that in the case of RHS,m i and p i are defined on different domains. Assuming that the maximum number of decoding iterations is fixed to L, we are interested in two performance metrics of the decoder, namely the bit error rate (BER) and the average number of iterations. If we constrain β to be constant, we can look for the value that optimizes some combination of these metrics. We will refer to this value as the optimal length-one β sequence, denoted β * . Experimental evidence shows that it depends on k, L, and on the channel SNR. This was also observed in [7] for L and SNR. For a given k, L, and SNR, a simple way to identify β * is through Monte-Carlo simulation. Our results show that β * is only weakly affected by SNR, and that in practice this dependence can be ignored. Therefore the simulation can be performed at a moderate SNR, thereby ensuring that the computational complexity is reasonable. In our Monte-Carlo simulator, we included the ability to record BER as a function of the decoding iteration index. The BER can then be plotted in terms of the number of iterations, as in Fig. 2 (a) (we will call this "settling curves"), or as a transfer function in the style of an error-probability EXIT chart [14] , as in Fig. 2(b) . By superimposing settling curves corresponding to different β values, one can identify β * as a function of L, albeit with the constraint that β * is in the set of β values simulated. For example, from Fig. 2 (a) we can determine that given β * ∈ {0.5, 0.25, 0.15}, β * = 0.5 for L ≤ 11, and β * = 0.25 for 11 < L ≤ 50. In this example, we have used a cost function that assigns a small but non-zero weight to the average number of iterations, such that the optimization is in terms of BER, but ties in BER are settled in favor of the faster algorithm.
We now want to consider making β a function of the iteration index t. A parallel can be made with Gear-Shift decoding [9] , which considers how several decoding rules can be used in sequence to optimize some performance metric (such as the maximum or the average number of iterations) while achieving a given target BER. To find the optimal sequence of decoding rules, [9] assumes that these rules have the following two properties. First, that the messages sent from variable nodes to check nodes can be described by a oneparameter probability density function (PDF). Let c(t) be the PDF parameter at iteration t. The second necessary property is that
where c o is the parameter of the channel output PDF. Unfortunately, the second property does not hold for BP algorithms with memory. For example, in the case of RHS, the variable node input message p i (t) in (11) depends on all received messages {m i (1), . . . ,m i (t)}, which becomes clear when unrolling the recursion. Our goal is to jointly optimize the BER and the average number of iterations. As a design procedure, we propose to initially assume that the second property holds, i.e., that the algorithm is memoryless. In this case, the best β(t) is simply the one that minimizes the BER at iteration t. Therefore the βsequence can be read off the transfer function plot. Note that for a memoryless algorithm, the same sequence minimizes both the BER and the average number of iterations. Since in reality the second property does not hold, this sequence is only used as a starting point. Following [9] , we will write the sequences as {β a 1 , β b 2 , . . .} to mean that β 1 is used for the first a iterations, followed by β 2 for the next b iterations, and so on. If we constrain the sequence to be of the form {β l 1 , β L−l 2 }, only the parameter l is left to be found, and if necessary it can be adjusted to trade off BER for throughput. For example, Fig. 2(b) shows the BER transfer curves for various values of β. From the plot we see that β = 0.5 is the best choice up to a BER-in of 3 · 10 −4 , while for lower BER-in values, β = 0.25 is superior. A BER of 3 · 10 −4 is achieved in 5 iterations with β = 0.5, therefore the best sequence is {0.5 5 , 0.25 L−5 }. The actual BER performance of this sequence is shown in Fig. 2(a) . Compared with the β = 0.25 curve, the BER at 50 iterations is slightly degraded, but the average number of iterations is reduced by 33%. Note that for L ≤ 50 (and most likely for any L), there is no length-one β-sequence that will yield this combination of BER and average number of iterations.
E. Lowering the Error Floor
A convenient way to improve the error floor at the level of the decoding algorithm is to consider a two-phase approach. In the first decoding phase, the normal algorithm is used. If there are unsatisfied check node constraints at the end of the first phase, the decoding is known to have failed and a modified algorithm is used to attempt to resolve the failure. If the two algorithms are similar, the cost in terms of circuit area is kept to a minimum. The two-phase approach has been widely used to improve error-floor performance (e.g. [16] [17] [18] ).
The RHS algorithm should readily integrate with most Phase-II algorithms developed for SPA decoders since the RHS variable node operations are based on the SPA. To illustrate this capability, we will introduce a Phase-II algorithm named "VN Harmonization" that can be used both for RHS and for SPA or Min-Sum decoders. Error rate results for this algorithm will be presented in Section V. VN Harmonization has some similarities with the Phase-II algorithm presented in [17] , but has been found to be successful on certain codes for which the algorithm of [17] is ineffective, such as the (2640,1320) Margulis code [19] . It also has the advantage that it operates locally in each variable node and requires no communication between processing nodes.
For each variable node and each iteration t, the VN Harmonization algorithm performs a modification on the set I(t) = {Λ 1 (t), . . . , Λ dv (t)} of LLR-domain VN inputs. When used with the RHS algorithm, these LLR inputs correspond to the LLR-domain input trackers, which will be introduced in Section IV-B. We partition I(t) into I + (t) = {Λ ∈ I(t) : Λ ≥ 0}, and I − (t) = I(t) \ I + (t). We then define a majority set M (t) as corresponding to the set with the largest number of elements among I + (t) and I − (t), with M (t) = ∅ if |I + (t)| = |I − (t)|, where |S| denotes the cardinality of set S. The algorithm is described by Algorithm 2, where j is such that M (t) = {Λ j (t)} when |M (t)| = 1, and d ≥ 0 is a constant that must be found empirically.
We will now present an efficient implementation for the mechanisms introduced in the previous section. By design, the RHS algorithm minimizes the wiring complexity of the decoder by using only one wire to transmit a given message. When k > 1, the message bits are transmitted serially. In addition, because of the properties of the check node function discussed in Section III-C, the topology of the circuit graph can be modified to simplify wire routing. Since the check node function is very simple, this section is devoted to the variable node function. The LLR domain is attractive for implementing the VN function because (1) is much simpler than the equivalent probability domain function. We present an approach for implementing the RHS VN function in the LLR domain with low complexity.
A. Variable Node Output
For each edge i of the variable node, 1 ≤ i ≤ d v , a sequence of random message bits {X 1 , . . . , X k } must be generated. This can be achieved by using (13) , but we would now like to work with LLR values Λ i instead of p i . As a result, the thresholds T j must be generated in the LLR domain. To generate LLR thresholds with low complexity, we approximate the natural logarithm with a base-2 logarithm, which can be generated using a simple circuit known as a priority encoder. A priority encoder takes as input a binary sequence Z and outputs the number W of "zero" elements preceding the first "one" element in the sequence. If Z = {Z 1 , . . . , Z q } is generated by a sequence of independent Bernoulli experiments such that Pr(Z i = 1) = ψ i , the output W ∈ {0, 1, . . . , q} has the following probability mass function:
The priority encoder only generates positive numbers, but since the LLR threshold distribution is symmetric, the sign bit can be generated separately using a fair random bit S. When quantized to integer values, the LLR threshold is expressed as T j = (−1) S W , for W < q. Pr(W = q) is a special case that only differs from Pr(W = q − 1) by a constant factor, and therefore the largest LLR magnitude that can be generated is |T j | = q − 1. When W = q, we can let |T j | take a value that is otherwise underrepresented by our approximation, e.g. let T j = (−1) S · 2 if W = q. We now have to find {ψ 1 , ψ 2 , . . .} that best approximate the true LLR threshold magnitude distribution. If we consider again the case of an integer quantization, the probabilities ψ 1 = 1 4 and ψ 2 = ψ 3 = . . . = ψ q = 1 2 provide a good approximation when q is not too large. In a circuit implementation, sequences of fair pseudo-random bits can be generated easily, for example using linear-feedback shift-register circuits. Z 1 can be generated by combining two fair random bits, while Z 2 , . . . , Z q only require one fair random bit each.
When deriving the stochastic check node function (6), we assumed that all binary messages entering a check node are statistically independent. Furthermore, in (10), we expect {Y 1 , . . . , Y k } to be independent, and therefore, in (13), the sequence {T 1 , . . . , T k } must be independent. To achieve this, the number r of independent random numbers required in the entire decoder for one iteration is obviously at most r = N k, where N is the number of variable nodes. However, by relaxing the requirement for independence, the decoder can function with much less random numbers. For the codes considered in this paper, simulations have shown that a random number generator can be shared among 64 VNs without any degradation in performance, that is r = N k/64. As a result, the circuit implementation can contain only a few Random Number Generation modules, and the circuit area occupied by these modules is expected to be negligible.
In practice, the LLR values Λ i are represented on a finite range, and we must consider the impact this has on the operation of the decoder. Let this range be [−Λ cap , Λ cap ], with Λ cap > 0. If |Λ i | ≤ Λ cap , the finite range has no impact. Therefore let s be the number of check node inputs for which |Λ i | > Λ cap , 0 ≤ s < d c . We can show that this approximately has the effect of changing the mean of Y j , defined in (8) 
where φ is a scaling factor that depends on Λ cap and on s:
A consequence of (16) is that, when s > 0, the message estimator g k is no longer unbiased as defined in (10), and this has an impact on the error rate performance. To have E[m] = m, we must replace it witĥ
This in turn changes the message set M, which is no longer a subset of [0, 1]. Similarly the codomain of the tracking function (11) would no longer be [0, 1] and it must be redefined with the appropriate saturation. If we assume that φ ≥ 1 − 2 k , we have 0 ≤ μ n ≤ 1 for n = {1, . . . , k − 1}, and we can define the new tracking function as
where L = β 1−β 1 2φ − 1 2 and H = 1 − L. Ideally, the parameter s in (17) would be set to the expected number of saturated check node inputs, which depends on the SNR and on the iteration index. However, having to set s dynamically would make the decoder too complex, and we will resort to simple heuristic rules. At high SNR, a large portion of messages in a BP decoder become saturated 1 . Therefore, it seems reasonable to use s = d c −1. Furthermore, when the d c values are small and Λ cap is not too large, simulation results presented in Section V show that such an n can be used at all SNRs with almost no performance degradation. However, when the d c values are large, simulation results show significant variations in performance between s = 0 and s = d c − 1, and a choice must be made based on the application.
B. Variable Node Input
At the input of the variable node circuit, the values required to evaluate (1) are estimated from the message stream using (11) (or (19) ). However, since we choose to perform the variable node computations in the LLR domain, the estimate would need to be converted from the probability domain to LLR. There is a more interesting alternative, which is to design a tracking mechanism that operates directly in the LLR domain. We will first consider how to achieve this when the tracking function is (11) , and will later comment on how the mechanism should be modified if (19) is used instead. Equation (11) in the LLR domain becomes
By fixing the messagem ∈ M, we obtain a transfer function that describes the tracker update, which we denote f (Λ;m). The tracker update is then expressed as Λ(t) = f (Λ(t−1);m).
With the message estimator given by (10) or (18), the transfer functions have the following symmetry property: f (Λ; μ j ) = −f (−Λ; μ k−j ), for k 2 < j ≤ k. An example is shown in Fig. 3 . Therefore, the number of transfer functions we need to consider is k/2 + 1. The remaining functions are simply obtained (and implemented) using the symmetry property.
The transfer functions are non-linear in the LLR domain, but we need the tracking circuits to have a low complexity. Fortunately, the transfer functions can suitably be approximated by a linear function, combined with saturation functions at either or both ends of the linear domain. For each value ofm, the steps for deriving the simplified transfer functions are as follows:
. Λ L > 0 is the maximum absolute value that must be represented in the trackers. It depends on the code structure and is the same as the maximum value that must be represented in an SPA decoder. 2) For Λ ∈ A, find the optimal linear approximation aΛ+b to f (Λ;m). 3) To simplify the circuit implementation, we want a and b to have binary representations that are as compact as possible, and therefore the constants are rounded according to this criterion. When possible we prefer a = 1. This step requires some simulation of the decoder to determine how much the constants can be rounded.
Example 1: Fig. 3 shows the transfer functions of the LLRdomain estimator for the case of 2-bit messages (i.e. k = 2 in (9) and (10)) and β = 0.15. In this case, the possible message estimates are M = {0, 1 2 , 1}. For f (Λ; 0), A = [−1.73, Λ L ]. We notice that the slope of f (Λ; 0) on this range is close to 1, therefore we look for an approximation of the form f (Λ, 0) = Λ + b. With Λ L = 15, the mean squared error is minimized by b = 0.206, which we round to b = 1 4 . A is rounded to [− 7 4 , 15] . f (Λ; 1) is obtained by symmetry, and form = 1 2 , we get f (Λ; 1 2 ) = 0.776 Λ on the domain [−2.5, 2.5], which we round to f (Λ; 1 2 ) = 3 4 Λ. Ultimately, the tracking circuit can be very simple. In the example above, the tracker value can be represented on 7 bits, and the only operations that are required are addition with ±1 (representing ± 1 4 ), and multiplication by 3 4 , which can be implemented as the addition of two shifted values.
When using s > 0, the transfer functions are similar, except that f (Λ; μ 0 ) goes to infinity at the LLR value corresponding to p(t − 1) = L, and similarly f (Λ; μ k ) goes to negative infinity at the LLR value corresponding to p(t − 1) = H, as shown in Fig. 3 . The maximum value to be represented in the tracker will therefore be set to Λ L = ln 1−L L (Λ L = 6.29 in our example). However, for f (Λ; μ 0 ) and f (Λ; μ k ), using range A described above as the linear approximation domain results in a poor approximation, since the functions are highly non-linear near Λ L and −Λ L , respectively. The domain of the linear approximation must be reduced slightly in order to obtain a good fit. If we consider the quantized representation of Λ(t), the "infinity" values can be handled by simply assigning a special meaning to the largest positive and negative values. We then re-define (1) to take into account this special meaning. We first define a saturation indicator S i as
Sj ∈{∼Si} S j = 0, the output Λ i is given by (1) as usual.
Otherwise, if Sj∈{∼Si} S j > 0, we set Λ i = Λ cap , and if Sj∈{∼Si} S j < 0, Λ i = −Λ cap . The proposed linear approximation of the LLR-domain estimator can also support an efficient implementation of the β-sequences introduced in Sect. III-D. In the case of the example above, using multiple values for b was found to provide a throughput advantage comparable to using βsequences in a decoder that uses (11) directly.
C. Area Comparison
It is clear that the RHS algorithm reduces the amount of wiring in a fully-parallel decoder, but we must also consider the amount of area required by the logic. We do so here at a high level, by counting the various elementary modules that are required to build a full decoder, for both RHS and NMS. This number depends on the structure of the code, and on the quantization precision used in the decoder. We denote by N v the number of variable nodes, N c the number of check nodes, E the number of edges, d c the average check node degree, n the number of quantization bits used in NMS, and w the number of bits used to represent tracker values in RHS. The intermediate LLR outputs of the RHS VN use n bits.
In the NMS algorithm, an iteration can be performed as follows: In the variable node, we compute the sum of the LLR inputs using a two's complement representation. We then convert the numbers to a sign and magnitude (S&M) representation and send the messages to the neighboring check nodes. In the check node, we use comparators and multiplexer (MUX) circuits arranged in a tree to find the input with the minimum magnitude. This first minimum is stored in a register, and we re-use the same circuit to find the second minimum. In parallel, we use a tree of XOR gates to compute the parity of the sign bits. We then combine the outgoing sign bits with either the first or the second minimum and send this value to the neighboring variable nodes. In the variable node, these values are converted back to a two's complement representation and stored in a register, which completes the iteration. Table I gives the module count for fully-parallel implementations of NMS and RHS, with specific values for each of the codes simulated in Section V. For the first code, the values of n = 4 and w = 4 are taken from [22] and [23] . For the second code, n = 5 is based on the necessity of representing an LLR range of [− 15, 15] , while w = 6 is taken from the example provided in Sect. V-A. The table shows that an RHS decoder requires more registers and comparators, but a lot less MUX-2 and XOR-2. Also, RHS requires additional logic to perform the tracker update, but doesn't have to convert numbers between S&M and 2's complement. This analysis does not allow making a precise quantitative conclusion, but it suggests that the requirements of NMS and RHS in terms of logic are not very different. On the other hand, the wiring complexity is greatly reduced, and this has an important impact on the final area requirement. To illustrate this, we can compare two fully-parallel implementations of an IEEE 802.3an LDPC decoder, one based on RHS [23] , and one based on NMS [4] . The NMS implementation is the only fully-parallel implementation of NMS reported for the IEEE 802.3an standard. To achieve this level of parallelism, the authors introduced an approximate check node function that simplifies the wire routing and reduces the complexity of the check node logic. Despite this approximation, which causes a significant loss in error rate, the NMS decoder occupies more area (4.84 mm 2 vs 4.41 mm 2 ).
V. SIMULATION RESULTS
To test the performance of the RHS algorithm we consider two standardized codes. The first is taken from the IEEE 802.3an standard. The code structure is based on a shortened Reed-Solomon code (RS-LDPC) [15] . The code has length 2048, rate 0.8413, and is regular with d v = 6 and d c = 32. The second code has been standardized in [24] . It has length 2048, with an additional 512 punctured variable nodes, rate 1/2, and d v ∈ {1, 2, 3, 6}, d c ∈ {3, 6} . The code design is known as Accumulate-Repeat-4-Jagged-Accumulate (AR4JA) [25] .
The performance of the various decoding algorithms is measured using software Monte-Carlo simulations executed on a parallel computing platform. Following the discussion in Sections III and IV, we will present results for various levels of idealization of the variable node tracking functions. We will refer to the use of (11) as "floating-point (FP) probability tracking". The second step towards implementation is to use linearized LLR-domain tracking (as described in IV-B) with full-precision parameters, referred to as "optimal linear tracking". Note that even in this case, we constrain the μ 0 and μ k solutions to be of the form Λ(t) = Λ(t − 1) + b, such that the corresponding circuit is simply an adder. The last step is to round the linear tracking parameters to reduce the circuit implementation complexity. This is referred to as "rounded linear tracking". For the IEEE 802.3an code, we went further and implemented the trackers with integer data types and quantized channel outputs, to mimic the operation of a circuit implementation. An RHS decoder has two design parameters, L and k. The iteration limit L should be chosen based on the latency requirement. Figure 4 shows the effect of k on the frame error rate for the two codes simulated. A larger k improves the error rate for a given L. Since the number of predetermined linear operations that must be supported in the tracker circuit is k/2 +1, the implementation complexity will generally grow with k. However, β * and the corresponding b are increasing functions of k, and a larger b can have the effect of decreasing the number of quantization bits required in the LLR tracker. Since message bits are transmitted serially, a larger k also increases the circuit delay associated with message transmission, but for small values of k, the variable node circuit delay is expected to be dominant. Parameter β is optimized as described in III-D, with BER as the optimization target (unless mentioned otherwise). The variable node output LLR range, Λ cap , is chosen as small as possible. We use Λ cap = 8 for the IEEE 802.3an code, and Λ cap = 6 for the AR4JA code.
We will first discuss in Section V-A the maximum BER performance of the RHS algorithm, that is the BER when L is chosen such that any increase provides a negligible BER improvement. Then, in Section V-B, we consider how RHS performs when the focus is on decoding latency and throughput.
A. Error Correction Capability
The BER results are shown in Fig. 5 (a) and 5(b) for the RS-LDPC code, and in Fig. 6 for the AR4JA code. We show the error rate achieved by floating-point SPA and NMS as a reference. For NMS, parameter α in (4) is set to α = 0.5 for the RS-LDPC code, and α = 0.75 for the AR4JA code. As seen in the figures, RHS can match the performance of FP SPA, but also outperforms it on the RS-LDPC code when the iteration limit is large. This superior performance on the RS-LDPC code can be attributed to the successive relaxation iterative dynamics that are a consequence of (11) . We can see in Fig. 5(a) that the RHS curve with 1K iterations and FP probability tracking exactly matches the BER of Successive Relaxation SPA [7] .
Both codes considered are affected by message saturation effects at low error rates, which can cause error floors. For the RS-LDPC code, the connection between the error floor and message saturation is well documented in [21] . For the AR4JA code, we have observed that when decoding with FP NMS, enforcing a limit on LLR messages causes a floor. For example, when limiting LLR values to the range [− 16, 16] , the BER of FP NMS never goes below 10 −9 . Without the saturation limit, no floor is observed. Because of the message saturation effects, for the AR4JA code we use (19) as the basis for variable node message tracking, with s = d c − 1 in (17) . In the waterfall SNR region, we have observed little difference in BER between s = 0 and s = d c − 1, which motivates the use of the latter. We can see in Fig. 6 that no floor is observed on either FP NMS or RHS. For the RS-LDPC code, RHS has an error floor that is comparable with quantized NMS. On this code, we have a solution available to address floor performance that has low complexity and is very effective, namely the VN Harmonization algorithm that was introduced in III-E. This solution is therefore preferred over the use of s > 0, especially since we have observed that for the RS-LDPC code, using s > 0 degrades the BER in the waterfall region. We present some curves that use a decoding Phase-II with VN Harmonization at specific SNRs, denoted by (*), (**) and (***) in Fig. 5 (a) and 5(b). The parameter d in Alg. 2 is set to 0.3. For the other RHS curves in Fig. 5(a) , no Phase-II is used because the resulting BER would require too much computing time to be simulated.
As expected, the best BER performance is obtained when using (11) implemented with floating-point operations. We first want to observe the impact of the linear approximation to LLR-domain tracking. We can see that for both codes, the "optimal linear tracking" curves are close to FP probability tracking. We then consider the BER performance when lowcomplexity parameters are used for the LLR-domain tracking. To give an example of how we can expect the algorithm to perform once implemented in hardware, we show a simulation of the RS-LDPC code where the software implementation uses integer data types and the channel outputs are quantized on 4 bits. The specific parameters used were presented in the example of Section IV-B. The BER achieved by this simulation ("rounded linear tracking") shows approximately a 0.1 dB gain over NMS with 4-bit inputs, similar to the difference observed between "optimal linear tracking" RHS and FP NMS. On the AR4JA code, the "rounded linear tracking" curve uses the same software implementation as "optimal linear tracking", but with rounded parameters. For the AR4JA code, the BER curves are shown for k = 4, and therefore 3 different linear calculations must be implemented in the tracker. Since we choose s = d c − 1 in (17), f (Λ;m) will depend on the degree d c of the check node generating the messagem. The AR4JA code has d c ∈ {3, 6}. However, after rounding the parameters, the transfer functions are the same for both check node degrees, with the exception that Λ L = 6.25 for d c = 3, and Λ L = 5.5 for d c = 6. The transfer functions used for the "rounded linear tracking" result are as follows:
The functions are saturated outside the ranges specified. As was the case for the RS-LDPC code, these tracking functions have a low implementation complexity. Furthermore, Λ(t) can be represented on only 6 bits. 
B. Average and Worst-Case Decoding Time
Both the average and worst-case decoding times are important performance metrics. Assuming the decoder terminates as soon as a codeword is found, the average decoding time determines the data throughput with an infinite-capacity input buffer, and is also related to the decoding energy per bit. A shorter average decoding time will improve energy efficiency even without an input buffer, since the decoder can be put to sleep when the decoding uses less iterations than the maximum. The worst-case decoding time determines the data throughput without input buffering. The average and worstcase decoding times depend respectively on the average and maximum number of iterations, but also on the time required to perform one iteration.
At equal error rate performance, we expect that RHS will require a higher iteration limit than NMS, because more information is exchanged in one iteration of NMS. However, the circuit delay of one RHS iteration is probably smaller than in the case of NMS, because of the simpler check node operation, and most importantly, RHS allows achieving a much higher parallelism in the implementation. Unfortunately, it is difficult to say in general what the complexity ratio of one NMS iteration to one RHS iteration will be, but we can point to a circuit implementation of an RHS decoder that performs 50 iterations in 100 clock cycles [23] , while an NMS implementation for the same code performs 8 iterations in 96 clock cycles [22] , and both implementations use comparable area. In this case, an RHS iteration can be seen as 6 times "simpler". However, our claim concerning the RHS algorithm is merely that it can allow building decoders with a worstcase decoding time comparable to other approaches, but a faster average decoding time. This stems from the fact that the average number of iterations required by RHS is not much higher than NMS.
As presented in III-D, the average number of iterations required for convergence can be reduced by using multiple β values in sequence. For the RS-LDPC code, we found that the average number of iterations can be reduced to 3.46 (at 4.6 dB) by using β = {0.5 5 , 0.25 L−5 }. The RHS curve with L = 100 iterations in Fig. 5 (b) uses this β sequence. Note that in practice, the behavior of the β sequence is implemented by varying b, as mentioned in Sect. IV-B. In comparison, the NMS algorithm requires an average of 2.5 iterations at 4.6 dB, and therefore RHS requires only 38% more iterations. On the AR4JA code, we have observed that using β = {0.5 26 , 0.25 L−26 } provides a 28% throughput improvement (at 2.5 dB) over β = 0.25. After this optimization the average number of iterations is 2.6× higher than for NMS. For both codes, it seems likely that an RHS decoder can achieve a significantly faster average decoding time than a decoder based on NMS. We see that RHS has a larger advantage on the RS-LDPC code, and this leads us to speculate that RHS will in general have a larger advantage on codes that are designed for fast convergence.
Other algorithms that are suitable for achieving a high level of parallelism include the previously proposed fully stochastic algorithms such as [6] , and Split-Row NMS [4] . Their BER on the RS-LDPC code is shown in Fig. 5(b) . We can see that 30 iterations of RHS provide a better BER than 400 iterations of the fully stochastic algorithm. Even though a decoding iteration of the fully stochastic algorithm is simpler because the decoder uses less area, and has simpler logic, it is clear that the worst-case decoding complexity is much higher than for RHS. Split-Row NMS, like RHS, converges in a few clock cycles on average (even fewer), resulting in a very low power decoder. However, the approximation used, which involves splitting a check node circuit into multiple sub-nodes, results in degraded BER, and seems to require a code with a large check node degree. Therefore, it might not be applicable to other codes. It is also not obvious whether its BER can be improved by increasing the number of iterations.
VI. CONCLUSION
We introduced a binary message passing decoding algorithm for LDPC codes that simplifies the wiring and the layout of fully-parallel circuit implementations, while being able to achieve an error rate that is comparable to the well known Sum-Product and Normalized Min-Sum algorithms. To demonstrate the practicality of the RHS algorithm, we presented a low complexity implementation, as well as simulations results that show that the bit error rate performance remains good even at low decoding latencies. In addition, we introduced the β-sequence method for reducing the average number of iterations with a negligible impact on implementation complexity, and we described an algorithm for resolving some decoding failures in the error floor region, named "VN Harmonization".
Our experimental results suggest that the RHS algorithm provides a significant gain over existing algorithms in average decoding complexity. As a result, the RHS algorithm can be used to build decoders with higher average throughput and increased power efficiency.
