Abstract-We present a hardware-based implementation of linear program (LP) decoding for binary linear codes. LP decoding frames error-correction as an optimization problem. In contrast, variants of belief propagation (BP) decoding frame errorcorrection as a problem of graphical inference. LP decoding has several advantages over BP-based methods, including convergence guarantees and better error-rate performance in high-reliability channels. The latter makes LP decoding attractive for optical transport and storage applications. However, LP decoding, when implemented with general solvers, does not scale to large blocklengths and is not suitable for a parallelized implementation in hardware. It has been recently shown that the alternating direction method of multipliers (ADMM) can be applied to decompose the LP decoding problem. The result is a message-passing algorithm with a structure very similar to BP. We present modifications to this algorithm, resulting in a more intuitive and hardware-compatible form. This is particularly true for projection onto the parity polytope: the major computational primitive for ADMM-LP decoding. Furthermore, we present results for a fixed-point Verilog implementation of ADMM-LP decoding. This implementation targets a field-programmable gate array (FPGA) platform to evaluate error-rate performance and estimate resource usage. We show that frame error rate performance well within 0.5 dB of doubleprecision implementations is possible with 10-bit messages. Finally, we outline research opportunities that should be explored en route to an application-specific integrated circuit (ASIC) implementation that is capable of Gigabit-per-second throughput.
I. INTRODUCTION
T HE field of error-correction coding was revolutionized in the mid-1990s by the widespread adoption (and academic study) of graph-based codes and associated belief propogation (BP) message-passing decoding algorithms [1] - [3] . A key aspect of the success of these codes was their compatibility with hardware. BP-based decoders are naturally distributed algorithms and variants such as Min-Sum map relatively easily to hardware. Graph-based codes, particularly turbo codes and low-density parity check (LDPC) codes, have been adopted in many real world systems. For example, LDPC codes have been adopted for the data channel in the "enhanced mobile broadband" (eMBB) component of the fifth-generation (5G) wireless standard. However, there are issues present with BPbased decoding algorithms. The first is their reliance on the tree assumption for the code-defining graph. In practice, tree codes are not used due to their poor distance properties [4] . This results in the use of LDPC codes without performance or convergence guarantees under BP due to graph cycles. Additionally, it is observed in practice that BP-based decoding algorithms often suffer from "error floors" in high-reliability channels [5] .
In the early 2000s, Feldman and his collaborators realized that the maximum likelihood (ML) decoding problem for binary linear codes can be rephrased as an integer program [6] . They obtained a linear program (LP) by relaxing the integer constraints. Feldman's work applies to any binary linear code, but he concentrated on LDPC codes due to their prevalence and smaller constraint sets. These results generated much interest among coding theorists. LPs are an extremely well-studied and understood class of optimization problems, especially when contrasted with BP. For instance, LP decoding has an ML certificate property [6] , such that if LP decoding fails, it fails in a detectable way (to a non-integer vertex). The relaxation can then be tightened and the LP re-run [7] . If a high-quality expander or high-girth code is used, LP decoding is guaranteed to correct a constant number of bit flips [8] , [9] .
On the practical side, there was less excitement. There initially seemed to be no real-world need for such a decoder. Furthermore, traditional LP solvers did not scale easily to the blocklengths of modern error-correcting codes. Nevertheless, a number of groups did study how to build an application-specific low-complexity LP decoder [7] , [10] - [12] . In particular, Barman et al. [12] built an application-specific LP decoder that was computationally competitive with BP and that had a message-passing structure with a standard schedule. They solved the LP decoding problem using the alternating direction method of multipliers (ADMM), a decomposition technique used in large-scale optimization [13] . This application of ADMM yielded an algorithm that scales linearly with respect to blocklength and drastically increases throughput. Able to study LP decoding performance at long blocklengths, it was observed empirically, and later confirmed theoretically, that LP decoding far outperforms BP in the high signal-to-noise ratio (SNR) regime with regard to error rates [12] , [14] , [15] . In this regime, LP decoders do not suffer from the same error floor effects as BP. Furthermore, Barman et al. [12] provide a convergence analysis of the ADMM-LP decoding algorithm, giving theoretical guarantees, as well as experimental measurements.
The application of ADMM to LP decoding catalyzed significant interest. A number of generalizations of the ADMM idea were proposed. Prominently, Liu and Draper augmented the objective of LP decoding with a penalty term [16] . This closed the gap in error-rate performance between BP and LP that had initially been observed in the low-SNR (waterfall) regime, while continuing to outperform alternate methods for improving the error-floor performance of BP, e.g., [17] - [19] . We note that similar penalization-type ideas have recently been developed in the context of computer vision and massive-MIMO, e.g., [20] , [21] . Additionally, LP decoding can be used as a subroutine in a multi-stage decoder that quickly approaches ML performance [22] . Thus, for applications in which reliability demands are extreme, LP decoding is an attractive alternative (or complement) to BP. Jiao et al. modified Liu's penalization method to improve error-rate performance of irregular LDPC codes [23] . Further generalizations of ADMM-LP to non-binary and multi-permutation codes were developed in [24] and [25] .
In parallel to these theoretical developments, a number of groups investigated how to reduce the computational load of ADMM-LP while others started to think about how to implement ADMM-LP in hardware. Regarding the former, several groups made progress creating efficient methods for solving the key computational primitive of ADMM-LP decoding: Euclidean projection onto the "parity polytope" [26] - [31] . In the preliminary work that underlies this paper, Wasson and Draper investigated mapping this operation to hardware [28] , [30] , [32] . These initial ideas were built on by a number of other groups [33] - [37] . Several implementation papers have also considered ADMM-LP decoding in other contexts. Debbabi et al. investigated how to schedule messages more efficiently and developed a multicore implementation [38] - [40] , while Ben Thameur et al. investigated fixed-point effects in the context of applying ADMM-LP to the decoding of convolutionally structured LDPC codes [41] .
While useful investigations, these studies do not demonstrate whether or not ADMM-LP decoding is viable in hardware. In this paper, we present a field-programmable gate array (FPGA)-based implementation that shows that the ADMM-LP decoding algorithm can be mapped to hardware without suffering an unacceptable performance loss. First, we review the LP decoding problem and provide modifications which simplify fixed-point implementation. Next, we expand upon the ADMM-LP decoding algorithm in a novel manner, which provides a more intuitive path to hardware implementation. We then dive into the algorithmic specifics of projection onto the parity polytope. The projection algorithm we develop herein expands on developments made in [28] by providing algorithmic enhancements, an improved proof, and the inclusion of simple geometric intuition not previously published. Hardware implementations of these algorithms are then described explicitly, and direction is given on how to assemble the pieces to form a complete LP decoder.
We present results for the [155, 64] Quasi-Cyclic (QC) LDPC code introduced by Tanner et al. [42] , the [672, 546] QC-LDPC code from the IEEE 802.11ad (WiGig) standard [43] , and an ensemble of (3, 6)-regular [1002, 503] QC-LDPC codes. We test code performance using an FPGA-based simulation environment. While our initial implementation requires more hardware resources than Min-Sum decoders, we show it is possible to preserve the superior error-rate performance of ADMM-LP in fixed-point. We emphasize that our objective in this paper is not to build a decoder that is more efficient than Min-Sum in terms of hardware resource utilization. In high error-rate applications, we do believe that (generally) Min-Sum will be the better choice. It is applications that must operate in the extremely low error-rate regime, traditionally dominated by error-floor effects, that is the target of ADMM-LP decoding. For these applications, we anticipate that the significant performance gain effected by ADMM-LP can offset the additional hardware costs.
The remainder of this paper is organized as follows. Section II provides an overview of the LP decoding problem, the penalized modification that improves waterfall behavior, and a "centering" reparameterization that improves dynamic range for fixed-point hardware implementation. Section III presents the ADMM-LP algorithm, the important computational primitives, and useful geometric intuition. It also develops the algorithmic contributions of the paper. Section IV describes the hardware architecture of our fixed-point ADMM-LP decoder design. Section V presents error-rate performance and implementation results of our FPGA-based ADMM-LP decoder, and a comparison of our implementation to previously published FPGA-based decoders. Finally, Section VI concludes this paper and provides some general directions for future work.
II. PROBLEM FORMULATION
In this paper, we consider the decoding of binary linear codes. A binary linear code C of blocklength n is a k-dimensional subspace of F n 2 . Such a code can be defined as the null space of the m × n "parity-check" matrix H, i.e., C = {x ∈ {0, 1} n : Hx = 0 (mod 2)}. In general, m ≥ n − k with equality when H has full rank. The rate of C is defined to be R = k/n, which specifies the number of information bits transmitted per codeword symbol. Each row of the parity-check matrix corresponds to a check, which specifies a subset of codeword symbols that must add to 0 modulo 2. These checks are indexed by the set J = {1, . . . , m}. Each column of the parity-check matrix corresponds to a codeword symbol or variable, indexed by I = {1, . . . , n}. The neighborhood of check j, denoted N c (j), is the set of variables that check j constrains to add to 0. That is, N c (j) = {i ∈ I : H j,i = 1}. Similarly, the neighborhood of variable i, N v (i), is the set of checks in which variable i participates, N v (i) = {j ∈ J : H j,i = 1}.
Given a stochastic channel model P(y|x), where y ∈ Y n is the channel output, ML decoding amounts to maximizing the model over the set of codewords. That is, we decode to arg max x∈C P(y|x). It was shown in [6] that, when considering a binary linear code transmitted over a symmetric memoryless channel, the ML decoding objective is linear in the length-n vector γ of log-likelihood ratios (LLRs) γ i = log(P(y i |0)/P(y i |1)). The ML decoding problem is
We note that γ can be multiplied by any positive scalar without changing the problem. Having framed ML as an optimization problem with a linear objective, we are ready to develop the LP relaxation first proposed in [6] . First, denote by x S , S ⊆ I, the length-|S| vector formed with the components of x indexed by S. With this notation, we can restate the parity-check condition for a binary linear code as C = {x ∈ {0, 1} n : 1 x N c (j) = 0 (mod 2) for all j ∈ J }. Each of the m constraints in this set can be visualized as requiring that the set of codeword variables connected to any particular check must be an even-weight vertex of the unit hypercube.
The LP decoding problem results from relaxing these constraints [6] . Instead of requiring the vector of variables connected to each check to be an even-weight vertex of the unit hypercube, LP decoding requires this set of variables to lie in the convex hull of these vertices. Visualized in Fig. 1 , the convex hull of the even-weight vertices of the unit hypercube is termed the "parity polytope," denoted P P d in d-dimensions:
The dimension d will be used as a placeholder parameter. The parity polytope P P d can be explicitly defined by a number of half-space inequalities [6] . Every odd-weight vertex is surrounded by even-weight vertices. Each half-space inequality is defined by the hyperplane that contains all these even-weight vertices and "cuts" off the half of the space in which the odd-weight vertex sits. Half of the 2 d vertices are of odd weight, so 2 d−1 half-space constraints can be used. Each such inequality corresponds to one of the constraints in the first line of the following description of P P d where we use the notation
.
The box constraints 0 ≤ v i ≤ 1 are not always redundant, e.g., when d = 2. In summary, LP decoding requires us to solve arg min
where x is constrained to [0, 1] n since a given x i is not guaranteed to participate in the parity polytope constraints. Note that LP decoding is not guaranteed to yield the ML solution. Due to the relaxation, the feasible space has fractional vertices. The failure mode of LP decoding is when one of these "pseudocodewords" is the minimal cost vertex.
One might consider rounding the fractional components when the LP solver outputs a pseudocodeword. However, this does not solve the pseudocodeword problem. An alternative approach proposed by Liu and Draper [16] is to augment the objective of (4) with a penalty function. This approach, referred to as "penalized LP" decoding, discourages pseudocodewords by penalizing the closeness of variables to 1/2. Many penalty functions were tested in [16] , but we implement only the so-called 1 -penalty function, due to its good error-rate performance and algorithmic simplicity. The 1 -penalized LP decoding problem is given by
where α ≥ 0 is termed the penalty parameter. The penalty parameter tunes how severely non-binary variables should be penalized. Setting α = 0 reduces (5) to (4) . The use of α = 0 causes (5) to be non-convex. This removes the ML certificate property of LP decoding. However, moderate α values have been empirically observed to improve performance in the waterfall regime [16] . Also, it is easy to conceive a two-stage decoder, which verifies if a penalized ADMM-LP solution is also the LP decoding solution, thus restoring the ML certificate.
To this point, we have discussed prior formulations of LP decoding. We now discuss a novel transformation of LP decoding that proves useful when designing a fixed-point implementation. The previous LP decoding formulation operates inside the unit hypercube centered around 1/2. In our hardware implementation, signed integers are used to implement fixedpoint arithmetic. Therefore, to eliminate possible asymmetries, we prefer LP decoding to operate inside the unit hypercube symmetrically centered around 0. To accomplish this, the simple variable substitution x new = x old − 1/2 can be applied to (5) . The result
is an equivalent optimization problem with two important differences. The first is that the objective now penalizes closeness to 0 rather than to 1/2. The second is that check neighborhoods must be in the "centered" parity polytope. The d-dimensional centered parity polytope P P d − 1 2 is obtained by taking every point in P P d and subtracting the length-d all-1/2 vector. For simplicity, we subsequently refer to this shifted object simply as the parity polytope, unless disambiguation is required.
III. ALGORITHMS
Section III-A describes the ADMM-LP decoding algorithm. Sections III-B and III-C discuss the two key projection subroutines, onto the parity polytope and onto the probability simplex, respectively.
A. ADMM Decomposition and Message Passing
In a linear code, the characteristic that each component of the codeword estimate x (generally) participates in multiple check constraints inhibits the decomposability of LP decoding. A small modification is therefore introduced in [12] to apply ADMM to LP decoding. We define an auxiliary "replica" variable vector z j = x N c (j) for each check neighborhood. By substituting into (6), we arrive at the following result, which fits the ADMM template:
The ADMM decomposition for (penalized) LP decoding starts from the 2 -regularized Lagrangian
We use z and λ to refer to the z j and λ j in aggregate. The λ j are length-|N c (j)| dual-variable vectors that enforce the z j = x N c (j) equality constraints. The μ parameter is a positive number that determines the degree of regularization. Regularization does not change the optimization problem solution, but it accelerates algorithmic convergence [13] . While the above regularized Lagrangian was used in previous work, we identify that μ is a redundant variable and thus, for hardware, we simplify the formulation as follows. First, note that the outputs of the additive white Gaussian noise (AWGN) channel are proportional to the LLRs, and further, channel outputs (decoder inputs) must be scaled in order to be quantized prior to decoding. The following steps allow us to absorb μ into the LLR scalings. We make the substitutions α = μα, λ = μλ and γ = μγ. These substitutions modify our choice of α, slightly change what λ represents, and linearly scale the objective function. The problem solution does not change. This results in the Lagrangian L(x, z, λ) =
ADMM-LP decoding alternates, in a round-robin manner, between minimizing over codeword estimates x and replicas z, followed by an update of the dual variables λ. Letting X and Z represent the feasible sets of x and z (the dual variables are unconstrained), each iteration takes the form [12] , [16] 
The x update can be decomposed into individual variable updates since the solution to its optimization problem separates 
end for 8:
for j ∈ J do Check updates 9:
end for 14: until termination into distinct calculations for each variable [12] , [16] . Similarly, the z update can be decomposed to update each z j individually. The λ update is already expressed in a decomposed form.
The fact that the updates decompose means that the algorithm performs a set of parallel variable updates followed by a set of parallel check updates. The result is a message-passing algorithm with a structure similar to BP. Variable update i is performed using the LLR γ i and a length-|N v (i)| vector of messages from each of its neighboring checks, denoted m N v (i)→i . Check update j is performed using the dual-variable vector λ j and a length-|N c (j)| vector of messages from each of the neighboring variables. The latter contains the current estimates x N c (j) of neighboring variables. The result of the update of check j is a new estimate of the associated dual variables λ j , as well as a length-|N c (j)| message vector whose components are sent to neighboring variables. This vector is denoted m j→N c (j) . Algorithm 1 presents the ADMM-LP decoding algorithm in full. The notation A (·) denotes Euclidean projection onto the set A.
The algorithm derivation follows from finding the (con-
Observe that the Lagrangian is almost a quadratic form (other than the 1 -penalty, which has a slope ±α, and is reflected in the addition or subtraction of α in line 5 of Algorithm 1). The minimization problems in (8) therefore reduce to projections in Euclidean space onto convex objects described by the original problem constraints (the X and Z sets). The final form of Algorithm 1 reflects a refactoring of the algorithm presented in [12] and [16] to yield an intuitive description with a natural hardware realization.
Viewing the variable/constraint structure of ADMM-LP in (8) using a factor graph, we observe that the message-passing schedule followed in Algorithm 1 is the standard flooding schedule of BP. This view helps to highlight key differences between ADMM-LP and BP. A factor graph is a bipartite graph whose connectivity, when representing a linear code, is specified by the code's parity-check matrix [3] . For each codeword variable, Note that the λ j vectors do not appear in Fig. 2 . Algorithm 1 shows that the vector of dual variables λ j is used only within the jth check update. Therefore, dual-variable vectors serve as internal check vertex states and are not passed as messages in the factor graph sense. This is an important difference from BP, as the dual variables play an important role in improving error-floor performance [15] . A second difference is that the |N v (i)| outgoing messages from any variable i are all equal, corresponding to the current estimate of x i . This differs from BP where, in general, all outgoing messages from a variable will differ. Despite these differences, the similarity between LP and BP decoding is apparent on a macroscopic level. The formulation of ADMM-LP as a message-passing algorithm is presented in [12] , and the description of its dynamics in the error-floor regime as a jump-linear system [15] contrasts with earlier descriptions of BP as a linear system.
We now examine the steps of Algorithm 1, recalling that this is where our novel refactoring of the ADMM-LP decoding algorithm is helpful. On line 4, variable updates first sum incoming messages and the negative LLR to form a variable estimate. Incoming messages tell the variable what value it should take on. Next, on line 5, the penalization is applied. A non-zero penalty pushes each variable estimate in the direction of its current belief. Recall that this is done to discourage fractional solutions (pseudocodewords). When α is small, the effect of penalization is reduced, making the algorithm closer to (unpenalized) LP decoding. A slight difference in Algorithm 1 from penalized LP decoding's original derivation in [16] is that no penalty is applied if t i = 0. This modification is important in a fixed-point implementation to avoid bias in codeword estimates. On line 6, the penalized estimate is normalized by the variable degree and projected onto the [− ] interval. The resulting final estimate is then passed to neighboring checks. Roughly speaking, the variable estimate is the average of the incoming messages. On line 9, the first step in the check update is to add the vector of neighboring variable estimates to the vector of dual variables (the check state vector). An updated vector of the replica estimate is obtained by projecting v j , the result of the addition, onto the parity polytope. This is where the parity polytope constraints of LP decoding are enforced. Using the projection, a new check state λ j and set of outgoing messages m j→N c (j) are calculated (possibly in parallel) on lines 11 and 12.
The dual-variable estimates affect algorithmic progression in two major ways. First, λ j acts as a momentum term on line 9. It brings v j closer to the previous value of v j , ensuring that z j does not evolve too erratically. Second, according to the pricing interpretation of duality, the λ j specifies the cost of breaking the equality constraint z j = x N c (j) . We can see this effect more clearly if line 12 is rewritten as
Since the new λ j value is the mismatch between v j and z j , line 12 compensates this mismatch by including it in the outgoing messages. At convergence, the λ j subtracted off here is canceled by the λ j added in to compute v j on line 9.
Note that a termination condition is not specified in Algorithm 1. While algorithmic convergence can be used as the stopping criterion in floating-point [12] , it may not be possible to obtain convergence in fixed-point to an arbitrary precision. Thus, in our implementation, we impose a fixed number of iterations, but can also terminate early if rounding the current codeword estimate produces a codeword.
While their message-passing schedules are the same, we have already observed two differences between ADMM-LP and BP: the existence of dual variables that form the check states, and the fact that all outgoing messages from a variable node are identical. To this point, a third significant difference has been abstracted: the computational primitive of the check update, which is the Euclidean projection onto the parity polytope. A discussion of how to implement this projection efficiently in hardware is the topic of the next section.
B. Parity Polytope Projection
Euclidean projection of a vector v onto the d-dimensional parity polytope is specified by the quadratic program
Projection onto the centered (P P d − 1 2 ) and non-centered (P P d ) parity polytope are similar operations related as
Barman et al. began the investigation into efficient projection onto the parity polytope [12] , [44] . These researchers established a "two-slice" representation of the polytope and exploited rotational symmetry by sorting the components of v into a canonical coordinate system for projection. However, their algorithm is not well suited for hardware due to its iterative nature. X. Zhang and Siegel [26] , [45] improved the method by removing the sort and de-sort operations through efficient identification of the violated cut from (3) . Unfortunately, as with the first approaches, the method remains intensively iterative with high latency, input dependence, and low potential for parallelization. Simultaneously, in [26] , G. Zhang et al. made a connection to projection onto the probability simplex [27] . This connection provides a clean geometric intuition, but their technique relies on a partial Fig. 3 . Projection onto the parity polytope P P 3 − 1 2 : identify active facet, transform into canonical coordinate system, project onto probability simplex. sort operation that is not easy to implement in hardware (except via full sort). Thus, seeking a hardware-compatible approach, in [28] , Wasson and Draper combined aspects from both of these papers. They employ the cut search technique from [26] to identify which facet to project onto, and then perform the projection onto the probability simplex as in [27] . This combination avoids the partial sort and replaces it with guaranteed linear complexity. We expand upon that work herein. We develop a centered version of the algorithm (which improves dynamic range), detail a useful geometric interpretation of the operations performed, and provide a cleaner proof.
A geometric illustration of the main steps of Algorithm 2 is depicted in Fig. 3 . First, the violated cut from (3) is identified, revealing the active facet. Next, the identified cut defines a similarity transform used to reorient the problem into a canonical orientation. The problem is thereby reduced to projection onto the (centered) probability simplex. Finally, after projecting onto the simplex, the similarity transform is inverted to yield the projection onto the parity polytope. The algorithm has a straightforward and non-iterative execution path, whose steps can largely be parallelized. This, combined with simple intuition, makes Algorithm 2 an excellent candidate for hardware adoption.
In Algorithm 2, lines 1-7 form the facet identification portion of the projection algorithm. The objective here is to identify the vertex cut from (3) that is violated (if one is violated). This amounts to finding the closest odd-weight vertex of the unit hypercube [46] . First, the closest vertex of the hypercube is found and stored in the binary vector f . For the case of the centered polytope, considered in Algorithm 2, the actual vertex is f − 0.5. If the Hamming weight (computed on line 4) of f is odd, then the closest vertex violates the parity constraint, i.e., it is not a codeword of the single parity-check code, and we have identified a potentially violated cut. On the other hand, if the Hamming weight of f is even, the nearest vertex does not violate the parity constraint. To find the cut, we perturb the f vector in one coordinate. The coordinate to perturb corresponds to the v i that is closest to the midpoint of the unit interval. This coordinate is identified on line 5 and f is perturbed accordingly to make it of odd weight on line 6.
Once the possibly violated cut is known, a similarity transform applied on line 8 transforms v toṽ. This aligns the identified cut with the (centered) probability simplex. This transformation is illustrated in Fig. 3 , where v is the dot in Fig. 3a ,ṽ is the dot in Fig. 3b , and f = [1 1 1]. The transformed pointṽ is then projected onto the (centered) probability simplex, as illustrated in Fig. 3c . After projection, the similarity transform is inverted on line 13. The similarity transform is self-inverting.
Algorithm 2: Given a vector v ∈ R
d , calculate its Euclidean projection onto the parity polytope. The method returns
Similarity transform 13:
The execution path up through line 14 of Algorithm 2 produces a projection onto the boundary or "shell" of the parity polytope. Through these steps, a point already inside the parity polytope, instead of being left unperturbed, would be projected onto the cut corresponding to the closest odd-weight vertex. To avoid this, we test for parity polytope membership on line 15.
We now describe the test used for parity polytope membership. If the vector being tested is in the unit hypercube, we need only to check if the previously identified cut is violated [46] . Line 15, originally given in [45] , tests the hypercube projection of v against the identified cut. This is done by taking the hypercube projection ofṽ and checking on which side of the (centered) probability simplex it lies. If the hypercube projection of v is in the parity polytope, then this must be the parity polytope projection of v since the parity polytope is a subset of the hypercube. If the hypercube projection of v is not in the parity polytope, then v is not in the parity polytope, and the projection of v onto the identified facet u is also the parity polytope projection of v.
The previous paragraphs aim to provide the reader an understanding of Algorithm 2. We now provide a concise proof of correctness.
We first assign a name to the similarity transform used in lines 8 and 12: T f (e) i := e i (−1) f i where e ∈ R d and i ∈ [d]. Next, we establish that
for all e ∈ R d and f ∈ {0, 1} d . This follows since T f (·) changes only the sign of the input in a component-wise manner, and Algorithm 3: Given a vector v ∈ R d , calculate its Euclidean projection onto the centered probability simplex. This method returns w
Calculate possible shifts 3:
Perform shift and clip 7: 
. By applying (10) to bring T f (·) out of the hypercube projection, we arrive at the centered version of the parity polytope cut (3), applied to hypercube projection. Zhang and Siegel provide the forward implication in [26] , [46] , where their cut search algorithm asserts that the above cut is the only cut that needs to be tested for vectors inside the hypercube. The reverse implication is simple, since one cut being violated implies that parity polytope membership is not satisfied.
Theorem 3 in [26] states
is not in the parity polytope, then its parity polytope projection is given by arg min e v − e 2 2 subject to e ∈ − 1 2 ,
If we apply the substitution e = T f (ẽ), then the problem
is produced after simplification. This is the Euclidean projection of T f (v) onto the centered probability simplex. Therefore,
is in the parity polytope, then it serves as the parity polytope projection since the parity polytope is a subset of the hypercube.
C. Simplex Projection
We now consider the final important algorithm, Algorithm 3: projection onto the centered probability simplex. The centered probability simplex S d − 1 2 is defined by subtracting the all-1/2 vector from the probability simplex
d onto the centered probability simplex is a quadratic program. Additionally, projection onto the centered and non-centered probability simplexes are related in the same manner as for the centered and noncentered parity polytope. Algorithm 3 presents a simplex projection method from [47] , modified to project onto the centered simplex. Our interpretation of this algorithm is as follows. Computing the projection is a straightforward optimization solved through analyzing KarushKuhn-Tucker (KKT) conditions [32] , [47] . The KKT conditions tell us that the projection is obtained by shifting v along the all-1s vector, clipping components that fall below 0, and ensuring non-clipped components sum to 1. The shift along the all-1s vector results from the fact that the all-1s vector is orthogonal to the simplex, as shown in Fig. 3c . The clipped components are the most negative components of v, therefore, the d possible shifts are computed from a sorted version of v. A smart way to identify the best common shift is developed in [48] and used herein. The magnitude of the common shift is computed on lines 2−5 of Algorithm 3. The shift and clip are implemented on line 7.
IV. HARDWARE ARCHITECTURE
We now build upon well-known hardware architectures for message-passing decoders to implement an ADMM-LP decoder. Our hardware architecture specifically targets FPGA devices, taking advantage of the high availability of flip-flops and internal block RAMs (BRAMs). To reduce critical path timing, we deeply pipeline our CN and VN operations using existing on-FPGA flip-flops. We also use existing on-FPGA BRAMs to store intermediate decoder messages, rather than constructing wired message-passing networks. While these design strategies work well for FPGAs, the approach in an ASIC design could be slightly different. In the following sections we provide original implementations of the check and variable processing nodes, as per the operations outlined in Algorithm 1. The Verilog register-transfer level (RTL) source code for our implementation is freely available under the MIT license [49] .
A. Decoder
A central challenge in implementing hardware-based decoders is the scalability of the message-passing network. We restrict our implementation to QC codes [50] and a partiallyparallel architecture [51] . This simplifies message routing and memory interfacing. We implement the message-passing network with regularly-distributed on-chip FPGA BRAMs. Fig. 4 presents an overview of our partially-parallel QC-LDPC decoder architecture for (r, s)-regular LDPC codes. The architecture is comprised of multiple memory types to store input LLRs, intermediate messages, and output codewords, as well as pipelined Check Node (CN) and Variable Node (VN) processing units that perform the arithmetic operations of Algorithm 1. We use an aggressive pipelining strategy that captures every intermediate result in pipeline registers.
QC-LDPC codes are defined by a parity-check matrix formed by tilings of p × p circulant matrices. Each tile of a QC paritycheck matrix can be either the all-zeros matrix or some addition of shifted-identity matrices. The tilings naturally divide the parity-check matrix into s = message locations for a check (variable) computation are the locations for the previous check (variable) plus 1 modulo p. This rich class of codes is popular in hardware implementations and standards such as IEEE 802.11ad [43] .
The first step that our decoder executes is to load LLRs into memory. We instantiate s memories, each of depth p, to store the LLRs. Each memory is read in parallel to feed LLRs into s pipelined VNs. The VNs also receive messages from a Check-to-Variable (CN-to-VN) message memory, to be discussed later. At the output of the VNs, the current variable estimates x i are written in parallel into s estimate memories, to be read from upon termination. Variable estimates are also written into Variable-to-Check (VN-to-CN) message memories. There is a VN-to-CN message memory for each shifted-identity matrix in the specification of the parity-check matrix. These memories are addressed using their corresponding shift number to ensure the messages are passed to the proper CN.
Next, r pipelined CNs read their required messages in parallel from the VN-to-CN message memory. In addition, check states are read from check state memories, which are instantiated in the same manner as the VN-to-CN message memories. However, address shifting is not required since these memories are accessed only by CNs. When a CN computation completes, the new check states are written into check state memories and the messages are written into CN-to-VN message memories. The CN-to-VN message memories are structured in the same manner as the VN-to-CN memories, with write operations using cyclic shift information. The process repeats until the maximum number of iterations is exceeded, or some early termination condition is satisfied.
We find our current implementation of the ADMM-LP decoder to be sensitive to fixed-point quantization. Min-Sum decoders can be implemented with 5-or 6-bit message widths while suffering minimal degradation in error-rate performance compared to floating-point [52] . ADMM-LP requires larger bit-widths. We believe the higher precision is required because the result of the projection operation that CNs perform must be quantized. The quantization results in a loss of precision and a corresponding deterioration of message resolution.
We now discuss the logic that underlies the choices we made in selecting fixed-point representations. We first note that a change in the assignment of bits between the integer and fractional parts of fixed-point LLRs amounts to a linear scaling of the LP objective. However, any scaling of the objective in an LP, e.g., of γ in (6), does not change the solution of the LP. This provides some flexibility in choosing the fixed-point representation of the LLRs. Next, we note that each message passed to a VN can be thought of as either trying to overcome the channel information or as trying to reinforce it. Thus, we allocate any extra bit-width to the integer part of a CN-to-VN message. This provides the dynamic range required to override channel LLRs. In contrast, extra bit-width allocated to VN-to-CN messages should be in the fractional part. An increase in the number of fractional bits mitigates the effect of the inexact (due to finite precision) normalization by |N v(i) | in the VNs.
Based on this intuition, we select fixed-point message representations to retain as much channel information as possible. We first consider the bit-width of LLRs and the estimate outputs, which correspond to the decoder's input and output message widths, respectively. Next, we consider how many additional bits VN-to-CN and CN-to-VN messages will receive. VN-to-CN messages, as well as the estimates, lie in the centered unit hypercube. Therefore, these messages receive one sign bit and zero integer bits. Next, we assign LLRs one sign bit, zero integer bits, and allocate the remainder to fractional bits. This ensures that all channel information is visible in the estimates and the VN-to-CN messages. Experimentation shows that this fixed-point LLR representation provides the best error-rate performance [32] . The CN-to-VN messages are assigned one sign bit and the same number of fractional bits as the LLRs. This is done so the summation in the VN computation produces an output that does not have any constant bits for some given LLR. Finally, the check states are assigned the same representation as the CN-to-VN messages because they are computed in a similar manner. . This large addition operation is performed using a pipelined adder tree with log 2 (|N v (i)| + 1) adder stages with pipelining in between. The output of the adder tree has an additional log 2 (|N v (i)| + 1) integer bits to prevent overflow. For clarity, we do not show pipelining registers in our implementation schematics in this section. We note here that there may be several ways to pipeline a particular design, depending on the design constraints, e.g., clock frequency, target throughput, etc. To implement penalization, the VN checks if the adder tree result t i is greater than, equal to, or less than 0. Using this information, a multiplexer then chooses to add α, 0, or −α to the adder tree output. An integer bit is added to the fixed-point representation to avoid overflow. The next step in the VN is to normalize the penalized sum s i . Division is generally an expensive operation to perform, however, variable degrees are constant for a given code. Therefore, division by |N v (i)| can be performed by finding its reciprocal during RTL synthesis and computing the normalization via multiplication. The fixed-point representation of the reciprocal has one sign bit and zero integer bits. Our FPGA implementation uses an on-FPGA Digital Signal Processing (DSP) block to execute this multiplication. The reciprocal is represented with the maximum bit-width accepted by the DSP modules on the FPGA used for our error-rate simulations. Theoretically, this results in a large bit-width for the normalization output, however, unused bits are trimmed during FPGA RTL design synthesis.
B. Variable Node
This normalization is trivial for certain variable degrees. For example, if |N v (i)| is a power of 2, the normalization can be implemented by bit-shifting the fixed-point representation. Similarly, if the reciprocal of |N v (i)| has few ones in its fixedpoint representation, soft logic can efficiently implement the resulting multiplication. Thus, to simplify the normalization step, a hardware-oriented code design approach can be taken, where |N v (i)| is chosen to be a power of 2.
To form the VN output x i , the above normalization must be projected onto the centered unit interval. Similar to the penalization step, the VN tests whether or not the normalized estimate is less than − The final step of the VN architecture is to format the variable estimate x i to the correct fixed-point representation. The VN-to-CN messages generally have a smaller bit-width than the projected estimate. Since the projected estimate is guaranteed to be between − 1 2 and 1 2 , its fixed-point representation has one sign bit and zero integer bits. Therefore, only excess fractional bits need to be removed, which causes the previously mentioned bit trimming for the normalization output. While not indicated in Fig. 5 , it is very important to round (rather than truncate) in order to remove these fractional bits. Truncation, i.e., rounding down, biases decoding towards lower-weight codewords. Rounding prevents such a bias.
The most complex operation in a VN is the adder tree. ADMM-LP VNs have O(|N v (i)|) area scaling and O(log |N v (i)|) delay scaling. Additionally, no information needs to be stored in the VN for use in future iterations. The result is a pipeline-friendly module. The vector addition output v j is fed into the parity polytope projection module. It must also be temporarily stored while the projection is computed so that it can be used to calculate CN outputs. The implementation of the parity polytope projection, the most resource intensive part of the CN, will be covered in the next subsection. The replica variable vector z j is assigned as the output of the projection module. The replica variable vector has the same bit-width as the projection input, but its fixed-point representation has one sign bit and zero integer bits since its components are guaranteed to be in [− |N c (j)|) 2 ) pipeline stages. With the active facet identified in f , a similarity transform is executed on v to align the active facet with the probability simplex. This is accomplished with d multiplexers choosing between v i or −v i based on the value of f i . The resulting vector v uses the same fixed-point representation as v, however, since its components are guaranteed to be negative, the sign bit can be dropped and added back in later when required for computation. This operation has constant delay and linear area scaling in the dimensionality of projection d.
C. Check Node

D. Parity Polytope Projection
At this algorithmic juncture, there are three operations that can take place in parallel: projection onto the unit hypercube, projection onto the probability simplex, and testing parity polytope membership. Our implementation, however, does not execute these operations in parallel. Parallel execution requires knowing the depth of each operation in order to properly pad the lower latency operations with pipeline registers. This can not be done without knowing code check degrees a priori.
Projection of v onto the unit hypercube is the simplest operation to perform. For each component of v, a multiplexer chooses between − Testing parity polytope membership involves projectingṽ onto the unit hypercube. Hypercube projection is performed in the same manner as above. This is followed by summing the resultant vector using a minimum-depth adder tree. In the adder tree, extra integer bits are added to prevent overflow. By comparing the adder tree result to a constant, we are able to determine what the projection output should be. This decision is stored with a single bit. Due to the adder tree, this operation has O(d) area scaling and O(log d) delay scaling.
The implementation of simplex projection is the topic of the next subsection. Simplex projection dominates the complexity of parity polytope projection. It gives the parity polytope pro-
2 ) area scaling and O((log d) 2 ) delay scaling. Additionally, the hypercube projection of v and the active facet identifier f must be stored for the O ((log d) 2 ) pipeline stages it takes to execute the simplex projection. This uses O (d(log d) 2 ) area. The similarity transform is applied again to the output of the simplex projection to invert itself.
The stored bit indicating parity polytope membership then drives d multiplexers that choose to output the hypercube projection of v, or the transformed output u of the simplex projection module. Both of these possible outputs have fixed-point representations with one sign bit and zero integer bits.
E. Simplex Projection
Our algorithm for simplex projection is detailed in Algorithm 3. A schematic is depicted in Fig. 8 . The components of the vector to be projected are first sorted in descending order. To sort in hardware, we require the set of operations to be performed regardless of the input vector. Sorting networks accomplish this. Sorting networks are composed of compare-swap modules, each of which can be implemented with a compare Fig. 8 . Simplex projection. Fig. 9 . An example of a minimum-delay sorting network from [53] , showing non-overlapping compare-swap modules for 16 inputs. operation and two multiplexers. A delay-optimal sorting network for 16 values is shown in Fig. 9 .
Knuth provides an in-depth analysis of sorting networks [53] . He explains that, in practice, the best sorting networks achieve O((log d) 2 
) delay scaling and O(d(log d)
2 ) area scaling. There exists a theoretical construction with better scaling, but it has prohibitively large overhead [53] . We use delay-optimal sorting networks from Knuth [53] for small vector lengths, and use Batcher's odd-even merge sort [54] to extend to a general construction for large dimensional projections [28] , [32] .
The next step of simplex projection is to calculate all partial sums (termed "prefix sum") of the sorted vector ρ. Since we need to subtract 1 from every partial sum, we simply include −1 as part of the prefix sum input. The prefix sum operation can be performed with O(d) area scaling and O(log d) delay scaling. Ladner and Fischer describe such a construction [55] . The dth sum is computed with a minimum-depth adder tree. Other sums are calculated by reusing computations when possible, making linear area scaling possible. Extra integer bits are allocated to the fixed-point representation of the prefix sum output to prevent overflow. Note that both v and ρ must be stored during this operation, requiring O(d log d) area.
Next, the prefix sum output vector components are normalized by their respective component indices. Component index reciprocals are found during synthesis and the normalization is performed by multiplication. As with VNs, multiplication by a power of 2 can easily be implemented with FPGA soft logic for some indices, while a multiplier DSP core is required for others. This operation has constant delay and linear area scaling in the dimension of projection.
We wish to select the normalized partial sum with the largest index that satisfies ρ i > u i as the common shift in the simplex projection. First, a length-d binary vector is created indicating the indices satisfying ρ i > u i . A priority encoder is then used to create a one-hot vector indicating the largest index position satisfying ρ i > u i . This is also a prefix operation, which yields the same complexity as the prefix sum [28] . However, the operation is on a binary vector, and not a fixed-point vector. The resources consumed are thus much smaller. This one-hot vector is used to select the corresponding component of u.
Finally, the selected component u i * and 
V. RESULTS
We tested the hardware viability of ADMM-LP decoding using a Xilinx Kintex UltraScale FPGA board. The architecture was synthesized for the FPGA, along with wrapper logic for noise generation and data transfer to a software test bench.
The binary-input AWGN channel is simulated using a Gaussian random number generator [56] on the FPGA to minimize simulation time. The core is a linear feedback shift register of period 2 176 , fed into an approximation of the inverse cumulative distribution function. We verified that FPGA-based channel simulation produced the same frame error rate (FER) results as CPU-based channel simulation for low-SNR channels.
Three QC-LDPC codes are considered for error-rate simulation and FPGA resource usage analysis. The first is the [155, 64, 20] Tanner code [42] , whose parity-check matrix is composed of 31 × 31 cyclic matrices. The second is the [672, 546] "WiGig" code from the IEEE 802.11ad (WiGig) standard [43] , composed of 42 × 42 matrices. The final codes are an ensemble of five [1002, 503] (3, 6)-regular QC-LDPC codes. The five paritycheck matrices for this ensemble were created by randomly generating shifts for 167 × 167 identity matrices. The resulting factor graph girths were verified using techniques from [57] . Codes with girth less than 6 were discarded. The QC shift matrices for the three codes are provided in Fig. 10 . Table I summarizes the fixed-point precision of each message variable as determined through experimentation to obtain FER performance close to floating-point (double-precision). Message variables are grouped by computation module and expressed as signed fixed-point numbers in the Q format [58] . An Input/Output (I/O) bit-width of 8 bits was required to guarantee error-rate performance close to double-precision implementations [32] . For internal CN-to-VN messages, it was found that two additional bits are required to achieve good FER performance in higher-reliability channels [32] . In our architecture, LLRs, CN-to-VN messages, and check states all use the same number of fractional bits. Experimentation indicates that maximizing the number of fractional bits results in the best error-rate performance [32] . That is, LLRs should have zero integer bits, and the CN-to-VN messages and check states should have two integer bits. We use the allocations shown in Table I in all of our fixed-point ADMM-LP simulations. From experimentation, we found that the smallest fixed-point representations for which our decoder functioned reliably were 6-bit I/Os and 8-bit internal messages [32] .
A. Error-Rate Performance
The previously discussed parameter choices affect both errorrate performance and resource consumption. There are two additional parameters that affect only error-rate performance. We discuss the choice of these parameters here.
Simulated channel outputs are saturated at some value in order to produce LLRs within the decoder's input range. We parameterize this in terms of standard deviations of channel noise. That is, the channel output is saturated at ±(1 + aσ), where a > 0 and σ is the standard deviation of the added Gaussian noise. Experimentation revealed that our implementation is not extremely sensitive to this parameter, however, a = 1 was found to be optimal with respect to FER [32] . Therefore, we saturate channel outputs at one standard deviation beyond the transmission values ±1. The saturated channel outputs are then scaled such that the saturation values are mapped to the minimum and maximum LLR values. Recall that AWGN channel outputs are proportional to LLR values, and scaling LLRs does not change the LP decoding objective.
The final parameter configuration is to choose a suitable penalty parameter α. The optimal penalty parameter changes with respect to SNR. Our double-precision and fixed-point implementations empirically show that larger penalty parameters perform better on low-SNR channels, while smaller penalty parameters perform better on high-SNR channels [16] , [32] . Intuitively, high-SNR channels do not perturb transmissions far from optimal LP solutions, and thus do not stand to benefit from penalization. To illustrate this point, Fig. 11 shows the FER for the Tanner code as a function of the penalty parameter. Generally, for all the codes we tested, we found that setting the penalty parameter to α = 0.1 minimizes the FER.
The number of decoding iterations also affects the error rate. Similar to BP, ADMM-LP can be configured to terminate after a fixed number of iterations, or once rounded variable estimates produce a codeword. The latter provides a more practical termination condition than algorithmic convergence, and can be used to enforce latency and throughput constraints. Fig. 12 plots FER as a function of the number of iterations allowed for the fixed-point implementation of the Tanner code. The FER does not improve substantially beyond 60 iterations. Hence, in the remainder of our experiments, we enforce a fixed number of 60 iterations, allowing for a fair comparison of decoding latency and throughput among the three codes.
Figs. 13-15 present the FER results for the three codes under investigation on the binary-input AWGN channel. Each plotted FER point is based on an accumulation of 100 frame errors. We present results for both penalized (α = 0.1) and unpenalized (α = 0) ADMM-LP, where the value of α refers to its setting in Algorithm 1. Double-precision simulations for ADMM-LP and BP were performed using Liu's implementation [59] . The BP results shown were generated with Butler and Siegel's nonsaturating version described in [17] . We also present results for our own implementation of Min-Sum decoding.
1) Tanner Code: Fig. 13 presents the FER performance of the Tanner code. A small performance gap exists between the fixed-point and double-precision ADMM-LP implementations. At higher SNRs, all ADMM-LP implementations outperform double-precision BP and double-precision Min-Sum. The penalized ADMM-LP decoder closes the gap to double-precision BP. However, in fixed point, penalized ADMM-LP does not perform 
These results support the conclusion that unpenalized ADMM-LP is better suited to high-SNR channels.
2) WiGig Code: Fig. 14 presents the FER performance of the WiGig code. Again, the performance of both penalized and unpenalized fixed-point ADMM-LP with 10-bit messages is very close to that of double-precision ADMM-LP. However, there is a very large performance gap between BP and ADMM-LP. This performance gap is not closed with the addition of penalization; the opposite of what has been observed with other codes [16] . The root cause of the weakness of LP decoding for this code requires further investigation. We conjecture that, in part, it may be due to the high degrees of the check nodes, resulting in more pseudocodewords, or due to the degree-one variable nodes. Another factor to consider is that significant research went into designing this standardized code for Min-Sum usage. Therefore, an argument can be made that BP and Min-Sum are performing abnormally well due to this code's design and construction.
3) QC-LDPC Ensemble: The FER performance of the ensemble of (3, 6)-regular QC-LDPC codes is shown in Fig. 15 . Each curve is obtained by averaging the performance of the same five codes from the QC-LDPC ensemble. This experiment is a more powerful demonstration of the performance of ADMM-LP, where the addition of penalization closes the performance gap between BP and ADMM-LP. The fixed-point implementations of both penalized and unpenalized ADMM-LP achieve performance very close to double-precision.
B. Resource Usage
Table II summarizes the FPGA resource utilization and throughput results for the three partially-parallel decoders synthesized for the Tanner, WiGig, and QC-LDPC ensemble codes using the fixed-point message representations summarized in Table I . While the error-rate simulations were performed on In this work we present FPGA implementation results for both a Min-Sum decoder and ADMM-LP decoder based on the (3, 6) QC-LDPC code. The acronyms used are as follows: FP = "fully parallel," PP = "partially parallel," PEG = "progressive edge growth," AC = "Altera Cyclone," XV = "Xilinx Virtex," and XKU = "Xilinx Kintex UltraScale."
an older Xilinx Virtex-5 FPGA, the resource utilization results presented here target a newer Xilinx Kintex Ultrascale FPGA (model XKU-115). This FPGA has 1,326,720 configurable logic block (CLB) flip-flops, 663, 360 CLB lookup tables (LUTs), a total of 75.9 Mbits of block RAM, and 5,520 DSP slices. Synthesis was performed using the Xilinx Vivado 2016.4 tool suite. Power consumption was estimated using the power utilization tool in Vivado. In general, pipeline depth and intermediate value storage have a large impact on resource consumption for all three decoders.
1) Tanner Code: The Tanner code decoder has three degree-5 CNs and five degree-3 VNs. Each VN uses a single DSP block to perform normalization, and each CN uses two DSP blocks to perform division by 3 and division by 5.
From Table II , we see that CNs account for the majority of resource usage. Therefore, a further breakdown of CN resource consumption is warranted. We now present a breakdown of the CLB utilization and power consumption inside a CN on a sub-component basis. In a CN, parity polytope projection accounts for 83.1% of combined CLB registers and LUTs, and 90.9% of the total power. The nested simplex projection accounts for 47.2% of combined CLB registers and LUTs, and 48.48% of the power per CN. The large amount of resources required for polytope projection are primary consumed by intermediate storage as each v j in a CN must be stored until projection is complete. Sorting consumes 11.4% of combined CLB registers and LUTs, and 13.6% of the power per CN.
2) WiGig Code: The WiGig code decoder has one degree-16 CN, one degree-15 CN, and one degree-14 CN. It has 14 degree-3 VNs, one degree-2 VN, and one degree-1 VN. The degree-3 VNs use fewer resources than the Tanner decoder due to increased resource sharing among the 14 degree-3 VNs.
Again, CNs account for the majority of resource usage. The percentage of CLB usage and power consumption inside the degree-16 CN on a sub-component basis is very similar to the degree-5 CNs of the Tanner decoder. For the WiGig decoder, the two complexity-dominating operations, sort and prefix addition, consume a larger fraction of resources.
3) QC-LDPC Ensemble: The QC-LDPC ensemble decoder has six degree-3 VNs and three degree-6 CNs. CNs account for the majority of resource usage, and the internal CN resource breakdown is almost identical to that of the Tanner decoder.
C. Implementation Comparison
Table III compares our implementation for the QC-LDPC ensemble with several FPGA-based Min-Sum decoder implementations for similar code rates and blocklengths. To establish a fair basis for comparison, we also implemented a Min-Sum decoder on our FPGA, using the same LLR and internal-message bit-widths as our ADMM-LP decoder. Our fixed-point Min-Sum decoder achieves comparable error-rate performance to double-precision Min-Sum for the (3, 6) QC-LDPC ensemble code presented in Fig. 15 . Compared to the five comparison works, our ADMM-LP decoder achieves better FER and bit error rate (BER) performance at the chosen operating point of E b /N 0 = 3 dB, since ADMM-LP outperforms Min-Sum. However, our ADMM-LP decoder requires more iterations and larger bit-widths, resulting in lower throughput and higher logic utilization.
A direct comparison of area and logic utilization between designs implemented on different generations of FPGAs is not possible. First, the internal LUT and flip-flop structure of a Xilinx CLB is not equivalent to an Altera adaptive logic module (ALM). Second, as logic and memory resources scale with CMOS technology node, ALM and CLB architectures also change between FPGA generations. As such, FPGA synthesis and place-and-route steps are highly dependent on the architecture of the target FPGA device. Nevertheless, the logic utilization numbers of Table III show that our partially-parallel ADMM-LP decoder implementation has a logic resource utilization within an order of magnitude of the partially-parallel Min-Sum comparison works implemented on Xilinx devices. As expected, our level of resource utilization for ADMM-LP is between the comparison implementations whose Min-Sum decoders realize serial and fully-parallel architectures.
VI. DISCUSSION AND CONCLUSIONS
In this paper, we demonstrate that ADMM-LP decoding can attain excellent error-rate performance in a fixed-point implementation. While our initial implementation requires higher precision and more logic resources than Min-Sum, this study points to numerous possible avenues for future developments to bring ADMM-LP's resource requirements into line with those of other message-passing decoders.
One avenue is algorithmic simplification. Just as Min-Sum can be viewed as a computationally simple approximation of Sum-Product BP, we can seek approximations of ADMM-LP that preserve its high-SNR performance. As one example, in [32] it is observed that implementing partial-sort (rather than fullsort) can result in a negligible increase in error rates.
A second set of directions is hardware centric. For example, it is not obvious how to implement a CN or a VN unit that can handle multiple node degrees. We believe that this problem can be solved through innovative hardware sharing or algorithmic generalization. Second, ADMM-LP provides an opportunity for simplifying message-passing networks, especially in fullyparallel architectures. The same message is sent from each variable to all connected checks. Such message broadcasting can perhaps be exploited to reduce interconnect complexity.
Finally, this study is a first step en-route to the development of a fully custom, in-silicon, application-specific integrated circuit (ASIC). An ASIC would allow for power-optimized register files and customized message-passing resources that would yield significant performance improvements not possible on FPGAs. State-of-the-art ASIC decoders currently achieve multi-Gb/s throughput with less than 200mW of power consumption [52] , [65] - [67] , based on Min-Sum decoding of codes with similar blocklengths and rates as the three codes explored in this work. While the power consumption of our FPGA-based ADMM-LP decoder is on the same order of magnitude as state-of-the-art ASIC Min-Sum decoders, our throughput is lower by 3-to-4 orders of magnitude. Future work should explore algorithmic and architectural co-design to improve the throughput of hardware-based ADMM-LP decoders while maintaining low power consumption.
Referring to Table III , we note that while the normalized throughput per iteration of our ADMM-LP decoder is slightly lower than that of our Min-Sum decoder, our ADMM-LP decoder achieves a BER nearly 20× better. This is the crux of the matter. If one is concerned with applications where excellent high-SNR performance is required, SNRs at which Min-Sum and Sum-Product encounter error-floor problems, then ADMM-LP will be an algorithm of great interest. Our current implementation outperforms Min-Sum with less than an order of magnitude difference in FPGA resources. Further development, and innovation, could turn ADMM-LP into the algorithm of choice in such operating regimes.
