Abstract-One of the most significant impediments to the use of LDPC codes in many communication and storage systems is the error-rate floor phenomenon associated with their iterative decoders. The error floor has been attributed to certain subgraphs of an LDPC code's Tanner graph induced by so-called trapping sets. We show in this paper that once we identify the trapping sets of an LDPC code of interest, a sum-product algorithm (SPA) decoder can be custom-designed to yield floors that are orders of magnitude lower than floors of the the conventional SPA decoder. We present three classes of such decoders: (1) a bi-mode decoder, (2) a bit-pinning decoder which utilizes one or more outer algebraic codes, and (3) three generalized-LDPC decoders. We demonstrate the effectiveness of these decoders for two codes, the rate-1/2 (2640,1320) Margulis code which is notorious for its floors and a rate-0.3 (640,192) quasi-cyclic code which has been devised for this study. Although the paper focuses on these two codes, the decoder design techniques presented are fully generalizable to any LDPC code.
is proposed which exploits the knowledge of the deleterious subgraphs (or trapping sets, as defined later) for a given LDPC code, and lowers the floor by inverting the bits of a known trapping set after the decoder becomes trapped in that trapping set. We discuss more on this decoder in Section III. Also, [19] investigates error floors by hardware simulation. These approaches in the literature involving decoder improvements for lower LDPC code error floors can be divided into two categories: decoder modification (DM) and post-processing (PP). In this paper, we propose decoder-based strategies which include: a bi-mode decoder, a bit-pinning decoder together with an outer algebraic code, and three generalized-LDPC (G-LDPC) decoders. The bi-mode decoder is essentially a PP technique, the bit-pinning approach is a mixture of the DM and PP categories, and the G-LDPC decoders belong to the DM category. We will present these decoding techniques in the order just given, which is in the order of increasing complexity.
It is well known that the error floors of LDPC codes under message-passing decoding are usually due to low-weight nearcodewords [21] , or trapping sets [20] , rather than low-weight codewords. A (w, v) trapping set is a set of w variable nodes (VNs) which induces a subgraph with v odd-degree check nodes and an arbitrary number of even-degree check nodes (CNs). When w and v are relatively small, errors in each of the w VNs create v parity check failures and tend to lead to situations from which the iterative decoder cannot escape. Throughout the paper, to simplify the language, we shall often use "trapping set" when we are referring to the induced subgraph.
Iterative decoders are susceptible to trapping set problems because they work locally in a distributed-processing fashion to (hopefully) arrive at a globally optimum decision. Of course, iterative decoders are vulnerable to cycles in the graph on which the decoder is based, for cycles often lead the iterative decoder away from the maximum-likelihood (ML) codeword. (Further, trapping sets are unions of several cycles.) In [28] , Koetter and Vontobel derived a framework for the finite-length analysis of iterative decoders of LDPC codes by introducing the concept of graph-cover decoding. It was shown that the iterative decoder cannot determine whether it is acting on the Tanner graph of the code itself or on some finite cover of the graph. The decoder decodes to the "pseudo signal" that has the highest correlation with the channel output, and the set of pseudo-codewords (related to trapping sets) from all finite graph covers competes with the transmitted codeword for being the "best" solution. In other words, the ML decision space is a subspace of the iterative decoder decision space.
We shall not delve into such theory in this paper. Rather, we shall explore specific codes to determine the trapping sets which dominate the floor of their iterative decoding 0090-6778/09$25.00 c 2009 IEEE performance curves. The most dominant trapping sets can be determined from computer simulations in the floor region. Moreover, for most practical codes, because the trapping setinduced subgraphs associated with the error floor are relatively small, and because their cardinalities are not too large, it is feasible to discover, enumerate, and evaluate all of the dominant trapping sets. Hence, once we obtain the trapping set information for an LDPC code by simulation and by graphsearch techniques, we explicitly target the known trapping sets with novel, custom-tailored iterative decoder designs. These low-floor decoders lower the floor by orders of magnitude. Although the specific trapping sets depend on decoder specifics such as algorithm type and message word size, the applicability and effectiveness of our low-floor techniques do not.
We consider the binary-input additive white Gaussian noise (AWGN) channel and the sum-product algorithm (SPA) decoder with floating point precision as our baseline system. We chose two LDPC codes to demonstrate the effectiveness of these decoders: (1) the rate-0.5 (2640, 1320) Margulis code which is notorious for its trapping set-induced floors [20] , [21] , and (2) a short quasi-cyclic rate-0. 3 (640, 192 ) code that we have devised for this research.
The rest of the paper is organized as follows. Section II briefly reviews the two LDPC codes under study and their dominant trapping sets. Section III describes the bimode decoder that recovers trapping sets by a post-processing erasure decoding algorithm. In Section IV, we propose the bitpinning decoder which utilizes one or more outer algebraic codes. In Section V, we explore the concept of generalized LDPC Tanner graphs and their G-LDPC decoders. The idea is to decode conventional LDPC codes as if they were G-LDPC codes, with the goal of eliminating trapping sets error events. The algorithm for finding the trapping sets of G-LDPC decoders is also presented in that section, after which the frame error rate (FER) performance of G-LDPC decoders can be accurately predicted using the method in [20] . Section VI discusses the complexity cost of these decoders in comparison with conventional sum-product algorithm decoder. Finally in Section VII, we draw some conclusions.
II. CODES UNDER STUDY
Both of the LDPC codes we consider have the following structure: the parity-check matrix H can be conveniently arranged into an M ×N array of Q×Q permutation matrices and Q × Q zero matrices. This structure simplifies the analysis of the code because the Tanner graph possesses an automorphism of order Q, and thus simplifies the search for all of the trapping sets of a code. Obviously, if the Q × Q permutation matrices are circulants, the code is quasi-cyclic (QC), a characteristic that facilitates encoder and decoder implementations. We will refer to the Q variable nodes associated with a column of permutation matrices as a VN group and the Q constraint nodes associated with a row of permutation matrices as a CN group. A VN of such a code can be represented by an integer pair 
Margulis (14, 4) T.S. 
A. The Margulis Code
The Margulis construction of a regular (3, 6) Gallager code has been well studied [21] [22] [23] . The rate-1/2 (2640,1320) Margulis code used in this paper is generated from the special linear group SL 2 (11) for which Q = 11. Its parity check matrix can be expressed as a 120 × 240 array of 11 × 11 permutation matrices. The "weakness" of this algebraically constructed code is a relatively high error floor due to trapping sets, as discovered by Mackay and Postol in [21] using computer simulations. The source of the error floor is 1320 isomorphic (12, 4) trapping sets and 1320 isomorphic (14, 4) trapping sets, both types depicted in Fig. 1 . As can be observed in the figure, a (14,4) trapping set has a structure similar to a (12,4) trapping set, and in fact each (14, 4) trapping set contains a unique (12, 4) trapping set as a subgraph. Hence, there is a 1-1 correspondence between the (12,4) and the (14,4) trapping sets. For both trapping set classes, half of the bits errors are systematic errors and half are parity errors.
This code is extremely difficult to deal with in terms of decoder design (as we will show later) because its trapping sets are highly entangled in a way that is more involved than what we just described. For instance, every VN of the code belongs to six different (12, 4) trapping sets. It is for this reason that we have studied the Margulis code: to devise decoders that lower the floor of codes with (and without) entangled trapping sets.
In [20] , Richardson proposed a semi-analytical method to predict the performance of LDPC codes in the floor region and he verified results for the Margulis code on the binary-input AWGN channel using FPGA simulations. In his method, a nearly complete list T of the dominant trapping set candidates was formed first. Then, the decoder failure is decomposed into the summation of the effects from the individual trapping sets:
where ξ T denotes the set of decoder inputs that cause a failure on a trapping set T . Next, the contribution of each trapping set class to the FER is evaluated by using an importance sampling (IS) method [24] [25] [26] :
where w is the weight of T , σ 2 is the variance of the Gaussian noise when operating at a high SNR in floor region, B is the noise biasing random variable, and E B represents the expectation over B. It was shown in [20] that the 1320 (12,4) trapping sets account for about 75% of the error floor performance, and the (14,4) trapping sets account for about 23%. (Low-weight trapping sets are always more dominant than high-weight trapping sets.) In this paper, we use a floating point software SPA decoder simulator which we have verified produces the same performance and prediction curves as those in [20] . It is known that quantization potentially affects the error floor, and various implementations of the sumproduct algorithm together with various levels of quantization could lead to different trapping sets. We contend that our algorithms will work on any decoder implementation because these algorithms lower floors by eliminating the trapping set errors of any given decoder implementation.
B. The Short QC Code
The rate-0.3 (640,192) QC code is designed using a progressive edge growth-like (PEG-like) algorithm [3] and the approximate cycle extrinsic message degree (ACE) algorithm [4] . It has a circulant size Q = 64 and column weight w c = 5. The H matrix is a 7×10 array of 64×64 circulant permutation matrices. We observed in our simulations two trapping set classes with 64 isomorphic trapping sets in each class, with a representative from each class shown in Fig. 1 . The (5, 5) trapping set is the dominant one, as it contributes to over 90% of the error floor level.
We remark that it is very common for structured LDPC codes to have one or two dominant trapping set classes because their structured graphs lead to isomorphic classes of trapping sets. Unstructured LDPC codes are more likely to have a more diverse trapping set collection, but these codes are rarely of interest in practice. Still, the techniques in the paper can be straightforwardly extended to unstructured codes as well.
III. BI-MODE SYNDROME-ERASURE DECODER
In [16] , the authors propose a post-processing technique to lower the error floor by using a look-up table of known trapping sets. After conventional SPA decoding, this table is used to process the residual error blocks much like a syndrome decoder. A drawback to this approach is, if the cardinality of trapping sets is large, the implementation complexity of the look-up table may be very costly. Inspired by this technique, we propose another post-processing decoder, which avoids look-up-table decoding. In our technique, postprocessing involves simple graph-based erasure decoding, so that our overall decoder has two decoding modes.
In the error-floor region, there are three types of error events: (1) unstable error events, which dynamically change from iteration to iteration and for which w and v are typically large; (2) stable trapping sets, for which w and v are typically small and thus are the main cause of the error floor; and (3) oscillating trapping sets, which periodically vary with the number of decoder iterations and which are sometimes subsets of stable trapping sets. The targets of the bi-mode decoder are the dominant stable trapping sets and some of the dominant oscillating trapping sets. In the first mode, SPA decoding is performed with a sufficient number of iterations for the decoder to reach one of the three error event situations just listed. The second mode is the post-processing mode, which is activated only when the syndrome weight of an error event falls into the set of syndrome weights of the target trapping sets. The key role of the second mode is, using syndrome information, to produce an erasure set that contains all of the VNs of the trapping set reached by the decoder, thus resulting in a pure binary erasure channel (BEC) with only correct bits and erasures. Iterative erasure decoding based on the LDPC code's graph can then resolve all of the erasures, including the bits that were originally part of the trapping set error event.
The look-up table post-processing decoder in [16] is based on the fact that there is usually a 1-1 correspondence between a set of unsatisfied CNs and a trapping set. It was also observed that in the error-floor region, after running extensive Monte Carlo simulations, most of the error patterns correspond to the so-called elementary trapping sets [29] whose induced subgraphs have only degree-one and degree-two CNs. That is, for an elementary trapping set, the unsatisfied CNs are usually connected to the trapping set exactly once (there is one bit error per unsatisfied CN). This is the case for the Margulis and QC codes we study. However, the applicability of the bi-mode decoding algorithm is not limited to codes with elementary trapping sets.
Assume now that the decoder converges to a trapping set and suppose an unsatisfied CN has degree d c . The goal of that CN is to find which one of the d c neighbors is in error. Our experiments have shown that the LLR magnitude of the bit in error is not necessarily the smallest among the d c VNs connected to the unsatisfied CN. Thus, we propose flagging as erasures all of the d c neighbors of the unsatisfied CN and then performing iterative erasure decoding in the neighborhood of that unsatisfied CN.
The set of erasures is generated by erasing the VNs of v trees whose roots are the v unsatisfied CNs, as in Fig. 2 . The depth of the trees depends on the trapping set structures of the code being decoded. Consider a regular LDPC code with row weight d c and column weight d v . For a (w, v) trapping set, the number of nodes in each level of the trees is listed in Fig. 2 . For a trapping set with w ≤ v, i.e., every VN is connected to at least one unsatisfied CN, trees with two levels can cover the whole trapping set. Otherwise, when w > v, bigger trees are necessary. As the trees grow, the number of erasures grows exponentially and the whole trapping set will eventually be covered. However, the growth must be limited, because too many erasures may form so-called stopping sets (trapping sets on the BEC), and overwhelm the decoder. As will be explained below for the Margulis code, for some codes, additional steps are necessary to produce a smaller erasure set covering the whole trapping set. We now present the bi-mode decoding solutions for the Margulis and QC codes.
A. Margulis Code Solution
We observed from simulations that the Margulis code with an SPA decoder is sometimes trapped in two oscillating (6, 18) trapping sets or two oscillating (7, 21) trapping sets, both with period two. The union of the (6,18) pair is a (12,4) trapping set, and that of the (7, 21) pair is a (14,4) trapping set. In the decoder solution for this code, the target trapping sets include the stable trapping sets as well as the two oscillating configurations. It is obvious that the oscillating trapping sets and some moderate-length unstable error events with w ≤ v can be flagged with two-level trees and recovered successfully. As depicted in Fig. 3 , any (12,4) trapping set can be flagged and recovered using trees with four levels; however, four-level trees cover only 10 VNs of any (14, 4) trapping set. If 6-level trees are used, the erasure decoder is overwhelmed by too many erasures.
To solve this problem, an auxiliary step is necessary. As discussed in Section II-A, any (14,4) trapping set contains a (12,4) subset plus two additional VNs. From the syndrome weight itself, the decoder cannot distinguish which of these two trapping set classes it is dealing with. The algorithm described below which resolves this issue is based on the following observation from the structure of the trees generated from the (12,4) and (14,4) trapping sets: As seen in Fig. 3 for either trapping set, there exists exactly two VNs in the second level which share a common level-3 CN. These two VNs are what distinguish the (12,4) and (14,4) trapping sets (examine Fig. 3 ). Toggling the values of these VNs will switch a (12,4) trapping set to a (14,4) trapping set, and vice versa. Note that c a , v a1 and v a2 in Algorithm 1 are unique for any (12, 4) or (14,4) trapping set. The difference is: if a (14,4) trapping set is reached in the first decoding mode, v a1 and v a2 are incorrect; otherwise when a (12,4) trapping set is reached, the two VNs have correct bit values. Note also that when repeating Steps 1 to 3, the decoder adds to the erasures already flagged in the first round of Steps 1 to 3. Flipping the two bits of a (14,4) trapping set essentially corrects two bit errors and turns it into a (12,4) trapping set. However, since both trapping set classes have the two unique VNs identified in
Step 4, the decoder still cannot distinguish which trapping set it is operating on. Thus, if a (12,4) trapping set is reached, v a1 and v a2 are flipped anyway, resulting in a (14,4) trapping set. For this case, repeating Steps 1 to 3 simply flags as erasures more VNs that are not in error, but does not overwhelm the erasure decoder. By using this algorithm, 352 erasures will be produced for any (12, 4) or (14,4) trapping set, which can be recovered by the LDPC code successfully within 3 or 4 iterations.
The performance results for this bi-mode decoder with the Margulis code are presented in Fig. 4 . For the SPA curves in the water fall regime, 100 LDPC codeword errors were collected; in the floor region, 50 codeword errors were collected, except for the last point at 2.7dB, which corresponds to 20 codeword errors. For the bi-mode curves, the numbers are similar, except the last three data points contain at least 10 block errors. No floor is observed down to FER ∼ 10
for the bi-mode decoder.
B. Short QC Code Solution
For both trapping set classes of the short QC code, every VN is associated with at least one unsatisfied CN (w ≤ v). Hence, one level of VNs per tree is enough to include a whole trapping set in the erasure set. Whenever the SPA decoder gets trapped in an error event with syndrome weight 5 or 7, the decoding enters the second mode. 34 erasures are flagged for any (5,5) trapping set and 54 are flagged for any (5,7) trapping set, all of which can be recovered with one erasure decoding iteration. The performance of this code with bi-mode decoder is presented in Fig. 4 , with 100 block errors collected, except for the last two data points for the bi-mode curves which correspond to 20 block errors. No floor is seen down below FER ∼ 10 −6 , so that the floor is lowered by at least two orders of magnitude relative to the SPA decoder.
With certain modifications custom-tailored to tackle different trapping sets, this simple yet effective bi-mode technique can apparently be applied to many LDPC code. One scenario that requires modification is when the erased bits contain a stopping set, which usually happens if the trapping set has a large number of check violations and/or the code's graph has large check node degrees. One solution is to use the softerasing method which retains the soft information (LLRs) of the bits outside of the erasure set, while erasing the erasure bits by setting their LLRs to zeros, and then performing the conventional iterative decoding. Another variation of the bimode technique is the partial-soft-erasing technique, which erases (by setting LLRs to zeros) the neighboring bits of one of the unsatisfied checks and then performs normal decoding; if the decoder fails to converge, the decoder is reset with erasures based on another unsatisfied check and this procedure is repeated until the decoder converges or all unsatisfied checks have been used. For example in [27] , the authors show that the (2048, 1723) LDPC code adopted in the IEEE 802.3an 10GBASE-T standard possesses an isormorphic class of (8, 8) trapping sets. These (8, 8) trapping sets are easily resolved using our partial-soft-erasing bi-mode decoder which requires on average around 20 additional decoding iterations to correct the entire trapping set.
We remark that the bi-mode decoder has the following advantages relative to other floor-lowering techniques:
1) Beside stable trapping sets and oscillating trapping sets, this technique can also handle some unstable error events with w ≤ v. 2) No outer codes are employed, so the gain is achieved without any code rate loss.
3) The erasure decoding post-processing has very low computational complexity, which is equivalent to solving linear equations with binary unknowns, and thus involves only a collection of binary XORs.
IV. CONCATENATION AND BIT-PINNING
It is well known that error floors of LDPC codes are usually a consequence of frequently occurring block errors with a small number of bit errors in each block error. A natural solution to lowering the floor in this case is to concatenate the LDPC code with a high-rate outer algebraic code to clean up the residual systematic errors, at the expense of a coderate loss. The serial concatenation of an outer code with the LDPC code has in fact been incorporated into the digital video broadcasting (DVB-S2) standard [39] . We propose a more effective concatenation technique which reduces the code rate loss by exploiting the trapping set knowledge of the LDPC code of interest.
The technique we propose combines outer code concatenation with the bit-pinning technique [11] . In bit-pinning, the encoder fixes (pins) to zero one or more bits of each trapping set of interest and these bits are deleted prior to transmission. The decoder then sets the magnitude of the log likelihood ratios (LLRs) for these pinned bits to the maximum possible value. Simulations have shown that this procedure offers a substantial improvement in the floor region [11] . However, this solution is not as effective for some codes, such as the Margulis code, which have a large number of overlapped trapping sets due to the substantial code rate loss introduced. The code rate loss can be substantially reduced for this code or any other code by the combination of concatenation and bit-pinning.
In one solution to this problem, one or more high-rate outer codes can be concatenated with the LDPC code, where the bit assignments in outer codes are arranged such that, whenever a trapping set error event occurs, at least one (systematic) bit is corrected by an outer algebraic decoder. The goal is to ensure that the short, low-complexity, high-rate outer codes target the trapping sets of interest. The multiple decoders are coordinated as follows. First, a sufficient number of LDPC decoder iterations is performed to guarantee that either all errors are corrected or a stable trapping set is reached. Then the outer hard-decision decoder(s) correct some of the residual errors and feed(s) back the signs of the corrected bits to the LDPC decoder, which pins the absolute values of LLRs corresponding to these bit to the maximum possible value. Then the LDPC decoder continues to iterate. With probability near one, the LDPC decoder will correct both systematic and parity errors caused by a trapping set within a few additional iterations.
A. Margulis Code Solution
Since each of the dominant trapping set classes of the Margulis code correspond to either 6 or 7 systematic bits, a single t-error-correcting BCH code with t = 7 is an obvious solution. In this case, no pinning is necessary. We choose the t = 7 (1320, 1243) BCH code with roots in GF(2 11 ), which is shortened from a primitive (2047, 1970) BCH code. The code rate of the overall system is reduced from 0.5 to 0.47, corresponding to a 0.26 dB rate loss.
We can reduce the code rate loss by considering multiple BCH codes of higher rates which exploits the fact that the trapping sets of this code are highly overlapped. We found that four BCH codes with t = 1 were sufficient when allowing feedback from the BCH decoders to the LDPC decoder. The BCH code bit assignments to the Margulis code bits are shown below. For convenience, the bit assignments are listed by VN group indices v p since all 11 bits in the VN group will be part of the BCH codeword. 
4) BCH(33,28): 94,116,203
The overall code rate is 0.49 (a 0.086 dB loss). The FER and BER curves of the single-BCH-code (no pinning) and four-BCH-code (with pinning) solutions are shown in Fig. 5 and Fig. 6 , respectively. All curves have at least 20 frame error occurrences in the range of E b /N 0 ≥ 2.5 dB. We observe that both solutions lower the floor beyond the reach of our simulations, and that the code-rate loss for the four-BCH-code solution is about 0.2 dB less than that of the single-BCH-code solution, as indicated above.
B. Short QC Code Solution
As discussed in Section II-B, the (640, 192) QC code has two trapping set configurations. We can concatenate that code with two t = 1 binary BCH (64, 58) codes. The bits of one BCH code belong to 64 different (5,5) trapping sets, and those of the other belong to 64 different (5,7) trapping sets. Thus, whenever a trapping set is in error, one of its five bits is correctable by one BCH code. Once corrected, this information is fed back to the LDPC decoder which pins its LLR magnitude to the maximum possible (with the appropriate sign). The overall code rate is reduced from 0.3 to 0.2813 (0.28 dB rate loss). The simulation results presented in Fig. 5 and Fig. 6 show no floor down to FER ∼ 10 −7 . 20 LDPC codeword errors were collected in high SNR region.
V. GENERALIZED-LDPC DECODER
In [17] , an averaged decoding algorithm was proposed to reduce the number of incorrectly decoded frames in the error floor region of the Margulis code. This algorithm is a modified SPA decoding algorithm which averages messages over several iterations to "slow down" the convergence rate of certain variable values, because a sudden magnitude change in the values of certain variable messages or fast convergence to an unreliable estimate is a possible indicator of the emergence of an error trap. In this section, we propose novel SPA decoders which, loosely speaking, are designed g g g g g g g g g g g g
7. An example of combining 3 SPC constraint processors into the GCP SC 1 : the redundant edges (dashed lines) on the right are deleted after grouping. Conventional iterative decoding updates messages based on the graph on the left, employing SPA constraint processors for each SPC separately. The G-LDPC decoder updates messages based on the generalized graph on the right, using the BCJR algorithm, for example, to calculate extrinsic information to be sent from SC 1 to neighboring VNs. Note that we allow SPC constraints to coexist in the generalized graph, whose node processors are the same as those in the original graph.
by transforming the LDPC code into a generalized LDPC (G-LDPC) code. A G-LDPC code, like an LDPC code, is a code that can be described by a sparse bipartite graph with variable nodes and constraint nodes. However, for G-LDPC codes the constraints may be more general than single paritycheck (SPC) constraints. For example, a constraint node can represent an arbitrary (n , k ) binary linear code.
To see how the G-LDPC philosophy arises, we remind the reader of the discussion in the introduction regarding locally optimum versus globally optimum decoders. Each constraint node decoder in an LDPC SPA decoder is locally optimum. It is possible in principle to group all of the SPC constraints into one combined global constraint and design a decoder for that graph. But that would be an ML decoder (for example), which has unacceptable complexity. Our strategy is to instead take one step toward that ideal and cleverly combine only a few SPC constraints at a time. Specifically, we combine check node processors corresponding to unsatisfied checks in the problematic trapping sets and call the combination a generalized-constraint processor (GCP) . Observe that an advantage of doing so is that cycles and other deleterious graphical structures (from the perspective of an iterative decoder) may be removed. See Fig. 7 for an illustration of this. The notion of combining check nodes into a super node to remove cycles first appeared in [41] . The approach here more directly targets problematic trapping sets rather than cycles.
The constituent decoder for the GCP can be any softinput/soft-output decoder. We use a locally optimal decoder which employs the BCJR algorithm designed to the "BCJR trellis" [30] [31] for the linear code represented by the GCP. We allow both SPCs and GCPs to co-exist in the generalized Tanner graph, i.e., some of the SPCs are not combined to form a GCP. The G-LDPC decoder, essentially the SPA with GCP's (denoted by SPA-GCP) passes soft information iteratively between VN processors and GCP's in the same manner as the SPA decoder.
Below we present three different G-LDPC decoders (equivalently, methods for grouping SPCs into GCPs) for both the Margulis code and the QC code. The approaches are based on the knowledge of the dominant trapping sets. The goal of the SPC grouping methods is to eliminate the dominant trapping sets, thus lowering the error floor. In selected cases, we predict the FER performance in the floor using the method in [20] .
A. G-LDPC Decoder I 1) Margulis Code Solution:
At least qualitatively, it is clear why an iterative decoder gets "trapped" by the subgraphs associated with trapping sets: As seen in Fig. 1 , when all of the bits in a trapping set are incorrect, most of the associated check equations are still satisfied (that is, mis-satisfied), so that a locally operating iterative decoder is incapable of resolving the discrepancies caused by these errors. However, if the unsatisfied CNs can be mutually fortified, the decoder may be able to correct the errors. The Type I G-LDPC (G-LDPC I) decoders are designed based on this idea.
Because the Margulis code trapping sets are highly overlapped (intersections are non-empty), we chose to require that no two GCPs share any common SPC constraint. We first grouped the four unsatisfied check nodes in the 198 non-overlapping (12,4) trapping sets into 198 GCPs. We also grouped the four unsatisfied check nodes in the remaining 22 non-overlapped (14,4) trapping sets into another 22 GCPs. In addition to these 220 GCPs (66.7% of all SPCs), each of which is a composition of four standard check nodes, the generalized Tanner graph also has 440 standard SPC nodes. At each GCP, the BCJR algorithm was applied to the BCJR trellis for the four associated check equations. The SPC constraints used standard SPA processing.
We ran Monte Carlo simulations on the generalized Tanner graph and the performance in terms of frame error rate (FER) and bit error rate (BER) are shown in Fig. 9 and Fig. 10 , respectively. We observe in the figure that the error floor is lowered by an order of magnitude compared to the standard SPA decoder. The simulator collected more than 20 frame errors for the last two simulation points of the G-LDPC I decoder, and it performed 220 iterations in order to obtain the stable trapping sets. We observed no (12, 4) or (14, 4) trapping sets by the G-LDPC I decoder. The new trapping sets of the G-LDPC I decoder are transformed from the original ones by deleting/adding one VN from/to the (12, 4) or (14, 4) trapping sets as in the (13,5) example shown in Fig. 8 . Given the algebraic construction of the Margulis code, we can find 1320 subgraphs of each configuration in Fig. 8 in the original graph. However, from the G-LDPC decoder's point of view, not all of the 1320 subgraphs are isomorphic. For example, a (13,5) subgraph in the original Tanner graph, which has the same structure as the example in Fig. 8 , is a dominant trapping set for the G-LDPC I decoder if and only if its five unsatisfied CNs are not involved in any GCP and the VN with two unsatisfied CNs is attached to the rest of the subgraph through a GCP. Similarly for the other configurations listed in Fig. 8 .
These observations allow us to find a complete list T of the new trapping sets that are related to the original ones by searching the generalized Tanner graph, or more efficiently by using a two-step importance sampling method which we will describe in Subsection V-A3. Fig. 8 lists the dominant trapping sets and their multiplicities for the G-LDPC I decoder. The contribution of each trapping set class to the error floor performance was evaluated by Richardson's method [20] . These results demonstrate that the G-LDPC I decoder eliminates the (12,4) and (14,4) trapping sets as problems. Although new dominant trapping sets arise, the new trapping set configurations are larger in weight and have much smaller multiplicities, and are thus less harmful. The floor prediction curves in Fig. 9 are obtained by weighted (by the multiplicity) sums of the error rates of each type in the table.
2) Short QC Code Solution: A similar G-LDPC I decoder is derived for the short QC code by combining the five unsatisfied, non-overlapped SPCs into a GCP. The resulting graph contains 29 GCP and 303 standard SPCs. Through the two-step IS method in the next subsection, we found this decoder has three different dominant (6, 8) random variable (RV) with mean -2 and variance corresponding to an SNR value in the floor region. We then observe the G-LDPC decoder outputs. The stable low-weight error events are the trapping sets of interest. Their multiplicities are also obtained in this way. Note that one pattern from Step (a) may lead to several stable trapping sets, thus multiple realizations of the RV need to be observed. The advantage of applying the two-step IS instead of Step (b) alone is: When biasing with Gaussian noise, multiple G-LDPC decoding instances are performed. It could be very time-consuming because every SPA trapping set has to be tested and the cardinality may be large (for example, there are 2640 SPA trapping sets to be tested for the Margulis code).
B. G-LDPC Decoder II
In Section V-A1, we presented the G-LDPC I decoder which eliminates the dominant trapping sets of SPA decoder. However, several new trapping sets emerged which dominated the new (lower) floor. On the other hand, only 66.7% of the CNs were combined in the Margulis code G-LDPC I. Further, we can target these new trapping sets and group appropriately selected SPCs into more GCPs. According to Fig. 8 , three trapping set classes amount to over 90% of the floor of G-LDPC I: (11, 5) , (13, 5) and (15, 5) . We can thus combine the five non-overlapping unsatisfied SPCs of these trapping sets, arriving at a G-LDPC graph with 229 standard CNs and 258 GCPs. 82.6% of the SPCs in the original graph are combined in this decoder.
The trapping sets observed for the G-LDPC II decoder are also closely related to the SPA trapping sets, and the same two-step IS method can be applied to enumerate the G-LDPC II trapping sets (results listed in Fig. 8 ). We can see that the three dominant trapping set classes of the G-LDPC I decoder are eliminated by these 38 additional GCPs, and the dominant trapping sets become (14, 6) and (16, 6) which contribute 90% of the new floor level. A prediction curve is drawn in Fig. 9 , which is over two orders of magnitude lower than the standard iterative decoder. This procedure of adding more GCPs according to new trapping sets can continue until no SPC CNs remain in the generalized graph or the system's floor performance requirement is achieved. Of course, the improvement is at the cost of increasing complexity. This realization led us to the G-LDPC III decoder.
After discussing the philosophy behind the G-LDPC II decoder in the context of the Margulis code, it is clear that this same obvious approach could be applied to the QC code. However, we felt no need to make this obvious (and tedious) step, particularly since G-LDPC III decoder in the next section provides the best solution.
C. G-LDPC Decoder III 1) Margulis Code Solution:
It was observed through simulations of many LDPC codes that, in the floor region, a frame error event usually contains a single trapping set. Instead of combining unsatisfied CNs within a trapping set, the G-LDPC III decoder takes trapping sets in pairs and combines an unsatisfied CN of one trapping set with that of the other. When the decoder is trapped in one trapping set, reliable information from its companion trapping set will be passed along the GCP, allowing recovery from the trapping set error event.
The G-LDPC III decoder solution for the Margulis code contains 660 GCPs, each consisting of two SPCs. This number came from the 1320 (12,4) trapping sets (660 pairs). Thus there are no SPC CNs in the generalized graph in this case. Due to the overlap of trapping sets (for example, every CN is one of the unsatisfied CNs in four different trapping sets), all the trapping sets form a linked overlapped network. The IS method in Section V-A3 was applied to this decoder and no trapping sets were observed. The effectiveness of this simple G-LDPC decoder is also confirmed by simulations, which shows no floor down to FER ∼ 10 −8 . We collected 50 error events at 2.2 dB and 2.3 dB, 10 error events at 2.4 dB, and 5 error events at 2.5 dB.
2) Short QC Code Solution: Take any (5,5) trapping set; its five VNs belong to five VN groups with indices [0, 1, 2, 3, 4]; the five unsatisfied CNs belong to four CN groups with indices [0, 1, 2, 3]. Thus, we take any two (5,5) trapping sets and group the three unsatisfied CNs which belong to disjoint CN groups. The 64 (5,5) trapping sets can be combined to give 32 · 3 = 96 GCPs. The FER and BER performance curves with this decoder are shown in Fig. 9 and Fig. 10 , and no trapping set is observed down to FER ∼ 10 −7 . The curves were drawn with at least 30 block error occurrences.
VI. COMPLEXITY DISCUSSION
As shown in this paper, the three proposed decoders efficiently eliminate or lower LDPC error floors given knowledge of the trapping sets that cause the floor. Acquiring this knowledge is an off-line task, which can be done by using extensive simulations, graphical search, and/or importance sampling. In this section, we compare the complexities of the three decoders with that of the conventional SPA decoder based on an LDPC code Tanner graph with only single parity-check nodes.
The bi-mode decoder has the lowest complexity of the three decoders due to the fact it is a post-processing decoder whose erasure flagging/recovering process operates only when the conventional SPA decoder is trapped in one of the known trapping sets. When the decoder enters its second decoding mode, bits in proximity to unsatisfied check nodes are flagged according to a trivial algorithm and then iterative erasure decoding which involves only XOR's is performed. The erasure flags can be one of the bits in the magnitude portion of the channel messages since in the second decoding mode only the sign bits and the erasure locations (flags) are needed. The XOR operators are already present from the SPA algorithm which handles the sign computations at the check nodes in this way. Thus, the additional hardware beyond standard SPA is truly negligible and, because only a few erasure decoding iterations are necessary, the additional number of computations is negligible as well. Note that no look-up tables are required as in the case for the algorithm in [16] .
The additional computational complexity required by the concatenation/bit-pinning decoder is: (1) hard-decision decoder(s) for the outer algebraic code(s) and (2) a few additional SPA iterations (usually less than five) to process the pinned bits fed back from the outer decoder(s). Because algebraic decoders have extremely low complexity, whereas the SPA decoder is quite complex, the percentage complexity increase can be on the order of 1%, depending on the LDPC code length.
As demonstrated in Section V, the G-LDPC decoders, particularly G-LDPC III, offer substantially improved the error floor performance as well as some improvement in the moderate SNR region, at the cost of higher decoding complexity. The complexity increase is due to the constituent decoder used to process the extrinsic information of the GCP's. In our experiments, we used the BCJR algorithm to implement the optimal GCP, though lower-complexity suboptimal algorithms, such as soft-output Chase algorithm of Pyndiah [32] or the softoutput Viterbi algorithm (SOVA) [33] can be used. The BCJR algorithm works on a trellis representing the linear block code [30] that corresponds to the generalized constraint of interest. Its complexity is proportional to the number of states, S, trellis length, L, and the number of core operations per state, O. For the three-stage structure of the BCJR algorithm (forward recursion, backward recursion, and completion stage), O = 3 binary max* operations, where max*(x, y) log(e x + e y ) (implementable as lookup table). For example, let us consider the G-LDPC III decoder for the Margulis code. Each GCP is a combination of two degree-6 SPC nodes, i.e., S = 4 and L ≤ 12. Thus, the BCJR algorithm needs S · L · O = 12L max* operations to generate the extrinsic information from this GCP to its L neighboring VNs. This amounts to 12 max* operations per extrinsic message.
As for the SPA decoder, each SPC constraint needs d c (d c − 2) log-tanh operations [34] (often denoted by " ") to produce extrinsic information to its d c neighbors. Thus, the two SPC nodes which are combined in the G-LDPC-III decoder requires 2d c (d c − 2) log-tanh operations to compute the messages to be sent to their L neighbors. That is, since d c = 6 for the Margulis code, 2d c (d c − 2)/L ≥ 4 log-tanh operations per extrinsic message are required. Further, the following relationship holds:
where s(x, y) is a so-called correction term given by s(x, y) = log(1 + e −|x+y| ) − log(1 + e −|x−y| ).
The correction term can also be implemented using lookup table. This leads us to the conclusion that conventional SPA decoder requires ≥ 4 correction-term table-lookup operations and 2 max operations per extrinsic message, compared to 12 max* table-lookup operations for the G-LDPC-III decoder.
Hence the computational complexity increase of the G-LDPC-III decoder versus the SPA decoder is no more than 200%.
VII. CONCLUSION
We have presented several decoding strategies for lowering floors in LDPC codes. All of the techniques presented succeeded in lowering the floors of the two codes we studied by orders of magnitude, and are fully generalizable to other LDPC codes. The techniques have varying levels of complexity and the chosen technique would depend on the performance/complexity requirements of the application. The bi-mode decoder and its variations should be considered in the early stages of the system design because they have the lowest complexity. On the other hand, as shown in Fig. 5 and Fig. 6 which compares the performances of the three techniques, the G-LDPC III decoder yields the best performance. Our
