Abstract-This brief presents an efficient architecture design for elementary-check-node processing in nonbinary low-density parity-check decoders based on the extended min-sum algorithm. This architecture relies on a simplified version of the bubble check algorithm and is implemented by the means of first-in-first-out. The adoption of this new design at the check node level results in a high-rate low-cost full-pipelined processor. A proof-of-concept implementation of this processor shows that the proposed architecture halves the occupied the field-programmable gate array (FPGA) surface and doubles the maximum frequency without modifying the input/output behavior of the previous one.
A Novel Architecture for Elementary-Check-Node
Processing in Nonbinary LDPC Decoders
Oussama Abassi, Laura Conde-Canencia, Ali Al Ghouwayel, and Emmanuel Boutillon
Abstract-This brief presents an efficient architecture design for elementary-check-node processing in nonbinary low-density parity-check decoders based on the extended min-sum algorithm. This architecture relies on a simplified version of the bubble check algorithm and is implemented by the means of first-in-first-out. The adoption of this new design at the check node level results in a high-rate low-cost full-pipelined processor. A proof-of-concept implementation of this processor shows that the proposed architecture halves the occupied the field-programmable gate array (FPGA) surface and doubles the maximum frequency without modifying the input/output behavior of the previous one.
Index Terms-Channel coding, communication systems decoding, Field Programmable Gate Array.
I. INTRODUCTION
N ONBINARY (NB) low-density parity-check (LDPC) codes are now known to outperform both binary LDPC and turbo codes when considering moderate or small code lengths [1] , [2] . This family of codes retains the benefits of a steep waterfall region (typical of convolutional turbo codes) and a low error floor (typical of binary LDPC). Compared to their binary counterparts, NB-LDPC codes generally present higher girths, which leads to better decoding performance. Moreover, their association with high-order modulations is advantageous, as symbol likelihoods are calculated directly, without any marginalization [3] . Different works have also revealed the interest of NB-LDPC in MIMO systems [4] - [6] .
However, these advantages entail the drawback of high computational complexity because NB-LDPC are defined over a Galois field GF(q = 2 m ) (where q 2 is the order of GF), i.e., the nonzero entries of their parity-check matrices belong to high-order finite fields. The elements of GF(q) are called symbols, and each symbol is a set of m bits. Consequently, in the decoding process, each message exchanged between processing nodes in the associated Tanner graph is an array of values, each one corresponding to a GF element. From an implementation point of view, this leads to a highly increased complexity compared to binary LDPC.
In the last years, important effort has been dedicated to reduce the complexity of NB-LDPC decoders, and several algo- rithms with their associated architectures have been proposed. In this brief, we propose a new design for the so-called L-bubble check architecture that is used to implement the extended minsum (EMS) algorithm [7] . Without modifying the algorithm, we moved the data-dependent computations to the last stage of the architecture. This modification allows to relax the critical path and significantly simplifies the hardware design. This brief is organized as follows. Section II provides a state of the art on NB-LDPC decoding algorithms and architectures. Section III presents the notations and principles of NB-LDPC codes. Section IV describes the min-sum algorithm for NB-LDPC codes as well as the elementary Check Node (CN) processing. Section V describes the algorithm and architecture of the FIFO-based elementary CN processor. Section VI presents the FPGA postsynthesis results to compare the new design with the state of the art. Finally, conclusion and perspectives are discussed in Section VII.
II. STATE OF THE ART ON NB-LDPC DECODING
The direct application of the belief propagation (BP) algorithm to NB-LDPC codes leads to a computational complexity
, [8] for each check node update, which becomes prohibitive when considering values of q > 16. An important effort has thus been dedicated to develop reducedcomplexity algorithms for NB-LDPC decoding. In order to reduce the prohibitive complexity of the BP algorithm for highorder NB-LDPC codes, the authors in [9] proposed to perform the BP algorithm in the logarithmic domain. This replaces all the products by the max * operation, without any performance loss for GF (8) . In [10] , an FFT-based BP decoding algorithm was proposed. The description of this algorithm in the log domain was presented in [11] . Note that the Fourier transform is easily computed when the GF is a binary extension field with order q = 2 m , and in this case, the computational complexity of the BP algorithm is reduced to the order of O(d c · q · m) per check node. The decoding of GF(256)-LDPC codes using this method was described in [12] . However, although these algorithms considerably reduce the computational complexity of the decoding process, they are still far from being considered for hardware implementation. This implementation became feasible with the introduction of the EMS [7] and the min-max [13] algorithms.
The EMS algorithm [7] , [14] is based on a generalization of the min-sum algorithm (initially proposed for binary LDPC codes [15] ). The EMS has the advantage of performing only additions while truncating the size of the messages from q to n m (n m q). This suboptimality introduces a performance degradation that is compensated by a correction factor that can be optimized so that the EMS algorithm can approach, or 1549-7747 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
even in some cases slightly outperform, the BP-FFT decoder [10] , [12] . Also, the complexity/performance tradeoff can be adjusted with the value of the n m parameter, making the EMS decoder architecture flexible for both implementation and performance constraints. In the min-max algorithm, the extrinsic messages exchanged within the min-sum-based decoder are composed of a set of GF symbols with their corresponding reliability metrics measured with respect to the most likely one. By using appropriate metrics, the author in [13] derived a low-complexity quasi-optimal iterative algorithm as well as its canonical selective implementation that reduces the number of operations at each decoding iteration. Different architectures for CN have been proposed based on the min-max algorithm [16] , trellis-based approach [17] , and a basis construction [18] . Also, a simplified version of the min-sum algorithm and its associated architectures were presented in [19] and [20] . Two other alternative approaches have been proposed for NB-LDPC decoding. The first one is based on symbol flipping algorithms, characterized by their low complexity at the cost of performance degradation [21] . The second approach is based on stochastic computations [22] .
The complexity reduction of the EMS-based CN processing has been investigated in [23] and [24] , specifically the elementary check node (ECN) processor, which constitutes the core of the CN based on the forward-backward (FB) structure [9] . According to this FB model, the CN is composed of 3 · (d c − 2) ECNs; thus, simplifying the ECN architecture will considerably reduce the global decoder complexity. The bubble check and L-bubble check algorithms proposed in [23] and [24] constitute two original approaches in the design of ECN allowing the reduction of hardware complexity from O(n 2 m ) [14] to O(n m · √ n m ) and O(4 · n m ), respectively. The L-bubble check only considers the first two columns and the first two rows of the n m × n m ECN matrix. In other words, the paths to follow are predetermined, and the size of the bubble sorter is always fixed to 4. These characteristics significantly simplify the ECN architecture compared to the original bubble check. Based on the L-bubble check algorithm, a global architecture for a GF(64)-LDPC EMS decoder was presented in [25] . In this architecture, the ECN contains a feedback loop, including the sorter that outputs valid candidates. This mechanism results in a long critical path that greatly impacts the latency of the architecture. Recently, the authors in [26] presented the syndrome-based algorithm, which is a lower complexity hardware approach for EMS-based CN processing that allows increased parallelism while achieving slightly better communication performance.
In this brief, we propose a novel and efficient architecture of the ECN processor for the EMS-based decoder. The proposed ECN architecture is based on the efficient use of FIFO buffers that directly output the candidates to the sorter. Also, we consider a tree-based architecture for the CN processor instead of the FB solution to reduce the numbers of ECNs in the critical path. Implementation results on FPGA show that the CN processor area is divided by two, the maximum operating frequency is doubled, and the hardware efficiency is enhanced by a factor of 6, compared to previous EMS implementations [25] .
III. NB-LDPC CODES
This section presents NB-LDPC codes and provides some details and references on the matrix construction.
A. Definition of NB-LDPC Codes
An NB-LDPC code is a linear block code defined on a very sparse parity-check matrix H whose nonzero elements belong to a Galois field GF(q), where q > 2. A code word is denoted by X = (x 1 , x 2 , . . . , x N ) , where x k , k = 1, . . . , N, is a GF(q) symbol represented by m = log 2 (q) bits as follows: 1 x k,2 , . . . , x k,m ) . The construction of these codes is expressed as a set of parity-check equations over GF(q), where a single parity equation involving d c code-word symbols is
with h j,k being the nonzero values of the jth row of H. The dimension of matrix H is M · N , where M is the number of parity CNs and N is the number of variable nodes (VNs), i.e., the number of GF(q) symbols in a code word.
B. Construction of NB-LDPC Codes
The Tanner graph of an NB-LDPC code is usually much sparser than the one of its homologous binary counterpart for the same rate and binary code length. The ultrasparse codes [27] achieve better performance with VN degrees d v = 2 because this reduces the stopping and trapping set effects and then the performance of message-passing algorithms becomes closer to the optimal performance of the maximum-likelihood decoding. The protograph-based codes [28] - [30] obtain both good error-correcting performance and hardware-friendly decoder architecture by maximizing the girth of the Tanner graph and min imizing the multiplicity of the cycles with minimum length [31] .
IV. MIN-SUM ALGORITHM FOR NB-LDPC DECODING
The EMS algorithm [7] is an extension of the min-sum [15] , [32] algorithm from binary to NB-LDPC codes. The exchanged messages between the VN and CN processors consist of vectors of log-likelihood ratio (LLR) values.
A. Definition of NB LLR Values
The first step of the min-sum algorithm is the computation of the LLR value for each symbol of the code word. With the hypothesis that the GF(q) symbols are equiprobable, the LLR value L k (x) of the kth symbol is given by [13] 
wherex k is the symbol of GF(q) that maximizes P (y k |x), i.e.,x k = arg max x∈GF(q) {P (y k |x)}, and y k is the received symbol. Note that L k (x k ) = 0 and, for all x ∈ GF(q), L k (x) ≥ 0. Thus, when the LLR of a symbol increases, its reliability decreases. This LLR definition avoids the need to renormalize the messages after each node update computation and permits to reduce the effect of quantization when considering the finite precision representation of the LLR values. 
B. CN Processing in the Min-Sum Algorithm
With the FB algorithm [9] , a CN of degree d c can be decomposed into 3(d c − 2) ECNs, where an ECN has two input messages U and V and one output message E. Each of the messages can be written as
is its associated LLR value. The messages are sorted in increasing order as follows: u [13] , [25] . With the min-sum algorithm, the values in message
where ⊕ is the addition in GF(q).
C. EMS ECN Processing
In practice, the role of an EMS ECN processor is to select the n m most reliable symbols in the 2-D matrix T Σ (see Fig. 1 ), where
2 . This addition represents the addition of the LLR values u
The most reliable symbols correspond to the smallest LLR values in T Σ . To simplify the selection of these symbols, the authors in [23] and [24] showed that the exploration of only a portion of matrix T Σ does not impact the decoder performance. To be specific, the L-bubble check algorithm divides this portion into four paths that constitute the so-called L-path [24] .
V. S-BUBBLE CHECK ALGORITHM AND ARCHITECTURE

A. Principle
Let the four straight paths be as follows (see Fig. 1 ).
•
As each path is inherently sorted in increasing order, to extract the n m most reliable symbols in T Σ , we only need to compare one candidate from each path during n m clock cycles. At the end of each comparison, the most reliable candidate is updated by the next symbol on the same path, as shown by the arrows in Fig. 1 . A detailed description is given in Algorithm 1. This algorithm is named the S-bubble check to make reference to the straight paths. If q is high, typically, q ≥ 256, and the use of only four paths may introduce performance degradation. In that case, it is possible to add extra straight paths (for example, starting from U (3) and V (3) for a six-path ECN) while keeping the same architecture structure.
Algorithm 1 S-Bubble Check Algorithm
Initialization step:
This operation is equivalent to a pull operation in the FIFO buffers (see Fig. 2 
B. Redundancy Control and Nonvalid Couples
Redundancy in GF(q) symbols occurs when the most recent selected candidate corresponds to a symbol that is already selected and contained in the output vector denoted by E. Then, if a redundant symbol is detected, only its first occurrence is considered valid, and the following occurrences are tagged as nonvalid. Fig. 2 shows the architecture of the S-bubble ECN. The adders receive vectors U and V and perform the LLR and GF(q = 2 m ) additions as previously described. The results are directly provided to the FIFOs on a clock cycle basis (push is always equal to 1). Note that each FIFO is receiving the elements of the corresponding path in matrix T Σ . The operator Min compares the outputs of the four FIFOs and selects the minimum LLR value with its associated GF symbol that will constitute a new element of message E. The selected candidate will be freed from the relevant FIFO (pull = 1), and the FIFO will output a new candidate in the next clock cycle. This process is repeated n m times. After n m /2 clock cycles, n m /2 symbols have been output, and n m /2 symbols still remain in the FIFOs. In the worst case, all those symbols are extracted from a FIFO not yet being read during the first n m /2 clock cycles. Thus, the maximum size D of a FIFO can be bounded by n m /2 if a mechanism preventing the input of new symbols, once a FIFO is full, is employed.
C. S-Bubble Elementary Check Architecture
A detailed clock cycle examination combined with low level hardware FIFO behavior (not described here) leads to sizes n m /2 for F 1 and F 2 , n m /2 − 1 for F 3 , and n m /2 − 2 for F 4 . Note that, for the sake of simplicity, Fig. 2 does not show the control unit that tracks redundant symbols in the output message E. Also note that, in the implementation described in [25] , a number n op > n m (typically, n op = n m + δ, with δ = 2) of messages are generated at the output of the ECN to compensate the fact that the redundant output symbols are discarded. In such case, the FIFO sizes should be lengthened by δ/2.
As described in [25] , the critical path of the L-bubble check ECN architecture contains a feedback loop, including several elements: RAM access, adder, comparators, and an index update operation, along with complex control. This mechanism results in a long critical path that greatly impacts the clock frequency. In the S-bubble architecture, the critical path on the feedback loop (the right part of the FIFOs in Fig. 2 ) is reduced to the Min processing and to the FIFO accesses (pull operation).
D. CNP Architecture
The check node processor (CNP) can be designed based on the FB architecture [see Fig. 3(a) ] or, alternatively, using a tree-based structure [33] , as illustrated in Fig. 3(b) for d c = 6 . The main advantage of the tree structure is that the number of ECNs in the critical path is minimized and constant for all outputs. For these reasons, we considered the tree structure in this brief. Fig. 4 illustrates the timing diagram: L C is the CNP latency, and Δ is the time delay required to start a new processing. Note that the symbols of the messages entering the CNP must be multiplied by the nonzero entries of the paritycheck matrix (the row corresponding to the CNP in the Tanner graph), as well as the output messages of the CNP that are divided by these nonzero entries. Therefore, the implemented CNP architecture uses the wired multipliers presented in [25] to perform multiplications over GF(q).
VI. FPGA PROTOTYPING
The proposed FIFO-based ECN was implemented on the Xilinx Virtex XC5VLX330T speed-2 FPGA device. For comparison purposes, we considered the same design parameters as in [25] , which are as follows: q = 64, m = log 2 q = 6, l = 6 (l is the number of bits used to represent each LLR value), n m = 12, and n op = 14. The CN degree d c is set to 4, 6, 8, and 12, corresponding to code rates of 1/2, 2/3, 3/4, and 5/6, respectively. Memory units were synthesized using distributed RAMs. In order to compare the architecture, we define the hardware efficiency E n of the architecture as the ratio between the number of CNs processed in a second divided by the number S of slices in the CNP. In the proposed architecture, a new CN can be started every n op clock cycles; thus, in a second, F/n op CNs can be processed, with F being the clock frequency of the design: E n = F/n op /S CN/s/slice. If we denote by γ the number of ECNs contained in the CNP critical path and considering that the latency of an ECN is 2 clock cycles, the latency L C can be expressed as follows: L C = 2 · γ + 2, where γ ≤ (d c + 1)/2 in the tree architecture, and γ = d c − 2 in the FB one. The two extra clock cycles of L c are required for input and output GF multipliers. Table I presents the postsynthesis results obtained for the CN processor considering the S-bubble ECN architecture for d c = 4, 6, 8, and 12 and the FB-based architecture in [25] (L-bubble check ECN and d c = 6). As shown in Table I , the FIFO-based architecture increases the hardware efficiency E n by a factor greater than 6 compared to available data in the state of the art. 1 For completeness, we should indicate that the work presented in [35] also improves the implementation results for the EMS ECN, based on a prefetching technique. Nevertheless, the authors in [35] do not provide synthesis results at the CN level. However, in terms of frequency, our ECN implementation operates at 209 MHz, which is twice the frequency achieved in [35] .
VII. CONCLUSION
This brief was dedicated to the design of an efficient ECN architecture for NB-LDPC decoders based on the EMS algorithm. The proposed architecture enhances the hardware efficiency of the CNP by a factor of 6, compared to previous work. This solution is based on the use of optimized FIFOs at the elementary level and on a tree architecture at the check node level. Future work will be dedicated to the optimization of the VN architecture and the implementation of the optimized global NB-LDPC decoder.
