Abstract-This paper proposes a unified framework to describe the check node architectures of non-binary lowdensity parity-check (NB-LDPC) decoders. Forward-backward, syndrome-based, and pre-sorting approaches are first described. Then, they are hybridized in an effective way to reduce the amount of computation required to perform a check node. This paper is specially impacting check nodes of high degrees (or high coding rates). Results of 28-nm ASIC post-synthesis for a check node of degree 12 (i.e., a code rate of 5/6 with a degree of variable equal to 2) are provided for NB-LDPC over GF(64) and GF(256). While simulations show almost no performance loss, the new proposed hybrid implementation check node increases the hardware and the power efficiency by a factor of six compared with the classical forward-backward architecture. This leads to the first ever reported implementation of a degree 12 check node over GF(256), and these preliminary results open the road to high decoding throughput, high rate, and high-order Galois field NB-LDPC decoder with a reasonable hardware complexity.
I. INTRODUCTION
L OW-Density Parity-Check (LDPC) codes [1] have now been adopted for a wide range of standards (WiMAX, WiFi, DVB-C, DVB-S2X, DVB-T2) because of their nearchannel-capacity performance. However, the capacity achieving performance is obtained for long codeword lengths and LDPC codes start to show their weakness when considering short and moderate codeword lengths. In the last decade significant research effort has been devoted to the extension of LDPC codes to high-order Galois GF(q) (q > 2 is the order of the GF). This family of codes, named Non-Binary (NB) LDPC codes, show strong potential for error correction capability with moderate and short codeword lengths [2] . This is mainly due to the fact that NB-LDPC codes present higher girths than their binary counterparts, and thus present better decoding performance with message passing algorithms. Also, the NB nature of such NB-LDPC codes makes them suitable for high-spectral-efficiency modulation schemes where the constellation symbols are directly mapped to GF(q) symbols [3] . This mapping bypasses the marginalization process of binary LDPC codes that causes information loss. NB-LDPC codes becomes serious competitors of classical binary LDPC and Turbo Codes in future wireless communication and digital video broadcasting standards.
The main drawback of NB-LDPC codes is related to their high decoding complexity. In the NB-LDPC decoder each message exchanged between the processing nodes is an array of values, each one corresponding to a GF element. From an implementation point of view, this leads to a highly increased complexity compared to binary LDPC decoding.
A straightforward implementation of the Belief-Propagation (BP) algorithm for NB-LDPC results in a computational complexity of the order of O(q 2 ) [2] . The Extended Min-Sum (EMS) algorithm, which is an extension of the well-known Min-Sum algorithm from the binary to the NB domain, represents an interesting compromise between hardware complexity and error correction performance [4] , [5] . As the Check Node (CN) processing constitutes the computational bottleneck in the EMS decoding, much research work has focused on its complexity reduction. Currently, state-of-the-art architectures apply the Forward-Backward (FB) algorithm [5] to process the CN. With this approach, a serial calculation is carried out to reduce the hardware cost and to reuse results from intermediate Elementary CN (ECN) . However, the FB CN structure suffers from high latency and low throughput. The Trellis-EMS (T-EMS) introduced in [6] avoids the long latency of the FB computation but its hardware complexity highly increases with q when a parallel implementation is considered. The complexity of the T-EMS was reduced with the one-minimum T-EMS [7] and the Trellis Min-Max (T-MM) [8] , [9] algorithms.
The SB algorithm, recently presented in [10] and [11] , is an efficient method to perform a parallel computation of the CN when q ≥ 16. This architecture was used for the first reported implementation of a GF(256) CN processor with degree d c = 4 [12] . However, the complexity of the SB CN algorithm is dominated by the number of syndromes to be computed, which increases quadratically with d c , limiting its interest for high coding rates, i.e. for high d c values. Recently, we showed that sorting the input vector of the CN according to a reliability criteria [13] , [14] allows significant reduction of the hardware complexity of the CN architecture without 1549-8328 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. [13] and with the FB CN, leading to the PS-FB CN [14] . The goals of this paper are twofold. The first one is to synthesize several contributions published in conference papers in a unified framework. The second goal is to present a simplification of the Extended Forward (EF) architecture and several original hybridizations of existing architectures to further reduce the hardware complexity of the CN and to derive efficient hardware implementations for high-order GF and/or high CN degrees. The pre-sorting is the key to understand the efficiency of the Hybrid architecture compared to previous algorithm. Pre-sorting allows to match locally the processing algorithm (SB, EF or FB) with the dynamic of incoming and intermediate messages. Table I presents the  names of the different existing and proposed architectures.  In this Table, the references with an asterisk correspond to previous contributions of the authors.
The paper is structured as follows: Section II recalls the structure of a NB-LDPC code and presents the EMS decoding algorithm. Section III gives a survey of the existing practical EMS architectures. Section IV presents several original hybrid architectures and Section V provides synthesis and performance results. Finally, conclusions are drawn in Section VI.
II. NB-LDPC NOTATIONS AND DECODING
This section first recalls the principles and notations of NB-LDPC codes. Then, the EMS algorithm is described.
A. NB-LDPC Code
A NB-LDPC code is a linear block code defined by a sparse parity-check matrix H of size M × N. The M rows of the matrix H refer to M parity check equations. The i th parity check equation is defined as d c (i ) non-zero GF(q) values {h i, j i (k) } k=1...d c (i) of the i th line of the parity check matrix as
where [18] . In this case, assuming a full-rank parity-check matrix H, the rate of the code is r = 1 − 2/d c . 
B. EMS Algorithm
For simplicity, the EMS algorithm is described only at the CN level. The reader can refer to [5] for a complete description of the whole decoding process including the variable and edge nodes. Fig. 1 1 and e k = h i,k x k . Each input e i is a variable defined in GF(q) and its associated a priori information is the discrete probability distribution P(e i = x), x ∈ GF(q). Each element of the probability distribution E associated to e can be expressed in the logarithmic domain as the Log-Likelihood Ratio (LLR) e + (x) defined as
wherex is the hard decision on e obtained with the maximum likelihood criterion, i.e.x = arg max x∈GF(q) P(e = x). From this LLR definition: e + (x) = 0 and ∀x ∈ GF(q), e + (x) ≥ 0. The distribution (or message) E associated to e is thus E = {e + (x)} x∈GF(q) . The EMS algorithm is an extension of the Min-Sum aiming at reducing the complexity of NB-LDPC decoders. Its main characteristic is the truncation of the messages E from q to the n m most reliable values (n m q). At the CN level, each incoming message U is composed of n m couples sorted in increasing order of LLRs (note that a high LLR value means low reliability). In Fig. 1 , each input U of the CN is a list
The same representation is used for each output V of the CN.
The EMS algorithm can be described in two steps:
where As the direct computation of (3) implies a prohibitive number of calculations, different approaches have been proposed for the EMS CN processing. They are presented in the following Section.
III. PRACTICAL IMPLEMENTATIONS
OF EMS CN PROCESSING This section review the two state-of-the-art implementations of the EMS algorithm: the Forward-Backward (FB) and the Syndrome-Based (SB). Then, the presorting technique is introduced to show that sorting the CN input messages can lead to significant saving in terms of computations and implementation hardware.
A. Forward-Backward CN Processing
The FB CN algorithm exploits the commutative and associative properties of the addition in GF(q) and factorizes (3) using a set of 2-input 1-output ECN. The CN processing is splitted into three layers: forward, backward and merge; each one containing d c − 2 ECNs [5] . An ECN processes a single output C as a function of two inputs A and B. Fig. 2 shows the resulting structure for a FB CN with d c = 6 inputs using (d c − 2) × 3 = 12 ECNs; each ECN being represented by a block . Intermediate results of the ECNs are reused in the later stages, avoiding re-computations and thus reducing the amount of processing. Several reported hardware implementations of NB-LDPC decoders use this efficient FB architecture [19] , [20] .
The ECN processing [19] can be described in three steps. 1) Addition: for each couple of index (a, b) ∈ {0, 1, . . . , n m − 1} 2 , the output tuple with a high probability, a number n op > n m outputs is generated by each ECN (typically, n op = n m + 2) [4] .
For the sake of clarity and for the rest of the paper, the three ECN steps are represented by as:
In [19] , it is shown that (4) needs to be evaluated only for indexes (a, b) that verify (a + 1)(b + 1) ≤ n op . In fact, since vectors A + and B + are sorted in increasing order of LLR, any
does not belong to the set of the n op smallest values and thus, does not need to be evaluated.
As proposed in [14] , the notion of potential bubbles is proposed by specifying the index variation ranges n a and n b of the two entries A and B, i.e. 0 ≤ a < n a and 0 ≤ b < n b and the one of the output C as n c , i.e. 0 ≤ c < n c . Note that n a and n b should be smaller than or equal to n c . In Fig. 3(a) , the subset of potential bubbles is represented in grey for n m = n op = 10. A serial hardware implementation of the Bubble CN architecture was presented in [19] . Suboptimal versions considering only the subset of the most probable potential bubbles (the first two rows and two columns) were presented in [19] - [21] .
B. Syndrome-Based CN Processing
The SB CN algorithm relies on the definition of a deviation path and its associated syndrome. In the sequel, n m,in (resp. n m,out ) refers to the size of the input (resp. output) vector of a CN.
A deviation path, denoted by δ, is defined as a d c -tuple
A syndrome associated to a deviation path δ is denoted by S(δ) and defined as the
where 
and U 4 [2] . Assuming that the elements of GF (64) are represented by the power of a primitive element α, of GF(64) constructed using the primitive polynomial
Let 0 be the set of all possible deviation paths that can contribute to an output value, i.e., 0 ⊂ {0, . . . , n m,in − 1} d c . Using the syndrome associated to a deviation path, (3) can be reformulated as
The DBV is used to reduce the complexity of (9) by avoiding redundant computation. In fact, if
It is thus possible to simplify (9) and express it as
Finally, (10) is further reduced by replacing δ ∈ 0 by δ ∈ where is a subset of 0 with a reduced cardinality || = n s [10] .
The SB CN algorithm proposed in [10] is summarized in Algo. 1 and its associated architecture is presented in Fig. 5 .
Step 1 is performed by the Syndrome unit, Step 2 by the Sorting unit and, finally, Step 3 by d c Decorrelation Units (DU) and d c RE units. The DUs are represented in parallel to show the inherent parallelism of the SB CN. The RE units discard couples with a GF value already generated (last test of step 3 in Algo. 1). Note that in [10] , the sorting process is done only partially. 
Algorithm 1: The SB CN Algorithm
Offline processing: Select a subset ⊂ 0 of cardinality || = n s . Initialization:
Step 1 (syndrome computation): ∀ δ ∈ , compute S(δ)
Step 2 (sorting process): sort the syndromes in the increasing order of S + (δ) to obtain an ordered list {S[k]} k=1,2,...,|| of syndromes;
Step 3 (decorrelation and RE): 
C. Presorting of Input Messages
The idea of input presorting is to dynamically change the order of inputs in the CN processor to classify reliable and unreliable inputs. This polarization of the inputs makes some deviation paths (for the SB CN [13] ) or some potential bubbles (for the FB CN [14] ) very unlikely to contribute to an output. The suppression of those useless configurations leads to significant hardware saving without affecting performance.
The presorting principle, as described in Algorithm 2, can be efficiently applied to the EMS algorithm [4] and their derived implementations (FB CN [5] , [22] and SB CN [10] ). Let us consider the application of the presorting algorithm on the EMS-based CN as illustrated in 
Step 1: Extract vector
Step 2: Permute input vectors using π:
Step 3: Perform the CN processing with input vectors
Step 4: Permute output vectors using the inverse permutation π −1 : two permutation networks (or switches). However, it allows some simplifications in the CN itself, globally leading to an important complexity reduction of the whole CN processing.
D. Decorrelation and Permutation Unit
The first contribution of this paper is to reduce the complexity of the inverse permutation (Step 4 in Algorithm 2) originally implemented by using a switch as shown at the bottom of Fig. 6 . In fact, the decorrelation and switch operations can be efficiently merged for the SB CN architecture. To be specific, the DU can be modified so that it also performs the switching. This is achieved as follows: note that the Switch block at the bottom of Fig. 6 reorders the V vectors to get the V vectors, i.e. The modified DU is named Decorrelation and Permutation Unit (DPU) and an example of its use is shown in Fig. 11 .
IV. HYBRID ARCHITECTURES
In previous Section, the FB and the SB CN architectures as well as the application of the presorting technique to both of them have been presented. In this section, we consider the hybridization of these approaches in a unique CN architecture. This constitutes the main contribution of this paper. The first proposed hybrid architecture uses an Extended Forward (EF) processing to dynamically generate the set of syndromes. The objective is to take advantage of the simplicity of the SB architecture while keeping the complexity linear with d c . The second original architecture introduces presorting in the EF to further reduce the complexity. Finally, a new level of hybridization is performed to take the most advantage of the presorting technique in the EF architecture.
A. Syndrome Computation Using the EF Processing
A syndrome set S b can also be computed by performing a forward iteration on all the inputs of the CN using a serial concatenation of ECNs (5) as
Applying the SB CN approach with ECNs of parameters (n m , n m , n m ) provides n m syndromes sorted in increasing order of their associated LLR values. The syndrome set can be computed in a serial scheme as shown in Fig. 7 or it can also be computed with log 2 (d c ) layers of ECNs using a tree structure.
Thanks to the use of ECNs, the computed syndrome set is sorted and can be directly applied to the decorrelation process. Note that in [10] , a sorting process is required after the syndrome computation. However, the ECNs used in the EF architecture require a small additional patch compared to those in the FB architecture. The role of this patch is to construct the DBV (8) during the ECN processing to provide the additional term c D a,b . The ECN addition (4) is then modified as
where represents the concatenation operation of two binary vectors. The CN inputs are initialized with a DBV value of one bit as follows:
Thanks to the DBV computation, the output of the EF processing is similar to the output of the SB processing just before the decorrelation. In particular, the notion of deviation path can be also applied to the EF processing, with the only difference that the set of deviation paths E F is input dependent, while is predefined offline in the SB architecture [10] .
A first drawback of the EF is that the number of computed syndromes is typically 3 × n m,out to compensate the discarded redundant syndromes. Even with this approach, the first simulation results of the EF algorithm showed significant performance degradation compared to the FB algorithm [17] . The reason of this performance degradation resides in the RE process performed by each ECN: since an ECN performs RE, no more than one ECN output could be associated to a given GF value. However, since the ECN outputs in the EF algorithm are partial syndromes, RE may discard useful partial syndromes that would construct valid complete syndromes at the end of the EF processing. In Fig. 8 , an example of CN with d c = 4, n m,in = 2 and n m,out = 3 is presented to illustrate the problem. The two deviation paths δ 1 = (1, 0, 0, 0) and δ 2 = (0, 1, 0, 0) lead to the same GF value, i.e., α 4 
U 2 of the first ECN is equal to C 1 = {(0, 0, 00), (1, α 24 , 10), (2, α 24 , 01), (3, 0, 11)} before the RE and equal to C 1 = {(0, 0, 00), (1, α 24 , 10)} after RE. Note that the seed of the partial syndrome δ 2 is eliminated. The final output in this example will be S = C 3 = {(0, α 4 , 0000), (1, α 0 , 1000), (7, α 17 , 0010)} and after the decorrelation unit, V 1 = {(0, α 24 ), (7, α 47 )}, instead of V 1 = {(0, α 24 ), (2, 0), (7, α 47 )}. The key idea to avoid this problem is to allow redundant GF values in the syndrome set. Thus, removing the RE process from the ECN processing avoids performance degradation. Let us define a modified ECN operation with symbol where the ECN addition is performed as in (12) and no RE is performed. The syndrome set of size n m can then be computed as
The RE process will then take place after the decorrelation operation performed by the DUs. As previously mentioned, the set of paths in the SB CN is pre-determined offline, while it is determined dynamically on the fly in the EF CN according to the current LLR values being processed. This leads to a significant reduction of the total number n s of syndromes to be computed.
B. EF CN With Presorting
As shown in Section III-C, presorting leads to significant hardware savings by reducing the number of candidate GF symbols to be processed within the CN. In this section, we show that this presorting technique, when applied to the message vectors entering the EF CN, leads to a high complexity reduction of the CN architecture. This architectural reduction is obtained by reducing the number of bubbles to be considered at each ECN. For this, we perform a statistical study based on Monte-Carlo simulation that traces the paths of the GF symbols that contribute to the output of the CN, in their way across the different ECNs. This statistical study [14] identifies in each ECN how often a given bubble contributes to an output. This information allows pruning the bubbles that never or rarely contribute. More formally, this study is conducted through the following two steps: 1) Monte Carlo simulation giving the trace of the different bubbles (each time a bubble b is used in an output message, its score γ (b) is incremented). 2) ECN pruning that aims at discarding the less important bubbles, thus simplifying the ECN architectures. How to prune low-score bubbles for best efficiency is still an open question. However, we propose here a method that prunes bubbles based on the statistics of their scores at each ECN. Let I b be a sorted set of indexes of the potential bubbles of a given
. Let τ be a real between 0 and 1 and let be the cumulative score of all bubbles, i.e., = 
After this pruning process, the structure of some ECNs is greatly simplified. The choice of the values of τ is a trade-off between hardware complexity and performance. As an example, Fig. 9 represents the remaining bubbles after the pruning process for a d c = 12 GF(64) (144, 120) NB-LDPC code with n m,out set to 16. The pruning process has been performed for an SNR of 5 dB and a value of τ equal to 0.01, leading to different simplified ECN architectures:
• 1B: only a single bubble is considered where C i is given by
• S-1B: it directly generates the n i c sorted output C i as
For this operation, a single GF-adder is required.
• S-xB: with x > 1, also known as S-bubble ECN.
As described in [21] , this architecture compares x bubbles per clock cycle. In Fig. 9 , we represent each bubble in an ECN by a filled circle and the direction for the next bubble by an arrow. The size of the arrow is the size of the FIFO in the architecture. Note that the complexity of the ECNs increases from left to right. In fact, only trivial ECN blocks, i.e. 1B, S-1B, S-2B architectures, are required on the left part while a S-5B ECN is required on the right part. It is possible to regroup several ECNs in a single component called Syndrome Node (SN). As detailed hereafter, this SN computes sorted partial syndromes in only one clock cycle, leading to significant hardware and latency reduction.
C. The Syndrome Node
In Fig. 9 the first 3 ECNs are of type 1Bs and can be processed together in a single clock cycle by simply adding the most reliable GF values of all inputs:
[0] = 0000. Also, the n 6 c = 3 values of C 6 can also be computed in one clock cycle. In fact, thanks to presorting, U
and the first three partial syndromes are
[0], 000010). In summary, we can consider all these computations to belong to a unique block, i.e. the SN, that involves several ECNs (to be specific, 5 ECNs in the example of Fig. 9 ) but that generates its outputs in a single clock cycle.
D. Hybridization Between FB and EF CN Architectures
Combining the EF architecture and the FB approach leads to a reduction of the total number of needed syndromes to guarantee a given number of valid syndromes. Fig. 10 shows the average number of syndromes that should be computed for a given output V i to obtain, with a probability of 90%, n op = 18 valid syndromes. This number is denoted by n 0.9 s (i ) and varies for each output V i . Note that when the presorting technique is considered, n 0.9 s (i ) increases with i . To decode without performance degradation, the number of computed syndromes is bounded by the number of syndromes required by the last output, i.e. n s = n 0.9 s (12) = 46 in the example of Fig. 10 . Fig. 9 shows that V 12 can be directly obtained from C 11 without DU, as C 11 contains the contribution of all the inputs except U 12 , i.e. V 12 = C 11 . This result can be seen as the application of the FB algorithm on the output V 12 since the forward process of the FB algorithm is included in the EF systematically generating V 12 . Consequently, the number of required syndromes can be reduced from n s (12) = 46 to n s (11) = 36, since V 12 is computed directly. This reduces the overall complexity and latency of the EF CN architecture without performance degradation. Note that this constitutes a first example of a hybrid architecture where one output is generated with the FB approach and the other d c − 1 = 11 outputs with the EF CN. This kind of approach can be generalized, as described by the following.
E. General Notations for Hybrid Architectures
Let HB(ρ S N , ρ E F , ρ F B ) be an hybrid architecture that combines the SN, EF and FB schemes. The first ρ S N inputs are processed by a SN block, the next ρ E F inputs are processed by an EF block and the remaining n F B inputs are processed by a FB block. Obviously, ρ S N + ρ E F + ρ F B = d c . Fig. 11 shows the HB(0, 4, 2) architecture for a CN of degree 6. As shown, V 5 and V 6 are computed using the FB algorithm in order to further reduce the number of required syndromes. There are several possible HB architectures between the EF (i.e., HB(0, 6, 0)) and the classical FB CN (i.e., HB(0, 0, 6)). Note that V 6 (resp. V 5 ) should bypass the decorrelation units and should be directly connected to V π(6) (resp. V π (5) ). Fig. 11 shows the case where π(6) = 3, i.e. the third multiplexer connects V 3 to V 6 , and π(5) = 5, i.e. the fifth multiplexer connects V 5 to V 6 . Finally, V 1 , V 2 , V 4 and V 6 are each one connected to the output of the corresponding DPU. Fig. 13 shows the HB(6, 4, 2) architecture for a CN of degree 12. V 11 and V 12 are computed using the FB algorithm and a SN is used to process the 6 first input U 1 to U 6 . 
F. Choice of Parameters
The determination of the CN architecture parameters, i.e. (ρ S N , ρ E F , ρ F B ) for the macro level, the internal structure of the EF and the FB blocks (the parameters of each ECN) for the micro level, is a complex problem. It can be formulated as an optimization problem: how to minimize the hardware complexity without introducing significant performance degradation. In this paper, we have limited the value of ρ F B to 1 and 2. Then, for the two
we have applied the same method than the one described in Section IV-B to determine the parameters of each ECN of the EF and FB blocks. Note that after the automatic raw pruning process described in Section IV-B, the parameters are further tuned by hand by a "try and see (i.e. estimate performance by simulation)" method. Once the pruning process finished, the value of ρ S N is fixed in order to optimize the hardware efficiency of the CN architecture. In fact, at a given point, CN with parameters (ρ S N + 1, ρ E F − 1, ρ F B ) will have a higher hardware complexity than CN with parameters (ρ S N , ρ E F , ρ F B ) but with a lower decoding latency.
G. Suppression of Final Output RE
In some decoder implementations [16] , [19] with d v = 2, the VNs connected to a CN are updated just after the CN update. For example, in Fig. 11 a variable node unit may be connected directly to each output V 1 to V 6 just after the RE units RE 1 to RE 6 . Also in some VN implementations [16] , [19] , RE is performed in the VN. If this type of VN is used, then the RE block can be removed from the hybrid architecture for complexity reduction. The suppression of the d c RE block is interesting specially for high d c values. In a HB architecture with final RE, the RE reduces the number of output messages from n s (in case that all messages are valid) to n m,out . Although removing the RE results in an increase of the number of output messages (n s > n m,out ), it has a limited impact on the complexity since the n m,out elements are processed on the fly by the VN, without the need of intermediate storage. However, it may impact slightly the VN power consumption since n s elements are processed instead of n m,out . Note that the suppression of RE neither affects the algorithm nor the performance since the RE is still performed in the VN. 
V. PERFORMANCE AND COMPLEXITY ANALYSIS
We consider GF(64)-and GF(256)-LDPC codes to obtain performance and post-synthesis results for the different proposed decoding architectures.
A. Performance
We ran bit-true Monte-Carlo simulations over the Additive White Gaussian Noise channel (AWGN) channel with a Binary Phase Shift Keying (BPSK) modulation scheme. The different parameters were set as follows: extrinsic and intrinsic LLR messages quantified on 6 bits, the a posteriori LLRs on 7 bits and the maximum number of decoding iterations to 10. The matrices used in our simulations are available in [23] . Fig. 12 shows the obtained Frame Error Rate (FER) for a GF(64) code of size (864,720) bits, code rate R = 5/6, d c = 12 and d v = 2 over the Gaussian channel. We consider the FB decoder in [21] as a reference, i.e. S-bubble algorithm with 4 bubbles, n m = 16 and n op = 18. We simulated the HB(6, 6, 0) or EF, the HB(6, 5, 1) and the HB (6, 4, 2) architectures with the same number of computed syndromes n s = 20. Fig. 13 shows the HB(6, 4, 2) architecture, for which no performance degradation is observed. We observe less than 0.05 dB of performance loss for the HB(6, 5, 1) and around 0.2 dB for the HB(6, 6, 0) configuration. We then conclude from these simulation results that the hybrid architectures can achieve the same performance as the FB architecture and outperform the EF architecture while needing a reduced number of syndromes with a factor ranging from 3 to 4. We can then conclude that this new family of hybrid architectures allows for significant complexity reduction in CN implementations without any performance loss compared to more complex state-of-the-art solutions. Fig. 15 shows performance results for a GF(64) (1536, 1344) LDPC code, code rate R = 7/8, d c = 16 and d v = 2. As reference, we consider the FB decoder with a S-bubble architecture [21] , 4 bubbles, n m = 16 and n op = 18. The architecture being used is the same as the one presented in Fig. 13 except for the fact that the SN includes four more 1B ECNs (i.e., the HB(10, 4, 2) CN architecture). The T-MM performance of a code of same length and coding rate provided in [8] is also presented in Fig. 15 . The Hybrid architecture shows a performance gain of 0.5 dB over the [24] .
B. Implementation Results
For complexity and power analysis, we considered the implementation of the architectures on 28 nm FD-SOI technologies targeting a clock frequency of 800 MHz. The different kinds of ECNs presented in Fig. 9 , were synthesized individually to provide the results in Table II . Additionally, the synthesis results of the SN, the S-bubble with RE (used in the FB CN), the sorter, the switch, the DU (Fig. 5 ) and the RE units implementation are also provided. The sorter is implemented using a serial architecture as in [14] , and the switch is a cross bar switch. The minimum clock period (P clk ) is given in nanoseconds with a clock uncertainty of 0.05 ns and a setup time of 0.02 ns. The Cycle Latency (CL) represents the number of clock cycles between the first input and the first output. A reduction factor of 57, 34 and 7 is observed between the 1B and the S-4B RE architectures in terms of area, power and clock period, respectively. These results show that significant gain can be obtained thanks to ECN simplification even if it implies the overcost of the presorter, the switch and the DPU units. [16] , [19] . CL(CN) gives the latency in cycles between the first input of a CN and the first output of a CN, taking into account the latency of the ECNs, the Pre-Sorter, the switch, the DPU, the RE and the GF multiplication and division. (Fig. 13) , CL(CN) = CL(Sorter) + CL(Switch) + CL(SN) + 3 × CL(S − 2B) + 2 × CL(S − 3B) + CL(S − 4B) + CL(DPU) = 14, considering that the multiplication is performed in the same cycle as the switch and the division is performed in the same cycle as the DPU.
For GF(256) results, the FB architecture is with n m = 40 and S-6B ECNs, the EF and HB architectures consider ECNs with a maximum n m,in value of 6.
C. Area and Energy Efficiency Comparison
To compare the efficiency of the different CN architectures, we consider the number of computed CNs per second as follows: T C N = 1/(CL(layer) × P clk ) where CL(layer) is the periodicity of a CN computation in a layered decoder. In our design, CL(layer) = CL(CN) + CL(VN) + n m,out + n m,int where CL(VN) is 7, n m,int = 4 for GF(64) and n m,int = 6 for GF(256). The Area Efficiency (AE) can then be expressed as: AE = T C N /Area in computed CN per second per mm 2 . The Energy Efficiency (EE) can be expressed as EE = F clk /(Power × CL(layer)) CN computation per micro joule where the clock frequency F clk is set at 800 MHz. In Table IV , we compare the different architectures. The HB(6,5,1) and HB (6, 4, 2) in GF(64) improve the AE compared with the FB architecture by a factor of 6.8 and 6.2, respectively. When comparing the EE, the improvement factors are of 6.4 and 5.5.
The CN implementation results of Table III and Table IV compare the FB, EF and HB architectures with and without RE. To compare the HB to the SB, we refer to [25] . 
D. Comparison With the T-MM Architecture
For a fair comparison with the T-MM architecture [8] , [9] , the full decoder is considered since the T-MM-base decoder uses a different VN architecture. We implemented a complete decoder architecture based on an HB(10,4,2) CN over GF(64) which is connected to 16 VNs. The variable node units mainly consist in adding n m,out values with their corresponding intrinsics and then sorting the n m,in smallest obtained values. A decision unit is implemented in parallel to each VN in order to implement the stopping criteria. The synthesis results are given in Table VI . We considered the same clock frequency as [8] 
E. Throughput
The throughput can be greatly improved by processing p CNs in parallel in a layered decoder [16] , [26] . Because we consider a code with d v = 2, the VNs are cascaded after the CN as in [16] and [20] . The throughput of a layered decoder 
VI. CONCLUSION
This paper was dedicated to low-complexity implementations of CN processors in NB-LDPC decoders. We reviewed the state-of-the-art architectures that consider the Extended Min-Sum algorithm and introduced new approaches to reduce the hardware complexity of the CN processors. We particularly focused on the effect of the presorting technique and the advantages of the Extended-Forward architecture. We then presented the hybrid architectures which combine the Extended-Forward and the Forward-Backward approaches to significantly reduce the total number of computed syndromes. The post-synthesis results on 28 nm ASIC technology showed that the area efficiency is improved by a factor of 6.2 without any performance loss, or by a factor of 6.8 with a performance loss of 0.04 dB compared with the Forward-Backward architecture. His research activities concerned parameterization study for software radio systems. His current research interests include study, optimization, and adequation algorithm-architecture for hardware implementation of non-binary low-density parity-check decoders, and design of reconfigurable ultra-throughput rate architectures for 5G systems.
