Abstract-Nonbinary LDPC codes outperform their binary counterparts in different scenarios. However, they require a considerable increase in complexity, especially in the checknode (CN) processor, for high-order Galois fields (GFs) higher than GF(16). To overcome this drawback, we propose an approximation for the trellis min-max algorithm that allows us to reduce the number of exchanged messages between the CN and the variable node compared with previous proposals from the literature. On the other hand, we reduce the complexity in the CN processor, keeping the parallel computation of messages. We implemented a layered scheduled decoder, based on this algorithm, in a 90-nm CMOS technology for the (837, 723) NB-LDPC code over GF(32) and the (1536, 1344) over GF(64), achieving an area saving of 16% and 36% for the CN and 10% and 12% for the whole decoder, respectively. The throughput is 1.07 and 1.26 Gb/s, which outperforms the state of the art of high-rate decoders with the high GF order from the literature.
I. INTRODUCTION

N
ONBINARY low-density parity-check (NB-LDPC) codes are a promising kind of linear block codes defined over Galois fields (GFs) GF(q = 2 p ) with p > 1. NB-LDPC codes have numerous advantages over its binary counterparts, including better error correction performance for short/medium codeword length, higher burst error correction capability, and improved performance in the error-floor region.
The main disadvantage of NB-LDPC codes is the high complexity of the decoding algorithms and the derived hardware architectures, which limit their application in real scenarios where high throughput and reduced silicon area are important requirements. Davey and MacKay [1] rediscovered the LDPC codes defined over GFs GF(q = 2 p ) with p > 1 with the introduction of the Q-ary sum-of-product algorithm (QSPA) as an extension of the binary LDPC decoding based on belief propagation. Since then, several advances have been made to reduce the complexity of the decoders.
Improvements based on QSPA, such as Fast Fourier Transform-SPA [2] , log-SPA, and max-log-SPA [3] , reduce the computational load of the parity-check equations without introducing any performance loss. The recently proposed trellis max-log-QPSA [4] algorithm improves considerably both the area and the decoding throughput compared with the previous solutions based on QPSA, making use of a path construction scheme to generate the output message in the check-node (CN) processor. These solutions offer the highest coding gain for high-rate NB-LDPC codes, but at the same time, they include costly processing that limits their application in real communication and storage systems.
Extended min-sum (EMS) [5] and min-max [6] algorithms were proposed with the aim of reducing the complexity offered by the solutions based on QPSA. In these algorithms, the CN equations are simplified by making approximations to involve only additions and comparisons in their parity-check equations. Since both the algorithms make use of forwardbackward (FB) metrics in the CN processor, the maximum throughput is bounded due to serial computations. The number of exchanged messages between CN and variable node (VN) for both the algorithms is n m × d c , where n m is a fraction of q total reliabilities, n m q, and d c the CN degree. Therefore, the number of messages between the nodes is lower than the previous solutions from the literature.
To avoid the use of FB metrics, trellis EMS (T-EMS) algorithm [7] , [8] was proposed. The input messages are organized in a trellis, including an extra column on it, to enable the generation of CN output messages in parallel. On the other hand, trellis min-max (T-MM) algorithm [9] improves both the algorithm and the architecture compared with T-EMS from [7] and [8] . One minimum only TMM (OMO-TMM) [10] is an approximation of the T-MM that reduces the complexity of the CN by obtaining only one minimum and estimating the second one. Both, T-EMS and T-MM, do not introduce any performance loss compared with the EMS and the min-max algorithms, respectively. Moreover, the derived hardware architectures improve in area and speed with respect to other proposals from the literature based on the algorithms from [5] and [6] . The main drawbacks of T-EMS, T-MM, and OMO-TMM are: 1) the high number of exchanged messages between the CN and the VN (q × d c reliabilities), which impacts in the wiring congestion, limiting the maximum throughput achievable and 2) the high amount of storage elements required in the hardware implementations of these algorithms, which supposes the major part of the decoder's area.
To overcome the drawbacks of T-EMS and T-MM, the proposal in [11] introduces a technique of message compression that reduces the wiring congestion between the CN and the VN and the storage elements used in the derived architectures. The messages at the output of the CN are reduced to four elementary sets that include the intrinsic and extrinsic information, the path coordinates, and the hard-decision symbols. The information exchanged between processors is reduced from q × d c reliabilities to 4 × (q − 1) + d c messages without introducing any performance loss. A step further was taken in [12] , where the mT-MM algorithm was proposed. This algorithm reduces the cardinality of the intrinsic information to only two elements, and the rest of the q − 2 values are approximated by a constant value. The information exchanged between processors is reduced to 3 × (q − 1) + d c messages but at the cost of some performance loss.
In this paper, we take as starting point the solution from [11] to propose a novel algorithm that reduces the messages that include the intrinsic information and the path coordinates from (q − 1) values to only L messages, each one being L < n m q. This improvement allows us to pass from the number of messages exchanged in [11] to only (q − 1) + 3 × L + d c , saving area in the decoder thanks to the reduction in the memory requirements. This reduction in messages introduces a performance loss in the coding gain that can be controlled by means of the parameter L. In a second step, we introduce a novel method to generate the L most reliable values of the intrinsic set, reducing considerably the CN complexity compared with the previous solutions from [8] , [9] , and [11] . The low size of this set allows us to propose a simplified network that calculates the L most reliable values for the intrinsic information. These values are sent to the VN. The proposed network greatly reduces the area required by the extra column processor from [8] , [9] , and [11] , which is the bottleneck of the implemented CN processors. Our proposal allows the design of high-rate NB-LDPC decoders over GF(32) and GF(64) without prohibitive areas. For the (1536, 1344) NB-LDPC code over GF(64), the area saving in the CN is ∼36% and 15% considering the overall decoder compared with the solutions from [11] , with a performance loss of 0.1 dB. In terms of throughput, the increase is ∼17.5% compared with the design from [11] . For the (837, 726) NB-LDPC code over GF(32), the area saving in the CN is ∼16% and 10% for the overall decoder, introducing a performance loss of 0.08 dB and a gain in the throughput of 10% compared with [11] . In both the cases, we implemented a layered scheduled decoder because the aim of this paper is to obtain high-throughput decoders for codes with large GF. For other efficient decoders not focused in high throughput, we refer to [13] .
The rest of this paper is organized as follows. Section II includes the basis of the NB-LDPC codes and the T-MM algorithm implemented using compressed messages. Section III includes the proposed approximation to reduce the CN output messages for the T-MM algorithm and describes a novel way to obtain the most reliable intrinsic information without analyzing the entire trellis. Section IV includes the hardware implementation for the proposed CN architecture. The implementation of a layered scheduled decoder and comparison with other proposals from the literature are devised in Section V. Finally, the conclusions are presented in Section VI.
II. T-MM DECODING ALGORITHM WITH COMPRESSED MESSAGES
A sparse parity-check matrix H defines an NB-LDPC code, where each nonzero element h m,n belongs to a GFs GF(q = 2 p ). Another common way to characterize the NB-LDPC codes is by means of a Tanner [14] graph, where two kinds of nodes are differentiated representing all N columns (VNs) and M rows (CNs) of H. N (m) denotes the set of VNs connected to a CN m and M(n) denotes the set of CNs connected to a VN n; therefore, the cardinality of the sets corresponds to d c and d v .
Let us consider a message m ∈ GF(q) K that is coded to c = m × G, where G is the generator matrix that satisfies G · H T = 0, where 0 is the zero matrix of size K × M. Using BPSK signaling, the codeword c is transmitted over a binaryinput additive gaussian noise channel. The received sequence is y = c + e, where e is the error vector introduced by the noisy communication channel.
NB-LDPC codes are decoded applying iterative algorithms where messages that represent reliability values are passed from VN to CN and vice versa. Basically, two types of scheduling are used: 1) flooding, where, first, all the CNs are processed and, then, all the VNs are updated based on the CN output messages and the channel information and 2) layered, where one CN is processed and then all the connected VNs are updated, so the process is repeated until all the CNs are processed. In this paper, we consider the layered schedule, since it offers a better tradeoff between complexity and decoding speed and for its higher convergence compared with the flooding schedule [15] . Algorithm 1 includes the basic steps involved in the layered schedule of NB-LDPC decoding.
The initialization step requires to extract the a priori information from the communication channel to compute the log-likelihood ratio (LLR). This is obtained by means of L n (a) = log[P(c n = z n |y n )/P(c n = a|y n )]. In addition, a normalization is made to ensure that all the LLR values are nonnegative, and L n (a) = |L n (a) − L n (z n )|, where z n is the hard-decision symbols associated with the highest reliability. LLRs are loaded in the VN that is represented by the set Q n (a). This set corresponds to the a posteriori information that is updated as the decoding algorithm progresses, as shown in Step 3 of Algorithm 1.
Messages from VN to CN are denoted as Q m,n (a) and are calculated using the VN information Q n (a) and the CN-to-VN messages R m,n (a) (Step 1 of Algorithm 1). The CN output messages R m,n (a) are calculated using T-MM algorithm [9] was proposed as a new implementation of min-max from [6] that allows the parallel processing of messages in the CN and reduces the complexity. Applying a message compression technique [16] , the basic steps of T-MM in the CN and the number of exchanged messages are further reduced without introducing any performance loss compared with the original T-MM algorithm.
In the compressed version of the T-MM algorithm, instead of sending q × d c R m,n (a) messages to the VN processor, the information in the CN is organized in four elementary sets called I (a), E(a), P(a), and z * n . I (a) is the set related to the intrinsic information sent to the VN processor. This set is calculated applying (1) to the most reliable CN input messages in delta domain m1(η(a)) [7] I (a) = min
where con f * (n r , n c ) [9] is the configuration set that selects the possible paths conformed by the n r symbols with higher reliability value. From all the possible paths, the configuration set only selects the ones that deviate at most n c times from the hard-decision path. 1 From this reduced set of possible paths, the one selected from the corresponding I (a) value is the one that ensures the highest reliability (minimum value). In this paper, we consider the case where n r = 1 and n c = 2. Therefore, only the most reliable messages are considered [first minimum set m1(a)] and only one and two-deviation paths are taken into account.
The set E(a) is related to the extrinsic information. It is composed of m1(a) or m2(a) (second minimum set) messages 1 The hard-decision path is the one formed only by messages corresponding to the symbol α −∞ , which in delta domain corresponds to the reliability of z n .
depending on the number of deviations of the path used to form I (a) according to
The set P(a), with n c × (q − 1) values, is used to keep track of the column where deviations take place, when the values of the set I (a) are generated. This information is used in two situations: 1) to select the proper values for the set E(a) depending on the deviation information when the set I (a) is computed and 2) it is used at the VN to generate the q × d c reliabilities as follows. Finally, the hard-decision symbols defined as z n = arg min a∈GF(q) Q mn (a) and the syndrome β = d c 1 z n are used to generate the hard-decision symbols z * n = z n + β required for delta-to-normal domain transformation.
At the VN processor, decompression of messages is made to generate the R m,n (a) values used to obtain the a posteriori information Q n (a) [9] . The decompression operations are made following:
III. T-MM ALGORITHM WITH REDUCED SET OF MESSAGES
In this section, we introduce a novel method to reduce the number of messages exchanged between the CN and the VN compared with the proposal from [11] . First, we define the reduced set of compressed messages that are sent from CN to VN and an approximation to obtain the rest of values in the VN. Second, the performance of the method is analyzed. Third, a technique to generate the most reliable values of the set I (a) without building a complete trellis structure is presented.
A. Reduction of the CN-to-VN Messages
The sets I (a) and P(a) are required to generate the messages R m,n (a) at the VN processor, as shown in (3) . Reducing the cardinality of I (a), the one of P(a) is also reduced.
Our proposal is to keep the L most reliable values of I (a) and the corresponding ones of P(a) and E(a), where L < (q − 1). We consider the set I * (a ) = {I * (a 1 ),
where L most reliable values from the set I (a) and a = {a 1 , a 2 , . . . , a L } are their corresponding GF symbols. On the other hand, consider the sets E * (a ) = E(a) ∀a ∈ a and P * (a ) = P(a) ∀a ∈ a .
Defining the complementary set a ∈ a \ a , we propose to set E * (a ) = m1(a ). Therefore, the cardinality of the set E * (a) is kept in q − 1. Table I includes the number of bits of each one of the sets exchanged from CN-to-VN processors compared with the proposal from [11] , where w is the number of bits used to quantize the reliabilities.
As an example, consider the (837, 726) NB-LDPC code over GF (32) using the methods from [17] . For the first code, the number of bits at the CN output is 817 bits using w = 6 bits with the method from [11] , while for our proposal, the number of bits is only 385, so there is a reduction of 53%. For the second code, the method from [11] outputs 1530 bits, while our proposal only exchanges 586 bits, which corresponds to a reduction of 62% in the number of bits. The L value was set to 4 in these examples.
Since the cardinality of the sets I * (a) and P * (a) has been reduced compared with I (a) and P(a), respectively, it is no longer possible to generate the messages R m,n (a) using (3) at the VN.
For the symbols, a is possible to construct
For the complementary set of symbols a , it is necessary to propose a function that approximates the reliabilities of the messages R m,n (a ) due to the cardinality reduction of the sets I (a) and P(a). Therefore, we introduce a novel way to obtain the messages R m,n corresponding to the symbols a . This uses an approximation function based on an offset and a scaled version of the set E * (a ) as expressed in
Even considering that the scaling factors γ 1 and γ 2 are constant values, the offset I * (a L ) and the set m1(a ) depend on the specific CN input messages at each iteration. This fact introduces a self-adjusted term for the approximated values of R m,n (a ).
B. Performance Analysis
To show the behavior of the set R * m,n (a) compared with R m,n (a) in an implementation of T-MM algorithm [9] , we computed histograms for the sets R * m,n (a) and R m,n (a). We tested several NB-LDPC codes over different GFs and degree distributions, for various Eb/No values and taking 10 6 repetitions for each configuration. We achieved similar results in all cases.
In Fig. 1 , we present the results for the (837, 726) NB-LDPC code over GF(32) [17] . Eb/No was set to 4.3 dB, γ 1 = γ 2 = 0.5 in (5), and L = 4 for this example. In Fig. 1 , the x-axis includes the arranged reliabilities for In order to test our proposal and reduce the number of exchanged messages between CN and VN, we performed BER simulations to compare it with the conventional T-MM algorithm. The code under test was the (837, 726) NB-LDPC code over GF(32) [17] . It can be seen in Fig. 2 that an increment in the parameter L (more exchanged messages from CN to VN) is translated into a BER performance closer to the conventional implementation of the T-MM algorithm. An improvement of almost 0.2 dB in the coding gain increasing L from L = 2 to L = 4, 0.05 dB from L = 4 to L = 6, and almost negligible when passing from L = 6 to L = 8 is observed. We also include in Fig. 2 the BER performance for L = 4 and 8 decoding iterations for the quantized model (6 bits) to ease the comparisons with other proposals in Section V.
The same analysis was made for the (1536, 1344) NB-LDPC code over GF(64) varying the L parameter. The BER performance is presented in Fig. 3 . It can be seen that the performance losses are greater than the ones from Fig. 2 for small L values, comparing both to the conventional T-MM algorithms [9] . This is due to the percentage of reliabilities approximated using (5), which is 87.5% for the GF(32) code and 93.75% for the GF(64) code, considering L = 4 for both the cases. It is shown in Section IV that the performance loss of 0.1 dB for L = 4 introduced with our approach is compensated with an important reduction in the complexity of the CN.
C. Generation of the Set I * (a )
In Section III-A, a method to reduce the number of messages sent from CN to VN was presented. It was shown that modifying the parameter L, the performance loss compared with the T-MM algorithm [9] can be tuned. On the other hand, a method to approximate the discarded messages of the set I (a) was introduced using (5). The maximum performance loss is set to 0.1 dB, so we fix L = 4 in the rest of this paper. In this way, the performance loss is 0.08 dB for the (837, 726) NB-LDPC code over GF(32) and 0.1 dB for the (1536, 1344) NB-LDPC code over GF (64) .
From the analysis made in Section III-A, it is easy to see that even reducing considerably the number of exchanged messages from CN to VN, the CN has to calculate the entire set I (a) using (1) and the set P(a) before selecting the L most reliable values from them. In this paper, we propose a method to obtain the L most reliable values without using (1) or introducing any approximation. Our method takes advantage of the min-max operator involved in (1). The min-max operator is used to obtain the reliability value among the reliabilities selected by the configuration set, for each symbol a. Examining how the min-max operator behaves to obtain the L most reliable symbols, it is possible to extract some rules to avoid the implementation of a complete trellis structure. In Fig. 4 , an example for the set Q m,n (a) (GF (8) , d c = 4) is presented, where the most reliable messages per row are marked with a dashed square. The rightmost column includes the set I (a) formed by the combination of the m1(a) values following (1) . This example will be used to explain the method to obtain the set I * (a ), composed of the L = 4 most reliable values of the set I (a).
First, consider the absolute minimum, m1 1 , among all the m1(a) reliabilities, in the example from Fig. 4 m1 1 = 1 . m1 1 will appear on the set I (a) only in one-deviation paths, because in the two-deviation cases, m1 1 will be discarded by the max operator when all the possible paths for each symbol a are analyzed. On the other hand, there is only one one-deviation path for each symbol a, so, in the example of Fig. 4 , for the symbol α 3 , the one-deviation path corresponds to m1 1 . In fact, this path is the most reliable among all the possible ones for α 3 . Then, instead of analyzing all the possible paths to obtain the most reliable value of the set I (a) (I * (a 1 )), we only have to assign the value I * (a 1 ) = m1 1 and retain the value of the corresponding symbol a 1 = a m1 1 = α 3 .
A similar analysis can be done to find the second most reliable value of the set I (a). This value can be obtained assigning the second minimum of m1(a) (m1 2 = 2 in the example from Fig. 4) , so I * (a 2 ) = m1 2 and a 2 = a m1 2 = α 0 . Note that there can be a two-deviation path that gives the same reliability value as m1 2 , and this is the combination of m1 1 and m1 2 if they belong to different columns (m1 1 col = m1 2 col ). In this case, the reliability of this two-deviation path corresponds to m1 2 due to the max operation involved in (1) .
The selection of the third most reliable value of I (a) (I  *  (a 3 ) ) requires a comparison between multiple candidates, which includes the one-deviation path formed with m1 3 and the two-deviation path made with the combination of m1 1 and m1 2 . The two-deviation path will be selected for I * (a 3 ) (I * (a 3 ) = m1 2 and a 3 = a m1 1 + a m1 2 ) unless m1 1 and m1 2 belong to the same column of Q m,n (a). In that case, the reliability selected is m1 3 (I * (a 3 ) = m1 3 and a 3 = a m1 3 ). In the example from Fig. 4 , since m1 1 = 1 and m1 2 = 2 belong to the same column of the trellis (n = 1), m1 2 cannot be used for I * (a 3 ). Instead of this, the selected reliability is I * (a 3 ) = m1 3 = 3 and a 3 = a m1 3 = α 4 .
For I * (a 4 ), we consider the candidates listed in Table II with the priority given in its leftmost column. The conditions to select a reliability are listed in the rightmost column of Table II . Basically, the conditions ensure that a value will not be selected if another one with higher reliability has been used for a symbol a i ∀i ∈ 1, 2, . . . , L, and on the other hand, for the two-deviation cases, no more than one deviation is made on each stage of the trellis [7] , [9] .
Following the example in Fig. 4 and the priority and conditions listed in Table II for the possible candidates of I * (a 4 ), the highest priority candidate (OD, m1 3 ) must be discarded, since it was used for the I * (a 3 ) reliability. Next, we select the one with the second priority, since it meets the conditions from Table II . Thus, I * (a 4 ) = m1 3 and the corresponding symbol a 4 = a m1 1 + a m1 3 = α 6 .
The conditions derived to obtain the L most reliable values of the set I (a) can be mapped directly in a hardware structure, avoiding a complete analysis of the trellis. The CN architecture is presented in Section IV.
The proposed CN decoding algorithm is summarized in Algorithm 2.
Step 1 corresponds to the delta-domain transformation [18] of the CN input messages, Q m,n (a), using the tentative hard-decision symbols z n . The syndrome β is calculated adding, in the GF domain, all z n symbols (Step 2).
Step 3 finds the two-minimum among the d c input messages in delta domain for each symbol a. The position of the first minimum, m1 col (a), is also retained. A L-min finder for the set m1(a) is included in Step 4. Function ψ selects the L values for the set I * , as detailed in this section.
Step 6 includes the conditions to select the values of the set E(a), as explained in Section II.
IV. CHECK-NODE ARCHITECTURE
In this section, we present the architecture for the CN processor based on the proposed method. It includes Algorithm 2 Proposed Check-Node Decoding Algorithm a network to calculate, in an efficient way, the L = 4 most reliable messages of the set I (a), using the conditions explained in Section III-C.
The top-level block diagram for the proposed CN is detailed in Fig. 5 . The CN input messages are Q m,n , which come from the VN processor and the tentative hard-decision symbols z. Both the input messages are used to compute the normalto-delta-domain transformation (N → block in Fig. 5 ). d c transformation networks are needed in the CN, each one requires q × log(q) w-bit MUX following the approach proposed in [19] , where w is the number of bits for the datapath.
z is also used to obtain the syndrome β adding all d c tentative hard-decision symbols. This operation requires w × (d c − 1) XOR gates. β is used to generate the new harddecision symbols z * , which are sent to the VN to generate the R * m,n messages using (4). z * symbols are generated using GF(q) adders that require d c × w XOR gates to implement them.
Two-minimum finders obtain the two most reliable messages for each GF(q) symbol over the delta-domain values ( Q m,n ). The search of α −∞ is excluded, since it corresponds to the hard-decision symbols, with the highest reliability (zero-value). Therefore, in the CN processor, there are q − 1 two-minimum finders where the position of the first minimum values is also extracted to obtain the set I * (a ), as explained in Section III-C. Implementation is done by means of tree-based two-minimum finders, following the approach from [20] . Each finder has d c inputs, implemented with 2 × d c w-bit comparators and 3 × d c w-bit MUXES.
An L-min finder is used to obtain the L most reliable values of the set m1(a), m1(a ) (m1 * in Fig. 5) , outputted from the 2-min finder. We propose to use a parallel sorting approach for the implementation with the aim of improving speed at the CN processor. The proposed architecture is presented in Fig. 6 , where an example for four inputs is included. It is based on a two-stage circuit: 1) we compute comparisons between all the combinations of input pairs (X i , X j ) ∀i = j [ Fig. 6(a) ] and 2) we add the output of the comparators for each one of the inputs. The main idea is to count the number of times that an input X i is lower than the other N − 1 inputs, where V X i the number of times and N the number of inputs of the network. The greater the V X i value, the lower X i is. Therefore, the second stage (Fig. 6 ) is responsible to find the value V X i corresponding to the minimum that we are looking for. For example, the m1 1 value corresponds to the one with V X i = N −1, since it is lower than the rest of inputs. Therefore, m1 2 corresponds to V X i = N − 2 and so on for the rest of the m1 j values that corresponds to
The proposed CN architecture requires a structure as the one in Fig. 6 (a) operating with q −1 inputs. Since we particularize the CN for the case where L = 4, we require four selection networks from Fig. 6(b) , one for each m1 j value.
The implementation of the structure from Fig. 6 (a) requires (q − 1) × (q − 2/2) w-bit comparators. The number of adders is summarized in Table III for different field orders.
Four structures as the one in Fig. 6 AND gates, assuming that the symbols a and columns m1 col (a ) from the L most reliable m1(a) values must be retained to be used in the calculation of the I * (a ), as can be seen in the block diagram in Fig. 5 . Finally, 4 × q × (w + p + log d c ) OR gates complete the logic elements required for the implementation of the circuit.
The solution from Fig. 6 to the L-min finder offers a high-speed structure that does not compromise the latency of the overall CN processor.
The set I * (a ) is generated using the circuit presented in Fig. 7 , which is a direct implementation of the method explained in Section III-C. It uses the outputs of the L-min finder as inputs to obtain the sets I * (a ), I * (a ) path , and I * (a ) sym .
As shown in Fig. 7 , the generation of the set I * (a ) requires few hardware resources that can be easily summarized GF(64) . The increase in the field order does not increment significantly the number of required gates compared with the structure that generates the extra column Q(a) in the proposal from [9] , which is unsuitable for the fields higher than GF(32).
The reliabilities of the set E * (a) are generated using the circuit from Fig. 8 . The portion of the circuit rounded by dashed lines is repeated for each GF symbol. The generation of the set E * (a) requires (q − 1) × (23 × w + 6 × p) + 3 × log d c equivalent NAND gates. To compare our proposed CN architecture with a conventional implementation of the T-MM algorithm [11] , we synthesized the design using Cadence register transfer level compiler for the (837, 726) NB-LDPC code over GF(32) and the (1536, 1344) NB-LDPC code over GF(64). It can be seen in Table IV that the area saving is almost doubled for the GF(64) NB-LDPC code compared with the GF(32) case. This is due to the reduction in complexity in the I * (a) generation that is the bottleneck in the CN implementation from [11] .
V. TOP-LEVEL DECODER ARCHITECTURE AND COMPLEXITY COMPARISON In this section, we include the proposed CN architecture in a layered decoder with a similar structure in [16] .
The decompression network generates the set R * m,n (a) and implements (4) and (5) using the structures presented in Fig. 9 . The circuit from Fig. 9(a) generates a (q − 1) -length set I * (a) from the reduced set I * (a ). Once the set I * (a) is obtained, the circuit from Fig. 9(b) is used to generate the set R * m,n (a) performing the normal-to-delta-domain transformation from the set I * (a), E * (a) and the new hard-decision symbols z * n . The decoder requires 2 × (q − 1) circuits as the one in Fig. 9(a) , and each one uses [27 × log d c + 14 × w + 6 × p] equivalent NAND gates. On the other hand, it requires 2 × d c circuits as the one presented in Fig. 9 (b) using q × (( p + 1) + 2 × log d c + 1) equivalent NAND gates each one of them.
One of the main benefits of reducing the number of messages exchanged from CN to VN is that the number of registers required to store the CN output messages from one iteration to the next one is greatly reduced compared with the conventional implementations of the T-MM algorithm [9] , which store M × q× d c × w information bits. Our proposal
A. Decoder Implementation Results and Comparisons
The complete decoder architecture based on the CN architecture explained in Section IV was implemented on a 90-nm CMOS process with nine metal layers and operating conditions 1.2 V and 25°C. VHDL was used for the description of the hardware, and Cadence tools were used for the synthesis and implementation of the proposed approach. To show the efficiency of our proposal for high-rate NB-LDPC codes over high-order fields, we present results for the (1536, 1344) NB-LDPC code over GF(64). In order to simplify comparisons with other proposals from the literature, we include the results for the (837, 726) NB-LDPC code over GF(32). Both the QC codes have been constructed using the methods from [17] . The throughput is obtained as
where qQC is the size of the circulant submatrices that conform H and seg corresponds to the pipeline stages used in the design. For both the codes, we choose seg = 16 to achieve a balance between throughput and area.
To the best of our knowledge, we present the first postlayout results for a high-rate NB-LDPC code over GF(64) and the best high-throughput decoder implementation for GF(64) is presented in [21] . It includes a chip implementation for a full-parallel decoder based on the (160, 80) NB-LDPC code over GF(64) with degree distribution (d c = 4, d v = 2) using a 65-nm CMOS process. The reported gate count is 2.78 M, reaching a throughput of 1221 Mb/s (881 Mb/s for 90 nm). A direct comparison is not possible, because this is not a highrate code (the rate is only 0.5) and our code has a rate of 0.875; furthermore, it is about ten times shorter than the one we use (960 bits per codeword compared with 9216 bits in our code). In order to compare our decoder with the previous proposals implementing the same code, we synthesized the designs from [9] - [12] for the GF(64) code. We could not obtain postlayout results due to the high gate count of the designs. The results are summarized in Table V , where we also show the implementation results of our decoder for L = 4 and L = 5. The implementation for L = 5 was done by the extrapolation of the architecture for L = 4. Comparing the implementation for L = 4 and L = 5, the increment in area and the reduction in throughput are both ∼1%. On the other hand, there is a coding gain of 0.02 dB with this increment in L. Comparing our decoder for L = 4 with the other proposals in Table V , it can be seen that the highest reduction in the gate count is ∼61% compared with the work from [9] , and the lowest is 12% compared with the proposal from [11] . In order to make fair comparisons in terms of throughput, it is important to remark that the clock frequency ( f clk ) usually reduces its value after placing and routing the design.
For example, our proposal achieves f clk = 351 MHz after synthesis and this value is lowered to 271 MHz after the place and route stages, which correspond to a reduction of 23%. Thus, the postsynthesis throughput of the other works is reduced in the same percentage, as shown in Table V . Considering these values, our work would outperform them between 30.6% and 16.6% thanks to the reduction of complexity in the CN processor and the minimization of messages exchanged between CN and VN, which mitigates the routing congestion.
In terms of efficiency measured as the ratio between throughput (megapixel per second) and number (million) of equivalent NAND gates, our approach outperforms the one from [9] in almost 2.4 times. Compared with the design from [11] , our proposal outperforms it in 35%.
Table VI compares the implementation results of the proposed decoder (L = 4) with the other state-of-the-art proposals for the (837, 726) NB-LDPC code over GF(32). The number of iterations in all the proposals listed in Table VI was adjusted to achieve similar performance at E b /N o = 4.4 dB. As can be seen, our proposal outperforms most of the other approaches in both area and throughput. In terms of gate count, despite the fact that [23] requires 21% less gates, this paper achieves a throughput that is almost seven times higher due to the parallel processing used in the CN. Compared with the proposal from [12] , our approach has similar throughput and outperforms it almost 6% in area, thanks to the reduction of complexity in the CN with the hardware structures presented in Section IV. In terms of efficiency, our approach is five times more efficient that the proposals from [9] and [23] and almost nine times higher than the decoder from [22] . Compared with the design from [12] , our novel proposals offer 8.6% higher efficiency.
VI. CONCLUSION
In this paper, we introduce an approximation for the T-MM algorithm to reduce the complexity of the CN architecture, which was the bottleneck in the previous solutions from the literature. This reduction allow us to offer postlayout results for high-rate NB-LDPC codes over GF(64) without prohibitive areas and higher throughput than the existing proposals, at the expense of some performance loss.
