Abstract-This paper presents a novel algorithm based on trellis min-max for decoding non-binary low-density paritycheck (NB-LDPC) codes. This decoder reduces the number of messages exchanged between check node and variable node processors, which decreases the storage resources and the wiring congestion and, thus, increases the throughput of the decoder. Our frame error rate performance simulations show that the proposed algorithm has a negligible performance loss for highrate codes with GF(16) and GF(32), and a performance loss smaller than 0.07 dB for high-rate codes over GF(64). In addition, a layered decoder architecture is presented and implemented on a 90-nm CMOS process for the following high-rate NB-LDPC codes: (2304, 2048) over GF(16), (837, 726) over GF(32), and (1536, 1344) over GF(64). In all cases, the achieved throughput is higher than 1 Gb/s. Index Terms-Check node (CN) processing, high rate, high speed, layered schedule, non-binary low-density parity-check (NB-LDPC), VLSI design.
I. INTRODUCTION
L OW-DENSITY parity-check (LDPC) codes have been adopted by numerous communication standards, such as DVB-S2 [1] , IEEE 802.16e [2] , and IEEE 802.11n [3] , among others. Good error rate performance, low complexity decoders, and high-rate decoding are some of the advantages of implementing LDPC codes over other error correction schemes.
Binary LDPC codes suffer from error correction degradation for short/medium codeword lengths. On the other hand, an effect called error floor appears with high signal-to-noise ratios. This effect limits the error correction performance, so some additional processing is required to avoid it. Non-binary LDPC (NB-LDPC) codes, defined over Galois fields (GFs) (q = 2 p ) with p > 1, were first investigated in [4] as an extension of binary LDPC codes, where p = 1. These codes emerge as an alternative to their binary counterparts to overcome the weaknesses shown by binary LDPC codes. In addition, they improve the burst error correction capability, especially with high-order GFs, and offer the possibility to be used in conjunction with high-order modulation schemes (16, 64, 256 QAM) , reducing the complexity in both the encoder and the decoder [5] , [6] . Unfortunately, NB-LDPC codes have some drawbacks: 1) a high complexity of their check node (CN); 2) a large amount of area spent on storage elements (RAM memories and registers); and 3) routing congestion that limits the overall decoding throughput. From their appearance until now, many efforts have been put into mitigating these problems. The first algorithm proposed to decode NB-LDPC codes was the Q-ary sum-of-product algorithm (QSPA) [4] , which was developed as a generalization of the SPA for binary LDPC codes. Further improvements, such as fast Fourier transform SPA [7] , log-SPA, and max-log-SPA [8] , were proposed to reduce the complexity of the CN processing equations without introducing any performance loss. More recently, a trellisbased implementation for QPSA (T-max-log-QSPA) [9] was proposed, offering a solution that increases the throughput with respect to previous solutions based on QPSA. Its main drawback is that the required area is prohibitive for real applications in communications and storage systems. Extended min-sum (EMS) [10] and min-max [11] algorithms were presented as the approximations of the QSPA [4] , so that they reduce considerably the CN complexity, which only requires additions and/or comparisons. In addition, EMS and min-max algorithms utilize forward-backward metrics to derive the CN output messages. These metrics involve serial computations that limit the throughput of the derived hardware architectures [11] , [12] .
Trellis EMS (T-EMS) algorithm was proposed [13] , [14] with the aim of enabling parallel processing of the messages in the CN. The input messages are organized in a trellis structure, while the output messages are generated in parallel by means of an extra column included in the trellis. Trellis min-max (T-MM) algorithm in [15] adapts the idea of T-EMS to min-max algorithm. One minimum only T-MM (OMO T-MM) [16] is an approximation of T-MM that reduces the complexity of the CN by obtaining only one minimum and estimating the second one. All these algorithms [13] - [16] exchange q × d c reliability values between CN and variable node (VN) processors. This amount of exchanged messages is large enough to cause wiring congestion, and this limits the maximum throughput, especially for high-rate NB-LDPC codes and high-order GFs. In addition, in decoder architectures with a layered schedule, the CN output messages are stored to be used in the next iteration. So, the required memory, which is the main part of the area in NB-LDPC decoder architectures [15] - [17] , is too high.
Other proposals from [9] and [18] - [22] exchange a minor number of messages between CN and VN and vice versa. This fact reduces the wiring congestion and the required memory resources, but implies the use of some kind of algorithm to generate the nonexchanged messages. Moreover, these approaches introduce a nonnegligible performance loss that depends on the GF order and the size of the reduced set.
In this paper, we propose the modified T-MM (mT-MM) algorithm that reduces the number of check-to-variable messages taking advantage of the replicated information in the output messages from the CN in the T-MM algorithm. The original idea comes from [23] , where we proposed a method to compress the messages between CN and VN for NB-LDPC message passing decoders. As the messages are not modified, this method does not introduce any performance loss. In [24] , we particularize the proposal in [23] to the T-MM algorithm, and we detail a hardware architecture for the CN processor and for a decoder with a layered schedule. In this paper, we extend the work in [24] , and present a modification of the T-MM algorithm that allow us to reduce even more the number of exchanged messages. The CN output messages are split into two arrays: one that compresses the extrinsic information and the other one that represents the intrinsic information. Based on the statistical analysis, we found that reducing the size of the intrinsic information from q to only two elements introduces a negligible performance loss for high-rate LDPC codes over GF (16) and GF(32) and a performance loss smaller than 0.07 dB for high-rate NB-LDPC codes over GF(64), compared with the T-MM algorithm [15] . In addition, we present a high-throughput architecture for the entire decoder (with layered scheduled), which includes the mT-MM algorithm in the CN processor, and compare our implementation results for the 90-nm CMOS technology with the other state-of-the-art decoder architectures.
The rest of this paper is organized as follows. Section II includes the basis of NB-LDPC codes and T-MM algorithm. The proposed the mT-MM algorithm is presented in Section III. Section IV includes the hardware implementation of the mT-MM algorithm and its inclusion in a full decoder. Comparison with other proposals from the literature is also devised. Finally, the conclusions are presented in Section V.
II. TRELLIS MIN-MAX DECODING ALGORITHM
NB-LDPC codes are linear block codes defined as a sparse parity-check matrix H with M rows and N columns, where each nonzero element h m,n belongs to a GF(q = 2 p ). A bipartite graph is commonly used to represent in a graphical way NB-LDPC codes. In this graph, the nodes called VNs represent the N columns of H, and the nodes called CNs represent the M rows of H. For the sake of simplicity, in this paper, we consider regular NB-LDPC codes where the number of VN (CN) connected to a CN (VN) is constant and equal to d c (d v ). Despite this, the approach presented in this paper is perfectly applicable to irregular NB-LDPC codes including the Algorithm 1 T-MM Algorithm [15] appropriate control signals to avoid possible memory access conflicts. In the same way, N (m) (M(n)) denotes the set of VN (CN) connected to a CN m (VN n); therefore, the cardinality of the set corresponds to d c (d v ). Q mn (a) and R mn (a) denote the exchanged messages from VN to CN and from CN to VN for each symbol a ∈ GF(q), respectively.
Let c = c 1 , c 2 , . . . , c N be the transmitted codeword over a binary input AWGN channel and y = y 1 , y 2 , . . . , y N the received symbol sequence, with y = c + e, being e the error vector introduced by the noisy communication channel. L n (a) corresponds to the a priori information from the communication channel obtained by means of the log-likelihood ratio (LLR) as L n (a) = log[P(c n = z n |y n )/P(c n = a|y n )]. All the LLR values are nonnegative, and the hard-decision symbol z n is the GF symbol associated with the highest reliability. Q n (a) is the a posteriori information, which is updated as the message passing decoding algorithm progresses.
The CN operations solve the parity-check equations, based on the messages from the VN (Q mn (a)), and update the reliability values for each GF symbol a. In this paper, we propose an algorithm for the CN to do these tasks (described in Section III), which is based on the T-MM algorithm [15] . The T-MM algorithm offers a good tradeoff between coding gain and decoding complexity compared with other proposals from the literature.
The basic steps to implement the CN processor of the T-MM algorithm [15] are presented in Algorithm 1.
Step 1 involves normal-to-delta domain transformation using the input messages and the hard-decision symbols. This transformation ensures that the reliabilities corresponding to the hard-decision symbols z n are related to the GF symbols α −∞ , simplifying the rest of the steps in the T-MM algorithm.
Step 2 obtains the syndrome β by adding all hard-decision symbols.
Step 3 calculates the first and second most reliable messages (minimum values), m1(a) and m2(a), by means of the function ψ, which also extracts the position of m1(a), m1 col (a).
Step 4 computes the extra column of the trellis, Q(a), which collects the reliability of the most reliable path for each GF symbol a.
conf * (n r , n c ) [15] is the configuration set that selects the possible paths conformed by the n r symbols with a higher reliability value. From all the possible paths, the ones that deviate at most n c times from the hard decision are selected. From this reduced set of possible paths, the one chosen for the corresponding Q(a) value is the one that ensures the highest reliability (minimum value). In this paper, we consider the case where n r = 1 and n c = 2. So, only the most reliable messages are considered [first minimum set m1(a)], and only one and two deviations paths are considered.
Finally, the CN output messages are generated in two steps. First (Step 5 in Algorithm 1), each row of R m,n (a) is filled with the corresponding Q(a) reliability, except for the columns that correspond to the stage in the trellis where deviations from the hard-decision path are made. In cases where only one deviation is made, the empty column is filled with the reliability of the second most reliable symbol m2(a). In those cases where two deviations are made, empty columns are filled with the m1(a) reliability. Second (Step 6), conversion from delta-to-normal domain is required for the CN output messages, where β is used to correct the tentative hard-decision symbols. In addition, a scaling factor λ is used to improve the performance and convergence rate of the T-MM algorithm.
III. MODIFIED TRELLIS MIN-MAX ALGORITHM
This section is organized as follows. In Section III-A, we reformulate the T-MM algorithm to introduce some variables required to explain how replicated information is reduced and that are used in the definition of the proposed mT-MM algorithm. Section III-B extends the explanation of the algorithm proposed in [24] taking as a reference the algorithm reformulated in Section III-A and also includes an analogy with binary LDPC decoders. Finally, Section III-C defines the new algorithm (mT-MM, mT-MM), which is based on an statistical analysis, and gives frame error rate (FER) performance results for high-rate NB-LDPC codes over GF (16) , GF(32), and GF(64).
A. Reformulation of Trellis Min-Max Algorithm
In this section, we reformulate the T-MM algorithm (Algorithm 1) as a first step to define our proposal. As can be seen in Algorithm 2, Steps 4 and 5 are the ones reformulated. The function ψ in Step 4 obtains, which path in the trellis was used to obtain Q(a), that is, the most reliable path. Considering that a maximum of two deviations is evaluated, the function returns the two GF symbols that define this path, η * 1 (a) and η * 2 (a). If the path used to obtain Q(a) has only one deviation from the hard-decision path, the function ψ equals η * 2 (a) to η * 1 (a). On the other hand,
Algorithm 2 Reformulated T-MM Algorithm
Step 5 calculates R mn (a), which is equal to Q(a), the first minimum of Q mn (m1(a)) or its second minimum (m2(a)), depending on the deviation information (η * 1 (a) and η * 2 (a)). For a symbol a, if the most reliable path does not deviate at column j (m1 col (η * 1 (a)) = j and m1 col (η * 2 (a)) = j ), the extra column information Q(a) is assigned to the output R mn (a). On the other hand, two different updates can be performed at the columns where deviations from the most reliable path are made: 1) if this path has only one deviation, the second minimum m2(a) is assigned to R m,n (a) and 2) if this path has two deviations, m1(a) is assigned to the output. Fig. 1 includes an example of trellis with GF(4) and d c = 5. It shows the CN input messages before (Q mn (a)) and after ( Q mn (a)) delta domain transformation. The hard-decision symbols are z = {α 1 , α 0 , 0, α 0 , 0}. After the normal-to-delta domain transformation, the reliabilities Q mn (a) in the first row of the trellis are equal to 0. The minimum value per row (per GF symbol a) of Q mn (a) is enclosed by a dotted box, so the most reliable path for a GF symbol a must include only these boxes. In Fig. 1 . the most reliable path for the symbol α 2 is shown in red color (α 2 = α 0 + α 1 ). This path is most reliable (reliability equal to the maximum between 5 and 10) than the path that makes only one deviation (reliability equal to 17), so Q(α 2 ) = 10. In a similar way, the most reliable paths for the symbols α 0 and α 1 are built, but in these cases, with only one deviation from the hard-decision path, [ Q(α 0 ) = 5 and Q(α 1 ) = 10]. For symbol α 2 , the new variables defined in Algorithm 2 are η * 1 (α 2 ) = α 0 and 
B. Reduction of Replicated Information in Check-to-Variable Exchanged Messages
For a better understanding of our proposal, an analogy with binary LDPC decoders is established. In [25] , the decoder architecture for binary LDPCs is proposed. In this architecture, the messages are compressed in a similar way to which we propose here for the nonbinary case. Instead of sending an individual message to each neighbor VN, a CN sends the same message to all its connected VNs, which includes the first minimum, the second minimum, and the position of the first minimum and the sign (that depends on the syndrome value). In this way, the routing congestion is reduced.
In the nonbinary case, the T-MM algorithm behaves in a similar way.
Step 5 in Algorithm 2 generates the CN output messages in delta domain. For each GF symbol a
In (1), m1 col (η * 1 (a)) and m1 col (η * 2 (a)) are the positions of the symbols that ensure the highest reliability of Q(a) (Step 4 of Algorithm 2). Let us consider the case where the highest reliability path in Q(a) was built performing only one deviation from the hard-decision path. In this particular case, the exclusion set is reduced to only one position and η * 1 (a) = η * 2 (a). Therefore, Q(a) is equal to m2(a) and (1) can be rewritten as (2) , which corresponds to the generalization of the CN output message of binary min-sum LDPC decoders to the nonbinary ones
Although (2) is a particular case of (1) in the T-MM algorithm, it is useful to remark that Q(a) plays the role of the intrinsic information from the binary min-sum. For the extrinsic messages, we define a set E(a) (3) that includes the m1(a) or m2(a) reliabilities depending on the number of 
Exchanging the sets E(a) and Q(a) instead of R mn (a) from CN to VN, the cardinality of the messages is reduced from q × d c to 2 × (q − 1).
Following with the analogy with binary min-sum-based LDPC decoders, the extrinsic information related to the syndrome (sign values) is also sent to the VN processor. In the nonbinary case, these extrinsic syndromes are obtained as
This increments the amount of information sent to the VN in d c p-bits values.
Finally, to reconstruct the q ×d c messages at the VN processor, it is necessary to send the positions where deviations were made to obtain the Q(a) values. These positions are included in a set P(a) that contains 2 × (q − 1) log d c -bits elements.
In terms of bits, the total among of information exchanged from CN to VN is 2
where w is the number of bits used to represent the reliability of messages in the decoder. This information has been detailed in Table I for each set exchanged from CN to VN. In this way, the information sent is only reorganized (not modified), so we do not have any performance loss with respect to T-MM.
C. Modified Trellis Min-Max Algorithm
In this section, we propose a new definition of the CN output messages (based on Section III-B) that allow us to reduce the exchanged messages from CN to VN even more. This new definition keeps only a minimum amount of values from Q(a) (the most reliable ones) and obtains the rest using an approximation function.
First, a statistical analysis for the set Q(a), a ∈ GF(q) was done in order to find its mean value. Using a software model for an NB-LDPC decoder based on the T-MM algorithm, we obtained Q(a) for all the M rows of H when decoding 10 6 noisy sequences (for E b /N 0 = 4.3 dB). Then, we ordered each set Q(a) from lower to higher value and obtained the mean value of the ordered sets Q(a) ( Q(a) ). The results of the analysis are presented in Fig. 2 . We used the (837,726) NB-LDPC code over GF(32), with degree distribution d c = 27, d v = 4 built using the methods presented in [26] . Besides, we replicated the analysis for NB-LDPC codes with different degree distributions and GF orders and we obtained the same conclusions. As can be seen in Fig. 2 , there is a big increase in mean value from one Q(a) index to the next for the first indices; however, this increase is lower for the rest of indices. Based on this observation, we propose to keep only the first minimum, Q m1 , and the second minimum, Q m2 , from the set Q(a) ∀a ∈ GF(q)\α −∞ . Storing only a limited set of values from Q(a), the exchanged information between CN and VN is reduced.
At the VN processor, we propose to approximate the rest q − 3 Q(a) values using (4), where a m1 and a m2 are the GF symbols corresponding to Q m1 and Q m2 , respectively. As the second most reliable value, Q m2 , is updated at each iteration, the distance between it and the values Q(a) approximated using (4) is kept. So, it is expected that the fixed scaling factor γ is greater than one, to ensure that the reliabilities of the approximated values of Q(a) will not be lower than Q m2
The value of the scaling factor γ from (4) is obtained as follows. First, we calculate the mean value of the entire set Fig. 2 . Then, we obtain the initial value of γ dividing Q * (a) by the mean value of Q m2 ( Q m2 ). In Fig. 2 , Q * (a) = 7.097 and Q m2 is Q m2 = 3.697, and thus, the initial value for γ is 1.9198. Finally, we adjust the initial value chosen for γ by means of FER simulations, optimized for E b /N 0 = 4.3 dB.
Considering the modifications presented above and the definitions made in Section III-B, Algorithm 3 describes the mT-MM decoding algorithm. Function ψ is a modified version of the ψ function from Algorithm 1, which also extracts the position (GF symbol) of the second minimum.
We include in Table II the number of bits exchanged between CN and VN for our proposal and for other works from the literature. The rightmost column includes the numerical results for the (837,726) NB-LDPC code over GF(32) [26] with degree distribution (d c = 27, d v = 4). We consider the same number of quantization bits for all proposals (w = 6 bits), and we set n m = 16 and n v = 5 according to [19] , [21] , [22] , and [27] as they propose for their codes. The work from [15] exchanges a full set of messages that turns into a higher number of bits at the CN output. As can be seen, proposals from [19] , [22] , and [27] eliminate the q-dependence, exchanging only a fraction n m < q of the reliabilities. This proposals maintain a strong dependence on the CN degree d c , which penalizes for high-rate NB-LDPC codes. The work from [21] reduces even more the fraction of output messages at the CN compared to previous proposals from the literature, being n v < n m < q. This reduction is offered at the cost of some error-correction degradation and the need of including real-multipliers at the VN for the message approximation. The work from [23] exchanges a fixed number of sets, and the size of each set depends of q and d c without introducing any performance loss compared with [15] . Finally, we propose in this paper a cardinality reduction of the set Q(a) to only two elements, reducing the total among of bits exchanged to the VN compared with the others proposals from Table II as can be seen in the example from its rightmost column for the high-rate NB-LDPC code over GF(32).
Algorithm 3 Modified T-MM Algorithm
In terms of complexity, the CN processor of Algorithm 3 has less computational load than the one of Algorithm 2, because it does not compute q × d c output messages. So, the number of wires between CN and VN is also reduced.
Figs. 3 and 4 compare the amount of bits exchanged from CN to VN in a conventional implementation of T-MM with our proposal, varying the field order ( p) or the CN degree (d c ), respectively. In all cases, the proposed approach outperforms the conventional T-MM in terms of exchanged bits. The differences are considerably higher when the field order and/or the CN degree are increased. This has a great impact on the area of a decoder that uses a layered schedule, as will be seen in Section IV. Fig. 5 shows the FER performance of the proposed mT-MM decoding algorithm [floating-point and fixed-point versions (6 bits)] for the (837,726) NB-LDPC code over GF(32), with 15 iterations and γ = 2.0 (approximated to be hardware friendly value). It also includes the performance of the floating-point T-MM algorithm [15] with 15 iterations for performance comparison purposes. As can be seen, our proposed algorithm introduces a negligible performance loss of 0.01 dB with respect to T-MM (floating-point versions). Fig. 5 also shows the FER performance of other algorithms from the literature (Simplified min-sum algorithm [12] , T-max-log-QSPA [9] , Relaxed Min-Max [17] , and OMO T-MM [16] ) that will be used in Section IV-C to compare their implementation results under the same performance. The number of iterations of each algorithm is adjusted to obtain a performance similar to [17] [26] . The algorithms analyzed are T-MM and mT-MM. The results show that mT-MM has a performance loss of 0.05 dB for the code in Fig. 6 and 0.07 dB for the code in Fig. 7 with respect to T-MM. Thus, the proposed mT-MM algorithm achieves good FER performance results for several GF orders and different degree distributions. Table III summarizes the parameters needed to adjust the initial value for the scaling value γ ( Q * (a) and Q m2 ), as well as the hardware friendly value of γ (γ H F ) chosen to generate the FER curves from Figs. 5-7.
IV. NB-LDPC DECODER IMPLEMENTATION
In this section, we describe the architecture designed to implement the proposed mT-MM algorithm (Section III-C). In addition, we include the top-level design of an NB-LDPC decoder that uses a layered schedule. The proposed decoder is designed for quasi-cyclic NB-LDPC codes over GF(q) constructed applying the methods in [26] , where H is formed by QC × QC circulant submatrices. These submatrices can be composed of zero elements or a cyclic-shifted identity matrix with nonzero elements from GF(q). In this way, the numbers of rows and columns in
A. CN Architecture for mT-MM Algorithm
Parallel processing is adopted in the CN processor, so its latency is kept low and this increases the overall throughput, as will be seen in Section IV-B. The main characteristic of the proposed mT-MM Algorithm is to move part of the complexity of the CN processor to the VN processor. In this way, the number of exchanged messages between them and also the storage resources of the decoder are reduced. Therefore, the CN architecture presented in this section requires less functional blocks than a conventional implementation of the T-MM algorithm [15] .
Next, the hardware required to perform Algorithm 3 is detailed. Fig. 9 shows the block diagram for the top-level CN architecture, where each block corresponds to a step in the mT-MM algorithm.
Step 1, that is, a normal-to-delta domain transformation is made by means of d c permutation networks that follow the structure introduced in [28] . Each one requires q ×log(q) w-bit MUXES. CN syndrome (Step 2) is obtained using GF-adders in a tree structure [
Function ψ (Step 3) is implemented using a tree-based two minimum finder [29] , modified to also extract the position of the first minimum [14] . In total q − 1, two-minimum finders with d c inputs are required. Each one is implemented with 2 × d c w-bit comparators and 3 × d c w-bit MUXES.
The extra column values, Q(a), and the corresponding path information (Step 4) are generated using only the As an example, the processor for the GF symbol α 0 and GF (8) is presented in Fig. 8 . The SAT block from Fig. 8 excludes paths deviating more than once in the same stage of trellis. That is, when it detects more than one m1(a) in the same path coming from the same column of the trellis, it assigns the maximum value (minimum reliability) to the one-minimum finder input.
Step 5 is implemented as a single two-minimum finder with q − 1 inputs, as shown in Fig. 8 . It selects the first and second minimum values of the set Q(a) ( Q m1 and Q m2 , respectively) and their position (GF symbol), a m1 and a m2 . It receives as inputs the outputs of the q −1 extra column processors to extract the two most reliable (minimum) values.
The computation of the set E(a) (Step 6) requires (q − 1) log(d c ) -bit comparators and (q − 1) w-bit MUXES. With this hardware, we distinguish paths with one [E(a) = m2(a)] and two deviations [E(a) = m1(a)] from the hard-decision path, for each GF(q) symbol.
Finally, the calculation of the extrinsic syndromes (Step 7), z * n , requires d c XOR gates.
As can be seen in Fig. 9 , some blocks do not depend on others, so they can be processed in parallel to the rest of blocks. This is the case of the CN syndrome calculation (Step 2), β, and the extrinsic syndromes calculation (Step 7), z * n . In addition, the E(a) calculation (Step 6) and the two-minimum finder (Step 5) can be processed at the same time. This reduces the total latency of the CN architecture.
Algorithm 4 Layered Schedule for the Proposed Decoder
As it will be explained in Section IV-B, the VN processor uses z * n , E(a), P(a), Q m1 , Q m2 , a m1 , and a m2 to build R mn in Algorithm 2. So, the total among of information exchanged from CN to VN is (q − 1)
where w is the number of bits used to represent the reliability of messages in the decoder.
B. Top-Level Decoder Architecture
In this section, we explain how the CN architecture for the mT-MM algorithm from Section IV-A is included in a complete decoder with a horizontal layered schedule. This schedule improves the convergence of the decoding algorithm in comparison with the flooding one. In this way, the number of iterations is reduced, and hence, the throughput is improved. On the other hand, the area of the resulting decoder is considerably lower than the one required by a fully parallel implementation.
In Algorithm 4, the layered schedule for the proposed decoder is presented, where mT-MM is the CN processor, which implements Algorithm 3, and DN is the decompression network from Algorithm 5. The VN processor uses the DN blocks that generate R mn by using the information given by the mT-MM CN processor.
Algorithm 5 details the operations required to reconstruct R mn , that is, the entire set of q × d c messages that goes from CN to VN processors. The DN has as input the reduced set of messages coming from the CN.
The complete block diagram for the proposed decoder is presented in Fig. 10 . As can be seen, there is only one CN processor and one VN processor, which processes one row of H per clock cycle. A layered schedule requires to store the CN output messages from one iteration to be used
Algorithm 5 Proposed Decompression Operations
in the next one. This is done by means of a shift register with M stages (SR in Fig. 10 ). The implementation of a conventional CN processor with q ×d c output messages would
registers to store the messages from the last iteration. This reduces the storage elements following the behavior presented in Figs. 3 and 4 when the field order or the CN degree is varied.
The blocks P and P −1 in Fig. 10 perform a direct and inverse permutation of messages from VN to CN and vice versa, respectively. The permutation is done based on the h m,n nonzero values of H.
The VN mem block is the memory required to store the messages in the VN processor during the decoding process. The depth of the required memories fits with the size of the circulant submatrices (QC) that form H [26] . On the other hand, the block LLR mem stores the channel information. This information is loaded in VN mem at the beginning of each new decoding frame. Fig. 11 shows the implementation of a DN for GF (4) . A total of d c decompression networks are required to generate all q × d c R mn values. Note that two decompression networks are included in the VN processor. However, the area required in our proposal, which duplicates the logic required to implement DN, is much lower than the one of a conventional implementation of the T-MM algorithm with a layered schedule [15] , [16] .
To illustrate the decoder operation, in Fig. 12 , a timing diagram is presented. It includes the input and output of the VN processor memory (VN MEM), the CN processor output (CN output), and the VN processor output (VN output). There are d v × QC = M rows in H to be processed in each iteration, that is, M layers that require M clock cycles (one layer per clock cycle). On the other hand, we included seg pipeline stages in the CN to improve timing. After processing QC layers (the size of a circulant matrix), the pipeline must be emptied before processing the following QC layers, which requires seg clock cycles. So, block n in Fig. 12 includes the processing of layers from QC × (n − 1) + 1 to n × QC, plus seg clock cycles due to the pipeline. The decoding process starts loading the channel information L n (a) = Q n is read from VN MEM and, at the same time, the VN processor starts; after seg clock cycles, the CN processor obtains its outputs, and
n is saved in VN MEM. Then, this process is replicated for blocks from 2 to d v . The same operations are repeated until the maximum number of decoding iterations (MaxIter) is reached. At this point, the tentative hard decoding starts to obtain the symbolsc n and store them in the corresponding memory 
C. Decoder Implementation Results and Comparisons
The decoder architecture explained in Section IV-B was implemented on a 90-nm CMOS process with nine metal layers and operating conditions 1.2 V and 25°C. VHDL was used for hardware description, and Cadence tools were used for synthesis and implementation. Table IV shows the implementation results for two highrate NB-LDPC codes whose performance are analyzed in Section III-C: (2304,2048) NB-LDPC code over GF (16) have the same impact on the number of gates of the decoder [the GF(64) NB-LDPC code has 2.85 times the number of gates of the one for the GF(16) NB-LDPC code]. In addition, the GF(64) NB-LDPC code has a stronger burst error correction capability. On the other hand, our proposal reach a throughput over 1 and 1.3 Gb/s for GF (16) and GF(64), respectively. Table V compares the implementation of our proposal with the other state-of-the-art proposals from the literature for the (837,726) NB-LDPC code over GF(32). For each reference, the number of iterations is selected to achieve approximately the same performance (see Fig. 5 ), and all of them use a layered schedule on their implemented decoders. For the proposals that do not use a CMOS 90-nm process, the throughput showed in Table V is scaled to this technology using the equations in [30] . On the other hand, our place-and-routed results have a core occupation of 70%.
In terms of a gate count, our proposal, which applies parallel processing in the CN, outperforms the other decoders from Table V except for [17] . The decoder from [17] requires 23% less gates than our approach thanks to the serial processing used in their design. This fact introduces an important reduction in the area but increases considerably the latency of the design, as can be seen in Table V. In terms of a throughput, our proposal achieves the highest throughput among the solutions from the literature listed in Table V . This is due to the reduced set of exchanged messages between CN and VN, which reduces the wiring congestion. Our approach outperforms solutions from [15] and [16] , which are the ones with a higher throughput in Table V , by 48% and 20%, respectively.
Regarding efficiency, which is obtained as throughput divided by gate count, our proposal clearly outperforms the rest of decoders: its efficiency is 93.85% higher than the most efficient decoder in Table V [16] .
The postlayout area required by the proposed decoder is smaller than any other solution from the literature for a similar CMOS technology and code parameters. The reduction in area is about 65% compared with [15] , which was the solution with lower area until now.
To quantify the reduction in the wire length when the mT-MM algorithm is applied, we compare the postlayout results of the decoder from [15] with the proposed approach where the same process is considered for both implementations. The total wire length is 75.4 cm for [15] and 58.2 cm for the proposed decoder, which corresponds to a reduction of 23%.
To sum up, the proposed decoder based on the novel mT-MM algorithm offers important advantages compared with the state-of-the-art in both area and throughput. On the other hand, it is important to remark that the proposed mT-MM algorithm does not introduce a significant performance loss for GF orders lower or equal to GF(32) and involves a nonnegligible performance loss of about 0.07 dB for GF(64), which is compensated with a great area saving and a throughput over 1.3 Gb/s, as can be seen in Table IV. V. CONCLUSION
The mT-MM is proposed in this paper. This algorithm reduces considerably the number of exchanged messages between CN and VN processor in NB-LDPC decoders. In terms of performance, the proposed algorithm introduces a negligible performance loss compared with the original T-MM algorithm for high-rate codes over GF (16) and GF(32). Regarding implementation results, our approach has significant advantages in terms of area and speed compared with proposals that exchange the complete set of messages between CN and VN processors, especially for codes with highorder fields and high CN degree. To show these advantages, we implemented several layered decoders with the mT-MM algorithm for different fields and degree distributions, outperforming in all cases others proposals from the literature in terms of area and throughput. He is currently with the Institute of Telecommunications and Multimedia Applications, Valencia. His current research interests include hardware and algorithmic optimization of error-control decoders.
