Abstract-A one minimum only decoder for Trellis-EMS (OMO T-EMS) and for
I. INTRODUCTION

S
INCE THE FIRST non-binary low-density parity-check (NB-LDPC) decoder architecture was proposed for the Q-ary Sum-of-Product algorithm (QSPA) [1] , hardware designers have been working to derive solutions that allow the use of NB-LDPC codes in a wide range of communication and storage systems. Good error correction, high throughput and small area remain the challenge of any NB-LDPC decoder designer.
Extended Min-Sum (EMS) [2] and Min-Max (MM) [3] algorithms were proposed, with the aim of reducing the complexity involved in the check node processor, which is the bottleneck of QSPA algorithm. Although the decoding process is simplified by means of using forward-backward for the extraction of check-to-variable messages, these metrics penalize the maximum throughput achievable when they are implemented in hardware.
To avoid the use of forward-backward, in [4] the Trellis Extended Min-Sum (T-EMS) was proposed. With T-EMS, the degree of parallelism is increased using only combinations of the most reliable Galois field (GF) symbols to compute the check-to-variable messages. The decoder presented in [4] was outperformed in [5] where an extra column is added to the original trellis with the purpose of generating the check-to-variable messages in a parallel way. The main drawback of the approach presented in [5] is that requires a lot of area and pipeline stages, reducing the overall efficiency of the decoder. In [6] the hardware implementation of a T-EMS decoder is described, reaching the highest throughput found in literature. Previous trellis-based proposals, such as the ones from [7] - [9] , applied partial-parallel decoding as a way to obtain the output messages in the check node processor.
In [10] a decoder architecture named Relaxed Min-Max (RMM) is proposed. RMM makes an approximation for the second minimum calculation and hence, generates the check-to-variable messages with less complexity. The main drawbacks for this approach are: i) the check node output messages are derived serially, reducing the overall throughput of the decoder and increasing latency; and ii) the proposed approach suffers of an early degradation in the error floor region, due to the way of deriving the second minimum.
In this paper, we introduce a novel second minimum approximation based on the statistical analysis of the check node messages named as One Minimum Only (OMO) decoder. The motivation to perform this approximation is that the two-minimum finder duplicates the critical path and increases the complexity of the check node processor. In addition to the second minimum estimator proposed in [10] , we analyze two other estimators: one based on a slight modification of the one-minimum finder, and a last one which linearly combines the first two estimators, and showed the best performance in simulations. The proposed OMO decoder can be applied to both T-EMS and Trellis Min-max obtaining OMO T-EMS and OMO T-MM decoders respectively. By avoiding the use of two-minimum finders [11] in the check node, we were able to reduce both area and latency of the check node update without introducing any performance loss compared to the original EMS or Min-max algorithms.
The OMO T-EMS and OMO T-MM check node architectures have been implemented and included in a layered scheduling decoder. A 90 nm CMOS process has been employed and a (837,726) NB-LDPC code over GF(32) has been chosen to show the efficiency of our approach for high order fields and high rate codes. The OMO T-EMS and OMO T-MM decoders achieve 100% and 159% higher efficiency (Mbps/million gates) compared to the most efficient decoder found in literature [10] respectively, with about 30% less latency and 40% higher throughput than the solution from [6] depending on whether EMS or Min-max version is implemented.
The rest of the paper is organized as follows: in Section II we introduce the nomenclature and the main concepts of T-EMS algorithm. The proposed approach for the second minimum estimation of T-EMS algorithm, OMO T-EMS, is presented in Section III, including and analysis of performance for different NB-LDPC codes and showing that can be extended to Min-max algorithm without loss of generality. Section IV includes the hardware implementation of the proposed check node and the overall decoder. Moreover, synthesis and post place and route results of the design and comparisons with other architectures are also included. Finally, conclusions are outlined in Section V.
II. TRELLIS-EXTENDED MIN-SUM ALGORITHM
NB-LDPC codes are characterized by a sparse parity check matrix where each non-zero element belongs to Galois field . We consider regular NB-LDPC codes with constant row weight and column weight . Decoding algorithms for NB-LDPC codes use iterative message exchange between two types of nodes called check nodes (CN) (M rows of ) and variable nodes (VN) (N columns of ).
Let be the set of VN (CN) connected to a CN (VN) . Let and be the edge messages from VN to CN and from CN to VN for each symbol respectively. denotes the channel information and the a posteriori information.
Let and be the transmitted codeword and received symbol sequence respectively, with and is the error vector introduced by the communication channel. The log-likelihood ratio (LLR) for each received symbol is obtained as ensuring that all values are non-negative where is the symbol associated to the highest reliability.
Trellis Extended Min-Sum (T-EMS) algorithm [4] presents a way of implementing the original EMS algorithm [2] , avoiding the use of the forward-backward metrics and increasing the degree of parallelism of the CN. Algorithm 1 includes the original T-EMS CN algorithm, where the first and fifth steps perform the transformation of incoming messages from "normal" to delta domain and from delta domain to normal domain for the CN outgoing messages respectively. For the transformation, syndrome of the CN must be obtained (Step 2 of Algorithm 1) using the incoming tentative hard decision for each CN message. The extra column calculation is derived on step 3 using the configuration sets originally proposed in [2] , with the aim of building the output messages using only the most reliable information. The configuration set is defined as the set of at most symbols that satisfy the parity equation, deviating at most times from the combination (path) of symbols with the highest reliability. Considering only the case when and , the extra column is built with the combination of the most reliable messages for each GF(q) symbol i.e. with the minimum value message, . Once the values are derived, the outgoing CN messages in delta domain , are generated in Algorithm 1 using step 4 which provides all the values for extrinsic CN outgoing messages. For the intrinsic values, and are used as it is explained in detail in [5] .
It is important to remark that the values are used for both, and generation while is only used to compute (in the case of and ). Additionally, the extraction of the position of the first minimum is also required , since this information is used to derive the path for each extra column value in the trellis. However, the two minimum values must be processed using a two-minimum finder before the extra column calculation. This two-minimum finder increases the critical path for due to the extraction. In next section we propose a novel approach to approximate the second minimum, which at the same time that reduces the critical path to get the first minimum, achieves an accurate estimation of the second one without degrading the performance of the original T-EMS and Min-max algorithms.
III. ONE MINIMUM ONLY TRELLIS DECODER
As shown in Section II, the two-minimum finder represents an important part of the CN architecture. On the other hand, the hardware architectures to implement the minimum finder processor [11] introduce the same delay for both and , which is not optimal for EMS and Min-max algorithms. This observation is our principal motivation for creating a novel check node architecture which approximates the computation of the second minimum, reducing the delay for the first one and hence improving the latency and the throughput of the decoder as it can be seen in next sections. Our proposed approach has been tested on multiple NB-LDPC codes with different GF(q) and degree distribution, showing in all cases a negligible performance loss compared with the T-EMS and Min-max algorithms. In order to simplify the description of our proposal, we will focus on T-EMS, however, this method can be directly derived to Min-max algorithm without any loss of generality. However, we will provide performance and implementation results of this new solution for both EMS and Min-max decoders.
In the rest of the section, different estimators of the second minimum values are considered, and a statistical analysis of their distribution compared to the true second minimum is made. The analysis is done for the (837,726) NB-LDPC code over GF(32), where is generated using the methods proposed in [12] , with circulant sub-matrices of size . However, other codes with different GF and degree distribution have been tested obtaining the same behavior.
A. Estimators for the Second Minimum Value
A first natural solution for the estimation of is to make use of a scaled version of the first minimum , described in (1):
(1) This approximation has been already proposed in [10] . However, by just applying (1) the value of the minimum is usually underestimated if we apply a value that mimics as much as possible the behavior of EMS or Min-max in the waterfall region. 1 As it can be seen in Fig. 1 , where we draw the distributions of the true and their proposed estimators, the value of is on average smaller than the real , which leads to an important performance degradation in the error floor region.
A second possible estimator makes advantage of a re-use of the hardware architecture. Using a radix-2 one-minimum finder is possible to determine an early estimation for the second minimum. In Fig. 3 , a one-minimum tree finder is presented. In the figure, we include an extra multiplexor in the last stage, that allows extracting the looser term, denoted
. By doing so and just using an extra multiplexor, this term can be used as an early estimator of the second minimum, which represents an upper-bound on the true minimum value. If the true value is located in the other half part of the tree that ( branches of the minimum tree finder not connected to ), then we obtain . In the other cases, . Hence, the resultant value corresponds to an provable upper bound on the true . A systematic overestimation of the second minimum value could lead also to performance degradation of the complete decoder, and we propose to combine and in order to get an estimator with a better statistical behavior.
As we will demonstrate with a statistical analysis in the next section, both and are biased estimators, one overestimating the true second minimum, and the other one underestimating the true second minimum. We therefore propose in this paper to combine those two estimators, by using a linear combination of the two preceding estimators, in the following way: (2) Compared to the real values, presents a similar behavior in the histogram shown in Fig. 1 which implies that the proposed estimation has similar statistical behavior than the exact values.
1 is selected as the mean value of the ratio between and . The operations involved to implement (2) can be performed after and values are obtained (using the hardware structure in Fig. 3) . Therefore, the second minimum estimation can be made at the same time that values are obtained, to finally calculate check-to-variable output messages.
In the next section, we analyse the statistical behavior of each of the three proposed estimators.
B. Statistical Analysis of the Different Estimators
In Fig. 2 , we plot the distributions of the difference between the proposed estimators, being defined following one of the (1), (2), and the true minimum, i.e.
. We performed this analysis by computing for each iteration and for different values the difference between the real second minimum at the check node and each one of the estimators. The information for this analysis is computed based on all the check nodes of the parity check matrix.
From the shape of the distributions, we can see that seems to be biased and skewed to the positive values of the difference, which means that not only underestimates the true minimum, but also that the difference is not symmetric around its bias. Of course, as it was expected for , which represents a upper bound on the true second minimum, we get the opposite behavior, as the distribution of is left biased and skewed. In order to better measure the performance of each estimator, we have computed the first four cumulants of the distributions , and reported their values in Table I after 1 and 15 decoding iterations. The first cumulant of the distribution is the mean, and measures the bias of the estimator, a value of zero indicating that the estimator is unbiased. The second cumulant is the square-root of the variance, which indicates the spread of the estimator around the mean value. The third cumulant is the skewness, and is a measure of the symmetry of the distributions. A zero skewness indicates that an estimator does not favor positive or negative difference with the true value . Finally, the fourth cumulant, called the kurtosis is a measure of the flatness of the tails of the distribution. A low value of the kurtosis indicates that very large outliers values of the difference with the true minimum do not appear with high probability. The kurtosis value for a Gaussian distribution is equal to 3.
As we can see from those tables, typically tends to underestimate the true value, since both the mean and the skewness are positive, while overestimates the true value, since both the mean and the skewness are negative (which was expected as is actually an upper bound of ). As we can see, the third estimator that we propose, namely , is a better estimation than the other 2, with respect to all statistics. First it is practically unbiased at the first iteration, although a slight positive bias seems to appear at iteration 15. The skewness is almost zero, which tells us that, on average, the decoder will underestimate or overestimate the with the same frequency. Finally, both the variance and the kurtosis of are the minimum among the three estimators, which indicates that values very different than the true minimum will appear less often with than with the other two estimators. With respect to those indicators, is a better estimator of than or . We will confirm in the next section that also provides the maximum gain in error correction performance for the overall LDPC decoder.
C. Frame Error Rate Performance
To prove the correct behavior of the proposed OMO T-EMS and OMO T-MM algorithms, we performed Frame Error Rate (FER) simulations for NB-LDPC codes with different degree distributions and Galois field values, from GF(4) to GF(32), assuming transmission over Binary Phase Shift Keying (BPSK) modulation and Additive White Gaussian Noise (AWGN) channel. In this subsection we only include the performance for two different codes, the (837,726) NB-LDPC code over GF(32) with and and the (2212,1896) NB-LDPC code over GF(4) with and . For the rest of the codes we obtained similar results. We compare the proposed OMO T-EMS and OMO T-MM approaches to T-EMS [6] and Relaxed Min-Max (RMM) algorithms [10] . Fig. 4 shows the frame error rate (FER) simulation results of the (837,726) NB-LDPC code. For this code, the proposed OMO T-EMS algorithm in its floating point version (fp) achieves the same performance as T-EMS algorithm without any performance loss. Both algorithms use 15 iterations (it) and a scaling factor . In addition, the OMO T-MM algorithm has a coding gain of 0.12 dB compared to the RMM from [10] . Comparing the quantized version of OMO T-EMS algorithm to RMM algorithm, OMO T-EMS algorithm with 7 bits (7b) for the datapath and 9 iterations achieves the same performance as RMM [10] with 5 bits for the datapath and 15 iterations, so the proposed approach requires less iterations than the method from [10] to achieve the same performance. For the quantized version of the OMO T-MM the performance with 6 bits (6b) and 8 iterations achieves the same than the RMM decoder.
On Fig. 5 , we have plotted the performance of the T-EMS decoder with 15 iterations, and for all the approximations of the second minimum discussed in this paper. The curve labeled T-EMS uses the exact computed value of . As we can see, the fact that and do not estimate accurately the second minimum has indeed an impact on the overall decoder performance, and especially in the error floor region, where a for OMO T-EMS algorithm. for OMO T-MM algorithm. strong early flattening appears for both approximations (especially for ). On the other hand, our proposed approximation has absolutely no performance loss compared to the T-EMS with the exact minimum computation, both in the waterfall and the error floor regime. It results that the complexity gains provided by the OMO-T-EMS comes at no performance loss, at least for the codes that we simulated.
In Fig. 6 , simulations for the (2212,1896) NB-LDPC show a negligible performance loss of 0.03 dB for a comparing T-EMS to OMO T-EMS. The same happens when we compare Min-max algorithm to OMO T-MM, just a negligible difference of 0.04 dB is introduced by the approximation. The values in all simulations are adjusted using the mean of the ratio as we said before.
IV. OMO T-EMS AND OMO T-MM HARDWARE ARCHITECTURES
In this section the hardware architectures for the proposed OMO T-EMS and OMO T-MM are introduced. Since the main contribution of this paper focuses in the CN processing, first we detail the implementation results for the OMO T-EMS and OMO T-MM CN architectures comparing them to other existing solutions. Finally, we present the results for the complete decoders with horizontal layered schedule. for OMO T-EMS algorithm. for MM and OMO T-MM algorithms.
for OMO T-MM algorithm.
Fig. 7. Check node top architecture for T-EMS algorithm (a). Proposed OMO T-EMS/OMO T-MM check node architecture (b).
A. Check Node Architecture
In Fig. 7 (a) the original T-EMS hardware structure [6] is included, while the proposed OMO T-EMS CN structure is presented in Fig. 7(b) . It can be observed that the main advantage of our approach is to avoid the use of two-minimum finders and apply one-minimum finders, reducing the total complexity and the delay for the values, introducing the novel secondminimum estimation. To do this approximation, the block labeled "
Processor" is responsible to implement the (2). The Processor does not introduce any additional delay since the processing is made at the same time that the values are computed. Moreover, it is important to remark that the OMO decoding technique can be directly implemented for a Min-max decoder obtaining the same advantages.
For both OMO T-EMS and OMO T-MM algorithms, the rowwise search of the most reliable messages implies that the one-minimum finder must have inputs and includes an extra multiplexor in the last stage to extract the values as shown in Fig. 3 . For the CN one-minimum finders are required, each one formed by -bit comparators and 2-input multiplexors, where is the number of bits for the datapath. On the other hand, compared with the two-minimum finders [11] , the critical path is reduced by half due to the reduction of the hardware spent on the second minimum computation, which will impact greatly on the obtained throughput. To make a fair comparison with the two-minimum finder used in conventional designs, we must add extra resources to implement (2), which reduce to -bit adders for the one-minimum finders. This value is calculated considering that the implementation of (1) and (2) need only two additional adders.
On the other hand, a conventional two-minimum finder [11] requires -bit comparators and 2-input multiplexors. Considering the same number of equivalent gates for an adder and a comparator ( bits both of them), the two-minimum finder has three times more multiplexors and two times more comparators than the one minimum finder plus the second minimum estimation implementing (2) .
For the and transformation, the approach used is similar to the one proposed in [13] , requiring 2-input multiplexors to perform both transformations. The check node's syndrome is calculated adding all tentative hard decision symbols by means of a GF(q) adder in a tree structure fashion needing XOR gates. The extra column values are generated using a configuration processor similar to the one proposed in [6] using -bit adders or comparators to generate the tentative extra column values depending on whether EMS or Min-max check node is implemented. To select the most reliable value, one minimum finders are required, each one formed by -bit comparators and 2-input multiplexors. To compute the path info, 2-input multiplexors, XOR gates and OR gates are implemented. The resources required for the CN implementation of OMO T-EMS and OMO T-MM are summarized in Table II and compared with the approaches from [10] and [6] . VHDL was used for the description of the hardware and the total gate account was derived after synthesis using Cadence RTL Compiler. The hardware implementation was performed for the (837,726) NB-LDPC code over , with and . As it can be observed, although the CN in [10] has less NAND gates than our proposals, their CN requires to store intermediate messages due to the serial processing, increasing the gate account of the CN to 230714 NAND gates (considering that storing one bit of RAM memory is equivalent in terms of area to 1.5 NAND gates [10] , [14] , [15] ). Hence, our proposals requires at most 18% less logical resources than the CN presented in [10] , even considering that we use two extra bits.
For T-EMS decoder presented in [6] we did not provide separate results for the CN architecture, however we obtained these results considering the main differences with our new proposal. The CN from [6] needs about four times more hardware than our OMO approaches for the extra column values computation due to the use of the first and second minimum for the extraction of the extra column values. As we can see in Table II OMO T-EMS and OMO T-MM require 37% less logical resources than [6] .
B. Complete Decoder Architecture
The proposed CN architecture has been included in an horizontal layered schedule decoder with one CN cell (Fig. 7) and VN units. Each VN processor includes dual-port memories that store the LLR values and avoid adding extra latency. On the other hand, due to the layered schedule, a shift register is required to store the "last iteration" CN output information . Since, only one CN cell is implemented in the decoder, clock cycles are required to complete one decoding iteration. This value is increased due to the pipeline stages introduced in the decoder ( clock cycles are added) with the aim of achieving the desired clock frequency . As after processing one entire circulant sub-matrix the pipeline registers must be empty before processing a new one, reducing the logical path of the decoder has a great impact in the maximum throughput achieved by the decoder (3). Finally, additional clock cycles are required to load the LLR values and output the estimated codeword of the decoder. With OMO T-EMS we reduce the critical path of the CN, so we only require 8 pipeline stages to achieve a clock frequency of after place and route with Cadence SOC encounter tools and employing a 90 nm CMOS library in which the area of a NAND gate is 3. 13 . The total area of the decoder is 19.02 with a core occupation of 60% and a gate account of million of NAND gates. For OMO T-MM the number of pipeline stages is 8 and the maximum clock frequency is . The total area is 16.10 with a core occupation of 70% and a gate account of million NAND gates.
It is important to remark that the library used to implement both OMO T-EMS and OMO T-MM do not include optimized RAM memories, so each bit of RAM is implemented as a register, and hence, the area for the memories is about three times larger. Due to this, the total number of equivalent NAND gates is overestimated compared to the results found in literature that always assume optimized memories. For this reason we include in Table III , for comparison purposes, the equivalent number of NAND gates assuming that each bit of RAM is implemented with and area of 1.5 NAND gates.
To achieve the same performance as in [10] and [6] , our OMO T-EMS approach requires only 9 decoding iterations, as can be seen in Fig. 4 , therefore the total latency of the decoder is 1435 clock cycles, which corresponds to a throughput of 729 Mbps (3). For the OMO T-MM 1279 clock cycles are required to get the same FER performance as RMM or T-EMS, so a maximum throughput of 818 Mbps is reached.
OMO T-EMS and OMO T-MM decoders have been compared to the most efficient NB-LDPC decoder designs to the best knowledge of the authors. The results of the comparisons have been included in Table III , where we have scaled the results in [10] to include all throughput results over 90 nm CMOS process [16] .
The throughput of both OMO T-EMS and OMO T-MM decoders is higher than any decoder proposed in literature for high order fields and high rate NB-LDPC codes (see Table III ), because of the improvements made at the check node processor.
Our approaches require in the worst case (OMO T-EMS) less than half area than [15] and achieve at least 3.2 times higher throughput, so our most complex solution is 13 times more efficient. 2 On the other hand, the decoder presented in [10] has been considered since it was the most efficient one until now and it uses (1) as a method to approximate the second minimum, which gives benefits in terms of area but introduces early performance degradation (Fig. 4) . Despite this, OMO T-EMS has 8.8 times less latency than [10] achieving 4.75 times higher throughput with a decoder 49.7% more efficient in terms of area over throughput (for a 90 nm CMOS process). On the other hand, OMO TMM has 61.2% higher efficiency than [10] with 9.9 times less latency and 5.3 times higher throughput.
Finally, our OMO T-EMS approach has been compared against the T-EMS decoder presented in [6] . Making use of the novel approach for the second minimum estimation, the latency is reduced on 33% with respect to [6] with an increment in throughput of 50%. The area was also reduced in 25%, so the efficiency is 50% higher.
It is important to remark that the proposed approach is focused on high-rate NB-LDPC codes. However, efficient NB-LDPC decoders suitable for lower rate codes have been proposed in the literature [17] , [18] . These architectures make a parallel processing of messages.
V. CONCLUSIONS
In this paper a new method to estimate the second minimum value in message of the check node processor of NB-LDPC de- 2 Note that [15] is the only proposal that also provides post place and route results.
coders is proposed. This solution avoids the use of two-minimum finders, greatly reducing the check node complexity. The simplifications applied to the T-EMS and T-MM algorithms reduce latency and area with respect to the original proposal, without introducing any significant performance loss. The proposed check node architecture was included in a complete decoder with layered schedule achieving 729 Mbps of throughput after place and route on a 90 nm CMOS process for OMO T-EMS and 818 Mbps for OMO T-MM. The designed decoder nearly doubles the efficiency of the best solutions found in literature for high order fields and high rate codes. David Declercq (SM'11) was born in June 1971. He received his Ph.D. degree in statistical signal processing from the University of Cergy-Pontoise, France, in 1998. He is currently full professor at the ENSEA in Cergy-Pontoise. He is the general secretary of the National GRETSI association. He is currently the recipient of a junior position at the Institut Universitaire de France. His research topics lie in digital communications and error-correction coding theory. He worked several years on the particular family of LDPC codes, both from the code and decoder design aspects. Since 2003, he developed a strong expertise on non-binary LDPC codes and decoders in high order Galois fields GF(q). A large part of his research projects are related to non-binary LDPC codes. He mainly investigated two aspects: i) the design of GF(q) LDPC codes for short and moderate lengths, and ii) the simplification of the iterative decoders for GF(q) LDPC codes with complexity/performance tradeoff constraints. He has published more than 30 papers in major journals (IEEE TRANSACTIONS ON COMMUNICATIONS, IEEE TRANSACTIONS ON INFORMATION THEORY, Communications Letters, EURASIP JWCN), and more than 100 papers in major conferences in information theory and signal processing.
