ABSTRACT This paper presents a modified Trellis Min-Max (T-MM) algorithm together with the associated architecture for non-binary (NB) low-density parity-check (LDPC) decoders. The proposed T-MM algorithm is able to reduce the memory requirements for the check-node messages through an efficient compression method and enhance the error-rate performance using the appropriate decompression. A method of updating the a posteriori log-likelihood ratio in the delta domain is used to simplify the computational and storage complexity. In order to enhance the decoding throughput, a low-complexity early termination (ET) scheme is devised by using the hard decisions of the variable-to-check messages, where, although a minor overhead is introduced, there is no visible degradation in error rate. As a proof of concept, a row-parallel layered decoder for the 32-ary (837, 726) LDPC code is implemented using a 90-nm CMOS process. The proposed decoder achieves a throughput of 1.64 Gb/s at 526.32 MHz based on eight iterations and has an area of 6.86 mm 2 . When the ET scheme is enabled, the decoder achieves a maximum throughput of 4.68 Gb/s with a frame error rate of 3.25 × 10 −6 at E b /N 0 = 4.5 dB. The proposed NB-LDPC decoder achieves the highest throughput and hardware efficiency compared to the state-of-the-art decoders, even when the ET scheme is not enabled.
I. INTRODUCTION
Recent state-of-the-art communication and storage systems have adopted binary low-density parity-check (LDPC) codes proposed by Gallager [1] , thanks to their near-capacity performance and the fact that they are hardware friendly. In order to achieve a better coding gain, especially for short code lengths, non-binary LDPC (NB-LDPC) codes [2] , [3] defined over high-order Galois fields (GFs), GF(q) values where q > 2, have been proposed. NB-LDPC codes can be decoded by using the Q-ary sum-of-product algorithm (QSPA) [2] , fast Fourier Transform-QSPA (FFT-QSPA) [4] , max-log-SPA [5] , and T-max-log-QSPA [6] . However, NB-LDPC decoders based on these algorithms suffer drawbacks from high computation and storage complexity,
The associate editor coordinating the review of this manuscript and approving it for publication was Francis Lau. especially for codes constructed using high-order Galois fields, limiting the overall decoding throughput.
In order to achieve an efficient hardware implementation, the simplified min-sum algorithm (SMA) and the extended min-sum (EMS) algorithm were presented in [7] and [8] , respectively, where only comparisons and additions are adopted in the check node (CN) units, reducing both the computation complexity and the memory requirements. Based on the EMS algorithm, high-throughput decoders for low-rate NB-LDPC codes were proposed in [9] and [10] . In the min-max (MM) algorithm presented in [11] , the sum operations in the CN units are approximated by using the maximization operations. Although both the EMS and the MM algorithms reduce the implementation complexity, the forward-backward (FB) computation process in the CN unit limits the overall throughput for high-rate NB-LDPC codes.
To avoid the use of FB computation, a trellis EMS algorithm and a trellis min-max (T-MM) algorithm, proposed in [12] and [13] , respectively, generate the d c check-tovariable (C2V) messages for a single CN in parallel, where d c is the check node degree, and hence, the decoding throughput can be increased. The T-MM decoder [13] uses min-max operations that have a lower complexity in order to improve the decoding throughput of the trellis EMS decoder [12] . For the trellis-based decoders presented in [12] and [13] , d c × q C2V messages for a single CN are stored directly in memory, resulting a significant increase in the decoder area. Lacruz et al. [14] , [15] and Thi and Lee [17] found that C2V messages can be recovered from intrinsic messages, extrinsic messages and other relative information, and proposed storing these important CN messages rather than all the d c × q C2V messages in order to reduce the area required for storage. Although this indirect-storage method will not introduce any loss in error-rate performance, additional processing is required in order to recover the C2V messages, resulting in a higher latency compared to the direct-storage method.
In order to increase the decoding throughput, an early termination (ET) scheme that is able to reduce the number of iterations is desired and is usually employed for binary LDPC codes. However, realizing the standard ET scheme for non-binary LDPC codes is too complex, especially for a large field size q and, hence, is infrequently discussed in the open literature. Park et al. [9] proposed a node-level convergence detection method that is able to avoid the calculation of syndrome values. However, node-level convergence detection requires at least several iterations, resulting in a greater number of iterations in the high SNR region compared to the standard ET scheme.
In this paper, a modified T-MM algorithm is proposed, together with the associated architecture, which is intended to reduce the hardware complexity and enhance the operating frequency. To achieve a reduction in storage area, we propose compressing both the intrinsic and the extrinsic messages from q elements to L elements, where L q, such that only 4 × L + d c CN messages are stored. A simplified compression checking method is also proposed that enables the processing latency to be minimized. A decompression formula is proposed that ensures the true C2V values within the desired range can be approximated well, and, hence, the proposed T-MM algorithm achieves a comparable errorrate performance compared to that described in [15] . Moreover, an efficient two-stage minimum finder is proposed that can be used to further improve the hardware efficiency of the CN unit. In order to reduce the decoder area, an LLR updating method in the delta domain is presented. A low-complexity ET scheme is proposed in order to achieve a higher decoding throughput without introducing any visible degradation in the error-rate. The ET circuits can be effectively shared with the CN unit, meaning that any overhead introduced is minor.
In order to demonstrate these techniques, a layered decoder for the 32-ary (837, 726) LDPC code is implemented using a 90-nm CMOS process. The proposed decoder achieves a throughput of 1.64 Gbps at 526.32 MHz based on 8 iterations, and has an area of 6.86 mm 2 . When the ET scheme is enabled, the decoder achieves a maximum throughput of 4.68 Gbps with a frame error rate (FER) of 3.25 × 10 −6 at E b /N 0 = 4.5 dB. Compared to the work presented in the previous literature, the proposed decoder can achieve the highest throughput and the best hardware efficiency in terms of throughput-toarea ratio (TAR), even when the ET scheme is not enabled.
The remainder of this paper is organized as follows. A review of the T-MM algorithm for NB-LDPC decoding is described in Section II. The proposed hardware-friendly T-MM algorithm that includes an efficient C2V compression and decompression method, an LLR updating method in the delta domain, and a low-complexity ET scheme, is presented in Section III. Section IV illustrates the row-parallel layered NB-LDPC decoder architecture. The implementation and comparison results are provided in Section V. Finally, Section VI concludes this paper.
II. PRELIMINARIES A. NB-LDPC DECODING
An NB-LDPC code can be defined using an M × N nonbinary parity-check matrix H = [h m,n ], where h m,n ∈ GF(q). The LLR for the n-th VN at the i-th iteration is denoted as L i n (a), where a ∈ GF(q). The message from the m-th CN to the n-th VN, and the message from the n-th VN to the m-th CN at the i-th iteration are denoted as R i m,n (a) and Q i m,n (a), respectively. The layered NB-LDPC decoding algorithm is detailed as follows. For ease of presentation, each layer consists of a single CN.
Initialization: The a posteriori LLR values for all VNs are initialized according to L 1 n (a) = log[P(c n = z 0 n |y n )/P(c n = a|y n )], where y n is the received symbol for VN n and z 0 n arg max a∈GF(q) P(c n = a|y n ). At the same time, all the R 0 m,n (a) and Q 0 m,n (a) values are reset. Iterative decoding: 1) VN updating: For each variable node n neighboring to CN m, i.e., n ∈ N (m), compute Q i m,n (a), a ∈ GF(q), according to
Then, the V2C messages Q i m,n (a) are normalized to ensure that the most reliable messages are always zero:
VOLUME 7, 2019
2) CN updating: For check node (layer) m, compute R i m,n (a), a ∈ GF(q), according to:
for each of the neighboring variable nodes n, where the CN updating function F CN depends on the decoding algorithm used, such as EMS [8] or T-MM [13] . 3) LLR updating: For each variable node n involved in the m-th CN (layer), compute L i+1 n (a), a ∈ GF(q), according to: (a).
B. CN UPDATING FOR THE T-MM ALGORITHM
The T-MM algorithm proposed in [13] is able to avoid the FB computation process used in [8] and [11] and generates d c C2V messages R m,n (a) in parallel for each CN, which are then able to be used to enhance the decoding throughput. We now present details of the computation for the CN updating function F CN used in the T-MM algorithm [13] . In the following, the iteration index i is omitted in order to simplify the notations, since the entire description relates to the same iteration. When updating CN (layer) m, the V2C messages Q m,n (a), a ∈ GF(q) and n ∈ N (m) are stored in a q × d c array. Fig. 1(a) shows an example for a CN, where d c = 5 and q = 4. This is called the normal domain representation. Declercq and Gunnam [12] proposed a normal domain to delta domain (N2D) transformation defined by
where ⊕ denotes the addition operation in GF(q), and z n ≡ arg min a∈GF(q) Q m,n (a) is called the reference symbol for VN n, and a = a ⊕ z n . This transformation ensures that the most reliable message is always in the first position of each column as shown in Fig. 1(b) , where the second column (1, 11, 0, 5) in Fig. 1(a) is transformed to (0, 5, 1, 11) in Fig. 1(b) . It is worth noting that the reference symbol is α, i.e., z n = α, for this VN. In this delta-domain representation, a path is denoted as
For example, the dotted line, i.e., (0, 1, 0, 0, 0) in Fig. 1(b) , forms a path for which η is 1, i.e., η = 0 ⊕ 1 ⊕ 0 ⊕ 0 ⊕ 0 = 1. It is also worth noting that there are several paths forming the same η value. For example, the solid line also forms a path where η = 1. In this paper, the reliability for path and symbol ω i is denoted as and γ i , respectively. Since the path reliability is dominated by the most unreliable symbol within the path, ( ) can be approximated as max 1≤i≤d c γ i , where γ i can be obtained from the V2C value for symbol ω i . For example, the reliability of the dotted line and the solid line is 5 and 3, respectively. The reliability of the most reliable path forming η = 1 is denoted as I (η = 1). It can be verified that I (η = 1) is 3 for this example. In [13] , I (η) is identified as the intrinsic message. The column positions that generate the intrinsic message I ( a) are denoted as P( a). For example, P( a = 1) = (0, 1). In Fig. 1(b) , an extra column I ( a) is used to store the values of the intrinsic message I (η), and an extra column P( a) is used to store the associated positions.
As shown in Fig. 1(b) , the most reliable path is the allzero path, which yields η = 0, and I (η = 0) = 0. For the calculation of other I (η) values, the concept of the configuration set (conf) proposed in [8] can be used so as to reduce the complexity. Consider the paths in the configuration set conf (1, 1) , which includes all paths that differ from the all-zero path in exactly one symbol. The path in conf(1, 1) is called the one-deviation path, as indicated by the dotted line in Fig. 1(b) . A path that differs from the all-zero path in exactly two symbols is called a twodeviation path, as indicated by the solid line in Fig. 1(b) . The configuration set conf (1, 2) consists of all the one-and two-deviation paths. Recall that I (η = a) is determined by the most reliable path yielding η = a, and the reliability of a path is dominated by the most unreliable symbol. Therefore, only the minimum V2C value for each of the non-zero rows needs to be considered. For row a, this minimum value is denoted as m1( a). As shown in Fig. 1(b) , the elements shaded in gray are the m1( a) elements, i.e., (5, 1, 3) . It is worth noting that it is impossible to form a two-deviation path that passes through elements 5 and 1, since these two elements are located in the same column.
In addition to I ( a), the reliability of the second reliable path where η = a, known as the extrinsic message E( a), should be determined in order to recover the C2V values. For example, the solid line indicated in Fig. 1(b) is the path that yields I (η = 1). The dotted line is the second reliable path, and, hence, generates E(η = 1). Since considering all possible paths is too complex, the E( a) value is only calculated from the reliability values of one-deviation paths [13] . If the source of I (η = a) is the one-deviation path, E( a) = m2(η = a), where m2( a) is the second minimum value for row a. If the source of I (η = a) is the two-deviation path, E( a) = m1( a = η). The extrinsic messages E( a), a ∈ GF(q), can be obtained according to
and are stored in an additional column, as illustrated in Fig. 1(b) . Each symbol for the C2V values is recovered by using either I ( a) or E( a) depending on the value of P( a). In Fig. 1 
Above all, the method used for the calculation of C2V messages is as follows.
1) Intrinsic message: I ( a) is calculated according to
where γ i can be obtained from the elements of {m1( a)| a ∈ GF(q)}, and then the column position P( a) is recorded. 2) C2V message: R m,n ( a) is equal to either I ( a) or E( a) based on the value of P( a).
where λ is a scaling factor and 0 ≤ λ ≤ 1.
Since all the C2V messages are in the delta domain, a delta to normal domain (D2N) transformation defined by
where θ = ⊕ d c n =1 z n , and a = a ⊕ z n ⊕ θ , is used to obtain the C2V messages in the normal domain. Fig. 1(d) shows the result after recovering the C2V messages.
In the direct-storage method presented in [13] , all the C2V messages R m,n (a) are stored in memory, resulting in a large area overhead. It can be seen from (8) that R m,n ( a) is either I ( a) or E( a) dependent on P( a). In addition, it can be seen from (9) that z n ⊕ θ , known as the inverse reference symbol (z * n ), is required in order to obtain R m,n (a). Lacruz et al. [14] proposed an indirect-storage method that is used to store 3 × (q − 1) messages, including I ( a), E( a), P( a), and d c inverse reference symbols z * n . In this method, additional recovery units are needed in order to recover the C2V messages, i.e., R m,n (a).
III. PROPOSED T-MM ALGORITHM WITH EARLY TERMINATION A. REDUCTION IN CN MESSAGES
In order to reduce the storage requirements for the CN messages, a compression method, which stores the L most reliable intrinsic messages, i.e., {I ( a 1 ),
, S L , and z * n need to be stored. The L most reliable intrinsic messages can be determined after all the intrinsic messages are calculated. However, this approach requires that all the one-deviation and twodeviation paths are considered, resulting in a higher complexity. Since the intrinsic messages are formed from the elements in {m1( a)| a ∈ GF(q), a = 0} based on (7), it is more efficient to use the L minimum values, which are denoted as m1 1 
to calculate the L most reliable intrinsic messages. For this purpose, these L minimum values are ordered such that m1 1 ≤ m1 2 · · · ≤ m1 L . In addition, let p i , and β i respectively denote the position and field symbol of m1 i .
The field symbols for the L minimum values, i.e., m1 1 , m1 2 , m1 3 , · · · and m1 L , can produce C L 1 one-deviation paths and, at most, C q−1 2 two-deviation paths, since a legal twodeviation path should satisfy the condition that the positions of the two constituent m1( a) elements are different. It can be seen that L ≤ L can be easily achieved. We use (m1 j , m1 k ) to denote a candidate that may form a two-deviation path passing through elements m1 j and m1 k . If p j = p k , the twodeviation path is legal. The reliability of the two-deviation path (m1 j , m1 k ) is m1 j , when m1 j > m1 k . Fig. 2 illustrates the candidates for the calculation of I ( a i ) from i = 1 to i = 10. It can be observed that I ( a 1 ) is only formed from the one-deviation candidate m1 1 , since the value of candidate m1 1 is the smallest. Candidates m1 2 and (m1 2 , m1 1 ) will have the same chance of being selected for I ( a 2 ), since both candidates have the same reliability. Therefore, a 2 is equal to either β 2 or β 2 ⊕ β 1 . However, the probability that I ( a 2 ) is obtained from m1 2 is greater than 50%, since (m1 2 , m1 1 ) will be an illegal path when p 1 = p 2 . For I ( a 3 ), candidates m1 2 and (m1 2 , m1 1 ) should be considered again, since only one of these two candidates can be selected for I ( a 2 ). Moreover, all candidates that have a reliability m1 3 , which are m1 3 , (m1 3 , m1 1 ), and (m1 3 , m1 2 ), should be considered in the calculation of I ( a 3 ) due to the illegal path issue for candidate (m1 2 , m1 1 ). Similarly, there are seven candidates in the calculations for I ( a 4 ), I ( a 5 ), and I ( a 6 ). Since a candidate that has a higher reliability also has a higher priority, the priority for the candidates having a reliability m1 3 is higher than the candidates that have a reliability m1 4 . Therefore, it is possible that the reliability of I ( a i ), i = 4, 5, 6 is m1 3 , and the reliability of I ( a i ), i = 7, 8, 9, 10 is m1 4 .
It can be observed from Fig. 2 that there are too many candidates in the calculation of I ( a i ). Since we need to check whether or not the path for a two-deviation candidate is legal, a higher computational complexity will be introduced compared to that for the one-deviation candidates. In order to simplify the calculation of the intrinsic messages, a priority assignment is used when the decoder selects paths that have the same reliability. For example, candidate m1 2 is assigned a higher priority compared to (m1 2 , m1 1 ). Therefore, the calculation of I ( a 2 ) always considers candidate m1 2 , and, hence a 2 = β 2 . Since both candidates have the same reliability values, but are associated with different field symbols, the priority assignment does not affect the value of the intrinsic message. Fig. 3 shows the percentage of path types for the calculation of I ( a i ), i = 1, 2, · · · , 31, together with the priority assignment, where the 32-ary (837, 726) LDPC code and E b /N 0 = 4.0 dB are considered. It can be observed that both I ( a 1 ) and I ( a 2 ) are always selected from the one-deviation candidates. It can be deduced that candidates for I ( a 3 ) are reduced to (m1 2 , m1 1 ) and m1 3 . Since candidate (m1 2 , m1 1 ) has a higher reliability value compared to candidate m1 3 , I ( a 3 ) is usually selected from (m1 2 , m1 1 ).
Although the priority assignment approach can be used to reduce the number of candidates for the calculation of I ( a i ), i ≤ 3, the number of candidates for the calculation of I ( a i ), i ≥ 4 cannot be reduced. For example, seven candidates still need to be consider in the calculation of I ( a 4 ), I ( a 5 ), and I ( a 6 ). Moreover, it is necessary to check whether or not the considered candidate is a duplicate with paths that will yield an η value that is equal to a 1 , a 2 , · · · or a i−1 when calculating I ( a i ). For example, when the value of I ( a 3 ) is selected from candidate (m1 2 , m1 1 ), in the calculation of I ( a 4 ), candidate m1 3 should be checked based on the condition β 2 ⊕ β 1 = β 3 . When the condition p 1 = p 2 is satisfied, the value of I ( a 3 ) is selected from candidate m1 3 . Therefore, in the calculation of I ( a 4 ), candidate m1 3 should not be considered. When either of these two conditions is not satisfied, the calculation for I ( a 4 ) should consider the other candidates, such as (m1 3 , m1 1 ), (m1 3 , m1 2 ), · · · , (m1 4 , m1 3 ) . Similarly, candidate (m1 3 , m1 1 ) should be checked based on the duplicate condition β 3 ⊕ β 1 = a 2 , and the legal path condition p 3 = p 1 . It can be seen that an increasing number of conditions are required to be checked in sequence in order to perform the calculation of I ( a i ), i ≥ 4, and, hence, a higher latency occurs in the calculation of the intrinsic messages, with the result being that the hardware clock frequency is limited.
Moreover, a data dependency exists between the intrinsic and extrinsic messages, as can be seen from (6), resulting in a higher complexity. For example, E( a 4 ) is equal to m1( a = β 3 ⊕ β 1 ), after I ( a 4 ) is determined from (m1 3 , m1 1 ) and a 4 is determined as β 3 ⊕ β 1 . If I ( a 4 ) is equal to m1 3 , then E( a 4 ) is equal to m2( a = β 3 ). Therefore, the calculation 
To address these issues, a method is proposed that reduces the number of candidates. In the proposed simplified candidate method, either one-or two-deviation candidates are used in the calculation of a specific I ( a i ) when i ≥ 4. For example, the calculation of I ( a 4 ) only considers two one-deviation candidates m1 3 and m1 4 , and the calculation of I ( a 5 ) only considers those two-deviation candidates where the reliability values are equal to either m1 3 or m1 4 . Fig. 4 shows the candidates for I ( a 1 ) to I ( a 10 ) using both the priority assignment and the simplified candidate method. It can be observed that almost all of the intrinsic messages are the same as those shown in Fig. 2 , but are associated with different field symbols. Moreover, the data dependency between the intrinsic and extrinsic messages is reduced significantly such that the related hardware can be implemented efficiently. The calculation of E( a i ), i ≥ 4, only considers the elements from either {m1( a)| a ∈ GF(q), a = 0} or {m2( a)| a ∈ GF(q), a = 0}. For example, E( a 4 ) only considers the elements from {m2( a)| a ∈ GF(q), a = 0}, while both E( a 5 ) and E( a 6 ) consider elements from {m1( a)| a ∈ GF(q), a = 0}. The details of this are described in Section IV-B.
We now present a method for recovering C2V messages from compressed messages. The proposed decompression formula is
where K is constant. In this paper, K is equal to E( a L ).
As can be seen from (10), the proposed method avoids using all the extrinsic messages, and, hence, the extrinsic messages can be compressed. As with (8), a scaling factor λ is also used for all R( a), a ∈ GF(q) values. It can also be seen from Fig. 5 that when L = 4, the proposed T-MM algorithm achieves almost the same error-rate performance as that achieved by the decoder described in [14] . Compared to the decoder presented in [15] , the proposed decoder has a gain of 0.08 dB when L = 4 is used for both decoders. It can be observed from Fig. 6 that the proposed decoder has a gain of 0.15 dB when L = 8 is used for the 64-ary (1536, 1344) LDPC code. Table 1 shows the CN memory requirements when using the proposed T-MM algorithm compared to the algorithms used in previous work. The last row of this table shows the requirements for the 32-ary (837, 726) code when using w = 6 quantization bits. It can be seen that only 2L memory bits are required in order to store P( a), where = log d c , since only two-deviation candidates need to be considered in the calculation of I ( a 3 ). Finally, since the proposed recovery formula uses the I ( a L ) and E( a L ) values to obtain the C2V messages for a / ∈ S L , both the intrinsic and the extrinsic messages can be compressed to L elements. In contrast, the work presented in both [15] and [17] only compress the intrinsic messages, and, hence, the proposed decoder only needs around 60% of the memory bits for the CN messages. 
B. LLR UPDATING IN THE DELTA DOMAIN
Conventionally, all the C2V messages in the delta domain, i.e., R( a), are recovered in advance by the intrinsic and extrinsic messages, and are transformed to the normal domain according to (9) . Then, the V2C message and the C2V message in the normal domain, i.e., Q(a) and R(a), are used to update the LLR messages in the normal domain, i.e., L(a). In order to reduce the complexity of the LLR updating process, an LLR updating method in the delta domain is proposed.
For the V2C messages, the relationship between the delta domain representation and the normal domain representation is Q i m,n (a) = Q i m,n ( a ⊕ z n ) based on (5). Similarly, it can be observed from (9) that R i m,n (a) = R i m,n ( a ⊕ z n ⊕ θ ), where θ = ⊕ d c n =1 z n . Then, the LLR updating equation (4) can be written as
It can be observed from (11) that the field symbols for the C2V messages and V2C messages are different, and the difference value is θ. If the transformation related to θ for the C2V messages is performed first, the LLR messages are then able to be updated in the delta domain. Following that, the transformation from the delta domain to the normal domain is performed for the LLR messages according to L m,n (a) = L m,n ( a ⊕ z n ). Finally, the operations related to the inverse permuted factor h −1 m,n are performed at the end of each layered processing operation.
Since the C2V messages can be obtained from both the intrinsic and the extrinsic messages, the transformation related to θ can be directly executed for both the intrinsic and the extrinsic messages. Consider again the example shown in Fig. 1(c) , the CN messages that include I ( a), E( a) and P( a) are calculated based on (6) and (7) . Then, all the intrinsic and extrinsic messages are transformed to a modified delta domain based on a = a⊕θ . For example, I ( a) = (0, 3, 1, 3) is transformed to I ( a ) = (1, 3, 0, 3) , and E( a) = (0, 5, 2, 4) is transformed to E( a ) = (2, 4, 0, 5) as shown in Fig. 7(a) . Using the transformed intrinsic and extrinsic messages, all the C2V messages for a are recovered as shown in Fig. 7(a) . Then, the LLR messages are updated based on Q i m,n ( a) + R i m,n ( a ). For example, the V2C messages for a = α are (2, 1, 3, 12, 6), as shown in Fig. 1(b) , and the C2V messages for a = α are (0, 0, 0, 0, 0), as shown in Fig. 7(a) . The resultant LLR messages for a = α are (2, 1, 3, 12, 6), as shown in Fig. 7(b) . Since the transformation related to θ is performed in advance, the C2V messages for the LLR updating process can be efficiently determined as either I ( a ) or E( a ) based on P( a ).
For the proposed LLR updating process in the delta domain, an efficient architecture can be devised, the details of which will be described in Section IV-C. Compared to the conventional method used in [15] , the proposed LLR updating method can achieve a reduction in area of up to 13%.
C. EARLY TERMINATION SCHEME
In order to increase the decoding throughput, an ET scheme that is able to reduce the number of iterations is desired. A decoded codeword x = (x 1 , x 2 , · · · , x N ) is obtained through the hard decisions of the LLR messages. Then, the syndrome s for the decoded codeword is calculated based on s = (s 1 , s 2 , · · · , s M ) T = Hx T , where the m-th syndrome component s m is obtained using the check-sum s m = ⊕ N n =1 x n h m,n . The decoder can be terminated when s = 0 is satisfied. This scheme is called the standard ET scheme. However, realizing this standard ET scheme is too complex, especially for a large field size q. In addition, the standard ET scheme usually requires additional clock cycles to calculate the syndrome values. Park et al. [9] proposed a node-level convergence detection scheme that was used to avoid the calculation of s. In this ET scheme, a minimum number of iterations I min is required. The decoding process is terminated when the harddecisions for the V2C messages, i.e., z n , remain unchanged for I τ consecutive iterations. Both criteria are used to prevent false convergence without causing any degradation in the error-rate performance. Since the node-level convergence detection requires at least I min iterations, it is expected that a greater average number of iterations is required compared to the standard ET scheme.
In this work, a low-complexity ET scheme is proposed, where the minimum number of iterations I min is not required. Based on (11), the LLR messages are calculated via the sum of the V2C and C2V messages. It can be seen that the value of the LLR messages is equal to the value of the V2C messages when the C2V messages are zero. However, the harddecisions for the LLR messages and the V2C messages are not necessarily the same. In fact, the hard-decisions for the LLR messages are equal to the hard-decisions for the V2C messages when the value of θ is zero, and, hence, the desired check-sum is also zero. Based on this observation, we now present an efficient ET scheme.
A layer that satisfies the condition θ = 0 is viewed as a valid layer. However, any previous valid layers might become invalid during the processing operation for the current layer, since the hard decisions for the LLR messages might be changed. When the decoder nears convergence as the number of iterations increases, the value of θ is equal to zero with a high probability. It is proposed that the decoding process is terminated if τ consecutive layers are valid layers. In the implementation, a counter, which is initially set to zero, is used to accumulate the number of consecutive valid layers. If an invalid layer occurs before the counter reaches τ , the counter will be reset. Fig. 8 shows the error-rate performance and the average number of iterations, denoted as I av , using different τ values for the 32-ary LDPC decoder. Also included in Fig. 8 is the   FIGURE 9 . Comparison of the FER for the 32-ary LDPC decoder between the proposed ET scheme and the ET scheme proposed in [9] , where I MAX = 8 and BPSK modulation are considered.
performance where the ET scheme is not implemented. It is worth noting that the ET can be triggered without introducing any notable performance degradation as long as all variable nodes are connected to the valid layers. This means that an evaluation of the validation may not be required for all check nodes, and the value of τ can be less than the number of check nodes. The exact value of τ depends on the structure of the LDPC code. It can be observed in Fig. 8 that when the proposed ET scheme is applied to the 32-ary LDPC code using τ = 0.25M , a poor performance is observed since only 1/4 of the total number of layers are valid, and they cannot connect to all variable nodes. In contrast, when τ = 0.75M , all variable nodes are successively connected via 3/4 of the total number of check nodes, and a lower number of iterations can be achieved without introducing any visible degradation in FER compared to the decoder where the ET scheme is not used. Fig. 9 compares the error-rate performance between the proposed scheme and the scheme proposed in [9] . It can be observed that when I min = 6 and I τ = 2 for the scheme proposed in [9] a comparable error-rate performance to the proposed scheme can be achieved. However, it can be seen from Fig. 10 that the proposed ET scheme has a much lower average number of iterations in the high SNR region compared to the ET scheme presented in [9] .
IV. EFFICIENT LAYERED DECODER ARCHITECTURE
A. DECODER ARCHITECTURE Fig. 11(a) shows the proposed row-parallel layered decoder, where d c VN units, a CN unit, q LLR updating units, an ET unit, and two memory blocks are included. The decoding process starts by receiving the channel LLR values y, which are used to initialize the a posteriori LLR messages (L i n (a)) stored in the L Mem. Following the method proposed in [20] , a combined permutation (CP) unit is used. In the initial step, the CP unit considers only the permuted factors h m,n . For the remaining layered processing operations, a combined permuted factor µ m,n = h −1 m,n × h (m,n),n is used, where h (m,n),n represents the permuted factor for the next layer in the same column in H . Following that, the LLR messages are sent to the VN unit to produce the V2C messages and the reference symbol, i.e., Q i n (a) and z n , respectively. The decoder requires d c VN units, and each VN unit calculates q V2C messages. The VN unit receives all CN messages, including I ( a), P( a), E( a), S L , and z * n , from the CN Mem block, together with the remaining LLR messages from the L Mem block. The VN units then update all the V2C messages and the reference symbols according to (1) and (2) .
The V2C messages and the reference symbols are used as the input for the CN unit, which comprises d c N2D units, a path calculation (Path Cal.) unit, two FIFO registers and L + d c GF adders. The Path Cal. unit produces the compressed intrinsic messages, the compressed extrinsic messages, and the associated information, including P( a), S L and θ values. In order to perform transformation related to θ for the CN messages, L GF adders are used to execute a = a ⊕ θ . The set of these L transformed field symbols, i.e., { a 1 , a 2 , a 3 , a 4 }, is denoted as S L . The inverse reference symbols are calculated using d c GF adders based on z * n = z n ⊕ θ . The LLR updating units produce the LLR messages in the delta domain, i.e., L i+1 n ( a), using the compressed intrinsic and extrinsic messages together with the V2C messages stored in the FIFO. The D2N units transform the LLR messages to the normal domain. The decoding process is repeated until the maximum number of iterations is reached, or until the ET conditions are satisfied.
For the NB-LDPC codes constructed using the method proposed in [18] , the parity-check matrix H consists of α-multiplied circulant permutation matrices, and each row is the right cyclic-shift of the row upon which it was multiplied by primitive element α. In other words, the permuted factors in the same column are the same, meaning that the combined permuted factor µ m,n becomes the identity element. Consequently, the CP unit can be moved to the input of the decoder, as shown in Fig. 11(b) , and, hence, the number of decoding cycles can be reduced. This means that the CP operation is executed before the decoding process, and then additional d c GF multipliers are used to execute operations related to h −1 m,n following the completion of the decoding process. Fig. 12 shows the block diagram for the proposed CN unit. The d c N2D units are used to transform q × d c V2C messages from the normal domain to the delta domain based on the reference symbols z n . The scaling factor λ is set to 0.75 to ensure that the majority of the intrinsic and extrinsic values are less than 2 (w−1) . Therefore, the number of quantization bits for the intrinsic and extrinsic messages can be reduced by 1 without introducing any notable degradation in errorrate performance. Since only minimization and maximization operations are involved in the CN unit, the scaling operation can be moved from the output of the CN unit to the input, which results in a reduction in the area required for the CN unit. The two-minimum finder (2-MF) then determines the first two minimum values and the associated first minimum position, i.e., m1( a), m2( a), and p( a) among { Q m,n ( a)|n ∈ N (m)}. L -MF is used to identify the first L = 4 minimum values among {m1( a)| a ∈ GF(q), a = 0}. The L -MF unit is based on the architecture proposed in [15] . Finally, the C2V compressor unit yields the desired compressed intrinsic and extrinsic messages, and records the associated field symbols, i.e., the elements in S L . Fig. 13 shows the circuit for the C2V compressor based on Table 2 . In general, the calculations of E( a i ) require 2-layer MUXes due to the dependent relationship between I ( a i ) and E( a i ). The first-layer MUXes are used to select elements from either {m2( a)| a ∈ GF(q), a = 0} or {m1( a)| a ∈ GF(q), a = 0} based on the candidate type used to calculate I ( a i ), and the second-layer MUXes are used to determine the E( a i ) value from the outputs of the first-layer MUXes based on η = a i . Thanks to the simplified candidate method, only the calculation of E( a 3 ) requires 2-layer MUXes. In contrast, the calculation of both the E( a 3 ) and E( a 4 ) values described in [15] requires 2-layer MUXes, resulting in a larger area.
B. CN UNIT
The timing diagram for the conventional CN unit is shown in Fig. 14(a) , where the L -MF operates after calculating the values for m1( a), m2( a), and p( a) using the 2-MF. If the 2-MF is based on the tree structure (TS) proposed in [19] , three clock cycles are required when a 32-ary LDPC decoder where d c = 27 is considered. Fig. 14(b) shows an example for the TS-based 2-MF with eight inputs. The mVG is able to determine the first minimum value, and the 2mVG is able to determine the first two minimum values. It can be seen from Fig. 14(b) that the TS-based 2-MF requires an additional delay of two MUXes in order to identify the second minimum value for the eight inputs. For the d c = 27 inputs considered in this work, the TS-based 2-MF requires an additional delay of five MUXes.
It is observed in Fig. 14(a) that m2( a) values are unnecessary and are stored in FIFO registers until the C2V compressor begins. In order to reduce both the processing latency and the area, a two-stage 2-MF is proposed based on the fact that the decision for the second minimum value can be postponed. Fig. 15(a) illustrates an example of the two-stage 2-MF with eight inputs. The first stage only identifies the first minimum value and the associated position among the eight inputs. More importantly, in the first stage, the second minimum value is not identified and only the log d c candidates for the second minimum values (CM 2s) are reserved by the MUX units. Therefore, the latency is lower compared to that of the TS-based 2-MF. Fig. 15(b) shows the proposed timing diagram for the CN unit, where the process for determining the m2( a) value is divided into two stages. It can be observed that the L -MF unit and the second stage of the two-stage 2-MF operate in parallel. When the first minimum value is processed in the L -MF unit, the second minimum value m2( a) is also identified using parallel comparison. As a result, the CN unit requires fewer clock cycles compared to the conventional CN unit. The proposed CN unit only requires six clock cycles, as shown in Fig. 15(b) , whereas the conventional CN unit requires seven. This implies that the proposed CN unit requires fewer pipeline stages to achieve the same throughput. Table 3 shows a comparison between the TS-based 2-MF [19] and the proposed 2-stage 2-MF. The operation frequency for the two-stage 2-MF using d c = 27 inputs can be increased by 13% and a reduction of 26 comparators (cmps) can be achieved compared to the TS-based 2-MF presented in [19] . Moreover, the area of the FIFO registers shown in Fig. 12 , which store Q n ( a) and z n , is also reduced by 14.3% thanks to the reduction in the number of clock cycles and the pipeline stages required in the CN unit. More details of the design for the proposed CN unit are presented in [16] .
C. DELTA-DOMAIN-BASED LLR UPDATING AND VN UNITS
Conventionally, the VN unit and the updating unit for the LLR messages occupy a significant percentage of the chip area. In this section, we present an efficient LLR updating method together with the associated VN unit based on the proposed delta-domain LLR updating method presented in Section III-B. As shown in Fig. 11(a) , the row-parallel decoder contains q LLR updating units for q field symbols, and each LLR updating unit yields d c LLR messages for the same field symbol in the delta domain representation. Fig. 16 illustrates the proposed LLR updating unit that includes a selection unit (SU), and d c updating units. For the calculation of the intrinsic and extrinsic messages, the SU should consider L + 2 cases, which are a = a 1 , a = a 2 , a = a 3 , a = a 4 , a = θ , and the others. Therefore, two (L + 2)-to-1 MUXes are required.
After identifying the intrinsic and extrinsic messages for a , the LLR values are updated in the updating unit. Since the V2C values for a = 0 are always zero in the delta domain, i.e., Q m,n ( a = 0) = 0, L n ( a = 0) is directly equal to either I ( a = 0) or E( a = 0). Therefore, the d c adders for a = 0 in the updating unit can be removed. Moreover, in the proposed decoder, the FIFO registers only store (q − 1) × d c × w × D CN bits, where D CN denotes the number of pipeline stages required for the CN unit. Fig. 17(a) shows the architecture for the VN unit, which consists of a decompression unit (DU), q adders, and a normalized (Nor.) unit, to calculate q V2C messages and the reference symbol z n . The Nor. unit, which consists of a minimum finder (1-MF) and q adders, is used to realize the normalization function in order to ensure that the most reliable V2C message for each VN is always zero. The 1-MF identifies the first minimum value and the associated field symbol among q elements in {Q i m,n (a)|a ∈ GF(q)}. The field symbol for the minimum value among {Q m,n (a)|a ∈ GF(q)} is the reference symbol z n .
The DU yields all C2V messages R m,n ( a = 0), from the delta domain to the normal domain based on (9) . Finally, all C2V messages in the normal domain are used to update the final V2C messages. Fig. 17(b) illustrates the architecture for the C2V Gen. unit that includes a selection unit (SU V ) and d c recovery units (RUs). In contrast, the selection unit SU V in the VN unit only needs to consider L + 1 cases, which are a = a 1 , a = a 2 , a = a 3 , a = a 4 , and the others. After the operation by the SU V , d c RUs recover C2V values which are either I ( a) or E( a) based on P( a).
V. PERFORMANCE EVALUATION
The proposed layered 32-ary (837, 726) LDPC decoder was implemented using a 90-nm 1P9M CMOS process, and the operating conditions follow those described in the work presented in [15] and [17] . Fig. 18 shows a layout plot for the overall decoder, and a summary of the decoder is provided in Table 4 . The throughput (TP) for the decoder can be written as
Mbps (12) where D denotes the number of clock cycles (the number of pipeline stages) required for each layered processing operation. The decoder achieves a throughput of 1.64 Gbps at a maximum clock frequency of 526.32 MHz, when I MAX = 8 and D = 10. Fig. 19 shows the progressive improvements in area and throughput performance for each of the proposed technologies, where the T-MM decoder described in [15] is used as the base-line decoder. Since the proposed CN unit compresses both the intrinsic and the extrinsic messages, and an efficient two-stage 2-MF is adopted, a reduction in area of 19.9% and an improvement in throughput of 38.86% can be achieved compared to the base-line decoder. The proposed recovery function enhances the error-rate performance, since it can approximate the true R( a i ) value well in the desired range. The updating technique for delta domain LLR together with its related architecture also reduces both the computational and storage complexity. Therefore, the TAR is increased by about 45.41%. The two proposed techniques can achieve a 30% reduction in area and a 53.14% improvement in decoding throughput compared to the base-line decoder [15] . In order to further enhance the hardware efficiency, a lowcomplexity ET scheme is proposed such that the proposed decoder has a 2.85 times higher throughput at E b /N 0 = 4.5 dB. The proposed decoder can achieve a maximum throughput of 4.68 Gbps with an FER of 3.25 × 10 −6 at E b /N 0 = 4.5 dB.
A comparison of the performance with that of the 32-ary NB-LDPC decoders described in the previously published literature is summarized in both Table 5 and Fig. 20 . Also included in Table 5 are the EMS-based decoders presented in [9] and [10] for a low-rate NB-LDPC code. The errorrate performance for the proposed decoder is comparable to that of the T-MM-based decoders presented in [15] and [17] . Compared to the base-line decoder described in [15] , the proposed decoder not only achieves a slightly better error-rate performance, but also has a much better TAR value since a 40% reduction in the memory requirements for the CN messages is achieved. The basic-set T-MM algorithm presented in [17] also compressed the intrinsic messages and reduced the CN calculation. However, a loss in FER performance can be observed when compared to the proposed decoder.
Since the modified T-MM algorithm is less complex than those presented in [14] , [15] , and [17] , the computational complexity is significantly reduced. Moreover, the proposed techniques, including efficient compression and decompression, a 2-stage 2-MF architecture, and delta-domain LLR updating, thereby preventing long critical paths, resulting in a higher operating frequency compared to other T-MM decoders. It can be observed from Table 5 that the proposed decoder achieves the highest normalized clock frequency and throughput compared to the state-of-the-art decoders, even when the ET scheme is not enabled. The throughput and TAR can be significantly enhanced when using the proposed ET scheme.
VI. CONCLUSION
An efficient row-parallel layered NB-LDPC decoder based on a hardware-friendly T-MM algorithm has been presented, together with the associated architecture. The hardwarefriendly T-MM algorithm is able to reduce the requirements for CN storage by implementing an efficient compression method, and by enhancing the decoding performance via appropriate decompression. Moreover, an efficient two-stage 2-MF is proposed such that the hardware efficiency of the CN unit can be further improved. In order to simplify the computational and storage complexity, a method of updating the LLR in the delta domain is proposed. A low-complexity ET scheme is proposed that achieves a higher decoding throughput. Only a minor overhead is introduced, since the ET circuits can be effectively shared with the CN unit. The proposed techniques have been demonstrated by implementing a 32-ary (837, 726) LDPC decoder using a 90-nm process in an area of 6.86 mm 2 . The decoder achieves a maximum throughput of 4.68 Gbps with an FER of 3.25 × 10 −6 at E b /N 0 = 4.5 dB.
