Abstract-In numerous memory and communication systems, Bose-Chaudhuri-Hocquenghem (BCH) codes are widely employed to enhance reliability. A one-pass Chase soft-decision decoding algorithm for BCH codes was previously proposed to achieve significant performance improvement over traditional hard-decision decoding while not increasing too much computational complexity. The bottleneck in conventional one-pass Chase decoding is the procedure of judging whether an obtained error locator polynomial is valid. In this brief, a novel algorithm that can efficiently verify eligibility of each generated error locator polynomial is proposed. The problem is first reformulated as a polynomial modulo problem, where repeated squaring can be employed for further simplification. In order to decrease the critical path delay and hardware complexity, an efficient polynomial division algorithm based on polynomial inversion is also proposed. In addition, a VLSI architecture for the proposed algorithm is presented. The implemented results show that the proposed eligibility checking algorithm reduces the gate counts to only 12% of a conventional polynomial selection algorithm without introducing any speed penalty. The projected area reduction achieved in a complete one-pass Chase decoder is approximately 75%. In addition, post-layout simulation shows that the proposed algorithm is 20 times more power efficient than the conventional method.
I. INTRODUCTION

B
OSE-CHAUDHURI-HOCQUENGHEM (BCH) codes are one of the most popular error correction codes (ECCs) employed in numerous memory and communication systems due to their relatively simple decoding complexity and outstanding error correction capability. Conventional BCH codes can correct up to t = d min /2 errors, where d min stands for the minimum code distance. The hard-decision decoding (HDD) procedure for BCH codes is well developed. Berlekamp's algorithm is usually employed to find error locator polynomials, and the Chien search is then utilized to locate each error. To better utilize soft information or confidence of each bit before decoding, Chase proposed a soft-decision decoding (SDD) algorithm that first flips η least reliable bits before conventional HDD [1] . By utilizing soft information, Chase II algorithm can correct up to t + η errors.
The most straightforward and the widely adopted way to implement Chase decoding in hardware is to use existing BCH HDD circuit iteratively with a control circuitry that generates different testing patterns [2] , [3] . This way, however, becomes prohibitively expensive when η is large since the computational time grows exponentially with η. To circumvent this, a one-pass Chase decoding algorithm was proposed in [4] . The original work is aimed for Reed-Solomon (RS) codes, but it can be easily adapted for BCH codes. Berlekamp's algorithm only needs to be applied once to obtain one starting error locator polynomial. Other error locator polynomials corresponding to different error patterns can be then easily derived from the starting polynomial by using the polynomial update algorithm outlined in [4] . This is a significant result since the algorithm turns the exponentially scaled Chase decoding algorithm into a linearly scaled algorithm, making soft-decision Chase decoding more attractive.
To implement the one-pass Chase decoding algorithm, hardware architecture was discussed in [5] . Zeros in the error locator polynomial corresponding to the flipped bits were taken out to maintain the order of the polynomial, reducing computational efforts. In that work, it was stated that most area and power consumption were spent on the highly parallel Chien search that runs as fast as the polynomial update block. An interpolationbased Chase decoding algorithm was developed in [6] to avoid the expensive parallel Chien search. A 2.3 times higher efficiency was reported in that work.
To better accommodate the one-pass Chase decoding algorithm, an eligibility verification algorithm is proposed in this brief. The algorithm checks if the obtained error locator polynomial from the polynomial update algorithm or Berlekamp's algorithm is able to generate a correct error pattern such that the corrected code is a valid code word. The main motivation of the proposed algorithm is that an invalid polynomial can be easily detected without actually finding out all roots of that polynomial. The problem of checking the eligibility is first converted into a problem of calculating polynomial modulus. Calculation of modulus can be then conveniently solved by repeated squaring. In addition, to further reduce the computation complexity and the critical path delay in a hardware implementation, a polynomial inversion algorithm is proposed. A hardware architecture and complexity analysis for the proposed algorithm are provided in Section III. In Section IV, a design 1549-7747 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
example is presented. Compared with conventional exhaustive searching, the proposed algorithm achieves a reduction of 88% in gate counts.
II. ALGORITHM
It has been shown in [7] that sufficient and necessary conditions for an error locator polynomial to locate errors such that a legal (not necessarily correct) word can be recovered are as follows.
1) The error locator polynomial Λ(x) has exactly e distinct nonzero roots in GF(2 m ).
where L Λ is the length of a linear feedback shift register described by Λ(x).
It is also known that d Λ ≤ L Λ [8] , where d Λ is the degree of the polynomial Λ(x). Consequently, the process of verifying an error locator polynomial can be divided into two cases. The condition d Λ = L Λ can be easily checked by identifying the location of the first nonzero coefficient in Λ(x). Furthermore, it is found in the simulation that this condition is satisfied in most of the time even when the error locator polynomial is not a correct one. Therefore, counting the number of roots that Λ(x) has is the key to determine whether the obtained error locator polynomial is valid.
Zero root in Λ(x) can be identified by checking whether Λ 0 is zero. Λ(x) with a zero root is discarded right away without further processing. Therefore, in the remainder of this brief, it is assumed that Λ(x) does not have a root that is equal to zero.
An auxiliary polynomial d(x) is defined according to (1) , where the operator gcd(a, b) stands for finding the greatest common divisor of a and b. Since x 2 m − x has all the elements in GF(2 m ) as its roots [9] , it can be shown that the degree of d(x) is equal to the number of roots that Λ(x) has
Euclidean's algorithm can be employed here to obtain a further simplified expression, as shown in the following equation:
Following (2), it can be proven that the sufficient and necessary condition for
Equation (3) is too expensive to be computed directly when m is large. Fortunately, squaring and multiply [10] can be utilized here to save a significant amount of computational labors. In addition, since 2 m is a power of 2, what we really need is just repeated squaring (i.e., not even multiply). More specifically, we can calculate x 2 m iteratively as follows. Starting from the trivial case
We can then compute x
where
By doing this, only m − log 2 (d Λ − 1) polynomial modulo operations are needed to compute x 2 m mod Λ(x). Each polynomial modulo operation is at most on the order of t.
To carry out modulo operation, an old school long division can be employed. Each modulo operation takes about d 2 Λ multiplications. One problem with this straightforward implementation is that the critical path is long, as will be shown in Section III.
Considering that divisors in all modulo operations are Λ(x) [see (2)- (6)], it is worth spending some effort on converting Λ(x) into a form with which the division in the following stages can be performed more efficiently. Inspired by the algorithm in [11] , we propose the following polynomial inversion algorithm to help improve the efficiency and critical path delay of the polynomial division.
Let Λ r (x) represent the polynomial with coefficients arranged in a reverse order of Λ(x). That is,
Similar notations apply for other polynomials. Then, it can be shown that the reverse quotient polynomial can be computed as
where Λ r (x) is defined as the inverse polynomial of Λ r (x) such that (9) and (10), one can obtain Λ r (x)
It is noted that, in (10), only unknown coefficients need to be calculated. The process of computing Λ r (x) takes one inversion,
/2) addition when d Λ − 1 is a power of two. As will be shown in Section III, the computational complexity of deriving Λ r (x) is much less than the complexity of computing (5). Furthermore, precomputing Λ r (x) can actually save efforts for computing (5). One thing should be noted is that the derivation of the proposed algorithm implicitly assumes that the BCH code is not shortened. Therefore, the proposed algorithm is not rigorously applicable to shortened BCH codes. It is theoretically possible for the error locator polynomial of a shortened BCH code to have the correct number of roots over GF (2 m ), yet one or more of these roots are not in the valid range. In this case, the proposed algorithm fails to detect the invalid error locator polynomial, whereas the exhaustive Chien search is still able to. The simulation, however, shows that the probability of this malfunction is unnoticeably low. Therefore, we argue that the proposed algorithm can be applied to a shortened BCH code for practical purposes.
III. VLSI ARCHITECTURE
Although the proposed algorithm can be used for a polynomial with an arbitrary degree, it is, in practice, enough to check polynomials with a degree of t. By doing this, the hardware complexity can be significantly reduced without noticeable performance degradation. Fig. 1 compares the performance achieved by SDD employing the proposed eligibility verification algorithm with the performance of SDD using direct polynomial search and the performance of HDD. In the proposed eligibility checking algorithm, d Λ is set to t. That is, the verification process only runs on polynomials with a degree of t. For polynomials with degrees less than t, they are adopted as valid solutions. It is shown in the figure that the proposed eligibility verification algorithm can effectively identify valid error locator polynomials. The performance of the proposed algorithm is degraded neither by applying the algorithm to a shortened BCH code nor by only running verification on polynomials with degree of t.
A VLSI architecture of the proposed eligibility checking algorithm is shown in Fig. 2 . There are three main blocks: Block I and II are for polynomial multiplication, and block III is used for polynomial inversion. Proper pipelining can be applied to locations marked with arrows in order to improve the throughput of the system.
There are mainly three finite field operations shown in Fig. 2 aside from the standard multiplex, delay, and compare operations. The first operation is subtracting polynomials in a finite field. This operation can be done by simply doing bitwise XOR operations on each coefficient of the two polynomials. The second operation is squaring a polynomial in a finite field, which can be achieved by squaring each coefficient of the polynomial. The third operation is to multiply and modulo polynomials. This is the main operation in the proposed algorithm. It can be formulated as a matrix-vector multiplication where the matrix is a Toeplitz matrix. Due to a unique property of Toeplitz matrices, the multiplication-and-modulo operation can be efficiently conducted by employing the circuit shown in Fig. 3 . Fig. 3 , diagonal multipliers share the same multiplicands b i . Multipliers at the same columns share the same multipliers a i . Products at the same row are then added by an XOR tree to get the final results c i . Since a finite field multiplication can be expressed as a matrix-vector multiplication, matrices associated with the shared multiplicands only need to be calculated once and distributed along the diagonals, reducing gate counts and the critical path delay.
Hardware complexity of the main blocks in the proposed eligibility checking circuit is summarized in Table I . The proposed circuit has an area-latency product on the order of (m − log 2 (t))t 2 . This is much less than nt, i.e., the area-latency product of a conventional exhaustive search. Area-latency product is defined as the product of the number of multipliers and the number of clock cycles needed to complete the task. It serves as a quick estimation of how the complexity of the circuit grows with the size of the problem. The reason that only finite field multipliers are counted is that they dominate the area of the circuit. To give an immediate comparison, the gate count ratio between the proposed algorithm and the conventional method is less than (n − k)/n, which is the redundancy ratio of ECC. The redundancy ratio in most memory systems is much less than one. Although the employed nonconstant multiplier takes a larger area than the constant multiplier employed in the Chien search, the saving on the area-latency product is still significant, as will be shown in Section IV. In addition, the diagonal-sharing technique aforementioned also helps reduce the gate counts effectively.
In addition, compared with the straightforward implementation with long division that requires (m − log 2 (t))t 2 multipliers, the proposed algorithm only needs approximately (3/4)(m − log 2 (t))t 2 multipliers, reducing gate counts by roughly 25%. This saving is achieved by precomputing the inverse polynomial Λ r (x).
IV. DESIGN EXAMPLE
In this section, the proposed eligibility checking circuit is implemented for a (4200, 4096) code over GF (2 13 ). Inversion of Λ Comparisons of gate counts and critical path delay are summarized in Table II . In this design example, the finite field multiplier and squaring circuit proposed in [12] are used. The multiplier in [12] is not the optimal choice in terms of gate counts. It is adopted in our design example because of its simplicity. More sophisticated multipliers such as those in [13] can be employed to further reduce the gate counts. In the table, numbers of flip-flops are directly read out from the synthesized netlists, and combinational gate counts reported by the employed synthesis tool are converted to equivalent NAND2 gate counts for comparison.
As shown in the table, gate counts of the proposed eligibility checking circuit are only around 12% of the exhaustive Chien search. This number can be further projected to estimate the overall saving on area of the decoder. In [5] , it is shown that the polynomial searching block occupies an area that is 85% of the total area of the decoder. Therefore, it is estimated that the proposed eligibility verification circuit can reduce the area of the one-pass Chase decoder by 75% while having a similar performance. Furthermore, due to the polynomial inversion step, the critical path delay of the proposed circuit is reduced to a value similar to the one of the Chien search. Table III is calculated according to the simulated critical path delay with a 10% margin. As noted in Table III . The proposed eligibility checking circuit is 20 times more power efficient than the conventional exhaustive Chien search. This number is larger than the area-saving ratio. This is mainly because conventional Chien search has a larger activity factor.
V. CONCLUSION
In this brief, we have presented a novel eligibility verification algorithm aiming to avoid the area and power consumption penalty incurred by the parallel Chien search in a conventional one-pass Chase soft-decision BCH decoder. The proposed algorithm can effectively check the correctness of a derived error locator polynomial by counting the number of roots it has. The root-counting problem is transformed into a polynomial modulo problem, which can be efficiently solved by repeated squaring. In addition, an iterative polynomial inversion algorithm is presented to reduce the area and the critical path delay. A hardware architecture for the proposed algorithms is also presented in this brief. Hardware complexity is carefully examined. A design example is implemented for a (4200, 4096) code over GF (2 13 ). The obtained gate counts and critical path delay are compared with a conventional design. Our newly proposed design achieves more than 88% area reduction while having a similar critical path delay. This translates into a 75% reduction in the overall decoder area. The proposed design is also placed and routed. The transistor-level simulation shows that, with a similar critical path delay compared with the conventional method, a 95% power saving is achieved.
