I. INTRODUCTION

B
OSE-CHAUDHURI-HOCQUENGHEN (BCH) codes were introduced in [1] and [2] and have extensively been studied. Long BCH codes can sometimes achieve better performance than Reed-Solomon codes, which is now widely used in digital video broadcasting, optical communication, and magnetic recording systems. Hence, BCH codes are of great interest for their efficient and high-speed hardware encoding and decoding implementation. Long BCH encoding can directly be implemented by linear feedback shifted register (LFSR) (see [3] and [10] ). However, this LFSR-based architecture suffers from a large fan-out effect. The LFSR-based systematic encoding of a long BCH code is actually a division circuit with the divisor g(x) (see [10] ), and the large fan-out of some XOR gate would lead to a large gate delay. In high-speed applications, such as optical communication systems and digital video broadcasting, such LFSR-based long BCH encoding cannot keep up with the data transmission speed. Thus, the long BCH hardware encoding eliminating the large fan-out effect is needed.
Several parallel encoding architectures for long cyclic codes have been proposed in [4] - [6] . In [7] and [8] encoding based on the J-unfolding method (see [9] ), which can eliminate the large fan-out effect. The basic idea of [7] and [8] is the modification of the generator polynomials of the BCH codes such that the retiming around the rightmost XOR gates in the J-unfolded LFSR architecture is possible. The multiplying of a suitable auxiliary polynomial can enhance the efficiency of encoding and eliminate the large fan-out effect. Our approach here is the transformation of the long LFSR in BCH systematic encoding by several short parallel LFSRs. It seems possible to combine these two methods for a more efficient hardware encoding of long BCH codes.
In this brief, we give a Chinese remainder theorem (CRT)-based parallel architecture of long BCH encoding that can be used to eliminate the large fan-out effect. The basic idea is the transformation of the long-division LFSR circuit of the generator polynomial by several short-division LFSR circuits of low-degree polynomials in parallel. In this transformation, we need some multiplication LFSR circuits that have no large fan-out effect. The advantage of our novel parallel architecture is that the only limitation on the fan-out effect of the CRT-based long BCH encoding is log 2 N , where N is the code length of the BCH code.
II. PRELIMINARIES
Let GF(2 t ) be a finite field of 2 t elements, and
It is well known that the minimum Hamming distance of C(δ) is at least δ. For any polynomial in GF (2) [x], it can be factorized to the product of some irreducible polynomials in GF (2)[x] (see [11] ). Let w 1 (x), . . . , w r (x) ∈ GF(2)[x] be the distinct monic irreducible polynomials whose zeros are of the form α
, where d is an arbitrary nonnegative integer. We know that the generator polynomial g(x) is the product g(x) = w 1 (x), . . . , w r (x). It is clear that deg(w i (x)) ≤ t for i = 1, . . . , r (see [10] ).
The systematic encoding of a cyclic code with generator polynomial g(x) (deg(g(x)) = n − k) and code length n is processed as follows:
, we have an LFSR circuit to implement the multiplication with at most nz(h) ≤ deg(h(x)) + 1 XOR gates, where nz(h) is the number of nonzero coefficients in h(x). For dividing the input polynomial u(x) ∈ GF(2) [x] by the polynomial h(x) ∈ GF(2)[x], we have an LFSR circuit with at most nz(h) XOR gates, which outputs the remainder polynomial of the division (see [10] and [11] , it is clear that r < (deg(g(x)/2). Most cyclotomic sets are of the size t, and r is roughly deg(g(x))/t. When t is a prime number, every cyclotomic coset except {1} is of the size t (see [11] ). Thus, r = (deg(g(x)/t) when t is a prime (see Examples 2 and 3 below).
III. CRT-BASED PARALLEL ARCHITECTURE OF LONG BCH ENCODING
We now recall the CRT [11] . Let f (x) ∈ GF(2)[x] and g(x) ∈ GF(2)[x] be two polynomials. Suppose g(x) = g 1 (x), . . . , g r (x), where g 1 (x), . . . , g r (x) are pairwise coprime (gcd(g i (x), g j (x)) = 1) for any two distinct i and
and gcd(g i (x), g i (x)) = 1. By using the generalized Euclid algorithm, we can find a polynomial
. We have the following result. w 1 (x) , . . . , w r (x) be the generator polynomial of a BCH code, where w 1 (x), . . . , w r (x) are the distinct irreducible polynomials in GF (2) [x] as in the previous section. It is well known (see [11] ) that deg(w i (x)) ≤ t, where t is a fixed constant around log 2 N , and N = 2 t − 1 is the code length of the BCH code. It is clear that these polynomials are pairwise coprime. Set
A. Chinese Remainder Theorem (See [11]). Rem
n−k ), we can have a parallel architecture for immediately getting Rem g(x) (m(x)x n−k ). First, we have r parallel LFSR circuits multiplying u 1 (x), . . . , u r (x); then r parallel LFSR circuits dividing w 1 (x), . . . , w r (x); r parallel LFSR circuits multiplying w 1 (x), . . . , w r (x) in the third step; and finally a circuit summing the outputs from the previous circuits.
Here, the fan-out effect of the LFSR circuits divided by w 1 (x), . . . , w r (x) is upper bounded by t, which is around log N . It is well known that the multiplying and summing LFSR circuits have no large fan-out effects and can be executed with small latency. Comparing with the direct LFSR-based architecture, though the number of clock cycles is increased in our architecture, the clock period is substantially decreased by eliminating the large fan-out effect. Thus, our parallel architecture of getting Rem g(x) (m(x)x n−k ) (the systemic encoding of the BCH code) is suitable in the high-speed applications. The speed of this CRT-based parallel architecture of long BCH encoding is essentially dependent on the number t, which is around log 2 N , where N is the code length of the BCH code.
IV. IMPLEMENTATION AND FURTHER COMMENTS
In this section, we give the implementation and the cost of the CRT-based architecture of long BCH encoding.
A. Implementation
Step 1) Multiplication LFSR of polynomials u 1 , . . . , u r with the input polynomial m(x)x n−k . Here, the circuits need Σ(deg(u i ) + 1) XOR gates.
Step 2) Division LFSR of polynomials w 1 , . . . , w r with the inputs of outputs of the circuits in Step 1. Here, the circuits need Σ(deg(w i ) + 1) XOR gates.
Step 3) Multiplication LFSR of polynomials w 1 , . . . , w r with the inputs of the outputs of the circuits in
Step 2. Here, the circuits need Σ(deg(g) − deg(w i ) + 1) XOR gates.
Step 4) The summation LFSR of the r outputs in Step 3.
Here, the circuits need at most r(t + 1) XOR gates. We can get an upper bound on the number of XOR gates used directly; it is upper bounded by Σ(deg(u i )+1+deg(w i )+ 1+deg(w i )+1)+r(t+1) ≤ 2r(t+1)+r(deg(g)+2). From the estimation of r, this number is roughly 2deg(g) + (deg(g)/t) (deg(g) + 2). [7] ): We consider the BCH code generated by a degree 121 polynomial g(x) ∈ GF(2) [x] . Its code length is 2047 = 2 11 − 1, and its dimension is 1926. The generator polynomial is the product of 11 distinct irreducible polynomials w 1 , . . . , w 11 of degree 11. Thus, our architecture needs at most 1595 XOR gates. The number of fan-out is upper bounded by deg(w 1 ), . . . , deg(w 11 ), which is at most 11. In some sense, this is better than the architecture in [7] .
1) Example 2 (See
2) Example 3 (See [8] ): We consider the BCH code generated by a degree 507 polynomial g(x) in GF(2) [x] . Its code length is N = 2 13 − 1, and its dimension is 7684. The generator polynomial is the product of 39 degree 13 distinct irreducible polynomials in GF (2) [x]. Thus, our architecture needs at most 20 865 XOR gates. The number of fan-out XOR gates in the architecture is at most 13. In some aspect, this is better than the architecture in [8] .
It is clear that this idea can be used for the systematic encoding of any long cyclic code with generator polynomial where g 1 , . . . , g r are pairwise coprime polynomials. Second, in some cases, if we can choose the generator polynomial with the same code parameters, then it is better to use the generator polynomial g(x) ∈ GF(2) [x] with the property that the numbers of nonzero coefficients in g 1 , . . . , g r are as small as possible. However, the idea of CRTbased architecture cannot be used for the encoding of long cyclic redundancy check (CRC) codes (see [4] - [6] ), because the generator polynomials of CRC codes are irreducible.
V. CONCLUSION
In this brief, we have presented a CRT-based high-speed parallel architecture for long BCH encoding. The architecture can be used for eliminating the large fan-out effect. The only limitation of this CRT-based parallel architecture is the logarithm of the code length of the BCH code. It should be noted that our architecture of using CRT for transforming the longdivision LFSR of the polynomial to short-division LFSRs in parallel can be used for the systematic hardware encoding of any long cyclic code.
