In this paper, a low-power design for the Reed-Solomon (RS) decoder is presented. Our approach includes a novel two-stage syndrome calculator that reduces the syndrome computations by one-half, a modified Berlekamp-Massey algorithm in the key equation solver and a terminated mechanism in the Chien search circuit. The test chip for (255, 239) and (208, 192) RS decoders are implemented by 0.25 µm CMOS 1P5M and 0.35 µm CMOS SPQM standard cells, respectively. Simulation results show our approach can work successfully and achievea large reduction of power consumption on the average.
Introduction
Among the most well-known error-correcting codes, the Reed-Solomon (RS) codes are undoubtedly the most widely used block codes in communications and storage systems to enhance the immunity to burst errors. An (N, K) RS code contains N coded symbols with K message symbols in each codeword, and is capable of correcting up to t = (N − K)/2 symbol errors, where each symbol belongs to the finite field.
1 Due to the increasing demand for high capacity communication systems and portable wireless applications, low-power implementations of RS decoders are desirable to meet higher data rates for system-level integration.
The most popular RS decoder architecture can be summarized into four steps: (1) calculating the syndromes from the received codeword, (2) computing the error locator polynomial and the error evaluator polynomial, (3) finding the error locations, and (4) computing error values. The second step in the four-step procedure involves solving the key equation, 2 which is
where S(x) represents the syndrome polynomial, σ(x) is the error locator polynomial, and Ω(x) is the error evaluator polynomial. As a consequence, existing RS decoders usually contain a syndrome calculator, a key equation solver, a Chien search, and an error value evaluator, which are illustrated in Fig. 1 . The syndrome calculator generates a set of syndromes from the received codeword polynomial R(x). From the syndromes, the key equation solver produces the error locator polynomial σ(x) and the error evaluator polynomial Ω(x), which can be used by the Chien search and the error value evaluator to determine the error locations and error values, respectively. The received data memory is used to store the received symbols. In accordance with the error value and its location, the output of the finite-field adder shown in Fig. 1 is the corresponding corrected symbol. While implemented for portable storage systems or optical communications with higher data rates, all existing RS decoders cause relatively large difficulty in systemlevel integration. As a result, we propose a low-power design for RS decoders using a novel two-stage syndrome calculator, which is addressed in Sec. 2, to detect or not detect received codeword carrying errors. If there is no error occuring, the power consumption can be reduced significantly by terminating the follow-up decoding procedure. Section 3 illustrates a modified Berlekamp-Massey algorithm to reduce many unnecessary calculation counts and Sec. 4 describes the Chien search with the terminated mechanism and an area-efficient architecture for the error value evaluator. The (255, 239) as well as (208, 192 ) RS decoder are implemented as the design examples and simulation results are shown in Sec. 5. Finally, the conclusion is given in Sec. 6.
The Novel Syndrome Calculator
By definition, the syndrome polynomial S(x) is denoted as S 1 +S 2 x+· · ·+S 2t x 2t−1 , where
If the received codeword contains no errors, it can be shown that all syndromes, S 1 through S 2t , will all equal zeros. From the relations between syndromes and the coefficients of the error locator polynomial σ(x) given by
. . , and S 2ν will be all zeros if both ν ≤ t and S 1 = S 2 = · · · = S ν = 0, where ν represents the number of actual errors and t is the number of correctable errors. Therefore, the first half of the syndromes can be seen as an error detector.
Once there are t continuous syndromes equaling zeros, all 2t syndromes will equal zeros. Then the follow-up decoding procedure can be terminated and all error values are set to zeros directly whether no-error (ν = 0) codeword or out-of-correction codeword (ν > t) received. The novel syndromes calculating procedure can be shown as follows:
The syndrome calculator cell S i is shown in Fig. 2(a) , where the partial syndrome is multiplied with α i and accumulated with the received symbol at each cycle. After all received symbols from R N −1 to R 0 are processed, the accumulated result is the ith syndrome, S i . In Fig. 2(b) , the proposed two-staged structure of the syndrome calculator is illustrated for t = 8. After N cycles, the syndromes S 1 through S 8 are obtained and then the signal R enb is used to control the access of the data memory. The syndrome calculator cells S 9 through S 16 will remain idle except the controlling signal R enb goes to high.
Moreover, the look-forward architecture 4 can be used to improve the throughput rate. For example, the process of calculating the ith syndrome in (255, 239) RS codes can be derived as The syndrome calculator cell S i using the look-forward architecture to process four symbols per cycle is illustrated in Fig. 3 . At each cycle, the partial syndrome is multiplied with α 4i , accumulated with the received symbols and their multiplying results of α i to α 3i in parallel. After all the received symbols are processed, the accumulated result is S i . Thus, our proposed two-stage syndrome calculator can be also applied to applications of higher data rate.
Note that the finite-field multipliers (FFMs) implemented in the syndrome calculator are all constant-variable FFMs, which have one input as a constant and the other input as a variable, indicating that the circuit complexity and power consumption are much lower than that of variable-variable FFMs, whose inputs are both variables. Since the transmission data in realistic systems are almost correct, (i.e., S 1 = S 2 = · · · = S t = 0), our proposal reduces by almost half the syndrome computations, leading to a good effect on the power consumption of the entire RS decoder.
The Key Equation Solver
The techniques frequently used to solve the key equation include the BerlekampMassey (BM) algorithm 2,5 and the Euclidean algorithm. 6 The BM algorithm is generally considered to be the one with the least hardware complexity for solving the key equation. Another advantage is that the constant term of σ(x) and Ω(x) always equals 1 and S 1 , suggesting an efficient decoding procedure to eliminate redundant computations in the BM algorithm. However, the BM algorithm is an iterative procedure and after calculating the first iteration in advance, the modified BM algorithm with some differences in initial conditions can be shown as follows:
Initial condition:
A Low-Power Design for Reed-Solomon Decoders 5
where σ (i) (x) is the ith error locator polynomial, Ω (i) (x) is the ith error evaluator polynomial, and σ
is the ith discrepancy and δ is the previous discrepancy; D (i) is an auxiliary degree variable in the ith iteration, and τ (i−1) (x) and γ (i−1) (x) are auxiliary polynomials for calculating σ (i) (x) and Ω (i) (x), respectively. Note that the ith error locator polynomial σ (i) (x) calculated by Eq. (4) will be equal to the previous polynomial
and Ω (2t−1) (x) are equivalent to the error locator polynomial σ(x) and the error locator polynomial Ω(x), respectively.
The conventional way to compute the error evaluator polynomial Ω(x), shown as above, is to do it in parallel with the computation of σ(x). From the key equation and Newton's identity, the computation of Ω(x) can be shown as follows:
⇒ Ω 0 = S 1 ,
where Ω j 's are the coefficients of the error evaluator polynomial Ω(x). Note that the proposed direct computation of Ω(x) after σ(x) is computed requires fewer multiplications and additions than the original BM algorithm. Table 1 compares the average calculation counts between the original Euclidean and BM algorithm with the modified BM algorithm after many random test patterns are simulated. In addition, the computation of the ith coefficient Ω i is similar to that of the ith discrepancy ∆ (i) . Therefore, the same hardware used to compute ∆ (i) can be reconfigured to compute the coefficient Ω i . Depending on the implementation, there are two different approaches illustrated in Fig. 4 to compute ∆ (i) or Ω i over GF (2 m ).
7
From the finite-field arithmetic, the multiplication of two operands can be split into a bit-wise multiplying operation and a modular operation. In Fig. 4(a) , the original approach indicates that each multiplier requires both the bit-wise multiplying operation and the modular operation. However, the separated approach, shown as Fig. 4(b) , reduces t − 1 modular operations and only requires an extra m − 1 XOR gates of t-input. Simulation results show the separated approach can achieve approximately both a 30% reduction of power consumption and a 15% reduction of circuit complexity as compared with the original approach for calculating ∆ (i) or Ω i within (t, m) = (8, 8).
Chien Search and the Error Value Evaluator
In the (N, K) RS decoding algorithm, the Chien search is used to check whether the error locator polynomial σ(x) equals zero or not while x = α −n , n = 0, 1, . . . , N − 1. If σ(α −n ) = 0, it means there is an error at R n . In Ref. 8, McEliece proposed three conditions to determine whether the received codeword can be corrected or not. The corresponding hardware is to compare the degree of σ(x) with the number of roots found by the Chien search. While the out-of-correction received codeword is detected, we stop the follow-up decoding procedure of error value evaluation. Figure 5(a) shows the circuit of the Chien search cell C i and the structure of the Chien search with t cells is illustrated in Fig. 5(b) . At the ηth cycle after initialization, the finite-field adder (FFA) in the right hand side of Fig. 5(b) calculates the value of σ(α −η ) and the NOR gate is used to check whether the final sum equaling zero or not. Note that σ odd (x) = σ 1 x + σ 3 x 3 + · · · + σ t odd x t odd is prepared to calculate the error value. For t = 8, t odd = 7 represents the largest odd number less than or equal to t. However, Fig. 5 checks one root at one cycle, whereas Fig. 6 illustrates another structure of the Chien search to check eight roots at each cycle for applications of higher data rate. Note that all input ports to the shadowed cells in Fig. 6 are identical. In the jth iteration after initialization, the outputs of the FFA in shadowed cells are σ
For calculating the error value, the Forney algorithm proposed in Ref. 9 is utilized and can be expressed as:
where χ
indicates the root of σ(x), for l = 1, . . . , t. Figure 7 shows an area-efficient architecture for the error value evaluator, which requires only one variable-variable FFM. Note that the Buffer-1 and Buffer-2 store all the roots χ and the detailed computation of (Ω(x) · x)| x=χ
can be shown as
At the (ν + 1)th cycle, the output of the Ω(x)-buffer is 0 and the output of the multiplexer in Fig. 7 should be altered to 1/[σ odd (χ −1 l )], then the output of the FFM, which will be latched by the register shown in the right hand side of Fig. 7 , is the error value corresponding to the root χ 
Chip Implementation and Simulation Results
Here we propose two design examples to verify our low-power architecture. One is the (255, 239) RS decoder, which is recommended in ITU-T G.975 to resist burst errors for optical fiber submarine cable systems on the STM-16 basis. Note that the STM-16 format has a data rate of 2.5 Gbps. For meeting a higher data rate with a lower clock rate, the look-forward architecture is utilized in our (255, 239) RS decoder to process four bytes at each cycle. Figure 8 shows the pipelining diagram with the latency of 192 cycles. In the first pipelining stage, the front half syndromes S 1 through S 8 are calculated and used to detect or not to detect the received codeword carrying errors. And the received codeword without errors indicates that the calculations of the later half syndrome, S 9 through S 16 , are needless in the second stage. However, the Chien search only takes half of the third stage, checking eight roots at each cycle. The error value is evaluated and simultaneously added with the corresponding received symbol for correcting errors in the fourth stage. After being implemented by the Verilog and Synopsys design tool with the Artisan 0.25 µm CMOS 1P5M standard cells, the (255, 239) RS decoder contains four 2K bit embedded single-port synchronous memory and has gate counts of 32.9 K. The total size is 2.23 mm×2.23 mm with the RS core of 2.01 mm×1.01 mm. The layout view and chip summary are shown in Fig. 9 . While simulated at 1.8 V of the supply voltage by EPIC PowerMill, our proposed RS decoder can work successfully with the 2.5 Gbps data rate and consumes 14.8 mW core power and 53.7 mW memory power under the bit-error rate (BER) of 10 −4 , indicating an approximately 20% error probability of the received codewords. For the BER less than 10 −5 , only the syndrome calculator operates to detect errors and the other three parts -key equation solver, Chien search, and error value evaluator -almost remain idle.
The other design example is the (208, 192) RS decoders utilized for DVD applications. Since the data rate of 1 × DVD is below 4 MBps, the overall decoding speed can be maintained by the two-stag approach of calculating syndromes, each Figure 11 shows the layout view and chip summary.
While simulated at the supply voltage of 3.3 V by EPIC PowerMill, our proposed RS decoder can work successfully with the clock rate of 100 MHz. Table 2 shows the circuit complexity and simulation results of power consumption between the previous architecture 10 and our proposal in this paper. Although the architecture proposed in Ref. 10 has less complexity, the proposed RS decoder consumes only 60% of the power dissipation of Ref. 10 and approximately no power consumption in the key equation solver, Chien search, and error value evaluator when no-error codewords are received. Note that the two different results in each architecture correspond to two extreme cases of the received codewords carrying eight errors and without errors. In realistic communication systems, the probability of no error is much larger than that of error. Our proposed two-stage syndrome calculator of reducing half syndrome calculations, and of terminated mechanisms in other parts can lead to a very power efficient solution for the RS decoder.
Conclusion
In this paper, the design and implementation of a low-power RS decoder is presented. The proposed architecture features the novel two-stage syndrome 
