Abstract-We propose a new efficient serial architecture to implement the Berlekamp-Massey algorithm, which is frequently used in BCH and Reed-Solomon decoders. An inversionless Berlekamp-Massey algorithm is adopted which not only eliminates the finite-field inverter but also introduces additional parallelism. We discover a clever scheduling of three finite-field multipliers to implement the algorithm very efficiently. Compared to a previously proposed serial Berlekamp-Massey architecture, our technique significantly reduces the latency.
I. INTRODUCTION
A MONG the most well-known error-correcting codes, the Bose-Chaudhuri-Hacquenghem (BCH) codes [1] , [2] , and the Reed-Solomon (RS) codes [3] are undoubtedly the most widely used block codes in communications and storage systems. For a comprehensive review of BCH and RS decoders, the texts by Berlekamp [4] , Lin and Costello [5] , or Blahut [6] are the best sources.
The most popular RS decoder architecture today can be summarized into four steps: 1) calculating the syndromes from the received codeword; 2) computing the error locator polynomial and the error evaluator polynomial; 3) finding the error locations; and 4) computing error values. The second step in the four-step procedure involves solving the key equation [4] , which is 1 where is the syndrome polynomial, is the error locator polynomial and is the error evaluator polynomial. The techniques frequently used to solve the key equation include the Berlekamp-Massey algorithm [4] , [8] , the Euclidean algorithm [9] , and the continuous-fraction algorithm [10] . Compared to the other two algorithms, the BerlekampMassey algorithm is generally considered to be the one with the least hardware complexity [11] . Another advantage of the Berlekamp-Massey algorithm is that it can be formulated to compute only, thus saving a portion of the hardware used to compute Existing architectures to implement the Berlekamp-Massey algorithm in hardware were proposed by Berlekamp [12] , Liu [13] , and Oh and Kim [14] . These proposals require finite-field multipliers (FFM's) where is the number of correctable errors. In addition, they all require a finitefield inverter (FFI) to implement the division operation, which imposes a significant hardware complexity. An inversionless Berlekamp-Massey algorithm was proposed by Burton [15] for BCH decoders, and was implemented by Reed, Shih, and Truong [11] for BCH and RS codes. However, more FFM's are required in the existing implementation of the inversionless Berlekamp-Massey algorithm [11] .
In this letter we present a new architecture to implement the Berlekamp-Massey algorithm with drastically reduced hardware complexity while maintaining the overall decoding speed. Our work was motivated from the following observations. First, the existing architectures to implement the Berlekamp-Massey algorithm are too fast. Indeed, in the four-step decoding approach, the throughput is limited by syndrome calculation and Chien Search, each taking cycles to finish, while existing architectures for the Berlekamp-Massey algorithm take cycles to finish. Slowing down the Berlakamp-Massey algorithm till taking cycles will not slow down the decoding. Therefore, we exploit time sharing a smaller number of FFM's to implement the Berlekamp-Massey algorithm. In this letter, such an approach will be termed a serial architecture, as in contrast to those parallel architectures [11] - [14] .
To our knowledge, a serial architecture for the Berlekamp-Massey algorithm was firstly shown in a text by Blahut [6] . That architecture uses three FFM's and one FFI, and requires clock cycles in each iteration. The clock cycle is determined by the the logic circuit delay of one FFM and one FFI. In this letter, we propose a new serial architecture which uses three FFM's and no FFI, and requires no more than clock cycles in each iteration. The clock cycle in our architecture is determined by the logic circuit delay of one FFM. Our architecture is therefore much faster than that in [6] .
In Section II, we describe the time-sharing idea in details and present our efficient scheduling. In Section III, we show how to reconfigure the architecture to compute In Section IV we conclude the paper.
0090-6778/99$10.00 © 1999 IEEE In other words, we can decompose the th iteration into cycles In each cycle requires at most two finite-field multiplications and requires only one finitefield multiplication. The data dependency of the decomposed algorithm can be seen in Table I .
It is evident from Table I that, at cycle the computation of  requires  and  which have been computed  at cycle Similarly, at cycle the computation of requires and which have been computed at cycle and the th step, respectively. Note that the original Berlekamp-Massey algorithm cannot be scheduled as efficiently because the computation of requires two sequential multiplications and one inversion. The inversionless Berlekamp-Massey algorithm provides the necessary parallelism to allow our efficient scheduling. The scheduling and data dependency of the decomposed algorithm are further illustrated in Fig. 1 .
The decomposed algorithm shown above suggested a three-FFM implementation of the inversionless Berlekamp-Massey algorithm, which is shown in Fig. 2 . Though not shown in this letter, our architecture can also be used in the correction of both errors and erasures. Compared to the previously proposed parallel architectures [11] - [14] , our architecture reduces the hardware complexity significantly. Compared to a previously proposed serial architecture [6] , our architecture reduces the time complexity significantly because of (1) the reduction of the number of clock cycles, and (2) the reduction of cycle time. Therefore, the proposed architecture achieves an optimization in the area-delay product.
III. EFFICIENT COMPUTATION OF
The conventional way to compute the error evaluator polynomial, is to do it in parallel with the computation of Using the Berlekamp-Massey algorithm, this involves an iterative algorithm to compute However, if (with degree is first obtained, we have from the key equation and the Newton's identity
That is, the computation of can be performed directly after is computed. Note that the direct computation requires fewer multiplications than the iterative algorithm which computes many unnecessary intermediate results. The penalty of this efficient computation is the additional latency because and are computed in sequence. Furthermore, it can be seen that the computation of is very similar to that of except some minor differences. Therefore, the same hardware used to compute can be reconfigured to compute after is computed. Like we compute as follows:
for for
In Fig. 3 , we show how the same three-FFM architecture can be reconfigured to compute
IV. CONCLUSION
In this letter we propose a new efficient serial architecture to implement the Berlekamp-Massey algorithm, which is frequently used in BCH and RS decoders. An inversionless Berlekamp-Massey algorithm is adopted which not only eliminates the FFI, but also introduces additional parallelism to the computation. We discover a clever scheduling of three FFM's to implement the algorithm very efficiently. To efficiently compute the computation is performed after is obtained. Moreover, in our architecture the computation of and shares the same hardware. Our technique can also be applied to the correction of both errors and erasures. Compared to the previously proposed parallel BerlekampMassey architectures, our architecture significantly reduces the hardware complexity. Compared to a previously proposed serial Berlekamp-Massey architecture, our architecture significantly reduces the timing complexity. Therefore, our architecture achieves an optimization in the area-delay product.
