A new VLSI design of a pipeline Reed-Solomon decoder is presented. The transform decoding technique used in a previous design is replaced by a simple time domain algorithm. A new architecture which realizes such algorithm permits efficient pipeline processing with a minimum of circuits. A systolic array is also developed to A modified form of Euclid's algorithm is develperform erasure corrections in the new design. oped with a new architecture which maintains a real-time throughput rate with less transistors. Such improvements results in both an enhanced capability and significant reduction in silicon area, thereby making it possible to build a pipeline (255,223) RS decoder on a single VLSI chip.
INTRODUCTION
Recently Brent and Kung [l] suggested a systolic array architecture to compute the greatest common divisor (gcd) of two polynomials. By the use of this idea a VLSI design of a pipeline Reed-Solomon decoder was developed [2] . The syndrome computation of this decoder for a 4-bit (15,9) RS code was implemented successfully on a chip [31. In the design of the chip for the above-mentioned decoder, three major problems arose: (1) It required N identical cells to implement the inverse transform for an ( N , I ) RS Code in the architecture suggested in [ 2 ] . As a consequence for a long code such as the (255,223) RS code., the inverse transform circuit would need 255 cells and be quite large. (2) The basic cell of the systolic array needed to perform a modified form of Euclid's algorithm occupied considerable silicon area. Since the decoding algorithm in
[ 2 ] required (N-I) of such cells, the entire systolic array needed much more silicon area than desired. ( 3 ) Erasure corrections became necessary and were not included in the original design.
To reduce the large circuit area required by the inverse transform operation it was decided to modify the original transform decoding algorithm. Also if one considers the need for erasure correction, it was found that the decoding algorithm given in [ 4 ] could accommodate both requirements. In addition, the throughput rate of a systolic
The research described in this paper was carried out by the JPL, Caltech, under contract with NASA.
41.

ICASSP 86, TOKYO
array to perform Euclid's algorithm can be maintained by multiplexing fewer basic cells.
In this paper, an improved VLSI architecture over that in [ 2 ] is developed utilizing the above observations. A systolic array is also designed for the needed polynomial expansion used in the erasure polynomial computation. These new modifications result in both and enhanced capability and a significant reduction in silicon area without any loss in the pipeline throughput rate.
THE DECODER ARCHITECTURE
Let N = Zrn -1 be the length of the (N, I) RS code over GF(2m) with design distance d. Suppose that t errors and s erasures occur, and s + 2t < d -1. The decoding procedure in 141 is summarized as follows:
Let Xi be an error location or an erasure location and A = {XilXi is an erasure location}, h = {XiIXi is an error location}. Let Y -be the corresponding errata magnitude and r = tro, rl, .. . rN-l) be the received vector.
Step 1) Compute the syndrome polynomial
Step 2) Compute the erasure locator polynomial
Step 3 ) Multiply S(Z) and A(Z) to obtain the Forney syndrome polynomial
Step 4 ) Compute the errata evaluator polynomial A(Z) and the error locator polynomial X(z> from 5. 1 by a modified Euclid's algorithm.
Step 5) Multiply A(Z) and X(Z) to get the errata locator polynomial
Step 6) Perform Chien search on x(Z) to find the error location set X.
Step 7) Compute the errata magnitudes
forl(k(s+t by evaluating A(Z) and P'(Z), the derivative of P(Z). Use sets X and A to direct the additions of Yk to the received vector r.
The pipeline architecture of the RS decoder is shown in Fig. 1 . The decoder computes the syndrome polynomial S(Z) by the transform circuit, given in
121.
The erasure information A enters the decoder in the form of binary sequence.
A VLSI DESIGN FOR EXPANDING THE ERASURE LOCATOR POLYNOMIAL
It is reasonable to assume that the erasure location information is derived from outside the decoder, possibly from a convolutional decoder. Let it arrive serially in the form of 1 ' s and
' s .
A simple circuit, of the form shown in Fig.  2a , first converts this erasure data into a sequence of a k l s and O ' s , where ak,A.
kl k2 kS
Let A = a , a , .. ., a , then the computation of the erasure polynomial demands the expansion of
Eq. ( 6 ) can be computed by the following ki recursion:
given Qo(Z) = 1. Such an operation involves The systolic array described in the next section expands the factors of A(Z) into a polynomial. Polynomial multiplications are performed simultaneously with the same circuit. A new architecture is developed which implements a modified Euclid s algorithm by operating on the product of S(Z) and A(Z).
The resulting error locator polynomial X(Z> is then multiplied by A(Z), thereby obtaining the errata locator polynomial P(Z).
The derivative P'(Z) of P(Z) is obtained by dropping the even terms of P(Z).
The errata magnitudes Yk are calculated then by a polynomial evaluation, field multiplication and inversion. Next the errata locations are obtained in the form of a binary sequence by the use of another polynomial evaluation circuit which performs the Chien search on X(Z).
This sequence of error locations directs the addition of Yk to the received message. can be obtained by letting Qo(Z> = S(Z). Similarly, the errata locator polynomial P(2) = i(Z)A(Z) can be computed by letting Qo(2) = X(Z>. Therefore in the RS decoder design, two of the polymonial expansion/multiplication circuits, as shown in Figure 1 , are needed to expand the erasure locator polynomial and to simultaneously multiply by another polynomial, either S(Z) or X ( Z > .
A NEW ARCHITECTURE TO PERFORM THE MODIFIED EUCLIDEAN ALGORITHM
A systolic array was designed in \21 to compute the error locator polynomial by a modified Euclidean algorithm. The array required 2t cells, twice the number of correctable errors. It is capable of performing the modified Euclidean algorithm continuously.
In the modified Euclidean algorithm only one syndrome polynomial is computed in the time interval of one code word. As a consequence, for the original architecture in [2] a pipeline RS decoder is not as efficient as it might be. A substantial portion of the systolic array is always idling. This fact makes possible a more efficient design with fewer cells and no loss in the throughput rate.
For the (N, I) RS code the length of the syndrome polynomial is N-I. The maximum length of the resultant Forney syndrome polynomial is also N-I. Imagine now that a single cell is used recursively to perform the successive steps of the modified Euclidean algorithm instead of pipelining data to the next cell. Then it would take N-I recursions to complete the algorithm, where each recursion requires only N-I symbol times. Therefore, using a single cell, recursively, requires only a total of (N-I)2 symbol time to complete the modified form of Euclidean algorithm. Since a syndrome polynomial needs to arrive every N symbol times, cells are needed to process successive syndrome polynomials at a full pipeline throughput rate. Fig. 3 shows the new alternate architecture design. The input multiplexer directs the syndrome polynomials to different cells. Each processor cell is almost identical to the cell presented in [ 2 ] , except that it is used to process data recursively.
41. The primary difference in the new cell structure from the architecture developed previously [ Z ] arepresented in the following. Since division is avoided in the modified form of Euclid's algorithm, a scale factor appears at the output. Although such a scale factor, call it K, is irrelevant to the problem of finding roots of the error locator polynomial h(Z), it must be removed from the errata evaluator polynomial A(Z).
In order to effectively utilize the processor cell given in [2], the factor K which appears at the output of each cell is calculated independently of the cell computation. This is accomplished by using a multiplier, operating recursively, to accumulate the product of all the nonzero leading coefficients of the divisor polynomials. Then an inverse computation circuit and a multiplier after the demultiplexer is used to remove the unwanted scalar K from KA(2). This computational process is illustrated in Fig. 3. The architecture of the new basic cell is given in Fig. 4 . Compared with the previous systolic array design 121, the present scheme for multiplexing the recursive cell computations significantly reduces the number of cells and as a consequence the number of circuits. Table 1 shows that the cell reduction is greater for high rate codes. a form which is identical to the syndrome computation (1). However, in ( 8 ) the polynomial is shorter than in (1). Also since N > s + t-1, ( 8 ) is evaluated over a wider range of field element than was needed for (1).
These differences make it inefficient to implement (8) in a manner similar to that used for the syndrome computations. A better method is to evaluate Ai(aiIk sequentially for each k at the ith cell. This is illustrated in Fig. 5 . The polynomial coefficient A; is multiplied by a ' at the initialization of cell i. From then on a feedback loop computes the quantities Ai(a'Ik recursively for k = 1, 2, 3 , ..., N-1. The summation shown at the bottom of the figure is implemented quite simply since all quantities are binary. Since the new RS decoder has a pipeline design, the interconnections are short and simple. By determining the size of the major building blocks, such as registers, latches, multipliers and other basic cells, estimates of the transistor counter of the functional blocks for a (255,223) ReedSolomon decoder are given in Table 2 . It appears feasible to integrate the entire (255,233) RS decoder onto a single chip with current technology.
Note that the functional block, called "delay II", takes about half of all the transistors. The rest of the decoder, which actually performs the decoding operations, takes approximately 60 thousand transistors. If the delay I1 circuitry is implemented on another chip, then it is certain that such a pipeline design of a (255,223) RS decoder could be fabricated on a single chip with present technology. 
