Abstract-A versatile Reed-Solomon (RS) decoder structure is developed based on the time-domain decoding algorithm (transform decoding without transforms). In this paper, the algorithm is restructured and a method is given to decode any RS code generated by any generator polynomial. The main advantage of the introduced decoder structure is the versatility. By versatile decoder we mean a decoder that can be programmed to decode any Reed-Solomon code defined in Galois field 2'" with a fixed symbol size m. This decoder can correct both errors and erasures for any RS code including shortened and singly extended codes. It is shown that the introduced versatile decoder has a very simple structure and can be used to design high-speed singlechip VLSI decoders. As an illustrative example, a Gate-Array-Based programmable RS decoder is implemented on a single chip. This decoder chip can decode any RS code defined in GF (2') with any code word length and any number of information symbols. The decoder chip is fabricated using low power, 1.5 1, 2-layer metal, HCMOS technology.
I. INTRODUCTION EED-SOLOMON codes have come into widespread R use in both communication and data storage systems.
Every particular application has its own distinct requirements usually satisfied by its own individual hardware design. Recently, two structures for a versatile decoder have been developed-one based on the algebraic [I] and the other based on the transform decoding algorithm [2] . These structures can decode a large range of RS codes, but they are very complex and difficult to be designed on a single VLSI chip. Two other structures for RS decoders have also been presented which are suitable for VLSI design [3], [4]; but these structures are not versatile, and the design should be configured for each specific RS code.
Our purpose is to design a versatile decoder, which means a decoder that can be used to decode any RS code defined in a specified Galois field. For this versatile decoder, the error and erasure correction capability of the decoder can be programmed, but the code should be defined in Galois field 2'" with a fixed m. The reason for limiting ourselves to a specified GF ( 2 ") is that the multiplier and the divider in Galois fields have different deManuscript received May 16, 1989 ; revised June IO, 1990 . This work was supported in part by Canada NSERC under Grants A5987 and A3267 I , and by Quebec FCAR under Grant ER-0106.
Y. R. Shayan is with SR Telecom Inc., St. Laurent, P.Q., Canada H4S 1M.5.
T. Le-Ngoc is with the Department of Electrical and Computer Engineering, Concordia University, Montreal, P.Q., Canada H3G 1M8.
V. K . Bhargava is with the Department of Electrical and Computer Engineering, University of Victoria, Victoria, B.C.. Canada V8W 2Y2.
IEEE Log Number 9037976.
signs for various values of m. This structure also allows the programming of different generator polynomials of RS codes.
To present the versatile decoder, the time-domain decoding algorithm introduced by Blahut [5] is used. The time-domain decoding algorithm is based on transform decoding algorithm, and it was first called "Transform Decoding without Transforms" [6] . In this paper, the time-domain decoding algorithm is restructured and a method is given to decode any RS code generated by any generator polynomial. It is also shown that the introduced structure can be used to decode shortened and singly extended RS codes. Complexity and throughput of this decoder is discussed. It is shown that this structure is very simple and can be used to design high-speed single-chip VLSI decoders.
As an illustrative example, the Gate-Array-Based programmable RS decoder for a Galois field GF ( 2 5 ) is implemented on a single chip. This chip can work in two modes: Stand-Alone (ST) and Microprocessor-Peripheral (MP). These modes of operation and their applications are explained.
DECODING ALGORITHM
Let GF (2") be the finite field of 2" elements. Also, let n = 2" -1 be the length of the ( n , k ) RS code over GF ( 2 ' " ) with minimum distance d = n -k + 1 where k denotes the number of m-bit message symbols. This code has the capability of correcting v errors and p erasures as long as 2v + p I n -k .
The time-domain decoding algorithm was first presented by Blahut [ 6 ] . To explain the time-domain decoding of Reed-Solomon codes, first the transform decoding algorithm is briefly mentioned, and then the extraction of the time-domain decoding algorithm from the transform one is explained.
The transform decoding algorithm is explained in [7] . In the transform decoding algorithm, first n -k syndromes are calculated, which is a form of Fourier transform. Then the Berlekamp-Massey algorithm [8] is performed to obtain the errata locator polynomial by initializing the algorithm with erasure locator polynomial. The errata locator polynomial, which has degree of at most n -k , is found in n -k -p iterations where p is the number of erasures. The roots of this polynomial indicate the error locations in the received code word. The third step of the decoding is recursive extension which yields, in frequency domain, the components of errata values. In 0733-8716/90/1000-1535$01 .OO @ 1990 IEEE the last step, inverse transform is performed on frequency components of errata values to obtain the time-domain errata-value vector. Finally, the estimate of the original code vector is evaluated by subtracting the time-domain errata values from the received code word.
As it was shown, in the transform decoder, the Berlekamp-Massey algorithm is preceded by a Fourier transform and is followed by an inverse Fourier transform in some form. However, instead of transforming the received word into frequency domain, it is possible to find the equivalent of the Berlekamp-Massey algorithm in the time domain [ 5 ] . Using the time-domain BerlekampMassey algorithm, the Fourier transforms simply vanish. The time-domain decoder is structurally simple and useful in designing versatile decoders, as will be shown in the next section.
A . Time-Domain Decoding
The time-domain decoding algorithm is explained in [ 5 ] , and is reviewed in this subsection. This algorithm has three steps as follows.
Step I-Time-Domain Erasure Locator: In the first step of the algorithm, the erasure locator vector p = ( p n -I , . . . , p O ) , which is the time domain equivalent of the erasure locator polynomial, should be evaluated.
, n -1. Step 11-Time-Domuin Errata Locator: In this step of the decoding, the time-domain errata locator vector 5; = ( A n -1 , * -. , A,) is calculated using the time-domain Berlekamp-Massey algorithm. This step is explained with the following set of recursive equations to compute AInbk' using the temporary vector 6 = ( b , -, , , bo) and variables L, 6, and A .
f o r i = 0, * --, n -k . The initial conditions are A,'"' = b j p ) = p i , for all i , L = O , a n d 6 = l i f b o t h A # O a n d 2 L -p r r -1, and otherwise 6 = 0. Then = 0 if and only if the , n -1 a n d f o r r = p + 1, p + 2 , * i th symbol of the received vector V is in error, and hence the component of the errata locator vector is A, = for all i.
Step 111-Time-Domain Errata Value: In frequency domain, to find the errata magnitudes, we need to use recursive extension. The following equivalent set of recursive equations for r = n -k + 1, -* -, n is restructured in time domain.
Starting with s i n p k ) = U ; , and hi = Xj"" fori = 0, * . , n -1, the last iteration results in the errata value vector s = ( S n -1 , . * , so). Finally, the estimate of the errata value vector 2 is obtained by
and added to the received vector V to form the corrected vector C.
B. Restructured Algorithm
The time-domain decoding algorithm explained in [5] may introduce many extra errors, when the number of errors in the channel is larger than the error correction capability of the code. In this subsection, a block is added at the end of the time-domain decoding algorithm to overcome this problem and hence improve its performance. Some other modifications are also given to simplify the structure of the decoder.
Consider the t error correcting time-domain Reed-Solomon decoder. When the errors introduced in the channel ( U ) are more than t , two cases can be distinguished.
In the first case, the pattern of the received vector E is such that the decoder assumes a codeword has been received which contains p < r errors. Therefore, the decoder decodes the received pattern by introducing at most C. L errors. Hence, the decoder output has at most p + v errors.
In the second case, the received pattern to the decoder is such that the decoder assumes a codeword has been received which contains p > t errors. In this case, the decoder tries to correct the received vector, but since this is out of capabilities of the code, the decoder fails and introduces up to n errors in the received vector. This happens even when the number of errors introduced in the channel are just above the error correction capability of the code ( v = t + 1). This case can be detected easily, and the decoder can output the received code word as it is without introducing any extra error ( T = E ) .
To detect the second case, consider the second step of the time-domain algorithm. In this step, when the ith This block compares all the components of X and S. If X , si is zero f o r i = 0, 1, * , n -1, then the output of the decoder is corrected to C = U -3 and otherwise C = U.
In Fig. 1 , other modifications are also provided to decrease the complexity or increase the speed of the decoder. As shown in this figure, after updating r, the block s, + a ' s , is introduced. This helps the equation for A not to have the term in a, as does the equation for updating s,. Because there are n iterations, the final s, is multiplied by a"', but a"' equals one since Q! has order n , and hence the final result is not affected. This modification eliminates the calculation of a'" in the original algorithm ( Fig.  1 of [ 5 ] ) .
In the restructured algorithm (Fig. l) , the value of L is initialized to zero, and A , and b, -are initialized to a ' = 1 f o r i = 0, 1, , bo) is * , n -1 where b = (b,,-l, * a temporary vector. Then the vector S is initialized to E . In the first p iterations, the Berlekamp-Massey algorithm is initialized for erasure locations and then proceeded through n -k -p ite9tions to compute the time-domain errata-locator vector (A). The last k iterations update S to compute the errata value vector. Then all the components of (x) are compared to corresponding components of updated S. At the end, it is decided to correct the received vector I7 or not, depending on the above component-bycomponent comparison.
Note that a mistake in Fig. 1 of [5] is corrected by moving the block for updating the components of vector 5 to the right place.
C. Decoding with any Generator Polynomial
for systematic codes generated by g ( x ) where
The original time-domain algorithm is only applicable
It is known that the polynomial of (9) is not the only generator polynomial generating RS codes. In fact, by choosing another generator polynomial, the structure of the encoder can become simpler [9] . In this subsection, a method is introduced to decode any RS code generated by any generator polynomial.
An ( n , k ) Reed-Solomon code can be generated by n different generator polynomials,
where the constant h can be chosen to have values of 0, 1 , 2 , * . -, n -I .
To explain the method, we state the following theorem. Theorem: In a systematic encoder, encoding the message polynomial d ( x ) by the generator polynomial g ( x ) to get the code word polynomial c ( x ) is equivalent to encoding
by G ( x ) to find the code word polynomial C ( x ) , and then evaluating c ( x ) = c (~'~-' ) x ) . The proof of this theorem is not discussed here ii, any detail [lo] . Using this theorem, the time-domain decoding algorithm can be used to decode any RS code with any -generator polynomial. Assume the code word vector c = ( C n p 1 , * * * , C O ) is generated by generator polynomial G ( x l in a systematic form. In the channel, the error vector E is added to this vector to form the received vector at the input of the decoder.
In the decoder, a block can be added to form the vector V as (13) The vector V is the sum of a code word vector C and a new error vector 2. According to the above theorem, C is
a valid code vector generated by the generator polynomial g ( x ) and related to c as (14) c, = c , a -l ( h -I ) 
Comparing (12), (13), and (14), we have e, = ~~ ( y l ( h -1 ) 
Since is nonzero for all i, the number of errors present in c and C are the same. Hence, the time-domain decoder can be used for decoding any RS code. At the end of decoding after evaluating T , we need to find the value of c by using (14).
111. DECODER ARCHITECTURE In this section, the architecture of the versatile timedomain decoder is presented [ 151. The Reed-Solomon decoder has two units, which are the Input/Output (I/O) Unit and the Decoding Unit. These two blocks are explained in this section with the assumption that the decoder receives the symbol Vo first and the symbol V, -last. The output of the decoder is also in the same order which means that CO is transmitted first and Ck -I last.
A . Input and Output Interface
The I/O Unit, shown in and a -l ( h -l ) are shown in Fig. 2 . The register RI, is initialized to a' = 1 when the I/O Unit receives the first symbol V0 of the input vector, and therefore the first symbol Vo is multiplied by 1 to form u0. This register is loaded when the next symbols VI (for i = 0, 1, , n -1 ) are received, so V, is multiplied by ( a h -I )'. To evaluate CY --I ( h -1 ) at the output section, a similar procedure is used. As we know, the inputs and the outputs of the decoder are continuous, hence, they cannot be stopped or delayed. On the other hand, the decoder needs some time to correct the errors after receiving an entire input block. To let the decoder correct the erasures and errors without losing any data at the input or output of the decoder, three sets of nstage parallel-in/parallel-out shift registers are used where n = 2" -I .
The I/O unit, upon receiving a code word, stores it symbol by symbol in one of these buffers, and then the phase control connects this buffer to the decoder unit. When the decoder is correcting the errors, another buffer receives the next code word and stores it. In the third phase, the buffer which holds the corrected code word outputs the data. The phase control is responsible for operation of these buffers in each phase. Note that we have assumed a systematic RS ( n , k ) code, and the information symbols of vi and ci are in locations with indexes i = k -1, k -2, * * . , 0 while the parity symbols are in locations i = n -1, n -2, * * . , k. Therefore, after correction of the errors, the U0 Unit only needs to output the information symbols which are located at k locations at the right side of the output buffer; and this is done automatically, since the output buffer is shifted out with a lower frequency clock compared to the input buffer.
There is another input to the I/O unit which gives the erasure information about any received symbol ( E R ) . This signal exhibits a low-to-high transition to denote an erasure which occurs in the synchronously received symbol. If the received symbol VI is an erasure, then we need to store a' and use it later. To generate a', a multiplier and
B. Decoding Unit
The Decoding Unit structure is shown in Fig. 3 . In the Decoding Unit, there are three sets of n-stage parallel-in/ parallel-out shift registers to store si, X i , and bi for i = 0, l;..
, n -1. The contents of these registers are shifted to the Arithmetic Unit with clki, which is the internal clock of the system. The Arithmetic Unit processes the contents of these registers and feeds back circularly. To distinguish between the different iterations of the decoding process, the internal clock is divided by n to form clk, and fed to the iteration counter. Output of this counter indicates the iteration number r. There are n + 1 iterations, which are determined by the control circuitry to give the proper command to the Arithmetic Unit.
In this design, three n-stage shift registers are shifted to the right with clk, clock. This shifting is performed up to the end of the decoding process. The switches are connected based on the value of the iteration counter ( r ) . The connections of these switches are shown on the contacts by specifying the value of r.
In iteration r = 0, the contents of work buffer ( v i ) are multiplied by c y i and loaded into the shift register si. At the same time, vicy' is accumulated in the A i register to form A for the first iteration. In this iteration, shift registers X i and bi are loaded with the initial value of (YO = 1. In order to keep the data of the work buffer, values of vi are also fed back to the work buffer with elk;. In this iteration, L is initialized to zero.
In iterations 1 up to p , the registers X i and bi are initialized by the erasure locator vector. In iterations p + 1 up to n -k , the value of 6 and updated value of L are calculated based on the value of iteration counter r , discrepancy A, and previous value of L. In these iterations h i , bi, and A are calculated simultaneously. In iterations n -k + 1 up t o n -1, the value of si is added to A and, simultaneously, the new value of A is calculated. In these iterations, X i is also shifted circularly to maintain the contents. In the last iteration ( n ), si is added to the previous value of A to be used for correcting vi. 
C. Structure for Shortened and Extended RS Decoders
the register RER is used. The register RER is initialized to a ' = 1, and the clock of this register is Clock-In. The erasure information is stored in &age parallel-in/parallel-out shift registers. Obviously, two sets of these n-stage shift registers should be used-one for input and one for work buffer. The phase control circuitry changes the role of these buffers consecutively. An up-counter finds the number of erasures ( p ) to be used in the Decoding Unit.
The shortened Reed-Solomon codes, that is, codes terminated by fixing some of the information places at zero, can also be decoded using the structure shown in the previous subsections. Let us consider an RS ( N , K ) code, which is the shortened form of the primitive RS ( n , k) code, where n -k = N -K, and the error correction capability of the shortened code is at least f = ( n -k ) / 2 . Assume that the RS ( N , K ) code is formed by fixing zero at locations indexed by i = 0, 1, . , n -N -1 . Therefore, decoding of this code can be accomplished by the same structure discussed before, with only two minor changes in the algorithm.
The first change is in inputting the data, which fixes zero in the locations defined above. This can be done by clearing the input buffer at the beginning of each phase before entering any data in the buffer. The other change is in outputting of the data. As we discussed, the last n -N symbols of the decoded vector in the work buffer ( i = N , N + 1 , n -1 ) are zero, and we do not want to output them. Therefore, after correcting the errors and forming the vector c, in the work buffer, this buffer should be shifted to the right n -N times to make the information symbols available.
Considering the above-mentioned changes in the system, the decoder chip can decode any RS ( N , K ) code which is defined in GF ( 2 " ' ) where N 5 2"' -1 and r = ( N -K ) / 2 is the error correction capability of the code. The only fixed parameter in the decoder chip is the Galois field ( m is fixed).
The decoding of a singly extended Reed-Solomon code is described in [ 1 I] . Singly extended codes are important because they allow one to use codes such as a ( 3 2 , 1 6 ) or a ( 2 5 6 , 2 4 0 ) Reed-Solomon code in which the block length is a power of two. An ( N , K ) singly extended Reed-Solomon code with N = 2 '" can be described as an is performed for the Berlekamp-Massey algorithm. In this case, -2t iterations are performed for updating the s, vector. These changes can be easily included in the control circuitry of the decoder, and therefore this decoder structure can also be used for singly extended RS codes. In this section, the complexity and throughput of the versatile decoder is discussed for different Galois field sizes.
As shown in Fig. 2 . the U 0 unit uses 5 n-stage parallelidparallel-out shift registers ( n = 2"' -1 ) which consists of rn-bit symbols. These registers are used as buffers for storing input, output, and erasure informations. Table  I gives the number of gates needed for the 110 unit for different Galois fields GF ( 2 " ' ) .
Considering Table I , we note that the number of gates of the buffers is much higher compared to the number of gates of the control and arithmetic section of the I/O unit.
The same is true for the decoding unit of Fig. 3 . Table I1 shows the number of gates needed for the decoding unit.
All the buffers in the decoder are very easy to be designed, since they have modular structure and are similar to each other. The irregular part of the design is the arithmetic and control sections. But these sections are very small compared to buffer sections, and therefore the design time is very short. For hardware design, this point will decrease the design and implementation cost with a large factor.
In the arithmetic and control sections, all together, there are ten parallel multipliers in GF (2'"), four rn-bit EXCLU-SIVE-OR gates for Galois field addition, and one inversion in GF (2"'). Also, some switches (multiplexer), counters, and other control circuitries are available. For the multiplier and inversion circuitry, standard designs can be used [121.
The decoding time of the decoder is determined by the longest delay path. This path has a delay 7 equivalent to delay of 24 gates. The decoding algorithm needs n = 2"' iterations and one extra iteration for initialization. In each iteration, the shift registers are shifted to the right n = 2"' times, so one iteration period is T = n 7 and the decoding time is n ( n + 1 ) 7 . The decoding time can be used to evaluate the maximum input bit rate of the code, which is rn(2"' + 1 )~. Considering an internal clock of 20 MHz ( 7 = 50 ns), the maximum bit rates at the input of the decoderforrn = 5 , 6 , 7 , 8 a r e 3 , 1.8, l . l , a n d 0 . 6 M b / s , respectively.
V. VLSI GATE-ARRAY-BASED DECODER
As shown in the previous section, the buffer sections of the decoder use most of the gates needed for the decoder design. Since the buffer structure is cellular and regular, the design and layout time for VLSI implementation of these sections becomes very low. The control and arithmetic sections of the decoder are not cellular but are very simple with a low number of gates, and hence are very easy to be designed. These features make the decoder structure suitable for VLSI design. One other reason for the suitability for VLSI design is the versatility of the RS decoder structure. In fact, because of the high cost of fabrication, general-purpose chips such as this decoder should be considered for VLSI design.
As an illustrative example, a Gate-Array-Based programmable RS decoder is implemented on a single chip [ 131. The decoder chip operates in a 5-bit symbol Galois field, GF ( 2 5 ) . Both parameters of the RS code-n (length of the code word) and k (length of the information block)-are user programmable. For the fabrication of the decoder chip, the low-power, 1.5 p , 2-layer metal HCMOS technology is used. The decoder chip can decode any RS code defined in GF (2')). The power dissipation of the chip is 0.6 W in the worst case, and the design needs 20,500 gates on a 68-pin LCC package. One of the reasons for using more gates in this design, compared to Tables I and 11 , is the restrictions available in the gate array design [ 141. The other reason is the addition of some more features to the versatile RS decoder, such as microprocessor interface circuits, compared to Figs. 2 and 3.
The Gate-Array-Based RS decoder chip can be used in two modes: ST (Stand-Alone) and MP (Microprocessor Peripheral). In ST mode, it operates as a stand-alone device, as shown in Figs. 2 and 3 , directly in the data symbol stream. In this mode, it does not require a processor or any external buffer. In MP mode, it can be directly interfaced to any standard microprocessor bus. This configuration (MP mode) is useful for applications in magnetic recording systems, data communications, as well as DSP-based modems. In this mode, the decoder chip operates as a standard peripheral device of a microprocessor or a digital signal processor.
VI. CONCLUSION
In this paper, a versatile Reed-Solomon decoder was presented to decode any RS ( n , k ) code in a specified Galois field. The time-domain decoding algorithm was restructured and used to introduce the decoder. A method was also presented to decode an RS code generated by any generator polynomial. The structure of the versatile decoder is such that different RS codes can be programmed to correct errors and erasures including shortened and singly extended codes.
The structure of the decoder is very simple and easy to be implemented on a VLSI chip. As an illustrative example, a Gate-Array-Based design of the versatile decoder in GF ( 2 ) was given.
