An area-efficient VLSI architecture for decoding of Reed-Solomoncodes by Jah-Ming Hsu & Chin-Liang Wang
AN AREA-EFFICIENT VLSI ARCHITECTURE FOR DECODING OF 
REED-SOLOMON CODES 
Jah-Ming Hsu and Chin-Liang Wang 
Department of Electrical Engineering 
National Tsing Hua University, Hsinchu, Taiwan 300, Republic of China 
clwang@ee.nthu.eud.tw 
ABSTRACT 
This paper presents a new pipelined VLSI array for 
decoding Reed-Solomon (RS) codes. The architecture is 
designed based on the modified time-domain Berlekamp- 
Massey algorithm incorporated with the remainder 
decoding concept. A prominent feature of the proposed 
system is that, for a t-error-correcting RS code with block 
length n, it involves only 2t consecutive symbols to 
compute a discrepancy value in the decoding process, 
instead of n consecutive symbols used in the previous RS 
decoders based on the same algorithm without using the 
remainder decoding concept. The proposed RS decoder 
reaches an average decoding rate of one data symbol per 
clock cycle. As compared to a similar pipelined RS 
decoder with the same decoding rate, it gains significant 
improvements in hardware complexity and latency. 
1. INTRODUCTION 
Reed-Solomon (RS) [l] codes are an important class of 
error-correcting codes. They have been extensively used 
in applications such as satellite and mobile commu- 
nications, magnetic recording [2], and high-definition 
television (HDTV) [3]. Since the decoding process for RS 
codes involves very high computational complexity, there 
is a need to design high-speed RS decoders to meet the 
real-time requirements. 
Various algorithms have been proposed for decoding of 
RS codes. Among them, the time-domain Berlekamp- 
Massey algorithm [4] is ideal for VLSI implementation 
due to its repeated use of some basic operations. One 
major disadvantage associated with this algorithm is that 
the number of iterations involved is usually large and thus 
the corresponding computational complexity is rather 
high. Recently, Choomchuay and Arambepola [ 5 ]  
proposed a modified time-domain Berlekamp-Massey 
algorithm having roughly 60% fewer computations than 
This work was supported by the National Science Council of the 
Republic of China under Grant NSC84-22 13-E007-059. 
the original algorithm by Blahut [4]. To further enhance 
the decoding speed, Choomchuay and Arambepola [ 5 ]  
also presented a pipelined array architecture of 2t hasic 
cells and an error correcting unit to implement their 
algorithm for decoding a t-error-correcting RS code with 
block length n, as shown in Fig.l(a). The 2t basic cells are 
used to estimate the error petttem, where each cell realizes 
one iteration stage of the decoding algorithm, andl the 
error correcting unit corrects the received data block by 
adding the estimated error pattem to it. It should be rioted 
that updating the temporary variables at each basic cell of 
the array cannot be performed until the Corresponding 
discrepancy value is evaluated, where a discrepancy value 
is computed using the entire received data block of n 
consecutive symbols. Since the received data block enters 
the array on a symbol-by-symbol basis, each basic cell 
requires n delay elements to delay one temporary variable. 
This makes the array involve an enormous storage 
requirement. 
In the following, we first modifl the RS decoding 
algorithm in [ 5 ]  properly such that the number of symbols 
relevant to the computation of a discrepancy value is 
reduced. Then we present a pipelined array structure to 
realize the modified algorithm. In contrast to the pipelined 
RS decoder in [5 ] ,  the proposed one achieves the same 
throughput performance with much smaller chip area and 
latency. 
2. AN RS DECODING ALGORITHM 
Theorem: Consider a t-error-correcting RS code of block 
length n = 2m - I  and with each block containing k m-bit 
information symbols over the finite field GF(2"'), where 
2t=n-k. For such an RS code, the modified time-domain 
Berlekamp-Massey algorithm in [SI is applicable lo RS 
decoding when the input is the remainder polynomial r(x) 
of degree 2t- 1 that is obtained by dividing the receive data 
polynomial (or data block) v(x) of degree n-1 by the 
generator polynomial g(x) of degree 2t (i.e., r(x)=v(x) 
mod g(x>). 
0-7803-3 192-3196 $5.0001996 IEEE 329 1 
Received 
Data B b  
Fig.1. Two pipelined RS decoders. (a) The structure in 
[5 ] .  (b) The proposed structure. 
A simple deduction is given here to show this theorem. 
Assume that the number of errors introduced in the noisy 
channel does not exceed the error correcting capability of 
the RS code used. In addition, let each received data 
polynomial v(x) consist of two components, a correctable 
error pattem e(x) and an arbitrary RS codeword c(x), 
where v(x)=c(x)+e(x). With these assumptions, the 
decoding algorithm in [5] can compute the error pattem 
exactly, regardless of the codeword transmitted. In other 
words, adding or subtracting an arbitrary multiple of the 
generator polynomial g(x) to or fiom the received data 
polynomial v(x) will still leave the estimated error 
pattems unchanged. Since the remainder polynomial r(x) 
of v(x) is obtained by subtracting a multiple of g(x) from 
v(x), it can be used as an excitation to the algorithm in [5 ]  
for RS decoding. Note that the above result is consistent 
with the concept indicated by Welch and Berlekamp [6] 
that the remainder polynomial of the received data block 
contains sufficient information for decoding an RS code. 
Let the generator polynomial g(x) of the RS code, the 
received data polynomial v(x), the corresponding 
remainder polynomial r(x), and the error pattern e(x) be 
expressed as follows: 
g(x)=(x+af0)(x+ato+').-(x+at0+2f-1) (1) 
V(X) = VO + VI x1 + ~2 x2 +. . * + vn-2 xnb2 + vn-1 xn-l (2) 
~ ( ~ ) = r 0 + r l x ~ + r 2 ~ ~ + . . . + r 2 ~ - 2 x ~ ~ - ~ + r 2 ~ - l x ~ ~ - ~ ( 3 )  
e(x) = eo + elxl+ e2x2 + . a * +  e n - 2 ~ n - 2  + en-1xn-l (4) 
where a is a primitive element in GF(2"') and t, is an 
arbitrary integer between 0 and n-1. Then, based on the 
above-mentioned theorem and Algorithm D in [SI, we 
have the following algorithm for RS decoding: 
Initialization: { i=O, 1, 2, ..., n-1 } 
LO = yo, ,P = 1 
t;p = sp= + 0 
I 1 1  
Lo=O 
Ko = uo = 1 
Recursion: 
{i=O,l, 2, ..., n-1 and j=1, 2, ..., 2t. 6j= 1 if both A j #  0 
and 2 I + l < j - 1 ,  and 6j=O otherwise. If 6j=O, 
uj = uj-1+ 1 ; otherwise uj = 1 .} 
Error Evaluation: { i=O, 1,2, ..., n-1 1 
-, i f h f f = ~  
0, otherwise 
(9) 
Note that the only difference between the above 
algorithm and Choomchuay-Arambepola's Algorithm D 
in [5 ]  is that the former uses the remainder polynomial 
r(x) of degree 2t to compute the discrepancy values, as 
given by (5), while the latter uses the received data 
polynomial v(x) of degree n-1 for the same purpose, as 
given by eqn.8a in [SI. 
3. REALIZATION OF THE PROPOSED RS 
DECODING ALGORITHM 
Fig.l(b) shows a pipelined array structure for realizing the 
proposed RS decoding algorithm with t-error-correcting 
capability and block length n. The structure is rather 
similar to that given in [5 ] .  It consists of a pre-processing 
unit, 2t basic cells, and an error correcting unit. The pre- 
processing unit is used to generate the remainder 
polynomial r(x) from the received data block v(x), as 
shown in Fig.2, where the received data symbol of the 
highest degree (i.e., vn-l) enters first and three FILO 
(fust-in-last-out) buffers are used to reverse the sequence 
of the received data symbols in a block and the output 
sequence of the remainder polynomial under the control 
3292 
of C1 and C2 patterns. The remainder polynomial 
generator is actually a divide-by-g(x) circuit [7]. The 
control pattern C1 is a sequence of n-2t 1’s followed by 2t 
0’s’ and the control pattern C2 is a sequence of n 0’s 
followed by n 1’s. These two patterns are synchronized 
with the input data symbols and will recirculate 
continuously. The function of the basic cell is shown in 
Fig.3, where the jth basic cell of the array realizes the jth 
iteration given by (5)-(9). The* error correcting unit 
evaluates the error pattern based on (10) and adds the 
result to the received data block. 
Remainder Generator Order Reversine 
X . X  ..... ~ . r ~ . ~ . r ~ . ~ .  ... r, 
FILO #1 of Size 2t 
ro . rl . .... r,,, 
Fig.2. The circuit of the pre-processing unit. 
Each basic cell consists of six multipliers (MI - M6), 
five adders, six multiplexers (MUX’s), one comparator, 
and a number of shift registers, where the circuit for 
computing discrepancy values (Aj’s) is further given in 
Fig.4. This circuit is designed based on the following 
equation: 
= (aj)2t-’[ro~~-l(,jJ(2t-1) + rlAi-1(,-J)- (2t-2) 
+ - * + r2t-1~i7!1] 
= (ajft-’[(. . .(((roh&-la-j+ rl hi-l)a-j 
+r2hi-’)a-J+* ..)a-j+ r2t-1 hi;ll] 
(11) 
Since the symbols are fed to the discrepancy computing 
circuit in serial form with the one of lowest degree fust, it 
takes 2t+l clock cycles to calculate the discrepancy value 
Aj, where one clock cycle is required for the 
multiplication by aj(2t-1). When A j  is ready, we require 
two more clock cycles to compute AjKj-~a-~~j - l .  It can 
be checked from (8) that hkl should be delayed by 2t+3 
clock cycles (denoted by “D(2t+3)”) such that it can meel 
A,K,-la-iuj-l to form I&.  Note that the multiplier M3 is 
used to perform the multiplication by aj(2t-1) given in 
(11) and the two multiplications involved in the 
computation of AjKj-la-iuj-l under the control of a 
multiplexer. The circuits for realizing the other operations 
of the recursion given in (5)-(9) are similar to those 
described in [5 ] ,  with a difference in the size of delay for 
buffering variables. 
Fig.3. The circuit of the basic cell in Fig.l(b). 
Fig.4. The proposed discrepancy computing circuit. 
The reason why the order of data symbols has to be 
reversed in our design is given as follows. From (81, we 
have 
(12) 
It is clear that the values ‘of hi-’ and yj-l must keep 
unchanged until A j  is evaluated. Also, with h=l for (5) ,  
we obtain 
hj 1 = hj-1- I A .  J K .  j-1 a-iuj-i. j-1 
Let us see first what will happen when symbols-order 
reversing is not employed. In this case, Xi?l appears fust 
and the value of A j  can be obtained only after A&-’ has 
3293 
entered the discrepancy computing circuit. This means 
that for correct execution should be buffered by at 
least n clock cycles before it becomes In other 
words, we need at least n delay elements to buffer Akll 
for pipelining operations. Similar buffering strategies are 
also necessary for other variables of the recursion given in 
(5)-(9), excluding ukl, Lkl, and KjWl. Such a 
characteristic will cause each basic cell to have a huge 
storage requirement of O(n), which is the same as that of 
the basic cell used for the pipelined array in [5]. On the 
other hand, when symbols-order reversing is adopted, Ab1 
comes in first and the value of A j can be computed after 
&!l has been fed to the discrepancy computing circuit. 
Accordingly, Ab1 should be buffered by at least 2t clock 
cycles before it becomes hi. That is, the number of delay 
elements required to buffer Ah-’ and other variables 
(excluding U j-l, Lkl, and K j-l) inside each basic cell for 
pipelining operations is 0(2t), which is much smaller than 
O(n) involved in the design without symbol-order 
reversing. 
s w c u ~ e  
in151 
proposed 
strucuture 
4. COMPARISON WITH AN EXISTING SYSTEM 
Like the pipelined array in [5], the proposed array has a 
decoding rate of one data block per n clock cycles. It can 
also be seen that the proposed one employs 8 delay lines 
of length 2t+3 inside each basic cell, while that in [5] 
requires 7 delay lines of length n+2. For a long block 
length, say n=255, and a moderate error correcting 
capability t, say 5 5 t I 10, the former gains a significant 
improvement in the storage requirement over the latter. 
For further comparison of the proposed RS decoder with 
that given in [5], the following assumptions are made: 1) 
each symbol is an element in GF(28), 2) an 8-bit delay 
element comprises 64 transistors, 3) a multiplier for 
GF(2’) consists of 1,792 transistors, estimated based on 
the array multiplier given in [SI, 4) an inverter for GF(2’) 
is realized using one ROM, composed of about 2,500 
transistors, and 5) each register in a FILO buffer of the 
pre-processing unit occupies about 12 transistors. Based 
on these assumptions, a comparison of the proposed 
structure with that in [5] is listed in Table I. This table 
indicates that the proposed structure has a substantial 
reduction in the transistor count and the latency size. 
Oofh-sistor. 1261k 1513k 1765k2017k2268k 2520k 
2570 3084 3598 4112 4626 5140 
#oftransistors 226k 282k 342k 406k 473k 546k 
latency 
(#ofcycles) 640 690 748 814 888 970 5. CONCLUSION 
In this paper, we have presented a new RS decoding 
algorithm that combines the modified time-domain 
Berlekamp-Massey algorithm with the remainder 
percentage 
reduction Of 
decoding concept. We have also proposed an area- 
efficient pipelined array architecture for realizing the 
decoding algorithm. The proposed RS decoder possesses 
the features of regularity and modularity, and is thus well 
suited to VLSI implementation. It can process one data 
block every n clock cycles, where n is the block length of 
the RS code. As compared to the similar pipelined RS 
decoder described in [5], the proposed one gains 
significant improvements in chip area and latency. 
””,y 0.820 0.814 0.806 0.799 0.791 0.783 
latency size 0.751 0.776 0.792 0.802 0.808 0,811 
6. REFERENCES 
[l]  F. J. MacWilliams and N. J. A. Sloane, The Theory of 
Error-Correcting Codes. New York: North-Holland, 
1977. 
[2] T. R. N. Rao and E. Fujiwara, Error-Control Coding 
for Computer Systems. Englewood Cliffs, NJ: 
Prentice-Hall, 1989. 
[3] K. Challapali, X. Lebegue, J. S. Lim, W. H. Paik, R. S. 
Girons, E. Petajan, V. Sathe, P. A. Snopko, and J. 
Zdepski, “The grand alliance system for US HDTV,” 
Proc. IEEE, vol. 83, pp. 158-174, Feb. 1995. 
[4] R. E. Blahut, “A universal Reed-Solomon decoder,” 
IBM J. Res. Develop., vol. 28, pp. 150-158, Mar. 
1984. 
[5] S. Choomchuay and B. Arambepola, “Time domain 
algorithms and architectures for Reed-Solomon 
decoding,” IEE Proc. - I, vol. 140, pp. 189-196, June 
1993. 
[6] L. Welch and E. R. Berlekamp, “Error correction for 
algebraic block codes,” U.S. Patent 4663470, Sept. 
1983. 
[7] R. E. Blahut, Theory and Practice of Error Control 
Codes. Reading, MA: Addison-Wesley, 1983. 
[SI C. -L. Wang and J. -L. Lin, “Systolic array 
implementation of multipliers for finite fields GF(2”),” 
IEEE Trans. Circuits Syst., vol. 38, pp. 796-800, July 
1991. 
TABLE I 
A Comparison of Two RS Decoders 
Design Error correcting capability t 
5 6 7 8 9 1 0  
3294 
