Abstract --This paper examines Reed-Solomon time-domain and frequency-domain decoder implementations in both software and hardware. The focus was on designing area-efficient, lowpower and low-complexity decoders suitable for today's moderate data-rate applications. Two decoder chips were designed using a synthesized standard cell library in a 0.18pm CMOS process, targeting a 160 Mbps decoding rate. The time-domain decoder was fabricated with a core area of 1.50 mn?.
I. INTRODUCTION
Reed-Solomon (RS) codes and decoders are extremely powerful error correction tools that greatly enhance transmission quality in communications systems. These techniques are used in numerous applications such as the (CD) player [ 11.
In recent years there has been a shift from large-scale, highspeed uses to small, moderate data-rate applications such as the ubiquitous cell phone. Driving the development of new technologies is the insatiable demand for lighter and smaller communication devices with greater capabilities. There is a need therefore, to focus on area and power considerations rather than solely on speed. This work focuses on two areas that have identical RS code parameters: Digital Video Broadcasting (DVB) and asymmetrical digital subscriber line (ADSL). However, the results are not strictly limited to these two concepts.
To date, most RS designers have placed an emphasis on achieving aggressive decoding speeds with less attention on optimizing area and power. Moreover, there has yet to be a presentation in the literature of a concise documented qualitative and quantitative comparison of the various VLSI decoder implementations.
An RS decoder can be designed as either a frequencydomain decoder (FDD) or a time-domain decoder (TDD). Within this algorithmic context, the decoders can either be implemented in hardware as an application specific integrated circuit (ASIC), or in embedded software that resides in memory as part of a system-on-a-chip (SOC) implementation. These approaches have their respective advantages and inherent trade-offs. Consequently, a four quadrant qualitative comparison of RS decoders will be presented in this paper.
REED-SOLOMON DECODING

A. Reed-Solomon Codes
The general structure of an RS code can be described as follows. Each code is composed of n symbols with a certain number of message symbols k and redundant symbols (n-k). The code is referred to as an (n,k) RS code of length n, and dimen-0-7803-7587-4/02/$17.00 02002 IEEE 99 sion k over a Galois field GF(q), where q is the power of a prime number. It has a minimum distance of d = w k + l , where n = q-1, and an error correction capability of t = ( 4 1 2 errors. Codewords are generated by a set of polynomials with degree k-1 and coefficients from GF(q). RS codes relevant to digital communications are based on the binary extension field GF(27, where each symbol is an m-bit word.
Erasures enhance the error correcting capability of an RS
decoder. An erasure location is a symbol location in the codeword, which the decoder recognizes as being incorrect, but it does not know which bits are in error. If there are v erased coordinates then it is possible to correct t = ( L d -v-1 J)/2 errors in the unerased coordinates of the received word, providedthat(2t+ v ) < d .
B. Applications
The two applications that are targeted here are G.lite (ADSL) and DVB. Both standards are based on a (255,239) RS code, have an error correcting capability of t = 8, and are generated GF(28). A maximum decoding speed of 20 MBps (160 Mbps) is required for these applications. This speed comfortably accommodates G.lite (1 MBps) and very-high-speed digital subscriber line (VDSL) (13 MBps). Other applications which use the same primitive polynomial include: the digital versatile disc (DVD), and deep space communications [l] .
C. Decoding Algorithms
The most challenging aspect of RS codes is finding efficient techniques for decoding the received symbols. This paper examines the Modified Euclidean Algorithm (MEA), and the Berlekamp-Massey Algorithm (BMA) [ 
11.
Thus far, ambiguity about whether to implement the MEA or the BMA has been an issue for VLSI designers of RS decoders.
Until the work of [3] , most practical VLSI decoder implementations used the MEA approach because previous BMA approaches used more multipliers than the MEA [3]. However, in [3] a time sharing algorithm is developed, which made the VLSI implementation of the BMA comparable to the MEA. The design reduced the number of finite field multipliers from 2t-3t to three multipliers in the decoding algorithm, which is comparable to the four used in the standard MEA.
Two independent perspectives, referenced in [ 11, corroborate the use of either algorithm. An algorithmic comparison concluded that both methods yield distinct but similar parallel architectures with the same optimal complexity. Both algofrom the primitive polynomial m(x) = x 0 4 3 2 + x + x + x + 1 in rithms also required the same number of machine cycles to complete decoding.
Finally, the memory requirements can be said to be equivalent for both algorithms. Each one requires the storage of the input codewords, which will then be used in their respective output stages to form the decoded codewords. Therefore, these two algorithms can be said to be equivalent, and the choice to implement one versus the other becomes a matter of personal preference for the RS decoder designer.
REED-SOLOMON DECODER DESIGN
A. Time-Domain Decoder Overview
The TDD and FDD designs are based on those developed in where the data codeword r, and the erasure locations U, are received and then decoded using the MEA. The codeword r is received one m-bit symbol at a time, and then sent to two different blocks: the syndrome computation block and a delay (memory) unit. The syndrome block is a multiply accumulate unit. Once all the symbols of a codeword have been input, the syndromes are checked. If the syndromes are all zero then the current codeword is correct and the decoding can be skipped, otherwise the decoding process continues. Received symbols are stored in memory because they need to be combined at the output stage with the decoded symbols to produce the codeword. At the same time, the erasure location bit stream is serially input into the power calculation unit. A '1' indicates the presence of an erasure at the current codeword symbol position. If this is the case, then the power calculation unit will convert it to a power of a, where a is primitive element of the GF over which the RS code is defined.
The subsequent section receives the syndromes and a powers as inputs. Here, three blocks calculate parameters that the MEA requires for its decoding. First, the polynomial expansion unit uses the syndromes and a powers to find the Fomey syndromes, which forms a polynomial. This initializes the MEA to be able to correct both errors and erasures. Second, the CL powers are expanded into an erasure locator polynomial in the power expansion block, and used to initialize the MEA. If an input symbol ri is labeled an erasure, then a'' will be a root of this erasure locator polynomial. The last block computes L(d+ v-3)/2A, which is used as an MEA stop indicator. If the degree of the Fomey polynomial is less than the degree of the erasure locator polynomial, then there are no errors and the MEA may be skipped. However if there are errors, then a specific decoding algorithm in the MEA block iterates to solve a key equation [ 11. Two results are produced and passed to the Chien search: the errata locator and the errata magnitude polynomials.
An exhaustive search is then performed to find the roots of the errata locator polynomial by evaluating it for all a? where
.., n -1. If a root is found, then the corresponding symbol ri is corrupted. The errata magnitudes are found by exhaustively evaluating the errata magnitude polynomial and the derivative of the errata locator polynomial for all cCi in the particular GF. Finally, these errata values are GF added (XORed) with the original symbols stored in memory to form the corrected codeword that the decoder outputs.
B. Frequency-Domain Decoder Overview
An overview of the FDD is illustrated in Fig. 2 . The input and decoding stage is similar to the TDD except for a couple of differences. First, an extra delay (memory) is required to store the syndromes from the first section of the decoder, which are required in the output stage. The second difference is the output of the MEA where only the errata locator polynomial is output to initialize the transform error pattern block. The delayed syndromes are input into this block. Finally, there is a significant difference in the structure of the output stage.
The first section of the output stage calculates the m-bit transforms of the errata pattern. Delayed syndromes from the first section in the previous stage, form the first n -k error transforms. To calculate the remaining k transforms, a recursive equation is used in conjunction with the syndromes and the errata locator polynomial. Each of the m-bit transforms are then sent to the next output block: the inverse transform unit. The inverse of all n errata transforms must be calculated before being added with the stored input symbols. This inverse transform is taken over GF(2"'). Finally, the symbols can be GF added to the stored input symbols and then output. 
IV. REED-SOLOMON VLSI IMPLEMENTATION
An area and power efficient (255,239) RS decoder implemenation, that supported 16OMbps throughput, was the primary design objective.
A. Time-Domain Decoder
The design was partitioned into four pipeline stages and used two clocks (clock and symbol-clock). The time required for each pipeline stage was 255 symbol_clock cycles. This clock was four times as slow as clock, but both were in phase. The savings in area was offset by the increased complexity of the control module for this block. However, it was found that only one MEA cell was needed to meet the timing requirements of the target applications if clock was used. The outputs of the MEA cell are delayed and then fed back as inputs. The Chien Search was optimized for area by using the suggested architecture in [7] , reducing the number of GF multipliers by 16 and XOR tree summation blocks by 2, but the completion time doubled because of time-sharing hardware.
Power consumption in the RS decoder was reduced by using the error and erasure information generated by the decoder. For instance, if the sixteen syndromes are all calculated to be zero then there are no errors, and the entire decoding process can be skipped. Memory cores were used to reduce the number of register shifting operations. The memory blocks were disabled when not reading or writing to them and only powered on one out of every four clock cycles.
B. Frequency-Domain Decoder
All of the architectures before the output stage of the decoder were reused from the TDD design. However, a GF TRANSFORM ERROR PATTERN DECODED :o-.. division was needed at the output of the MEA to normalize the output since only the errata locator polynomial was required.
The design for the error and inverse transform blocks was adopted from (41. No hardware optimizations over this implementation were immediately evident.
C. VLSI Decoder Results and Comparison
The TDD and FDD were designed using the Virtual Silicon 0.18pm standard cell library and a commercial memory module generator. The TDD was fabricated (Fig 3. Die Photo) and performance results for both decoders are shown in Table 1 . The MEA was the largest contributor to design complexity. Since both the TDD and the FDD contain an MEA, they are comparable in terms of design and implementation complexity. The TDD was proven to be superior to the FDD because the TDD chip had a smaller area, faster decoding speed, and lower power consumption. The prohibitive area of the FDD was a prominent limitation to practical applications. A comparison of previous implementations of RS decoders and the TDD in this paper is provided in Table 2 . 
v. REED-SOLOMON SOITWARE IMPLEMENTATION
The purpose of implementing both decoders in software was to determine which approach would be superior for an embedded software SOC application. C code was used and compiled on a SPARC architecture CPU using various compiler optimizations [ 11, and the results are shown in Table 3 . 
VI. CONCLUSIONS
The focus in this paper was to design area-efficient, lowpower and low-complexity RS decoders. The detrimental factor of the FDD was the inverse transform block of the output stage. Hence based on the VLSI and software results, the TDD is superior to the FDD for hardware and embedded software applications.
