Abstract-This paper presents high-speed parallel Reed-Solomon (RS) (255,239) decoder architecture using modified Euclidean algorithm for the high-speed multigigabit-per-second fiber optic systems. Pipelining and parallelizing allow inputs to be received at very high fiber-optic rates and outputs to be delivered at correspondingly high rates with minimum delay. Parallel processing architecture results in speed-ups of as much as or more than 10 Gb, since the maximum achievable clock frequency is generally bounded by the critical path of the modified Euclidean algorithm block. The parallel RS decoders have been designed and implemented with the 0.13-m CMOS standard cell technology in a supply voltage of 1.1 V. It is suggested that a parallel RS decoder, which can keep up with optical transmission rates, i.e., 10 Gb/s and beyond, could be implemented. The proposed channel = 4 parallel RS decoder operates at a clock frequency of 770 MHz and has a data processing rate of 26.6 Gb/s.
I. INTRODUCTION
The use of error-correction coding to eliminate the necessity of retransmission of the data is called forward error correction (FEC) . The basic concept is to systematically add redundancy to messages at the encoder such that the decoder can successfully recover the messages from the received block possibly corrupted by channel noise. The use of FEC in optical networks was pioneered for submarine systems where detection and correction of errors was critical for transmission over very long haul networks. Out of many error correction codes, Reed-Solomon (RS) codes have been widely used in a variety of communication systems such as space communication link, digital subscriber loops, and wireless systems as well as in networking communications [1] . RS decoders can be used to protect digital data against errors occurred and reduce the signal to noise ratio in the transmission process. The RS decoder can be implemented using the modified Euclidean (ME) algorithm or Berlekamp-Massey (BM) algorithm to solve a key equation. For either algorithm, a finite-field (also called Galois-field) is a mathematical structure which plays a crucial role in the theory of RS codes and the finite-field arithmetic operations are the fundamental building blocks for the RS decoder [2] , [3] . Knowledge of basic finite-field concepts and Reed-Solomon encoding algorithm was well introduced in [1] - [3] .
If no erasure is taken into consideration, a syndrome-based RS decoder consists of three components. First part is a syndrome computation (SC) block. It generates a syndrome polynomial S(x) that is used in the key-equation solver (KES) block for solving a key equation S(x)(x) = !(x) mod x 2t . Either the ME algorithm or BM algorithm can be used to solve a key equation for an error-locator polynomial (x) and an error-value polynomial !(x). Then in the third component, these two polynomials are used to find out the error locations and the corresponding error values according to the Chien search and Forney algorithm and corrects the errors as the received word is being read out of the decoder. In addition, a first-in first-out (FIFO) memory is used in order to buffer the received symbols according to Manuscript the latency of these components. The very high-speed data transmission techniques that have developed for fiber optical networking systems have necessitated the implementation of high-speed RS decoder architecture to meet continual demand for ever higher data rates. With the advent of dense wavelength division multiplexing (DWDM), the capacity of optical transmission systems has rapidly increased over the past ten years. In particular, the RS(255, 239) code is now commonly used in high-speed (10 Gb/s and beyond) fiber optic systems such as DWDM systems due to its 8-B error-correcting capability. The throughput bottleneck in an RS decoder is in the KES block, but the other blocks of the decoder are simple to pipeline due to their feedforward structure. Thus, we propose a highly pipelined ME algorithm block that improves the clock frequency. A key advantage of RS decoders based upon the ME algorithm is their regular data-flow structure that can be easily pipelined and implemented in very large scale integration (VLSI) [10] - [14] . Several parallel architectures used for Viterbi decoding, CRC codes, and BCH codes [4] - [6] have been reported in order to obtain processing rates higher than the clock frequency. Such an approach is an effective method of speeding up more than 10-Gb/s data-processing rate, since the maximum clock frequency is generally bounded by the physical constraints of the circuit.
This paper presents very high-speed VLSI architecture for parallel RS(255, 239) decoder based on ME algorithm, which is now commonly used for submarine fiber optic systems due to its 8-B error-correcting capability [7] . An ITU standard was developed using the RS(255 239) code that provided 6.0 dB net electrical coding gain with approximately 7% overhead. The speed bottleneck in the conventional ME algorithm block is in the finite-field multiplier, adder, and multiplexer. This bottleneck is eliminated via pipelined fully-parallel multiplier and pipelined degree computation block that reduce the critical path delay. Section II shows the proposed architectures for the pipelined decoder components and illustrate the decoding process. Section III describes high-speed parallel-processing RS decoder architecture that achieves more than 10-Gb/s data-processing rate. Finally, the conclusions are given in Section IV.
II. PIPELINED RS DECODER
This section presents components of high-speed pipelined RS decoder based on ME algorithm. The finite-field multiplier contributes significantly to the critical path delay and, hence, pipelining reduces the maximum data rates achievable with the conventional ME algorithm block. In this section, we propose new RS decoder architecture that has a smaller critical path delay achieved via highly pipelined ME algorithm block.
A. Pipelined Fully Parallel Multiplier
The finite-field element multiplication plays an important role in the VLSI implementation of an RS decoder. The chip complexity and the computation time are all depending on the design of the finite-field multiplier. The design of finite-field multipliers depends on the choice of basis for the representation. Here, we consider only the standard polynomial basis in which the m-bit B (a m01 ; a m02 ; . . . ; a 1 ; a 0 ) represents the finite-field elements A = a m01 m01 + a m02 m02 + 1 1 1 + a1 + a0. The product W of two GF (2 m ) elements A and B is computed as follows:
1063-8210/03$17.00 © 2003 IEEE Fig. 1 . Reed-Solomon decoder using ME algorithm. (1) and the modular reduction as in (2) . There are no data dependences in both procedures and they can be computed with parallel, respectively. Using a polynomial to represent a field element, the finite-field multiplication of A and B can be more concisely formatted as W (x) = A(x)B(x) mod p(x).
The computation A(x) 1 B(x) is basically done by the formula
which can be implemented in parallel or serial. The part A(x)b j is the AND operation and the operator "+" is the XOR operation.
The standard step is to perform one simultaneous computations of 64 ANDs followed by 77 XORs, respectively. This fully-parallel multiplier can be fully pipelined on XOR trees using pipeline cutset shown in Fig. 2 . For both parts, balanced XOR trees are used to having the shortest logic paths. Thus, the critical path delay is T pipe mult = T + T and2 + Txor2 + Txor3, where T , T and2 , Txor2 and T xor3 are the delays of D flip-flop, 2-input AND, XOR and 3-input XOR gates. The pipelined multiplier structure will provide significant gains for critical path delay.
B. Syndrome Computation Block
Let C(x) and R(x) be the code word polynomial and the received polynomial, respectively. The received polynomial can be corrupted by channel noise during transmission. This can be described as R(x) = C(x) + E(x) = R n01 x n01 + 1 11 + R 1 x + R 0 , where E(x) is the error polynomial.
The first step in the decoding algorithm is to calculate 2t syndromes the noisy channel. It considers the symbol values as polynomial coefficients and determines if the series of symbols contained in a data block form a valid code word for the particular RS code chosen. It evaluates the polynomial for 2t syndrome values and detects whether the evaluations are zero (the data block is a code word) or nonzero (the data block is not a code word). Any block that is not a code word is corrupted by errors. As shown in Fig. 3(a) , the partial syndrome is multiplied with i at each cycle and accumulated with the received symbol. 
C. Modified Euclidean Algorithm Block
We need to obtain the error locator polynomial (x) and the error value polynomial !(x) by solving the key equations S(x)(x) = !(x) mod x 2t . These polynomials can be obtained by using ME algorithm.
The ME algorithm is summarized as follows. Initially
In ith iteration
where a i01 and b i01 are the leading coefficients of R i01 (x) and Qi01(x), respectively, and
The algorithm stops when deg (R i (x)) < t, where deg (1) denotes the degree of a polynomial. If the stop condition is satisfied, then !(x) = R i (x) and (x) = L i (x). Fig. 4(a) shows the ME algorithm Processing Element (PE) which consists of a degree computation (DC) block and a polynomial arithmetic (PA) block. The ME algorithm processing block shown in Fig. 4(b) consists of 2t PEs connected by means of a systolic-array structure. Such systolic array of 2t PEs computes the error-locator polynomial (x) and error-value polynomial !(x), and it is capable of performing the ME algorithm continuously.
The DC block processes several operations such as the degree computation, the degree update using subtraction and the leading zero detection in order to calculate (8) and (9) . First, it performs the control to determine when the polynomials of two systems are to be exchanged and when the initial polynomial and product polynomial are to be exchanged. Thus, each exchange control circuit controls whether the first (R i (x)) and second data lines (xQ i (x)) are exchanged or not, and also control whether the third (Li(x)) and fourth data lines (xU i (x)) are exchanged or not. That is, exchange control circuit com-
then the signal "sw" outputs a high level (sw = 1), otherwise it outputs a low level (sw = 0). Second, it detects if the arithmetic operation stop condition of (deg (!(x)) < deg ((x))) or deg (R i (x)) < t is satisfied. That is, if deg (Ri+1) < (t = 8) or deg (Qi+1) < (t = 8), then stop o = 1 and computation stop, otherwise stop o = 0. A "start" signal is used to indicate the beginning of the polynomials, i.e., the leading coefficients ai01, bi01. The "start" signal, as well as xQ 0 (x) and xU 0 (x), is delayed by one-time unit in such a manner that the leading coefficients of R1(x), Q1(x), L1(x) and U1(x) are properly initiated by the start signal at the output of first PE (PE1). Signal "z lead" has to be generated in order to check whether the leading coefficients of Qi(x) is equal to "zero." 5-bits of arithmetic data are passed from DC block to DC block in ME algorithm processing block, and these data are used to generate multiplexer control signal "sw" and "stop" signal in each PE. The PA block processes finite-field multiplications and additions. One PA block contains four pipelined multipliers, two adders, and eight multiplexers in order to calculate (4)-(7).
At first iteration step of the ME algorithm, R0(x) and Q0(x) are initialized to x 2t and S(x), respectively. L 0 (x) and U 0 (x) are initialized to "0" and "1," respectively.
The PA block uses pipelined fully-parallel multiplier to reduce the critical path delay and has provided significant gains for clock frequency. Therefore, the critical path delay of PA block is reduced from T + T mult + T add to just T pipe mult = T + T and2 + Txor2 + Txor3.
Furthermore, the DC block can also be pipelined in five stages and retimed to reduce the critical path delay. Thus, the critical path delay of DC block is T subt + Tmux2 which is less than the delay of PA block T PA = T pipe mult +T add +T mux2 , where T subt , T mux2 and T pipe mult are delays of subtracter, 2 2 1 multiplexer and pipelined fully-parallel multiplier. The critical path delay of ME algorithm block can be defined T ME = T subt + T mux2 = T + 3T or2 + T xnor2 + T mux2 which exists in the DC block. A serial-parallel converter is also included to receive outputs of ME algorithm block for the next Chien search block. 
D. Chien Search, Forney Algorithm, and Error Correction Blocks
Let the error locator polynomial of degree n over GF (2 m where t is the maximum number of errors that are to be corrected in the RS code. However, the Chien search algorithm requires the multiplication of each coefficient i by the powers of , where is the root of a primitive irreducible polynomial of degree t over GF (2) .
The Chien search and Forney algorithm for calculating the error locations and values are described as follows: The circuit of ith Chien search cell is shown in Fig. 5(a) . Fig. 5(b) shows the block diagram of Chien search block with eight Chien search cells. The finite-field adders accumulate the result of two Chien search cells, and send the sum to the next adder. Fig. 5(c) shows the block diagram of the Forney algorithm and error correction blocks which generate the error value and then the corrected symbol. For division of the Galois-field, first of all, the inverse element of the divisor is derived, and it is then multiplied with the element of the dividend by the pipelined fully-parallel multiplier. A straightforward approach for computation of the inverse of a nonzero element in GF ( 2 8 ) is to use a simple look-up table composed of 255 words of 8 bits (the inverse value of the input), in which inverse of the field elements are stored. Consequently, it can be realized by means of a static read-only memory (ROM) which gives a critical path delay less than that of pipelined multiplier. As each error value is computed, the corresponding received symbol is fetched from a FIFO memory which buffers the received symbols during the decoding process. Each error value is simply added to the received symbol to produce the corrected symbol. In order to match the processing latency, the FIFO memory block is used to delay the received data by n + , where n is the delay required for the syndrome computation, and is the processing time delay of ME algorithm processing blocks.
E. Result and Comparison
The proposed RS decoder was modeled in Verilog hardware description language (HDL) and simulated to verify its functionality. After complete verification of the design functionality, it was then synthesized using appropriate time and area constraints. Both simulation and synthesis steps were carried out using SYNOPSYS design tool and a 0.13-m CMOS technology optimized for a 1.1-V supply voltage. Column two of Table II shows the implementation results for the proposed pipelined RS(255, 239) decoder using ME algorithm. The total number of gates is 115, 500 from the synthesized results excluding the FIFO memory and the clock frequency is 770 MHz. For complexity consideration, we choose constant-variable finite-field multipliers to = 4 Parallel RS decoder using ME algorithm. implement the syndrome computation block, Chien search and Forney algorithm blocks.
The critical path delay in the conventional architectures is mostly due to the finite-field multipliers in the KES block [9] - [13] . For the ME algorithm PE shown in Fig. 4(a) , the pipelined multiplier structure was used to reduce the critical path delay and has provide significant gains for the clock frequency. As discussed in Section II-C, the critical path delay of the ME algorithm block can be defined TME = T + 3T or2 +T xnor2 +T mux2 which exists in the DC block. Thus, the critical path delay of proposed RS decoder is T + 3Tor2 + Txnor2 + Tmux2, whereas that of conventional RS decoder is at best T mult + T add + T mux2 . This proposed RS decoder using the ME algorithm achieves smaller critical path delay 1.3 ns than the previous RS decoders, as shown in Table I . The pipelined multiplier structure in combination with the systolic architecture provides significant gains over existing approaches. The proposed RS decoder has much higher speed and data processing rates (throughput) than the other architectures published in the past [12] - [16] . Fig. 6 shows the timing chart for the proposed RS decoder using the ME algorithm. The syndrome computation block provides 2t syndromes simultaneously for the next blocks, one clock cycle after the last received symbol has been fed into the block. The ME algorithm processing block accepts syndromes and outputs the coefficients of (x) and !(x) serially. The proposed RS decoder takes in code blocks consecutively, performs the appropriate coding operation and outputs the data with a fixed latency of n + 10t + 20 clock cycles, where n is the processing delay required for computing syndromes and 10t + 8 clock cycles are the delay required for the ME algorithm processing block including serial-parallel converter and 12 clock cycles are the delay required for the Chien-search, Forney algorithm, and error correction blocks.
III. HIGH-SPEED PARALLEL-PROCESSING RS DECODER
Since the maximum achievable clock frequency is bounded by the physical constraints of the ME algorithm block, the parallel processing is an effective method to keep up with high data-processing rates (i.e., more than 10 Gb/s). The parallel RS decoder is designed to parallelize the entire decoding process using blocks parallel processing, i.e., the received symbols are inputted in parallel and the estimated codeword symbols are outputed in parallel. Fig. 7 shows a proposed architecture for the channel (Ch) 4 parallel-processing RS decoder, while Fig. 8 shows the timing chart for this decoder. Since the decoding process, which finds the error locator polynomial and error magnitude polynomial involves high computational complexity, it affects the speed and the hardware complexity of RS decoders. Syndrome generation and application of correction have to be instantiated independent for each of the four decoding channels, while the ME algorithm block can be shared between all channels. It just relies on the result of syndrome generation and needs only 16 operation steps which are spread over 88 clock cycles due to pipelined ME algorithm block. As seen from the resulting timing shown in Fig. 8 , the sharing of the ME algorithm block requires phase shifting between each single channel. It shows the relation between received data (Rin) and corrected data (Cout) for each channel. For correction the ME algorithm block is only required for a rather short time after complete syndrome generation (Si). The required shifted structure is not realized by an appropriately shifted frame transmission but by wrapping the Ch = 1 RS decoder with additional asymmetric delaying FIFOs as shown in Fig. 7 . The proposed architecture allows H symbols to be processed in parallel and it can process input data at mF H (bits/s), where F is the clock frequency of the decoder and H can be chosen as an arbitrary integer. We believe that this parallel RS decoder using ME algorithm can be used to achieve much higher speeds than can be achieved by other implementations of the RS decoders using ME and BM algorithms, although its hardware complexity is not proportional to the number of parallel symbols H. Table II compares the gate count of the several parallel RS decoders with that of the Ch = 1 RS decoder whose block diagram has been shown in Fig. 1 . It is clear that the Ch = 4 parallel RS decoder requires only 40% increase of the gates count of the Ch = 1 RS decoder. Sharing of the ME algorithm block among Ch = 4 RS decoder leads to substantial hardware savings as the ME algorithm block requires about 80% of the total gates of the RS decoder. Table III shows the frequency, latency, and data processing rate for several parallel RS decoders. The Ch = 1 RS decoder has to process at the data processing rate of 6.16 Gb/s at the frequency of 770 MHz, whereas Ch = 4 parallel RS decoder can process at the data processing rate of 26.6 Gb/s. The proposed parallel RS decoders have a very high-speed performance to meet continual demand for ever higher data rates for fiber optic systems.
IV. CONCLUSION
We have presented the components of pipelined RS decoder and very high-speed parallel RS decoder using ME algorithm for fiber optic systems. Their regular data flow structure can be highly pipelined and easily implemented in VLSI. The pipelined multiplier structure in combination with the systolic architecture result in an order of magnitude reduction in the critical path delay and provides significant gains over existing approaches. Parallel processing architecture results in a speedups of as much as more than 10 Gb/s, since the maximum achievable clock frequency is generally bounded by the critical path of the modified Euclidean algorithm block. These pipelined and parallel RS decoders have been implemented with the 0.13-m CMOS technology in a supply voltage of 1.1 V. The Ch = 1 pipelined RS decoder has a data processing rate of 6.16 Gb/s at the frequency of 770 MHz, whereas Ch = 4 parallel RS decoder can process at the data processing rate of 26.6 Gb/s. The proposed parallel RS decoders have a very high-speed performance to meet continual demand for fiber optic systems with speed beyond 10 Gb/s.
