This paper presents a high-throughput low-complexity four-parallel Reed-Solomon (RS) decoder for high-rate WPAN systems. Four-parallel processing is used to achieve 12-Gbps data throughput and low hardware complexity. Also, the proposed pipelined folded DegreeComputationless Modified Euclidean (fDCME) algorithm is used to implement the key equation solver (KES) block, which provides low hardware complexity for the RS decoder. The proposed four-parallel RS decoder is implemented 90-nm CMOS technology optimized for a 1.2 V supply voltage. The implementation result shows that the proposed RS decoder can be operated at a clock frequency of 400 MHz and has a data throughput 12.8-Gbps. The proposed four-parallel RS decoder architecture has high data processing rate and low hardware complexity. Therefore it can be applied in the FEC devices for next-generation high-rate WPAN systems with data rate of 10-Gbps and beyond. key words: forward error correction (FEC), Reed-Solomon (RS), decoder, mmWAVE, WPAN 
Introduction
The emergence of a multitude of "bandwidth hungry" multimedia applications has definitely exacerbated the need for multi-gigabit wireless solutions, which are beyond reach of conventional WLAN technology (802.11a, b and g). Uncompressed high-definition video distribution and massive data synchronization are driving data-throughput requirements well beyond gigabits/s (Gbps), and already demanding up to 10-Gbps with introduction of, for example, the HDMI 1.3 video standard [1] . Such a strong commercial interest in using the 57-66 GHz band known as the millimeter wave band for indoor wireless communications is evidenced by the recent industrial and standard development efforts in several international standard groups including ECMA TC-387, IEEE 802.15.3c and the 802.11 VHT60 [2] .
These task groups are developing a millimeter-wave (mmWave) based alternative physical layer (PHY) for highrate Wireless Personal Area Network (WPAN) standard [3] , [4] . This mmWave WPAN system will allow high coexistence with all other microwave systems in the 802.15 family of WPANs. In addition, the mmWave WPAN will support high data rate applications such as high speed internet access and streaming content download (video on demand, home theater, 3D TV etc.). Very high data rates in excess of 10-Gbps beyond will be provided for simultaneous time de- These reasons, such a demand for ever higher data rates, makes it necessary to devise very high-speed Forward Error Correction (FEC) architectures. Reed-Solomon (RS) codes have been adopted WPAN systems as a FEC scheme [3] , [4] , and also several multi-giga bit RS decoders have been reported. To get a high throughput, parallel processing method can be a best solution for the hardware design. The one-shot Reed-Solomon encoder/decoder scheme [5] , [6] , which is based on parallel combinational circuit, can be a representative example for high throughput RS decoder.
In this paper, we present the four-parallel RS (240,224) encoder/decoder architecture for mmWAVE WPAN systems especially ECMA standard. Four-parallel processing is used to achieve 12-Gbps data throughput rates. Also, folded Degree-Computationless Modified Euclidean (fDCME) architecture is applied for key equation solver (KES) block to reduce a hardware complexity. This paper is organized as follows. Section 2 shows the proposed four-parallel RS encoder architecture. In Sect. 3, we will describe the key ideas applied to four-parallel RS decoder design, especially those for achieving high throughput and reduced hardware complexity. Four-parallel syndrome computation, Chien search & error correction block and pipelined fDCME architecture are proposed. Section 4 gives implementation results and performance comparison. Finally, conclusions are provided in Sect. 5.
Four-Parallel Reed-Solomon Encoder
The systematic RS encoding produces codeword polynomial in Eq. (1), which is comprised of message symbols followed by parity symbols. The message polynomial M(x) is multiplied by x n−k after then added the parity polynomial P(x). If generator polynomial G(x) was given as Eq. (2), the following parity polynomial P(x)can be written as Eq. (3). To apply four-parallel structure, the Eq. (3) should be reformulated. The M(x) consists of 224 symbols, which are multiple of four. As a result the four-parallel based P(x) can be rewritten to Eq. (4) and we can derive the following partial generator polynomial as shown in Eq. (5) .
The proposed four-parallel RS encoder is shown in back Shift Register (LFSR).
Four-Parallel Reed-Solomon Decoder
Generally, the RS decoder consists of following three blocks, which are syndrome computation block, KES block, Chien search and error correction block. The RS decoder can be implemented using modified Euclidean (ME) algorithm to solve a key equation. In this paper, we propose fDCME algorithm that is reformulated version of our previous pipelined Degree-Computationless Modified Euclidean (pDCME) algorithm in [16] . While the pDCME algorithm can be implemented by systolic array architecture, the fD-CME algorithm is useful for folding architecture. Therefore, the proposed fDCME architecture can be provided much lower hardware complexity for the KES block. Both the syndrome computation block and Chien Search & error correction block are reformulated for the high data throughput four-parallel processing. The proposed four-parallel RS decoder architecture is shown in Fig. 2 . The proposed architecture includes fourparallel syndrome computation block, fDCME block, and four-parallel Chien search and error correction block. This section gives full explanation about sub-blocks.
Four Parallel Syndrome Computation Block
The syndrome computation block calculates all syndromes S i (0 ≤ i ≤ 15) by putting the roots of generator polynomial G(x) into the received codeword polynomial R(x) in Eq. (6) . As shown in Fig. 3 , proposed four-parallel syndrome computation block is implemented by following Eq. (7). 
R(x)
The received codeword consists of 240 symbols which are multiple of 4, so that the proposed syndrome computation block should calculate syndromes during 60 clock cycles. At the first clock, the received codeword (r 239 
This iterative process will be performed during 60 clock cycle after the syndromes S i are stored in the flip-flop (1). Multiplexer (3) and (4) are selected '1' at every 60th clock cycle, and syndromes S i are shifted to the flip-flop (2). Finally, the syndromes S 0 , S 1 , . . . , S 15 are outputted serially to the KES block, and new syndromes can be computed in the syndrome cells.
Key Equation Solver Block
The KES block is used to obtain the error locator polynomial σ(x) and the error value polynomial ω(x) by solving the key equation ω(x) = S (x)σ(x) mod x 2t . The KES block is the most critical part in the design of RS decoders. The KES architectures based on the modified Euclidean (ME) algorithm [7] - [9] , [17] , [18] or Berlekamp-Massey (BM) algorithm [11] , [12] are regular structure, but the hardware cost is very high, because their architectures are required both systolic-array structure and degree computation units. So, the pDCME algorithm was suggested alternatively in [15] , but the pDCME architecture still has high hardware complexity. While pDCME architecture can be implemented by 2t processing element (PE), the proposed fDCME algorithm, which is employed folding technique, consists of only 2 PEs with shift-registers.
The proposed fDCME algorithm is described by the pseudo-code shown in below. Two array of PE performs the DCME algorithm continuously and then the error locator polynomial σ(x) and error value polynomial ω(x) can be computed. Until when the index 'stage' is reached at t times, a i−1 and b i−1 are the leading coefficients of polynomial F i−1 (x) and G i−1 (x) respectively. Either Step2 (swap operation) or Step3 (delaying previous coefficients) is executed until when the index 'loop' of Step1 reaches 2, repeatedly. The Step2 is controlled by stop-signal (stop), swapsignal (sw) and Shift-signal (sht).
Inputs of PE (1) and (2) have several patterns, which correspond to Eqs. (8)∼(10). These patterns are used to generate two control signals which are 'sw' and 'sht'. The 'sw' signal determines whether two polynomials pair F i−1 (x), G i−1 (x) and H i−1 (x), I i−1 (x) should be swapped or not. The 'sht' signal determines either polynomial arithmetic operation or shift operation. In Eq. (8), G 0 (x) is x 2t and F 0 (x) is S (x) multiplied x. And the coefficient S 2t−1 is non zero. Since the degree of two polynomials are same as 16, the PE (1) executes the arithmetic operation. After the operation of PE (1), G 1 (x), F 1 (x) have same degree as 15. In Eq. (9) the coefficient S 2t−1 is zero. That means the degree of F 0 input is 15. So PE (1) executes only delay operation for G 0 's output to make the same degree of two inputs. And then PE (2) executes the arithmetic operation since degree of two inputs is same. In Eq. (10), the coefficient S 2t−1 is non zero but S 2t−2 is zero. In case of this, the PE (1) has same operation as Eq. (10) . But degree of F 0 's output is 14. Thus, the PE (2) executes only delay operation for output of G 1 . Since the degree of F 1 (x) is less than the degree of G 1 (x), two inputs were swapped before the PE (1) operation. When the index 'stage' is reach at ttimes, the fDCME algorithm stops. The output F 16 (x) of PE (2) becomes the error value polynomial ω(x) and the output H 16 (x) becomes the error locator polynomial σ(x). Figure 4 shows a block diagram of proposed fDCME architecture, which consists of two PEs and shift-registers connected by means of a recursive loop. F i−1 (x), G i−1 (x), H i−1 (x) and I i−1 (x) generates the updated coefficients of each polynomial serially. The output of PE (2) is fed back into the PE (1) in descending order. The PE (1) and (2) consist of a polynomial arithmetic structure, control-signal generate block and stop-signal generate block. One PE consists of four Galois-field (GF) multipliers, two GF adders and ten multiplexers. The PE unit has three pipelining stages to provide significant improvement for the clock frequency. The twelve stage shift-registers are used to store the output of PE (2) at each recursive iteration step. Therefore, the fDCME block has eighteen pipelining stages. The PE (1) and (2) use pipelined fully-parallel GF multiplier to reduce the critical path delay and to provide significant gains for the clock frequency. Therefore, the critical path delay of PE is T inv + T and2 + 3T mux2 + T ff , where T inv , T and2 , and T mux2 are delays of the inverter, 2-input AND gate, and 2×1 multiplexer. 
Four-Parallel Chien Search and Error Correction Block
After the KES block operation, the error locator polynomial σ(x) and the error value polynomial ω(x) are obtained. Let X l = a ml and Y l = e ml , the Eq. (11) can be transformed to the Eq. (12), where X l and Y l are the possible error location and the possible error value, respectively. Chien search algorithm can be implemented using the Eq. (13). The roots of σ(x) are the inversion of error location. In case of RS(240,224) code, 'σ(α 16 )=0' means that r 239 was corrupted by an error. At first, α 16 is putted into σ(x) because the first symbol of received codeword is r 239 in the RS(240,224) codes.
The error value polynomial can be derived as the Eq. (14) . Finally the error value can be computed using the Eq. (15), where σ (x) is the derivative of σ(x). Rewriting σ(x) as the sum of the even terms σ even (x) and the odd terms σ odd (x), we have σ odd (x) = x · σ (x). Therefore, the Chien search and error correction block is implemented as shown in Fig. 5 .
The dividing operation is implemented by 256 × 8 ROM in which the inverse of field elements are stored. As shown in Fig. 5(b 
Performance Comparison
The proposed four-parallel RS encoder/decoder architecture was modeled in Verilog HDL and simulated to verify its functionality. After complete verification of the design functionality, it was then synthesized using appropriate time and area constraints. Both simulation and synthesis steps were carried out using SYNOPSYS synthesis tool and 90 nm CMOS technology optimized for a 1.2 V supply voltage. The total number of gates for proposed four-parallel RS decoder is 23,920 gates from the synthesized results including memory block. From the post-layout simulation, the proposed four-parallel RS decoder architecture can operate at a clock frequency of 400 MHz and has a data processing rate of 12.8-Gbps. Table 1 shows the comparison results of various RS decoder architectures. In case of KES block, proposed fDCME architecture provides much lower hardware complexity than other KES architectures based on ME algorithm. For the purpose of comparison, we used Technology-Scaled Normalized Throughput (TSNT) in [13] . The TSNT is the silicon area normalized to a 0.13 μm technology, as shown in below. We can see that the throughput rate and the TSNT index of our design is the highest among all other architectures.
TSNT =
Throughput Rate #of Total Gates × Tech.
μm
The implementation result shows that the proposed four-parallel RS decoder architecture has much higher data processing rate and low hardware complexity compared with the conventional ME algorithm based RS decoder architectures.
Conclusions
This paper presented the design and implementation of fourparallel RS encoder/decoder for high-rate WPAN systems. Four-parallel processing is used to achieve 12-Gbps high data throughput. A high-speed low-complexity fDCME block is applied in the KES block. Four-way parallelizing for syndrome computation and Chien search blocks allow the inputs to be received at very high data rates and the outputs to be delivered at correspondingly high rates with a minimum delay. As a result, the proposed four-parallel RS decoder architecture has a much higher data processing rate and low hardware complexity compared with the conventional RS decoder architectures. The proposed RS decoder can be applied in the FEC devices for next-generation high-rate WPAN systems.
