A novel pipelined recursive modified Euclidean (ME) algorithm block for the low-complexity, high-speed Reed-Solomon (RS) decoder is presented. The deeply pipelined recursive structure enables implementation of a significantly low-complexity ME algorithm block with much improved clock frequency. The low-complexity, high-speed RS decoder using the pipelined recursive ME algorithm block has been implemented in 0.13 mm CMOS technology with a supply voltage of 1.1 V. The results show that it has significantly low hardware complexity and a high data processing rate of 6.16 Gbit=s.
Introduction: The multi-bit error detection and correction capability of Reed-Solomon (RS) codes along with relatively low-complexity architecture have made them the most widely used for the purpose of forward error correction (FEC) in a variety of communication systems such as space communication links, digital subscriber loops, and wireless systems, as well as in networking communication. The most commonly used RS decoder architecture, which can detect and correct t errors, consists of three main components (Fig. 1) . The first component is a syndrome computation (SC) block. It generates a syndrome polynomial S(x), which is a function of the error pattern in the received codeword. This polynomial is used in the second component of the RS decoder, the key-equation solver (KES) block for solving the key equation S(x)s(x) ¼ o(x)mod x 2t . Either the modified Euclidean (ME) or the Berlekamp-Massey (BM) algorithm can be used to solve the key equation for an error-locator polynomial s(x) and an error value polynomial o(x). In the third component of the RS decoder, both the error locator and error value polynomials, are used to find the error magnitude values corresponding to the error locations using the Chien search and the Forney algorithms. The output of this block is the corrected received codeword, which is read out of the decoder. In addition, an FIFO memory is used to buffer the received symbols while the decoder executes the error detection and correction process. The depth of the FIFO is relative to the total latency of the decoder components. A modified division-free Euclidean algorithm and several high-speed ME algorithm (MEA) blocks for RS decoding were proposed in [1] [2] [3] [4] . The conventional MEA blocks consist of 2t (twice the number of maximum correctable errors) processing elements (PEs) connected by means of a systolic-array structure. However, the hardware size of the MEA block constituted about 70% of the total RS decoder size. As a consequence, a key challenge is to minimise the hardware complexity of the MEA block so that the critical path delay and the total power consumption can be reduced. In this Letter we propose a pipelined recursive MEA block to achieve a low-complexity RS decoder with high throughput.
Pipelined recursive MEA block: The pipelined recursive MEA block has been implemented to solve the key equation
2t for the error locator polynomial s(x) and the error value polynomial o(x). The MEA is summarised as follows. Initially,
In the ith iteration,
where a iÀ1 and b iÀ1 are the leading coefficients of R iÀ1 (x) and Q iÀ1 (x), respectively, and
The algorithm stops when deg(R i (x)) < t, where deg(Á) denotes the degree of a polynomial. If the stop condition is satisfied, '10000' 2 shows the low-complexity pipelined recursive MEA block, which consists of a pipelined degree computation (DC) block, a polynomial arithmetic (PA) block and shift-registers (SRs) connected by means of a recursive loop. The pipelined recursive MEA block computes the error locator polynomial s(x) and error value polynomial o(x). The DC block processes several operations such as the degree computation, the degree update calculation using subtraction and reevaluation of the 'stop' condition for the MEA and consequently the generation of the loop back 'stop' signal for the subsequent iterations. The first part of the DC block compares the degrees of the R i (x) and Q i (x) polynomials. This comparison determines when the two polynomials, R i (x) and Q i (x) from (1) and (2) and the two polynomials, L i (x) and U i (x) from (3) and (4) need to be exchanged. Thus the exchange control circuit computes l iÀ1 in (5). If deg(R i (x)) < deg(Q i (x)), then the signal 'sw' is asserted high (sw ¼ 1), otherwise it is asserted low (sw ¼ 0). The second part of the DC block computes the degrees of both the R i (x) and Q i (x) polynomials for the next iteration step of the MEA. These polynomial degree values are registered at the end of each iteration and are held constant until the next iteration. This is important in order to avoid any dependency between the two successive iterations since a single highly pipelined MEA block is utilised recursively. Third, the DC block detects the condition of deg(o(x)) < deg(s(x)) or deg(R i (x)) < t. If this condition is satisfied, the DC block generates a 'stop' signal indicating the end of the algorithm and thus appropriately halts all the arithmetic operation in the PA block, i.e. if deg(R iþ1 ) < (t ¼ 8) or deg(Q iþ1 ) < (t ¼ 8), then stop ¼ 1 and the computation stops, otherwise stop ¼ 0. In the PA block, a 'start' signal is used to indicate the beginning of the polynomials, i.e. the 'start' signal is always aligned with the leading coefficients a iÀ1 and b iÀ1 of R i (x) and Q i (x) polynomials, respectively. The 'start' signal, as well as xQ 0 (x) and xU 0 (x), is delayed by one time unit in such a manner that the leading coefficients of R 1 (x), Q 1 (x), L 1 (x) and U 1 (x) are properly initiated by the start signal at first iteration step of the MEA. Signal 'z lead ' is generated in the PA block to indicate whether the leading coefficient of Q i (x) is equal to 'zero'. The PA block processes finite-field multiplications and additions. One PA block contains four pipelined Galois-field multipliers, two Galois-field adders, and 11 multiplexers in order to calculate (1) to (4) . At first iteration step of the MEA, R 0 (x) and Q 0 (x) are initialised to x 2t and S(x), respectively. L 0 (x) and U 0 (x) are initialised to '0' and '1', respectively. The PA block uses the pipelined fully-parallel multiplier, and has five pipelining stages to provide significant improvement to the clock frequency. The 11-stage shift-registers are used to store the output of each recursive iteration step. Therefore, the MEA block has a total of 16 stages including the pipelining registers inside the recursive PE block. The critical path of the proposed MEA block is the 5-bit comparator in the DC block with the most number of logic levels. Thus the critical path delay in the DC block and consequently the ME block is defined as
, where T pipe mult is the delay of the 8-bit pipelined Galois-field variable multiplier. After the MEA is completed, the error locator polynomial s(x) and the error value polynomial o(x) are loaded into a serial-to-parallel converter. These values are then loaded into the Chien search algorithm block, which calculates the roots of the error locator polynomial. The Forney algorithm block works in parallel with the Chien search block to calculate the magnitude of the error symbol at each error location. The final step in the decoding process is to add (XOR) the FIFO buffered input codeword and the error magnitudes to correct the errors. This design provides us with much reduced hardware complexity as compared to the conventional systolic array ME architectures [1] [2] [3] [4] without impacting the critical path delay.
Results: The proposed RS decoder was designed with Verilog-HDL and simulated to verify its functionality. After complete verification of the design functionality, it was then synthesised using appropriate time and area constraints. Both simulation and synthesis steps were carried out using CADENCE design tool and a 0.13 mm CMOS technology optimised for a 1.1 V supply voltage. Table 1 shows the comparison of critical path delay, hardware complexity and data processing rate for various KES blocks. It shows that the proposed MEA block has almost the same critical path delay compared to the previous high-speed MEA block [3] , and has significantly less critical path delay than the Euclidean algorithm (EA) [5] and the BM algorithm [6] blocks. It can be seen that, compared to the conventional KES blocks, the proposed pipelined recursive ME algorithm block can be implemented by only one PE and SRs. As a result, it shows significantly reduced hardware-complexity compared to the conventional MEA [1] [2] [3] [4] , EA [5] , and BM algorithm [6] blocks. It is clear that the proposed MEA block requires only 9% and 21% of the gate count of the previous MEA [3] and EA [5] , respectively. The RS decoder using the proposed MEA block operates at a clock rate of 770 MHz and has the throughput of 6.16 Gbit=s. As a result, the proposed RS decoder has significantly reduced hardware complexity and high data processing rate compared to previously published conventional RS decoders [1] [2] [3] [4] [5] [6] . EA [5] RiBM [6] Critical path delay 
