† , Nonmember and Hanho LEE †a) , Member SUMMARY A high-speed low-complexity time-multiplexing ReedSolomon-based forward error correction architecture based on the pipelined truncated inversionless Berlekamp-Massey algorithm is presented in this paper. The proposed architecture has very high speed and very low hardware complexity compared with conventional ReedSolomon-based forward error correction architectures. Hardware complexity is improved by employing a truncated inverse Berlekamp-Massey algorithm. A high-speed and high-throughput data rate is facilitated by employing a three-parallel processing pipelining technique and modified syndrome computation block. The time-multiplexing method for pipelined truncated inversionless Berlekamp-Massey architecture is used in the parallel ReedSolomon decoder to reduce hardware complexity. The proposed architecture has been designed and implemented with 90-nm CMOS technology. Synthesis results show that the proposed 16-channel Reed-Solomon-based forward error correction architecture requires 417,600 gates and can operate at 640 MHz to achieve a throughput of 240 Gb/s. The proposed architecture can be readily applied to Reed-Solomon-based forward error correction devices for next-generation short-reach optical communications.
Introduction
Demands for 100 Gigabit Ethernet (GbE) devices are increasing dramatically where data traffic converges, such as high performance computing, servers, data centers, and enterprise networks. In the future, bandwidth will be much more in demand than 100 GbE. For this reason, the IEEE 802.3ba task force approved IEEE std802.3ba-2010 for the use of 40 Gb/s and 100 Gb/s Ethernet [1] . These very high speed data transmission techniques that have been developed for fiber optic networking systems have necessitated the implementation of high speed Forward Error Correction (FEC) architecture to meet the continuing demand for ever higher data rates. Also, high speed (40 Gb/s and beyond) short-reach optical communication systems commonly use Reed-Solomon (RS)(255,239) code. Specifically, the ITU-T has discussed standardization of a hard-decision FEC for a 100 Gb/s optical transport network (OTN) [2] . As a result, the RS(255,239) code has become the one of candidate for 100-Gb/s short-reach optical communication systems.
The very high-speed data transmission techniques for optical communications have necessitated the implementation of high-speed low-complexity RS-based FEC architecture to meet the continuing demands for ever higher data rates (100 Gb/s and beyond). The typical high-speed parallel RS-based FEC architectures have adopted modified Euclidean (ME) architecture to achieve the requirement of high throughput rate [3] - [7] . However, hardware utilization is not efficient and requires a huge hardware cost to achieve very high speed transmission data rates for optical systems. Also, the RS decoder architectures using folded ME architecture were proposed to achieve efficient hardware utilization and low hardware complexity [7] , [8] . However, they require very long latency.
In this paper, we present three-parallel RS decoder architecture and high-speed low-complexity timemultiplexing RS-based FEC architecture using a truncated inversionless Berlekamp-Massey (TiBM) algorithm for next generation short-reach optical systems. We describe the key ideas applied to 16-channel time-multiplexing RS-based FEC architecture design, especially those related to achieving high throughput, low-complexity, and low latency. The synthesized result shows that compared with related research, the proposed RS-based FEC architecture has very low hardware complexity and delivers a very high throughput rate.
The rest of this paper is organized as follows. Section 2 presents the three-parallel RS decoder with a modified syndrome computation block and pipelined TiBM (pTiBM) architecture. Section 3 presents the high-speed and lowcomplexity 16-channel time-multiplexing RS-based FEC architecture. The performance evaluation and comparisons with related work are described in Sect. 4. Finally, conclusions are provided in Sect. 5.
Three-Parallel Reed-Solomon Decoder
The RS decoder consists of three main blocks, which are syndrome computation block, key equation solver (KES) block and Chien search and error evaluation (CSEE) block, as shown in Fig. 1 . Generally, the RS decoder can be implemented with a Berlekamp-Massey (BM) algorithm or ME algorithm to solve a key equation. In this section, we propose three-parallel RS decoder using modified syndrome computation block and pTiBM architecture, which provides high speed and low hardware-complexity. The modified Copyright c 2012 The Institute of Electronics, Information and Communication Engineers syndrome computation block and CSEE block are reformulated to minimize the critical path delay.
Modified Three-Parallel Syndrome Computation Block
The Let C(x) and R(x) be the codeword polynomial and the received polynomial, respectively. The transmitted polynomial can be corrupted by channel noise during the transmission. Therefore, the received polynomial can be described as
is the error polynomial. The first step in the decoding algorithm is to calculate 2t syndromes S i (0 ≤ i ≤ 2t − 1) which are used to correct fixable errors. The t is the capability of error correction. If all 2t syndromes S i (0 ≤ i ≤ 2t − 1) are zero, then the received polynomial R(x) is a valid codeword C(x), that is, no errors have occurred. The syndrome polynomial S (x) is defined as (1) and (2) . Also (3) represents the syndrome polynomial described for three-parallel processing:
The conventional three-parallel syndrome computation block consists of 2t syndrome cells, which compute the S i value during 85 clock cycles. However, the critical path of the syndrome cell is increased if the syndrome computation block is implemented for three-parallel processing as shown in (3) . To reduce the critical path, the syndrome polynomial can be separated into even terms and odd terms as follows: If the three-parallel syndrome computation block is reformulated by the syndrome polynomial shown in (7), the pipelining is possible without any additional latency. Figure 2 shows the modified three-parallel syndrome computation block. The even and odd terms are computed alternately during 84 clock cycles. At the final 85th clock cycle, we can obtain a syndrome polynomial by multiplying the odd term by α 3i . The critical path of the proposed syndrome computation block is reduced to 3T xor + T ff from the critical path 6T xor + T mux + T ff of the conventional syndrome computation block, in which 3T xor means the critical path delay of the constant Galois-field (GF) multiplier.
pTiBM Architecture
The low-complexity TiBM architecture for a KES block was presented in our previous paper [9] and removed the unnecessary t − 1 PEs in the conventional RiBM architecture [10] . The TiBM algorithm can be described by pseudocode as follows:
Step TiBM.3 if δ 0 (r) 0 and k(0) ≥ 0 then begin Figure 3 shows the block diagram of the proposed pTiBM architecture. In the pTiBM architecture, the original t +1 PE1s which are employed in the conventional RiBM architecture are used in PE1 0 ∼ PE1 t and modified t + 1 PE2s are used in PE2 t+1 ∼ PE2 2t+1 . Some lost zero values occurred because of truncated t − 1 PE1s. Thus, MUX(1) and MUX(2) were added into the modified PE2s to give zero values at the appropriate time. Also, the proposed pTiBM architecture can be pipelined for high speed. This fact represents that a time-multiplexing method can be used efficiently in the multi-channel RS-based FEC architecture. The timemultiplexing method is described in Sect. 3.
The pTiBM architecture consists of PE1, PE2, and Control Units 1 and 2. Because of removed t − 1 PE1s, control circuits are needed to adjust MUX(1) and MUX(2) Fig. 3 Proposed pTiBM architecture and its sub-blocks such as original PEs, modified PE2s, and control units. in PE2, and propagate δ i (r) and θ i (r) correctly. Control Unit 1 generates the control signal such as MC(r), γ(r) and δ 0 (r). Control Unit 2 generates the selection signals of the MUX(1) and MUX(2) in the PE2. Control Unit 2 can be implemented via a finite state machine (FSM). Each selection signals of 9 MUX(1)s are represented by 2 bits, which are 0(00), 1(01) and 2 (10) . So the total selection signals of 9 MUX(1)s are 18 bits. Also, each selection signal of 9 MUX(2)s is represented as 1 bit, which is either 0 or 1. So the bit size of selection signals is total 9 bits. Therefore, the total selection signal for MUX(1)s and MUX(2)s is 27 bits, as shown in Fig. 3 . The FSM starts their operation with a reset signal and input w repeats periodically with 0, x, 0, x, 0, x, 0, x, 0, x, 0, x, 0, x, 1, where x is don't care.
MUX signal Gen. 1 and MUX signal Gen. 2 generate 27 bit selection signals. MUX signal Gen. 1 can be generated by concatenating 18 bits for MUX(1) and 9 bits for MUX (2) . The former 18 bits move to the right every 2 clock cycles and '2' is inserted at the very left of the Control Unit 2 as shown in Fig. 3 . Also, the latter 9 bits move to the right every 2 clock cycles and '1' is inserted at the very left. For instance, 27 bit initial selection signals (2, 2, 0, 1, 1, 1, 1, 1, 1 and 1, 1, 0 (1) and MUX(2) are adjusted using this method, the error locator polynomial λ(x) and error evaluator polynomials ω(x) can be obtained correctly using only 2t +2 PEs after the operation of 2t times. The PE architecture consists of 3-stage pipelined GF multipliers, adders, and D-FFs. The critical path delay of the proposed KES block has 2T xor + T ff . Fig. 4 Pipelined three-parallel Chien search block and cell.
Pipelined Three-Parallel CSEE Block
The CSEE block finds error locations and error values. Figure 4 represents the three-parallel Chien search blocks and their cells. The Forney algorithm block is almost the same structure as the Chien search block, except that the C8 cell is eliminated. The dotted line in Fig. 4 is a cutline for pipelining. Then, the critical path delay of the Chien search block is reduced from 7T xor + T mux + T ff to 3T xor + T mux + T ff . The detailed information for the parallel Chien search block is described in [11] . Figure 5 shows the proposed 16-channel time-multiplexing RS-based FEC architecture, which is made up of fourchannel three-parallel RS decoders. The syndrome computation block provides 2t syndromes after 85 clock cycles which are required for computing the syndrome polynomial.
16-Channel Time-Multiplexing RS-Based FEC Architecture
Since four syndrome computation blocks are connected by only one KES block, syndrome values are entered into the KES block alternately. The KES block outputs four error location polynomials λ(x) and four error value polynomials ω(x) in parallel after 64 clock cycles. Finally, a CSEE block completes error correction. Most conventional high-speed RS decoders have used ME algorithms to solve the KES block, because the ME algorithm can be easily implemented by fully pipelined systolic-array structure. On the other hand, the systolic- array ME architecture has very high hardware complexity compared to the BM architecture. In general, the BM algorithm is difficult to use pipeline technique because of their feedback loops. But if many channels are used in the TiBM architecture, the pipelining techniques can be efficiently used with a time-multiplexing method. Therefore, the proposed pTiBM architecture is able to process a maximum of four independent syndrome values because the iteration period for obtaining λ(x) and ω(x) in the KES block is 16 clock cycles and the syndrome computation block uses 85 clock cycles for its computation.
Figures 6(a) and (b) show the timing chart of four independent syndrome values for conventional ME architecture and the proposed pTiBM architecture using timemultiplexing. The proposed pTiBM block is initialized by four independent syndrome values during 4 clock cycles, as shown in Fig. 6(b) . After 60 clock cycles, computation processing of the pTiBM architecture is completed and the outputs λ(x) and ω(x) are generated during 61 to 64 clock cycles.
For pTiBM architecture, a total of 18 processing elements (PEs) are connected serially, and every PE accepts the value δ 0 , γ and MC control signal from a control unit. After 64 clock cycles, D-FF in the PE 0 to PE 7 have four independent values of ω(x). The values of λ(x) are also in the PE 8 to PE 16 . Figure 7 represents a timing chart of the proposed 4-channel RS decoder. This architecture has as much as 161 clock cycles of latency. 85 clock cycles are used in the Syndrome computation block because of their three-parallel architecture. Also, 64 clock cycles are used in the KES block 6 Timing chart of (a) conventional ME architecture [6] , and (b) proposed pTiBM architecture using time-multiplexing for 4-channel RS decoder architecture. using the time-multiplexing method. The rest of the latency is used for a delay to adjust the timing sequence.
Result and Comparison
The proposed 16-channel time-multiplexing RS-based FEC architecture and conventional architectures [5] , [6] were modeled in Verilog HDL and simulated to verify their functionality. After complete verification of the design functionality, it was then synthesized using appropriate time and area constraints. Both simulation and synthesis steps were carried out using SYNOPSYS design tools and 90-nm CMOS technology optimized for a 1.2 V supply voltage. For fare comparison, the conventional RS decoders in [5] , [6] were synthesized using the same 90-nm CMOS technology. Table 1 shows the critical path of each sub-block for the proposed and conventional decoder architectures. As shown in Table 1 , the critical path delay of the proposed architecture is reduced significantly. Table 2 shows the implementation results of the proposed 16-channel time-multiplexing RS-based FEC architecture and the other existing RS-based FEC architectures. The total number of gates for the proposed architecture is 417,600 from the synthesized results (excluding the FIFO memory) and the clock frequency is 625 MHz. The proposed time-multiplexing architecture has higher throughput rate and lower hardware complexity than the parallel architectures in [4] - [6] .
Compared to the design in [3] , the proposed design can operate much faster with comparable hardware requirements. Note that the proposed architecture is using the highly pipelined GF multiplier, but the design in [3] cannot use the pipelined GF multiplier in a KES block. As a result, the proposed time-multiplexing RS-based FEC architecture has higher throughput rate, lower hardware complexity, and lower latency than previous architectures.
Conclusion
This paper presented a high-speed, low-complexity VLSI architecture of 16-channel time-multiplexing RS-based FEC for next generation short-reach optical communication applications. The three-parallel processing for syndrome computation and error correction allows the inputs to be received at very high fiber optic rates, and the outputs to be delivered at correspondingly high rates with a minimum delay. A high-speed and high-throughput rate is facilitated by employing a three-parallel processing pipelining technique and modified syndrome computation block. Especially, the syndrome computation block is reformulated for pipelining to obtain high clock speed. The time-multiplexing method for resource sharing of pTiBM architecture is used in the parallel RS decoder to reduce hardware complexity. As a result, the proposed RS-based FEC architecture has a much higher throughput rate and lower hardware complexity compared to conventional RS-based FEC architectures. The proposed architecture has potential applications in RS-based FEC devices for short-reach optical communications with a data rate of 100 Gb/s and beyond. 
