The decoder is implemented and verified in FPGA, and can work at 178 MHz in Virtex2P. Thus a 8-channel FPGA implementation can be used for 10 Gbps satellite communication systems. Additionally, the decoder is also synthesized in Chartered 90 nm CMOS technology, and compared with previous decoders. The results show the decoder is more area-efficient than previous decoders. Meanwhile, by using this CMOS technology, the decoder can be clocked at about 1350 MHz, so a single-channel ASIC implementation can meet the requirement of 10 Gbps satellite communication.
Introduction
Reed-Solomon (RS) codes have been widely used in many kinds of modern communication systems, such as satellite communication system (DVB-S), star-earth link system (ShiJian-V), as well as optical communication system (10GBase-LR) [1, 2] . RS (244,212) is recommended by CAST for the wideband multimedia satellite communication system. Due to the increasing requirements for multi-channel high definition multimedia applications in wide-band satellite communication systems, high speed and low hardware cost RS decoder is needed urgently.
General RS decoders include Syndrome Computation (SC), Key Equation Solver (KES), and Chien Search & Error Correction (CSEC) blocks. KES block, which is considered as the most complex block, occupies more than 60%-80% area of the whole decoder [1] . Many implementations of KES have been proposed to downsize the VLSI area. They can be divided into two categories, namely, the Berlekamp-Massey (BM) based implementation and the modified Euclidean (ME) based implementation. In this paper, a new scheme named TD-iBM is proposed based on iBM, which adopts time division scheduling to increase the utilization of Galois Field (GF) multipliers, and reduce the complexity. We present an area-efficient pipeline-balancing RS decoder for 10 Gpbs satellite communication applications based on TD-iBM, and implement it on Virtex2P FPGA and Chartered 90 nm CMOS technology. Meanwhile, other main ideas for achieving low complexity in SC and CSEC designs are introduced in the proposed decoder.
2 Proposed architecture and timing 2.1 Fixed-Factor Syndrome Computation block SC block calculates 2t syndromes from the received stream, where t denotes the error correction capacity [1] . It can be implemented in full parallel or partly parallel. When it is implemented in P-parallel, which means P sets of GF adders and multipliers are designed, the processing time of one code T cycle will be 2t × n/P , where n is the code length. Since decoding speed is one of the main optimization goals in our design, full parallel (32-parallel) implementation is adopted. The above part of Fig. 1 (a) shows the implementation detail of the full-paralleled SC block. The parameter t is 16, and n is 244, thus T cycle is 244.
Since the origin element α is fixed in our design, and one operand of each multiplier in SC is constant, multipliers in SC block can be simplified into single operand Fixed-Factor (FF) multipliers. Fig. 1 (a) shows the detail of one type of FF multipliers which is used for calculating syndrome s 0 . FF multipliers are designed by cutting the redundant logic of the general multipliers when one operand is fixed. The experiment shows the hardware cost of the FF multiplier of which detail is shown in Fig. 1 (a) is 3 slices, while the hardware cost of general multiplier is 30 slices. Since the most complex FF multiplier in the design occupies 16.7% hardware of the general multiplier, the FF scheme can significantly reduce the hardware cost.
TD-iBM key equation solver block
The iBM algorithm is used to solve the key equation ω(x) ≡ σ(x)S(x) mod x 2t , where σ(x) and ω(x) are defined as the error locator polynomial and the error evaluator polynomial. In general iBM algorithm, t + 1 multipliers are used to design KES, and the hardware overhead is too high. Moreover, the processing latency of KES block is far less than SC block, so the macro-pipeline between KES and SC is not balanced. TD-iBM algorithm is a variation of iBM algorithm to reduce hardware cost and support micro-pipeline balance. Time Division (TD) scheduling is adopted to complete the KES block with (t + 1)/Γ multipliers, where Γ is the factor of the TD scheduling. When Γ increases, the hardware cost of KES block will be reduced, while the processing latency will be increased.
Algorithm 1 : TD-iBM Algorithm
Input: Syndrome set S:
, others are all set to zero;
1: for j := 0 to 2*t-1 do
2:
Realign S to generate S * , foreach s * n ∈ S * ,s * n = s j−n ;
3:
Divide set S * and Σ j into Γ subsets
where is the GF sum;
5:
where is the GF sum; 
foreach subset Σ i γ and Σ j γ 9:
N * γ+n = delta γ,n ;
12:
for m := 0 to t do 13:
end for 15: end for 16: for i:= 0 to t-1 do
17:
Realign S to generate S * * , foreach s * * n ∈ S * * ,s * * n = s i−n ;
18:
Divide set S * * and Σ into Γ subsets S 
where is the GF sum; 21: end for
The TD-iBM algorithm is showed above. The symbol '⊗', '⊕', and ' ' denote GF multiplier, adder, and subtracter. In TD-iBM algorithm, σ(x) and ω(x) are calculated in sequence: the first 15 lines are used for σ(x) calculating and the rest ones are for ω(x) based on the σ(x) results. TD-iBM employs TD scheduling in three operation sections: the first one is line 2-5, the second one is line 7-11, and the last one is line 17-20. Taking the first one as the example, the two operand sets: Σ and S * are divided into Γ subsets, each subset has N elements (N = t + 1/Γ , the element number of last subset may be less than N , zero will be used for filling in), all the subsets will complete the operation (4) in sequence. When TD scheduling is not employed, t + 1 multipliers are needed to generate d i in full parallel. When Γ-factor TD scheduling is used, d i will be generated in Γ cycles, so the multiplier number will be reduced to (t + 1)/Γ . According to Algorithm 1, the processing latency will be 2 × t × 3 × Γ + t × t/ (t + 1)/Γ . When TD-iBM algorithm is used for different applications, the factor Γ should be selected carefully to balance the processing latency and the hardware cost. Fig. 1 (b) shows the main datapath of the TD-iBM architecture. The TD factor Γ is optimized as 2 in the proposed decoder. The main components of TD-iBM architecture are 9 multipliers and related multiplexers. The variables in TD-iBM first pass the Inner Logic Scheduling MUXs, which are implemented as MUX4, and then pass the TD Scheduling MUXs, and are sent into the register for GF multiplier. The GF multiplier results will be used to generate the new variables for the next cycle. The generation circuit of D i , D j , σ set, and σ delay set are also showed in Fig. 1 (b) . As the datapath of ω set generation circuit is similar to σ set, it is omitted in Fig. 1 (b) . According to line 2 and line 17 in TD-iBM algorithm, one shift register is designed to prepare the syndrome set for d i calculation. The 'LoadEnable' ports of all the D-latches and Shift Register are controlled by the decoding FSM according to the processing timing.
Fixed-Factor chien search block
The architecture detail of CSEC block is showed in Fig. 1 (a) . In the block, Chien Search algorithm and Forney algorithm were used to calculate the error location and value. Similar to SC block, all the multipliers in the block were implemented as FF multipliers. The above part of CSEC block, containing 17 FF multipliers, is the Chien Search part, and the below one, containing 16 FF multipliers, is used for error correction. According to the algorithm, 212 cycles are needed to complete the operations in CSEC block. To compare the proposed decoder with the previous decoders, the HDL model was synthesized in Chartered 90 nm CMOS technology with Synopsys Design Compiler. The area report of Design Compiler is 80696 um 2 without the FIFO block, which is identical to 28591 NAND2 gates. The timing report shows the maximum clock frequency is 1350 MHz. Table I shows the areaefficient comparison results of the related decoders, C.T. is shorted for Code Type. All the values in Gates column are not including the FIFO block. The metric of T hroughput column is bit per cycle, which is much fairer than Gbps, as the gain from the CMOS technology is eliminated. The gate is used for comparison to eliminate the variation of CMOS technology in different decoders. Moreover, the code type proposed in this paper is not popular, so RS (255,239) decoders were used for comparison. The most difference between RS (244,212) and RS (255,239) is the parameter t, which will affect the complexity of the decoder and the error correction capacity.
To compare these decoders fairly, index 1024 × T hroughput × C.T./Gates is used to denote the hardware efficiency. The throughput, VLSI cost, and error correction capacity are all considered in this index, while the influence of the CMOS technology is eliminated. The efficiency of the proposed decoder is 4.58, which outperforms any other decoders. That is mainly because of the FF multiplier implementation, and the proposed TD-iBM architecture which improves the efficiency of the GF multiplier.
Conclusions
This paper proposed an area-efficient Reed-Solomon decoder for 10 Gbps Satellite Communication. TD-iBM algorithm and architecture were introduced to downsize the VLSI area and balance the pipeline. Meanwhile, Fixed-Factor multipliers were also used to reduce the hardware cost. The decoder architecture was implemented in both Xilinx FPGA and Chartered 90 nm CMOS technology. The results showed the proposed decoder is more area-efficient than previous decoders. And an 8-channel FPGA implementation or a single-channel ASIC implementation can be feasible in the 10 Gbps satellite communication system.
