INTRODUCTION
The theory of error detection and correction codes deals with the reliable transmission and storage of data over unreliable communication channels. Error correction coding is the encoding process of adding parity bits to the message bits, making it longer in size than the original bits which are mostly called "code-word". When this code-word is received at destination, it is decoded to retrieve the original message bits. The BCH (Bose-Chaudhuri-Hochquenghem) code is one of the most powerful algebraic codes. It is extensively used for the modern digital communication system owing to its efficient error correction ability with high speed hardware implementation. Compared to the RS (Reed-Solomon) codes, BCH codes can achieve around additional 0.6dB coding gain over the AWGN noise [1] . The conventional BCH encoder is implemented by LFSR (linear feedback shift register) architecture. Since this architecture is based on a single feedback loop, it can be operated at high frequency. However, the major drawback of this LFSR based BCH encoder is that it is operated with bitserial manner, thereby it operates only one message bit in a single clock cycle. Thus, this fact becomes a barrier for a high throughput. Owing to the ever increasing demands, where high throughput is usually desired, the clock frequency of such LFSR based encoders cannot keep up with data transmission rate and thus parallel processing must be employed [2] [3] [4] [5] [6] .
Several parallel BCH encoding methods have been introduced earlier such as matrix multiplication, unfolding method, and CRT based encoding [2] [3] [4] . In matrix multiplication, the BCH encoder becomes very complex as the parallel factor increases than the number of registers in the circuit. The generator polynomial has to be modified in the unfolding method for greater output making the area overhead. The complexity of the CRT method is the flexibility in parallelization. In a parallel BCH encoder circuit, p bits of data are processed at a time, where the total number of clock cycles can be reduced by p times. But it doesn't increase the throughput by p times because the critical path becomes longer as the parallel factor p increases. Therefore, the parallel encoder with high speed and small area overhead are essential. Further the flexibility to increase and decrease the parallel factor p is also enhanced for the desired compatible throughput. In this paper, we present a new systolic array for BCH encoder with several p-parallel factors. In addition, the proposed BCH encoder introduces a tree-type structure so as to reduce the delay time for the critical path. The proposed systolic array encoder has a great flexibility in parallelization without any complexity with high throughput. It can improve the performance compared to its counterparts without any significant area increase. In addition, the optimized tree-type structure significantly increases the throughput in the same parallel factor. We have implemented a (31,16) triple error correcting binary BCH code, which is similar to error detecting capability of CRC-16, as an example. The synthesis and simulation results for the several p factors original encoder and optimized systolic array are presented using VHDL on FPGA. 
RALLEL BCH
e, a k bit messa ts of k bit mes d is defined as n-1) as the co-e k bit message 
e entered into a g operation [9] . derived by the ion for X t+p can
, 
From the Eq. (7), the circuit has the ability to compute the pbits in parallel, which changes the state from X t to X t+p in a single clock cycle. The generator polynomial for the eightparallel encoder for (31,16, 3) BCH code is given in Eq. (2) . For this generator polynomial, the Eq. (7) 7 represents the stages of the eight parallel systolic array BCH encoder. Let X 0 (t+8) to X 14 (t+8) represents the value of the registers at (t+8). z 0 (t) to z 7 (t) represents the eight parallel input data at t [8] . We can get the value of the registers with eight parallel inputs processed in the systolic array BCH encoder as follows: U 14 = z 0 (t)+ X 14 (t) U 13 = z 1 (t)+ X 13 (t) U 12 = z 2 (t)+ X 12 (t) U 11 = z 3 (t)+ X 11 (t) U 10 = z 4 (t)+ X 10 (t) U 9 = z 5 (t)+ X 9 (t) U 8 = z 6 (t)+ X 8 (t) U 7 = z 7 (t)+ X 7 (t) X14(t+8) = X6+ 2U 14 + U 13 + U 12 + U 11 + U 10 X13(t+8) = X5+ 2U 14 +2U 13 + U 12 + U 11 + U 10 + U 9 X12(t+8) = X4+4U 14 + 2U 13 +2 U 12 + U 11 + U 10 + U 9 + U 8 X11(t+8) = X3+4U 14 + 4U 13 + 2U 12 +2U 11 + U 10 + U 9 + U 8 + U 7 Constructing a systolic array BCH encoder is simple and can be performed from the equation given in Eq. (11) . The equation is to trace the path of the node from one register to another through different stages. The node can be mentioned by the degree of the register in a column and the position of the pfactor stages in a row. First, the conventional serial BCH encoder has to be drawn. The positions of the XOR gates are same as in first stage and are repeated until p-1 stages.
...,GT
= + (11) Where, a is the degree of the register b is the parallel stage r is the number of registers in between two XOR gates For example, to find a path from a node X 11 [0], a = 11, b = 0 and r = 3, we get a node X' 14 [3] from the Eq. (11). The first node without name, is the feedback of the rightmost XOR gate, is always shifted by one to the next stage. In the result, if b' > b, then b'-b is the number of nodes to be shifted. The out-put of the rightmost XOR gate is input to all other XOR gates in a same manner of the serial BCH encoder. As the output of the p-1 stages is the input to the registers, there is no need to trace the path.
Instead of one message bit input in the serial BCH encoder, the encoder receives one byte of message bits in parallel in our example. This allows for multiple bits of encoding to be performed simultaneously and at greater speeds than using comparative models. As the stages are the replicas of the first stage, it can be increased or decreased to any factor without increase in hardware complexity. The key reason of its flexibility is that the stages can be moved to any number of parallel factors. However, the value of the p-1 stage should be the input to the first stage.
In the paper, we focus on the critical path delay of the systolic array BCH encoder. Since the circuit functions as a loop, it faces a critical path delay. In 8-parallel systolic array BCH encoder, the longest critical path is 7T, where T is the delay of an XOR gate. The calculation of the critical path from registers X 2 to X 10 is shown in Fig.3 .
During the XOR operation, the signal S(0) is obtained from the value of the register X 2 and the value from Z 0 and X 14 , which has a critical path of 2T. Similarly the signal S(0) is transferred to the another stage to calculate the value from Z 2 and X 12 and creates a new signal S (1) . The critical path is added to 3T. The overall data delay path for the longest critical path for the signal S (5) ath from registe
MANCE RESU
ncoder is synthe ith a device X mplemented w egree 15 and 9 n entional BCH e BCH encoder and Figure 4 sh uts of bit-serial systolic array Fig. 3 
ULTS
esized on the X XC2VP30 [11] . with the gene non-zero terms encoder in bit-s employing sys how the compa (p=1), 4-paralle y BCH enco allel BCH enc nventional bit-s complexity of 38% compared own in Fig. 4 increased by al BCH enc el BCH encoder parallel counter t is quite stee n we compared o 16-parallel ughput has incre s), whereas are still worthwhi er with high par Table 2 , th H encoder and u 2.078 ns res oder, the critica ulation using X oder has speede oder. 
CONCLUSION
In this paper, we present the tree-type systolic array BCH encoder to perform in parallel without significant area overhead. Several p-factors of the original systolic array BCH encoder has been compared with its area and speed. The systolic array BCH encoder has also been compared with the unfolding method. Using tree-type structure to the original systolic array encoder, the performance is tremendously better with a minor increase in area. Considerably, the proposed BCH encoder has a great flexibility to any number of parallelization factors without any complexity. The fan-out effect has been disregarded in our experiment. Retiming can be applied in the proposed encoder without any modification in the generator polynomial and increasing its hardware. Future work can be directed toward reducing the fan-out effect in the proposed tree-type systolic array BCH encoder to amend its output.
