Abstract. The Bose, Chaudhuri and Hocquenghem (BCH) codes form a large class of powerful random-error correcting cyclic codes. However, the implementation of its decoder requires high-complexity computation resources with a huge number of sequential circuits. This paper presents a low-complexity register transfer level (RTL) circuit design of a BCH decoder. In accordance with the table relationship between the syndrome and the error bit position, we propose a circuit that is mostly occupied by combinational elements without any sequential evolvement. Therefore the designed system has a low complexity and high throughput properties. The implementation of the BCH (15,7)decoder on Virtex 5 FX70TFF1136 requires 77 look-up tables (LUTs) with the maximum throughput reaching 1.7 Gbps.
Introduction
Today, error-correcting codes are used throughout digital communication systems. Satellite communications, cell-phones, compact disc players, DVDs, disk drives, two-dimensional bar code systems and many other communication devices use varying amounts of error control to achieve a certain degree of accuracy in transmitting information. The Binary Bose, Chaudhuri and Hocquenghem (BCH) codes, discovered by Hocquenghem in 1959 and independently investigated by Bose and Chaudhuri in 1960, are a remarkable generalization of the Hamming codes for multiple-error correction. BCH codes containing Reed-Solomon codes have been widely adopted in practical errorcontrol applications, owing to their good performance against degradation and the flexibility they allow in setting appropriate parameters [1] . Digital Video Broadcasting (DVB) [2] and Worldwide Interoperability for Microwave Access (WiMAX) [3] are examples of current standards that utilize BCH in their system.
One of the well-developed algorithms to decode binary BCH code uses a Euclidean algorithm [4] . However, its process requires high computation resources due to the error-locator polynomial. Other algorithms are step-by-step algorithms [5] , [6] that consist of procedure tests to check whether the error pattern weight falls by changing the received symbols one at a time. This decoding procedure does not terminate until the error pattern weight has been reduced to zero or all received information symbols have been tested. Hence, this is called an iterative method of decoding. Even though the hardware implementation [7] is less complex than that of the first decoding algorithm, the throughput may not be higher due to the iterative procedures.
In this paper, we propose a simple hardware implementation procedure with low complexity and high throughput properties. This simple combinational circuit was developed based on the table relationship between the syndrome and the error bit position. Thus, a low-complexity BCH decoder could be developed. Furthermore, the decoder throughput could be increased by employing pipelining and parallelization. This paper is organized as follows. In Section 2, the architecture of the encoder and the decoder is detailed. The design complexity is explained in Section 3. In Section 4, the compilation and synthesis results are presented. Finally, conclusions are drawn in Section 5.
Architecture Description

Encoder Specification
The architecture of a BCH encoder using shift register has been introduced by Massy in 1969 [8] . This paper considers a BCH (15,7) encoder consisting of 7 information bits and 8 parity bits as target implementation. The sending bit (SB) of this BCH (15,7) encoder are based on the polynomial given by:   
where u 0 , u 1 , u 2 , u 3 , u 4 , u 5 , u 6 represent information, and r 0 , r 1 , r 2 , r 3 , r 4 , r 5 , r 6 , r 7 express the parity bits. This can be implemented using the remainder polynomial, based on:
Furthermore, Eq. (2) can be realized by a simple circuit, as shown in Figure 1 , where u 6 , u 5 , …, u 0 inputted serially in signal input port (SIN), and r 7 , r 6 , …, r 0 generated after seven clock cycles. A parallel process to get the parity bits is introduced in order to reach a higher throughput. Parallel computation is performed based on the remainder polynomial of 
Based on Eq. (2) 
Eq. (4) can be realized easily in a circuit, as shown in Figure 2 , where the adder symbol is implemented using XOR gates. 
Proposed BCH (15,7) Decoder
This paper proposes parallel computation for the BCH (15,7) decoder to reach a higher throughput. The proposed system consists of 7 main blocks, as shown in Figure 3 .
Syndrome S1 
Syndrome Calculation
In this process, syndrome blocks S1 and S2 generate the syndrome bits of received bits with error (RBWE). The generation polynomial G(x) for errorchecking is given by,
G 1 (x) and G 2 (x) are related to syndrome S1 and syndrome S2 respectively. If there is no error in the received code, both the remainder polynomial of G 1 (x) and that of G 2 (x) remain zero. Suppose the received bits with error (RBWE) are expressed as,   
Based on Eq. (5), we can derive the remainder polynomial for x 5 , x 6 ,…,x 14 and then substitute it in Eq. (7), hence syndrome S1 is: By replacing each (+) sign with an XOR gate, in total 48 XOR gates are required for the syndrome calculation block. However, this can be reduced by sharing the same logic, such as û 6 XOR û 5 , used in S1(3) as well as in S1(0). This will reduce the number of XOR gates from 48 to 37.
Error Position Detection
The next process is error position detection based on the values of syndrome S1 and S2. Eq. (10) will be re-applied and rearranged, becoming:
It is clear that an error occurs in û 2 , 5 r or 0 r and has only resulted on S2(0). In the same way, an error in û 4 , 7 r or 2 r only has an effect on S2 (2) . In Table 1 , from the table relationship between the syndrome bits and error bits position, it can be seen that they occupy the same column. Thus, there are 5 column groups (CG), i.e.
CG0, consisting ofû 2 , 5
r and 0 r , corresponds to S2 = "1000"; CG1, consisting of û 3 , 6 r and 1 r , corresponds to S2 = "0100"; CG2, consisting of û 4 , 7 r and 2 r , corresponds to S2 = "0010"; CG3, consisting of û 5 ,û 0 and 3 r , corresponds to S2 = "0001"; CG4, consisting of û 6 ,û 1 and 4 r , corresponds to S2 = "1111". Therefore, the recognition of a CG can be based on the position of bit "1" in S2. For example, S2 = "0101" means there is an error in CG1 and an error in CG3.
However, if an error in CG4is introduced from another CG, the recognition scheme becomes different. For example, an error occurs in CG4 as well as an error in CG2, where S2 = "1101" cannot be recognized from the bit "1" position. In this case, an inverter is required before recognition of the bit "1" position takes place. However, the inverter only works if the number of bit "1" is more than two. Therefore, the code group detection system consists of the number of bit "1" calculation, selectors, and bit "1" position recognition.
The first component in the code group detection system is the number of bit "1" calculation. A 4-bit adder can be used to implement it. However, it may need big resources since an adder consist of XOR and AND gates in four bits. Since our target does not actually count the number of bit "1" but only recognizes that the number of bit "1" is more than two, we propose a combinational circuit that only uses four AND gates and three OR gates. This is expressed in term of syndrome S2 as SEL.
SEL is "1" when the number of bit "1" within S2 bits is more than two. This equation also expresses CG4 detection, since CG4 always exists if the number of bit "1" is more than two. The next component is the selector. Its function isto select between S2 and (NOTS2). The XOR gate has been chosen as the selector. The inputs are S2 and SEL, and the output belongs to the code group. Therefore, bit "1" position recognition is no longer required. Finally, the code group detection circuit shown in Figure 4consists of 4 AND gates, 3 OR gates, and 4 XOR gates.
S2 (3) S2 (2) S2 ( Furthermore, to simplify the error detection process, we divide the error possibilities into two groups: error possibility 1 (EP1) and error possibility 2 (EP2). EP1 consists of two errors occurring at the same time and in the same CG. Notice Eq. (11), when two errors in the same CG occur at the same time, it makesS2 = "0000". In the next discussion, all error possibilities in the S2 = "0000" column of Table 1 are categorized as EP1.
SEL
The next error group is error possibility 2 (EP2). This group includes all errors in every column of Table 1 except column S2 = 0000. Note that "all errors" means, all errors that can be recognized and recovered by this error correction algorithm.
Error Possibility 1 (EP1)
EP1 occurs when two errors come from the same CG. The property of this case is syndrome S2 = "0000" AND S1 "0000". The only way to detect the errors is direct mapping between S1 and the error bit position within 15 received bits. Based on Table 1 , column S2 = "0000", the combinational circuit for ERROR1(14:0) can be expressed as:
ERROR1 (14) = S1_12 OR S1_5 ERROR1(13) = S1_1 OR S1_10 ERROR1(12) = S1_2 OR S1_13 ERROR1(11) = S1_4 OR S1_3 ERROR1(10) = S1_6 OR S1_8 ERROR1(9) = S1_9 OR S1_12 ERROR1(8) = S1_1 OR S1_11 ERROR1(7) = S1_2 OR S1_15 ERROR1(6) = S1_4 OR S1_7 ERROR1(5) = S1_8 OR S1_14 ERROR1(4) = S1_5 OR S1_9 ERROR1(3) = S1_10 OR S1_11 ERROR1(2) = S1_13 OR S1_15 ERROR1(1) = S1_3 OR S1_7 ERROR1(0) = S1_6 OR S1_14
where,
S1_1 = (NOT S1(0)) AND (NOT S1(1)) AND (NOT S1(2)) AND S1(3) S1_2 = (NOT S1(0)) AND (NOT S1(1)) AND S1(2) AND (NOT S1(3)) S1_3 = (NOT S1(0)) AND (NOT S1(1)) AND S1(2) AND S1(3) S1_4 = (NOT S1(0)) AND S1(1) AND (NOT S1(2)) AND (NOT S1(3)) S1_5 = (NOT S1(0)) AND S1(1) AND (NOT S1(2)) AND S1(3) S1_6 = (NOT S1(0)) AND S1(1) AND S1(2) AND (NOT S1(3)) S1_7 = (NOT S1(0)) AND S1(1) AND S1(2) AND S1(3) S1_8 = S1(0) AND (NOT S1(1)) AND (NOT S1(2)) AND (NOT S1(3)) S1_9 = S1(0) AND (NOT S1(1)) AND (NOT S1(2)) AND S1(3) S1_10 = S1(0) AND (NOT S1(1)) AND S1(2) AND (NOT S1(3)) S1_11 = S1(0) AND (NOT S1(1)) AND S1(2) AND S1(3) S1_12 = S1(0) AND S1(1) AND (NOT S1(2)) AND (NOT S1(3)) S1_13 = S1(0) AND S1(1) AND (NOT S1(2)) AND S1(3) S1_14 = S1(0) AND S1(1) AND S1(2) AND (NOT S1(3)) S1_15 = S1(0) AND S1(1) AND S1(2) AND S1(3)
This requires 45 AND gates, 15 OR gates and 28 NOT gates. However, sharing computation is introduced in S1_1 to S1_15, so that: S1_1 = C1_00 AND C0_01 S1_2 = C1_00 AND C0_10 S1_3 = C1_00 AND C0_11 S1_4 = C1_01 AND C0_00 S1_5 = C1_01 AND C0_01 S1_6 = C1_01 AND C0_10 S1_7 = C1_01 AND C0_11 S1_8 = C1_10 AND C0_00 S1_9 = C1_10 AND C0_01 S1_10 = C1_10 AND C0_10 S1_11 = C1_10 AND C0_11 S1_12 = C1_11 AND C0_00 S1_13 = C1_11 AND C0_01 S1_14 = C1_11 AND C0_10 S1_15 = C1_11 AND C0_11
C1_00 = (NOT S1(0)) AND (NOT S1(1)) C1_01 = (NOT S1(0)) AND S1(1) C1_10 = S1(0) AND (NOT S1(1)) C1_11 = S1(0) AND S1(1) C0_00 = (NOT S1(2)) AND (NOT S1(3)) C0_01 = (NOT S1(2)) AND S1(3) C0_10 = S1(2) AND (NOT S1(3)) C0_11 = S1(2) AND S1(3)
This scheme only needs 23 AND gates (a half less than before) and 8 NOT gates (reduced to 28%).
EP1 consists of three parts, as shown in Figure 5 . The first part computes C0_00, C0_01, C0_10, C0_11, C1_00, C1_01, C1_10 and C1_11 simultaneously. The second part computes S1_1 up to S1_14 as expressed in Eq. (13). The last part computes ERROR1(14:0) based on Eq. (12). All computations are done without buffer and latency. Finally, the EP1 block requires 23 AND gates, 15 OR gates and 8 NOT gates. 
Error Possibility 2 (EP2)
Error detection in this group is performed based on the code group and syndrome S1. There are 105 possible error positions in this group. The detection concept consists of three steps. First, the code group is used to generate a maximum of nine candidates in term of syndrome Ŝ1. Next, all possible combinations of the syndrome Ŝ1candidates are prepared and compared with the actual syndrome S1. As a result, a syndrome Ŝ1 candidate that has the same pattern as the actual syndrome S1 is recognized. Finally, this result is converted to the error position, within 0 to 15. The general architecture of EP2 detection is shown in Figure 6 . Each step is explained in detail below.
Step 1
Step 2
Step 3
Syndrome S1
(3…0)
Out_pos1
Out_pos2 Comp2_out1 Comp2_out5 Figure 6 General architecture of EP2.
The main process of Step 1 is to generate all Ŝ1 candidates based on the received code group (CG). A maximum of two groups can be detected at the same time, where each group belongs to three Ŝ1 candidates; a maximum of nine Ŝ1 candidate combinations are produced in the Step 1 block. The Ŝ1 candidates are based on Table 1 . They are: CG0Ŝ1 = "1000", "0110", "1110" CG1Ŝ1 = "0100", "0011", "0111" CG2Ŝ1 = "0010", "1101", "1111" CG3Ŝ1 = "0001", "1010", "1011" CG4Ŝ1 = "1100", "0101", "1001".
Note that S1, S2 and Ŝ1 have the same configuration, where the most left is the least significant bit (LSB -e.g. S1(0)) and the most right is the most significant bit (MSB -e.g. S1 (3)).
Notice that one Ŝ1 occurring in a CG is equal to the XOR of two other Ŝ1s. Therefore, two Ŝ1s must be mentioned in the process of Step 1. The details of the architecture of Step 1 are shown in Figure 7 ; it consists of ten selectors and a comparator to recognize a single error, since a single error will give the same value in both outputs. The ten bits on each selector represent two Ŝ1s (each 4 bits) and a representative error position (3 bits). 0101``01``01``01C
G (0) CG (1) CG (1) CG (2) CG (2) CG (3) CG (3) CG (4) CG ( The main process of Step 2 consists of Ŝ1 combinations generation and a comparison of Ŝ1 combinations with the actual syndrome S1. Since each CG contributes three Ŝ1s and there is a maximum of two errors with a different CG, the maximum number of combinations is nine. For example, the first CG gives Ŝ1 = A1, B1, and C1, and the other CG gives Ŝ1 = A2, B2, and C2. Therefore, the combinations of
One of them should be the same as the actual S1. Figure 8 shows the details of the architecture of Step 2.
In Step 3, error positions are recognized based on comparing the results of Step 2 with a representative error from Step 1. A representative error is the smallest error position in each CG, for example, the representative error in CG2is "010". We can recognize two other errors because they have a special pattern, i.e. interval five. Thus, when the representative error is "010", the other errors are "0111" and "1100". However, to recognize the actual error, the output of Step 2 has to be considered. 
EP2 Decoder
The function in this section converts EP2 output from 4-bit format to the error position within 15 bits. These 15-bit patterns are also called ERROR2. The relationship between the two port inputs (out_pos1 and out_pos2) and the 15-bits ERROR2 is expressed as:
out_pos1_0000 = (NOT out_pos1(0)) AND (NOT out_pos1(1)) AND (NOT out_pos1(2)) AND (NOT out_pos1(3)) … out_pos1_1110 = (NOT out_pos1(0)) AND out_pos1(1) AND out_pos1(2) AND out_pos1(3) out_pos2_0000 = (NOT out_pos2(0)) AND (NOT out_pos2(1)) AND (NOT out_pos2(2)) AND (NOT out_pos2(3)) … out_pos2_1110 = (NOT out_pos2(0)) AND out_pos2(1) AND out_pos2(2) AND out_pos2(3)
Therefore, it can be implemented using a combinational circuit consisting of AND and OR gates. Some parts of this circuit are shown in Figure 10 . [2]
[1]
[0]
[3]
[2]
ERROR2 [1] ERROR2 [11] ERROR2 [14] Figure 10 Architecture of the EP2 decoder.
Finally, the total complexity of EP2 and its decoder is given in Table 2 , consisting of two multiplexers, a comparator, 20 XOR gates, 32 OR gates, 49 AND gates, 18 NOT gates, and two adders. 
Error Correction
The last step is error correction. The main concept of the error correction system is XOR-ing the received information with the pattern correction built from EP1
and EP2. However, we must also consider syndrome S1 and S2 for selecting correction patterns, which can be expressed as,
o t h e r s wh e n , 0 a n d 0 wh e n , 0 a n d 0 wh e n , 0 Received Bit (RBWE)
Error Correction (RBEC) Figure 11 Error correction architecture.
Proposed Design Complexity
In section 2, the proposed RTL design was presented along with the complexity. Furthermore, the total complexity of each block is re-typed and shown in table 3. Note that a multiplexer is equal to 3 logic gates, and a comparator is equal to an adder and a logic gate. Based on the calculation shown in Table 3 , the proposed BCH (15,7) decoder needs 263 logic gates and 3 adders. We now consider a simple algorithm proposed by Hong [9] as a comparer. Hong"s algorithm for 2-bit errors gives a result of in total 110 multipliers, excluding the other components such as adders. The 110 multipliers are distributed such that 56 multipliers are used for syndrome evaluation, 6 multipliers for the error locator polynomial, 44 multipliers for root finding, and 4 multipliers for error evaluation.
Considering a 2-bit multiplier, its complexity is equal to 4 logic gates and an adder [10] . Thus, Hong"s algorithm for 2-bit errors is equal to 440 logic gates and 110 adders. Therefore, the proposed system has a lower complexity than Hong"s algorithm.
Simulation, Compilation and Synthesis Results
In order to ensure that the developed system has been worked out properly, we did a verification based on the block diagram in Figure 12 . All parts were implemented in Very High Hardware Description Language (VHDL) and simulated using ModelSim 6.3. A snapshot of the functional simulation is shown in Figure 13 . It is clear that all errors can be recovered by the decoder.
Figure 12
Block diagram for verification.
Figure 13
Snapshot of simulation result.
Furthermore, using a clock period of 200 ns, a long simulation was performed in 3 seconds, or 15 x 10 6 clock cycles. Within this period, the BCH decoder received approximately 1,000,000 data. The result was that there were no errors, as shown in Figure 14 , which means the bit error rate was zero, or all received bits were corrected perfectly.
Figure 14
Simulation snapshot of one million data.
The compilation was processed using design tool ISE 11.2. The result shows that the critical path appears from register input RBWE(0) to register output RBEC(14), as shown in the snapshot of the compilation result in Figure 15 . This path is through syndrome S2, column group detection, error possibility 2, and the error correction block. The critical path delay is 8.713 ns for Virtex 5 FX70TFF1136 implementation. The critical path can be reduced by pipelining. Without pipelining the proposed design has a maximum clock frequency of 114.771 MHz. Since the computation process is done in 15-bit parallel processing, the maximum throughput that can be achieved is 1.7 Gbps.
Figure 15
Snapshot of the timing report.
From a circuit area point of view, the proposed architecture of the BCH (15,7) decoder requires 77 slice LUTs without flip-flop, as shown in Figure 16 . Since all components are made from a combinational circuit, there are no sequential components such as a register or a memory. Thus, clock latency is zero.
Figure 16
Snapshot of the resource summary.
The 1.7 Gbps throughput is higher than the decoder architecture proposed by A. Kumar, et.al. [11] , which can reach a data rate of up to 1.6 Gbps with a maximum clock of 200 MHz in an application-specific integrated circuit (ASIC) implementation. In addition, the proposed system has no latency since no sequential circuit is included. The decoder proposed by A. Kumar, et.al. [11] has a clock latency of 284. Thus, the proposed system has a lower latency than Kumar"s decoder.
Conclusions
We have designed a BCH (15,7) hardware implementation in a combinational circuit instead of a sequential circuit to avoid high computation requirements and iteration processes. The simulation results using ModelSim 6.3 show that the developed circuit has correct functional processes. Furthermore, based on the compilation and synthesis results, the BCH decoder occupies 77 LUTs out of the 44800 LUTs on the target device Virtex 5 FX70TFF1136. The critical path delay is 8.713 ns in 15-bit parallel processing. Thus the maximum throughput can reach 1.7 Gbps. Since sequential circuits are no longer involved, there is no process latency and the output can be executed in one clock cycle.
