A BCH (Bose-Chaudhuri and Hochquenghem) 
Introduction
In recent years, there has been an increasing demand for efficient and reliable digital data transmission and storage systems. This demand has been accelerated by the rapid development and availability of VLSI technology, high speed data networks, and storage of digital information. It is obvious that a digital system must be fully reliable, as a single error may shutdown the whole system, or cause unacceptable corruption of data. In such situations, this error control must be employed so that an error may be detected and afterwards corrected. For this reason, the use of for error control code has become an integral part in the design of modern communication and digital storage systems.
Flash memories are widely used storage elements in embedded systems and mobile applications. Low-power consumption, non-volatile and high density storage are the properties of flash memories. There are two commonly used flash memories: NAND and NOR flash memory. Basically NAND type is used as high density data storage, whereas NOR type is used for code storage and direct execution. Some major differences between NAND and NOR type are shown in Table 1 . SLC (single-level cell) stores a single bit in a cell, whereas MLC (multi-level cell) stores multiple bits in a cell. There is a high chances of occurring error during the read and write cycles due to the reduction of the programming voltage in MLC for each of the data levels [1 -2] . The organization of MLC flash memories is shown in Figure 1 . Each plane consists of 2KB page register, with additional 64 spare bytes to store the ECC parity bits. Each block consists of several pages. The page registers are divided into four 512 byte sectors. Page registers are organized in two ways as shown in Figure 2 The most common way is to store data in lower part of the page along with error correction bits in the upper part as in Figure 2 (a). However, some implementation prefers different organization to increase performance as in Figure 2 (b). For a write operation, the input data are encoded before writing to the page register. For a read operation, the data from the registers are decoded. The decoding failure reported is the number of errors exceeded the designed capability of the ECC circuits . In a flash memory, the read and write operations of data are conducted in bytes at each clock cycle. Thus, byte-wise parallel encoding and decoding shall be desired for high-speed flash memories.
The theory of error detecting and correcting codes deals with the reliable storage of data. Information media is not completely reliable in practice since the noise frequently causes data to be distorted. In particular, BCH code for multiple error correction is widely used in MLC flash memory. Generally, BCH encoder can be implemented either by hardware or software methods. Since software implementation of BCH code cannot reach the needed limited speed, the hardware design is preferred for high-speed applications. In conventional BCH design, a simple shift register, that is, LFSR (linear feedback shift register) is associated with specific XOR gates to computes one bit message data per cycle. Owing to the ever increasing demands for high speed communication appliances, the conventional bit-serial BCH encoder should be replaced to the new parallel BCH encoders that have been presented until now.
In a parallel BCH encoder circuit, p-bits of data are processed at a time, where the speed is p times faster with a relatively small increase in hardware. Several parallel processing methods have been introduced earlier such as matrix multiplication, unfolding method, CRT based encoding [3] [4] [5] . The high speed circuit is always in required but the area and the flexibility should be also considered. In this paper, we present a systolic array BCH encoder employing tree-type structure. The proposed BCH encoder has a great flexibility in parallelization without any complexity with high throughput. It can improve the performance compared to its counterparts without any significant area increase. 
Serial and Parallel BCH Encoder
This BCH (n, k) code is capable of correcting t errors in a block of n = 2 m -1, where, n is the length of the codeword, k is the length of the message bits, n-k is the length of the parity bits and t is the number of errors to be corrected. The BCH (4122, 4096) is a shortened code of BCH (8191,8165) over Galois Field GF (2 13 ). This code has an ability to correct two-errors. A shortened BCH code is used when the block length is smaller than 2 m -1, and can be considered as a standard BCH code. This code form a subset of BCH code with good error correction and detection capabilities as the original long BCH code. The encoding and decoding of the shortened BCH code is almost same as the long BCH code.
In a binary BCH (n,k) code, a k-bit message is encoded in n-bit codeword. It consists of kbit message and n-k parity bits. The n-bit codeword is defined as (C n-1 , C n-2 , … , C 0 ), where as
(x). The encoding of a BCH code can be expressed as c(x) = m(x)g(x)
, where g(x) is the generator polynomial of a degree n-k.
Consistently, BCH encoding is constructed with three steps. Multiplying the message m(x) by x n-k and dividing it by generator polynomial g(x), where the remainder
is obtained. The remainder is now added to the message to form a codeword as shown in Eq. 1. In this paper, the proposed BCH encoder is implemented with (31, 16, 3) BCH code [6] . Thus, we got the generator polynomial, g(x) as shown in Eq. 2. For the conventional bit-serial encoding, the k message bits are input to the LFSR with bitserial manner. At the k-th cycle, the registers contain Rem(m(x).x n-k ) g(x) , which is also called the parity bits. Figure 3 illustrates the circuit connection of a conventional serial BCH encoder. The critical path of this bit-serial architecture consists of two XOR gates as shown in Figure 3 . This architecture is quite straightforward, but it cannot run in a high speed as the application requirement. The three steps of the BCH encoding can be done with a simple LFSR architecture. The conventional serial encoder for BCH (4122, 4096) is shown in the Figure 3 . The feedback terms are determined by the generator polynomial g(x) in the Eq. (2). Modulo-2 addition is processed with XOR gate and multiplication is with the finite field multiplier. There are total number of 26 (n-k) registers to store the remainder Rem(m(x)x nk ) g (x) . For binary BCH code, the multiply operation is simplified to a connection or a disconnection when g i is (0≤ i ≤ n-k) equals to '1' or '0'.
The parallelization of the circuit is the method of sending the p number of message bits at a time t. When p message bits arrive in each clock cycle, only k/p clock cycles are required to compute the remainder in the registers. Unfolding is a method of parallelization of a circuit and has a high throughput [4] . In the J-unfolded architecture, there are J copies of each node with the same function as in the original architecture. It is assumed that there is a path from node U to node V in the original architecture with W delay elements. Therefore, node U i is connected to V(i+w) percent J with [(i+w)/J] delay elements, where, U i , V j (0≤i, j≤J) are the copies of nodes U and V respectively. Fan out problem also exist in the unfolding method, but retiming is not accessible when the J factor is larger than the degree difference between the highest and the second highest order of g(x).
In the case, if a J unfolded BCH encoder is acquired, the generator polynomial needs to be modified and the remainder Rem(m(x)x n-k ) g(x) in the BCH encoding can be implemented by the steps illustrated in Figure 4 . Additional hardware will increase dramatically when large unfold factor J is used [5] . 
Proposed Tree-type Systolic Array BCH Encoder
The major drawback of the unfolding method is the modification of the generator polynomial, when two highest degree orders are smaller than J, to perform retiming. This problem can be easily solved in the systolic array BCH encoder without any modification in the generator polynomial and additional hardware. Before implementing a p-parallel encoder, we need to know the state of the LFSR at time t+1 from the state t. Let X(t) = [X 0 , X 1 ,…,X n-1 ] denotes the state of the registers at time t and z(t+1) denotes the input bit to be entered at time t+ 1. Then the state of the encoder at time t+1 can be represented as:
The first column in matrix F is the coefficients of the generator polynomial g(x) and Eq. (4) can be written as:
X(t+1) = F x [X(t) + z(t+1)]
Let X(t+ p) is the state of the register at time (t+ p), from the equation (5), a recursive equation for X(t+ p) can be derived as:
Where,
. . . z p-1 | 0……0]
T , the matrix F p can be constructed recursively when i changes from 2 to p. Thus when a BCH encoder processes p-bit message in each cycle, it takes k/p cycles to save the parity bits in the registers [7] . Figure 4 illustrates the eight-parallel systolic array BCH encoder for the generator polynomial given in equation 2, which serves as our example. Let X 0 (t+8) to X 14 (t+8) represents the value of the registers at (t+8). z 0 (t) to z 7 (t) represents the eight parallel input data at (t+1).From the equation (6), we can get the value of the registers processed in the systolic array BCH encoder as follows. The systolic array BCH encoder is almost the same with the conventional serial BCH encoder. In the serial BCH encoder the output of the rightmost XOR gate is the input to the rest of the XOR gates as well as the first register. The stages in the Figure 5 are the number of parallel factor p. The position of the XOR gates in each of the stages is a replica of the first stage but the input to every stage is from the previous stage. This process can be concluded as a shift operation. After the XOR operation, the output of the p-1 stage is the input to the 0 th stage, and is repeated consecutively.
Figure 5. Systolic Array Type BCH Encoder
Constructing a systolic array BCH encoder is simple and can be performed from the equation given in Eq. (7). The equation is to trace the path of the node from one register to another through different stages. The node can be mentioned by the degree of the register in a column and the position of the p-factor stages in a row. First, the conventional serial BCH encoder has to be drawn. The positions of the XOR gates are same as in first stage and are repeated till p-1 stages.
Where, a is the degree of the register b is the parallel stage and r is the number of registers in between two XOR gates.
For example, to find a path from a node X 2 [3] , a = 2, b = 3 and r = 2, we get a node X 4 [5] from the equation (4) . The nodes of the registers are also shown in the Figure 5 as a dot. The first node without name, which is the feedback of the rightmost XOR gate, is always shifted by one to the next stage. The output of the rightmost XOR gate is input to all other XOR gates in a same manner of the serial BCH encoder. As the output of the p-1 stages is the input to the registers, there is no need to trace the pa th. Instead of one message bit input in the serial BCH encoder, the encoder receives one byte of message bits in parallel in our example. This allows for multiple bits of encoding to be performed simultaneously and at greater speeds than using comparative models.
As the stages are the replicas of the first stage, it can be increased or decreased to any factor without any complexity in the circuit. The key reason of its flexibility is that the stages can be moved to any number of parallel factors. However, the value of the p-1 stage should be the input to the first stage.
In the paper, we focus on the critical path delay of the systolic array BCH encoder in the Figure 5 . Since the circuit functions as a loop, it faces a critical path delay. In the eight parallel systolic array BCH encoder, the longest critical path lies from X 0 to X 8 , X 1 to X 9 , X 2 to X 10 , X 3 to X 11 and X 4 to X 12 . All of them have a critical path of 7T, where T is the delay of an XOR gate. For example, the route from the register X 1 to X 9 is shown with a bold line.
During the XOR operation, the signal s(0) is obtained from the value of the register X 1 and the value from Z 0 and X 14 , which has a critical path of 2T. Similarly the signal s(0) is transferred to the stage 1 to calculate the value from Z 1 and X 13 and creates a new signal s (1) . The critical path is added to 3T. The overall data delay path for the longest critical path for the signal s (5) is 7T . The critical delay path from the registers X 1 to X 9 is shown in Figure 6 in detail. In order to acquire a low delay, tree-type structure has been optimized for the Figure 6 , which is not shown due to complexity. The critical path of the proposed eight parallel systolic array BCH encoder has been reduced to 5T from 24T as shown in Figure 7 . Using tree type structure, the delay caused by the cascaded XOR gates is much less than that caused by large fan-out [4] . Compared to the encoder in the Figure 4 , the proposed tree-type encoder is speed up by 40%.
Performance Results
The BCH (31, 16, 3) code encoder is synthesized on the Xilinx ISE 10.1, Vertex II PRO with a device XC2VP30. These device families provide logic array blocks that can accommodate a bunch of basic logic elements with fast interconnections. The simulation result for the encoder is implemented with the generator polynomial in Eq. (3) with degree 26 and 13 non-zero terms for the encoder.
First, we compared the proposed systolic array type BCH encoder with several parallel factors. Figure 8 Figure 8 (a) , the number of XOR gates for serial encoder is just 14 and increases linearly as the parallel factor p increases. The XOR gates for 8, 16 and 32 parallel systolic array BCH encoder are 112, 224 and 448 respectively. Similarly, the number of slices used in the FPGA device for the 16 parallel is increased by 5.2 times than the serial one, whereas, 32-parallel is increased by 1,8 times than the 16-parallel. Results for throughput Figure 9 shows the simulation waveform of the J unfolding method and the systolic array type BCH encoder with the same parity bits in the register. Both of them are implemented with the generator polynomial in equation (2) with degree 15 and 9 non-zero terms. The Tables 2 and Table 3 show the comparison results of the J unfolding method, systolic array and the proposed tree-type systolic array BCH encoders with their device utilization and the critical path delay respectively. The J unfolding method has been implemented without any change in the generator polynomial, i.e., no retiming is applied. Some XOR gates in the proposed tree-type encoder are increased. The area is still efficiency (about 0% of the allocated area of the XC2VP30 device). In Table 2 , the number of slices is slightly increased, which is negligible for the modern FPGA. The maximum frequency operation of the proposed tree-type encoder is around 657.9 MHz.
From Table 3 , the critical path of the original systolic array BCH encoder is 2.097ns. Using tree-type systolic array encoder, the critical path is reduced to 1.49ns from VHDL simulation. The throughput in the proposed encoder has speeded up by 40 % than the original systolic array encoder.
Conclusions
The tree-type systolic array BCH encoder has been proposed. The proposed eight-parallel BCH encoder has been compared with the unfolding method. The result shows that the proposed one has a better result in its throughput than its counterpart with a minor increase in area. Considerably, the proposed BCH encoder has a great flexibility to any parallelization factor without any complexity. The fan-out effect has been disregarded in our experiment.
Retiming can be applied in the proposed encoder without any modification in the generator polynomial and increasing its hardware. Future work can be directed toward reducing the fanout effect in the proposed tree-type systolic array BCH encoder to amend its output.
