Error correction coding (ECC) has become one of the most important tasks of flash memory controllers. The gate count of the ECC unit is taking up a significant share of the overall logic. Scaling the ECC strength to the growing error correction requirements has become increasingly difficult when considering cost and area limitations.
Introduction
The importance of flash memories as a non-volatile mass storage is continuously increasing. A flash memory consists of floating gates in which the information is stored. These floating gates keep their state without a power supply. However, errors occur while the information is read. Hence, an error correction coding (ECC) unit is required. The error rate depends on the storage density, the used flash technology, and the amount of read and write cycles 1 . Most reliable are flash memory chips using single-level cell (SLC) technology, where each cell can store one binary digit. Multi-level cells (MLC) can store multiple binary digits. However, in the literature the term MLC typically refers to memories that store two bits per cell, whereas TLC (three level cell) chips store three bits per cell.
The ECC unit of a flash controller is embedded within the controller. The relationship between ECC unit and the NAND flash is depicted in Fig. 1 . In addition of correctable errors is large 10, 11 . In section 4, we present a concept for the BMA that enables a better tradeoff between throughput and area. The new concept requires both a parallel and a serial implementation of the BMA, but the parallel implementation is only used to correct the most likely errors. Hence, it requires fewer multipliers and fewer decoding cycles than a fully parallel implementation. If, however, the number of errors is larger than the error correcting capability of the parallel BMA, we use a serial implementation for error correction. The required area of a parallel implementation is dominated by the number of multipliers. In section 5, we analyze the average number of decoding cycles and show that a carefully chosen value of the correcting capability of the parallel BMA allows to reduce the required area without much effect on the throughput. This mixed serial/parallel approach can also be applied to the following decoding step, i.e. the Chien search. Section 6 presents such an extension to a Chien search with mixed parallelization degrees and an optimization for both decoding steps. The whole design has been realized with Verilog HDL and verified on an FPGA. The corresponding synthesis results demonstrate a significant reduction of the number of logic elements, where the average number of decoding cycles can even be smaller than with a fully parallel implementation.
BCH encoding
The presented simulation results indicate that BCH codes are a suitable solution for the next generation of flash memories. However, these flash technologies will have significantly higher raw bit error rates. Moreover, the diversification of the flash technologies is likely to increase. This motivates configurable encoding and decoding schemes for BCH codes with high error correcting capability. In the remainder of this paper, we propose such a flexible BCH encoding/decoding architecture, where we start with a configurable encoder structure in this section.
The design of parallelized BCH encoder/decoder units is described for instance in 9, 12 . We also consider the design of a parallel BCH error correction unit. However, the proposed design is configurable for different error correction capabilities. A codeword of an (n, k, t) BCH code C consists of n bits v 0 , . . . , v n−1 which are calculated from k information bits u 0 , . . . , u k−1 . For cyclic codes, the corresponding vectors (v 0 , . . . , v n−1 ) and (u 0 , . . . , u k−1 ) are represented by polynomials
Usually, the codeword is calculated as
where g(x) is the so-called generator polynomial of degree n − k − 1. More precise, g(x) is the lowest-degree polynomial over the Galois field GF (2) having α, α 2 , . . . , α t as its roots. Here α is a root of the primitive polynomial that generates the Galois field GF (2 m ) and t is the number of correctable errors. With BCH codes the gen- erator polynomial ensures the following condition for the syndrome of a codeword
For encoding of cyclic codes, like BCH codes, an information polynomial u(x) is multiplied with a generator polynomial g(x). This can be implemented with a linear feedback shift register (LFSR) whose taps correspond to the coefficients of the generator polynomial (see Fig. 2 ). The encoder shifts the information bitwise into the LFSR. After encoding the register contains the parity bits s(x). The length of the generator polynomial depends on the used alphabet size and the number of used minimal polynomials which determines the number of correctable bit errors t (error correction capability).
Depending on the requirements this encoding sequence can be parallelized 13 . If a high throughput has to be achieved a higher parallelization is required. But the higher the parallelization, the higher the amount of required gates. In a partial parallel encoder the amount of information bits can be chosen arbitrarily. A useful choice for the parallelization degree is the size of a byte or a word.
As already mentioned, the parity bits s(x) can be calculated by multiplication of the information u(x) by the generator polynomial g(x).
14 describes how the decomposition of s(x) for m different extensions s ′ (x) can be formulated as
where l is the degree of g(x). b(x) in Fig. 3 is the information fragment that has to be processed in parallel with m bits. Figure 2 shows how each new information bit which is shifted into the LFSR is XORed with the last remaining bit in the shift register.
14 introduces the simplification
for a new information fragment. The first summand inserts the new information bit into the parity calculation and the second summand represents the feedback of the already calculated parity. The integration of the new information bits is done by multiplication of this information bits with the generator polynomial. This means that for the parallel calculation each information bit is combined with the shifted generator polynomial depending on the digit. The first information bit combined with the generator polynomial shifted by zero, the second information bit is combined with g(x) shifted by two and so on. The generator polynomial fractions are static for a given g(x). Therefore, a bitmask can be created for each parity bit which is combined bitwise with the information t(x) by AND gates. The resulting bits are combined by an m-input XOR gate (addition in the Galois field GF (2)). There are exactly degree(g(x)) bitmasks of size m. Each bitmask column represents a logic circuit for one LFSR tap. A schematic of the parallel encoder is depicted in Fig. 3 , where the orange blocks represent the tap logic. A configurable bit-wise BCH encoder was presented in 15 where multiplexers are used to configure the LFSR taps for the different generator polynomials. Similarly, a configurable parallel encoder can be achieved by implementing the logic circuits for different BCH codes. During operation of the encoder unit, the required logic is selected with a multiplexer for each register tap.
BCH decoding
The received word r(x) is usually not a codeword if errors occurred. The errors can be represented by an error polynomial e(x) with e i = 1 if position i is in error and e i = 0 otherwise. Hence, the received word is r(x) = v(x) + e(x). The syndrome can be calculated based on the received word, but depends on the error polynomial 
If the number of errors e is less or equal t, this condition results in a set of equations (the so-called key equation) that has a unique solution. This solution is usually calculated in form of the error location polynomial
where the roots indicate the locations of errors, i.e. σ(β −1 l ) = 0 if there is an error in position l. The calculation of the error location polynomial is usually implemented based on the Berlekamp-Massey-Algorithm (BMA) 16 . The Chien search locates the errors by checking σ(α −l ) = 0 for all 0 ≤ l ≤ n − 1 17 . Hence, the Chien search calculates an estimated error polynomial e(x) with e l = 1 if σ(α −l ) = 0 and e l = 0 otherwise.
In summary, algebraic decoding of BCH codes consists of the following four steps which are usually realized in separate modules in a hardware implementation (see e.g. 18 ):
(1) Calculating the syndrome (2) Calculating the error location polynomial (BMA) (3) Evaluation of the error location polynomial (Chien search) (4) Correcting the erroneous positions
The basic decoder architecture is illustrated in Fig. 4 . The computational complexity and therefore the area required for a hardware implementation is dominated by steps 2 and 3. Both decoding steps can be realized with serial and parallel implementations. A parallel implementation achieves larger throughput but also requires more space. With a BCH code that has n code bits the Chien search consists of n steps which can be paralleled arbitrarily. Commonly, parallel implementations of the BMA are used, because of the higher throughput 6, 7, 12, 19 . An overview of different parallel implementations is provided in reference 9 . Reference 9 also presents a comparison of the computational complexity of the different realizations. Accordingly, a parallel implementation for a t error correcting code requires at least 2t multipliers and 2t iterations. Serial implementations of the Berlekamp-Massey algorithm require a significantly smaller number of multipliers, but increase the number of iterations. The first serial architecture for the BMA was introduced by Blahut 10 . The serial implementation according to reference 11 requires only 3 multipliers, but 2t 2 iterations. To our knowledge mixed serial/parallel implementations do not exist.
In the following section we present a concept for a BMA implementation that enables a better trade-off between throughput and area. The new concept requires both a parallel and a serial implementation of the BMA, but the parallel implementation is only used to correct up to t 1 < t errors. Hence, it requires only 2t 1 multipliers and 2t 1 iterations. If, however, the number of errors is larger than t 1 , we use a serial implementation for error correction. The required area of a parallel implementation is dominated by the number of multipliers. Therefore, a carefully chosen value of t 1 allows to reduce the required area without much effect on the throughput.
Two-stage BMA
The basic idea of this work is to implement the parallel BMA only for error correction up to t 1 < t errors. The decoder therefore utilizes only the first 2t 1 coefficients of the syndrome. If the number of errors e is smaller or equal t 1 , the decoding results in the correct codeword v(x). However, for t 1 < e ≤ t the parallel decoder may fail or may result in a vectorv(x) / ∈ C. Therefore, we extent the parallel BMA to detect whether the BMA has obtained a valid solution. In the case of a failure of the parallel BMA, the decoder uses a serial implementation of the BMA to compute the error location polynomial. The decoder architecture for proposed two-stage BMA is illustrated in Fig. 5 .
Next we consider the method to verify whether the parallel BMA has obtained a valid solution. Note that the condition
defines a new code C 1 which is a supercode of the code C, i.e. C ⊂ C 1 . If the parallel decoder finds a solution to the key equation, the corresponding estimated codeword v(x) satisfies Equation (7). Therefore, the parallel decoding can result in a codeword v(x) ∈ C 1 ,v(x) / ∈ C. The mixed implementation therefore requires some additional means to ensurê v(x) ∈ C. In the following we discuss two methods to check whether the vectorv(x) is a codeword of the code C.
A straight forward method is to implement a syndrome calculation for the remaining syndrome positions in parallel to the Chien search. The syndrome calculation checks
If this condition is satisfied,v(x) is the correct codeword. Otherwise the decoding starts over using the serial implementation for the complete syndrome. The second method results in a small modification of the BMA. During every iteration of the BMA the so-called discrepancy
is calculated. Here j = 0, 1, . . . denotes the iteration and L (j) is the degree of the error location polynomial in the j-th iteration. If the number of errors is less or equal t 1 , the discrepancy is zero for all iterations with j ≥ 2t 1 − 1. Therefore, if the discrepancy ∆ (j) is equal to zero for all iterations j = 2t 1 − 1, . . . , 2t − 1 we havê v(x) ∈ C.
Because the calculation of the discrepancy is part of the BMA, we can use the hardware of the parallel implementation, where for the iterations j ≥ 2t 1 − 1 we only calculate the discrepancy and omit the update of the error location polynomial. Hence, the parallel implementation of the modified BMA only requires 2t 1 multipliers. In fact it follows from the results of Chen 20 that only a subset of the syndrome equations is to be satisfied. It is sufficient to check the discrepancy values for j = 2t 1 − 1, . . . , t + t 1 − 1. The BMA requires 2t 1 iterations and in addition t − t 1 iterations are necessary to check whether the vectorv(x) is a codeword of the code C. For binary BCH codes, the discrepancy values of the odd-numbered iterations are always zero. Therefore, the BMA only requires t 1 iterations and ⌊ t−t1 2 ⌋ additional iterations to check the discrepancies.
Average number of iterations
The number of iterations with the parallel and serial BMA is constant and determined by the error correction capability of the code. Whereas with the mixed serial/parallel implementation, the number of iterations depends on the number of errors in the received word. In this section we estimate the average number of iterations for the binary symmetrical channel. Furthermore, we demonstrate that a proper choice of t 1 leads to an average number of iterations that is smaller than with a fully parallel implementation.
Let P (t 1 , p) denote the conditional probability that the number of errors is greater than t 1 given that at least one error occurred. The conditional probability P (t 1 , p) depends on t 1 and the channel crossover probability p of the BSC. The average number of iterations is
for binary BCH codes and
for non-binary codes. For the binary symmetrical channel, P (t 1 , p) can be approximated as
where the enumerator term is the probability that the number of errors is larger than t 1 and the denominator term is the probability that at least one error occurs in a word of n bits.
The average number of iterations should satisfy
for binary BCH codes and N ≤ 2t (14) for non-binary codes, to ensure that the average throughput is not reduced.
To illustrate the performance of the mixed serial/parallel implementation we consider a typical example for flash memories. To provide a sector size of 1 kbyte, a code dimension of k = 8288 is required which includes the data bits plus some additional bits used by the flash controller. The error correction should ensure a word error rate less than 10 −16 . We consider three different BCH codes with different error correction capabilities. The parameters of the codes are summarized in Table 1 . Because the crossover probability of flash memories deteriorates with increasing number of read/write cycles, we consider the maximum tolerable crossover probability. In Table 1 the value p t denotes the maximum crossover probability so that the word error rate is less than 10 −16 for a given error correction capability. The value t 1,opt is the value of t 1 that minimizes the average number of iterations for the maximum crossover probability p t . For the optimal value of t 1 the average number of iterations is smaller than t. The improvement is illustrated in Fig. 6 which depicts the reduction of the number of iterations t − N versus t 1 compared to the fully parallel implementation. The whole design has been realized with Verilog HDL and verified on a Xilinx Virtex4 (xc4vlx200) FPGA. For this target a 24 bit parallel BMA requires 11069 lookup tables (LUT) while the serial implementation for 48 bits only requires 3462 LUT. In total 14531 LUT are required to implement the mixed serial/parallel BMA for 48 bit errors, whereas an implementation of the parallel BMA for 48 bit errors requires 29655 LUT. Hence, with the mixed serial/parallel implementation the number of lookup tables for the BMA is reduced by 51% without any reduction of the average throughput compared with a parallel implementation.
Chien search
As already described, the Chien search checks all error positions 0 ≤ l ≤ n − 1 by checking whether the equation σ(α −l ) = 0 is fulfilled. This can be implemented in different parallel forms. In a parallel Chien search the errors for m positions can be 
This needs tm multiplications and (t + 1)m additions where t is the error correction capability.
In this section, we propose a Chien search, that uses two different parallelization degrees. A faster Chien search is used if the detected number of errors is less or equal t 1 . This fast search calculates m 1 positions in parallel. Moreover, a slower calculation with error capability t and with parallelization degree m < m 1 is used if the number of errors is greater than t 1 .
We illustrate this concept with a simple example with t = 4 and t 1 = 2. For instance for a parallel implementation with parallelization degree m = 4 the evaluation of σ(α 0 ), σ(α −1 ), σ(α −2 ), σ(α −3 ) has to be realized in hardware. For the next iteration, the coefficients of the error location polynomial are multiplied by powers of α −4 which results in the new polynomial
This polynomial is used as error location polynomial in the next iteration. Therefore, in the second iteration σ(α −4 ), σ(α −5 ), σ(α −6 ), σ(α −7 ) are evaluated. For a combined solution with m 1 = 4 and m = 2 the coefficients in Fig. 7 are needed. Each row in Fig. 7 represents one instance of Eq. (15) . The first mode (correction capability t 1 = 2 and parallelization degree m = 4) needs the coefficients σ 1 , σ 2 and the powers of α in the corresponding column from all four rows. Whereas the second mode (with parameters t = 4 and m 1 = 2) needs the first two rows and all four columns. This means that multiplications with α −3 and α −4 are used in both modes, but with different coefficients of the error locator polynomial. To use the corresponding multipliers in both modes, the coefficients can be selected using multiplexers. The mode selection and the multiplexing of the coefficients is illustrated in Fig. 8 .
The modification of the error location polynomial for the next iteration is implemented for m = 4. Hence the mode for m 1 = 2 evaluates only one half of all possible positions. This is resolved by running the Chien search twice. First with the initial σ(x) and in the second run with a modified error location polynomial 
, where all coefficients are multiplied with powers of α −2 . Therefore, the Chien search for m = 2 requires twice the number of iterations. Alternatively, a mode selection for the modification of the error location polynomial can be implemented by multiplexing the required powers of α −m1 or α −m , respectively. Similar to the analysis in section 5, we can estimate the average number of iterations for the Chien search. The number of errors is estimated by the BMA. Therefore, the decoder can select the fast or the slow Chien search based on the degree of the error location polynomial. The Chien search requires n m1 iterations in the fast mode and n m in the slow mode. Therefore, we obtain the expected value of (1 − P (t 1 , p)) 
for non-binary codes. However, for the Chien search the number of gates required to realize a multiplier varies. Hence, the number of multipliers allows only a rough estimate of the actual area reduction. Therefore, the whole design was implemented with Verilog HDL and verified on a Xilinx Virtex4 (xc4vlx200) FPGA. For this target a total of 14531 LUT are required to implement the mixed serial/parallel BMA for 48 bit errors, whereas an implementation of the parallel BMA for 48 bit errors requires 29655 LUT. Similarly, for the Chien search the number of LUT is reduced from 29293 to 21831. Hence, with the proposed implementation the number of lookup tables is reduced by 38% without any reduction of the average throughput compared with a fully parallel implementation.
Conclusions
In this work, we have presented a new concept for a BMA hardware implementation that enables a better trade-off between throughput and space complexity. The new concept requires both a parallel and a serial implementation of the BMA, but the parallel implementation is only used to correct up to t 1 < t errors, where t 1 is typically much smaller than t. The new architecture only requires 2t 1 + 3 multipliers compared to 2t multipliers with a parallel implementation. A carefully chosen value of t 1 allows reducing the required area without much effect on the average throughput.
The presented examples demonstrate that in some cases the number of multipliers can be reduced by more than 50 percent in the BMA and up to 40 percent in the Chien search. Moreover, the average number of cycles with the mixed implementation is even smaller than with a fully parallel implementation, because the serial operation is only required in the rare event of an error pattern with more than t 1 errors.
In general, it is expected that the bit error rate (BER) in flash memories is growing with the number of read/write cycles 2 . At the beginning of a flash memory life cycle the bit error rate is low so that the error polynomial can be solved fast by only using the first parallel stage of the proposed architecture. Program and erase cycles cause permanent damages which increase the BER at the end of the life cycle. Therefore, the serial implementation is probably only needed at the end of the life cycle of the flash memory.
SLC and MLC technologies require rather different error correcting capabilities. In 15 an adaptive rate ECC scheme with BCH codes was proposed. In particular, a flexible BCH encoder structure was presented that supports different BCH code rates. The new decoder structure could easily be combined with the approach from 15 to support different BCH codes with a large range of error correcting capabilities.
