NAND flash memories represent a key storage technology for solid-state storage systems. However, they su↵er from serious reliability and endurance issues that must be mitigated by the use of proper error correction codes. This paper proposes the design and implementation of an optimized BoseChaudhuri-Hocquenghem hardware codec core able to adapt its correction capability in a range of predefined values. Code adaptability makes it possible to e ciently trade-o↵, in-field reliability and code complexity. This feature is very important considering that the reliability of a NAND flash memory continuously decreases over time, meaning that the required correction capability is not fixed during the life of the device. Experimental results show that the proposed architecture enables to save resources when the device is in the early stages of its lifecycle, while introducing a limited overhead in terms of area.
Introduction

1
NAND flash memories are a widespread technology for the development 2 of compact, low-power, low-cost and high data throughput mass storage sys-3 tems for consumer/industrial electronics and mission critical applications. 4 Manufacturers are pushing flash technologies into smaller geometries to fur-5 ther reduce the cost per unit of storage. This includes moving from tradi-6 tional single-level cell (SLC) technologies, able to store a single bit of infor-7 mation, to multi-level cell (MLC) technologies, storing more than one bit per 8 cell.
9
The strong transistor miniaturization and the adoption of an increasing 10 number of levels per cell introduce serious issues related to yield, reliability, 11 and endurance [? ? ? ? ? ] . Error correction codes (ECCs) must therefore 12 be systematically applied. ECCs are a cost-e cient technique to detect and 13 correct multiple errors [? ] . Flash memories support ECCs by providing 14 spare storage cells dedicated to system management and parity bit storage, 15 while demanding the actual implementation to the application designer [? 16 ? ]. Choosing the correction capability of an ECC is a trade-o↵ between 17 reliability and code complexity. It is therefore a strategic decision in the 18 design of a flash-based storage system. A wrong choice may either overesti-19 mate or underestimate the required redundancy, with the risk of missing the 20 target failure rate. In fact, the reliability of a NAND flash memory continu-21 ously decreases over time, since program and erase operations are somehow 22 destructive. At the early stage of their life-time, devices have a reduced 23 error-rate compared to intensively used devices [? ] . Therefore, designing an 24 ECC system whose correction capability can be modified in-field is an attrac-25 tive solution to adapt the correction schema to the reliability requirements 26 the flash encounters during its life-time, thus maximizing performance and 27 reliability.
28
This paper proposes the hardware implementation of an optimized adapt-29 able Bose -Chaudhuri -Hocquenghem (BCH) codec core for NAND flash 30 memories and a related framework for its automatic generation. is motivated by the fact that contemporary high-density MLC flash mem-36 ories require a more powerful error correction capability, and, at the same 37 time, they have to meet more demanding requirements in terms of read/write 38 latency.
39
Given this premise, we will tackle a BCH hardware implementation for 40 encoding and decoding tasks. In particular, the main contribution of the 41 proposed architecture is its adaptability. It enables in-field selection of the 42 desired correction capability, coupled with high optimization that minimizes 43 the required resources. Experimental results compare the proposed architec-44 ture with typical BCH codecs proposed in the literature.
45
The paper is organized as follows: Section ?? shortly introduces basic 46 notions and related works. Sections ?? and ?? present a solution to reduce 47 resources overhead, while Section ?? and ?? overview the proposed adapt-48 able architecture. Section ?? provides experimental results and Section ?? 49 summarizes the main contributions of the work and concludes the paper. demonstrated to provide high correction e ciency [? ] , when considering the 57 specific application domain of flash memories, the need to trade-o↵ code e -58 ciency, hardware complexity and performances have moved both the scientific 59 and industrial community toward a set of codes that enable very e cient and 60 optimized hardware implementations [? ? ].
61
Old SLC flash designs used very simple Hamming based block codes. 62 Hamming codes are relatively straightforward and simple to implement in 63 both software and hardware, but they o↵er very limited correction capability 64 [? ? ]. As the error rate increased with successive generations of both SLC 65 and MLC NAND flash memories, designers moved to more complex and pow-66 erful codes including Reed-Solomon (RS) codes [? ] and Hocquenghem (BCH) codes [? ] . Both codes are similar and belong to the 68 larger class of cyclic codes which have e cient decoding algorithms due to 69 their strict algebraic architecture, and enable very optimized hardware im-70 plementations. RS codes perform correction over multi-bit symbols and are 71 better suited when errors are expected to occur in bursts, while BCH codes 72 perform correction over single-bit symbols and better perform when bit er-73 rors are not correlated, or randomly distributed. In fact, several studies have 74 reported that NAND flash memories manifest non-correlated or randomly 75 distributed bit errors over a page [? ] 
83
Given a finite Galois field GF (2 m ) (with m 3), a t-error-correcting BCH 84
) by 86 adding r parity bits to the original message. The number r of parity bits 87 required to correct t errors in the n-bit codeword is computed by finding 88 the minimum m that solves the inequality k + r  2 m 1, where r = 89 m · t. Whenever n = k + r < 2 m 1, the BCH code is called shortened 90 or polynomial. In a shortened BCH code the codeword includes less binary 91 symbols than the ones the selected Galois field would allow. The missing 92 information symbols are imagined to be at the beginning of the codeword 93 and are considered to be 0. Let ↵ be a primitive element of GF (2 m ) and 94
1 (x) a primitive polynomial with ↵ as a root. Starting from 1 (x) a set of 95 minimal polynomials i (x) having ↵ i as root can be always constructed [? 96 ]. For the same GF (2 m ), di↵erent valid 1 (x) may exist [? ] . The generator 97 polynomial g (x) of a t-error-correcting BCH code is computed as the Least 98
Common Multiple (LCM) among 2t minimal polynomials i (x) (1  i  2t). 99
, only t minimal polynomials must 100 be considered and g (x) can therefore be computed as:
When working with BCH codes, the message and the codeword can be 102
represented as two polynomials: (1) b(x) of degree k 1 and (2) c (x) of degree 103 n 1. Given this representation, both the encoding and the decoding process 104
can be defined by algebraic operations among polynomials in GF (2 m ). The 105 encoding process can be expressed as:
where
denotes the remainder of the division between the 107 message left shifted of r positions and the generator polynomial g(x). This 108 remainder represents the r parity bits to append to the original message.
109
The BCH decoding process searches for the position of erroneous bits 110 in the codeword. This operation requires three main computational steps: 111 1) syndrome computation, 2) error locator polynomial computation, and 3) 112 error position computation.
113
Given the selected correction capability t, the decoding process requires 114 first the computation of 2t syndromes of the codeword c (x), each associ-115 ated with one of the 2t minimal polynomials i (x) generating the code. 116
Syndromes are calculated by first computing the remainders R i (x) of the 117 division between c (x) and each minimal polynomial i (x). If all remainders 118 are null, c(x) does not contain any error and the decoding stops. Otherwise, 119 the 2t syndromes are computed by evaluating each remainder R i (x) in ↵ i : 120
Practically, according to (??), given that i (x) = 2i (x), only 121 t remainders must be computed and evaluated in 2t elements of GF (2 m ).
122
The most used algebraic method to compute the coe cients of the error 123 locator polynomial from the syndromes is the Berlekamp-Massey algorithm 124
[? ]. Since the complexity of this algorithm grows linearly with the correction 125 capability of the code, it enables e cient hardware implementations. The 126 equations that link syndromes and error locator polynomial can be expressed 127
as:
. . .
129
The Berlekamp-Massey algorithm iteratively solves the system of equa-130 tions defined in (??) using consecutive approximations. The gray box of Fig. ? polynomials (x 13 , x 10 , x 5 , x 3 , x, and 1 in Table ? ?). For these terms, 197 an XOR will be always required in the PPLFSR, thus saving the area 198 dedicated to the MUX and the related control logic. 2. Missing terms (represented in underlined italic zeros), i.e., terms not 200 defined in any of the considered polynomials, (x 14 , x 11 , x 9 , x 8 , x 7 and 201 x 6 in Table ? ?). For these terms both the XOR and the related MUX 202 x 10
can be avoided.
203
3. Specific terms, i.e., terms that are specific of a subset of the considered 204 polynomials (x 15 , x 12 , x 4 , x 2 in Table ? ?). These terms are the only 205 ones actually required.
206
We can therefore implement an optimized programmable LFSR (OP-207 PLFSR) with three main building blocks:
. each common present term (i.e., columns of all "1" of This optimization also applies on polynomials with very di↵erent lengths. 215
As an example, an OPPLFSR with single bit parallelism and able to divide 216 by p 1 (x) = x 225 + x + 1 and p 2 (x) = x + 1, would only require a single 217 adaptable block, compared to the 226 blocks required by a normal PPLFSR. 218 generic OPPLFSR. Such a block is able to divide by a set p 1 (x) , ..., p M (x) 224 of polynomials. We denote with q the number of required gray boxes. 
BCH Code Design Optimization
235
In this section, we address first the issue of choosing the most suitable 236 set of polynomials for an optimized adaptable BCH code. Then, we propose 237 a novel block, shared between the adaptable BCH encoder and the decoder, 238 which reduces the area overhead of the resulting codec core. an example, an excessive number of shared polynomials may make it di cult 244 to find common terms, leading to an unwilled increase of the area overhead. 245 Therefore, the choice of the polynomials to share is critical and must be 246 properly tailored to the overall design.
247
Let us denote by ⌦ the set of t generators g i (x) and t minimal polynomials 248 i which fully characterize an adaptable BCH code (see Section ??). 
262
In the sequel, we introduce an algorithm to assess each set ⌦ i according 263 the output is a set of partitions:
Fig . ? ? graphically shows the MCI of two partitions generated from two 278 di↵erent starting points, for an hypothetical set ⌦ i . 
Eq. ?? applies to each set ⌦ i where i = {1...Y }.
286
The best partition of the set ⌦ i is then computed selecting the one with 287 maximum MCI avg :
Finally, Eq. ?? compares the best partition of each set ⌦ i to find the best 289 set of polynomials:
Eq. ?? defines the family of polynomials S bestBCH , with the maximum 291 average number of common terms. Suppose to consider a single set ⌦ i composed of the polynomials of 
5 from Eq. ??.
304
The complete algorithm iterates this computation for all possible starting 305 points. Fig. ? ? graphically shows the output of the MCI associated with each 306 partition S i,j calculated for the following starting point j = {1, 2, 3, 4}. Table ? ?
According to Eq. ??, S i,2 (the bold line) is the S best i of the example of 308 • the codeword c (x) by (potentially) all minimal polynomials from 1 (x) 316 up to 2t M 1 (x), to compute the set of syndromes required during the 317 decoding phase.
318
In a traditional implementation, these computations are performed by 319 two separate set of LFSRs. In this paper, we propose to devise a shared 320 set of LFSRs able to: (i) perform all these computations, and (ii) reduce the 321 overall cost in terms of resources overhead. Therefore, we can adopt the same 322 shared set of LFSRs both in the encoding and decoding processes. This is 323 possible since in a flash memory these operations are, in general, not required 324 at the same time.
325
The OPPLFSR, introduced in Section ??, is the main building block of 326 the set of shared LFSRs. Therefore, we will refer hereafter to such set of 327 LFSRs as shared OPPLFSR (shOPPLFSR). Fig. ?? shows the high-level 328 architecture of the shOPPLFSR. Its interface includes: a s-bit input port 329 (IN) used to input the data to be divided, a dlog 2 (N)e-bit input port (en) 330 used to enable each OPPLFSR, an input port (sel) used to select the proper 331 polynomial by which each OPPLFSR has to divide, and a N ⇥ s-bit port (p) 332 providing the result of the division. Given N OPPLFSRs and a maximum correction capability t M , each 334 OPPLFSR i performs the division by a set of generator polynomials g (x) and 335 minimal polynomials (x). Such shOPPLFSR can be seen as an optimized 336 programmable LFSR able to:
337
• divide by all generator polynomials from g 1 (x) to g t M (x);
338
• divide by specific subsets of minimal polynomials from Eq. ??, as well. 339
An improper choice of the shared polynomials g (x) and (x) can dramat-340
ically reduce the performance of the overall BCH codec. Also the partitioning 341 strategy adopted is critical to maximize the optimization in terms of area, 342 minimizing the impact on the latency of encoding/decoding operations.
343
The algorithm presented in Section ?? provides a valuable support for the 344 exploration of this huge design space. In fact, the proposed method can be 345 exploited to properly partition polynomials into the di↵erent OPPFLSRs of 346 Fig. ? in the most significant bit of the LFSR, instead than starting from least 360 significant bit. Fig. ?? shows the high-level architecture of the adaptable 361 encoder.
362
The encoder's interface includes: a s-bit input port (IN) used to input the 363 k-bit message to encode starting from the most significant bits, a dlog 2 (t M )e-364 bit input port (t) selecting the requested correction capability in a range 365 between 1 and t M , a start input signal used to start the encoding process 366 and a s-bit output port (OUT) providing the r parity bits. Three blocks 367 compose the encoder: a shOPPLFSR, a flush logic and a controller.
368
The shOPPLFSR performs the actual parity bits computation. ing to the BCH theory, adaptation is achieved by supporting the computation 370 through the lsel signal, according to the desired correction capability t. 374
Then, it manages the overall encoding process based on two internal param-375
eters: 1) the number of s-bit words composing the message (fixed at design 376 time) and 2) the number of produced s-bit parity words, that depends on 377 the selected correction capability. The flush logic splits the r parity bits into 378 s-bit words, providing them in output, one per clock cycle.
379
To further optimize the encoding and the decoding process, since in a flash 380 memory these operations are not required at the same time, the encoder's 381
shOPPLFSR can be merged with the shOPPLFSRs that will be employed 382 in the syndrome computation (see Section ??), thus allowing additional area 383 saving. input the n bit codeword to decode (starting from the most significant bits), 388 a dlog 2 (t M )e bit input port (t) to select the desired correction capability, a 389 start input signal to start the decoding and a set of output ports providing 390 information about detected errors. In particular:
391
• deterr is a dlog 2 (t M )e bit port providing the number of errors that 392 have been detected in a codeword. In case of decoding failure it is set 393 to 0;
394
• erradd and errmask provide information about the detected error po-395 sitions. Assuming the codeword split into h bit words, erradd is used 396 as a word address in the codeword and errmask is a h bit mask whose 397 asserted bits indicate detected erroneous bits in the addressed word. 398 The parallelism h of the error mask depends on the parallelism of the 399 Chien machine, as explained later in this section;
400
• vmask is asserted whenever a valid error mask is available at the output 401 of the decoder;
402
• fail is asserted whenever an error occurred during the decoding pro-403 cess (e.g., the number of errors is greater than the selected correction 404 capability);
405
• end is asserted when the decoding process is completed. drome machine with correction capability 1 6 t 6 t M . Each PLFSR computes the remainder of the division of the codeword by a 420 di↵erent minimal polynomial i (x). Given two correction capabilities t 1 and 421 t 2 with t 1 < t 2  t M , the set of 2t 1 minimal polynomials generating the code 422 for t 1 is a subset of those generating the code for t 2 . To obtain adaptability 423 of the correction capability in a range between 1 and t M , the syndrome 424 machine can therefore be designed to compute the maximum number t M 425 of remainders required to obtain 2t M syndromes. Based on the selected 426 correction capability t, only the first t PLFSRs out of the t M available in the 427 circuit are actually enabled through the Enable div. network of Fig. ? ?. the PLFSRs require a 2-bit alignment, implemented by the network of Fig. 458 ??. It simply delays the last 2 input bits resorting to two flip-flops, whose 459 initial state has to be zero, and properly rotates the remaining input bits. 460 Changing the correction capability of the decoder changes the number of 461 parity bits of the codeword, and therefore the required alignment. Given the 462 parallelism s of the decoder, a maximum of s alignments must be provided 463 and implemented in the Aligner block of Fig. ? This iterative process is repeated until all equations are solved. If, at the 479 end of the iterations, the computed polynomial has a degree lower than t, 480 it correctly represents the error locator polynomial and its degree represents 481 the number of detected errors; otherwise, the code is unable to correct the 482 given codeword.
483
The architecture of the iBM machine is intrinsically adaptive as long as 484 one guarantees that the internal bu↵ers and the hardware structures are sized 485 to deal with the worst case design (i.e., t = t M ). The coe cients of (x) are 486 Algorithm 1 Inversion-less Berlekamp-Massey alg.
1: (x) = 1, k(x) = 1, = 1 2: for i = 0 to t 1 do The overall architecture of the proposed adaptable Chien Machine is 503
shown in the Fig. ? ?. The machine first loads into t M m-bit registers the 504 coe cients from 1 to t M of the error locator polynomial (x) computed by 505 the iBM machine (ld = 0). The actual search is then started (ld = 1). At 506 each clock cycle, the block performs h parallel evaluations of (x) in GF(2 m ) 507
and outputs a h bit word, denoted as errmask. Each bit of errmask corre-508 sponds to one of the h candidate error locations that have been evaluated. 509
Asserted bits denote detected errors. This mask can then be XORed (outside 510 the Chien Machine) with the related bits of the codeword in order to correct 511 the detected erroneous bits.
512
The architecture of Fig. ? ? provides an adaptable Chien machine with 513 lower area consumption than other designs [? ] , having, at the same time, 514 a marginal impact on performance. Four interesting features contribute to 515 such optimization: (i) constant multipliers substructure sharing, (ii) adapt-516 ability to the correction capability, (iii) improved fast skipping to reduce the 517 decoding time, and (iv) reduced full GF multipliers area. In the sequel, we 518 briefly address each feature.
519
The first feature is represented by the optimized GF Constant Multipliers 520
(optGFCM) networks of Fig. ? ?. The h parallel evaluations are based on 521 provide up to 60% reduction of the hardware complexity of the machine with 528 no impact on performance.
529
The second feature is the adaptability of the Chien machine. The rows of 530 the matrix define the parallelism of the block (i.e., the number of evaluations 531 per clock cycles), while the columns define the maximum correction capability 532 of the block. Whenever the selected correction capability t is lower than t M , 533 the coe cients of the error locator polynomial of degree greater than t are 534 equal to zero and do not contribute to equation (??), thus allowing us to 535 adapt the computation to the di↵erent correction capabilities.
536
The third feature stems from a simple observation. Depending on the 537 selected correction capability t, not all the elements of GF(2 m ) represent 538 realistic error locations. In fact, considering a codeword composed of k bits 539 of the original message and r = m · t parity bits, only k + m · t out of 2 
All elements between ↵ 0 and ↵ 2 m k m·t can be skipped to reduce the 545 computation time. Di↵erently from fixed correction capability fast skipping 546
Chien machines this interval is not constant here but depends on the se-547 lected t. The architecture of Fig. ? ? implements an adaptable fast skipping 548
by initializing the internal registers to the coe cients of the error corrector 549 polynomial multiplied by a proper value t ini = ↵ 2 m k m·t 1 . For each value 550 of t, t M m bit constant values corresponding to
must be stored in an internal ROM (not shown in Fig. ??) and multiplied 552 by the coe cients i using a full GF multiplier.
553
This is connected with the last feature, the reduced GF Full Multipliers 554 (redGFFM) network of Fig. ? ?. Each full GF multiplier has a high cost in 555 terms of area. Since they are used only during initialization of the Chien, the 556 redGFFM adopts only z 6 t M full GF multipliers. It also includes a ( ) input 557 port to input z coe cients, per clock cycles, of the error locator polynomial. 558
This network enables to reduce area consumption, at a reasonable cost in 559 terms of latency.
560
For the sake of brevity, a detailed description of the controller required 561 to fully coordinate the decoder's modules interaction is out of the scope of 562 this paper. 
Experimental Results
564
This section provides experimental data from the implementation of the 565 adaptable BCH codec proposed on a selected case study. 
Automatic generation framework
567
To cope with the complexity of a manual design of these blocks, a semi-568 automatic generation tool named ADAGE (ADaptive ECC Automatic GEn-569 erator) [? ] able to generate a fully synthesizable adaptable BCH codec core 570
following the proposed architecture has been designed and exploited in this 571 experimentation extending a preliminary framework previously introduced 572
in [? ] . The overall architecture of the framework is in Fig. ? ?. the BER after the application of the ECC, which is application dependent. 582
It is computed as the probability of having more than t errors in the code-583
word (calculated as a binomial distribution of randomly occurred bit errors) 584
divided by the length of the codeword [? ]:
Given the RBER of the flash and the target UBER, Eq. ?? can be 586 exploited to compute the maximum required correction capability of the 587 code and consequently the value of m that defines the target GF. Given these 588 two parameters, the Galois Field manager exploits an internal polynomials 589 database to generate the set of minimal polynomials and the related generator 590 polynomials for the selected code.
591
Finally, the RTL VHDL code generator combines these parameters and 592
generates a RTL description of the BCH encoder and decoder implementing 593 the architecture illustrated in this paper.
594
The whole framework combines Matlab software modules with custom 595 C programs. The full framework code is available for download at http: 596 //www.testgroup.polito.it in the Tools section of the website. the BCH code, the current trend is to enlarge the block size k over which 603
ECC operations are performed. In fact, longer blocks better handle higher 604
concentrations of errors, providing more protection while using fewer parity 605
bits [? ] . For this reason, we adopted a block size k = 2KB, equal to the 606 page size of the selected memory.
607
Experiments performed on the flash provided that, in a range between 608 10 and 100,000 program/erase (P/E) cycles on a page, the estimated RBER 609 changes in a range [9 ⇥ 10 6 ÷ 3. we opted for a Chien machine with parallelism h = 8 and z = 1 full GF 620 multipliers.
621
In this experimentation we analyzed the three architectures summarized 622
in Table ? ?. capability. The total occupation ranges in this case from 15% to 70% of the 667 total spare area. This mitigates the overhead for storing parity bits whenever 668 the error rate enables to select low correction capabilities (e.g., for devices in 669 
Synthesis Results
692
Synopsys Design Vision and a CORE 45nm technology cell library have 693 been exploited to synthesize the designs. architectures computed using Synopsys PrimeTime. As for the decoding 717 
Conclusions
727
This paper proposed a BCH codec architectures and its related auto-728 matic generation framework which enables its code correction capability to 729 be selected in a predefined range of values. Designing an ECC system whose 730 correction capability can be modified in-field has the potentiality to adapt 731 the correction schema to the reliability requirements the flash encounters 732 during its life-time, thus maximizing performance and reliability.
733
Experimental results on a selected NAND flash memory architecture 734
proved that the proposed solution reduces spare area usage, decoding time, 735 and power dissipation whenever small correction capability can be selected. 736 
