Abstract-Program disturb, read disturb, and retention time noise are identified as three major contributors to multilevel cell (MLC) NAND flash memory bit errors. With program/erase cycling and technology scaling, bit error rate (BER) of MLC NAND flash memory rapidly increases. Previous works revealed that BER is heavily dependent on data patterns. Based on this observation, we propose data-pattern-aware (DPA) error protection technique to extend the lifespan of NAND flashbased storage systems. DPA manipulates the ratios of 0's and 1's in the stored data to reduce the probability of the data patterns, which are susceptible to device noises. By minimizing the vulnerable data patterns, our scheme can effectively reduce the BER and improves the system endurance. Our DPA scheme also incorporates a data management scheme to minimize the redundancy-induced performance overhead. Simulation results show that our scheme can increase flash system life expectancy by up to 4×.
other wordlines. Retention time noise is originated from charge loss on the floating gate and leads to V th decrease. These noises broaden V th distribution and cause the neighboring levels to overlap each other. These noises are aggravated with program/erase (P/E) cycling and eventually result in high bit error rate (BER) at the end of the lifetime of the flash memory.
High BER is one of the major factors limiting memory endurance. The maximum rated endurance of MLC NAND flash memory is only 10 4 P/E cycles. Following technology scaling, MLC NAND flash memory endurance degrades continuously. At 2 times nanometer technology node, for example, the MLC flash memory can only tolerate 5000 P/E operations [3] . Programming an MLC NAND flash memory beyond the rated P/E cycle may incur uncorrectable errors. To assure data integrity, error correction code (ECC) is employed in NAND flash memory-based storage systems (NFSSs). Two typical ECCs are Bose-Chaudhuri-Hocquenghem (BCH) code and low-density parity-check (LDPC) code. The LDPC code has a long decoding latency and therefore is not suitable for many read critical applications. In comparison, the BCH code has a much lower decoding latency. However, the hardware cost of the BCH code may become prohibitively high when the required correction capability increases. In this paper, we focus on improving NFSS lifetime under the protection of the BCH code.
This paper is based on the observation that the occurrence of bit errors heavily depends on V th levels. Moon et al. [4] revealed that programming flash cells to the highest V th level incurs serious program disturb [1] . Mielke et al. [1] found that the highest V th level is susceptible to retention time errors. If the probability of storing data to these vulnerable V th levels can be reduced, the BER will be effectively reduced accordingly. The main contributions of this paper are summarized as follows.
1) We propose DPA pattern probability unbalance (DPA-PPU) scheme to reduce the probability of data patterns vulnerable to bit errors. Depending on data correlation, DPA-PPU adopts decorrelation and scrambling techniques to unbalance the number of 1's and 0's in the stored data. By unbalancing the ratio between 0's and 1's, fewer NAND flash cells are placed on the vulnerable V th levels. As a result, the BER is effectively reduced. 2) We propose DPA data-redundancy management (DPA-DRM) scheme to mitigate performance degradation of DPA-PPU-induced data redundancy. DPA-DRM adopts P/E-cycle-dependent DPA-PPU to reduce the redundancy amount and optimizes the layout of the redundant data to minimize access latency. Due to random data patterns, the redundant data are more vulnerable to bit errors. The DPA-DRM employs differential storage to protect the integrity of the redundant data. Experimental results show that our scheme can achieve ∼ 4× lifespan extension with marginal hardware overhead. The rest of this paper is organized as follows: Section II presents preliminary knowledge; Section III gives the overview of our DPA error prevention technique; Section IV and Section V discuss the details of DPA-PPU and DPA-DRM, respectively; Section VI shows the simulation results; Section VII concludes this paper.
II. PRELIMINARY

A. MLC NAND Flash Basics
An MLC NAND flash memory is divided into a number of blocks. The structure of a NAND flash block is shown in Fig. 1(a) . An MLC NAND flash block includes many rows that are controlled by wordlines. Each wordline corresponds to two or four pages. A NAND flash page contains a data area and an out-of-band (OOB) area, which consist of identical NAND flash cells. Each MLC NAND flash cell stores two bits with four V th levels (L0∼L3). L0 (the lowest level) ∼ L3 (the highest level) represent 11, 10, 01, and 00, respectively. V th levels are confined with a pair of lower and upper reference voltages.
There are three basic operations in NAND flash memoriesread, program, and erase operations. The read operation is realized by comparing the programmed V th with a series of reference voltages [5] . The program (i.e., write) operation is injecting a predefined amount of electrons to the floating gate. An incremental-step progressive programming (ISPP) method [1] is employed in programming operation. Programming pulses are iteratively applied until V th reaches the verifying voltage. To maximize the noise margin, the iterative program-verifying process ensures to place V th in the center of the V th level window [6] . The iterative ISPP partially contributes to long program operation latency, i.e., 800 μs∼3 ms [3] . To reduce performance degradation caused by long program latency, an array of NAND flash memories is employed in the NFSS. The memory array is divided into a number of channels, which can be accessed in parallel (i.e., internal parallelism) [7] . As such, the system bandwidth can be improved. Unfortunately, the P/E cycling gradually wears out NAND flash memory and leaves the programmed cell susceptible to device noises. Previous works identified program disturb, read disturb, and retention time noise as three major noises that cause V th distortion [1] .
B. Program Disturb
Program disturb stems from the combined effect of RTN and cell-to-cell interference. The RTN broadens the V th distribution by either increasing or decreasing V th of the programmed cell [8] . Assume that λ represents the mean value of the V th shift V rtn . The distribution of V rtn can be modeled by [8] 
Cell-to-cell interference is caused by parasitic capacitancecoupling. As shown in Fig. 1(b) , cell-to-cell interference incurs V th shift of the victim cell when its adjacent cells (i.e., interfering cells) are programmed. The V th shift of the victim cell V c2c can be modeled by [8] 
p and γ (k) denote the V th shift of the interfering cell after programming and coupling ratio, respectively. Reference [4] shows that the effect of cell-to-cell interference is patterndependent. The largest V th distortion of the victim cell occurs when the interfering cell is programmed to V th level L1 or L3.
C. Read Disturb
Read disturb results from Fowler-Nordheim tunneling mechanism and SILC. The mechanism of read disturb is shown in Fig. 1(c) . The verify voltage V READ (0∼4V) is applied to the wordline of the read page, while V PASS (6 V) is applied to the unread wordlines in the same block. V PASS and V READ can inject electrons to the floating gate and lead to V th increase. Coda et al. [9] revealed that the lowest V th level L0 is most susceptible to read disturb.
D. Retention Time Noise
Retention time noise results from electron detrapping and SILC. Electrons are trapped in transistor tunnel oxide through P/E cycling [8] . These trapping electrons gradually leak away and assist charges stored on floating gates to escape, leading to V th decrease. The noise margin of retention time noise is the difference between V th and the lower reference voltage. Mielke et al. [1] showed that bit errors incurred by retention time noise dominate at the postcycling stage. Note that V th shift distribution caused by retention time noise can be modeled by
Here K s , K d , K m , and t 0 are constants. N is P/E cycle count. x 0 is V th of level L0. x is the initial V th after programming and t is elapsed time. Equations (3) and (4) show that higher initial V th level incurs larger V th shift. Hence, L3 is most vulnerable to retention time bit errors.
E. Related Works
To extend the lifetime of the NFSS, researchers and system designers are devoted to reducing memory BER. References [8] and [10] proposed refresh schemes to avoid accumulation of retention time errors. However, the refresh schemes must be performed under a power-ON status and therefore cannot prevent bit error occurrence when power is OFF. Reference [11] proposed to reduce V th level number and increase noise margin of each V th level to reduce memory BER. Some works also explore other coding techniques to improve system reliability. For example, [12] - [16] proposed to apply soft-decision codes, such as LDPC and reed-solomon codes, in the NFSS to improve error correction performance. Dong et al. [15] , [17] proposed to adopt entropy coding to reduce the memory-to-controller data transfer latency of the soft sensing operation. However, both soft-decision codes and entropy coding incur high hardware cost and degrade read performance. Reference [18] proposed an error-prediction (EP) LDPC ECC to extend system lifetime. The EP LDPC ECC counts the number of 1's to determine dominant error types and applies retention error-recovery pulse or programming error-recovery pulse accordingly to reduce BER.
There are previous works dedicated to improving performance of ECC in NAND-based storage system. Reference [19] proposed three error-correction architectures to optimize code rate, hardware complexity, and power consumption. Reference [20] proposed coset coding to reduce memory BER by converting data to multiple vectors and adopting the vector with fewest flipping bits. This paper and [20] are based on NAND flash and phase change memory (PCM) memory technologies, respectively, and therefore employed different approaches to minimize BER: the BER of PCM can be minimized by simply reducing flipping bits, while the BER of NAND flash memory can be reduced by avoiding vulnerable V th levels. In addition, implementation of coset coding is also different from this paper. The coset coding adopts linear algebraic block codes to reduce the Hamming distance of two writes to the same location. In comparison, this paper adopts scrambling and decorrelation technique to reduce the number of 0's in the written data. Reference [21] proposed the energy-aware error control coding to reduce the energy of write operation to NAND flash memory. Different from this paper, which reduces the number of "00," the energyaware error control coding reduces the number of "10" and "01" in the data written to NAND flash memory. Besides, the energy-aware error control coding XORs adjacent bits in the same data blocks to detect the data patterns "10" and "01," which is significantly different from our approach, which unbalances the number of 0's ratio by XORing adjacent data blocks and adopting Galois field modulo-2 operation. Reference [22] proposed nibble remapping coding to reduce BER by reducing the number of 1's. Different from this paper, which adopts multiple polynomials to reduce number of 1's, nibble remapping coding collects the statistics of 1's in the data and maps the data pattern with highest appearance frequency to the one with fewest number of 1's. However, nibble remapping coding is inefficiency in the data with approximately equal appearance frequency (e.g., multimedia file).
The DPA error prevention technique proposed in this paper can work together with all these previous works to further improve system reliability and extend system lifespan. Compared with the previous works, however, our DPA technique has a unique advantage that can extend system lifetime under both power-OFF and power-ON statuses, In addition, our DPA technique can achieve fast decoding speed with low hardware cost.
III. OVERVIEW OF DPA
The DPA error prevention technique is targeted at minimizing the BER of NAND flash memory, based on the following observations.
1) The program disturb bit errors occur at the lowest probability when an interfering cell is programmed to V th levels L0 and L2. 2) The retention time bit errors occur at the lowest probability when a cell is programmed to V th level L0. According to the error patterns, if more cells are put at the V th level L0, the BER can be effectively reduced. Placing more cells at the V th level L0 may make them more susceptible to read disturb. However, due to the reduction of cell-to-cell interference, it also assures a larger noise margin to read disturb and finally introduces a reduced read disturb BER. The effect of the DPA technique on read disturb will be evaluated in detail in Section VI.
To place more cells in L0, data patterns should be converted. According to the data encoding scheme in Fig. 1(a) , two bits data "11" are mapped to V th level L0. Hence, increasing the ratio of 1's in the stored data can increase the probability of V th level L0. To raise the ratio of 1's, we propose a DPA error prevention technique, as shown in Fig. 2 . DPA includes two components: DPA-PPU module and DPA-DRM module. DPA-PPU converts the data to the patterns with more 1's by adopting decorrelation and scrambling coding techniques. However, both decorrelation and scrambling coding techniques introduce data redundancies. These redundancies incur extra accesses to NAND flash memory and thus performance degradation. Due to random data patterns, the redundant data are also more vulnerable to bit errors than the converted data. As a solution, DPA-DRM manages the redundancies by reducing their access latency and offering stronger data protection.
IV. DPA PATTERN PROBABILITY UNBALANCE
A. DPA-PPU Basics
The DPA-PPU is designed to increase the number of 1's in the stored data. Instead of directly increasing the number of 1's, DPA-PPU first increases the ratio of 0's and then inverts the whole data page. In DPA-PPU, increasing the ratio of 0's is achieved through decorrelation and scrambling techniques. Scrambling technique relies on the correlation of the data. For the strongly correlated data, decorrelation [23] is applied to increases the 0's ratio using simple XOR operations. For the weakly correlated data, scrambling coding is applied to increase the number of 0's in the stored data. Scrambling implements Galois field modulo-2 operation [24] by XORing the data with a polynomial cyclically. If an appropriate polynomial whose pattern overlaps with the pattern of the stored data to the maximum degree is employed, the ratio of 0's can be effectively increased.
The DPA-PPU architecture is shown in Fig. 3 . A data page is divided into multiple data chunks. These data chunks are sent to a set of data conversion circuit. The data conversion circuit set is composed of n scrambling circuits and a decorrelation circuit. The n scrambling circuits, labeled with polynomial tags 0 ∼ n − 1, employ different polynomials to maximize the scrambling efficiency. Each data chunk is processed by the n scrambling and decorrelation circuits simultaneously. The n + 1 output data from scrambling and polynomial circuits are sent to n + 1 counters. Each counter will calculate the total number of the 0's in each processed data output. The values of these counters are checked by a comparator. The output with most 0's is selected. The output data with the fewest 1's are selected and inverted before being flushed into the flash memory. If the decorrelated data are selected, then a correlation bit is set to 1. Otherwise, the correlation bit is set to 0. The correlation bit needs to be stored in NAND flash memory for data restoration. If the scrambled data are selected, the corresponding polynomial tag must be also stored as they are required in descrambling for data restoration during read operations.
The efficiency of the scrambling circuit is determined by the polynomial pattern, the data chunk size, the polynomial number, and the polynomial order combined. One efficient implementation of the scrambling circuit is cyclic XOR operations. The ratio of 0's can be effectively increased when the overlapping bits between stored data and the polynomial are maximized. To select appropriate polynomials, we investigate different types of data patterns as shown in Table I . The ratios of 1's in these types of data are shown in Fig. 4 . It can be seen that the occurrence probabilities of 1's and 0's in the stored data are approximately similar. Hence, the polynomials with a 1-to-0 ratio of 1:1 are generally preferred. The increase of the data chunk size diversifies the data patterns and potentially reduces the overlapping bits of the data chunks and the polynomial. As a result, the scrambling efficiency degrades when the data chunk size rises. Similarly, increasing the number of polynomials can enhance the scrambling efficiency by increasing bit overlapping probability. The polynomial order itself does not directly affect the scrambling efficiency. However, a higher order can provide more polynomial options and potentially increase bit overlapping probability.
B. V th Offsetting Technique
Increasing the ratio of 1's can reduce the probability that the NAND flash cells being placed on the V th levels (L1∼L3). However, reducing the number of cells being programmed to higher V th levels may increase retention time bit errors due to V th decrease. The cause of retention time BER increase is shown in Fig. 5 . The difference between the programmed V th and the reference voltage of the V th level which the cell is programmed to is defined as a retention time noise margin. The cell with higher retention time noise margin is more tolerable to retention time bit errors. Higher retention time margin can be realized by increasing the V th within the same V th level. As discussed in Section II-B, one of the major contributors to V th increase is cell-to-cell interference. However, under our DPA technique, more cells are programmed to the low V th levels, resulting in less cell-to-cell interference. As such, for the cells which is programmed to high V th level, the retention time BER increases. To increase retention time noise margin, we adopt a V th offsetting technique as shown in Fig. 5 . For a cell which is programmed to V th levels L1∼L3, we increase the verifying voltage to increase the programmed V th . As such, retention time noise margin increases and retention time errors can be reduced. Increasing verifying voltage may increase the probability that V th exceed the upper reference voltage, causing program bit errors. However, in our design, due to the reduction of cell-to-cell interference, increasing a proper amount of verifying voltage will not increase program BER.
C. DPA-PPU Overhead
DPA-PPU incurs both hardware and storage overheads. The hardware overhead includes n pairs of scrambling and descrambling circuits, two decorrelation circuits, n + 1 counter, and a comparator. The decorrelation circuit is implemented with simple XOR operations. The scrambling-descrambling circuit is implemented with linear feedback shift registers. Assume that n is set to 64. It is estimated that DPA-PPU consumes 9800 lookup tables (LUTs) and 6400 registers in Xilinx XC7VX485TFFG1157-2 FPGA chip. Compared with the resources used by BCH ECC (about 160 000 LUTs), DPA-incurred hardware overhead is marginal.
The storage overhead includes polynomial tags and correlation bits. The cost of correlation bit is negligible, since one data page needs only 1 bit. Comparatively, the polynomial tags, which are called redundancy hereafter, incur high storage overhead. For example, with an 8-byte chunk size and 64 16-order polynomial, 24-GB storage space is allocated for polynomial tags in a 256-GB NFSS. Due to a large size, the redundancy cannot be completely stored in the OOB area of the same page where the data are stored. Instead, the redundancy is stored by using extra pages. Therefore, extra page read and write accesses are incurred, which directly leads to performance degradation. To reduce the redundancyincurred hardware and performance overheads, we propose DPA-DRM.
V. DPA DATA-REDUNDANCY MANAGEMENT DPA-DRM reduces the redundancy overhead by investigating the relationship between redundancy overhead and the BER. Given a polynomial number and polynomial order, redundancy overhead, which is determined by redundancy size, only depends on the chunk size. Selection of the chunk size determines the efficiency to reduce 1's ratio and therefore the BER reduction efficiency. NAND flash memories show different BERs at different P/E cycles. To ensure system reliability, we reduce the BER to an acceptable level by tuning the chunk size based on the P/E cycles. The redundancy overhead, hence, can be maintained at the minimum required level. A tune-on-demand strategy is adopted in DPA-DRM as follows. The entire lifetime of NFSS is divided into three stages. At the early cycling stages where the BER is low, no DPA-PPU is needed. At the early postcycling stages where the BER is relatively high, DPA-PPU with a medium chunk size is enough to reduce the BER to the acceptable level. At the late postcycling stages where the BER is high, DPA-PPU with a small chunk size is required to reduce the BER to a maximum degree. By dynamically adjusting chunk size according to the BER, the unnecessary redundancy can be avoided and the performance can be improved.
We also improve the redundancy-incurred overhead by optimizing data layout and adopting redundancy caching. The optimized interleaving data layout is shown in Fig. 6 . The converted data are stored in the data zone of NAND flash pages. We call the NAND flash pages, which store converted data data pages. Due to a large size, only part of redundancy can be stored in the OOB area of the data page. The remaining redundancy is stored in a separate page, say, redundancy page. The data pages and redundancy pages are stored in different blocks of adjacent channels. For the data pages in the same block, their redundancy pages are sequentially stored in another block. Hence, the data pages and the redundancy pages can be written to NAND flash memory in parallel. The relationship between the physical addresses of the redundancy pages and the physical addresses of the corresponding data pages is stored in a redundancy mapping table. The redundancy mapping table is stored in DRAM for fast access. Upon reading the data page, the physical address of the corresponding redundant page is looked up in the redundancy mapping table by using the physical address of the data page as index. As such, the redundancy pages can be read out simultaneously with the data pages and the read latency of the redundancy is hidden. For the data pages in the same block, their redundancy pages are sequentially stored in another block. Under such a data layout, we only need to record the address of redundancy page, whose data page is the first page in a physical block (referred as starting redundancy page hereafter) in the redundancy mapping table as shown in Fig. 6 . The address of redundancy page belonging to other data page can be calculated by adding the related starting redundancy page address and offset. Such a layout effectively minimizes storage overhead of the redundancy mapping table. For a 512-GB solid state disk (SSD) with 524 288 1-MB blocks, only 2 MB is assigned to the redundancy mapping table. Hence, the incurred storage overhead is marginal. The redundancy mapping table is only flushed to NAND flash memory upon power failure or power-down. Therefore, it will not cause any runtime performance overhead.
Under different data layouts of DPA-DRM system design, the size of redundancy of a data page is smaller than that of physical page. Hence, update to redundancy of a data page may cause partial page update [25] - [27] . The entire page has to be rewritten although only a part of it is updated. In our DPA design, to avoid partial page update, a dedicated garbage collection scheme is proposed. When a data page is updated, the physical address of its new redundancy page is filled in the redundancy mapping table. The old data page is invalidated, but its redundancy pages are still valid and recorded in the redundancy mapping table. During garbage collection, when a physical block of the old data page is recycled, the corresponding redundancy pages are invalidated together in the redundancy mapping table. If there is valid page in the recycled physical block, the corresponding redundancy page migrate to a clean page. As such, partial page update to the redundancy page can be avoided. To further minimize garbage collection overhead, we store the redundancy pages of the frequently accessed or hot data pages in one block. References [25] and [28] - [33] can be adopted to keep track of hot data. To further minimize the write latency of redundancy pages, we adopt a redundancy caching. The redundancy pages are temporarily buffered in the redundancy cache by adopting least recently used (LRU) cache replacement policy. The frequently updated redundancy pages can be merged and the write accesses to NAND flash memory are reduced. We also adopt a delayed write scheme [34] to further hide the write latency, that is, the evicted redundancy pages in the redundancy cache are flushed to NAND flash memory only during the system idle time. If newly written redundancy is generated and the redundancy cache is full, the system will stop responding to I/O requests from the host until there is free space in redundancy cache instead of discarding the evicted data.
We note that the redundancy pages also require stronger data protection than the data pages, since they store the polynomial tags. Our results (Table II) show that the probability of each possible polynomial tag being selected is about the same. Hence, the redundancy page has approximately equal ratios of 1's and 0's, causing a higher BER than the data page. To solve this problem, we propose a wear-unleveling technique, where some blocks are intentionally controlled to wear out slower than other blocks. Therefore, the blocks wearing out more slowly always have a lower BER than the blocks wearing out faster. Due to the low BER, only BCH is applied to the blocks wearing out more slowly to protect the data integrity. At the postcycling stage, the data pages are stored in the blocks wearing out faster and the redundancy pages are stored in the blocks wearing out more slowly. Since BCH is sufficient to assure the reliability of the redundancy pages, little performance overhead is incurred.
VI. EVALUATION
A. DPA-PPU Evaluation
We study the efficiency of the decorrelation and scrambling circuits to reduce the ratio of 1's in the seven types of data listed in Table I . Fig. 7 shows the ratio of 1's in the data before and after processed by the decorrelation circuit. It is shown that the ratio of 1's to 0's of all file data except system metadata is around 50%. Hence, simple mechanism such as inverting data cannot effectively increase the ratios of 1's. For system data with more than 50% 0's, our scheme can generate more 1's than simply inverting data, since we can further unbalance the ratio of 1's and 0's. Among all the data types, only system metadata exhibits strong data correlation where the ratio of 1's decreases from 0.45 to 0.27 after decorrelation. For other data types, however, the data are weakly correlated. Therefore, the reduction of the ratio of 1's is very small after the decorrelation.
We also demonstrate the efficiency of the scrambling scheme. The default values of data chunk size, polynomial number, and polynomial order are set to 8 byte, 64, and 16, respectively. We only change the value of one parameter each time to evaluate its impact on scrambling efficiency. The probability that each scrambling polynomial reduces the number of 1's is shown in Table II . In general, scrambling demonstrates good efficiency: almost all polynomials have approximately 50% probability to reduce the number of 1's. It demonstrates that the polynomial with the approximately same number of 1's and 0's can reduce number of 0's efficiently as we claimed in Section IV-A. Fig. 8(a) -(c) reveals the efficiency of DPA-PPU to reduce 1's ratio. As shown in Fig. 8(a) , the average ratio of 1's decreases from 0.42 to 0.34 as the polynomial order rises from 4 to 16 in scrambling scheme. Such an improvement is due to the increase of the available polynomials to select. Fig. 8(b) shows that the average ratio of 1's decreases by 11.5% as the polynomial number changes from 32 to 256. Fig. 8(c) shows that the ratio of 1's increases from 0.27 to 0.42 on average by increasing the chunk size from 4 to 32 byte due to the reduced overlap pattern between the data chunk and the polynomial that take effects.
Finally, we compare the ratios of 0's in the outputs of DPA-PPU under the data chunk sizes of 4 and 8 byte, respectively. Here, the outputs of DPA-PPU are inversion of the outputs from the decorrelation or scrambling circuits. All the design parameters are set to the default values. As shown in Fig. 9 , the efficiency of DPA-PPU is generally better when the data chunk size is small. The only exception is system metadata under 8-byte data chunk size where the data are mainly generated by decorrelation circuit. Based on the results in Fig. 9 , we derived the probability of four V th levels under the chunk sizes of 4 and 8 byte. As shown in Fig. 10 , the probability of V th L0 increases from 24.5% to 48% when the DPA-PPU with a data chunk size of 8 byte is applied, compared with the case that no DPA-PPU is applied. When reducing the data chunk sizes down to 4 byte, the probability of V th L0 further raises to 59%.
The results in Fig. 10 can be directly used to derive the efficiency of DPA-PPU to improve system reliability, i.e., uncorrectable BER U (n). In this paper, the uncorrectable BER (UBER) U (n) is evaluated by assuming that BCH ECC code is applied. n is the correctable bit number. Without loss of generality, we assume that n-bit BCH ECC (m,l,n) is performed to an l-bit data block. m is the total codeword length. U (n) can be expressed by
Here, p c is the BER of a single NAND flash memory cell. The probability that each cell is programmed to V th level Li is denoted by p i (i = 0, 1, 2, 3). The BER corresponding to each V th level is represented by p li (i = 0, 1, 2, 3), respectively. p c can be calculated by
Equations (2) ∼ (4) can be used to evaluate p li of program disturb and retention time noise, respectively. The effects of read disturb are different in NAND flash memories under different technology nodes from different vendors. In [2] and [35] , read disturb is much lower compared with program disturb and retention time errors. In contrast, read disturb is much higher in some NAND flash memory, such as [29] and [36] .
In this paper, we simulate the read disturb as in [2] and [35] . We will evaluate our design on the device with higher read disturb in our future work. Reference [9] shows that read disturb stems from FN tunneling and SILC. The V th shift due to Fowler-Nordheim tunneling can be modeled in [37, eqs. (5) and (16)]. The effect of SILC can be modeled in [9] and [38] . The V th shift induced by read disturb can be modeled by combining these two models. We set V th shift distribution caused by read disturb as a random variable following the Gaussian distribution. The mean μ rd and variation σ rd can be expressed as
where γ , β, K r , and α are technology-related coefficients. V CG and V T ,0 represent the voltages applied on the wordline and the floating gate, respectively. t s and N denote read pulse duration and P/E cycle count, respectively.
In (6), p i is determined by the ratios of 1's and 0's in the data stored in NAND flash memory. Let p(0) and p(1) denote the probability of 1's and 0's, respectively. p 0 ∼ p 3 can be estimated by p (1) p(1), p(0) p(1), p(1) p(0) , and p(0) p(0), respectively. As shown in Fig. 4 , the ratio of 1's of all types of the data is all around 50% except for system metadata files (∼45%). Hence, we can approximately assume that p(0) = p(1) = 0.5.
In our simulation, an 8-bit BCH ECC (4200, 4096, 104) is applied to every 512-byte data block. The acceptable UBER(8) is set to 10 −13 [1] . The V th distribution under V th level L0 is modeled by a Gaussian distribution N(1.1, 0.35) . The ISPP verifying voltages and the incremental program step voltage are 2.55, 3.15, 3.75, and 0.3, respectively [39] . λ is set to 2.8× 10 −4 N 0.4 [40] . We adopt the emerging all-bit-line structure. The coupling ratios γ y and γ xy are set to 0.08 and 0.0048 [39] , respectively. By fitting the data in [2] , K d and K m are set to 4 × 10 −5 and 3 × 10 −6 , respectively. For the read disturb model, γ , β, and t s are set to 1.1, 8.8 × 10 5 , and 8 × 10 −2 , respectively. By fitting the data in [37] and [9] , α and K r are set to 5.3 × 10 −2 and 11.9, respectively. Read disturb is evaluated under 10k read count, and retention time error rate is estimated with 1-year elapsed time.
UBERs under program disturb, read disturb, and retention time noise are shown in Fig. 11 . Fig. 11(a) shows that program disturb UBER under DPA-PPU keeps below 2 ×10 −18 even at the P/E cycle count of 40k. Compared with the system without DPA-PPU, program disturb UBER decreases by up to 10 13 . The UBER reduction can be explained by the significantly minimized cell-to-cell interference. Fig. 11(b) shows that the retention time UBER under DPA-PPU decreases by up to 10 9 (at 30k P/E cycle) compared with the system without DPA-PPU. Fig. 11(c) shows the UBER of read disturb at the read count of 5k. Read operation is always performed to the pages, which are already programmed. Therefore, the read disturb UBER we simulate here is generated based on both program disturb and read disturb errors. Compared with the system without DPA-PPU, read disturb UBER decreases by up to 10 5 at 30k P/E cycle count. Note that the reduction of read disturb UBER is observed in DPA-PPU even though more cells are placed on L0. This is because the read disturb UBER is caused by both program disturb and read disturb. Our DPA scheme can significantly reduce program errors. As shown in [2] , read disturb errors are much fewer than program disturb errors. Hence, due to decrease of program errors, the read disturb UBER is reduced even though read disturb error increases.
In general, retention time bit errors dominate in the NFSS and therefore, we use retention time UBER to determine the NFSS lifetime. When an 8-byte data chunk size is applied, UBER reaches the accepted UBER(8) (i.e., 10 −13 ) at the P/E cycle count of 23k. When a 4-byte data chunk size is applied, UBER reaches UBER(8) at the P/E cycle count of 30k. Compared with the system without DPA-PPU where the maximum allowed P/E cycle count is 7.5k, the NAND flash lifetime is improved by 3× and 4× under 8-and 4-byte chunk sizes, respectively. In all scenarios, reducing the data chunk size always improves the NAND flash reliability. Fig. 12 demonstrates the relationship between the maximum tolerable read count and P/E cycle count under UBER (8) . The maximum read count of the system without DPA-PPU quickly drops down to zero when the P/E cycle count raises above 18k. After applying DPA-PPU technique, the working range of the flash cells is dramatically expanded.
B. Evaluation of DPA-DRM
Flashsim [45] is adopted as our simulation platform. We modify the simulator by adding multichip access capability and implementing the DPA technique on it. The benchmarks representing five applications are selected to evaluate our scheme and their characteristics are listed in Table III . In this simulation, the block size and page size are 512 and 4 KB, respectively. The OOB size is 218 byte and block number is set to 4096. Program latency, read latency, and erase latency are set to 900, 50, and 3.5 μs, respectively [41] , [46] - [48] . The NFSS capacity is set to 256 GB and the redundancy cache size is set to 4 MB.
We evaluate the impact of DPA-DRM on system performance under the data chunk sizes of 4 and 8 byte. The data chunk size is switched from 8 to 4 byte when the P/E cycle count reaches 23k (the reliability limit for 8-byte data chunk size, see Fig. 11 ). When the data chunk size is 8 byte, 1/3 of the redundant bits can be stored in the OOB zone. When the data chunk size changes to 4 byte, the redundant bits increase and only 1/7 of them can be stored in the OOB zone, inducing higher page access overhead. Fig. 13(a) shows that compared with the system without DPA, after 23k P/E cycles, the response time of DPA+DRM with 8-and 4-byte data chunk size averagely degrades by ∼ 7% and ∼ 10%, respectively. The maximum performance degradation ∼ 15% occurs at financial workload. There is also performance degradation for read intensive web workload. One reason for this overhead is that DPA incurs a considerable amount of redundant page reads. Another reason is that most of reads in web workload are large-sized and sequential, where the hit rate of redundant page cache is low. Fig. 13(b) and (c) shows the write counts and erase counts of the two experiment setups, respectively. On average, under 4-byte data chunk, DPA+DRM increases the write count and erase count by ∼ 7% and ∼ 4%, respectively.
VII. CONCLUSION
In this paper, we propose DPA error prevention technique to extend the lifespan of NAND flash storage system. The DPA technique is based on the observation that V th level L0 is resistant to retention time error and cell-to-cell interference. To reduce bit errors, we propose DPA-PPU to place more cells on the the V th level L0 by biasing the numbers of 1's and 0's. We also propose DPA-DRM to mitigate DPA-PPU induced performance degradation. Our experimental results show that the DPA technique improves system endurance by four times with very marginal hardware cost. Approximately ∼10% performance overhead is introduced.
