This paper focuses on finding, settling and reducing the program disturb error of 3D-TLC NAND flash memory. Experimental analysis of the FPGA test platform determines characteristics of the program disturb error. The program disturb error makes the state shift rate of storage cells lose balance. MSB, CSB and LSB pages have unbalanced bit error ratio and bit error rate distribution. The page program disturb bit error is unbalanced as the number of program/erase cycles changes. Based on the experimental results of the error rate imbalance, an error avoidance algorithm is designed, which can shift the data state that is subject to program disturb to one that is not vulnerable to program disturb. The test results show that the algorithm can reduce the error rate of program disturb by 20% to 90%. In this sense, the program disturb error phenomenon found and error avoidance algorithm designed in this paper are helpful for improving the reliability of 3D-TLC NAND flash memory.
Introduction
Compared with the traditional magnetic storage products, NAND flash memory [1] [2] [3] has many advantages, such as high storage density, less response time, anti-vibration, lightweight and low power consumption, and it has been widely used. Depending on the bit number per storage cell, NAND flash memory is divided into three types, namely Single-Level Cell (short for SLC, 1 bit per storage cell), Multi-Level Cell (short for MLC, 2 bits per storage cell), Triple-Level Cell (short for TLC, 3 bits per storage cell). At present, the storage structure of NAND flash memory has changed from a traditional plane structure to a three-dimensional structure [4] [5] [6] [7] , which further improves the storage density. With the increase in storage density, data reliability [8] of flash memory has decreased [9] . To design effective algorithms for ensuring reliability, we must understand flash failure modes [10] [11] . Some papers [12] [13] believe that the main reason for this reliability decrease is the coupling disturb of adjacent units in the data programming process, which leads to program disturb errors and therefore affects the data storage reliability [14] . The program disturb error is one of the main errors in 3D-TLC NAND flash memory [15] . Designing an effective data fault-tolerant scheme is a powerful means to decrease program disturb errors and improve the data storage reliability. For this purpose, we must figure out program disturb error patterns of 3D-TLC NAND flash memory. Some prior works [12, [16] [17] [18] [19] mainly studied the error patterns of planar NAND flash memory, without systematically exploring the specific program disturb error patterns of 3D NAND flash memory. The previous work [9] only investigated error patterns of 3D-MLC NAND flash memory, without considering the error characteristics of 3D-TLC NAND flash memory [7] .
Based on the FPGA test platform, this paper conducts experiments on 3D-TLC NAND flash program disturb errors and analyzes the experimental results, especially the correlation of the program disturb error state, program disturb bit error characteristics and unbalanced page program disturb error rates. Some rules and characteristics of 3D-TLC NAND flash memory program disturb errors are discovered in these experiments, and these provide help and guidance for designing an effective error tolerance algorithm to solve program disturb problems and ensure the data storage reliability. Based on the study of the specific error mode of 3D-TLC NAND flash memory program disturb errors, a program state shift algorithm is designed in this paper. In the program process, the data state that is susceptible to disturb can be shifted to a state that is not easily disturbed, thereby reducing program disturb probability and original bit error rate. This improves the reliability of data storage on flash memory.
Background

NAND Flash Memory
The block structure of NAND flash memory is similar to a cross matrix: the transverse direction is the word line, the longitudinal direction is the bit line, the word line and bit line cross at the storage unit, and a page consists of a plurality of the storage unit on a word line. There are two main structures of the NAND flash memory block: odd/even bit lines and allbit lines. Odd/even bit lines are accessed alternately when the odd and even digital lines are programmed, and even digital lines are programmed first. All bit lines connected by a word line are accessed at the same time when programmed.
Each cell of 3D-TLC NAND flash memory can store three bits. As shown in Figure 1 , the left bit of a physical page is called the Most Significant Bit (MSB), the middle bit is the Central Significant Bit (CSB), and the right bit is the Least Significant Bit (LSB). The three bits use Gray code mapping. Only one bit is different from any two adjacent codes; therefore, there are eight states in total. Flash memory is programmed in page units, and MSB and CSB are respectively programmed by one-shot to write MSB pages, CSB pages and LSB pages in three different logical pages at a time [14] . In other words, these three pages are located on the same word line.
Program Disturb Error
The flash memory erase operations are mainly the internal flash cell charge injection and release process, which has a destructive effect on flash cells around the oxide layer and produces charge traps in the oxide layer. These results in charge disturbance, changing the charge number inside the flash memory unit and noise disturb. Therefore, the flash memory erasure operation is the main factor for program disturb error, and it not only destroys the oxidation layer around the flash memory unit, but also produces the charge coupling effect in the adjacent flash memory unit and generates the charge leakage pathway in the internal oxidation layer. This ultimately results in the loss of charges. Data reading from flash cells is closely related to the number of charges. Once the number of charges changes with the program/erase (P/E) cycles, read error happens due to the cell threshold charge transfers.
The main reason for flash memory errors is threshold voltage change (increase/decrease) inside the flash cells. When reading data from flash cells, the detected threshold voltage is compared to the reference voltage to obtain the bit value. Ideally, the binary value of flash memory can be correctly identified by voltage value comparison, but when the flash memory cell is subjected to noise disturb, voltages will appear in the contrast error. This results in data recognition error of the flash memory cell. There are many reasons that cause the change of charges within the flash cell, such as P/E cycles and data retention [20] time, but only the flash cell charges change over a certain range.
When programming the three pages of a one-word line, due to the coupling effect of parasitic capacitance [16] , the program process has a disturb to the three pages of adjacent word lines that have been programmed. This makes the interfered three pages capture additional electrons, leading to threshold voltage drift [9] and easily resulting in bit errors, as shown in Figure 1 . When the 3D-TLC NAND flash cell is affected by the program disturb, the window of threshold voltages transfers to the right and across the other windows, causing bit errors and thus affecting the reliability.
Experiment and Analysis
State Correlation
We study the state correlation of 3D-TLC NAND flash program disturb errors. We count the conversion ratio of different program disturb states with the changes of P/E cycles, as shown in Figure 2 . Experimental results show that different 3D-TLC NAND flash cell states experience program disturb errors to different degrees. The state correlation of program disturb errors is related to the P/E cycles, fluctuating with the increase in P/E cycles. The program disturb mainly causes the cell threshold voltage shift to the right, shifting from voltage windows with less charges to voltage windows with more charges. Such drifting is more likely to appear between two neighboring states than across one or more states. In order to facilitate the presentation, we use the numbers '0', '1', '2', '3', '4', '5', '6' and '7' to denote the eight states of 3D-TLC NAND flash cells, i.e., 111, 011, 001, 000, 010, 110, 100 and 101. The error ratio is caused from one state to another state, defined as follows. 
Where R ij (0  i, j  7) denotes the state error ratio from state i to j when experiencing a certain number of P/E cycles, i
is not equal to j. N ij represents the total errors of shifting from state i to j.
Figure 2 (a) shows that the ratio of state 001 to state 000 is higher than the ratios of state 000 to state 010 and state 000 to state 100. Because the state 001 has relatively less charges, the internal electric field intensity is relatively weak. Therefore, it is much easier to capture additional electronics to shift to the state 000. Conversely, state 000 to state 100 is more difficult because it must capture enough electrons across the middle state 010 and state 110 to state 100. As a result, the state shift ratio is very low, below 10%. The ratio from state 000 to state 010 fluctuates around 1%. With the increase in P/E cycles, the error ratio becomes lower. At the beginning of P/E cycles, the ratio from state 00 to state 000 is near 4%, and with the increase in P/E cycles, the ratio is in decline. This is because data programming starts from the first block of the first page to the last page of the specified data block, and then it starts to perform the data read operations. This process will cause data retention errors, introducing state 000 shift to state 001. The error ratio fluctuates with the increase in P/E cycles. When the number of P/E cycles reaches 2000, the error ratio of state shift caused by retention errors reaches 8%. With the increase in P/E cycles, the error ratio has a declining trend.
Figure 2 (b) shows that with the increase in P/E cycles, the ratio changes from state 001 to state 011, state 010 to state 000, and state 110 to state 001. The four-error state proportion shows that the trend of instability increases with an increasing number of P/E cycles. When the number of P/E cycles is less than 1000, these four state shift ratios are relatively high. With the increase in P/E cycles, the error ratio has a declining trend. The error ratio from state 010 to 110 can reach 9% when P/E cycles are near 300 and start to sharply decline. The error ratio from 001 to 000 begins to decline from the original 4.8%. The ratios of state 010 to state 110 and state 001 to state 000 are mostly higher than those of state 001 to state 011 and state 010 to state 000. Because the transitions from state 001 to state 011 and state 010 to state 000 are caused by data retention errors, it is difficult to change from a high state to a low state. The number of high state charges is greater. Sufficient electron loss is needed to move to a lower state, which also requires a longer retention period.
Figure 2 (c) shows the change trend of state 011 to state 111, state 100 to state 101, state 100 to state 110, and state 101 to state 100 with an increasing number of P/E cycles. When the P/E cycles are increased, the error ratios of state 011 to state 111, state 100 to state 110, and state 101 to state 100 show a stable trend.
Figure 2 (d) shows the error ratio change trend of state 110 to state 010, state 110 to state 100, state 111 to state 011, and state 111 to state 001 with an increasing number of P/E cycles. The error ratio of state 110 to state 010, state 110 to state 100, and state 111 to state 001 is very low, and it is stable near 5%. Program disturb causes threshold voltage windows to cross two states with more difficulty. The error ratio of state 111 to state 011 caused by program disturb is much higher, and the increase in P/E cycles shows a rapid growth trend, increasing from about 15% to 75%. This is because state 111 is the erased state, this state has less electronics, and the electric field is weak, so it is easier to capture extra electrons from the outside to shift to state 111. The program disturb errors of 3D-TLC NAND flash memory make the cell state shift ratio have an unbalanced relationship. The neighbor cell state is easily shifted. It is difficult to capture more electronics to cross the middle state to shift to other states. In addition, program disturb makes the cell state shift ratio from a lower state to a higher state, falling below 5%. These state shifts are caused by retention errors because the retention time is shorter and retention errors are lower than program disturb errors considered as the main errors. The 100 state is shifted to state 101 due to program disturb errors, and the error ratio is significantly higher than that of the other three states. With the increase in P/E cycles, the error ratio has an overall downward trend. The cell state with less electronics is much easier to shift to the cell state with more electronics. 
Bit Error Characteristics
In this paper, we studied the reasons of MSB page, CSB page and LSB page errors by program disturb in flash memory theoretically. Program disturb typically leads to the shifting of the flash threshold voltage to the right, causing bit errors. The bit errors for the LSB, CSB and MSB pages caused by state shifts can be modeled as follows. 
Here, R lsb , R csb and R msb respectively represent the error ratio of the LSB page, CSB page and MSB page when disturbed by programming. N ij (0 < i, j < 7) denotes the ratio of the i state to the j state. When the flash memory is affected by program disturb, cells with less electrons are shifted to adjacent cells and the proportion is relatively large. We count the error ratio in the MSB page, CSB page and LSB page, including both bit 0 flipped to bit 1 and bit 1 flipped to bit 0, and the corresponding bit error rate distribution is shown in Figure 4 and Figure 5 respectively. Figure 3 shows the bit error ratio change trend of the MSB page, CSB page and LSB page as P/E cycles increase. In the figure, 0>>1 indicates bit 0 flipped bit 1. 1>>0 denotes bit 1 flipped to bit 0. The error ratio of 1>>0 is significantly higher than that of 0>>1 in the MSB page and the CSB page. Conversely, the error ratio of 0>>1 is significantly higher than that of 1>>0 in the LSB page. The error ratio of 0>>1 in the MSB page gradually decreases from the original 85% to 5% as P/E cycles increase. The error ratio of 0>>1 in the CSB page and 1>>0 in the LSB page shows a fluctuating trend when P/E cycles increase. When the number of P/E cycles is less than 3000, the error ratio of 0>>1 in the LSB page is relatively high, which fluctuates around 90%. When the number of P/E cycles exceeds 3000, the error proportion of 0>>1 starts to decline in the LSB page. When the number of P/E cycles exceeds 5000, the error ratio has fallen to 70%. In the MSB page, the error ratio of 1<<0 is increased from 18% to 95%, and the error ratio of 1>>0 in the CSB page is 80%, creating waves in the left or right directions. The error of 1>>0 in the MSB page is caused by 110>>010, 111>>011 and 111>>001 in Figure 3 . The error of 0>>1 in the MSB page is caused by 000>>100, 010>>110 and 011>>111. In CSB page, the 1>>0 error is caused by 011>>001, 010>>000, 110>>100 and 111>>001, and the 0>>1 error is caused by 000>>010, 001>>011 and 100>>110. The error of 1>>0 is caused by 001>>000 and 101>>100. The 0>>1 error is caused by 000>>001 and 100>>101 in the LSB page.
In Figure 2 , the ratio of 111>>011 is increased with the increase in P/E cycles, which makes the error of 1>>0 in the MSB page gradually increasing and with a high proportion. The 100>>101 has a high proportion, leading to a relatively high error ratio of 0>>1 in the LSB page. The error of 1>>0 in the CSB page is caused by more state shifts. Figure 4 shows the bit error rate distribution of 0>>1 and 1>>0 in the MSB page, CSB page and LSB page with the change of P/E cycles. The error rate of 0>>1 and 1>>0 is unbalanced in terms of the distribution in the MSB pages, CSB pages and LSB pages. In the MSB page, the error rate of 1>>0 is high, which increases following an exponential growth trend with the increase in P/E cycles. When the number of P/E cycles is up to 5000, the error rate of 1>>0 is more than 0.005 in the MSB page, and the reliability is seriously threatened. The bit error rate is lower than 0.001 for the others, and they show a steady growth trend with the increase in P/E cycles. In the MSB page, CSB page and LSB page, distributions of the bit error ratio are unbalanced about 0>>1 and 1>>0. In the MSB page, the 1>>0 error emerges easily and the error rate increases with the increase in P/E cycles.
Error Rate Imbalance
Finally, we theoretically study the unbalanced error rate caused by program disturb of 3D-TLC NAND flash pages. When cells suffer from the program disturb, the unbalanced bit error rate of LSB, CSB and MSB pages introduced by state shifts is expressed as follows: CSB page and LSB page. In this paper, we count the program disturb error rate of flash pages when P/E cycle is increased, as shown in Figure 5 . The bit error rate of page program disturb is unbalanced. The error rates in the MSB page rapidly increase with the increase in P/E cycles. The MSB page program disturb bit error rate increases relatively quickly with the number of P/E cycles, demonstrating the trend of approximate exponential increase. When the P/E cycle is 5000, the bit error rate reaches However, for the CSB and LSB pages, the bit error rate of program disturb is lower, mostly below -3
10 
, and it increases slowly with the change of P/E cycles. As previously mentioned, flash program disturb errors make the shift ratio of the cell states become unbalanced, and the 1>>0 error occurs more easily in the MSB page with the increase in P/E cycles. The red curves in Figure 5 represent the changing trends of the average bit error rate of program disturb in MSB, CSB and LSB pages. Figure 6 shows the distribution of the bit errors of program disturb with the change of P/E cycles. When the P/E cycle is less than 1000, the program disturb errors in the LSB page are much higher, with a rate that fluctuates around 55%. As P/E cycles increase, the percentage of program disturbs errors in LSB pages gradually decreases. When the P/E cycle is 5000, the bit error rate of program disturb is reduced to 10%. The percentage of program disturb errors in the CSB page also decreases from the original 20% to 10%, and the error rate is much lower than those of the MSB page and LSB page due to low program disturb errors, as shown in Figure 5 . The percentage of program disturbs errors in the MSB page is increased from the initial 25% to 80% with the increase in P/E cycles. When the P/E cycle is greater than 2000, the percentage of the program disturb errors in the MSB page dominates due to server program disturb. The bit error rate and percentage of program disturb of flash pages is unbalanced as P/E cycles change, which is higher in the MSB page. The program disturb errors in the CSB and LSB pages increase nearly linearly, and the error percentage is reduced gradually with the increase in P/E cycles.
Design of State Shift Algorithm
Based on the test results of flash memory program disturb error modes, we observe that when the flash pages are programmed, the MSB, CSB and LSB pages will suffer from varying degrees of program disturb, causing serious imbalance of the bit error rate. The bit error rate of the MSB page is much higher than that of the other two pages, and it gradually increases with the increase in P/E cycles, as shown in Figure 4 , 5 and 6. The high bit error rate seriously affects the reliability. In order to guarantee the reliability, it is necessary to reduce the bit error rate fundamentally. On the one hand, the reliability can be ensured by decreasing bit error rates. On the other hand, it can reduce the cost of error-correcting code, such as reducing the decoding complexity and decoding delay, and the reduction of decoding delay can improve read performance of flash storage systems. Once the error rate of the page in flash blocks exceeds the error correction ability of the error correcting code, the block will be marked as a bad block instead of being used, resulting in a huge waste of storage space.
The fundamental cause of the high bit error rate in MSB pages is the P/E cycle number increase. State 111 is easily affected by program disturb and shifted to state 011. The ratio of state 111 to state 011 increases to 75% with the increase in P/E cycles, occupying a large proportion in various state transitions, as shown in Figure 2 (d) . Once this state is changed to state 011, it will shift the error from 1 to 0 in the MSB page. This is the root cause of the higher error rate of 1 to 0 in the MSB page, as shown in Figures 3 and 4 . Therefore, this paper designs a program state shift algorithm to reduce the error rate of program disturb on the MSB page, thus improving the reliability and performance of flash memory. When the algorithm performs data program operations, this algorithm first checks whether the data state to be written to the flash cell is 111. If it is 111, the state is shifted to 101, because state 101 is least prone to program disturb errors. It is considered that when the flash cell is affected by program disturb, the cell state will capture the extra electrons and move to the right, as shown in Figure 1 . Even if the 101 state suffers from program disturb, the cell state transitions to the right will not cause a bit error, because the 101 state does not intersect with other states even to the right. Therefore, the probability of a bit error is lower.
In this paper, the program state shift module is added in the 3D-TLC NAND flash controller. The module mainly contains two main modules: the state recognition module and the state transition module. The status recognition feature is used to identify the data that is written to the flash cell and find the data state 111 that is to be written. The state shift module mainly shifts the identified data state 111 to avoid the occurrence of the high bit error status 111 in the MSB page, and thus state 111 programs disturb probability and bit error rate are reduced. The process description of the state shift algorithm is as follows:
Step 1 The host sends a program operation command to the controller.
Step 2 The encoder in the controller encodes the data by increasing the redundant bits to correct the bit error and ensure the reliability.
Step 3 The state recognition module identifies the bit data that is to be written to the flash cell, finds the bit data with a state of 111, and records it.
Step 4 The controller starts the data write operation, and the state shift module shifts the bit data of 111 to the data state 101 that does not suffer from program disturb.
Step 5 The data from the state shift is written to each page of the flash memory chip. Three modules in the flash shift layer are address mapping, garbage collection and wear leveling, where the address mapping assigns physical storage space to a logical address when programming data and establishes the transformation between the logical address and the physical address. Garbage collection is used to recycle dirty blocks to reuse them, and wear leveling is used to balance the P/E cycles of each block at the same level, ensuring that each block reaches the final lifetime at the same time.
Step 6 The host sends the data to read operation command and shifts to the controller. The controller reads the data in the chip to the decoder to decode, and the decoded data is shifted to the state shift module. Once again, the state shift module will restore the decoded data to the original data and transmit it to the host terminal according to the record made by the state recognition module.
Algorithm Verification and Analysis
The program state shift algorithm is designed to reduce the probability of data state 111, thus reducing the bit error rate of the MSB page and improving the reliability of data. Figure 7 shows the distribution of the state error ratio in the flash cell with the different P/E cycles.
The error ratio of state 111 is converted to state 011 with the increase in P/E cycles, increasing from the initial 2% to 75%. However, the number of other state errors decreases as the number of P/E cycles increases gradually, decreasing from 98% to 25%. The error ratio of state 111 to state 011 is equal to the sum of the total error ratio of the whole state to the same level as the P/E cycles is increased from 2000 to 2500. It can be seen that the large proportion of the whole state error ratio is the most direct factor leading to the increase in the original bit error rate of the MSB page. Figure 8 shows the error rates of the flash cell in the influence of program disturb and the cell error rate after the use of the program state shift method. Before using the program method of the state transition, due to the program disturb, the error rate of state 111 shifted to 011 is gradually increasing in the number of P/E cycles and occupies a great proportion in the total cell error rates. This leads to the increased cell failure rates with the number of P/E cycles. The error rate is -4 9 10  when the P/E cycles are 500 at the beginning of flash. When the P/E cycle is increased to 5000, the cell error rate is -3 6.8 10  , improving 7.6 times. After the program state shift algorithm is adopted, the flash memory cell has a significantly reduced trend. When the P/E cycle is 5000, the cell error rate decreases from the original -3 6.8 10  to -3 5.2 10  , and the cell error rate decreases by 24%. Figure 9 shows that the change of P/E cycles results in the reduced error ratio of 1 to 0 in the MSB pages after using the program state shift algorithm. Before using the program state shift algorithm, the error ratio of 1 to 0 in the MSB page is higher, and with the increase in P/E cycles, the error ratio fluctuates around 90%. State 111 affected by program disturb more easily shifts to state 011, leading to a high proportion. Once state 111 shifts to state 011, this causes a bit error of 1 to 0 in the MSB page. After using the program state shift algorithm, state 111 is moved to state 101. When programming, program disturb errors are avoided to reduce the error ratio of 1 to 0 in the MSB page so that the error ratio of 1 to 0 is greatly reduced. When the P/E cycle is 500, the error ratio of 1 to 0 in the MSB page is reduced from 76% to 18%. When the P/E cycle is increased from 500 to 5000, the error ratio of the bit 1 in the MSB page is reduced from 95% to 40%. After using the program state shift algorithm, the ratio of the original bit error rate in the MSB page is significantly reduced due to the decreased error of 1 to 0. When the P/E cycle is 500, the error rate of the MSB page is reduced by 72%, and the error rate is reduced up to 90% when the P/E cycle is 5000. Figure 10 shows the distribution of the original bit error rate before and after using the program state shift algorithm. Before using the program state shift algorithm, the bit error rates of the MSB page, CSB page and LSB page are quite different, and this gap increases with the increase in P/E cycles. The bit error rate of the MSB page is obviously increasing with the P/E cycles. When the P/E cycles are 500 and 1000 times, the original bit error rate of MSB page and LSB page is closer, about -4 5 
10
 . With the increase in P/E cycles, the original bit error rates of the CSB page and LSB page increase slowly, while the original bit error rate of the MSB page rises sharply. The gap between the LSB and CSB pages is rapidly widened. The raw bit error rate of the MSB page is as high as -3 5.6 10  when the P/E cycles are 5000. The extremely unbalanced bit error rate in flash pages has a serious impact on the reliability. Once the bit error rate of the MSB page reaches the error correction ability of the error correcting code, the block will become a bad block instead of being used. Conversely, when the program state shift algorithm is used, the original bit error rate of the MSB page is significantly reduced. With the increase in P/E cycles, the original bit error rate of the MSB page presents a stable fluctuation trend, narrowing the wide gap of original bit error rates between the MSB page, LSB page and CSB pages. This improves the reliability and lifetime of flash storage systems.
Conclusions
This paper conducts some experiments for studying the program disturb errors of 3D-TLC NAND flash memory, and the results identify some important characteristics in program disturb state correlation: bit error characteristics and error rate imbalance. Based on the error rate imbalance, we design a state shift algorithm, which can reduce 20% to 90% of the program disturb error rate. The phenomenon found in this paper and the program disturb error avoidance algorithm helps to improve the reliability of 3D-TLC NAND flash memory. However, this paper's experiments and analysis are limited, and there are still many aspects that need to be further studied.
