Abstract-This paper studies the system-level reliability of 16nm MLC NAND flash memories under total ionizing dose (TID) effect. Errors that occur in the parts under TID effect are characterized at multiple levels. Results show that faithful data recovery only lasts until 9k rad. Data errors observed in irradiated flash samples are strongly asymmetric. To improve the reliability of the parts, we study error mitigation methods that consider the specific properties of TID errors. First, we implement a novel data representation scheme that stores data using the relative order of cell voltages. The representation is more robust against uniform asymmetric threshold voltage shift of floating gates. Experimental results show that the scheme reduces errors at least by 50% for blocks with less than 3k program/erase cycles and 10k rad. Second, we conduct empirical evaluations of memory scrubbing schemes. Based on the results, we identify a scheme that refreshes cells without doing block erasure. Evaluation results show that parts under this scrubbing scheme survive up to 8k PECs and 57k rad total doses.
I. INTRODUCTION
NAND flash memory is an attractive media for primary storage in various space applications thanks to its excellent properties such as high density, low power consumption and random access. A necessary step to determine if an NAND flash memory is suitable for a specific space application is to test its reliability against radiation. In the scenarios where flash components must survive high radiation levels such as GEO orbits and other deep space trajectories, radiation hardened NAND flash memory is used. The thin (8nm) tunnel oxides and high internal on-chip voltages (18-20V) make radiation hardening NAND flash memories very challenging. However many useful space missions (LEO, some Mars missions, etc) only require a reduced level of radiation tolerance, often less than 50k rad. Therefore, it is possible to use conventional commercial NAND flash memory to achieve high storage density with low cost. In this paper, we focus on such scenarios.
The reliability of commercial NAND flash memory decreases exponentially as density increases. NAND flash memory stores data by programming memory cells to different charge levels. As feature size shrinks, memory cells carry less charges, and their charge levels are more sensitive to both internal and external noise such as interference, charge leakage and radiation. This paper studies the radiation reliability of 16nm commercial NAND flash memory-one of the most dense and cost-effective NAND flash on the market. The question to ask is how do these parts behave under total ionization dose (TID) and single-event upset (SEU) effects. In space, these effects correspond to accumulative background radiation and strikes from high energy ionizing particles, respectively. This paper only reports our results on TID effect, leaving the SEU effect for future work.
TID effect has recently been studied for NAND flash memories with lower densities, e.g. 90nm∼25nm single-level * Part of the research was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration.
cell (SLC) [1] [2] [3] [4] , 25nm multi-level cell (MLC) [4] . To the best of our knowledge, this paper for the first time investigates the system-level reliability of 16nm NAND flash memory. We provide a quantitative study on the properties of errors at multiple levels during irradiation. Based on the study, we further implement effective error mitigation methods for reliability enhancement.
The contributions of this paper include: 1) Characterizing distribution shift, bit error, and cell state error of the parts under TID effect. Results show that errors at all levels show strong asymmetry. Faithful data recovery only lasts until 9k rad when typical configurations of error correcting codes (ECC) are used. ECC decoding failures due to shift in floating gates (FGs) occur much earlier than failures of peripheral circuits.
2) Design and implementation of a novel data representation named rank modulation [5] . The scheme represents data using the relative order of cell voltages, and provides higher reliability against uniform asymmetric shifts. Results show that the scheme reduces bit errors at least by 50% for blocks with less than 10k rad and 3k PECs.
3) Empirical evaluation of three memory scrubbing (M-S) methods. Based on the results, we identify an MS scheme that refreshes cells without block erasure. Under this scheme, the parts survived up to 8k PECs and 57k rad, which outperforms the other two schemes in different aspects.
II. METHODOLOGY

A. Experimental setup
We used 16nm planar MLC NAND flash memory manufactured by a major flash vendor. Each package contains one die with total capacity of 64Gb and specified lifetime of 3k PECs. Fig. 1 shows a decapsulated package. A die contains two planes, with each plane having 1024 blocks. A block contains 256 pages, with page size being 16KB. The peripheral circuits mainly include charge pumps used for page reading, page programming and block erasure, row decoders for page addressing, as well as control logic. The parts comply with the ONFI standard [6] .
We operated the flash packages using a commercial NAND flash tester [7] shown in Fig. 2 . The tester uses an FPGA as its controller, and connects up to two daughter boards with each carrying two sockets where NAND flash packages are inserted. Flash characterization and error mitigation methods in this work were implemented as software on host PC, and communicated with the tester via a USB interface.
Irradiation experiments were carried out at Shepard Co-60 High Dose Rate TID Center at the Jet Propulsion Laboratory. In the experiments, the parts were removed from the testers and relocated to a chamber for irradiation. After irradiation, the parts were returned to the tester for measurements.
B. Testing procedures
To characterize the parts under TID, randomly selected blocks were first program-erase cycled up to 8k PECs. This process degrades the blocks to different stages of their lifetime. We then stored pseudo-random test data in the blocks. Pages in a block were programmed sequentially (i.e. page 0, 1, ⋅ ⋅ ⋅ , 255) as recommended by the vendor for reducing programming interference. After the test data were written, the blocks were immediately read. Both the binary output data and the distributions of the blocks were saved. The parts were then irradiated up to 63k rad at dose rate of 9 rad/s. The irradiation was carried out at room temperature, and all the parts were unbiased. In this process, the blocks were read and their distributions were measured whenever an additional 3k rad were received. After irradiation, all the collected output data were compared with the test data regenerated from the saved seeds for error analysis. The distributions measured during irradiation were compared with the initial distributions for understanding the errors observed in the output data.
III. CHARACTERIZATION RESULTS
This section reports the results of NAND flash characterization. We first discuss the overall raw bit error rates (RBERs) of the parts under TID effect, and determine their lifetime under typical ECC configurations used by commodity solidstate drives (SSDs). We then describe shielding test results that identify the major source of errors. Finally, we analyze the properties of distribution shifts, bit errors and cell state errors, and explain the strong asymmetry observed in errors at all levels. The purpose of the characterizations is to obtain deeper understandings on the properties of errors, which facilitate the design of more effective error mitigation schemes. Fig. 3 shows that the overall RBERs (defined by the ratio between the number of bit errors and total number of bits in a block.) of the blocks under TID effect. Compared to the RBERs measured at 0 rad, the RBERs of the blocks that carried 5, 1k, 3k, 6k, and 8k PECs increased by 1647x, 1017x, 463x, 141x, and 55x at 63k rad, respectively. The errors observed above were mainly due to the interference of page programming and TID effect. Programming interference increases the voltages of neighboring cells when a cell is being programmed due to capacitance coupling. TID effect changes the threshold voltage of an FG in two ways [8] . First, irradiation breaks charge-hole pairs in tunnel oxides. When FG stores charges (in this case, the cell has positive threshold voltage), some of the detached holes will be pulled into the FG due to the electric field created by the charges in the FG. The injected holes recombine with the charges in FG, and thus reduce cell voltage. Similarly, when FG stores holes (the cell is at erased state and has negative threshold voltage), charges in the tunnel oxide are pulled into FGs to recombine with the holes after irradiation. Such recombination shifts cell threshold voltage towards neutralized state. Second, irradiation causes photoemission, which dissipates charges or holes carried by FGs. The TID errors in our parts are almost due to the downward shift of cell voltages. This is because only FGs that are at the erased state store holes, having negative threshold voltages. The reference threshold voltages (RTVs) for distinguishing cells at the erased state and other states are always positive, while FGs having negative voltages can never be shifted to positive values under TID effect. Therefore, upward voltage shifts of erased FGs do not introduce errors on data. After 18k rad, the blocks with more PECs have lower RBERs than the blocks with less PECs. This is because errors due to write interference dominate at lower doses, and radiation errors dominate at higher doses. For cells that are not at the erased state, programming interference and TID effect shift their threshold voltages towards opposite directions, and cells carried higher PECs have more programming errors "corrected" by radiation.
A. Overall raw bit error rate
TID effect also degrades the quality of peripheral circuits [9] . Irradiation makes charge pumps generate lower than normal voltages, leading to under-programming error, and incomplete block erasure. The degradation of row decoders causes misaddressing, leading to large number of bit errors due to incorrect reading address. In our experiments, we observed that programming, erasure and addressing starts failing at 42k rad. Programming and erasure failures continuously occurred, while addressing failures were temporary. Fig. 4 shows an example of row decoder failure at 40k rad.
B. Lifetime under error correcting code
The lifetime of a block under TID effect is determined as the largest total dose when its RBER increases above the correction limit of ECC. The correction limit of an ECC is the maximal RBER that makes uncorrectable bit error rate stay below 1 − 15 as required by industry standard [10] . UBER measures the bit error rate after ECC decoding, and can be calculated as
, where is the current RBER, is the number of bit errors that the ECC corrects, and is the codeword length used. correct 40, 45, 50, and 55 bit errors, respectively. BCH code is one of the dominant ECCs used in commodity SSDs as it allows efficient hardware implementation, low redundancy and good error correction capability. The BCH configurations in Fig. 3 are typical for SSDs, where high code rates (the ratio between the number of information bit and codeword length) are desired. For instance, the BCH codes that correct 40 bit errors have code rate 0.935, and they are recommended by the vendor of the parts. Comparisons between the correction limits and the RBERs of the block show that reliable data recovery only lasted up till 9k rad. For instance, under the 40-bit BCH code the data of the blocks that carried only 5 PECs could be recovered until 6k rad while meeting the requirement on UBER, and data recovery failed at 3k rad for blocks that carried 3k PECs.
C. Shielding test
As errors introduced during irradiation are due to the degradation of FGs and peripheral circuits, it is important for us to understand how does each source contribute to the errors. We thus conducted shielding tests to characterize the RBERs of the parts with different components being shielded separately during irradiation. Specifically, we divided 8 parts into two groups of same size. As shown in Fig. 5 , we shielded the FG arrays of each part in the first group, and shielded the peripheral circuits of each part in the second group. A standard 3.625 inch thick lead brick was used, which shields 100% of radiation theoretically. The boundary between peripheral circuits and FG arrays was determined using the decapsulated part in Fig. 1 . Both groups were first irradiated to 25k rad, and then to 50k rad. Fig. 6 compares the RBERs of the shielded parts with the RBERs of unshielded parts and the RBERs of unirradiated parts. At both 25k rad and 50k rad, the RBERs of the parts with peripheral circuits being shielded are close to those of the unshielded parts. The RBERs of the parts with FG arrays being shielded are the closest to those of the unirradiated parts. The RBERs of the parts with FG arrays being shielded are only 22% higher on average at 50k rad than at 25k rad, and the difference becomes 358% for the parts with shielded peripheral circuits. These observations all indicate that the degraded FG arrays are the dominant sources of errors.
D. distribution
The threshold voltages of FGs are shifted due to TID effect. MLC NAND flash memory uses four logical cell states to store 2 bit in each FG. Each state corresponds to a different level of threshold voltages, and is read by comparing cell voltage with predetermined RTVs. We refer the logical states as P0, P1, P2 and P3 with P0 being erased state, and P3 being the state with the highest voltage in average. Following the discussion in Section III-A, most of the cells at state P0 have negative voltages, and thus will be shifted to higher values. The threshold voltages of the cells in the other three states will be shifted to lower values. The state of a cell changes when its threshold voltage is shifted across predetermined RTVs. Fig. 7 shows that the -distributions of the cells in states P1, P2, and P3 before irradiation, after 24k rad, and after 48k rad, respectively. The cells were from the same wordline in the middle of a flash block, and each distribution was measured by reading the cells multiple time with different RTVs. The -distributions of all three states kept shifting towards lower positions. The -distribution of state P0 were not measured due to its negative mean voltage. A significant portion of cells at states P2 and P3 (belonging to the left tails of their distributions) shifted to the central region of state P1 and P2 after 48k rad, leading to the high RBERs observed in Fig. 3 . The amount of cell voltage shift is proportional to its initial voltage before irradiation. For instance, among all the distributions, the distribution of state P3 had the largest shift, and that of state P1 had the smallest shift. This is because a cell with higher initial voltage creates larger electric field, which pulls more holes from the tunnel oxide into the FG to recombine with the stored charges.
E. Error on cell state
We analyze the patterns of cell state errors caused by shift. In MLC NAND flash memory, the logical states of the cells in a physical page are determined by the bits from input lower page and upper page. A lower (an upper) page contains the LSB (MSB) to be stored in each cell. A pair SE-2-3 of (LSB, MSB) is mapped to a logical state following the Gray mapping: (1, 1) ↔ P0, (0, 1) ↔ P1, (0, 0) ↔ P2 and (1, 0) ↔ P3. We computed the cell states before and after irradiation, and compared them for analysis. Fig. 8 shows the numbers of upward (which makes cell state change from PX to PY, where X < Y.) and downward state errors for blocks at different PECs and total doses. Downward errors started dominating at 6k rad. For instance, at 12k rad there are 3x-193x more downward state errors than upward errors. The number of upward errors gradually decrease as total dose increases due to radiation-induced charge loss. Fig. 9 shows that adjacent state transitions (state move from from PX to PY, where |X − Y| = 1) are the major state error patterns whose number is up to 2.5×10 4 times higher than that of nonadjacent state errors on average. There are more P2 → P1 and P3 → P2 errors than P1 → P0 errors due to the larger voltage shifts of the cells in states P2 and P3.
F. Bit error
Bit errors displayed strong asymmetry as well. Fig. 10(a) shows the average RBERs of upper page and lower page under TID effect. The RBERs of upper page is higher than those of lower pages by 22%-41% on average. This is explained using the Gray mapping between cell states and binary bits. The errors in MSBs/upper page are mainly due to the state errors that cause P3→ P2 and P1→ P0 state transitions. The errors in lower bits are mainly due to the state errors that cause P2→P1 state transitions. According to the measurement in Fig. 9 , upper pages thus have higher RBERs. Therefore, the RBERs of upper pages shall be used as the worst case for determining the ECC correction capability in practice. Fig. 10(b) and Fig. 10(c) analyze the patterns of bit errors in lower pages and upper pages, respectively. At higher doses, bit errors are asymmetric in lower pages, containing significantly more 0 → 1 errors than 1 → 0 errors. This is because downward cell state errors dominates at higher doses, and downward errors causes 0 → 1 errors in lower pages. Bit errors in upper pages are more symmetric than those of lower pages. In upper pages, 1 → 0 bit errors are caused by P3→P2 state errors, and 0 → 1 bit errors are caused by P1→P0 state errors. Following the results of Fig. 9 , the number of 0 → 1 errors are thus approaching that of 1 → 0 errors at higher total doses as shown in Fig. 10(c) .
IV. ERROR MITIGATION
The error characterization show that peripheral circuits still functions at lower total doses, and errors under TID effect are due to uniform asymmetric voltage shift. In this section, we study two effective error mitigation methods, namely rank modulation (RM) coding and memory scrubbing (MS), which take advantage of the aspects above to improve the lifetime of flash memory under TID effect.
A. Rank modulation
RM provides a novel data representation for reducing errors caused by asymmetric shifts in NAND flash memory [5] . Different from using quantized voltage levels for data representation in current flash, data are represented using the relative order of cell voltages in RM. The new representation provides higher reliability to flash under TID as the order of cell voltages largely remains when asymmetric shifts occur.
We implemented an adapted version of the RM scheme. In the implementation, cells of a physical page are divided into groups of equal size. User input data are first used to determine the state of each cell. Then we assign each cell a rank, which simply equals to the index of the cell state. Consequently, cells with lower ranks have lower threshold voltages. Furthermore, for each group we generate metadata that record the number of cells in each rank. After which, both user data and metadata are stored together in the cells on the same physical page. Data are read by sorting the cells in a group by approximated voltage, and assigning ranks following the sorted order. Cell voltages are approximated by reading with different RTVs. These RTVs split the whole range of threshold voltage into multiple bins. The results of the multiple reads are combined to determine the bin of each cell, where the index of a bin provides an estimation on cell threshold voltage. For each group, we sort the cells by bin index, and assign ranks to cells following the sorted order. The number of cells to be assigned in each rank is given by the previously stored metadata. We refer readers to a separate paper [11] for the detailed description of our implementation and more experimental results. Fig. 11 shows that for blocks with PEC ≤ 3k reading using RM yields 70%, 61% and 50% less RBERs at 1k, 5k, and 10k rad, respectively, compared to reading using adaptive reference threshold voltages. The latter is a scheme that is recommended by the vendor, which shifts the reference threshold voltages to lower values when cell voltage shift. In this experiment, SE-2-4 each RM group had 512 cells with each cell storing 1.97 bits on average (0.03 bit less than the uncoded control scheme). We used 4 RTVs between every two adjacent distributions for measuring the estimated cell voltages. The results of the control scheme selected the minimum RBERs from 8 reads using different RTVs between two adjacent distributions. The values of RTVs are supplied by the vendor. The reliability gain of RM becomes smaller as PEC increases. When cells are being worn out, process variation is amplified due to the increased number of charges that are trapped inside tunnel oxide. High variation implies that cell voltage shifts under TID effect are less uniform. This increases the probability that cells of higher ranks have lower voltages than cells of lower ranks after irradiation. The increased output rank switches thus introduce more bit errors to output data.
B. Memory scrubbing
To further improve the lifetime of flash memory under TID effect, MS was suggested for flash to keep RBER constantly below the correction limit of ECC [1] [4] [12] . Conventional MS schemes periodically read data, correct errors using ECC, erase the blocks, and write the corrected data back to the blocks [4] [12] . Therefore, cells that lose charges due to irradiation will be recharged after scrubbing.
We conducted an empirical study of three different MS schemes referred as MLC-MS, E-SLC-MS, and SLC-IPR-MS. For the first two schemes, all the pages are programmed for storing data. MLC-MS writes data in using conventional sequential MLC programming. E-SLC-MS only uses states P0 and P3 of a cell to store one bit data, with the two states representing bits "1" and "0", respectively. Therefore, E-SLC-MS provides higher reliability but lower capacity thanks to the large voltage gap between the two states. During scrubbing, both schemes need to erase the blocks, and write back the corrected data. The process thus introduces one additional PEC to each block being used. For SLC-IPR-MS, only the lower pages of a block are written for storing data, and upper pages are never programmed. This way enables the single-level cell (SLC) mode which uses states P0 and P1
+ for storing bits "1" and "0", where state P1
+ is an intermediate state whose distribution is between those of 1 and 2. Moreover, it also allows the in-place rewriting (IPR) capability [13] , i.e. bit 1 in lower page can be reprogrammed to 0 without first erasing the block. Therefore, MS using IPR is able to correct the dominating 0 → 1 bit errors of lower pages without introducing additional PEC. Note that, the capacity of MLC NAND flash under E-SLC-MS and SLC-IPR-MS schemes will be reduced by 50%. Compared to the other two MS schemes, SLC-IPR-MS makes cells have lower voltage in average due to the smaller voltage gap between states P0 and P1
+ , and programs fewer cells during scrubbing.
In total four parts were used in the evaluation where the parts are scrubbed at the same time while being irradiated. For each part, we chose 20 blocks for evaluating each MS scheme. For each scheme, the blocks were divided into a experimental group and a control group. Both groups had equal size. The experimental group used MS on the blocks, and the control group only read data. Before irradiation, the 10 blocks of each group were cycled to 5, 0.5k, 1k, 2k, 3k, 4k, 5k, 6k, 7k and 8k PECs, respectively. BCH code of length 1KB correcting 55 bit errors was used as the ECC. For all MS and their control schemes, data in blocks were read and decode every 3k rad. The RBERs were also recorded for each read. If the number of bit errors in any codeword of a block exceeds a predetermined scrubbing threshold that was set to 25 and is still less than 55, the whole block will be scrubbed. Fig. 12 shows the RBERs of 6 blocks with 3k PECs from the same package during irradiation. Among the blocks, three were scrubbed using different schemes. The results show that MLC-MS and SLC-IPR-MS schemes significantly improve the lifetime of flash memory under TID effect. The zigzag patterns shown in Fig. 12 indicate the triggering of scrubbing, after which RBERs started increasing again from low values. For E-SLC-MS, no triggering of scrubbing was observed in our experiment as the RBERs were always below the predetermined scrubbing threshold thanks to the large voltage gap between the two cell states used. The RBERs of cells under E-SLC-MS quickly moved above the correction limit of ECC after 50k rad due to possible peripheral circuit failures, and ECC decoding failure. Although our results show that there is no difference between E-SLC-MS and its control case, it is still possible for E-SLC-MS to be beneficial if smaller scrubbing thresholds are used so that scrubbing can happen earlier before the failure of peripheral circuits. 113% and 43% larger than those of MLC-MS and E-SLC-MS, respectively. The SR of SLC-IPR-MS reaches 57k rad when scrubbing was still effective (Fig. 12) . As TID effect gradually degrade charge pumps, programming/erasure failures start occurring due to the lower voltage falsely supplied by charge pumps [1] . The scrubbing of SLC-IPR-MS does not need block erasure, which avoids errors caused by erasure failures. The in-place rewriting only programs the cells that have 0 → 1 bit errors. Therefore, much fewer cells need to be programmed during each scrubbing. Also considering the smaller voltage gap used during programming, the scrubbing operation of SLC-IPR-MS thus has higher success rate than those of the other schemes at higher total doses. E-SLC-MS and its control scheme provide the highest reliability when blocks carry less than 4 PECs. This is because the large voltage gap between P0 and P3 used in E-SLC-MS takes much more doses of radiation to introduce state error compared to the other schemes. The results in Fig. 12 further confirms the explanation, where the RBER growths of E-SLC-MS and its control scheme are much slower than those of the other schemes are.
SLC-IPR-MS provides the highest reliability to blocks that carry 5k -8k PECs when almost all the blocks under the other two schemes immediately failed to decode even before irradiation. This observation is explained by comparing the programming errors of each scheme. Programming errors come from interference and charge over-injection. The amount of interference received by a victim cell depends on the average voltage increase of its neighboring cells, the latter is proportional to the voltage gap between cell states. Thus the programming of SLC-IPR-MS creates less interference than the other schemes. Charge over-injection is the phenomenon that additional charges are injected into FGs during programming due to leakage paths formed by the charges trapped in tunnel oxides. The amount of over-injection of an FG depends on the time duration of programming, and the number of PECs carried by the FG. In our experiments, the blocks used for comparing different schemes have the same PECs, and the programming used by SLC-IPR-MS takes much less time thanks to the smaller voltage gap used. SLC-IPR-MS thus introduces less over-injection. Therefore, SLC-IPR-MS introduces less programming errors. This explanation is validated by the results of Fig. 12 where SLC-IPR-MS and its control schemes have the lowest RBERs at 0 rad. As over-injection errors grow with PECs, and the programming methods of MLC-MS and E-SLC-MS have higher over-injection errors, blocks under these two schemes thus suffer from more programming errors at higher PECs, and started having ECC decoding failures at earlier PECs than the blocks under SLC-IPR-MS.
V. CONCLUSIONS AND FUTURE WORK
In summary, we have characterized the errors of 16nm MLC NAND flash memory under TID effect. Results have shown that the faithful data recovery only lasted until 9k rad under typical ECC configurations. FGs degraded by TID effect are the major sources of errors. We have observed strong asymmetry of errors at multiple levels, and studied RM and MS schemes which take advantage of the properties of TID errors for reliability enhancement. Evaluation has shown that that both error mitigation schemes significantly extended the lifetime of flash memory under TID effect. This work is the first step of our efforts towards high density flash-based storage in space. In the next step, we would like to continue understanding the behavior of high-density NAND flash under SEU effect as well as in low temperature environment, and to design more effective and practical error mitigation schemes.
