This paper summarizes our work on experimentally analyzing, exploiting, and addressing vulnerabilities in multi-level cell NAND ash memory programming, which was published in the industrial session of HPCA 2017 [9] , and examines the work's signi cance and future potential. Modern NAND ash memory chips use multi-level cells (MLC), which store two bits of data in each cell, to improve chip density. As MLC NAND ash memory scaled down to smaller manufacturing process technologies, manufacturers adopted a two-step programming method to improve reliability. In two-step programming, the two bits of a multi-level cell are programmed using two separate steps, in order to minimize the amount of cell-to-cell program interference induced on neighboring ash cells.
Introduction
Solid-state drives (SSDs), which consist of NAND ash memory chips, are widely used for storage today due to signi cant decreases in the per-bit cost of NAND ash memory, which, in turn, have driven great increases in SSD capacity. These improvements have been enabled by both aggressive process technology scaling and the development of multilevel cell (MLC) technology. NAND ash memory stores data by changing the threshold voltage of each ash cell, where a cell consists of a oating-gate transistor [44, 74, 81] . In singlelevel cell (SLC) ash memory, the threshold voltage range could represent only a single bit of data. A multi-level cell uses the same threshold voltage range to represent two bits of data within a single cell (i.e., the range is split up into four windows, known as states, where each state represents one of the data values 00, 01, 10, or 11), thereby doubling storage capacity [11, 20, 37, 63, 92, 114] . In a NAND ash memory chip, a row of cells is connected together by a common wordline, which typically spans 32K-64K cells. Each wordline contains two pages of data, where a page is the granularity at which the data is read and written (i.e., programmed). The most signi cant bits (MSBs) of all cells on the same wordline are combined to form an MSB page, and the least signi cant bits (LSBs) of all cells on the wordline are combined to form an LSB page [13] .
To precisely control the threshold voltage of a ash cell, the ash memory device uses incremental step pulse programming (ISPP) [20, 37, 63, 114] . ISPP applies multiple short pulses of a high programming voltage to each cell in the wordline being programmed, with each pulse increasing the threshold voltage of the cell by some small amount. SLC and older MLC devices programmed the threshold voltage in one shot, issuing all of the pulses back-to-back to program both bits of data at the same time. However, as ash memory scales down to smaller technology nodes, the distance between neighboring ash cells decreases, which in turn increases the program interference that occurs due to cell-to-cell coupling. This program interference causes errors to be introduced into neighboring cells during programming [13, 16, 29, 66, 68, 92] . To reduce this interference by half [13] , manufacturers have been using two-step programming for MLC NAND ash memory since the 40nm technology node [92] . A large fraction of SSDs on the market today use sub-40nm MLC NAND ash memory. Two-step programming stores each bit within an MLC ash memory cell using two separate, partial programming steps, as shown in Figure 1 . An unprogrammed cell starts in the erased (ER) state. The rst programming step programs the LSB page: for each ash cell within the page, the cell is partially programmed depending on the LSB being written to the cell. If the LSB of the cell should be 0, the cell is programmed into a temporary program state (TP); otherwise, it remains in the ER state. The maximum voltage of a partially-programmed cell is approximately half of the maximum possible threshold voltage of a fully-programmed ash cell. In its second step, two-step programming programs the MSB page: it reads the LSB value into a bu er inside the ash chip (called the internal LSB bu er) to determine the partially-programmed state of the cell's threshold voltage, and then partially programs the cell again, depending on whether the MSB of the cell is a 0 or a 1. The second programming step moves the threshold voltage from the partially-programmed state to the desired nal state (i.e., ER, P1, P2, or P3). By breaking MLC programming into two separate steps, manufacturers halve the program interference of each programming operation [13, 68] . The SSD controller employs shadow program sequencing [6, 7, 8, 13, 25, 91] , which interleaves the partial programming steps of a cell with the partial programming steps of neighboring cells to ensure that a fully-programmed cell experiences interference only from a single neighboring partial programming step. 
MSB LSB
Probability Density Figure 1 : Starting (after erase), temporary (after LSB programming), and nal (after MSB programming) states for two-step programming. Reproduced from [9] .
Error Sources in Two-Step Programming
In our HPCA 2017 paper [9] , we demonstrate that two-step programming introduces new possibilities for ash memory errors that can corrupt some of the data stored within ash cells without accessing them, and that these errors can be exploited to design malicious attacks. As there is a delay between programming the LSB and the MSB of a single cell due to the interleaved writes to neighboring cells, raw bit errors can be introduced into the already-programmed LSB page before the MSB page is programmed. These errors can cause a cell to be programmed to an incorrect state in the second programming step. During the second step, both the MSB and LSB of each cell are required to determine the nal target threshold voltage of the cell. As shown in Figure 2 , the data to be programmed into the MSB is loaded from the SSD controller to the internal MSB bu er ( 1 in the gure). Concurrently, the LSB data is loaded into the internal LSB bu er from the ash memory wordline ( 2 ). By bu ering the LSB data inside the ash chip and not in the SSD controller, ash manufacturers avoid data transfer between the chip and the controller during the second programming step, thereby reducing the step's latency. Unfortunately, this means that the errors loaded from the internal LSB bu er cannot be corrected as they would otherwise be during a read operation, because the error correction (ECC) engine resides only inside the controller ( 3 ), and not inside the ash chip. As a result, the nal cell voltage can be incorrectly set during MSB programming, permanently corrupting the LSB data. 1 We refer the reader to our prior works [6, 7, 8, 9, 11, 12, 14, 14, 15, 16, 17, 18, 72] for a detailed background on NAND ash memory. Our recent survey paper [6, 7, 8] provides an extensive survey of the state-of-the-art in NAND ash memory. Figure 2: In the second step of two-step programming, LSB data does not go to the controller, and is not corrected when read into the internal LSB bu er, resulting in program errors. Reproduced from [9] .
Flash Memory SSD Controller
We brie y discuss two sources of errors that can corrupt LSB data, and characterize their impact on real state-of-theart 1X-nm (i.e., 15-19nm ) MLC NAND ash chips. We perform our characterization using an FPGA-based ash testing platform [10, 11] that allows us to issue commands directly to raw NAND ash memory chips. In order to determine the threshold voltage stored within each cell, we use the read-retry mechanism built into modern SSD controllers [13, 17, 108, 130] . Throughout this work, we present normalized voltage values, as actual voltage values are proprietary information to ash manufacturers. Our complete characterization results can be found in our HPCA 2017 paper [9] .
Cell-to-Cell Program Interference
The rst error source, cell-to-cell program interference, introduces errors into a ash cell when neighboring cells are programmed, as a result of parasitic capacitance coupling [6, 7, 8, 13, 16, 28, 29, 32, 68] . While two-step programming reduces program interference for fully-programmed cells, we nd that interference during two-step programming is a signi cant error source for partially-programmed cells.
As an example, we look at a ash block in the commonlyused all-bit-line (ABL) ash architecture [13, 19, 20] , which is shown in Figure 3 . After the LSB page on Wordline 1 (Page 1 in Figure 3 ) is programmed, the next two pages that are programmed (Pages 2 and 3) reside on directly-adjacent wordlines. Therefore, before the MSB page on Wordline 1 (Page 4) is programmed, the LSB page (Page 1) could be susceptible to program interference when Pages 2 and 3 are programmed. ash memory. Reproduced from [9] . Figure 4 shows the measured raw bit error rate for Page 1 in real NAND ash memory devices after four di erent times, normalized to the error rate just after Page 1 is programmed:
A. Just after Page 1 is programmed (no interference), B. Page 2 is programmed with pseudo-random data, C. Pages 2 and 3 are programmed with pseudo-random data, D. Pages 2 and 3 are programmed with a data pattern that induces the worst-case program interference. We observe that the amount of interference is especially high when Pages 2 and 3 in Figure 3 are written with the worst-case data pattern, after which the raw bit error rate of Page 1 is 4.9x the rate before interference. Note that the worst-case data pattern that we write to Pages 2 and 3 requires no knowledge of the data stored within Page 1 [9] . 
Read Disturb
The second error source, read disturb, disrupts the contents of a ash cell when another cell is read [6, 7, 8, 18, 28, 32, 35, 77, 90, 115] . NAND ash memory cells are organized into multiple ash blocks (two-dimensional cell arrays), where each block contains a set of bitlines that connect multiple ash cells in series. To accurately read the value from one cell, the SSD controller applies a pass-through voltage to turn on the unread cells on the bitline, which allows the value to propagate through the bitline. Unfortunately, this passthrough voltage induces a weak programming e ect on an unread cell: it slightly increases the cell threshold voltage [6, 7, 8, 18] . As more neighboring cells within a block are read, an unread cell's threshold voltage can increase enough to change the data value stored in the cell [6, 7, 8, 18, 35, 90] . In two-step programming, a partially-programmed cell is more likely to have a lower threshold voltage than a fully-programmed cell, and the weak programming e ect is stronger on cells with a lower threshold voltage. Measuring errors in real NAND ash memory devices, we nd that the raw bit error rates for an LSB page in a partially-programmed or unprogrammed wordline is an order of magnitude greater than the rate for an LSB page in a fully-programmed wordline. However, existing read disturb management solutions are designed to protect fully-programmed cells [18, 31, 35, 36, 52, 105] , and o er little mitigation for partially-programmed cells.
Exploiting Two-Step Programming Errors
Two major issues arise from the program interference and read disturb vulnerabilities of partially-programmed and unprogrammed cells. First, the vulnerabilities induce a large number of errors on these cells, exhausting the SSD's error correction capacity and limiting the SSD lifetime. Second, the vulnerabilities can potentially allow (malicious) applications to aggressively corrupt and change data belonging to other programs and further hurt the SSD lifetime. We present two example sketches of potential exploits in our HPCA 2017 paper [9] , which we brie y summarize here.
Sketch of Program Interference Based Exploit
In this exploit, a malicious application can induce a signicant amount of program interference onto a ash page that belongs to another, benign victim application, corrupting the page and shortening the SSD lifetime. Recall from Section 2.1 that writing the worst-case data pattern can induce 4.9x the number of errors into a neighboring page (with respect to an interference-free page). The goal of this exploit is for a malicious application to write this worst-case data pattern in a way that ensures that the page that is disrupted belongs to the victim application, and that the page that is disrupted experiences the greatest amount of program interference possible. Figure 5 illustrates the contents of the pages within neighboring 8KB wordlines (rows of ash cells within a block). The SSD controller uses shadow program sequencing to interleave partial programming steps to pages in ascending order of the page numbers shown on the left side of the gure. A malicious application can write a small 16KB le with all 1s to prepare for the attack ( 1 in the gure), and then waits for the victim application to write its data to Wordline n ( 2 ). Once the victim writes its data, the malicious application then writes all 0s to a second 16KB le ( 3a and 3b ). This induces the largest possible change in voltage on the victim data, and can be used to ip bits within the data. In our HPCA 2017 paper [9] , we discuss how a malicious application can (1) work around SSD scrambling and (2) monitor victim application data writes. 
Sketch of Read Disturb Based Exploit
In this exploit, a malicious application can induce a signi cant amount of read disturb onto several ash pages that belong to other, benign victim applications. Recall from Section 2.2 that the error rate after read disturb for an LSB page in a partially-programmed wordline is an order of magni-tude greater than the error rate for an LSB page in a fullyprogrammed wordline. The goal of this exploit is for a malicious application to quickly perform a large number of read operations in a very short amount of time, to induce read disturb errors that corrupt both pages already written to partially-programmed wordlines and pages that have yet to be written. The malicious application writes an 8KB le, with arbitrary data, to the SSD. Immediately after the le is written, the malicious application repeatedly forces the le system to send a new read request to the SSD. Each request induces read disturb on the other wordlines within the ash block, causing the cell threshold voltages of these wordlines to increase. After the malicious application nishes performing the repeated read requests, a victim application writes data to a le. As the SSD is unaware that an attack took place, it does not detect that the data cannot be written correctly due to the increased cell threshold voltages. As a result, bit ips can occur in the victim application's data. Unlike the program interference exploit, which attacks a single page, the read disturb exploit can corrupt multiple pages with a single attack, and the corruption can a ect pages written at a much later time than the attack if the host write rate is low.
Protection and Mitigation Mechanisms
We propose three mechanisms to eliminate or mitigate the program interference and read disturb vulnerabilities of partially-programmed and unprogrammed cells due to twostep programming. Table 1 summarizes the cost and bene ts of each mechanism. We brie y discuss our three mechanisms here, and provide more detail on them in our HPCA 2017 paper [9] . Our rst mechanism bu ers LSB data in the SSD controller, eliminating the need to read the LSB page from ash memory at the beginning of the second programming step, thereby completely eliminating the vulnerabilities. It maintains a copy of all partially-programmed LSB data within DRAM bu ers that exist in the SSD near the controller. Doing so ensures that the LSB data is read without any errors from the DRAM bu er, where it is free from the vulnerabilities (instead of from the ash memory, where it incurs errors that are not corrected), in the second programming step. Figure 6 shows a owchart of our modi ed two-step programming algorithm. This solution increases the programming latency of the ash memory by 4.9% in the common case, due to the long latency of sending the LSB data from the controller to the internal LSB bu er inside ash memory.
A: Send LSB data to internal LSB buffer
YES
Step 1
Step Figure 6 : Modi ed two-step programming, using a DRAM bu er for LSB data (modi cations shown in shaded boxes).
Reproduced from [9] .
The two other mechanisms that we develop largely mitigate (but do not fully eliminate) the probability of two-step programming errors at much lower latency impact. Our second mechanism adapts the LSB read operation to account for threshold voltage changes induced by program interference and read disturb. It adaptively learns an optimized read reference voltage for LSB data, lowering the probability of an LSB read error. Our third mechanism greatly reduces the errors induced during read disturb, by customizing the pass-through voltage for unprogrammed and partially-programmed ash cells. State-of-the-art SSDs apply a single pass-through voltage (V pass ) to all of the unread cells, as shown in Figure 7a . This leaves a large gap between the pass-through voltage and the threshold voltage of a partially-programmed or unprogrammed cell, which greatly increases the impact of read disturb [9, 18] . To minimize this gap, and, thus, the impact of read disturb, we propose to use three pass-through voltages, as shown in Figure 7b : V erase pass for unprogrammed cells, V partial pass for partially-programmed cells, and the same pass-through voltage as before (V pass ) for fully-programmed cells. This mechanism decreases the number of errors induced by read operations to neighboring cells by 72%, which translates to a 16% increase in NAND ash memory lifetime (see Section 6.3 of our HPCA 2017 paper [9] for more detail). We conclude that, by eliminating or reducing the probability of introducing errors during two-step programming, our solutions completely close or greatly reduce the exposure to reliability and security vulnerabilities. 
Related Work
To our knowledge, our HPCA 2017 paper [9] is the rst to (1) experimentally characterize both program interference and read disturb errors that occur due to the two-step programming method commonly used in MLC NAND ash memory; (2) reveal new reliability and security vulnerabilities exposed by two-step programming in ash memory; and (3) develop novel solutions to reduce these vulnerabilities. We brie y describe related works in the areas of DRAM and NAND ash memory. We note that a thorough survey of error mechanisms in NAND ash memory is provided in our recent works [6, 7, 8] .
Read Disturb Errors in DRAM
Commodity DRAM chips that are sold and used in the eld today exhibit read disturb errors [55] , also called RowHammerinduced errors [82] , which are conceptually similar to the read disturb errors found in NAND ash memory (see Section 2.2). Repeatedly accessing the same row in DRAM can cause bit ips in data stored in adjacent DRAM rows. In order to access data within DRAM, the row of cells corresponding to the requested address must be activated (i.e., opened for read and write operations). This row must be precharged (i.e., closed) when another row in the same DRAM bank needs to be activated. Through experimental studies on a large number of real DRAM chips, we show that when a DRAM row is activated and precharged repeatedly (i.e., hammered) enough times within a DRAM refresh interval, one or more bits in physically-adjacent DRAM rows can be ipped to the wrong value [55] .
In our original RowHammer paper [55] , we tested 129 DRAM modules manufactured by three major manufacturers (A, B, and C) between 2008 and 2014, using an FPGA-based experimental DRAM testing infrastructure [38] (more detail on our experimental setup, along with a list of all modules and their characteristics, can be found in our original RowHammer paper [55] ). Figure 8 shows the rate of RowHammer errors that we found, with the 129 modules that we tested categorized based on their manufacturing date. We nd that 110 of our tested modules exhibit RowHammer errors, with the earliest such module dating back to 2010. In particular, we nd that all of the modules manufactured in 2012-2013 that we tested are vulnerable to RowHammer. Like with many NAND ash memory error mechanisms, especially read disturb, RowHammer is a recent phenomenon that especially a ects DRAM chips manufactured with more advanced manufacturing process technology generations [82] . The phenomenon is due to reliability problems caused by DRAM technology scaling [82, 83, 84, 85] . Figure 9 shows the distribution of the number of rows (plotted in log scale on the y-axis) within a DRAM module that ip the number of bits shown along the x-axis, as measured for example DRAM modules from three di erent DRAM manufacturers [55] . We make two observations from the gure. First, the number of bits ipped when we hammer a row (known as the aggressor row) can vary signi cantly within a module. Second, each module has a di erent distribution of the number of rows. Despite these di erences, we nd that this DRAM failure mechanism a ects more than 80% of the DRAM chips we tested [55] . As indicated above, this read disturb error mechanism in DRAM is popularly called RowHammer [82] . when an aggressor row is repeatedly activated, for three representative DRAM modules from three major manufacturers. We label the modules in the format X yyww n , where X is the manufacturer (A, B, or C), yyww is the manufacture year (yy) and week of the year (ww), and n is the number of the selected module. Reproduced from [55] .
Various recent works show that RowHammer can be maliciously exploited by user-level software programs to (1) induce errors in existing DRAM modules [55, 82] and (2) launch attacks to compromise the security of various systems [3, 4, 33, 34, 82, 101, 106, 107, 117, 123] . For example, by exploiting the RowHammer read disturb mechanism, a userlevel program can gain kernel-level privileges on real laptop systems [106, 107] , take over a server vulnerable to RowHammer [34] , take over a victim virtual machine running on the same system [3] , and take over a mobile device [117] . Thus, the RowHammer read disturb mechanism is a prime (and perhaps the rst) example of how a circuit-level failure mechanism in DRAM can cause a practical and widespread system security vulnerability.
Note that various solutions to RowHammer exist [53, 55, 82 ], but we do not discuss them in detail here. Our recent work [82] provides a comprehensive overview. A very promising proposal is to modify either the memory controller or the DRAM chip such that it probabilistically refreshes the physically-adjacent rows of a recently-activated row, with very low probability. This solution is called Probabilistic Adjacent Row Activation (PARA) [55] . Our prior work shows that this low-cost, low-complexity solution, which does not require any storage overhead, greatly closes the RowHammer vulnerability [55] .
The RowHammer e ect in DRAM worsens as the manufacturing process scales down to smaller node sizes [55, 82] . More ndings on RowHammer, along with extensive experimental data from real DRAM devices, can be found in our prior works [53, 55, 82] .
Cell-to-Cell Interference Errors in DRAM
Like NAND ash memory cells, DRAM cells are susceptible to cell-to-cell interference. In DRAM, one important way in which cell-to-cell interference exhibits itself is the data-dependent retention behavior, where the retention time of a DRAM cell is dependent on the values written to nearby DRAM cells [46, 47, 48, 49, 70, 82, 97] . This phenomenon is called data pattern dependence (DPD) [70] . Data pattern dependence in DRAM is similar to the data-dependent nature of program interference that exists in NAND ash memory (see Section 2.1). Within DRAM, data pattern dependence occurs as a result of parasitic capacitance coupling (between DRAM cells). Due to this coupling, the amount of charge stored in one cell's capacitor can inadvertently a ect the amount of charge stored in an adjacent cell's capacitor [46, 47, 48, 49, 70, 82, 97] . As DRAM cells become smaller with technology scaling, cellto-cell interference worsens because parasitic capacitance coupling between cells increases [46, 70] . More ndings on cell-to-cell interference and the data-dependent nature of cell retention times in DRAM, along with experimental data obtained from modern DRAM chips, can be found in our prior works [46, 47, 48, 49, 70, 82, 97] .
Errors in Emerging Memory Technologies
Emerging nonvolatile memories, such as phase-change memory (PCM) [60, 61, 62, 100, 122, 125, 129] , spin-transfer torque magnetic RAM (STT-RAM or STT-MRAM) [57, 86] , metal-oxide resistive RAM (RRAM) [121] , and memristors [26, 113] , are expected to bridge the gap between DRAM and NAND-ash-memory-based SSDs, providing DRAM-like access latency and energy, and at the same time SSD-like large capacity and nonvolatility (and hence SSD-like data persistence). While their underlying designs are di erent from DRAM and NAND ash memory, these emerging memory technologies have been shown to exhibit similar types of errors. PCM-based devices are expected to have a limited lifetime, as PCM can only sustain a limited number of writes [60, 100, 122] , similar to the P/E cycling errors in SSDs (though PCM's write endurance is higher than that of SSDs [60] ). PCM su ers from (1) resistance drift [41, 98, 122] , where the resistance used to represent the value becomes higher over time (and eventually can introduce a bit error), similar to how charge leakage in NAND ash memory and DRAM lead to retention errors over time; and (2) write disturb [43] , where the heat generated during the programming of one PCM cell dissipates into neighboring cells and can change the value that is stored within the neighboring cells, similar in concept to cell-to-cell program interference in NAND ash memory. STT-RAM su ers from (1) retention failures, where the value stored for a single bit (as the magnetic orientation of the layer that stores the bit) can ip over time; and (2) read disturb (a conceptually di erent phenomenon from the read disturb in DRAM and ash memory), where reading a bit in STT-RAM can inadvertently induce a write to that same bit [86] .
Due to the nascent nature of emerging nonvolatile memory technologies and the lack of availability of large-capacity devices built with them, extensive and dependable experimental studies have yet to be conducted on the reliability of real PCM, STT-RAM, RRAM, and memristor chips. However, we believe that error mechanisms conceptually or abstractly similar to those for ash memory and DRAM are likely to be prevalent in emerging technologies as well (as supported by some recent studies [2, 43, 50, 86, 109, 110, 128] ), albeit with di erent underlying mechanisms and error rates.
Other Related Works
Memory Error Characterization and Understanding. Prior works study various types of NAND ash memory errors derived from circuit-level noise, such as data retention noise [6, 7, 8, 11, 12, 14, 15, 73, 77, 79] , read disturb noise [6, 7, 8, 18, 77, 90] , cell-to-cell program interference noise [11, 13, 15, 16] , and P/E cycling noise [6, 7, 8, 11, 15, 17, 72, 77, 96] . Other prior works examine the aggregate e ect of these errors on large sets of SSDs that are deployed in the production data centers of Facebook [75] , Google [103] , and Microsoft [87] . None of these works characterize how program interference and read disturb signi cantly increase errors within the unprogrammed or partially-programmed cells of an open block due to the vulnerabilities in two-step programming, nor do they develop mechanisms that exploit or mitigate such errors.
A concurrent work by Papandreou et al. [89] characterizes the impact of read disturb on partially-programmed and unprogrammed cells in state-of-the-art MLC NAND ash memory. The authors come to similar conclusions as we do about the impact of read disturb. However, unlike our work, they do not (1) characterize the impact of cell-to-cell program interference on partially-programmed cells, (2) propose exploits that can take advantage of the vulnerabilities in partially-programmed cells, or (3) propose mechanisms that mitigate or eliminate the vulnerabilities.
Similar to the characterization studies performed for NAND ash memory, DRAM latency, reliability, and variation have been experimentally characterized at both a small scale (e.g., hundreds of chips) [21, 22, 23, 38, 46, 47, 48, 49, 51, 53, 55, 64, 65, 67, 70, 97, 99] and a large scale (e.g., tens of thousands of chips) [40, 76, 104, 111, 112] .
Program Interference Error Mitigation Mechanisms.
Prior works [13, 16] model the behavior of program interference, and propose mechanisms that estimate the optimal read reference voltage once interference has occurred. These works minimize program interference errors only for fullyprogrammed wordlines, by modeling the change in the threshold voltage distribution as a result of the interference. These models are tted to the distributions of wordlines after both the LSB and MSB pages are programmed, and are unable to determine and mitigate the shift that occurs for wordlines that are partially programmed. In contrast, we propose mechanisms that speci cally address the program interference resulting from two-step programming, and reduce the number of errors induced on LSB pages in both partially-programmed and unprogrammed wordlines.
Read Disturb Error Mitigation Mechanisms. One patent [31] proposes a mechanism that uses counters to monitor the total number of reads to each block. Once a block's counter exceeds a threshold, the mechanism remaps and rewrites all of the valid pages within the block to remove the accumulated read disturb errors [31] . Another patent [105] proposes to monitor the MSB page error rate to ensure that it does not exceed the ECC error correction capability, to avoid data loss. Both of these mechanisms monitor pages only from fully-programmed wordlines. Unfortunately, as we observed, LSB pages in partially-programmed and unprogrammed wordlines are twice as susceptible to read disturb as pages in fully-programmed wordlines (see Section 2.2). If only the MSB page error rate is monitored, read disturb may be detected too late to correct some of the LSB pages.
Our earlier work [18] dynamically changes the passthrough voltage for each block to reduce the impact of read disturb. As a single voltage is applied to the whole block, this mechanism does not help signi cantly with the LSB pages in partially-programmed and unprogrammed wordlines. In contrast, our read disturb mitigation technique (see Section 4) speci cally targets these LSB pages by applying multiple different pass-through voltages in an open block, optimized to the di erent programmed states of each wordline, to reduce read disturb errors.
Other prior works [35, 36, 52] propose to use read reclaim to mitigate read disturb errors. The key idea of read reclaim is to remap the data in a block to a new ash block, if the block has experienced a high number of reads [35, 36, 52] . Read reclaim is similar to the remapping-based refresh mechanism [14, 15, 71, 80, 88] employed by many modern SSDs to mitigate data retention errors [6, 7, 8] . Read reclaim can remap the contents of a wordline only after the wordline is fully programmed, and does not mitigate the impact of read disturb on partiallyprogrammed or unprogrammed wordlines.
Using Flash Memory for Security Applications. Some prior works studied how ash memory can be used to enhance the security of applications. One work [119] uses ash memory as a secure channel to hide information, such as a secure key. Other works [118, 124] use ash memory to generate random numbers and digital ngerprints. None of these works study vulnerabilities that exist within the ash memory.
Based on our HPCA 2017 paper [9] , recent work [58] demonstrates how an attack can be performed on a real SSD using our program interference based exploit (see Section 3.1). The authors use our exploit to perform a le system level attack on a Linux machine, using the attack to gain root privileges.
Two-Step vs. One-Shot Programming. One-shot programming shifts ash cells directly from the erased state to their nal target state in a single step. For smaller transistors with less distance between neighboring ash cells, such as those in sub-40nm planar (i.e., 2D) NAND ash memory, two-step programming has replaced one-shot programming to alleviate the coupling capacitance resulting from cell-to-cell program interference [92] . 3D NAND ash memory currently uses one-shot programming [94, 95, 127] , as 3D NAND ash memory chips use larger process technology nodes (i.e., 30-50 nm) [102, 126] and employ charge trap transistors [30, 42, 45, 56, 93, 116, 120] for ash cells, as opposed to the oating-gate transistors used in planar NAND ash memory. However, once the number of 3D-stacked layers reaches its upper limit [59, 69] , 3D NAND ash memory is expected to scale to smaller transistors [126] , and we expect that the increased program interference will again require partial programming (just as it happened for planar NAND ash memory in the past [54, 92] ). More detail on 3D NAND ash memory is provided in a recent survey article [8] .
Long-Term Impact
As we discuss in Section 5, our HPCA 2017 paper [9] makes several novel contributions on characterizing, exploiting, and mitigating vulnerabilities in the two-step programming algorithm used in state-of-the-art MLC NAND ash memory. We believe that these contributions are likely to have a signi cant impact on academic research and industry.
Exposing the Existence of Errors
NAND ash manufacturers use two-step programming widely in their contemporary MLC NAND ash devices. Prior to our HPCA 2017 paper [9] and concurrent work by Papandreou et al. [89] , there was no publicly-available knowledge about how two-step programming introduced new error sources that did not exist in the prior one-shot programming approach. Using real o -the-shelf contemporary NAND ash memory chips, our HPCA 2017 paper exposes the fact that fundamental limitations of the two-step programming method introduce program errors that reduce the lifetime of SSDs available on the market today.
Through a rigorous characterization, our HPCA 2017 paper [9] analyzes two major sources of these errors, program interference and read disturb, demonstrating how they can corrupt data stored in a partially-programmed ash cell. While prior works have addressed both program interference (e.g. [13, 29, 68, 92] ) and read disturb (e.g., [18, 31, 35, 105] ) errors in the past, we nd that none of these existing solutions are able to protect the vulnerable partially-programmed pages produced during two-step programming. We expect that by exposing these errors and the unique vulnerabilities of partially-programmed cells, our work will (1) provide NAND ash memory manufacturers and the academic community with signi cant insight into the problem; (2) foster the development of new solutions that can reduce or eliminate this vulnerability; and (3) inspire others to search for other reliability and security vulnerabilities that exist in NAND ash memory.
Security Implications for Flash Memory
Our HPCA 2017 paper [9] proposes two sketches of new potential security exploits based on errors arising from twostep programming. Malicious applications can be developed to use these (or other similar) exploits to corrupt data belonging to other applications. For example, our paper has already enabled the development and demonstration of a le system based attack by IBM security researchers [58] . In that work, the researchers built upon our program interference based exploits to show how to use the le system to acquire root privileges on a real machine. The work con rms that our exploit sketches are likely viable on a real system, and that the threat of maliciously exploiting vulnerabilities in two-step programming is real (and needs to be addressed).
As was the case for RowHammer attacks in DRAM (see Section 5.1), our ndings have already generated signi cant interest and concern in the broader technology community (e.g., [5, 24, 27, 39] ). The reason behind the broader impact of our work is that many existing drives in the eld today can be attacked. After IBM researchers demonstrated the ability to perform such attacks on a real system [58] , there has been further interest in NAND ash memory attacks (e.g., [1, 78] ).
We hope and expect that other researchers will take our cue and begin to investigate how other reliability issues in NAND ash memory can be exploited by applications to perform malicious attacks. We believe that this is a new area of research that will grow in importance as SSDs and ash memory become even more widely used.
Eliminating Program Error Attacks
Our HPCA 2017 paper [9] proposes three solutions that either eliminate or mitigate vulnerabilities to program interference and read disturb during two-step programming. We intentionally design all three of our solutions to be low overhead and easily implementable in commercial SSDs. One of our three solutions completely eliminates the vulnerabilities, albeit with a small increase in ash programming latency. We expect our work to have a direct impact on the NAND ash memory industry, as manufacturers will likely incorporate solutions such as the ones we propose to mitigate or eliminate these vulnerabilities in their new SSDs. We also expect manufacturers and researchers to explore new mechanisms, inspired by our work and by our solutions, that can eliminate these or other vulnerabilities and exploits due to NAND ash memory reliability errors.
Conclusion
Our HPCA 2017 paper [9] shows that the two-step programming mechanism commonly employed in modern MLC NAND ash memory chips opens up new vulnerabilities to errors, based on an experimental characterization of modern 1X-nm MLC NAND ash chips. We show that the root cause of these vulnerabilities is the fact that when a partiallyprogrammed cell is set to an intermediate threshold voltage, it is much more susceptible to both cell-to-cell program interference and read disturb. We demonstrate that (1) these vulnerabilities lead to errors that reduce the overall reliability of ash memory, and (2) attackers can potentially exploit these vulnerabilities to maliciously corrupt data belonging to other programs. Based on our experimental observations and the resulting understanding, we propose three new mechanisms that can remove or mitigate these vulnerabilities, by eliminating or reducing the errors introduced as a result of the two-step programming method. Our experimental evaluation shows that our new mechanisms are e ective: they can either eliminate the vulnerabilities with modest/low latency overhead, or drastically reduce the vulnerabilities and reduce errors with negligible latency or storage overhead. We hope that the vulnerabilities we analyzed and exposed in this work, along with the experimental data we provided, open up new avenues for mitigation as well as for exposure of other potential vulnerabilities due to internal ash memory operation.
