 The high dynamic power and poor scaling potential of DRAM and the high leakage power of SRAM pose major challenges to be overcome in future computing systems. Nonvolatile memory (NVM) technologies such as phase change memory (PCM), resistive RAM (RRAM), and spin-transfer torque RAM (STT-RAM) have emerged as promising replacement candidates for DRAM/SRAM due to their better scalability, higher data density, and lower leakage power [1]- [3] . These resistance-class NVMs store data by modulating the resistance (PCM/ RRAM) and magnetoresistance (STT-RAM) of the storage material. Due to a large separation between the lowest and the highest resistance states, these NVMs also support multilevel cell (MLC) or triple-level cell (TLC) operation, i.e., they offer the ability to store 2/3 logical bits per physical cell, increasing data density and driving down cost.
read/write latency, write energy (aggravated for MLC/TLC NVMs), and data security fronts. To address these challenges, a broad class of solutions that rely on data encoding, read/write scheduling, circuit and/or architecture modifications, and memory encryption have been proposed; however, without exception, these solutions usually impact NVM reliability/lifetime. NVM reliability has thus emerged as a first-order design constraint that gates the broad commercialization of these advanced memory technologies. To this end, this paper presents a comprehensive survey of reliability enhancement techniques for emerging NVMs. The techniques are classified based on the NVM technology and fault classes, their merits/demerits, and applicability in practice are discussed, and directions and open problems for further research are also summarized.
The reader will also benefit from other insightful surveys [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] on NVM technologies in the literature that provide focused in-depth coverage of candidate technologies, e.g., PCM in [1] , [4] , RRAM in [2] , [7] , and STT-RAM in [3] , [8] , [9] . However, most of these works [1] [2] [3] [4] [5] [6] , [10] [11] [12] [13] do not focus exclusively on reliability; the surveys that do cover reliability [1] , [4] [5] [6] , [10] are over four years old. By narrowing the focus to NVM reliability, simultaneously covering the three most promising NVM technologies (PCM, RRAM, and STT-RAM), and providing a chronological severity of error organization for both permanent and transient faults in these technologies, this survey is uniquely positioned Editor's note: Reliability continues to be a severe challenge in the development of emerging memories. In this article, the authors offer a comprehensive survey of reliability enhancement techniques for three mainstream emerging memories and a summary of the possible future research directions in this area.
- the chemical lattice of the phase change material (chalcogenide). RRAM suffers from RF due to random generation (or recombination with oxygen ions) of the oxygen vacancies. Process variations in magnetic tunnel junction (MTJ) in STT-RAMs reduce the sense margin, resulting in faults such as incomplete write and false read. As a result, soft error mitigation and correction techniques have followed a trajectory of customized evolution for the specific NVM technology. The upcoming section surveys solutions to recover from transient faults in PCM [16] , [26] , [27] , in RRAM [15] , [28] [29] [30] , and in STTRAM [31] [32] [33] .
Permanent (hard) faults
A permanent (hard) fault causes a memory cell to be stuck-at the set (1) or reset (0) state, and the faulty cell cannot be subsequently programmed, resulting in a hard error. Whereas hard errors in SRAM/DRAM were usually the result of manufacturing defects, in NVMs, repeated programming (writing) of memory cells over time causes cell wear, eventually resulting in hard errors. Due to low write endurance of a PCM/RRAM cell (10 8-10 writes until cell failure on average [15] , [34] ), hard errors are more prominent in PCM-/RRAM-based NVMs. Note that the write endurance of an STT-RAM cell exceeds 10 12 writes to cell failure [17] , making it relatively immune to hard errors. Whereas NVM hard error detection and correction techniques are agnostic to the NVM technology, most advances in this area have been proposed in the context of PCM/RRAM [18] [19] [20] [21] [22] [23] [24] [25] . In PCM, a stuck-at 0 fault occurs when the heating element detaches itself from the cell due to continued thermal expansion and contraction over time. In contrast, a stuck-at 1 fault occurs when overheating of the chalcogenide material destroys its physical characteristics [1] , [16] . In RRAM, a stuck-at 0 (1) fault occurs due to a deficiency (excessive doping) of oxygen vacancies in the cell, causing the cell to remain stuck at 0 (1) irrespective of the voltage applied across it. Although rare, a stuck-at 1 (0) fault occurs in STT-RAM-based NVMs when a resistive bridge short circuits a word line or the node between a transistor and an MTJ to VDD (GND) [35] .
Since hard errors appear over memory lifetime, it is necessary to continuously scrub memory to detect new cell failures. To this end, it is customary to adopt a verify-read-after-write technique to detect new cell failures, i.e., hard errors [18] [19] [20] , [25] . In to complement these earlier works and function as a single resource introduction to the topic of NVM reliability.
NVMs are susceptible to both permanent and transient faults due to process variations, design marginalities, corner operating conditions, decreasing feature size, and increasing data density (MLC/TLC NVMs) [1] [2] [3] , . A permanent fault causes a memory cell to be stuck-at 1/0, and the faulty cell cannot be subsequently programmed, resulting in a hard error. In contrast, a transient fault causes a memory cell to temporarily change its state, resulting in a soft error, which can be corrected by reprogramming the faulty cell. These permanent/transient faults, if not addressed, manifest themselves as read and write errors, which compromise the reliability of the memory system. Although some permanent/transient faults can be detected and corrected during postmanufacturing test [16] , [17] , faults, permanent as well as transient, that occur during field operation require built-in hard-/soft-error detection and correction capabilities to ensure memory integrity and dependability.
Without exception, hard errors (resulting from permanent faults) occur in all NVM technologies. As a result, solutions to recover from hard errors in NVMs have usually been agnostic to the NVM technology. Whereas error-correcting codes (ECCs) for hard error correction is a well-researched area, it was recognized early on that conventional ECC are a poor choice for addressing hard errors in NVMs. Since the number of hard errors increases over time, strong ECC implementations, which incur large overhead, are required to ensure complete error recovery. This led to the evolution of a broad spectrum of non-ECC hard error correction techniques that were motivated by challenges unique to NVM technologies [18] [19] [20] [21] [22] [23] [24] [25] (refer to the next section).
As NVM densities increased, it was observed that transient faults like resistance drift, retention failure (RF), read disturb, write disturb (WD), etc., also increased, leading to an increase in the soft error rate [15] , [16] , [26] [27] [28] [29] [30] [31] [32] [33] . Unlike hard error recovery techniques that are agnostic to the underlying NVM technology, soft-error recovery techniques are customized for the underlying NVM technology due to the unique physical/chemical properties of the memory cells used in these NVMs. For example, PCM primarily suffers from resistance drift due to the presence of defect structures in Third, SAFER [20] , proposed concurrent to ECP, was a significant departure from contemporary hard error correction techniques since it leveraged the observation that a stuck-at cell can not only be read but also used to store data. SAFER dynamically partitions a faulty data block into groups, each of which has one faulty cell at most and writes the data inverted (as-is) if the stuck-at value of the faulty cell and the data bit to be written to that faulty cell are inequivalent (equivalent). The hardware overhead of SAFER is proportional to the number of groups required to partition the data to ensure one failed cell per group. Whereas SAFER is complementary to other hard error correction techniques such as ECC, DRM, and ECP, integrating SAFER with these techniques increases the memory overhead of error correction. Recursively defined invertible set (RDIS) [23] and Aegis [24] are also data-inversion-based hard error correction techniques that improve SAFER's partitioning approaches to realize lifetime improvements over SAFER.
Fourth, FREE-p [21] is a fine-grained remapping scheme that maps a failed memory block (512 bits) within a 4-kB page to a new location and stores the remapping pointer in the failed block. The remapping pointer points to the location storing the data of the failed block. Unlike ECC or ECP or SAFER, FREE-p implements error correction inside the memory controller, enabling detection/correction of errors in the wires, packaging, and peripheral circuits in addition to errors in NVM cells. However, similar to DRM, FREE-p requires special support from the OS to manage remapping of the dead blocks.
Fifth, Pay-As-You-Go (PAYG) is the first technique that departs from a uniform memory allocation policy for error correction. PAYG uses ECPs for error correction and follows an on-demand ECP allocation policy to mitigate ECP wastage associated with uniform ECP allocation [22] . To realize on-demand error correction, PAYG uses a local error correction (LEC) pool that allocates one ECP per 512-bit memory block and a global error correction (GEC) pool to provide extra ECPs to only those memory blocks that experience more than one cell failure. After a memory block has exhausted its LEC pool, the GEC pool is accessed to obtain more ECPs. PAYG effectively reduces the memory overhead (≈ 3× over ECP-6/SAFER/Free-p) of hard error correction; however, in the presence of a large number of hard errors, PAYG incurs a performance penalty of 1-2 read cycles on every read/write to access the GEC pool for error correction.
verify-read-after-write, a cache line is immediately read back after a write to determine if any new cell failures occurred during the write. If one or more cell failures are encountered on the verify-read, the memory controller rewrites the correct data using the underlying hard error correction scheme. Thereafter, subsequent reads or writes to the faulty location require hard error correction.
Hard error correction techniques for NVMs have evolved from conventional ECC to error-correcting pointers (ECPs) [19] and their extensions in PAYG [22] and Zombie memory [25] , as well as non-ECPbased schemes like dynamically replicated memory (DRM) [18] , SAFER [20] , and FREE-p [21] . A roughly chronological survey of these techniques along with their merits/demerits is presented in this section.
First, DRM [18] was the first departure from conventional ECC for NVMs. In DRM, hard error correction is performed by combining two pages with failed memory blocks (faulty pages henceforth) such that the failed memory blocks across the two pages do not align. DRM assigns a parity bit to every n-bit block within a 4-kB page; a block is marked as dead when any bit within it (including the parity bit) fails. Unlike SEC-DED ECC, where error correction capability is exhausted early in the memory lifetime, DRM provides error correction capability till late in the memory lifetime by leveraging compatible faulty pages. However, it is important to note that DRM requires modifications and support of both the hardware (memory controller) and the software (operating system (OS)).
Second, ECPs [19] are now regarded as a milestone in hard error correction for NVMs. ECP uses a pointer to point to a failed cell and also stores the correct value for that cell. For example, on a 512-bit memory block, 9 bits (log 2 512) are required to uniquely point to any failed cell and 1 replacement bit is required to store the correct value. The standard ECP implementation reserves six ECPs for every 512-bit memory block, enabling each 512-bit block to recover from up to 6 hard errors (ECP-6 henceforth). Thus, ECP-6 has equivalent overhead to the (72,64) SEC-DED ECC widely used in SRAMs/DRAMs for error recovery. ECPs are fully contained inside the NVM DIMM and require no modifications to or support from the OS. Due to its low hardware and performance overhead, ECP is considered a baseline reference for evaluating hard error correction in NVMs.
Transient (temporary) faults
Whereas hard errors were the subject of research in the early days of NVM research, technology scaling (i.e., reduction in intercell spacing) and the maturation of MLC/TLC processes led to the emergence of soft errors (attributed to transient faults) as a major reliability concern for PCM, RRAM, and STT-RAM. Further, whereas certain transient faults share a common nomenclature (e.g., read/write disturb in PCM and RRAM), the underlying failure mechanisms are technology specific and this has necessitated technology-specific nomenclatures and customized soft error correction measures. Note that technology-agnostic solutions like ECC-based scrubbing are also employed for detection or correction of soft errors. This section summarizes the failure mechanisms and correction approaches (customized/generic) for each NVM technology.
PCM
Whereas resistance drift has been a dominant source of soft errors in PCM, with technology scaling, soft errors due to resistance drift or WD faults have also increased. Note that soft error correction techniques (discussed next) are applied to safeguard both data bits and hard error correction bits from soft errors. Hence, soft error correction is performed before hard error correction. We next discuss transient faults in PCM in decreasing order of prominence.
First, PCM resistance drift refers to an increase in the resistance of a PCM cell over time, caused due to the presence of "defect crystals" in the chalcogenide material that stabilize over time and render the material amorphous [26] . Resistance drift is exacerbated in MLC/TLC PCM, where the resistance ranges for different cell states are narrower in comparison to single-level cell (SLC) (10 kΩ versus 100 kΩ) [26] , [27] and a small drift in resistance can change the state of a cell. Awasthi et al. [26] proposed multibit ECC along with memory scrubbing to detect and correct resistance drift errors. Besides energy and latency, repeated scrubbing of memory adversely affects the endurance of the memory cells. To reduce the scrubbing frequency, Seong et al. [27] proposed partial data mapping (PDM) as a circuit-level solution to mitigate PCM resistance drift faults. PDM drops the most resistance-drift-prone state of an MLC PCM and stores the data in the remaining three out of four states.
Sixth, Zombie memory [25] is a state-of-the-art hard error correction technique that recycles the dead memory pages and uses them for storing error-correcting information. Zombie initially allocates 64-bit metadata to every 512-bit memory block for storing ECPs. Once the metadata of a memory block (primary block) on an active page is exhausted, in order to accommodate more ECPs, the ECPs of that memory block are migrated from its metadata to a larger (up to 512-bit) spare (secondary) block (recovered from a dead page). The metadata in the primary block is then used for storing the address of the secondary block. For the same memory overhead as ECP-6, Zombie memory improves memory lifetime by up to 50% over ECP-6. Furthermore, Zombie memory does not require any modifications to the OS (unlike DRM/FREE-p); however, it incurs a performance penalty of an extra read cycle to read or update the corresponding secondary block after every read/write to the primary block.
Whereas hard error correction improves memory reliability by allowing error-free memory operations in the presence of hard errors, fault avoidance solutions that emphasize hard error prevention have also been advocated and evaluated by memory architects to reduce memory wear and delay the onset of hard errors. Typically, memory wear (write) reduction is achieved by: DCW writes only the modified data to the memory and leaves the unmodified data as-is. DCW can be performed either at the word-level [14] (write only modified words) or the bit-level [1] , [36] . FNW is an extension of DCW in which the data are written in inverted form if it results in fewer bit writes in comparison to writing uninverted data. In the worst case, data inversion reduces the number of bits to be written by half [37] . Note that both DCW and FNW require a read before write in order to compare the existing data with the new data. In contrast, wear leveling attempts to spread writes uniformly across memory, preventing early life failures of frequently written cells [1] , [14] , [38] . However, the benefits from wear leveling come at the cost address translation and data migration overhead. A summary of hard error correction and prevention techniques is provided in Table 1. much impact on system performance. In summary, resistance drift is a very important challenge and is only expected to worsen with denser (8-/16-level) PCM technologies; hence, designing robust resistance drift mitigation techniques is necessary to realize reliable PCM NVMs [27] .
Second, a PCM WD fault occurs when a cell write dissipates heat to the neighboring cells, disturbing the state of those cells. WD is affected by intercell spacing and worsens as cell-to-cell spacing is reduced with technology scaling [34] , [41] . In [34] , a data insulation (DIN) framework to reduce the frequency of writes of WD-vulnerable data patterns is Furthermore, by combining PDM with single error correction, double error detection (SEC-DED) ECC, memory scrubbing can be almost completely eliminated. Whereas PDM effectively eliminates PCM resistance drift errors, it reduces the effective capacity of the memory. M-metric [39] is another circuit-level solution that proposes drift-resilient voltage sensing, instead of traditional current sensing (referred to as R-metric). However, M-metric based readout is latency intensive; Wang et al. [40] proposed a hybrid readout scheme that integrates M-metric and R-metric, as a middle ground solution to achieve low resistance drift error rate without Whereas the crossbar architecture achieves higher density in comparison to the grid architecture, it renders RRAM NVMs susceptible to read/WD faults [15] , [28] . Further, due to the changes to an RRAM cell's physical properties under different programming conditions, faults such as RFs and pseudohard errors are also observed in RRAM NVMs. Note that similar to PCM NVMs, RRAM NVMs also employ soft error correction on top of hard error correction for reliable operation [43] . First, a read disturb fault in RRAM occurs when the half-selected cells (unselected cells in the same word/ bit line as the selected cell) are in a low resistance state (LRS) and the selected cell is in a HRS. Under these conditions, a high half-select current (sneak current) flows in the selected bit line causing an RRAM read disturb failure in the selected cells [15] , [28] . To reduce read sneak currents, [28] proposes biasing all unselected rows at the same voltage as the selected column. However, leakage currents and variations in peripheral circuitry may change the voltage drop on the unselected cells connected to the selected column away from ideal zero. In contrast, a two-step sampling approach to isolate the noise current of the parasitic half-selected cells from the read current of the selected cell was also proposed in [28] . The technique involves sampling current from both half-selected and the selected cells to accurately identify the true state of the selected cell. Although this technique successfully eliminates the noise current from half-selected cells, it incurs high latency overhead due to the two-step current sampling.
Second, an RRAM WD fault is the result of unwanted voltage drops on the cells of a word line due to the lack of isolation among cells in the crosspoint structure [29] . For example, on a word line, when the cell farthest from the voltage driver is written, the cell closest to the voltage driver experiences a nonnegligible voltage drop that may change its resistance. WD errors are aggravated in the presence of high sneak currents, which causes the resistances of the half-selected cells to shift higher (lower) when programming the selected cell in HRS (LRS). RRAM WD errors can be avoided by dual-port write, i.e., adding another set of voltage drivers on the other side of the word line [29] ; however, this doubles the area and energy of the voltage drivers. Dual-port write is often complemented with a test-and-flip strategy [29] . Since the cells in LRS generate the most sneak currents, the test-and-flip technique ensures described. In DIN, an n-bit data is encoded using an m-bit code (m > n > 1), with multiple possible encodings for an n-bit data pattern. DIN selects the least WD-prone bit pattern out of the 2 m possible patterns. DIN reduces WD errors only along the word lines and relies on the compressibility of memory data to reduce the memory overhead of writing an n-bit data using an m-bit code. However, for applications with low data compressibility, e.g., computer vision, data mining, memory encryption, [42] proposes to employ unused ECPs in conjunction with DIN to temporarily recover from WD errors along both word and bit lines. Since WD faults are transient, ECP entries can be cleared on the next write. A verify-read-after-write is employed to identify the memory cells that have recovered from WD faults and no longer require ECP entries. WD errors are expected to rise with reducing feature size, and hence developing efficient WD solutions is necessary for continued technology scaling [34] .
Third, a PCM read recovery disturb (RRD) fault occurs when a cell is read immediately after it has been programmed to a high resistance state (HRS). On an RRD error, the read cell returns "1" where the stored value is "0." RRD faults occur because a recently programmed PCM cell takes some time to cool down completely, and an instant read-afterwrite (RAW) operation disturbs the cooling process of the chalcogenide [16] , [41] . The occurrence of RRD errors can be mitigated by increasing the RAW timings and using ECC to detect erroneous reads [41] .
Fourth, a PCM read disturb fault occurs when the small read current flowing through a cell induces localized heating that accelerates a spontaneous amorphous-to-crystalline transition, inducing a PCM read disturb fault. Additionally, in the presence of defects, the read current flowing through a cell is increased leading to an increase in the heat generated in the chalcogenide material. These factors may result in a cell-state switch from amorphous-to-crystalline state [16] , [41] . PCM read disturb errors are detected using ECC and the faulty cells are programmed back to their original states [16] , [41] .
RRAM
Unlike conventional grid architecture-based memory design, RRAM-based NVMs are organized as crossbar architecture, where RRAM cells are interconnected to each other without transistors [15] . read disturb, false read, and incomplete write, all of which are related to the uncertainty in MTJ switching behavior. A detailed discussion of these faults and their mitigation techniques is as follows.
First, an STT-RAM read disturb fault occurs when the data stored in a cell is flipped during read operation [31] , [32] . Although the read current is 5-10× lower than the critical write current, it may flip the magnetic orientation of an MTJ cell, disturbing the stored data. Due to the hysteresis inherent to MTJ device operation, the read disturbance can only occur in one direction, i.e., an antiparallel to parallel read disturb or vice versa. Mostly, circuit-level techniques are employed for STT-RAM read disturb fault mitigation. For example, a pulsed read approach switches the word line on and off so that the read current cannot flow continuously through the cell. However, this reduces read disturb errors at the cost of increased read latency and high sensing complexity. Similarly, increasing the thermal stability factor can also reduce STT-RAM read disturb fault rate, but it consumes more write power [31] . In contrast, asymmetric differential cell structure proposed in [33] uses a differential current sensing scheme to improve the read performance without incurring high power/latency penalties. At the architecture level, a postread restore operation (restore-after-read) can be employed to ensure data reliability of STT-RAM-based caches. To mitigate high energy/latency overhead of the restore-after-read approach, [45] proposes a selective restore (SR) solution. In SR, the restore of an read disturb-affected L2 cache line is postponed/deferred to the time that the cache line is evicted from the upper level L1 cache.
Second, an STT-RAM false read denotes the scenario where a cell stores "0" but is read out as "1" or vice versa. To read a cell, its effective resistance is measured by sensing the current through the cell. If the sensing current is lower (higher) than the reference current value when reading a cell storing a "0" ("1"), the stored value is read incorrectly and termed a false read [32] . False read also occurs due to the small sense margin incurred by the process variations of MTJ or access transistor [46] . The problem is exacerbated in MLCs due to narrower sense margins than SLCs. A device-level solution for STT-RAM false read is to increase the access transistor's width, which improves the read reliability by increasing the distance between the means of the read-0 and read-1 current distributions; however, it reduces cell density and increases power consumption. Whereas power that the number of cells in LRS is always less than half of the total number of cells in a word/bit line. Since WD errors are expected to rise with denser crossbar structures, development of circuit-/architecture-level solutions for RRAM WD mitigation is an active area of research [28] , [29] .
Third, RRAM RFs refer to an abrupt resistance drop (rise) of a cell in the HRS (LRS), resulting from the random generation (recombination with oxygen ions) of the oxygen vacancies [28] , [30] . RF faults are largely addressable through device-level techniques to improve data retention by controlling the oxygen vacancy concentration in a cell's conductive filament [30] . Varying the voltage settings of word/ bit lines to boost the off-to-on resistance ratio also improves the retention capability of the RRAM cell [44] . Such device-/circuit-level preventive measures can be complemented by ECC-based data scrubbing to detect and correct RF errors [30] .
Fourth, an RRAM slow-write-1 (SW1) or slowwrite-0 (SW0) fault occurs when the write pulse is not strong enough to switch an RRAM cell from logic 0/1 to logic 1/0. The SW1 (SW0) RRAM fault is caused due to a small increase (decrease) in the dopants (oxygen vacancies). The SW1 (SW0) can be avoided by using a large magnitude write 1 (0) pulse to program the cell to logic 1 (0), but it incurs more write energy. An alternate approach that has latency but not energy overhead is to use a read-monitored write with successive reads followed by short write pulses to gradually program a cell into the desired resistance range [16] .
Fifth, RRAM pseudohard errors are a result of temporal variations in a single device from cycle to cycle, caused due to the stochastic nature of the process of formation and rupturing of the conductive filament [29] . Due to these temporal variations, if the LRS (HRS) resistance is not low (high) enough, the sensing margin shrinks, resulting in a read failure. In contrast, if the LRS (HRS) resistance is too low (high), a write failure may occur in the next write cycle since the same programming current cannot switch the cell anymore. However, unlike hard errors, this error can be recovered by increasing the voltage amplitude or the width of the write pulse. Hence, it is referred as a pseudohard error [29] .
STT-RAM
STT-RAM is a truly bistable device and does not suffer from the dynamic WD faults as seen in PCM/RRAM [35] . The major sources of soft errors in STT-RAM are state and stores data in only the three remaining states [46] . Whereas state-restrict MLC is a power efficient STT-RAM false read mitigation approach, it reduces the effective capacity of the memory.
Third, an STT-RAM incomplete write fault occurs when the resistance of a cell is too large penalty of using wider access transistor is acceptable for SLC STT-RAM, employing the same solution for already high power consuming MLC STT-RAM is infeasible in practice [32] , [46] . For MLC STT-RAM, a lowpower circuit-level solution is to use a state-restrict MLC that eliminates the most error-prone resistance Resistance drift [26] , [27] , [40] Increase in cell's resistance over time
Presence of defect crystals in chalcogenide ECC + Scrubbing [26] , tri-level PCM [27] , M-/R-metric based hybrid readout [40] Read disturb [16] , [41] Switching of a cell during read
Increased heating during read operation ECC + scrubbing [41] Read recovery disturb [16] , [41] Write immediately followed by a read results in incorrect read
Recently programmed cell takes time to cool down completely
Increasing read-after-write (RAW) timing, ECC [16] , [41] Write disturb [34] , [41] , [42] Writing a cell disturbs data in neighboring cells
Small intercell space due to scaling
Verify-and-restore, Flip-N-Write, coding schemes like data insulation (DIN) [34] , [42] Resistive
RAM (RRAM)
Retention failure [28] , [30] , [44] Abrupt drop (rise) in resistance of a cell in HRS (LRS)
Random generation (recombination with oxygen ions) of oxygen vacancies ECC + scrubbing [28] ,boosting OFF-to-ON resistance ratio [44] Read disturb [15] , [28] Read failure due to current from halfselected cells in LRS Sneak-path current in crossbar circuit
Bias unselected rows at the voltage of selected column, two step sampling to isolate noise current [28] Write disturb [28] , [29] Unwanted voltage drop on half-selected cells of word/bit line Lack of isolation in crosspoint structure
Dual port write, test-and-flip [29] , verify-and-restore [28] Slowwrite-1/0 [16] Cell state remains unchanged after write SW1 occurs due to an increase in dopants; SW0 occurs due to a decrease in oxygen vacancies
Large amplitude write pulse, read-monitored write [16] Pseudohard error [29] Shrinking of sense margin Formation and rupture of conductive filament ECC, using stronger programming voltage [29] SpinTransfer Torque RAM (STT-RAM) Retention failure [47] Flipping of the content of an ideal cell Thermal noise ECC + scrubbing [47] Read disturb [31] , [33] , [45] Data stored in a cell is changed during read Change in the magnetic orientation of MTJ due to read current Pulsed read [31] , increasing thermal stability factor [33] , selective restore for LLC [45] False read [32] , [46] Cell stores a '0' but is read as '1' or vice versa Process variation of MTJ or access transistor reduces sense margin
State-restrict MLC [46] , ECC, increase width of access transistor [32] Incomplete write [17] , [32] , [46] Failing to write data reliably in a write cycle for the access transistor to provide sufficient write current required to reliably program the cell in a given write cycle [17] , [46] . Techniques to mitigate incomplete write increase the write current, which in turn reduces the MTJ switching time and improves the write reliability. For example, word line boosting during a write operation brings down the resistance of the access transistor and increases the current flowing through the MTJ; however, it negatively impacts the reliability of the access transistor [17] . Similarly, body biasing of the access transistor lowers the threshold voltage of the access transistor and increases the cell current. Furthermore, reducing the write threshold current also helps in improving writability [32] . This can be accomplished by increasing the write pulse width, which lowers the incomplete write failures at the cost of reduced write speed. Apart from modulating the write current, techniques like error-pattern removal suppress the incomplete writes in MLC STT-RAM using state-restrict MLC or ternary MLC, which removes the most error prone state ("10") [46] . Fourth, an STT-RAM RF occurs when the content of an idle cell flips due to thermal noise [47] . The RF rate is exponentially dependent on the thermal stability that scales down with technology and leads to an exponential rise in the RF error rate in STT-RAM. The dominant STT-RAM RF error rate reduction approach relies on ECC and memory scrubbing [47] .
The recent research [48] on STT-RAM cell optimization has proposed a new classification-persistent and nonpersistent errors-for read or write errors in STT-RAM. Whereas persistent errors are deterministic and can be reproduced after the chip is manufactured, nonpersistent errors are transient failures induced by intermittent events. For example, write failures caused due to geometric variations of transistors and MTJ are classified as persistent errors. In contrast, write failures caused due to thermal fluctuation of the critical MTJ switching current are nondeterministic, and hence classified as nonpersistent errors. Similarly, false read, which is caused due to device mismatch in sense amplifiers is a persistent error, but flipping of the resistance state of the MTJ read disturb is a nonpersistent error. Table 2 summarizes the transient faults along with their manifestations, causes, and solutions for PCM, RRAM, and STT-RAM.
Conclusion
Nonvolatile memory technologies like PCM/RRAM/ STT-RAM have emerged as promising replacement candidates for DRAM/SRAM. However, the low reliability of these NVMs has become a first-order design consideration and potentially gates their wide-scale integration and adoption into commodity and high-performance computing systems. This paper surveys the evolution and future trends of reliability enhancement techniques proposed for these advanced NVM technologies. The key takeaways of this survey can be summarized as follows. First, PCM/RRAM/STT-RAM experience both hard and soft errors and require holistic integration of solutions to address both types of errors. Second, due to the technology-specific nature of transient faults, customized fault mitigation techniques are employed for PCM/ RRAM/STT-RAM. Lastly, hard as well as soft errors exacerbate with technology scaling and MLC/TLC NVMs, which necessitates even stronger error correction solutions for dense NVMs.
direcTions for nVm reliability research include:
• architectures that are capable of synergistically recovering from both hard and soft errors for low memory energy or latency overhead; • encryption architectures to thwart security attacks that negatively impact NVM endurance and, by extension, coexist with reliability solutions; • the exploration of holistic system designs that incorporate novel features, such as applications that do not require page initialization and frequent page copy, to complement state-of-the-art device-or circuit-or architecture-level solutions for reducing hard or soft errors. 
