102 research outputs found
Experimental Characterization, Optimization, and Recovery of Data Retention Errors in MLC NAND Flash Memory
This paper summarizes our work on experimentally characterizing, mitigating,
and recovering data retention errors in multi-level cell (MLC) NAND flash
memory, which was published in HPCA 2015, and examines the work's significance
and future potential. Retention errors, caused by charge leakage over time, are
the dominant source of flash memory errors. Understanding, characterizing, and
reducing retention errors can significantly improve NAND flash memory
reliability and endurance. In this work, we first characterize, with real 2Y-nm
MLC NAND flash chips, how the threshold voltage distribution of flash memory
changes with different retention ages -- the length of time since a flash cell
was programmed. We observe from our characterization results that 1) the
optimal read reference voltage of a flash cell, using which the data can be
read with the lowest raw bit error rate (RBER), systematically changes with its
retention age, and 2) different regions of flash memory can have different
retention ages, and hence different optimal read reference voltages.
Based on our findings, we propose two new techniques. First, Retention
Optimized Reading (ROR) adaptively learns and applies the optimal read
reference voltage for each flash memory block online. The key idea of ROR is to
periodically learn a tight upper bound of the optimal read reference voltage,
and from there approach the optimal read reference voltage. Our evaluations
show that ROR can extend flash memory lifetime by 64% and reduce average error
correction latency by 10.1%. Second, Retention Failure Recovery (RFR) recovers
data with uncorrectable errors offline by identifying and probabilistically
correcting flash cells with retention errors. Our evaluation shows that RFR
essentially doubles the error correction capability
Errors in Flash-Memory-Based Solid-State Drives: Analysis, Mitigation, and Recovery
NAND flash memory is ubiquitous in everyday life today because its capacity
has continuously increased and cost has continuously decreased over decades.
This positive growth is a result of two key trends: (1) effective process
technology scaling; and (2) multi-level (e.g., MLC, TLC) cell data coding.
Unfortunately, the reliability of raw data stored in flash memory has also
continued to become more difficult to ensure, because these two trends lead to
(1) fewer electrons in the flash memory cell floating gate to represent the
data; and (2) larger cell-to-cell interference and disturbance effects. Without
mitigation, worsening reliability can reduce the lifetime of NAND flash memory.
As a result, flash memory controllers in solid-state drives (SSDs) have become
much more sophisticated: they incorporate many effective techniques to ensure
the correct interpretation of noisy data stored in flash memory cells.
In this chapter, we review recent advances in SSD error characterization,
mitigation, and data recovery techniques for reliability and lifetime
improvement. We provide rigorous experimental data from state-of-the-art MLC
and TLC NAND flash devices on various types of flash memory errors, to motivate
the need for such techniques. Based on the understanding developed by the
experimental characterization, we describe several mitigation and recovery
techniques, including (1) cell-tocell interference mitigation; (2) optimal
multi-level cell sensing; (3) error correction using state-of-the-art
algorithms and methods; and (4) data recovery when error correction fails. We
quantify the reliability improvement provided by each of these techniques.
Looking forward, we briefly discuss how flash memory and these techniques could
evolve into the future.Comment: arXiv admin note: substantial text overlap with arXiv:1706.0864
RowHammer: A Retrospective
This retrospective paper describes the RowHammer problem in Dynamic Random
Access Memory (DRAM), which was initially introduced by Kim et al. at the ISCA
2014 conference~\cite{rowhammer-isca2014}. RowHammer is a prime (and perhaps
the first) example of how a circuit-level failure mechanism can cause a
practical and widespread system security vulnerability. It is the phenomenon
that repeatedly accessing a row in a modern DRAM chip causes bit flips in
physically-adjacent rows at consistently predictable bit locations. RowHammer
is caused by a hardware failure mechanism called {\em DRAM disturbance errors},
which is a manifestation of circuit-level cell-to-cell interference in a scaled
memory technology.
Researchers from Google Project Zero demonstrated in 2015 that this hardware
failure mechanism can be effectively exploited by user-level programs to gain
kernel privileges on real systems. Many other follow-up works demonstrated
other practical attacks exploiting RowHammer. In this article, we
comprehensively survey the scientific literature on RowHammer-based attacks as
well as mitigation techniques to prevent RowHammer. We also discuss what other
related vulnerabilities may be lurking in DRAM and other types of memories,
e.g., NAND flash memory or Phase Change Memory, that can potentially threaten
the foundations of secure systems, as the memory technologies scale to higher
densities. We conclude by describing and advocating a principled approach to
memory reliability and security research that can enable us to better
anticipate and prevent such vulnerabilities.Comment: A version of this work is to appear at IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems (TCAD) Special Issue
on Top Picks in Hardware and Embedded Security, 2019. arXiv admin note:
substantial text overlap with arXiv:1703.00626, arXiv:1903.1105
Flexible-Latency DRAM: Understanding and Exploiting Latency Variation in Modern DRAM Chips
This article summarizes key results of our work on experimental
characterization and analysis of latency variation and latency-reliability
trade-offs in modern DRAM chips, which was published in SIGMETRICS 2016, and
examines the work's significance and future potential.
The goal of this work is to (i) experimentally characterize and understand
the latency variation across cells within a DRAM chip for these three
fundamental DRAM operations, and (ii) develop new mechanisms that exploit our
understanding of the latency variation to reliably improve performance. To this
end, we comprehensively characterize 240 DRAM chips from three major vendors,
and make six major new observations about latency variation within DRAM.
Notably, we find that (i) there is large latency variation across the cells for
each of the three operations; (ii) variation characteristics exhibit
significant spatial locality: slower cells are clustered in certain regions of
a DRAM chip; and (iii) the three fundamental operations exhibit different
reliability characteristics when the latency of each operation is reduced.
Based on our observations, we propose Flexible-LatencY DRAM (FLY-DRAM), a
mechanism that exploits latency variation across DRAM cells within a DRAM chip
to improve system performance. The key idea of FLY-DRAM is to exploit the
spatial locality of slower cells within DRAM, and access the faster DRAM
regions with reduced latencies for the fundamental operations. Our evaluations
show that FLY-DRAM improves the performance of a wide range of applications by
13.3%, 17.6%, and 19.5%, on average, for each of the three different vendors'
real DRAM chips, in a simulated 8-core system
Improving 3D NAND Flash Memory Lifetime by Tolerating Early Retention Loss and Process Variation
Compared to planar (i.e., two-dimensional) NAND flash memory, 3D NAND flash
memory uses a new flash cell design, and vertically stacks dozens of silicon
layers in a single chip. This allows 3D NAND flash memory to increase storage
density using a much less aggressive manufacturing process technology than
planar NAND flash memory. The circuit-level and structural changes in 3D NAND
flash memory significantly alter how different error sources affect the
reliability of the memory.
In this paper, through experimental characterization of real,
state-of-the-art 3D NAND flash memory chips, we find that 3D NAND flash memory
exhibits three new error sources that were not previously observed in planar
NAND flash memory: (1) layer-to-layer process variation, where the average
error rate of each 3D-stacked layer in a chip is significantly different; (2)
early retention loss, a new phenomenon where the number of errors due to charge
leakage increases quickly within several hours after programming; and (3)
retention interference, a new phenomenon where the rate at which charge leaks
from a flash cell is dependent on the data value stored in the neighboring
cell.
Based on our experimental results, we develop new analytical models of
layer-to-layer process variation and retention loss in 3D NAND flash memory.
Motivated by our new findings and models, we develop four new techniques to
mitigate process variation and early retention loss in 3D NAND flash memory.
These four techniques are complementary, and can be combined together to
significantly improve flash memory reliability. Compared to a state-of-the-art
baseline, our techniques, when combined, improve flash memory lifetime by
1.85x. Alternatively, if a NAND flash vendor wants to keep the lifetime of the
3D NAND flash memory device constant, our techniques reduce the storage
overhead required to hold error correction information by 78.9%.Comment: presented at SIGMETRICS 201
Heterogeneous-Reliability Memory: Exploiting Application-Level Memory Error Tolerance
This paper summarizes our work on characterizing application memory error
vulnerability to optimize datacenter cost via Heterogeneous-Reliability Memory
(HRM), which was published in DSN 2014, and examines the work's significance
and future potential. Memory devices represent a key component of datacenter
total cost of ownership (TCO), and techniques used to reduce errors that occur
on these devices increase this cost. Existing approaches to providing
reliability for memory devices pessimistically treat all data as equally
vulnerable to memory errors. Our key insight is that there exists a diverse
spectrum of tolerance to memory errors in new data-intensive applications, and
that traditional one-size-fits-all memory reliability techniques are
inefficient in terms of cost. This presents an opportunity to greatly reduce
server hardware cost by provisioning the right amount of memory reliability for
different applications.
Toward this end, in our DSN 2014 paper, we make three main contributions to
enable highly-reliable servers at low datacenter cost. First, we develop a new
methodology to quantify the tolerance of applications to memory errors. Second,
using our methodology, we perform a case study of three new data-intensive
workloads (an interactive web search application, an in-memory key--value
store, and a graph mining framework) to identify new insights into the nature
of application memory error vulnerability. Third, based on our insights, we
propose several new hardware/software heterogeneous-reliability memory system
designs to lower datacenter cost while achieving high reliability and discuss
their trade-offs. We show that our new techniques can reduce server hardware
cost by 4.7% while achieving 99.90% single server availability.Comment: 4 pages, 4 figures, summary report for DSN 2014 paper:
"Characterizing Application Memory Error Vulnerability to Optimize Datacenter
Cost via Heterogeneous-Reliability Memory
Adaptive-Latency DRAM: Reducing DRAM Latency by Exploiting Timing Margins
This paper summarizes the idea of Adaptive-Latency DRAM (AL-DRAM), which was
published in HPCA 2015, and examines the work's significance and future
potential. AL-DRAM is a mechanism that optimizes DRAM latency based on the DRAM
module and the operating temperature, by exploiting the extra margin that is
built into the DRAM timing parameters. DRAM manufacturers provide a large
margin for the timing parameters as a provision against two worst-case
scenarios. First, due to process variation, some outlier DRAM chips are much
slower than others. Second, chips become slower at higher temperatures. The
timing parameter margin ensures that the slow outlier chips operate reliably at
the worst-case temperature, and hence leads to a high access latency.
Using an FPGA-based DRAM testing platform, our work first characterizes the
extra margin for 115 DRAM modules from three major manufacturers. The
experimental results demonstrate that it is possible to reduce four of the most
critical timing parameters by a minimum/maximum of 17.3%/54.8% at 55C while
maintaining reliable operation. AL-DRAM uses these observations to adaptively
select reliable DRAM timing parameters for each DRAM module based on the
module's current operating conditions. AL-DRAM does not require any changes to
the DRAM chip or its interface; it only requires multiple different timing
parameters to be specified and supported by the memory controller. Our real
system evaluations show that AL-DRAM improves the performance of
memory-intensive workloads by an average of 14% without introducing any errors.
Our characterization and proposed techniques have inspired several other works
on analyzing and/or exploiting different sources of latency and performance
variation within DRAM chips.Comment: arXiv admin note: substantial text overlap with arXiv:1603.0845
Reducing DRAM Refresh Overheads with Refresh-Access Parallelism
This article summarizes the idea of "refresh-access parallelism," which was
published in HPCA 2014, and examines the work's significance and future
potential. The overarching objective of our HPCA 2014 paper is to reduce the
significant negative performance impact of DRAM refresh with intelligent memory
controller mechanisms.
To mitigate the negative performance impact of DRAM refresh, our HPCA 2014
paper proposes two complementary mechanisms, DARP (Dynamic Access Refresh
Parallelization) and SARP (Subarray Access Refresh Parallelization). The goal
is to address the drawbacks of state-of-the-art per-bank refresh mechanism by
building more efficient techniques to parallelize refreshes and accesses within
DRAM. First, instead of issuing per-bank refreshes in a round-robin order, as
it is done today, DARP issues per-bank refreshes to idle banks in an
out-of-order manner. Furthermore, DARP proactively schedules refreshes during
intervals when a batch of writes are draining to DRAM. Second, SARP exploits
the existence of mostly-independent subarrays within a bank. With minor
modifications to DRAM organization, it allows a bank to serve memory accesses
to an idle subarray while another subarray is being refreshed. Our extensive
evaluations on a wide variety of workloads and systems show that our mechanisms
improve system performance (and energy efficiency) compared to three
state-of-the-art refresh policies, and their performance bene ts increase as
DRAM density increases.Comment: 9 pages. arXiv admin note: text overlap with arXiv:1712.07754,
arXiv:1601.0635
Read Disturb Errors in MLC NAND Flash Memory
This paper summarizes our work on experimentally characterizing, mitigating,
and recovering read disturb errors in multi-level cell (MLC) NAND flash memory,
which was published in DSN 2015, and examines the work's significance and
future potential. NAND flash memory reliability continues to degrade as the
memory is scaled down and more bits are programmed per cell. A key contributor
to this reduced reliability is read disturb, where a read to one row of cells
impacts the threshold voltages of unread flash cells in different rows of the
same block.
For the first time in open literature, this work experimentally characterizes
read disturb errors on state-of-the-art 2Y-nm (i.e., 20-24 nm) MLC NAND flash
memory chips. Our findings (1) correlate the magnitude of threshold voltage
shifts with read operation counts, (2) demonstrate how program/erase cycle
count and retention age affect the read-disturb-induced error rate, and (3)
identify that lowering pass-through voltage levels reduces the impact of read
disturb and extend flash lifetime. Particularly, we find that the probability
of read disturb errors increases with both higher wear-out and higher
pass-through voltage levels.
We leverage these findings to develop two new techniques. The first technique
mitigates read disturb errors by dynamically tuning the pass-through voltage on
a per-block basis. Using real workload traces, our evaluations show that this
technique increases flash memory endurance by an average of 21%. The second
technique recovers from previously-uncorrectable flash errors by identifying
and probabilistically correcting cells susceptible to read disturb errors. Our
evaluations show that this recovery technique reduces the raw bit error rate by
36%
Adaptive-Latency DRAM (AL-DRAM)
This paper summarizes the idea of Adaptive-Latency DRAM (AL-DRAM), which was
published in HPCA 2015. The key goal of AL-DRAM is to exploit the extra margin
that is built into the DRAM timing parameters to reduce DRAM latency. The key
observation is that the timing parameters are dictated by the worst-case
temperatures and worst-case DRAM cells, both of which lead to small amount of
charge storage and hence high access latency. One can therefore reduce latency
by adapting the timing parameters to the current operating temperature and the
current DIMM that is being accessed. Using an FPGA-based testing platform, our
work first characterizes the extra margin for 115 DRAM modules from three major
manufacturers. The experimental results demonstrate that it is possible to
reduce four of the most critical timing parameters by a minimum/maximum of
17.3%/54.8% at 55C while maintaining reliable operation. AL-DRAM adaptively
selects between multiple different timing parameters for each DRAM module based
on its current operating condition. AL-DRAM does not require any changes to the
DRAM chip or its interface; it only requires multiple different timing
parameters to be specified and supported by the memory controller. Real system
evaluations show that AL-DRAM improves the performance of memory-intensive
workloads by an average of 14% without introducing any errors.Comment: This is a summary of the original paper, entitled "Adaptive-Latency
DRAM: Optimizing DRAM Timing for the Common-Case" which appears in HPCA 201
- …