734 research outputs found
Errors in Flash-Memory-Based Solid-State Drives: Analysis, Mitigation, and Recovery
NAND flash memory is ubiquitous in everyday life today because its capacity
has continuously increased and cost has continuously decreased over decades.
This positive growth is a result of two key trends: (1) effective process
technology scaling; and (2) multi-level (e.g., MLC, TLC) cell data coding.
Unfortunately, the reliability of raw data stored in flash memory has also
continued to become more difficult to ensure, because these two trends lead to
(1) fewer electrons in the flash memory cell floating gate to represent the
data; and (2) larger cell-to-cell interference and disturbance effects. Without
mitigation, worsening reliability can reduce the lifetime of NAND flash memory.
As a result, flash memory controllers in solid-state drives (SSDs) have become
much more sophisticated: they incorporate many effective techniques to ensure
the correct interpretation of noisy data stored in flash memory cells.
In this chapter, we review recent advances in SSD error characterization,
mitigation, and data recovery techniques for reliability and lifetime
improvement. We provide rigorous experimental data from state-of-the-art MLC
and TLC NAND flash devices on various types of flash memory errors, to motivate
the need for such techniques. Based on the understanding developed by the
experimental characterization, we describe several mitigation and recovery
techniques, including (1) cell-tocell interference mitigation; (2) optimal
multi-level cell sensing; (3) error correction using state-of-the-art
algorithms and methods; and (4) data recovery when error correction fails. We
quantify the reliability improvement provided by each of these techniques.
Looking forward, we briefly discuss how flash memory and these techniques could
evolve into the future.Comment: arXiv admin note: substantial text overlap with arXiv:1706.0864
RowHammer: A Retrospective
This retrospective paper describes the RowHammer problem in Dynamic Random
Access Memory (DRAM), which was initially introduced by Kim et al. at the ISCA
2014 conference~\cite{rowhammer-isca2014}. RowHammer is a prime (and perhaps
the first) example of how a circuit-level failure mechanism can cause a
practical and widespread system security vulnerability. It is the phenomenon
that repeatedly accessing a row in a modern DRAM chip causes bit flips in
physically-adjacent rows at consistently predictable bit locations. RowHammer
is caused by a hardware failure mechanism called {\em DRAM disturbance errors},
which is a manifestation of circuit-level cell-to-cell interference in a scaled
memory technology.
Researchers from Google Project Zero demonstrated in 2015 that this hardware
failure mechanism can be effectively exploited by user-level programs to gain
kernel privileges on real systems. Many other follow-up works demonstrated
other practical attacks exploiting RowHammer. In this article, we
comprehensively survey the scientific literature on RowHammer-based attacks as
well as mitigation techniques to prevent RowHammer. We also discuss what other
related vulnerabilities may be lurking in DRAM and other types of memories,
e.g., NAND flash memory or Phase Change Memory, that can potentially threaten
the foundations of secure systems, as the memory technologies scale to higher
densities. We conclude by describing and advocating a principled approach to
memory reliability and security research that can enable us to better
anticipate and prevent such vulnerabilities.Comment: A version of this work is to appear at IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems (TCAD) Special Issue
on Top Picks in Hardware and Embedded Security, 2019. arXiv admin note:
substantial text overlap with arXiv:1703.00626, arXiv:1903.1105
Reducing DRAM Refresh Overheads with Refresh-Access Parallelism
This article summarizes the idea of "refresh-access parallelism," which was
published in HPCA 2014, and examines the work's significance and future
potential. The overarching objective of our HPCA 2014 paper is to reduce the
significant negative performance impact of DRAM refresh with intelligent memory
controller mechanisms.
To mitigate the negative performance impact of DRAM refresh, our HPCA 2014
paper proposes two complementary mechanisms, DARP (Dynamic Access Refresh
Parallelization) and SARP (Subarray Access Refresh Parallelization). The goal
is to address the drawbacks of state-of-the-art per-bank refresh mechanism by
building more efficient techniques to parallelize refreshes and accesses within
DRAM. First, instead of issuing per-bank refreshes in a round-robin order, as
it is done today, DARP issues per-bank refreshes to idle banks in an
out-of-order manner. Furthermore, DARP proactively schedules refreshes during
intervals when a batch of writes are draining to DRAM. Second, SARP exploits
the existence of mostly-independent subarrays within a bank. With minor
modifications to DRAM organization, it allows a bank to serve memory accesses
to an idle subarray while another subarray is being refreshed. Our extensive
evaluations on a wide variety of workloads and systems show that our mechanisms
improve system performance (and energy efficiency) compared to three
state-of-the-art refresh policies, and their performance bene ts increase as
DRAM density increases.Comment: 9 pages. arXiv admin note: text overlap with arXiv:1712.07754,
arXiv:1601.0635
Flexible-Latency DRAM: Understanding and Exploiting Latency Variation in Modern DRAM Chips
This article summarizes key results of our work on experimental
characterization and analysis of latency variation and latency-reliability
trade-offs in modern DRAM chips, which was published in SIGMETRICS 2016, and
examines the work's significance and future potential.
The goal of this work is to (i) experimentally characterize and understand
the latency variation across cells within a DRAM chip for these three
fundamental DRAM operations, and (ii) develop new mechanisms that exploit our
understanding of the latency variation to reliably improve performance. To this
end, we comprehensively characterize 240 DRAM chips from three major vendors,
and make six major new observations about latency variation within DRAM.
Notably, we find that (i) there is large latency variation across the cells for
each of the three operations; (ii) variation characteristics exhibit
significant spatial locality: slower cells are clustered in certain regions of
a DRAM chip; and (iii) the three fundamental operations exhibit different
reliability characteristics when the latency of each operation is reduced.
Based on our observations, we propose Flexible-LatencY DRAM (FLY-DRAM), a
mechanism that exploits latency variation across DRAM cells within a DRAM chip
to improve system performance. The key idea of FLY-DRAM is to exploit the
spatial locality of slower cells within DRAM, and access the faster DRAM
regions with reduced latencies for the fundamental operations. Our evaluations
show that FLY-DRAM improves the performance of a wide range of applications by
13.3%, 17.6%, and 19.5%, on average, for each of the three different vendors'
real DRAM chips, in a simulated 8-core system
Heterogeneous-Reliability Memory: Exploiting Application-Level Memory Error Tolerance
This paper summarizes our work on characterizing application memory error
vulnerability to optimize datacenter cost via Heterogeneous-Reliability Memory
(HRM), which was published in DSN 2014, and examines the work's significance
and future potential. Memory devices represent a key component of datacenter
total cost of ownership (TCO), and techniques used to reduce errors that occur
on these devices increase this cost. Existing approaches to providing
reliability for memory devices pessimistically treat all data as equally
vulnerable to memory errors. Our key insight is that there exists a diverse
spectrum of tolerance to memory errors in new data-intensive applications, and
that traditional one-size-fits-all memory reliability techniques are
inefficient in terms of cost. This presents an opportunity to greatly reduce
server hardware cost by provisioning the right amount of memory reliability for
different applications.
Toward this end, in our DSN 2014 paper, we make three main contributions to
enable highly-reliable servers at low datacenter cost. First, we develop a new
methodology to quantify the tolerance of applications to memory errors. Second,
using our methodology, we perform a case study of three new data-intensive
workloads (an interactive web search application, an in-memory key--value
store, and a graph mining framework) to identify new insights into the nature
of application memory error vulnerability. Third, based on our insights, we
propose several new hardware/software heterogeneous-reliability memory system
designs to lower datacenter cost while achieving high reliability and discuss
their trade-offs. We show that our new techniques can reduce server hardware
cost by 4.7% while achieving 99.90% single server availability.Comment: 4 pages, 4 figures, summary report for DSN 2014 paper:
"Characterizing Application Memory Error Vulnerability to Optimize Datacenter
Cost via Heterogeneous-Reliability Memory
A Survey on Tiering and Caching in High-Performance Storage Systems
Although every individual invented storage technology made a big step towards
perfection, none of them is spotless. Different data store essentials such as
performance, availability, and recovery requirements have not met together in a
single economically affordable medium, yet. One of the most influential factors
is price. So, there has always been a trade-off between having a desired set of
storage choices and the costs. To address this issue, a network of various
types of storing media is used to deliver the high performance of expensive
devices such as solid state drives and non-volatile memories, along with the
high capacity of inexpensive ones like hard disk drives. In software, caching
and tiering are long-established concepts for handling file operations and
moving data automatically within such a storage network and manage data backup
in low-cost media. Intelligently moving data around different devices based on
the needs is the key insight for this matter. In this survey, we discuss some
recent pieces of research that have been done to improve high-performance
storage systems with caching and tiering techniques.Comment: Ph.D. Research Exam Repor
Revitalizing Copybacks in Modern SSDs: Why and How
For modern flash-based SSDs, the performance overhead of internal data
migrations is dominated by the data transfer time, not by the flash program
time as in old SSDs. In order to mitigate the performance impact of data
migrations, we propose rCopyback, a restricted version of copyback. Rcopyback
works like the original copyback except that only n consecutive copybacks are
allowed. By limiting the number of successive copybacks, it guarantees that no
data reliability problem occurs when data is internally migrated using
rCopyback. In order to take a full advantage of rCopyback, we developed a
rCopyback-aware FTL, rcFTL, which intelligently decides whether rCopyback
should be used or not by exploiting varying host workloads. Our evaluation
results show that rcFTL can improve the overall I/O throughput by 54% on
average over an existing FTL which does not use copybacks.Comment: 5 pages, 6 figure
Architectural Techniques for Improving NAND Flash Memory Reliability
Raw bit errors are common in NAND flash memory and will increase in the
future. These errors reduce flash reliability and limit the lifetime of a flash
memory device. We aim to improve flash reliability with a multitude of low-cost
architectural techniques. We show that NAND flash memory reliability can be
improved at low cost and with low performance overhead by deploying various
architectural techniques that are aware of higher-level application behavior
and underlying flash device characteristics.
We analyze flash error characteristics and workload behavior through
experimental characterization, and design new flash controller algorithms that
use the insights gained from our analysis to improve flash reliability at a low
cost. We investigate four directions through this approach. (1) We propose a
new technique called WARM that improves flash reliability by 12.9 times by
managing flash retention differently for write-hot data and write-cold data.
(2) We propose a new framework that learns an online flash channel model for
each chip and enables four new flash controller algorithms to improve flash
reliability by up to 69.9%. (3) We identify three new error characteristics in
3D NAND through a comprehensive experimental characterization of real 3D NAND
chips, and propose four new techniques that mitigate these new errors and
improve 3D NAND reliability by up to 66.9%. (4) We propose a new technique
called HeatWatch that improves 3D NAND reliability by 3.85 times by utilizing
self-healing effect to mitigate retention errors in 3D NAND.Comment: Thesis, Carnegie Mellon University (2018
Recent Advances in DRAM and Flash Memory Architectures
This article features extended summaries and retrospectives of some of the
recent research done by our group, SAFARI, on (1) understanding,
characterizing, and modeling various critical properties of modern DRAM and
NAND flash memory, the dominant memory and storage technologies, respectively;
and (2) several new mechanisms we have proposed based on our observations from
these analyses, characterization, and modeling, to tackle various key
challenges in memory and storage scaling. In order to understand the sources of
various bottlenecks of the dominant memory and storage technologies, these
works perform rigorous studies of device-level and application-level behavior,
using a combination of detailed simulation and experimental characterization of
real memory and storage devices.Comment: arXiv admin note: substantial text overlap with arXiv:1805.0640
Large Scale Studies of Memory, Storage, and Network Failures in a Modern Data Center
The workloads running in the modern data centers of large scale Internet
service providers (such as Amazon, Baidu, Facebook, Google, and Microsoft)
support billions of users and span globally distributed infrastructure. Yet,
the devices used in modern data centers fail due to a variety of causes, from
faulty components to bugs to misconfiguration. Faulty devices make operating
large scale data centers challenging because the workloads running in modern
data centers consist of interdependent programs distributed across many
servers, so failures that are isolated to a single device can still have a
widespread effect on a workload. In this dissertation, we measure and model the
device failures in a large scale Internet service company, Facebook. We focus
on three device types that form the foundation of Internet service data center
infrastructure: DRAM for main memory, SSDs for persistent storage, and switches
and backbone links for network connectivity. For each of these device types, we
analyze long term device failure data broken down by important device
attributes and operating conditions, such as age, vendor, and workload. We also
build and release statistical models to examine the failure trends for the
devices we analyze. Our key conclusion in this dissertation is that we can gain
a deep understanding of why devices fail---and how to predict their
failure---using measurement and modeling. We hope that the analysis,
techniques, and models we present in this dissertation will enable the
community to better measure, understand, and prepare for the hardware
reliability challenges we face in the future.Comment: PhD thesis, CMU (Dec 2018
- …