654 research outputs found
FMMU: A Hardware-Automated Flash Map Management Unit for Scalable Performance of NAND Flash-Based SSDs
NAND flash-based Solid State Drives (SSDs), which are widely used from
embedded systems to enterprise servers, are enhancing performance by exploiting
the parallelism of NAND flash memories. To cope with the performance
improvement of SSDs, storage systems have rapidly adopted the host interface
for SSDs from Serial-ATA, which is used for existing hard disk drives, to
high-speed PCI express. Since NAND flash memory does not allow in-place
updates, it requires special software called Flash Translation Layer (FTL), and
SSDs are equipped with embedded processors to run FTL. Existing SSDs increase
the clock frequency of embedded processors or increase the number of embedded
processors in order to prevent FTL from acting as bottleneck of SSD
performance, but these approaches are not scalable. This paper proposes a
hardware-automated Flash Map Management Unit, called FMMU, that handles the
address translation process dominating the execution time of the FTL by
hardware automation. FMMU provides methods for exploiting the parallelism of
flash memory by processing outstanding requests in a non-blocking manner while
reducing the number of flash operations. The experimental results show that the
FMMU reduces the FTL execution time in the map cache hit case and the miss case
by 44% and 37%, respectively, compared with the existing software-based
approach operating in 4-core. FMMU also prevents FTL from acting as a
performance bottleneck for up to 32-channel, 8-way SSD using PCIe 3.0 x32 host
interface
Moving Processing to Data: On the Influence of Processing in Memory on Data Management
Near-Data Processing refers to an architectural hardware and software
paradigm, based on the co-location of storage and compute units. Ideally, it
will allow to execute application-defined data- or compute-intensive operations
in-situ, i.e. within (or close to) the physical data storage. Thus, Near-Data
Processing seeks to minimize expensive data movement, improving performance,
scalability, and resource-efficiency. Processing-in-Memory is a sub-class of
Near-Data processing that targets data processing directly within memory (DRAM)
chips. The effective use of Near-Data Processing mandates new architectures,
algorithms, interfaces, and development toolchains
Flexible-Latency DRAM: Understanding and Exploiting Latency Variation in Modern DRAM Chips
This article summarizes key results of our work on experimental
characterization and analysis of latency variation and latency-reliability
trade-offs in modern DRAM chips, which was published in SIGMETRICS 2016, and
examines the work's significance and future potential.
The goal of this work is to (i) experimentally characterize and understand
the latency variation across cells within a DRAM chip for these three
fundamental DRAM operations, and (ii) develop new mechanisms that exploit our
understanding of the latency variation to reliably improve performance. To this
end, we comprehensively characterize 240 DRAM chips from three major vendors,
and make six major new observations about latency variation within DRAM.
Notably, we find that (i) there is large latency variation across the cells for
each of the three operations; (ii) variation characteristics exhibit
significant spatial locality: slower cells are clustered in certain regions of
a DRAM chip; and (iii) the three fundamental operations exhibit different
reliability characteristics when the latency of each operation is reduced.
Based on our observations, we propose Flexible-LatencY DRAM (FLY-DRAM), a
mechanism that exploits latency variation across DRAM cells within a DRAM chip
to improve system performance. The key idea of FLY-DRAM is to exploit the
spatial locality of slower cells within DRAM, and access the faster DRAM
regions with reduced latencies for the fundamental operations. Our evaluations
show that FLY-DRAM improves the performance of a wide range of applications by
13.3%, 17.6%, and 19.5%, on average, for each of the three different vendors'
real DRAM chips, in a simulated 8-core system
Reducing DRAM Refresh Overheads with Refresh-Access Parallelism
This article summarizes the idea of "refresh-access parallelism," which was
published in HPCA 2014, and examines the work's significance and future
potential. The overarching objective of our HPCA 2014 paper is to reduce the
significant negative performance impact of DRAM refresh with intelligent memory
controller mechanisms.
To mitigate the negative performance impact of DRAM refresh, our HPCA 2014
paper proposes two complementary mechanisms, DARP (Dynamic Access Refresh
Parallelization) and SARP (Subarray Access Refresh Parallelization). The goal
is to address the drawbacks of state-of-the-art per-bank refresh mechanism by
building more efficient techniques to parallelize refreshes and accesses within
DRAM. First, instead of issuing per-bank refreshes in a round-robin order, as
it is done today, DARP issues per-bank refreshes to idle banks in an
out-of-order manner. Furthermore, DARP proactively schedules refreshes during
intervals when a batch of writes are draining to DRAM. Second, SARP exploits
the existence of mostly-independent subarrays within a bank. With minor
modifications to DRAM organization, it allows a bank to serve memory accesses
to an idle subarray while another subarray is being refreshed. Our extensive
evaluations on a wide variety of workloads and systems show that our mechanisms
improve system performance (and energy efficiency) compared to three
state-of-the-art refresh policies, and their performance bene ts increase as
DRAM density increases.Comment: 9 pages. arXiv admin note: text overlap with arXiv:1712.07754,
arXiv:1601.0635
Exploiting Row-Level Temporal Locality in DRAM to Reduce the Memory Access Latency
This paper summarizes the idea of ChargeCache, which was published in HPCA
2016 [51], and examines the work's significance and future potential. DRAM
latency continues to be a critical bottleneck for system performance. In this
work, we develop a low-cost mechanism, called ChargeCache, that enables faster
access to recently-accessed rows in DRAM, with no modifications to DRAM chips.
Our mechanism is based on the key observation that a recently-accessed row has
more charge and thus the following access to the same row can be performed
faster. To exploit this observation, we propose to track the addresses of
recently-accessed rows in a table in the memory controller. If a later DRAM
request hits in that table, the memory controller uses lower timing parameters,
leading to reduced DRAM latency. Row addresses are removed from the table after
a specified duration to ensure rows that have leaked too much charge are not
accessed with lower latency. We evaluate ChargeCache on a wide variety of
workloads and show that it provides significant performance and energy benefits
for both single-core and multi-core systems.Comment: arXiv admin note: substantial text overlap with arXiv:1609.0723
Tiered-Latency DRAM (TL-DRAM)
This paper summarizes the idea of Tiered-Latency DRAM, which was published in
HPCA 2013. The key goal of TL-DRAM is to provide low DRAM latency at low cost,
a critical problem in modern memory systems. To this end, TL-DRAM introduces
heterogeneity into the design of a DRAM subarray by segmenting the bitlines,
thereby creating a low-latency, low-energy, low-capacity portion in the
subarray (called the near segment), which is close to the sense amplifiers, and
a high-latency, high-energy, high-capacity portion, which is farther away from
the sense amplifiers. Thus, DRAM becomes heterogeneous with a small portion
having lower latency and a large portion having higher latency. Various
techniques can be employed to take advantage of the low-latency near segment
and this new heterogeneous DRAM substrate, including hardware-based caching and
software based caching and memory allocation of frequently used data in the
near segment. Evaluations with simple such techniques show significant
performance and energy-efficiency benefits.Comment: This is a summary of the original paper, entitled "Tiered-Latency
DRAM: A Low Latency and Low Cost DRAM Architecture" which appears in HPCA
201
Voltron: Understanding and Exploiting the Voltage-Latency-Reliability Trade-Offs in Modern DRAM Chips to Improve Energy Efficiency
This paper summarizes our work on experimental characterization and analysis
of reduced-voltage operation in modern DRAM chips, which was published in
SIGMETRICS 2017, and examines the work's significance and future potential.
We take a comprehensive approach to understanding and exploiting the latency
and reliability characteristics of modern DRAM when the DRAM supply voltage is
lowered below the nominal voltage level specified by DRAM standards. We perform
an experimental study of 124 real DDR3L (low-voltage) DRAM chips manufactured
recently by three major DRAM vendors. We find that reducing the supply voltage
below a certain point introduces bit errors in the data, and we comprehensively
characterize the behavior of these errors. We discover that these errors can be
avoided by increasing the latency of three major DRAM operations (activation,
restoration, and precharge). We perform detailed DRAM circuit simulations to
validate and explain our experimental findings. We also characterize the
various relationships between reduced supply voltage and error locations,
stored data patterns, DRAM temperature, and data retention.
Based on our observations, we propose a new DRAM energy reduction mechanism,
called Voltron. The key idea of Voltron is to use a performance model to
determine by how much we can reduce the supply voltage without introducing
errors and without exceeding a user-specified threshold for performance loss.
Our evaluations show that Voltron reduces the average DRAM and system energy
consumption by 10.5% and 7.3%, respectively, while limiting the average system
performance loss to only 1.8%, for a variety of memory-intensive quad-core
workloads. We also show that Voltron significantly outperforms prior dynamic
voltage and frequency scaling mechanisms for DRAM
Adaptive-Latency DRAM (AL-DRAM)
This paper summarizes the idea of Adaptive-Latency DRAM (AL-DRAM), which was
published in HPCA 2015. The key goal of AL-DRAM is to exploit the extra margin
that is built into the DRAM timing parameters to reduce DRAM latency. The key
observation is that the timing parameters are dictated by the worst-case
temperatures and worst-case DRAM cells, both of which lead to small amount of
charge storage and hence high access latency. One can therefore reduce latency
by adapting the timing parameters to the current operating temperature and the
current DIMM that is being accessed. Using an FPGA-based testing platform, our
work first characterizes the extra margin for 115 DRAM modules from three major
manufacturers. The experimental results demonstrate that it is possible to
reduce four of the most critical timing parameters by a minimum/maximum of
17.3%/54.8% at 55C while maintaining reliable operation. AL-DRAM adaptively
selects between multiple different timing parameters for each DRAM module based
on its current operating condition. AL-DRAM does not require any changes to the
DRAM chip or its interface; it only requires multiple different timing
parameters to be specified and supported by the memory controller. Real system
evaluations show that AL-DRAM improves the performance of memory-intensive
workloads by an average of 14% without introducing any errors.Comment: This is a summary of the original paper, entitled "Adaptive-Latency
DRAM: Optimizing DRAM Timing for the Common-Case" which appears in HPCA 201
Errors in Flash-Memory-Based Solid-State Drives: Analysis, Mitigation, and Recovery
NAND flash memory is ubiquitous in everyday life today because its capacity
has continuously increased and cost has continuously decreased over decades.
This positive growth is a result of two key trends: (1) effective process
technology scaling; and (2) multi-level (e.g., MLC, TLC) cell data coding.
Unfortunately, the reliability of raw data stored in flash memory has also
continued to become more difficult to ensure, because these two trends lead to
(1) fewer electrons in the flash memory cell floating gate to represent the
data; and (2) larger cell-to-cell interference and disturbance effects. Without
mitigation, worsening reliability can reduce the lifetime of NAND flash memory.
As a result, flash memory controllers in solid-state drives (SSDs) have become
much more sophisticated: they incorporate many effective techniques to ensure
the correct interpretation of noisy data stored in flash memory cells.
In this chapter, we review recent advances in SSD error characterization,
mitigation, and data recovery techniques for reliability and lifetime
improvement. We provide rigorous experimental data from state-of-the-art MLC
and TLC NAND flash devices on various types of flash memory errors, to motivate
the need for such techniques. Based on the understanding developed by the
experimental characterization, we describe several mitigation and recovery
techniques, including (1) cell-tocell interference mitigation; (2) optimal
multi-level cell sensing; (3) error correction using state-of-the-art
algorithms and methods; and (4) data recovery when error correction fails. We
quantify the reliability improvement provided by each of these techniques.
Looking forward, we briefly discuss how flash memory and these techniques could
evolve into the future.Comment: arXiv admin note: substantial text overlap with arXiv:1706.0864
Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disks
Resource utilization is one of the emerging problems in many-chip SSDs. In
this paper, we propose Sprinkler, a novel device-level SSD controller, which
targets maximizing resource utilization and achieving high performance without
additional NAND flash chips. Specifically, Sprinkler relaxes parallelism
dependency by scheduling I/O requests based on internal resource layout rather
than the order imposed by the device-level queue. In addition, Sprinkler
improves flash-level parallelism and reduces the number of transactions (i.e.,
improves transactional-locality) by over-committing flash memory requests to
specific resources. Our extensive experimental evaluation using a
cycle-accurate large-scale SSD simulation framework shows that a many-chip SSD
equipped with our Sprinkler provides at least 56.6% shorter latency and 1.8 ~
2.2 times better throughput than the state-of-the-art SSD controllers. Further,
it improves overall resource utilization by 68.8% under different I/O request
patterns and provides, on average, 80.2% more flash-level parallelism by
reducing half of the flash memory requests at runtime.Comment: This paper is published at 20th IEEE International Symposium On High
Performance Computer Architectur
- …