Search CORE

654 research outputs found

FMMU: A Hardware-Automated Flash Map Management Unit for Scalable Performance of NAND Flash-Based SSDs

Author: Min Sang Lyul
Woo Yeong-Jae
Publication venue
Publication date: 11/04/2017
Field of study

NAND flash-based Solid State Drives (SSDs), which are widely used from embedded systems to enterprise servers, are enhancing performance by exploiting the parallelism of NAND flash memories. To cope with the performance improvement of SSDs, storage systems have rapidly adopted the host interface for SSDs from Serial-ATA, which is used for existing hard disk drives, to high-speed PCI express. Since NAND flash memory does not allow in-place updates, it requires special software called Flash Translation Layer (FTL), and SSDs are equipped with embedded processors to run FTL. Existing SSDs increase the clock frequency of embedded processors or increase the number of embedded processors in order to prevent FTL from acting as bottleneck of SSD performance, but these approaches are not scalable. This paper proposes a hardware-automated Flash Map Management Unit, called FMMU, that handles the address translation process dominating the execution time of the FTL by hardware automation. FMMU provides methods for exploiting the parallelism of flash memory by processing outstanding requests in a non-blocking manner while reducing the number of flash operations. The experimental results show that the FMMU reduces the FTL execution time in the map cache hit case and the miss case by 44% and 37%, respectively, compared with the existing software-based approach operating in 4-core. FMMU also prevents FTL from acting as a performance bottleneck for up to 32-channel, 8-way SSD using PCIe 3.0 x32 host interface

arXiv.org e-Print Archive

Moving Processing to Data: On the Influence of Processing in Memory on Data Management

Author: Koch Andreas
Petrov Ilia
Vincon Tobias
Publication venue
Publication date: 12/05/2019
Field of study

Near-Data Processing refers to an architectural hardware and software paradigm, based on the co-location of storage and compute units. Ideally, it will allow to execute application-defined data- or compute-intensive operations in-situ, i.e. within (or close to) the physical data storage. Thus, Near-Data Processing seeks to minimize expensive data movement, improving performance, scalability, and resource-efficiency. Processing-in-Memory is a sub-class of Near-Data processing that targets data processing directly within memory (DRAM) chips. The effective use of Near-Data Processing mandates new architectures, algorithms, interfaces, and development toolchains

arXiv.org e-Print Archive

Flexible-Latency DRAM: Understanding and Exploiting Latency Variation in Modern DRAM Chips

Author: Chang Kevin K.
Ghose Saugata
Hassan Hasan
Hsieh Kevin
Kashyap Abhijith
Khan Samira
Lee Donghyuk
Li Tianshi
Mutlu Onur
Pekhimenko Gennady
Publication venue
Publication date: 08/05/2018
Field of study

This article summarizes key results of our work on experimental characterization and analysis of latency variation and latency-reliability trade-offs in modern DRAM chips, which was published in SIGMETRICS 2016, and examines the work's significance and future potential. The goal of this work is to (i) experimentally characterize and understand the latency variation across cells within a DRAM chip for these three fundamental DRAM operations, and (ii) develop new mechanisms that exploit our understanding of the latency variation to reliably improve performance. To this end, we comprehensively characterize 240 DRAM chips from three major vendors, and make six major new observations about latency variation within DRAM. Notably, we find that (i) there is large latency variation across the cells for each of the three operations; (ii) variation characteristics exhibit significant spatial locality: slower cells are clustered in certain regions of a DRAM chip; and (iii) the three fundamental operations exhibit different reliability characteristics when the latency of each operation is reduced. Based on our observations, we propose Flexible-LatencY DRAM (FLY-DRAM), a mechanism that exploits latency variation across DRAM cells within a DRAM chip to improve system performance. The key idea of FLY-DRAM is to exploit the spatial locality of slower cells within DRAM, and access the faster DRAM regions with reduced latencies for the fundamental operations. Our evaluations show that FLY-DRAM improves the performance of a wide range of applications by 13.3%, 17.6%, and 19.5%, on average, for each of the three different vendors' real DRAM chips, in a simulated 8-core system

arXiv.org e-Print Archive

Reducing DRAM Refresh Overheads with Refresh-Access Parallelism

Author: Alameldeen A. R.
Chang K. K.
Chishti Z.
Kim Y.
Lee D.
Mutlu O.
Wilkerson C.
Publication venue
Publication date: 02/05/2018
Field of study

This article summarizes the idea of "refresh-access parallelism," which was published in HPCA 2014, and examines the work's significance and future potential. The overarching objective of our HPCA 2014 paper is to reduce the significant negative performance impact of DRAM refresh with intelligent memory controller mechanisms. To mitigate the negative performance impact of DRAM refresh, our HPCA 2014 paper proposes two complementary mechanisms, DARP (Dynamic Access Refresh Parallelization) and SARP (Subarray Access Refresh Parallelization). The goal is to address the drawbacks of state-of-the-art per-bank refresh mechanism by building more efficient techniques to parallelize refreshes and accesses within DRAM. First, instead of issuing per-bank refreshes in a round-robin order, as it is done today, DARP issues per-bank refreshes to idle banks in an out-of-order manner. Furthermore, DARP proactively schedules refreshes during intervals when a batch of writes are draining to DRAM. Second, SARP exploits the existence of mostly-independent subarrays within a bank. With minor modifications to DRAM organization, it allows a bank to serve memory accesses to an idle subarray while another subarray is being refreshed. Our extensive evaluations on a wide variety of workloads and systems show that our mechanisms improve system performance (and energy efficiency) compared to three state-of-the-art refresh policies, and their performance bene ts increase as DRAM density increases.Comment: 9 pages. arXiv admin note: text overlap with arXiv:1712.07754, arXiv:1601.0635

arXiv.org e-Print Archive

Exploiting Row-Level Temporal Locality in DRAM to Reduce the Memory Access Latency

Author: Ergin Oguz
Hassan Hasan
Lee Donghyuk
Mutlu Onur
Pekhimenko Gennady
Seshadri Vivek
Vijaykumar Nandita
Publication venue
Publication date: 08/05/2018
Field of study

This paper summarizes the idea of ChargeCache, which was published in HPCA 2016 [51], and examines the work's significance and future potential. DRAM latency continues to be a critical bottleneck for system performance. In this work, we develop a low-cost mechanism, called ChargeCache, that enables faster access to recently-accessed rows in DRAM, with no modifications to DRAM chips. Our mechanism is based on the key observation that a recently-accessed row has more charge and thus the following access to the same row can be performed faster. To exploit this observation, we propose to track the addresses of recently-accessed rows in a table in the memory controller. If a later DRAM request hits in that table, the memory controller uses lower timing parameters, leading to reduced DRAM latency. Row addresses are removed from the table after a specified duration to ensure rows that have leaked too much charge are not accessed with lower latency. We evaluate ChargeCache on a wide variety of workloads and show that it provides significant performance and energy benefits for both single-core and multi-core systems.Comment: arXiv admin note: substantial text overlap with arXiv:1609.0723

arXiv.org e-Print Archive

Tiered-Latency DRAM (TL-DRAM)

Author: Kim Yoongu
Lee Donghyuk
Liu Jamie
Mutlu Onur
Seshadri Vivek
Subramanian Lavanya
Publication venue
Publication date: 26/01/2016
Field of study

This paper summarizes the idea of Tiered-Latency DRAM, which was published in HPCA 2013. The key goal of TL-DRAM is to provide low DRAM latency at low cost, a critical problem in modern memory systems. To this end, TL-DRAM introduces heterogeneity into the design of a DRAM subarray by segmenting the bitlines, thereby creating a low-latency, low-energy, low-capacity portion in the subarray (called the near segment), which is close to the sense amplifiers, and a high-latency, high-energy, high-capacity portion, which is farther away from the sense amplifiers. Thus, DRAM becomes heterogeneous with a small portion having lower latency and a large portion having higher latency. Various techniques can be employed to take advantage of the low-latency near segment and this new heterogeneous DRAM substrate, including hardware-based caching and software based caching and memory allocation of frequently used data in the near segment. Evaluations with simple such techniques show significant performance and energy-efficiency benefits.Comment: This is a summary of the original paper, entitled "Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture" which appears in HPCA 201

arXiv.org e-Print Archive

Voltron: Understanding and Exploiting the Voltage-Latency-Reliability Trade-Offs in Modern DRAM Chips to Improve Energy Efficiency

Author: Agrawal Aditya
Chang Kevin K.
Chatterjee Niladrish
Ghose Saugata
Hassan Hasan
Kashyap Abhijith
Lee Donghyuk
Mutlu Onur
O'Connor Mike
Yaglıkçı Abdullah Giray
Publication venue
Publication date: 08/05/2018
Field of study

This paper summarizes our work on experimental characterization and analysis of reduced-voltage operation in modern DRAM chips, which was published in SIGMETRICS 2017, and examines the work's significance and future potential. We take a comprehensive approach to understanding and exploiting the latency and reliability characteristics of modern DRAM when the DRAM supply voltage is lowered below the nominal voltage level specified by DRAM standards. We perform an experimental study of 124 real DDR3L (low-voltage) DRAM chips manufactured recently by three major DRAM vendors. We find that reducing the supply voltage below a certain point introduces bit errors in the data, and we comprehensively characterize the behavior of these errors. We discover that these errors can be avoided by increasing the latency of three major DRAM operations (activation, restoration, and precharge). We perform detailed DRAM circuit simulations to validate and explain our experimental findings. We also characterize the various relationships between reduced supply voltage and error locations, stored data patterns, DRAM temperature, and data retention. Based on our observations, we propose a new DRAM energy reduction mechanism, called Voltron. The key idea of Voltron is to use a performance model to determine by how much we can reduce the supply voltage without introducing errors and without exceeding a user-specified threshold for performance loss. Our evaluations show that Voltron reduces the average DRAM and system energy consumption by 10.5% and 7.3%, respectively, while limiting the average system performance loss to only 1.8%, for a variety of memory-intensive quad-core workloads. We also show that Voltron significantly outperforms prior dynamic voltage and frequency scaling mechanisms for DRAM

arXiv.org e-Print Archive

Adaptive-Latency DRAM (AL-DRAM)

Author: Chang Kevin
Khan Samira
Kim Yoongu
Lee Donghyuk
Mutlu Onur
Pekhimenko Gennady
Seshadri Vivek
Publication venue
Publication date: 28/03/2016
Field of study

This paper summarizes the idea of Adaptive-Latency DRAM (AL-DRAM), which was published in HPCA 2015. The key goal of AL-DRAM is to exploit the extra margin that is built into the DRAM timing parameters to reduce DRAM latency. The key observation is that the timing parameters are dictated by the worst-case temperatures and worst-case DRAM cells, both of which lead to small amount of charge storage and hence high access latency. One can therefore reduce latency by adapting the timing parameters to the current operating temperature and the current DIMM that is being accessed. Using an FPGA-based testing platform, our work first characterizes the extra margin for 115 DRAM modules from three major manufacturers. The experimental results demonstrate that it is possible to reduce four of the most critical timing parameters by a minimum/maximum of 17.3%/54.8% at 55C while maintaining reliable operation. AL-DRAM adaptively selects between multiple different timing parameters for each DRAM module based on its current operating condition. AL-DRAM does not require any changes to the DRAM chip or its interface; it only requires multiple different timing parameters to be specified and supported by the memory controller. Real system evaluations show that AL-DRAM improves the performance of memory-intensive workloads by an average of 14% without introducing any errors.Comment: This is a summary of the original paper, entitled "Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case" which appears in HPCA 201

arXiv.org e-Print Archive

Errors in Flash-Memory-Based Solid-State Drives: Analysis, Mitigation, and Recovery

Author: Cai Yu
Ghose Saugata
Haratsch Erich F.
Luo Yixin
Mutlu Onur
Publication venue
Publication date: 05/01/2018
Field of study

NAND flash memory is ubiquitous in everyday life today because its capacity has continuously increased and cost has continuously decreased over decades. This positive growth is a result of two key trends: (1) effective process technology scaling; and (2) multi-level (e.g., MLC, TLC) cell data coding. Unfortunately, the reliability of raw data stored in flash memory has also continued to become more difficult to ensure, because these two trends lead to (1) fewer electrons in the flash memory cell floating gate to represent the data; and (2) larger cell-to-cell interference and disturbance effects. Without mitigation, worsening reliability can reduce the lifetime of NAND flash memory. As a result, flash memory controllers in solid-state drives (SSDs) have become much more sophisticated: they incorporate many effective techniques to ensure the correct interpretation of noisy data stored in flash memory cells. In this chapter, we review recent advances in SSD error characterization, mitigation, and data recovery techniques for reliability and lifetime improvement. We provide rigorous experimental data from state-of-the-art MLC and TLC NAND flash devices on various types of flash memory errors, to motivate the need for such techniques. Based on the understanding developed by the experimental characterization, we describe several mitigation and recovery techniques, including (1) cell-tocell interference mitigation; (2) optimal multi-level cell sensing; (3) error correction using state-of-the-art algorithms and methods; and (4) data recovery when error correction fails. We quantify the reliability improvement provided by each of these techniques. Looking forward, we briefly discuss how flash memory and these techniques could evolve into the future.Comment: arXiv admin note: substantial text overlap with arXiv:1706.0864

arXiv.org e-Print Archive

Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disks

Author: Jung Myoungsoo
Kandemir Mahmut T.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 12/05/2017
Field of study

Resource utilization is one of the emerging problems in many-chip SSDs. In this paper, we propose Sprinkler, a novel device-level SSD controller, which targets maximizing resource utilization and achieving high performance without additional NAND flash chips. Specifically, Sprinkler relaxes parallelism dependency by scheduling I/O requests based on internal resource layout rather than the order imposed by the device-level queue. In addition, Sprinkler improves flash-level parallelism and reduces the number of transactions (i.e., improves transactional-locality) by over-committing flash memory requests to specific resources. Our extensive experimental evaluation using a cycle-accurate large-scale SSD simulation framework shows that a many-chip SSD equipped with our Sprinkler provides at least 56.6% shorter latency and 1.8 ~ 2.2 times better throughput than the state-of-the-art SSD controllers. Further, it improves overall resource utilization by 68.8% under different I/O request patterns and provides, on average, 80.2% more flash-level parallelism by reducing half of the flash memory requests at runtime.Comment: This paper is published at 20th IEEE International Symposium On High Performance Computer Architectur

arXiv.org e-Print Archive