37 research outputs found

    Tiered-Latency DRAM (TL-DRAM)

    Full text link
    This paper summarizes the idea of Tiered-Latency DRAM, which was published in HPCA 2013. The key goal of TL-DRAM is to provide low DRAM latency at low cost, a critical problem in modern memory systems. To this end, TL-DRAM introduces heterogeneity into the design of a DRAM subarray by segmenting the bitlines, thereby creating a low-latency, low-energy, low-capacity portion in the subarray (called the near segment), which is close to the sense amplifiers, and a high-latency, high-energy, high-capacity portion, which is farther away from the sense amplifiers. Thus, DRAM becomes heterogeneous with a small portion having lower latency and a large portion having higher latency. Various techniques can be employed to take advantage of the low-latency near segment and this new heterogeneous DRAM substrate, including hardware-based caching and software based caching and memory allocation of frequently used data in the near segment. Evaluations with simple such techniques show significant performance and energy-efficiency benefits.Comment: This is a summary of the original paper, entitled "Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture" which appears in HPCA 201

    Adaptive-Latency DRAM (AL-DRAM)

    Full text link
    This paper summarizes the idea of Adaptive-Latency DRAM (AL-DRAM), which was published in HPCA 2015. The key goal of AL-DRAM is to exploit the extra margin that is built into the DRAM timing parameters to reduce DRAM latency. The key observation is that the timing parameters are dictated by the worst-case temperatures and worst-case DRAM cells, both of which lead to small amount of charge storage and hence high access latency. One can therefore reduce latency by adapting the timing parameters to the current operating temperature and the current DIMM that is being accessed. Using an FPGA-based testing platform, our work first characterizes the extra margin for 115 DRAM modules from three major manufacturers. The experimental results demonstrate that it is possible to reduce four of the most critical timing parameters by a minimum/maximum of 17.3%/54.8% at 55C while maintaining reliable operation. AL-DRAM adaptively selects between multiple different timing parameters for each DRAM module based on its current operating condition. AL-DRAM does not require any changes to the DRAM chip or its interface; it only requires multiple different timing parameters to be specified and supported by the memory controller. Real system evaluations show that AL-DRAM improves the performance of memory-intensive workloads by an average of 14% without introducing any errors.Comment: This is a summary of the original paper, entitled "Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case" which appears in HPCA 201

    Predictable Performance and Fairness Through Accurate Slowdown Estimation in Shared Main Memory Systems

    Full text link
    This paper summarizes the ideas and key concepts in MISE (Memory Interference-induced Slowdown Estimation), which was published in HPCA 2013 [97], and examines the work's significance and future potential. Applications running concurrently on a multicore system interfere with each other at the main memory. This interference can slow down different applications differently. Accurately estimating the slowdown of each application in such a system can enable mechanisms that can enforce quality-of-service. While much prior work has focused on mitigating the performance degradation due to inter-application interference, there is little work on accurately estimating slowdown of individual applications in a multi-programmed environment. Our goal is to accurately estimate application slowdowns, towards providing predictable performance. To this end, we first build a simple Memory Interference-induced Slowdown Estimation (MISE) model, which accurately estimates slowdowns caused by memory interference. We then leverage our MISE model to develop two new memory scheduling schemes: 1) one that provides soft quality-of-service guarantees, and 2) another that explicitly attempts to minimize maximum slowdown (i.e., unfairness) in the system. Evaluations show that our techniques perform significantly better than state-of-the-art memory scheduling approaches to address the same problems. Our proposed model and techniques have enabled significant research in the development of accurate performance models [35, 59, 98, 110] and interference management mechanisms [66, 99, 100, 108, 119, 120]

    Recent Advances in Overcoming Bottlenecks in Memory Systems and Managing Memory Resources in GPU Systems

    Full text link
    This article features extended summaries and retrospectives of some of the recent research done by our research group, SAFARI, on (1) various critical problems in memory systems and (2) how memory system bottlenecks affect graphics processing unit (GPU) systems. As more applications share a single system, operations from each application can contend with each other at various shared components. Such contention can slow down each application or thread of execution. The compound effect of contention, high memory latency and access overheads, as well as inefficient management of resources, greatly degrades performance, quality-of-service, and energy efficiency. The ten works featured in this issue study several aspects of (1) inter-application interference in multicore systems, heterogeneous systems, and GPUs; (2) the growing overheads and expenses associated with growing memory densities and latencies; and (3) performance, programmability, and portability issues in modern GPUs, especially those related to memory system resources.Comment: arXiv admin note: text overlap with arXiv:1805.0912

    Recent Advances in DRAM and Flash Memory Architectures

    Full text link
    This article features extended summaries and retrospectives of some of the recent research done by our group, SAFARI, on (1) understanding, characterizing, and modeling various critical properties of modern DRAM and NAND flash memory, the dominant memory and storage technologies, respectively; and (2) several new mechanisms we have proposed based on our observations from these analyses, characterization, and modeling, to tackle various key challenges in memory and storage scaling. In order to understand the sources of various bottlenecks of the dominant memory and storage technologies, these works perform rigorous studies of device-level and application-level behavior, using a combination of detailed simulation and experimental characterization of real memory and storage devices.Comment: arXiv admin note: substantial text overlap with arXiv:1805.0640

    RowHammer and Beyond

    Full text link
    We will discuss the RowHammer problem in DRAM, which is a prime (and likely the first) example of how a circuit-level failure mechanism in Dynamic Random Access Memory (DRAM) can cause a practical and widespread system security vulnerability. RowHammer is the phenomenon that repeatedly accessing a row in a modern DRAM chip predictably causes errors in physically-adjacent rows. It is caused by a hardware failure mechanism called read disturb errors. Building on our initial fundamental work that appeared at ISCA 2014, Google Project Zero demonstrated that this hardware phenomenon can be exploited by user-level programs to gain kernel privileges. Many other recent works demonstrated other attacks exploiting RowHammer, including remote takeover of a server vulnerable to RowHammer. We will analyze the root causes of the problem and examine solution directions. We will also discuss what other problems may be lurking in DRAM and other types of memories, e.g., NAND flash and Phase Change Memory, which can potentially threaten the foundations of reliable and secure systems, as the memory technologies scale to higher densities.Comment: A version of this paper is to appear in the COSADE 2019 proceedings. arXiv admin note: text overlap with arXiv:1703.0062

    Adaptive-Latency DRAM: Reducing DRAM Latency by Exploiting Timing Margins

    Full text link
    This paper summarizes the idea of Adaptive-Latency DRAM (AL-DRAM), which was published in HPCA 2015, and examines the work's significance and future potential. AL-DRAM is a mechanism that optimizes DRAM latency based on the DRAM module and the operating temperature, by exploiting the extra margin that is built into the DRAM timing parameters. DRAM manufacturers provide a large margin for the timing parameters as a provision against two worst-case scenarios. First, due to process variation, some outlier DRAM chips are much slower than others. Second, chips become slower at higher temperatures. The timing parameter margin ensures that the slow outlier chips operate reliably at the worst-case temperature, and hence leads to a high access latency. Using an FPGA-based DRAM testing platform, our work first characterizes the extra margin for 115 DRAM modules from three major manufacturers. The experimental results demonstrate that it is possible to reduce four of the most critical timing parameters by a minimum/maximum of 17.3%/54.8% at 55C while maintaining reliable operation. AL-DRAM uses these observations to adaptively select reliable DRAM timing parameters for each DRAM module based on the module's current operating conditions. AL-DRAM does not require any changes to the DRAM chip or its interface; it only requires multiple different timing parameters to be specified and supported by the memory controller. Our real system evaluations show that AL-DRAM improves the performance of memory-intensive workloads by an average of 14% without introducing any errors. Our characterization and proposed techniques have inspired several other works on analyzing and/or exploiting different sources of latency and performance variation within DRAM chips.Comment: arXiv admin note: substantial text overlap with arXiv:1603.0845

    Phase Change Logic via Thermal Cross-Talk for Computation in Memory

    Full text link
    We have computationally demonstrated logic function implementations using lateral and vertical multi-contact phase change devices integrated with CMOS circuitry, which use thermal cross-talk as a coupling mechanism to implement logic functions at smaller CMOS footprints. Thermal-crosstalk during the write operations is utilized to recrystallize the previously amorphized regions to achieve toggle operations. Amorphized regions formed between different pairs of write contacts are utilized to isolate read contacts. Typical expected reduction in CMOS footprint is ~ 50% using the described approach for toggle-multiplexing, JK-multiplexing and 2x2 routing. The switching speeds of the phase change devices are in the order of nanoseconds and are inherently non-volatile. An electro-thermal modeling framework with dynamic materials models are used to capture the device dynamics, and current and voltage requirements.Comment: 7 pages, 6 figure

    SAWL:A Self-adaptive Wear-leveling NVM Scheme for High Performance Storage Systems

    Full text link
    In order to meet the needs of high performance computing (HPC) in terms of large memory, high throughput and energy savings, the non-volatile memory (NVM) has been widely studied due to its salient features of high density, near-zero standby power, byte-addressable and non-volatile properties. In HPC systems, the multi-level cell (MLC) technique is used to significantly increase device density and decrease the cost, which however leads to much weaker endurance than the single-level cell (SLC) counterpart. Although wear-leveling techniques can mitigate this weakness in MLC, the improvements upon MLC-based NVM become very limited due to not achieving uniform write distribution before some cells are really worn out. To address this problem, our paper proposes a self-adaptive wear-leveling (SAWL) scheme for MLC-based NVM. The idea behind SAWL is to dynamically tune the wear-leveling granularities and balance the writes across the cells of entire memory, thus achieving suitable tradeoff between the lifetime and cache hit rate. Moreover, to reduce the size of the address-mapping table, SAWL maintains a few recently-accessed mappings in a small on-chip cache. Experimental results demonstrate that SAWL significantly improves the NVM lifetime and the performance for HPC systems, compared with state-of-the-art schemes.Comment: 14 pages, 17 figure

    High-Performance and Energy-Effcient Memory Scheduler Design for Heterogeneous Systems

    Full text link
    When multiple processor cores (CPUs) and a GPU integrated together on the same chip share the off-chip DRAM, requests from the GPU can heavily interfere with requests from the CPUs, leading to low system performance and starvation of cores. Unfortunately, state-of-the-art memory scheduling algorithms are ineffective at solving this problem due to the very large amount of GPU memory traffic, unless a very large and costly request buffer is employed to provide these algorithms with enough visibility across the global request stream. Previously-proposed memory controller (MC) designs use a single monolithic structure to perform three main tasks. First, the MC attempts to schedule together requests to the same DRAM row to increase row buffer hit rates. Second, the MC arbitrates among the requesters (CPUs and GPU) to optimize for overall system throughput, average response time, fairness and quality of service. Third, the MC manages the low-level DRAM command scheduling to complete requests while ensuring compliance with all DRAM timing and power constraints. This paper proposes a fundamentally new approach, called the Staged Memory Scheduler (SMS), which decouples the three primary MC tasks into three significantly simpler structures that together improve system performance and fairness. Our evaluation shows that SMS provides 41.2% performance improvement and fairness improvement compared to the best previous state-of-the-art technique, while enabling a design that is significantly less complex and more power-efficient to implement
    corecore