1,371 research outputs found

    Exploiting Row-Level Temporal Locality in DRAM to Reduce the Memory Access Latency

    Full text link
    This paper summarizes the idea of ChargeCache, which was published in HPCA 2016 [51], and examines the work's significance and future potential. DRAM latency continues to be a critical bottleneck for system performance. In this work, we develop a low-cost mechanism, called ChargeCache, that enables faster access to recently-accessed rows in DRAM, with no modifications to DRAM chips. Our mechanism is based on the key observation that a recently-accessed row has more charge and thus the following access to the same row can be performed faster. To exploit this observation, we propose to track the addresses of recently-accessed rows in a table in the memory controller. If a later DRAM request hits in that table, the memory controller uses lower timing parameters, leading to reduced DRAM latency. Row addresses are removed from the table after a specified duration to ensure rows that have leaked too much charge are not accessed with lower latency. We evaluate ChargeCache on a wide variety of workloads and show that it provides significant performance and energy benefits for both single-core and multi-core systems.Comment: arXiv admin note: substantial text overlap with arXiv:1609.0723

    High-Performance and Energy-Effcient Memory Scheduler Design for Heterogeneous Systems

    Full text link
    When multiple processor cores (CPUs) and a GPU integrated together on the same chip share the off-chip DRAM, requests from the GPU can heavily interfere with requests from the CPUs, leading to low system performance and starvation of cores. Unfortunately, state-of-the-art memory scheduling algorithms are ineffective at solving this problem due to the very large amount of GPU memory traffic, unless a very large and costly request buffer is employed to provide these algorithms with enough visibility across the global request stream. Previously-proposed memory controller (MC) designs use a single monolithic structure to perform three main tasks. First, the MC attempts to schedule together requests to the same DRAM row to increase row buffer hit rates. Second, the MC arbitrates among the requesters (CPUs and GPU) to optimize for overall system throughput, average response time, fairness and quality of service. Third, the MC manages the low-level DRAM command scheduling to complete requests while ensuring compliance with all DRAM timing and power constraints. This paper proposes a fundamentally new approach, called the Staged Memory Scheduler (SMS), which decouples the three primary MC tasks into three significantly simpler structures that together improve system performance and fairness. Our evaluation shows that SMS provides 41.2% performance improvement and fairness improvement compared to the best previous state-of-the-art technique, while enabling a design that is significantly less complex and more power-efficient to implement

    Tiered-Latency DRAM (TL-DRAM)

    Full text link
    This paper summarizes the idea of Tiered-Latency DRAM, which was published in HPCA 2013. The key goal of TL-DRAM is to provide low DRAM latency at low cost, a critical problem in modern memory systems. To this end, TL-DRAM introduces heterogeneity into the design of a DRAM subarray by segmenting the bitlines, thereby creating a low-latency, low-energy, low-capacity portion in the subarray (called the near segment), which is close to the sense amplifiers, and a high-latency, high-energy, high-capacity portion, which is farther away from the sense amplifiers. Thus, DRAM becomes heterogeneous with a small portion having lower latency and a large portion having higher latency. Various techniques can be employed to take advantage of the low-latency near segment and this new heterogeneous DRAM substrate, including hardware-based caching and software based caching and memory allocation of frequently used data in the near segment. Evaluations with simple such techniques show significant performance and energy-efficiency benefits.Comment: This is a summary of the original paper, entitled "Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture" which appears in HPCA 201

    Simultaneous Multi Layer Access: A High Bandwidth and Low Cost 3D-Stacked Memory Interface

    Full text link
    Limited memory bandwidth is a critical bottleneck in modern systems. 3D-stacked DRAM enables higher bandwidth by leveraging wider Through-Silicon-Via (TSV) channels, but today's systems cannot fully exploit them due to the limited internal bandwidth of DRAM. DRAM reads a whole row simultaneously from the cell array to a row buffer, but can transfer only a fraction of the data from the row buffer to peripheral IO circuit, through a limited and expensive set of wires referred to as global bitlines. In presence of wider memory channels, the major bottleneck becomes the limited data transfer capacity through these global bitlines. Our goal in this work is to enable higher bandwidth in 3D-stacked DRAM without the increased cost of adding more global bitlines. We instead exploit otherwise-idle resources, such as global bitlines, already existing within the multiple DRAM layers by accessing the layers simultaneously. Our architecture, Simultaneous Multi Layer Access (SMLA), provides higher bandwidth by aggregating the internal bandwidth of multiple layers and transferring the available data at a higher IO frequency. To implement SMLA, simultaneous data transfer from multiple layers through the same IO TSVs requires coordination between layers to avoid channel conflict. We first study coordination by static partitioning, which we call Dedicated-IO, that assigns groups of TSVs to each layer. We then provide a simple, yet sophisticated mechanism, called Cascaded-IO, which enables simultaneous access to each layer by time-multiplexing the IOs. By operating at a frequency proportional to the number of layers, SMLA provides a higher bandwidth (4X for a four-layer stacked DRAM). Our evaluations show that SMLA provides significant performance improvement and energy reduction (55%/18% on average for multi-programmed workloads, respectively) over a baseline 3D-stacked DRAM with very low area overhead

    Adaptive-Latency DRAM (AL-DRAM)

    Full text link
    This paper summarizes the idea of Adaptive-Latency DRAM (AL-DRAM), which was published in HPCA 2015. The key goal of AL-DRAM is to exploit the extra margin that is built into the DRAM timing parameters to reduce DRAM latency. The key observation is that the timing parameters are dictated by the worst-case temperatures and worst-case DRAM cells, both of which lead to small amount of charge storage and hence high access latency. One can therefore reduce latency by adapting the timing parameters to the current operating temperature and the current DIMM that is being accessed. Using an FPGA-based testing platform, our work first characterizes the extra margin for 115 DRAM modules from three major manufacturers. The experimental results demonstrate that it is possible to reduce four of the most critical timing parameters by a minimum/maximum of 17.3%/54.8% at 55C while maintaining reliable operation. AL-DRAM adaptively selects between multiple different timing parameters for each DRAM module based on its current operating condition. AL-DRAM does not require any changes to the DRAM chip or its interface; it only requires multiple different timing parameters to be specified and supported by the memory controller. Real system evaluations show that AL-DRAM improves the performance of memory-intensive workloads by an average of 14% without introducing any errors.Comment: This is a summary of the original paper, entitled "Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case" which appears in HPCA 201

    Predictable Performance and Fairness Through Accurate Slowdown Estimation in Shared Main Memory Systems

    Full text link
    This paper summarizes the ideas and key concepts in MISE (Memory Interference-induced Slowdown Estimation), which was published in HPCA 2013 [97], and examines the work's significance and future potential. Applications running concurrently on a multicore system interfere with each other at the main memory. This interference can slow down different applications differently. Accurately estimating the slowdown of each application in such a system can enable mechanisms that can enforce quality-of-service. While much prior work has focused on mitigating the performance degradation due to inter-application interference, there is little work on accurately estimating slowdown of individual applications in a multi-programmed environment. Our goal is to accurately estimate application slowdowns, towards providing predictable performance. To this end, we first build a simple Memory Interference-induced Slowdown Estimation (MISE) model, which accurately estimates slowdowns caused by memory interference. We then leverage our MISE model to develop two new memory scheduling schemes: 1) one that provides soft quality-of-service guarantees, and 2) another that explicitly attempts to minimize maximum slowdown (i.e., unfairness) in the system. Evaluations show that our techniques perform significantly better than state-of-the-art memory scheduling approaches to address the same problems. Our proposed model and techniques have enabled significant research in the development of accurate performance models [35, 59, 98, 110] and interference management mechanisms [66, 99, 100, 108, 119, 120]

    Decoupling GPU Programming Models from Resource Management for Enhanced Programming Ease, Portability, and Performance

    Full text link
    The application resource specification--a static specification of several parameters such as the number of threads and the scratchpad memory usage per thread block--forms a critical component of modern GPU programming models. This specification determines the parallelism, and hence performance, of the application during execution because the corresponding on-chip hardware resources are allocated and managed based on this specification. This tight-coupling between the software-provided resource specification and resource management in hardware leads to significant challenges in programming ease, portability, and performance. Zorua is a new resource virtualization framework, that decouples the programmer-specified resource usage of a GPU application from the actual allocation in the on-chip hardware resources. Zorua enables this decoupling by virtualizing each resource transparently to the programmer. We demonstrate that by providing the illusion of more resources than physically available via controlled and coordinated virtualization, Zorua offers several important benefits: (i) Programming Ease. Zorua eases the burden on the programmer to provide code that is tuned to efficiently utilize the physically available on-chip resources. (ii) Portability. Zorua alleviates the necessity of re-tuning an application's resource usage when porting the application across GPU generations. (iii) Performance. By dynamically allocating resources and carefully oversubscribing them when necessary, Zorua improves or retains the performance of applications that are already highly tuned to best utilize the resources.Comment: arXiv admin note: substantial text overlap with arXiv:1802.0257

    Adaptive-Latency DRAM: Reducing DRAM Latency by Exploiting Timing Margins

    Full text link
    This paper summarizes the idea of Adaptive-Latency DRAM (AL-DRAM), which was published in HPCA 2015, and examines the work's significance and future potential. AL-DRAM is a mechanism that optimizes DRAM latency based on the DRAM module and the operating temperature, by exploiting the extra margin that is built into the DRAM timing parameters. DRAM manufacturers provide a large margin for the timing parameters as a provision against two worst-case scenarios. First, due to process variation, some outlier DRAM chips are much slower than others. Second, chips become slower at higher temperatures. The timing parameter margin ensures that the slow outlier chips operate reliably at the worst-case temperature, and hence leads to a high access latency. Using an FPGA-based DRAM testing platform, our work first characterizes the extra margin for 115 DRAM modules from three major manufacturers. The experimental results demonstrate that it is possible to reduce four of the most critical timing parameters by a minimum/maximum of 17.3%/54.8% at 55C while maintaining reliable operation. AL-DRAM uses these observations to adaptively select reliable DRAM timing parameters for each DRAM module based on the module's current operating conditions. AL-DRAM does not require any changes to the DRAM chip or its interface; it only requires multiple different timing parameters to be specified and supported by the memory controller. Our real system evaluations show that AL-DRAM improves the performance of memory-intensive workloads by an average of 14% without introducing any errors. Our characterization and proposed techniques have inspired several other works on analyzing and/or exploiting different sources of latency and performance variation within DRAM chips.Comment: arXiv admin note: substantial text overlap with arXiv:1603.0845

    A Hardware-Software Blueprint for Flexible Deep Learning Specialization

    Full text link
    Specialized Deep Learning (DL) acceleration stacks, designed for a specific set of frameworks, model architectures, operators, and data types, offer the allure of high performance while sacrificing flexibility. Changes in algorithms, models, operators, or numerical systems threaten the viability of specialized hardware accelerators. We propose VTA, a programmable deep learning architecture template designed to be extensible in the face of evolving workloads. VTA achieves this flexibility via a parametrizable architecture, two-level ISA, and a JIT compiler. The two-level ISA is based on (1) a task-ISA that explicitly orchestrates concurrent compute and memory tasks and (2) a microcode-ISA which implements a wide variety of operators with single-cycle tensor-tensor operations. Next, we propose a runtime system equipped with a JIT compiler for flexible code-generation and heterogeneous execution that enables effective use of the VTA architecture. VTA is integrated and open-sourced into Apache TVM, a state-of-the-art deep learning compilation stack that provides flexibility for diverse models and divergent hardware backends. We propose a flow that performs design space exploration to generate a customized hardware architecture and software operator library that can be leveraged by mainstream learning frameworks. We demonstrate our approach by deploying optimized deep learning models used for object classification and style transfer on edge-class FPGAs.Comment: 6 pages plus references, 8 figure

    Exploiting the DRAM Microarchitecture to Increase Memory-Level Parallelism

    Full text link
    This paper summarizes the idea of Subarray-Level Parallelism (SALP) in DRAM, which was published in ISCA 2012, and examines the work's significance and future potential. Modern DRAMs have multiple banks to serve multiple memory requests in parallel. However, when two requests go to the same bank, they have to be served serially, exacerbating the high latency of on-chip memory. Adding more banks to the system to mitigate this problem incurs high system cost. Our goal in this work is to achieve the benefits of increasing the number of banks with a low-cost approach. To this end, we propose three new mechanisms, SALP-1, SALP-2, and MASA (Multitude of Activated Subarrays), to reduce the serialization of different requests that go to the same bank. The key observation exploited by our mechanisms is that a modern DRAM bank is implemented as a collection of subarrays that operate largely independently while sharing few global peripheral structures. Our three proposed mechanisms mitigate the negative impact of bank serialization by overlapping different components of the bank access latencies of multiple requests that go to different subarrays within the same bank. SALP-1 requires no changes to the existing DRAM structure, and needs to only reinterpret some of the existing DRAM timing parameters. SALP-2 and MASA require only modest changes (< 0.15% area overhead) to the DRAM peripheral structures, which are much less design constrained than the DRAM core. Our evaluations show that SALP-1, SALP-2 and MASA significantly improve performance for both single-core systems (7%/13%/17%) and multi-core systems (15%/16%/20%), averaged across a wide range of workloads. We also demonstrate that our mechanisms can be combined with application-aware memory request scheduling in multicore systems to further improve performance and fairness