Search CORE

3,176 research outputs found

Understanding Reduced-Voltage Operation in Modern DRAM Chips: Characterization, Analysis, and Mechanisms

Author: Agrawal Aditya
Chang Kevin K.
Chatterjee Niladrish
Ghose Saugata
Hassan Hasan
Kashyap Abhijith
Lee Donghyuk
Mutlu Onur
O'Connor Mike
Yağlıkçı Abdullah Giray
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 29/05/2017
Field of study

The energy consumption of DRAM is a critical concern in modern computing systems. Improvements in manufacturing process technology have allowed DRAM vendors to lower the DRAM supply voltage conservatively, which reduces some of the DRAM energy consumption. We would like to reduce the DRAM supply voltage more aggressively, to further reduce energy. Aggressive supply voltage reduction requires a thorough understanding of the effect voltage scaling has on DRAM access latency and DRAM reliability. In this paper, we take a comprehensive approach to understanding and exploiting the latency and reliability characteristics of modern DRAM when the supply voltage is lowered below the nominal voltage level specified by DRAM standards. Using an FPGA-based testing platform, we perform an experimental study of 124 real DDR3L (low-voltage) DRAM chips manufactured recently by three major DRAM vendors. We find that reducing the supply voltage below a certain point introduces bit errors in the data, and we comprehensively characterize the behavior of these errors. We discover that these errors can be avoided by increasing the latency of three major DRAM operations (activation, restoration, and precharge). We perform detailed DRAM circuit simulations to validate and explain our experimental findings. We also characterize the various relationships between reduced supply voltage and error locations, stored data patterns, DRAM temperature, and data retention. Based on our observations, we propose a new DRAM energy reduction mechanism, called Voltron. The key idea of Voltron is to use a performance model to determine by how much we can reduce the supply voltage without introducing errors and without exceeding a user-specified threshold for performance loss. Voltron reduces the average system energy by 7.3% while limiting the average system performance loss to only 1.8%, for a variety of workloads.Comment: 25 pages, 25 figures, 7 tables, Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS

arXiv.org e-Print Archive

In-DRAM Bulk Bitwise Execution Engine

Author: Mutlu Onur
Seshadri Vivek
Publication venue
Publication date: 05/04/2020
Field of study

Many applications heavily use bitwise operations on large bitvectors as part of their computation. In existing systems, performing such bulk bitwise operations requires the processor to transfer a large amount of data on the memory channel, thereby consuming high latency, memory bandwidth, and energy. In this paper, we describe Ambit, a recently-proposed mechanism to perform bulk bitwise operations completely inside main memory. Ambit exploits the internal organization and analog operation of DRAM-based memory to achieve low cost, high performance, and low energy. Ambit exposes a new bulk bitwise execution model to the host processor. Evaluations show that Ambit significantly improves the performance of several applications that use bulk bitwise operations, including databases.Comment: arXiv admin note: substantial text overlap with arXiv:1605.06483, arXiv:1610.09603, arXiv:1611.0998

arXiv.org e-Print Archive

A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps

Author: Ausavarangnirun Rachata
Bhowmick Abhishek
Das Chita
Ghose Saugata
Jog Adwait
Kandemir Mahmut
Mowry Todd C.
Mutlu Onur
Pekhimenko Gennady
Vijaykumar Nandita
Publication venue
Publication date: 03/02/2016
Field of study

Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, different bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a GPU is bottlenecked by the available off-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive. This work describes the Core-Assisted Bottleneck Acceleration (CABA) framework that employs idle on-chip resources to alleviate different bottlenecks in GPU execution. CABA provides flexible mechanisms to automatically generate "assist warps" that execute on GPU cores to perform specific tasks that can improve GPU performance and efficiency. CABA enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck, e.g., by using assist warps to perform data compression to transfer less data from memory. Conversely, the same framework can be employed to handle cases where the GPU is bottlenecked by the available computational units, in which case the memory pipelines are idle and can be used by CABA to speed up computation, e.g., by performing memoization using assist warps. We provide a comprehensive design and evaluation of CABA to perform effective and flexible data compression in the GPU memory hierarchy to alleviate the memory bandwidth bottleneck. Our extensive evaluations show that CABA, when used to implement data compression, provides an average performance improvement of 41.7% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU applications

arXiv.org e-Print Archive

LISA: Increasing Internal Connectivity in DRAM for Fast Data Movement and Low Latency

Author: Chang Kevin K.
Ghose Saugata
Lee Donghyuk
Mutlu Onur
Nair Prashant J.
Qureshi Moinuddin K.
Publication venue
Publication date: 08/05/2018
Field of study

This paper summarizes the idea of Low-Cost Interlinked Subarrays (LISA), which was published in HPCA 2016, and examines the work's significance and future potential. Contemporary systems perform bulk data movement movement inefficiently, by transferring data from DRAM to the processor, and then back to DRAM, across a narrow off-chip channel. The use of this narrow channel results in high latency and energy consumption. Prior work proposes to avoid these high costs by exploiting the existing wide internal DRAM bandwidth for bulk data movement, but the limited connectivity of wires within DRAM allows fast data movement within only a single DRAM subarray. Each subarray is only a few megabytes in size, greatly restricting the range over which fast bulk data movement can happen within DRAM. Our HPCA 2016 paper proposes a new DRAM substrate, Low-Cost Inter-Linked Subarrays (LISA), whose goal is to enable fast and efficient data movement across a large range of memory at low cost. LISA adds low-cost connections between adjacent subarrays. By using these connections to interconnect the existing internal wires (bitlines) of adjacent subarrays, LISA enables wide-bandwidth data transfer across multiple subarrays with little (only 0.8%) DRAM area overhead. As a DRAM substrate, LISA is versatile, enabling a variety of new applications. We describe and evaluate three such applications in detail: (1) fast inter-subarray bulk data copy, (2) in-DRAM caching using a DRAM architecture whose rows have heterogeneous access latencies, and (3) accelerated bitline precharging by linking multiple precharge units together. Our extensive evaluations show that each of LISA's three applications significantly improves performance and memory energy efficiency on a variety of workloads and system configurations

arXiv.org e-Print Archive

FUSE: Fusing STT-MRAM into GPUs to Alleviate Off-Chip Memory Access Overheads

Author: Jung Myoungsoo
Kandemir Mahmut Taylan
Zhang Jie
Publication venue
Publication date: 09/03/2019
Field of study

In this work, we propose FUSE, a novel GPU cache system that integrates spin-transfer torque magnetic random-access memory (STT-MRAM) into the on-chip L1D cache. FUSE can minimize the number of outgoing memory accesses over the interconnection network of GPU's multiprocessors, which in turn can considerably improve the level of massive computing parallelism in GPUs. Specifically, FUSE predicts a read-level of GPU memory accesses by extracting GPU runtime information and places write-once-read-multiple (WORM) data blocks into the STT-MRAM, while accommodating write-multiple data blocks over a small portion of SRAM in the L1D cache. To further reduce the off-chip memory accesses, FUSE also allows WORM data blocks to be allocated anywhere in the STT-MRAM by approximating the associativity with the limited number of tag comparators and I/O peripherals. Our evaluation results show that, in comparison to a traditional GPU cache, our proposed heterogeneous cache reduces the number of outgoing memory references by 32% across the interconnection network, thereby improving the overall performance by 217% and reducing energy cost by 53%

arXiv.org e-Print Archive

Tiered-Latency DRAM (TL-DRAM)

Author: Kim Yoongu
Lee Donghyuk
Liu Jamie
Mutlu Onur
Seshadri Vivek
Subramanian Lavanya
Publication venue
Publication date: 26/01/2016
Field of study

This paper summarizes the idea of Tiered-Latency DRAM, which was published in HPCA 2013. The key goal of TL-DRAM is to provide low DRAM latency at low cost, a critical problem in modern memory systems. To this end, TL-DRAM introduces heterogeneity into the design of a DRAM subarray by segmenting the bitlines, thereby creating a low-latency, low-energy, low-capacity portion in the subarray (called the near segment), which is close to the sense amplifiers, and a high-latency, high-energy, high-capacity portion, which is farther away from the sense amplifiers. Thus, DRAM becomes heterogeneous with a small portion having lower latency and a large portion having higher latency. Various techniques can be employed to take advantage of the low-latency near segment and this new heterogeneous DRAM substrate, including hardware-based caching and software based caching and memory allocation of frequently used data in the near segment. Evaluations with simple such techniques show significant performance and energy-efficiency benefits.Comment: This is a summary of the original paper, entitled "Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture" which appears in HPCA 201

arXiv.org e-Print Archive

FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review

Author: El-Maleh Aiman
Sait Sadiq M.
Shawahna Ahmad
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

Due to recent advances in digital technologies, and availability of credible data, an area of artificial intelligence, deep learning, has emerged, and has demonstrated its ability and effectiveness in solving complex learning problems not possible before. In particular, convolution neural networks (CNNs) have demonstrated their effectiveness in image detection and recognition applications. However, they require intensive CPU operations and memory bandwidth that make general CPUs fail to achieve desired performance levels. Consequently, hardware accelerators that use application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), and graphic processing units (GPUs) have been employed to improve the throughput of CNNs. More precisely, FPGAs have been recently adopted for accelerating the implementation of deep learning networks due to their ability to maximize parallelism as well as due to their energy efficiency. In this paper, we review recent existing techniques for accelerating deep learning networks on FPGAs. We highlight the key features employed by the various techniques for improving the acceleration performance. In addition, we provide recommendations for enhancing the utilization of FPGAs for CNNs acceleration. The techniques investigated in this paper represent the recent trends in FPGA-based accelerators of deep learning networks. Thus, this review is expected to direct the future advances on efficient hardware accelerators and to be useful for deep learning researchers.Comment: This article has been accepted for publication in IEEE Access (December, 2018

arXiv.org e-Print Archive

RowHammer: A Retrospective

Author: Kim Jeremie S.
Mutlu Onur
Publication venue
Publication date: 22/04/2019
Field of study

This retrospective paper describes the RowHammer problem in Dynamic Random Access Memory (DRAM), which was initially introduced by Kim et al. at the ISCA 2014 conference~\cite{rowhammer-isca2014}. RowHammer is a prime (and perhaps the first) example of how a circuit-level failure mechanism can cause a practical and widespread system security vulnerability. It is the phenomenon that repeatedly accessing a row in a modern DRAM chip causes bit flips in physically-adjacent rows at consistently predictable bit locations. RowHammer is caused by a hardware failure mechanism called {\em DRAM disturbance errors}, which is a manifestation of circuit-level cell-to-cell interference in a scaled memory technology. Researchers from Google Project Zero demonstrated in 2015 that this hardware failure mechanism can be effectively exploited by user-level programs to gain kernel privileges on real systems. Many other follow-up works demonstrated other practical attacks exploiting RowHammer. In this article, we comprehensively survey the scientific literature on RowHammer-based attacks as well as mitigation techniques to prevent RowHammer. We also discuss what other related vulnerabilities may be lurking in DRAM and other types of memories, e.g., NAND flash memory or Phase Change Memory, that can potentially threaten the foundations of secure systems, as the memory technologies scale to higher densities. We conclude by describing and advocating a principled approach to memory reliability and security research that can enable us to better anticipate and prevent such vulnerabilities.Comment: A version of this work is to appear at IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD) Special Issue on Top Picks in Hardware and Embedded Security, 2019. arXiv admin note: substantial text overlap with arXiv:1703.00626, arXiv:1903.1105

arXiv.org e-Print Archive

Efficient Processing of Deep Neural Networks: A Tutorial and Survey

Author: Chen Yu-Hsin
Emer Joel
Sze Vivienne
Yang Tien-Ju
Publication venue
Publication date: 13/08/2017
Field of study

Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of DNNs to improve energy efficiency and throughput without sacrificing application accuracy or increasing hardware cost are critical to the wide deployment of DNNs in AI systems. This article aims to provide a comprehensive tutorial and survey about the recent advances towards the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various hardware platforms and architectures that support DNNs, and highlight key trends in reducing the computation cost of DNNs either solely via hardware design changes or via joint hardware design and DNN algorithm changes. It will also summarize various development resources that enable researchers and practitioners to quickly get started in this field, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic co-designs, being proposed in academia and industry. The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand the trade-offs between various hardware architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand recent implementation trends and opportunities.Comment: Based on tutorial on DNN Hardware at eyeriss.mit.edu/tutorial.htm

arXiv.org e-Print Archive

High-Performance and Energy-Effcient Memory Scheduler Design for Heterogeneous Systems

Author: Ausavarungnirun Rachata
Chang Kevin
Loh Gabriel H.
Mutlu Onur
Subramanian Lavanya
Publication venue
Publication date: 30/04/2018
Field of study

When multiple processor cores (CPUs) and a GPU integrated together on the same chip share the off-chip DRAM, requests from the GPU can heavily interfere with requests from the CPUs, leading to low system performance and starvation of cores. Unfortunately, state-of-the-art memory scheduling algorithms are ineffective at solving this problem due to the very large amount of GPU memory traffic, unless a very large and costly request buffer is employed to provide these algorithms with enough visibility across the global request stream. Previously-proposed memory controller (MC) designs use a single monolithic structure to perform three main tasks. First, the MC attempts to schedule together requests to the same DRAM row to increase row buffer hit rates. Second, the MC arbitrates among the requesters (CPUs and GPU) to optimize for overall system throughput, average response time, fairness and quality of service. Third, the MC manages the low-level DRAM command scheduling to complete requests while ensuring compliance with all DRAM timing and power constraints. This paper proposes a fundamentally new approach, called the Staged Memory Scheduler (SMS), which decouples the three primary MC tasks into three significantly simpler structures that together improve system performance and fairness. Our evaluation shows that SMS provides 41.2% performance improvement and fairness improvement compared to the best previous state-of-the-art technique, while enabling a design that is significantly less complex and more power-efficient to implement

arXiv.org e-Print Archive