3,176 research outputs found
Understanding Reduced-Voltage Operation in Modern DRAM Chips: Characterization, Analysis, and Mechanisms
The energy consumption of DRAM is a critical concern in modern computing
systems. Improvements in manufacturing process technology have allowed DRAM
vendors to lower the DRAM supply voltage conservatively, which reduces some of
the DRAM energy consumption. We would like to reduce the DRAM supply voltage
more aggressively, to further reduce energy. Aggressive supply voltage
reduction requires a thorough understanding of the effect voltage scaling has
on DRAM access latency and DRAM reliability.
In this paper, we take a comprehensive approach to understanding and
exploiting the latency and reliability characteristics of modern DRAM when the
supply voltage is lowered below the nominal voltage level specified by DRAM
standards. Using an FPGA-based testing platform, we perform an experimental
study of 124 real DDR3L (low-voltage) DRAM chips manufactured recently by three
major DRAM vendors. We find that reducing the supply voltage below a certain
point introduces bit errors in the data, and we comprehensively characterize
the behavior of these errors. We discover that these errors can be avoided by
increasing the latency of three major DRAM operations (activation, restoration,
and precharge). We perform detailed DRAM circuit simulations to validate and
explain our experimental findings. We also characterize the various
relationships between reduced supply voltage and error locations, stored data
patterns, DRAM temperature, and data retention.
Based on our observations, we propose a new DRAM energy reduction mechanism,
called Voltron. The key idea of Voltron is to use a performance model to
determine by how much we can reduce the supply voltage without introducing
errors and without exceeding a user-specified threshold for performance loss.
Voltron reduces the average system energy by 7.3% while limiting the average
system performance loss to only 1.8%, for a variety of workloads.Comment: 25 pages, 25 figures, 7 tables, Proceedings of the ACM on Measurement
and Analysis of Computing Systems (POMACS
In-DRAM Bulk Bitwise Execution Engine
Many applications heavily use bitwise operations on large bitvectors as part
of their computation. In existing systems, performing such bulk bitwise
operations requires the processor to transfer a large amount of data on the
memory channel, thereby consuming high latency, memory bandwidth, and energy.
In this paper, we describe Ambit, a recently-proposed mechanism to perform bulk
bitwise operations completely inside main memory. Ambit exploits the internal
organization and analog operation of DRAM-based memory to achieve low cost,
high performance, and low energy. Ambit exposes a new bulk bitwise execution
model to the host processor. Evaluations show that Ambit significantly improves
the performance of several applications that use bulk bitwise operations,
including databases.Comment: arXiv admin note: substantial text overlap with arXiv:1605.06483,
arXiv:1610.09603, arXiv:1611.0998
A Framework for Accelerating Bottlenecks in GPU Execution with Assist Warps
Modern Graphics Processing Units (GPUs) are well provisioned to support the
concurrent execution of thousands of threads. Unfortunately, different
bottlenecks during execution and heterogeneous application requirements create
imbalances in utilization of resources in the cores. For example, when a GPU is
bottlenecked by the available off-chip memory bandwidth, its computational
resources are often overwhelmingly idle, waiting for data from memory to
arrive.
This work describes the Core-Assisted Bottleneck Acceleration (CABA)
framework that employs idle on-chip resources to alleviate different
bottlenecks in GPU execution. CABA provides flexible mechanisms to
automatically generate "assist warps" that execute on GPU cores to perform
specific tasks that can improve GPU performance and efficiency.
CABA enables the use of idle computational units and pipelines to alleviate
the memory bandwidth bottleneck, e.g., by using assist warps to perform data
compression to transfer less data from memory. Conversely, the same framework
can be employed to handle cases where the GPU is bottlenecked by the available
computational units, in which case the memory pipelines are idle and can be
used by CABA to speed up computation, e.g., by performing memoization using
assist warps.
We provide a comprehensive design and evaluation of CABA to perform effective
and flexible data compression in the GPU memory hierarchy to alleviate the
memory bandwidth bottleneck. Our extensive evaluations show that CABA, when
used to implement data compression, provides an average performance improvement
of 41.7% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU
applications
LISA: Increasing Internal Connectivity in DRAM for Fast Data Movement and Low Latency
This paper summarizes the idea of Low-Cost Interlinked Subarrays (LISA),
which was published in HPCA 2016, and examines the work's significance and
future potential. Contemporary systems perform bulk data movement movement
inefficiently, by transferring data from DRAM to the processor, and then back
to DRAM, across a narrow off-chip channel. The use of this narrow channel
results in high latency and energy consumption. Prior work proposes to avoid
these high costs by exploiting the existing wide internal DRAM bandwidth for
bulk data movement, but the limited connectivity of wires within DRAM allows
fast data movement within only a single DRAM subarray. Each subarray is only a
few megabytes in size, greatly restricting the range over which fast bulk data
movement can happen within DRAM.
Our HPCA 2016 paper proposes a new DRAM substrate, Low-Cost Inter-Linked
Subarrays (LISA), whose goal is to enable fast and efficient data movement
across a large range of memory at low cost. LISA adds low-cost connections
between adjacent subarrays. By using these connections to interconnect the
existing internal wires (bitlines) of adjacent subarrays, LISA enables
wide-bandwidth data transfer across multiple subarrays with little (only 0.8%)
DRAM area overhead. As a DRAM substrate, LISA is versatile, enabling a variety
of new applications. We describe and evaluate three such applications in
detail: (1) fast inter-subarray bulk data copy, (2) in-DRAM caching using a
DRAM architecture whose rows have heterogeneous access latencies, and (3)
accelerated bitline precharging by linking multiple precharge units together.
Our extensive evaluations show that each of LISA's three applications
significantly improves performance and memory energy efficiency on a variety of
workloads and system configurations
FUSE: Fusing STT-MRAM into GPUs to Alleviate Off-Chip Memory Access Overheads
In this work, we propose FUSE, a novel GPU cache system that integrates
spin-transfer torque magnetic random-access memory (STT-MRAM) into the on-chip
L1D cache. FUSE can minimize the number of outgoing memory accesses over the
interconnection network of GPU's multiprocessors, which in turn can
considerably improve the level of massive computing parallelism in GPUs.
Specifically, FUSE predicts a read-level of GPU memory accesses by extracting
GPU runtime information and places write-once-read-multiple (WORM) data blocks
into the STT-MRAM, while accommodating write-multiple data blocks over a small
portion of SRAM in the L1D cache. To further reduce the off-chip memory
accesses, FUSE also allows WORM data blocks to be allocated anywhere in the
STT-MRAM by approximating the associativity with the limited number of tag
comparators and I/O peripherals. Our evaluation results show that, in
comparison to a traditional GPU cache, our proposed heterogeneous cache reduces
the number of outgoing memory references by 32% across the interconnection
network, thereby improving the overall performance by 217% and reducing energy
cost by 53%
Tiered-Latency DRAM (TL-DRAM)
This paper summarizes the idea of Tiered-Latency DRAM, which was published in
HPCA 2013. The key goal of TL-DRAM is to provide low DRAM latency at low cost,
a critical problem in modern memory systems. To this end, TL-DRAM introduces
heterogeneity into the design of a DRAM subarray by segmenting the bitlines,
thereby creating a low-latency, low-energy, low-capacity portion in the
subarray (called the near segment), which is close to the sense amplifiers, and
a high-latency, high-energy, high-capacity portion, which is farther away from
the sense amplifiers. Thus, DRAM becomes heterogeneous with a small portion
having lower latency and a large portion having higher latency. Various
techniques can be employed to take advantage of the low-latency near segment
and this new heterogeneous DRAM substrate, including hardware-based caching and
software based caching and memory allocation of frequently used data in the
near segment. Evaluations with simple such techniques show significant
performance and energy-efficiency benefits.Comment: This is a summary of the original paper, entitled "Tiered-Latency
DRAM: A Low Latency and Low Cost DRAM Architecture" which appears in HPCA
201
FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
Due to recent advances in digital technologies, and availability of credible
data, an area of artificial intelligence, deep learning, has emerged, and has
demonstrated its ability and effectiveness in solving complex learning problems
not possible before. In particular, convolution neural networks (CNNs) have
demonstrated their effectiveness in image detection and recognition
applications. However, they require intensive CPU operations and memory
bandwidth that make general CPUs fail to achieve desired performance levels.
Consequently, hardware accelerators that use application specific integrated
circuits (ASICs), field programmable gate arrays (FPGAs), and graphic
processing units (GPUs) have been employed to improve the throughput of CNNs.
More precisely, FPGAs have been recently adopted for accelerating the
implementation of deep learning networks due to their ability to maximize
parallelism as well as due to their energy efficiency. In this paper, we review
recent existing techniques for accelerating deep learning networks on FPGAs. We
highlight the key features employed by the various techniques for improving the
acceleration performance. In addition, we provide recommendations for enhancing
the utilization of FPGAs for CNNs acceleration. The techniques investigated in
this paper represent the recent trends in FPGA-based accelerators of deep
learning networks. Thus, this review is expected to direct the future advances
on efficient hardware accelerators and to be useful for deep learning
researchers.Comment: This article has been accepted for publication in IEEE Access
(December, 2018
RowHammer: A Retrospective
This retrospective paper describes the RowHammer problem in Dynamic Random
Access Memory (DRAM), which was initially introduced by Kim et al. at the ISCA
2014 conference~\cite{rowhammer-isca2014}. RowHammer is a prime (and perhaps
the first) example of how a circuit-level failure mechanism can cause a
practical and widespread system security vulnerability. It is the phenomenon
that repeatedly accessing a row in a modern DRAM chip causes bit flips in
physically-adjacent rows at consistently predictable bit locations. RowHammer
is caused by a hardware failure mechanism called {\em DRAM disturbance errors},
which is a manifestation of circuit-level cell-to-cell interference in a scaled
memory technology.
Researchers from Google Project Zero demonstrated in 2015 that this hardware
failure mechanism can be effectively exploited by user-level programs to gain
kernel privileges on real systems. Many other follow-up works demonstrated
other practical attacks exploiting RowHammer. In this article, we
comprehensively survey the scientific literature on RowHammer-based attacks as
well as mitigation techniques to prevent RowHammer. We also discuss what other
related vulnerabilities may be lurking in DRAM and other types of memories,
e.g., NAND flash memory or Phase Change Memory, that can potentially threaten
the foundations of secure systems, as the memory technologies scale to higher
densities. We conclude by describing and advocating a principled approach to
memory reliability and security research that can enable us to better
anticipate and prevent such vulnerabilities.Comment: A version of this work is to appear at IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems (TCAD) Special Issue
on Top Picks in Hardware and Embedded Security, 2019. arXiv admin note:
substantial text overlap with arXiv:1703.00626, arXiv:1903.1105
Efficient Processing of Deep Neural Networks: A Tutorial and Survey
Deep neural networks (DNNs) are currently widely used for many artificial
intelligence (AI) applications including computer vision, speech recognition,
and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it
comes at the cost of high computational complexity. Accordingly, techniques
that enable efficient processing of DNNs to improve energy efficiency and
throughput without sacrificing application accuracy or increasing hardware cost
are critical to the wide deployment of DNNs in AI systems.
This article aims to provide a comprehensive tutorial and survey about the
recent advances towards the goal of enabling efficient processing of DNNs.
Specifically, it will provide an overview of DNNs, discuss various hardware
platforms and architectures that support DNNs, and highlight key trends in
reducing the computation cost of DNNs either solely via hardware design changes
or via joint hardware design and DNN algorithm changes. It will also summarize
various development resources that enable researchers and practitioners to
quickly get started in this field, and highlight important benchmarking metrics
and design considerations that should be used for evaluating the rapidly
growing number of DNN hardware designs, optionally including algorithmic
co-designs, being proposed in academia and industry.
The reader will take away the following concepts from this article:
understand the key design considerations for DNNs; be able to evaluate
different DNN hardware implementations with benchmarks and comparison metrics;
understand the trade-offs between various hardware architectures and platforms;
be able to evaluate the utility of various DNN design techniques for efficient
processing; and understand recent implementation trends and opportunities.Comment: Based on tutorial on DNN Hardware at eyeriss.mit.edu/tutorial.htm
High-Performance and Energy-Effcient Memory Scheduler Design for Heterogeneous Systems
When multiple processor cores (CPUs) and a GPU integrated together on the
same chip share the off-chip DRAM, requests from the GPU can heavily interfere
with requests from the CPUs, leading to low system performance and starvation
of cores. Unfortunately, state-of-the-art memory scheduling algorithms are
ineffective at solving this problem due to the very large amount of GPU memory
traffic, unless a very large and costly request buffer is employed to provide
these algorithms with enough visibility across the global request stream.
Previously-proposed memory controller (MC) designs use a single monolithic
structure to perform three main tasks. First, the MC attempts to schedule
together requests to the same DRAM row to increase row buffer hit rates.
Second, the MC arbitrates among the requesters (CPUs and GPU) to optimize for
overall system throughput, average response time, fairness and quality of
service. Third, the MC manages the low-level DRAM command scheduling to
complete requests while ensuring compliance with all DRAM timing and power
constraints. This paper proposes a fundamentally new approach, called the
Staged Memory Scheduler (SMS), which decouples the three primary MC tasks into
three significantly simpler structures that together improve system performance
and fairness. Our evaluation shows that SMS provides 41.2% performance
improvement and fairness improvement compared to the best previous
state-of-the-art technique, while enabling a design that is significantly less
complex and more power-efficient to implement
- …