81 research outputs found
LISA: Increasing Internal Connectivity in DRAM for Fast Data Movement and Low Latency
This paper summarizes the idea of Low-Cost Interlinked Subarrays (LISA),
which was published in HPCA 2016, and examines the work's significance and
future potential. Contemporary systems perform bulk data movement movement
inefficiently, by transferring data from DRAM to the processor, and then back
to DRAM, across a narrow off-chip channel. The use of this narrow channel
results in high latency and energy consumption. Prior work proposes to avoid
these high costs by exploiting the existing wide internal DRAM bandwidth for
bulk data movement, but the limited connectivity of wires within DRAM allows
fast data movement within only a single DRAM subarray. Each subarray is only a
few megabytes in size, greatly restricting the range over which fast bulk data
movement can happen within DRAM.
Our HPCA 2016 paper proposes a new DRAM substrate, Low-Cost Inter-Linked
Subarrays (LISA), whose goal is to enable fast and efficient data movement
across a large range of memory at low cost. LISA adds low-cost connections
between adjacent subarrays. By using these connections to interconnect the
existing internal wires (bitlines) of adjacent subarrays, LISA enables
wide-bandwidth data transfer across multiple subarrays with little (only 0.8%)
DRAM area overhead. As a DRAM substrate, LISA is versatile, enabling a variety
of new applications. We describe and evaluate three such applications in
detail: (1) fast inter-subarray bulk data copy, (2) in-DRAM caching using a
DRAM architecture whose rows have heterogeneous access latencies, and (3)
accelerated bitline precharging by linking multiple precharge units together.
Our extensive evaluations show that each of LISA's three applications
significantly improves performance and memory energy efficiency on a variety of
workloads and system configurations
Reducing DRAM Refresh Overheads with Refresh-Access Parallelism
This article summarizes the idea of "refresh-access parallelism," which was
published in HPCA 2014, and examines the work's significance and future
potential. The overarching objective of our HPCA 2014 paper is to reduce the
significant negative performance impact of DRAM refresh with intelligent memory
controller mechanisms.
To mitigate the negative performance impact of DRAM refresh, our HPCA 2014
paper proposes two complementary mechanisms, DARP (Dynamic Access Refresh
Parallelization) and SARP (Subarray Access Refresh Parallelization). The goal
is to address the drawbacks of state-of-the-art per-bank refresh mechanism by
building more efficient techniques to parallelize refreshes and accesses within
DRAM. First, instead of issuing per-bank refreshes in a round-robin order, as
it is done today, DARP issues per-bank refreshes to idle banks in an
out-of-order manner. Furthermore, DARP proactively schedules refreshes during
intervals when a batch of writes are draining to DRAM. Second, SARP exploits
the existence of mostly-independent subarrays within a bank. With minor
modifications to DRAM organization, it allows a bank to serve memory accesses
to an idle subarray while another subarray is being refreshed. Our extensive
evaluations on a wide variety of workloads and systems show that our mechanisms
improve system performance (and energy efficiency) compared to three
state-of-the-art refresh policies, and their performance bene ts increase as
DRAM density increases.Comment: 9 pages. arXiv admin note: text overlap with arXiv:1712.07754,
arXiv:1601.0635
Tiered-Latency DRAM (TL-DRAM)
This paper summarizes the idea of Tiered-Latency DRAM, which was published in
HPCA 2013. The key goal of TL-DRAM is to provide low DRAM latency at low cost,
a critical problem in modern memory systems. To this end, TL-DRAM introduces
heterogeneity into the design of a DRAM subarray by segmenting the bitlines,
thereby creating a low-latency, low-energy, low-capacity portion in the
subarray (called the near segment), which is close to the sense amplifiers, and
a high-latency, high-energy, high-capacity portion, which is farther away from
the sense amplifiers. Thus, DRAM becomes heterogeneous with a small portion
having lower latency and a large portion having higher latency. Various
techniques can be employed to take advantage of the low-latency near segment
and this new heterogeneous DRAM substrate, including hardware-based caching and
software based caching and memory allocation of frequently used data in the
near segment. Evaluations with simple such techniques show significant
performance and energy-efficiency benefits.Comment: This is a summary of the original paper, entitled "Tiered-Latency
DRAM: A Low Latency and Low Cost DRAM Architecture" which appears in HPCA
201
Flexible-Latency DRAM: Understanding and Exploiting Latency Variation in Modern DRAM Chips
This article summarizes key results of our work on experimental
characterization and analysis of latency variation and latency-reliability
trade-offs in modern DRAM chips, which was published in SIGMETRICS 2016, and
examines the work's significance and future potential.
The goal of this work is to (i) experimentally characterize and understand
the latency variation across cells within a DRAM chip for these three
fundamental DRAM operations, and (ii) develop new mechanisms that exploit our
understanding of the latency variation to reliably improve performance. To this
end, we comprehensively characterize 240 DRAM chips from three major vendors,
and make six major new observations about latency variation within DRAM.
Notably, we find that (i) there is large latency variation across the cells for
each of the three operations; (ii) variation characteristics exhibit
significant spatial locality: slower cells are clustered in certain regions of
a DRAM chip; and (iii) the three fundamental operations exhibit different
reliability characteristics when the latency of each operation is reduced.
Based on our observations, we propose Flexible-LatencY DRAM (FLY-DRAM), a
mechanism that exploits latency variation across DRAM cells within a DRAM chip
to improve system performance. The key idea of FLY-DRAM is to exploit the
spatial locality of slower cells within DRAM, and access the faster DRAM
regions with reduced latencies for the fundamental operations. Our evaluations
show that FLY-DRAM improves the performance of a wide range of applications by
13.3%, 17.6%, and 19.5%, on average, for each of the three different vendors'
real DRAM chips, in a simulated 8-core system
Exploiting Row-Level Temporal Locality in DRAM to Reduce the Memory Access Latency
This paper summarizes the idea of ChargeCache, which was published in HPCA
2016 [51], and examines the work's significance and future potential. DRAM
latency continues to be a critical bottleneck for system performance. In this
work, we develop a low-cost mechanism, called ChargeCache, that enables faster
access to recently-accessed rows in DRAM, with no modifications to DRAM chips.
Our mechanism is based on the key observation that a recently-accessed row has
more charge and thus the following access to the same row can be performed
faster. To exploit this observation, we propose to track the addresses of
recently-accessed rows in a table in the memory controller. If a later DRAM
request hits in that table, the memory controller uses lower timing parameters,
leading to reduced DRAM latency. Row addresses are removed from the table after
a specified duration to ensure rows that have leaked too much charge are not
accessed with lower latency. We evaluate ChargeCache on a wide variety of
workloads and show that it provides significant performance and energy benefits
for both single-core and multi-core systems.Comment: arXiv admin note: substantial text overlap with arXiv:1609.0723
Exploiting the DRAM Microarchitecture to Increase Memory-Level Parallelism
This paper summarizes the idea of Subarray-Level Parallelism (SALP) in DRAM,
which was published in ISCA 2012, and examines the work's significance and
future potential. Modern DRAMs have multiple banks to serve multiple memory
requests in parallel. However, when two requests go to the same bank, they have
to be served serially, exacerbating the high latency of on-chip memory. Adding
more banks to the system to mitigate this problem incurs high system cost. Our
goal in this work is to achieve the benefits of increasing the number of banks
with a low-cost approach. To this end, we propose three new mechanisms, SALP-1,
SALP-2, and MASA (Multitude of Activated Subarrays), to reduce the
serialization of different requests that go to the same bank. The key
observation exploited by our mechanisms is that a modern DRAM bank is
implemented as a collection of subarrays that operate largely independently
while sharing few global peripheral structures.
Our three proposed mechanisms mitigate the negative impact of bank
serialization by overlapping different components of the bank access latencies
of multiple requests that go to different subarrays within the same bank.
SALP-1 requires no changes to the existing DRAM structure, and needs to only
reinterpret some of the existing DRAM timing parameters. SALP-2 and MASA
require only modest changes (< 0.15% area overhead) to the DRAM peripheral
structures, which are much less design constrained than the DRAM core. Our
evaluations show that SALP-1, SALP-2 and MASA significantly improve performance
for both single-core systems (7%/13%/17%) and multi-core systems (15%/16%/20%),
averaged across a wide range of workloads. We also demonstrate that our
mechanisms can be combined with application-aware memory request scheduling in
multicore systems to further improve performance and fairness
In-DRAM Bulk Bitwise Execution Engine
Many applications heavily use bitwise operations on large bitvectors as part
of their computation. In existing systems, performing such bulk bitwise
operations requires the processor to transfer a large amount of data on the
memory channel, thereby consuming high latency, memory bandwidth, and energy.
In this paper, we describe Ambit, a recently-proposed mechanism to perform bulk
bitwise operations completely inside main memory. Ambit exploits the internal
organization and analog operation of DRAM-based memory to achieve low cost,
high performance, and low energy. Ambit exposes a new bulk bitwise execution
model to the host processor. Evaluations show that Ambit significantly improves
the performance of several applications that use bulk bitwise operations,
including databases.Comment: arXiv admin note: substantial text overlap with arXiv:1605.06483,
arXiv:1610.09603, arXiv:1611.0998
Enabling Practical Processing in and near Memory for Data-Intensive Computing
Modern computing systems suffer from the dichotomy between computation on one
side, which is performed only in the processor (and accelerators), and data
storage/movement on the other, which all other parts of the system are
dedicated to. Due to this dichotomy, data moves a lot in order for the system
to perform computation on it. Unfortunately, data movement is extremely
expensive in terms of energy and latency, much more so than computation. As a
result, a large fraction of system energy is spent and performance is lost
solely on moving data in a modern computing system.
In this work, we re-examine the idea of reducing data movement by performing
Processing in Memory (PIM). PIM places computation mechanisms in or near where
the data is stored (i.e., inside the memory chips, in the logic layer of
3D-stacked logic and DRAM, or in the memory controllers), so that data movement
between the computation units and memory is reduced or eliminated. While the
idea of PIM is not new, we examine two new approaches to enabling PIM: 1)
exploiting analog properties of DRAM to perform massively-parallel operations
in memory, and 2) exploiting 3D-stacked memory technology design to provide
high bandwidth to in-memory logic. We conclude by discussing work on solving
key challenges to the practical adoption of PIM.Comment: A version of this work is to appear in a DAC 2019 Special Session as
an Invited Paper in June 2019. arXiv admin note: substantial text overlap
with arXiv:1903.0398
Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity
In modern systems, DRAM-based main memory is significantly slower than the
processor. Consequently, processors spend a long time waiting to access data
from main memory, making the long main memory access latency one of the most
critical bottlenecks to achieving high system performance. Unfortunately, the
latency of DRAM has remained almost constant in the past decade. This is mainly
because DRAM has been optimized for cost-per-bit, rather than access latency.
As a result, DRAM latency is not reducing with technology scaling, and
continues to be an important performance bottleneck in modern and future
systems.
This dissertation seeks to achieve low latency DRAM-based memory systems at
low cost in three major directions. First, based on the observation that long
bitlines in DRAM are one of the dominant sources of DRAM latency, we propose a
new DRAM architecture, Tiered-Latency DRAM (TL-DRAM), which divides the long
bitline into two shorter segments using an isolation transistor, allowing one
segment to be accessed with reduced latency. Second, we propose a fine-grained
DRAM latency reduction mechanism, Adaptive-Latency DRAM, which optimizes DRAM
latency for the common operating conditions for individual DRAM module. Third,
we propose a new technique, Architectural-Variation-Aware DRAM (AVA-DRAM),
which reduces DRAM latency at low cost, by profiling and identifying only the
inherently slower regions in DRAM to dynamically determine the lowest latency
DRAM can operate at without causing failures.
This dissertation provides a detailed analysis of DRAM latency by using both
circuit-level simulation with a detailed DRAM model and FPGA-based profiling of
real DRAM modules. Our latency analysis shows that our low latency DRAM
mechanisms enable significant latency reductions, leading to large improvement
in both system performance and energy efficiency.Comment: 159 pages, PhD thesis, CMU 201
CLR-DRAM: A Low-Cost DRAM Architecture Enabling Dynamic Capacity-Latency Trade-Off
DRAM is the prevalent main memory technology, but its long access latency can
limit the performance of many workloads. Although prior works provide DRAM
designs that reduce DRAM access latency, their reduced storage capacities
hinder the performance of workloads that need large memory capacity. Because
the capacity-latency trade-off is fixed at design time, previous works cannot
achieve maximum performance under very different and dynamic workload demands.
This paper proposes Capacity-Latency-Reconfigurable DRAM (CLR-DRAM), a new
DRAM architecture that enables dynamic capacity-latency trade-off at low cost.
CLR-DRAM allows dynamic reconfiguration of any DRAM row to switch between two
operating modes: 1) max-capacity mode, where every DRAM cell operates
individually to achieve approximately the same storage density as a
density-optimized commodity DRAM chip and 2) high-performance mode, where two
adjacent DRAM cells in a DRAM row and their sense amplifiers are coupled to
operate as a single low-latency logical cell driven by a single logical sense
amplifier.
We implement CLR-DRAM by adding isolation transistors in each DRAM subarray.
Our evaluations show that CLR-DRAM can improve system performance and DRAM
energy consumption by 18.6% and 29.7% on average with four-core multiprogrammed
workloads. We believe that CLR-DRAM opens new research directions for a system
to adapt to the diverse and dynamically changing memory capacity and access
latency demands of workloads.Comment: This work is to appear at ISCA 202
- …