37 research outputs found
Tiered-Latency DRAM (TL-DRAM)
This paper summarizes the idea of Tiered-Latency DRAM, which was published in
HPCA 2013. The key goal of TL-DRAM is to provide low DRAM latency at low cost,
a critical problem in modern memory systems. To this end, TL-DRAM introduces
heterogeneity into the design of a DRAM subarray by segmenting the bitlines,
thereby creating a low-latency, low-energy, low-capacity portion in the
subarray (called the near segment), which is close to the sense amplifiers, and
a high-latency, high-energy, high-capacity portion, which is farther away from
the sense amplifiers. Thus, DRAM becomes heterogeneous with a small portion
having lower latency and a large portion having higher latency. Various
techniques can be employed to take advantage of the low-latency near segment
and this new heterogeneous DRAM substrate, including hardware-based caching and
software based caching and memory allocation of frequently used data in the
near segment. Evaluations with simple such techniques show significant
performance and energy-efficiency benefits.Comment: This is a summary of the original paper, entitled "Tiered-Latency
DRAM: A Low Latency and Low Cost DRAM Architecture" which appears in HPCA
201
Adaptive-Latency DRAM (AL-DRAM)
This paper summarizes the idea of Adaptive-Latency DRAM (AL-DRAM), which was
published in HPCA 2015. The key goal of AL-DRAM is to exploit the extra margin
that is built into the DRAM timing parameters to reduce DRAM latency. The key
observation is that the timing parameters are dictated by the worst-case
temperatures and worst-case DRAM cells, both of which lead to small amount of
charge storage and hence high access latency. One can therefore reduce latency
by adapting the timing parameters to the current operating temperature and the
current DIMM that is being accessed. Using an FPGA-based testing platform, our
work first characterizes the extra margin for 115 DRAM modules from three major
manufacturers. The experimental results demonstrate that it is possible to
reduce four of the most critical timing parameters by a minimum/maximum of
17.3%/54.8% at 55C while maintaining reliable operation. AL-DRAM adaptively
selects between multiple different timing parameters for each DRAM module based
on its current operating condition. AL-DRAM does not require any changes to the
DRAM chip or its interface; it only requires multiple different timing
parameters to be specified and supported by the memory controller. Real system
evaluations show that AL-DRAM improves the performance of memory-intensive
workloads by an average of 14% without introducing any errors.Comment: This is a summary of the original paper, entitled "Adaptive-Latency
DRAM: Optimizing DRAM Timing for the Common-Case" which appears in HPCA 201
Predictable Performance and Fairness Through Accurate Slowdown Estimation in Shared Main Memory Systems
This paper summarizes the ideas and key concepts in MISE (Memory
Interference-induced Slowdown Estimation), which was published in HPCA 2013
[97], and examines the work's significance and future potential. Applications
running concurrently on a multicore system interfere with each other at the
main memory. This interference can slow down different applications
differently. Accurately estimating the slowdown of each application in such a
system can enable mechanisms that can enforce quality-of-service. While much
prior work has focused on mitigating the performance degradation due to
inter-application interference, there is little work on accurately estimating
slowdown of individual applications in a multi-programmed environment. Our goal
is to accurately estimate application slowdowns, towards providing predictable
performance.
To this end, we first build a simple Memory Interference-induced Slowdown
Estimation (MISE) model, which accurately estimates slowdowns caused by memory
interference. We then leverage our MISE model to develop two new memory
scheduling schemes: 1) one that provides soft quality-of-service guarantees,
and 2) another that explicitly attempts to minimize maximum slowdown (i.e.,
unfairness) in the system. Evaluations show that our techniques perform
significantly better than state-of-the-art memory scheduling approaches to
address the same problems.
Our proposed model and techniques have enabled significant research in the
development of accurate performance models [35, 59, 98, 110] and interference
management mechanisms [66, 99, 100, 108, 119, 120]
Recent Advances in Overcoming Bottlenecks in Memory Systems and Managing Memory Resources in GPU Systems
This article features extended summaries and retrospectives of some of the
recent research done by our research group, SAFARI, on (1) various critical
problems in memory systems and (2) how memory system bottlenecks affect
graphics processing unit (GPU) systems. As more applications share a single
system, operations from each application can contend with each other at various
shared components. Such contention can slow down each application or thread of
execution. The compound effect of contention, high memory latency and access
overheads, as well as inefficient management of resources, greatly degrades
performance, quality-of-service, and energy efficiency. The ten works featured
in this issue study several aspects of (1) inter-application interference in
multicore systems, heterogeneous systems, and GPUs; (2) the growing overheads
and expenses associated with growing memory densities and latencies; and (3)
performance, programmability, and portability issues in modern GPUs, especially
those related to memory system resources.Comment: arXiv admin note: text overlap with arXiv:1805.0912
Recent Advances in DRAM and Flash Memory Architectures
This article features extended summaries and retrospectives of some of the
recent research done by our group, SAFARI, on (1) understanding,
characterizing, and modeling various critical properties of modern DRAM and
NAND flash memory, the dominant memory and storage technologies, respectively;
and (2) several new mechanisms we have proposed based on our observations from
these analyses, characterization, and modeling, to tackle various key
challenges in memory and storage scaling. In order to understand the sources of
various bottlenecks of the dominant memory and storage technologies, these
works perform rigorous studies of device-level and application-level behavior,
using a combination of detailed simulation and experimental characterization of
real memory and storage devices.Comment: arXiv admin note: substantial text overlap with arXiv:1805.0640
RowHammer and Beyond
We will discuss the RowHammer problem in DRAM, which is a prime (and likely
the first) example of how a circuit-level failure mechanism in Dynamic Random
Access Memory (DRAM) can cause a practical and widespread system security
vulnerability. RowHammer is the phenomenon that repeatedly accessing a row in a
modern DRAM chip predictably causes errors in physically-adjacent rows. It is
caused by a hardware failure mechanism called read disturb errors. Building on
our initial fundamental work that appeared at ISCA 2014, Google Project Zero
demonstrated that this hardware phenomenon can be exploited by user-level
programs to gain kernel privileges. Many other recent works demonstrated other
attacks exploiting RowHammer, including remote takeover of a server vulnerable
to RowHammer. We will analyze the root causes of the problem and examine
solution directions. We will also discuss what other problems may be lurking in
DRAM and other types of memories, e.g., NAND flash and Phase Change Memory,
which can potentially threaten the foundations of reliable and secure systems,
as the memory technologies scale to higher densities.Comment: A version of this paper is to appear in the COSADE 2019 proceedings.
arXiv admin note: text overlap with arXiv:1703.0062
Adaptive-Latency DRAM: Reducing DRAM Latency by Exploiting Timing Margins
This paper summarizes the idea of Adaptive-Latency DRAM (AL-DRAM), which was
published in HPCA 2015, and examines the work's significance and future
potential. AL-DRAM is a mechanism that optimizes DRAM latency based on the DRAM
module and the operating temperature, by exploiting the extra margin that is
built into the DRAM timing parameters. DRAM manufacturers provide a large
margin for the timing parameters as a provision against two worst-case
scenarios. First, due to process variation, some outlier DRAM chips are much
slower than others. Second, chips become slower at higher temperatures. The
timing parameter margin ensures that the slow outlier chips operate reliably at
the worst-case temperature, and hence leads to a high access latency.
Using an FPGA-based DRAM testing platform, our work first characterizes the
extra margin for 115 DRAM modules from three major manufacturers. The
experimental results demonstrate that it is possible to reduce four of the most
critical timing parameters by a minimum/maximum of 17.3%/54.8% at 55C while
maintaining reliable operation. AL-DRAM uses these observations to adaptively
select reliable DRAM timing parameters for each DRAM module based on the
module's current operating conditions. AL-DRAM does not require any changes to
the DRAM chip or its interface; it only requires multiple different timing
parameters to be specified and supported by the memory controller. Our real
system evaluations show that AL-DRAM improves the performance of
memory-intensive workloads by an average of 14% without introducing any errors.
Our characterization and proposed techniques have inspired several other works
on analyzing and/or exploiting different sources of latency and performance
variation within DRAM chips.Comment: arXiv admin note: substantial text overlap with arXiv:1603.0845
Phase Change Logic via Thermal Cross-Talk for Computation in Memory
We have computationally demonstrated logic function implementations using
lateral and vertical multi-contact phase change devices integrated with CMOS
circuitry, which use thermal cross-talk as a coupling mechanism to implement
logic functions at smaller CMOS footprints. Thermal-crosstalk during the write
operations is utilized to recrystallize the previously amorphized regions to
achieve toggle operations. Amorphized regions formed between different pairs of
write contacts are utilized to isolate read contacts. Typical expected
reduction in CMOS footprint is ~ 50% using the described approach for
toggle-multiplexing, JK-multiplexing and 2x2 routing. The switching speeds of
the phase change devices are in the order of nanoseconds and are inherently
non-volatile. An electro-thermal modeling framework with dynamic materials
models are used to capture the device dynamics, and current and voltage
requirements.Comment: 7 pages, 6 figure
SAWL:A Self-adaptive Wear-leveling NVM Scheme for High Performance Storage Systems
In order to meet the needs of high performance computing (HPC) in terms of
large memory, high throughput and energy savings, the non-volatile memory (NVM)
has been widely studied due to its salient features of high density, near-zero
standby power, byte-addressable and non-volatile properties. In HPC systems,
the multi-level cell (MLC) technique is used to significantly increase device
density and decrease the cost, which however leads to much weaker endurance
than the single-level cell (SLC) counterpart. Although wear-leveling techniques
can mitigate this weakness in MLC, the improvements upon MLC-based NVM become
very limited due to not achieving uniform write distribution before some cells
are really worn out. To address this problem, our paper proposes a
self-adaptive wear-leveling (SAWL) scheme for MLC-based NVM. The idea behind
SAWL is to dynamically tune the wear-leveling granularities and balance the
writes across the cells of entire memory, thus achieving suitable tradeoff
between the lifetime and cache hit rate. Moreover, to reduce the size of the
address-mapping table, SAWL maintains a few recently-accessed mappings in a
small on-chip cache. Experimental results demonstrate that SAWL significantly
improves the NVM lifetime and the performance for HPC systems, compared with
state-of-the-art schemes.Comment: 14 pages, 17 figure
High-Performance and Energy-Effcient Memory Scheduler Design for Heterogeneous Systems
When multiple processor cores (CPUs) and a GPU integrated together on the
same chip share the off-chip DRAM, requests from the GPU can heavily interfere
with requests from the CPUs, leading to low system performance and starvation
of cores. Unfortunately, state-of-the-art memory scheduling algorithms are
ineffective at solving this problem due to the very large amount of GPU memory
traffic, unless a very large and costly request buffer is employed to provide
these algorithms with enough visibility across the global request stream.
Previously-proposed memory controller (MC) designs use a single monolithic
structure to perform three main tasks. First, the MC attempts to schedule
together requests to the same DRAM row to increase row buffer hit rates.
Second, the MC arbitrates among the requesters (CPUs and GPU) to optimize for
overall system throughput, average response time, fairness and quality of
service. Third, the MC manages the low-level DRAM command scheduling to
complete requests while ensuring compliance with all DRAM timing and power
constraints. This paper proposes a fundamentally new approach, called the
Staged Memory Scheduler (SMS), which decouples the three primary MC tasks into
three significantly simpler structures that together improve system performance
and fairness. Our evaluation shows that SMS provides 41.2% performance
improvement and fairness improvement compared to the best previous
state-of-the-art technique, while enabling a design that is significantly less
complex and more power-efficient to implement