1,371 research outputs found
Exploiting Row-Level Temporal Locality in DRAM to Reduce the Memory Access Latency
This paper summarizes the idea of ChargeCache, which was published in HPCA
2016 [51], and examines the work's significance and future potential. DRAM
latency continues to be a critical bottleneck for system performance. In this
work, we develop a low-cost mechanism, called ChargeCache, that enables faster
access to recently-accessed rows in DRAM, with no modifications to DRAM chips.
Our mechanism is based on the key observation that a recently-accessed row has
more charge and thus the following access to the same row can be performed
faster. To exploit this observation, we propose to track the addresses of
recently-accessed rows in a table in the memory controller. If a later DRAM
request hits in that table, the memory controller uses lower timing parameters,
leading to reduced DRAM latency. Row addresses are removed from the table after
a specified duration to ensure rows that have leaked too much charge are not
accessed with lower latency. We evaluate ChargeCache on a wide variety of
workloads and show that it provides significant performance and energy benefits
for both single-core and multi-core systems.Comment: arXiv admin note: substantial text overlap with arXiv:1609.0723
High-Performance and Energy-Effcient Memory Scheduler Design for Heterogeneous Systems
When multiple processor cores (CPUs) and a GPU integrated together on the
same chip share the off-chip DRAM, requests from the GPU can heavily interfere
with requests from the CPUs, leading to low system performance and starvation
of cores. Unfortunately, state-of-the-art memory scheduling algorithms are
ineffective at solving this problem due to the very large amount of GPU memory
traffic, unless a very large and costly request buffer is employed to provide
these algorithms with enough visibility across the global request stream.
Previously-proposed memory controller (MC) designs use a single monolithic
structure to perform three main tasks. First, the MC attempts to schedule
together requests to the same DRAM row to increase row buffer hit rates.
Second, the MC arbitrates among the requesters (CPUs and GPU) to optimize for
overall system throughput, average response time, fairness and quality of
service. Third, the MC manages the low-level DRAM command scheduling to
complete requests while ensuring compliance with all DRAM timing and power
constraints. This paper proposes a fundamentally new approach, called the
Staged Memory Scheduler (SMS), which decouples the three primary MC tasks into
three significantly simpler structures that together improve system performance
and fairness. Our evaluation shows that SMS provides 41.2% performance
improvement and fairness improvement compared to the best previous
state-of-the-art technique, while enabling a design that is significantly less
complex and more power-efficient to implement
Tiered-Latency DRAM (TL-DRAM)
This paper summarizes the idea of Tiered-Latency DRAM, which was published in
HPCA 2013. The key goal of TL-DRAM is to provide low DRAM latency at low cost,
a critical problem in modern memory systems. To this end, TL-DRAM introduces
heterogeneity into the design of a DRAM subarray by segmenting the bitlines,
thereby creating a low-latency, low-energy, low-capacity portion in the
subarray (called the near segment), which is close to the sense amplifiers, and
a high-latency, high-energy, high-capacity portion, which is farther away from
the sense amplifiers. Thus, DRAM becomes heterogeneous with a small portion
having lower latency and a large portion having higher latency. Various
techniques can be employed to take advantage of the low-latency near segment
and this new heterogeneous DRAM substrate, including hardware-based caching and
software based caching and memory allocation of frequently used data in the
near segment. Evaluations with simple such techniques show significant
performance and energy-efficiency benefits.Comment: This is a summary of the original paper, entitled "Tiered-Latency
DRAM: A Low Latency and Low Cost DRAM Architecture" which appears in HPCA
201
Simultaneous Multi Layer Access: A High Bandwidth and Low Cost 3D-Stacked Memory Interface
Limited memory bandwidth is a critical bottleneck in modern systems.
3D-stacked DRAM enables higher bandwidth by leveraging wider
Through-Silicon-Via (TSV) channels, but today's systems cannot fully exploit
them due to the limited internal bandwidth of DRAM. DRAM reads a whole row
simultaneously from the cell array to a row buffer, but can transfer only a
fraction of the data from the row buffer to peripheral IO circuit, through a
limited and expensive set of wires referred to as global bitlines. In presence
of wider memory channels, the major bottleneck becomes the limited data
transfer capacity through these global bitlines. Our goal in this work is to
enable higher bandwidth in 3D-stacked DRAM without the increased cost of adding
more global bitlines. We instead exploit otherwise-idle resources, such as
global bitlines, already existing within the multiple DRAM layers by accessing
the layers simultaneously. Our architecture, Simultaneous Multi Layer Access
(SMLA), provides higher bandwidth by aggregating the internal bandwidth of
multiple layers and transferring the available data at a higher IO frequency.
To implement SMLA, simultaneous data transfer from multiple layers through
the same IO TSVs requires coordination between layers to avoid channel
conflict. We first study coordination by static partitioning, which we call
Dedicated-IO, that assigns groups of TSVs to each layer. We then provide a
simple, yet sophisticated mechanism, called Cascaded-IO, which enables
simultaneous access to each layer by time-multiplexing the IOs. By operating at
a frequency proportional to the number of layers, SMLA provides a higher
bandwidth (4X for a four-layer stacked DRAM). Our evaluations show that SMLA
provides significant performance improvement and energy reduction (55%/18% on
average for multi-programmed workloads, respectively) over a baseline
3D-stacked DRAM with very low area overhead
Adaptive-Latency DRAM (AL-DRAM)
This paper summarizes the idea of Adaptive-Latency DRAM (AL-DRAM), which was
published in HPCA 2015. The key goal of AL-DRAM is to exploit the extra margin
that is built into the DRAM timing parameters to reduce DRAM latency. The key
observation is that the timing parameters are dictated by the worst-case
temperatures and worst-case DRAM cells, both of which lead to small amount of
charge storage and hence high access latency. One can therefore reduce latency
by adapting the timing parameters to the current operating temperature and the
current DIMM that is being accessed. Using an FPGA-based testing platform, our
work first characterizes the extra margin for 115 DRAM modules from three major
manufacturers. The experimental results demonstrate that it is possible to
reduce four of the most critical timing parameters by a minimum/maximum of
17.3%/54.8% at 55C while maintaining reliable operation. AL-DRAM adaptively
selects between multiple different timing parameters for each DRAM module based
on its current operating condition. AL-DRAM does not require any changes to the
DRAM chip or its interface; it only requires multiple different timing
parameters to be specified and supported by the memory controller. Real system
evaluations show that AL-DRAM improves the performance of memory-intensive
workloads by an average of 14% without introducing any errors.Comment: This is a summary of the original paper, entitled "Adaptive-Latency
DRAM: Optimizing DRAM Timing for the Common-Case" which appears in HPCA 201
Predictable Performance and Fairness Through Accurate Slowdown Estimation in Shared Main Memory Systems
This paper summarizes the ideas and key concepts in MISE (Memory
Interference-induced Slowdown Estimation), which was published in HPCA 2013
[97], and examines the work's significance and future potential. Applications
running concurrently on a multicore system interfere with each other at the
main memory. This interference can slow down different applications
differently. Accurately estimating the slowdown of each application in such a
system can enable mechanisms that can enforce quality-of-service. While much
prior work has focused on mitigating the performance degradation due to
inter-application interference, there is little work on accurately estimating
slowdown of individual applications in a multi-programmed environment. Our goal
is to accurately estimate application slowdowns, towards providing predictable
performance.
To this end, we first build a simple Memory Interference-induced Slowdown
Estimation (MISE) model, which accurately estimates slowdowns caused by memory
interference. We then leverage our MISE model to develop two new memory
scheduling schemes: 1) one that provides soft quality-of-service guarantees,
and 2) another that explicitly attempts to minimize maximum slowdown (i.e.,
unfairness) in the system. Evaluations show that our techniques perform
significantly better than state-of-the-art memory scheduling approaches to
address the same problems.
Our proposed model and techniques have enabled significant research in the
development of accurate performance models [35, 59, 98, 110] and interference
management mechanisms [66, 99, 100, 108, 119, 120]
Decoupling GPU Programming Models from Resource Management for Enhanced Programming Ease, Portability, and Performance
The application resource specification--a static specification of several
parameters such as the number of threads and the scratchpad memory usage per
thread block--forms a critical component of modern GPU programming models. This
specification determines the parallelism, and hence performance, of the
application during execution because the corresponding on-chip hardware
resources are allocated and managed based on this specification. This
tight-coupling between the software-provided resource specification and
resource management in hardware leads to significant challenges in programming
ease, portability, and performance. Zorua is a new resource virtualization
framework, that decouples the programmer-specified resource usage of a GPU
application from the actual allocation in the on-chip hardware resources. Zorua
enables this decoupling by virtualizing each resource transparently to the
programmer.
We demonstrate that by providing the illusion of more resources than
physically available via controlled and coordinated virtualization, Zorua
offers several important benefits: (i) Programming Ease. Zorua eases the burden
on the programmer to provide code that is tuned to efficiently utilize the
physically available on-chip resources. (ii) Portability. Zorua alleviates the
necessity of re-tuning an application's resource usage when porting the
application across GPU generations. (iii) Performance. By dynamically
allocating resources and carefully oversubscribing them when necessary, Zorua
improves or retains the performance of applications that are already highly
tuned to best utilize the resources.Comment: arXiv admin note: substantial text overlap with arXiv:1802.0257
Adaptive-Latency DRAM: Reducing DRAM Latency by Exploiting Timing Margins
This paper summarizes the idea of Adaptive-Latency DRAM (AL-DRAM), which was
published in HPCA 2015, and examines the work's significance and future
potential. AL-DRAM is a mechanism that optimizes DRAM latency based on the DRAM
module and the operating temperature, by exploiting the extra margin that is
built into the DRAM timing parameters. DRAM manufacturers provide a large
margin for the timing parameters as a provision against two worst-case
scenarios. First, due to process variation, some outlier DRAM chips are much
slower than others. Second, chips become slower at higher temperatures. The
timing parameter margin ensures that the slow outlier chips operate reliably at
the worst-case temperature, and hence leads to a high access latency.
Using an FPGA-based DRAM testing platform, our work first characterizes the
extra margin for 115 DRAM modules from three major manufacturers. The
experimental results demonstrate that it is possible to reduce four of the most
critical timing parameters by a minimum/maximum of 17.3%/54.8% at 55C while
maintaining reliable operation. AL-DRAM uses these observations to adaptively
select reliable DRAM timing parameters for each DRAM module based on the
module's current operating conditions. AL-DRAM does not require any changes to
the DRAM chip or its interface; it only requires multiple different timing
parameters to be specified and supported by the memory controller. Our real
system evaluations show that AL-DRAM improves the performance of
memory-intensive workloads by an average of 14% without introducing any errors.
Our characterization and proposed techniques have inspired several other works
on analyzing and/or exploiting different sources of latency and performance
variation within DRAM chips.Comment: arXiv admin note: substantial text overlap with arXiv:1603.0845
A Hardware-Software Blueprint for Flexible Deep Learning Specialization
Specialized Deep Learning (DL) acceleration stacks, designed for a specific
set of frameworks, model architectures, operators, and data types, offer the
allure of high performance while sacrificing flexibility. Changes in
algorithms, models, operators, or numerical systems threaten the viability of
specialized hardware accelerators. We propose VTA, a programmable deep learning
architecture template designed to be extensible in the face of evolving
workloads. VTA achieves this flexibility via a parametrizable architecture,
two-level ISA, and a JIT compiler. The two-level ISA is based on (1) a task-ISA
that explicitly orchestrates concurrent compute and memory tasks and (2) a
microcode-ISA which implements a wide variety of operators with single-cycle
tensor-tensor operations. Next, we propose a runtime system equipped with a JIT
compiler for flexible code-generation and heterogeneous execution that enables
effective use of the VTA architecture. VTA is integrated and open-sourced into
Apache TVM, a state-of-the-art deep learning compilation stack that provides
flexibility for diverse models and divergent hardware backends. We propose a
flow that performs design space exploration to generate a customized hardware
architecture and software operator library that can be leveraged by mainstream
learning frameworks. We demonstrate our approach by deploying optimized deep
learning models used for object classification and style transfer on edge-class
FPGAs.Comment: 6 pages plus references, 8 figure
Exploiting the DRAM Microarchitecture to Increase Memory-Level Parallelism
This paper summarizes the idea of Subarray-Level Parallelism (SALP) in DRAM,
which was published in ISCA 2012, and examines the work's significance and
future potential. Modern DRAMs have multiple banks to serve multiple memory
requests in parallel. However, when two requests go to the same bank, they have
to be served serially, exacerbating the high latency of on-chip memory. Adding
more banks to the system to mitigate this problem incurs high system cost. Our
goal in this work is to achieve the benefits of increasing the number of banks
with a low-cost approach. To this end, we propose three new mechanisms, SALP-1,
SALP-2, and MASA (Multitude of Activated Subarrays), to reduce the
serialization of different requests that go to the same bank. The key
observation exploited by our mechanisms is that a modern DRAM bank is
implemented as a collection of subarrays that operate largely independently
while sharing few global peripheral structures.
Our three proposed mechanisms mitigate the negative impact of bank
serialization by overlapping different components of the bank access latencies
of multiple requests that go to different subarrays within the same bank.
SALP-1 requires no changes to the existing DRAM structure, and needs to only
reinterpret some of the existing DRAM timing parameters. SALP-2 and MASA
require only modest changes (< 0.15% area overhead) to the DRAM peripheral
structures, which are much less design constrained than the DRAM core. Our
evaluations show that SALP-1, SALP-2 and MASA significantly improve performance
for both single-core systems (7%/13%/17%) and multi-core systems (15%/16%/20%),
averaged across a wide range of workloads. We also demonstrate that our
mechanisms can be combined with application-aware memory request scheduling in
multicore systems to further improve performance and fairness
- …