Adaptive-Latency DRAM: Reducing DRAM Latency by Exploiting Timing
  Margins by Lee, Donghyuk et al.
Adaptive-Latency DRAM:
Reducing DRAM Latency by Exploiting Timing Margins
Donghyuk Lee1,2 Yoongu Kim2 Gennady Pekhimenko3,2
Samira Khan4,2 Vivek Seshadri5,2 Kevin Chang6,2 Onur Mutlu7,2
1NVIDIA Research 2Carnegie Mellon University 3University of Toronto
4University of Virginia 5Microsoft Research India 6Facebook 7ETH Zürich
This paper summarizes the idea of Adaptive-Latency DRAM
(AL-DRAM), which was published in HPCA 2015 [90], and
examines the work’s signicance and future potential. AL-
DRAM is a mechanism that optimizes DRAM latency based on
the DRAMmodule and the operating temperature, by exploiting
the extra margin that is built into the DRAM timing parameters.
DRAM manufacturers provide a large margin for the timing
parameters as a provision against two worst-case scenarios.
First, due to process variation, some outlier DRAM chips are
much slower than others. Second, chips become slower at higher
temperatures. The timing parameter margin ensures that the
slow outlier chips operate reliably at the worst-case temperature,
and hence leads to a high access latency.
Using an FPGA-based DRAM testing platform, our work rst
characterizes the extra margin for 115 DRAM modules from
three major manufacturers. The experimental results demon-
strate that it is possible to reduce four of the most critical tim-
ing parameters by a minimum/maximum of 17.3%/54.8% at
55℃ while maintaining reliable operation. AL-DRAM uses
these observations to adaptively select reliable DRAM timing
parameters for each DRAM module based on the module’s cur-
rent operating conditions. AL-DRAM does not require any
changes to the DRAM chip or its interface; it only requires mul-
tiple dierent timing parameters to be specied and supported
by the memory controller. Our real system evaluations show
that AL-DRAM improves the performance of memory-intensive
workloads by an average of 14% without introducing any errors.
Our characterization and proposed techniques have inspired
several other works on analyzing and/or exploiting dierent
sources of latency and performance variation within DRAM
chips [30, 34, 51, 71, 89, 127].
1. Problem: High DRAM Latency
A DRAM chip is made of capacitor-based cells that rep-
resent data in the form of electrical charge. To store data
in a cell, charge is injected, whereas to retrieve data from a
cell, charge is extracted. Such movement of charge happens
through a wire called bitline. Due to the large resistance and
the large capacitance of the bitline, it takes a long time to
access DRAM cells. To guarantee correct operation for every
module sold, DRAM manufacturers impose a set of mini-
mum latency restrictions on DRAM accesses, called timing
parameters [60]. Ideally, timing parameters should provide
just enough time for a DRAM chip to operate correctly. In
practice, however, there is a very large margin in the tim-
ing parameters to ensure correct operation under worst-case
conditions with respect to two aspects. First, due to process
variation, some outlier cells suer from a larger RC-delay
than other cells [64,94], and require more time to be accessed.
Second, due to temperature dependence, DRAM cells lose more
charge at high temperature [97, 171], and therefore require
more time to be accessed. Due to the worst-case provisioning
of the xed timing parameters, which ensure reliable oper-
ation up to a temperature of 85℃, it takes a longer time to
access most of DRAM under most operating conditions than
is actually necessary for correct operation.
2. Key Observations and Our Goal
First, we observe that most DRAM chips do not con-
tain the worst-case cells that require the largest access
latency. Using an FPGA-based testing platform [52], we pro-
le 115 real DRAM modules and observe that the slowest
cell (i.e., the cell that stores the smallest amount of charge)
for a typical chip is still signicantly faster than the slowest
cell of the worst-case chip. Our proling exposes the large
margin built into DRAM timing parameters. In particular,
we identify four timing parameters that are the most critical
during a DRAM access: tRCD, tRAS, tWR, and tRP.1 At 55℃,
we demonstrate that the parameters can be reduced by an
average of 17.3%, 37.7%, 54.8%, and 35.2%, respectively, while
still maintaining correctness.
Second, we observe that most DRAM chips are not ex-
posed to theworst-case temperature of 85℃.We measure
the DRAM ambient temperature in a server cluster running
a very memory-intensive benchmark, and nd that the tem-
perature never exceeds 34℃, and never changes by more than
0.1℃ per second. Other works [48,99] also observe that worst-
case DRAM temperatures are not common, and that servers
typically operate at much lower temperatures [48, 99].
Based on these two observations, we show that typical
DRAM chips operating at typical temperatures (e.g., 55℃) are
capable of operating correctly when accessed with a much
smaller access latency, but are nevertheless forced to operate
1For a detailed background on the operation of DRAM, and an explanation
of each timing parameter, we refer the reader to our prior works [30, 31, 32,
34, 51, 52, 67, 68, 69, 70, 71, 73, 75, 76, 77, 78, 88, 89, 90, 92, 93, 97, 98, 127, 146, 147].
ar
X
iv
:1
80
5.
03
04
7v
1 
 [c
s.A
R]
  4
 M
ay
 20
18
at the largest latency of the worst-case module and operating
conditions. Modules in existing systems use these worst-case
latencies because existing memory controllers are equipped
with only a single set of timing parameters that are dictated
by the worst case.
Our goal in our HPCA 2015 paper [90] is to exploit the
extra margin that is built into the DRAM timing parameters
to reduce DRAM latency, and thus improve performance as
well as energy consumption. To this end, we rst provide a
detailed analysis of why we can reduce DRAM timing param-
eters without sacricing reliability.
3. Charge & Latency Interdependence
The operation of a DRAM cell is governed by two impor-
tant parameters: i) the quantity of charge and ii) the latency
it takes to move charge. These two parameters are closely
related to each other. Based on SPICE simulations with a de-
tailed DRAM model, we identify the quantitative relationship
between charge and latency [90]. We briey summarize our
three key observations from these analyses here. Section 7
of our HPCA 2015 paper [90] provides a detailed analysis of
our observations.
First, having more charge in a DRAM cell accelerates the
sensing operation in the cell, especially at the beginning of
sensing, enabling the opportunity to shorten the timing pa-
rameters that correspond to sensing (tRCD and tRAS). Second,
when restoring the charge in a DRAM cell, a large amount of
the time is spent on injecting the nal small amount of charge
into the cell. If there is already enough charge in the cell for
the next access, the cell does not need to be fully restored.
In this case, it is possible to shorten the latter part of the
restoration time, creating the opportunity to shorten the tim-
ing parameters that correspond to restoration (tRAS and tWR).
Third, at the end of precharging, i.e., setting the bitline into
the initial voltage level (before accessing a cell) for the next
access, a large amount of the time is spent on precharging
the nal small amount of bitline voltage dierence from the
initial level. When there is already enough charge in the cell
to overcome the voltage dierence in the bitline, the bitline
does not need to be fully precharged. Thus, it is possible to
shorten the nal part of the precharge time, creating the op-
portunity to shorten the timing parameter that corresponds
to precharge (tRP). Based on these three observations, we
conclude that timing parameters can be shortened if DRAM
cells have enough charge.
4. Adaptive-Latency DRAM
As explained in Section 3, the amount of charge in the
cell right before an access to it plays a critical role in how
long it takes to retrieve the correct data from the cell. In
Figure 1, we illustrate the impact of process variation using
two dierent cells: one is a typical cell (left column) and the
other is the worst-case cell that deviates the most from the
typical (right column). The worst-case cell initially contains
less charge than the typical cell for two reasons. First, due to
its large resistance, the worst-case cell cannot allow charge to
ow inside quickly. Second, due to its small capacitance, the
worst-case cell cannot store much charge even when it is fully
charged. To accommodate such a worst-case cell, existing
timing parameters are conservatively set to large values.
large
leakage
Typical Cell
small
leakage
small
leakage
large
leakage
Te
m
pe
ra
tu
re
Worst Cell
Te
m
pe
ra
tu
re
Rlow Rhigh
RhighRlow
unfilled
(Rhigh)
unfilled
(Rhigh)
unfilled
(by design)
unfilled
(by design)
unfilled
(by design)
Ty
pi
ca
l
W
or
st
leakage leakage
leakage
Figure 1: Eect of reduced latency: typical vs. worst-case. Re-
produced from [90].
In Figure 1, we also illustrate the impact of temperature
dependence using two cells at two dierent operating temper-
atures: i) a typical temperature (55℃, bottom row), and ii) the
worst-case temperature (85℃, top row) supported by DRAM
standards. Both typical and worst-case cells leak charge at
a faster rate at the worst-case temperature. Therefore, not
only does the worst-case cell have less charge to begin with,
but it is left with even less charge at the worst temperature
because it leaks charge at a faster rate (top-right in Figure 1).
To accommodate the combined eect of process variation and
temperature dependence, existing timing parameters are set
to very large values. That is why the worst-case condition
for correctness is specied by the top-right of Figure 1, which
shows the least amount of charge stored in the worst-case cell
at the worst-case temperature in its initial state. On top of
this, DRAM manufacturers add an extra latency margin to
the access time under worst-case conditions. In other words,
the amount of charge in a cell under worst-case conditions
is still greater than the minimum amount of charge required
for correctness.
If we were to reduce the timing parameters, we would
also reduce the amount of charge stored in the cells. It is
important to note, however, that we are proposing to exploit
only the additional slack (in terms of charge) compared to the
worst case. This allows us to provide as strong of a reliability
guarantee as manufacturers currently do for worst-case cells
and operating conditions. In Figure 1, we illustrate the impact
of reducing the timing parameters. The lightened portions
inside the cells represent the amount of charge that we are
giving up by using reduced timing parameters. Note that
we are not giving up any charge for the worst-case cell at
the worst-case temperature. Although the other three cells
are not fully charged in their initial state, w propose to give
up just enough charge from them such that they are left
2
with a similar amount of charge as the worst case (top-right).
This is because these cells are capable of either holding more
charge to begin with (typical cell, left column) or holding
their charge for longer (typical temperature, bottom row).
Therefore, optimizing the timing parameters (based on the
amount of existing charge slack) provides the opportunity to
reduce overall DRAM latency while still maintaining the same
reliability guarantees provided by DRAM manufacturers.
Based on these observations, we propose Adaptive-Latency
DRAM (AL-DRAM), a mechanism that dynamically optimizes
the timing parameters for dierent modules at dierent tem-
peratures. AL-DRAM exploits the additional charge slack
present in the common-case compared to the worst-case,
thereby preserving the level of reliability (at least as high as
the worst-case) provided by DRAM manufacturers.
5. DRAM Latency Proling:
Experimental Analysis of 115 Modules
We present and analyze the results of our DRAM proling
experiments, performed on our FPGA-based DRAM testing
infrastructure, SoftMC [52], which is also used in our various
past works analyzing various DRAM characteristics [30,34,68,
69,75,89,90,97,136]. In total, we analyze 115 DRAM modules
from three major manufacturers, comprising 920 total DRAM
chips. Our full methodology is explained in Section 6 of our
HPCA 2015 paper [90].
5.1. Analysis of a Representative DRAMModule
We study the possible timing parameter reductions of a
DRAM module while still maintaining correctness. To guaran-
tee reliable DRAM operation, DRAM manufacturers provide
a built-in safety margin in retention time, also referred to as a
guardband [2, 68, 97, 127, 166]. This way, DRAM manufactur-
ers are able to guarantee that even the weakest cell is insured
against various other modes of failure. We rst measure the
safety margin of a DRAM module by sweeping the refresh
interval at the worst-case operating temperature (85℃), using
the standard timing parameters. Figure 2a plots the maxi-
mum refresh intervals of each bank and each chip in a DRAM
module for both read and write operations. We make several
observations. First, the maximum error-free refresh intervals
of both read and write operations are much larger than the
DRAM standard (208 ms for the read and 160 ms for the write
operations vs. the 64 ms standard). Second, for the smaller
architectural units (banks and chips in the DRAM module),
some of them operate without incurring errors even at much
higher refresh intervals than others (as high as 352 ms for
the read operations and 256 ms for the write operations).
This is because the error-free retention time is determined
by the worst single cell in each architectural component (i.e.,
bank/chip/module).
Based on this experiment, we dene the safe refresh interval
for a DRAM module as the maximum refresh interval that
leads to no errors, minus an additional margin of 8 ms, which
(a) Maximum error-free refresh interval at 85℃ (bank/chip/module)
(b) Read latency (refresh Iinterval: 200 ms)
(c) Write latency (refresh interval: 152 ms)
Figure 2: Latency reductions while maintaining the safety
margin of DRAM. Reproduced from [90].
is the increment at which we sweep the refresh interval. The
safe refresh interval for the read and write operations are 200
ms and 152 ms, respectively. We then use the safe refresh
intervals to run the tests with all possible combinations of
timing parameters. For each combination, we run our tests
at two temperatures: 85℃ and 55℃.
Figure 2b plots the error-free timing parameter combina-
tions (tRCD, tRAS, and tRP) in the read operation test. For
each combination, there are two stacked bars — the left bar
for the test at 55℃ and the right bar for the test at 85℃. Miss-
ing bars indicate that the test (with that timing parameter
combination at that temperature) causes errors. Figure 2c
plots same data for the write operation test (tRCD, tWR, and
tRP).
We make two observations. First, even at the highest tem-
perature of 85℃, the DRAM module reliably operates with
reduced timing parameters (24% reduction for read, and 35%
reduction for write operations). Second, at the lower temper-
ature of 55℃, the potential latency reduction is even higher
(36% for read, and 47% for write operations). These latency
reductions are possible while maintaining the safety margin
of the DRAM module. From these two observations, we con-
clude that there is signicant opportunity to reduce DRAM
timing parameters without compromising reliability.
5.2. Analysis of 115 DRAMModules
We have studied the eect of temperature and the potential
to reduce various timing parameters at dierent temperatures
for a single DRAM module. The same trends and observations
3
also hold true for all of the other modules we studied. In this
section, we analyze the eect of process variation by study-
ing the results of our proling experiments on 115 DIMMs.
We also present results for intra-chip process variation by
studying the process variation across dierent banks within
each DIMM.
Figure 3a (solid line) plots the highest refresh interval that
leads to correct operation across all cells at 85℃ within each
DIMM for the read operation test. The red dots on top show
the highest refresh interval that leads to correct operation
across all cells within each bank for all 8 banks. Figure 3b
plots the same data for the write operation test.
(a) Read retention time (b) Write retention time
(c) Read latency (d) Write latency
Figure 3: Analysis of 115 modules. Reproduced from [90].
We draw two conclusions. First, although there exist a
few modules which just meet the timing parameters (with
a low safety margin), a vast majority of the modules very
comfortably meet the standard timing parameters (with a
high safety margin). This indicates that a majority of the
DIMMs have signicantly higher safety margins than the
worst-case module even at the highest-acceptable operating
temperature of 85℃. Second, the eect of process variation is
even higher for banks within the same DIMM, explained by
the large spread in the red dots for each DIMM. Since banks
within a DIMM can be accessed independently with dierent
timing parameters, one can potentially imagine a mechanism
that more aggressively reduces timing parameters at a bank
granularity and not just the DIMM granularity. We leave this
for future work.2
2Note that our future works [30, 33, 34, 87, 89] explain this observation of
latency heterogeneity within a DRAM chip.
To study the potential of reducing timing parameters for
each DIMM, we sweep all possible combinations of timing
parameters (tRCD/tRAS/tWR/tRP) for all the DIMMs at both
the highest acceptable operating temperature (85℃) and a
more typical operating temperature (55℃). We then determine
the acceptable DRAM timing parameters for each DIMM for
both temperatures while maintaining its safety margin.
Figures 3c and 3d show the results of this experiment for
the DRAM read and DRAM write, respectively. The y-axis
plots the sum of the relevant timing parameters (tRCD, tRAS,
and tRP for the DRAM read and tRCD, tWR, and tRP for the
DRAM write). The solid black line shows the latency sum of
the standard timing parameters (DDR3 DRAM specication).
The dotted red line and the dotted blue line show the most
acceptable latency parameters for each DIMM at 85℃ and
55℃, respectively. The solid red line and blue line show the
average acceptable latency across all DIMMs.
We make two observations. First, even at the highest tem-
perature of 85℃, DIMMs can reliably operate at reduced
access latencies: 21.1% on average for read, and 34.4% on
average for write operations. This is a direct result of the pos-
sible reductions in timing parameters tRCD/tRAS/tWR/ tRP —
15.6%/20.4%/20.6%/28.5% on average across all the DIMMs.3
As a result, we conclude that process variation and lower
temperatures enable a signicant potential to reduce DRAM
access latencies. Second, we observe that at lower tempera-
tures (e.g., 55℃) the potential for latency reduction is even
greater (32.7% on average for read, and 55.1% on average
for write operations), where the corresponding reduction
in timing parameters tRCD/tRAS/tWR/ tRP are 17.3%/37.7%/
54.8%/35.2% on average across all the DIMMs.
We conclude that existing DRAM modules can be ac-
cessed reliably with lower access latencies, especially at lower
temperatures than the worst-case temperature specied by
DRAM manufacturers.
6. Real-System Evaluation
We evaluate AL-DRAM on a real system that oers dy-
namic software-based control over DRAM timing param-
eters at runtime [10, 11]. We use the minimum values of
the timing parameters that do not introduce any errors at
55℃ for any module to determine the latency reduction at
55℃. Thus, the latency is reduced by 27%/32%/33%/18% for
tRCD/tRAS/tWR/tRP, respectively. Our full methodology is
described in Section 8 of our HPCA 2015 paper [90].
Figure 4 shows the performance improvement of reducing
the timing parameters in the evaluated memory system with
one rank and one memory channel at a 55℃ operating tem-
perature. We run a variety of dierent applications in two
dierent congurations. The rst one (single-core) runs only
one thread, and the second one (multi-core) runs multiple
3Due to space constraints, we present only the average potential reduction
for each timing parameter. However, detailed characterization of each DIMM
can be found online at the SAFARI Research Group website [91].
4
applications/threads. We run each conguration 30 times
(only SPEC benchmarks are executed 3 times due to their
large execution times), and present the average performance
improvement across all the runs and their standard devia-
tion as an error bar. Based on the last-level cache misses
per kilo instructions (MPKI), we categorize our applications
into memory-intensive or non-intensive groups, and report
the geometric mean performance improvement across all
applications from each group.
hm
m
er
na
m
d
ca
lc
ul
ix
gr
om
ac
po
vr
ay
h2
64
bz
ip
2
sj
en
g
to
nt
o
pe
rl
go
bm
k
as
ta
r
xa
la
n
ca
ct
us gc
c
sp
hi
nx
ze
us
de
al
II
bw
av
e
om
ne
t
so
pl
ex m
cf
m
ilc lib
q
lb
m
ge
m
s
tr
ia
d
ad
d
co
py
sc
al
e
s.
cl
us
te
r
ca
nn
ea
l
m
ca
ch
ed
ap
ac
he
gu
ps
no
n-
in
te
ns
iv
e
in
te
ns
iv
e
al
l-w
or
kl
oa
ds
0
5
10
15
20
25
Pe
rf.
Im
pr
ov
em
en
t(
%
) MEANsingle-core multi-core
Figure 4: Real system performance improvement with AL-
DRAM. Reproduced from [90].
We draw three key conclusions from Figure 4. First,
AL-DRAM provides signicant performance improvement
over the baseline (as high as 20.5% for the very memory-
bandwidth-intensive STREAM applications [109]). Second,
when the memory system is under higher pressure with multi-
core/multi-threaded applications, we observe signicantly
higher performance (than in the single-core case) across all
applications from our workload pool. Third, as expected,
memory-intensive applications benet more in performance
than non-memory-intensive workloads (14.0% vs. 2.9% on
average). We conclude that by reducing the DRAM timing
parameters using AL-DRAM, we can speed up a real system
by 10.5% (on average across all 35 workloads on the multi-
core/multi-thread conguration).
We also conducted reliability stress tests for our mecha-
nism. We ran our workloads for 33 days without interruption
of the lower latencies. We observed no errors and correct
results.
7. Other Results and Analyses in Our Paper
Our HPCA 2015 paper [90] includes signicant amount
of DRAM latency analyses and system performance evalua-
tions. We refer the reader to [90] for detailed evaluations and
analyses.
• Eect of Changing the Refresh Interval on DRAM
Latency. We evaluate DRAM latency at dierent refresh
intervals. We observe that refreshing DRAM cells more
frequently enables more DRAM latency reduction (Sec-
tion 7.1 of our HPCA 2015 paper [90]).
• Eect ofReducingMultiple TimingParameters. We
study the potential for reducing multiple timing parame-
ters simultaneously. Our key observation is that reducing
one timing parameter leads to decreasing the opportu-
nity to reduce another timing parameter simultaneously
(Section 7.2 of our HPCA 2015 paper [90]).
• Analysis of the Repeatability of Cell Failures. We
perform tests for ve dierent scenarios to determine
that a cell failure due to reduced latency is repeatable: i)
same test, ii) test with dierent data patterns, iii) test with
timing-parameter combinations, iv) test with dierent
temperatures, and v) DRAM read/write. Most of these
scenarios show that a very high fraction (more than 95%)
of the erroneous cells consistently experience an error
over multiple iterations of the same test (Section 7.6 of
our HPCA 2015 paper [90]).
• Performance Sensitivity Analyses. We analyze the
impact of increasing the number of ranks and channels,
executing heterogeneous workloads, using dierent row
buer policies. We show that AL-DRAM eectively im-
proves performance in all cases (Section 8.4 of our HPCA
2015 paper [90]).
• Power Consumption Analysis. We show that AL-
DRAM reduces DRAM power consumption by 5.8%. This
reduced power consumption is due to the reduced DRAM
latencies (Section 8.4 of our HPCA 2015 paper [90]).
8. Related Work
To our knowledge, our HPCA 2015 paper is the rst work
to i) provide a detailed qualitative and empirical analysis of
the relationship between process variation and temperature
dependence of modern DRAM devices on the one side, and
DRAM access latency on the other side (we directly attribute
the relationship between the two to the amount of charge in
cells), ii) experimentally characterize a large number of exist-
ing DIMMs to understand the potential of reducing DRAM
timing constraints, iii) provide a practical mechanism that can
take advantage of this potential, and iv) evaluate the perfor-
mance benets of this mechanism by dynamically optimizing
DRAM timing parameters on a real system using a variety of
real workloads.
Several works investigated the possibility of reducing
DRAM latency by either exploiting DRAM latency variation
or changing the DRAM architecture. We discuss these below.
DRAM Latency Variation. Chandrasekar et al. [29] eval-
uate the potential of relaxing some DRAM timing parame-
ters to reduce DRAM latency. This work observes latency
variations across DIMMs as well as for a DIMM at dierent
operating temperatures. However, there is no explanation
as to why this phenomenon exists. In contrast, our HPCA
2015 paper [90] (i) identies and analyzes the root cause of
latency variation in detail, (ii) provides a practical mechanism
that can relax timing parameters, and (iii) provides a real sys-
tem evaluation of this new mechanism, using real workloads,
showing improved performance and preserved reliability.
5
NUAT [153] and ChargeCache [51] show that recently-
refreshed rows contain more charge, and propose mecha-
nisms to access recently-refreshed rows with reduced latency.
Even though some of the observations in these works are
similar to ours, the approaches to leverage them are dierent.
AL-DRAM exploits temperature dependence in a DIMM and
process variations across DIMMs, while NUAT and Charge-
Cache use the time dierence between a row refresh and an
access to the row (hence its benets are dependent on when
the row is accessed after it is refreshed). Therefore, NUAT
and ChargeCache are complementary to AL-DRAM, and can
potentially be combined for better performance.
Voltron [34] uses an experimental characterization of real
DRAM modules to identify the relationship between the
DRAM supply voltage and access latency variation. Voltron
uses this relationship to identify the combination of voltage
and access latency that minimizes system-level energy con-
sumption without exceeding a user-specied threshold for
the maximum acceptable performance loss.
Flexible-Latency DRAM (FLY-DRAM) [30] uses an exper-
imental characterization of real DRAM modules to capture
access latency variation across DRAM cells within a single
DRAM chip due to manufacturing process variation. FLY-
DRAM identies that there is spatial locality in the slower
cells, resulting in fast regions (i.e., regions where all DRAM
cells can operate at signicantly-reduced access latency with-
out experiencing errors) and slow regions (i.e., regions where
some of the DRAM cells cannot operate at signicantly-
reduced access latency without experiencing errors) within
each chip. To take advantage of this heterogeneity in the reli-
able access latency of DRAM cells within a chip, FLY-DRAM
(1) categorizes the cells into fast and slow regions; and (2) low-
ers the overall DRAM latency by accessing fast regions with
a lower latency.
Design-Induced Variation-Aware DRAM (DIVA-
DRAM) [89] uses an experimental characterization of
real DRAM modules to identify the latency variation within
a single DRAM chip that occurs due to the architectural
design of the chip. For example, a cell that is further away
from the row decoder requires a longer access time than a
cell that is close to the row decoder. Similarly, a cell that
is farther away from the wordline driver requires a larger
access time than a cell that is close to the wordline driver.
DIVA-DRAM uses design-induced variation to reduce the
access latency to dierent parts of the chip.
Low-Latency DRAMArchitectures. Various works [31,
32, 33, 53, 78, 92, 108, 116, 142, 146, 154, 176] propose new
DRAM architectures that provide lower latency. Many of
these works improve DRAM latency at the cost of either
signicant additional DRAM chip area (i.e., extra sense am-
pliers [108, 142, 154], an additional SRAM cache [53, 176]),
specialized protocols [31,78,92,146] or a combination of these.
Our proposed mechanism requires no changes to the DRAM
chip and the DRAM interface, and hence has almost negligi-
ble overhead. Furthermore, AL-DRAM is largely orthogonal
to these proposed designs, and can be applied in conjunction
with them, providing greater cumulative reduction in latency.
Binning or Overclocking DRAM. AL-DRAM has multi-
ple sets of DRAM timing parameters for dierent tempera-
tures and dynamically optimizes the timing parameters at run-
time. Therefore, AL-DRAM is dierent from simple binning
(performed by manufacturers) or over-clocking (performed
by end-users; e.g., [58, 126]) that are used to gure out the
highest static frequency or lowest static timing parameters
for DIMMs.
OtherMethods for LoweringMemory Latency. There
are many works that reduce overall memory access latency
by modifying DRAM, the DRAM-controller interface, and
DRAM controllers. These works enable more parallelism and
bandwidth [3, 4, 31, 32, 78, 88, 93, 145, 146, 147, 167, 174, 178],
reduce refresh counts [66, 68, 70, 97, 98, 136, 164], accelerate
bulk operations [32,145,146,147,148], accelerate computation
in the logic layer of 3D-stacked DRAM [5, 6, 14, 15, 50, 54, 55,
72, 100, 129, 173], enable better communication between the
CPU and other devices through DRAM [93], leverage DRAM
access patterns [51, 153], reduce write-related latencies by
better designing DRAM and DRAM control policies [35, 83,
144], reduce overall queuing latencies in DRAM by better
scheduling memory requests [12, 13, 38, 46, 49, 56, 59, 61, 65, 76,
77, 84, 85, 86, 96, 109, 110, 111, 112, 120, 121, 125, 129, 141, 152,
159, 160, 161, 162, 163, 177], employ prefetching [9, 28, 36, 37,
42, 44, 45, 47, 84, 113, 114, 115, 119, 122, 124, 128, 158], perform
memory/cache compression [1, 7, 8, 39, 41, 43, 130, 131, 132,
133, 134, 151, 165, 168, 175], or perform better caching [67,
137, 138, 149, 150]. Our proposal is orthogonal to all of these
approaches and can be applied in conjunction with them to
achieve higher latency and energy benets.
Experimental Studies of DRAM Chips. There are
several studies that characterize various errors in DRAM.
Many of these works observe how specic factors aect
DRAM errors, analyzing the impact of temperature [48] and
hard errors [57]. Other works have conducted studies of
DRAM error rates in the eld, studying failures across a
large sample size [95, 106, 143, 155, 156, 157]. There are also
works that have studied errors through controlled experi-
ments, usually using FPGA-based DRAM testing infrastruc-
tures like SoftMC [52], to investigate errors due to reten-
tion time [52, 66, 68, 69, 70, 97, 98, 127, 136], disturbance from
neighboring DRAM cells [62, 74, 75, 118], latency variation
across/within DRAM chips [29,30, 33, 87,89], and supply volt-
age [33, 34]. None of these works extensively study latency
variation across DRAM modules, which we characterize in
our work.
9. Signicance
Our work on AL-DRAM is the rst to extensively char-
acterize and exploit the large access latency variation that
exists in modern DRAM devices. In this section, we discuss
6
the novelty of AL-DRAM and its expected future impact on
the community.
9.1. Novelty
We make the following major contributions in our HPCA
2015 paper [90]:
Addressing a Critical Real Problem, High DRAM La-
tency, with Low Cost. High DRAM latency is a critical
bottleneck for overall system performance in a variety of
modern computing systems [117,123], especially in real large-
scale server systems [63, 101]. Considering the signicant
diculties in DRAM scaling [64, 117, 118, 123], the DRAM
latency problem is getting worse in future systems due to
process variation. Our HPCA 2015 work [90] leverages the
heterogeneity created by DRAM process variation across
DRAM chips and system operating conditions to mitigate
the DRAM latency problem. We propose a practical mech-
anism, Adaptive-Latency DRAM, which mitigates DRAM la-
tency with very modest hardware cost, and with no changes
to the DRAM chip itself.
Large-Scale Latency Proling of Modern DRAM
Chips. Using our FPGA-based DRAM testing infrastruc-
ture [30, 33, 34, 52, 68, 69, 75, 87, 89, 90, 97, 127, 136], we prole
115 DRAM modules (920 DRAM chips in total) and show
that there is signicant timing variation between dierent
DIMMs at dierent temperatures. We believe that our results
are statistically signicant to validate our hypothesis that the
DRAM timing parameters strongly depend on the amount of
cell charge. We provide a detailed characterization of each
DIMM online at the SAFARI Research Group website [91].
Furthermore, we introduce our FPGA-based DRAM infras-
tructure and experimental methodology for DRAM proling,
which are carefully constructed to represent the worst-case
conditions in power noise, bitline/wordline coupling, data
patterns, and access patterns. Such information will hopefully
be useful for future DRAM research.
Extensive Real System Evaluation of DRAM Latency.
We evaluate our mechanism on a real system [10, 11] and
show that our mechanism provides signicant performance
improvements. Reducing the timing parameters strips the ex-
cessive margin in the electrical charge stored within a DRAM
cell. We show that the remaining margin is enough for DRAM
to operate reliably. To verify the correctness of our experi-
ments, we ran our workloads for 33 days nonstop, and exam-
ined their and the system’s correctness with reduced timing
parameters. Using the reduced timing parameters over the
course of 33 days, our real system was able to execute 35
dierent workloads in both single-core and multi-core con-
gurations while preserving correctness and being error-free.
Note that these results do not absolutely guarantee that no
errors can be introduced by reducing the timing parameters.
However, we believe that we have demonstrated a proof-of-
concept which shows that DRAM latency can be reduced at
no impact on DRAM reliability. Ultimately, DRAM manufac-
turers can provide the reliable timing parameters for dierent
operating conditions and modules.
9.2. Potential Long-Term Impact
Tolerating High DRAM Latency by Exploiting
DRAM Intrinsic Characteristics. Today, there is a large
latency cli between the on-chip last level cache and o-chip
DRAM, leading to a large performance fall-o when applica-
tions start missing in the last level cache. By enabling lower
DRAM latency, our mechanism, Adaptive-Latency DRAM,
smoothens this latency cli without adding another layer
into the memory hierarchy.
Applicability to FutureMemoryDevices. We show the
benets of the common-case timing optimization in modern
DRAM devices by taking advantage of intrinsic characteris-
tics of DRAM. Considering that most memory devices adopt a
unied specication that is dictated by the worst-case operat-
ing condition, our approach that optimizes device latency for
the common case can be applicable to other memory devices
by leveraging the intrinsic characteristics of the technology
they are built with. We believe there is signicant potential
for approaches that could reduce the latency of Phase Change
Memory (PCM) [40, 80, 81, 82, 105, 135, 139, 140, 170, 172],
STT-MRAM [79, 105], RRAM [169], and NAND ash mem-
ory [16,17,18,19,20,21,22,22,23,24,25,26,27,102,103,104,107].
NewResearchOpportunities. Adaptive-Latency DRAM
creates new opportunities by enabling mechanisms that can
leverage the heterogeneous latency oered by our mechanism.
We describe a few of these briey.
Optimizing the operating conditions for faster DRAM access:
Adaptive-Latency DRAM provides dierent access latencies
for dierent operating conditions. Future works can explore
how the operating conditions themselves can be optimized,
which can be used in conjunction with AL-DRAM to further
improve the DRAM access latency. For instance, balancing
DRAM accesses over multiple DRAM channels and ranks can
potentially reduce the DRAM operating temperature, max-
imizing the benets provided by AL-DRAM. At the system
level, operating the system at a constant low temperature can
enable the use of lower DRAM latencies more frequently.
Optimizing data placement to reduce overall DRAM access
latency: We characterize the latency variation in dierent
DIMMs due to process variation. Placing data based on this
information and the latency criticality of data maximizes the
benets of lowering DRAM latency, by placing the data that
is most sensitive to latency in the fastest DRAM chips (and,
thus, providing lookups to the data with the fastest access
latency).
Error correction mechanisms to further reduce DRAM latency.
Error correction mechanisms allow us to lower DRAM latency
even further, by correcting bit errors that occur when a small
number of the DRAM operations end before the minimum
charge is stored in the DRAM cell. Such mechanisms can rely
on error correction to compensate for the reduced reliability
7
of read and write operations at even lower latencies, leading
to a further reduction in DRAM latency without errors. Fu-
ture research that uses error correction to enable even lower
latency DRAM is therefore promising as it opens a new set
of trade-os. Note that our recent work, DIVA-DRAM [89],
explores this direction and nds very promising benets.
Inspired by our characterization and proposed techniques,
several recent works [30, 34, 51, 71, 89, 127] have explored
many of these new research opportunities, by (1) analyzing
dierent sources of latency and performance variation within
DRAM chips, and (2) exploiting these sources of latency and
performance variation to reduce access latency and/or energy
consumption.
10. Conclusion
This paper summarizes our HPCA 2015 work on Adaptive-
Latency DRAM (AL-DRAM), a simple and eective mecha-
nism for dynamically tailoring the DRAM timing parameters
for the current operating condition without introducing any
errors. AL-DRAM takes advantage of the large latency mar-
gin available in the DRAM timing parameters for common-
case operation, by dynamically the operating temperature
of each DRAM module and employing timing constraints
optimized for a particular module at the current tempera-
ture. AL-DRAM provides an average 14% improvement in
overall system performance across a wide variety of memory-
intensive applications run on a real multi-core system. We
conclude that AL-DRAM is a simple and eective mechanism
to reduce DRAM latency. We hope that our experimental
exposure of the large margin present in the standard DRAM
timing constraints will inspire other approaches to optimize
DRAM chips, latencies, and parameters at low cost.
Acknowledgments
We thank Saugata Ghose for his dedicated eort in the
preparation of this article. We thank the anonymous re-
viewers for their valuable feedback. We thank Uksong Kang,
Jung-Bae Lee, and Joo Sun Choi from Samsung, and Michael
Kozuch from Intel for their helpful comments. We acknowl-
edge the support of our industrial partners: Facebook, IBM,
Intel, Microsoft, Qualcomm, VMware, and Samsung. This
research was partially supported by NSF (grants 0953246,
1212962, 1065112), the Semiconductor Research Corporation,
and the Intel Science and Technology Center for Cloud Com-
puting. Donghyuk Lee was supported in part by the John and
Claire Bertucci Graduate Fellowship.
References
[1] B. Abali, H. Franke, D. Po, R. Saccone, C. Schulz, L. Herger, and T. Smith, “Mem-
ory Expansion Technology (MXT): Software support and performance,” in IBM
JRD, 2001.
[2] J.-H. Ahn et al., “Adaptive Self Refresh Scheme for Battery Operated High-
Density Mobile DRAM Applications,” in ASSCC, 2006.
[3] J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber, “Improving
System Energy Eciency with Memory Rank Subsetting,” in ACM TACO, 2012.
[4] J. H. Ahn, J. Leverich, R. Schreiber, and N. P. Jouppi, “Multicore DIMM: an Energy
Ecient Memory Module with Independently Controlled DRAMs,” in IEEE CAL,
2009.
[5] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A Scalable Processing-in-Memory
Accelerator for Parallel Graph Processing,” in ISCA, 2015.
[6] J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “PIM-Enabled Instructions: A Low-
Overhead, Locality-Aware Processing-in-Memory Architecture,” in ISCA, 2015.
[7] A. R. Alameldeen and D. A. Wood, “Adaptive Cache Compression for High-
Performance Processors,” in ISCA, 2004.
[8] A. R. Alameldeen and D. A. Wood, “Frequent Pattern Compression: A
Signicance-Based Compression Scheme for L2 Caches,” Univ. of Wisconsin–
Madison, Computer Sciences Dept., Tech. Rep. 1500, 2004.
[9] A. Alameldeen and D. Wood, “Interactions Between Compression and Prefetch-
ing in Chip Multiprocessors,” in HPCA, 2007.
[10] AMD, AMD Opteron 4300 Series processors, http://www.amd.com/en-us/
products/server/4000/4300.
[11] AMD, “BKDG for AMD Family 16h Models 00h-0Fh Processors,” 2013.
[12] R. Ausavarungnirun, K. Chang, L. Subramanian, G. H. Loh, and O. Mutlu, “Staged
memory scheduling: achieving high performance and scalability in heteroge-
neous systems,” in ISCA, 2012.
[13] R. Ausavarungnirun, S. Ghose, O. Kayiran, G. H. Loh, C. R. Das, M. T. Kandemir,
and O. Mutlu, “Exploiting Inter-Warp Heterogeneity to Improve GPGPU Perfor-
mance,” in PACT, 2015.
[14] A. Boroumand et al., “Google Workloads for Consumer Devices: Mitigating Data
Movement Bottlenecks,” in ASPLOS, 2018.
[15] A. Boroumand, S. Ghose, B. Lucia, K. Hsieh, K. Malladi, H. Zheng, and
O. Mutlu, “LazyPIM: An Ecient Cache Coherence Mechanism for Processing-
in-Memory,” in IEEE CAL, 2016.
[16] Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, and O. Mutlu, “Error Characterization,
Mitigation, and Recovery in Flash-Memory-Based Solid-State Drives,” in Proceed-
ings of the IEEE, 2017.
[17] Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, and O. Mutlu, “Error Characteri-
zation, Mitigation, and Recovery in Flash Memory Based Solid-State Drives,”
arXiv:1706.08642 [cs.AR], 2017.
[18] Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, and O. Mutlu, “Errors in Flash-Memory-
Based Solid-State Drives: Analysis, Mitigation, and Recovery,” arXiv:1711.11427
[cs.AR], 2017.
[19] Y. Cai, S. Ghose, Y. Luo, K. Mai, O. Mutlu, and E. F. Haratsch, “Vulnerabilities in
MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and
Mitigation Techniques,” in HPCA, 2017.
[20] Y. Cai, E. F. Haratsch, O. Mutlu, and K. Mai, “Error Patterns in MLC NAND Flash
Memory: Measurement, Characterization, and Analysis,” in DATE, 2012.
[21] Y. Cai, E. F. Haratsch, O. Mutlu, and K. Mai, “Threshold voltage distribution in
MLC NAND ash memory: Characterization, analysis, and modeling,” in DATE,
2013.
[22] Y. Cai, Y. Luo, S. Ghose, and O. Mutlu, “Read Disturb Errors in MLC NAND Flash
Memory: Characterization, Mitigation, and Recovery,” in DSN, 2015.
[23] Y. Cai, Y. Luo, E. Haratsch, K. Mai, and O. Mutlu, “Data retention in MLC NAND
ash memory: Characterization, optimization, and recovery,” in HPCA, 2015.
[24] Y. Cai, O. Mutlu, E. F. Haratsch, and K. Mai, “Program Interference in MLC
NAND Flash Memory: Characterization, Modeling, and Mitigation,” in ICCD,
2013.
[25] Y. Cai, G. Yalcin, O. Mutlu, E. Haratsch, A. Cristal, O. Unsal, and K. Mai,
“Flash correct-and-refresh: Retention-aware error management for increased
ash memory lifetime,” in ICCD, 2012.
[26] Y. Cai, G. Yalcin, O. Mutlu, E. F. Haratsch, A. Cristal, O. Unsal, and K. Mai, “Error
Analysis and Retention-Aware Error Management for NAND Flash Memory,” in
ITJ, 2013.
[27] Y. Cai, G. Yalcin, O. Mutlu, E. F. Haratsch, O. Unsal, A. Cristal, and K. Mai,
“Neighbor-cell Assisted Error Correction for MLC NAND Flash Memories,” in
SIGMETRICS, 2014.
[28] P. Cao, E. W. Felten, A. R. Karlin, and K. Li, “A Study of Integrated Prefetching
and Caching Strategies,” in SIGMETRICS, 1995.
[29] K. Chandrasekar, S. Goossens, C. Weis, M. Koedam, B. Akesson, N. Wehn, and
K. Goossens, “Exploiting Expendable Process-margins in DRAMs for Run-time
Performance Optimization,” in DATE, 2014.
[30] K. Chang, A. Kashyap, H. Hassan, S. Khan, K. Hsieh, D. Lee, S. Ghose, G. Pekhi-
menko, T. Li, and O. Mutlu, “Understanding Latency Variation in Modern DRAM
Chips: Experimental Characterization, Analysis, and Optimization,” in SIGMET-
RICS, 2016.
[31] K. Chang, D. Lee, Z. Chishti, A. Alameldeen, C. Wilkerson, Y. Kim, and O. Mutlu,
“Improving DRAM performance by parallelizing refreshes with accesses,” in
HPCA, 2014.
[32] K. Chang, P. J. Nair, S. Ghose, D. Lee, M. K. Qureshi, and O. Mutlu, “Low-Cost
Inter-Linked Subarrays (LISA): Enabling Fast Inter-Subarray Data Movement in
DRAM,” in HPCA, 2016.
[33] K. K. Chang, “Understanding and Improving Latency of DRAM-Based Memory
Systems,” Ph.D. dissertation, Carnegie Mellon University, 2017.
[34] K. K. Chang, A. G. Yaglikci, A. Agrawal, N. Chatterjee, S. Ghose, A. Kashyap,
H. Hassan, D. Lee, M. O’Connor, and O. Mutlu, “Understanding Reduced-Voltage
Operation in Modern DRAM Devices: Experimental Characterization, Analysis,
and Mechanisms,” in SIGMETRICS, 2017.
[35] N. Chatterjee, N. Muralimanohar, R. Balasubramonian, A. Davis, and N. P. Jouppi,
“Staged Reads: Mitigating the Impact of DRAM Writes on DRAM Reads,” in
HPCA, 2012.
[36] R. Cooksey, S. Jourdan, and D. Grunwald, “A Stateless, Content-directed Data
Prefetching Mechanism,” in ASPLOS, 2002.
8
[37] F. Dahlgren, M. Dubois, and P. Stenström, “Sequential Hardware Prefetching in
Shared-Memory Multiprocessors,” in IEEE TPDS, 1995.
[38] R. Das, R. Ausavarungnirun, O. Mutlu, A. Kumar, and M. Azimi, “Application-
to-core mapping policies to reduce memory system interference in multi-core
systems,” in HPCA, 2013.
[39] R. de Castro, A. Lago, and M. Silva, “Adaptive compressed caching: design and
implementation,” in SBAC-PAD, 2003.
[40] G. Dhiman, R. Ayoub, and T. Rosing, “PDRAM: A hybrid PRAM and DRAM main
memory system,” in DAC, 2009.
[41] F. Douglis, “The Compression Cache: Using On-line Compression to Extend
Physical Memory,” in Winter USENIX Conference, 1993.
[42] J. Dundas and T. Mudge, “Improving Data Cache Performance by Pre-executing
Instructions Under a Cache Miss,” in ICS, 1997.
[43] J. Dusser, T. Piquet, and A. Seznec, “Zero-content Augmented Caches,” in ICS,
2009.
[44] E. Ebrahimi, O. Mutlu, and Y. Patt, “Techniques for bandwidth-ecient prefetch-
ing of linked data structures in hybrid prefetching systems,” in HPCA, 2009.
[45] E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt, “Prefetch-aware Shared Resource
Management for Multi-core Systems,” in ISCA, 2011.
[46] E. Ebrahimi, R. Miftakhutdinov, C. Fallin, C. J. Lee, J. A. Joao, O. Mutlu, and Y. N.
Patt, “Parallel application memory scheduling,” in MICRO, 2011.
[47] E. Ebrahimi, O. Mutlu, C. J. Lee, and Y. N. Patt, “Coordinated Control of Multiple
Prefetchers in Multi-core Systems,” in MICRO, 2009.
[48] N. El-Sayed, I. A. Stefanovici, G. Amvrosiadis, A. A. Hwang, and B. Schroeder,
“Temperature Management in Data Centers: Why Some (Might) Like It Hot,” in
SIGMETRICS, 2012.
[49] S. Ghose, H. Lee, and J. F. Martínez, “Improving Memory Scheduling via
Processor-Side Load Criticality Information,” in ISCA, 2013.
[50] Q. Guo, N. Alachiotis, B. Akin, F. Sadi, G. Xu, T. M. Low, L. Pileggi, J. C. Hoe, and
F. Franchetti, “3D-Stacked Memory-Side Acceleration: Accelerator and System
Design,” in WoNDP, 2014.
[51] H. Hassan, G. Pekhimenko, N. Vijaykumar, V. Seshadri, D. Lee, O. Ergin, and
O. Mutlu, “ChargeCache: Reducing DRAM Latency by Exploiting Row Access
Locality,” in HPCA, 2016.
[52] H. Hassan, N. Vijaykumar, S. Khan, S. Ghose, K. Chang, G. Pekhimenko, D. Lee,
O. Ergin, and O. Mutlu, “SoftMC: A Flexible and Practical Open-Source Infras-
tructure for Enabling Experimental DRAM Studies,” in HPCA, 2017.
[53] H. Hidaka, Y. Matsuda, M. Asakura, and K. Fujishima, “The Cache DRAM Archi-
tecture: A DRAM with an On-Chip Cache Memory,” in IEEE Micro, 1990.
[54] K. Hsieh, S. Khan, N. Vijaykumar, K. K. Chang, A. Boroumand, S. Ghose, and
O. Mutlu, “Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges,
Mechanisms, Evaluation,” in ICCD, 2016.
[55] K. Hsieh, E. Ebrahimi, G. Kim, N. Chatterjee, M. O’Connor, N. Vijaykumar,
O. Mutlu, and S. W. Keckler, “Transparent Ooading and Mapping (TOM):
Enabling Programmer-Transparent Near-Data Processing in GPU Systems,” in
ISCA, 2016.
[56] I. Hur and C. Lin, “Adaptive History-Based Memory Schedulers,” in MICRO, 2004.
[57] A. A. Hwang, I. A. Stefanovici, and B. Schroeder, “Cosmic Rays Don’t Strike
Twice: Understanding the Nature of DRAM Errors and the Implications for Sys-
tem Design,” in ASPLOS, 2012.
[58] Intel Corp., “Intel Extreme Memory Prole (Intel XMP) DDR3
Technology,” http://www.intel.com/content/www/us/en/chipsets/
extreme-memory-prole-ddr3-technology-paper.html, 2009.
[59] E. Ipek, O. Mutlu, J. F. Martínez, and R. Caruana, “Self-optimizing memory con-
trollers: A reinforcement learning approach,” in ISCA, 2008.
[60] JEDEC, Standard No. 79-3F. DDR3 SDRAM Specication, Jul. 2012.
[61] A. Jog, O. Kayiran, A. Pattnaik, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das,
“Exploiting Core-Criticality for Enhanced GPU Performance,” in SIGMETRICS,
2016.
[62] M. Jung, C. C. Rheinländer, C. Weis, and N. Wehn, “Reverse Engineering of
DRAMs: Row Hammer with Crosshair,” in MEMSYS, 2016.
[63] S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G.-Y. Wei, and
D. Brooks, “Proling a Warehouse-Scale Computer,” in ISCA, 2015.
[64] U. Kang, H.-S. Yu, C. Park, H. Zheng, J. Halbert, K. Bains, S. Jang, and J. Choi,
“Co-Architecting Controllers and DRAM to Enhance DRAM Process Scaling,” in
The Memory Forum, 2014.
[65] D. Kaseridis, J. Stuecheli, and L. K. John, “Minimalist Open-Page: A DRAM Page-
Mode Scheduling Policy for the Many-Core Era,” in MICRO, 2011.
[66] S. Khan et al., “Detecting and Mitigating Data-Dependent DRAM Failures by
Exploiting Current Memory Content,” in MICRO, 2017.
[67] S. Khan, A. R. Alameldeen, C. Wilkerson, O. Mutlu, and D. A. Jimenez, “Improv-
ing Cache Performance by Exploiting Read-Write Disparity,” in HPCA, 2014.
[68] S. Khan, D. Lee, Y. Kim, A. R. Alameldeen, C. Wilkerson, and O. Mutlu, “The
Ecacy of Error Mitigation Techniques for DRAM Retention Failures: A Com-
parative Experimental Study,” in SIGMETRICS, 2014.
[69] S. Khan, D. Lee, C. Wilkerson, and O. Mutlu, “PARBOR: An Ecient System-
Level Technique to Detect Data Dependent Failures in DRAM,” in DSN, 2016.
[70] S. Khan, C. Wilkerson, D. Lee, A. R. Alameldeen, and O. Mutlu, “A Case for
Memory Content-Based Detection and Mitigation of Data-Dependent Failures
in DRAM,” in IEEE CAL, 2016.
[71] J. Kim, M. Patel, H. Hassan, and O. Mutlu, “The DRAM Latency PUF: Quickly
Evaluating Physical Unclonable Functions by Exploiting the Latency–Reliability
Tradeo in Modern DRAM Devices,” in HPCA, 2018.
[72] J. S. Kim, D. Senol, H. Xin, D. Lee, S. Ghose, M. Alser, H. Hassan, O. Ergin,
C. Alkan, and O. Mutlu, “GRIM-Filter: Fast Seed Location Filtering in DNA Read
Mapping Using Processing-in-Memory Technologies,” BMC Genomics, 2018.
[73] Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A Fast and Extensible DRAM Simu-
lator,” in IEEE CAL, 2015.
[74] Y. Kim, “Architectural Techniques to Enhance DRAM Scaling,” Ph.D. dissertation,
Carnegie Mellon University, 2015.
[75] Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilkerson, K. Lai, and
O. Mutlu, “Flipping Bits in Memory Without Accessing Them: An Experimental
Study of DRAM Disturbance Errors,” in ISCA, 2014.
[76] Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter, “ATLAS: A scalable and high-
performance scheduling algorithm for multiple memory controllers,” in HPCA,
2010.
[77] Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter, “Thread Cluster Mem-
ory Scheduling: Exploiting Dierences in Memory Access Behavior,” in MICRO,
2010.
[78] Y. Kim, V. Seshadri, D. Lee, J. Liu, and O. Mutlu, “A Case for Exploiting Subarray-
Level Parallelism (SALP) in DRAM,” in ISCA, 2012.
[79] E. Kultursay, M. Kandemir, A. Sivasubramaniam, and O. Mutlu, “Evaluating STT-
RAM as an energy-ecient main memory alternative,” in ISPASS, 2013.
[80] B. Lee, P. Zhou, J. Yang, Y. Zhang, B. Zhao, E. Ipek, O. Mutlu, and D. Burger,
“Phase-Change Technology and the Future of Main Memory,” in IEEEMicro, 2010.
[81] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Architecting Phase Change Memory
As a Scalable DRAM Alternative,” in ISCA, 2009.
[82] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Phase Change Memory Architecture
and the Quest for Scalability,” in CACM, 2010.
[83] C. J. Lee, E. Ebrahimi, V. Narasiman, O. Mutlu, and Y. N. Patt, “DRAM-Aware
Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory
Systems,” Univ. of Texas at Austin, High Performance Systems Group, Tech. Rep.
TR-HPS-2010-002, 2010.
[84] C. J. Lee, O. Mutlu, V. Narasiman, and Y. N. Patt, “Prefetch-Aware DRAM Con-
trollers,” in MICRO, 2008.
[85] C. J. Lee, O. Mutlu, V. Narasiman, and Y. N. Patt, “Prefetch-Aware Memory Con-
trollers,” in IEEE TC, 2011.
[86] C. J. Lee, V. Narasiman, O. Mutlu, and Y. N. Patt, “Improving Memory Bank-level
Parallelism in the Presence of Prefetching,” in MICRO, 2009.
[87] D. Lee, “Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity,”
Ph.D. dissertation, Carnegie Mellon University, 2016.
[88] D. Lee, S. Ghose, G. Pekhimenko, S. Khan, and O. Mutlu, “Simultaneous Multi-
Layer Access: Improving 3D-Stacked Memory Bandwidth at Low Cost,” in ACM
TACO, 2016.
[89] D. Lee, S. Khan, L. Subramanian, S. Ghose, R. Ausavarungnirun, G. Pekhimenko,
V. Seshadri, and O. Mutlu, “Design-Induced Latency Variation in Modern DRAM
Chips: Characterization, Analysis, and Latency Reduction Mechanisms,” in SIG-
METRICS, 2017.
[90] D. Lee, Y. Kim, G. Pekhimenko, S. Khan, V. Seshadri, K. Chang, and O. Mutlu,
“Adaptive-latency DRAM: Optimizing DRAM timing for the common-case,” in
HPCA, 2015.
[91] D. Lee, Y. Kim, G. Pekhimenko, S. Khan, V. Seshadri, K. Chang, and O. Mutlu,
“Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case,”
http://www.ece.cmu.edu/~safari/tools/aldram-hpca2015-fulldata.html.
[92] D. Lee, Y. Kim, V. Seshadri, J. Liu, L. Subramanian, and O. Mutlu, “Tiered-latency
DRAM: A low latency and low cost DRAM architecture,” in HPCA, 2013.
[93] D. Lee, L. Subramanian, R. Ausavarungnirun, J. Choi, and O. Mutlu, “Decoupled
Direct Memory Access: Isolating CPU and IO Trac by Leveraging a Dual-Data-
Port DRAM,” in PACT, 2015.
[94] J. Lee, K. Kim, Y. Shin, K. Lee, J. Kim, D. Kim, J. Park, and J. Lee, “Simultaneously
Formed Storage Node Contact and Metal Contact Cell (SSMC) for 1Gb DRAM
and Beyond,” in IEDM, 1996.
[95] X. Li, M. C. Huang, K. Shen, and L. Chu, “A Realistic Evaluation of Memory
Hardware Errors and Software System Susceptibility,” in USENIX ATC, 2010.
[96] Y. Li, S. Ghose, J. Choi, J. Sun, H. Wang, and O. Mutlu, “Utility-Based Hybrid
Memory Management,” in CLUSTER, 2016.
[97] J. Liu, B. Jaiyen, Y. Kim, C. Wilkerson, and O. Mutlu, “An Experimental Study of
Data Retention Behavior in Modern DRAM Devices: Implications for Retention
Time Proling Mechanisms,” in ISCA, 2013.
[98] J. Liu, B. Jaiyen, R. Veras, and O. Mutlu, “RAIDR: Retention-Aware Intelligent
DRAM Refresh,” in ISCA, 2012.
[99] S. Liu, B. Leung, A. Neckar, S. Memik, G. Memik, and N. Hardavellas, “Hard-
ware/software techniques for DRAM thermal management,” in HPCA, 2011.
[100] Z. Liu, I. Calciu, M. Herlihy, and O. Mutlu, “Concurrent Data Structures for Near-
Memory Computing,” in SPAA, 2017.
[101] D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and C. Kozyrakis, “Heracles:
Improving resource eciency at scale,” in ISCA, 2015.
[102] Y. Lu, J. Shu, J. Guo, S. Li, and O. Mutlu, “High-Performance and Lightweight
Transaction Support in Flash-Based SSDs,” in IEEE TC, 2015.
[103] Y. Luo, Y. Cai, S. Ghose, J. Choi, and O. Mutlu, “WARM: Improving NAND ash
memory lifetime with write-hotness aware retention management,” in MSST,
2015.
[104] Y. Luo, S. Ghose, Y. Cai, E. F. Haratsch, and O. Mutlu, “Enabling Accurate and
Practical Online Flash Channel Modeling for Modern MLC NAND Flash Mem-
ory,” in JSAC, 2016.
[105] J. Meza, J. Li, and O. Mutlu, “A Case for Small Row Buers in Non-Volatile Main
Memories,” in ICCD, Poster Session, 2012.
9
[106] J. Meza, Q. Wu, S. Kumar, and O. Mutlu, “Revisiting Memory Errors in Large-
Scale Production Data Centers: Analysis and Modeling of New Trends from the
Field,” in DSN, 2015.
[107] J. Meza, Q. Wu, S. Kumar, and O. Mutlu, “A large-scale study of ash memory
failures in the eld,” in SIGMETRICS, 2015.
[108] Micron, “RLDRAM 2 and 3 Specications,” http://www.micron.com/products/
dram/rldram-memory.
[109] T. Moscibroda and O. Mutlu, “Memory Performance Attacks: Denial of Memory
Service in Multi-core Systems,” in USENIX Security, 2007.
[110] T. Moscibroda and O. Mutlu, “Distributed Order Scheduling and Its Application
to Multi-core Dram Controllers,” in PODC, 2008.
[111] J. Mukundan and J. F. Martínez, “MORSE: Multi-Objective Recongurable Self-
Optimizing Memory Scheduler,” in HPCA, 2012.
[112] S. P. Muralidhara, L. Subramanian, O. Mutlu, M. Kandemir, and T. Moscibroda,
“Reducing memory interference in multicore systems via application-aware
memory channel partitioning,” in MICRO, 2011.
[113] O. Mutlu, H. Kim, and Y. Patt, “Address-value delta (AVD) prediction: increasing
the eectiveness of runahead execution by exploiting regular memory allocation
patterns,” in MICRO, 2005.
[114] O. Mutlu, H. Kim, and Y. Patt, “Techniques for ecient processing in runahead
execution engines,” in ISCA, 2005.
[115] O. Mutlu, J. Stark, C. Wilkerson, and Y. Patt, “Runahead execution: an alternative
to very large instruction windows for out-of-order processors,” in HPCA, 2003.
[116] O. Mutlu, “Memory Scaling: A Systems Architecture Perspective,” in MemCon,
2013.
[117] O. Mutlu, “Memory Scaling: A Systems Architecture Perspective,” in IMW, 2013.
[118] O. Mutlu, “The RowHammer Problem and Other Issues We May Face as Memory
Becomes Denser,” in DATE, 2017.
[119] O. Mutlu, H. Kim, and Y. N. Patt, “Ecient Runahead Execution: Power-ecient
Memory Latency Tolerance,” in IEEE Micro, 2006.
[120] O. Mutlu and T. Moscibroda, “Stall-Time Fair Memory Access Scheduling for
Chip Multiprocessors,” in MICRO, 2007.
[121] O. Mutlu and T. Moscibroda, “Parallelism-Aware Batch Scheduling: Enhancing
both Performance and Fairness of Shared DRAM Systems,” in ISCA, 2008.
[122] O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt, “Runahead execution: An eec-
tive alternative to large instruction windows,” in IEEE Micro, 2003.
[123] O. Mutlu and L. Subramanian, “Research Problems and Opportunities in Memory
Systems,” in SUPERFRI, 2014.
[124] K. Nesbit, A. Dhodapkar, and J. Smith, “AC/DC: an adaptive data cache
prefetcher,” in PACT, 2004.
[125] K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith, “Fair Queuing Memory
Systems,” in MICRO, 2006.
[126] NVIDIA Corp., “Extreme DDR3 Performance with SLI-Ready Memory,” http://
www.nvidia.com/docs/IO/52280/NVIDIA_EPP2_TB.pdf, 2008.
[127] M. Patel, J. Kim, and O. Mutlu, “The Reach Proler (REAPER): Enabling the Mit-
igation of DRAM Retention Failures via Proling at Aggressive Conditions,” in
ISCA, 2017.
[128] R. H. Patterson, G. A. Gibson, E. Ginting, D. Stodolsky, and J. Zelenka, “Informed
Prefetching and Caching,” in SOSP, 1995.
[129] A. Pattnaik, X. Tang, A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu,
and C. R. Das, “Scheduling Techniques for GPU Architectures with Processing-
in-Memory Capabilities,” in PACT, 2016.
[130] G. Pekhimenko, E. Bolotin, M. O’Connor, O. Mutlu, T. C. Mowry, and S. W. Keck-
ler, “Toggle-Aware Compression for GPUs,” in IEEE CAL, 2015.
[131] G. Pekhimenko, E. Bolotin, N. Vijaykumar, O. Mutlu, T. C. Mowry, and S. W.
Keckler, “Toggle-Aware Bandwidth Compression for GPUs,” in HPCA, 2016.
[132] G. Pekhimenko, T. Huberty, R. Cai, O. Mutlu, P. P. Gibbons, M. A. Kozuch, and
T. C. Mowry, “Exploiting Compressed Block Size as an Indicator of Future Reuse,”
in HPCA, 2015.
[133] G. Pekhimenko, V. Seshadri, Y. Kim, H. Xin, O. Mutlu, P. B. Gibbons, M. A.
Kozuch, and T. C. Mowry, “Linearly Compressed Pages: A Low-complexity, Low-
latency Main Memory Compression Framework,” in MICRO, 2013.
[134] G. Pekhimenko, V. Seshadri, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C.
Mowry, “Base-Delta-Immediate Compression: A Practical Data Compression
Mechanism for On-Chip Caches,” in PACT, 2012.
[135] M. Qureshi, J. Karidis, M. Franceschini, V. Srinivasan, L. Lastras, and B. Abali,
“Enhancing lifetime and security of PCM-based main memory with start-gap
wear leveling,” in MICRO, 2009.
[136] M. Qureshi, D.-H. Kim, S. Khan, P. Nair, and O. Mutlu, “AVATAR: A Variable-
Retention-Time (VRT) Aware Refresh for DRAM Systems,” in DSN, 2015.
[137] M. K. Qureshi, A. Jaleel, Y. Patt, S. Steely, and J. Emer, “Adaptive Insertion Policies
for High Performance Caching,” in ISCA, 2007.
[138] M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt, “A Case for MLP-Aware
Cache Replacement,” in ISCA, 2006.
[139] M. K. Qureshi, V. Srinivasan, and J. A. Rivers, “Scalable High Performance Main
Memory System Using Phase-change Memory Technology,” in ISCA, 2009.
[140] S. Raoux et al., “Phase-change random access memory: A scalable technology,”
in IBM Journal of Research and Development, 2008.
[141] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens, “Memory Access
Scheduling,” in ISCA, 2000.
[142] Y. Sato et al., “Fast Cycle RAM (FCRAM); a 20-ns random row access, pipe-lined
operating DRAM,” in VLSIC, 1998.
[143] B. Schroeder, E. Pinheiro, and W.-D. Weber, “DRAM Errors in the Wild: A Large-
Scale Field Study,” in SIGMETRICS, 2009.
[144] V. Seshadri, A. Bhowmick, O. Mutlu, P. Gibbons, M. Kozuch, and T. Mowry, “The
Dirty-Block Index,” in ISCA, 2014.
[145] V. Seshadri, K. Hsieh, A. Boroumand, D. Lee, M. Kozuch, O. Mutlu, P. Gibbons,
and T. Mowry, “Fast Bulk Bitwise AND and OR in DRAM,” in IEEE CAL, 2015.
[146] V. Seshadri et al., “RowClone: Fast and Energy-ecient in-DRAM Bulk Data
Copy and Initialization,” in MICRO, 2013.
[147] V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A. Kozuch,
O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Ambit: In-memory Accelerator for
Bulk Bitwise Operations Using Commodity DRAM Technology,” in MICRO, 2017.
[148] V. Seshadri, T. Mullins, A. Boroumand, O. Mutlu, P. B. Gibbons, M. A. Kozuch,
and T. C. Mowry, “Gather-Scatter DRAM: In-DRAM Address Translation to Im-
prove the Spatial Locality of Non-unit Strided Accesses,” in MICRO, 2015.
[149] V. Seshadri, O. Mutlu, M. A. Kozuch, and T. C. Mowry, “The Evicted-Address
Filter: A Unied Mechanism to Address Both Cache Pollution and Thrashing,”
in PACT, 2012.
[150] V. Seshadri, S. Yedkar, H. Xin, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C.
Mowry, “Mitigating Prefetcher-Caused Pollution Using Informed Caching Poli-
cies for Prefetched Blocks,” in ACM TACO, 2015.
[151] A. Shaee, M. Taassori, R. Balasubramonian, and A. Davis, “MemZip: Exploring
Unconventional Benets from Memory Compression,” in HPCA, 2014.
[152] J. Shao and B. T. Davis, “A Burst Scheduling Access Reordering Mechanism,” in
HPCA, 2007.
[153] W. Shin, J. Yang, J. Choi, and L.-S. Kim, “NUAT: A Non-Uniform Access Time
Memory Controller,” in HPCA, 2014.
[154] Y. H. Son, O. Seongil, Y. Ro, J. W. Lee, and J. H. Ahn, “Reducing Memory Access
Latency with Asymmetric DRAM Bank Organizations,” in ISCA, 2013.
[155] V. Sridharan, N. DeBardeleben, S. Blanchard, K. B. Ferreira, J. Stearley, J. Shalf,
and S. Gurumurthi, “Memory Errors in Modern Systems: The Good, the Bad, and
the Ugly,” in ASPLOS, 2015.
[156] V. Sridharan and D. Liberty, “A Study of DRAM Failures in the Field,” in SC, 2012.
[157] V. Sridharan, J. Stearley, N. DeBardeleben, S. Blanchard, and S. Gurumurthi,
“Feng Shui of Supercomputer Memory: Positional Eects in DRAM and SRAM
Faults,” in SC, 2013.
[158] S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt, “Feedback Directed Prefetching:
Improving the Performance and Bandwidth-Eciency of Hardware Prefetchers,”
in HPCA, 2007.
[159] L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu, “The Blacklisting
Memory Scheduler: Achieving high performance and fairness at low cost,” in
ICCD, 2014.
[160] L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu, “BLISS: Balancing
Performance, Fairness and Complexity in Memory Access Scheduling,” in TPDS,
2016.
[161] L. Subramanian, V. Seshadri, A. Ghosh, S. Khan, and O. Mutlu, “The Application
Slowdown Model: Quantifying and Controlling the Impact of Inter-Application
Interference at Shared Caches and Main Memory,” in MICRO, 2015.
[162] L. Subramanian, V. Seshadri, Y. Kim, B. Jaiyen, and O. Mutlu, “MISE: Providing
Performance Predictability and Improving Fairness in Shared Main Memory Sys-
tems,” in HPCA, 2013.
[163] H. Usui, L. Subramanian, K. Chang, and O. Mutlu, “DASH: Deadline-Aware High-
Performance Memory Scheduler for Heterogeneous Systems with Hardware Ac-
celerators,” in ACM TACO, 2016.
[164] R. Venkatesan, S. Herr, and E. Rotenberg, “Retention-Aware Placement in DRAM
(RAPID): Software Methods for Quasi-Non-Volatile DRAM,” in HPCA, 2006.
[165] N. Vijaykumar, G. Pekhimenko, A. Jog, A. Bhowmick, R. Ausavarungnirun,
C. Das, M. Kandemir, T. C. Mowry, and O. Mutlu, “A Case for Core-Assisted Bot-
tleneck Acceleration in GPUs: Enabling Flexible Data Compression with Assist
Warps,” in ISCA, 2015.
[166] M.-J. Wang, R.-L. Jiang, J.-W. Hsia, C.-H. Wang, and J.-E. Chen, “Guardband de-
termination for the detection of o-state and junction leakages in DRAM testing,”
in Asian Test Symposium, 2001.
[167] F. Ware and C. Hampel, “Improving Power and Data Eciency with Threaded
Memory Modules,” in ICCD, 2006.
[168] P. R. Wilson, S. F. Kaplan, and Y. Smaragdakis, “The Case for Compressed
Caching in Virtual Memory Systems,” in ATEC, 1999.
[169] H.-S. Wong, H.-Y. Lee, S. Yu, Y.-S. Chen, Y. Wu, P.-S. Chen, B. Lee, F. Chen, and
M.-J. Tsai, “Metal Oxide RRAM,” in Proceedings of the IEEE, 2012.
[170] H.-S. Wong, S. Raoux, S. Kim, J. Liang, J. P. Reifenberg, B. Rajendran, M. Asheghi,
and K. E. Goodson, “Phase Change Memory,” in Proceedings of the IEEE, 2010.
[171] D. Yaney, C. Y. Lu, R. Kohler, M. J. Kelly, and J. Nelson, “A meta-stable leakage
phenomenon in DRAM charge storage - Variable hold time,” in IEDM, 1987.
[172] H. Yoon, J. Meza, N. Muralimanohar, N. P. Jouppi, and O. Mutlu, “Ecient Data
Mapping and Buering Techniques for Multilevel Cell Phase-Change Memories,”
in ACM TACO, 2014.
[173] D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Igna-
towski, “TOP-PIM: Throughput-oriented Programmable Processing in Memory,”
in HPCA, 2014.
[174] T. Zhang, K. Chen, C. Xu, G. Sun, T. Wang, and Y. Xie, “Half-DRAM: A high-
bandwidth and low-power DRAM architecture from the rethinking of ne-
grained activation,” in ISCA, 2014.
[175] Y. Zhang, J. Yang, and R. Gupta, “Frequent value locality and value-centric data
cache design,” in ASPLOS, 2000.
[176] Z. Zhang, Z. Zhu, and X. Zhang, “Cached DRAM for ILP Processor Memory
Access Latency Reduction,” in IEEE Micro, 2001.
10
[177] J. Zhao, O. Mutlu, and Y. Xie, “FIRM: Fair and High-Performance Memory Con-
trol for Persistent Memory Systems,” in MICRO, 2014.
[178] H. Zheng, J. Lin, Z. Zhang, E. Gorbatov, H. David, and Z. Zhu, “Mini-rank: Adap-
tive DRAM architecture for improving memory power eciency,” in MICRO,
2008.
11
