Flexible-Latency DRAM: Understanding and Exploiting Latency Variation in
  Modern DRAM Chips by Chang, Kevin K. et al.
Flexible-Latency DRAM: Understanding and Exploiting
Latency Variation in Modern DRAM Chips
Kevin K. Chang1,2 Abhijith Kashyap3,2 Hasan Hassan4,2,5 Saugata Ghose2
Kevin Hsieh2 Donghyuk Lee6,2 Tianshi Li2,7 Gennady Pekhimenko8,2
Samira Khan9,2 Onur Mutlu4,2
1Facebook 2Carnegie Mellon University 3NVIDIA 4ETH Zürich
5TOBB University of Economics & Technology 6NVIDIA Research
7Peking University 8University of Toronto 9University of Virginia
This article summarizes key results of our work on experi-
mental characterization and analysis of latency variation and
latency-reliability trade-os in modern DRAM chips, which was
published in SIGMETRICS 2016 [24], and examines the work’s
signicance and future potential. Our work is motivated to
reduce the long DRAM latency, which is a critical performance
bottleneck in current systems. DRAM access latency is dened
by three fundamental operations that take place within the
DRAM cell array: (i) activation of a memory row, which opens
the row to perform accesses; (ii) precharge, which prepares the
cell array for the next memory access; and (iii) restoration of
the row, which restores the values of cells in the row that were
destroyed due to activation. There is signicant latency vari-
ation for each of these operations across the cells of a single
DRAM chip due to irregularity in the manufacturing process.
As a result, some cells are inherently faster to access, while
others are inherently slower. Unfortunately, existing systems do
not exploit this variation.
The goal of this work is to (i) experimentally characterize and
understand the latency variation across cells within a DRAM
chip for these three fundamental DRAM operations, and (ii) de-
velop new mechanisms that exploit our understanding of the
latency variation to reliably improve performance. To this end,
we comprehensively characterize 240 DRAM chips from three
major vendors, and make six major new observations about
latency variation within DRAM. Notably, we nd that (i) there
is large latency variation across the cells for each of the three op-
erations; (ii) variation characteristics exhibit signicant spatial
locality: slower cells are clustered in certain regions of a DRAM
chip; and (iii) the three fundamental operations exhibit dierent
reliability characteristics when the latency of each operation is
reduced.
Based on our observations, we propose Flexible-LatencY
DRAM (FLY-DRAM), a mechanism that exploits latency varia-
tion across DRAM cells within a DRAM chip to improve system
performance. The key idea of FLY-DRAM is to exploit the spa-
tial locality of slower cells within DRAM, and access the faster
DRAM regions with reduced latencies for the fundamental op-
erations. Our evaluations show that FLY-DRAM improves the
performance of a wide range of applications by 13.3%, 17.6%,
and 19.5%, on average, for each of the three dierent vendors’
real DRAM chips, in a simulated 8-core system.
We have open sourced the data from our research online. We
hope the characterization and analysis we provide opens up
new research directions for both researchers and practitioners
in computer architecture and systems.
1. Introduction
Over the past few decades, the long latency of memory has
been a critical bottleneck in system performance. Increas-
ing core counts, the emergence of more data-intensive and
latency-critical applications, and increasingly limited band-
width in the memory system are together leading to higher
memory latency. Thus, low-latency memory operation is
now even more important to improving overall system per-
formance [30, 55, 93, 101, 102, 105, 143].
The latency of a memory request is predominantly dened
by the timings of three fundamental operations: (1) activation,
which “opens” a row of DRAM cells to access stored data,
(2) precharge, which “closes” an activated row, and (3) restora-
tion, which restores the charge level of each DRAM cell in
a row to prevent data loss.1 The latencies of these three
DRAM operations, as dened by vendor specications, have
not improved signicantly in the past 18 years, as depicted
in Figure 1. This is especially true when we compare latency
improvements to the capacity (128×) and bandwidth improve-
ments (20×) [23] commodity DRAM chips experienced in the
past 18 years. In fact, the activation and precharge latencies
increased from 2013 to 2015, when DDR DRAM transitioned
from the third generation (12.5ns for DDR3-1600J [51]) to
the fourth generation (14.06ns for DDR4-2133P [53]). As the
latencies specied by vendors have not reduced over time,
the memory latency remains as a critical system performance
bottleneck in many modern applications, such as big data
workloads [28] and Google’s warehouse-scale workloads [55].
1We refer the reader to our prior works [22, 24, 25, 26, 43, 44, 61, 64, 65,
66, 67, 68, 76, 77, 79, 81, 82, 86, 87, 110, 127, 128] for a detailed background on
DRAM.
ar
X
iv
:1
80
5.
03
15
4v
1 
 [c
s.A
R]
  8
 M
ay
 20
18
+21% -29%
-17% +8%
Figure 1: DRAM latency trends over time [50, 51, 53, 97].
Adapted from [24].
2. Motivation
In this work, we observe that the three fundamental DRAM
operations can actually complete with a much lower latency
for many DRAM cells than the vendor specication, because
there is inherent latency variation present across the DRAM
cells within a DRAM chip. This is a result of manufacturing
process variation, which causes the sizes and strengths of cells
to be dierent, thus making some cells faster and other cells
slower to be accessed reliably [85]. The speed gap between
the fastest and slowest DRAM cells is getting worse [20, 107],
as the technology node continues to scale down to sub-20nm
feature sizes. Unfortunately, instead of optimizing the latency
specications for the common case, DRAM vendors use a sin-
gle set of standard access latencies, called timing parameters,
which provide reliable operation guarantees for the worst case
(i.e., the slowest cells), to maximize manufacturing yield.
We experimentally demonstrate that signicant latency
variation is present across DRAM cells in 240 DDR3 DRAM
chips from three major vendors, and that a large fraction
of cells can be read reliably even if the activation/restora-
tion/precharge latencies are reduced signicantly. By repeat-
edly testing these DRAM chips, we observe that access latency
variation exhibits spatial locality within DRAM — slower cells
cluster in certain regions of a DRAM chip. In Section 4, we
propose a new mechanism, called FLY-DRAM, which exploits
the lower latencies of DRAM regions with faster cells by in-
troducing heterogeneous timing parameters into the memory
controller. By analyzing and exploiting the latency variation
that exists in DRAM cells, we can greatly reduce the DRAM
access latency.
We discuss our major experimental observations in Sec-
tion 3. For a detailed discussion on all of our observations,
we refer the reader to our SIGMETRICS 2016 paper [24].
3. Latency Variation Analysis
To capture the eect of latency variation in modern DDR3
DRAM chips, we tune the timing parameters that control the
amount of time taken for each of the fundamental DRAM
operations. We developed an FPGA-based DRAM testing
platform [43] that allows us to precisely control the timing
parameter values and the tested DRAM location (i.e., banks,
rows, and columns). A photo of the platform is shown in Fig-
ure 2. Using this platform, we characterize latency variation
on a total of 30 DDR3 DRAM modules (or DIMMs), compris-
ing 240 DRAM chips from three major vendors. Each chip
has a 1Gb density. Thus, each of our DIMMs has a 1GB ca-
pacity. Table 1 lists the relevant information about the tested
DRAM modules. Unless otherwise specied, we test mod-
ules at an ambient temperature of 20±1℃. For results using
higher temperatures, we refer the reader to Section 4.5 of our
SIGMETRICS 2016 paper [24].
Figure 2: FPGA-based DRAM testing infrastructure. Repro-
duced from [24].
Vendor Total Number Timing (ns) Assemblyof Chips (tRCD/tRP/tRAS) Year
A (8 DIMMs) 64 13.125/13.125/35-36 2012-13
B (9 DIMMs) 72 13.75/13.75/35 2011-12
C (13 DIMMs) 104 13.75/13.75/34-36 2011-12
Table 1: Main properties of the tested DIMMs. Reproduced
from [24].
In this section, we present a short summary of our key
results on varying the activation, precharge, and restoration
latencies, which are controlled by the tRCD, tRP, and tRAS
timing parameters, respectively. For more details on the ex-
perimental results and observations, see Sections 4–6 of our
SIGMETRICS 2016 paper [24].
3.1. Behavior of Timing Errors
We analyze the variation in the latencies of activation,
precharge, and restoration by operating DRAM at multiple
reduced latencies for each of these operations. Faster cells do
not get aected by the reduced timings, and can be accessed
reliably without any errors; however, slower cells cannot be
read reliably with reduced latencies for the three operations,
leading to bit ips. In this work, we dene a timing error as a
bit ip in a cell that occurs due to a reduced-latency access,
and characterize timing errors incurred by the three DRAM
operations.
Our experiments yield several new observations on the
behavior of timing errors. When we reduce the three laten-
cies, we observe that each latency exhibits a dierent level
of impact on the inherently-slower cells. Lowering the ac-
tivation latency (tRCD) aects only the cells (data) read in
2
the rst accessed cache line, but not the subsequently read
cache lines from the same row. This is mainly due to two
reasons. First, a read command accesses only its correspond-
ing sense ampliers, without accessing the other columns.
Hence, a read’s eect is isolated to its target cache line. Sec-
ond, by the time a subsequent read is issued to the same
activated row, a sucient amount of time has already passed
for the row buer to fully sense and latch in the row data. In
contrast, lowering the restoration (tRAS) or precharge (tRP)
latencies aects all cells in the activated row (see Section 5 of
our SIGMETRICS 2016 paper [24] for a detailed explanation).
Lowering these latencies aects the entire row because these
commands operate at the row level, and they directly aect
the restoration and sensing of all cells in the row.
We also nd that the number of timing errors introduced is
very sensitive to reducing the activation or precharge latency,
but not that sensitive to reducing the restoration latency. We
conclude that dierent levels of mitigation are required to
address the timing errors that result from lowering each of
the dierent DRAM operation latencies, and that reducing
restoration latency to the lowest levels allowed by our infras-
tructure does not introduce timing errors in our experiments
(see Section 6 in our SIGMETRICS 2016 paper [24]).
3.2. Timing Error Distribution
We briey present the distribution of activation and
precharge errors collected from all of the tests conducted
on every DIMM. Figure 3 shows the box plots of the bit error
rate (BER) observed on every DIMM as activation latency
(tRCD) varies. The BER is dened as the fraction of bits with
errors due to reducing tRCD in the total population of tested
bits. In other words, the BER represents the fraction of cells
that cannot operate reliably under the specied shortened
latency. The box plot shows the maximum and minimum
BER of all of our tested DIMMs as whiskers, and the box
shows the quartiles of the distribution. In addition, we show
all observation points for each specic tRCD/tRP value by
overlaying them on top of their corresponding box plot. Each
point shows a BER collected from one round of tests on one
DIMM with a specic data pattern and tRCD value. For box
plots showing the BER distribution when the precharge la-
tency (tRP) is reduced, see Figure 12 in the original paper [24].
We make two observations from the BER distributions when
reducing tRCD or tRP.
First, at tRCD or tRP values of 12.5ns and 10ns, we observe
no timing errors on any DIMM due to reduced activation or
precharge latency. This shows that the tRCD/tRP latencies
of the slowest cells in our tested DIMMs likely fall between
7.5 and 10ns, which are lower than the value provided in
the vendor specications (13.125ns). DRAM vendors use the
extra latency as a guardband to provide additional protection
against process variation.
Second, there exists a large BER variation among DIMMs
at tRCD of 7.5ns, and the BER variation becomes smaller as
2.55.07.510.012.5
tRCD (ns)
10-1010
-910-8
10-710
-610-5
10-410
-310-2
10-110
0
Bi
t E
rro
r R
ate
 (B
ER
)
Figure 3: Bit error rate of all DIMMs with reduced tRCD. Re-
produced from [24].
the tRCD or tRP value decreases. The number of fast cells that
can operate at tRCD=7.5ns or tRP=7.5ns varies signicantly
across dierent DIMMs. These results demonstrate that there
exists signicant latency variation among and within DIMMs,
as not all of the cells exhibit timing errors at 7.5ns.
3.3. Spatial Locality of Timing Errors
In this section, we investigate the location and distribu-
tion of timing errors within a DIMM when the activation or
precharge latencies are reduced. Figure 4 shows the probabil-
ity of every cache line (64B) in one bank of a specic DIMM
observing at least 1 bit of error with reduced activation la-
tency (Figure 4a) or precharge latency (Figure 4b). See [24]
for additional results. The x-axis and y-axis indicate the cache
line number and row number (in thousands), respectively. In
our tested DIMMs, a row size is 8KB, comprising 128 cache
lines.
0 20 40 60 80 100 120
Cache LLne
0
2
4
6
8
10
12
14
16
5
Rw
 (
00
0s
)
0.00
0.03
0.06
0.09
0.12
0.15
0.18
0.21
0.24
0.27
P
r(
ca
ch
e 
lLn
e 
w
Lt
h 
≥
 1
-b
Lt
 e
rr
Rr
)
(a) Activation latency (tRCD)
at 7.5ns (43% reduction).
0 20 40 60 80 100 120
Cache LLne
0
2
4
6
8
10
12
14
16
5
Rw
 (
00
0s
)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
P
r(
ca
ch
e 
lLn
e 
w
Lth
 ≥
 1
-b
Lt 
er
rR
r)
(b) Precharge latency (tRP)
at 7.5ns (43% reduction).
Figure 4: Probability of observing timing errors in oneDIMM.
Adapted from [24].
The main observation is that timing errors due to reducing
activation or precharge latency are not distributed uniformly
across locations within this DIMM. Timing errors tend to
cluster at certain regions of cache lines. For the remaining
cache lines, we observe that they do not exhibit timing errors
due to reduced latency throughout the experiments. We ob-
serve similar characteristics in other DIMMs — timing errors
concentrate within certain spatial regions of memory.
We hypothesize that the cause of the spatial locality of
timing errors is due to the locality of variation in the fab-
rication process during manufacturing. Certain cache line
locations can end up with less robust components, such as
weaker sense ampliers, weaker cells, or higher resistance
bitlines.
3
3.4. Other Characterization Results
We briey summarize our other observations on the eects
of reducing timing parameters. First, we analyze the number
of timing errors that occur when DRAM access latencies are
reduced, and experimentally demonstrate that most of the
erroneous cache lines have a single-bit error, with only a
small fraction of cache lines experiencing more than one bit
ip (see Section 4.7 of our SIGMETRICS 2016 paper [24]). We
conclude, therefore, that using simple error-correcting codes
(ECC) can correctmost of these errors, thereby enabling lower
latency for many inherently slower cells (see Section 4.8 of
our SIGMETRICS 2016 paper [24] for a detailed analysis of
ECC).
Second, we nd that the stored data pattern in cells aects
access latency variation. Certain patterns lead to more timing
errors than others. For example, the bit value 1 can be read
signicantly more reliably at a reduced access latency than
the bit value 0 (see Section 4.4 of our SIGMETRICS 2016
paper [24]). This observation is similar to the data pattern
dependence observation made for retention times of DRAM
cells [57, 58, 59, 60, 86, 110].
Third, we nd no clear correlation between temperature
and variation in cell access latency. We believe that it is
not essential for latency reduction techniques that exploit
such variation to be aware of the operating temperature (Sec-
tion 4.5 in [24]).
4. Exploiting Latency Variation
Based on our extensive experimental characterization and
new observations on latency-reliability trade-os in modern
DRAM chips, we propose a new hardware mechanism, called
Flexible-LatencY DRAM (FLY-DRAM), to reduce DRAM la-
tency for better system performance. FLY-DRAM exploits
the key observation that (i) dierent cells can operate reli-
ably at dierent DRAM latencies, and (ii) there is a strong
correlation between the location of a cell and the lowest la-
tency that the cell can operate reliably at. The key idea of
FLY-DRAM is to (i) categorize the DRAM cells into fast and
slow regions, (ii) expose this categorization to the memory
controller, and (iii) reduce overall DRAM latency by accessing
the fast regions with a lower latency.
The FLY-DRAM memory controller (i) loads the latency
proling results [24] into on-chip SRAM at system boot time,
(ii) looks up the proled latency for each memory request
based on its memory address, and (iii) applies the correspond-
ing latency to the request. By reducing the values of tRCD,
tRAS, and tRP for some memory requests, FLY-DRAM im-
proves overall system performance. In addition, we also pro-
pose an OS page allocator design that exploits the latency
variation in DRAM to improve system performance (see Sec-
tion 7.2 of our paper [24]).
There are two key design challenges of FLY-DRAM. The
rst challenge is determining the fraction of fast cells within
a DRAM chip and the innate access latency of the fast cells.
Since DRAM vendors have detailed information on their
DRAM chips from the DRAM post-production tests, DRAM
vendors can embed the latency proling results in the Se-
rial Presence Detect (SPD) circuitry (a ROM present in each
DIMM) [52]. The memory controller can read the proling
results from the SPD circuitry during DRAM initialization,
and apply the correct latency for each DRAM region.
The second design challenge is limiting the storage over-
head of the latency proling results. Recording the shortest
latency for each cache line can incur a large storage overhead.
Fortunately, the storage overhead can be reduced based on a
new observation of ours. As discussed in Section 3.3, timing
errors typically concentrate at certain DRAM regions. There-
fore, FLY-DRAM records the shortest latency at the granu-
larity of DRAM regions (i.e., a group of adjacent cache lines,
rows, or banks). One can imagine using more sophisticated
structures, such as Bloom Filters [6], to provide ner-grained
latency information within a reasonable storage overhead,
as shown in prior work on variable DRAM refresh inter-
vals [87, 115].
4.1. Summary of Results
We evaluate FLY-DRAM on on an 8-core system with a
wide variety of workloads by using Ramulator [64, 120], a
cycle-level open-source DRAM simulator developed by our
research group. Table 2 summarizes the conguration of our
evaluated system. We use the standard DDR3-1333H timing
parameters [51] as our baseline.
Processor 8 cores, 3.3 GHz, OoO 128-entry window
LLC 8 MB shared, 8-way set associative
DRAM
DDR3-1333H [51], open-row policy [66, 67, 118],
2 channels, 1 rank per channel, 8 banks per rank,
Baseline: tRCD/tCL/tRP = 13.125ns, tRAS = 36ns
Table 2: Evaluated system conguration. Adapted from [24].
Figure 5 illustrates the system performance improvement
of FLY-DRAM over the baseline (DDR3-1333) for 40 work-
loads. The x-axis indicates each of the evaluated DRAM
congurations. D2A, D7B, and D
2
C correspond to latency proles
collected from three real DIMMs. Our SIGMETRICS 2016
paper [24] describes these real-DRAM proles in more detail.
For these three DIMMs, FLY-DRAM improves system per-
formance signicantly, by 17.6%, 13.3%, and 19.5% on average
across all 40 workloads. This is because FLY-DRAM reduces
the latency of tRCD, tRP, and tRAS by 42.8%, 42.8%, and 25%,
respectively, for a large fraction of cache lines. In particular,
DIMM D2C , which has a 99% of cells that operate reliably at
low tRCD and tRP, performs within 1% of the upper-bound
performance (19.7% on average), which is obtained by op-
erating all DRAM cells at low tRCD and tRP. We conclude
that FLY-DRAM is an eective mechanism to improve system
performance by exploiting the widespread latency variation
present across DRAM cells.
4
D2A D
7
B D
2
C
Upper Bound
1.05
1.10
1.15
1.20
1.25
1.30
No
rm
ali
ze
d W
S
17.6%
13.3%
19.5% 19.7%
Figure 5: System performance improvement of FLY-DRAM
for various DIMMs. Reproduced from [24].
As we show in our SIGMETRICS 2016 paper [24],
FLY-DRAM can take advantage of an intelligent
DRAM-aware page allocator that allocates frequently
used and latency-critical pages in fast DRAM regions. We
leave the detailed design and evaluation of such an allocator
to future work.
5. Related Work
To our knowledge, this is the rst work to (i) provide a
detailed experimental characterization and analysis of latency
variation for three major DRAM operations (tRCD, tRP, and
tRAS) across dierent cells within a DRAM chip, (ii) demon-
strate that a reduction in latency for each of these funda-
mental operations has a dierent impact on dierent cells,
(iii) show that access latency variation exhibits spatial lo-
cality, (iv) demonstrate that the error rate due to reduced
latencies is correlated with the stored data pattern but not
conclusively correlated with temperature, and (v) propose
mechanisms that take advantage of variation within a DRAM
chip to improve system performance. We discuss the most
closely related works here.
5.1. DRAM Latency Variation
Adaptive-Latency DRAM (AL-DRAM) also characterizes
and exploits DRAM latency variation, but does so at a much
coarser granularity [79]. This work experimentally charac-
terizes latency variation across dierent DRAM chips under
dierent operating temperatures. AL-DRAM sets a uniform
operation latency for the entire DIMM. In contrast, our work
characterizes latency variation within each chip, at the granu-
larity of individual DRAM cells. Our mechanism, FLY-DRAM,
can be combined with AL-DRAM to further improve perfor-
mance.2
A recent work by Lee et al. [76] also observes latency vari-
ation within DRAM chips. The work analyzes the variation
that is due to the circuit design of DRAM components, which
it calls design-induced variation. Furthermore, it proposes
a new proling technique to identify the lowest DRAM la-
tency without introducing errors. In this work, we provide
the rst detailed experimental characterization and analy-
sis of the general latency variation phenomenon within real
DRAM chips. Our analysis is broad and is not limited to
design-induced variation. Our proposal of exploiting latency
2A description of the AL-DRAM work and its impact is provided in a
companion article in the very same issue of this journal [80].
variation, FLY-DRAM can employ Lee et al.’s new proling
mechanism [76] to identify additional latency variation re-
gions for reducing access latency.
Chandrasekar et al. study the potential of reducing some
DRAM timing parameters [21]. Similar to AL-DRAM, this
work observes and characterizes latency variation across
DIMMs, whereas our work studies variation across cells
within a DRAM chip.
5.2. DRAM Error Studies
There are several studies that characterize various er-
rors in DRAM. Many of these works observe how specic
factors aect DRAM errors, analyzing the impact of tem-
perature [32, 79] and hard errors [48]. Other works have
conducted studies of DRAM error rates in the eld, study-
ing failures across a large sample size [84, 95, 123, 132, 133].
There are also works that have studied errors through con-
trolled experiments, investigating errors due to retention
time [43, 57, 58, 59, 60, 86, 110, 115], disturbance from neigh-
boring DRAM cells [65, 101], latency variation across/within
DRAM chips [21, 76, 78, 79], and supply voltage [26]. None of
these works study errors due to latency variation across the
cells within a DRAM chip, which we extensively characterize
in our work.
5.3. DRAM Latency Reduction
Several types of commodity DRAM (Micron’s RL-
DRAM [98] and Fujitsu’s FCRAM [122]) provide low latency
at the cost of high area overhead [68, 81]. Many prior works
(e.g., [22, 25, 45, 68, 81, 88, 101, 102, 106, 125, 127, 128, 131, 150])
propose various architectural changes within DRAM chips to
reduce latency. In contrast, FLY-DRAM does not require any
changes to a DRAM chip. Other works [44, 75, 124, 129, 130]
reduce DRAM latency by changing the memory controller,
and FLY-DRAM is complementary to them.
5.4. ECC DRAM
Many memory systems incorporate ECC DIMMs, which
store information used to correct data during a read operation.
Prior work (e.g., [39, 54, 60, 63, 83, 140, 142, 145, 146]) proposes
more exible or more powerful ECC schemes for DRAM.
While these ECC mechanisms are designed to protect against
faults using standard DRAM timings, we show that they also
have the potential to correct timing errors that occur due to
reduced DRAM latencies. A recent work by Lee et al. [76]
exploits this observation and uses ECC to correct errors that
occur due to reduced latency in DRAM.
5.5. Other Latency Reduction Mechanisms
Various prior works [1, 2, 3, 5, 7, 8, 25, 31, 33, 34, 35, 36, 38, 40,
42, 46, 47, 56, 62, 69, 92, 109, 111, 112, 114, 125, 126, 128, 129, 134,
139, 149] examine processing in memory to reduce DRAM
latency. Other prior works propose memory scheduling tech-
niques, [4,37,49,66,67,74,99,100,103,104,135,136,137,138,141],
5
which generally reduce latency to access DRAM. Our anal-
yses and techniques can be combined with these works to
enable further low-latency operation.
6. Signicance
Our SIGMETRICS 2016 paper [24] presents a new exper-
imental characterization and analysis of latency variation
in modern DRAM chips. In this section, we describe the
potential impact that our study can have on the research
community and industry.
6.1. Potential Research Impact
Our paper develops a new way of using manufactured
DRAM chips: accessing dierent regions of memory using
each region’s inherent latency instead of a homogeneous
xed standard latency for all regions of memory. We show
that (i) there is signicant latency variation within a DRAM
chip, and (ii) it is possible to exploit the variation with sim-
ple mechanisms. We believe one key impact of our paper is
demonstrating the eectiveness of designing memory opti-
mizations based on real-world characterization. We expect
that this same principle can be used to craft new memory
architectures for both existing and future memory technolo-
gies, such as SRAM, PCM [71, 72, 73, 116, 117, 147, 148], STT-
MRAM [27, 41, 70], or RRAM [144].
Our work exposes several opportunities for both operating
systems and hardware to further optimize for memory access
latency. We have open-sourced our raw characterization data,
to allow other researchers to further analyze and build o of
our work [120]. Other researchers can nd many other ways
to take advantage of the insights and the characterization data
we provide. Our FLY-DRAM implementation is also available
as part of the open-source release of Ramulator [64, 119].
ECC to Reduce Latency. In our paper, we analyze the
distribution of timing errors (due to reduced latency) at the
granularity of data beats, as conventional error-correcting
codes (ECC) work at the same granularity. Our data shows
that many of the erroneous data beats experience only a
single-bit error, while the majority of the data beats contain
no errors. Therefore, this creates an opportunity for applying
ECC to correct timing errors. We also envision an oppor-
tunity for applying ECC to only certain regions of DRAM,
which takes advantage of the spatial locality of timing errors
exposed by our work. Lee et al. [76] provide examples of the
use of ECC to reduce latency further, but they apply ECC
globally to the entire DRAM chip. We believe a signicant
opportunity exists in customizing ECC to latency errors and
dierent DRAM reliability issues.
Data Pattern Dependence. We nd that timing errors
caused by reducing activation latency are dependent on the
stored data pattern. Reading bit 1 is signicantly more reliable
than bit 0 at reduced activation latencies. This asymmetric
sensing strength can potentially be a good direction for study-
ing DRAM reliability. Currently, DRAM commonly employs
data bus inversion [53] as an encoding scheme to reduce tog-
gle rate on the data bus, thereby saving channel power [113].
Similar encoding techniques can be developed to reduce bit
0s and increase the overall number of 1s in data. We believe
that developing asymmetric data encodings or ECC mecha-
nisms that favor 1s over 0s is a promising research direction
to improve DRAM reliability.
DRAM-Aware Page Allocator. We developed a hard-
ware mechanism (FLY-DRAM) that exploits latency variation
to improve system performance in a software-transparent
manner. Researchers can take better advantage of the varia-
tion by exposing the dierent latency regions to the software
stack. In our SIGMETRICS 2016 paper [24], we discuss the
potential of a DRAM-aware page allocator in the OS (Section
7.2), which can improve FLY-DRAM performance by intelli-
gently mapping more frequently-accessed application pages
to faster DRAM regions. We believe that the key idea of en-
abling the OS to allocate pages based on the accessed memory
region’s latency can be applied to other types of memory char-
acteristics (e.g., energy eciency or voltage [26, 29]) without
needing to modify the architecture.
Applicability to Other Memory Technologies. In this
work, we focus on characterizing only DRAM technology. A
class of emerging memory technology is non-volatile memory
(NVM), which has the capability of retaining data even when
the memory is not powered. Since the memory organization
of NVM mostly resembles that of DRAM [71, 96, 147], we
believe that our characterization and optimization can be
extended to dierent types of NVMs, such as PCM [71, 72,
73, 116, 117, 147, 148], STT-MRAM [27, 41, 70], or NAND ash
memory [9,10,11,12,13,14,15,16,17,18,19,89,90,91] to further
enhance their reliability or performance.
6.2. Long-Term Impact on Industry
High main memory latency remains a problem for many
modern applications, such as in-memory databases (e.g., Re-
dis [121], MemSQL [94], TimesTen [108]), Spark, Google’s dat-
acenter workloads [28, 55], and many mobile and interactive
workloads. We propose two simple ideas that exploit latency
variation in existing DRAM chips. Both can be adopted rela-
tively easily in the processor architecture (i.e., the memory
controller) or in the OS.
In addition to improving memory access latency, reduc-
ing the latency of the three fundamental DRAM operations
also increases the eective memory bandwidth. To fully uti-
lize the available memory bandwidth, memory controllers
would have to maximize the number of read or write com-
mands. However, due to interference between access streams
within and across applications, memory controllers need to
constantly open and close rows by issuing activate and
precharge commands due to an increasing number of bank
conicts [44, 68]. These commands increase the queuing
latency of accesses (read and write), thus decreasing the
eective memory bandwidth utilization.
6
As pin count is limited and increasing bus frequency is
becoming more dicult (due to signal integrity issues [29]),
our work oers a new alternative to help improve bandwidth
utilization. By reducing the latency of DRAM operations,
which fall on the critical path of DRAM access time, more
accesses per second are allowed, thereby improving the over-
all eective bandwidth. Furthermore, improving latency and
eective bandwidth also leads to lower memory energy con-
sumption due to reduced execution time and fewer active
cycles.
All these benets (e.g., reduced latency, increased band-
width, and reduced energy) will become much more impor-
tant as applications become more data-intensive and sys-
tems become more energy-constrained in the foreseeable
future [102, 105].
In conclusion, we believe that in the longer term, the idea of
leveraging variation in dierent characteristics (e.g., latency,
reliability) inside memory chips will become more benecial
for both the software and hardware industry. For example, by
making CPU aware of variation behavior in memory devices,
memory vendors have an incentive to sell memory with larger
variation at a lower price, allowing system designers to lower
costs with a small amount of additional logic in hardware.
Many other opportunities to improve system performance,
energy, and cost abound, which we hope the future works
can build upon and exploit.
7. Conclusion
This paper provides the rst experimental study that com-
prehensively characterizes and analyzes the latency variation
within modern DRAM chips for three fundamental DRAM
operations (activation, precharge, and restoration). We nd
that signicant latency variation is present across DRAM
cells in all 240 of our tested DRAM chips, and that a large
fraction of cache lines can be read reliably even if the activa-
tion/restoration/precharge latencies are reduced signicantly.
Consequently, exploiting the latency variation in DRAM cells
can greatly reduce the DRAM access latency. Based on the
ndings from our experimental characterization, we propose
and evaluate a new mechanism, FLY-DRAM (Flexible-LatencY
DRAM), which reduces DRAM latency by exploiting the in-
herent latency variation in DRAM cells. FLY-DRAM reduces
DRAM latency by categorizing the DRAM cells into fast and
slow regions, and accessing the fast regions with a reduced
latency. We demonstrate that FLY-DRAM can greatly reduce
DRAM latency, leading to signicant system performance
improvements on a variety of workloads.
We conclude that it is promising to understand and exploit
the inherent latency variation within modern DRAM chips.
We hope that the experimental characterization, analysis, and
optimization techniques presented in this paper will enable
the development of other new mechanisms that exploit the
latency variation within DRAM to improve system perfor-
mance and perhaps reliability.
Acknowledgments
We thank the anonymous reviewers and SAFARI group
members for their feedback. We acknowledge the support
of Google, Intel, NVIDIA, and Samsung. This research was
supported in part by the ISTC-CC, SRC, and NSF (grants
1212962 and 1320531). Kevin Chang was supported in part
by the SRCEA/Intel Fellowship.
References
[1] J. Ahn et al., “A Scalable Processing-in-Memory Accelerator for Parallel Graph
Processing,” in ISCA, 2015.
[2] J. Ahn et al., “PIM-Enabled Instructions: A Low-Overhead, Locality-Aware
Processing-in-Memory Architecture,” in ISCA, 2015.
[3] B. Akin et al., “Data Reorganization in Memory Using 3D-stacked DRAM,” in
ISCA, 2015.
[4] R. Ausavarungnirun et al., “Staged Memory Scheduling: Achieving High Perfor-
mance and Scalability in Heterogeneous Systems,” in ISCA, 2012.
[5] O. O. Babarinsa and S. Idreos, “Jafar: Near-data processing for databases,” in
SIGMOD, 2015.
[6] B. H. Bloom, “Space/Time Tradeos in Hash Coding with Allowable Errors,”
CACM, July 1970.
[7] A. Boroumand et al., “LazyPIM: An Ecient Cache Coherence Mechanism for
Processing-in-Memory,” CAL, 2016.
[8] A. Boroumand et al., “Google Workloads for Consumer Devices: Mitigating Data
Movement Bottlenecks,” in ASPLOS, 2018.
[9] Y. Cai et al., “Read Disturb Errors in MLC NAND Flash Memory: Characteriza-
tion and Mitigation,” in DSN, 2015.
[10] Y. Cai et al., “Error Characterization, Mitigation, and Recovery in Flash-Memory-
Based Solid-State Drives,” Proceedings of the IEEE, 2017.
[11] Y. Cai et al., “Error Characterization, Mitigation, and Recovery in Flash Memory
Based Solid-State Drives,” arXiv:1706.08642 [cs.AR], 2017.
[12] Y. Cai et al., “Errors in Flash-Memory-Based Solid-State Drives: Analysis, Miti-
gation, and Recovery,” arXiv:1711.11427 [cs.AR], 2017.
[13] Y. Cai et al., “Vulnerabilities in MLC NAND Flash Memory Programming: Ex-
perimental Analysis, Exploits, and Mitigation Techniques,” in HPCA, 2017.
[14] Y. Cai et al., “Error Patterns in MLC NAND Flash Memory: Measurement, Char-
acterization, and Analysis,” in DATE, 2012.
[15] Y. Cai et al., “Data Retention in MLC NAND Flash Memory: Characterization,
Optimization, and Recovery,” in HPCA, 2015.
[16] Y. Cai et al., “Flash Correct-and-Refresh: Retention-Aware Error Management
for Increased Flash Memory Lifetime,” in ICCD, 2012.
[17] Y. Cai et al., “Error Analysis and Retention-Aware Error Management for NAND
Flash Memory,” in ITJ, 2013.
[18] Y. Cai et al., “Neighbor Cell Assisted Error Correction in MLC NAND Flash Mem-
ories,” in SIGMETRICS, 2014.
[19] Y. Cai et al., “Threshold Voltage Distribution in MLC NAND Flash Memory:
Characterization, Analysis, and Modeling,” in DATE, 2013.
[20] K. Chakraborty and P. Mazumder, Fault-Tolerance and Reliability Techniques for
High-Density Random-Access Memories. Prentice Hall, 2002.
[21] K. Chandrasekar et al., “Exploiting Expendable Process-Margins in DRAMs for
Run-Time Performance Optimization,” in DATE, 2014.
[22] K. K. Chang et al., “Improving DRAM Performance by Parallelizing Refreshes
with Accesses,” in HPCA, 2014.
[23] K. K. Chang, “Understanding and Improving the Latency of DRAM-Based Mem-
ory Systems,” Ph.D. dissertation, Carnegie Mellon University, 2017.
[24] K. K. Chang et al., “Understanding Latency Variation in Modern DRAM Chips:
Experimental Characterization, Analysis, and Optimization,” in SIGMETRICS,
2016.
[25] K. K. Chang et al., “Low-Cost Inter-Linked Subarrays (LISA): Enabling Fast Inter-
Subarray Data Movement in DRAM,” in HPCA, 2016.
[26] K. K. Chang et al., “Understanding Reduced-Voltage Operation in Modern DRAM
Devices: Experimental Characterization, Analysis, and Mechanisms,” in SIGMET-
RICS, 2017.
[27] M. T. Chang et al., “Technology Comparison for Large Last-Level Caches (L3Cs):
Low-Leakage SRAM, Low Write-Energy STT-RAM, and Refresh-Optimized
eDRAM,” in HPCA, 2013.
[28] R. Clapp et al., “Quantifying the performance impact of memory latency and
bandwidth for big data workloads,” in IISWC, 2015.
[29] H. David et al., “Memory Power Management via Dynamic Voltage/Frequency
Scaling,” in ICAC, 2011.
[30] J. Dean and L. A. Barroso, “The Tail at Scale,” CACM, 2013.
[31] J. Draper et al., “The Architecture of the DIVA Processing-in-memory Chip,” in
ICS, 2002.
[32] N. El-Sayed et al., “Temperature Management in Data Centers: Why Some
(Might) Like It Hot,” in SIGMETRICS, 2012.
7
[33] A. Farmahini-Farahani et al., “NDA: Near-DRAM acceleration architecture lever-
aging commodity DRAM devices and standard memory modules,” in HPCA,
2015.
[34] B. B. Fraguela et al., “Programming the FlexRAM Parallel Intelligent Memory
System,” in PPoPP, 2003.
[35] M. Gao et al., “Practical near-data processing for in-memory analytics frame-
works,” in PACT, 2015.
[36] M. Gao and C. Kozyrakis, “HRL: Ecient and exible recongurable logic for
near-data processing,” in HPCA, 2016.
[37] S. Ghose et al., “Improving Memory Scheduling via Processor-Side Load Critical-
ity Information,” in ISCA, 2013.
[38] M. Gokhale et al., “Processing in memory: the Terasys massively parallel PIM
array,” Computer, vol. 28, no. 4, pp. 23–31, 1995.
[39] S.-L. Gong et al., “CLEAN-ECC: High Reliability ECC for Adaptive Granularity
Memory System,” in MICRO, 2015.
[40] Q. Guo et al., “3D-Stacked Memory-Side Acceleration: Accelerator and System
Design,” in WONDP, 2014.
[41] X. Guo et al., “Resistive Computation: Avoiding the Power Wall with Low-
Leakage, STT-MRAM Based Computing,” in ISCA, 2010.
[42] M. Hashemi et al., “Accelerating Dependent Cache Misses with an Enhanced
Memory Controller,” in ISCA, 2016.
[43] H. Hassan et al., “SoftMC: A Flexible and Practical Open-Source Infrastructure
for Enabling Experimental DRAM Studies,” in HPCA, 2017.
[44] H. Hassan et al., “ChargeCache: Reducing DRAM Latency by Exploiting Row
Access Locality,” in HPCA, 2016.
[45] H. Hidaka et al., “The Cache DRAM Architecture,” IEEE Micro, 1990.
[46] K. Hsieh et al., “Transparent Ooading and Mapping (TOM): Enabling
Programmer-Transparent Near-Data Processing in GPU Systems,” in ISCA, 2016.
[47] K. Hsieh et al., “Accelerating pointer chasing in 3D-stacked memory: Challenges,
mechanisms, evaluation,” in ICCD, 2016.
[48] A. A. Hwang et al., “Cosmic Rays Don’t Strike Twice: Understanding the Nature
of DRAM Errors and the Implications for System Design,” in ASPLOS, 2012.
[49] E. Ipek et al., “Self-Optimizing Memory Controllers: A Reinforcement Learning
Approach,” in ISCA, 2008.
[50] JEDEC, “DDR2 SDRAM Standard,” 2009.
[51] JEDEC, “DDR3 SDRAM Standard,” 2010.
[52] JEDEC, “Standard No. 21-C. Annex K: Serial Presence Detect (SPD) for DDR3
SDRAM Modules,” 2011.
[53] JEDEC, “DDR4 SDRAM Standard,” 2012.
[54] X. Jian et al., “Low-Power, Low-Storage-Overhead Chipkill Correct via Multi-
Line Error Correction,” in SC, 2013.
[55] S. Kanev et al., “Proling a Warehouse-Scale Computer,” in ISCA, 2015.
[56] Y. Kang et al., “FlexRAM: toward an advanced intelligent memory system,” in
ICCD, 1999.
[57] S. Khan et al., “Detecting and Mitigating Data-Dependent DRAM Failures by
Exploiting Current Memory Content,” in MICRO, 2017.
[58] S. Khan et al., “A Case for Memory Content-Based Detection and Mitigation of
Data-Dependent Failures in DRAM,” CAL, 2016.
[59] S. Khan et al., “PARBOR: An Ecient System-Level Technique to Detect Data
Dependent Failures in DRAM,” in DSN, 2016.
[60] S. Khan et al., “The Ecacy of Error Mitigation Techniques for DRAM Retention
Failures: A Comparative Experimental Study,” in SIGMETRICS, 2014.
[61] J. S. Kim et al., “The DRAM Latency PUF: Quickly Evaluating Physical Unclon-
able Functions by Exploiting the Latency–Reliability Tradeo in Modern DRAM
Devices,” in HPCA, 2018.
[62] J. S. Kim et al., “GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping
Using Processing-in-Memory Technologies,” BMC Genomics, 2018.
[63] J. Kim et al., “Bamboo ECC: Strong, Safe, and Flexible Codes for Reliable Com-
puter Memory,” in HPCA, 2015.
[64] Y. Kim et al., “Ramulator: A Fast and Extensible DRAM Simulator,” CAL, 2015.
[65] Y. Kim et al., “Flipping Bits in Memory Without Accessing Them: An Experimen-
tal Study of DRAM Disturbance Errors,” in ISCA, 2014.
[66] Y. Kim et al., “ATLAS: A Scalable and High-Performance Scheduling Algorithm
for Multiple Memory Controllers,” in HPCA, 2010.
[67] Y. Kim et al., “Thread Cluster Memory Scheduling: Exploiting Dierences in
Memory Access Behavior,” in MICRO, 2010.
[68] Y. Kim et al., “A Case for Exploiting Subarray-Level Parallelism (SALP) in
DRAM,” in ISCA, 2012.
[69] P. M. Kogge, “EXECUBE-A New Architecture for Scaleable MPPs,” in ICPP, 1994.
[70] E. Kultursay et al., “Evaluating STT-RAM as an energy-ecient main memory
alternative,” in ISPASS, 2013.
[71] B. C. Lee et al., “Architecting Phase Change Memory as a Scalable DRAM Alter-
native,” in ISCA, 2009.
[72] B. C. Lee et al., “Phase Change Memory Architecture and the Quest for Scalabil-
ity,” CACM, vol. 53, no. 7, pp. 99–106, 2010.
[73] B. C. Lee et al., “Phase-Change Technology and the Future of Main Memory,”
IEEE Micro, vol. 30, no. 1, pp. 143–143, 2010.
[74] C. J. Lee et al., “Prefetch-Aware DRAM Controllers,” in MICRO, 2008.
[75] C. J. Lee et al., “DRAM-Aware Last-Level Cache Writeback: Reducing Write-
Caused Interference in Memory Systems,” Univ. of Texas at Austin, High Per-
formance Systems Group, Tech. Rep. TR-HPS-2010-002, 2010.
[76] D. Lee et al., “Design-Induced Latency Variation in Modern DRAM Chips: Char-
acterization, Analysis, and Latency Reduction Mechanisms,” in SIGMETRICS,
2017.
[77] D. Lee et al., “Decoupled Direct Memory Access: Isolating CPU and IO Trac
by Leveraging a Dual-Data-Port DRAM,” in PACT, 2015.
[78] D. Lee, “Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity,”
Ph.D. dissertation, Carnegie Mellon University, 2016.
[79] D. Lee et al., “Adaptive-Latency DRAM: Optimizing DRAM Timing for the
Common-Case,” in HPCA, 2015.
[80] D. Lee et al., “Adaptive-Latency DRAM: Reducing DRAM Latency by Exploiting
Timing Margins,” IPSI Transactions on Advanced Research (TAR), 2018.
[81] D. Lee et al., “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Ar-
chitecture,” in HPCA, 2013.
[82] D. Lee et al., “Simultaneous Multi Layer Access: A High Bandwidth and Low
Cost 3D-Stacked Memory Interface,” TACO, 2016.
[83] S. Li et al., “MAGE: Adaptive Granularity and ECC for Resilient and Power E-
cient Memory Systems,” in SC, 2012.
[84] X. Li et al., “A Realistic Evaluation of Memory Hardware Errors and Software
System Susceptibility,” in USENIX ATC, 2010.
[85] Y. Li et al., “DRAM Yield Analysis and Optimization by a Statistical Design Ap-
proach,” in IEEE TCSI, 2011.
[86] J. Liu et al., “An Experimental Study of Data Retention Behavior in Modern
DRAM Devices: Implications for Retention Time Proling Mechanisms,” in ISCA,
2013.
[87] J. Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” in ISCA, 2012.
[88] S.-L. Lu et al., “Improving DRAM Latency with Dynamic Asymmetric Subarray,”
in MICRO, 2015.
[89] Y. Luo et al., “WARM: Improving NAND ash memory lifetime with write-
hotness aware retention management,” in MSST, 2015.
[90] Y. Luo et al., “Enabling Accurate and Practical Online Flash Channel Modeling
for Modern MLC NAND Flash Memory,” JSAC, 2016.
[91] Y. Luo et al., “HeatWatch: Improving 3D NAND Flash Memory Device Reliability
by Exploiting Self-Recovery and Temperature Awareness,” in HPCA, 2018.
[92] K. Mai et al., “Smart memories: a modular recongurable architecture,” in ISCA,
2000.
[93] S. A. McKee, “Reections on the memory wall,” in CF, 2004.
[94] MemSQL, Inc., “MemSQL,” https://www.memsql.com.
[95] J. Meza et al., “Revisiting Memory Errors in Large-Scale Production Data Centers:
Analysis and Modeling of New Trends from the Field,” in DSN, 2015.
[96] J. Meza et al., “A Case for Small Row Buers in Non-Volatile Main Memories,” in
ICCD Poster Session, 2012.
[97] Micron Technology, Inc., “128Mb: x4, x8, x16 Automotive SDRAM,” 1999.
[98] Micron Technology, Inc., “576Mb: x18, x36 RLDRAM3,” 2011.
[99] T. Moscibroda and O. Mutlu, “Memory Performance Attacks: Denial of Memory
Service in Multi-core Systems,” in USENIX Security, 2007.
[100] S. P. Muralidhara et al., “Reducing Memory Interference in Multicore Systems
via Application-aware Memory Channel Partitioning,” in MICRO, 2011.
[101] O. Mutlu, “The RowHammer problem and other issues we may face as memory
becomes denser,” in DATE, 2017.
[102] O. Mutlu, “Memory Scaling: A Systems Architecture Perspective,” IMW, 2013.
[103] O. Mutlu and T. Moscibroda, “Stall-Time Fair Memory Access Scheduling for
Chip Multiprocessors,” in MICRO, 2007.
[104] O. Mutlu and T. Moscibroda, “Parallelism-Aware Batch Scheduling: Enhancing
Both Performance and Fairness of Shared DRAM Systems,” in ISCA, 2008.
[105] O. Mutlu and L. Subramanian, “Research Problems and Opportunities in Memory
Systems,” SUPERFRI, 2014.
[106] S. O et al., “Row-Buer Decoupling: A Case for Low-Latency DRAM Microarchi-
tecture,” in ISCA, 2014.
[107] M. Onabajo and J. Silva-Martinez, Analog Circuit Design for Process Variation-
Resilient Systems-on-a-Chip. Springer, 2012.
[108] Oracle, “Oracle TimesTen In-Memory Database,” https://www.oracle.com/
database/timesten-in-memory-database/index.html.
[109] M. Oskin et al., “Active pages: a computation model for intelligent memory,” in
ISCA, 1998.
[110] M. Patel et al., “The Reach Proler (REAPER): Enabling the Mitigation of DRAM
Retention Failures via Proling at Aggressive Conditions,” in ISCA, 2017.
[111] D. Patterson et al., “A Case for Intelligent RAM,” IEEE Micro, 1997.
[112] A. Pattnaik et al., “Scheduling Techniques for GPU Architectures with
Processing-In-Memory Capabilities,” in PACT, 2016.
[113] G. Pekhimenko et al., “A Case for Toggle-Aware Compression for GPU Systems,”
in HPCA, 2016.
[114] S. H. Pugsley et al., “NDC: Analyzing the impact of 3D-stacked memory+logic
devices on MapReduce workloads,” in ISPASS, 2014.
[115] M. K. Qureshi et al., “AVATAR: A Variable-Retention-Time (VRT) Aware Refresh
for DRAM Systems,” in DSN, 2015.
[116] M. K. Qureshi et al., “Enhancing Lifetime and Security of PCM-based Main Mem-
ory with Start-gap Wear Leveling,” in MICRO, 2009.
[117] M. K. Qureshi et al., “Scalable High Performance Main Memory System Using
Phase-change Memory Technology,” in ISCA, 2009.
[118] S. Rixner et al., “Memory Access Scheduling,” in ISCA, 2000.
8
[119] SAFARI Research Group, “Ramulator – GitHub Repository,” https://github.com/
CMU-SAFARI/ramulator.
[120] SAFARI Research Group, “SAFARI Software Tools – GitHub Repository,” https:
//github.com/CMU-SAFARI.
[121] S. Sanlippo, “Redis,” https://redis.io.
[122] Y. Sato et al., “Fast cycle RAM (FCRAM): A 20-ns Random Row Access, Pipe-
Lined Operating DRAM,” in VLSIC, 1998.
[123] B. Schroeder et al., “DRAM Errors in the Wild: A Large-Scale Field Study,” in
SIGMETRICS, 2009.
[124] V. Seshadri et al., “The Dirty-Block Index,” in ISCA, 2014.
[125] V. Seshadri et al., “Fast Bulk Bitwise AND and OR in DRAM,” CAL, 2015.
[126] V. Seshadri, “Simple DRAM and Virtual Memory Abstractions to Enable Highly
Ecient Memory Systems,” Ph.D. dissertation, Carnegie Mellon University, 2016.
[127] V. Seshadri et al., “RowClone: Fast and Energy-Ecient In-DRAM Bulk Data
Copy and Initialization,” in MICRO, 2013.
[128] V. Seshadri et al., “Ambit: In-Memory Accelerator for Bulk Bitwise Operations
Using Commodity DRAM Technology,” in MICRO, 2017.
[129] V. Seshadri et al., “Gather-Scatter DRAM: In-DRAM Address Translation to Im-
prove the Spatial Locality of Non-Unit Strided Accesses,” in MICRO, 2015.
[130] W. Shin et al., “NUAT: A Non-Uniform Access Time Memory Controller,” in
HPCA, 2014.
[131] Y. H. Son et al., “Reducing Memory Access Latency with Asymmetric DRAM
Bank Organizations,” in ISCA, 2013.
[132] V. Sridharan et al., “Memory Errors in Modern Systems: The Good, The Bad, and
The Ugly,” in ASPLOS, 2015.
[133] V. Sridharan and D. Liberty, “A Study of DRAM Failures in the Field,” in SC, 2012.
[134] H. S. Stone, “A Logic-in-Memory Computer,” IEEE TC, 1970.
[135] L. Subramanian et al., “BLISS: Balancing Performance, Fairness and Complexity
in Memory Access Scheduling,” in IEEE TPDS, 2016.
[136] L. Subramanian et al., “The Blacklisting Memory Scheduler: Achieving High
Performance and Fairness at Low Cost,” in ICCD, 2014.
[137] L. Subramanian et al., “Mise: Providing performance predictability and improv-
ing fairness in shared main memory systems,” in HPCA, 2013.
[138] L. Subramanian et al., “The Application Slowdown Model: Quantifying and Con-
trolling the Impact of Inter-application Interference at Shared Caches and Main
Memory,” in MICRO, 2015.
[139] Z. Sura et al., “Data access optimization in a processing-in-memory system,” in
CF, 2015.
[140] A. N. Udipi et al., “LOT-ECC: Localized and Tiered Reliability Mechanisms for
Commodity Memory Systems,” in ISCA, 2012.
[141] H. Usui et al., “DASH: Deadline-Aware High-Performance Memory Scheduler
for Heterogeneous Systems with Hardware Accelerators,” TACO, vol. 12, no. 4,
pp. 65:1–65:28, 2016.
[142] C. Wilkerson et al., “Reducing Cache Power with Low-cost, Multi-bit Error-
correcting Codes,” in ISCA, 2010.
[143] M. V. Wilkes, “The Memory Gap and the Future of High Performance Memories,”
SIGARCH CAN, 2001.
[144] H.-S. P. Wong et al., “Metal-Oxide RRAM,” Proc. IEEE, 2012.
[145] D. H. Yoon et al., “BOOM: Enabling Mobile Memory Based Low-Power Server
DIMMs,” in ISCA, 2012.
[146] D. H. Yoon and M. Erez, “Virtualized ECC: Flexible Reliability in Main Memory,”
in ASPLOS, 2010.
[147] H. Yoon et al., “Row Buer Locality Aware Caching Policies for Hybrid Memo-
ries,” in ICCD, 2012.
[148] H. Yoon et al., “Ecient Data Mapping and Buering Techniques for Multilevel
Cell Phase-Change Memories,” TACO, vol. 11, no. 4, pp. 40:1–40:25, 2014.
[149] D. Zhang et al., “TOP-PIM: Throughput-Oriented Programmable Processing in
Memory,” in HPDC, 2014.
[150] T. Zhang et al., “Half-DRAM: A High-Bandwidth and Low-Power DRAM Archi-
tecture from the Rethinking of Fine-grained Activation,” in ISCA, 2014.
9
