Reducing DRAM Refresh Overheads with Refresh-Access Parallelism by Chang, K. K. et al.
Reducing DRAM Refresh Overheads
with Refresh–Access Parallelism
Kevin K. Chang1,2 Donghyuk Lee3,2 Zeshan Chishti4
Alaa R. Alameldeen4 Chris Wilkerson4 Yoongu Kim2 Onur Mutlu5,2
1Facebook 2Carnegie Mellon University 3NVIDIA Research 4Intel Labs 5ETH Zürich
This article summarizes the idea of “refresh–access paral-
lelism,” which was published in HPCA 2014 [17], and examines
the work’s signicance and future potential. The overarching
objective of our HPCA 2014 paper is to reduce the signicant
negative performance impact of DRAM refresh with intelligent
memory controller mechanisms.
To mitigate the negative performance impact of DRAM re-
fresh, our HPCA 2014 paper proposes two complementary mech-
anisms, DARP (Dynamic Access Refresh Parallelization) and
SARP (Subarray Access Refresh Parallelization). The goal is
to address the drawbacks of state-of-the-art per-bank refresh
mechanism by building more ecient techniques to parallelize
refreshes and accesses within DRAM. First, instead of issuing
per-bank refreshes in a round-robin order, as it is done today,
DARP issues per-bank refreshes to idle banks in an out-of-order
manner. Furthermore, DARP proactively schedules refreshes
during intervals when a batch of writes are draining to DRAM.
Second, SARP exploits the existence of mostly-independent sub-
arrays within a bank. With minor modications to DRAM
organization, it allows a bank to serve memory accesses to an
idle subarray while another subarray is being refreshed. Our
extensive evaluations on a wide variety of workloads and sys-
tems show that our mechanisms improve system performance
(and energy eciency) compared to three state-of-the-art re-
fresh policies, and their performance benets increase as DRAM
density increases.
1. Introduction
Modern main memory is predominantly built using dy-
namic random access memory (DRAM) cells. A DRAM cell
consists of a capacitor to store one bit of data as electrical
charge. The capacitor leaks charge over time, causing stored
data to change. As a result, DRAM requires an operation
called refresh that periodically restores electrical charge in
DRAM cells to maintain data integrity.
There are two major ways refresh operations are performed
in modern DRAM systems: all-bank refresh (or, rank-level re-
fresh) and per-bank refresh. These methods dier in what
levels of the DRAM hierarchy refresh operations tie up. A
modern DRAM system is organized as a hierarchy of ranks
and banks. Each rank is composed of multiple banks. Dif-
ferent ranks and banks can be accessed independently. Each
bank contains a number of rows (e.g., 16-32K in modern
chips). Because successively refreshing all rows in a DRAM
chip would cause very high delay by tying up the entire
DRAM device, modern memory controllers issue a number
of refresh commands that are evenly distributed throughout
the refresh interval [38, 40, 73, 74, 93]. Each refresh command
refreshes a small number of rows.1 The two common refresh
methods of today dier in where in the DRAM hierarchy the
rows refreshed by a refresh command reside.
In all-bank refresh (REFab), employed by both commodity
DDR and LPDDR DRAM chips, a refresh command operates
at the rank level: it refreshes a number of rows in all banks
of a rank concurrently. This causes every bank within a rank
to be unavailable to serve memory requests until the refresh
command is complete. Therefore, it degrades performance
signicantly [4, 17, 74, 88, 93, 96, 115].
An alternative method is to perform refresh operations
at the bank level, called per-bank refresh (REFpb), which is
currently supported in LPDDR DRAM used in mobile plat-
forms [40]. In contrast to REFab, REFpb enables a bank to be
accessed while another bank is being refreshed, alleviating
part of the negative performance impact of refresh. Figure 1
shows pictorially how REFpb provides performance benets
over REFab from parallelization of refreshes and reads. REFpb
reduces refresh interference on reads by issuing a refresh
to Bank 0 while Bank 1 is serving reads. Subsequently, it
refreshes Bank 1 while allowing Bank 0 to serve a read. As
a result, REFpb alleviates part of the performance loss due to
refreshes by enabling parallelization of refreshes and accesses
across banks.
Saved Cycles in REFpb
READ Time
Time
Bank0
Bank1
Per-Bank 
Refresh READ READ
REFpb
REFpb
READ Time
Time
Bank0
Bank1
All-Bank 
Refresh
REFab
REFab READ READ
READ Time
Time
Bank0
Bank1
No-Refresh
READ READ
Figure 1: Service timelines of all-bank and per-bank refresh.
Adapted from [17].
Unfortunately, there are two shortcomings of per-bank
refresh. First, refreshes to dierent banks are scheduled in
a strict round-robin order, as specied by the LPDDR stan-
dard [40]. Using this static policy may force a busy bank to
be refreshed, delaying the memory requests queued in that
1The time between two refresh commands is xed to an amount that is
dependent on the DRAM type and temperature. We refer the reader to our
prior works [17, 18, 19, 20, 30, 31, 49, 52, 53, 54, 55, 56, 67, 68, 69, 70, 71, 73, 74, 96,
107, 108] for a detailed background on DRAM.
ar
X
iv
:1
80
5.
01
28
9v
1 
 [c
s.A
R]
  2
 M
ay
 20
18
bank, while other idle banks are available to be refreshed.
Second, a bank that is refreshing cannot concurrently serve
memory requests. Hence, requests to a refreshing bank get
delayed due to a “refresh–access bank conict.”
We show that the negative performance impact of DRAM
refresh becomes exacerbated as DRAM density increases in
the future. Figure 2 shows the average performance degrada-
tion of all-bank/per-bank refresh compared to ideal baseline
without any refresh.2 Although REFpb performs slightly better
than REFab, the performance loss due to refresh is still signif-
icant, especially as the density grows (16.6% loss at 32Gb).
Therefore, the goal this work is to provide practical mecha-
nisms to overcome the aforementioned two shortcomings to
mitigate the performance overhead of DRAM refresh.
 0
 5
 10
 15
 20
8Gb 16Gb 32Gb
P
er
fo
rm
a
n
ce
L
o
ss
 (
%
)
DRAM Density
REFab REFpb
Figure 2: Performance loss due to all-bank refresh (REFab)
and per-bank refresh (REFpb). Reproduced from [17].
2. Parallelizing Refreshes with
Memory Accesses
We propose two mechanisms, Dynamic Access Refresh Par-
allelization (DARP) and Subarray Access Refresh Parallelization
(SARP), that hide refresh latency by parallelizing refreshes
with memory accesses across banks and subarrays, respec-
tively. In this section, we present a brief overview of these
two new mechanisms. We refer the reader to Section 4 of our
HPCA 2014 paper [17] for more detail on the algorithm and
implementation.
2.1. Dynamic Access Refresh Parallelization
(DARP)
DARP is a new refresh scheduling policy that consists of
two components. The rst component is out-of-order per-
bank refresh, which enables the memory controller to specify
a particular (idle) bank to be refreshed as opposed to the stan-
dard per-bank refresh policy that refreshes banks in a strict
round-robin order. With out-of-order refresh scheduling,
DARP can avoid refreshing (non-idle) banks with pending
memory requests, thereby avoiding the refresh latency for
those requests. The second component is write-refresh paral-
lelization that proactively issues REFpb to a bank while DRAM
is draining write batches to other banks, thereby overlapping
refresh latency with write request latencies.
2.1.1. DARP: Out-of-order Per-bank Refresh. A major
limitation of the current REFpb mechanism is that it disal-
lows a memory controller from specifying which bank to
refresh. Instead, a DRAM chip has internal logic that strictly
2Our detailed methodology is described in Section 5 of our full pa-
per [17].
refreshes banks in a sequential round-robin order. Because
DRAM lacks visibility into a memory controller’s state (e.g.,
request queues’ occupancy), simply using an in-order REFpb
policy can unnecessarily refresh a bank that has multiple
pending requests to be served when other banks may be free
to serve a refresh command. To address this problem, we
propose the rst component of DARP, out-of-order per-bank
refresh. The idea is to remove the bank selection logic from
DRAM and make it the memory controller’s responsibility
to determine which bank to refresh. As a result, the memory
controller can refresh an idle bank to enhance parallelization
of refreshes and accesses, avoiding refreshing a bank that has
pending requests as much as possible.
Due to REFpb reordering, the memory controller needs to
guarantee that deviating from the original in-order refresh
schedule still preserves data integrity. To achieve this, we take
advantage of the fact that the contemporary DDR JEDEC stan-
dard [39] provides some refresh scheduling exibility. The
standard allows up to eight all-bank refresh commands to be
issued late (postponed) or early (pulled-in). This implies that
each bank can tolerate up to eight REFpb commands to be post-
poned or pulled in. Therefore, the memory controller ensures
that reordering REFpb preserves data integrity by limiting the
number of postponed or pulled-in commands. Our HPCA
2014 paper [17] describes our new algorithm for out-of-order
per-bank refresh in detail.
2.1.2. DARP:Write-refresh Parallelization. The key idea
of the second component of DARP is to actively avoid refresh
interference on read requests and instead enable more par-
allelization of refreshes with write requests. We make two
observations that lead to our idea. First, write batching in
DRAM [65] creates an opportunity to overlap a refresh op-
eration with a sequence of writes, without interfering with
reads. A modern memory controller typically buers DRAM
writes and drains them to DRAM in a batch to amortize the
bus turnaround latency, also called tWTR or tRTW [39,56,65],
which is the additional latency incurred from switching be-
tween serving writes to reads and vice versa. Typical sys-
tems start draining writes when the write buer occupancy
exceeds a certain threshold until the buer reaches a low
watermark. This draining time period is called the writeback
mode, during which no rank within the draining channel can
serve read requests [22, 65, 116]. Second, DRAM writes are
usually not latency-critical because processors do not stall
to wait for them: DRAM writes are due to dirty cache line
evictions from the last-level cache [65, 105, 116].
Given that writes are not latency-critical and are drained
in a batch for some time interval, they are more exible to
be scheduled with minimal performance impact. We propose
the second component of DARP, write-refresh parallelization,
that attempts to maximize parallelization of refreshes and
writes. Write-refresh parallelization selects the bank with the
minimum number of pending demand requests (both read
and write) and preempts the bank’s writes with a per-bank
2
refresh. As a result, the bank’s refresh operation is hidden by
the writes in other banks.
Figure 3 shows the service timeline and benets of
write-refresh parallelization. There are two scenarios when
the scheduling policy parallelizes refreshes with writes to in-
crease DRAM’s availability to serve read requests. Figure 3a
shows the rst scenario when the scheduler postpones issuing
a REFpb command to avoid delaying a read request in Bank
0 and instead serves the refresh in parallel with writes from
Bank 1, eectively hiding the refresh latency in the writeback
mode. Even though the refresh can potentially delay individ-
ual write requests during writeback mode, the delay does not
impact performance as long as the length of writeback mode
remains the same as in the baseline due to longer prioritized
write request streams in other banks. In the second scenario
shown in Figure 3b, the scheduler proactively pulls in a REFpb
command early in Bank 0 to fully hide the refresh latency
from the later read request while Bank 1 is draining writes
during the writeback mode (note that the read request cannot
be scheduled during the writeback mode).
Postpone Refresh
READ
WRITE
Time
Time
Bank0
Bank1
READ Time
Time
Bank0
Bank1
WRITEWRITE
WRITEWRITE WRITE
Saved Cycles
REF Delays Read
Per-Bank
Refresh
Write Access 
Refresh 
Parallelization
REFpb
REFpb
Turnaround
WRITE
WRITE
(a) Scenario 1: Parallelize postponed refresh with writes.
WRITE
Time
Time
Bank0
Bank1
Time
Time
Bank0
Bank1
WRITEWRITE
Saved Cycles
Per-Bank
Refresh
Write Access 
Refresh 
Parallelization WRITEWRITE WRITE
READ
READ
REF Delays Read
Pull-In Refresh
REFpb
REFpb
WRITE
WRITE
(b) Scenario 2: Parallelize pulled-in refresh with writes.
Figure 3: Service timeline of a per-bank refresh operation
along with read and write requests using dierent refresh
scheduling policies. Reproduced from [17].
2.2. Subarray Access Refresh Parallelization
(SARP)
To tackle the problem of refreshes and accesses colliding
within the same bank, we propose SARP (Subarray Access
Refresh Parallelization), which exploits the existence of sub-
arrays [56] within a bank. A DRAM bank is sub-divided into
multiple subarrays [19,23,31,56,67,69,70,76,106,107,108,110,
120, 125, 126], as shown in Figure ??. A subarray consists of
a 2-D array of cells organized in rows and columns.3 Each
DRAM cell has two components: 1) a capacitor that stores one
3Physically, DRAM has 32 to 128 subarrays, which varies depending
on the number of rows (typically 16-64K) within a bank. This work divides
them into 8 subarray groups. We refer to a subarray group as a subarray [56],
without loss of generality.
bit of data as electrical charge, and 2) an access transistor that
connects the capacitor to a wire called bitline that is shared
by a column of cells. The access transistor is controlled by a
wire called wordline that is shared by a row of cells. When a
wordline is raised to VDD , a row of cells becomes connected
to the bitlines, allowing reading or writing data to the con-
nected row of cells. The component that reads (i.e., senses)
or writes a bit of data on a bitline is called a sense amplier,
shared by an entire column of cells. A row of sense ampliers
is also called a row buer. All subarrays’ row buers are con-
nected to an I/O buer [22, 48, 68, 87] that reads and writes
data from/to the bank’s I/O bus.
Bank
Subarray
Subarray
Subarray
SubarrayRo
w 
De
co
de
r
I/O Bus
I/O Buffer
Wordline
Bi
tlin
e
...
...
Cell
...
...
Sense Amplifier
...
Row Buffer
Subarray
Row
Figure 4: DRAM bank and subarray organization. Repro-
duced from [17].
The key observation leading to our second mechanism,
SARP, is that a refresh operation is constrained to only a few
subarrays within a bank whereas the other subarrays and
the I/O bus remain idle during the process of refreshing. The
reasons for this are two-fold. First, refreshing a row requires
only its subarray’s sense ampliers that restore the charge in
the row without transferring any data through the I/O bus.
Second, each subarray has its own set of sense ampliers that
are not shared with other subarrays.
Based on this observation, SARP’s key idea is to allow
memory accesses to an idle subarray while other subarrays
are refreshing. Figure 5 shows the service timeline and the
performance benet of our mechanism. As shown, SARP
reduces the read latency by performing the read operation to
Subarray 1 in parallel with the refresh in Subarray 0. Com-
pared to DARP, SARP provides the following advantages: 1)
SARP is applicable to both all-bank and per-bank refresh, 2)
SARP enables memory accesses to a refreshing bank, which
cannot be achieved with DARP, and 3) SARP also utilizes
bank-level parallelism [66, 91] by serving memory requests
to multiple banks in parallel while the entire rank is under
refresh.
SARP requires modications to 1) the DRAM architecture,
because two distinct wordlines in dierent subarrays need
to be raised simultaneously (to accommodate parallel refresh
and access to the two subarrays), which cannot be done in
today’s DRAM due to the shared peripheral logic among
subarrays; and 2) the memory controller, such that it can
keep track of which subarray is under refresh in order to send
the appropriate memory request to an idle subarray. Section
4.3 of our HPCA 2014 paper [17] describes these changes in
3
REFab/pb
Time
Time
Subarray0
Subarray1
Time
Time
Subarray0
Subarray1
Saved Cycles
All-Bank or 
Per-Bank 
Refresh
Subarray
Access Refresh
Parallelization
REFab/pb
READ
READ
Bank0
Bank0
REF Delays Read
Figure 5: Service timeline of a refresh and a read request to
two dierent subarrays within the same bank. Reproduced
from [17].
detail. To evaluate the benets and die area overhead of SARP,
we use 8 subarrays per bank and 8 banks per DRAM chip.
Based on this conguration, we calculate the area overhead
of SARP using parameters from a Rambus DRAM model at
55nm technology [101], and nd it to be 0.71% in a 2Gb DDR3
DRAM chip with a die area of 73.5mm2. The power overhead
of the additional components is negligible compared to the
entire DRAM chip.
3. Evaluation
We briey summarize our results on an eight-core system.
Section 6 of our HPCA 2014 paper provides detailed evalua-
tions on a wide variety of systems and sensitivity studies. We
evaluate the performance of our proposed mechanisms on an
eight-core system using Ramulator [52, 103], an open-source
cycle-level DRAM simulator, driven by CPU traces generated
from Pin [77]. We use benchmarks from SPEC CPU2006 [113],
STREAM [83], TPC [118], and a microbenchmark with random-
access behavior similar to HPCC RandomAccess [34]. Table 1
summarizes the conguration of our evaluated system.
Processor 8 cores, 4GHz, 3-wide issue, 8 MSHRs/core,128-entry instruction window
Last-level
Cache
64B cache-line, 16-way associative,
512KB private cache-slice per core
Memory
Controller
64/64-entry read/write request queue, FR-FCFS [102],
writes are scheduled in batches [22, 65, 116] with
low watermark = 32, closed-row policy [22, 54, 55, 102]
DRAM DDR3-1333 [86], 2 channels, 2 ranks per channel,8 banks/rank, 8 subarrays/bank, 64K rows/bank, 8KB rows
Refresh
Settings
tRFCab = 350/530/890ns for 8/16/32Gb DRAM chips,
tREFIab = 3.9µs, tRFCab-to-tRFCpb ratio = 2.3
Table 1: Evaluated system conguration. Adapted from [17].
Figure 6 shows the average system performance (left) and
energy per DRAM access (right) of our nal mechanism,
DSARP, the combination of DARP and SARP, compared to
two baseline refresh schemes and an ideal scheme without
any refreshes. We measure system performance with the
commonly-used weighted speedup (WS) [26, 109] metric. The
percentage numbers on top of the bars are the performance
improvement of DSARP over REFab.
We make two observations. First, DSARP consistently im-
proves system performance and energy eciency over prior
refresh schemes, capturing most of the benet of the ideal sys-
 0
 1
 2
 3
 4
 5
 6
8Gb 16Gb 32Gb
W
ei
g
h
te
d
 S
p
ee
d
u
p
DRAM Density
7.9% 12.3% 20.2%
REFab
REFpb
DSARP
No REF
8Gb 16Gb 32Gb
 0
 5
 10
 15
 20
 25
 30
 35
 40
 45
E
n
er
g
y
 P
er
 A
cc
es
s 
(n
J
)
DRAM Density
3.0% 5.2% 9.0%
Figure 6: Average system performance and energy consump-
tion due to dierent refresh mechanisms.
tem with no refresh. Second, as DRAM density (i.e., refresh
latency) increases, the performance benet of DSARP gets
larger. We conclude that DSARP is an eective mechanism to
alleviate the negative performance impact of DRAM refresh.
3.1. Comparison to DDR4
Fine Granularity Refresh
DDR4 DRAM supports a new refresh mode called ne gran-
ularity refresh (FGR) in an attempt to mitigate the increasing
refresh latency (tRFCab) [39]. FGR trades o shorter tRFCab
with a faster refresh rate (1/tREFIab) that increases by either 2x
or 4x. Figure 7 shows the eect of FGR in comparison to REFab,
adaptive refresh policy (AR) [88], and DSARP. 2x and 4x FGR ac-
tually reduce average system performance by 3.9%/4.0%/4.3%
and 8.1%/13.7%/15.1% compared to REFab with 8/16/32Gb den-
sities, respectively. As the refresh rate increases by 2x/4x
(higher refresh penalty), tRFCab does not scale down with the
same constant factors. Instead, tRFCab reduces by 1.35x/1.63x
with 2x/4x higher rate [39], thus increasing the worst-case
refresh latency by 1.48x/2.45x. This performance degradation
due to FGR has also been observed in Mukundan et al. [88].
AR [88] dynamically switches between 1x (i.e., REFab) and 4x
refresh modes to mitigate the downsides of FGR. AR performs
slightly worse than REFab (within 1%) for all densities. Be-
cause using 4x FGR greatly degrades performance, AR can
only mitigate the large loss from the 4x mode and cannot
improve performance over REFab. On the other hand, DSARP
is a more eective mechanism to tolerate the long refresh
latency than both FGR and AR as it overlaps refresh latency
with access latency without increasing the refresh rate.
 0
 0.2
 0.4
 0.6
 0.8
 1
 1.2
8Gb 16Gb 32Gb
N
o
rm
a
li
ze
d
 W
S
DRAM Density
REFab
FGR 2x
FGR 4x
AR
DSARP
Figure 7: Performance comparisons to FGR and AR [88]. Re-
produced from [17].
We conclude that DSARP is an eective mechanism that
can eectively tolerate and hide longer refresh latencies,
which are expected in future DRAM devices as DRAM tech-
nology scales to even smaller feature sizes.
4
4. Related Work
To our knowledge, this is the rst work to comprehen-
sively study the eect of per-bank refresh and propose 1) a
refresh scheduling policy built on top of per-bank refresh
and 2) a mechanism that achieves parallelization of refresh
operations and memory accesses within a refreshing bank.
We discuss prior works that mitigate the negative eects of
DRAM refresh and compare them to our mechanisms.
Retention-Aware Refresh. Various works (e.g., [1, 3, 4,
5, 27, 50, 72, 74, 94, 95, 96, 98, 119]) propose mechanisms to
reduce unnecessary refresh operations by taking advantage
of the fact that dierent DRAM cells have widely dierent
retention times [51, 73, 96]. These works assume that the
retention time of DRAM cells can be accurately proled and
they depend on having this accurate prole to guarantee data
integrity [73]. However, as shown in Liu et al. [73] and later
analyzed in detail by several other works [44,45,46,47,96,98],
accurately determining the retention time prole of DRAM
is an outstanding research problem due to the Variable Re-
tention Time (VRT) and Data Pattern Dependence (DPD)
phenomena, which can cause the retention time of a cell to
uctuate over time. As such, retention-aware refresh tech-
niques need to overcome the proling challenges to be viable.
A recent work, AVATAR [98], proposes a retention-aware
refresh mechanism that addresses VRT by using ECC chips,
which introduces extra cost. In contrast, our refresh mit-
igation techniques enable parallelization of refreshes and
accesses without relying on cell data retention proles or
ECC, thus providing high reliability at low cost.
Refresh Scheduling. Stuecheli et al. [115] propose elastic
refresh that postpones refreshes by a time delay that varies
based on the number of postponed refreshes and the predicted
rank idle time to avoid interfering with demand requests.
Elastic refresh has two shortcomings. First, it becomes less
eective when the average rank idle period is shorter than
tRFCab as the refresh latency cannot be fully hidden in that pe-
riod. This occurs especially with 1) more memory-intensive
workloads that inherently have less idleness and 2) higher
density DRAM chips that have higher tRFCab. Second, elastic
refresh incurs more refresh latency when it incorrectly pre-
dicts a time period as idle when the time period actually has
pending requests. In contrast, our mechanisms parallelize
refresh operations with accesses even if there is no idle period
and therefore outperform elastic refresh.
Ishii et al. [37] propose a write scheduling policy that pri-
oritizes write draining over read requests in a rank while
another rank is refreshing (even if the write queue has not
reached the threshold to trigger write mode). This technique
is only applicable in multi-ranked memory systems. Our
mechanisms are also applicable to single-ranked memory sys-
tems by enabling parallelization of refreshes and accesses at
the bank and subarray levels, and they can be combined with
Ishii et al. [37].
Mukundan et al. [88] propose scheduling techniques (in ad-
dition to adaptive refresh discussed in Section 3.1) to address
the problem of command queue seizure, whereby a command
queue gets lled up with commands to a refreshing rank,
blocking commands to another non-refreshing rank. In our
work, we use a dierent memory controller design that does
not have command queues, similarly to prior work [32]. Our
controller generates a command for a scheduled request right
before the request is sent to DRAM instead of pre-generating
the commands and queuing them up. Thus, our baseline de-
sign does not suer from the problem of command queue
seizure.
Subarray-Level Parallelism (SALP). Kim et al. [56] pro-
pose SALP to reduce bank serialization latency by enabling
multiple accesses to dierent subarrays within a bank to pro-
ceed in a pipelined manner. In contrast to SALP, our mech-
anism (SARP) parallelizes refreshes and accesses to dierent
subarrays within the same bank. Therefore, SARP exploits
the existence of subarrays for a dierent purpose and in a
dierent way from SALP. We reduce the sharing of the pe-
ripheral circuits for refreshes and accesses, not for arbitrary
accesses. As such, our implementation is not only dier-
ent, but also less intrusive than SALP: SARP does not re-
quire new DRAM commands and timing constraints. We
note that several other works exploit the existence of subar-
rays for various performance and energy improvement pur-
poses [19, 67, 69, 70, 106, 107, 108]. We refer the reader to the
SALP paper in this very same issue for a detailed treatment
of SALP [57].
DRAM Refresh Architecture. Several other works pro-
pose dierent refresh architectures. Nair et al. [93] propose
Refresh Pausing, which pauses a refresh operation to serve
pending memory requests when the refresh causes conicts
with the requests. Although our work already signicantly
reduces conicts between refreshes and memory requests
by enabling parallelization, it can be combined with Refresh
Pausing to address rare conicts. Tavva et al. [117] propose
EFGR, which exposes non-refreshing banks during an all-
bank refresh operation so that a few accesses can be sched-
uled to those non-refresh banks during the refresh opera-
tion. However, such a mechanism does not provide addi-
tional performance and energy benets over per-bank re-
fresh, which we use to build our mechanism in this disser-
tation. Isen and John [36] propose ESKIMO, which modi-
es the ISA to enable memory allocation libraries to skip
refreshes on memory regions that do not aect programs’
execution. ESKIMO is orthogonal to our mechanism, and its
modication has high system-level complexity by requiring
system software libraries to make refresh decisions. Other
techniques (e.g., heterogeneous-reliability memory [81] or
Flikker [75]) can eliminate or reduce refreshes in parts of
memory. Our techniques are complementary to such refresh
elimination/reduction techniques.
5
eDRAM Concurrent Refresh. Kirihata et al. [58] pro-
pose a mechanism to enable a bank to refresh independently
while another bank is being accessed in embedded DRAM
(eDRAM). Our work diers from [58] in two major ways. First,
unlike SARP, [58] parallelizes refreshes only across banks, not
within each bank. Second, there are signicant dierences
between DRAM and eDRAM architectures, which make it
non-trivial to apply [58]’s mechanism directly to DRAM. In
particular, eDRAMs have no standardized timing/power in-
tegrity constraints and access protocol, making it simpler
for each bank to independently manage its refresh sched-
ule. In contrast, refreshes in DRAM need to be managed by
the memory controller to ensure that parallelizing refreshes
with accesses does not violate other constraints. Other works
(e.g., [2, 25]) exploit the fact that eDRAM is used as a cache
to avoid refresh operations.
5. Signicance
In this section, we describe three trends in the current and
future DRAM subsystem that will likely make our proposed
solutions more important and attractive in the future, and
examine the work’s impact on future research.
5.1. Long-Term Impact
Worsening Retention Time. As the DRAM cell feature
size continues to scale, the cells’ retention time will likely
become shorter, exacerbating the refresh penalty [43, 89, 90].
When the surface area of cells gets smaller with further scal-
ing, the depth/height of the cell needs to increase to maintain
the same amount of capacitance that can be stored in a cell. In
other words, the aspect ratio (the ratio of a cell’s depth to its
diameter) needs to be increased to maintain the capacitance.
However, many works have shown that fabricating high as-
pect ratio cells is becoming more dicult due to processing
technology [33, 43, 82]. Therefore, the cells’ capacitance (and,
thus, their retention time) may potentially decrease with fur-
ther scaling, increasing the refresh frequency. Using DSARP
is a cost-eective way to alleviate the increasing negative
impact of refresh as our results show [17]. Note that errors
have started appearing in DRAM chips due to aggressive
technology scaling [53, 85, 89, 104, 111, 112]. The RowHam-
mer problem is a prime example of DRAM errors that have
been slipping into the eld [53, 89], and one solution for it is
to increase the refresh rate [53, 89]. Such solutions to tech-
nology scaling issues clearly exacerbate the refresh problem.
Therefore, DSARP can alleviate the performance impact un-
der these conditions.
New DRAM Standards with Flexible Per-Bank Re-
fresh. According to newer DRAM standards, the industry is
already in the process of implementing a similar concept of
enabling the memory controller to determine which bank to
refresh. In particular, the two standards are: 1) HBM [41, 71]
(October 2013, after the submission of our HPCA 2014 pa-
per [17]) and 2) LPDDR4 [42] (August 2014). Both standards
have incorporated a new refresh mode that allows per-bank
refresh commands to be issued in any order by the mem-
ory controllers. Neither standard species a preferred order
which the memory controller needs to follow for issuing
refresh commands.
Our work has done extensive evaluations to show that
our proposed per-bank refresh scheduling policy, DARP, out-
performs a naive round-robin policy by opportunistically re-
freshing idle banks. As a result, our policy can be potentially
adopted in the future processors that use HBM or LPDDR4
DRAM.
Increasing Number of Subarrays. As DRAM density
keeps increasing, more rows of cells are added within each
DRAM bank. To avoid the disadvantage of increasing sensing
latency due to longer bitlines in subarrays [18, 70], more
subarrays will likely be added within a single bank instead of
increasing the size of each subarray. Our proposed refreshing
scheme at the subarray level, SARP, becomes more eective
at mitigating refresh as the number of subarrays increases
because the probability of a refresh and a demand request
colliding at the subarray level decreases with more subarrays.
5.2. Potential Research Impact
Impact on Recent Research Work. To our knowledge,
this is the rst work to comprehensively study and extend the
concept of per-bank refresh to DDRx DRAM chips. Several
works [5, 28, 117] use our per-bank refresh mechanism as
a baseline for comparison. Kotra et al. [60] propose a new
refresh mechanism to further enhance our per-bank refresh
mechanism. Kong et al. [59] extend our per-bank refresh idea
to eDRAM.
FutureResearchDirections. This work will likely create
new research opportunities for studying refresh scheduling
policies at dierent dimensions (i.e., bank and subarray level)
to mitigate worsening refresh overheads. Among many po-
tential opportunities, one potential way to further reduce the
refresh latency (i.e., tRFCab/pb) is to trade o higher refresh
rate (i.e., tREFI ), which is currently supported as ne gran-
ularity refresh in DDR4 DRAM for all-bank refresh. In this
work, we assume a xed refresh rate for per-bank refresh
as it is specied in the standard. Therefore, a new research
question that our work raises is how can one combine per-bank
refresh with ne granularity refresh and design a new schedul-
ing policy for that? We think that DARP can inspire new
scheduling policies to improve the performance of existing
DRAM designs.
Applicability to Other Memory Technologies. Re-
fresh is used in NAND ash memory to improve lifetime [12,
13, 14, 78], and can be used as a general solution to several
other NAND ash reliability problems that are characterized
and discussed in various recent works [6, 7, 8, 9, 10, 11, 15, 16,
79, 80]. We believe the idea of DSARP and refresh scheduling
can also be applied to refresh mechanisms in ash memory,
and this can be especially benecial toward the end of the
6
lifetime of ash memory when the device is refreshed more
frequently [7,8,9,13]. We refer the reader to our recent works
to understand the mechanisms for refresh in modern ash
memories [7, 8, 9].
We believe the principles of DSARP are also applica-
ble to emerging memory technologies [84], e.g., phase-
change memory (PCM) [62, 63, 64, 99, 100, 122, 123, 124], STT-
MRAM [21, 29, 61, 92], or RRAM/memristors [24, 114, 121].
For example, PCM suers from resistance drift [35, 97, 122],
where the resistance used to represent the value becomes
higher over time (and eventually can introduce a bit error).
To mitigate resistance drift, PCM can use refresh-like opera-
tions to rewrite the original data value, and as the density of
PCM grows, more such operations are required. We leave a
detailed exploration of how DSARP can be used for emerging
memory technologies to future works.
6. Conclusion
We introduced two new complementary techniques, DARP
(Dynamic Access Refresh Parallelization) and SARP (Subar-
ray Access Refresh Parallelization), to mitigate the DRAM
refresh penalty by enhancing refresh–access parallelization at
the bank and subarray levels, respectively. DARP 1) issues
per-bank refreshes to idle banks in an out-of-order manner
instead of issuing refreshes in a strict round-robin order, 2)
proactively schedules per-bank refreshes during intervals
when a batch of writes are draining to DRAM. SARP enables
a bank to serve requests from idle subarrays in parallel with
other subarrays that are being refreshed. Our extensive evalu-
ations on a wide variety of systems and workloads show that
these mechanisms signicantly improve system performance
and outperform state-of-the-art refresh policies, approaching
the performance of ideally eliminating all refreshes. We con-
clude that DARP and SARP are eective at hiding the refresh
latency penalty in modern and near-future DRAM systems,
and that their benets increase as DRAM density increases.
We believe these techniques are also applicable to other
memory technologies, such as NAND ash memory and
phase change memory. We hope our work inspires future
research to develop even more eective refresh latency toler-
ance techniques.
Acknowledgments
We thank Saugata Ghose for his dedicated eort in the
preparation of this article. We thank the anonymous review-
ers and Jamie Liu for helpful feedback and the members of
the SAFARI research group for feedback and the stimulating
environment they provide. We acknowledge the support of
IBM, Intel, and Samsung. This research was supported in
part by the Intel Science and Technology Center on Cloud
Computing, the Semiconductor Research Corporation, and
an NSF CAREER Award (grant 0953246).
References
[1] A. Agrawal et al., “Mosaic: Exploiting the spatial locality of process variation to
reduce refresh energy in on-chip eDRAM modules,” in HPCA, 2014.
[2] A. Agrawal et al., “Refrint: Intelligent Refresh to Minimize Power in On-Chip
Multiprocessor Cache Hierarchies,” in HPCA, 2013.
[3] A. Agrawal et al., “CLARA: Circular Linked-List Auto and Self Refresh Architec-
ture,” in MEMSYS, 2016.
[4] S. Baek et al., “Refresh Now and Then,” IEEE TC, vol. 63, no. 12, pp. 3114–3126,
2014.
[5] I. Bhati et al., “Flexible Auto-refresh: Enabling Scalable and Energy-ecient
DRAM Refresh Reductions,” in ISCA, 2015.
[6] Y. Cai et al., “Read Disturb Errors in MLC NAND Flash Memory: Characteriza-
tion and Mitigation,” in DSN, 2015.
[7] Y. Cai et al., “Error Characterization, Mitigation, and Recovery in Flash-Memory-
Based Solid-State Drives,” Proceedings of the IEEE, 2017.
[8] Y. Cai et al., “Error Characterization, Mitigation, and Recovery in Flash Memory
Based Solid-State Drives,” arXiv:1706.08642 [cs.AR], 2017.
[9] Y. Cai et al., “Errors in Flash-Memory-Based Solid-State Drives: Analysis, Miti-
gation, and Recovery,” arXiv:1711.11427 [cs.AR], 2017.
[10] Y. Cai et al., “Vulnerabilities in MLC NAND Flash Memory Programming: Ex-
perimental Analysis, Exploits, and Mitigation Techniques,” in HPCA, 2017.
[11] Y. Cai et al., “Error Patterns in MLC NAND Flash Memory: Measurement, Char-
acterization, and Analysis,” in DATE, 2012.
[12] Y. Cai et al., “Data Retention in MLC NAND Flash Memory: Characterization,
Optimization, and Recovery,” in HPCA, 2015.
[13] Y. Cai et al., “Flash Correct-and-Refresh: Retention-Aware Error Management
for Increased Flash Memory Lifetime,” in ICCD, 2012.
[14] Y. Cai et al., “Error Analysis and Retention-Aware Error Management for NAND
Flash Memory,” in ITJ, 2013.
[15] Y. Cai et al., “Neighbor Cell Assisted Error Correction in MLC NAND Flash Mem-
ories,” in SIGMETRICS, 2014.
[16] Y. Cai et al., “Threshold Voltage Distribution in MLC NAND Flash Memory:
Characterization, Analysis, and Modeling,” in DATE, 2013.
[17] K. K. Chang et al., “Improving DRAM Performance by Parallelizing Refreshes
with Accesses,” in HPCA, 2014.
[18] K. K. Chang et al., “Understanding Latency Variation in Modern DRAM Chips:
Experimental Characterization, Analysis, and Optimization,” in SIGMETRICS,
2016.
[19] K. K. Chang et al., “Low-Cost Inter-Linked Subarrays (LISA): Enabling Fast Inter-
Subarray Data Movement in DRAM,” in HPCA, 2016.
[20] K. K. Chang et al., “Understanding Reduced-Voltage Operation in Modern DRAM
Devices: Experimental Characterization, Analysis, and Mechanisms,” in SIGMET-
RICS, 2017.
[21] M. T. Chang et al., “Technology Comparison for Large Last-Level Caches (L3Cs):
Low-Leakage SRAM, Low Write-Energy STT-RAM, and Refresh-Optimized
eDRAM,” in HPCA, 2013.
[22] N. Chatterjee et al., “Staged reads: Mitigating the impact of DRAM writes on
DRAM reads,” in HPCA, 2012.
[23] J. Choi et al., “Multiple Clone Row DRAM: A Low Latency and Area Optimized
DRAM,” in ISCA, 2015.
[24] L. Chua, “Memristor – The Missing Circuit Element,” TCT, 1971.
[25] P. G. Emma et al., “Rethinking Refresh: Increasing Availability and Reducing
Power in DRAM for Cache Applications,” IEEE Micro, vol. 28, no. 6, pp. 47–56,
2008.
[26] S. Eyerman and L. Eeckhout, “System-Level Performance Metrics for Multipro-
gram Workloads,” IEEE Micro, 2008.
[27] Y. H. Gong and S. W. Chung, “Exploiting Refresh Eect of DRAM Read Opera-
tions: A Practical Approach to Low-Power Refresh,” IEEE TC, vol. 65, no. 5, pp.
1507–1517, 2016.
[28] M. Guan and L. Wang, “Temperature aware refresh for DRAM performance im-
provement in 3D ICs,” in ISQED, 2015.
[29] X. Guo et al., “Resistive Computation: Avoiding the Power Wall with Low-
Leakage, STT-MRAM Based Computing,” in ISCA, 2010.
[30] H. Hassan et al., “SoftMC: A Flexible and Practical Open-Source Infrastructure
for Enabling Experimental DRAM Studies,” in HPCA, 2017.
[31] H. Hassan et al., “ChargeCache: Reducing DRAM Latency by Exploiting Row
Access Locality,” in HPCA, 2016.
[32] E. Herrero et al., “Thread row buers: Improving memory performance isolation
and throughput in multiprogrammed environments,” IEEE TC, vol. 62, no. 9, pp.
1879–1892, 2013.
[33] S. Hong, “Memory technology trend and future challenges,” in IEDM, 2010.
[34] HPC Challenge, “RandomAccess,” http://icl.cs.utk.edu/hpcc.
[35] D. Ielmini et al., “Recovery and Drift Dynamics of Resistance and Threshold
Voltages in Phase-Change Memories,” TED, 2007.
[36] C. Isen and L. John, “ESKIMO - energy savings using semantic knowledge of
inconsequential memory occupancy for DRAM subsystem,” in MICRO, 2009.
[37] Y. Ishii et al., “High performance memory access scheduling using compute-
phase prediction and writeback-refresh overlap,” in JILP Memory Scheduling
Championship, 2012.
[38] JEDEC, “DDR3 SDRAM Standard,” 2010.
[39] JEDEC, “DDR4 SDRAM Standard,” 2012.
[40] JEDEC, “Low Power Double Data Rate 3 (LPDDR3),” 2012.
7
[41] JEDEC, “High Bandwidth Memory (HBM) DRAM,” 2013.
[42] JEDEC, “Low Power Double Data Rate 4 (LPDDR4),” 2014.
[43] U. Kang et al., “Co-Architecting Controllers and DRAM to Enhance DRAM Pro-
cess Scaling,” in The Memory Forum, 2014.
[44] S. Khan et al., “Detecting and Mitigating Data-Dependent DRAM Failures by
Exploiting Current Memory Content,” in MICRO, 2017.
[45] S. Khan et al., “A Case for Memory Content-Based Detection and Mitigation of
Data-Dependent Failures in DRAM,” CAL, 2016.
[46] S. Khan et al., “PARBOR: An Ecient System-Level Technique to Detect Data
Dependent Failures in DRAM,” in DSN, 2016.
[47] S. Khan et al., “The Ecacy of Error Mitigation Techniques for DRAM Retention
Failures: A Comparative Experimental Study,” in SIGMETRICS, 2014.
[48] R. Kho et al., “75nm 7Gb/s/pin 1Gb GDDR5 graphics memory device with
bandwidth-improvement techniques,” in ISSCC, 2011.
[49] J. S. Kim et al., “The DRAM Latency PUF: Quickly Evaluating Physical Unclon-
able Functions by Exploiting the Latency–Reliability Tradeo in Modern DRAM
Devices,” in HPCA, 2018.
[50] J. Kim and M. C. Papaefthymiou, “Block-based multi-period refresh for energy
ecient dynamic memory,” in ASIC, 2001.
[51] K. Kim and J. Lee, “A New Investigation of Data Retention Time in Truly
Nanoscaled DRAMs,” EDL, vol. 30, no. 8, pp. 846–848, 2009.
[52] Y. Kim et al., “Ramulator: A Fast and Extensible DRAM Simulator,” CAL, 2015.
[53] Y. Kim et al., “Flipping Bits in Memory Without Accessing Them: An Experimen-
tal Study of DRAM Disturbance Errors,” in ISCA, 2014.
[54] Y. Kim et al., “ATLAS: A Scalable and High-Performance Scheduling Algorithm
for Multiple Memory Controllers,” in HPCA, 2010.
[55] Y. Kim et al., “Thread Cluster Memory Scheduling: Exploiting Dierences in
Memory Access Behavior,” in MICRO, 2010.
[56] Y. Kim et al., “A Case for Exploiting Subarray-Level Parallelism (SALP) in
DRAM,” in ISCA, 2012.
[57] Y. Kim et al., “Exploiting the DRAM Microarchitecture to Increase Memory-Level
Parallelism,” IPSI Transactions on Advanced Research (TAR), 2018.
[58] T. Kirihata et al., “An 800-MHz embedded DRAM with a concurrent refresh
mode,” IEEE JSSC, pp. 1377–1387, 2005.
[59] J. Kong et al., “Towards Refresh-optimized EDRAM-based Caches with a Selec-
tive Fine-grain Round-robin Refresh Scheme,” Microprocess. Microsyst., vol. 49,
no. C, pp. 95–104, 2017.
[60] J. B. Kotra et al., “Hardware-Software Co-design to Mitigate DRAM Refresh Over-
heads: A Case for Refresh-Aware Process Scheduling,” in ASPLOS, 2017.
[61] E. Kultursay et al., “Evaluating STT-RAM as an energy-ecient main memory
alternative,” in ISPASS, 2013.
[62] B. C. Lee et al., “Architecting Phase Change Memory as a Scalable DRAM Alter-
native,” in ISCA, 2009.
[63] B. C. Lee et al., “Phase Change Memory Architecture and the Quest for Scalabil-
ity,” CACM, vol. 53, no. 7, pp. 99–106, 2010.
[64] B. C. Lee et al., “Phase-Change Technology and the Future of Main Memory,”
IEEE Micro, vol. 30, no. 1, pp. 143–143, 2010.
[65] C. J. Lee et al., “DRAM-Aware Last-Level Cache Writeback: Reducing Write-
Caused Interference in Memory Systems,” Univ. of Texas at Austin, High Per-
formance Systems Group, Tech. Rep. TR-HPS-2010-002, 2010.
[66] C. J. Lee et al., “Improving Memory Bank-Level Parallelism in the Presence of
Prefetching,” in MICRO, 2009.
[67] D. Lee et al., “Design-Induced Latency Variation in Modern DRAM Chips: Char-
acterization, Analysis, and Latency Reduction Mechanisms,” in SIGMETRICS,
2017.
[68] D. Lee et al., “Decoupled Direct Memory Access: Isolating CPU and IO Trac
by Leveraging a Dual-Data-Port DRAM,” in PACT, 2015.
[69] D. Lee et al., “Adaptive-Latency DRAM: Optimizing DRAM Timing for the
Common-Case,” in HPCA, 2015.
[70] D. Lee et al., “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Ar-
chitecture,” in HPCA, 2013.
[71] D. Lee et al., “Simultaneous Multi Layer Access: A High Bandwidth and Low
Cost 3D-Stacked Memory Interface,” TACO, 2016.
[72] C. H. Lin et al., “SECRET: Selective Error Correction for Refresh Energy Reduc-
tion in DRAMs,” in ICCD, 2012.
[73] J. Liu et al., “An Experimental Study of Data Retention Behavior in Modern
DRAM Devices: Implications for Retention Time Proling Mechanisms,” in ISCA,
2013.
[74] J. Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” in ISCA, 2012.
[75] S. Liu et al., “Flikker: Saving dram refresh-power through critical data partition-
ing,” in ASPLOS, 2011.
[76] S.-L. Lu et al., “Improving DRAM Latency with Dynamic Asymmetric Subarray,”
in MICRO, 2015.
[77] C.-K. Luk et al., “Pin: Building Customized Program Analysis Tools with Dy-
namic Instrumentation,” in PLDI, 2005.
[78] Y. Luo et al., “WARM: Improving NAND ash memory lifetime with write-
hotness aware retention management,” in MSST, 2015.
[79] Y. Luo et al., “Enabling Accurate and Practical Online Flash Channel Modeling
for Modern MLC NAND Flash Memory,” JSAC, 2016.
[80] Y. Luo et al., “HeatWatch: Improving 3D NAND Flash Memory Device Reliability
by Exploiting Self-Recovery and Temperature Awareness,” in HPCA, 2018.
[81] Y. Luo et al., “Characterizing Application Memory Error Vulnerability to Opti-
mize Datacenter Cost via Heterogeneous-Reliability Memory,” in DSN, 2014.
[82] J. A. Mandelman et al., “Challenges and Future Directions for the Scaling of Dy-
namic Random-access Memory (DRAM),” IBM JRD, vol. 46, no. 2-3, pp. 187–212,
2002.
[83] J. D. McCalpin, “STREAM Benchmark,” http://www.cs.virginia.edu/stream/.
[84] J. Meza et al., “A Case for Ecient Hardware/Software Cooperative Management
of Storage and Memory,” in WEED, 2013.
[85] J. Meza et al., “Revisiting Memory Errors in Large-Scale Production Data Centers:
Analysis and Modeling of New Trends from the Field,” in DSN, 2015.
[86] Micron Technology, “8Gb: x4, x8 1.5V TwinDie DDR3 SDRAM,” 2011.
[87] Y. Moon et al., “1.2V 1.6Gb/s 56nm 6F 2 4Gb DDR3 SDRAM with hybrid-I/O sense
amplier and segmented sub-array architecture,” in ISSCC, 2009.
[88] J. Mukundan et al., “Understanding and mitigating refresh overheads in high-
density DDR4 DRAM systems,” in ISCA, 2013.
[89] O. Mutlu, “The RowHammer problem and other issues we may face as memory
becomes denser,” in DATE, 2017.
[90] O. Mutlu, “Memory Scaling: A Systems Architecture Perspective,” IMW, 2013.
[91] O. Mutlu and T. Moscibroda, “Parallelism-Aware Batch Scheduling: Enhancing
Both Performance and Fairness of Shared DRAM Systems,” in ISCA, 2008.
[92] H. Naeimi et al., “STT-RAM Scaling and Retention Failure,” Intel Technology Jour-
nal, 2013.
[93] P. Nair et al., “A case for refresh pausing in DRAM memory systems,” in HPCA,
2013.
[94] P. J. Nair et al., “ArchShield: Architectural Framework for Assisting DRAM Scal-
ing by Tolerating High Error Rates,” in ISCA, 2013.
[95] T. Ohsawa et al., “Optimizing the DRAM Refresh Count for Merged DRAM/Logic
LSIs,” in ISLPED, 1998.
[96] M. Patel et al., “The Reach Proler (REAPER): Enabling the Mitigation of DRAM
Retention Failures via Proling at Aggressive Conditions,” in ISCA, 2017.
[97] A. Pirovano et al., “Low-Field Amorphous State Resistance and Threshold Volt-
age Drift in Chalcogenide Materials,” TED, 2004.
[98] M. K. Qureshi et al., “AVATAR: A Variable-Retention-Time (VRT) Aware Refresh
for DRAM Systems,” in DSN, 2015.
[99] M. K. Qureshi et al., “Enhancing Lifetime and Security of PCM-based Main Mem-
ory with Start-gap Wear Leveling,” in MICRO, 2009.
[100] M. K. Qureshi et al., “Scalable High Performance Main Memory System Using
Phase-change Memory Technology,” in ISCA, 2009.
[101] Rambus, “DRAM Power Model,” 2010.
[102] S. Rixner et al., “Memory Access Scheduling,” in ISCA, 2000.
[103] SAFARI Research Group, “Ramulator – GitHub Repository,” https://github.com/
CMU-SAFARI/ramulator.
[104] B. Schroeder et al., “DRAM Errors in the Wild: A Large-Scale Field Study,” in
SIGMETRICS, 2009.
[105] V. Seshadri et al., “The Dirty-Block Index,” in ISCA, 2014.
[106] V. Seshadri et al., “Fast Bulk Bitwise AND and OR in DRAM,” CAL, 2015.
[107] V. Seshadri et al., “RowClone: Fast and Energy-Ecient In-DRAM Bulk Data
Copy and Initialization,” in MICRO, 2013.
[108] V. Seshadri et al., “Ambit: In-Memory Accelerator for Bulk Bitwise Operations
Using Commodity DRAM Technology,” in MICRO, 2017.
[109] A. Snavely and D. Tullsen, “Symbiotic Jobscheduling for a Simultaneous Multi-
threading Processor,” in ASPLOS, 2000.
[110] Y. H. Son et al., “Reducing Memory Access Latency with Asymmetric DRAM
Bank Organizations,” in ISCA, 2013.
[111] V. Sridharan et al., “Memory Errors in Modern Systems: The Good, The Bad, and
The Ugly,” in ASPLOS, 2015.
[112] V. Sridharan and D. Liberty, “A Study of DRAM Failures in the Field,” in SC, 2012.
[113] Standard Performance Evaluation Corp., “SPEC CPU2006 Benchmarks,”
http://www.spec.org/cpu2006.
[114] D. B. Strukov et al., “The Missing Memristor Found,” Nature, 2008.
[115] J. Stuecheli et al., “Elastic refresh: Techniques to mitigate refresh penalties in
high density memory,” in MICRO, 2010.
[116] J. Stuecheli et al., “The virtual write queue: Coordinating DRAM and last-level
cache policies,” in ISCA, 2010.
[117] V. K. Tavva et al., “EFGR: An Enhanced Fine Granularity Refresh Feature for
High-Performance DDR4 DRAM Devices,” TACO, vol. 11, no. 3, 2014.
[118] Transaction Performance Processing Council, “TPC Benchmarks,” http://www.
tpc.org/.
[119] R. Venkatesan et al., “Retention-Aware Placement in DRAM (RAPID): Software
Methods for Quasi-Non-Volatile DRAM,” in HPCA, 2006.
[120] T. Vogelsang, “Understanding the Energy Consumption of Dynamic Random Ac-
cess Memories,” in MICRO, 2010.
[121] H.-S. P. Wong et al., “Metal-Oxide RRAM,” Proc. IEEE, 2012.
[122] H.-S. P. Wong et al., “Phase Change Memory,” Proc. IEEE, 2010.
[123] H. Yoon et al., “Row Buer Locality Aware Caching Policies for Hybrid Memo-
ries,” in ICCD, 2012.
[124] H. Yoon et al., “Ecient Data Mapping and Buering Techniques for Multilevel
Cell Phase-Change Memories,” TACO, vol. 11, no. 4, pp. 40:1–40:25, 2014.
[125] J. Yue and Y. Zhu, “Exploiting Subarrays Inside a Bank to Improve Phase Change
Memory Performance,” in DATE, 2013.
8
[126] T. Zhang et al., “CREAM: A Concurrent-Refresh-Aware DRAM Memory Archi-
tecture,” in HPCA, 2014.
9
