RowClone: Accelerating Data Movement and Initialization Using DRAM by Seshadri, Vivek et al.
RowClone: Accelerating Data Movement and Initialization Using DRAM
Vivek Seshadri1,2 Yoongu Kim2 Chris Fallin2 Donghyuk Lee3,2
Rachata Ausavarungnirun2 Gennady Pekhimenko4,2 Yixin Luo2
Onur Mutlu5,2 Phillip B. Gibbons2,6 Michael A. Kozuch6 Todd C. Mowry2
1Microsoft Research India 2Carnegie Mellon University 3NVIDIA Research
4University of Toronto 5ETH Zürich 6Intel Labs
This paper summarizes the idea of RowClone, which was
published in MICRO 2013 [151], and examines the work’s signif-
icance and future potential. In existing systems, to perform any
bulk data movement operation (copy or initialization), the data
has to rst be read into the on-chip processor, all the way into the
L1 cache, and the result of the operation must be written back to
main memory. This is despite the fact that these operations do
not involve any actual computation. RowClone exploits the or-
ganization and operation of commodity DRAM to perform these
operations completely inside DRAM using twomechanisms. The
rst mechanism, Fast Parallel Mode, copies data between two
rows inside the same DRAM subarray by issuing back-to-back
activate commands to the source and the destination row. The
second mechanism, Pipelined Serial Mode, transfers cache lines
between two banks using the shared internal bus. RowClone
signicantly reduces the raw latency and energy consumption
of bulk data copy and initialization. This reduction directly
translates to improvement in performance and energy eciency
of systems running copy or initialization-intensive workloads.
Our proposed technique has inspired signicant research on
various ways to perform operations in memory and reduce data
movement between the CPU and DRAM [2, 25, 69, 76, 102, 103,
153, 154, 157, 162].
1. Problem: Bulk Data Movement
The main memory subsystem is an increasingly more
signicant limiter of system performance and energy e-
ciency [123, 124] for at least two reasons. First, the available
memory bandwidth between the processor and main memory
is not growing and nor is it expected to grow commensurately
with the compute bandwidth available in modern multi-core
processors [61, 64]. Second, a signicant fraction (20% to
42%) of the energy required to access data from memory is
consumed in driving the high-speed bus connecting the pro-
cessor and memory [149] (calculated using [112]). Therefore,
judicious use of the available memory bandwidth is critical to
ensure both high system performance and energy eciency.
In this work, we focus our attention on optimizing two
important classes of bandwidth-intensive memory operations
that frequently occur in modern systems: 1) bulk data copy—
copying a large quantity of data from one location in physical
memory to another, and 2) bulk data initialization—initializing
a large quantity of data to a specic value. We refer to these
two operations as bulk data movement operations. Prior re-
search [68, 131, 147] has shown that operating systems and
data center workloads spend a signicant portion of their
time performing bulk data movement operations. There-
fore, accelerating these operations will likely improve system
performance. In fact, the x86 ISA has recently introduced
instructions to provide enhanced performance for bulk copy
and initialization (ERMSB [60]), highlighting the importance
of bulk operations.
The main reason bulk data movement operations degrade
system performance and energy eciency is that they require
large amounts of data to be transferred back and forth on the
memory bus. This large data transfer has three shortcom-
ings. First, because the data is transferred one cache line at
a time across the bus, these operations incur high latency,
directly degrading the performance of the application per-
forming the operation. Second, transferring a large amount
of data on the bus interferes with the memory accesses of
other concurrently-running applications, degrading their per-
formance as well. Finally, the large data transfer contributes
to a signicant fraction of the energy consumed by these bulk
movement operations.
While bulk data movement operations also degrade per-
formance by hogging the CPU and potentially polluting the
on-chip caches, prior works [66, 192] have proposed simple
solutions to address these problems by adding support for
such operations in the memory controller. However, the tech-
niques proposed by these works do not eliminate the need to
transfer data over the memory bus, which is a increasingly
more critical bottleneck for performance in modern systems.
2. RowClone: Fast In-DRAM Copy
The fact that both bulk data copy and initialization do not
require any computation on the part of the processor enables
the opportunity to perform these operations completely inside
DRAM. Our MICRO 2013 paper [151] presents a new mecha-
nism, RowClone, which exploits the internal organization and
operation of DRAM to perform bulk data copy/initialization
quickly and eciently inside DRAM.
Figure 1 illustrates the organization of a DRAM chip. The
chip contains multiple banks, each of which is divided into
subarrays, and each subarray in turn consists of multiple
rows of DRAM cells. Each subarray contains a row buer,
ar
X
iv
:1
80
5.
03
50
2v
1 
 [c
s.A
R]
  7
 M
ay
 20
18
which is used to extract the data from the DRAM cells. Data
transfer between the DRAM cells and the row buer happen
at a row granularity, i.e., even to read a single byte from a
row, the chip copies the entire row of data from the DRAM
cells to the corresponding row buer.1
Bank
Chip I/O
Shared Internal Bus
DRAM Chip
Memory Channel
Subarray
Bank I/O
Bank
Row-buUer
Row of DRAM cells
Subarray
Figure 1: DRAM chip microarchitecture. Reproduced
from [151].
2.1. RowClone Mechanisms
RowClone consists of two mechanisms: (1) Fast Parallel
Mode (FPM), which is used to copy data from one row to an-
other row in the same subarray; and (2) Pipelined Serial Mode
(PSM), which is used to copy data from one row to another
row in a dierent subarray or bank. We briey discuss how
each mechanism performs bulk data copy and bulk data ini-
tialization. Section 3 of our MICRO 2013 paper [151] provides
a detailed implementation and discussion of FPM and PSM.
Fast Parallel Mode (FPM). FPM uses the high internal
bandwidth oered by DRAM to quickly and eciently copy
data between two rows within the same subarray in two
simple steps. First, FPM copies the data from the source
row to the local row buer of the subarray. Second, FPM
copies the data from the row buer to the destination row.
To perform the copy, FPM simply issues two back-to-back
ACTIVATE commands to the bank, rst with the source row
address and the second with the destination row address.
Implementing this in existing DRAM chips requires almost
negligible changes. These small changes are to the peripheral
logic that controls back-to-back ACTIVATEs.
FPM imposes two constraints on the copy operation. First,
it requires the source and the destination row to be within
the same subarray. Second, it copies the entire row’s worth
of data. It cannot partially copy data from one row to another.
Despite these constraints, FPM can be used to accelerate many
operations in modern systems (Section 3).
Pipelined Serial Mode (PSM). PSM accelerates copy op-
erations between rows in dierent banks/subarrays, As shown
in Figure 1, each DRAM chip uses a shared internal bus to
1We refer the reader to our prior works [25, 26, 27, 28, 53, 54, 77, 78, 79, 80,
81, 82, 96, 97, 98, 99, 100, 108, 109, 132, 151, 154] for a detailed background on
DRAM.
transfer data between the bank and the memory channel (for
both reads and writes). PSM exploits this fact to overlap the
latency of the read and write operations involved in a copy. To
implement PSM, we propose a new DRAM command called
TRANSFER. TRANSFER is equivalent to appropriately overlapping
READ to the source bank and WRITE to the destination bank.
However, unlike READ or WRITE, TRANSFER does not transfer the
data on to the memory channel, saving signicant amounts
of energy.
Bulk Data Initialization. For bulk initialization, Row-
Clone initializes one row of the destination with the required
data and then initializes the remaining rows by copying the
data from the pre-initialized row using the appropriate bulk
copy mechanism described above. For bulk zeroing (which
happens frequently), our mechanism reserves a single row in
each subarray, which is pre-initialized to zero. This enables
the memory controller to use FPM to zero out any row in
the system. We refer the reader to Section 3.4 of our MICRO
2013 paper [151] for more details on performing bulk data
initialization with RowClone.
2.2. Latency and Energy Benets
Table 1 shows the reduction in latency and energy con-
sumption due to our mechanisms for dierent cases of 4KB
copy and zeroing operations. To be fair to the baseline, the re-
sults include only the energy consumed by the DRAM and the
DRAM channel. We draw two conclusions from our results.
Table 1: DRAM latency and memory energy reductions due
to RowClone. Adapted from [151].
Mechanism
Latency Memory Energy
(ns) (↓) (µJ) (↓)
C
op
y
Baseline 1046 1.0x 3.6 1.0x
FPM 90 11.6x 0.04 74.4x
Inter-Bank - PSM 540 1.9x 1.1 3.2x
Intra-Bank - PSM 1050 1.0x 2.5 1.5x
Ze
ro Baseline 546 1.0x 2.0 1.0x
FPM 90 6.0x 0.05 41.5x
First, FPM signicantly improves both the latency and the
energy consumed by bulk data operations — 11.6x and 6x
reduction in latency of 4KB copy and zeroing, and 74.4x and
41.5x reduction in memory energy of 4KB copy and zeroing.
Second, although PSM does not provide as much benet as
FPM, it still reduces the latency and energy of a 4KB inter-
bank copy by 1.9x and 3.2x, while providing a more generally
applicable mechanism. As we show in Section 4, these latency
and energy benets translate to signicant improvements in
both overall system performance and energy eciency.
2.3. End-to-End System Design
To fully extract the potential benets of RowClone, changes
are required to the ISA, processor microarchitecture, and the
system software. First, we introduce two new instructions
to the ISA, namely, memcopy and meminit, which enable the
2
software to indicate occurrences of bulk data operations to the
processor. Second, for each instance of the memcopy/meminit
instruction, the processor microarchitecture determines if
the operation can be partially/fully accelerated by RowClone
and issues appropriate commands to the memory controller.
While existing mechanisms to handle Direct Memory Ac-
cess requests can be used to ensure cache coherence with
RowClone, we also propose two simple mechanisms, called
in-cache copy and clean zero cache line insertion, to further
reduce memory bandwidth requirements and improve perfor-
mance. We call this optimized version of RowClone, which
includes in-cache copy and clean zero cache line insertion,
RowClone-ZI. Third, to maximize the use of FPM, we make the
system software aware of subarrays and the minimum gran-
ularity of copy (required by FPM). Section 4 of our MICRO
2013 paper [151] describes these changes in detail.
3. Applications
RowClone can be used to accelerate any bulk copy and
initialization operation to improve both system performance
and energy eciency. We quantitatively evaluate the e-
cacy of RowClone by using it to accelerate two primitives
widely used by modern system software: 1) Copy-on-Write
and 2) Bulk Zeroing. We rst describe these primitives, and
then discuss several applications that frequently trigger the
primitives.
3.1. Primitives Accelerated by RowClone
Copy-on-Write (CoW) is a technique used by most modern
operating systems (OS) to postpone an expensive copy op-
eration until it is actually needed. When data of one virtual
page needs to be copied to another, instead of creating a copy,
the OS points both virtual pages to the same physical page
(source) and marks the page as read-only. In the future, when
one of the sharers attempts to write to the page, the OS al-
locates a new physical page (destination) for the writer and
copies the contents of the source page to the newly allocated
page. Fortunately, prior to allocating the destination page, the
OS already knows the location of the source physical page.
Therefore, it can ensure that the destination is allocated in the
same subarray as the source, thereby enabling the processor
to use FPM to perform the copy.
Bulk Zeroing (BuZ) is an operation where a large block of
memory is zeroed out. Our mechanism maintains a reserved
row that is fully initialized to zero in each subarray. For each
row in the destination region to be zeroed out, the processor
uses FPM to copy the data from the reserved zero-row of the
corresponding subarray to the destination row.
3.2. Applications That Use CoW/BuZ
We now describe seven example applications or use-cases
that extensively use the CoW or BuZ operations. Note that
these are just a small number of example scenarios that incur
a large number of copy and initialization operations. Some
other applications and scenarios are provided in one of our
more recent works [155]. Recent work from Google [68]
shows that a considerable fraction of execution time is spent
on memset and memcpy system calls in Google’s data center
workloads.
Process Forking. fork is a frequently-used system call in
modern operating systems (OS). When a process (parent) calls
fork, it creates a new process (child) with the exact same
memory image and execution state as the parent. This seman-
tics of fork makes it useful for dierent scenarios. Common
uses of the fork system call are to 1) create new processes,
and 2) create stateful threads from a single parent thread in
multi-threaded programs. One main limitation of fork is
that it results in a CoW operation whenever the child/parent
updates a shared page. Hence, despite its wide usage, as a
result of the large number of copy operations triggered by
fork, it remains one of the most expensive system calls in
terms of memory performance [150].
Initializing Large Data Structures. Initializing large data
structures often triggers Bulk Zeroing. In fact, many managed
languages (e.g., C#, Java, PHP) require zero initialization of
variables to ensure memory safety [185]. In such cases, to
reduce the overhead of zeroing, memory is zeroed-out in
bulk.
Secure Deallocation. Most operating systems (e.g.,
Linux [18], Windows [148], Mac OS X [166]) zero out pages
newly allocated to a process. This is done to prevent malicious
processes from gaining access to the data that previously be-
longed to other processes or the kernel itself. Not doing so
can potentially lead to security vulnerabilities, as shown by
prior works [31, 41, 51, 52].
Process Checkpointing. Checkpointing is an operation dur-
ing which a consistent version of a process state is backed-up,
so that the process can be restored from that state in the future.
This checkpoint-restore primitive is useful in many cases in-
cluding high-performance computing servers [15], software
debugging with reduced overhead [168], hardware-level fault
and bug tolerance mechanisms [33,34,105,106,107], and spec-
ulative OS optimizations to improve performance [24, 182].
However, to ensure that the checkpoint is consistent (i.e., the
original process does not update data while the checkpoint-
ing is in progress), the pages of the process are marked with
copy-on-write. As a result, checkpointing often results in a
large number of CoW operations.
Virtual Machine Cloning/Deduplication. Virtual machine
(VM) cloning [88] is a technique to signicantly reduce the
startup cost of VMs in a cloud computing server. Similarly,
deduplication is a technique employed by modern hypervi-
sors [180] to reduce the overall memory capacity require-
ments of VMs. With this technique, dierent VMs share
physical pages that contain the same data. Similar to forking,
both these operations likely result in a large number of CoW
operations for pages shared across VMs [155].
3
Page Migration. Bank conicts, i.e., concurrent requests
to dierent rows within the same bank, typically result in
reduced row buer hit rate and hence degrade both system
performance and energy eciency [80]. Prior work [175]
proposed techniques to mitigate bank conicts using page
migration. The PSM mode of RowClone can be used in con-
junction with such techniques to 1) signicantly reduce the
migration latency and 2) make the migrations more energy-
ecient.
CPU-GPU Communication. In many current and future
processors, the GPU is or is expected to be integrated on
the same chip with the CPU. Even in such systems where
the CPU and GPU share the same o-chip memory, the o-
chip memory is partitioned between the two devices. As a
consequence, whenever a CPU program wants to ooad some
computation to the GPU, it has to copy all the necessary data
from the CPU address space to the GPU address space [62].
When the GPU computation is nished, all the data needs
to be copied back to the CPU address space. This copying
involves a signicant overhead. By spreading out the GPU
address space over all subarrays and mapping the application
data appropriately, RowClone can signicantly speed up these
copy operations. Note that communication between dierent
processors and accelerators in a heterogeneous system-on-
chip (SoC) is done similarly to the CPU-GPU communication
and can also be accelerated by RowClone.
4. Results
In this section, we briey summarize our evaluation of
RowClone. We evaluate three congurations: Baseline, an
unmodied main memory subsystem that cannot perform
bulk data copy or initialization within memory; RowClone,
which uses the FPM and PSM mechanisms described in Sec-
tion 2.1; and RowClone-ZI, an optimized version of RowClone
that includes the two optimizations discussed in Section 2.3.
Section 6 of our MICRO 2013 paper [151] discusses our full
evaluation methodology, including details on the simulator,
system conguration, and benchmarks used for our evalua-
tions.
4.1. Single-Core Evaluations
Figure 2 shows the performance improvement and reduc-
tion in DRAM energy consumption due to RowClone-ZI com-
pared to the baseline for six copy- and initialization-intensive
benchmarks. As we observe from the gure, these applica-
tions improve signicantly with RowClone-ZI. Compared
with Baseline, RowClone-ZI improves the IPC by up to 43%,
while reducing DRAM energy consumption by up to 67%.
Section 7 of our MICRO 2013 paper [151] provides more
detailed single-core results, including (1) the individual per-
formance of the FPM and PSM mechanisms using a fork
benchmark (Section 7.2 of [151]); (2) a breakdown of memory
trac for each application into read, write, copy, and initial-
ization operations (Section 7.3 of [151]); (3) the performance,
IPC Improvement Energy Reduction
10%
20%
30%
40%
50%
60%
70%
bootup compile forkbench mcached mysql shell
Im
pr
ov
em
en
to
ve
r
B
as
el
in
e
Figure 2: Performance improvement and energy reduction
of RowClone-ZI compared to a baseline memory subsystem
without bulk copy support.
energy, and bandwidth improvements of both RowClone and
RowClone-ZI (Section 7.3 of [151]); and (4) a comparison
of RowClone to a memory-controller-based DMA approach
for data copy and initialization, similar to [192] (Section 7.5
of [151]).
4.2. Multi-Core Evaluations
As RowClone performs bulk data operations completely
within DRAM, it signicantly reduces the memory bandwidth
consumed by these operations. As a result, RowClone can
benet other applications that are running concurrently on
the same system, even if these applications do not perform
bulk data operations themselves. We evaluate this benet
of RowClone by running our copy/initialization-intensive
applications alongside memory-intensive applications from
the SPEC CPU2006 benchmark suite [169] (i.e., those appli-
cations with last-level cache misses per kilo-instruction, or
MPKI, greater than 1). Table 2 lists the set of applications
used for our multi-programmed workloads.
Table 2: List of benchmarks used for multi-core evaluation.
Reproduced from [151].
Copy/Initialization-intensive benchmarks
bootup, compile, forkbench, mcached, mysql, shell
Memory-intensive benchmarks from SPEC CPU2006
bzip2, gcc, mcf, milc, zeusmp, gromacs, cactusADM, leslie3d, namd,
gobmk, dealII, soplex, hmmer, sjeng, GemsFDTD, libquantum,
h264ref, lbm, omnetpp, astar, wrf, sphinx3, xalancbmk
We generate multi-programmed workloads for two-core,
four-core and eight-core systems. In each workload, half of
the cores run copy/initialization-intensive benchmarks, while
the remaining cores run memory-intensive SPEC benchmarks.
Benchmarks from each category are chosen at random.
Figure 3 plots the performance improvement due to Row-
Clone and RowClone-ZI for the 50 four-core workloads that
we evaluate (sorted based on the performance improvement
due to RowClone-ZI). Two conclusions are in order. First, al-
though RowClone degrades performance of certain four-core
workloads (with compile, mcached or mysql benchmarks), it
signicantly improves performance for all other workloads
4
(by 10% across all workloads). Second, RowClone-ZI elim-
inates the performance degradation due to RowClone and
consistently outperforms both the baseline and RowClone
for all workloads (20% on average).
Baseline RowClone RowClone-ZI
0.95
1.00
1.05
1.10
1.15
1.20
1.25
1.30
1.35
1.40
N
or
m
al
iz
ed
W
ei
gh
te
d
Sp
ee
du
p
50 Workloads
Figure 3: System performance improvement of RowClone
for four-core workloads. Reproduced from [151].
To provide more insight into the benets of RowClone
on multi-core systems, we classify our copy/initialization-
intensive benchmarks into two categories: 1) Moderately
copy/initialization-intensive (compile, mcached, and mysql)
and highly copy/initialization-intensive (bootup, forkbench,
and shell). Figure 4 shows the average improvement in
weighted speedup for the dierent multi-core workloads, cat-
egorized based on the number of highly copy/initialization-
intensive benchmarks. As the trends indicate, RowClone’s
performance improvement increases with increasing number
of such benchmarks for all three multi-core systems, indi-
cating the eectiveness of RowClone in accelerating bulk
copy/initialization operations.
5
10
15
20
25
30
35
0 1 0 1 2 0 1 2 3 4
2-core 4-core 8-core
W
ei
gh
te
d
Sp
ee
du
p
Im
pr
ov
em
en
to
ve
r
B
as
el
in
e
Number of Highly Copy/Initialization-intensive Benchmarks
Figure 4: Eect of increasing copy/initialization intensity.
Reproduced from [151].
We conclude that RowClone is an eective mechanism to
improve system performance, energy eciency and band-
width eciency of future, bandwidth-constrained multi-core
systems.
5. Related Work
To our knowledge, this is the rst paper to propose a con-
crete mechanism to perform bulk data copy and initializa-
tion operations completely in DRAM. In this section, we
discuss related work and qualitatively compare them to Row-
Clone. Other treatments of related works can be found
in [156, 158, 159].
Patents on Data Copy in DRAM. Several patents [3, 48,
113, 114] propose the abstract notion that the row buer in
DRAM can be used to copy data from one row to another.
These patents have four major drawbacks. First, they do not
provide any concrete mechanism to perform the copy opera-
tion. Second, while using the row buer to copy data between
two rows is possible only when the two rows are within the
same subarray, these patents make no such distinction. Third,
these patents do not discuss the support required from the
other layers of the system to realize a working system. Fourth,
these patents do not provide any concrete evaluation to show
the benets of performing copy operations in DRAM. In con-
trast, RowClone is more generally applicable, and our MICRO
2013 paper [151] discusses the concrete changes required to
all layers of the system stack, from the DRAM architecture
to the system software, to enable bulk data copy.
Oloading Copy/Initialization Operations. Prior
works [66, 192] propose mechanisms to 1) ooad bulk data
copy/initialization operations to a separate engine; 2) reduce
the impact of pipeline stalls (by waking up instructions de-
pendent on a copy operation as soon as the necessary blocks
are copied without waiting for the entire copy operation to
complete); and 3) reduce cache pollution by using hints from
software to decide whether to cache blocks involved in the
copy or initialization. While Section 7.5 of our MICRO 2013
paper [151] shows the eectiveness of RowClone compared
to ooading bulk data operations to a separate engine, tech-
niques to reduce pipeline stalls and cache pollution [66] can
be naturally combined with RowClone to further improve
performance.
Low-cost Interlinked Sub-Arrays (LISA) [25] proposes to
connect adjacent subarrays inside a DRAM bank using a set
of isolation transistors. Using this structure, LISA proposes
mechanisms to eciently copy data across rows in dier-
ent subarrays within the same bank. LISA and RowClone
can be combined to perform all bulk copy and initialization
operations eciently inside DRAM. However, unlike LISA,
RowClone does not require any changes to the DRAM array.
The Compute Cache [2] performs copy, zero, and bitwise
operations completely inside the on-chip SRAM cache. Like
RowClone, the Compute Cache exploits the fact that many
cells are connected to the same bitline to eciently perform
these operations across cells connected to the same bitline.
Again, depending on the location of the data, RowClone and
Compute Cache can be combined to further improve system
performance and eciency.
Bulk Memory Initialization. Jarrod et al. [63] propose
a mechanism for avoiding the memory access required to
fetch uninitialized blocks on a store miss. They use a special-
ized cache to keep track of uninitialized regions of memory.
RowClone can potentially be combined with this mechanism.
While Jarrod et al.’s approach can be used to reduce band-
5
width consumption for irregular initialization (initializing
dierent pages with dierent values), RowClone can be used
to push regular initialization (e.g., initializing multiple pages
with the same values) to DRAM, thereby freeing up the CPU
to perform other useful operations.
Yang et al. [185] propose to reduce the cost of zero initial-
ization by 1) using non-temporal store instructions to avoid
cache pollution, and 2) using idle cores/threads to perform
zeroing ahead of time. While the proposed optimizations
reduce the negative performance impact of zeroing, their
mechanism does not reduce memory bandwidth consump-
tion of the bulk zeroing operations. In contrast, RowClone
signicantly reduces the memory bandwidth consumption
and the associated energy overhead.
Processing-in-Memory. Recent works propose mecha-
nisms that exploit the internal organization and operation of
DRAM [102, 153, 154], SRAM [2, 69], phase-change memory
(PCM) [103], or memristors [162] to perform bulk bitwise
Boolean algebra and/or simple arithmetic operations. One
such mechanism, called Ambit [153, 154], uses a number of
row copy and initialization operations to perform Boolean
algebra using DRAM. Ambit makes use of RowClone to ef-
ciently perform these row copy and initialization opera-
tions. Another mechanism, the Compute Cache [2], can
perform copy and initialization operations within SRAM.
Other mechanisms for in-memory Boolean algebra or arith-
metic [69, 102, 103, 162] can be trivially used to perform data
copy and initialization operations (e.g., a data copy can be
performed by performing a bulk addition, where the row to
be copied is added to a row of all zeroes).
Various prior works (e.g., [6,7,16,17,49,55,56,76,83,110,133,
135,188]) have investigated mechanisms to add logic circuitry
closer to memory to perform bandwidth-intensive compu-
tations (e.g., SIMD vector operations) more eciently. The
main limitation of such approaches is that adding logic to or
near DRAM signicantly increases the cost of main memory.
In contrast, RowClone exploits the existing internal organiza-
tion and operation of DRAM to perform bandwidth-intensive
copy and initialization operations quickly and eciently with
low cost.
OtherMethods for LoweringMemory Latency. There
are many works that improve the performance of applications
by reducing the overall memory access latency. These works
enable more parallelism and bandwidth [4,5,27,80,97,100,153,
154,181,189,193], exploit latency variation within DRAM [23,
26, 28, 96, 98, 99], reduce refresh counts [71, 72, 74, 75, 108, 109,
141, 178], enable better communication between the CPU
and other devices through DRAM [100], leverage DRAM
access patterns to reduce access latency [54, 165], reduce
write-related latencies by better designing DRAM and DRAM
control policies [30, 92, 152], reduce overall queuing latencies
in DRAM by better scheduling memory requests [13, 14, 37,
45, 47, 57, 61, 67, 70, 78, 79, 93, 94, 95, 104, 115, 116, 117, 118,
125, 126, 130, 135, 146, 164, 171, 172, 173, 174, 177, 191], employ
prefetching [12, 22, 35, 36, 40, 43, 44, 46, 93, 119, 120, 121, 122,
127, 129, 134, 167], perform memory/cache compression [1,
10, 11, 38, 39, 42, 136, 137, 138, 139, 140, 163, 179, 183, 190], or
perform better caching [73, 142, 144, 160, 161]. RowClone is
orthogonal to all of these approaches, and can be combined
with any of them with them to achieve higher latency and
energy benets.
6. Signicance
Our MICRO 2013 paper [151] proposes RowClone, a simple
mechanism to export bulk copy and initialization operations
to DRAM. In this section, we describe the novelty of our
approach, the long term impact of our proposed techniques,
and new research directions triggered by our work.
6.1. Novelty
Prior works investigate mechanisms to add logic closer to
memory to perform bandwidth-intensive operations more
eciently. Although this approach has the potential to be
used for a wide range of applications, it has two shortcom-
ings. First, adding logic to DRAM increases the cost of DRAM
signicantly. Second, this approach does not reduce the band-
width requirement of simple bulk copy/initialization opera-
tions.
In contrast, our work is the rst (to our knowledge) to
propose mechanisms that exploit the internal organization
and operation of DRAM to perform bandwidth-intensive copy
and initialization operations quickly and eciently in DRAM.
The changes required by our mechanism in the DRAM chip
are limited to the peripheral logic and are very modest, with
a DRAM die area overhead of only 0.2%. With this small
overhead, our mechanisms signicantly reduce the latency,
bandwidth, and energy consumed by bulk data operations.
6.2. Long-Term Impact
We believe four trends in current and future systems make
our proposed solutions even more relevant. We discuss each
trend, and how RowClone can be applied in the context of
the trend.
Increasingly Limited Memory Bandwidth. Processor
manufactures are integrating more and more cores on a single
chip, thereby signicantly increasing the compute capability
of the processing chip. However, due to (1) the high cost asso-
ciated with increasing pin counts and (2) limitations in DRAM
scalability, the available memory bandwidth is not expected
to grow at the same rate [61, 64]. This makes mechanisms
like RowClone, which signicantly reduce the overall mem-
ory bandwidth utilization of the system, likely even more
important in future systems.
Increasing Use of Hardware Accelerators. Many mod-
ern processors already integrate the GPU on the same die as
the CPU. With emerging systems moving towards a system-
on-chip (SoC) model, many components/accelerators (called
agents) are integrated on the same die as the CPU, and share
6
the o-chip memory [176, 177]. To reduce the complexity of
managing these agents, each agent is given its own share of
the physical address space, and agents typically communicate
with each other by copying data in bulk across the individual
device address spaces. By enabling faster bulk data copies,
we expect RowClone to signicantly reduce the communica-
tion latency between dierent agents without increasing the
complexity of the system.
Increasing Use of Virtualization. Modern systems (es-
pecially data centers and cloud computers) are increasingly
employing virtualization to improve the utilization, security,
and availability of systems and services. As described in
our MICRO 2013 paper [151], the use of techniques such
as VM cloning and deduplication [88, 180] to reduce the
memory capacity requirements will likely increase the num-
ber of copy operations and zeroing operations (to protect
data across VMs). RowClone can improve the performance
and energy eciency of such systems by performing these
copy/initialization operations eciently.
Ease of Adoption. Given the low implementation com-
plexity of RowClone, it can be easily adopted in existing
systems. RowClone is not limited only to DDR DRAMs. It
can be used with 3D-stacked DRAM technologies [97, 111]
such as the Hybrid Memory Cube [58, 59] and High Band-
width Memory [65], which are gaining increasing interest
among researchers, DRAM manufacturers, and system de-
signers [6, 7, 82].
6.3. New Research Directions
Our proposed approach to performing bulk data copy and
initialization in DRAM inspires several important research
directions (and hopefully many more that others will imagine).
We describe a few of them below.
One important research question that our work raises is
how can one redesign system software (e.g., operating system,
hypervisors) and application software to take better advantage
of RowClone? Existing systems assume that copies are expen-
sive and hence trade o complexity for performance. How-
ever, with RowClone, it may be possible to design simpler
yet high performance systems by rethinking software design
in the presence of very fast bulk copy and initialization.
Our MICRO 2013 paper [151] proposes low-cost mecha-
nisms to export bulk copy and initialization to DRAM. These
are by no means the only bandwidth-intensive operations.
There are other operations that unnecessarily move data be-
tween the main memory and the processor, which can be
optimized using low-cost mechanisms. Therefore, another
natural research question is what other bandwidth-intensive
operations can be exported to main memory using low-cost
mechanisms? We believe RowClone can inspire similar mech-
anisms for other such operations. For example, one of our
recent works [157] proposes an ecient method to perform
gather/scatter operations in DRAM. Another of our recent
works proposes mechanisms to perform bulk bitwise opera-
tions in DRAM [153,154], building upon and taking advantage
of RowClone.
Recently, there has been increased interest in emerging
non-volatile memory technologies (e.g., PCM [89, 90, 91,
143, 145, 184, 186, 187], STT-MRAM [29, 50, 84, 128], mem-
ristors [32, 170]). Given this trend, exploring the feasibility
of extending RowClone to these new memory technologies is
a relevant and important research direction. For example,
two recent works [103, 162] use the principles discussed in
RowClone to perform bulk Boolean algebra and arithmetic
operations within emerging memories. Similarly, exploring
the idea of RowClone in other storage/memory technologies,
e.g., NAND ash memory [19, 20, 21], is promising.
Given that memory bandwidth is expected to become an
even more scarce resource in future systems, answers to
these research questions have the potential to greatly mitigate
bandwidth contention, and, thus, signicantly improve both
the performance and energy eciency of these systems.
6.4. Works Building on RowClone
RowClone has inspired a number of followup works that
propose 1) new mechanisms to perform bulk operations inside
various memory technologies (e.g., DRAM [25,102], SRAM [2,
69], PCM [103], memristors [162]), and 2) mechanisms that
exploit RowClone to speedup other operations (e.g., in-DRAM
bulk bitwise operations [76, 153, 154]). A survey of related
works is provided in [159].
One of our recent works, Ambit [153,154], proposes a mech-
anism to perform bulk bitwise operations completely inside
DRAM. Ambit operations involve a number of row copy and
initialization operations. Ambit uses RowClone to perform
these operations quickly and eciently inside DRAM. In fact,
RowClone is essential for Ambit to obtain the performance
and energy eciency improvements. Other recent works that
perform bulk bitwise Boolean algebra and/or simple arith-
metic operations [2,8, 9, 69,85,86,87,101,102,103,162] exploit
the organization and operation of memory arrays, akin to
RowClone, and can be used to perform bulk data copy and
initialization operations.
Data movement is expected to become an even more crit-
ical problem in future systems. We believe RowClone can
inspire other works that propose mechanisms to reduce data
movement, thereby enabling higher system performance and
energy eciency.
7. Conclusion
Our MICRO 2013 paper [151] proposes RowClone, a mech-
anism that performs bulk data copy and initialization oper-
ations completely inside DRAM. RowClone consists of two
mechanisms, Fast Parallel Mode and Pipelined Serial Mode,
that are used to copy data using existing peripheral struc-
tures within DRAM, requiring no changes to the DRAM cell
array. By enabling ecient bulk data copy and initialization,
7
RowClone provides signicant performance and DRAM en-
ergy improvements that are between one to two orders of
magnitude higher compared to existing systems.
RowClone is one of the rst steps towards reducing un-
necessary data movement between the processor and the
main memory using a low-cost in-memory approach. Cur-
rent trends in system design indicate that our approach will
be more relevant to future, bandwidth-limited systems. We
hope that our work triggers research that leads to 1) simpler
and more ecient software design and 2) extensions of our
approach to other operations and memory technologies, with
the goal of continuing to greatly improve system performance
and energy eciency.
Acknowledgments
We thank Saugata Ghose for his dedicated eort in the
preparation of this article. We acknowledge the support of
AMD, IBM, Intel, Oracle, Qualcomm, and Samsung. This re-
search was partially supported by the NSF (grants 0953246,
1147397, and 1212962), the Intel University Research Oce
Memory Hierarchy Program, the Intel Science and Technol-
ogy Center for Cloud Computing, and the Semiconductor
Research Corporation.
References
[1] B. Abali, H. Franke, D. Po, R. Saccone, C. Schulz, L. Herger, and T. Smith, “Mem-
ory Expansion Technology (MXT): Software support and performance,” in IBM
JRD, 2001.
[2] S. Aga et al., “Compute Caches,” in HPCA, 2017.
[3] J. Ahn, “Memory device having page copy mode,” U.S. Patent 5,886,944, 1999.
[4] J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber, “Improving
System Energy Eciency with Memory Rank Subsetting,” in ACM TACO, 2012.
[5] J. H. Ahn, J. Leverich, R. Schreiber, and N. P. Jouppi, “Multicore DIMM: an Energy
Ecient Memory Module with Independently Controlled DRAMs,” in IEEE CAL,
2009.
[6] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A Scalable Processing-in-Memory
Accelerator for Parallel Graph Processing,” in ISCA, 2015.
[7] J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “PIM-Enabled Instructions: A Low-
Overhead, Locality-Aware Processing-in-Memory Architecture,” in ISCA, 2015.
[8] A. Akerib, O. Agam, E. Ehrman, and M. Meyassed, “Using Storage Cells to Per-
form Computation,” U.S. Patent 8,908,465, 2014.
[9] A. Akerib and E. Ehrman, “In-Memory Computational Device,” U.S. Patent
9,653,166, 2015.
[10] A. R. Alameldeen and D. A. Wood, “Adaptive Cache Compression for High-
Performance Processors,” in ISCA, 2004.
[11] A. R. Alameldeen and D. A. Wood, “Frequent Pattern Compression: A
Signicance-Based Compression Scheme for L2 Caches,” Univ. of Wisconsin–
Madison, Computer Sciences Dept., Tech. Rep. 1500, 2004.
[12] A. Alameldeen and D. Wood, “Interactions Between Compression and Prefetch-
ing in Chip Multiprocessors,” in HPCA, 2007.
[13] R. Ausavarungnirun, K. Chang, L. Subramanian, G. H. Loh, and O. Mutlu, “Staged
memory scheduling: achieving high performance and scalability in heteroge-
neous systems,” in ISCA, 2012.
[14] R. Ausavarungnirun, S. Ghose, O. Kayiran, G. H. Loh, C. R. Das, M. T. Kandemir,
and O. Mutlu, “Exploiting Inter-Warp Heterogeneity to Improve GPGPU Perfor-
mance,” in PACT, 2015.
[15] J. Bent et al., “PLFS: A checkpoint lesystem for parallel applications,” in SC,
2009.
[16] A. Boroumand, S. Ghose, Y. Kim, R. Ausavarungnirun, E. Shiu, R. Thakur, D. Kim,
A. Kuusela, A. Knies, P. Ranganathan, and O. Mutlu, “Google Workloads for Con-
sumer Devices: Mitigating Data Movement Bottlenecks,” in ASPLOS, 2018.
[17] A. Boroumand, S. Ghose, B. Lucia, K. Hsieh, K. Malladi, H. Zheng, and
O. Mutlu, “LazyPIM: An Ecient Cache Coherence Mechanism for Processing-
in-Memory,” in IEEE CAL, 2016.
[18] D. P. Bovet and M. Cesati, Understanding the Linux Kernel. O’Reilly Media, 2005,
p. 388.
[19] Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, and O. Mutlu, “Error Characterization,
Mitigation, and Recovery in Flash-Memory-Based Solid-State Drives,” Proc. IEEE,
2017.
[20] Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, and O. Mutlu, “Error Characteri-
zation, Mitigation, and Recovery in Flash Memory Based Solid-State Drives,”
arXiv:1706.08642 [cs.AR], 2017.
[21] Y. Cai, S. Ghose, E. F. Haratsch, Y. Luo, and O. Mutlu, “Errors in Flash-Memory-
Based Solid-State Drives: Analysis, Mitigation, and Recovery,” arXiv:1711.11427
[cs.AR], 2017.
[22] P. Cao, E. W. Felten, A. R. Karlin, and K. Li, “A Study of Integrated Prefetching
and Caching Strategies,” in SIGMETRICS, 1995.
[23] K. Chandrasekar, S. Goossens, C. Weis, M. Koedam, B. Akesson, N. Wehn, and
K. Goossens, “Exploiting Expendable Process-margins in DRAMs for Run-time
Performance Optimization,” in DATE, 2014.
[24] F. Chang and G. A. Gibson, “Automatic I/O hint generation through speculative
execution,” in OSDI, 1999.
[25] K. K. Chang et al., “Low-cost Inter-linked Subarrays (LISA): Enabling Fast Inter-
subarray Data Movement in DRAM,” in HPCA, 2016.
[26] K. K. Chang, A. Kashyap, H. Hassan, S. Khan, K. Hsieh, D. Lee, S. Ghose, G. Pekhi-
menko, T. Li, and O. Mutlu, “Understanding Latency Variation in Modern DRAM
Chips: Experimental Characterization, Analysis, and Optimization,” in SIGMET-
RICS, 2016.
[27] K. K. Chang, D. Lee, Z. Chishti, A. Alameldeen, C. Wilkerson, Y. Kim, and
O. Mutlu, “Improving DRAM Performance by Parallelizing Refreshes with Ac-
cesses ,” in HPCA, 2014.
[28] K. K. Chang, A. G. Yaglikci, S. Ghose, A. Agrawal, N. Chatterjee, A. Kashyap,
D. Lee, M. O’Connor, H. Hassan, and O. Mutlu, “Understanding Reduced-Voltage
Operation in Modern DRAM Devices: Experimental Characterization, Analysis,
and Mechanisms,” in SIGMETRICS, 2017.
[29] M. T. Chang, P. Rosenfeld, S. L. Lu, and B. Jacob, “Technology Comparison for
Large Last-Level Caches (L3Cs): Low-Leakage SRAM, Low Write-Energy STT-
RAM, and Refresh-Optimized eDRAM,” in HPCA, 2013.
[30] N. Chatterjee, N. Muralimanohar, R. Balasubramonian, A. Davis, and N. P. Jouppi,
“Staged Reads: Mitigating the Impact of DRAM Writes on DRAM Reads,” in
HPCA, 2012.
[31] J. Chow et al., “Shredding Your Garbage: Reducing data lifetime through secure
deallocation,” in USENIX SS, 2005.
[32] L. Chua, “Memristor—The Missing Circuit Element,” TCT, Sep. 1971.
[33] K. Constantinides et al., “Software-Based Online Detection of Hardware Defects:
Mechanisms, architectural support, and evaluation,” in MICRO, 2007.
[34] K. Constantinides et al., “Online Design Bug Detection: RTL analysis, exible
mechanisms, and evaluation,” in MICRO, 2008.
[35] R. Cooksey, S. Jourdan, and D. Grunwald, “A Stateless, Content-directed Data
Prefetching Mechanism,” in ASPLOS, 2002.
[36] F. Dahlgren, M. Dubois, and P. Stenström, “Sequential Hardware Prefetching in
Shared-Memory Multiprocessors,” in IEEE TPDS, 1995.
[37] R. Das, R. Ausavarungnirun, O. Mutlu, A. Kumar, and M. Azimi, “Application-
to-core mapping policies to reduce memory system interference in multi-core
systems,” in HPCA, 2013.
[38] R. de Castro, A. Lago, and M. Silva, “Adaptive compressed caching: design and
implementation,” in SBAC-PAD, 2003.
[39] F. Douglis, “The Compression Cache: Using On-line Compression to Extend
Physical Memory,” in Winter USENIX Conference, 1993.
[40] J. Dundas and T. Mudge, “Improving Data Cache Performance by Pre-executing
Instructions Under a Cache Miss,” in ICS, 1997.
[41] A. M. Dunn et al., “Eternal Sunshine of the Spotless Machine: Protecting privacy
with ephemeral channels,” in OSDI, 2012.
[42] J. Dusser, T. Piquet, and A. Seznec, “Zero-content Augmented Caches,” in ICS,
2009.
[43] E. Ebrahimi, O. Mutlu, and Y. Patt, “Techniques for bandwidth-ecient prefetch-
ing of linked data structures in hybrid prefetching systems,” in HPCA, 2009.
[44] E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt, “Prefetch-aware Shared Resource
Management for Multi-core Systems,” in ISCA, 2011.
[45] E. Ebrahimi, R. Miftakhutdinov, C. Fallin, C. J. Lee, J. A. Joao, O. Mutlu, and Y. N.
Patt, “Parallel application memory scheduling,” in MICRO, 2011.
[46] E. Ebrahimi, O. Mutlu, C. J. Lee, and Y. N. Patt, “Coordinated Control of Multiple
Prefetchers in Multi-core Systems,” in MICRO, 2009.
[47] S. Ghose, H. Lee, and J. F. Martínez, “Improving Memory Scheduling via
Processor-Side Load Criticality Information,” in ISCA, 2013.
[48] P. B. Gillingham and R. Torrance, “DRAM page copy method,” U.S. Patent
5,625,601, 1997.
[49] Q. Guo, N. Alachiotis, B. Akin, F. Sadi, G. Xu, T. M. Low, L. Pileggi, J. C. Hoe, and
F. Franchetti, “3D-Stacked Memory-Side Acceleration: Accelerator and System
Design,” in WoNDP, 2014.
[50] X. Guo, E. İpek, and T. Soyata, “Resistive Computation: Avoiding the Power Wall
with Low-Leakage, STT-MRAM Based Computing,” in ISCA, 2009.
[51] J. A. Halderman et al., “Lest We Remember: Cold boot attacks on encryption
keys,” in USENIX SS, 2008.
[52] K. Harrison and S. Xu, “Protecting cryptographic keys from memory disclosure
attacks,” in DSN, 2007.
[53] H. Hassan, N. Vijaykumar, S. Khan, S. Ghose, K. K. Chang, G. Pekhimenko, D. Lee,
O. Ergin, and O. Mutlu, “SoftMC: A Flexible and Practical Open-Source Infras-
tructure for Enabling Experimental DRAM Studies,” in HPCA, 2017.
8
[54] H. Hassan, G. Pekhimenko, N. Vijaykumar, V. Seshadri, D. Lee, O. Ergin, and
O. Mutlu, “ChargeCache: Reducing DRAM Latency by Exploiting Row Access
Locality,” in HPCA, 2016.
[55] K. Hsieh, S. Khan, N. Vijaykumar, K. K. Chang, A. Boroumand, S. Ghose, and
O. Mutlu, “Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges,
Mechanisms, Evaluation,” in ICCD, 2016.
[56] K. Hsieh, E. Ebrahimi, G. Kim, N. Chatterjee, M. O’Connor, N. Vijaykumar,
O. Mutlu, and S. W. Keckler, “Transparent Ooading and Mapping (TOM):
Enabling Programmer-Transparent Near-Data Processing in GPU Systems,” in
ISCA, 2016.
[57] I. Hur and C. Lin, “Adaptive History-Based Memory Schedulers,” in MICRO, 2004.
[58] Hybrid Memory Cube Consortium, “HMC Specication 1.1,” 2013.
[59] Hybrid Memory Cube Consortium, “HMC Specication 2.0,” 2014.
[60] Intel, “Intel 64 and IA-32 Architectures Optimization Reference Manual,” Apr.
2012.
[61] E. Ipek et al., “Self Optimizing Memory Controllers: A Reinforcement Learning
Approach,” in ISCA, 2008.
[62] T. B. Jablin et al., “Automatic CPU-GPU communication management and opti-
mization,” in PLDI, 2011.
[63] L. A. Jarrod et al., “Avoiding Initialization Misses to the Heap,” in ISCA, 2002.
[64] JEDEC, “Server memory roadmap,” http://www.jedec.org/sites/default/les/
Ricki_Dee_Williams.pdf.
[65] JEDEC, “High Bandwidth Memory (HBM) DRAM,” Standard No. JESD235, 2013.
[66] X. Jiang et al., “Architecture support for improving bulk memory copying and
initialization performance,” in PACT, 2009.
[67] A. Jog, O. Kayiran, A. Pattnaik, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das,
“Exploiting Core-Criticality for Enhanced GPU Performance,” in SIGMETRICS,
2016.
[68] S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G.-Y. Wei, and
D. Brooks, “Proling a Warehouse-Scale Computer,” in ISCA, 2015.
[69] M. Kang, M.-S. Keel, N. R. Shanbhag, S. Eilert, and K. Curewitz, “An Energy-
Ecient VLSI Architecture for Pattern Recognition via Deep Embedding of Com-
putation in SRAM,” in ICASSP, 2014.
[70] D. Kaseridis, J. Stuecheli, and L. K. John, “Minimalist Open-Page: A DRAM Page-
Mode Scheduling Policy for the Many-Core Era,” in MICRO, 2011.
[71] S. Khan et al., “Detecting and Mitigating Data-Dependent DRAM Failures by
Exploiting Current Memory Content,” in MICRO, 2017.
[72] S. Khan, D. Lee, and O. Mutlu, “PARBOR: An Ecient System-Level Technique
to Detect Data-Dependent Failures in DRAM,” in DSN, 2016.
[73] S. Khan, A. R. Alameldeen, C. Wilkerson, O. Mutlu, and D. A. Jimenez, “Improv-
ing Cache Performance by Exploiting Read-Write Disparity,” in HPCA, 2014.
[74] S. Khan, D. Lee, Y. Kim, A. R. Alameldeen, C. Wilkerson, and O. Mutlu, “The
Ecacy of Error Mitigation Techniques for DRAM Retention Failures: A Com-
parative Experimental Study,” in SIGMETRICS, 2014.
[75] S. Khan, C. Wilkerson, D. Lee, A. R. Alameldeen, and O. Mutlu, “A Case for
Memory Content-Based Detection and Mitigation of Data-Dependent Failures
in DRAM,” in IEEE CAL, 2016.
[76] J. S. Kim et al., “Genome Read In-Memory (GRIM) Filter: Fast Location Filtering
in DNA Read Mapping Using Emerging Memory Technologies,” in APBC, 2018.
[77] J. S. Kim, M. Patel, H. Hassan, and O. Mutlu, “The DRAM Latency PUF: Quickly
Evaluating Physical Unclonable Functions by Exploiting the Latency-Reliability
Tradeo in Modern DRAM Devices,” in HPCA, 2018.
[78] Y. Kim et al., “ATLAS: A scalable and high-performance scheduling algorithm
for multiple memory controllers,” in HPCA, 2010.
[79] Y. Kim et al., “Thread Cluster Memory Scheduling: Exploiting dierences in
memory access behavior,” in MICRO, 2010.
[80] Y. Kim et al., “A Case for Exploiting Subarray-Level Parallelism (SALP) in
DRAM,” in ISCA, 2012.
[81] Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilkerson, K. Lai, and
O. Mutlu, “Flipping Bits in Memory Without Accessing Them: An Experimental
Study of DRAM Disturbance Errors,” in ISCA, 2014.
[82] Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A Fast and Extensible DRAM Simu-
lator,” CAL, 2015.
[83] P. M. Kogge, “EXECUBE - A new architecture for scaleable MPPs,” in ICPP, 1994.
[84] E. Kültürsay, M. Kandemir, A. Sivasubramaniam, and O. Mutlu, “Evaluating STT-
RAM as an Energy-Ecient Main Memory Alternative,” in ISPASS, 2013.
[85] S. Kvatinsky, D. Belousov, S. Liman, G. Satat, N. Wald, E. G. Friedman, A. Kolodny,
and U. C. Weiser, “MAGIC—Memristor-Aided Logic,” IEEE TCAS II: Express Briefs,
2014.
[86] S. Kvatinsky, A. Kolodny, U. C. Weiser, and E. G. Friedman, “Memristor-Based
IMPLY Logic Design Procedure,” in ICCD, 2011.
[87] S. Kvatinsky, G. Satat, N. Wald, E. G. Friedman, A. Kolodny, and U. C. Weiser,
“Memristor-Based Material Implication (IMPLY) Logic: Design Principles and
Methodologies,” TVLSI, 2014.
[88] H. A. Lagar-Cavilla et al., “SnowFlock: Rapid virtual machine cloning for cloud
computing,” in EuroSys, 2009.
[89] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Architecting Phase Change Memory
as a Scalable DRAM Alternative,” in ISCA, 2009.
[90] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Phase Change Memory Architecture
and the Quest for Scalability,” CACM, 2010.
[91] B. C. Lee, P. Zhou, J. Yang, Y. Zhang, B. Zhao, E. Ipek, O. Mutlu, and D. Burger,
“Phase-Change Technology and the Future of Main Memory,” IEEE Micro, 2010.
[92] C. J. Lee, E. Ebrahimi, V. Narasiman, O. Mutlu, and Y. N. Patt, “DRAM-Aware
Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory
Systems,” Univ. of Texas at Austin, High Performance Systems Group, Tech. Rep.
TR-HPS-2010-002, 2010.
[93] C. J. Lee, O. Mutlu, V. Narasiman, and Y. N. Patt, “Prefetch-Aware DRAM Con-
trollers,” in MICRO, 2008.
[94] C. J. Lee, O. Mutlu, V. Narasiman, and Y. N. Patt, “Prefetch-Aware Memory Con-
trollers,” in IEEE TC, 2011.
[95] C. J. Lee, V. Narasiman, O. Mutlu, and Y. N. Patt, “Improving Memory Bank-level
Parallelism in the Presence of Prefetching,” in MICRO, 2009.
[96] D. Lee, Y. Kim, V. Seshadri, J. Liu, L. Subramanian, and O. Mutlu, “Tiered-Latency
DRAM: A Low Latency and Low Cost DRAM Architecture,” in HPCA, 2013.
[97] D. Lee, S. Ghose, G. Pekhimenko, S. Khan, and O. Mutlu, “Simultaneous Multi-
Layer Access: Improving 3D-Stacked Memory Bandwidth at Low Cost,” in ACM
TACO, 2016.
[98] D. Lee, S. Khan, L. Subramanian, S. Ghose, R. Ausavarungnirun, G. Pekhimenko,
V. Seshadri, and O. Mutlu, “Design-Induced Latency Variation in Modern DRAM
Chips: Characterization, Analysis, and Latency Reduction Mechanisms,” in SIG-
METRICS, 2017.
[99] D. Lee, Y. Kim, G. Pekhimenko, S. Khan, V. Seshadri, K. K. Chang, and O. Mutlu,
“Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case,” in
HPCA, 2015.
[100] D. Lee, L. Subramanian, R. Ausavarungnirun, J. Choi, and O. Mutlu, “Decoupled
Direct Memory Access: Isolating CPU and IO Trac by Leveraging a Dual-Data-
Port DRAM,” in PACT, 2015.
[101] Y. Levy, J. Bruck, Y. Cassuto, E. G. Friedman, A. Kolodny, E. Yaakobi, and
S. Kvatinsky, “Logic Operations in Memory Using a Memristive Akers Array,”
Microelectronics Journal, 2014.
[102] S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, and Y. Xie, “DRISA: A DRAM-
Based Recongurable In-Situ Accelerator,” in MICRO, 2017.
[103] S. Li et al., “Pinatubo: A Processing-in-Memory Architecture for Bulk Bitwise
Operations in Emerging Non-Volatile Memories,” in DAC, 2016.
[104] Y. Li, S. Ghose, J. Choi, J. Sun, H. Wang, and O. Mutlu, “Utility-Based Hybrid
Memory Management,” in CLUSTER, 2017.
[105] Y. Li, S. Makar, and S. Mitra, “CASP: Concurrent Autonomous Chip Self-Test
Using Stored Test Patterns,” in DATE, 2008.
[106] Y. Li, O. Mutlu, D. S. Gardner, and S. Mitra, “Concurrent Autonomous Self-Test
for Uncore Components in System-on-Chips,” in VTS, 2010.
[107] Y. Li, O. Mutlu, and S. Mitra, “Operating System Scheduling for Ecient Online
Self-Test in Robust Systems",” in ICCAD, 2009.
[108] J. Liu, B. Jaiyen, Y. Kim, C. Wilkerson, and O. Mutlu, “An Experimental Study of
Data Retention Behavior in Modern DRAM Devices: Implications for Retention
Time Proling Mechanisms,” in ISCA, 2013.
[109] J. Liu, B. Jaiyen, R. Veras, and O. Mutlu, “RAIDR: Retention-aware Intelligent
DRAM Refresh,” in ISCA, 2012.
[110] Z. Liu, I. Calciu, M. Herlihy, and O. Mutlu, “Concurrent Data Structures for Near-
Memory Computing,” in SPAA, 2017.
[111] G. H. Loh, “3D-Stacked Memory Architectures for Multi-Core Processors,” in
ISCA, 2008.
[112] Micron, “DDR3 SDRAM system-power calculator,” 2011.
[113] D. M. Morgan and M. A. Shore, “DRAMs having on-chip row copy circuits for
use in testing and video imaging and method for operating same,” U.S. Patent
5,440,517, 1995.
[114] K. Mori, “Semiconductor memory device including copy circuit,” U.S. Patent
5,854,771, 1998.
[115] T. Moscibroda and O. Mutlu, “Memory Performance Attacks: Denial of Memory
Service in Multi-core Systems,” in USENIX Security, 2007.
[116] T. Moscibroda and O. Mutlu, “Distributed Order Scheduling and Its Application
to Multi-core Dram Controllers,” in PODC, 2008.
[117] J. Mukundan and J. F. Martínez, “MORSE: Multi-Objective Recongurable Self-
Optimizing Memory Scheduler,” in HPCA, 2012.
[118] S. P. Muralidhara et al., “Reducing memory interference in multi-core systems
via application-aware memory channel partitioning,” in MICRO, 2011.
[119] O. Mutlu et al., “Ecient Runahead Execution: Power-ecient memory latency
tolerance,” IEEE Micro, vol. 26, no. 1, 2006.
[120] O. Mutlu, H. Kim, and Y. Patt, “Address-value delta (AVD) prediction: increasing
the eectiveness of runahead execution by exploiting regular memory allocation
patterns,” in MICRO, 2005.
[121] O. Mutlu, H. Kim, and Y. Patt, “Techniques for ecient processing in runahead
execution engines,” in ISCA, 2005.
[122] O. Mutlu, J. Stark, C. Wilkerson, and Y. Patt, “Runahead execution: an alternative
to very large instruction windows for out-of-order processors,” in HPCA, 2003.
[123] O. Mutlu and L. Subramanian, “Research Problems and Opportunities in Memory
Systems,” SUPERFRI, 2014.
[124] O. Mutlu, “Memory Scaling: A Systems Architecture Perspective,” in IMW, 2013.
[125] O. Mutlu and T. Moscibroda, “Stall-Time Fair Memory Access Scheduling for
Chip Multiprocessors,” in MICRO, 2007.
[126] O. Mutlu and T. Moscibroda, “Parallelism-Aware Batch Scheduling: Enhancing
both Performance and Fairness of Shared DRAM Systems,” in ISCA, 2008.
9
[127] O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt, “Runahead execution: An eec-
tive alternative to large instruction windows,” in IEEE Micro, 2003.
[128] H. Naeimi, C. Augustine, A. Raychowdhury, S.-L. Lu, and J. Tschanz, “STT-RAM
Scaling and Retention Failure,” Intel Technol. J., May 2013.
[129] K. Nesbit, A. Dhodapkar, and J. Smith, “AC/DC: an adaptive data cache
prefetcher,” in PACT, 2004.
[130] K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith, “Fair Queuing Memory
Systems,” in MICRO, 2006.
[131] J. K. Ousterhout, “Why aren’t operating systems getting faster as fast as hard-
ware?” in USENIX STC, 1990.
[132] M. Patel, J. Kim, and O. Mutlu, “The Reach Proler (REAPER): Enabling the Mit-
igation of DRAM Retention Failures via Proling at Aggressive Conditions,” in
ISCA, 2017.
[133] D. Patterson et al., “A case for Intelligent RAM,” IEEE Micro, 1997.
[134] R. H. Patterson, G. A. Gibson, E. Ginting, D. Stodolsky, and J. Zelenka, “Informed
Prefetching and Caching,” in SOSP, 1995.
[135] A. Pattnaik, X. Tang, A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu,
and C. R. Das, “Scheduling Techniques for GPU Architectures with Processing-
in-Memory Capabilities,” in PACT, 2016.
[136] G. Pekhimenko, E. Bolotin, M. O’Connor, O. Mutlu, T. C. Mowry, and S. W. Keck-
ler, “Toggle-Aware Compression for GPUs,” in IEEE CAL, 2015.
[137] G. Pekhimenko, E. Bolotin, N. Vijaykumar, O. Mutlu, T. C. Mowry, and S. W.
Keckler, “Toggle-Aware Bandwidth Compression for GPUs,” in HPCA, 2016.
[138] G. Pekhimenko, T. Huberty, R. Cai, O. Mutlu, P. P. Gibbons, M. A. Kozuch, and
T. C. Mowry, “Exploiting Compressed Block Size as an Indicator of Future Reuse,”
in HPCA, 2015.
[139] G. Pekhimenko, V. Seshadri, Y. Kim, H. Xin, O. Mutlu, P. B. Gibbons, M. A.
Kozuch, and T. C. Mowry, “Linearly Compressed Pages: A Low-complexity, Low-
latency Main Memory Compression Framework,” in MICRO, 2013.
[140] G. Pekhimenko, V. Seshadri, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C.
Mowry, “Base-Delta-Immediate Compression: A Practical Data Compression
Mechanism for On-Chip Caches,” in PACT, 2012.
[141] M. Qureshi, D.-H. Kim, S. Khan, P. Nair, and O. Mutlu, “AVATAR: A Variable-
Retention-Time (VRT) Aware Refresh for DRAM Systems,” in DSN, 2015.
[142] M. K. Qureshi, A. Jaleel, Y. Patt, S. Steely, and J. Emer, “Adaptive Insertion Policies
for High Performance Caching,” in ISCA, 2007.
[143] M. K. Qureshi, J. Karidis, M. Franceschini, V. Srinivasan, L. Lastras, and B. Abali,
“Enhancing Lifetime and Security of PCM-based Main Memory with Start-gap
Wear Leveling,” in MICRO, 2009.
[144] M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt, “A Case for MLP-Aware
Cache Replacement,” in ISCA, 2006.
[145] M. K. Qureshi, V. Srinivasan, and J. A. Rivers, “Scalable High Performance Main
Memory System Using Phase-change Memory Technology,” in ISCA, 2009.
[146] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens, “Memory Access
Scheduling,” in ISCA, 2000.
[147] M. Rosenblum et al., “The impact of architectural trends on operating system
performance,” in SOSP, 1995.
[148] M. E. Russinovich et al., Windows Internals. Microsoft Press, 2009, p. 701.
[149] G. Sandhu, “DRAM scaling and bandwidth challenges,” in WETI, 2012.
[150] R. F. Sauers et al., HP-UX 11i Tuning and Performance. Prentice Hall, 2004, ch.
8. Memory Bottlenecks.
[151] V. Seshadri et al., “RowClone: Fast and energy-ecient in-DRAM bulk data copy
and initialization,” in MICRO, 2013.
[152] V. Seshadri et al., “The Dirty-Block Index,” in ISCA, 2014.
[153] V. Seshadri et al., “Fast Bulk Bitwise AND and OR in DRAM,” in IEEE CAL, 2015.
[154] V. Seshadri et al., “Ambit: In-Memory Accelerator for Bulk Bitwise Operations
Using Commodity DRAM Technology,” in MICRO, 2017.
[155] V. Seshadri, G. Pekhimenko, O. Ruwase, O. Mutlu, P. B. Gibbons, M. A. Kozuch,
T. C. Mowry, and T. Chilimbi, “Page Overlays: An Enhanced Virtual Memory
Framework to Enable Fine-Grained Memory Management,” in ISCA, 2015.
[156] V. Seshadri, “Simple DRAM and Virtual Memory Abstractions to Enable Highly
Ecient Memory Systems,” Ph.D. dissertation, Carnegie Mellon University, 2016.
[157] V. Seshadri, T. Mullins, A. Boroumand, O. Mutlu, P. B. Gibbons, M. A. Kozuch,
and T. C. Mowry, “Gather-Scatter DRAM: In-DRAM Address Translation to Im-
prove the Spatial Locality of Non-Unit Strided Accesses,” in MICRO, 2015.
[158] V. Seshadri and O. Mutlu, “The Processing Using Memory Paradigm: In-DRAM
Bulk Copy, Initialization, Bitwise AND and OR,” arXiv:1610.09603 [cs:AR], 2016.
[159] V. Seshadri and O. Mutlu, “Simple Operations in Memory to Reduce Data Move-
ment,” in Advances in Computers, Volume 106, 2017.
[160] V. Seshadri, O. Mutlu, M. A. Kozuch, and T. C. Mowry, “The Evicted-Address
Filter: A Unied Mechanism to Address Both Cache Pollution and Thrashing,”
in PACT, 2012.
[161] V. Seshadri, S. Yedkar, H. Xin, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C.
Mowry, “Mitigating Prefetcher-Caused Pollution Using Informed Caching Poli-
cies for Prefetched Blocks,” in TACO, 2015.
[162] A. Shaee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu,
R. S. Williams, and V. Srikumar, “ISAAC: A Convolutional Neural Network Ac-
celerator with In-Situ Analog Arithmetic in Crossbars,” in ISCA, 2016.
[163] A. Shaee, M. Taassori, R. Balasubramonian, and A. Davis, “MemZip: Exploring
Unconventional Benets from Memory Compression,” in HPCA, 2014.
[164] J. Shao and B. T. Davis, “A Burst Scheduling Access Reordering Mechanism,” in
HPCA, 2007.
[165] W. Shin, J. Yang, J. Choi, and L.-S. Kim, “NUAT: A Non-Uniform Access Time
Memory Controller,” in HPCA, 2014.
[166] A. Singh, Mac OS X Internals: A Systems Approach. Addison-Wesley Profes-
sional, 2006.
[167] S. Srinath et al., “Feedback Directed Prefetching: Improving the performance
and bandwidth-eciency of hardware prefetchers,” in HPCA, 2007.
[168] S. M. Srinivasan et al., “Flashback: A lightweight extension for rollback and de-
terministic replay for software debugging,” in USENIX ATC, 2004.
[169] Standard Performance Evaluation Corporation, “SPEC CPU2006,” http://www.
spec.org/cpu2006.
[170] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S. Williams, “The Missing Mem-
ristor Found,” Nature, May 2008.
[171] L. Subramanian et al., “MISE: Providing performance predictability and improv-
ing fairness in shared main memory systems,” in HPCA, 2013.
[172] L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu, “The Blacklisting
Memory Scheduler: Achieving high performance and fairness at low cost,” in
ICCD, 2014.
[173] L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu, “BLISS: Balancing
Performance, Fairness and Complexity in Memory Access Scheduling,” in TPDS,
2016.
[174] L. Subramanian, V. Seshadri, A. Ghosh, S. Khan, and O. Mutlu, “The Application
Slowdown Model: Quantifying and Controlling the Impact of Inter-Application
Interference at Shared Caches and Main Memory,” in MICRO, 2015.
[175] K. Sudan et al., “Micro-pages: Increasing DRAM eciency with locality-aware
data placement,” in ASPLOS, 2010.
[176] M. A. Suleman, O. Mutlu, J. A. Joao, Khubaib, and Y. N. Patt, “Data Marshaling
for Multi-Core Architectures,” in ISCA, 2010.
[177] H. Usui, L. Subramanian, K. Chang, and O. Mutlu, “DASH: Deadline-Aware High-
Performance Memory Scheduler for Heterogeneous Systems with Hardware Ac-
celerators,” in ACM TACO, 2016.
[178] R. Venkatesan, S. Herr, and E. Rotenberg, “Retention-Aware Placement in DRAM
(RAPID): Software Methods for Quasi-Non-Volatile DRAM,” in HPCA, 2006.
[179] N. Vijaykumar, G. Pekhimenko, A. Jog, A. Bhowmick, R. Ausavarungnirun,
C. Das, M. Kandemir, T. C. Mowry, and O. Mutlu, “A Case for Core-Assisted Bot-
tleneck Acceleration in GPUs: Enabling Flexible Data Compression with Assist
Warps,” in ISCA, 2015.
[180] C. A. Waldspurger, “Memory resource management in VMware ESX server,” in
OSDI, 2002.
[181] F. Ware and C. Hampel, “Improving Power and Data Eciency with Threaded
Memory Modules,” in ICCD, 2006.
[182] B. Wester et al., “Operating system support for application-specic speculation,”
in EuroSys, 2011.
[183] P. R. Wilson, S. F. Kaplan, and Y. Smaragdakis, “The Case for Compressed
Caching in Virtual Memory Systems,” in ATEC, 1999.
[184] H.-S. P. Wong, S. Raoux, S. Kim, J. Liang, J. P. Reifenberg, B. Rajendran,
M. Asheghi, and K. E. Goodson, “Phase Change Memory,” Proc. IEEE, Dec. 2010.
[185] X. Yang et al., “Why Nothing Matters: The impact of zeroing,” in OOPSLA, 2011.
[186] H. Yoon, J. Meza, R. Ausavarungnirun, R. Harding, and O. Mutlu, “Row Buer
Locality Aware Caching Policies for Hybrid Memories,” in ICCD, 2012.
[187] H. Yoon, J. Meza, N. Muralimanohar, N. P. Jouppi, and O. Mutlu, “Ecient Data
Mapping and Buering Techniques for Multilevel Cell Phase-Change Memories,”
TACO, 2014.
[188] D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Igna-
towski, “TOP-PIM: Throughput-oriented Programmable Processing in Memory,”
in HPCA, 2014.
[189] T. Zhang, K. Chen, C. Xu, G. Sun, T. Wang, and Y. Xie, “Half-DRAM: A high-
bandwidth and low-power DRAM architecture from the rethinking of ne-
grained activation,” in ISCA, 2014.
[190] Y. Zhang, J. Yang, and R. Gupta, “Frequent value locality and value-centric data
cache design,” in ASPLOS, 2000.
[191] J. Zhao, O. Mutlu, and Y. Xie, “FIRM: Fair and High-Performance Memory Con-
trol for Persistent Memory Systems,” in MICRO, 2014.
[192] L. Zhao et al., “Hardware support for bulk data movement in server platforms,”
in ICCD, 2005.
[193] H. Zheng et al., “Mini-rank: Adaptive DRAM architecture for improving memory
power eciency,” in MICRO, 2008.
10
