This paper summarizes the idea of RowClone, which was published in MICRO 2013 [151], and examines the work's significance and future potential. In existing systems, to perform any bulk data movement operation (copy or initialization), the data has to rst be read into the on-chip processor, all the way into the L1 cache, and the result of the operation must be written back to main memory. This is despite the fact that these operations do not involve any actual computation. RowClone exploits the organization and operation of commodity DRAM to perform these operations completely inside DRAM using two mechanisms. The rst mechanism, Fast Parallel Mode, copies data between two rows inside the same DRAM subarray by issuing back-to-back activate commands to the source and the destination row. The second mechanism, Pipelined Serial Mode, transfers cache lines between two banks using the shared internal bus. RowClone signi cantly reduces the raw latency and energy consumption of bulk data copy and initialization. This reduction directly translates to improvement in performance and energy e ciency of systems running copy or initialization-intensive workloads.
Problem: Bulk Data Movement
The main memory subsystem is an increasingly more signi cant limiter of system performance and energy eciency [123, 124] for at least two reasons. First, the available memory bandwidth between the processor and main memory is not growing and nor is it expected to grow commensurately with the compute bandwidth available in modern multi-core processors [61, 64] . Second, a signi cant fraction (20% to 42%) of the energy required to access data from memory is consumed in driving the high-speed bus connecting the processor and memory [149] (calculated using [112] ). Therefore, judicious use of the available memory bandwidth is critical to ensure both high system performance and energy e ciency.
In this work, we focus our attention on optimizing two important classes of bandwidth-intensive memory operations that frequently occur in modern systems: 1) bulk data copycopying a large quantity of data from one location in physical memory to another, and 2) bulk data initialization-initializing a large quantity of data to a speci c value. We refer to these two operations as bulk data movement operations. Prior research [68, 131, 147] has shown that operating systems and data center workloads spend a signi cant portion of their time performing bulk data movement operations. Therefore, accelerating these operations will likely improve system performance. In fact, the x86 ISA has recently introduced instructions to provide enhanced performance for bulk copy and initialization (ERMSB [60] ), highlighting the importance of bulk operations.
The main reason bulk data movement operations degrade system performance and energy e ciency is that they require large amounts of data to be transferred back and forth on the memory bus. This large data transfer has three shortcomings. First, because the data is transferred one cache line at a time across the bus, these operations incur high latency, directly degrading the performance of the application performing the operation. Second, transferring a large amount of data on the bus interferes with the memory accesses of other concurrently-running applications, degrading their performance as well. Finally, the large data transfer contributes to a signi cant fraction of the energy consumed by these bulk movement operations.
While bulk data movement operations also degrade performance by hogging the CPU and potentially polluting the on-chip caches, prior works [66, 192] have proposed simple solutions to address these problems by adding support for such operations in the memory controller. However, the techniques proposed by these works do not eliminate the need to transfer data over the memory bus, which is a increasingly more critical bottleneck for performance in modern systems.
RowClone: Fast In-DRAM Copy
The fact that both bulk data copy and initialization do not require any computation on the part of the processor enables the opportunity to perform these operations completely inside DRAM. Our MICRO 2013 paper [151] presents a new mechanism, RowClone, which exploits the internal organization and operation of DRAM to perform bulk data copy/initialization quickly and e ciently inside DRAM. Figure 1 illustrates the organization of a DRAM chip. The chip contains multiple banks, each of which is divided into subarrays, and each subarray in turn consists of multiple rows of DRAM cells. Each subarray contains a row bu er, which is used to extract the data from the DRAM cells. Data transfer between the DRAM cells and the row bu er happen at a row granularity, i.e., even to read a single byte from a row, the chip copies the entire row of data from the DRAM cells to the corresponding row bu er. 1 
Bank

RowClone Mechanisms
RowClone consists of two mechanisms: (1) Fast Parallel Mode (FPM), which is used to copy data from one row to another row in the same subarray; and (2) Pipelined Serial Mode (PSM), which is used to copy data from one row to another row in a di erent subarray or bank. We brie y discuss how each mechanism performs bulk data copy and bulk data initialization. Section 3 of our MICRO 2013 paper [151] provides a detailed implementation and discussion of FPM and PSM.
Fast Parallel Mode (FPM). FPM uses the high internal bandwidth o ered by DRAM to quickly and e ciently copy data between two rows within the same subarray in two simple steps. First, FPM copies the data from the source row to the local row bu er of the subarray. Second, FPM copies the data from the row bu er to the destination row. To perform the copy, FPM simply issues two back-to-back ACTIVATE commands to the bank, rst with the source row address and the second with the destination row address. Implementing this in existing DRAM chips requires almost negligible changes. These small changes are to the peripheral logic that controls back-to-back ACTIVATEs.
FPM imposes two constraints on the copy operation. First, it requires the source and the destination row to be within the same subarray. Second, it copies the entire row's worth of data. It cannot partially copy data from one row to another. Despite these constraints, FPM can be used to accelerate many operations in modern systems (Section 3).
Pipelined Serial Mode (PSM). PSM accelerates copy operations between rows in di erent banks/subarrays, As shown in Figure 1 , each DRAM chip uses a shared internal bus to transfer data between the bank and the memory channel (for both reads and writes). PSM exploits this fact to overlap the latency of the read and write operations involved in a copy. To implement PSM, we propose a new DRAM command called TRANSFER. TRANSFER is equivalent to appropriately overlapping READ to the source bank and WRITE to the destination bank. However, unlike READ or WRITE, TRANSFER does not transfer the data on to the memory channel, saving signi cant amounts of energy.
Bulk Data Initialization. For bulk initialization, RowClone initializes one row of the destination with the required data and then initializes the remaining rows by copying the data from the pre-initialized row using the appropriate bulk copy mechanism described above. For bulk zeroing (which happens frequently), our mechanism reserves a single row in each subarray, which is pre-initialized to zero. This enables the memory controller to use FPM to zero out any row in the system. We refer the reader to Section 3.4 of our MICRO 2013 paper [151] for more details on performing bulk data initialization with RowClone. Table 1 shows the reduction in latency and energy consumption due to our mechanisms for di erent cases of 4KB copy and zeroing operations. To be fair to the baseline, the results include only the energy consumed by the DRAM and the DRAM channel. We draw two conclusions from our results. First, FPM signi cantly improves both the latency and the energy consumed by bulk data operations -11.6x and 6x reduction in latency of 4KB copy and zeroing, and 74.4x and 41.5x reduction in memory energy of 4KB copy and zeroing. Second, although PSM does not provide as much bene t as FPM, it still reduces the latency and energy of a 4KB interbank copy by 1.9x and 3.2x, while providing a more generally applicable mechanism. As we show in Section 4, these latency and energy bene ts translate to signi cant improvements in both overall system performance and energy e ciency.
Latency and Energy Bene ts
End-to-End System Design
To fully extract the potential bene ts of RowClone, changes are required to the ISA, processor microarchitecture, and the system software. First, we introduce two new instructions to the ISA, namely, memcopy and meminit, which enable the software to indicate occurrences of bulk data operations to the processor. Second, for each instance of the memcopy/meminit instruction, the processor microarchitecture determines if the operation can be partially/fully accelerated by RowClone and issues appropriate commands to the memory controller. While existing mechanisms to handle Direct Memory Access requests can be used to ensure cache coherence with RowClone, we also propose two simple mechanisms, called in-cache copy and clean zero cache line insertion, to further reduce memory bandwidth requirements and improve performance. We call this optimized version of RowClone, which includes in-cache copy and clean zero cache line insertion, RowClone-ZI. Third, to maximize the use of FPM, we make the system software aware of subarrays and the minimum granularity of copy (required by FPM). Section 4 of our MICRO 2013 paper [151] describes these changes in detail.
Applications
RowClone can be used to accelerate any bulk copy and initialization operation to improve both system performance and energy e ciency. We quantitatively evaluate the ecacy of RowClone by using it to accelerate two primitives widely used by modern system software: 1) Copy-on-Write and 2) Bulk Zeroing. We rst describe these primitives, and then discuss several applications that frequently trigger the primitives.
Primitives Accelerated by RowClone
Copy-on-Write (CoW) is a technique used by most modern operating systems (OS) to postpone an expensive copy operation until it is actually needed. When data of one virtual page needs to be copied to another, instead of creating a copy, the OS points both virtual pages to the same physical page (source) and marks the page as read-only. In the future, when one of the sharers attempts to write to the page, the OS allocates a new physical page (destination) for the writer and copies the contents of the source page to the newly allocated page. Fortunately, prior to allocating the destination page, the OS already knows the location of the source physical page. Therefore, it can ensure that the destination is allocated in the same subarray as the source, thereby enabling the processor to use FPM to perform the copy.
Bulk Zeroing (BuZ) is an operation where a large block of memory is zeroed out. Our mechanism maintains a reserved row that is fully initialized to zero in each subarray. For each row in the destination region to be zeroed out, the processor uses FPM to copy the data from the reserved zero-row of the corresponding subarray to the destination row.
Applications That Use CoW/BuZ
We now describe seven example applications or use-cases that extensively use the CoW or BuZ operations. Note that these are just a small number of example scenarios that incur a large number of copy and initialization operations. Some other applications and scenarios are provided in one of our more recent works [155] . Recent work from Google [68] shows that a considerable fraction of execution time is spent on memset and memcpy system calls in Google's data center workloads.
Process Forking. fork is a frequently-used system call in modern operating systems (OS). When a process (parent) calls fork, it creates a new process (child) with the exact same memory image and execution state as the parent. This semantics of fork makes it useful for di erent scenarios. Common uses of the fork system call are to 1) create new processes, and 2) create stateful threads from a single parent thread in multi-threaded programs. One main limitation of fork is that it results in a CoW operation whenever the child/parent updates a shared page. Hence, despite its wide usage, as a result of the large number of copy operations triggered by fork, it remains one of the most expensive system calls in terms of memory performance [150] .
Initializing Large Data Structures. Initializing large data structures often triggers Bulk Zeroing. In fact, many managed languages (e.g., C#, Java, PHP) require zero initialization of variables to ensure memory safety [185] . In such cases, to reduce the overhead of zeroing, memory is zeroed-out in bulk.
Secure Deallocation.
Most operating systems (e.g., Linux [18] , Windows [148] , Mac OS X [166] ) zero out pages newly allocated to a process. This is done to prevent malicious processes from gaining access to the data that previously belonged to other processes or the kernel itself. Not doing so can potentially lead to security vulnerabilities, as shown by prior works [31, 41, 51, 52] .
Process Checkpointing. Checkpointing is an operation during which a consistent version of a process state is backed-up, so that the process can be restored from that state in the future. This checkpoint-restore primitive is useful in many cases including high-performance computing servers [15] , software debugging with reduced overhead [168] , hardware-level fault and bug tolerance mechanisms [33, 34, 105, 106, 107] , and speculative OS optimizations to improve performance [24, 182] . However, to ensure that the checkpoint is consistent (i.e., the original process does not update data while the checkpointing is in progress), the pages of the process are marked with copy-on-write. As a result, checkpointing often results in a large number of CoW operations.
Virtual Machine Cloning/Deduplication. Virtual machine (VM) cloning [88] is a technique to signi cantly reduce the startup cost of VMs in a cloud computing server. Similarly, deduplication is a technique employed by modern hypervisors [180] to reduce the overall memory capacity requirements of VMs. With this technique, di erent VMs share physical pages that contain the same data. Similar to forking, both these operations likely result in a large number of CoW operations for pages shared across VMs [155] .
Page Migration. Bank con icts, i.e., concurrent requests to di erent rows within the same bank, typically result in reduced row bu er hit rate and hence degrade both system performance and energy e ciency [80] . Prior work [175] proposed techniques to mitigate bank con icts using page migration. The PSM mode of RowClone can be used in conjunction with such techniques to 1) signi cantly reduce the migration latency and 2) make the migrations more energye cient.
CPU-GPU Communication. In many current and future processors, the GPU is or is expected to be integrated on the same chip with the CPU. Even in such systems where the CPU and GPU share the same o -chip memory, the ochip memory is partitioned between the two devices. As a consequence, whenever a CPU program wants to o oad some computation to the GPU, it has to copy all the necessary data from the CPU address space to the GPU address space [62] . When the GPU computation is nished, all the data needs to be copied back to the CPU address space. This copying involves a signi cant overhead. By spreading out the GPU address space over all subarrays and mapping the application data appropriately, RowClone can signi cantly speed up these copy operations. Note that communication between di erent processors and accelerators in a heterogeneous system-onchip (SoC) is done similarly to the CPU-GPU communication and can also be accelerated by RowClone.
Results
In this section, we brie y summarize our evaluation of RowClone. We evaluate three con gurations: Baseline, an unmodi ed main memory subsystem that cannot perform bulk data copy or initialization within memory; RowClone, which uses the FPM and PSM mechanisms described in Section 2.1; and RowClone-ZI, an optimized version of RowClone that includes the two optimizations discussed in Section 2.3. Section 6 of our MICRO 2013 paper [151] discusses our full evaluation methodology, including details on the simulator, system con guration, and benchmarks used for our evaluations. Figure 2 shows the performance improvement and reduction in DRAM energy consumption due to RowClone-ZI compared to the baseline for six copy-and initialization-intensive benchmarks. As we observe from the gure, these applications improve signi cantly with RowClone-ZI. Compared with Baseline, RowClone-ZI improves the IPC by up to 43%, while reducing DRAM energy consumption by up to 67%. Section 7 of our MICRO 2013 paper [151] provides more detailed single-core results, including (1) the individual performance of the FPM and PSM mechanisms using a fork benchmark (Section 7.2 of [151] ); (2) a breakdown of memory tra c for each application into read, write, copy, and initialization operations (Section 7.3 of [151] ); (3) the performance, 
Single-Core Evaluations
Multi-Core Evaluations
As RowClone performs bulk data operations completely within DRAM, it signi cantly reduces the memory bandwidth consumed by these operations. As a result, RowClone can bene t other applications that are running concurrently on the same system, even if these applications do not perform bulk data operations themselves. We evaluate this bene t of RowClone by running our copy/initialization-intensive applications alongside memory-intensive applications from the SPEC CPU2006 benchmark suite [169] (i.e., those applications with last-level cache misses per kilo-instruction, or MPKI, greater than 1). Table 2 lists the set of applications used for our multi-programmed workloads. Table 2 : List of benchmarks used for multi-core evaluation. Reproduced from [151] .
Copy/Initialization-intensive benchmarks bootup, compile, forkbench, mcached, mysql, shell Memory-intensive benchmarks from SPEC CPU2006 bzip2, gcc, mcf, milc, zeusmp, gromacs, cactusADM, leslie3d, namd, gobmk, dealII, soplex, hmmer, sjeng, GemsFDTD, libquantum, h264ref, lbm, omnetpp, astar, wrf, sphinx3, xalancbmk
We generate multi-programmed workloads for two-core, four-core and eight-core systems. In each workload, half of the cores run copy/initialization-intensive benchmarks, while the remaining cores run memory-intensive SPEC benchmarks. Benchmarks from each category are chosen at random. Figure 3 plots the performance improvement due to RowClone and RowClone-ZI for the 50 four-core workloads that we evaluate (sorted based on the performance improvement due to RowClone-ZI). Two conclusions are in order. First, although RowClone degrades performance of certain four-core workloads (with compile, mcached or mysql benchmarks), it signi cantly improves performance for all other workloads (by 10% across all workloads). Second, RowClone-ZI eliminates the performance degradation due to RowClone and consistently outperforms both the baseline and RowClone for all workloads (20% on average). To provide more insight into the bene ts of RowClone on multi-core systems, we classify our copy/initializationintensive benchmarks into two categories: 1) Moderately copy/initialization-intensive (compile, mcached, and mysql) and highly copy/initialization-intensive (bootup, forkbench, and shell). Figure 4 shows the average improvement in weighted speedup for the di erent multi-core workloads, categorized based on the number of highly copy/initializationintensive benchmarks. As the trends indicate, RowClone's performance improvement increases with increasing number of such benchmarks for all three multi-core systems, indicating the e ectiveness of RowClone in accelerating bulk copy/initialization operations. We conclude that RowClone is an e ective mechanism to improve system performance, energy e ciency and bandwidth e ciency of future, bandwidth-constrained multi-core systems.
Related Work
To our knowledge, this is the rst paper to propose a concrete mechanism to perform bulk data copy and initialization operations completely in DRAM. In this section, we discuss related work and qualitatively compare them to RowClone. Other treatments of related works can be found in [156, 158, 159] .
Patents on Data Copy in DRAM. Several patents [3, 48, 113, 114] propose the abstract notion that the row bu er in DRAM can be used to copy data from one row to another. These patents have four major drawbacks. First, they do not provide any concrete mechanism to perform the copy operation. Second, while using the row bu er to copy data between two rows is possible only when the two rows are within the same subarray, these patents make no such distinction. Third, these patents do not discuss the support required from the other layers of the system to realize a working system. Fourth, these patents do not provide any concrete evaluation to show the bene ts of performing copy operations in DRAM. In contrast, RowClone is more generally applicable, and our MICRO 2013 paper [151] discusses the concrete changes required to all layers of the system stack, from the DRAM architecture to the system software, to enable bulk data copy.
O loading Copy/Initialization Operations. Prior works [66, 192] propose mechanisms to 1) o oad bulk data copy/initialization operations to a separate engine; 2) reduce the impact of pipeline stalls (by waking up instructions dependent on a copy operation as soon as the necessary blocks are copied without waiting for the entire copy operation to complete); and 3) reduce cache pollution by using hints from software to decide whether to cache blocks involved in the copy or initialization. While Section 7.5 of our MICRO 2013 paper [151] shows the e ectiveness of RowClone compared to o oading bulk data operations to a separate engine, techniques to reduce pipeline stalls and cache pollution [66] can be naturally combined with RowClone to further improve performance.
Low-cost Interlinked Sub-Arrays (LISA) [25] proposes to connect adjacent subarrays inside a DRAM bank using a set of isolation transistors. Using this structure, LISA proposes mechanisms to e ciently copy data across rows in di erent subarrays within the same bank. LISA and RowClone can be combined to perform all bulk copy and initialization operations e ciently inside DRAM. However, unlike LISA, RowClone does not require any changes to the DRAM array.
The Compute Cache [2] performs copy, zero, and bitwise operations completely inside the on-chip SRAM cache. Like RowClone, the Compute Cache exploits the fact that many cells are connected to the same bitline to e ciently perform these operations across cells connected to the same bitline. Again, depending on the location of the data, RowClone and Compute Cache can be combined to further improve system performance and e ciency.
Bulk Memory Initialization. Jarrod et al. [63] propose a mechanism for avoiding the memory access required to fetch uninitialized blocks on a store miss. They use a specialized cache to keep track of uninitialized regions of memory. RowClone can potentially be combined with this mechanism. While Jarrod et al.'s approach can be used to reduce band-width consumption for irregular initialization (initializing di erent pages with di erent values), RowClone can be used to push regular initialization (e.g., initializing multiple pages with the same values) to DRAM, thereby freeing up the CPU to perform other useful operations.
Yang et al. [185] propose to reduce the cost of zero initialization by 1) using non-temporal store instructions to avoid cache pollution, and 2) using idle cores/threads to perform zeroing ahead of time. While the proposed optimizations reduce the negative performance impact of zeroing, their mechanism does not reduce memory bandwidth consumption of the bulk zeroing operations. In contrast, RowClone signi cantly reduces the memory bandwidth consumption and the associated energy overhead.
Processing-in-Memory. Recent works propose mechanisms that exploit the internal organization and operation of DRAM [102, 153, 154] , SRAM [2, 69] , phase-change memory (PCM) [103] , or memristors [162] to perform bulk bitwise Boolean algebra and/or simple arithmetic operations. One such mechanism, called Ambit [153, 154] , uses a number of row copy and initialization operations to perform Boolean algebra using DRAM. Ambit makes use of RowClone to efciently perform these row copy and initialization operations. Another mechanism, the Compute Cache [2] , can perform copy and initialization operations within SRAM. Other mechanisms for in-memory Boolean algebra or arithmetic [69, 102, 103, 162] can be trivially used to perform data copy and initialization operations (e.g., a data copy can be performed by performing a bulk addition, where the row to be copied is added to a row of all zeroes).
Various prior works (e.g., [6, 7, 16, 17, 49, 55, 56, 76, 83, 110, 133, 135, 188] ) have investigated mechanisms to add logic circuitry closer to memory to perform bandwidth-intensive computations (e.g., SIMD vector operations) more e ciently. The main limitation of such approaches is that adding logic to or near DRAM signi cantly increases the cost of main memory. In contrast, RowClone exploits the existing internal organization and operation of DRAM to perform bandwidth-intensive copy and initialization operations quickly and e ciently with low cost.
Other Methods for Lowering Memory Latency. There are many works that improve the performance of applications by reducing the overall memory access latency. These works enable more parallelism and bandwidth [4, 5, 27, 80, 97, 100, 153, 154, 181, 189, 193] , exploit latency variation within DRAM [23, 26, 28, 96, 98, 99] , reduce refresh counts [71, 72, 74, 75, 108, 109, 141, 178] , enable better communication between the CPU and other devices through DRAM [100] , leverage DRAM access patterns to reduce access latency [54, 165] , reduce write-related latencies by better designing DRAM and DRAM control policies [30, 92, 152] , reduce overall queuing latencies in DRAM by better scheduling memory requests [13, 14, 37, 45, 47, 57, 61, 67, 70, 78, 79, 93, 94, 95, 104, 115, 116, 117, 118, 125, 126, 130, 135, 146, 164, 171, 172, 173, 174, 177, 191] , employ prefetching [12, 22, 35, 36, 40, 43, 44, 46, 93, 119, 120, 121, 122, 127, 129, 134, 167] , perform memory/cache compression [1, 10, 11, 38, 39, 42, 136, 137, 138, 139, 140, 163, 179, 183, 190] , or perform better caching [73, 142, 144, 160, 161] . RowClone is orthogonal to all of these approaches, and can be combined with any of them with them to achieve higher latency and energy bene ts.
Signi cance
Our MICRO 2013 paper [151] proposes RowClone, a simple mechanism to export bulk copy and initialization operations to DRAM. In this section, we describe the novelty of our approach, the long term impact of our proposed techniques, and new research directions triggered by our work.
Novelty
Prior works investigate mechanisms to add logic closer to memory to perform bandwidth-intensive operations more e ciently. Although this approach has the potential to be used for a wide range of applications, it has two shortcomings. First, adding logic to DRAM increases the cost of DRAM signi cantly. Second, this approach does not reduce the bandwidth requirement of simple bulk copy/initialization operations.
In contrast, our work is the rst (to our knowledge) to propose mechanisms that exploit the internal organization and operation of DRAM to perform bandwidth-intensive copy and initialization operations quickly and e ciently in DRAM. The changes required by our mechanism in the DRAM chip are limited to the peripheral logic and are very modest, with a DRAM die area overhead of only 0.2%. With this small overhead, our mechanisms signi cantly reduce the latency, bandwidth, and energy consumed by bulk data operations.
Long-Term Impact
We believe four trends in current and future systems make our proposed solutions even more relevant. We discuss each trend, and how RowClone can be applied in the context of the trend.
Increasingly Limited Memory Bandwidth. Processor manufactures are integrating more and more cores on a single chip, thereby signi cantly increasing the compute capability of the processing chip. However, due to (1) the high cost associated with increasing pin counts and (2) limitations in DRAM scalability, the available memory bandwidth is not expected to grow at the same rate [61, 64] . This makes mechanisms like RowClone, which signi cantly reduce the overall memory bandwidth utilization of the system, likely even more important in future systems.
Increasing Use of Hardware Accelerators. Many modern processors already integrate the GPU on the same die as the CPU. With emerging systems moving towards a systemon-chip (SoC) model, many components/accelerators (called agents) are integrated on the same die as the CPU, and share the o -chip memory [176, 177] . To reduce the complexity of managing these agents, each agent is given its own share of the physical address space, and agents typically communicate with each other by copying data in bulk across the individual device address spaces. By enabling faster bulk data copies, we expect RowClone to signi cantly reduce the communication latency between di erent agents without increasing the complexity of the system.
Increasing Use of Virtualization. Modern systems (especially data centers and cloud computers) are increasingly employing virtualization to improve the utilization, security, and availability of systems and services. As described in our MICRO 2013 paper [151] , the use of techniques such as VM cloning and deduplication [88, 180] to reduce the memory capacity requirements will likely increase the number of copy operations and zeroing operations (to protect data across VMs). RowClone can improve the performance and energy e ciency of such systems by performing these copy/initialization operations e ciently.
Ease of Adoption. Given the low implementation complexity of RowClone, it can be easily adopted in existing systems. RowClone is not limited only to DDR DRAMs. It can be used with 3D-stacked DRAM technologies [97, 111] such as the Hybrid Memory Cube [58, 59] and High Bandwidth Memory [65] , which are gaining increasing interest among researchers, DRAM manufacturers, and system designers [6, 7, 82] .
New Research Directions
Our proposed approach to performing bulk data copy and initialization in DRAM inspires several important research directions (and hopefully many more that others will imagine). We describe a few of them below.
One important research question that our work raises is how can one redesign system software (e.g., operating system, hypervisors) and application software to take better advantage of RowClone? Existing systems assume that copies are expensive and hence trade o complexity for performance. However, with RowClone, it may be possible to design simpler yet high performance systems by rethinking software design in the presence of very fast bulk copy and initialization.
Our MICRO 2013 paper [151] proposes low-cost mechanisms to export bulk copy and initialization to DRAM. These are by no means the only bandwidth-intensive operations. There are other operations that unnecessarily move data between the main memory and the processor, which can be optimized using low-cost mechanisms. Therefore, another natural research question is what other bandwidth-intensive operations can be exported to main memory using low-cost mechanisms? We believe RowClone can inspire similar mechanisms for other such operations. For example, one of our recent works [157] proposes an e cient method to perform gather/scatter operations in DRAM. Another of our recent works proposes mechanisms to perform bulk bitwise operations in DRAM [153, 154] , building upon and taking advantage of RowClone.
Recently, there has been increased interest in emerging non-volatile memory technologies (e.g., PCM [89, 90, 91, 143, 145, 184, 186, 187] , STT-MRAM [29, 50, 84, 128] , memristors [32, 170] ). Given this trend, exploring the feasibility of extending RowClone to these new memory technologies is a relevant and important research direction. For example, two recent works [103, 162] use the principles discussed in RowClone to perform bulk Boolean algebra and arithmetic operations within emerging memories. Similarly, exploring the idea of RowClone in other storage/memory technologies, e.g., NAND ash memory [19, 20, 21] , is promising.
Given that memory bandwidth is expected to become an even more scarce resource in future systems, answers to these research questions have the potential to greatly mitigate bandwidth contention, and, thus, signi cantly improve both the performance and energy e ciency of these systems.
Works Building on RowClone
RowClone has inspired a number of followup works that propose 1) new mechanisms to perform bulk operations inside various memory technologies (e.g., DRAM [25, 102] , SRAM [2, 69] , PCM [103] , memristors [162] ), and 2) mechanisms that exploit RowClone to speedup other operations (e.g., in-DRAM bulk bitwise operations [76, 153, 154] ). A survey of related works is provided in [159] .
One of our recent works, Ambit [153, 154] , proposes a mechanism to perform bulk bitwise operations completely inside DRAM. Ambit operations involve a number of row copy and initialization operations. Ambit uses RowClone to perform these operations quickly and e ciently inside DRAM. In fact, RowClone is essential for Ambit to obtain the performance and energy e ciency improvements. Other recent works that perform bulk bitwise Boolean algebra and/or simple arithmetic operations [2, 8, 9, 69, 85, 86, 87, 101, 102, 103, 162] exploit the organization and operation of memory arrays, akin to RowClone, and can be used to perform bulk data copy and initialization operations.
Data movement is expected to become an even more critical problem in future systems. We believe RowClone can inspire other works that propose mechanisms to reduce data movement, thereby enabling higher system performance and energy e ciency.
Conclusion
Our MICRO 2013 paper [151] proposes RowClone, a mechanism that performs bulk data copy and initialization operations completely inside DRAM. RowClone consists of two mechanisms, Fast Parallel Mode and Pipelined Serial Mode, that are used to copy data using existing peripheral structures within DRAM, requiring no changes to the DRAM cell array. By enabling e cient bulk data copy and initialization, RowClone provides signi cant performance and DRAM energy improvements that are between one to two orders of magnitude higher compared to existing systems.
RowClone is one of the rst steps towards reducing unnecessary data movement between the processor and the main memory using a low-cost in-memory approach. Current trends in system design indicate that our approach will be more relevant to future, bandwidth-limited systems. We hope that our work triggers research that leads to 1) simpler and more e cient software design and 2) extensions of our approach to other operations and memory technologies, with the goal of continuing to greatly improve system performance and energy e ciency.
