LISA: Increasing Internal Connectivity in DRAM for Fast Data Movement
  and Low Latency by Chang, Kevin K. et al.
LISA: Increasing Internal Connectivity in DRAM
for Fast Data Movement and Low Latency
Kevin K. Chang1,2 Prashant J. Nair3,4 Saugata Ghose2
Donghyuk Lee5,2 Moinuddin K. Qureshi4 Onur Mutlu6,2
1Facebook 2Carnegie Mellon University 3IBM Research
4Georgia Institute of Technology 5NVIDIA Research 6ETH Zürich
This paper summarizes the idea of Low-Cost Interlinked Sub-
arrays (LISA), which was published in HPCA 2016 [10], and
examines the work’s signicance and future potential. Our
HPCA 2016 paper introduces a new DRAM design that enables
fast and energy-ecient bulk data movement across subarrays
in a DRAM chip. While bulk data movement is a key operation
in many applications and operating systems, we observe that
contemporary systems perform this movement ineciently, by
transferring data from DRAM to the processor, and then back to
DRAM, across a narrow o-chip channel. The use of this narrow
channel for bulk data movement results in high latency and
energy consumption. Prior work proposes to avoid these high
costs by exploiting the existing wide internal DRAM bandwidth
for bulk data movement, but the limited connectivity of wires
within DRAM allows fast data movement within only a single
DRAM subarray. Each subarray is only a few megabytes in
size, greatly restricting the range over which fast bulk data
movement can happen within DRAM.
Our HPCA 2016 paper proposes a new DRAM substrate, Low-
Cost Inter-Linked Subarrays (LISA), whose goal is to enable fast
and ecient data movement across a large range of memory at
low cost. LISA adds low-cost connections between adjacent sub-
arrays. By using these connections to interconnect the existing
internal wires (bitlines) of adjacent subarrays, LISA enables
wide-bandwidth data transfer across multiple subarrays with
little (only 0.8%) DRAM area overhead. As a DRAM substrate,
LISA is versatile, enabling a variety of new applications. We
describe and evaluate three such applications in detail: (1) fast
inter-subarray bulk data copy, (2) in-DRAM caching using a
DRAM architecture whose rows have heterogeneous access laten-
cies, and (3) accelerated bitline precharging by linking multiple
precharge units together. Our extensive evaluations show that
each of LISA’s three applications signicantly improves per-
formance and memory energy eciency, and their combined
benet is higher than the benet of each alone, on a variety of
workloads and system congurations.
1. Introduction
Bulk data movement, the movement of thousands or mil-
lions of bytes between two memory locations, is a common
operation performed by an increasing number of real-world
applications (e.g., [6, 37, 57, 58, 74, 82, 85, 88, 89, 94, 99, 110]).
Therefore, it has been the target of several architectural opti-
mizations (e.g., [4,6,35,40,58,70,86,88,103,110]). In fact, bulk
data movement is important enough that modern commer-
cial processors are adding specialized support to improve its
performance, such as the ERMSB instruction recently added
to the x86 ISA [28].
In today’s systems, to perform a bulk data movement be-
tween two locations in memory, the data needs to go through
the processor even though both the source and destination are
within memory. To perform the movement, the data is rst
read out one cache line at a time from the source location in
memory into the processor caches, over a pin-limited o-chip
channel (typically 64 bits wide). Then, the data is written back
to memory, again one cache line at a time over the pin-limited
channel, into the destination location. By going through the
processor, this data movement incurs a signicant penalty in
terms of latency and energy consumption.
To address the ineciencies of traversing the pin-limited
channel, a number of mechanisms have been proposed to
accelerate bulk data movement (e.g., [35, 63, 88, 110]). The
state-of-the-art mechanism, RowClone [88], performs data
movement completely within a DRAM chip, avoiding costly
data transfers over the pin-limited memory channel. How-
ever, its eectiveness is limited because RowClone can enable
fast data movement only when the source and destination are
within the same DRAM subarray. A DRAM chip is divided
into multiple banks (typically 8), each of which is further
split into many subarrays (16 to 64) [45], shown in Figure 1a,
to ensure reasonable read and write latencies at high den-
sity [8, 32, 33, 45, 101].1 Each subarray is a two-dimensional
array with hundreds of rows of DRAM cells, and contains
only a few megabytes of data (e.g., 4MB in a rank of eight
1Gb DDR3 DRAM chips with 32 subarrays per bank). While
two DRAM rows in the same subarray are connected via a
wide (e.g., 8K bits) bitline interface, rows in dierent subar-
rays are connected via only a narrow 64-bit data bus within
the DRAM chip (Figure 1a). Therefore, even for previously-
proposed in-DRAM data movement mechanisms such as Row-
Clone [88], inter-subarray bulk data movement incurs long
latency and high memory energy consumption even though
data does not move out of the DRAM chip.
1We refer the reader to our prior works [8, 9, 10, 11, 21, 22, 39, 41, 42, 43,
44, 45, 54, 55, 56, 57, 58, 60, 61, 75, 88, 89] for a detailed background on DRAM.
ar
X
iv
:1
80
5.
03
18
4v
1 
 [c
s.A
R]
  8
 M
ay
 20
18
Row Buffer
Slow
Internal 
Data Bus
Subarray
Cell
Bitlines
64b
(a) RowClone [88]
8Kb Fast
Isolation
Transistor
(b) LISA
Figure 1: Transferring data between subarrays using the in-
ternal data bus takes a long time in state-of-the-art DRAM
design, RowClone [88] (a). Ourwork, LISA, enables fast inter-
subarray datamovementwith a low-cost substrate (b). Repro-
duced from [10].
While it is clear that fast inter-subarray data movement can
have several applications that improve system performance
and memory energy eciency [6, 37, 55, 74, 82, 85, 88, 89, 110],
there is currently no mechanism that performs such data
movement quickly and eciently. This is because no wide
datapath exists today between subarrays within the same bank
(i.e., the connectivity of subarrays is low in modern DRAM).
Our goal is to design a low-cost DRAM substrate that enables
fast and energy-ecient data movement across subarrays.
2. Low-Cost Inter-Linked Subarrays (LISA)
We make two key observations that allow us to improve the
connectivity of subarrays within each bank in modern DRAM.
First, accessing data in DRAM causes the transfer of an entire
row of DRAM cells to a buer (i.e., the row buer, where the
row data temporarily resides while it is read or written) via
the subarray’s bitlines. Each bitline connects a column of cells
to the row buer, interconnecting every row within the same
subarray (Figure 1a). Therefore, the bitlines essentially serve
as a very wide bus that transfers a row’s worth of data (e.g.,
8K bits in a chip) at once. Second, subarrays within the same
bank are placed in close proximity to each other. Thus, the
bitlines of a subarray are very close to (but are not currently
connected to) the bitlines of neighboring subarrays (as shown
in Figure 1a).
Key Idea. Based on these two observations, we intro-
duce a new DRAM substrate, called Low-cost Inter-linked
SubArrays (LISA). LISA enables low-latency, high-bandwidth
inter-subarray connectivity by linking neighboring subarrays’
bitlines together with isolation transistors, as illustrated in Fig-
ure 1b. We use the new inter-subarray connection in LISA to
develop a new DRAM operation, row buer movement (RBM),
which moves data that is latched in an activated row buer in
one subarray into an inactive row buer in another subarray,
without having to send data through the narrow internal data
bus in DRAM. RBM exploits the fact that the activated row
buer has enough drive strength to induce charge perturba-
tion within the idle (i.e., precharged) bitlines of neighboring
subarrays, allowing the destination row buer to sense and
latch this data when the isolation transistors are enabled. We
describe the detailed operation of RBM in our HPCA 2016
paper [10].
By using a rigorous DRAM circuit model that conforms
to the JEDEC standards [32] and ITRS specications [30, 31],
we show that RBM performs row buer movement at 26x
the bandwidth of a modern 64-bit DDR4-2400 memory chan-
nel (500 GB/s vs. 19.2 GB/s), even after we conservatively
add a large (60%) timing margin to account for process and
temperature variation.
Die Area Overhead. To evaluate the area overhead of
adding isolation transistors, we use area values from prior
work, which adds isolation transistors to disconnect bitlines
from sense ampliers [73]. That work shows that adding an
isolation transistor to every bitline incurs a total of 0.8% die
area overhead in a 28nm DRAM process technology. Similar
to prior work that adds isolation transistors to DRAM [57,
73], our LISA substrate also requires additional control logic
outside the DRAM banks to control the isolation transistors,
which incurs a small amount of area and is non-intrusive to
the cell arrays.
3. Applications of LISA
We exploit LISA’s fast inter-subarray movement capability
to enable many applications that can improve system perfor-
mance and energy eciency. In our HPCA 2016 paper [10],
we implement and evaluate three applications of LISA, which
signicantly improve system performance in dierent ways.
3.1. Rapid Inter-Subarray Bulk Data Copying
(LISA-RISC)
Due to the narrow memory channel width, bulk copy oper-
ations used by applications and operating systems are perfor-
mance limiters in today’s systems [35, 37, 55, 88, 110]. These
operations are commonly performed due to the memcpy and
memmov. Recent work reported that these two operations con-
sume 4-5% of all of Google’s data center cycles [37], making
them an important target for lightweight hardware accelera-
tion.
Our goal is to design a new mechanism that enables low-
latency and energy-ecient memory copy between rows in
dierent subarrays within the same bank. To this end, we
propose a new in-DRAM copy mechanism that uses LISA to
exploit the high-bandwidth links between subarrays. The
key idea, step by step, is to: (1) activate a source row in a
subarray; (2) rapidly transfer the data in the activated source
row buers to the destination subarray’s row buers, through
LISA’s RBM operation; and (3) activate the destination row,
which enables the contents of the destination row buers to be
latched into the destination row. We call this inter-subarray
row-to-row copy mechanism LISA-Rapid Inter-Subarray Copy
(LISA-RISC).
3.1.1. DRAM Latency and Energy Consumption. Fig-
ure 2 shows the DRAM latency and DRAM energy consump-
tion of memcpy (i.e, the baseline system), RowClone [88] (state-
of-the-art work), and LISA-RISC for copying a row of data
(8KB). The exact latency and energy numbers are listed in
2
 0
 1
 2
 3
 4
 5
 6
 7
 0  200  400  600  800  1000  1200  1400
En
er
gy
 (µ
J)
Latency (ns)
InterSA
InterBankIn
tra
SA
1 7 15 hops
Improv
ement 
of LIS
A-RIS
C memcpy
RowClone
LISA-RISC
Figure 2: Latency and DRAM energy of 8KB copy. Repro-
duced from [10].
Copy Commands (8KB) Latency (ns) Energy (µJ)
memcpy (via mem. channel) 1366.25 6.2
RC-InterSA / Bank / IntraSA 1363.75 / 701.25 / 83.75 4.33 / 2.08 / 0.06
LISA-RISC (1 / 7 / 15 hops) 148.5 / 196.5 / 260.5 0.09 / 0.12 / 0.17
Table 1: Copy latency and DRAM energy. Reproduced from
[10].
Table 1. For LISA-RISC, we dene a hop as the number of
subarrays that LISA-RISC needs to copy data across to move
the data from the source subarray to the destination subarray.
For example, if the source and destination subarrays are adja-
cent to each other, the number of hops is 1. The DRAM chips
we evaluate have 16 subarrays per bank, so the maximum
number of hops is 15.
We make two observations from these numbers. First, al-
though inter-subarray RowClone (RC-InterSA) incurs similar
latencies as memcpy, it consumes 1.43x less energy, as it does
not transfer data over the channel and DRAM I/O for each
copy operation. However, as we discuss in Section 4.1 of our
HPCA 2016 paper [10], RC-InterSA incurs a higher system
performance penalty because it is a blocking long-latency
memory command. Second, copying between subarrays us-
ing LISA reduces the copy latency by 9x and copy energy by
48x compared to RowClone, even though the total latency
of LISA-RISC grows linearly with the hop count. An addi-
tional benet of using LISA-RISC is that its inter-subarray
copy operations are performed completely inside a bank. As
the internal DRAM data bus is untouched, other banks can
concurrently serve memory requests, exploiting bank-level
parallelism.
3.1.2. Evaluation. We briey summarize the system per-
formance improvement due to LISA-RISC on a quad-core
system. We evaluate our system using Ramulator [41, 83], an
open-source cycle-accurate DRAM simulator, driven by traces
generated from Pin [64]. Our workload evaluation results
show that LISA-RISC outperforms RowClone and memcpy:
its average performance improvement and energy reduction
over the best performing inter-subarray copy mechanism
(i.e., memcpy) are 66.2% and 55.4%, respectively, on a quad-
core system, across 50 workloads that perform bulk copies.
We refer the reader to Section 9 of our HPCA 2016 paper [10]
for detailed evaluation and analysis.
3.2. In-DRAM Caching Using
Heterogeneous Subarrays (LISA-VILLA)
Our second application aims to reduce the DRAM access la-
tency for frequently-accessed (hot) data. We propose to intro-
duce heterogeneity within a bank by designing heterogeneous-
latency subarrays. We call this heterogeneous DRAM de-
sign VarIabLe LAtency DRAM (VILLA-DRAM). To design
a low-cost fast subarray, we take an approach similar to
prior work, attaching fewer cells to each bitline to reduce
the parasitic capacitance and resistance. This reduces the la-
tency of the three fundamental DRAM operations–activation,
precharge, and restoration–when accessing data in the fast
subarrays [57,67,94]. Activation “opens” a row of DRAM cells
to access stored data. Precharge “closes” an activated row.
Restoration restores the charge level of each DRAM cell in
a row to prevent data loss. Together, these three operations
predominantly dene the latency of a memory request [8, 9,
10,11,21,22,39,41,42,43,44,45,54,55,56,57,58,60,61,75,88,89].
In this work, we focus on managing the fast subarrays in
hardware, as doing so oers better adaptivity to dynamic
changes in the hot data set.
In order to take advantage of VILLA-DRAM, we rely on
LISA-RISC to rapidly copy rows across subarrays, which sig-
nicantly reduces the caching latency. We call this synergistic
design, which builds VILLA-DRAM using LISA, LISA-VILLA.
Nonetheless, the cost of transferring data to a fast subarray
is still non-negligible, especially if the fast subarray is far
from the subarray where the data to be cached resides. There-
fore, an intelligent cost-aware mechanism is required to make
astute decisions on which data to cache and when.
3.2.1. Caching Policy for LISA-VILLA. We design a simple
epoch-based caching policy to evaluate the benets of caching
a row in LISA-VILLA. Every epoch, we track the number of
accesses to rows by using a set of 1024 saturating counters
for each bank.2 The counter values are halved every epoch
to prevent staleness. At the end of an epoch, we mark the 16
most frequently-accessed rows as hot, and cache them when
they are accessed the next time. For our cache replacement
policy, we use the benet-based caching policy proposed by
Lee et al. [57]. Specically, it uses a benet counter for each
row cached in the fast subarray: whenever a cached row is
accessed, its counter is incremented. The row with the least
benet is replaced when a new row needs to be inserted. Note
that a large body of work proposes various caching policies
(e.g., [20, 23, 26, 34, 38, 59, 66, 78, 79, 87, 91, 100, 104, 106]), each
of which can potentially be used with LISA-VILLA.
3.2.2. Evaluation. Figure 3 shows the system performance
improvement of LISA-VILLA over a baseline without any fast
subarrays in a four-core system. It also shows the hit rate
in VILLA-DRAM, i.e., the fraction of accesses that hit in the
fast subarrays. We make two main observations. First, by
2The hardware cost of these counters is low, requiring only 6KB of stor-
age in the memory controller (see Section 7.1 of our HPCA 2016 paper [10]).
3
exploiting LISA-RISC to quickly cache data in VILLA-DRAM,
LISA-VILLA improves system performance for a wide variety
of workloads — by up to 16.1%, with a geometric mean of
5.1%. This is mainly due to reduced DRAM latency of accesses
that hit in the fast subarrays. The performance improvement
heavily correlates with the VILLA cache hit rate. Second, the
VILLA-DRAM design, which consists of heterogeneous subar-
rays, is not practical without LISA. Figure 3 shows that using
RC-InterSA (i.e., RowClone copying data across subarrays)
to move data into the cache reduces performance by 52.3%
due to slow data movement, which overshadows the benets
of caching. The results indicate that LISA is an important
substrate to enable not only fast bulk data copy, but also a
fast in-DRAM caching scheme.
 0.95
 1
 1.05
 1.1
 1.15
 0  10  20  30  40  50
 0
 10
 20
 30
 40
 50
 60
 70
N
or
m
al
iz
ed
 W
S
V
ILLA
 H
it Rate (%
)
Workloads
LISA-VILLA
Cache Hit Rate
 0
 0.2
 0.4
 0.6
 0.8
 1
GMean
N
or
m
al
iz
ed
 W
S
RC-InterSA
LISA-VILLA
Figure 3: Performance improvement and hit rate with LISA-
VILLA, and performance comparison to using RC-InterSA
with VILLA-DRAM. Reproduced from [10].
3.3. Fast Precharge Using Linked Precharge Units
(LISA-LIP)
Our third application aims to accelerate the process of
precharge. The precharge time for a subarray is determined
by the drive strength of the precharge unit (i.e., a circuitry
in a subarray’s row buer for precharging the connected
subarray). We observe that in modern DRAM, while a subar-
ray is being precharged, the precharge units (PUs) of other
subarrays remain idle.
We propose to exploit these idle PUs to accelerate a
precharge operation by connecting them to the subarray that
is being precharged. Our mechanism, LISA-LInked Precharge
(LISA-LIP), precharges a subarray using two sets of PUs: one
from the row buer that is being precharged, and a second set
from a neighboring subarray’s row buer (which is already
in the precharged state), by enabling the links between the
two subarrays.
To evaluate the accelerated precharge process, we use the
same DRAM circuit model described in Section 2 and sim-
ulate the linked precharge operation in SPICE. Our SPICE
simulation reports that LISA-LIP signicantly reduces the
precharge latency by 2.6x compared to the baseline (5ns vs.
13ns). Our system evaluation shows that LISA-LIP improves
performance by 10.3% on average, across 50 four-core work-
loads. We refer the reader to Section 6 of our HPCA 2016
paper [10] for a detailed analysis of LISA-LIP.
3.4. Evaluation: Putting Everything Together
As all of the three proposed applications are complemen-
tary to each other, we evaluate the eect of putting them
together on a four-core system. Figure 4 shows the system
performance improvement of adding LISA-VILLA to LISA-
RISC, as well as combining all three optimizations, compared
to our baseline using memcpy and standard DDR3-1600 mem-
ory across 50 workloads. We refer the reader to our full
paper [10] for the detailed conguration and workloads. We
draw several key conclusions. First, the performance benets
from each scheme are additive. On average, adding LISA-
VILLA improves performance by 16.5% over LISA-RISC alone,
and adding LISA-LIP further provides an 8.8% gain over LISA-
(RISC+VILLA). Second, although LISA-RISC alone provides a
majority of the performance improvement over the baseline
(59.6% on average), the use of both LISA-VILLA and LISA-LIP
further improves performance, resulting in an average per-
formance gain of 94.8% and memory energy reduction (not
plotted) of 49.0%. Taken together, these results indicate that
LISA is an eective substrate that enables a wide range of
high-performance and energy-ecient applications in the
DRAM system.
 0.5
 1
 2
 4
 8
 0  10  20  30  40  50
N
or
m
al
iz
ed
 W
S
(lo
g2
)
Workloads
LISA-(RISC+VILLA+LIP)
LISA-(RISC+VILLA)
LISA-RISC
Figure 4: Combined weighted speedup (WS) [14,93] improve-
ment of LISA applications. Reproduced from [10].
We conclude that LISA is an eective substrate that can
greatly improve system performance and reduce system en-
ergy consumption by synergistically enabling multiple dier-
ent applications. Our HPCA 2016 paper [10] provides many
more experimental results and analyses conrming this nd-
ing.
4. Related Work
To our knowledge, this is the rst work to propose a DRAM
substrate that supports fast data movement between subar-
rays in the same bank, which enables a wide variety of appli-
cations for DRAM systems. We now discuss prior works that
focus on each of the optimizations that LISA enables.
4.1. Bulk Data Transfer Mechanisms
Prior works [7, 16, 17, 36, 108] propose to add scratchpad
memories to reduce CPU pressure during bulk data transfers,
which can also enable sophisticated data movement (e.g.,
scatter-gather [90]), but they still require data to rst be
moved on-chip. A patent proposes a DRAM design that can
copy a page across memory blocks [84], but lacks concrete
analysis and evaluation of the underlying copy operations.
Intel I/O Acceleration Technology [27] allows for memory-to-
memory DMA transfers across a network, but cannot transfer
data within main memory.
Zhao et al. [110] propose to add a bulk data movement
engine inside the memory controller to speed up bulk-copy
4
operations. Jiang et al. [35] design a dierent copy engine,
placed within the cache controller, to alleviate pipeline and
cache stalls that occur when these transfers take place. How-
ever, these works do not directly address the problem of data
movement across the narrow memory channel.
A concurrent work by Lu et al. [63] proposes a het-
erogeneous DRAM design similar to VILLA-DRAM, called
DAS-DRAM, but with a very dierent data movement mech-
anism from LISA. It introduces a row of migration cells into
each subarray to move rows across subarrays. Unfortunately,
the latency of DAS-DRAM is not scalable with movement
distance, because it requires writing the migrating row into
each intermediate subarray’s migration cells before the row
reaches its destination, which prolongs data transfer latency.
In contrast, LISA provides a direct path to transfer data be-
tween row buers between adjacent subarrays without requir-
ing intermediate data writes into any subarray.
4.2. Cached DRAM
Several prior works (e.g., [20,23,26,38,109]) propose to add
a small SRAM cache to a DRAM chip to lower the access la-
tency for data that is kept in the SRAM cache (e.g., frequently
or recently used data). There are two main disadvantages of
these works. First, adding an SRAM cache into a DRAM chip
is very intrusive: it incurs a high area overhead (38.8% for
64KB in a 2Gb DRAM chip) and design complexity [45, 57].
Second, transferring data from DRAM to SRAM uses a nar-
row global data bus, internal to the DRAM chip, which is
typically 64-bit wide. Thus, installing data into the DRAM
cache incurs high latency. Compared to these works, our
LISA-VILLA design enables low latency without signicant
area overhead or complexity.
4.3. Heterogeneous-Latency DRAM
Prior works propose DRAM architectures that provide het-
erogeneous latency either spatially (dependent on where in
the memory an access targets) or temporally (dependent on
when an access occurs).
Spatial Heterogeneity. Prior work introduces spatial het-
erogeneity into DRAM, where one region has a fast access
latency but fewer DRAM rows, while the other has a slower
access latency but many more rows [57, 94]. Recent works
show that latency heterogeneity inherent in DRAM chips
due to process or design-induced variation can also naturally
enable such heterogeneous-latency substrates [9, 54]. The
fast region in DRAM can be utilized as a caching area, for
the frequently or recently accessed data. We briey describe
two state-of-the-art works that oer dierent heterogeneous-
latency DRAM designs.
CHARM [94] introduces heterogeneity within a rank by
designing a few fast banks with (1) shorter bitlines for faster
data sensing, and (2) closer placement to the chip I/O for faster
data transfers. To exploit these low-latency banks, CHARM
uses an OS-managed mechanism to statically map hot data to
these banks, based on proled information from the compiler
or programmers. Unfortunately, this approach cannot adapt
to program phase changes, limiting its performance gains.
If it were to adopt dynamic hot data management, CHARM
would incur high migration costs over the narrow 64-bit bus
that internally connects the fast and slow banks.
TL-DRAM [57] provides heterogeneity within a subarray
by dividing it into fast (near) and slow (far) segments that have
short and long bitlines, respectively, using isolation transis-
tors. The fast segment can be managed as an OS-transparent
hardware cache. The main disadvantage is that it needs to
cache each hot row in two near segments as each subarray
uses two row buers on opposite ends to sense data in the
open-bitline architecture (as discussed in our HPCA 2016
paper [10]). This prevents TL-DRAM from using the full
near segment capacity. As we can see, neither CHARM nor
TL-DRAM strike a good design balance for heterogeneous-
latency DRAM. Our proposal, LISA-VILLA, is a new hetero-
geneous DRAM design that oers fast data movement with a
low-cost and easy-to-implement design.
Temporal Heterogeneity. Prior work observes that
DRAM latency can vary depending on when an access oc-
curs. The key observation is that a recently-accessed or re-
freshed row has nearly full electrical charge in the cells, and
thus the following access to the same row can be performed
faster [21, 22, 92]. We briey describe two state-of-the-art
works that focus on providing heterogeneous latency tempo-
rally.
ChargeCache [22] enables faster access to recently-accessed
rows in DRAM by tracking the addresses of recently-accessed
rows. NUAT [92] enables accesses to recently-refreshed rows
at low latency because these rows are already highly-charged.
In contrast to ChargeCache and NUAT, LISA does not require
data to be recently-accessed/refreshed in order to reduce
DRAM latency. Adaptive-Latency DRAM (AL-DRAM) [56]
adapts the DRAM latency of each DRAM module to temper-
ature, observing that each module can be operated faster at
lower temperatures. LISA is orthogonal to AL-DRAM. The
ideas of LISA can be employed in conjunction with works
that exploit the temporal heterogeneity of DRAM latency.
4.4. Other Latency Reduction Mechanisms
Many prior works propose memory scheduling techniques,
which generally reduce latency to access DRAM [3, 13, 15,
29, 43, 44, 51, 52, 53, 68, 69, 71, 72, 96, 97, 98, 102]. Other works
propose mechanisms to perform in-memory computation to
reduce data movement and access latency [1, 2, 5, 6, 18, 24, 25,
40, 46, 62, 76, 77, 88, 89, 95, 107]. LISA is complementary to
these works, and it can work synergistically with in-memory
computation mechanisms by enabling fast aggregation of
data.
5. Signicance
Our HPCA 2016 paper [10] proposes a new DRAM sub-
strate that signicantly improves the performance and e-
5
ciency of bulk data movement in modern systems. In this
section, we briey discuss the expected future impact of our
work, and discuss several research directions that our work
motivates.
5.1. Potential Industry Impact
We believe that our LISA substrate can have a large impact
on mobile systems as well as data centers that consume a
signicant amount of cycle time performing bulk data move-
ment. A recent study [37] by Google reports that memcpy()
and memmove() library functions alone represent 4-5% of their
data center cycles even though Google has a signicant work-
load diversity running within their data centers. Another re-
cent study shows that 62.7% of system energy is spent on data
movement on consumer devices (e.g., smartphones, wearable
devices, web-based computers such as Chromebooks) [6].
In this work, we demonstrate that one potential applica-
tion of using the LISA substrate is to accelerate memcpy()
and memmove(), as discussed in Section 3.1. Our detailed
DRAM circuit model reports that LISA reduces the latency
and DRAM energy of these functions by 9x and 69x compared
to today’s systems, respectively. Hence, we expect LISA can
improve the eciency and performance of both mobile and
data center systems.
5.2. Future Research Directions
This work opens up several avenues of future research
directions. In this section, we describe several directions that
can enable researchers to tackle other problems related to
memory systems based on the LISA substrate.
Reducing Subarray Conicts via Remapping. When
two memory requests access two dierent rows in the same
bank, they have to be served serially, even if they are to
dierent subarrays. To mitigate such bank conicts, Kim
et al. [45] propose subarray-level parallelism (SALP), which
enables multiple subarrays to remain activated at the same
time. However, if two accesses are to the same subarray, they
still have to be served serially. This problem is exacerbated
when frequently-accessed rows reside in the same subarray.
To help alleviate such subarray conicts, LISA can enable
a simple mechanism that eciently remaps or moves the
conicting rows to dierent subarrays by exploiting fast RBM
operations.
Enabling LISA to Perform 1-to-N Memory Copy or
Move Operations. A typical memcpy or memmove call only
allows the data to be copied from one source location to one
destination location. To copy or move data from one source
location to multiple dierent destinations, repeated calls are
required. The problem is that such repeated calls incur long
latency and high bandwidth consumption. One potential
application that can be enabled by LISA is performing memcpy
or memmove from one source location to multiple destinations
completely in DRAM without requiring multiple calls of these
operations.
By using LISA, we observe that moving data from the
source subarray to the destination subarray latches the source
row’s data in all the intermediate subarrays’ row buer. As a
result, activating these intermediate subarrays would copy
their row buers’ data into the specied row within these
subarrays. By extending LISA to perform multi-point (1-to-
N) copy or move operations, we can signicantly increase
system performance of several commonly-used system op-
erations. For example, forking multiple child processes can
utilize 1-to-N copy operations to eciently copy memory
pages from the parent’s address space to all the children. As
another example, LISA can extend the range of in-DRAM
bulk bitwise operations [85, 89]. Thus, LISA can eciently
enable architectural support to a new, useful system and pro-
gramming primitive: 1-to-N bulk memory copy/movement.
In-Memory Computation with LISA. One important
requirement of ecient in-memory computation is being
able to move data from its stored location to the computation
units with very low latency and energy. We believe using
the LISA substrate can enable a new in-memory computation
framework. The idea is to add a small computation unit inside
each or a subset of banks, and connect these computation
units to the neighboring subarrays which store the data. Do-
ing so allows the system to utilize LISA to move bulk data
from the subarrays to the computation units with low latency
and low area overhead.
Extending LISA to Non-VolatileMemory. In this work,
we only focus on the DRAM technology. A class of emerging
memory technology is non-volatile memory (NVM), which
has the capability of retaining data without power supply. We
believe that the LISA substrate can be extended to NVM (e.g.,
PCM [48, 49, 50, 80, 81, 104, 105] and STT-MRAM [12, 19, 47])
since the memory organization of NVM mostly resembles
that of DRAM. A potential application of LISA in NVM is an
ecient le copy operation that does not incur costly I/O
data transfer. We believe LISA can provide further benets
when main memory becomes persistent [65].
6. Conclusion
We present a new DRAM substrate, low-cost inter-linked
subarrays (LISA), that expedites bulk data movement across
subarrays in DRAM. LISA achieves this by creating a new
high-bandwidth datapath at low cost between subarrays, via
the insertion of a small number of isolation transistors. We
describe and evaluate three applications that are enabled by
LISA. First, LISA signicantly reduces the latency and mem-
ory energy consumption of bulk copy operations between
subarrays over state-of-the-art mechanisms [88]. Second,
LISA enables an eective in-DRAM caching scheme on a new
heterogeneous DRAM organization, which uses fast subar-
rays for caching hot data in every bank. Third, we reduce
precharge latency by connecting two precharge units of adja-
cent subarrays together using LISA. We experimentally show
that the three applications of LISA greatly improve system
6
performance and memory energy eciency when used indi-
vidually or together, across a variety of workloads and system
congurations.
We conclude that LISA is an eective substrate that enables
several eective applications. We believe that this substrate,
which enables low-cost interconnections between DRAM
subarrays, can pave the way for other applications that can
further improve system performance and energy eciency
through fast data movement in DRAM. We greatly encourage
future work to 1) investigate new applications and benets of
LISA, and 2) develop new low-cost interconnection substrates
within a DRAM chip to improve internal connectivity and
data transfer ability.
Acknowledgments
We thank the anonymous reviewers and SAFARI group
members for their helpful feedback. We acknowledge the
support of Google, Intel, NVIDIA, Samsung, and VMware.
This research was supported in part by the ISTC-CC, SRC,
CFAR, and NSF (grants 1212962, 1319587, and 1320531). Kevin
Chang was supported in part by the SRCEA/Intel Fellowship.
References
[1] J. Ahn et al., “A Scalable Processing-in-Memory Accelerator for Parallel Graph
Processing,” in ISCA, 2015.
[2] J. Ahn et al., “PIM-Enabled Instructions: A Low-Overhead, Locality-Aware
Processing-in-Memory Architecture,” in ISCA, 2015.
[3] R. Ausavarungnirun et al., “Staged Memory Scheduling: Achieving High Perfor-
mance and Scalability in Heterogeneous Systems,” in ISCA, 2012.
[4] S. Blagodurov et al., “A Case for NUMA-Aware Contention Management on Mul-
ticore Systems,” in USENIX ATC, 2011.
[5] A. Boroumand et al., “LazyPIM: An Ecient Cache Coherence Mechanism for
Processing-in-Memory,” CAL, 2016.
[6] A. Boroumand et al., “Google Workloads for Consumer Devices: Mitigating Data
Movement Bottlenecks,” in ASPLOS, 2018.
[7] J. Carter et al., “Impulse: Building a Smarter Memory Controller,” in HPCA, 1999.
[8] K. K. Chang et al., “Improving DRAM Performance by Parallelizing Refreshes
with Accesses,” in HPCA, 2014.
[9] K. K. Chang et al., “Understanding Latency Variation in Modern DRAM Chips:
Experimental Characterization, Analysis, and Optimization,” in SIGMETRICS,
2016.
[10] K. K. Chang et al., “Low-Cost Inter-Linked Subarrays (LISA): Enabling Fast Inter-
Subarray Data Movement in DRAM,” in HPCA, 2016.
[11] K. K. Chang et al., “Understanding Reduced-Voltage Operation in Modern DRAM
Devices: Experimental Characterization, Analysis, and Mechanisms,” in SIGMET-
RICS, 2017.
[12] M. T. Chang et al., “Technology Comparison for Large Last-Level Caches (L3Cs):
Low-Leakage SRAM, Low Write-Energy STT-RAM, and Refresh-Optimized
eDRAM,” in HPCA, 2013.
[13] E. Ebrahimi et al., “Parallel Application Memory Scheduling,” in MICRO, 2011.
[14] S. Eyerman and L. Eeckhout, “System-Level Performance Metrics for Multipro-
gram Workloads,” IEEE Micro, 2008.
[15] S. Ghose et al., “Improving Memory Scheduling via Processor-Side Load Critical-
ity Information,” in ISCA, 2013.
[16] M. Gschwind, “Chip Multiprocessing and the Cell Broadband Engine,” in CF,
2006.
[17] J. Gummaraju et al., “Architectural Support for the Stream Execution Model on
General-Purpose Processors,” in PACT, 2007.
[18] Q. Guo et al., “3D-Stacked Memory-Side Acceleration: Accelerator and System
Design,” in WONDP, 2014.
[19] X. Guo et al., “Resistive Computation: Avoiding the Power Wall with Low-
Leakage, STT-MRAM Based Computing,” in ISCA, 2010.
[20] C. A. Hart, “CDRAM in a Unied Memory Architecture,” in Intl. Computer Con-
ference, 1994.
[21] H. Hassan et al., “SoftMC: A Flexible and Practical Open-Source Infrastructure
for Enabling Experimental DRAM Studies,” in HPCA, 2017.
[22] H. Hassan et al., “ChargeCache: Reducing DRAM Latency by Exploiting Row
Access Locality,” in HPCA, 2016.
[23] H. Hidaka et al., “The Cache DRAM Architecture,” IEEE Micro, 1990.
[24] K. Hsieh et al., “Transparent Ooading and Mapping (TOM): Enabling
Programmer-Transparent Near-Data Processing in GPU Systems,” in ISCA, 2016.
[25] K. Hsieh et al., “Accelerating pointer chasing in 3D-stacked memory: Challenges,
mechanisms, evaluation,” in ICCD, 2016.
[26] W.-C. Hsu and J. E. Smith, “Performance of Cached DRAM Organizations in
Vector Supercomputers,” in ISCA, 1993.
[27] Intel Corp., “Intel®I/O Acceleration Technology,” http://www.intel.com/content/
www/us/en/wireless-network/accel-technology.html.
[28] Intel Corp., “Intel 64 and IA-32 Architectures Optimization Reference Manual,”
2012.
[29] E. Ipek et al., “Self-Optimizing Memory Controllers: A Reinforcement Learning
Approach,” in ISCA, 2008.
[30] ITRS, http://www.itrs.net/ITRS1999-2014Mtgs,Presentations&Links/2013ITRS/
2013Tables/FEP_2013Tables.xlsx, 2013.
[31] ITRS, http://www.itrs.net/ITRS1999-2014Mtgs,Presentations&Links/2013ITRS/
2013Tables/Interconnect_2013Tables.xlsx, 2013.
[32] JEDEC, “DDR3 SDRAM Standard,” 2010.
[33] JEDEC, “DDR4 SDRAM Standard,” 2012.
[34] X. Jiang et al., “CHOP: Adaptive Filter-Based DRAM Caching for CMP Server
Platforms,” in HPCA, 2010.
[35] X. Jiang et al., “Architecture Support for Improving Bulk Memory Copying and
Initialization Performance,” in PACT, 2009.
[36] J. A. Kahle et al., “Introduction to the Cell Multiprocessor,” IBM JRD, 2005.
[37] S. Kanev et al., “Proling a Warehouse-Scale Computer,” in ISCA, 2015.
[38] G. Kedem and R. P. Koganti, “WCDRAM: A Fully Associative Integrated Cached-
DRAM with Wide Cache Lines,” Duke Univ. Dept. of Computer Science, Tech.
Rep. CS-1997-03, 1997.
[39] J. S. Kim et al., “The DRAM Latency PUF: Quickly Evaluating Physical Unclon-
able Functions by Exploiting the Latency–Reliability Tradeo in Modern DRAM
Devices,” in HPCA, 2018.
[40] J. S. Kim et al., “GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping
Using Processing-in-Memory Technologies,” BMC Genomics, 2018.
[41] Y. Kim et al., “Ramulator: A Fast and Extensible DRAM Simulator,” CAL, 2015.
[42] Y. Kim et al., “Flipping Bits in Memory Without Accessing Them: An Experimen-
tal Study of DRAM Disturbance Errors,” in ISCA, 2014.
[43] Y. Kim et al., “ATLAS: A Scalable and High-Performance Scheduling Algorithm
for Multiple Memory Controllers,” in HPCA, 2010.
[44] Y. Kim et al., “Thread Cluster Memory Scheduling: Exploiting Dierences in
Memory Access Behavior,” in MICRO, 2010.
[45] Y. Kim et al., “A Case for Exploiting Subarray-Level Parallelism (SALP) in
DRAM,” in ISCA, 2012.
[46] P. M. Kogge, “EXECUBE-A New Architecture for Scaleable MPPs,” in ICPP, 1994.
[47] E. Kultursay et al., “Evaluating STT-RAM as an energy-ecient main memory
alternative,” in ISPASS, 2013.
[48] B. C. Lee et al., “Architecting Phase Change Memory as a Scalable DRAM Alter-
native,” in ISCA, 2009.
[49] B. C. Lee et al., “Phase Change Memory Architecture and the Quest for Scalabil-
ity,” CACM, vol. 53, no. 7, pp. 99–106, 2010.
[50] B. C. Lee et al., “Phase-Change Technology and the Future of Main Memory,”
IEEE Micro, vol. 30, no. 1, pp. 143–143, 2010.
[51] C. J. Lee et al., “Prefetch-Aware DRAM Controllers,” in MICRO, 2008.
[52] C. J. Lee et al., “DRAM-Aware Last-Level Cache Writeback: Reducing Write-
Caused Interference in Memory Systems,” Univ. of Texas at Austin, High Per-
formance Systems Group, Tech. Rep. TR-HPS-2010-002, 2010.
[53] C. J. Lee et al., “Improving Memory Bank-Level Parallelism in the Presence of
Prefetching,” in MICRO, 2009.
[54] D. Lee et al., “Design-Induced Latency Variation in Modern DRAM Chips: Char-
acterization, Analysis, and Latency Reduction Mechanisms,” in SIGMETRICS,
2017.
[55] D. Lee et al., “Decoupled Direct Memory Access: Isolating CPU and IO Trac
by Leveraging a Dual-Data-Port DRAM,” in PACT, 2015.
[56] D. Lee et al., “Adaptive-Latency DRAM: Optimizing DRAM Timing for the
Common-Case,” in HPCA, 2015.
[57] D. Lee et al., “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Ar-
chitecture,” in HPCA, 2013.
[58] D. Lee et al., “Simultaneous Multi Layer Access: A High Bandwidth and Low
Cost 3D-Stacked Memory Interface,” TACO, 2016.
[59] Y. Li et al., “Utility-Based Hybrid Memory Management,” in CLUSTER, 2017.
[60] J. Liu et al., “An Experimental Study of Data Retention Behavior in Modern
DRAM Devices: Implications for Retention Time Proling Mechanisms,” in ISCA,
2013.
[61] J. Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” in ISCA, 2012.
[62] Z. Liu et al., “Concurrent Data Structures for Near-Memory Computing,” in SPAA,
2017.
[63] S.-L. Lu et al., “Improving DRAM Latency with Dynamic Asymmetric Subarray,”
in MICRO, 2015.
[64] C.-K. Luk et al., “Pin: Building Customized Program Analysis Tools with Dy-
namic Instrumentation,” in PLDI, 2005.
[65] J. Meza et al., “A Case for Ecient Hardware/Software Cooperative Management
of Storage and Memory,” in WEED, 2013.
[66] J. Meza et al., “Enabling Ecient and Scalable Hybrid Memories Using Fine-
Granularity DRAM Cache Management,” CAL, 2012.
7
[67] Micron Technology, Inc., “576Mb: x18, x36 RLDRAM3,” 2011.
[68] T. Moscibroda and O. Mutlu, “Memory Performance Attacks: Denial of Memory
Service in Multi-core Systems,” in USENIX Security, 2007.
[69] J. Mukundan and J. F. Martinez, “MORSE: Multi-objective Recongurable Self-
Optimizing Memory Scheduler,” in HPCA, 2012.
[70] O. Mutlu, “Memory Scaling: A Systems Architecture Perspective,” IMW, 2013.
[71] O. Mutlu and T. Moscibroda, “Stall-Time Fair Memory Access Scheduling for
Chip Multiprocessors,” in MICRO, 2007.
[72] O. Mutlu and T. Moscibroda, “Parallelism-Aware Batch Scheduling: Enhancing
Both Performance and Fairness of Shared DRAM Systems,” in ISCA, 2008.
[73] S. O et al., “Row-Buer Decoupling: A Case for Low-Latency DRAM Microarchi-
tecture,” in ISCA, 2014.
[74] J. K. Ousterhout, “Why Aren’t Operating Systems Getting Faster as Fast as Hard-
ware?” in USENIX Summer Conf., 1990.
[75] M. Patel et al., “The Reach Proler (REAPER): Enabling the Mitigation of DRAM
Retention Failures via Proling at Aggressive Conditions,” in ISCA, 2017.
[76] D. Patterson et al., “A Case for Intelligent RAM,” IEEE Micro, 1997.
[77] A. Pattnaik et al., “Scheduling Techniques for GPU Architectures with
Processing-In-Memory Capabilities,” in PACT, 2016.
[78] M. Qureshi et al., “A Case for MLP-Aware Cache Replacement,” in ISCA, 2006.
[79] M. K. Qureshi et al., “Adaptive Insertion Policies for High-Performance Caching,”
in ISCA, 2007.
[80] M. K. Qureshi et al., “Enhancing Lifetime and Security of PCM-based Main Mem-
ory with Start-gap Wear Leveling,” in MICRO, 2009.
[81] M. K. Qureshi et al., “Scalable High Performance Main Memory System Using
Phase-change Memory Technology,” in ISCA, 2009.
[82] M. Rosenblum et al., “The Impact of Architectural Trends on Operating System
Performance,” in SOSP, 1995.
[83] SAFARI Research Group, “Ramulator – GitHub Repository,” https://github.com/
CMU-SAFARI/ramulator.
[84] S.-Y. Seo, “Methods of Copying a Page in a Memory Device and Methods of Man-
aging Pages in a Memory System,” U.S. Patent Application 20140185395, 2014.
[85] V. Seshadri et al., “Fast Bulk Bitwise AND and OR in DRAM,” CAL, 2015.
[86] V. Seshadri et al., “Page overlays: An enhanced virtual memory framework to
enable ne-grained memory management,” in ISCA, 2015.
[87] V. Seshadri et al., “The Evicted-Address Filter: A Unied Mechanism to Address
Both Cache Pollution and Thrashing,” in PACT, 2012.
[88] V. Seshadri et al., “RowClone: Fast and Energy-Ecient In-DRAM Bulk Data
Copy and Initialization,” in MICRO, 2013.
[89] V. Seshadri et al., “Ambit: In-Memory Accelerator for Bulk Bitwise Operations
Using Commodity DRAM Technology,” in MICRO, 2017.
[90] V. Seshadri et al., “Gather-Scatter DRAM: In-DRAM Address Translation to Im-
prove the Spatial Locality of Non-Unit Strided Accesses,” in MICRO, 2015.
[91] V. Seshadri et al., “Mitigating Prefetcher-Caused Pollution Using Informed
Caching Policies for Prefetched Blocks,” TACO, vol. 11, no. 4, pp. 51:1–51:22, 2015.
[92] W. Shin et al., “NUAT: A Non-Uniform Access Time Memory Controller,” in
HPCA, 2014.
[93] A. Snavely and D. Tullsen, “Symbiotic Jobscheduling for a Simultaneous Multi-
threading Processor,” in ASPLOS, 2000.
[94] Y. H. Son et al., “Reducing Memory Access Latency with Asymmetric DRAM
Bank Organizations,” in ISCA, 2013.
[95] H. S. Stone, “A Logic-in-Memory Computer,” IEEE TC, 1970.
[96] L. Subramanian et al., “BLISS: Balancing Performance, Fairness and Complexity
in Memory Access Scheduling,” in IEEE TPDS, 2016.
[97] L. Subramanian et al., “The Blacklisting Memory Scheduler: Achieving High
Performance and Fairness at Low Cost,” in ICCD, 2014.
[98] L. Subramanian et al., “Mise: Providing performance predictability and improv-
ing fairness in shared main memory systems,” in HPCA, 2013.
[99] K. Sudan et al., “Micro-Pages: Increasing DRAM Eciency with Locality-Aware
Data Placement,” in ASPLOS, 2010.
[100] G. Tyson et al., “A Modied Approach to Data Cache Management,” in MICRO,
1995.
[101] A. N. Udipi et al., “Rethinking DRAM Design and Organization for Energy-
Constrained Multi-Cores,” in ISCA, 2010.
[102] H. Usui et al., “DASH: Deadline-Aware High-Performance Memory Scheduler
for Heterogeneous Systems with Hardware Accelerators,” TACO, vol. 12, no. 4,
pp. 65:1–65:28, 2016.
[103] S. Wong et al., “A Hardware Cache memcpy Accelerator,” in FPT, 2006.
[104] H. Yoon et al., “Row Buer Locality Aware Caching Policies for Hybrid Memo-
ries,” in ICCD, 2012.
[105] H. Yoon et al., “Ecient Data Mapping and Buering Techniques for Multilevel
Cell Phase-Change Memories,” TACO, vol. 11, no. 4, pp. 40:1–40:25, 2014.
[106] X. Yu et al., “Banshee: Bandwidth-Ecient DRAM Caching via Software/Hard-
ware Cooperation,” in MICRO, 2017.
[107] D. Zhang et al., “TOP-PIM: Throughput-Oriented Programmable Processing in
Memory,” in HPDC, 2014.
[108] L. Zhang et al., “The Impulse Memory Controller,” IEEE TC, vol. 50, no. 11, pp.
1117–1132, 2001.
[109] Z. Zhang et al., “Cached DRAM for ILP Processor Memory Access Latency Re-
duction,” IEEE Micro, vol. 21, no. 4, Jul. 2001.
[110] L. Zhao et al., “Hardware Support for Bulk Data Movement in Server Platforms,”
in ICCD, 2005.
8
