Exploiting the DRAM Microarchitecture to Increase Memory-Level
  Parallelism by Kim, Yoongu et al.
Exploiting the DRAMMicroarchitecture
to Increase Memory-Level Parallelism
Yoongu Kim1 Vivek Seshadri2,1 Donghyuk Lee3,1 Jamie Liu4,1 Onur Mutlu5,1
1Carnegie Mellon University 2Microsoft Research India
3NVIDIA Research 4Google 5ETH Zürich
This paper summarizes the idea of Subarray-Level Paral-
lelism (SALP) in DRAM, which was published in ISCA 2012 [66],
and examines the work’s signicance and future potential. Mod-
ern DRAMs have multiple banks to serve multiple memory re-
quests in parallel. However, when two requests go to the same
bank, they have to be served serially, exacerbating the high
latency of o-chip memory. Adding more banks to the system
to mitigate this problem incurs high system cost. Our goal in
this work is to achieve the benets of increasing the number
of banks with a low-cost approach. To this end, we propose
three new mechanisms, SALP-1, SALP-2, and MASA (Multitude
of Activated Subarrays), to reduce the serialization of dier-
ent requests that go to the same bank. The key observation
exploited by our mechanisms is that a modern DRAM bank is
implemented as a collection of subarrays that operate largely
independently while sharing few global peripheral structures.
Our three proposed mechanisms mitigate the negative impact
of bank serialization by overlapping dierent components of the
bank access latencies of multiple requests that go to dierent
subarrays within the same bank. SALP-1 requires no changes
to the existing DRAM structure, and needs to only reinterpret
some of the existing DRAM timing parameters. SALP-2 and
MASA require only modest changes (< 0.15% area overhead)
to the DRAM peripheral structures, which are much less design
constrained than the DRAM core. Our evaluations show that
SALP-1, SALP-2 and MASA signicantly improve performance
for both single-core systems (7%/13%/17%) and multi-core sys-
tems (15%/16%/20%), averaged across a wide range of workloads.
We also demonstrate that our mechanisms can be combined
with application-aware memory request scheduling in multi-
core systems to further improve performance and fairness.
Our proposed technique has enabled signicant research in
the use of subarrays for various purposes (e.g., [15,16,21,37,76,78,
84, 87, 128, 129, 130, 135, 156, 159]). SALP has also been described
and evaluated by a recent work by Samsung and Intel [54] as a
promising mechanism to tolerate long write latencies that are a
result of aggressive DRAM technology scaling.
1. Introduction
To be able to serve multiple memory requests in paral-
lel, modern DRAM chips employ multiple banks that can be
accessed independently, providing bank level parallelism. Un-
fortunately, if two memory requests go to the same bank,
they have to be served one after another. This is called a
bank conict. In the worst case, bank conicts may delay a
memory request by hundreds or even thousands of nanosec-
onds [16, 37, 66, 129]. In particular, bank conicts cause three
specic problems that degrade the access latency, bandwidth
utilization, and energy eciency of the main memory sub-
system:
1. Serialization. Bank conicts serialize requests that could
potentially have been served in parallel. Such serialization
exacerbates the already large latency of a memory access,
and may cause processor cores to stall for much longer.
2. Write Recovery. A request scheduled after a write re-
quest to the same bank experiences an extra delay called
the write recovery penalty, which is an additional time re-
quired to safely store new data in the cells. This write
recovery latency further aggravates the impact of serial-
ization.
3. Row Buer Thrashing. Each bank has a row buer that
caches the last accessed row. A request that hits in the row
buer is much cheaper in terms of both latency and energy
than a request that misses in the row buer. However, bank
conicts between requests that access dierent rows lead
to costly row buer misses.
A naive solution to bank conicts is to increase the number
of banks. Unfortunately, as we discuss in Section 1 of our
ISCA 2012 paper [66], simply adding more banks to the mem-
ory subsystem comes at signicantly high costs or reduced
performance regardless of the way it is done: more banks per
chip, more ranks per channel, or more channels.1
The goal in our ISCA 2012 paper [66] is to mitigate such
detrimental eects of bank conicts in a cost-eective manner.
Toward that end, we make two key observations that lead to
our proposed solutions.
Observation 1. A modern DRAM bank is not imple-
mented as a monolithic component equipped with only a
single row buer. Implementing a DRAM bank in such a way
requires very long internal wires (called bitlines) to connect
the row buer to all the rows in the bank, which can sig-
nicantly increase the access latency. Instead, as Figure 1b
shows, a bank consists of multiple subarrays, each with its
own local row buer. Subarrays within a bank share two im-
portant global structures: i) a global row address decoder, and
ii) a global row buer.
1We refer the reader to our prior works [14, 15, 16, 17, 37, 38, 61, 62, 63,
64, 65, 66, 75, 76, 77, 78, 79, 80, 81, 115, 129, 130] for a detailed background on
DRAM.
ar
X
iv
:1
80
5.
01
96
6v
1 
 [c
s.A
R]
  4
 M
ay
 20
18
row
Bank
r
o
w
-d
e
c
o
d
e
r
row-buffer
3
2
k
 r
o
w
s
(a) Logical abstraction
1
local row-buffer
Subarray64
g
lo
b
a
l 
d
e
c
o
d
e
r
5
1
2
 
ro
w
s
5
1
2
 
ro
w
s
Subarray
local row-buffer
global row-buffer
(b) Physical implementation
Figure 1: DRAM bank organization. Adapted from [66].
Observation 2. The latency of a bank access predomi-
nantly consists of three major components: i) loading a row
into the local row buer (activation), ii) accessing the data
from the local row buer (read or write), and iii) clearing the
local row buer (precharging) [14, 37, 38, 66, 76, 77, 78]. In ex-
isting DRAM banks, all three operations must be completed
for one request before serving another request to a dierent
row, even if the two rows reside in dierent subarrays. How-
ever, this does not need to be the case for two reasons. First,
activation and precharging are mostly local to each subarray,
which enables the opportunity to overlap these operations
when they are to dierent subarrays. Second, if we reduce
the sharing of the global structures among subarrays, we can
parallelize the concurrent activation of dierent subarrays.
Doing so would allow us to exploit the existence of multiple
local row buers across the subarrays, enabling more than
just a single row to be cached for each bank and thereby
increasing the row buer hit rate.
2. Subarray-Level Parallelism
Subarray-Oblivious Baseline. Let us consider the base-
line example shown in Figure 2, which presents a timeline
of four memory requests being served at the same bank in a
subarray-oblivious manner. This example highlights the three
key problems that we discussed in Section 1. First, requests
are completely serialized, even though they are to dierent
subarrays. Second, although the write-recovery penalty is
local to a subarray, it delays a subsequent request to a dier-
ent subarray. Third, a request to one subarray unnecessarily
evicts (i.e., precharges) the other subarray’s local row buer,
which must be reloaded (i.e., activated) when a future request
accesses the evicted row. In this section, we describe how
SALP-1, SALP-2 and MASA can take an advantage of the
DRAM bank organization to enable parallel DRAM opera-
tions in a cost-eective manner.
2.1. SALP-1: Subarray-Level-Parallelism-1
We observe that precharging and activation are mostly
local to a subarray. Based on this observation, we propose
SALP-1, which overlaps the precharging of one subarray
with the activation of another subarray. In contrast, existing
systems always serialize precharging and activation to the
same bank, conservatively provisioning for when they are
to the same subarray. SALP-1 requires no modications to
existing DRAM structure. It only requires reinterpretation
of an existing timing constraint (tRP) and, potentially, the
addition of a new timing constraint (which we describe in
Section 5.1 of our ISCA 2012 paper [66]). Figure 3 (top) shows
the timeline of the same four requests from Figure 2 when we
use SALP-1 instead of our Baseline. As the timeline shows,
overlapping the precharge operation reduces the overall time
needed to complete the four requests.
2.2. SALP-2: Subarray-Level-Parallelism-2
While SALP-1 pipelines the precharging and activation of
dierent subarrays, the relative ordering between the two
commands is still preserved. This is because existing DRAM
banks do not allow two subarrays to be activated at the same
time. As a result, the write-recovery latency of an activated
subarray delays not only a PRECHARGE to itself, but also a
subsequent ACTIVATE to another subarray. Based on the ob-
servation that the write-recovery latency is also local to a
subarray, we propose SALP-2. SALP-2 issues the ACTIVATE
to another subarray before the PRECHARGE to the currently-
activated subarray. As a result, SALP-2 can overlap the write
recovery of the currently-activated subarray with the activa-
tion of another subarray, further reducing the service time
compared to SALP-1 (as shown in the middle timeline of
Figure 3).
However, as highlighted in the gure, SALP-2 requires
two subarrays to remain activated at the same time. This
is not possible in existing DRAM banks as the global row-
address latch, which determines the wordline in the bank
that is raised, is shared by all of the subarrays. Section 5.2 of
our ISCA 2012 paper [66] discusses how to enable SALP-2 by
eliminating this sharing. The key idea is to push the global
address latch to each subarray, thereby creating local address
latches, one per subarray.
2.3. MASA: Multitude of Activated Subarrays
Although SALP-2 allows two subarrays within a bank to be
activated, it requires the controller to precharge one of them
before issuing a column command (e.g., READ) to the bank.
This is because when a bank receives a column command, all
activated subarrays in the bank will connect their local row
buers to the global bitlines. If more than one subarray is
activated, this will result in a short circuit. As a result, SALP-
2 cannot allow multiple subarrays to concurrently remain
activated and serve column commands.
To solve this, we propose MASA, whose key idea is to allow
multiple subarrays to be activated at the same time, while
allowing the memory controller to designate exactly one of
the activated subarrays to drive the global bitlines during
the next column command. MASA has two advantages over
SALP-2. First, MASA overlaps the activation of dierent sub-
arrays within a bank. Just before issuing a column command
to any of the activated subarrays, the memory controller
designates one particular subarray whose row buer should
2
ACT W PRE timetWR
timeACT W PRE
ACT R PRE
ACT R PREtWR
Serialization❶
Write-Recovery
Baseline
Timeline
❷ 
Row0@ Subarray0
Row512@ Subarray1
Bank0 ❸ Row-Buffer Thrashing
Figure 2: Timeline of four requests to two dierent rows in the same bank. Adapted from [66].
Row512@
Row512@
Row512@
Figure 3: Timeline of four requests to two dierent rows in the same bank but dierent subarrays, using our mechanisms to
exploit subarray-level parallelism. Adapted from [66].
serve the column command. Second, MASA eliminates extra
ACTIVATEs to the same row, thereby mitigating row buer
thrashing. This is because the local row buers of multiple
subarrays can remain activated at the same time without
experiencing collisions on the global bitlines. As a result,
MASA further improves performance compared to SALP-2,
as shown in the bottom timeline of Figure 3.
MASA: Overhead. To designate one of the multiple ac-
tivated subarrays, the controller needs a new command,
SA_SEL (subarray-select). In addition to the changes required
by SALP-2, MASA requires a single-bit latch per subarray to
denote whether a subarray is designated or not. According to
our detailed circuit-level analysis, MASA increases the DRAM
die-size by only 0.15% (due to extra latches) and the static
power consumption by only ∼1% (each additional activated
subarray consumes 0.56mW). Also, the memory controller
needs less than 256 bytes to track the status of subarrays
across all DRAM banks. We discuss a detailed implementa-
tion of MASA, along with its overhead, in Section 5.3 of our
ISCA 2012 paper [66].
3. Experimental Methodology
We evaluate our three mechanisms for subarray-level par-
allelism using Ramulator [62, 124], an open-source cycle-
accurate DRAM simulator that we developed which accu-
rately models DRAM subarrays. We use Ramulator as part of
a cycle-level in-house x86 multi-core simulator, whose front-
end is based on Pin [85]. We calculate DRAM dynamic energy
consumption by associating an energy cost with each DRAM
command, derived using Micron’s DDR3 DRAM tool [93],
Rambus’ DRAM power model [123], and previously published
data [150].
We evaluate SALP-1, SALP-2, and MASA on a wide va-
riety of workloads [39, 41, 89, 146] and system congura-
tions [45, 46, 134, 143]. The results shown in Section 4 are
based on the conservative assumption that a DRAM bank ex-
poses only 8 subarrays to be exploited by our subarray-level
parallelism mechanisms, whereas in practice the number of
subarrays in current DRAM banks is typically much higher
(∼64). Section 9.2 of our ISCA 2012 paper [66] shows that the
performance improvement of our three mechanisms over a
subarray-oblivious baseline increases with a greater number
of subarrays.
For our full methodology, we refer the reader to Section 8
of our ISCA 2012 paper [66].
4. Evaluation
Figure 4 shows the performance improvement of SALP-1,
SALP-2, and MASA on a system with 8 subarrays-per-bank
over a subarray-oblivious baseline. The gure also shows
the performance improvement of an “Ideal” scheme which is
the subarray-oblivious baseline with 8 times as many banks
(this represents a system where all subarrays are fully inde-
pendent). The benchmarks are sorted along the x-axis by
increasing memory intensity. We make two observations
from the gure. First, SALP-1, SALP-2, and MASA consis-
3
tently perform better than the baseline for all benchmarks.
On average, they improve the average performance by 6.6%,
13.4%, and 16.7%, respectively. Second, MASA captures most
of the benets of “Ideal,” which improves performance by
19.6% compared to baseline.
The dierence in performance improvement across bench-
marks can be explained by a combination of three factors
related to the benchmarks’ individual memory access behav-
ior. First, subarray-level parallelism in general is most bene-
cial for memory-intensive benchmarks that frequently access
memory (e.g., the benchmarks located towards the right of
Figure 4). By increasing the memory throughput for such
applications, subarray-level parallelism signicantly allevi-
ates their memory bottleneck. The average memory intensity
of the applications that gain >5% performance with SALP-1
is 18.4 MPKI (last-level cache misses per kilo-instruction),
compared to 1.14 MPKI for the other applications.
Second, the advantage of SALP-2 is large for applications
that are write-intensive (i.e., those with the most write misses
per kilo-instruction, or WMPKI). For such applications, SALP-
2 can overlap the long write-recovery latency with the activa-
tion of a subsequent access. In Figure 4, the three applications
that improve more than 38% with SALP-2 are among both
the most memory-intensive (>25 MPKI) and the most write-
intensive (>15 WMPKI).
Third, MASA is benecial for applications that experience
frequent bank conicts. For such applications, MASA paral-
lelizes accesses to dierent subarrays by concurrently acti-
vating multiple subarrays (ACTIVATE) and allowing the appli-
cation to switch between the activated subarrays at low cost
(SA_SEL). Therefore, the subarray-level parallelism oered by
MASA can be gauged by the SA_SEL-to-ACTIVATE ratio. For
the nine applications that benet more than 30% from MASA,
on average, one SA_SEL was issued for every two ACTIVATEs,
compared to one-in-seventeen for all other applications. For a
few benchmarks, MASA performs slightly worse than SALP-2.
This is because the baseline scheduling algorithm used with
MASA tries to overlap as many ACTIVATEs as possible, and in
the process inadvertently delays the column command of the
most critical request. This delay to the most critical request
slightly degrades performance for these benchmarks.2
Energy Eciency. We focus on the energy eciency of
MASA. MASA utilizes multiple local row buers across sub-
arrays and increases the chance that an access will hit in a
local row buer. Specically, MASA increases the row buer
hit rate by an average of 12.8% across 32 benchmarks. A row
buer hit not only has a lower access latency, but also con-
sumes less energy, since it does not require the power-hungry
operations of activation and, to a lesser degree, precharging.
Consequently, MASA reduces the dynamic energy consump-
tion by 18.6% as shown in Figure 5.
2For one benchmark, MASA performs slightly better than “Ideal” due
to interactions with the scheduler.
Our ISCA 2012 paper [66] provides a detailed evaluation
of SALP-1, SALP-2, and MASA, including:
• Sensitivity studies to (1) the number of channels (1–8),
ranks (1–8), banks (8–64), and subarrays per bank (1–128)
in the memory system; (2) the mapping policy (row-/line-
interleaved); and (3) an open-row or closed-policy (Sec-
tions 9.2 and 9.3 of [66]).
• Multi-core results using an application-aware memory
scheduling algorithm, where we show signicant perfor-
mance improvements (Section 9.3 of [66]).
• An analysis of the power and area overhead at both the
DRAM chip and the memory controller (Section 6 of [66]).
5. Related Work
To our knowledge, our ISCA 2012 paper [66] is the rst to
exploit the existence of subarrays within a DRAM bank and
enable their parallel operation in a cost-eective manner. We
propose three schemes that exploit the existence of subarrays
within DRAM banks to mitigate the negative eects of bank
conicts. Related works propose increasing the performance
and energy-eciency of DRAM through approaches such
as DRAM module reorganization, changes to DRAM chip
design, and memory controller optimizations. We briey
discuss these works here.
DRAMModule Reorganization. Several prior works [3,
4, 151, 164] partition a DRAM rank and the DRAM data bus
into multiple rank subsets, each of which can be operated
independently. While these techniques increase parallelism,
they reduce the width of the data bus of each rank subset,
leading to longer latencies to transfer a 64 byte cache line. Fur-
thermore, having many rank subsets requires a correspond-
ingly large number of DRAM chips to compose a DRAM rank,
an assumption that does not hold in mobile DRAM systems
where a rank may consist of as few as two chips [95]. Unlike
these works, our mechanisms increase memory-level par-
allelism [72, 100, 101, 105, 107, 108, 120] without increasing
memory latency or the number of DRAM chips.
Changes to DRAM Design. Cached DRAM organiza-
tions, which have been widely proposed [25, 36, 40, 44, 56,
110, 125, 152, 161], augment DRAM chips with an additional
SRAM cache that can store recently accessed data in order
to reduce memory access latency. However, these propos-
als increase the chip area and design complexity of DRAM
designs. Furthermore, cached DRAM provides parallelism
only when accesses hit in the SRAM cache, while serializing
cache misses that access the same DRAM bank. Our schemes
parallelize DRAM bank accesses while incurring signicantly
lower area and logic complexity.
Fujitsu’s FCRAM [126] and Micron’s RLDRAM [57] pro-
pose to implement shorter local bitlines (i.e., fewer cells per
bitline) that are quickly drivable due to their lower capaci-
tance in order to reduce DRAM latency. However, this signif-
icantly increases the DRAM die size (30-40% for FCRAM, 40-
80% for RLDRAM) because the large area of sense-ampliers
4
0%
10%
20%
30%
40%
50%
60%
70%
80%
IP
C
 I
m
p
ro
v
e
m
e
n
t SALP-1 SALP-2 MASA "Ideal"Benchmark key:
c (SPEC CPU2006) 
t (TPC)
s (STREAM) 
random (random-access)
Figure 4: IPC improvement of SALP-1, SALP-2, MASA, and an ideal mechanism over the subarray-oblivious baseline. Repro-
duced from [66].
0.0
0.2
0.4
0.6
0.8
1.0
1.2
N
o
rm
a
li
ze
d
 
D
y
n
. 
E
n
e
rg
y
 
Baseline
SALP
Figure 5: Dynamic DRAM energy consumption for MASA.
is amortized over a smaller number of cells. Hybrid memory
systems can reduce the die size overhead by using a small
amount of FCRAM [126] or RLDRAM [57] in conjunction
with conventional DRAM and managing which subset of the
data resides in FCRAM/RLDRAM at any given time to lower
the latency of memory accesses.
A patent by Qimonda [113] proposes the high-level no-
tion of separately addressable sub-banks, but lacks concrete
mechanisms for exploiting the independence between sub-
banks. Yamauchi et al. propose the Hierarchical Multi-Bank
(HMB) [154], which parallelizes accesses to dierent subar-
rays in a ne-grained manner. However, this scheme adds
complex logic to all subarrays.
Udipi et al. [147] propose two techniques (SBA and SSA) to
lower DRAM power. In SBA, global wordlines are segmented
and controlled separately so that tiles in the horizontal direc-
tion are not activated in lockstep, but selectively. However,
this increases DRAM chip area by 12-100% [147]. SSA com-
bines SBA with chip-granularity rank-subsetting to achieve
even higher energy savings. Both SBA and SSA increase
DRAM latency, more signicantly so for SSA (due to rank-
subsetting).
When transitioning from serving a write request to serving
a read request, and vice versa [18, 73, 137], a DRAM chip
experiences bubbles in the data bus, called the bus-turnaround
penalty (tWTR and tRTW). During the bus turnaround penalty,
Chatterjee et al. [18] propose to internally “prefetch” data for
subsequent read requests into extra registers that are added
to the DRAM chip.
Other works propose new DRAM designs that are capa-
ble of reducing memory latency of conventional DRAM [3,
4, 14, 16, 19, 36, 40, 44, 56, 75, 76, 77, 78, 79, 86, 94, 112, 118, 126,
133, 135, 151, 164] as well as non-volatile memory [68, 69, 70,
71, 90, 91, 121, 122, 155]. Previous works on bulk data trans-
fer [13,16,33,34,47,51,53,84,127,129,158,163] and in-memory
computation [1, 2, 5, 9, 11, 12, 23, 26, 27, 28, 29, 30, 32, 35, 42, 43,
55, 60, 67, 88, 114, 116, 117, 119, 128, 130, 131, 132, 136, 144, 157]
can be used improve DRAM bandwidth utilization and lower
the number of costly data movements between CPU cores
and DRAM. All these works can benet from SALP as the
underlying memory substrate.
Memory Controller Optimizations. To reduce bank
conicts and increase row buer locality, Zhang et al. [160]
propose to randomize the bank address of memory requests
by XOR hashing. Sudan et al. [142] propose to improve row
buer locality by placing frequently-referenced data from dif-
ferent rows together in the same row buer. Both proposals
can be combined with our mechanisms to further improve
parallelism and row buer locality.
Prior works propose memory scheduling algorithms for
CPUs (e.g., [24, 31, 48, 58, 59, 64, 65, 72, 73, 74, 82, 96, 97, 98, 99,
106, 107, 111, 137, 138, 139, 140, 141, 153, 162]), GPUs (e.g., [7,
8, 20, 50, 52]), and other systems (e.g., [148, 149, 162]) that
prioritize certain favorable requests in the memory controller
to improve system performance and/or fairness. Subarrays
expose more parallelism to the memory controller, increasing
the controller’s exibility to schedule requests. Our subarray-
level parallelism mechanisms can be combined with many of
these schedulers to provide increased performance benets.
Enabling higher benet from SALP by designing SALP-aware
memory scheduling algorithms is a promising open research
topic.
6. Signicance and Long-Term Impact
We believe SALP will have long-term impact because: i) it
tackles a critical problem, bank conicts and memory paral-
lelism, whose importance will increase in the future; and ii)
the memory substrate it provides can further be leveraged to
enable other novel optimizations in the memory subsystem.
In fact, as Section 6.2 shows, there has been a signicant
amount of work that built upon our ISCA 2012 paper in the
past six years.
6.1. Trends and Opportunities in Favor of SALP
Worsening Bank Conicts. Future many-core systems
with large numbers of cores and accelerators (e.g., bandwidth-
hungry GPUs) will exert increasingly larger amount of pres-
sure on the memory subsystem. On the other hand, naively
adding more DRAM banks is dicult without incurring high
costs, high energy or reduced performance. Therefore, as
5
more and more memory requests contend to access a limited
er of banks, bank conicts will occur with increasing like-
lihood and severity. SALP is a cost-eective mechanism to
alleviate the bank conict problem by exploiting the existing
subarrays in DRAM at low cost.
Challenges in DRAM Scaling. DRAM process scaling
is becoming more dicult due to increased manufacturing
complexity/cost and reduced cell reliability [6, 49, 54, 63, 102,
103, 104, 109]. As a result, it is critical to examine alternative
ways of improving memory performance while still main-
taining low cost. SALP is a new cost-eective DRAM design
whose advantages are mostly orthogonal to the advantages of
DRAM process scaling. Therefore, SALP can further improve
the performance and the energy-eciency of future DRAM.
In fact, a recent industry proposal to enhance the DDR stan-
dard incorporates one of our SALP mechanisms [54]. This
work by Samsung and Intel quantitatively shows that SALP is
an eective mechanism to tolerate increasing write latencies
in DRAM, corroborating the results in our ISCA 2012 paper
on SALP-2.
A Building Block for New Optimizations. SALP en-
ables new DRAM optimizations that were not possible be-
fore. We discuss three potential examples. First, exploiting
subarray-level parallelism can potentially mitigate DRAM
unavailability during refresh by parallelizing refreshes in one
subarray with accesses to another subarray within the same
bank. Work by Chang et al. [15], which builds on our ISCA
2012 paper, shows that such parallelization can eliminate
most of the performance overhead of refresh. Second, sub-
arrays provide an additional degree of freedom in mapping
the physical address space onto dierent levels of the DRAM
hierarchy (channels, ranks, banks, subarrays, rows, columns).
Thus, they enable more exibility in performance and energy
optimization via data mapping. Third, DRAM can be divided
among dierent applications (to provide quality-of-service)
at the ner-grained partitions of subarrays that are less vul-
nerable to capacity and bandwidth fragmentation. As we
discuss, some research has explored these approaches (also
see Section 6.2). We expect even more future research will tap
into these and other opportunities that can use our proposed
SALP substrate as a building block for other optimizations.
Widely Applicable Substrate. SALP is a general-
purpose substrate that is also applicable to embedded DRAM
(eDRAM) [10] and 3D die-stacked DRAM (3D-DRAM), both
of which consist of subarrays [75, 83]. For example, eDRAM
is known to be vulnerable to the write-recovery penalty [22],
since it is typically used as the last-level cache and thus
exposed to higher amounts of write trac. SALP can in-
crease the availability of eDRAM by hiding the write-recovery
penalty. In addition, SALP may be applied to future emerging
memory technologies as long as their banks are organized
hierarchically [69, 92], similar to how a DRAM bank consists
of subarrays.
New Research Opportunities. SALP creates new oppor-
tunities for exploiting and enhancing the parallelism and the
locality of the memory subsystem.
• Enhancing Memory-Level Parallelism. To tolerate the long
latency of DRAM, computer architects often design mech-
anisms that perform multiple memory requests in a con-
current manner [72, 100, 101, 105, 107, 108, 120, 145]. Such
eorts may become ineective when requests access the
same DRAM bank and, as a consequence, are not actually
served in parallel [107]. SALP, on the other hand, paral-
lelizes requests to dierent subarrays within the same bank.
In this regard, we believe SALP not only enhances previous
approaches to memory-level parallelism, but also creates
opportunities for developing new techniques that preserve
memory-level parallelism in a subarray-aware manner.
• Enhancing Memory Locality. Memory access patterns that
exhibit high locality benet greatly from a DRAM bank’s
row buer where the last accessed row is cached (4–8kB).
While a DRAM bank has multiple row buers across multi-
ple subarrays, an existing DRAM system exposes only one
row buer at a time in a bank and, as a result, is prone to
row buer thrashing. In contrast, SALP allows a DRAM
bank to utilize multiple row buers concurrently. This
enables the opportunity for new techniques that can take
advantage of the multiple row buers, whether they be
for streaming/strided accesses (demand or prefetch), vector
processing, or GPUs.
6.2. Works Building on SALP
The introduction of the notion of subarrays and their
microarchitecture has enabled the use of the subarrays
in many works. These include RowClone [129], TL-
DRAM [78], DSARP [15], DIVA-DRAM [76], LISA [16],
ChargeCache [37], Multiple Clone Row DRAM [21], Am-
bit [128, 130], ERUCA [87], and other works on improving
DRAM [84, 135, 156, 159]. Some of these works exploit subar-
ray level parallelism, e.g., DSARP [15] reduces the overhead
of a DRAM refresh by decoupling independent subarrays
from the subarray that is being refreshed. This decoupling
allows DRAM to service memory accesses while a subarray
is being refreshed. Others make changes to subarrays to im-
prove an aspect, e.g., TL-DRAM [78] creates two dierent
latency regions in a subarray to improve DRAM latency at
low cost.
7. Conclusion
Our ISCA 2012 paper [66] introduces three new mecha-
nisms that exploit the existence of subarrays within a DRAM
bank to mitigate the performance impact of bank conicts.
Our mechanisms are built on the key observation that subar-
rays within a DRAM bank operate largely independently and
have their own row buers. Hence, the latencies of accesses
to dierent subarrays within the same bank can potentially
be overlapped to a large degree. Our three mechanisms take
6
advantage of this fact and progressively increase the inde-
pendence of operation of subarrays by making small modi-
cations to the DRAM chip. Our most sophisticated scheme,
MASA, enables i) multiple subarrays to be accessed in par-
allel, and ii) multiple row buers to remain activated at the
same time in dierent subarrays, thereby improving both
memory-level parallelism and row buer locality. We show
that our schemes signicantly improve system performance
on both single-core and multi-core systems on a variety of
workloads while incurring little (<0.15%) or no area overhead
in the DRAM chip. Our techniques can also improve memory
energy eciency.
We conclude that exploiting subarray-level parallelism in
a DRAM bank can be a promising and cost-eective method
for overcoming the negative eects of DRAM bank conicts,
without paying the large cost of increasing the number of
banks in the DRAM system. Signicant recent work has
built upon our ISCA 2012 paper, and we expect many other
new works can exploit the new substrate we have enabled to
achieve even bigger goals and higher benets.
Acknowledgments
We thank Rachata Ausavarungnirun and Saugata Ghose
for their dedicated eort in the preparation of this article.
Many thanks to Uksong Kang, Hak-soo Yu, Churoo Park,
Jung-Bae Lee, and Joo Sun Choi from Samsung for their help-
ful comments. We thank the anonymous reviewers for their
feedback. We gratefully acknowledge members of the SA-
FARI group for feedback and for the stimulating intellectual
environment they provide. We acknowledge the generous
support of AMD, Intel, Oracle, and Samsung. This research
was also partially supported by grants from NSF (CAREER
Award CCF- 0953246), GSRC, and Intel ARO Memory Hierar-
chy Program.
References
[1] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A Scalable Processing-in-memory
Accelerator for Parallel Graph Processing,” in ISCA, 2015.
[2] J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “PIM-enabled Instructions: A Low-
overhead, Locality-aware Processing-in-memory Architecture,” in ISCA, 2015.
[3] J. H. Ahn, J. Leverich, R. Schreiber, and N. P. Jouppi, “Multicore DIMM: an Energy
Ecient Memory Module with Independently Controlled DRAMs,” IEEE CAL,
2009.
[4] J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber, “Improving
System Energy Eciency with Memory Rank Subsetting,” TACO, 2012.
[5] B. Akin, F. Franchetti, and J. C. Hoe, “Data Reorganization in Memory Using
3D-stacked DRAM,” in ISCA, 2015.
[6] G. Atwood, “Current and Emerging Memory Technology Landscape (Micron),”
in Flash Memory Summit, 2011.
[7] R. Ausavarungnirun, K. K. Chang, L. Subramanian, G. Loh, and O. Mutlu, “Staged
Memory Scheduling: Achieving High Performance and Scalability in Heteroge-
neous Systems,” in ISCA, 2012.
[8] R. Ausavarungnirun, S. Ghose, O. Kayıran, G. H. Loh, C. R. Das, M. T. Kandemir,
and O. Mutlu, “Exploiting Inter-Warp Heterogeneity to Improve GPGPU Perfor-
mance,” in PACT, 2015.
[9] O. O. Babarinsa and S. Idreos, “JAFAR: Near-Data Processing for Databases,” in
SIGMOD, 2015.
[10] J. Barth, D. Plass, E. Nelson, C. Hwang, G. Fredeman, M. Sperling, A. Mathews,
T. Kirihata, W. R. Reohr, K. Nair, and N. Caon, “A 45 nm SOI Embedded DRAM
Macro for the POWER Processor 32 MByte On-Chip L3 Cache,” JSSC, 2011.
[11] A. Boroumand, S. Ghose, Y. Kim, R. Ausavarungnirun, E. Shiu, R. Thakur, D. Kim,
A. Kuusela, A. Knies, P. Ranganathan, and O. Mutlu, “Google Workloads for Con-
sumer Devices: Mitigating Data Movement Bottlenecks,” in ASPLOS, 2018.
[12] A. Boroumand, S. Ghose, B. Lucia, K. Hsieh, K. Malladi, H. Zheng, and
O. Mutlu, “LazyPIM: An Ecient Cache Coherence Mechanism for Processing-
in-Memory,” IEEE CAL, 2016.
[13] J. Carter, W. Hsieh, L. Stoller, M. Swanson, L. Zhang, E. Brunvand, A. Davis,
C.-C. Kuo, R. Kuramkote, M. Parker, L. Schaelicke, and T. Tateyama, “Impulse:
Building a Smarter Memory Controller,” in HPCA, 1999.
[14] K. K. Chang, A. Kashyap, H. Hassan, S. Ghose, K. Hsieh, D. Lee, T. Li, G. Pekhi-
menko, S. Khan, and O. Mutlu, “Understanding Latency Variation in Modern
DRAM Chips: Experimental Characterization, Analysis, and Optimization,” in
SIGMETRICS, 2016.
[15] K. K. Chang, D. Lee, Z. Chishti, A. Alameldeen, C. Wilkerson, Y. Kim, and
O. Mutlu, “Improving DRAM Performance by Parallelizing Refreshes with Ac-
cesses ,” in HPCA, 2014.
[16] K. K. Chang, P. J. Nair, D. Lee, S. Ghose, M. K. Qureshi, and O. Mutlu, “Low-Cost
Inter-Linked Subarrays (LISA): Enabling Fast Inter-Subarray Data Movement in
DRAM,” in HPCA, 2016.
[17] K. K. Chang, A. G. Yaglikci, S. Ghose, A. Agrawal, N. Chatterjee, A. Kashyap,
D. Lee, M. O’Connor, H. Hassan, and O. Mutlu, “Understanding Reduced-Voltage
Operation in Modern DRAM Devices: Experimental Characterization, Analysis,
and Mechanisms,” in SIGMETRICS, 2017.
[18] N. Chatterjee et al., “Staged reads: Mitigating the impact of DRAM writes on
DRAM reads,” in HPCA, 2012.
[19] N. Chatterjee, M. Shevgoor, R. Balasubramonian, A. Davis, Z. Fang, R. Illikkal,
and R. Iyer, “Leveraging Heterogeneity in DRAM Main Memories to Accelerate
Critical Word Access,” in MICRO, 2012.
[20] N. Chatterjee, M. O’Connor, G. H. Loh, N. Jayasena, and R. Balasubramonian,
“Managing DRAM Latency Divergence in Irregular GPGPU Applications,” in SC,
2014.
[21] J. Choi, W. Shin, J. Jang, J. Suh, Y. Kwon, Y. Moon, and L.-S. Kim, “Multiple Clone
Row DRAM: A Low Latency and Area Optimized DRAM,” in ISCA, 2015.
[22] B. W. Curran et al., “The zEnterprise 196 System and Microprocessor,” IEEEMicro,
Mar 2011.
[23] J. Draper, J. Chame, M. Hall, C. Steele, T. Barrett, J. LaCoss, J. Granacki, J. Shin,
C. Chen, C. W. Kang, I. Kim, and G. Daglikoca, “The Architecture of the DIVA
Processing-in-memory Chip,” in ICS, 2002.
[24] E. Ebrahimi, R. Miftakhutdinov, C. Fallin, C. J. Lee, J. A. Joao, O. Mutlu, and Y. N.
Patt, “Parallel Application Memory Scheduling,” in MICRO, 2011.
[25] Enhanced Memory Systems, “Enhanced SDRAM SM2604,” 2002.
[26] A. Farmahini-Farahani, J. H. Ahn, K. Morrow, and N. S. Kim, “NDA: Near-DRAM
Acceleration Architecture Leveraging Commodity DRAM Devices and Standard
Memory Modules,” in HPCA, 2015.
[27] B. B. Fraguela, J. Renau, P. Feautrier, D. Padua, and J. Torrellas, “Programming
the FlexRAM Parallel Intelligent Memory System,” in PPoPP, 2003.
[28] M. Gao, G. Ayers, and C. Kozyrakis, “Practical Near-Data Processing for In-
Memory Analytics Frameworks,” in PACT, 2015.
[29] M. Gao and C. Kozyrakis, “HRL: Ecient and Flexible Recongurable Logic for
Near-Data Processing,” in HPCA, 2016.
[30] S. Ghose, K. Hsieh, A. Boroumand, R. Ausavarungnirun, and O. Mutlu, “The
Processing-in-Memory Paradigm: Mechanisms to Enable Adoption,” arXiv
[cs.AR], 2018.
[31] S. Ghose, H. Lee, and J. F. Martínez, “Improving Memory Scheduling via
Processor-side Load Criticality Information,” in ISCA, 2013.
[32] M. Gokhale, B. Holmes, and K. Iobst, “Processing in Memory: the Terasys Mas-
sively Parallel PIM Array,” Computer, 1995.
[33] M. Gschwind, “Chip Multiprocessing and the Cell Broadband Engine,” in CF,
2006.
[34] J. Gummaraju, M. Erez, J. Coburn, M. Rosenblum, and W. J. Dally, “Architec-
tural Support for the Stream Execution Model on General-Purpose Processors,”
in PACT, 2007.
[35] Q. Guo, N. Alachiotis, B. Akin, F. Sadi, G. Xu, T.-M. Low, L. Pileggi, J. C. Hoe, and
F. Franchetti, “3D-Stacked Memory-Side Acceleration: Accelerator and System
Design,” in WoNDP, 2014.
[36] C. A. Hart, “CDRAM in a Unied Memory Architecture,” in Compcon, 1994.
[37] H. Hassan, G. Pekhimenko, N. Vijaykumar, V. Seshadri, D. Lee, O. Ergin, and
O. Mutlu, “ChargeCache: Reducing DRAM Latency by Exploiting Row Access
Locality,” in HPCA, 2016.
[38] H. Hassan, N. Vijaykumar, S. Khan, S. Ghose, K. K. Chang, G. Pekhimenko, D. Lee,
O. Ergin, and O. Mutlu, “SoftMC: A Flexible and Practical Open-source Infras-
tructure for Enabling Experimental DRAM Studies,” in HPCA, 2017.
[39] J. L. Henning, “SPEC CPU2006 benchmark descriptions,” CAN, 2006.
[40] H. Hidaka, Y. Matsuda, M. Asakura, and K. Fujishima, “The Cache DRAM Archi-
tecture,” IEEE Micro, 1990.
[41] HPC Challenge, “GUPS,” http://icl.cs.utk.edu/projectsles/hpcc/RandomAccess/.
[42] K. Hsieh, S. Khan, N. Vijaykumar, K. K. Chang, A. Boroumand, S. Ghose, and
O. Mutlu, “Accelerating Pointer Chasing in 3D-stacked Memory: Challenges,
Mechanisms, Evaluation,” in ICCD, 2016.
[43] K. Hsieh, E. Ebrahimi, G. Kim, N. Chatterjee, M. O’Connor, N. Vijaykumar,
O. Mutlu, and S. W. Keckler, “Transparent Ooading and Mapping (TOM):
Enabling Programmer-Transparent Near-Data Processing in GPU Systems,” in
ISCA, 2016.
7
[44] W.-C. Hsu and J. E. Smith, “Performance of Cached DRAM Organizations in
Vector Supercomputers,” in ISCA, 1993.
[45] Intel, “2nd Gen. Intel Core Processor Family Desktop Datasheet,” 2011.
[46] Intel, “Intel Core Desktop Processor Series Datasheet,” 2011.
[47] Intel Corp., “Intel®I/O Acceleration Technology,” http://www.intel.com/content/
www/us/en/wireless-network/accel-technology.html.
[48] E. Ipek, O. Mutlu, J. F. Martínez, and R. Caruana, “Self-optimizing memory con-
trollers: A reinforcement learning approach,” in ISCA, 2008.
[49] ITRS, “International Technology Roadmap for Semiconductors,” 2011.
[50] M. K. Jeong, M. Erez, C. Sudanthi, and N. Paver, “A QoS-Aware Memory Con-
troller for Dynamically Balancing GPU and CPU Bandwidth Use in an MPSoC,”
in DAC, 2012.
[51] X. Jiang, Y. Solihin, L. Zhao, and R. Iyer, “Architecture Support for Improving
Bulk Memory Copying and Initialization Performance,” in PACT, 2009.
[52] A. Jog, O. Kayiran, A. Pattnaik, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das,
“Exploiting Core Criticality for Enhanced GPU Performance,” in SIGMETRICS,
2016.
[53] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy,
“Introduction to the Cell Multiprocessor,” IBM JRD, 2005.
[54] U. Kang, H.-S. Yu, C. Park, H. Zheng, J. Halbert, K. Bains, S. Jang, and J. Choi,
“Co-Architecting Controllers and DRAM to Enhance DRAM Process Scaling,” in
The Memory Forum, 2014.
[55] Y. Kang, W. Huang, S.-M. Yoo, D. Keen, Z. Ge, V. Lam, P. Pattnaik, and J. Torrellas,
“FlexRAM: Toward an Advanced Intelligent Memory System,” in ICCD, 1999.
[56] G. Kedem and R. P. Koganti, “WCDRAM: A Fully Associative Integrated Cached-
DRAM with Wide Cache Lines,” Duke Univ. Dept. of Computer Science, Tech.
Rep. CS-1997-03, 1997.
[57] B. Keeth et al., DRAMCircuit Design. Fundamental and High-Speed Topics. Wiley-
IEEE Press, 2007.
[58] H. Kim, D. de Niz, B. Andersson, M. Klein, O. Mutlu, and R. Rajkumar, “Bounding
Memory Interference Delay in COTS-based Multi-core Systems,” in RTAS, 2014.
[59] H. Kim, D. de Niz, B. Andersson, M. Klein, O. Mutlu, and R. Rajkumar, “Bounding
and Reducing Memory Interference in COTS-based Multi-core Systems,” RTS,
2016.
[60] J. S. Kim, D. Senol, H. Xin, D. Lee, S. Ghose, M. Alser, H. Hassan, O. Ergin,
C. Alkan, and O. Mutlu, “GRIM-Filter: Fast Seed Location Filtering in DNA Read
Mapping Using Processing-in-Memory Technologies,” BMC Genomics, 2018.
[61] J. S. Kim, M. Patel, H. Hassan, and O. Mutlu, “The DRAM Latency PUF: Quickly
Evaluating Physical Unclonable Functions by Exploiting the Latency-Reliability
Tradeo in Modern DRAM Devices,” in HPCA, 2018.
[62] Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A Fast and Extensible DRAM Simu-
lator,” IEEE CAL, 2015.
[63] Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilkerson, K. Lai, and
O. Mutlu, “Flipping Bits in Memory Without Accessing Them: An Experimental
Study of DRAM Disturbance Errors,” in ISCA, 2014.
[64] Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter, “ATLAS: A Scalable and High-
Performance Scheduling Algorithm for Multiple Memory Controllers,” in HPCA,
2010.
[65] Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter, “Thread Cluster Mem-
ory Scheduling: Exploiting Dierences in Memory Access Behavior,” in MICRO,
2010.
[66] Y. Kim, V. Seshadri, D. Lee, J. Liu, and O. Mutlu, “A Case for Exploiting Subarray-
Level Parallelism (SALP) in DRAM,” in ISCA, 2012.
[67] P. M. Kogge, “EXECUBE-A New Architecture for Scaleable MPPs,” in ICPP, 1994.
[68] E. Kultursay, M. Kandemir, A. Sivasubramaniam, and O. Mutlu, “Evaluating STT-
RAM as an energy-ecient main memory alternative,” in ISPASS, 2013.
[69] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Architecting Phase Change Memory
as a Scalable DRAM Alternative,” in ISCA, 2009.
[70] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Phase Change Memory Architecture
and the Quest for Scalability,” CACM, 2010.
[71] B. C. Lee, P. Zhou, J. Yang, Y. Zhang, B. Zhao, E. Ipek, O. Mutlu, and D. Burger,
“Phase-Change Technology and the Future of Main Memory,” IEEE Micro, 2010.
[72] C. J. Lee, V. Narasiman, O. Mutlu, and Y. N. Patt, “Improving Memory Bank-Level
Parallelism in the Presence of Prefetching,” in MICRO, 2009.
[73] C. J. Lee, E. Ebrahimi, V. Narasiman, O. Mutlu, and Y. N. Patt, “DRAM-Aware
Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory
Systems,” Univ. of Texas at Austin, High Performance Systems Group, Tech. Rep.
TR-HPS-2010-002, 2010.
[74] C. J. Lee, O. Mutlu, V. Narasiman, and Y. N. Patt, “Prefetch-aware DRAM Con-
trollers,” in MICRO, 2008.
[75] D. Lee, S. Ghose, G. Pekhimenko, S. Khan, and O. Mutlu, “Simultaneous Multi-
layer Access: Improving 3D-stacked Memory Bandwidth at Low Cost,” TACO,
2016.
[76] D. Lee, S. Khan, L. Subramanian, S. Ghose, R. Ausavarungnirun, G. Pekhimenko,
V. Seshadri, and O. Mutlu, “Design-Induced Latency Variation in Modern DRAM
Chips: Characterization, Analysis, and Latency Reduction Mechanisms,” in SIG-
METRICS, 2017.
[77] D. Lee, Y. Kim, G. Pekhimenko, S. Khan, V. Seshadri, K. K. Chang, and O. Mutlu,
“Adaptive-latency DRAM: Optimizing DRAM Timing for the Common-case,” in
HPCA, 2015.
[78] D. Lee, Y. Kim, V. Seshadri, J. Liu, L. Subramanian, and O. Mutlu, “Tiered-latency
DRAM: A Low Latency and Low Cost DRAM Architecture,” in HPCA, 2013.
[79] D. Lee, L. Subramanian, R. Ausavarungnirun, J. Choi, and O. Mutlu, “Decoupled
Direct Memory Access: Isolating CPU and IO Trac by Leveraging a Dual-Data-
Port DRAM,” in PACT, 2015.
[80] J. Liu, B. Jaiyen, Y. Kim, C. Wilkerson, and O. Mutlu, “An Experimental Study of
Data Retention Behavior in Modern DRAM Devices: Implications for Retention
Time Proling Mechanisms,” in ISCA, 2013.
[81] J. Liu, B. Jaiyen, R. Veras, and O. Mutlu, “RAIDR: Retention-Aware Intelligent
DRAM Refresh,” in ISCA, 2012.
[82] W. Liu, P. Huang, T. Kun, T. Lu, K. Zhou, C. Li, and X. He, “LAMS: A Latency-
aware Memory Scheduling Policy for Modern DRAM Systems,” in IPCCC, 2016.
[83] G. H. Loh, “3D-stacked Memory Architectures for Multi-core Processors,” in
ISCA, 2008.
[84] S.-L. Lu, Y.-C. Lin, and C.-L. Yang, “Improving DRAM Latency with Dynamic
Asymmetric Subarray,” in MICRO, 2015.
[85] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J.
Reddi, and K. Hazelwood, “Pin: building customized program analysis tools with
dynamic instrumentation,” in PLDI, 2005.
[86] Y. Luo, S. Govindan, B. Sharma, M. Santaniello, J. Meza, A. Kansal, J. Liu,
B. Khessib, K. Vaid, and O. Mutlu, “Characterizing Application Memory Error
Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Mem-
ory,” in DSN, 2014.
[87] S. Lym, H. Ha, Y. Kwon, C. Chang, J. Kim, and M. Erez, “ERUCA: Ecient DRAM
Resource Utilization and Resource Conict Avoidance for Memory System Par-
allelism,” in HPCA, 2018.
[88] K. Mai, T. Paaske, N. Jayasena, R. Ho, W. J. Dally, and M. Horowitz, “Smart Mem-
ories: A Modular Recongurable Architecture,” in ISCA, 2000.
[89] J. D. McCalpin, “STREAM Benchmark,” http://www.streambench.org/.
[90] J. Meza, Y. Luo, S. Khan, J. Zhao, Y. Xie, and O. Mutlu, “A Case for Ecient Hard-
ware/Software Cooperative Management of Storage and Memory,” in WEED,
2013.
[91] J. Meza, J. Chang, H. Yoon, O. Mutlu, and P. Ranganathan, “Enabling Ecient and
Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management,”
IEEE CAL, 2012.
[92] J. Meza, J. Li, and O. Mutlu, “A Case for Small Row Buers in Non-volatile Main
Memories,” in ICCD, 2012.
[93] Micron Technology, Inc., “DDR3 SDRAM System-Power Calculator,” 2010.
[94] Micron Technology, Inc., “576Mb: x18, x36 RLDRAM3,” 2011.
[95] Micron Technology, Inc., “2Gb: x16, x32 Mobile LPDDR2 SDRAM,” 2012.
[96] T. Moscibroda and O. Mutlu, “Memory Performance Attacks: Denial of Memory
Service in Multi-core Systems,” in USENIX Security, 2007.
[97] T. Moscibroda and O. Mutlu, “Distributed Order Scheduling and Its Application
to Multi-core DRAM Controllers,” in PODC, 2008.
[98] J. Mukundan and J. F. Martinez, “MORSE: Multi-objective Recongurable Self-
optimizing Memory Scheduler,” in HPCA, 2012.
[99] S. P. Muralidhara, L. Subramanian, O. Mutlu, M. Kandemir, and T. Moscibroda,
“Reducing Memory Interference in Multicore Systems via Application-Aware
Memory Channel Partitioning,” in MICRO, 2011.
[100] O. Mutlu, H. Kim, and Y. N. Patt, “Address-value Delta (AVD) Prediction: Increas-
ing the Eectiveness of Runahead Execution by Exploiting Regular Memory Al-
location Patterns,” in MICRO, 2005.
[101] O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt, “Runahead execution: An eec-
tive alternative to large instruction windows,” IEEE Micro, 2003.
[102] O. Mutlu, “Memory Scaling: A Systems Architecture Perspective,” IMW, 2013.
[103] O. Mutlu, “Memory Scaling: A Systems Architecture Perspective,” in MEMCON,
2013.
[104] O. Mutlu, “The RowHammer Problem and Other Issues We May Face as Memory
Becomes Denser,” in DATE, 2017.
[105] O. Mutlu, H. Kim, and Y. N. Patt, “Techniques for Ecient Processing in Runa-
head Execution Engines,” in ISCA, 2005.
[106] O. Mutlu and T. Moscibroda, “Stall-time fair memory access scheduling for chip
multiprocessors,” in MICRO, 2007.
[107] O. Mutlu and T. Moscibroda, “Parallelism-Aware Batch Scheduling: Enhancing
both Performance and Fairness of Shared DRAM Systems,” in ISCA, 2008.
[108] O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt, “Runahead Execution: An Al-
ternative to Very Large Instruction Windows for Out-of-Order Processors,” in
HPCA, 2003.
[109] O. Mutlu and L. Subramanian, “Research Problems and Opportunities in Memory
Systems,” SUPERFRI, 2014.
[110] NEC, “Virtual Channel SDRAM uPD4565421,” 1999.
[111] K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith, “Fair queuing memory
systems,” in MICRO, 2006.
[112] S. O, Y. H. Son, N. S. Kim, and J. H. Ahn, “Row-Buer Decoupling: A Case for
Low-Latency DRAM Microarchitecture,” in ISCA, 2014.
[113] J.-H. Oh, “Semiconductor memory having a bank with sub-banks,” U.S. Patent
7,782,703, 2010.
[114] M. Oskin, F. T. Chong, and T. Sherwood, “Active Pages: A Computation Model
for Intelligent Memory,” in ISCA, 1998.
[115] M. Patel, J. Kim, and O. Mutlu, “The Reach Proler (REAPER): Enabling the Mit-
igation of DRAM Retention Failures via Proling at Aggressive Conditions,” in
8
ISCA, 2017.
[116] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis,
R. Thomas, and K. Yelick, “A Case for Intelligent RAM,” IEEE Micro, 1997.
[117] A. Pattnaik, X. Tang, A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu,
and C. R. Das, “Scheduling Techniques for GPU Architectures with Processing-
In-Memory Capabilities,” in PACT, 2016.
[118] S. Phadke and S. Narayanasamy, “MLP Aware Heterogeneous Memory System,”
in DATE, 2011.
[119] S. H. Pugsley, J. Jestes, H. Zhang, R. Balasubramonian, V. Srinivasan, A. Buyuk-
tosunoglu, A. Davis, and F. Li, “NDC: Analyzing the Impact of 3D-stacked Mem-
ory+logic Devices on MapReduce Workloads,” in ISPASS, 2014.
[120] M. K. Qureshi, D. Lynch, O. Mutlu, and Y. Patt, “A Case for MLP-Aware Cache
Replacement,” in ISCA, 2006.
[121] M. K. Qureshi, J. Karidis, M. Franceschini, V. Srinivasan, L. Lastras, and B. Abali,
“Enhancing Lifetime and Security of PCM-based Main Memory with Start-gap
Wear Leveling,” in MICRO, 2009.
[122] M. K. Qureshi, V. Srinivasan, and J. A. Rivers, “Scalable High Performance Main
Memory System Using Phase-change Memory Technology,” in ISCA, 2009.
[123] Rambus, “DRAM Power Model,” 2010.
[124] SAFARI Research Group, “Ramulator – GitHub Repository,” https://github.com/
CMU-SAFARI/ramulator.
[125] R. H. Sartore, K. J. Mobley, D. G. Carrigan, and O. F. Jones, “Enhanced DRAM
with embedded registers,” U.S. Patent 5,887,272, 1999.
[126] Y. Sato, T. Suzuki, T. Aikawa, S. Fujioka, W. Fujieda, H. Kobayashi, H. Ikeda, T. Na-
gasawa, A. Funyu, Y. Fuji, K. Kawasaki, M. Yamazaki, and M. Taguchi, “Fast cycle
RAM (FCRAM): A 20-ns Random Row Access, Pipe-Lined Operating DRAM,” in
VLSIC, 1998.
[127] S.-Y. Seo, “Methods of Copying a Page in a Memory Device and Methods of Man-
aging Pages in a Memory System,” U.S. Patent Application 20140185395, 2014.
[128] V. Seshadri, K. Hsieh, A. Boroumand, D. Lee, M. Kozuch, O. Mutlu, P. Gibbons,
and T. Mowry, “Fast Bulk Bitwise AND and OR in DRAM,” IEEE CAL, 2015.
[129] V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo,
O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, “RowClone: Fast and
Energy-Ecient In-DRAM Bulk Data Copy and Initialization,” in ISCA, 2013.
[130] V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A. Kozuch,
O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Ambit: In-memory Accelerator for
Bulk Bitwise Operations using Commodity DRAM Technology,” in MICRO, 2017.
[131] V. Seshadri, T. Mullins, A. Boroumand, O. Mutlu, P. B. Gibbons, M. A. Kozuch,
and T. C. Mowry, “Gather-Scatter DRAM: In-DRAM Address Translation to Im-
prove the Spatial Locality of Non-Unit Strided Accesses,” in MICRO, 2015.
[132] V. Seshadri and O. Mutlu, “Simple Operations in Memory to Reduce Data Move-
ment,” in Advances in Computers, Volume 106, 2017.
[133] W. Shin, J. Yang, J. Choi, and L.-S. Kim, “NUAT: A Non-Uniform Access Time
Memory Controller,” in HPCA, 2014.
[134] B. Sinharoy, R. Kalla, W. J. Starke, H. Q. Le, R. Cargnoni, J. A. Van Norstrand, B. J.
Ronchetti, J. Stuecheli, J. Leenstra, G. L. Guthrie et al., “IBM POWER7 multicore
server processor,” IBM Journal Res. Dev., May. 2011.
[135] Y. H. Son, S. O, Y. Ro, J. W. Lee, and J. H. Ahn, “Reducing Memory Access Latency
with Asymmetric DRAM Bank Organizations,” in ISCA, 2013.
[136] H. S. Stone, “A Logic-in-Memory Computer,” IEEE TC, vol. C-19, no. 1, pp. 73–78,
1970.
[137] J. Stuecheli, D. Kaseridis, D. Daly, H. C. Hunter, and L. K. John, “The Virtual
Write Queue: Coordinating DRAM and Last-level Cache Policies,” in ISCA, 2010.
[138] L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu, “BLISS: Balancing
Performance, Fairness and Complexity in Memory Access Scheduling,” in IEEE
TPDS, 2016.
[139] L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu, “The Blacklisting
Memory Scheduler: Achieving high performance and fairness at low cost,” in
ICCD, 2014.
[140] L. Subramanian, V. Seshadri, A. Ghosh, S. Khan, and O. Mutlu, “The Application
Slowdown Model: Quantifying and Controlling the Impact of Inter-application
Interference at Shared Caches and Main Memory,” in MICRO, 2015.
[141] L. Subramanian, V. Seshadri, Y. Kim, B. Jaiyen, and O. Mutlu, “MISE: Providing
Performance Predictability and Improving Fairness in Shared Main Memory Sys-
tems,” in HPCA, 2013.
[142] K. Sudan, N. Chatterjee, D. Nellans, M. Awasthi, R. Balasubramonian, and
A. Davis, “Micro-pages: Increasing DRAM eciency with locality-aware data
placement,” in ASPLOS, 2010.
[143] Sun Microsystems, “OpenSPARC T1 microarch. specication,” 2006.
[144] Z. Sura, A. Jacob, T. Chen, B. Rosenburg, O. Sallenave, C. Bertolli, S. Antao,
J. Brunheroto, Y. Park, K. O’Brien, and R. Nair, “Data Access Optimization in
a Processing-in-memory System,” in CF, 2015.
[145] R. M. Tomasulo, “An Ecient Algorithm for Exploiting Multiple Arithmetic
Units,” IBM JRD, 1967.
[146] Transaction Processing Performance Council, “,” http://www.tpc.org/.
[147] A. N. Udipi, N. Muralimanohar, N. Chatterjee, R. Balasubramonian, A. Davis,
and N. P. Jouppi, “Rethinking DRAM Design and Organization for Energy-
constrained Multi-cores,” in ISCA, 2010.
[148] H. Usui, L. Subramanian, K. K. Chang, and O. Mutlu, “SQUASH: Simple QoS-
aware high-performance memory scheduler for heterogeneous systems with
hardware accelerators,” arXiv:1505.07502 [cs.AR], 2015.
[149] H. Usui, L. Subramanian, K. K. Chang, and O. Mutlu, “DASH: Deadline-Aware
High-Performance Memory Scheduler for Heterogeneous Systems with Hard-
ware Accelerators,” TACO, 2016.
[150] T. Vogelsang, “Understanding the Energy Consumption of Dynamic Random Ac-
cess Memories,” in MICRO, 2010.
[151] F. A. Ware and C. Hampel, “Improving Power and Data Eciency with Threaded
Memory Modules,” in ICCD, 2006.
[152] W. A. Wong and J.-L. Baer, “DRAM caching,” 1997.
[153] D. Xiong, K. Huang, X. Jiang, and X. Yan, “Memory Access Scheduling Based on
Dynamic Multilevel Priority in Shared DRAM Systems,” TACO, 2016.
[154] T. Yamauchi, L. Hammond, and K. Olukotun, “The Hierarchical Multi-bank
DRAM: A High-performance Architecture for Memory Integrated with Proces-
sors,” in Advanced Research in VLSI, 1997.
[155] H. Yoon, J. Meza, R. Ausavarungnirun, R. Harding, and O. Mutlu, “Row Buer
Locality Aware Caching Policies for Hybrid Memories,” in ICCD, 2012.
[156] J. Yue and Y. Zhu, “Exploiting Subarrays Inside a Bank to Improve Phase Change
Memory Performance,” in DATE, 2013.
[157] D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Igna-
towski, “TOP-PIM: Throughput-oriented Programmable Processing in Memory,”
in HPDC, 2014.
[158] L. Zhang, Z. Fang, M. Parker, B. K. Mathew, L. Schaelicke, J. B. Carter, W. C.
Hsieh, and S. A. McKee, “The Impulse Memory Controller,” IEEE TC, 2001.
[159] T. Zhang, M. Poremba, C. Xu, G. Sun, and Y. Xie, “CREAM: A Concurrent-
Refresh-Aware DRAM Memory Architecture,” in HPCA, 2014.
[160] Z. Zhang, Z. Zhu, and X. Zhang, “A permutation-based page interleaving scheme
to reduce row-buer conicts and exploit data locality,” in MICRO, 2000.
[161] Z. Zhang, Z. Zhu, and X. Zhang, “Cached DRAM for ILP processor memory
access latency reduction,” IEEE Micro, July 2001.
[162] J. Zhao, O. Mutlu, and Y. Xie, “FIRM: Fair and High-Performance Memory Con-
trol for Persistent Memory Systems,” in MICRO, 2014.
[163] L. Zhao, R. Iyer, S. Makineni, L. Bhuyan, and D. Newell, “Hardware Support for
Bulk Data Movement in Server Platforms,” in ICCD, 2005.
[164] H. Zheng, J. Lin, Z. Zhang, E. Gorbatov, H. David, and Z. Zhu, “Mini-rank: Adap-
tive DRAM Architecture for Improving Memory Power Eciency,” in MICRO,
2008.
9
