High-Performance and Energy-Effcient Memory Scheduler Design for
  Heterogeneous Systems by Ausavarungnirun, Rachata et al.
High-Performance and Energy-Ecient Memory Scheduler Design
for Heterogeneous Systems
Rachata Ausavarungnirun1 Gabriel H. Loh2
Lavanya Subramanian3,1 Kevin Chang4,1 Onur Mutlu5,1
1Carnegie Mellon University 2AMD Research 3Intel Labs 4Facebook 5ETH Zürich
This paper summarizes the idea of the Staged Memory Sched-
uler (SMS), which was published at ISCA 2012 [14], and exam-
ines the work’s signicance and future potential. Whenmultiple
processor cores (CPUs) and a GPU integrated together on the
same chip share the o-chip DRAM, requests from the GPU
can heavily interfere with requests from the CPUs, leading to
low system performance and starvation of cores. Unfortunately,
state-of-the-art memory scheduling algorithms are ineective
at solving this problem due to the very large amount of GPU
memory trac, unless a very large and costly request buer
is employed to provide these algorithms with enough visibility
across the global request stream.
Previously-proposed memory controller (MC) designs use a
single monolithic structure to perform three main tasks. First,
the MC attempts to schedule together requests to the same
DRAM row to increase row buer hit rates. Second, the MC
arbitrates among the requesters (CPUs and GPU) to optimize
for overall system throughput, average response time, fairness
and quality of service. Third, the MC manages the low-level
DRAM command scheduling to complete requests while ensur-
ing compliance with all DRAM timing and power constraints.
This paper proposes a fundamentally new approach, called
the Staged Memory Scheduler (SMS), which decouples the three
primary MC tasks into three signicantly simpler structures
that together improve system performance and fairness. Our
three-stage MC rst groups requests based on row buer locality.
This grouping allows the second stage to focus only on inter-
application scheduling decisions. These two stages enforce high-
level policies regarding performance and fairness, and therefore
the last stage can use simple per-bank FIFO queues (i.e., there is
no need for further command reordering within each bank) and
straightforward logic that deals only with the low-level DRAM
commands and timing.
We evaluated the design trade-os involved and compared it
against four state-of-the-art MC designs. Our evaluation shows
that SMS provides 41.2% performance improvement and 4.8×
fairness improvement compared to the best previous state-of-
the-art technique, while enabling a design that is signicantly
less complex and more power-ecient to implement.
Our analysis and proposed scheduler have inspired signi-
cant research on (1) predictable and/or deadline-aware memory
scheduling [91, 92, 194, 195, 197, 201, 202, 216] and (2) memory
scheduling for heterogeneous systems [161, 201, 202, 207].
1. Introduction
As the number of cores continues to increase in modern
chip multiprocessor (CMP) systems, the DRAM memory sys-
tem has become a critical shared resource [139, 145]. Mem-
ory requests from multiple cores interfere with each other,
and this inter-application interference is a signicant im-
pediment to individual application and overall system perfor-
mance. Various works on application-aware memory schedul-
ing [98,99,141,142] address the problem by making the mem-
ory controller aware of application characteristics and ap-
propriately prioritizing memory requests to improve system
performance and fairness.
Recent heterogeneous CPU-GPU systems [1, 27, 28, 37, 76,
77, 78, 133, 152, 153, 167, 209] present an additional challenge
by introducing integrated graphics processing units (GPUs)
on the same die with CPU cores. GPU applications typi-
cally demand signicantly more memory bandwidth than
CPU applications due to the GPU’s capability of executing
a large number of concurrent threads [1, 2, 3, 4, 13, 23, 27, 37,
38, 40, 62, 68, 77, 78, 133, 149, 150, 151, 152, 153, 154, 167, 176, 178,
179, 188, 189, 199, 200, 206, 209]. GPUs use single-instruction
multiple-data (SIMD) pipelines to concurrently execute mul-
tiple threads [53]. In a GPU, a group of threads executing the
same instruction is called a wavefront or warp, and threads
in a warp are executed in lockstep. When a wavefront stalls
on a memory instruction, the GPU core hides this memory
access latency by switching to another wavefront to avoid
stalling the pipeline. Therefore, there can be thousands of
outstanding memory requests from across all of the wave-
fronts. This is fundamentally more memory intensive than
CPU memory trac, where each CPU application has a much
smaller number of outstanding requests due to the sequential
execution model of CPUs.
Figure 1 (a) shows the memory request rates for a represen-
tative subset of our GPU applications and the most memory-
intensive SPEC2006 (CPU) applications, as measured by mem-
ory requests per thousand cycles when each application runs
alone on the system. The raw bandwidth demands (i.e., mem-
ory request rates) of the GPU applications are often multiple
times higher than the SPEC benchmarks. Figure 1 (b) shows
the row buer hit rates (also called row buer locality or
RBL [134]). The GPU applications show consistently high
levels of RBL, whereas the SPEC benchmarks exhibit more
variability. The GPU programs have high levels of spatial
ar
X
iv
:1
80
4.
11
04
3v
1 
 [c
s.A
R]
  3
0 A
pr
 20
18
0
25
50
75
100
125
150
175
200
225
250
G
A
M
E
01
G
A
M
E
03
G
A
M
E
05
B
E
N
C
H
02
B
E
N
C
H
04
gcc
h264ref
astar
om
netpp
leslie3d
m
cf
M
PK
C
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
G
A
M
E
01
G
A
M
E
03
G
A
M
E
05
B
E
N
C
H
02
B
E
N
C
H
04
gcc
h264ref
astar
om
netpp
leslie3d
m
cf
R
B
L
0
1
2
3
4
5
6
G
A
M
E
01
G
A
M
E
03
G
A
M
E
05
B
E
N
C
H
02
B
E
N
C
H
04
gcc
h264ref
astar
om
netpp
leslie3d
m
cf
B
L
P
(a) (b) (c)
Figure 1: GPU memory characteristics. (a) Memory intensity, measured by memory requests per thousand cycles, (b) row
buer locality, measured by the fraction of accesses that hit in the row buer, and (c) bank-level parallelism. Reproduced
from [14].
locality, often due to access patterns related to large sequen-
tial memory accesses (e.g., frame buer updates). Figure 1(c)
shows the bank-level parallelism (BLP) [109, 142], which is the
average number of parallel memory requests that can be issued
to dierent DRAM banks, for each application, with the GPU
programs consistently making use of four banks at the same
time.
In addition to the high-intensity memory trac of GPU
applications, there are other properties that distinguish GPU
applications from CPU applications. Prior work [99] observed
that CPU applications with streaming access patterns typi-
cally exhibit high RBL but low BLP, while applications with
less uniform access patterns typically have low RBL but high
BLP.
In contrast, GPU applications have both high RBL and high
BLP. The combination of high memory intensity, high RBL
and high BLP means that the GPU will cause signicant inter-
ference to other applications across all banks, especially when
using a memory scheduling algorithm that preferentially fa-
vors requests that result in row buer hits (e.g., [173, 220]).
Recent memory scheduling research has focused on mem-
ory interference between applications in CPU-only scenarios.
These past proposals are built around a single centralized re-
quest buer at each memory controller (MC). The scheduling
algorithm implemented in the memory controller analyzes
the stream of requests in the centralized request buer to
determine application memory characteristics, decides on
a priority for each core, and then enforces these priorities.
Observable memory characteristics may include the number
of requests that result in row buer hits, the bank-level par-
allelism of each core, memory request rates, overall fairness
metrics, and other information. Figure 2(a) shows the CPU-
only scenario where the request buer only holds requests
from the CPUs. In this case, the memory controller sees a
number of requests from the CPUs and has visibility into
their memory behavior. On the other hand, when the request
buer is shared between the CPUs and the GPU, as shown in
Figure 2(b), the large volume of requests from the GPU occu-
pies a signicant fraction of the memory controller’s request
X X X X X X X X
X X X X X X X X X X X X X X X X
X X
X X X X X X X X
CPU Requests
GPU Requests
(a)
(b)
(c)
Figure 2: Example of the limited visibility of the mem-
ory controller. (a) CPU-only information, (b) Memory con-
troller’s visibility, (c) Improved visibility. Adapted from [14].
buer, thereby limiting the memory controller’s visibility of
the CPU applications’ memory characteristics.
One approach to increasing the memory controller’s visibil-
ity across a larger window of memory requests is to increase
the size of its request buer. This allows the memory con-
troller to observe more requests from the CPUs to better char-
acterize their memory behavior, as shown in Figure 2(c). For
instance, with a large request buer, the memory controller
can identify and service multiple requests from one CPU core
to the same row such that they become row buer hits, how-
ever, with a small request buer as shown in Figure 2(b), the
memory controller may not even see these requests at the
same time because the GPU’s requests have occupied the
majority of the entries.
Unfortunately, very large request buers impose signi-
cant implementation challenges, including the die area for
the larger structures and the additional circuit complexity for
analyzing so many requests, along with the logic needed for
assignment and enforcement of priorities [194, 195]. There-
fore, while building a very large, centralized memory con-
troller request buer could perhaps lead to reasonable mem-
ory scheduling decisions, the approach is unattractive due to
the resulting area, power, timing and complexity costs.
In this work, we propose the Staged Memory Scheduler
(SMS), a decentralized architecture for memory scheduling in
the context of integrated multi-core CPU-GPU systems. The
key idea in SMS is to decouple the various functional tasks of
memory controllers and partition these tasks across several
simpler hardware structures which operate in a staged fash-
2
ion. The three primary functions of the memory controller,
which map to the three stages of our proposed memory con-
troller architecture, are:
1. Detection of basic intra-application memory characteris-
tics (e.g., row buer locality).
2. Prioritization across applications (CPUs and GPU) and
enforcement of policies to reect the priorities.
3. Low-level command scheduling (e.g., activate, precharge,
read/write), enforcement of DRAM device timing con-
straints (e.g., tRAS, tFAW, etc.), and resolution of resource
conicts (e.g., data bus arbitration).1
Our specic SMS implementation makes widespread use
of distributed FIFO structures to maintain a very simple im-
plementation, but at the same time SMS can provide fast
service to low memory-intensity (likely latency-sensitive)
applications and eectively exploit row buer locality and
bank-level parallelism for high memory-intensity (bandwidth-
demanding) applications. While SMS provides a specic im-
plementation, our staged approach for memory controller
organization provides a general framework for exploring scal-
able memory scheduling algorithms capable of handling the
diverse memory needs of integrated heterogeneous process-
ing systems of the future (e.g., systems-on-chip that contain
CPUs, GPUs, and accelerators).
2. Staged Memory Scheduler Design
Overview: Our proposed Staged Memory Scheduler [14]
architecture introduces a new memory controller (MC) de-
sign that provides 1) scalability and simpler implementation
by decoupling the primary functions of an application-aware
MC into a simpler multi-stage MC, and 2) performance and
fairness improvement by reducing the interference caused by
very bandwidth-intensive applications. SMS provides these
benets by introducing a three-stage design. The rst stage
is the per-core batch formation stage, which groups requests
from the same application that access the same row to im-
prove row buer locality. The second stage is the batch sched-
uler, which schedules batches of requests from across dierent
applications. The last stage is the DRAM command scheduler,
which sends requests to DRAM while satisfying all DRAM
constraints.
The staged organization of SMS lends directly to a
low-complexity hardware implementation. Figure 3 illus-
trates the overall hardware organization of the SMS. We
briey discuss each stage below. Section 4 of our ISCA 2012
paper [14] includes a detailed description of each stage.
Stage 1 - Batch Formation. The goal of this stage is to
combine individual memory requests from each source into
batches of requests that are to the same row buer entry. It
consists of several simple FIFO structures, one per source (i.e.,
a CPU core or the GPU). Each request from a given source
1We refer the reader to our prior works [32, 33, 34, 35, 66, 67, 93, 96, 97, 98,
99, 100, 112, 113, 114, 115, 116, 120, 121, 158, 183, 184] for a detailed background
on DRAM.
Req
Req
Req
Req
Req
Req
Req
Req
Req
Req
Req
Req
Req
Req
Req
Req
Req
Req
Req
Req
Req
Req
Req
Req
Req
Batch
Batch
Batch
Batch
Batch
Batch
Batch
Batch
Batch
Batch
Bank 1 Bank 2 Bank 3 Bank 4
Core 1 Core 2 Core 3 Core 4 GPU
Batch Scheduler
Batch Formation
Stage
DRAM Command
Scheduler
Stage
Batch Scheduling
Stage
TO DRAM
Staged Memory
Scheduler
Figure 3: The organization of SMS. Adapted from [14].
is initially inserted into its respective FIFO upon arrival at
the MC. A batch is simply one or more memory requests
from the same source that access the same DRAM row. That
is, all requests within a batch, except perhaps for the rst
one, would be row buer hits if scheduled consecutively. A
batch is deemed complete or ready when an incoming request
accesses a dierent row, when the oldest request in the batch
has exceeded a threshold age, or when the FIFO is full. Only
ready batches are considered for future scheduling by the
second stage of SMS.
Stage 2 - Batch Scheduler. The batch scheduler deals
directly with batches, and therefore does not need to worry
about optimizing for row buer locality. Instead, the batch
scheduler focuses on higher-level policies regarding inter-
application interference and fairness. The goal of this stage is
to prioritize batches from applications that are latency critical,
while making sure that bandwidth-intensive applications (e.g.,
those running on the GPU) still make good progress.
The batch scheduler considers every source FIFO (from
stage 1) that contains a ready batch. It picks one ready batch
based on either a shortest job rst (SJF) or a round-robin
policy. Using the SJF policy, the batch scheduler chooses
the oldest ready batch from the source with the fewest to-
tal in-ight memory requests across all three stages of SMS.
SJF prioritization reduces average request service latency,
and it tends to favor latency-sensitive applications, which
tend to have fewer total requests [98, 99, 109, 142]. Using the
round-robin policy, the batch scheduler simply picks the next
ready batch in a round-robin manner across the source FIFOs.
This ensures that memory-intensive applications receive ade-
quate service. The batch scheduler uses the SJF policy with
probability p and the round-robin policy with probability
1 –p. The value of p determines whether the CPU or the GPU
3
receives higher priority. When p is high, the SJF policy is
applied more often and applications with fewer outstanding
requests are prioritized. Hence, the batches of the likely less
memory-intensive CPU applications are prioritized over the
batches of the GPU application. On the other hand, when p is
low, request batches are scheduled in a round-robin fashion
more often. Hence, the memory-intensive GPU application’s
naturally-large request batches are likely scheduled more
frequently, and the GPU is thus prioritized over the CPU.
After picking a batch, the batch scheduler enters a drain
state where it forwards the requests from the selected batch
to the nal stage of the SMS. The batch scheduler dequeues
one request per cycle until all requests from the batch have
been removed from the selected FIFO.
Stage 3 - DRAMCommand Scheduler (DCS). DCS con-
sists of one FIFO queue per DRAM bank. The drain state of
the batch scheduler places the memory requests directly into
these FIFOs. Note that because batches are moved into DCS
FIFOs one batch at a time, row buer locality within a batch is
preserved within a DCS FIFO. At this point, higher-level pol-
icy decisions have already been made by the batch scheduler.
Therefore, the DCS simply issues low-level DRAM commands,
ensuring DRAM protocol compliance.
In any given cycle, DCS considers only the requests at
the head of each of the per-bank FIFOs. For each request,
DCS determines whether that request can issue a command
based on the request’s current row buer state (e.g., is the row
buer already open with the requested row?) and the current
DRAM state (e.g., time elapsed since a row was opened in
a bank, and data bus availability). If more than one request
is eligible to issue a command in any given cycle, the DCS
arbitrates between DRAM banks in a round-robin fashion.
3. Qualitative Comparison with
Previous Scheduling Algorithms
In this section, we compare SMS qualitatively to previously
proposed scheduling policies and analyze the basic dierences
between SMS and these policies. The fundamental dierence
between SMS and previously-proposed memory scheduling
policies for CPU-only scenarios is that the latter are designed
around a single, centralized request buer which has poor
scalability and complex scheduling logic, while SMS is built
around a decentralized, scalable framework.
First-Ready FCFS (FR-FCFS). FR-FCFS [173, 220] is a
commonly used scheduling policy in commodity DRAM sys-
tems. An FR-FCFS scheduler prioritizes requests that result
in row buer hits over row buer misses and otherwise pri-
oritizes older requests. Since FR-FCFS unfairly prioritizes
applications with high row buer locality to maximize DRAM
throughput, prior works [42, 45, 98, 99, 134, 137, 141, 142, 194,
195] have observed that it has low system performance and
high unfairness.
Parallelism-Aware Batch Scheduling (PAR-BS). PAR-
BS [142, 143] aims to improve both fairness and system per-
formance. In order to prevent unfairness, it forms batches of
outstanding memory requests and prioritizes the oldest batch,
to avoid request starvation. To improve system throughput,
it prioritizes applications with smaller number of outstand-
ing memory requests within a batch. However, PAR-BS has
two major shortcomings. First, batching could cause older
GPU requests and requests of other memory-intensive CPU
applications to be prioritized over latency-sensitive CPU ap-
plications. Second, as previous work [98] has also observed,
PAR-BS does not take into account an application’s long term
memory-intensity characteristics when it assigns application
priorities within a batch. This could cause memory-intensive
applications’ requests to be prioritized over latency-sensitive
applications’ requests within a batch, due to the application-
agnostic nature of batching.
Adaptive Per-Thread Least-Attained-Serviced Mem-
ory Scheduling (ATLAS). ATLAS [98] aims to improve sys-
tem performance by prioritizing requests of applications with
lower attained memory service. This improves the perfor-
mance of low memory-intensity applications as they tend to
have low attained service. However, ATLAS has the disadvan-
tage of not preserving fairness. Previous works [98, 99] have
shown that simply prioritizing applications based on attained
service leads to signicant slowdown of memory-intensive
applications.
Thread ClusterMemory Scheduling (TCM). TCM [99]
is a state-of-the-art application-aware cluster memory sched-
uler providing both high system throughput and high fairness.
It groups an application into either a latency-sensitive or a
bandwidth-sensitive cluster based on the application memory
intensity. In order to achieve high system throughput and
low unfairness, TCM employs a dierent prioritization policy
for each cluster. To improve system throughput, a fraction of
total memory bandwidth is dedicated to the latency-sensitive
cluster and applications within the cluster are then ranked
based on memory intensity with the least memory-intensive
application receiving the highest priority. On the other hand,
TCM minimizes unfairness by periodically shuing applica-
tions within the bandwidth-sensitive cluster to avoid starva-
tion. This approach provides both high system performance
and fairness in CPU-only systems. In an integrated CPU-GPU
system, the GPU generates a signicantly larger number of
memory requests compared to the CPUs and lls up the cen-
tralized request buer. As a result, the memory controller
lacks the visibility into CPU memory requests to accurately
determine each application’s memory access characteristics.
Without such visibility, TCM makes incorrect and non-robust
clustering decisions, which classify some applications with
high memory intensity into the latency-sensitive cluster and
vice versa. Such misclassied applications cause interference
not only to low memory intensity applications, but also to
each other. Therefore, TCM cannot always provide high sys-
tem performance and high fairness in an integrated CPU-GPU
system. Increasing the request buer size is a practical way
4
 0
 2
 4
 6
 8
 10
 12
 14
 16
L ML M HL HML HM H Avg
Sy
ste
m
 P
er
fo
rm
an
ce
(H
igh
er 
is 
Be
tte
r) FR-FCFSATLAS PAR-BSTCM SMS
L ML M HL HML HM H Avg
 40
 80
 120
 160
 200
 240
 280
 320
 360
U
nf
ai
rn
es
s
(L
ow
er 
is 
Be
tte
r)
Figure 4: System performance, and fairness for 7 categories of workloads (total of 105 workloads). Reproduced from [14].
 0
 2
 4
 6
 8
 10
 12
 14
 16
L ML M HL HML HM H AvgC
PU
 S
ys
te
m
 P
er
fo
rm
an
ce
(H
igh
er 
is 
Be
tte
r) FR-FCFSATLAS PAR-BSTCM SMS
L ML M HL HML HM H Avg
 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
G
PU
 S
pe
ed
up
(H
igh
er 
is 
Be
tte
r)
Figure 5: CPUs and GPU Speedup for 7 categories of workloads (total of 105 workloads). Reproduced from [14].
to gain more visibility into CPU applications’ memory access
characteristics. However, this approach is not scalable as we
show in our evaluations [14]. In contrast, SMS provides much
better system performance and fairness than TCM with the
same number of request buer entries and lower hardware
cost, as we show in Section 5.
4. Evaluation Methodology
We use an in-house cycle-accurate simulator to perform
our evaluations. For our performance evaluations, we model
a system with sixteen x86 CPU cores and a GPU. For the
CPUs, we model three-wide out-of-order processors with a
cache hierarchy including per-core L1 caches and a shared,
distributed L2 cache. The GPU does not share the CPU caches.
In order to prevent the GPU from taking the majority of
request buer entries, we reserve half of the request buer
entries for the CPUs. To model the memory bandwidth of
the GPU accurately, we perform coalescing on GPU memory
requests before they are sent to the memory controller [119].
We evaluate our system with a set of 105 multiprogrammed
workloads simulated for 500 million cycles. Each workload
consists of sixteen SPEC CPU2006 benchmarks and one GPU
application selected from a mix of video games and graph-
ics performance benchmarks. We classify CPU benchmarks
into three categories (Low, Medium, and High) based on
their memory intensities, measured as last-level cache misses
per thousand instructions (MPKI). Based on these three cate-
gories, we randomly choose sixteen CPU benchmarks from
these three categories and one randomly selected GPU bench-
mark to form workloads consisting of seven intensity mixes: L
(All low), ML (Low/Medium), M (All medium), HL (High/Low),
HML (High/Medium/Low), HM (High/Medium) and H(All
high). For each CPU benchmark, we use Pin [125, 172] with
PinPoints [159] to select the representative phase. For the
GPU applications, we use an industrial GPU simulator to
collect memory requests with detailed timing information.
These requests are collected after having rst been ltered
through the GPU’s internal cache hierarchy, therefore we
do not further model any caches for the GPU in our nal
hybrid CPU-GPU simulation framework. More detail on our
experimental methodology is in Section 5 of our ISCA 2012
paper [14].
5. Experimental Results
We present the performance of ve memory scheduler con-
gurations: FR-FCFS [173, 220], ATLAS [98], PAR-BS [142],
TCM [99], and SMS [14] on the 16-CPU/1-GPU four-memory-
controller system. All memory schedulers use 300 request
buer entries per memory controller. This size was chosen
based on empirical results, which showed that performance
does not appreciably increase for larger request buer sizes.
Results are presented in the workload categories, with work-
load memory intensities increasing from left to right.
Figure 4 shows the system performance (measured as
weighted speedup [50, 51]) and fairness (measured as max-
imum slowdown [43, 98, 99, 194, 195, 203]) of the previously
proposed algorithms and SMS. Compared to TCM, which is
the previous state-of-the-art algorithm for both system perfor-
mance and fairness, SMS provides 41.2% system performance
improvement and 4.8× fairness improvement. Therefore,
we conclude that SMS provides better system performance
and fairness than all previously proposed scheduling poli-
cies, while incurring much lower hardware cost and simpler
scheduling logic, as we show in Section 5.2.
We study the performance of the CPU system and the
GPU system separately and provide two major observations
in Figure 5. First, SMS gains 1.76× improvement in CPU
system performance over TCM. Second, SMS achieves this
5
 0
 2
 4
 6
 8
2 4 8 16
Sy
ste
m
 P
er
fo
rm
an
ce
(H
igh
er 
is 
Be
tte
r) TCM SMS
2 4 8 16
 20
 40
 60
 80
U
nf
ai
rn
es
s
(L
ow
er 
is 
Be
tte
r)
Figure 6: SMS vs. TCM on a 16 CPU/1 GPU, 4 memory controller system with varying the number of cores.
 0
 2
 4
 6
 8
2 4 8
Sy
ste
m
 P
er
fo
rm
an
ce
(H
igh
er 
is 
Be
tte
r) TCM SMS
 20
 40
 60
 80
 100
 120
 140
 160
 180
 200
 220
 240
2 4 8
U
nf
ai
rn
es
s
(L
ow
er 
is 
Be
tte
r)
 0
 0.2
 0.4
 0.6
 0.8
2 4 8
G
PU
 S
pe
ed
up
(H
igh
er 
is 
Be
tte
r)
Figure 7: SMS vs. TCM on a 16 CPU/1 GPU system with varying the number of channels.
1.76× CPU performance improvement while delivering sim-
ilar GPU performance as the FR-FCFS baseline. The results
show that TCM (and the other algorithms) end up allocating
far more bandwidth to the GPU, at signicant performance
and fairness cost to the CPU applications. SMS appropriately
deprioritizes the memory bandwidth intensive GPU applica-
tion in order to enable higher CPU performance and overall
system performance, while preserving fairness. Previously
proposed scheduling algorithms, on the other hand, allow the
GPU to hog memory bandwidth and therefore signicantly
degrade system performance and fairness.
We provide a more detailed analysis in Sections 6.1 and 6.2
of our ISCA 2012 paper [14].
5.1. Scalability with Cores and
Memory Controllers
Figure 6 compares the performance and fairness of SMS
against TCM (averaged over 75 workloads)2 with the same
number of request buers, as the number of cores is varied.
We make the following observations: First, SMS continues to
provide better system performance and fairness than TCM.
Second, the system performance gains and fairness gains in-
crease signicantly as the number of cores and hence, mem-
ory pressure is increased. SMS’s performance and fairness
benets are likely to become more signicant as core counts
in future technology nodes increase.
Figure 7 shows the system performance and fairness of
SMS compared against TCM as the number of memory chan-
nels is varied. For this, and all subsequent results, we perform
our evaluations on 60 workloads from categories that contain
high memory-intensity applications (HL, HML, HM and H
2We use 75 randomly selected workloads per core count. We could not
use the same workloads/categorizations as specied in Section 4 because
those were for 16-core systems, whereas we are now varying the number of
cores.
workload categories). We observe that SMS scales better as
the number of memory channels increases. As the perfor-
mance gain of TCM diminishes when the number of memory
channels increases from 4 to 8 channels, SMS continues to
provide performance improvement for both CPU and GPU.
We provide a detailed scalability analysis in Section 6.3 of our
ISCA 2012 paper [14].
5.2. Power and Area
We present the power and area of FR-FCFS and SMS. We
nd that SMS consumes 66.7% less leakage power than FR-
FCFS, which is the simplest of all of the prior memory sched-
ulers that we evaluate. In terms of die area, SMS requires
46.3% less area than FR-FCFS. The majority of the power
and area savings of SMS over FR-FCFS come from the de-
centralized request buer queues and simpler scheduling
logic in SMS. In comparison, FR-FCFS requires centralized
request buer queues, content-addressable memory (CAMs),
and complex scheduling logic. Because ATLAS and TCM
require more complex ranking and scheduling logic than
FR-FCFS, we expect that SMS also provides power and area
reductions over ATLAS and TCM.
We provide the following additional results in our ISCA
2012 paper [14]:
• Combined performance of CPU-GPU heterogeneous sys-
tems for dierent SMS congurations with dierent Short-
est Job First (SJF) probability.
• Sensitivity analysis to SMS’s conguration parameters.
• Performance of SMS in CPU-only systems.
6. Related Work
To our knowledge, our ISCA 2012 paper is the rst to
provide a fundamentally new memory controller design for
heterogeneous CPU-GPU systems in order to reduce interfer-
ence at the shared o-chip main memory. There are several
6
prior works that reduce interference at the shared o-chip
main memory in other systems. We provide a brief discussion
of these works.
6.1. Memory Partitioning Techniques
Instead of mitigating the interference problem between
applications by scheduling requests at the memory controller,
Awasthi et al. [18] propose a mechanism that spreads data
in the same working set across memory channels in order
to increase memory level parallelism. Memory channel par-
titioning (MCP) [137] maps applications to dierent mem-
ory channels based on their memory intensities and row
buer locality, to reduce inter-application interference. Mao
et al. [128] propose to partition GPU channels and allow only
a subset of threads to access each memory channel. In ad-
dition to channel partitioning, several works [74, 122, 210]
also propose to partition DRAM banks to improve perfor-
mance. These partitioning techniques are orthogonal to our
proposals, and can be combined with SMS to improve the
performance of heterogeneous CPU-GPU systems.
6.2. Memory Scheduling Techniques
Memory Scheduling on CPUs. Numerous prior works
propose memory scheduling algorithms for CPUs that im-
prove system performance. The rst-ready, rst-come-rst-
serve (FR-FCFS) scheduler [173, 220] prioritizes requests that
hit in the row buer over requests that miss in the row buer,
with the aim of reducing the number of times rows must be
activated (as row activation incurs a high latency). Several
memory schedulers improve performance beyond FR-FCFS by
identifying critical threads in multithreaded applications [47],
using reinforcement learning to identify long-term memory
behavior [79, 136], prioritizing memory requests based on
the criticality (i.e., latency sensitivity) of each memory re-
quest [57, 123, 211], distinguishing prefetch requests from
demand requests [109, 111], or improving the scheduling of
memory writeback requests [110, 182, 193]. While all of these
schedulers increase DRAM performance and/or throughput,
many of them introduce fairness problems by under-servicing
applications that only infrequently issue memory requests.
To remedy fairness problems, several application-aware mem-
ory scheduling algorithms [98, 99, 135, 141, 142, 194, 195, 197]
use information on the memory intensity of each application
to balance both performance and fairness. Unlike SMS, none
of these schedulers consider the dierent needs of CPU mem-
ory requests and GPU memory requests in a heterogeneous
system.
Memory Scheduling on GPUs. Since GPU applications
are bandwidth intensive, often with streaming access pat-
terns, a policy that maximizes the number of row buer hits
is eective for GPUs to maximize overall throughput. As a
result, FR-FCFS with a large request buer tends to perform
well for GPUs [22]. In view of this, prior work [213] proposes
mechanisms to reduce the complexity of FR-FCFS scheduling
for GPUs. Ausavarungnirun et al. [15] propose MeDiC, which
is a cache and memory management scheme to improve the
performance of GPGPU applications. Jeong et al. [80] propose
a QoS-aware memory scheduler that guarantees the perfor-
mance of GPU applications by prioritizing memory requests
from graphics applications over those from CPU applications
until the system can guarantee that a frame can be rendered
within a given deadline, after which it prioritizes requests
from CPU applications. Jog et al. [83] propose CLAM, a mem-
ory scheduler that identies critical memory requests and
prioritizes them in the main memory. Ausavarungnirun et
al. [17] propose a scheduling algorithm that identies and pri-
oritizes TLB-related memory requests in GPU-based systems,
to reduce the overhead of memory virtualization. Unlike SMS,
none of these works holistically optimize the performance
and fairness of requests when a memory controller is shared
by a CPU and a GPU.
Memory Scheduling on Emerging Systems. Recent
proposals investigate memory scheduling on emerging plat-
forms. Usui et al. [201, 202] propose accelerator-aware mem-
ory controller designs that improve the performance of sys-
tems that contain both CPUs and hardware accelerators. Zhao
et al. [216] decouple the design of a memory controller for per-
sistent memory into multiple stages. These works build upon
principles for heterogeneous system memory scheduling that
were rst proposed in SMS.
6.3. Other Related Works
DRAM Designs. Aside from memory scheduling and
memory partitioning techniques, previous works propose
new DRAM designs that are capable of reducing memory la-
tency in conventional DRAM [9,10,31,32,33,34,36,63,69,72,90,
100,112,113,114,115,116,126,132,155,166,177,187,190,208,218]
and non-volatile memory [102, 105, 106, 107, 130, 131, 170, 171,
212]. Previous works on bulk data transfer [30, 34, 59, 60,
75, 81, 86, 124, 180, 183, 215, 217] and in-memory computa-
tion [7, 8, 11, 19, 25, 26, 44, 52, 54, 55, 56, 58, 61, 70, 71, 87, 94, 101,
127,157,160,161,168,181,184,185,192,198,214] can be used im-
prove DRAM bandwidth. Techniques to reduce the overhead
of DRAM refresh [5,6,20,24,95,118,121,146,156,169,204] can
be applied to improve the performance of GPU-based systems.
Data compression techniques [162, 163, 164, 165, 205] can also
be used on the main memory to increase the eective avail-
able DRAM bandwidth. All of these techniques can mitigate
the performance impact of memory interference and improve
the performance of GPU-based systems. They are orthogonal
to, and can be combined with, SMS to further improve the
performance of heterogeneous CPU-GPU systems.
Previous works on data prefetching [12, 21, 29, 39, 41, 46, 48,
49,64,65,73,84,85,104,108,109,111,138,140,144,148,186,191]
can also be used to mitigate high DRAM latency. However,
these techniques generally increase DRAM bandwidth uti-
lization, which can lead to lower GPU performance.
7
Other Ways to Improve Performance on Systems
with GPUs. Other works have proposed various methods
of decreasing memory divergence. These methods range
from thread throttling [88, 89, 103, 174] to warp schedul-
ing [117, 129, 147, 174, 175, 219]. While these methods share
our goal of reducing memory divergence, none of them ex-
ploit inter-warp heterogeneity and, as a result, are orthogonal
or complementary to our proposal. Our work also makes new
observations about memory divergence that are not covered
by these works.
7. Signicance and Long-Term Impact
SMS exposes the need to redesign components of the
memory subsystem to better serve integrated CPU-GPU
systems. Systems-on-chip (SoCs) that integrate CPUs and
GPUs on the same die are growing rapidly in popularity
(e.g., [37, 133, 152, 153]), due to their high energy eciency
and lower costs compared to discrete CPUs and GPUs. As
a result, SoCs are commonly used in mobile devices such as
smartphones, tablets, and laptops, and are being used in many
servers and data centers. We expect that as more powerful
CPUs and GPUs are integrated in SoCs, and as the workloads
running on the CPUs/GPUs become more memory-intensive,
SMS will become even more essential to alleviate the shared
memory subsystem bottleneck.
The observations and mechanisms in our ISCA 2012 pa-
per [14] expose several future research problems. We briey
discuss two future research areas below.
Interference Management in Emerging Heteroge-
neous Systems. Our ISCA 2012 paper [14] considers het-
erogeneous systems where a CPU executes various general-
purpose applications while the GPU executes graphics work-
loads. Modern heterogeneous systems contain an increas-
ingly diverse set of workloads. For example, programmers
can use the GPU in an integrated CPU-GPU system to exe-
cute general-purpose applications (known as GPGPU appli-
cations). GPGPU applications can have signicantly dierent
access patterns from graphics applications, requiring dier-
ent memory scheduling policies (e.g., [15, 17, 83]). Future
work can adapt the mechanisms of SMS to optimize the per-
formance of GPGPU applications.
Many heterogeneous systems are being deployed in mobile
or embedded environments, and must ensure that memory
requests from some or all of the components of the hetero-
geneous system meet real-time deadlines [91, 92, 201, 202].
Traditionally, applications with real-time deadlines are exe-
cuted using embedded cores or xed-function accelerators,
which are often integrated into modern SoCs. We believe that
the observations and mechanisms in our ISCA 2012 paper [14]
can be used and extended to ensure that these deadlines are
met. Recent works [201, 202] have shown that the principles
of SMS can be extended to provide deadline-aware memory
scheduling for accelerators within heterogeneous systems.
Even though the mechanisms proposed in our ISCA 2012
paper [14] aim to minimize the slowdown caused by inter-
ference, they do not provide actual performance guarantees.
However, we believe it is possible to combine principles from
SMS with prediction mechanisms for memory access latency
(e.g., [91,92,196,197] to provide hard performance guarantees
for real-time applications, while still ensuring fairness for all
applications executing on the heterogeneous system.
Memory Scheduling for Concurrent GPGPU Appli-
cations. While SMS allows CPU applications and graphics
applications to share DRAM more eciently, we assume that
there is only a single GPU application running at any given
point in time. Recent works [16, 82] propose methods to
eciently share the same GPU across multiple concurrently-
executing GPGPU applications. We believe that the tech-
niques and observations provided in our ISCA 2012 paper [14]
can be applied to reduce the memory interference induced by
additional GPGPU applications. Furthermore, as concurrent
GPGPU application execution becomes more widespread, the
concepts of SMS can be extended to provide prioritization
and fairness across multiple GPGPU applications.
Our analysis of memory interference in heterogeneous
systems, and our new Staged Memory Scheduler, have in-
spired a number of subsequent works. These works include
signicant research on predictable and/or deadline-aware
memory scheduling [91, 92, 194, 195, 197, 201, 202, 216], and
on other memory scheduling algorithms for heterogeneous
systems [161, 201, 202, 207].
8. Conclusion
While many advancements in memory scheduling policies
have been made to deal with multi-core processors, the inte-
gration of GPUs on the same chip as the CPUs has created new
system design challenges. Our ISCA 2012 paper [14] demon-
strates how the inclusion of GPU memory trac can cause
severe diculties for existing memory controller designs in
terms of performance and especially fairness. We propose a
new approach, Staged Memory Scheduler, which delivers su-
perior performance and fairness for integrated CPU-GPU sys-
tems compared to state-of-the-art memory schedulers, while
providing a design that is signicantly simpler to implement
(thus improving the scalability of the memory controller).
The key insight behind simplifying the implementation of
SMS is that the primary functions of sophisticated memory
controller algorithms can be decoupled. As a result, SMS
proposes a multi-stage memory controller architecture. We
show that SMS signicantly improves the performance and
fairness in integrated CPU-GPU systems. We hope and ex-
pect that our observations and mechanisms can inspire future
work in memory system design for existing and emerging
heterogeneous systems.
8
Acknowledgments
We thank Saugata Ghose for his dedicated eort in the
preparation of this article. We thank Stephen Somogyi and
Fritz Kruger at AMD for their assistance with the modeling
of the GPU applications. We also thank Antonio Gonzalez,
anonymous reviewers and members of the SAFARI group
at CMU for their feedback. We acknowledge the generous
support of AMD, Intel, Oracle, and Samsung. This research
was also partially supported by grants from the NSF (CAREER
Award CCF-0953246 and CCF-1147397), GSRC, and Intel ARO
Memory Hierarchy Program.
References
[1] Advanced Micro Devices, “AMD Accelerated Processing Units.”
[2] Advanced Micro Devices, “ATI Radeon GPGPUs.”
[3] Advanced Micro Devices, “AMD Radeon R9 290X,” 2013.
[4] A. Agarwal, B. H. Lim, D. Kranz, and J. Kubiatowicz, “APRIL: A Processor Archi-
tecture for Multiprocessing,” in ISCA, 1990.
[5] A. Agrawal, A. Ansari, and J. Torrellas, “Mosaic: Exploiting the Spatial Locality
of Process Variation to Reduce Refresh Energy in On-chip eDRAM Modules,” in
HPCA, 2014.
[6] A. Agrawal, M. O’Connor, E. Bolotin, N. Chatterjee, J. Emer, and S. Keckler,
“CLARA: Circular Linked-List Auto and Self Refresh Architecture,” in MEMSYS,
2016.
[7] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A Scalable Processing-in-memory
Accelerator for Parallel Graph Processing,” in ISCA, 2015.
[8] J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “PIM-enabled Instructions: A Low-
overhead, Locality-aware Processing-in-memory Architecture,” in ISCA, 2015.
[9] J. H. Ahn, J. Leverich, R. Schreiber, and N. P. Jouppi, “Multicore DIMM: an Energy
Ecient Memory Module with Independently Controlled DRAMs,” IEEE CAL,
2009.
[10] J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber, “Improving
System Energy Eciency with Memory Rank Subsetting,” ACM TACO, vol. 9,
no. 1, pp. 4:1–4:28, 2012.
[11] B. Akin, F. Franchetti, and J. C. Hoe, “Data Reorganization in Memory Using
3D-stacked DRAM,” in ISCA, 2015.
[12] A. R. Alameldeen and D. A. Wood, “Interactions Between Compression and
Prefetching in Chip Multiprocessors,” in HPCA, 2007.
[13] R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Portereld, and B. Smith,
“The Tera Computer System,” in ICS, 1990.
[14] R. Ausavarungnirun, K. Chang, L. Subramanian, G. Loh, and O. Mutlu, “Staged
Memory Scheduling: Achieving High Performance and Scalability in Heteroge-
neous Systems,” in ISCA, 2012.
[15] R. Ausavarungnirun, S. Ghose, O. Kayıran, G. H. Loh, C. R. Das, M. T. Kandemir,
and O. Mutlu, “Exploiting Inter-Warp Heterogeneity to Improve GPGPU Perfor-
mance,” in PACT, 2015.
[16] R. Ausavarungnirun, J. Landgraf, V. Miller, S. Ghose, J. Gandhi, C. J. Rossbach,
and O. Mutlu, “Mosaic: A GPU Memory Manager with Application-Transparent
Support for Multiple Page Sizes,” in MICRO, 2017.
[17] R. Ausavarungnirun, V. Miller, J. Landgraf, S. Ghose, J. Gandhi, A. Jog, C. J. Ross-
bach, and O. Mutlu, “MASK: Redesigning the GPU Memory Hierarchy to Support
Multi-Application Concurrency,” in ASPLOS, 2018.
[18] M. Awasthi, D. W. Nellans, K. Sudan, R. Balasubramonian, and A. Davis, “Han-
dling the Problems and Opportunities Posed by Multiple On-chip Memory Con-
trollers,” in PACT, 2010.
[19] O. O. Babarinsa and S. Idreos, “JAFAR: Near-Data Processing for Databases,” in
SIGMOD, 2015.
[20] S. Baek, S. Cho, and R. Melhem, “Refresh Now and Then,” IEEE TC, vol. 63, no. 12,
pp. 3114–3126, 2014.
[21] J.-L. Baer and T.-F. Chen, “Eective Hardware-Based Data Prefetching for High-
Performance Processors,” IEEE TC, vol. 44, no. 5, pp. 609–623, 1995.
[22] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, “Analyzing CUDA
Workloads Using a Detailed GPU Simulator,” in ISPASS, 2009.
[23] G. H. Barnes, R. M. Brown, M. Kato, D. J. Kuck, D. L. Slotnick, and R. A. Stokes,
“The Illiac IV Computer,” IEEE TC, vol. 100, no. 8, pp. 746–757, 1968.
[24] I. Bhati, Z. Chishti, S.-L. Lu, and B. Jacob, “Flexible Auto-refresh: Enabling Scal-
able and Energy-ecient DRAM Refresh Reductions,” in ISCA, 2015.
[25] A. Boroumand et al., “Google Workloads for Consumer Devices: Mitigating Data
Movement Bottlenecks,” in ASPLOS, 2018.
[26] A. Boroumand, S. Ghose, B. Lucia, K. Hsieh, K. Malladi, H. Zheng, and
O. Mutlu, “LazyPIM: An Ecient Cache Coherence Mechanism for Processing-
in-Memory,” IEEE CAL, 2016.
[27] D. Bouvier and B. Sander, “Applying AMD’s "Kaveri" APU for Heterogeneous
Computing,” in Hot Chips, 2014.
[28] B. Burgess, B. Cohen, J. Dundas, J. Rupley, D. Kaplan, and M. Denman, “Bobcat:
AMD’s Low-Power x86 Processor,” IEEE Micro, 2011.
[29] P. Cao, E. W. Felten, A. R. Karlin, and K. Li, “A Study of Integrated Prefetching
and Caching Strategies,” in SIGMETRICS, 1995.
[30] J. Carter et al., “Impulse: Building a Smarter Memory Controller,” in HPCA, 1999.
[31] K. Chandrasekar, S. Goossens, C. Weis, M. Koedam, B. Akesson, N. Wehn, and
K. Goossens, “Exploiting Expendable Process-Margins in DRAMs for Run-Time
Performance Optimization,” in DATE, 2014.
[32] K. Chang, A. Kashyap, H. Hassan, S. Ghose, K. Hsieh, D. Lee, T. Li, G. Pekhi-
menko, S. Khan, and O. Mutlu, “Understanding Latency Variation in Modern
DRAM Chips: Experimental Characterization, Analysis, and Optimization,” in
SIGMETRICS, 2016.
[33] K. Chang, D. Lee, Z. Chishti, A. Alameldeen, C. Wilkerson, Y. Kim, and O. Mutlu,
“Improving DRAM Performance by Parallelizing Refreshes with Accesses ,” in
HPCA, 2014.
[34] K. K. Chang, P. J. Nair, D. Lee, S. Ghose, M. K. Qureshi, and O. Mutlu, “Low-Cost
Inter-Linked Subarrays (LISA): Enabling Fast Inter-Subarray Data Movement in
DRAM,” in HPCA, 2016.
[35] K. K. Chang, A. G. Yaglikci, S. Ghose, A. Agrawal, N. Chatterjee, A. Kashyap,
D. Lee, M. O’Connor, H. Hassan, and O. Mutlu, “Understanding Reduced-Voltage
Operation in Modern DRAM Devices: Experimental Characterization, Analysis,
and Mechanisms,” in SIGMETRICS, 2017.
[36] N. Chatterjee, M. Shevgoor, R. Balasubramonian, A. Davis, Z. Fang, R. Illikkal,
and R. Iyer, “Leveraging Heterogeneity in DRAM Main Memories to Accelerate
Critical Word Access,” in MICRO, 2012.
[37] M. Clark, “A New X86 Core Architecture for the Next Generation of Computing,”
in Hot Chips, 2016.
[38] Control Data Corporation, “Control Data 7600 Computer Systems Reference
Manual,” 1972.
[39] R. Cooksey, S. Jourdan, and D. Grunwald, “A Stateless, Content-directed Data
Prefetching Mechanism,” in ASPLOS, 2002.
[40] B. A. Crane and J. A. Githens, “Bulk Processing in Distributed Logic Memory,”
IEEE EC, 1965.
[41] F. Dahlgren, M. Dubois, and P. Stenström, “Sequential Hardware Prefetching in
Shared-Memory Multiprocessors,” IEEE TPDS, vol. 6, no. 7, pp. 733–746, 1995.
[42] R. Das, R. Ausavarungnirun, O. Mutlu, A. Kumar, and M. Azimi, “Application-
to-core Mapping Policies to Reduce Memory System Interference in Multi-core
Systems,” in HPCA, 2013.
[43] R. Das, O. Mutlu, T. Moscibroda, and C. R. Das, “Application-Aware Prioritization
Mechanisms for On-Chip Networks,” in MICRO, 2009.
[44] J. Draper et al., “The Architecture of the DIVA Processing-in-memory Chip,” in
ICS, 2002.
[45] E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt, “Fairness via Source Throttling: A
Congurable and High-performance Fairness Substrate for Multi-core Memory
Systems,” in ASPLOS, 2010.
[46] E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. N. Patt, “Prefetch-aware Shared Resource
Management for Multi-core Systems,” in ISCA, 2011.
[47] E. Ebrahimi, R. Miftakhutdinov, C. Fallin, C. J. Lee, J. A. Joao, O. Mutlu, and Y. N.
Patt, “Parallel Application Memory Scheduling,” in MICRO, 2011.
[48] E. Ebrahimi, O. Mutlu, C. J. Lee, and Y. N. Patt, “Coordinated Control of Multiple
Prefetchers in Multi-core Systems,” in MICRO, 2009.
[49] E. Ebrahimi, O. Mutlu, and Y. N. Patt, “Techniques for Bandwidth-ecient
Prefetching of Linked Data Structures in Hybrid Prefetching Systems,” in HPCA,
2009.
[50] S. Eyerman and L. Eeckhout, “System-Level Performance Metrics for Multipro-
gram Workloads,” IEEE Micro, 2008.
[51] S. Eyerman and L. Eeckhout, “Restating the Case for Weighted-IPC Metrics to
Evaluate Multiprogram Workload Performance,” IEEE CAL, 2014.
[52] A. Farmahini-Farahani, J. H. Ahn, K. Morrow, and N. S. Kim, “NDA: Near-DRAM
Acceleration Architecture Leveraging Commodity DRAM Devices and Standard
Memory Modules,” in HPCA, 2015.
[53] M. Flynn, “Very High-Speed Computing Systems,” Proc. of the IEEE, vol. 54, no. 2,
1966.
[54] B. B. Fraguela, J. Renau, P. Feautrier, D. Padua, and J. Torrellas, “Programming
the FlexRAM Parallel Intelligent Memory System,” in PPoPP, 2003.
[55] M. Gao, G. Ayers, and C. Kozyrakis, “Practical Near-Data Processing for In-
Memory Analytics Frameworks,” in PACT, 2015.
[56] M. Gao and C. Kozyrakis, “HRL: Ecient and Flexible Recongurable Logic for
Near-data Processing,” in HPCA, 2016.
[57] S. Ghose, H. Lee, and J. F. Martínez, “Improving Memory Scheduling via
Processor-side Load Criticality Information,” in ISCA, 2013.
[58] M. Gokhale, B. Holmes, and K. Iobst, “Processing in Memory: the Terasys Mas-
sively Parallel PIM Array,” Computer, vol. 28, no. 4, pp. 23–31, 1995.
[59] M. Gschwind, “Chip Multiprocessing and the Cell Broadband Engine,” in CF,
2006.
[60] J. Gummaraju, M. Erez, J. Coburn, M. Rosenblum, and W. J. Dally, “Architec-
tural Support for the Stream Execution Model on General-Purpose Processors,”
in PACT, 2007.
[61] Q. Guo, N. Alachiotis, B. Akin, F. Sadi, G. Xu, T.-M. Low, L. Pileggi, J. C. Hoe, and
F. Franchetti, “3D-Stacked Memory-Side Acceleration: Accelerator and System
Design,” in WONDP, 2014.
9
[62] R. H. Halstead and T. Fujita, “MASA: A Multithreaded Processor Architecture
for Parallel Symbolic Computing,” in ISCA, 1988.
[63] C. A. Hart, “CDRAM in a Unied Memory Architecture,” in Intl. Computer Con-
ference, 1994.
[64] M. Hashemi, O. Mutlu, and Y. N. Patt, “Continuous Runahead: Transparent Hard-
ware Acceleration for Memory Intensive Workloads,” in MICRO, 2016.
[65] M. Hashemi, Khubaib, E. Ebrahimi, O. Mutlu, and Y. N. Patt, “Accelerating De-
pendent Cache Misses with an Enhanced Memory Controller,” in ISCA, 2016.
[66] H. Hassan, G. Pekhimenko, N. Vijaykumar, V. Seshadri, D. Lee, O. Ergin, and
O. Mutlu, “ChargeCache: Reducing DRAM Latency by Exploiting Row Access
Locality,” in HPCA, 2016.
[67] H. Hassan, N. Vijaykumar, S. Khan, S. Ghose, K. Chang, G. Pekhimenko, D. Lee,
O. Ergin, and O. Mutlu, “SoftMC: A Flexible and Practical Open-source Infras-
tructure for Enabling Experimental DRAM Studies,” in HPCA, 2017.
[68] H. Hellerman, “Parallel Processing of Algebraic Expressions,” IEEE Transactions
on Electronic Computers, 1966.
[69] H. Hidaka, Y. Matsuda, M. Asakura, and K. Fujishima, “The Cache DRAM Archi-
tecture,” IEEE Micro, 1990.
[70] K. Hsieh, S. Khan, N. Vijaykumar, K. K. Chang, A. Boroumand, S. Ghose, and
O. Mutlu, “Accelerating Pointer Chasing in 3D-stacked Memory: Challenges,
Mechanisms, Evaluation,” in ICCD, 2016.
[71] K. Hsieh, E. Ebrahimi, G. Kim, N. Chatterjee, M. O’Connor, N. Vijaykumar,
O. Mutlu, and S. W. Keckler, “Transparent Ooading and Mapping (TOM): En-
abling Programmer-transparent Near-data Processing in GPU Systems,” in ISCA,
2016.
[72] W.-C. Hsu and J. E. Smith, “Performance of Cached DRAM Organizations in
Vector Supercomputers,” in ISCA, 1993.
[73] I. Hur and C. Lin, “Memory Prefetching Using Adaptive Stream Detection,” in
MICRO, 2006.
[74] T. Ikeda and K. Kise, “Application Aware DRAM Bank Partitioning in CMP,” in
ICPADS, 2013.
[75] Intel Corp., “Intel®I/O Acceleration Technology.”
[76] Intel Corp., “Sandy Bridge Intel Processor Graphics Perfor-
mance Developer’s Guide,” https://software.intel.com/en-us/articles/
intel-snbgraphics-developers-guides, 2012.
[77] Intel Corp., “Introduction to Intel® Architecture,” 2014.
[78] Intel Corp., “6th Generation Intel® CoreTM Processor Family Datasheet, Vol. 1,”
2017.
[79] E. Ipek, O. Mutlu, J. F. Martínez, and R. Caruana, “Self-optimizing memory con-
trollers: A reinforcement learning approach,” in ISCA, 2008.
[80] M. K. Jeong, M. Erez, C. Sudanthi, and N. Paver, “A QoS-Aware Memory Con-
troller for Dynamically Balancing GPU and CPU Bandwidth Use in an MPSoC,”
in DAC, 2012.
[81] X. Jiang, Y. Solihin, L. Zhao, and R. Iyer, “Architecture Support for Improving
Bulk Memory Copying and Initialization Performance,” in PACT, 2009.
[82] A. Jog, O. Kayıran, T. Kesten, A. Pattnaik, E. Bolotin, N. Chatterjee, S. W. Keckler,
M. T. Kandemir, and C. R. Das, “Anatomy of GPU Memory System for Multi-
Application Execution,” in MEMSYS, 2015.
[83] A. Jog, O. Kayiran, A. Pattnaik, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das,
“Exploiting Core Criticality for Enhanced GPU Performance,” in SIGMETRICS,
2016.
[84] D. Joseph and D. Grunwald, “Prefetching Using Markov Predictors,” in ISCA,
1997.
[85] N. P. Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of
a Small Fully-Associative Cache and Prefetch Buers,” in ISCA, 1990.
[86] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy,
“Introduction to the Cell Multiprocessor,” IBM JRD, 2005.
[87] Y. Kang, W. Huang, S.-M. Yoo, D. Keen, Z. Ge, V. Lam, P. Pattnaik, and J. Torrellas,
“FlexRAM: Toward an Advanced Intelligent Memory System,” in ICCD, 1999.
[88] O. Kayıran, A. Jog, M. T. Kandemir, and C. R. Das, “Neither More Nor Less: Op-
timizing Thread-Level Parallelism for GPGPUs,” in PACT, 2013.
[89] O. Kayıran, N. C. Nachiappan, A. Jog, R. Ausavarungnirun, M. T. Kandemir, G. H.
Loh, O. Mutlu, and C. R. Das, “Managing GPU Concurrency in Heterogeneous
Architectures,” in MICRO, 2014.
[90] G. Kedem and R. P. Koganti, “WCDRAM: A Fully Associative Integrated Cached-
DRAM with Wide Cache Lines,” Duke Univ. Dept. of Computer Science, Tech.
Rep. CS-1997-03, 1997.
[91] H. Kim, D. de Niz, B. Andersson, M. Klein, O. Mutlu, and R. Rajkumar, “Bounding
Memory Interference Delay in COTS-Based Multi-Core Systems,” in RTAS, 2014.
[92] H. Kim, D. de Niz, B. Andersson, M. Klein, O. Mutlu, and R. Rajkumar, “Bounding
and Reducing Memory Interference in COTS-Based Multi-Core Systems,” RTS,
2016.
[93] J. S. Kim, M. Patel, H. Hassan, and O. Mutlu, “The DRAM Latency PUF: Quickly
Evaluating Physical Unclonable Functions by Exploiting the Latency–Reliability
Tradeo in Modern DRAM Devices,” in HPCA, 2018.
[94] J. S. Kim, D. Senol, H. Xin, D. Lee, S. Ghose, M. Alser, H. Hassan, O. Ergin,
C. Alkan, and O. Mutlu, “GRIM-Filter: Fast Seed Location Filtering in DNA Read
Mapping Using Processing-in-Memory Technologies,” BMC Genomics, 2018.
[95] J. Kim and M. C. Papaefthymiou, “Block-based Multi-period Refresh for Energy
Ecient Dynamic Memory,” in ASIC, 2001.
[96] Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A Fast and Extensible DRAM Simu-
lator,” CAL, 2015.
[97] Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilkerson, K. Lai, and
O. Mutlu, “Flipping Bits in Memory Without Accessing Them: An Experimental
Study of DRAM Disturbance Errors,” in ISCA, 2014.
[98] Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter, “ATLAS: A Scalable and High-
Performance Scheduling Algorithm for Multiple Memory Controllers,” in HPCA,
2010.
[99] Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter, “Thread Cluster Mem-
ory Scheduling: Exploiting Dierences in Memory Access Behavior,” in MICRO,
2010.
[100] Y. Kim, V. Seshadri, D. Lee, J. Liu, and O. Mutlu, “A Case for Exploiting Subarray-
Level Parallelism (SALP) in DRAM,” in ISCA, 2012.
[101] P. M. Kogge, “EXECUBE-A New Architecture for Scaleable MPPs,” in ICPP, 1994.
[102] E. Kultursay, M. Kandemir, A. Sivasubramaniam, and O. Mutlu, “Evaluating STT-
RAM as an energy-ecient main memory alternative,” in ISPASS, 2013.
[103] H.-K. Kuo, B. C. Lai, and J.-Y. Jou, “Reducing Contention in Shared Last-Level
Cache for Throughput Processors,” ACM TODAES, vol. 20, no. 1, 2014.
[104] A.-C. Lai, C. Fide, and B. Falsa, “Dead-block Prediction & Dead-block Correlat-
ing Prefetchers,” in ISCA, 2001.
[105] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Architecting Phase Change Memory
as a Scalable DRAM Alternative,” in ISCA, 2009.
[106] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Phase Change Memory Architecture
and the Quest for Scalability,” CACM, vol. 53, no. 7, pp. 99–106, 2010.
[107] B. C. Lee, P. Zhou, J. Yang, Y. Zhang, B. Zhao, E. Ipek, O. Mutlu, and D. Burger,
“Phase-Change Technology and the Future of Main Memory,” IEEE Micro, vol. 30,
no. 1, pp. 143–143, 2010.
[108] C. J. Lee, O. Mutlu, V. Narasiman, and Y. N. Patt, “Prefetch-aware Memory Con-
trollers,” IEEE TC, vol. 60, no. 10, pp. 1406–1430, 2011.
[109] C. J. Lee, V. Narasiman, O. Mutlu, and Y. N. Patt, “Improving Memory Bank-Level
Parallelism in the Presence of Prefetching,” in MICRO, 2009.
[110] C. J. Lee, E. Ebrahimi, V. Narasiman, O. Mutlu, and Y. N. Patt, “DRAM-Aware
Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory
Systems,” Univ. of Texas at Austin, High Performance Systems Group, Tech. Rep.
TR-HPS-2010-002, 2010.
[111] C. J. Lee, O. Mutlu, V. Narasiman, and Y. N. Patt, “Prefetch-aware DRAM Con-
trollers,” in MICRO, 2008.
[112] D. Lee, L. Subramanian, R. Ausavarungnirun, J. Choi, and O. Mutlu, “Decoupled
Direct Memory Access: Isolating CPU and IO Trac by Leveraging a Dual-Data-
Port DRAM,” in PACT, 2015.
[113] D. Lee, S. Ghose, G. Pekhimenko, S. Khan, and O. Mutlu, “Simultaneous Multi-
layer Access: Improving 3D-stacked Memory Bandwidth at Low Cost,” TACO,
2016.
[114] D. Lee, S. Khan, L. Subramanian, S. Ghose, R. Ausavarungnirun, G. Pekhimenko,
V. Seshadri, and O. Mutlu, “Design-Induced Latency Variation in Modern DRAM
Chips: Characterization, Analysis, and Latency Reduction Mechanisms,” in SIG-
METRICS, 2017.
[115] D. Lee, Y. Kim, G. Pekhimenko, S. Khan, V. Seshadri, K. Chang, and O. Mutlu,
“Adaptive-latency DRAM: Optimizing DRAM Timing for the Common-case,” in
HPCA, 2015.
[116] D. Lee, Y. Kim, V. Seshadri, J. Liu, L. Subramanian, and O. Mutlu, “Tiered-latency
DRAM: A Low Latency and Low Cost DRAM Architecture,” in HPCA, 2013.
[117] S.-Y. Lee and C.-J. Wu, “CAWS: Criticality-Aware Warp Scheduling for GPGPU
Workloads,” in PACT, 2014.
[118] C. H. Lin, D. Y. Shen, Y. J. Chen, C. L. Yang, and M. Wang, “SECRET: Selective
Error Correction for Refresh Energy Reduction in DRAMs,” in ICCD, 2012.
[119] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, “NVIDIA Tesla: A Unied
Graphics and Computing Architecture,” IEEE Micro, 2008.
[120] J. Liu, B. Jaiyen, Y. Kim, C. Wilkerson, and O. Mutlu, “An Experimental Study of
Data Retention Behavior in Modern DRAM Devices: Implications for Retention
Time Proling Mechanisms,” in ISCA, 2013.
[121] J. Liu, B. Jaiyen, R. Veras, and O. Mutlu, “RAIDR: Retention-aware Intelligent
DRAM Refresh,” in ISCA, 2012.
[122] L. Liu, Z. Cui, M. Xing, Y. Bao, M. Chen, and C. Wu, “A Software Memory Parti-
tion Approach for Eliminating Bank-level Interference in Multicore Systems,” in
PACT, 2012.
[123] W. Liu, P. Huang, T. Kun, T. Lu, K. Zhou, C. Li, and X. He, “LAMS: A Latency-
aware Memory Scheduling Policy for Modern DRAM Systems,” in IPCCC, 2016.
[124] S.-L. Lu, Y.-C. Lin, and C.-L. Yang, “Improving DRAM Latency with Dynamic
Asymmetric Subarray,” in MICRO, 2015.
[125] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J.
Reddi, and K. Hazelwood, “Pin: Building Customized Program Analysis Tools
with Dynamic Instrumentation,” in PLDI, 2005.
[126] Y. Luo, S. Govindan, B. Sharma, M. Santaniello, J. Meza, A. Kansal, J. Liu,
B. Khessib, K. Vaid, and O. Mutlu, “Characterizing Application Memory Error
Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Mem-
ory,” in DSN, 2014.
[127] K. Mai, T. Paaske, N. Jayasena, R. Ho, W. J. Dally, and M. Horowitz, “Smart Mem-
ories: A Modular Recongurable Architecture,” in ISCA, 2000.
[128] M. Mao, W. Wen, X. Liu, J. Hu, D. Wang, Y. Chen, and H. Li, “TEMP: Thread
Batch Enabled Memory Partitioning for GPU,” in DAC, 2016.
10
[129] J. Meng, D. Tarjan, and K. Skadron, “Dynamic Warp Subdivision for Integrated
Branch and Memory Divergence Tolerance,” in ISCA, 2010.
[130] J. Meza, Y. Luo, S. Khan, J. Zhao, Y. Xie, and O. Mutlu, “A Case for Ecient Hard-
ware/Software Cooperative Management of Storage and Memory,” in WEED,
2013.
[131] J. Meza, J. Chang, H. Yoon, O. Mutlu, and P. Ranganathan, “Enabling Ecient and
Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management,”
IEEE CAL, 2012.
[132] Micron Technology, Inc., “576Mb: x18, x36 RLDRAM3,” 2011.
[133] R. Mijat, “Take GPU Processing Power Beyond Graphics with Mali GPU Com-
puting,” 2012.
[134] T. Moscibroda and O. Mutlu, “Memory Performance Attacks: Denial of Memory
Service in Multi-core Systems,” in USENIX Security, 2007.
[135] T. Moscibroda and O. Mutlu, “Distributed Order Scheduling and Its Application
to Multi-core DRAM Controllers,” in PODC, 2008.
[136] J. Mukundan and J. F. Martinez, “MORSE: Multi-objective Recongurable Self-
optimizing Memory Scheduler,” in HPCA, 2012.
[137] S. P. Muralidhara, L. Subramanian, O. Mutlu, M. Kandemir, and T. Moscibroda,
“Reducing Memory Interference in Multicore Systems via Application-Aware
Memory Channel Partitioning,” in MICRO, 2011.
[138] O. Mutlu, H. Kim, and Y. N. Patt, “Address-value Delta (AVD) Prediction: Increas-
ing the Eectiveness of Runahead Execution by Exploiting Regular Memory Al-
location Patterns,” in MICRO, 2005.
[139] O. Mutlu, “Memory scaling: A systems architecture perspective,” in IMW, 2013.
[140] O. Mutlu, H. Kim, and Y. N. Patt, “Techniques for Ecient Processing in Runa-
head Execution Engines,” in ISCA, 2005.
[141] O. Mutlu and T. Moscibroda, “Stall-Time Fair Memory Access Scheduling for
Chip Multiprocessors,” in MICRO, 2007.
[142] O. Mutlu and T. Moscibroda, “Parallelism-Aware Batch Scheduling: Enhancing
Both Performance and Fairness of Shared DRAM Systems,” in ISCA, 2008.
[143] O. Mutlu and T. Moscibroda, “Parallelism-Aware Batch Scheduling: Enabling
High-Performance and Fair Memory Controllers,” IEEE Micro, 2009.
[144] O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt, “Runahead Execution: An Al-
ternative to Very Large Instruction Windows for Out-of-Order Processors,” in
HPCA, 2003.
[145] O. Mutlu and L. Subramanian, “Research Problems and Opportunities in Memory
Systems,” SUPERFRI, 2014.
[146] P. J. Nair, D.-H. Kim, and M. K. Qureshi, “ArchShield: Architectural Framework
for Assisting DRAM Scaling by Tolerating High Error Rates,” in ISCA, 2013.
[147] V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt,
“Improving GPU Performance via Large Warps and Two-Level Warp Scheduling,”
in MICRO, 2011.
[148] K. J. Nesbit, A. S. Dhodapkar, and J. E. Smith, “AC/DC: An Adaptive Data Cache
Prefetcher,” in PACT, 2004.
[149] NVIDIA Corp., “NVIDIA’s Next Generation CUDA Compute Architecture:
Fermi,” http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_
compute_architecture_whitepaper.pdf, 2011.
[150] NVIDIA Corp., “NVIDIA’s Next Generation CUDA Compute Architecture: Ke-
pler GK110,” 2012.
[151] NVIDIA Corp., “NVIDIA GeForce GTX 750 Ti,” 2014.
[152] NVIDIA Corp., “NVIDIA Tegra K1,” http://www.nvidia.com/content/pdf/tegra_
white_papers/tegra-k1-whitepaper-v1.0.pdf, 2014.
[153] NVIDIA Corp., “NVIDIA Tegra X1,” https://international.download.nvidia.com/
pdf/tegra/Tegra-X1-whitepaper-v1.0.pdf, 2015.
[154] NVIDIA Corp., “NVIDIA Tesla P100,” 2016.
[155] S. O, Y. H. Son, N. S. Kim, and J. H. Ahn, “Row-Buer Decoupling: A Case for
Low-Latency DRAM Microarchitecture,” in ISCA, 2014.
[156] T. Ohsawa, K. Kai, and K. Murakami, “Optimizing the DRAM Refresh Count for
Merged DRAM/Logic LSIs,” in ISLPED, 1998.
[157] M. Oskin, F. T. Chong, and T. Sherwood, “Active Pages: A Computation Model
for Intelligent Memory,” in ISCA, 1998.
[158] M. Patel, J. S. Kim, and O. Mutlu, “The Reach Proler (REAPER): Enabling the
Mitigation of DRAM Retention Failures via Proling at Aggressive Conditions,”
in ISCA, 2017.
[159] H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karunanidhi, “Pinpoint-
ing Representative Portions of Large Intel Itanium Programs with Dynamic In-
strumentation,” in MICRO, 2004.
[160] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis,
R. Thomas, and K. Yelick, “A Case for Intelligent RAM,” IEEE Micro, vol. 17, no. 2,
pp. 34–44, 1997.
[161] A. Pattnaik, X. Tang, A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu,
and C. R. Das, “Scheduling Techniques for GPU Architectures with Processing-
In-Memory Capabilities,” in PACT, 2016.
[162] G. Pekhimenko, E. Bolotin, N. Vijaykumar, O. Mutlu, T. C. Mowry, and S. W.
Keckler, “A Case for Toggle-aware Compression for GPU Systems,” in HPCA,
2016.
[163] G. Pekhimenko, T. Huberty, R. Cai, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and
T. C. Mowry, “Exploiting Compressed Block Size as an Indicator of Future Reuse,”
in HPCA, 2015.
[164] G. Pekhimenko, V. Seshadri, Y. Kim, H. Xin, O. Mutlu, M. A. Kozuch, P. B. Gib-
bons, and T. C. Mowry, “Linearly Compressed Pages: A Main Memory Compres-
sion Framework with Low Complexity and Low Latency,” in MICRO, 2013.
[165] G. Pekhimenko, V. Seshadri, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C.
Mowry, “Base-delta-immediate Compression: Practical Data Compression for
On-chip Caches,” in PACT, 2012.
[166] S. Phadke and S. Narayanasamy, “MLP Aware Heterogeneous Memory System,”
in DATE, 2011.
[167] PowerVR, “PowerVR Hardware Architecture Overview for Developers,”
http://cdn.imgtec.com/sdk-documentation/PowerVR+Hardware.Architecture+
Overview+for+Developers.pdf. 2016.
[168] S. H. Pugsley, J. Jestes, H. Zhang, R. Balasubramonian, V. Srinivasan, A. Buyuk-
tosunoglu, A. Davis, and F. Li, “NDC: Analyzing the Impact of 3D-stacked Mem-
ory+logic Devices on MapReduce Workloads,” in ISPASS, 2014.
[169] M. K. Qureshi, D. H. Kim, S. Khan, P. J. Nair, and O. Mutlu, “AVATAR: A Variable-
Retention-Time (VRT) Aware Refresh for DRAM Systems,” in DSN, 2015.
[170] M. K. Qureshi, J. Karidis, M. Franceschini, V. Srinivasan, L. Lastras, and B. Abali,
“Enhancing Lifetime and Security of PCM-based Main Memory with Start-gap
Wear Leveling,” in MICRO, 2009.
[171] M. K. Qureshi, V. Srinivasan, and J. A. Rivers, “Scalable High Performance Main
Memory System Using Phase-change Memory Technology,” in ISCA, 2009.
[172] V. J. Reddi, A. Settle, D. A. Connors, and R. S. Cohn, “PIN: A Binary Instrumenta-
tion Tool for Computer Architecture Research and Education,” in WCAE, 2004.
[173] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens, “Memory Access
Scheduling,” in ISCA, 2000.
[174] T. G. Rogers, M. O’Connor, and T. M. Aamodt, “Cache-Conscious Wavefront
Scheduling,” in MICRO, 2012.
[175] T. G. Rogers, M. O’Connor, and T. M. Aamodt, “Divergence-Aware Warp Schedul-
ing,” in MICRO, 2013.
[176] R. M. Russell, “The CRAY-1 Computer System,” CACM, vol. 21, no. 1, pp. 63–72,
1978.
[177] Y. Sato et al., “Fast cycle RAM (FCRAM): A 20-ns Random Row Access, Pipe-
Lined Operating DRAM,” in VLSIC, 1998.
[178] P. B. Schneck, The CDC STAR-100. Boston, MA: Springer US, 1987, pp. 99–117.
http://dx.doi.org/10.1007/978-1-4615-7957-1_5
[179] D. N. Senzig and R. V. Smith, “Computer Organization for Array Processing,” in
AFIPS, 1965.
[180] S.-Y. Seo, “Methods of Copying a Page in a Memory Device and Methods of Man-
aging Pages in a Memory System,” U.S. Patent Application 20140185395, 2014.
[181] V. Seshadri, K. Hsieh, A. Boroumand, D. Lee, M. Kozuch, O. Mutlu, P. Gibbons,
and T. Mowry, “Fast Bulk Bitwise AND and OR in DRAM,” IEEE CAL, 2015.
[182] V. Seshadri, A. Bhowmick, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C.
Mowry, “The Dirty-Block Index,” in ISCA, 2014.
[183] V. Seshadri et al., “RowClone: Fast and Energy-Ecient In-DRAM Bulk Data
Copy and Initialization,” in ISCA, 2013.
[184] V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A. Kozuch,
O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Ambit: In-memory Accelerator for
Bulk Bitwise Operations using Commodity DRAM Technology,” in MICRO, 2017.
[185] V. Seshadri, T. Mullins, A. Boroumand, O. Mutlu, P. B. Gibbons, M. A. Kozuch,
and T. C. Mowry, “Gather-Scatter DRAM: In-DRAM Address Translation to Im-
prove the Spatial Locality of Non-Unit Strided Accesses,” in MICRO, 2015.
[186] V. Seshadri, S. Yedkar, H. Xin, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C.
Mowry, “Mitigating Prefetcher-Caused Pollution Using Informed Caching Poli-
cies for Prefetched Blocks,” ACM TACO, vol. 11, no. 4, pp. 51:1–51:22, 2015.
[187] W. Shin, J. Yang, J. Choi, and L.-S. Kim, “NUAT: A Non-Uniform Access Time
Memory Controller,” in HPCA, 2014.
[188] D. L. Slotnick, W. C. Borck, and R. C. McReynolds, “The Solomon Computer – A
Preliminary Report,” in Workshop on Computer Organization, 1962.
[189] B. Smith, “Architecture and Applications of the HEP Multiprocessor Computer
System,” SPIE, 1981.
[190] Y. H. Son, S. O, Y. Ro, J. W. Lee, and J. H. Ahn, “Reducing Memory Access Latency
with Asymmetric DRAM Bank Organizations,” in ISCA, 2013.
[191] S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt, “Feedback Directed Prefetching:
Improving the Performance and Bandwidth-Eciency of Hardware Prefetchers,”
in HPCA, 2007.
[192] H. S. Stone, “A Logic-in-Memory Computer,” IEEE TC, vol. C-19, no. 1, pp. 73–78,
1970.
[193] J. Stuecheli, D. Kaseridis, D. Daly, H. C. Hunter, and L. K. John, “The Virtual
Write Queue: Coordinating DRAM and Last-level Cache Policies,” in ISCA, 2010.
[194] L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu, “BLISS: Balancing
Performance, Fairness and Complexity in Memory Access Scheduling,” in IEEE
TPDS, 2016.
[195] L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu, “The Blacklisting
Memory Scheduler: Achieving high performance and fairness at low cost,” in
ICCD, 2014.
[196] L. Subramanian et al., “The application slowdown model: Quantifying and con-
trolling the impact of inter-application interference at shared caches and main
memory,” in MICRO, 2015.
[197] L. Subramanian, V. Seshadri, Y. Kim, B. Jaiyen, and O. Mutlu, “MISE: Providing
Performance Predictability and Improving Fairness in Shared Main Memory Sys-
tems,” in HPCA, 2013.
[198] Z. Sura et al., “Data Access Optimization in a Processing-in-memory System,” in
CF, 2015.
11
[199] J. E. Thornton, “Parallel Operation in the Control Data 6600,” in AFIPS FJCC,
1964.
[200] J. E. Thornton, Design of a Computer–The Control Data 6600. Scott Foresman &
Co, 1970.
[201] H. Usui, L. Subramanian, K. Chang, and O. Mutlu, “SQUASH: Simple qos-aware
high-performance memory scheduler for heterogeneous systems with hardware
accelerators,” arXiv CoRR, 2015.
[202] H. Usui, L. Subramanian, K. Chang, and O. Mutlu, “DASH: Deadline-Aware High-
Performance Memory Scheduler for Heterogeneous Systems with Hardware Ac-
celerators,” ACM TACO, vol. 12, no. 4, Jan. 2016.
[203] H. Vandierendonck and A. Seznec, “Fairness Metrics for Multi-threaded Proces-
sors,” IEEE CAL, Feb 2011.
[204] R. Venkatesan, S. Herr, and E. Rotenberg, “Retention-aware Placement in DRAM
(RAPID): Software Methods for Quasi-non-volatile DRAM,” in HPCA, 2006.
[205] N. Vijaykumar, G. Pekhimenko, A. Jog, A. Bhowmick, R. Ausavarungnirun,
C. Das, M. Kandemir, T. C. Mowry, and O. Mutlu, “A Case for Core-Assisted Bot-
tleneck Acceleration in GPUs: Enabling Flexible Data Compression with Assist
Warps,” in ISCA, 2015.
[206] Vivante, “Vivante Vega GPGPU Technology,” http://www.vivantecorp.com/
index.php/en/technology/gpgpu.html, 2016.
[207] H. Wang, R. Singh, M. J. Schulte, and N. S. Kim, “Memory Scheduling Towards
High-throughput Cooperative Heterogeneous Computing,” in PACT, 2014.
[208] F. A. Ware and C. Hampel, “Improving Power and Data Eciency with Threaded
Memory Modules,” in ICCD, 2006.
[209] S. Wasson. (2011, Oct.) AMD’s A8-3800 Fusion APU.
[210] M. Xie, D. Tong, K. Huang, and X. Cheng, “Improving System Throughput and
Fairness Simultaneously in Shared Memory CMP Systems via Dynamic Bank
Partitioning,” in HPCA, 2014.
[211] D. Xiong, K. Huang, X. Jiang, and X. Yan, “Memory Access Scheduling Based
on Dynamic Multilevel Priority in Shared DRAM Systems,” ACM TACO, vol. 13,
no. 4, Dec. 2016.
[212] H. Yoon, J. Meza, R. Ausavarungnirun, R. Harding, and O. Mutlu, “Row Buer
Locality Aware Caching Policies for Hybrid Memories,” in ICCD, 2012.
[213] G. Yuan, A. Bakhoda, and T. Aamodt, “Complexity Eective Memory Access
Scheduling for Many-Core Accelerator Architectures,” in MICRO, 2009.
[214] D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Igna-
towski, “TOP-PIM: Throughput-oriented Programmable Processing in Memory,”
in HPDC, 2014.
[215] L. Zhang, Z. Fang, M. Parker, B. K. Mathew, L. Schaelicke, J. B. Carter, W. C. Hsieh,
and S. A. McKee, “The Impulse Memory Controller,” IEEE TC, vol. 50, no. 11, pp.
1117–1132, 2001.
[216] J. Zhao, O. Mutlu, and Y. Xie, “FIRM: Fair and High-Performance Memory Con-
trol for Persistent Memory Systems,” in MICRO, 2014.
[217] L. Zhao, R. Iyer, S. Makineni, L. Bhuyan, and D. Newell, “Hardware Support for
Bulk Data Movement in Server Platforms,” in ICCD, 2005.
[218] H. Zheng, J. Lin, Z. Zhang, E. Gorbatov, H. David, and Z. Zhu, “Mini-rank: Adap-
tive DRAM Architecture for Improving Memory Power Eciency,” in MICRO,
2008.
[219] Z. Zheng, Z. Wang, and M. Lipasti, “Adaptive Cache and Concurrency Allocation
on GPGPUs,” IEEE CAL, 2014.
[220] W. K. Zuravle and T. Robinson, “Controller for a Synchronous DRAM That
Maximizes Throughput by Allowing Memory Requests and Commands to Be
Issued Out of Order,” US Patent No. 5,630,096, 1997.
12
