Predictable Performance and Fairness Through Accurate Slowdown
  Estimation in Shared Main Memory Systems by Subramanian, Lavanya et al.
Predictable Performance and Fairness Through Accurate
Slowdown Estimation in Shared Main Memory Systems
Lavanya Subramanian1,2 Vivek Seshadri3,2
Yoongu Kim2 Ben Jaiyen4,2 Onur Mutlu5,2
1Intel Labs 2Carnegie Mellon University 3Microsoft Research India 4Google 5ETH Zürich
This paper summarizes the ideas and key concepts of MISE
(Memory Interference-induced Slowdown Estimation), which
was published in HPCA 2013 [97], and examines the work’s
signicance and future potential. Applications running con-
currently on a multicore system interfere with each other at
the main memory. This interference can slow down dierent
applications dierently. Accurately estimating the slowdown
of each application in such a system can enable mechanisms
that can enforce quality-of-service. While much prior work
has focused on mitigating the performance degradation due
to inter-application interference, there is little work on accu-
rately estimating slowdown of individual applications in a
multi-programmed environment. Our goal is to accurately
estimate application slowdowns, towards providing predictable
performance.
To this end, we rst build a simple Memory Interference-
induced Slowdown Estimation (MISE) model, which accurately
estimates slowdowns caused by memory interference. We then
leverage ourMISEmodel to develop two newmemory scheduling
schemes: 1) one that provides soft quality-of-service guarantees,
and 2) another that explicitly attempts to minimize maximum
slowdown (i.e., unfairness) in the system. Evaluations show that
our techniques perform signicantly better than state-of-the-art
memory scheduling approaches to address the above problems.
Our proposed model and techniques have enabled signicant
research in the development of accurate performance models [35,
59, 98, 110] and interference management mechanisms [66, 66,
99, 100, 108, 119, 120].
1. Problem: Unpredictable Slowdowns
In a multicore system, multiple applications are consoli-
dated on the same machine. While consolidation may enable
better resource utilization, it results in interference between
applications at the shared resources, slowing down each ap-
plication to a dierent degree. Specically, main memory is a
heavily contended shared resource between applications in a
multicore system. Each application accessing the memory ex-
periences dierent and unpredictable slowdowns depending
on the available memory bandwidth and the other concur-
rently running applications.
A large body of work proposed several dierent approaches
to mitigate memory interference between applications with
the goal of improving overall system performance. This in-
cludes memory scheduling [2,18,27,32,42,43,50,72,76,77,80,99,
100, 103, 117], memory channel/bank partitioning [36, 64, 74],
memory interleaving [38], source throttling [3, 7, 17, 19, 102],
and thread scheduling [14,101,106,121] techniques. However,
few previous works (notably [15, 17, 19, 76]) have attempted
to estimate individual application slowdowns online with the
goal of providing predictable performance.
Our goal in our HPCA 2013 paper [97] is to provide pre-
dictable performance for individual applications. To this
end, we rst design a model to accurately estimate memory-
interference-induced slowdowns of applications running con-
currently on a multicore system. We then leverage this model
to design eective mechanisms to enforce quality-of-service
(QoS) and achieve fairness.
2. The Memory Interference-Induced
Slowdown Estimation (MISE) Model
The slowdown of an application indicates the performance
of the application, when it is sharing resources with other
applications, relative to when the application is run alone.
Slowdown can be expressed as
Slowdown of an App. = alone-performance
shared-performance
(1)
Hence, estimating the slowdown of an individual application
requires two pieces of information: 1) the performance of
the application when it is run concurrently with other ap-
plications (i.e., shared-performance), and 2) the performance
of the application when it is run alone on the same system
(i.e., alone-performance). While the former can be directly
measured, the key challenge is to estimate the performance
the application would have if it were running alone while
it is actually running alongside other applications. This re-
quires quantifying the eect of interference on application
performance.
2.1. Key Observations
In this work, we make two observations that lead to a
simple and eective model to estimate the slowdown of indi-
vidual applications.
Observation 1: The performance of a memory-bound appli-
cation is roughly proportional to the rate at which its memory
requests are served. This observation stems from a memory-
bound application’s characteristic to spend an overwhelm-
ingly large fraction of its execution time stalling on memory
accesses. Therefore, the rate at which such an application’s
requests are served has signicant impact on its performance.
ar
X
iv
:1
80
5.
05
92
6v
1 
 [c
s.A
R]
  1
5 M
ay
 20
18
To validate this observation, we conducted a real-system
experiment where we ran a memory-bound application from
the SPEC CPU2006 benchmark suite [96] alongside three
copies of a microbenchmark whose memory intensity can
be varied, on a 4-core Intel Core i7 [31].1 By varying the
memory intensity, i.e., the last-level cache (LLC) miss rate, of
the microbenchmark, we can change the rate at which the
requests of the SPEC application are served. Figure 1 plots
the results of this experiment for three memory-intensive
benchmarks, mcf, omnetpp, and astar. The gure shows the
performance of each application versus the rate at which its
requests are served.
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 1
 0.3  0.4  0.5  0.6  0.7  0.8  0.9  1
N
o r
m
a l
i z
e d
 P
e r
f o
r m
a n
c e
( n o
r m
.  t o
 p e
r f o
r m
a n
c e
 w
h e
n  r
u n
 a l
o n
e )
Normalized Request Service Rate
(norm. to request service rate when run alone)
mcf
omnetpp
astar
Figure 1: Request service rate vs. performance. Reproduced
from [97].
The results of this experiment validate our observation.
The performance of a memory-bound application is directly
proportional to the rate at which its requests are served. This
suggests that we can use the request-service-rate of an appli-
cation as a proxy for its performance. More specically, we
can estimate the slowdown of an application, i.e., the ratio
of its performance when it is run alone on a system vs. its
performance when it is run alongside other applications on
the same system, as follows:
Slowdown of an App. = alone-request-service-rate
shared-request-service-rate
(2)
Estimating the shared-request-service-rate (SRSR) of an ap-
plication is straightforward. It only requires the memory
controller to keep track of how many requests of the appli-
cation are served in a given number of cycles. However, the
challenge is to estimate the alone-request-service-rate (ARSR)
of an application while it is run alongside other applications.
A naive way of estimating ARSR of an application would be
to prevent all other applications from accessing memory for
a length of time and measure the application’s ARSR. While
this approach might provide an estimate of the application’s
ARSR, it would signicantly slow down other applications in
1The microbenchmark streams through a large region of memory (one
block at a time). The memory intensity of the microbenchmark (last-level
cache misses per kilo-instruction, i.e., LLC MPKI) is varied by changing the
amount of computation performed between memory operations.
the system and is prone to incorrect estimations due to phase
uctuations in the application. Our second observation helps
us to address this problem.
Observation 2: The ARSR of an application can be estimated
by giving the requests of the application the highest priority in
accessing memory.
Giving an application’s requests the highest priority in
accessing memory results in very little interference from
the requests of other applications. Therefore, requests of
the application are served almost as if the application were
the only one running on the system. Based on the above
observation, the ARSR of an application can be estimated as:
ARSR of an App. = # Requests with Highest Priority# Cycles with Highest Priority (3)
where # Requests with Highest Priority is the number of re-
quests served when the application is given highest priority,
and # Cycles with Highest Priority is the number of cycles
an application is given highest priority by the memory con-
troller.
The memory controller can use Equation 3 to periodically
estimate the ARSR of an application. We add an interference
counter to capture the remaining interference cycles. The
details of the mechanisms we add to increase the accuracy
of the model are described in Section 4 of our HPCA 2013
paper [97]. Once we estimate ARSR, Equation 2 can be used
to estimate the slowdown of the application.
2.2. MISE Model for Non-Memory-Bound
Applications
So far, we have described the key observations of the MISE
model for a memory-bound application. We nd that the
model presented above has low accuracy for non-memory-
bound applications. This is because a non-memory-bound
application spends a signicant fraction of its execution time
in the compute phase (when the core is not stalled waiting for
memory). Hence, varying the request service rate for such
an application will not aect the length of the large compute
phase. Therefore, we take into account the duration of the
compute phase to make the model accurate for non-memory-
bound applications.
Letα be the fraction of time spent by an application stalling
at memory. Therefore, the fraction of time spent by the ap-
plication in the compute phase is 1 – α. Since changing the
request service rate aects only the memory phase, we aug-
ment Equation 2 to take into account α as follows:
Slowdown of an App. = (1 – α) + αARSR
SRSR
(4)
In addition to estimating ARSR and SRSR required by Equa-
tion 2, the above equation requires estimating the parameter
α, the fraction of time spent in the memory phase. However,
precisely computing α for a modern out-of-order processor
is a challenge since such a processor overlaps computation
2
with memory accesses. The processor stalls waiting for mem-
ory only when the oldest instruction in the reorder buer is
waiting on a memory request. For this reason, we estimate
α as the fraction of time the processor spends stalling for
memory:
α = # Cycles spent stalling on memory requestsTotal number of cycles (5)
More details of our MISE slowdown estimation model are
described in Sections 3 and 4 of our HPCA 2013 paper [97].
More recently, we used this model to expand slowdown es-
timation to a memory hierarchy that also includes shared
caches, as part of the Application Slowdown Model [98].
3. Evaluation of the MISE Model
We compare the MISE model against the slowdown es-
timation model employed by the Stall Time Fair Memory
Scheduler (STFM) [76], which is the closest previous work on
estimating memory interference-induced slowdown.2 STFM
estimates the slowdown of an application by estimating the
number of cycles it stalls due to interference from other ap-
plications’ requests. In this section, we qualitatively and
quantitatively compare MISE with STFM.
There are two key dierences between MISE and STFM in
estimating slowdown. First, MISE uses request service rates
rather than stall times to estimate slowdown. In MISE, the
alone-request-service-rate of an application can be fairly ac-
curately estimated by giving the application highest priority
in accessing memory. Giving the application highest priority
in accessing memory results in very little interference from
other applications. In contrast, STFM attempts to estimate
the alone-stall-time of an application while it is receiving
signicant interference from other applications, which turns
out to be dicult to do accurately. Second, MISE takes into ac-
count the eect of the compute phase for non-memory-bound
applications. STFM, on the other hand, has no such provi-
sion to account for the compute phase. As a result, MISE’s
slowdown estimates for non-memory-bound applications are
signicantly more accurate than STFM’s estimates.
Figure 2 compares the accuracy of MISE with STFM for
two representative memory-bound applications, lbm and
leslie3d. Figure 3 compares the accuracy of MISE with STFM
for two representative non-memory-bound applications, wrf
and povray. Each of these applications is run on a 4-core
system with three other applications. Our detailed experi-
mental methodology is provided in Section 5 of our HPCA
2013 paper [97]. This includes detailed descriptions of our
experimental setup, workloads and metrics. Furthermore,
our simulator implementing the MISE model is available on-
line [90]. As can be observed, MISE’s slowdown estimates
2FST [17] and Du Bois et al.’s per-thread cycle accounting mecha-
nism [15] are the other two previous works that estimate application slow-
down. The mechanism to estimate main memory interference induced
slowdown in both of these previous works is similar to STFM.
are much closer to the actual slowdown than STFM’s esti-
mates. This is because the MISE model eliminates a signicant
portion of the interference received by an application while
estimating slowdown, by prioritizing it in the memory con-
troller. On the other hand, STFM estimates slowdown while
an application is experiencing interference.
 1
 1.5
 2
 2.5
 3
 3.5
 4
 0  20  40  60  80  100  120  140  160  180  200
S l
o w
d o
w n
Million Cycles
Actual
STFM
MISE
(a) lbm
 1
 1.5
 2
 2.5
 3
 3.5
 4
 0  20  40  60  80  100  120  140  160  180  200
S l
o w
d o
w n
Million Cycles
Actual
STFM
MISE
(b) leslie3d
Figure 2: Comparison of MISE with STFM for representative
memory-bound applications. Adapted from [97].
Table 1 shows the average slowdown estimation error for
each benchmark, with STFM and MISE, across 300 4-core
workloads of dierent memory intensities. As can be ob-
served, MISE’s slowdown estimates have signicantly lower
error than STFM’s slowdown estimates across most bench-
marks. Across 300 workloads, STFM’s estimates deviate from
the actual slowdown by 29.8%, whereas, our proposed MISE
model’s estimates deviate from the actual slowdown by only
8.1%. Therefore, we conclude that our slowdown estimation
model provides better accuracy than STFM.
For a more detailed analysis of the MISE model’s accuracy
and characteristics, we refer the reader to our HPCA 2013
paper [97].
4. Leveraging the MISE Model
Accurate slowdown estimates are a key enabler towards
designing mechanisms to better enforce quality-of-service
(QoS) and fairness. Slowdown estimates from the MISE model
could be leveraged in hardware to design memory scheduling
policies to provide QoS guarantees and fairness. Alterna-
tively, the slowdown estimates could be communicated to the
3
 1
 1.5
 2
 2.5
 3
 3.5
 4
 0  20  40  60  80  100  120  140  160  180  200
S l
o w
d o
w n
Million Cycles
Actual
STFM
MISE
(a) wrf
 1
 1.5
 2
 2.5
 3
 3.5
 4
 0  20  40  60  80  100  120  140  160  180  200
S l
o w
d o
w n
Million Cycles
Actual
STFM
MISE
(b) povray
Figure 3: Comparison of MISE with STFM for representative
non-memory-bound applications. Adapted from [97].
Table 1: Average slowdown estimation error for each bench-
mark (in %). Adapted from [97].
Benchmark STFM MISE Benchmark STFM MISE
453.povray 56.3 0.1 473.astar 12.3 8.1
454.calculix 43.5 1.3 456.hmmer 17.9 8.1
400.perlbench 26.8 1.6 464.h264ref 13.7 8.3
447.dealII 37.5 2.4 401.bzip2 28.3 8.5
436.cactusADM 18.4 2.6 458.sjeng 21.3 8.8
450.soplex 29.8 3.5 433.milc 26.4 9.5
444.namd 43.6 3.7 481.wrf 33.6 11.1
437.leslie3d 26.4 4.3 429.mcf 83.74 11.5
403.gcc 25.4 4.5 445.gobmk 23.1 12.5
462.libquantum 48.9 5.3 483.xalancbmk 18.0 13.6
459.GemsFDTD 21.6 5.5 435.gromacs 31.4 15.6
470.lbm 6.9 6.3 482.sphinx3 21 16.8
473.astar 12.3 8.1 471.omnetpp 26.2 17.5
456.hmmer 17.9 8.1 465.tonto 32.7 19.5
system software, which could leverage them to perform appli-
cation scheduling, admission control and migration. We will
describe two such mechanisms that leverage the MISE model:
1) MISE-QoS, a mechanism to provide soft QoS guarantees
in the memory controller; and 2) MISE-Fair, a mechanism to
minimize maximum slowdown [13, 14, 42, 43, 92, 99, 100, 103]
to improve overall system fairness.
4.1. MISE-QoS: Providing Soft QoS Guarantees
MISE-QoS aims to provide soft slowdown guarantees to
an application of interest (AoI) in a workload with many
applications, while trying to maximize overall performance
for the remaining applications. There are two aspects of
providing a soft slowdown guarantee. One is to ensure that
the application of interest is not slowed down by more than
a system-software-specied bound. The other aspect is to
detect if the bound is not met for some reason.
MISE-QoS addresses both of these aspects by using slow-
down estimates from the MISE model. It periodically ob-
tains slowdown estimates from the MISE model and in-
creases/decreases the memory bandwidth allocated to the
AoI such that the AoI receives just enough bandwidth to meet
its slowdown bound. This enables the other applications to
use the remaining bandwidth, improving their performance.
MISE-QoS addresses the second aspect by comparing slow-
down estimates from the MISE model with the prescribed
bound periodically. When the prescribed bound cannot be
met despite always prioritizing the AoI, MISE-QoS detects
that the bound cannot be met just by prioritizing the applica-
tion at the memory controller.
Previous work [34] attempts to address the rst aspect by
always prioritizing the AoI. This may unnecessarily slow-
down other applications in the system by excessively prior-
itizing the AoI, especially when the AoI is meeting its per-
formance bound. Furthermore, such a mechanism, in the
absence of accurate slowdown estimates, does not have the
provision to detect whether or not the bound is met.
Slowdown Evaluation. We evaluate the MISE-QoS mech-
anism across 300 workloads with 10 dierent slowdown
bounds for each workload. Our results show that the MISE-
QoS mechanism meets the prescribed slowdown bound for
97.5% of the workloads for which the naive mechanism that
always prioritizes the AoI meets the bound, while improving
overall system performance by 12%. Furthermore, MISE-QoS
also predicts whether or not the bound is met with an accu-
racy of 95.7%, while previous work [34] has no such provision.
To show the eectiveness of MISE-QoS, we compare the
AoI’s slowdown due to MISE-QoS and the mechanism that
always prioritizes the AoI (Always Prioritize) [34]. Figure 4
presents representative results for 8 dierent AoIs when they
are run alongside three other applications. The label MISE-
QoS-n corresponds to a slowdown bound of 10n . (Note that
Always Prioritize does not take into account the slowdown
bound.) Note that the slowdown bound decreases (i.e., be-
comes tighter) from left to right for each benchmark in Fig-
ure 4 (as well as in other gures).
We draw three conclusions from the results. First, for most
applications, the slowdown of Always Prioritize is consider-
ably more than one. This indicates that always prioritizing
the AoI does not completely prevent other applications from
interfering with the AoI. Second, as the slowdown bound
for the AoI is decreased (left to right), MISE-QoS gradually
increases the bandwidth allocation for the AoI, eventually
allocating all the available bandwidth to the AoI. At this point,
MISE-QoS performs very similarly to the Always Prioritize
mechanism. Third, in almost all cases (in this gure and
4
 1
 1.2
 1.4
 1.6
 1.8
 2
 2.2
 2.4
perlbench
calculix
gromacs
cactusADM
bzip2
astar
leslie3d
milc
AvgQ
o S
- C
r i t
i c a
l  A
p p
l i c
a t
i o n
 S
l o w
d o
w n
AlwaysPrioritize
MISE-QoS-1
MISE-QoS-2
MISE-QoS-3
MISE-QoS-4
MISE-QoS-5
MISE-QoS-6
MISE-QoS-7
MISE-QoS-8
MISE-QoS-9
MISE-QoS-10
Figure 4: AoI performance: MISE-QoS vs. AlwaysPrioritize.
Reproduced from [97].
across all our 3000 data points), MISE-QoS meets the spec-
ied slowdown bound if Always Prioritize is able to meet
the bound (see Section 8.1 of our HPCA 2013 paper [97] for
details).
System Performance and Fairness. Figure 5 compares
the system performance (harmonic speedup) and fairness
(maximum slowdown) of MISE-QoS and Always Prioritize
for dierent values of the bound. We omit the AoI from the
performance and fairness calculations. The results are catego-
rized into four workload categories (0, 1, 2, 3) indicating the
number of memory-intensive benchmarks in the workload.
For clarity, the gure shows results only for a few slowdown
bounds. Three conclusions are in order.
First, MISE-QoS signicantly improves performance com-
pared to Always Prioritize, especially when the slowdown
bound for the AoI is large. On average, when the bound is
10
3 , MISE-QoS improves harmonic speedup [67] by 12% and
weighted speedup [22, 95] by 10% (not shown due to lack
of space) over Always Prioritize, while reducing maximum
slowdown [13,14,42,43,92,99,100,103] by 13%. Second, as ex-
pected, the performance and fairness of MISE-QoS approach
that of Always Prioritize as the slowdown bound is decreased
(going from left to right for a set of bars). Finally, the bene-
ts of MISE-QoS increase with increasing memory intensity
because always prioritizing a memory intensive application
will cause signicant interference to other applications.
Based on our results, we conclude that MISE-QoS can ef-
fectively ensure that the AoI meets the specied slowdown
bound while achieving high system performance and fairness
across the other applications.
4.2. MISE-Fair: Minimizing Maximum Slowdown
The second mechanism we build on top of our MISE model
is one that seeks to improve overall system fairness. Specif-
ically, this mechanism attempts to minimize the maximum
slowdown across all applications in the system. Ensuring
that no application is unfairly slowed down while maintain-
ing high system performance is an important goal in multi-
core systems where co-executing applications are similarly
important. Many prior works evaluate fairness in such sce-
narios in terms of the maximum slowdown of any applica-
tion [13, 14, 42, 43, 92, 99, 100, 103].
At a high level, our mechanism works as follows. The mem-
ory controller maintains two pieces of information: 1) a target
slowdown bound (B) for all applications, and 2) a bandwidth
allocation policy that partitions the available memory band-
width across all applications. The memory controller enforces
the bandwidth allocation policy using a lottery-scheduling
technique proposed in [105]. The controller attempts to en-
sure that the slowdown of all applications is within the bound
B. To this end, it modies the bandwidth allocation policy
so that applications that are slowed down more get more
memory bandwidth. Should the memory controller nd that
bound B is not possible to meet, it increases the bound. On
the other hand, if the bound is easily met, it decreases the
bound.
Interaction with the Operating System. As we will
show in Section 4.2, our mechanism provides the best fairness
compared to three state-of-the-art approaches for memory
request scheduling [42, 43, 76]. In addition to this, there is
another benet to using our approach. Our mechanism, based
on the MISE model, can accurately estimate the slowdown of
each application. Therefore, the memory controller can po-
tentially communicate the estimated slowdown information
to the operating system (OS). The OS can use this information
to make more informed scheduling and mapping decisions
in order to further improve system performance or fairness.
Since prior memory scheduling approaches do not explic-
itly attempt to minimize maximum slowdown by accurately
estimating the slowdown of individual applications, such a
mechanism to interact with the OS is not possible with them.
Evaluating the benets of the interaction between our mech-
anism and the OS is beyond the scope of this paper but is an
important area of future work.
Evaluation. Figure 6 compares the system fairness (max-
imum slowdown) of dierent mechanisms with increasing
number of cores. The gure shows results with four previ-
ously proposed memory scheduling policies (FRFCFS [89,122],
ATLAS [42], TCM [43], and STFM [76]), and our proposed
mechanism using the MISE model (MISE-Fair). We draw three
conclusions from our results.
First, MISE-Fair provides the best fairness compared to all
other previous approaches. The reduction in the maximum
slowdown due to MISE-Fair when compared to STFM (the
best previous mechanism) increases with increasing num-
ber of cores. With 16 cores, MISE-Fair provides 7.2% better
fairness compared to STFM.
Second, STFM, as a result of prioritizing the most slowed
down application, provides better fairness than all other pre-
vious approaches. While the slowdown estimates of STFM
are not as accurate as those of our mechanism, they are good
5
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 1
 1.1
 1.2
 1.3
 1.4
0 1 2 3 Avg
H
a r
m
o n
i c
 S
p e
e d
u p
Number of Memory Intensive Benchmarks in a Workload
AlwaysPrioritize
MISE-QoS-1
MISE-QoS-3
MISE-QoS-5
MISE-QoS-7
MISE-QoS-9
 1
 1.5
 2
 2.5
 3
 3.5
0 1 2 3 Avg
M
a x
i m
u m
 S
l o
w d
o w
n
Number of Memory Intensive Benchmarks in a Workload
AlwaysPrioritize
MISE-QoS-1
MISE-QoS-3
MISE-QoS-5
MISE-QoS-7
MISE-QoS-9
Figure 5: Average system performance and fairness across 300 workloads of dierent memory intensities. Reproduced
from [97].
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
4 8 16
M
a x
i m
u m
 S
l o
w d
o w
n
Number of Cores
FRFCFS
ATLAS
TCM
STFM
MISE-Fair
Figure 6: Fairness with dierent core counts. Reproduced
from [97].
enough to identify the most slowed down application. How-
ever, as the number of concurrently-running applications
increases, simply prioritizing the most slowed down applica-
tion may not lead to better fairness. MISE-Fair, on the other
hand, works towards reducing maximum slowdown by steal-
ing bandwidth from those applications that are less slowed
down compared to others. As a result, the fairness benets
of MISE-Fair compared to STFM increase with increasing
number of cores.
Third, ATLAS and TCM are more unfair compared to FR-
FCFS. As shown in prior work [42, 43], ATLAS trades o
fairness to obtain better performance. TCM, on the other
hand, is designed to provide high system performance and
fairness. Further analysis showed us that the cause of TCM’s
unfairness is the strict ranking employed by TCM. TCM ranks
all applications based on its clustering and shuing tech-
niques [43] and strictly enforces these rankings. We found
that such strict ranking destroys the row-buer locality of
low-ranked applications. This increases the slowdown of
such applications, leading to high maximum slowdown.3
We conclude that the MISE model’s slowdown estimates
can be used to design a better and more fair memory scheduler.
We expect future works can take advantage of the MISE model
3Note that this observation later led us to develop the Blacklisting
Memory Scheduler (BLISS) [99, 100].
to design even better memory scheduling and other resource
management mechanisms.
5. Related Work
To our knowledge, this is the rst paper to 1) provide a
simple and accurate model to estimate application slowdowns
in the presence of main memory interference, and 2) use
this model to devise two new memory scheduling techniques
that either aim to satisfy slowdown bounds of applications or
improve system fairness and performance. In this section, we
discuss several related works. We discuss works that build
upon MISE in Section 6.1.
Slowdown Estimation. Stall Time Fair Memory Schedul-
ing (STFM) [76] attempts to estimate each application’s slow-
down, with the goal of improving fairness by prioritizing
the most slowed down application. STFM estimates an ap-
plication’s slowdown as the ratio of its memory stall time
when it is run alone versus when it is concurrently run along-
side other applications. The challenge is in determining the
alone stall time of an application while the application is actu-
ally running alongside other applications. STFM proposes to
address this challenge by counting the number of cycles an ap-
plication is stalled due to interference from other applications
at the DRAM channels, banks and row-buers. STFM uses
this interference cycle count to estimate the alone-stall-time
of the application, and hence the application’s slowdown.
Fairness via Source Throttling (FST) [17] estimates appli-
cation slowdowns due to inter-application interference at the
shared caches and memory, as the ratio of uninterfered to
interfered execution times. FST uses the slowdown estimates
to make informed source throttling decisions, to improve
fairness. The mechanism to account for memory interfer-
ence to estimate uninterfered execution time is similar to that
employed in STFM. Prefetch-Aware Shared Resource Man-
agement [19] extends the FST model to take into account
prefetch requests.
A concurrent work by Du Bois et al. [15] proposes per-
thread cycle accounting (PTCA) for multicore processors,
which determines an application’s standalone execution time
when it shares cache and memory with other applications
6
in a multicore system. In order to quantify memory inter-
ference, PTCA counts the number of waiting cycles due to
inter-application interference and factors out these waiting
cycles to estimate alone execution times, which is similar to
STFM’s alone stall time estimation mechanism.
Eyerman and Eeckhout [23] and Cazorla et al. [5] propose
mechanisms to determine an application’s slowdown while
it is running alongside other applications on an SMT proces-
sor. Luque et al. [68] estimate application slowdowns in the
presence of shared cache interference. Lin and Balasubra-
monian [60] propose a regression-based model to estimate
performance for dierent cache allocations. None of these
studies take into account inter-application interference at
the main memory. Therefore, MISE, which estimates slow-
down due to main memory interference, can be combined
with the above approaches to quantify interference at the
SMT processor and shared cache to build a comprehensive
mechanism.
Quality-of-Service (QoS). Several prior works provide
QoS guarantees in shared memory CMP systems. Mars et
al. [69] propose a mechanism to estimate an application’s
sensitivity towards interference and its propensity to cause
interference. They utilize this knowledge to make informed
mapping decisions between applications and cores. However,
this mechanism 1) assumes a priori knowledge of applications,
which may not always be possible to have, and 2) is designed
for only 2 cores, and it is not clear how it can be extended to
more than 2 cores. In contrast, MISE does not assume any a
priori knowledge of applications and works well with large
core counts, as we have shown in this paper. That said, MISE
can possibly be used to provide feedback to the mapping
mechanism proposed by [69] to overcome the shortcomings
of their mechanism.
Iyer et al. [30, 33, 34] propose mechanisms to provide guar-
antees on shared cache space, memory bandwidth or IPC for
dierent applications. The slowdown guarantee provided by
MISE-QoS is stricter than these mechanisms as MISE-QoS
takes into account the alone-performance of each applica-
tion. Nesbit et al. [80] propose a mechanism to enforce a
bandwidth allocation policy, by partitioning the available
bandwidth across concurrently running applications based
on some policy. While we use a scheduling technique similar
to lottery-scheduling [85, 105] to enforce the bandwidth allo-
cation policies of MISE-QoS and MISE-Fair, the mechanism
proposed by Nesbit et al. can also be used in our proposal
to allocate bandwidth instead of our lottery-scheduling ap-
proach.
Memory Interference Mitigation. Many prior works
focus on the problem of mitigating inter-application interfer-
ence at the main memory to improve system performance
and/or fairness. Most of these approaches address memory
interference by modifying the memory request scheduling
algorithm [2, 18, 27, 32, 34, 42, 43, 50, 51, 52, 53, 72, 73, 76, 77, 80,
99, 100, 115, 117]. We quantitatively compare MISE-Fair to
STFM [76], ATLAS [42], and TCM [43] in Section 4.2, and
show that MISE-Fair provides better fairness than these prior
approaches.
Other works examine approaches such as sub-row inter-
leaving [38], channel/bank partitioning [36, 64, 74, 109], band-
width partitioning [61,97], source throttling [3,7,17,19,39,81,
82, 102], thread scheduling [14, 101, 106, 121], and changes to
DRAM design [44, 58]. These approaches are complementary
to MISE, and can be combined to achieve better fairness.
Prior Work on Analytical Performance Modeling.
Prior works attempt to quantify the impact of cache/memory
contention through oine proling. Mars et al. [69] esti-
mate an application’s sensitivity/propensity to receive/cause
interference. Other previous works propose to estimate an ap-
plication’s sensitivity to cache capacity [20, 91] and memory
bandwidth [21] through proling. Yang et al. [111] attempt
to estimate applications’ sensitivity to interference online.
However, this work assumes that latency-critical applications
run alone at times, when they can be proled (which could
degrade system throughput). These works assume the ability
to prole (1) entire applications oine; or (2) specic exe-
cution scenarios, such as an application executing alone. In
contrast, MISE can estimate the slowdown of any applica-
tion online, in the general scenario of multiple applications
running together.
Several previous works [24, 25, 37, 104] propose analytical
models to estimate processor performance, as an alternative
to time consuming simulations. The goal of our MISE model,
in contrast, is to estimate slowdowns at runtime, in order to
enable mechanisms to provide QoS and high fairness. Its use
in simulation is possible, but is left to future work.
6. Signicance
To our knowledge, our HPCA 2013 paper [97] is the rst to
build a simple yet accurate hardware-based model to estimate
application slowdowns due to main memory interference
online with the goal of providing predictable performance. Pre-
vious works [15, 17, 19, 76] propose mechanisms to estimate
application slowdowns. However, these mechanisms are not
accurate enough (as we demonstrate in Section 3) since they
were not designed with the goal of providing predictable per-
formance. Rather, the slowdown estimates were used to make
prioritization/throttling decisions to improve overall fairness.
This work is also the rst to design a hardware-based mech-
anism to i) provide soft guarantees on slowdown for appli-
cations and ii) detect when a prescribed slowdown bound
is not being met, by leveraging slowdown estimates from
the MISE model, while also improving overall system per-
formance. Previous work [34], in the absence of a model
to accurately estimate application slowdowns, always pri-
oritizes the application that needs guaranteed performance,
degrading the performance of other co-running applications.
Furthermore, previous work also does not have the provision
7
to detect whether or not the prescribed slowdown bounds
are being met (as we describe in Section 4).
6.1. Retrospective and Works Building on
Our HPCA 2013 Paper
Adoption of the Principles of the MISE Model. The
principles employed in the MISE model have been adopted
towards slowdown estimation in several works that followed.
The application slowdown model (ASM) [98], a follow-on
work, builds on top of MISE’s memory slowdown estimation
model and extended it to take into account shared cache in-
terference. In doing so, ASM also addressed one of the major
caveats of the MISE model, the estimation of slowdown for
non-memory-intensive applications. While MISE has a mech-
anism to address the slowdown of non-memory-intensive
applications, this mechanism relies on the estimation of the
memory-bound fraction of an application. Estimating the
fraction of an application’s execution that is memory bound,
with high delity, is challenging. ASM addresses this chal-
lenge by applying the observation on request service rate as
a proxy for performance at the input to the shared caches.
This seamlessly enables slowdown estimation for applications
with dierent memory and cache intensities/sensitivities. The
ASM work shows that it can accurately estimate slowdowns
with only 9.9% error across 100 workloads. We refer the
reader to [98] for details.
A later work by Xiong et al. [110] proposes a slowdown
estimation model that adopts the principle of giving an ap-
plication highest priority in order to estimate its alone run
behavior. This work directly measures alone-IPC during such
high priority periods, rather than estimating alone request
service rate and employs this alone-IPC estimate towards
determining slowdown.
Applications of the MISE Model. The MISE model has
been applied towards slowdown estimation in multiple con-
texts. Zhou and Wentzla [120] employ the MISE model in
the context of throttling memory trac at the source, based
on inter-arrival times between requests. Specically, they
employ a set of bins, each corresponding to a range of inter-
arrival times, and allocate a certain number of credits to each
bin, depending on an application’s request inter-arrival times.
In order to determine the optimal credit allocation in dierent
bins corresponding to dierent arrival times, they employ a
genetic algorithm. This credit allocation determines the even-
tual number of requests that can be served corresponding to
dierent inter-arrival times, for an application, and hence,
shapes the memory trac of the application. Slowdown esti-
mates from the MISE model are leveraged to determine the
optimal bins/credits conguration, to eectively shape mem-
ory trac. Camouage [119] employs the MISE model for the
purposes of trac shaping, but in the context of providing
security. Camouage shapes memory trac into a predeter-
mined distribution, in order to prevent attackers from probing
the memory bus to infer the program’s memory access and
response patterns. Slowdown estimates from the MISE model
are used to determine the optimal bins/credits conguration.
Employing Slowdown-Proportional Resource Allo-
cation. The general principle of allocating resources pro-
portionally, to the estimated slowdown at that resource is
a key principle employed in the MISE-QoS and MISE-Fair
schemes. Two prior works [66, 108] apply a similar principle
in the context of addressing interference at the on-chip net-
work. Towards mitigating on-chip network contention, they
build a scheme that allocates channel bandwidth proportional
to the aggregate rate of ow of trac from each thread.
These works [66,98,110,119,120] are clear instances of the
applicability of the MISE model itself and its principles in
various contexts. The works that build on our original MISE
paper [97] are strongly indicative of the potential impact this
work could have in the long term, as we describe in the next
section.
6.2. Long-Term Impact
Predictable Performance in Current and Future Sys-
tems. Building predictable systems is a grand research chal-
lenge [12, 75, 78]. Predictable performance is a key require-
ment in current and future systems where 1) multiple ap-
plications are consolidated onto the same machine, sharing
resources and 2) some applications need a certain guaranteed
performance. Data centers, virtualized systems, interactive
mobile systems and real- time systems are all examples of
scenarios where predictable performance is desirable or nec-
essary. We expect the need for predictable performance to
increase in the future as more systems will likely move to-
wards consolidation as a means to eectively utilize resources.
Given this trend, accurately quantifying the eect of shared
resource interference on performance is an important enabler
towards providing predictable performance. Therefore, we
believe that slowdown estimates from the MISE model and
the hardware/software techniques that can be built on top of
our model are important steps towards providing predictable
performance.
Request Service Rate a Proxy for Performance. One
of the key ideas behind MISE is to use memory request service
rate as a proxy for performance for memory-bound applica-
tions. We hypothesize that the performance of an application
that is bottlenecked at a certain resource is likely correlated
with the request service rate at that resource. Hence, the
notion of using request service rate as a proxy for perfor-
mance can be used as a primitive for performance prediction
and applied more generally to other shared resources such
as shared caches, storage and network. ASM [98], described
in Section 6.1, is one such work that takes advantage of this
key idea of request service rate as a proxy for performance,
measured at the shared caches.
Accurate and Ecient Estimation of Alone Perfor-
mance. Another key idea behind MISE is to periodically
give each application the highest priority in order to estimate
8
alone-request-service-rate. In doing so, the highest priority
application receives minimal interference when its slowdown
is being estimated, while also not disrupting other applica-
tions’ execution. This leads to better accuracy than previous
work [15, 17, 19, 76] that estimates an application’s slowdown
while it is receiving interference from other applications. We
believe that the principle of estimating slowdown while using
techniques such as prioritization to minimize interference
can be applied at other shared resources such as I/O, storage
and network as well.
Enabling Better Resource Management. The ability
to accurately estimate slowdown in the presence of shared
resource interference can enable a range of resource manage-
ment techniques to provide QoS in both hardware and soft-
ware. Slowdown estimates can be leveraged in the hardware
for resource management (as we demonstrate with memory
bandwidth). Slowdown estimates can also be communicated
to the software, enabling more eective and informed admis-
sion control and migration mechanisms across a cluster of
machines. Therefore, we believe MISE’s slowdown estimates
can enable substantial future research on resource allocation
policies.
Simplicity of the Technique. The MISE model requires
only simple hardware changes to the memory controller and
scheduling logic, while providing high accuracy. By virtue
of the memory bandwidth partitioning scheme we employ,
the memory scheduler only needs to give one application
the highest priority at any point in time, while treating other
applications’ requests similarly. On the other hand, previ-
ously proposed memory scheduling policies such as ATLAS,
TCM [42,43] employ ranking policies where an ordered rank-
ing is enforced across all applications’ requests. Hence, MISE
requires simpler comparator logic compared to previous pro-
posals and can be more easily incorporated into today’s mem-
ory controllers than previous proposals.
Applicability to Other Memory Technologies. In our
HPCA 2013 paper [97], we described MISE within the context
of a system using DRAM as main memory, for which the
reader can nd detailed background information in our prior
works [6, 8, 9, 10, 28, 29, 40, 41, 42, 43, 44, 45, 54, 55, 56, 57, 58, 62,
63, 83, 93, 94]. We believe the principles of MISE are easily
applicable to other memory technologies, e.g., phase-change
memory [47, 48, 49, 87, 107, 112, 118], STT-MRAM [46, 70, 79],
and hybrid memory systems [1,4,11,16,26,59,65,70,71,84,86,
87, 88, 113, 114, 116]. We leave a detailed exploration of these
to future works.
7. Conclusion
Application slowdowns induced by memory interference
are a signicant deterrent to high and predictable per-
formance. Towards tackling such application slowdowns,
our HPCA 2013 paper [97] (1) builds a simple Memory
Interference-induced Slowdown Estimation (MISE) model
to accurately estimate application slowdowns, and (2) demon-
strates two use cases that leverage our MISE model to achieve
predictable performance and fairness. Since our original
HPCA 2013 paper [97] on the MISE model and its applications,
several works have adopted and employed the MISE model
and its principles in dierent contexts. We conclude that the
MISE model and the principles behind it can fuel and inspire
many more such works on high performance, predictable,
and fair memory systems.
Acknowledgments
We thank Saugata Ghose for his dedicated eort in the
preparation of this article. We thank the reviewers for their
valuable feedback and suggestions. We acknowledge mem-
bers of the SAFARI group for their feedback and for the stim-
ulating research environment they provide. Many thanks to
Brian Prasky from IBM and Arup Chakraborty from Freescale
for their helpful comments. We acknowledge the support of
our industrial sponsors, including AMD, HP Labs, IBM, Intel,
Oracle, Qualcomm and Samsung. This research was also par-
tially supported by the NSF (grant 0953246), SRC, and Intel
URO Memory Hierarchy Program.
References
[1] N. Agarwal and T. F. Wenisch, “Thermostat: Application-Transparent Page Man-
agement for Two-Tiered Main Memory,” in ASPLOS, 2017.
[2] R. Ausavarungnirun et al., “Staged memory scheduling: Achieving high perfor-
mance and scalability in heterogeneous systems,” in ISCA, 2012.
[3] E. Baydal et al., “A Family of Mechanisms for Congestion Control in Wormhole
Networks,” IEEE TPDS, 2005.
[4] S. Bock, B. R. Childers, R. Melhem, and D. Mossé, “Concurrent Migration of
Multiple Pages in Software-Managed Hybrid Main Memory,” in ICCD, 2016.
[5] F. J. Cazorla et al., “Predictable performance in SMT processors: Synergy be-
tween the OS and SMTs,” IEEE TC, Jul. 2006.
[6] K. K. Chang, D. Lee, Z. Chishti, A. R. Alameldeen, C. Wilkerson, Y. Kim, and
O. Mutlu, “Improving DRAM Performance by Parallelizing Refreshes with Ac-
cesses,” in HPCA, 2014.
[7] K. K. Chang et al., “HAT: Heterogeneous adaptive throttling for on-chip net-
works,” in SBAC-PAD ’12, 2012.
[8] K. K. Chang, A. Kashyap, H. Hassan, S. Ghose, K. Hsieh, D. Lee, T. Li, G. Pekhi-
menko, S. Khan, and O. Mutlu, “Understanding Latency Variation in Modern
DRAM Chips: Experimental Characterization, Analysis, and Optimization,” in
SIGMETRICS, 2016.
[9] K. K. Chang, P. J. Nair, D. Lee, S. Ghose, M. K. Qureshi, and O. Mutlu, “Low-Cost
Inter-Linked Subarrays (LISA): Enabling Fast Inter-Subarray Data Movement in
DRAM,” in HPCA, 2016.
[10] K. K. Chang, A. G. Yağlıkçı, S. Ghose, A. Agrawal, N. Chatterjee, A. Kashyap,
D. Lee, M. O’Connor, H. Hassan, and O. Mutlu, “Understanding Reduced-Voltage
Operation in Modern DRAM Devices: Experimental Characterization, Analysis,
and Mechanisms,” in SIGMETRICS, 2017.
[11] N. Chatterjee et al., “Leveraging Heterogeneity in DRAM Main Memories to Ac-
celerate Critical Word Access,” in MICRO, 2012.
[12] Computing Research Association, “Grand research challenges in information
systems,” 2003.
[13] R. Das et al., “Application-aware prioritization mechanisms for on-chip net-
works,” in MICRO, 2009.
[14] R. Das et al., “Application-to-Core Mapping Policies to Reduce Memory System
Interference in Multi-Core Systems,” in HPCA, 2013.
[15] K. Du Bois et al., “Per-thread cycle accounting in multicore processors,” in
HiPEAC, 2013.
[16] S. R. Dulloor, A. Roy, Z. Zhao, N. Sundaram, N. Satish, R. Sankaran, J. Jackson,
and K. Schwan, “Data Tiering in Heterogeneous Memory Systems,” in EuroSys,
2016.
[17] E. Ebrahimi et al., “Fairness via source throttling: A congurable and high-
performance fairness substrate for multi-core memory systems,” in ASPLOS,
2010.
[18] E. Ebrahimi et al., “Parallel application memory scheduling,” in MICRO, 2011.
[19] E. Ebrahimi et al., “Prefetch-aware shared resource management for multi-core
systems,” in ISCA, 2011.
[20] D. Eklov, N. Nikoleris, D. Black-Schaer, and E. Hagersten, “Cache Pirating: Mea-
suring the Curse of the Shared Cache,” in ICPP, 2011.
9
[21] D. Eklov, N. Nikoleris, D. Black-Schaer, and E. Hagersten, “Bandwidth Bandit:
Quantitative Characterization of Memory Contention,” in PACT, 2012.
[22] S. Eyerman and L. Eeckhout, “System-level performance metrics for multipro-
gram workloads,” IEEE Micro, 2008.
[23] S. Eyerman and L. Eeckhout, “Per-thread cycle accounting in SMT processors,”
in ASPLOS, 2009.
[24] S. Eyerman et al., “A performance counter architecture for computing accurate
CPI components,” in ASPLOS, 2006.
[25] S. Eyerman et al., “A mechanistic performance model for superscalar out-of-
order processors,” TOCS, May 2009.
[26] K. Gai, M. Qiu, H. Zhao, and L. Qiu, “Smart Energy-Aware Data Allocation for
Heterogeneous Memory,” in HPCC, 2016.
[27] S. Ghose, H. Lee, and J. F. Martínez, “Improving Memory Scheduling via
Processor-Side Load Criticality Information,” in ISCA, 2013.
[28] H. Hassan, N. Vijaykumar, S. Khan, S. Ghose, K. Chang, G. Pekhimenko, D. Lee,
O. Ergin, and O. Mutlu, “SoftMC: A Flexible and Practical Open-Source Infras-
tructure for Enabling Experimental DRAM Studies,” in HPCA, 2017.
[29] H. Hassan, G. Pekhimenko, N. Vijaykumar, V. Seshadri, D. Lee, O. Ergin, and
O. Mutlu, “ChargeCache: Reducing DRAM Latency by Exploiting Row Access
Locality,” in HPCA, 2016.
[30] A. Herdrich et al., “Rate-based QoS techniques for cache/memory in CMP plat-
forms,” in ICS, 2009.
[31] Intel Corp., “First the tick, now the tock: Next generation Intel microarchitecure
(Nehalem),” White Paper, 2008.
[32] E. Ipek et al., “Self-optimizing memory controllers: A reinforcement learning
approach,” in ISCA, 2008.
[33] R. Iyer, “CQoS: A framework for enabling QoS in shared caches of CMP plat-
forms,” in ICS, 2004.
[34] R. Iyer et al., “QoS policies and architecture for cache/memory in CMP platforms,”
in SIGMETRICS, 2007.
[35] M. Jahre and L. Eeckhout, “GDP: Using Dataow Properties to Accurately Esti-
mate Interference-free Performance at Runtime,” in HPCA, 2018.
[36] M. K. Jeong, D. H. Yoon, D. Sunwoo, M. Sullivan, I. Lee, and M. Erez, “Balancing
DRAM locality and parallelism in shared memory CMP systems,” in HPCA, 2012.
[37] T. S. Karkhanis and J. E. Smith, “A rst-order superscalar processor model,” in
ISCA, 2004.
[38] D. Kaseridis et al., “Minimalist open-page: A DRAM page-mode scheduling pol-
icy for the many-core era,” in MICRO, 2011.
[39] O. Kayiran, N. C. Nachiappan, A. Jog, R. Ausavarungnirun, M. T. Kandemir, G. H.
Loh, O. Mutlu, and C. R. Das, “Managing GPU Concurrency in Heterogeneous
Architectures,” in MICRO, 2014.
[40] J. S. Kim, M. Patel, H. Hassan, and O. Mutlu, “The DRAM Latency PUF: Quickly
Evaluating Physical Unclonable Functions by Exploiting the Latency–Reliability
Tradeo in Modern DRAM Devices,” in HPCA, 2018.
[41] Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilkerson, K. Lai, and
O. Mutlu, “Flipping Bits in Memory Without Accessing Them: An Experimental
Study of DRAM Disturbance Errors,” in ISCA, 2014.
[42] Y. Kim et al., “ATLAS: A scalable and high-performance scheduling algorithm
for multiple memory controllers,” in HPCA, 2010.
[43] Y. Kim et al., “Thread cluster memory scheduling: Exploiting dierences in mem-
ory access behavior,” in MICRO, 2010.
[44] Y. Kim et al., “A case for exploiting subarray-level parallelism (salp) in dram,” in
ISCA, 2012.
[45] Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A Fast and Extensible DRAM Simu-
lator,” CAL, 2015.
[46] E. Kültürsay, M. Kandemir, A. Sivasubramaniam, and O. Mutlu, “Evaluating STT-
RAM as an Energy-Ecient Main Memory Alternative,” in ISPASS, 2013.
[47] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Architecting Phase Change Memory
as a Scalable DRAM Alternative,” in ISCA, 2009.
[48] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Phase Change Memory Architecture
and the Quest for Scalability,” CACM, 2010.
[49] B. C. Lee, P. Zhou, J. Yang, Y. Zhang, B. Zhao, E. Ipek, O. Mutlu, and D. Burger,
“Phase-Change Technology and the Future of Main Memory,” IEEE Micro, 2010.
[50] C. J. Lee, O. Mutlu, V. Narasiman, and Y. N. Patt, “Prefetch-Aware DRAM Con-
trollers,” in MICRO, 2008.
[51] C. J. Lee, O. Mutlu, V. Narasiman, and Y. N. Patt, “Prefetch-Aware Memory Con-
trollers,” TC, 2011.
[52] C. J. Lee, V. Narasiman, E. Ebrahimi, O. Mutlu, and Y. N. Patt, “DRAM-Aware
Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory
Systems,” Univ. of Texas at Austin, High Performance Systems Group, Tech. Rep.
TR-HPS-2010-002, 2010.
[53] C. J. Lee, V. Narasiman, O. Mutlu, and Y. N. Patt, “Improving Memory Bank-Level
Parallelism in the Presence of Prefetching,” in MICRO, 2009.
[54] D. Lee, S. Khan, L. Subramanian, S. Ghose, R. Ausavarungnirun, G. Pekhimenko,
V. Seshadri, and O. Mutlu, “Design-Induced Latency Variation in Modern DRAM
Chips: Characterization, Analysis, and Latency Reduction Mechanisms,” in SIG-
METRICS, 2017.
[55] D. Lee, S. Ghose, G. Pekhimenko, S. Khan, and O. Mutlu, “Simultaneous Multi-
Layer Access: Improving 3D-Stacked Memory Bandwidth at Low Cost,” TACO,
2016.
[56] D. Lee, Y. Kim, G. Pekhimenko, S. Khan, V. Seshadri, K. Chang, and O. Mutlu,
“Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case,”
in HPCA, 2015.
[57] D. Lee, Y. Kim, V. Seshadri, J. Liu, L. Subramanian, and O. Mutlu, “Tiered-Latency
DRAM: A Low Latency and Low Cost DRAM Architecture,” in HPCA, 2013.
[58] D. Lee, L. Subramanian, R. Ausavarungnirun, J. Choi, and O. Mutlu, “Decoupled
Direct Memory Access: Isolating CPU and IO Trac by Leveraging a Dual-Data-
Port DRAM,” in PACT, 2015.
[59] Y. Li, S. Ghose, J. Choi, J. Sun, H. Wang, and O. Mutlu, “Utility-Based Hybrid
Memory Management,” in CLUSTER, 2017.
[60] X. Lin and R. Balasubramonian, “Rening the Utility Metric for Utility-Based
Cache Partitioning,” in WDDD, 2009.
[61] F. Liu, X. Jiang, and Y. Solihin, “Understanding How O-Chip Memory Band-
width Partitioning in Chip Multiprocessors Aects System Performance,” in
HPCA, 2010.
[62] J. Liu, B. Jaiyen, Y. Kim, C. Wilkerson, and O. Mutlu, “An Experimental Study of
Data Retention Behavior in Modern DRAM Devices: Implications for Retention
Time Proling Mechanisms,” in ISCA, 2013.
[63] J. Liu, B. Jaiyen, R. Veras, and O. Mutlu, “RAIDR: Retention-Aware Intelligent
DRAM Refresh,” in ISCA, 2012.
[64] L. Liu, Z. Cui, M. Xing, Y. Bao, M. Chen, and C. Wu, “A Software Memory Parti-
tion Approach for Eliminating Bank-level Interference in Multicore Systems,” in
PACT, 2012.
[65] L. Liu, H. Yang, Y. Li, M. Xie, L. Li, and C. Wu, “Memos: A Full Hierarchy Hybrid
Memory Management Framework,” in ICCD, 2016.
[66] Z. Lu et al., “Aggregate ow-based performance fairness in CMPs,” TACO, vol. 13,
no. 4, Dec. 2016.
[67] K. Luo et al., “Balancing throughput and fairness in SMT processors,” in ISPASS,
2001.
[68] C. Luque et al., “CPU accounting in CMP processors,” IEEE CAL, Jan. - Jun. 2009.
[69] J. Mars et al., “Bubble-Up: Increasing utilization in modern warehouse scale com-
puters via sensible co-locations,” in MICRO, 2011.
[70] J. Meza, Y. Luo, S. Khan, J. Zhao, Y. Xie, and O. Mutlu, “A Case for Ecient Hard-
ware/Software Cooperative Management of Storage and Memory,” in WEED,
2013.
[71] J. Meza et al., “Enabling ecient and scalable hybrid memories using ne-
granularity DRAM cache management,” CAL, 2012.
[72] T. Moscibroda and O. Mutlu, “Memory performance attacks: Denial of memory
service in multi-core systems,” in USENIX Security, 2007.
[73] T. Moscibroda and O. Mutlu, “Distributed Order Scheduling and its Application
to Multi-Core DRAM Controllers,” in PODC, 2008.
[74] S. P. Muralidhara et al., “Reducing memory interference in multicore systems via
application-aware memory channel partitioning,” in MICRO, 2011.
[75] O. Mutlu, “Memory Scaling: A Systems Architecture Perspective,” IMW, 2013.
[76] O. Mutlu and T. Moscibroda, “Stall-time fair memory access scheduling for chip
multiprocessors,” in MICRO, 2007.
[77] O. Mutlu and T. Moscibroda, “Parallelism-aware batch scheduling: Enhancing
both performance and fairness of shared DRAM systems,” in ISCA, 2008.
[78] O. Mutlu and L. Subramanian, “Research Problems and Opportunities in Memory
Systems,” SUPERFRI, 2014.
[79] H. Naeimi, C. Augustine, A. Raychowdhury, S.-L. Lu, and J. Tschanz, “STT-RAM
Scaling and Retention Failure,” Intel Technology Journal, 2013.
[80] K. J. Nesbit et al., “Fair queuing memory systems,” in MICRO, 2006.
[81] G. Nychis, C. Fallin, T. Moscibroda, and O. Mutlu, “Next Generation On-Chip
Networks: What Kind of Congestion Control Do We Need?” in HotNets, 2010.
[82] G. Nychis, C. Fallin, T. Moscibroda, and O. Mutlu, “On-Chip Networks from
a Networking Perspective: Congestion and Scalability in Many-core Intercon-
nects,” in SIGCOMM, 2012.
[83] M. Patel, J. S. Kim, and O. Mutlu, “The Reach Proler (REAPER): Enabling the
Mitigation of DRAM Retention Failures via Proling at Aggressive Conditions,”
in ISCA, 2017.
[84] A. J. Peña and P. Balaji, “Toward the Ecient Use of Multiple Explicitly Managed
Memory Subsystems,” in CLUSTER, 2014.
[85] D. Petrou et al., “Implementing lottery scheduling: Matching the specializations
in traditional schedulers,” in USENIX ATEC, 1999.
[86] S. Phadke et al., “MLP Aware Heterogeneous Memory System,” in DATE, 2011.
[87] M. K. Qureshi, V. Srinivasan, and J. A. Rivers, “Scalable High Performance Main
Memory System Using Phase-Change Memory Technology,” in ISCA, 2009.
[88] L. E. Ramos, E. Gorbatov, and R. Bianchini, “Page Placement in Hybrid Memory
Systems,” in ICS, 2011.
[89] S. Rixner et al., “Memory access scheduling,” in ISCA, 2000.
[90] SAFARI Research Group, ASMSim –GitHub Repository, https://github.com/CMU-
SAFARI/ASMSim.
[91] A. Sandberg, A. Sembrant, E. Hagersten, and D. Black-Schaer, “Modeling Per-
formance Variation Due to Cache Sharing,” in HPCA, 2013.
[92] V. Seshadri et al., “The evicted-address lter: A unied mechanism to address
both cache pollution and thrashing,” in PACT, 2012.
[93] V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A. Kozuch,
O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Ambit: In-Memory Accelerator for
Bulk Bitwise Operations Using Commodity DRAM Technology,” in MICRO, 2017.
10
[94] V. Seshadri et al., “RowClone: Fast and Energy-Ecient In-DRAM Bulk Data
Copy and Initialization,” in MICRO, 2013.
[95] A. Snavely and D. M. Tullsen, “Symbiotic jobscheduling for a simultaneous mul-
tithreaded processor,” in ASPLOS, 2000.
[96] Standard Performance Evaluation Corp., SPEC CPU2006,
http://www.spec.org/spec2006.
[97] L. Subramanian et al., “MISE: Providing performance predictability and improv-
ing fairness in shared main memory systems,” in HPCA, 2013.
[98] L. Subramanian et al., “The application slowdown model: Quantifying and con-
trolling the impact of inter-application interference at shared caches and main
memory,” in MICRO, 2015.
[99] L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu, “The blacklisting
memory scheduler: Achieving high performance and fairness at low cost,” in
ICCD, 2014.
[100] L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu, “BLISS: Balancing
performance, fairness and complexity in memory access scheduling,” TPDS, 2016.
[101] L. Tang, J. Mars, N. Vachharajani, R. Hundt, and M. L. Soa, “The Impact of Mem-
ory Subsystem Resource Sharing on Datacenter Applications,” in ISCA, 2011.
[102] M. Thottethodi, A. R. Lebeck, and S. Mukherjee, “Self-Tuned Congestion Control
for Multiprocessor Networks,” in HPCA, 2001.
[103] H. Usui et al., “DASH: Deadline-aware high-performance memory scheduler for
heterogeneous systems with hardware accelerators,” TACO, Jan. 2016.
[104] K. Van Craeynest et al., “Scheduling heterogeneous multi-cores through perfor-
mance impact estimation (PIE),” in ISCA, 2012.
[105] C. A. Waldspurger and W. E. Weihl, “Lottery scheduling: Flexible proportional-
share resource management,” in OSDI, 1994.
[106] H. Wang et al., “A-DRM: Architecture-aware distributed resource management
of virtualized clusters,” in VEE, 2015.
[107] H.-S. P. Wong, S. Raoux, S. Kim, J. Liang, J. P. Reifenberg, B. Rajendran,
M. Asheghi, and K. E. Goodson, “Phase Change Memory,” Proc. IEEE, 2010.
[108] X. Xiang, W. Shi, S. Ghose, L. Peng, O. Mutlu, and N.-F. Tzeng, “Carpool: A Buer-
less On-Chip Network Supporting Adaptive Multicast and Hotspot Alleviation,”
in ICS, 2017.
[109] M. Xie, D. Tong, K. Huang, and X. Cheng, “Improving System Throughput and
Fairness Simultaneously in Shared Memory CMP Systems via Dynamic Bank
Partitioning,” in HPCA, 2014.
[110] D. Xiong et al., “Providing predictable performance via a slowdown estimation
model,” TACO, vol. 14, no. 3, Aug. 2017.
[111] H. Yang, A. Breslow, J. Mars, and L. Tang, “Bubble-ux: Precise Online QoS
Management for Increased Utilization in Warehouse Scale Computers,” in ISCA,
2013.
[112] H. Yoon, J. Meza, N. Muralimanohar, N. P. Jouppi, and O. Mutlu, “Ecient Data
Mapping and Buering Techniques for Multi-Level Cell Phase-Change Memo-
ries,” TACO, 2014.
[113] H. Yoon et al., “Row Buer Locality Aware Caching Policies for Hybrid Memo-
ries,” in ICCD, 2012.
[114] X. Yu, C. J. Hughes, N. Satish, O. Mutlu, and S. Devadas, “Banshee: Bandwidth-
Ecient DRAM Caching via Software/Hardware Cooperation,” in MICRO, 2017.
[115] G. L. Yuan et al., “Complexity eective memory access scheduling for many-core
accelerator architectures,” in MICRO, 2009.
[116] W. Zhang and T. Li, “Exploring Phase Change Memory and 3D Die-Stacking
for Power/Thermal Friendly, Fast and Durable Memory Architectures,” in PACT,
2009.
[117] J. Zhao, O. Mutlu, and Y. Xie, “FIRM: Fair and High-Performance Memory Con-
trol for Persistent Memory Systems,” in MICRO, 2014.
[118] P. Zhou, B. Zhao, J. Yang, and Y. Zhang, “A Durable and Energy Ecient Main
Memory Using Phase Change Memory Technology,” in ISCA, 2009.
[119] Y. Zhou et al., “Camouage: Memory trac shaping to mitigate timing attacks,”
in HPCA, 2017.
[120] Y. Zhou and D. Wentzla, “MITTS: Memory inter-arrival time trac shaping,”
in ISCA, 2016.
[121] S. Zhuravlev et al., “Addressing shared resource contention in multicore proces-
sors via scheduling,” in ASPLOS, 2010.
[122] W. K. Zuravle and T. Robinson, “Controller for a synchronous DRAM that max-
imizes throughput by allowing memory requests and commands to be issued out
of order,” Patent 5630096, 1997.
11
