Online Cache Modeling for Commodity Multicore Processors by West, Rich et al.
Boston University
OpenBU http://open.bu.edu
Computer Science CAS: Computer Science: Technical Reports
2010-07-02
Online Cache Modeling for
Commodity Multicore Processors
West, Rich; Zaroo, Puneet; Waldspurger, Carl; Zhang, Xiao. "Online Cache Modeling
for Commodity Multicore Processors", Technical Report BUCS-TR-2010-015,























Modern chip-level multiprocessors (CMPs) contain multiple pro-
cessor cores sharing a common last-level cache, memory intercon-
nects, and other hardware resources. Workloads running on sep-
arate cores compete for these resources, often resulting in highly-
variable performance. It is generally desirable to co-schedule work-
loads that have minimal resource contention, in order to improve
both performance and fairness. Unfortunately, commodity proces-
sors expose only limited information about the state of shared re-
sources such as caches to the software responsible for scheduling
workloads that execute concurrently. To make informed resource-
management decisions, it is important to obtain accurate measure-
ments of per-workload cache occupancies and their impact on per-
formance, often summarized by utility functions such as miss-ratio
curves (MRCs).
In this paper, we first introduce an efficient online technique for
estimating the cache occupancy of individual software threads us-
ing only commonly-available hardware performance counters. We
derive an analytical model as the basis of our occupancy estima-
tion, and extend it for improved accuracy on modern cache config-
urations, considering the impact of set-associativity, line replace-
ment policy, and memory locality effects. We demonstrate the ef-
fectiveness of occupancy estimation with a series of CMP simula-
tions in which SPEC benchmarks execute concurrently on multiple
cores. Leveraging our occupancy estimation technique, we also in-
troduce a lightweight approach for online MRC construction, and
demonstrate its effectiveness using a prototype implementation in
the VMware ESX Server hypervisor. We present a series of experi-
ments involving SPEC benchmarks, comparing the MRCs we con-
struct online with MRCs generated offline in which various cache
sizes are enforced via static page coloring.
Categories and Subject Descriptors





Multicore processors, CMPs, shared cache resource management
∗Copyright c© 2010 VMware, Inc. and Boston University. All
rights reserved.
1. INTRODUCTION
Advancements in processor architecture have led to a proliferation
of multicore processors, commonly referred to as chip-level mul-
tiprocessors (CMPs). Commodity client and server platforms con-
tain one or more CMPs, with each CMP consisting of multiple pro-
cessor cores sharing a common last-level cache, memory intercon-
nects, and other hardware resources [1, 12]. Workloads running o
separate cores compete for these shared resources, often resulting
in highly-variable or unpredictable performance [9, 14].
Operating systems and hypervisors are designed to multiplex hard-
ware resources across multiple workloads with varying demands
and importance. Unfortunately, commodity CMPs typically man-
age shared hardware resources, such as cache space and memory
bandwidth, in a manner that is opaque to the software responsi-
ble for higher-level resource management. Without adequate visi-
bility and control over performance-critical hardware resources, it
is extremely difficult to optimize for efficient resource utilization
or to enforce quality-of-service policies. For example, accurate
measurement of workload cache occupancies is needed to make
informed decisions about suitable cache partition sizes, and to en-
able co-scheduling decisions which can potentially reduce shared
cache conflicts.
Many hardware-based resource management approaches have been
proposed, including low-level architectural mechanisms to support
cache occupancy monitoring and / or the ability to partition cache
space and other memory system hardware among multiple work-
loads [2, 5, 7, 8, 13, 14, 17, 22, 23, 25, 27]. However, none of
these techniques provide a method to accurately estimate work-
load cache occupancies on commodity processors, which expose
only limited information to software. To further understand the
impact of shared caches on workload performance, methods have
also been devised to construct cache utility functions, such as miss-
ratio curves (MRCs), which capture miss ratios at different cache
occupancies [3, 21, 26, 27, 28]. However, existing techniques for
generating MRCs either require custom hardware support, or incur
non-trivial software overheads.
Both an accurate method of per-workload cache occupancy estima-
tion and a generic approach to constructing cache utility curves are
important for effective cache management. These techniques pro-
vide the basis for more efficient cache usage, allowing higher-level
resource management policies to provide differential quality of ser-
vice for workloads. For example, schedulers can exploit cache
1
performance information to make better co-runner placement de-
cisions [4, 26, 29, 33], improving cache efficiency or fairness. Un-
fortunately, strict quality-of-service enforcement generally requires
hardware support. While software-based page-coloring techniques
have been used to provide isolation [6, 15, 16], such hard partition-
ing is inflexible, and generally prevents efficient cache utilization.
Moreover, without special hardware support [24], dynamically re-
coloring a page is expensive, requiring updates to page mappings
and a full page copy, making this approach unattractive for dynamic
workload mixes in general-purpose systems.
We focus on several cache modeling techniques that are useful for
more informed resource management decisions, such as co-runner
selection. This paper does not propose specific policies for im-
proved cache-aware scheduling decisions, but instead lays the foun-
dations for such software-based performance management opera-
tions. We identify two key contributions: first, an efficient online
technique for per-thread cache occupancy estimation and, second,
an online technique for identifying the utility or performance bene-
fits to each thread as a function of cache occupancy. In the first case,
we leverage only commonly available hardware performance coun-
ters, found on most commodity CMPs in use today. We demon-
strate the accuracy of our online cache occupancy estimations using
Intel’s CMPSched$im simulator [19], applied to sets of co-running
SPEC benchmarks. We also extend the basic occupancy estimation
model with cache-hit information, to more accurately reflect fac-
tors such as execution locality, hardware cache-line replacement
policies, and set-associativity. Leveraging our online occupancy
estimation technique, we show how to construct miss-ratio curves
efficiently online, as threads compete dynamically for shared hard-
ware resources. We demonstrate the effectiveness of online MRC
construction by presenting experimental results using a prototype
implementation in VMware’s ESX Server hypervisor [30].
The next section presents our cache occupancy estimation approach,
including a detailed description of its mathematical basis, together
with simulation results demonstrating its effectiveness. Section 3
builds on this foundation, introducing our method for online con-
struction of cache utility curves. Using a prototype implementation
in the VMware ESX Server hypervisor, we examine the accuracy
of our online MRCs by comparing them with MRCs for the same
workloads collected via static page coloring. Related work is ex-
amined in Section 4. Finally, we summarize our conclusions and
highlight opportunities for future work in Section 5.
2. CACHE OCCUPANCY ESTIMATION
In this section, we present our approach for estimating cache oc-
cupancy. We begin with a formal explanation of our basic model,
which requires only cache miss counts for each co-running thread.
We then examine the effects of pseudo-LRU set-associativity as
implemented in modern processors, and extend our model to addi-
tionally incorporate cache hit counts to improve accuracy for such
configurations.
We demonstrate the effectiveness of our cache occupancy estima-
tion techniques with a series of experiments in which SPEC bench-
marks execute concurrently on multiple cores. Since real proces-
sors do not expose the contents of hardware caches to software1,
we measure accuracy using the Intel CMPSched$im simulator [19]
1Current processor families do not allow software to inspect cache
tags, although the MIPS R4000 [10] did provide ac che instruc-
tion with this capability.
to compare the results of our model with actual cache occupancies
in several different configurations.
For the purposes of our model, we consider a shared last-level
cache that may be direct-mapped orn-way set associative. Our
objective is to determine the current amount of cache space occu-
pied by some thread,τ , at timet, given contention for cache lines
by multiple threads running on all the cores that share that cache.
At time t, threadτ may be descheduled, or it may be actively exe-
cuting on one core while other threads are active on the remaining
cores.
2.1 Basic Cache Model
Since hardware caches reveal very little information to software, in
order to derive quantitative information about their state, we must
rely on inference techniques using features such as hardware per-
formance counters. Virtually all modern processors provide per-
formance counters through which information about various sys-
tem events can be determined, such as instructions retired, cache
misses, cache accesses and cycle times for execution sequences.
Using two events, namely thelocal and global last-level cache
misses, we estimate the number of cache lines,E, occupied byτ
at timet. By global cache misses, we mean the cumulative number
of such events across all cores that share the same last-level cache.
We assume that the shared cache is accessed uniformly at random.
Results show this to be a reasonable assumption, given the unbiased
nature of memory allocation, and the desire for all cache lines to be
used effectively across multiple workloads and execution phases.
We also assume each cache line is allocated to a single thread at
any point in time. Data sharing is not considered in this paper,
although it is part of our ongoing work in this area.
Cache occupancy is effectively dictated by the number of misses
experienced by a thread because cache lines are allocated in re-
sponse to such misses. Essentially, the current execution phase of
a threadτi influences its cache investment, because any of its lines
that it no longer accesses may be evicted by conflicting accesses to
the same cache index by other threads. Evicted lines no longer rel-
evant to the current execution phase ofτi will not incur subsequent
misses that would cause them to return to the cache. Hence, the
cache occupancy of a thread is a function of its misses experienced
over some interval of time.
For subsequent discussion, we introduce the following notation:
• Let C represent the number of cache lines in a shared cache,
accessed uniformly at random.
• Let ml represent the number of misses experienced by the
local thread,τl, under observation over some sampling in-
terval. This term also represents the number of cache lines
allocated due to misses.
• Let mo represent the aggregate number of misses by every
threadother thanτl, on all cores of a CMP that cause cache
lines to be allocated in response to such misses. We use the
notationτo to represent the aggregate behavior of all other
threads, treating it as if it were a single thread.
Theorem. Consider a cache of size C lines, with E cache lines
belonging to τl and C−E cache lines belonging to τo at some time,
t. If, in some interval, δt, there are ml misses corresponding to τl
and mo misses corresponding to τo, then the expected occupancy





Proof. First, at timet, it is assumed thatτl andτo are sufficiently
memory-intensive, and have executed for enough time, to collec-
tively populate the entire cache. Now, considering any single cache
line, i, at timet + δt we have:
Pr{i belongs to τl} =
Pr{i belongs to τl | i belonged to τl}·Pr{i belonged to τl} +
Pr{i belongs to τl | i belonged to τo}·Pr{i belonged to τo}
This follows from the prior probabilities, at timet:








Additionally, afterml +mo misses, the probability thatτl replaces
line i, previously occupied byτo, is one minus the probability that
τl does not replaceτo afterml + mo misses. More formally,
Pr{τl replaces τo on line i} = (3)




In Equation 3, ml
C(ml+mo)
represents the probability that a miss by
τl will result in an arbitrary line,i, being populated by contents for
τl. We know that the probability of a particular line being replaced
by a single miss is1/C, and the ratio ml
ml+mo
corresponds to the
probability of that miss being caused by one ofτl’s accesses.
It follows from Equation 3 that the probability ofτo replacingτl on
line i at the end ofml + mo misses is:
Pr{τo replaces τl on line i} = (4)





Pr{i belongs to τl | i belonged to τl} = (5)





Pr{i belongs to τl | i belonged to τo} = (6)
Pr{τl replaces τo on line i} =




From Equations 1, 2, 5 and 6, we have:














Ignoring the effects of quadratic and higher-degree terms, the first-
degree linear approximation of Equation 7 becomes:
Pr{i belongs to τl} = (8)
E/C(1 − mo/C) + (1 − E/C)ml/C
This is a reasonable approximation given that1/C is small. Con-
sequently, the expected number of cache lines,E′, belonging toτl
at timet + δt is:
E′ = E(1 − mo/C) + (1 − E/C)ml = (9)







This follows from Equation 8 by considering the state of each of
theC cache lines as independent of all others.
Observe that the recurrence relation in Equation 9 captures the
changes in cache occupancy for some thread over a given inter-
val of time, with known local and global misses. The terms[1 −
mo
C(ml+mo)
](ml+mo) and[1 − ml
C(ml+mo)
](ml+mo) in Equation 7,
approximate toe−mo/C ande−ml/C , respectively. Thus, for situ-
ations whereml + mo >> 1, Equation 9 becomes
E′ = Ee−mo/C + C(1 − E/C)(1 − e−ml/C) (10)
Equation 10 is significant in that it shows the cache occupancy of
a thread (here,τl) mimics the charge on an electrical capacitor.
Given some initial occupancy,E, a growth rate proportional to(1−
e−ml/C) applies to lines currently unoccupied byτl. Similarly, the
rate of reduction in occupancy (i.e., the equivalent discharge rate in
a capacitor) is proportional toe−mo/C .
The linear model in Equation 9 is practical for online occupancy
estimation, since it consists of an inexpensive computation that
requires only the ability to measure per-core and per-CMP cache
misses, which is provided by most modern processor architectures.
For example, in the Intel Core architecture [11] used for our experi-
ments in Section 3, the performance counter eventL2_LINES_IN
represents lines allocated in the L2 cache, in response to both on-
demand and prefetch misses. A mask can be used to specify whether
to count misses on a single core or on both cores sharing the cache.
2.2 Extended Cache Model for LRU Replace-
ment Policies
So far, our analysis has assumed that each line of the cache is
equally likely to be accessed. Over the lifetime of a large set of
threads, this is a reasonable assumption. However, commodity
CMP configurations featuren-way set associative caches, and lines
within sets are not usually replaced randomly. Rather, victim lines
are typically selected using some approximation to a least recently
used (LRU) replacement policy. We modified Equation 9 and to ad-
ditionally incorporate cachehit information, modeling the reduced
replacement probability due to LRU effects when lines are reused.
Equation 9 can be rewritten as
E′ = E(1 − mopl) + (C − E)mlpo (11)
wherepl is the probability that a miss falls on a line belonging toτl,
andpo is the probability that a miss falls on a line belonging toτo.
Since Equation 9 does not model LRU effects, each line is equally
likely to be replaced andpl = po = 1/C. In order to model LRU
effects, we calculate
rl = (hl + ml)/E (12)
ro = (ho + mo)/(C − E) (13)
to quantify the frequency of reuse of the cache lines ofτl andτo,
respectively.hl andho represent the number of cache hits experi-
enced byτl andτo, respectively, in the measurement interval. As
with miss counts, these hit counts can be obtained using hardware
performance counters available on most modern processors.
When the cache replacement policy is an LRU variant,ro andrl
approximate the frequency of reuse of the cache lines belonging
to τ0 andτ1, respectively, since we are unable to precisely know
which line is the most recently accessed. Since the probability that
3
a miss evicts a line belonging to a thread is inversely proportional
to its reuse frequency, we assume the following relationship:
po/pl = rl/ro (14)
Furthermore, since a miss must fall on some line in the cache with
probability1:
plE + po(C − E) = 1 (15)
Solving Equations 14 and 15, we obtain:
po = rl/[roE + rl(C − E)] (16)
pl = ro/[roE + rl(C − E)] (17)
The values ofpo andpl obtained from Equations 16 and 17 can
be substituted in Equation 11 to obtain the hit-adjusted occupancy
estimation model which handles LRU cache replacement effects.
2.3 Experiments
We evaluated the cache estimation models on Intel’s CMPSched$im
simulator [19], which supports binary execution and co-scheduling
of multiple workloads. This enabled us to measure the accuracy
of our cache occupancy models by comparing the estimated occu-
pancy values with the actual values returned by the simulator. The
ability to control scheduling allowed us to perform experiments in
both under-committed and over-committed scenarios.
By default, the Intel simulator implements a CMP architecture us-
ing a pseudo-LRU policy used in modern processors, although it is
also configurable to simulate random and other replacement poli-
cies. We configured the simulator to use a 3 GHz clock frequency,
with private per-core 32 KB 4-way set-associative L1 caches, and
a shared 4 MB 16-way set-associative L2 cache. All caches used
a 64-byte line size. The number of hardware cores and software
threads was varied across different experiments to test the effective-
ness of our occupancy estimation models under diverse conditions.
During simulation, the per-core and per-CMP performance coun-
ters measuring L2 misses and hits were sampled once per millisec-
ond, after which the occupancy estimates were updated for each
software thread. Since cache occupancies exhibit rapid changes at
this time scale, we averaged occupancies over 100 millisecond in-
tervals. We plot one value per second for both the estimated and
actual occupancy values, in order to display results more clearly
over longer time scales. We refer to the miss-based occupancy es-
timation technique using the basic cache model presented in Sec-
tion 2.1 as methodEstimate-M. The extended cache model pre-
sented in Section 2.2 that also incorporates hit information to better
model associativity is referred to as methodEstimate-MH.
Our first experiment tests the effectiveness of the basic Estimate-M
method in a dual-core configuration where a 16-way set-associative
L2 cache is configured to use a simple random cache line replace-
ment policy instead of pseudo-LRU. Figure 1 plots the estimated
and actual cache occupancies over time when the two cores were
runningmcf andomnetpp from the SPEC CPU2006 benchmark
suite. The estimated occupancy for each benchmark tracks its ac-
tual occupancy very closely, which is expected since the random
replacement policy is consistent with our assumption of random
cache access.
Our next experiment evaluates the same workload with the default
pseudo-LRU line replacement policy which is used by actual pro-





















Figure 1: Accuracy of basic Estimate-M method on dual-core
system with random line replacement policy.
cache occupancies over time, formcf andomnetpp respectively,
using both the basic Estimate-M and extended Estimate-MH meth-
ods. Figures 2(c) and 2(d) present the absolute error between the
actual and estimated values. The workloads in this experiment were
selected to highlight the difference in accuracy between the two es-
timation methods, which generally agreed more closely for other
workload pairings. In this case, the Estimate-M method is consider-
ably less accurate, often showing a substantial discrepancy relative
to the actual occupancies, especially during the interval between 8
and 18 seconds. On the other hand, the hit-adjusted Estimate-MH
method, designed to better reflect LRU effects, is much more accu-
rate, and tracks the actual occupancies fairly closely. The remain-
ing experiments focus on the more accurate Estimate-MH method
with various sets of co-running workloads. Figure 3 presents the re-
sults of two separate experiments with different co-running SPEC
CPU2006 benchmarks with a dual-core configuration. Figures 3(a)
and 3(b) showgcc running withmcf on the two cores;omnetpp
andperlbmk are co-runners in Figures 3(c) and 3(d). The esti-
mated occupancies match the actual values very closely.
Figure 4 presents the results of three separate experiments. Each
row plots occupancy over time for four different co-running bench-
marks from the SPEC CPU2000 and CPU2006 suites in a quad-
core configuration. As with the dual-core results, the estimated
occupancies in the quad-core system match the actual values with
reasonably good accuracy.
We also evaluated the effectiveness of occupancy estimation in an
over-committed system, in which many software threads are time-
multiplexed onto a smaller number of hardware cores. In such a
scenario, some threads will be descheduled at various points in
time, waiting in a scheduler run queue to be dispatched onto a
processor core. In our experiments, we used a 100 millisecond
scheduling time quantum, with a simple round-robin scheduling
policy selecting threads from a global run queue.
Figure 5 plots the actual and estimated occupancies over time for a
quad-core system with ten software threads running various bench-
marks from the SPEC CPU2000 and CPU2006 suites. In this exper-
iment, the ten threads are scheduled to run on the four cores sharing
the L2 cache. The accuracy of occupancy estimation remains high,
despite the time-sliced scheduling.
In order to look at the estimation accuracy over shorter time inter-




























































































































Figure 3: Two pairs of co-runners in dual-core systems:mcf vs.gcc, andomnetppvs.perlbmk.
tion for themcf, equake00, andxalancbmk workloads from
Figures 5(a), 5(c), and 5(e). The actual and estimated occupancies
are plotted every 100 milliseconds. Estimated occupancy tracks ac-
tual occupancy very closely, even during periods when a thread is
de-scheduled and its occupancy falls to zero. Although these fine-
grained results are reported for only three of the ten workloads from
Figure 5, we observed similar behavior for the remaining bench-
marks.
3. CACHE UTILITY CURVES
Our work on online cache occupancy prediction is intended to lay
a foundation for improved resource management on multicore pro-
cessors. In particular, we are currently investigating ways to im-
prove fair and efficient scheduling of workloads (such as software
threads, processes, tasks, or virtual CPUs) on hardware contexts
(e.g., cores or SMT threads) that compete for shared resources such
as cache lines and memory bus bandwidth. In this paper, we do not
propose any new cache-aware scheduling algorithms. Instead, we
show how our method for online cache occupancy estimation can
be used to produce workload-specific cache utility curves, which
have proved valuable in prior research [3, 21, 26, 27, 28, 33].
These curves are presented with cache occupancy as the indepen-



































































































































Figure 4: Three sets of co-runners in quad-core systems: occupancy over time for different sets of co-running SPEC CPU2000 and
CPU2006 benchmarks in (a)–(d), (e)–(h), and (i)–(l).
on they-axis, such as the number of cache misses per reference,
instruction, or cycle at different occupancies. In this section we
explain our technique for lightweight online construction of cache
utility curves, yielding information about the effect of cache size
on expected performance for running workloads. We then present
experimental MRC results for a series of benchmarks, using a pro-
totype implementation, and compare them to MRCs collected for
the same workloads using static page coloring.
All experiments were conducted on a Dell PowerEdge SC1430
host, configured with two 2.0 GHz Intel Xeon E5535 processors
and 4GB RAM. Each quad-core Xeon processor actually consists
of two separate dual-core CMPs in a single physical package. The
two cores in each CMP share a common 4MB L2 cache. We im-
plemented our techniques for cache utility curve generation in the
VMware ESX Server 4.0 hypervisor [30]. Each benchmark appli-
cation was deployed in a separate virtual machine, configured with
a single CPU and 256MB RAM, running an unmodified Red Hat
Enterprise Linux 5 guest OS (Linux 2.6.18-8.e15 kernel).
3.1 Curve Types
Most work in this area has focused on per-threadmiss-ratio curves
that plot cache misses per memory reference at different cache oc-
cupancies [3, 21, 26, 27, 28]. Another type of miss-ratio curve plots
cache misses per instruction retired at different cache occupancies.
We refer to miss-ratio curves in units of misses per kilo-reference
asMPKR curves, and to those in units of misses per kilo-instruction
asMPKI curves.
It is also possible to constructmiss-rate curves, defined in terms
of misses per kilo-cycle. SuchMPKC curves are attractive for use
with cache-aware scheduling policies, since they indicate the num-
ber of misses expected over a real-time interval for a workload with
a given cache occupancy. However, a problem with MPKC curves
is that they are sensitive to contention for memory bandwidth from
co-running workloads. Under high contention, workloads start ex-
periencing more memory stalls, throttling back their instruction is-
sue rate, thereby decreasing their cache misses per unit time. Con-
sequently, a cache utility function based on miss rates is dependent
on dynamic memory bandwidth contention from co-running work-
loads. In contrast, MPKR and MPKI curves measure cache metrics
that are intrinsic to a workload, independent of co-runners and tim-
ing details.
Figure 7 illustrates the problem of MPKC sensitivity to memory
bandwidth contention using the SPEC2000mcf00workload. Miss-
rate curves formcf00 were collected using page coloring, but with
different levels of memory read bandwidth contention generated by
a micro-benchmark running on a different CMP sharing the same
memory bus, but not the same cache. For a given cache occupancy
value, the miss rates are higher when there is less memory band-
width contention, resulting in variable miss-rate curves.
One can also generateCPKI curves, which measure the impact of
cache size on the cycles per kilo-instruction efficiency of a work-
load. The CPKI metric has the advantage of directly showing the
impact of cache size on a workload’s performance, reflecting the
effects of instruction-level parallelism that help tolerate cache miss























































































































































Figure 6: Fine-grained occupancy estimation in over-committed quad-core system.
the problem of co-runner variability due to contention for memory
bandwidth or other shared hardware resources.
Since MPKI and MPKR curves do not vary based on memory con-
tention caused by co-runners, they are good candidates for deter-
mining a workload’s intrinsic cache behavior. In some cases, how-
ever, it is also useful to infer the impact on workload performance
due to the combined effects of cache and memory bandwidth con-
tention. Our utility curve generator can produce both MPKI and
CPKI curves to guide higher-level scheduling policies.
3.2 Curve Generation
We implemented our online cache-utility curve generator in ESX
Server. Utilizing the occupancy estimation method described in
Section 2, curve generation consists of two components at different
time scales: fine-grained occupancy updates, and coarse-grained
curve construction.
3.2.1 Occupancy Updates
Each core updates the cache occupancy estimate for its currently-
running thread every two milliseconds, using the linear occupancy
model in Equation 9 to implement the Estimate-M method pre-
sented in Section 2.1. We were unable to use the more accurate
hit-adjusted Estimate-MH method described in Section 2.2, due
to the limited number of hardware performance counters available
on our experimental platform.2 A high-precision timer callback
reads hardware performance counters to obtain the number of cache
misses for both the local core and the whole CMP since the last up-
date. In addition to this periodic update, occupancy estimates are
also updated whenever a thread is rescheduled, based on the num-
ber of intervening cache misses since it last ran.
2The Intel E5535 processor supports only two programmable coun-
ters, which we used for counting cache misses by the local core and
the whole CMP. More recent processors from Intel and AMD sup-
port at least four programmable counters, sufficient for supporting























Figure 7: Effect of memory bandwidth contention on MPKC
miss-rate curve formcf00workload.
While always maintaining a precise cache occupancy estimate, our
current implementation additionally quantizes cache occupancy in
discrete levels equal to one-eighth of the total cache size, in or-
der to support efficient curve generation. We construct discrete
curves to bound the space and time complexity of their genera-
tion, while providing sufficient accuracy to be useful in cache-
aware CPU scheduling enhancements. During each occupancy up-
date for a thread, we record the change in values since the previ-
ous update for several hardware performance counters, including
cache misses, instructions retired, and elapsed CPU cycles. These
changes are added to aggregate values associated with the current
discrete occupancy level. However, if an update spans multiple oc-
cupancy levels, the current incremental changes are not added to
the aggregate values, because they cannot be attributed to a single
level. Since occupancy updates are invoked very frequently, we
tuned the timer callback carefully, and measured its cost as approx-
imately 320 cycles on our experimental platform.
3.2.2 Generating Miss-Ratio Curves
Miss-ratio curves are generated after a configurable time period,
typically several seconds spanning thousands of fine-grained occu-
pancy updates. For each discrete occupancy level, an MPKI value
is computed by dividing the its associated aggregate cache misses
(accumulated during the fine-grained occupancy updates) by the
accumulated retired instructions for that level.
MPKI values are expected to be monotonically decreasing with in-
creasing cache occupancy;i.e., more cache leads to fewer misses
per instruction. Our utility curve generator enforces this mono-
tonicity property explicitly by adjusting MPKI values. Preference
is given to those occupancy points that have the most updates, since
we have more confidence in the performance metrics correspond-
ing to these points. Starting with the most-updated occupancy point
with MPKI valuem, any lower MPKI values to its left or higher
MPKI values to its right are set tom. Interestingly, monotonicity
violations are good indicators of phase changes in workload behav-
ior, although we do not yet exploit such hints. We instrumented our
MRC generation code, including monotonicity enforcement, and
found that it takes approximately 2850 cycles to execute on our ex-
perimental platform. The overheads for occupancy estimation and
MRC construction are sufficiently low that these techniques can
remain enabled at all times in production systems.
Our cache utility curve generator is extremely flexible. By record-
ing appropriate statistics with each discrete occupancy point, a va-
riety of different cache performance curves can be constructed. By
default, we collect cache misses, instructions retired, and elapsed
cycles, enabling generation of MPKI, MPKC, and CPKI curves.
3.2.3 Obtaining Full Curves
A key challenge with our approach is obtaining performance met-
rics at all discrete occupancy points. In the steady state, a group
of threads co-running on a shared cache achieve equilibrium occu-
pancies. As a result, the cache performance curve for each thread
has performance metrics concentrated around its equilibrium occu-
pancy, leading to inaccuracies in the full cache performance curves.
In addition to passive monitoring, we have explored ways to ac-
tively perturb the execution of co-running threads to alter their rel-
ative cache occupancies temporarily. For example, varying the
group of co-runners scheduled with a thread typically causes it to
visit a wider range of occupancy points. An alternative approach
is to dynamically throttle the execution of some cores, allowing
threads on other cores to increase their occupancies. Our utility
curve generator cannot use frequency and voltage scaling to throt-
tle cores, since in commodity CMPs, all cores must operate at the
same frequency [20]. However, we did have some success with
duty-cycle modulation techniques [11, 31] to slow down specific
cores dynamically.
For thermal management, Intel processors allow system code to
specify a multiplier (in discrete units of12.5%) specifying the frac-
tion of regular cycles during which a core should be halted. When
a core is slowed down, its co-runners get an opportunity to increase
their cache occupancy, while the occupancy of the thread running
on the throttled core is decreased. To limit any potential perfor-
mance impact, we enable duty-cycle modulation during less than
2% of execution time. Experiments with SPEC CPU2000 bench-
marks did not reveal any observable performance impact due to
cache performance curve generation with duty-cycle modulation.
3.3 Experiments
We evaluated cache curve construction techniques using our ESX
Server implementation. We first collected miss-ratio curves for var-
ious SPEC CPU2000 benchmarks (mcf00, swim00, twolf00,
equake00, gzip00 andperlbmk00), by running them to com-
pletion with access to an increasing number of page colors in each
successive run. We then ran all six benchmarks together on a single
CMP of the Dell system, using our online approach to generate the
miss-ratio curves at benchmark completion time.
Figure 8 compares the miss-ratio curves of the benchmarks ob-
tained online with those obtained by page coloring. In most cases,
the MRC shapes and absolute MPKI values match reasonably well.
However, in Figure 8(a), the MRC generated online formcf00
is flat at lower occupancy points, differing significantly from the
page-coloring results. Even with duty-cycle modulation there is in-
sufficient interference from co-runners to pushmcf00 into lower
occupancy points. Since there are no updates for these points, the
miss-ratio values for higher occupancy points are used as the best
estimate due to monotonicity enforcement.
To analyze this further, Figure 9 shows separate MRCs generated
online formcf00with different co-runners,swim00 andgzip00.
The MRC generated whenmcf00 is running withgzip00 is flat
becausemcf00 only has updates at the highest occupancy point.










































































































Figure 9: MRC for mcf00with different co-runners.
of sixty more than the miss ratio ofgzip00, which renders duty-
cycle modulation ineffective, since it can throttle a core by at most
a factor of eight. In contrast, the MRC generated with co-runner
swim00 matches the MRC obtained by page-coloring closely.
3.4 Discussion
Our online technique for MRC construction builds upon our cache
occupancy estimation model. While the MRCs generated for a
working system in Section 3.3 are encouraging, there remain sev-
eral open issues. By using only commodity hardware features, our
MRCs may not always yield data points across the full spectrum of
cache occupancies. Duty cycle modulation addresses this problem
to some degree, but some sensitivity to co-runner selection may
still remain. Although an MPKI curve is intrinsic to a workload,
and does not vary based on contention from co-runners, the work-
load may be prevented from visiting certain occupancy levels due to
co-runner interference, as observed in Figure 9. In practice, it may
be necessary to vary co-runners selectively during some execution
intervals, in order to allow a workload to reach high cache occu-
pancies, or alternatively, to force a workload into low occupancy
states, depending on the memory demands of the co-runners.
While the experiments in Section 3.3 compare offline MRCs with
our online approach, they are produced at the time of benchmark
completion. This introduces some potential differences between
the online and offline curves, since online we plot MPKI values
based on the timeduring workload execution at which a given oc-
cupancy is reached. We are currently investigating MRCs at differ-
ent time granularities. Early investigations yield curves that remain
stable for an execution phase, but which fluctuate while changing
phases. We intend to study how MRCs can be used to identify
phase changes as part of future work.
4. RELATED WORK
The focus of this paper is on efficient online cache modeling, in-
cluding software techniques for estimating both runtime cache oc-
cupancies and performance curves for individual workloads. The
intent is for such estimates to inform performance-related resource
management decisions, including cache-aware scheduling. To the
best of our knowledge, no prior software techniques exist for online
estimation of per-thread cache occupancies in commodity proces-
sors with shared caches. Other researchers have, however, inferred
cache usage and utility of different cache sizes. In CacheScouts [32],
for example, hardware support for monitoring IDs and set sam-
pling are used to associate cache lines with different workloads,
enabling cache occupancy measurements. The use of special IDs
differs from our occupancy estimation approach, which requires
only currently-available performance monitoring events common
to modern CMPs. For estimation of performance curves, the ap-
proach taken by RapidMRC [28] comes closest to our goal of an
efficient online software technique, although as detailed below, it
still requires uncommon hardware support and incurs significant
runtime overhead.
9
In the area of shared-cache resource management, there is a sig-
nificant literature on cache partitioning, using either hardware or
software techniques [2, 5, 7, 13, 14, 17, 22, 23, 25, 27]. This has
been prompted by the observation that multiple workloads sharing
a cache may experience interference in the form of conflict misses
and memory bus bandwidth contention, resulting in significant per-
formance degradation. For example, several studies have shown
significant variation in execution times for SPEC benchmarks, de-
pending on co-runners competing for shared resources [14, 33].
Cache partitioning has the potential to eliminate conflict misses and
improve fairness or overall performance. While hardware-based
approaches are typically faster and more efficient than those imple-
mented by software, they are not commonly available on current
processors [26, 27]. Software techniques such as those based on
page coloring require careful coordination with the memory man-
agement subsystem of the underlying OS or hypervisor, and are
generally too expensive for workloads with dynamically varying
memory demands [6, 15, 16].
A significant challenge with cache partitioning is deriving the op-
timal allocation size for a workload. One approach is to construct
cache utility functions, or performance curves, that associate work-
load benefits (e.g., in terms of miss ratio, miss rate, or CPI) with
different cache sizes. In particular, methods for constructing miss-
ratio curves (MRCs) have been proposed to capture workload per-
formance impacts at different cache occupancies, but either require
special hardware [21, 26, 27], or incur high overhead [3, 28].
The Mattson Stack Algorithm [18] generates MRCs by maintain-
ing an LRU-ordered stack of memory addresses. RapidMRC [28]
uses this algorithm as the basis for its online MRC construction, but
requires hardware support, in the form of aS mpled Data Address
Register (SDAR) in the IBM POWER5 performance monitoring
unit, to obtain a trace ofall data accesses to the L2 cache. The total
cost of online MRC construction is several hundred milliseconds,
with more than 80 milliseconds of workload stall time due to the
high overhead of trace collection. This overhead is mitigated by
triggering MRC construction only when phase transitions are de-
tected, based on changes in the overall cache miss rate. However,
since changes in cache miss rates can be triggered by cache con-
tention caused by co-runners, and not necessarily phase changes,
the phase transition detection in RapidMRC does not seem robust
in over-committed environments.
In contrast, we deploy an efficient online method to construct MRCs
and other cache-performance curves, requiring only commonly-
available performance counters. Due to the low overhead of our
cache-performance curve construction, it can remain enabled at all
times, providing up-to-date information pertaining to the most re-
cent phase. As a result, our technique does not require an offline
reference point to account for vertical shifts in the online curves
due to phase transitions as in [28], and is also robust in the pres-
ence of cache contention from co-runners. We do, however, suffer
from the problem of obtaining enough occupancy data points to
construct full curves. Using duty-cycle modulation to temporarily
reduce the rate of memory access by competing workloads is one
technique that has the potential to alleviate this problem.
5. CONCLUSIONS AND FUTURE WORK
We have introduced several novel techniques for practical online
cache modeling of commodity multicore processors sharing a last-
level cache. Using both simulations and empirical results from a
prototype implementation, we demonstrated the effectiveness of
our software-based approach with quantitative experiments for a
variety of workloads and CMP configurations.
Our first contribution is efficient online estimation of cache oc-
cupancies for software threads, using only performance counters
commonly available on commodity processors. To the best of our
knowledge, our technique is the first to enable accurate online cache
occupancy monitoring without requiring additional hardware sup-
port. We derive a basic statistical model for cache occupancy es-
timation based on cache miss counts, and extend it to incorpo-
rate cache hit counts for improved accuracy with set-associative
caches that employ pseudo-LRU replacement policies. Simula-
tion results using Intel’s CMPSched$im verify that our estimates
track actual occupancies closely, across various sets of co-running
SPEC benchmarks on realistic dual-core and quad-core CMP con-
figurations, even in the presence of workload phase changes and
descheduling events in over-committed scenarios.
Building on occupancy estimation, we demonstrate how to dynam-
ically generate cache performance curves, such as MRCs, which
capture the utility of cache space on workload performance. Empir-
ical results using the VMware ESX Server hypervisor demonstrate
that we are able to construct per-thread MRCs online with low over-
head, in the presence of interference from co-runners. Compar-
isons with MRCs generated offline using static page coloring indi-
cate that our lightweight approach is sufficiently accurate to inform
online scheduling algorithms. We also highlight remaining chal-
lenges, and show how duty cycle modulation can be used to facili-
tate obtaining a wider range of MRC occupancy points, by dynam-
ically varying the level of cache contention between co-runners.
While we have presented several new online techniques for CMP
cache modeling, many interesting research opportunities remain.
We plan to enhance our occupancy estimation approach to incor-
porate the effects of data sharing and constructive interference be-
tween threads. We are also examining various approaches to obtain
accurate cache performance curves at all cache occupancy points.
Additionally, we are exploring ways to extend and integrate our
software techniques with future hardware support for cache QoS
monitoring and enforcement, and to scale to wide CMPs contain-
ing large numbers of cores.
The techniques described in this paper are currently being applied
to cache-aware fair and efficient scheduling in the VMware ESX
Server hypervisor. Our scheduling-related research is directed at
algorithms for optimizing co-runner placement based on estimated
performance trade-offs, as well as mechanisms for improving fair-
ness by adjusting thread scheduling priorities to account for co-
runner interference.
6. REFERENCES
[1] Advanced Micro Devices, Inc.Multi-Core Processors from
AMD, 2009. http://multicore.amd.com/.
[2] D. H. Albonesi. Selective cache ways: on-demand cache
resource allocation. InACM/IEEE International Symposium
on Microarchitecture (MICRO ’99), pages 248–259,
November 1999.
[3] E. Berg, H. Zeffer, and E. Hagersten. A statistical
multiprocessor cache model. InEEE International
Symposium on Performance Analysis of Systems and
Software (ISPASS ’06), pages 89–99, 2006.
10
[4] J. M. Calandrino and J. H. Anderson. Cache-aware real-time
scheduling on multicore platforms: Heuristics and a case
study. InEuroMicro Conference on Real-Time Systems
(ECRTS ’08), pages 299–308, July 2008.
[5] J. Chang and G. S. Sohi. Cooperative cache partitioning for
chip multiprocessors. InInternational Conference on
Supercomputing (ICS ’07), pages 242–252, June 2007.
[6] S. Cho and L. Jin. Managing distributed, shared L2 caches
through OS-level page allocation. Inthe 39th Annual
IEEE/ACM International Symposium on Microarchitecture,
pages 455–468, 2006.
[7] H. Dybdahl, P. Stenström, and L. Natvig. A
cache-partitioning aware replacement policy for chip
multiprocessors. InHigh Performance Computing, volume
4297/2006, pages 22–34, 2006.
[8] E. Ebrahimi, C. J. Lee, O. Mutlu, and Y. Patt. Fairness via
source throttling: A configurable and high-performance
fairness substrate for multi-core memory systems. In
Architectural Support for Programming Languages and
Operating Systems (ASPLOS ’10), pages 335–346, March
2010.
[9] A. Fedorova, M. Seltzer, and M. D. Smith. Cache-fair thread
scheduling for multicore processors. Technical Report
TR-17-06, Harvard University, 2006.
[10] J. Heinrich.MIPS R4000 Microprocessor User’s Manual.
MIPS Technologies, Inc., 1994.
[11] Intel Corporation.Intel 64 and IA-32 Architectures Software
Developer’s Manual, Volume 3: System Programming Guide,
June 2009.
[12] Intel Corporation.Intel Multi-Core Technology, 2009.
http://www.intel.com/multi-core/.
[13] R. Iyer. CQoS: a framework for enabling QoS in shared
caches of CMP platforms. Inthe 18th Annual International
Conference on Supercomputing, pages 257–266, 2004.
[14] S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and
partitioning in a chip multiprocessor architecture. InParallel
Architectures and Compilation Techniques (PACT ’04),
October 2004.
[15] J. Liedtke, H. Härtig, and M. Hohmuth. OS-controlled cache
predictability for real-time systems. Inthe 3rd IEEE
Real-time Technology and Applications Symposium, 1997.
[16] J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and
P. Sadayappan. Gaining insights into multicore cache
partitioning: Bridging the gap between simulation and real
systems. Inthe 14th IEEE International Symposium on High
Performance Computer Architecture, pages 367–378, 2008.
[17] C. Liu, A. Sivasubramaniam, and M. Kandemir. Organizing
the last line of defense before hitting the memory wall for
CMPs. InInternational Symposium on High-Performance
Computer Architecture, pages 176–185, 2004.
[18] R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger.
Evaluation techniques for storage hierarchies.IBM Systems
Journal, 9(2):78–117, 1970.
[19] J. Moses, K. Aisopos, A. Jaleel, R. Iyer, R. Illikkal,
D. Newell, and S. Makineni. CMPSched$im: Evaluating
OS/CMP interaction on shared cache management. InIEEE
International Symposium on Performance Analysis of
Systems and Software (ISPASS ’09), pages 113–122, April
2009.
[20] A. Naveh, E. Rotem, A. Mendelson, S. Gochman,
R. Chabukswar, K. Krishnan, and A. Kumar. Power and
thermal management in the Intel Core Duo processor.Intel
Technology Journal, 10(2):109–122, 2006.
[21] M. K. Qureshi and Y. N. Patt. Utility-based cache
partitioning: A low-overhead, high-performance, runtime
mechanism to partition shared caches. Inthe 39th Annual
IEEE/ACM International Symposium on Microarchitecture,
pages 423–432, 2006.
[22] N. Rafique, W.-T. Lim, and M. Thottethodi. Architectural
support for operating system-driven CMP cache
management. InParallel Architectures and Compilation
Techniques (PACT ’06), pages 2–12, September 2006.
[23] P. Ranganathan, S. V. Adve, and N. P. Jouppi.
Reconfigurable caches and their application to media
processing. Inthe 27th Annual International Symposium on
Computer Architecture, pages 214–224, June 2000.
[24] T. Sherwood, B. Calder, and J. S. Emer. Reducing cache
misses using hardware and software page placement. In
International Conference on Supercomputing (ICS ’99), June
1999.
[25] S. Srikantaiah, M. Kandemir, and M. J. Irwin. Adaptive set
pinning: Managing shared caches in CMPs. InArchitectural
Support for Programming Languages and Operating Systems
(ASPLOS ’08), March 2008.
[26] G. E. Suh, S. Devadas, and L. Rudolph. Analytical cache
models with applications to cache partitioning. In
International Conference on Supercomputing (ICS ’01),
pages 1–12, June 2001.
[27] G. E. Suh, L. Rudolph, and S. Devadas. Dynamic
partitioning of shared cache memory.Journal of
Supercomputing, 28(1):7–26, April 2004.
[28] D. Tam, R. Azimi, L. Soares, and M. Stumm. RapidMRC:
Approximating L2 miss rate curves on commodity systems
for online optimizations. InArchitectural Support for
Programming Languages and Operating Systems (ASPLOS
’09), March 2009.
[29] D. Tam, R. Azimi, and M. Stumm. Thread clustering:
sharing-aware scheduling on SMP-CMP-SMT
multiprocessors. InProceedings of EuroSys 2007, pages
47–58, March 2007.
[30] VMware, Inc.vSphere Resource Management Guide: ESX
4.0, ESXi 4.0, vCenter Server 4.0, 2009.
[31] X. Zhang, S. Dwarkadas, and K. Shen. Hardware execution
throttling for multi-core resource management. In
Proceedings of the USENIX Annual Technical Conference,
June 2009.
[32] L. Zhao, R. Iyer, R. Illikkal, J. Moses, D. Newell, and
S. Makineni. CacheScouts: Fine-grain monitoring of shared
caches in CMP platforms. InParallel Architectures and
Compilation Techniques (PACT ’07), September 2007.
[33] S. Zhuravlev, S. Blagodurov, and A. Fedorova. Addressing
shared resource contention in multicore processors via
scheduling. InArchitectural Support for Programming
Languages and Operating Systems (ASPLOS ’10), pages
129–141, March 2010.
11
