Symbiotic Cache Resizing for CMPs with Shared LLC by Choi, Inseok & Yeung, Donald
University of Maryland Technical Report UMIACS-TR-2013-02
Symbiotic Cache Resizing for CMPs with Shared LLC
Inseok Choi and Donald Yeung
Department of Electrical and Computer Engineering
University of Maryland at College Park
{inseok, yeung}@umd.edu
Abstract
This paper investigates the problem of finding the optimal
sizes of private caches and a shared LLC in CMPs. Resiz-
ing private and shared caches in modern CMPs is one way
to squeeze wasteful power consumption out of architectures
to improve power efficiency. However, shrinking each pri-
vate/shared cache has different impact on the performance
loss and the power savings to the CMPs because each cache
contributes differently to performance and power. It is benefi-
cial for both performance and power to shrink the LRU way
of the private/shared cache which saves power most and in-
creases data traffic least.
This paper presents Symbiotic Cache Resizing (SCR), a
runtime technique that reduces the total power consumption
of the on-chip cache hierarchy in CMPs with a shared LLC.
SCR turnoffs private/shared-cache ways in an inter-core and
inter-level manner so that each disabling achieves best power
saving while maintaining high performance. SCR finds such
optimal cache sizes by utilizing greedy algorithms that we
develop in this study. In particular, Prioritized Way Selec-
tion picks the most power-inefficient way. LLC-Partitioning-
aware Prioritized Way Selection finds optimal cache sizes
from the multi-level perspective. Lastly, Weighted Threshold
Throttling finds optimal threshold per cache level. We eval-
uate SCR in two-core, four-core and eight-core systems. Re-
sults show that SCR saves 13% power in the on-chip cache
hierarchy and 4.2% power in the system compared to an even
LLC partitioning technique. SCR saves 2.7X more power
in the cache hierarchy than the state-of-the-art LLC resizing
technique while achieving better performance.
1. Introduction
The power wall is currently the main limiter to achieving hig
performance in modern CPUs, and has been one of the most
critical problems that computer architects face over the past
several years [20]. Unfortunately, this problem will only get
worse in the future as process technologies continue to scale
down feature sizes. As such, power efficiency will remain
an extremely important design goal, and will require hard-
ware designers to continue to make efforts to squeeze waste-
ful power consumption out of architectures.
A key place to look for power savings is in the on-chip
cache hierarchy. Caches occupy a large portion of the CPU’s
available die area–upwards of 50% in today’s CPUs–so they
take up a significant portion of a processor’s overall power
dissipation. In addition, caches are sized for the worst case.
This means an average computation cannot effectively utilize
all of the cache capacity. Such cache over-provisioning can
result in significant waste that, if eliminated, can yield large
power savings without sacrificing much performance.
Several researchers have investigatedcache resizing tech-
niques[1, 3, 4, 23, 24, 28, 40, 41] that target this form of
waste. Cache resizing is an architecture-level power man-
agement technique to determine the minimum cache that a
program needs to run at near-peak performance, and then re-
configure the cache by enabling/disabling cache ways or sets
to implement this efficient capacity. Resizing can reduce the
mount of cache activated per access, and also enable circuit-
level techniques (e.g. gated-Vdd [28]) to shut down the un-
used portion of the cache. This can translate into significant
dynamic and static power savings.
CMPs are prevalent because exploiting instruction level
parallelism (ILP) incurs high power dissipation but yield only
modest performance gains. CMP scaling,i.e. increasing
the number of cores, will continue in the foreseeable future
as transistor count increases. Modern CMPs are commonly
equipped with a shared last-level cache (LLC) to efficiently
utilize cache capacity and share data between cores. Ideally,
finding optimal sizes of the private caches and the shared LLC
in such CMPs leads to saving most of the wasteful power con-
sumption in the on-chip cache hierarchy without any perfor-
mance degradation or only with unnoticeable/acceptable per-
formance degradation. While extremely effective on unipro-
cessors, existing resizing techniques cannot be simply applied
to modern CMPs. This is because finding the optimal size of
each private cache on CMPs is already anNP-hardproblem.
As such, having a scalable algorithm is crucial to alleviatethe
power inefficiencyin the private caches in CMPs.
On the other hand, a shared LLC can be considered as a
single cache, so we can directly apply an existing cache re-
sizing techniques to the shared LLC. A recent study [36] ap-
plied the cache resizing technique to the shared LLC in CMPs
on top of the utility-based cache partitioning scheme [29].
The study explored the wasteful power consumption of the
LLC, mostly from the static power of the LLC, thus reducing
the power consumption of the LLC without noticeable per-
formance degradation by not allocating ways with lower util-
ity than pre-specified threshold. Unfortunately, the technique
alone can not squeeze most of the wasteful power consump-
tion out of the on-chip cache hierarchy because most of the
over provisioning, from the perspective of dynamic power, oc-
curs at the private caches. Thus, it is important to find a way
to resize private caches on top of the resizable shared LLC.
In this paper, we presentSymbiotic Cache Resizing (SCR),
a novel greedy algorithm to resize private caches and the
shared LLC in CMPs that squeezes wasteful power consump-
tion out of both resizing. In particular, SCR achieves higher
performance and saves more power than the existing LLC re-
sizing technique. Figure2 summarizes our offline-study re-
sults by showing weighted speedups and power consumption
of the on-chip cache hierarchy, as we compare the static SCR
to previous studies and our exhaustive searches (we will dis-
cuss our offline study in Sections4.4 and 5.1 in detail). As
shown, the static SCR can achieve both higher performance
and higher power efficiency than the existing LLC resizing
technique by eliminating the wasteful power consumption
both from the private caches and the LLC.
SCR orchestrates the private cache and shared LLC by uti-
lizing Weighted Threshold Throttling (WTT). WTT finds op-
timal thresholds to control a private-cache resizing technique
and a LLC resizing technique. Having an optimal threshold
per different cache level is the key tosymbiosis, because re-
sizing each type of cache (private vs. shared) has a different
impact on the overall performance loss and power consump-
tion. Moreover, the impact varies across applications. For
resizing the LLC, WTT employs the existing state-of-the-art
technique [36]. However, we propose a new technique, called
LLC-Partitioning-aware Prioritized Way Selection (LP-PWS),
to control private cache resizing in WTT. LP-PWS resizes
private caches symbiotically by utilizingreuse distancepro-
files across cores. LP-PWS monitors power efficiency gain
(PEG) perLRU way of each core and disables the ways in the
PEG order to achieve the best power saving perunit perfor-
mance loss. In addition, LP-PWS adopts a multi-level cache
resizing technique to optimize the total power consumption
of a cache hierarchy, thus achievingsymbiosiswhen running
with a resizable shared LLC. Resizing private caches to save
power, however, can increase private caches’ miss rates, re-
sulting in greater power dissipation at the next level of cache
(or the LLC) due to increased accesses. The access energy
to a shared LLC is generally larger than the access energy
of a private cache. For this reason, shrinking down a pri-
vate cache usually reduces the total power consumption of
the cache hierarchy, but after some point, it will increase the
total power consumption due to the larger access energy at the
shared LLC. Accordingly, it is crucial to control private cache
resizing with awareness of the total power consumption in the
cache hierarchy.
This paper makes the following contributions:
• We conduct a limit study that searches the solution space
exhaustively to find the solution that outperforms thestate-
of-the-artLLC resizing technique in both performance and
power savings.
• We proposeSymbiotic Cache Resizing, a greedy algorithm
that canheuristicallyfind good resizing solutions in an on-
line fashion.
• In particular, we explain and show thatnaive PWS, LP-
PWS and WTT can achieve significant power savings
while maintaining high performance relative to other tech-
niques.
• We show that SCR can save up to 54.5% total cache-
hierarchy power and 16.9% total system power.
• To the best of our knowledge, this is the first study propos-
ing a run-time technique to find the optimal on-chip cache
hierarchy in CMPs by resizing both private caches and the
shared LLC.
The remainder of this paper is organized as follows. Sec-
tion 2 studies background and motivation of SCR. Then, Sec-
tion 3 presents analytical model for SCR and its architecture
in details. Section4 explains experimental environment and
methodology and Section5 evaluates power savings and per-
formance of SCR. Finally, Section6 discusses related work,
and Section7 concludes the paper.
2. Background and Motivation
Cache resizing has been known for several decades, but its
application in CMPs is not fully investigated yet. Although
there has been significant work on cache resizing, existing
techniques are limited to uniprocessors. In particular, most
studies consider resizing a single level cache for a uniproces-
sor only [1, 23, 24, 28, 40, 41], (typically the L1 caches). A
recent study [36] investigates cache resizing in a multi-core
platform, but it only studies it for the shared LLC and does
not cover private caches. Unfortunately, there’s no compre-
hensive study on cache resizing in a modern CMP cache hier-
archy yet.
Moreover, since both dynamic and static power are impor-
tant, only controlling the size of a single level of cache po-
tentially misses significant opportunities for power savings.
The trend for modern CPUs is towards deeper cache hierar-
chies which distributes the power consumption across differ-
nt caching levels. For dynamic power consumption, the pri-
vate cache is the greatest culprit, but for static power con-
sumption, the LLC is by far the greatest concern due to
its large area. As such, investigating multi-level resizing s
mandatory to eliminate all wasteful power consumption in the
cache hierarchy.
2.1. Private Cache Resizing in CMPs
Private caches can account for a significant part of the over-
all power consumption, and can also vary across programs.
Figure1 shows the power consumption breakdowns for the
SPEC 2006 benchmarks in a 2-way out-of-order core with
32KB private cache and 1MB LLC (details will follow in Sec-
tion 4). In particular, the dynamic power of the private cache
alone can take up to 20% of the total system power consump-
tion or 50% of the total power consumption of the cache hi-































Core Private Dynamic Private Static LLC Dynamic LLC Static 
(a) Integer Benchmarks (b) Floating Point Benchmarks 
Figure 1: Breakdown of system power consumption for the SPEC CPU2006 benchmarks.
perlbench bzip2 mcf gobmk hmmer sjeng libquantum h264ref omnetpp astar xalan
TS (W) 2.627 2.397 1.696 2.724 2.766 2.513 1.468 3.066 1.751 2.606 1.579
L1D (W) 0.525 0.333 0.067 0.380 0.472 0.352 0.023 0.665 0.117 0.482 0.045
bwaves zeusmp gromacs cactusADM leslie3d namd povray calculix GemsFDTD lbm sphinx3
TS (W) 2.015 1.752 2.033 1.730 1.724 1.976 2.232 1.811 1.587 1.908 2.029
L1D (W) 0.230 0.088 0.133 0.163 0.097 0.178 0.270 0.111 0.063 0.078 0.136
Table 1: Total system power (TS) and L1 dynamic power (L1D) fo r SPEC CPU2006 benchmarks.
mcf astar
Total accesses 856647 Total misses 166736 Total accesses 4071875 Total misses 30415
Way 1 Way 2 Way 3 Way 4 Way 5 Way 6 Way 7 Way 1 Way 2 Way 3 Way 4 Way 5 Way 6 Way 7
M+ 32534 6994 4755 4024 3406 2838 2442 M+ 122031 14174 4894 2989 2437 2125 1900
PEG 26.3 122.5 180.2 212.9 251.5 301.8 350.9 PEG 33.4 287.3 832.0 1362.4 1671.2 1916.4 2143.0










































On-Chip Cache Hierarchy (Private Cache + LLC) 
Previous studies 
What we propose for SCR 
Figure 2: Scope and goal of this work.
shows the dynamic power consumption of the private cache
varies from 0.02 W to 0.5 W. Resizing each core’s private
cache independently without considering the absolute power
levels will fail to reach a globally optimal solution. Instead,
for a given amount of acceptable performance loss (i.e. in-
creased traffic to the shared LLC), it is crucial to apply resiz-
ing on the private cache that will save the most power.
2.2. Intra-Core Multi-Level Optimization
Cache-partitioning techniques have been widely investigated
for higher performance and better power efficiency [29, 30,
22]. Most of runtime state-of-the-art techniques dynami-
cally change partitioning. As such, resizing private caches
in CMPs will encounter a dynamically changing LLC parti-
tion size. Although LLCs commonly use the serialized ac-
cess technique, which saves the data-array access power when
a cache miss occurs, its tag access power can be significant
compared to the potential power savings from private cache
resizing. The dynamic power consumption of tag accesses is
dependent on the LLC partitioning, the allocated LLC ways
in UCP [29] and thus, private cache resizing increases the
power consumption of the LLC. To alleviate this problem, pri-
vate cache resizing should be performed with the awareness
of the increased dynamic power at the LLC constrained by a
threshold to avoid severe performance degradation.
2.3. Private Cache Resizing vs LLC Resizing
As discussed earlier, multi-level cache hierarchies distribute
power consumption across the different levels. Each level of
cache has its own power savings and performance degrada-
tion when the waste is squeezed out. Since the overall perfor-
mance impact and power savings are not trivial, when mul-
tiple cache-resizing techniques are combined, it is crucial to
predict potential power savings and performance loss to con-
trol each technique for saving most of the power while main-
taining high performance.
Cooperative Partitioning [30], or LLC Resizing, is a previ-
ous technique focused on resizing a CMP’s shared LLC. It
disables ways in the LLC when theutility of the way is not
high enough. Combining the private cache resizing and the
LLC resizing can be eithersymbioticor destructive, depend-
ing on the workload. If the LLC resizing technique disables
some number of ways on top of aggressive private cache resiz-
ing that causes significant additional traffic to the LLC, there
can be significant performance degradation which is unac-
ceptable in high performance computing. On the other hand,
it is possible to combine the two techniques in such a way
3






























) mcf astar 
(c) Power consumption comparison 
         ways                                        ways                                            ways                                        ways 
1   2   3   4   5   6   7   8            1   2   3   4   5   6   7   8              1   2   3   4   5   6   7   8            1   2   3   4   5   6   7   8                       
Figure 3: Comparing private-cache resizing techniques.
that achieves most of the potential power savings with neg-
ligible performance degradation. This is possible when the
two techniques formsymbiosissuch that the private cache re-
sizing saves power with near-peak performance and the LLC
resizing employs more ways than it would in the absence of




   D-Cache   
  Core 0 
Way Counters 
I-Cache 
   D-Cache 
  Core 1 
Way Counters 
I-Cache 














Figure 4: SCR framework
Figure4 illustrates the framework of SCR. We develop the
design of SCR to achieve most of the potential power savings
and at the same time attain high performance based on the
principal rule which enforces the configurations resultingi
best power savingperunit performance degradation, which
can be captured asPower Efficiency Gain(PEG). SCR con-
sists of mainly two parts: the monitoring part and the control
part. The monitoring part includes way counters to approxi-
mate thereuse-distanceprofile in the private /shared caches,
and PEG/PEL (Power Estimation Logic). The control part
consists of LP-PWS, LLC Resizing and WTT.
LP-PWS monitors PEG to prioritize the cache way in pri-
vate caches, which the best power savings can be expected
by disabling of. The power estimation logic (PEL) in SCR
provides the approximation of dynamic and static power con-
sumption of a given level of the cache and provides the val-
ues to the logic which determines the size of the private
cache and the LLC. SCR works with the existing LLC resiz-
ing techniquesymbiotically. In particular, LP-PWS leads to
an optimal multi-level cache resizing. Second, PEL-based
WTT achieves further power saving without significant per-
formance degradation. We detail the design of the two algo-
rithms after presenting the analytical model for SCR .
3.2. Analytical Model for SCR
Power Optimization with Bounded Performance Degrada-
tion In high performance systems, achieving desired per-
formance levels and reducing power consumption are both
important considerations. As such, our objective functionis
the lowest power consumption with aboundedperformance
degradation, notEnergy-Delay Productor Energy-Delay-
Squared Productwhich are more common in circuit/device-
level designs.
Assume that we solve the problem of finding optimal pri-
vate cache sizes for a two-core system. Letx andy be the size
of the private caches. Letf (x,y) be the power consumption
of the system consisting of a core with a private cache size of
x and another core with a private cache size ofy. Let g(x,y)
be the performance,i.e. weighted speedup, of the system.
Power= f (x,y), Performance= g(x,y)
Let µ be the normalized value of the degraded performance
andxB andyB are the baseline sizes of the private caches (in
this study, we only focus on homogeneous systems, soxB =
yB). Then, a simple form of this problem we want to solve is




Solving theLagrangianfor this problem is very challeng-
ing. Moreover, its solutions are neither accurate nor prac-
tical because performance and power characterizations are
challenging and the derived solutions are not always config-
urable. However, our empirical study shows that approxima-
tion, given below, discovers the solution effectively based on
the exhaustive search over the solution space.
SCR Approximation We approximate the solution by
searchingxscr,yscr such that
maximize f (xB,yB)− f (xscr,yscr),
subject to(T(xB,xscr)+T(yB,yscr))< threshold
(2)
whereT(a,b) is a function returning the normalized value
of increased traffic to the LLC. In Equation2, we change the
constraints from being based on pure performance (normal-
ized speedup) to being based on performance-related events,
which is easier to predict. We predict the increased traffic
4
with reuse-distance profilingby utilizing way countersat run
time. We take agreedyapproach to solve this problem, based
on power efficiency gainto compare potential power savings
by disabling each LRU way in private caches. The effective-
ness of the approximation will be discussed in Section5.1.
3.3. SCR Designs
Power Efficiency Gain Comparing the impact, from the
power and the performance perspective, of disabling a sin-
gle way of a given core’s private cache in the CMP is crucial
when searching for the global optimal. If the LRU way of the
private cache in a given core (i) is disabled, then the expected
power efficiency gain perunit performance degradationis:
PEGi = total_accesses/way_counts[l ], (3)
wherel = current_LRU_way. Table2 shows PEGs per way
of the example workload consisting ofmcf andastar. M+
numbers are measured when the number of the ways of the
cache is reduced by one,.g. "way 5" columns denote the
increased number of misses when the cache has been changed
from 6 ways to 5 ways.




allowedTraffic = trafficToLLC * threshold




selectedCore =core with maximum PEG
increasedTraffic += disable_way(selectedCore)
end











PEG-based Prioritized Way Selection Each private cache
consumes dynamic power that is proportional to the number
of accesses which can vary significantly across cores. As
such, disabling a single way in private caches may have differ-
ent power impact orpower efficiency gain. Decisions made
by monitoring only each core’s performance are prone to be
destructive because each core’s increased data traffic to the
LLC aggravates resource conflicts in the LLC. Figure3-(a)
shows an example. Downsizing each core’s private cache,
while monitoring its own performance degradation though,
will result in the sum of the worst IPCs of each application.
This is because each downsizing causes more traffic to the
LLC, resulting in poorer IPC than would occur in a unipro-
cessor without a shared LLC.
PWS addresses this problem by selecting the way that pro-
vides the biggest power efficiency gain first. This will re-
sult in either saving more power by selecting ways that pro-
vide higher power efficiency gain, or achieving higher perfor-
mance by disabling fewer number of ways for the same power
savings compared to Figure3-(a). Figure3-(b) shows an ex-
ample ofsymbiotic cache resizingwith the PWS technique.
In Figure3-(b), only 3ways in total are disabled, compared to
8 ways in Figure3-(a), resulting in lower power dissipation
compared to the power dissipation in Figure3-(a). PWS se-
lects the ways to be disabled based on their PEG values. PWS
prioritizes the ways in PEG order until the expected perfor-
mance degradation reaches a pre-defined threshold, as shown
in Algorithm 1.




allowedTraffic = trafficToLLC * threshold
prevPower[i] = power_estimation(i,private_max_way,
llcPartition[i])

















privPower = dynamic_power(i,priv) + static_power(i,priv)
llcPower = dynamic_power(i,LLC) + static_power(i,LLC)
return privPower + llcPower
end
Intra-Core LLC Partitioning-aware PWS Although
naivePWS determines a solution to achieve high power ef-
ficiency, each per-core solution still may not achieve one of
the Pareto-optimal solutions in the power-performance solu-
tion space because it does not take the dynamic access en-
rgy of the the LLC into consideration. The increased traf-
fic to the LLC results in additional power consumption. The
additional power can be significant as we shrink the private
cache aggressively, resulting in a total power increase. In
5
addition, tag-access energy keeps changing because a LLC
resizing technique partitions the LLC and shrink the number
of active ways dynamically. For this reason, the optimal de-
gree of disabling private cache can not be determined inde-
pendently. LP-PWS uses the power estimation logic (PEL),
shown in Figure4 to improve uponnaivePWS by throttling
the amount of private cache resizing. The technique is speci-
fied in Algorithm2.
Algorithm 3: Weighted Threshold Throttling









/* LLC_Resizing power */
powerLlcResizing += power_estimation(i)
end
powerSavingPriv = powerBaseline - powerPrivResizng
powerSavingLlc = powerBaseline - powerLlcResizing
powerSaving = powerSavingPriv + powerSaving Llc
throtPriv = powerSavingPriv / powerSaving
throttle = powerSavingLlc / powerSaving
LP_PWS(throtPriv *TPriv)
LLC_Resizing(throtLlc *TLLC)
return private and LLC ways
end
Weighted Threshold Throttling The goal of SCR is to
eliminate wasteful power consumption both from private
caches and the shared LLC. To achieve this goal, we need
to utilize both resizing techniques symbiotically so as notto
degrade performance severely. In particular, we try to achieve
the sum of power savings from both techniques while allow-
ing only the performance degradation of one of the technique.
Although we assume given thresholds for both resizing tech-
niques, we adjust theffectivethresholds dynamically.
We propose Weighted Threshold Throttling (WTT) which
adjusts each threshold dynamically based on the ratio of
the expected power savings of each technique to the overall
power savings, as shown in Algorithm3. This means WTT
gives more weight to the cache resizing technique that saves
more power with the given default threshold, while discour-
aging the resizing in the other caching level to prevent severe
performance degradation. We use the algorithm described
in [36] for the LLC_Resizing()in Algorithm 3, and omit its
explanation here.
3.4. Scalability of SCR
As we mentioned earlier, finding an optimal private cache
size in CMPs is already anNP-hard problem. This is be-
cause it has a time complexity ofO((Np)M), where the CMPs
have M cores and each core has aNp-way private cache.
Group Benchmark MPKI Group Benchmark MPKI
Mcf 51
Libquantum 29 Astar 0.82
High Lbm 22 Perlbench 0.69
Omnetpp 15 Hmmer 0.66
GemsFDTD 14 H264ref 0.54
Leslie3d 9.5 Low Sjeng 0.27
Sphinx3 8.2 Gobmk 0.2
Xalan 6.8 Calculix 0.2
Medium Bwaves 4.8 Gromacs 0.13
Zeusmp 4.1 Namd 0.07
CactusADM 2.3 Povray 0.03
Bzip2 1.9
Table 3: Benchmarks classification.
We reduced the time complexity down toO(NpM) for naive
PWS and LP-PWS as shown in Algorithms1 and2. Previ-
ous studies [29, 36] showed that LLC partitioning techniques
do not take longer thanO(M2). SCR has a time complex-
ity of O(M(Np +M)) because WTT, shown in Algorithm3,
integrates the private-cache-resizing algorithm and the LLC-
resizing algorithm.
3.5. Hardware Overhead
Way Counters The major additional hardware circuit to im-
plement SCR is way counters. We assume the same circuits
used in the shared LLC in [29] and employ similar circuits
in the private caches. Way counter overheads in term of area
and power consumption are minimal, though we take them
into consideration when we measure the system power.
4. Experimental Methodology
4.1. Benchmarks
We use 22 SPEC CPU2006 benchmarks (11 integer and 11
floating point), as shown in Table3. We compile the bench-
marks on an Alpha CPU emulator [8]. We run theLinux
(Debian Etch) system on the emulator, and then install SPEC
CPU 2006 on it. We use the native Alpha compiler, gcc-4.1.1
(provided along withDebian). We compile the benchmarks
with the -O2 option and link glibc-2.5 statically. One inte-
ger benchmark (403.gcc) and six floating point benchmarks
(416.gamess, 433.milc, 447.dealll, 450.soplex, 465.tonto, and
481.wrf) could not be compiled, so they have been omitted
from our study. Using the reference inputs, all of the bench-
marks were run to completion on SimPoint [13]. We take
the most representativesimpoint, consisting of 1B instruc-
tions per benchmark. Each simulation point contains 1.1B
instructions, 100M instructions for cache warmup and 1B in-
structions for the representative simulations.
Workloads We generate multi-program workloads ran-
domly to mix all three categories in Table3. We created 20




2.0 GHz 2-way out-of-order
64-entry ROB, 32-entry LSQ
Gshare branch predictor, 1024-entry BTB
L1 I-Cache 32 KB, 2-way, 64-byte blocks




16/32/64-way 2MB/4MB/8MB for 2/4/8 cores
64-byte blocks, 13/15/17 cycles
Table 4: Architectural configuration.
4.2. Architectural Simulation
We use modified Simplescalar tools for the Alpha ISA [5] to
conduct our study. Table4 shows our baseline processor con-
figuration. As mentioned in Section2, we model state-of-the-
art power-efficient cores. As such, we use a relatively narrow
2-way issue core to achieve high power efficiency. The cores
are attached to a 2-level cache hierarchy. In particular, the
on-chip cache hierarchy has a split 8-way 32KB L1 private
cache and a unified and shared LLC. The LLC is 2MB for 2
cores, 4MB for 4 cores, and 8MB for 8 cores. Its associativ-
ity increases by 8 ways for additional core counts. The cache
block size is 64 bytes for all caches. The baseline cache hi-
erarchy maintains the noninclusive inclusion property forthe
LLC. We will study different inclusion properties in Section 5
and compare their performance and power consumption. Be-
fore resizing caches, we apply existing techniques to ensur
the baseline cache hierarchy is reasonably efficient. In par-
ticular, we assume the LLC cache serializes tag and data ac-
cesses such that only a single data way is ever accessed re-
gardless of the number of configured cache ways. Figure1
shows the cache-hierarchy energy breakdown of the baseline
multi-level caches for each SPEC CPU2006 benchmark.
Cache Reconfiguration To enable cache reconfiguration,
we modified Simplescalar’s cache module to modelselective
cache ways[1]. We assume all caches in the hierarchy are
reconfigurable and can change their capacity in increments of
a cache way from 1 to the associativity number of ways in the
cache. (Our work does not consider I-cache resizing, and as-
sumes the I-cache is always fixed). While each cache’s access
delay also changes across different configurations, we assume
a constant number of CPU cycles to access each cache cho-
sen to handle that cache’s worst-case access delay (i.e. with
all ways enabled).
4.3. Power Modeling
We use McPAT [21] and CACTI 6.5 [26] for power mod-
eling. Our baseline model uses the 32nm technology node
and ITRShigh performance devices. We adopt a state-of-
art circuit-level static power reduction technique to model th
static power of the shared LLC more realistically. Specifi-
cally, we assume high-Vt devices throughout [15], but apply
reverse body bias (RBB) in standby mode to further reduce
standby leakage [37]. When an access occurs, we apply a for-
ward body bias (FBB) to restore the threshold voltage for low
access delay. We assume that applying FBB does not impact
the access delay for the cache [37]. We utilize stack effect
in conjunction with ABB to model way selection [31, 37].
We use the Model for Assessment of cmoS Technologies And
Roadmaps (MASTAR 2011) from ITRS [25] to derive param-
eters required for CACTI according to our assumptions.
4.4. Implementation
Static Study We conduct an off-line exhaustive evaluation
of static SCR to examine its potential power savings and per-
formance improvement compared to other schemes including
UCP andLLC Resizingbased on UCP as well as exhaustive
searches for optimal power, calledExhaustive-Power, and for
optimal performance, calledExhaustive-Performance. The
static study has two goals: first, identify potential perfor-
mance gains and power savings compared to the other tech-
niques, and second, determine the limit on the maximum per-
formance and the power savings. The latter will allow us to
assess how well our on-line SCR technique performs.
To facilitate the study, we run all possible combinations of
ways of private cache and shared LLC. The extensive sim-
ulations enable the exhaustive search over the entire solution
space to find the best static solution. In the case ofExhaustive-
Power, we search for the solution which consumes the least
power while achieving better weighted speedup thanLLC
Resizing. Likewise, in the case ofExhaustive-Performance,
we search for the solution which has the highest weighted
speedup while consuming less power thanLLC Resizing.
Dynamic SCR We implemented our dynamic SCR tech-
niques in the simulator from Section3.3. In particular, we
modified our simulator to emit an interrupt every 1M cycles
and execute an interrupt handler. The interrupt handler esti-
mates performance and power to determine the best private
cache sizes across cores. Every 5M cycles, the interrupt han-
dler also runs an LLC partitioning algorithm. We also modi-
fied the simulator to allow software to reconfigure the caches
within the interrupt handler.
Performance and power estimations are provided by PEG
and PEL in the simulator. PEG logic and PEL reads the way
c unters in the simulator. In particular, at each iterationdur-
ing the Prioritized Way Selection, PEG values from each core
are read to pick the maximum value and LLC-Partitioning-
Aware Throttling can veto the decision and pick the core with
second biggest PEG value. PEL requires per-access energies
and leakage energies from CACTI to predict power consump-
tion per epoch, so we implemented configurable registers to
store these values. To facilitate our SCR algorithm, our sim-
ulator implements PEG and PEL on top of implementing pri-
vate cache way counters per core.
Our simulator accounts for the overheads associated with
resizing each cache. When up-sizing the private cache/shard
LLC, we assume 100/1000, private/shared-LLC, cycles to
power up each way, and 10 cycles per way to flash invalidate
the newly powered-on cache blocks. When down-sizing, we
7
walk the down-sized way(s) to flush its contents. Clean cache
blocks are discarded after checking upstream caches to main-
tain inclusion. Dirty cache blocks check upstream caches and
are also written back to the next-lower level. We assume
these operations are pipelined such that flushing takes 1 cy-
cle per walked cache block. Down-sized ways are selected
in reverse way ID order. Because we do not physically move
cache blocks once they are filled, the flushed cache blocks
have an equal probability of being at any position in the LRU
stack. Moreover, we do not attempt to reconstruct the per-
set LRU stacks after flushing. Resizing is performed on the
private cache and the shared LLC.
5. Results and Analysis
5.1. Static SCR
Figure5 presents our off-line study results. The figure shows
weighted speedups and cache-hierarchy power consumption
of two-application workloads. SCR consumes less power
than LLC Resizing while improving performance. On aver-
age, SCR improves performance by 0.7% and consumes 2.5%
less power compared to LLC Resizing. Exhaustive-Power
and Exhaustive-Performance show two Pareto-optimal solu-
tions on theweighted speedup- owercartesian space. The
weighted speedup and the cache-hierarchy power consump-







































(a) Cache-Hierarchy Power Consumption 
UCP LLC Resizing Static SCR Exhautive-Pwr Exhaustive-Perf 
Figure 5: Offline study for static SCR.
These results demonstrate our SCR techniques can provide
significant power savings, even when using a static (fixed)
configuration throughout the entire workload run, compared
to the baseline cache hierarchy and to the LLC-only resizing.
The bigger power savings and higher performance of SCR
over LLC-only resizing can be explained as follows. First, pri-
vate cache resizing can save dynamic power and static power
consumption with some performance degradation. The re-
sults show that the average number of activated ways in the
private caches is 6.5 compared to 8 in the baseline. And sec-
ond, SCR compensates for the performance loss from the pri-
vate cache resizing by utilizing more ways in the shared LLC
compared to the LLC-only resizing. The average number of
activated LLC ways per core of SCR and LLC-only resizing
are 7.38 and 5.88, respectively, compared to 8 in even parti-
tioning, and thus results in SCR’s higher performance. Power
consumption in the shared LLC of SCR is higher than LLC-
only resizing, but overall, the cache-hiearchy power savings
in SCR is greater than in LLC-only resizing due to the power
savings in the private caches.
5.2. Dynamic SCR
Symbiotic Cache Resizing Figure 6 summaries our dy-
namic SCR results by showing power consumption of the on-
chip cache hierarchy and the system, and weighted speedups,
as we compare the SCR technique to the LLC resizing tech-
nique. Figure6-(a) shows the power consumption in the
cache hierarchy for LLC-resizing and SCR techniques. The
stacked bars break down cache-hierarchy power dissipation
for SCR and the line reports total cache-hierarchy power
consumption for the LLC resizing technique. These results
demonstrate our SCR technique can provide significant power
savings compared to the LLC resizing technique. SCR can
save power in the cache hierarchy by as much as 55% inG2-
19while the LLC resizing technique provides 25% power sav-
ings. On average, SCR saves private dynamic power by 20%,
private static power by 25% and LLC static power by 4%
across the workloads. LLC dynamic power changes by±10%
according to workload groups. As the "AVG" bars show, SCR
provides 13% power savings from the on-chip cache hier-
archy across workloads, representing a 2.7x increase in the
power saved by the LLC resizing technique.
Figure6-(b) shows weighted speedups and system power
for the different techniques. The two bars per workload re-
port weighted speedups for the two techniques, and the lines
show system power consumption. These results show SCR
achieves slightly better performance than the LLC resizing
technique. In addition, the power savings that SCR provides
translates into 4.2% power savings from the system power
perspective.
Effectiveness of Prioritized Way Selection We compare
the weighted speedup and the system power consumption of
PWS to theIndependentscheme. The disadvantage of the
Independentscheme compared to PWS is proportionally ag-
gregated performance loss and sub-optimal power savings.
Figure8 shows the cache-hierarchy power consumption and
weighted speedups of PWS andIndependent. PWS outper-
forms Independent by as much as 2.2% in performance and
2.6% in power savings inG2-7. There are a few workloads
where the performance of PWS is lower than that ofInde-



































































































































































































(a) SCR Cache-Hierarchy Power Consumption Breakdown and LLC Resizing Cache-Hierarchy Power Consumption 



























































































































































































































(b) Weighted Speedup and System Power Consumption 
LLC Resizng Weighted Speedup SCR Weighted Speedup LLC Resizing System Power SCR System Power 
             G2                                                                        G4                                                                        G8                !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
G2                                                                           G4                                                                           G8 
1.08                        1.09    1.14    1.16 



































































































































































































































































































































































LLC Static Power LLC Dynamic Power Private Cache Static Power Private Cache Dynamic Power 



































(b) Weighted Speedup 
Figure 8: PWS effectiveness.
nificant in those cases. For example, the power savings of
PWS overIndependentin G2-18 is 7.2% while its perfor-
mance loss is only 0.3%. On average, PWS saves system
power by 1.7% and improves performance by 0.5% over the
Independentscheme.
The power savings of PWS compared toIndependentis
much more significant when we compare just the dynamic
power of the private caches. Figure7 shows the cache-
hierarchy power consumption for PWS andIndependent.
When we compare the private-cache dynamic power con-
sumption for the two techniques in Figure7, PWS saves as
much as 73% of the private-cache dynamic power compared
to Independentin G2-18. On average, PWS saves the dy-
namic power in the private caches by 12% and the overall
cache-hierarchy power by 5.9%.
Effectiveness of LLC-Partitioning-aware PWS Private
cache resizing should besymbioticin that aggressive resiz-
ing of the private caches may increase total power consump-
tion of the cache hierarchy by increasing LLC access energy
more than the power savings from the private cache resizing.
We compare the weighted speedup and the system power con-
sumption of LP-PWS to PWS, which does not consider the















































































































































































































































































































































LLC Static Power LLC Dynamic Power Private Cache Static Power Private Cache Dynamic Power 




































(b) Weighted Speedup 
Figure 10: LP-PWS effectiveness ( TPriv = 3).
power consumption and weighted speedup of PWS and LP-
PWS. The weighted speedup of PWS degrades significantly
by allowing additional cache misses withresholdof 3 to
save power aggressively. Nevertheless, PWS still does not
save as much power as LP-PWS, which manages less perfor-
mance degradation.
Aggressive PWS can harm the weighted speedup as much
as 15% inG2-4, while achieving 6.4% total system power
savings. ForG2-4LP-PWS shows weighted speedup degra-
dation of only 11% with power saving of 9%, which can
be translated into 4% performance improvement and 2.7%
power savings over PWS. On average, LP-PWS saves sys-
tem power by 1.4% and improves performance by 1.8% over
the PWS scheme. Figure9 shows the power consumption
of the cache hierarchy of PWS and LP-PWS. PWS saves dy-
namic power in the private caches by as much as 67% com-
pared to LP-PWS, but it consumes 250% of the LLC dynamic
power compared to LP-PWS inG2-14, resulting in 7.5% ad-
ditional power consumption in the cache hierarchy. On av-
erage, LP-PWS saves the cache hierarchy power consump-
tion by 5.7% consuming more dynamic power in the private






















(a) Performance and Power Consumption 













































Figure 11: SCR with non-inclusive and exclusive LLC.
Non-Inclusive vs. Exclusive LLC The decision to use non-
inclusive or exclusive LLCs is non-trivial because exclusive
LLCs are more efficient in cache-capacity utilization while
non-inclusive LLCs are more efficient in the on-chip band-
width utilization and dynamic power. Moreover, which inclu-
sion property is more beneficial depends on the workload. For
this reason, the exclusive LLC with even partitioning shows
slightly lower performance than the noninclusive LLC. As
Figure11 shows, there is marginal performance degradation
in the exclusive LLC, unlike the performance improvement
in the noninclusive LLC. The exclusive LLC shows higher
DRAM-access increase, after applying SCR, 3.5% compared
to 0.17% decrease of the noninclusive LLC, explaining the
slight performance degradation in the exclusive LLC and the
performance improvement in the noninclusive LLC.
The dynamic power consumption of the exclusive LLC is
higher by 27.5% compared to the inclusive LLC, contributing
to the higher system power. Although the noninclusive LLC
shows lower system power consumption, the ratios compared
to even partitioning are almost identical –0.961–implyingthat
SCR saves around 4% of system power consumption in both
cases. We note that SCR incurs bigger dynamic power in-
crease in the exclusive LLC by 11.5% compared to 10.7% in
the noninclusive LLC, but its overall impact on system power
is almost negligible. The average numbers of activated ways,
after applying SCR, are virtually the same too.
Epoch Size We ran experiments with 0.5M, 1M and 2.5M–
cycle epoch size for private cache resizing to measure sen-
sitivities of results to the epochs size while the epoch size
10
for the shared LLC is fixed at 5M cycles like other studies
[29, 36]. We found that 1M–cycle epoch size is a reasonable
choice for generating good power savings compared to the
cache reconfiguration overhead.
Impact of Thresholds SCR relies on two default thresh-
olds,TPriv andTLLC, to control the power savings over the per-
formance loss for the private-cache resizing and the shared-
LLC resizing, respectively. We explored ranges of thresholds
between 0 to 0.1 forTPriv and 0 to 0.15 forTLLC–both in 5
steps. Setting the threshold to 0 results in the best perfor-
mance without any power savings for both techniques (they
degenerate into the UCP technique as both thresholds are set
to 0). On the other hand, we can save large amounts of power
as the threshold increases at the cost of performance loss. Al-
though we observed interesting trade-offs between power sav-
ings and performance by changing the thresholds, we fixed
the thresholds toTPriv = 0.02 andTLLC = 0.12 for our dy-
namic SCR experiments in 2, 4 and 8-core CMPs.
6. Related Work
A large body of work exists on cache resizing. Selective
cache ways [1] uses off-line profiling to drive disabling of
cache ways for dynamic power savings. DRI caches [28, 41,
40] use cache-miss counts to detect over-provisioning, and re-
size across either cache sets or ways. In addition, DRI caches
also gate the power supply to unused portions of cache, con-
serving both dynamic and static power. Maliket al [24] study
selective ways in the MCore CPU. All of these prior studies
consider resizing a single level of cache only in uniproces-
sors, whereas SCR addresses the problem of resizing multi-
ple levels of caches in CMPs. In particular, we develop novel
algorithms for solving anNP-hard problem inO(N2) time
complexity with agreedyapproach.
Cache partitioning explicitly allocates shared cache across
multiprogrammed workloads, providing cache to those pro-
grams that can best utilize it. The majority of techniques
focus on performance [6, 29, 19, 33, 34, 38]. More re-
cently, techniques have also tried to reduce power consump-
tion [14, 35, 36] by withholding allocation and shutting down
portions of the shared cache, similar to cache resizing. Madan
et al [23] propose resizing L2 caches by dynamically extend-
ing their capacity into stacked DRAM. Like SCR, cache par-
titioning also employs reuse distance profiles to drive alloc -
tion decisions. But LLC cache partitioning saves mostly static
power consumption compared to SCR which resizes private
caches where dynamic power dominates.
Balasubramonianet al [3, 4] propose resizing two levels of
cache, either the L1/L2 or the L2/L3, by partitioning a com-
mon pool of SRAM arrays to different caching levels. Be-
cause partitionings always utilize all of the available SRAM,
only one cache’s size is controlled independently. Hence,
in this technique, it is impossible to optimize the balance
point of different caching levels simultaneously as is done
in SCR. Moreover, the technique is only limited to unipro-
cessors. Wanget al [39] propose private cache resizing in
conjunction with LLC partitioning. This technique requires
off-line profiling. Besides, it adopts LLC partitioning only,
losing the opportunity to save static power in the LLC.
Besides resizing, researchers have studied other adaptive
cache techniques as well. Dropshoet al [7] proposeaccount-
ing cacheswhich divide a cache’s ways into primary and sec-
ondary groups. Each cache access searches the two groups
sequentially, accessing the secondary only on a primary miss.
This saves power if secondary accesses are infrequent. Zhang
et al [43] proposeway concatenationwhich permits flexible
organization of cache banks to form direct-mapped, 2-way, or
4-way set-associative caches. Neither accounting caches nor
way concatenation address capacity allocation across differ-
ent levels of cache, as done in SCR.
Silva-Filho et al [32] and Gordon-Rosset al [11] study
design-time techniques for optimizing 2-level cache hierar-
chies. This body of work tries to find the best block size and
associativity–as well as cache capacity–for two caching lev-
els. They consider a more complex design space than we do,
and employ more costly search techniques that are suitable
for design analysis only. In contrast, SCR is an architectur-
level power management technique. It solves a more con-
strained problem, but provides algorithms suitable for run-
time use. Similarly, Zhang and Vahid [42] search for the best
cache architecture using a reconfigurable hardware platform.
But they only consider optimizing a single level of cache.
Finally, significant research has explored circuit-level tech-
niques for reducing a cache’s static power consumption.
Multi-Vt techniques [2, 17] employ low-Vt devices along crit-
ical paths and high-Vt devices along non-critical paths to save
power while still maintaining performance. Similarly, super
high-Vt devices have been explored in [15]. Gated-VDD [28]
uses high-Vt devices to gate the power supply to unused por-
tions of cache. Adaptive body bias [16, 27] and Forward body
bias [37] control the back-gate voltage to place devices in a
standby low-leakage mode when not in use, but then restores
the devices to an active high-performance mode when the
cache is accessed. Lastly, dynamic voltage scaling [10, 18]
can similarly transition between standby and active modes by
scaling the supply voltage. Similar to other cache resizing
techniques [41, 40], SCR relies on Gated-VDD to essentially
eliminate leakage for unused portions of the cache.
7. Conclusion
This paper presents SCR, an architecture-level power man-
agement technique that resizes all caches in a modern CMP
cache hierarchy. Our work shows a static-optimal version
of SCR can reduce total power dissipation in the on-chip
cache hierarchy by 2.5% while boosting performance by 0.7%
across two-application workloads compared to the LLC resiz-
ing technique. We find that significant power savings comes
fromsymbiosisof performing the private cache and the shared
11
LLC resizing simultaneously. Our work also develops the
SCR algorithms which employ agreedyapproach to find
pseudo-(Pareto) optimal solutions at runtime in a scalable
fashion. We show dynamic SCR can achieve between 2.6–
54.5% power savings in the cache hierarchy while achieving
performance boost up to15.8%.
References
[1] D. H. Albonesi, “Selective Cache Ways: On-Demand Cache Resource
Allocation,” in Proceedings of the 32nd Annual International Sympo-
sium on Microarchitecture, November 1999, pp. 248–259.
[2] R. Bai, N.-S. Kim, D. Sylvester, and T. Mudge, “Total Leakage Op-
timization Strategies for Multi-Level Caches,” inProceedings of the
15th ACM Great Lakes Symposium on VLSI, 2005, pp. Chicago, IL.
[3] R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, and
S. Dwarkadas, “Dynamic Memory Hierarchy Performance Optimiza-
tion,” in Proceedings of the workshop on Solving the Memory Wall
Problem, June 2000.
[4] R. Balasubramonian, D. H. Albonesi, A. Buyuktosunoglu,and
S. Dwarkadas, “A Dynamically Tunable Memory Hierarchy,”IEEE
Transactions on Computers, vol. 52, no. 10, pp. 1243–1258, October
2003.
[5] D. Burger and T. M. Austin, “The SimpleScalar Tool Set, Version 2.0,”
University of Wisconsin-Madison, CS TR 1342, June 1997.
[6] J. Chang and G. S. Sohi, “Cooperative Cache Partitioningfor Chip
Multiprocessors,” inProceedings of the International Conference on
Supercomputing, Seattle, WA, June 2007.
[7] S. Dropsho, A. Buyuktosunoglu, R. Balasubramonian, D. H. Albonesi,
S. Dwarkadas, G. Semeraro, G. Magklis, and M. L. Scott, “Integrating
Adaptive On-Chip Storage Structures for Reduced Dynamic Power,”
in Proceedings of 11th Annual International Conference on Parallel
Architectures and Compilation Techniques, 2002.
[8] EmuVM, “AlphaVM-free, version 1.0.2 for Windows 7. Avail ble at,”
http://www.emuvm.com/downloads.php.
[9] X. Fan, W.-D. Weber, and L. A. Barroso, “Power provisioning for a
warehouse-sized computer,” inProceedings of the 34th annual inter-
national symposium on Computer architecture, s r. ISCA ’07. New
York, NY, USA: ACM, 2007, pp. 13–23.
[10] K. Flautner, nam Sung Kim, S. Martin, D. Blaauw, and T. Mudge,
“Drowsy Caches: Simple Techniques for Reducing Leakage Power,”
in Proceedings of the International Symposium on Computer Archi-
tecture, Anchorage, AK, May 2002.
[11] A. Gordon-Ross, F. Vahid, and N. Dutt, “Automatic Tuning of Two-
Level Caches to Embedded Applications,” inProceedings of the
Design, Automation and Test in Europe Conference and Exhibition
(DATE 04), 2004.
[12] A. Greenberg, J. Hamilton, D. A. Maltz, and P. Patel, “The cost of a
cloud: research problems in data center networks,”SIGCOMM Com-
put. Commun. Rev., vol. 39, no. 1, pp. 68–73, Dec. 2008.
[13] G. Hamerly, E. Perelman, J. Lau, and B. Calder, “SimPoint 3.0: Faster
and More Flexible Program Analysis,” inProceedings of the Work-
shop on Modeling, Benchmarking and Simulation, June 2005.
[14] K. Kedzierski, F. J. Cazorla, R. Gioiosa, A. Buyuktosunoglu, and
M. Valero, “Power and Performance Aware Reconfigurable Cache for
CMPs,” in Proceedings of the Second International Forum on Next-
Generation Multicore/Manycore Technologies, Saint-Malo, France,
June 2010.
[15] C. Kim, J.-J. Kim, S. Mukhopadhyay, and K. Roy, “A forward body-
biased low-leakage SRAM cache: device, circuit and architetur
considerations,”Very Large Scale Integration (VLSI) Systems, IEEE
Transactions on, vol. 13, no. 3, pp. 349–357, 2005.
[16] C. H. Kim and K. Roy, “Dynamic Vth Scaling Scheme for Active
Leakage Power Reduction,” inProceedings of the International Sym-
posium on Design, Automation, and Test in Europe, 2002, pp. 163–
167.
[17] N. S. Kim, D. Blaauw, and T. Mudge, “Leakage Power Optimizat on
Techniques for Ultra Deep Sub-Micron Multi-Level Caches,”in Pro-
ceedings of the International Conference on Computer-Aided D sign,
2003.
[18] N. S. Kim, K. Flautner, D. Blaauw, and T. Mudge, “Circuitand
Microarchitectural Techniques for Reducing Cache LeakagePower,”
IEEE Transactions on Very Large Scale Integration, vol. 12, no. 2, pp.
167–184, February 2004.
[19] S. Kim, D. Chandra, and Y. Solihin, “Fair Cache Sharing and Parti-
tioning in a Chip Multiprocessor Architecture,” inProceedings of the
International Symposium on High Performance Computer Archite -
ture, 2002.
[20] P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally,
M. Denneau, P. Franzon, W. Harrod, K. Hill, J. Hiller, S. Karp,
S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S.Scott,
A. Snavely, T. Sterling, R. S. Williams, and K. Yelick, “ExaScale
Computing Study: Technology Challenges in Achieving Exascale Sys-
tems.”
[21] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and
N. P. Jouppi, “Mcpat: an integrated power, area, and timing modeling
framework for multicore and manycore architectures,” inProceedings
of the 42nd Annual IEEE/ACM International Symposium on Microar-
chitecture, ser. MICRO 42. New York, NY, USA: ACM, 2009, pp.
469–480.
[22] W. Liu and D. Yeung, “Using aggressor thread information t improve
shared cache management for cmps,” inProceedings of the 2009 18th
International Conference on Parallel Architectures and Compilation
Techniques, ser. PACT ’09. Washington, DC, USA: IEEE Computer
Society, 2009, pp. 372–383.
[23] N. Madan, L. Zhao, naveen Muralimanohar, A. Udipi, R. Balasubra-
monian, R. Iyer, S. Makineni, and D. Newell, “Optimizing Communi-
cation and Capacity in a 3D Stacked Reconfigurable Cache Hierarchy,”
in Proceedings of the International Symposium on High Performance
Computer Architecture, 2009.
[24] A. Malik, B. Moyer, and D. Cermak, “A Low Power Unified Cache
Architecture Providing Power and Performance Flexibility,” in Pro-
ceedings of the International Symposium on Low Power Electronics
and Design, Rapallo, Italy, 2000.
[25] “ITRS Working Group Models, MASTAR,”
http://www.itrs.net/models.html, 2011.
[26] N. Muralimanohar, R. Balasubramonian, and N. Jouppi, “Optimizing
nuca organizations and wiring alternatives for large caches with cacti
6.0,” in IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROAR-
CHITECTURE. IEEE Computer Society, 2007, pp. 3–14.
[27] K. Nii, H. Makino, Y. Tujihashi, C. Morishima, Y. Hayakawa,
H. Nunogami, T. Arakawa, and H. Hamano, “A Low Power SRAM
using Auto-Backgate-Controlled MT-CMOS,” inProceedings of the
International Symposium on Low-Power Electronics and Design, Au-
gust 1998, pp. Monterey, CA.
[28] M. Powell, S.-H. Yang, B. Falsafi, K. Roy, and T. N. Vijayku-
mar, “Gated-Vdd: A Circuit Technique to Reduce Leakage in Deep-
Submicron Cache Memories,” inProceedings of the IEEE/ACM Inter-
national Symposium on Low Power Electronics & Design, 2000, pp.
90–95.
[29] M. K. Qureshi and Y. N. Patt, “Utility-based cache partitioning: A low-
overhead, high-performance, runtime mechanism to partition shared
caches,” inProceedings of the 39th Annual IEEE/ACM International
Symposium on Microarchitecture, ser. MICRO 39. Washington, DC,
USA: IEEE Computer Society, 2006, pp. 423–432.
[30] D. Sanchez and C. Kozyrakis, “Vantage: scalable and effici nt fine-
grain cache partitioning,” inProceedings of the 38th annual interna-
tional symposium on Computer architecture, ser. ISCA ’11. New
York, NY, USA: ACM, 2011, pp. 57–68.
[31] N. Shukla, R. Singh, and M. Pattanaik, “Design and Analysis of
a Novel Low-Power SRAM Bit-Cell Structure at Deep-Sub-Micron
CMOS Technology for Mobile Multimedia Applications,”Interna-
tional Journal of Advanced . . ., 2011.
[32] A. G. Silva-Filho and F. R. Cordeiro, “A Combined Optimization
Method for Tuning Two-Level Memory Hierarcnhy ConsideringEn-
ergy Consumption,”EURASIP Journal on Embedded Systems, vol.
2011, September 2010.
[33] G. E. Suh, S. Devadas, and L. Rudolph, “A New Memory Monitring
Scheme for Memory-Aware Scheduling and Partitioning,” inProceed-
ings of the International Symposium on High Performance Computer
Architecture, 2002.
[34] G. E. Suh, L. Rudolph, and S. Devadas, “Dynamic Partition ng of
Shared Cache Memory,”The Journal of Supercomputing, vol. 28, no.
7-26, 2004.
[35] K. T. Sundararajan, V. Porpodas, T. M. Jones, M. P. Topham, nd
B. Franke, “Cooperative Partitioning: Energy-Efficient Cache Parti-
tioning for High-Performance CMPs,” inProceedings of the 18th In-
ternational Symposium on High-Performance Computer Archite ture,
New Orleans, LA, February 2012.
[36] K. T. Sundararajan, V. Porpodas, T. M. Jones, N. P. Topham, nd
B. Franke, “Cooperative partitioning: Energy-efficient cache partition-
ing for high-performance cmps,” inHPCA, 2012, pp. 311–322.
12
[37] J. Tschanz, S. Narendra, Y. Ye, B. Bloechel, S. Borkar, and V. De, “Dy-
namic sleep transistor and body bias for active leakage power control
of microprocessors,”Solid-State Circuits, IEEE Journal of, vol. 38,
no. 11, pp. 1838–1845, 2003.
[38] K. Varadarajan, S. K. Nandy, V. Sharda, and A. Bharadwaj, “Molecu-
lar Caches: A Caching Structure for Dynamic Creation of Application-
Specific Heterogeneous Cache Regions,” inProceedings of the Inter-
national Symposium on Microarchitecture, 2006.
[39] W. Wang, P. Mishra, and S. Ranka, “Dynamic cache reconfiguration
and partitioning for energy optimization in real-time multi-core sys-
tems,” inProceedings of the 48th Design Automation Conference, ser.
DAC ’11. New York, NY, USA: ACM, 2011, pp. 948–953.
[40] S.-H. Yang, M. D. Powell, B. Falsafi, and T. N. Vijaykumar, “Exploit-
ing Choice in Resizable Cache Design to Optimize Deep-Submicron
Processor Energy-Delay,” inProceedings of the 29th International
Symposium on Computer Architecture, San Diego, CA, June 2003.
[41] S.-H. Yang, M. D. Powell, B. Falsafi, K. Roy, and T. N. Vijaykumar,
“An Integrated Circuit/Architecture Approach to ReducingLeakage in
Deep-Submicron High-Performance I-Caches,” inProceedings of the
7th International Symposium on High-Performance ComputerArchi-
tecture, 2001.
[42] C. Zhang and F. Vahid, “Cache Configuration Explorationon Proto-
typing Platforms,” inProceedings of the 14th International Workshop
on Rapid Systems Prototyping, 2003.
[43] C. Zhang, F. Vahid, and W. Najjar, “A Highly ConfigurableCache Ar-
chitecture for Embedded Systems,” inProceedings of the 30th Inter-
national Symposium on Computer Architecture, San Diego, CA, June
2003.
13
