CuttleSys: Data-Driven Resource Management forInteractive Applications
  on Reconfigurable Multicores by Kulkarni, Neeraj et al.
CuttleSys: Data-Driven Resource Management for
Interactive Applications on Reconfigurable Multicores
Neeraj Kulkarni, Gonzalo Gonzalez-Pumariega, Amulya Khurana, Christine Shoemaker,
Christina Delimitrou, and David Albonesi
Cornell University
ABSTRACT
Multi-tenancy for latency-critical applications leads to re-
source interference and unpredictable performance. Core
reconfiguration opens up more opportunities for colocation,
as it allows the hardware to adjust to the dynamic performance
and power needs of a specific mix of co-scheduled applications.
However, reconfigurability also introduces challenges, as even
for a small number of reconfigurable cores, exploring the de-
sign space becomes more time- and resource-demanding.
We present CuttleSys, a runtime for reconfigurable multi-
cores that leverages scalable and lightweight data mining to
quickly identify suitable core and cache configurations for a
set of co-scheduled applications. The runtime combines col-
laborative filtering to infer the behavior of each job on every
core and cache configuration, with Dynamically Dimensioned
Search to efficiently explore the configuration space. We eval-
uate CuttleSys on multicores with tens of reconfigurable cores
and show up to 2.46× and 1.55× performance improvements
compared to core-level gating and oracle-like asymmetric
multicores respectively, under stringent power constraints.
1. INTRODUCTION
Cost efficiency in datacenters is adversely affected by low
resource utilization [12, 24, 28, 29, 30, 31, 32, 33, 34, 35, 35, 59,
60, 64, 84, 103]. Server utilization can be improved through
multi-tenancy which, however, is especially challenging for
latency-critical applications, such as websearch, social net-
works, and ML inference, since it can lead to interference in
shared resources (cores, cache, memory bandwidth, network
bandwidth, power, etc.), and unpredictable performance. Prior
work has proposed techniques to avoid interference by dis-
allowing colocation of contending workloads [24, 29, 32, 33,
64, 103], or techniques to eliminate interference altogether, by
leveraging hardware and software resource isolation mecha-
nisms [24, 41, 46, 47, 53, 59, 60, 87, 90].
In multi-tenant systems with latency-critical applications,
fine-grained resource allocation allows assigning just enough
resources to co-scheduled applications to meet the QoS, which
in turn, improves resource efficiency by allowing more appli-
cations to be co-scheduled. However, prior work is limited
to traditional servers where cores cannot be reconfigured to
enable fine-grained performance and power adjustments. Core
reconfiguration [44, 75, 105] opens up more opportunities for
colocation, as it allows the hardware to adjust to the dynamic
needs of a specific mix of co-scheduled applications.
DVFS, which is widely used in systems today, is another
solution to allow fine-grained performance and power adjust-
ments in cores. However, the movement towards processors
with razor-thin voltage margins and the increase in leakage
power consumption limits the effectiveness of DVFS in future
systems [6,7,56,65,66,67]. Reconfigurable cores [44,75,105]
operate by dynamically power gating core components. Since
they reduce both active and leakage power, they can be effec-
tive in reducing power consumption in technologies where
voltage scaling ranges are limited. Datacenters also suffer
from poor energy proportionality [59, 66], stemming from the
high idle power of processors as technology shrinks. Reconfig-
urable cores with their ability to reduce idle power, also offer
a solution to make cloud servers more energy proportional.
We propose to leverage reconfigurable cores to enable co-
scheduling of latency-critical and batch applications. For sce-
narios that involve colocation of latency-sensitive and batch
applications, this means satisfying the strict quality of service
(QoS) requirements of interactive services, and maximizing
the throughput of the batch applications, while always remain-
ing under the allowed power budget assigned to the server
either by the chip-wide power budget, or by a global power
manager [59] running datacenter-wide. Prior work [75] ad-
dresses reconfiguration exclusively for batch applications, and
leads to QoS violations and unpredictable performance for
latency-critical services. Additionally, Flicker [75] does not
handle interference in the shared memory hierarchy. On the
other hand, fine-tuning architectural parameters also increases
the space of allocations a resource manager must traverse to
identify suitable resource configurations for an application.
As the number of cores and configuration parameters increase,
efficiently exploring this space becomes computationally pro-
hibitive. This is even more challenging given that decisions
must be online, as applications and power budgets change.
We design CuttleSys, an online resource manager that com-
bines scalable machine learning to determine the performance
and power of each application across all possible core and
cache reconfigurations, with fast design space exploration to
effectively navigate the large configuration space and arrive
at a high-performing solution. First, the system leverages col-
laborative filtering, namely PQ-reconstruction with Stochastic
Gradient Descent (SGD) to infer the performance (tail latency
for latency-critical and throughput for batch applications) and
power consumption of an application across core and cache
configurations without the overhead of exhaustive profiling.
Second, it leverages a new, parallel Dynamically Dimensioned
Search (DDS) algorithm to efficiently find a per-job near-
optimal configuration that satisfies QoS for latency-sensitive
workloads, and maximizes the throughput for batch jobs, un-
der a given power budget. Both techniques keep overheads low,
a couple milliseconds, allowing CuttleSys to reevaluate its de-
cisions frequently to adjust to changes in application behavior.
We make the following contributions:
• We demonstrate the potential of reconfigurable cores for
ar
X
iv
:2
00
8.
00
32
9v
1 
 [c
s.A
R]
  1
 A
ug
 20
20
cloud servers when running latency-critical applications
by characterizing five representative interactive cloud
services (Section 3).
• We present CuttleSys, an online resource manager that
efficiently navigates the large design space and deter-
mines suitable core and cache configurations (Section 4).
• We evaluate CuttleSys on 32-core simulated systems
with mixes of latency-sensitive [48] and batch applica-
tions [1]. We show that at near-saturation load and across
different power caps, CuttleSys achieves 2.46× higher
throughput than core-level gating and 1.55× higher than
an oracle-like asymmetric multicore, while always sat-
isfying QoS for the latency-sensitive applications. We
also show that CuttleSys effectively adapts to changes
in input load and power budgets online (Section 8).
2. RELATED WORK
2.1 Power Management
2.1.1 Dynamic Voltage-Frequency Scaling
Dynamic Voltage-Frequency Scaling (DVFS) allows dy-
namically changing a processor’s voltage and frequency, and
is widely used in modern multicores.
Batch Workloads: Isci et al. [43] propose maxBIPS, an algo-
rithm that selects DVFS modes for each core that maximize
throughput under a power budget. Sharkey et al. [92] extend
this work by exploring both DVFS and fetch toggling, as well
as design tradeoffs such as local versus global management.
Bergamaschi et al. [16] further extend maxBIPS, and com-
pare its discrete implementation to continuous power modes.
Chen et al. [22] propose co-ordinated predictive hill climbing
to control distribution of power among cores, and intra-core
resources like IQ, ROB and register files among SMT threads.
Papadimitriou et al. [74] explore safe Vmin for different appli-
cations by exposing pessimistic guardbands and determining
the best voltage, frequency, and core allocation at runtime.
Apart from open-loop solutions, there are also multiple
feedback-based controllers [13,59,62,71,99]. Wang et al. [99]
use Model Predictive Control to maintain the power of a CMP
below the budget by controlling the DVFS states, while Bar-
tolini et al. [13] propose a distributed solution allocating one
MPC-based controller to each core. Ma et al. [62] propose a hi-
erarchical solution for many-core architectures that divides the
problem by allocating frequency budgets to smaller groups of
cores. Intel also supports fine-grained power control through
the RAPL [5] interface that allows software to set a power
limit, which the hardware meets by scaling voltage/frequency.
Latency Sensitive Workloads: Lo et al. [59] propose a feedback-
based controller that reduces power consumption in server
clusters, while meeting the QoS (Quality of Service) require-
ments of latency-critical services by adjusting the server power
limits using RAPL. Nishtala et al. [71] use Reinforcement
Learning to find the best core allocations and frequency set-
tings for latency-critical jobs to save energy while meeting
QoS. Kasture et al. [46] propose Rubik, a fine-grained DVFS
scheme for latency-sensitive workloads and RubikColoc, a
scheme to co-schedule batch and latency-critical workloads.
Adrenaline [40] applies DVFS at a per-query granularity, using
application-level information to speed up long queries. Meis-
ner et al. [66] explore the efficacy of active and idle low-power
modes for latency-critical applications to save power under
QoS, and showed that active power modes (DVFS) provide
good power-performance trade-offs but cannot achieve energy
proportionality by themselves. Motivated by their conclusion,
our work explores fine-grained power management techniques
that reduce idle power along with active power.
The movement towards processors with razor-thin volt-
age margins limits the effectiveness of DVFS as technology
scaling slows down. A viable and widely-implemented al-
ternative to DVFS is core-level gating (C states), discussed
in the next section. Reconfigurable cores enable gating at an
even finer granularity allowing further gains over traditional
core-level gating. Similar to how core-level gating is used
along-side DVFS in modern processors, our technique can
augment DVFS by increasing the energy gains for frequency
regions where DVFS is not effective [6, 7, 105].
2.1.2 Core-Level Gating
Core-level gating powers off individual cores by placing
them in a separate domain [2, 3, 6, 54], and has become neces-
sary to reduce power consumption beyond DVFS. Intel CPUs
since Skylake [6, 7] support Duty Cycling Control (DCC),
which rapidly cycles between on (C0) and off (C6) states for
each core at the granularity of tens of microseconds. A few
of the proposals to use core-level gating to maximize perfor-
mance under a power budget are described below.
Batch Workloads: Intel processors [6, 7] implement core-
level gating only during idle core times using auto-demotion.
Ma et al. [63] and Huazhe et al. [104] integrate core-level gat-
ing with DVFS, and propose a controller-based algorithm that
employs power gating at coarse granularity, and DVFS at fine
granularity. Arora et al. [11] develop a linear prediction algo-
rithm forC6 for CPU-GPU benchmarks. Pothukuchi et al. [77]
use MIMO theory, while Rahmani et al. [79] use Supervisory
Control Theory to dynamically tune architectural parameters
to meet performance and power goals. These feedback-based
controllers become overly expensive as the decision space
expands, taking a prohibitive time to converge.
Latency Sensitive Workloads: Leverich et al. [56] propose
per-core power-gating to dynamically turn cores on/off based
on utilization and QoS. PowerNap [65] and DreamWeaver [67]
coordinate deep CPU sleep states to minimize idle power.
However, Kanev et al. [45] show that deep CPU sleep states,
owing to their long wakeup latencies, can also impact tail la-
tency, as latency-sensitive applications have short idle periods.
We use core-level gating in this work as a baseline for cores
that host batch workloads to meet the power budget.
2.2 Asymmetric Multicores
Asymmetric multicores improve performance and power
by assigning resources to applications based on their dynamic
requirements [14, 20, 21, 52, 55, 89, 93, 98].
Batch Workloads: PIE [27] schedules applications in het-
erogeneous multicores by estimating the performance of an
application on out-of-order cores, while running on an in-order
core and vice-versa. Liu et al. [58] propose a dynamic thread-
mapping approach, maximization-then-swapping, to maxi-
mize performance in power-constrained heterogeneous mul-
ticores. However, this relies on application profiling, which
can become impractical in large-scale multicores.
Teodorescu et al. [96] and Winter et al. [100] propose thread
2
0.0
0.5
1.0
1.5
2.0
Ta
il 
la
te
nc
y 
(m
s)
×104 Xapian
80% load
20% load
0
2
4
6
8 ImgDNN
80% load
20% load
0
2
4
6
8 ×10
3 Masstree
80% load
20% load
0
2
4
6
8 Moses
80% load
20% load
0
2
4
6
8 ×10
3 Silo
80% load
20% load
0
20
40
60
Po
we
r (
W
)
{2,2,6}
80% load
20% load
0
20
40
60
{4,2,4}
80% load
20% load
0
20
40
60
{4,2,4}
80% load
20% load
0
20
40
60
{6,2,4}
80% load
20% load
0
20
40
60
{2,2,4}80% load
20% load
244 642 442 242 626 426 226 624 424 224 622 422 222
666 466 266 664 464 264 662 462 262 646 446 246 644 444
Figure 1: Characterization of tail latency and power of 5 latency-sensitive applications across core configurations. Colors in the
background represent the different core configurations, labeled as {FE,BE,LS}, as shown in the table. Core configurations, from
highest to lowest configuration (dark to light color), are ordered by serially decreasing configurations in LS, FE, and BE. For each
application, x-axis (core configurations) is sorted according to the tail latency observed at 80% load.
scheduling and power management for heterogeneous systems.
Teodorescu [96] proposes LinOpt, a linear programming-
based approach, while [100] explores the Hungarian algo-
rithm to optimize performance under a power budget. Adileh
et al. [8, 9] maximizes performance by multiplexing applica-
tions between two voltage/frequency operating points to match
the power budget. The authors propose a technique to shift
“power holes” arising due to core heterogeneity. Navada et al.
[69] propose the use of non-monotic cores, each optimized
for different instruction-level behavior, and steer applications
on appropriate core types using bottleneck signatures.
Latency Sensitive Workloads: Petrucci et al. [76] show that
simply using asymmetric multicores without redesigning sys-
tem software results in QoS violations. They propose a con-
troller that maps jobs to the least power-hungry processing re-
sources that can satisfy QoS by incrementally assigning more
slower or faster cores until QoS is met. Ren et al. [85, 86] pro-
pose a query-level slow-to-fast scheduler, where short queries
run on slower cores and longer queries are promoted to faster
cores to reduce their service latency. The latter work [86]
also theoretically proves the energy efficiency advantages of
asymmetric multicores over homogeneous systems. All of
these efforts assume that cores of the desired speed are al-
ways available, which is not realistic. Haque et al. [39] take
into account the fact that there is a limited number of cores
of each type. They combine asymmetric multicores with
DVFS and implement the slow-to-fast scheduler of [85, 86].
However, asymmetric multicores have a fixed number of core
types (generally two), while reconfigurable cores provide a
finer granularity of heterogeneity, enabling fine-grained per-
formance/power tuning. We compare CuttleSys against an
oracle-like asymmetric multicore in Section 8.
2.3 Reconfigurable Architectures
Previous work on reconfigurable cores focuses on batch,
throughput-bound workloads. Lee et al. show the efficiency
advantages and limits of adapting microarchitecture param-
eters to workloads. Lukefahr et al. [61] propose Composite
cores, which pair big and little compute engines, and save
energy by running applications on the small core as much as
possible, while still meeting performance requirements. Pad-
manabha et al. [73] propose trace-based phase prediction for
migration of applications in Composite cores.
Chrysso [44] proposes an integrated power manager that
uses analytical power and performance models and global
utility-based power allocation. The configuration space of
a core in our work is significantly larger compared to Chrysso [44],
which makes the optimization problem more complex. Re-
source Constrained Scaling (RCS) [36] also aims to maximize
performance in power-constrained multicores. In RCS, the
resources of a processor and the number of operating cores
are scaled simultaneously, which means that the system can
operate in only a few different configurations.
Khubaib et al. [49] propose a core architecture that dynam-
ically morphs from a single-threaded out-of-order to a multi-
threaded in-order core. FlexCore [10] can similarly morph
into 4-way out-of-order, 2-way out-of-order, or 2-way in-order
cores at runtime. Tarsa et al. [95] propose a post-silicon clus-
tered CPU architecture that combines 2 out-of-order execution
clusters, which can operate as an 8-wide execution engine or
a low-power 4-wide engine.
The Sharing Architecture [107] and Core Fusion [42] com-
bine multiple simple out-of-order cores to form larger out-
of-order cores. CASH [106] also advances the Sharing Ar-
chitecture with a runtime to find the best configuration for a
single application which minimizes cost and meets QoS, using
control theory and Q-learning. CuttleSys accounts for the
interference between multiple co-scheduled applications that
must all meet performance guarantees, and can be applied to
the Sharing Architecture to quickly explore the design space
of resource slices when multiple applications are hosted on a
multi-tenant server, and arrive at suitable per-job resources.
Zhang et al. [105] and Petrica et al. [75] propose cores that
can be reconfigured by scaling datapath components to save
energy beyond DVFS. The dynamic scheme in Flicker [75]
optimizes performance for a homogeneous multicore with
reconfigurable cores under a power budget. Zhang et al. [105]
also show that reconfigurable cores significantly extend the
performance-energy pareto frontier provided by DVFS.
However, these systems are limited to batch applications,
and do not consider the implications of tail latency on core
3
reconfiguration. Moreover, Zhang et al. [105] only consider a
single core running one application. In Section 8.5, we discuss
why Flicker cannot be applied directly in this setting, and pro-
vide a quantitative comparison between Flicker and CuttleSys.
3. CHARACTERIZATION OF
LATENCY-CRITICAL SERVICES
We now quantify the impact of different core configurations
on the tail latency of interactive cloud services. We use five ap-
plications, Xapian,Masstree,Imgdnn,Silo,Moses, and
configure them based on the analysis in [48]. We simulate
each application on a homogeneous 16-core system using
zsim [91], a fast and cycle-level simulator, combined with
McPAT v1.3 [57] for a 22nm technology for power statistics.
A core is divided into three sections, front-end (FE - fetch,
decode, ROB, rename, dispatch), back-end (BE - issue queues,
register files, functional), and load-store (LS - LD/ST queues),
each of which can be configured to six-way, four-way, and
two-way, similar to Flicker [75], except that we adopt a more
aggressive superscalar design. These cores dynamically power
gate associated array structures in each pipeline region when
the configuration is downsized.
Fig. 1 shows the variation of tail latency and power for
each service, across core configurations at low and high load.
Across all services, at high load, tail latency increases dramati-
cally as the back-end and load-store queue are constrained. On
the other hand, at low load, tail latency remains low, even for
the lower-performing configurations. Therefore, when load is
low, interactive services can leverage reconfiguration to reduce
their power consumption, without a performance penalty.
We also observe that the core section that most affects tail
latency varies between applications. For Xapian, tail latency
is primarily determined by the load-store queue size, with low
latency requiring a six-way queue. In the cases of ImgDNN,
Silo, and Masstree, tail latencies are low when FE and LS are
configured to six- or four-way, while in the case of Moses, tail
latency primarily depends on the front-end core section.
At high load, the configuration with the best performance-
power trade-off varies across services. For example, Xapian
consumes the least power in a {2,2,6} configuration while
keeping tail latency low, while for ImgDNN, Masstree, Moses,
and Silo, configurations {4,2,4}, {4,2,4}, {6,2,4} and {2,2,4}
consume the least power respectively. This shows that different
core configurations are indeed needed by diverse applications.
Also, batch applications differ in preferences from latency-
critical applications. This variability across loads and appli-
cations highlights the need for practical runtimes that identify
the best core configurations of each application online.
4. CUTTLESYS OVERVIEW
We co-schedule latency-sensitive applications with batch
workloads on a server with multiple reconfigurable cores, as
shown in Figure 2. The last level cache (LLC) and power
budget are shared across all cores.
4.1 Problem Formulation
Our objective is to meet the QoS target for the latency-
sensitive application, and maximize the throughput of the
co-located batch applications, under a power budget that can
change dynamically. Since the applications share the last level
1.3 ⋯ 3.4⋮ ⋱ ⋮2.54.90.8
5.65.73.9 4.3 5.12.9
Known apps
ap
pl
ica
tio
ns
configurations
3 matrices:
Throughput,
Power,
Tail latency
1.3 ⋯ 3.4⋮ ⋱ ⋮2.54.9⋯0.9
5.6⋯ 5.83.8 4.3⋯
5.1⋯⋯2.9
Known apps
Perf/Power Reconstruction (Parallel SGD)
Resource Controller
Design Exploration
Power1/
Th
ro
ug
hp
ut
②
①
③ ④ ⑤
maxPower
DDS Algorithm
Selected
configs
Configuration Controller
Main Memory
LLC
L1iL1d L1iL1d L1iL1dL1iL1d
L1iL1d L1iL1d L1iL1dL1iL1d
M
onitoring
Power
cap
ap
pl
ica
tio
ns
configurations
Front-end
Back-end
IQ/RF/Exec
units
Load/Store: LSQ
F/D/Rename/ 
Dispatch/ROB
lanes
Profiling
Figure 2: CuttleSys system overview.
cache, the performance of each application depends on the
interference in the last level cache caused by other applica-
tions. In order to mitigate this interference, CuttleSys also
dynamically partitions the LLC among active applications at
the granulariry of cache ways [26, 78].
The system consists of N cores. Each core can be configured
in m modes. Each application can be assigned one of p cache
way allocations. Thus, each application can be executed in
m∗p configurations. For simplicity, the formulation below as-
sumes one latency-sensitive application colocated with multi-
ple (B) batch applications. The objective function is as follows:
Bi, j,k = throughput (BIPS) of batch app i running in core
config j and cache allocation k
T0, j,k = tail latency of latency-sensitive app running in
core config j and cache allocation k
Pi, j = power of app i running in core config j
Ci, j,k = cache ways allocated to app i running in core
config j and cache allocation k
Ii, j,k=1 if app i is assigned to core configuration j
and cache allocation k
=0 otherwise
We maximize the geometric mean of throughput:
BIPSsystem=(
B
∏
i=1
∑
j,k
Bi, j,k∗Ii, j,k)1/B (1)
under the following constraints:
Powersystem=
B
∑
i=0
∑
j,k
Pi, j∗Ii, j,k≤maxPower (2)
Cache_allocsystem=
B
∑
i=0
∑
j,k
Ci, j,k∗Ii, j,k≤cacheWays (3)
∑
j,k
T0, j,k∗I0, j,k≤QoS (4)
∑
j,k
Ii, j,k=1∀i=1,..N (5)
Eq. 2 states that the total power should be under the budget,
while Eq. 3 states that the total allocated cache ways should
be no higher than the LLC associativity. Eq. 4 addresses the
QoS requirement of the latency-sensitive application. Eq. 5
states that each application can be mapped to a single con-
figuration. We use geometric mean as the objective function,
since all batch applications have equal priority [94]. Exhaus-
tively exploring the full design space of core configurations
and cache allocations ((m∗p)∗(m∗p)B) is impractical as the
number of cores/applications increases. This is problematic,
4
since reconfiguration decisions need to happen online, and the
optimization problem is non-linear and non-convex in nature.
Our scheme is made practical via two separate, mutually
beneficial optimizations:
1. Lightweight runtime characterization to infer the per-
formance, Bi, j,k in Eq. 1, T0, j,k in Eq.4 and power, Pi, j
in Eq. 2, of all applications across all possible m core
configurations and p cache allocations; and
2. Fast and accurate design space exploration, given the
output from (1) to determine a near-optimal solution to
the core configuration and cache allocation problem.
Previous approaches [75] to determine the impact of recon-
figuration require detailed profiling of each active application
against large number of resource configurations, which incurs
non-trivial profiling overheads, and scales poorly with the num-
ber of configuration parameters. This approach is furthermore
limited to batch applications, and does not take into account
inter-application interference. Instead, we propose to infer per-
formance (tail latency for interactive services and throughput
for batch jobs) and power, across all possible core and cache
configurations, by uncovering the similarities between the be-
havior of new and previously-seen applications across config-
urations. Specifically, we use PQ-reconstruction with Stochas-
tic Gradient Descent [18, 29, 51, 101], a fast and accurate data
mining technique that, given a few profiling samples for an
application collected at runtime, estimates the application’s
performance and power across all remaining system config-
urations, based on how previously-seen, similar applications
behaved on them. While SGD has been previously applied in
the context of cluster scheduling [29,32], core reconfiguration
places much stricter timing constraints (only few ms) on SGD,
as well as a larger configuration space, requiring a new, more
efficient, parallel approximated SGD implementation.
To quickly explore the design space, we adapt Dynami-
cally Dimensioned Search (DDS) [97], a heuristic algorithm
that searches high-dimensional spaces for near-optimal solu-
tions. DDS is computationally efficient, applicable to discrete
problems, and especially effective for problems with high
dimensionality, such as quickly searching the large space of
resource configurations. The combination of SGD and DDS
significantly improves performance over previous approaches.
We also note that CuttleSys is an open-loop solution, which
searches the design space and finds the best resource alloca-
tion in a single decision interval compared to feedback-based
controllers, which take significant time to converge. This is
especially beneficial for latency-critical applications, as they
do not suffer from QoS violations until convergence.
4.2 Efficient Resource Management
Fig. 2 shows the high-level architecture of CuttleSys, which
consists of the Configuration Controller and the Resource Con-
troller. At the beginning of each decision quantum (100ms by
default, consistent with prior work [75]), the Configuration
Controller profiles performance and power, which are used
by the Perf/Power Reconstruction module in the Resource
Controller. The Configuration Controller then configures
cores and cache ways based on the solution from the Design
Exploration module for the remainder of the timeslice.
Total time slice
④ Steady State
③ Optimization Algorithm
2 samples
①
②
Reconstruction Algorithm
Figure 3: Timeline showing the steps of characterization, in-
ference, and steady-state operation in CuttleSys.
The Resource Controller takes as input the collected profil-
ing samples, and the specified Power Cap, and determines the
best core/cache configurations. The first step is Perf/Power Re-
construction, which uses Stochastic Gradient Descent (SGD)
to estimate the power and performance of an application for all
core and cache configurations, based on a small number of sam-
ples (Section 5). The Design Exploration uses SGD’s output
to determine the best configuration for each job (Section 6).
We describe the timeline of this process below, using Fig. 3.
Our approach requires 2 profiling samples, one sample of the
highest- and one of the lowest-performing configurations, cor-
responding to the widest-issue ({6,6,6}) and narrowest-issue
({2,2,2}) configurations respectively with one LLC way per
core for the currently running applications, along with the
performance and power of some “training” applications in all
configurations, as shown in Figure 2. We run applications for
the duration of a sample timeframe (1ms as described in Sec-
tion 8.1.1), for each configuration and measure performance
and power ( 1©). QoS for most cloud services is measured
at intervals longer than 1ms [23, 32, 59, 60, 66, 70]. To ob-
tain meaningful measurements, we measure tail latency over
the entire 100ms of the previous timeslices. After this online
profiling, we run the reconstruction algorithm to estimate the
tail latency of latency-sensitive cloud services, the through-
put of batch applications, and the power consumption of each
application across all m∗p configurations ( 2©).
Finally, we apply DDS to quickly search the space of core
configurations and cache allocations, and find a solution that
meets QoS and maximizes the throughput of batch applica-
tions for the given power budget ( 3©). The system then runs
in steady state ( 4©) with the selected core and LLC configura-
tions. At the end of the timeslice, power and performance are
measured and updated in the SGD matrix to ensure that any
predictions deviating from the real metrics are corrected.
5. PRACTICAL INFERENCE WITH SGD
The first step in the Resource Controller estimates the power,
throughput, and tail latency for applications across all core
configurations and cache allocations. Previous techniques [75]
require long profiling runs to accurately estimate an applica-
tion’s power and performance across configurations. More-
over, since previous work only targeted core configurations,
estimating performance for cache allocations too would re-
quire an untenable number of profiling samples. Instead, we
use the following insight to reduce profiling and improve prac-
ticality: the performance and power profile of a new, poten-
tially unknown application may exhibit similarities with the
characteristics of applications the system has previously seen,
even if the exact applications are not the same.
This problem is analogous to a recommender system [15,
17, 19, 37, 50, 80, 101], where the system recommends items
to users based only on sparse information about their prefer-
5
ences. In our case, users are analogous to applications and
items are analogous to resource configurations (combination
of core configurations and cache allocations). A rating corre-
sponds to the power or performance of an application running
in the particular core and cache configuration. We construct
a sparse matrix R (one each for throughput, tail latency and
power) with applications as rows and resource configurations
(core-cache vectors) as columns. The rows of matrix R include
some “known” applications, along the previously-unseen ap-
plications that arrive to the system. The matrix is initially
populated with the performance or power of these “known”
applications which have been characterized once offline across
all configurations. For all other new applications, the corre-
sponding rows only have two entries obtained through profil-
ing on two core-cache configurations out of the entire design
space. The missing entries in the matrix are inferred using
PQ-reconstruction with Stochastic Gradient Descent (SGD)
[15, 18, 29, 51, 80]. To reconstruct R, we first decompose it
to matrices P and Q, where the product of Q and PT gives the
reconstructed R, as shown in Algorithm 1. Matrices Q and
P are then constructed using Singular Value Decomposition
(SVD), and correspond to Q=U and PT =∑·V T respectively,
where U , V are the left and right matrices of singular vectors,
and∑ the diagonal matrix of singular values. In Algorithm 1,
A is the total number of applications (including known ones),
and m∗p is the number of resource configurations. The impact
of training set size is discussed in Sec. 8.1.2.
Algorithm 1 Reconstruction Algorithm
1: Initialization:
2: Q← random(A,m*p); P← random(m*p,m*p)
3: η← learning rate; λ← regularization factor
4: maxIter←max # of iterations
5: for l← 1 to maxIter do
6: for i← 1 to A do
7: for j← 1 to m∗p do
8: εi j←Ri j−Qj.PTi
9: Qj←Qj+η(εi jPi−λQj)
10: Pi←Pi+η(εi jQj−λPi)
11: R←Q×PT
There is an obvious trade-off between the maximum num-
ber of iterations and the reconstruction accuracy: the fewer
the iterations, the lower the overhead, but also the higher the
prediction inaccuracy. We have conducted a sensitivity study
to select convergence thresholds for SGD. To further reduce
overheads, we have also limited the number of iterations.
For the currently-running applications, we obtain two sam-
ples of the highest- and lowest-performing core configurations
with the ways equally allocated at runtime. We also get ad-
ditional samples for these applications by monitoring power,
throughput, and tail latency for the configurations from pre-
vious steady states. To predict the throughput and power for
the remaining configurations (m ∗ p− 2, initially but fewer
as we get more points from previous steady states) and tail
latency for the remaining configurations (m∗ p−1 initially),
we run three instances of the reconstruction algorithm, one
each for throughput, tail latency, and power. We run these
three reconstructions in parallel to minimize overheads.
To further accelerate reconstruction, we have implemented
current best point
perturb vector
+ =
new pointcurrent best point
Initial random best point
app configuration vector
configuration {core-config, cache alloc} 
in which application will run
If objective(new point) > objective(best point)
best point = new pointaf
te
r N
 it
er
at
io
ns
OUTPUT: Near-optimal combination of core-configurations and cache allocations
0 35 19 73
0 35 19 73
0 35 19 73
34 9 50 0
34 44 69 73
Figure 4: The DDS design space exploration algorithm.
a parallel reconstruction algorithm that executes SGD with-
out synchronization primitives [72, 88]. This introduces a
small, upper-bounded inaccuracy (approximately 1%), while
improving its execution time by 3.5×.
6. FAST DESIGN EXPLORATION WITH DDS
Once SGD recovers the missing performance and power
of each job across all core configurations and cache alloca-
tions, the system employs Dynamically Dimensioned Search
(DDS) to quickly explore the space, and select appropriate core
configurations and cache partitions. DDS [97] is specifically
design to navigate spaces with high dimensionality.
The operation of DDS is shown in Fig. 4. The algorithm
explores new points in the design space by perturbing a small
number of dimensions from the current best point in each it-
eration, with the number of perturbed dimensions decreasing
as the search progresses, and eventually converging to the
best solution. Fig. 4 shows an example of DDS for a simple
4-core system running four applications on four cores. The
application configuration vector is a N-dimensioned decision
variable, where the ith dimension denotes the configuration
assigned to the ith application. The configuration assigned can
be any number from 0 to m∗p−1. The algorithm starts with a
set of random points, and selects the point that has the highest
value for the target objective as the current best point. In the
given example, the current best point has threads 0, 1, 2 and 3
assigned to configurations 0, 35, 19, and 73 respectively. The
current best point is then perturbed to explore new points. If
the new point has a higher objective, it replaces the previous
best point, and the process repeats until the algorithm arrives at
a near-optimal combination of core configurations and cache
allocations. The perturbation vector determines the number of
dimensions to be perturbed and the perturbation magnitude for
each dimension. DDS searches across more dimensions in the
beginning, and narrows down to fewer dimensions later. The
perturbation quantity is equal to r·(#con f s)·N (0,1), where
r is a perturbation parameter.
6.1 Handling Optimization Constraints
The optimization problem described in Sec. 4 has three con-
straints: a) power (Eq. 2), b) cache (Eq. 3), and c) QoS (Eq. 4).
Since latency-critical applications are load-balanced, all
cores assigned to them run in the same configuration. This
simplifies the search for a suitable core configuration to just
scanning through the predicted tail latency values of the m∗p
configurations. We select the lowest cache allocation and
6
the core configuration that consumes the least power while
meeting QoS. DDS then explores points for the batch appli-
cations, while keeping the configuration of cores and cache
ways assigned to latency-critical application fixed.
To handle the power and cache constraints of Eq. 2 and 3,
we use an objective function that penalizes the points that con-
sume more power and/or more cache than allowed as follows:
ob jective(x)=BIPSsystem(x)
−penalty_power∗(maxPower−Powersystem(x))
−penalty_cache∗(maxWays−Cache_allocsystem(x))
We choose a soft penalty approach to handle the power con-
straint in the objective function, so that points with slightly
higher power are not heavily penalized.
If no configurations are found which meet the QoS of the
latency-critical service, CuttleSys reclaims cores from the
batch workloads, one per timeslice, and yields them to the
latency-critical service, until QoS is met. The cores are sim-
ilarly incrementally relinquished by the latency-critical appli-
cations when QoS is met with latency slack.
6.2 Parallel DDS
To further speed up the design space exploration, we have
designed a new parallel DDS, shown in Alg. 2.
Algorithm 2 Parallel DDS Algorithm
1: Initialization:
2: maxIter←max # of iterations
3: r← perturbation parameter
4: lc = get_config_LC()
5: Initial rand points x={lc,..,lc,xK ,...,xN}
6: xbest←argmax{ob j(x)|x∈random points}
7: for i←1 to maxIter do
8: xlocalbest =xbest
9: for j← 1 to pointsPerIteration do
10: p←1−log(i)/log(maxIter)
11: add dimensions to {P}with probability p
12: for d∈{P} do
13: xnew[d]=xlocalbest [d]+r·(#con f s)·N (0,1)
14: if xnew[d] 6∈ [0,#con f s) then
15: reflect the perturbation
16: if ob j(xnew)>ob j(xlocalbest) then
17: xlocalbest =xnew
18: barrier_wait()
19: if threadID==0 then
20: xbest←argmax{ob j(x)|x∈{xlocalbest}}
21: barrier_wait()
In the first phase, we initialize the algorithm’s parameters.
Line 2 sets the maximum number of iterations (maxIter) of the
algorithm. As maxIter increases, the quality of the solution
obtained improves, but at the same time the time required to
run the algorithm also increases. We explore this trade-off in
Section 8, and select the appropriate number of iterations.
In parallel DDS, to avoid different threads exploring the
same points (obtained from perturbation of the same best
point), and to explore a larger space of configurations, we use
four different values for the perturbation parameter; r = (r1, r2,
r3, r4). In an N-core system, the first N/4 threads of the parallel
algorithm set r = r1, the next N/4 threads set r = r2, and so on.
Line 4 gets the resource configuration that satisfies the QoS
for latency-critical (LC) applications. Lines 5-6 show the
randomly-chosen points the algorithm starts with, selecting the
best among them as the initial best point. In parallel DDS, for
a current best point, each thread generates pointsPerIteration
number of new points, and finds the best point among them,
as shown in Lines 9-17. The number of dimensions to be per-
turbed is determined by the probability function, as seen on
Lines 10-11, while Line 13 shows the quantity by which the
dimensions are perturbed. If the value of a dimension in the
newly-generated point is out of bounds, the algorithm mirrors
the value about the maximum or minimum bound, to bring the
point back within the valid range (Lines 14-15).
DDS chooses the new point as the next best point if ob j(xnew)>
ob j(xbest) (Lines 16-17). After each core has computed points
PerIteration points, a single core aggregates all the per-core
best points, picks the best one, and distributes the selected
configuration to be used for the next iteration (Lines 18-21).
DDS concludes after maxIter iterations, and returns the best
combination of core configurations and LLC allocations.
If the power cap is not met even after operating all cores run-
ning batch jobs in the lowest configuration, we turn off cores,
in descending order of power, until the power budget is met.
7. EXPERIMENTAL METHODOLOGY
We evaluate our approach on 32-core multicore architec-
tures consisting of reconfigurable cores. The core’s architec-
tural parameters are shown in Table 1, and are scaled according
to the selected core configuration similar to [75]. Since we
assume six-, four-, and two-way in each of the front-end, back-
end, and load/store queue section, we have a total of 33=27
(m=27) configurations. Our reconfigurable cores are also sim-
ilar to the large cores in AnyCore [25], which evaluates the
performance-energy overheads of reconfiguration.
ine ine
Front end
BP: gshare + bimodal, 64 entry RAS, 4KB BTB
144 entry ROB
6-wide fetch/decode/rename/retire
ine out-of-order, 6-wide issue/execute
192 integer registers, 144 FP registers
Execution 48 entry IQueue, Load Queue, Store Queue
core 6 Integer ALUs, 2 FP ALU
1 Int/FP Mult Unit, 1 Int/FP Div Unit
ine L1 I-Cache: 32KB, 2-way, 2 cycles
Memory L1 D-Cache: 64KB, 2-way, 2 cycles
heirarchy L2 Cache: 64MB, shared, 32-way, 20 cycles
200 cycle DRAM access latency
ine Technology 22 nm technology, 0.8V Vdd, 4GHz frequency
ine ine
Table 1: Configuration of the 32-core simulated system.
Based on the RTL analysis of frequency, energy, area over-
heads in [25], we assume 1.67% frequency and 18% energy
penalty per cycle for our reconfigurable cores compared to
fixed ones. Reconfigurable cores also consume 19% higher
area. In our experiments, we consider fixed-power scenarios,
where the power budget is kept constant across the designs
(core gating of symmetric and asymmetric multicore, and re-
configurable cores). Under the power-capped scenarios, even
if more cores can be packed in fixed-core designs (core gating-
7
based and asymmetric multicores), they cannot be turned on
due to power constraints. The performance benefits of Cut-
tleSys are achieved at the cost of 19% more area.
7.1 Simulation Infrastructure and Workloads
We use zsim [91] to obtain performance statistics com-
bined with McPAT v1.3 [57] for 22nm technology to obtain
power statistics. We simulate 32-core systems, with 50%
cores assigned to a latency sensitive application and 50%
cores are assigned to batch jobs at time t=0. The core al-
location changes at runtime as needed. Batch applications
are multi-programmed mixes chosen from SPECCPU2006
(perlbench,bzip2,gcc,mcf,cactusADM,namd,soplex,
hmmer,libquantum,lbm,bwaves,zeusmp,leslie3d,milc,
h264ref,sjeng,GemsFDTD,omnetpp,xalancbmk,sphinx3,
astar, gromacs, gamess, gobmk, povray, specrand, cal-
culix, wrf), while the latency-critical (LC) services are
selected from TailBench [48] (Xapian,Masstree,ImgDNN,
Moses,Silo). To examine diverse resource behaviors, we
co-schedule each of the TailBench applications with 10 multi-
programmed (16-app) mixes from SPECCPU2006, for a total
of 50 mixes. We use one LC service for simplicity, however,
CuttleSys is generalizable to any number of LC and batch
services, as long as the system is not oversubscribed.
The reconstruction algorithm requires the power and perfor-
mance of a small number of representative applications to be
collected offline, on all core configurations and cache alloca-
tions. We randomly selected 16 (discussed in Section 8.1.2)
of the above SPECCPU2006 applications for offline train-
ing at the beginning, excluding significant platform redesigns.
Each of the multiprogrammed workloads is constructed by ran-
domly selecting one of the remaining SPECCPU2006 bench-
marks to run on each core, to ensure no overlap between the
training and testing datasets. Each SPECCPU2006 benchmark
runs with the reference input dataset.
To find the maximum load each Tailbench service can sus-
tain, we simulate it on a 16-core system and incrementally
increase the queries per second (QPS), until we observe satura-
tion. We use the QPS at the knee-point before saturation as the
maximum load to avoid the instability of saturation [24]. These
max QPS are: a) Xapian: 22kQPS, b) Masstree: 17kQPS, c)
ImgDNN: 8kQPS, d) Moses: 8kQPS, and e) Silo: 24kQPS.
The system’s maximum power is the average per-core power
across all jobs on reconfigurable cores scaled to 32 cores. We
evaluate the system across power caps.
7.2 Baseline Core-Level Gating
We compare our design with core-level gating as it is widely
employed in current systems for power gating. To meet QoS
the cores running latency-sensitive applications are always
turned on. To determine which cores to turn off, core gating
requires estimations of the power and performance of all appli-
cations. To do this, we profile the applications for one sample_-
time. We explore the following approaches for selecting the
cores to turn off: a) descending order of power; b) ascending
order of power; c) ascending order of BIPSperWatt; and d)
ascending order of BIPS. From our experiments, we found that
turning off cores based on descending order of power achieves
the best performance for core-level gating. When turning off
the last core required to meet the power budget, we search
among the active cores and gate the one that meets the power
budget with the smallest slack. We also consider core-gating
with LLC way-partitioning using [78], since the technique is
already available in real cloud servers [60]; the choice of cache
partitioning is orthogonal to the techniques in CuttleSys.
Quantitatively comparing against core-level gating using
the geometric mean of throughput is problematic, since when
a core is gated, fewer applications run to completion. Thus,
we compare the total number of instructions (useful work)
executed over the same amount of time.
7.3 Asymmetric Multicores
Asymmetric multicores, which comprise cores with differ-
ent performance and energy characteristics, have been pro-
posed as an alternative to homogeneous multicores in order
to improve energy efficiency [14, 20, 52, 55, 89, 93, 98]. Het-
erogeneity allows each application to receive resources that
are suitable to its requirements and thus, improve the overall
throughput, while still operating under a power budget. In
asymmetric multicores, each type of core (typically a high-end
and a low-power core type [4]), and the number of cores of each
type are statically designed. In contrast, reconfigurable multi-
cores allow for finer granularity of configuration by providing
higher number of different core types. Furthermore the number
of cores in each configuration can be decided at runtime.
We compare CuttleSys with a heterogeneous system with
two types of cores: big cores, equivalent to the {6,6,6} configu-
ration, and small cores, equivalent to the {2,2,2} configuration.
While typically the number of cores are statically fixed, we
compare against an oracle-like system, which selects the best
number of big and small cores that meets the QoS of latency-
critical applications, and maximizes the throughput of batch
applications under a given power budget. For the oracle sys-
tem, we also ignore any scheduling overheads that the threads
incur to migrate between cores of different types.
8. EVALUATION
8.1 CuttleSys Scheduling Overheads
CuttleSys incurs three types of overheads: (i) for the initial
application profiling that gives the controller a sparse signal of
the application’s characteristics, (ii) for the reconstruction al-
gorithm that infers performance and power on all non-profiled
configurations, and (iii) for the DDS space exploration (Fig. 3).
Table 2 shows these overheads.
Performance/Power SGD DDS
sampling reconstruction search
Single run Total time 4.8 ms 1.3 ms1 ms 2 ms
Table 2: Characterization and optimization overheads.
8.1.1 Profiling
We empirically set a monitoring period of 1ms as a advan-
tageous trade-off between reducing profiling overheads and
increasing decision accuracy, similar to [75]. We profile all
cores in parallel for 2ms ( 1© of Fig. 3), 1ms each in the widest-
issue {6,6,6} and narrowest-issue {2,2,2} configurations with
one way of LLC allocated to each core, and measure perfor-
mance and power consumption. To avoid power overshoot by
8
1 2 3
30
20
10
0
10
20
30
40
%
 in
ac
cu
ra
cy
(a)
1 2 3
30
20
10
0
10
20
30
40
(b) 0.9 0.8 0.7 0.6 0.5
% Power Cap
0.00
0.25
0.50
0.75
1.00
1.25
1.50
Re
la
tiv
e 
in
st
ru
ct
io
ns
Core-gating
Core-gating+wp
Asymm-cores
CuttleSys
No gating
(c)1: Throughput 2: Tail latency 3: power
Figure 5: Box plots of the error between the measured and pre-
dicted performance and power by SGD across configurations
(a) in isolation and (b) with colocation. (c) Instructions with
CuttleSys vs. core-level gating over 1s across power caps.
running all cores in the highest configuration, half of the cores
run in the widest-issue configuration, and the other half in the
narrowest-issue configuration in the first 1ms and vice-versa
in the second 1ms. Note that even core-level gating incurs an
overhead of 1ms for one profiling period.
8.1.2 Reconstruction Algorithm
Reconstruction requires characterizing offline a few “known”
applications. We select the fewest jobs (16) needed to keep
accuracy over 90% for all running applications. If instead the
training set included 24 applications, the inaccuracy drops
to 8%, while execution time of the reconstruction algorithm
increases by 18%. On the other hand, decreasing the training
set to 8 applications increases inaccuracy to 20%.
We run three instances of the reconstruction algorithm (one
each for throughput of batch jobs, tail latency of latency-
sensitive applications, and power for all jobs). Reconstructing
the throughput for batch jobs takes longer, as it needs to find the
missing values for all combinations of core and LLC configura-
tions for 16 applications, while reconstructing the tail latency
needs to estimate the missing values for all configurations of
1 application at a time. Inferring performance and power for
all possible LLC allocations (32 in our case) increases the
overhead and impacts accuracy, even though many allocations
would not be feasible in practice, as all 32 cores need to share
the 32 ways. Therefore, we limit the LLC allocations for each
job to 1/2, 1, 2, and 4 ways. If two jobs are allocated 1/2 ways
each, both are assigned the same LLC way. Any interference
between them is handled by updating the entries in the recon-
struction matrix with the measured values during runtime. The
three reconstructions all run in parallel on the same server.
8.1.3 DDS Algorithm
As described in Section 6, the #con f s is set to 107, since we
consider four LLC allocations for each core configuration. We
have performed sensitivity studies to find the parameters of
parallel DDS that achieve the best trade-off between runtime
and accuracy. We arrived at the parameter values shown in
Figure 6.
8.2 CuttleSys Inference Accuracy
ine ine initial random points 50
ine r = [r1,r2,r3,r4] [0.2,0.3,0.4,0.5]
ine penalty_wt 2
ine pointPerIteration 10
ine maxIter 40
ine ine
Figure 6: DDS parameters.
CuttleSys uses three
instances of the par-
allel SGD algorithm
to reconstruct the
throughput, tail la-
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Time (s)
0
10
20
30
Co
re
s
(a) Core gating
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Time (s)
(b) Asymm Cores
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Time (s)
0.0
0.2
0.4
0.6
0.8
1.0
in
st
ru
ct
io
ns
 e
xe
cu
te
d
(in
 b
illi
on
s)
(c) CuttleSys
Figure 7: Instructions executed in each time slice (0.1s) on all
cores with core-level gating, asymmetric cores, and CuttleSys.
tency, and power of
co-scheduled applications across resource configurations.
To isolate the prediction accuracy of SGD, we run all test
applications in isolation for the full time slice in all core con-
figurations, which avoids both interference from co-scheduled
jobs and inaccuracies from limited profiling time. For the
throughput, power, and tail latency estimation, we profile on
two configurations per job, and infer the remaining 106 en-
tries. Fig. 5(a) shows the estimation errors for throughput, tail
latency, and power across the 12 “testing” SPEC applications
and 5 Tailbench applications at 80% load. Fig. 1 shows that
some configurations incur very high tail latency, and are not se-
lected during runtime. For these configurations, exact latency
prediction is less critical, as long as the prediction shows that
QoS is violated. We observe that the 25th and 75th percentiles
are within 10%, while the 5th and 95th percentiles are less than
20% for throughput, tail latency, and power. The error for tail
latency is higher, as we predict services one at a time and only
use 2 sample runs to predict the remaining 106 configurations.
We now examine the inaccuracy at runtime, which also
includes the inter-application interference and the inaccuracies
due to limited profiling time. Fig. 5(b) shows box plots of these
errors for throughput, tail latency, and power. The median is
close to zero and the 25th and 75th percentiles are within 10%
in all cases. However, the 5th and 95th percentiles in the case of
tail latency increase, as do the outliers for throughput. This is
due to (a) applications changing execution phases, making the
profiling runs not representative of steady state behavior, and
(b) resource contention between co-scheduled applications.
Since CuttleSys updates the reconstruction matrix with the
measured metrics, it accounts for changes at runtime.
8.3 Core Gating and Asymmetric Multicores
Fig. 7 shows the number of instructions executed on all cores
in each timeslice over 1s with core-level gating and CuttleSys
under a 70% power cap. In the case of core-level gating, cores
that consume the most power are turned off to meet the power
budget and do not execute any instructions. In the case of
asymmetric multicores, though all cores remain active, some
jobs execute on small cores. We assume an unrealistic, oracle-
like asymmetric multicore, where the number of big and small
cores is determined to be the optimal, for a given workload,
in each timeslice. To meet QoS, the latency-sensitive appli-
cations usually execute on big cores. For 70% power cap, an
additional 7 out of 16 batch applications execute on the big
cores, while the remaining 9 applications execute on the small
cores. CuttleSys also keeps all cores active, but portions of the
9
cores might be turned off to meet the power budget.
Fig. 5(c) quantitatively compares the total number of in-
structions executed by batch applications in (1) core-level
gating without way-partitioning; (2) core-level gating with
way-partitioning; (3) the oracle-like asymmetric multicore;
and (4) CuttleSys, relative to no gating (all cores run in highest
configuration) with no cache partitioning, for each power cap.
QoS is satisfied for all Tailbench applications across all runs
for core-level gating, oracle-like asymmetric multicore, and
CuttleSys. Results include all overheads of Sec. 8.1.
For relaxed power caps (90%), all cores can be turned on for
the fixed-core multicores (core-level gating and asymmetric
multicores), while parts of the cores need to be turned off
with Cuttlesys, given the energy overhead of reconfiguration.
Thus, CuttleSys performs worse than core-level gating and
asymmetric multicores.
As the power caps decrease, however, CuttleSys outper-
forms core-level gating both without and with way-partitioning
by 1.64× and 1.52× on average, and up to 2.65× and 2.46×
respectively (Fig. 5(c)). CuttleSys also outperforms the oracle-
like asymmetric multicore by 1.19× on average, and up to
1.55× for the most stringent power cap. As power caps de-
crease, core-level gating turns off additional cores, while the
oracle-like multicore executes more jobs on smaller cores. The
fine granularity of reconfigurable cores provides additional
power/performance operating points, which permit better fine-
tuning during power-constrained scenarios. These gains amor-
tize the energy and scheduling overheads of CuttleSys.
CuttleSys provides modest throughput gains over the oracle-
like asymmetric multicore for relaxed power caps, as more
batch jobs can execute on big cores in the asymmetric multi-
core. In real systems [4], the number of small and big cores
is fixed. CuttleSys outperforms a typical multicore with 50%
big and 50% small cores by 1.70×, 1.65× and 1.50× at 90%,
80% and 70% power caps respectively. The performance of
this 50-50 multicore is the same as that of the oracle-like asym-
metric system at 60% and 50% power cap, since all the batch
applications run on small cores.
8.4 Dynamic behavior of CuttleSys
We now show CuttleSys’s behavior under varying load and
power caps, and an example of core relocation.
8.4.1 Varying Load
We vary the input load of the latency-critical application
by simulating a diurnal pattern, while maintaining the power
budget at 70% of max. Fig. 8a shows the input load of the
latency-critical application, its tail latency with respect to
QoS, the throughput of batch applications, the total power
consumed by the system, and the core configurations for batch
applications for a colocation of Xapian with a mix of 16 SPEC
jobs. When load is low, cores running Xapian are configured
to {4,2,4}, as shown by the background color.
As load increases, the tail latency also increases and violates
QoS. Subsequently, CuttleSys configures the cores allocated
to Xapian to the {6,6,6} configuration in the next time slice,
after which QoS is met, and to {6,2,6} in the following time
slice. Four cache ways are allocated to Xapian throughout the
experiment. Under high load, Xapian consumes a significant
fraction of the power budget, leaving less power for the SPEC
applications. The cores running SPEC jobs therefore have to
0
50
100
%
 lo
ad
LC app
0.5
1.0
ta
il 
la
te
nc
y
(w
rt 
Qo
S)
QoS
LC app
5.50
5.75
Th
ro
ug
hp
ut
(G
M
) batch
90
100
Po
we
r
(W
)
budget
current
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Time (s)
0
10
Ba
tc
h 
co
re
s
(a)
0
50
100
LC app
0.5
1.0
QoS
LC app
4.5
5.0 batch
100
125 budget
current
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Time (s)
0
10
(b)
0
100
LC app
0.5
1.0 QoS
LC app
14
16
18
Co
re
s (
LS
)
5.0
5.5
batch
90
100
budget
current
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Time (s)
0
10
(c)
666 664 662 646 644 642 626 624 622 466 464 462 446 444
442 426 424 422 266 264 262 246 244 242 226 224 222
Figure 8: CuttleSys under (a) varying input load, (b) varying
power budget, and (c) example of core relocation. The table
shows the colors corresponding to core configurations.
run in lower-performing configurations, and as a result achieve
lower throughput. There is a brief interval in t∈ [0.3,0.4]s
where the system violates its power budget. This is because
the input load of Xapian increases in the middle of CuttleSys’s
decision interval, and the system needs to wait until the next
interval before reconfiguring the cores. While this may briefly
consume more power than required, it avoids ping-ponging
between configurations due to short load spikes. When the
load decreases, CuttleSys again reconfigures Xapian’s cores to
{4,2,4}, and set the remaining cores to higher configurations,
thus increasing the throughput of SPEC jobs.
8.4.2 Varying Power Budget
We now vary the power cap over time when running Xapian
and a mix of SPEC applications, while maintaining a con-
stant 80% load for the latency-critical application. The power
budget is set to 90% and reduced to 60% at t=0.3s. In this
case (Fig. 8b), the cores running Xapian are configured to
{6,2,6} and four cache ways for the entire duration of the ex-
periment. When the power cap is reduced, Xapian still needs
the same amount of power to meet its QoS, leaving a lower
power budget for the SPEC workloads, which are configured
to lower-performing configurations, decreasing their through-
put. When the power cap is set back to 90% at t=0.7s, the
SPEC cores revert back to the higher configurations.
8.4.3 Core Relocation
Fig. 8c demonstrates an example co-scheduling Xapian
with a mix of SPEC applications, where CuttleSys relocates
cores to the latency-critical application to meet its QoS. As the
load increases after t=0.3s, Xapian suffers a QoS violation,
after which its allocated cores are reconfigured from {4,2,4}
to the widest-issue configuration {6,6,6}. However, that is not
10
4  4.5 5  5.5 6  
0.35
0.4
0.45
0.5
0.55
Power
1/
Th
ro
ug
hp
ut
Best Point GA
Best Point DDS
 
 
Points in space
Points explored by GA
Points explored by DDS
0.9 0.8 0.7 0.6 0.5
% Power Cap
0.8
0.9
1.0
1.1
1.2
1.3
1.4
Re
la
tiv
e 
th
ro
ug
hp
ut
SGD-GA
SGD-DDS
Figure 9: (a) Comparison of DDS vs GA’s ability to explore
the design space. (b) Throughput with DDS and GA under
different power caps using SGD for inference.
sufficient to meet QoS in this case. Thus, CuttleSys reclaims
a core from the batch applications, and assigns it to Xapian,
at which point QoS is met. After the load drops back down to
20%, tail latency also drops. Since now the latency slack is
high enough (20% unless otherwise specified), the extra core
is yielded back to the batch applications. As a result of the
core relocation, the SPEC applications time-multiplex on the
reduced number of cores allocated to them, achieving lower
throughput, which is recovered when the core is returned.
8.5 Comparison with Flicker
Flicker [75] is the most relevant prior work to CuttleSys.
Flicker was proposed for multicore architectures running
multi-programmed mixes of exclusively batch applications.
It proposed 3MM3 sampling [102] with RBF surrogate fit-
ting [38, 68, 81, 82, 83] to characterize the impact of core
configurations, and a Genetic Algorithm (GA) for space explo-
ration. Flicker relies on detailed per-configuration profiling,
and is limited to core configurations, still allowing interference
through the memory hierarchy. 3MM3 requires sampling nine
core configurations, which are then used by RBF surrogate
fitting to get the complete performance and power profiles
across all core configurations. To get a meaningful sample for
tail latency, the system needs to run for at least 10ms.
We evaluated Flicker in two ways: a) we set the profiling
period to 10ms and profile the applications for a total of 90ms,
search the best configuration that meets the QoS and power
budget and maximizes the throughput using GA (takes 2ms),
and run the system in that configuration for the remaining 8ms;
b) Flicker only manages batch applications, and we set the
cores assigned to latency-critical jobs to the highest – {6,6,6} –
configuration, which reduces the power budget available for
batch jobs. In this case, since we only predict throughput and
power, we can directly apply the 3MM3 and RBF techniques
over 1ms samples. Overall, we profile for 9ms, and run GA for
2ms. In both cases, we have to run the latency-critical service
in lower configurations for extended periods of time. Since
QoS is defined with respect to the 99th percentile latency, even
1ms of slow requests is enough to violate QoS. As a result, we
see extensive QoS violations by over an order of magnitude
for the first methodology, and by 1.5× for the second.
We now compare the individual techniques used in Flicker
and CuttleSys. Flicker requires 9 samples for characterization,
while SGD only uses 2 samples. To show a fair comparison
for the characterization mechanisms, we show the prediction
error of the RBF-based approach in performance and power
in Fig. 10 when using 3 samples from the full 100ms times-
lice (the algorithm was unable to converge when using two
samples). The error is dramatically higher for Flicker with 3
samples, with outliers reaching up to 600%. Thus, with the
same amount of information, the SGD-based reconstruction
clearly outperforms the RBF-based approach.
Next, we compare the exploration algorithms, GA and DDS.
Fig. 9a shows a subset of points in the entire space, as well as
the points explored by both DDS and GA. The black dots rep-
resent the points explored by GA, while the pink dots represent
the points explored by DDS. We can see that DDS explores
more points on the pareto-optimal front and thus, obtains a
higher-quality configuration with better throughput compared
to GA, shown by blue and yellow stars respectively, under a
given power budget, shown by the dotted green line.
Throughput
RBF
Power
RBF
Throughput
SGD
Power
SGD
400
200
0
200
400
600
%
 in
ac
cu
ra
cy
Figure 10: Predicted er-
rors in performance (1) &
power (2) with RBF.
To quantitatively compare
DDS with GA, we applied
GA during the optimization
phase instead of DDS, and
used SGD for reconstruc-
tion in our 32-core system.
Fig. 9b shows the compari-
son of the geometric mean of
throughput of CuttleSys with
SGD and GA across different
power caps. Using DDS for
optimization offers a performance improvement of up to 19%
compared to GA for a 32-core system. This can be attributed to
the fact that the GA algorithm is relatively slow in exploring a
highly-dimensional search space compared to DDS. Also, the
optimization algorithm is required to explore a higher number
of configurations 27 ∗ 4=108 (including LLC allocations),
compared to only 27 core configurations in [75]. We also note
that, the performance improvement is higher at lower power
caps, as a large subset of configurations does not violate the
power budget, and DDS can quickly search through the large
space. As the power constraints become more stringent, fewer
configurations are valid, enabling GA to find the best config-
urations in a given amount of time. The improvement is the
smallest for a 50% power cap as at that point, all cores often
have to operate in their lowest configurations, and may even
need to be switched off to meet the power budget.
9. CONCLUSIONS
We present CuttleSys, an online and practical resource man-
agement system for reconfigurable multicores, which quickly
infers the performance and power consumption of each co-
scheduled application across all core configurations and cache
allocations, and arrives at a suitable configuration that meets
QoS for the latency-critical services, and maximizes through-
put for the batch workloads, under a power budget.
We evaluate CuttleSys across a set of diverse latency-critical
and batch workloads, and showed that the system meets both
the QoS and power budget at all times, while achieving sig-
nificantly higher throughput for the batch applications than
previous work, including core-level gating and Flicker. We
also quantified the inference errors of the reconstruction algo-
rithm in CuttleSys, and showed that they are low in all cases.
11
REFERENCES
[1] “Spec cpu 2006,” https://www.spec.org/cpu2006/.
[2] “2nd generation intel core processor family desktop,” January 2011.
[3] “Power management of the third generation intel core micro
architecture formerly codenamed ivy bridge,” Hot Chips: A
Symposium on High Performance Chips, 2012.
[4] “big.little technology: The future of mobile,” https://www.arm.com,
2013.
[5] “IntelÂo˝ 64 and ia -32 architectures software developerâA˘Z´s manual,
system programming guide, part 2,” 2016.
[6] “6th generation intel processor families for s-platforms,” August 2018.
[7] “8th and 9th generation intel core processor families and intel xeon e
processor family,” October 2018.
[8] A. Adileh, S. Eyerman, A. Jaleel, and L. Eeckhout, “Mind the power
holes: Sifting operating points in power-limited heterogeneous
multicores,” IEEE Computer Architecture Letters, vol. 16, no. 1, pp.
56–59, Jan 2017.
[9] A. Adileh, S. Eyerman, A. Jaleel, and L. Eeckhout, “Maximizing
heterogeneous processor performance under power constraints,” ACM
Trans. Archit. Code Optim., vol. 13, no. 3, pp. 29:1–29:23, Sep. 2016.
[Online]. Available: http://doi.acm.org/10.1145/2976739
[10] F. Afram and K. Ghose, “Flexcore: A reconfigurable processor
supporting flexible, dynamic morphing,” in 2015 IEEE 22nd
International Conference on High Performance Computing (HiPC),
Dec 2015, pp. 30–39.
[11] M. Arora, S. Manne, I. Paul, N. Jayasena, and D. M. Tullsen,
“Understanding idle behavior and power gating mechanisms in the
context of modern benchmarks on cpu-gpu integrated systems,” in
2015 IEEE 21st International Symposium on High Performance
Computer Architecture (HPCA), Feb 2015, pp. 366–377.
[12] L. Barroso and U. Hoelzle, The Datacenter as a Computer: An
Introduction to the Design of Warehouse-Scale Machines. Synthesis
lectures on computer architecture, 2013.
[13] A. Bartolini, M. Cacciari, A. Tilli, and L. Benini, “Thermal and energy
management of high-performance multicores: Distributed and
self-calibrating model-predictive controller,” IEEE Transactions on
Parallel and Distributed Systems, vol. 24, no. 1, pp. 170–183, Jan
2013.
[14] M. Becchi and P. Crowley, “Dynamic thread assignment on
heterogeneous multiprocessor architectures,” in Proceedings of the
3rd Conference on Computing Frontiers, ser. CF ’06. New York, NY,
USA: ACM, 2006, pp. 29–40. [Online]. Available:
http://doi.acm.org/10.1145/1128022.1128029
[15] R. Bell, Y. Koren, and C. Volinsky, “The bellkor 2008 solution to the
netflix prize,” Tech. Rep., 2007.
[16] R. Bergamaschi, G. Han, A. Buyuktosunoglu, H. Patel, I. Nair,
G. Dittmann, G. Janssen, N. Dhanwada, Z. Hu, P. Bose, and
J. Darringer, “Exploring power management in multi-core systems,” in
Proceedings of the 2008 Asia and South Pacific Design Automation
Conference, ser. ASP-DAC ’08. Los Alamitos, CA, USA: IEEE
Computer Society Press, 2008, pp. 708–713. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1356802.1356973
[17] L. Bottou, “Large-scale machine learning with stochastic gradient
descent,” in Proceedings of the International Conference on
Computational Statistics (COMPSTAT). Paris, France, 2010.
[18] L. Bottou, Large-Scale Machine Learning with Stochastic Gradient
Descent. Heidelberg: Physica-Verlag HD, 2010, pp. 177–186.
[19] R. Burke, “Hybrid recommender systems: Survey and experiments,”
User Modeling and User-Adapted Interaction, vol. 12, no. 4, pp.
331–370, Nov. 2002.
[20] J. Chen and L. K. John, “Efficient program scheduling for
heterogeneous multi-core processors,” in 2009 46th ACM/IEEE
Design Automation Conference, July 2009, pp. 927–930.
[21] J. Chen, A. A. Nair, and L. K. John, “Predictive heterogeneity-aware
application scheduling for chip multiprocessors,” IEEE Transactions
on Computers, vol. 63, no. 2, pp. 435–447, 2014.
[22] J. Chen and L. John, “Predictive coordination of multiple on-chip
resources for chip multiprocessors,” 01 2011, pp. 192–201.
[23] Q. Chen, H. Yang, M. Guo, R. S. Kannan, J. Mars, and L. Tang,
“Prophet: Precise qos prediction on non-preemptive accelerators to
improve utilization in warehouse-scale computers,” in Proceedings of
the Twenty-Second International Conference on Architectural Support
for Programming Languages and Operating Systems, ser. ASPLOS
’17. XiÃa˛n, China: ACM, 2017, pp. 17–32.
[24] S. Chen, C. Delimitrou, and J. F. Martínez, “PARTIES: QoS-aware
resource partitioning for multiple interactive services,” in
International Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS), 2019.
[25] R. B. R. Chowdhury, A. K. Kannepalli, S. Ku, and E. Rotenberg,
“Anycore: A synthesizable RTL model for exploring and fabricating
adaptive superscalar cores,” in 2016 IEEE International Symposium
on Performance Analysis of Systems and Software, ISPASS 2016,
Uppsala, Sweden, April 17-19, 2016. IEEE Computer Society, 2016,
pp. 214–224. [Online]. Available:
https://doi.org/10.1109/ISPASS.2016.7482096
[26] H. Cook, M. Moreto, S. Bird, K. Dao, D. A. Patterson, and
K. Asanovic, “A hardware evaluation of cache partitioning to improve
utilization and energy-efficiency while preserving responsiveness,” in
Proceedings of the 40th Annual International Symposium on
Computer Architecture, ser. ISCA ’13. New York, NY, USA: ACM,
2013, pp. 308–319.
[27] K. V. Craeynest, A. Jaleel, L. Eeckhout, P. Narvaez, and J. Emer,
“Scheduling heterogeneous multi-cores through performance impact
estimation (pie),” in 2012 39th Annual International Symposium on
Computer Architecture (ISCA), June 2012, pp. 213–224.
[28] C. Delimitrou and C. Kozyrakis, “iBench: Quantifying Interference
for Datacenter Workloads,” in Proceedings of the 2013 IEEE
International Symposium on Workload Characterization (IISWC).
Portland, OR, September 2013.
[29] C. Delimitrou and C. Kozyrakis, “Paragon: QoS-aware scheduling for
heterogeneous datacenters,” in Proceedings of the Eighteenth
International Conference on Architectural Support for Programming
Languages and Operating Systems, 2013.
[30] C. Delimitrou and C. Kozyrakis, “QoS-aware scheduling in
heterogeneous datacenters with paragon,” in ACM Transactions on
Computer Systems, Vol. 31 Issue 4, 2013.
[31] C. Delimitrou and C. Kozyrakis, “Quality-of-Service-aware
scheduling in heterogeneous datacenters with paragon,” in IEEE
Micro Special Issue on Top Picks from the Computer Architecture
Conferences, 2014.
[32] C. Delimitrou and C. Kozyrakis, “Quasar: Resource-efficient and
qos-aware cluster management,” in Proceedings of the Nineteenth
International Conference on Architectural Support for Programming
Languages and Operating Systems, 2014.
[33] C. Delimitrou and C. Kozyrakis, “HCloud: Resource-efficient
provisioning in shared cloud systems,” in Proceedings of the Twenty
First International Conference on Architectural Support for
Programming Languages and Operating Systems, 2016.
[34] C. Delimitrou and C. Kozyrakis, “Bolt: I know what you did last
summer... in the cloud,” in Proceedings of the Twenty-Second
International Conference on Architectural Support for Programming
Languages and Operating Systems. ACM, 2017, pp. 599–613.
[35] C. Delimitrou, D. Sanchez, and C. Kozyrakis, “Tarcil: Reconciling
Scheduling Speed and Quality in Large Shared Clusters,” in
Proceedings of the Sixth ACM Symposium on Cloud Computing, 2015.
[36] H. R. Ghasemi and N. S. Kim, “Rcs: Runtime resource and core
scaling for power-constrained multi-core processors,” in Proceedings
of the 23rd International Conference on Parallel Architectures and
Compilation, ser. PACT ’14. New York, NY, USA: ACM, 2014, pp.
251–262. [Online]. Available:
http://doi.acm.org/10.1145/2628071.2628095
[37] A. Gunawardana and C. Meek, “A unified approach to building hybrid
recommender systems,” in Proc. of the Third ACM Conference on
Recommender Systems (RecSys). New York, NY, 2009.
[38] H.-M. Gutmann, “A radial basis function method for global
optimization,” Journal of Global Optimization, vol. 19, no. 3, pp.
201–227, 2001. [Online]. Available:
http://dx.doi.org/10.1023/A%3A1011255519438
[39] M. E. Haque, Y. He, S. Elnikety, T. D. Nguyen, R. Bianchini, and
K. McKinley, “Exploiting heterogeneity for tail latency and energy
12
efficiency,” in Proceedings of the International Symposium on
Microarchitecture (MICRO), October 2017. [Online]. Available:
https://www.microsoft.com/en-us/research/publication/exploiting-
heterogeneity-tail-latency-energy-efficiency/
[40] C. Hsu, Y. Zhang, M. A. Laurenzano, D. Meisner, T. Wenisch, J. Mars,
L. Tang, and R. G. Dreslinski, “Adrenaline: Pinpointing and reining in
tail queries with quick voltage boosting,” in 2015 IEEE 21st
International Symposium on High Performance Computer
Architecture (HPCA), Feb 2015, pp. 271–282.
[41] C.-H. Hsu, Q. Deng, J. Mars, and L. Tang, “Smoothoperator:
Reducing power fragmentation and improving power utilization in
large-scale datacenters,” in Proceedings of the Twenty-Third
International Conference on Architectural Support for Programming
Languages and Operating Systems, ser. ASPLOS ’18. New York,
NY, USA: ACM, 2018, pp. 535–548.
[42] E. Ipek, M. Kirman, N. Kirman, and J. F. Martinez, “Core fusion:
Accommodating software diversity in chip multiprocessors,” in
Proceedings of the 34th Annual International Symposium on
Computer Architecture, ser. ISCA ’07. New York, NY, USA: ACM,
2007, pp. 186–197. [Online]. Available:
http://doi.acm.org/10.1145/1250662.1250686
[43] C. Isci, A. Buyuktosunoglu, C.-Y. Cher, P. Bose, and M. Martonosi,
“An analysis of efficient multi-core global power management policies:
Maximizing performance for a given power budget,” in Proceedings of
the 39th Annual IEEE/ACM International Symposium on
Microarchitecture, ser. MICRO 39. Washington, DC, USA: IEEE
Computer Society, 2006, pp. 347–358. [Online]. Available:
http://dx.doi.org/10.1109/MICRO.2006.8
[44] S. S. Jha, W. Heirman, A. Falcón, T. E. Carlson, K. Van Craeynest,
J. Tubella, A. González, and L. Eeckhout, “Chrysso: An integrated
power manager for constrained many-core processors,” in Proceedings
of the 12th ACM International Conference on Computing Frontiers,
ser. CF ’15. New York, NY, USA: ACM, 2015, pp. 19:1–19:8.
[Online]. Available: http://doi.acm.org/10.1145/2742854.2742885
[45] S. Kanev, K. Hazelwood, G. Wei, and D. Brooks, “Tradeoffs between
power management and tail latency in warehouse-scale applications,”
in 2014 IEEE International Symposium on Workload Characterization
(IISWC), Oct 2014, pp. 31–40.
[46] H. Kasture, D. B. Bartolini, N. Beckmann, and D. Sanchez, “Rubik:
Fast analytical power management for latency-critical systems,” in
Proceedings of the 48th International Symposium on
Microarchitecture, 2015.
[47] H. Kasture and D. Sanchez, “Ubik: Efficient cache sharing with strict
QoS for latency-critical workloads,” in Proceedings of the 19th
International Conference on Architectural Support for Programming
Languages and Operating Systems, 2014.
[48] H. Kasture and D. Sanchez, “Tailbench: a benchmark suite and
evaluation methodology for latency-critical applications,” in IEEE
International Symposium on Workload Characterization, 2016.
[49] Khubaib, M. A. Suleman, M. Hashemi, C. Wilkerson, and Y. N. Patt,
“Morphcore: An energy-efficient microarchitecture for high
performance ilp and high throughput tlp,” in Proceedings of the 2012
45th Annual IEEE/ACM International Symposium on
Microarchitecture, ser. MICRO-45. Washington, DC, USA: IEEE
Computer Society, 2012, pp. 305–316. [Online]. Available:
http://dx.doi.org/10.1109/MICRO.2012.36
[50] K. C. Kiwiel, “Convergence and efficiency of subgradient methods for
quasiconvex minimization,” in Mathematical Programming (Series A)
(Berlin, Heidelberg: Springer) 90 (1): pp. 1-25, 2001.
[51] K. C. Kiwiel, “Convergence and efficiency of subgradient methods for
quasiconvex minimization,” Mathematical Programming, vol. 90,
no. 1, pp. 1–25, Mar 2001.
[52] D. Koufaty, D. Reddy, and S. Hahn, “Bias scheduling in
heterogeneous multi-core architectures,” in Proceedings of the 5th
European Conference on Computer Systems, ser. EuroSys ’10. New
York, NY, USA: ACM, 2010, pp. 125–138. [Online]. Available:
http://doi.acm.org/10.1145/1755913.1755928
[53] N. Kulkarni, F. Qi, and C. Delimitrou, “Pliant: Leveraging
approximation to improve datacenter resource efficiency,” 2019 IEEE
International Symposium on High Performance Computer
Architecture (HPCA), pp. 159–171, 2018.
[54] R. Kumar and G. Hinton, “A family of 45nm ia processors,” in
Solid-State Circuits Conference - Digest of Technical Papers, 2009.
ISSCC 2009. IEEE International, Feb 2009, pp. 58–59.
[55] R. Kumar, D. M. Tullsen, P. Ranganathan, N. P. Jouppi, and K. I.
Farkas, “Single-isa heterogeneous multi-core architectures for
multithreaded workload performance,” in Proceedings. 31st Annual
International Symposium on Computer Architecture, 2004., June 2004,
pp. 64–75.
[56] J. Leverich, M. Monchiero, V. Talwar, P. Ranganathan, and
C. Kozyrakis, “Power management of datacenter workloads using
per-core power gating,” IEEE Comput. Archit. Lett., vol. 8, no. 2, pp.
48–51, Jul. 2009. [Online]. Available:
http://dx.doi.org/10.1109/L-CA.2009.46
[57] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P.
Jouppi, “Mcpat: An integrated power, area, and timing modeling
framework for multicore and manycore architectures,” in 2009 42nd
Annual IEEE/ACM International Symposium on Microarchitecture
(MICRO), Dec 2009, pp. 469–480.
[58] G. Liu, J. Park, and D. Marculescu, “Dynamic thread mapping for
high-performance, power-efficient heterogeneous many-core systems.”
in ICCD. IEEE Computer Society, 2013, pp. 54–61. [Online].
Available:
http://dblp.uni-trier.de/db/conf/iccd/iccd2013.html#LiuPM13
[59] D. Lo, L. Cheng, R. Govindaraju, L. A. Barroso, and C. Kozyrakis,
“Towards energy proportionality for large-scale latency-critical
workloads,” in Proceedings of the 41st Annual International
Symposium on Computer Architecuture, 2014.
[60] D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and C. Kozyrakis,
“Heracles: Improving resource efficiency at scale,” in Proceedings of
the 42nd Annual International Symposium on Computer Architecture,
2015.
[61] A. Lukefahr, S. Padmanabha, R. Das, F. M. Sleiman, R. Dreslinski,
T. F. Wenisch, and S. Mahlke, “Composite cores: Pushing
heterogeneity into a core,” in Proceedings of the 2012 45th Annual
IEEE/ACM International Symposium on Microarchitecture, ser.
MICRO-45. Washington, DC, USA: IEEE Computer Society, 2012,
pp. 317–328. [Online]. Available:
http://dx.doi.org/10.1109/MICRO.2012.37
[62] K. Ma, X. Li, M. Chen, and X. Wang, “Scalable power control for
many-core architectures running multi-threaded applications,” in 2011
38th Annual International Symposium on Computer Architecture
(ISCA), June 2011, pp. 449–460.
[63] K. Ma and X. Wang, “Pgcapping: Exploiting power gating for power
capping and core lifetime balancing in cmps,” in 2012 21st
International Conference on Parallel Architectures and Compilation
Techniques (PACT), Sept 2012, pp. 13–22.
[64] J. Mars and L. Tang, “Whare-map: heterogeneity in "homogeneous"
warehouse-scale computers,” in Proceedings of the 40th Annual
International Symposium on Computer Architecture, 2013.
[65] D. Meisner, B. T. Gold, and T. F. Wenisch, “Powernap: eliminating
server idle power,” in Proceedings of the 14th international ASPLOS,
ser. ASPLOS ’09, 2009. [Online]. Available:
http://doi.acm.org/10.1145/1508244.1508269
[66] D. Meisner, C. M. Sadler, L. A. Barroso, W.-D. Weber, and T. F.
Wenisch, “Power management of online data-intensive services,” in
Proceedings of the 38th annual international symposium on Computer
architecture, 2011, pp. 319–330.
[67] D. Meisner and T. F. Wenisch, “Dreamweaver: Architectural support
for deep sleep,” in Proceedings of the Seventeenth International
Conference on Architectural Support for Programming Languages
and Operating Systems, ser. ASPLOS XVII. New York, NY, USA:
ACM, 2012, pp. 313–324. [Online]. Available:
http://doi.acm.org/10.1145/2150976.2151009
[68] J. Mueller, C. Shoemaker, and R. Piche, “SO-MI: A Surrogate Model
Algorithm for Computationally Expensive Nonlinear Mixed-integer
Black-box Global Optimization Problems,” Computers and
Operations Research, May 2013.
[69] S. Navada, N. Choudhary, S. Wadhavkar, and E. Rotenberg, “A unified
view of non-monotonic core selection and application steering in
heterogeneous chip multiprocessors,” 01 2013, pp. 133–144.
[70] H. Nguyen, Z. Shen, X. Gu, S. Subbiah, and J. Wilkes, “AGILE:
Elastic distributed resource scaling for infrastructure-as-a-service,” in
Proceedings of the 10th International Conference on Autonomic
Computing (ICAC 13). San Jose, CA: USENIX, 2013, pp. 69–82.
[Online]. Available:
13
https://www.usenix.org/conference/icac13/technical-
sessions/presentation/nguyen
[71] R. Nishtala, V. Petrucci, P. Carpenter, and M. SjÃd’lander, “Twig:
Multi-agent task management for colocated latency-critical cloud
services,” 12 2019.
[72] F. Niu, B. Recht, C. Re, and S. J. Wright, “Hogwild!: A lock-free
approach to parallelizing stochastic gradient descent,” in Proceedings
of the 24th International Conference on Neural Information
Processing Systems, ser. NIPS’11. USA: Curran Associates Inc.,
2011, pp. 693–701. [Online]. Available:
http://dl.acm.org/citation.cfm?id=2986459.2986537
[73] S. Padmanabha, A. Lukefahr, R. Das, and S. Mahlke, “Trace based
phase prediction for tightly-coupled heterogeneous cores,” in
Proceedings of the 46th Annual IEEE/ACM International Symposium
on Microarchitecture, ser. MICRO-46. New York, NY, USA: ACM,
2013, pp. 445–456. [Online]. Available:
http://doi.acm.org/10.1145/2540708.2540746
[74] G. Papadimitriou, A. Chatzidimitriou, and D. Gizopoulos, “Adaptive
voltage/frequency scaling and core allocation for balanced energy and
performance on multicore cpus,” in 2019 IEEE International
Symposium on High Performance Computer Architecture (HPCA),
2019, pp. 133–146.
[75] P. Petrica, A. M. Izraelevitz, D. H. Albonesi, and C. A. Shoemaker,
“Flicker: A dynamically adaptive architecture for power limited
multicore systems,” in Proceedings of the 40th Annual International
Symposium on Computer Architecture, ser. ISCA ’13. New York,
NY, USA: ACM, 2013, pp. 13–23. [Online]. Available:
http://doi.acm.org/10.1145/2485922.2485924
[76] V. Petrucci, M. A. Laurenzano, J. Doherty, Y. Zhang, D. MossÃl’,
J. Mars, and L. Tang, “Octopus-man: Qos-driven task management for
heterogeneous multicores in warehouse-scale computers,” in 2015
IEEE 21st International Symposium on High Performance Computer
Architecture (HPCA), 2015, pp. 246–258.
[77] R. P. Pothukuchi, A. Ansari, P. Voulgaris, and J. Torrellas, “Using
multiple input, multiple output formal control to maximize resource
efficiency in architectures,” in 2016 ACM/IEEE 43rd Annual
International Symposium on Computer Architecture (ISCA), June
2016, pp. 658–670.
[78] M. K. Qureshi and Y. N. Patt, “Utility-based cache partitioning: A
low-overhead, high-performance, runtime mechanism to partition
shared caches,” in Proceedings of the 39th Annual IEEE/ACM
International Symposium on Microarchitecture, ser. MICRO 39, 2006.
[Online]. Available: http://dx.doi.org/10.1109/MICRO.2006.49
[79] A. M. Rahmani, B. Donyanavard, T. Mück, K. Moazzemi, A. Jantsch,
O. Mutlu, and N. Dutt, “Spectr: Formal supervisory control and
coordination for many-core systems resource management,” in
Proceedings of the Twenty-Third International Conference on
Architectural Support for Programming Languages and Operating
Systems, ser. ASPLOS ’18. New York, NY, USA: ACM, 2018, pp.
169–183. [Online]. Available:
http://doi.acm.org/10.1145/3173162.3173199
[80] A. Rajaraman and J. Ullman, “Textbook on mining of massive
datasets. rightscale.” 2011,
https://aws.amazon.com/solution-providers/isv/rightscale.
[81] R. G. Regis and C. A. Shoemaker, “Local Function Approximation in
Evolutionary Algorithms for the Optimization of Costly Functions,”
IEEE Transactions on Evolutionary Computation, October 2004.
[82] R. G. Regis and C. A. Shoemaker, “A Stochastic Radial Basis
Function Method for the Global Optimization of Expensive Functions,”
INFORMS Journal on Computing, Fall 2007.
[83] R. G. Regis and C. A. Shoemaker, “Combining Radial Basis Function
Surrogates and Dynamic Coordinate Search in High-dimensional
Expensive Black-box Optimization,” Engineering Optimization, May
2013.
[84] C. Reiss, A. Tumanov, G. Ganger, R. Katz, and M. Kozych,
“Heterogeneity and dynamicity of clouds at scale: Google trace
analysis,” in Proceedings of the 2017 Symposium on Cloud
Computing, 2012.
[85] S. Ren, Y. He, S. Elnikety, and K. S. McKinley, “Exploiting processor
heterogeneity in interactive services,” in ICAC, January 2013.
[Online]. Available:
https://www.microsoft.com/en-us/research/publication/exploiting-
processor-heterogeneity-in-interactive-services/
[86] S. Ren, Y. He, and K. S. McKinley, “A theoretical foundation for
scheduling and designing heterogeneous processors for interactive
applications,” in International Symposium on Distributed Computing
(DISC). European Association for Theoretical Computer Science,
October 2014. [Online]. Available:
https://www.microsoft.com/en-us/research/publication/a-
theoretical-foundation-for-scheduling-and-designing-
heterogeneous-processors-for-interactive-applications/
[87] F. Romero and C. Delimitrou, “Mage: Online and Interference-Aware
Scheduling for Multi-Scale Heterogeneous Systems,” in Proceedings
of the 27th International Conference on Parallel Architectures and
Compilation Techniques (PACT18), November 2018.
[88] C. D. Sa, C. Zhang, K. Olukotun, and C. Ré, “Taming the wild: A
unified analysis of hog wild! -style algorithms,” in Proceedings of the
28th International Conference on Neural Information Processing
Systems - Volume 2, ser. NIPS’15. Cambridge, MA, USA: MIT
Press, 2015, pp. 2674–2682.
[89] J. C. Saez, A. Pousa, F. Castro, D. Chaver, and M. Prieto-Matias,
“Acfs: A completely fair scheduler for asymmetric single-isa multicore
systems,” in Proceedings of the 30th Annual ACM Symposium on
Applied Computing, ser. SAC ’15. New York, NY, USA: ACM, 2015,
pp. 2027–2032. [Online]. Available:
http://doi.acm.org/10.1145/2695664.2695714
[90] D. Sanchez and C. Kozyrakis, “Vantage: Scalable and Efficient
Fine-Grain Cache Partitioning,” in Proceedings of the 38th annual
International Symposium in Computer Architecture, 2011.
[91] D. Sanchez and C. Kozyrakis, “Zsim: Fast and accurate
microarchitectural simulation of thousand-core systems,” in
Proceedings of the 40th Annual International Symposium on
Computer Architecture, 2013.
[92] J. Sharkey, A. Buyuktosunoglu, and P. Bose, “Evaluating design
tradeoffs in on-chip power management for cmps,” in Proceedings of
the 2007 International Symposium on Low Power Electronics and
Design, ser. ISLPED ’07. New York, NY, USA: ACM, 2007, pp.
44–49. [Online]. Available:
http://doi.acm.org/10.1145/1283780.1283791
[93] D. Shelepov, J. C. Saez Alcaide, S. Jeffery, A. Fedorova, N. Perez, Z. F.
Huang, S. Blagodurov, and V. Kumar, “Hass: A scheduler for
heterogeneous multicore systems,” SIGOPS Oper. Syst. Rev., vol. 43,
no. 2, pp. 66–75, Apr. 2009. [Online]. Available:
http://doi.acm.org/10.1145/1531793.1531804
[94] E. H. Sibley, P. J. Fleming, and J. J. Wallace, “How iuot to lie with
statistics: The correct way to summarize 6eluclflwark results.”
[95] S. J. Tarsa, R. B. R. Chowdhury, J. Sebot, G. Chinya, J. Gaur,
K. Sankaranarayanan, C.-K. Lin, R. Chappell, R. Singhal, and
H. Wang, “Post-silicon cpu adaptation made practical using machine
learning,” in Proceedings of the 46th International Symposium on
Computer Architecture, ser. ISCA âA˘Z´19. New York, NY, USA:
Association for Computing Machinery, 2019, p. 14âA˘S¸26. [Online].
Available: https://doi.org/10.1145/3307650.3322267
[96] R. Teodorescu and J. Torrellas, “Variation-aware application
scheduling and power management for chip multiprocessors,” in
Proceedings of the 35th Annual International Symposium on
Computer Architecture, ser. ISCA ’08. Washington, DC, USA: IEEE
Computer Society, 2008, pp. 363–374. [Online]. Available:
http://dx.doi.org/10.1109/ISCA.2008.40
[97] B. A. Tolson and C. A. Shoemaker, “Dynamically dimensioned search
algorithm for computationally efficient watershed model calibration,”
Water Resources Research, vol. 43, no. 1, pp. n/a–n/a, 2007, w01413.
[Online]. Available: http://dx.doi.org/10.1029/2005WR004723
[98] K. Van Craeynest, S. Akram, W. Heirman, A. Jaleel, and L. Eeckhout,
“Fairness-aware scheduling on single-isa heterogeneous multi-cores,”
in Proceedings of the 22Nd International Conference on Parallel
Architectures and Compilation Techniques, ser. PACT ’13.
Piscataway, NJ, USA: IEEE Press, 2013, pp. 177–188. [Online].
Available: http://dl.acm.org/citation.cfm?id=2523721.2523748
[99] Y. Wang, K. Ma, and X. Wang, “Temperature-constrained power
control for chip multiprocessors with online model estimation,” in
Proceedings of the 36th Annual International Symposium on
Computer Architecture, ser. ISCA ’09. New York, NY, USA: ACM,
2009, pp. 314–324. [Online]. Available:
http://doi.acm.org/10.1145/1555754.1555794
[100] J. A. Winter, D. H. Albonesi, and C. A. Shoemaker, “Scalable thread
14
scheduling and global power management for heterogeneous
many-core architectures,” in Proceedings of the 19th International
Conference on Parallel Architectures and Compilation Techniques, ser.
PACT ’10. New York, NY, USA: ACM, 2010, pp. 29–40. [Online].
Available: http://doi.acm.org/10.1145/1854273.1854283
[101] I. H. Witten, E. Frank, and G. Holmes, Data Mining: Practical
Machine Learning Tools and Techniques. 3rd Edition.
[102] C. F. J. Wu and M. S. Hamada, Experiments: Planning, Analysis, and
Optimization. John Wiley and Sons, Inc., 2009.
[103] H. Yang, A. Breslow, J. Mars, and L. Tang, “Bubble-flux: precise
online qos management for increased utilization in warehouse scale
computers,” in Proceedings of the 40th International Symposium on
Computer Architecture, 2013.
[104] H. Zhang and H. Hoffmann, “Maximizing performance under a power
cap: A comparison of hardware, software, and hybrid techniques,” in
Proceedings of the Twenty-First International Conference on
Architectural Support for Programming Languages and Operating
Systems, ser. ASPLOS ’16. New York, NY, USA: ACM, 2016, pp.
545–559. [Online]. Available:
http://doi.acm.org/10.1145/2872362.2872375
[105] W. Zhang, H. Zhang, and J. Lach, “Dynamic core scaling: Trading off
performance and energy beyond dvfs,” in 2015 33rd IEEE
International Conference on Computer Design (ICCD), Oct 2015, pp.
319–326.
[106] Y. Zhou, H. Hoffmann, and D. Wentzlaff, “Cash: Supporting iaas
customers with a sub-core configurable architecture,” in Proceedings
of the 43rd International Symposium on Computer Architecture, ser.
ISCA âA˘Z´16. IEEE Press, 2016, p. 682âA˘S¸694. [Online]. Available:
https://doi.org/10.1109/ISCA.2016.65
[107] Y. Zhou and D. Wentzlaff, “The sharing architecture: Sub-core
configurability for iaas clouds,” in Proceedings of the 19th
International Conference on Architectural Support for Programming
Languages and Operating Systems, ser. ASPLOS âA˘Z´14. New York,
NY, USA: Association for Computing Machinery, 2014, p.
559âA˘S¸574. [Online]. Available:
https://doi.org/10.1145/2541940.2541950
15
