Adaptive Performance Optimization under Power Constraint in Multi-thread
  Applications with Diverse Scalability by Conoci, Stefano et al.
Adaptive Performance Optimization under Power Constraint in
Multi-thread Applications with Diverse Scalability
Stefano Conoci, Pierangelo Di Sanzo, Bruno Ciciani
DIAG - Sapienza University of Rome
Email: {conoci.1483662@studenti.uniroma1.it,
disanzo@dis.uniroma1.it, ciciani@dis.uniroma1.it}
Francesco Quaglia
DICII - University of Rome Tor Vergata
Email: francesco.quaglia@uniroma2.it
Abstract—In modern data centers, energy usage represents
one of the major factors affecting operational costs. Power
capping is a technique that limits the power consumption of
individual systems, which allows reducing the overall power
demand at both cluster and data center levels. However,
literature power capping approaches do not fit well the nature
of important applications based on first-class multi-thread
technology. For these applications performance may not grow
linearly as a function of the thread-level parallelism because
of the need for thread synchronization while accessing shared
resources—such as shared data. In this paper we consider the
problem of maximizing the application performance under a
power cap by dynamically tuning the thread-level parallelism
and the power state of the CPU-cores. Based on experimental
observations, we design an adaptive technique that selects in
linear time the optimal combination of thread-level parallelism
and CPU-core power state for the specific workload profile of
the multi-threaded application. We evaluate our proposal by
relying on different benchmarks, configured to use different
thread synchronization methods, and compare its effectiveness
to different state-of-the-art techniques.
I. INTRODUCTION
Multi-core architectures are nowadays dominating the
market. Also, thanks to the support they offer for sharing
memory among CPU-cores, they have become the main-
stream reference hardware for applications based on the first-
class multi-thread technology. On the downside, powering
many-core machines implies high energy delivery to each
single multi-core server. Therefore, over the last years,
energy and power consumption raised up as a core concern
to cope with, especially in (large) data centers.
Such concern led manufacturers to introduce hardware
mechanisms oriented to improve energy efficiency in op-
erational contexts. These include Dynamic Voltage and Fre-
quency Scaling (DVFS), which allows lowering the voltage
and the frequency (hence the power consumption) of a
processor/core in a controlled manner, and Clock Gating,
which disables some processor/core circuitry during idle pe-
riods. Contextually, today’s Operating Systems offer power
management tools—like Linux CPUFreq Governor [1]—
which expose to the user code interfaces to dynamically
change the power state of cores via DVFS, thus allowing
to tune the performance of cores and their power demand
according to the need of specific applications/workload.
In this context, one interesting challenge is the one of
controlling the power demand of an application in order
to keep it below a given threshold, also known as the
power cap. However, an even more interesting challenge
is the one of ensuring that an application runs at maximum
performance under a given power cap. Such an achievement,
in addition to performance benefits, would also improve the
application efficiency in terms of energy per task.
Various power capping techniques for multi-core servers
have been proposed. As for the specific case of multi-
thread applications, the problem of regulating the number
of threads and the core frequency to control the balance
between performance and power consumption has been orig-
inally considered in [2], and subsequently in [3]. The main
drawback of these approaches is that the tuning strategies
they rely on do not account for complex and dynamic effects
on performance that may be caused by thread contention
on hardware resources and/or shared data. In more details,
when multiple threads concurrently run on different CPU-
cores, the presence of shared hardware resources—such as
memory interconnections and cache levels—leads them to
contend for their utilization. This impacts both performance
and the power consumption profile of the application. Also,
in common multi-thread applications that are not disjoint-
access parallel, threads share data whose accesses may
require synchronization. This still affects performance and
the power consumption profile, also depending on the spe-
cific synchronization mechanisms (either speculative or not)
that are employed by the application code. An additional
factor of complexity in the presence of synchronization
is that the speed-up achieved by running the application
with different number of threads may change depending on
the workload profile, which in turn can be dynamic by its
own. Also, the speed-up can be non-linear as a function of
the number of threads, depending on the workload profile,
as well as the underlying hardware settings. Specifically,
performance can even decrease when increasing the level
of parallelism. This indicates that synchronization costs,
including the energy spent while performing synchronization
operations, can show complex profiles to deal with.
Overall, to select the right combination of thread paral-
lelism and core power state, which ensures the best perfor-
mance under a power cap, it looks mandatory to take into
account the (possible) limited scalability of a multi-thread
application, just like it manifests at run-time due to actual
synchronization dynamics. Further, it is mandatory to react
ar
X
iv
:1
70
7.
09
64
2v
2 
 [c
s.P
F]
  3
 Se
p 2
01
7
to variations of the workload profile.
To cope with this problem, we present an adaptive tech-
nique that uses a novel on-line exploration-based tuning
strategy. We devised our technique exploiting empirical
observations of the effects on both performance and power
consumption associated with the combined variation of
thread-level parallelism and core power state. Specifically,
by the results of experiments we conducted with different
multi-thread benchmarks characterized by non-negligible
incidence of synchronization —e.g. because of thread con-
tention while accessing share data—we highlight that their
scalability is not affected by the variation of the power state
of the CPU-cores. Based on this, we defined an optimized
tuning strategy where the exploration moves along specific
directions that depend on the power cap value and on the
intrinsic scalability of the application. Remarkably, we prove
that the proposed technique finds in linear the optimal con-
figuration of concurrent threads and CPU frequency/voltage,
i.e. the configuration that provides the highest performance
among the configurations with power consumption lower
than the power cap. Also, we present a refinement of
our technique that exploits continues fluctuations between
configurations—in terms of thread-level parallelism and core
power state—to further improve the application performance
and reduce the possibility/incidence of power cap violations.
We demonstrate the advantages of our proposal via an
experimental study based on various application contexts,
including various benchmarks that use different thread syn-
chronization methods. This allows us to robustly assess our
technique via disparate test cases where contention among
threads affects the application scalability in significantly
different ways.
The remainder of this article is structured as follows. In
Section II we discuss related works. Section III defines our
target problem and presents the results of the preliminary
analysis. Section IV illustrates the proposed optimization
technique, proves that the selected configuration is optimal
and analyzes the time complexity of the exploration proce-
dure. Section V describes the most relevant implementation
details and presents the experimental results.
II. RELATED WORK
A work specifically focused on optimizing the energy
demand at application level is presented in [2]. The proposed
technique, called Pack and Cap, aims at selecting the best
configuration, in terms of number of cores to be assigned
to an application and the related core frequency, which
ensures a given power cap for multi-thread workloads.
Based on experimental measurements of the performance
and power consumption obtained running benchmarks from
the Parsec suite, the authors conclude that the configuration
that provides the highest performance at a given level of
power consumption always assigns to the application the
highest possible number of cores. However, as extensively
shown in the following of this article, this selection strategy
is not optimal for general multi-threaded applications with
less than linear scalability. The work in [3] considers the
problem of maximizing performance under a power cap
while also taking into account the effects of contention.
The solution defines an ordered set of power knobs that are
progressively tuned by performing a binary search on the
respective domain, selecting the setting that provides the
highest performance for the considered power knob while
operating within the power cap. In particular, the solution
first selects the optimal number of cores that should be
assigned to an application while running at the slowest avail-
able frequency/voltage, then selects the optimal CPU P-State
setting for the previously selected number of assigned cores.
Therefore, by tuning the power knobs independently, it does
not consider the changing energy/performance trade-offs at
different levels of parallelism for the specific workload. As
an example, if an application shows a limited speed-up when
increasing the number of cores, the solution would still pick
the highest value that provides a power consumption within
the cap, even if the same power budget could provide higher
performance if spent to further increase the frequency of a
lower number of cores.
Other works in literature investigate the problem of
improving application performance under power constraint
considering different power management variables. FastCap
[4] defines an approach for optimizing performance under a
system-wide power cap considering both CPU and memory
DVFS. It defines a non-linear optimization problem solved
through a queuing model that considers the interaction
between CPU-cores and memory banks communicating over
a shared bus. Unfortunately, memory DVFS has only been
proposed recently [5] [6] and is not yet available in com-
mercial systems. Kanduri et al. propose approximation as
another knob that can be used in power capping, combined
with DVFS and Clock Gating, to define a trade-off between
performance and accuracy of the results [7]. However, in
order to dynamically switch between different levels of
accuracy, it requires multiple implementations of the same
application. PPEP [8] is an online prediction framework that,
based on hardware performance events and on-chip tem-
perature measurements, estimates the performance, power
consumption and energy efficiency for each different CPU P-
state. Therefore, it allows the definition of a power capping
technique that can meet power targets in a single step
without requiring any exploration. However, it does not
consider the possibility of altering the number of cores
assigned to an application, thus it would provide sub-optimal
performance for multi-thread applications showing less than
linear scalability.
III. PROBLEM STATEMENT AND PRELIMINARY
ANALYSIS
As discussed, in our study we consider the problem of
adaptively tuning the system configuration to ensure the
highest application performance under a power cap. We
consider two tuning parameters, the number of concurrent
threads and the cores power state. We focus on the general
scenario of multi-thread applications executed on a working-
thread pool (e.g. multithreaded web/application servers)
whose size can be tuned at run-time. However, we should
note that the proposed technique is orthogonal with respect
to the chosen thread regulation mechanism. Also, we assume
that the power state of cores can be changed, affecting
both power consumption and performance. In practice, this
is what happens when changing the so-called P-state in
modern multi-core processors, which determines a variation
of the core voltage and frequency, thus modifying both the
power consumption and the instruction processing speed. We
adhere to the notation of ACPI standard, which establishes
that P0 denotes the core state with maximum power and
performance, and P1, P2, ... progressively identify states
with less power and performance. Also, we consider the
core idle state (C-state), where C0 denotes the full operating
core state, and C1, C2, ..., progressively identify lower
power states where the core is idle, i.e. it does not execute
instructions. A core can transit from C0 to a deeper C-
state when it has no instruction to execute. Hence, when
the number of running threads goes below the number of
available cores, unused cores can transit to low power states,
thus reducing the total power consumption.
To provide the reader with real data demonstrating the
effects on power consumption associated with the variation
of P-state and the number of concurrent threads, we show
in Figure 1 the results of an experiment where we run the
multi-thread Intruder benchmark from the STAMP suite [9]
for Transactional Memory systems [10]. Intruder emulates a
signature-based network intrusion detection system where
network packets are processed in parallel by concurrent
threads. We executed different runs while changing P-state
and the number of concurrent threads on top of a machine
with two Intel Xeon E5, 20 physical cores total, 256 ECC
DDR4 memory, with core clock frequency ranging from
1.2 GHz (whose P-state is denoted as P-11) to 2.2 GHz
(denoted as P-1), and TurboBoost from 2.2 GHz to 3.1 GHz
(denoted as P-0). Since we focus on the effects of the joint
variation of core power state and thread parallelism, we
consider power consumption data related to the CPU and
memory subsystems, which we collected via Intel RAPL
interface [11]. The plot shows the power consumption as
a function of the couple (p, t), where p is P-state and t is
the number of concurrent threads. The results clearly outline
that the power consumption grows while incrementing either
the first or the second variable. Given a power cap value,
if {(p, t)} is the set of all possible configurations, we
denote as {(p, t)}ac ⊆ {(p, t)} the subset of all acceptable
configurations, that is the configurations for which the power
cap is not violated. Formally, it is the subset such that
pwr(p, t) ≤ C, where pwr(p, t) is the power consumption
with configuration (p, t) and C is the power cap value. Since
the function pwr(p, t) monotonically increases with respect
to both p and t, the subsets of acceptable and unacceptable
configurations are separated by a frontier, as shown in figure
3.
Our goal is to find the configuration (p, t)∗ ∈ {(p, t)}ac
for which the performance of the application is maximized.
 0
 2
 4
 6
 8
 10  2
 4  6
 8 10
 12 14
 16 18
 20
 30
 40
 50
 60
 70
 80
 90
 100
 110
Po
w
er
 (W
att
s)
Power Consumption
P-State
Threads
Po
w
er
 (W
att
s)
 30
 40
 50
 60
 70
 80
 90
 100
 110
Figure 1. Throughput vs. Number of Concurrent Threads and P-state
Without loss of generality, we consider the application
throughput as a performance metric. In any case, with
our approach also other metrics, depending on the specific
application, could be used, such as the application runtime
or the operation response time. We denote as thr(p, t) the
application throughput for configuration (p, t).
In multi-thread applications, the variation of the appli-
cation throughput as a function on t plays a key role
when finding the best configuration. Due to hardware and
data contention phenomena, the profile of the application
throughput curve is generally characterized by two parts,
i.e. an initial ascending part, where the throughput increases
while increasing t, followed by a descending part, where the
throughput decreases while increasing t. However, we note
that in the case of high contention the initial ascending part
may not exist (i.e. the throughput always decreases when
increasing t). Conversely, in the case of low contention the
throughput may never decrease.
In Figure 2, we report the results of an experimental study
we conducted with three different multi-thread applications
still taken from STAMP, namely Intruder, Genome, Vacation
and Ssca2. We selected these applications since their scal-
ability trends are very different. Also, in our experiments
we considered two different implementations of the thread
synchronization logic: a) a coarse-grained lock-based ap-
proach, where critical sections are synchronized by a single
global lock, and b) a fine-grained approach based on soft-
ware transactional memory, where shared data accesses are
synchronized by transactions. We purposely used a coarse-
grained locking scheme to evaluate our approach in various
and antithetical scenarios, spanning from applications with
very limited to very high scalability.
By the plots in Figure 2, the profile of the throughput
curves confirm that there is an ascending part followed by
an descending part. In some cases the ascending or the
descending part may not exist. Also, the plots show that,
when changing the application and/or the synchronization
approach, the shapes of the throughput curves change.
Particularly, the number of threads that provides the best
throughput is generally different. Its range varies from 1
(in the case of workloads with very limited scalability, such
as for Intruder Lock-based, Vacation Lock-based and Ssca2
Lock-based), up to 20 (in the case fully scalable work-
loads as Genome Transaction-based or Vacation Transaction-
 100000
 150000
 200000
 250000
 300000
 350000
 400000
 450000
 0  2  4  6  8  10  12  14  16  18  20
Th
ro
ug
hp
ut
Concurrent threads
Genome Lock-based
 500000
 1x106
 1.5x106
 2x106
 2.5x106
 3x106
 3.5x106
 4x106
 0  2  4  6  8  10  12  14  16  18  20
Th
ro
ug
hp
ut
Concurrent threads
Intruder Lock-based
 100000
 120000
 140000
 160000
 180000
 200000
 220000
 240000
 260000
 280000
 300000
 0  2  4  6  8  10  12  14  16  18  20
Th
ro
ug
hp
ut
Concurrent threads
Vacation Lock-based
 500000
 1x106
 1.5x106
 2x106
 2.5x106
 3x106
 3.5x106
 0  2  4  6  8  10  12  14  16  18  20
Th
ro
ug
hp
ut
Concurrent threads
Ssca2 Lock-based
 0
 500000
 1x106
 1.5x106
 2x106
 2.5x106
 3x106
 3.5x106
 4x106
 0  2  4  6  8  10  12  14  16  18  20
Th
ro
ug
hp
ut
Concurrent threads
Genome Transaction-based
 500000
 1x106
 1.5x106
 2x106
 2.5x106
 3x106
 3.5x106
 0  2  4  6  8  10  12  14  16  18  20
Th
ro
ug
hp
ut
Concurrent threads
Intruder Transaction-based
 0
 200000
 400000
 600000
 800000
 1x106
 1.2x106
 1.4x106
 1.6x106
 1.8x106
 2x106
 0  2  4  6  8  10  12  14  16  18  20
Th
ro
ug
hp
ut
Concurrent threads
Vacation Transaction-based
 600000
 800000
 1x106
 1.2x106
 1.4x106
 1.6x106
 1.8x106
 2x106
 2.2x106
 2.4x106
 0  2  4  6  8  10  12  14  16  18  20
Th
ro
ug
hp
ut
Concurrent threads
Ssca2 Transaction-based
Figure 2. Throughput vs. Number of Concurrent Threads
based). Notably, in some cases it is in the middle (as for
Intruder Transaction-based, Genome Lock-based or Ssca2
Transaction-based). On the other hand, fixed the application
and the synchronization approach, the throughput curves
preserve the shape when varying P-state. Curves appear
proportionally translated, but the number of threads that
provides the best throughput does not change, unless for
small and unpredictable variations due to the measurement
noise. Finally, the plots shows that, keeping fixed the number
of threads, the throughput increases when decreasing P-
state. We exploit these experimental findings to define the
exploration-based technique presented in the next section.
IV. THE ADAPTIVE POWER CAPPING TECHNIQUE
The adaptive power capping technique we propose aims
at finding the optimal configuration (p, t)∗, i.e. the con-
figuration that provides the highest performance among
the configurations in the set {(p, t)}ac, assuming that it
may change due to variations of the workload profile. The
technique is based on an on-line tuning strategy that peri-
odically performs an exploration procedure. The latter aims
to identify the optimal configuration (p, t)∗ for the current
workload profile, which is actuated until the exploration pro-
cedure restarts after a given period. During the exploration
procedure, the power consumption and the throughput of the
application are measured while moving along configurations
within a given path, discarding the explored configurations
that are not in the set {(p, t)}ac. Then, the one with the
highest throughput is selected. We should note that the
number of threads t∗ of the optimal configuration may be
different than the number of thread that provides the highest
throughput for the specific application, since decreasing the
CPU P-state might provide an higher performance increase
than increasing the number of threads. The procedure is
able to identify the optimal configuration by exploring only
a subset of configurations. In effect, we note that the full
set of configurations may be very large, particularly when
a large number of cores are available. Thus, reducing the
exploration space is fundamental to implement an on-line
exploration-based strategy.
A. The Exploration Procedure
The exploration procedure takes as input a starting config-
uration (ps, ts) and a power cap value C and returns (p, t)∗.
For the first execution of the procedure, the starting config-
uration is established by the user, while in next executions
it corresponds to the output configuration of the previous
one. We note that, based on the shapes of the throughput
curves and the observations that we made in our preliminary
analysis, a set of configurations can be excluded from
01
2
3
4
5
6
7
8
9
10
11
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
P
-S
ta
te
Threads
Frontier
Exploration phases
(ps,ts) (ps,t1)
(p2,t2)
(p3,t3)
Phase 1
Phase 2
Phase 3
Figure 3. Example of exploration phases performed by the basic strategy
the exploration, thus reducing the configuration exploration
space. Specifically, if during the exploration:
1) a configuration (pj , tk) such that thr(pj , tk) ≤
thr(pj , tk − 1) is found then all configurations (p, t)
where t ≥ tk, for whichever p, can be excluded (since
we are in the descending part of the throughput curve
and since the throughput curves preserve the shape
while varying P-state).
2) a configuration (pj , tk) such that pwr(pj , tk) ≤ C
is found then all configurations (p, tk) with p > pk
can be excluded (since increasing P-state reduces the
application throughput).
3) a configuration (pj , tk) such that pwr(pj , tk) > C
is found then all configurations (p, t) where t ≥ tk
and p ≤ pk can be excluded (since decreasing P-
state or increasing the number of concurrent threads
increments the power consumption).
Based on the above observations, we built an exploration
procedure divided in 3 phases, plus a final selection phase.
The phases are described below. A graphical example is
shown in Figure 3, which refers to an execution where
the number of concurrent threads providing the highest
throughput is equal to 15 and C = 50 watts.
The phases are the following ones:
Phase 1: this phase starts from the initial configuration
(ps, ts) and, keeping the P-state fixed, aims at finding the
number of threads providing the highest throughput without
violating the power cap. We denote as (ps, t1) the configu-
ration returned by this phase. It performs a search inspired
by the hill-climbing technique. Specifically, it increments by
one the number of threads while the throughput increases
and the power cap is not violated (since it is moving along
the ascending part of the throughput curve), then it returns
the configuration with the highest throughput within the
power cap. If the throughput does not grow after the first in-
crement or the power cap is violated, it starts decreasing the
number of threads (since it is moving along the descending
part of the throughput curve or the power consumption has
to be reduced) until the throughput starts decreasing. Then,
it returns the configuration with the highest throughput if it
does not violate the power cap, otherwise if all the explored
configurations violate the power cap or if the exploration
reaches a number of threads equal to 1 it returns (ps, 1).
In the example in Figure 3, the exploration performed in
Phase 1 is represented by the green line. It starts with
(ps, ts) = (6, 5), then increases the number of threads and
terminates when it explores configuration (6, 13) since it
violates the power cap. It returns (ps, t1) = (6, 12), which
is within the power cap.
Phase 2: This phase starts from the configuration returned
by phase 1 (ps, t1) and is executed only if this configuration
does not violate the power cap (otherwise it jumps to the next
phase). The goal of phase 2 is to continue the exploration for
lower values of P-state (we remark that lower values of P-
state lead to both higher core performance and higher power
consumption). Specifically, it explores by moving from the
current configuration (p, t) to configuration (p − 1, t). If
the latter configuration does not violate the power cap,
it continues to reduce the value of P-state. If the explo-
ration reaches a configurations such that pwr(p, t) > C,
it starts reducing the number of threads, thus moving to
configuration (p, t − 1), then (p, t − 2) and so on (since
decreasing the number of concurrent threads reduces the
power consumption) until the power cap is not violated.
After, it restarts the exploration by decreasing the value of
P-state. The exploration terminates when p reaches 0 and the
current configuration does not violate the power cap, when
it reaches configuration (0, 1), or when a configuration with
t = 1 violates the power cap. Then, the phase returns the
explored configuration with the highest throughput within
the power cap, that we denote as (p2, t2), or configuration
(0, 1). In Figure 3, the exploration of Phase 2 is shown by
the blue line. It starts from (ps, t1) = (6, 12), then explores
up to configuration (0, 1). It returns (p2, t2) = (3, 6).
Phase 3: This phase starts again from the configuration
returned by Phase 1, i.e. (ps, t1), and aims at continuing the
exploration for higher values of P-state. If the configuration
returned by Phase 1 is such that t1 is the number of threads
providing the highest throughput and is within the power
cap, Phase 3 is not executed (since decrementing the value of
P-state leads to lower throughput). Otherwise, it increments
by one the value of P-state and starts increasing the number
of concurrent threads until the power cap is violated or the
throughput decreases. In the former case, if the maximum
value of P-state has not been reached, it increments by one
the value of P-state and starts again incrementing the number
of threads. In all the other cases the exploration terminates.
Then, the phase returns the explored configuration, that we
denote as (p3, t3), with the highest throughput within the
power cap, or configuration (pmax, t1), where pmax is the
maximum value of P-state. In Figure 3, the exploration of
Phase 2 is represented by the yellow line. It starts from
(ps, t1) = (6, 12), then explores up to configuration (8, 16),
where it stops since the throughput decreases (in the example
the number of concurrent threads providing the highest
throughput is equal to 15). It returns (p3, t3) = (8, 15).
Final phase: this phase selects the configuration with
the highest throughput between the configurations (ps, t1),
(p2, t2) and (p3, t3), which does not violate the power cap,
or returns null if none of them is within the power cap.
B. Proof of Optimality
In this subsection we prove that the proposed exploration
procedure returns the optimal configuration, i.e. the configu-
ration (p, t)∗ that provides the highest level of performance
with a power consumption lower than the power cap. The
proof assumes that the observations discussed in Section III
always hold true. Specifically, we take as hypotheses that:
1) the shape of the throughput curve for each fixed P-
state and varying number of active threads is char-
acterized by an initial ascending part, followed by
a descending part. Also, one of these parts may be
missing;
2) if thr(pj , tk) > thr(pj , tk + 1) then thr(p, tk) >
thr(p, tk +1) for each p (the shape of the throughput
curves preserve the shape while varying the P-state);
3) if pj < pk then thr(pj , t) > thr(pk, t) for each fixed t
(decreasing the P-state with a fixed number of threads
always increases the throughput);
4) pwr(p, t) ≥ pwr(pj , tk) for each p <= pj and
t >= tk (decreasing P-state or increasing the number
of threads increases the power consumption);
5) the workload is static during the exploration proce-
dure;
6) the samples of throughput and power consumption ob-
tained for each explored configuration are equivalent
to their real values;
We should note that hypothesis 5 and 6 are necessary for
any exploration-based solution that relies on data gathered
at run-time. In particular, they guarantee that if the optimal
configuration is explored it will also be selected by the
algorithm as the best configuration. Hypotheses 1, 2, 3 and
4, as shown in Figure 2, reflect properties that appear to be
valid for all the considered workloads.
Proof: We can partition the search space defined by the
configurations (p, t) in three distinct sub-spaces, delimited
by the starting configuration (p, t)s, such that:
• p = ps;
• p < ps;
• p > ps;
We denote as the optimal configuration for the sub-space
of configurations S, the configuration (p, t)q ∈ S that
provides the highest performance while operating within the
power cap compared to all the configurations (p, t) ∈ S.
Considering that the sum of these three sub-spaces covers
the complete space of configurations, the configuration that
provides the highest performance between the optimal con-
figuration of all the sub-spaces will be the optimal configu-
ration (p, t)∗. Thus, proving that the exploration procedure
finds the optimal configuration for each of these sub-space
is equivalent to prove that it finds the optimal configuration
for the whole space of configurations.
p = ps : Phase 1 of the exploration procedure starts from
(p, t)s and searches for the number of active threads at
P-state ps that maximizes performance with power con-
sumption within the power cap. Therefore, it explores
the considered sub-space. Phase 1 is based on an hill-
climbing optimization algorithm which generally finds the
local optima, which might not be the best possible solu-
tion. However, for hypothesis 1, the local optima is also
the global optima as it is not possible for a non-global
optima to exist for a function with a single ascending part
followed by a single descending part or, in case one of
those is missing, for a monotonic function. We should note
that, unlike traditional hill-climbing algorithms, the selected
configuration might not be the global optima as it might
require a power consumption higher than the power cap. In
this case, exploiting the shape of the throughput function,
the configuration with the highest number of active thread
with a power consumption lower than the power cap is
selected which is clearly the optimal configuration for the
sub-space as the optimum is always located either at the
end of the ascending part or at 1 thread if the ascending
part does not exist. Else, the global optima is selected. Thus,
phase 1 selects the optimal configuration for the sub-space
of configurations with p = ps.
p < ps : Assume that the optimal configuration (p, t)k+1 for
the sub-space of configurations with p = k + 1 is known.
The optimal configurations for the sub-space with p = k
must have tk <= tk+1 since if tk+1 is the optimal number
of threads for p = k + 1 it must be that either:
• throughput with t = tk+1 + 1 is lower than with
t = tk+1 with p = k + 1. Thus, for hypothesis 2,
all configurations (p, t) where t > tk+1, for whichever
p, can not be optimal;
• pwr(pk+1, tk+1+1) > C which implies, for hypothesis
4, that for p = k < k+ 1 all configurations with t >=
tk+1 would have a power consumption higher than the
power cap and thus can not be optimal.
If tk+1 = 1 we can already conclude that the optimum for
the sub-space of configurations with p = pk is (p = k, t =
1). Differently, for t > 1 we can state that the throughput
at P-state k monotonically increases in the range from 1 to
t = tk+1. If tk+1 > 1 it must be that thr(pk+1, tk+1) >=
thr(pk+1, tk+1 − 1) which implies for hypothesis 2 that
thr(pk, tk+1) >= thr(pk, tk+1 − 1) and consequently that
the throughput curve with p = k for t < tk+1 is in
the ascending part. Therefore, as performed by Phase 2,
starting the exploration of the sub-space of configurations
with p = k from the configuration pk, tk+1 and, if necessary,
decreasing the number of threads until the power cap is
reached assures that the the optimal configuration for the
sub-space is explored. The configuration returned by phase
1—which is the optimal configuration for the sub-space with
p = ps—is used as base case. The sum of each sub-space
of configurations with {p = j | j ∈ [0, ps − 1] is equal
to the sub-space of configurations with p < ps. Therefore,
the configuration with the highest performance between the
optimal configurations for each of this sub-spaces will be the
optimal configuration for the entire sub-space with p < ps.
p > ps : Assume that the optimal configuration (p, t)k−1
for the sub-space of configurations with p = pk−1 = pk − 1
is known. We can state that thr(pk, t) <= thr(p, t)k−1 for
each t <= tk−1 since:
• if tk−1 is the optimal value of t for p = pk−1, it must
be true that thr(pk, tk−1) >= thr(pk, t) for each t <=
tk−1 (hypotheses 2);
• thr(pk, tk−1) < (pk−1, tk−1) (hypothesis 3).
In addition, we can state that if thr(pk, tk−1) >
thr(pk, tk−1 + 1) then for each configuration (p, t) with
p > pk it must be true that thr(p, t) < thr(pk, tk−1) since:
• for hypothesis 2, increasing the number of threads over
tk−1 does not improve the throughput for any P-state;
• for hypothesis 3, increasing the P-state reduces the
throughput.
Therefore, considering that pwr(pk, tk−1) < C (hypothesis
4) and that (pk, tk−1) is included in the sub-space of config-
urations with p < ps, if thr(pk, tk−1) > thr(pk, tk−1 + 1)
then all configurations in the sub-space of configurations
with p < pk cannot be optimal configurations of the sub-
space with p < ps. Starting from the configuration returned
by phase 1—which is the optimal configuration for p = ps—
phase 3 decrements the P-state and increments the number
of threads until the power cap is violated or the throughput
decreases, which assures that the optimal configuration for
the sub-space is explored. If the throughput decreases when
increasing the number of threads or when the maximum
P-state is reaches, phase 3 is completed. By induction, it
explores the optimal configuration of each sub-space of
configurations with {p = j | j ∈ [ps + 1, pmax], excluding
the sub-spaces that we proved cannot contain the optimum.
Therefore, the configuration with the highest performance
between the optimal configurations of the considered sub-
spaces will be the optimal configuration for the entire sub-
space with p > ps.
C. Time Complexity Analysis
The time complexity of the exploration procedure is
expressed as the number of exploration steps required by
the procedure to return the optimal configuration (p, t)∗. Let
ptot be the total number of P-states supported by the system
and ttot the maximum number of concurrent threads for the
specific application, which, in HPC applications, is usually
set equal to the number of physical/virtual cores available in
the system. Considering that the exploration procedure does
not explore the descending part of the throughput curve, we
could also denote ttot as the maximum number of concurrent
threads that provide, for at least a portion of the execution
time, the highest performance for the specific application run
on the specific hardware. We analyze the time complexity
of each exploration phase separately:
• phase 1: each configuration with a different number
of concurrent threads and p = ps is explored at most
once, thus the time complexity is O(ttot);
• phase 2: starting from a configuration (p, t), phase 2
either reduces the value of p or reduces t. Starting from
the configuration returned by phase 1, it can reduce p at
most ptot times, and reduce t at most ttot times. Thus,
the time complexity of phase 2 is O(ptot + ttot);
• phase 3: starting from a configuration (p, t), phase
3 either increments the value of p or increments t.
Thus, for the same reasoning used in phase 2, the time
complexity of phase 3 is O(ptot + ttot);
Therefore, the overall time complexity of the exploration
procedure is O(ptot + ttot).
D. The Enhanced Tuning Strategy
In this section we present an enhancement of the tuning
strategy that allows to further improve performance and
reduce the power cap violation probability. It profits by the
possible gap between the power cap value and the power
consumption of configuration (p, t)∗ which is due to the dis-
crete domain of power consumption values of the different
configurations. Specifically, it is unlikely that pwr(p, t)∗ is
exactly equal to C. Rather we can have C−pwr(p, t)∗ > 0.
Statistically, the greater the difference of power consumption
between adjacent configurations, the larger C − pwr(p, t)∗.
To reduce the performance penalization due to this gap,
the enhanced tuning strategy relies on continue fluctuations
between two configurations (rather that remaining always
in (p, t)∗) along the time interval between the end of the
exploration procedure and the start of the next one. All
the phases are equal to the previous tuning strategy, except
that an additional configuration (p, t)H is selected. (p, t)H
is the configuration with higher throughput than (p, t)∗ (if
any) such that the ratio between throughput and power
consumption is the largest one among the explored ones.
Thus, it is the configuration with the highest efficiency in
terms of throughput over power consumption. We note that,
since (p, t)∗ is the configuration within the power cap with
the highest throughput, then (p, t)H is a configuration that
violates the power cap.
At the end of the exploration procedure, the enhanced
strategy continuously fluctuates between (p, t)∗ and (p, t)H
in order to take advantage of the higher throughput of
configuration (p, t)H , but avoids that the average power
consumption, over a given time window w, overcomes C.
To this aim, if the average power consumption overcomes
C, then configuration (p, t)∗ is set. Conversely, when the
average power consumption falls below C, configuration
(p, t)H is set, and so on. To limit the fluctuation frequency,
an upper and a lower tolerance threshold, C+ l and C− l is
used. In real scenarios, the length w can be set equal to the
actual time window used to calculate the power consumption
of the machine.
Another factor that may impact the effectiveness of our
technique is the variation over time of the power consump-
tion of the selected configurations. For example, pwr(p, t)∗
may change due to variations of the workload profile, thus
leading to power cap violations. If this happens, with our
tuning strategy it may not be detected until the next explo-
ration procedure starts. To limit the effect of this delay on
the power cap violation, the enhanced tuning strategy selects
a third configuration, that we denote as (p, t)L. It is the
configuration with lower power consumption than (p, t)∗ (if
any) with the highest efficiency in terms of throughput over
power consumption. Thus, if pwr(p, t)∗ overcomes C, then
the strategy fluctuates between pwr(p, t)∗ and pwr(p, t)L
rather than between (p, t)∗ and (p, t)H . This allows to reduce
the probability of power cap violation until the workload
profile variation is such that pwr(p, t)L < C. Similarly, for
the same goal of promptly adapting to workload variations,
if pwr(p, t)L > C (pwr(p, t)H < C), the P-state of all
configurations is shifted up (down) by one.
V. EXPERIMENTAL RESULTS
In this section, we presents the results of an experimental
study we conducted to asses the proposed power capping
technique. As in previous studies on power capping (e.g. [2],
[12]), we consider two evaluation metrics, the application
performance and the average power cap error. The latter is
the average difference between the power consumption and
power cap value along time intervals where the power cap
is violated. We run experiments for all application scenarios
that we considered in our preliminary study (see Section
III). Thus we use Intruder, Genome,Vacation and Ssca2 as
benchmark applications from STAMP, with both locks and
transactions as the synchronization method. These applica-
tions were specifically selected to cover a wide range of
different scalability scenarios. We compared our technique
with:
1) a reference power capping technique, referred to as
baseline, that selects the configuration with the low-
est P-state from the set of configurations with the
highest number of threads among the configurations
with power consumption lower than the power cap. It
implements the selection strategy proposed in [2];
2) a technique, referred to as dual-phase, that initially
tunes the number of threads starting from the lowest P-
state, and subsequently tunes the CPU P-state keeping
the number of threads fixed. The initial phase is
equivalent to phase 1 of the proposed exploration
procedure. The selection strategy of this technique is
similar to the one presented in [3].
The comparison with the first technique allows to quantify
the performance benefits achievable by properly allocating
the power budget taking into consideration the scalability
of the specific multi-threaded application. Additionally, we
considered the dual-phase technique in the evaluation to
quantify the possible performance benefits achievable by
exploring the whole bi-dimensional space of configurations
over two distinct mono-dimensional explorations, which
might not find the optimal configuration. We should note
that, despite exploring a larger set of configurations, the
proposed technique has the same time complexity of the
dual-phase technique.
A. Implementation details
We developed a controller module that implements our
technique and the baseline technique.1 All software of our
experimental study, including benchmark applications, is
developed in C language for Linux. The controller mod-
ule alters the number of concurrent threads exploiting the
pause() system call and thread-specific signal for reactiva-
tion. The CPU P-state is regulated through the cpufreq linux
sub-system, while energy readings are obtained from the
powercap sub-system. Both these sub-systems are included
by default in recent versions of the linux kernel and expose
their respective interface through the sys virtual file system.
The exploration procedure relies on statistical results of
the previous step, such as average power consumption and
throughput, to define the following configuration to explore.
Each step of statistics collection is determined by a fixed
amount of units of work processed. We cannot rely on
application independent metrics, such as the number of CPU
retired operation, since it would also consider instructions
related to spin-locking or aborted transactions that do not
provide execution progress. For applications based on locks
we defined the unit of work as the execution of one critical
section guarded by a global lock. Differently, for transactions
we define the unit of work as one commit. The statistics are
collected in a round-robin fashion by all the active threads to
reduce execution overhead and provide NUMA-aware results
in modern multi-package systems.
For the executions presented in the experimental results,
we set the units of work per step to 5000, resulting in tens
of milliseconds per step for all the considered applications
and synchronization method. In addition, we set to 150 the
number of steps required to restart the exploration procedure
after the conclusion of the previous.
B. Experimental results
We consider both the tuning strategies of our technique
referred as basic strategy and enhanced strategy. We analyze
the performance results of our strategies in terms of speed-up
with respect to the throughput of the baseline technique. As
anticipated, we also compare the average power cap error.
For each test case, we present the results with three different
power cap values, i.e. 50, 60 and 70 watts.
Results for the case of lock-based synchronization are
reported in Figure 4. Overall, the results show an evident per-
formance improvement with both strategies of our technique
with respect to the baseline technique. Only for the case of
Genome the performance is comparable. In the best cases,
i.e. with Intruder, the performance improvement reaches 2.2x
(2.32x) and 2.15x (2.19x) for the basic (enhanced) strategy
when the power cap is equal to 50 and 60 watts respectively,
and it is close to 1.9x for both the proposed strategies with
power cap set to 70 watts. The enhanced strategy further
1See github.com/StefanoConoci/STMEnergyOptimization
 0
 0.5
 1
 1.5
 2
intruder vacation genome ssca2
Sp
ee
d-
up
Speed-up with Locks - Power Cap: 50 watts
 0
 0.5
 1
 1.5
 2
intruder vacation genome ssca2
Sp
ee
d-
up
Speed-up with Locks - Power Cap: 60 watts
 0
 0.5
 1
 1.5
 2
intruder vacation genome ssca2
Sp
ee
d-
up
Speed-up with Locks - Power Cap: 70 watts
0.0
1.0
2.0
3.0
4.0
5.0
intruder vacation genome ssca2
Er
ro
r (
%)
Power Cap Error with Locks - Power Cap: 50 watts
0.0
1.0
2.0
3.0
4.0
5.0
intruder vacation genome ssca2
Er
ro
r (
%)
Power Cap Error with Locks - Power Cap: 60 watts
0.0
1.0
2.0
3.0
4.0
5.0
intruder vacation genome ssca2
Er
ro
r (
%)
Power Cap Error with Locks - Power Cap: 70 watts
Figure 4. Throughput Speed-up and Power Cap Error with Locks
improves performance compared to the baseline technique
by up to 12.5% in Intruder at 50 watts, and by 5.3% on
average. For lock-based synchronization, the results of the
dual-phase technique are similar to those achieved by the
basic strategy.
As for the power cap error, with both the strategies of our
technique and the dual-phase technique, it is clearly reduced
compared to the baseline. Also, the results show that with
the enhanced strategy in many cases there is a reduction of
the power cap error compared to the basic strategy. Indeed,
except for the case of Vacation with power cap equals to 60
watts, where it is increased by less than 0.1%, the error with
the enhanced strategy is lower. In the best case it is about
0.1%, while it is about 2% and 4.8% with the basic strategy
and the baseline technique, respectively.
Results for the case of transaction-based synchronization
are reported in Figure 5. Overall, the performance results
confirm the advantage of our technique compared to the
baseline technique. However, with transactions the speed-
up is generally slightly lower than with locks. In the best
cases, it reaches about 1.9x. Also, there is one case (with
Genome and power cap = 50 watts) where it is slightly less
that 1 with both the strategies. As for the power cap error,
it increases with the basic strategy compared to the case
with locks, overcoming the error of the baseline technique in
most of the cases. However, it does not overcome 2% in all
cases. The error is considerably reduced with the enhanced
strategy. Particularly, it is clearly lower than the baseline
technique with all applications when the power cap is equal
to 50 watts and with Intruder when the power cap is equals
to 60 watts, while the results are similar for the other power
cap values. In addition, the enhanced strategy can further
increase performance by up to to 8% (Vacation with power
cap set to 50 watts) and by 3.5% on average. Differently
from the lock-based case, both strategies of the proposed
technique show an higher speed-up compared to the dual-
phase technique by up to 21% (ssca2 with power cap set
to 50), and by 7.7% and 10.7% on average for the basic
strategy and the enhanced strategy respectively.
C. Analysis of the Results
As a first observation, results show that in various cases
with locks, the error of our technique and of the dual-phase
technique is very close to zero. This is due to the fact that, in
our study, the scalability is limited for all applications when
using locks. In these scenarios, the number of concurrent
threads providing the higher throughput (that is selected by
our technique and by the dual-phase technique) is low, thus
the value of P-state can be changed up to 0 while the power
cap frontier is still far. This keeps the error very close to 0
since it is unlikely that the power cap is violated during the
exploration procedure or due to workload variations.
The error is generally reduced with the enhanced strategy
compared to the basic strategy, while also improving perfor-
mance. This arises since the former is able to react along the
time between two consecutive exploration procedures to the
possible variations of the power consumption of the selected
configurations, as discussed at the end of Section IV-D.
The speed-up with our technique is less than 1 only in
one case, i.e. for Genome with transactions when the power
cap value is equal to 50 watts. We note that Genome with
transactions is highly scalable (see Figure 2). This leads
both the baseline technique and our technique to select 20
as number of concurrent threads. As shown by the plot in
Figure 2, the throughput of Genome with transactions is
subject to noise when close to 20 threads . Also, we remark
 0
 0.5
 1
 1.5
 2
intruder vacation genome ssca2
Sp
ee
d-
up
Speed-up with Transactions - Power Cap: 50 watts
 0
 0.5
 1
 1.5
 2
intruder vacation genome ssca2
Sp
ee
d-
up
Speed-up with Transactions - Power Cap: 60 watts
 0
 0.5
 1
 1.5
 2
intruder vacation genome ssca2
Sp
ee
d-
up
Speed-up with Transactions - Power Cap: 70 watts
0.0
1.0
2.0
3.0
4.0
5.0
intruder vacation genome ssca2
Er
ro
r (
%)
Power Cap Error with Transactions - Power Cap: 50 watts
0.0
1.0
2.0
3.0
4.0
5.0
intruder vacation genome ssca2
Er
ro
r (
%)
Power Cap Error with Transactions - Power Cap: 60 watts
0.0
1.0
2.0
3.0
4.0
5.0
intruder vacation genome ssca2
Er
ro
r (
%)
Power Cap Error with Transactions - Power Cap: 70 watts
Figure 5. Throughput Speed-up and Power Cap Error with Transactions
that our technique is able to react to workload variations also
in terms of scalability. In this scenario, these factors cause
lower performance with our technique due to the noise,
which sometimes (wrongly) leads to temporarily selecting
a less than optimal number of concurrent threads.
As expected, for lock-based synchronization the proposed
technique technique shows similar results to the dual-phase
technique since both techniques return the same configu-
ration when the ascending part of the throughput curve is
missing. For transaction-based synchronization, the highest
speed-up improvements over the dual-phase technique are
obtained for Ssca2 and Genome which show a less than lin-
ear ascending part of the throughput curve for each fixed P-
state (Figure 2. As the most significant example, in Ssca2 the
throughput slightly increases when increasing the number of
threads from 6 to 15 which makes the dual-phase technique
select a configuration with 15 threads. Differently, the pro-
posed technique allocates the power budget more efficiently
by selecting a configuration with a lower number of threads
at an increased frequency. We should note that the benefits
of the proposed technique over the dual-phase technique are
not limited to applications that rely on transactional-based
synchronization. Effectively, performance benefits should be
obtained for any application with a throughput function that
shows an ascending part followed by a descending, or only
an ascending part that is less than linear.
Overall, the results of our experiments study show that it
is possible to achieve significant performance benefits by
appropriately selecting the number of concurrent threads
and CPU P-state taking into consideration the scalability
of the the specific multi-threaded application. As expected,
compared to the baseline technique, the proposed solutions
achieves the best results with poorly scalable applications,
i.e. where contention is not minimal. Compared to the dual-
phase technique, the exploration of the whole bi-dimensional
space of configurations performed by the proposed technique
can provide an appreciable improvement in performance
for some applications, while achieving the same results for
others. Finally, the enhanced strategy manages to further
improve performance and reduce the power cap error over
the basic strategy.
VI. CONCLUSIONS
In this work we introduced a novel power capping tech-
nique that, by jointly tuning the CPU performance state and
the number of concurrent threads, improves the performance
of multi-thread applications, specifically for applications
that show less than linear scalability due to contention.
Exploiting the results of a preliminary analysis, the proposed
technique can return in linear time the optimal configuration
which provides the highest performance between all con-
figurations with power consumption lower than the power
cap. We also present an enhanced strategy that by fluctuat-
ing between different configurations optimizes the dynamic
allocation of the power budget, resulting in both increased
performance and reduced power cap error. Compared to the
baseline technique, that always assigns to the application the
highest possible number of cores, our strategy provides an
average speed-up of 1.48x, with individual test cases reach-
ing up to 2.32x. Furthermore, we show that by exploring the
overall bi-dimensional space of configuration, the proposed
technique can improve performance by up to 21% compared
to techniques that tune the number of threads and the CPU
performance state independently.
REFERENCES
[1] V. Pallipadi and A. Starikovskiy, “The ondemand governor:
past, present and future,” in Proceedings of Linux Symposium,
vol. 2, pp. 223-238, 2006.
[2] S. Reda, R. Cochran, and A. Coskun, “Adaptive power
capping for servers with multithreaded workloads,” IEEE
Micro, vol. 32, no. 5, pp. 64–75, Sep. 2012. [Online].
Available: http://dx.doi.org/10.1109/MM.2012.59
[3] H. Zhang and H. Hoffmann, “Maximizing performance under
a power cap: A comparison of hardware, software, and hybrid
techniques,” in Proceedings of the Twenty-First International
Conference on Architectural Support for Programming Lan-
guages and Operating Systems, ser. ASPLOS ’16. New York,
NY, USA: ACM, 2016, pp. 545–559.
[4] Y. Liu, G. Cox, Q. Deng, S. C. Draper, and R. Bianchini,
“FastCap: An efficient and fair algorithm for power capping in
many-core systems,” ISPASS 2016 - International Symposium
on Performance Analysis of Systems and Software, no. 3, pp.
57–68, 2016.
[5] Q. Deng, L. Ramos, R. Bianchini, D. Meisner, and
T. Wenisch, “Active low-power modes for main memory with
memScale,” IEEE Micro, vol. 32, no. 3, pp. 60–69, 2012.
[6] H. David, C. Fallin, E. Gorbatov, U. R. Hanebutte, and
O. Mutlu, “Memory Power Management via Dynamic Volt-
age/Frequency Scaling,” Proceedings of the 8th ACM Inter-
national Conference on Autonomic Computing, pp. 31–40,
2011.
[7] A. Kanduri, M.-H. Haghbayan, A. M. Rahmani, P. Liljeberg,
A. Jantsch, N. Dutt, and H. Tenhunen, “Approximation knob:
power capping meets energy efficiency,” Proceedings of the
35th International Conference on Computer-Aided Design -
ICCAD ’16, pp. 1–8, 2016.
[8] B. Su, J. Gu, L. Shen, W. Huang, J. L. Greathouse, and
Z. Wang, “PPEP: Online Performance, Power, and Energy
Prediction Framework and DVFS Space Exploration,” 2014
47th Annual IEEE/ACM International Symposium on Mi-
croarchitecture, pp. 445–457, 2014.
[9] C. Cao Minh, J. Chung, C. Kozyrakis, and K. Oluko-
tun, “STAMP: Stanford transactional applications for multi-
processing,” in Proc. 4th IEEE Int. Symposium on Workload
Characterization. IEEE, 2008, pp. 35–46.
[10] N. Shavit and D. Touitou, “Software transactional memory,”
in Proc. 14th ACM Symposium on Principles of Distributed
Computing. ACM, 1995, pp. 204–213.
[11] “Intel 64 and ia-32 architectures software developers manual,
volume 3c: System programming guide, part 3,” (Accessed
on 06/26/2017).
[12] C. Lefurgy, X. Wang, and M. Ware, “Power capping: A
prelude to power shifting,” Cluster Computing, vol. 11,
no. 2, pp. 183–195, Jun. 2008. [Online]. Available:
http://dx.doi.org/10.1007/s10586-007-0045-4
