The Power-Optimised Software Envelope by Roberts, Stephen I. et al.
This is a repository copy of The Power-Optimised Software Envelope.
White Rose Research Online URL for this paper:
http://eprints.whiterose.ac.uk/144543/
Version: Accepted Version
Article:
Roberts, Stephen I., Wright, Steven A. orcid.org/0000-0001-7133-8533, Fahmy, Suhaib A. 
et al. (1 more author) (Accepted: 2019) The Power-Optimised Software Envelope. ACM 
Transactions on Architecture and Code Optimization. ISSN 1544-3566 (In Press) 
eprints@whiterose.ac.uk
https://eprints.whiterose.ac.uk/
Reuse 
Items deposited in White Rose Research Online are protected by copyright, with all rights reserved unless 
indicated otherwise. They may be downloaded and/or printed for private study, or other acts as permitted by 
national copyright laws. The publisher or other rights holders may allow further reproduction and re-use of 
the full text version. This is indicated by the licence information on the White Rose Research Online record 
for the item. 
Takedown 
If you consider content in White Rose Research Online to be in breach of UK law, please notify us by 
emailing eprints@whiterose.ac.uk including the URL of the record and the reason for the withdrawal request. 
The Power-Optimised Sotware Envelope∗
STEPHEN I. ROBERTS2, Arm Ltd., UK
STEVEN A. ΩRIGHT, University of York, UK
SUHAIB A. FAH℧Y, University of Warwick, UK
STEPHEN A. JARVIS, University of Warwick, UK
Advances in processor design have delivered performance improvements for decades. As physical limits are
reached, reinements to the same basic technologies are beginning to yield diminishing returns. Unsustainable
increases in energy consumption are forcing hardware manufacturers to prioritise energy eiciency in their
designs. Research suggests that software modiications may be needed to exploit the resulting improvements
in current and future hardware. New tools are required to capitalise on this new class of optimisation.
In this paper, we present the Power Optimised Software Envelope (POSE) model, which allows developers
to assess the potential beneits of power optimisation for their applications. The POSE model is metric agnostic
and in this paper we provide derivations using the established Energy-Delay Product metric and the novel
Energy-Delay Sum and Energy-Delay Distance metrics that we believe are more appropriate for energy-aware
optimisation eforts. We demonstrate POSE on three platforms by studying the optimisation characteristics of
applications from the Mantevo benchmark suite. Our results show that the Pathinder application has very
little scope for power optimisation while TeaLeaf has the most, with all other applications in the benchmark
suite falling between the two.
Finally, we extend our POSE model with a formulation known as System Summary POSE ś a meta-heuristic
that allows developers to assess the scope a system has for energy-aware software optimisation independent
of the code being run.
Additional Key Words and Phrases: Energy-eiciency, Energy-aware Computing, Power Optimisation
ACM Reference Format:
Stephen I. Roberts, Steven A. Wright, Suhaib A. Fahmy, and Stephen A. Jarvis. 2019. The Power-Optimised
Software Envelope. ACM Trans. Arch. Code Optim. X, Y (April 2019), 25 pages. https://doi.org/10.1145/nnnnnnn.
nnnnnnn
∗Extension of Conference Paper: In our previous paper [35] we introduced a visual modelling tool called POSE, designed to
guide energy aware optimisation. In this paper, we extend our model with formulations based on two newly developed
metrics for assessing energy-aware optimisation, known as Energy-Delay Sum and Energy-Delay Distance [34]. We then
further extend our POSE model to include a new formulation known as System Summary POSE. System Summary POSE
allows us to reason about the scope an entire system has for energy-aware optimisations independently of any particular
code being run.
2Work completed while registered at the University of Warwick
Authors' addresses: Stephen I. Roberts, Development Solutions Group, Arm Ltd. UK, stephen.roberts@arm.com; Steven A.
Wright, Department of Computer Science, University of York, UK, steven.wright@york.ac.uk; Suhaib A. Fahmy, School of
Engineering, University of Warwick, UK, s.fahmy@warwick.ac.uk; Stephen A. Jarvis, Department of Computer Science,
University of Warwick, UK, s.a.jarvis@warwick.ac.uk.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for proit or commercial advantage and that copies bear this notice and
the full citation on the irst page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior speciic permission and/or a fee. Request permissions from permissions@acm.org.
© 2019 Association for Computing Machinery.
XXXX-XXXX/2019/4-ART $15.00
https://doi.org/10.1145/nnnnnnn.nnnnnnn
ACM Trans. Arch. Code Optim., Vol. X, No. Y, Article . Publication date: April 2019.
:2 Stephen I. Roberts, Steven A. Ωright, Suhaib A. Fahmy, and Stephen A. Jarvis
1 INTRODUCTION
Scientiic computing and numerical simulation have become indispensable tools in many areas of
science and engineering. Simulations allow scientists to test their theories in domainswhere physical
experimentation would be prohibitively costly, impractical, or dangerous. As a result, computational
methods have joined theory and experiment as central pillars of scientiic investigation.
Maximising performance is paramount in scientiic computing. Higher performance means more
calculations can be carried out, allowing scientists to increase the size, complexity or resolution
of their simulations. The ield of High Performance Computing (HPC) exists to improve the per-
formance of supercomputers and the software which they run. HPC covers a broad spectrum of
disciplines. At one extreme, domain experts write high-level simulation software to model phe-
nomena of interest. At the other, hardware engineers design the processors and other components
that make up supercomputers. Performance engineering bridges the gap between these extremes,
seeking ways to optimise software to make better use of the available hardware.
This work investigates how conventional performance engineering techniques can be adapted to
support energy-aware software optimisation. It seeks to quantify the beneits which can realistically
be expected as a result of energy-aware optimisation. Speciically, this paper makes the following
contributions:
• We introduce the Power-Optimised Software Envelope (POSE), a model which helps per-
formance engineers to determine whether power or runtime optimisation will provide the
greatest beneits for their code;
• We provide derivations for POSE using the established Energy-Delay Product family of
metrics as well as the Energy-Delay Sum and Energy-Delay Distance [34] metrics;
• We demonstrate POSE on a number of applications from the Mantevo benchmark suite,
showing that PathFinder is the application least amenable to power optimisation and TeaLeaf
is the most amenable;
• Finally, we extend POSE to provide a model for system-wide power optimisation characteris-
tics. System Summary POSE is able to derive upper limits for the beneit of energy-aware
software optimisation on a given system independent of any speciic software application.
The remainder of this paper is structured as follows: Section 2 summarises background work in
energy measurement and optimisation; Section 3 presents a survey of related work; Section 4
outlines the construction of POSE models; Section 5 demonstrates the use of POSE on applications
from the Mantevo benchmark suite; Section 6 introduces the System Summary POSE model and
demonstrates its application; inally, Section 7 concludes the paper.
2 BACKGROUND
2.1 Energy Measurement
Accurate measurement is fundamental to performance engineering. Processors incorporate built-in
clocks to maintain synchronisation and schedule interrupts. Engineers can use these clocks to
measure the runtime performance of their code. Energy monitoring capabilities are also appearing
in new processor designs.
Energy is the integral of power over time, or E = P¯t . Energy consumption can therefore be
calculated based on measurements of power draw and time. Various methods have been used to
measure power draw in HPC systems, both at system and component levels.
Energy used by computers is converted to waste heat, as per the irst law of thermodynamics.
Thermal cameras can be used to measure the temperature of diferent components, and hence
estimate their power draw. Mesa-Martinez et al. use thermal cameras and custom heat sinks to
ACM Trans. Arch. Code Optim., Vol. X, No. Y, Article . Publication date: April 2019.
The Power-Optimised Sotware Envelope :3
measure CPU power consumption [32], while Hackenberg et al. follow a similar approach to
measure system-wide power consumption [20].
Computing platforms can also be instrumented with dedicated power sensors. Bedard et al.
develop PowerMon, a scheme formeasuring component-level power draw in commodity systems [4].
Using sense resistors with a known resistance, and a voltmeter, they measure the voltage drop
across the resistors to calculate the current low using Ohm's law.
An alternative approach to power measurement relies on the magnetic ields induced when
current lows through a wire. Laros et al. develop PowerInsight, a production quality power
monitoring platform which uses Hall efect sensors and Ampere's law, rather than sense resistors,
to improve accuracy and reliability [27].
Hackenberg et al. instrument a large HPC cluster called Taurus with commercial power sen-
sors [18]. The resulting High Deinition Energy Eiciency Monitoring (HDEEM) infrastructure
can be used to measure component-level power and energy consumption across large numbers of
nodes at high sample rates.
Intel introduced Running Average Power Limit (RAPL) to support power-aware frequency scaling
in the Sandy Bridge Processor [12]. As a side efect, performance engineers gained access to an
interface capable of reporting CPU energy consumption. Early versions of RAPL were model
based, but more recent processors incorporate dedicated power sensors. AMD included equivalent
functionality starting with their Bulldozer CPU [1], while similar schemes exist for GPU [8] and
Xeon Phi [28] platforms.
2.2 Energy-Aware Metrics
Metrics allow performance engineers to assess HPC systems and software based on properties
of interest. They enable meaningful comparison between diferent platforms and can be used to
quantify the efects of code changes.
Some metrics act as utility functions which measure the cost of running diferent programs.
These Figure-of-Merit (FoM) metrics can be used to rank diferent implementations of the same
algorithm in order to identify valid optimisations [22]. Runtime and energy consumption are both
examples of FoM metrics.
Until recently, runtime optimisation was ubiquitous in HPC while energy optimisation has been
conined to domains such as embedded systems and mobile robotics. Although energy consump-
tion is becoming a constraint for scientiic computing, minimising runtime is still an important
optimisation objective.
Optimising software according to multiple properties simultaneously is known asMulti-Objective
Optimisation (MOO). MOO requires FoM metrics that strike the right balance between the poten-
tially conlicting requirements imposed by diferent optimisation objectives.
Gonzalez et al. propose Energy-Delay Product, a dimensionless FoM metric which combines
the energy and runtime costs incurred by processors [17]. Martin et al. generalise this into the
Etn family of FoM metrics, with parameters E and t corresponding to energy and time [31]. They
argue that Et2 provides the best balance for microprocessor design. Srinivasan et al. reach the same
conclusion, although for slightly diferent reasons [39].
Many authors have adopted these metrics from the hardware community and applied them to
software optimisation problems. Vincent et al. describe a technique which minimises Et1 using CPU
throttling [15]. Bingham and Greenstreet use Etn metrics to analyse runtime constraints imposed
by a ixed energy budget for various algorithms [6]. Laros et al. use Etn metrics to assess a number
of production applications and state that Et3 strikes the right balance between runtime and energy
for HPC [26]. Et1 has also been used extensively to quantify the eiciency of resource provisioning
and scheduling in cloud computing environments [36, 42].
ACM Trans. Arch. Code Optim., Vol. X, No. Y, Article . Publication date: April 2019.
:4 Stephen I. Roberts, Steven A. Ωright, Suhaib A. Fahmy, and Stephen A. Jarvis
Bekas and Curioni further generalise Etn metrics to the form E · f (t), a product between energy
and an application dependent function of time [5]. They argue that this formalisation is able to
drive software optimisation, assuming an appropriate function f (t) can be identiied.
In our previous work we have shown that metrics originating from the hardware community are
not suitable formeasuring software performance.We subsequently proposed two new dimensionless
metrics which are designed to support energy-aware performance optimisation [34]. In this paper
we provide derivations for POSE using both of these metrics, namely Energy-Delay Sum (EDS,
Equation 1) and Energy-Delay Distance (EDD, Equation 2).
M (θ ) = αEθ + βtθ (1)
M (θ ) =
√
(αEθ )2 + (βtθ )2 (2)
2.3 Energy-Aware Optimisation
Energy use can be reduced either by shortening runtime or decreasing power consumption. While
runtime optimisation has been well studied, power optimisation is less developed; however some
progress in this area has been made.
Dynamic Voltage and Frequency Scaling (DVFS) and sleep states are two hardware features often
exploited by power optimisations; DVFS allows processors to run at diferent clock speeds and
supply voltages, while sleep states allow processors to power down during periods of inactivity.
In multi-node systems, nodes of the critical path can use DVFS to lower their clock speeds and
reduce power draw [13]. Alternatively, they can temporarily increase their clock speeds to inish
their work quickly before entering into sleep states [38].
The Intel Intelligent Power Node Manager uses a combination of features like DVFS, sleep states
and memory throttling to maintain a system power cap. Pedretti et al. demonstrate its application
for node-level power capping on a Cray XC40 system [33]. Their indings indicate that, while power
capping can be used to maintain power limits, it can also introduce signiicant and unpredictable
runtime overheads. Nodes which approach the power limit are aggressively throttled, leading to
slow-downs which can cause cascading delays and performance variability.
3 RELATEDWORK
Performance modelling techniques enable the rapid exploration of large hardware and software
design spaces. This paper presents a performance modelling technique which enables engineers to
make decisions that may inluence the energy consumption of their codes.
3.1 Simulators
Performance simulators such as SST [37], WARRP [21] and PACE [9] gather performance data by
executing simpliied representations of target applications. Using code as a modelling input shifts
the burden of model construction away from the user. Consequently, model accuracy depends
primarily on how faithfully the simulator is able to represent a target system.
Tools such as Wattch [7] and McPAT [29] extend performance simulators with models of power
draw. These models use the energy costs associated with particular hardware events to estimate
the power consumption of a simulated code.
3.2 Analytical Models
Analytical models distil the structure and behaviour of a program into a set of parameterised
mathematical expressions. Performance predictions are then obtained by solving these expressions
for the required input parameters. As a result of their mathematical nature, analytical models
ACM Trans. Arch. Code Optim., Vol. X, No. Y, Article . Publication date: April 2019.
The Power-Optimised Sotware Envelope :5
produce results more quickly than simulations, making them particularly suitable for parameter
studies. Ensuring the model is expressive enough to capture all possible program behaviours is
often challenging and requires a deep understanding of the target application and platform.
Examples of this approach include LogP [11], LogGP [2] and PRAM [25], which provide model
skeletons which must then be tailored to individual codes. This approach has also been applied to
modelling energy consumption, with examples including BTL [30] and CAPE [24].
Wu et al. construct an analytical performance model of runtime and energy consumption for two
HPC applications using performance counter data [41]. They use Spearman correlation and principle
component analysis to identify the counters most signiicantly correlated with the applications
performance and then use multivariant regression analysis to build a model capable of predicting
runtime and energy consumption.
3.3 Heuristic Models
Heuristic models represent the most abstract category of performance models and the one to which
our work belongs. Rather than attempting to faithfully represent an entire system, heuristic models
provide a simpliied analogy which helps developers reason about particular properties of a code.
Ease of construction and the clarity of their insights mean heuristic models are well suited to the
early stages of optimisation.
Arguably the best known heuristic model is Amdahl's Law [3], which states that the performance
gains from parallelisation are limited by the serial portion of a parallel program. A second prominent
example is the Roolinemodel [40], which frames application performance in terms of its operational
intensity and two system bottlenecks: of-chip memory bandwidth and loating point performance.
This simpliication limits Rooline's use as a predictive model but does mean a developer can easily
isolate the limiting factor of code performance and target their optimisation eforts accordingly.
Choi et al. extended the Rooline model to identify the algorithmic conditions necessary for trade-
ofs between runtime and energy [10]. In particular their łRooline model of energyž highlights
how power consumption peaks when operational intensity places equal demands on memory and
loating point performance. With both subsystems under equal load, neither one can become a
bottleneck and force the other to enter an idle state waiting for more work. Idle subsystems draw
less power, so overall power consumption drops when either subsystem is left idling.
4 THE POWER OPTIMISED SOFTWARE ENVELOPE MODEL
In this paper, we outline the POSE model, a heuristic model that serves as a preliminary ‘irst
cut' modelling technique intended to guide energy-aware optimisation eforts. Our model draws
inspiration from the Rooline model in that its insights are presented in an intuitive graphical
format. Also like Rooline, our model does not directly identify optimisation opportunities but
rather identiies where optimisation eforts should be focussed.
The energy eiciency of a code can be improved either by shortening its runtime or by decreasing
its power consumption. The POSE model can quantify the potential beneits of each approach,
allowing developers to focus their eforts on whichever ofers the greatest rewards.
POSE is metric agnostic and is compatible with all members of the Etn family, the EDS and EDD
metrics [34], and indeed any metric which is an element-wise monotonic function of runtime and
energy consumption. The only prerequisites are that runtime and energy consumption can be
accurately measured or calculated for the target platform.
4.1 Model Construction
POSE models partition the energy/runtime plane into areas with diferent performance characteris-
tics relative to some initial, unoptimised code.
ACM Trans. Arch. Code Optim., Vol. X, No. Y, Article . Publication date: April 2019.
:6 Stephen I. Roberts, Steven A. Ωright, Suhaib A. Fahmy, and Stephen A. Jarvis
θ
A
B
C
D
E
1.
2.
3.
4.
Runtime (s)
E
n
er
g
y
(J
)
Pmax Energy Bound
Pmin Energy Bound
B E Optimisation Bound
C θ Contribution Bound
A C Optimisation Limit
1. Strong Runtime Optimisation
2. Weak Runtime Optimisation
3. Power Optimisation
4. Performance Degradation
(a) Etn
A
B
C
D
E
1.
2. 3.
4.
θ
Runtime (s)
E
n
er
g
y
(J
)
(b) EDS
θ
A
B
C
D
E
1.
2. 3.
4.
Runtime (s)
E
n
er
g
y
(J
)
(c) EDD
Fig. 1. POSE ℧odel for Etn , EDS and EDD ℧etrics
Feasible Performance Envelope
POSE is built around the concept of a Feasible Performance Envelope (FPE). This is constructed
by plotting lines with gradient Pmin and Pmax as shown in Figure 1. These values represent the
minimum and maximum rates of power draw possible during normal operation of the target
platform. As such, the runtime and energy costs incurred by running any given code θ under
similar conditions are represented by a single point somewhere within this envelope.
The quantitative insights ofered by POSE are calculated from the positions of the ive vertices
labelled A ś E in Figure 1. Four of these vertices lie on an intersection between the FPE and one
of the POSE bounds. The remaining vertex D lies directly below the initial code θ on the Pmin
energy bound at coordinates (tθ , Pmintθ ). This vertex corresponds to the largest possible pure power
optimisation of θ , meaning an optimisation which reduces power consumption without any change
to runtime.
Optimisation Bound
POSE considers the metric used to guide optimisation in order to constrain the search space for
valid optimisations within the FPE.
Definition 1. For logically equivalent codes θ and λ, the transformation θ → λ is a valid optimi-
sation with respect to a cost metricM ifM(λ) dominatesM(θ ).
The optimisation bound passes through θ , linking all points λ with the same metric value as the
original code, such thatM(λ) = M(θ ). This bound is represented by the curve B Ð E in Figure 1.
ACM Trans. Arch. Code Optim., Vol. X, No. Y, Article . Publication date: April 2019.
The Power-Optimised Sotware Envelope :7
Compared to θ , all points below the optimisation bound will have strictly better performance in
terms of metricM , and all points above it will have strictly worse performance in terms ofM .
The equation for the optimisation bound depends on the optimisation metric used. Deriving an
equation for the optimisation bound involves inding an expression for the curve which links all
points λ with the same metric value as θ . Equations 3, 4 and 5 give derivations for the optimisation
bound using the Etn , EDS and EDD metrics, respectively.
M (λ) = M (θ )
Eλ tλ
n
= Eθ tθ
n
Eλ = Eθ
(
tθ
tλ
)n
(3)
M (λ) = M (θ )
αEλ + βtλ = αEθ + βtθ
αEλ = αEθ + βtθ − βtλ
Eλ = Eθ +
β
α
(tθ − tλ ) (4)
M (λ) = M (θ )√
(αEλ )2 + (βtλ )2 =
√
(αEθ )2 + (βtθ )2
(αEλ )2 = (αEθ )2 + (βtθ )2 − (βtλ )2
Eλ =
√
Eθ
2
+
(
β
α
)2 (
tθ
2 − tλ 2
)
(5)
The intersections between the optimisation bound and the FPE determine the position of vertices B
and E in Figure 1. Vertex B represents the fastest possible code within the FPE which shares the
same metric value as θ . Any optimised version of θ with a runtime faster than B is guaranteed to
outperform the original unoptimised code in terms ofM . Similarly, vertex E represents the slowest
possible code with the same metric value as θ . By deinition, any optimised version of θ must run
faster than E.
Contribution Bound
All optimised versions of the initial, unoptimised code θ must appear inside the FPE in the region
below the optimisation bound. The contribution bound further subdivides this region into runtime
and power optimisations.
Performance engineers seek to use the most appropriate tools while searching for optimisations.
Conventional time-based performance engineering techniques aremore appropriate when searching
for optimisations which result in large reductions in runtime, whereas energy-aware techniques
are better suited to inding optimisations which primarily reduce power consumption. POSE uses
the contribution bound to make this distinction.
Definition 2. An optimisation θ → λ with respect to metric M is considered to be a power
optimisation if the improvement in terms ofM stems primarily from a reduction in power draw, such
thatM(tθ , Pλtθ ) dominatesM(tλ , Pθ tλ).
Most optimisations will impact both runtime and power consumption to some degree. Deinition 2
determines which of these impacts causes most improvement in terms of metricM . It does this by
treating them as if they were two separate optimisations; a pure power optimisation (tθ , Pθ tθ ) →
(tθ , Pλtθ ), and a pure runtime optimisation (tθ , Pθ tθ ) → (tλ , Pθ tλ), and then comparing them to see
ACM Trans. Arch. Code Optim., Vol. X, No. Y, Article . Publication date: April 2019.
:8 Stephen I. Roberts, Steven A. Ωright, Suhaib A. Fahmy, and Stephen A. Jarvis
which is most beneicial. Power optimisations are those which derive most of their beneits from
reduced power consumption rather than shorter runtimes, meaning that M(tθ , Pλtθ ) dominates
M(tλ , Pθ tλ).
Curve C Ð θ in Figure 1 links all points for which power and runtime factors contribute toM in
the same ratio as the original code. By Deinition 2, any power-optimised versions of θ must lie
below this contribution bound.
The equation for the contribution bound also depends on the metric chosen. It is obtained by
letting M(tθ , Pλtθ ) = M(tλ , Pθ tλ), expanding the deinition of M , re-arranging to make Pλ the
subject, then inally multiplying by tλ to provide a result in terms of energy. The general form for
Etn metrics, and EDS and EDD metrics is derived as follows:
M (tθ , Pλ tθ ) = M (tλ, Pθ tλ )
Pλ tθ · tθ n = Pθ tλ · tλn
Pλ = Pθ
(
tλ
tθ
)n+1
Eλ = Pθ tλ
(
tλ
tθ
)n+1
(6)
M (tθ , Pλ tθ ) = M (tλ, Pθ tλ )
αPλtθ + βtθ = αPθ tλ + βtλ
αPλ + β =
tλ
tθ
(αPθ + β )
αPλ =
tλ
tθ
(αPθ + β ) − β
Pλ =
tλ
tθ
(
Pθ +
β
α
)
− β
α
Eλ =
tλ
2
tθ
(
Pθ +
β
α
)
− tλ
β
α
(7)
M (tθ , Pλ tθ ) = M (tλ, Pθ tλ )√
(αPλ tθ )2 + (βtθ )2 =
√
(αPθ tλ )2 + (βtλ )2
(αPλ tθ )2 = (αPθ tλ )2 + (βtλ )2 − (βtθ )2
Pλ
2
=
(
Pθ
tλ
tθ
)2
+
(
β tλ
α tθ
)2
−
(
β
α
)2
Pλ =
√(
Pθ
tλ
tθ
)2
+
(
β tλ
α tθ
)2
−
(
β
α
)2
Eλ = tλ ·
√(
Pθ
tλ
tθ
)2
+
(
β tλ
α tθ
)2
−
(
β
α
)2
(8)
The intersection between the contribution and Pmin energy bound determines the position of vertex
C in Figure 1. This vertex represents the fastest possible code which still meets the criteria to count
as a power-optimised version of θ . Any optimisation which reduces runtime below that of C must
have a larger impact on runtime than on power consumption, and as such would be considered a
runtime optimisation.
Vertex C can also be interpreted as the best possible outcome for power optimisation. This is
because, in addition to having the smallest runtime of any power optimisation, it also has the lowest
ACM Trans. Arch. Code Optim., Vol. X, No. Y, Article . Publication date: April 2019.
The Power-Optimised Sotware Envelope :9
possible power draw as it lies on the Pmin energy bound. As such, it will have the best possible
metric value of any point within the power optimised region.
Optimisation Limit
The bounds described so far delineate those regions of the energy/runtime plane in which runtime
and power optimised versions of a given code can be found. The optimisation limit further partitions
runtime optimisations into those which could potentially be outperformed by some hypothetical
power optimisation and those which strictly dominate all possible power optimisations.
As its name suggests, the optimisation limit is closely related to the optimisation bound. The
optimisation limit links all points with the same metric value as a reference code, and as such is
similarly deined by Equations 3-5 for each of the metrics considered. The only diference is that
the optimisation limit connects all points with the same metric value as vertex C rather than θ .
Vertex C represents the best possible outcome from power optimisation; all optimisations which
lie below the optimisation limit must strictly dominate any possible power optimisation. Vertex
A lies on the intersection between the optimisation limit and the Pmax energy bound in Figure 1.
This vertex represents the fastest possible code with the same metric value as C , which in turn
corresponds to the best possible outcome from power optimisation. As such, any optimisation
which results in a faster code than A will outperform all possible power optimisations.
Because the optimisation bound and the optimisation limit are both based on Equations 3-5, the
expression for their coordinates are also similar. The only diference is that C replaces θ as the
reference point used, yielding Equations 9-11, for Etn , EDS and EDD, respectively.
M (λ) = M (C)
Eλ = PmintC
(
tC
tλ
)n
(9)
M (λ) = M (C)
Eλ = PmintC +
β
α
(tC − tλ ) (10)
M (λ) = M (C)
Eλ =
√
(PmintC )2 +
(
β
α
)2 (
tC 2 − tλ 2
)
(11)
4.2 POSE Insights
Figure 1 shows how POSE models partition the FPE into four distinct regions, each with diferent
performance characteristics.
Region 1 contains runtime optimisations which dominate the best case power optimisation in
terms of a given metricM (Strong Runtime Optimisation). Region 2 contains runtime optimisations
which dominate θ in terms ofM , yet may be outperformed by some power optimised version of θ
(Weak Runtime Optimisation). Region 3 contains optimisations for which improvements toM are
primarily due to reduced power consumption (Power Optimisation). Finally, Region 4 corresponds
to codes with performance strictly worse than that of θ (Performance Degradation).
The ive vertices labelled A to E correspond to the extreme outcomes of energy-aware optimi-
sation. Comparing these outcomes to the initial performance of θ provides quantitative insights
about the optimisation potential for this code. These insights fall into two broad categories which
together help performance engineers decide if power optimisation is likely to prove worthwhile.
ACM Trans. Arch. Code Optim., Vol. X, No. Y, Article . Publication date: April 2019.
:10 Stephen I. Roberts, Steven A. Ωright, Suhaib A. Fahmy, and Stephen A. Jarvis
The irst category relates to the potential beneits from power optimisation. The diference in
energy between points θ and D places an upper bound on the amount of energy which can be saved
by reducing power consumption. Similarly, the diference in value betweenM(θ ) andM(C) gives
an upper bound for the improvement in a metric which can be delivered by power optimisation.
The second category relates to the scope a code has for power optimisation. The ratio tθ /tB
represents the smallest speed-up which guarantees a code that outperforms θ with respect toM .
The diference in runtime between points E and θ represents the maximum increase in runtime
which could be traded of to achieve a slower yet more energy eicient code. Finally, tθ /tA is the
smallest speed-up guaranteed to outperform any power optimised version of θ .
POSE results can be given in either relative or absolute forms by taking the ratio or the diference
between values. For example, an optimisation guaranteed to outperform θ in terms of M must
reduce runtime by at least tθ − tB seconds, or equivalently yield a relative speed-up of tθ /tB times.
Expressions for POSE coordinates are all linear functions in terms of tθ , meaning the ratios between
them remain constant regardless of changes to runtime. This property means relative results can
be used to predict large-scale optimisation characteristics from tests with shorter runtimes.
The results given by POSE are all bounds, and the true beneits of power optimisation will be more
modest in practice. Additionally, how potential beneits are realised will be application-speciic;
two applications that exhibit the same runtime and power draw will have the same opportunities
for energy-optimisation, but may require diferent solutions. Even so, these values are useful as
they allow performance engineers to make informed decisions about where best to focus their
optimisation eforts.
4.3 POSE Metric Tuning
One thing to note is how metric tuning parameters afect POSE models. Figure 2 shows how POSE
varies in response to diferent Etn exponents; in this case, Energy (Et0) and Energy-Delay Cubed
Product (Et3). Higher values of n place more emphasis on runtime, resulting in less scope for
energy-aware optimisation. POSE is able to relect this change through its various insights and
identify exactly how much the opportunity for energy-aware optimisation has been reduced by.
Figure 3 shows how POSE models for the EDS and EDD metrics compare to the same model
built with the Et3 metric. The parameterisations of the α and β coeicients used for the EDS and
EDD POSE models in this igure were chosen to mirror the relative energy/time costs of Et3. As a
result, the gradients of their optimisation bounds at point θ are the same as for Et3. Even so, the
optimisation bound for Et3 diverges from the other metrics, moving further away from the origin
and suggesting a larger scope for energy-aware optimisation.
This divergence happens because Etn metrics produce perverse optimisation incentives; Etn
places more emphasis on energy optimisations for eicient codes and on runtime optimisations for
fast codes [34]. Any small optimisation which improves energy eiciency will increase the apparent
beneits of further energy optimisations, leading to the concave curvature of the optimisation
bounds for Etn metrics.
Avoiding perverse optimisation incentives was a key design principle for both EDS and EDD.
They do not over-emphasise energy optimisation for eicient codes or runtime optimisations for fast
ones. As a result, POSE models built for these metrics will show less opportunity for energy-aware
optimisation than equivalent models built for Etn metrics if equivalent parameterisations are used.
5 POSE INVESTIGATION
This section uses POSE to investigate the energy-aware optimisation characteristics of codes from
the Mantevo [23] mini-application benchmark suite. Experiments were carried out on the Taurus
system operated by TU Dresden and an Intel KNL Developer Access Program (DAP) platform
ACM Trans. Arch. Code Optim., Vol. X, No. Y, Article . Publication date: April 2019.
The Power-Optimised Sotware Envelope :11
θ
Runtime (s)
E
n
er
g
y
(J
)
Et 3 POSE Model
Et 0 (Energy) POSE Model
Fig. 2. Etn POSE ℧odel Tunability
θ
Runtime (s)
E
n
er
g
y
(J
)
Et 3 POSE Model
EDS POSE Model
(a) EDS
θ
Runtime (s)
E
n
er
g
y
(J
)
Et 3 POSE Model
EDD POSE Model
(b) EDD
Fig. 3. Comparison of POSE ℧odels for Diferent ℧etrics
installed at the University of Warwick. Results were gathered using the HDEEM instrumentation
infrastructure present on Taurus [18], and using a power meter on the Intel DAP platform.
Taurus is a heterogeneous cluster with several diferent classes of node. Work was carried out
on two of these partitions: one featuring dual twelve core Intel Xeon E5-2680 v3 CPUs and 64GB
of memory, and one featuring dual fourteen core Intel Xeon E5-2680 v4 CPU and 64GB of memory.
This represents one CPU from the Haswell microarchitecture family and one from the Broadwell
microarchitecture family (a die shrink of Haswell). The KNL platform contains an Intel Xeon Phi
7210 CPU with 16 GB MCDRAM and 96 GB of DDR4 memory.
5.1 Feasible Performance Envelope
The irst step when applying POSE is to construct an FPE. Many manufacturers publish power
dissipation igures for their hardware, however for safety reasons these are usually conservative
estimates. POSE works best when the power bounds are as tight as possible; it is therefore advisable
to determine Pmin and Pmax empirically.
The value of Pmin is dependant on the programming model used and the nature of the application.
For this reason, four custom micro-benchmarks have been developed. Each micro-benchmark
executes a single jmp instruction each clock cycle, but does so in difering circumstances. omp_serial
is representative of an OpenMP application that contains a substantial portion of serial work, as
such it executes some instructions in a parallel block, before looping on a single jmp instruction in
a critical section; omp_parallel executes the same jmp instruction, but does so in a parallel block
on all available threads; mpi_parallel similarly performs the same jmp loop, but does so on each
ACM Trans. Arch. Code Optim., Vol. X, No. Y, Article . Publication date: April 2019.
:12 Stephen I. Roberts, Steven A. Ωright, Suhaib A. Fahmy, and Stephen A. Jarvis
Table 1. Values that can be used for Pmin and Pmax for the three platforms
Benchmark Haswell Broadwell KNL
omp_serial 111.90 125.78 122.20
omp_parallel 181.14 180.90 166.00
mpi_parallel 167.76 172.20 166.10
mpi_serial 219.79 203.10 194.90
FIRESTARTER 345.57 329.69 311.80
Table 2. Description and run parameters for applications
Application Runtime Parameters / Input File Description
PathFinder -x medium_test.adj_list A graph search application
TeaLeaf tea_bm16_short.in A linear heat conduction equation solver
CloverLeaf clover_bm4.in A Lagrangian-Eulerian Hydrodynamics benchmark
CloverLeaf3D clover_bm.in A 3D implementation of CloverLeaf
miniMD -t 1 -n 30000 --half_neigh 0 (OpenMP) A molecular dynamics proxy using neighbour lists
-t num_ranks -n 30000 --half_neigh 0 (MPI)
CoMD -e -x 90 -y 90 -z 90 A classical molecular dynamics proxy using cell lists
miniFE -nx 512 -ny 512 -nz 256 A mini-app for unstructured implicit inite element codes
HPCCG 128 128 5376 (OpenMP) A simple conjugate gradient benchmark code
128 128 [5376 ÷ num_ranks] (MPI)
available MPI process; inally, mpi_serial is representative of an MPI application that contains
substantial serial work, and as such each rank waits on an MPI_Barrier except for rank 0, which
executes a jmp loop.
Any non-trivial code will perform more work per unit time than these minimal benchmarks.
Additional work means more transistors changing state per cycle, and hence a higher power draw.
FIRESTARTER [19] is used as the benchmark to measure Pmax. This tool is designed to trigger
peak power consumption on x86-64 based servers. It consists of hand optimised assembly routines
which raise the activity factor above the level achievable with high level languages.
In this paper we only consider fully occupied nodes, running at their highest available CPU
frequency. This corresponds to 24 threads/ranks at 2.5GHz for Haswell, 28 threads/ranks at 2.4GHz
for Broadwell and 256 threads/ranks at 1.3GHz for KNL.
Table 1 gives the values that can be used for the FPE for the Haswell, Broadwell and KNL systems
used in this study. Haswell has the largest range of power draw of the three platforms, while the
KNL platform has the smallest.
Of particular note, there is a large diference in the power draw of the omp_serial and mpi_serial
benchmarks, that are both representative of serialised portions of parallel applications. MPI ap-
plications with critical sections typically keep idle threads active using spinlocks. As a result, in
addition to the single active thread performing computation, the other threads also consume energy
checking the state of a barrier waiting for continuation. For applications that must necessarily
contain some serial work, OpenMP will therefore likely produce a more energy-eicient solution.
5.2 POSE Models for Code Optimisation
The next step in this investigation is to capture energy and runtime igures for real applications.
The Mantevo application suite was chosen because it covers a broad range of scientiic computing
workloads.
All codes were compiled with the Intel C Compiler (icc) version 18.0. Each application was run
ifteen times on the same node to reduce the impact of random variations in runtime and energy.
ACM Trans. Arch. Code Optim., Vol. X, No. Y, Article . Publication date: April 2019.
The Power-Optimised Sotware Envelope :13
Table 3. Runtime and Energy for the ℧antevo applications
Application
Haswell Broadwell KNL
Runtime (s) Energy (J) Runtime (s) Energy (J) Runtime (s) Energy (J)
PathFinder OpenMP 212.91 38 952.89 194.72 34 205.62 243.75 40 803.46
TeaLeaf
OpenMP 322.65 98 593.56 293.42 79 075.99 126.97 34 610.35
MPI 322.23 100 404.76 294.59 83 223.96 132.66 35 907.50
CloverLeaf
OpenMP 187.89 56 459.45 164.26 41 721.59 82.78 22 213.09
MPI 187.41 57 443.92 162.99 43 601.47 116.78 29 950.38
CloverLeaf3D
OpenMP 130.19 34 461.48 105.34 26 652.87 87.69 20 485.27
MPI 111.71 33 056.51 101.14 26 912.57 158.55 40 805.38
miniMD
OpenMP 140.83 34 939.13 104.68 24 986.60 89.28 20 037.16
MPI 132.06 34 493.59 99.08 25 366.42 85.93 20 114.99
CoMD
OpenMP 109.49 23 206.60 90.59 19 350.64 97.29 19 242.44
MPI 87.73 19 239.83 70.65 15 251.38 57.54 12 975.61
miniFE
OpenMP 113.66 26 754.53 103.15 23 764.70 203.89 33 898.46
MPI 86.35 23 953.16 74.27 18 845.43 87.25 19 706.01
HPCCG
OpenMP 199.90 39 226.18 136.01 28 021.77 167.80 28 249.75
MPI 57.15 17 738.34 52.26 14 472.43 78.86 17 764.55
Application parameters were tuned where necessary to ensure reasonable run times on single
nodes.
Table 2 details the applications used in this study and the runtime parameters or input iles
used. Each application except PathFinder (for which only an OpenMP implementation exists) was
executed using pure OpenMP and pure MPI; where parameters difered between these executions,
they have been listed separately.
In these experiments, we use Et3 because Laros et al. found that this strikes the right balance
between energy and runtime for HPC [26]. This implies that a 1% reduction in runtime is approx-
imately three times more valuable than the same reduction in energy consumption. In order to
facilitate fair comparison between metrics, EDS and EDD parameterisations are based on the same
1:3 ratio. Whereas the Etn parameter operates in a relative fashion, EDS and EDD parameters are
based on absolute costs of consumption. In our study, the energy costs are around 300 times greater
than that of the runtime; for simplicity, we therefore scale runtime costs by a factor of 300 before
applying the same 3:1 ratio in order to compensate for this efect.
The parameterisation used for EDS is obtained by multiplying the 300 scaling factor and the
3:1 ratio together, resulting in the parameters α = 1 and β = 3 × 300 = 900. The parameterisation
of EDD is very similar, except that it uses a multiplier of
√
3 rather than 3 to account for the
square root present in the deinition of EDD. This results in a parameterisation of α = 1 and
β =
√
3 × 300 ≈ 519.615.
We note that the metric parameterisation used in this paper is based on the previous work of
Laros et al. [26] and is used only to illustrate our model and analysis techniques. We believe that
the EDS and EDD metrics can be parameterised such that they produce a dollar-cost value based
on purchasing, energy and maintenance costs ś these values will therefore be machine and site
speciic, and so we leave this to future work.
Table 3 lists the mean energy and runtime costs incurred by running codes from the Mantevo
suite. PathFinder, miniMD and TeaLeaf broadly cover the full range of mini-application power
consumption, with PathFinder having the lowest average power consumption on each platform and
TeaLeaf having the highest average power consumption. POSE models are reproduced graphically
in Figures 4, 5 and 6 respectively, and model summaries are presented in Table 4.
As a result of having the lowest average power consumption, our results show that PathFinder
is the code least amenable to power optimisation. PathFinder's average power usage is near (or
ACM Trans. Arch. Code Optim., Vol. X, No. Y, Article . Publication date: April 2019.
:14 Stephen I. Roberts, Steven A. Ωright, Suhaib A. Fahmy, and Stephen A. Jarvis
Pmax Energy Bound Pmin_parallel Energy Bound Pmin_serial Energy Bound
B E Optimisation Bound C θ Contribution Bound A C Optimisation Limit
θ
140 160 180 200 220
2.0
4.0
6.0
·104
Runtime (s)
E
n
er
g
y
(J
)
(a) Haswell
θ
140 160 180 200
2.0
4.0
6.0
·104
Runtime (s)
E
n
er
g
y
(J
)
(b) Broadwell
θ
180 200 220 240 260
2.0
4.0
6.0
·104
Runtime (s)
E
n
er
g
y
(J
)
(c) KNL
Fig. 4. Et3 POSE ℧odels for PathFinder
θ
100 120 140
2.0
3.0
4.0
·104
Runtime (s)
E
n
er
g
y
(J
)
(a) Haswell
θ
80 90 100 110
2.0
3.0
·104
Runtime (s)
E
n
er
g
y
(J
)
(b) Broadwell
θ
70 80 90
1.5
2.0
2.5
3.0
·104
Runtime (s)
E
n
er
g
y
(J
)
(c) KNL
Fig. 5. EDS POSE ℧odels for mini℧D using ℧PI
θ
250 300 350
0.4
0.6
0.8
1.0
1.2
·105
Runtime (s)
E
n
er
g
y
(J
)
(a) Haswell
θ
240 260 280 300 320
0.4
0.6
0.8
1.0
·105
Runtime (s)
E
n
er
g
y
(J
)
(b) Broadwell
θ
100 120 140
2.0
3.0
4.0
·104
Runtime (s)
E
n
er
g
y
(J
)
(c) KNL
Fig. 6. EDD POSE ℧odels for TeaLeaf using Open℧P
even below) Pomp_parallel, which is indicative that it is not using all available threads for all of its
execution. Using Pomp_serial as a baseline power opens up a larger power optimisation envelope, as
seen in Figure 4, but reducing parallelisation to achieve this power usage will almost certainly lead
to a larger runtime. Instead, it seems more likely that increasing parallelisation (and the average
power draw), would more efectively reduce energy consumption by lowering the runtime. Without
reducing parallelisation further, only 1% of the energy can be saved through power optimisation
on Haswell and KNL in the very best case ś equating to a saving of between 341 J and 386 J.
For Broadwell, the power draw is already below Pomp_parallel; for this reason, some of the insights
ACM Trans. Arch. Code Optim., Vol. X, No. Y, Article . Publication date: April 2019.
The Power-Optimised Sotware Envelope :15
Table 4. Platform specific POSE model summaries
Haswell Broadwell KNL
PathFinder (Runtime; Energy) 212.91 s; 38 952.89 J 194.72 s; 34 205.62 J 243.75 s; 40 803.46 J
Maximum energy saved by reduced power consumption 386.33 J; 1.01× −1019.23 J; 0.97× 341.51 J; 1.01×
Maximum improvement in Et 3 from power optimisation 1.02× 0.94× 1.02×
Minimum speed-up guaranteed to outperform θ 31.30 s; 1.17× 28.36 s; 1.17× 35.10 s; 1.17×
Worst case slowdown as a result of power optimisation 0.53 s; 1.00× −1.42 s; 0.99× 0.51 s; 1.00×
Speed-up required to dominate power optimisation 32.20 s; 1.18× 25.90 s; 1.15× 35.98 s; 1.17×
miniMD [MPI] (Runtime; Energy) 132.06 s; 34 493.59 J 99.08 s; 25 366.42 J 85.93 s; 20 114.99 J
Maximum energy saved by reduced power consumption 12 339.58 J; 1.56× 3986.12 J; 1.19× 5842.41 J; 1.41×
Maximum improvement in EDS from power optimisation 1.18× 1.07× 1.13×
Minimum speed-up guaranteed to outperform θ 8.94 s; 1.07× 5.94 s; 1.06× 5.51 s; 1.07×
Worst case slowdown as a result of power optimisation 11.56 s; 1.09× 3.57 s; 1.04× 5.48 s; 1.06×
Speed-up required to dominate power optimisation 27.96 s; 1.27× 12.31 s; 1.14× 14.86 s; 1.21×
TeaLeaf [OpenMP] (Runtime; Energy) 322.65 s; 98 593.56 J 293.42 s; 79 075.99 J 126.97 s; 34 610.35 J
Maximum energy saved by reduced power consumption 40 148.49 J; 1.69× 25 996.31 J; 1.49× 13 532.77 J; 1.64×
Maximum improvement in EDD from power optimisation 1.20× 1.13× 1.16×
Minimum speed-up guaranteed to outperform θ 10.98 s; 1.04× 14.32 s; 1.05× 4.03 s; 1.03×
Worst case slowdown as a result of power optimisation 30.80 s; 1.10× 18.74 s; 1.06× 9.61 s; 1.08×
Speed-up required to dominate power optimisation 62.92 s; 1.24× 46.83 s; 1.19× 20.72 s; 1.19×
presented in Table 4 show negative improvements, since this would require increasing the power
draw.
Conversely, TeaLeaf is the application most amenable to power optimisation, closely followed by
CloverLeaf. This is illustrated by the diference in scale between Figures 4 and 6, with POSE models
for TeaLeaf showing much greater scope for energy-aware optimisation. On all three platforms,
TeaLeaf has the highest average power usage (311.6W on Haswell, 282.5W on Broadwell, 272.6W
on KNL), and therefore logically has the most scope for power optimisation. As shown in Table 4,
energy consumption can potentially be improved by between 1.49× and 1.69× on the three platforms.
Through power optimisation the EDD FoM value can also be improved in the best case by between
1.13× and 1.20×.
For both TeaLeaf and CloverLeaf, there is only a small variation in performance between the
OpenMP and MPI variants. On the two Xeon platforms, the runtimes are within a few seconds in the
worst case, with MPI marginally faster, but using more energy as a result of a higher average power
draw. For the KNL platform, OpenMP leads to a higher average power draw but a considerably
lower runtime and therefore a more energy-eicient execution.
miniMD falls somewhere between the extremes of PathFinder and TeaLeaf with an average
power lying somewhere between 225W and 260W. On all three platforms, the MPI variant has a
lower runtime than the OpenMP implementation, but on both Broadwell and KNL, it runs 18W and
10W higher, respectively, leading to higher overall energy usage. Table 4 shows that the EDS FoM
can be improved by 1.18× for Haswell, 1.07× for Broadwell, and 1.13× for the KNL platform in the
best case focussing only on power optimisations. Compared to TeaLeaf, the possible improvements
from power optimisation are more modest, which logically follows from miniMD having a lower
average power usage.
Between the three platforms used in this paper, the maximum energy that can be saved is usually
lower on KNL, as a result of it exhibiting lower runtimes. The only exception to this is PathFinder,
where the application does not use all available threads throughout the execution. Across all
applications, the Broadwell platform consistently outperforms the Haswell platform in terms of
both runtime and energy usage ś the Broadwell CPU is a die shrink of the Haswell CPU with two
additional cores per socket and lower average power usage, showing the progress made between
processor generations.
ACM Trans. Arch. Code Optim., Vol. X, No. Y, Article . Publication date: April 2019.
:16 Stephen I. Roberts, Steven A. Ωright, Suhaib A. Fahmy, and Stephen A. Jarvis
Table 5. POSE model summaries for Diferent ℧etrics on Haswell
Et 3 EDS EDD
PathFinder (212.91 s; 38 952.89 J)
Maximum energy saved by reduced power consumption 386.33 J; 1.01× 386.33 J; 1.01× 386.33 J; 1.01×
Maximum improvement in metric from power optimisation 1.02× 1.00× 1.00×
Minimum speed-up guaranteed to outperform θ 31.30 s; 1.17× 27.80 s; 1.15× 24.96 s; 1.13×
Worst case slowdown as a result of power optimisation 0.53 s; 1.00× 0.36 s; 1.00× 0.23 s; 1.00×
Speed-up required to dominate power optimisation 32.20 s; 1.18× 28.42 s; 1.15× 25.37 s; 1.14×
TeaLeaf [OpenMP] (322.65 s; 98 593.56 J)
Maximum energy saved by reduced power consumption 40 148.49 J; 1.69× 40 148.49 J; 1.69× 40 148.49 J; 1.69×
Maximum improvement in metric from power optimisation 2.85× 1.24× 1.20×
Minimum speed-up guaranteed to outperform θ 9.77 s; 1.03× 10.36 s; 1.03× 10.98 s; 1.04×
Worst case slowdown as a result of power optimisation 45.06 s; 1.14× 37.14 s; 1.12× 30.80 s; 1.10×
Speed-up required to dominate power optimisation 81.76 s; 1.34× 71.50 s; 1.28× 62.92 s; 1.24×
Table 5 shows how summary data changes between the three metrics outlined in this paper
for PathFinder and TeaLeaf on the Haswell platform. Between the three metrics used, our POSE
models show relatively minor diferences between the Et3, EDS and EDDmetrics. For PathFinder, in
particular, the choice of metric has very little efect on the insights presented. As previously stated,
PathFinder is the application that has the lowest average power draw throughout its execution
and therefore its POSE model places tight bounds on its scope for energy-aware optimisation,
regardless of the metric used. In particular, only half a second can be theoretically traded for a
more energy-eicient execution of PathFinder.
The diferences between metrics becomes more apparent on an application that is potentially
more amenable to power optimisation like TeaLeaf. The biggest diference between Et3 and our
novel metrics is illustrated in the second insight for TeaLeaf, where the Et3 FoM value can be
improved by up to 2.85× through power-optimisation. For both EDS and EDD, their FoM values
can only be improved by 1.24× and 1.20×, respectively.
The divergence in metric value highlights an important issue with Etn metrics. In this paper, the
EDS and EDD metrics have been parameterised to allow a fair comparison to Et3, however one
particular feature of both the EDS and EDD metrics is that they can be parameterised to represent
the monetary cost [34]; with carefully chosen parameters for EDS and EDD, POSE models can
represent potential savings in dollar cost. The divergence between Et3 and EDS/EDD may lead an
application developer to incorrectly focus on optimising for power when runtime optimisations
would deliver a better EDS or EDD value, and therefore a lower monetary cost.
For many of the applications summarised in Table 3, the choice of programming model results in
minor diferences in runtime and energy performance. However, both miniFE and HPCCG use less
time and energy when parallelised with MPI. Figure 7 shows POSE models and power traces for
the OpenMP and MPI variants of miniFE.
The OpenMP implementations of both miniFE and HPCCG exhibit a lower average power draw
than their MPI equivalents but have a much higher runtime. Like PathFinder previously, this
suggests that they are not fully exploiting the available parallelism. From the POSE models shown
in Figure 7a, it seems there is less scope for power optimisation on the OpenMP variant and so it
would be more appropriate to explore runtime optimisations.
Figure 7b shows a temporal power trace for miniFE using OpenMP and MPI. From this data it is
clear that there is a serialised period present in the OpenMP variant, likely due to the grid creation
and decomposition being serialised. By parallelising this stage (as has been done successfully in
the MPI implementation), the runtime can be greatly improved and a more energy-eicient code
can be achieved.
ACM Trans. Arch. Code Optim., Vol. X, No. Y, Article . Publication date: April 2019.
The Power-Optimised Sotware Envelope :17
Pma
x
Pomp_p
aralle
l
Pma
x
Pmpi_p
aralle
l
θomp
θmpi
60 80 100 120
0.0
1.0
2.0
3.0
4.0
·104
Runtime (s)
E
n
er
g
y
(J
)
(a) EDD POSE ℧odels
Pmax
Pomp_parallel
Pmpi_parallel
0 20 40 60 80 100 120
100.0
200.0
300.0
Runtime (s)
P
o
w
er
(W
)
OpenMP MPI
(b) Power Trace
Fig. 7. POSE models for miniFE using Open℧P and ℧PI on Haswell, and associated temporal power traces.
This leads the the observation that codes with less scope for power optimisation likely have much
greater scope for energy optimisation through increasing parallelisation; therefore signiicantly
reducing runtime at the expense of slightly increasing the average power consumption. Conversely,
codes with more scope for power optimisation can likely have their power draw reduced through
close analysis of the balance between the memory and CPU subsystems, or through reducing the
parallelisation if this is possible without increasing the runtime. Furthermore, the choice of parallel
programming model may be important if there are inherently serial portions in the application.
6 SYSTEM SUMMARY POSE
Ordinary POSE models quantify the scope which exists for the energy-aware optimisation of
a speciic code running on a given system. In this section, we present System Summary POSE,
an extension of POSE that allows developers to reason about system-wide power optimisation
characteristics without reference to any particular code.
Typically, POSE models use system Pmin and Pmax energy bounds together with the energy and
runtime costs incurred when running a code to calculate the scope that code has for power and
runtime optimisation. System Summary POSE is a meta-heuristic which determines the range
of results conventional POSE models could produce for a given system. This łbound-of-boundsž
approach allows developers to understand the scope a system has for energy-aware software
optimisation independent of the code being run.
6.1 System Summary POSE Derivation
System Summary POSE examines how the insights provided by POSE models vary in response
to changes in the initial code θ . Increasing the power consumption of a code while keeping its
metric value ixed leads to a corresponding increase in the scope for power optimisation. Figure 8
illustrates how such a change would be relected in the output of an Et3 POSE model.
System Summary POSE determines which point along the optimisation bound B Ð E maximises
the value of each of the ive key insights provided by POSE models. This maximum value then
serves as an upper limit on the values which the corresponding insight could take for real codes
running on the target system.
ACM Trans. Arch. Code Optim., Vol. X, No. Y, Article . Publication date: April 2019.
:18 Stephen I. Roberts, Steven A. Ωright, Suhaib A. Fahmy, and Stephen A. Jarvis
θ
A
B
C
D
E
Runtime (s)
E
n
er
g
y
(J
)
Pmax Energy Bound
Pmin Energy Bound
B E Optimisation Bound
C θ Contribution Bound
A C Optimisation Limit
Fig. 8. Et3 System Summary POSE Intuition
In practice, all POSE insights assume their maximum values at either vertex B or vertex E because
these points correspond to extremes of power consumption. As such, another interpretation
of System Summary POSE is as a pair of ordinary POSE models for the Pmin and Pmax energy
benchmarks.
Ordinary POSE models require four input parameters; the Pmin and Pmax values which deine an
FPE and the energy and runtime costs for a speciic code. A key feature of the relative forms of
POSE insights is that their runtime terms always cancel. Furthermore, the power draws at vertices
B and E are by deinition Pmax and Pmin respectively. As a result, System Summary POSE is able to
derive system-wide power optimisation limits from just two unknowns, namely the values for Pmin
and Pmax.
The irst relative POSE insight, Eθ /ED , places an upper limit on the amount of energy which can
be saved by reducing power consumption. Figure 8 makes it clear that this value is maximised when
θ = B and therefore Pθ = Pmax. Intuitively, the code with the most to gain from energy optimisation
is the one which exhibits the highest rate of power consumption. Substituting in Pθ = Pmax into
the deinition of the irst insight yields the following metric agnostic expression for system-wide
energy savings:
argmax
θ
Eθ
ED
= B
EB
ED
=
Pmax · tθ
Pmin · tθ
=
Pmax
Pmin
(12)
The second relative POSE insight,M(θ )/M(C), limits the maximum improvement in a metric which
can be attributed to power optimisation. This value depends on the metric used, however for any
valid metric (a monotonically increasing function of time and energy) this value is again maximised
when θ is at point B. Substituting in Pθ = Pmax and PC = Pmin yields the following system-wide
bounds for the Etn , EDS and EDD metrics respectively:
argmax
θ
M (θ )
M (C) =
Eθ tθ
n
EC tCn
=
Pmax tθ
n+1
Pmin tCn+1
=
Pmax
2 tθ
n+1
Pmin
2 tθ
n+1
(tC = tθ
(
Pmin
Pθ
) 1
n+1
)
=
(
Pmax
Pmin
)2
(13)
ACM Trans. Arch. Code Optim., Vol. X, No. Y, Article . Publication date: April 2019.
The Power-Optimised Sotware Envelope :19
argmax
θ
M (θ )
M (C) =
αEθ + βtθ
αEC + βtC
=
tθ
tC
· Pmax +
β/α
Pmin + β/α
(tC = tθ
(
Pmin+
β/α
Pθ +
β/α
)
)
=
(
Pmax + β/α
Pmin + β/α
)2
(14)
argmax
θ
M (θ )
M (C) =
√
(αEθ )2 + (βtθ )2√
(αEC )2 + (βtC )2
=
tθ
tC
·
√
Pmax
2
+ (β/α)2
Pmin
2
+ (β/α)2 (tC = tθ
√
Pmin
2
+(β/α)2
Pθ
2
+(β/α)2 )
=
Pmax
2
+ (β/α)2
Pmin
2
+ (β/α)2 (15)
The third relative POSE insight, tθ /tB , represents the smallest speed-up which guarantees a code
that outperforms θ with respect toM . Uniquely, this value is maximised when θ runs at minimum
power, and is therefore located at point E. This is because any speed-up at all would guarantee an
improvement in terms ofM for codes with maximum power consumption Pmax. The derivations
for this system-wide bound for the Etn , EDS and EDD metrics are as follows:
argmax
θ
tθ
tB
=
tθ
tθ
(
Pθ
Pmax
) 1
n+1
tE
tB
=
(
Pmax
Pmin
) 1
n+1
(16)
argmax
θ
tθ
tB
=
tθ
tθ
Pθ +
β/α
Pmax+β/α
tE
tB
=
Pmax + β/α
Pmin + β/α
(17)
argmax
θ
tθ
tB
=
tθ
tθ ·
√
Pθ
2
+(β/α)2
Pmax2+(β/α)2
tE
tB
=
√
Pmax
2
+ (β/α)2
Pmin
2
+ (β/α)2 (18)
The fourth relative POSE insight, tE/tθ , represents the maximum slowdown which could be traded
of to achieve a slower yet more energy eicient code. This insight is maximised at vertex B because
this point has the most scope for power optimisation. As a result, this system-wide bound takes on
the same values as Equations 16, 17 and 18 for the three metrics considered.
argmax
θ
tE
tθ
= B
=
tE
tB
(19)
The inal relative POSE insight, tθ /tA, represents the smallest speed-up guaranteed to outperform
any power optimised version of θ . This insight is once again maximised at vertex B because this
point has the most scope for power optimisations and as such larger runtime optimisations are
ACM Trans. Arch. Code Optim., Vol. X, No. Y, Article . Publication date: April 2019.
:20 Stephen I. Roberts, Steven A. Ωright, Suhaib A. Fahmy, and Stephen A. Jarvis
required in order to guarantee they outperform all possible power optimisations. Equations 20 ś 22
give the derivations of this system-wide bound for the Etn , EDS and EDD metrics.
argmax
θ
tθ
tA
=
tθ
tθ
(
Pmin
2
Pθ Pmax
) 1
n+1
=
1(
Pmin
2
Pmax2
) 1
n+1
=
(
Pmax
Pmin
) 2
n+1
(20)
argmax
θ
tθ
tA
=
tθ
tθ · (Pmin +
β/α)2
(Pθ +β/α)(Pmax+β/α)
=
1
(Pmin +β/α)2
(Pmax+β/α)2
=
(
Pmax + β/α
Pmin + β/α
)2
(21)
argmax
θ
tθ
tA
=
tθ
tθ · Pmin
2
+(β/α)2√
Pθ
2
+(β/α)2 ·
√
Pmax2+(β/α)2
=
1(
Pmin
2
+(β/α)2
Pmax2+(β/α)2
)
=
Pmax
2
+ (β/α)2
Pmin
2
+ (β/α)2 (22)
Equations 12 ś 22 highlight a number of interesting properties. Equation 12 does not depend on
the metric used, as it deals exclusively with energy savings and does not consider runtime. For Etn
metrics, Equation 13 shows that the runtime exponent n does not inluence the degree to which
power optimisation can improve an Etn metric. Further, for the EDS and EDD metrics, Equations 14
and 21, and Equations 15 and 22 show that the second and ifth relative POSE insights are identical,
i.e., the maximum improvement in the metric that can be attributed to power optimisation is equal
to the minimum speed-up required to dominate power optimisations. Finally, The fact that the third
and fourth relative POSE insights are identical shows that the maximum slowdown from power
optimisation is the same as the smallest speed-up which is guaranteed to improve performance in
terms of M . Most signiicantly, all of these equations only depend on Pmin and Pmax. As a result,
System Summary POSE analysis can be carried out on any system for which these parameters are
known.
6.2 System Summary POSE Investigation
Building a System Summary POSE model requires only that we can measure the Pmin and Pmax
bounds for a given system. Table 1 in Section 5.1 provides these bounds for the three plat-
forms used throughout this paper. Table 6 shows the ive key insights that System Summary
POSE can produce using the Et3, EDS and EDD metrics, where Pmax = PFIRESTARTER and Pmin =
min(Pomp_parallel, Pmpi_parallel).
The irst key insight, the maximum amount of energy that can be saved by reducing power
consumption, does not depend on the metric used and shows that power optimisation could deliver a
2.06× improvement in energy consumption for Haswell nodes, a 2.01× improvement for Broadwell
ACM Trans. Arch. Code Optim., Vol. X, No. Y, Article . Publication date: April 2019.
The Power-Optimised Sotware Envelope :21
Table 6. System Summary POSE Insights for Haswell, Broadwell and KNL
Haswell Broadwell KNL
Maximum energy saved by reduced power consumption 2.06× 2.01× 1.88×
Et 3
Maximum improvement in Et 3 from power optimisation 4.24× 4.03× 3.53×
Minimum speed-up guaranteed to outperform θ 1.20× 1.19× 1.17×
Worst case slowdown as a result of power optimisation 1.20× 1.19× 1.17×
Speed-up required to dominate power optimisation 1.44× 1.42× 1.37×
EDS
Maximum improvement in EDS from power optimisation 1.36× 1.35× 1.29×
Minimum speed-up guaranteed to outperform θ 1.17× 1.16× 1.14×
Worst case slowdown as a result of power optimisation 1.17× 1.16× 1.14×
Speed-up required to dominate power optimisation 1.36× 1.35× 1.29×
EDD
Maximum improvement in EDD from power optimisation 1.31× 1.30× 1.23×
Minimum speed-up guaranteed to outperform θ 1.14× 1.14× 1.11×
Worst case slowdown as a result of power optimisation 1.14× 1.14× 1.11×
Speed-up required to dominate power optimisation 1.31× 1.30× 1.23×
nodes and a 1.88× improvement for KNL, in the best case. However, an improvement of this
magnitude would require reducing power consumption to near Pmin, with no increase in runtime.
Between the three platforms, each has a similar minimum power draw, but both the Broadwell
and the KNL nodes have a lower maximum power draw; leading to less scope for energy-aware
optimisation.
Each of the remaining insights are dependant on the parameterisation of the metric used. For
Et3, an applications runtime could potentially be increased by between 1.17× and 1.20× in the
worst case, in order to produce a slower, but more energy-eicient code. A minimum speed-up of
the same magnitude is required on each respective platform to outperform any power optimisation
in terms of Et3. Furthermore, a speed-up greater than approximately 1.4× is likely to outperform
any power-optimised application.
Across all three platforms, both EDS and EDD show less scope for system-wide energy-aware
optimisation. On Haswell, a 1.17× reduction in runtime will lead to better EDS performance, and a
1.36× reduction in runtime will outperform any power optimisation. For Broadwell and KNL, a
1.16× and 1.14× reduction in runtime will lead to a better EDS FoM value, respectively; a 1.35×
and 1.29× reduction in runtime will outperform any power optimisations. The insights produced
using EDD suggests even smaller improvements are required to outperform power optimisations.
As with ordinary POSE models, the biggest diference between the three metrics lies in the
second insight ś the maximum improvement in the metric FoM value that can be achieved through
power optimisation. In the best case, the Et3 FoM value can be improved by up to 4.24× on Haswell,
whereas the EDS and EDD values can only be improved by 1.36× and 1.31×, respectively. This
pattern is repeated across both Broadwell and KNL architectures, whereby the Et3 value can be
improved by up to 4.03× and 3.53×, whereas the EDS and EDD metrics can only be improved by
between 1.23× and 1.35×.
For applications exhibiting a high power draw (near Pmax), power optimisation can deliver
a signiicant reduction in energy costs; but the results in Table 6 show that modest runtime
optimisations (greater than ≈ 1.3×) are more likely to reduce energy expenditure.
System Summary POSE for Components
System Summary POSE models can also be built for individual subsystems as well as entire nodes.
Table 7 gives values that can be used for the Pmin and Pmax bounds for the CPU components of the
three platforms used in this study.
ACM Trans. Arch. Code Optim., Vol. X, No. Y, Article . Publication date: April 2019.
:22 Stephen I. Roberts, Steven A. Ωright, Suhaib A. Fahmy, and Stephen A. Jarvis
Table 7. Values that can be used for Pmin and Pmax for the three platforms CPU component
Benchmark Haswell Broadwell KNL
omp_serial 52.33 67.85 80.84
omp_parallel 114.71 115.79 125.07
mpi_parallel 102.18 108.48 125.05
mpi_serial 148.54 147.66 142.64
FIRESTARTER 231.00 223.87 229.10
Table 8. System Summary POSE Insights for the CPU components in Haswell, Broadwell and KNL
Haswell Broadwell KNL
Maximum energy saved by reduced power consumption 2.26× 2.06× 1.83×
Et 3
Maximum improvement in Et 3 from power optimisation 5.11× 4.26× 3.36×
Minimum speed-up guaranteed to outperform θ 1.23× 1.20× 1.16×
Worst case slowdown as a result of power optimisation 1.23× 1.20× 1.16×
Speed-up required to dominate power optimisation 1.50× 1.44× 1.35×
EDS
Maximum improvement in EDS from power optimisation 1.40× 1.35× 1.31×
Minimum speed-up guaranteed to outperform θ 1.18× 1.16× 1.14×
Worst case slowdown as a result of power optimisation 1.18× 1.16× 1.14×
Speed-up required to dominate power optimisation 1.40× 1.35× 1.31×
EDD
Maximum improvement in EDD from power optimisation 1.60× 1.53× 1.48×
Minimum speed-up guaranteed to outperform θ 1.27× 1.24× 1.22×
Worst case slowdown as a result of power optimisation 1.27× 1.24× 1.22×
Speed-up required to dominate power optimisation 1.60× 1.53× 1.48×
As previously, the parameters for EDS and EDD are chosen to facilitate fair comparison with
Et3, i.e., using a 1:3 ratio. Since the magnitude of energy costs for the CPU will be around 200
times greater than that of the runtime, we additionally factor this into our choice of value for
the parameters. For EDS this results in the parameters α = 1 and β = 3 × 200 = 600. The
parameterisation of EDD is similar, but using a multiplier of
√
3 instead, resulting in α = 1 and
β =
√
3 × 200 ≈ 246.410.
Table 8 shows that power optimisation could potentially deliver a 2.26× reduction in Haswell
CPU energy consumption, a 2.06× reduction in Broadwell CPU energy consumption and a 1.83×
reduction in KNL CPU energy consumption. For each of the three metrics, the Haswell and
Broadwell CPUs show slightly more scope for power optimisation, with greater speed-ups required
to dominate power optimisation. This is also the case for KNL, except when using the Et3 metric ś
where smaller improvements are required due to the KNL CPU accounting for a greater proportion
of the nodes power draw (≈ 75%, compared to ≈ 65% for Haswell and Broadwell).
CPU energy consumption accounts for a signiicant portion of the energy used by high per-
formance systems [16]. It is therefore unsurprising that System Summary POSE yields similar
values for the platforms in this study and the CPUs they contain. However, the results in Table 8
do suggest that, in general, there is a greater opportunity for energy-aware optimisation on the
CPU, and these optimisations should translate to energy savings on the whole node.
System Summary POSE models for individual components are especially useful, since the results
can more easily transferred to other machines containing the same, or similar, hardware. Providing
the Pmin and Pmax bounds can be measured, a System Summary POSE model can be built and used.
ACM Trans. Arch. Code Optim., Vol. X, No. Y, Article . Publication date: April 2019.
The Power-Optimised Sotware Envelope :23
7 CONCLUSION
Historically, runtime was the main factor used to deine the performance of HPC applications. More
recently, unsustainable increases in power draw have led energy consumption to join runtime as a
primary constraint in HPC. Performance engineers are facing a future in which they must minimise
both runtime and energy consumption in tandem. Existing tools must be updated and new tools
must be developed in order to support this emerging class of optimisation.
This paper outlines POSE, a mathematical and visual modelling tool which captures the trade
of between software power consumption and runtime. POSE allows developers to compare the
potential beneits of power and runtime optimisation and determine which approach is most
suitable for their code.
In this paper, we have demonstrated the POSE model by studying the optimisation characteristics
of codes from the Mantevo mini-application suite on three Intel-based platforms. On each platform,
TeaLeaf was found to have the most scope for single node power optimisation, with potential
improvements in the Et3 metric of up to 2.85× on the Haswell architecture. Conversely, PathFinder
had the least scope for power optimisation with improvements to the same metric limited to just
1.02× without sacriicing concurrency. When POSE models are formulated using energy-aware
metrics such as EDS and EDD, the potential beneits of power optimisation are considerably
constrained. Power optimisation of TeaLeaf could improve the EDS metric by up to 1.24× in the
best case, and the EDD metric value by up to just 1.20× on the Haswell platform. For all other
platforms, the best-case improvement is constrained further still.
This paper also presented an extension to POSE that allows developers to reason about the power
optimisation potential for a system or component, independently of any particular code, using only
the minimum and maximum power draw. Our results showed that, for the Haswell architecture,
power optimisation is limited to reducing node-level energy consumption by at most 2.06×. Again,
the Broadwell and KNL architectures show less opportunity for power optimisation ś a maximum
improvement of 2.01× and 1.88×, respectively.
Between the three metrics outlined in this paper, Etn metrics consistently show a greater scope
for power optimisation. In particular, System Summary POSE models show that the Et3 FoM value
can be improved by up to 4.24× from power optimisation on the Haswell platform. For both EDS
and EDD, the models suggest at most a 1.36× or 1.31× improvement ś highlighting how Etn metrics
may lead application developers to pursue power optimisation where little beneit may be derived.
POSE models like those contained in this paper are useful because they allow performance
engineers to focus their eforts where they will yield the greatest return. Additionally, it may be
possible to build energy-eicient supercomputers using component-level System Summary POSE
models to choose hardware that is amenable to this class of optimisation.
Future Work
This paper lays the foundation for the POSEmodel and outlines its use in energy-aware optimisation
studies. POSE has a number of potential uses which we intend to explore in the future. First, we
plan to revisit the use of frequency scaling and P-state selection to identify whether POSE can
highlight additional opportunities for power-optimisation [35]. Second, we wish to use POSE to
investigate the use of accelerator architectures; we believe GPU and FPGA architectures may ofer
greater opportunities for power optimisation compared to the x86-64 architectures outlined in
this paper [14]. Third, we would like to validate the insights presented by POSE through a code
optimisation case study.
Further, we would like to investigate potential extensions to our POSE model. In contrast to
POSE, the Rooline Model of Energy can be used to analyse how the characteristics of a code relate
ACM Trans. Arch. Code Optim., Vol. X, No. Y, Article . Publication date: April 2019.
:24 Stephen I. Roberts, Steven A. Ωright, Suhaib A. Fahmy, and Stephen A. Jarvis
to its power draw. The relationship between a POSE model and its rooline model of energy may
ofer additional, more targeted insights for optimisation opportunities.
ACKNOWLEDGMENTS
The authors would like to thank Thomas Ilsche, Daniel Hackenberg and the Center of Information
Services and High Performance Computing (ZIH) at TU Dresden. This research was funded in part
by a UK Technology Strategy Board project, number 131197 (Energy-Eiciency Tools for High-
Performance Multi- and Many-core Applications), which supported this collaboration between the
University of Warwick and Allinea Software Ltd. (now part of Arm Ltd.). Professor Stephen Jarvis
is an AWE William Penney Fellow.
Benchmarks, scripts and additional results can be found at https://github.com/pose-model.
REFERENCES
[1] 2012. Advanced Micro Devices. BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h Models 00h-0Fh
Processors.
[2] A. Alexandrov, F. I. Mihai, E. S. Klaus, and S. Chris. 1997. LogGP: Incorporating Long Messages into the LogP Model
for Parallel Computation. Journal of Parallel and Distributed Computing (JPDC) 44 (1997), 71ś79.
[3] G. M. Amdahl. 1967. Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities. In
Proceedings of the Joint Computer Conference. 483ś485.
[4] D. Bedard, M. Y. Lim, R. Fowler, and A. Porterield. 2010. Powermon: Fine-Grained and Integrated Power Monitoring
for Commodity Computer Systems. In Proceedings of the IEEE SoutheastCon. 479ś484.
[5] C. Bekas and A. Curioni. 2010. A New Energy Aware Performance Metric. Computer Science-Research and Development
25, 3-4 (2010), 187ś195.
[6] B. D. Bingham and M. R. Greenstreet. 2008. Computation with Energy-Time Trade-Ofs: Models, Algorithms and
Lower Bounds. In IEEE International Symposium on Parallel and Distributed Processing with Applications. 143ś152.
[7] D. Brooks, V. Tiwari, and M. Martonosi. 2000. Wattch: A Framework for Architectural-level Power Analysis and
Optimizations. In Proceedings of the International Symposium on Computer Architecture (ISCA). 83ś94.
[8] M. Burtscher, I. Zecena, and Z. Zong. 2014. Measuring GPU Power with the K20 Built-In Sensor. In Proceedings of
Workshop on General Purpose Processing Using GPUs. 28ś36.
[9] J. Cao, D. Kerbyson, E. Papaefstathiou, and G. Nudd. 2000. PerformanceModelling of Parallel and Distributed Computing
using PACE. In Prodeedings of the IEEE International Performance Computing and Communications Conference (IPCCC).
485ś492.
[10] J. W. Choi, D. Bedard, R. Fowler, and R. Vuduc. 2013. A Rooline Model of Energy. In Proceedings of the IEEE International
Symposium on Parallel & Distributed Processing (IPDPS). 661ś672.
[11] D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. 1993. LogP:
Towards a Realistic Model of Parallel Computation. In Proceedings of the ACM SIGPLAN Symposium on Principles and
Practice of Parallel Programming (PPOPP). 1ś12.
[12] H. David, E. Gorbatov, U. R. Hanebutte, R. Khanna, and C. Le. 2010. RAPL: Memory Power Estimation and Capping. In
Proceedings of the International Symposium on Low-Power Electronics and Design (ISLPED). 189ś194.
[13] J. Eastep, S. Sylvester, C. Cantalupo, B., F. Ardanaz, A. Al-Rawi, K. Livingston, F. Keceli, M. Maiterth, and S. Jana.
2017. Global Extensible Open Power Manager: A Vehicle for HPC Community Collaboration on Co-Designed Energy
Management Solutions. ISC 2017: High Performance Computing. Lecture Notes in Computer Science (LNCS) 10266 (2017),
394ś412.
[14] S. A. Fahmy, K. Vipin, and S. Shreejith. 2015. Virtualized FPGA Accelerators for Eicient Cloud Computing. In
Proceedings of the IEEE International Conference on Cloud Computing Technology and Science. 430ś435.
[15] V. W. Freeh, D. K. Lowenthal, F. Pan, N. Kappiah, R. Springer, B. L. Rountree, and M. E. Femal. 2007. Analyzing the
Energy-Time trade-of in High-Performance Computing Applications. IEEE Transactions on Parallel and Distributed
Systems 18, 6 (2007), 835ś848.
[16] R. Ge, X. Feng, S. Song, H. Chang, D. Li, and K. W. Cameron. 2010. PowerPack: Energy Proiling and Analysis of
High-Performance Systems and Applications. IEEE Transactions on Parallel and Distributed Systems 21, 5 (2010),
658ś671.
[17] R. Gonzales and M. Horowitz. 1996. Energy Dissipation in General Purpose Microprocessors. IEEE Journal of Solid
State Circuits 31, 9 (1996), 1277ś1284.
[18] D. Hackenberg, T. Ilsche, J. Schuchart, R. Schöne, W. E. Nagel, M. Simon, and Y. Georgiou. 2014. HDEEM: High
Deinition Energy Eiciency Monitoring. In Energy Eicient Supercomputing Workshop (E2SC). 1ś10.
ACM Trans. Arch. Code Optim., Vol. X, No. Y, Article . Publication date: April 2019.
The Power-Optimised Sotware Envelope :25
[19] D. Hackenberg, R. Oldenburg, D. Molka, and R. Schöne. 2013. Introducing FIRESTARTER: A Processor Stress Test
Utility. In International Green Computing Conference (IGCC ’13). 1ś9.
[20] D. Hackenberg and M. K. Patterson. 2016. Evaluation of a new data center air-cooling architecture: The down-low
Plenum. In Proceedings of the IEEE Conference on Thermal and Thermomechanical Phenomena in Electronic Systems.
395ś403.
[21] S. D. Hammond, G. R. Mudalige, J. A. Smith, S. A. Jarvis, J. A. Herdman, and A. Vadgama. 2009. WARPP: A Toolkit for
Simulating High-performance Parallel Scientiic Codes. In Proceedings of the International Conference on Simulation
Tools and Techniques (Simutools). 19:1ś19:10.
[22] M. Harman and J. Clark. 2004. Metrics are Fitness Functions Too. In Proceedings of the International Symposium on
Software Metrics. 58ś69.
[23] M. A. Heroux, D. W. Doerler, P. S. Crozier, J. M. Willenbring, H. C. Edwards, A. Williams, M. Rajan, E. R. Keiter, H. K.
Thornquist, and R. W. Numrich. 2009. Improving Performance via Mini-Applications. SAND2009-5574 (2009).
[24] M. B. Kamble and K. Ghose. 1997. Analytical Energy Dissipation Models for Low Power Caches. In Proceedings of the
International Symposium on Low Power Electronics and Design. 143ś148.
[25] R. Karp and V. Ramachandran. 1990. Handbook of Theoretical Computer Science. MIT Press, Cambridge, MA, Chapter
Parallel Algorithms for Shared-Memory Machines, 869ś941.
[26] J. H. Laros, K. Pedretti, S. M. Kelly, Wei Shu, K. Ferreira, J. Vandyke, and C. Vaughan. 2013. Energy Delay Product. In
Energy-Eicient High Performance Computing: Measurement and Tuning. Springer, 51ś55.
[27] J. H. Laros, P. Pokorny, and D. DeBonis. 2013. PowerInsight - A Commodity Power Measurement Capability. In
International Green Computing Conference (IGCC). 1ś6.
[28] G. Lawson, V. Sundriyal, M. Sosonkina, and Y. Shen. 2015. Modeling Performance and Energy for Applications
Oloaded to Intel Xeon Phi. In Proceedings of the International Workshop on Hardware-Software Co-Design for High
Performance Computing. 7:1ś7:8.
[29] S. Li, J. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. 2009. McPAT: An Integrated Power, Area, and
Timing Modeling Framework for Multicore and Manycore Architectures. In Proceedings of the IEEE/ACM International
Symposium on Microarchitecture (MICRO). 469ś480.
[30] I. Manousakis and D. Nikolopoulos. 2012. BTL: A Framework for Measuring and Modeling Energy in Memory
Hierarchies. In IEEE International Symposium on Computer Architecture and High Performance Computing. 139ś146.
[31] A. J. Martin, M. Nyström, and P. I. Pénzes. 2002. ET 2: A Metric for Time and Energy Eiciency of Computation. In
Power Aware Computing. Springer, 293ś315.
[32] F. J. Mesa-Martinez, M. Brown, J. Nayfach-Battilana, and J. Renau. 2007. Measuring Performance, Power, and Tempera-
ture from Real Processors. In Proceedings of the Workshop on Experimental Computer Science. 16:1ś16:10.
[33] K. Pedretti, S. L. Olivier, K. B. Ferreira, G. Shipman, and W. Shu. 2015. Early Experiences with Node-level Power
Capping on the Cray XC40 Platform. In Energy Eicient Supercomputing Workshop (E2SC), 2015. 1ś10.
[34] S. I. Roberts, S. A. Wright, S. A. Fahmy, and S. A. Jarvis. 2017. Metrics for Energy-Aware Software Optimisation. ISC
2017: High Performance Computing. Lecture Notes in Computer Science (LNCS) 10266 (2017), 413ś430.
[35] S. I. Roberts, S. A. Wright, D. Lecomber, C. January, J. Byrd, X. Oró, and S. A. Jarvis. 2015. POSE: A Mathematical
and Visual Modelling Tool to Guide Energy Aware Code Optimisation. In Proceedings of the International Green and
Sustainable Computing Conference (IGSC).
[36] I. Rodero, H. Viswanathan, E. K. Lee, M. Gamell, D. Pompili, and M. Parashar. 2012. Energy-Eicient Thermal-Aware
Autonomic Management of Virtualized HPC Cloud Infrastructure. Journal of Grid Computing 10, 3 (2012), 447ś473.
[37] A. F. Rodrigues, K. S. Hemmert, B. W. Barrett, C. Kersey, R. Oldield, M. Weston, R. Risen, J. Cook, P. Rosenfeld, E.
Cooper-Balis, and B. Jacob. 2011. The Structural Simulation Toolkit. ACM SIGMETRICS Performance Evaluation Review
38, 4 (2011), 37ś42.
[38] E. Rotem, R. Ginosar, C. Weiser, and A. Mendelson. 2014. Energy Aware Race To Halt: A Down to EARTH Approach
for Platform Energy Management. IEEE Computer Architecture Letters 13, 1 (2014), 25ś28.
[39] V. Srinivasan, D. Brooks, M. Gschwind, P. Bose, V. Zyuban, P. N. Strenski, and P. G. Emma. 2002. Optimizing Pipelines
for Power and Performance. In Proceedings of the International Symposium on Microarchitecture (MICRO). 333ś344.
[40] S. Williams, A. Waterman, and D. Patterson. 2009. Rooline: An Insightful Visual Performance Model for Multicore
Architectures. Commun. ACM 52, 4 (April 2009), 65ś76.
[41] X. Wu, V. Taylor, J. Cook, and P. J. Mucci. 2016. Using Performance-Power Modeling to Improve Energy Eiciency of
HPC Applications. Computer 49, 10 (2016), 20ś29.
[42] S. Yeo and H. Lee. 2011. UsingMathematical Modeling in Provisioning a Heterogeneous Cloud Computing Environment.
Computer 44, 8 (2011), 55ś62.
ACM Trans. Arch. Code Optim., Vol. X, No. Y, Article . Publication date: April 2019.
