On the accuracy and usefulness of analytic energy models for
  contemporary multicore processors by Hofmann, Johannes et al.
ar
X
iv
:1
80
3.
01
61
8v
1 
 [c
s.P
F]
  5
 M
ar 
20
18
On the accuracy and usefulness of analytic energy
models for contemporary multicore processors
Johannes Hofmann1, Georg Hager2, and Dietmar Fey1
1 Computer Architecture, University of Erlangen-Nuremberg, 91058 Erlangen, Germany,
johannes.hofmann@fau.de, dietmar.fey@fau.de
2 Erlangen Regional Computing Center (RRZE), 91058 Erlangen, Germany,
georg.hager@fau.de
Abstract. This paper presents refinements to the execution-cache-memory per-
formance model and a previously published power model for multicore proces-
sors. The combination of both enables a very accurate prediction of performance
and energy consumption of contemporary multicore processors as a function of
relevant parameters such as number of active cores as well as core and Uncore fre-
quencies. Model validation is performed on the Sandy Bridge-EP and Broadwell-
EP microarchitectures. Production-related variations in chip quality are demon-
strated through a statistical analysis of the fit parameters obtained on one hundred
Broadwell-EP CPUs of the same model. Insights from the models are used to ex-
plain the performance- and energy-related behavior of the processors for scalable
as well as saturating (i.e., memory-bound) codes. In the process we demonstrate
the models’ capability to identify optimal operating points with respect to high-
est performance, lowest energy-to-solution, and lowest energy-delay product and
identify a set of best practices for energy-efficient execution.
Keywords: Performance modeling, power modeling, energy modeling
1 Introduction
The usefulness of analytic models for the performance and power consumption of code
running on modern processors is undisputed. Here, “analytic” means a simplified de-
scription of the interactions between software and hardware, simple enough to identify
relevant performance and energy issues but also elaborate enough to be realistic at least
in some important scenarios. There is a large gray area between the extremes of mod-
eling procedures: Purely analytic, also called first-principles or white-box models, try
to start from known technical details of the hardware and how the software executes,
without additional phenomenological input such as measured quantities or parameter-
ized fit functions. The other end of the spectrum is set by black-box models that can
be constructed from almost zero knowledge; measured runtime, hardware performance
metrics, power dissipation, etc., are used to identify crucial influence factors for the
metrics to be modeled. One can then use the “trained” system to predict properties of
arbitrary code, or play with parameters to explore design spaces. In either case, the pre-
dictive power of the model enables insight beyond what we would get by just running
the code on the hardware at hand.
2The power dissipation and energy consumption of HPC systems has become a ma-
jor concern. Developing a good understanding of the mechanisms behind it and how
code can be executed in the most energy-efficient way is thus of great interest to the
community. It is certainly out of the question that navigating the parameter space of
core count, clock frequencies, and (possibly) supply voltage will be sufficient to meet
the challenges of future top-tier parallel computers is terms of power, but energy is still
a major part of the operating costs of HPC clusters. Moreover, there is a trend to employ
power capping in order to enable a more accurate tailoring of the power supply to the
needs of the machine, thereby saving a lot of expenses in the infrastructure. Under such
conditions, letting code run “cooler” and knowing the energy vs. performance tradeoffs
will directly yield more science (i.e., useful core hours) per dollar.
This paper is concerned with core- and chip-level performance and power models
for Intel server CPUs. These models are precise enough to yield quantitative predic-
tions of energy consumption. In terms of performance we rely on the execution-cache-
memory (ECM) performance model [4,11] (of which the well-known roofline model
is a special case), which can deliver single-core and chip-level runtime estimates for
loop-based code on multicore CPUs. A simple multicore power model [4] serves as
a starting point for energy modeling. Both models are rather qualitative in nature; al-
though the ECMmodel is precise on the single core, it is over-optimistic once the mem-
ory bandwidth starts saturating. The original power model is very approximate and can
only track the rough energy consumption behavior of the processor. In this work we
refine both models to a point where the prediction accuracy for performance and power
dissipation, and thus also for energy consumption, becomes unprecedented. This comes
at the price of making the models more “gray-box”-like in the above terminology, i.e.,
they need more phenomenological input and fit parameters. However, the actual choice
of functional dependencies is still motivated by white-box thinking.
This paper is organized as follows: The remainder of this section describes related
work and lists our new contributions. In Section 2 we describe the hardware and soft-
ware setup and our measurement methodology. Section 3 refines the ECM performance
model to yield more accurate predictions for code near the bandwidth saturation point.
In Section 4 we extend the simple multicore power model by refining it for better base-
line power prediction and adapt it to the new Intel processors with dual clock frequency
domains (core and Uncore). Section 5 links the two models to validate the predicted
energy consumption. Motivated by the results we give some guidelines for energy op-
timization in Section 6 and conclude the paper with an outlook to future work in Sec-
tion 7.
1.1 Related work
Energy and performance models on the chip level have received intense interest in the
past decade. The roofline model [13] is still the starting point for most code analysis
activities, but it lacks accuracy and predictive power on the single core and for satura-
tion behavior. The ECMmodel [4,11] requires less phenomenological input but encom-
passes more details of the underlying architecture than roofline, yielding better results
on the single core. In contrast to the original ECM model we allow for latency penalty
3contributions that depend on the memory bus utilization, making the model accurate
across the whole scaling curve.
Energy-performance tradeoffs have been studied since the power envelope of pro-
cessors became a major concern, but were only treated phenomenologically [2,10].
Rauber et al. [9] show using a simple heuristic model that the typical energy minimum
versus clock speed observed for scalable code can be derived analytically. However,
they do not have a useful performance model and do not take saturation patterns due
to memory bandwidth exhaustion into account. Khabi et al. [8] study the energy con-
sumption of simple, scalable compute kernels using a similar underlying power model,
but they also lack a performance prediction. The energy model introduced by Hager et
al. [4] includes performance saturation but is only qualitative and thus allows only rough
estimates of energy consumption, due to the combined shortcomings of the underlying
ECM and power models.
Manufacturing variations of processors and their consequences have been studied
by several authors [7,12], and we do not add to their wisdom here; our contribution in
this area is to show the relation between fitting parameters for a specific specimen and
the “batch,” yielding insight about the usefulness of a particular set of parameters.
1.2 Contribution
This paper makes the following contributions: We refine the ECM performance model
to accurately describe the saturation behavior of memory-bound loops across cores. A
previously published multicore power model is extended to include dual clock domains
(core and Uncore) and frequency- and core-dependent baseline power. The achieved
accuracy in predicting runtime, power, and energy (using both models combined) with
respect to core frequency, Uncore frequency, and number of active cores is unprece-
dented. This is demonstrated with Intel Xeon Sandy Bridge and Broadwell CPUs. We
also identify which of the power model parameters depend on the code and which do
not. A statistical analysis of the variation of power parameters due to production spread
is given for a batch of Intel “Broadwell” 10-core CPUs, setting the limits for the gener-
ality of the power fit parameters. Finally, based on the energy modeling results, we use
Z-plots3 to identify best practices for energy-efficient, best-performance, and lowest-
EDP (energy-delay product) execution of scalable (DGEMM) and saturating (STREAM)
code, with special emphasis on the Uncore clock of the Broadwell CPU, which we
identify as a crucial parameter in energy-aware computing.
2 Testbed and methodology
All measurements were performed on one socket of standard two-socket Intel Xeon
servers. A summary of key specifications of the testbed processors is shown in Table 1.
The Sandy Bridge-EP (SNB) and Broadwell-EP (BDW) chips were selected for their
relevance in scientific computing. Along with their “relatives,” the Ivy Bridge-EP (IVB)
3 We coined the name to honor a colleague that came up with the idea, not knowing that it had
been around for some time. Lacking any better name, we stick to it here.
4Table 1. Key specification of test bed machines.
Microarchitecture (Shorthand) Sandy Bridge-EP (SNB) Broadwell-EP (BDW)
Chip Model Xeon E5-2680 E5-2697 v4
Thermal Design Power (TDP) 130W 145W
Supported core frequency range 1.2–2.7 GHz 1.2–2.3 GHz
Supported Uncore frequency range 1.2–2.7 GHz 1.2–2.8 GHz
Cores/Threads 8/16 18/36
Core-private L1/L2 cache capacities 8×32 kB / 8×256 kB 18×32 kB / 18×256 kB
Shared L3 cache capacity 20MB (8×2.5MB) 45MB (18×2.5MB)
Memory Configuration 4 ch. DDR3-1600 4 ch. DDR4-2400
Theoretical Memory Bandwidth 51.2 GB/s 76.8GB/s
and Haswell-EP (HSW) microarchitectures, they make up more than 85% of the sys-
tems in the latest TOP500 list published in November 2017.
Apart from obvious advances over processor generations such as the increased core
count or microarchitectural improvements concerning SIMD ISA extensions, major
frequency-related changes were made on the HSW/BDW microarchitectures. On the
older SNB/IVB microarchitectures the chip’s Uncore4 was clocked at the same fre-
quency as CPU cores. On Haswell/Broadwell chips, separate clock domains are pro-
vided for CPU cores and the Uncore. As will be demonstrated, the capability to run
cores and the Uncore at different clock speeds proves to be a distinguishing feature of
the newer designs that has significant impact on energy-efficient operation.
Processors featuring the new Uncore clock domain provide a feature called Uncore
frequency scaling that allows chips to dynamically set the Uncore frequency based on
the workload. When this feature is disabled the Uncore frequency is fixed at the max-
imum supported frequency. Although not officially documented, a means to manually
set the Uncore frequency via a model specific register is supported by all HSW, BDW,
and Skylake processors; starting with version 4.3.0, the likwid-setFrequencies tool
from the LIKWID tool suite5 provides a comfortable way to manually set the Uncore fre-
quency.
Since previous investigations of the running average power limit (RAPL) interface
indicate that data provided by this interface is of high quality [3], all power-related
empirical data was collected via RAPL using likwid-perfctr (also from the LIKWID
tool suite). Representatives from the classes of scalable and saturating applications for
which performance and energy behavior was investigated were DGEMM (from Intel’s
MKL, version 16.0.1) and the STREAM triad pattern (executed in likwid-bench, again
from the LIKWID tool suite), respectively. Variance of empirical performance and power
data was addressed by taking each measurement ten times; afterwards, the coefficient of
4 On Intel processors, the term Uncore refers to all parts of the chip that are not part of the core
design, such as, e.g., shared last-level cache, ring interconnect, and memory controllers.
5 http://tiny.cc/LIKWID
5TP
TL3Mem Tchip TL3Mem
TL3Mem Tchip
(c) TL3Mem TchipTP TL3Mem core 0
core 1
core 2
(b) TL3Mem TL3Mem
TL3Mem Tchip TL3Mem
core 0
core 1
(a) TL3Mem Tchip TL3Mem core 0
TPTchip
TP TP
TP
TP TP
TP
Tchip
Tchip
Tchip
Tchip
Tchip
memory bus utilization
memory bus utilization
memory bus utilization
time
Fig. 1. Visualization of memory bandwidth saturation under the refined ECM model. The white
boxes show the average latency penalty T¯P, which grows with with the utilization u(n).
variation6 was used to asses variance—which in no case was higher than 2%, indicating
that variance is not a problem.
3 Refined ECM performance model
The ECM model takes into account predictions of single-threaded in-core execution
time and data transfers through the complete cache hierarchy. These predictions can
be put together in different ways, depending on the CPU architecture. On all recent
Intel server microarchitectures it turns out that the model yields the best predictions if
one assumes no (temporal) overlap of data transfers through the cache hierarchy and
between the L1 cache and registers, while in-core execution (such as arithmetic) shows
full overlap. Scalability is assumed to be perfect until a bandwidth bottleneck is hit. A
full account of the ECMmodel would exceed the scope of this paper, so we refer to [11]
for a recent discussion.
One of the known shortcomings of the ECM model is that it is rather optimistic
near the saturation point [4,11], i.e., it overestimates performance when the memory
interface is nearly saturated. There are several possible explanations for this effect. For
example, it is documented that Intel’s hardware prefetching mechanism reduces the
prefetch distance when the memory bus is near saturation [1], which leads to larger
latencies for individual accesses, causing an additional latency contribution to the data
access time in the model. Thus the assumption that the scaling is linear with unchanged
data delay contributions across all cores until the bandwidth bottleneck is exhausted
cannot be upheld in this simple form. Based on this insight we make the following
additional assumptions about performance saturation:
– Let u(n) be the utilization of the memory interface with n active cores, i.e., the
fraction of time in which the memory bus is actively transferring data. The plain
6 The coefficient of variation is used to measure the relative variance of a sample. It is defined
as the ratio of the standard deviation σ to the mean µ of a sample.
61 2 3 4 5 6 7 8
Number of active cores
0
500
1000
1500
2000
2500
Pe
rfo
rm
an
ce
 [M
Fl
op
/s]
Old model
Refined model
Measurement
f
core
= 1.2 GHz
f
core
= 2.7 GHz
(a)
2 4 6 8 10 12 14 16 18
Number of active cores
0
1000
2000
3000
4000
Pe
rfo
rm
an
ce
 [M
Fl
op
/s]
Old model
Refined model
Measurement
f
core
= 1.9 GHz,
fUncore= 1.2 GHz
f
core
= 2.3 GHz,
fUncore= 2.8 GHz
(b)
Fig. 2. Comparison of original and refined ECM model multi-core estimates to empirical perfor-
mance data on (a) SNB (p0 = 7.8cy) (b) BDW (p0 = 5.2cy) for the STREAM benchmark with a
16GB data set size.
ECM model predicts
u(1) =
TL3Mem
TECM
=
TL3Mem
TL3Mem+Tchip
(1)
Here, TL3Mem is the runtime contribution of L3-memory data transfers, and Tchip
quantifies the data delay up to and including the L3 cache (see Figure 1(a)). No
change to the model is necessary at this level.
– For n > 1, the probability that a memory request initiated by a core hits a busy
memory bus is proportional to the utilization of the bus caused by the n−1 remain-
ing cores. If this happens, the core picks up an additional average latency penalty
T¯P (see Figures 1(b) and (c)) that is proportional to (n− 1)u(n− 1):
T¯P(n) = (n− 1)u(n− 1)p0 . (2)
Here, p0 is a free parameter that has to be fitted to the data. Hence, we get a recur-
sive formula for predicting the utilization:
u(n) =min
(
1,
nTL3Mem
TECM+ T¯P
)
=min
(
1,
nTL3Mem
TECM+(n− 1)u(n− 1)p0
)
. (3)
The penalty increases with the number of cores and with the utilization, so it has
the effect of delaying the bandwidth saturation.
– The expected performance at n cores is then pi(n) = u(n)piBW, where piBW is the
bandwidth-bound performance limit as given, e.g., by the roofline model. If u(n)<
1 for all n ≤ ncores, where ncores is the number of available cores connected to the
memory interface (i.e., in the ccNUMA domain), the code cannot saturate the mem-
ory bandwidth.
One phenomenological input for the ECM model is the saturated memory bandwidth,
which is typically determined by a streaming benchmark. There is no straightforward
71.5 2.0 2.5
Uncore frequency [GHz]
0
10
20
30
40
50
60
70
B
an
dw
id
th
 [G
B/
s]
SNB
BDW
(a)
0 1 2 3 4 5 6 7 8
Number of Cores
0
25
50
75
100
125
Pa
ck
ag
e 
Po
w
er
 v
ia
 R
A
PL
 [W
]
DGEMM
STREAM triad
Graph500
f core
=
 2.
7 G
Hz
f core= 
1.7 GH
z
(b)
Fig. 3. (a) Maximum memory bandwidth (measured with STREAM) versus (Un)core clock fre-
quency on SNB and BDW. (b) Package power consumption for DGEMM (N = 20,000), STREAM
triad (16GB data set size), and Graph500 (scale 24, edge factor 16) subject to active core count
and CPU core frequency on SNB.
way to derive the memory bandwidth from core and Uncore clock frequencies, so we
have to measure its dependence on these parameters. Figure 3(a) shows memory band-
width versus (Un)core clock frequency on BDW and SNB, respectively. The measured
bandwidth at a particular clock setting is then used together with the known memory
traffic to determine TL3Mem.
In Figure 2 we show comparisons between the original scaling assumption (dashed
lines) and our improved version (solid lines, open circles) together with measured per-
formance data (solid circles) on the SNB and BDW chips. The agreement between the
new model and the measurement is striking. It is unclear and left to future work whether
and how p0 depends on the code, e.g., the number of concurrent streams or the potential
cache reuse. Note that the single-core ECM model is unchanged and does not require a
latency correction.
Since there is no closed formula for the ECM-based runtime and performance pre-
dictions due to the recursive nature of the utilization ratio (3), setting up the model
becomes more complicated. We provide a python script for download that implements
the improved multi-core model (and also the power model described in the following
section) at http://tiny.cc/hbpmpy. The single-coremodel can either be constructed
manually or via the open-source Kerncraft tool [5], which can automatically derive the
ECM and roofline models from C source and architectural information.
4 Refined power model
In [4] a simple power dissipation model for multicore CPUs was proposed, which as-
sumed a baseline (static) and a dynamic power component, with only the latter being
8dependent on the number of active cores: P = Pstatic+ nPdyn, with Pdyn = P1 f +P2 f
2.
Several interesting conclusions can be drawn from this, but the agreement of the model
with actual power measurements remains unsatisfactory: The decrease of dynamic per-
core power in the saturated performance regime, per-core static power, and the depen-
dence of static power on the core frequency (via the automatically adjusted supply volt-
age) are all neglected. Together with the inaccuracies of the original ECM performance
model, predictions of energy consumption become cursory at best [14]. Moreover, since
the introduction of the Uncore clock domain with the Intel Haswell processor, a single
clock speed f became inadequate. An improved power model can be constructed, how-
ever, by adjusting some assumptions:
– There is a baseline (“static”) power component for the whole chip (i.e., independent
of the number of active cores) that is not constant but depends on the clock speed
of the Uncore:7
Pbase( fUncore) =W
base
0 +W
base
1 fUncore+W
base
2 f
2
Uncore . (4)
– As long as there is no bandwidth bottleneck there is a power component per active
core, comprising static and dynamic power contributions:
Pcore( fcore,n) =W
core
0 +
(
W core1 fcore+W
core
2 f
2
core
)
ε(n)α . (5)
In the presence of a bandwidth bottleneck, performance stagnates but power in-
creases (albeit more slowly than in the scalable case) as the number of active cores
goes up. We accommodate this behavior by using a damping factor ε(n)α , where
ε(n) is the parallel efficiency at n cores and α is a fitting parameter.
The complete power model for n active cores is then
Pchip = Pbase( fUncore)+ nPcore( fcore,n) . (6)
The model fixes all deficiencies of the original formulation, but this comes at the price
of a significant number of fitting parameters. The choice of a quadratic polynomial for
the f dependence is to some extent arbitrary; it turns out that a cubic term does not
improve the accuracy, nor does an exponential form such as β + γ f δ . Thus we stick
to the quadratic form in the following. Note that there is a connection between the
model parameters and “microscopic” energy parameters such as the energy per cache
line transfer, per floating-point instruction, etc., which we do not use here since they
also result from fitting procedures; they also cannot predict the power dissipation of
running code with sufficient accuracy.
The model parametersW ∗∗ and α have to be determined by fitting procedures, run-
ning code with different characteristics. In Figure 3(b) we show how Pbase is determined
by extrapolating the measured power consumption towards zero cores at different clock
7 Although Sandy Bridge and Ivy Bridge processors do not have a separate Uncore frequency
domain, on chips based on these microarchitectures the Uncore is clocked at the same speed
as the cores, which implies a variate Uncore frequency and thereby a non-constant baseline
power on these processors as well.
91 1.5 2 2.5 3
Uncore Frequency [GHz]
0
10
20
30
40
50
60
P b
as
e 
[W
]
Empirical (SNB)
Empirical (BDW)
   Est. (BDW,
fUncore≤1.7 GHz)
Estimate (SN
B)
 
 
 
Est.
 (BD
W,
f Uncore
>1.7
 GH
z)
(a)
1 1.5 2 2.5
Core Frequency [GHz]
0
2.5
5
7.5
10
12.5
P c
o
re
 
[W
]
Empirical (SNB)
Empirical (BDW)
Est
ima
te (S
NB)
Estim
ate (B
DW)
(b)
Fig. 4. (a) Pbase and (b) Pcore parameters derived from empirical data for different CPU
core/Uncore frequencies on SNB and BDW. The Pcore fit was done for the DGEMM benchmark;
STREAM yields different fitting parameters (see Table 2).
frequencies on SNB (there is only one clock domain on this architecture). The STREAM
triad, a DGEMM, and the Graph500 benchmark were chosen because they have very
different requirements towards the hardware. In case of STREAM we ignore data points
beyond saturation (more precisely, for parallel efficiency smaller than 90%) in order to
get sensible results. The extrapolation shows that the baseline power is independent of
the code characteristics, which is surprising since the Uncore includes the L3 cache,
whose power consumption is certainly a function of data transfer activity. Its variation
with the clock speed can be used to determine the three parametersW base∗ , as shown in
Figure 4(a) for both architectures. In this figure, each data point (circles and squares)
is an extrapolated baseline power measurement for a different (Un)core frequency. On
the BDW CPU, the measurements exhibit a peculiar change of trend as the frequency
falls below 1.7GHz; a different set of fit parameters is needed in this regime. We can
only speculate that the chip employs more aggressive power saving techniques at low
fUncore.
As for the core power parametersW core∗ , they do depend on the code as can already
be inferred from the data in Figure 3(b). In Figure 4(b) we show the quality of the fit-
ting procedure for DGEMM. The parameter α , which quantifies the influence of parallel
efficiency reduction on dynamic core power in the saturated regime of STREAM, can be
determined as well. Table 2 shows all fit parameters.
5 Energy model and validation
Putting together performance and power dissipation, we can now validate predictions
for energy consumption. Guidelines for energy-efficient execution as derived from this
data will be given in Sect. 6 below. We normalize the energy to the work, quantifying
10
0 50 100 150
Performance [GFlop/s]
0
500
1000
1500
2000
2500
En
er
gy
 c
os
t [
pJ
/F
lop
]
Model estimates
Empirical data
f
core
= 1.2 GHz f
core
= 1.9 GHz
f
core
= 2.7 GHz
(a)
0 100 200 300 400 500 600
Performance [GFlop/s]
0
500
1000
1500
2000
2500
En
er
gy
 c
os
t [
pJ
/F
lop
]
Model estimates
Empirical data
f
core
= fUncore= 1.2 GHz
f
core
= fUncore= 1.8 GHz
f
core
= fUncore= 2.3 GHz
(b)
Fig. 5. Z-plots relating performance and energy cost for DGEMM with from Intel’s MKL on the
(a) Sandy Bridge-EP processor for N = 40,000 and on the (b) Broadwell-EP processor for N =
60,000 using different CPU core counts as well as CPU core and Uncore frequencies on the
Broadwell-EP processor.
the energy cost per work unit:
E =
Pchip( fcore, fUncore,n)
pi( fcore, fUncore,n)
(7)
In our case this quantity has a unit of J/flop. Unfortunately, this model is too intricate to
deliver general analytic predictions of minimum energy or EDP and the required oper-
ating points to attain them. Some simple cases, however, can be tackled. On a CPU with
only one clock speed domain (such as SNB), where fcore = fUncore = f , and assuming
that the code runtime is proportional to the inverse clock frequency, one can differen-
tiate (7) with respect to f and set it to zero in order to get the frequency for minimum
Microarchitecture SNB BDW
α 0.4 0.5
W base0 [W] 14.62 27.2
1 / 70.82
W base1 [W/GHz] 1.07 −6.45
1 / −44.12
W base2 [W/GHz
2] 1.02 5.711 / 13.12
W core0 [W] 1.42 (DGEMM / 1.33 (STREAM) −0.11 (DGEMM) / 0.45 (STREAM)
W core1 [W/GHz] −0.52 (DGEMM) / 0.80 (STREAM) −1.46 (DGEMM) / 2.95 (STREAM)
W core2 [W/GHz
2] 1.51 (DGEMM) / 1.22 (STREAM) 1.47 (DGEMM) / −0.24 (STREAM)
1 fUncore ≤ 1.7GHz
2 fUncore > 1.7GHz
Table 2. Fitted parameters for the power model (4)–(6), using the STREAM and DGEMM bench-
marks. Note that these numbers are fit parameters only; their physical relevance should not be
overstressed.
11
energy consumption. This yields
fopt =
√
W base0 + nW
core
0
W base2 + nW
core
2
, (8)
which simplifies to the expression derived in [4] if we setW core0 =W
base
2 = 0. The opti-
mal frequency is large when the static power components dominate, which is plausible
(“race to idle”).
We have chosen the SNB and BDW processors for our study because they are rep-
resentatives of server CPU microarchitectures that exhibit significantly different power
consumption properties. The DGEMM and STREAM benchmarks are used for validation;
it should be emphasized that almost all parameters in the energy and power models
(apart from the base power) are code-dependent, so our validation makes no claim of
generality other than that it is suitable for codes with substantially different characteris-
tics. For STREAM we constructed the refined ECM model as described in Sect. 3, while
for DGEMM we assumed a performance of 95% of peak, which is quite accurate on the
two platforms. The “Turbo” feature was disabled.
In order to discuss performance and power behavior the Z-plot [2,14] has proven to
be useful. It is a Cartesian plot that shows (normalized or absolute) energy consumption
versus code performance,8 with some parameter varying along the data set. This can be,
e.g., the number of active cores, a clock frequency, a loop nest tile size, or any other
parameter that affects energy or runtime. In a Z-plot, lines of constant energy cost are
horizonal, lines of constant performance are vertical (e.g., a roofline limit is a hard
barrier), and lines of constant energy-delay product (EDP) are lines through the origin
whose slope is proportional to the EDP (assuming constant amount of work).
Figure 5 shows Z-plots comparing model predictions (open circles) and measure-
ments (dots) for DGEMM on the two platforms, with varying number of active cores
along the data sets. In case of SNB (Fig. 5(a)), each of the three data sets is for a dif-
ferent core frequency (and hence implicitly different Uncore frequency). To mimic the
SNB behavior on BDW (Fig. 5(b)), we have set the core and Uncore frequencies to the
same value. The accuracy of the energy model is impressive, and much improved over
previous work. As predicted by the model, lowest energy for constant clock frequency
is always achieved with all cores. The clock frequency for minimum energy cost at a
given number of cores depends on both parameters: the more cores, the lower the opti-
mal frequency due to the waning influence of the base power. The spread in energy cost
across the parameter range is naturally larger on BDW with its large core count (18 vs.
8). At full chip, both architectures show lowest EDP at the fastest clock speed. To get
a better overview of the model quality on BDW we show in Fig. 6 a heat map of the
model error with respect to core count and core frequency at fixed Uncore clock. The
error is never larger than 4%; if one excludes the regions of small core counts and small
core frequencies, which are not very relevant for practical applications anyway, then the
maximum error falls below 2% and is typically smaller than 1%.
It is well known that manufacturing variations cause significant fluctuation across
chips of the same type in terms of power dissipation [7,6]. This poses problems, e.g.,
8 Wallclock time can also be used, which essentially mirrors the plot about the y axis.
12
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3
N
u
m
b
er
o
f
co
re
s
CPU core frequency [GHz]
3.6 3.2 3.7 4.0 2.6 2.5 2.9 3.1 2.9 1.9 2.3 2.9
3.6 2.1 3.0 2.0 2.3 2.3 1.6 1.8 1.0 1.2 0.8 1.4
2.4 1.3 2.4 1.3 2.4 1.6 0.7 0.6 0.8 0.1 0.1 0.6
2.4 2.4 1.5 1.3 0.9 1.9 0.2 1.0 0.8 0.0 0.7 0.5
0.4 1.5 0.5 0.6 0.6 0.4 0.8 1.6 1.8 0.5 0.8 0.2
0.9 0.3 0.3 0.6 0.6 0.8 1.2 1.4 0.8 0.1 0.9 0.0
0.2 0.5 0.5 0.3 0.5 0.0 0.1 0.6 0.8 0.5 0.7 0.1
0.9 0.7 0.5 1.1 0.5 0.6 0.4 1.2 1.7 0.1 1.2 0.1
1.7 0.5 0.3 0.1 0.1 1.5 0.4 0.7 0.7 0.2 0.3 0.5
2.5 0.6 0.2 0.8 0.5 0.5 0.6 0.8 0.7 0.2 0.1 0.1
1.8 0.3 0.8 0.8 0.0 0.3 0.9 0.8 0.4 0.4 0.4 0.3
0.7 0.5 0.0 0.0 0.8 0.3 0.1 0.8 0.1 0.5 0.8 1.0
0.6 0.1 0.4 0.6 0.8 0.5 0.3 0.0 0.6 1.2 1.2 1.4
0.5 0.2 0.6 0.4 1.1 0.5 0.2 0.4 1.2 0.8 1.7 1.9
1.7 1.0 0.2 0.5 0.2 0.9 0.1 0.7 1.2 1.2 1.0 1.8
1.3 0.6 1.1 0.2 0.0 0.5 1.1 0.5 1.6 1.7 0.6 1.2
1.5 0.3 1.0 1.1 1.5 1.3 1.3 1.6 2.1 0.9 0.9 1.5
1.1 0.3 0.6 1.4 1.6 1.1 1.3 1.6 0.1 0.0 0.1 0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
R
el
at
iv
e
m
o
d
el
er
ro
r
[%
]
Fig. 6. Relative model error for DGEMM on the Broadwell-EP processor for different core counts
and CPU core frequencies. The Uncore clock speed was set to the maximum (2.8GHz).
when power capping is enforced because power variations then translate into perfor-
mance variations [7], but it can also be leveraged for saving energy by intelligent
scheduling [12]. For modeling purposes it is interesting to analyze the variation of fit-
ted power model parameters in order to see how generic these values are. Figures 7
and 8 show histograms ofW base∗ andW
core
∗ for DGEMM, including Gaussian fits for 100
chips of a cluster9 based on dual-socket nodes with Intel Xeon Broadwell E5-2630v4
CPUs (10 cores). The data clearly shows that the accuracy of the power dissipation
model can only be achieved for a particular specimen; however, the general insights are
unchanged. It cannot be ruled out that some of the variation is caused by changes in
environmental conditions across the nodes and CPUs. For example, the typical front-
to-back airflow in many cluster node designs leads to one of the chips getting warmer.
The (weak) bi-modal distribution ofW base0 may be a consequence of this. We have also
observed that chips with a particularly high value of one parameter (e.g.,W base2 ) are not
necessarily “hot” specimen, because other parameters can be average or even smaller.
These observations underpin our claim that one should not put too much physical in-
terpretation into the power model fit parameters but rather take them as they are and
try to reach qualitative conclusions, although the predictions for an individual chip are
accurate.
In Figure 9 we show Z-plots comparing the predictions and measurements for the
STREAM triad with the same frequency settings as in Fig. 5. The saturation bandwidth,
which limits the performance to the right in the plot, was taken from the data in Fig. 3(a).
The prediction accuracy is not worse than for DGEMM, despite the fact that model is
now much more complicated due to the saturating performance and the drop in parallel
efficiency beyond the saturation point. A major difference between the two processors
9 Link to cluster doc hidden for double-blind review
13
10 15 20 25
W0
base
 [W]
0
5
10
15
20
25
N
o
rm
al
iz
ed
 d
en
sit
y 
[%
]
 μ = 17.11 W
 σ = 2.16 W
-10 -5 0 5
W1
base
 [W/GHz]
 μ = -1.65 W/GHz
 σ = 1.48 W/GHz
0 1 2 3
W2
base
 [W/GHz2]
0
5
10
15
20
25
 μ = 1.56 W/GHz2
 σ = 0.39 W/GHz2
Fig. 7. Histograms ofW base∗ for DGEMM among 100 Xeon Broadwell E5-2630v4 chips. The sum
of all probabilities was normalized to one.
-1 0 1 2
W0
core
 [W]
0
10
20
30
40
N
o
rm
al
iz
ed
 d
en
sit
y 
[%
]
 μ = 0.44 W
 σ = 0.34 W
0 1 2 3 4
W1
core
 [W/GHz]
 μ = 1.86 W/GHz
 σ = 0.48 W/GHz
-0.5 0.0 0.5 1.0
W2
core
 [W/GHz2]
0
10
20
30
40 μ = 0.22 W/GHz2
 σ = 0.17 W/GHz2
Fig. 8. Histograms ofW core∗ for DGEMM among 100 Xeon Broadwell E5-2630v4 chips. The sum
of all probabilities was normalized to one.
strikes the eye: The waste in energy for core counts beyond saturation is consider-
ably smaller on BDW (although it is still about 20-25%), and the saturation point is
almost independent of the clock speed. Only the refined ECM model can predict this
accurately; in the original model, the saturation point depends very strongly on the fre-
quency. At saturation, the energy consumption varies only weakly with the clock speed,
which makes finding the saturation point the paramount energy optimization strategy.
In contrast, on SNB one has to find the saturation point and choose the right frequency
for the global energy minimum. If the EDP is the target metric, finding the optimal op-
erating point is more difficult. For both chips it coincides with the saturation point at a
frequency that is somewhere half-way between minimum and maximum.
In summary, our power model yields meaningful estimates of high quality with
an error below 1% for relevant operating points (i.e., away from saturation and using
more than a single core). In contrast to the work in [6], where the power/performance
behavior was only observed empirically, we have presented an analytic model based on
simplifying assumptions that can accurately describe the observed behavior.
14
0 500 1000 1500 2000 2500 3000
Performance [MFlop/s]
0
10
20
30
40
50
60
70
En
er
gy
 c
os
t [
nJ
/F
lop
]
Model estimates
Empirical data
f
core
 = 1.2 GHz fcore = 1.9 GHz
f
core
 = 2.7 GHz
(a)
lowe
st ED
P
0 1000 2000 3000 4000
Performance [MFlop/s]
0
10
20
30
40
50
60
70
En
er
gy
 c
os
t [
nJ
/F
lop
]
Model estimate
Measurement
f
core
= fUncore= 1.8 GHz
f
core
= fUncore= 2.3 GHz
f
core
= fUncore= 1.2 GHz
(b)
Fig. 9. Z-plots relating performance and energy cost for the STREAM triad using a 4GB data set
for different CPU core counts as well as CPU core and Uncore frequencies on the (a) Sandy
Bridge-EP and (b) Broadwell-EP processors. On BDW, core and Uncore clock frequencies were
set to the same value for this experiment.
6 Consequences for energy optimization
It is satisfying that our refined ECM and power models are accurate enough to describe
the energy consumption of scalable and bandwidth-saturating code with unprecedented
quality on two quite different multicore architectures. However, in order to go beyond
an exercise in curve fitting, we have to derive guidelines for the efficient execution of
code that are independent of the specific fit parameters determined for a given chip
specimen. As usual, we differentiate between scalable and saturating code, exemplified
by DGEMM and the STREAM triad, respectively.
6.1 Scalable code
Figure 10(a) shows a Z-plot with two data sets (four and eight cores, respectively) and
the core frequency as a parameter for the SNB processor running DGEMM. The highest
performance, lowest energy, and lowest EDP observed are marked with dashed lines.
From the energy model and our measurements we expect minimum energy for full-
chip execution at a clock speed which is determined by the ratio of the baseline power
and the f 2 component of dynamic power (see (8)). For the chip at hand, this minimum
is at fopt ≈ 1.4GHz with all cores and at fopt ≈ 1.7GHz with only four cores. The
global performance maximum (and EDP minimum) is at the highest core clock speed
using all cores, as predicted by the model. Hence, on this chip, where the Uncore and
core frequencies are the same by design, there is only a choice between highest perfor-
mance (and, at the same time, lowest EDP) or lowest energy consumption. The latter is
achieved at a rather low clock speed setting using all cores. About 21% of energy can
be saved by choosing fopt, albeit at the price of a 50% performance loss.
15
50 100 150 200
Performance [GFlop/s]
500
600
700
800
En
er
gy
 c
os
t [
pJ
/F
lop
]
8 cores (full-chip),
f
core
 varies
  4 cores,
f
core
 varies
1.2 GHz
2.7 GHz
lowest energy cost
hi
gh
es
t  
 p
er
fo
rm
an
ce
lo
we
st 
  E
DP
(a)
300 400 500 600
Performance [GFlop/s]
150
200
250
300
En
er
gy
 c
os
t [
pJ
/F
lop
]
f
core
= 1.2 GHz,
fUncore varies
f
core
= 2.3 GHz,
fUncore varies
f
core
 varies,
fUncore= 2.8 GHz
lowest energy cost
hi
gh
es
t p
er
fo
rm
an
ce
low
est
 ED
P
(b)
Fig. 10. (a) Z-plot relating performance and energy cost for DGEMM from Intel’s MKL for N =
40,000 running on four (half-chip) respectively eight (full-chip) cores of the Sandy Bridge-EP
processor clocked at different CPU core frequencies, i.e., the core frequency varies along the
curves. The energy minimum is exactly at the optimal frequency predicted by (8). (b) Z-plot
relating performance and energy cost for DGEMM from Intel’s MKL for N = 60,000 running on
all cores of the Broadwell-EP processor clocked at different CPU core frequencies and fixed,
maximum Uncore clock (along the red curve) and at fixed maximum (blue curve) and minimum
(green curve) core frequency with varying Uncore speed. Black arrows indicate the direction of
rising (Un)core clock frequency.
The situation is more intricate on BDW, where the Uncore clock speed has a strong
impact on the power consumption as well as on the performance even of DGEMM. Fig-
ure 10(b) shows energy-performance Z-plots for different operating modes: Along the
red curve the core clock speed is varied at maximum Uncore clock. This is also the
mode in which most production clusters are run today since the automatic Uncore fre-
quency scaling of the BDW processor favors high Uncore frequencies. In this case
the energy-optimal core frequency is beyond the accessible range on this chip, which
is why the lowest-energy (and highest-performance) “naive” operating point is at the
largest fcore. Starting at this point one can now reduce the Uncore clock frequency at
constant, maximum core clock (2.3GHz) until the slow Uncore clock speed starts to
impact the DGEMM performance (blue curve) due to the slowdown of the L3 cache.
At fUncore = 2.1GHz we arrive at the global performance maximum and EDP mini-
mum, saving about 17% of energy compared to the naive operating point. At even lower
fUncore the performance starts to degrade, ultimately leading to a rise in energy cost. The
question arises whether one could save even more energy by accepting a performance
degradation, just as on the SNB CPU. The green curve shows the extreme case where
the core clock speed is at the minimum of 1.2GHz. Here the Uncore frequency cannot
be lowered to a point where it impacts the performance, which thus stays constant, but
the energy consumption goes down significantly. However, the additional energy sav-
16
500 1000 1500 2000 2500 3000
Performance [MFlop/s]
10
20
30
40
En
er
gy
 c
os
t [
nJ
/F
lop
] fco
re
 =
 1.3 GHz
f
core
 =
 1.8 GHz
f
core  =
 2.7 GHz
(a)
low
est
 ED
P
hi
gh
es
t p
er
fo
rm
an
ce
lowest energy cost
3000 3200 3400 3600 3800
Performance [MFlop/s]
12
14
16
18
20
22
En
er
gy
 c
os
t [
nJ
/F
lop
]
f
core
= 1.7 GHz, fUncore varies
f
core
= 1.2 GHz, fUncore varies
f
core
 varies,
fUncore= 2.8 GHz
f
core
= 2.3 GHz, fUncore varies
lowest energy cost
hi
gh
es
t p
er
fo
rm
an
ce
lowe
st E
DP
(b)
Fig. 11. Z-plots relating performance and energy cost for the STREAM triad with a 4GB data set
using various CPU core counts as well as core and Uncore frequencies. (a) Sandy Bridge-EP at
three clock frequencies with varying number of cores along the curves. (b) Broadwell-EP, three
core frequencies (red/green/blue), varying Uncore clock speed along the curves. Black: fixed
Uncore frequency (at maximum), varying core clock speed along curve. The number of cores at
each data point on BDW was determined by minimizing the EDP vs. cores at fixed clock speeds.
ing is only about 5% compared to the case of maximum performance at optimal Uncore
frequency. This does not justify the almost 50% performance loss.
In conclusion, the BDW CPU shows a qualitatively different performance/energy
tradeoff due to its large and power-hungry Uncore. The Uncore clock speed is the dom-
inating parameter here. It is advisable to set the core clock speed to a maximum and
then lower the Uncore clock until performance starts to degrade. This is also the point
where the global EDP minimum is reached. For codes that are insensitive to the Uncore
(e.g., with purely core-bound performance characteristics), it should be operated at the
lowest possible Uncore frequency setting.
6.2 Saturating code
A code with saturating performance characteristics due to the memory bandwidth bot-
tleneck is more complicated to describe than a scalable code, because the saturation
point marks an abrupt change in the energy behavior. The saturation point (i.e., the
number of cores required to saturate) depends on the clock speed(s) of the chip, so it
cannot be assumed that the minimum energy point is reached at the same number of
cores (as was the case for saturating behavior).
Figure 11(a) shows Z-plots for SNB with the STREAM triad code and three different
clock speeds (1.3, 1.8, and 2.7GHz). The number of cores varies from one to eight along
the curves. On this CPU the lowest-energy and highest-performance operating points
are quite distinct; the saturation point with respect to core count can be clearly identified
by the lowest EDP (per core frequency), and coincides with the highest-performance
17
point with good accuracy. Hence, there is a simple tradeoff between performance and
energy, with a performance loss of 25% for 28% of energy savings (only considering
the saturation points). As mentioned before, using the whole chip is wasteful, especially
at a fast clock speed.
Already during the analysis of scalable code the Uncore clock frequency was iden-
tified as a decisive factor on BDW. In Figure 11(b) we thus show in red, green, and
blue the Z-plots for three different core clock speeds (1.2, 1.7, and 2.3GHz) with the
Uncore clock as the parameter along the curves. At each point, the number of cores was
determined by the minimum EDP vs. active cores with fixed frequencies. As indicated
in Figure 3(a), there is a minimum Uncore frequency required to saturate the memory
interface; Figure 11(b) shows that it is largely independent of the core frequency. In
other words, there is an fcore-independent fUncore ≈ 2GHz that leads to (almost) max-
imum performance. fcore can then be set very low (1.2GHz) for optimal energy and
EDP without a significant performance loss. Again it can be observed that a sensible
setting of the Uncore frequency is the major contributor to saving energy on BDW. The
black data set shows a typical operating mode in practice, where the Uncore clock is set
very high (or left to be determined by Uncore frequency scaling) and the core clock is
varied. Even with sensible concurrency throttling, the energy cost only varies by about
10%, whereas optimal parameters allow for additional savings between 27% and 33%.
7 Summary and outlook
By refining known ECM performance and power models we have constructed an ana-
lytic energy model for Intel multicore CPUs that can predict the energy consumption
and runtime of scalable and saturating loop code with high accuracy. The power model
parameters show significant manufacturing variation among CPU specimen. The Un-
core frequency on the latest Intel x86 designs was identified as a major factor in energy
optimization, even more important than the core frequency, for scalable and saturating
code alike. Overall, energy savings of 20-50% are possible depending on the CPU and
code type by choosing the optimal operating point in terms of clock speed(s) and num-
ber of active cores. If the energy-delay product (EDP) is the target metric, the Z-plot
delivers a simple yet sufficiently accurate method to determine the point of lowest EDP.
Our work can be extended in several directions. The refined ECM model should be
tested against a variety of codes to check the generality of the recursive latency penalty.
We have ignored the “Turbo Mode” feature of the Intel CPUs, but our models should
be able to encompass Turbo Mode if the dynamic clock frequency variation (depend-
ing mainly on the number of active cores) is properly taken into account. A related
problem, the reduction of the base clock speed when using AVX SIMD instructions on
the latest Intel CPUs, could be handled in the same way. An analysis of the new Intel
Skylake-SP and AMD Epyc server CPUs for their performance and power properties is
currently ongoing. It would furthermore be desirable to identify more cases where the
energy model (7) can yield simple analytic results. Finally, it should be useful to ease
the construction of our improved analytic performance and energy models by extending
tools such as Kerncraft [5].
18
References
1. Intel 64 and IA-32 Architectures Optimization Reference Manual. Intel Press (June 2016),
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/
64-ia-32-architectures-optimization-manual.pdf
2. Freeh, V.W., Lowenthal, D.K., Pan, F., Kappiah, N., Springer, R., Rountree, B.L., Femal,
M.E.: Analyzing the energy-time trade-off in high-performance computing applications.
IEEE Transactions on Parallel and Distributed Systems 18(6), 835–848 (June 2007)
3. Hackenberg, D., Scho¨ne, R., Ilsche, T., Molka, D., Schuchart, J., Geyer, R.: An energy ef-
ficiency feature survey of the Intel Haswell processor. In: 2015 IEEE International Parallel
and Distributed Processing Symposium Workshop. pp. 896–904 (May 2015)
4. Hager, G., Treibig, J., Habich, J., Wellein, G.: Exploring performance and power proper-
ties of modern multicore chips via simple machine models. Concurrency Computat.: Pract.
Exper. (2013), DOI: 10.1002/cpe.3180
5. Hammer, J., Eitzinger, J., Hager, G., Wellein, G.: Kerncraft: A tool for analytic performance
modeling of loop kernels. In: Niethammer, C., Gracia, J., Hilbrich, T., Knu¨pfer, A., Resch,
M.M., Nagel, W.E. (eds.) Tools for High Performance Computing 2016: Proceedings of
the 10th International Workshop on Parallel Tools for High Performance Computing, Oc-
tober 2016, Stuttgart, Germany. pp. 1–22. Springer International Publishing, Cham (2017),
https://doi.org/10.1007/978-3-319-56702-0_1
6. Hofmann, J., Hager, G., Wellein, G., Fey, D.: An analysis of core- and chip-level archi-
tectural features in four generations of Intel server processors. In: Kunkel, J.M., Yokota,
R., Balaji, P., Keyes, D. (eds.) High Performance Computing: 32nd International Confer-
ence, ISC High Performance 2017, Frankfurt, Germany, June 18–22, 2017, Proceedings. pp.
294–314. Springer International Publishing, Cham (2017), https://doi.org/10.1007/
978-3-319-58667-0_16
7. Inadomi, Y., Patki, T., Inoue, K., Aoyagi, M., Rountree, B., Schulz, M., Lowenthal, D., Wada,
Y., Fukazawa, K., Ueda, M., Kondo, M., Miyoshi, I.: Analyzing and mitigating the impact
of manufacturing variability in power-constrained supercomputing. In: Proceedings of the
International Conference for High Performance Computing, Networking, Storage and Anal-
ysis. pp. 78:1–78:12. SC ’15, ACM, New York, NY, USA (2015), http://doi.acm.org/
10.1145/2807591.2807638
8. Khabi, D., Ku¨ster, U.: Power consumption of kernel operations. In: Resch, M.M., Bez, W.,
Focht, E., Kobayashi, H., Kovalenko, Y. (eds.) Sustained Simulation Performance 2013: Pro-
ceedings of the joint Workshop on Sustained Simulation Performance, University of Stuttgart
(HLRS) and Tohoku University, 2013. pp. 27–45. Springer International Publishing, Cham
(2013), https://doi.org/10.1007/978-3-319-01439-5_3
9. Rauber, T., Ru¨nger, G.: Towards an energy model for modular parallel scientific applica-
tions. In: 2012 IEEE International Conference on Green Computing and Communications.
pp. 523–532 (Nov 2012)
10. Song, S., Su, C., Rountree, B., Cameron, K.W.: A simplified and accurate model of power-
performance efficiency on emergent GPU architectures. In: 2013 IEEE 27th International
Symposium on Parallel and Distributed Processing. pp. 673–686 (May 2013)
11. Stengel, H., Treibig, J., Hager, G., Wellein, G.: Quantifying performance bottlenecks of sten-
cil computations using the Execution-Cache-Memory model. In: Proceedings of the 29th
ACM International Conference on Supercomputing. ICS ’15, ACM, New York, NY, USA
(2015), http://doi.acm.org/10.1145/2751205.2751240
12. Wilde, T., Auweter, A., Shoukourian, H., Bode, A.: Taking advantage of node power vari-
ation in homogenous HPC systems to save energy. In: Kunkel, J.M., Ludwig, T. (eds.)
High Performance Computing: 30th International Conference, ISC High Performance 2015,
19
Frankfurt, Germany, July 12-16, 2015, Proceedings. pp. 376–393. Springer International
Publishing, Cham (2015), https://doi.org/10.1007/978-3-319-20119-1_27
13. Williams, S., Waterman, A., Patterson, D.: Roofline: An insightful visual performance model
for multicore architectures. Commun. ACM 52(4), 65–76 (2009), http://doi.acm.org/
10.1145/1498765.1498785
14. Wittmann, M., Hager, G., Zeiser, T., Treibig, J., Wellein, G.: Chip-level and multi-node anal-
ysis of energy-optimized lattice Boltzmann CFD simulations. Concurrency and Computa-
tion: Practice and Experience 28(7), 2295–2315 (2016), http://dx.doi.org/10.1002/
cpe.3489
