Benchmarking for power consumption monitoring by Weiland, Michele & Johnson, Nicholas
  
 
 
 
Edinburgh Research Explorer 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Benchmarking for power consumption monitoring
Citation for published version:
Weiland, M & Johnson, N 2015, 'Benchmarking for power consumption monitoring' Computer Science -
Research and Development, vol. 30, no. 2, pp. 155-163. DOI: 10.1007/s00450-014-0260-1
Digital Object Identifier (DOI):
10.1007/s00450-014-0260-1
Link:
Link to publication record in Edinburgh Research Explorer
Document Version:
Peer reviewed version
Published In:
Computer Science - Research and Development
General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)
and / or other copyright owners and it is a condition of accessing these publications that users recognise and
abide by the legal requirements associated with these rights.
Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer
content complies with UK legislation. If you believe that the public display of this file breaches copyright please
contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and
investigate your claim.
Download date: 05. Apr. 2019
Noname manuscript No.
(will be inserted by the editor)
Benchmarking for power consumption monitoring
Description of benchmarks desgined to expose power usage characteristics of parallel hardware
systems, and preliminary results
Miche`le Weiland · Nick Johnson
Received: date / Accepted: date
Abstract This paper presents a set of benchmarks that are
designed to measure power consumption in parallel systems.
The benchmarks range from low-level, single instructions
or operations, to small kernels. In addition to describing the
motivation behind developing the benchmarks and the de-
sign principles that were followed, the paper also introduces
a metric to quantify the power-performance of a parallel sys-
tem. Initial results are presented and help to illustrate the
contribution of the paper.
Keywords Benchmarks · power consumption · energy
efficiency metrics
1 Introduction
The quest for Exascale computing has put research into the
power consumption of (parallel) software and hardware firm-
ly on the agenda of the HPC community. Recent advances in
HPC-specific hardware architectures, and the advent of low-
power multi- and many-core architectures and accelerator
technologies, have meant that developers of parallel soft-
ware have had to adapt their programming techniques and
models to exploit the full performance of todays systems.
The Adept project is partially funded by the European Commission un-
der the 7th Framework Programme, grant agreement number 610490.
Miche`le Weiland
EPCC
The University of Edinburgh
Tel.: +44-131-6505030
Fax: +44-131-6506555
E-mail: m.weiland@epcc.ed.ac.uk
Nick Johnson
EPCC
The University of Edinburgh
Tel.: +44-131-6505030
Fax: +44-131-6506555
E-mail: Nick.Johnson@ed.ac.uk
Developing parallel software that is efficient in both perfor-
mance and power usage in an increasingly complex hard-
ware landscape is one of the core Exascale challenges [2].
It would however be inaccurate to believe that this chal-
lenge only exists at the top end of parallel computing; it is
also a crucial obstacle to overcome for smaller scale parallel
systems, including mobile processing devices and applica-
tions [1]. In order to be able to implement energy efficient
algorithms or choose the most power-performance efficient
hardware, it is first necessary to understand and quantify any
factors that dictate power consumption. Benchmarks can be
used to gain a deeper understanding of how implementation
and architecture choices can impact on the overall efficiency
of software. If we can identify the power usage profiles of
computational patterns, it will become possible to optimise
for energy and power in the same way we optimise for per-
formance today.
This paper describes the design and implementation of a
set of benchmarks that can be used to measure both per-
formance and power usage on a wide range of hardware ar-
chitectures, and outlines the motivation for developing new
benchmarks rather than using existing suites. The bench-
marks introduced here are representative of the whole spec-
trum of parallel computing (from Embedded to HPC) and
providemeasurements that can be interpreted on a wide range
of platforms. They also cover different levels of compute
granularity, from single instructions and operations up to
specific computational patterns in the form of kernels.
This paper makes contributions to and progresses the state-
of-the-art in the following areas:
– The development of a set of benchmarks which expose
system behaviour and allow measurement of power and
energy consumption;
2 Weiland & Johnson
– The desgin of a methodology to quantify power and en-
ergy consumption in a range of systems from embedded
to HPC;
– Initial results exposing the power-performance charac-
teristics of two different CPUs.
The work presented here is part of the research undertaken
in the EU FP7 project Adept (“Addressing Energy in Paral-
lel Technologies”)1, which at its heart has the objective to
develop a tool that will allow the prediction of both perfor-
mance and power usage of parallel software on a wide range
of hardware architectures. The development of benchmarks
to further our understanding of energy use and power con-
sumption of software and hardware is central to fulfilling
this objective.
2 Benchmarking for power consumption
Benchmarks are used to measure and quantify the perfor-
mance of certain aspects of a given system; here, they are
used specifically to measure the performance in combina-
tion with the power and energy usage of computations on
a wide range of hardware platforms. The purpose of the
benchmarkswill not be to measure pure runtime performance.
Rather they will be used to get an understanding of the power
usage of a system (where system includes hardware, soft-
ware environment, programming models and algorithms) to
inform the development of a power usagemodel for different
operations and computational patterns. Spanning the entire
landscape of parallel computing from Embedded to large-
scale HPC systems, the benchmarks need to reflect this span
by being representative of the different computational de-
mands on these systems. While a HPC application may fo-
cus on floating-point arithmetic, an Embedded application
may be more concerned with thread management and low-
level communication.
2.1 Motivation
The decision to develop a new set of benchmarks rather than
using existing benchmarks (for example BenchIT [6], LM-
bench [10], or MultiMaps [8]) was motivated by two factors:
firstly, to the best of our knowledge, no single benchmark
suite incorporates computational patterns from both Embed-
ded and high-performance computing; as parallel comput-
ing is no longer restricted to HPC, but is becoming increas-
ingly commonplace in the Embedded sector as well, it is
important that a benchmark suite for power measurement
encompasses this branch of computing. Secondly, and more
importantly, in order to measure power usage of particular
1 www.adept-project.eu
fine-grained operations such as inter-process communica-
tion or basic arithmetic, it is necessary to have a clear under-
standing of any overheads that stem from the basic bench-
mark initialisation and management code. Developing the
benchmarks from scratch means that they can be designed so
that any overheads can be minimised and discounted when-
ever necessary. Existing benchmarks would likely need to
be modified considerably in order to allow for clear distinc-
tion between what needs to be measured and what needs to
be discounted. This second point is particularly important as
it dictates the methodology that was followed in the design
and implementation of the benchmarks, which will be elab-
orated on more in Section 3. In order to be able to measure
power on a wide range of hardware platforms it is important
to have a clear set of well-defined software benchmarks as a
starting point to ensure accurate and reliable measurements.
Our benchmarks are designed to expose the computational
loads and patterns of typical workloads from both the em-
bedded and HPC sectors in order to allow separate, platform
dependant tools to extract power and energy usage informa-
tion.
2.2 Designing the benchmarks
As mentioned before, an important factor motivating the de-
cision to write new benchmarks rather than use existing ones
is that of low-level control over the implementation and the
separation of overheads. The following design decisions also
dictated the implementation:
Language All the benchmarks are implemented in C; the
language was chosen for portability reasons, as well as for
being closest to the system. C is used widely in both HPC
and Embedded applications. In scenarios where C is too
high level and too many overheads are introduced, or where
the compiler may optimise the source code in an unpre-
dictable manner, alternative assembler language implemen-
tations are provided. This is especially true for very low-
level operations, such as basic arithmetic.
Optimisation The benchmarks are tested with a range of
different C compilers such as GNU, PGI, Intel and Clang.
The performance and power usage of the benchmarks should
represent real-life performance whenever possible and com-
piling the benchmarks with optimisation enabled should be
the default. However, for some of the benchmarks (e.g. func-
tion calls), enabling optimisation would simply result in the
operation that we want to measure being removed (i.e. by in-
lining the function calls). In such cases the benchmarks are
built with all optimisations disabled.
Benchmarking for Power 3
Assembler In order to ensure that the C version of the bench-
mark codes results in the correct instructions being executed,
the development process involves disassembling the source
code and inspecting the machine code. This way it is possi-
ble to verify exactly which instructions are executed for each
benchmark, and which optimisations the compilers perform.
Overheads It is not possible to eliminate overheads entirely.
For instance, in order to measure the time and energy used
to perform a single multiplication, it is necessary to perform
this operation many times in a loop. The overhead (both in
term of performance and power) that is introduced by the
loop is significant when compared to the multiplication it-
self. It is therefore necessary to measure the overhead on
its own in order to discount it for the overall measurements.
This is achieved by adding empty loops containing a nop
operation to the benchmarks, thus making sure that the time
and energy spent in an empty loop can be measured:
f o r ( i =0 ; i<max rep ; i ++) {
a sm ( ‘ ‘ nop ’ ’ ) ;
}
Timing Runtimemeasurements are currently taken using the
stdlib call gettimeofday(), however alternativeswith finer
granularity are being investigated, such as clock gettime()
operating in the CLOCK MONOTONIC RAW mode. When tak-
ing measurements, the convention is to take a minimum of
10 readings and retain the measurement with the lowest run-
time (i.e. the best possible observed performance).
Warm-up Measuring the power usage of a system “from
cold” may give false and misleading results. It is therefore
important that the benchmarks put the hardware into a known
and stable state by warming it up, i.e. making sure the CPU
is running at a relevant clock-speed, and that the pipeline and
caches are filled. This is especially important for the low-
level benchmarks that measure small operations; in a real-
life application those operations are not isolated and mea-
suring them in a “warm” system takes this into account. The
warm-up is achieved by executing a moderate number of the
benchmark operations before taking any measurements.
3 Adept benchmarks
In the interest of brevity, this section only describes a selec-
tion of the benchmarks that were developed to test power-
performance and scaling; a full list is given in Table1. These
benchmarks were chosen to represent operations and algo-
rithms of interest to both HPC and Embedded systems en-
gineers. Parentheses indicate that this benchmark is partly
relevant in this area, or that it is not a commonly used oper-
ation or workload.
Table 1 Complete list of micro- and kernel-level power measurement
benchmarks and an indication of their relevance for either Embedded
or HPC computing.
Benchmark Embedded HPC
Bus transfer ✓ ✓
Memory ✓ ✓
Basic arithmetic ✓ ✓
SIMD instructions ✓ ✓
Network I/O ✓ (✓)
Disk I/O (✓) ✓
Jump & Branch ✓ ✓
Function calls ✓ ✓
Cache misses ✓ ✓
IPC ✓
Thread & process management ✓
BLAS ✓ ✓
File parsing ✓
Pattern matching ✓ ✓
Kernel invocation ✓ ✓
FFT ✓ ✓
Stencil operations ✓ ✓
3.1 Arithmetic Operations
The basic algebra benchmark exercises four basic numeri-
cal operations: addition, subtraction, multiplication and di-
vision. For each operation, different data types may be tested
to compare performance. Measuring a single operation may
be beyond the measurement capabilities of the system under
test, especially for in-band measurement. Therefore a num-
ber of operations N of the same data type are performed in a
loop, in which the number of iterations R is user-specifiable.
For each numerical operation, multiple tests are performed.
The first begins with a single operation (N = 1) being exe-
cuted R times. In each subsequent test, N is increased whilst
R is decreased such that the product (N×R) remains con-
stant. The motivation is to expose any differences in perfor-
mance for what should remain a constant volume of work
with a reduced overhead being incurred by the loop due to
the execution of fewer iterations. Because this benchmark in
particular deals with small, very basic operations, the work
loops are implemented both in C and in assembly language.
While it is possible to measure single arithmetic instructions
using the assembly implementation, the C implementation
will include load and store instructions in addition to the
arithmetic.
3.2 Memory Benchmarks
This benchmark is designed to test the performance of each
level of memory and observe conditions when hierarchical
boundaries are crossed, for example, from L1 cache to L2
cache.
4 Weiland & Johnson
The benchmark performs reads or writes to a block of mem-
ory, the size of which is user specified. Accesses to this
block are made in one of three ways: contiguous, strided, or
random, each of which is detailed below. For write bench-
marks, a single data value is pre-computed and assigned to
each desired element of the array. For read benchmarks, the
array is pre-filled with random data.
– For the contiguous-access case, elements of the array
are accessed in order of monotonically increasing index.
Each element of the array is accessed once, and once
only.
– For the strided-access case, the array is treated as a quasi-
circular buffer. The indices of the elements to be ac-
cessed increase by a constant, the stride length, which
begins at two elements, and doubles on each pass to a
maximum value requested by the user. For example, if
the user requests a stride length of 4, the benchmark will
be run twice, first using a stride length of 2 and then
again using a stride length of 4. Because the array is
considered quasi-circular, all elements of the array are
accessed for each stride length. Quasi-circular, in this
case, means that for each pass through the array, the off-
set increases by 1. For example, with an array of length
10, and a stride length of 2, the elements would be ac-
cessed in the following order: 0, 2, 4, 6, 8, 1, 3, 5, 7, 9.
A true circular buffer would see only elements 0, 2, 4, 6
and 8 are each accessed twice. Here, when the end of the
array is reached, the offset, initially 0, is increased by 1,
allowing access to elements 1 (0+1), 3 (2+1), 5 (4+1), 7
(6+1) and 9 (8+1). Each element of the array is accessed
once, and once only, for each stride length.
– For the random-access case, the element of the array to
be accessed is determined randomly, once per iteration;
the number of iterations is equal to the number of el-
ements in the array. The random-access case does not
store a list of previously accessed elements so it is likely
that some elements may be accessed more than once and
some never accessed.
The memory benchmark also has an option to measure a cal-
loc operation, i.e. assigning and zero-ing a block of memory,
for a user-specified amount of memory.
3.3 Function Calls
This benchmark exposes data relating to the overheads in-
curred in calling a function.Many optimising compilers will
attempt to inline functions wherever feasbile, however this
is not always possible and being able to quantify the im-
plications of a function call on performance and energy use
is therefore of interest. This benchmark uses a single code,
an iterative approximation algorithm for pi , in all three cases
tested: inline, nested and recursive, each of which is detailed
below. For each case, the entire approximation is repeated
a number of times, R, as specified by the user and within
each repeat, the number of iterations of the approximation,
N, may also be specified.
– In the inline case, the code is inlined to the caller, giving
a baseline for measurement.
– In the nested case, the caller executes a single function,
which contains the code for each repeat, R, of the calcu-
lation. The called function performs N iterations of the
approximation.
– In the recursive case, the caller executes a function for
each repeat, R, which calls itself, for each iteration, N.
Because systems have a maximum recursion depth, the
number of repetitions may have to be broken. When this
is the case, and Ruser > Rmax, a fraction of the repeats
R f rac is executed M times, i.e. R f rac = Ruser/M. The ex-
tra overhead introduced through additional loop is dis-
counted.
3.4 IPC operations
The inter-process communication (IPC) benchmark exercises
threemechanisms of IPC, FIFO buffers, UNIX domain sock-
ets and shared memory segments.
In each case, two processes are spawned using the pThreads
library. One thread runs a server code that sends a times-
tamp via the selected IPC mechanism to the other thread,
which runs a client code. The client receives the timestamp
and adds it to an array along with a timestamp representing
when it received the data from the server. After a user spec-
ified number of repetitions the differences between pairs of
timestamps (server and client) are computed to give an ap-
proximation to the transit time of the IPC mechanism.
Both the socket and FIFO methods have an implicit buffer
which negates the need for explicit signalling between server
and client for the sending of each timestamp, provided there
is consensus about readiness prior to the measurement loop.
The shared memory method does require explicit signalling
for each timestamp and to establish readiness consensus.
Both these requirements are handled using the pThreads con-
ditional signalling and mutex operations.
3.5 BLAS operations
This benchmark is a naı¨ve implementation of selected BLAS
routines for dense data, for example: dot product, vector
Benchmarking for Power 5
product, Euclidian norm, matrix-vector product and scalar-
vector product. In normal coding, it would always be prefer-
able to make use of a BLAS library, often provided by the
CPU or system vendor which has been tuned to provide best
performance for the system in question. In this case, we are
interested in a simple code that is portable between plat-
forms (to compare system performance), portable between
programming methodologies (parallelisation in OpenMP[7]
versus UPC[11] for example) and allows coding with dif-
ferent compiler options and optimisation flags. The baseline
implementation will allow for direct comparison of the im-
plications that different methods of programming, compila-
tion and execution have on power consumption and perfor-
mance.
3.6 FFT
This benchmark is again a naı¨ve implementation of a 2-
dimensional FFT, based on the well-known Cooley-Tukey
algorithm[3]. Whilst alternative implementations exist, this
has been well-studied and provides a good comparison with
implementation in libraries such as FFTW. Much like the
BLAS benchmark in many applications, a tuned, platform
specific implementation would be used. However, the desire
here is for a portable implementation which has no target-
specific tuning, but exposes the computational patterns typ-
ical of an FFT computation.
3.7 Stencil algorithms
This benchmark is a naı¨ve implementation of a 5, 9, 19 & 27
point stencil operation. Stencil algorithms are widely used in
numerical HPC codes and the computational patterns they
exhibit therefore are of interest. The stencil operations are
computed multiple times to ensure a runtime large enough
for measurement and consist of performing the stencil op-
eration (N-point average) on the complete dataset each time
with the intermediate result at each iteration saved to an out-
of-place buffer which becomes the input buffer for the next
iteration, whilst the current input buffer becomes the storage
buffer.
4 Metrics
Perhaps one of the most important issues when benchmark-
ing power consumption and performance is how best to re-
port and analyse the data. Traditionally, performance is re-
ported in units of operations of interest per second (op/s).
In this case, an operation of interest may be bytes written to
disk, packets sent across a network interface, or the number
of dot-products calculated.
The biggest problem with this approach is that it takes ac-
count of neither energy consumption nor other system com-
ponents. When considering energy consumption, the whole
system should be accounted for, rather than a specific com-
ponent, which implies adding idle, or unused, components
to any metric.
We therefore propose a metric of operations-of-interest per
second per Watt (op/s/W) which allows a comparison of
the power efficiency of different components in a given sys-
tem. From this we can derive a metric for the energy scaling
performance of a system.
E(1) = EA× 1+EI× (NT − 1) (1a)
E(NA) = EA×NA+EI× (NT −NA) (1b)
E(NA)
E(1)
=
EA×NA+EI× (NT −NA)
EA+EI× (NT − 1)
(1c)
Consider Equation 1 where E represents energy consumed,
N represents the number of components and the subcripts A,
I & T represent active, idle and total respectively. For this
explanation, it is assumed that the components in question
are cores in a multi-core CPU. The equation is cast in terms
of energy and operations of interest, ie the time element has
been factored out. This is because we seek to compare the
performance and efficiency of a system using a fixed vol-
ume of computational work. The runtime of this work will
vary with changes in the system configuration (for example
choice of CPU or number of active threads) hence normaliz-
ing by runtime. We are also concerned with the energy con-
sumed by the system doing the work, not the peak-power
which is an instantaneous measurement and a function of
the design of the system.
Equation 1a gives the total energy consumed by one active
core as the energy consumed by an active core (EA) plus the
energy consumed by NT − 1 idle cores. Equation 1b gener-
alizes this to an arbitrary mix of idle and active cores. Fi-
nally, Equation 1c gives the scaling ratio; that is, the ratio of
the energy consumption of one active core (with the remain-
der idle) to an arbitrary number of active cores. In a system
where idle cores consume no energy, this would result in a
linear scaling with NA, but this may not be the case in prac-
tice.
This approach could be further generalised by including terms
for all system components such as memory, GPU and disk
in both active and idle states.
6 Weiland & Johnson
5 Early results
In this section we show results from selected multi-threaded
BLAS benchmarks, namely AXPY and dot-product using
a vector length of 50 million elements, as well as a simple
arithmetic benchmark. The results were obtained by using
the power measurement systems available on an ODROID
XU+E platform [4]. The motivation for using this system
is that it provides easy access to power measurements. The
benchmarks were run on the platform’s performance CPU,
an ARM A15 and on the powersaving CPU, an ARM A7.
Ordinarily, the system is free to migrate loads between pro-
cessors, however, for this test the load (the benchmark) was
fixed to one CPU. In all cases, the energy consumption fig-
ures are for the CPU in question only, to give an estimate
of performance scaling for the CPU. Additionally, the A15
core is complex and offers out of order execution in a similar
manner to a x86-based system. More details of this system
can be found in Appendix A. In the case of an HPC system,
whole-node power readings may be obtained from the sys-
tem itself, but more usually provided by the job scheduler
upon job completion. The Cray XC30 is an example of a
system that already offers this functionality [5].
1 2 3 4
Thread Count
0
20000
40000
60000
80000
100000
120000
140000
R
un
tim
e 
(m
s)
int float double
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
En
er
gy
 (J
)
A7 axpy
Fig. 1 AXPY benchmark as run on A7 processor for three data types:
int, float & double.
In Figure 1 we see the A7 processor performance and power
consumption for three data types when running the AXPY
benchmark. It can be clearly seen that using the double data
type consumes more energy and takes longer to complete
than the float data type, which may naı¨vely be expected.
The energy consumption scales reasonably well for all three
data types, although is definitely sub-linear. It may at first
seem counter-intuitive that the total energy consumption de-
creases with increasing core count, however this is a result of
the reduced total runtime. It is clear that, on this low-power
CPU, there is a benefit in running the AXPY benchmark
with as many cores as possible to get the best performance
in terms of runtime and energy.
1 2 3 4
Thread Count
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
R
un
tim
e 
(m
s)
int float double
1.5
2.0
2.5
3.0
3.5
4.0
4.5
En
er
gy
 (J
)
A7 dot_product
Fig. 2 Dot-product benchmark as run on A7 processor for three data
types: int, float & double.
In Figure 2 we see the same processor executing the dot-
product benchmark. Again, the scaling performance of both
runtime and energy consumption is good, but sub-linear. In-
terestingly, for this benchmark on the A7 CPU, the perfor-
mance of the float data type is better (faster runtime, but
higher energy consumption) than for the integer data type.
1 2 3 4
Thread Count
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
R
un
tim
e 
(m
s)
int float double
15
20
25
30
35
40
45
50
55
En
er
gy
 (J
)
A15 axpy
Fig. 3 AXPY benchmark as run on A15 processor for three data types:
int, float & double.
Benchmarking for Power 7
In Figure 3 we see the performance of the A15 processor
running the AXPY benchmark. Compared with Figure 1, it
is clear that this processor consumes a much larger amount
of power, of the order of 7 times that of the A7, for the sin-
gle core case. Runtime performance is much improved com-
pared to the A7 CPU, with this benchmark taking around
one order of magnitude less time to complete for the single
core case. However, the scaling is poor up to 4 threads with
little runtime decrease from additional cores, but a marked
increase in power consumption.
1 2 3 4
Thread Count
0
2000
4000
6000
8000
10000
12000
14000
R
un
tim
e 
(m
s)
int float double
10
15
20
25
30
35
En
er
gy
 (J
)
A15 dot_product
Fig. 4 Dot-product benchmark as run on A15 processor for three data
types: int, float & double.
In Figure 4 we see the same processor running the dot-product
benchmark. The result, when comparing with Figure 2, cor-
relates to that of the AXPY case: the runtime performance
improvement is good, with a marked reduction for the single
core case, however the scaling is poor and power consump-
tion markedly increases with increasing core counts.
Taken together, these results show that balancing the power
and performance of the two BLAS benchmarks is more com-
plex than might be initially assumed. With a simple proces-
sor such as the A7, results scale well in terms of runtime
and power consumption as the number of active cores is in-
creased. In both benchmarks shown, the most efficient oper-
ating mode for this processor is to use all available cores, re-
gardless of data type. For a more complex processor such as
the A15, it is more difficult to balance runtime against power
consumption. Increasing the core count can reduce the run-
time, but markedly increases power consumption, even with
the saving of a reduced runtime.
Figure 5 shows the performance for 1 billion integer addi-
tions as part of the Basic Arithmetic benchmark. The “nop”
no-op 1 2 4 5 8 10
Number of loop operations N
0
1000
2000
3000
4000
5000
6000
R
un
tim
e 
(m
s)
1_volatile some_volatile all_volatile
0
1
2
3
4
5
6
7
En
er
gy
 (J
)
int_add
Fig. 5 Performance of integer additions benchmark
instance of the benchmark is used to quantify the overhead
that is introduced by an loop with 1 billion iteration. It is
possible to see that changing the number of operations inside
the loop (N) has a minimal effect on both runtime and energy
consumption. What affects these quantities most is the data
locality, such as whether a value is in the CPU register, in the
cache or in main memory, as well as the number of iterations
used to compute the operations. In the 1 volatile case,
only one of the variables inside the work loop is declared
volatile. In the some volatile and all volatile cases,
some or all of the variables are declared as volatile re-
spectively; in the some volatile case the number of ac-
cesses to volatile variables per iteration remains constant
for all values of N (i.e. the total number of volatile mem-
ory accesses reduces as N increases), whereas the total num-
ber of volatile memory accesses is independent of N in the
all volatile scenario. It can be seen then that, as ex-
pected, the more variables are volatile, the more the power
consumption increases noticeably for ≥ 3 operations in the
work loop (N ≥ 3). This is because non-volatile variable
may be highly-efficiently stored in the CPU registers, whereas
the use of volatile tells the compiler to re-read the vari-
ables with every use. The use of the volatile keyword is
widespread in Embedded systems programming, though less
commonly used in HPC.
Figures 6 and 7 show the energy efficiency as described by
Equation 1 for the AXPY and dot-product benchmarks. The
data used to compute these results is the same as used for
the previous figures; the workload is fixed and strong scal-
ing behaviour is shown. What can be seen is that for the
benchmarks shown the CPU-power efficiency of the A15
scales sub-linearly (scaling factor > 1), whereas the CPU-
power efficiency of the A7 is super-linear (< 1). What the
figures represent is the efficiency of changing processors
8 Weiland & Johnson
1 2 3 4
Number of Active cores (NA)
0.8
1.0
1.2
1.4
1.6
En
er
gy
 S
ca
lin
g 
R
at
io
 (E
(N
A
)
E
(1
)
)
Scaling ratio for AXPY benchmark
A15 int
A7 int
A15 float
A7 float
A15 double
A7 double
Linear
Fig. 6 Energy scaling for AXPY benchmark
1 2 3 4
Number of Active cores (NA)
0.8
1.0
1.2
1.4
1.6
En
er
gy
 S
ca
lin
g 
R
at
io
 (E
(N
A
)
E
(1
)
)
Scaling ratio for dot-product benchmark
A15 int
A7 int
A15 float
A7 float
A15 double
A7 double
Linear
Fig. 7 Energy scaling for dot-product benchmark
from idle to active. If idle CPUs consumed zero energy, and
if perfect scaling of energy usage of active cores were as-
sumed, the efficiency ratio would be 1 for all core counts.
However this is not the case, firstly because idle cores con-
sume power (namely 0.0354W per A15 core and 0.0125W
per A7 core2) and therefore need to be taken into account for
the overall energy usage of the CPU, and secondly because
E(NA) 6= N×E(1).
The A15 processor, in both benchmarks cases, consumes
∼ 1.6×E(NA) Joules when using 4 actives cores (NA = 4
and E(NA) = 24.13 Joules), which means that for the im-
proved runtime achieved by using 4 threads a 60% energy
usage penalty is incurred (with E(1) = 16.06 Joules).
2 These numbers were extracted from the system when the respec-
tive CPU was idle
TheA7 processor, however, is the opposite. Usingmore cores
results in a lower overall energy consumption,∼ 0.8×E(NA)
Joules when using 4 actives cores (NA = 4). The implication
here is that the more cores used, the more efficient (in terms
of energy consumed) the processor becomes and computing
the benchmarks. An indication of this is also that the differ-
ence between the idle and active power for a single core on
this processor is small, whilst it is much more significant for
the A15.
6 Future Work
The work presented in this paper is still in the early stages of
research. The baseline implementations of the benchmarks
have been developed and tested on a variety of platforms
to ensure portability and correctness; the next steps involve
developing alternative implementations and parallelisation
strategies using different programming models and, where
applicable, different algorithms. To date, we have been re-
stricted to perform power measurements on the two ARM
CPUs offered by the ODROID platform.As part of the Adept
project, we are working on designing a flexible and accurate
power measurement solution that will allow us to run the
benchmarks on a wider range of platforms.
7 Conclusions
Understanding the power consumption profile of an appli-
cation on a given hardware architecture is the all-important
first step in being able to optimise this application for en-
ergy and power usage. Without this prerequisite understand-
ing, trying to achieve good power-performance efficiency is
akin to implementing code optimisations without knowing
the performance hotspots. This is where the Adept bench-
marks, and the associated metrics, come in: they provide
detailed, quantifiable and comparable information that will
deepen the understanding of software and hardware power
usage profiles.
A ODROID Specifications
The board used in the evaluation section of this paper is an ODROID
XU+E. This is a complete System-on-Chip based on the Samsung Exynos
5410 Octa processor with two quad-core ARM CPUs [9]: the perfor-
mance CPU, a complex out-of-order ARM A15 running at 1.6GHz,
and the powersaving CPU, a simple in-order ARM A7, with a clock
speed of 200MHz. Both CPUs have 32KB L1 instruction and data
caches per compute core. However the L2 cache (which is shared be-
tween all core of the CPU) for the A15 is 2MB, as opposed to only
512KB for the A7. The ODROID has 2GB of LPDDR3 DRAM, which
runs at 800MHz and has a maximum bandwidth of 12.8GB/s. Ordinar-
ily, the system is free to migrate loads between processors, however, for
all results in this paper the load (the benchmark) was fixed to one CPU.
Benchmarking for Power 9
The ODROID has built-in power measurement sensors for both the
SoC and board, allowing easy access to power usage data without ex-
ternal instrumentation. These sensors can measure the voltage, current
and power consumption of each the CPUs, as well as the memory and
the on-board GPU. The sensor readings are reported via the Linux
filesystem. The update period for the sensors is set to the default of
262ms although it can be lowered to measure shorter loads at a cost of
an increased overhead in sampling, as for any in-band measurement
system. The measurements themselves are taken by INA231 sensor
modules from TI which use 16bit ADCs with an accuracy of 2.5µV .
A block diagram for the ODROID is shown in Figure 8.
Acknowledgements Thanks to James Perry and Iakovos Panourgias,
both EPCC, for testing/reviewing the benchmarks, and to Andrew Mc-
Cormick from Alpha Data Parallel Systems Ltd for deriving the energy
scaling metrics.
References
1. Towards a breakthrough in software for advanced computing sys-
tems. Report from a Workshop organised by the European Com-
mission in preparation for HORIZON 2020 (2012)
2. Amarasinghe, S., Campbell, D., Carlson, W., Chien, A., Dally, W.,
Elnohazy, E., Harrison, R., Harrod, W., Hiller, J., Karp, S., Koel-
bel, C., Koester, D., Kogge, P., Levesque, J., Reed, D., Schreiber,
R., Richards, M., Scarpelli, A., Shalf, J., Snavely, A., Sterling,
T.: Exascale software study: Software challenges in extreme scale
systems (2009)
3. Cooley, J.W., Tukey, J.W.: An algorithm for the machine calcu-
lation of complex Fourier series. Mathematics of Computation
19(90), 297297 (1965). DOI 10.1090/s0025-5718-1965-0178586-
1. URL http://dx.doi.org/10.1090/S0025-5718-1965-0178586-1
4. Hardkernel: ODROID XU+E Specification. Online. URL
http://bit.ly/1sLd62v
5. Hart, A., Richardson, H., Doleschal, J., Ilsche, T., Bielert, M.,
Kappel, M.: User-level power monitoring and application perfor-
mance on cray xc30 supercomputers. In: In Proceedings of the
Cray User Group (CUG) 2014, Lugano, Switzerland (2014)
6. Juckeland, G., et al.: BenchIT – Performance measurement and
comparison for scientific applications. In: G. Joubert, W. Nagel,
F. Peters, W. Walter (eds.) Parallel Computing Software Tech-
nology, Algorithms, Architectures and Applications, Advances in
Parallel Computing, vol. 13, pp. 501 – 508. North-Holland (2004)
7. OpenMP ARB: OpenMP Specification (2013)
8. PMaC: MultiMaps. URL http://bit.ly/1hG2vwr
9. Samsung: Samsung Exynos 5 Octa Specification. URL
http://bit.ly/OOsOcZ
10. Staelin, C., packard Laboratories, H.: lmbench: Portable tools for
performance analysis. In: In USENIX Annual Technical Confer-
ence, pp. 279–294 (1996)
11. UPC Consortium: UPC Language Specications (2005)
Dr Miche`le Weiland is a Project
Manager at EPCC, the supercom-
puting centre at the University of
Edinburgh. She is the Coordina-
tor of the EU FP7-funded Adept
project; her main research interest
are in power-performance optimi-
saiton of HPC applications.
Dr Nick Johnson is an Applica-
tions Consultant at EPCC, the su-
percomputing centre at the Uni-
versity of Edinburgh. He cur-
rently works on the EU FP7-funded
Adept project; his research in-
terests are the power-performance
optimisation of applications, and
methods for the measurement and
quantification of power in com-
puter systems.
10 Weiland & Johnson
Fig. 8 ODROID block diagram, courtesy of HardKernel.
