BEEBS: Open Benchmarks for Energy Measurements on Embedded Platforms by Pallister, James et al.
ar
X
iv
:1
30
8.
51
74
v2
  [
cs
.PF
]  
28
 Se
p 2
01
3
BEEBS: Open Benchmarks for Energy
Measurements on Embedded Platforms
James Pallister
jp@cs.bris.ac.uk
Simon Hollis
simon@cs.bris.ac.uk
Jeremy Bennett
jeremy.bennett@embecosm.com
Department of Computer Science,
University of Bristol,
Merchant Venturers Building,
Woodland Road, Bristol, BS8 1UB
United Kingdom
Embecosm
Palamos House #104,
66/67 High Street,
Lymington, SO41 9AL, United
Kingdom.
1 Abstract
This paper presents and justifies an open bench-
mark suite named BEEBS, targeted at evaluating
the energy consumption of embedded processors.
We explore the possible sources of energy
consumption, then select individual benchmarks
from contemporary suites to cover these areas.
Version one of BEEBS is presented here and con-
tains 10 benchmarks that cover a wide range of
typical embedded applications. The benchmark
suite is portable across diverse architectures and
is freely available.
The benchmark suite is extensively evaluated,
and the properties of its constituent programs
are analysed. Using real hardware platforms
we show case examples which illustrate the
difference in power dissipation between three
processor architectures and their related ISAs.
We observe significant differences in the aver-
age instruction dissipation between the architec-
tures of 4.4x, specifically 170µW/MHz (ARM
Cortex-M0), 65µW/MHz (Adapteva Epiphany)
and 88µW/MHz (XMOS XS1-L1).
2 Introduction
Benchmarking is frequently used to gain an idea of how
a system will perform during general use, when the spe-
cific environment cannot be reproduced at design-time.
This gives designers feedback on how their system will
perform and where performance is lacking. Typically,
one benchmark cannot exercise all aspects of a target,
leading to suites of benchmarks. Each benchmark tests
a combination of areas of the hardware. This separation
of benchmarks allows the designer to see which parts of
the hardware perform the best.
The energy consumption of electronic devices is
rapidly becoming a large factor in the design process.
A portable embedded system will typically have severe
power constraints placed upon it, if it is to have a long
battery life. To recognize whether these constraints have
been met, the power consumption of the device under a
typical load must be tested. To build a full picture of a
platform’s energy consumption characteristics, a bench-
mark suite that hits possible combinations of an applica-
tion’s characteristics (such as memory accesses, integer
and floating point operations, etc) is needed. This al-
lows the energy consumption of various components of
the system to be determined, ensuring that the system
is fit for purpose.
There are few freely available benchmark suites for
deeply embedded systems and none exist which are de-
signed to allow energy consumption to be measured. Ex-
isting suites, such as MiBench [1], MediaBench [2], LIN-
PACK [3] and Dhrystone [4] are all targeted towards
larger desktop-based applications, with significant com-
pute power. This is due to their emphasis on measuring
performance, as opposed to energy efficiency. Most as-
sume a host operating system is present, which may not
be true on an embedded system. Furthermore, when
analysing energy consumption, having to account for
the operating systems effect on the result is non-trivial.
These benchmarks — while in theory are portable —
have significant difficulties running unmodified on em-
bedded platforms. There are a variety of issues that
cause these difficulties, such as lack of an OS, lack of a
storage system, small memory size and run-time scala-
bility. The issue of run-time scalability only occurs with
a diverse range of platforms — large differences in clock
speed and microarchitecture may mean that without scal-
ing down a benchmark it is infeasible to run it on less
powerful platforms.
Of the existing suites, MiBench is the closest to our
1
Name Source B M I FP License Category
Blowfish MiBench L M H L GPL Security
CRC32 MiBench M L H L GPL Network, telecomm
Cubic root solver MiBench L M H L GPL Automotive
Dijkstra MiBench M L H L GPL Network
FDCT WCET H H L H None† Consumer
Float Matmult WCET M H M M None† Automotive, consumer
Integer Matmult WCET M M H L None† Automotive
Rjindael MiBench H L M L GPL Security
SHA MiBench H M M L GPL Network, security
2D FIR DSPstone H M L H None† Automotive, consumer
Table 1: Benchmarks selected, and the categories they fit in. Legend in Table 2.
† Redistributed under the GPL.
requirements in terms of variety of benchmarks and appli-
cability but assumes there is a host operating system for
the majority of the benchmarks. In particular it requires
access to a filesystem which is usually unavailable on
small embedded platforms. The benchmarks represent a
broad range of embedded areas. Our benchmark suite
keeps this cross-section of areas while selecting bench-
marks which bring out a range of energy consumption
characteristics.
The WCET benchmarks [5] are also quite suitable, in
that none of them require an operating system. However,
many of these programs are small and not representative
of computations that would typically be done on an em-
bedded platform (e.g. searching for primes).
The DSPstone suite [6] is aimed at evaluating com-
pilers for DSP-type platforms, therefore it fits into the
criteria of no OS and small memory footprint. However
the majority of these benchmarks are too small to be
useful in a realistic benchmark set.
In this paper we create a new set of benchmarks — the
Bristol Energy Efficiency Benchmark Suite (BEEBS) [7]
— chosen from popular benchmark suites, and their use
justified for benchmarking energy consumption. The
benchmark suite is designed to expose the processor
and memory’s performance, with other factors such as
I/O and peripherals excluded for portability. The se-
lection was designed such that the benchmarks would
be portable, to expose the changing in energy consump-
tion when exercising the platform in different ways, such
as with memory verses arithmetic intensive computation.
The benchmarks are intended to be run on the bare metal
with no host operating system.
We consider four orthogonal aspects that the bench-
mark suite must cover, allowing the range of benchmarks
to expose all of the behaviour of the platform.
• Integer operations. Operations which use the inte-
ger ALU will have similar energy consumptions.
• Floating point operations. These operations may
use different pipelines or functional units to the in-
teger operations, so may consume a different amount
of energy.
• Memory access intensity. An access to memory is
known to take a significantly different amount of
energy to other operations [8].
• Branching frequency. Branching frequently will
stress parts of the processor, such as an instruction
prefetch phase. This is similar to memory access in-
tensity, but as the code and data are often held in
different areas and types of memory this should be
considered separately.
Using benchmarks that hit combinations of these, in-
teresting observations about the energy consumption of
the device can be made.
The benchmark suite has been extensively tested on
three different processors, with the rest of the paper de-
tailing the results, shown in the top half of Table 3. The
suite has been confirmed to run successfully on a further
three platforms (shown in the bottom half of Table 3).
Targeting multiple platforms ensures that more general
conclusions can be drawn about the nature of the energy
consumption.
This paper discusses previous benchmark suites, justi-
fying the need for a benchmark suite targeted at exposing
energy consumption characteristics. Then a set of bench-
marks chosen from subsets of these pre-existing suites is
presented, with justifications listed for the benchmarks
and the modifications made to them. An analysis of the
new BEEBS suite is given, with instruction distributions
and examples of how the benchmarks can be used to ex-
pose energy consumption characteristics.
2
Key Description
L Low
M Medium
H High
B Branching
M Memory intensity
I Integer pipeline intensity
FP FPU pipeline intensity
Table 2: Legend for the benchmark table
Vendor Processor
ARM Cortex-M0
Adapteva Epiphany
XMOS L1
ARM Cortex-M3
ARM Cortex-A8
Microchip PIC32MX (MIPS)
Table 3: Platforms considered for the benchmark suite.
The top half of the table is analysed in depth, where as
the suite is verified to compile and run on the lower half.
3 Previous Work
Of the many existing benchmark suites, few target em-
bedded systems. Most target either desktop machines
(e.g Dhrystone) or HPC (e.g. PARSEC). Few also ex-
plicitly target multithreaded systems, and none explic-
itly aim for energy as the target metric.
MiBench established a well known set of benchmarks
with well characterised behaviour. This suite consisted
of 37 different benchmarks split across six different cate-
gories, chosen to be representative of which applications
would be run on both desktop and embedded platforms.
Each benchmark is justified, with instruction traces anal-
ysed on a model of the StrongARM architecture. This
gave a good overview of the proportions of each type of
instructions that the benchmarks executed. The draw-
back of this was that the instruction traces were only
gathered for one platform — each benchmark could have
a radically different instruction distribution for alterna-
tive platforms, leading to a different performance char-
acteristics exposed.
MiBench was used as the main benchmark suite for
MILEPOST GCC [9]. This study applied machine learn-
ing to predict which optimizations would benefit a pro-
gram without needing to perform expensive iterative
compilation techniques. In this study they emphasised
how the performance achieved can be very dependent
on the structure of the benchmarks. This highlights the
need to have a wide range of benchmarks which each hit
different combinations of the types of computation they
could perform.
ParMiBench, a variant of MiBench was created to
address the lack of multithreadedness in the original
suite [10]. It attempts to parallelise some of the bench-
marks, allowing them to be used to benchmark multi-
core systems. This has an advantage over other paral-
lel benchmark suites in that it also targets the embed-
ded space. Very few other benchmark suites (such as
LINPACK, PARSEC and SPLASH-2 [11]) target mul-
tithreadedness at this level — most are aimed at large
clusters and HPC applications.
DSPstone is a benchmark suite for Digital Signal Pro-
cessors (DSPs) and was originally designed to evaluated
compiler effectiveness at compiling for DSPs. This suite
contains a large number of non-integer tests, with most
tests replicated in fixed point and floating point form.
As this set is aimed at DSPs rather than general purpose
processors no benchmarks were chosen from DSPstone.
A set of benchmarks is maintained by the worst case
execution time (WCET) initiative. These benchmarks
are appropriate because they are self contained and writ-
ten completely in C. Each benchmark is less comprehen-
sive than its equivalent from the MiBench set, but fo-
cusses on one particular application that may be specif-
ically what a low end processor will perform. Some of
these applications fit well with typical embedded appli-
cations.
In addition to the previous benchmark suites, several
other suites were evaluated. We also evaluated several
unsuitable suites:
• MediaBench
• OpenBench[12]
• SPEC2006[13]
• LINPACK
• Livermore Fortran Kernels
All of these benchmark suites were found to be unsuit-
able for the aim of characterising energy consumption on
embedded platforms due to their reliance on the operat-
ing system and features provided by it.
A specific suite to target energy consumption is useful
because of the differing energy costs of each instruction
in a processor’s instruction set. Many previous studies
[14, 15, 16, 17] have attached an energy cost to each
instruction and find that different instructions can have
significantly different energies even if they take a similar
amount of time.
Brooks et al. created the Wattch toolkit [18] which
provides architectural models and instruction level mod-
els to allow design-space exploration of the power con-
sumption of processors, as well as evaluating software’s
3
energy consumption. BEEBS provides the missing com-
ponent, a benchmark suite designed for energy explo-
ration that allows these kind of explorations to be done
consistently and systematically.
Energy modelling has also been used to optimise a pro-
gram’s execution, through selecting compiler optimisa-
tions [19], instruction scheduling [20] and automatically
inserting idle instructions [21].
Optimisation can also be achieved at the microarchi-
tecture level, for example, by choosing an instruction
encoding to minimise the number of bit flips [22]. Other
methods of reducing energy in this way include encod-
ing bus traffic [23], adaptive scheduling of DRAM ac-
cesses [24] and exposing energy efficient version of in-
structions in an ISA [25].
4 Platforms
We intend BEEBS to be applicable to a wide range of
hardware platforms. For our evaluation, a range of plat-
forms has been chosen, covering different types of ar-
chitectures. The processors are mainly small embedded
systems which are designed for low power usage. As a
consequence, some of the platforms are very memory lim-
ited, restricting the types of applications that can be run
on them.
A set of platforms is needed to complement the bench-
mark suite due to the varying capabilities of each plat-
form. For example, a benchmark will behave very dif-
ferently on platforms which have a cache, compared to
platforms which do not. As such, we have chosen plat-
forms with different pipeline depths, numbers of registers
and types of memory. A comparison of the platforms can
be seen in Table 4.
The number of registers has a large effect on the energy
consumption due to the high cost of memory accesses —
if a variable can be stored in a register there will be fewer
memory accesses and overall less energy consumed. For
similar reasons the type of memory the code is executing
can have a large impact on energy — flash and SRAM
both consume different amounts of energy.
The XMOS platform is an unusual platform, in that
it is an event driven multicore platform, with eight hard-
ware threads. Of these threads, up to four can run full
speed [26]. The Epiphany platform is superscalar having
one integer pipeline and another integer/floating point
pipeline. The Epiphany processor used has 16 cores, con-
nected by a network-on-chip [27]. The ARM Cortex-M0
is a simple single-core processor.
All three platforms also have diverse instruction sets
with different features. This diversity makes this selec-
tion of platforms ideal for testing the benchmarks.
Platform Registers Pipeline
depth
FPU Execution
memory
Cortex-M0 16 3 No Flash
XMOS L1 12 4 No RAM
Epiphany 64 8 Yes RAM
Table 4: Features of the platforms experimented on.
5 The BEEBS Benchmarks
A set of benchmarks to tests all aspects of the target plat-
forms is presented in this section. The benchmarks were
selected by defining a coverage matrix which included all
the individual benchmarks from following suites:
• MiBench
• DSPstone
• WCET
• Livermore Fortan Kernels
• Dhrystone
• MediaBench
The matrix (listed in full in Appendix A) also broadly
evaluated other benchmark suites for their suitability.
Two sets of parameters are evaluated in this table —
type of operations performed by the benchmark and suit-
ability for inclusion in the final suite. The suitability for
inclusion evaluates whether the benchmark should be in-
cluded, based on what the benchmark does, whether it
will work on the target platforms and the effort required
to port it.
The type of operations was derived from examining
the source of each benchmark and roughly categorising
it as to the types of operations it performs. This allows
benchmarks with similar properties to be excluded before
a lengthy examination.
Benchmarks with a high suitability and a minimal set
covering suitably different types of operations were se-
lected to be included in the final suite (shown in Table 1).
The types of operations are listed were calculated from a
combination of inspecting the source code and from the
instruction traces generated. This is shown in the table
under the following columns:
• Branching.
• Memory.
• Integer.
• Floating Point.
In the final suite, a large number of benchmarks are de-
rived from MiBench. MiBench has 37 well defined bench-
marks, however a large proportion of these are targeted
at much higher end platforms than chosen. This lead to a
small subset of the MiBench benchmarks being selected.
Several benchmarks were sourced from the WCET set.
4
These tested small applications which could conceivably
be ran by the platforms discussed earlier. One bench-
mark is taken from the DSPstone suite, to cover this
application area and type of computation.
The other applications considered were all found to be
too time consuming to port to a small embedded system,
or unnecessary for inclusion because other benchmarks
performed a similar set of operations.
6 Benchmark Descriptions
This section talks about each benchmark, giving a short
description of the benchmark, modifications made, and
why it is included.
Categories
MiBench divided the embedded processor applications
into six categories (see Table 5): automotive, network,
consumer, security, telecomms and office. The bench-
marks selected broadly fit into these categories, however
consumer and office in particular require the higher end
embedded processors. This is due to the benchmarks
running ‘off the shelf’ programs such as ghostscript and
rsynth.
Similarly we divide the chosen benchmarks into the
same categories, since they are appropriately descriptive.
However, some of the benchmarks are broad enough that
the fit into several categories. A more accurate classifi-
cation of the groups the benchmarks fit into is shown in
the table of benchmarks (Table 1).
Blowfish
Blowfish is an encryption algorithm commonly used in
cryptography. This benchmark was taken from MiBench
but modified to both encrypt and decrypt small blocks of
data, as if the data was being streamed into the proces-
sor. The stream is generated pseudo-randomly to avoid
platform dependencies on input and output. Encryption
typically involves many integer operations with fewer,
predictable branches.
Rijndael
Rijndael is the algorithm for the Advanced Encryption
Standard. It is commonly used in many security applica-
tions, and has a similar structure to blowfish. It also has
similar execution characteristics except for more frequent
branching.
SHA
Secure Hashing Algorithm (SHA) is a hashing algorithm
commonly used for fingerprinting and verification of data
Category Description
Automotive This category demonstrates
the mathematical ability of the
processor.
Consumer Embedded processors are fre-
quently used in consumer appli-
cations, performing tasks such
as audio and video decoding.
Network Processors in routers fre-
quently perform the operations
in this category. This involves
handling packets and routing
graphs.
Telecomm Applications that include radio
frequency analysis and encod-
ing.
Security Encryption algorithms, hash-
ing and signing applications
are placed in this category.
Table 5: Categories used to indicate the application area
of the benchmarks.
streams. It is useful for stressing integer pipelines, and
has low memory requirements. The benchmark hashes a
stream of pseudo randomly generated data.
CRC32
Similar to SHA, CRC32 is used for verification of data
streams, notably ethernet frames. It differs from SHA
in that it can be implemented with very few instructions
as it consists mainly of shifts and XORs. As it con-
sists of few instructions in a tight loop, this benchmark
should exercise processors with superscalar execution or
branch prediction. The benchmark performs the CRC
on a stream of pseudo randomly generated data.
Integer Matrix Multiplication
Integer matrix multiplication is used very frequently in
many applications, and so is a useful benchmark to have.
It consists of a tight inner loop with many array accesses,
making it useful for stressing the memory and integer
pipeline at the same time. This should also expose data
caching effects of the platform.
Float Matrix Multiplication
Floating point matrix multiplication is also used fre-
quently. This benchmark is a modified version of the
integer matrix multiplication benchmark, with floating
point numbers in place of integer — all other code is
5
identical. This should allow a good metric of relative per-
formance between the integer and floating point pipeline
to be produced.
Dijkstra
This benchmark implements the Dijkstra shortest-
algorithm path. This benchmark performs lots of non-
linear accesses to memory, and branches unpredictably.
This makes it good for stressing caches and branch units
that the processor may have. This algorithm is com-
monly used by routers to calculate the shortest path to
another router. This benchmark was modified from the
MiBench version to have the adjacency matrix embedded
in the source code, rather than loaded from the filesys-
tem.
Cubic root solver
This benchmark performs a large amount of trigonome-
try to solve various cubic equations. This tests the float-
ing point pipeline with very little memory required. This
is a portion of the ‘basicmath’ benchmark in MiBench,
cut down to fit on smaller processors.
2D FIR
FIR filters are frequently using in image transformations.
In the embedded space this could be the type of opera-
tions done by digital cameras. This benchmark is similar
to the matrix multiplications but with potentially more
memory accesses and spatially different arithmetic.
FDCT
The Finite Discrete Cosine Transform (FDCT) bench-
mark was included as it is a core algorithm behind many
video decoders used in consumer products. This bench-
mark represents real-world usage of the systems as well
as testing the floating point pipeline and caches.
7 Benchmark Analysis
This section provides a concrete analysis of all the chosen
benchmarks by collecting their instruction traces across
three of the platforms. From these graphs, the instruc-
tions can be categorised to demonstrate that each bench-
mark performed a different distribution of operations.
Figures 1, 2 and 3 show the instruction distributions
for the Epiphany, XMOS and ARM Cortex-M0 (Thumb
instruction set) platforms respectively. The ‘Other’ cat-
egory of instructions contains miscellaneous control in-
structions that do not fit into other categories (for exam-
ple, interrupt control on the Epiphany platform).
blo
wf
ish
crc
32
cu
bic
dij
kst
ra fdc
t
flo
at_
ma
tm
ult
int
_m
atm
ult
rijn
da
el sh
a
2d
fir
0
20
40
60
80
100
Pe
rc
en
ta
ge
 o
f i
ns
tr
uc
tio
ns
Integer
Floating point
Memory
Branch
Other
BEEBS instruction distributions - Epiphany
Figure 1: BEEBS Instruction distribution for the
Epiphany platform.
blo
wf
ish
crc
32
cu
bic
dij
kst
ra fdc
t
flo
at_
ma
tm
ult
int
_m
atm
ult
rijn
da
el sh
a
2d
fir
0
20
40
60
80
100
Pe
rc
en
ta
ge
 o
f i
ns
tr
uc
tio
ns
Integer
Memory
Branch
Other
BEEBS instruction distributions - XMOS
Figure 2: BEEBS Instruction distribution for the XMOS
platform.
blo
wf
ish
crc
32
cu
bic
dij
kst
ra fdc
t
flo
at_
ma
tm
ult
int
_m
atm
ult
rijn
da
el sh
a
2d
fir
0
20
40
60
80
100
Pe
rc
en
ta
ge
 o
f i
ns
tr
uc
tio
ns
Integer
Memory
Branch
BEEBS instruction distributions - ARM
Figure 3: BEEBS Instruction distribution for the ARM
Cortex-M0 platform.
Overall these results show that the benchmarks give
a good spread of different distributions of instruction
types.
6
Type Platforms (%) Benchmarks (%)
Epiphany XMOS ARM
I 30 26–77 28–68 37–79
FP – 0–49 – –
M 30 10–30 17–43 6–34
B 29 1–20 1–30 1–42
Table 6: Variation in instruction distributions between
the platforms and between the benchmarks.
Integer operations are the most common type of in-
struction in almost every benchmark. Across the plat-
forms, the distributions are similar, with small variations
due to the underlying instruction set. For example, there
are a larger percentage of mov-type instructions in the
Epiphany results because there are several predicated
mov instructions (moveq, movlt, etc). This reduces the
need for conditional branches, so this category decreases
in proportion.
Epiphany is also the only platform in the subset cho-
sen which has hardware support for floating point. For
the other platforms, software emulation is used. On the
XMOS platform this manifests in extra branch and mem-
ory instructions, whereas for the ARM platform the pro-
portion of integer operations rises. These differences are
due to different emulation strategies used.
The ARM traces follow the same general trend as the
traces for XMOS and Epiphany, however with overall
less memory operations. This is due to the ARM pro-
cessor having support for the ldm and stm instruction
allowing multiple accesses to memory in a single instruc-
tion. These instructions are used extensively in function
prologues and epilogues to save and restore registers.
The integer instruction category is the largest group
in almost every case, for all platforms and benchmarks.
This comes from the integer category covering the largest
number of types of instructions, as it groups arithmetic,
register copying and bit-wise operations.
These benchmarks show a range of different quanti-
ties of each instruction, with similarities across platforms.
This makes the set of benchmarks ideal for use in energy
profiling of a system.
We see that for all platforms a given benchmark pro-
duces a similar instruction profile (within 30% between
all platforms). This is shown in Table 6, where the
platforms column shows the maximum variation between
each platform for each instruction category. The bench-
mark columns show the ranges of instruction proportions
across the benchmarks on that platform. Between bench-
marks there is significant variation, therefore the suite
explores a wide range of input configurations in a consis-
tent way between architectures.
Power (mW)
Category Epiphany XMOS ARM
Integer 28 33 8.4
Floating Point 31 – –
Memory 20 35 9.3
Branching 40 35 6.8
Other 14 – –
Average 26 35 8.3
Average/MHz 65µW 88µW 170µW
Table 7: Power dissipation for each instruction category
calculated by linear regression.
Shunt
resistor
Power
monitor
Power
loggerProcessor
Figure 4: Hardware setup to measure the power of the
processor under test.
8 Case Study
The use of the benchmark suite is demonstrated through
collecting power measurements for each benchmark on
each of the platforms. Linear regression is then used to
assign an average power dissipation to each class of in-
structions by considering the average power and instruc-
tion distribution per benchmark.
The power of each platform was measured by instru-
menting hardware as in Figure 4. This set-up allowed
real measurements to be taken, rather than using an ab-
stract power model for the processor.
The average power dissipation of each benchmark
was measured on the three hardware platforms. Lin-
ear regression is applied, with the categorized instruc-
tion counts gathered from the traces. This allows each
category of instructions to be assigned an average power
dissipation. The results of this analysis are presented in
Table 7. These are scaled results, representing the cost
of a single instruction per core/hardware thread (Scaled
down by 16 for Epiphany and by 4 for XMOS).
Overall, the main difference in power dissipations is
due to differing clock rates — XMOS and Epiphany run
at 400MHz and ARM at 48MHz.
From these results several conclusions can be drawn.
7
For the ARM Cortex-M0, a memory access is more costly
than an arithmetic instruction, as is expected. The
branch power dissipation, disagrees with other results
taken. The power measured when executing a while(1);
loop was found to be 11mW. This figure is higher than
a memory access, due to the instruction being loaded
from flash as opposed to RAM. The discrepancy is due
to conditional branches having a lower power when the
branch is not taken (further results indicate that when
a conditional branch is not taken, the power dissipation
is roughly 4mW).
The XMOS results show memory operations are
slightly more costly than arithmetic. The identical cost
for branching and memory access is due to the structure
of the processor’s pipeline: the final stage is a memory
access which either does an instruction fetch or a mem-
ory operation.
The results for the Epiphany exhibit the most variabil-
ity, with a branch instruction requiring almost twice the
power of a memory access. We believe this is due to the
longer pipeline having to be flushed, then new instruc-
tions fetched. A floating point operation also takes more
power than an integer instruction — this is attributed
to the larger complexity of an FPU.
9 Conclusion
This paper presented BEEBS, a benchmark suite of 10
programs that has been carefully designed to expose the
energy consumption characteristics of the target plat-
form. The benchmarks were chosen after evaluating an
extensive list of embedded programs for their character-
istics and suitability. This included modifying existing
benchmarks to be more suitable for a bare-metal bench-
mark suite for testing energy. The benchmarks are avail-
able online [7].
Each of the benchmarks in the suite was analysed
for its instruction distribution, verifying that the bench-
mark suite sufficiently covered the a range of distribu-
tions. This was repeated across three platforms with
very different features, showing that the suite is consis-
tently good even for different instruction sets. This is im-
portant when considering energy consumption, as each
type of instruction can consume very different amounts
of energy.
An example of how the benchmark suite could be used
was given in Section 8. This case study took physi-
cal measurements of three platforms, ARM Cortex-M0,
XMOS XS1-L1 and Adapteva Epiphany. Then an aver-
age power for each instruction was derived, by perform-
ing linear regression on the power figures and the in-
struction distributions. We find that different categories
of instruction have different power consumptions, as ex-
pected. The power dissipations differ per platform in
ways which can be explained. One example of this is
the memory and branching consuming similar powers on
the XMOS platform, due to the nature of the proces-
sor’s pipeline. On the Epiphany platform floating point
was slightly more power hungry and integer calculations,
due to the extra circuitry in FPUs. The fact that these
features can be highlighted by the suite shows that the
benchmark suite is fit for purpose when evaluating dif-
ferent processors.
9.1 Future Work
The benchmark suite targeted the processor core of the
embedded platforms, not exercising peripherals or I/O.
In future this benchmark suite could be extended to allow
these items to be tested, but it remains to be seen how
this can be done in a portable way.
References
[1] M. R. Guthaus and J. S. Ringenberg. “MiBench: A
free, commercially representative embedded bench-
mark suite”. In: IEEE International Workshop on
Workload Characterization (WWC-4). 2001, pp. 3–
14.
[2] Jason E. Fritts et al. “MediaBench II video: Ex-
pediting the next generation of video systems
research”. In: Microprocessors and Microsystems
33.4 (June 2009), pp. 301–318.
[3] Jack J. Dongarra, Piotr Luszczek, and Antoine Pe-
titet. “The LINPACK Benchmark: past, present
and future”. In: Concurrency and Computation:
Practice and Experience 15.9 (Aug. 2003), pp. 803–
820.
[4] R. P. Weicker. “Dhrystone benchmark: rationale
for version 2 and measurement rules”. In: ACM
SIGPLAN Notices 23.8 (1988).
[5] J. Gustafsson. “The Ma¨lardalen WCET bench-
markspast, present and future”. In: Proceedings of
the 10th International Workshop on Worst-Case
Execution Time Analysis (2010).
[6] Vojin Zivojnovic et al. “DSPstone: A DSP-oriented
benchmarking methodology”. In: Proc. of ICSPAT
(1994).
[7] James Pallister, Simon Hollis, and Jeremy Ben-
nett. The BEEBS Benchmark Suite. 2013. url:
http://www.cs.bris.ac.uk/Research/Micro/beebs.jsp.
8
[8] Vivek Tiwari, Sharad Malik, and Andrew Wolfe.
“Compilation techniques for low energy: an
overview”. In: Proceedings of 1994 IEEE Sym-
posium on Low Power Electronics. IEEE, 1994,
pp. 38–39.
[9] Grigori Fursin et al. “Milepost GCC: machine
learning enabled self-tuning compiler”. In: Inter-
national Journal of Parallel Programming (2011),
pp. 1–31.
[10] Syed Muhammad Zeeshan Iqbal, Yuchen Liang,
and Hakan Grahn. “ParMiBench - An Open-
Source Benchmark for Embedded Multiprocessor
Systems”. In: IEEE Computer Architecture Letters
9.2 (Feb. 2010), pp. 45–48.
[11] Christian Bienia, Sanjeev Kumar, and Kai Li.
“PARSEC vs. SPLASH-2: A quantitative compar-
ison of two multithreaded benchmark suites on
Chip-Multiprocessors”. In:Workload Characteriza-
tion, 2008. IISWC 2008. IEEE International Sym-
posium on (Oct. 2008), pp. 47–56.
[12] Rene´ Rebe. Openbench. 2012.
[13] J. L. Henning. “SPEC CPU2006 benchmark de-
scriptions”. In: ACM SIGARCH Computer Archi-
tecture News (2006).
[14] Holger Blume et al. “Hybrid functional and instruc-
tion level power modeling for embedded proces-
sors”. In: Embedded Computer Systems: Architec-
tures, Modeling, and Simulation (2006), pp. 216–
226.
[15] S. Lee et al. “An accurate instruction-level energy
consumption model for embedded risc processors”.
In: ACM SIGPLAN Notices (2001).
[16] Stefan Steinke et al. “An accurate and fine grain
instruction-level energy model supporting software
optimizations”. In: Proc. of PATMOS (2001).
[17] Vivek Tiwari et al. “Instruction level power anal-
ysis and optimization of software”. In: Journal of
VLSI Signal Processing Systems for Signal, Image,
and Video Technology 13.2-3 (1996), pp. 223–238.
[18] David Brooks, Vivek Tiwari, and Margaret
Martonosi. “Wattch: a framework for architectural-
level power analysis and optimizations”. In: Pro-
ceedings of the 27th Annual International Sympo-
sium on Computer Architecture (2000).
[19] Tomasz Patyk et al. “Energy consumption reduc-
tion by automatic selection of compiler options”.
In: 2009 International Symposium on Signals, Cir-
cuits and Systems. IEEE, July 2009, pp. 1–4.
[20] A. Parikh et al. “Instruction scheduling based on
energy and performance constraints”. In: 2000 Pro-
ceedings. IEEE Computer Society Workshop on
VLSI. IEEE Comput. Soc, 2000, pp. 37–42.
[21] Anil Seth, R. B. Keskar, and R. Venugopal. “Al-
gorithms for energy optimization using processor
instructions”. In: CASES ’01 Proceedings of the
2001 international conference on Compilers, archi-
tecture, and synthesis for embedded systems (2001),
p. 195.
[22] Seungdo Woo, Jungmin Yoon, and Jihong Kim.
“Low-power instruction encoding techniques”. In:
SOC Design Conference (2001).
[23] M. R. Stan and W. P. Burleson. “Bus-invert cod-
ing for low-power I/O”. In: IEEE Transactions on
Very Large Scale Integration (VLSI) Systems 3.1
(Mar. 1995), pp. 49–58.
[24] Ibrahim Hur and Calvin Lin. “A comprehen-
sive approach to DRAM power management”. In:
2008 IEEE 14th International Symposium on High
Performance Computer Architecture (Feb. 2008),
pp. 305–316.
[25] K Asanovic. “Energy-exposed instruction set ar-
chitectures”. In: Work in Progress Session, HPCA.
January. 2000.
[26] Steve Kerrison and Kerstin Eder. Energy modelling
and optimisation of software for a hardware multi-
threaded embedded microprocessor. Tech. rep. Bris-
tol: University of Bristol, 2013.
[27] Adapteva. E16G301 Epiphany 16-core micropro-
cessor datasheet. 2013.
9
A Benchmark Evaluation Table
This appendix gives a comprehensive list of all the benchmarks evaluated to choose the final set. Each benchmark
was examined and its rough characteristics estimated. Each benchmark was also evaluated for several other prop-
erties — embedded applicability, memory footprint, and the modifications required to make it run on an embedded
system. These three properties were combined in a rule-based manner, producing a ‘suitability’ for inclusion in the
suite. This allowed us to immediately exclude benchmarks with a very low suitability.
The characteristics of the benchmarks estimated the amount of computation in the following areas:
• Integer, Floating Point, or neither. This was estimated from the ratio of floating point operations to integer
operations.
• Branching. The benchmark was deemed to be branch-intensive in there was a high ratio of control structures
compared to other computation. For example, if more then 20% of the code is control structures the benchmark
was marked as branch-intensive.
• Memory. The benchmark was said to be memory intensive if there were frequent accesses to large arrays or
other data structures.
The other columns in the table are:
• Embedded Applicability. This is the likelihood that the functionality of the benchmark would be used in
a real embedded system. For example, checksumming is frequently done in embedded systems, so this would
receive a ‘High’ embedded applicability.
• Fit in memory. This column specifies whether the benchmark would fit into a small amount of memory.
Some benchmarks receive a ‘possibly’ result for this, where it may be possible to reduce the size of the dataset
the program uses.
• Modifications for bare metal. This field indicates the amount of modification necessary to make the
benchmark run without operating system support. For example, if the benchmark does not make extesnive
use of the operating system, and simply loads a dataset, the modifications to make this run bare metal are
‘minor’. However if the benchmark needs graphical display support or other complex features, the modifications
necessary are ‘major’.
Benchmark
Characteristics Embedded
Applicability
Fit in
memory
Modifications
for bare metal
Suitability
FP/I B M
DSPstone
Real updates FP – – Medium Yes1 None High
Matrix products FP – – High Yes None Very High
Complex product FP – – Medium Yes1 None High
LMS filter FP – – Low Yes1 None Medium
2D FIR filter FP – Y Medium Yes None High
Complex updates FP – – Medium Yes1 None High
Convolution FP – – Medium Yes1 None High
IIR biquad filter FP – Y Low Yes1 None Medium
FIR filter FP – Y Low Yes1 None Medium
MiBench
basicmath FP – – Medium Possibly None Medium
bitcount I Y – Medium Yes None High
qsort I Y – Medium Yes None High
susan (edges) FP – Y Low Possibly None Low
susan (corners) FP – Y Low Possibly None Low
10
Benchmark
Characteristics Embedded
Applicability
Fit in
memory
Modifications
for bare metal
Suitability
FP/I B M
susan (smoothing) FP – Y Low Possibly None Low
jpeg – Y Y Medium No None Medium
lame FP Y Y Low No None Low
mad – Y Y Low No Major Very Low
tiff2bw – Y Y Medium Possibly Major Low
tiff2rgba – Y Y Medium Possibly Major Low
tiffdither – Y Y Medium Possibly Major Low
tiffmedian – Y Y Medium Possibly Major Low
typeset – Y – Low No Major Very Low
ghostscript – Y – Low No Major Very Low
ispell – – – Low No Major Very Low
rsynth FP – – Low No Major Very Low
sphinx – Y – Low No Major Very Low
stringsearch – Y Y Medium Yes None High
dijkstra – Y – High Yes Minor High
patricia – Y Y High Yes Minor High
blowfish enc I – – High Yes Minor High
blowfish dec I – – High Yes Minor High
pgp sign I – – Medium Yes Minor Medium
pgp verify I – – Medium Yes Minor Medium
rijndael enc I Y Y High Yes Minor High
rijndael dec I Y Y High Yes Minor High
sha I – Y High Yes Minor High
CRC32 I – Y High Yes Minor High
FFT FP Y Y Medium Yes None High
IFFT FP Y Y Medium Yes None High
ADPCM enc – Y – High Yes None Very High
ADPCM dec – Y – High Yes None Very High
GSM enc I – – High Yes None Very High
GSM dec I – – High Yes None Very High
WCET Benchmarks
adpcom I – – High Yes None Very High
bs – Y – Medium Yes None High
bsort100 – Y – Medium Yes None High
cnt – – Y Low Yes None Medium
compress I – – Medium Yes None High
cover – Y – Low Yes None Medium
crc I Y – Medium Yes None High
duff – Y – Low Yes None Medium
edn I – Y Low Yes None Medium
expint I – – Medium Yes None High
fac – Y – Low Yes None Medium
fdct I – – High Yes None Very High
fft1 I – – High Yes None Very High
fibcall I Y – Low Yes None Medium
fir – – Y High Yes None Very High
insert sort – – Y Medium Yes None High
11
Benchmark
Characteristics Embedded
Applicability
Fit in
memory
Modifications
for bare metal
Suitability
FP/I B M
janne complex – Y – Low Yes None Medium
jfdctint I – Y Medium Yes None High
lcdnum I Y – Medium Yes None High
lms FP Y – Medium Yes None High
ludcmp FP Y – Low Yes None Medium
matmult I – – High Yes None Very High
minver FP – – Low Yes None Medium
ndes – – Y Medium Yes None High
ns – – Y Low Yes None Medium
nsichneu I – Y Low Yes None Medium
prime I – – Low Yes None Medium
qsort-exam – Y Y High Yes None Very High
qurt FP Y – Medium Yes None High
recursion – Y – Low Yes None Medium
select – Y Y Medium Yes None High
sqrt FP Y – Medium Yes None High
st – – – Low Yes None Medium
statemate – Y – Low Possibly None Low
ud I Y – Low Yes None Medium
MediaBench
cjpeg FP – – Medium No Major Low
djpeg FP – – Medium No Major Low
h263dec – Y Y Low No Major Very Low
h263enc – Y Y Low No Major Very Low
h264dec – Y Y Low No Major Very Low
h264enc – Y Y Low No Major Very Low
jpg2000dec FP – – Medium No Major Low
jpg2000enc FP – – Medium No Major Low
mpeg2dec – Y Y Low No Major Very Low
mpeg2enc – Y Y Low No Major Very Low
mpeg4dec – Y Y Low No Major Very Low
mpeg4enc – Y Y Low No Major Very Low
OpenBench – – – Medium No Major Low
Livermore Loops FP – – Low Possibly None Low
LINPACK FP Y Y Low No Major Very Low
SPEC20062 – – – Low No Major Very Low
1These benchmarks are too small to be useful, their final suitability is adjusted to reflect this.
2These benchmarks were not available, as this is not a free benchmark suite.
12
