This paper presents and justifies an open benchmark suite named BEEBS, targeted at evaluating the energy consumption of embedded processors.
Introduction
Benchmarking is frequently used to gain an idea of how a system will perform during general use, when the specific environment cannot be reproduced at design-time. This gives designers feedback on how their system will perform and where performance is lacking. Typically, one benchmark cannot exercise all aspects of a target, leading to suites of benchmarks. Each benchmark tests a combination of areas of the hardware. This separation of benchmarks allows the designer to see which parts of the hardware perform the best.
The energy consumption of electronic devices is rapidly becoming a large factor in the design process. A portable embedded system will typically have severe power constraints placed upon it, if it is to have a long battery life. To recognize whether these constraints have been met, the power consumption of the device under a typical load must be tested. To build a full picture of a platform's energy consumption characteristics, a benchmark suite that hits possible combinations of an application's characteristics (such as memory accesses, integer and floating point operations, etc) is needed. This allows the energy consumption of various components of the system to be determined, ensuring that the system is fit for purpose.
There are few freely available benchmark suites for deeply embedded systems and none exist which are designed to allow energy consumption to be measured. Existing suites, such as MiBench [1] , MediaBench [2] , LIN-PACK [3] and Dhrystone [4] are all targeted towards larger desktop-based applications, with significant compute power. This is due to their emphasis on measuring performance, as opposed to energy efficiency. Most assume a host operating system is present, which may not be true on an embedded system. Furthermore, when analysing energy consumption, having to account for the operating systems effect on the result is non-trivial. These benchmarks -while in theory are portablehave significant difficulties running unmodified on embedded platforms. There are a variety of issues that cause these difficulties, such as lack of an OS, lack of a storage system, small memory size and run-time scalability. The issue of run-time scalability only occurs with a diverse range of platforms -large differences in clock speed and microarchitecture may mean that without scaling down a benchmark it is infeasible to run it on less powerful platforms. 
Automotive, consumer Table 1 : Benchmarks selected, and the categories they fit in. Legend in Table 2 . † Redistributed under the GPL.
requirements in terms of variety of benchmarks and applicability but assumes there is a host operating system for the majority of the benchmarks. In particular it requires access to a filesystem which is usually unavailable on small embedded platforms. The benchmarks represent a broad range of embedded areas. Our benchmark suite keeps this cross-section of areas while selecting benchmarks which bring out a range of energy consumption characteristics. The WCET benchmarks [5] are also quite suitable, in that none of them require an operating system. However, many of these programs are small and not representative of computations that would typically be done on an embedded platform (e.g. searching for primes).
The DSPstone suite [6] is aimed at evaluating compilers for DSP-type platforms, therefore it fits into the criteria of no OS and small memory footprint. However the majority of these benchmarks are too small to be useful in a realistic benchmark set.
In this paper we create a new set of benchmarks -the Bristol Energy Efficiency Benchmark Suite (BEEBS) [7] -chosen from popular benchmark suites, and their use justified for benchmarking energy consumption. The benchmark suite is designed to expose the processor and memory's performance, with other factors such as I/O and peripherals excluded for portability. The selection was designed such that the benchmarks would be portable, to expose the changing in energy consumption when exercising the platform in different ways, such as with memory verses arithmetic intensive computation. The benchmarks are intended to be run on the bare metal with no host operating system.
We consider four orthogonal aspects that the benchmark suite must cover, allowing the range of benchmarks to expose all of the behaviour of the platform.
• Integer operations. Operations which use the integer ALU will have similar energy consumptions.
• Floating point operations. These operations may use different pipelines or functional units to the integer operations, so may consume a different amount of energy.
• Memory access intensity. An access to memory is known to take a significantly different amount of energy to other operations [8] .
• Branching frequency. Branching frequently will stress parts of the processor, such as an instruction prefetch phase. This is similar to memory access intensity, but as the code and data are often held in different areas and types of memory this should be considered separately.
Using benchmarks that hit combinations of these, interesting observations about the energy consumption of the device can be made.
The benchmark suite has been extensively tested on three different processors, with the rest of the paper detailing the results, shown in the top half of Table 3 . The suite has been confirmed to run successfully on a further three platforms (shown in the bottom half of Table 3 ). Targeting multiple platforms ensures that more general conclusions can be drawn about the nature of the energy consumption.
This paper discusses previous benchmark suites, justifying the need for a benchmark suite targeted at exposing energy consumption characteristics. Then a set of benchmarks chosen from subsets of these pre-existing suites is presented, with justifications listed for the benchmarks and the modifications made to them. An analysis of the new BEEBS suite is given, with instruction distributions and examples of how the benchmarks can be used to expose energy consumption characteristics. 
Previous Work
Of the many existing benchmark suites, few target embedded systems. Most target either desktop machines (e.g Dhrystone) or HPC (e.g. PARSEC). Few also explicitly target multithreaded systems, and none explicitly aim for energy as the target metric.
MiBench established a well known set of benchmarks with well characterised behaviour. This suite consisted of 37 different benchmarks split across six different categories, chosen to be representative of which applications would be run on both desktop and embedded platforms. Each benchmark is justified, with instruction traces analysed on a model of the StrongARM architecture. This gave a good overview of the proportions of each type of instructions that the benchmarks executed. The drawback of this was that the instruction traces were only gathered for one platform -each benchmark could have a radically different instruction distribution for alternative platforms, leading to a different performance characteristics exposed.
MiBench was used as the main benchmark suite for MILEPOST GCC [9] . This study applied machine learning to predict which optimizations would benefit a program without needing to perform expensive iterative compilation techniques. In this study they emphasised how the performance achieved can be very dependent on the structure of the benchmarks. This highlights the need to have a wide range of benchmarks which each hit different combinations of the types of computation they could perform.
ParMiBench, a variant of MiBench was created to address the lack of multithreadedness in the original suite [10] . It attempts to parallelise some of the benchmarks, allowing them to be used to benchmark multicore systems. This has an advantage over other parallel benchmark suites in that it also targets the embedded space. Very few other benchmark suites (such as LINPACK, PARSEC and SPLASH-2 [11] ) target multithreadedness at this level -most are aimed at large clusters and HPC applications.
DSPstone is a benchmark suite for Digital Signal Processors (DSPs) and was originally designed to evaluated compiler effectiveness at compiling for DSPs. This suite contains a large number of non-integer tests, with most tests replicated in fixed point and floating point form. As this set is aimed at DSPs rather than general purpose processors no benchmarks were chosen from DSPstone.
A set of benchmarks is maintained by the worst case execution time (WCET) initiative. These benchmarks are appropriate because they are self contained and written completely in C. Each benchmark is less comprehensive than its equivalent from the MiBench set, but focusses on one particular application that may be specifically what a low end processor will perform. Some of these applications fit well with typical embedded applications.
In addition to the previous benchmark suites, several other suites were evaluated. We also evaluated several unsuitable suites:
• MediaBench • OpenBench [12] • SPEC2006 [13] • LINPACK • Livermore Fortran Kernels All of these benchmark suites were found to be unsuitable for the aim of characterising energy consumption on embedded platforms due to their reliance on the operating system and features provided by it.
A specific suite to target energy consumption is useful because of the differing energy costs of each instruction in a processor's instruction set. Many previous studies [14, 15, 16, 17] have attached an energy cost to each instruction and find that different instructions can have significantly different energies even if they take a similar amount of time.
Brooks et al. created the Wattch toolkit [18] which provides architectural models and instruction level models to allow design-space exploration of the power consumption of processors, as well as evaluating software's energy consumption. BEEBS provides the missing component, a benchmark suite designed for energy exploration that allows these kind of explorations to be done consistently and systematically.
Energy modelling has also been used to optimise a program's execution, through selecting compiler optimisations [19] , instruction scheduling [20] and automatically inserting idle instructions [21] .
Optimisation can also be achieved at the microarchitecture level, for example, by choosing an instruction encoding to minimise the number of bit flips [22] . Other methods of reducing energy in this way include encoding bus traffic [23] , adaptive scheduling of DRAM accesses [24] and exposing energy efficient version of instructions in an ISA [25] .
Platforms
We intend BEEBS to be applicable to a wide range of hardware platforms. For our evaluation, a range of platforms has been chosen, covering different types of architectures. The processors are mainly small embedded systems which are designed for low power usage. As a consequence, some of the platforms are very memory limited, restricting the types of applications that can be run on them.
A set of platforms is needed to complement the benchmark suite due to the varying capabilities of each platform. For example, a benchmark will behave very differently on platforms which have a cache, compared to platforms which do not. As such, we have chosen platforms with different pipeline depths, numbers of registers and types of memory. A comparison of the platforms can be seen in Table 4 .
The number of registers has a large effect on the energy consumption due to the high cost of memory accessesif a variable can be stored in a register there will be fewer memory accesses and overall less energy consumed. For similar reasons the type of memory the code is executing can have a large impact on energy -flash and SRAM both consume different amounts of energy.
The XMOS platform is an unusual platform, in that it is an event driven multicore platform, with eight hardware threads. Of these threads, up to four can run full speed [26] . The Epiphany platform is superscalar having one integer pipeline and another integer/floating point pipeline. The Epiphany processor used has 16 cores, connected by a network-on-chip [27] . The ARM Cortex-M0 is a simple single-core processor.
All three platforms also have diverse instruction sets with different features. This diversity makes this selection of platforms ideal for testing the benchmarks. Table 4 : Features of the platforms experimented on.
The BEEBS Benchmarks
A set of benchmarks to tests all aspects of the target platforms is presented in this section. The benchmarks were selected by defining a coverage matrix which included all the individual benchmarks from following suites:
The matrix (listed in full in Appendix A) also broadly evaluated other benchmark suites for their suitability. Two sets of parameters are evaluated in this tabletype of operations performed by the benchmark and suitability for inclusion in the final suite. The suitability for inclusion evaluates whether the benchmark should be included, based on what the benchmark does, whether it will work on the target platforms and the effort required to port it.
The type of operations was derived from examining the source of each benchmark and roughly categorising it as to the types of operations it performs. This allows benchmarks with similar properties to be excluded before a lengthy examination.
Benchmarks with a high suitability and a minimal set covering suitably different types of operations were selected to be included in the final suite (shown in Table 1 ). The types of operations are listed were calculated from a combination of inspecting the source code and from the instruction traces generated. This is shown in the table under the following columns:
• Branching.
• Memory.
• Integer.
• Floating Point.
In the final suite, a large number of benchmarks are derived from MiBench. MiBench has 37 well defined benchmarks, however a large proportion of these are targeted at much higher end platforms than chosen. This lead to a small subset of the MiBench benchmarks being selected. Several benchmarks were sourced from the WCET set.
These tested small applications which could conceivably be ran by the platforms discussed earlier. One benchmark is taken from the DSPstone suite, to cover this application area and type of computation.
The other applications considered were all found to be too time consuming to port to a small embedded system, or unnecessary for inclusion because other benchmarks performed a similar set of operations.
Benchmark Descriptions
This section talks about each benchmark, giving a short description of the benchmark, modifications made, and why it is included.
Categories
MiBench divided the embedded processor applications into six categories (see Table 5 ): automotive, network, consumer, security, telecomms and office. The benchmarks selected broadly fit into these categories, however consumer and office in particular require the higher end embedded processors. This is due to the benchmarks running 'off the shelf' programs such as ghostscript and rsynth.
Similarly we divide the chosen benchmarks into the same categories, since they are appropriately descriptive. However, some of the benchmarks are broad enough that the fit into several categories. A more accurate classification of the groups the benchmarks fit into is shown in the table of benchmarks (Table 1) .
Blowfish
Blowfish is an encryption algorithm commonly used in cryptography. This benchmark was taken from MiBench but modified to both encrypt and decrypt small blocks of data, as if the data was being streamed into the processor. The stream is generated pseudo-randomly to avoid platform dependencies on input and output. Encryption typically involves many integer operations with fewer, predictable branches.
Rijndael
Rijndael is the algorithm for the Advanced Encryption Standard. It is commonly used in many security applications, and has a similar structure to blowfish. It also has similar execution characteristics except for more frequent branching. streams. It is useful for stressing integer pipelines, and has low memory requirements. The benchmark hashes a stream of pseudo randomly generated data.
SHA

CRC32
Similar to SHA, CRC32 is used for verification of data streams, notably ethernet frames. It differs from SHA in that it can be implemented with very few instructions as it consists mainly of shifts and XORs. As it consists of few instructions in a tight loop, this benchmark should exercise processors with superscalar execution or branch prediction. The benchmark performs the CRC on a stream of pseudo randomly generated data.
Integer Matrix Multiplication
Integer matrix multiplication is used very frequently in many applications, and so is a useful benchmark to have. It consists of a tight inner loop with many array accesses, making it useful for stressing the memory and integer pipeline at the same time. This should also expose data caching effects of the platform.
Float Matrix Multiplication
Floating point matrix multiplication is also used frequently. This benchmark is a modified version of the integer matrix multiplication benchmark, with floating point numbers in place of integer -all other code is identical. This should allow a good metric of relative performance between the integer and floating point pipeline to be produced.
Dijkstra
This benchmark implements the Dijkstra shortestalgorithm path. This benchmark performs lots of nonlinear accesses to memory, and branches unpredictably. This makes it good for stressing caches and branch units that the processor may have. This algorithm is commonly used by routers to calculate the shortest path to another router. This benchmark was modified from the MiBench version to have the adjacency matrix embedded in the source code, rather than loaded from the filesystem.
Cubic root solver
This benchmark performs a large amount of trigonometry to solve various cubic equations. This tests the floating point pipeline with very little memory required. This is a portion of the 'basicmath' benchmark in MiBench, cut down to fit on smaller processors.
2D FIR
FIR filters are frequently using in image transformations. In the embedded space this could be the type of operations done by digital cameras. This benchmark is similar to the matrix multiplications but with potentially more memory accesses and spatially different arithmetic.
FDCT
The Finite Discrete Cosine Transform (FDCT) benchmark was included as it is a core algorithm behind many video decoders used in consumer products. This benchmark represents real-world usage of the systems as well as testing the floating point pipeline and caches.
Benchmark Analysis
This section provides a concrete analysis of all the chosen benchmarks by collecting their instruction traces across three of the platforms. From these graphs, the instructions can be categorised to demonstrate that each benchmark performed a different distribution of operations. Overall these results show that the benchmarks give a good spread of different distributions of instruction types. Table 6 : Variation in instruction distributions between the platforms and between the benchmarks.
Integer operations are the most common type of instruction in almost every benchmark. Across the platforms, the distributions are similar, with small variations due to the underlying instruction set. For example, there are a larger percentage of mov-type instructions in the Epiphany results because there are several predicated mov instructions (moveq, movlt, etc). This reduces the need for conditional branches, so this category decreases in proportion.
Epiphany is also the only platform in the subset chosen which has hardware support for floating point. For the other platforms, software emulation is used. On the XMOS platform this manifests in extra branch and memory instructions, whereas for the ARM platform the proportion of integer operations rises. These differences are due to different emulation strategies used.
The ARM traces follow the same general trend as the traces for XMOS and Epiphany, however with overall less memory operations. This is due to the ARM processor having support for the ldm and stm instruction allowing multiple accesses to memory in a single instruction. These instructions are used extensively in function prologues and epilogues to save and restore registers.
The integer instruction category is the largest group in almost every case, for all platforms and benchmarks. This comes from the integer category covering the largest number of types of instructions, as it groups arithmetic, register copying and bit-wise operations.
These benchmarks show a range of different quantities of each instruction, with similarities across platforms. This makes the set of benchmarks ideal for use in energy profiling of a system. We see that for all platforms a given benchmark produces a similar instruction profile (within 30% between all platforms). This is shown in Table 6 , where the platforms column shows the maximum variation between each platform for each instruction category. The benchmark columns show the ranges of instruction proportions across the benchmarks on that platform. Between benchmarks there is significant variation, therefore the suite explores a wide range of input configurations in a consistent way between architectures. 
Case Study
The use of the benchmark suite is demonstrated through collecting power measurements for each benchmark on each of the platforms. Linear regression is then used to assign an average power dissipation to each class of instructions by considering the average power and instruction distribution per benchmark.
The power of each platform was measured by instrumenting hardware as in Figure 4 . This set-up allowed real measurements to be taken, rather than using an abstract power model for the processor.
The average power dissipation of each benchmark was measured on the three hardware platforms. Linear regression is applied, with the categorized instruction counts gathered from the traces. This allows each category of instructions to be assigned an average power dissipation. The results of this analysis are presented in Table 7 . These are scaled results, representing the cost of a single instruction per core/hardware thread (Scaled down by 16 for Epiphany and by 4 for XMOS).
Overall, the main difference in power dissipations is due to differing clock rates -XMOS and Epiphany run at 400MHz and ARM at 48MHz.
From these results several conclusions can be drawn.
For the ARM Cortex-M0, a memory access is more costly than an arithmetic instruction, as is expected. The branch power dissipation, disagrees with other results taken. The power measured when executing a while(1); loop was found to be 11mW. This figure is higher than a memory access, due to the instruction being loaded from flash as opposed to RAM. The discrepancy is due to conditional branches having a lower power when the branch is not taken (further results indicate that when a conditional branch is not taken, the power dissipation is roughly 4mW).
The XMOS results show memory operations are slightly more costly than arithmetic. The identical cost for branching and memory access is due to the structure of the processor's pipeline: the final stage is a memory access which either does an instruction fetch or a memory operation.
The results for the Epiphany exhibit the most variability, with a branch instruction requiring almost twice the power of a memory access. We believe this is due to the longer pipeline having to be flushed, then new instructions fetched. A floating point operation also takes more power than an integer instruction -this is attributed to the larger complexity of an FPU.
Conclusion
This paper presented BEEBS, a benchmark suite of 10 programs that has been carefully designed to expose the energy consumption characteristics of the target platform. The benchmarks were chosen after evaluating an extensive list of embedded programs for their characteristics and suitability. This included modifying existing benchmarks to be more suitable for a bare-metal benchmark suite for testing energy. The benchmarks are available online [7] .
Each of the benchmarks in the suite was analysed for its instruction distribution, verifying that the benchmark suite sufficiently covered the a range of distributions. This was repeated across three platforms with very different features, showing that the suite is consistently good even for different instruction sets. This is important when considering energy consumption, as each type of instruction can consume very different amounts of energy.
An example of how the benchmark suite could be used was given in Section 8. This case study took physical measurements of three platforms, ARM Cortex-M0, XMOS XS1-L1 and Adapteva Epiphany. Then an average power for each instruction was derived, by performing linear regression on the power figures and the instruction distributions. We find that different categories of instruction have different power consumptions, as expected. The power dissipations differ per platform in ways which can be explained. One example of this is the memory and branching consuming similar powers on the XMOS platform, due to the nature of the processor's pipeline. On the Epiphany platform floating point was slightly more power hungry and integer calculations, due to the extra circuitry in FPUs. The fact that these features can be highlighted by the suite shows that the benchmark suite is fit for purpose when evaluating different processors.
Future Work
The benchmark suite targeted the processor core of the embedded platforms, not exercising peripherals or I/O. In future this benchmark suite could be extended to allow these items to be tested, but it remains to be seen how this can be done in a portable way.
A Benchmark Evaluation Table
This appendix gives a comprehensive list of all the benchmarks evaluated to choose the final set. Each benchmark was examined and its rough characteristics estimated. Each benchmark was also evaluated for several other properties -embedded applicability, memory footprint, and the modifications required to make it run on an embedded system. These three properties were combined in a rule-based manner, producing a 'suitability' for inclusion in the suite. This allowed us to immediately exclude benchmarks with a very low suitability.
The characteristics of the benchmarks estimated the amount of computation in the following areas:
• Integer, Floating Point, or neither. This was estimated from the ratio of floating point operations to integer operations.
• Branching. The benchmark was deemed to be branch-intensive in there was a high ratio of control structures compared to other computation. For example, if more then 20% of the code is control structures the benchmark was marked as branch-intensive.
• Memory. The benchmark was said to be memory intensive if there were frequent accesses to large arrays or other data structures.
The other columns in the table are:
• Embedded Applicability. This is the likelihood that the functionality of the benchmark would be used in a real embedded system. For example, checksumming is frequently done in embedded systems, so this would receive a 'High' embedded applicability.
• Fit in memory. This column specifies whether the benchmark would fit into a small amount of memory. Some benchmarks receive a 'possibly' result for this, where it may be possible to reduce the size of the dataset the program uses.
• Modifications for bare metal. This field indicates the amount of modification necessary to make the benchmark run without operating system support. For example, if the benchmark does not make extesnive use of the operating system, and simply loads a dataset, the modifications to make this run bare metal are 'minor'. However if the benchmark needs graphical display support or other complex features, the modifications necessary are 'major'. 
Benchmark
