Energy models can be constructed by characterizing the energy consumed by executing each instruction in a processor's instruction set. This can be used to determine how much energy is required to execute a sequence of assembly instructions.
Introduction
In embedded systems, low energy consumption is a very important requirement. The software running on these systems has a profound effect on the energy consumed. The design of software and algorithms, the programming language and the compiler together with its optimization level all contribute towards energy consumption of an application. Estimations of energy consumption of programs are very useful to software engineers, so that these can un-
[Copyright notice will appear here once 'preprint' option is removed.] derstand the effect of their code on the energy consumption of the final system. Accurate energy consumption and timing analysis of programs involves analyzing low-level machine code representations. However, programs are written in high-level languages with rich abstraction mechanisms, and the relation between the two is often blurred. For instance, optimizations such as dead code elimination, various kinds of code motion, inlining and other clever loop optimization techniques obfuscate the structure of the program and make the resultant code difficult to analyze.
In this paper, we develop a static analyzer that works on the intermediate compiler representation of the program (LLVM IR). Our analysis is based on a well-developed approach in which recursive equations (cost relations) are extracted from a program, representing the cost of running the program in terms of its input [2, 6, 7, 25, 31] , and summarized in this survey [? ] . Finally, these cost relations are converted to closed-form, i.e. without recurrences, by means of a solver. For example, we can analyze the following program. if
where l denotes the length of the array v, i stands for the counter of the loop and Cproc, C odd and Ceven approximate, respectively, the costs of executing their corresponding methods. The constraints, denoted on the right hand side of the relations, specify a condition that must be true for the cost relation to be applicable. For instance, relation (a) corresponds to the cost of executing proc with an array of length greater than 0 (stated in the condition l > 0), where cost k1 is accumulated to the cost of executing the loop, given by C f or . Note that the transition into (c) and (d) is non deterministic. The constants k1, . . . , k4 take different values depending on the cost model that one adopts. In this paper, our cost model focuses on energy. These constants are obtained from energy models created at the Instruction Set Architecture (ISA) level [13] . Such models have previously been applied to analysis at the same level [15, 18] , and in this paper we propagate this up to the LLVM level.
Many modern compilers such as Clang or XCC are built using the LLVM framework. These internally transform source programs into intermediate compiler representations, which are more amenable to analysis than either source or machine level programs. We show how resource consumption analysis techniques can be adapted and applied to programming languages targeting LLVM IR (such as C or XC [30] ) by reusing some of the existing machinery available in the compiler framework (for instance LLVM analysis passes). We show how cost relations can be extracted from programs, such that these can be solved using PUBS (practical upper bounds solver) [2] . Specifically, we focus on optimized LLVM IR, that has been compiled with optimization levels O2 or higher. This ensures that the experiments we perform are realistic and that the techniques can be used for real-world applications.
Time is a significant component of energy consumption, in that a program that computes its result quicker will typically consume less energy by virtue of a shorter run-time. However, the correlation between time and energy varies between architectures, and is related to the complexity of the processor's pipeline [23] . For example, one of the target architectures for this paper exhibits an approximately 2× difference in energy depending on the instructions that are executed, with a similar relationship for the number of threads executed upon it [13] . Analysis of system energy and not just of execution time will therefore garner better information on the energy characteristics of a program. Energy models can be constructed for a processor's instruction set, however this information needs to be constructed, or propagated to a higher level program representation in order to benefit our analysis mechanism. We propose two different techniques (Section 4), for assigning energy to a higher level program representation (LLVM IR). We first propose a mechanism for mapping program segments at ISA level to program segments at LLVM IR level. Using this mapping, we can perform a multi level program analysis where we consider the LLVM IR for the structure and semantics of the program and the ISA instructions for the physical effect on the hardware. We also propose an alternative technique, of determining the instruction energy model directly at the LLVM IR level. This is based on empirical data and domain knowledge of the compiler backend and underlying processor. The analysis toolchain is illustrated in Figure 1 . The static resource analysis mechanism is described in Section 3. Parts of this mechanism perform a symbolic execution of LLVM IR, which is described in Section 2. The techniques described are built into a tool, which can be integrated into the build process and statically estimates the energy consumption of an embedded program (and its constituent parts, such as procedures and functions) as a function on several parameters of the input data. Our approach is validated in Section 5 on a number of embedded systems benchmarks, on both xCORE and Cortex-M platforms. Finally, we describe related work in Section 6 and conclude in Section 7.
Structure and interpretation of LLVM IR
In this section we describe the core language and an important technique we utilize in the resource analysis mechanism (Section 3), which infers energy formulae given an LLVM IR program.
The LLVM IR language
LLVM IR is a Static Single Assignment (SSA) based representation. This is used in a number of compilers, and is designed to represent high-level languages. For presentation purposes, we first formalize a simple calculus of LLVM IR, based on the following syntax:
.an (generic op., no side-effects)
We use metavariable names p, f, a, x to describe predicates, function names, generic arguments and variables respectively. The instruction semantics are modeled on the actual LLVM IR semantics [33] . Instruction op represents any side effect free operation such as icmp or add in LLVM. The φ instruction takes a list of pairs as arguments, with one pair for each predecessor basic block of the current block. Each pair contains a reference to the predecessor block together with the variable that is propagated to the current block. The only place where a φ instruction can appear is in the beginning of a basic block. Two interesting instructions are memload and memstore. These represent any dynamic memory load and store operation respectively. For instance, getelementptr and load are some examples of instructions represented by memload. These instructions typically compute pointers dynamically and load data from memory. In our abstract semantics of LLVM IR, we therefore treat variables assigned with values dynamically loaded from memory as unknown (denoted '?').
LLVM IR instructions are arranged in basic blocks, labeled with a unique name. A basic block BB over a CFG is a maximal sequence of instructions, inst1 through instn, such that all instructions up to instn−1 are not branch or return instructions and instn is br or ret. The φ instructions always appear as the first instructions in a block, as a block can have multiple in-edges. All call instructions are assumed to eventually return.
Symbolic evaluation of LLVM IR variables
At the core of our resource analysis mechanism of LLVM IR is a symbolic evaluation function seval . Given a block of code BB , and a variable x, seval (BB , x) symbolically executes a slice from this block to compute the effect on x. A program slice is a set of instructions that may affect the value of x at some point of interest. During this static analysis phase, we do not simply execute the LLVM IR, but we use a non-standard semantics, which abstracts away dynamic memory reads and writes i.e., memload and memstore. This has the effect of producing simple expressions, which can be handled by the PUBS solver. We proceed by showing examples of actual LLVM IR snippets and showing the effect of this on some variables: In this case, the symbolic evaluation concludes that seval (BB ,%not.zerocmp7) =?. The evaluation starts at the assignment of %not.zerocmp7, which evaluates to %deref6 == 0. However, since %deref6 is a dynamically loaded value memload, the analyzer concludes that %deref6 is ? and that therefore seval (BB , %not.zerocmp7) =?. Sometimes, the code inside a block has no effect on a variable of interest. In this case seval (%i.0) is %i.0. In this case seval (BB ,%exitcond) is (%i.0+1) ==%1, which is easily found by traversing the structure of the LLVM block backwards.
Resource Analysis for LLVM IR
The techniques described here are used to infer cost relations [2] . Cost relations are recursively defined and closely follow the flow of the program. What we actually want to infer is a closed form formula modeling the cost, which is parametric to any relevant input arguments to the program, which requires solving using a cost relation solver. These solvers typically work with simplified control flow graph structures, and therefore we must first perform some simplifications on the control flow graphs, as described in Section 3.3. The analysis then infers block arguments by using symbolic evaluation as described in Section 2.2.
Inferring block arguments
Block arguments characterize the input data, which flows into the block, and is either consumed (killed) or propagated to another block or function. Unfortunately, solving multi-variate cost relations and recurrence relations automatically is still an open problem, and the fewer arguments each relation has, the easier it is to solve these. For this reason, we designed an analysis algorithm to minimize the block arguments before inferring the cost relations.
The algorithm for inferring block arguments is a data flow analysis algorithm. We use a standard means to describe this algorithm, as in [21] . We define a data flow analysis function gen, which, given a basic block, returns the variables of interest in that block:
The function gen blk returns the input arguments that affect the branching in a block BB , composed of instructions inst1 through instn, and gen f n returns the variables that affect the input to any external calls in the block. gen blk is defined as follows:
The function ref returns all variables referred to in the symbolically evaluated expression given as argument, for example ref (x > (y + 3)) returns {x, y}. We also define function gen f n . This returns all the input arguments that affect the parameters given to the function, and is defined as:
The data flow analysis function kill is defined as:
Finally, we combine gen and kill by utilizing a transfer function, which is inlined into args in and args out . These compute the relevant block arguments utilized by our resource analysis. args in (BB ) is defined as the function's arguments if BB is the function's first block. In all other cases, args in and args out are defined as:
where phimap maps variables between adjacent blocks BB and BB based on the φ instructions in BB .
Functions args in and args out are recomputed until their least fixpoint is found. Finally, the block arguments are found in args in . The analysis explained in this section is closely related to live variable analysis. A crucial difference, however, is in the function gen. In our case, this returns a smaller subset of variables than live variable analysis i.e., only the ones that may affect control flow.
Generating and solving cost relations
In order to generate cost relations we need to characterize the energy exerted by executing the instructions in a single block. We also need to model the continuations of each block. Continuations, expressed as calls to other cost relations, arise from either branching at the end of a block, or from function calls in the middle of a block. For instance, consider the following LLVM IR block: This would translate to the following relation:
where CLI, Cret and CLB characterize the energy exerted when running the blocks LoopIncrement, return and LoopBody respectively. We therefore refer to Cret and CLB as continuations of CLI. Expressing these calls to other cost relations involves evaluating their arguments, which cannot be done without evaluating the program. Instead, by symbolically executing the block, we can express the arguments of the continuation in terms of the input arguments to the block. In order to do so, we perform symbolic evaluation using the function seval . The cost relations, extracted from recursive programs using the techniques discussed in this section, can be automatically solved by PUBS [2] after translating to its proprietary format. In the output, this shows the upper bound obtained, as a formula together with the results of the intermediate steps performed. Internally, PUBS solves these by computing ranking functions and loop invariants. The problem of solving cost relations is composable, i.e., complex functions can be inferred by first inferring simpler ones and composing these together mathematically.
There are cases where the optimized program structures produced by LLVM based compilers prevent the cost relation solvers from finding unique cover points in the structure of the cost relations. In order to solve this problem, we need to perform transformations to the call graph upon which we construct our cost relations. This is described in the next section.
Transformations for control flow graphs
After compilation, nested loop program structures are mangled by compiler optimizations. When the resulting Control Flow Graph (CFG) is directly used to produce CRs, it is usually not possible to infer closed form solutions. For instance PUBS [2] cannot handle complex CFGs, and therefore in order to analyze programs with nested loops, the CFG needs to be simplified. The simplification is actually done at an early stage in the analysis, right after generating an initial CFG, using the following steps:
1. Identify a loop's CFG, A, that has nested loops.
2. Identify the sub-CFG, B, of A that corresponds to the inner loop.
3. Extract B out of A, so that B is a separate CFG. This can be thought of as a new function with multiple return points. Hence B's exit edges are removed.
4. In A, in the place where B used to be, keep the continuation to B. Append a continuation to B's exit targets to B's caller in A.
In order to perform the first two steps, we need to identify the loops in the CFG. While LLVM has specific passes to do so, we had better success when using the algorithm described in [32] . As an example, we show how these steps can be used to transform the CFG of a simple insertion sort, as shown in Listing 1. The original CFG of this program, when compiled using clang with optimization level O2 is shown in Figure 2 (left). In this CFG, the nested loops are identified, which also involves identifying their corresponding entries, re-entries, exit and loop headers. Here, blocks bb1, bb2 and .backedge form the inner loop. These blocks are hoisted and the exit edge from .backedge (dotted) is eliminated. Instead, .loopexit is then called after bb1 "returns" (Figure 2 , right).
The CFG simplifications described in this section preserve the same order of operations when applied to an existing CFG com- In order to verify the transformation with respect to energy, let us consider a typical while or for loop and show that the same sequence of blocks is called after the transformation takes place. We can assume that such a loop has a single header, but may have multiple exits or reentries and induction variables of the outer loops are not modified in the inner loops. After the transformation takes place on a nested loop structure (B inside A), B is still called from A, however B's exit edges are now removed. The target of B's exit edges will still be called after B completes. This is because we have appended a continuation in A to this target, in Step 4. Hence all blocks will be called in the same sequence. The argument above can be inductively applied to loops with arbitrary nesting levels.
Computing energy cost of LLVM IR blocks
The intermediate representation used by LLVM is architecture independent. Any given LLVM IR sequence can be passed to one of many different backends, including ISAs [16] . The exact implementation of the ISA determines the energy consumed by each instruction that is executed. Thus, the conversion to machine code, together with the processor implementation, affects the energy consumption of an instruction at the LLVM IR level.
For static analysis of LLVM IR to produce useful energy formulae for programs, a method of assigning an energy cost to an LLVM IR segment must be used. Two possible methods are demonstrated in this paper:
1. ISA energy model w/mapping. LLVM IR is mapped to its corresponding ISA instructions and the energy cost is obtained from the ISA level cost model. The advantage is that it is simpler to characterize at ISA level, however this requires an additional step to correlate LLVM with ISA instructions.
2. LLVM energy model. Attributing costs directly to LLVM IR removes the need for a mapping. However, it necessarily simplifies the energy consumption characteristics, reducing accuracy.
In principle, both methods can be explored for both architectures. This paper utilizes an ISA level model for the XMOS processor. The Cortex-M is modeled at the LLVM IR level directly.
XMOS XS1-L ISA level modeling
The aim of ISA level modeling is to associate machine instructions with an energy cost. To achieve this, energy consumption samples bb0:
... Figure 2 . CFG of an insertion sort compiled using clang with optimization level O2 before (left) and after (right) simplification.
must be collected and an appropriate representation of the underlying hardware must be used as a basis for the model. A singlethreaded model, such as that defined by Tiwari [27] and expressed in Equation 1, describes the energy of a sequence of instructions, or program.
The program's energy, Eprog, is first formed from the base cost, Bi of all instructions, i, in the ISA, multiplied by the occurrences, Ni, of each instruction. For each transition in a sequence of instructions, the overhead, Oi,j, of switching from instruction i to instruction j, multiplied by the number of times the combination i, j occurs, Ni,j. Finally, for a set of k external effects, the cost of each of these effects, E k is added. For example, these external effects may represent the cache and memory costs, based on the cache hit rate statistics of the program. The XS1-L architecture implements multi-threading in a hardware pipeline. Even for single-threaded programs, we need to consider the behavior of this multi-threaded pipeline. The power of individual instructions varies by up to 2×, with multi-threading introducing up to a 1.6× increase with a 4× performance boost. This means execution time and energy are related in a more complex way than a simpler single-threaded architecture. The model for the XS1-L is built upon existing work of [28] and the more detailed [26] , which obtain model data through the energy measurement of specific instruction sequences, and create a representation of some of the processor's internal structure in the model equations. A full description of the XS1-L's energy characteristics and the model is given in [13] .
To extend a Tiwari style approach to model the XS1-L processor, two new characteristics must be accounted for: idle time and concurrency. The XS1 ISA has a number of event-driven instructions, which can result in the processor executing no instructions for a period of time, until the event occurs. Furthermore, the multithreaded pipeline permits only one instruction from a given thread to be present in the pipeline at any one time. These changes are expressed in Equation 2. Here, the energy exerted by running a program depends on a base power, Pbase, which represents the energy cost when no instructions are executed, multiplied by the number of idle periods, Nidle. The clock period of the processor, Tclk is also introduced, to allow different clock speeds to be considered. The inter-instruction overhead, previously described in Equation 1 as Oi,j, is generalized to a constant overhead, O, due to the unpredictability of instruction interaction between threads. For each instruction, the base cost is added to the instruction cost, Pi, which is scaled by the overhead and an additional scaling factor based on the number of active threads, Mt. This is multiplied by the number of occurrences of this instruction at t threads, Ni,t and the clock period, Tclk. This is done for the varying number of threads, t that may be active in the program over its lifetime.
The multi-threaded ISA level model for the XS1-L requires that for each level of concurrency, t, the number of instructions executed at that level should be known, or estimated. If a single threaded program is run on its own on the XS1-L and there are no idle periods, then Equation 2 simplifies to Equation 3, where the idle accounting is removed, and only the first threading level, t = 1, is considered.
The current analysis effort focuses upon single threaded experiments, thus Equation 3 can be used. Multi-threaded analysis is proposed as future work in Section 7.
XMOS LLVM IR energy characterization by mapping
To enable the analysis at the LLVM IR level we need a mechanism to propagate the existing energy model at the ISA level up to the LLVM level. The mapping technique described in this section serves this purpose by creating a fine grained mapping between segments of ISA instructions to LLVM IR instructions, in order to enable the energy characterization of each LLVM IR instruction in a program. A full description of the mapping techniques is given in [8] .
Our mapping technique leverages the existing debug mechanism in the XMOS compiler toolchain. This mechanism is originally meant to facilitate the debugging process of an application, particularly when stepping through a program line by line. During the lowering phase of the compilation process, the LLVM IR code is transformed to the specific ISA code by the backend. The debug information (DI) is also stored alongside with the ISA code using the DWARF standard [1] , a standardized debugging data format used by many compilers and debuggers to support source level debugging. By tracking this information we can extract an n:m relationship between the two levels, because one source code instruction can be related to many different sequences LLVM IR instructions and therefore many different sequences of ISA instructions. Because this n:m relation complicates static analysis, there is a need for a more fine grained mapping.
To address this issue, we created an LLVM pass that traverses the LLVM IR and replaces the Source Location Information with LLVM IR location information, right after all the optimization passes and just before emitting the ISA code. In this way, we can extract a 1:m relationship between the mapping of LLVM IR instructions and ISA instructions. Also, by doing it after the LLVM optimizations passes the optimized LLVM IR is closer to the ISA code than the unoptimized one, which will go through a series of transformations. There are optimizations that happen during the lowering phase, such as peephole optimizations and some late target specific optimizations that can affect the mapping. However, the effect of these optimizations on the structure of the code is not as profound as those applied to LLVM IR. After a mapping is extracted for a particular program, the associated energy values for the ISA instructions corresponding to a specific LLVM IR instruction are aggregated and then associated with the LLVM IR instruction, and finally to every LLVM IR block.
Although we use the XMOS tool-chain for the mapper tool, the approach is generic and transferable, due to the use of the common LLVM optimizer and code generator, and the use of the DWARF standardized debugging data format, used by many compilers and debuggers to support source layer debugging.
LLVM IR energy model for ARM
An energy model for ARM Cortex-M series is applied directly at the LLVM IR level, based upon empirical energy measurement data, and knowledge of both the processor architecture and the compiler backend. The Cortex-M3 model is for the most part a simplification of the Tiwari model [27] , applied at the LLVM IR level. The simple, embedded nature of this processor removes the need to model external effects such as cache misses, and the effect of the switching cost between instructions is approximated into the actual instruction cost. Through analysis of energy measurements for a large set of the target ISA instructions, it was found that LLVM IR instructions can be segmented into four groups: memory, M , program flow, B, division, D, and all other instructions, G.
The LLVM IR syntax described in Section 2 can be related to these groupings. In particular, br, call and ret can be combined into group B; memload and memstore are members of M ; the subset of op relating to division make up group D; and finally, φ and all remembering members of op form group G.
This yields a model equation that accumulates the energy of a program based on the number of instructions executed from each group. Equation 4 considers each group, which is assigned an energy cost, which combined give the total program energy, Eprog, where Ei is the energy cost of a single instruction in group i, and Ni is the number of instructions executed in that group.
In addition, there are a number of other factors that affect energy, due to the relation between the LLVM IR and the ISA:
1. Variadic arguments. LLVM has instructions with variadic arguments. Typically, the number of arguments in the instruction affects the energy consumed in a linear manner. 2. Data types. LLVM operations op can be performed on values of different data types. If the data type is larger than 32 bits, or floating point, this will translate into a larger number of ISA instructions on a Cortex-M with no floating point unit.
3. Predicated instructions. The Cortex-M processor is capable of executing predicated instruction sequences. In some cases, short LLVM IR blocks originating from ternary expressions in the original source code are directly translated to a number of predicated instructions in the ARM ISA. Therefore, the number of ISA instructions generated could be less than the instructions in LLVM IR, and the static analysis over-approximates the energy consumption of these blocks. Table 1 . Benchmark Characteristics.
Factors (1) and (2) As we can see, the energy consumed by an LLVM call instruction is parametric in the number and types of the arguments and return value.
Experimental Evaluation
We have selected a series of benchmarks of core algorithmic functions, particularly from the BEEBS [22] and MDH WCET benchmark [9] suites. These are collections of open source benchmarks for deeply embedded systems, slightly modified to work with our test harness. The benchmarks are single threaded, reflecting the scope of the analysis performed in this paper. Table 1 summarizes the characteristics of the benchmarks and the meaning of the last 5 columns is as follows: (L) contains loops, (NL) contains nested loops, (A) uses arrays and/or matrices, (B) contains bitwise operations, (C) contains loops with complex control flow predicates.
In order to show that our techniques are applicable to multiple languages and platforms, we have ported some of the benchmarks from C to XC. Porting C code to XC typically does not involve rewriting, since the syntax is very similar and they both use the same preprocessors. However, since XC does not provide pointers some changes need to be made to the benchmarks during the porting process. For the benchmarks that run on the xCORE, we have used the XC compiler, version 13. For Cortex-M benchmarks we have used Clang version 3.5. We proceed by describing the benchmarks. In both cases, the benchmarks are compiled under optimization level O2.
GCD. This benchmark is an implementation of the Euclidean algorithm, which computes the greatest common divisor between any two numbers. This is implemented using an iterative style and parameterized with its two input numbers (A and B).
Insertion sort. The code of the main function is shown in Figure 1. The energy exerted by the insertion sort partly depends on how many swaps need to take place, and this is dependent on the actual data present inside the array. Since our analysis infers approximate upper bounds, we will be measuring the energy consumed in sorting a reverse-ordered list, and comparing this to the statically inferred formula. Note that the number of iterations in the inner loop depends on an induction variable in the outer loop. This benchmark is parameterized by the length of the list to be sorted, P .
Matrix multiply (BEEBS/MDH WCET).
We slightly modified this so that it can work with matrices of various sizes. The matrices are all square, of size P .
Base64 encode.
Computes the base64 encoding 1 as a string, given an input string of length P .
MAC (MDH WCET)
. Dot product of two vectors together with sum of squares. This is parameterized by the length of the vectors, P .
Jpegdct (MDH WCET).
Performs a JPEG discrete cosine transform. Taken from the MDH WCET benchmark suite. This benchmark is not parameterized.
Levenshtein distance (BEEBS).
Computes the minimum number of edits to change one string into another. The lengths of the two strings are parameterized with the variables A and B.
Experimental Setup
For both ARM and XMOS platforms, power measurement data is collected by using appropriately instrumented power supplies, a power sense chip and an embedded system running control and data acquisition software. The implementations differ, but are structurally very similar. Both of these periodically calculate the power using Equation 5 during a test run by sampling the voltage on either side of a shunt resistor (Vbus and Vshunt) to determine the supplied current.
For the Cortex-M processor, the measurements are taken on an ST Microelectronics STM32VLDISCOVERY board while for the xCORE, a custom XMOS board with an XS1-L based XS1-U16A chip is used.
Results
The results for the XMOS xCORE and ARM Cortex-M processors are shown in Figures 4 and 5 , respectively. These graphs show the insertion sort, matrix multiplication and mac benchmarks, with data series for the static analysis results and actual energy measurements. The static analysis closely fits the empirical results, validating our approach. Table 2 shows the formulae and final errors for all benchmarks. Overall, the final error is typically less than 10% and Mean relative error 9.9 3.4 Figure 6 . Example program of where the analysis infers a max formula, together with its CFG 20% on the XMOS and ARM platforms respectively, showing that the general trend of the static analysis results can be relied upon to give an estimate of the energy consumption. We explain the sources of error in our results below:
Simple LLVM IR energy model (ARM). For the case of Cortex-M the errors in the analysis mostly stem from the greatly simplified model of energy consumption in the Cortex-M. The LLVM energy model used for the Cortex-M assigns an energy cost to each IR instruction. Therefore, when an IR instruction expands to unexpected, or many ISA level instructions, the energy consumption can be inaccurate. In particular, for base64, ternary operators are heavily used inside its main loop. In LLVM IR, this introduces a number of short conditional blocks inside this loop. These multiple basic blocks in LLVM IR are translated to a smaller number of predicated instructions in the ARM ISA by the compiler, so the static analysis will over approximate the energy consumed.
Measurement error. Measurement errors are introduced from environmental factors such as temperature and power supply fluctuations. The tolerance of the components is also another factor. Another important factor is the test harness itself, which has to call a function repeatedly in order to get its energy measurements. The loop surrounding this function, together with the act of calling can be a significant overhead when the amount of computation inside the function is low. In fact, we can see that in all cases except GCD, the relative error converges to a single error result. This is expected because in all of the benchmarks the parameter controls the number of iterations performed in one or more loops. As the parameter increases, the difference in the constant energy overhead is minimized, with respect to the energy consumption of the function under test. Measurement runs were run numerous times to ensure consistency of results within the expected error margins described above.
Data flow through the processor's execution units. The energy models for the xCORE and ARM assume a random distribution of operand data. In practice, however, operations such as logical tests, bit-manipulation and instructions performed on shorter data types such as char will not use the full bit-range of the data path. In cases such as these, energy consumption will be lower, therefore introducing some estimation inaccuracy.
LLVM IR to ISA mapping (xCORE). In the case of the xCORE, the overall results are better than that of the Cortex-M. This is due to a more accurate assignment of energy values to LLVM-IR instructions, which the mapper can produce for each individual program, as described in Section 4.2. Nevertheless, the mapper introduces analysis error. For instance, the mapper does not consider instruction scheduling on the processor, where an instruction fetch stall can happen in some limited scenarios. This can be addressed by performing a further local analysis on the ISA code to determine the possible locations where this happens, and adjusting the energy accordingly. Another problem arises when mapping LLVM IR phi instructions, to the "corresponding" ISA code. This code is sometimes hoisted out of loops at a later compilation phase. Also, multiple ISA instruction sequences that are conditionally executed sometimes map to a single phi instruction. This phenomenon was partially addressed by automatically adjusting the energy of phi instructions in the mapping in such cases.
Static analysis and data dependence. Programs where the behavior and state depends on complex properties of the actual input data are problematic for static resource analysis. An extreme example of such a program would be an interpreter. The execution time of an interpreter not only depends on the size of the program file it is supplied, but also on the contents of this file. A more typical example would be the euclidean algorithm (GCD), where the number of steps taken to execute depends on a relationship between its parameters A and B. Our static analysis technique, however, still manages to compute an approximate, logarithmic upper bound, which is dependent on only one of the arguments. Part of the reason why we can analyze programs of this type is that symbolic evaluation of modulus between two variables x mod y returns an upper bound of y − 1, a lower bound of 0 and an approximation of (y − 1)/2.
The levenshtein cost function for the xCORE processor includes a max function, making it a different type of formula to the Cortex-M's cost function. This occurs when a data dependent branch is on the upper bound of the function and the analysis is unable to resolve the branch statically, possibly because the branching is data dependent. An example of this is shown in Figure 6 . The analysis cannot statically ascertain the outcome of the A < B ex- pression, so simply returns the cost function as the maximum of the two possible branches:
where k1, ..., k6 are the costs of executing the respective basic blocks, as seen in Figure 5 .2. The same effect causes max to appear in the xCORE's formula -there is a data dependent if statement in an inner loop of levenshtein.
Composability
All of the benchmarks so far have consisted of relatively simple code, for which a single function is analyzed. However, the analysis can handle nesting and recursion, in the same way that it can handle functions with multiple basic blocks. In the code in Listing 2, the levenshtein and modified insertion sort functions are composed into a simple spell checker -for a given string, sort the list of strings by the sortbysimilarity to the target string.
In this listing, dictword len is the maximum size of the strings in dictionary. Inferring a cost formula for this program does not present any issues as long as it is possible to infer formulas for its constituent parts. Our techniques construct Cost Relations (CRs) from the program that is being analyzed. An important feature of CRs is their compositionality. This allows computing upper bounds of CRs composed of multiple relations by concentrating on one relation at a time. The process starts by computing upper bounds for cost relations which do not depend on any other relations, which we refer to as stand-alone cost relations, and continues by replacing the computed upper bounds on the equations which call such relations. For instance, for the above program levenshtein distance has an associated energy cost of
where A and B are the third and fourth arguments to the function. Our modified string sorting routine has a cost of:
These functions are systematically combined together so that a cost for sortbysimilary is computed. In this case it is 530ABC + 157AC + 346BC + 366C 2 + 629C + 210 nJ,
where A is word len, B is dictword len and C is n strings.
Related Work
Related work exists in four different areas: energy modeling of processors, mapping low-level program segments to higher level structures, static resource usage analysis and worst-case execution time analysis (WCET).
Energy models of processors for program analysis require energy consumption data in relation to the program's instructions. This data can be collected by simulating the hardware at various levels, including semiconductor [17] and CMOS [4] . Alternatively, higher level representations may be used such as functional block level [26] that reflects the micro-architecture, direct measurement on a per-instruction basis [27] , or by profiling the energy consumption of commonly used software blocks [24] . Higher level data collection and modeling efforts are typically quicker to use once the data has been acquired, as there is less computational burden than a low-level simulation. However, the accuracy may be lower, therefore a suitable trade-off must be met.
Although substantial effort has been devoted to ISA energy modeling, there is not a lot of work done for higher level program representations. This is mostly because precision decreases when moving further away from the hardware. One of the most recently pertinent works for LLVM IR energy modeling is [5] . The authors performed statistical analysis and characterization of LLVM IR code, together with instrumentation and execution on the host machine, to estimate the performance and energy requirements in embedded software. In their case, retrieving the LLVM IR energy model to a new platform requires performing the statistical analysis again. Our LLVM IR energy model takes into consideration types and other aspects of the instructions. Furthermore, our mapping technique requires only to adjust the LLVM mapping pass for the new architecture.
Static cost analysis techniques based on setting up and solving recurrence equations date back to Wegbreit's [31] seminal paper, and have been developed significantly in subsequent work [2, 6, 7, 20, 25, 29] . In [18] this approach is applied to inferring statically the energy consumption of Java programs as functions of input data sizes, by specializing a generic resource analyzer [10, 20] to Java bytecode analysis [19] . However, this work did not compare the results with measured energy consumptions. In [15] the approach is applied to the energy analysis of XC programs using ISA-level models [13] , and the results are compared to actual hardware measurements. Our analysis continues in this line of work but with a number of important differences. First, analysis is performed at the LLVM-IR level and we propose novel techniques for reflecting the ISA-level energy models at the LLVM-IR level. Also, instead of using a generic resource analyzer (requiring translating blocks to its Horn Clause-based input syntax) and delegating the generation of cost equations to it, we generate the equations directly from the LLVM-IR compiler representation, performing control flow simplifications, and reducing the number of variables modelled by the analysis mechanism. Finally, we study a larger set of benchmarks. There exist other approaches to cost analysis such as those using dependent types [11] , SMT solvers [3] , or size change abstraction [34] .
As discussed in Section 1, energy and time are often correlated to some degree. Techniques such as implicit path enumeration [14] are often used in worst-case execution time analysis of programs. In most cases, programs are assumed to be preprocessed such that no loops are present (e.g. using loop unrolling). Some approaches such as in [12] focus on statically predicting cache behavior. WCET analysis is concerned with getting an absolute worst-case timing for hard real-time systems. In practice, for energy consumption analysis we typically are more interested in average cases. Also, most WCET analysis approaches produce absolute timing figures. In our case, we infer energy formulae parameterized by the program's input.
Conclusion and Future Work
In this paper we have introduced an approach for estimating the energy consumption of programs based on the LLVM compiler framework. We have shown that this approach can be applied to multiple embedded languages (such as C or XC), compiled using optimization level O2 with different compilers (such as Clang or XCC). We have also validated this approach for multiple backends, via two target architectures: ARM Cortex M3 and XMOS XS1-L. Our approach is validated by comparing the static analysis to physical measurement taken from the hardware. The results on our benchmarks show that energy estimations using our technique are within 10% and 20% or better in the case of the xCORE and the Cortex-M processors, respectively.
Although the techniques discussed here were initially designed for single threaded programs, these can be adapted to multithreaded programs. For these programs, we also need to take the synchronization time into consideration. For example, the XC language has explicit constructs for thread communication using channels, and therefore the blocking communication between threads needs to be modeled. In order to do so, we can analyze the communication throughput of individual threads using techniques discussed in this paper. Using this information we can estimate the time between events happening on channels and hence the utilization of the processor. This, coupled with multi-threaded energy models as discussed in Section 4.1, can be used to analyze multithreaded programs.
An interesting direction is to further develop the assignment of energy to LLVM IR program segments. In particular, an LLVM IR energy model for the xCORE can be implemented by using the information gathered from the mapping technique together with statistical analysis. The mapping technique used for the xCORE can also be adapted for the ARM case. We aim to further develop our techniques so they can be applied and evaluated against other embedded processor architectures, such as MIPS, or other ARM variants.
Finally, the static analysis techniques can be improved further. Currently the biggest limitation is solving the cost relations. Cost relations could also be solved numerically. In some cases this can produce tighter upper bounds and enable us to analyze more complex programs. An implementation of this can be used when actual formulae are not required.
