Optimizing the flash-RAM energy trade-off in deeply embedded systems by Pallister, James et al.
Optimizing the flash-RAM energy trade-off in deeply embedded systems
James Pallister∗1, Kerstin Eder†1 and Simon J. Hollis‡1
1University of Bristol
Abstract
Deeply embedded systems often have the tightest constraints
on energy consumption, requiring that they consume tiny
amounts of current and run on batteries for years. However,
they typically execute code directly from flash, instead of the
more energy efficient RAM. We implement a novel compiler
optimization that exploits the relative efficiency of RAM by
statically moving carefully selected basic blocks from flash to
RAM. Our technique uses integer linear programming, with
an energy cost model to select a good set of basic blocks to
place into RAM, without impacting stack or data storage.
We evaluate our optimization on a common ARM micro-
controller and succeed in reducing the average power con-
sumption by up to 41% and reducing energy consumption by
up to 22%, while increasing execution time. A case study is
presented, where an application executes code then sleeps
for a period of time. For this example we show that our opti-
mization could allow the application to run on battery for up
to 32% longer. We also show that for this scenario the total
application energy can be reduced, even if the optimization
increases the execution time of the code.
1. Introduction
Deeply embedded System on Chips (SoCs) are prevalent in
our portable devices. These SoCs are small microcontrollers,
without caches that typically contain RAM and flash. Code is
executed directly from the flash and the RAM used for volatile
data storage (initialized on startup).
Many of these devices are low speed, allowing both flash
and RAM to be accessed in a single cycle. However, these
memories have very different energy consumption characteris-
tics. Reading from flash memory typically takes more power
than reading from RAM. This means that if code can be exe-
cuted from RAM instead of flash, there should be a marked
improvement in energy consumption.
Figure 1 compares a set of extremely simple programs run
in flash and RAM. These programs consist of 16 identical
instructions in a loop. This loop is placed in flash and then
in RAM, showing the difference in power consumption. The
power consumption is significantly lower when the code is
executing from RAM, except when the code in RAM also
accesses the flash (for example to load constant data), as seen
in the last bar of Figure 1. This motivates moving as much
code as possible into RAM.
∗james.pallister@bristol.ac.uk
†kerstin.eder@bristol.ac.uk
‡simon.hollis@bristol.ac.uk
Unfortunately, it is not economical to have equal amounts
of flash and RAM in deeply embedded systems — typically
the there is an 8:1 (or higher ratio) of flash to RAM. Therefore,
not all of the code can be placed into RAM especially as the
RAM must also contain the volatile data and stack. Only
the most effective parts of code can be moved, to minimize
the energy consumption while still being constrained to the
available amount of RAM.
A motivating example is given in Figure 2. This example
shows a function and its control flow graph, for which the
inner loop will be executed much more frequently than its
surrounding basic blocks (a basic block is a sequence of code
in which the control flow enters or exits only from the begin-
ning or end, respectively). It may not be possible to move
the entire function into RAM. However, it may be possible to
move just some basic blocks. Because the loop is executed
more frequently, the relative gain from moving it is larger. The
block following the loop is also moved into the RAM, as this
negates the need to add the long range branch to the inner
loop.
This problem has been tackled in a similar form by the
scratchpad memory community. Many of these techniques
are transferable, and a review of these techniques is given in
Section 2. In this paper, Integer Linear Programming (ILP) [7]
is used with a basic block cost model to find a set of basic
blocks which would benefit from being in RAM. Individual
basic blocks are statically placed into RAM, rather than full
functions. This allows better use to be made of the limited
RAM in these deeply embedded systems, by only placing
energy intensive basic blocks into RAM. The optimization is
applicable to any microcontroller with a unified address space
— the ability to transfer control flow into RAM is necessary to
implement the optimization.
ram
store ram
load add nop bran
ch
flash
load
Type of instruction
0
2
4
6
8
10
12
14
16
P
ow
er
(m
W
)
Flash
RAM
Figure 1: Average power for different instructions, when exe-
cuting out of flash and RAM.
1
ar
X
iv
:1
40
6.
04
03
v2
  [
cs
.O
H]
  3
 Ju
n 2
01
4
mov r1, #1
mov r0, #0
cmp r0, #64
mul r1, r1, r2
bne loop
init
loop
cmp r1, #255
ble return
if
bx  lr 
mov r0, r1
return
int fn(int k)
{
  int i, x;
  x = 1;
  for(i = 0; i < 64; ++i)
  {
    x *= k;
  }
  if(x > 255)
    x = 255;
  return x;
}
init
return
add r0, r0, #1
mov r0, #255
iftrue
loop
if
iftrue
mov r1, #1
mov r0, #0
mul r1, r1, r2
bne loop
init
loop
cmp   r1, #255
it    le
if
bx  lr 
mov r0, r1
return
mov r0, #255
iftrue
ldr pc, =loop
ldrle r5, =return
ldrgt r5, =iftrue
bx    r5
Flash Flash RAM
Additional long range
branches are needed to
jump between memories
Original code Original compiled code Optimized compiled code
Instrumenting the if 
block instead of the
loop block reduces the
overall energy/time.
cmp r0, #64
add r0, r0, #1
Instrumented
code
Figure 2: An example of a function (left), its original compiled version (center), and the optimized version with basic blocks in
RAM (right). The given code is for the Cortex-M3.
In this paper, this is implemented on a power-measurement
instrumented Cortex-M3 processor [22]. This processor is
a commonly used microcontroller. It has a unified address
space, allowing instructions to be executed from either flash
or RAM. The SoC has 64KB of flash and 8KB of RAM.
This SoC is frequently used in low power applications and
these applications could directly benefit from the techniques
introduced in this paper.
This paper makes the following contributions to the areas of
embedded energy efficiency and compiler optimization:
• The design of a novel optimization which analyzes the
program to find a set of basic blocks which should be trans-
ferred into RAM to improve the energy consumption. This
involves the construction of a model describing the costs
associated with moving this set of basic blocks into RAM.
The optimization rewrites the branches at the end of basic
blocks that need to jump between RAM and flash.
• The evaluation of this optimization on BEEBS [14], a bench-
mark suite designed to allow testing of embedded systems
for energy consumption. The model is evaluated by exam-
ining the solutions it selects, compared to a large sample of
possible solutions.
• A realistic case example of periodic sensing is presented,
where a device wakes from sleep to perform computation,
then returns to sleep. This presents the unintuitive result
that when the energy of the active region is not reduced
and the execution time is increased, then the overall energy
consumption of the application can still be decreased.
In this scenario, the lower energy consumption and higher
execution time actually benefit the application, calculating
that the device’s battery life can be extended by up to 32%.
In the following section, related work is discussed. Then,
the optimization’s general methodology is presented. Then,
the methodology is discussed in more detail, firstly with the
cost model (Section 4) and then the code transformation (Sec-
tion 5). Section 6 describes the tests and efficacy of the opti-
mization. Then, Section 7 presents a case study which exam-
ines the optimizations effectiveness in a real world situation.
Finally, the paper is concluded in Section 8.
2. Related work
The problem of moving parts of code and data from one mem-
ory to a faster memory has been studied extensively in the
context of scratchpad memory. Most studies focus on static
assignment of code and data to the scratchpad memory with
the aim of decreasing program execution time or energy con-
sumption. Steinke et al. [21] compare scratchpad memories
and caches, finding that a scratchpad memory can save up
to 43% of the energy consumption compared to a cache of
the same size. This is achieved by using Integer Linear Pro-
2
gramming (ILP) to minimize the cost of placing basic blocks
into memory. There have been many different formulations of
ILP problems, considering data objects [18], individual basic
blocks [27] and cache-awareness [25]. Ishitobi et al. [8] use
ILP in a system with both caches and scratchpad memory,
creating a model to decide whether the particular item should
be placed in either a cacheable or scratchpad memory region.
This reduces both energy consumption and execution time.
Other work on scratchpad memory has attempted to dy-
namically move objects into memory as they are needed [19].
This study identified which parts of the code should remain in
scratchpad memory and which parts should be brought in dy-
namically at specific locations through the program. Another
study [24] applies techniques developed for global register
allocation to scratchpad memories, reducing the energy con-
sumption by up to 34%. In both of these studies, the energy
saving is partly due to the decrease in execution time.
A different approach is taken by Kandemir et al. [9], where
Presburger formulae are used to minimize the number of trans-
fers between main memory and the scratchpad memory. This
technique manages to reduce the number of off-chip references
and memory energy consumption.
Sharing a scratchpad memory between multiple tasks has
been tackled in [6], by attempting to optimally pack different
task’s regions of code and data into the scratchpad memory.
Many other scratchpad memory allocation schemes have
been proposed. A comprehensive review of these is given
in [28], including multiple scratchpad memories, and parti-
tioned memories.
For embedded systems with no scratchpad memory, but
just flash and RAM, there has been less extensive research.
Park et al. [16] attempt to minimize the amount of RAM
required for programs executing directly from NAND flash
by using a dynamic code overlay to execute the code from
RAM. Execution out of flash has been optimized using a page
manager to copy pages of flash into RAM at runtime, with
analysis support provided by the compiler [15]. Our study
instruments branches in a similar way, however in deeply
embedded systems both memories are single cycle access, so
the dynamic approach is not needed.
Software-level energy modeling was first discussed by Ti-
wari et al. [23]. This model consisted of an energy cost for
each instruction, an energy cost for each transition between
sequential instructions and a term for extra effects (such as
caches). This allowed simulation traces to estimate the energy
consumption without requiring to be run on physical hard-
ware. The method was refined into a fine grained model also
accounting for the energy due to differing data in [20]. These
approaches have been further developed to allow architectural
exploration, with Wattch [3] and SimplePower [29]. More re-
cent studies have explored how these models can be extended
to processors with hardware multithreading [10].
Modeling energy consumption has been explored at the
function level of code [17]. This involves creating a ‘data bank’
of how much energy each function costs to run. These energy
figures can then be distributed with libraries, or combined with
instruction level modeling [2] to estimate a programs energy
consumption.
Energy modeling has also been explored at a higher level,
by considering the average power of each state the processor
can be in [11, 12]. This requires less knowledge about the
exact instruction stream of the processor, simply the times
spent in each mode.
3. Methodology
Much of the previous work has focused on devices that have a
scratchpad memory operating several times quicker than the
data or instruction memory. The systems targeted by this work
do not have a scratchpad memory, just a unified address space
with embedded flash and RAM. Both memories are single
cycle access, resulting in no performance gain if the code
is moved out of flash and into RAM. These memory access
times causes a net performance loss when code is distributed
across both flash and RAM, however, should save energy due
to RAM’s lower average power.
A cost model is created to describe the effect of moving
different regions of code from flash to RAM. This model con-
siders the energy consumption in a simplified way — an aver-
age power is assigned to each instruction, based on whether
the instruction executes out of RAM or flash. The cycle cost
of each basic block is modeled in more detail, since moving
code from flash to RAM results in the code taking additional
execution cycles. The overhead is integrated into the model,
enabling the developer to set a maximum slow down. This
allows multi-objective optimization, balancing the trade-off
between code size, energy and performance.
The total energy sum is minimized by an ILP solver. This
solver uses the cost model and parameters extracted from the
program to estimate which are the optimal basic blocks of the
program to place into memory.
The set of basic blocks to be placed in RAM is used to
transform the code, moving the correct basic blocks to a sec-
tion which is loaded to RAM. The basic blocks which jump
between memories are instrumented such that they have the
range to make the long jump.
The resultant code is run on physical processors. The ma-
jority of studies rely on models, which may not incorporate
all effects the hardware may have on the energy. Thus ac-
tual measurements are taken to verify the efficacy of the final
code. Effects such as position dependent energy consump-
tion in memory [13] and large variability between supposedly
identical processors [26] necessitate evaluation with real, in-
strumented hardware.
4. Formulating the ILP model
In this section a model and set of constraints is developed to
allow optimal selection of which parts of code are moved into
3
RAM. This model is heavily influenced by [21], extending
and improving it in the following ways:
• The cost of modifying the code in the basic block to branch
between memories is accurately described. This results in
the solver automatically ‘clustering’ small basic blocks into
RAM, which would otherwise have caused a large overhead
when branching between memories.
• The model is based on the number of cycles spend execut-
ing from RAM and flash as the cost metric, instead of the
number of instructions. This is necessary since the Cortex-
M3 attempts to prefetch instructions as well as speculatively
fetching branch destinations [30]. Having the cycle count of
the code accounted for means that the overhead in execution
time will also be minimized.
• There is no need to consider data items in the model, since
the volatile data items are already in RAM (copied on
startup by the runtime). Constant data is still stored in
the flash, however this is typically accessed infrequently,
and mostly used for initialization.
There are a couple of factors which affect the efficacy of
placing a basic block in RAM. One factor is the overhead
of instrumenting the basic block to be able to jump to the
other memory. Each transition between flash and RAM must
be done with an indirect branch, rather than a typical direct
branch. Indirect branches allow the long ranging jumps be-
tween memory spaces to be made. However they typically
take longer to execute, or require other supporting instruc-
tions. The instrumentation overhead is discussed further in
Section 5. This instrumentation is only performed if one of
the basic block’s successors is in a different memory space —
the instrumentation is not needed if the block’s successors are
in the same memory.
The relative benefit of placing the block into RAM must
also be considered. It is more beneficial to place frequently
executed blocks in RAM, since the reduction in energy con-
sumption will be greater. However, it is also beneficial to
place blocks in RAM if it removes the need to instrument
a frequently executed block. It is for this reason that small
joining basic blocks between frequently executed loops may
be moved into RAM.
The rest of this section is divided into a discussion of the
parameters required by the model, and then how the model is
constructed from these parameters.
4.1. Parameters
Several parameters are given to the model, allowing a solver to
consider most of the factors which will affect the energy con-
sumption of the code. This section discusses the parameters
required by the model, and how they are calculated.
The following parameters are derived automatically from
the structure of the source code. The parameters differ for each
basic block. A diagram showing these parameters is given in
Figure 3.
Figure 3: Model parameters for a basic block, b.
Sb This parameter describes the size of the basic block, b,
in bytes.
Cb The number of cycles taken to execute the basic block.
This will always be a best estimate, due to complexi-
ties in the processor, such as fetch stalling and pipeline
flushes when conditional branches are taken.
Fb This is the ‘frequency’ of the basic block — the number
of times it is executed. This can be found by profiling
the application, or by statically analyzing the code.
A simple estimate can be made of this parameter by
simply considering the block’s loop-depth. Section 6
discusses how estimates of this parameters affect the
final solution, showing that a rough estimate is good
enough in most cases.
Kb The instrumentation cost of the basic block, in bytes.
This is the number of necessary extra bytes to instru-
ment the basic block with jumps between memory
spaces.
Tb The instrumentation cost of the basic block, in cycles.
This is the number of additional cycles executed when
the basic block is instrumented with jumps between
memory spaces.
Lb This is the number of additional cycles required when
the block is in RAM. This stems from contention on
the memory bus, and is proportional to the number of
load instructions in the basic block.
Succ(b) This specifies the set of basic blocks which are
immediate successors to the block, b.
The following parameters are specified by the developer.
Xlimit This is a ‘time factor’, indicating the maximum
overhead that should be allowed in the solution. For
example, setting Xlimit = 1.1 allows the solver to pick
a combination of blocks to go in RAM that should
take less than 10% longer to execute.
4
Rspare The maximum amount of RAM to use for code.
The solver can be restricted to using fewer bytes of
RAM, to fit within memory limits. This can also be de-
rived statically, by considering the size of the variables
in RAM, heap and the stack usage [4].
The following parameters are determined from the hardware.
E f lash A coefficient representing the energy cost of execut-
ing out of flash. The average power when executing
instructions out of flash is assigned to this parame-
ter.
Eram A coefficient representing the energy cost of execut-
ing out of RAM. The average power when executing
instructions out of RAM is assigned to this parame-
ter.
4.2. Auxiliary parameters/functions
The following sets are determined during the solving.
B The set of all basic blocks. This is extracted from the
control flow graph.
R This set is the set of basic blocks that are moved into
RAM.
I This set is the set of basic blocks that need to be instru-
mented, since one of their successors is present in a
different memory. This is purely for convenience, and
can be calculated purely from R.
Several other functions are based on the given parameters.
M(b) This function returns memory energy cost, depend-
ing on whether b is in RAM or not.
Oc(b) A function that returns the cycle overhead for the
basic block, b. If the block is not instrumented then
this is 0.
Or(b) This function returns the cycle overhead if the basic
block is in RAM. This occurs because of contention
on the memory bus when a load instruction is en-
countered.
Os(b) A function which returns the space overhead for the
basic block, b, when the block is instrumented. If
the block is not instrumented then this is 0 for this
particular basic block.
4.3. The model
The problem is formulated as a minimization of the total en-
ergy of the program, by finding a set of basic blocks, R, which
represents the set of basic blocks placed into RAM.
Minimize : ∑
b∈B
E(b), (1)
where B is the set of all basic blocks. The energy of each basic
blocks can be determined,
E(b) =
(
Cb +Oc(b)+Or(b)
) ·M(b) ·Fb, (2)
where Cb is the number of cycles the basic block takes to ex-
ecute, Oc(b) is the cycle overhead added to the basic block
(dependent on whether b is placed into RAM or not), M(b)
is an energy coefficient, describing the energy cost per cycle
of executing out of a particular type of memory. Fb is an exe-
cution frequency for that particular block, scaling the energy
cost of b by how many times it is executed. Or(b) is the cycle
overhead from a block being in RAM.
The cycle count, Cb, and the execution frequency, Fb, of
the block are input variables into the equation. In previous
works, the number of instructions is used in place of the cycles
per basic block [21]. However, under most circumstances this
processor (the Cortex-M3) will fetch every cycle, filling the
prefetch buffer during a multi-cycle instruction.
The memory energy cost of a basic block is easily deter-
mined — it is only dependent on whether the basic block is in
RAM or not:
M(b) =
{
Eram b ∈ R
E f lash b /∈ R, (3)
where R is the set of basic blocks which are in RAM, and Eram
and E f lash are coefficients describing the energy cost of RAM
and flash respectively.
If the block needs to be instrumented, then the instrumenta-
tion overhead needs to be factored in,
Oc(b) =
{
Tb b ∈ I
0 b /∈ I, (4)
where Tb is an instrumentation overhead for that block, in
cycles, and I is the set of basic blocks which need to be instru-
mented. A basic block needs to be instrumented if it and any
of its successors are not in the same memory space.
b /∈ I if b ∈ R and ∀(x ∈ Succ(b)) : x ∈ R
b /∈ I if b /∈ R and ∀(x ∈ Succ(b)) : x /∈ R
b ∈ I otherwise,
(5)
where Succ(b) returns a set of all blocks which are immediate
successors to b, and can be extracted from the control flow
graph.
Although both the flash and the RAM are nominally single
cycle access, there are cases when the number of cycles can
differ. The most prevalent case is when executing a load from
RAM, while executing out of RAM. The processor stalls for
extra cycles in this case, due to contention for the RAM mem-
ory interface. This contention does not occur when executing
out of flash, due to there being a separate memory interface
for flash. The cycle overhead when is executing from RAM is
given by:
Or(b) =
{
Lb b ∈ R
0 b /∈ R, (6)
5
where Lb is the number of stall cycles due to load instructions
being executed when in RAM (b ∈ R).
These definitions form the basis of the model, however
additional constraints are added to ensure that the maximum
amount of spare RAM is not exceeded, and that the execution
time does not grow larger than optimal. The constraint on
RAM usage is given,
∑
b∈R
(Sb +Os(b))≤ Rspare, (7)
where Rspare is the amount of ‘spare’ RAM which the program
does not use and Os(b) is the amount of RAM necessary
to instrument a block (bytes). In many cases for embedded
systems, the spare RAM can be determined statically, by stack
analysis. Otherwise, the RAM dedicated to program code will
be a design decision, similar to the amount of stack and heap
space to allocate to the program.
The size instrumentation cost of the basic block is similar
to Oc(b):
Os(b) =
{
Kb b ∈ I
0 b /∈ I, (8)
where Kb is the instrumentation cost of b in bytes.
The execution time constraint can be formulated in a similar
way by considering the number of cycles that each basic block
requires to execute. The ratio of original execution time to
execution time with overhead can be constrained, ensuring
that the execution time does not grow by more than Xlimit :
∑b∈B
(
(Cb +Oc(b)+Or(b)) ·Fb
)
∑b∈B(Cb ·Fb)
≤ Xlimit , (9)
where Cb is the number of cycles the block b takes to execute,
Oc(b) is the cycle overhead from block instrumentation, and
Or(b) is the cycle overhead from a block being in RAM.
This model can be linearized so that it can be solved with a
standard ILP solver. In this paper the GNU Linear Program-
ming Kit (GLPK) [5] is integrated into the optimization. This
solver returns the set of basic blocks that should be in RAM.
5. Code transformation
Once the basic blocks to be in RAM are chosen, the transfor-
mation can relocate these blocks and instrument the necessary
blocks. The actual transformation itself happens at the very
end of compilation. The parameters for the model are extracted
from the CFG, passed to the solver and the basic blocks are
modified in accordance to whether they are in RAM or not.
All basic blocks which are required to be in RAM are moved
into a custom section of the executable which is loaded into
RAM at start-up by the runtime.
The transformation also modifies the basic blocks, based on
whether they are required to jump between memory spaces or
not. In general, there are three basic forms the instrumentation
takes, each corresponding to the type of jump at the end of a
basic block:
Unconditional branch. If the branch at the end of the block
is unconditional this just needs to be exchanged to an indirect
branch, loading the target address from memory. This enables
a much larger branch range, necessary for the jump into RAM
or back into flash.
Conditional branch. A basic block with a conditional
branch at the end could transfer execution to two possible
locations. Both must be instrumented. If the branch is not
taken, the execution falls through into the following block.
Here, an indirect branch must be added, since the following
block may not be in the same memory space.
No branch. As with a conditional branch, if the following
basic block is not in the same memory space the transition
must be instrumented. This is simply an indirect branch.
The specific code changes made to instrument a basic block
for the Cortex-M3 are shown in Figure 4. This illustrates how
the code is modified for each type of basic block, along with
the space and time overheads for doing so. For the Thumb2
instruction set [1] used by the Cortex-M3 there is an additional
type of conditional branch which requires slightly different
instrumentation, due to it combining the comparison into the
instruction.
6. Evaluation
The optimization was evaluated using BEEBS [14], a bench-
mark suite designed to analyze energy consumption char-
acteristics of the processor. The benchmark suite consists
of ten programs taken from different areas of embedded
applications. All of the benchmarks were measured on a
STM32VLDISCOVERY board instrumented with energy mea-
surement equipment.
The optimization was run on the set of benchmarks at the
O0, O1, O2, O3 and Os optimization levels (compiled with
GCC 4.8.2). Across all benchmarks and optimization levels,
the average reduction in energy and power is 7.7% and 21.9%
respectively. The execution time is increased by an average
of 19.5%. This indicates that the optimization is effective for
a wide range of benchmarks across different combinations of
optimizations.
The result of applying the optimization to this benchmark
suite at the O2 and Os optimization levels are shown in Fig-
ure 5. This graph shows the percentage change in execution
time and energy consumption when compared to the program
without the optimization pass applied. Also shown are the re-
sults when an actual basic block frequency is used, as opposed
to an estimate. Overall the optimization manages to decrease
the energy significantly in many cases, with up to 22% re-
duction in energy consumption in some cases (int_matmult,
O2). It is notable that this is successful despite the additional
instructions executed and the overall increase in execution
time.
In all cases, the average power is reduced. In some cases,
the reduction is significant: 41% reduction in the case of fdct
at O2 optimization level. This reduction in power is large
6
b label
label:
ldr pc, =label
label:
  bne label bx    r5
label2:
ldreq r5, =label2
ldrne r5, =label
it    ne
label: label2: label:
cbnz r0, label bx    r5
label2:
ldreq r5, =label2
ldrne r5, =label
it    ne
label: label2: label:
cmp   r0, #0
...
label:
ldr pc, =label
label:
4 cycles
4 bytes
Unconditional branch
Conditional branch
Short conditional branch
No branch (fall through)
3 cycles
2 bytes
3 cycles
2 bytes
7 cycles
8 bytes
3 cycles
2 bytes
0 cycles
0 bytes
8 cycles
10 bytes
4 cycles
4 bytes
Figure 4: The transformations applied to the basic blocks which need to be instrumented (b ∈ I). The text gives the execution
time (cycles) and the size (bytes) of the code sequences.
2dfi
r
blo
wfi
sh
crc
32
cub
ic
dijk
stra fdc
t
floa
t m
atm
ult
int
ma
tmu
lt
rijn
dae
l sha
Av
era
ge
Benchmark
−30
−20
−10
0
10
20
30
40
50
%
C
h
an
ge
Energy
Time
w/Frequency
O2 Os
Figure 5: Results for applying the optimization pass on the
BEEBS benchmark suite at different optimization lev-
els. Each pair of energy and time bars is a single run
of the benchmark. The dots indicate an additional
run with actual basic block frequencies as opposed
to the estimate.
due to both the increase in execution time and the decrease
in energy consumption. As such, it occurs even when the
energy is not reduced significantly. Applications which must
not exceed a certain peak power will find this beneficial.
Some of the benchmarks show very little improvement
(cubic, float_matmult). These benchmarks make heavy
use of library calls and emulated floating point calculations.
The library calls are statically linked into the final executable
and the optimization pass does not see these functions, so can-
not place them into RAM. This limitation could be removed if
the optimization pass was moved into the linker, allowing it to
operate on all emitted code.
In all of the cases, the results are very similar when the basic
block frequency is estimated, versus the actual frequencies.
This demonstrates that a static estimate is good enough to
achieve good results, without having to go through the lengthy
procedure of instrumenting the application for profiling, or
simulating it.
The choice of basic blocks to go into RAM made by the ILP
solver is not necessarily the optimal solution. However, it is
usually a good solution, out of the possible choices. Figure 6
shows the space of possible solutions (2k solutions, where k
is the number of basic blocks). This space shows the energy
usage and execution time of each solution, along with the
RAM usage of each solution (colored). The point marked
‘All blocks in flash’ is the base case, where no basic blocks
have been moved to RAM. The point marked ‘No RAM or
time constraint’ is the solution chosen when no constraints are
placed on the RAM usage, or the execution time overhead.
The dashed line shows the choices made by the solver as the
RAM constraint is relaxed. For the int_matmult benchmark
(Figure 6a), the solver identifies good solutions to reduce the
energy consumption, while avoiding clusters of low energy
but much higher execution time. Similarly, as the solver is
allowed to increase the overhead (thus increasing execution
time), solutions with lower energy consumption are found.
This is shown by the solid line. This graph has several clusters
because the benchmark has 3 basic blocks with a large size
and iteration count. There are 23 = 8 combinations of these 3
basic blocks in RAM or flash, and each of these combinations
forms a separate cluster in the graph (the larger bottom cluster
is formed of two combinations).
The graph for the fdct benchmark (Figure 6b) shows sim-
ilar effects, with more energy efficient, but slower solutions
being found as the constraints are relaxed. This graph only
has three clusters, because there are two large and similarly
sized basic blocks. When neither block is in RAM, the points
are clustered in the bottom right of the graph. When both are
in RAM, then the points are in the cluster at the top left. Oth-
erwise, when just one of the blocks is selected the solutions
are in the middle.
7
9 10 11 12 13 14 15 16 17 18
Energy (mJ)
0.8
1.0
1.2
1.4
1.6
T
im
e
(s
)
All blocks in flash
No RAM or
time constraint
Constraining RAM
Constraining time
Possible choices
0
40
80
12
0
16
0
20
0
24
0
28
0
A
m
ou
n
t
of
R
A
M
u
sa
ge
(b
y
te
s)
(a) int_matmult
13.5 14.0 14.5 15.0 15.5 16.0 16.5 17.0 17.5
Energy (mJ)
1.1
1.2
1.3
1.4
1.5
1.6
1.7
T
im
e
(s
)
All blocks in flash
No RAM or
time constraint
Constraining RAM
Constraining time
Possible choices
0
10
0
20
0
30
0
40
0
50
0
60
0
70
0
80
0
A
m
ou
n
t
of
R
A
M
u
sa
ge
(b
y
te
s)
(b) fdct
Figure 6: These diagrams show the trade-off space for possible sets of basic blocks in RAM. Each point represents a possible
combination of basic blocks put into RAM, along with its energy, time and RAM required. The solid line shows the
solutions selected when changing the Rspare parameter. The dashed line shows the solutions selected when changing
the Xlimit parameter.
p
ow
er
activeidle
(a) The power profile of an application which periodically wakes to perform
computation.
idle active
p
ow
er
(b) The power profile of the application after the optimization is applied.
Figure 7: Power profiles for an application before and after the
optimization is applied. The application periodically
executes some code (the active region) then waits in
an sleep state until the end of the period.
7. Case study
The optimization is particularly useful for some types of appli-
cations that these deeply embedded processors are typically
used for, such as periodic sensing. In this application, the
processor spends the majority of its time in a sleep mode, and
wakes infrequently to perform some processing.
In this section we consider the scenario where the processor
must wake up every T seconds, and transform as signal using
a Finite Discrete Cosine Transform (FDCT). The initial power
profile for this is shown in Figure 7a.
The total energy, E, for one period, T , of the application is
given below,
E = E0 +PS ·(T −TA), (10)
where E0 is the energy consumed by the active region of code,
TA is the length of time spent in the active region (for one
time period, T ) and PS is the average power of the sleep state
(quiescent power).
When the optimization is applied, the energy consumption
is:
E ′ = ke ·E0 +PS ·(T − ktTA), (11)
where ke and kt are factors describing how the optimization
affects the energy and time respectively. As such we expect
ke ≤ 1 and kt ≥ 1 most of the time. This corresponds to the op-
timization reducing the energy consumption while increasing
the execution time, as seen in the experimental results.
From E and E ′ we can calculate the energy saved by ap-
plying the optimization to this case example. Here, Es is the
energy saved:
Es = E−E ′
= E0(1− ke)+PSTA(kt −1). (12)
From this equation in can be seen that either decreasing ke
(i.e. reducing the energy) or increasing kt (i.e. increasing the
execution time) will maximize the amount of energy saved.
An overall energy saving can be achieved even if the energy
of the active region has not been reduced, in cases where the
energy is similar and the execution time is increased. This
phenomenon is shown in Figure 8. In this diagram the active
region where code is being executed requires the same amount
of energy, but takes a longer time to execute in the optimized
8
10
m
W
5m
W
5ms
10ms
10ms 5ms
1mW 1mW
Same energy
Unoptimized Optimized
Figure 8: Diagram illustrating how keeping the energy con-
sumption of the active region constant while the ex-
ecution time increases affects the total energy. The
beginning region is the active code, which is opti-
mized.
example. This leads to a reduction in the amount of time that
is spent in the sleep state, and an overall decrease in energy
consumption. Overall the energy is reduced from 60 µJ to
55 µJ in this illustration.
For certain benchmarks the energy saved is low (see Sec-
tion 6), compared to the increase in execution time. However,
if this was used in a real application which required sleeping
after executing the function, this will result in an overall reduc-
tion in energy consumption. The sleep power consumption for
the SoC used (STM32F103RB) to prototype this optimization
is measured at PS = 3.5 mW. The values of the energy and
time, as seen in Figure 5 are (for fdct):
E0 = 16.9 mJ
TA = 1.18 s
ke = 0.825
kt = 1.33.
(13)
Substituting these values into Equation 12 gives a total
energy saved of Es = 4.32 mJ.
The energy savings as a percentage of total energy consump-
tion cannot be calculated without knowing the value of the
time period, T , which varies from application to application.
Figure 9 shows how T affects the final energy consumption of
several benchmarks. Intuitively, smaller time periods are more
greatly affected by the savings achieved in the active region,
since the active region is a higher overall percentage of the
processor’s activity.
Figure 9 also shows other benchmarks used as the active
region in this scenario. In the results graph for the optimization
(Figure 5), 2dfir did not achieve any significant energy saving
but the execution time was increased. This can still result in
an overall reduction in energy consumption, as seen when in
this scenario, although the improvement becomes minimal as
T increases.
The optimization is very beneficial for this class of applica-
tions, providing up to 25% reduction in energy consumption.
This leads to up to 32% longer battery life for devices which
perform periodic sensing.
0 5 10 15 20
Overall period, T (s)
70
75
80
85
90
95
100
E
n
er
gy
co
n
su
m
p
ti
on
(%
)
fdct
int matmult
2dfir
Figure 9: The proportion of energy after the optimization has
been applied for different time periods, T . The
points indicate multiples in time of the active region.
I.e. the first point is for T = TA (no sleep period) and
the second point is for T = 2 ·TA (with TA s of sleep
time).
8. Conclusion
An optimization is presented that exploits the spare RAM
often available in deeply embedded SoCs. The optimization
inputs parameters into an energy cost model. This model
is minimized using integer linear programming, resulting in
a set of basic blocks that should be moved into RAM, to
exploit its low power consumption. The code transformation is
performed post-compilation, moving the required basic blocks
into RAM and modifying the necessary branches to ensure the
flow of execution is preserved. Significant energy savings are
achieved, comparable to previous scratchpad memory studies,
despite the more restrictive low power environment.
The optimization is evaluated over all major optimization
levels (using GCC), with a set of representative benchmarks.
On average it reduces energy consumption by 7.7% and power
dissipation by 21.9%. Much higher energy and power savings
are seen (up to 22% and 41% respectively) in some cases. In
particular, the savings for some of the benchmarks are limited
by the current implementation of the optimization, rather than
the technique itself. If given greater visibility of all the code
(such as library and compiler intrinsic calls), then a greater
reduction will be seen.
The techniques implemented result in a trade-off between
execution time and energy consumption, since moving the
code to RAM necessarily results in an overhead. The execu-
tion time increased by an average of 19.5% on our platform.
Despite this, the code relocation manages to reduce the overall
energy consumption. The increase in execution time is not
problematic for a large class of applications which run on
deeply embedded SoCs. A case example of periodic sensing,
where the processor will perform a computation then return
9
to sleep periodically was demonstrated. The combination of
a reduction in energy consumption and increase in execution
time was shown to be beneficial, resulting in larger energy
consumption savings (up to 25%) than the savings achieved
without the context of the application (22%). As a result, the
battery life of the device can be extended by up to 32%.
Future work
The current prototype of this optimization cannot operate on
library code and other code which is used at link time. The
optimization could be moved into the linker, allowing it to
have a full view of the program. This should enable library
code to be moved into RAM as well, improving the results
since all of the basic blocks can be moved into RAM and the
entire structure of the program is accounted for.
References
[1] ARM Limited, “ARMv7-M Architecture Reference Manual,” 2010.
[2] H. Blume, D. Becker, M. Botteck, J. Brakensiek, and T. G.
Noll, “Hybrid functional and instruction level power modeling for
embedded processors,” Embedded Computer Systems: Architectures,
Modeling, and Simulation, pp. 216–226, 2006. [Online]. Available:
http://www.springerlink.com/index/n0767t2911665588.pdf
[3] D. Brooks, V. Tiwari, and M. Martonosi, “Wattch: a framework for
architectural-level power analysis and optimizations,” Proceedings of
the 27th Annual International Symposium on Computer Architecture,
2000. [Online]. Available: http://dl.acm.org/citation.cfm?id=339657
[4] D. Brylow, N. Damgaard, and J. Palsberg, “Static checking of
interrupt-driven software,” in Proceedings of the 23rd International
Conference on Software Engineering. ICSE 2001. IEEE Comput.
Soc, 2001, pp. 47–56. [Online]. Available: http://ieeexplore.ieee.org/
lpdocs/epic03/wrapper.htm?arnumber=919080
[5] Free Software Foundation, “GNU Linear Programming Kit, Version
4.52,” 2014. [Online]. Available: http://www.gnu.org/software/glpk/
glpk.html
[6] L. Gauthier, T. Ishihara, H. Takase, H. Tomiyama, and H. Takada,
“Minimizing Inter-Task Interferences in Scratch-Pad Memory Usage
for Reducing the Energy Consumption of Multi-Task Systems,” in
Proceedings of the 2010 international conference on Compilers, Archi-
tectures and Synthesis for Embedded Systems. ACM, 2010.
[7] K. L. Hoffman and T. K. Ralphs, “Integer and Combinatorial Opti-
mization,” in Encyclopedia of Operations Research and Management
Science, 2013, pp. 771–783.
[8] Y. Ishitobi, T. Ishihara, and H. Yasuura, “Code and Data
Placement for Embedded Processors with Scratchpad and Cache
Memories,” Journal of Signal Processing Systems, vol. 60,
no. 2, pp. 211–224, Nov. 2008. [Online]. Available: http:
//www.springerlink.com/index/10.1007/s11265-008-0306-3
[9] M. Kandemir, I. Kadayif, and U. Sezer, “Exploiting scratch-pad
memory using Presburger formulas,” Proceedings of the 14th
international symposium on Systems synthesis - ISSS ’01, p. 7, 2001.
[Online]. Available: http://portal.acm.org/citation.cfm?doid=500001.
500004
[10] S. Kerrison and K. Eder, “Energy modelling and optimisation of soft-
ware for a hardware multi-threaded embedded microprocessor,” Uni-
versity of Bristol, Bristol, Tech. Rep., 2013.
[11] A. W. Min, R. Wang, J. Tsai, M. a. Ergin, and T.-Y. C.
Tai, “Improving energy efficiency for mobile platforms by
exploiting low-power sleep states,” in Proceedings of the 9th
conference on Computing Frontiers - CF ’12. New York, New
York, USA: ACM Press, 2012, p. 133. [Online]. Available:
http://dl.acm.org/citation.cfm?doid=2212908.2212928
[12] J. Nunez-Yanez and G. Lore, “Enabling accurate modeling of
power and energy consumption in an ARM-based System-on-Chip,”
Microprocessors and Microsystems, vol. 37, no. 3, pp. 319–332, May
2013. [Online]. Available: http://linkinghub.elsevier.com/retrieve/pii/
S0141933113000021
[13] J. Pallister, K. Eder, S. J. Hollis, and J. Bennett, “A high-level model of
embedded flash energy consumption,” Apr. 2014. [Online]. Available:
http://arxiv.org/abs/1404.1602
[14] J. Pallister, S. Hollis, and J. Bennett, “BEEBS: Open Benchmarks
for Energy Measurements on Embedded Platforms,” 2013. [Online].
Available: http://arxiv.org/abs/1308.5174
[15] C. Park, J. Lim, K. Kwon, J. Lee, and S. L. Min, “Compiler-
assisted demand paging for embedded systems with flash memory,”
in Proceedings of the fourth ACM international conference
on Embedded software - EMSOFT ’04. New York, New
York, USA: ACM Press, 2004, p. 114. [Online]. Available:
http://portal.acm.org/citation.cfm?doid=1017753.1017775
[16] H.-w. Park, S. Park, and M.-m. Sim, “Dynamic Code Overlay of SDF-
Modeled Programs on Low-end Embedded Systems,” Proceedings of
the Design Automation & Test in Europe Conference, pp. 1–2, 2006.
[Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.
htm?arnumber=1657026
[17] G. Qu, N. Kawabe, K. Usami, and M. Potkonjak, “Function-
level power estimation methodology for microprocessors,” in
Proceedings of the 37th conference on Design automation - DAC
’00. New York, New York, USA: ACM Press, 2000, pp. 810–813.
[Online]. Available: http://dl.acm.org/citation.cfm?id=337786http:
//portal.acm.org/citation.cfm?doid=337292.337786
[18] J. Sjödin, B. Fröderberg, and T. Lindgren, “Allocation of Global Data
Objects in On-Chip RAM,” in Proc. Workshop on Compiler and Ar-
chitectural Support for Embedded Computer Systems, Washington DC,
1998.
[19] S. Steinke, N. Grunwald, L. Wehmeyer, R. Banakar, M. Balakrishnan,
and P. Marwedel, “Reducing Energy Consumption by Dynamic Copy-
ing of Instructions onto Onchip Memory,” in International Symposium
on Systems Synthesis. Kyoto, Japan: ACM, 2002.
[20] S. Steinke, M. Knauer, L. Wehmeyer, and P. Marwedel, “An
accurate and fine grain instruction-level energy model supporting
software optimizations,” in Proceedings of PATMOS, 2001. [Online].
Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.
21.6971&rep=rep1&type=pdf
[21] S. Steinke, L. Wehmeyer, and P. Marwedel, “Assigning program and
data objects to scratchpad for energy reduction,” in Proceedings 2002
Design, Automation and Test in Europe Conference and Exhibition.
IEEE Comput. Soc, 2002, pp. 409–415. [Online]. Available: http:
//ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=998306
[22] STMicroelectronics, “STM32F100XX Reference Manual,” 2011.
[23] V. Tiwari, S. Malik, A. Wolfe, and M. Tien-Chien Lee,
“Instruction level power analysis and optimization of software,”
Journal of VLSI Signal Processing Systems for Signal, Image,
and Video Technology, vol. 13, no. 2-3, pp. 223–238, 1996.
[Online]. Available: http://www.springerlink.com/index/10.1007/
BF01130407http://link.springer.com/10.1007/BF01130407
[24] M. Verma, L. Wehmeyer, and P. Marwedel, “Dynamic Overlay of
Scratchpad Memory for Energy Minimization,” in International Con-
ference on Hardware/software Codesign and System Synthesis. ACM,
2004.
[25] M. Verma, L. Wehmeyer, and P. Marwedel, “Cache-Aware
Scratchpad-Allocation Algorithms for Energy-Constrained Embedded
Systems,” IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, vol. 25, no. 10, pp. 2035–2051, Oct. 2006.
[Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.
htm?arnumber=1677689
[26] L. Wanner, C. Apte, and R. Balani, “A case for opportunistic
embedded sensing in presence of hardware power variability,” in
Proceedings of the 2010 international conference on Power aware
computing and systems., 2010. [Online]. Available: https://www.
usenix.org/legacy/event/hotpower/tech/full_papers/Wanner.pdf
[27] L. Wehmeyer, U. Helmig, and P. Marwedel, “Compiler-optimized
usage of partitioned memories,” in Proceedings of the 3rd workshop on
Memory performance issues in conjunction with the 31st international
symposium on computer architecture - WMPI ’04. New York, New
York, USA: ACM Press, 2004, pp. 114–120. [Online]. Available:
http://portal.acm.org/citation.cfm?doid=1054943.1054959
[28] L. Wehmeyer and P. Marwedel, “Scratchpad Memory Optimizations,”
in Fast , Efficient and Predictable Memory Accesses: Optimization
Algorithms for Memory Architecture Compilation, 1st ed. Dordrecht,
The Netherlands: Springer Netherlands, 2006, ch. 4, pp. 89—-169.
[29] W. Ye, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin, “The Design
and Use of SimplePower : A Cycle-Accurate Energy Estimation Tool,”
in 37th Conference on Design Automation (DAC’00), 2000, pp. 340–
345.
[30] J. Yiu, The Definitive Guide to the ARM Cortex-M3, 2nd ed. Newnes,
2010.
10
