Less is more:exploiting the standard compiler optimization levels for better performance and energy consumption by Georgiou, Kyriakos et al.
                          Georgiou, K., Blackmore, C., Xavier De Souza, S., & Eder, K. (2018). Less
is more: exploiting the standard compiler optimization levels for better
performance and energy consumption. In 21st International Workshop on
Software and Compilers for Embedded Systems (SCOPES 2018) Association
for Computing Machinery (ACM). https://doi.org/10.1145/3207719.3207727
Peer reviewed version
Link to published version (if available):
10.1145/3207719.3207727
Link to publication record in Explore Bristol Research
PDF-document
This is the author accepted manuscript (AAM). The final published version (version of record) is available online
via ACM at https://dl.acm.org/citation.cfm?doid=3207719.3207727 . Please refer to any applicable terms of use
of the publisher.
University of Bristol - Explore Bristol Research
General rights
This document is made available in accordance with publisher policies. Please cite only the published
version using the reference above. Full terms of use are available:
http://www.bristol.ac.uk/pure/about/ebr-terms
Less is More: Exploiting the Standard Compiler
Optimization Levels for Better Performance and
Energy Consumption
Kyriakos Georgiou1, Craig Blackmore1, Samuel Xavier-de-Souza2, Kerstin
Eder1
1 University of Bristol, UK
2 Universidade Federal do Rio Grande do Norte, Brazil
Abstract. This paper presents the interesting observation that by per-
forming fewer of the optimizations available in a standard compiler opti-
mization level such as -O2, while preserving their original ordering, sig-
nificant savings can be achieved in both execution time and energy con-
sumption. This observation has been validated on two embedded proces-
sors, namely the ARM Cortex-M0 and the ARM Cortex-M3, using two
different versions of the LLVM compilation framework; v3.8 and v5.0.
Experimental evaluation with 71 embedded benchmarks demonstrated
performance gains for at least half of the benchmarks for both proces-
sors. An average execution time reduction of 2.4% and 5.3% was achieved
across all the benchmarks for the Cortex-M0 and Cortex-M3 processors,
respectively, with execution time improvements ranging from 1% up to
90% over the -O2. The savings that can be achieved are in the same range
as what can be achieved by the state-of-the-art compilation approaches
that use iterative compilation or machine learning to select flags or to de-
termine phase orderings that result in more efficient code. In contrast to
these time consuming and expensive to apply techniques, our approach
only needs to test a limited number of optimization configurations, less
than 64, to obtain similar or even better savings. Furthermore, our ap-
proach can support multi-criteria optimization as it targets execution
time, energy consumption and code size at the same time.
1 Introduction
Compilers were introduced to abstract away the ever-increasing complexity of
hardware and improve software development productivity. At the same time,
compiler developers face a hard challenge: producing optimized code. A modern
compiler supports a large number of architectures and programming languages
and it is used for a vast diversity of applications. Thus, tuning the compiler
optimizations to perform well across all possible applications is impractical. The
task is even harder as compilers need to adapt to rapid advancements in hardware
and programming languages.
Modern compilers adopted two main practices to mitigate the problem and
find a good balance between the effort needed to develop compilers and their
2 K. Georgiou et al.
effectiveness in optimizing code. The first approach is the splitting of the compi-
lation process into distinct phases. Modern compilers such as those based on the
LLVM compilation framework [Lat02], allow for a common optimizer that can
be used by any architecture and programming language. This is made possible
by the use of an Intermediate Representation (IR) language on which optimiza-
tions are applied. Then a front-end framework is provided to allow programming
languages to be translated into the IR, and a back-end framework exists that
allows the IR to be translated into specific instruction set architectures (ISA).
Therefore, to take advantage of the common optimizer one only needs to cre-
ate a new front-end for a programming language and a new back-end for an
architecture.
The second practice is the use of standard optimization levels, typically -O0,
-O1, -O2, -O3 and -Os. Most modern compilers have a large number of trans-
formations exposed to software developers via compiler flags; for example, the
LLVM’s optimizer has 56 documented transformations [LLV18]. There are two
major challenges a software developer faces while using compilers. First, to select
the right set of transformations, and second to order the chosen transformations
in a meaningful way, also called the compiler phase-ordering problem. The com-
mon objective is to achieve the best resource usage based on the application’s re-
quirements. To address this, each standard optimization level offers a predefined
sequence of optimizations, which are proven to perform well based on a number
of micro-benchmarks and a range of architectures. For example, for the LLVM
compilation framework, starting from the -O0 level, which has no optimizations
enabled, and moving to -O3, each level offers more aggressive optimizations with
the main focus being performance, while -Os is focused on optimizing code size.
Code size is critical for embedded applications with a limited amount of memory
available. Furthermore, the optimization sequences defined for each level encap-
sulate the accumulated empirical knowledge of compiler engineers over the years.
For example, some optimizations depend on other code transformations being
applied first, and some optimizations offer more opportunities for other opti-
mizations. Note that a code transformation is not necessarily an optimization,
but instead, it can facilitate an IR structure which enables the application of
other optimizations. Thus, a code transformation does not always lead to better
performance.
Although standard optimization levels are a good starting point, they are
far from optimal in many cases, depending on the application and architecture
used. An optimization configuration is a sequence of ordered flags. Due to the
huge number of possible flag combinations and their possible orderings, it is
impractical to explore the whole optimization-configuration space. Thus, finding
optimal optimization configurations is still an open challenge. To tackle this
issue, iterative compilation and machine-learning techniques have been used to
find good optimization sequences by exploiting only a fraction of the optimization
space [AKC+18]. Techniques involving iterative compilation are expensive since
typically a large amount of optimization configurations, in the order of hundreds
to thousands, need to be exercised before reaching any performance gains over
Less is More: Exploiting the Standard Compiler Optimization Levels 3
standard optimization levels. On the other hand, machine learning approaches
require a large training phase and are hardly portable across compilers and
architectures.
This paper takes a different approach. Instead of trying to explore a frac-
tion of the whole optimization space, we are focusing on exploiting the existing
optimization levels. For example, using the optimization flags included in the
-O2 optimization level as a starting point, a new optimization configuration is
generated each time by removing the last transformation flag of the current op-
timization configuration. In this way, each new configuration is a subsequence of
the -O2 configuration, that preserves the ordering of flags in the original opti-
mization level. Thus, each new optimization configuration stops the optimization
earlier than the previously generated configuration did. This approach aims to
preserve the empirical knowledge built into the ordering of flags for the standard
optimization levels. The advantages of using this technique are:
– The architecture and the compiler are treated as a black box, and thus, this
technique is easy to port across different compilers or versions of the same
compiler, and different architectures. To demonstrate this we applied our
approach to two embedded architectures (Arm Cortex-M0 and Cortex-M3)
and two versions of the LLVM compilation framework (v3.8 and v5.0);
– An expensive training phase similar to the ones needed by the machine
learning approaches is not required;
– The empirical knowledge built into the existing optimization levels by the
compiler engineers is being preserved;
– In contrast to machine-learning approaches and random iterative compila-
tion [BKK+98], which permit reordering transformation passes, our tech-
nique retains the original order of the transformation passes. Reordering can
break the compilation or create a malfunctioning executable;
– In contrast to the majority of machine-learning approaches, which are often
opaque, our technique provides valuable insights to the software engineer on
how each optimization flag affects the resource of interest;
– Because energy consumption, execution time and code size of each optimiza-
tion configuration are being monitored during compilation, multi-criteria
optimizations are possible without needing to train a new model for each
resource.
Our experimental evaluation demonstrates an average of 2.4% and 5.3% exe-
cution time improvement for the Cortex-M0 and Cortex-M3 processors, respec-
tively. Similar savings were achieved for energy consumption. These results are
in the range of what existing complicated machine learning or time consum-
ing iterative compilation approaches can offer on the same embedded proces-
sors [BRE15, PHB15].
The rest of the paper is organized as follows. Section 2 gives an overview
of the compilation and analysis methodology used. Our experimental evaluation
methodology, benchmarks and results are presented and discussed in Section 3.
Section 4 critically reviews previous work related to ours. Finally, Section 5
concludes the paper and outlines opportunities for future work.
4 K. Georgiou et al.
  
Generate
Optimization Config.
Programs
Resource Usage
 Measurement
Yes
LLVM Back-End
No
Best
Config.
LLVM Optimizer
Clang Front-End
Results
Executable
Selection
Finished?
Control Data
Fig. 1: Compilation and evaluation process.
2 Compilation and Analysis
As the primary focus of this work is deeply embedded systems, we demonstrate
the portability of our technique across different architectures by exploring two
of the most popular embedded processors: the Arm Cortex-M0 [ARM18a] and
the Arm Cortex-M3 [ARM18b]. Although the two architectures belong to the
same family, they have significant differences in terms of performance and power
consumption characteristics [ARM18c]. The technique treats an architecture
as a black box as no resource models are required e.g. energy-consumption or
execution-time models. Instead, execution time and energy consumption phys-
ical measurements are used to assess the effectiveness of a new optimization
configuration on a program.
For demonstrating the portability of the technique across different compiler
versions, the analysis for the Cortex-M0 processor was performed using the
LLVM compilation framework v3.8., and for the Cortex-M3 using the LLVM
compilation framework v5.0. The technique treats the compiler as a black box
since it only uses the compilation framework to exercise the different optimization-
configuration scenarios, extracted from a predefined optimization level, on a par-
ticular program. In contrast, machine-learning-based techniques typically require
a heavy training phase for each new compiler version or when a new optimization
flag is introduced [ABP+17, BRE15].
Figure 1 demonstrates the process used to evaluate the effectiveness of the
different optimization configurations explored. Each configuration is a set of or-
dered flags used to drive the analysis and transformation passes by the LLVM
optimizer. An analysis pass can identify properties and expose optimization op-
portunities that can later be used by transformation passes to perform optimiza-
tions. A standard optimization level (-O1, -O2, -O3, -Os, -Oz) can be selected
as the starting point. Each optimization level represents a list of optimization
Less is More: Exploiting the Standard Compiler Optimization Levels 5
flags which have a predefined order. Their order influences the order in which
the transformation / optimization and analysis passes will be applied to the code
under compilation. A new flag configuration is obtained by excluding the last
transformation flag from the current list of flags. Then the new optimization con-
figuration is being applied to the unoptimized intermediate representation (IR)
of the program, obtained from the Clang front-end. Note that the program’s un-
optimized IR only needs to be generated once by the Clang front-end; it can then
be used throughout the exploration process thus saving compilation time. The
optimized IR is then passed to the LLVM back-end and linker to generate the
executable for the architecture under consideration. Note that both the back-end
and linker are always called using the optimization level selected for exploration;
in our case -O2. The executable’s energy consumption, execution time and code
size are measured and stored. The exploration process finishes when the current
list of transformation flags is empty. This is equivalent to optimization level -O0,
where no optimizations are applied by the optimizer. Then, depending on the
resource requirements, the best flag configuration is selected.
There are two kinds of pass dependencies for the LLVM optimizer; explicit
and implicit dependencies. An explicit dependency exists when a transformation
pass requires an other analysis pass to execute first. In this case, the optimizer
will automatically schedule the analysis pass if only the transformation pass was
requested by the user. An implicit dependency exists when a transformation
or analysis pass is designed to work after another transformation instead of an
analysis pass. In this case, the optimizer will not schedule the pass automatically,
instead the user must manually add the passes in the correct order to be executed
either using the opt tool or the pass manager. The pass manager is the LLVM
built-in mechanism for scheduling passes and handling their dependencies. If a
pass is requested but its dependencies have not been requested in the correct
order, then the specified pass will be automatically skipped by the optimizer.
For the predefined optimization levels, the implicit dependencies are predefined
in the pass manager.
To extract the list of transformation and analysis passes, their ordering, and
their dependencies for a predefined level of optimization, we use the argument
”-debug-pass=Structure” with the opt tool (the LLVM optimizer). This informa-
tion is passed to our flag-selection process, which, to extract a new configuration,
simply eliminates the last optimization flag applied. This ensures that all the im-
plicit dependencies for the remaining passes in the new configuration are still in
place. Thus, the knowledge built into the predefined optimization levels about
effective pass orderings is preserved in the newly generated optimization config-
urations. What we are actually questioning is whether the pass scheduling in the
predefined-optimization levels is a good choice. In other words, can stopping the
optimizations at an earlier point yield more optimal code for a specific program
and architecture?
The BEEBS benchmark suite [PHB13] was used for evaluation. BEEBS is
design for assessing the energy consumption of embedded processors. The re-
source usage estimation process retrieves the execution time, energy consump-
6 K. Georgiou et al.
tion and code size for each executable generated. The code size can be retrieved
by examining the size of the executable. The execution time and energy con-
sumption is being measured using the MAGEEC board [Hol13] together with
the pyenergy [Pal15] firmware and host-side software. The BEEBS benchmark
suite utilizes this energy measurement framework and allows for triggering the
begin and the end of the execution of a benchmark. Thus, energy measurements
are reported only during a benchmark’s execution. Energy consumption, exe-
cution time and average power dissipation are reported back to the host. The
MAGEEC board supports a sampling rate of up to six million samples per sec-
ond. A calibration process was needed prior to measurement to determine the
number of times a benchmark should be executed in a loop while measuring
to obtain an adequate number of measurements. This ensured the collection of
reliable energy values for each benchmark. Finally, the BEEBS benchmark suite
has a built-in self-test mechanism that flags up when a generated executable
is invalid, i.e. it does not provide the expected results. Standard optimization
levels shipped with each new version of a compiler are typically heavily tested
to ensure the production of functionally correct executables. In our case, using
optimization configurations that are subsequences of the standard optimization
levels increases the chance of generating valid executables. In fact, all the exe-
cutables we tested passed the BEEBS validation.
3 Results and Discussion
For the evaluation of our approach, the same 71 benchmarks from the BEEBS
[PHB13] benchmark suite were used for both the Cortex-M0 and the Cortex-
M3 processors. Two benchmarks were left out because they did not fit into the
memory of the Cortex-M0 development board. For each benchmark, Figure 2
(Figure 2a for the Cortex-M0 and the LLVM v3.8 and Figure 2b for the Cortex-
M3 and the LLVM v5.0) demonstrates the biggest performance gains achieved
by the proposed technique compared to the standard optimization level under
investigation, -O2. In other words, this figure represents the resource usage re-
sults obtained by using the optimization configuration, among the configurations
exercised by our technique, that achieves the best performance gains compared
to -O2 for each benchmark. A negative percentage represents an improvement
on a resource, e.g. a result of -20% for execution time represents a 20% reduc-
tion in the execution time obtained by the selected optimization configuration
when compared to the execution time retrieved by -O2. The energy-consumption
and code-size improvements are also given for the selected configurations. If two
optimization configurations have the same performance gains, then energy con-
sumption improvement is used as a second criterion and code size improvement
as a third criterion to select the best optimization configuration. The selection
criteria can be modified according to the resource requirements for a specific
application. Moreover, a function can be introduced to further formalize the
selection process when complex multi-objective optimization is required.
Less is More: Exploiting the Standard Compiler Optimization Levels 7
co
m
pr
es
s
ar
ra
yb
in
se
ar
ch du
ff
bu
bb
le
so
rt
ra
di
x4
Di
vi
sio
n
st
rs
tr
m
on
te
ca
rlo ed
n
m
er
ge
so
rt
dt
oa
jp
eg
dc
t
ud fd
ct
qs
or
t fir
qu
eu
e
ef
_s
qr
t
sh
a2
56
ta
ra
i
fa
c
se
le
ct
lis
tin
se
rts
or
t
ad
pc
m
nd
es
ha
sh
ta
bl
e bs
ar
cf
ou
r
fib
ca
ll
ja
nn
e_
co
m
pl
ex de
s
lm
s
qu
rt
jfd
ct
in
t
ef
_lo
g
slr
e
co
m
pr
es
s_
te
st
dr
iv
er
m
on
t6
4
ns
ich
ne
u
lis
ts
or
t
ba
se
64
isa
_a
dc
st
at
em
at
e
ca
st
12
8
fft
1
st
b_
pe
rli
n
re
cu
rs
io
n ns
m
in
ve
r st
m
d5
rb
tre
e
st
at
s
cr
c
sie
ve
fa
st
a
di
jk
st
ra
_s
m
al
l
ex
pi
nt cn
t
sq
rt
nb
od
y
wh
et
st
on
e
fir
2d
im
co
ve
r
ef
_m
od
dl
lis
t
pr
im
e
cr
c_
32
le
ve
ns
ht
ei
n
m
0-
m
at
m
ul
t
lu
dc
m
p
Benchmarks
20
15
10
5
0
5
Pe
rc
en
ta
ge
 ti
m
e,
 e
ne
rg
y,
 c
od
e 
siz
e 
vs
 -O
2
Improvements over -O2
Execution Time
Energy Usage
Code Size
(a) Results for the Cortex-M0 processor and the LLVM v3.8 compilation framework.
co
ve
r
cr
c ns
ja
nn
e_
co
m
pl
ex
le
ve
ns
ht
ei
n
sie
ve fa
c
dt
oa
ef
_s
qr
t
m
on
t6
4
co
m
pr
es
s
qs
or
t
jfd
ct
in
t
dl
lis
t
qu
eu
e
se
le
ct
lis
ts
or
t
slr
e ud
bu
bb
le
so
rt
fd
ct
du
ff
ad
pc
m
lu
dc
m
p
nd
es
jp
eg
dc
t
pr
im
e
lis
tin
se
rts
or
t
m
er
ge
so
rt
sh
a2
56 de
s
ef
_lo
g
wh
et
st
on
e
di
jk
st
ra
_s
m
al
l
ba
se
64
ha
sh
ta
bl
e fir
ca
st
12
8
fft
1
nb
od
y
rb
tre
e
lm
s
sq
rt
st
at
em
at
e
re
cu
rs
io
n bs
cr
c_
32 s
t
ra
di
x4
Di
vi
sio
n
qu
rt
isa
_a
dc
fib
ca
ll
ed
n
ex
pi
nt
fa
st
a
cn
t
m
0-
m
at
m
ul
t
ef
_m
od
ns
ich
ne
u
st
at
s
ar
cf
ou
r
m
in
ve
r
st
rs
tr
ta
ra
i
m
d5
dr
iv
er
ar
ra
yb
in
se
ar
ch
m
on
te
ca
rlo
st
b_
pe
rli
n
co
m
pr
es
s_
te
st
fir
2d
im
Benchmarks
100
80
60
40
20
0
Pe
rc
en
ta
ge
 ti
m
e,
 e
ne
rg
y,
 c
od
e 
siz
e 
vs
 -O
2
Improvements over -O2
Execution Time
Energy Usage
Code Size
(b) Results for the Cortex-M3 processor and the LLVM v5.0 compilation framework.
Fig. 2: Best achieved execution-time improvements over the standard optimiza-
tion level -O2. For the best execution-time optimization configuration, energy
consumption and code size improvements are also given. A negative percentage
represents a reduction of resource usage compared to -O2.
8 K. Georgiou et al.
-O
0
-s
im
pl
ify
cf
g 
10
-s
ro
a 
12
-ip
sc
cp
 2
3
-g
lo
ba
lo
pt
 2
4
-d
ea
da
rg
el
im
 2
5
-in
st
co
m
bi
ne
 2
7
-s
im
pl
ify
cf
g 
28
-p
ru
ne
-e
h 
30
-in
lin
e 
32
-fu
nc
tio
na
ttr
s 3
3
-a
rg
pr
om
ot
io
n 
34
-s
ro
a 
35
-ju
m
p-
th
re
ad
in
g 
39
-s
im
pl
ify
cf
g 
41
-in
st
co
m
bi
ne
 4
3
-ta
ilc
al
le
lim
 4
4
-s
im
pl
ify
cf
g 
45
-re
as
so
cia
te
 4
6
-lo
op
-s
im
pl
ify
 4
9
-lc
ss
a 
50
-lo
op
-ro
ta
te
 5
1
-li
cm
 5
2
-lo
op
-u
ns
wi
tc
h 
53
-in
st
co
m
bi
ne
 5
4
-lo
op
-s
im
pl
ify
 5
6
-lc
ss
a 
57
-in
dv
ar
s 5
8
-lo
op
-d
el
et
io
n 
60
-lo
op
-u
nr
ol
l 6
2
-g
vn
 6
7
-m
em
cp
yo
pt
 6
9
-s
cc
p 
70
-in
st
co
m
bi
ne
 7
2
-ju
m
p-
th
re
ad
in
g 
74
-d
se
 7
8
-a
dc
e 
79
-s
im
pl
ify
cf
g 
80
-in
st
co
m
bi
ne
 8
2
-lo
op
-s
im
pl
ify
 8
6
-lc
ss
a 
87
-in
st
co
m
bi
ne
 9
2
-s
im
pl
ify
cf
g 
95
-in
st
co
m
bi
ne
 9
7
-lo
op
-s
im
pl
ify
 9
9
-lc
ss
a 
10
0
-lo
op
-u
nr
ol
l 1
03
-s
tri
p-
de
ad
-p
rt 
10
5
-g
lo
ba
ld
ce
 1
06
-c
on
st
m
er
ge
 1
07
Compilation Configuration
0
10
20
Pe
rc
en
ta
ge
 ti
m
e,
 e
ne
rg
y,
 c
od
e 
siz
e 
vs
 -O
2
Mergesort Benchmark (Improvements over -O2)
Execution Time
Energy Usage
Code Size
-O
0
-s
im
pl
ify
cf
g 
10
-s
ro
a 
12
-ip
sc
cp
 2
3
-g
lo
ba
lo
pt
 2
4
-d
ea
da
rg
el
im
 2
5
-in
st
co
m
bi
ne
 2
7
-s
im
pl
ify
cf
g 
28
-p
ru
ne
-e
h 
30
-in
lin
e 
32
-fu
nc
tio
na
ttr
s 3
3
-a
rg
pr
om
ot
io
n 
34
-s
ro
a 
35
-ju
m
p-
th
re
ad
in
g 
39
-s
im
pl
ify
cf
g 
41
-in
st
co
m
bi
ne
 4
3
-ta
ilc
al
le
lim
 4
4
-s
im
pl
ify
cf
g 
45
-re
as
so
cia
te
 4
6
-lo
op
-s
im
pl
ify
 4
9
-lc
ss
a 
50
-lo
op
-ro
ta
te
 5
1
-li
cm
 5
2
-lo
op
-u
ns
wi
tc
h 
53
-in
st
co
m
bi
ne
 5
4
-lo
op
-s
im
pl
ify
 5
6
-lc
ss
a 
57
-in
dv
ar
s 5
8
-lo
op
-d
el
et
io
n 
60
-lo
op
-u
nr
ol
l 6
2
-g
vn
 6
7
-m
em
cp
yo
pt
 6
9
-s
cc
p 
70
-in
st
co
m
bi
ne
 7
2
-ju
m
p-
th
re
ad
in
g 
74
-d
se
 7
8
-a
dc
e 
79
-s
im
pl
ify
cf
g 
80
-in
st
co
m
bi
ne
 8
2
-lo
op
-s
im
pl
ify
 8
6
-lc
ss
a 
87
-in
st
co
m
bi
ne
 9
2
-s
im
pl
ify
cf
g 
95
-in
st
co
m
bi
ne
 9
7
-lo
op
-s
im
pl
ify
 9
9
-lc
ss
a 
10
0
-lo
op
-u
nr
ol
l 1
03
-s
tri
p-
de
ad
-p
rt 
10
5
-g
lo
ba
ld
ce
 1
06
-c
on
st
m
er
ge
 1
07
Compilation Configuration
0
25
50
75
100
Pe
rc
en
ta
ge
 ti
m
e,
 e
ne
rg
y,
 c
od
e 
siz
e 
vs
 -O
2
Montecarlo Benchmark (Improvements over -O2)
Execution Time
Energy Usage
Code Size
(a) Compilation profiles for two of the benchmarks, using the Cortex-M0 processor
and the LLVM v3.8 compilation framework.
-O
0
-s
im
pl
ify
cf
g 
6
-s
ro
a 
8
-ip
sc
cp
 1
9
-g
lo
ba
lo
pt
 2
1
-m
em
2r
eg
 2
3
-d
ea
da
rg
el
im
 2
4
-in
st
co
m
bi
ne
 3
2
-s
im
pl
ify
cf
g 
33
-p
ru
ne
-e
h 
36
-in
lin
e 
37
-fu
nc
tio
na
ttr
s 3
8
-s
ro
a 
40
-ju
m
p-
th
re
ad
in
g 
49
-s
im
pl
ify
cf
g 
52
-in
st
co
m
bi
ne
 6
0
-ta
ilc
al
le
lim
 6
2
-s
im
pl
ify
cf
g 
63
-re
as
so
cia
te
 6
4
-lo
op
-s
im
pl
ify
 6
7
-lc
ss
a 
69
-lo
op
-ro
ta
te
 7
3
-li
cm
 7
4
-lo
op
-u
ns
wi
tc
h 
75
-s
im
pl
ify
cf
g 
76
-in
st
co
m
bi
ne
 8
4
-lo
op
-s
im
pl
ify
 8
5
-lc
ss
a 
87
-in
dv
ar
s 8
9
-lo
op
-d
el
et
io
n 
91
-lo
op
-u
nr
ol
l 9
2
-g
vn
 9
9
-m
em
cp
yo
pt
 1
03
-s
cc
p 
10
4
-in
st
co
m
bi
ne
 1
14
-ju
m
p-
th
re
ad
in
g 
11
6
-d
se
 1
23
-lo
op
-s
im
pl
ify
 1
25
-lc
ss
a 
12
7
-li
cm
 1
30
-a
dc
e 
13
2
-s
im
pl
ify
cf
g 
13
3
-in
st
co
m
bi
ne
 1
41
-g
lo
ba
lo
pt
 1
46
-g
lo
ba
ld
ce
 1
47
-lo
op
-s
im
pl
ify
 1
53
-lc
ss
a 
15
5
-lo
op
-ro
ta
te
 1
59
-lo
op
-s
im
pl
ify
 1
76
-in
st
co
m
bi
ne
 1
86
-s
im
pl
ify
cf
g 
19
1
-in
st
co
m
bi
ne
 1
99
-lo
op
-s
im
pl
ify
 2
00
-lc
ss
a 
20
2
-lo
op
-u
nr
ol
l 2
04
-in
st
co
m
bi
ne
 2
08
-lo
op
-s
im
pl
ify
 2
09
-lc
ss
a 
21
1
-li
cm
 2
13
-s
tri
p-
de
ad
-p
rt 
21
5
-g
lo
ba
ld
ce
 2
16
-c
on
st
m
er
ge
 2
17
-lo
op
-s
im
pl
ify
 2
22
-lc
ss
a 
22
4
-s
im
pl
ify
cf
g 
23
6
Compilation Configuration
0
50
100
Pe
rc
en
ta
ge
 ti
m
e,
 e
ne
rg
y,
 c
od
e 
siz
e 
vs
 -O
2
Levenshtein Benchmark (Improvements over -O2)
Execution Time
Energy Usage
Code Size
-O
0
-s
im
pl
ify
cf
g 
6
-s
ro
a 
8
-ip
sc
cp
 1
9
-g
lo
ba
lo
pt
 2
1
-m
em
2r
eg
 2
3
-d
ea
da
rg
el
im
 2
4
-in
st
co
m
bi
ne
 3
2
-s
im
pl
ify
cf
g 
33
-p
ru
ne
-e
h 
36
-in
lin
e 
37
-fu
nc
tio
na
ttr
s 3
8
-s
ro
a 
40
-ju
m
p-
th
re
ad
in
g 
49
-s
im
pl
ify
cf
g 
52
-in
st
co
m
bi
ne
 6
0
-ta
ilc
al
le
lim
 6
2
-s
im
pl
ify
cf
g 
63
-re
as
so
cia
te
 6
4
-lo
op
-s
im
pl
ify
 6
7
-lc
ss
a 
69
-lo
op
-ro
ta
te
 7
3
-li
cm
 7
4
-lo
op
-u
ns
wi
tc
h 
75
-s
im
pl
ify
cf
g 
76
-in
st
co
m
bi
ne
 8
4
-lo
op
-s
im
pl
ify
 8
5
-lc
ss
a 
87
-in
dv
ar
s 8
9
-lo
op
-d
el
et
io
n 
91
-lo
op
-u
nr
ol
l 9
2
-g
vn
 9
9
-m
em
cp
yo
pt
 1
03
-s
cc
p 
10
4
-in
st
co
m
bi
ne
 1
14
-ju
m
p-
th
re
ad
in
g 
11
6
-d
se
 1
23
-lo
op
-s
im
pl
ify
 1
25
-lc
ss
a 
12
7
-li
cm
 1
30
-a
dc
e 
13
2
-s
im
pl
ify
cf
g 
13
3
-in
st
co
m
bi
ne
 1
41
-g
lo
ba
lo
pt
 1
46
-g
lo
ba
ld
ce
 1
47
-lo
op
-s
im
pl
ify
 1
53
-lc
ss
a 
15
5
-lo
op
-ro
ta
te
 1
59
-lo
op
-s
im
pl
ify
 1
76
-in
st
co
m
bi
ne
 1
86
-s
im
pl
ify
cf
g 
19
1
-in
st
co
m
bi
ne
 1
99
-lo
op
-s
im
pl
ify
 2
00
-lc
ss
a 
20
2
-lo
op
-u
nr
ol
l 2
04
-in
st
co
m
bi
ne
 2
08
-lo
op
-s
im
pl
ify
 2
09
-lc
ss
a 
21
1
-li
cm
 2
13
-s
tri
p-
de
ad
-p
rt 
21
5
-g
lo
ba
ld
ce
 2
16
-c
on
st
m
er
ge
 2
17
-lo
op
-s
im
pl
ify
 2
22
-lc
ss
a 
22
4
-s
im
pl
ify
cf
g 
23
6
Compilation Configuration
0
100
200
300
Pe
rc
en
ta
ge
 ti
m
e,
 e
ne
rg
y,
 c
od
e 
siz
e 
vs
 -O
2
Ns Benchmark (Improvements over -O2)
Execution Time
Energy Usage
Code Size
(b) Compilation profiles for two of the benchmarks, using the Cortex-M3 processor
and the LLVM v5.0 compilation framework
Fig. 3: For each optimization configuration tested by the proposed technique the
execution-time, energy-consumption and code-size improvements over -O2 are
given. A negative percentage represents a reduction of resource usage compared
to -O2. Each element of the horizontal axis has the name of the last flag applied
and the total number of flags used. The configurations are incremental subse-
quences of the -O2, starting from -O0 and adding optimization flags till reaching
the complete -O2 set of flags.
Less is More: Exploiting the Standard Compiler Optimization Levels 9
For the Cortex-M0 processor, we observed an average reduction in execution
time of 2.5%, with 29 out of the 71 benchmarks seeing execution time improve-
ments over -O2 ranging from around 1% to around 23%. For the Cortex-M3
processor, we observed an average reduction in execution time of 5.3%, with 38
out of the 71 benchmarks seeing execution time improvements over -O2 ranging
from around 1% to around 90%. The energy consumption improvements were
always closely related to the execution time improvements for both of the proces-
sors. This is expected due to the predictable nature of these deeply embedded
processors. In contrast, there were no significant fluctuations in the code size
between different optimization configurations. We anticipate that, if the -Os or
-Oz optimization levels, which both aim to achieve smaller code size, had been
used as a starting point for our exploration, then more variation would have
been observed for code size.
As it can be seen from Figures 2a and 2b, our optimization strategy per-
formed significantly different for the two processors per benchmark. This can
be caused by the different performance and power consumption characteristics
of the two processors and/or the use of different compiler versions in each case.
Furthermore, the technique performed better on the Cortex-M3 with the LLVM
v5.0 compilation framework. This could be due to the compilation framework
improvements from version 3.8 to version 5.0. Another possible reason might
be that the -O2 optimization level for LLVM v5.0 includes more optimization
flags than the LLVM v.3.8. The more flags in an optimization level, the more
optimization configurations will be generated and exercised by our exploitation
technique, and thus, more opportunities for execution-time, energy-consumption
and code-size savings can be exposed.
Figures 3a and 3b demonstrate the effect of each optimization configuration,
exercised by our exploitation technique, on the three resources (execution time,
energy consumption and code size), for two of the benchmarks for the Cortex-
M0 and Cortex-M3 processors, respectively. Similar figures were obtained for
all the 71 benchmarks and for both of the processors. Similarly to Figure 2, a
negative percentage represents an improvement on the resource compared to the
one achieved by -O2. The horizontal axis of the figures shows the flag at which
compilation stopped together with the total number of flags included up to that
point. This represents an optimization configuration that is a subsequence of
the -O2. For example, the best optimization configuration for all three resources
for the Levenstein benchmark (see top part of Figure 3b) is achieved when
the compilation stops at flag number 91, -loop-deletion. This means that the
optimization configuration includes the first 91 flags of the -O2 configuration
with their original ordering preserved. The optimization configurations include
both transformation and analysis passes.
The number of optimization configurations exercised in each case depends
on the number of transformation flags included in the -O2 level of the version
of the LLVM optimizer used. Note that we are only considering the documented
transformation passes [LLV18]. For example, 50 and 64 different configurations
are being tested in the case of the Cortex-M0 processor with the LLVM compila-
10 K. Georgiou et al.
tion framework v3.8, and the case of Cortex-M3 with the LLVM framework v5.0,
respectively. Many of the transformation passes are repeated multiple times in
a standard optimization level, but because of their different ordering, they have
a different effect. Thus, we consider each repetition as an opportunity to create
a new optimization configuration. Furthermore, note that more transformation
passes exist in the LLVM optimizer, but typically, these are passes that have
implicit dependencies on the documented passes. The methodology of creating a
new optimization configuration explained in Section 2 ensures the preservation
of all the implicit dependencies for each configuration. This is part of preserv-
ing the empirical knowledge of good interactions between transformations built
into the predefined optimization levels and reusing it in the new configurations
generated.
Typically, optimization approaches based on iterative compilation are ex-
tremely time consuming [ABP+17], since thousands of iterations are needed to
reach levels of resource savings similar to the ones achieved by our approach.
In our case the maximum number of iterations we had to apply were the 64
iterations for the Cortex-M3 processor. This makes our simple and inexpensive
approach an attractive alternative, before moving to the more expensive ap-
proaches, such as iterative-compilation-based and machine-learning-based com-
pilation techniques [AKC+18, Ash16].
By manually observing the compilation profiles obtained for all the bench-
marks, similar to the ones demonstrated in Figure 3, no common behavior pat-
terns were detected, except that typically there is a significant improvement on
the execution time and the energy consumption at the third optimization config-
uration, i.e. the sroa 12 and the sroa 8 configurations shown in Figure 3 for the
Cortex-M0 and Cortex-M3 processors, respectively. Future work will use clus-
tering to see if programs can be grouped together based on their compilation
profiles. This can be useful to identify optimization sequences that perform well
for a particular type of program. Furthermore, the retrieved optimization profiles
can also give valuable insights to compiler engineers and software developers on
the effect of each optimization flag on a specific program and architecture. It is
beyond the scope of this work to investigate these effects.
4 Related Work
Iterative compilation has been proved an effective technique for tackling both
the problems of choosing the right set of transformations and for ordering them
to maximize their effectiveness [ABP+17]. The technique is typically used to
iterate over different sets of optimizations with the aim of satisfying an objec-
tive function. Usually, each iteration involves some feedback, such as profiling
information, to evaluate the effectiveness of the tested configuration. In random
iterative compilation [BKK+98], random optimization sequences are generated,
ranging from hundreds to thousands, and then used to optimize a program.
Random iterative compilation has been proved to provide significant perfor-
mance gains over standard optimization levels. Thus, it has become a standard
Less is More: Exploiting the Standard Compiler Optimization Levels 11
baseline metric for evaluating the effectiveness of machine-guided compilation
approaches [FKM+11, ABP+17, BRE15], where the goal is to achieve better
performance gains with less exploration time. Due to the huge number of pos-
sible flag combinations and their possible orderings, it is impossible to explore
a large fraction of the optimization space. To mitigate this problem, machine
learning is used to drive iterative compilation [ABC+06, OPWL17, CFA+07].
Based on either static code features [FKM+11] or profiling data [CFA+07],
such as performance counters, machine learning algorithms try to predict the
best set of flags to apply to satisfy the objective function with as few iterations
as possible. The techniques have proven to be effective in optimizing the resource
usage, mainly execution-time, of programs on a specific architecture but gener-
ally suffer from a number of drawbacks. Typically, these techniques require a
large training phase [OPWL17] to create their predictive models. Furthermore,
they are hardly portable across different compilers or versions of the same com-
piler and different architectures. Even if a single flag is introduced to the set of a
compiler’s existing flags the whole training phase has to be repeated. Moreover,
extracting some of the metrics that these techniques depend on, such as static
code features, might require a significant amount of engineering.
A recent work that is focused on mitigating the phase-ordering problem,
[ABP+17], divided the -O3 standard optimization flags of the LLVM compila-
tion framework v3.8, into five subgroups using clustering. Then they used iter-
ative compilation and machine learning techniques to select optimization con-
figurations by reordering the subgroups. The approach demonstrated average
performance speedup of 1.31. An interesting observation is that 79% of the -O3
optimization flags were part of a single subgroup with a fixed ordering that is
similar to that used in the -O3 configuration. This suggests that the ordering of
flags in a predefined optimization level is a good starting point for further per-
formance gains. Our results actually confirm this hypothesis for the processors
under consideration.
Embedded applications typically have to meet strict timing, energy consump-
tion, and code-size constraints [GdSE17, EGLG+16]. Hand-written optimized
code is a complex task and requires extensive knowledge of architectures. There-
fore, utilizing the compilers optimizations to achieve optimal resource usage is
critical.
In an attempt to find better optimization configurations than the ones offered
by the standard optimization levels, the authors in [BRE15] applied inductive
logic programming (ILP) to predict compiler flags that minimize the execution
time of software running on embedded systems. This was done by using ILP to
learn logical rules that relate effective compiler flags to specific program features.
For their experimental evaluation they used the GCC compiler, [GCC18], and
the Arm Cortex-M3 architecture; the same architecture used by this paper. Their
method was evaluated on 60 benchmarks selected from the BEEBS benchmark
suite; the same used in this work. They were able to achieve an average reduction
in execution time of 8%, with about half of the benchmarks seeing performance
improvements. The main drawback of their approach was the large training phase
12 K. Georgiou et al.
of their predictive model. For each benchmark, they needed to create and test
1000 optimization configurations. This resulted in about a week of training time.
Furthermore, for their approach to be transferred to a new architecture, compiler
or compiler version, or even to add a new optimization flag, the whole training
phase has to be repeated from scratch. The same applies for applying their
approach to resources other than execution time, such as energy consumption
or code size. In contrast, our approach, for the same architecture and more
benchmarks of the same benchmark suite, was able to achieve similar savings
in execution time (average 5.3%) by only testing 65 optimization configurations
for each program. At the same time, our approach does not suffer from the
portability issues faced by their technique.
In [PHB15], the authors used fractional factorial design (FFD) to explore
the large optimization space (282 possible combinations for the GCC compiler
used) and determine the effects of optimizations and optimization combinations.
The resources under investigation were execution time and energy consumption.
They tested their approach on five different embedded platforms including the
Cortex-M0 and Cortex-M3, which are also used in this work. For their results
to be statistically significant, they needed to exercise 2048 optimization config-
urations for each benchmark. Although they claimed that FFD was able to find
optimization configurations that perform better than the standard optimization
levels, they demonstrated this only on a couple of benchmarks. Again, this ap-
proach suffers from the same portability issues as [BRE15].
In our work, to maximize the accuracy of our results, hardware measurements
were used for both the execution time and energy consumption. Although, high
accuracy is desirable, in many cases physical hardware measurements are diffi-
cult to deploy and use. Existing works demonstrated that energy modeling and
estimation techniques could accurately estimate both execution time and energy
consumption for embedded architectures similar to the ones used in this pa-
per [GKCE17, GGP+15]. Such estimation techniques can replace the physical-
hardware measurements used in our approach in order to make the proposed
technique accessible to more software developers.
5 Conclusion
Finding optimal optimization configurations for a specific compiler, architecture,
and program is an open challenge since the introduction of compilers. Standard
optimization levels that are built-in to modern compilers, on average perform
well on a range of architectures and programs and provide convenience to the
software developer. Over the past years, iterative compilation and complex ma-
chine learning approaches have been exploited to yield optimization configura-
tions that outperform these standard optimization levels. These techniques are
typically expensive either due to their large training phases or the large number
of configurations that they need to test. Moreover, they are hardly portable to
new architectures and compilers.
Less is More: Exploiting the Standard Compiler Optimization Levels 13
In contrast, in this work an inexpensive and easily portable approach that
generates and tests less than 64 optimization configurations proved able to
achieve execution-time and energy-consumption savings in the same range as
the ones achieved by state of the art machine learning and iterative compilation
techniques [BRE15, PHB15, AKC+18]. The effectiveness of this simple approach
is attributed to the fact that we used subsequences of the optimization passes
defined in the standard optimization levels, but stopped the optimizations at
an earlier point than the standard optimization level under exploitation. This
indicates that the accumulated empirical knowledge built into the standard op-
timization levels is a good starting point for creating optimization configurations
that will perform better than the standard ones.
The approach is compiler and target independent. Thus, for its validation,
two processors and two versions of the LLVM compiler framework were used;
namely, the Arm Cortex-M0 with the LLVM v3.8 and the Arm Cortex-M3 with
the LLVM v5.0. An average execution time reduction of 2.4% and 5.3% was
achieved across all the benchmarks for the Cortex-M0 and Cortex-M3 processors,
respectively, with at least half of the 71 benchmarks tested seeing performance
and energy consumption improvements. Finally, our approach can support multi-
criteria optimization as it targets execution time, energy consumption and code
size at the same time.
In future work, clustering and other machine learning techniques can be
applied on the compilation profiles retrieved by our exploitation approach (Fig-
ure 3) to fine-tune the standard optimization levels of a compiler to perform
better for a specific architecture. Furthermore, the technique is currently being
evaluated on more complex architectures, such as Intel’s X-86.
6 Acknowledgments
The authors would like to thank Dr. Zbigniew Chamski for his valuable com-
ments and helpful suggestions. The work is supported by the European Unions
Horizon 2020 Research and Innovation Programme under Grant agreement No.:
779882, TeamPlay (Time, Energy and security Analysis for Multi/Many-core
heterogeneous PLAtforms), and from the Royal Society Newton Advanced Fel-
lowship Programme under Grant No.: NA160108.
References
ABC+06. F. Agakov, E. Bonilla, J. Cavazos, B. Franke, G. Fursin, M. F. P. O’Boyle,
J. Thomson, M. Toussaint, and C. K. I. Williams. Using machine learn-
ing to focus iterative optimization. In Proceedings of the International
Symposium on Code Generation and Optimization, CGO ’06, pages 295–
305, Washington, DC, USA, 2006. IEEE Computer Society. URL: http:
//dx.doi.org/10.1109/CGO.2006.37, doi:10.1109/CGO.2006.37.
ABP+17. Amir H. Ashouri, Andrea Bignoli, Gianluca Palermo, Cristina Silvano,
Sameer Kulkarni, and John Cavazos. Micomp: Mitigating the compiler
14 K. Georgiou et al.
phase-ordering problem using optimization sub-sequences and machine
learning. ACM Trans. Archit. Code Optim., 14(3):29:1–29:28, Septem-
ber 2017. URL: http://doi.acm.org/10.1145/3124452, doi:10.1145/
3124452.
AKC+18. A. H. Ashouri, W. Killian, J. Cavazos, G. Palermo, and C. Silvano. A
Survey on Compiler Autotuning using Machine Learning. ArXiv e-prints,
jan 2018. arXiv:1801.04405.
ARM18a. ARM. Arm cortex-m0 processor, 2018. URL: https://developer.arm.
com/products/processors/cortex-m/cortex-m0.
ARM18b. ARM. Arm cortex-m3 processor, 2018. URL: https://developer.arm.
com/products/processors/cortex-m/cortex-m3.
ARM18c. ARM. Processors cortex-m series, 2018. URL: https://www.arm.com/
products/processors/cortex-m.
Ash16. Amir H. Ashouri. Compiler autotuning using machine learning techniques.
PhD thesis, Politecnico Di Milano, Department of Computer Science and
Engineering, 2016.
BKK+98. Franc¸ois Bodin, Toru Kisuki, Peter Knijnenburg, Mike O’ Boyle, and Er-
ven Rohou. Iterative compilation in a non-linear optimisation space. In
Workshop on Profile and Feedback-Directed Compilation, Paris, France,
Oct 1998. URL: https://hal.inria.fr/inria-00475919.
BRE15. Craig Blackmore, Oliver Ray, and Kerstin Eder. A logic programming
approach to predict effective compiler settings for embedded software.
Theory and Practice of Logic Programming, 15(4-5):481–494, 2015. doi:
10.1017/S1471068415000174.
CFA+07. John Cavazos, Grigori Fursin, Felix Agakov, Edwin Bonilla, Michael F. P.
O’Boyle, and Olivier Temam. Rapidly selecting good compiler opti-
mizations using performance counters. In Proceedings of the Interna-
tional Symposium on Code Generation and Optimization, CGO ’07, pages
185–197, Washington, DC, USA, 2007. IEEE Computer Society. URL:
http://dx.doi.org/10.1109/CGO.2007.32, doi:10.1109/CGO.2007.32.
EGLG+16. K. Eder, J. P. Gallagher, P. Lo´pez-Garc´ıa, H. Muller, Z. Bankovic´,
K. Georgiou, R. Haemmerle´, M. V. Hermenegildo, B. Kafle, S. Kerri-
son, M. Kirkeby, M. Klemen, X. Li, U. Liqat, J. Morse, M. Rhiger, and
M. Rosendahl. ENTRA: Whole-systems energy transparency. Micropro-
cess. Microsyst., 47, Part B:278–286, November 2016. URL: https://
doi.org/10.1016/j.micpro.2016.07.003, doi:10.1016/j.micpro.2016.
07.003.
FKM+11. Grigori Fursin, Yuriy Kashnikov, Abdul Wahid Memon, Zbigniew Cham-
ski, Olivier Temam, Mircea Namolaru, Elad Yom-Tov, Bilha Mendel-
son, Ayal Zaks, Eric Courtois, Francois Bodin, Phil Barnard, Elton Ash-
ton, Edwin Bonilla, John Thomson, Christopher K. I. Williams, and
Michael O’Boyle. Milepost gcc: Machine learning enabled self-tuning
compiler. International Journal of Parallel Programming, 39(3):296–
327, Jun 2011. URL: https://doi.org/10.1007/s10766-010-0161-2,
doi:10.1007/s10766-010-0161-2.
GCC18. GCC. Gcc, the gnu compiler collection, 2018. URL: https://gcc.gnu.
org/.
GdSE17. K. Georgiou, S. Xavier de Souza, and K. Eder. The iot energy challenge: A
software perspective. IEEE Embedded Systems Letters, PP(99):1–1, 2017.
doi:10.1109/LES.2017.2741419.
Less is More: Exploiting the Standard Compiler Optimization Levels 15
GGP+15. Neville Grech, Kyriakos Georgiou, James Pallister, Steve Kerrison, Jeremy
Morse, and Kerstin Eder. Static analysis of energy consumption for llvm
ir programs. In Proceedings of the 18th International Workshop on Soft-
ware and Compilers for Embedded Systems, SCOPES ’15, pages 12–21,
New York, NY, USA, 2015. ACM. URL: http://doi.acm.org/10.1145/
2764967.2764974, doi:10.1145/2764967.2764974.
GKCE17. Kyriakos Georgiou, Steve Kerrison, Zbigniew Chamski, and Kerstin Eder.
Energy transparency for deeply embedded programs. ACM Trans. Archit.
Code Optim., 14(1):8:1–8:26, March 2017. URL: http://doi.acm.org/10.
1145/3046679, doi:10.1145/3046679.
Hol13. Simon Hollis. The mageec energy measurement board, aug 2013. URL:
http://mageec.org/wiki/Power_Measurement_Board.
Lat02. Chris Lattner. LLVM: An Infrastructure for Multi-Stage Optimization.
Master’s thesis, Computer Science Dept., University of Illinois at Urbana-
Champaign, Urbana, IL, Dec 2002. URL: http://llvm.cs.uiuc.edu.
LLV18. LLVMorg. LLVM’s Analysis and Transform Passes, 2018. URL: https:
//llvm.org/docs/Passes.html.
OPWL17. W. F. Ogilvie, P. Petoumenos, Z. Wang, and H. Leather. Minimizing the
cost of iterative compilation with active learning. In 2017 IEEE/ACM
International Symposium on Code Generation and Optimization (CGO),
pages 245–256, Feb 2017. doi:10.1109/CGO.2017.7863744.
Pal15. James Pallister. Pyenergy: An interface to the mageec energy monitor
boards, feb 2015. URL: https://pypi.python.org/pypi/pyenergy.
PHB13. James Pallister, Simon J. Hollis, and Jeremy Bennett. BEEBS: open
benchmarks for energy measurements on embedded platforms. CoRR,
abs/1308.5174, 2013. URL: http://arxiv.org/abs/1308.5174, arXiv:
1308.5174.
PHB15. James Pallister, Simon J. Hollis, and Jeremy Bennett. Identifying com-
piler options to minimize energy consumption for embedded platforms.
The Computer Journal, 58(1):95–109, 2015. URL: http://dx.doi.org/
10.1093/comjnl/bxt129, doi:10.1093/comjnl/bxt129.
