Lost in translation: Exposing hidden compiler optimization opportunities by Georgiou, Kyriakos et al.
Lost in translation: Exposing hidden compiler
optimization opportunities
Kyriakos Georgiou, Zbigniew Chamski, Andres Amaya Garcia, David May,
Kerstin Eder
University of Bristol, UK
Abstract. To increase productivity, today’s compilers offer a two-fold
abstraction: they hide hardware complexity from the software developer,
and they support many architectures and programming languages. At
the same time, due to fierce market competition, most processor vendors
do not disclose many of their implementation details. These factors force
software developers to treat both compilers and architectures as black
boxes. In practice, this leads to a suboptimal compiler behavior where
the maximum potential of improving an application’s resource usage,
such as execution time, is often not realized. This paper exposes missed
optimization opportunities and is of interest to all three communities,
compiler engineers, software developers and hardware architects. By ex-
ploiting the behavior of the standard optimization levels, such as the
-O3, of the LLVM v6.0 compiler, we show how to reveal hidden cross-
architecture and architecture-dependent potential optimizations on two
popular processors: the Intel i5-6300U, widely used in portable PCs, and
the ARM Cortex-A53-based Broadcom BCM2837 used in the Raspberry
Pi 3B+. The classic nightly regression testing can then be extended to
use the resource usage and compilation information collected while ex-
ploiting subsequences of the standard optimization levels. This provides
a systematic means of detecting and tracking missed optimization oppor-
tunities. The enhanced nightly regression system is capable of driving the
improvement and tuning of the compiler’s common optimizer.
1 Introduction
Producing cost-effective software requires a high degree of productivity with-
out sacrificing quality. The relationship between these two factors is complex.
Typically, a careless increase in productivity aiming at reducing software de-
velopment costs can harm quality, while poor end-product quality can lead to
increased costs of deploying and maintaining an application. Furthermore, while
application quality in terms of a reduced number of bugs being discovered is
easy to quantify, assessing other quality metrics such as execution time and en-
ergy consumption is often less straightforward. For example, consider a mobile
application that has been extensively tested and offers to the end user an al-
most bug-free experience and a good overall response time. Still, it is difficult
to argue that this application has reached its maximum potential in regards to
ar
X
iv
:1
90
3.
11
39
7v
2 
 [c
s.P
L]
  2
8 M
ar 
20
19
2 K. Georgiou et al.
execution time and energy efficiency on a given architecture. Perhaps, there is
another more energy-efficient version of the same application that provides the
same functionality? Considering that this application may run on millions of
devices, the aggregate effect of even small energy savings can be substantial.
Compilers are at the heart of software development. Their primary goal is
to increase software productivity. They are a key element of the software stack,
providing an abstraction between high-level languages and machine code. The
challenge now lies with the compiler engineers as they have to support a vast
amount of architectures and programming languages and adapt to their rapid
advances. To mitigate this, modern compilers, such as the LLVM [LA04, LLVc]
and GCC [GCC] compilers, are designed to be modular. For example, they make
use of a common optimizer across all the architectures and programming lan-
guages supported. The common optimizer exposes to the software developer a
large number of available code optimizations via compiler flags; for example, the
LLVM’s optimizer has 56 documented transformations [LLVb]. The challenge
then becomes to select and order the flags to create optimization configurations
that can achieve the best resource usage possible for a given program and archi-
tecture. Due to the huge number of possible flag combinations and their possible
orderings, it is impractical to explore the complete space of optimization con-
figurations. Thus, finding optimal optimization configurations is still an open
challenge.
To address this, compilers offer standard optimization levels, typically -O0,
-O1, -O2, -O3 and -Os, which are predefined sequences of optimizations. These
sequences are tuned through each new compiler release to perform well on a
number of micro-benchmarks and a range of mainstream architectures. Starting
from the -O0 level, which has no optimizations enabled, and moving to level -O3,
each level offers more aggressive optimizations with the main focus being perfor-
mance, while -Os is focused on optimizing code size. Still, such optimizations are
proven not optimal, as iterative compilation and machine-learning approaches
can find optimization sequences that offer better resource usage than the stan-
dard optimization levels, on a particular program and architecture [WO18]. The
main idea of such approaches is to find good optimization sequences by exploiting
only a fraction of the optimization space [AKC+18a]. Although such techniques
are promising for auto-tuning the compiler’s optimizer settings for a particular
application [WO18], they typically act as a “black box”, providing no insights
into why certain configurations are better than others. Thus, it is difficult to use
them as guidance for developing new systematic optimizations or improving the
existing ones.
Another dimension to the problem is the non-disclosure of hardware imple-
mentation details by processor vendors. This has two serious implications. First,
compilers are slow in adapting to architectural performance innovations. Even
worse, in some cases legacy optimization techniques which performed well on
previous hardware generations can actually perform poorly on newer hardware
(for example, see if-conversion and interactions with branch prediction poten-
tial optimizations reported in Section 4.2). Secondly, programmers often have no
Lost in translation: Exposing hidden compiler optimization opportunities 3
clear view on the architecture’s and compiler’s internals and thus may produce
code that is neither compiler- nor architecture-friendly.
In [GBXdSE18], we made the interesting observation that by performing
fewer of the optimizations available in a standard compiler optimization level
such as -O2, while preserving their original ordering, significant savings can
be achieved in both execution time and energy consumption. This observation
has been validated on two embedded processors, namely the ARM Cortex-M0
and the ARM Cortex-M3, using two different versions of the LLVM compila-
tion framework; v3.8 and v5.0. Building on these findings, this paper makes the
following contributions:
– It investigates if the technique proposed in [GBXdSE18] is applicable on a
broader class of architectures beyond the deeply embedded processors ini-
tially tested. The technique is applied to the Intel i5-6300U X86-based archi-
tecture popular in desktop and laptop PCs, and to the ARM Cortex-A53,
an ARMv8-A 64-bit based architecture frequently used in mobile devices
(Section 3).
– It proposes an enhanced nightly regression system that can help tuning the
standard compiler optimization levels for better resource usage (execution
time, energy consumption and code size), and demonstrates how this ap-
proach can expose hidden architecture-dependent and cross-architecture op-
timization opportunities (Section 4).
Experimental evaluation with 42 benchmarks, which are part of the LLVM’s
test suite [LLVa], demonstrated performance gains for at least half of the bench-
marks, with an average of 11.5% and 5.1% execution time improvement for the
i5-6300U and the Cortex-A53 processors, respectively. These findings confirm
that the technique can detect compiler inefficiencies beyond deeply embedded
architectures, like the Cortex-M0 and Cortex-M3 examined in [GBXdSE18], and
across multiple versions of compilers: namely the LLVM 3.8 and LLVM 5.0 used
in [GBXdSE18] and the LLVM 6.0 used in this paper.
These results motivate the search for a systematic way of exploiting the com-
piler optimization inefficiencies exposed by the technique proposed in [GBXdSE18]
to drive the tuning of the compiler’s common optimizer. Thus, benchmarking
results were then exploited by our enhanced nightly regression system. The en-
hanced system directly pinpoints behaviors common across multiple benchmarks
and architectures and reveals their possible causes. Using only a selection of the
information collected during enhanced regression tests, we demonstrate the value
of the enhanced system by exposing two significant cross-target shortcomings of
the LLVM common optimizer, two distinct opportunities for target-aware heuris-
tics adjustments and a possible direction for improving the support of advanced
hardware branch prediction at compiler level.
The rest of the paper is organized as follows. Section 2 gives a brief overview of
the common optimizer exploitation technique that was introduced in [GBXdSE18]
and the adjustments needed for the architectures used in this paper. Our bench-
marking experimental evaluation results are presented and discussed in Section 3.
4 K. Georgiou et al.
Section 4 introduces the enhanced nightly regression system and demonstrates
how it can guide compilers engineers to tune the common compiler optimizer.
Section 5 critically reviews previous work related to ours. Finally, Section 6
concludes the paper and outlines opportunities for future work.
2 Exploiting Standard Optimization Levels
Figure 1 demonstrates the process used to evaluate the effectiveness of the dif-
ferent optimization configurations explored. Each configuration is a set of or-
dered flags used to drive the analysis and transformation passes by the LLVM
optimizer. An analysis pass can identify properties and expose optimization op-
portunities that can later be used by transformation passes to perform optimiza-
tions. A standard optimization level (-O1, -O2, -O3, -Os, -Oz) can be selected
as the starting point. Each optimization level represents a list of optimization
flags which have a predefined order. Their order influences the order in which
the transformation/optimization and analysis passes will be applied to the code
under compilation. A new flag configuration is obtained by excluding the last
transformation flag from the current list of flags. Then the new optimization
configuration is being applied to the unoptimized IR of the program, obtained
from the Clang front-end. Note that the program’s unoptimized IR only needs
to be generated once by the Clang front-end; it can then be used throughout
the exploration process, thus saving compilation time. The optimized IR is then
passed to the LLVM back-end and linker to generate the executable for the archi-
tecture under consideration. Note that both the back-end and linker are always
called using the optimization level selected for exploration, in our case -O3. The
executable’s resource usage are measured and stored for each tested configura-
tion. The exploration process finishes when the current list of transformation
flags is empty. This is equivalent to optimization level -O0, where the optimizer
applies no optimizations. Then, depending on the resource requirements, defined
as optimization criteria, the best flag configuration is selected. A more detailed
explanation of the technique is given in [GBXdSE18].
In [GBXdSE18], the primary focus was deeply-embedded processors, typi-
cally used in Internet of Things (IoT) applications, and thus, we demonstrated
the technique’s effectiveness on the Arm Cortex-M0 [Cora] and the Arm Cortex-
M3 [Corb] processors. In this paper, the technique is being ported to two, more
complex processors, namely the Intel i5-6300U and the Arm Cortex-A53. Port-
ing to a new architecture is not time-consuming since the technique treats an
architecture as a black box. This is feasible because no resource models are re-
quired, neither for execution time nor for energy consumption. Instead, physical
measurements of such resources can be used to assess the effectiveness of a new
optimization configuration on a program.
Similarly, the technique treats the compiler as a black box. It only uses the
compilation framework to exercise the different optimization configuration sce-
narios extracted from a predefined optimization level on a particular program. In
contrast, machine-learning-based techniques typically require a heavy training
Lost in translation: Exposing hidden compiler optimization opportunities 5
  
Opt.
Criteria
Generate
Optimization Config.
Programs
Resource Usage
 Measurement
Yes
LLVM Back-End
No
Best
Config.
LLVM Optimizer
Clang Front-End
Results
Configuration
Selection
Finished?
Control Data
Fig. 1: Compilation and evaluation process (modified from [GBXdSE18]).
phase for each new compiler version or when a new optimization flag is intro-
duced [ABP+17, BRE15]. For demonstrating the portability of the technique
across different compiler versions, in [GBXdSE18], the analysis for the Cortex-
M0 processor was performed using the LLVM compilation framework v3.8., and
for the Cortex-M3 using the LLVM compilation framework v5.0. In this paper,
the LLVM compilation framework v6.0 is used. Overall, the porting to the new
architectures and the new version of the LLVM compiler was completed within
an hour.
The Collective Knowledge (CK) [FLSU18, cTu], a framework for collabora-
tive research that supports compilers’ optimization auto-tuning, was used for
evaluation. CK includes a variety of benchmark suites for the training and the
evaluation of auto-tuning techniques for compiler optimization, such as iterative-
based and ML-based techniques. One of them is the Milepost-GCC-codelet
benchmark suite, which was used in the seminal work on ML-based compiler opti-
mization, MilePostGcc [Fea11]. These benchmarks represent hot spots extracted
together with there datasets from several real software projects [FLSU18]. These
benchmarks are also part of the LLVM-compiler’s test-suite, under the MiBench
benchmark suite [LLVa]. Both the Milepost-GCC and its benchmark suite are
now integrated into the CK framework and are often used as the baseline to com-
pare the effectiveness of new auto-tuning techniques [FLSU18, ABP+17, BRE15].
Thus, this paper also uses the Milepost-GCC-Codelet for evaluation.
The Resource Usage Measurement box that is part of our compilation and
evaluation process, shown in Fig. 1, can be used to determine the execution
time, energy consumption and code size for each executable generated. Various
measurement or estimation techniques can be utilized as part of our framework
in a plug-and-play approach to address different hardware platforms and opti-
mization requirements. For this work, we focus on the execution time and code
size since there is no support for direct hardware energy measurements on the
devices under test. As demonstrated in [GBXdSE18], the technique is capable
6 K. Georgiou et al.
of accounting for energy consumption, when accurate energy measurements or
estimations are available. The code size can be obtained by examining the size
of the “.text” section of the executable. For execution time measurements we
use the CK’s built-in execution time measurement framework. The framework
has a calibration process that is needed prior to measurement to determine the
number of times a benchmark should be executed in a loop while measuring to
obtain a representative average execution time for each benchmark. In addition,
we repeat the evaluation process ten times and obtain the mean execution time
values for each benchmark to ensure minimization of other events that can af-
fect a benchmark’s execution, such as Dynamic Voltage and Frequency Scaling
(DVFS) and noise from the operating system or other applications running on
the machines under test.
In the case of the Intel i5-6300U, the above settings were adequate to provide
stable and trustworthy results. In the case of the ARM Cortex-A53 as used in
a Raspberry Pi 3B+ board, the measurements were still not sufficiently stable.
To achieve stable measurements, we had to disable DVFS, fixing the proces-
sor’s frequency to 1200 MHz, and disable the wireless communication (WiFi
and Bluetooth) modules.
CK has a built-in self-test mechanism that detects and reports when a gen-
erated executable is invalid, i.e., it does not provide the expected results. We
modified this mechanism to check the benchmark’s results for each optimiza-
tion configuration against those of the -O0 compilation with no optimizations
enabled. This is because compiled unoptimized programs are considered to act
as intended by the programmer.
3 Benchmark Evaluation
The 42 benchmarks from the CK Milepost-GCC-Codelet benchmark suite, listed
in Table 2c, were used for both the Intel i5-6300U and the Arm Cortex-A53
processors to facilitate the discovery of potential cross-architecture compiler op-
timizations. For each benchmark, Figure 2 (Figure 2a for the i5-6300U and Fig-
ure 2b for the Cortex-A53) demonstrates the biggest performance gains achieved
by the proposed technique compared to the standard optimization level under
investigation, -O3. In other words, this figure represents the resource usage re-
sults of the optimization configuration which achieved the best performance
gains among the configurations exercised by our technique, when compared to
-O3 for each benchmark. A negative percentage represents an improvement on
a resource, e.g., a result of -20% for execution time represents a 20% reduction
in the execution time obtained by the selected optimization configuration when
compared to the execution time of the reference -O3 optimization configuration.
The code-size improvements are also given for the selected configurations. If
two optimization configurations have the same performance gains, then code-
size improvement is used as a second criterion to select the best optimization
configuration. The selection criteria can be modified according to the resource
requirements for a specific application. Moreover, a function can be introduced to
Lost in translation: Exposing hidden compiler optimization opportunities 7
8 24 29 2 37 42 38 40 9 23 35 5 13 25 34 6 7 41 27 26 33 17 20 14 11 1 4 12 10 15 21 28 22 31 3 16 18 36 19 30 32 39
Benchmarks
70
60
50
40
30
20
10
0
Pe
rc
en
ta
ge
 ti
m
e,
 c
od
e 
siz
e 
vs
 -O
3
Improvements over -O3
Execution Time
Code Size
(a) Results for the i56300U processor and the LLVM v6.0 compilation framework.
9 18 10 17 8 36 5 22 21 42 38 39 31 13 26 3 14 7 4 23 27 40 35 11 33 34 1 2 25 15 20 24 12 32 29 19 16 30 37 41 28 6
Benchmarks
35
30
25
20
15
10
5
0
Pe
rc
en
ta
ge
 ti
m
e,
 c
od
e 
siz
e 
vs
 -O
3
Improvements over -O3
Execution Time
Code Size
(b) Results for the Cortex-A53 processor and the LLVM v6.0 compilation framework.
ID Benchmark Name ID Benchmark Name
1 automotive-basicmath-cubic-3-1 2 automotive-basicmath-isqrt-1-1
3 automotive-qsort1-src-qsort-1-1 4 automotive-susan-e-src-susan-10-1
5 automotive-susan-e-src-susan-2-1 6 automotive-susan-s-src-susan-1-1
7 consumer-jpeg-c-src-jcdctmgr-13-1 8 consumer-jpeg-c-src-jchuff-9-1
9 consumer-jpeg-c-src-jfdctint-2-1 10 consumer-lame-src-fft-2-1
11 consumer-lame-src-newmdct-10-1 12 consumer-lame-src-newmdct-3-1
13 consumer-lame-src-psymodel-17-1 14 consumer-lame-src-quantize-7-1
15 consumer-lame-src-quantize-pvt-6-1 16 consumer-lame-src-takehiro-16-1
17 consumer-lame-src-takehiro-5-1 18 consumer-mad-src-layer3-5-1
19 consumer-mad-src-layer3-6-1 20 consumer-tiff2rgba-src-tif-predict-4-1
21 consumer-tiffdither-src-tif-fax3-8-1 22 consumer-tiffdither-src-tif-fax3-9-1
23 consumer-tiffdither-src-tiffdither-1-1 24 consumer-tiffmedian-src-tiffmedian-1-1
25 consumer-tiffmedian-src-tiffmedian-3-1 26 consumer-tiffmedian-src-tiffmedian-4-1
27 consumer-tiffmedian-src-tiffmedian-5-1 28 consumer-tiffmedian-src-tiffmedian-6-1
29 network-dijkstra-src-dijkstra-large-5-1 30 office-ghostscript-src-gdevpbm-1-1
31 office-rsynth-src-nsynth-5-1 32 office-rsynth-src-nsynth-9-1
33 security-pgp-d-src-mpilib-1-1 34 security-pgp-e-src-mpilib-1-1
35 security-pgp-e-src-mpilib-3-1 36 security-pgp-e-src-mpilib-4-1
37 telecomm-adpcm-c-src-adpcm-1-1 38 telecomm-adpcm-d-src-adpcm-1-1
39 telecomm-fft-fftmisc-5-1 40 telecomm-fft-fourierf-3-1
41 telecomm-gsm-src-rpe-4-1 42 telecomm-gsm-src-short-term-2-1
(c) The GCC-Milepost benchmarks used for evaluation.
Fig. 2: Best achieved execution-time improvements over the standard optimiza-
tion level -O3. For the best execution-time optimization configuration, code size
improvements are also given. A negative percentage represents a reduction of
resource usage compared to -O3.
8 K. Georgiou et al.
further formalize the selection process when complex multi-objective optimiza-
tion is required. Energy consumption can be another resource to be exploited
whenever accurate energy measurements are available for the processor under
investigation.
For the i5-6300U processor, we observed an average reduction in execution
time of 11.5%, with 26 out of the 42 benchmarks seeing execution time improve-
ments over -O3 ranging from around 1% to 71%. For the Cortex-A53 processor,
we observed an average reduction in execution time of 5.1%, with 26 out of
the 42 benchmarks seeing execution time improvements over -O3 ranging from
around 1% to 28%. In contrast, there were only a few significant code-size im-
provements; namely benchmarks labeled 13, 34, 33, 42, 18, 40 have a code size
reduction of 14.7%, 6.8%, 6.8%, 4.3%, 4%, and 1.9%, respectively, for the i5-
6300U processor, and benchmarks labeled 9, 10, 13 have a code-size reduction
of 33.3%, 33.3%, 25%, respectively, for the Cortex-A53 processor. For embedded
applications, code size is often the first resource targeted for optimization due to
the limited memory of the processor. In such cases, our optimization exploitation
can use as a starting point the -Os or -Oz optimization levels, which both aim
to achieve smaller code size.
Considering Figures 2a and 2b, at first sight it seems that our optimization
strategy performed significantly different for the two processors for most of the
benchmarks. Section 4 will take a closer look into the results and how we can
expose cross-architecture compiler optimizations.
-O
0
-O
0-
cu
st
om
-s
im
pl
ify
cf
g 
7
-s
ro
a 
9
-ip
sc
cp
 2
0
-g
lo
ba
lo
pt
 2
2
-m
em
2r
eg
 2
4
-d
ea
da
rg
el
im
 2
5
-in
st
co
m
bi
ne
 3
3
-s
im
pl
ify
cf
g 
34
-p
ru
ne
-e
h 
37
-in
lin
e 
38
-fu
nc
tio
na
ttr
s 3
9
-s
ro
a 
41
-ju
m
p-
th
re
ad
in
g 
50
-s
im
pl
ify
cf
g 
52
-in
st
co
m
bi
ne
 6
0
-ta
ilc
al
le
lim
 7
6
-s
im
pl
ify
cf
g 
77
-re
as
so
cia
te
 7
8
-lo
op
-s
im
pl
ify
 8
1
-lc
ss
a 
83
-lo
op
-ro
ta
te
 8
7
-li
cm
 8
8
-lo
op
-u
ns
wi
tc
h 
89
-s
im
pl
ify
cf
g 
90
-in
st
co
m
bi
ne
 9
8
-lo
op
-s
im
pl
ify
 9
9
-lc
ss
a 
10
1
-in
dv
ar
s 1
03
-lo
op
-d
el
et
io
n 
10
5
-lo
op
-u
nr
ol
l 1
06
-g
vn
 1
13
-m
em
cp
yo
pt
 1
17
-s
cc
p 
11
8
-in
st
co
m
bi
ne
 1
28
-ju
m
p-
th
re
ad
in
g 
13
0
-d
se
 1
36
-lo
op
-s
im
pl
ify
 1
38
-lc
ss
a 
14
0
-li
cm
 1
43
-a
dc
e 
14
5
-s
im
pl
ify
cf
g 
14
6
-in
st
co
m
bi
ne
 1
54
-g
lo
ba
lo
pt
 1
59
-g
lo
ba
ld
ce
 1
60
-lo
op
-s
im
pl
ify
 1
66
-lc
ss
a 
16
8
-lo
op
-ro
ta
te
 1
72
-lo
op
-s
im
pl
ify
 1
89
-in
st
co
m
bi
ne
 1
99
-s
im
pl
ify
cf
g 
20
0
-in
st
co
m
bi
ne
 2
12
-lo
op
-s
im
pl
ify
 2
13
-lc
ss
a 
21
5
-lo
op
-u
nr
ol
l 2
17
-in
st
co
m
bi
ne
 2
21
-lo
op
-s
im
pl
ify
 2
22
-lc
ss
a 
22
4
-li
cm
 2
26
-s
tri
p-
de
ad
-p
rt 
22
8
-g
lo
ba
ld
ce
 2
29
-c
on
st
m
er
ge
 2
30
-lo
op
-s
im
pl
ify
 2
35
-lc
ss
a 
23
7
-s
im
pl
ify
cf
g 
24
9
Compilation Configuration
50
0
50
100
Pe
rc
en
ta
ge
 ti
m
e,
 c
od
e 
siz
e 
vs
 -O
3
consumer-jpeg-c-src-jchuff-9-1 (ID: 8)
Execution Time
Code Size
Fig. 3: Optimization-performance example on the i5-6300U. For each optimiza-
tion configuration tested by the proposed technique, the execution-time and
code-size improvements over -O3 are given. A negative percentage represents a
reduction of resource usage compared to -O3. Each element of the horizontal
axis has the name of the last flag applied and the total number of flags used.
The configurations are incremental subsequences of the -O3, starting from -O0
and adding optimization flags till reaching the complete -O3 set of flags.
Figure 3 demonstrates the effect of each optimization configuration, exer-
cised by our exploitation technique, on the two resources (execution time and
code size), for the consumer-jpeg-c-src-jchuff-9-1 benchmark on the i5-6300U
Lost in translation: Exposing hidden compiler optimization opportunities 9
processor. Similar figures were obtained for all the 42 benchmarks and for both
of the processors. As in Figure 2, a negative percentage represents a reduction
(thus, an improvement) in the usage of the given resource compared to the one
achieved by standard -O3 optimization. The horizontal axis of the figures shows
the flag at which compilation stopped together with the total number of flags
included up to that point. This represents an optimization configuration that is a
subsequence of the -O3 optimization sequence. For example, the best optimiza-
tion configuration for performance for the benchmark in Figure 3 is achieved
when the compilation stops at flag number 9, sroa. This means that the op-
timization configuration includes the first nine flags of the -O3 configuration
with their original ordering preserved. The optimization configurations include
both transformations and analyses passes. The -O0-custom configuration is the
split version of the -O0 optimization level where the compilation is explicitly
decomposed into the front-end, common optimizer and back-end, as described
in Section 2. Its results are compared to the ones with the normal -O0 compila-
tion, to ensure that the decomposition of the compilation did not introduce any
significant variation in benchmark performance.
The number of optimization configurations exercised in each case depends
on the number of transformation flags included in the -O3 level of the version
of the LLVM optimizer used. Note that we are only considering the documented
transformation passes [LLVb]. For example, 64 and 66 different configurations
are being automatically detected and tested by our technique for the Cortex-A53
and the i5-6300U processors, respectively. The difference for the -O3 optimiza-
tion level in terms of optimization flags between the two processors is probably
an attempt by the compiler engineers to better address the performance char-
acteristics of the two architectures. Overall, more analysis passes are used for
the i5-6300U processor. Many of the transformation passes are applied multiple
times in a standard optimization level, but because of their different position in
the configuration sequence they may have a different effect. Thus, we consider
each repetition as an opportunity to create a new optimization configuration.
Furthermore, note that more transformation passes exist in the LLVM optimizer,
but typically, these are passes that have implicit dependencies on the documented
passes. The methodology of creating a new optimization configuration explained
in Section 2 ensures the preservation of all the implicit dependencies for each
configuration.
It is time-consuming to identify any optimization patterns across multiple
benchmarks, by manually inspecting the compilation profiles obtained for all
the benchmarks, similar to the ones presented in Figure 3. In the next section,
we show how the benchmarks can be automatically clustered based on their
compilation profiles. We then demonstrate the value of such clustering as part
of a nightly-regression system, as it can expose potential hidden architecture-
dependent and cross-architecture optimizations. Moreover, it can pinpoint opti-
mizations that degrade performance.
10 K. Georgiou et al.
4 Exposing Hidden Optimization Opportunities
Retargetable compiler frameworks achieve their generality by abstracting target
architecture properties and by relying on cross-target heuristics in the front-
and middle-end compilation passes. The abstract properties may be parameter-
ized by quantitative characteristics of each actual target used, but the decision
heuristics and the actual sequence of optimizations are often defined by experi-
mentation and, once established, are seldom questioned in subsequent releases of
the compiler framework. Therefore, evaluating the pertinence and the quality of
the heuristics used in a compiler may provide valuable insights into the quality
of the current compiler configuration and its potential for further improvement.
The standard approach to tuning a compiler’s common optimizer remains the
repetitive testing of the compiler on a variety of benchmarks and mainstream
architectures. This approach is typically called nightly regression testing, and
it mainly aims at validating benchmark results in terms of correctness and im-
proving performance (or, in some application domains, code size). The output
of a nightly regression session is typically a report with information about the
compilation time, execution time and the correctness of the output for each test.
These results are then compared to a reference point, usually the result of a pre-
vious nightly-regression run that passed all the validation tests and exhibits the
best achievable execution and compilation times so far. The purpose of nightly
regression is to constantly monitor the quality of the modifications in a compiler
towards the release of a new version.
All observed regressions (either correctness failures or significant degrada-
tions in the execution time of a benchmark relative to its reference point) have
to be investigated by a compiler engineer. However, the detection of a regres-
sion does not offer any insights into what actually caused it and requires the
engineer to manually examine and track the source of the problem. Depending
on the engineer’s experience and the complexity of the issue, the identification
of the root cause of a regression can be an extremely time-consuming task.
Furthermore, a standard nightly regression system will only report regressions
or improvements for individual tests, but will not directly pinpoint behaviors
common across multiple benchmarks and architectures that can indicate hidden
optimization opportunities.
In this section we propose an enhancement of the classic nightly-regression
testing that utilizes the technique explained in Section 2 to extract recurring
behaviors of the compiler. By exposing and quantifying the effect of successive
optimizations across all tested benchmarks and supported target architectures
of a compiler, the enhanced nightly regression approach enables the discovery
of unexploited cross-architecture and architecture-dependent optimization op-
portunities and the identification of optimization passes that have a negative
impact on target resource utilization (execution time, energy consumption, or
code size). The insights gained in this way can drastically improve the process
of tuning the compiler’s common optimizer, even without detailed knowledge of
the target architecture.
Lost in translation: Exposing hidden compiler optimization opportunities 11
To demonstrate this, we will use the results obtained by our technique on
the Milepost-GCC benchmarks, as described in Section 3. Milepost-GCC bench-
marks are an excellent candidate for this exercise as they are also part of the
LLVM compiler’s test suite (under the MiBench subsuite [LLVa]). From Fig-
ure 2, we already know that significant performance gains can be achieved using
the proposed technique across both architectures. A compiler engineer will need
to focus first on the cases where the same optimizations appear across multiple
benchmarks. These repeating patterns indicate potential optimization opportu-
nities that can benefit a wider group of programs and/or architectures. To this
end, the benchmark results are first classified to expose common optimization
behaviors which are then analyzed in more depth.
4.1 Classification of nightly regression results
Figure 4 shows the outcome of the initial result classification. Figure 4a and Fig-
ure 4b are the new proposed reports for a compiler’s nightly-regression system for
the i5-6300U and the Cortex-A53 processors, respectively. The reports include
all the benchmarks where our technique achieved an execution time reduction
of more than 3%. The benchmarks are then grouped in terms of their observed
optimization behavior. The first level of grouping is done on the First Config.
Better than -O3 column, which represents the first optimization configuration
that outperformed the -O3 (e.g., pass sroa 9 in Figure 5a), and on the Con-
fig. Removing Gains column, which represents the configuration in which those
achieved gains were lost by the addition of more optimization flags (e.g., pass
simplifycfg 34 in Figure 5a). The second grouping appears on the Best Overall
Config. column which represents the configuration that achieved the best per-
formance against -O3. Finally, the benchmarks within groups are sorted based
on their achieved performance gains over -O3, in descending order. This is also
the case for any benchmarks that do not belong to any group, e.g., the last 8
benchmarks in Figure 4b. The reports presented in Figure 4 will be used in the
later sections to demonstrate how they can guide the tuning of the compiler’s
optimizer.
The comparison of performance figures achieved after each optimizing trans-
formation gives a direct insight into that transformation’s effectiveness, relative
both to preceding and subsequent optimizations, and to the “best optimization
level” baseline. Our experiments demonstrate that for many compute kernels the
best overall performance is achieved at an intermediate step of the optimization
process, indicating that certain transformations applied at later optimization
stages are in fact counter-productive.
The number of cases where an intermediate optimization configuration leads
to a substantially better performance than the reference “best” optimization
level -O3 is significant: 21 out of 42 benchmarks on the i5-6300U platform, and
20 out of 42 benchmarks on the ARM Cortex-A53 core achieve a performance
gain of at least 3%, and in some cases up to 71% wrt. using optimization level
-O3. For these benchmarks, simply stopping the optimization process at the
appropriate intermediate stage provides a directly exploitable gain.
12 K. Georgiou et al.
Benchmark ID First Config. Better than -O3 Config. Removing Gains Best Overall Config. Execution Time Reduction %
8 sroa - 9 simplifycfg - 34 instcombine - 33 -70.98
2 sroa - 9 simplifycfg - 34 instcombine - 33 -40.98
37 sroa - 9 simplifycfg - 34 instcombine - 33 -32.34
23 sroa - 9 simplifycfg - 34 instcombine - 33 -24.76
13 sroa - 9 simplifycfg - 34 sroa - 9 -12.53
25 sroa - 9 simplifycfg - 34 sroa - 9 -8.82
7 sroa - 9 simplifycfg - 34 sroa - 9 -5.11
42 sroa - 9 simplifycfg - 34 ipsccp -20 -31.61
35 sroa - 9 simplifycfg - 34 instcombine - 221 -21.05
24 sroa - 9 simplifycfg - 90 functionattrs - 39 -50.79
34 instcombine - 33 lcssa - 83 instcombine - 33 -6.25
33 instcombine - 33 lcssa - 83 instcombine - 33 -3.13
29 no pattern no pattern jump-threading - 130 -50.00
38 sroa - 9 instcombine - 60 instcombine - 33 -31.53
40 sroa - 9 loop-rotate - 87 ipsccp -20 -26.51
9 loop-unroll - 217 after simplifycfg -249 mem2reg - 24 -25.00
5 no pattern no pattern loop-simplify 138 -17.82
6 sroa - 9 globaldce - 229 loop-rotate - 87 -6.00
41 reassiciate - 78 indvars - 103 loop-rotate - 87 -4.76
27 sroa - 9 lcssa - 101 ipsccp -20 -3.92
26 loop-rotate - 87 instcombine - 98 loop-rotate - 87 -3.17
(a) Advanced nightly regression report for the i5-6300U processor.
Benchmark ID First Config. Better than -O3 Config. Removing Gains Best Overall Config. Execution Time Reduction %
10 sroa - 8 instcombine - 27 sroa - 8 -17.18
36 sroa - 8 instcombine - 27 sroa - 8 -11.35
42 sroa - 8 instcombine - 27 sroa - 8 -10.48
31 sroa - 8 instcombine - 27 sroa - 8 -6.25
7 sroa - 8 instcombine - 27 sroa - 8 -3.23
5 loop-rotate - 73 jump-threading - 109 instcombine - 80 -10.82
22 loop-rotate - 73 jump-threading - 109 instcombine - 80 -10.71
21 loop-rotate - 73 jump-threading - 109 memcopyopt - 100 -10.71
39 loop-rotate - 73 instcombine - 80 loop-rotate - 73 -7.14
26 loop-rotate - 73 instcombine - 80 simplifycfg - 76 -4.92
13 loop-unswitch - 75 instcombine - 80 loop-unswitch - 75 -5.16
23 loop-unswitch - 75 instcombine - 80 simplifycfg - 76 -3.07
9 loop-rotate - 145 loop-unroll - 186 loop-simplify - 182 -27.72
18 loop-rotate - 145 no pattern strip-dead-prot - 194 -24.68
17 sroa - 8 loop-rotate - 73 ipsccp - 19 -16.23
8 sroa - 8 instcombine - 80 globalopt - 20 -11.38
38 sroa - 8 instcombine - 53 sroa - 8 -9.22
3 no pattern no pattern licm - 192 -4.00
14 sroa - 8 indvars - 86 functionattrs - 33 -3.85
4 no pattern no pattern strip-dead-prot - 194 -3.07
(b) Advanced nightly regression report for the Cortex-A53 processor.
Fig. 4: Advanced regression reports using our technique on the Milepost-GCC
benchmarks.
Lost in translation: Exposing hidden compiler optimization opportunities 13
The analysis of performance degradations between consecutive transforma-
tions provides a means of improving the overall quality of the optimizations
constituting the -O3 level. Such degradations are a direct indication of incorrect
or inadequate transformation behavior, unless the degradation is transitory and
enables subsequent, highly effective optimizations.
ID Opportunity Category Target and Benchmark ID Location in Repository [Z. ]
1 If-conversion heuristics GI
i5-6300u — 8
A53 — 31
results/i5/benchmark-8
results/A53/benchmark-31
2 Dead code in unrolling GI
A53 — 9
A53 — 18
results/A53/benchmark-9
results/A53/benchmark-18
3 Tuning of unrolling parameters TA
A53 — 9
A53 — 18
results/A53/benchmark-9
results/A53/benchmark-18
4 Store-vs-recompute tradeoffs TA A53 — 10 results/A53/benchmark-10
5 Explicit conversion instructions TS
A53 — 7
A53 — 42
results/A53/benchmark-7
results/A53/benchmark-42
6 Better-predicted branch conditions TS
A53 — 17
A53 — 36
results/A53/benchmark-17
results/A53/benchmark-36
Categories: GI: General Improvement, TA: Target-Aware heuristic tuning, TS:
Target-Specific optimization refinement
Table 1: Selected compiler improvement opportunities with locations of example
target code in [Z. ].
In the following sections we illustrate one possible approach to analyzing
the data produced by optimization-enhanced nightly regression tests. The list
of findings in this illustrative study is by no means exhaustive and additional
compiler improvement opportunities could be identified by further exploring
the collected data. We begin the analysis with the identification of recurring
sources of untapped optimization potential on the i5-6300U and Cortex-A53
platforms. We then review the reasons for the potential gains and the ways in
which the potential is canceled. We mainly focus on the Cortex-A53 architecture
which exhibits a more diverse range of performance and code size artefacts, and
we only use the i5-6300U case for demonstrating potential cross-architecture
optimization opportunities.
The findings are grouped into three categories corresponding to compiler
reengineering tasks with increasing levels of knowledge and understanding of the
target architectures: generic optimization improvements, target-aware heuristic
tuning, and target-specific optimization refinement (Table 1). Generic optimiza-
tion improvements are expected to be applicable to all targets, or to large classes
of targets sharing a common feature such as predicated instructions or advanced
branch predictors. Target-aware heuristic tuning is intended to help better ex-
ploiting the target architectures without modifying the common optimizer of
a compiler. Finally, findings falling into the target-specific optimization refine-
ment category identify the interactions between architectural mechanisms and
compiler technology which cannot be easily captured in a common optimizer.
14 K. Georgiou et al.
For each case discussed below, a set of supporting IR and object files is
available in repository [Z. ] at the location indicated in the corresponding entry
of Table 1. Each set contains matching IR and target object files corresponding
to:
– the state of optimization immediately before and after the transformation
that introduces the better-than-O3 performance;
– the state of optimization immediately before and after the transformation
that discards the corresponding gains;
– the outcome of the standard -O3 optimization flow.
4.2 Identifying recurring patterns of optimization potential
As shown in Figure 4, there is potential for improvement over the -O3 perfor-
mance baseline across recurring ranges of optimization passes. The number of
benchmarks sharing a given “opportunity range” is a direct indication of the rele-
vance of that range, and can be directly used to focus the compiler re-engineering
effort. For each such range, the first configuration which exhibits the potential
gains helps identify the unexploited feature, whereas the configuration which
cancels the potential improvement points directly to the counter-productive op-
timization. Since our enhanced nightly regression system stores all IR files, the
corresponding object files, and the executables for all configurations being tested,
the compiler engineer can start the analysis process by reviewing the IR files gen-
erated before and after the passes that delimit each opportunity range.
The largest cluster of optimization configurations offering hidden optimiza-
tion potential on the i5-6300U architecture involves 9 benchmarks with potential
performance gains of up to 71% (cf. Figure 4a). The corresponding opportunity
range begins at the first application of the static replacement of aggregates pass
(sroa 9) and ends with the subsequent application of the control flow graph
simplification pass (simplifycfg 34), which removes the potential gains in 10
out of 21 benchmarks.
The analysis of the generated target code and the IR files shows that on i5-
6300U the first application of the simplifycfg pass is repeatedly too aggressive
in applying the conversion of conditional control flow to predicated instructions
(called also if-conversion) inside loop bodies. In benchmark 8 (Figure 5a), the
potential gain is available until the application of pass simplifycfg 34. This
pass replaces a sequence of four conditional loopback jumps with the computa-
tion of loopback condition using predicated instructions and a single conditional
jump. As a result, the average loop execution increases three-fold, thus canceling
almost the entire gain potential. This behavior calls for an in-depth revision of
if-conversion strategies in the compiler and is a clear opportunity for a generic
optimization improvement that should benefit multiple targets, cf. case 1 in Ta-
ble 1.
The largest cluster of similarly behaving benchmarks on Cortex-A53 (see Fig-
ure 4b) consists of benchmarks 10, 36, 42, 31, and 7. Its corresponding opportu-
nity range starts with the first “static replacement of aggregates” (sroa 8) pass
Lost in translation: Exposing hidden compiler optimization opportunities 15
-s
ro
a 
9
-ip
sc
cp
 2
0
-g
lo
ba
lo
pt
 2
2
-m
em
2r
eg
 2
4
-d
ea
da
rg
el
im
 2
5
-in
st
co
m
bi
ne
 3
3
-s
im
pl
ify
cf
g 
34
-p
ru
ne
-e
h 
37
-in
lin
e 
38
-fu
nc
tio
na
ttr
s 3
9
-s
ro
a 
41
-ju
m
p-
th
re
ad
in
g 
50
-s
im
pl
ify
cf
g 
52
-in
st
co
m
bi
ne
 6
0
-ta
ilc
al
le
lim
 7
6
-s
im
pl
ify
cf
g 
77
-re
as
so
cia
te
 7
8
-lo
op
-s
im
pl
ify
 8
1
-lc
ss
a 
83
-lo
op
-ro
ta
te
 8
7
-li
cm
 8
8
-lo
op
-u
ns
wi
tc
h 
89
-s
im
pl
ify
cf
g 
90
-in
st
co
m
bi
ne
 9
8
-lo
op
-s
im
pl
ify
 9
9
-lc
ss
a 
10
1
-in
dv
ar
s 1
03
-lo
op
-d
el
et
io
n 
10
5
-lo
op
-u
nr
ol
l 1
06
-g
vn
 1
13
-m
em
cp
yo
pt
 1
17
-s
cc
p 
11
8
-in
st
co
m
bi
ne
 1
28
-ju
m
p-
th
re
ad
in
g 
13
0
-d
se
 1
36
-lo
op
-s
im
pl
ify
 1
38
-lc
ss
a 
14
0
-li
cm
 1
43
-a
dc
e 
14
5
-s
im
pl
ify
cf
g 
14
6
-in
st
co
m
bi
ne
 1
54
-g
lo
ba
lo
pt
 1
59
-g
lo
ba
ld
ce
 1
60
-lo
op
-s
im
pl
ify
 1
66
-lc
ss
a 
16
8
-lo
op
-ro
ta
te
 1
72
-lo
op
-s
im
pl
ify
 1
89
-in
st
co
m
bi
ne
 1
99
-s
im
pl
ify
cf
g 
20
0
-in
st
co
m
bi
ne
 2
12
-lo
op
-s
im
pl
ify
 2
13
-lc
ss
a 
21
5
-lo
op
-u
nr
ol
l 2
17
-in
st
co
m
bi
ne
 2
21
-lo
op
-s
im
pl
ify
 2
22
-lc
ss
a 
22
4
-li
cm
 2
26
-s
tri
p-
de
ad
-p
rt 
22
8
-g
lo
ba
ld
ce
 2
29
-c
on
st
m
er
ge
 2
30
-lo
op
-s
im
pl
ify
 2
35
-lc
ss
a 
23
7
-s
im
pl
ify
cf
g 
24
9
Compilation Configuration
60
40
20
0
Pe
rc
en
ta
ge
 ti
m
e,
 c
od
e 
siz
e 
vs
 -O
3
consumer-jpeg-c-src-jchuff-9-1 (ID: 8)
Execution Time
Code Size
(a) Impact of pass simplifycfg 34 in benchmark 8 on i5-6300U
-s
ro
a 
8
-ip
sc
cp
 1
9
-g
lo
ba
lo
pt
 2
0
-m
em
2r
eg
 2
2
-d
ea
da
rg
el
im
 2
3
-in
st
co
m
bi
ne
 2
7
-s
im
pl
ify
cf
g 
28
-p
ru
ne
-e
h 
31
-in
lin
e 
32
-fu
nc
tio
na
ttr
s 3
3
-a
rg
pr
om
ot
io
n 
34
-s
ro
a 
36
-ju
m
p-
th
re
ad
in
g 
46
-s
im
pl
ify
cf
g 
49
-in
st
co
m
bi
ne
 5
3
-ta
ilc
al
le
lim
 6
2
-s
im
pl
ify
cf
g 
63
-re
as
so
cia
te
 6
4
-lo
op
-s
im
pl
ify
 6
7
-lc
ss
a 
69
-lo
op
-ro
ta
te
 7
3
-li
cm
 7
4
-lo
op
-u
ns
wi
tc
h 
75
-s
im
pl
ify
cf
g 
76
-in
st
co
m
bi
ne
 8
0
-lo
op
-s
im
pl
ify
 8
2
-lc
ss
a 
84
-in
dv
ar
s 8
6
-lo
op
-d
el
et
io
n 
88
-lo
op
-u
nr
ol
l 8
9
-g
vn
 9
6
-m
em
cp
yo
pt
 1
00
-s
cc
p 
10
1
-in
st
co
m
bi
ne
 1
07
-ju
m
p-
th
re
ad
in
g 
10
9
-d
se
 1
16
-lo
op
-s
im
pl
ify
 1
18
-lc
ss
a 
12
0
-li
cm
 1
23
-a
dc
e 
12
5
-s
im
pl
ify
cf
g 
12
6
-in
st
co
m
bi
ne
 1
30
-lo
op
-s
im
pl
ify
 1
39
-lc
ss
a 
14
1
-lo
op
-ro
ta
te
 1
45
-lo
op
-s
im
pl
ify
 1
62
-in
st
co
m
bi
ne
 1
69
-in
st
co
m
bi
ne
 1
80
-lo
op
-s
im
pl
ify
 1
82
-lc
ss
a 
18
4
-lo
op
-u
nr
ol
l 1
86
-in
st
co
m
bi
ne
 1
87
-lo
op
-s
im
pl
ify
 1
88
-lc
ss
a 
19
0
-li
cm
 1
92
-s
tri
p-
de
ad
-p
rt 
19
4
-g
lo
ba
ld
ce
 1
95
-c
on
st
m
er
ge
 1
96
-lo
op
-s
im
pl
ify
 2
01
-lc
ss
a 
20
3
-s
im
pl
ify
cf
g 
21
4
Compilation Configuration
30
20
10
0
10
Pe
rc
en
ta
ge
 ti
m
e,
 c
od
e 
siz
e 
vs
 -O
3
consumer-lame-src-fft-2-1 (ID: 10)
Execution Time
Code Size
(b) Impact of pass instcombine 27 in benchmark 10 on Cortex-A53
-s
ro
a 
8
-ip
sc
cp
 1
9
-g
lo
ba
lo
pt
 2
0
-m
em
2r
eg
 2
2
-d
ea
da
rg
el
im
 2
3
-in
st
co
m
bi
ne
 2
7
-s
im
pl
ify
cf
g 
28
-p
ru
ne
-e
h 
31
-in
lin
e 
32
-fu
nc
tio
na
ttr
s 3
3
-a
rg
pr
om
ot
io
n 
34
-s
ro
a 
36
-ju
m
p-
th
re
ad
in
g 
46
-s
im
pl
ify
cf
g 
49
-in
st
co
m
bi
ne
 5
3
-ta
ilc
al
le
lim
 6
2
-s
im
pl
ify
cf
g 
63
-re
as
so
cia
te
 6
4
-lo
op
-s
im
pl
ify
 6
7
-lc
ss
a 
69
-lo
op
-ro
ta
te
 7
3
-li
cm
 7
4
-lo
op
-u
ns
wi
tc
h 
75
-s
im
pl
ify
cf
g 
76
-in
st
co
m
bi
ne
 8
0
-lo
op
-s
im
pl
ify
 8
2
-lc
ss
a 
84
-in
dv
ar
s 8
6
-lo
op
-d
el
et
io
n 
88
-lo
op
-u
nr
ol
l 8
9
-g
vn
 9
6
-m
em
cp
yo
pt
 1
00
-s
cc
p 
10
1
-in
st
co
m
bi
ne
 1
07
-ju
m
p-
th
re
ad
in
g 
10
9
-d
se
 1
16
-lo
op
-s
im
pl
ify
 1
18
-lc
ss
a 
12
0
-li
cm
 1
23
-a
dc
e 
12
5
-s
im
pl
ify
cf
g 
12
6
-in
st
co
m
bi
ne
 1
30
-lo
op
-s
im
pl
ify
 1
39
-lc
ss
a 
14
1
-lo
op
-ro
ta
te
 1
45
-lo
op
-s
im
pl
ify
 1
62
-in
st
co
m
bi
ne
 1
69
-in
st
co
m
bi
ne
 1
80
-lo
op
-s
im
pl
ify
 1
82
-lc
ss
a 
18
4
-lo
op
-u
nr
ol
l 1
86
-in
st
co
m
bi
ne
 1
87
-lo
op
-s
im
pl
ify
 1
88
-lc
ss
a 
19
0
-li
cm
 1
92
-s
tri
p-
de
ad
-p
rt 
19
4
-g
lo
ba
ld
ce
 1
95
-c
on
st
m
er
ge
 1
96
-lo
op
-s
im
pl
ify
 2
01
-lc
ss
a 
20
3
-s
im
pl
ify
cf
g 
21
4
Compilation Configuration
25
0
25
50
Pe
rc
en
ta
ge
 ti
m
e,
 c
od
e 
siz
e 
vs
 -O
3
consumer-jpeg-c-src-jfdctint-2-1 (ID: 9)
Execution Time
Code Size
-s
ro
a 
8
-ip
sc
cp
 1
9
-g
lo
ba
lo
pt
 2
0
-m
em
2r
eg
 2
2
-d
ea
da
rg
el
im
 2
3
-in
st
co
m
bi
ne
 2
7
-s
im
pl
ify
cf
g 
28
-p
ru
ne
-e
h 
31
-in
lin
e 
32
-fu
nc
tio
na
ttr
s 3
3
-a
rg
pr
om
ot
io
n 
34
-s
ro
a 
36
-ju
m
p-
th
re
ad
in
g 
46
-s
im
pl
ify
cf
g 
49
-in
st
co
m
bi
ne
 5
3
-ta
ilc
al
le
lim
 6
2
-s
im
pl
ify
cf
g 
63
-re
as
so
cia
te
 6
4
-lo
op
-s
im
pl
ify
 6
7
-lc
ss
a 
69
-lo
op
-ro
ta
te
 7
3
-li
cm
 7
4
-lo
op
-u
ns
wi
tc
h 
75
-s
im
pl
ify
cf
g 
76
-in
st
co
m
bi
ne
 8
0
-lo
op
-s
im
pl
ify
 8
2
-lc
ss
a 
84
-in
dv
ar
s 8
6
-lo
op
-d
el
et
io
n 
88
-lo
op
-u
nr
ol
l 8
9
-g
vn
 9
6
-m
em
cp
yo
pt
 1
00
-s
cc
p 
10
1
-in
st
co
m
bi
ne
 1
07
-ju
m
p-
th
re
ad
in
g 
10
9
-d
se
 1
16
-lo
op
-s
im
pl
ify
 1
18
-lc
ss
a 
12
0
-li
cm
 1
23
-a
dc
e 
12
5
-s
im
pl
ify
cf
g 
12
6
-in
st
co
m
bi
ne
 1
30
-lo
op
-s
im
pl
ify
 1
39
-lc
ss
a 
14
1
-lo
op
-ro
ta
te
 1
45
-lo
op
-s
im
pl
ify
 1
62
-in
st
co
m
bi
ne
 1
69
-in
st
co
m
bi
ne
 1
80
-lo
op
-s
im
pl
ify
 1
82
-lc
ss
a 
18
4
-lo
op
-u
nr
ol
l 1
86
-in
st
co
m
bi
ne
 1
87
-lo
op
-s
im
pl
ify
 1
88
-lc
ss
a 
19
0
-li
cm
 1
92
-s
tri
p-
de
ad
-p
rt 
19
4
-g
lo
ba
ld
ce
 1
95
-c
on
st
m
er
ge
 1
96
-lo
op
-s
im
pl
ify
 2
01
-lc
ss
a 
20
3
-s
im
pl
ify
cf
g 
21
4
Compilation Configuration
0
50
100
Pe
rc
en
ta
ge
 ti
m
e,
 c
od
e 
siz
e 
vs
 -O
3
consumer-mad-src-layer3-5-1 (ID: 18)
Execution Time
Code Size
(c) Interaction of passes loop-rotate 145 and loop-unroll 186 on Cortex-A53.
Fig. 5: Selected examples of better-than-O3 optimization potential. Note these
figures are similar to Figure 3 but with the first 3 configurations (-O0, -O0-
custom, simplifycfg) removed. These configurations were significantly slower than
-O3, and thus, they were obfuscating the rest of the configurations’ results.
16 K. Georgiou et al.
and ends with the first application of the instruction combiner (instcombine
27) pass. The instruction combiner pass removes many of the explicit conversion
instructions and performs selective if-conversion. A deeper analysis of the IR
files and the generated target code for the benchmarks of the cluster leads to a
broad range of findings:
– In benchmark 31, the source of the hidden performance potential is the
presence of explicit conditional control flow with unbalanced workloads in
the “true” and “false” paths. The instruction combiner pass replaces the
explicit conditional control flow structure with predicated instructions, thus
aligning the critical path of the resulting code on the longest of the critical
paths of the original control flow structure. Like in the case of i5-6300U and
the simplifycfg 34 pass, this issue signals a deficiency of the if-conversion
strategy and is an example of a general optimization improvement which can
benefit all targets. The similarity with the case of benchmark 8 on i5-6300u
suggests that the two deficiencies of if-conversion may have to be addressed
in conjunction, and have therefore been grouped together as case 1 in Table 1.
– In benchmark 10 (Figure 5b, case 4 in Table 1), the presence of an explicit
conversion instruction forces the recomputation of a value which would oth-
erwise require an additional register. The corresponding reduction in register
pressure increases the performance and reduces both memory traffic and the
actual code size. This case can lead to target-aware heuristic tuning of store-
vs.-recompute tradeoffs.
– The presence of explicit conversion instructions enables the recognition of
complex instruction patterns involving explicit conversions (multiply-accumulate
in benchmark 42 and addition/subtraction with operand shift in benchmark
7, case 5 in Table 1) and the use of seemingly faster branch instructions (con-
ditional branches on signed rather than unsigned comparison conditions in
benchmark 36, case 6 in Table 1). These three cases are linked to the specific
instruction set and the microarchitectural behavior of the target architecture
and belong to the category of target-specific optimization refinements.
The opportunity ranges opened on Cortex-A53 by loop rotate passes (loop-
rotate 73 and loop-rotate 145) are associated with loop transformations. Op-
timization opportunities offered by the second loop rotation pass (loop-rotate
145, cf. Figure 5c) are more significant and illustrate a changing behavior of
the compiler regarding the interactions between loop vectorization and loop un-
rolling.
In benchmark 9 (upper graph of Figure 5c), pass loop-rotate 145 vectorizes
the original loop, yielding an outer loop with only two iterations and a perfor-
mance improvement of 27.7% over the code generated using the standard -O3
optimizations. The subsequent unrolling of the outer loop in pass loop-unroll
186 fully unrolls the loop body producing code that is fully sequential but twice
as large, and the corresponding performance loss may be caused by instruction
cache trashing artefacts.
In contrast, in benchmark 18 (lower graph of Figure 5c) a similar performance
gain is achieved through loop vectorization, but it is not canceled by a subsequent
Lost in translation: Exposing hidden compiler optimization opportunities 17
loop unrolling of the vectorized loop. This difference in behavior is explained
by the fact that quantitative settings of the loop-unroll pass depend on the
optimization flag used when invoking the optimizer. Our optimization sequences
start from level -O0 and therefore, the loop unrolling passes use the default loop
unrolling threshold value applied at optimization levels lower than -O3.
On the other hand, the standard optimization sequence of the -O3 level uses
a default unroll threshold value which is twice as large, enabling the unrolling
where our partial optimization sequences prevent it. This artefact raises the
importance of target-aware heuristic tuning (case 3 in Table 1) which requires
significant understanding of the target architecture, but may be needed to utilize
the target architecture at its best.
In Figure 5c, the final loop unrolling pass (loop-unroll 186) not only im-
pacts performance, but also cancels the potential for code size reduction observed
in benchmarks 9 and 18. The increase in code size caused by this optimization
pass is linked to the introduction of additional “catch-up” loops intended to han-
dle the cases where the actual number of iterations is not known beforehand and
might not be a multiple of the unrolling factor. However, in the tested bench-
marks the loop has a constant number of iterations and once vectorized, it is
fully unrolled to linear code. This means that the catch-up loops are redundant
and should be removed, yet they are actually left in the code calling for a generic
optimization improvement (cf. case 2 in Table 1).
As a last example, the analysis of behavior of benchmark 17 on Cortex-A53
leads to a potential target-specific optimization refinement (case 6 in Table 1): the
loss of performance potential during the first loop-rotate pass (loop-rotate 73)
corresponds to the inversion of conditional branch conditions in the benchmark
core loop, with all other instructions of the core loop remaining identical. The
associated 16.2% decrease in code performance hints at a branch prediction
artefact that could be related to the findings of benchmark 36 (described above)
in which the benchmark performance is directly linked to the relative execution
times of signed vs. unsigned conditional branch instructions.
4.3 Leveraging the identified optimization opportunities
The example findings described in the preceding section suggest that our ap-
proach of testing the quality of partial optimization configurations in compilers
can benefit the compiler technology community, industrial users and developers
of compilers, as well as hardware architects. Generic optimization improvement
opportunities, once identified, should ideally be reported to compiler maintainers
and the compiler technology community at large. The resulting improvements
in the given compiler will benefit all developers and users of that compiler on
many if not all target platforms it supports.
Target-aware heuristic tuning opportunities are of particular importance to
developers and maintainers of industrial compilers, who focus on the best pos-
sible utilization of their target architectures. The findings can help selecting the
most appropriate values of quantitative parameters of transformations, if these
18 K. Georgiou et al.
parameters can be controlled by the user (such as the loop-unrolling threshold),
and can identify the cases where new parameters should be introduced.
The performance potential identified in nightly regression tests can then be
easily made available to users, e.g., by supplying sets of parameter options tuned
for the different configurations of the target architecture. In addition, the per-
formance of the generated code can be finely matched to the target platforms
without affecting the basic principle of a common optimizer and without having
to modify the optimizer code (with all the quality risks it would imply.)
Finally, target-specific optimization refinements identify subtle interactions
between architectural mechanisms and compiler technology which may require a
coordinated effort of the hardware and compiler communities. This category of
findings requires by far the deepest levels of hardware architecture and compiler
technology knowledge. Findings regarding the behavior of branch predictors, for
example, can simultaneously provide useful feedback to hardware architects and
to compiler developers. The former can gain additional awareness of the ways
the branch prediction is behaving on compiler-generated code, and the latter
will be able to review the flow of predictor-aware code generation. We have seen
in the previous section that branch prediction and code generation may interfere
in significant ways.
In order to assess the actual impact of these interactions, the static analysis
of generated code may prove insufficient, requiring detailed information about
the behaviour of specific micro-architectural features of the target platform,
e.g., in the form of data from hardware performance counters [Opr]. The use of
performance counters requires a good understanding of the target architecture,
making them a tool aimed primarily at expert compiler engineers.
However, once the correlation between specific hardware events, the readings
of the performance counters and the effects of a given optimization has been
established, the monitoring of the relevant hardware events can be integrated
into the nightly regression tests as an additional metric to be tracked in addition
to execution time, code size or energy consumption.
5 Related Work
Auto-tuning of compiler optimizations has emerged in the last decade, taking two
main forms; iterative and machine-learning-based (MLB) compilation [WO18,
AKC+18b]. Typically, the aim is to find new optimization sequences that can
outperform what the standard compiler optimization levels can achieve in terms
of effective resource usage on an architecture; the resource of interest being exe-
cution time, energy consumption or memory usage (code size). The motivation
for automatic tuning is that the possible optimization configuration space is too
large to be explored in practice, and thus, hidden optimization opportunities can
exist within that space. These can outperform the standard optimization levels
for a specific architecture or programming language. For example, GCC v4.7 has
282 possible optimization combinations [PHB15], not counting the possible values
of quantitative parameters. The concept of common architecture-independent
Lost in translation: Exposing hidden compiler optimization opportunities 19
optimizers, while helping compiler developers in supporting more programming
languages and more architectures, has the adverse effect of preventing high-level
optimizations from matching target architectures’ quantitative characteristics.
This can produce suboptimal executables in terms of efficiently using a specific
architecture’s resources.
Iterative compilation typically randomly samples the optimization configura-
tion space until finding a configuration that outperforms a predefined optimiza-
tion level [ABP+17]. The technique has in many cases proven to provide signifi-
cant performance gains [BKK+98, FLSU18], but typically a large number of op-
timization configurations, in the order of hundreds to thousands, need to be eval-
uated before reaching any performance gains over standard optimization levels.
Thus, iterative compilation has been traditionally used as a baseline to assess the
performance of MLB compiler auto-tuning techniques [Fea11, ABP+17, BRE15].
MLB techniques aim to beat the performance of iterative compilation by finding
a better optimization configuration in a shorter time. Thus, MLB techniques try
to strategically sample the optimization configuration space based on the models
built during their training phase. Such models are being trained on either static
code features [Fea11] or profiling information [CFA+07], such as performance
counter values that characterize the programs in the training set, and a per-
formance metric for the dependent variable. An example of such a performance
metric is the execution time of programs when applying a specific optimization
configuration.
Typically, these techniques require a large training phase [OPWL17] to create
their predictive models. Furthermore, they are hardly portable across different
compilers, different versions of the same compiler, or different architectures.
Even if a single flag is added to the set of a compiler’s existing flags, the whole
training phase has to be repeated. Moreover, extracting some of the metrics
that these techniques depend on, such as static code features, might require a
significant amount of engineering [WO18]. Thus, MLB techniques are inadequate
for systematic testing and improvement of compilers.
Furthermore, while iterative compilation and MLB approaches aim to assist
the software developers in improving their application resource usage by auto-
tuning the compilers settings, they offer limited value to the compiler engineer on
how to improve the compiler’s common optimizer. This is because they typically
offer limited information in regards to the potential causes of their achieved
gains over a standard optimization level. Furthermore, MLB approaches only
provide suggestions of good optimization sequences, that might or might not
work well on applications that are unseen by the machine-learning training phase.
Compiler engineers need more concrete evidence to guide their efforts of tuning
the compiler’s common optimizer.
Our enhanced nightly regression system, introduced in Section 4, offers a dif-
ferent approach which can assist the compiler engineer to “debug” the compiler
optimization sequences in terms of their effectiveness in a systematic way. This
is due to: a) the ability of our technique to attribute the optimization effects ob-
served to specific transformation passes exercised in an optimization sequence,
20 K. Georgiou et al.
and b) the technique offering concrete data to drive the tuning of the common
optimizers, instead of MLB predictions.
Energy consumption of computing is becoming critically important for eco-
nomic, environmental, and reliability reasons [Eea16, GdSE17]. In [GBXdSE18],
the technique also used in this paper for exploring the standard optimization
levels, was able to accurately account for energy consumption through physical
hardware measurements on deeply embedded devices. In future work, we will
explore if energy profilers [INT] can achieve the same for platforms with higher-
end architectures that do not allow for processor’s direct energy measurements,
such the ones explored in this paper.
6 Conclusion
Traditional auto-tuning techniques, such as iterative compilation and MLB ap-
proaches, are not suitable for routine testing of compilers as they tend to require
a new training phase at each compiler update, or need to run for thousands of
iterations. Furthermore, such techniques typically act as a “black box”, with-
out providing any insights into why any detected optimization configuration
performs better than expected or conversely, degrades benchmark performance.
Thus, while they can be useful in achieving better resource usage than the stan-
dard optimization levels for particular applications, they are of limited value to
a compiler engineer.
In this paper, we propose a new take on the classic nightly regression sys-
tem, enhanced with statistics that expose the behaviour of the standard compiler
optimization levels wrt. performance. To achieve this, we adopt the technique
proposed in [GBXdSE18], i.e. we exploit subsequences of the standard optimiza-
tion levels rather than arbitrary permutations of optimizations. Thus, in contrast
with iterative compilation or MLB techniques, our approach offers compiler en-
gineers an intuitive way of correlating performance variations with the internal
structure of the optimizer.
By applying the technique to benchmarks from the LLVM test-suite, we
established the existence of significant optimization opportunities within the
standard optimization levels, firstly, on more complex architectures (the X-86-
based i5-6300U and the ARMv8-A-based Cortex-A53) than the deeply embedded
ones used in [GBXdSE18], and secondly, across multiple versions of the LLVM
compiler, namely the LLVM v3.8 and v5.0 examined in [GBXdSE18] and also
the v6.0 examined in this paper. These findings motivated our investigation into
how the technique can be utilized for systematic tuning of compiler optimizers.
Significant performance gains were observed for more than half of the 42
benchmarks tested, with an average of 11.5% and 5.1% execution time improve-
ment for the i5-6300U and the Cortex-A53 processors, respectively. These results
were collected, classified and exploited by the proposed nightly regressions sys-
tem to expose a series of potential architecture-depended and cross-architecture
optimizations, see Section 4.
Lost in translation: Exposing hidden compiler optimization opportunities 21
This is of significant value for compiler engineers who can focus their ef-
forts on exploiting the hidden gains and removing the shortcomings of the key
performance-affecting optimizations. The resulting insights may lead to cross-
architecture optimizer improvements that benefit all users of the compiler, to
architecture-specific tuning relevant for suppliers and users of industrial compil-
ers, and to new ways of handling innovative hardware mechanisms at compiler
level. To the best of our knowledge, this is the first work on automated tuning
of compilers that enables the discovery and the analysis of new optimization
potential to this extent.
In the future, we plan to extend our nightly regression system with the collec-
tion of hardware performance counter data to further support compiler engineers
in identifying and exploiting potential optimization opportunities that are not
statically analyzable and may be linked to micro-architectural features of the
target processors.
Acknowledgments
This research is supported by the European-Union’s Horizon 2020 Research and
Innovation Programme under grant agreement No. 779882, TeamPlay (Time,
Energy and security Analysis for Multi/Many-core heterogeneous PLAtforms).
References
ABP+17. Amir H. Ashouri, Andrea Bignoli, Gianluca Palermo, Cristina Silvano,
Sameer Kulkarni, and John Cavazos. Micomp: Mitigating the compiler
phase-ordering problem using optimization sub-sequences and machine
learning. ACM Trans. Archit. Code Optim., 14(3):29:1–29:28, Septem-
ber 2017. URL: http://doi.acm.org/10.1145/3124452, doi:10.1145/
3124452.
AKC+18a. A. H. Ashouri, W. Killian, J. Cavazos, G. Palermo, and C. Sil-
vano. A Survey on Compiler Autotuning using Machine Learning.
CoRR, abs/1801.04405, 2018. URL: http://arxiv.org/abs/1801.04405,
arXiv:1801.04405.
AKC+18b. Amir Hossein Ashouri, William Killian, John Cavazos, Gianluca Palermo,
and Cristina Silvano. A survey on compiler autotuning using machine
learning. CoRR, abs/1801.04405, 2018. URL: http://arxiv.org/abs/
1801.04405, arXiv:1801.04405.
BKK+98. Franc¸ois Bodin, Toru Kisuki, Peter Knijnenburg, Mike O’ Boyle, and
Erven Rohou. Iterative compilation in a non-linear optimisation space.
In Workshop on Profile and Feedback-Directed Compilation, Paris, France,
Oct 1998. URL: https://hal.inria.fr/inria-00475919.
BRE15. Craig Blackmore, Oliver Ray, and Kerstin Eder. A logic programming
approach to predict effective compiler settings for embedded software.
Theory and Practice of Logic Programming, 15(4-5):481–494, 2015. doi:
10.1017/S1471068415000174.
22 K. Georgiou et al.
CFA+07. John Cavazos, Grigori Fursin, Felix Agakov, Edwin Bonilla, Michael F. P.
O’Boyle, and Olivier Temam. Rapidly selecting good compiler opti-
mizations using performance counters. In Proceedings of the Interna-
tional Symposium on Code Generation and Optimization, CGO ’07, pages
185–197, Washington, DC, USA, 2007. IEEE Computer Society. URL:
http://dx.doi.org/10.1109/CGO.2007.32, doi:10.1109/CGO.2007.32.
Cora. Arm Cortex-M0 Processor. https://developer.arm.com/products/
processors/cortex-m/cortex-m0 (accessed February 19, 2018).
Corb. Arm Cortex-M3 Processor. https://developer.arm.com/products/
processors/cortex-m/cortex-m3 (accessed February 19, 2018).
cTu. cTuning Foundation and Dividiti. Collective Knowledge. http://
cknowledge.org/ (accessed October 18, 2018).
Eea16. K. Eder and et al. ENTRA: Whole-systems energy trans-
parency. Microprocess. Microsyst., 47, Part B:278–286, November
2016. URL: https://doi.org/10.1016/j.micpro.2016.07.003, doi:
10.1016/j.micpro.2016.07.003.
Fea11. Grigori Fursin and et al. Milepost gcc: Machine learning en-
abled self-tuning compiler. International Journal of Parallel Pro-
gramming, 39(3):296–327, Jun 2011. URL: https://doi.org/10.1007/
s10766-010-0161-2, doi:10.1007/s10766-010-0161-2.
FLSU18. Grigori Fursin, Anton Lokhmotov, Dmitry Savenko, and Eben Up-
ton. A collective knowledge workflow for collaborative research
into multi-objective autotuning and machine learning techniques.
CoRR, abs/1801.08024, 2018. URL: http://arxiv.org/abs/1801.08024,
arXiv:1801.08024.
GBXdSE18. Kyriakos Georgiou, Craig Blackmore, Samuel Xavier-de Souza, and Ker-
stin Eder. Less is more: Exploiting the standard compiler optimization
levels for better performance and energy consumption. In Proceedings
of the 21st International Workshop on Software and Compilers for Em-
bedded Systems, SCOPES ’18, pages 35–42, New York, NY, USA, 2018.
ACM. URL: http://doi.acm.org/10.1145/3207719.3207727, doi:10.
1145/3207719.3207727.
GCC. GCC team. GCC, the GNU Compiler Collection. https://gcc.gnu.org/
(accessed February 10, 2019).
GdSE17. K. Georgiou, S. Xavier de Souza, and K. Eder. The IoT energy challenge:
A software perspective. IEEE Embedded Systems Letters, PP(99):1–1,
2017. doi:10.1109/LES.2017.2741419.
INT. INTEL Open Source Org. RAPL Power Meter. https://01.org/
rapl-power-meter (accessed February 20, 2019).
LA04. C. Lattner and V.S. Adve. LLVM: A compilation framework for lifelong
program analysis and transformation. In CGO, pages 75–88, 2004.
LLVa. LLVM Org. LLVM Test Suite - MiBench. https://github.com/llvm/
llvm-test-suite/tree/master/MultiSource/Benchmarks/MiBench
(accessed January 29, 2019).
LLVb. LLVM Org. LLVM’s Analysis and Transform Passes.
https://llvm.org/docs/Passes.html (accessed February 19,
2018).
LLVc. LLVM Org. The LLVM Compiler Infrastructure. http://www.llvm.org/
(accessed January 19, 2019).
Lost in translation: Exposing hidden compiler optimization opportunities 23
Opr. Oprofile community. Oprofile - An open source project that includes a sta-
tistical profiler for Linux systems. http://cknowledge.org/ (accessed
February 20, 2019).
OPWL17. W. F. Ogilvie, P. Petoumenos, Z. Wang, and H. Leather. Minimizing the
cost of iterative compilation with active learning. In 2017 IEEE/ACM
International Symposium on Code Generation and Optimization (CGO),
pages 245–256, Feb 2017. doi:10.1109/CGO.2017.7863744.
PHB15. James Pallister, Simon J. Hollis, and Jeremy Bennett. Identifying com-
piler options to minimize energy consumption for embedded platforms.
The Computer Journal, 58(1):95–109, 2015. URL: http://dx.doi.org/
10.1093/comjnl/bxt129, doi:10.1093/comjnl/bxt129.
WO18. Z. Wang and M. OBoyle. Machine learning in compiler optimization.
Proceedings of the IEEE, 106(11):1879–1901, Nov 2018. doi:10.1109/
JPROC.2018.2817118.
Z. . Z. Chamski and K. Georgiou. ”Lost in translation” github reposi-
tory. https://github.com/TrustworthySystemLab/LostInTranslation
(accessed March 25, 2019).
