Portable compiler optimisation across embedded programs and microarchitectures using machine learning by Christophe Dubach et al.
Portable Compiler Optimisation Across Embedded
Programs and Microarchitectures using Machine Learning
Christophe Dubach,
Timothy M. Jones,
Edwin V. Bonilla
Members of HiPEAC
School of Informatics
University of Edinburgh, UK
Grigori Fursin
Member of HiPEAC
INRIA Saclay, France
Michael F.P. O’Boyle
Member of HiPEAC
School of Informatics
University of Edinburgh, UK
ABSTRACT
Building an optimising compiler is a difﬁcult and time consum-
ing task which must be repeated for each generation of a micro-
processor. As the underlying microarchitecture changes from one
generation to the next, the compiler must be retuned to optimise
speciﬁcally for that new system. It may take several releases of the
compiler to effectively exploit a processor’s performance potential,
by which time a new generation has appeared and the process starts
again.
We address this challenge by developing a portable optimising
compiler. Our approach employs machine learning to automati-
cally learn the best optimisations to apply for any new program on
a new microarchitectural conﬁguration. It achieves this by learn-
ing a model off-line which maps a microarchitecture description
plus the hardware counters from a single run of the program to the
best compiler optimisation passes. Our compiler gains 67% of the
maximum speedup obtainable by an iterative compiler search using
1000 evaluations. We obtain, on average, a 1.16x speedup over the
highest default optimisation level across an entire microarchitec-
ture conﬁguration space, achieving a 4.3x speedup in the best case.
We demonstrate the robustness of this technique by applying it to
an extended microarchitectural space where we achieve compara-
ble performance.
Categories and Subject Descriptors
D.3.4 [Programming languages]: Processors—Compilers; Opti-
mization; Retargetable compilers; C.0 [Computer Systems Or-
ganization]: General—Hardware/software interfaces; C.4 [Com-
puter Systems Organization]: Performance of systems—Design
studies; Modelingtechniques; I.2.6[Artiﬁcialintelligence]: Learn-
ing.
General Terms
Design, Experimentation, Performance.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for proﬁt or commercial advantage and that copies
bear this notice and the full citation on the ﬁrst page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior speciﬁc
permission and/or a fee.
MICRO’09, December 12–16, 2009, New York, NY, USA.
Copyright 2009 ACM 978-1-60558-798-1/09/12 ...$10.00.
Keywords
architecture/compilerco-design, design-spaceexploration, machine
learning.
1. INTRODUCTION
Creating an optimising compiler for a new microprocessor is a
time consuming and laborious process. For each new microarchi-
tecture generation the compiler has to be retuned and specialised to
the particular characteristics of the new machine. Several releases
of a compiler might be needed to effectively exploit the proces-
sor’s performance potential, by which time the next microarchitec-
ture generation has been developed and the process starts again.
This never-ending game of catch-up means that we rarely exploit a
shipped processor to the full and this inevitably delays the time to
market. Although this is a general issue for all processor domains,
it is particularly acute for embedded systems. Ideally, we would
like a portable compiler technology that provides retargetable op-
timisation while fully exploiting the characteristics of the new mi-
croarchitecture. In other words, given any new processor genera-
tion, deliver a compiler that automatically optimises for that new
target and achieves high performance.
Building such a compiler is, however, extremely challenging.
This is primarily due to the complexity of the underlying machine’s
behaviour and the varying structure of the programs being com-
piled. Iterative compilation, which tunes each new program on
a speciﬁc architecture [6, 16, 24, 30], has provided a methodol-
ogy to ﬁnd good optimisations.Techniques such as genetic algo-
rithms [24], hill climbing [2] or optimisation orchestration [30]
have been explored, all showing impressive performance improve-
ments. Although useful, these approaches all suffer from the large
number of compilations and executions required to optimise each
program. Every time the program or architecture changes, this
time-consuming process must be repeated.
In order to overcome these challenges, researchers have devel-
oped compilers that learnoptimisationstrategies usingpriorknowl-
edge of other programs’ behaviour. Stephenson et al. [34] showed
that genetic programming can learn good individual compiler opti-
misations on a ﬁxed architecture, eliminating the need for any iter-
ative compilations of the new program. Cavazos et al. [3] showed
that this could be used to learning the best set of compiler options
on a ﬁxed architecture. Although these approaches dramatically
reduce or even eliminate the need for extra compilations and exe-
cutions of the target program, they suffer from the need to entirely
retrain the compiler whenever the platform changes.
In this paper we develop, to the best of our knowledge, the ﬁrst
compiler that can automatically adapt to underlying microarchitec-
tural changes. This enables portable performance across differentgenerations of a microprocessor. This represents the ﬁrst step to-
wards the development of a universal compiler that can automati-
cally optimise applications for any platform without requiring ex-
tensive tuning.
Given a new microarchitecture, our approach automatically de-
termines the right optimisation passes for any new program. Our
scheme learns a machine learning model off-line which maps a mi-
croarchitecture description plus the hardware counters from a sin-
gle run of the program to the best compiler optimisation passes.
The learning process is a one-off activity whose cost is amortised
across all future users of the compiler on subsequent variations of
the processor’s microarchitecture.
Using this approach we can, on average, achieve a 1.16x speedup
over the highest default compiler optimisation across 200 microar-
chitectural conﬁgurations. In addition, we show that this approach
achieves 67% of the maximum performance improvement gained
by standard iterative compilation search using 1000 evaluations.
Given our approach, a new compiler does not need to be tuned
whenever the processor microarchitecture changes or a new pro-
gram needs compiling. This allows compilers to become fully in-
tegrated into the design space exploration of a new processor gen-
eration, helping designers to fully evaluate the potential of any new
microarchitecture. Overtime, designers may wish to add new mi-
croarchitectural features not originally envisaged. We show that
our approach adapts to new microarchitectural conﬁgurations and
is able to deliver the same level of performance.
In summary, this paper makes the following contributions:
• We develop a machine learning model that can predict the
best optimisation passes to use for any new program when
compiling for a new microarchitecture conﬁguration;
• Weshowhowourschemeaccuratelydeliverstheperformance
improvementsavailableacrosstheMiBenchbenchmarksuite
and an embedded microarchitectural design space;
• We demonstrate the robustness of our scheme showing that
it delivers comparable performance on an extended microar-
chitecture space.
The next section provides a short example demonstrating the dif-
ﬁculty of achieving portable optimisation. Section 3 provides a
description of how our compiler is trained and deployed using ma-
chine learning. Section 4 describes the experimental setup. Sec-
tion 5 then evaluates the technique and analyses the results. Sec-
tion 7 evaluates this approach on a new extended space. This is
followed by a description of related work in section 8. Finally, sec-
tion 9 concludes the paper.
2. EXAMPLE
In this paper we limit our study to selecting the right passes
within an existing compiler framework for varying programs and
microarchitectures. Although this may seem like a restricted set-
ting, the best optimisation passes to apply vary signiﬁcantly be-
tween programs and microarchitectural conﬁgurations. Finding the
best set of optimisation passes across programs and microarchitec-
tures is highly non-trivial.
To illustrate this point, consider ﬁgure 1 which shows segment
diagrams for three programs (rijndael_s, untoast and madplay) on
three microarchitectures from our design space (described in sec-
tion 4). For these three programs and microarchitectures, we found
the best optimisation passes to apply (described in section 4.3).
These optimisations lead to signiﬁcant speedups, ranging from
1.16x to 2.62x speedup over the highest default optimisation level.
In this example we show only ﬁve signiﬁcant optimisation passes:
block reordering, loop unrolling, function inlining, instruction
schedulingandglobalcommonsub-expressionelimination, labelled
Global common
  subexpression
   elimination
Block reordering
Loop unrolling
Function inlining
Instruction scheduling
Program
rijndael_e untoast madplay
M
i
c
r
o
a
r
c
h
i
t
e
c
t
u
r
e
A
B
C
XScale
Xscale with
small insn cache
Xscale with
small insn cache
small data cache
Figure 1: Segment diagrams representing the optimisation
passes to enable in order to achieve the highest performance
for three programs executed on three microarchitectures. A
ﬁlled segment means that the optimisation should be enabled,
an empty one means it should be disabled.
on the right of ﬁgure 1. For each program/microarchitecture pair
there is a circle of ﬁve segments representing the ﬁve passes. If
the segment is ﬁlled, then the corresponding optimisation should
be enabled for the given program and microarchitecture. If empty,
it should be disabled.
What is immediately clear is that the best set of optimisation
passes to apply changes across programs and microarchitectures.
If we consider madplay for instance, three optimisations should
be enabled for microarchitecture A, a different set of three for B
and four enabled for conﬁguration C. If we now consider microar-
chitecture B, two optimisations should be turned on for rijndael_e,
fourwhen compiling untoast and only three for madplay. Given the
large number of optimisations available in a typical compiler, pro-
viding a portable optimising compiler for programs and microar-
chitectures is non-trivial.
However, there are similarities between programs and microar-
chitectures that we can exploit. The best set of optimisations for
rijndael_e on conﬁguration C and madplay on microarchitecture A
are exactly the same. This is also true for untoast on microarchitec-
tures B and C and madplay on conﬁguration C. If we can somehow
characterise the program madplay on conﬁguration A and relate
it to the characteristics of rijndael_e on microarchitecture C, then
we can apply the same optimisation passes to madplay as we did
to rijndael_e. This will allow us to obtain the best speedups on
this new program/microarchitecture pair without having ever seen
madplay or conﬁguration A before. In the next section we develop
a machine learning model that automatically identiﬁes these simi-
larities. We then use and evaluate it in section 5 to provide portable
optimisation across programs and microarchitectures.
3. ENABLINGPORTABLEOPTIMISATION
As seen in the previous section, the best performance is achieved
by applying different optimisations depending on the program and
the underlying microarchitecture. This means that with current ap-
proaches to tuning an optimising compiler[1, 3, 29, 34] a new com-
piler needs to be developed for each new generation or variation of
the microarchitecture. To overcome this problem, we develop a
machine learning model that automatically adapts the compiler’s
optimisation strategy for any program and any microarchitecture
variation.
3.1 Overview
Figure 2 gives an overview of our compiler’s structure. The tool
works like any other compiler, taking as an input the source code
of a program and producing an optimised binary. However, in ad-
dition to the source code our compiler has two other inputs whichFigure 2: Overview of our portable optimising compiler. The
compiler takes in a program source, some performance coun-
ters and a microarchitecture description and outputs an opti-
mised program binary for the microarchitecture. At the heart
of the compiler is a machine learning model that predicts the
best passes to run, controlling the optimisations applied.
it uses internally to optimise the program speciﬁcally for the mi-
croarchitecture it will run on.
Firstly, our compiler takes in a description of the microarchitec-
ture to target. This is similar to standard compilers where this de-
scription is hard-coded in a machine description ﬁle; here it is just
an input. Secondly, it takes in performance counters derived from
a previous run of the program. This is similar to feedback-directed
compilers that typically use proﬁling information from a previous
run to generate an optimised version of the program. However,
unlike any existing technique, our compiler generates an optimised
binary speciﬁcally for the target microarchitecture even when it has
never seen the program or the microarchitecture before. Therefore,
the compiler does not have to be modiﬁed or regenerated whenever
a new program or microarchitecture is encountered.
At the heart of our compiler is a model that correlates the be-
haviour of the new input program and microarchitecture with pro-
grams and microarchitectures that it has previously seen. Such a
model is built using machine learning and ourapproach can be con-
sidered as a three stage process: generating training data, building
a model and deploying it. The next three subsections describe each
of these activities in detail, allowing us to create the overall com-
piler shown in ﬁgure 2.
3.2 Generating Training Data
In order to build a model that predicts good optimisation passes,
we need examples of various optimisation passes on different pro-
grams and microarchitectures as well as a description of each pro-
gram and microarchitecture. We generate this training data by eval-
uating N different sets of optimisation passes, y, on a set of train-
ing program/microarchitecture pairs, X
1,...,X
M, and recording
theirexecution times, t. Wecancharacteriseaprogram/microarchi-
tecture pair using a vector of features x
1 ...,x
M. Therefore, for
each program/microarchitecture pair X
j we have an associated
dataset D
j = {(y
i,t
i)
j}
N
i=1, with j = 1,...M. Our goal is to
predict the best set of optimisation passes y
∗ whenever a new pro-
gram/microarchitecture X
∗ is encountered.
Although the generated dataset may be large, it is only a one-
off cost incurred by our model. Furthermore, techniques such as
clustering [31] are able to reduce this and is the subject of future
work.
Features
We characterise program interaction with the processor using 11
performance counters, c, and with the microarchitectures using 8
descriptors, d. The performance counters are shown in table 1
Table 1: Performance counters used as a representation of pro-
gram/microarchitecture pairs.
Performance Counter
Instructions per cycle Insn cache access rate ALU usage
Decoder access rate Insn cache miss rate Mac usage
Register ﬁle access rate Data cache access rate Shifter usage
Branch pred. access rate Data cache miss rate
and are similar to those typically found in processor analytic mod-
els [12, 22] To capture the features of the microarchitecture we sim-
ply record its static description shown in table 2. The performance
counters from a program running on a microarchitecture, c, are
concatenated together with the microarchitecture description, d, to
form a single feature vector for the program/microarchitecture pair,
x = (c,d).
3.3 Building a Model
Ouraim is to build a model, M(x,y), that provides the mapping
from any set of program/microarchitecture features to a set of good
optimisation passes: M : x → y. We approach this problem by
learning the mapping from the features, x, to a probability distribu-
tion over good optimisation passes, q(y|x). Once this distribution
has been learnt (see next section), prediction on a new program and
on a new microarchitecture is achieved by sampling at the mode of
the distribution. Thus we obtain the predicted set of optimisations
by computing:
y
∗ = argmax
y
q(y|x
∗). (1)
In other words, we ﬁnd the value of y that gives the greatest prob-
ability of being a good optimisation.
3.3.1 Fitting Individual Distributions
In order to learn the model we need to ﬁt a probability distribu-
tion over good optimisation passes to each training program/micro-
architecture. Let g(y|X) be a parametric distribution speciﬁc to a
program/microarchitecture pair X. Note that whereas g(y|X) is
speciﬁc to the identity of a program/microarchitecture pair, q(y|x)
allows generalisation across programs and microarchitectures by
being conditioned on a set of features x.
Let e Y be a set of good optimisation passes and ˜ p(y|X) be the
empirical distribution over these passes
1 for program and microar-
chitecturepair X. Wewishtoﬁttheparametricdistribution g(y|X)
for each program/microarchitecture pair to be as close as possible
to the empirical distribution ˜ p(y|X). To do this we can minimise
the Kullback-Leibler (KL) divergence:
2
KL(˜ p(y),g(y)) =
ﬁ
log
˜ p(y)
g(y)
ﬂ
˜ p(y)
= constant + H(˜ p(y),g(y)),
(2)
where H(˜ p(y),g(y)) is the cross-entropy of ˜ p(y) and g(y). Thus,
we can maximise the objective function:
L = −H(˜ p(y),g(y)) =
X
y∈ e Y
˜ p(y)logg(y). (3)
1In our experiments we have chosen the set of “good” optimi-
sations e Y to be those combinations of passes that are within
the top 5% of all training optimizations for the respective pro-
gram/microarchitecture pair. We have then weighted these opti-
mizations uniformly.
2Forthe following derivation, to simplify the notation, we will omit
the conditional dependency of these distributions on a speciﬁc pro-
gram/microarchitecture pair X.In principle, our model’s probability distribution g(y) can be-
long to any parametric family. However, we have selected a very
simple IID (independent and identically distributed) model, where
the impact of each pass (yℓ) is considered to be independent of all
others, i.e. g(y) =
QL
ℓ=1 g(yℓ), where L is the numberof available
passes. If g(yℓ) is a multinomial distribution we have that:
g(y) =
L Y
ℓ=1
g(yℓ) =
L Y
ℓ=1
|Sℓ| Y
j=1
(θ
j
ℓ)
I[yℓ=s
(j)
ℓ ], (4)
where Sℓ = {s
(1)
ℓ ,...,s
(|Sℓ|)
ℓ } is the set of possible values that the
pass yℓ can take (i.e. on, off or a parameter value); I[yℓ = s
(j)
ℓ ]
is an indicator function that is 1 only when the particular optimi-
sation pass yℓ takes on the value s
(j)
ℓ (i.e. I[yℓ = s
(j)
ℓ ] = 1 when
yℓ = s
(j)
ℓ and zero otherwise); and θ
j
ℓ is the probability of the opti-
misation pass yℓ taking on the particular value s
j
ℓ (i.e. θ
j
ℓ = p(yℓ =
s
(j)
ℓ )); and, by deﬁnition, we have that
P
j θ
j
ℓ = 1.
By using equation (4) in equation (3) and maximising the lat-
ter with respect to each parameter θ
j
ℓ subject to the constraints P
j θ
j
ℓ = 1 we obtain:
θ
j
ℓ =
X
y∈ e Y
˜ p(y)I[yℓ = s
(j)
ℓ ]. (5)
This result is known as the maximum likelihood estimator. Since
we have used the uniform distribution as ˜ p(y), this means that the
estimationoftheparametersoftheIIDdistribution θ
j
ℓ isthenumber
of (selected) optimisation passes in which yℓ = s
(j)
ℓ divided by the
total number of (selected) passes.
Our assumption of statistical independence between compiler
optimisations may seem simplistic because it is well known that
optimisations interact with each other. However, as we shall see in
section 5, this IID model generalises well across programs and mi-
croarchitectures. Ourmodel works on the assumption that although
compiler optimisations do interact, these interactions are less im-
portant across good sets of optimisations. Additionally, more com-
plicated distributions, e.g. a Markov model, could be considered
without modifying our approach.
3.3.2 LearningaPredictiveDistributionAcrossPro-
grams and Microarchitectures
Once the individual training distributions for each pro-
gram/microarchitecture pair g(y|X) have been obtained, we can
learn a predictive distribution q(y|x). This will predict the proba-
bility of each optimisation pass value being good from the features,
x, of a program/microarchitecture pair (performance counters, c,
and microarchitecture descriptors, d).
One possible way of learning this distribution is to use memory-
based methods such as K-nearest neighbours. In other words, we
can set the predictive distribution q(y|x) to be a convex combina-
tion of the K distributions corresponding to the training programs
and microarchitectures that are closest in the feature space to the
new (test) program and microarchitecture. The coefﬁcients of the
combination are obtained by using:
w
k =
exp{−βd(x
(k),x
(∗))}
PK
k=1 exp{−βd(x(k),x(∗))}
, (6)
where β is a constant and d(·,·) is our evaluation function, i.e. the
euclidean distance of each corresponding nearest training point to
the test point, so that the distributions of the closest training points
are assigned larger weights. We have set β = 1 and K = 7 differ-
ent neighbour programs, although we have found experimentally
that the technique is not sensitive to similar values of K.
Table 2: Microarchitectural parameters and their values. Each
parameter varies as a power of 2, meaning 288,000 total conﬁg-
urations. Also shown are the values for the XScale processor.
Parameter Values XScale Parameter Values XScale
IL1 size 4K...128K 32K DL1 size 4K...128K 32K
IL1 assoc. 4...64 32 DL1 assoc. 4...64 32
IL1 block 8...64 32 DL1 block 8...64 32
BTB entries 128...2048 512 BTB assoc. 1...8 1
3.4 Deployment
Once the model is built, it can be used to predict the best optimi-
sation passes for any new program on any new microarchitecture,
as shown in ﬁgure 2. It does this using just one run of the new pro-
gram compiled with the default optimisation level, O3, on the new
microarchitecture. Thus, given a new program/microarchitecture
pair, X
∗, we extract its features by using the microarchitecture
description, d
∗, and the performance counters from a run of this
program (compiled with O3) on this microarchitecture, c
∗, so that
we form x
∗ = (c
∗,d
∗). We then use equation (1) above to give
the predicted-best optimisation passes, y
∗, compile and execute the
program with this new optimisation.
4. DESIGN SPACE
The previous section has developed a machine learning model
that automatically predicts the correct optimisation passes to apply
for any new program on any microarchitectural conﬁguration. This
section describes our experimental setup later used to evaluate this
model. It also provides a brief characterisation of the compiler and
microarchitectural design spaces.
4.1 Benchmarks
We chose to use MiBench [15], a common embedded bench-
mark suite, well suited to the microarchitecture space of this pa-
per. It contains a mix of programs, from signal processing algo-
rithms to full ofﬁce applications. We used all 35 programs, running
each to completion using an input set requiring at least 100 million
executed instructions, wherever possible. Therefore, susan_c, su-
san_e, djpeg, tiff2rgba and search were run with the large input set,
all others were run with the small inputs.
4.2 Microarchitecture Space
In this paper we consider a typical embedded microarchitectural
design space based on the XScale processor shown in table 2. We
show the microarchitecture parameters of the XScale that we varied
along with the values each parameter can take. To generate our de-
sign space we varied the cache and branch predictor conﬁgurations
because they are important components of an embedded proces-
sor. Other signiﬁcant parameters such as pipeline depth or voltage
scaling were not considered, but could be easily added to our ex-
perimental setup. We varied the parameters over a wide range of
values, beyond those in current systems, to fully explore the de-
sign space of this processor’s microarchitecture. In total there are
288,000 different conﬁgurations some of which give increased per-
formance over the original processor (up to 19%), others give sig-
niﬁcantlyreducedpowerconsumption(upto21%)whichisequally
signiﬁcant for embedded processors. In our experiments we used
a sample space of 200 conﬁgurations selected with uniform ran-
dom sampling. Each conﬁguration was implemented in the Xtrem
simulator [5] which has been validated for performance against the
XScale processor. We also used Cacti [35] to accurately model the
cache access latencies, ensuring our experiments were as realistic
as possible.Figure 3: Compiler optimisations and their parameters. Each is a pass within gcc and can be varied independently. In total there
are 642 million combinations.
 1
 1.2
 1.4
 1.6
 1.8
 2
 2.2
 2.4
q
s
o
r
t
r
a
w
c
a
u
d
i
o
t
i
f
f
2
r
g
b
a
g
s
d
j
p
e
g
p
a
t
r
i
c
i
a
b
a
s
i
c
m
a
t
h
l
o
u
t
f
f
t
_
i
f
f
t
s
u
s
a
n
_
s
s
u
s
a
n
_
c
t
i
f
f
m
e
d
i
a
n
i
s
p
e
l
l
p
g
p
t
i
f
f
d
i
t
h
e
r
b
f
_
e
b
f
_
d
r
a
w
d
a
u
d
i
o
p
g
p
_
s
a
t
i
f
f
2
b
w
c
j
p
e
g
l
a
m
e
d
i
j
k
s
t
r
a
s
u
s
a
n
_
e
t
o
a
s
t
m
a
d
p
l
a
y
u
n
t
o
a
s
t
s
h
a
b
i
t
c
n
t
s
s
a
y
r
i
j
n
d
a
e
l
_
d
c
r
c
r
i
j
n
d
a
e
l
_
e
s
e
a
r
c
h
A
V
E
R
A
G
E
S
p
e
e
d
u
p
4.4 4.8
Figure 4: Distribution of the maximum speedup available
across all microarchitectures on a per-program basis. The x-
axis represents the program and the y-axis the speedup relative
to gcc’s default optimisation level O3. The central line denotes
the median speedup. The box represents the 25 and 75 per-
centile area while the outer whiskers denote the extreme points
of the distribution.
Application to other spaces
Wehaveconsideredanembeddedmicroarchitecturalspaceandbench-
marksuitebecausethisrepresentsareal-worldchallengeforportable
compiler optimisation, based around an existing processor conﬁgu-
ration. However, the machine learning schemes we develop in this
paper are independent of this and can equally be applied to other,
more complex spaces. In section 7 we extend the microarchitec-
tural space by varying frequency and issue width and show that our
approach adapts to the new space.
4.3 Compiler Optimisation Space
Finding the best optimisation for a speciﬁc program on a spe-
ciﬁc microarchitecture is intractable. To do this, all equivalent pro-
grams, of which there are inﬁnitely many, must be evaluated and
the best selected. Therefore, in this paper we limit ourselves to
ﬁnding the best optimisation within a ﬁnite space of optimisations.
The space we have considered consists of all combinations of the
compiler passes and their parameters shown in ﬁgure 3. These
passes are applied within gcc 4.2, an industry standard for the XS-
cale processor. They were found to have a performance impact on
the XScale microarchitecture conﬁguration and other researchers
have explored a similar space [38], allowing independent compar-
isons with existing work. Turning passes on or off leads to a design
space of 642 million different optimisations. Varying the parame-
ters controlling some of the optimisations, (e.g. gcse has ﬁve fur-
ther options) leads to a total of 1.69 ∗ 10
17 unique optimisation
passes.
Clearly, it is not feasible to exhaustively enumerate this entire
space to ﬁnd the best optimisations for each program on each mi-
croarchitecture. However, iterative compilation can be used to
quickly ﬁnd an approximation of the best and has been shown to
out-perform other approaches [30]. Hence, to ﬁnd the best opti-
misations, we used iterative compilation which evaluated a 1000
different optimisations. These optimisations were selected using
uniform random sampling. In our experimental setup we saw al-
most no additional improvement after 1000 evaluations, showing
that it is a useful indicator of the upper bound on realistic perfor-
mance achievable by a compiler.
4.4 Characterising the Compiler Space
Before trying to build a compiler that optimises across microar-
chitectures, it is important to examine whether there is any per-
formance to gain. For this purpose, we evaluated the impact of
the compiler optimisations on the 35 MiBench programs compiled
with the 1000 random optimisation passes, each of which was exe-
cuted on the 200 different architectural conﬁgurations, as described
earlier. This corresponds to a sample space of 7 million simulations
and should provide some evidence of the potential beneﬁts of opti-
misation passes selection across microarchitectures and programs.
We then record the best performance achieved on a perprogram per
architecture basis.
Figure 4 shows the sample space’s distribution of maximum
speedups for each program across the microarchitectural conﬁg-
urations when compiling with the best set of optimisations per pro-
gram per microarchitecture. What is immediately clear is that there
is signiﬁcant variation across the programs. For some the per-
formance improvement is modest; selecting the best optimisations
does not help the library-bound benchmarks qsort or basicmath for
instance. For rijnadael_e there is signiﬁcant performance improve-
ment to be gained, ranging from a 1.2x speedup to 4.8x in the best
case, 1.8x being the average. In the case of search the extremes
are much less but on average selecting the best optimisation gives a
2.2x speedup across all conﬁgurations. In programs such as toast,
madplay and untoast, there are modest speedups to be gained on
average but signiﬁcant improvements available on certain microar-
chitectures (up to 2.4x for madplay as the top whisker shows).
The right-most entry shows that there is an average speedup of
1.23x available across the design space if we were able to select the
best optimisations per program per microarchitecture. The chal-
lenge is to develop a compiler that can automatically obtain this
speedup without having seen the program or target microarchitec-
turalconﬁgurationbefore. Furthermore, itshouldbeabletocapture
the high performance available on certain microarchitectures and
avoid the potential slowdowns found by picking the wrong optimi-
sations. We found that choosing the wrong set of passes can lead
to an average speedup of 0.7 across programs and 0.2 in the worstSpeedup
Microarchitecture Program
qsort
rawcaudio
tiff2rgba
gs
djpeg
patricia
basicmath
lout
fft_i
fft
susan_s
susan_c
tiffmedian
ispell
pgp
tiffdither
bf_e
bf_d
rawdaudio
pgp_sa
tiff2bw
cjpeg
lame
dijkstra
susan_e
toast
madplay
untoast
sha
bitcnts
say
rijndael_d
crc
rijndael_e
search
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 4
 4.5
 5
(a) Best passes
Speedup
Microarchitecture Program
qsort
rawcaudio
tiff2rgba
gs
djpeg
patricia
basicmath
lout
fft_i
fft
susan_s
susan_c
tiffmedian
ispell
pgp
tiffdither
bf_e
bf_d
rawdaudio
pgp_sa
tiff2bw
cjpeg
lame
dijkstra
susan_e
toast
madplay
untoast
sha
bitcnts
say
rijndael_d
crc
rijndael_e
search
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 4
 4.5
 5
(b) Our compiler
Figure 5: Speedup over O3 for each program/microarchitecture pair. Figure (a) shows the best improvement possible over the
programs and microarchitectures. Figure (b) shows the performance of the passes predicted by our compiler scheme. Our portable
optimising compiler can accurately predict the best passes to enable across this space to achieve the speedups available across all
programs and microarchitectures.
case (i.e. 5 times slower). The next section evaluates experimen-
tally the performance of the machine-learning compiler developed
in section 3.
5. EXPERIMENTAL METHODOLOGY
AND RESULTS
We ﬁrst describe our evaluation methodology and then evaluate
our approach across programs and microarchitectures. This is fol-
lowed by an analysis of the results.
5.1 Evaluation Methodology
This section describes how we perform our experiment and de-
termine the best performance achievable in our space.
5.1.1 Cross-Validation
To evaluate the accuracy of our approach we use leave-one-out
cross-validation. This means that we remove a program and a mi-
croarchitecture from our training set, build a model based on the
remaining data and then predict the best optimisations for the re-
movedprogramandmicroarchitecture. Followingthis, theprogram
is compiled and executed with the predicted optimisations on that
microarchitecture and its performance recorded. We then repeat
this for each program/microarchitecture. This is a standard evalua-
tion methodology for machine learning techniques and means that
we never train using the program or microarchitecture for which
we will optimise. Hence, this is a fair evaluation methodology.
5.1.2 Best Performance Achievable
In addition to comparison with O3, we want to evaluate our ap-
proach by assessing how close its performance is to the maximum
achievable. Although it is intractable to determine the best per-
formance that can be achieved by any set of optimisation passes,
as stated in section 4.3, we consider the performance of an itera-
tive compiler with 1000 evaluations as being an appropriate upper
bound for a compiler using a single proﬁle run.
5.2 Program/Microarchitecture Optimisation
Space
We ﬁrst consider the performance of our compiler compared to
the maximum speedups available. Figure 5(a) shows the maximum
speedups achievable, when selecting the best optimisations, rela-
tive to the default optimisation level, O3, across the program and
microarchitectural spaces. The microarchitectures are ordered so
that those with large speedups available over O3 are on the left.
The benchmarks are ordered so that those with large performance
increases (such as search) are on the right (as in ﬁgure 4). In the
back corner, the maximum speedup achievable with the best com-
piler passes is obtained by rijndael_e. This benchmark achieves
a 4.85x speedup on a microarchitecture with a small instruction
cache size. The optimisations leading to this result do not include
any loop optimisations (apart from moving loop-invariant code out
of the loops). In particular, no loop unrolling is performed because
there is already extensive, optimised software loop unrolling pro-
grammed into the source code.
The performance of our compiler across the programs and mi-
croarchitectures is shown in ﬁgure 5(b). As is immediately appar-
ent, it is almost identical to the performance achieved when using
the best optimisations, ﬁgure 5(a). Our model is highly accurate
at predicting very good compiler passes across the programs and
microarchitecture space. The coefﬁcient of correlation between
the performance of the predicted optimisations and the best ones
is 0.93 when evaluated across the joint program/microarchitecture
space. For all program/microarchitecture pairs with large perfor-
mance available, our approach is able to achieve signiﬁcant
speedups, as is shown by the peaks for programs ispell, madplay,
rijndael_d and rijndael_e. These graphs clearly demonstrate that
our model is able to capture the variation in speedups available
across the program and microarchitecture spaces. The next two
sections conduct further evaluations of our model.
5.3 Evaluation Across Programs
This section focuses on the performance of our compiler on each
program rather than examining the microarchitectural space. Fig-
ure 6 shows the performance of each program when optimised with
our portable optimising compiler, relative to compiling with O3,0.9
1
1.2
1.4
1.6
1.8
2
2.2
S
p
e
e
d
u
p
 
 
q
s
o
r
t
r
a
w
c
a
u
d
i
o
t
i
f
f
2
r
g
b
a
g
s
d
j
p
e
g
p
a
t
r
i
c
i
a
b
a
s
i
c
m
a
t
h
l
o
u
t
f
f
t
_
i
f
f
t
s
u
s
a
n
_
s
s
u
s
a
n
_
c
t
i
f
f
m
e
d
i
a
n
i
s
p
e
l
l
p
g
p
t
i
f
f
d
i
t
h
e
r
b
f
_
e
b
f
_
d
r
a
w
d
a
u
d
i
o
p
g
p
_
s
a
t
i
f
f
2
b
w
c
j
p
e
g
l
a
m
e
d
i
j
k
s
t
r
a
s
u
s
a
n
_
e
t
o
a
s
t
m
a
d
p
l
a
y
u
n
t
o
a
s
t
s
h
a
b
i
t
c
n
t
s
s
a
y
r
i
j
n
d
a
e
l
_
d
c
r
c
r
i
j
n
d
a
e
l
_
e
s
e
a
r
c
h
A
V
E
R
A
G
E
Our Model
Best
Figure 6: The performance of our approach and the best opti-
misations achieved by iterative compilation for each program,
normalised to O3 and averaged over all microarchitectures.
averaged across all microarchitecture conﬁgurations. The second
bar, labelled Best, is the maximum speedup achievable for each
program. On average, our technique obtains a 1.16x performance
improvement across all programs and microarchitectures with just
one proﬁle run, achieving a 1.94x speedup for search on average.
For three benchmarks in particular (search, rijndael_e and ri-
jndael_d), our scheme achieves signiﬁcant speedups, approaching
the best performance available. Figure 6 shows that our model is
able to correctly identify good optimisations, allowing these pro-
grams to exploit the large performance gains when available.
However, ﬁgure 6 also shows that some programs experience
minor slowdowns compared with O3. Considering rawcaudio for
example, our approach achieves only a 0.97x speedup. This can
be explained if we refer back to ﬁgure 4 where we can see that
there is negligible performance improvement to be gained over O3
for this program, even when picking the best optimisations per mi-
croarchitecture. Unfortunately, for this benchmark, the majority of
optimisations are detrimental to performance and being less than
100% accurate in picking optimizations means that compiling with
our scheme causes a small amount of performance loss.
Considering our technique compared to the maximum speedup
achievable, we approach Best in most cases. For some programs,
such as susan_e, we obtain over 95% of the maximum perfor-
mance. However, for crc we achieve only 30%. The reason for
this shortfall is due to a subtlety in the source code of crc. The
main loop within this benchmark updates a pointer on every itera-
tion, resulting in a large number of loads and stores. By perform-
ing function inlining and allowing a large growth factor (parameter
max-inline-insns-auto), this pointer increment is reduced to a sim-
ple registeraddition which in turn reduces the numberof data cache
accesses. Theperformancecountersarenotsufﬁcientlyinformative
to enable our machine learning model to capture this behaviour.
This prevents our model from selecting the best passes. However,
the addition of extra features, in particular code features [9], would
enable us to pick this up and will be considered in future work.
By way of comparison, standard iterative compilation would re-
quire approximately 50 iterations on average to achieve similarper-
formance. For some programs more than 100 iterations would even
be needed to match our model. This clearly shows the beneﬁt of
using machine-learning models to build a portable optimising com-
piler.
5.4 Evaluation Across Microarchitectures
We now turn our attention to the performance of our compiler
acrossthemicroarchitecturespaceratherthanacrossprograms. Fig-
ure 7 shows the performance of our compiler compared the best
32 185
1
1.1
1.2
1.3
1.4
1.5
Microarchitecture
S
p
e
e
d
u
p
 
 
Best
Our Model
Figure 7: The performance of our approach and the best op-
timisations achieved by iterative compilation for each microar-
chitecture, normalised to O3 and averaged across programs.
performance available for each microarchitecture, labelled Best.
The microarchitectural conﬁgurations are ordered in terms of in-
creasing speedup available over O3 (i.e. the Best line). Those on
the left have little speedup available whereas those on the right can
gain signiﬁcantly.
For our portable optimising compiler we see that the amount of
improvement over O3 varies from 1.08x to 1.35x. This gives an
average speedup of 1.16x across all programs and microarchitec-
tures. It is important to see that our scheme closely follows the
trend of the Best optimisations, showing how our approach cap-
tures the variation between conﬁgurations, exploiting architectural
features when performance improvements can be achieved.
Looking at ﬁgure 7 in more detail we can see that it is divided
into roughly three regions. On the left, up to conﬁguration 32, the
ﬁrst region has little performance improvement available. All mi-
croarchitectures in this area have a small data cache. Unfortunately
gcc has very few data access optimisations, meaning the available
speedups are relatively small. Following this is the second region
where the Best optimisations gain an average 1.2x speedup and our
scheme manages to capture a respectable 1.16x.
Finally in the third section, after conﬁguration 185, the available
performance improvement increases dramatically. These microar-
chitectures on the right have a small instruction cache, meaning
that it is important to prevent code duplication wherever possible.
This is typical of embedded systems where code size is frequently
an important optimisation goal. The performance counter speci-
fying the instruction cache miss rate enables our model to learn
this from the training programs. In particular, our compiler learns
that instruction scheduling (schedule-insns) and function inlining
(inline-functions) must be disabled to prevent code size increases.
In the case of instruction scheduling, this increase is due to a sub-
sequent register allocation pass which emits more spill code for
certain schedules. Here we can see an effect of the complex re-
lationships between passes within the compiler. Nonetheless, our
model is able to cope with these interactions and achieve the ma-
jority of the speedups available in this area.
5.5 Summary
We have shown that our portable optimising compiler achieves
an average 1.16x speedup over O3 across the entire microarchitec-
ture space for the MiBench benchmark suite. This is equivalent to
67% of the speedup achieved by the Best optimisation passes and is
roughly consistent across the architecture conﬁguration space. In
addition, our approach is able to achieve higher levels of perfor-
mance whenever they are available, accurately following the trends
in the optimisation space across programs and microarchitectures.q
s
o
r
t
r
a
w
c
a
u
d
i
o
t
i
f
f
2
r
g
b
a
g
s
d
j
p
e
g
p
a
t
r
i
c
i
a
b
a
s
i
c
m
a
t
h
l
o
u
t
f
f
t
_
i
f
f
t
s
u
s
a
n
_
s
s
u
s
a
n
_
c
t
i
f
f
m
e
d
i
a
n
i
s
p
e
l
l
p
g
p
t
i
f
f
d
i
t
h
e
r
b
f
_
e
b
f
_
d
r
a
w
d
a
u
d
i
o
p
g
p
_
s
a
t
i
f
f
2
b
w
c
j
p
e
g
l
a
m
e
d
i
j
k
s
t
r
a
s
u
s
a
n
_
e
t
o
a
s
t
m
a
d
p
l
a
y
u
n
t
o
a
s
t
s
h
a
b
i
t
c
n
t
s
s
a
y
r
i
j
n
d
a
e
l
_
d
c
r
c
r
i
j
n
d
a
e
l
_
e
s
e
a
r
c
h
param_max_unrolled_insns
param_max_unroll_times
funroll_loops
param_inline_call_cost
param_inline_unit_growth
param_large_unit_insns
param_large_function_growth
param_large_function_insns
param_max_inline_insns_auto
finline_functions
fno_sched_spec
fno_sched_interblock
fschedule_insns
param_max_gcse_passes
fgcse_after_reload
fgcse_las
fgcse_sm
fno_gcse_lm
fgcse
funswitch_loops
ftree_pre
ftree_vrp
falign_labels
falign_loops
falign_jumps
falign_functions
freorder_blocks
fregmove
fpeephole2
fcaller_saves
frerun_loop_opt
fre_run_cse_after_loop
fstrength_reduce
fexpensive_optimizations
fcse_skip_blocks
fcse_follow_jumps
foptimize_sibling_calls
fcrossjumping
fthread_jumps
Figure 8: A Hinton diagram showing the optimisations that our model con-
siders most likely to affect performance for each benchmark. The larger the
box, the more likely an optimisation affects the performance of the respec-
tive program.
b
t
b
_
s
i
z
e
b
t
b
_
a
s
s
o
c
i
_
s
i
z
e
i
_
a
s
s
o
c
i
_
b
l
o
c
k
d
_
s
i
z
e
d
_
a
s
s
o
c
d
_
b
l
o
c
k
I
P
C
d
e
c
_
a
c
c
_
r
a
t
e
r
e
g
_
a
c
c
_
r
a
t
e
b
p
r
e
d
_
a
c
c
_
r
a
t
e
i
c
a
c
h
e
_
a
c
c
_
r
a
t
e
i
c
a
c
h
e
_
m
i
s
s
_
r
a
t
e
d
c
a
c
h
e
_
a
c
c
_
r
a
t
e
d
c
a
c
h
e
_
m
i
s
s
_
r
a
t
e
A
L
U
_
u
s
g
M
A
C
_
u
s
g
S
h
f
t
_
u
s
g
param_max_unrolled_insns
param_max_unroll_times
funroll_loops
param_inline_call_cost
param_inline_unit_growth
param_large_unit_insns
param_large_function_growth
param_large_function_insns
param_max_inline_insns_auto
finline_functions
fno_sched_spec
fno_sched_interblock
fschedule_insns
param_max_gcse_passes
fgcse_after_reload
fgcse_las
fgcse_sm
fno_gcse_lm
fgcse
funswitch_loops
ftree_pre
ftree_vrp
falign_labels
falign_loops
falign_jumps
falign_functions
freorder_blocks
fregmove
fpeephole2
fcaller_saves
frerun_loop_opt
fre_run_cse_after_loop
fstrength_reduce
fexpensive_optimizations
fcse_skip_blocks
fcse_follow_jumps
foptimize_sibling_calls
fcrossjumping
fthread_jumps
Figure 9: A Hinton diagram showing the relation-
shipbetweenoptimisationsandfeatures. Thelarger
thebox, themoreinformative afeatureis inpredict-
ing whether to apply the optimisation.
The next sections analyses our results, describing the passes that
are important in our space and how our model selects good optimi-
sation passes for new programs and microarchitectures.
6. ANALYSIS OF RESULTS
This section analyses the results of our experiments. It ﬁrst de-
scribes how programs affect the choice of optimisation to apply.
Then it shows how microarchitecture inﬂuence the optimisations
choice.
6.1 Program Impact on Optimisations
Section 5.2 showed that our compiler’s performance closely fol-
lows the speedups achieved by the best optimisations for each pro-
gram/microarchitecture pair. We now consider how it achieves this
by focusing on those optimisations that are most likely to affect
performance. Note that this is a post-hoc analysis and, in general,
we cannot know in advance whether an optimisation will be likely
to affect performance for a speciﬁc program and microarchitecture.
Figure 8 shows a Hinton diagram of the normalised mutual in-
formation between each optimisation and the speedups obtained on
each program. Intuitively, mutual information gives an indication
of the impact (good or bad) of a speciﬁc compiler pass on each
program. The larger the box, the greater the impact of the pass.
However, as this is a summary across all architectures an optimisa-
tion may be important for just a few microarchitectures but not for
the others, leading to a small box being drawn.
It is clear from ﬁgure 8 that some optimisations are important
across all programs, whereas others are only important to a few
benchmarks. For example, instruction scheduling (schedule-insns)
is important foralmost all benchmarks. As discussed in section 5.4,
in some cases this optimisation has a negative impact on microar-
chitectures with a small instruction cache. Loop unrolling (unroll-
loops) is also an important optimisation for many programs. For
programs such as search, which contains loops with a known num-
ber of iterations, it is important to consider this optimisation to
achievegoodperformance. However, forothers, suchas rijndael_e,
this optimisation does not play a crucial role in achieving good per-
formance because extensive unrolling is already implemented in
the source code.
The optimisation passes affecting function inlining (inline-
functions to param-inline-call-cost) have little impact on most pro-
grams. However, for four programs, ispell, pgp, pgp_sa and say,
these are the most important passes. By using the mutual informa-
tion shown in ﬁgure 8 our model focuses on those optimisations
that are most likely to affect performance on a per program/archi-
tecture basis.
6.2 Microarchitecture Impact on
Optimisations
Having analysed the optimisations that have most impact on dif-
ferent programs, we now turn our attention to the relationship be-
tween the microarchitecture and compiler optimisations. Figure 9
presents another Hinton diagram showing the microarchitectural
impact on the best optimisations to apply. These results are aver-
aged over all programs. The features are separated into two groups:
the ﬁrst contains the eight architectural parameters d whilst the sec-
ond contains the 11 performance counters c.
Of all the micro architectural parameters, the size of the instruc-
tion cache (denoted i_size) has the biggest impact on compiler opti-
misation. In particular, it strongly inﬂuences the optimisations that
control function inlining (inline_functions) and loop unrolling (un-
roll_loops) It is therefore critical to predict these optimisations cor-
rectly (based on i_size) to avoid increasing the cache miss rate on
small cache conﬁgurations. Furthermore, on larger cache conﬁgu-
rations, it is important to perform aggressive inlining and unrolling
to exploit the full potential of the cache.
Now considering the performance counters, d, we can see that
IPC has signiﬁcant impact. This is used by the model in conjunc-0.9
1
1.2
1.4
1.6
1.8
2
2.2
S
p
e
e
d
u
p
 
 
q
s
o
r
t
r
a
w
c
a
u
d
i
o
t
i
f
f
2
r
g
b
a
g
s
d
j
p
e
g
p
a
t
r
i
c
i
a
b
a
s
i
c
m
a
t
h
l
o
u
t
f
f
t
_
i
f
f
t
s
u
s
a
n
_
s
s
u
s
a
n
_
c
t
i
f
f
m
e
d
i
a
n
i
s
p
e
l
l
p
g
p
t
i
f
f
d
i
t
h
e
r
b
f
_
e
b
f
_
d
r
a
w
d
a
u
d
i
o
p
g
p
_
s
a
t
i
f
f
2
b
w
c
j
p
e
g
l
a
m
e
d
i
j
k
s
t
r
a
s
u
s
a
n
_
e
t
o
a
s
t
m
a
d
p
l
a
y
u
n
t
o
a
s
t
s
h
a
b
i
t
c
n
t
s
s
a
y
r
i
j
n
d
a
e
l
_
d
c
r
c
r
i
j
n
d
a
e
l
_
e
s
e
a
r
c
h
A
V
E
R
A
G
E
Our Model
Best
Figure 10: The performance of our approach and the best opti-
misations achieved by iterative compilation for each program,
normalised to O3 and averaged over all microarchitectures on
an extended space.
tion with the other features to predict the most important optimi-
sations to apply, such as block reordering (reorder_blocks), global
common subexpression elimination (gcse), instruction scheduling
(schedule_insns), function inlining (inline_functions) and loop un-
rolling(unroll_loops). The performance counters that record cache
and branch predictor access/miss rates also have signiﬁcant impact
on choosing the best optimisation ﬂags. Surprisingly, knowledge
of register and functional unit usage has little importance in deter-
mining the correct compiler optimisations to apply.
While some of these observations may seem rather intuitive, cur-
rent production compilers, such as gcc, always use the same strat-
egy when applying optimisation passes, independently of the archi-
tectural parameters. One immediate recommendation would be to
make gcc’s unrolling and inlining optimisations sensitive to the in-
struction cache size and to make use of branch predictor and cache
access performance counters. However, this is just a post hoc anal-
ysis based on the results of this space. The technique developed
in this paper is micro-architectural space neutral enabling a com-
piler to automatically adapt to any underlying microarchitecture, as
shown in the next section.
7. EXTENDING THE MICROARCHITEC-
TURAL SPACE
While our approach works well on a predeﬁned architecture
space, it is reasonable to ask how would it perform if the architec-
ture space was changed at a later date. We therefore extended our
space by varying two microarchitectural parameters not considered
in section 4.3, namely frequency and processor width. Frequency
ranges from 200 to 600 MHz while issue width is either1 or2. As a
reference, the corresponding XScale values are 400 MHz and issue
width 1.
Given this extended space, we then applied our approach, pre-
dicting the best optimisation passes for it. Figure 10 shows the re-
sulting performance across programs, compared to the best perfor-
mance available. In this new space, selecting the correct compiler
optimisation passes has a similar impact as before. The Best opti-
misations give an average 1.24x improvement over O3 compared
to 1.23x in the previous space. Our approach is able to achieve an
improved average of 1.14x speedup. This is comparable to the per-
formance achieved on the previous space without any modiﬁcation
to our approach. If we were to include new features that capture
the behaviour of the additional architectural parameters, the perfor-
mance of our model would be further improved.
8. RELATED WORK
There is a signiﬁcant volume of prior work related to this paper
which we discuss in the following seven sections.
Domain-Speciﬁc Optimisations
Yotov et al. [40] investigated a model-driven approach for the
ATLAS self-tuning linear algebra library that uses the machine de-
scription to compute the optimal parameters of the optimisations.
SPIRAL [32] is another self-tuned library. It automatically gen-
erates high-performance code for digital signal processing by ex-
ploiting domain-speciﬁc knowledge to search the parameter space
at compile-time. These two systems both required domain-speciﬁc
knowledge and the use of iterative compilation to optimise them-
selves on the target system. They have to be retuned for each new
platform. This contrasts with our work where the compiler is built
only once and optimises across a range of microarchitectures using
just one proﬁle run for any new program.
Iterative Compilation
Iterative compilation optimises a single program on a speciﬁc mi-
croarchitecture by searching the optimisation space. Cooper et
al. [7] were amongst the ﬁrst to use a genetic algorithm to solve
the phase ordering problem, achieving impressive code size reduc-
tions. Later, an extensive study of this problem was conducted, ad-
vocating the use of multiple hill-climber runs [2]. Vuduc et al. [39]
looked at the problem of optimising a matrix multiplication library
using a statistical criterion to stop search. Kulkarni et al. [24] used
their previously developed VISTA compiler infrastructure [42] to
search for effective optimisation phases at a function level. They
build a tree of effective transformation sequences and use it to limit
the search of the optimisation space with a genetic algorithm.
Orthogonally to this, other researchers have focused on ﬁnding
the best optimisations settings to apply. Triantafyllis et al. [36]
concentrated on a small set of optimisations that perform well on a
given set of code segments. These are placed in a search tree which
is traversed to search for good optimisation combinations for a new
application. Finally, Pan and Eigenmann [30] compared these tech-
niques with their own algorithm that iteratively eliminates settings
with the most negative effect from the search space. Compared to
our approach, all these techniques speciﬁcally tune each program
on a per-program, per-architecture basis by searching its optimisa-
tion space. Conversely, our technique avoids search and recompila-
tion by directly predicting the correct set of compiler optimisations
to apply on a new micro-architecture.
Analytic Models for Compilation
The use of analytic models has also been investigated to speedup it-
erative compilation. Triantafyllis et al. [36] used an analytic model
to reduce the required time to evaluate different compiler optimi-
sations for different code segments. Zhao et al. [41] developed
an approach named FPO to estimate the impact of different loop
transformations. To overcome the high cost of iterative compila-
tion, Cooper et al. [6] developed ACME which uses the concept of
virtual execution; a simple analytic model that estimates the execu-
tion time of basic blocks. Analytic models have proved to be useful
for searching the optimisation space quickly. However, since our
model does not perform any search but directly predicts the best
optimisation passes to apply, they are not applicable in this context.
Machine-Learning Compilers
Some of the ﬁrst researchers to incorporate machine learning into
optimising compilers were McGovern and Moss [29] who used re-
inforcement learning for the scheduling of straight-line code.Stephenson and Amarasinghe [33] looked at tuning the unroll fac-
tor using supervised classiﬁcation techniques such as K-Nearest-
Neighbour and Support Vector Machines. All these approaches
only consider one compiler optimisation and, furthermore, are spe-
ciﬁc to the target architecture.
Subsequent researchers have considered predictive models to au-
tomaticallytuneacompilerforanexistingmicroarchitecture. These
models use program’s features to focus the search of the optimisa-
tion space in promising areas. Agakov et al. [1] used code features
to characterise programs while Cavazos et al. [3] investigated the
use of performance counters. However, both still require a search
of the space and as such are comparable to iterative compilation.
To tackle this problem, Cavazos [4] developed a logistic regressor
that predicts which optimisations to apply at a method level within
the Jikes RVM. Recently the Milepost-gcc has been developed to
drive the compiler optimisation process based on machine learn-
ing [14]. Each of these approaches, however, has to be entirely
retrained for any new platform and cannot be used for “compiler in
the loop” architecture design-space exploration. In a similar direc-
tion, Stephenson et al. [34] investigated the use of meta optimisa-
tions by tuning the compiler heuristics using genetic programming
and Hoste and Eeckhout [17] used genetic algorithms to search for
the best static compiler ﬂags across various programs. In contrast
to these static heuristics, we have developed a model that predicts
the best optimisations to apply based on the characteristics of any
new program or microarchitecture.
Retargetable Compilers
Integration of compiler and microarchitecture development is not
new. Frameworks such as Buildabong [13] and Trimaran [37] al-
low automatic exploration of both compiler and microarchitecture
spaces. Other researchers have focused on creating portable com-
pilers such as LLVM [25]. However, these infrastructures focus
purely on portability from an engineering point of view: develop-
ing tools and optimisations that can be reused across many microar-
chitectures.
Microarchitectural Design Space
Recently there has been signiﬁcant interest in predicting the per-
formance of different programs across a microarchitectural design
space. Schemes include linear regressors [20], artiﬁcial neural net-
works [18, 19], radial basis functions [21, 38] and spline func-
tions [26, 27]. These models obtain similar accuracy to each
other [28]. Other researchers have since proposed new models that
learn across programs [11, 23]. However, all these models are lim-
itedtomicroarchitecturalexplorationandhavenotconsideredcom-
piler optimisations.
Co-design Space Exploration
Finally, other researchers have explored the microarchitecture and
compiler optimisation co-design space on a per program basis.
Vaswani et al. [38] focused primarily on allowing exploration of
this space. They built a model for a speciﬁc program that predicts
the performance of compiler ﬂags on microarchitecture conﬁgura-
tions for that program. However their model cannot handle unseen
programs and its use is therefore limited and cannot be used for
portable optimisation. Dubach et al. [10] and Desmet et al. [8]
independently also explored the microarchitectural and compiler
optimisation co-design space. In addition, Dubach et al. [10] de-
veloped models that predict the performance that the best set of
compiler ﬂags could achieve for a given program on any microar-
chitecture, withoutactuallysearchingtheoptimisationspace. How-
ever, these models are program-speciﬁc and predict program per-
formance, rather than the actual optimisations to apply. In contrast,
our technique directly predicts the optimisation passes to apply for
any unseen program on any unseen microarchitecture.
9. CONCLUSIONS AND FUTURE WORK
This paper has presented a portable optimising compiler that au-
tomatically learns the best optimisation passes to apply for any new
program on any new microarchitecture. Using a machine learning
approach, we can achieve on average a 1.16x speedup over the de-
fault best optimisation pass after just one proﬁle run. This corre-
sponds to 67% of the maximum speedup available if we were to use
iterative compilation with 1000 evaluations. We achieve this after
a one-off training cost which is amortised across all generations of
the processor. We also show that similar performance is achieved
when applied to a new extended micro-architectural space. Future
work will consider ﬁne-grained optimisations at a function level
and the ability of the compiler to alter its optimisation pass order-
ings. We will remove the single proﬁle run we currently require by
considering abstract syntax tree features to characterise programs.
Furthermore, we will look at reducing the training cost of our ap-
proach by using clustering techniques which can dramatically re-
duce the amount of training data needed.
10. REFERENCES
[1] F. Agakov, E. Bonilla, J. Cavazos, B. Franke, G. Fursin,
M. F. P. O’Boyle, J. Thomson, M. Toussaint, and C. K. I.
Williams. Using machine learning to focus iterative
optimization. In CGO, 2006.
[2] L. Almagor, K. D. Cooper, A. Grosul, T. J. Harvey, S. W.
Reeves, D. Subramanian, L. Torczon, and T. Waterman.
Finding effective compilation sequences. SIGPLAN Not.,
39(7), 2004.
[3] J. Cavazos, G. Fursin, F. Agakov, E. Bonilla, M. F. P.
O’Boyle, and O. Temam. Rapidly selecting good compiler
optimizations using performance counters. In CGO, 2007.
[4] J. Cavazos and M. F. P. O’Boyle. Method-speciﬁc dynamic
compilation using logistic regression. In OOPSLA, 2006.
[5] G. Contreras, M. Martonosi, J. Peng, R. Ju, and G. Lueh.
XTREM: a power simulator for the Intel XScale core. In
LCTES, 2004.
[6] K. D. Cooper, A. Grosul, T. J. Harvey, S. Reeves,
D. Subramanian, L. Torczon, and T. Waterman. Acme:
adaptive compilation made efﬁcient. SIGPLAN Not., 40(7),
2005.
[7] K. D. Cooper, P. J. Schielke, and D. Subramanian.
Optimizing for reduced code space using genetic algorithms.
In LCTES, 1999.
[8] V. Desmet, S. Girbal, and O. Temam. Archexplorer.org: Joint
compiler/hardware exploration for fair comparison of
architectures. In INTERACT workshop at HPCA, 2009.
[9] C. Dubach, J. Cavazos, B. Franke, G. Fursin, M. F. O’Boyle,
and O. Temam. Fast compiler optimisation evaluation using
code-feature based performance prediction. In CF, 2007.
[10] C. Dubach, T. M. Jones, and M. F. O’Boyle. Exploring and
predicting the architecture/optimising compiler co-design
space. In CASES, 2008.
[11] C. Dubach, T. M. Jones, and M. F. P. O’Boyle.
Microarchitectural design space exploration using an
architecture-centric approach. In MICRO, 2007.
[12] S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith. A
performance counter architecture for computing accurate cpi
components. In ASPLOS, 2006.[13] D. Fischer, J. Teich, R. Weper, U. Kastens, and M. Thies.
Design space characterization for architecture/compiler
co-exploration. In CASES, 2001.
[14] G. Fursin, C. Miranda, O. Temam, M. Namolaru,
E. Yom-Tov, A. Zaks, B. Mendelson, E. Bonilla, J. Thomson,
H. Leather, C. Williams, and M. O. Boyle. MILEPOST
GCC: machine learning based research compiler. In GCC
Summit, 2008.
[15] M. Guthaus, J. Ringenberg, D. Ernst, T. Austin, T. Mudge,
and R. Brown. MiBench: A free, commercially
representative embedded benchmark suite. In WWC, 2001.
[16] M. Haneda, P. Knijnenburg, and H. Wijshoff. Automatic
selection of compiler options using non-parametric
inferential statistics. PACT, 2005.
[17] K. Hoste and L. Eeckhout. Cole: compiler optimization level
exploration. In CGO, 2008.
[18] E. ˙ Ipek, B. R. de Supinski, M. Schulz, and S. A. McKee. An
approach to performance prediction for parallel applications.
In Euro-Par, 2005.
[19] E. ˙ Ipek, S. A. McKee, R. Caruana, B. R. de Supinski, and
M. Schulz. Efﬁciently exploring architectural design spaces
via predictive modeling. In ASPLOS-XII, 2006.
[20] P. J. Joseph, K. Vaswani, and M. J. Thazhuthaveetil.
Construction and use of linear regression models for
processor performance analysis. In HPCA-12, February
2006.
[21] P. J. Joseph, K. Vaswani, and M. J. Thazhuthaveetil. A
predictve performance model for superscalar processors. In
MICRO-39, 2006.
[22] T. S. Karkhanis and J. E. Smith. A ﬁrst-order superscalar
processor model. In ISCA, 2004.
[23] S. Khan, P. Xekalakis, J. Cavazos, and M. Cintra. Using
predictive modeling for cross-program design space
exploration in multicore systems. In PACT, 2007.
[24] P. Kulkarni, S. Hines, J. Hiser, D. Whalley, J. Davidson, and
D. Jones. Fast searches for effective optimization phase
sequences. In PLDI, 2004.
[25] C. Lattner and V. Adve. LLVM: A compilation framework
for lifelong program analysis & transformation. In CGO,
2004.
[26] B. C. Lee and D. Brooks. Illustrative design space studies
with microarchitectural regression models. In HPCA-13,
2007.
[27] B. C. Lee and D. M. Brooks. Accurate and efﬁcient
regression modeling for microarchitectural performance and
power prediction. In ASPLOS-XII, 2006.
[28] B. C. Lee, D. M. Brooks, B. R. de Supinski, M. Schulz,
K. Singh, and S. A. McKee. Methods of inference and
learning for performance modeling of parallel applications.
In PPoPP-12, 2007.
[29] A. McGovern and J. E. B. Moss. Scheduling straight-line
code using reinforcement learning and rollouts. In NIPS,
1998.
[30] Z. Pan and R. Eigenmann. Fast and effective orchestration of
compiler optimizations for automatic performance tuning. In
CGO, 2006.
[31] A. Phansalkar, A. Joshi, L. Eeckhout, and L. K. John.
Measuring program similarity: Experiments with spec cpu
benchmark suites. In ISPASS, 2005.
[32] M. Puschel, J. Moura, J. Johnson, D. Padua, M. Veloso,
B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko,
K. Chen, R. Johnson, and N. Rizzolo. Spiral: Code
generation for dsp transforms. Proceedings of the IEEE,
93(2):232–275, Feb. 2005.
[33] M. Stephenson and S. Amarasinghe. Predicting unroll factors
using supervised classiﬁcation. In CGO, 2005.
[34] M. Stephenson, S. Amarasinghe, M. Martin, and
U. O’Reilly. Meta optimization: improving compiler
heuristics with machine learning. In PLDI, 2003.
[35] D. Tarjan, S. Thoziyoor, and N. P. Jouppi. Cacti 4.0.
Technical Report HPL-2006-86, HP Laboratories Palo Alto,
2006.
[36] S. Triantafyllis, M. Vachharajani, N. Vachharajani, and D. I.
August. Compiler optimization-space exploration. In CGO,
2003.
[37] Trimaran: An infrastructure for research in instruction-level
parallelism. http://www.trimaran.org/, 2000.
[38] K. Vaswani, M. J. Thazhuthaveetil, Y. N. Srikant, and P. J.
Joseph. Microarchitecture sensitive empirical models for
compiler optimizations. In CGO, 2007.
[39] R. Vuduc, J. W. Demmel, and J. A. Bilmes. Statistical
models for empirical search-based performance tuning. Int.
J. High Perform. Comput. Appl., 18(1):65–94, 2004.
[40] K. Yotov, X. Li, G. Ren, M. Cibulskis, G. DeJong,
M. Garzaran, D. Padua, K. Pingali, P. Stodghill, and P. Wu.
A comparison of empirical and model-driven optimization.
SIGPLAN Not., 38(5):63–76, 2003.
[41] M. Zhao, B. Childers, and M. L. Soffa. Predicting the impact
of optimizations for embedded systems. SIGPLAN Not.,
38(7), 2003.
[42] W. Zhao, B. Cai, D. Whalley, M. W. Bailey, R. van Engelen,
X. Yuan, J. D. Hiser, J. W. Davidson, K. Gallivan, and D. L.
Jones. Vista: a system for interactive code improvement. In
LCTES/SCOPES, 2002.