Improved Basic Block Reordering by Newell, Andy & Pupyrev, Sergey
Improved Basic Block Reordering
Andy Newell
Facebook, Inc.
Menlo Park, CA, USA
newella@fb.com
Sergey Pupyrev
Facebook, Inc.
Menlo Park, CA, USA
spupyrev@fb.com
Abstract
Basic block reordering is an important step for profile-guided
binary optimization. The state-of-the-art for basic block re-
ordering is to maximize the number of fall-through branches.
However, we demonstrate that such orderings may impose
suboptimal performance on instruction and I-TLB caches.
We propose a new algorithm that relies on a model com-
bining the effects of fall-through and caching behavior. As
details of modern processor caching is quite complex and
often unknown, we show how to use machine learning in
selecting parameters that best trade off different caching
effects to maximize binary performance.
An extensive evaluation on a variety of applications, in-
cluding Facebook production workloads, the open-source
compiler Clang, and SPEC CPU 2006 benchmarks, indicate
that the new method outperforms existing block reordering
techniques, improving the resulting performance of large-
scale data-center applications. We have open sourced the
code of the new algorithm as a part of a post-link binary
optimization tool, BOLT.
1 Introduction
Profile-guided binary optimization (PGO) is an important
step for improving performance of large-scale applications
that tend to contain huge amounts of code. Such techniques,
also known as feedback-driven optimization (FDO), are de-
signed to improve code locality which leads to better utiliza-
tion of CPU instruction caches. In practice tools like Aut-
oFDO [8], Ispike [22], PLTO [35], HFSort [26], and BOLT [27]
speed up binaries by 5% − 15% depending on workload and
CPU architecture, and thus, are widely used for a variety of
complex applications.
PGO is comprised of a number of optimization passes such
as function and basic block reordering, identical code folding,
function inlining, unreachable code elimination, register al-
location, and others. Typical targets for optimizations are an
instruction cache (I-cache) used to hold executable instruc-
tions and a translation lookaside buffer (I-TLB) used to speed
up virtual-to-physical address translation for instructions.
The reordering passes directly optimize code layout, and
thus impact performance the most [22, 27]. Therefore, even
small improvements in the underlying algorithms for code
reordering significantly affect the benefit of PGO tools.
Current techniques for basic block reordering optimize a
specific dimension of CPU performance such as (i) cache line
utilization by increasing the average number of instructions
executed per cache line, (ii) the branch predictor by reducing
the number of mispredicted branches, and (iii) the instruc-
tion cache miss rate by minimizing cache line conflicts. An
application’s overall performance depends on a combina-
tion of these dimensions. Since modern processors employ a
complex and often non-disclosed strategy for execution, it
is challenging to consider all of these effects at once when
optimizing an ordering of basic blocks. In this paper, we make
the first, to the best of our knowledge, attempt to design and
implement a block reordering algorithm that directly optimizes
the performance of an application.
Our approach consists of two main steps. Firstly, we learn
a proxy metric that describes the relationship between the
performance of a binary and the ordering of its basic blocks.
This is achieved by (i) selecting a set of features representing
how basic block ordering can influence performance, (ii) col-
lecting training data by running extensive experiments and
measuring the performance, and (iii) using machine learning
to suggest the best combination of the features for a score
that best predicts CPU performance. Secondly, we suggest
an efficient algorithm that, given a control flow graph for a
procedure, builds an improved ordering of the basic blocks
optimizing the learned metric. Since the constructed metric
correlates highly with the performance of a binary, we ob-
serve overall efficiency gains, despite possible regressions of
individual CPU characteristics.
The contributions of the paper are the following.
• We identify an opportunity for improvement over the
classical approach for basic block reordering, initiated by
Pettis and Hansen [29]. Then we extend the model and
suggest a new optimization problem with the objective
closely related to the performance of a binary.
• We then develop a new practical algorithm for basic
block reordering. The algorithm relies on a greedy tech-
nique for solving the optimization problem. We describe
the details of our implementation, which scales to real-
world instances without significant impact on the run-
ning time of a binary optimization tool.
• We propose a Mixed Integer Programming formulation
for the aforementioned optimization problem, which is
capable of finding optimal solutions on small functions.
Our experiments with the exact method demonstrate
that the new suggested heuristic finds an optimal order-
ing of basic blocks in 98% of real-world functions with
30 or fewer blocks.
ar
X
iv
:1
80
9.
04
67
6v
1 
 [c
s.P
L]
  1
2 S
ep
 20
18
• Finally, we extensively evaluate the new algorithm on a
variety of applications, including Facebook production
workloads, the open-source compiler Clang, and SPEC
CPU 2006 benchmarks. The experiments indicate that
the new method outperforms the state-of-the-art block
reordering techniques, improving the resulting perfor-
mance by 0.5% − 1%. We have open sourced the code of
our new algorithm as a part of BOLT [5, 27].
The paper is organized as follows. We first discuss limita-
tions of the existing model for basic blocks reordering and
suggest an improvement in Section 2. We describe an effi-
cient heuristic (Section 3) and an exact algorithm (Section 4)
for solving the new problem. Next, in Section 5, we present
experimental results, which are followed by a discussion of
related work in Section 6. We conclude the paper with some
limitations and promising future directions in Section 7.
2 Learning an Optimization Model
The state-of-the-art approach for basic block reordering is
based on the idea of collocating frequently executed blocks
together. The goal is to position blocks so that the hottest
successor of a block will most likely be a fall-through branch,
that is, located right next to the predecessor. This strategy
reduces the number of taken branches and the working set
size of the I-cache, while relieving pressure from the branch
predictor unit. More formally, the reordering problem can be
formulated as follows. Given a directed control flow graph
comprising of basic blocks and frequencies of jumps between
the blocks, find an ordering of the blocks such that the num-
ber of fall-through jumps is maximized. This is the maximum
directed Traveling Salesman Problem (TSP), a widely stud-
ied NP-hard combinatorial optimization problem.
The simplicity of the model and solid practical results
made TSP-based algorithms very popular in the code opti-
mization community. To the best of our knowledge, Boesch
and Gimpel [4] are the first ones to formulate the problem of
finding an ordering of basic block as the path covering prob-
lem on a control flow graph, which is equivalent to solving
TSP. They describe an optimal algorithm on acyclic directed
graphs and suggest a heuristic for general digraphs. Later
the same path covering model has been studied in a series of
papers suggesting optimal algorithms for special classes of
digraphs and heuristics for general digraphs [14, 23, 38]. In
their seminal paper from 1990 [29], Pettis andHansen present
two heuristics for positioning of basic blocks. We notice that
both heuristics are designed to solve (possibly non-optimally)
TSP. Later, one of the heuristics (seemingly producing bet-
ter results) has been extended by Calder and Grunwald [6],
Torrellas et al. [36], and Luk et al. [22]. We stress that the
majority of existing algorithms for block reordering utilize
the TSP model. A variant of the Pettis-Hansen algorithm
B0
1000
B1
B3B2
B4
5
500
995
B0 B1 B2 B3 B4
B0 B1 B2 B4 B3
64 bytes
500
500
5
5
Figure 1. Two orderings of basic blocks with the same TSP
score (1995) resulting in different I-cache utilization. All
blocks have the same size of 16 bytes and colored according
to their hotness in the profile.
is used by most of the modern binary optimizers, includ-
ing PLTO [35], Ispike [22], BOLT [27], and the link-time
optimizer (LTO) of the GCC compiler [32].
We observe that solving TSP alone is not sufficient for
constructing a good ordering of basic blocks. It is easy to
find examples of control flow graphs with multiple different
orderings that are all optimal with respect to the TSP objec-
tive. Consider for example a control flow graph in Figure 1
in which the maximal number of fall-through branches is
achieved with two orderings that utilize a different number
of I-cache lines during a typical execution. For these cases, an
algorithm needs to take into consideration non-fall-through
branches to choose the best ordering. However, maximizing
the number of fall-through jumps is not always preferred
from the performance point of view. Consider a control flow
graph with seven basic blocks in Figure 2. It is not hard
to verify that the ordering with the maximum number of
fall-through branches is one containing two concatenated
chains, B0B1B3B4 and B5B6B2 (upper-right in Fig-
ure 2). Notice that for this placement, the hot part of the
function occupies three 64-byte cache lines. Arguably a bet-
ter ordering is the lower-right in Figure 2, which uses only
two cache lines for the five hot blocks, B0,B1,B2,B3,B4, at
the cost of breaking the lightly weighted branch B6B2.
How do we identify the best ordering of basic blocks? The
question is fairly difficult and even experts may have hard
time determining which ordering leads to the maximum per-
formance of a binary. A naive approach is to exhaustively
evaluate every valid block placement and then profile the bi-
nary to collect relevant performance metrics. Obviously due
to the enormous search space, this approach is infeasible for
practical use. A natural improvement is to reduce the search
space and experiment only with the most promising order-
ings. This technique, also known as iterative compilation
or autotuning, is a natural task for machine learning [2, 37].
While in certain scenarios the overhead is justifiable, we
found this approach impractical for our production systems
due to long build and deployment times. Therefore, we use
2
B0
1000
B1
B3
B4
B2
B5
550
550
455
5
5
450B6
5
B0 B1 B3 B4 B6B5 B2
B0 B1 B3 B4 B5B2 B6
64 bytes 64 bytes 64 bytes
Figure 2. A control flow graph with jump frequencies (left)
and two possible orderings of basic blocks (right). All blocks
have the same size (in bytes) and colored according to their
hotness in the profile.
another strategy for optimization by trying to develop a
score function that is used as a proxy for estimating the
quality of an ordering. The idea is to perform extensive ex-
periments profiling an application in order to understand
what aspects and features of the block placement affect the
resulting performance. After this first phase, employ a ma-
chine learning technique to build an optimization model and
derive a quality metric for an ordering. As a final step, design
an algorithm to optimize the constructed score function.
Next we describe the process in detail, explaining the data
collection phase and presenting the developed score function.
Since our new approach for basic block reordering is imple-
mented in a post-link optimizer BOLT [27] and evaluated on
modern Intel x86 processors, we describe the steps that are
typical for this setup. Note however that the approach is not
tied to the binary optimizer and can be similarly applied in
other environments.
Data Collection. Following most of the recent works on
profile-guided code optimizations [8, 22, 26, 27], we rely on
sampling techniques for collecting profile data. Although
the sampling-based approach is typically less accurate than
the instrumentation-based one, it incurs significantly less
memory and performance overheads, making it the preferred
way of profiling binaries in actual production environments.
We utilize hardware support of Intel x86 processors to col-
lect Last Branch Records (LBR), which is a list of the last
16 taken branches. From the list of branches, that are sam-
pled according to a specified event, we infer the frequencies
of jumps between basic blocks. Specifically, we extract a
weighted directed control flow graph for every function in
the profiled binary. The vertices (basic blocks) and the edges
(branches) of the graph, along with the sizes of the blocks,
are extracted statically via the BOLT infrastructure, which is
based on LLVM [15]. The weights between the basic blocks
correspond to the total number of times the jumps appear in
collected LBRs.We stress that before processing, we augment
collected LBRs with fall-through jumps, as LBRs only contain
information about taken branches; to this end, we utilize a
simple algorithm similar to one described by Chen et al. [7].
B1
3000
B2
B4B3
400600
B0
3000
B1
3000
B2
B4B3
12001800
B0
3000
(a)
B1
3000
B2
B4B3
01000
B0
3000
B1
3000
B2
B4B3
03000
B0
3000
(b)
Figure 3. Two examples of an original incomplete profile
(left) and its realistic correction (right) that cannot be recon-
structed with the Minimum Cost Maximum Flow model.
Blue numbers show the likely most realistic adjustments of
the edge weights satisfying flow conservation constraints.
Notice that we ignore indirect branches, procedure calls, and
returns while constructing the control flow graph.
We have experimented with several different events to
collect LBRs, including cycles, retired instructions, and taken
branches, and using various levels of precise event based
sampling. We observed that independently of the utilized
event and processor microarchitecture, the extracted jump
frequencies do not always follow the expected distribution;
refer to [7, 25, 27] for concrete examples and possible expla-
nations of the phenomenon. Hence, we use by default the
cycles event to sample LBRs.
A common technique for ensuring edge weights are more
realistic is solving theMinimum Cost Maximum Flow prob-
lem on the control flow graph, which was initiated by Levin
et al. [19] and later adopted by several groups [7, 21, 24, 25].
In contrast with the earlier works, our experiments with the
flow-based approach did not produce performance gains in
comparison with the original (possibly biased) data. A prob-
lematic example for the approach is illustrated in Figure 3a,
where the arguably most realistic adjustment is highlighted.
Depending on how the costs of the edges are assigned, an
algorithm for Minimum Cost Maximum Flow will either
produce frequencies 2600 and 400 or 600 and 2400 for jumps
B2B3 and B2B4, respectively; it is desirable to keep the
probabilities of the branches at B2 (approximately) the same.
A related issue is shown in Figure 3b. Here an algorithm
may decide to send some flow along the edge from B2B4,
thus making basic block B4 hot and prohibit future compiler
optimizations that position hot and cold parts of the func-
tion in different sections of the binary. As a result, we avoid
modifying jump frequencies via the flow-based approach,
leaving for the future the task of increasing profile precision.
Engineering a Score Function. Our goal is to design a func-
tion x → f (x), that takes in a feature vector x , characterizing
an ordering of basic blocks, and produces a real value f (x),
indicating an expected performance of a binary for the or-
dering. We assume that the execution time of a single basic
3
t0
s
bytes
16 32 48
(a)
0
0.1
b
ra
n
ch
im
p
o
rt
a
n
ce
jump length, bytes
1024640
forward
backward
1
fall-through
(b)
Figure 4. (a) The lengths of a forward jump, st , and a
backward jump, ts , are 16 and 48 bytes, respectively. (b) The
dependency between the length of a jump and its importance
for the ExtTSP model.
block is independent of the block ordering within a func-
tion. Thus, the ordering only affects branches between the
blocks, which may incur some delay in the execution, for
example due to a miss in the instruction cache. However, not
all branches equally affect the performance. An important
feature of a branch is the jump length, that is, the distance (in
bytes) between the end of the source block to the beginning
of the target block; see Figure 4a. For example, it is a com-
mon belief that zero-length jumps (equivalently, fall-through
branches) impose the smallest performance overhead. This
is the main motivation for the TSP model, whose objective
can be formally expressed as follows:
TSP =
∑
(s,t )
w(s, t) ×
{
1 if len(s, t) = 0,
0 if len(s, t) > 0,
where w(s, t) is the frequency and len(s, t) is the length of
branch st . An optimal ordering corresponds to the max-
imum value of the expression; thus, we call it the score of
TSP. The performance, however, might also depend on other
characteristics of a branch. In our study, we consider the
following features.
• The length of a jump impacts the performance of in-
struction caches. Longer jumps are more likely to result
in a cache miss than shorter ones. In particular, a jump
with the length shorter than 64 bytes has a chance to
remain within the same cache line.
• The direction of a branch plays a role for branch pre-
dicting. We distinguish between forward branches, st
with s < t , and backward branches, ts with s < t .
• The branches can be classified into unconditional (if the
out-degree is one) and conditional (if the out-degree is
two). A special kind of a branch is between consecu-
tive blocks in the ordering, ss + 1, that are called fall-
through; in this case, a jump instruction is not needed.
We introduce a new score that estimates the quality of a
basic block ordering taking into account the branch charac-
teristics. In the most generic form, the new function, called
Extended TSP (ExtTSP), is expressed as follows:
ExtTSP =
∑
(s,t )
w(s, t) × Ks,t × hs,t
(
len(s, t)),
where the sum is taken over all branches in the control flow
graph. Herew(s, t) is the frequency of branch st andKs,t is
a weight coefficient modeling the relative importance of the
branch for optimization.We distinguish six types of branches
arising in code: conditional and unconditional versions of
fall-through, forward, and backward branches. Thus, we con-
sider six different coefficients for ExtTSP. The lengths of
the jumps are accounted in the last term of the expression,
which increases the importance of short jumps. A function
hs,t
(
len(s, t)) is defined by value of 1 for zero-length jumps,
value of 0 for jumps exceeding a prescribed length, and it
monotonically decreases between the two values. To be con-
sistent with the objective of TSP, the ExtTSP score needs to
be maximized for the best performance. Notice that ExtTSP
is a generalization of TSP, as the latter can be modeled by
setting K = 1,h
(
len(s, t)) = 1 for fall-through branches and
K = 0,h
(
len(s, t)) = 0 otherwise.
In general we cannot manually select the most appropriate
constants of ExtTSP that best describe the performance of
modern processors. Next we describe a process for learning
these constant values that lead to the best performance.
Learning Parameters. As a preliminary step of our study,
we run multiple experiments with two binaries, the Clang
compiler and the HipHop Virtual Machine (HHVM) [1]. Each
experiment consists of constructing a distinct ordering of ba-
sic blocks, running a binary, and measuring its performance
metrics via the Linux perf tool. In order to build different
block orderings, we utilize five algorithms (described in Sec-
tion 5.1) and apply them for a certain percentage (ranging
from 0% to 100%) of functions. In total, we evaluated 50 dis-
tinct block orderings and conducted 250 experiments (five
per ordering) for each of the two binaries.
Our first finding is that the traditional TSP score has a rela-
tively high correlation with the performance of the binaries;
see Figure 5. However, there are several unexpected outliers
that cannot be explained by the model. In order to choose
suitable parameters for the ExtTSP score, we employ the
so-called black-box solver developed by Letham et al. [18]
which is a powerful tool for optimizing functions with com-
putationally expensive evaluations. Formally, our problem
can be stated as finding parameters for ExtTSP that have
the highest correlation with the performance of a binary in
the experiments. Here we try to maximize the Kendall rank
correlation coefficient, τ , between the observed performance
(instructions per cycle) and the predicted improvement given
by the ExtTSP score. Notice that the Pearson correlation
coefficient, ρ, is not the best choice for optimization, as the re-
lationship between observed and predicted values might not
be linear. The black-box solver, which is based on Bayesian
optimization, is able to compute values for a collection of
4
86
4
2
0
sp
ee
du
p,
 %
1086420
TSP
1086420
ExtTSPscore improvement (%)
ρ=0.973 τ=0.877 ρ=0.984 τ=0.906
(a) Clang
6
5
4
3
2
1
0
sp
ee
du
p,
 %
3020100
TSP
3020100
ExtTSPscore improvement (%)
ρ=0.985 τ=0.897 ρ=0.989 τ=0.921
(b) HHVM
Figure 5. The relationship between the performance (instructions per cycle) of two binaries and the TSP and ExtTSP
scores measured for various orderings of basic blocks. The values correspond to the relative improvements over the original
non-optimized binary.
continuous parameters that maximize the correlation coef-
ficient. This is done via a careful exploration of the search
space taking into account noise in real-world experiment
outcomes. In our study we introduce six variables for weight
coefficients, K , of ExtTSP. The jump-length function, h(·),
is considered to be of the form
(
1 −
(
len(jump)
M
)α )
with two
variablesM > 0 and α > 0 for different types of branches.
The black-box solver found amodel that better predicts the
observed values; see Figure 5. The new model increases the
Kendall correlation coefficient τ from 0.877 to 0.906 forClang
and from 0.897 to 0.921 for HHVM. The models constructed
for Clang and HHVM are not identical, though they share
many similarities. In addition, the actual (optimal) solutions
contain fractional values that are not human-readable, as the
black-box solver works on a continuous space of parameters.
Next we present a simplified variant of the model in which
we round the parameters, combine parameters with similar
weights (having difference less than 0.05), and exclude ones
having small values (less than 0.05). We did not notice a
discrepancy between the actual and the rounded ExtTSP
models, meaning that the unified solution works well for
both of the binaries and is reasonably robust to the choice
of constants. Recall that better block orderings correspond
to higher values of ExtTSP.
ExtTSP=
∑
(s,t )
w(s, t) ×

1 if len(s, t) = 0,
0.1 ·
(
1 − len(s,t )1024
)
if 0 < len(s, t)
≤ 1024 and s < t ,
0.1 ·
(
1 − len(s,t )640
)
if 0 < len(s, t)
≤ 640 and s > t ,
0 otherwise.
Intuitively, ExtTSP resembles the traditional TSP model,
as the number of fall-through branches is the dominant factor.
The main difference is that ExtTSP respects longer jumps.
The impact of such jumps is significantly lower and it linearly
decreases with the length of a jump. Next we summarize our
high-level observations regarding the new score function.
• The suggested parameters for ExtTSP correlate well
with the overall performance of a binary in a production-
like environment, though we also observe moderate cor-
relation (in the order of ρ = 0.8) between the values of
ExtTSP and the measured number of I-cache misses.
• We have not observed significant differences between
the importance of conditional and unconditional branches
that seem to be similarly relevant for the quality of a
block ordering. It contradicts to the intuition of Calder
andGrunwald [6]who assign noticeably differentweights
depending on the type of a branch.
• Themaximum length of a jump affecting ExtTSP is fairly
large: 16 and 10 cache lines for forward and backward
branches, respectively. The importance of a non-fall-
through branch linearly decreases with its length; see
Figure 4b. We experimented with non-linear decreasing
functions but did not discover a significant improvement;
hence, we use the simpler variant.
• We found that forward branches are more important
for an ordering than backward ones; see Figure 4b for a
dependency of the weights of the two types of branches
on the ExtTSP score. Both types of non-fall-through
branches are noticeably less important than fall-throughs
in the constructed model, which is reflected in the low
coefficient (0.1) in the expression.
Finding an optimal solution for the Extended TSP prob-
lem is NP-hard. Next we describe our heuristic.
3 A Heuristic for ExtTSP
Our algorithm finds an optimized ordering of basic blocks
for every function in the binary. It operates with a weighted
control flow graph G = (V ,E,w) containing a set of basic
5
Algorithm 1: Basic Block Reordering
Input :control flow graph G = (V ,E,w),
the entry point v∗ ∈ V
Output :ordering of basic blocks (v∗ = B1,B2, . . . ,B |V |)
Function ReorderBasicBlocks
for v ∈ V do /* initial chain creation */
Chains ← Chains ∪ (v);
while |Chains | > 1 do /* chain merging */
for ci , c j ∈ Chains do
дain[ci , c j ] ← ComputeMergeGain(ci , c j );
/* find best pair of chains */
src,dst ← argmax
i, j
дain[ci , c j ];
/* merge the pair and update chains */
Chains ← Chains ∪ Merge(src,dst) \ {src,dst};
return ordering given by the remaining chain;
Function ComputeMergeGain(src,dst)
/* try all ways to split chain src */
for i = 1 to size(src) do
/* break the chain at index i */
s1 ← src[1 : i];
s2 ← src[i + 1 : size(src)];
/* try all valid ways to concatenate */
scorei max

ExtTSP(s1, s2,dst) if v∗ <dst
ExtTSP(s1,dst , s2) if v∗ <dst
ExtTSP(s2, s1,dst) if v∗ <s1,dst
ExtTSP(s2,dst , s1) if v∗ <s1,dst
ExtTSP(dst , s1, s2) if v∗ <src
ExtTSP(dst , s2, s1) if v∗ <src
/* the gain of merging chains src and dst
*/
return max
i
scorei − ExtTSP(src) − ExtTSP(dst);
blocks, V , and directed edges, E, representing branches be-
tween the blocks. An edge (s, t) ∈ E corresponds to a jump
from a block s ∈ V to a block t ∈ V and its weight, w(s, t),
corresponds to the frequency of the jump. We assume that
the sizes (in bytes) of the basic blocks are a part of the input.
The goal of the algorithm is to find an ordering of V with
an improved ExtTSP score (as defined in Section 2) while
keeping a given entry point, v∗ ∈ V , the first in the ordering.
On a high level, our algorithm is a greedy heuristic that
works with chains (ordered sequences) of basic blocks; see
Algorithm 1 for an overview. Initially all chains are isolated
basic blocks. Then we iteratively merge pairs of chains so
as to improve the ExtTSP score. On every iteration, we pick
a pair of chains whose merging yields the biggest increase
in ExtTSP, and the pair is merged into a new chain. The
procedure stops when there is only one chain left, which
determines the resulting ordering of basic blocks.
An important aspect of our approach is the way two
chains are merged; see function ComputeMergeGain of
Algorithm 1. In order to merge a pair of chains, src and dst ,
we first split chain src into two subchains, s1 and s2, that
retain the ordering of blocks given by src . Then we consider
all six ways of combining the three chains, s1, s2, and dst ,
into a single one, discarding the ones that do not place entry
point v∗ at the beginning. A chain with the largest ExtTSP
over all possible splitting indices of src and permutations
of s1, s2, dst is chosen as the result. The motivation here
B0
99
B1
B2
100
99
Figure 6. Chain con-
catenation produces
suboptimal ordering.
is to increase the search space in com-
parison to the simpler concatenation
of two chains. The simplest exam-
ple in which chain splitting helps is
depicted in Figure 6. A greedy con-
catenation merges block B0 with B2
on the first iteration, which results in
the final ordering (B0,B2,B1). In con-
trast, chain splitting allows to build
an ordering (B0,B1,B2), which has
a higher ExtTSP score since all the
edges are forward.
What is the computational complexity of Algorithm 1?
A naive implementation takes O(|V |5) steps: There are |V |
merge iterations that process at most |V |2 pairs of chains
per iteration with O(|V |2) steps needed to compute a merge
gain between two chains. However, this is an overestima-
tion, as we argue below. First observe that the ExtTSP score
between two chains, c1 and c2, can be positive only if there
is a branch between c1 and c2; thus, the number of candidate
chain pairs for merging is upper bounded by the total num-
ber of branches in the control flow graph, |E |. The second
observation is that one can memoize the results of Com-
puteMergeGain function and re-use them throughout the
computation. It is easy to see that the merge gain depends
only on a pair of chains; hence, if neither of the two chains
is merged at an iteration, then we do not need to recompute
the gain for the pair at the next iteration. Putting the two
observations together, the running time of Algorithm 1 is
bounded by O(∑c size(c) ·degree(c)), where the sum is taken
over all chains taking part in a merge with size(c) being the
number of blocks in the chain and degree(c) being the num-
ber of branches from and to the chain. In the worst case,
this sums up to O(|V |2 |E |) time in general, which equals to
O(|V |3) for real-world control flow graphs.
Large functions. While cubic running time is acceptable
for reordering most of the functions we experimented with,
there are several exceptions with a large number of basic
blocks. In order to deal with these cases, we introduce a
threshold, k > 1, on the maximum size of a chain that is
6
considered for splitting inComputeMergeGain. If the size of
a chain, c , exceeds the threshold, that is, size(c) > k , then we
only try a simple concatenation of c with other chains. With
the modification, the complexity of Algorithm 1 is estimated
by O(k · |V |2), which is quadratic when k is a constant. In
our implementation, we use k = 128 as the default value.
Reordering of cold blocks. Notice that Algorithm 1 is not
trying to optimize layout of cold basic blocks that are never
sampled during profiling. However, one may still want to
modify their relative order, as this could affect the code size
as follows. Consider pairs of cold basic blocks, s and t , such
that the only outgoing branch from s is the only incoming
branch to t . If s and t are not consecutive in the resulting
ordering, then one would need to introduce an unconditional
branch instruction. In contrast, if t follows s in the ordering,
then the instruction is not needed as t is on the fall-through
path of s . In order to guarantee that Algorithm 1 always
merges such pairs into a chain, we modify the the weights of
cold edges in the control flow graph before the computation.
Specifically we set w(s, t) = ϵ1 (for some 0 < ϵ1 ≪ 1) if
(s, t) ∈ E corresponds to a cold fall-through branch in the
original binary, and set w(s, t) = ϵ2 (for some 0 < ϵ2 < ϵ1)
if (s, t) corresponds to a cold non-fall-through branch. Such
weights make it desirable to merge original fall-through
branches, even if they are cold according to the profile.
Code layout in memory. Apart from basic block reorder-
ing, profile-guided optimization tools typically perform two
other passes directly affecting the layout of functions in
the generated code: hot/cold code splitting and code align-
ment [8, 27]. The first one splits hot and cold basic blocks into
separate sections, while the second pass aligns the blocks at
cache line boundaries via introducing NOP instructions. We
stress that both optimizations are complimentary to basic
block reordering and their benefits are additive. In our ex-
periments, we evaluate the effect of reordering alone, with
all other optimization passes applied for the binary.
4 An Optimal Algorithm for ExtTSP
We now demonstrate how ExtTSP is formulated as a Mixed
Integer Program (MIP). A MIP is a method to find optimal
solutions for NP-hard problems whose objective and require-
ments are represented by linear functions. This is a time-
consuming technique that can be applied only for small
instances, and we use the optimal MIP solutions to better
understand the quality of our heuristic.
maximize
∑
(s,t )w(s, t) × f (ds,t )
subject to xs ∈ R, s ∈ V
ds,t ∈ R, (s, t) ∈ E
zs,t ∈ {0, 1}, s, t ∈ V , s , t
xt − xs ≥ Ls −M
×(1 − zs,t ), s, t ∈ V , s , t
xs − xt ≥ Lt −M · zs,t , s, t ∈ V , s , t
ds,t = xt − xs − Ls , (s, t) ∈ E
The objective is a summation over the contribution of
each edge (s, t) ∈ E in the control flow graph to ExtTSP.
The contribution of an edge is the number of jumps,w(s, t),
weighted by a value dependent on the length of the jump,
ds,t = len(s, t). The piece-wise function shown in Figure 4b
converts the length to the desired weight. This function is for-
mulated in MIP by introducing additional integer variables
to cope with the non-convex shape [10].
The constraints express the complete search space of all
legal starting bytes xs for each block s ∈ V considering the
size of the block, Ls . For all pairs of blocks, s and t , either the
final ordering has s before t (that is, xt −xs ≥ Ls ) or t before s
(that is, xs − xt ≥ Lt ). A binary variable zs,t is utilized to
enforce one of those two constraints. The distances used in
the objective ds,t are constrained to be the distance between
the end of block s and start of block t . Negative distances
correspond to backward jumps which are incorporated into
the piece-wise function in the objective.
We utilize Xpress solver for finding optimal solutions of
the MIP model.
5 Evaluation
The experiments presented in this section were conducted
on Linux-based servers powered by Intel microprocessors.
The applications were compiled using either GCC 5.5 or
Clang 6.0 with -O3 optimization level.
5.1 Techniques
We compare our new algorithm (referred to as ext-tsp) with
the following competitors.
• original is the ordering provided by the compiler.
• tsp is the ordering constructed by the “top-down” heuris-
tic suggested by Pettis and Hansen [29]. The algorithm
starts by placing the entry basic block for a function,
and then iteratively finding a successor with the heavi-
est edge to the last placed block. If all successors have
already been selected, then one picks the block with the
largest connection to the already placed blocks.
• ph is the Pettis-Hansen “bottom-up” algorithm [29]. The
algorithms maintains a collection of chains of basic
blocks, which correspond to paths in the control flow
graph. Initially every block forms its own chain. Looking
at the arcs from largest to smallest, two different chains
are merged together if the arc connects the tail of one
7
Table 1. Basic properties of evaluated binaries
hot blocks per function
.text (MB) IPC functions avg p95 max
Clang 29 0.71 10,600 19 226 1,556
HHVM 143 0.83 9,500 60 465 9,375
Multifeed 469 0.95 28,500 19 172 9,228
Proxygen 216 0.63 9,900 15 108 945
chain to the head of another. Once the merging stage
is done, the chains are ordered so as to maximize the
weight of backward edges to achieve the best perfor-
mance of the branch predictor.
• cache represents a modification of the Pettis-Hansen
algorithm suggested by Luk et al. for the Ispike post-
link optimizer [22]. The difference from ph is in the
last step, ordering of chains of basic blocks. The chains
are sorted by their density, that is, the total execution
count of a chain divided by the sum of sizes of its basic
blocks. Placing hottest chains first reduces conflicts in
the I-cache and improves code locality.
• mip is an optimal algorithm for ExtTSP described in
Section 4. Since the running time of the approach is not
practical for large functions, we only compare the results
of mip on a subset of small functions.
All the algorithms are implemented in an open-source
post-link binary optimizer BOLT [5].
5.2 Facebook Workloads and Clang
This section evaluates various basic block ordering algo-
rithms on four large-scale binaries deployed at Facebook’s
data centers and a binary of the open-source Clang compiler.
The first system, which is our primary use case for ext-tsp,
is the HipHop Virtual Machine (HHVM) [1], that serves as an
execution engine for PHP at Facebook,Wikipedia, Baidu, and
other large websites.Multifeed and Proxygen are Facebook
services responsible for News Feed and cluster load balanc-
ing, respectively. The HHVM binary is built using GCC with
LTO, whileMultifeed and Proxygen are compiled with Clang
with AutoFDO enabled to enhance their performance. All
the Facebook services are running with huge pages enabled
and utilize function reordering [26]. Table 1 provides basic
properties of the evaluated binaries.
Figure 7 presents a performance comparison of four ba-
sic block ordering algorithms on the Facebook workloads.
The results are obtained by using internal performance-
measurement tools for running A/B experiments. The tests
are performed on a set of isolated machines that process the
same production traffic. We report performance as the CPU
utilization during steady state. We note that differences in
CPU utilization among block ordering were highly correlated
with instructions per cycle (IPC). As a baseline, we measure
the performance of the binaries optimized with BOLT us-
ing the original block ordering algorithm. In the case of
8
6
4
2
0
sp
ee
du
p,
 %
Clang
small
Clang
large
HHVM
6.
94
4.
39 4.
83
6.
93
3.
94 4
.8
2
6.
75
4.
08 5
.0
7
7.
67
5.
17 5.
58
 tsp  ph  cache  ext-tsp 
Figure 7. Performance improvements of various reordering
algorithms over original measured for different binaries.
HHVM and Clang, this is an original ordering constructed
by the compiler.
We excludeMultifeed and Proxygen from Figure 7 as there
are production limitations to running sufficient experiments
to accurately show differences among the basic block order-
ing algorithms, thus its misleading to include them. Each
ordering algorithm tends to provide 2% performance im-
provement while there is little difference among the ordering
algorithms. We attribute the lower performance improve-
ment to the fact that FDO was applied to these binaries.
Basic block reordering can be beneficial for any front-end
bound application, as illustrated by our tests with the bi-
nary of the Clang compiler whose .text section is around
30MB. For these experiments, we utilize a dual-node 2.4 GHz
Intel Xeon E5-2680 (Broadwell) with 256GB RAM. We ran
the evaluation in two modes, Clang-small and Clang-large,
that consist of building a single and a collection of about
100 template-heavy C++14 source files, respectively. For the
baseline, we utilize a Clang binary compiled with GCC and
then optimized with BOLT using original ordering of ba-
sic blocks. Every experimental run is repeated 500 times to
increase precision of our measurements so that the average
mean deviation is within 0.05%. According to our profile data
(collected on a training medium-sized source file), the binary
of Clang contains around 11MB of hot code making it a good
candidate for profile-guided layout optimization given that
the size of L1 I-cache is only 32KB on the processor.
Overall we observe that ext-tsp performs better than
alternative ordering algorithms on all but one evaluated in-
stance. The relative speedup is close to 1% for Clang and
around 0.5% for HHVM and Proxygen. We stress that the
experiments for Multifeed are noticeably noisier than the
alternatives with a typical deviation from the mean around
0.5%; hence, identifying smaller efficiency gains is problem-
atic. Amore detailed analysis of the evaluation for theHHVM
binary is depicted in Figure 8. The main advantage of block
ordering optimization is an improved performance of the
L1 I-cache, that exhibits over 19% miss reduction. The new
8
25
20
15
10
5
0
m
iss
 re
du
cti
on
, %
I-cache I-TLB LLC Branch
18
.7
10
.3
3.
5 5
.7
18
.9
9.
8
3.
4
6
19
.3
10
.9
3.
3
6.
1
20
.8
9.
9
3.
5
6.
2
 tsp  ph  cache  ext-tsp 
Figure 8. perf metrics measured for the HHVM binary.
ordering algorithm increases this value to 21%. The num-
ber of branch and I-TLB misses is also significantly reduced,
with ext-tsp being the best for the branch misses counter.
We also see a modest improvement in the performance of
the last level cache, though the difference between various
ordering algorithms is not prominent.
5.3 SPEC CPU 2006
In this section we evaluate basic block reordering on the
SPEC CPU 2006 benchmark. We utilize 19 C/C++ programs
compiled using GCC with LTO and ran experiments on the
same hardware as in the previous section. We analyze the
performance of the binaries optimized by BOLT with various
ordering algorithms, using original order as a baseline.
Profile data is collected using a separate SPEC train mode.
We observe that the SPEC binaries are much smaller than
the typical applications used in modern data centers. There-
fore, they are unlikely to be front-end bound and exhibit
many I-cache and I-TLB misses. Figure 9 presents the results
of our experiments on the largest binaries that contain at
least 100KB of hot code according to the collected profile.
Since we are interested in the impact of various block or-
dering algorithms on performance, we report the relative
differences to the baseline (positive values indicate improve-
ments, negative ones indicate regressions). We do not see
a consistent advantage of applying basic block reordering
for the binaries. In all the experiments (with the exception
of h264ref) we record a high variance in the running times,
which often exceeds the differences between means. For
the two largest programs, gcc and perlbench, the ext-tsp
algorithm achieves 3% and 2% speedups, respectively, out-
performing the competitors by 0.3%.
To better understand the source of regressions, we analyze
a binary from the benchmark, astar, whose running time
increases by 2% − 3% after block reordering. It turns out
that there is a single function whose reordering yields a 2%
regression, despite the fact that the number of fall-through
branches increases by more than 20%. We conclude that for
4
2
0
-2
sp
ee
du
p,
 %
gcc
790KB
perlbench
275KB
gobmk
240KB
dealII
210KB
povray
170KB
h264ref
130KB
namd
110KB
soplex
105KB
 tsp  ph  cache  ext-tsp 
Figure 9. Performance improvement using various block
reordering methods on the largest binaries of the SPEC 2006
dataset. For every binary, the size of the hot code is specified.
small binaries that are not front-end bound, both TSP and
ExtTSP are not accurate models.
5.4 Analysis of ExtTSP
Here we present an evaluation of Algorithm 1 (ext-tsp) for
solving the ExtTSP problem. We design the experiments to
answer two questions: (a) How do various parameters of the
algorithm contribute to the solution and what are the best
default values? (b) How does the algorithm perform in com-
parison with existing heuristics and the optimal technique?
Considering the first question, we observe that ext-tsp
has only one parameter that affects its quality and perfor-
mance. As explained in Section 3, we introduce a threshold k ,
which controls the maximum size of a chain that can be split
during optimization. In the extreme case with k = |V |, all
chains can be broken if that improves the objective; however,
the running time of the algorithm is cubic on the number of
basic blocks comprising a function. Another extreme, k = 0,
forbids chain splitting but makes the running time quadratic.
As Table 1 illustrates, some functions in the dataset contain
a few thousand of basic blocks. Hence, the threshold should
be chosen carefully, since it impacts the quality of a solution
and the time needed to process a binary, which can be crucial
in production environments.
Figure 10 illustrates the results of the experiments with
the chain splitting threshold.Multifeed is the binary whose
processing time is substantially affected by large values of k .
For k ≥ 1024, the combined running time of ext-tsp on
all functions of theMultifeed binary is around 10 minutes,
while for HHVM it is less than 2 minutes. For k = 0, the
processing times are 15 and 6 seconds for the binaries, respec-
tively. The difference between the corresponding solutions
in the ExtTSP score ranges between 0.3% and 0.7%; note
that according to our analysis in Section 2 and Figure 5, this
translates to a performance difference of 0.1% − 0.3%. The
value k = 128 provides a reasonable compromise between
processing speed and solution quality, and thus, is utilized
as the default value for ext-tsp in all our experiments.
9
0.6
0.5
0.4
0.3
0.2
0.1
0.0Ex
tT
SP
 im
pr
ov
em
en
t, 
%
2  8  32  128  512  
split threshold, k
400
300
200
100
0
ru
nn
ing
 tim
e,
 se
c
2  8  32  128  512  
split threshold, k
 Clang
 HHVM
 Multifeed
 Proxygen
Figure 10. The running times and the resulting ExtTSP
score produced by ext-tsp (Algorithm 1) using various
chain splitting thresholds for binaries described in Table 1.
In order to analyze the quality of solutions for ExtTSP
generated by Algorithm 1, we employ the optimal technique
presented in Section 4. We apply mip to all 2992 functions
containing at most 30 basic blocks in the HHVM binary.
For these small functions, mip finds a provably optimal so-
lution in 2963 (99%) of the cases in less than one minute.
Out of these instances with the known optimal ordering,
our heuristic, ext-tsp, finds an equivalent solution in 2914
(98.3%) of those functions where we found optimal solutions.
For the remaining 49 functions, the ExtTSP score produced
by ext-tsp is on average 0.14% lower than the optimum.
For comparison, the runner-up approach on the same bi-
nary is cache, which is able to reconstruct 2745 (92.6%) of
optimal orderings in the binary. The relative improvements
in the ExtTSP score over non-reordered functions are 27%,
29%, 29%, 30% for tsp, ph, cache, and ext-tsp, respectively,
which aligns with our experiments in Figure 7.
We emphasize that mip is not considered a practical ap-
proach, as it does not scale to instances with many basic
blocks. The average running time of mip on a function with
30 blocks exceeds 10 seconds, while it is below a millisecond
for all four alternatives, tsp, ph, cache, and ext-tsp. Never-
theless, the aforementioned analysis demonstrates that the
new heuristic provides a close-to-optimal solutions in the
majority of real-world instances, while being sufficiently fast
to process large functions.
6 Related Work
There exists a rich literature on profile-guided optimizations.
Here we discuss previous works that are closely related to
code layout and our main contributions.
The work by Pettis and Hansen [29] is the basis for the ma-
jority of modern code reordering techniques. The goal is to
create chains of basic blocks that are frequently executed to-
gether in the order. As discussed in Section 2, many variants
of the technique have been suggested in the literature and
implemented in various tools [4, 6, 14, 22, 23, 32, 35, 36, 38].
Similar to our work, the techniques are operating with a
control flow graph and try to lay out basic blocks tackling a
variant of the Traveling Salesman Problem. Alternative
models have been studied by Bahar et al. [3], Gloy et al. [12],
and Lavaee andDing [17], where a temporal-relation graph is
taken into account. Temporal affinities between code instruc-
tions can be utilized for reducing conflict cache misses [13]
and improving the performance of multiple applications us-
ing a shared cache [20]. We emphasize that according to our
experiments, the performance of a front-end bound large-
scale binary can be largely predicted by its control flow graph
without considering more expensive models.
Code reordering at the function-level is also initiated by
Pettis and Hansen [29], who describe an algorithm that is
implemented in many compilers and binary optimization
tools [27, 31, 32, 34]. This approach greedily merges chains of
functions and is designed to primarily reduce I-TLB misses.
An improvement is recently proposed by Ottoni and Ma-
her [26], who suggest to work with a directed call graph in
order to reduce I-cache misses. Note that unlike our work,
the techniques are heuristics not aiming to produce code
layouts that are optimal from the performance point of view.
Another opportunity for improving performance is to
modify layout of data [9, 11, 30, 33]. Most of the existing
works focus on field reordering and structure splitting based
on the field hotness and data affinities. While the problem of
finding an optimal data layout is computationally hard [16,
28], we believe that utilizing machine learning may lead
to improved heuristics resulting in performance gains for
real-world applications.
7 Discussion
In this work we extended the state-of-the-art model for re-
ordering of basic blocks and developed a new efficient algo-
rithm to optimize the layout of a binary. There are several
interesting aspects and possible limitations of our approach
that we discuss next. Firstly, our approach employs amachine
learning toolkit to build a desired objective for optimization.
An advantage is that such a technique can be used by non-
experts in the field of code optimization. As our evaluation
demonstrates, the resulting model outperforms the classical
one, as the new objective correlates very well with the per-
formance of large-scale binaries. A possible risk here is to
over-tune a model for a specific application and miss impor-
tant details that might affect performance. The experiments
with the SPEC benchmark imply that the models based on
maximizing the number of fall-through branches are too
simplistic for binaries that are not front-end bound.
Secondly, our study focuses on optimizing data center ap-
plications built with particular compilers and running on
a specific hardware. A reasonable future work is to verify
whether the presented approach can be generalized to other
use cases. Our preliminary experiments indicate that compa-
rable gains can be achieved on other Intel microprocessors
and alternative processor architectures. Similarly, the new
reordering algorithm is applied as a post-link optimization,
10
and we did not examine how it behaves on earlier compi-
lation stages. It is interesting to investigate the effect of an
improved reordering done at compilation time. In particular,
we plan in the future to integrate and compare ext-tspwith
the algorithms implemented in GCC [36] and Clang.
Finally, we point out that this paper considers a certain
aspect related to code generation: reordering of basic blocks
within a function. There are many complementary optimiza-
tions that we did not investigate in detail, for example, un-
rolling loops or duplicating blocks in order to avoid extra
jumps. An attractive direction is to allow cross-procedure
reordering in which basic blocks from different functions can
be interleaved in the final layout. This might further increase
code locality and improve cache utilization. Unfortunately,
our preliminary experiments with existing cross-procedure
heuristics [36] did not produce measurable gains; further
research of the technique is an intriguing future work.
Acknowledgments
We would like to thank Pol Mauri Ruiz and Alon Shalita for
fruitful discussions of the project.Wewould also like to thank
Rafael Auler, Guilherme Ottoni, and Maksim Panchenko for
their help with integrating ext-tsp into BOLT.
References
[1] K. Adams, J. Evans, B. Maher, G. Ottoni, A. Paroski, B. Simmers,
E. Smith, and O. Yamauchi. The HipHop Virtual Machine. In SIGPLAN
Notices, volume 49, pages 777–790. ACM, 2014.
[2] A. H. Ashouri, W. Killian, G. P. John Cavazos, and C. Silvano. A
survey on compiler autotuning using machine learning. arXiv preprint
arXiv:1801.04405, 2018.
[3] I. Bahar, B. Calder, and D. Grunwald. A comparison of software code
reordering and victim buffers. ACM SIGARCH Computer Architecture
News, 27(1):51–54, 1999.
[4] F. T. Boesch and J. F. Gimpel. Covering points of a digraph with point-
disjoint paths and its application to code optimization. Journal of the
ACM (JACM), 24(2):192–198, 1977.
[5] BOLT. Binary Optimization and Layout Tool. https://github.com/
facebookincubator/BOLT.
[6] B. Calder and D. Grunwald. Reducing branch costs via branch align-
ment. In SIGPLAN Notices, volume 29, pages 242–251. ACM, 1994.
[7] D. Chen, N. Vachharajani, R. Hundt, X. Li, S. Eranian, W. Chen, and
W. Zheng. Taming hardware event samples for precise and versatile
feedback directed optimizations. IEEE Transactions on Computers, 62
(2):376–389, 2013.
[8] D. Chen, D. X. Li, and T. Moseley. AutoFDO: Automatic feedback-
directed optimization for warehouse-scale applications. In Interna-
tional Symposium on Code Generation and Optimization, pages 12–23.
ACM, 2016.
[9] T. M. Chilimbi and R. Shaham. Cache-conscious coallocation of hot
data streams. In SIGPLAN Notices, volume 41, pages 252–262. ACM,
2006.
[10] K. L. Croxton, B. Gendron, and T. L. Magnanti. A comparison of
mixed-integer programming models for nonconvex piecewise linear
cost minimization problems. Management Science, 49(9):1268–1273,
2003.
[11] T. Eimouri, K. B. Kent, A. Micic, and K. Taylor. Using field access fre-
quency to optimize layout of objects in the jvm. In Annual Symposium
on Applied Computing, pages 1815–1818. ACM, 2016.
[12] N. Gloy and M. D. Smith. Procedure placement using temporal-
ordering information. Transactions on Programming Languages and
Systems, 21(5):977–1027, 1999.
[13] A. H. Hashemi, D. R. Kaeli, and B. Calder. Efficient procedure mapping
using cache line coloring. In SIGPLAN Notices, volume 32, pages 171–
182. ACM, 1997.
[14] T. Hirata, A. Maruoka, and M. Kimura. A polynomial time algorithm
to find a path cover of a reducible flow graph. Syst. Comput. Control,
10(3):71–78, 1979.
[15] C. Lattner and V. Adve. LLVM: A compilation framework for lifelong
program analysis & transformation. In International Symposium on
Code Generation and Optimization, page 75. IEEE Computer Society,
2004.
[16] R. Lavaee. The hardness of data packing. SIGPLAN Notices, 51(1):
232–242, 2016.
[17] R. Lavaee and C. Ding. ABC Optimizer: Affinity based code layout
optimization. Technical report, University of Rochester, 2014.
[18] B. Letham, B. Karrer, G. Ottoni, and E. Bakshy. Constrained Bayesian
optimization with noisy experiments. Bayesian Analysis, 2018. doi:
10.1214/18-BA1110. Advance publication.
[19] R. Levin, I. Newman, and G. Haber. Complementing missing and
inaccurate profiling using a minimum cost circulation algorithm. In
International Conference on High-Performance Embedded Architectures
and Compilers, pages 291–304. Springer, 2008.
[20] P. Li, H. Luo, C. Ding, Z. Hu, and H. Ye. Code layout optimization
for defensiveness and politeness in shared cache. In International
Conference on Parallel Processing, pages 151–161. IEEE, 2014.
[21] X.-h. Liu, Y. Peng, and J.-y. Zhang. A sample profile-based optimization
method with better precision. DEStech Transactions on Computer
Science and Engineering, 2016.
[22] C.-K. Luk, R.Muth, H. Patil, R. Cohn, andG. Lowney. Ispike: A post-link
optimizer for the Intel® Itanium® architecture. In Code Generation and
Optimization: Feedback-Directed and Runtime Optimization, page 15.
IEEE Computer Society, 2004.
[23] O. Medvedev. Optimal basic block reordering via hammock decompo-
sition. In Spring/Summer Young Researchers Colloquium on Software
Engineering, number 1, 2007.
[24] D. Novillo. SamplePGO: the power of profile guided optimizations
without the usability burden. In Proceedings of the 2014 LLVM Compiler
Infrastructure in HPC, pages 22–28. IEEE Press, 2014.
[25] A. Nowak, A. Yasin, A. Mendelson, and W. Zwaenepoel. Establishing
a base of trust with performance counters for enterprise workloads.
In USENIX Annual Technical Conference, pages 541–548, 2015.
[26] G. Ottoni and B. Maher. Optimizing function placement for large-
scale data-center applications. In International Symposium on Code
Generation and Optimization, pages 233–244. IEEE Press, 2017.
[27] M. Panchenko, R. Auler, B. Nell, and G. Ottoni. BOLT: A practi-
cal binary optimizer for data centers and beyond. arXiv preprint
arXiv:1807.06735, 2018.
[28] E. Petrank and D. Rawitz. The hardness of cache conscious data
placement. In SIGPLAN Notices, volume 37, pages 101–112. ACM,
2002.
[29] K. Pettis and R. C. Hansen. Profile guided code positioning. In SIGPLAN
Notices, volume 25, pages 16–27. ACM, 1990.
[30] E. Raman, R. Hundt, and S. Mannarswamy. Structure layout optimiza-
tion for multithreaded programs. In International Symposium on Code
Generation and Optimization, pages 271–282. IEEE, 2007.
[31] A. Ramirez, L. A. Barroso, K. Gharachorloo, R. Cohn, J. Larriba-Pey,
P. G. Lowney, and M. Valero. Code layout optimizations for transac-
tion processing workloads. In SIGARCH Computer Architecture News,
volume 29, pages 155–164. ACM, 2001.
[32] A. Ramírez, J.-L. Larriba-Pey, C. Navarro, J. Torrellas, and M. Valero.
Software trace cache. In International Conference on Supercomputing,
pages 261–268. ACM, 2014.
11
[33] P. Roy and X. Liu. StructSlim: A lightweight profiler to guide struc-
ture splitting. In International Symposium on Code Generation and
Optimization, pages 36–46. ACM, 2016.
[34] W. J. Schmidt, R. R. Roediger, C. S. Mestad, B. Mendelson, I. Shavit-
Lottem, and V. Bortnikov-Sitnitsky. Profile-directed restructuring of
operating system code. IBM Systems Journal, 37(2):270–297, 1998.
[35] B. Schwarz, S. Debray, G. Andrews, and M. Legendre. PLTO: A link-
time optimizer for the intel ia-32 architecture. In Workshop on Binary
Rewriting, 2001.
[36] J. Torrellas, C. Xia, and R. L. Daigle. Optimizing the instruction cache
performance of the operating system. IEEE Transactions on Computers,
47(12):1363–1381, 1998.
[37] Z. Wang and M. O’Boyle. Machine learning in compiler optimization.
Proceedings of the IEEE, PP(99):1–23, 2018.
[38] C. Young, D. S. Johnson, M. D. Smith, and D. R. Karger. Near-optimal
intraprocedural branch alignment. SIGPLAN Notices, 32(5):183–193,
1997.
12
