BOLT: A Practical Binary Optimizer for Data Centers and Beyond by Panchenko, Maksim et al.
BOLT: A Practical Binary Optimizer
for Data Centers and Beyond
Maksim Panchenko Rafael Auler Bill Nell Guilherme Ottoni
Facebook, Inc.
Menlo Park, CA, USA
{maks,rafaelauler,bnell,ottoni}@fb.com
Abstract
Performance optimization for large-scale applications has
recently become more important as computation continues
to move towards data centers. Data-center applications are
generally very large and complex, which makes code lay-
out an important optimization to improve their performance.
This has motivated recent investigation of practical tech-
niques to improve code layout at both compile time and link
time. Although post-link optimizers had some success in the
past, no recent work has explored their benefits in the context
of modern data-center applications.
In this paper, we present BOLT, a post-link optimizer
built on top of the LLVM framework. Utilizing sample-
based profiling, BOLT boosts the performance of real-world
applications even for highly optimized binaries built with
both feedback-driven optimizations (FDO) and link-time op-
timizations (LTO). We demonstrate that post-link perfor-
mance improvements are complementary to conventional
compiler optimizations, even when the latter are done at a
whole-program level and in the presence of profile infor-
mation. We evaluated BOLT on both Facebook data-center
workloads and open-source compilers. For data-center appli-
cations, BOLT achieves up to 8.0% performance speedups
on top of profile-guided function reordering and LTO. For
the GCC and Clang compilers, our evaluation shows that
BOLT speeds up their binaries by up to 20.4% on top of FDO
and LTO, and up to 52.1% if the binaries are built without
FDO and LTO.
1. Introduction
Given the large scale of data centers, optimizing their work-
loads has recently gained a lot of interest. Modern data-
center applications tend to be very large and complex pro-
grams. Due to their sheer amount of code, optimizing the
code locality for these applications is very important to im-
prove their performance.
The large size and performance bottlenecks of data-center
applications make them good targets for feedback-driven
optimizations (FDO), also called profile-guided optimiza-
tions (PGO), particularly code layout. At the same time,
the large sizes of these applications also impose scala-
bility challenges to apply FDO to them. Instrumentation-
based profilers incur significant memory and computa-
tional performance costs, often making it impractical to
gather accurate profiles from a production system. To sim-
plify deployment and increase adoption, it is desirable
to have a system that can obtain profile data for FDO
from unmodified binaries running in their normal produc-
tion environments. This is possible through the use of
sample-based profiling, which enables high-quality pro-
files to be gathered with minimal operational complexity.
This is the approach taken by tools such as Ispike [21],
AutoFDO [6], and HFSort [25]. This same principle is used
as the basis of the BOLT tool presented in this paper.
Profile data obtained via sampling can be retrofitted to
multiple points in the compilation chain. The point into
which the profile data is used can vary from compilation
time (e.g. AutoFDO [6]), to link time (e.g. LIPO [18] and
HFSort [25]), to post-link time (e.g. Ispike [21]). In gen-
eral, the earlier in the compilation chain the profile infor-
mation is inserted, the larger is the potential for its impact,
because more phases and optimizations can benefit from
this information. This benefit has motivated recent work on
compile-time and link-time FDO techniques. At the same
time, post-link optimizations, which in the past were ex-
plored by a series of proprietary tools such as Spike [8],
Etch [28], FDPR [11], and Ispike [21], have not attracted
much attention in recent years. We believe that the lack of in-
terest in post-link optimizations is due to folklore and the in-
tuition that this approach is inferior because the profile data
is injected very late in the compilation chain.
In this paper, we demonstrate that the intuition described
above is incorrect. The important insight that we leverage
in this work is that, although injecting profile data earlier
in the compilation chain enables its use by more optimiza-
tions, injecting this data later enables more accurate use of
the information for better code layout. In fact, one of the
main challenges with AutoFDO is to map the profile data,
collected at the binary level, back to the compiler’s interme-
diate representations [6]. In the original compilation used to
1
ar
X
iv
:1
80
7.
06
73
5v
2 
 [c
s.P
L]
  1
2 O
ct 
20
18
Optimized
Executable BinaryCompiler IRSource Code Object Files Executable BinaryCode Gen.Parser Linker Binary Opt.
Profile Data
Figure 1: Example of a compilation pipeline and the various alternatives to retrofit sample-base profile data.
produce the binary where the profile data is collected, many
optimizations are applied to the code by the compiler and
linker before the machine code is emitted. In a post-link op-
timizer, which operates at the binary level, this problem is
much simpler, resulting in more accurate use of the profile
data. This accuracy is particularly important for low-level
optimizations such as code layout.
We demonstrate the finding above in the context of a
static binary optimizer we built, called BOLT. BOLT is a
modern, retargetable binary optimizer built on top of the
LLVM compiler infrastructure [16]. Our experimental eval-
uation on large real-world applications shows that BOLT can
improve performance by up to 20.41% on top of FDO and
LTO. Furthermore, our analysis demonstrates that this im-
provement is mostly due to the improved code layout that is
enabled by the more accurate usage of sample-based profile
data at the binary level.
Overall, this paper makes the following contributions:
1. It describes the design of a modern, open-source post-
link optimizer built on top of the LLVM infrastructure.1
2. It demonstrates empirically that a post-link optimizer
is able to better utilize sample-based profiling data to
improve code layout compared to a compiler-based ap-
proach.
3. It shows that neither compile-time, link-time, nor post-
link-time FDO supersedes the others but, instead, they
are complementary.
This paper is organized as follows. Section 2 motivates
the case for using sample-based profiling and static binary
optimization to improve performance of large-scale applica-
tions. Section 3 then describes the architecture of the BOLT
binary optimizer, followed by a description of the optimiza-
tions that BOLT implements in Section 4 and a discussion
about profiling techniques in Section 5. An evaluation of
BOLT and a comparison with other techniques is presented
in Section 6. Finally, Section 7 discusses related work and
Section 8 concludes the paper.
2. Motivation
In this section, we motivate the post-link optimization ap-
proach used by BOLT.
1 BOLT is available at https://github.com/facebookincubator/BOLT.
2.1 Why sample-based profiling?
Feedback-driven optimizations (FDO) have been proved to
help increase the impact of code optimizations in a variety
of systems (e.g. [6, 9, 13, 18, 24]). Early developments in
this area relied on instrumentation-based profiling, which
requires a special instrumented build of the application to
collect profile data. This approach has two drawbacks. First,
it complicates the build process, since it requires a special
build for profile collection. Second, instrumentation typi-
cally incurs very significant CPU and memory overheads.
These overheads generally render instrumented binaries in-
appropriate for running in real production environments.
In order to increase the adoption of FDO in production
environments, recent work has investigated FDO-style tech-
niques based on sample-based profiling [6, 7, 25]. Instead
of instrumentation, these techniques rely on much cheaper
sampling using hardware profile counters available in mod-
ern CPUs, such as Intel’s Last Branch Records (LBR) [15].
This approach is more attractive not only because it does not
require a special build of the application, but also because
the profile-collection overheads are negligible. By address-
ing the two main drawbacks of instrumentation-based FDO
techniques, sample-based profiling has increased the adop-
tion of FDO-style techniques in complex, real-world pro-
duction systems [6, 25]. For these same practical reasons,
we opted to use sample-based profiling in this work.
2.2 Why a binary optimizer?
Sample-based profile data can be leveraged at various levels
in the compilation pipeline. Figure 1 shows a generic com-
pilation pipeline to convert source code into machine code.
As illustrated in Figure 1, the profile data may be injected at
different program-representation levels, ranging from source
code, to the compiler’s intermediate representations (IR), to
the linker, to post-link optimizers. In general, the designers
of any FDO tool are faced with the following trade-off. On
the one hand, injecting profile data earlier in the pipeline al-
lows more optimizations along the pipeline to benefit from
this data. On the other hand, since sample-based profile data
must be collected at the binary level, the closer a level is
to this representation, the higher the accuracy with which
the data can be mapped back to this level’s program repre-
sentation. Therefore, a post-link binary optimizer allows the
profile data to be used with the greatest level of accuracy.
2
(01) function foo(int x) {
(02) if (x > 0) {
(03) ... // B1
(04) } else {
(05) ... // B2
(06) }
(07) }
(08) function bar() {
(09) foo(... /* > 0 */); // gets inlined
(10) }
(11) function baz() {
(12) foo(... /* < 0 */); // gets inlined
(13) }
Figure 2: Example showing a challenge in mapping binary-
level events back to higher-level code representations.
AutoFDO [6] retrofits profile data back into a compiler’s
intermediate representation (IR). Chen et al. [7] quantified
the precision of the profile data that is lost by retrofitting
profile data even at a reasonably low-level representation in
the GCC compiler. They quantified that the profile data had
84.1% accuracy, which they were able to improve to 92.9%
with some techniques described in that work.
The example in Figure 2 illustrates the difficulty in map-
ping binary-level performance events back to a higher-level
representation. In this example, both functions bar and baz
call function foo, which gets inlined in both callers. Func-
tion foo contains a conditional branch for the if statement
on line (02). For forward branches like this, on modern pro-
cessors, it is advantageous to make the most common suc-
cessor be the fall-through, which can lead to better branch
prediction and instruction-cache locality. This means that,
when foo is inlined into bar, block B1 should be placed
before B2, but the blocks should be placed in the opposite
order when inlined into baz. When this program is profiled
at the binary level, two branches corresponding to the if in
line (02) will be profiled, one within bar and one within
baz. Assume that functions bar and baz execute the same
number of times at runtime. Then, when mapping the branch
frequencies back to the source code in Figure 2, one will
conclude that the branch at line (02) has a 50% chance
of branching to both B1 and B2. And, after foo is inlined
in both bar and baz, the compiler will not be able to tell
what layout is best in each case. Notice that, although this
problem can be mitigated by injecting the profile data into
a lower-level representation after function inlining has been
performed, this does not solve the problem in case foo is
declared in a different module than bar and baz because in
this case inlining cannot happen until link time.
Since our initial motivation for BOLT was to improve
large-scale data-center applications, where code layout plays
a major role, a post-link binary optimizer was very appeal-
ing. Traditional code-layout techniques are highly dependent
on accurate branch frequencies [26], and using inaccurate
profile data can actually lead to performance degradation [7].
Nevertheless, as we mentioned earlier, feeding profile infor-
mation at a very low level prevents earlier optimizations in
the compilation pipeline from leveraging this information.
Therefore, with this approach, any optimization that we want
to benefit from the profile data needs to be applied at the bi-
nary level. Fortunately, code layout algorithms are relatively
simple and easy to apply at the binary level.
2.3 Why a static binary optimizer?
The benefits of a binary-level optimizer outlined above can
be exploited either statically or dynamically. We opted for
a static approach for two reasons. The first one is the sim-
plicity of the approach. The second was the absence of
runtime overheads. Even though dynamic binary optimiz-
ers have had some success in the past (e.g. Dynamo [2],
DynamoRIO [5], StarDBT [29]), these systems incur non-
trivial overheads that go against the main goal of improving
the overall performance of the target application. In other
words, these systems need to perform really well in order to
recover their overheads and achieve a net performance win.
Unfortunately, since they need to keep their overheads low,
these systems often have to implement faster, sub-optimal
code optimization passes. This has been a general challenge
to the adoption of dynamic binary optimizers, as they are
not suited for all applications and can easily degrade per-
formance if not tuned well. The main benefit of a dynamic
binary optimizer over a static one is the ability to handle dy-
namically generated and self-modifying code.
3. Architecture
Large-scale data-center binaries may contain over 100 MB
of code from multiple source-code languages, including as-
sembly language. In this section, we discuss the design of
the BOLT binary optimizer that we created to operate in this
scenario.
3.1 Initial Design
We developed BOLT by incrementally increasing its binary
code coverage. At first, BOLT was only able to optimize the
code layout of a limited set of functions. With time, code
coverage gradually increased by adding support for more
complex functions. Even today, BOLT is still able to leave
some functions in the binary untouched while processing
and optimizing others, conservatively skipping code that
violates its current assumptions.
The initial implementation targeted x86 64 Linux ELF
binaries and relied exclusively on ELF symbol tables to
guide binary content identification. By doing that, BOLT
was able to optimize code layout within existing function
boundaries. When BOLT was not able to reconstruct the
control-flow graph of a given function with full confidence,
it would just leave the function untouched.
Due to the nature of code layout optimizations, the effec-
tive code size may increase for a couple of reasons. First, this
may happen due to an increase in the number of branches on
3
Function
discovery
Read debug
info
Read profile
data
Disassembly CFG
construction
Optimization
pipeline
Emit and link
functions
Rewrite
binary file
Figure 3: Diagram showing BOLT’s binary rewriting pipeline.
cold paths. Second, there is a peculiarity of x86’s conditional
branch instruction, which occupies 2 bytes if a (signed) off-
set to a destination fits in 8 bits but otherwise takes 6 bytes
for 32-bit offsets. Naturally, moving cold code further away
showed a tendency to increase the hot code size. If an opti-
mized function would not fit into the original function’s allo-
cated space, BOLT would split the cold code and move it to
a newly created ELF segment. Note that such function split-
ting was involuntary and did not provide any extra benefit
beyond allowing code straightening optimizations as BOLT
was not filling out the freed space between the split point and
the next function.
3.2 Relocations Mode
A second and more ambitious mode was later added to op-
erate by changing the position of all functions in the bi-
nary. While multiple approaches were considered, the most
obvious and straightforward one was to rely on relocations
recorded and saved by the linker in an executable. Both BFD
and Gold linkers provide such an option (--emit-relocs).
However, even with this option, there are still some missing
pieces of information. An example is the relative offsets for
PIC jump tables which are removed by the linker. Other ex-
amples are some relocations that are not visible even to the
linker, such as cross-function references for local functions
within a single compilation unit (they are processed inter-
nally by the compiler). Therefore, in order to detect and fix
such references, it is important to disassemble all the code
correctly before trying to rearrange the functions in the bi-
nary. Nevertheless, with relocations, the job of gaining com-
plete control over code re-writing became much easier. Han-
dling relocations gives BOLT the ability to change the order
of functions in the binary and split function bodies to further
improve code locality.
Since linkers have access to relocations, it would be pos-
sible to use them for similar binary optimizations. However,
there are multiple open-source linkers for x86 Linux alone,
and which one is being used for any particular application
depends on a number of circumstances that may also change
over time. Therefore, in order to facilitate the tool’s adop-
tion, we opted for writing an independent post-link optimizer
instead of being tied to a specific linker.
3.3 Rewriting Pipeline
Analyzing an arbitrary binary and locating code and data is
not trivial. In fact, the problem of precisely disassembling
machine code is undecidable in the general case. In practice,
there is more information than just an entry point available,
and BOLT relies on correct ELF symbol table information
for code discovery. Since BOLT works with 64-bit Linux
binaries, the ABI requires an inclusion of function frame in-
formation that contains function boundaries as well. While
BOLT could have relied on this information, it is often the
case that functions written in assembly omit frame informa-
tion. Thus, we decided to employ a hybrid approach using
both symbol table and frame information when available.
Figure 3 shows a diagram with BOLT’s rewriting steps.
Function discovery is the very first step, where function
names are bound to addresses. Later, debug information and
profile data are retrieved so that disassembly of individual
functions can start.
BOLT uses the LLVM compiler infrastructure [16] to
handle disassembly and modification of binary files. There
are a couple of reasons LLVM is well suited for BOLT. First,
LLVM has a nice modular design that enables relatively easy
development of tools based on its infrastructure. Second,
LLVM supports multiple target architectures, which allows
for easily retargetable tools. To illustrate this point, a work-
ing prototype for the ARM architecture was implemented in
less than a month. In addition to the assembler and disassem-
bler, many other components of LLVM proved to be useful
while building BOLT. Overall, this decision to use LLVM
has worked out well. The LLVM infrastructure has enabled
a quick implementation of a robust and easily retargetable
binary optimizer.
As Figure 3 shows, the next step in the rewriting pipeline
is to build the control-flow graph (CFG) representation for
each of the function. The CFG is constructed using the
MCInst objects provided by LLVM’s Tablegen-generated
disassembler. BOLT reconstructs the control-flow informa-
tion by analyzing any branch instructions encountered dur-
ing disassembly. Then, in the CFG representation, BOLT
runs its optimization pipeline, which is explained in detail
in Section 4. For BOLT, we have added a generic annotation
mechanism to MCInst in order to facilitate certain optimiza-
tions, e.g. as a way of recording dataflow information. The
final steps involve emitting functions and using LLVM’s run-
time dynamic linker (created for the LLVM JIT systems) to
resolve references among functions and local symbols (such
as basic blocks). Finally, the binary is rewritten with the new
contents while also updating ELF structures to reflect the
new sizes.
4
Pass Name Description
1. strip-rep-ret Strip repz from repz retq instructions used for legacy
AMD processors
2. icf Identical code folding
3. icp Indirect call promotion
4. peepholes Simple peephole optimizations
5. inline-small Inline small functions
6. simplify-ro-loads Fetch constant data in .rodata whose address is known stati-
cally and mutate a load into a mov
7. icf Identical code folding (second run)
8. plt Remove indirection from PLT calls
9. reorder-bbs Reorder basic blocks and split hot/cold blocks into separate
sections (layout optimization)
10. peepholes Simple peephole optimizations (second run)
11. uce Eliminate unreachable basic blocks
12. fixup-branches Fix basic block terminator instructions to match the CFG and
the current layout (redone by reorder-bbs)
13. reorder-functions Apply HFSort [25] to reorder functions (layout optimization)
14. sctc Simplify conditional tail calls
15. frame-opts Removes unnecessary caller-saved register spilling
16. shrink-wrapping Moves callee-saved register spills closer to where they are
needed, if profiling data shows it is better to do so
Table 1: Sequence of transformations applied in BOLT’s
optimization pipeline.
3.4 C++ Exceptions and Debug Information
BOLT is able to recognize DWARF [10] information and
update it to reflect the code modifications and relocations
performed during the rewriting pass.
Figure 4 shows an example of a CFG dump demonstrat-
ing BOLT’s internal representation of the binary for the first
two basic blocks of a function with C++ exceptions and a
throw statement. The function is quite small with only five
basic blocks in total, and each basic block is free to be relo-
cated to another position, except the entry point. Placehold-
ers for DWARF Call Frame Information (CFI) instructions
are used to annotate positions where the frame state changes
(for example, when the stack pointer advances). BOLT re-
builds all CFI for the new binary based on these annota-
tions so the frame unwinder works properly when an excep-
tion is thrown. The callq instruction at offset 0x00000010
can throw an exception and has a designated landing pad
as indicated by a landing-pad annotation displayed next to
it (handler: .LLP0; action: 1). The last annotation on
the line indicates a source line origin for every machine-level
instruction.
4. Optimizations
BOLT runs passes with either code transformations or anal-
yses, similar to a compiler. BOLT is also equipped with a
dataflow-analysis framework to feed information to passes
that need it. This enables BOLT to check register liveness at
a given program point, a technique also used by Ispike [21].
Some passes are architecture-independent while others are
not. In this section, we discuss the passes applied to the Intel
x86 64 target.
Table 1 shows each individual BOLT optimization pass
in the order they are applied. For example, the first line
Binary Function "_Z11filter_onlyi" after building cfg {
State : CFG constructed
Address : 0x400ab1
Size : 0x2f
Section : .text
LSDA : 0x401054
IsSimple : 1
IsSplit : 0
BB Count : 5
CFI Instrs : 4
BB Layout : .LBB07, .LLP0, .LFT8, .Ltmp10, .Ltmp9
Exec Count : 104
Profile Acc : 100.0%
}
.LBB07 (11 instructions, align : 1)
Entry Point
Exec Count : 104
CFI State : 0
00000000: pushq %rbp # exception4.cpp:22
00000001: !CFI $0 ; OpDefCfaOffset -16
00000001: !CFI $1 ; OpOffset Reg6 -16
00000001: movq %rsp, %rbp # exception4.cpp:22
00000004: !CFI $2 ; OpDefCfaRegister Reg6
00000004: subq $0x10, %rsp # exception4.cpp:22
00000008: movl %edi, -0x4(%rbp) # exception4.cpp:22
0000000b: movl -0x4(%rbp), %eax # exception4.cpp:23
0000000e: movl %eax, %edi # exception4.cpp:23
00000010: callq _Z3fooi # handler: .LLP0; action: 1
# exception4.cpp:23
00000015: jmp .Ltmp9 # exception4.cpp:24
Successors: .Ltmp9 (mispreds: 0, count: 100)
Landing Pads: .LLP0 (count: 4)
CFI State: 3
.LLP0 (2 instructions, align : 1)
Landing Pad
Exec Count : 4
CFI State : 3
Throwers: .LBB07
00000017: cmpq $-0x1, %rdx # exception4.cpp:24
0000001b: je .Ltmp10 # exception4.cpp:24
Successors: .Ltmp10 (mispreds: 0, count: 4),
.LFT8 (inferred count: 0)
CFI State: 3
....
Figure 4: Partial CFG dump for a function with C++ excep-
tions.
presents strip-rep-ret at the start of the pipeline. Notice
that passes 1 and 4 are focused on leveraging precise target
architecture information to remove or mutate some instruc-
tions. A use case of BOLT for data-center applications is to
allow the user to trade any optional choices in the instruction
space in favor of I-cache space, such as removing alignment
NOPs and AMD-friendly REPZ bytes, or using shorter ver-
sions of instructions. Our findings show that, for large ap-
plications, it is better to aggressively reduce I-cache occu-
pation, except if the change incurs D-cache overhead, since
cache is one of the most constrained resources in the data-
center space. This explains BOLT’s policy of discarding all
NOPs after reading the input binary. Even though compiler-
generated alignment NOPs are generally useful, the extra
space required by them does not pay off and simply strip-
ping them from the binary provides a small but measurable
performance improvement.
BOLT features identical code folding (ICF) to comple-
ment the ICF optimization done by the linker. An addi-
tional benefit of doing ICF at the binary level is the abil-
5
ity to optimize functions that were compiled without the
-ffunction-sections flag and functions that contain
jump tables. As a result, BOLT is able to fold more identical
functions than the linkers. We have measured the reduction
of code size for the HHVM binary [1] to be about 3% on top
of the linker’s ICF pass.
Passes 3 (indirect call promotion), 5 (inline small func-
tions), and 7 (PLT call optimization) leverage call frequency
information to either eliminate or mutate a function call
into a more performant version. We note that BOLT’s func-
tion inlining is a limited version of what compilers per-
form at higher levels. We expect that most of the inlin-
ing opportunities will be leveraged by the compiler (poten-
tially using FDO). The remaining inlining opportunities for
BOLT are typically exposed by more accurate profile data,
BOLT’s indirect-call promotion (ICP) optimization, cross-
module nature, or a combination of these factors.
Pass 6, simplification of load instructions, explores a
tricky tradeoff by fetching data from statically known values
(in read-only sections). In these cases, BOLT may convert
loads into immediate-loading instructions, relieving pres-
sure from the D-cache but possibly increasing pressure on
the I-cache, since the data is now encoded in the instruction
stream. BOLT’s policy in this case is to abort the promotion
if the new instruction encoding is larger than the original
load instruction, even if it means avoiding an arguably more
computationally expensive load instruction. However, we
found that such opportunities are not very frequent in our
workloads.
Pass 9, reorder and split hot/cold basic blocks, reorders
basic blocks according to the most frequently executed
paths, so the hottest successor will most likely be a fall-
though, reducing taken branches and relieving pressure from
the branch predictor unit.
Finally, pass 13 reorders the functions via the HFSort
technique [25]. This optimization mainly improves I-TLB
performance, but it also helps with I-cache to a smaller ex-
tent. Combined with pass 9, these are the most effective ones
in BOLT because they directly optimize the code layout.
5. Profiling Techniques
This section discusses pitfalls and caveats of different sample-
based profiling techniques when trying to produce accurate
profiling data.
5.1 Techniques
In recent Intel microprocessors, LBR is a list of the last 32
taken branches. LBRs are important for profile-guided op-
timizations not only because they provide accurate counts
for critical edges (which cannot be inferred even with per-
fect basic block count profiling [17]), but also because they
make block-layout algorithms more resilient to bad sam-
pling. When evaluating several different sampling events to
collect LBRs for BOLT, we found that the performance im-
pact in LBR mode is very consistent even for different sam-
pling events. We have experimented with collecting LBR
data with multiple hardware events on Intel x86, including
retired instructions, taken branches, and cycles, and also
experimented with different levels of Precise Event Based
Sampling (PEBS) [15]. In all these cases, for a workload for
which BOLT provided a 5.4% speedup, the performance dif-
ferences were within 1%. In non-LBR mode, using biased
events with a non-ideal algorithm to infer edge counts can
cause as much as 5% performance penalty when compared
to LBR, meaning it misses nearly all optimization opportuni-
ties. An investigation showed that non-LBR techniques can
be tuned to stay under 1% worse than LBR in this example
workload, but if LBR is available in the processor, one is
better off using it to obtain higher and more robust perfor-
mance numbers. We also evaluate this effect for HHVM in
Section 6.5.
5.2 Consequences for Block Layout
Using LBRs, in a hypothetical worst-case biasing scenario
where all samples in a function are recorded in the same
basic block, BOLT will lay out blocks in the order of the path
that leads to this block. It is an incomplete layout that misses
the ordering of successor blocks, but it is not an invalid nor
a cold path. In contrast, when trying to infer the same edge
counts with non-LBR samples, the scenario is that of a single
hot basic block with no information about which path was
taken to get to it.
In practice, even in LBR mode, many times the collected
profile is contradictory by stating that predecessors execute
many times more than its single successor, among other vi-
olations of flow equations.2 Previous work [17, 23], which
includes techniques implemented in IBM’s FDPR [12], re-
port handling the problem of reconstructing edge counts by
solving an instance of minimum cost flow (MCF [17]), a
graph network flow problem. However, these reports predate
LBRs. LBRs only store taken branches, so when handling
very skewed data such as the cases mentioned above, BOLT
satisfies the flow equation by attributing all surplus flow to
the non-taken path that is naturally missing from the LBR,
similarly to Chen et al. [7]. BOLT also benefits from being
applied after the static compiler: to cope with uncertainty, by
putting weight on the fall-through path, it trusts the original
layout done by the static compiler. Therefore, the program
trace needs to show a significant number of taken branches,
which contradict the original layout done by the compiler, to
convince BOLT to reorder the blocks and change the orig-
inal fall-through path. Without LBRs, it is not possible to
take advantage of this: algorithms start with guesses for both
taken and non-taken branches without being sure if the taken
branches, those taken for granted in LBR mode, are real or
the result of bad edge-count inference.
2 I.e., the sum of a block’s input flow is equal to the sum of its output flow.
6
5.3 Consequences for Function Layout
BOLT uses HFSort [25] to perform function reordering
based on a weighted call graph. If LBRs are used, the
edge weights of the call graph are directly inferred from the
branch records, which may also include function calls and
returns. However, without LBRs, BOLT is still able to build
an incomplete call graph by looking at the direct calls in the
binary and creating caller-callee edges with weights corre-
sponding to the number of samples recorded in the blocks
containing the corresponding call instructions. However,
this approach cannot take indirect calls into account. Even
with these limitations, we did not observe a performance
penalty as severe as using non-LBR mode for basic block
reordering (Section 6.5)
6. Evaluation
This section evaluates BOLT in a variety of scenarios, in-
cluding Facebook server workloads and the GCC and Clang
open-source compilers. A comparison with GCC’s and
Clang’s PGO and LTO is also provided in some scenarios.
The evaluation presented in this section was conducted
on Linux-based servers featuring Intel microprocessors.
6.1 Facebook Workloads
The impact of BOLT was measured on five binaries inside
Facebook’s data centers. The first is HHVM [1], the PHP
virtual machine that powers the web servers at Facebook and
many other web sites, including Baidu and Wikipedia. The
second is TAO [4], a highly distributed, in-memory, data-
caching service used to store Facebook’s social graph. The
third one is Proxygen, which is a cluster load balancer built
on top of the open-source library with the same name [27].
Finally, the other two binaries implement a service called
Multifeed, which is used to select what is shown in the
Facebook News Feed.
In this evaluation, we compared the performance impact
of BOLT on top of binaries built using GCC and function
reordering via HFSort [25]. The HHVM binary specifically
is compiled with LTO to further enhance its performance.
Unfortunately, a comparison with FDO and AutoFDO was
not possible. The difficulties with FDO were the common
ones outlined in Section 2.1 to deploy instrumented binaries
in these applications’ normal production environments. And
we found that AutoFDO support in the latest version of GCC
available in our environment (version 5.4.1) is not stable
and caused either internal compiler errors or runtime errors
related to C++ exceptions. Nevertheless, a direct comparison
between BOLT and FDO was possible for other applications,
and the results are presented in Section 6.2.
Figure 5 shows the performance results for applying
BOLT on top of HFSort for our set of Facebook data-center
workloads (and, in case of HHVM, also on top of LTO). In
all cases, BOLT’s application resulted in a speedup, with an
average of 5.4% and a maximum of 8.0% for HHVM. Note
HHV
M TAO Prox
ygen
Mult
ifeed
1
Mult
ifeed
2
GeoM
ean0
2
4
6
8
10
%
Sp
ee
du
p
Figure 5: Performance improvements from BOLT for our set
of Facebook data-center workloads.
Bran
ch
D-Ca
che
I-Ca
che I-TL
B
D-TL
B
LLC
0
5
10
15
20
%
M
is
s
R
ed
uc
tio
n
Figure 6: Improvements on micro-architecture metrics for
HHVM.
inpu
t1
inpu
t2
inpu
t3
clang
-buil
d0
50
100
5
2
.1
4
%
4
0
.1
5
%
2
2
.2
7
%
3
6
.2
2
%
3
9
.9
2
%
3
0
.5
4
%
2
1
.5
2
%
2
9
.9
3
%6
8
.4
9
%
5
3
.2
5
%
3
3
.9
8
%
4
9
.4
2
%
%
Sp
ee
du
p
BOLT PGO+LTO PGO+LTO+BOLT
Figure 7: Performance improvements for Clang.
that HHVM, despite containing a large amount of dynami-
cally compiled code that is not optimized by BOLT, spends
more time in statically compiled code than in the dynam-
ically generated code. Among these applications, HHVM
has the largest total code size, which makes it very front-end
bound and thus more amenable to the code layout optimiza-
tions that BOLT implements.
To better understand the performance benefits of BOLT,
we performed a more detailed performance analysis of
HHVM. Figure 6 shows BOLT’s improvements on impor-
tant performance metrics, including i-cache misses, i-TLB
misses, branch misses, and LLC misses. Improving branch
prediction is an important benefit from the block layout op-
timization done by BOLT, and for HHVM this metric im-
proved by 11%. Moreover, improving locality leads to better
metrics across the entire cache hierarchy, specially the first
level of i-cache, which exhibits 18% reduction in misses. It
is possible to see a small improvement of 1% in the first level
of d-cache as well, due to reordering jump tables for locality
and frame optimizations. The observed TLB improvements
come from packing accessed instructions and data into fewer
pages. To better illustrate how cache and TLB locality are
improved, we present heat maps of address-space accesses
in Section 6.4.
6.2 Clang and GCC Compilers
BOLT should be able to improve the performance of any
front-end bound application, not just data-center workloads.
7
inpu
t1
inpu
t2
inpu
t3
clang
-buil
d0
20
40
60
2
4
.2
8
%
2
4
.1
2
%
1
3
.9
9
%
2
1
.2
6
%
1
6
.4
6
%
1
7
.2
8
%
1
2
.4
2
%
1
5
.7
3
%
2
7
.0
8
%
2
7
.5
2
%
1
7
.7
6
%
2
4
.3
5
%
%
Sp
ee
du
p
BOLT PGO PGO+BOLT
Figure 8: Performance improvements for GCC. Different
than Clang, we did not use LTO due to build errors.
To test this theory, we ran BOLT on two open-source com-
pilers: Clang and GCC.
6.2.1 Clang Setup
For our Clang evaluation, we used the release 60 branch
of llvm, clang, and compiler-rt open-source repositories [19].
We built a bootstrapped release version of the compiler first.
This stage1 compiler provided a baseline for our evaluation.
We then built an instrumented version of Clang,3 and then
used the instrumented compiler to build Clang again with
default options. The collected profile data was used to do
another build of Clang with LTO enabled.4 This is referred
as PGO+LTO in our chart.
Each of the 2 compilers was profiled with our training
input, a full build of GCC. We used the Linux perf util-
ity with the option record -e cycles:u -j any,u. The
profile from perf was converted using perf2bolt utility into
YAML format (-w option). Then the profile was used to op-
timize the compiler binary using BOLT with the following
options:
-b profile.yaml -reorder-blocks=cache+
-reorder-functions=hfsort+ -split-functions=3 -split-all-cold
-split-eh -dyno-stats -icf=1 -use-gnu-stack
The four compilers were then used to build Clang, and the
overall build time was measured for benchmarking purposes.
For all builds above we used ninja instead of GNU make,
and for all benchmarks we ran them with -j40 clang op-
tions. We chose to build only the clang binary (as opposed
to the full build) to minimize the effect of link time on our
evaluation.
We have also selected 3 Clang/LLVM source files rang-
ing from small to large sizes and preprocessed those files
such that they could be compiled without looking up header
dependencies. The 3 source files we used are:
• input1: tools/clang/lib/CodeGen/CGVTT.cpp
• input2: lib/ExecutionEngine/Orc/OrcCBindings.cpp
• input3: lib/Target/X86/X86ISelLowering.cpp
Each of the files was then compiled with -std=c++11
-O2 options multiple times, and the results were recorded
for benchmarking purposes. Tests were run on a dual-node
3 -DLLVM BUILD INSTRUMENTED=ON
4 -DLLVM ENABLE LTO=Full -DLLVM PROFDATA FILE=clang.profdata
20-core (40-core with hyperthreading) IvyBridge (Intel(R)
Xeon(R) CPU E5-2680 v2 @ 2.80GHz) system with 32GiB
RAM.
6.2.2 GCC Setup
For our GCC evaluation, we used version 8.2.0. First, GCC
was built using the default build process. The result of this
bootstrap build was our baseline. Second, we built a PGO
version using the following configuration:
--enable-linker-build-id --enable-bootstrap
--enable-languages=c,c++ --with-gnu-as --with-gnu-ld
--disable-multilib
Afterwards, make profiledbootstrap was used to
generate our PGO version of GCC.
Since BOLT is incompatible with GCC function splitting,
we had to repeat the above builds passing BOOT CFLAGS=-´O2
-g -fno-reorder-blocks-and-partition´ to the make
command. The resulting compiler, ready to be BOLTed, was
used to build GCC again (our training input), this time with-
out the bootstrap. The profile was then recorded and con-
verted using perf2bolt to YAML format, and the cc1plus
binary was optimized using BOLT with the same options
used for Clang and later copied over to GCC’s installation
directory.
All 4 different types of GCC compilers, 2 without BOLT
and 2 with BOLT, were later used to build the Clang com-
piler using the default configuration.
6.2.3 Experimental Results
Figures 7 and 8 show the experimental results for Clang and
GCC, respectively. We observed a significant improvement
on both compilers by using BOLT. On top of GCC with
PGO, BOLT provided a 7.45% speedup when doing a full
build of Clang. On top of Clang with LTO and PGO, a 15.0%
speedup when doing a full build of Clang.
Table 2 shows some statistics reported by BOLT as
it optimizes the Clang binaries for the baseline and with
PGO+LTO applied. These statistics are based on the input
profile data. Even when applied on top of PGO+LTO, BOLT
has a very significant impact in many of these metrics, par-
ticularly the ones that affect code locality. For example, we
see that BOLT reduces the number of taken branches by
44.3% over PGO+LTO (69.8% over the baseline), which
significantly improves i-cache locality.
6.3 Analysis of Suboptimal Compiler Code Layout
Using BOLT’s -report-bad-layout option, we inspected
Clang’s binary built with LTO+PGO to identify frequently
executed functions that contain cold basic blocks interleaved
with hot ones. Combined with options -print-debug-info
and -update-debug-sections, this allowed us to trace
the source of such blocks. Using this methodology, we an-
alyzed such suboptimal code layout occurrences among the
hottest functions. Our analysis revealed that the majority of
8
0 10 20 30 40 50 60
0
20
40
60
0
2
4
6
8
10
(a) without BOLT
0 10 20 30 40 50 60
0
20
40
60
0
2
4
6
8
10
(b) with BOLT
Figure 9: Heat maps for instruction memory accesses of the HHVM binary, without and with BOLT. Heat is in a log scale.
Metric Over Baseline Over PGO+LTO
executed forward branches -1.6% -1.0%
taken forward branches -83.9% -61.1%
executed backward branches +9.6% +6.0%
taken backward branches -9.2% -21.8%
executed unconditional branches -66.6% -36.3%
executed instructions -1.2% -0.7%
total branches -7.3% -2.2%
taken branches -69.8% -44.3%
non-taken conditional branches +60.0% +13.7%
taken conditional branches -70.6% -46.6%
Table 2: Statistics reported by BOLT when applied to
Clang’s baseline and PGO+LTO binaries.
such cases originated from function inlining as motivated in
the example in Figure 2. Figure 10 illustrates one of these
functions at the binary level. This function contains 3 basic
blocks, each one corresponding to source code from a differ-
ent source file. In Figure 10, the blocks are annotated with
their profile counts (Exec Count). The source code corre-
sponding to block .LFT680413 is not cold, but it is very
cold when inlined in this particular call site. By operating at
the binary level and being guided by the profile data, BOLT
can easily identify these inefficiencies and improve the code
layout.
6.4 Heat Maps
Figure 9 shows heat maps of the instruction address space
for HHVM running with Facebook production traffic. Fig-
ure 9a illustrates addresses fetched through I-cache for the
regular binary, while Figure 9b shows the same for HHVM
processed with BOLT.
This heat map is built as a matrix of addresses. Each
line has 64 blocks and the complete graph has 64 lines.
The HHVM binary chosen for this study has 148.2 MB of
text size, which is fully represented in the heat map. Each
block represents 36,188 bytes and the heat map shows how
many times, on average, each byte of a block is fetched as
indicated by profiling data. For example, the line at Y = 0
Function:
clang::Redeclarable<clang::TagDecl>::DeclLink::getNext(...)
const
Exec Count : 1723213
.Ltmp1100284 (4 instructions, align : 1)
Exec Count : 1635334
Predecessors: .Ltmp1100286, .LBB087908
0000001d: movq %r12, %rbx # PointerIntPair.h:152:40
00000020: andq $-0x8, %rbx # PointerIntPair.h:152:40
00000024: testb $0x4, %r12b # PointerUnion.h:143:9
00000028: je .Ltmp1100279 # ExternalASTSource.h:462:19
Successors: .Ltmp1100279 (mispreds: 2036, count: 1635334),
.LFT680413 (mispreds: 0, count: 0)
.LFT680413 (2 instructions, align : 1)
Exec Count : 0
Predecessors: .Ltmp1100284
0000002a: testq %rbx, %rbx # ExternalASTSource.h:462:19
0000002d: jne .Ltmp1100280 # ExternalASTSource.h:462:19
Successors: .Ltmp1100280 (mispreds: 0, count: 0),
.Ltmp1100279 (mispreds: 0, count: 0)
.Ltmp1100279 (9 instructions, align : 1)
Exec Count : 1769771
Predecessors: .Ltmp1100284, .LFT680414, .Ltmp1100282,
.LFT680413
0000002f: movq %rbx, %rax # Redeclarable.h:140:5
00000032: addq $0x28, %rsp # Redeclarable.h:140:5
00000036: popq %rbx # Redeclarable.h:140:5
00000037: popq %r12 # Redeclarable.h:140:5
00000039: popq %r13 # Redeclarable.h:140:5
0000003b: popq %r14 # Redeclarable.h:140:5
0000003d: popq %r15 # Redeclarable.h:140:5
0000003f: popq %rbp # Redeclarable.h:140:5
00000040: retq # Redeclarable.h:140:5
Figure 10: Real example of poor code layout produced by
the Clang compiler (compiling itself) even with PGO. Block
.LFT680413 is cold (Exec Count: 0), but it is placed be-
tween two hot blocks connected by a forward taken branch.
from X = 0 to X = 63 plots how code is being accessed in
the first 2,316,032 bytes of the address space. The average
number of times a byte is fetched is reduced by a logarithm
function to help visualize the data, so we can easily identify
even code that is executed just a few times. Completely white
areas show cold basic blocks that were never sampled during
9
Instr
uctio
ns
Bran
ch-m
iss
I-cac
he-m
iss
LLC
-mis
s
iTLB
-mis
s
CPU
time
0
5
10
15
0
.5
2
%
0
.6
6
%
0
.0
3
%
1
.7
5
%
0
.0
9
%
0
.2
8
%
2
.8
8
%
2
.4
3
%
1
.0
3
% 5
.3
9
%
1
.7
1
%
0
.3
5
%
2
.8
2
%
5
.1
6
%
1
.4
1
%
8
.2
%
2
.1
6
%
%
R
ed
uc
tio
n
Functions BBs Both
Figure 11: Improvements on different metrics for HHVM by
using LBRs (higher is better).
profiling, while strong red highlights the most frequently
accessed areas of instruction memory.
Figure 9b demonstrates how BOLT packs together hot
code to use about 4 MB of space instead of the original range
spanning 148.2 MB. There is still some activity outside the
dense hot area, but they are relatively cold. These functions
were ignored by BOLT’s function-reordering pass because
they have an indirect tail call. BOLT marks functions it can
fully understand as simple as a mechanism to allow it to
operate on the binary even if it does not fully process all
functions. Those are non-simple functions.
Indirect tail calls are more challenging for static binary
rewriters because it is difficult to guess if the target is an-
other function or another basic block of the same function,
which could affect the CFG. BOLT leaves these functions
untouched. This also explains a large cold block of about
160 KB in the hot area at Y = 1 and 16 ≤ X ≤ 20: this
cold block is part of a large non-simple function whose CFG
is not fully processed by BOLT, so it is not split in the same
way as other functions.
Function splitting and reordering are important to move
cold basic blocks out of the hot area, and BOLT uses these
techniques on the vast majority of functions in the HHVM
binary. The result is a tight packing of frequently executed
code as show in Y = 1 of Figure 9b, which greatly benefits
I-cache and I-TLB.
6.5 Importance of LBR
Not all CPU vendors support a hardware mechanism to col-
lect a trace of the last branches, such as LBRs on Intel CPUs.
We compared the impact of using them for BOLT profile ver-
sus relying on plain samples with no such traces.
Figure 11 summarizes our evaluation on different metrics
for HHVM, in 3 different scenarios: reordering functions us-
ing HFSort, reordering basic blocks and applying other op-
timizations, and with both (all optimizations on). For exam-
ple, the first data set shows that the overall reduction on in-
structions executed is 0.35% by having more accurate pro-
filing enabled by LBRs. As Figure 6 shows, total CPU time
improvements by using BOLT on HHVM are about 8%. Fig-
ure 11 shows us that using LBRs is responsible for about
2% of these improvements. Furthermore, the impact is more
significant for basic block layout optimizations than it is for
function reordering. The reason is because basic-block re-
ordering requires more fine-grained profiling, at the basic-
block level, which is harder to obtain without LBRs.
7. Related Work
Binary or post-link optimizers have been extensively ex-
plored in the past. There are two different categories for
binary optimization in general: static and dynamic, operat-
ing before program execution or during program execution.
Post-link optimizers such as BOLT are static binary opti-
mizers. Large platforms for prototyping and testing dynamic
binary optimizations are DynamoRIO [5] for same host or
QEMU [3] for emulation. Even though it is challenging to
overcome the overhead of the virtual machine with wins due
to the optimizations themselves, these tools can be useful in
performing dynamic binary instrumentation to analyze pro-
gram execution, such as Pin [20] does, or debugging, which
is the main goal of Valgrind [22].
Static binary optimizers are typically focused on low-
level program optimizations, preferably using information
about the precise host that will run the program. MAO [14]
is an example where microarchitectural information is used
to rewrite programs, although it rewrites source-level as-
sembly and not the binary itself. Naturally, static optimiz-
ers tend to be architecture-specific. Ispike [21] is a post-link
optimizer developed by Intel to optimize for the quirks of
the Itanium architecture. Ispike also utilizes block layout
techniques similar to BOLT, which are variations of Pettis
and Hansen [26]. However, despite supporting architecture-
specific passes, BOLT was built on top of LLVM [16] to en-
able it to be easily ported to other architectures. Ottoni and
Maher [25] present an enhanced function-reordering tech-
nique based on a dynamic call graph. BOLT implements the
same algorithm in one of its passes.
Profile information is most commonly used to augment
the compiler to optimize code based on run-time informa-
tion, such as done by AutoFDO [6]. The latter has also been
studied in the context of data-center applications, like BOLT.
Even though there is some expected overlap in gains be-
tween AutoFDO and BOLT, since both tools perform layout,
in this paper we show that the gains with FDO in general (not
just AutoFDO) and BOLT can be complimentary and both
tools can be used together to obtain maximum performance.
8. Conclusion
The complexity of data-center applications often results in
large binaries that tend to exhibit poor CPU performance
due to significant pressure on multiple important hardware
structures, including caches, TLBs, and branch predictors.
To tackle the challenge of improving performance of such
applications, we created a post-link optimizer, called BOLT,
which is built on top of the LLVM infrastructure. The main
goal of BOLT is to reorganize the applications’ code to
reduce the pressure that they impose on those important
10
hardware structures. BOLT achieves this goal with a series
of optimizations, with particular focus on code layout. A
key insight of this paper is that a post-link optimizer is in
a privileged position to perform these optimizations based
on profiling, even beyond than what a compiler can achieve.
We tested our assumptions in Facebook data-center appli-
cations and obtained improvements ranging from 2% to 8%.
Unlike profile-guided static compilers, BOLT does not need
to retrofit profiling data back to source code, making the
profile more accurate. Nevertheless, a post-link optimizer
has fewer optimizations than a compiler. We show that the
strengths of both strategies combine instead of purely over-
lapping, indicating that using both approaches leads to the
highest efficiency for large, front-end bound applications. To
show this, we measure the performance improvements on
two open-source compilers, GCC and Clang, featuring large
code bases dependent on the instruction cache performance.
Overall, BOLT achieves 15% performance improvement for
Clang on top of LTO and FDO.
Acknowledgments
We would like to thank Gabriel Poesia and Theodoros
Kasampalis for their work on BOLT during their internships
at Facebook. We would also like to thank Sergey Pupyrev
for his work on improving the basic block layout algorithms
used by BOLT.
References
[1] Keith Adams, Jason Evans, Bertrand Maher, Guilherme Ot-
toni, Andrew Paroski, Brett Simmers, Edwin Smith, and
Owen Yamauchi. 2014. The Hiphop Virtual Machine. In Pro-
ceedings of the ACM International Conference on Object Ori-
ented Programming Systems Languages & Applications. 777–
790.
[2] Vasanth Bala, Evelyn Duesterwald, and Sanjeev Banerjia.
2000. Dynamo: A Transparent Dynamic Optimization Sys-
tem. In Proceedings of the ACM SIGPLAN Conference on
Programming Language Design and Implementation. ACM,
1–12.
[3] F. Bellard. 2005. QEMU, a fast and portable dynamic transla-
tor. In USENIX Annual Technical Conference.
[4] Nathan Bronson, Zach Amsden, George Cabrera, Prasad
Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Gia-
rdullo, Sachin Kulkarni, Harry Li, Mark Marchukov, Dmitri
Petrov, Lovro Puzar, Yee Jiun Song, and Venkat Venkatara-
mani. 2013. TAO: Facebook’s Distributed Data Store for the
Social Graph. In Proceedings of the USENIX Conference on
Annual Technical Conference. 49–60.
[5] Derek Bruening, Timothy Garnett, and Saman Amarasinghe.
2003. An infrastructure for adaptive dynamic optimization. In
Proceedings of the International Symposium on Code Gener-
ation and Optimization. IEEE, 265–275.
[6] Dehao Chen, David Xinliang Li, and Tipp Moseley. 2016.
AutoFDO: Automatic Feedback-directed Optimization for
Warehouse-scale Applications. In Proceedings of the Inter-
national Symposium on Code Generation and Optimization.
12–23.
[7] Dehao Chen, Neil Vachharajani, Robert Hundt, Xinliang Li,
Stephane Eranian, Wenguang Chen, and Weimin Zheng. 2013.
Taming hardware event samples for precise and versatile feed-
back directed optimizations. IEEE Trans. Comput. 62, 2
(2013), 376–389.
[8] Robert Cohn, D. Goodwin, and P. G. Lowney. 1997. Optimiz-
ing Alpha executables on Windows NT with Spike. Digital
Technical Journal 9, 4 (1997), 3–20.
[9] James C. Dehnert, Brian K. Grant, John P. Banning, Richard
Johnson, Thomas Kistler, Alexander Klaiber, and Jim Matt-
son. 2003. The Transmeta Code Morphing Software: Using
Speculation, Recovery, and Adaptive Retranslation to Address
Real-life Challenges. In Proceedings of the International Sym-
posium on Code Generation and Optimization. 15–24.
[10] DWARF Debugging Standards Committee. 2017. DWARF
Debugging Information Format version 5.
[11] Ealan A Henis, Gadi Haber, Moshe Klausner, and Alex War-
shavsky. 1999. Feedback based post-link optimization for
large subsystems. In Workshop on Feedback Directed Opti-
mization. 13–20.
[12] E. A. Henis, G. Haber, M. Klausner, and A. Warshavsky. 1999.
Feedback based postlink optimization for large subsystems.
In Proceedings of the 2nd workshop on Feedback Directed
Optimization. 13–20.
[13] Urs Ho¨lzle and David Ungar. 1994. Optimizing Dynamically-
dispatched Calls with Run-time Type Feedback. In Proceed-
ings of the ACM Conference on Programming Language De-
sign and Implementation. 326–336.
[14] Robert Hundt, Easwaran Raman, Martin Thuresson, and
Neil Vachharajani. 2011. MAO – An Extensible Micro-
architectural Optimizer. In Proceedings of the 9th Annual
IEEE/ACM International Symposium on Code Generation
and Optimization. IEEE Computer Society, 1–10.
[15] Intel Corporation. 2011. Intel R© 64 and IA-32 Architectures
Software Developer’s Manual. Number 325384-039US.
[16] Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation
Framework for Lifelong Program Analysis & Transformation.
In Proceedings of the International Symposium on Code Gen-
eration and Optimization. 75–86.
[17] Roy Levin. 2007. Complementing incomplete edge profile by
applying minimum cost circulation algorithms.
[18] Xinliang David Li, Raksit Ashok, and Robert Hundt. 2010.
Lightweight Feedback-Directed Cross-Module Optimization.
In Proceedings of the International Symposium on Code Gen-
eration and Optimization. 53–61.
[19] LLVM Community. 2018. The LLVM open-source code
repositories. Web site:
http://llvm.org/releases.
[20] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil,
Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa
Reddi, and Kim Hazelwood. 2005. Pin: Building Customized
Program Analysis Tools with Dynamic Instrumentation. In
Proceedings of the 2005 ACM SIGPLAN Conference on Pro-
11
gramming Language Design and Implementation. ACM, 190–
200.
[21] C-K Luk, Robert Muth, Harish Patil, Robert Cohn, and Ge-
off Lowney. 2004. Ispike: a post-link optimizer for the Intel
Itanium architecture. In Proceedings of the International Sym-
posium on Code Generation and Optimization. IEEE, 15–26.
[22] Nicholas Nethercote and Julian Seward. 2007. Valgrind:
A Framework for Heavyweight Dynamic Binary Instrumen-
tation. In Proceedings of the 28th ACM SIGPLAN Confer-
ence on Programming Language Design and Implementation.
ACM, 89–100.
[23] Diego Novillo. 2014. SamplePGO: The Power of Profile
Guided Optimizations Without the Usability Burden. In Pro-
ceedings of the 2014 LLVM Compiler Infrastructure in HPC.
IEEE Press, 22–28.
[24] Guilherme Ottoni. 2018. HHVM JIT: A Profile-guided,
Region-based Compiler for PHP and Hack. In Proceedings of
the 39th ACM SIGPLAN Conference on Programming Lan-
guage Design and Implementation. ACM, 151–165.
[25] Guilherme Ottoni and Bertrand Maher. 2017. Optimizing
Function Placement for Large-scale Data-center Applications.
In Proceedings of the International Symposium on Code Gen-
eration and Optimization. IEEE, 233–244.
[26] Karl Pettis and Robert C. Hansen. 1990. Profile Guided
Code Positioning. In Proceedings of the ACM Conference on
Programming Language Design and Implementation. ACM,
16–27.
[27] Proxygen Team. 2017. Proxygen: Facebook’s C++ HTTP
Libraries. Web site: https://github.com/facebook/proxygen.
[28] Ted Romer, Geoff Voelker, Dennis Lee, Alec Wolman, Wayne
Wong, Hank Levy, Brian Bershad, and Brad Chen. 1997. In-
strumentation and optimization of Win32/Intel executables
using Etch. In Proceedings of the USENIX Windows NT Work-
shop, Vol. 1997. 1–8.
[29] Cheng Wang, Shiliang Hu, Ho-seop Kim, Sreekumar R Nair,
Mauricio Breternitz, Zhiwei Ying, and Youfeng Wu. 2007.
StarDBT: an efficient multi-platform dynamic binary transla-
tion system. In Proceedings of the Asia-Pacific Conference on
Advances in Computer Systems Architecture. Springer, 4–15.
12
