Studying Compiler Optimizations on Superscalar Processors through Interval Analysis by Stijn Eyerman et al.
Studying Compiler Optimizations on Superscalar
Processors through Interval Analysis
Stijn Eyermany Lieven Eeckhouty James E. Smithz
yELIS Department, Ghent University, Belgium
zECE Department, University of Wisconsin – Madison
Email: fseyerman,leeckhoug@elis.UGent.be,jes@ece.wisc.edu
Abstract
Understanding the performance impact of compiler optimizations on superscalar
processors is complicated because compiler optimizations interact with the microarchi-
tecture in complex ways. This paper analyzes this interaction using interval analysis,
an analytical processor model that allows for breaking total execution time into cycle
components. By studying the impact of compiler optimizations on the various cycle
components, one can gain insight into how compiler optimizations affect out-of-order
processor performance. The analysis provided in this paper reveals various interesting
insightsandsuggestionsforfutureworkoncompileroptimizationsforout-of-orderpro-
cessors. In addition, we contrast the effect compiler optimizations have on out-of-order
versus in-order processors.
1 Introduction
In modern processors, both the hardware implementation and optimizing compilers are
very complex, and they often interact in unpredictable ways. A high performance mi-
croarchitecture typically issues instructions out-of-order and must deal with a number
of disruptive miss events such as branch mispredictions and cache misses. An optimiz-
ing compiler implements a large number of individual optimizations which not only
interact with the microarchitecture,but also interact with each other. These interactions
can be constructive (improvedperformance),destructive (lost performance),or neutral.
Furthermore, whether there is performance gain or loss often depends on the particular
program being optimized and executed.
In practice, the only way that the performance gain (or loss) for a given compiler
optimization can be determined is by running optimized programs on the hardware
and timing them. This method, while useful, does not provide insight regarding the
underlying causes for performance gain/loss. By using the recently proposed method
of interval analysis [1,2,3,4], one can decompose total execution time into intuitively
meaningful cycle components. These components include a base cycle count, which is
a measure of the time required to execute the program in the absence of all disruptive
missevents,alongwithadditionalcyclecountsforeachtypeofmissevent.Performance
gain (or loss) resulting from a compiler optimization can then be attributed to either the
base cycle count or to speciﬁc miss event(s).
By analyzing the various cycle count components for a wide range of compiler op-
timizations one can gain insight into the underlying mechanisms by which compileroptimizations affect out-of-orderprocessor performance.To the best of our knowledge,
this paperis theﬁrst to analyzecompileroptimizationsonout-of-orderprocessorsusing
an analytical-based superscalar processor model. The work reported here provides and
supports a number of key insights. Some of these insights provide quantitative support
for conventional wisdom, while others provide a fresh view of how compiler optimiza-
tions interact with superscalar processor performance. To be more speciﬁc:
– We demonstrate the use of interval analysis for studying the impact of compiler
optimizations on superscalar processor performance; this is done by breaking up
the total execution time into cycle components and by analyzing the effect of com-
pileroptimizationsonthevariouscyclecomponents.Compilerbuilderscanuse this
methodology to better understand the impact of compiler optimizations.
– Our analysis provides a number of interesting insights with respect to how com-
piler optimizations affect out-of-order processor performance. For one, the critical
path leading to mispredicted branches is the only place during program execution
where optimizationsreducingthe lengthof the chainof dependentoperationsaffect
overallperformanceona balancedout-of-orderprocessor— inter-operationdepen-
dencies not residing on the critical path leading to a mispredicted branch are typ-
ically hidden by out-of-order execution. Second, reducing the dynamic instruction
count (an important optimization objective dating back to sequential processors)
still is an important compiler optimization criterion for today’s out-of-order pro-
cessors. Third, some compiler optimizations (unintentionally) bring long-latency
loads closer to each other in the dynamicinstructionstream,therebyexposingmore
memory-level parallelism (MLP) and improving performance.
– We show that compiler optimizations have a different performance impact on in-
order versus out-of-order processors. In fact, the biggest fraction of the total per-
formance gain on in-order processors is achieved by reducing the dynamic instruc-
tion count and critical path length. For out-of-order processors on the other hand,
only about half the performancegain comes from reducingthe dynamic instruction
count and critical path length; the other half of the performance gain comes from
optimizations related to the I-cache, L2 D-cache and branch predictor behavior.
2 Decomposing execution time into cycle components
In order to gain insight into how compiler optimizations affect out-of-order processor
performance, we use a previously developed analytical model called interval analysis.
This section brieﬂy summarizes interval analysis; for a more elaborate discussion the
reader can refer to a number of prior references [1,2,3,4].
2.1 Interval analysis
Intervalbehaviorobservedonsuperscalarprocessorsis illustratedinFigure1.Thenum-
ber of (useful)instructionsissued percycle (IPC) is shownon the vertical axis, andtime
(in clock cycles) is shown on the horizontal axis. As illustrated in the ﬁgure, an interval
begins at the time new instructions start to ﬁll the issue and reorder buffers following
the preceding miss event (regardless of which type). Initially, only a small number oftime
IPC
branch
mispredicts
i-cache
miss long d-cache miss
interval 1 interval 2 interval 3 interval 0
Fig.1. Performance can be analyzed by dividing time into intervals between miss events.
instructions are in the issue buffer, so relatively few can be found to issue in parallel.
However, as instructions continue to ﬁll the window, the scope for ﬁnding parallel in-
structions increases, as does the issue rate. In the limit, the window becomes full (or
nearly so) and a steady state issue rate is achieved. At some point, the next miss event
occurs and the stream of useful instructions entering the window ceases. The window
begins draining of useful instructions as they commit, but no new instructions take their
place. Finally, there are no more instructions to issue until the interval ends. In the
meantime, the miss event is being handled by the hardware, for example, an instruction
cache miss is serviced. After the time during which no instructions issue or commit,
the next interval begins with a ramp-uptransient as the window re-ﬁlls, and instructions
once againbegin issuing. The exact mechanismswhich cause instructionsto stop ﬁlling
the window and the timing of the window drain with respect to the occurrence of the
miss event are dependent on the type of miss event, so each type of miss event should
be analyzed separately.
When we use interval analysis to decompose performance into cycle count com-
ponents, there are three main aspects: base cycle counts, miss event cycle counts, and
overlap of miss events.
BaseCycle Counts. IfthereareN instructionsina givenintervalandthedispatchwidth
is D, then it will take dN=De cycles to dispatch them into the window. In the absence
of all miss events,a balancedsuperscalarprocessorcan then issue and retire the instruc-
tions at (nearly) the dispatch rate. Consequently, the base cycle count is computed as
dN=De for an interval containing N instructions.
This stems from the observation that for practical pipeline widths, say up to eight,
one can, in most cases, make the window size big enough that an issue rate matching
the pipelinewidth can be achieved(underideal, no miss event,conditions)[5,6,7]. Note
that we equate the dispatch width D with the processor’s ‘pipeline width’, because it
typically deﬁnes the maximum sustainable instruction fetch/decode/execution rate. We
call a superscalar processor design balanced when the ROB and other resources such
as the issue buffer, load/store buffers, rename registers, MSHRs, functional units, etc.,
are sufﬁciently large to support the processor width in the absence of all miss events.
Miss Event Cycle Counts. The cycle counts (penalties) for miss events depend on the
type of miss event.time = N/D time
= pipeline length
time
= branch latency
@ window drain time
cfe cdr
Fig.2. Interval behavior for a branch misprediction.
– For a front-endmiss event such as an L1 I-cache miss, L2 I-cache miss or an I-TLB
miss, the penalty equals the access time to the next cache level [3]. For example,
the cycle count penalty for an L1 I-cache miss event is the L2 cache access latency.
– The penalty for a branch misprediction equals the branch resolution time plus the
number of pipeline stages in the front-end pipeline cfe, see Figure 2. Previous
work [2] has shown that the branch resolution time can be approximated by the
window drain time cdr; i.e., the mispredicted branch is very often the last instruc-
tion being executed before the window drains. Also, this previous work has shown
that, in many cases, the branchresolution time is the main contributorto the overall
branch misprediction penalty.
– Short back-end misses, i.e., L1 D-cache misses, are modeled as if they are instruc-
tions that are serviced by long latency functional units, not miss events. In other
words, it is assumed that the latencies can be hidden by out-of-order execution,
which is the case in a balanced processor design.
– The penalty for isolated long back-end misses, such as L2 D-cache misses and D-
TLB misses, equals the main memory access latency.
Overlapping Miss Events. The above discussion of miss event cycle counts essentially
assumes that the miss events occurin isolation. In practice, of course,they may overlap.
We deal with overlapping miss events in the following manner.
– For long back-end misses that occur within an interval of W instructions (the ROB
size), the penalties overlap completely [8,3]. We refer to the latter case as over-
lapping long back-end misses; in other words, memory-level parallelism (MLP) is
present.
– Simulation results in prior work [1,3] show that the number of front-end misses in-
teracting with long back-end misses is relatively infrequent. Our simulation results
conﬁrm that for all except three of the SPEC CPU2000 benchmarks, less than 1%
of the cycles are spent servicing front-end and long back-end misses in parallel;
only gap (5.4%), twolf (4.9%), vortex (3.1%) spend more than 1% of their cycles
servicing front-end and long data cache misses in parallel.2.2 Evaluating Cycle Count Components
To evaluate the cycle count components in this paper, we use detailed simulation. We
compute the following cycle components: base (no miss events), L1 I-cache miss, L2
I-cache miss, I-TLB miss, L1 D-cache miss, L2 D-cache miss, D-TLB miss, branch
misprediction and resource stall (called ‘other’ throughoutthe paper). The cycle counts
are determined in the following way: (i) cycles caused by a branch misprediction as the
branch resolution time plus the number of pipeline stages in the front-end pipeline, (ii)
the cycles for an I-cache miss event as the time to access the next level in the memory
hierarchy,(iii)the cycles foroverlappinglongback-endmisses arecomputedas a single
penalty, and (iv) the L1 D-cache and resource stall cycle components account for the
cycles in which no instructions can be committed because of an L1 D-cache miss or
long latency instruction (such as a multiply or ﬂoating-point operation) blocking the
head of the ROB. Furthermore,we do not count miss events along mispredicted control
ﬂow paths. The infrequent case of front-end miss events overlapping with long back-
end miss events is handledby assigning a cycle count to the front-endmiss event unless
a full ROB triggers the long back-end miss penalty. Given the fact that front-end miss
events only rarely overlap with long back-end miss events, virtually any mechanism for
dealing with overlaps would sufﬁce. The base cycle component then is the total cycle
count minus all the individual cycle components.
Note that althoughwe are using simulationin this paper,this is consistent with what
could be done in real hardware. In particular, Eyerman et al. [1] proposed an archi-
tected hardware counter mechanism that computes CPI components using exactly the
same counting mechanism as we do in this paper. The hardware performance counter
architectureproposedin [1] was shownto computeCPI componentsthat are accurateto
within a few percent of components computedby detailed simulations. Note that in this
paper we are counting cycle components (C alone) and not CPI components as done
in [1] because the number of instructions is subject to compiler optimization. However,
the mechanisms for computing cycle components and CPI components are essentially
the same.
3 Experimental setup
This paper uses all the C benchmarks from the SPEC CPU2000 benchmark suite, see
Table 1. Because we want to run all the benchmarks to completion for all the compiler
optimizations, we use the lgred inputs provided by MinneSPEC [9]. The dynamic
instruction count of the lgred input varies between several hundreds of millions of
instructions and a number of billions of instructions.
ThesimulatedsuperscalarprocessorisdetailedinTable2.Itisa4-wideout-of-order
microarchitecturewith a 128-entryreorder buffer (ROB). This processor was simulated
using SimpleScalar/Alpha v3.0 [10].
All the benchmarks were compiled using gcc v4.1.1 (dated May 2006) on an Al-
pha 21264 processor machine. We chose the gcc compiler because, in contrast to the
native Compaq cc compiler, it comes with a rich set of compiler ﬂags that can be set
individually. This enables us to consider a wide range of optimization levels. The 22Table 1. The SPEC CPU2000 benchmarks, their inputs and the dynamic instruction count when
compiled using the -O3 optimization ﬂag (in millions).
benchmark input dyn. I-cnt (M)
bzip2 lgred.program 2,102
crafty lgred 781
gap lgred 672
gcc lgred.cp-decl.i 4,576
gzip lgred.graphic 1,682
mcf lgred 659
parser lgred 3,944
perlbmk lgred.makerand 1,943
twolf lgred 1,236
vortex lgred 1,256
vpr lgred route 643
ammp lgred 1,344
art lgred 2,038
equake lgred 817
mesa lgred 1,691
Table 2. Processor model assumed in our experimental setup.
ROB 128 entries
processor width 4 wide
fetch width 8 wide
latencies load 2 cycles, mul 3 cycles, div 20 cycles, arith/log 1 cycle
L1 I-cache 16KB 4-way set-associative, 32-byte cache lines
L1 D-cache 16KB 4-way set-associative, 32-byte cache lines
L2 cache uniﬁed, 1MB 8-way set-associative, 128-byte cache lines
10 cycle access time
main memory 250 cycle access time
branch predictor hybrid predictor consisting of 4K-entry meta, bimodal and
gshare predictors
front-end pipeline 5 stages
optimization levels considered in this paper are given in Table 3. This ordering of op-
timization settings is inspired by gcc’s -O1, -O2 and -O3 optimization levels; the
compiler optimizations are applied on top of each other to progressively evolve from
the base optimization level to the most advanced optimization level. The reason for
working with these optimization levels is to keep the number of optimization combina-
tions at a tractable number while exploring a wide enough range of optimization levels
— the number of possible optimization settings by setting individual ﬂags is obviously
huge and impractical to do. We believe the particular ordering of optimization levels
does not affect the overall conclusions from this paper.Table 3. Compiler optimization levels considered in this paper.
Abbreviation Description
base base optimization level: -O1 -fnotree-ccp -fno-tree-dce
-fno-tree-dominator-opts -fno-tree-dse -fno-tree-ter -fno-tree-lrs
-fno-tree-sra -fno-tree-copyrename -fno-tree-fre -fno-tree-ch
-fno-cprop-registers -fno-merge-constants -fno-loop-optimize
-fno-if-conversion -fno-if-conversion2 -fno-unit-at-a-time
basic tree opt basic optimizations on intermediate SSA code tree
const prop/elim merge identical constants across compilation units
constant propagation and copy elimination
loop opt loop optimizations: move constant expressions out of loop and simplify exit test conditions
if-conversion if-conversion: convert control dependencies to data dependencies using predicated execution
through conditional move (cmov) instructions
O1 optimization ﬂag -O1
O2 -fnoO2 optimization ﬂag -O2 with all individual -O2 optimization ﬂags disabled
CSE apply common subexpression elimination
BB reorder reorder basic blocks in order to reduce the number of taken branches and improve code locality
strength red strength reduction optimization and elimination of iteration variables
recursion opt optimize sibling and tail recursive function calls
insn scheduling reorder instructions to eliminate stalls due to required data being unavailable
includes scheduling instructions across basic blocks
is speciﬁc for target platform on which the compiler runs
strict aliasing assumes that an object of one type never reside at the same address as an object
of a different type, unless the types are almost the same
alignment align the start of branch targets, loops and functions to a power-of-two boundary
adv tree opt advanced intermediate code tree optimizations
O2 optimization ﬂag -O2
aggr loop opt perform more aggressive loop optimizations
inlining integrate simple functions (determined based on heuristics) into their callers
O3 optimization ﬂag -O3
loop unroll unroll loops whose number of iterations can be determined at compile time or upon entry to the loop
software pipelining modulo scheduling
FDO feedback-directed optimization using edge counts
4 The impact of compiler optimizations
This section ﬁrst analyzes the impact various compiler optimizations have on the vari-
ous cycle componentsin an out-of-orderprocessor. We subsequentlyanalyze how com-
piler optimizations affect out-of-order processor performance as compared to in-order
processor performance.
4.1 Out-of-order processor performance
Before discussing the impact of compiler optimizations on out-of-order processor per-
formance in great detail on a number of case studies, we ﬁrst present and discuss some
general ﬁndings.
Figure 3 shows the average normalized execution time for the sequence of opti-
mizations used in this study. The horizontal axis shows the various optimization levels;
the vertical axis shows the normalizedexecutiontime (averagedacross all benchmarks)
compared to the base optimization setting. On average, over the set of benchmarks
and the set of optimization settings considered in this paper, performance improves by
15.2%comparedto the base optimizationlevel.(Note that ourbase optimizationsetting
already includes a numberof optimizations,and results in 40% better performancethan
the -O0 compiler setting.) Some benchmarks, such as ammp and mesa observe no or0.800
0.825
0.850
0.875
0.900
0.925
0.950
0.975
1.000
base
tree opt
const prop/elim
basic loop opt
if-conversion
O1
O2 -fnoO2
CSE
BB reorder
strength red
recursion opt
insn scheduling
strict aliasing
alignment
adv tree opt
O2
aggr loop opt
inlining
O3
loop unrolling
software pipelining
FDO
a
v
g
n
o
r
m
a
l
i
z
e
d
e
x
e
c
u
t
i
o
n
t
i
m
e
Fig.3. Averaged normalized cycle counts on a superscalar out-of-order processor.
0%
1%
2%
3%
4%
5%
6%
7%
8%
9%
10%
base
L1 I-cache
L2 I-cache
I-TLB
L1 D-cache
no/ L2 D-cache
no/ D-TLB misses
MLP
no/ branch misses
branch penalty
other
c
y
c
l
e
c
o
m
p
o
n
e
n
t
d
e
c
r
e
a
s
e
out-of-order processor
in-order processor
Fig.4. Overall performance improvement on an out-of-order processor and an in-order processor
across the various compiler settings partitioned by cycle component.
almost no performance improvement. Other benchmarks beneﬁt substantially, such as
mcf (19%), equake (23%) and art (over 40%).
Figure 4 summarizes the total performance improvement for the individual cycle
components. This graph divides the total 15.2% performance improvement by the con-
tributions in each of the cycle components. There are a number of interesting insights
to be gained from the above analysis concerning the impact of compiler optimizations
on out-of-orderprocessor performance.
First, compiler optimizations reduce the dynamic instruction count and improvethe
base cycle component.Figure 4 shows that an absolute 6.6%performanceimprovement
(or43.9%ofthetotalimprovement)comesfromreducingthebasecyclecomponent.As
such, we conclude that reducing the dynamic instruction count, which has been a tradi-
tional objective for optimization dating back to sequential (non-pipelined) processors,
is still an important optimization criterion for today’s out-of-orderprocessors.
Compiler optimizations that aim at improving the critical path of inter-operation
dependencies only improve the branch misprediction penalty. This is a key new in-
sight from this paper: the critical path of inter-operation dependencies is only visible
through the branch misprediction penalty and by consequence, optimizations targetted
at reducingchains of dependentinstructions only affect the branch resolutiontime; on abalanced processor, inter-operation dependencies not residing on the critical path lead-
ing to a mispredicted branch can be effectively hidden by out-of-order execution. Note
that optimizations targeting the inter-operation critical path may also improve the base
and resource stall cycle components in case of unbalanced execution, i.e., when the re-
order buffer is too small to sustain a given issue rate of instructions; in practice though,
this is an infrequentcase. Figure4 shows the improvementin the branchresolutiontime
across the optimization settings; this is a 1.2% absolute improvementor a 7.8% relative
improvement.
Finally, compiler optimizations signiﬁcantly affect the number of miss events and
their overlap behavior. According to Figure 4, 9.6% of the total performance improve-
ment comes from a reduced number of branch mispredictions, and 16.7% and 19.5%
of the total performance improvement comes from improved L1 I-cache and the L2 D-
cache cycle components, respectively. The key observation here is that the reduced L2
D-cache cycle component is almost entirely due to improved memory-levelparallelism
(MLP). In other words, compiler optimizations that bring L2 cache miss loads closer
to each other in the dynamic instruction stream improve performance substantially by
increasing the amount of MLP.
4.2 Compiler optimization analysis case studies
We now present some case studies illustrating the power of interval analysis for gain-
ing insight into how compiler optimizations affect out-of-orderprocessor performance.
Figure 5 shows normalized cycle distributions for individual benchmarks — we se-
lected the benchmarks that are affected most by the compiler optimizations. These bars
are computed as follows. For all compiler optimization settings, we compute the cycle
counts for each of the nine cycle components: base, L1 I-cache, L2 I-cache, I-TLB,
L1 D-cache, L2 D-cache, D-TLB, branch misprediction and other resource stalls. Once
these cycle counts are computed we then normalize the cycle components for all opti-
mization settings to the total cycle count for the base optimization setting.
During the analysis presented in the next discussion we will also refer to Tables 4
and 5 which show the number of benchmarks for which a given compiler optimization
results in a positive or negative effect, respectively, on the various cycle components.
These tables also show the number of benchmarks for which the dynamic instruction
count is signiﬁcantly affected by the various compiler optimizations; likewise for the
number of long back-end misses and their amount of MLP as well as for the number
of branch mispredictions and their penalty. We do not show average performance im-
provement numbers in these tables because outliers make the interpretation difﬁcult;
instead, we treat outliers in the following discussion.
Basic loop optimizations. Basic loop optimizations move constant expressions out of
the loop and simplify loop exit conditions. Most benchmarks beneﬁt from these loop
optimizations; the reasons for improved performance include a smaller dynamic in-
struction count which reduces the base cycle component. A second reason is that the
simpliﬁed loop exit conditions result in a reduced branch misprediction penalty. Two
benchmarks that beneﬁt signiﬁcantly from loop optimizations are perlbmk (6.7% im-
provement) and art (5.9% improvement). The reason for these improvements is differ-art
0.0
0.2
0.4
0.6
0.8
1.0
base
tree opt
const prop/elim
basic loop opt
if-conversion
O1
O2 -fnoO2
CSE
BB reorder
strength red
recursion opt
insn scheduling
strict aliasing
alignment
adv tree opt
O2
aggr loop opt
inlining
O3
loop unrolling
software pipelining
FDO
n
o
r
m
a
l
i
z
e
d
c
y
c
l
e
d
i
s
t
r
i
b
u
t
i
o
n
other
bpred
DTLB
DL2
DL1
ITLB
IL2
IL1
base
equake
0.0
0.2
0.4
0.6
0.8
1.0
base
tree opt
const prop/elim
basic loop opt
if-conversion
O1
O2 -fnoO2
CSE
BB reorder
strength red
recursion opt
insn scheduling
strict aliasing
alignment
adv tree opt
O2
aggr loop opt
inlining
O3
loop unrolling
software pipelining
FDO
n
o
r
m
a
l
i
z
e
d
c
y
c
l
e
d
i
s
t
r
i
b
u
t
i
o
n
other
bpred
DTLB
DL2
DL1
ITLB
IL2
IL1
base
mcf
0.0
0.2
0.4
0.6
0.8
1.0
base
tree opt
const prop/elim
basic loop opt
if-conversion
O1
O2 -fnoO2
CSE
BB reorder
strength red
recursion opt
insn scheduling
strict aliasing
alignment
adv tree opt
O2
aggr loop opt
inlining
O3
loop unrolling
software pipelining
FDO
n
o
r
m
a
l
i
z
e
d
c
y
c
l
e
d
i
s
t
r
i
b
u
t
i
o
n
other
bpred
DTLB
DL2
DL1
ITLB
IL2
IL1
base
perlbmk
0.0
0.2
0.4
0.6
0.8
1.0
base
tree opt
const prop/elim
basic loop opt
if-conversion
O1
O2 -fnoO2
CSE
BB reorder
strength red
recursion opt
insn scheduling
strict aliasing
alignment
adv tree opt
O2
aggr loop opt
inlining
O3
loop unrolling
software pipelining
FDO
n
o
r
m
a
l
i
z
e
d
c
y
c
l
e
d
i
s
t
r
i
b
u
t
i
o
n
other
bpred
DTLB
DL2
DL1
ITLB
IL2
IL1
base
vpr
0.0
0.2
0.4
0.6
0.8
1.0
base
tree opt
const prop/elim
basic loop opt
if-conversion
O1
O2 -fnoO2
CSE
BB reorder
strength red
recursion opt
insn scheduling
strict aliasing
alignment
adv tree opt
O2
aggr loop opt
inlining
O3
loop unrolling
software pipelining
FDO
n
o
r
m
a
l
i
z
e
d
c
y
c
l
e
d
i
s
t
r
i
b
u
t
i
o
n
other
bpred
DTLB
DL2
DL1
ITLB
IL2
IL1
base
Fig.5. Normalized cycle distributions for the out-of-order processor for art, equake, mcf,
perlbmk and vpr.Table 4. The number of benchmarks (out of 15) for which a given compiler optimization has an
positive (more than 0:1%) effect on the various cycle components, the number of retired instruc-
tions, the number long back-end misses and their MLP, and the number of branch mispredictions
and their penalties. Numbers larger than or equal to 9 are shown in bold.
cycle components #insns DL2 and DTLB misses br mispredicts
optimization total base IL1 IL2 ITLB DL1 DL2 DTLB bpred other #DL2 #DTLB MLP #bmp pen
basic tree opt 11 14 5 0 0 2 6 2 11 0 14 3 0 5 11 3
cst prop/elim 6 2 1 0 0 1 0 0 8 0 7 0 0 1 4 4
loop opt 12 12 3 0 0 3 3 0 8 8 13 3 0 3 3 9
if-conversion 7 1 3 0 0 1 6 1 10 1 1 2 0 6 8 5
jump opt (O1) 9 10 4 0 0 2 2 0 3 1 11 1 0 2 3 5
O2 -fnoO2 5 0 3 0 0 2 2 1 8 1 0 1 0 1 5 6
CSE 6 5 3 0 0 2 1 1 8 2 9 1 1 1 6 6
BB reordering 10 10 6 0 0 2 2 0 4 1 11 2 1 0 6 1
strength red 4 3 1 0 0 0 1 1 3 0 2 1 0 1 2 1
recursion opt 7 4 4 0 0 0 3 0 4 0 6 3 0 0 4 3
insn scheduling 5 1 2 0 0 4 3 1 10 4 0 1 0 3 5 10
strict aliasing 8 11 2 0 0 5 3 1 6 3 11 1 1 5 0 10
alignment 5 3 2 0 0 0 3 1 5 0 4 2 0 2 4 2
adv tree opt 6 3 3 0 0 2 4 2 4 3 5 1 2 3 3 5
O2 9 7 4 0 0 0 2 0 6 0 7 3 0 2 4 4
aggr loop opt 7 2 3 0 0 1 2 0 4 1 3 1 0 1 4 0
inlining 12 10 3 0 0 0 7 1 9 1 12 4 1 4 7 4
O3 5 2 2 0 0 0 1 0 3 0 2 0 0 1 3 1
loop unrolling 9 11 1 0 0 3 4 0 7 2 12 0 0 5 3 6
software pipelining 3 1 2 0 0 1 1 0 1 1 0 1 1 0 4 0
FDO 8 7 3 1 0 2 3 1 6 1 10 5 2 1 7 5
Table 5. The number of benchmarks (out of 15) for which a given compiler optimization has a
negative (more than 0:1%) effect on the various cycle components, the number of retired instruc-
tions, the number of long back-end misses and their MLP, and the number of branch mispredic-
tions and their penalties. Numbers larger than or equal to 9 are shown in bold.
cycle components #insns DL2 and DTLB misses br mispredicts
optimization total base IL1 IL2 ITLB DL1 DL2 DTLB bpred other #DL2 #DTLB MLP #bmp pen
basic tree opt 4 0 1 0 0 2 3 0 3 5 1 2 1 4 1 10
cst prop/elim 6 1 4 0 0 0 2 1 2 0 0 2 0 2 3 2
loop opt 1 1 2 0 0 1 3 1 4 0 0 2 1 2 6 3
if-conversion 6 7 3 0 0 2 1 1 1 1 11 1 0 2 2 4
jump opt (O1) 4 1 1 0 0 1 3 2 7 2 0 3 1 3 5 4
O2 -fnoO2 8 11 3 0 0 0 3 0 3 0 13 1 0 2 4 1
CSE 6 6 3 0 0 1 0 1 3 1 4 1 0 1 5 4
BB reordering 2 1 0 0 0 1 3 2 8 2 2 1 1 4 5 11
strength red 3 1 1 0 0 0 1 0 1 0 1 0 0 1 0 0
recursion opt 2 1 1 0 0 0 2 1 4 0 0 1 0 2 4 3
insn scheduling 8 10 4 0 0 1 5 0 1 1 11 3 1 1 3 1
strict aliasing 4 0 2 0 0 0 3 1 2 1 0 3 0 1 6 2
alignment 4 1 4 0 0 2 2 0 3 1 0 2 0 2 3 2
adv tree opt 7 3 3 0 0 0 2 1 7 0 4 2 0 2 6 2
O2 3 1 2 0 0 1 3 1 1 0 2 3 0 1 3 3
aggr loop opt 2 0 1 0 0 1 1 0 1 0 0 1 0 0 1 2
inlining 1 2 2 2 0 3 0 1 4 1 1 1 1 0 3 5
O3 4 1 3 0 0 0 2 0 1 1 1 2 0 0 1 2
loop unrolling 5 1 6 2 0 1 1 0 3 1 1 3 0 1 5 2
software pipelining 3 1 2 0 0 0 1 1 2 0 1 1 0 2 3 4
FDO 6 4 2 0 0 2 3 0 5 2 3 0 0 6 5 7entforthetwobenchmarks.Forperlbmk,thereasonisareducedL1I-cachecomponent
and a reduced branch misprediction component. The reduced L1 I-cache component is
due to fewer L1 I-cache misses. The branch misprediction cycle component is reduced
mainlybecauseofareducedbranchmispredictionpenalty—thenumberofbranchmis-
predictions is not affected very much. In other words, the loop optimizations reduce the
critical path leading to the mispredicted branch so that the branch gets resolved earlier.
For art on the other hand,the majorcycle reductionis observedin the L2 D-cachecycle
component. The reason being an increased number of overlapping L2 D-cache misses:
the number of L2 D-cache misses remains the same, but the reduced code footprint
brings the L2 D-cache misses closer to each other in the dynamic instruction stream
which results in more memory-level parallelism.
If-conversion. Thegoalofif-conversionistoeliminatehard-to-predictbranchesthrough
predicated execution. The potential drawback of if-conversion is that more instructions
need to be executed because instructions along multiple control ﬂow paths need to be
executed and part of these will be useless. Executing more instructions reﬂects itself in
a largerbase cycle component.In addition,moreinstructionsneedto be fetched;we ob-
serve that this also increases the number of L1 I-cache misses for several benchmarks.
Approximately half the benchmarks beneﬁt from if-conversion; for these benchmarks,
the reduction in the number of branch mispredictions outweights the increased number
of instructions that need to be executed. For the other half of the benchmarks, the main
reason for the decreased performance is the increased number of dynamically executed
instructions.
An interesting benchmark to consider more closely is vpr: its base, resource stall
and L1 D-cache cycle componentsincrease by 4.5%, 9.6% and 3.9%, respectively.This
analysis shows that if-conversion adds to the already very long critical path in vpr —
vpr executes a tight loop with loop-carried dependencies which results in very long
dependence chains. If-conversion adds to the critical path because registers may need
to be copied using conditionalmoveinstructionsat the reconvergencepoint. Because of
this very long critical path in vpr, issue is unable to keep up with dispatch which causes
the reorder buffer to ﬁll up. In other words, the reorder buffer is unable to hide the
instruction latencies and dependencies through out-of-orderexecution, which results in
increased base, L1 D-cache and resource stall cycle components.
Instructionscheduling. Instructionschedulingtendstoincreasethedynamicinstruction
count which, in its turn, increases the base cycle component. This observation was also
made by Valluri and Govindarajan [11]. The reason for the increased dynamic instruc-
tioncountis thatspill codeis addedduringtheschedulingprocessbythecompiler.Note
also that instruction scheduling reduces the branch misprediction penalty for 10 out of
15 benchmarks, see Table 4, i.e., the critical path leading to the mispredicted branch
is shortened through the improved instruction scheduling. Unfortunately, this does not
compensate for the increased dynamic instruction count resulting in a net performance
decrease for most of the benchmarks.
Strict aliasing. The assumption that references to different object types never access
the same address allows for more aggressive scheduling of memory operations — this0.800
0.825
0.850
0.875
0.900
0.925
0.950
0.975
1.000
base
tree opt
const prop/elim
basic loop opt
if-conversion
O1
O2 -fnoO2
CSE
BB reorder
strength red
recursion opt
insn scheduling
strict aliasing
alignment
adv tree opt
O2
aggr loop opt
inlining
O3
loop unrolling
software pipelining
FDO
a
v
g
n
o
r
m
a
l
i
z
e
d
e
x
e
c
u
t
i
o
n
t
i
m
e
Fig.6. Average normalized cycle counts on a superscalar in-order processor.
is a safe optimization as long as the C program complies with the ISO C99 standard1.
This results in signiﬁcant performance improvements for a number of benchmarks, see
for example art (16.2%). Strict aliasing reduces the number of non-overlapping L2 D-
cache misses by 11.5% for art while keeping the total number of L2 D-cache misses
almost unchanged; in other words, memory-level parallelism is increased.
4.3 Comparison with in-order processors
Having discussed the impact of compiler optimizations on out-of-order processor per-
formance, it is interesting to compare against the impact these compiler optimizations
have on in-order processor performance. Figure 6 shows the average normalized cycle
counts on a superscalar in-orderprocessor. Performanceimprovesby 17.5% on average
comparedtothebaseoptimizationlevel.Themoststrikingobservationtobemadewhen
comparingthe in-ordergraph(Figure6) against the out-of-ordergraph(Figure3) is that
instruction scheduling improves performance on the in-order processor whereas on the
out-of-orderprocessor,it degradesperformance.Thereason is that onin-orderarchitec-
tures, the improved instruction schedule outweights the additional spill code that may
be generatedfor accommodatingthe improvedinstruction schedule.On an out-of-order
processor, the additional spill code only adds overheadthrough an increased base cycle
component.
To better understand the impact of compiler optimizations on out-of-order versus
in-order processor performance we now compare in-order processor cycle components
against out-of-order processor cycle components. In order to do so, we use the fol-
lowing cycle counting mechanism for computing the cycle components on the in-order
processor. For a cycle when no instruction can be issued in a particular cycle, the mech-
anism increments the count of the appropriatecycle component.For example, when the
next instruction to issue stalls for a register to be produced by an L2 miss, the cycle
is assigned to the L2 D-cache miss cycle component. Similarly, if no instructions are
availableinthepipelinetoissue becauseofa branchmisprediction,thecycleis assigned
to the branch misprediction cycle component.
1 The current standard for Programming Language C is ISO/IEC 9899:1999, published 1999-
12-01.Theresult ofcomparingthein-orderprocessorcyclecomponentsagainst theout-of-
order processor cycle components is presented in Figure 4. To facilitate the discussion,
we makethe followingdistinctionin cyclecomponents.Theﬁrst groupofcyclecompo-
nentsis affectedbythedynamicinstructioncountandthe criticalpathofinter-operation
dependencies;these are the base, resource stall, and branch misprediction penalty cycle
components. We observe from Figure 4 that these cycle components are affected more
by the compiler optimizations for the in-order processor than for the out-of-order pro-
cessor: 14.6% versus 8.0%. The second group of cycle components are related to the
L1 and L2 cache and TLB miss events and the number of branch misprediction events.
This second group of miss events is affected more for the out-of-order processor: this
is only 2.3% for the in-order processor versus 7% for the out-of-order processor. In
other words, most of the performance gain through compiler optimizations on an in-
order processor comes from reducing the dynamic instruction count and shortening the
critical pathofinter-operationdependencies.Onan out-of-orderprocessor,the dynamic
instructioncountand the critical path are also importantfactors affectingoverallperfor-
mance,however,aboutonehalfofthetotalperformancespeedupcomesfromsecondary
effects related to I-cache, long-latency D-cache and branch misprediction behavior.
Therearethreereasonsthatsupporttheseobservations.First,out-of-orderexecution
hides part of the inter-operation dependencies and latencies which reduces the impact
of critical path optimizations. In particular, in a balanced out-of-order processor, the
critical path of inter-operation dependencies is only visible on a branch misprediction.
Second, the base and resource stall cycle components are more signiﬁcant for an in-
order processor than for an out-of-order processor; this makes the miss event cycle
components relatively less signiﬁcant for an in-order processor than for an out-of-order
processor. As such, an improvement to these miss event cycle components results in
a smaller impact on overall performance for in-order processors. Third, scheduling in-
structions can have a bigger impact on memory-level parallelism on an out-of-order
processor than on an in-order processor. A good static instruction schedule will place
independent long-latency D-cache and D-TLB misses closer to each other in the dy-
namic instruction stream. An out-of-orderprocessorwill be able to exploitthe available
MLP at run time in case the independent long-latency loads appear within a ROB size
from each other in the dynamic instruction stream. An in-order processor on the other
hand, may not be able to get to the independent long-latency loads because of the pro-
cessor stalling on instructions that are dependent on the ﬁrst long-latency load.
5 Related work
A smallnumberofresearchpapersexistoncompileroptimizationsforout-of-orderpro-
cessors, however,noneof this priorwork analyzesthe impactof compileroptimizations
in terms of their impact on the various cycle components.
Valluri and Govindarajan [11] evaluate the effectiveness of postpass and prepass
instruction scheduling techniques on out-of-order processor performance. In postpass
scheduling, register allocation precedes instruction scheduling. The potential drawback
is that false dependencies introduced by the register allocator may limit the scheduler’s
ability to efﬁciently schedule instructions. A prepass scheduling on the other hand onlyallocates registers after completing instruction scheduling. The potential drawback is
that register lifetimes may increase which possibly leads to more spill code. Silvera et
al. [12] also emphasize the importance of reducing register spill code in out-of-order
issue processors. This is also what we observe in this paper. Instruction scheduling
increases the dynamic instruction count which degrades the base cycle component and,
for most benchmarks, also degrades overall performance. This paper is different from
the study conducted by Valluri and Govindarajan [11] in two main ways. First, Valluri
and Govindarajan limit their study to instruction scheduling; our paper studies a wide
range of compileroptimizations.Second,the study done by Valluri and Govindarajanis
an empirical study and does not provide the insight that we provide using an analytical
processor model.
Pai and Adve [13] propose read miss clustering, a code transformation technique
suitable for compiler implementation that improves memory-level parallelism on out-
of-order processors. Read miss clustering strives at scheduling likely long-latency in-
dependent memory accesses as close to each other as possible. At execution time, these
long-latency loads will then overlap improving overall performance.
Holler [14] discusses various compiler optimizations for the out-of-order HP PA-
8000 processor. The paper enumerates various heuristics for driving various compiler
optimizations such as loop unrolling, if-conversion, superblock formation, instruction
scheduling,etc. However,Holler does not quantifythe impact of each of these compiler
optimizations on out-of-orderprocessor performance.
Cohn and Lowney [15] study feedback-directedcompiler optimizations on the out-
of-order Alpha 21264 processor. Again, Cohn and Lowney do not provide insight into
how compiler optimizations affect cycle components.
Vaswani et al. [16] build empirical models that predict the effect of compiler opti-
mizations and microarchitecture conﬁgurations on superscalar processor performance.
Those models do not provide the insights in terms of cycle components obtained from
interval analysis as presented in this paper.
6 Conclusion and Impact on Future Work
The interaction between compiler optimizations and superscalar processors is difﬁcult
to understand, especially because of overlap effects in superscalar out-of-order proces-
sors. This paper analyzed the impact compiler optimizations have on out-of-order pro-
cessor performance using interval analysis by dividing total execution time into cycle
components.
This paper provides a number of key insights that can help drive future work in
compiler optimizations for out-of-order processors. First, the critical path leading to
mispredicted branches is the only place during program execution where the impact
of the critical path of inter-operation dependencies is visible on overall performance.
As such, limiting the focus of instruction scheduling to paths leading to mispredicted
branches could yield improved performance and/or limit compilation time; the latter is
an important consideration for dynamic compilation systems. Second, the analysis in
this paper showed that reducing the dynamic instruction count improves performance
by reducing the base cycle component. As such, compiler builders can use this in-sight for gearing towards optimizations for out-of-order processors that minimize the
dynamic instruction count, rather than to increase the amount of ILP — ILP can be ex-
tracted dynamically by the hardware. The results presented in this paper shows that
reducing the dynamic instruction count remains an important optimization criterion
for today’s high-performance microprocessors. Third, since miss events have a large
impact on overall performance, more so on out-of-order processors than on in-order
processors, it is important to make compiler optimizations conscious of their potential
impact on miss events. In particular, across the optimization settings considered in this
paper, 47.3% of the total performance improvement comes from reduced miss event
cycle components for the an out-of-order processor versus only 17.3% for the in-order
processor. Fourth, compiler optimizations can improve the amount of memory-level
parallelism by scheduling long-latency back-end loads closer to each other in the bi-
nary.Independentlong-latencyloads that occur within ROB size instructions from each
other in the dynamic instruction stream overlap at run time which results in memory-
level parallelism and thus improved performance. In fact, most of the L2 D-cache miss
cycle component reduction observed in our experiments comes from improved MLP,
not from reducing the number of L2 D-cache misses. We believe more research can be
conducted in exploring compiler optimizations that expose memory-level parallelism.
Acknowledgements
The authors would like to thank the reviewers for their insightful comments. Stijn Ey-
erman and Lieven Eeckhout are supported by the Fund for Scientiﬁc Research in Flan-
ders (Belgium (FWO-Vlaanderen). Additional support was provided by the European
HiPEAC Network of Excellence.
References
1. Eyerman, S., Eeckhout, L., Karkhanis, T., Smith, J.E.: A performance counter architecture
for computing accurate CPI components. In: ASPLOS. (2006) 175–184
2. Eyerman, S., Smith, J.E., Eeckhout, L.: Characterizing the branch misprediction penalty. In:
ISPASS. (2006) 48–58
3. Karkhanis, T.S., Smith, J.E.: A ﬁrst-order superscalar processor model. In: ISCA. (2004)
338–349
4. Taha, T.M., Wills, D.S.: An instruction throughput model of superscalar processors. In: RSP.
(2003) 156–163
5. Michaud, P., Seznec, A., Jourdan, S.: Exploring instruction-fetch bandwidth requirement in
wide-issue superscalar processors. In: PACT. (1999) 2–10
6. Riseman, E.M., Foster, C.C.: The inhibition of potential parallelism by conditional jumps.
IEEE Transactions on Computers C-21 (1972) 1405–1411
7. Wall, D.W.: Limits of instruction-level parallelism. In: ASPLOS. (1991) 176–188
8. Chou, Y., Fahs, B., Abraham, S.: Microarchitecture optimizations for exploiting memory-
level parallelism. In: ISCA. (2004) 76–87
9. KleinOsowski, A.J., Lilja, D.J.: MinneSPEC: A new SPEC benchmark workload for
simulation-based computer architecture research. Computer Architecture Letters 1 (2002)
10–1310. Burger, D.C., Austin, T.M.: The SimpleScalar Tool Set. Computer Architecture News (1997)
See also http://www.simplescalar.comfor more information.
11. Valluri, M.G., Govindarajan, R.: Evaluating register allocation and instruction scheduling
techniques in out-of-order issue processors. In: PACT. (1999) 78–83
12. Silvera, R., Wang, J., Gao, G.R., Govindarajan, R.: A register pressure sensitive instruction
scheduler for dynamic issue processors. In: PACT. (1997) 78–89
13. Pai, V.S., Adve, S.V.: Code transformations to improve memory parallelism. In: MICRO.
(1999) 147–155
14. Holler, A.M.: Optimization for a superscalar out-of-order machine. In: MICRO. (1996)
336–348
15. Cohn, R., Lowney, P.G.: Design and analysis of proﬁle-based optimization in Compaq’s
compilation tools for Alpha. Journal of Instruction-Level Paralellism 3 (2000) 1–25
16. Vaswani, K., Thazhuthaveetil, M.J., Srikant, Y.N., Joseph, P.J.: Microarchitecture sensitive
empirical models for compiler optimizations. In: CGO. (2007) 131–143