Microarchitecture modeling for timing analysis of embedded software by LI XIANFENG
MICROARCHITECTURE MODELING FOR
TIMING ANALYSIS OF EMBEDDED
SOFTWARE
LI XIANFENG
(B.Eng, Beijing Institute of Technology)
A THESIS SUBMITTED
FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
DEPARTMENT OF COMPUTER SCIENCE
NATIONAL UNIVERSITY OF SINGAPORE
2005
ACKNOWLEDGEMENTS
I am deeply grateful to my supervisors, Dr. Abhik Roychoudhury and Dr. Tulika
Mitra. I sincerely thank them for introducing me such an exciting research topic
and for their constant guidance on my research. I consider myself very fortunate to
be their first Ph.D. student and because of this I had the privilege to receive their
guidance almost exclusively in my junior graduate years (Some times I feel guilty for
taking them so much time).
I have also benefited from Professors P.S. Thiagarajian, Samarjit Chakraborty and
Wong Weng Fai. They have given me many insightful comments and advices. Their
lectures and talks not only have been another source of knowledge and inspirations for
me, but also have been excellent examples for how to communicate scientific thoughts.
The weekly seminars of our embedded systems research group have been a unique
forum for us to exchange ideas. I have learnt a lot by either presenting my own work
or by listening to the talks given by our group members or visiting professors. I will
certainly miss it after I leave our group.
I would like to thank the National University of Singapore for funding me with
research scholarship and for providing such an excellent environment and services. My
thanks also go to the administrative and support staff in the School of Computing,
NUS. Their support is more than what I have expected.
I thank my friends Dr. Zhu Yongxin, Chen Peng, Luo Ming, Shen Qinghua and
Daniel Ho¨gberg, with whom I play tennis and badminton. Doing sports has made
my life here more fun and less stressful. I would also miss my other friends and
lab mates Liang Yun, Pan Yu, Kathy Nguyen Dang, Wang Tao, Andrew Santosa,
Marciuca Gheorghita, Mihail Asavoae, Sufatrio Rio, Xie Lei andWang Zhanqing. Our
ii
discussions, gatherings and other social activities made my stay at NUS enjoyable.
I have special thanks to my parents, my brother and sister for their love and
encouragement. To make me concentrate on my study, they were even trying to
conceal from me a serious illness of my mother when she was suffering it a couple of
years ago.
Most of all, this thesis would not have been possible without the enormous support
of Cailing, my wife. She has sacrificed a great deal ever since I decided to pursue my
Ph.D. study. As an indebted husband, I hope this thesis could be a gift to her, and I
take this chance to make a promise that I will never leave her struggling alone in the
future.
The work presented in this thesis was partially supported by National University




ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . ii
SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
I INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Real-time Embedded Systems . . . . . . . . . . . . . . . . . . . . . 1
1.2 Worst Case Execution Time Analysis . . . . . . . . . . . . . . . . . 2
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . 8
II OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Background on Microarchitecture . . . . . . . . . . . . . . . . . . . 9
2.1.1 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Branch Prediction . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Instruction Caching . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 A Processor Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Our Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 Program Path Analysis and WCET Calculation . . . . . . . 21
2.3.2 Microarchitecture Modeling . . . . . . . . . . . . . . . . . . . 25
2.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
III RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1 WCET Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Microarchitecture Modeling . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Program Path Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 44
IV OUT-OF-ORDER PIPELINE ANALYSIS . . . . . . . . . . . . . . 49
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
iv
4.1.1 Out-of-Order Execution . . . . . . . . . . . . . . . . . . . . . 50
4.1.2 Timing Anomaly . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.1.3 Overview of the Pipeline Modeling . . . . . . . . . . . . . . . 52
4.2 The Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2.1 Estimation for a Basic Block without Context . . . . . . . . 53
4.2.2 Estimation for a Basic Block with Context . . . . . . . . . . 66
4.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
V BRANCH PREDICTION ANALYSIS . . . . . . . . . . . . . . . . . 77
5.1 Modeling Branch Prediction . . . . . . . . . . . . . . . . . . . . . . 78
5.1.1 The Technique . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1.2 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.1.3 Retargetability . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2 Integration with Instruction Cache Analysis . . . . . . . . . . . . . . 93
5.2.1 Instruction Cache Analysis . . . . . . . . . . . . . . . . . . . 94
5.2.2 Changes to Instruction Cache Analysis . . . . . . . . . . . . 95
5.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 101
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
VI ANALYSIS OF PIPELINE, BRANCH PREDICTION AND IN-
STRUCTION CACHE . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.1 Timing Estimation of a Basic Block in Presence of Branch Prediction 113
6.1.1 Changes to Execution Graph . . . . . . . . . . . . . . . . . . 114
6.1.2 Changes to Estimation Algorithm . . . . . . . . . . . . . . . 117
6.1.3 Handling Prediction of Other Branches . . . . . . . . . . . . 117
6.2 Timing Estimation of a Basic Block in Presence of Instruction Caching118
6.3 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 122
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
v
VII CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.1 Summary of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
APPENDIX A — PROOFS FOR THE PIPELINE ANALYSIS AL-
GORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
vi
SUMMARY
Worst Case Execution Times (WCET) of tasks are an essential input to the
schedulability analysis of hard real-time systems. Obtaining the WCET of a program
by exhaustive simulation over all sets of data input is often unaffordable. As an
alternative, static WCET analysis predicts the worst case without actually running
the program. One important yet difficult problem for static WCET analysis is to
model the hardware features which have a great impact on the execution time of
the program. In this thesis, we study the features that are commonly found in high
performance processors but have not been effectively modeled for WCET analysis.
First, we model out-of-order pipelines. This in general is difficult even for a basic
block (a sequence of instructions with single-entry and single-exit points) if some
of the instructions have variable latencies. This is because the WCET of a basic
block on out-of-order pipelines cannot be obtained by assuming maximum latencies
of the individual instructions; on the other hand, exhaustively enumerating pipeline
schedules could be very inefficient. In this thesis, we propose an innovative technique
which takes into account the timing behavior of all possible pipeline schedules but
avoids their exhaustive enumeration.
Next, we present a technique for dynamic branch prediction modeling. Dynamic
branch predictions are superior to static branch predictions in terms of accuracy,
but are much harder to model. There are very few studies dealing with dynamic
branch predictions and the existing techniques are limited to some relatively simpler
branch prediction schemes. Our technique can effectively model a variety of dynamic
prediction schemes including the popular two-level branch predictions used in cur-
rent commercial processors. We also study the effect of speculative execution (via
vii
branch prediction) on instruction caching and capture it by augmenting an existing
instruction cache analysis technique.
Finally, we integrate the analyses of different features into a single framework. The
features being modeled include an out-of-order pipeline, a dynamic branch predictor,
and an instruction cache. Modeling multiple features in combination has long been
acknowledged as a difficult problem due to their interactions. However, the combined
analysis in our work does not need significant changes to the modeling techniques for
the individual features and the analysis complexity remains modest.
viii
LIST OF TABLES
2.1 The Benchmark Programs . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1 Accuracy and Performance of Out-of-Order Pipeline Analysis . . . . . 74
5.1 Modeling Gshare Branch Prediction Scheme for WCET Analysis. . . 103
5.2 Configurations of Branch Prediction Schemes . . . . . . . . . . . . . . 104
5.3 Observed and Estimated WCET and Misprediction Counts of Gshare,
GAg and Local Schemes. . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.4 Combined Analysis of Branch Prediction and Instruction Caching . . 108
5.5 ILP Solving Times (in seconds) with Different BHT Sizes and BHR Bits110
6.1 Combined Analysis of Out-of-Order Pipelining, Branch Prediction and
Instruction Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
ix
LIST OF FIGURES
2.1 The Speedup of Pipelined Execution . . . . . . . . . . . . . . . . . . 10
2.2 Categorization of Branch Prediction Schemes . . . . . . . . . . . . . . 12
2.3 Illustration of Branch Prediction Schemes. The branch prediction table
is shown as PHT, denoting Pattern History Table. . . . . . . . . . . . 13
2.4 Two-bit Saturating Counter Predictor . . . . . . . . . . . . . . . . . . 13
2.5 The Organization of a Direct Mapped Cache . . . . . . . . . . . . . . 16
2.6 The Block Diagram of the Processor . . . . . . . . . . . . . . . . . . 18
2.7 The Organization of the Pipeline . . . . . . . . . . . . . . . . . . . . 19
2.8 The WCET Analysis Framework . . . . . . . . . . . . . . . . . . . . 21
2.9 A Control Flow Graph Example . . . . . . . . . . . . . . . . . . . . . 22
3.1 An Example of Infeasible Paths (by Healy and Whalley) . . . . . . . 32
4.1 Timing Anomaly due to Variable-Latency Instructions . . . . . . . . 51
4.2 A basic block and its execution graph. The solid edges represent de-
pendencies and the dashed edges represent contention relations. . . . 58
4.3 An Example Prologue . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4 Overall and Pipeline Overestimations . . . . . . . . . . . . . . . . . . 75
5.1 Example of the Control Flow Graph . . . . . . . . . . . . . . . . . . . 86
5.2 Additional edges in the Cache Conflict Graph due to Speculative Exe-
cution. The l-blocks are shown as rectangular boxes, and the ml-blocks
among them are shaded. . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.3 Changes to Cache Conflict Graph (Shaded nodes are ml-blocks) . . . 99
5.4 The Importance of Modeling Branch Prediction: Mispredictions in Ob-
servation and Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.5 Overall and Branch Prediction Overestimation . . . . . . . . . . . . . 104
5.6 A Fragment of the Whetstone Benchmark . . . . . . . . . . . . . . . 106
5.7 Change (in Percentage) of Cache Misses and Overall Penalties in Com-
bined Modeling to Those in Individual Modelings . . . . . . . . . . . 107
5.8 Est./Obs. WCET Ratio under Different Misprediction Penalties
and Cache Miss Penalties . . . . . . . . . . . . . . . . . . . . . . . 109
x
6.1 Execution Graph with Branch Prediction . . . . . . . . . . . . . . . . 115
6.2 Comparison of Overestimations of Pure Pipeline Analysis and Com-




1.1 Real-time Embedded Systems
Today a large portion of computing devices are serving as components of other systems
for the purpose of data processing, control or communication. These computing
devices are called embedded systems. The application domains of embedded systems
are diverse: ranging from mission-critical systems, such as aviation systems, power
plant monitoring systems, vehicle engine control systems, etc, to consumer electronics,
such as mobile phones, mp3 players, etc.
Many of the embedded systems are required to interact with the environment
in a timely fashion and they are called real-time systems. The correctness of such
systems depends not only on the computed results, but also on the time at which
the results are produced. Real-time systems can be further divided into two classes:
hard real-time systems and soft real-time systems. Hard real-time systems do not allow
any violation of their timing requirements. They are typically mission-critical systems
such as vehicle control systems, avionics, automated manufacturing and sophisticated
medical devices. With such systems, any failure to meet their deadlines may cause
disastrous loss. In contrast, soft real-time systems can tolerate occasional misses of
deadlines. For example, in voice communication systems or multimedia streaming
applications, the loss or delay of a few frames may be tolerable. In this thesis, we are
concerned with hard real-time systems.
1
1.2 Worst Case Execution Time Analysis
Typically, a hard real-time system is a collection of tasks running on a set of hardware
resources. Each task repeats periodically or sporadically and can be characterized by
a release time, a deadline, and a computation time. The schedulability analysis is
concerned with whether it is possible to find a schedule for the tasks such that they
all complete executions within their deadlines each time they are released (ready to
execute).
Clearly, to perform schedulability analysis, the computation time for each task
needs to be known a priori. Furthermore, to guarantee that the deadline is met
in any circumstance, the Worst Case Execution Time (WCET) should be used as
input instead of average case execution time. In reality, it may not be possible to
know an exact WCET of a task and a conservative estimate is used. Tight WCET
estimates are of primary importance for schedulability analysis as they reduce the
waste of hardware resources. In this thesis, we study efficient methods for WCET
estimations.
The Worst Case Execution Time to be studied in this thesis is defined as the
maximum possible execution time of a task running on a hardware platform without
being interrupted. There are several points for this definition to be noted. First, a
simplified assumption is made that the task is executed uninterruptedly, while in a
hard real-time system the task may be interrupted, e.g., by a higher priority task.
The impact of interruptions on the execution of a task is another topic and it is
beyond our research scope in this thesis. Second, the WCET is hardware-specific as
the execution time of a task depends on the underlying hardware platform. Last, the
execution time of a task varies with different data input and the WCET should cover
all possible sets of data input.
In general, there are two approaches to determine the WCET of a task, or equiva-
lently, the WCET of a program (as we are now shifting from a multi-tasking context
2
of schedulability analysis to a single task context of WCET determination, we will
use the term program instead of task). The first approach is to obtain the WCET
by simulating or by actually running the program on the target hardware over all
sets of possible data input. However simulation or execution can only examine one
set of data input each time. On the other hand, most non-trivial programs have a
tremendous number of sets of possible data input, rendering an exhaustive simula-
tion over all of them unaffordable. Another approach is to estimate the WCET by
static analysis, which studies the program, derives its timing properties, and makes
an estimation on the WCET without actually running the program. Static WCET
analysis is expected to have the following properties:
• Conservative. The analysis should not underestimate the actual WCET, other-
wise the system which is reported by the analysis as ”safe” may actually fail. For
example, the task is assigned a computation time which is above the reported
WCET but lower than what is required for the actual worst case, resulting in
its deadline being missed in some circumstances.
• Tight. The analysis should be reasonably close to the actual WCET, other-
wise the task will be assigned an unnecessarily long computation time, i.e.,
a computation time no less than the estimated WCET. With the increase of
computational requirement for each task, the promise of schedulability on the
target hardware platform decreases and more powerful and expensive hardware
platform may be needed.
• Efficient. The static analysis should be efficient in both time and space con-
sumption.
Note the first property is compulsory and the other two are desirable.
Since the execution time of a program is affected by two factors: (a) the data input
to the program, and (b) the hardware platform on which the program is running, their
3
effects need to be studied for WCET determination. The first factor mainly affects
the execution path of a program and the second factor affects instruction timing, i.e.,
how long an instruction executes. Correspondingly, static WCET analysis can be
divided into three sub-problems.
The first sub-problem is called program path analysis. It works on either the
source program or the compiled code and derives program flow information such as
what are the feasible paths and infeasible paths that an execution can go through.
Later on, during the search of the worst case execution path, the identified infeasible
paths will be excluded from consideration. Therefore the more infeasible paths are
discovered, the more efficient and accurate the computation of the WCET.
The second sub-problem is calledmicroarchitecture modeling. It is concerned
with instruction timing. Traditionally, the execution time of an instruction is ei-
ther a constant or easy to predict on processors with simple architectures. Modern
processors, however, employ aggressive microarchitectural features such as pipelin-
ing, caching and branch prediction to improve the performance of the applications
running on them. These features, which are designed to speed up the average-case
execution, pose difficulties for instruction timing prediction. Firstly, the execution
time of an instruction is no longer a constant, e.g., a cache miss may result in a much
longer execution time that a cache hit does. Furthermore, the variation of instruction
timing can be highly dynamic, e.g., without detailed execution history information, it
may be unclear whether a cache access is a hit or a miss. Microarchitecture modeling
studies the impact of the microarchitectural features on the executions of instructions.
It provides instruction timing information which later on will be used to evaluate the
costs of the execution paths during the search for the worst case execution path.
The third sub-problem is called WCET calculation. With the program path
information and instruction timing information, the costs of the program paths are
evaluated and the maximum one will be taken as the estimated WCET. In contrast
4
to the simulation approach, where program paths are evaluated individually, static
WCET analysis performs this task more efficiently by simultaneously considering a
set of paths which share some common properties. The correctness of the WCET
calculation (the estimated WCET is not an underestimation to the actual WCET)
relies on the earlier two sub-problems. First, no feasible paths are excluded by the
program path analysis, otherwise the estimated WCET would be an underestimation
in case the worst case execution path is among the excluded ones. Second, instruction
timing estimated by microarchitecture modeling should be conservative, such that
the cost of each program path will not be underestimated. On the other hand, the
tightness of the estimated WCET depends on the first two sub-problems as well: the
more infeasible paths are discovered, the less infeasible paths (which may have longer
execution times than the feasible paths) are to be considered; and the more accurate
the instruction timing, the tighter the estimation of the paths. There has been a few
WCET calculation methods, which are different in the way that program paths are
evaluated and the way instruction timing information is used. We will discuss them
in the related work.
1.3 Contributions
In this thesis, we study microarchitecture modeling for WCET analysis. Our goal is
to develop a framework for microarchitecture modeling which accurately estimates
the timing effects of the three most popular microarchitectural features: instruction
caching, branch prediction and pipelining (in-order/out-of-order). The framework
should have an extensible structure, such that the modeling of more features can be
conveniently incorporated. The contributions of this thesis can be summarized as
follows.
• We propose a technique for out-of-order pipeline modeling. In out-of-order
5
pipelines, an instruction can execute if its operands are ready and the corre-
sponding resource is available, irrespective of whether earlier instructions have
started execution or not. Since out-of-order execution improves processor’s per-
formance significantly by replacing pipeline stalls with useful computations, it
has become popular in high performance processors. The main challenge to
out-of-order pipeline modeling is that out-of-order pipelines exhibit a phenom-
enon called timing anomaly [50], where counterintuitive events may arise. For
example, a cache miss may result in shorter overall execution time of the pro-
gram than a cache hit does, which means assuming a cache miss somewhere
the actual cache access result is not available may be not conservative. Unfor-
tunately, existing techniques largely rely on these conservative assumptions to
make accuracy-performance trade-offs by only considering conservative cases.
In the presence of timing anomalies, such trade-offs are no longer safe. As a
result, all cases need to be examined. However, examining the possible cases
individually could be very inefficient. In this thesis, we address the timing
anomaly problem by proposing a novel technique which avoids enumerating the
individual cases. Our technique is a fixed-point analysis over time intervals,
where multiple cases of an event at a point are represented as an interval. This
way, these cases can be studied in one go, and at the same time the analysis
result obtained is still safe as long as the interval covers all cases.
• We develop a framework for the modeling of a variety of dynamic branch pre-
diction schemes. The presence of branch instructions introduces control de-
pendencies among different parts of the program. Control dependencies cause
pipeline stalls called control hazards [30]. Current generation processors per-
form control flow speculation through branch prediction, which predicts the
outcome of a branch instruction long before the actual outcome is available.
If the prediction is correct, then execution proceeds without any interruption.
6
Otherwise (known as misprediction), the speculatively executed instructions are
undone, incurring a branch misprediction penalty. If branch prediction is not
modeled, all the branches in the program have to be assumed mispredicted to
avoid underestimation. However, a majority of the branches can be correctly
predicted in reality, which means the estimated WCET will be very pessimistic
if branch prediction is not modeled. In this thesis, we propose a generic and
parameterizable framework by using Integer Linear Programming (ILP). Since
it is integrated with our ILP-based WCET calculation method, it can make
good use of program path information for a tight estimate. Our framework can
model the popular branch prediction schemes, including both global and local
ones [52, 74].
• We propose a framework for combined analyses of the three features: out-of-
order pipelining, branch prediction and instruction caching. The major issue
with the combined analyses of multiple features is the sharp increase of the
analysis complexity due to their interactions. By decomposing the timing ef-
fects of the various features into local timing effects (which affect nearby instruc-
tions) and global timing effects (which affect remote instructions), our combined
analyses are divided into two levels: local analyses and global analyeses. By
doing so, we can keep the analysis at a reasonable complexity, yet we can still
receive good accuracy.
We have implemented a publicly available prototype tool called ”Chronos” for
evaluating the WCET techniques proposed in this thesis. It consists of an analysis
engine and a graphical front-end. The analysis engine contains 16 C source files and
11 header files, and it has 16, 108 source lines in total. More details of this tool can
be found on the following web site.
http://www.comp.nus.edu.sg/~rpembed/chronos
7
1.4 Organization of the Thesis
The rest of the thesis is organized as follows. The next chapter presents an overview of
the approach taken in this thesis. Chapter 3 surveys the literature of WCET analysis.
Chapter 4 presents the out-of-order pipeline analysis. Branch prediction analysis is
discussed in Chapter 5, where its integration with an ILP-based instruction cache
analysis is also discussed. The combined analysis the three features is presented in
Chapter 6. Finally, Chapter 7 gives a summary on what have been achieved in this




In this chapter, we provide an overview of the approach taken in this thesis. First,
we give some background information on the three microarchitectural features: out-
of-order pipelining, branch prediction, and instruction caching. Then we introduce a
concrete processor model used in this thesis. Next we present our overall approach
for WCET analysis. Finally, we introduce the experimental setup used throughout
this thesis.
2.1 Background on Microarchitecture
Microarchitecture is the term used to describe the resources and methods used to
achieve architecture specification of processors. Modern processors employ aggres-
sive microarchitectural features such as pipelining, caching and branch prediction to
improve the performance of the applications running on them. The purpose of this
section is to give some background information on the three popular microarchitec-
tural features studied in this thesis.
2.1.1 Pipelining
The execution of an instruction naturally involves several tasks performed sequen-
tially, or in other words, the execution proceeds through several stages. Therefore,
instead of starting the execution of an instruction after the completion of an ear-
lier one, we can overlap the executions of multiple instructions, where each one is
in a particular execution stage at a time. This implementation technique is called
pipelining. Ideally, if the execution which takes T time units to execute is divided
into N pipeline stages with equal latencies, there can be an instruction completing
9
IF    EX IF    EX IF    EX IF    EX
IF    EX
IF    EX
IF    EX
IF    EX
(a) Unpipelined Execution
(b) Pipelined Execution
Figure 2.1: The Speedup of Pipelined Execution
execution each T/N time units, achieving a speedup of factor N . The speedup of
pipelined execution is illustrated in Figure 2.1. With a two-stage pipeline, the ex-
ecution takes roughly half the execution time of the unpipelined execution for four
instructions. Modern processors have much deeper pipelines and the improvement is
more substantial.
However, the ideal speedup of pipelined execution is often not reached because
there are some events preventing the instructions from proceeding through the pipeline
smoothly. These events are called hazards in the literature [30]. There are three classes
of hazards.
• Structural hazards. Some of the resources needed by an instruction are currently
unavailable, e.g., occupied by another instruction.
• Data hazards. Some of the data operands on which an instruction depends are
currently unavailable, e.g., an operand to be provided by an earlier instruction
is still under computation.
• Control hazards. The next instruction to be executed is currently unknown,
e.g., due to branches or other control flow transfer instructions.
10
Because of these hazards, the execution time of an instruction or a sequence of
instructions is not straightforwardly predictable, resulting in difficulties for timing
analysis. This problem becomes more serious with aggressive pipelining mechanisms
such as out-of-order execution. On an out-of-order pipeline, instructions can proceed
through some of the pipeline stages out of their program order. This rise of complexity
makes the hazards harder to predict. For example, in an out-of-order pipeline, a
structural hazard happening to an instruction might be caused by either an earlier
instruction or a later instruction, while in an in-order pipeline, it can only be caused
by an earlier instruction.
2.1.2 Branch Prediction
The motivation for branch prediction is to address control hazards. When a condi-
tional branch is executed, it computes the address of the subsequent instruction to
be executed. There can be two possible outcomes: taken or not taken. If the branch
outcome is taken, the subsequent execution will be redirected to a target address indi-
cated by the branch instruction, otherwise it is not taken and the execution continues
sequentially. However, the branch outcome is often available somewhere late in the
pipeline, which means the processor does not know what to do between the interval
from the start of the branch instruction to its production of the outcome.
If we do nothing with control hazards and let the processor idly wait for the branch
outcome (the waiting time is called a branch penalty), we will have a significant
performance loss. Hennessy and Patterson [30] have shown that for a program with
a 30% branch frequency and a branch penalty of three clock cycles, their processor
with branch stalls achieves only about half the ideal speedup with pipelining.
In light of this, various techniques have been proposed to reduce branch stalls.
One effort is to reduce the branch penalty by computing the branch outcome and





GAg gshare gselect .......
Figure 2.2: Categorization of Branch Prediction Schemes
of the pipelined execution, the computation of the branch outcome often cannot be
done immediately after or very close to the start of the branch’s execution, thus the
branch stall cannot be completely overcome. In fact, on current processors with deep
pipelines, the branch penalty can be over ten clock cycles.
Another method is to predict the branch outcome before it is available, such that
the processor can continue execution along the predicted direction instead of idly
waiting for the actual outcome. In case the prediction is correct, the branch penalty
is completely avoided, otherwise it is a misprediction and some recovery actions must
be taken to undo the effects of the wrong path instructions. The interval from the time
the wrong path instructions entering the pipeline to the time the execution resuming
on the correct path is called a misprediction penalty. It is the delay compared to the
scenario of a correct prediction and is usually equal to or slightly higher than the
branch penalty.
A variety of branch prediction schemes have been proposed and they can be
broadly categorized as static and dynamic (see Figure 2.2; the most popular cate-
gory in each level is underlined). In a static scheme, a branch is predicted the same
direction every time it is encountered. Either the compiler can attach a prediction



















Figure 2.3: Illustration of Branch Prediction Schemes. The branch prediction table
is shown as PHT, denoting Pattern History Table.
11 10 01 00
Predicted Taken Predicted Not Taken
Not Taken Not Taken Not TakenNot Taken
TakenTaken Taken Taken
Figure 2.4: Two-bit Saturating Counter Predictor
using simple heuristics, such as backward branches are predicted taken and forward
branches are predicted non-taken. Static schemes are simple to realize and easy to
model. However, they do not make very accurate predictions.
Dynamic schemes predict the outcome of a branch according to the execution
history. The first dynamic technique proposed is called local branch prediction (il-
lustrated in Figure 2.3(c)), where the prediction of a branch is based on its last few
outcomes. It is called ”local” because the prediction of a branch is only dependent
on its own history. This scheme uses a 2n-entry branch prediction table to store past
branch outcomes, and this table is indexed by the n lower order bits of the branch
address. Obviously, two or more branches with the same lower order address bits
13
will map to the same table entry and they will affect each other’s predictions (con-
structively or destructively). This is known as the aliasing effect. In the simplest
case, each prediction table entry is one-bit and stores the last outcome of the branch
mapped to that entry.
In this thesis, for simplicity of disposition, we discuss our modeling only for the
one-bit scheme. When a branch is encountered, the corresponding table entry is
looked up and used for prediction; and the entry will be updated after the outcome
is resolved. In practice, two-bit saturating counters are often used for prediction, as
show in Figure 2.4. Furthermore, the two-bit counter can be extended to n-bit scheme
straightforwardly. We are aware that subsequent to our work, there is an effort by
Bate and Reutemann [4] on modeling an n-bit saturating counter (in each row of
the prediction table). However, their work has some restrictions, e.g., they assume
that there are no interferences in the BHT among different branches for bimodal
branch predictors, and they make another assumption that there are no conditional
constructs in loops when they model two-level branch predictors. Apparently, these
restrictions severely limit the applicability of their technique in practice.
Local prediction schemes cannot exploit the fact that a branch outcome may be
dependent on the outcomes of other recent branches. The global branch prediction
schemes can take advantage of this situation [74]. Global schemes use a single shift
register called branch history register (BHR) to record the outcomes of the n most
recent branches. As in local schemes, there is a branch prediction table in which pre-
dictions are stored. The various global schemes differ from each other (and from local
schemes) in the way the prediction table is looked up when a branch is encountered.
Among the global schemes, three are quite popular and have been widely implemented
[52]. In the GAg scheme (refer to Figure 2.3(a)), the BHR is simply used as an index
to look up the prediction table. In the popular gshare scheme (refer to Figure 2.3(b)),
the BHR is XOR-ed with the last n bits of the branch address (the PC register in
14
Figure 2.3(b)) for prediction table look-up. Usually, gshare results in a more uniform
distribution of table indices compared to GAg. Finally, in the gselect (GAp) scheme
(not illustrated in Figure 2.3 but can be derived from the gshare scheme), the BHR
is concatenated with the last few bits of the branch address to look up the table.
Note that even with accurate branch prediction, the processor needs the target
address of a taken branch instruction. Current processors employ a small branch tar-
get buffer to cache this information. We have not modeled this buffer in our analysis
technique; its effect can be easily modeled via techniques similar to instruction cache
analysis [43]. Furthermore, the effect of the branch target buffer on a program’s
WCET is small compared to the total branch misprediction penalty. This is because
the target address is available at the beginning of the pipeline whereas the branch
outcome is available near the end of the pipeline.
2.1.3 Instruction Caching
Caching in our context is a mechanism used to bridge the gap between a faster
processor and a relatively slower memory. A cache is a small, fast memory close
to the processor that accommodates the most recently accessed code or data in the
memory. If the data item needed by the processor is found in the cache, it is called a
cache hit, otherwise the processor has to get it from the main memory and it is called
a cache miss. The cost of a cache miss is called cache miss penalty. The caching
mechanism is effective thanks to the principle of locality, which says that programs
tend to reuse data and instructions they have used recently. It has been observed that
a program may spend 90% of its execution time on only 10% of the code. Thus, by
storing the recently accessed data in the cache, we will have a high chance of visiting
them again from the cache in the future.
Program instructions and data can be cached either in a single storage, called von






Figure 2.5: The Organization of a Direct Mapped Cache
For embedded systems, Harvard architecture is more widely used. This makes it
possible to study instruction caching and data caching separately. In this thesis we
only study instruction caching.
Now we look at the organization of a cache with a simplified view. A cache is
organized in fixed-size blocks, each of which accommodates consecutive data items
located in the memory (called memory blocks). Depending on where a memory block
can be placed in the cache, there are three organization categories.
• If a memory block has only one place to go in the cache, the cache is called
direct mapped.
• If a memory block can be placed anywhere in the cache, the cache is called fully
associative.
• If a memory block can be placed in a restricted set of places in the cache, the
cache is called set associative.
Direct mapped cache and fully associative cache can be viewed as two special cases
of set associative caches. In this thesis, for simplicity of disposition, we will take direct
mapped cache as an example, but our work can be extended to set associative caches.
Figure 2.5 gives a simplified view of the organization of a direct mapped cache. A
16
direct mapped cache is divided into multiple cache lines. Each cache line has three
portions: a data portion which contains the memory block; a tag portion which is
used to differentiate multiple possible memory blocks mapped to the same cache line;
and a valid bit to indicate whether the cache line contains any valid data. When the
processor accesses a data item, it dispatches the address of the data item to the cache.
The address is divided into three fields as shown in Figure 2.5: The index field is used
to determine which cache line to access; the tag field is used to decide whether the
cache line contains the desired data (true if the tag field matches the tag portion of
the corresponding cache line); and the block offset field is used to select the desired
data item from the corresponding cache line. In case the memory block is not in the
cache, access is directed to the main memory, and the memory block fetched from
the main memory will displace the current one from the corresponding cache line.
2.2 A Processor Model
In this section we present the processor model used in this thesis. It is a simplified
version of the SimpleScalar sim-outorder processor model [6], which is in turn based
on [68]. The processor consists of three components: an out-of-order pipeline, a
branch predictor and an instruction cache. The block diagram of the processor and
the interactions among the three components are shown in Figure 2.6.
The pipeline consists of five stages. The interaction between the pipeline and
the instruction cache takes place at the instruction fetch stage (IF on the diagram),
where the pipeline dispatches an instruction address to the instruction cache and
the instruction is sent to the pipeline upon a hit, otherwise the instruction will be
fetched from the main memory and the instruction cache is updated accordingly.
The interaction between the pipeline and the branch predictor takes place at two
stages. In the IF stage, the pipeline consults the branch predictor for the subsequent
instruction to be executed. In the EX stage, where computed results are available, the
17




Figure 2.6: The Block Diagram of the Processor
branch predictor is updated with the branch outcome if the instruction is a conditional
branch. The interaction between the branch predictor and the instruction cache is
indirect (via the pipeline). The content of the instruction cache can be changed by
the branch prediction in the following way: If the branch prediction is incorrect,
the pipeline will execute instructions on the wrong path, which might bring some
instructions into the instruction cache and displace some existing instructions. The
instruction cache does not change the state of the branch predictor because the state
of the branch predictor is only updated by the branch outcomes of the program,
which is independent of the behaviors of both the pipeline and the instruction cache.
Next, we give the organization of the pipeline and explain in more details how an
instruction is executed by this processor.
The pipeline is shown in Figure 2.7. It consists of the following components: an
instruction buffer (I-buffer), which accommodates instructions that have been fetched
from the instruction cache or main memory, but yet to be decoded and executed;
a circular reorder buffer (ROB), which accommodates instructions that have been
decoded, but have not completed execution; several functional units which carry out





















Figure 2.7: The Organization of the Pipeline
including an integer register file and a floating-point register file.
An instruction proceeds through the five-stage pipeline as follows.
1. Instruction Fetch (IF). In this stage, the instruction specified by the the pro-
gram counter is fetched from the instruction cache or memory into the I-buffer.
There are several rules dictating the behavior of the IF stage. Instructions en-
ter and leave the I-buffer in program order. If the I-buffer is full, the processor
stops fetching more instructions until the earliest instruction leaves the I-buffer.
2. Instruction Decode & Dispatch (ID). In this stage, the earliest instruction
in the I-buffer is removed from the I-buffer, decoded, and dispatched into the
ROB. The instruction is stored there until it commits (see CM stage). The
instruction decode cannot proceed if the ROB is full or the I-buffer is empty.
3. Instruction Execute (EX). In this stage, an instruction in the ROB is is-
sued to its corresponding functional unit for execution when all its operands
19
are ready and the functional unit is available. If more than one instruction cor-
responding to a function unit are ready for execution, the earliest instruction
has the highest priority. We assume that the functional units are not pipelined,
that is, an instruction can be issued to a functional unit F only after the previ-
ous instruction occupying F has completed execution. We also assume that the
number of instructions issued in a clock cycle is only bounded by the number
of functional units. When an arithmetic instruction completes this stage, it
forwards the computed result to awaiting instructions, if any, in the ROB; if
all the operands of an awaiting instruction becomes ready, the instruction will
be among the candidates scheduled for execution in the next cycle. The EX
stage exhibits true out-of-order behavior as an instruction can start execution
irrespective of whether earlier instructions have started execution or not.
4. Write Back (WB). In this stage, load instructions dispatch the addresses
computed in the EX stage to the memory system and fetch the data from the
memory. Since we do not model data caching, we assume it takes a single
cycle to fetch the data, and we also assume there is no resource limit in this
stage. Thus, every instruction proceeds through this stage in one clock cycle.
Like arithmetic instructions in the EX stage, load instructions forward data to
awaiting instructions, if any, in the ROB; if all the operands of an awaiting in-
struction becomes ready, the instruction will be among the candidates scheduled
for execution in the next cycle.
5. Commit (CM). This is the last stage where the earliest instruction which has
completed the WB stage writes its output to the register files and frees its ROB
entry. Note that the instructions commit in program order. Therefore, even if
an instruction has completed its WB stage, it still has to wait for the earlier
















Figure 2.8: The WCET Analysis Framework
cycle.
In summary, in this processor model, EX and WB are the pipeline stages where
instructions can proceed out-of-order, but resource contentions (contention for func-
tional units) only happen in the EX stage.
2.3 Our Framework
In this section, we provide an overview of our approach for WCET analysis and mi-
croarchitecture modeling. As mentioned in Section 1.2, there are three sub-problems
for WCET analysis: program path analysis, microarchitecture modeling, and WCET
calculation. Our approach to performing these sub-problems and handling their in-
teractions is illustrated in Figure 2.8. We divide the analyses into two levels: local
analyses and global analyses, depending on whether global program flow information
is needed or not in the respective analysis.
2.3.1 Program Path Analysis and WCET Calculation
The purpose of program path analysis is to identify feasible paths which later on will
be used by WCET calculation. There has been extensive research work in this direc-





















Figure 2.9: A Control Flow Graph Example
new techniques for program path analysis, and the existing program path analysis
techniques may be adopted here. The rest part is mainly for WCET calculation.
WCET calculation evaluates the costs of the program paths and takes the maxi-
mum one as the WCET. In contrast to simulation, where each program path is eval-
uated separately (the major drawback of the simulation approach), WCET analysis
evaluates multiple program paths simultaneously. The key problem is how the pro-
gram paths are grouped for evaluation. There has been an approach proposed by Li
and Malik [40] which uses Integer Linear Programming (ILP) to represent the pro-
gram paths. We adopt their approach for WCET calculation in our work. The idea
is as follows.
We work on the compiled code of the program. We first construct the Control Flow
Graph (CFG) [1] for the program. The vertices of the graph are basic blocks, each
of which is a sequence of instructions where flow of control can only enter from the
beginning of the basic block and leave from the end. The basic blocks are connected
by directed edges. There is an edge from block B1 to block B2 if and only if B2 can
follow the execution of B1 in some execution sequence. The diagram on the left hand
side of Figure 2.9 gives s simple CFG example.
22
Suppose the costs (execution times) of the basic blocks are known, then the ex-
ecution time of a path can be calculated by first collecting the execution counts of
the basic blocks on the path, then summing up the terms of the execution counts
weighted by their costs. More formally, given a path P , its execution time TP can be





where costi and vi are the cost and the execution count of block Bi respectively. If
P does not contain block Bi, vi is set to zero.
As mentioned earlier, static analysis evaluates a set of paths (or a segment of a
set of paths) at a time. The ILP approach achieves this by exploiting the fact that if
two paths P1 and P2 have the same execution counts for each of their corresponding
basic blocks, that is to say, they only differ in the execution order of the basic blocks,
then their execution time will be the same (under the assumption that the costs of
each basic block in the two paths are identical). From another point of view, the ILP
assigns feasible execution counts to the basic blocks and give them an evaluation.
This assignment actually represents a collection of paths with the same execution
time, hence they need to be evaluated only once by the ILP solver. The right hand
side in Figure 2.9 gives a concrete example. Suppose the loop from A to D iterates
four times. Since there is an ”if-then-else” branch inside the loop, each iteration the
control flow may go through either B or C, thus there can be 16 paths of the program
in total. By assigning one to the execution count of B (vB = 1) and three to the
execution count of C (vC = 3), there can be four paths satisfying this situation and
having the same execution time. These paths are listed in the upper half on the
right hand side. Similarly, with vB = 2 and vC = 2, there are six paths that can be
evaluated together (listed in the lower half on the right hand side).
Above we have discussed in an intuitively way on how program paths are grouped
by ILP for evaluation. Actually, an ILP solver can do an even better job by exploiting
23
relationships between different groups of paths (sets of execution counts). For details,
please refer to [66, 70]. Formally, the WCET of the program with N basic blocks can




costi ∗ vi (2.1)
We call Equation 2.1 the objective function. The ILP solver maximizes Time
by trying to assign different execution counts to vi. Obviously, there must be some
constraints on the execution counts that can be assigned. A ready set of constraints








where ei→j is the count of control flow transfer from block Bi to block Bj. Equation 2.2
captures the fact that the execution count of a basic block is equal to the sum of
incoming control flow as well as the sum of outgoing control flow. Furthermore, for
the start and end blocks, which execute exactly once, we have







The flow constraints by themselves are not enough. For instance, a program
typically has loops whose iterations must be bounded, but above constraints by no
means give such bounds. The loop bounds can either be derived by the program
path analysis or be provided manually. For example, if we found that the loop in
Figure 2.9 can iterate no more than four times, we add a bound vA ≤ 4 to the existing
constraints. Besides the compulsory loop bounds, some more flow facts discovered
by the program path analysis can be transformed to constraints to further bound the
possible execution count assignment. For example, suppose costB is larger than costC
in Figure 2.9, if the program path analysis finds out that B can only execute a limited
number of times (less than the loop iterations) and this fact is transformed into an
24
extra constraint, then the ILP solver will not be able to assign vB a loop iteration
count which leads to an unnecessarily overestimated WCET.
WCET calculation works on the scope of the global program, thus it belongs to
the global analyses in our framework in Figure 2.8.
It worth noting that when microarchitecture is modeled, the cost of a basic block
varies under different execution scenarios. In that case, we will identify the timing
events that affect the cost and refine the execution of a basic block into a few scenar-
ios, each of which may have a distinct cost and its occurrences will be bounded by
microarchitecture modeling. The objective function will be changed accordingly.
2.3.2 Microarchitecture Modeling
Some of the timing effects of the microachitecture are mainly exercised in a local scope,
and their analyses need no much program flow information. Pipelining is a typical
example, where adjacent instructions affect each other, but remote instructions such
as those who have completed execution do not affect instructions currently in the
pipeline. As a result, pipeline analysis is performed at the level of basic blocks with
very limited program flow information taken into account (e.g, a short sequence of
instructions preceding or succeeding the analyzed basic block).
For instruction caching and branch prediction, it is well known that they exhibit
global timing effects in the sense that an earlier cache access or branch instruction
can update the state of the instruction cache or the branch predictor, which will
affect future cache accesses or branch predictions. How long the effect is exercised
is highly dynamic. For example, a cache access to an instruction I may displace
another instruction I ′ from the cache; when I ′ will be visited again depends on the
program path taken from I to I ′. We call the analyses for the global effects global
analyses (”Global IC Analysis” and ”Global BP Analysis” in Figure 2.8). To receive
reasonably accurate results, global program flow information needs to be taken into
25
account for global instruction cache analysis and global branch prediction analysis.
On the other hand, instruction caching and branch prediction have local effects
– mainly on the pipeline. For example, a cache miss results in a longer latency of
the corresponding pipeline IF stage, and a branch misprediction results in the flush
of the pipeline. We call the analyses for the local effects local analyses ( ”Local IC
Analysis” and ”Local BP Analysis” 1 in Figure 2.8).
Local analyses. Since pipeline is the place where instructions are executed and the
execution time is accounted, the pipeline analysis is taken as the core of local level
analyses, while the local analyses of the other two features, instruction caching and
branch prediction, are incorporated into the pipeline analysis with their effects on the
corresponding pipeline stages being captured (indicated by the arrows from ”Local
IC Analysis” and ”Local BP Analysis” to ”Pipeline Analysis” in Figure 2.8).
Global analyses. The global instruction cache analysis and the global branch pre-
diction analysis are concerned with the occurrences of the timing effects, e.g., cache
misses and branch mispredictions. Li et al. [41, 43] have proposed an ILP-based in-
struction cache analysis which can be conveniently integrated with their ILP-based
WCET calculation. In our global branch prediction analysis, to better exploit the
program flow information, we also use ILP to model the global behavior of branch
prediction (The technical details appear in Chapter 5). Recall in Section 2.2, we have
mentioned that the state of the instruction cache can be affected by the behavior of
the branch prediction. Now we revisit this issue with a perspective of global/local
effects. Clearly, a misprediction, which may affect the cache state, has no impact
on how a cache miss or hit affects the pipeline; rather, by changing the cache state,
it affects whether a future cache access is a hit or a miss. Therefore, an arrow is
1Note local branch prediction analysis is not the analysis for local branch prediction schemes.
26
drawn from local branch prediction analysis to global instruction cache analysis. In
Chapter 5, we will augment the instruction cache analysis by Li et al. to capture the
branch prediction effect.
Now we show the changes to WCET calculation with microarchitecture modeling
enabled. Since the execution time of a basic block varies with timing events (cache
misses, branch mispreditions) that may happen in its execution, and on the other
hand, the occurrences of the timing events are bounded by global analyses. The






costsci ∗ vsci (2.4)
where sc is an execution scenario of block Bi, e.g., it may carry relevant cache state
and branch prediction information. The possible execution scenarios of Bi are cap-
tured by the set SCi. For different sc and sc
′ of the same Bi, costsci and cost
sc′
i are
expected to be different. The occurrences of each sc is bounded by global analy-
ses, such that for an sc which results in a higher costsci than other scenarios, the
corresponding vsci will not be assigned an impossibly high count. Note the scenario
mentioned here is generic – we will see concrete scenarios in the respective microar-
chitecture modeling chapters.
In summary, we decompose microarchitecture modeling into two levels: local level
and global level. The local level analyses are concerned with the timing of the analy-
sis units (e.g., basic blocks) by modeling local timing effects, and pipeline analysis is
the core at this level. The global level analyses are concerned with the occurrences
of timing events, and it works on the scale of the whole program. By decomposing
microarchitecture modeling into two levels, the analyses can be performed with rea-
sonably complexity and the microarchitecture modeling can be conveniently extended
when more features are to be modeled.
27
Program Description Bytes #P #BB #BR #LP S
adpcm Adaptive pulse code modulation 7296 16 140 43 18 N
compress Data compression program 3424 8 99 49 7 N
dhry Dhrystone benchmark 3144 16 98 34 8 Y
fdct Fast Discrete Cosine Transform 2800 1 10 3 3 Y
fft 1024-point Fast Fourier Transformation 2216 1 27 8 5 Y
fir FIR filter with Gaussian function 3824 7 69 13 8 N
ludcmp LU decomposition algorithm 4728 2 60 17 11 N
matmul Multiplication of two 10x10 matrices 272 1 7 3 3 Y
matsum Summation of two 100x100 matrices 232 1 5 2 2 Y
minver Inversion of a floating point matrix 6144 3 102 31 17 N
qurt Root computation of quadratic equations 1928 3 32 8 1 N
whet Whetstone benchmark 2520 4 36 18 8 Y
Table 2.1: The Benchmark Programs
2.4 Experimental Setup
We will conduct experiments to evaluate our out-of-order pipeline analysis, branch
prediction analysis and the combined analysis of the three features. The experiments
share some commonalities such as the benchmarks used, the methodology, and the
experimental environment.
Benchmarks Table 2.1 lists the benchmark programs used for experiments. These
programs have been used by other researchers for WCET analysis. Among them,
dhry, fdct, fft, matsum, matmul, and whet were used by Li et al. [43]; the others are
from the real-time research group at Seoul National University [64] and the Real-Time
Research Center at Ma¨lardalen University [51].
In Table 2.1, column ”Bytes” gives the size of the object code for each benchmark
program. Here we do not count library code or other segments that are not included
in our WCET analysis (data segments, stack, symbol table, etc). Column ”#P” gives
the number of procedures in each benchmark. Column ”#BB” gives the number of
the total basic blocks in each program. Column #BR” gives the number of conditional
branches. Column ”#LP” gives the number of loops. Finally, column ”S” indicates
whether the program has a single execution path or multiple execution paths.
By comparing columns ”#BR” and ”#LP”, we can see that fdct, fft, matsum,
28
matmul are loop-intensive programs, while the rest are control-oriented programs. For
the program size, as we will use an instruction cache of 1K bytes in our experiments, a
few programs (matmul, and matsum) can be completely accommodated by the cache.
The other programs have sizes ranging from two to seven times of the cache size, thus
they will suffer from conflicting misses, and a good cache analysis is needed.
For the single-path programs (dhry, fdct, fft, matmul, matsum, and whet), whose
branch conditions are not dependent on input data, if the execution latencies of the
instructions are deterministic, then we can know precisely their actual worst case
execution times by simulation. While for the multiple-path programs, or programs
with variable-latency instructions, simulation usually can only provide a lower-bound
to the actual worst case.
Methodology To evaluate the accuracy of our analysis, the estimated result should
be compared against some reference one. Ideally, it should be the actual worst case.
However, as explained earlier, it is often impossible to know the actual worst case.
As an alternative, we use an approximate to the actual worst case by doing an inex-
haustive simulation over some sets of data input which are likely to produce the worst
case. We call the result obtained this way the observed worst case. Correspondingly,
the result produced by our analysis is called the estimated worst case. The relation-
ships of the three values are: observed WCET ≤ actual WCET ≤ estimated WCET.
Finding a set of data input for a good observed worst case is not easy, especially when
timing effects introduced by microarchitectural features come into play. What we do
is to inspect the important parts of a program (with the timing effects in mind), e.g.,
the inner loops, to get an idea on how the executions of these important parts are
affected by the data input, then we try to feed the program with a set of data input
which is likely to maximize their execution.
29
For both the simulation and estimation, we use SimpleScalar [6], a popular ar-
chitectural simulation toolset, for a variety of tasks. Our experiments start with the
source program. The first step is to compile the source program into object code
using the GCC compiler provided by SimpleScalar. This GCC version yields code of
an instruction set architecture (ISA) which is a superset of MIPS ISA [61].
Then, we simulate the object code by using one of the SimpleScalar simulators
with the selected data input. Whichever is used for simulation depends on which
microarchiture features are being modeled. In addition, to match our processor con-
figuration, the simulators are tailored and the parameters are set correspondingly.
Next, we conduct the analysis with a prototype analyzer written by us. It reads
the object code, constructs control flow graphs (CFG), performs local and global mi-
croarchitecture modeling, and formulates an ILP problem by producing an objective
function, a set of flow constraints as well as constraints from microarchitecture mod-
eling. In addition, flow information collected by program path analysis or by user
observation is transformed into an extra set of flow constraints, e.g., loop bounds,
which are called functional constraints.
Finally, the ILP problem is submitted to an ILP solver and the objective function
is maximized by the solver. The produced result will be the estimated worst case. In
our experiments, we use CPLEX [15], a commercial ILP solver for this task.
Environment We run all the experiments on a 1.3 GHz Pentium IV machine with
1-GB main memory. The operating system is Linux-2.4.18. The parameters of the





The importance of Worst Case Execution Time (WCET) analysis has been recognized
by the real-time community and substantial progress has been made over the past fif-
teen years. The earlier work include Kligerman and Stoyenko’s Real-time Euclid [35],
Shaw and Park’s timing schema [67, 59], and Puschner and Koza’s study [62] on the
calculation of WCET and its decidability issue. In the early time WCET analysis was
targeted towards simple hardware, on which the timing of an instruction is constant,
thus no need for microachitecture modeling; and if no much optimization is done
by the compiler, working on the source program would be sufficient. However, with
the advent of modern processors which employ aggressive performance enhancement
features, it is not a feasible option anymore for not doing microarchitecture modeling
and WCET calculation is usually carried out on the compiled code.
In the rest of this section, we review the literature on three topics: program path
analysis, microarchitecture modeling, and WCET calculation. They correspond to
the three sub-problems for WCET analysis introduced in Chapter 1. Since WCET
calculation is directly connected to the aim of the analysis (the WCET of the pro-
gram) and the other two sub-problems are performed to enable and improve WCET
calculation, our review will first cover WCET calculation methods, then the rest two
topics.
3.1 WCET Calculation
There are primarily three WCET calculation methods: timing schema, path-based

















Figure 3.1: An Example of Infeasible Paths (by Healy and Whalley)
Timing Schema. Shaw and Park [67, 59, 58] proposed a tree-based approach called
timing schema. It determines the execution times of program constructs with a
bottom-up traversal of the syntax tree. Once the times of lower level constructs
have been obtained, the time of the higher level construct containing them can be
estimated. The advantage of this approach is that it is very efficient. However,
the local estimation in timing schema cannot account for infeasible paths which are
defined by constraints across multiple constructs. Consider the example in Figure 3.1
(which is from [27]). Clearly, the statements on Line 4 and 7 are mutually exclusive
and any paths across the two statements in the same iteration are infeasible ones.
Timing schema estimates the costs of the two if statements separately, with the
executions of both the two statements on Line 4 and 7 being assumed true. Thus the
estimated worst case for this example will arise from an infeasible path.
Timing schema has been adopted and extended by some other researchers [10, 11,
31, 44, 45, 46]. Lim et al. [44, 45] and Hur et al. [31] have used it for WCET analysis
on RISC processors. In their work, they used new data structures and replaced some
of the operations in the original timing schema with operations which work on these
data structures. Their revised timing schema can better account for timing effects of
the pipelines and the caches. Colin and Puaut [10, 11] have recognized the importance
32
of loop nestings for tight WCET estimates. In their work, a construct is estimated
under the context of its different loop nestings. Because of this, the result for a
construct i is a set of touples < wceti, ln levelj > instead of a single wceti, where
ln levelj is a loop level in which the construct can be located. They have developed
a static analysis tool named Heptane 1.
Path-based Calculation. To better exploit the correlations of different program
parts. Some researchers work on program paths for WCET Calculation [3, 24, 27, 28,
49, 69]. Arnold et al. [3] and Healy et al. [24, 27, 28] search the longest loop paths2
in each loop-nesting level. Infeasible loop paths found by program path analysis are
disregarded (e.g., paths go through both Line 4 and 7 in the program in Figure 3.1).
Furthermore, the longest loop path may only execute a limited number/range of
iterations. In that case, the search continues on finding the next longest path as well
as the iterations in which it can execute. This process terminates when all iterations
are exhausted. Then, the cost of the loop can be calculated by summing up the terms
of the longest paths weighted by their costs. This path-based calculation traverses the
program hierarchically, such that when the cost of an outer loop is under calculation,
the costs of its inner loops are available for use.
Stappert et al. [69] developed another path-based WCET calculation method.
They construct a scope graph – a hierarchical representation of the program. The
longest paths are searched for the scopes. To simplify the work, each scope may be
expanded into some virtual scopes, where the iterations are covered by the same set
of flow facts (flow information derived from programm path analysis). They then
search the longest path in each virtual scope. If the longest path is an infeasible
one, it is discarded and the search continues. Unlike the first path-based calculation,
1http://www.irisa.fr/aces/work/heptane-demo/heptane.html
2A loop path is a control-flow connected sequence of blocks in a loop which starts with the loop
header and terminates at a block with a transition either to the loop head or out of the loop.
33
each virtual scope has a unique longest path applicable to all iterations of the virtual
scope. Thus the cost of a virtual scope is simply the cost of the longest path times
the scope iterations. The WCET of the program can be calculated via a bottom-up
traversal of the scope graph.
Lundqvist and Stenstro¨m [49] used cycle-level symbolic simulation technique for
WCET calculation. Symbolic simulation needs to handle two problems: unknown
data values in data-manipulating instructions and unknown conditions in conditional
branches. In the later case, both paths of the branch need to be simulated. Since the
number of feasible paths across the entire program can be substantial, to reduce the
paths maintained for simulation, they apply path merging, which is typically carried
out at the beginning of each loop iteration. The path merging should guarantee that
the execution following the merged path will not lead to a time lower than what
an execution following any of the pre-merged paths can do. Symbolic simulation
can exclude some infeasible paths. For example, if the branch condition evaluated is
known, the false path will not be simulated.
IPET. Li and Malik [40] proposed a technique which considers all paths implic-
itly by using integer linear programming (ILP). Suppose the cost of each basic block
Bi, denoted as costi, is known, and let its execution count be denoted as vi, then
the execution time of a complete program with N basic blocks can be expressed as∑N
i=1 costi ∗ vi. Then the rest of the task is to maximize the value of this function
over all valid combinations of the execution counts. The value of the execution count
vi can take is bounded by the control flow of the program as well as some extra flow
information derived from program path analysis or observed by user. The path enu-
meration is implicit in the sense that each combination of execution counts actually
captures a set of program paths which have the same execution counts for the cor-
responding basic blocks, but the orders in which the basic blocks are executed are
34
different. An example illustrating this is given in Figure 2.9 in the overview chap-
ter. This approach (IPET) differs from the path-based approaches in the following
aspects. First, the paths (implicitly enumerated) in IPET are entire program paths
whereas the paths in most of the path-based approaches are segments of program
paths, e.g., paths within loops. Second, IPET considers a set of paths having the
same combination of execution counts whereas path-based approaches considers a
single path during the longest path search. Last, a path in IPET does not contain
temporal information (the order in which basic block are executed) whereas a path in
path-based approaches specifies a deterministic execution order for the basic blocks
on the path. Note that both approaches can have some optimizations to speed up the
search for the longest path. For example, in a path-based approach, Dijkstra’s algo-
rithm for longest-path search can be used to more efficiently find the longest path in
a loop or a scope [13]; in IPET, the ILP solver can employ very aggressive algorithms
to explore the relationships between different combinations of execution counts (refer
to [66, 70] for more details).
Because of its simplicity and efficiency for path enumeration, the availability of
powerful ILP solvers and a potential for a closer integration with microarchitecture
modeling (will be explained later), the IPET approach has been adopted by some
other researchers including us for WCET calculation [8, 37, 55, 71, 72, 38, 39].
3.2 Microarchitecture Modeling
Microarchitectural features, especially pipelining and caching, have caught a lot of
attention for accurate WCET analysis. We review the various microarchitecture
modeling techniques in this section.
Extended timing schema. Researchers at Seoul National University [31, 44, 45,
46] proposed a technique for modeling RISC processors. They extended timing schema
35
to account for pipeline and cache effects. In their work, the time-bound for a pro-
gram construct in the original timing schema is replaced by a data structure called
worst case timing abstraction (WCTA). It contains a set of elements, each of which
corresponds to a possible worst case path in the program construct. An element in
a WCTA consists of a time-bound for its respective path and a reservation table,
which captures the use of pipeline stages and instruction interactions. When two ad-
jacent constructs are concatenated, path concatenation is realized by concatenating
the reservation tables in the two constructs, where interactions between instructions
across construct borders are modeled. After concatenation, a prune operation may
discard some concatenated reservation tables which can not be the worst case.
To model instruction cache effects, they divide memory accesses in a path of a
construct into three groups: first/last/other references to the cache lines. Cache
hits/misses of the first references need to be resolved with execution information
preceding the path and the last references are needed by paths succeeding it, thus they
are remembered by augmenting the WCTA. When concatenating two paths across
program constructs, the last references are used to resolve some of the hits/misses in
the first references in the later path, and the first/last references of the concatenated
path will be computed from the first/last references of the two concatenating paths.
Their treatment to the combination of the two analyses is simple: just superimpose
cache miss penalties to the execution time obtained from the pipeline analysis, where
instruction cache effects were not considered.
Flow analysis technique. The approach proposed by researchers at Florida State
University [3, 24, 28] is based on flow analysis techniques found in optimizing com-
pilers. The target architecture includes pipelines and instruction caches. They first
perform instruction cache analysis by using a static cache simulator [56, 57]. The
simulator analyzes the program control flow and categorizes instructions into four
36
classes: always hit, always miss, first hit, and first miss. The categorization infor-
mation is associated with loop levels. For example, an instruction I categorized as
always miss for an outer loop L1 might be categorized as first miss for an inner loop
L2. This is to more accurately account for cache behaviors and the WCET.
Next, they perform pipeline analysis by using the cache category information. This
work consists of two steps. First, they perform pipeline analysis for loop paths. To
model the pipeline behavior, the key point is to model its structural hazards and data
hazards. They use two data structures for this purpose. The structural/data hazards
information stored in each path will be used by the path concatenation algorithm.
Next, they perform loop analysis to predict the worst case execution time of a loop.
To avoid the complexity of calculating all combinations of paths, they union the
pipeline effects of the paths for a single iteration of a loop. The union operation
should guarantee conservativeness for safe WCET of the loop.
Last, the timing analyzer predicts the WCET of the program by using the worst
case execution times of the code segments containing loops, function calls etc. Like
timing schema, this is done in a bottom-up manner.
Abstract Interpretation Researchers at Saarland University [22, 72] used abstract
interpretation [14] for instruction cache analysis. The analysis consists of two steps.
The first step is to collect abstract cache states at program points. Intuitively, in an
abstract cache state, each cache line contains a set of memory blocks. They define
two functions: an abstract cache update function, which specifies how an abstract
cache state is updated by a cache access; and a join function, which combines two or
more abstract cache states at program joins. By traversing the program flow, abstract
cache states are updated and joined. In the second stage, the abstract cache states
are used to categorize memory references into four categories: always hit, always miss,
persistent and not classified. The category information will be used for subsequent
37
analysis where cache information is needed.
They have also used abstract interpretation for pipeline analysis [65]. They first
introduce concrete pipeline semantics to model the pipeline behavior and capture
pipeline hazards (structural and data). Instruction executions on the pipeline are
described by updates of concrete pipeline states. A concrete pipeline state describes
the occupancy of the pipeline stages by instructions, resource allocations and states of
some other resources. Based on the concrete pipeline semantics, they build abstract
pipeline semantics, in which an abstract pipeline state is a set of concrete pipeline
states. Update on an abstract pipeline state is realized by updating each of the
contained concrete pipeline states. In some cases, if the update involves some non-
deterministic events (e.g., a load with unknown address), one concrete pipeline state
is split into multiple successor states. If a successor state cannot be determined as
impossible to be the worst case, it has to be kept in the new abstract state. They
claim that in general the number of concrete states in an abstract state is small,
therefore operations on abstract pipeline states are efficient.
In recent years, they have targeted their work to real-life modern processors.
Langenbach et al. [36] modeled Motorola ColdFire-5307, and Heckmann et al. [29]
modeled PowerPC-755, an out-of-order processor.
Integer linear programming. Li et al. [41, 42, 43] used integer linear program-
ming (ILP) for instruction cache modeling and combined it with their ILP-based
WCET calculation method. In their work, the cache behavior is modeled by a set of
graphs called Cache Conflict Graphs (CCG) for a directly mapped instruction cache.
The CCG models flow transfer information among memory blocks3 mapping to the
same cache line. Cache misses are captured as flow transfer between conflicting mem-
ory blocks. Variables and linear constraints are generated from the CCGs and are
3a sequence of instructions in a basic block which map to the same cache line
38
incorporated into the existing ILP problem. This way, the modeling of cache behavior
is tightly coupled with the modeling of program flow. For set associative instruction
caches, an extra set of graphs called Cache State Transition Graphs (CSTG) are
introduced to model their more complicated behaviors. This ILP-based instruction
cache modeling, due to its ability of using more detailed flow information, achieves
good accuracies. On the other hand, its tight integration with WCET calculation
results in an increase in analysis time, especially for set associative caches.
Symbolic simulation Lundqvist and Stenstro¨m [49] used cycle-level symbolic sim-
ulation technique for WCET calculation. Microarchitectural features such as caching
and pipelining are modeled during the symbolic execution. The instruction cache
state in the simulation is updated along an execution path and cache states from
multiple paths are merged at a path join. Each cache line in the cache state contains
either a block of program instructions or invalid content (for direct mapped cache).
They have two merge strategies: pessimistic merge and optimistic merge. With the
pessimistic merge, if the contents of the respective cache lines from two different paths
are different, invalid content is assumed for the cache line in the merged cache state.
Optimistic merge is based on the idea that if it is known in advance that one partial
path does not belong to the worst case path, the cache state of this path is simply
ignored by the path merge. In their work, they predict the worst case penalty and
best case penalty that the cache state of each path can incur. For two partial paths
P1 and P2, if the cost of P1 plus its worst case penalty is less than the cost of P2 plus
the best case penalty of P2, then P1’s cache state will be ignored in the merge. For
pipeline modeling, they use pipeline reservation tables to maintain the pipeline state.
A reservation table record when each resource (pipeline stages or register) is released.
With the reservation tables, pipeline hazards (structural and data) can be captured.
During the simulation, the reservation table can be updated for each instruction at
39
a time. For the path merge, the pipeline reservation tables are merged following the
same strategy of cache states merge. The accuracy of this approach depends on how
many infeasible paths can be identified during simulation and how many path merges
can be applied with the optimistic merge.
Other techniques There are some other techniques on the modeling of pipelines
and instruction caches. There are also some work on the modeling of other microar-
chitecture features such as branch prediction, data caching, prefetching etc.
Engblom [16] provides a comprehensive study of various pipelines for WCET
analysis in his doctoral dissertation. His work for pipeline modeling is based on a
concept called timing effects, which reflect the impact of an earlier instruction on sub-
sequent instructions. Formally, given two consecutive instructions I1 and I2, let their
isolated execution times be T (I1) and T (I2) respectively, and let their combined exe-
cution time be T (I1I2), the timing effect is defined as δI1I2 = T (I1I2)−(T (I1)+T (I2)).
Due to pipeline overlap, δI1I2 is often negative and the timing effect is called negative
timing effect. The concept of timing effect can be extended to a sequence of more
than two instructions. A timing effect related to a long instruction sequence is called
long timing effect. If long timing effects are absent or insignificant on a pipeline, then
the execution time of an instruction sequence can be obtained by doing simulation
on its short sub-sequences; otherwise, one either performs extensive simulations on
both its short and long sub-sequences to get a tight estimate or trades accuracy for
performance by ignoring the long time effects. Note a time effect can only be ignored
if it is negative. Ignoring positive timing effects results in underestimation. There-
fore, positive long timing effects pose a problem for this approach. Unfortunately, it
has been observed in his dissertation that out-of-order pipelines exercise positive long
timing effects.
40
Branch prediction started getting attention in recent years. Compared to instruc-
tion caching, dynamic branch prediction [52, 74] is more difficult to model as similar
regular properties for instruction caching do not exist in dynamic branch prediction
schemes. For instance, for some inner loops which can be completely accommodated
by the cache, the accesses except for the first time to an instruction will always be hits
as long as the execution is repeated within the loop. This spatial locality has been
exploited by some techniques which differentiate instruction executions with respect
to their execution contexts such as loop levels and function calls ([3, 24, 28] and the
VIVU approach in [22, 72]). In contrast, spatial locality is not obvious or does not ex-
ist for dynamic branch prediction schemes. For example, a conditional branch which
is repeatedly executed in an inner loop may disturb itself by changing its direction
each time and making itself wrongly predicted. As a result, branch prediction mod-
eling is expected to take more effort. The difficulties for branch prediction modeling
have been discussed by Engblom [17].
To our knowledge, the first detailed branch prediction analysis for WCET was
performed by Colin and Puaut [9]. They modeled the Branch Target Buffer (BTB),
which can be found in a Intel Pentium processor. With the BTB scheme, a branch is
either predicted according to its history in the BTB or is predicted as not taken if it
is absent from the BTB. In their work, the evolution of the BTB state with program
flow is studied and information is collected along with the evolution. Next, with the
collected information, branch instructions are classified according to whether they are
predicted by their history or by default. The classification is connected with correct
predictions/mispredictions in the following way. Since their WCET calculation is
based on timing schema, the worst case path taken in a construct is always the
same path across different iterations, thereby a branch instruction always takes the
same direction on the worst case execution path of the program. Thus, for a branch
predicted by its history, the prediction is correct. For a branch predicted by default,
41
depending on its direction in the worst case path, it can be statically determined
whether it is correctly predicted or mispredicted. Only for a branch whose source
of prediction (by history or by default) is unknown, its prediction is assumed to be
mispredicted for the sake of conservativeness. This way, the timing effects of branch
predictions can be accounted for WCET analysis. It needs to be pointed out that
above disposition takes a simplified view of their work. In fact, due to their extension
to the original timing schema, the worst case path of a construct and the direction of
a branch in it may not be globally unique, rather they are unique only in a specific
loop level. But the rationale behind remains unchanged.
Another work on branch prediction analysis is by Bate and Reutemann [4]. They
modeled bimodal branch predictors and two-level branch predictors with some re-
stricted assumptions on the program structure (which severely limit the applications
of their technique).
Comparison In this part, we compare the various modeling techniques (including
ours).
As for instruction cache analysis, the flow analysis approach and the abstract in-
terpretation approach perform it before WCET calculation; while in the extended
timing schema, integer linear programming and symbolic simulation approaches, in-
struction cache analysis is integrated with WCET calculation. Integrated approaches
have the potential of achieving more accurate results as more program path infor-
mation can be used for cache analysis, but it may have a higher computation cost.
For example, when the ILP approach is used for modeling set associative instruction
caches, very long computation time has been observed. For separated approaches, the
analysis results are general and conservative enough to be applicable to all possible
program paths or to one of a few sets of program paths (when execution context
information is imposed, e.g., loop levels and function calls). This can be viewed as
42
trading accuracy for performance. However, due to its locality, instruction caching
can still be modeled with good accuracy by separated approaches if execution context
information such as loop levels is used to distinguish the accesses of an instruction.
As for pipeline analysis, we compare our work with the various approaches. We
model an out-of-order pipeline [39] where an instruction can be executed in variable
latencies. For such a pipeline, considering only one latency for each instruction such
as the longest one would be unsafe [50]. In contrast, most of the surveyed pipeline
analysis approaches are only applicable to in-order pipelines. In addition, they assume
that an instruction executes with a single latency or implicitly take the longest latency
for estimation. Recently, the abstract interpretation approach has been applied to
out-of-order pipelines [29]. However, as mentioned earlier, the issue is that the pipeline
states are updated against each possible latency when a variable-latency instruction
is encountered, leading to an accumulation of pipeline states along the estimation
process. In case the sequence of instructions to be estimated is not very short and
the pipeline is complex, this approach can result in state space explosion [73]. Our
approach avoids enumerating the individual execution latencies of an instruction by
using an interval to represent the latencies, and it employs an efficient fixed-point
algorithm to iteratively tighten the intervals. Another advantage of our approach is
its convenience for integrating with the analyses of other microarchitectural features.
For example, it can either be integrated with an instruction cache analysis where
cache accesses are classified as hits or misses before pipeline analysis is carried out, or
be integrated with an ILP-based instruction cache analysis, where cache hits/misses
are figured out during WCET calculation (in this thesis, we use the later approach).
In contrast, most of the surveyed approaches have not demonstrated such a flexibility.
43
3.3 Program Path Analysis
Program path analysis studies a number of topics including automatic flow analysis for
infeasible path detection and loop bounding, path annotation methods, source-code
level to compiled-code level flow information translation, interaction with optimiza-
tion compilers etc. Substantial research work has been done in this area.
Automatic flow analysis. Feasible/infeaisble path information is either manually
provided or is explored automatically by flow analysis. The later approach has been
investigated by many researchers.
Altenbernd [2] proposed a method to exclude false paths during the search for
the worst case execution path. His work is a combination of path enumeration with
pruning and symbolic execution. He used branch-and-bound algorithm to perform
the actual path search in the control flow graph.
Ermedahl and Gustafsson [20] used symbolic execution to discover false paths
and loop bounds. They work on abstract semantics of programs. The key concept
is an environment σhi , which captures the abstract values (split integer intervals)
of variables at a program point i following a specific path h. Rules updating the
environments at program points are generated based on program semantics. If a
variable’s abstract value in σhi is ⊥, which means empty value, then the path h to i
is an infeasible one.
Lundqvist and Stenstro¨m [49] used cycle-level symbolic simulation for WCET
calculation as well as infeasible path detection. In their work, the domain of variable
values is extended with an extra value called unknown. When a conditional branch
is reached and the value of the condition variable is not unknown, then the execution
goes along one path and the execution along the other path is an infeasible one, which
is simply not simulated. In case the condition value is an unknown, both paths need
to be simulated.
44
Liu and Gomez [47, 48] proposed another technique using symbolic evaluation
on partially known input structures. They work on the source-language level. In
contrast to the earlier symbolic execution based techniques, they do not merge paths
from loops. This reduces nondeterminism due to path merge but raises concerns on
time and space complexity. They apply some program language transformations such
as incremental computation and transformation of conditionals to make the analysis
more efficient. In their experiments they observed that the analysis is still feasible
for for inputs sizes in the thousands.
Above symbolic execution based methods need to iterate through the loops many
times, which could be inefficient. Healy et al. [25, 26] implemented techniques to au-
tomatically determine the minimum/maximum number of iterations for loops. They
do so by (1) identifying conditional branches within the loop that can affect the num-
ber of of loop iterations, (2) calculating the range of iterations these branches can
be reached, and (3) calculating the minimum/maximum number of iterations with
the information computed in (2). In another work [27], they developed techniques
for automatic detection of branch constraints. They do so by analyzing the effect
of a variable assignment on a branch and the correlation between the outcomes of
different branches. The fall through or taken frequency of a branch in a loop may also
be calculated by using value range analysis on loop induction variables. The branch
constraints will be used in the subsequent analyses to exclude infeasible paths.
Ferdinand et al. [21] used abstract interpretation to detect infeasible paths. They
call it value analysis, which computes for each processor register an interval of possible
values. If at a conditional branch, the value interval for the branch condition indicates
a deterministic direction, then the path along the other direction is an infeasible one.
Annotation methods. To make use of the feasible/infeasible path information,
there should be methods to describe it.
45
Puschner and Koza [62] proposed a language called MARS-C. They use constructs
like scopes, markers, and loop sequences to describe feasible/infeasible paths.
Park [58] developed a script language called IDL (information description lan-
guage), which is subsequently translated into regular expressions. IDL can capture
some frequent path relationships such as that a statement is executed a certain num-
ber of times, or that two statements are always executed together or they are mutually
exclusive. The major problem is that manipulations on regular expressions, e.g., in-
tersection of two regular expressions, are difficult.
Li and Malik [40] used linear constraints to specify the flow information, which
they called functional constraints. Functional constraints can be used to give loop
bounds, and relationship of execution counts among multiple basic blocks. They have
shown that every IDL information clause in [58] can be transformed into functional
constraints.
Colin and Puaut [9] proposed an annotating method for loops with variant number
of iterations. They used couples of mathematical expressions instead of constants for
inner loops whose iteration numbers are dependent on counter variables of outer loops.
For example, [maxiter, counter] is such a loop bound, wheremaxiter is the maximum
number of iterations and counter is the loop counter value, both are mathematical
expressions. These expressions are symbolically evaluated by Maple [7]. By using
this annotating method, they have achieved significant accuracy improvement for
programs having inner loops with variant number of iterations.
Engblom and Ermedahl [18] defined a language called flow facts language to de-
scribe complex flow information. They define flow facts for scopes, which are program
segments under some execution context, e.g., a loop or a function call reached from
a path. A flow fact consists of three parts: the name of a scope, a context specifier,
which typically gives the iterations of the scope, and a constraint expression specify-
ing the flow information. For example, a flow fact foo: [1..10] : XA ≤ 2 specifies that
46
a block A in the scope foo cannot execute more than twice in the first ten iterations
of the scope. Depending on the WCET calculation method being used, not all flow
facts can be accurately transformed to path information that can be used for that
WCET calculation.
Translation and compiler support. Program path information is often provided
on the source-program level, but WCET analysis is usually on the compiled-code level.
Thus a translation of the annotations from source-program level to the compiled-code
level is necessary. This is a non-trivial problem because optimizing compilers perform
a lot of code transformations, which makes the mapping between source program
constructs to instructions/basic blocks in the compiled code difficult.
Puschner [63] described a mapping function to translate path information on the
source level to the assembly level. It assumes that the programs are compiled with
moderate optimization. The mapping function traverses the parse tree of the source
program. In each step down the tree it tries to find the corresponding assembly
code by using information about the nesting of constructs, line numbers etc. in the
assembly code. If the mapping fails on a construct, it outputs a warning.
Engblom et al. [19] proposed an approach called co-transformation for supporting
the mapping of execution information from source program to compiled code. They
defined a language called Optimization Description Language (ODL) to characterize
what typical optimizations do. The co-transformation engine can be generated from
the ODL source. To apply their work, the compiler needs to be modified slightly to tell
the transformer what kind of optimizations have been done. As long as the optimiza-
tion types performed by the compiler are described by ODL, the co-transformation
can map the source code constructs to compiled code segments.
Kirner and Puschner [33, 34] developed another transformation method that is in-
tegrated into the compiler. The path information is transformed through all compiler
47
stages. Therefore substantial effort is needed to extend the existing compiler, but is
paid by the ability to supporting strong code optimizations for WCET analysis.
Summary Above discussion covers several issues on program path analysis, which
address the problem of providing program path information for WCET calculation
from different aspects. More accurate program path information is essential for tight
WCET estimates. On the other hand, automated path information derivation tech-
niques and integration with compilers will facilitate WCET analysis and promote its
application. In this thesis, we focus on microarchitecture modeling and do not explore
new program path analysis methods. The existing techniques can be integrated with




Our aim in this chapter is to obtain a safe and tight WCET estimate for out-of-
order pipelined execution without enumerating possible instruction schedules. Our
technique is inspired by an iterative performance analysis technique for real-time
distributed systems proposed by Yen and Wolf [75], which estimates the execution
time of tasks with data dependencies and resource contentions. For estimating the
WCET of a basic block, we exploit and augment their technique by treating individual
instructions as tasks. Clearly, there are data dependencies between instructions in
a program; resource contention is defined in terms of two instructions requiring the
same functional unit. We then extend our solution for estimating the WCET of a
basic block to arbitrary programs with complex control flows. This extension involves
several steps. First, we apply the timing estimation technique to each basic block.
Next, we bound the timing effects of instructions preceding or succeeding a basic
block. Finally, Integer Linear Programming (ILP) technique is employed on the
control flow graph to estimate the WCET of the entire program.
The rest of this chapter is organized as follows. In the next section we discuss the
difficulties of out-of-order pipeline analysis and present an overview of our approach
for addressing them. In Section 4.2 we present the analysis technique in two steps: in
the first step, we develop the core algorithms for the execution of a basic block without
considering its execution context; and in the next step we extend the algorithms to
handle the issues related to the execution context of a basic block. In Section 4.3 we
experimentally validate the analysis technique. The concluding remarks for out-of-




Modern processors such as the one presented in Section 2.2 employ out-of-order ex-
ecution where the instructions can be scheduled for execution in an order different
from the original program order. In such a processor, an instruction can execute if
its operands are ready and the corresponding functional unit is available, irrespective
of whether earlier instructions have started execution or not. Out-of-order execution
improves processor’s performance significantly as it replaces pipeline stalls (due to
dependencies and/or resource contentions) with useful computations. However, the
out-of-order execution exhibits a phenomenon called timing anomaly 1, which makes
WCET analysis difficult.
4.1.2 Timing Anomaly
The problem of timing anomaly was originally discussed by Lundqvist and Stenstro¨m
[50]. Let us consider an instruction I with two possible latencies lmin and lmax such
that lmax > lmin. The variation of latency could due to different reasons: cache
hit/miss for a load instruction, variable number of cycles taken by an arithmetic in-
struction like multiplication etc. Let us assume that the execution time of a sequence
of instructions containing I is gmax (gmin) if I incurs a latency of lmax (lmin). The la-
tencies of the other instructions in the sequence are fixed. A timing anomaly happens
if either (gmax − gmin) < 0 or (gmax − gmin) > (lmax − lmin).
Figure 4.1 illustrates timing anomaly with an example. In the code fragment,
instruction B depends on A, instruction C depends on B, and instruction E depends
on D. Instructions A and E use the MULTU functional unit with latency of 1 ∼ 4
cycles and the other instructions use the single cycle ALU functional unit.
1It has been observed by Langenbach et al. [36] that timing anomaly can also happen to some
in-order processors such as Motorola ColdFire 5307 where a unified cache for instruction/data is
employed.
50
0 1 2 3 5 6 7 8 9 104
# Instruction
A mult r3 r1 r2
B  add r3 r3 8
C  and r3 r3 0xff
D  addu r5 r4 8










(c) Instruction A executes 3 cycles
0 1 2 3 5 6 7 8 9 104
(a) Instruction sequence
MULTU  1 ~ 4 cycles
ALU    1     cycle
(d) Instruction A executes 4 cycles(b) Latencies
Figure 4.1: Timing Anomaly due to Variable-Latency Instructions
We illustrate two possible execution scenarios. In the first scenario illustrated in
Figure 4.1(c), instruction A executes for three cycles – cycles 0 − 2. Since A starts
executing at cycle 0, it is ready for execution at cycle 0 or earlier. Therefore at
the beginning of cycle 3, all of B, C, D are ready for execution; all of them are
contending for the ALU. Thus, instructions B and C execute on cycles 3 and 4,
respectively. Instruction D is ready for execution in cycle 3 itself, but it can only be
scheduled for execution in cycle 5 after B and C (which appear earlier in program
order). The overall execution time in this case is 10 cycles. In the second scenario
as illustrated in Figure 4.1(d), A executes for four cycles. Now D is the only ready
instruction in cycle 3 (B and C are still waiting for their operands); D executes in
clock cycle 3 allowing E to start execution in clock cycle 4. The overall execution
time in this case is only eight cycles. Thus, a longer latency of A results in a shorter
overall execution time.
In the presence of timing anomaly, techniques which generally take the local worst
case for WCET estimation no longer guarantee safe bounds. For example, it is not
51
safe to assume that the worst case cache behavior of a sequence of instructions results
from a cache miss in all the instructions. For the same reason, it is not safe to assume
the longest latency for variable-latency arithmetic instructions will lead to the overall
WCET of a program. This prompts the need to consider all possible schedules of
instructions. For a piece of code with N instructions and each of which hasK possible
latencies, a naive approach, which examines each possible schedule individually, will
have to consider KN schedules. We now explain the basic idea behind our approach
which allows us to avoid such expensive enumeration.
4.1.3 Overview of the Pipeline Modeling
Given the control flow graph of a program, our WCET analysis method first derives
a WCET estimate for each basic block. Then the basic block estimates are combined
using Integer Linear Programming (ILP) to produce the program’s WCET estimate
(refer to Equation 2.1).
How do we find the WCET estimate for a basic block Bi? This is done by first
considering the basic block’s execution in isolation, that is, starting with an empty
pipeline. We find the WCET estimate without enumerating instruction schedules as
follows. We observe that the worst-case timing behavior of Bi occurs from maximum
resource contention among instructions in Bi, that is, each instruction being delayed
by maximum number of other instructions. We produce very coarse estimates for
the time interval at which instructions in Bi can start/finish execution by initially
assuming that any instruction in Bi can delay the others, except the contentions ruled
out by data dependencies. The estimates allow us to rule out certain contentions –
if the earliest time instruction I is ready for execution occurs after the latest time
at which I ′ finishes, clearly I cannot delay I ′. This allows us to further refine the
estimates, thereby ruling out more contentions. The process continues until a fixed
point is reached. The WCET of the basic block Bi (where Bi’s execution starts with
52
an empty pipeline) is the maximum time between the fetch of Bi’s first instruction
and commit of Bi’s last instruction.
Given the execution time estimate of Bi’s execution starting with an empty
pipeline, how do we find costi, block Bi’s WCET estimate? We observe that the
number of instructions before and after Bi which can affect the timing of Bi’s exe-
cution is bounded by architectural parameters. Accordingly, we extend our timing
estimation technique to operate on basic block with a prologue/epilogue (instructions
before/after Bi which directly affect the timing of Bi). Time intervals for execution of
instructions in prologue/epilogue are estimated conservatively by assuming maximum
possible contentions. We also consider (a) the data dependencies between instructions
in prologue and instructions in Bi, and (b) possible time overlap between instructions
in Bi and instructions prior to Bi. In this way, we find the timing estimate of basic
block Bi for all possible choices of prologue and epilogues. The maximum of these
estimates is costi, the estimated WCET of Bi.
In the preceding, we have given an overview of our modeling technique which
captures the timing effects of out-of-order pipelines. The technical details of this
modeling will be presented in the following sections.
4.2 The Analysis
Our analysis technique is presented in two steps. First, we estimate the execution
time of a basic block in isolation by assuming an empty pipeline at the beginning.
Next, we extend the technique by taking into account the possible initial pipeline
states and context instructions before/after the basic block.
4.2.1 Estimation for a Basic Block without Context
Our effort in this section is to develop an algorithm for estimating the WCET of
a basic block executing on the out-of-order processor pipeline presented in Section
53
2.2. Instructions in a basic block are executed sequentially, that is, there is no non-
determinism in terms of control flow transfer. The main advantage of our approach
is that explicit enumeration of possible instruction schedules is avoided. Thus the
estimation is both time and space efficient. The technical details are presented in the
following order. First, we formulate the problem as an execution graph, which cap-
tures data dependencies and resource contentions — the two major factors dictating
instruction executions. Next, based on the execution graph, we develop an algorithm
which starts with very coarse yet safe estimates, and iteratively refines the estimates
until a fixed point is reached.
Definition 4.1 (Execution Graph). The execution graph for a basic block B under
a pipeline model is defined as
GB = (VB, DEB)
where VB represents all possible combination of instruction identifiers and pipeline
stages for basic block B, and DEB ⊆ VB×VB represents a dependency relation among
nodes. For two nodes u, v ∈ VB, we say that (u, v) ∈ DEB iff v can start execution
only after u has completed execution; this is indicated by a solid directed edge from u
to v in the execution graph. Clearly (u, v) ∈ DEB ⇒ (v, u) 6∈ DEB.
Apart from the dependency relation among nodes in an execution graph (denoted
by solid edges), we also define a contention relation as follows. We do not make
the contention relation part of the execution graph so as to clearly identify what we
mean by “paths” in the exection graph; paths in the execution graph refer to chains
of dependency edges. This will be required in our analysis.
Definition 4.2 (Contention Relation). Let B be a basic block, and GB = (VB, DEB)
be its execution graph. We define a contention relation CEB ⊆ VB×VB such that for
two nodes u, v ∈ VB, we say that (u, v) ∈ CEB iff
54
• nodes u and v denote the EX stages of two different instructions I and J re-
spectively, and
• instruction I and J can delay each other by contending for a functional unit.
Our definition of contention relation is symmetric, that is, (u, v) ∈ CEB ⇒ (v, u) ∈
CEB. We will show the contention between u and v as an undirected dashed edge in
the execution graph.
We now explain the nodes, dependencies and contentions captured in an execu-
tion graph in details. This will also clarify how the dependency and the contention
relations can be computed.
Let CodeB = I1 . . . In represent the sequence of instructions in a basic block B.
Then each node v ∈ VB is represented by a tuple: an instruction identifier and a
pipeline stage denoted as stage(instruction id). For example, the node v = IF (Ii)
represents the fetch stage of the instruction Ii. If basic blockB contains n instructions,
then |VB| = n×P where P is the number of stages in the pipeline. Each node in the
execution graph is associated with the latency of the corresponding pipeline stage. In
our processor pipeline, all pipeline stages except EX have single cycle latency.
Our definition of dependency edges includes dependencies due to resource con-
straints and pipelined execution in addition to traditional data dependencies. We
consider:
• Dependencies among pipeline stages of the same instruction. This is because
an instruction must proceed from the first stage to the last last stage in order,
for example, ID(Ii) must follow IF (Ii).
• Dependencies due to in-order execution in IF, ID, and CM pipeline stages.
That is, different instructions should proceed through these pipeline stages in
program order, for example, IF (Ii+1) can only start after IF (Ii).
55
• Dependencies due to resource constraints as in full I-buffer or ROB. For example,
assuming I-buffer has two entries, there will be no entry available for IF (Ii+2)
before the completion of ID(Ii) (which removes Ii from the I-buffer). Therefore,
there should be a dependency edge ID(Ii)→ IF (Ii+2). Similarly, with a 4-entry
ROB, there should be a dependency edge CM(Ii)→ ID(Ii+4) because CM(Ii)
frees up the entry occupied by Ii in the ROB. Note that we can draw these edges
as both the I-buffer and the ROB are allocated and freed in program order.
• Data dependencies among instructions. If instruction Ii produces a result that
is used by instruction Ij, then there should be a dependency edge EX(Ii) →
EX(Ij) if Ii is an arithmetic instruction (because of data forwarding for arith-
metic instructions at the end of the EX stage), or WB(Ii)→ EX(Ij) if Ii is a
load instruction.
The above summarizes the dependencies; we now describe the contention rela-
tion among nodes in the execution graph of a basic block B. We define contention
relation CEB among the EX stages of different instructions utilizing the same FU
for execution. This is because contention can only happen in the EX stage with
our pipeline model. For two instructions Ii, Ij in basic block B (i 6= j) we define
(EX(Ii), EX(Ij)) ∈ CEB iff
1. instructions Ii and Ij utilize the same functional unit,
2. there is no path from EX(Ii) to EX(Ij) or from EX(Ij) to EX(Ii) in the
execution graph GB, and
3. |i− j| < ROB size
The second condition ensures that there is no dependency between the two nodes,
i.e., they can indeed contend for a functional unit. The final condition simply excludes
the possibility of two far-away nodes contending with each other. For example, if the
56
ROB has four entries then clearly instructions Ii and Ii+4 cannot coexist in the ROB.
Note that the contention between two instructions obeys the following rules.
• If two instructions contend for a functional unit in the same clock cycle, the
earlier instruction (according to program order) gets access to the functional
unit, and
• Once an instruction gets access to a functional unit, it runs to completion
without getting pre-empted.
Given two instructions Ii, Ij (where i < j, i.e. Ii appears earlier in program order)
contending for a functional unit, suppose Ij becomes ready earlier than Ii. This is
possible since Ii may be delayed due to data dependencies. Instruction Ij thus starts
executing ahead of Ii. Meanwhile Ii may receive its operands and get ready. However,
Ii now has to wait for the function unit to be free, that is, until Ij completes. This
is how instructions later in the program order can delay the execution of an earlier
instruction.
Figure 4.2 shows an example of execution graph. This graph is constructed from
a basic block with five instructions as shown in Figure 4.2(a). In Figure 4.2(b), the
edges EX(I1) → EX(I3), EX(I2) → EX(I5), and EX(I4) → EX(I5) reflect data
dependencies. The other solid edges capture dependencies due to the structure of the
pipeline and resource constraints. The dashed edges represent contention relations.
The contention relation between EX(I1) and EX(I4) implies: (a) if instructions I1
and I4 are both ready to execute and the functional unit MULTU is free, then EX(I1)
will be issued for execution as it is from an earlier instruction and thus has higher
priority; and (b) if EX(I4) has already started execution before EX(I1) is ready,
then EX(I4) will be allowed to complete and thereby delay EX(I1). Our execution
graph is similar to the dynamic dependency graph among instructions of Fields et al.
[23]. In their work, the dependency graph is obtained from a concrete simulation run,
57
IF(I1) ID(I1) EX(I1) WB(I1) CM(I1)
IF(I2) ID(I2) EX(I2) WB(I2) CM(I2)
IF(I3) ID(I3) EX(I3) WB(I3) CM(I3)
IF(I4) ID(I4) EX(I4) WB(I4) CM(I4)
IF(I5) ID(I5) EX(I5) WB(I5) CM(I5)
I1:  mult r6  r10  4
I2:  mult r1  r10  r1
I3:  sub   r6  r6 r2
I4:  mult r4  r8   r4
I5:  add   r1  r1 r4
(a) Code Example
(b) Execution Graph of the Code
Figure 4.2: A basic block and its execution graph. The solid edges represent depen-
dencies and the dashed edges represent contention relations.
58
that is, a trace of dynamic instructions. Therefore, the actual resource contentions
exercised in that particular run are known and the nodes are annotated with the
execution latency as well as the wait time for a functional unit. They study how
much each instruction can be delayed (the slack) without increasing the execution
time of the run. Our execution graph is static and all possible resource contentions
between instructions are represented for the purposes of static analysis.
Problem Definition Let B be a basic block consisting of a sequence of instructions
CodeB = I1 . . . In and let GB = (VB, DEB) be its execution graph. Estimating the
WCET of B can be formulated as finding the maximum (latest) completion time of
the node CM(In) assuming that IF (I1) starts at time zero. Note that this problem
is not equivalent to finding the longest path from IF (I1) to CM(In) in the execution
graph (taking the maximum latency of each pipeline stage). The execution time of
a path in the execution graph is not a summation of the latencies of the individual
nodes because of two reasons.
• The total time spent in making the transition from ID(Ii) to EX(Ii) is depen-
dent on the contentions from other ready instructions.
• The initiation time of a node is computed as the max of the completion times
of its immediate predecessors in the execution graph. This models the effect of
dependencies, including data dependencies.
A Related Problem Given the problem formulated as an execution graph, we
propose an iterative algorithm to estimate the WCET of a sequence of instructions.
The basic structure of our algorithm is inspired by a performance analysis technique
for real-time distributed systems [75] which analyzes a system consisting of several
periodic tasks represented by task graphs. Each task consists of a partially ordered set
of processes, and each process has lower and upper bounds on its computation time.
59
The hardware architecture consists of a set of Processing Elements (PE) connected via
communication edges. Processes are allocated to the PEs and priorities are assigned
among the processes assigned to the same PE. A process P is scheduled to execute on
a processor E if (1) all of P ’s predecessors have completed execution, and (2) no higher
priority process in running on E. P can possibly preempt a lower priority process
to start execution; on the other hand, P may itself get preempted by higher priority
processes during its execution. The algorithm estimates the worst case completion
time of all the tasks.
The problem addressed by Yen and Wolf’s algorithm is similar to our analysis
problem in some key aspects. The similarities include the fact that the execution
graph in our problem is similar to the task graph considered in [75]; both these
graphs capture data dependencies between nodes. Furthermore there are resource
contentions between the nodes and contending nodes are assigned priorities. However,
there are some significant differences as well. First of all, [75] captures periodic tasks
whereas the instructions in our execution graph are not periodic. More importantly, in
[75] a higher priority process hp may delay a lower priority process lp by preemption;
but lp cannot delay hp. However, in our problem, it is possible for a lower priority
instruction (appearing later in program order) li to delay the execution of a higher
priority instruction hi. As there is no preemption, if li is executing when hi becomes
ready, then li is allowed to complete the execution and it delays the execution of
hi. Such differences make the computation of the response time of a node v – the
time when all of v’s predecessors have completed execution to the time v completes
execution – different in our problem.
Notations Before we discuss our WCET estimation method, we explain the no-
tations used in our estimation algorithm. In the following, u, v denote nodes in the
execution graph of the basic block B being analyzed.
60
• treadyv : Ready time of node v is defined as the time when all its predecessors
have completed execution.
• tstartv : Start time of node v is defined as the time when it starts execution.
Except for nodes corresponding to EX stages, tstartv = t
ready
v . A node EX(Ii)
may not be able to start execution when it becomes ready if another instruction
is using the corresponding functional unit, or some higher priority instructions
(earlier than Ii in program order) are also ready. Therefore, t
start
v ≥ treadyv .
• tfinishv : Finish time of a node v is defined as the time when it completes execu-
tion. Pipeline stages other than EX need only one cycle to execute. Therefore,
tfinishv = t
start
v + 1. For EX stage, we add the minimum (maximum) latency of
the functional unit to tstartv when we compute its earliest (latest) finish time.
• separated[u, v]: If the executions of the two nodes u and v cannot overlap,
then separated[u, v] is assigned to true; otherwise, they might overlap and it is
assigned to false.
• instr id(v): The instruction id corresponding to a node v.
• early contenders(v): Contending instructions that appear earlier in program
order, i.e., the set of nodes u s.t. (u, v) ∈ CEB and instr id(u) < instr id(v).
Recall that that CEB denotes the contention relation among the nodes in the
execution graph of basic block B.
• late contenders(v): Contending instructions that appear later in program or-
der, i.e., the set of nodes u s.t. (u, v) ∈ CEB and instr id(u) > instr id(v).
• min latv,max latv: Minimum and maximum execution latencies of node v.
Note in actual execution, node v may only take some of the discrete values in
the interval of [min latv,max latv]. For example, a cache hit takes the min latv
61
clock cycles, and a cache miss takes max latv cycles. In estimation, however,
by operating on the interval instead of discrete values, it is implicitly assumed
that all possible discrete values within this interval are possible.
Summary of our method As mentioned earlier, our problem is not equivalent to
finding the longest path in the execution graph due to resource contentions and de-
pendencies. We account for the timing effects of the dependencies by using a modified
longest path algorithm that traverses the nodes in topologically sorted order. This
topological traversal ensures that when a node is visited, the completion times of all
its predecessors are known. To model the effect of resource contentions, we conserva-
tively estimate an upper bound on the delay due to contentions for a functional unit
by other instructions. A single pass of the modified longest path algorithm computes
loose bounds on the lifetime of each node. These bounds are used to identify nodes
with disjoint lifetimes. These nodes are not allowed to contend in the next pass of
the longest path search to get tighter bounds. These two steps repeat till either there
is no change in the bounds or a pre-defined number of iterations have elapsed.
separated[., .] = false; step = 0;1
foreach node v ∈ V do2
earliest[tstartv ] := 0; earliest[t
finish
v ] := min latv;3
latest[tstartv ] :=∞; latest[tfinishv ] :=∞;
repeat4
LatestTimes(G); EarliestTimes(G);5
foreach u, v ∈ V do6
if earliest[treadyv ] ≥ latest[tfinishu ] then7
separated[u, v] = true;
if earliest[treadyu ] ≥ latest[tfinishv ] then8
separated[u, v] = true;
step := step+ 1;9
until separated[., .] are unchanged or step > limit ;
WCET = latest[tfinishCM(In)]; /* In is the last instruction of the basic block */10
Algorithm 1: WCET Estimation for Execution Graph G = (V,DE)
62
latest[treadyIF (I1)] := 0; /* I1 is the first instruction of the basic block */1
foreach node v ∈ V in topologically sorted order do2
latest[tstartv ] := latest[t
ready
v ];3
Slate := late contenders(v)
⋂ {u | ¬separated[u, v] ∧ earliest[tstartu ] <4
latest[treadyv ]};








, latest[treadyv ] +max latv − 1
)
;
Searly := early contenders(v)
⋂ {u | ¬separated[u, v]} ;7








, latest[tstartv ] + |Searly| ×max latv
)
;





latest[tfinishv ] := latest[tstartv ] +max latv ;11
foreach immediate successor w of v do12





Algorithm 2: LatestTimes(G = (V,DE))
earliest[treadyIF (I1)] := 0; /* I1 is the first instruction of the basic block */1
foreach node v ∈ V in topologically sorted order do2
earliest[tstartv ] := earliest[t
ready
v ];3
Slate := late contenders(v)
⋂ {u | ¬separated[u, v] ∧ latest[tstartu ] <4
earliest[treadyv ] < earliest[t
finish
u ]};
Searly := early contenders(v)
⋂ {u | ¬separated[u, v] ∧ latest[tstartu ] ≤5






if S 6= φ then7









earliest[tfinishv ] := earliest[tstartv ] +min latv ;9
foreach immediate successor w of v do10





Algorithm 3: EarliestTimes(G = (V,DE))
63
Estimation Algorithm Algorithm 1 gives the outline for computing the WCET
given an execution graph G = (V,DE) corresponding to a basic block. The top
level algorithm iteratively performs two operations: timing bounds computation and
separations analysis. The first operation is done by LatestTimes and EarliestTimes,
which compute the upper and lower timing bounds of the nodes. The second opera-
tion is done by re-assigning the values of separated[u, v] for all nodes u, v. Basically,
we find out pairs of nodes (u, v) whose lifetimes are guaranteed to not overlap; for
these nodes we set separated[u, v] to true. How do we find out pairs of nodes with
non-overlapping lifetimes? In our problem, given two nodes u and v in the execu-
tion graph, we simply set separated[u, v] to true if earliest[treadyu ] ≥ latest[tfinishv ] or
earliest[treadyv ] ≥ latest[tfinishu ].2 Thus, the tighter the time intervals obtained, the
more are the pairs of nodes that can be identified as separated. On the other hand,
the more the number of separated pairs identified, the tighter are the timing intervals
computed in subsequent iterations due to lesser number of competing nodes.
Algorithm 2 computes the latest ready, start, and finish times for each node of
the execution graph. The latest start time of node v, denoted as latest[tstartv ], is
computed according to (a) its latest ready time latest[treadyv ] (which is obtained from
the latest finish times of its predecessors), and (b) its contenders. We first consider
the delay of v’s start time by contenders later in program order. Note that the start
time of node v can be delayed by at most one late contender. Obviously, a late
contender u ∈ late contenders(v) cannot delay v after v is ready (since v has higher
priority). Therefore, late contenders who do not satisfy the condition earliest[tstartu ] <
latest[treadyv ] are excluded. We also exclude the contenders who have been identified
to be separated from v (i.e., whose lifetimes cannot overlap with v). The delay from
a late contender u is bounded by u’s latest finish time latest[tfinishu ]. In addition,
2There exist more sophisticated techniques for finding nodes with disjoint lifetimes in a graph
e.g. see [53]. In our experiments we found that our simplified approach for identifying separated
nodes substantially increases the efficiency of our WCET analysis.
64
u cannot delay v by more than its maximum latency; thus, we have another bound
latest[treadyv ] +max latv − 1 where max latu = max latv is the maximum latency of
the contended functional unit. The minimum of the two bounds is taken.
Apart from the delay due to late contenders of node v, we also need to estimate
the delay in v’s start time due to its early contenders. Note that the early contenders
appear before v in program order. So in the worst case, all of them, except those
proved to be separated from v (i.e., not overlapping with v’s lifetime), can contend
with v and delay its start time. This is captured on Lines 7–10 of Algorithm 2. First,
it is obvious that the delay due to early contention cannot be beyond the time when






On the other hand, the maximum delay is also bounded by |Searly| ×max latx where
each early contender executes for its maximum latency.
The latest finish time of v is obtained by simply adding the maximum latency of
the functional unit to latest[tstartv ] (Line 11). This is because an instruction cannot
get preempted once it has started execution on a functional unit. The immediate
successors of v get their latest ready times updated if v’s latest finish time is higher
than the current approximation of their latest ready times (Lines 12–13). In this way
the LatestTimes algorithm estimates the latest ready/start/finish times of each node
in the execution graph.
Similar to the algorithm LatestTimes, the EarliestTimes algorithm (see Algo-
rithm 3) computes the earliest ready, start, and finish times of all nodes in the exe-
cution graph. The main difference is that we allow a node u to contend and thereby
delay the earliest start time of a node v only if the contention can be guaranteed. A
formal proof for the correctness of the algorithms is given in Appendix A.1.
65
4.2.2 Estimation for a Basic Block with Context
In the last section, our technique for estimating the WCET of a basic block Bi is based
on the simplifying assumptions that execution of instructions outside Bi does not
interact with Bi’s execution and the initial pipeline state is empty. This is, however,
an unrealistic assumption. In this section, we extend our technique to consider the
instructions preceding and succeeding Bi.
The execution context of a basic block Bi is defined in terms of the instructions
that directly affect the timing of Bi’s execution. To model the execution time of a
basic block Bi, we need to consider (1) contentions and data dependencies among
instructions prior to Bi and instructions in Bi, and (2) contentions between instruc-
tions in Bi and instructions after Bi
3. The instructions before (after) a basic block
Bi that directly affect the execution time of Bi constitute the contexts of Bi and are
called the prologue (epilogue) of Bi. For example, assuming a 2-entry I-buffer and a
4-entry ROB, at most (4+2)-1 = 5 instructions can be in the pipeline when Bi enters
the pipeline. Similarly, due to the 4-entry ROB, at most 4-1=3 instructions after Bi
can contend with instructions in Bi. Of course, a basic block Bi may have multiple
prologues and epilogues corresponding to the different paths along which Bi can be
entered or exited. To capture the effects of contexts, our analysis constructs execution
graphs corresponding to all possible combinations of prologues and epilogues. Each
execution graph consists of three parts: the prologue, the basic block itself (called the
body) and the epilogue.
Time Intervals for Prologue Nodes Figure 4.3 shows a prologue with 5 instruc-
tions preceding the body. We need to estimate the time intervals of the start/ready/finish
of prologue nodes in order to compute their effects on body nodes. As the execution
3Here, we only consider contentions but not dependencies because data dependencies between Bi





-4) ID(I-4) EX(I-4) WB(I-4) CM(I-4)
IF(I
-3) ID(I-3) EX(I-3) WB(I-3) CM(I-3)
IF(I
-2) ID(I-2) EX(I-2) WB(I-2) CM(I-2)
IF(I
-1) ID(I-1) EX(I-1) WB(I-1) CM(I-1)
(d) Execution Graph of the Code
IF(I0) ID(I0) EX(I0) WB(I0) CM(I0)





Figure 4.3: An Example Prologue
/* I1 is the first instruction in the basic block latest[t
ready
IF (I1)
] := 0 */;
foreach node v ∈ prologue do1
shaded[v] := false;2
if paths(v, IF (I1)) 6= φ then3
shaded[v] := true;4
latest[tfinishv ] := −maxpi∈paths(v,IF (I1))
∑
x∈nodes(pi)min latx; /* Equation 4.25
*/
latest[tstartv ] := latest[t
finish
v ]−min latv; latest[treadyu ] := latest[tstartu ];6
/* I−p is the instruction just before the prologue */
latest[treadyCM(I−p)] := −maxpi∈paths(CM(I−p),IF (I1))
∑
x∈nodes(pi)min latx − 1;7
foreach node v ∈ prologue in topologically sorted order where shaded[v] = false do8












latest[tstartv ] := latest[t
ready
v ] +max latv − 1; /* conservative late contention */11
Searly := early contenders(v) ;12








, latest[tstartv ] + |Searly| ×max latv
)
;





latest[tfinishv ] := latest[tstartv ] +max latv;16
Algorithm 4: Estimation of latest times of prologue nodes
67
context of the prologue itself is not clear, we conservatively estimate the time intervals
as follows. We set the ready time of IF (I1) to 0 and then we derive the time intervals
of the nodes in prologue with respect to the ready time of IF (I1). Algorithm 4 shows
the computation of latest ready, start, and finish times of the nodes in the prologue.
First, we observe that certain nodes in prologue (shaded in Figure 4.3) have at least
one path to the node IF (I1) where I1 is the first instruction in the body, that is, the
basic block being analyzed. The latest finish time of a shaded prologue node is clearly
bounded by latest[treadyIF (I1)] = 0. Let u be a node in prologue with a path to IF (I1).
Consider any path pi connecting v and IF (I1), and let nodes(pi) be the nodes in pi
appearing between v and IF (I1). Clearly




where min latx is the minimum latency of node x. That is, the finish time of shaded
prologue node v cannot be later than the right-hand-side expression in Inequality 4.1
even assuming an ideal execution where each node along the path from v to IF (I1)
(a) becomes ready immediately at the completion of execution of its predecessor, (b)
starts execution as soon as it becomes ready (i.e., there is no delay due to contention)
and (c) executes as fast as possible by taking the minimum latency. Clearly, Inequality
4.1 holds for all paths between v and IF (I1). Therefore, for any shaded prologue node
v (i.e. a node with a path to IF (I1)) we can estimate the latest finish time of v as
latest[tfinishv ] ≤ maxpi∈paths(v,IF (I1))




where paths (v, IF (I1)) is the set of paths between v and IF (I1) in the execution
graph with prologue/epilogue. Since we compute the time intervals for prologue
nodes relative to ready time of IF (I1) we can set latest[t
ready
IF (I1)
] = 0 in Inequality 4.2;
this is shown on Line 5 of Algorithm 4. In this way we compute the latest finish times
of prologue nodes which have a path to IF (I1). Given the latest finish times, it is
68
straightforward to estimate the latest start and ready times of these nodes (Line 6 of
Algorithm 4).
For the rest of prologue nodes (unshaded nodes in Figure 4.3), the latest time
calculation is similar to Algorithm 2 with some modifications (see Lines 8–16 of
Algorithm 4). First, the processing of the nodes proceed in topologically sorted order.
Thus, each of the unshaded nodes, when visited, has at least one predecessor node
whose latest finish time has already been computed. Ready time of an unshaded node
is estimated as the maximum of the finish times of its immediate predecessors (Line 9
of Algorithm 4). However, we still have not accounted for the immediate predecessors
that belong to the pre-prologue part. This effect is conservatively estimated on Line
10 of Algorithm 4. We observe that all pre-prologue nodes should have completed
execution by the time the commit stage of the last pre-prologue instruction (CM(I−p)
where p is the length of the prologue) is ready. Since CM(I−p) has a path to IF (I1),
its latest ready time can be computed easily (Line 7 of Algorithm 4). We bound
the ready time of the unshaded prologue nodes by the ready time of CM(I−p) to
take care of the dependencies from the pre-prologue nodes. Latest start time of an
unshaded prologue node is estimated conservatively from the latest ready time by
taking into account the effect of contentions. First, we conservatively assume that
late contention is always present. By definition, at most one late contender can delay
an instruction. For early contenders, we do not need to look beyond the prologue
as (1) all the pre-prologue nodes have completed execution by the ready time of the
node CM(I−p) and (2) the ready time of the prologue nodes have been bounded by
the ready time of CM(I−p) on Line 10. The maximum delay due to early contenders
is estimated in a manner similar to Algorithm 2 (Lines 13–15 of Algorithm 4).
Earliest times of prologue nodes do not affect the WCET estimation significantly.
Therefore, we conservatively assume earliest ready, start, and finish times of the
prologue nodes as −∞.
69
Time intervals for epilogue nodes Time intervals for epilogue nodes are initial-
ized and iteratively tightened almost the same way as Algorithms 2 and 3 except for
one difference: for the EX epilogue nodes which are from the last ROB size − 1
instructions, they may have late contenders beyond the epilogue, therefore we con-
servatively assume maximum late contentions for each of them when latest times are
estimated.
Time intervals for body nodes Given the time intervals for prologue and epi-
logue nodes, the timing estimation of body nodes (i.e., the nodes in the basic block
we are analyzing) still follows Algorithms 2 and 3. The only difference is that the
dependencies and contention from the prologue nodes and late contentions from the
epilogue nodes are taken into account in the estimation process.
Overlapped execution For a basic block Bi with instructions I1, . . . , In the exe-
cution time estimate of Bi can be calculated as the time between the fetch of I1 to
the commit of In, that is, t
finish
CM(In)
− treadyIF (I1). However, this definition does not produce
tight timing estimates. This is because the execution of two or more successive basic
blocks have some overlap due to the presence of the pipeline.
Definition 4.3. The overlap δ between a basic block Bi and its preceding basic block
Bj is the period during which instructions from both the basic blocks are in the pipeline,
that is




where I0 is the last instruction of block Bj and I1 is the first instruction of block Bi.
We want to avoid duplicating the overlap in time estimates of successive basic
blocks. Therefore, we calculate the execution time estimate of a basic block with a
given context as follows.
70
Definition 4.4. For a basic block Bi with instructions I1, . . . , In executed under a
context (prologue and epilogue) ctx, its estimated execution time, denoted as costctxi ,
is the interval from the time when the instruction immediately preceding the basic






where I0 is the instruction immediately prior to Bi.
Note that the first basic block of the program does not have any preceding in-
structions. As a special case, we calculate its execution time as the time between the
fetch of its first instruction and commit of its last instruction.
Now, we estimate costctxi for basic block Bi with respect to the time at which
the first instruction I1 of Bi is fetched, i.e. t
ready
IF (I1)












calculated by our LatestTimes algorithm. The smallest value of the overlap δ is











Proof. Let u be the node among IF (I1)’s immediate predecessors with the longest
(maximum) finish time. Then,












This is because CM(I0) can become ready only after its predecessors along the paths
from u have executed. Therefore,



































Above we have proved that the overlap is lower-bounded by the right hand side of
Inequality 4.5, which will be used as the estimated minimum overlap. The complete
proof for the correctness of the estimation for a basic block with context can be found
in Appendix A.2.
Putting it all together Note that costctxi is obtained for a specific prologue and a
specific epilogue of Bi. Since a basic block in general has multiple choices of prologues
and epilogues, they might result in different estimates. So, we estimate Bi’s execution
time under all possible combinations of prologues and epilogues, denoted as CTXi,
and costi = maxctx∈CTXi (cost
ctx
i ), where costi is the WCET of Bi used in the WCET






This objective function is maximized over the constraints on vi given by control flow
equations, loop bounds and user-provided infeasible flow information. This is done
by using an Integer Linear Programming solver like CPLEX.
4.3 Experimental Evaluation
In this section, we evaluate the accuracy of our estimation technique with twelve
benchmarks listed in Table 2.1. All benchmarks except matsum contain variable-
latency arithmetic instructions, e.g., integer multiplications and floating-point oper-
ations.
The pipeline configuration for our experiments is as follows. It has a 4-entry I-
buffer and an 8-entry ROB and it contains the following variable latency functional
unit types: (a) an integer multiplication unit with 1 ∼ 4 cycle latency, (b) a floating
point add unit with 1 ∼ 2 cycle latency, and (c) a floating point multiplication
unit with 1 ∼ 12 cycle latency. In addition, the processor has an integer ALU unit
and a load/store unit, each with one cycle latency. Note that we assume single-cycle
latency for load/store unit because we have not modeled data cache. Since instruction
caching and branch prediction has not been modeled so far, we simply assume every
instruction fetch takes a single clock cycle and every branch instruction is correctly
predicted, e.g., there is no pipeline stall caused by the two events.
Table 4.1 presents the observed WCET (columnObs. WCET) and the estimated
WCET (column Est. WCET), both are in clock cycles. As can be seen from the
ratio (Est. WCET/ Obs. WCET) column, the estimated WCET is not far
from the observed WCET for most benchmarks specially considering the fact that
the difference between actual and observed WCET is unknown. There are mainly
two factors for the overestimation. (1) Overestimation from program path analysis,
i.e., the bounds on execution counts of basic blocks in the estimation are often higher
than the actual execution counts during simulation. (2) Overestimation from pipeline
73
Obs. Est. Analysis ILP Solving
Program WCET WCET Ratio Time(sec.) Time(sec.)
adpcm 142169 208722 1.47 1.12 0.01
compress 4477 6194 1.38 1.24 0.01
dhry 115612 118715 1.03 0.78 0.01
fdct 3959 4137 1.04 0.02 0.01
fft 855017 976147 1.14 0.11 0.01
fir 41580 51477 1.24 0.26 0.01
ludcmp 9220 10976 1.19 0.13 0.01
matmul 14078 18079 1.28 0.02 0.01
matsum 100812 100816 1.00 0.02 0.01
minver 5801 7023 1.21 0.39 0.01
qurt 1613 1979 1.23 0.29 0.01
whet 850042 941632 1.11 0.37 0.01
Table 4.1: Accuracy and Performance of Out-of-Order Pipeline Analysis
analysis, i.e., for each of the basic blocks, the pipeline analysis might introduce some
amount of pessimism compared to the actual worst case of the basic block.
To see how much the overestimation is caused by pipeline analysis, we use the
execution counts of basic blocks observed in simulation as user constraints for analy-
sis. Figure 4.4 compares the overall overestimation to pipeline-only overestimation
(benchmarks with single execution path are excluded from this figure as they suffer
no overestimation from program path analysis). For adpcm and compress, program
path analysis contributes a significant portion of the overestimation. While for the
rest benchmarks, pipeline overestimation is dominant, but is still within a reasonably
low level.
The pipeline analysis time and ILP solving time (counted in seconds) for the
benchmarks are given in the last two columns in Table 4.1. As we can see, both the
pipeline analysis and the ILP solving take very little time. The reason for the short
analysis time is that Algorithm 1 can reach the fixed-point very quickly. We have
observed that at most three iterations were spent for the estimation of any basic block











adpcm    
ludcmp   
minver   
qurt     













Figure 4.4: Overall and Pipeline Overestimations
unknown. It is likely that more iterations are needed, since the estimation algorithms
will need changes when they are targeted to different architectures. The reason for
the short ILP solving time is that the complexity of the ILP problem comes only
from the program flow, not from microachitecture modeling – pipeline analysis in
this context. This will not be true when branch prediction and instruction cache are
to be modeled in subsequent chapters, where significant ILP solving time comes from
the modeling of these two hardware features.
4.4 Summary
Timing anomalies appearing in out-of-order processors complicate Worst Case Exe-
cution Time (WCET) analysis by invalidating the assumption that local worst case
always lead to global worst case. On the other hand, an exhaustive enumeration of
all possible local cases is anticipated to be quite inefficient. In this chapter, we have
modeled an out-of-order processor pipeline for WCET analysis. The key idea behind
our approach is to avoid exhaustive enumeration by bounding the time intervals at
75
which the events in pipelined execution can occur. We have implemented our tech-
nique and experimentally validated its estimation accuracy against several standard




In this chapter, we study another popular microarchitectural feature: branch predic-
tion. Branch prediction is used to address control hazards [30] on pipelined processors.
If a prediction is correct, the corresponding control hazard is overcome, otherwise a
misprediction penalty is incurred. Apart from misprediction penalties, branch pre-
diction also exerts indirect effects on the performance of other microarchitectural
features, such as instruction cache. As the processor caches instructions along the
mispredicted path, the instruction cache content is modified by the time the branch is
resolved. This prefetching of instructions can have both constructive and destructive
effects on cache performance and hence on WCET.
Clearly, we cannot assume perfect branch prediction for the purposes of WCET
analysis. This assumption may result in an incorrect WCET (i.e., lower than the
actual value), particularly, when a hard-to-predict conditional statement (if-then-else)
is present inside a loop body and contributes substantially to a program’s WCET.
Alternatively, certain works assume that all branches in a program are mispredicted.
This pessimism results in significant overestimation of the WCET as branch prediction
accuracy is quite high for loop control branches.
Our effort in this chapter is to develop techniques to bound the occurrences of
mispredictions and to model its interactions with an instruction cache. To dedicate
to this task, we do not consider pipeline effects here. Thus all the variations on the
execution times of basic blocks are merely caused by mispredictions and cache misses.
The integration with pipeline modeling will be discussed in the next chapter.
We propose an Integer Linear Programming (ILP) based framework to model
77
branch prediction as well as its interaction with the instruction cache. We use ILP
because the global nature of the behavior of branch prediction requires global pro-
gram path information, which can be provided by the ILP based WCET calculation
method used by us. Our branch prediction modeling is generic and parameteriz-
able with respect to the currently used branch prediction schemes. Effects of branch
misprediction on cache performance are integrated into our framework by extending
previous work on instruction cache modeling [43]. Based on the branch prediction
scheme and cache organization, our modeling derives linear constraints from the con-
trol flow graph of a program. These constraints are fed to an ILP solver for computing
an upper bound on the program’s execution time.
The rest of this chapter is organized as follows. In the next section we study
dynamic branch prediction mechanisms for our modeling purpose. Then we present
the modeling technique in Section 5.1. In Section 5.2 we show the combined analysis of
branch prediction and instruction caching. In Section 5.3 we show by experimentation
that our technique yields tight estimates. We conclude this chapter in Section 5.4.
5.1 Modeling Branch Prediction
In this section, we discuss the modeling of dynamic branch prediction schemes for
WCET analysis. Recall that dynamic schemes make predictions according to the
execution history. They commonly use a branch prediction table to store past branch
outcomes and make predictions according to the stored information. They differ in
the ways the prediction table being indexed. For the GAg scheme, a shift register
called branch history register (BHR) which stores the the outcomes of n most recent
branches is used as the index to the prediction table; the entry indexed by the BHR
will provide the prediction and will be updated by the outcome of current branch.
For the local scheme, the prediction table is indexed by the n lower order bits of the
branch address. Some other schemes such as gshare and gselect use a combination of
78
the BHR and the address of the branch as index to the prediction table (details were
given in Section 2.1.2). Our technique to be presented can model all above mentioned
dynamic schemes.
5.1.1 The Technique
Issues in modeling branch prediction We proceed to examine the difficulties in
modeling branch prediction for worst case execution time analysis. So far, microar-
chitectural features such as pipelining and instruction caching have been modeled for
WCET analysis. In the presence of these features, the execution time of an instruc-
tion may depend on the past execution trace. For pipelining, these dependencies are
typically local. That is, the execution time of an instruction may depend only on the
past few instructions which are still in the pipeline. To model instruction caching
and branch prediction, global analysis is required. This is because the effect of an
instruction’s execution on caches and branch predictors could affect the execution of
remote instructions. However, there are two significant differences between the global
analysis of the instruction caching and of branch prediction.
Both instruction caching and branch prediction maintain global data structures
that record information about the past execution trace, namely the cache and the
branch prediction table. For instruction caching, a given instruction can reside only
in one row of the cache: if it is present, it is a cache hit; otherwise, it is a cache miss1.
Local branch prediction is quite similar – outcomes of a given branch instruction are
stored only in one fixed entry of the prediction table where predictions are made.
However, for global branch prediction schemes, a given branch instruction may use
different entries of the prediction table at different points of execution. Given a branch
instruction I, a global branch prediction scheme uses the history HI (which is the
outcome of the last few branches before arriving at I) to decide the prediction table
1To be precise, in associative caches, an address can be present in only one cache set.
79
entry. Because it is possible to arrive at I with various histories, the prediction for I
can use different entries of the prediction table at different points of execution.
The other difference between instruction caching and branch prediction model-
ing is obvious. In the case of instruction caching, if two instructions I and I ′ are
competing for the same cache entry, then the flow of control either from I to I ′ or
from I ′ to I will always cause a cache miss. However, for branch prediction, even if
two branch instructions I and I ′ map to the same entry in the prediction table, the
flow of control between them does not imply correct or incorrect prediction. Their
competition for the same entry may have constructive or destructive effect in terms
of branch prediction, depending on the outcome of the branches I and I ′.
For ease of discussion, we take GAg, a global branch prediction scheme described
in Section 2.1.2, as a modeling example. However, our modeling is generic and not
restricted to GAg (as will be shown in Section 5.1.3). In fact, the default scheme in
our experiments is the more popular gshare scheme.
Control Flow Graph (CFG) The starting point of our analysis is the control
flow graph of the program, from which we can derive program flow constraints, as
described in Section 2.3.1.
Defining WCET In Section 2.3.1, the WCET is defined as Equation 2.1 under
the assumption that the cost of a basic block is a constant. Now in the presence
of branch prediction, the cost of a basic block under a misprediction is higher than
its cost under a correct prediction. Thus the WCET should be modified to reflect
this difference. Suppose the cost of Bi under the correct prediction is costi and a
misprediction penalty is bmp, then the cost under the misprediction is costi + bmp.
Let bmi be the misprediction count of Bi (thus Bi is correctly predicted vi − bmi
80




(costi ∗ (vi − bmi) + (costi + bmp) ∗ bmi)
The first term is the sum of execution times under correct predictions and the second
term is the sum of execution times under mispredictions. The WCET objective




(costi ∗ vi + bmp ∗ bmi) (5.1)
To find the worst case execution time, we need to maximize the above objective
function. For this purpose, we need to derive constraints on bmi.
Introducing History Patterns To predict the direction of the branch in Bi, first,
the index into the prediction table is computed. In the case of GAg, this index is
the outcome of the last k branches before Bi is executed and recorded in the Branch
History Register (BHR) with k bits. Thus, if k = 2 and the last two branches are taken
(1) followed by not taken (0), then the index will be 10. We define annotated execution
counts and misprediction counts vpii and bm
pi
i , corresponding to the execution of Bi
with BHR = pi when Bi is reached. Similarly, e
pi
i→j denotes the number of times the
edge ei→j is passed with BHR = pi at the beginning of basic block Bi













For each Bi and history pi, we find out whether it is possible to reach Bi with
history pi. This information can be obtained via a terminating least fixed point
analysis on the control flow graph. Clearly, if it is not possible to reach Bi with pi,





Control flow among history patterns To provide an upper bound on bmpii , we
first define constraints on vpii (since bm
pi
i ≤ vpii ). This is done by modeling the change
in history along the control flow graph.
81
Definition 5.1. Let label(i → j) be an annotation on an edge i → j of the CFG,
which is given a value according to the following rules:
label(i→ j) = U if i→ j implies unconditional flow
1 if i→ j implies branch at i is taken
0 if i→ j implies branch at i is non-taken
Definition 5.2. Let pi be a history pattern with k bits (the width of the Branch
History Register) at Bi. It is composed of the sequence of outcomes of the most recent
k branches with the latest outcome at the rightmost bit. The change in history pattern
along i→ j is given by:
Γ(pi, i→ j) = pi if label(i→ j) = U
left(pi, 0) if label(i→ j) = 0
left(pi, 1) if label(i→ j) = 1
where left(pi, 0) (left(pi, 1)) shifts pattern pi to the left by one bit (the old leftmost
bit is therefore discarded) and puts 0 (1) as the rightmost bit.
Now, Bi can execute with history pi only if there exists Bj executing with history pi
′
such that Γ(pi′, j → i) = pi. Note that for any such incoming edge j → i, there can be
two history patterns pi′ such that Γ(pi′, j → i) = pi. For example, if label(j → i) = 1,
then Γ(011, j → i) = Γ(111, j → i) = 111. Therefore, from the inflows of Bi’s
















Repetition of a history pattern Let us assume a misprediction of the branch in
Bi with history pi. This means that certain blocks (perhaps Bi itself) were executed
with history pi such that the outcomes of these branches created a prediction different
from the current outcome of Bi. Thus, to model mispredictions, we need to capture
repeated occurrences of a history pi during the program’s execution. For this purpose,
we define ppii j.
Definition 5.3. Let Bi and Bj be two basic blocks with branch instructions and pi be
a history pattern. Then ppii j is the number of times a path is taken from Bi to Bj s.t.
• pi never occurs at a node with a branch instruction between Bi and Bj.
• If Bi 6= start block, then pi occurs at Bi
• If Bj 6= end block, then pi occurs at Bj
Intuitively, ppii j denotes the number of times control flows from Bi to Bj s.t. the
pith row of the prediction table is only used for branch prediction at Bi and Bj, and
is never accessed in between. In these scenarios, the outcome of Bi may cause a




i end) models the number of times the
pi th row of the prediction table is looked up for the first (last) time at Bi.
When the pith row is used for branch prediction at Bi, either the pith row is used
for the first time (denoted by ppistart i) or the pith row was used for branch prediction
last time in some block Bj 6= Bstart. Similarly, for every use of the pith row of the
prediction table at Bi, either it is the last use (denoted by p
pi
i end) or it will be used
the next time in Bj 6= Bend. Since vpii denotes the number of times Bi uses the pith









Also, there can be at most one first use, and at most one last use of the pi th row of
the prediction table during program execution. Therefore, we get:
∑
i
ppistart i ≤ 1 and
∑
i
ppii end ≤ 1
Introducing branch outcomes To model mispredictions, we not only need to
model the repetition of history patterns, but also branch outcomes. A misprediction
occurs on differing branch outcomes for the same history pattern. Therefore, we
partition the paths contributing to the count ppii j based on the branch outcome at
Bi: p
pi,1
i j and p
pi,0
i j, which denote the execution count of those paths that begin with
the outgoing edge of Bi labeled 1 (i.e., outcome 1) and 0, respectively. By definition:
ppii j = p
pi,1












i j = e
pi
i→l
where label(i → k) = 1 and label(i → l) = 0. In other words, i → l and i → k are
the outgoing edges of basic block Bi with labels 0 and 1, respectively.
Modeling mispredictions For simplicity of exposition, let us assume that each
row of the prediction table contains a one-bit prediction: 0 denotes a prediction that
the branch will not be taken, and 1 denotes a prediction that the branch will be taken.
However, our technique for estimating mispredictions is generic. It can be extended
if the prediction table maintains more than one bit per entry. In particular, a recent
work [4] has modeled a n-bit saturating counter (in each row of the prediction table).
Recall that bmpii denotes the number of mispredictions of the branch in Bi when
it is executed with history pattern pi. There can be two scenarios in which Bi is
mispredicted with history pi:
• Case 1: Branch of Bi is taken
Since the actual control flow will go through the taken edge i → k, we denote
84
the misprediction count of this case as empii→k. Obviously, em
pi
i→k ≤ epii→k. On
the other hand, when a branch at Bi is mispredicted as not taken, the prediction
in row pi of the prediction table must be 0 (not taken). This is possible only if
another block Bj is executed with history pi and outcome 0 and history pi never
appears between Bj and Bi. The total number of such inflows into Bi is at most∑
j p
pi,0









• Case 2: Branch of Bi is not taken
Since the actual control flow will go through the not taken edge i→ l, we denote
the misprediction count of this case as empii→l. Following the reasoning in Case















Additionally, let emi→j be the misprediction count for control flow transfers along





Putting it all together We have derived linear inequalities on vi (execution count
of Bi) and bmi (misprediction count of Bi). We now maximize the objective function
(denoting the execution time of the program), subject to these constraints using an
(integer) linear programming solver. This gives an upper bound of the program’s
WCET.







v100 = 1;   m100 = 0
v101 = 99; m101 = 1
(1)
(100)
v200 = 1;   m200 = 1





Figure 5.1: Example of the Control Flow Graph
5.1.2 An Example
In this part, we illustrate our WCET estimation technique with a simple example.
Consider the control flow graph in Figure 5.1. The start and end blocks are called
Bstart and Bend respectively. All edges of the graph are labeled. Recall that the
label U denotes unconditional control flow and the label 1 (0) denotes control flow by
taking (not taking) a conditional branch. We assume that a two-bit history pattern
is maintained i.e., the prediction table has four rows for the four possible history
patterns: 00, 01, 10, 11. Also, each row of the prediction table contains one bit to
store the last outcome for that pattern: 0 for not taken and 1 for taken.
Flow constraints and loop bounds The start and end nodes execute only once.
Hence
vstart = vend = 1 = estart→1 = e2→end + e1→end
From the inflows and outflows of blocks 1 and 2, we get:
v1 = estart→1 + e2→1 = e1→2 + e1→end
v2 = e1→2 = e2→end + e2→1
86
Furthermore, the edge 2→ 1 is a loop, and its bound must be given. In our method,
this bound is either computed oﬄine or provided by the user. Let us consider a loop
bound of 100. Then,
e2→1 < 100
Defining WCET Let us assume a branch misprediction penalty of three clock
cycles. The WCET of the program is obtained by maximizing:
Time = 2vstart + 2v1 + 4v2 + 2vend + 3bm1 + 3bm2
assuming tstart = t1 = 2, t2 = 4 and tend = 2. Recall that ti is the execution time of
block i (assuming perfect prediction); bmi is the number of mispredictions of block i.
There are no mispredictions for executions of start and end blocks, since they do not
have branches.
Introducing History Patterns We find out the possible history patterns pi for
each basic block Bi via static analysis of the control flow graph. The initial history at
the beginning of program execution is assumed to be 00. In our example, the possible




Bend: {00, 01, 11}
We now introduce the variables vpii and bm
pi
i : the execution count and mispredic-





























end = 1 bmend = 0
bm001 ≤ v001 bm011 ≤ v011
bm002 ≤ v002 bm102 ≤ v102
We also define variables of the form epii→j as follows (by using the set of patterns
possible at each basic block):
estart→1 = e00start→1
e1→2 = e001→2 + e
01





e2→1 = e002→1 + e
10





Control flow among history patterns We now derive the constraints on vpii
based on the flow of the pattern pi. Let us consider the inflows and outflows of block






Note that the inflow from block start to block 1 is automatically disregarded in this
constraint since it cannot produce a history 01 when we arrive at block 1. Also, for
the inflows from block 2 the history at block 2 can be either 00 or 10. Both of these
patterns produce history 01 at block 1 when control flows via the edge 2 → 1 i.e.,
Γ(00, 2→ 1) = Γ(10, 2→ 1) = 01 from Definition 5.2.






Constraints for inflows/outflows of block 1 with history 00, block 2 with history 00,
and block 2 with history 10 are derived similarly.
88
Repetition of history pattern To model the repetition of a history pattern along
a program path, variables ppii j are introduced (refer to Definition 5.3). We now
present the constraints for the pattern 01. Corresponding to the first and last occur-
rence of the history pattern 01, we get:
p01start 1 ≤ 1 and p011 end ≤ 1
Corresponding to the repetition of the pattern 01, the constraints are as follows:
Exec. with Inflow from last Outflow to next
pattern 01 occurrence of 01 occurrence of 01
v011 = p
01
1 1 + p
01
start 1 = p
01
1 1 + p
01
1 end
Similarly, we provide constraints for the other patterns.
Introducing branch outcomes For each ppii j, we define the variables p
pi,0
i j and
ppi,1i j via the equation p
pi
i j = p
pi,0
i j + p
pi,1
i j. More importantly, we relate p
pi
i j variables
to epii→j variables via p
pi,0
i j and p
pi,1
i j. For example we have p
10,1
2 2 + p
10,1
2 end = e
10
2→1 in
Figure 5.1. In our simple example, we only derive trivial constraints in this category.
In general, a sum of ppi,1i j (or p
pi,0
i j) variables equals an e
pi
i→j variable.
Modeling mispredictions Let us now derive the constraints for bm011 , the num-
ber of mispredictions of block 1 with history 01. For this, we consider two cases
corresponding to the outcome of the branch at block 1.
• Case 1: The branch at block 1 is taken, and the last branch using the 01 row
of the predictor table is not taken.
The number of times the branch at block 1 under history 01 is taken is e011→end.
The number of times the last branch (before arriving at block 1) using the 01
row of the predictor table is not taken is p01,0start 1 + p
01,0
1 1. Note that the other
89
block (block 2) is not considered since block 2 cannot be reached with pattern
01.









• Case 2: The branch at block 1 under history 01 is not taken, and the last







Note that 0 appears in above formula as in this particular example, no earlier
branch using the 01 row of the predictor table with outcome taken can reach
block 1.
























≤ min (e102→1 , p10,0start 2)+min (e102→end , p10,12 2)
They correspond to the constraints on bmpii in Section 5.1.1. Maximizing the
objective function with respect to all these constraints gives the program’s WCET.
The execution counts of basic blocks as well as their misprediction counts com-
puted by the ILP solver are given in Figure 5.1.
90
5.1.3 Retargetability
We now discuss how our modeling can be used to capture the effects of various
local and global branch prediction schemes. Our modeling of branch prediction is
independent of the definition of the prediction table index, so far called the history
pattern pi. All our constraints only assume the following: (a) the presence of a global
prediction table, (b) the index pi into this prediction table, and (c) every time the
pi th row is looked up for branch prediction, it is updated subsequent to the branch
outcome. These constraints continue to hold even if pi does not denote the history
pattern (as in the GAg scheme).
In fact, the different branch prediction schemes differ from each other primarily
in how they index into the prediction table. Thus, to predict a branch I, the index
computed is a function of: (a) the past execution trace (history) and (b) the address of
the branch instruction I. In the GAg scheme, the index computed depends solely on
the history and not the branch instruction address. Other global prediction schemes
(gshare, gselect) use both the history and the branch address, while local schemes use
only the branch address.
To model the effect of other branch prediction schemes, we only alter the meaning
of pi, and show how pi is updated with the control flow (the Γ function of Defin-
ition 5.2). This of course affects the possible prediction table indices that can be
looked up at a basic block Bi. No change is made to the linear constraints (parame-
terized w.r.t. possible prediction table indices at each basic block) described in the
previous subsection. These constraints then bound a program’s WCET (under the
new branch prediction scheme).
Other global schemes We now discuss two other global prediction schemes: gshare
and gselect [52, 74]. In gshare, the index pi used for a branch instruction I is defined
91
as
pi = historym ⊕ addressn(I)
where m,n are constants, n ≥ m, ⊕ is XOR, addressn(I) denotes the lower order n
bits of I’s address, and historym denotes the most recent m branch outcomes (which
are XOR-ed with higher-order m bits of addressn(I)). The updating of pi due to
control flow is modeled by the function:
Γgshare(pi, i→ j) = Γ(historym, i→ j)⊕ addressn(j)
where i→ j is an edge in the control flow graph, addressn(j) is the least significant
n bits of the branch instruction in basic block j, and Γ is the function on the history
patterns described in Definition 5.2.
The modeling of the gselect prediction scheme is similar. Here, the index pi into
the prediction table is defined as:
pi = historym • addressn(j)
where m and n are some constants and • denotes concatenation. The updating of pi
due to control flow is given by function Γgselect
Γgselect(pi, i→ j) = Γ(historym, i→ j) • addressn(j)
Again, i → j is an edge in the control flow graph and Γ is the function described in
Definition 5.2.
Local prediction schemes In local schemes, the index pi into the prediction table
for predicting the outcome of instruction I is pi = addressn(I). Here, n is a con-
stant and addressn(I) denotes the least significant n bits of the address of branch
instruction I.
Updating of the index pi due to control flow is given by Γlocal(pi, i → j) =
addressn(j). Here, i → j is an edge in the control flow graph and addressn(j) is
92
the least significant n bits of the last instruction in basic block j. If block j contains
a branch instruction I, it must be the last instruction of j. Thus, the least significant
n bits of the address of I are used to index into the prediction table (as demanded by
local schemes). If j does not contain any branch instruction, then the index computed
is never used to lookup the prediction table. Clearly, since each block j always uses
the same index pi into the prediction table, index pi is used at basic block j if and
only if pi denotes the least significant n bits of the address of the branch instruction
of block j (if any).
5.2 Integration with Instruction Cache Analysis
Our branch prediction analysis is a Integer Linear Program (ILP) based approach.
In this section, we will show how to integrate it with another ILP based instruction
cache analysis. The key point for combined analysis of multiple microarchitectural
features is to capture their interactions. In the context of branch prediction and
instruction caching, the interaction is unidirectional. The speculative execution via
branch prediction can alter the behavior of the instruction cache, i.e., instructions
can be prefetched into or displaced from the cache due to speculative execution.
On the other hand, the instruction cache does not change the branch prediction
outcome as an cache access does not access or change the state of the branch predictor.
This indicates that our branch prediction technique needs not to be changed in the
combined analysis.
To discuss the combined analysis, we first review on the instruction cache analysis
technique proposed by Li et al. [43]. Then we make modifications to it to take into
account the effect of speculative execution.
93
5.2.1 Instruction Cache Analysis
We recapitulate the earlier instruction cache modeling [43]. A basic block Bi is
partitioned into ni l-blocks
3 denoted as Bi.1, Bi.2, . . . , Bi.ni . Let cmi.j be the total
cache misses for l-block Bi.j and cmp be the constant denoting the cache miss penalty.








For simplicity of exposition, let us assume a direct mapped cache. For each cache
line c, a Cache Conflict Graph (CCG) Gc [43] is constructed. The nodes of Gc are
the l-blocks mapped to c. An edge Bi.j  Bu.v exists in Gc iff there exists a path in
the CFG s.t. control flows from Bi.j to Bu.v without going through any other l-block
mapped to c. In other words, there is an edge between l-blocks Bi.j to Bu.v if Bi.j can
be present in the cache when control reaches Bu.v.
Let ri.j u.v be the execution count of the edge between l-blocks Bi.j and Bu.v in
a CCG. Now, the execution count of l-block Bi.j equals the execution count of basic
block Bi. Also, at each node of the CCG, the inflow equals the outflow and both








The cache miss count cmi.j equals the inflow from conflicting l-blocks in the CCG
(whether two l-blocks are conflicting or non-conflicting is statically determined by







3A line-block, or l-block, is a sequence of instructions in a basic block that belong to the same
instruction cache line.
94
5.2.2 Changes to Instruction Cache Analysis
Effects of speculative execution on caching WCET analysis as described in
the previous section does not take into account the effect of branch misprediction on
instruction cache performance. When a branch is predicted, instructions are fetched
and executed from the predicted path. If all the branches are predicted correctly, then
the analysis described in previous section will give accurate results. Now, consider a
branch that is mispredicted. The processor will fetch and execute instructions along
the mispredicted path till the branch is resolved. There can be two scenarios during
mispredicted path execution: (1) there is no cache miss, and (2) there is at least one
cache miss. In the first scenario, the misprediction has no effect on the instruction
cache. However, in the second scenario, the instruction cache content is modified
when the processor resumes execution from the correct path. Various studies have
concluded that depending on the application, this wrong-path prefetching can have
a constructive or a destructive effect on the instruction cache’s performance [12, 60].
Our goal here is to model this wrong-path cache effect for WCET analysis.
We make two standard assumptions. First, we assume that the processor allows
only one unresolved branch at any point of time during execution. Thus, if another
branch is encountered during speculative execution, the processor simply waits till
the previous branch is resolved. We also assume that the instruction cache is blocking
(i.e., it can support only one pending cache miss). This is indeed the case in almost
all commercial processors.
We introduce some notations for the subsequent parts. We use [Bi.j] to denote
the cache line to which l-block Bi.j maps. The shorthand Bi.j ∼= Bu.v is used to
denote that l-blocks Bi.j and Bu.v map to the same cache line. Thus Bi.j ∼= Bu.v iff
[Bi.j] = [Bu.v].
The effects of speculation on instruction cache performance can be categorized as
follows:
95
1. An l-block Bi.j misses during normal execution since it is displaced by another
l-block Bu.v ∼= Bi.j during speculative execution (destructive effect).
2. An l-block Bi.j hits during normal execution, since it is pre-fetched during spec-
ulative execution (constructive effect).
3. A pending cache miss of Bi.j during speculative execution along the wrong path
causes the processor to stall when the branch is resolved. How long the stall
lasts depends on the portion of cache miss penalty which is masked by the
branch misprediction penalty. If the speculative fetching is completely masked
by branch penalty, then there is no delay incurred.
The last situation cannot be simply deemed constructive or destructive, although
a delay often happens in that case. The cost of the delay may be offset later by a
cache hit to the l-block.
The following changes to the Cache Conflict Graph (CCG) capture both the con-
structive and destructive effects of speculative execution on cache.
Additional nodes in Cache Conflict Graph We add all the l-blocks fetched
along the mispredicted path to their respective cache conflict graphs. Given a con-
ditional branch b, its actual outcome X (not taken or taken, denoted as 0 and 1,
respectively) and misprediction penalty bmp (a constant number of clock cycles), we
can identify the set of l-blocks accessed along the mispredicted path, called Spec(b,X).
Clearly, the cost of executing the blocks in Spec(b,X) cannot exceed bmp. If one or
more blocks cause cache misses, then not all the l-blocks in Spec(b,X) can execute.
Those l-blocks executed along the mispredicted path are called ml-blocks and are an-
notated with the corresponding basic block containing the branch instruction and the
actual outcome. For example, if Bi.j ∈ Spec(b,X), then the corresponding ml-block
is denoted by Bb,Xi.j . Note that it is possible to have multiple ml-blocks corresponding
96
to an l-block. For an l-block Bi.j, all its ml-blocks are added to the CCG of the cache
line it maps to.
Additional edges in Cache Conflict Graph We now need to add additional
edges in the cache conflict graphs. Given a CCG, we add edges between ml-blocks
and the normal l-blocks; we also add edges between ml-blocks. For an ml-block Bb,Xi.j ,
we add edges to/from all the other l-blocks Bu.v in the CCG of cache line [Bi.j] and
their corresponding ml-blocks as follows:
1. Bu.v  Bb,Xi.j if there exists a path from Bu.v to Bi.j through branch b that does
not contain any other l-block mapped to [Bi.j]. This models the flow from the
last normal use of the cache line to the ml-block.
2. Bb,Xi.j  Bb,Xu.v if Bu.v is the next use of the cache line [Bi.j] in Spec(b,X) after
Bi.j. This models the flow from the ml-block to the next possible use of the
cache line along the mispredicted path.
3. Bb,Xi.j  Bu.v if there exists a path from branch b with outcome X to Bu.v that
does not contain any other l-block mapped to [Bi.j].
4. In addition, in case 3, if the path to Bu.v goes through branch b
′ and Bu.v ∈
Spec(b′, Y ) (b′ can be the same as or different from b), then we also add Bb,Xi.j  
Bb
′,Y
u.v . The edges in cases 3 and 4 model the flow from the ml-block to the next
possible use of the cache line after the branch is resolved.
Figure 5.2 illustrates these cases. The shaded rectangles are the ml-blocks and the
unshaded ones are the normal l-blocks. The third and fourth type of edges require
some explanation. If there are multiple l-blocks along the speculative path that map
to a particular cache line, then we conservatively add outgoing edges from all of them
to the first use of the cache line in the correct path (or another speculative path).
















    
    
    


    







    
    
    
Figure 5.2: Additional edges in the Cache Conflict Graph due to Speculative Exe-
cution. The l-blocks are shown as rectangular boxes, and the ml-blocks among them
are shaded.
resolved; exactly which one will be in the cache when the branch is resolved depends
on the exact values of bmp, cmp and the execution time of the individual basic blocks.
Figure 5.3 illustrates the modifications to the CCG with an example. The control
flow graph is shown in Figure 5.3(a). Let us assume that l-blocks B0.1, B1.2 and B3.1
belong to the same cache line. Then, the original CCG for that cache line is shown
in Figure 5.3(b). A dummy start node and an end node are added to each CCG to
make the initial and terminal flow equations correct.
The modifications to the CCG due to wrong-path prefetching is shown in Fig-
ure 5.3(c). We add two ml-block B2,13.1 and B
3,0
1.2 corresponding to the mispredictions
at node B2 and node B3, respectively. Note that we do not add any node correspond-
ing to a 0 outcome at branch B2 and a 1 outcome at branch B3. This is because
with a 0 outcome at branch B2, the mispredicted path fetches basic block B2 which
does not contain any l-block that maps to the cache line, and similarly for B3 with
outcome 1. Among the additional edges, B1.2  B2,13.1 and B3.1  B3,01.2 belong to the





















Figure 5.3: Changes to Cache Conflict Graph (Shaded nodes are ml-blocks)
Figure 5.3 shows the modeling of the constructive effect of wrong path prefetching.
In the original CCG, there is an edgeB1.2  B3.1 and that is the only path between the
two nodes. Therefore, every time control reaches from B1.2 to B3.1, it is a cache miss.
In the modified CCG in Figure 5.3(c), there is another path from B1.2 to B3.1 via the
ml-block B2,13.1 . First, there is no cache miss along B
2,1
3.1  B3.1 as they are physically
the same l-block. Second, the cache miss along B1.2  B2,13.1 is partially masked by
the branch misprediction delay. Thus, this kind of prefetching is constructive to the
execution.
Additional constraints on ml-blocks The execution count of a normal l-block
is equal to the execution count of the basic block it belongs to. However, for an
ml-block Bb,Xi.j , this count is dependent on the number of mispredictions at branch b
where the actual outcome is X (X is 0 or 1). To derive this execution count, note






bmp (cmp) denotes branch misprediction penalty (cache miss penalty). In accordance
99






assumption is, however, not required, and our modeling can be easily extended. Given
bmp < cmp, a single misprediction can result in at most one cache miss along the
mispredicted path. Let Spec(b,X) = 〈Bu1.v1 , . . . , Buk.vk〉. Therefore, the execution





where bmXb is the number of mispredictions at branch b with outcome X (obtained
from the modeling of branch prediction) and cmb,Xul.vl is the number of cache misses for
the ml-block Bb,Xul.vl . Constraints on cm
b,X
ul.vl
are obtained from the CCG as shown in
Equation 5.8 (refer to page 94). Constraints on bmXb are obtained from our modeling
of branch prediction described in Section 5.1.1.













The three subterms of the first term are the ideal execution time, the branch
penalty and the cache penalty, respectively. The last term, mp delay(b,X) is the de-
lay that the processor has waited for pending cache misses (arising during mispredic-
tions) after mispredictions have been resolved. As the assumption bmp < cmp holds,
the criteria for such a delay to happen are: (a) a cache miss happens during a mispre-
diction, and (b) this cache miss is not completely masked by the misprediction (still















delay introduced due to the cache miss of Bui.vi along the mispredicted path of branch
b (where the actual outcome is X). This delay is not a constant, as part of the cache
miss penalty cmp can be masked, depending on the location of the cache miss in the
mispredicted path.
5.3 Experimental Evaluation
In this section we experimentally evaluate our branch prediction analysis as well as
the combined analysis with instruction caching.
Since we want to examine the effects of instruction caching and branch predic-
tion, we exclude the impact of other factors, such as pipelining, data caching, data
dependencies, etc. In our experiments, we assume a perfect processor pipeline with
no stalls due to data dependencies. This allows each instruction to take a fixed num-
ber of clock cycles to execute. The only timing overhead is introduced by instruction
cache misses and branch mispredictions of conditional branches.
We use the same set of benchmarks as in Chapter 4, and we compare our esti-
mation against the simulation on SimpleScalar. Our analyzer is parameterized with
respect to the prediction scheme, the predictor table size, the misprediction penalty,
the cache configuration and the cache miss penalty. The default parameters in our
experiments are as follows: (1) branch prediction scheme is gshare; (2) a 128-entry
BHT is used, and the 3-bit branch history is XOR-ed with the higher portion of
the address bits PC[9:3] as index to the BHT; (3) the branch misprediction penalty





26224 26224 2965 29079 29079 22389
430 compress 430 28 898 compress 898 491
19219 19219 1230 19219 19219 2124
80 80 15 80 80 15
13339 13339 2080 13337 13337 4149
1401 fir      1401 111 1432 fir      1432 659
290 290 99 290 290 157
1110 1110 226 1110 1110 229
10100 10100 206 10100 10100 208
228 228 113 228 228 184
116 116 23 117 117 86




adpcm    adpcm    
dhry     dhry     
fdct     fdct     
fft      fft      
ludcmp   ludcmp   
matmul   matmul   
matsum   matsum   
minver   minver   
qurt     qurt     
adpcm    com-
press 
dhry     fdct     fft      fir      ludcmp   matmul   matsum  
 













adpcm    com-
press 
dhry     fdct     fft      fir      ludcmp   matmul   matsum  
 













Figure 5.4: The Importance of Modeling Branch Prediction: Mispredictions in Ob-
servation and Estimation
cache is a 1KB direct-mapped cache with 32 cache lines, and each line has 32 bytes.
Experiments on the impact of changing the parameters will be reported later in the
section.
We first justify the need for modeling branch prediction. Figure 5.4 shows the
correct predictions and mispredictions in observation as well as in estimation. On
chart (a), we can see that for the majority of the benchmarks, more than eighty per-
cent of the branches are correctly predicted, which means if we do not model branch
102
Program Obs. Est. Est./Obs. Analysis ILP Solving
Program WCET WCET Ratio Time(sec.) Time(sec.)
adpcm 139802 245854 1.76 0.09 7.9
compress 4565 7878 1.73 0.03 0.1
dhry 121559 126033 1.04 0.04 0.1
fdct 3968 3968 1.00 0.01 0.01
fft 653006 688697 1.05 0.01 0.01
fir 29032 32570 1.12 0.01 0.3
ludcmp 8520 9143 1.07 0.01 0.01
matmul 15205 15220 1.00 0.01 0.01
matsum 101842 101852 1.00 0.01 0.01
minver 5731 6297 1.10 0.01 0.1
qurt 1285 1610 1.25 0.01 0.2
whet 528928 580708 1.10 0.01 0.01
Table 5.1: Modeling Gshare Branch Prediction Scheme for WCET Analysis.
predictions, these dominantly correct predictions will be pessimistically taken as mis-
predictions. On chart (b), we can see that our analysis indeed captures considerable
correct predictions.
The results of branch prediction modeling are reported in Table 5.1. It shows
the observed WCET (column Obs. WCET) obtained from SimpleScalar and the
estimated WCET (column Est. WCET) obtained from our ILP based technique.
We use the popular gshare prediction scheme in these experiments. We also evaluate
the accuracy of our estimation technique by presenting the ratio of the estimated
WCET over the observed WCET in the Ratio column in Table 5.1. Note the
overestimation comes from two sources: program flow analysis and branch prediction
modeling. To see how much pessimism is contributed by branch prediction, we use the
execution counts of basic blocks observed in simulation as user constraints (as what
we did in the pipeline analysis). Figure 5.5 compares the overall estimation to branch-
prediction-only overestimation (benchmarks with single execution path are excluded
from this figure as they suffer no overestimation from program path analysis). As we





adpcm    0.76 0.1
compress 0.73 0.02





ludcmp   
minver   
qurt     













Figure 5.5: Overall and Branch Prediction Overestimation
Scheme BHT size BHR bits Address bits
gshare 128 3 PC[9:3]
GAg 8 3 none
local 128 none PC[9:3]
Table 5.2: Configurations of Branch Prediction Schemes
path analysis; while for the rest benchmarks, branch prediction analysis contributes
more overestimation than program path analysis. However, in the later case, the
overall overestimation is low. Thus, with our experiments, branch prediction analysis
does not produce considerable overestimation in any case.
Now we evaluate our branch prediction technique against three schemes: gshare,
GAg, and local. The respective configurations for the three schemes are described
in Table 5.2, and the results are presented in Table 5.3. Note the WCETs are in
clock cycles while mispredictions are in counts. There are a few benchmarks (adpcm,
compress, whet) with relatively higher WCET and/or misprediction overestimation.
As we have seen from Figure 5.5, the major portion of overestimation comes from
program flow analysis for adpcm and compress. For whet, its observed mispredictions
under the gshare scheme (128-entry BHT with 3-bit BHR) is much less than those
104
WCET
Pgm. gshare GAg local
Obs. Est. Ratio Obs. Est. Ratio Obs. Est. Ratio
adpcm 139802 245854 1.76 154057 254437 1.65 139662 254428 1.82
compress 4565 7878 1.73 5025 9581 1.91 4510 8376 1.86
dhry 121559 126033 1.04 130434 134539 1.03 123459 128769 1.04
fdct 3968 3968 1.00 3948 3978 1.01 3923 3948 1.01
fft 653006 688697 1.05 652976 696680 1.07 660646 698819 1.06
fir 29032 32570 1.12 29367 34337 1.17 29407 34260 1.17
ludcmp 8520 9143 1.07 8565 9153 1.07 8640 8996 1.04
matmul 15205 15220 1.00 15205 15220 1.00 15185 15190 1.00
matsum 101842 101852 1.00 101842 101852 1.00 101822 101832 1.00
minver 5731 6297 1.10 5716 6409 1.12 5726 6159 1.08
qurt 1285 1610 1.25 1280 1618 1.26 1265 1560 1.23
whet 528928 580708 1.10 546143 580838 1.06 580568 580668 1.00
WCET
Pgm. gshare GAg local
Obs. Est. Ratio Obs. Est. Ratio Obs. Est. Ratio
adpcm 2965 22389 7.55 5816 24292 4.18 2937 23038 7.84
compress 28 491 17.54 120 819 6.83 17 572 33.65
dhry 1230 2124 1.73 3005 3826 1.27 1610 2672 1.66
fdct 15 15 1.00 11 17 1.55 6 11 1.83
fft 2080 4149 1.99 2074 6198 2.99 3608 6170 1.71
fir 111 659 5.94 178 1012 5.69 186 1037 5.58
ludcmp 99 157 1.59 108 159 1.47 123 120 0.98
matmul 226 229 1.01 226 229 1.01 222 223 1.00
matsum 206 208 1.01 206 208 1.01 202 204 1.01
minver 113 184 1.63 110 197 1.79 112 146 1.30
qurt 23 86 3.74 22 89 4.05 19 75 3.95
whet 314 10670 33.99 3757 10696 2.85 10642 10662 1.00
Table 5.3: Observed and Estimated WCET and Misprediction Counts of Gshare,
GAg and Local Schemes.
105
 L0:  j = 1; 
L1:  for (i = 1; i <= n4; i += 1) { /* 3450 */ 
L2:      if (j == 1) 
L3:          j = 2; 
L4:      else 
L5:          j = 3; 
 
L6:      if (j > 2) 
L7:          j = 0; 
L8:      else 
L9:          j = 1; 
 
L10:     if (j < 1 ) 
L11:         j = 1; 
L12:     else 







(a) Source Code Segment 
Itr. Paths 
(1) L2 L3 L6 L9 L10 L13 
(2) L2 L5 L6 L7 L10 L11 
(3) L2 L3 L6 L9 L10 L13 
(4) L2 L5 L6 L7 L10 L11 
  .  . 
  .  . 
  .  . 
(2n-1) L2 L3 L6 L9 L10 L13 
(2n) L2 L5 L6 L7 L10 L11
 
 
(b) Paths in Loop 
L3 = 1725 (1a) 
L5 = 1725 (1b) 
 
L7 = 1725 (2a) 
L9 = 1725 (2b) 
 
L11 = 1725 (3a) 
L13 = 1725 (3b) 
 
(c) Linear Constraints 
Figure 5.6: A Fragment of the Whetstone Benchmark
observed in other schemes. This is because the gshare scheme under this configuration
happens to avoid destructive alias effects in the BHT, while the other two schemes
suffer from such effects. However, the much higher estimated gshare misprediction
counts for whet is caused by a drawback of our ILP-based modeling. The detailed
reason is explained in the following paragraph.
Difficulty in Exploiting Temporal Path Information One reason for the over-
estimation of misprediction counts is the aggregate nature of the ILP approach. The
ILP approach only allows us to provide linear constraints on basic block execution
counts. However, path information (even if provided by the user) cannot be exploited
by the ILP solver. For example, let us study a program segment of the whet bench-
mark given in Figure 5.6. Figure 5.6(a) is a loop body with loop iteration counts
annotated. There are three if-then-else constructs embedded in the loop body.
By taking a closer look, we can figure out that the outcomes of these branches are
















change of cache misses
change of overheads
Figure 5.7: Change (in Percentage) of Cache Misses and Overall Penalties in Com-
bined Modeling to Those in Individual Modelings
is given in Figure 5.6(b). We can see there are only two paths and they alternate
during the iterations. However, this temporal information cannot be fed into the
ILP solver. Instead, the ILP solver uses the constraints in Figure 5.6(c) to implicitly
consider any path satisfying these constraints. All such paths are considered in the
ILP solver’s quest to maximize branch predictions (leading to overestimation).
So far, we have presented the experimental results for branch prediction modeling.
We now discuss the integrated modeling of instruction caching and branch prediction.
First, we illustrate the importance of combined modeling of cache and speculation
for WCET analysis by comparing it against a naive technique which models both
caching and speculation but ignores the cache-speculation interaction. Figure 5.7
shows this comparison with benchmarks for which we can find the actual WCET
(and the corresponding cache miss and branch misprediction overheads).
The first group of bars indicate the percentage increase/decrease in cache misses
due to the effect of branch prediction on cache behavior. For matmult and fdct,
there are more cache misses in the combined modeling than in the naive modeling,
indicating that the destructive effects of speculation are more significant than the
107
Pgm. WCET Mispred Cache miss
Obs. Est. Ratio Obs. Est. Ratio Obs. Est. Ratio
adpcm 145302 252811 1.74 2965 22388 7.55 550 633 1.15
compress 8905 14150 1.59 28 482 17.21 434 573 1.32
dhry 193969 217322 1.12 1230 2072 1.68 7241 8323 1.15
fdct 7028 7576 1.08 15 15 1.00 306 328 1.07
fft 716526 759438 1.06 2080 4148 1.99 6352 6431 1.01
fir 42962 51474 1.20 111 665 5.99 1393 1726 1.24
ludcmp 12060 18427 1.53 99 157 1.59 354 844 2.38
matmul 15275 15330 1.00 226 229 1.01 7 10 1.43
matsum 101902 101940 1.00 206 208 1.01 6 8 1.33
minver 8941 12724 1.42 113 175 1.55 321 589 1.83
qurt 2295 3009 1.31 23 84 3.65 101 129 1.28
whet 529618 581608 1.10 314 10674 33.99 69 80 1.16
Table 5.4: Combined Analysis of Branch Prediction and Instruction Caching
constructive effects. For other programs, the constructive effects outperform the
destructive effects, thereby decreasing the number of cache misses. The second group
of bars shows the percentage change in total timing overhead of cache misses and
branch mispredictions due to cache-speculation interaction. The timing overhead
shows similar behavior as cache misses. The results show that if naive modeling is
used (i.e., the effect of branch prediction on caching is not modeled), the WCET
can either be overestimated (as the downward bars indicate), or, more seriously, be
underestimated (as the upward bars indicate).
The results for combined modeling of instruction caching and branch prediction
are given in Table 5.4. Note that the numbers for the WCET columns are in processor
cycles while the Mispred. and Cache miss columns denote misprediction and cache
miss counts. As we can see from the ratio column, most benchmarks have tight
estimated bounds.
Modern processors have deep pipelines and an increasing gap between processor
speed an memory latency. Deeper pipelining leads to larger misprediction penalties
(in terms of clock cycles). The increasing processor-memory speed gap results in
longer cache miss penalties. Due to this trend of hardware advancement, we examine
the accuracy of our WCET analysis with more aggressive parameters (bmp is increased
108
program
1.76 2.28 1.66 2.15
compress 1.73 1.84 1.41 1.5
1.04 1.14 1.14 1.16
1 1.08 1.07 1.08
1.05 1.07 1.05 1.06
fir     1.12 1.26 1.22 1.25
1.07 1.54 1.99 1.99
1 1 1.01 1.01
1 1 1 1
1.1 1.44 1.67 1.68
1.25 1.39 1.29 1.28
whet    1.1 1.2 1.1 1.2
bmp=5,cmp=10 bmp=10,cmp=10 bmp=5,cmp=50 bmp=10,cmp=50
adpcm   
dhry    
fdct    




























bmp=5,cmp=10 bmp=10,cmp=10 bmp=5,cmp=50 bmp=10,cmp=50
Figure 5.8: Est./Obs. WCET Ratio under DifferentMisprediction Penalties and
Cache Miss Penalties
from five clock cycles to 10 clock cycles, and cmp is increased from 10 clock cycles to
50 clock cycles). Figure 5.8 gives the Est/Obs WCET ratios for benchmarks under
different bmp/cmp settings. We have three observations from this figure. First, for
most benchmarks, their ratios do not change significantly with the increases of bmp
and cmp. This is a desirable result. Second, for a few benchmarks, such as ludcmp
and minver, there is a non-trivial increase of overestimation. The reason is that
for these benchmarks, the contribution of cache miss/misprediction penalties to the
overall execution time is high, thus the increase of bmp/cmp has a high impact. Last,
for a few benchmarks such as adpcm and compress, their overestimation decreases
with the increase of bmp/cmp. This sounds counter-intuitive. But it can happen,
depending on how much the ideal execution (execution without cache misses and
mispredictions) contributes to the overall execution time. The following example
explains the reason quantitatively. Let us assume for a benchmark, its observed
ideal execution and penalties under lower bmp/cmp are 5000 cycles and 1000 cycles
respectively; while its estimated ideal execution and penalties are 10000 cycles and
109
fixed BHT fixed BHR
variable BHR variable BHT
program 128/3 128/4 128/5 256/5 512/5 1024/5
adpcm 13.3 38.7 161.5 168.5 125.2 102.3
compress 0.2 2.0 9.2 5.4 2.8 3.7
dhry 0.1 1.3 1.9 2.9 1.2 1.5
fdct 0.01 0.01 0.01 0.01 0.01 0.01
fft 0.01 0.1 0.1 0.7 0.2 0.3
fir 0.05 10.3 28.4 22.0 13.2 14.6
ludcmp 0.1 0.2 2.8 2.7 2.8 3.0
matmul 0.01 0.01 0.01 0.1 0.1 0.1
matsum 0.01 0.01 0.01 0.01 0.01 0.01
minver 0.3 2.5 7.8 13.8 29.2 12.0
qurt 0.2 2.9 3.8 0.4 0.3 0.3
whet 0.01 0.2 2.4 1.3 1.5 1.2
Table 5.5: ILP Solving Times (in seconds) with Different BHT Sizes and BHR Bits
1500 cycles. Thus the Est/Obs ratio would be (10000 + 1500)/(5000 + 1000) = 1.92.
Now under higher bmp/cmp, the observed penalty is increased from 1000 to 5000,
while the estimated penalty is increased from 1500 to 7500 (the ideal execution times
keep unchanged). Then the new ratio would be (10000+7500)/(5000+5000) = 1.75.
Thus it is possible that more aggressive configurations (higher bmp/cmp) may result
in lower overestimation.
Finally, we look at the performance of the analysis. Since the complexity of ILP
problems increases fast, we examine the ILP solving times under various configura-
tions of the branch predictor, in particular, the BHT size and BHR length. The
size and complexity of the program are also an important factor affecting the ILP
solving time. The characteristics of the benchmarks have been given in Table 2.1 in
Chapter 2. We present the ILP solving times in Table 5.5. Columns 2 to 4 study
the increase of ILP solving time with the length of BHR being increased from 3 to 5
bits. The results show that ILP solving time increases fast with longer branch his-
tory. However, the benefit using history length of more than four branches diminishes
rapidly because correlations across so many branches are usually weak. Rather, it
110
is more effective to reduce alias effects by using more address bits as index to the
BHT. Thus columns 5 to 7 study the variation of ILP solving time under different
BHT sizes (with a fixed 5-bit BHR). In most cases, increasing BHT does not result
in significant increase in ILP solving time. In fact, it reduces solving time in some
cases. This is because purely increasing BHT size does not give rise to more history
patterns for each branch. On the other hand, reduced alias effects can lead to simpler
relationships of different branches in our analysis. If the analyzed program is much
larger than our benchmarks, and over 6 recent branch outcomes are used as history,
the analysis time could be significantly longer than the times reported in Table 5.5.
One possible way to address the scalability problem is to divide a large program
into smaller components and conduct analysis on each of the smaller components.
Some amount of accuracy loss is expected since at the beginning and end of each
component, we have to make conservative assumptions on the state of the branch
predictor. In practice, a large procedure or a sub-graph in the call graph is a possible
candidate for analysis.
5.4 Summary
In this chapter, we presented a framework to model dynamic branch predictions for
WCET analysis. Our modeling can be targeted to various dynamic branch prediction
schemes (which are used in both general-purpose and embedded processors [32, 54]).
This ILP-based modeling is conveniently integrated with the ILP-based program path
analysis. We also extended the branch prediction modeling to a combined analysis
of branch prediction and instruction caching. The destructive/constructive effects of
branch prediction on cache behavior are captured uniformly. Using our technique, we
have obtained tight timing estimates for benchmark programs under various branch
prediction schemes. We have also studied the scalability of our technique. Some of the
benchmarks used are not within the spectrum of small-sized ones among widely used
111
WCET benchmarks, and branch predictors of moderate configurations are considered.
The results show that our technique has reasonable scalability. Our current technique
is primarily for modeling the prediction of branch directions. Some other important
aspects of branch prediction have not been considered, such as branch target buffer.
In our future work for modeling a realistic processor, we will investigate the full
features of branch predictors and extend our current technique to model them.
112
CHAPTER VI
ANALYSIS OF PIPELINE, BRANCH
PREDICTION AND INSTRUCTION CACHE
We have studied out-of-order pipelines for WCET analysis in Chapter 4 and branch
prediction as well as its interaction with instruction caches in Chapter 5. In this
chapter we integrate the timing effects of branch prediction and instruction caches
with our out-of-order pipeline modeling. To achieve this, we need to study the impact
of instruction caches and branch prediction on the pipeline. This involves changes in
our estimation algorithm as well as the execution graph for each basic block (since a
branch misprediction may execute additional code speculatively). These changes are
now described.
The rest of this chapter is organized as follows. First we describe how the WCET
estimation of a basic block is affected by branch prediction (Section 6.1), and instruc-
tion cache (Section 6.2). Then, in Section 6.3 we describe the ILP formulation for
WCET estimation of the whole program in presence of pipeline, cache and branch
prediction. In Section 6.4 we show by experimentation that the combined analysis
yields tight estimates. We conclude this chapter in Section 6.5.
6.1 Timing Estimation of a Basic Block in Pres-
ence of Branch Prediction
Clearly, if a branch is predicted correctly, then our pipeline analysis does not require
any modification. However, a branch misprediction results in instructions along the
wrong path being executed in the pipeline (without commit) and flushed out after
the branch is resolved. This involves changes in the execution graph of a basic block.
113
Before describing these changes, we make the following assumptions.
Assumptions First, we assume that the processor allows only one unresolved branch
at any point of time during execution. Thus, if another branch is encountered during
speculative execution, the processor simply waits till the previous branch is resolved.
Second, we assume that the outcome of a branch is resolved upon the completion of
its WB stage. If it is a misprediction, the wrong path instructions are flushed out
and the processor resumes execution along the correct path immediately. Last, we
assume that the branch prediction takes place at the end of the fetch stage. That is,
the target address is available at the end of the fetch stage irrespective of whether
a branch is predicted as taken or non-taken. In reality, this is easy for a non-taken
prediction; but for a taken prediction, extra resources, such as branch target buffer,
are needed to achieve this goal [30].
6.1.1 Changes to Execution Graph
We now describe the changes to the execution graph of a basic block in order to
account for instructions executed due to branch misprediction; these instructions are
also referred to as wrong path instructions. In particular, we discuss the changes to
execution graph nodes, dependency relation and contention relation among nodes.
Consider the execution graph of a basic block B with a body, a prologue and an
epilogue. If the last instruction of the prologue is a branch b, we include instructions
along the mispredicted path of b; otherwise no change is made to the execution graph.
A fragment of an execution graph without misprediction is shown in Figure 6.1(a)
and the modified execution graph fragment due to the misprediction of branch b is
shown in Figure 6.1(b).
Additional nodes in the execution graph A mispredicted branch brings the
instructions along the wrong path into the pipeline. In order to capture their effect
114
IF(I
-1) ID(I-1) EX(I-1) WB(I-1) CM(I-1)
IF(Ib) ID(Ib) EX(Ib) WB(Ib) CM(Ib)
(a) Original execution graph
IF(I1) ID(I1) EX(I1) WB(I1) CM(I1)
IF(I
-1) ID(I-1) EX(I-1) WB(I-1) CM(I-1)
IF(Ib) ID(Ib) EX(Ib) WB(Ib) CM(Ib)
(b) Modifications due to branch misprediction
IF(I1) ID(I1) EX(I1) WB(I1) CM(I1)







Figure 6.1: Execution Graph with Branch Prediction
115
on the execution of normal instructions, we construct nodes corresponding to these
wrong path instructions in the execution graph. Given a conditional branch b and
its actual outcome X (non-taken or taken, denoted as 0 and 1, respectively), we
can identify the maximum sequence of wrong path instructions that can enter the
pipeline, called Spec(b,X). The length of this sequence is bounded by two factors.
• |Spec(b,X)| ≤ ROB size + IBuffer size where ROB size is the size of the
re-order buffer and the IBuffer size is the size of the Instruction fetch buffer
• If another conditional branch b′ is encountered along the wrong path, then the
sequence Spec(b,X) is terminated at b′.
In Figure 6.1(b), the shaded nodes are the wrong path nodes (only one instruction is
drawn for simplicity). There are no CM nodes for wrong path instructions as these
instructions are not allowed to commit.
Changes to dependency relation Due to the changes in the execution graph
nodes, the nodes can now be categorized as (a) prologue nodes (b) wrong path nodes
(c) body nodes (this is the basic block being analyzed) and (d) epilogue nodes. The
dependency edges among the nodes in each category are drawn as usual. However,
the dependency edges among nodes in different categories require some explanation.
First, we observe that the lifetimes of the wrong-path nodes and body nodes are
disjoint. Hence we do not draw any dependency edges between wrong path nodes
and body nodes. Instead we add a dependency edge between EX(b) and IF (I1)
where b is the branch in the prologue whose misprediction we are considering, and I1
is the first instruction in the basic block being analyzed. This reflects the fact that
instructions in the correct path (the body nodes) are fetched after the mispredicted
branch is resolved. The dependency edges between the prologue and body nodes
are drawn as usual, that is, they are not affected by the insertion of the wrong path
116
nodes. This is because we do not make any assumptions about when the mispredicted
branch is resolved.
Changes to contention relation Contention relation among prologue, body and
epilogue nodes remain unchanged. We also consider contention of prologue and wrong
path nodes in the estimation algorithm. Contention of body and wrong path nodes
are not considered since the body nodes and wrong path nodes are guaranteed to
have disjoint lifetimes.
6.1.2 Changes to Estimation Algorithm
As before, we use Algorithm 4 to estimate latest times of prologue nodes; earliest
times of prologue nodes are conservatively estimated to −∞. We still use Algorithm
2 to estimate the latest times and the Algorithm 3 to estimate the earliest times of the
body and epilogue nodes in the modified execution graph. For the wrong path nodes,
we use Algorithms 2, 3 to estimate the latest/earliest times but with one important
change. We observe that the wrong path nodes are flushed after branch b is resolved.
Therefore, the latest ready, start, and finish times of all the wrong path nodes are
additionally bounded by latest[tfinishEX(b ].
6.1.3 Handling Prediction of Other Branches
So far we have discussed how to handle a mispredicted branch at the end of the
prologue (i.e., if the last instruction before the current basic block is a mispre-
dicted branch). However, the prologue and epilogue can contain multiple conditional
branches if the basic blocks are too small. One possibility is to consider both the
scenarios (correct and misprediction) for these conditional branches. However, this
would require considering a large number of possibilities and is clearly very inefficient.
We observe that only the last conditional branch in the prologue has significant
impact on the execution time of a basic block. Therefore, for this branch, we consider
117
both the correct prediction and the misprediction scenarios and compute the execu-
tion time of the basic block accordingly. This leads to two possible WCET estimates
of the basic block under the two scenarios.
We avoid enumerating correct/wrong prediction of other branches in the prologue
or epilogue (i.e. any branch in the prologue or epilogue apart from the last branch
of prologue) as follows. Consider any such branch b in the prologue or epilogue. We
modify the execution graph such that correct as well as wrong prediction of b is con-
sidered. This is done by defining the special edge from the EX(b) to the IF stage of
the first instruction along the correct path as a conditional edge. This conditional
edge is considered during the estimation of the latest times; but it is ignored in the
estimation of earliest times. Similarly, all the wrong path nodes due to misprediction
of b and their contentions are also considered to be conditional. They are considered
during latest times calculations but are ignored for earliest times calculations. The in-
tuition behind this approach is to take both possibilities of prediction (correct/wrong
prediction) into account so as to compute safe upper and lower bounds.
6.2 Timing Estimation of a Basic Block in Pres-
ence of Instruction Caching
We now perform combined analysis of pipelining, branch prediction and instruction
caching. In our earlier discussions, we have assumed that there is no instruction
cache and each instruction fetch takes a single clock cycle. We now discuss how we
can capture the effects of instruction cache misses.
Given a cache configuration, a basic block Bi can be partitioned into a fixed
number of memory blocks, with instructions in each memory block being mapped to
the same cache line (cache accesses of instructions other than the first one in a memory
block are always cache hits). Let the memory blocks be denoted as Bi.1, Bi.2, . . . , Bi.ni ,
where ni is the number of memory blocks in Bi; a cache scenario of Bi is defined as
118
a mapping of hit or miss to each of the memory blocks of Bi.
Now we study the changes to be made to the estimation of Bi under a particular
cache scenario ω. First, it is obvious that the instruction cache only affects the
latency of the instruction fetch (IF) stage, but does not affect data dependencies or
contentions, thus no changes need to be made to the execution graph. Second, there
is a slight change to the estimation algorithm. Recall when instruction cache was
not modeled, the IF stage was assigned a single-cycle latency. Now the latency of IF
stage is determined by the cache access result of an instruction. If it is a hit, then a
single cycle is assigned; if it is a miss, a cache miss penalty, N is assigned; otherwise
the access result is unknown and an interval [1, N ] is assigned, which covers both
possibilities.
Note that for the context instructions (prologue and epilogue) of Bi we do not
distinguish their cache scenarios. In other words, we conservatively assume the cache
access results of the first instructions of the memory blocks in prologue/epilogue
are unknown. Thus, we assign the interval [1, N ] to the IF stage of each of those
instructions. This policy is based on the observation that cache hits/misses in context
instructions do not affect the execution time of Bi significantly.
In the preceding, we have clarified how to account for timing effects of instruction
cache if we know the cache scenario, that is, whether the memory blocks of a basic
block is in the cache. In reality, we need to consider all possible cache scenarios
and bound the number of occurrences of the different cache scenarios under which a
basic block may be executed. We accomplish this via Integer Linear Programming.
In particular, we introduce ILP variables to capture the number of occurrences of
any basic block Bi under (a) correct/wrong prediction of the preceding branch (b) a
specific cache scenario to denote hit/miss of memory blocks of Bi. Constraints are
imposed on these ILP variables to bound their values, thereby obtaining an estimate
of the program’s WCET.
119
6.3 Putting It All Together
We now describe the ILP formulation which integrates our analyses of pipelining,
instruction caching and branch prediction. Let B1, . . . , BN be the set of basic blocks
of the program whose WCET we are estimating. Now the execution of Bi is associated
with the prediction of its preceding branch and its cache scenario. We denote the
set of possible cache scenarios at Bi as Ωi. For the possible cache scenarios Ωi of Bi,
the number of cache scenarios can be 2ni in the worst case, where ni is the number
of memory blocks of Bi. However, constrained by the program control flow, only a
few scenarios are possible in reality. For better accuracy and less analysis time, it is
necessary to exclude those infeasible ones. This can be achieved by a preprocessing
step. The preprocessing traverses the program control flow graph by propagating and
updating the cache states; at the entry of each basic block, distinct cache scenarios
are collected. The preprocessing terminates until no new scenarios are found at any
basic block.
Considering the possible cache scenarios and correct/wrong prediction of the pre-
ceding branch for a basic block, the ILP objective function denoting a program’s










costcωj→i × ecωj→i + costmωj→i × emωj→i
)
(6.1)
where costcωj→i is the WCET of Bi executed under the following contexts: (1) Bi
is reached from a preceding block Bj (2) the branch prediction at the end of Bj is
correct or Bj does not have a conditional branch (3) Bi is executed under a cache
scenario ω ∈ Ωi; ecωj→i is the number of times that Bi is executed under these contexts.
Similarly, costmωj→i is the WCET of Bi executed under the following contexts: (1) Bi
is reached from a preceding block Bj (2) the branch at the end of Bj is mispredicted
(3) Bi is executed under a cache scenario ω ∈ Ωi; emωj→i is the number of times that
120
Bi is executed under these contexts.
Using our out-of-order pipeline analysis in Chapter 4 as well as the extensions
proposed in Section 6.1 and Section 6.2, we can estimate the WCET of a basic block
provided the correct/wrong prediction of the preceding branch and the cache scenario
is known. In other words, we can estimate costcωj→i and costm
ω
j→i as constants. We
now need to develop constraints to bound the ILP variables ecωj→i and em
ω
j→i.
In Chapter 5, we have proposed an ILP-based branch prediction modeling tech-
nique, which can bound correct branch predictions and misprdictions. For instance,
the count of mispredictions of a basic block Bi along one edge i→ j was denoted as
a variable emj→i and was bounded by constraints (5.2)-(5.5) in Section 5.1.1. The
correct predictions along the same edge, which was not explicitly defined in Chapter
5, can be straightforwardly defined as ecj→i = ej→i − emj→i.
Now we observe that ecωj→i and em
ω
j→i are refined forms of ecj→i and emj→i where
block Bi’s executions are further distinguished with cache scenarios at Bi. This leads








On the other hand, with the ILP-based instruction cache modeling by Li et al. [43]
as well as our modification to it in Section 5.2, we can further bound the occurrences
of cache scenarios by relating them to the cache hits/misses of memory blocks. Recall
in Section 5.2, the cache miss count for a memory block Bi.k was denoted as cmi.k
(cache hit count for Bi.k, which was not explicitly defined, can be straightforwardly
defined as chi.k = vi − cmi.k, where vi is the execution count for both the basic block
Bi and the memory block Bi.k). Since a cache scenario ω of Bi is an assignment of hit
or miss to each of Bi’s memory blocks, we can partition Ωi, the set of possible cache
scenarios at Bi as
Ωi = Ω
h




i.k) is the set of those cache scenarios in Ωi in which memory block Bi.k results




i.k can be computed straightforwardly.






















With constraints in Equations 6.2 and 6.3, ecωj→i and em
ω
j→i can be effectively
bounded.
Finally, the objective function in Equation 6.1 can be maximized by the ILP
solver subject to (1) the control flow constraints, (2) the branch prediction modeling
constraints, (3) the instruction cache modeling constraints, and (4) the constraints
presented in this section.
6.4 Experimental Evaluation
In this section, we evaluate the accuracy of our estimation technique for the same
benchmarks used in the earlier chapters. The configurations of the pipeline, the
instruction cache and the branch predictor are the same as in the earlier chapters
except for the branch predictor no branch misprediction penalty is explicitly given.
This is because the misprediction penalties are now accounted for by the pipelined
execution, and how long a misprediction of a branch b lasts is dependent on when
WB(b) completes, which means the misprediction penalties are no longer constants
(as assumed in Chapter 5).
The experimental results are given in Table 6.1. As the Ratio column indicates,
the combined analysis yields tight estimations. To further illustrate the effectiveness
of the technique for the combined analysis, we compare the overestimation to that in
the pure pipeline modeling in Figure 6.2. It clearly shows that the combined analysis
does not produce significantly more overestimation than the pure pipeline modeling
does. This justifies our modifications to the pipeline analysis algorithms which take
122
Program WCET Ratio Analysis ILP Solving
Obs. Est. time (sec.) time (sec.)
adpcm 153845 240615 1.56 11.75 10.9
compress 8933 11283 1.26 17.71 0.2
dhry 195746 201597 1.03 11.79 0.1
fdct 6970 7106 1.02 0.22 0.01
fft 897481 1044961 1.16 2.20 0.01
fir 53221 64172 1.21 2.79 0.9
ludcmp 12941 15989 1.24 2.46 0.2
matmul 14792 18633 1.26 0.11 0.01
matsum 101276 101312 1.00 0.13 0.01
minver 9382 11562 1.23 8.40 0.5
qurt 2563 3034 1.18 4.03 1.1
whet 850714 963639 1.13 4.18 0.01
















whet    0.11 0.13
adpcm   
dhry    
fdct    

































Figure 6.2: Comparison of Overestimations of Pure Pipeline Analysis and Combined
Analysis
123
into account mispredictions and cache misses. We also observed that for some of the
benchmarks such as compress, fir, and qurt, the overestimation in the combined
analysis is even less than the pure pipeline modeling. This can be explained that with
the occurrences of cache misses and branch mispredictions in the pipelined execution,
instructions might have lower chances to contend with each other. For example,
with a cache miss, it might be possible to determine by our estimation algorithms
that for an instruction I preceding the cache miss and another one I ′ following it,
separated[EX(I), EX(I ′)] = false, which means I and I ′ cannot contend with each
other in any case. This fact might not hold without the cache miss. As we learned
from Chapter 4, less contentions can lead to more accurate estimates.
On the other hand, indicated by the Analysis time and ILP Solving time
columns, the combined analysis can be performed very efficiently. For example, com-
pared to the ILP solving time in the branch prediction analysis (refer to Table 5.1),
the increase of ILP solving time in the combined analysis is not significant. This is
because the main difference between the combined analysis and branch prediction
analysis is how the cost of a basic block is determined, while the differences between
the ILP problems in the two analyses are not so substantial.
6.5 Summary
This chapter presents a combined analysis of out-of-order pipelining, branch predic-
tion and instruction caching. We achieve this by studying the timing effects of branch
prediction and instruction caching on the pipeline, and we extend the pipeline analysis
algorithms to capture these effects. For branch prediction, we add additional nodes
and edges corresponding to speculatively executed code into the execution graph,
and make slight changes to the estimation algorithms to take care of the difference of
speculative execution. For instruction caching, the changes made to pipeline analysis
124
is even more straightforward: the only change is that the original single-cycle laten-
cies of the instruction fetches are changed to latencies corresponding to cache hits,
cache misses or both possibilities. The insignificant modifications made to the pure




This chapter concludes the thesis. In Section 7.1 we summarize the contribution of
this thesis and in Section 7.2 we discuss some future directions.
7.1 Summary of the Thesis
Worst Case Execution Time (WCET) prediction has been a fundamental problem for
hard real-time systems. Typically, the WCET of a task is hard to predict by running
the task because all possible sets of data input have to be evaluated to guarantee that
the worst case is covered. As a result, static Worst Case Execution Time analysis,
which predicts the maximum running time of the program without actually running
it, has become a promising alternative approach and extensive research has been
conducted in this direction. In general, it consists of three subtasks: (1) program
path analysis, which identifies feasible/infeasible program paths; (2) microarchitec-
ture modeling, which models the timing effects of hardware features to determine
instruction timing; and (3) WCET calculation, which calculates the WCET of the
program with program path information and instruction timing information. Among
them, microarchitecture modeling has become an increasingly important yet difficult
task mainly because modern processors have employed aggressive microarchitectural
features for the quest of higher performance.
In this thesis, we study the core microarchitectural features of modern processors,
namely out-of-order pipelines, dynamic branch predictions and instruction caching.
We have developed a microarchitecture modeling framework which models the above
three features in combination. The framework consists of two levels: the local level
126
analyses estimate the worst case execution time of a basic block under a specific
execution context, while the global level analyses are responsible for identifying exe-
cution contexts for basic blocks and bounding the occurrences of these contexts. This
way, we can estimate the WCET of the whole program by summing up the execution
times of basic blocks under different execution contexts. Under this framework, we
have developed analytical techniques for the individual microarchitectural features
and have proposed a method for combining all them together.
First, for out-of-order pipelines, we have proposed an innovative technique to
address a phenomenon called timing anomaly [50]. In the presence of timing anomaly,
techniques which generally take the local worst case for WCET estimation no longer
guarantee safe bounds. This prompts the need to consider all possible local cases
and their subsequent executions. However, a naive approach which enumerates the
possible cases individually is often expensive in terms of both the analysis time and
resource needs. Our technique avoids enumeration for individual cases. The key point
of this technique is a fixed-point analysis of time intervals at which the instructions can
enter/leave the pipeline stages. Experimental results have shown that this technique
yields accurate results and works efficiently.
Second, for dynamic branch predictions, we have proposed an Integer Linear Pro-
gramming based framework to bound branch mispredictions. The branch prediction
analysis is integrated with the ILP based WCET calculation. We follow this strategy
because branch prediction exhibits a strong global nature, that is, the prediction is
based on the executions of earlier branches whose distance to current branch could be
either near or far away. As a result, global program flow information is needed, which
can be provided by the ILP based WCET calculation. This ILP-based framework is
parameterized and can be straightforwardly targeted to a variety of branch prediction
schemes. Apart from branch prediction modeling, we have also studied the effect of
speculative execution (via branch prediction) on instruction caching. The effect is
127
captured by modifications to an existing ILP based instruction cache analysis [43].
Last, we have combined the analyses of the three features: out-of-order pipelines,
branch prediction and instruction caching. We do so by studying the timing effects of
branch prediction and instruction caching on the pipeline and making modifications
to the pure pipeline analysis algorithms to capture their effects. The modifications
are not substantial and the combined analysis works efficiently, suggesting a good
extensibility of our framework for modeling more microarchitectural features.
7.2 Future Work
We have identified the following directions to be pursued in the future.
Data cache analysis Data cache is another important feature in current proces-
sors. Unlike instruction cache, whose behavior is only determined by the program
flow, the behavior of data cache is affected by both the program flow and data values.
As a result, techniques which exploit control flow information for instruction cache
analysis are not sufficient in the context of data cache; and we need to develop new
methods to model it.
Analysis for real-life processors We would like to extend our work to real-life su-
perscalar processors which essentially have the three components we have addressed.
Working on a real-life processor is more challenging. An important issue faced by
our pipeline analysis technique is the size of the instruction window. Current high-
performance processors have much larger instruction windows than the 8-entry in-
struction window assumed in our processor model. With a large instruction win-
dow, significantly more context instructions as well as more possible contexts need
to be considered in pipeline analysis. Improvements need to be made to address the
degradation of analysis performance when modeling processors with large instruction
windows.
128
Integration with program path analysis There has been substantial program
path analysis work in the literature. By excluding infeasible paths with the help of
program path analysis, the accuracy of WCET analysis can be significantly improved.
We will try to adopt some program path analysis techniques into our framework.
WCET optimization Its purpose is to reduce the estimated WCET by program
transformation. There has been some research activities in this direction. Zhao et
al. [77, 76] optimize the WCET of a program by code positioning or by optimizing the
worst case path using compiler optimizations like path duplication and loop unrolling.
Bodin and Puaut [5] propose a WCET-oriented static branch prediction algorithm
for processors supporting compiler-directed branch prediction.
Integrating the timing analyzer with the compiler Timing analyzer needs
both high level source code information and low level object code information. For
example, the analyzer users may want to give program path information at the source
code level and want the compiler to transform it into object code level representation.
The compiler can also yield its results of data flow analysis for the timing analyzer,
such as loop bounds or infeasible paths. The key issue for integrating the timing
analyzer with the compiler is to develop a standard interface between the two par-
ties, such that when the timing analyzer is targeted to a new compiler, both the




PROOFS FOR THE PIPELINE ANALYSIS
ALGORITHMS
In Chapter 4, we have presented how the algorithms (Algorithm 1, 2, 3 and 4) produce
estimates for the worst case costs of the basic blocks. Intuitively, they start with
conservative timing intervals for the executions of instructions and iteratively tighten
the intervals until a fixed-point is reached. In this appendix, we give a formal proof for
their correctness, that is, the calculated intervals indeed cover all possible execution
times of instructions and the worst case costs of basic blocks are not underestimated.
A.1 Proofs for the Context-Free Estimation
In this section we prove the correctness for the algorithms in Section 4.2.1 where
we do not consider the execution context of a basic block. We want prove that the
estimated WCET for a basic block is no less than any possible execution times of
that basic block by showing that the latest times and earliest times calculated by
Algorithm 2 and Algorithm 3 for the execution graph nodes are indeed the upper and
lower bounds of their corresponding execution times.
Lemma A.1. Let u and v be two contending nodes in the execution graph with
u ∈ late contenders(v), and let Slate be the set of late contenders of v computed
by Algorithm 2. If in a particular run, u delays v, and the relationship of the ex-
ecution times with the calculated earliest and latest times by our algorithms is that
130
∀w ∈ {u, v},
earliest[treadyw ] ≤ treadyw ≤ latest[treadyw ]
earliest[tstartw ] ≤ tstartw ≤ latest[tstartw ]
earliest[tfinishw ] ≤ tfinishw ≤ latest[tfinishw ] (A.1)
then u ∈ Slate, which means the actual late contender delaying v is in the calculated
set of late contenders.




u . Now we prove
the lemma in two steps. First, we show that separated(u, v) = false. By definition,
separated(u, v) = true must satisfy the following inequalities
earliest[treadyu ] ≥ latest[tfinishv ] ∨ earliest[treadyv ] ≥ latest[tfinishu ]
Now we prove that neither of them can be true. With treadyu < t
ready
v and (A.1),
earliest[treadyu ] ≤ treadyu < treadyv < tfinishv ≤ latest[tfinishv ]
Similarly,
earliest[treadyv ] ≤ treadyv < tfinishu ≤ latest[tfinishu ]
Combine the above two, separated(u, v) = false.
Second, we show that earliest[tstartu ] < latest[t
ready
v ]. This is true because
earliest[tstartu ] ≤ tstartu < treadyv ≤ latest[treadyv ]
Therefore, following the calculation of Slate in Algorithm 2, u ∈ Slate.
Lemma A.2. Let v be a node in the execution graph and let Searly be its early con-
tenders computed by Algorithm 2. If in a particular run, the actual early contenders
delaying v are S ′early, and the inequalities in (A.1) are true for v and S
′
early here, then
S ′early ⊆ Searly, which means the actual early contenders delaying v are included in
the set of early contenders calculated by Algorithm 2.
131
Proof. Since every u ∈ S ′early is an early contender delaying v, treadyu < tfinishv and
treadyv < t
finish
u . With (A.1), we have
earliest[treadyu ] ≤ treadyu < tfinishv ≤ latest[tfinishv ]
and
earliest[treadyv ] ≤ treadyv < tfinishu ≤ latest[tfinishu ]
which means separated(u, v) = false. Thus u ∈ Searly and S ′early ⊆ Searly.
Theorem A.3. For every node v in the execution graph, the following relationship
between the actual execution times of v and its earliest/latest times calculated by
Algorithms 2 and 3 in each iteration of Algorithm 1 is true.
earliest[treadyv ] ≤ treadyv ≤ latest[treadyv ] (A.2)
earliest[tstartv ] ≤ tstartv ≤ latest[tstartv ] (A.3)
earliest[tfinishv ] ≤ tfinishv ≤ latest[tfinishv ] (A.4)
Proof. We prove it by induction. Assume (A.2 - A.4) are true for all nodes in previous
iterations and for the nodes earlier than v in topologically sorted order in current
iteration. We show that (A.2 - A.4) are also true for v in current iteration.
Obviously, the base case is true since the latest times are initialized as ∞ and
earliest times are initialized as 0 or minimum latencies (for finish events). For the
induction case, we take the latest times for discussion.





. On the other hand, by Algorithm 2 (Lines 12 - 13),

















For v’s start time, let the late contender delaying v, if any, be w and its delay to
v be d1 cycles; let the early contenders delaying v, if any, be S
′
early and their delays
to v be d2 cycles (note d1 must happen before d2 as w can only delay v by starting
execution before v is ready). Then tstartv = t
ready
v + d1 + d2. For d1,




v +max latv − 1
)
According to Lemma A.1, w ∈ Slate, along with the induction assumption, we can
derive the following from above inequality











+max latv − 1
)
which means
treadyv + d1 ≤ latest[tstartv ]′ (A.5)
where latest[tstartv ]
′ is the intermediate latest start time computed on Line 6 in Al-
gorithm 2. Next, for d2, suppose each u ∈ S ′early delays v for du cycles (where
du ≤ max latu = max latv). Then d2 =
∑
u∈S′early du ≤
∣∣S ′early∣∣ ×max latv. Accord-
ing to Lemma A.2, S ′early ⊆ Searly. Thus
d2 ≤ |Searly| ×max latv (A.6)
Now we examine tstartv under two cases: d2 = 0 and d2 > 0.
In the first case, tstartv = t
ready
v + d1, and according to (A.5), t
start
v ≤ latest[tstartv ]′.
Compare to the latest[tstartv ] calculated on Line 10 in Algorithm 2, t
start
v ≤ latest[tstartv ].
In the second case, one implication is that tstartv cannot be later than the finish





Since S ′early ⊆ Searly and tfinishu ≤ latest[tfinishu ] (by induction), we can derive from









On the other hand, by applying (A.5) and (A.6),
tstartv = t
ready
v + d1 + d2
≤ latest[tstartv ]′ + d2
≤ latest[tstartv ]′ + |Searly| ×max latv (A.8)










′ + |Searly| ×max latv
)
(A.9)
in which the right hand side corresponds to tmp on Line 9 in Algorithm 2. Compare
to latest[tstartv ] calculated on Line 10, t
start
v ≤ latest[tstartv ].
For v’s finish time, suppose v executes for latv (≤ max latv) cycles, tfinishv ≤




≤ latest[tstartv ] +max latv
Thus we have proved that the latest times calculated by Algorithm 2 indeed
provide upper bounds for the actual execution times of the nodes in the execution
graph. Similarly, we can prove that the earliest times calculated by Algorithm 3
indeed provide lower bounds.
With Theorem A.3, we can claim that the WCET of a basic block estimated by the
algorithms (1, 2 and 3) in Section 4.2.1 is a safe upper bound of the possible execution





which by Theorem A.3 is no less than the actual execution time, tfinishCM(In).
134
A.2 Proofs for the Context-Inclusive Estimation
In this section we prove the correctness for the algorithms in Section 4.2.2 where we
take the execution context of a basic block into account. We want to prove that the
estimated WCET for a basic block is no less than any possible execution times of








is not underestimated and δ, the overlap, is
not overestimated, then the estimated execution time is correct. The correctness of
overlap estimation has been guaranteed by Theorem 4.1. Therefore we only need to




. We do this by proving
that for any node (prologue, body or epilogue), the estimated latest and earliest times
are indeed upper and lower bounds for its actual execution times.
We first show that the execution times of the prologue nodes are correctly bounded.
Algorithm 4.3 estimating the prologue nodes consists of two parts: one part for the
estimation of the shaded nodes, which have paths to IF (I1), the fetch of the first in-
struction in the body; and the other part for the rest prologue nodes. The correctness
of the first part has already been guaranteed by Inequality 4.2. The second part is




on Line 10 and
a maximized estimated delay from the late contender on Line 11. Now we only need
to prove the correctness for the two differences because the proof for the rest of the
algorithm can follow that in the previous section.
Lemma A.4. Suppose for each prologue node preceding an unshaded node v in topo-
logically sorted order, its latest and earliest times provide upper and lower bounds for
its execution times. Then, v’s latest ready time calculated by Lines 9 and 10 in Algo-




Proof. Let the immediate predecessors of v (those with a dependence edge to v),
denoted as DE(v), be partitioned into two parts: those in the prologue, denoted as
135

































Second, all nodes in DE2(v) are pre-prologue nodes and they should have completed





) ≤ treadyCM(I−p) ≤ latest [treadyCM(I−p)] (A.12)


















calculated by Lines 9 and 10.





The correctness of Line 11 for bounding delay from an early contender is obvious
– the maximum delay max latv − 1 is assumed.
Theorem A.5. For every node v in the execution graph including the prologue, body
and epilogue, Inequalities (A.2 - A.4) are satisfied. In other words, the estimated latest
and earliest times indeed provide upper and lower bounds for the actual execution
times.
Proof. For the prologue nodes, the correctness of the only differences between Algo-
rithm 2 and Algorithm 4 has been proved by Lemma A.4, and the proof for the rest
of Algorithm 4 is the same as the proof for Algorithm 2 in last section. Similarly, the
136
estimation algorithms for body nodes and epilogue nodes are exactly the same as in
last section whose correctness has already been proved. Thus Inequalities (A.2 - A.4)
hold.





upper bound to the actual treadyCM(In). Since the estimated overlap δ has been proved





− δ is an upper bound to the actual execution time.
137
REFERENCES
[1] Aho, A., S. R. U. J., Compilers: Principles, Techniques and Tools. Addison-
Wesley, 1986.
[2] Altenbernd, P., “On the false path problem in hard real-time programs,” in
8th Euromicro Workshop on Real Time Systems (WRTS), 1996.
[3] Arnold, R., Mueller, F., Whalley, D., and Harmon, M., “Bounding
worst-case instruction cache performance,” in IEEE Real-Time Systems Sympo-
sium, 1994.
[4] Bate, I. and Reutemann, R., “Worst-case timing analysis for dynamic branch
predictors,” in Proceedings of the 16th Euromicro Conference on Real-Time Sys-
tems (ECRTS’04), 2004.
[5] Bodin, F. and Puaut, I., “A WCET-oriented static branch prediction scheme
for real-time systems,” in Proc. of the 17th Euromicro Conference on Real-Time
Systems, (Palma de Mallorca, Spain), July 2005.
[6] Burger, D. and Austin, T., “The SimpleScalar Tool Set, Version 2.0,” Tech-
nical Report CS-TR-1997-1342, University of Wisconsin, Madison, June 1997.
[7] Char, B., Geddes, K., Gonnet, G., Leong, B., Monagan, M., and
Watt, S., Maple V Language Reference Manual. Springer-Verlag, 1991.
[8] Chen, K., Malik, S., and August, D., “Retargatable static software timing
analysis,” in IEEE/ACM Intl. Symp. on System Synthesis (ISSS), 2001.
[9] Colin, A. and Puaut, I., “Worst case execution time analysis for a processor
with branch prediction,” Journal of Real time Systems, May 2000.
138
[10] Colin, A. and Puaut, I., “A modular and retargetable framework for tree-
based WCET analysis,” in Proc. of the 13th Euromicro Conference on Real-Time
Systems, (Delft, The Netherlands), pp. 37–44, June 2001.
[11] Colin, A. and Puaut, I., “A modular and retargetable framework for tree-
based WCET analysis,” Tech. Rep. 0, IRISA, March 2001.
[12] Combs, J., Combs, C., and Shen, J., “Mispredicted path cache effects,” in In
Euro-Par Conference, 1999.
[13] Cormen, T., Leiserson, C., Rivest, R., and Stein, C., Introduction to
Algorithms (Second Edition). MIT Press, 2001.
[14] Cousot, P. and Cousot., R., “Abstract interpretation: a unified lattice model
for static analysis of programs by construction or approximation of fixpoints.,”
in ACM Symposium on Principles of Programming Languages, 1977.
[15] CPLEX, “The ILOG CPLEX Optimizer v7.5,” 2002. Commercial software,
http://www.ilog.com.
[16] Engblom, J., Processor Pipelines and Static Worst-Case Execution Time
Analysis. PhD thesis, Uppsala University, Sweden, 2002.
[17] Engblom, J., “Analysis of the execution time unpredictability caused by dy-
namic branch prediction,” in IEEE Real-Time and Embedded Technology and
Applications Symposium (RTAS), 2003.
[18] Engblom, J. and Ermedahl, A., “Modeling complex flows for worst-case
execution time analysis,” in IEEE Real-Time Systems Symposium, 2000.
[19] Engblom, J., Ermedahl, A., and Altenbernd, P., “Facilitating worst-
case execution times analysis for optimized code,” in Proceedings of the 10th
Euromicro Real-Time Systems Workshop, 1998.
139
[20] Ermedahl, A. and Gustafsson, J., “Deriving annotations for tight calcula-
tion of execution time,” in European Conference on Parallel Processing, 1997.
[21] Ferdinand, C., Heckmann, R., Langenbach, M., Martin, F., Schmidt,
M., Theiling, H., Thesing, S., and Wilhelm, R., “Reliable and precise
WCET determination for a real-life processor,” in Intl. Workshop on Embedded
Software (EmSoft), 2001.
[22] Ferdinand, C. and Wilhelm, R., “Fast and Efficient Cache Behavior Pre-
diction for Real-Time Systems,” Real-Time Systems, vol. 17, no. (2/3), 1999.
[23] Fields, B., Bodik, R., and Hill, M., “Slack: Maximizing performance under
technological constraints,” in 29th ACM Annual International Symposium on
Computer architecture, 2002.
[24] Healy, C., Arnold, R., Mueller, F., Whalley, D., and Harmon, M.,
“Bounding pipeline and instruction cache performance,” IEEE Transactions on
Computers, vol. 48, no. 1, 1999.
[25] Healy, C., Sjodin, M., Rustagi, V., and Whalley, D., “Bounding loop
iterations for timing analysis,” in IEEE Real-time Appplications Symposium
(RTAS), 1998.
[26] Healy, C., Sjodin, M., Rustagi, V., Whalley, D., and Engelen, R.,
“Supporting timing analysis by automatic bounding of loop iterations,” Real-
Time Systems, vol. 18, no. 2/3, pp. 129–156, 2000.
[27] Healy, C. andWhalley, D., “Automatic detection and exploitation of branch
constraints for timing analysis,” IEEE Transaction on Software Engineering,
vol. 28, no. 8, 2002.
140
[28] Healy, C.,Whalley, D., and Harmon, M., “Integrating the timing analysis
of pipelining and instruction caching,” in IEEE Real-Time Systems Symposium
(RTSS), 1995.
[29] Heckmann, R., Langenbach, M., Thesing, S., and Wilhelm, R., “The
Influence of Processor Architecture on the Design and the Results of WCET
Tools,” Proceedings of the IEEE, vol. 91, July 2003.
[30] Hennessy, J. and Patterson, D., Computer Architecture- A Quantitative
Approach. Morgan Kaufmann, 1996.
[31] Hur, Y., Bae, Y. H., Lim, S.-S., Kim, S.-K., Rhee, B.-D., Min, S. L.,
Park, C. Y., Shin, H., and Kim, C. S., “Worst case timing analysis of RISC
processors: R3000/r3010 case study,” in IEEE Real-Time Systems Symposium
(RTSS), 1995.
[32] Inc., S., “SiByte SB-1 MIPS64 embedded CPU Core,” in Embedded Processor
Forum, 2000.
[33] Kirner, R. and Puschner, P., “Transformation of path information for
WCET analysis during compilation,” in 13th Euromicro Conference on Real-
Time Systems, 2001.
[34] Kirner, R. and Puschner, P., Extending Optimising Compiliation to Sup-
port Worst-Case Execution Time Analysis. PhD thesis, Vienna University of
Technology, 2003.
[35] Kligerman, E. and Stoyenko, A. D., “Real-time euclid: a language for
reliable real-time systems,” IEEE Trans. Softw. Eng., vol. 12, no. 9, pp. 941–
949, 1986.
141
[36] Langenbach, M., Thesing, S., and Heckmann, R., “Pipeline modeling for
timing analysis,” in Static Analysis Symposium (SAS), 2002.
[37] Li, X., Mitra, T., and Roychoudhury, A., “Accurate timing analysis by
modeling caches, speculation and their interaction,” in ACM Design Automation
Conf. (DAC), 2003.
[38] Li, X., Mitra, T., and Roychoudhury, A., “Modeling control speculation
for timing analysis,” Journal of Real-Time Systems, vol. 29, no. 1, 2005.
[39] Li, X., Roychoudhury, A., and Mitra, T., “Modeling out-of-order proces-
sors for software timing analysis,” in IEEE Real-Time Systems Symposium, 2004.
[40] Li, Y.-T. S. andMalik, S., “Performance analysis of embedded software using
implicit path enumeration,” inWorkshop on Languages, Compilers and Tools for
Real-Time Systems, 1995.
[41] Li, Y.-T. S., Malik, S., and Wolfe, A., “Efficient microarchitecture mod-
eling and path analysis for real-time software,” in Proceeding of the IEEE Real-
Time Systems Symposium, 1995.
[42] Li, Y.-T. S., Malik, S., and Wolfe, A., “Cache modeling for real-time
software: Beyond direct mapped instruction caches,” in Proceeding of the IEEE
Real-Time Systems Symposium, 1996.
[43] Li, Y.-T. S., Malik, S., and Wolfe, A., “Performance estimation of embed-
ded software with instruction cache modeling,” ACM Transactions on Design
Automation of Electronic Systems, vol. 4, no. 3, 1999.
[44] Lim, S.-S., Bae, Y., Jang, G., Rhee, B.-D., Min, S., Park, C., Shin, H.,
Park, K., and Kim, C., “An accurate worst-case timing analysis technique for
142
RISC processors,” IEEE Transactions on Software Engineering, vol. 21, no. 7,
1995.
[45] Lim, S.-S., Bae, Y., Jang, G., Rhee, B., Min, S., Park, C., Shin, H.,
Park, K., and Kim, C., “An accurate worst case timing analysis technique for
risc processors,” in IEEE Real-Time Systems Symposium, 1994.
[46] Lim, S.-S., Han, J., Kim, J., and Min, S., “A worst case timing analysis
technique for multiple-issue machines,” in IEEE Real Time Systems Symposium
(RTSS), pp. 334–345, 1998.
[47] Liu, Y. and Gomez, G., “Automatic time-bound analysis for a higher-order
language,” in Proceedings of the ACM SIGPLAN Workshop on Languages, Com-
pilers, and Tools for Embedded Systems (LCTES), 1998.
[48] Liu, Y. andGomez, G., “Automatic accurate cost-bound analysis for high-level
languages,” IEEE Transactions on Computers, vol. 50, no. 12, 2001.
[49] Lundqvist, T. and Stenstro¨m, P., “An integrated path and timing analysis
method based on cycle-level symbolic execution,” Journal of Real-Time Systems,
vol. 17, no. 2-3, 1999.
[50] Lundqvist, T. and Stenstro¨m, P., “Timing anomalies in dynamically sched-
uled microprocessors,” in IEEE Real-Time Systems Symposium, 1999.
[51] Ma¨lardalen Real-Time Research Centre, “WCET Benchmarks
http://www.mrtc.mdh.se/projects/wcet/benchmarks.html.”
[52] McFarling, S., “Combining branch predictors,” tech. rep., DEC Western Re-
search Laboratory, 1993.
[53] McMillan, K. and Dill, D., “Algorithms for interface timing verification,”
in IEEE International Conference on Computer Design, 1992.
143
[54] Microelectronics, I., “PowerPC 440GP Embedded Processor,” in Embedded
Processor Forum, 2001.
[55] Mitra, T., Roychoudhury, A., and Li, X., “Timing analysis of embedded
software for speculative processors,” in ACM SIGDA International Symposium
on System Synthesis (ISSS), 2002.
[56] Mueller, F. andWhalley, D. B., “Fast instruction cache analysis via static
cache simulation,” in Simulation Symposium, 1995.
[57] Mueller, F., Static Cache Simulation and its Applications. PhD thesis, The
Florida State University, 1994.
[58] Park, C., Predicting Deterministic Execution Times of Real-Time Programs.
PhD thesis, University of Washington, 1992.
[59] Park, C. and Shaw, A., “Experiments with a program timing tool based on
source-level timing schema,” IEEE Transactions on Computers, vol. 24, no. 5,
1991.
[60] Pierce, J. and Mudge, T., “Wrong-path instruction prefetching,” in In ACM
Intl. Symp. on Microarchitectures(MICRO), 1996.
[61] Price, C., “MIPS IV Instruction Set, revision 3.1,” 1995.
[62] Puschner, P. and Koza, C., “Calculating the maximum execution time of
real-time programs,” Journal of Real-time Systems, vol. 1, no. 2, 1989.
[63] Puschner, P., “Worst-case execution time analysis at low cost,” Control En-
gineering Practice, vol. 6, pp. 129–135, Jan. 1998.
[64] Real-Time Research Group at Seoul National University, “SNU
Real-Time Benchmarks.” http://archi.snu.ac.kr/RESEARCH/index.html.
144
[65] Schneider, J. and Ferdinand, C., “Pipeline behavior prediction for super-
scalar processors by abstract interpretation,” in ACM Intl. Workshop on Lan-
guages, Compilers and Tools for Embedded System (LCTES), 1999.
[66] Schrijver, A., Theory of Linear and Integer Programming. John Wiley Ltd.,
1986.
[67] Shaw, A., “Reasoning about time in higher level language software,” IEEE
Transactions on Software Engineering, vol. 1, no. 2, 1989.
[68] Sohi, G., “Instruction issue logic for high-performance, interruptible, multiple
functional unit, pipelined computers,” IEEE Transactions on Computers, vol. 39,
no. 3, 1990.
[69] Stappert, F., Ermedahl, A., and Engblom, J., “Efficient longest exe-
cutable path search for programs with complex flows and pipeline effects,” Tech.
Rep. 2001-012, Uppsala University, 2001.
[70] Sultan, A., Linear Programming, An Introduction with Applications. Academic
Press Inc., 1986.
[71] Theiling, H. and Ferdinand, C., “Combining Abstract Interpretation and
ILP for Microarchitecture Modelling and Program Path Analysis,” in Proceedings
of the 19th IEEE Real-Time Systems Symposium, 1998.
[72] Theiling, H., Ferdinand, C., and Wilhelm, R., “Fast and precise WCET
prediction by separated cache and path analysis,” Journal of Real Time Systems,
May 2000.
[73] Thesing, S., Safe and Precise Worst-Case Execution Time Prediction by Ab-
stract Interpretation of Pipeline Models. PhD thesis, University of Saarland,
2004.
145
[74] Yeh, T. and Patt, Y., “Alternative implementations of two-level adaptive
branch prediction,” in ACM Intl. Symp. on Computer Architecture (ISCA), 1992.
[75] Yen, T. and Wolf, W., “Performance estimation for real-time distributed
embedded systems,” IEEE Transactions on Parallel and Distributed Systems,
vol. 9, no. 11, 1998.
[76] Zhao, W., Kreahling, W., Whalley, D., Healy, C., and Mueller, F.,
“Improving WCET by optimizing worst-case paths,” in IEEE Real-Time and
Embedded Technology and Applications Symposium, 2005.
[77] Zhao, W., Whalley, D., Healy, C., and Mueller, F., “WCET code po-
sitioning,” in IEEE Real-Time Systems Symposium, 2004.
146
