PRECISE YET SCALABLE RESOURCE ANALYSIS VIA SYMBOLIC EXECUTION by RASOOL MAGHAREH
PRECISE YET SCALABLE RESOURCE




FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
DEPARTMENT OF COMPUTER SCIENCE





I hereby declare that the thesis is my original work and it has been written by me
in its entirety. I have duly acknowledged all the sources of information which
have been used in the thesis.
This thesis has also not been submitted for any degree in any university
previously.
Rasool Maghareh
Friday 10th February, 2017
iv
Dedicated to my grandmother Mina (1941 - 2013)
Sadly missed, dearly remembered, forever loved.
Acknowledgments
In the name of Allah, God, the Creator and Nurturer. I can never express all my
gratitude and appreciations for all his favors. I am sincerely thankful for his each
and every blessings in the past few years.
I would like to thank my wife Narjes for her love and support. Without her
support, I would not be able to write this thesis. I also like to thank my parents
and my wife’s parents Mohammad, Zohreh, Ali and Farahnaz and all my fam-
ily members including our grandparents Morteza, Maryam, Fatemeh, Zahra, my
siblings and their families Hossein, Sadegh, Ghazaleh, my brother and sister in
laws and their families Motahareh, Ali, Mohammad, Hengameh, Ahmad and Bita
. Thank you all for your love, support, pray and bearing with our absence during
the course of my PhD studies. You are the true reason I am here today. In the
last year of my PhD my wife and I were blessed to have our son Mahdi. The love
and joy he brought to our life was a great motivation towards the end of my PhD
journey.
I would like to extend my sincere gratitude to my thesis advisor, Professor
Joxan Jaffar for his guidance and professional insights throughout my PhD. It
was only by his support that I embarked on this PhD journey. Thank you Prof.
Jaffar for your valuable advices that helped me to be an independent researcher.
Besides, I truly appreciate Dr. Duc-Hiep Chu, for all technical help and advices
during these years. It was a pleasure to work with him. This thesis would never
have come together without Prof. Jaffar and Dr. Chu’s continuous guidance and
support. I am very much grateful to my lab mates and my friends from School of
v
vi Acknowledgments
Computing at NUS, for all the wonderful time that we had with each other.
Finally, I would like to thank National University of Singapore and Agency for
Science, Technology and Research (A*STAR) and specifically, Singapore Interna-
tional Graduate Award (SINGA) program and Lee Foundation for providing the
financial support for this research.
I also feel a deep appreciation for my friends in Singapore, who have made expe-
rience of living in Singapore and studying in NUS a unique and unforgettable one:
Hossein Nejati, Shima Bayat, Saeid Montazeri, Narges Nourieh, Mohammadreza
Chamanbaz, Faezeh Emami, Mohammad Sadegh Tavallali, Mehdy Ghaeminia,
Sogand Zarei, Safoura Sameni, Ahmadreza Shehabinia, Siavash Sakhavi, Sajjad
Seifozzakerini, Amin Kianinejad, Mohammad Akbari, Reihaneh Eslami, Ahmadreza
Pourghaderi, Fatemeh Jamshidian, Ali Khalili, Mona Khalighi, Seyyed Mohammad
Majedi, Samaneh Tavakolinia, Hamed Mirabolghasemi, Fatemeh Safari, Hamed
Sepehr, Zahra Barikbin, Saeid Arabnejad, Mohammad Nouri, Amin Torabi Jahromi,
Alireza Partovi, Asad Norouzi, Fahimeh Masoumian, Alireza Rezvanpour, Shima
Najafi Nobar, Ehsan Rismani, Khatereh Hajizadeh, Davood Afshari, Fatemeh
Movahednia, Meghdad Attatzadeh, Elham Jahanshahi and other Iranian friends
in Singapore as well as Nadim, Fatemeh, Mohamed, Simi, Rahim, Aeylia and Yati




List of Abbreviations xix
List of Symbols xxi
1 Introduction 1
1.1 Estimating Worst-case Resource Consumption . . . . . . . . . . . . 2
1.1.1 Dynamic (Measurement-based) Methods . . . . . . . . . . . 2
1.1.2 Static Methods . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Modular Worst-case Resource Consumption Analysis . . . . . . . . 5
1.2.1 Imprecision in Modular Approaches . . . . . . . . . . . . . . 7
1.3 Integrated Worst-case Resource Analysis . . . . . . . . . . . . . . . 9
1.4 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5.1 Worst-case Execution Time Analysis with Cache . . . . . . . 13
1.5.2 Memory High-watermark Analysis with Symbolic Bounds . . 14
1.5.3 Worst-case Energy Consumption Analysis with Cache and
Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
vii
viii Contents
2 Integrated Resource Analysis Framework 17
2.1 Symbolic Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.1 Symbolic Execution for Program Verification . . . . . . . . . 18
2.1.2 Symbolic Execution for Program Analysis . . . . . . . . . . 19
2.1.3 Symbolic Execution on Loops . . . . . . . . . . . . . . . . . 21
2.2 Overview of Our Resource Analysis Framework . . . . . . . . . . . 22
2.2.1 Microarchitecture-Aware Symbolic Execution . . . . . . . . . 23
2.2.2 Precision of Our Resource Analysis Framework . . . . . . . . 24
2.2.3 Scalability of Our Resource Analysis Framework . . . . . . . 25
2.3 General Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.1 Symbolic Execution with Machine State . . . . . . . . . . . 27
2.3.2 Constructing Summarizations . . . . . . . . . . . . . . . . . 33
2.3.3 Reuse with Interpolation and Dominance . . . . . . . . . . . 36
2.4 Customizing the Analysis Framework . . . . . . . . . . . . . . . . . 37
2.5 The Integrated Algorithm . . . . . . . . . . . . . . . . . . . . . . . 39
2.5.1 The Summarize Function . . . . . . . . . . . . . . . . . . . . 39
2.5.2 Compounding Two Summarizations . . . . . . . . . . . . . . 43
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3 Integrated Worst-case Execution Time Analysis with Cache 47
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.1 High-level analysis . . . . . . . . . . . . . . . . . . . . . . . 49
3.2.2 Low-level analysis . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.3 Other Related Work . . . . . . . . . . . . . . . . . . . . . . 51
3.2.4 Commercial WCET Tools . . . . . . . . . . . . . . . . . . . 52
3.2.5 A discussion on the State-of-the-art WCET Analysis . . . . 54
3.3 General Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.1 Basic Operations and Timing Cost Model . . . . . . . . . . 55
3.3.2 The Machine State . . . . . . . . . . . . . . . . . . . . . . . 56
3.3.3 Witness and Dominating Condition . . . . . . . . . . . . . . 57
3.3.4 Machine State Summary . . . . . . . . . . . . . . . . . . . . 62
3.4 Example Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 72
Contents ix
3.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.5.2 Discussion on Precision . . . . . . . . . . . . . . . . . . . . . 75
3.5.3 Discussion on Scalability . . . . . . . . . . . . . . . . . . . . 77
3.5.4 Analysis of Benchmarks with Complicated Loops . . . . . . 78
3.5.5 Comparison of Analysis Results with Simulations . . . . . . 80
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4 Symbolic Memory High-watermark Analysis 85
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.1.1 Overview Examples . . . . . . . . . . . . . . . . . . . . . . . 89
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.2.1 Instrumentation Tools . . . . . . . . . . . . . . . . . . . . . 94
4.2.2 Worst-case Stack Usage Analysis . . . . . . . . . . . . . . . 95
4.2.3 Worst-case Heap Usage Analysis . . . . . . . . . . . . . . . . 95
4.3 General Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.3.1 Basic Operations and Memory Cost Model . . . . . . . . . . 97
4.3.2 The Machine State . . . . . . . . . . . . . . . . . . . . . . . 98
4.3.3 Witness and Dominating Condition . . . . . . . . . . . . . . 98
4.3.4 Machine State Summary . . . . . . . . . . . . . . . . . . . . 106
4.4 Example Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 110
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5 Integrated Worst-case Energy Consumption Analysis 117
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.1.1 Integrated WCEC Analysis . . . . . . . . . . . . . . . . . . 119
5.1.2 Pipeline Analysis . . . . . . . . . . . . . . . . . . . . . . . . 120
5.1.3 Analysis of Loop-free Programs . . . . . . . . . . . . . . . . 123
5.1.4 Extension to Programs with Loops . . . . . . . . . . . . . . 126
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.2.1 Worst-case Energy Consumption Analysis . . . . . . . . . . 126
5.2.2 Out-of-order Pipeline Analysis . . . . . . . . . . . . . . . . . 127
5.3 General Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.3.1 Basic Operations and Energy Cost Model . . . . . . . . . . 129
5.3.2 The Machine State . . . . . . . . . . . . . . . . . . . . . . . 130
x Contents
5.3.3 Witness and Dominating Condition . . . . . . . . . . . . . . 131
5.3.4 Generating Machine State Summaries . . . . . . . . . . . . . 136
5.4 An Example Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 139
5.5.1 Discussion on Precision . . . . . . . . . . . . . . . . . . . . . 140
5.5.2 Discussion on Scalability . . . . . . . . . . . . . . . . . . . . 141
5.6 Extension to Out-of-order Pipelines . . . . . . . . . . . . . . . . . . 143
5.6.1 Timing Anomaly . . . . . . . . . . . . . . . . . . . . . . . . 146
5.6.2 Dominating Condition and Machine State Summary . . . . . 147
5.6.3 Analysis of Loops . . . . . . . . . . . . . . . . . . . . . . . . 148
5.6.4 Example Analysis . . . . . . . . . . . . . . . . . . . . . . . . 148
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6 Conclusion and Future Work 155
Bibliography 159
A Compiling LLVM IR to Transition System 169
A.1 The LLVM Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
B Energy Consumption Model in Embedded Systems 175
B.1 Energy Consumption Model . . . . . . . . . . . . . . . . . . . . . . 176
B.1.1 Instruction-specific Energy . . . . . . . . . . . . . . . . . . . 177
B.1.2 Pipeline-specific Energy . . . . . . . . . . . . . . . . . . . . 179
Summary
The expanding complexity of embedded systems has emerged the need of tech-
niques to analyze different non-functional properties of programs such as perfor-
mance, resource consumption or security. Three of the important non-functional
properties for embedded systems are execution time, consumed memory and con-
sumed energy.
Both dynamic (measurement-based) and static resource analysis methods can
be employed to estimate such non-functional properties of programs. While dy-
namic methods are unable to generate safe upper bounds for these properties,
static methods used for resource analysis, have been proven to generate safe upper
bounds on the highest resource consumption for a program. In general, due to
scalability reasons, resource analysis is usually performed in two separate high-
level and low-level phases. Such approaches are named as modular approaches.
Contrary to these modular approaches, integrated methods perform resource anal-
ysis in a single phase. To the best of our knowledge, integrated methods were not
scalable up to now.
In this thesis, we present a novel integrated framework for resource analysis
where micro-architectural modeling and systematic path-sensitivity, are synergized.
This would give us a very high precision for resource analysis, but at the same time,
it is a huge challenge for scalability. Our contribution is then a dynamic program-
ming algorithm with a powerful concept of reuse. Reuse, in turn, depends on the
core concepts of interpolation and dominance. While interpolation-based meth-
ods have been used in program verification for the purpose of pruning the search
xi
xii Summary
space of symbolic execution, our setting is novel not just because we are perform-
ing analysis instead of verification, but because our interpolation with dominance
covers reuse under an environment where resource consumption of program paths
is dynamic and/or symbolic.
We should highlight that since we are systematically path-sensitive, our algo-
rithm is more precise. The important point, however, is that it also can scale to
a reasonable level. Our realistic benchmarks will show both aspects: that system-
atic path-sensitivity, in fact, brings significant accuracy gains; and, likewise, the
algorithm scales well.
The thesis makes several contributions in the area of worst-case resource con-
sumption analysis. First, an integrated analysis framework is presented in Chapter
2. Next, the analysis framework is customized for the following three analyses:
1. Worst-case Execution Time (WCET) Analysis: The analysis of the
execution time of an embedded system becomes necessary in hard real-time
systems. Worst-case execution time can be used to ensure the temporal
correctness of hard real-time systems. However, the worst-case input of a
program is not known in general and is hard to derive. As a result, an upper
bound on the worst-case execution time is generated by utilizing WCET
analysis.
Our worst-case resource analysis framework is customized for WCET analy-
sis in Chapter 3. The main contribution of the analysis presented in Chapter
3 is performing a precise integrated WCET analysis down to instruction and
data caches. The key challenge, scalability, is obtained by using a notion of
reuse in an environment where the contribution of each basic block to the
overall WCET is dynamic. In realistic benchmarks, it is shown that the
extreme attempt at precision, in fact, pays off because there is a significant
increase in precision, and this is obtained in a reasonable time.
2. Memory High-watermark (MHW) Analysis: In Chapter 4, our anal-
ysis framework is customized for symbolic MHW analysis. Memory high-
watermark refers to the highest amount of memory that a program can ac-
quire in its executions. It can be compared to WCET, except that WCET
represents the longest execution time of a program in all its executions, while
MHW represents the highest amount of memory usage of a program (along
Summary xiii
all its possible executions). We will explain how MHW analysis is needed to
ensure the reliability of safety-critical embedded systems.
The main contributions of the Memory high-watermark analysis are (1) it
is performed on a non-cumulative resource analysis where the resource of
interest can be consumed or released and (2) the concept of reuse with inter-
polation and dominance is performed in the presence of symbolic bounds.
3. Worst-case Energy Consumption (WCEC) Analysis: WCEC analy-
sis has gained interest in the recent years and it is important for embedded
systems with limited access to energy. The aim of this analysis is to ensure
that a hard-real time system would have access to enough energy to per-
form its assigned tasks. Due to the complex micro-architectural features in
modern systems, the path resulting in the worst-case execution time can be
different from the worst-case energy consuming path.
In Chapter 5, we customize our analysis framework for WCEC analysis in
the presence of cache and in-order pipelines. Energy is data dependent, and
our integrated approach can generate more precise bounds compared to other
analyses. In this chapter, we extend reuse to be performed in the presence
of in-order pipeline. Finally, Chapter 5 finishes by presenting an extension
of our WCEC analysis framework to out-of-order pipeline.

List of Tables
2.1 Customizable Terms . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1 Comparing our Algorithm (Unroll d) to the State-of-the-art . . . . 74
3.2 The results of our analysis, Unroll d, compared to AI+SAT⊕Unroll s and
AI+SAT⊕ILP to Illustrate the Super-linearity of Loop Unrolling . . 79
4.1 Comparison of Memory Analysis Tools and Methods . . . . . . . . 111
4.2 Result of Analysis of Experimental Benchmarks on Our Algorithm
Unroll d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.1 Pipeline States Along the First Path . . . . . . . . . . . . . . . . . 138
5.2 Pipeline States Along the Second Path . . . . . . . . . . . . . . . . 138
5.3 Results of Analysis of Benchmarks with In-order Pipeline . . . . . . 141
B.1 Instruction Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
B.2 Data dependencies between the instructions . . . . . . . . . . . . . 185
B.3 The energy consumption for the leftmost path (Path 1) . . . . . . . 185
B.4 The energy consumption for path 2 . . . . . . . . . . . . . . . . . . 187
B.5 The energy consumption for path 3 . . . . . . . . . . . . . . . . . . 188




1.1 The distribution of possible execution times of a program . . . . . . . . 3
2.1 Reuse of Summarizations: (a) [CJ11] vs. (b) Our Analysis . . . . . 20
2.2 (a) Loop analysis in our framework (b) Loop analysis in AI Frame-
work with some infeasible path detection, where C is the fixed point
context of loop and di is the incoming context of a loop iteration . 21
2.3 An Academic Example . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 From a C program to its Transition System . . . . . . . . . . . . . 30
2.5 (a) a CFG and (b) Its Symbolic Execution Tree . . . . . . . . . . . 31
2.6 Helper Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.1 Function to summarize a transition . . . . . . . . . . . . . . . . . . 58
3.2 Combining (Vertically) Witnesses and Dominating Conditions . . . 59
3.3 Merging Witnesses and Dominating Conditions . . . . . . . . . . . 61
3.4 Combining (Vertically) and Merging (horizontally) Machine State
Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.5 (a) a CFG (with memory accesses and static instruction timing
shown in each block); and (b) Our Analysis Tree . . . . . . . . . . . 68
3.6 Comparison of the generated WCET with GEM5 simulations for
benchmarks in Table 3.1 . . . . . . . . . . . . . . . . . . . . . . . . 81
3.7 Comparison of the generated WCET with GEM5 simulations for
Different Benchmarks Groups . . . . . . . . . . . . . . . . . . . . . 82
4.1 An Annotated C program . . . . . . . . . . . . . . . . . . . . . . . 89
4.2 Reuse with Interpolation and Dominance . . . . . . . . . . . . . . . 90
4.3 A Complicated Allocation Pattern . . . . . . . . . . . . . . . . . . . 92
4.4 Analysis of the Complicated Allocation Pattern . . . . . . . . . . . 92
4.5 Summarize-a-Trans Function . . . . . . . . . . . . . . . . . . . . . . . 101
xvii
xviii List of Figures
4.6 Helper Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.7 Combining Witness Formula and Dominating Conditions . . . . . . 103
4.8 Merging Witness Formulas and Dominating Conditions . . . . . . . 104
4.9 (a) The CFG of an annotated program (b) The analysis tree of the
program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.1 In-order processor model . . . . . . . . . . . . . . . . . . . . . . . . 122
5.2 Full Symbolic Execution Tree . . . . . . . . . . . . . . . . . . . . . 124
5.3 Reuse Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.4 Window of Epilogue Instructions from Root to Successor Nodes . . 133
5.5 Summarize-a-Trans Function . . . . . . . . . . . . . . . . . . . . . . . 133
5.6 Merge-Witness Function . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.7 Combine-Witness function . . . . . . . . . . . . . . . . . . . . . . . . 135
5.8 (a) a CFG (with instructions and their respective execution unit
shown in each block); and (b) Our Analysis Tree . . . . . . . . . . . 137
5.9 Comparison of Estimated WCEC with Wattch Simulation . . . . . 142
5.10 Out-of-order processor model . . . . . . . . . . . . . . . . . . . . . 144
B.1 (a) program control flow graph with accessed memory blocks and
instruction running time shown inside each block, (b) the analysis
tree of the program . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
B.2 The energy analysis of the leftmost path (Path 1) . . . . . . . . . . 186
B.3 The energy analysis of path 2 . . . . . . . . . . . . . . . . . . . . . 187
B.4 The energy analysis of path 3 . . . . . . . . . . . . . . . . . . . . . 189
B.5 The energy analysis of path 4 . . . . . . . . . . . . . . . . . . . . . 189
List of Abbreviations
ACS Abstract Cache State
AI Abstract interpretation
ALU Arithmetic Logic Unit






FIFO First In First Out
FPU Floating-point Unit
I-cache Instruction cache
ID Instruction Decode & Dispatch
IF Instruction Fetch
ILP Integer linear programming
xix
xx List of Abbreviations
IPET Implicit path enumeration technique
LRU Least Recently Used
LD/ST Load/Store Unit
LLVM IR LLVM Intermediate representation
MULTU Multiplication Unit
MHW Memory High-watermark
MRU Most Recently Used
PLRU Pseudo-LRU
RCSP Resource-constrained Shortest Path
RISC Reduced Instruction Set Computing
SSA Static Single Assignment
VIVU Virtual interpretation of virtual unrolled (loops)
WB Write Back
WCEC Worst-case Energy Consumption
WCET Worst-case Execution Time
WCHA Worst-case Heap Analysis







∆m Machine State Summary
∆s Cache-set Abstract Transformer
E Energy
Energyreg Energy of Registers
Energywk Wakeup Logic Energy
EnergyFPU Energy of FPU
EnergyMULTU Energy of MULTU
EnergyALU1 Energy of ALU1
EnergyALU2 Energy of ALU2
Energyic Energy of Instruction Cache
xxi
xxii List of Symbols
Energydc Energy of Data Cache





instr Sequence of Instructions
Ψ Interpolant
`0 Initial Program Point
leakagepath Leakage Energy
m Machine State
m0 Initial Machine State
mem Memory Access
M Set of Pairs 〈mem, i〉
L Set of Program Points
Ops Set of Operations
P Power
P Transition System
pi Path Constraints Along the Witness
seq Sequence of Memory Block Accesses
s Symbolic State
s0 Initial Symbolic State
SymStates Set of Symbolic States
selectionpath Selection Logic Energy
Switchoff Sum of the Switch-off Energy
List of Symbols xxiii
σ Symbolic Store
tr Transition
t Execution Time of Instructions
Υ Sequence of All Memory Accesses Along a Path
Υh Sequence of alc(±, e) Depicting All Memory Allocations or Deallocations
Vars All Program Variables
Chapter1
Introduction
An embedded system is a system which performs a specific task and is composed of
software and hardware or mechanical parts. Embedded systems have been utilized
to accomplish safety-critical tasks in different domains, e.g., in medical devices,
aircraft flight control systems or automobiles. The failure of such embedded sys-
tems may result in massive financial loss or even loss of lives. An example of a
safety-critical embedded system is a heart pacemaker. Similarly, modern air crafts
and automobiles rely on a network of embedded systems which some might be
performing critical tasks.
Beside the functional safety of embedded systems, reasoning over the non-
functional properties of embedded systems has been an interesting research topic
for many years. Non-functional properties are referred to properties that do not
influence the input-output behavior of embedded systems directly. The execution
time, consumed memory and consumed energy are examples of important non-
functional properties of embedded systems.
Analyzing non-functional properties has also become part of the safety criteria
1
2 Chapter 1. Introduction
for certain groups of embedded systems. In these embedded systems, the worst-case
usage of a non-functional property is estimated to make sure that the embedded
system will function properly in extreme cases.
1.1 Estimating Worst-case Resource Consump-
tion
There are two classes of methods to obtain an approximation on the resource
consumption of a program:
• Dynamic (measurement-based) methods which use real executions of the pro-
gram or parts of the program to estimate the consumed resource from the
executions or the combination of the executions of the program parts.
• Static methods which use the program itself plus extra information such as
loop bounds and an abstract model of the corresponding hardware to ap-
proximate the highest consumption of a resource.
1.1.1 Dynamic (Measurement-based) Methods
Dynamic methods for estimating resource consumption are frequently used in
industry. For example, for measuring the worst-case execution time (WCET), the
easiest method is to measure the execution time of a subset of the real program
executions and report the longest seen execution time. Figure 1.1 illustrates the
distribution of all possible resource costs of a sample program. As it can be seen in
Figure 1.1 the reported result would be the maximal observed resource consumption,
which can be different from the actual WCET.
1.1 Estimating Worst-case Resource Consumption 3
Figure 1.1: The distribution of possible execution times of a program
Another traditional method which is still being used in industry [Wil06], mea-
sures the execution time of small snippets of code plus a safety margin. The results
are combined based on the control flow graph. This method helps to extend the
subset of real executions measured in the program since each code snippet is mea-
sured once and used several times. The correctness of this method follows the
correctness of the combine step. However, [The04] shows that the assumptions on
the correctness of combine step can be wrong and still the measured WCET would
be the maximal observed execution time (dashed line with orange color in Figure
1.1).
Dynamic methods can be used to estimate the memory usage of an embedded
system too. However, similar to WCET, dynamic methods can not generate a safe
upper-bound on the resource usage. In fact, the bound returned by measurement
methods is the peak memory usage on a concrete execution of the software. There
are many commercial and open-source memory profiling tools where some of them
such as Valgrind Massif [Val] and Perfrewrite [Kru14] measure the peak memory.
4 Chapter 1. Introduction
Last but not least, due to the complicated nature of energy consumption anal-
ysis many methods have been developed for generating an estimation of the en-
ergy consumption of embedded systems [T+94, Q+00] and mobile applications
[B+14, A+12b, M+13].
In general, the main disadvantage of dynamic methods is that they cannot gen-
erate safe upper-bounds over the worst-case resource usage of embedded systems.
1.1.2 Static Methods
The estimation of the worst-case usage of a non-functional property is also
performed using program analysis techniques. In general, these techniques are
categorized under static methods for worst-case resource analysis. Static methods
provide estimations over the resource consumption by examining the program itself
with some extra information (such as the loop bounds) and without executing it
directly on the hardware. Static methods can generate safe upper-bounds over the
worst-case resource consumption of embedded systems.
As illustrated in Figure 1.1, the highest resource cost that a program will take
to execute is referred as the worst case resource consumption of a program. In
order to find this bound for a program the worst-case input for the program is
needed. The worst-case input is not known in general and due to the large state
space of all possible executions of a program, the worst-case input is difficult to
derive. Therefore, an approximation on the worst-case resource consumption is
returned as the upper timing bound (line with purple color in Figure 1.1).
Generally, the withdrawal of static methods is that these methods need specific
models of processor behavior and micro-architectural features which enhance the
performance of a processor (e.g., caches or pipelines). Additionally, it is possible
that static methods generate imprecise results due to the inaccuracy of the models
1.2 Modular Worst-case Resource Consumption Analysis 5
of the hardware or the overestimations (dashed line with green color in Figure 1.1).
Static analysis methods analyze either the source code or the disassembled binary
executable to determine the structure of a program.
1.2 Modular Worst-case Resource Consumption
Analysis
Traditionally, modular approaches to static analysis are used for worst-case re-
source consumption analysis. These approaches are performed in two phases
[W+08]:
• Micro-architectural Modeling Phase: Also referred to as low-level analysis,
often involves micro-architectural modeling to determine the maximum re-
source consumption accurately for each of the basic blocks. In this phase,
the effects of performance enhancing processor features such as pipeline and
caches on the consumed resource are considered.
• Path Analysis Phase: Also referred to as high-level analysis, this phase con-
cerns a program level path analysis to determine loop bounds and infeasible
paths in the program’s control flow graph (CFG), and moreover it generates
an upper-bound over the worst-case estimation.
For example, in the case of WCET analysis [LM95], the low-level analysis gen-
erates a timing for each basic block (using the timing information of the real
hardware with all its specific features). By providing the information from the
low-level analysis to the high-level analysis, the high-level analysis generates an
upper-bound over the WCET of a given program and hardware platform.
6 Chapter 1. Introduction
Low-level analysis and high-level analyses are still performed separately for the
reason of scalability. Throughout this thesis, we will call such approaches to static
analysis as modular approaches.
Micro-architectural modeling has been an active research topic in resource anal-
ysis. Initial works on instruction cache modeling used integer linear programming
(ILP) [LMW99]. However, the work did not scale due to a huge number of gen-
erated ILP constraints. Subsequently, the abstract interpretation (AI) [CC77]
framework for micro-architectural modeling, proposed in [TFW00], made an im-
portant step towards scalability. The solution has been proved scalable, and it
has also been applied in commercial WCET tools (e.g., [aiT]). For most existing
WCET analyzers, AI framework has emerged to be the basic approach used for
micro-architectural modeling.
On the other hand, three classes of methods are suggested for high-level analysis
in modular approaches:
• Structure-based Methods: In structure-based methods, as used in [CP00],
an upper-bound is calculated in a bottom-up traversal of the syntax tree of
the program which combines the bounds computed for the structures accord-
ing to the combination rules [W+08].
• Implicit Path Enumeration (IPET): Implicit path enumeration (IPET)
represents the control flow of the program in a set of linear equations and
presents the resource consumption formulation in the maximization of an
objective function of an ILP. The ILP is passed to an ILP solver, and the
result is an upper-bound on the resource consumption of the program.
1.2 Modular Worst-case Resource Consumption Analysis 7
• Path-based Methods: Path-based techniques try to find the amount of
consumed resource of a program by measuring the upper-bound for the dif-
ferent paths in the program.
1.2.1 Imprecision in Modular Approaches
The AI framework for micro-architectural modeling + IPET [TFW00] has been
used to estimate the WCET of programs. This method combines the sets of pos-
sible states at the merging points in the control flow graph. In order to manage
the potentially exponential number of states merging is inevitable. However, this
may lead to significant overestimation, since some associated information will be
lost during the merge step. The situation becomes worse when it is noticed that
in some cases feasible paths are merged with infeasible paths which result in un-
necessary loss of information. However, the analysis is still guaranteed to be safe;
all possible states are considered and the result will never be smaller than the
maximum consumed resource, but the results are often not precise. There are two
main causes of imprecision in AI framework + IPET:
1. The analysis of the estimations of memory accesses will lose precision through
joins at the control flow merge points. This results in subsequently overesti-
mating in the micro-architectural modeling.
2. Beyond the one-iteration virtual unrolling of loops [TFW00], AI is unable to
give different resource cost for a basic block executed in different iterations
of a loop.
Much research has been conducted on finding infeasible paths. By detecting
infeasible paths and avoiding merging them with feasible paths the effect of scenario
8 Chapter 1. Introduction
(1) is reduced. Some of the methods, [L+10], apply the infeasible path detection
only to the path analysis phase and still use the AI framework for the micro-
architectural modeling. The result in these cases will not help in decreasing the
effect of scenario (1). Besides, not all techniques for finding infeasible paths are
applicable to IPET-based resource analysis.
Recent approaches such as [CR11, BCR13] have made improvements on the
AI framework with verification technology. These researches have demonstrated
that the accuracy of WCET estimates (time as the resource of interest) can
be improved by integrating some forms of infeasible path discovery into micro-
architectural modeling. We note, however, due to the scalability issue, such ad-
ditions of path-sensitivity is quite limited. Moreover, although these works show
promising results as an AI framework which reduces certain overestimation re-
sulted from scenario (1), they are unable to address systematically the imprecision
caused by scenario (2). In other words, these works, while possessing some flavor of
path-sensitivity, essentially still employ a fixed point computation. Therefore, the
works give a worst-case resource consumption for each basic block, even though the
resource consumption of a basic block in different iterations of a loop can diverge
significantly. We will elaborate more on these methods in Chapter 3.
Path analysis has also been well investigated by the resource analysis research
community (e.g., with focus on WCET analysis [LM95, EG97, G+05, GEL06,
CJ11, CJ13]). These works aim to improve the estimate of the consumed resources
at the program path level.
The most recent related works are [CJ11, CJ13]. The fully automated symbolic
simulation algorithm presented in [CJ11] performs loop unrolling and utilizes the
notion of interpolation and reuse to scale. We will elaborate more on these works
1.3 Integrated Worst-case Resource Analysis 9
while presenting the foundations of our resource analysis framework in Chapter 2.
1.3 Integrated Worst-case Resource Analysis
In the previous section, we reviewed the main causes of imprecision in modular
approaches. Though these methods are safe, but the amount of overestimation
in the analysis can be quite high. In general, the two most important scales to
evaluate a resource analysis is safety, meaning the approximation is equal or more
than the real worst-case consumed resource and precision, meaning if the measured
bound is close to the exact worst-case consumed resource. Any resource analysis
which maintains the safety of the generated resource consumption bound is called
a sound resource analysis.
The abstract hardware model used in the low-level analysis can affect both the
safety and the precision of the estimated resource consumption through the effects
of the hardware components such as cache, pipeline, branch prediction, etc.
On the other hand, the existence of infeasible paths in the high-level analysis
of modular approaches, though do not affect the safety of the estimated resource
consumption bound, but affect the precision of the estimation. A more precise
estimate is generated by ignoring the infeasible paths in the resource analysis.
Moreover, the high-level analysis finds upper bounds on the number of the iter-
ations of the loops which is a requirement for precise estimation of the resource
consumption.
In contrast to modular approaches, integrated methods perform both the micro-
architectural modeling and path analysis in one phase. An integrated path and
resource analysis would theoretically give us ideal precision for a precise resource
10 Chapter 1. Introduction
analysis. The first attempt to perform integrated resource analysis based on sym-
bolic execution was Lundqvist et al. [LS99] which did not scale to realistic pro-
grams.
1.4 Thesis Contributions
In this thesis, we investigate on a scalable integrated resource analysis which in-
tegrates micro-architectural modeling with program path analysis. The objective
of this thesis is to address the precision issue in resource analysis. We propose a
symbolic execution framework where micro-architectural modeling and systematic
path-sensitivity are synergized. The essence of our proposed method is that it is
fully path-sensitive modulo summarization which leads to a more precise inspection
of the consumed resource measured from the underling hardware model.
Our framework can be compared to the state-of-the-art methods based on the
number of paths being disregarded due to infeasibility. The current state-of-the-art
methods disregard none of the infeasible paths [TFW00] or disregard the infeasible
paths where the reason for infeasibility falls inside a constant number of merge
points [CR11, BCR13], while our method disregards the maximum possible number
of the infeasible paths1.
In our framework, loops are unrolled fully2 and summarized (inherited from
[CJ11]). As a result, our framework can detect the exact number of iterations
in complicated loop patterns such as non-rectangular loops and amortized loops
[GZ10].
Moreover, our framework generates an exact incoming context for each loop
1Limited by the only merging performed at the end of the loops and the power of the theorem
solver
2We should note that the loop unrolling in our algorithm is done virtually and not physically,
and is different from the loop unrolling in compilers.
1.4 Thesis Contributions 11
iteration. This will enable our framework to detect the infeasible paths that were
not detected due to loss of information in the fixed point computation or the in-
feasible paths which are infeasible in only one loop iteration. The only abstraction
performed in our framework is within a loop iteration, and not across loop itera-
tions. This leads to a precise inspection of the consumed resource by the underlying
hardware model (because the micro-architectural state can be tracked across the
iterations).
Based on this, we emphasize that the significance of our resource analysis frame-
work (presented in Chapter 2) is that it exhibits the highest level of path sensitivity,
so in theory produces the most precise estimates. Moreover, integrated methods
have not been scalable up to now [W+08] and to the best of our knowledge, our
method is the first scalable integrated resource analysis. The following are the
advantages of our framework:
• Advantage1: Our framework possesses maximum infeasible path detection.
This results in a decrease of the overestimations in the resource analysis.
• Advantage2: Our framework differentiates between the incoming context
of different loop iterations. As a result, the state of the micro-architecture
features can be precisely tracked during the analysis, which in turn results
in a more precise estimation of the analyzed resources.
• Advantage3: Our framework can dynamically perform complete loop un-
rolling and aggregate the results from each loop iteration separately. It will be
able to calculate the exact number of loop iterations (inherited from [CJ11]).
We have customized our resource analysis framework for three different analyses:
12 Chapter 1. Introduction
1. Worst-case Execution Time Analysis (WCET): The analysis of the
worst-case execution time in embedded systems, and especially for real-time
embedded systems, has been receiving a lot of attention by researchers. Real-
time embedded systems require completing their work and delivering their
services on a timely basis. These systems are becoming increasingly im-
portant in everyday life. The traffic light regulator or the airport aviation
monitoring system are examples of real-time embedded systems.
2. Memory High-watermark Analysis (MHW): Memory has also been an
important resource in embedded systems. Memory consumption behavior is
an important feature in the design of embedded systems using cost-efficient
hardware.
3. Worst-case Energy Consumption Analysis (WCEC): Energy has re-
ceived increasing attention in the last few years. The impact of the embedded
systems and their energy consumption behavior on the environment is be-
coming increasingly important in recent years. Furthermore, the consumed
energy in embedded systems has been a major barrier for devices which have
limited access to energy.
1.5 Thesis Organization
We will present our integrated resource analysis framework in Chapter 2. Next, in
Chapters 3-5, we will customize our resource framework for WCET, MHW and
WCEC analysis. In this section, we will separately review each of these customized
analyses. Finally, this thesis finishes in Chapter 6 with a discussion on conclusion
and future works.
1.5 Thesis Organization 13
1.5.1 Worst-case Execution Time Analysis with Cache
The analysis of the execution time in real-time embedded systems has been a
major research topic in embedded systems. Real-time systems can be classified
as hard real-time systems or soft real-time systems. By definition, hard real-time
systems must satisfy all deadline constraints, and a failure in satisfying any of the
deadline constraints may result in a disaster. For example, the effectiveness of a
car airbag system, as an example of a hard real-time system, highly depends on
the timing of its ignition. In a typical frontal collision at 50 km/h against a wall,
the gas generator must be ignited approximately 10 ms to 30 ms after the crash
has begun. This is worse in the case of a side collision when the gas generator must
be ignited only 5 ms after the crash begins [FV98].
A hard real-time system is deemed as failed unless all of its deadline constraints
are satisfied. Accordingly, the correctness of a hard real-time system depends
both on its logical correctness (i.e., the program implements exactly the things
it is supposed to do), and its temporal correctness (i.e., all deadlines are met).
In contrast, soft real-time systems do not require to satisfy deadline constraints
strictly, but it is desirable to do so [Kop11].
In order to evaluate the temporal correctness of hard real-time systems, upper-
bounds on their execution times is needed. The upper-bounds are then used to
demonstrate that the execution of the hard real-time systems will finish before the
deadlines. Note that, in general, it is not possible to calculate upper-bounds on the
execution time of programs; otherwise, we could solve the halting problem [Sch94].
However, some restrictions, such as allowing only explicitly bounded recursion and
loops, are imposed on real-time systems, which guarantees the termination of real-
time programs.
14 Chapter 1. Introduction
Chapter 3 presents a precise WCET analysis based on our resource analysis
framework. The scope of the research presented in Chapter 3 is analyzing both the
instruction and data cache for WCET analysis.
1.5.2 Memory High-watermark Analysis with Symbolic Bounds
Traditionally, in safety-critical embedded systems, it was recommended not to
use dynamic memory allocation because of two main reasons: (a) the allocation
instructions might take longer than expected, resulting in the failure of temporal
constraints in hard real-time systems; and (b) the memory fragmentation issue.
As a result, the stack was the only memory that grows dynamically during execu-
tion. Worst-case stack usage was estimated by methods such as the one proposed
in [KF14]; the estimate is compared with the available memory to ensure stack
overflow errors does not occur.
In the past few years, there have been advances in both hardware and soft-
ware of embedded systems. The drop in the hardware cost, the development of
customized operating systems for embedded systems, and finally the advent of con-
stant time memory allocation algorithms with a reasonable handling of memory
fragmentation [MRC03, M+08] are among these advances. Besides these, as the
embedded systems become more complex, the need to use third-party code – which
might require dynamic memory allocation – becomes more inevitable. As a result,
dynamic memory allocation has now been used more frequently in embedded soft-
ware [A+15].
Such increased use of dynamic memory allocation will raise concerns about the
reliability of embedded systems that are deployed for safety-critical tasks. Thus,
there is a real need for developing program analysis methods to avoid both stack
and heap overflows in safety-critical systems. Besides, the estimate produced by
1.5 Thesis Organization 15
such analyzers would be useful in the design process of embedded systems to re-
duce hardware cost [T+13]; it can also be presented to the programmers who are
interested in dissecting the memory footprint of an embedded system.
Memory is a non-cumulative resource: what is acquired can later be released.
As a result, unlike time and energy where the maximum consumption of an execu-
tion path is at the end of the path, the maximum memory usage of a path can be
at any place in that path, e.g., right in the middle of it. Thus, many approaches
developed for worst-case analysis of cumulative resources, such as WCET anal-
ysis, becomes inapplicable. More specifically, these methods often abstract away
the orders between the acquires/releases, which is crucial for precise analysis of
non-cumulative resource.
Memory high-watermark (MHW), refers to the highest amount of memory that
a program might require in any execution. It can be compared to the WCET,
except that WCET represents the longest execution time of a program in all
its executions, while MHW represents the highest amount of memory usage of a
program, among all its executions. Chapter 4, presents a precise MHW analysis
as another application of worst-case resource analysis of embedded systems.
1.5.3 Worst-case Energy Consumption Analysis with Cache
and Pipeline
In some of the embedded systems, energy is a scarce resource. Energy consump-
tion analysis is becoming an important trend in different programming paradigms
such as embedded systems and mobile applications.
One first approach in the literature has measured an average energy usage of
a program. The measured value can be reported as an indicator of the average
16 Chapter 1. Introduction
energy usage of software on a specific hardware to the customers [GGP+14].
On the other hand, a second approach in the literature has been insisting that
energy consumption analysis is an important factor for energy harvesting embedded
systems. The lack of access to infinite energy source can be either because the
energy source of these embedded systems is not always available and cannot grantee
an infinite operation (e.g., solar powered sensor nodes) or such embedded systems
do not have access to rechargeable energy resources after being deployed (e.g.,
underwater sensor nodes). This analysis is indeed important for embedded systems
which have scarce access to energy. The researchers in this group suggest that an
energy consumption analysis should be performed which can guarantee that an
embedded system would have access to enough energy to perform the set of tasks
assigned to it [B+05, V+14, PS01].
Similar to WCET, the path resulting in the worst-case energy consumption
(WCEC) of an embedded system is not known in general, and a WCEC analysis
should be performed to find an upper-bound on the WCEC of an embedded sys-
tem [J+06, W+15]. Note that, due to the complex micro-architectural features in
modern systems, the path resulting in the worst-case execution time is not the same
as the worst-case energy consuming paths. So, the path returned by the WCET
analysis cannot be used directly to generate the WCEC of a program [J+06]. In
Chapter 5, a framework for a precise WCEC analysis is presented. The resource
analysis method presented in Chapter 5 extends our framework with resource anal-
yses in the presence of superscalar in-order processors. Moreover, a discussion on
extension of the resource analysis framework to analyzing out-of-order processors
is presented in the end of Chapter 5.
Chapter2
Integrated Resource Analysis Framework
In this chapter, we present our symbolic execution resource analysis framework.
Symbolic execution has been quite a well-established method in program verifica-
tion. However, it suffers from the path explosion problem which makes it unde-
sirable for program analysis. In this chapter, we will present a modified version
of symbolic execution empowered with notions of reuse and domination which can
be used to perform precise and scalable resource analysis.
2.1 Symbolic Execution
Symbolic execution [C+76, Kin76] is one of the methods used for reasoning about
the programs. In symbolic execution, symbolic values are used as inputs instead of
actual data. The values of the program variables are stored as symbolic expressions
of the input symbolic values. Symbolic execution was first developed for program
testing, but it has been subsequently used for program verification and program
analysis.
The main challenges of performing symbolic execution for program verification
is (1) exponential number of paths and (2) infinite-length paths in the presence of
17
18 Chapter 2. Integrated Resource Analysis Framework
unbounded loops. In the scope of this thesis, we would assume programs containing
bounded loops. This is a valid assumption for programs used in safety critical
embedded systems. However, the first challenge still remains for utilizing symbolic
execution.
2.1.1 Symbolic Execution for Program Verification
In the recent years symbolic execution has emerged as a successful method
for program verification. The main advantages of using symbolic execution for
program verification is that it avoids exploring infeasible paths1, hence, program
verification using symbolic execution is more accurate and does not result in false
alarms. More specifically, the paths in the symbolic execution tree are enumerated
and verified one by one. In case a path is found which contains a bug, the verifi-
cation of the program has failed and the path is reported to the user. Otherwise,
after enumerating all paths, the program is reported as safe.
The most relevant verification tool to our analysis, TRACER [JMNS12], is a
symbolic execution-based verification tool for sequential C programs. It attempts
to address the first challenge by summarizing each subtree in the symbolic execu-
tion tree after finishing traversing it. Next, it attempts to subsume other similar
subtrees in the symbolic execution tree using the generated summarization. The
ability to subsume portions of the symbolic execution tree by previously analyzed
subtrees will enable TRACER to remain scalable while verifying a program.
In order for the subsumption to be sound, an interpolation test is performed
which checks if the infeasible paths in the analyzed subtree would remain infeasible
wrt. the context reaching the subsumed subtree. We will elaborate more on the
concept of summarization with interpolation in the next section.
1Limited to the power of the theorem solver
2.1 Symbolic Execution 19
2.1.2 Symbolic Execution for Program Analysis
In program analysis, symbolic execution also known as symbolic simulation, is
used to simulate the execution of tasks on an abstract model of the processor. The
simulation is performed without input values and it is the simulator’s duty to deal
with the unknown execution states such as branches.
The first attempt to make symbolic execution for program analysis scalable
was [JSV08], where the concept of summarization with interpolation was used for
analysis on the specific problem of resource-constrained shortest path (RCSP).
The problem was formulated such that the scalability could be achieved by
reusing the summarizations of previously computed sub-problems. After analyzing
the paths in a sub-tree, an interpolant, preserving the discovered infeasible paths
in the analyzed subtree, and a witness formula preserving the (sub)-analysis of the
subtree was computed and stored. In case the node was encountered in another
path, such that its context entailed the previously computed interpolant and wit-
ness formula, the paths emerging from that node were not explored. This step
was called reuse and the subsumed node would share the analysis results of the
subsuming node. Otherwise, the symbolic execution would naturally explore all
the paths emerging from that node. In other words, this was a generalized form
of dynamic programming. However, reuse of the summarizations was limited to
loop-free programs.
The next step was taken in [CJ11, CJ13], where reuse of summarizations was
enhanced for WCET analysis of programs with loops. In this modular resource
analysis approach the timings of the basic blocks could be generated from the
state-of-the-art low-level analysis. In the analysis, loops were fully unrolled and
the analysis was made scalable by introducing compounded summarizations, where
20 Chapter 2. Integrated Resource Analysis Framework
Figure 2.1: Reuse of Summarizations: (a) [CJ11] vs. (b) Our Analysis
one or more loop iterations were summarized and reused for subsuming other loop
iterations.
Example 1. Figure 2.1 (a), informally depicts a symbolic execution tree, where
each triangle presents a subtree. The program contexts for the left and right sub-
trees, i.e., the symbolic states s0 and s1 respectively, are of the same program point.
If we had applied the algorithm in [CJ11] on the left subtree, we would obtain two
things: an interpolant Ψ0, a generalization of s0, encapsulating any context that
would preserve the infeasible paths (indicated with a red cross) of the subtree. We
also obtain a “representative” path, called a witness, indicated in blue, which gives
rise to the WCET (15 in this case) of the subtree.
The algorithm [CJ11] now considers the right subtree, where two tests are per-
formed. First the context s1 is checked if it implies the interpolant. If so, every
infeasible path in the left subtree remains infeasible in the right subtree. A second
test is whether the witness path is still feasible. If both tests are passed, the analy-
sis can be reused here, and the WCET of the right subtree can now be computed
without traversal.
The final analysis, at the root of the tree, can be computed by collating the
2.1 Symbolic Execution 21
Figure 2.2: (a) Loop analysis in our framework (b) Loop analysis in AI Framework with
some infeasible path detection, where C is the fixed point context of loop and di is the
incoming context of a loop iteration
analyses of the left and right subtrees, and we can now determine its value of 22 as
indicated. Note, importantly, that we never actually traversed the path that gives
rise to this result; instead, we inferred its value.
2.1.3 Symbolic Execution on Loops
In [CJ11], loops are unrolled fully and summarized2. As a result, the exact
number of iterations in complicated loop patterns can be found. In [CJ11] the loops
are unfolded dynamically and for each loop iteration a different incoming context
d1, d2, d3, ... is generated (Figure 2.2(a)). Note that the paths in blue are feasible
2Loops are unrolled gradually, meaning that loops are unrolled once then checked till unrolling
is no longer feasible.
22 Chapter 2. Integrated Resource Analysis Framework
paths and the paths in red are infeasible paths. By checking the exact incoming
context for each loop iteration, it can detect the infeasible paths which, due to the
fixed point computation, were not detected in ILP-based methods (Figure 2.2(b)).
For example, the framework in [CJ11] can detect the infeasible paths which are
infeasible in only one loop iteration (e.g., the right-most path which is infeasible
in only one triangle in Figure 2.2(a) and is feasible in Figure 2.2(b)). The process
of loop unfolding continues till it reaches the end of the loop (blue arrow).
The method presented in [CJ11] is a powerful framework which can handle
most of the challenges of utilizing symbolic execution for resource analysis. It can
be generalized and used for resource analysis as long as it can be combined with
a sound low-level analysis. However, the symbolic execution framework in [CJ11]
performs program path analysis only and does not take into account the state of
the micro-architectural features.
2.2 Overview of Our Resource Analysis Frame-
work
In our framework, we need to go beyond program path analysis; thus we inherit
or extend the fundamental concepts in [CJ11] (unrolling loops, reuse with sum-
marization, etc.) to include micro-architectural features. This would enable our
framework to inherit the benefits of the symbolic execution framework presented
in [CJ11] and yet be more precise by taking into account the state of the micro-
architectural features.
2.2 Overview of Our Resource Analysis Framework 23
2.2.1 Microarchitecture-Aware Symbolic Execution
In our symbolic execution framework, each symbolic state is coupled with the
internal states of all respective micro-architectural features. We refer to the internal
state of such related micro-architectural features as the implicit machine state.
The implicit machine state is the internal states of the micro-architectural features
relevant to a resource analysis, which cannot be explicitly derived from the source
of a program. The implicit machine state would contain enough information to
perform precise resource analysis. In short we will use the term machine state to
refer to the implicit machine state.
Not all micro-architectural features do affect all resource consumption behavior
of programs. For example, in WCET analysis different micro-architectural fea-
tures, ranging from cache to branch prediction, do have effect on the generated
bound, while, in MHW analysis, the internal cache state does not have any ef-
fect on the amount of allocated memory. So, our proposal is to only capture the
internal states of the micro-architectural features which are relevant to a resource
analysis.
Our resource analysis framework is integrated in the sense that it performs high-
level and low-level analysis in one phase. In summary, our framework performs
resource analysis by:
1. Fully enumerating all symbolic paths in the symbolic execution tree.
2. Generating and precisely updating the machine state across the symbolic
paths.
3. Using machine states to precisely capture the resource consumption behavior
of the program along the symbolic paths.
24 Chapter 2. Integrated Resource Analysis Framework
2.2.2 Precision of Our Resource Analysis Framework
The three key features of our framework to perform precise resource analysis
are (1) path sensitive symbolic execution, (2) microarchitecture-aware symbolic
execution and (3) loop unrolling. These key features equip our framework with
the means for precise resource analysis. We will demonstrate the precision of our
framework in the following example.
for (i = 0; i < 100; i += 3) {
if (i % 5 == 0) { /* a */ }
else { /* b */ }
}
Figure 2.3: An Academic Example
Example 2. Let us consider the academic example in Figure 2.3. We like to
measure a certain resource denoted by r, which can be affected in the presence of
caches (e.g., timing or energy). We assume a direct-mapped cache, where block a
and block b access the memory block m1 and m2 respectively, and m1 and m2 map
to the same cache set. In other words, the memory blocks m1 and m2 conflict in the
cache. Note that the execution of block a and block b is feasible, though in different
iterations.
A pure AI approach such as [TFW00] will have to conservatively declare that
the fixed point must cache at the looping point contains neither m1 nor m2. Thus,
all the accesses are considered as misses, which affects the consumed resource r.
On the other hand, [BCR13] might improve the analysis by ignoring some “sim-
ple” infeasible states from being considered. For example, if the conditional expres-
sion was (i < 50), [BCR13] could discover that an execution of b cannot be followed
by an execution of a in the next iteration. In other words, it could discover that
2.2 Overview of Our Resource Analysis Framework 25
the access m2 is indeed persistent [FW98, H+11]. Specifically, only the first ac-
cess to m2 is a cache miss, the rest of m2 accesses are cache hits. Now consider
the case that the conditional expression is a bit more complicated, for example,
i % 5 > 1. Without knowing the precise value of i, after executing block a (or block
b) in the current iteration, it is possible to either execute block a or block b in the
next iteration. In such a case, a fixed point method, equipped with some form of
path-sensitivity, will not be of much help to improve the analysis precision.
In our analysis, we precisely capture the value of i throughout the analysis pro-
cess. Consequently, we are able to disregard all infeasible states from consideration,
thus achieving more accurate analysis result comparing to the state-of-the-art.
2.2.3 Scalability of Our Resource Analysis Framework
The precision of our analysis comes at the cost of scalability. Clearly, any
framework which attempts the full exploration of the symbolic execution tree will
not scale. As elaborated in Section 2.1.2, the core concept of reuse was used for
scalability in [CJ11]. Reuse in [CJ11] was based on the idea that the contribution of
each basic block in the computation of the timing (resource consumption) of a path
is constant. However, in the presence of micro-architectural features such as caches,
the contribution of each basic block is no longer constant. We define this concept as
dynamic resource consumption model. As a result, reusing summarizations would
no longer be sound. Note that we explained the safety and soundness terms for
worst-case resource consumption bounds and analyses in Section 1.3.
In our framework, resource analysis is performed in the presence of the dy-
namic resource consumption model. The main contribution of our framework is
furnishing the concept of reuse with the means to dynamically generate the resource
26 Chapter 2. Integrated Resource Analysis Framework
consumption behavior of a subtree based on the stored summarization and the ma-
chine state reaching a node. For this to happen the witness should dominate all
the other paths in the presence of the dynamic resource consumption model. As a
consequence, the safety of reuse is guarded not only by the concept of interpolation,
but also with the concept of dominance. By that, we mean, given a new symbolic
state, it is sound to reuse a summarization if first the interpolant is satisfied which
ensures all the discovered infeasible paths are maintained at the reuse point, and
second a dominating condition is satisfied which ensures the machine state reach-
ing the reuse point is similar enough to the machine state of the subsuming node.
We will demonstrate this in the next example.
Example 3. consider Figure 2.1(b) where we now focus on dynamic resource con-
sumption, which arises because of cache configurations. A cache state c0 is also part
of the context s0 of the subtree on the left. After analyzing the left subtree we obtain
an interpolant Ψ0 and a witness path (indicated in blue) as before. As explained
above the reuse of this witness path is unsound in general. To remedy this, we now
compute a dominating condition c0. Essentially, this is a formula which describes
an abstract cache configuration which is sufficient to guarantee the witness path
remains optimal, i.e., the worst-case path in the subtree, when encountering a new
context.
In Figure 2.1(b), suppose the dominating condition applies, that is, suppose that
the cache context c1 is covered by c0. We indicate this by the predicate DOM(c0, c1).
Now this allows us to reuse the witness path. We then need to proceed replaying
the witness path under the new cache configuration c1. This, importantly, can lead
to new value of the path (now 17), which is different from the original value (15).
Finally, we can conclude the analysis on the whole tree with the value 24.
2.3 General Framework 27
Now suppose the dominating condition did not apply. Then the path indicated
by 17 may not be the worst-case path in the right subtree. For example, there could
be a path of length 18 somewhere else in the subtree. If we reuse the witness path,
we would now report, wrongly, a final value of 24.
We end this section by emphasizing that reuse is the key to scalability, since
with reuse our analysis avoids the full exploration of a symbolic execution tree.
Reuse in the presence of the dynamic resource consumption model is only possible
when path and resource analysis are performed in one integrated phase. We now
present our analysis framework.
2.3 General Framework
2.3.1 Symbolic Execution with Machine State
We model a program by a transition system. A transition system P is a tuple
〈L, `0,−→〉 where L is the set of program points, `0 ∈ L is the unique initial
program point. Let −→⊆ L×L×Ops, where Ops is the set of operations, be the
transition relation that relates a state to its (possible) successors by executing the
operations.
Basic operations are either assignments or “assume” operations. The set of all
program variables is denoted by Vars. An assignment x := e corresponds to assign
the evaluation of the expression e to the variable x. The expression assume(cond)
means: if the conditional expression cond evaluates to true, execution continues;
otherwise it halts. We shall use ` op−→ `′ to denote a transition relation from ` ∈ L to
`′ ∈ L executing the operation op ∈ Ops3. Clearly a transition system is derivable
3Note that based on the resource of interest, basic operations can also have other forms such
as memory allocations/deallocations for MHW analysis. We will customize the set of basic
28 Chapter 2. Integrated Resource Analysis Framework
from a CFG.
Definition 1 (Symbolic State). A symbolic state s is a tuple 〈`,m, σ,Π〉 where
` ∈ L is the current program point, m is the machine state the symbolic store σ
is a function from program variables to terms over input symbolic variables, and
finally the path condition Π is a first-order formula over the symbolic inputs.
Let s0 def= 〈`0,m0, σ0,Π0〉 denote the unique initial symbolic state, where m0
is the initial machine state. At s0 each program variable is initialized to a fresh
input symbolic variable. For every state s ≡ 〈`,m, σ,Π〉, the evaluation [[e]]σ of an
arithmetic expression e in a store σ is defined as usual: [[v]]σ = σ(v), [[n]]σ = n,
[[e+e′]]σ = [[e]]σ+[[e′]]σ, [[e−e′]]σ = [[e]]σ−[[e′]]σ, etc. The evaluation of the conditional
expression [[cond]]σ can be defined analogously. The set of first-order logic formulas
and symbolic states are denoted by FO and SymStates, respectively.
Definition 2 (Transition Step). Given 〈L, l0,−→〉, a transition system, and a
symbolic state s ≡ 〈`,m, σ,Π〉 ∈ SymStates, the symbolic execution of transition
tr : ` op−→ `′ returns another symbolic state s′ defined as:
s′ def=

〈`′,m′, σ,Π ∧ cond〉 if op ≡ assume(cond)
〈`′,m′, σ[x 7→ [[e]]σ],Π〉 if op ≡ x := e
where m′ is the new machine state derived from m and op. m′ is updated using
updMachineState function, where m′ ≡ updMachineState(op,m).
In order to be able to capture the precise machine state, we need a more low-
level representation of a program. We have chosen LLVM intermediate represen-
tation (IR) [LA04], which while being expressive enough to precisely capture the
machine state, preserves the overall CFG of the program. Moreover, LLVM IR
operations for each analyses in Chapters 3-5.
2.3 General Framework 29
is both platform independent and is well-suited to perform static program anal-
yses [S+12, ZGSV11]. For a more detailed discussion on how we generate the
transition system from LLVM IR, we refer the interested reader to Appendix B.
The updMachineState function can be defined to capture the updates on the
machine state based on the update criteria for the micro-architectural features.
The updMachineState should be defined with regard to the resource of interest.
We will customize the updMachineState function for each of the resource analyses
presented in Chapters 3-5.
Abusing notation, the execution step from s to s′ is denoted as s tr−→ s′ where
tr is a transition. Given a symbolic state s ≡ 〈`,m, σ,Π〉 we also define [[s]] :





onto the set of program variables Vars. The projection is performed by the elimi-
nation of existentially quantified variables.
For convenience, when there is no ambiguity, we just refer to the symbolic state
s using the abbreviated tuple 〈`,m, [[s]]〉 where ` and m are as before, and [[s]] is
obtained by projecting s as described above.
Example 4 (Translation into Transition System). Consider the C program frag-
ment in Figure 2.4(a) and part of its respective LLVM IR in Figure 2.4(b). The
program points are enclosed in angle brackets. Some of the transitions are shown
in Figure 2.4(c). For instance, the transition 〈〈1〉,c := 0,〈2〉〉 represents that the
system state switches from program point 〈1〉 to 〈2〉 and the constraint denotes the
reset of c to 0.
30 Chapter 2. Integrated Resource Analysis Framework
〈1〉 c = 0;
〈2〉 if (a > 0) 〈3〉t += 1;
〈4〉 if (b > 0) 〈5〉t += 2;
〈6〉 if (c > 0) 〈7〉t += 3;
〈8〉
(a) The C Program
〈1〉 store i32 0, i32* %c
br label %2
〈2〉 %10 = load i32* %a
%cmp = icmp sgt i32 %10, 0
br i1 %cmp, label %3, label %4
〈3〉 %11 = load i32* %t
%add = add nsw i32 %11, 1
store i32 %add, i32* %t
br label %4
〈4〉 %12 = load i32* %b
%cmp1 = icmp sgt i32 %12, 0
br i1 %cmp1, label %5, label %6
...
(b) The LLVM IR
〈〈1〉, c := 0 ,〈2〉〉
〈〈2〉, assume(a>0) ,〈3〉〉
〈〈3〉, t := t+1 ,〈4〉〉
〈〈2〉, assume(a≤0) ,〈4〉〉
...
(c) The Transition System
Figure 2.4: From a C program to its Transition System
A path pi ≡ s0 → s1 → . . . sn is feasible if sn ≡ 〈`,mn, [[sn]]〉 and [[sn]] is
satisfiable. Otherwise, the path is called infeasible and sn is called an infeasible
state. Here we query a theorem prover for satisfiability checking on the path
condition. We assume the theorem prover is sound, but not complete. If ` ∈ L
and there is no transition from ` to another program point, then ` is called the end
2.3 General Framework 31











x = 0 x ≠ 0 
x > 1 x ≤ 1
(a)








x > 1 x ≤ 1




r = r + 20
r = r + 30
r = r + 20
r = r + 30 r = r + 30
r = 0 r = 0 
<0> <0a>
Figure 2.5: (a) a CFG and (b) Its Symbolic Execution Tree
Example 5 (Symbolic Execution Tree). Consider the CFG in Figure 2.5(a). Each
node abstracts a basic block. In each basic block, a program point is shown. For
brevity, we might use interchangeably the identifying program point when referring
to a basic block. Two outgoing edges signify a branching structure, while the branch
conditions are labeled beside the edges. Moreover, r is set to 0 in the beginning and
the updates to it are also shown.
Next, in Figure 2.5(b), we show our analysis tree. Each node, shown as a circle,
is identified by the corresponding program point, followed by a letter to distinguish
between multiple visits to the same program point. Each path denotes a symbolic
execution path of the program. Each node is associated with a symbolic state, but
for simplicity we do not explicitly show any state content.
32 Chapter 2. Integrated Resource Analysis Framework
Now assume that no basic block modifies x. At node 〈5a〉, the projection of the
path condition over program variables Vars, namely [[s5a]], is r = 20 ∧ x = 0 ∧ x >
1, which is equivalent to false. In other words, the leftmost path in Figure 2.5(b)
is in fact infeasible. On the other hand, at node 〈7a〉, the projection of the path
condition over program variables Vars, namely [[s7a]], is r = 50 ∧ x = 0 ∧ x ≤ 1.
The set of program variables V ars, also include a resource variable r. Our
analysis computes a sound and accurate bound for r in the end, across all fea-
sible paths of the program. Note that r is always initialized to 0. The allowed
operations on r are defined based on the point that the resource of interest is cu-
mulative or non-cumulative. For example, for WCET analysis the only operation
allowed is concrete increment, but in MHW analysis, r can be both incremented
or decremented.
Given a symbolic state s ≡ 〈`,m, [[s]]〉 and a transition tr : ` op−→ `′, the amount
of change in the value of r at s by executing tr will be evaluated by a function
ResEval. ResEval receives the machine state m and the current value for r as input
and measures the updated value for r. ResEval is defined specifically based on the
resource analyses in Chapters 3-5. The resource variable r is not used in any other
way.
We stated in Section 2.2.1 that in our framework, loops are unrolled fully and
summarized. We now introduce the concepts required for our loop unrolling frame-
work. Our method benefits from the loop unrolling technology introduced in [CJ11]
to find the exact number of iterations in complicated loop patterns (explained in
Section 2.1.3).
We assume that each loop has only one loop head and one unique end point.
This can be achieved by a preprocessing phase. For each loop, following the back
2.3 General Framework 33
edge from the end point to the loop head, we do not execute any operation. Note
that our transition system is a directed graph. Moreover, since programs in safety
critical systems do not contain break and return instructions in loops, the same
treatment can be applied to have one loophead for loops.
Definition 3 (Loop). Given a directed graph G = (V,E) (transition system), we
call a strongly connected component S = (VS, ES) in G with |ES| > 0, a loop of G.
Definition 4 (Loop Head). Given a directed graph G = (V,E) and a loop L =
(VL, EL) of G, we call E ∈ VL a loop head of L, also denoted by E(L), if no node
in VL, other than E has a direct successor outside L.
Definition 5 (End Point of Loop Body). Given a directed graph G = (V,E), a
loop L = (VL, EL) of G and its loop head E. We say that a node u ∈ VL is an end
point of a loop body if there exists an edge (u, E) ∈ EL.
Definition 6 (Same Nesting Level). Given a directed graph G = (V,E) and a
loop L = (VL, EL), we say two nodes u and v are in the same nesting level if for
each loop L = (VL, EL) of G, u ∈ VL ⇐⇒ v ∈ VL.
2.3.2 Constructing Summarizations
In our framework, a “subtree” is a portion of the symbolic execution tree.
Given a state s and program point `2 such that (a) state s ≡ 〈`1,m, [[s]]〉 appears
in the tree, and (b) `2 post-dominates `1, then subtree(s, `2) depicts all the paths
emanating from s and, if feasible, terminating at `2. (Note that `2 may not be the
end point of the whole tree.) We call `1 and `2 the entry and exit points of the
subtree.
A summarization of a subtree, intuitively, is a succinct description of its analy-
sis. This is formalized as a tuple of certain important components of the analysis.
34 Chapter 2. Integrated Resource Analysis Framework
These are: the entry and exit program points, an interpolant describing infeasi-
ble paths, one or more witnesses describing the paths with highest resource con-
sumption in the subtree, one or more dominating condition ensuring the witnesses
represents the appropriate worst-case paths in the subtree, and finally an abstract
transformer relating the input and output program variables and an abstract trans-
former relating the input and output machine states. The abstract transformers
are used to generate the outgoing context at the exit point.
We start with our notion of interpolant. The idea here is to approximate at the
root of a subtree, the weakest precondition in order to maintain the infeasibility of
all the nodes inside4. In the context of program verification, an interpolant cap-
tures succinctly a condition which ensures the safety of the tree at hand. Adapting
this to program analysis was first done in [JSV08] which formalized a generalized
form of dynamic programming. The work applying interpolation for reuse in pro-
gram analysis has been [CJ11]. In the context of program analysis, the problem
is formulated such that the scalability can be achieved by reusing previously com-
puted sub-problems. Since all infeasible nodes are excluded from calculating the
analysis result of a subtree, in order to ensure soundness, at the point of reuse,
all such infeasibility must also be maintained. Our framework adopts the efficient
algorithm for computing the interpolant from [JSV09].
Next, we discuss the witness concept. Intuitively, it is a path that depicts the
worst-case resource consuming path of a subtree and/or the path with the peak
resource consumption (for non-cumulative resources). Witness is depicted by Γ.
The resource cost of a witness is obtained dynamically from replaying the sequence
4An exact computation of the weakest precondition is in general intractable. However, the
approximation of the weakest precondition generated by our framework would be in conjunction
form and as a result is tractable
2.3 General Framework 35
resource consuming items along the path under an incoming machine state m.
We say that two nodes in a symbolic execution tree are similar if they refer
to the same program point. Thus two subtrees are similar if they share the same
entry and exit program points.
We next discuss dominating condition, another component of our analysis of a
subtree. Intuitively, this is a description of what machine state is needed in order
that the witness remains optimal in a similar subtree. That is, in a subtree, the
witness remains the most resource consuming path.
The witness and the dominating condition are customized based on the resource
analysis. We will present customized witness and dominating condition for each of
the resource analyses presented in Chapters 3-5.
We now discuss an abstract transformer ∆p of a subtree from `1 to `2 which
is an abstraction of all feasible paths (w.r.t. the incoming symbolic state s) from
`1 to `2. Its purpose is to capture an input-output relation between the program
variables. In our implementation, we adopt from [CJ11] which uses the polyhedral
domain [CH].
Similarly, we also need an abstract transformer for machine state. We name it
as a machine state summary ∆m. The machine state summary is generated based
on the micro-architectural features captured in the machine state. More details on
the how to generate and apply machine state summaries is presented separately
for each of the resource analyses in Chapters 3-5.
Definition 7. A summarization of subtree(s, `2), where `1 is the program point of
s, is a tuple
[`1, `2,Ψ,Γ, r, δ,∆p,∆m]
36 Chapter 2. Integrated Resource Analysis Framework
where Ψ is an interpolant, Γ is the witness, r is the resource cost of the subtree(s, `2),
and δ is the dominating condition. ∆p is an abstract transformer relating the input
and output variables and finally, ∆m is a machine state summary.
Note that all components stored in the summarization (except abstract trans-
former) are in conjunctive form and these components do not contain any disjunc-
tions.
2.3.3 Reuse with Interpolation and Dominance
We now display a key feature of our algorithm: reuse of a summarization. Sup-
pose we have already computed a summarization [`1, `2,Ψ,Γ, r, δ,∆p,∆m]. Suppose
we then encounter a symbolic state s′ ≡ 〈`1,m, [[s′]]〉. The summarization now can
be reused if:
1. [[s′]] implies the stored interpolant Ψ i.e., [[s′]] |= Ψ.
2. The context of s′ is consistent with the witness, i.e., [[Γ]]∧ [[s′]] is satisfiable.
3. The dominating condition is satisfied by the incoming machine state m, i.e.,
DOM(δ,m) holds.
The worst-case resource cost of the subtree beneath the state s′ is then derived from
the witness Γ and the machine state m. Note that the resource cost of the subtree
beneath s′ can be different from r. Finally, the outgoing state of the subtree at `2
denoted by s′2 is generated by s′
∆p,∆m−−−−→ s′2.
We now conclude this subsection by mentioning that we only summarize at
selected program points. Given entry point `1, the corresponding exit point `2 is
determined as follows. It is the program point that post-dominates `1 s.t. `2 is of
2.4 Customizing the Analysis Framework 37
the same nesting level as `1 and either is (1) an end point of the program, or (2) an
end point of some loop body. In other words, we only perform “merging” abstrac-
tion at loop boundaries. As `2 can always be deduced from `1, in a summarization,
we omit the component about `2. So, in a short form we store a summarization as
[`,Ψ,Γ, r, δ,∆p,∆m].
2.4 Customizing the Analysis Framework
Before presenting the integrated resource analysis algorithm, we will review over
the terms and functions which should be customized for performing any specific
resource analysis with our framework. These terms have been briefly defined in this
chapter and a detailed presentation of these terms and functions will be presented
in Chapters 3-5.
Table 2.1 presents the list of the terms and the functions needed to be cus-
tomized based on the resource of interest. By defining these terms and their re-
spective functions in detail, our framework will be enabled to perform different
resource analyses. We will now review the terms and functions one by one:
Table 2.1: Customizable Terms
Customizable Terms Respective Functions
1 Basic Operations & Resource Consumption Cost Model ResEval
2 Machine State updMachineState




4 Machine State Summary Combine-SummaryMerge-Summary
1. Basic Operations: As explained in Section 2.3, based on the resource of
38 Chapter 2. Integrated Resource Analysis Framework
interest, basic operations can also have other forms such as memory alloca-
tions/deallocations for MHW analysis. We will customize the set of basic
operations for each analyses.
Moreover, the resource consumption cost model should be defined. For ex-
ample, in WCET analysis the resource which we are interested to estimate
is time and we should define the cost model for time in the presence of micro-
architectural features tracked in the machine state (cache state in this case).
The ResEval function will capture the consumed resource based on the cost
model.
2. Machine State: Defining precisely the machine state is the second step
for customizing the resource analysis framework. The machine state will
represent the internal state of the micro-architectural features that affect the
resource analysis. For example, in the WCET analysis presented in Chapter
3, we will take into account the instruction and data cache. So the machine
state will be defined such that it would contain the internal states of the
data and instruction cache. Moreover, the updMachineState function which
tracks the updates of the machine state during the traversal of the symbolic
execution tree would be defined.
3. Witness and Dominating Condition: In this chapter, we defined the
witness as the most resource consuming path in an analyzed subtree and the
dominating condition as the set of constraints that guarantee the domination
of the witness over the other paths in a subtree w.r.t. the machine state.
The witness only stores the items which affect the amount of the resource
consumed in the path. As a result the details of the items stored in the
2.5 The Integrated Algorithm 39
witness and the details of the constraints stored in the dominating condition
are defined based on the analyzed resources.
The Summarize-a-Trans function computes a summarization for a single tran-
sition tr at state s. This can be seen as a basic step in our algorithm. This
function is customized based on the resource of the interest.
Finally, the Combine-Witness and Merge-Witness functions should be defined to
present the basic merge and combine steps for generating witness and the
dominating condition recursively.
4. Machine State Summary: The machine state summaries are used when
reusing summarizations of loop iterations. Machine state summary can only
be defined after machine state is precisely defined. The machine state sum-
mary is the last term that should be defined for customizing the resource
analysis framework.
The Combine-Summary and Merge-Summary functions are defined to present the
merge and combine steps for the machine state summaries.
2.5 The Integrated Algorithm
In this section, we explain Algorithm 1 in a top-down manner. The function Analyze
takes the initial symbolic state s0 and the transition system P of an input program.
It then invokes Summarize to generates a summarization for the whole program (line
1) and returns r as the maximum resource consumption of the program (line 2).
2.5.1 The Summarize Function
The Summarize function performs a depth-first traversal of the symbolic execu-
tion tree. During the depth-first traversal, at each node either (1) a summarization
40 Chapter 2. Integrated Resource Analysis Framework
Algorithm 1 Integrated WCET Analysis Algorithm
function Analyze(s0,P)
Let s0 be 〈0,m0, [[s0]]〉
〈1〉 [0, ·, ·, r, ·, ·, ·] := Summarize(s0,P)
〈2〉 return r
function Summarize(s,P)
Let s be 〈`,m, [[s]]〉
〈3〉 if ([[s]] ≡ false) return [`, false, [ ],−∞, false, [ ], [ ]]
〈4〉 if (outgoing(`,P) = ∅) return [`, false, [ ], 0, [ ], [Id(V ars,m)], [ ]]
〈5〉 if (loop end(`,P)) return [`, false, [ ], 0, [ ], [Id(V ars,m)], [ ]]
〈6〉 S := [`,Ψ,Γ, r, δ,∆p,∆m] := memoed(`)
〈7〉 if ([[s]] |= Ψ ∧ [[Γ]] 6≡ false ∧ DOM(δ, c)) return S
〈8〉 if (loop-head(`,P))
〈9〉 S1 := [·, ·,Γ1, ·, ·,∆p1,∆m1]:= TransStep(s,P , entry(`,P)))
〈10〉 if ([[Γ1]] ≡ false)
〈11〉 S := JoinHorizontal(m,S1,TransStep(s,P, exit(`,P)))
else
〈12〉 Let tr be ` ∆p1,∆m1−−−−−→ `′
〈13〉 s tr−→ s′
〈14〉 S ′ := Summarize(s′,P)
〈15〉 S := Compose(S1, S ′)
〈16〉 S := JoinHorizontal(m,S,TransStep(s,P , exit(`,P)))
〈17〉 else S := TransStep(s,P , outgoing(`,P))
〈18〉 memo and return S
is reused, thus we do not need to expand the node; or (2) after expanding it, we
compute its summarization based on the summarizations of its child nodes. We
now discuss the Summarize function in detail.
Base Cases: Summarize handles 4 base cases. First, when the symbolic state s is
infeasible (line 3). Note that here path-sensitivity plays a role because provably
infeasible paths will be excluded from contributing to the analysis result. Thus the
returned witness is [ ]. The abstract transformer is [ ] too. Note that the empty
machine state summary is also depicted by [ ] for brevity. Second, s is a terminal
state (line 4). Here Id refers to the identity function, which keeps the symbolic
2.5 The Integrated Algorithm 41
state unchanged. The end point of a loop is treated similarly in the third base
case (line 5). The last base case, lines 6-7, is the case that a summarization can
be reused. We have discussed this step in Section 2.3.3.
Expanding to the next programming point: Line 17 depicts the case when
transitions can be taken from the current program point `, and ` is not a loop head.
We call TransStep to move recursively to next program points. TransStep considers
all transitions emanating from `, denoted as outgoing(`,P), then calls Summarize
recursively and compounds the returned summarizations into a summarization of
`.
In more detail, for each tr in TransSet, TransStep extends the current state
with the transition. We then call Summarize with the resulting child state (line 22).
The algorithm aggregates each returned summarization into a single summariza-
tion, namely S. This is achieved by first calling Compose (line 23), then calling
JoinHorizontal (line 24). We use Compose and JoinHorizontal (explanation delegated
to the later parts in this section) to combine vertically and merge horizontally two
summarizations. Note here that we construct a summarization from a single tran-
sition before calling Compose. Since all components (except abstract transformer)
stored in the summarizations are not in disjunctive form, the execution of the
Compose and JoinHorizontal steps is not heavy in terms of the computation.
Handling Loops: Lines 9-16 handle the case when the current program point `
is a loop head. Let entry(`,P) denote the set of transitions going into the body of
the loop, and exit(`,P) denote the set of transitions exiting the loop.
Upon encountering a loop, our algorithm attempts to unroll it once by calling
the function TransStep to explore the entry transitions (line 9). If the returned wit-
ness formula is false, meaning that it is infeasible to execute another iteration, we
42 Chapter 2. Integrated Resource Analysis Framework
function TransStep(s,P,TransSet)
Let s be 〈`, ·, ·, ·〉
〈19〉 S := [`, false, [ ], 0, [ ], [Id(V ars,m)], [ ]]
〈20〉 foreach (tr ∈ TransSet) do
〈21〉 s tr−→ s′
〈22〉 S′ := Summarize(s′,P)
〈23〉 S := Compose(Summarize-a-Trans(s, tr), S′)





Let S1 be [`1,Ψ1,Γ1, r1, δ1,∆p1,∆m1]
Let S2 be [`2,Ψ2,Γ2, r2, δ2,∆p2,∆m2]
〈26〉 r := r1 + r2
〈27〉 ∆p := ∆p1 ∧ ∆p2
〈28〉 ∆m := Combine-Summary(∆m1,∆m2)
〈29〉 Ψ := Ψ1 ∧ Pre-Cond(∆p1,Ψ2)
〈30〉 {Γ, δ} := Combine-Witnesses(Γ1,Γ2, δ1, δ2)
〈31〉 return [`1,Ψ,Γ, r, δ,∆p,∆m]
end function
function JoinHorizontal(m,S1, S2)
Let S1 be [`,Ψ1,Γ1, r1, δ1,∆p1,∆m1]
Let S2 be [`,Ψ2,Γ2, r2, δ2,∆p2,∆m2]
〈32〉 {Γ, r, δ} = Merge-Witnesses(m,Γ1,Γ2, r1, r2, δ1, δ2)
〈33〉 ∆m := Merge-Summary(∆m1,∆m2)
〈34〉 ∆p := ∆p1 ∨ ∆p2
〈35〉 Ψ := Ψ1 ∧ Ψ2
〈36〉 return [`,Ψ,Γ, r, δ,∆p,∆m]
end function
Figure 2.6: Helper Functions
thus proceed with the exit branches. The returned summarization is merged (us-
ing JoinHorizontal) with the summarization of the previous unrolling attempt (line
11). Otherwise, we first use the returned abstract transformer to produce a new
continuation context, (line 12 and 13), then we continue the analysis from the next
loop iteration on-wards (line 14). The returned information is then compounded
2.5 The Integrated Algorithm 43
with the summarization of the first iteration (line 15). Note that, importantly,
compounded summarizations of the inner loop(s) can be reused in later iterations
of the outer loop. In other words, the compounded summarizations generated from
the iterations of an inner loop can be reused to avoid exploring part or all of the
inner loop in later iterations of the outer loop.
2.5.2 Compounding Two Summarizations
Next, we will elaborate on how summarizations are compounded through the
functions Compose and JoinHorizontal in Figure 2.6.
Compounding Vertically Two Summarizations: Consider subtree(s2, `3) suf-
fixes subtree(s1, `2), where s2 ≡ 〈`2,m2, [[s2]]〉 and s1 ≡ 〈`1,m1, [[s1]]〉. In other
words, a path pi1 from `1 to `2 followed by a path pi2 from `2 to `3 corresponds
a path pi in subtree(s1, `3). The Compose function returns a summarization for
subtree(s1, `3) by compounding the two existing summarizations, respectively for
subtree(s1, `2) and subtree(s2, `3).
The resource cost of subtree(s1, `3) is computed as the sum of the resource
cost of the subtrees (line 26), the abstract transformer ∆p is computed as the
conjunction of the input abstract transformers (line 27), with proper variable re-
naming. Note that in our implementation, abstract transformers are computed
using polyhedral domain. We employ ∆p to generate one continuation context,
before proceeding the analysis with subsequent program fragments. Next, the de-
sired interpolant must capture the infeasibility of S1, as well as the infeasibility of
S2 given that we treat subtree(s1, `2) as an abstract transition, of which the opera-
tion is ∆p. We rely on the function Pre-Cond, which in line 29 under-approximates
the weakest-precondition of the post-condition Ψ2 w.r.t. to the transition relation
44 Chapter 2. Integrated Resource Analysis Framework
∆p. Finally, we use Combine-Summary to construct the overall machine state input-
output relation for subtree(s1, `3) (line 28) and Combine-Witnesses to compound the
witnesses and the dominating conditions of the summarizations (line 30).
Compounding Horizontally Two Summarizations:
Given two summarizations rooted at two nodes which are siblings, we want
to propagate the information back and compute the summarization for the par-
ent node. While propagation can be achieved by Compose, we need JoinHorizontal
(presented in Figure 2.6) to “merge” the contributions of the two children to the
parent node. Note that unlike Compose, we need to select the longer path be-
tween the two witnesses of the input summarizations. Such selection depends
on the current machine state. That is why the machine state m is passed as
an input to JoinHorizontal, which subsequently is passed it on to Merge-Witnesses.
Merge-Witnesses and Merge-Summary, which are customized based on the analysis,
are used to merge witnesses, dominating conditions and machine state summary.
The abstract transformer ∆p, however, is computed straightforwardly as the dis-
junction of the input abstract transformers. All the infeasible paths in both sub-
structures must be maintained, thus the desired interpolant is the conjunction of
the two input interpolants. Examples on compounding summarizations vertically
or horizontally are presented in the Example Analysis Sections of Chapters 3-5.
Finally, we conclude this section with a formal statement about the soundness
and termination of our framework.
Theorem 1 (Sound Resource Consumption Estimation). Our algorithm always
produces safe worst-case resource consumption estimates.
Proof. Our algorithm performs a depth-first traversal of the symbolic execution
tree. In all steps except when reuse happens, what we perform is only widening
2.5 The Integrated Algorithm 45
the execution contexts, not narrowing them. Because of such steps, we might over-
approximate the real worst-case resource cost; but this is safe. However, reuse in
our setting is different from [CJ11], so we need to check its soundness separately.
Assume that, at symbolic state s ≡ 〈`,m, [[s]]〉, we reuse a summarization
[`,Ψ,Γ, r, δ,∆p,∆m] of a subtree T .
Also assume that the reuse is unsafe. Note that when reuse happens, we employ
the abstract transformers to generate a continuation context and continue the
analysis from there. This step is also a widening step, thus it is safe. As a result,
there must be a feasible path in the avoided subtree emanating from s, of which
the resource cost is more than the resource cost of the witness Γ. Let us call this
path Γ′.
Because the first condition for reuse implies that all infeasible paths of T stay
infeasible under the new context s, Γ′ must be feasible in T as well. Obviously, in
order for Γ to be reported as the witness, in T , the resource cost of Γ′ must be not
more than the resource cost of Γ.
The third condition for reuse ensures that the dominating condition is satisfied.
This implies that the machine state at s maintains the optimality of Γ. In particu-
lar, if the resource cost of Γ (in T ) is not less than the resource cost of some other
feasible path (in T ), it is still the case under the new context s. Consequently,
under context s, the resource cost of Γ′ can not be more than the resource cost of
Γ. This is a contradiction.
We remark here that we do not make use of the second condition for reuse
in the proof of soundness. In fact, that condition has to do with the precision
of reuse, rather than its soundness. An important implication – which has been
shown in [JSV08] – is that our algorithm produces “exact” analysis for loop-free
46 Chapter 2. Integrated Resource Analysis Framework
programs.
Theorem 2 (Termination of Resource Analysis). Our algorithm always termi-
nates.
Proof. Our algorithm performs a depth-first traversal of the symbolic execution
tree. Due to the assumptions on programs used in safety critical embedded sys-
tems, the programs terminate always. Thus, the symbolic execution tree is a
bounded tree. Considering that all the data structures generated and utilized in
our algorithm are also bounded in length, our algorithm will always terminate.
2.6 Summary
In this chapter, we presented our resource analysis framework which performs inte-
grated resource analysis. We demonstrated that our framework is able to generate
the most accurate bounds over the consumed resources. Moreover, the notions of
reuse and domination were introduced or enhanced to help our framework remain
scalable. Finally, we presented the steps needed to customize our framework. Fol-
lowing these steps, in the next three chapters, we will customize our framework for
three different analyses: (1) WCET analysis, (2) MHW analysis and (3) WCEC
analysis.
Chapter3
Integrated Worst-case Execution Time
Analysis with Cache
3.1 Introduction
In this chapter, we customize the resource analysis framework presented in Chapter
2 for WCET analysis. The main contribution of the WCET analysis presented
in this chapter is that while our integrated resource analysis framework increases
the precision, it also scales for a class of realistic benchmarks. Thus, our method
can be employed for applications where precise WCET analysis is pivotal.
As discussed in Chapter 1, hard real-time systems need to meet hard deadlines.
Static Worst-Case Execution Time (WCET) analysis is therefore very important
in the design process of real-time systems. However, micro-architectural features
in processors (e.g., caches) make WCET analysis a difficult task.
WCET analysis is usually proceeded in two phases for scalability: high-level
phase and low-level phase. In low-level analysis, we need to consider timing effects
of performance enhancing processor features such as pipeline and caches. This
47
48 Chapter 3. Integrated Worst-case Execution Time Analysis with Cache
chapter focuses on caches, since the impact of the caches on the real-time behavior
of programs is much more than other features [MHH02]. Consequently, the machine
state will only track the updates on the cache states in this chapter.
Cache analysis – to be scalable – is usually accomplished by abstract interpre-
tation (AI) [TFW00]. In other words, we need to analyze the memory accesses
of the input program via an iterative fixed point computation. In Chapter 1, two
reasons for imprecision of AI framework was presented. We will elaborate more
on these reasons with respect to caches in Sections 3.2 and in the results presented
in Section 3.5 of this chapter. We explained that most of such algorithms employ
a fixed point computation in order to aggregate the analysis across loop iterations.
Consequently, they still inherit the imprecision from an AI framework, identified
as point (2) in Section 1.2.1. More specifically, a fixed point method will compute
a worst-case timing for each basic block in all possible contexts, even though the
timings of a basic block in different iterations of a loop can diverge significantly.
In Chapter 2, we demonstrated that our resource analysis framework is able
to systematically address these two causes of impression. However, we left certain
terms in the framework to be defined while customizing the framework for different
resources. In this chapter, We will customize the resource analysis framework for
WCET analysis.
3.2 Related Work
WCET analysis has been the subject of much research, and substantial progress
has been made in the area (see [PB00, W+08] for surveys of WCET). As discussed
before, WCET analysis is often conducted by separating low-level analysis and
high-level analysis into different phases.
3.2 Related Work 49
3.2.1 High-level analysis
Among the works on high-level analysis, our most important related work
is [CJ11] which was thoroughly investigated in Chapter 2. On the other hand,
there are recent cegar-like methods, which start by generating a rough WCET
estimate and then gradually refine it. “WCET squeezing” [KKZ13] is built on
top of the Implicit Path Enumeration Technique (IPET) [LM95]. A solution to
the given integer linear programming (ILP) formula corresponds to number of pro-
gram traces, of which the feasibility will be checked (one-by-one) via smt solving.
If such a trace is infeasible, additional ILP constraints are added to exclude it from
further consideration. Subsequently, [Cˇ+] proposes hierarchical segment abstrac-
tion, thus allows the computation of WCET by solving a number of independent
ILP problems, instead of one large global ILP problem. Since the abstract segment
trees can store more expressive constraints than ILP, better refinement procedure
can be implemented.
We also mention the recent work [H+14a], which also employs the concept of
interpolation, but under the smt framework, to avoid state explosion in WCET
analysis. Like [JSV08], this approach is formulated for loop-free programs, and not
yet suitable for analyzing programs with loops.
In summary, we can see a trend of research where recent advances in software
verification are employed for WCET high-level analysis. However, it is unclear if
these approaches will remain scalable when extended towards low-level analysis,
under the presence of loops and/or many infeasible paths.
50 Chapter 3. Integrated Worst-case Execution Time Analysis with Cache
3.2.2 Low-level analysis
Low-level analysis, with emphasis on caches, has always been an active re-
search topic in WCET analysis. Initial work on instruction cache modeling uses
ILP [LMW99]. However, the work does not scale due to a huge number of gener-
ated ILP constraints. Subsequently, the AI framework [CC77] for low-level anal-
ysis, proposed in [TFW00], has made an important step towards scalability. The
solution has also been applied in commercial WCET tools (e.g., [aiT]). For most
existing WCET analyzers, AI framework has emerged to be the basic approach
used for low-level analysis. Additionally, static timing analysis with data cache has
been investigated in [W+97, FW98, H+11].
Recent approaches [CR11, BCR13] by the same research group – combining AI
with verification technology – have shown some promising results. In the more re-
cent work [BCR13], a partial path is tracked together with each micro-architectural
state µ. This partial path captures a subset of the control flow edges along which
the micro-architectural state µ has been propagated. If a partial path was infea-
sible, its associated micro-architectural state can be excluded from consideration.
To be tractable, micro-architectural states are merged at appropriate sink nodes.
(In fact, the partial path constraints are merged to true.) As a result, the approach
is only effective for detecting infeasible paths whose conflicting branch conditions
appeared relatively close to each other in the cfg.
In a similar spirit as [KKZ13] and [CR11], Nagar and Srikant [NS] propose the
concept of cache miss paths. The method employs IPET formulation, using the
information from the worst-case solution of the ILP problem (which corresponds to
a number of program paths) to improve the precision of AI-based cache analysis.
However, it is reported that for benchmarks statemate and nsichneu – which
3.2 Related Work 51
contain a large number of program paths – little improvement is obtained.
It is important to note that, in general, the above-mentioned approaches still
employ a fixed-point computation in order to ensure sound analysis across loop
iterations. Thus, they inherit the imprecision of AI, because the timings of a basic
block in different iterations of a loop often can diverge significantly.
3.2.3 Other Related Work
We mention some orthogonal works that represent recent and interesting ad-
vances in WCET research. [L+] performs loop unrolling, passing flow information
from source level through the process of compiler optimization in order to help
tighten the WCET estimates. At the current stage, the approach seems to be
limited to single-path programs. Zolda and Kirner [ZK15] propose to incorporate
the information from collected program execution traces into IPET framework to
enhance the precision of calculated WCET bounds. The effectiveness of this ap-
proach seems dependent on the quality of the collected traces as well as the amount
of infeasible paths in the given input program.
We remark that the idea of coupling low-level analysis with high-level analysis
(with loop unrolling) dates back to [LS99]. However, to counter state explosion,
the only solution of [LS99] is to perform merging frequently. In the end, the
approach forfeits its intended precision, while at the same time, does not scale
realistic benchmarks.
Finally, we remark on the issue of timing anomaly [R+06]. In general, timing
anomaly can make abstraction (and therefore AI) unsound. It is extremely hard to
systematically address this issue. More often, custom solutions are employed. For
example, [RS09a] can compute a constant bound to be added to the local worst-case
path to safely handle timing anomalies, provided they are not of “domino-effect”
52 Chapter 3. Integrated Worst-case Execution Time Analysis with Cache
type. This approach is also applicable to us. In this chapter we explore WCET
analysis without timing anomaly.
3.2.4 Commercial WCET Tools
In this section, we will briefly mention a few of the many commercial WCET
tools. Most commercial tools, such as [aiT], use AI framework for the low-level
analysis. The worst-case timing for each basic block is then aggregated using the
ILP formulation to give the final WCET estimate. In other words, low-level and
high-level analyses are performed separately. The immediate benefit is that these
tools scale impressively and are applicable to a wide range of input programs.
AbsInt timing-analysis tool (aiT)
The purpose of the AbsInt timing-analysis tool aiT [aiT] is to obtain upper
bounds for the execution times of code snippets in executables. aiT works on exe-
cutables in order to acquire the information on the register usage and the instruc-
tion and data addresses. The addresses are later used in cache analysis. Moreover,
their tool supports memories with different timing behavior areas. aiT is one of
the most powerful tools using the ILP method and collects constraints based on
the abstract processor model. The team working on aiT has extended the tool
with support for abstract models of different commercial processors.
Bound-T Time and Stack Analyzer
Bound-T [Bou], is another tool which was planned for the analysis of on-board
software in spacecraft. The tool determines an upper bound on the execution time
of subroutines, including called functions. It gets binary executable program as
input with an embedded symbol table containing the debug information. The tool
is able to compute upper bounds on some normal loops. For other loops the user
provides annotations, called assertions in Bound-T.
3.2 Related Work 53
Florida WCET Analysis Tool
The research prototype of Florida State and Fruman Universities [Flo14] works
on hard real-time systems and energy-aware embedded systems with timing con-
straints. It performs path-based static analysis.
TU Vienna Analysis Tools
From the TU Vienna, the first is a prototype tool for static-timing analysis
with IPET that has been integrated into a Matlab/Simulink tool chain and can
analyze C code or Matlab/Simulink models. It performs timing analysis for C
programs coded in WCETC (a subset of C with extensions to make annotations
about (in)feasible execution paths [K+02]). The second tool is a measurement
based tool that uses genetic algorithms to direct input-data generation for timing
measurements in the search for the worst case or long program execution times.
The third tool is a hybrid tool for timing analysis that uses both measurements
and static analysis with IPET to assess the timing of C code.
Chronos WCET Analysis Tool
Chronos [chr14] is developed in the National University of Singapore. It is
an open-source static WCET analysis tool. Its aim is to find more close-fitting
bounds on the WCET of programs executed on modern processors. Chronos
receives a C program with the configuration of the target processor. First the
tool performs data-flow analysis to compute loop bounds. In case of failure, user
annotations have to be provided. The user can add information about infeasible-
paths to improve the accuracy of the results. Chronos has supports for analyzing
out-of-order pipelines, dynamic branch prediction schemes, and instruction/data
caches. More details on the commercial WCET tools are available at [W+08].
54 Chapter 3. Integrated Worst-case Execution Time Analysis with Cache
3.2.5 A discussion on the State-of-the-art WCET Analysis
Considering the different methods and commercial WCET tools which have
been discussed in this section, finding the state-of-the-art WCET analysis is not a
trivial task. The reason is that the accuracy of a WCET analysis not only depends
on how well the high-level and low-level analyses are performed but also it depends
on how well a hardware is simulated in the analysis. However, comparing different
methods based on the accuracy of their high-level and low-level analyses can help
the ultimate precision that a WCET analysis can reach to.
Based on this, we can state that the algorithm in [BCR13], which expands
some path-sensitivity to low-level analysis, comprises the state-of-the-art micro-
architectural modeling (AI+SAT) combined with an ILP formulation for WCET
analysis. In this algorithm, must analysis and persistence analysis are used to
model abstract instruction and data cache. In the rest of this chapter, we will refer
to this algorithm as AI+SAT⊕ILP.
On the other hand, the path based method presented in [CJ11] presents the
state of the art high-level analysis. It appears that the state-of-the-art modular
WCET analysis would be a hypothetical algorithm which benefits from combining
the AI+SAT low-level analysis and the high-level analysis in [CJ11]. More specif-
ically, this algorithm improves on other methods because of fully unrolling loops
and an increased infeasible path detection in high level and low level analyses. In
the rest of this chapter, we will refer to this algorithm as AI+SAT⊕Unroll s. We
will compare our framework with these two analyses in Section 3.5.
3.3 General Framework 55
3.3 General Framework
In this section, we will present the customized resource analysis framework for
WCET analysis with emphasis on caches. As explained in Section 3.1, in this
chapter, we will limit the machine state only to the data and instruction caches.
We adopt the concept of abstract cache state (ACS) and the update and join
functions in our framework from the update and join functions of the A-way set
associative cache with LRU replacement policy for the must analysis in the AI
framework [TFW00]. The Update function of the LRU cache replacement policy
in the abstract domain model the impact on the ACS of every reference inside a
basic block and the Join function merges two ACS at join points. Note that in our
analysis we only merge cache states in the end of loop iterations.
In Table 2.1, we indicated the list of the terms and functions needed to be
defined for customizing the resource analysis framework. In the rest of this section,
we will define these terms and functions for WCET analysis.
3.3.1 Basic Operations and Timing Cost Model
Recall from Chapter 2 that we model a program by a transition system. The set
of basic operations, as the first customizable term, remains the same for WCET
analysis.
In WCET analysis the variable of interest r models the execution time. The
execution time is the summation of the execution time of the instructions and the
cache access time, which can alter due to a cache hit or a cache miss.
The WCET analysis computes a sound and accurate bound for r in the end,
across all feasible paths of the program. Note that this variable is always initialized
56 Chapter 3. Integrated Worst-case Execution Time Analysis with Cache
to 0 and the only operation allowed upon it is a constant increment. Given a sym-
bolic state s ≡ 〈`,m, [[s]]〉 and a transition tr : ` op−→ `′, the amount of increment at
s by executing tr will be evaluated by the execution time of the LLVM instructions
and the access time for seq. r is not used in any other way.
3.3.2 The Machine State
Since in this chapter we are limiting the machine state to the data and in-
struction cache, we define the symbolic state with the machine state assigned to
each symbolic state to be a tuple of the form 〈ci, cd〉, where ci and cd are depicting
instruction and data caches. As a result a symbolic state would be depicted by
〈`,m, σ,Π〉 where m ≡ 〈ci, cd〉.
In the domain of WCET analysis where time is the resource of interest and
in the presence of caches, the updMachineState function would be equivalent to
the standard Update function from the abstract cache semantics for must analysis
from [TFW00]. Although the must analysis is originally only used to model the
abstract instruction cache, since our framework performs loop unrolling, the must
analysis is still expressive enough to model the abstract data cache.
Given a program point `, an operation op ∈ Ops, and a symbolic store σ, the
function acc(`, op, σ) denotes the sequence of memory block accesses by execut-
ing op at the symbolic state s ≡ 〈`,m, σ, ·〉. In this chapter for short, we denote
acc(`, op, σ) by seq. While the program point ` identifies the instruction cache
access, the sequence of data accesses are obtained by considering both op and σ to-
gether. We denote the updating machine state by m′ ≡ updMachineState(op, σ,m).
3.3 General Framework 57
3.3.3 Witness and Dominating Condition
Intuitively, witness Γ is a path that depicts the WCET of a subtree. More
specifically, it is depicted by Γ def= 〈t,Υ, pi〉 where t is the (static) execution time of
the instructions along the path assuming all the memory accesses are cache hits, Υ
is the sequence of all memory accesses along the path, and pi is the path constraints
along the witness.
In case Υ contains consecutive accesses to the same memory block, all-but-
first accesses in that sub-sequence can be classified as Always Hit and, importantly
they will not affect the resulting cache state. As an optimization, we consider them
redundant and remove them from Υ. In set-associative caches the chance that a
memory access gets omitted in this way is more, because each memory access is
checked against its consecutive accesses in the cache set. This helps reducing the
size of Υ.
The timing of a witness is obtained dynamically from t and replaying the se-
quence Υ under an incoming cache state c. The feasibility of a witness w.r.t. to an
incoming context is determined by checking if [[pi]]∧ [[s]] is satisfiable. Throughout
this thesis, we abbreviate [[pi]]∧ [[s]] by [[Γ]].
We next discuss dominating condition δ, another customizable component of
our analysis. Intuitively, this is a description of what cache configuration is needed
in order that the witness remains optimal in a similar subtree. That is, in an anal-
ysis of the latter subtree, the witness remains the longest path. More specifically,
the constraints in the dominating condition are either of the form age(mi) < k or
age(mi) ≥ k, where age is a function returning the relative age of mi in the cache
and k is a non-negative integer. As an example, age(mi) < A means the memory
block mi is in the cache; in contrast, age(mi) ≥ A indicates the memory block is not
58 Chapter 3. Integrated Worst-case Execution Time Analysis with Cache
in the cache. We next present the customized Summarize-a-Trans, Combine-Witness
and Merge-Witness functions.
function Summarize-a-Trans(s, tr)
Let s be 〈`,m, σ, .〉 and Let tr be ` op−→ `′
〈37〉 t := Execution-Time(op); Υ := acc(`, op, σ)
〈38〉 Iterate through Υ and remove repeating accesses
〈39〉 i := 0;M := ∅;w := t
〈40〉 foreach mem ∈ Reverse(acc(`, op, σ)) do
〈41〉 Add 〈mem, i〉 into M; i := i+ 1
〈42〉 w := w + Acc-Time(mem,m)
endfor
〈43〉 ∆m := 〈M, i〉
〈44〉 return [`, true, 〈t,Υ, [[op]]σ〉, w, true, op∆,∆m]
end function
Figure 3.1: Function to summarize a transition
Summarize-a-Trans Function
Summarize-a-Trans in Figure 3.1 computes a summarization for a single transition
tr at state s. Because no infeasible path has been discovered, the interpolant Ψ is
just true. There is a single path, thus the dominating condition is true (the machine
state is unconstrained). Moreover, the machine state summary for a single path
is simply generated from the reverse order of the sequence of all memory accesses,
namely acc(`, op, σ), and i, increasing from 0, which indicates the relative age of the
memory accesses in the instruction/data cache states in the end of the path (lines
39-41 and 43). Note that for brevity, here we demonstrate the steps to generate
the machine state summary for a fully-associative cache. These steps can trivially
be extended for set-associative caches. The abstract transformer ∆p (for program
variables) is the operation op itself, but translated to the language of input-output
relation. As an example, x := x + 1 is translated to xout = xin + 1. We use op∆ to
3.3 General Framework 59
denote such translated op.
function Combine-Witness(Γ1,Γ2, δ1, δ2)
Let Γ1 be 〈t1,Υ1, pi1〉 and Let Γ2 be 〈t2,Υ2, pi2〉
〈45〉 t = t1 + t2
〈46〉 if (Last(Υ1) ≡ first(Υ2)) then Υ2 := Remove-First(Υ2)
〈47〉 Υ = Υ1 · Υ2
〈48〉 pi := pi1 ∧ pi2
〈49〉 δ′2 := true
〈50〉 foreach {age(memi) < k} ∈ δ2 do
〈51〉 AlwaysTrueF lag = false
〈52〉 foreach memj ∈ Υ1 do
〈53〉 if memi ≡ memj then AlwaysTrueF lag = true
〈54〉 else if Conflict(memi,memj) then k := k − 1
endfor
〈55〉 if AlwaysTrueF lag ≡ true then skip;
〈56〉 else if k > 0 then δ′2 := δ′2 + {age(memi) < k}
〈57〉 else if k ≤ 0 then δ′2 := δ′2 + false
endfor
〈58〉 foreach {age(memi) ≥ k} ∈ δ2 do
〈59〉 AlwaysFalseF lag = false
〈60〉 foreach mj ∈ Υ1 do
〈61〉 if k ≥ 0 ∧memi ≡ mj then AlwaysFalseF lag = true
〈62〉 else if Conflict(memi,mj) then k := k − 1
endfor
〈63〉 if AlwaysFalseF lag ≡ true then δ′2 := δ′2 + false
〈64〉 else if k ≥ 0 then δ′2 := δ′2 + {age(memi) ≥ k}
〈65〉 else if k < 0 then skip;
endfor
〈66〉 δ = δ1 ∧ δ′2
〈67〉 return {〈t,Υ, pi〉, δ}
end function
Figure 3.2: Combining (Vertically) Witnesses and Dominating Conditions
Combine-Witness and Merge-Witness Functions
In Figure 3.2, Combine-Witness produces a witness and a dominating condition,
by compounding the witnesses and the dominating conditions of the two subtrees,
where one suffixes the other. This can be understood as a sequential composition.
60 Chapter 3. Integrated Worst-case Execution Time Analysis with Cache
The static timing of the witness t is initialized as the sum of t1 and t2 (line 45).
Let mem be the last access in Υ1. If mem is also the first access in Υ2, it would
always be a cache hit and is removed from Υ2 (line 46). The combined Υ is then
the concatenation of Υ1 and Υ2 (line 47). Next, the witness path constraint pi is
computed as the conjunction of pi1 and pi2 (line 48).
The combined dominating condition δ is computed as the conjunction of δ1 and
a condition δ′2, in line 66. Intuitively, δ′2 describes a cache state m, such that if
we perform all the accesses in Υ1 on the cache states inside m, we will produce a
cache state m′ which satisfies δ2.
The computation of δ′2 is a precondition computation, but in the nature of
caches. More specifically, δ′2 is initialized to true (line 49). Next, all the conditions
in δ2 are updated w.r.t. Υ1. If a condition become always true, it is not added
to δ2 (lines 53 and 55 and line 65 ). Otherwise, k is decreased by the number of
conflicting memory blocks in Υ1 (line 54 and line 62) and the condition is added
to δ′2 (line 56 and 64). There is a special case, where a condition always resolves to
false, thus, false is added to δ′2 (line 57). Lines 61 and 63 test for another similar
case. As a result, δ′2 would always resolve to false. This scenario rarely happened
in our experiments.
In Figure 3.3, Merge-Witness produces a witness and a dominating condition, by
compounding the witnesses and the dominating conditions of two sibling subtrees.
We need to choose one witness as the dominating witness out of the two input wit-
nesses. Moreover, the combined dominating condition must ensure the dominance
of each witness (in its respective subtree) and the dominance of the chosen witness
over the other witness.
3.3 General Framework 61
The dominating condition δ is initialized as the conjunction of the two dominat-
ing conditions (line 68). We next compare the two WCET values; and we select
the one with higher timing as the dominating witness. After line 70, the chosen
witness and its corresponding WCET and dominating condition are captured by
Γ1, WCET1 and δ1.
Next, we test if δ is sufficient to ensure that Γ1 dominates Γ2. Given a condi-
tion δ, a witness dominates another witness if its minimum timing is more than
the maximum timing of the other. The minimum timing is calculated by: (1) first
determine some accesses in the Υ component are necessary misses as the conse-
quence of the condition δ; (2) classifying the remaining accesses in Υ as cache hits.
Whereas the maximum timing is calculated in the opposite manner: (1’) first de-
termine some accesses in the Υ component are necessary hits as the consequence of
function Merge-Witness(m,Γ1,Γ2, r1, r2, δ1, δ2)
Let Γ1 be 〈t1,Υ1, pi1〉 and Let Γ2 be 〈t2,Υ2, pi2〉
〈68〉 δ := δ1 ∧ δ2
〈69〉 if (r1 < r2)
〈70〉 swap(Γ1,Γ2), swap(r1, r2), swap(δ1, δ2)
〈71〉 if (t1 + Min-Time(Υ1, δ) ≥ t2 + Max-Time(Υ2, δ))
〈72〉 return {Γ1, r1, δ}
〈73〉 foreach memi ∈ Υ1 do
〈74〉 if (Not-Constrained(memi, δ))
〈75〉 δ := δ ∧ {age(memi) ≥ A}
〈76〉 if (t1 + Min-Time(Υ1, δ) ≥ t2 + Max-Time(Υ2, δ))
〈77〉 return {Γ1, r1, δ}
endfor
〈78〉 foreach memj ∈ Υ2 do
〈79〉 if (Not-Constrained(memj , δ))
〈80〉 δ := δ ∧ {age(memj) < A}
〈81〉 if (t1 + Min-Time(Υ1, δ) ≥ t2 + Max-Time(Υ2, δ))
〈82〉 return {Γ1, r1, δ}
endfor
end function
Figure 3.3: Merging Witnesses and Dominating Conditions
62 Chapter 3. Integrated Worst-case Execution Time Analysis with Cache
the condition δ; (2’) classifying the remaining accesses in Υ as cache misses. This
dominance test is shown in line 71.
If Γ1 dominates Γ2, then Γ1 is returned as the dominating witness with δ as
the dominating condition. If not, we need to further constrain the dominating
condition δ.
First, for each access memi in Υ1, if memi has not been constrained in δ,
age(memi) ≥ A is added to δ (lines 75). This cache constraint might increase the
minimum timing of Γ1 and lead to passing the dominance test. If the dominance
test indeed succeeds, Γ1, WCET1 and δ are returned.
If we have not succeeded yet, we can do similarly for each memj in Υ2. Note
the difference that now we add the cache constraint of the form age(memj) < A,
with the hope to reduce the maximum timing of Γ2 enough that the dominance
test can be passed (line 81).
At the end of the first for loop, Min(Υ1, δ) would be larger than (or equal to)
the original timing of Γ1 (w.r.t. cache context c) while at the end of the second for
loop, Max(Υ2, δ) would be less than (or equal to) the original timing of Γ2 (w.r.t.
cache context c). In other words, eventually, we will end up with a condition δ so
that Γ1 dominates Γ2.
3.3.4 Machine State Summary
In the WCET analysis presented in this chapter, the machine state summary
would only capture the relation between the abstract cache components. In short,
we will call it as cache summary in this chapter. The cache summary presented
in this section can be used as part of the machine summary for other resource
analyses presented in the subsequent chapters. The cache summary is one of the
contributions of the work presented in this chapter and to the best of our knowledge
3.3 General Framework 63
it is the first time that the concept of summarizing the cache for program analysis
is presented1.
Suppose the state is at program point `1 is s and a summarization of subtree(s, `2)
is reused at another visit to `1 with an incoming cache state c1. Then c2, the cache
state at `2, can be generated by applying the cache abstract transformer to c1.
That is, the transformer over-approximates the memory accesses along the feasible
paths, which start from s and end at `2.
Let us first review on abstract set-associative cache for must analysis. An
abstract set-associative cache c is consisted of N cache sets, where N = C/(BS∗A),
C is the cache capacity and BS is block size. Specifically, c is [cs1, ..., csN ] where
each csj is a cache set. In turn, each cache set is a set of cache lines, i.e., cs =
[l1, ..., lA]. We use cs(li) = mem to indicate the presence of a memory block
mem in a cache-set, where i describes the relative age of the memory block and
not the physical position in the cache hardware. The cache abstract transformer
∆c is partitioned to N independent abstract transformers of respective cache-sets
∆c ≡ [∆s0 , ...∆sN−1 ]. In the process of applying a cache abstract transformer on a
cache state, each abstract transformer of a cache-set is applied to the corresponding
cache-set.
Each cache-set abstract transformer ∆s is depicted by 〈M, n〉, where M is a
set of pairs 〈mem, i〉. Each pair indicates a memory block mem and its relative
age in the cache i. Moreover, n depicts the number of cache lines that the memory
blocks in M are loaded to. It is computed as the maximum i in the sequence M
1We would like to also mention that cache summary in our work is different from the concept of
non-preemptive fixed-priority scheduling where execution times depend on history from [A+12c].
The concept of cache summary presented in this section foresees the sequence of memory accesses
in a sub-tree based on a similar sub-tree, explored in the past, which can be quite different from
the idea of using history for task scheduling.
64 Chapter 3. Integrated Worst-case Execution Time Analysis with Cache
plus 1. While computing the cache abstract transformer, only the memory blocks
with an age less than the associativity A are stored. The rest of the memory blocks
would naturally be pushed out of the cache and we do not need to maintain them
in the abstract transformer. Thus, it is always true that n <= A and the size of
the cache abstract transformer is linear w.r.t. the cache capacity.
At the time of reuse, each ∆s ≡ 〈M, n〉 in the cache abstract transformer is
applied to its respective cache-set. First, the memory blocks in the cache-set are
aged n times. Next, for each pair 〈mem, i〉 inM, the memory block mem is loaded
to its cache-set with relative age i. Considering that the pairs in M maintain the
memory blocks accessed in subtree(s, `2), the cache abstract transformer simulates
the updates and merges of the cache state along the paths in subtree(s, `2).
Example 6. For example, consider the sequence of memory blocks 〈m1,m2,m3,m2〉.
Assume, the cache summary of this sequence of memory blocks for a fully associa-
tive cache would be 〈3, ([m2, 0], [m3, 1], [m1, 2])〉. Now, consider a fully associative
cache of size 4 which initially contains m0:
0 1 2 3
m0
By applying the memory block sequence 〈m1,m2,m3,m2〉 to the cache the final
cache state would be:
0 1 2 3
m2 m3 m1 m0
If we apply the cache summary 〈3, ([m2, 0], [m3, 1], [m1, 2])〉 to the initial cache
state, in the first step the items which are already loaded into the cache are aged
by n (the static number stored in the cache summary which is 3):
3.3 General Framework 65
0 1 2 3
m0
Next, each of the [m2, 0], [m3, 1], [m1, 2] tuples in the cache summary are loaded
into the cache with their relative ages:
0 1 2 3
m2 m3 m1 m0
As it can be seen, the generated cache state would be the same to the cache state
as if we had loaded the memory blocks one by one.
The size of the cache summary is of the order of O(Capacity), where Capacity
is the cache capacity. It is so, since, in the process of generating the cache summary
only the memory blocks which their relative age is less than or equal to the cache
associativity are stored. The rest of the memory blocks would be naturally pushed
out of the cache. As a result, while our analysis remains sound, the size of the
cache summaries will remain constant w.r.t. to the cache capacity. Next, we will
describe how to generate a cache summary.
Combine-Summary and Merge-Summary Functions
Consider Combine-Summary in Figure 3.4. The cache abstract transformer is first
initialized to the cache abstract transformer of the prefix subtree (line 83). Next,
since the memory accesses in the suffix subtree are more recent along the feasible
paths, before adding them to the transformer (line 89), all the previous memory
accesses in reverse(M) are aged by 1 (line 86), while mem itself is not visited. If
mem is visited, due to a more recent access (〈m, 0〉) it is removed from M (lines
87). Finally, the new value for n is calculated (line 91). Note that storing the pairs
66 Chapter 3. Integrated Worst-case Execution Time Analysis with Cache
function Combine-Summary(∆m1,∆m2)
Let ∆m1 be 〈M1, n1〉 and Let ∆m2 be 〈M2, n2〉
〈83〉 M := M1;n := n1
〈84〉 foreach 〈mem, k〉 ∈ M2 do
〈85〉 foreach 〈mem′, i〉 ∈ reverse(M) do
〈86〉 if mem′ 6≡ mem then update 〈mem′, i〉 to 〈mem′, i+ 1〉
else
〈87〉 Remove 〈mem, k〉 from M
〈88〉 break
endfor
〈89〉 Add 〈mem, 0〉 to the beginning of M
endfor
〈90〉 foreach 〈m, i〉 ∈ M do
〈91〉 if i > n then n := i + 1
〈92〉 if i ≥ A then Remove 〈mem, i〉 from M; n := A
endfor
〈93〉 return 〈M, n〉
end function
function Merge-Summary(∆m1,∆m2)
Let ∆m1 be 〈M1, n1〉 and Let ∆m2 be 〈M2, n2〉
〈94〉 M := ∅
〈95〉 foreach 〈mem, i〉 ∈ M1 ∧ 〈mem, j〉 ∈ M2 do
〈96〉 M := M+ 〈mem,max(i, j)〉
endfor
〈97〉 return 〈M,max(n1, n2)〉
end function
Figure 3.4: Combining (Vertically) and Merging (horizontally) Machine State Summaries
with relative age more than the associativity A would be redundant. Such pairs
are removed from M in line 92 and the value of n is updated accordingly.
Merge-Summary in Figure 3.3 joins two cache state summaries. Since the cache
states in the cache states are updated based on the semantics of abstract cache for
must analysis, similarly the intersection of the memory accesses on the respective
cache states are preserved with their maximum age. The memory access sequence
M is initialized to ∅. Next, for each memory access m that is in the memory
3.4 Example Analysis 67
access sequences of the left and right subtreesM1 andM2, it is added toM with
the maximum age from M1 and M2 (line 96). Finally, M is returned with the
maximum number of cache lines that memory accesses in M will be loaded to
max(n1, n2) (line 97).
Extension to other Cache Policies
As stated in the beginning of this section, our framework performs instruction
and data cache analysis with LRU cache policy. In order to extend the cache
analysis to support other cache policies, a sound cache summarization should be
defined. Our inspection shows that if the memory blocks in a cache set can be
ordered (by a ranking), a sound cache summarization can be defined for them.
As a result, our framework can be extended to other cache policies such as MRU
or FIFO where memory blocks are ranked based on the order they entered the
cache. However, our framework seems not to be extendable to PLRU cache policy.
Finally, we like to note that our framework can support cache hierarchies while
it remains scalable. In order to define cache hierarchies, each cache hierarchy is
added as an abstract cache to the machine state and the update function should
be defined such that all cache states are updated with respect to the hierarchy at
each transition. As a result our framework does not need to change substantially
when changing to other cache policies.
3.4 Example Analysis
Consider the CFG and the symbolic execution tree in Figure 3.5. Here we assume
a direct-mapped cache with 3 cache sets, initially empty, and a cache miss penalty
of 10 cycles. Consider accesses to memory blocks m1,m2,m3, and m4, where only
m1 and m3 conflict with each other in the first cache set. Note that in Figure5.8(b),
68 Chapter 3. Integrated Worst-case Execution Time Analysis with Cache
Figure 3.5: (a) a CFG (with memory accesses and static instruction timing shown in
each block); and (b) Our Analysis Tree
we have not (fully) drawn the subtree below node 〈4b〉.
Suppose the subtree 〈7a〉 has been analyzed, and its summarization is [〈7〉,Ψ,Γ,
w, δ,∆p,∆c]. We now explain the components of this summarization. The inter-
polant Ψ is easily determined as true because all (two) paths of this subtree are
feasible. Next, because the incoming cache state contains only m1, the timing of
the sub-path 〈7a〉, 〈8a〉, 〈10a〉 is 40 = (10 + 5 + 10 + 15), with both accesses
as cache misses. Similarly, the timing of the other sub-path 〈7a〉, 〈9a〉, 〈10b〉 is
45 = (10 + 5 + 10 + 10 + 10). So, the sub-path 〈7a〉, 〈9a〉, 〈10b〉 is longer than the
other and it is chosen as the worst-case path2 of subtree 〈7a〉. Consequently, the
witness Γ is computed as 〈15, [m2,m3,m4], z ≥ 0〉, where 15 is the static timing of
the witness path, [m2,m3,m4] are the memory accesses along the path, and z ≥ 0
is the (partial) path constraints of the path. Moreover, the WCET of the subtree
(w) is 45. Next, we capture a dominating condition (δ) as age(m4) ≥ 1 (The cache
associativity of a direct-mapped cache is 1). This condition is sufficient to ensure
2When it is clear, we often use “path” to mean “sub-path”.
3.4 Example Analysis 69
that the chosen path dominates (i.e., is longer than) any other path in the subtree.
The abstract transformer ∆p is the trivial one where the output is the same as
the input. This is because in this example we abstract away all the instructions
executed by the basic blocks. The memory blocks m2 and m3 are accessed along
the sub-path 〈7a〉, 〈8a〉, 〈10a〉. The abstract transformer of the first cache set
is ∆s0 ≡ 〈[〈m3, 0〉], 1〉, where 〈m3, 0〉 indicates the relative age of m3 in the first
cache set of the cache state at 〈10〉 and 1 shows the number of cache lines that
the memory blocks are loaded to. Similarly, the relative age of m2 in the second
cache set is 0 and the transformer of the second cache set is ∆s1 ≡ 〈[〈m2, 0〉], 1〉.
The abstract transformer of the third cache set ∆s2 is empty 〈[ ], 0〉. In a similar
manner, the set abstract transformers for the sub-path 〈7a〉, 〈9a〉, 〈10b〉 are ∆s0 ≡
〈[〈m3, 0〉], 1〉, ∆s1 ≡ 〈[〈m2, 0〉], 1〉 and ∆s2 ≡ 〈[〈m4, 0〉], 1〉. The respective set
abstract transformers are joined at 〈7a〉. The joined abstract transformer would
maintain the common memory block accesses from both abstract transformers and
the maximum of the number of cache lines where memory blocks are loaded to.
The memory accesses m2 and m3 are the common accesses on both sub-paths, so
∆c ≡ [∆s0,∆s1,∆s2] ≡ [〈[〈m3, 0〉], 1〉, 〈[〈m2, 0〉], 1〉, 〈[ ], 1〉].
In short, after analyzing 〈7a〉, we also have computed a summarization [7, true,
〈15, [m2,m3,m4], z ≥ 0〉, 45, age(m4) ≥ 1, Id(Vars),∆c]. For brevity, in what fol-
lows, we do not detail on abstract transformers ∆p and ∆c.
Next we propagate the analysis of 〈7a〉 to its parent 〈5a〉 whose summarization
is now updated so that the witness is stored in the form 〈20, [m1,m2,m3,m4], z ≥
0〉, where 20 is computed as the sum of: (1) the static timing of block 〈5〉, which
is 5; (2) the static timing of the witness for 〈7a〉, which is 15. The dominating
condition is age(m4) ≥ 1, as before.
70 Chapter 3. Integrated Worst-case Execution Time Analysis with Cache
We fast forward to node 〈7b〉, and consider now if the above analysis of 〈7a〉 can
be reused. That is, even though we have depicted the subtree 〈7b〉 in full, could we
in fact have simply declared that the witness in the subtree below 〈7b〉 would remain
the same as the witness in subtree below 〈7a〉? (Recall that the witness in the
subtree below 〈7a〉 spans along the program points 〈7〉, 〈9〉, 〈10〉.) Unfortunately,
the answer is negative, and the reason is that the dominating condition, age(m4) ≥
1, is not met because m4 is in the cache at 〈7b〉. This non-reuse is depicted by a
red cross. We thus have to analyze 〈7b〉 fully. We get a different longest sub-path
this time, 〈7b〉, 〈8b〉, 〈10c〉, with the witness 〈20, [m2,m3], z < 0〉. The dominating
condition is also different: δ : age(m4) < 1.
Finally, this analysis of 〈7b〉 is propagated for its parent 〈6a〉. The dominating
condition is age(m4) < 1 which always holds due to the access of m4 at 〈6〉. Thus
the dominating condition for 〈6a〉 is simply true.
Having now analyzed both 〈5a〉 and 〈6a〉, we can now compute an analysis
for their common parent 〈4a〉. Here the observed longest sub-path is 〈4a〉, 〈6a〉,
〈7b〉, 〈8b〉, 〈10c〉, and the witness is stored as 〈41, [m4,m2,m3], y ≥ 0 ∧ z < 0〉.
The dominating condition is conjoined from: (a) the dominating condition of its
left child 〈5a〉; (b) the dominating condition of its right child 〈6a〉; and (c) the
reason for the dominance of the above observed longest path over the other path.
In particular, δ is age(m4) ≥ 1 ∧ true ∧ age(m1) < 1.
Now we can exemplify reuse on the subtree 〈4b〉. We first check if the context
of 〈4b〉 implies the interpolant computed for 〈4a〉. Because all paths from 〈4a〉
are feasible, the interpolant is true, thus, it trivially holds. We then check if the
dominating condition holds. Examining the cache context of 〈4b〉, indeed m1 is
in the cache and m4 is not in the cache. Furthermore, the witness is still feasible
3.4 Example Analysis 71
w.r.t. the incoming context (x ≥ 0). So we can reuse the witness of 〈4a〉, yielding
the timing of 61. We remark here that the timing of the sub-path 〈4b〉, 〈6b〉, 〈7c〉,
〈8c〉, 〈10e〉 is less than the timing of 〈4a〉,〈6a〉, 〈7b〉, 〈8b〉, 〈10c〉 because now m2
is present in the cache at 〈4b〉.
Finally, we easily arrive at the WCET of the entire tree, thus, the entire
example program, to be 106 cycles (= 10 + 10 + 10 + 15 + 61, since the accesses
to m1 and m2 at 〈3a〉 are cache miss).
Let us reconsider the same example using a pure abstract interpretation (AI)
framework such as [TFW00]. A pure AI method would typically perform merging
at the three join points: 〈4〉, 〈7〉, 〈10〉. Importantly, it discovers that at 〈4〉, m1
must be in the cache. Thus, the access to m1 at 〈5〉 is hit. However, at 〈7〉, AI has
to conservatively declare that m4 is not in the cache. As a result the access to m4
at 〈9〉 will be cache miss. Consequently, the final worst case timings for the basic
blocks that have some memory accesses are: (〈2〉,20), (〈3〉,35), (〈5〉,5), (〈6〉,21),
(〈7〉,15), (〈8〉,25), (〈9〉,30).
If we aggregate using a path-insensitive high-level analysis (such as [CJ11]),
the WCET estimate is 121 (= 10 + max(20, 35) + 10 + max(5, 21) + 15 +
max(25,30)). We cannot improve the estimate for this example, since the timing
of the witness w.r.t. new context remains the same. However, as our experiments
in the next section shows, in general, the timing generated by our framework is
more accurate.
72 Chapter 3. Integrated Worst-case Execution Time Analysis with Cache
3.5 Experimental Evaluation
The data and instruction cache settings in our experiments is borrowed from
[wtc14] for ARM9 target processor. Our instruction and data caches are sepa-
rate. A cache state c contains two separate abstract caches 〈ci, cd〉, where ci is a
4KB abstract instruction cache and cd is a 4KB abstract data cache. The cache
configurations are write-through, with no-write-allocate, 4-way set associative L1
cache with LRU replacement policy. The cache miss and cache hit latencies are
respectively 10 and 0 cycles.
Because we fully unroll loops in our analysis, it is sufficient to employ a must
analysis for precisely tracking the data cache, as opposed to a persistence analysis.
We follow the treatment as in [FW98] for loading memory ranges into the cache
for persistence analysis3 when a data access cannot be resolved to a single memory
address, meaning that the blocks in the memory address range are not loaded into
the cache, but the blocks already in the cache are relocated as if all the blocks in
the memory address range were loaded into the cache.
3.5.1 Results
We used an Intel Core i5 @ 3.2 Ghz processor having 4Gb RAM for our exper-
iments and built our system upon CLP(R) [JMSY92] and Z3 as the constraint
solver, thus providing an accurate test for feasibility. The analysis was performed
on LLVM IR which, while being expressive enough, a program’s transition sys-
tem can be easily constructed. The LLVM instructions are simulated for a RISC
architecture. We use Clang 3.2 [Cla14] to generate the IR.
3Huynh et. al. in [H+11] have fixed a safety issue with the treatment of loading memory ranges
into the cache from [FW98]. However, this safety issue occurs in the semantics of abstract cache
for persistence analysis and does not affect the semantics of abstract cache for must analysis,
which is used by our method.
3.5 Experimental Evaluation 73
Table 3.1 presents the results of the comparison of our method with the two
state-of-the-art algorithms explained in Section 3.2.5:
• AI+SAT⊕ILP implements the algorithm in [BCR13]. It comprises the state-of-
the-art micro-architectural modeling (AI+SAT) combined with an ILP formulation
for WCET aggregation.
• AI+SAT⊕Unroll s implements a hypothetical algorithm, to benefit from com-
bining the AI+SAT low-level analysis and the high-level analysis in [CJ11]. This
combined algorithm generates static timing for each basic block before aggregating
results via a path analysis phase.
• Unroll d is the algorithm presented in this chapter. This further improves on the
already quite accurate hypothetical algorithm above because we now accommodate
dynamic timing. As explained in the earlier sections, this entails more cost. Our
results below show that this cost is bearable.
In Table 3.1, the columns T(s) and State denote the running time and num-
ber of states (in symbolic execution) respectively. The symbol ∞ denotes out-of-
memory. The WCET precision improvement is computed as B−U
B
× 100%, where
U is the WCET obtained using our analysis algorithm, and B is the WCET ob-
tained using the baseline approach. In order to highlight the importance of reuse,
we tabulate separate results for the cases where it is employed or not. The last
two columns, separated by a vertical double line, summarize the improvement of
Unroll d over the other two analyses.
We have divided our benchmark programs, which are quite standard in evaluat-
ing WCET analysis algorithms, into three groups, separated by horizontal double
lines:












































































































































































































































































































3.5 Experimental Evaluation 75
Benchmarks with lots of Infeasible Paths: The first group contains statemate
and nsichneu from Ma¨lardalen benchmarks [M0¨6] and tcas, a real life implemen-
tation of a safety critical embedded system. tcas is a loop-free program with
many infeasible paths, which is used to illustrate the performance of our method
in analyzing loop-free programs. On the other hand, nsichneu and statemate are
programs which contain loops of big-sized bodies, also with many infeasible paths.
These benchmarks are often used to evaluate the scalability of WCET analysis
algorithms [W+08].
Standard Timing Analysis Benchmarks with Infeasible Paths: This group
contains standard programs from [M0¨6], and fly-by-wire from [N+06].
Benchmarks with Simple Loops: This group contains a set of academic pro-
grams from [M0¨6]. Though the loops in these programs are simple for high-level
analysis, they contain memory accesses that a fixed-point computation might re-
solve to a range of memory addresses, leading to imprecise low-level WCET anal-
ysis.
3.5.2 Discussion on Precision
The generated WCET by Unroll d for the first group of benchmarks, com-
pared to AI+SAT⊕ILP, on average is improved by 34%; compared to AI+SAT⊕
Unroll s, the number is 17%. Focusing on nsichneu and statemate, it can be
seen that part of the improvement of Unroll d over AI+SAT⊕ILP comes from the
detection of infeasible paths (i.e., the common improvement between Unroll d and
AI+SAT⊕Unroll s over AI+SAT⊕ILP). The improvement of Unroll d over AI+SAT⊕
Unroll s, on the other hand, is due to infeasible path detection directly reflected
in the tracking of micro-architectural states. This avoids lossy merging of cache
76 Chapter 3. Integrated Worst-case Execution Time Analysis with Cache
states at the join points in the CFG.
For a loop-free program like tcas, the improvement of Unroll d over the
other two analyses is clearly not advantaged by tighter loop bounds in unrolling,
nor disadvantaged by fixed-point computation in AI+SAT⊕ILP. Next, consider the
fact that the (high-level) infeasible paths detected by Unroll d and AI+SAT⊕
Unroll s are the same. Even so, Unroll d is more accurate by 8%. Once again,
this improvement comes from our integration of low-level analysis with high-level
analysis, making infeasible path detection reflected in the precise tracking of micro-
architectural states.
For benchmarks in the second group, Unroll d produces significantly more ac-
curate WCET than AI+SAT⊕ILP, on average 42%, peaking at 94%. In compress,
ud and ndes, many infeasible paths have to do with loops, and being able to
detect them improves the WCET estimates dramatically. AI+SAT⊕Unroll s per-
forms relatively well on this group of benchmarks. However, for ud, ndes and
fly-by-wire, the accuracy improvement of Unroll d over AI+SAT⊕Unroll s is
still noticeable. Further investigation reveals that two of these benchmarks con-
tain memory accesses which are resolved to address ranges in the AI component –
ultimately is still a fixed-point computation – leading to imprecise analysis results
from the combined algorithm.
The effect of such memory accesses on analysis precision can be seen more
clearly by examining the third benchmark group. Unroll d is still better than
the other two algorithms by 18% on average. These benchmarks do not contain
many infeasible paths nor complicated loops and that is the reason why AI+SAT⊕
Unroll s does not produce better estimates than AI+SAT⊕ILP. However, these
benchmarks contain memory accesses which are resolved to address ranges in a
3.5 Experimental Evaluation 77
fixed-point computation, leading to the imprecision of AI+SAT⊕ILP. In contrast,
Unroll d performs loop unrolling, thus it can precisely resolve the addresses of
the accesses, leading to superior precision.
In summary, in terms of precision, Unroll d outperforms the other two al-
gorithms in all benchmarks. The WCET estimations from Unroll d have im-
proved 33% on average compared to AI+SAT⊕ILP and 14% on average compared
to AI+SAT⊕Unroll s. These improvements clearly uphold our proposal that per-
forming WCET analysis in one integrated phase in the presence of dynamic timing
will enhance the precision over modular approaches. However, the scalability of
our method is not yet discussed.
3.5.3 Discussion on Scalability
As expected, reuse is important for scalability. For most of the benchmarks (8
out of 12) the analysis cannot finish without reuse. Between the benchmarks in the
first group which contain many infeasible paths (tcas, nsichneu and statemate),
none of the benchmarks can be analyzed without reuse. The two largest bench-
marks, nsichneu and statemate, are used as an indicator of the scalability of the
WCET tools. The WCET analysis for nsichneu and statemate, uses at most 53%
and 40% of the 4GB available. It is worth noting that, for nsichneu, the overhead
of the analysis time and memory usage compared to AI+SAT⊕Unroll s is 31% and
40%, respectively, while the precision is improved by 27%.
In conclusion, our analysis framework relies a lot on reuse for scalability. From
these experiments we can infer that only small size programs where the number
of paths is limited can be analyzed without reuse. In the next section, we will
elaborate on the point that similar to the symbolic simulation algorithm in [CJ11],
our integrated analysis remains super-linear with regard to both time and space
78 Chapter 3. Integrated Worst-case Execution Time Analysis with Cache
complexity.
3.5.4 Analysis of Benchmarks with Complicated Loops
We have performed a set of experiments on benchmarks containing nested or
complicated loops from [M0¨6]. These benchmarks are analyzed not only to show
the ability of our analyzer to generate tight WCET for the benchmarks, but also
to maintain the super-linearity of the symbolic execution WCET analyzer in the
integrated WCET analysis.
Table 3.2, presents the results of the comparison of our symbolic execution anal-
ysis, Unroll d, with AI+SAT⊕ILP and AI+SAT⊕Unroll s for these benchmarks.
In this table, the WCET estimated by Unroll d compared to AI+SAT⊕ILP on
average is improved by 50% and peaks at 97% for expint and insertsort. From the
benchmarks in this group, bubblesort, expint, fft1, fir, insertsort, janne complex and ud
contain complicated loops which the number of the iterations of the loops is related
to the context reaching the loop. In order to have a fair comparison between the
three analyses, we provided the ILP formulation with the tightest possible loop
bounds. Our aim was to eliminate the effect of the loop bounds for complicated
loops on the estimated WCET. We like to elaborate on the results of the analyses
for fir and bubblesort. These two benchmarks, while containing complicated loops,
possess the least improvement. It is noteworthy, that the generated LLVM IR
for insertsort, showing the best improvement in the generated WCET, is slightly
different from the CFG of the C code. This has resulted in a much more costly
path with less flow to receive maximum flow in the ILP formulation. The AI+SAT⊕
Unroll s analysis is able to tackle this issue in the high level analysis, but the
issue still remains in the low-level analysis due to the fixed point computation.
However, Unroll d is able to generate a more precise WCET estimation.


































































































































































































































































































































































































































































































































































80 Chapter 3. Integrated Worst-case Execution Time Analysis with Cache
The estimated WCET in all of these benchmarks show a significant improve-
ment. Part of the improvement comes from the ability to find the exact number
of loop iterations for complicated loops. This improvement is shared between
our integrated analysis, Unroll d , and AI+SAT⊕Unroll s. However, comparing
Unroll d and AI+SAT⊕Unroll s, the estimated WCET on average is improved
by 12% and peaks at 31% for insertsort. The extra improvement comes from the
integrity of the WCET analysis.
Finally, we like to note one other contribution of our method, which is per-
forming loop unrolling in super-linear time. Our analysis shares this contribution
with AI+SAT⊕Unroll s . It is reached through depth-wise loop compression, i.e.,
reusing across the loop iterations and generating and using compound summariza-
tions where many loop iterations can be summarized at once. This summarization
can be used later to avoid exploring large portions of loops. By comparing the
size parameter of the benchmarks in table 3.2 (second column) with the analysis
time it can be clearly seen that the super-linear complexity of the brute-force loop
unrolling remains in our integrated analysis. For example, in ns the complexity is
O(n4) while the analysis is performed in a super-linear complexity.
3.5.5 Comparison of Analysis Results with Simulations
In this section, we address the hypothesis that our analysis tool generates re-
alistic WCET bounds. In order to test this hypothesis, We have simulated the
benchmarks from Table 3.1 on the cycle accurate GEM5 simulator [B+11] and
compared the results.
GEM5 is an emerging simulator which is able to evaluate different architec-
tural designs. Our simulations were performed on TimingSimpleCPU, which sup-
ports timing accesses memory, 4kB L1 instruction cache and 4kB L1 data cache,
3.5 Experimental Evaluation 81
ARM instruction set and simple memory format, which was the closest to the
hardware settings of our experiments. The static binaries passed to GEM5 were
cross compiled with arm-linux-gnueabihf-gcc.
Figure 3.6 presents the comparison of the generated WCET and the results of
the simulation of the benchmarks. As it can be seen the simulations and the gen-
erated WCET in most cases follow the same trend, which verifies our hypothesis.
However, for nsichneu, ndes and edn there is a difference in the simulation result and
the generated WCET. Our further investigation demonstrates that this difference
can come from the different software characteristics of these benchmarks.
Figure 3.6: Comparison of the generated WCET with GEM5 simulations for benchmarks
in Table 3.1
The research in [Als14] examines the relation between executions of a program
and its software characteristics, such as having single paths, containing loops or
nested loops, array manipulation, bit operation, etc. It uses regression algorithms
to find a relation between GEM5 simulations for Ma¨lardalen benchmarks [M0¨6]
and their software characteristics. In the research, the benchmarks are classified
into different groups based on the program characteristics presented in [M0¨6] and
for each classification group a relationship between the simulated clock cycles and
82 Chapter 3. Integrated Worst-case Execution Time Analysis with Cache
the software characteristics are derived. From our simulated benchmarks, ndes
individually falls into one classification group. This can be the reason for the slight
difference that we see between the simulation result and the generated WCET.
However, nsichneu falls into a classification group with some other benchmarks.
It has been stated that the WCET for nsichneu is unknown [KKZ13] and any
simulation in general can be far from the WCET. This can be the reason of the
slight difference seen between the simulation results and the generated WCET for
nsichneu.
Figure 3.7: Comparison of the generated WCET with GEM5 simulations for Different
Benchmarks Groups
We have followed the same classifications presented in [Als14] and compared the
3.6 Summary 83
generated WCET and the simulation results. The results of this comparison for
benchmark groups is presented in Figure 3.7. From the simulated benchmarks, tcas
and ndes individually fall into one classification group and cannot be compared to
other benchmarks. The LNA classification group of the benchmarks, which contain
nested loops and array/matrices contains cnt, compress and fdct. We have compared
these benchmarks with other benchmarks in Figure 3.7 (a). In Figure 3.7 (b) and
(e), fft1 and insertsort benchmarks are compared with each other for different sizes.
On the other hand, the benchmarks in group L containing non-nested loops
are nsichneu, statemate and adpcm. Moreover, fly-by-wire from Papabench [N+06]
has similar program characteristics and is added to this group. We have compared
statemate, adpcm and fly-by-wire with each other in Figure 3.7 (c). We have excluded
nsichneu due to the reason explained above. Moreover, edn, jfdctint and matmult are
compared in Figure 3.7 (d). In general, these comparisons show that in the majority
of cases the generated WCET and the simulation results follow a similar trend.
The slight differences in the trends may be due to the point that the WCET
analysis is performed on LLVM IR, while the simulations are performed on binary.
3.6 Summary
In this chapter, we have customized our framework for the WCET analysis of pro-
grams with consideration of a cache micro-architecture. At its core, the framework
is a symbolic execution, which preserves the program’s operational semantics in de-
tail, down to the cache. The only abstraction performed is that the analysis of one
loop iteration is summarized; importantly, the analysis proceeds precisely across
loop iterations. The key challenge, scalability, was obtained by using a custom
notion of reuse. In realistic benchmarks, it was shown that the extreme attempt
84 Chapter 3. Integrated Worst-case Execution Time Analysis with Cache
at precision in fact pays off because there was a significant increase in precision,
and this was obtained in a reasonable time.
Chapter4
Symbolic Memory High-watermark
Analysis with Symbolic Bounds
4.1 Introduction
Traditionally, in safety-critical embedded systems, it was recommended not to use
dynamic memory allocation because of two main reasons: (a) the allocation in-
structions might take longer than expected, resulting in the failure of temporal
constraints in hard real-time systems; and (b) the memory fragmentation issue.
As a result, stack was the only memory that grows dynamically during execu-
tion. Worst-case stack usage was estimated by methods such as the one proposed
in [KF14]; the estimate is compared with the available memory to ensure stack
overflow errors does not occur.
In the past few years, there have been advances in both hardware and soft-
ware of embedded systems. The drop in the hardware cost, the development of
customized operating systems for embedded systems, and finally the advent of con-
stant time memory allocation algorithms with a reasonable handling of memory
85
86 Chapter 4. Symbolic Memory High-watermark Analysis
fragmentation [MRC03, M+08] are among these advances. Besides these, as the
embedded systems become more complex, the need to use third-party code – which
might require dynamic memory allocation – becomes more inevitable. As a result,
dynamic memory allocation has now been used more frequently in embedded soft-
ware [A+15].
The use of dynamic memory allocation in embedded systems, raised more con-
cerns on ensuring the reliability of embedded systems used for safety-critical tasks.
Thus, methods have been developed to avoid heap and stack overflows in safety-
critical systems [RRW05]. The worst-case memory consumption of a program can
be estimated to ensure the reliability of such safety-critical systems against heap
and stack overflow errors. Besides that, the estimate of the worst-case memory
consumption would be useful in the design process of embedded systems to reduce
hardware cost [T+13]. Furthermore, this estimate can be presented to the pro-
grammer or a customer who are interested in knowing the memory footprint of an
embedded system.
Memory is a non-cumulative resource, meaning that unlike time and energy
where the maximum execution time or consumed energy of a path in a program
is at the end of the path, the maximum memory used in a path can be at any
place inside a path (e.g., right in the middle of the path). Thus, many approaches
developed for worst-case analysis of cumulative resources, such as WCET analysis,
becomes inapplicable. More specifically, these methods often abstract away the
orders between the acquires/releases, which is crucial for precise analysis of non-
cumulative resource.
There has been a large body of work for automatically deriving symbolic upper
bounds of memory consumption. Such analyses can provide a bound even when
4.1 Introduction 87
the program loops or recursions are not statically bounded. A bound generated
by these methods is parametric in two types of program inputs: (1) the inputs
that determine the maximum depths of the loops and recursions; and (2) the other
program inputs. It is worth to note that the generated bound is often a non-linear
formula over the first type of inputs.
Resource analysis of imperative programs with non-regular loop patterns and
signed integers, to model both memory allocation and deallocation, has long been
an open problem. By “non-regular”, we mean that the loop does not behave
uniformly across different iterations. We now mention some notable related work.
COSTA [A+07, A+12a], formulating the problem using the framework of cost
relations, can infer parametric upper-bounds on the memory consumption of Java
programs with region-based garbage collection. It was then extended [FMH14] to
generate more refined cost relations. On the other hand, [CHS15] performs amor-
tized resource analysis for C programs. However, these methods are quite limited
in coping with non-linear formulas, in the sense that either the bounds generated
are too imprecise or they have to require manual user interaction. They are fur-
ther challenged by the programs of which the termination can only be decided if
path-sensitivity is carefully taken into account.
In this chapter, we present a memory high-watermark analysis, based upon the
analysis framework presented in Chapter 2. Our analysis precisely computes and
summarizes the memory consumption of each iteration. The key result from the
loop unrolling is that non-regular loop patterns can be analyzed efficiently, often
in a linear number of steps.
The main contribution of the analysis framework presented in this chapter
is that the bound generated by our analysis is symbolic. This is mainly because
88 Chapter 4. Symbolic Memory High-watermark Analysis
programs can allocate and/or deallocate a non-fixed amount of memory (e.g., via
some input variable that is not statically determined). Our bound, however, is not
parametric w.r.t. program inputs dictating the maximum depths of the program
loops. As a result, we do not need to deal with the challenging problem of inferring
closed-form expressions for the loops. This enables our method to have a higher
level of automation, while producing more accurate bounds.
In detail, given a program, our analysis starts by constructing the symbolic ex-
ecution tree, from which an estimate of the memory high-watermark can be easily
extracted. Being highly path-sensitive, our analysis disregards infeasible combi-
nations of allocations/deallocations from consideration, thus producing accurate
bounds. In the previous chapter, we introduced the concept reuse with interpola-
tion and dominance to achieve scalable symbolic execution for integrated WCET
analysis. In this chapter, though we use the same method for scalability, we still
need to address two major challenges:
• First, it is the issue of non-cumulative resource. This requires the interplay
between the net usage and the high-watermark usage of memory. As will be shown
later, to accommodate this, we need to store more than one witness and dominating
condition.
• Second, it is the issue of dealing with symbolic cost of an instruction, as opposed
to concrete cost in many related work, as well as in Chapter 3. The need to
compare between symbolic expressions leads us to the usage of the standard max
function. Importantly, how it is used in tandem with the capturing of “dominance




Consider the C program from Figure 4.1. The program points are shown in
brackets, e.g., 〈1〉. The allocations (deallocations) in the program are annotated
with increment (decrement) statements (in red color). The resource variable r
captures the resource of interest: memory. Note that in this chapter both increment
and decrement operations are allowed on it.
〈1〉 int main(int j){
〈2〉 int n; r = 0;
〈3〉 if(j > 0){
〈4〉 n = 10;
〈5〉 char *matrix = malloc(n); r = r + n;
〈6〉 /* normal computation */
〈7〉 free(matrix); r = r - n;
}else{
〈8〉 n = 5
〈9〉 char *matrix = malloc(n + 10); r = r + (n + 10);
〈10〉 /* normal computation */




Figure 4.1: An Annotated C program
As elaborated in Chapter 2, our analysis computes a sound and precise bound
over r (possibly symbolic here) in the end, across all feasible paths of the program.
Note that due to the non-cumulative nature of the resource of interest, the largest
value of the resource variable r can be at any place along a path (and not in general
at the end of a path).
For the program in Figure 4.1, the highest value of r in the then branch is 10
(at 〈5〉) and in the else branch is 15 (at 〈10〉). These two values are compared at
the parent node, namely 〈3〉. Because the highest value of r in the else branch
90 Chapter 4. Symbolic Memory High-watermark Analysis
is larger, we say the else branch dominates the then branch, under the current
context. Then the path 〈3〉, 〈8〉, . . . , 〈11〉 is returned as the dominating path in
the program. As a result, 15 – the highest value of r in the dominating path –
is returned as a sound estimate for the worst-case memory consumption of the
program.
Before stepping into a full analysis example, let us revisit the scalability issue.
As stated in the previous section, one contribution of this chapter is to adapt
the concept of “reuse with interpolation and dominance” to the setting where the







s1⊨  ψ𝟎 ∧ s1⊨ δ
r :  i + 19
s0 s1
11
Figure 4.2: Reuse with Interpolation and Dominance
Figure 4.2 depicts a symbolic execution tree, where each triangle represents a
subtree. The symbolic states s0 and s1 act as the program contexts for the left
and right subtrees, respectively; and they are different visits to the same program
point. Our analysis starts with exploring the left subtree. After traversing the left
subtree, we obtain a summarization, comprised of four main components:
1) An interpolant Ψ0, which is a generalization of s0 that captures a condition
4.1 Introduction 91
preserving the infeasible paths of the subtree. Infeasible paths are marked with a
red cross.
2) A witness path denoted by Γh, which gives rise to the worst-case memory con-
sumption of the subtree. This path is indicated in green color.
3) A second witness path, denoted by Γn, which captures the highest net memory
usage at the end of the subtree. This path is indicated in blue color. The use
of two witness paths is critical for safely combining summarizations (presented in
Section 4.3).
4) A dominating condition δ, a formula which sufficiently guarantees that the
dominating path remains optimal, i.e., the worst-case path in the subtree, when
encountering a new context.
Considering we are now at s1. Suppose that all the paths that were infeasible
in s0 stay infeasible, i.e., s1 |= Ψ0, and the dominating condition applies, i.e.,
DOM(δ, s1). This allows us to “reuse” the previous summarization. We then need
to replay the witness path Γh under the context s1. This, importantly, can lead to
new value of the path (now 19), which is different from the original value (14). This
is because the valuations of some symbolic expressions (or variables) are different,
under the new context s1, as opposed to the old context s0 and may also hit the
spike in another point along the path (as shown in Figure 4.2).
Backtracking to the root, and assuming that i > 0, thus i+ 19 > 4 + 14. Therefore
the right path in green is chosen as the overall dominating path. We then can
conclude the analysis on the whole tree with the symbolic value i+ 19.
Next, consider the C program in Figure 4.3 where we will discuss the concepts
under the presence of a loop. Figure 4.4 depicts a symbolic execution tree of the
92 Chapter 4. Symbolic Memory High-watermark Analysis
〈0〉 void main(int c){
〈1〉 assume(c ≥ 0); int N = 3;
r = 0;
〈2〉 int** matrix = malloc(5 * sizeof(int*));
r = r + 5 * 8;
〈3〉 char * b = (char*) malloc(c);
r = r + c;
〈4〉 for (int i = 0; i ≤ N; i++){
〈5〉 if ((i % 2) == 0){
〈6〉 matrix[i]=malloc(i*sizeof(int));
r = r + (i * 4);
}
}
〈7〉 free (b); r = r - c;
}
〈8〉
Figure 4.3: A Complicated Allocation Pattern
Figure 4.4: Analysis of the Complicated Allocation Pattern
corresponding program, where each triangle represents a loop iteration. For each
node, in additional to a program point, we also use a letter to distinguish multiple
4.1 Introduction 93
visits to the same program point. An infeasible node is identified with a red bullet.
For instance, at 〈4e〉, the red bullet indicates that it is not possible to re-enter the
loop body. For readability, the program points 〈2〉 and 〈3〉 are not shown in the
tree.
We note that loops are exhaustively unrolled and contexts (of feasible paths)
are merged in the end of each loop iteration.
Starting the analysis, the value of r is successively increased by 40 and then by
the value of c (input argument) from program points 〈1〉 to 〈4〉. In the first loop
iteration, the then branch is feasible and the value of r is increased by the value
of i ∗ 4, which is 0. Note that the else branch (colored in red in the symbolic
execution tree) is infeasible because i = 0 ∧ i % 2 6= 0 is equivalent to false.
At 〈4b〉, the analysis moves to the second iteration where the then branch
is infeasible, because (i = 1 ∧ i % 2 == 0) is equivalent to false. There are
no allocation/deallocations in the else branch, thus the value of r is unchanged.
Similarly, in the third and fourth iterations the value of r is increased by 8 (2 ∗ 4)
and 0, respectively. Finally, only at 〈4e〉, exiting the loop is possible, while re-
entering is not. We continue with the node 〈7a〉, where r is then decreased by
c. Because c is non-negative, the maximum value of r is reached at 〈7a〉 which is
48 + c.
Now let us go a bit deeper into the technicality. After the traversal of the first
iteration, the maximum increase/decrease in the value of r in the iteration, which
is +(i ∗ 4), was stored in a summarization of the loop iteration as a witness Γh.
(The second witness path Γn coincides with Γh in this example.) Since there is
only one feasible path in the loop iteration, the dominating condition is true and
the interpolant stored in the summarization is i % 2 6= 0 which is enough to
94 Chapter 4. Symbolic Memory High-watermark Analysis
capture the reason of infeasibility.
In a following loop iteration where the interpolant and the dominating condition
hold, for example at 〈4c〉, the summarization of the first iteration is then reused.
The analysis of the new iteration can be deduced to be +(8), without the need of
exploring all other paths in the loop iteration.
We also note that this summarization cannot be applied to the second iteration
at 〈4b〉. This is because the interpolant test fails. Fast forwarding, we finally
mention that in Figure 4.4, the respective triangles of the third and the fourth
iterations are shown in dotted lines to indicate that they were not explored in full.
Instead, we reuse the summarizations of other iterations.
4.2 Related Work
We will briefly review three groups of related work.
4.2.1 Instrumentation Tools
Several different dynamic analysis tools have been developed which perform
different forms of memory analysis; they can be categorized under instrumentation
tools. Such tools often start by profiling an input program before analyzing the
collected data; depending on the granularity of the data and the analysis overhead,
we can broadly classify into “lightweight” and “heavyweight”.
Firstly, Valgrind [NS07], is a tool that has been widely used for memory debug-
ging. One of its components, Massif, is a heap profiler that can measure the heap
usage in a current execution of the program. Secondly, DynamoRio [B+03], has
a memory debugging tool which can be used to detect heap-overflow errors. One
state-of-the-art tool from IBM, Pin [L+05], tracks the amount of system resources
4.2 Related Work 95
used by a program. Finally, WMTrace [P+11], is most relevant tool to our analy-
sis. It tracks memory allocation events in a multi-threaded program and in a post
processing phase, it measures the worst-case heap usage of the program. All these
methods are based on dynamic analysis, and not able to calculate the worst-case
memory consumption.
4.2.2 Worst-case Stack Usage Analysis
Worst-case stack analysis is important for detecting stack overflows in safety-
critical embedded systems. One state-of-the-art tool is AbsInt’s StackAnalyzer
[KF14]. It is a variant of value analysis performed on memory cells and CPU
registers where the highest value on stack pointer is reported as the worst-case
stack usage. This approach employs interval analysis and for precision, it is context-
sensitive. Contexts are differentiated by a call string, which is bounded to some N
(for scalability). The value analysis keeps updating the intervals till a fixpoint is
reached. The highest value on the stack pointer shows the worst-case stack usage.
Recently, [C+14] employs a variant of Hoare logic to establish bounds on stack
usage of C programs. However, this method cannot be extended for dynamic heap
allocation. In its current formulation, it requires the size of the stack frame to be
static.
In contrast, our method can be used for analyzing both worst-case heap and
stack consumption. Importantly, our approach is path-sensitive, thus it produces
more precise results.
4.2.3 Worst-case Heap Usage Analysis
Recently, object oriented languages have been proposed to be used in real-
time critical systems [B+13, Sch15]. In such languages, besides WCET analysis,
96 Chapter 4. Symbolic Memory High-watermark Analysis
analysis of worst-case heap consumption is also crucial for ensuring safety of the
deployed systems [PHS10].
One attempt is [PHS10], targeting Java. It employs IPET-based framework,
originated from WCET analysis. The method does not take into account memory
deallocations and thus the bounds it produces would be imprecise. Another simi-
lar work is [A+13], which only measures the allocations and assumes scope-based
memory model: all the allocations are deallocated with the entire scope.
Our most closely related work is the static analysis presented in [HIE12], where
an algorithm, extended from [CJ11, CJ13], has been outlined to perform non-
cumulative resource analysis. This algorithm, however, assumes that the amount
of resource consumed by each basic block is a constant. In this chapter, we make
no such assumption. In fact, to accommodate that, we need to introduce a new
form of reuse, as detailed in Section 4.3.
Last but not least, inferring parametric bounds, for memory consumption of
imperative and object-oriented programs, has been an important research topic
[A+07, A+12a, B+08, FMH14, HAH11, CHS15, H+16]. We have carefully discussed
their representatives in Section 4.1.
4.3 General Framework
In this section, we will present the customized resource analysis framework for
MHW analysis. In Table 2.1, we indicated the list of the terms and functions
needed to be defined for customizing the resource analysis framework. In the rest
of this section, we will define these terms and functions for MHW analysis.
4.3 General Framework 97
4.3.1 Basic Operations and Memory Cost Model
Recall from Chapter 2 that we model a program by a transition system. The
set of basic operations are either assignments, “assume” operations or memory
allocations/deallocations. The set of all program variables is denoted by Vars
including a special variable r tracking the amount of memory consumption. An
assignment x := e corresponds to assign the evaluation of the expression e to the
variable x. The expression assume(cond) means: if the conditional expression
cond evaluates to true, execution continues; otherwise it halts. Moreover, alc(+, e)
or alc(−, e) corresponds to a memory allocation or deallocation, respectively, of
size e. These operations are compiled from the malloc and free statements in the
input C programs. We shall use ` op−→ `′ to denote a transition relation from ` ∈ L
to `′ ∈ L executing the operation op ∈ Ops. Similarly, the transition step would
be defined as follow.
Definition 8 (Transition Step). Given 〈L, l0,−→〉, a transition system, and a
symbolic state s ≡ 〈`, σ,Π〉 ∈ SymStates, the symbolic execution of transition
tr : ` op−→ `′ returns another symbolic state s′ defined as:
s′ def=

〈`′, σ,Π ∧ cond〉 if op ≡ assume(cond)
〈`′, σ[x 7→ [[e]]σ],Π〉 if op ≡ x := e
〈`′, σ[r 7→ r + [[e]]σ],Π〉 if op ≡ alc(+,e)
〈`′, σ[r 7→ r − [[e]]σ],Π〉 if op ≡ alc(-,e)
98 Chapter 4. Symbolic Memory High-watermark Analysis
4.3.2 The Machine State
The machine state for MHW analysis is empty, since estimating the memory
consumption of a program does not rely on the internal state of any of the micro-
architectural features. As a result, we define the symbolic state with the machine
state assigned to each symbolic state to be empty (〈〉).
In the domain of MHW analysis, since the machine state is an empty tuple,
the updMachineState is equivalent to the identity function (Id), where it returns the
exact input state as output: m′:= updMachineState(seq,m) ≡ Id(m).
4.3.3 Witness and Dominating Condition
Next, we discuss the concept of witness. As explained in Section 4.1.1, a high-
watermark witness and a net-usage witness are stored in the summarization. A
high-watermark witness is a sub-path from the root of the subtree subtree(s, `2)
to a program point ` inside the subtree where the resource variable r reaches its
peak value. It is depicted by Γh ≡ 〈Υh, pih〉 where Υh is the sequence of alc(±, e)
depicting all memory allocation or deallocations and pih is the path constraints
along the witness.
The witness of highest net-usage is a sub-path from the root of the subtree
subtree(s, `2) to the program point `2 where the resource variable r has the highest
value at `2 and it is depicted by Γn ≡ 〈Υn, pin〉.
Note that the two witnesses can be different. Also, to reduce the size of the
witness, consecutive allocations of concrete amount are merged into one. Similarly
for consecutive deallocations.
Because non-constant allocations may be evaluated to different values, with
different contexts. For such evaluation, we rely on the estimating upper-bound and
4.3 General Framework 99
the estimating lower-bound functions.
Definition 9 (Estimating Upper-bound (EUB)). Given a symbolic state s ≡
〈`, [[s]]〉 and a non-constant allocation of the amount captured by an expression e,
the function EUB(e, [[s]]) returns the smallest expression ub over symbolic input pa-
rameters and concrete values such that [[s]] |= e ≤ ub. In case no such upper-bound
can be generated, e is returned.
Definition 10 (Estimating Lower-bound (ELB)). Given a symbolic state s ≡
〈`, [[s]]〉 and a non-constant deallocation of the amount captured by an expression
e, the function ELB(e, [[s]]) returns the largest expression lb over symbolic input
parameters and concrete values such that such that [[s]] |= lb ≤ e.
In summary, EUB and ELB are to over-estimate and under-estimate non-constant
allocations and deallocations, respectively. Note that, in the worst case, ELB can
always return 0 as a trivial lower bound. Generating sound, but not precise bounds
using EUB or ELB would affect the overall precision of the analysis.
Example 7 (Witness Path). Assume the following sequence of allocating/deallocating
statements along a symbolic path, which is selected as a high-watermark witness:
x=malloc(10), y=malloc(5), free(y), free(x), z=malloc(c)
This sequence would be stored as Υh ≡ ([+, 15], [−, 15], [+, c]).
When the summarization is reused, the high-watermark memory usage is com-
puted by replaying the sequence Υh under the new incoming context s, making
use of the EUB and ELB functions. For example, given that [[s]] ≡ c < 5, Υh in the
above example will be approximated by Υ1 = [+, 15], [−, 15], [+, 4], which gives us
100 Chapter 4. Symbolic Memory High-watermark Analysis
a high-watermark of 15. In a different context where [[s]] ≡ c < 20, Υh will be
approximated by Υ1 = [+, 15], [−, 15], [+, 19], which gives us a high-watermark of
19.
Finally, similar to Chapter 3, the feasibility of a witness Γ ≡ 〈Υ, pi〉 w.r.t. to
an incoming context s is determined by checking if [[pi]]∧ [[s]] is satisfiable. In what
follows, we abbreviate [[pi]]∧ [[s]] by [[Γ]].
We say that two nodes in a symbolic execution tree are similar if they refer
to the same program point. Thus, two subtrees are similar if they share the same
entry and exit program points.
We next discuss dominating condition, another component of our analysis of
a subtree. Each dominating condition is generated with respect to a witness.
Intuitively, this answers the question “in what context of a similar subtree does
the witness remain optimal?”
More specifically, the constraints in the dominating condition are typically of
the form x ≤ y where x, y are either program variables or some concrete values —
note that at least one must be a variable. The dominating condition is computed
by abstracting the context that gives rise to dominance in the first place. We next
present the customized Summarize-a-Trans function.
Summarize-a-Trans Function
Summarize-a-Trans, presented in Figure 4.5 computes a summarization for a sin-
gle transition tr at state s. This can be seen as a basic step in our algorithm. We
first elaborate on the computation of the witnesses and the high-watermark usage
mhw. First, Υ is initialized to the sequence of allocations/deallocations (line 98),
i.e., [+, e] and/or [−, e] in op. Next, consecutive concrete allocation/deallocations
are merged by iterating through Υ once (line 99 and 100). Moreover, for each
4.3 General Framework 101
function Summarize-a-Trans(s, tr)
Let s be 〈`, σ,Π〉 and Let tr be ` op−→ `′
〈98〉 Υ := Sequence of (de-)allocations in op
〈99〉 Iterate on Υ and merge consecutive concrete allocations
〈100〉 Iterate on Υ and merge consecutive concrete deallocations
〈101〉 i := 0;netusg := mhw := 0
〈102〉 foreach [sign,m] ∈ Υ do
〈103〉 if sign is+ then netusg := netusg +m
〈104〉 else netusg := netusg −m
〈105〉 if netusg > mhw then mhw := netusg
endfor
〈106〉 Γh := 〈Υ, [[op]]σ〉; Γn := 〈Υ, [[op]]σ〉
〈107〉 return [`, true,Γh,Γn,mhw, true, true, op∆]
Figure 4.5: Summarize-a-Trans Function
(de-)allocation the netusg is updated in lines 103 and 104. If the value of netusg
is greater than the high-watermark value, mhw is updated and `′ is stored as the
peak location (line 105).
Next, the path constraint for each witness is computed by projecting op onto the
set of program variables w.r.t. the symbolic store σ, denoted as [[op]]σ (line 106). We
now elaborate on the rest of the components stored in the summarization. Because
no infeasible path has been discovered, the interpolant Ψ is just true. There is a
single path, thus the dominating conditions are true. The abstract transformer ∆p
is the operation op itself, but translated to the language of input-output relation.
As an example, y := y + 1 is translated to yout = yin + 1. We use op∆ to denote
such translated op.
Compounding Two Summarizations
We presented the high view steps to compounding two summarizations in
Chapter 2. However, since that the summarizations in this chapter contain both
high-watermark and net-usage witness, we will present and updated version of the
102 Chapter 4. Symbolic Memory High-watermark Analysis
function Compose(s, S1, S2)
Let S1 be [`1,Ψ1,Γh1,Γn1,mhw1, δh1, δn1,∆p1]
Let S2 be [`2,Ψ2,Γh2,Γn2,mhw2, δh2, δn2,∆p2]
Let Γn1 be 〈Υn1, pin1〉
〈108〉 ∆p := ∆p1 ∧ ∆p2
〈109〉 Ψ := Ψ1 ∧ Pre-Cond(∆p1,Ψ2)
〈110〉if (mhw1 > net-usg(s,Υn1) +mhw2)
〈111〉 mhw := mhw1
〈112〉 Γh := Γh1; δh := δh1
else
〈113〉 mhw := net-usg(s,Υn1) +mhw2
〈114〉 {Γh, δh} := Combine-Witnesses(∆p1,Γn1,Γh2, δn1, δh2)
〈115〉 {Γn, δn} := Combine-Witnesses(∆p1,Γn1,Γn2, δn1, δn2)
〈116〉 return [`1,Ψ,Γh,Γn,mhw, δh, δn,∆p]
end function
function JoinHorizontal(s, S1, S2)
Let S1 be [`,Ψ1,Γh1,Γn1,mhw1, δh1, δn1,∆p1]
Let S2 be [`,Ψ2,Γh2,Γn2,mhw2, δh2, δn2,∆p2]
〈117〉 mhw := max(mhw1,mhw2)
〈118〉 Ψ := Ψ1 ∧ Ψ2
〈119〉 ∆p := ∆p1 ∨ ∆p2
〈120〉 {Γh, δh} := Merge-Witness-h(Γh1,Γh2, δh1, δh2)
〈121〉 {Γn, δn} := Merge-Witness-n(s,Γn1,Γn2, δn1, δn2)
〈122〉 return [`,Ψ,Γh,Γn,mhw, δh, δn,∆p]
end function
Figure 4.6: Helper Functions
Compose and JoinHorizontalfunction in Figure 4.6.
Compounding Vertically Two Summarizations: Consider that subtree(s2, `3)
suffixing subtree(s1, `2), where s2 ≡ 〈`2, [[s2]]〉 and s1 ≡ 〈`1, [[s1]]〉. In other words,
a path pi1 from `1 to `2 followed by a path pi2 from `2 to `3 corresponds a path pi
in subtree(s1, `3). The Compose function (in Figure 4.6) returns a summarization
for subtree(s1, `3) by compounding the two existing summarizations, respectively
for subtree(s1, `2) and subtree(s2, `3).
The abstract transformer ∆p is computed as the conjunction of the input
4.3 General Framework 103
function Combine-Witnesses(∆p,Γ1,Γ2, δ1, δ2)
Let Γ1 be 〈Υ1, pi1〉 and Let Γ2 be 〈Υ2, pi2〉
〈123〉 pi := pi1 ∧ pi2
〈124〉 δ′2 := true
〈125〉 foreach cond ∈ δ2 do
〈126〉 δ′2 := δ′2 ∧ Pre-Cond(∆p, cond)
endfor
〈127〉 δ := δ1 ∧ δ′2
〈128〉 Υ′2 := [ ]
〈129〉 foreach [sign,m] ∈ Υ2 do
〈130〉 Υ′2 := Add [sign,Pre-Cond(∆p,m)] into Υ′2
endfor
〈131〉 Υ := Υ1 • Υ′2 // concatenation
〈132〉 return {〈Υ, pi〉, δ}
Figure 4.7: Combining Witness Formula and Dominating Conditions
abstract transformers (line 108), with proper variable renaming. Note that in
our implementation, abstract transformers are computed using polyhedral do-
main. We employ ∆p to generate one continuation context, before proceeding
the analysis with subsequent program fragments. Next, the desired interpolant
must capture the infeasibility of S1, as well as the infeasibility of S2 given that we
treat subtree(s1, `2) as an abstract transition, of which the operation is ∆p. We
rely on the function Pre-Cond, which in line 109 under-approximates the weakest-
precondition of the post-condition Ψ2 w.r.t. to the transition relation ∆p.
Next we update the high-watermark witness. Here the net-usage witness be-
comes important. In the combined subtree, the high-watermark is chosen by com-
paring (1) the high-watermark of the prefix tree and (2) the (worst) net-usage of
the prefix subtree plus the high-watermark of the suffix subtree (line 110). In case
(1) is greater, the witness and the dominating condition of the prefix subtree is
returned (lines 111 and 112). Otherwise, the net-usage witness and its dominat-
ing condition of the prefix subtree are combined with the high-watermark witness
104 Chapter 4. Symbolic Memory High-watermark Analysis
and the corresponding dominating condition of the suffix subtree (lines 113 and
114). This is achieved by calling the function Combine-Witnesses. Finally, we again
invoke Combine-Witnesses to combine the net-usage witnesses and their respective
dominating conditions (line 115).
In Figure 4.7, Combine-Witnesses produces a witness and a dominating condition,
by compounding the witnesses and dominating conditions of the two subtrees,
where one suffixes the other. This can be understood as a sequential composition.
First, the path constraint pi is simply the conjunction of pi1 and pi2 (line 123).
Next, the combined dominating condition δ is computed as the conjunction of δ1
and a condition δ′2, in line 127, where intuitively, δ′2 describes a set of conditions,
such that δ′2 is a precondition of δ2 w.r.t. to the transition relation ∆p. Similarly,
the allocation/deallocations in Υ2 are updated w.r.t. to the transition relation ∆p
and stored in Υ′2.
Compounding Horizontally Two Summarizations: Given two summariza-
tions of two subtrees rooted at two nodes which are siblings, we want to propagate
function Merge-Witness-h(Γ1,Γ2,mhw1,mhw2, δ1, δ2)
〈133〉 δ := δ1 ∧ δ2
〈134〉 if (true |= mhw1 ≥ mhw2) then return {Γ1, δ}
〈135〉 if (true |= mhw1 ≤ mhw2) then return {Γ2, δ}
〈136〉 return {〈max(Υ1,Υ2), pi1 ∧ pi2〉, δ}
function Merge-Witness-n(s,Γ1,Γ2, δ1, δ2)
Let Γ1 be 〈Υ1, pi1〉 and Let Γ2 be 〈γ2, pi2〉
〈137〉 δ := δ1 ∧ δ2
〈138〉 if (true |= net-usg(s,Υ1) ≥ net-usg(s,Υ2))
〈139〉 return {Γ1, δ}
〈140〉 if (true |= net-usg(s,Υ1) ≤ net-usg(s,Υ2))
〈141〉 return {Γ2, δ}
〈142〉 return {〈max(Υ1,Υ2), pi1 ∧ pi2〉, δ}
Figure 4.8: Merging Witness Formulas and Dominating Conditions
4.3 General Framework 105
the information back and compute the summarization for the (common) parent
node. While propagation can be achieved by Compose, we need JoinHorizontal (pre-
sented in Figure 4.6) to “merge” the contributions of the two children to the parent
node. Note that unlike Compose, we need to select the path with the larger mem-
ory high-watermark usage between the two witnesses of the input summarizations.
Thus, the high-water mark usage would be the maximum of the mhw1 and mhw2
(line 117). Moreover, all the infeasible paths in both sub-structures must be main-
tained, thus the desired interpolant is the conjunction of the two input interpolants
(line 118). On the other hand, the abstract transformer ∆p is computed straight-
forwardly as the disjunction of the input abstract transformers (line 119). Finally,
Merge-Witnesses-h is invoked to merge the high-watermark witnesses and the respec-
tive dominating conditions (line 120) and similarly Merge-Witnesses-n is invoked to
merge the net-usage witnesses and the respective dominating conditions (line 121).
In Figure 4.8, Merge-Witnesses-h produces a high-watermark witness and a dom-
inating condition, by compounding the witnesses and dominating conditions of two
sibling subtrees. We need to choose one witness from the two input witnesses. The
combined dominating condition must ensure the dominance of each witness (in its
respective subtree) and the dominance of the chosen witness, which produces a
higher value of r, over the other.
The dominating condition δ is initialized as the conjunction of the two dom-
inating conditions (line 133). Next we compare the two high-watermarks mhw1
and mhw2; if mhw1 is greater or equal to mhw2, Γ1 dominates Γ2, and Γ1 is re-
turned as the dominating witness with δ as the dominating condition (line 134).
If not, we check if mhw2 is greater or equal to mhw1 and then we return Γ2 as the
dominating witness with δ as the dominating condition (line 135). If both tests
106 Chapter 4. Symbolic Memory High-watermark Analysis
fail, this could happen when we deal with symbolic expressions, we then employ
the max function, delaying the dominance test to a higher level in the symbolic
execution tree (line 136) with the hope that another witness might dominate this
path. Similarly, the Merge-Witnesses-n function produces a net-usage witness and
a dominating condition. Here we make use of the function net-usg, which extracts
the net usage given a context s and a sequence of allocations/deallocations (Υ1
or Υ2). This function can be easily implemented, we omit the detail due to space
reason.
4.3.4 Machine State Summary
InMHW analysis, since the machine state is empty, the machine state summary
is equal to the identity function (Id), which returns the same machine state: ∆m ≡
Id(m).
4.4 Example Analysis
Figure 4.9(a) presents the CFG of an example program. Its symbolic execution
tree is depicted in Figure 4.9(b). Both the CFG and the tree are annotated with
the updates on the resource variable r (in red color). Assume that b is a symbolic
input parameter for this example program. The variable of interest r is initialized
to 0 between nodes 〈1a〉 and 〈2a〉. The value of r can be seen beside node 〈2a〉 (in
green color). Note that in Figure 4.9(b), we do not (fully) show the subtree below
node 〈5b〉 and that we do not discuss the abstract transformers in detail.
In the left-most path, the value of r is updated to 20 at 〈5a〉, 0 at 〈8a〉 and c at
〈11a〉, which is a symbolic value. Note that c is a random number in the range of
[0, b− 1] and cannot be determined statically. In the second path, reaching 〈10a〉,
the path constraints contains both j = 0 and j < 0 (only relevant constraints are
4.4 Example Analysis 107
(a)
i < 0 i ≥ 0











r = r + 20
a = 30
x = malloc (20) 














r = r + 20
i ≥ 0
a = 30
r = r + 20
j = 0 j ≠ 0
j ≥ 0
j < 0
r = r + b
<9> <10>
<11>
z = malloc (b) 
r = r - b
c = Rand() % b
z = malloc (c) 
r = r + c
free(x)
r = r - 20
y = malloc (a) 




c = Rand() % b
r = r + c




r = r + b

















c = Rand() % b




r = 40 + c
r = r - b
r = 40 + b - b
Figure 4.9: (a) The CFG of an annotated program (b) The analysis tree of the program
shown for brevity), thus an infeasible path is detected.
After finishing the subtree beneath 〈8a〉, the following summarization is com-
puted and stored:
[〈8〉,Ψ,Γh,Γn,mhw, δh, δn,∆p],
where the stored interpolant is Ψ ≡ j ≥ 0, which is a succinct reason for the
infeasibility of the right sub-path; the stored dominating conditions δh and δn are
true, given that there is only one feasible path. Similarly, the only feasible path
(in blue color) is stored both as the high-watermark witness Γh ≡ 〈([+, c]), j ≥ 0〉
and as the net-usage witness Γn ≡ 〈([+, c]), j ≥ 0〉, where j ≥ 0 is the witness
path constraint and ([+, c]) is the sequence of the allocation/deallocations along
the witness path. Finally, mhw, the worst-case heap memory consumption of the
108 Chapter 4. Symbolic Memory High-watermark Analysis
subtree, is computed to b− 1, by evaluating the memory consumption of the high-
watermark witness. This, in turn, is achieved by invoking the function EUB with
c as the first argument and the current context as the second argument.
Continuing the analysis, consider the pair of nodes 〈8a〉 and 〈8b〉. At node
〈8b〉, the value of r is 40, due to the memory allocation from 〈7a〉 to 〈8b〉. We will
check the reuse conditions here. We first check whether the stored interpolant (Ψ)
is implied. This does not hold. The key reason is: some infeasible path in 〈8a〉 is
in fact feasible in 〈8b〉. As a result, reuse does not happen and the node 〈8b〉 is
expanded.
After the analysis of the subtree beneath 〈8b〉, a summarization is generated
from the analysis of node 〈8b〉. The summarization would be [〈8〉,Ψ′,Γ′h,Γ′n,mhw′,
δ′h, δ
′
n,∆′p]. The stored interpolant Ψ
′ is simply true because both paths in the
subtree are feasible.
Comparing the peak value of r along these two feasible sub-paths, it can be
seen that the value of r is larger in the right sub-path. So, the right sub-path is
chosen as the high-watermark witness and Γ′h is stored as 〈([+, b], [−, b]), j < 0〉.
Consequently, mhw′ is set to b.
Now, we need to capture a dominating condition such that when it holds, it
is guaranteed that the chosen witness path dominates all the other path(s) in
the subtree. For any symbolic state s, it is the case that [[s]] |= EUB(c, [[s]]) <
EUB(b, [[s]]) (since C is a random value in the range [0, b − 1]). Thus the stored
dominating condition δ′h is simply true.
On the other hand, the value of r at 〈11c〉 is 40 + c, which is higher than the
value of r at 〈11d〉, which is 40. So the net-usage witness is the sub-path 〈8b〉, 〈9b〉,
〈11c〉 (with green color) Γ′n ≡ 〈([+, c]), j ≥ 0〉. Moreover, the net-usage dominating
4.4 Example Analysis 109
condition δ′n would also be simply true. For brevity, we would not discuss net-usage
witnesses and their dominating conditions in the rest of this example.
Continuing the analysis, consider the pair of nodes 〈5a〉 and 〈5b〉. We will show
that reuse can in fact take place here. Please take note, without proof, that: (1)
the high-watermark witness of the subtree rooted at 〈5a〉 is the rightmost feasible
path 〈5a〉, 〈7a〉, 〈8b〉, 〈10b〉, 〈11d〉 (in blue color), stored as 〈([+, a], [+, b]), j < 0〉;
(2) The interpolant of interest is true; and the dominating condition for the high-
watermark witness is also true.
Now we can exemplify the reuse at 〈5b〉. We first check if the context of 〈5b〉,
called [[s5b]] implies the interpolant computed after finishing 〈5a〉. In this case, the
interpolant is true, thus the implication trivially holds. We then check whether
the dominating condition, which is true, holds. This is also trivially satisfied, thus
we can reuse the high-watermark witness of 〈5a〉, yielding an overall worst-case
memory consumption of 20 + EUB(a, [[s5b]]) + EUB(b, [[s5b]]), which is simplified to
be 50 + b.
We remark here that, the worst-case consumption of the sub-path 〈5〉 〈7〉 〈8〉
〈10〉 〈11〉 for contexts 〈5a〉 and 〈5b〉 are indeed different. It is because, fundamen-
tally, the worst-case consumption of a symbolic path is dependent on its context.
In this particular example, the valuation of a in the two contexts is different.
Finally, we easily arrive at the worst-case heap memory consumption of the
entire tree, thus the entire example program, to be 50 + b which is a symbolic
bound considering that b has been an input argument for this example program.
110 Chapter 4. Symbolic Memory High-watermark Analysis
4.5 Experimental Evaluation
Table 4.1 compares our framework with the state-of-the-art tools and methods.
We have elaborated these methods and tools in Sections 4.1 and 4.2. The four first
tools perform dynamic analysis and hence are unable to perform worst-case heap
analysis (WCHA) and worst-case stack analysis (WCSA). The second four tools
either only perform WCHA or WCSA, while our framework performs both WCHA
and WCSA.
The main tools and methods that can be compared to our framework is first
[HIE12] which generates non-symbolic bounds while our method is able to generate
symbolic bounds. A second group of tools that can be compared to our analysis
are COSTA [A+07, A+12a] and the methods presented in [CHS15] and [FMH14].
These methods might generate non-linear bounds and have limitations in handling
with non-linear bounds. Our framework generates non-linear bounds only if a
dynamic memory allocation/deallocation has a non-linear size. In other words,
since our framework fully unrolls loops, our method does not require manual user
interaction to solve non-linear formulas. Moreover, non-linear formulas, in general,
might be more imprecise compared to the (probably linear) bound generated by
our framework.
As a result, we can state that our method would be at least as precise as other
methods. However, this precision might affect the scalability of our framework.
We evaluate our proposed algorithm using a number of benchmarks collected
from the literature. The suite includes: (1) memory allocator tests such as shbench,
larson and cache-scratch from Hoard benchmarks [BMBW00]; (2) embedded
programs from MiBench [G+01] and from Ma¨lardalen WCET benchmarks [M0¨6]);
4.5 Experimental Evaluation 111


















DynamoRio Detect heap-overflow errors
Pin Tracks used system resources





* Worst-case stack usage (variant of value analysis)
[C+14] * Worst-case stack usage (Static frames)
[PHS10] * Does not consider deallocations
[A+13] * Scope-based deallocations
[HIE12] * * Generates non-symbolic bounds
COSTA
* *
Limited in coping with non-linear formulas, either:
& [FMH14] · Generate imprecise bounds
[CHS15] · Require manual user interaction
Our Framework
* *
1. Generating symbolic bounds
(Unroll d) 2. Generate bounds only parametric to input variables
3. Not requiring manual user interaction
and (3) heap manipulating benchmarks from [LLV15].
Table 4.2 presents the results of our experiments. The third column indicates
the values to which input parameters are concretized. Time, States and Reuses
columns present the running time, the number of visited states, and the number
of reuses in each benchmark instance. In other words, they illustrate the cost of
our algorithm.
Among the benchmarks, nsichneu, statemate and ndes contain many (possi-
bly infeasible) paths. They are to stress the scalability of our algorithm. Bench-
marks cache-thrash and cache-scratch are used to test active and passive false
sharing. To test for memory fragmentation, shbench is often used. It randomly
allocates and deallocates random memory chunks of memory. As illustrated in














































































































































































































































































































4.5 Experimental Evaluation 113
Section 4.4, our analysis can generate bounds even for programs where memory al-
locations/deallocations are highly randomized. Finally, larson is a famous bench-
mark which simulates a server. Similar to shbench it has a random behavior in
memory allocation/deallocation. The analyzed benchmarks are categorized into
four groups, separated by a double line in Table 4.2. We discuss each individual
group as below.
The first group of benchmarks contain complicated patterns of allocations
and/or deallocations (e.g., inside loops and conditional branches). In these bench-
marks, path-sensitivity plays a crucial role in generating a precise worst-case esti-
mate. Although the method presented in [HIE12] also benefits from path-sensitivity,
one key distinction of this work is the employment of symbolic witnesses, which
make our analysis applicable to programs with unbounded allocations/deallocations.
Moreover, let us elaborate on puzzle to highlight the impact of addressing
the issue of non-cumulative resource directly, as opposed to applying some known
approaches for analyzing cumulative resource. In puzzle, the outer loop iterates
5 times, and in each iteration it acquires memory, performs some operations in
some inner loops and releases the memory at the end of each outer loop iteration.
Analyzing memory as a cumulative resource would return 1020 as an estimate of
the worst-case consumption, which is 5 times larger than the bound produced by
our method.
The second group of benchmarks contains cache-thrash and cache-scratch
which are analyzed for different input parameters. Both cache-thrash and cache-
scratch are multi-threaded benchmarks. We have analyzed their local computa-
tion for the cases when number of threads (nthreads) are either 1 or 2 (exact
114 Chapter 4. Symbolic Memory High-watermark Analysis
bounds). In both benchmarks, the number of the iterations of an inner loop is de-
termined by repetitions/nthreads. As a result, the execution with nthreads = 2
is indeed shorter than the execution with nthreads = 1. Finally, note that the
generated bound for cache-scratch is dependent on the number of threads.
The third group of benchmarks contains nsichneu and statemate. These two
benchmarks contain very large loops which iterate twice. These benchmarks are
often used by WCET research community to test the scalability aspect of an
algorithm, especially when it is based upon symbolic execution. Our algorithm is
able to fully analyze these benchmarks, demonstrating the potential of our new
concept of “reuse” to scale in the presence of symbolic bounds.
Finally, the last group of benchmarks contain loops with large number of it-
erations, where the ability to reuse compounded summarizations to avoid state
explosion is crucial. For example, himenobmtxpa contains 14 loops with the nested
level of 3 and shbench contains a complicated loop pattern with the nested level
of 3. Among these benchmarks, shbench can be executed in both single-threaded
and multi-threaded forms which we have analyzed it in the single threaded form.
We highlight that the bounds generated for fft1 and nsieve-bits are symbolic
bounds.
4.6 Summary
In this chapter, we presented memory high-watermark analysis with symbolic
bounds. Memory high-watermark analysis is important since it is a non-cumulative
resource analysis where the resource of interest can be consumed or released.
Moreover, we extended the concept of reuse with interpolation and domination
in the presence of symbolic bounds. Memory high-watermark analysis does not get
4.6 Summary 115
affected from any of the micro-architecture features. As a result, the machine state
was empty. However, as illustrated in the examples, our resource analysis frame-
work can still return more precise bounds based on the point that the contribution
of each basic block in the memory consumption is dynamic. Finally, we presented
the results of our experiments on memory-intensive benchmarks to verify both the




Consumption Analysis with Cache and
Pipeline
5.1 Introduction
In this chapter, we will present a worst-case energy consumption analysis. Energy
consumption is one of the important non-functional features in embedded systems.
A lower energy consuming embedded system can function for a longer time in
an environment with limited access to energy. The software being executed on
the embedded system has a profound effect on the consumed energy. The energy
consumption of an application can vary based on the design of the software, its
algorithms, the programming language used and the compiler and its optimizations.
Measuring energy consumption of software in general needs specific knowledge
about the underlying hardware and also instrumentation, which is not an easy
task for the programmers or software engineers.
117
118 Chapter 5. Integrated Worst-case Energy Consumption Analysis
On the other hand, knowing an estimation of the energy consumption of a
program beforehand is very useful. It can help understanding the effect of the
code on the energy consumption of the final system without instrumentation or
even having access the system.
One approach in the literature has been measuring an average energy usage of
a program. This measured value can be reported as an indicator of the average
energy usage of software on a specific hardware to the end-user [GGP+14]. On the
other hand, a second approach [PS01, B+05, V+14] states in cases where energy is
a scarce resource, energy consumption analysis is an important factor for the safety
of embedded systems. The lack of access to infinite energy source can be either
because: 1- The energy source of these embedded systems is not always available
and cannot guarantee an infinite operation in energy harvesting systems (e.g.,
solar powered sensor nodes); or 2- Such embedded systems do not have access to
rechargeable energy resources after being deployed (e.g., underwater sensor nodes).
These researches suggest that an energy consumption analysis should be performed
to guarantee that an embedded system would have enough energy to perform the
set of assigned tasks.
In general, the issues with measurement-based methods in timing analysis re-
mains valid in the energy consumption analysis too. Measurement-based methods
cannot give a safe upper-bound on the energy consumption of a program.
In order to evaluate if a system can operate in an environment with limited
access to energy, an upper-bound on the worst-case energy consumption (WCEC)
of a program should be generated. The WCEC of a program can be a reliable
guarantee to evaluate the safety of the operation of an embedded system in an
environment with limited access to energy.
5.1 Introduction 119
Similar to worst-case execution time, the WCEC of a program is not known
in general and WCEC analysis is performed to generate a safe and precise upper-
bound on the energy usage of the program.
Although, there has been extensive research on the worst-case execution time
analysis of the programs, unfortunately, the current worst-case execution time anal-
ysis tools cannot be used to determine the worst-case energy behavior of programs.
Due to the complicated micro-architectural features in modern processors, in gen-
eral, the path resulting in the worst-case execution time is not the same as the path
resulting in the WCEC of a program [J+06]. Indeed, it is not true that if a path
has a lower execution time, its energy consumption would be less too [H+14b].
5.1.1 Integrated WCEC Analysis
Static analysis methods can generate safe upper bounds for the energy consump-
tion of software provided that a worst case cost model for the energy consumption
of the underlying hardware is present. Similar to timing analysis accurate energy
consumption of a program involves low-level micro-architectural analysis. The
WCEC analysis is more difficult compared to the worst-case execution time anal-
ysis, since, the complexity of the micro-architectural features can have more effect
on the energy consumption of a basic block compared to its execution time. In
other words, energy consumption is more data dependent, meaning that a small
change in the value of a variable might result in a significant difference in the en-
ergy consumption of a certain path. As a result, the methods utilizing fixed point
computation in WCEC analysis might lead in a greater overestimation.
This point brings us to the main proposal of this chapter, where we propose that
in the domain of energy consumption (where the nature of a consumed resource is
very much data dependent), integrated resource analysis is more plausible compared
120 Chapter 5. Integrated Worst-case Energy Consumption Analysis
to other methods. Since it will remain sound and yet it is able to generate precise
bounds on the worst-case energy consumption behavior of a program. In the
following sections, we will present our resource analysis customized for WCEC
analysis. Our WCEC analysis:
1. Is able to remain precise (to a great extent) concerning the effect of data-
dependency on the consumed energy across different paths1.
2. Extends the cache analysis presented in Chapter 3 with first in-order and then
out-of-order superscalar pipeline analysis as an important micro-architectural
feature affecting the energy consumption in programs.
5.1.2 Pipeline Analysis
Many researches have been performed on integrating pipeline analysis into
worst-case analyses (details are available in survey [W+08]). In general, most of
these analyses, for scalability reasons, perform pipeline analysis and the analysis
of other micro-architecture elements such as caches separately. However, for accu-
racy reasons, all micro-architectural elements should be analyzed in an integrated
analysis2.
Most of the energy harvesting embedded systems are built using in-order proces-
sors (due to their lower energy consumption compared to out-of-order processors).
In in-order pipelines, all of the pipeline stages execute instructions based on the
instruction order, while this is not the case in out-of-order pipelines.
1Our analysis does not consider battery tailing effect.
2In the presence of complicated patterns such as out-of-order execution, the state of one micro-
architectural feature can affect the internal state of other micro-architectural features [H+03]. So,
specially for out-of-order pipelines, integrated low-level analysis should be performed to ensure
the soundness of the analysis.
5.1 Introduction 121
In this chapter, we first focus on presenting an integrated analysis where high-
level path analysis and low-level analysis of cache and in-order pipeline are per-
formed at the same time. Our aim is to show that such an analysis would remain
scalable in the presence of cache and in-order pipeline. We should note that we
follow the abstract cache semantics from [SF99]. By performing pipeline analysis
during the symbolic execution tree traversal, we also model the following features:
• Superscalarity
• Worst-case Energy Consumption Analysis
• The Multiplicity of Resources: Our processor model can contain two
ALUs, and arithmetic instructions (except multiply and divide) can be issued
to any of them.
• Resource Capacity: Some resources such as buffers and queues have lim-
ited capacity and thus introduce dependencies between instructions in an
earlier pipeline stage and other instructions at a later stage.
• Instruction and Data Cache: We have modeled instruction and data
cache and their interactions with the pipeline.
We model a processor with in-order superscalar pipeline, instruction and data
cache. Our model can be parametrized w.r.t to the cache configuration, the num-
ber of entries in the instruction window, the latency of the functional units, etc.
The pipeline has a standard 5-stage pipeline consisting of Instruction Fetch (IF),
Instruction Decode & Dispatch (ID), Instruction Execute (EX), Write Back (WB)
and Commit (CM).
122 Chapter 5. Integrated Worst-case Energy Consumption Analysis
Figure 5.1 presents the in-order superscalar processor which we will be using in
the analysis framework in this chapter. The pipeline consist of five stages:
Figure 5.1: In-order processor model
1. Instruction Fetch (IF): In this stage, instructions are fetched from the
memory in the program order into the instruction fetch buffer. We assume a
4-entry buffer in this chapter which can load two instructions into the buffer
per clock cycle.
2. Instruction Decode & Dispatch (ID): In this stage, instructions are
decoded and dispatched in program order for execution. Two instructions at
most can be decoded and dispatched per clock cycle.
3. Instruction Execute (EX): At this stage the earliest instruction is issued
to its corresponding functional unit for execution. A new instruction can be
assigned to an execution unit when the previous instruction in the unit has
5.1 Introduction 123
finished execution. The components in the processor model are two ALU
units, one MULTU unit for integer multiplications, one FPU unit for floating
point multiplications and one unit for load/store instructions.
4. Write Back (WB): At this stage the execution of an instruction has finished
and the results are written to the registers or passed to the data cache.
5. Commit (CM): In this stage, the instructions which have finished writ-
ing their output to the registers and are removed from the pipeline. Two
instructions can be committed per cycle.
In the end of this chapter, we will present the steps needed to extend our
analysis to out-of-order execution in Section 5.6.
5.1.3 Analysis of Loop-free Programs
We first focus on a sound analysis of loop-free programs. Our analysis performs
a depth-first traversal of the symbolic execution tree of a program. At each node
in the tree, a unique instruction and data cache, and a pipeline state are assigned
to the nodes (as seen in Figure 5.3). Our analysis explores the symbolic execution
tree and reports the resource cost of the longest path.
Unlike [The04, H+03] where a pipeline state could have many successor states,
our analysis usually can generate one precise successor pipeline state. The only
time where more than one successor pipeline state may be generated is when a
data cache access points to a memory range access. However, since we do not
perform merge on machine states in our analysis3, more than one successor state
is relatively rare and in our experiments only in two of the benchmarks a pipeline
3We only merge the program context in the end of the loops. So, in loop-free programs we do
not perform merge at any point and in programs with loops the machine states are not merged.
124 Chapter 5. Integrated Worst-case Energy Consumption Analysis
Figure 5.2: Full Symbolic Execution Tree
state had more than one successor state.
In Chapter 3 we employed reuse with interpolation and dominance to perform
exhaustive symbolic execution in an environment where the contribution of each
basic block to the worst-case resource consumption was dynamic w.r.t. the context.
However, it only covered cache states. In this chapter, we extend reuse to be
performed in the presence of cache and in-order pipeline.
Figure 5.3 informally depicts a symbolic execution tree, where each triangle
presents a subtree. The contexts for the left and right subtrees (s0 and s1) contain
the machine state reaching the subtree. The machine state for the left and right
subtrees are depicted by m0 ≡ 〈c0, p0〉 at `0 and m1 ≡ 〈c1, p1〉 at `1 containing the
cache and pipeline states.
In our setting, we keep the interpolant and witness test from Chapter 3 and we
enhance the domination test to be able to express the dominance of the witness
path w.r.t. the incoming context. We now explain the dominating condition.
If at s1 the pipeline and cache states p1 and c1 were exactly the same as p0 and
c0, we could safely reuse the summarization of s0. However, in general, such cases
5.1 Introduction 125
Figure 5.3: Reuse Step
do not happen often. The instructions in the pipeline state might be different, and
also the memory blocks in the cache can be different.
We define successor nodes inside a subtree as nodes where the pipeline state at
these nodes does not contain any instruction from before the subtree. Successor
nodes are chosen in all feasible paths except the witness (highlighted with green
color in Figure 5.3). We can check if the pipeline state at successor nodes of s1
(successor states of p1) remain similar enough to the successor states of p0. If so,
the witness path will remain the dominating path in the new context. For the
successor states of p1 to remain similar enough to the successor states of p0, we
will compare the energy cost of the witness under the new cache and pipeline state
against the maximum possible energy cost of other paths from successor nodes plus
the energy cost of the path from the root to the successor nodes. If the energy
cost of the witness is more, we conclude the cache and pipeline state are similar
enough to maintain the dominance of the witness. As illustrated in Figure 4.9, the
energy cost of the witness is 14, while the sum of (1) the highest energy cost of
the other paths from the root of the subtree to the successor nodes and (2) the
126 Chapter 5. Integrated Worst-case Energy Consumption Analysis
maximum energy cost of the rest the path is (3+8 ≡ 11). Note that the maximum
possible energy cost of the paths (from successor nodes to the end of the subtree)
is computed based on the worst-case cache state at the successor node.
5.1.4 Extension to Programs with Loops
We adopt the loop unrolling framework presented in Chapter 2. However,
unlike [R+06], where in analyzing loops, the contexts are merged at the merge
points (the end of loop iterations in our framework), here to remain sound the
machine states are not merged and all the contexts in the end of a loop iteration
are passed to the next iteration.
5.2 Related Work
5.2.1 Worst-case Energy Consumption Analysis
Cycle-accurate power simulators such as Wattch [BTM00] and Simplepower
[Y+00] perform power simulation for one path and in general are not able to gen-
erate a safe estimation of the WCEC for a program. Wattch produces estimations
of energy consumption by modeling micro-architectural systems such as cache and
branch prediction. It can model software on different architectures with the preci-
sion of up to 10% of the commercial low-level hardware modeling tools. Further-
more, JoulTrack [SC01] is an Instruction-level energy estimation method which
does not consider the effect of cache miss or branch prediction.
The work presented in [GGP+14] develops a static analyzer that works on
LLVM IR. Their analysis generates cost relations which represent the cost of run-
ning the program in terms of its input. These cost relations are then solved to
generate the energy consumption.
5.2 Related Work 127
Other promising works such as [W+15, J+06] perform WCEC analysis. The
work presented in [J+06], to the best of our knowledge, is the first research pre-
senting WCEC analysis. On the other hand, [W+15] generates a comprehensive
approach to WCEC analysis. It presents a method that enables WCEC compu-
tations based on relative energy models and measurements. It performs WCEC
estimates by combining different techniques for three different hardware platforms.
However, these works are only applicable to in-order pipelines.
The majority of the literature on WCEC analysis (like us), assume that a
worst-case cost model is either present or can be generated. Recall from Chap-
ter 1 that modular approaches can generate a worst-case energy consumption for
each basic block in the low-level analysis phase. Although, this has been a valid
assumption among the resource analysis community, the researchers from the hard-
ware community doubt this point. Such researchers believe that due to the high
level of data dependency in the energy consumption domain, a worst-case energy
consumption value generated for a basic block in the low-level phase is either not
safe or very imprecise [PKME15]. As an example, there have been examples where
the energy cost of a certain path returned as the WCEC path has had a different
energy cost (differing up to 20 % based on the input data) [KE15]. This issue
seems to be a valid point and the current approaches can not perform full data-
dependent WCEC analysis. This issue is an open problem and can be a topic for
future research. In industry is solved, by adding a safety margin to the generated
bound by WCEC analyses.
5.2.2 Out-of-order Pipeline Analysis
In literature, many works focus on in-order pipelines. However, the number of
works addressing both in-order and out-of-order pipeline analysis is limited. We
128 Chapter 5. Integrated Worst-case Energy Consumption Analysis
will review some of the more important methods.
Lundqvist et al. in [LS98, LS99] proposed an integrated approach using sym-
bolic execution for computing the WCET bounds where all possible contexts where
enumerated. Although, their method is able to generate safe bounds for compli-
cated execution patterns in pipelines, it does not scale to a reasonable level.
Three important algorithms have been proposed to address out-of-order exe-
cution in pipelines. The first work presented in [The04, H+03] modeled the out-
of-order processor PowerPC 755. It generates reduced concrete pipeline model for
performing pipeline analysis. The reduced concrete pipeline state are maintained
for each basic block and are updated along the program points. However, a pipeline
state may have up to several different successor states where certain memory ac-
cesses cannot be resolved to cache hit/cache miss. The analysis is sound, however,
it can take a long time (12 hrs on average for an avionics program) as reported
in [S+07] and consume a huge amount of memory.
The work presented by Li et al [LRM06], first identifies the dependency among
pipeline stages and then uses an iterative technique to estimate the earliest/latest
start and completion time of each basic block in the presence of out-of-order execu-
tion by an interval. The process continues till it reaches fix-point. Although, their
method considers overlapping between the execution of basic blocks in pipelines,
the paper does not present any proof that their method generates safe bounds.
Furthermore, their model does not take into account data cache analysis.
Rochange and Sainrat [RS09b] report in their experiments that the bounds re-
turned by [LRM06] is quite imprecise. Their method is an extension over [LRM06]
which attempts to decrease this imprecision by incorporating a set of parameters
to represent the resource usage of pipeline resources in each basic block. They
5.3 General Framework 129
present a proof in their paper. However, their model only takes into consideration
out-of-order execution in pipelines and assumes perfect cache.
Our method performs integrated WCEC analysis and considers (in-order and
out-of-order) pipeline, instruction and data cache states at the same time.
5.3 General Framework
In Chapter 2, we presented our resource analysis framework. In this section, we will
customize the analysis framework forWCEC analysis with emphasis on instruction
and data cache, and in-order superscalar pipelines.
5.3.1 Basic Operations and Energy Cost Model
Given a program point `, an operation op ∈ Ops, and a symbolic store σ, we
denote the sequence of instructions in op by instr and the function acc(`, op, σ)
denotes the sequence of memory block accesses by executing op at the symbolic
state s ≡ 〈`,m, σ, ·〉. In this chapter for short, we denote acc(`, op, σ) by memseq
and the sequence of instructions by instr.
In WCEC analysis the variable of interest r models the consumed energy. Note
that this variable is always initialized to 0 and the only operation allowed upon it
is a constant increment.
The WCEC analysis computes a sound and accurate bound for r in the end,
across all feasible paths of the program. Given a symbolic state s ≡ 〈`,m, [[s]]〉
and a transition tr : ` op−→ `′, the amount of increment at s by executing tr will be
evaluated by the energy consumption in different components in the superscalar
in-order processor and other energy consuming components. r is not used in any
other way.
We have presented the energy cost model that we will be using through this
130 Chapter 5. Integrated Worst-case Energy Consumption Analysis
chapter in Appendix B. In our cost model, we have utilized parts of the energy
cost model presented in [J+06]. However, we have updated the cost model to fit
our analysis framework. We also have indicated the parts where our analysis can
compute the energy cost more accurately.
5.3.2 The Machine State
In this chapter, the machine state consists of the data and instruction cache and
the pipeline. As a result, we define the machine state attached to each symbolic
state to be a tuple of the form 〈ci, cd, p〉, where ci and cd are depicting instruction
and data caches and p is the pipeline state. As a result, a symbolic state would be
depicted by 〈`,m, σ,Π〉 where m ≡ 〈ci, cd, p〉.
A concrete pipeline state can be modeled by a set of 〈ins, s, t〉 tuples where ins
depicts an instruction, s depicts a pipeline stage and t depicts the time remaining to
finish executing instruction ins in stage s. A pipeline state then would be modeled
as [〈insi, sj, t1〉 . . . 〈insk, sl, t2〉] which indicates that insi is at state sj, taking t1
clock cycles to finish its execution in stage sj and so on.
An update function U can be defined which cycle-wise updates a pipeline state
based on a previous pipeline state p, the set of instructions in a basic block instBB
and the dependency graph of a basic block dep: p = U(p′, inst, dep)
By cycle-wise update we mean that the pipeline state p is updated from cycle
to cycle till reaching p′.
The sequence of instructions, instr, is used to update the pipeline state and the
instruction cache access. The updMachineState function updates both the cache and
pipeline states. The updMachineState function which was used to update the cache
state in Chapter 3 is used to update both the instruction and data cache states.
For the pipeline state, the pipeline is updated following the standard process in
5.3 General Framework 131
in-order superscalar processors.
5.3.3 Witness and Dominating Condition
In WCEC analysis, the witness (Γ) is the sequence of the program points
demonstrating the most energy consuming path in an explored subtree. In WCEC
analysis and due to the existence of in-order pipeline, the contribution of the
Energyreg, Energywk, EnergyFPU and EnergyMULTU is only affected by the se-
quence of the instructions on the path. As a result, the contributions of these
energy consuming elements remain static regardless of the machine state reaching
a node. Moreover, although EnergyALU1 and EnergyALU2 would alter based on
the machine state, EnergyALU1 + EnergyALU2 would remain constant.
In the presence of in-order pipeline, the rest of the energy consuming elements
in a path cannot be considered static. As a result, witness is stored as the sequence
of the instructions on the witness path which will be used to dynamically measure
the energy consumption of these elements based on the incoming machine state
plus the energy cost of static items. The energy of these dynamic elements are
depicted by the energy in instruction cache Energyic and data cache Energydc,
the clock energy clockpath, selection logic energy selectionpath, the leakage energy
leakagepath and sum of the switch-off energy in different components Switchoff .
As a result, the energy consumption of a witness path is obtained dynamically and
can alter based on the incoming machine state.
In WCEC analysis, the dominating condition (δ) should guarantee the domi-
nance in the presence of both cache states and pipeline state. As briefly explained
in Section 5.1.1, the dominating condition would reason about the energy cost of
the witness based on the machine state at reuse point compared to the maximum
energy cost from the successor nodes to the end of the subtree plus the energy from
132 Chapter 5. Integrated Worst-case Energy Consumption Analysis
the root to the successor nodes.
In general, the energy cost is affected from the cache states and the data de-
pendency and contentions in the in-order pipeline. Due to the in-order nature of
the pipeline, the contentions will always result to the earlier instructions to be
executed. So, they would be the same regardless of the context.
In order to consider data dependencies, the successor nodes are chosen after
a window of instructions, such that the instructions from before the root of the
subtree have been committed. As a result, the data dependencies after these
successor nodes remains the same and the maximum energy cost of the paths after
successor nodes can be computed using empty cache.
The window of instructions from the root to the successor nodes would be the
set of epilogue instructions in all the paths emanating from the node at reuse point.
In general, since our framework performs analysis on the symbolic execution tree
and w.r.t. to the in-order pipeline model, the set of epilogue instructions of such
paths would be at most 8− 1 = 7 instructions (Figure 5.4).
Finally, the energy cost from the root to successor nodes is computed based on
the new machine state which considers the different access time in the cache access
time and data dependencies in the new pipeline state.
As a result, the dominating condition is a set of instructions from the root to
successor nodes and a respective maximum energy from the successor nodes to the
end of the sub-path. At reuse time, the energy cost of these instruction sets is
added to the respective maximum energy stored with them and compared to the
energy cost of the witness in the new context. In case the energy cost of the witness
is still more, the summarization can be reused and the energy cost of the witness is
returned (as illustrated in Figure 5.4). In our experiments, the number of successor
5.3 General Framework 133
Figure 5.4: Window of Epilogue Instructions from Root to Successor Nodes
nodes are bounded by O(1). We next present the customized Summarize-a-Trans,
Combine-Witness and Merge-Witness functions.
function Summarize-a-Trans(s, tr)
Let s be 〈`,m, σ, .〉 and Let tr be ` op−→ `′
Let m ≡ {ci, cd, p}
〈143〉 Γ := Sequence of instructions in op
〈144〉 Energyp := Energy(p);
〈145〉 Execute Instructions in p
〈146〉 wcec := Energy(p)− Energyp;
〈147〉 return [`, true,Γ, wcec, true, op∆]
Figure 5.5: Summarize-a-Trans Function
Summarize-a-Trans Function
Summarize-a-Trans computes a summarization for a single transition tr at state
s. This can be seen as a basic step in our algorithm. Because no infeasible path
has been discovered, the interpolant Ψ is just true. There is a single path, thus
the witness is simply the sequence of the instructions in the transition and the
134 Chapter 5. Integrated Worst-case Energy Consumption Analysis
dominating condition is also simply true. Moreover, the wcec is measured by the
difference in the energy cost of the pipeline state before and after executing the
instructions (lines 144 - 146). The abstract transformer ∆p is the operation op
itself, but translated to the language of input-output relation. As an example,
y := y + 1 is translated to yout = yin + 1. We use op∆ to denote such translated
op.
function Merge-Witness(s,Γ1,Γ2, δ1, δ2)
Let s be 〈`,m, σ, .〉 and Let tr be ` op−→ `′
Let m ≡ {ci, cd, p}
〈148〉 if (max(Γ1, p) ≤ max(Γ2, p))
〈149〉 swap(Γ1,Γ2), swap(δ1, δ2)
〈150〉 instrsucc := Generate-Succ(Γ2);
〈151〉 maxsucc := Max-from-Succ(Γ2);
〈152〉 δ := Union(δ1, δ2)
〈153〉 δ := δ ∧ 〈instrsucc,maxsucc〉
〈154〉 return {Γ1, δ}
end function
Figure 5.6: Merge-Witness Function
The Combine-Witness and Merge-Witness Functions
The Merge-Witness function in Figure 5.6, presents the step to merge horizontally
two witnesses and their respective dominating conditions. Due to the complicated
nature of WCEC analysis with pipeline, in order to choose the dominating path,
the pipeline state is also needed. In the first step, based on the energy consump-
tion of each of the witnesses the longer witness is chosen (line 148 to 149). We
now elaborate on the computation of the dominating condition. The dominating
condition is consisted of the instructions to the successor node and the max energy
cost from the root of the successor node to the end of the path. The union of
the dominating conditions are computed (line 152) and next, a successor node is
5.3 General Framework 135
generated on the second witness and the instructions from the root to the succes-
sor node (generated in line 150) and the maximum energy cost from the successor
node to the end of the subtree (generated in line 151) is added to the dominating
condition (line 153).
function Combine-Witness(s, tr,Γ′, δ′)
Let s be 〈`,m, σ, .〉
Let tr contain {instr}
Let m ≡ {ci, cd, p}
〈155〉 Γ = Γ′
〈156〉 foreach insi ∈ instr do
〈157〉 Γ = Γ + insi
endfor
〈158〉 δ := [ ]
〈159〉 foreach 〈instrsucc,maxsucc〉 ∈ δ′ do
〈160〉 δ := Union(δ , update-succ(instrsucc,maxsucc,Γ))
endfor
〈161〉 return {Γ, δ}
end function
Figure 5.7: Combine-Witness function
The Combine-Witness function in Figure 5.7 combines two witnesses and dom-
inating conditions. Note that tr can be an abstract transition, by that we mean
that it is an abstraction of several transitions between program point `1 to `2. An
abstract transition is the summarization between `1 and `2. As a result, in case
tr is an abstract transition, instr would be generated from the witness (Γ) in the
summarization.
First, for each of the instructions in the respective transition (instr), it is added
to the beginning of the witness list Γ (line 157). Next, the dominating condition is
updated with regard to the new generated witness Γ in line 160. In other words, for
each 〈instrsucc,maxsucc〉 in δ′ the update-succ function updates 〈instrsucc,maxsucc〉.
Note that the update-succ function can return more than one 〈instrsucc,maxsucc〉
136 Chapter 5. Integrated Worst-case Energy Consumption Analysis
sets. The output of the update-succ function is next generated by union with δ.
5.3.4 Generating Machine State Summaries
In this chapter, the machine state summary should be redefined. In Chapter 3,
we introduced cache summary which captured the relation between the abstract
cache components. The cache summary was needed when a summarization was
reused across a loop iteration. In the presence of in-order execution pipelines the
pipeline and cache states cannot be merged.
As a result, the witness (the most energy consuming path) is used to generate
the pipeline and cache states at the end of the reused subtree. The combination
of the cache summaries and witness will act as the machine state summary. Cus-
tomizing the Combine-Summary and Merge-Summary functions would be quite similar
to the functions presented in Chapter 3.
5.4 An Example Analysis
In the example presented in this section, we will illustrate the logic of the reuse
step which ensures the dominance of the witness over another path in a subtree
with respect ta a new machine state. Moreover, this example clarifies the point
that the reuse step is sound.
Consider the CFG and the symbolic execution tree in Figure 5.8. Each rectangle
in the CFG represents a basic block where the program point (e.g., 〈1a〉) and
the instructions in the basic block can be seen. In front of each instruction, the
respective functional unit (executing it) is depicted in red. For simplicity, in this
example we assume perfect data cache. The pipeline is a scalar 5 stage in-order
pipeline, initially empty, where all stages take only 1 clock cycle to complete.
The functional units in the EX stage of the pipeline are MEM and ALU . The
5.4 An Example Analysis 137
time taken in all the pipeline stages is equally 1 clock cycle (CC). Note that in





































Figure 5.8: (a) a CFG (with instructions and their respective execution unit shown in
each block); and (b) Our Analysis Tree
The analysis starts with empty pipeline state at 〈1a〉. The pipeline state at 〈2a〉
and 〈4a〉 is [〈ins1,ID〉, 〈ins2,IF 〉] (CC 2) and [〈ins1,MEM〉, 〈ins2,ID〉, 〈ins3,IF 〉]
(CC 3). The analysis continues updating the pipeline state clock-wise till all the
instructions leave the pipeline state. The energy cost of the sub-path from 〈4a〉
would be 123 mJoules. The pipeline state at each clock cycle can be seen in Table
5.1.
Note that the pipeline state at program point 〈7a〉 does not contain any in-
struction from before the root of the subtree (〈4a〉). As a result, 〈7a〉 is chosen as
the successor node. The instructions between the root 〈4a〉 and the successor node
〈7a〉 are ins5, ins6, ins7, ins8, ins9. Moreover, the maximum energy cost from the
138 Chapter 5. Integrated Worst-case Energy Consumption Analysis
Table 5.1: Pipeline States Along the First Path
PP Pipeline State Clk
〈4a〉 [〈ins1,MEM〉, 〈ins2,ID〉, 〈ins3,IF 〉] 3
[〈ins1,WB〉, 〈ins2,MEM〉, 〈ins3,ID〉, 〈ins5,IF 〉] 4
〈5a〉 [〈ins1,CM〉, 〈ins2,WB〉, 〈ins3,ALU〉, 〈ins5,ID〉, 〈ins6,IF 〉] 5
[〈ins2,CM〉, 〈ins3,WB〉, 〈ins5,MEM〉, 〈ins6,ID〉, 〈ins7,IF 〉] 6
[〈ins3,CM〉, 〈ins5,WB〉, 〈ins6,MEM〉, 〈ins7,ID〉, 〈ins8,IF 〉] 7
〈7a〉 [〈ins5,CM〉, 〈ins6,WB〉, 〈ins7,ALU〉, 〈ins8,ID〉, 〈ins9,IF 〉] 8
[〈ins6,CM〉, 〈ins7,WB〉, 〈ins8,ALU〉, 〈ins9,ID〉, 〈ins14,IF 〉] 9
[〈ins7,CM〉, 〈ins8,WB〉, 〈ins9,ALU〉, 〈ins14,ID〉, 〈ins15,IF 〉] 10
[〈ins8,CM〉, 〈ins9,WB〉, 〈ins14,ALU〉, 〈ins15,ID〉] 11
[〈ins9,CM〉, 〈ins14,WB〉, 〈ins15,MEM〉] 12
[〈ins14,CM〉, 〈ins15,WB〉] 13
[〈ins15,CM〉] 14
successor node 〈7a〉 till the end of the path is stored with the instruction window:
[〈7〉, 〈ins5, ins6, ins7, ins8, ins9〉, 108].
Moving to the right sub-path, the energy cost of the second path from 〈4a〉
would be 125 mJoules and the pipeline state at the program points can be seen in
Table 5.2.
Table 5.2: Pipeline States Along the Second Path
PP Pipeline State Clk
〈4a〉 [〈ins1,MEM〉, 〈ins2,ID〉, 〈ins3,IF 〉] 3
〈6a〉 [〈ins1,CM〉, 〈ins2,WB〉, 〈ins3,ALU〉, 5
〈7b〉 [〈ins6,CM〉, 〈ins10,WB〉, 〈ins11,MEM〉, 〈ins12,ID〉, 〈ins13,IF 〉] 8
After traversing the second path, similar to the first path an instruction window
and maximum energy for the rest of the path is generated and stored: [〈7〉, 〈ins5, ins6,
ins10, ins11, ins12, ins13〉, 52].
5.5 Experimental Evaluation 139
Continuing to 〈4a〉 the two paths are compared and the path on the right is
chosen as the witness. Next, the [SuccNode,Witness,MaxEnergy] of the other
path is stored as the dominating condition.
Fast forwarding the analysis to the right subtree at 〈4b〉, this dominating
condition is tested on the pipeline state [〈ins1,MEM〉, 〈ins2,ID〉, 〈ins4,IF 〉] (CC
3). Assume that there is some dependency between the nodes 〈ins4, CM〉 and
〈ins5,MEM〉.
The energy cost of the witness with regard to the current pipeline state is
computed which is 128 mJoules. Next, the energy cost of the instructions in the
instruction window in the new pipeline state is computed which is 71 + 3 ≡ 74
mJoules this time. The energy cost is added to the maximum energy stored with the
instruction window and compared to the energy cost of the witness: 74+52 < 128.
Since the energy cost of the witness is larger we can reuse and return the energy
cost of the witness as the WCEC of the subtree.
The analysis is continued and we can come to the WCEC of the whole program
which is 150.
5.5 Experimental Evaluation
We used an Intel Core i5 @ 3.2Ghz processor having 4Gb RAM for our experiments
and built our system upon CLP(R) [JMSY92] and Z3 as the constraint solver, thus
providing an accurate test for feasibility. The analysis was performed on LLVM IR
which, while being expressive enough, maintains the general CFG of the program.
Our framework receives a C program as input and produces a transition system
from the LLVM IR of the program. The LLVM instructions are simulated for a
RISC architecture. We use Clang 3.2 [Cla14] to generate the IR. The data and
140 Chapter 5. Integrated Worst-case Energy Consumption Analysis
instruction cache settings in our experiments were similar to Chapter 3 and the
pipeline settings have been presented in Section 5.1.2.
Table 5.3 presents the results of our analysis on a set of benchmarks. The
benchmarks used in our experiments are from embedded programs (fft1 from
MiBench [G+01] and the rest from [M0¨6]). The second column presents the size
parameter of the benchmarks. We compare the result of our analysis framework
unroll d with the cycle accurate power simulator Wattch [BTM00]. The third
column reports the result of simulation on Wattch. The state and reuse columns
in Table 5.3 present the number of visited states and reused states in the analysis.
The time column reports the analysis time and the WCEC column reports the
worst-case energy consumption for the benchmarks.
5.5.1 Discussion on Precision
We have compared our analysis results with the Wattch power simulator. As it
can be seen in Table 5.3 and Figure 5.9, in all cases our analysis result overestimates
the Wattch simulations. Among the benchmarks, we were able to simulate the
worst-case path on edn, matmult, ns, fdct, bubblesort and jfdctint. On these
benchmarks, the WCEC estimation by our analysis is compared to the actual
WCEC of the program. The results show in average our analysis results have
73% overestimation. On the other benchmarks, finding the worst-case path is not
trivial. So, we compared our analysis result with the simulation of the benchmarks
without changes. As a result, in some cases the simulated path might be far from
the WCEC path.
5.5 Experimental Evaluation 141
Table 5.3: Results of Analysis of Benchmarks with In-order Pipeline
Benchmark Size WattchSimulation (nJ)
Unroll d
Time State Reuse WCEC (nJ)
edn 1964967.475 606.42 2047 571 3909066.00
compress 170967.348 268.64 1666 398 15205680.00
ndes 634466.991 27.11 714 246 1848224.00
adpcm Not Runnable 262.51 1100 255 265783.00
fft1 8 181638.497 2.87 313 51 546861.00
matmult 20 4379671.380 2.16 462 114 7290374.00
ns 5 358722.764 0.11 78 18 645745.80
ns 10 6165369.361 0.24 142 40 9531186.00
ns 20 107642166.000 0.6 266 78 147183400.00
fir 364811.870 3.78 343 129 685058.00
ud 5 301514.463 1.41 514 53 302174.40
expint 118812.395 25.26 867 247 147140.00
cnt 10 Not Runnable 0.25 142 38 133072.80
fdct 8 125910.057 0.14 56 14 205605.60
jfdctint 302495.221 1.41 251 77 354978.80
bubblesort 25 311460.787 0.62 194 40 314025.60
bubblesort 50 732892.352 2.41 387 84 1279203.80
bubblesort 100 1518204.211 13.02 773 172 5163788.00
two shapes 50 165419.981 2.02 375 47 179106.20
two shapes 100 428837.735 10.73 750 97 695741.20
two shapes 200 1455017.221 89.09 1500 197 2742264.00
5.5.2 Discussion on Scalability
The three groups of the benchmarks used in our experiments are separated by
a double line:
Benchmarks with Long Witnesses: The first group contains benchmarks with
mainly long nested loops. The length of the witness becomes quite long in these
benchmarks and in order for our analysis to scale, we had to do some adjustments
for scalability in the analysis of these benchmarks. We overestimate the later part
of the witness to force its size to remain within our threshold. These group of
benchmarks illustrate the performance of our method in analyzing benchmarks
142 Chapter 5. Integrated Worst-case Energy Consumption Analysis
Figure 5.9: Comparison of Estimated WCEC with Wattch Simulation
with very long paths. Due to the large number of infeasible paths in these bench-
marks, the analysis of these benchmarks can also determine the precision of our
proposed analysis.
Benchmarks with Long Witnesses and Complicated Loops: This group
of benchmarks contains standard programs from [M0¨6] which although contained
long witnesses, the length of the witness did not exceed the threshold. The reason
is that in these benchmarks, the number of the iterations of the complicated loop
patterns are precisely measured by our analyses loop unrolling technology. As a
result, the length of the witness remains below the threshold and the analysis can
generate precise results for these benchmarks.
5.6 Extension to Out-of-order Pipelines 143
Academic Benchmarks: Although, the loops in these benchmark programs are
considered to be simple, they contain memory accesses which might be resolved to
a range of memory addresses, leading to the imprecision in the low-level analysis
approaches that employ a fixed point computation.
In conclusion, with the adjustment to handle long witness paths our framework
is able to scale on reasonable size benchmarks. This adjustment can affect the
precision of our results for benchmarks analyzed with the adjustment, but still the
overestimation falls in an acceptable range.
One other limitation of our framework is that the processor model does not
consider all micro-architecture features. In comparison with the Wattch simulator,
we tried to compare our analysis with a similar processor setting. We had to add
bounded overestimation to handle the missing micro-architecture features (such as
branch prediction). Although, our estimations would still be sound, but a more
accurate processor model can increase the precision of our analysis. In other words,
our analysis is not bounded to a specific pipeline model, but with a more accurate
processor model a more precise analysis can be provided.
5.6 Extension to Out-of-order Pipelines
In Section 5.1.2, we stated that most of embedded systems with limited access to
energy operate on in-order pipelines. However, a precise and scalable out-of-order
pipeline analysis has been long an open problem. We explored three state-of-the-
art out-of-order pipeline analysis methods in Section 5.2.2 and we elaborated on
the existing issues in out-of-order pipeline analysis.
144 Chapter 5. Integrated Worst-case Energy Consumption Analysis
Now, we present a further discussion on extending the analysis framework pre-
sented in this chapter to out-of-order pipelines. We update the dominating con-
dition to preserve soundness in the presence of timing anomaly introduced by
out-of-order pipelines.
The extension in this section is presented to demonstrate that our WCEC
analysis framework can also be performed on out-of-order superscalar pipelines,
instruction and data cache at the same time. We like to note that, in order to
remain sound we need to be conservative in the reuse step. This might affect the
scalability of our analysis on out-of-order pipelines.
Figure 5.10 presents the out-of-order superscalar processor which we will be
using in the analysis framework in this section. The pipeline has a standard 5-stage
pipeline consisting of Instruction Fetch (IF), Instruction Decode & Dispatch (ID),
Instruction Execute (EX), Write-Back (WB) and Commit (CM). The instruction
fetch, decode, and commit stages are performed in program order, and the execute
and write-back stages, instructions can be processed out-of-order:
Figure 5.10: Out-of-order processor model
5.6 Extension to Out-of-order Pipelines 145
1. Instruction Fetch (IF): In this stage instructions are fetched from the
memory in the program order into the instruction fetch buffer I-buffer. We
assume a 4-entry I-buffer in this section which can load two instructions into
the I-buffer per clock cycle.
2. Instruction Decode & Dispatch (ID): In this stage instructions are de-
coded in the I-buffer and dispatched in program order into the ROB buffer.
We assume an 8-entry buffer in this section. Instructions are stored in this
buffer from the time they are dispatched to the time they are committed.
Two instructions at most can be decoded and dispatched per clock cycle.
3. Instruction Execute (EX): At this stage, the earliest instruction in the
ROB, which its arguments are ready is issued to its corresponding functional
unit for execution. This stage executes the instructions in an out-of-order
behavior, and a new instruction can be assigned to an execution unit when
the previous instruction in the unit has finished execution. The components
in the processor model are two ALU units, one MULTU unit for integer
multiplications, one FPU unit for floating point multiplications and one unit
for load/store instructions.
4. Write-Back (WB): At this stage the execution of an instruction has finished
and the results are passed to the following waiting instructions in the ROB
buffer. In case all other operands of an instruction in the ROB are ready, the
instruction could be executed in the next cycle. For the sake of simplicity,
we will assume that an instruction will write-back immediately after finishing
execution. The WB stage is also an out-of-order stage.
5. Commit (CM): In this stage the instructions which have finished the WB
146 Chapter 5. Integrated Worst-case Energy Consumption Analysis
stage write their output to the registers and free the ROB register in the
program order. Two instructions can commit per cycle.
5.6.1 Timing Anomaly
Out-of-order execution in current modern processors can result in timing anomaly.
Timing anomaly is referred to the case where a local worst-case/best-case might
not result in global worst-case/best-case. Consider an instruction I with the pos-
sible execution times of E1 and E2, which lead to different worst-case execution
time estimations T1 and T2. E1 and E2 can be different due to some reasons
such as cache hit/cache miss, etc. A timing anomaly happens when T1 > T2, but
E1 < E2. Generally, in the worst-case execution time analysis, we need to keep
track of the state of the hardware components and their possible effects on the
overall worst-case execution time.
Three types of timing anomaly have been identified by Reineke et al. [R+06].
Scheduling timing anomaly, which has been thoroughly studied, happens when a
faster execution time of a basic block leads to the global worst case. The second
form of timing anomaly, speculation timing anomaly, where an initial cache hit
changes the order of executed instruction and an increase in the overall execution
time. Finally, the third type is cache timing anomaly which is caused by non-LRU
cache replacement policies.
In the presence of out-of-order pipeline, the effect of timing anomaly should
be considered in the worst-case execution time analysis. This issue would still be
valid forWCEC analysis and it becomes worse since energy is more data dependent
compared to timing.
5.6 Extension to Out-of-order Pipelines 147
5.6.2 Dominating Condition and Machine State Summary
In the presence of out-of-order pipeline, the dominating condition, and the
machine state summary should be updated to ensure that the safety of our analysis
is maintained. Due to the existence of out-of-order pipeline, the local worst-case
will not generally result in the global worst case and the reuse test presented
in Section 5.3.3 is no longer sound. The domination test should ensure all the
components affecting the energy cost of the paths in a subtree remains the same
at reuse time. As a result, the dominating condition here would be a record of the
ready time and start time of the instructions at successor nodes. Ready time is
the time that an instruction is ready to enter a pipeline stage and start time is the
time that it actually enters that particular stage. Due to pipeline stalls, these two
times might not be the same. For example, StartIF1 ≤ ReadyIF1 indicates start
time of instruction 1 at stage IF in the pipeline is always less than its ready time.
In other words the difference between the ready time and the start time indicates a
pipeline stall in that stage. Since the pipeline model is non-preemptive, by storing
these times and checking it at the successor nodes at reuse point, we can make
sure the components affecting the energy cost would be the same at reuse point.
The drawback of this reuse test is that the number of conditions, in this case can
grow exponentially.
In order to address this issue, we adopt the concept of dependency graph from
[LRM04]. As a result, besides the dominating condition from Section 5.3.3 a subset
of the dependency graph is stored and tested at reuse point. We will illustrate
the dominating condition test based on the dependency graph, in the example
presented in Section 5.6.4.
148 Chapter 5. Integrated Worst-case Energy Consumption Analysis
Dependency Graph
Dependency graphs can model the dependencies and also contentions that may
occur in pipelines [B+06]. The dependency graph models the following dependency
and contentions [LRM06]:
1. Dependencies among pipeline stages of the same instruction.
2. Dependencies due to finite-sized buffers and queues
3. Dependencies due to out-of-order execution stages
4. Data dependencies among instructions
5. Contention relations modeling structural hazards in the pipeline
We should note that the dominating condition though sound, is conservative
in analysis of large programs and our analysis of out-of-order pipeline might not
scale on large benchmarks.
5.6.3 Analysis of Loops
In order for the analysis to remain sound, our analysis avoids merging at any
point inside the symbolic execution tree. As a result, there will be no need to store
machine state summaries. Note that by avoiding merge at any point inside the
loop our analysis will remain sound and will be quite precise. However, the size of
the symbolic execution tree can grow exponentially in terms of the loop size. This
can affect the scalability of our method.
5.6.4 Example Analysis
Consider the CFG and the symbolic execution tree of the example presented
in Section 5.4. This time we analyze the example program in Figure 5.8(a) on
5.6 Extension to Out-of-order Pipelines 149
an out-of-order pipeline. The pipeline is a scalar 5 stage pipeline, initially empty,
where all stages take only 1 clock cycle to complete. The IF , ID and CM stages
are in-order and the stages EX and WB are out of order. The functional units
in the EX stage of the pipeline are MEM and ALU . The time taken in all the
pipeline stages is equally 1 clock cycle (CC). For brevity, in this example we only
explain the extra dominating condition for out-of-order pipeline.
Similar to before, the analysis starts with empty pipeline state at 〈1a〉. The
pipeline state at 〈2a〉 and 〈4a〉 is [〈ins1,ID〉, 〈ins2,IF 〉] (CC 2) and [〈ins1,MEM〉,
〈ins2,ID〉, 〈ins3,IF 〉] (CC 3). The analysis continues updating the pipeline state
clock-wise till all the instructions leave the pipeline state. The energy cost of this
path would be 197 mJoules. The pipeline state at each clock cycle can be seen in
the table below:
PP Pipeline State Clk
〈4a〉 [〈ins1,MEM〉, 〈ins2,ID〉, 〈ins3,IF 〉] 3
[〈ins1,WB〉, 〈ins2,MEM〉, 〈ins3,ID〉, 〈ins5,IF 〉] 4
〈5a〉 [〈ins1,CM〉, 〈ins2,WB〉, 〈ins3,ALU〉, 〈ins5,ID〉, 〈ins6,IF 〉] 5
[〈ins2,CM〉, 〈ins3,WB〉, 〈ins5,MEM〉, 〈ins6,ID〉, 〈ins7,IF 〉] 6
[〈ins3,CM〉, 〈ins5,WB〉, 〈ins6,MEM〉, 〈ins7,ID〉, 〈ins8,IF 〉] 7
〈7a〉 [〈ins5,CM〉, 〈ins6,WB〉, 〈ins7,ALU〉, 〈ins8,ID〉, 〈ins9,IF 〉] 8
[〈ins6,CM〉, 〈ins7,WB〉, 〈ins8,ALU〉, 〈ins9,ID〉, 〈ins14,IF 〉] 9
[〈ins7,CM〉, 〈ins8,WB〉, 〈ins9,ALU〉, 〈ins14,ID〉, 〈ins15,IF 〉] 10
[〈ins8,CM〉, 〈ins9,WB〉, 〈ins14,ALU〉, 〈ins15,ID〉] 11
[〈ins9,CM〉, 〈ins14,WB〉, 〈ins15,MEM〉] 12
[〈ins14,CM〉, 〈ins15,WB〉] 13
[〈ins15,CM〉] 14
After traversing the first path, the dependence graph at 〈5a〉 is as follow:









































IF,ins14 ID,ins14 ALU,ins14 WB,ins14 CM,ins14
Note that the prefix instructions ins1, ins2, ins3, ins5 and ins6 are added to the
dependence graph since these instructions exist in the pipeline state at 〈5a〉. Moreover,
the instructions ins7, ins8, ins9 and ins14 are added to the dependence graph, since these
instructions exist in pipeline state when the last prefix instruction ins6 is committed (at
CC9). All the black edges in this dependence graph would always hold in future visits
to 〈5〉. However, the blue edges might not hold. So, we generate a reduced dependence









































IF,ins14 ID,ins14 ALU,ins14 WB,ins14 CM,ins14
5.6 Extension to Out-of-order Pipelines 151
Next, we will generalize it and store it as the dominating condition:
Moving to the right sub-path, the execution time of the second path would be 15
CCs and the pipeline state at the program points can be seen in the table below:
PP Pipeline State Clk
〈4a〉 [〈ins1,MEM〉, 〈ins2,ID〉, 〈ins3,IF 〉] 3
〈6a〉 [〈ins1,CM〉, 〈ins2,WB〉, 〈ins3,ALU〉, 5
〈7b〉 [〈ins6,CM〉, 〈ins10,WB〉, 〈ins11,MEM〉, 〈ins12,ID〉, 〈ins13,IF 〉] 8









































IF,ins13 ID,ins13 MEM,ins13 WB,ins13 CM,ins13
The reduced dependence graph is as follow:











































Moreover, The generalized dependence graph would be as follow:
MEM,Pre MEM,ins12MEM,ins11MEM,ins10 MEM,ins13
Continuing to 〈4a〉 the generalized dependence graphs from 〈5a〉 and 〈6a〉 will be
merged with generalized dependence graphs at 〈4a〉 and the result would be:
MEM,ins6 MEM,ins12MEM,ins11MEM,ins10 MEM,ins13MEM,Pre MEM,ins5
ALU,Pre ALU,ins9ALU,ins8ALU,ins7 ALU,ins14
MEM,ins6MEM,Pre MEM,ins5




Moving to the right subtree at 〈4b〉, this dominating condition is tested on the pipeline
state[〈ins1,MEM〉, 〈ins2,ID〉, 〈ins4,IF 〉] (CC 3). Assume that there is some dependency
between the nodes CM, ins4 MEM, ins5. As a result, this time ins5 is delayed and


























The pipeline state is updated cycle-wise till the instructions in the generalized de-
pendence graphs at 〈4b〉 reach EX state and it is tested if the edges in the generalized
dependence graphs still hold, which is true. So the summarization for 〈4〉 is reused and
the pipeline state on the right sub-path (〈4b〉, 〈6c〉, 〈7c〉) is replayed and the total energy
cost for the path is computed which is 221 mJoules:
PP Pipeline State Clk
〈4b〉 [〈ins1,MEM〉, 〈ins2,ID〉, 〈ins4,IF 〉] 3
〈6c〉 [〈ins1,CM〉, 〈ins2,WB〉, 〈ins4,ALU〉, 〈ins5,ID〉, 〈ins6,IF 〉] 5
〈7c〉 [〈ins5,WB〉, 〈ins6,ST 〉, 〈ins10,WB〉, 〈ins11,MEM〉, 〈ins12,IF 〉, 〈ins13,IF 〉, ] 8
Continuing the analysis to the root the WCEC of the whole tree is returned as 247
mJoules.
5.7 Summary
In this chapter we presented the worst-case energy consumption analysis. We highlighted
that in the energy domain, which is data dependent to a high scale, modular approaches
154 Chapter 5. Integrated Worst-case Energy Consumption Analysis
can generate more imprecise bounds. We proposed that the best application for inte-
grated analysis is when the resource of interest is data-dependent (such as energy). Our
framework is able to precisely track the machine state across the paths in the symbolic
execution tree. This will give us the ideal precision to measure the worst-case energy
cost of a program. However, the scalability of our method relies on the possibility of
reuse.
Comparing to the state-of-the-art, our method beside inheriting the precision from
infeasible path detection and more precise cache analysis (from Chapter 3), can generate
more precise energy cost in the pipeline analysis (wake-up logic, selection and clock
energies, explained in Appendix B). Our intention in this chapter has been on ensuring
that our framework can remain scalable while performing superscalar pipeline analysis
and WCEC analysis. We were able to extend the fundamental notion of reuse in the
presence of in-order execution while remaining sound and scalable and extend reuse in
the presence of out-of-order execution while remaining sound.
Our pipeline analysis can be applied to other architectures as long as a sound tran-
sition function for the pipeline can be defined to represent the changes in the internal
state of the pipeline. We consider timing anomaly in the out-of-order pipeline analysis.
However, we do not consider timing anomaly in in-order pipeline analysis and also we
do not consider multi-threaded pipelines.
Chapter6
Conclusion and Future Work
In this thesis, we have presented a framework which performs resource analysis with the
help of a fully path-sensitive integrated symbolic execution framework. Our framework,
presented in Chapter 2, performs resource analysis in one integrated phase where the
low-level analysis and high-level analysis are performed at the same time. The main
contribution of our method is that our algorithm precisely tracks the underlying micro-
architectural features represented by the machine state along the symbolic execution.
This, then, allows our framework to precisely estimate the total consumed resource along
different paths.
Our framework addresses the scalability issue with the notion of reuse. Reuse has
been used for scalability of path-based methods [CJ11, CJ13]. However, in this thesis,
first, we extended the notion of reuse in the presence of dynamic resource consumption
model, where a realistic assumption is that the contribution of each basic block to the
resource consumption of a program is not a constant value. Secondly, we extended
the notion of reuse in the presence of symbolic resource consumption model, where the
contribution of each basic block to the resource consumption of a program is a symbolic
value.
In Chapter 3, we customized our resource analysis framework for WCET analysis in
155
156 Chapter 6. Conclusion and Future Work
the presence of instruction and data caches. The WCET analysis with data cache anal-
ysis adds a new dimension (memory range being accessed by data cache) to the problem.
The experiments presented shows that our analysis can scale on realistic benchmarks
such as nsichneu.
Our framework performs instruction and data cache with LRU cache policy. In order
to extend the cache analysis to support other cache policies, a sound cache summarization
should be defined. Our inspection shows that if the memory blocks in a cache set can be
ordered (by a ranking), a sound cache summarization can be defined for them. As a result,
our framework can be extended to some of the cache policies such as FIFO. Finally, we
like to note that our framework can support cache hierarchies while it remains scalable.
The result of our research in Chapter 3 is published in [CJMa].
Next, we extended our resource analysis framework to estimate the MHW of a pro-
gram in Chapter 4. Note that although none of the micro-architectural features have an
effect on the memory consumption in MHW analysis, the resource of interest, memory,
is a non-cumulative resource. In other words, since the memory can be acquired and
released the MHW of a program is no longer at the end of its paths and rather, it can be
at any node along a path. We extended our framework to be able to perform symbolic
resource analysis on non-cumulative resources in this chapter.
As elaborated in Chapter 4, compared to the state-of-the-art methods, our method
can perform both stack and heap analysis and also generate bounds which are less com-
plex (parametric to input arguments only). The result of our research in Chapter 4 is
published in [CJMb].
Finally, in Chapter 5, we extended our resource analysis framework to perform
WCEC analysis in the presence of in-order and out-of-order superscalar processors.
The machine state would track the pipeline and cache states precisely along the paths
during the analysis and we were able to extend the fundamental notion of reuse in the
presence of in-order execution while remaining sound and scalable and extend reuse in
157
the presence of out-of-order execution while remaining sound.
Comparing to the state-of-the-art, our method beside inheriting the precision from
infeasible path detection and more precise cache analysis (from Chapter 3), can generate
more precise energy cost in the pipeline analysis (wake-up logic, selection and clock
energies). Our intention in this chapter has been on ensuring that our framework can
remain scalable while performing superscalar pipeline analysis and WCEC analysis.
Our pipeline analysis can be applied to other architectures as long as a sound tran-
sition function for the pipeline can be defined to represent the changes in the internal
state of the pipeline. We consider timing anomaly in the out-of-order pipeline analysis.
However, we do not consider timing anomaly in in-order pipeline analysis and also we
do not consider multi-threaded pipelines.
We will end this chapter by proposing some future directions to the research in this
thesis for interested readers.
Extension to Binary Analysis
As discussed in Chapter 2, our resource analysis framework performs resource analysis
on LLVM IR. However, analyzing binary code can be performed for the analysis of highly
safety-critical systems. The analysis of binary code is cumbersome and costly. Moreover,
certain issues such as unbounded jumps which the jump location can only be identified
at run-time, makes performing symbolic execution on binary code quite difficult.
The research presented in [H+13] and implemented in the respective tool T-Crest
can bridge this gap. In this method, by the help of a control-flow relation graph, each
LLVM basic block is related to one or more flow paths in the binary code and the timing
of each LLVM basic block can then be measured precisely. By embedding this method
with our analysis framework, we would be able to perform resource analysis on machine
code. Moreover, note that since our framework is path-sensitive we might be able to
track more precisely the context reaching an LLVM basic block and as a result prune
158 Chapter 6. Conclusion and Future Work
certain flow paths in the respective control-flow relation graph.
Pipeline Analysis with Branch Prediction
In the WCEC analysis presented in Chapter 5, we assumed perfect branch prediction
where all the predictions of the branch instruction are correct. One important extension
of the research presented in Chapter 5 is to extend the resource analysis framework with
computing the effect from the branch prediction. Note that the branch prediction affects
both in-order and out-of-order execution. The soundness of our framework relies on the
soundness of the reuse step, and the integration of a realistic branch prediction should
be performed in a way which it would not affect the soundness of the reuse step. In this
way, we would be able to estimate the WCEC of a program with more precision.
Summarizing Functions
Our proposed framework can be extended by adding the means to generate summa-
rization of functions and reusing it in later on calls to the functions. It should be noted
that a function might be called several times during the execution of a program and gen-
erating summarizations for the functions will extend the scalability of our framework.
The main idea will be to analyze functions as loops with one iteration. By the help of
this method, we can use the concepts proposed for loops (witness, dominating condition
and machine state summary) in the summarization of functions.
Bibliography
[A+07] Elvira Albert et al. Cost analysis of java bytecode. In Programming Lan-
guages and Systems, pages 157–172. Springer, 2007.
[A+12a] Elvira Albert et al. Cost analysis of object-oriented bytecode programs.
Theoretical Computer Science, 413(1):142–159, 2012.
[A+12b] Saswat Anand et al. Automated concolic testing of smartphone apps. In
Proceedings of the ACM SIGSOFT 20th International Symposium on the
Foundations of Software Engineering, page 59. ACM, 2012.
[A+12c] Bjo¨rn Andersson et al. Non-preemptive scheduling with history-dependent
execution time. In ECRTS, pages 363–372, 2012.
[A+13] Jeppe L Andersen et al. Worst-case memory consumption analysis for scj.
In Proceedings of the 11th International Workshop on Java Technologies for
Real-time and Embedded Systems, pages 2–10. ACM, 2013.
[A+15] David Atienza Alonso et al. Dynamic memory management optimization for
multimedia applications. In Dynamic Memory Management for Embedded
Systems, pages 167–192. Springer, 2015.
[aiT] aiT Worst-Case Execution Time Analyzers. URL http://www.absint.com-
/ait/index.htm.
[Als14] Mohammad Alshamlan. A regression approach to execution time estimation
for programs running on multicore systems. 2014.
[B+03] Derek Bruening et al. An infrastructure for adaptive dynamic optimization.
In Code Generation and Optimization, 2003. CGO 2003. International Sym-
posium on, pages 265–275. IEEE, 2003.
[B+05] Enrico Bini et al. Speed modulation in energy-aware real-time systems.
In Real-Time Systems, 2005.(ECRTS 2005). Proceedings. 17th Euromicro
Conference on, pages 3–10. IEEE, 2005.
159
160 Bibliography
[B+06] Jonathan Barre et al. Modeling instruction-level parallelism for wcet evalu-
ation. In Proceedings of the 12th IEEE International Conference on Embed-
ded and Real-Time Computing Systems and Applications, RTCSA ’06, pages
61–67, Washington, DC, USA, 2006. IEEE Computer Society.
[B+08] Vı´ctor Braberman et al. Parametric prediction of heap memory require-
ments. In Proceedings of the 7th international symposium on Memory man-
agement, pages 141–150. ACM, 2008.
[B+11] Nathan Binkert et al. The gem5 simulator. ACM SIGARCH Computer
Architecture News, 39(2):1–7, 2011.
[B+13] Thomas Bøgholm et al. Towards harnessing theories through tool support
for hard real-time java programming. Innovations in Systems and Software
Engineering, 9(1):17–28, 2013.
[B+14] Abhijeet Banerjee et al. Detecting energy bugs and hotspots in mobile
apps. In Proceedings of the 22nd ACM SIGSOFT International Symposium
on Foundations of Software Engineering, pages 588–598. ACM, 2014.
[BCR13] Abhijeet Banerjee, Sudipta Chattopadhyay, and Abhik Roychoudhury. Pre-
cise micro-architectural modeling for wcet analysis via ai+sat. In 19th RTAS.
IEEE, 2013.
[BMBW00] Emery D Berger, Kathryn S McKinley, Robert D Blumofe, and Paul R
Wilson. Hoard: A scalable memory allocator for multithreaded applications.
ACM Sigplan Notices, 35(11):117–128, 2000.
[Bou] Bound-T time and stack analyser. URL http://www.bound-t.com.
[BTM00] David Brooks, Vivek Tiwari, and Margaret Martonosi. Wattch: a framework
for architectural-level power analysis and optimizations, volume 28. ACM,
2000.
[Cˇ+] Pavol Cˇerny` et al. Segment abstraction for worst-case execution time anal-
ysis. In ESOP 2015.
[C+76] Lori Clarke et al. A system to generate test data and symbolically execute
programs. Software Engineering, IEEE Transactions on, (3):215–222, 1976.
[C+14] Quentin Carbonneaux et al. End-to-end verification of stack-space bounds
for c programs. In ACM SIGPLAN Notices, volume 49. ACM, 2014.
[CC77] Patrick Cousot and Radhia Cousot. Abstract interpretation: A unified lat-
tice model for static analysis. In POPL, pages 238–252. ACM, 1977.
[CH] Patrick Cousot and Nicolas Halbwachs. Automatic discovery of linear re-
straints among variables of a program. In POPL, Pages 84–96, 1978.
Bibliography 161
[chr14] Chronos WCETanalysis tool. www.comp.nus.edu.sg/r˜pembed/chronos,
January 2014.
[CHS15] Quentin Carbonneaux, Jan Hoffmann, and Zhong Shao. Compositional cer-
tified resource bounds. In PLDI, pages 467–478. ACM, 2015.
[CJ11] Duc-Hiep Chu and Joxan Jaffar. Symbolic simulation on complicated loops
for wcet path analysis. In Embedded Software (EMSOFT), pages 319–328.
IEEE, 2011.
[CJ13] Duc-Hiep Chu and Joxan Jaffar. Path-sensitive resource analysis compliant
with assertions. In Embedded Software (EMSOFT), pages 1–10. IEEE, 2013.
[CJMa] Duc-Hiep Chu, Joxan Jaffar, and Rasool Maghareh. Precise cache timing
analysis via symbolic execution. In RTAS 2016.
[CJMb] Duc-Hiep Chu, Joxan Jaffar, and Rasool Maghareh. Symbolic execution for
memory consumption analysis. In LCTES 2016.
[Cla14] clang: a c language family front-end for llvm. http://www.clang.llvm.org,
2014. Accessed: 2015-02-01.
[CP00] Antoine Colin and Isabelle Puaut. Worst case execution time analysis for
a processor with branch prediction. Real-Time Systems, 18(2-3):249–274,
2000.
[CR11] Sudipta Chattopadhyay and Abhik Roychoudhury. Scalable and precise
refinement of cache timing analysis via model checking. In RTSS, pages
193–203. IEEE, 2011.
[EG97] Andreas Ermedahl and Jan Gustafsson. Deriving annotations for tight cal-
culation of execution time. In European Conference on Parallel Processing,
pages 1298–1307. Springer, 1997.
[Flo14] Florida WCETanalysis tool. http://moss.csc.ncsu.edu/mueller/, January
2014.
[FMH14] Antonio Flores-Montoya and Reiner Ha¨hnle. Resource analysis of complex
programs with cost equations. In Programming Languages and Systems,
pages 275–295. Springer, 2014.
[FV98] Klaus Friedewald and AG Volkswagen. Design methods for adjusting the
side airbag sensor and the car body. In Proceedings of 16th International
Technical Conference on the Enhanced Safety of Vehicles, Windsor, Ontario,
Canada, 31 May-4 June 1998. VOLUME 3, 1998.
[FW98] Christian Ferdinand and Reinhard Wilhelm. On predicting data cache be-
havior for real-time systems. In LCTES, pages 16–30. Springer, 1998.
162 Bibliography
[G+01] Matthew R Guthaus et al. Mibench: A free, commercially representative
embedded benchmark suite. In Workload Characterization, 2001. WWC-4.
2001 IEEE International Workshop on, pages 3–14. IEEE, 2001.
[G+05] Jan Gustafsson et al. Towards a flow analysis for embedded system c pro-
grams. In 10th IEEE International Workshop on Object-Oriented Real-Time
Dependable Systems, pages 287–297. IEEE, 2005.
[GEL06] Jan Gustaffson, Andreas Ermedahl, and Bjo¨rn Lisper. Algorithms for in-
feasible path calculation. In OASIcs-OpenAccess Series in Informatics, vol-
ume 4, 2006.
[GGP+14] Neville Grech, Kyriakos Georgiou, James Pallister, Steve Kerrison, and Ker-
stin Eder. Static energy consumption analysis of llvm ir programs. arXiv
preprint arXiv:1405.4565, 2014.
[GZ10] Sumit Gulwani and Florian Zuleger. The reachability-bound problem. SIG-
PLAN Not., 45(6), June 2010.
[H+03] Reinhold Heckmann et al. The influence of processor architecture on the
design and the results of wcet tools. Proceedings of the IEEE, 91(7):1038–
1054, 2003.
[H+11] Bach Khoa Huynh et al. Scope-aware data cache analysis for wcet estima-
tion. In 2011 17th IEEE Real-Time and Embedded Technology and Applica-
tions Symposium, pages 203–212. IEEE, 2011.
[H+13] Benedikt Huber et al. Combined wcet analysis of bitcode and machine code
using control-flow relation graphs. In ACM SIGPLAN Notices, volume 48,
pages 163–172. ACM, 2013.
[H+14a] Julien Henry et al. How to compute worst-case execution time by optimiza-
tion modulo theory and a clever encoding of program semantics. In RTNS,
pages 43–52. ACM, 2014.
[H+14b] Timo Ho¨nig et al. Proactive energy-aware programming with peek. In
Proceedings of the 2014 International Conference on Timely Results in Op-
erating Systems, pages 6–6. USENIX Association, 2014.
[H+16] Re´my Haemmerle´ et al. A transformational approach to parametric
accumulated-cost static profiling. In International Symposium on Functional
and Logic Programming, pages 163–180. Springer, 2016.
[HAH11] Jan Hoffmann, Klaus Aehlig, and Martin Hofmann. Multivariate amortized
resource analysis. In POPL’11, pages 357–370, 2011.
[HIE12] CHU DUC HIEP. Interpolation Methods for Symbolic Execution. PhD thesis,
NATIONAL UNIVERSITY OF SINGAPORE, 2012.
Bibliography 163
[J+05] Hans Jacobson et al. Stretching the limits of clock-gating efficiency in server-
class processors. In High-Performance Computer Architecture, 2005. HPCA-
11. 11th International Symposium on, pages 238–242. IEEE, 2005.
[J+06] Ramkumar Jayaseelan et al. Estimating the worst-case energy consumption
of embedded software. In Real-Time and Embedded Technology and Applica-
tions Symposium, 2006. Proceedings of the 12th IEEE, pages 81–90. IEEE,
2006.
[JMNS12] Joxan Jaffar, Vijayaraghavan Murali, Jorge A Navas, and Andrew E San-
tosa. Tracer: A symbolic execution tool for verification. In Computer Aided
Verification, pages 758–766. Springer, 2012.
[JMSY92] Joxan Jaffar, Spiro Michaylov, Peter J Stuckey, and Roland HC Yap. The
CLP(R) language and system. ACM TOPLAS 14(3), 14(3):339–395, 1992.
[JSV08] Joxan Jaffar, Andrew E Santosa, and Razvan Voicu. Efficient memoization
for dynamic programming with ad-hoc constraints. In AAAI, 2008.
[JSV09] Joxan Jaffar, Andrew E Santosa, and Ra˘zvan Voicu. An interpolation
method for clp traversal. In CP, 2009.
[K+02] Raimund Kirner et al. Fully automatic worst-case execution time analysis
for matlab/simulink models. In Real-Time Systems, 2002. Proceedings. 14th
Euromicro Conference on, pages 31–40. IEEE, 2002.
[KE13] Steve Kerrison and Kerstin Eder. Energy modelling and optimisation of soft-
ware for a hardware multi-threaded embedded microprocessor. University
of Bristol, Bristol, Tech. Rep, 2013.
[KE15] Steve Kerrison and Kerstin Eder. Energy modeling of software for a hard-
ware multithreaded embedded microprocessor. ACM Transactions on Em-
bedded Computing Systems (TECS), 14(3):56, 2015.
[KF14] Daniel Ka¨stner and Christian Ferdinand. Proving the absence of stack over-
flows. In Computer Safety, Reliability, and Security. Springer, 2014.
[Kin76] James C King. Symbolic execution and program testing. Communications
of the ACM, 19(7):385–394, 1976.
[KKZ13] Jens Knoop, Laura Kova´cs, and Jakob Zwirchmayr. Wcet squeezing: on-
demand feasibility refinement for proven precise wcet-bounds. In RTNS,
pages 161–170. ACM, 2013.
[Kop11] Hermann Kopetz. Real-time systems: design principles for distributed em-
bedded applications. Springer, 2011.
[Kru14] Michael Kruse. Perfrewrite–program complexity analysis via source code
instrumentation. arXiv preprint arXiv:1409.2089, 2014.
164 Bibliography
[L+] Hanbing Li et al. Tracing flow information for tighter wcet estimation:
Application to vectorization. In RTCSA 2015.
[L+05] Chi-Keung Luk et al. Pin: building customized program analysis tools with
dynamic instrumentation. In ACM Sigplan Notices, volume 40, pages 190–
200. ACM, 2005.
[L+10] Mingsong Lv et al. Combining abstract interpretation with model checking
for timing analysis of multicore software. In Real-Time Systems Symposium
(RTSS), 2010 IEEE 31st, pages 339–349. IEEE, 2010.
[LA04] Chris Lattner and Vikram Adve. Llvm: A compilation framework for lifelong
program analysis & transformation. In Code Generation and Optimization,
2004. CGO 2004. International Symposium on, pages 75–86. IEEE, 2004.
[LLV15] Llvm test suite guide. URL http://llvm.org/releases/2.2/docs/Testing
Guide.html, 2015.
[LM95] Yau-Tsun Steven Li and Sharad Malik. Performance analysis of embedded
software using implicit path enumeration. SIGPLAN Not., 30(11):88–98,
1995.
[LMW99] Yau-Tsun Steven Li, Sharad Malik, and Andrew Wolfe. Performance esti-
mation of embedded software with instruction cache modeling. TODAES
4(3), 4(3), 1999.
[LRM04] Xianfeng Li, Abhik Roychoudhury, and Tulika Mitra. Modeling out-of-order
processors for software timing analysis. In RTSS, pages 92–103. IEEE, 2004.
[LRM06] Xianfeng Li, Abhik Roychoudhury, and Tulika Mitra. Modeling out-of-order
processors for wcet analysis. Real-Time Systems, 34(3):195–227, 2006.
[LS98] Thomas Lundqvist and Per Stenstro¨m. Integrating path and timing analysis
using instruction-level simulation techniques. In (LCTES),1998, pages 1–15.
Springer, 1998.
[LS99] Thomas Lundqvist and Per Stenstro¨m. An integrated path and timing anal-
ysis method based on cycle-level symbolic execution. RTS, 1999.
[M0¨6] Ma¨lardalen WCET research group benchmarks. URL http://www.mrtc.m-
dh.se/projects/wcet/benchmarks.html, 2006.
[M+08] Miguel Masmano et al. A constant-time dynamic storage allocator for real-
time systems. Real-Time Systems, 40(2):149–179, 2008.
[M+13] Aravind Machiry et al. Dynodroid: An input generation system for an-
droid apps. In Proceedings of the 2013 9th Joint Meeting on Foundations of
Software Engineering, pages 224–234. ACM, 2013.
Bibliography 165
[MHH02] Frank Mehnert, Michael Hohmuth, and Hermann Hartig. Cost and benefit
of separate address spaces in real-time operating systems. In 23rd RTSS,
pages 124–133. IEEE, 2002.
[MRC03] Miguel Masmano, Ismael Ripoll, and Alfons Crespo. Dynamic storage alloca-
tion for real-time embedded systems. Proc. of Real-Time System Simposium
WIP, 2003.
[N+06] Fadia Nemer et al. Papabench: a free real-time benchmark. In WCET,
volume 4. Schloss Dagstuhl-Leibniz-Zentrum fu¨r Informatik, 2006.
[NS] Kartik Nagar and YN Srikant. Path sensitive cache analysis using cache
miss paths. In VMCAI 2015.
[NS07] Nicholas Nethercote and Julian Seward. Valgrind: a framework for heavy-
weight dynamic binary instrumentation. In ACM Sigplan notices, volume 42,
pages 89–100. ACM, 2007.
[P+11] OFJ Perks et al. Wmtrace: a lightweight memory allocation tracker and
analysis framework. 2011.
[PB00] P. Puschner and A. Burns. A review of worst-case execution-time analysis.
J. RTS, 2000.
[PHS10] Wolfgang Puffitsch, Benedikt Huber, and Martin Schoeberl. Worst-case
analysis of heap allocations. In Leveraging Applications of Formal Methods,
Verification, and Validation, pages 464–478. Springer, 2010.
[PKME15] James Pallister, Steve Kerrison, Jeremy Morse, and Kerstin Eder. Data
dependent energy modelling: A worst case perspective. arXiv preprint
arXiv:1505.03374, 2015.
[PS01] Padmanabhan Pillai and Kang G Shin. Real-time dynamic voltage scaling
for low-power embedded operating systems. In ACM SIGOPS Operating
Systems Review, number 5, pages 89–102. ACM, 2001.
[PSSG10] Preeti Ranjan Panda, BVN Silpa, Aviral Shrivastava, and Krishnaiah Gum-
midipudi. Power-efficient system design. Springer Science & Business Media,
2010.
[Q+00] Gang Qu et al. Function-level power estimation methodology for micropro-
cessors. In Proceedings of the 37th Annual Design Automation Conference,
pages 810–813. ACM, 2000.
[R+06] Jan Reineke et al. A definition and classification of timing anomalies.
WCET, 4, 2006.
166 Bibliography
[RRW05] John Regehr, Alastair Reid, and Kirk Webb. Eliminating stack overflow by
abstract interpretation. ACM Transactions on Embedded Computing Sys-
tems (TECS), 4(4):751–778, 2005.
[RS09a] Jan Reineke and Rathijit Sen. Sound and efficient wcet analysis in the
presence of timing anomalies. In WCET, page 101, 2009.
[RS09b] Christine Rochange and Pascal Sainrat. A context-parameterized model for
static analysis of execution times. In Transactions on High-Performance
Embedded Architectures and Compilers II, pages 222–241. Springer, 2009.
[S+07] Jean Souyris et al. Computing the worst case execution time of an avionics
program by abstract interpretation. In OASIcs-OpenAccess Series in Infor-
matics, volume 1. Schloss Dagstuhl-Leibniz-Zentrum fu¨r Informatik, 2007.
[S+12] Carsten Sinz et al. Llbmc: A bounded model checker for llvm’s intermediate
representation. In Tools and Algorithms for the Construction and Analysis
of Systems, pages 542–544. Springer, 2012.
[SC01] Amit Sinha and Anantha P Chandrakasan. Jouletrack: a web based tool
for software energy profiling. In Proceedings of the 38th annual Design Au-
tomation Conference, pages 220–225. ACM, 2001.
[Sch94] Werner Schu¨tz. Fundamental issues in testing distributed real-time systems.
Real-Time Systems, 7(2):129–157, 1994.
[Sch15] Martin Schoeberl. Scala for real-time systems? In Proceedings of the 13th
International Workshop on Java Technologies for Real-time and Embedded
Systems, page 13. ACM, 2015.
[SF99] Jo¨rn Schneider and Christian Ferdinand. Pipeline behavior prediction for su-
perscalar processors by abstract interpretation. In ACM SIGPLAN Notices,
volume 34, pages 35–44. ACM, 1999.
[T+94] Vivek Tiwari et al. Power analysis of embedded software: a first step towards
software power minimization. Very Large Scale Integration (VLSI) Systems,
IEEE Transactions on, 2(4):437–445, 1994.
[T+13] Philip W Trinder et al. Resource analyses for parallel and distributed coordi-
nation. Concurrency and Computation: Practice and Experience, 25(3):309–
348, 2013.
[TFW00] H. Theiling, C. Ferdinand, and R. Wilhelm. Fast and precise WCET predic-
tion by seperate cache and path analyses. RTS 18(2/3), 18(2/3):157–179,
May 2000.
[The04] Stephan Thesing. Safe and precise WCET determination by abstract inter-
pretation of pipeline models. PhD thesis, Universita¨tsbibliothek, 2004.
Bibliography 167
[V+14] Marcus Volp et al. Has energy surpassed timeliness? scheduling energy-
constrained mixed-criticality systems. In Real-Time and Embedded Technol-
ogy and Applications Symposium (RTAS), 2014 IEEE 20th, pages 275–284.
IEEE, 2014.
[Val] Valgrind instrumentation framework. http://valgrind.org/.
[W+97] Randall T White et al. Timing analysis for data caches and set-associative
caches. In RTAS, pages 192–202. IEEE, 1997.
[W+08] Reinhard Wilhelm et al. The worst-case execution-time problem—overview
of methods and survey of tools. ACM Transactions on Embedded Computing
Systems (TECS), 7(3):36, 2008.
[W+15] Peter Wa¨gemann et al. Worst-case energy consumption analysis for energy-
constrained embedded systems. ECRTS, 4, 2015.
[Wil06] Reinhard Wilhelm. Determining bounds on execution times. In Richard
Zurawski, editor, Handbook on Embedded Systems, chapter 14. CRC Press,
2006.
[wtc14] Wcet tool competition 2014. URL www.mrtc.mdh.se/projects/WTC/, 2014.
[Y+00] Wu Ye et al. The design and use of simplepower: a cycle-accurate energy
estimation tool. In Proceedings of the 37th Annual Design Automation Con-
ference, pages 340–345. ACM, 2000.
[ZGSV11] Florian Zuleger, Sumit Gulwani, Moritz Sinn, and Helmut Veith. Bound
analysis of imperative programs with the size-change abstraction. In Static
Analysis, pages 280–297. Springer, 2011.
[ZK15] Michael Zolda and Raimund Kirner. Calculating wcet estimates from timed
traces. RTS, pages 1–50, 2015.

AppendixA
Compiling LLVM IR to Transition
System
LLVM is a modular compiler, which compiles high-level languages to an intermediate
representation named LLVM IR. LLVM IR is in static single assignment (SSA) form
representation.
LLVM IR instructions are arranged in basic blocks (labeled with unique names). The
basic blocks are a sequence of instructions, inst1 . . . instn, such that all instructions up
to instn−1 are not branch or return instructions. The last instruction instn is always br
or ret. Moreover, the Phi instruction always appears as the first instruction in the basic
block.
Our symbolic execution framework receives a C program as an input and produces
a respective transition system in CLP(R) [JMSY92]. The transition system is passed
to the resource analysis framework. We use the open-source C/C++ front-end Clang
3.2 [Cla14] to compile a C program and generate the LLVM IR. Next, a LLVM pass, which
we have developed, reads the LLVM IR and generates the transition system. Finally, our
resource analysis framework analyses the transition system.
169
170 Chapter A. Compiling LLVM IR to Transition System
During the compilation, for each basic block BB, and its set of input and output
variables, V arsinput, V arsoutput, the updates on each of the variables in V arsoutput com-
pared to V arsinput is captured through a set of constraints and is stored in the generated
transition system.
Note that besides capturing the relations between the input and output variables
for a BB, the framework needs more low-level information for the resource analysis to
be able to precisely generate and update the machine state. For example in WCET
analysis, the sequence of memory block accesses should also be captured.
A.1 The LLVM Pass
The LLVM pass, that we have developed, transforms the LLVM IR to a respective
transition system in CLP(R). The LLVM pass performs the transformation in nine
phases:
Phase 1:
In the first phase, three groups of variables are collected:
1. Global variables
2. Local variables needed for infeasibility test
3. Local variables accessed by data cache are collected to create the data map
The global variables are collected and added to both the list of global variables list
and also to the data map. Since LLVM introduces many intermediate variables, not all of
the local variables are collected. Only the local variables which their value is important
for the feasibility test or data range calculation are collected. All the variables which
appear in the arguments of the branch and store instructions are tested and the ones
which are needed for the infeasibility test or data range calculation are added to the
list of local variables. The data range calculation is performed by load instructions. In
A.1 The LLVM Pass 171
the next step, all the dynamic range data accessed in load instructions are then tested
and the variables which their value has an effect on the data access ranges are collected
(similar to taint analysis, but from bottom to top). This process is repeated till it reaches
a fixed point (where all the important variables and all the variables writing to them,
are collected).
Phase 2:
In phase 2, the function names and their input arguments are collected.
Phase 3:
In phase 3, each basic block is mapped to a program point and the CFG structure is
collected and stored too.
Phase 4:
This phase is the core of the compiler where the information of the transition system
is collected from the LLVM IR and is stored in the related data structures. For brevity, we
will only explain phase 4 and phase 5 for WCET analysis. In this phase, the transitions
between the basic blocks are generated and the constraints needed for infeasibility test
and data range calculation are collected in this phase.
For each transition, firstly, the list of global and local variables is generated. Note
that the transitions keep the list of incoming variables and outgoing variables. The
updated variables in the transition are kept in the outgoing list.
Secondly, a function collects all the data accesses in the basic blocks and stores
them in the transitions. Data accesses are in store and load instructions. In the store
instruction all the data accesses are assumed as cache misses (based on the point that a
write-through cache configuration is used in the experiments).
The load instructions data accesses might be a cache hit or a cache miss. On the
other hand, the access can be a constant access (accessing to a specific memory location)
or a dynamic access (the accesses address is resolved at run-time). For the constant
172 Chapter A. Compiling LLVM IR to Transition System
accesses the constant address is recorded and for the dynamic accesses a constraint is
generated which calculates the accessed address or a range of data addresses at run-time.
Finally, the constraints updating the variables in the transitions are collected. We
like to highlight the following instructions and explain more on the information collected
from them:
• Load instruction: In addition to collecting the data range accesses information,
in case the load instruction loads from a pointer location, then the constraint is
generated and added to the transition.
• Store instruction: Beside collecting the information from data range accesses, in
case the store instruction updates a variable in the collected variables list, recur-
sively a constraint is generated where it captures the update made to the variable.
For example, for the following piece of LLVM IR, the generated constraint would
be A4 = A3−A1 +A2.
%Add = add %A1, %A2
%Sub = sub %A3, %Add
Store %Sub, %A4
• GetElementP tr instruction: Since, this instruction stores pointer accesses and
array manipulation addresses, its information is used for calculation of the dynamic
data cache accesses.
• Br instruction: This instruction is usually the final instruction in the basic blocks
and represents conditional or unconditional jump. In the case of unconditional
jump, only one transition would be generated. However, for conditional jumps
two or three transitions are generated and for each transition the constraint which
indicates the condition of the jump is added to the transition.
A.1 The LLVM Pass 173
• Cmp instruction: In the icmp and fcmp instructions, a comparison type and its
arguments is collected which is used later in the Br instruction.
• Ret instruction: The ret instruction is the last instruction in each function and
for each ret instruction in a function the respective transition is updated with the
information of the returned value of the function. For the main function these
transitions are marked as the transitions to the leaf nodes.
Phase 5:
In phase 5, the instructions in a basic block and their addresses are added to the transi-
tions. This information is used to capture the instruction cache accesses.
Phase 6:
In phase 6 the information regarding loops (such as loop-head, exit-edge, back-edge
and nested level of the loops) are collected. The information is used to perform loop
simplification. Loop simplification is performed in phase 7.
Phase 7:
In this phase, first a dummy node is added at the end of the loops and all the back-edges
in a loop are first redirected to this dummy node and then an edge is added from this
dummy node to the loop-head. As a result the loops have only one back-edge and the
dummy nodes are selected as the abstraction point (merge point) in our analysis of loops.
Phase 8:
In phase 8 loops are virtually unrolled once (similar to the virtual step in AI framework) is
performed. The virtual unrolling is performed to ensure the comparison to AI framework
is a fair comparison.
Phase 9:




Energy Consumption Model in Embedded
Systems
In this Appendix, we will present the energy consumption model for the in-order and the
out-of-order pipeline model presented in Chapter 5.
The energy consumed by an embedded system is governed by the physical properties
of the hardware components in use, the power management capabilities of that hardware,
and the activity driven by the software running on it [KE13]. Jayaseelan et. al. in [J+06]
has argued that although power and energy are two terms used reciprocally, energy is
the important metric regarding battery life.
The energy consumption of a task running on a processor is defined as E = P × t,
where P is the average power and t is the execution time. In other words, energy is the
integral of power over a period of time, E =
∫ T
0 P (t)dt. Energy is measured in Joules
whereas power is measured in Watts (Joules/Sec).
Power consumption consists of two main components: dynamic power and leakage
power P = Pdynamic+Pleakage. Dynamic power is caused due to the switching activity in
the underlying hardware and it is data dependent and related to the executed program.
175
176 Chapter B. Energy Consumption Model in Embedded Systems
Leakage power captures the power lost from the leakage current irrespective of switching
activity.
Furthermore, in clock cycled processors part of the energy is consumed in the clock.
This consumed energy is referred to as the clock energy. Moreover, we refer to the power
consumed in the idle cycles as switch-off power.
B.1 Energy Consumption Model
The energy consumption of a path can be calculated through the following formula:
energypath = dynamicpath + leakagepath + switchoffpath + clockpath (B.1)
Where the total energy consumption during the execution time of the path, energypath,
is the summation of dynamicpath (the energy consumed due to the switching activity),
and leakagepath, switchoffpath and clockpath, (the energy consumed due to the leakage
power, switch-off power, and clock power).
On the other hand, the energy consumption of a program can be categorized into
two separate time-dependent and time-independent components which can be estimated
separately:
1. Instruction-specific energy: The energy consumed by an instruction, which
can come from the energy consumed in the ALU units or the fetching of param-
eters which can resolve to a cache hit/cache miss. This energy is totally time-
independent.
2. Pipeline-specific energy: The energy consumed in different micro-architectural
features such as pipeline. This energy cannot be related to a certain instruction
and is relevant to the execution time.
B.1 Energy Consumption Model 177
B.1.1 Instruction-specific Energy
The instruction-specific energy of a path is the dynamic power consumed due to the





Where dynamicins is the dynamic power consumed by an instruction ins.
The energy consumed by an instruction as it travels through the pipeline would be
as follows:
• Fetch and decode: The energy consumed due to the fetch, decode and instruction
cache access are consumed in this stage. We will follow the instruction cache
model presented in Chapter 3 to measure the energy consumed in this phase in
the presence of the instruction cache. Pic depicts the power consumed to access
the instruction cache per cycle.
• Register access: The energy consumed for the register file access due to reads
and/or writes can alter for different classes of instructions. Throughout this thesis
we assumed realistic clock gating [J+06]. As a result, the energy consumption
in the register file for an instruction is proportional to the number of register
operands. PRegisterAccess depicts the register-access power consumed to access one
operand by an instruction in a clock cycle.
• Branch prediction: Currently we assume perfect branch prediction. Examining
and extending our framework to model branch prediction can be a goal for future
research. For now, this energy consumption is fixed as a constant value.
• Wake-up logic: The wake-up logic informs the dependent instructions that their
arguments are ready while writing the result onto the result bus. The energy
178 Chapter B. Energy Consumption Model in Embedded Systems
consumed in the wake-up logic is proportional to the number of output operands
for an instruction. PWakeUp is the power consumed in one clock cycle for making
the dependent instructions ready for one output operand. Wake-up logic energy is
only consumed in out-of-order pipelines.
• Selection logic: After wake-up logic notifies an instruction that all its arguments
are ready, the instruction is chosen by the selection logic for execution. It selects
an instruction from a pool of ready instructions. Note that, due to the existence of
instructions with higher priority, an instruction might be accessed by the selection
logic a few times till it gets executed. PSelection depicts the power consumed by
the selection logic for each clock cycle that the selection logic chooses an instruc-
tion for execution from a pool of ready instructions. Unlike other methods which
conservatively consider the selection logic operates at every clock cycle, in our ap-
proach, besides the end of the loop iterations, we can accurately identify the clock
cycles that the selection logic is accessed. Similar to Wake-up logic, the energy for
selection logic is only consumed in out-of-order pipelines.
• Functional units: The energy consumed in the execution stage for an instruction
depends on the functional unit it uses and its latency. Since our framework in able
to precisely track the context reaching each basic block, unlike other methods, for
variable latency instructions, we usually can track precisely the consumed energy
in the functional unit. Otherwise, similar to other methods such as [J+06], we will
assume the maximum energy consumption.
The power consumed by each component C when it is active is depicted by PC . The
power consumed by these components in the active mode is measured by PALU ,
PMULTU and PFPU . The energy which the execution of each instruction utilizes
is then measured from the delay of the instruction times the power consumed
in the component. Moreover, the access to the data cache and hence the energy
B.1 Energy Consumption Model 179
consumption for load/store units is measured here. Pdc depicts the power consumed
to access the data cache per cycle.
B.1.2 Pipeline-specific Energy
Pipeline-specific energy is consisted of three components: switch-off energy, clock-
energy and leakage energy. All three energy components are influenced by the execution
time of the paths:
• Switch-off Energy: The switch-off energy refers to the power consumed in an idle
unit when it is disabled through clock gating. The switch-off energy corresponding





Where Components is the set of all hardware components. Since we track the
pipeline state and the accesses to different components such as caches along each
path, we will measure the switch-off energy of each component for each clock cycle.
SwitchoffC is the power consumed per clock cycle by each component C in the
idle state. Following the realistic clock gating from [J+06] we consider the switch-
off power of a component SwitchoffC , being 10% of its peak power. As a result,
the switch-off power for each component is as follow:
switchoff(C) = (ExecutionT imepath − accesspath(C)) ∗ PC ∗ 10% (B.4)
Where ExecutionT imepath is the execution time of the basic blocks along a path
with regard to the context which reaches the basic blocks, accesspath(C) is the
total number of accesses to a component C by the instructions in a path and PC is
180 Chapter B. Energy Consumption Model in Embedded Systems
the active power consumed by the component C. The switch-off power of the com-
ponents are represented by SwitchoffALU , SwitchoffMULTU and SwitchoffFPU .
• Clock Network Energy: In modern high-frequency microprocessors, nearly 70%
of the active (switching) power is consumed by the clock circuit [J+05]. Clock
gating helps in reducing the dynamic power by pruning part of the components
being idle in the clock tree [PSSG10]. The power consumed by the clock network
while having clock gating is depicted by PClock. PClock is proportionate to the
number of the active components in each clock cycle. It can at most reach the
peak power used by the clock power clock-powercycle.
Since, our framework measures the energy consumed in a path based on a concrete
pipeline state, it is able to determine which components are active and which
components are idle in each clock cycle and as a result measure the PClock with
a more precision. PClock at each clock-cycle is then used to measure the energy
consumed in the clock network (clockPath). However, this measurements needs the
list of the components which are clock gated in a processor and also the amount of
clock power which is utilized. In general, these information might not be available
or the producer may not disclose such information with a processor. In such cases
we can follow the estimation technique used in [BTM00, J+06] to measure the
energy consumed in the clock network:
clockPath = nonGatedClockPath × ( circuitPath
nonGatedCircuitPath
) (B.5)
where nonGatedClockPath is the clock energy without gating and can be defined
as:
nonGatedClockPath = clockPowercycle ×WCETPath (B.6)
B.1 Energy Consumption Model 181
where clock-powercycle is the peak power consumed per cycle in the clock net-
work and WCETPath is the worst-case execution time of basic blocks in the path.
Moreover, circuitPath is the power consumed in all the components except clock
network in the presence of clock gating:
circuitPath = dynamicPath + switchoffPath + leakagePath (B.7)
nonGatedCircuitPath, on the other hand, is the power consumed in all the com-
ponents except clock network in the absence of clock gating:
nonGatedCircuitPath = circuitPowercycle ×WCETPath (B.8)
circuit − powercycle is a constant defining peak dynamic plus leakage power per
cycle excluding the clock network.
• Leakage Energy: The energy lost due to the leakage current regardless of the
circuit activity per clock cycle is the leakage energy. For measuring the leakage
energy we follow the measurement presented in [J+06]:
leakagepath = Pleakage × ExecutionT imepath (B.9)
Where Pleakage is the power lost per processor cycle from the leakage current
regardless of the circuit activity. This quantity is a constant given a processor
architecture. ExecutionT imepath is the execution time of the basic blocks along a
path with regard to the context which reaches the basic blocks.
Revisiting the formula to measure the instruction-specific energy, the dynamic energy
for a path can be measured by the following formula:
182 Chapter B. Energy Consumption Model in Embedded Systems
dynamicpath = Energyic + Energydc + Energyreg + Energywk +
EnergyALU1 + EnergyALU2 + EnergyFPU + EnergyMULTU
Where Energyic is the energy consumed to access the instruction cache, Energydc is the
energy consumed to access the data cache, Energyreg is the energy consumed to access
the register file, Energywk is the energy consumed by the wake-up logic and EnergyALU1,
EnergyALU2, EnergyFPU , EnergyMULTU are the energies consumed by ALU1, ALU2,
FPU and MULTU along the path.
Furthermore, we depict the energy consumed by the clock with clockpath, selection
logic with selectionpath, the leakage energy with leakagepath and sum of the switch-off
energy in different components by Switchoff .
Example 8 (Computing Energy Cost of a Sub-path on an Out-of-order Pipeline).
Consider the control flow graph (CFG) of a program fragment in Figure B.1(a). We
abstract a basic block, shown as a rectangle, using (1) the beginning program point of
the block; (2) and the sequences of instructions in the basic block. Two outgoing edges
signify the branching structure, while the branch conditions are labeled beside the edges.
The branch condition is shown beside each conditional branch.
The processor model has an 8-entry ROB, 4-entry I-buffer queue and the following
functional units: two single-cycle integer ALU, an integer multiplier with 1 ∼ 4 cycle
latency and a floating-point multiplier with 1 ∼ 12 cycle latency. Moreover, all the
constant power consumptions of different devices are set to 1.
For simplicity, in this example, we consider perfect data cache (a single-cycle load/store
unit ) and perfect branch prediction. As a result, all accesses to the data cache would be
cache hit and also all the predictions for the next branch would be correct prediction.
B.1 Energy Consumption Model 183
Figure B.1: (a) program control flow graph with accessed memory blocks and instruction
running time shown inside each block, (b) the analysis tree of the program
However, the effect of the instruction cache is considered and an instruction cache hit
would take one clock cycle, while an instruction cache miss would cost 10 clock cycles.
In this example, we consider the out-of-order superscalar RISC processor presented
in Figure 5.10 and for the sake of illustration the size of the instructions is set to 2 bytes.
As a result, instructions ins1 to ins16 fall in the first memory block m1 and the rest of
the instructions (ins17 to ins21) fall into m2.
Next, in Figure B.1(b), we depict a symbolic execution tree of the program. The
nodes, shown as circles, represent the program points, with superscripts to distinguish
the multiple occurrences. Each path denotes a symbolic execution of the program. Each
node is associated with a symbolic state s ≡ 〈`, c, [[s]]〉, which preserves the context
reaching the node. While the context is not explicitly shown in the figure (since the
basic blocks are shown only abstractly), we shall make use of some obvious properties of
184 Chapter B. Energy Consumption Model in Embedded Systems
the context of some nodes. Note that we have not (fully) drawn the subtree below node
〈4〉(2) in Figure B.1(b). In order to focus on the role of the dominating condition for
pipeline in reuse, in this example all paths are feasible paths. In order to investigate more
on the role of infeasible paths we refer the interested reader to the example presented in
Section 2.2.2.








ins1 1 LD/ST 1 1
ins2 1 LD/ST 1 1
ins3 4 MULTU 2 1
ins4 1 LD/ST 1 1
ins5 1 ALU 2 1
ins6 4 MULTU 2 1
ins7 1 LD/ST 1 1
ins8 3 MULTU 2 1
ins9 1 LD/ST 1 1
ins10 12 FPU 2 1
ins11 1 LD/ST 1 1
ins12 1 ALU 2 1
ins13 1 LD/ST 1 1
ins14 1 ALU 2 1
ins15 1 LD/ST 1 1
ins16 4 MULTU 2 1
ins17 1 LD/ST 1 1
ins18 1 LD/ST 1 1
ins19 1 ALU 2 1
ins20 1 LD/ST 1 1
ins21 1 LD/ST 1 1
Since the execution time, execution unit, number of input arguments and number of
output arguments affects the energy cost of an instruction, Table B.1 presents such in-
formation for each of the instructions. For example, ins8 is of the integer multiplication
B.1 Energy Consumption Model 185







instructions with 3 cycle latency, 2 input arguments and 1 output argument. Further-
more, Table B.2, presents the list of data dependencies between the instructions. For
example, ins16 can only be executed after the output of ins12 is written back.
We will compute the energy cost of the four paths in the left subtree. We start with
the left most path in the symbolic execution tree. Note that in the beginning at node 〈1〉1
the variable of interest r which tracks the energy consumption is set to 0. The execution
of the leftmost path with can be seen in Figure B.2. The execution time of this path
takes 32 clock cycles. The energy consumption for this path is calculated accordingly
and can be seen in Table B.3.
Table B.3: The energy consumption for the leftmost path (Path 1)
Component Consumed Energy
Energyic 25 × Pic
Energydc 8 × Pdc
Energyreg 18 × PRegister−access
Energywk 5 × PWake−up
EnergyALU1 1 × PALU
EnergyALU2 0 × PALU
EnergyFPU 12 × PFPU
EnergyMULTU 11 × PMULTU
Leakage 32 × Pleakage
Clock 757 × clock-powercycle
Selection 10 × Pselection
Switchoff
3.1 × SwitchoffALU + 3.2 × SwitchoffALU +
2 × SwitchoffFPU + 2.1 × SwitchoffMULTU
186 Chapter B. Energy Consumption Model in Embedded Systems
Figure B.2: The energy analysis of the leftmost path (Path 1)
The numbers in Table B.3 are generated from the execution times in Figure B.2
and the cost model presented in this appendix. For example, EnergyALU1 is estimated
from the formula 1 × PowerALU since ALU1 is active in clock cycles 23 in Figure
B.2. Furthermore, the Leakage energy is calculated from 32 × Pleakage, since path
1 takes 32 clock cycles to be executed. On the other hand, the Selection energy is
calculated from 10 × Powerselection, since the selection logic is accessed in clock cycles
12, 13, 14, 15, 16, 17, 21, 23, 26 and 27. Finally, we would like to mention the Clock energy.
At most in a clock cycle, 7 gated components are consuming the clock power (instruction
cache, data cache, ALU 1, ALU 2, MULTU, FPU and Register File). Counting the
number of clock cycles that these components are in active mode results in 75. While
this number can at most reach 7 × 32, where 32 is the execution time of the path.
As a result, knowing that clock-powercycle is the maximum power consumed in the clock
network, the formula 757 × clock-powercycle estimates the clock energy in the path. Fixing
B.1 Energy Consumption Model 187
all power consumptions to 1, the total energy consumption of the leftmost path would
be 143.11 Micro-watt.
Figure B.3: The energy analysis of path 2
Table B.4: The energy consumption for path 2
Component Consumed Energy
Energyic 25 × Poweric
Energydc 8 × Powerdc
Energyreg 18 × PRegister−access
Energywk 5 × PWake−up
EnergyALU1 2 × PowerALU
EnergyALU2 0 × PowerALU
EnergyFPU 12 × PowerFPU
EnergyMULTU 7 × PowerMULTU
Leakage 31 × Pleakage
Clock 727 × clock-powercycle
Selection 10 × Powerselection
Switchoff
2.9 × SwitchoffALU + 3.1 × SwitchoffALU +
1.9 × SwitchoffFPU + 2.4 × SwitchoffMULTU
188 Chapter B. Energy Consumption Model in Embedded Systems
Moving to the second path, the execution time of the leftmost path with respect
to the pipeline state can be seen in Figure B.3. The execution time of this path takes
31 clock cycles. The energy consumption for this path is calculated accordingly and
can be seen in Table B.4. The total energy consumption of this path would be 138.88
Micro-watt.
As it can be seen, the energy consumption of the second path is less than the first
path. So the energy consumption of the sub-path 〈〈7〉1, 〈8〉1, 〈10〉1〉 is more than the sub-
path 〈〈7〉1, 〈9〉1, 〈10〉2〉 (since the paths share the first part 〈〈1〉1, 〈2〉1, 〈4〉1, 〈5〉1, 〈7〉1〉).
As a result, in an analysis, the witness of the subtree beneath 〈7〉1 is 〈〈7〉1, 〈8〉1, 〈10〉1〉.
Moving to the third path in the symbolic execution tree. The execution time of this
path with respect to the pipeline state can be seen in Figure B.4. The execution time
of this path takes 32 clock cycles. The energy consumption for this path is calculated
accordingly and can be seen in Table B.5. The total energy consumption of the leftmost
path would be 132.35 Micro-watt.
Table B.5: The energy consumption for path 3
Component Consumed Energy
Energyic 25 × Poweric
Energydc 8 × Powerdc
Energyreg 18 × PRegister−access
Energywk 5 × PWake−up
EnergyALU1 2 × PowerALU
EnergyALU2 0 × PowerALU
EnergyFPU 0 × PowerFPU
EnergyMULTU 11 × PowerMULTU
Leakage 32 × Pleakage
Clock 697 × clock-powercycle
Selection 10 × Powerselection
Switchoff
3 × SwitchoffALU + 3.2 × SwitchoffALU +
3.2 × SwitchoffFPU + 2.1 × SwitchoffMULTU
B.1 Energy Consumption Model 189
Figure B.4: The energy analysis of path 3
Figure B.5: The energy analysis of path 4
190 Chapter B. Energy Consumption Model in Embedded Systems
Table B.6: The energy consumption for path 4
Component Consumed Energy
Energyic 25 × Poweric
Energydc 8 × Powerdc
Energyreg 18 × PRegister−access
Energywk 5 × PWake−up
EnergyALU1 3 × PowerALU
EnergyALU2 0 × PowerALU
EnergyFPU 0 × PowerFPU
EnergyMULTU 7 × PowerMULTU
Leakage 29 × Pleakage
Clock 607 × clock-powercycle
Selection 10 × Powerselection
Switchoff
2.6 × SwitchoffALU + 2.9 × SwitchoffALU +
2.9 × SwitchoffFPU + 2.2 × SwitchoffMULTU
Continuing the depth first traversal in the symbolic execution tree to 〈〈7〉2, 〈9〉2, 〈10〉4〉.
The execution time of the fourth path with respect to the pipeline state can be seen in
Figure B.5. The execution time of this path takes 29 clock cycles. The energy con-
sumption for this path is calculated accordingly and can be seen in Table B.6. The total
energy consumption of the leftmost path would be 124.17 Micro-watt. Moving back to
node 〈4〉1, we can note that the sub-path 〈〈4〉1, 〈5〉1, 〈7〉1, 〈8〉1, 〈10〉1〉 (in blue color) is
the most energy consuming sub-path in the tree and in an analysis would be chosen as
the witness path.
